10  Spatial and Image Data Using GeoPandas

10.1 Learning Objectives

  • Manipulating raster data with rasterio
  • Manipulating vector data with geopandas
  • Working with raster and vector data together

10.2 Introduction

In this lesson, we’ll be working with geospatial raster and vector data to do an analysis on vessel traffic in south central Alaska. If you aren’t already familiar, geospatial vector data consists of points, lines, and/or polygons, which represent locations on the Earth. Geospatial vector data can have differing geometries, depending on what it is representing (eg: points for cities, lines for rivers, polygons for states.) Raster data uses a set of regularly gridded cells (or pixels) to represent geographic features.

Both geospatial vector and raster data have a coordinate reference system, which describes how the points in the dataset relate to the 3-dimensional sphereoid of Earth. A coordinate reference system contains both a datum and a projection. The datum is how you georeference your points (in 3 dimensions!) onto a spheroid. The projection is how these points are mathematically transformed to represent the georeferenced point on a flat piece of paper. All coordinate reference systems require a datum. However, some coordinate reference systems are “unprojected” (also called geographic coordinate systems). Coordinates in latitude/longitude use a geographic (unprojected) coordinate system. One of the most commonly used geographic coordinate systems is WGS 1984.

Coordinate reference systems are often referenced using a shorthand 4 digit code called an EPSG code. We’ll be working with two coordinate reference systems in this lesson with the following codes:

  • 3338: Alaska Albers
  • 4326: WGS84 (World Geodetic System 1984), used in GPS

In this lesson, we are going to take two datasets:

and use them to calculate the total distance travelled by ships within each fishing area.

The high level steps will be

  • read in the datasets
  • reproject them so they are in the same projection
  • extract a subset of the raster and vector data using a bounding box
  • turn each polygon in the vector data into a raster mask
  • use the masks to calculate the total distance travelled (sum of pixels) for each fishing area

10.3 Pre-processing raster data

First we need to load in our libraries. We’ll use geopandas for vector manipulation, rasterio for raster maniupulation.

First, we’ll use requests to download the ship traffic raster from Kapsar et al.. We grab a one month slice from August, 2020 of a coastal subset of data with 1km resolution. To get the URL in the code chunk below, you can right click the download button for the file of interest and select “copy link address.”

import urllib

url = 'https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A6b847ab0-9a3d-4534-bf28-3a96c5fa8d72'

msg = urllib.request.urlretrieve(url, "Coastal_2020_08.tif")

Using rasterio, open the raster file, plot it, and look at the metadata. We use the with here as a context manager. This ensures that the connection to the raster file is closed and cleaned up when we are done with it.

import rasterio
import matplotlib.pyplot as plt

with rasterio.open("Coastal_2020_08.tif") as ship_con:
    # read in raster (1st band)
    ships = ship_con.read(1)
    ships_meta = ship_con.profile

plt.imshow(ships)
print(ships_meta)
{'driver': 'GTiff', 'dtype': 'float32', 'nodata': -3.3999999521443642e+38, 'width': 3087, 'height': 2308, 'count': 1, 'crs': CRS.from_epsg(3338), 'transform': Affine(999.7994153462766, 0.0, -2550153.29233849,
       0.0, -999.9687691991521, 2711703.104608573), 'tiled': False, 'compress': 'lzw', 'interleave': 'band'}

You’ll notice that we are saving two objects here, ships and ships_meta. Looking at the types of these two objects is useful to understand what rasterio is doing.

type(ships)
numpy.ndarray
type(ships_meta)
rasterio.profiles.Profile

The ships object is a numpy array, while the ships_meta is a special rasterio class called Profile. To understand why the raster data is represented as an array, and what that profile object is, let’s look into what raster data are, exactly.

Source: Leah A. Wasser, Megan A. Jones, Zack Brym, Kristina Riemer, Jason Williams, Jeff Hollister, Mike Smorul. Raster 00: Intro to Raster Data in R. National Evologial Observatory Network (NEON).

The upper left panel of the figure above shows some satellite imagery data. These data are in raster format, which when you zoom in, you can see consist of regularly gridded pixels, each of which contails a value. When we plot these data, we can assign a color map to the pixel values, which generates the image we see. The data themselves, though, are just an n-dimensional grid of numbers. Another way we might describe this is…an array! So, this is why raster data is represented in python using a numpy array.

This is all great, and the array of values is a lot of information, but there are some key items that are missing. This array isn’t imaginary, it represents a physical space on this earth, so where is all of that contextual information? The answer is in the rasterio profile object. This object contains all of the metadata needed to interpret the raster array. Here is what our ships_meta contains:

'driver': 'GTiff',
'dtype': 'float32',
'nodata': -3.3999999521443642e+38,
'width': 3087,
'height': 2308,
'count': 1,
'crs': CRS.from_epsg(3338),
'transform': Affine(999.7994153462766, 0.0, -2550153.29233849, 0.0, -999.9687691991521, 2711703.104608573),
'tiled': False,
'compress': 'lzw',
'interleave': 'band'}

This object gives us critical information, like the CRS of the data, the no data value, and the transform. The transform is what allows us to move from image pixel (row, column) coordinates to and from geographic/projected (x, y) coordinates. The transform and the CRS are crititally important, and related. If the CRS are instructions for how the coordinates can be represented in space and on a flat surface (in the case of projected coordinate systems), then the transform describes how to locate the raster array positions in the correct coordinates given by the CRS.

Note that since the array and the profile are in separate objects it is easy to lose track of one of them, accidentally overwrite it, etc. Try to adopt a naming convention that works for you because they usually need to work together in geospatial operations.

10.4 Pre-processing vector data

Now download a vector shapefile of commercial fishing districts in Alaska.

url = 'https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3A7c942c45-1539-4d47-b429-205499f0f3e4'

msg = urllib.request.urlretrieve(url, "Alaska_Commercial_Salmon_Boundaries.gpkg")

Read in the data using geopandas.

import geopandas as gpd

comm = gpd.read_file("Alaska_Commercial_Salmon_Boundaries.gpkg")

Note the “pandas” in the library name “geopandas.” Our comm object is really just a special type of pandas data frame called a geodataframe. This means that in addition to any geospatial stuff we need to do, we can also just do regular pandas things on this data frame.

For example, we can get a list of column names (there are a lot!)

comm.columns.values
array(['OBJECTID', 'GEOMETRY_START_DATE', 'GEOMETRY_END_DATE',
       'STAT_AREA', 'STAT_AREA_NAME', 'FISHERY_GROUP_CODE',
       'GIS_SERIES_NAME', 'GIS_SERIES_CODE', 'REGION_CODE',
       'REGISTRATION_AREA_NAME', 'REGISTRATION_AREA_CODE',
       'REGISTRATION_AREA_ID', 'REGISTRATION_LOCATION_ABBR',
       'MANAGEMENT_AREA_NAME', 'MANAGEMENT_AREA_CODE', 'DISTRICT_NAME',
       'DISTRICT_CODE', 'DISTRICT_ID', 'SUBDISTRICT_NAME',
       'SUBDISTRICT_CODE', 'SUBDISTRICT_ID', 'SECTION_NAME',
       'SECTION_CODE', 'SECTION_ID', 'SUBSECTION_NAME', 'SUBSECTION_CODE',
       'SUBSECTION_ID', 'COAR_AREA_CODE', 'CREATOR', 'CREATE_DATE',
       'EDITOR', 'EDIT_DATE', 'COMMENTS', 'STAT_AREA_VERSION_ID',
       'Shape_Length', 'Shape_Area', 'geometry'], dtype=object)

We can also look at the head of the data frame:

comm.head
<bound method NDFrame.head of      OBJECTID       GEOMETRY_START_DATE GEOMETRY_END_DATE  STAT_AREA  \
0          12 1975-01-01 00:00:00+00:00               NaT      33461   
1          13 1975-01-01 00:00:00+00:00               NaT      33462   
2          18 1978-01-01 00:00:00+00:00               NaT      33431   
3          19 1980-01-01 00:00:00+00:00               NaT      33442   
4          20 1980-01-01 00:00:00+00:00               NaT      33443   
..        ...                       ...               ...        ...   
860       959                       NaT               NaT      19241   
861       960                       NaT               NaT      19242   
862       961 1994-01-01 00:00:00+00:00               NaT      19245   
863       962                       NaT               NaT      19250   
864       963 1994-01-01 00:00:00+00:00               NaT      18252   

                                 STAT_AREA_NAME FISHERY_GROUP_CODE  \
0        Tanana River mouth to Kantishna River                   B   
1                 Kantishna River to Wood River                  B   
2                    Toklik to Cottonwood Point                  B   
3    Right Bank, Bishop Rock to Illinois Creek                   B   
4      Left Bank, Cone Point to Illinois Creek                   B   
..                                          ...                ...   
860                               Kaliakh River                  B   
861                                  Tsiu River                  B   
862                             Midtimber River                  B   
863                                  Seal River                  B   
864                               Middle Italio                  B   

    GIS_SERIES_NAME GIS_SERIES_CODE  REGION_CODE    REGISTRATION_AREA_NAME  \
0            Salmon               B            3                Yukon Area   
1            Salmon               B            3                Yukon Area   
2            Salmon               B            3                Yukon Area   
3            Salmon               B            3                Yukon Area   
4            Salmon               B            3                Yukon Area   
..              ...             ...          ...                       ...   
860          Salmon               B            1  Southeastern Alaska Area   
861          Salmon               B            1  Southeastern Alaska Area   
862          Salmon               B            1  Southeastern Alaska Area   
863          Salmon               B            1  Southeastern Alaska Area   
864          Salmon               B            1  Southeastern Alaska Area   

     ... COAR_AREA_CODE         CREATOR               CREATE_DATE  \
0    ...             YU   Evelyn Russel 2006-03-26 00:00:00+00:00   
1    ...             YU   Evelyn Russel 2006-03-26 00:00:00+00:00   
2    ...             YL   Evelyn Russel 2006-03-26 00:00:00+00:00   
3    ...             YU   Evelyn Russel 2006-03-26 00:00:00+00:00   
4    ...             YU   Evelyn Russel 2006-03-26 00:00:00+00:00   
..   ...            ...             ...                       ...   
860  ...             A2   Evelyn Russel 2006-03-26 00:00:00+00:00   
861  ...             A2   Evelyn Russel 2006-03-26 00:00:00+00:00   
862  ...             A2   Evelyn Russel 2006-03-26 00:00:00+00:00   
863  ...             A2   Evelyn Russel 2006-03-26 00:00:00+00:00   
864  ...             A2  Sabrina Larsen 2017-05-05 00:00:00+00:00   

             EDITOR                 EDIT_DATE  \
0    Sabrina Larsen 2017-02-02 00:00:00+00:00   
1    Sabrina Larsen 2017-02-02 00:00:00+00:00   
2    Sabrina Larsen 2017-02-02 00:00:00+00:00   
3    Sabrina Larsen 2017-02-02 00:00:00+00:00   
4    Sabrina Larsen 2017-02-02 00:00:00+00:00   
..              ...                       ...   
860  Sabrina Larsen                       NaT   
861  Sabrina Larsen                       NaT   
862  Sabrina Larsen                       NaT   
863  Sabrina Larsen                       NaT   
864            None                       NaT   

                                              COMMENTS STAT_AREA_VERSION_ID  \
0    Yukon District, 6 Subdistrict and 6-A Section ...                 None   
1    Yukon District, 6 Subdistrict and 6-B Section ...                 None   
2      Yukon District and 3 Subdistrict until 1/1/1980                 None   
3                                                 None                 None   
4                                                 None                 None   
..                                                 ...                  ...   
860                                               None                 None   
861                                               None                 None   
862                                               None                 None   
863                                               None                 None   
864                                               None                 None   

    Shape_Length Shape_Area                                           geometry  
0       4.610183   0.381977  MULTIPOLYGON (((-151.32805 64.96913, -151.3150...  
1       3.682421   0.321943  MULTIPOLYGON (((-149.96255 64.70518, -149.9666...  
2       2.215641   0.198740  MULTIPOLYGON (((-161.39853 61.55463, -161.4171...  
3       9.179852   0.382788  MULTIPOLYGON (((-153.15234 65.24944, -153.0761...  
4       9.500826   0.378262  MULTIPOLYGON (((-152.99905 65.17027, -152.9897...  
..           ...        ...                                                ...  
860     0.223565   0.000408  MULTIPOLYGON (((-142.90787 60.09177, -142.9051...  
861     0.030506   0.000006  MULTIPOLYGON (((-143.00416 60.07711, -143.0046...  
862     0.019805   0.000012  MULTIPOLYGON (((-143.28504 60.05800, -143.2861...  
863     0.096016   0.000238  MULTIPOLYGON (((-143.49701 60.04832, -143.5083...  
864     0.016237   0.000006  MULTIPOLYGON (((-139.15063 59.30748, -139.1489...  

[865 rows x 37 columns]>

Note the existence of the geometry column. This is where the actual geospatial points that comprise the vector data are stored, and this brings up the important difference between raster and vector data - while raster data is regularly gridded at a specific resolution, vector data are just points in space.

Source: Leah A. Wasser, Megan A. Jones, Zack Brym, Kristina Riemer, Jason Williams, Jeff Hollister, Mike Smorul. Raster 00: Intro to Raster Data in R. National Evologial Observatory Network (NEON).

The diagram above shows the three different types of geometries that geospatial vector data can take, points, lines or polygons. Whatever the geometry type, the geometry information (the x,y points) is stored in the column named geometry in the geopandas data frame. In this example, we have a dataset containing polygons of fishing districts. Each row in the dataset corresonds to a district, with unique attributes (the other columns in the dataset), and it’s own set of points defining the boundaries of the district, contained in the geometry column.

comm['geometry'][:5]
0    MULTIPOLYGON (((-151.32805 64.96913, -151.3150...
1    MULTIPOLYGON (((-149.96255 64.70518, -149.9666...
2    MULTIPOLYGON (((-161.39853 61.55463, -161.4171...
3    MULTIPOLYGON (((-153.15234 65.24944, -153.0761...
4    MULTIPOLYGON (((-152.99905 65.17027, -152.9897...
Name: geometry, dtype: geometry

So, now we know where our x,y points are, where is all of the other information like the crs? With vector data, all of this information is contained within the geodataframe. We can access the crs attribute on the data frame and print it like so:

comm.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Now that we know a little about what we are working with, let’s get this data ready to analyze. First, we can make a plot of it just to see what we have.

comm.plot()
<AxesSubplot:>

This plot doesn’t look so good. Turns out, these data are in WGS 84 (EPSG 4326), as opposed to Alaska Albers (EPSG 3338), which is what our raster data are in. To make pretty plots, and allow our raster data and vector data to be analyzed together, we’ll need to reproject the vector data into 3338. To to this, we’ll use the to_crs method on our comm object, and specify as an argument the projection we want to transform to.

comm_3338 = comm.to_crs("EPSG:3338")

comm_3338.plot()
<AxesSubplot:>

Much better!

10.5 Crop data to area of interest

For this example, we are only interested in south central Alaska, encompassing Prince William Sound, Cook Inlet, and Kodiak. Our raster data is significantly larger than that, and the vector data is statewide. So, as a first step we might want to crop our data to the area of interest.

First, we’ll need to create a bounding box. We use the box function from shapely to create the bounding box, then create a geoDataFrame from the points, and finally convert the WGS 84 coordinates to the Alaska Albers projection.

from shapely.geometry import box

coord_box = box(-159.5, 55, -144.5, 62)

coord_box_df = gpd.GeoDataFrame(
    crs = 'EPSG:4326',
    geometry = [coord_box]).to_crs("EPSG:3338")

Now, we can read in raster again cropped to bounding box. We use the mask function from rasterio.mask. Note that we apply this to the connection to the raster file (with rasterio.open(...)), then update the metadata associated with the raster, because the mask function requires as its first dataset argument a dataset object opened in r mode.

import rasterio.mask
import numpy as np

with rasterio.open("Coastal_2020_08.tif") as ship_con:
    shipc_arr, shipc_transform = rasterio.mask.mask(ship_con,
                                                    coord_box_df["geometry"],
                                                    crop=True)
    shipc_meta = ship_con.meta
    # select just the 2-D array (by default a 3-D array is returned even though we only have one band)
    shipc_arr = shipc_arr[0,:,:]
    # turn the no-data values into NaNs.
    shipc_arr[shipc_arr == ship_con.nodata] = np.nan
shipc_meta.update({"driver": "GTiff",
                 "height": shipc_arr.shape[0],
                 "width": shipc_arr.shape[1],
                 "transform": shipc_transform,
                 "compress": "lzw"})

Now we’ll do a similar task with the vector data. Tin this case, we use a spatial join. The join will be an inner join, and select only rows from the left side (our fishing districts) that are within the right side (our bounding box). I chose this method as opposed to a clipping type operation because it ensures that we don’t end up with any open polygons at the boundaries of our box, which could cause problems for us down the road.

comm_clip = gpd.sjoin(comm_3338,
                      coord_box_df,
                      how='inner',
                      predicate='within')

10.5.1 Check extents

Now let’s look at the two cropped datasets overlayed on each other to ensure that the extents look right.

import rasterio.plot

# set up plot
fig, ax = plt.subplots(figsize=(7, 7))
# plot the raster
rasterio.plot.show(shipc_arr,
                   ax=ax,
                   vmin = 0,
                   vmax = 50000,
                   transform = shipc_transform)
# plot the vector
comm_clip.plot(ax=ax, facecolor='none', edgecolor='red')
<AxesSubplot:>

10.6 Calculate total distance per fishing area

In this step, we rasterize each polygon in the shapefile, such that pixels in or touching the polygon get a value of 1, and pixels not touching it get a value of 0. Then, for each polygon, we extract the indices of the raster array that are equal to 1. We then extract the values of these indicies from the original ship traffic raster data, and calculate the sum of the values over all of those pixels.

Here is a simplified diagram of the process:

10.6.0.1 Zonal statistics over one polygon

Let’s look at how this works over just one fishing area first. We use the rasterize method from the features module in rasterio. This takes as arguments the data to rasterize (in this case the 40th row of our dataset), and the shape and transform the output raster will take. We alo set the all_touched argument to true, which means any pixel that touches a boundary of our vector will be burned into the mask.

from rasterio import features

r40 = features.rasterize(comm_clip['geometry'][40].geoms,
                                    out_shape=shipc_arr.shape,
                                    transform=shipc_meta['transform'],
                                    all_touched=True)

If we have a look at a plot of our rasterized version of the single fishing district, we can see that instead of a vector, we now have a raster showing the rasterized district (with pixel values of 1) and any area not in the district has a pixel value of 0.

# set up plot
fig, ax = plt.subplots(figsize=(7, 7))
# plot the raster
rasterio.plot.show(r40,
                   ax=ax,
                   vmin = 0,
                   vmax = 1,
                   transform = shipc_meta['transform'])
# plot the vector
comm_clip.plot(ax=ax, facecolor='none', edgecolor='red')
<AxesSubplot:>

A quick call to np.unique shows our unique values are 0 or 1, which is what we expect.

np.unique(r40)
array([0, 1], dtype=uint8)

Finally, we need to know is the indices of the original raster where the fishing district is. We can use np.where to extract this information

r40_index = np.where(r40 == 1)
print(r40_index)
(array([108, 108, 108, 108, 108, 109, 109, 109, 109, 109, 109, 109, 109,
       109, 109, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110,
       110, 110, 110, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111,
       111, 111, 111, 111, 112, 112, 112, 112, 112, 112, 112, 112, 112,
       112, 112, 112, 112, 112, 112, 113, 113, 113, 113, 113, 113, 113,
       113, 113, 113, 113, 113, 113, 113, 113, 114, 114, 114, 114, 114,
       114, 114, 114, 114, 114, 114, 115, 115, 115, 115, 115, 115, 116]), array([759, 760, 762, 763, 764, 755, 756, 757, 758, 759, 761, 762, 763,
       764, 765, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763,
       764, 765, 766, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762,
       763, 764, 765, 766, 753, 754, 755, 756, 757, 758, 759, 760, 761,
       762, 763, 764, 765, 766, 767, 753, 754, 755, 756, 757, 758, 759,
       760, 761, 762, 763, 764, 765, 766, 767, 753, 754, 755, 756, 757,
       758, 759, 760, 761, 762, 763, 753, 754, 755, 756, 757, 758, 754]))

In the last step, we’ll using these indices to extract the values of the data from the fishing raster, and sum them to get a total distance travelled.

np.nansum(shipc_arr[r40_index])
14369028.0

Now that we know the individual steps, let’s run this over all of the districts. First we’ll create an id column in the vector data frame. This will help us track unique fishing districts later.

10.6.0.2 Zonal statistics over all polygons

comm_clip['id'] = range(0,len(comm_clip))

For each district (with geometry and id), we run the features.rasterize function. Then we calculate the sum of the values of the shipping raster r_array based on the indicies in the raster where the district is located.

distance_dict = {}
for geom, idx in zip(comm_clip['geometry'], comm_clip['id']):
    rasterized = features.rasterize(geom.geoms,
                                    out_shape=shipc_arr.shape,
                                    transform=shipc_meta['transform'],
                                    all_touched=True)

    r_index = np.where(rasterized == 1)
    distance_dict[idx] = np.nansum(shipc_arr[r_index])

Now we just create a data frame from that dictionary, and join it to the vector data using pandas operations.

import pandas as pd

# create a data frame from the result
distance_df = pd.DataFrame.from_dict(distance_dict,
                                     orient='index',
                                     columns=['distance'])

# extract the index of the data frame as a column to use in a join and convert distance to kilometers
distance_df['id'] = distance_df.index
distance_df['distance'] = distance_df['distance']/1000

Now we join the result to the original geodataframe.

# join the sums to the original data frame
res_full = comm_clip.merge(distance_df,
                           on = "id",
                           how = 'inner')

Finally, we can plot our result!

import matplotlib.ticker
fig, ax = plt.subplots(figsize=(7, 7))

ax = res_full.plot(column = "distance", legend = True, ax = ax)
fig = ax.figure
label_format = '{:,.0f}'
cb_ax = fig.axes[1]
ticks_loc = cb_ax.get_yticks().tolist()
cb_ax.yaxis.set_major_locator(matplotlib.ticker.FixedLocator(ticks_loc))
cb_ax.set_yticklabels([label_format.format(x) for x in ticks_loc])
ax.set_axis_off()
ax.set_title("Distance Traveled by Ships in Kilometers")
plt.show()

From here we can do any additional geopandas operations we might be interested in. For example, what if we want to calculate the total distance by registration area (a superset of fishing district). We can do that using dissolve from geopandas.

reg_area = res_full.dissolve(by = "REGISTRATION_AREA_NAME", 
                             aggfunc = 'sum')

Let’s have a look at the same plot as before, but this time over our aggregated data.

fig, ax = plt.subplots(figsize=(7, 7))

ax = reg_area.plot(column = "distance", legend = True, ax = ax)
fig = ax.figure
label_format = '{:,.0f}'
cb_ax = fig.axes[1]
ticks_loc = cb_ax.get_yticks().tolist()
cb_ax.yaxis.set_major_locator(matplotlib.ticker.FixedLocator(ticks_loc))
cb_ax.set_yticklabels([label_format.format(x) for x in ticks_loc])
ax.set_axis_off()
ax.set_title("Distance Traveled by Ships in Kilometers")
plt.show()

10.7 Summary

We covered a lot of ground here, so let’s recap some of the high level points:

  • Raster data consist of regularly gridded values, and can be represented in python as an array
  • Vector data consist of any number of points, that might be connected, and is represented in python as a geodataframe
  • We can do geospatial operations like changing the projection or cropping the data to a particular extent on both raster and vector data
  • You can use vector data to help analyze raster data (and vice versa!) by rasterizing the vector data and using numpy operations on the resulting array.