import urllib
= 'https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A6b847ab0-9a3d-4534-bf28-3a96c5fa8d72'
url
= urllib.request.urlretrieve(url, "Coastal_2020_08.tif") msg
9 Spatial and Image Data Using GeoPandas
9.1 Learning Objectives
- Manipulating raster data with
rasterio
- Manipulating vector data with
geopandas
- Working with raster and vector data together
9.2 Introduction
In this lesson, we’ll be working with geospatial raster and vector data to do an analysis on vessel traffic in south central Alaska. If you aren’t already familiar, geospatial vector data consists of points, lines, and/or polygons, which represent locations on the Earth. Geospatial vector data can have differing geometries, depending on what it is representing (eg: points for cities, lines for rivers, polygons for states.) Raster data uses a set of regularly gridded cells (or pixels) to represent geographic features.
Both geospatial vector and raster data have a coordinate reference system, which describes how the points in the dataset relate to the 3-dimensional sphereoid of Earth. A coordinate reference system contains both a datum and a projection. The datum is how you georeference your points (in 3 dimensions!) onto a spheroid. The projection is how these points are mathematically transformed to represent the georeferenced point on a flat piece of paper. All coordinate reference systems require a datum. However, some coordinate reference systems are “unprojected” (also called geographic coordinate systems). Coordinates in latitude/longitude use a geographic (unprojected) coordinate system. One of the most commonly used geographic coordinate systems is WGS 1984.
Coordinate reference systems are often referenced using a shorthand 4 digit code called an EPSG code. We’ll be working with two coordinate reference systems in this lesson with the following codes:
- 3338: Alaska Albers
- 4326: WGS84 (World Geodetic System 1984), used in GPS
In this lesson, we are going to take two datasets:
- Alaskan commercial salmon fishing statisical areas
- North Pacific and Arctic Marine Vessel Traffic Dataset
and use them to calculate the total distance travelled by ships within each fishing area.
The high level steps will be
- read in the datasets
- reproject them so they are in the same projection
- extract a subset of the raster and vector data using a bounding box
- turn each polygon in the vector data into a raster mask
- use the masks to calculate the total distance travelled (sum of pixels) for each fishing area
9.3 Pre-processing raster data
First we need to load in our libraries. We’ll use geopandas
for vector manipulation, rasterio
for raster maniupulation.
First, we’ll use requests
to download the ship traffic raster from Kapsar et al.. We grab a one month slice from August, 2020 of a coastal subset of data with 1km resolution. To get the URL in the code chunk below, you can right click the download button for the file of interest and select “copy link address.”
Using rasterio
, open the raster file, plot it, and look at the metadata. We use the with
here as a context manager. This ensures that the connection to the raster file is closed and cleaned up when we are done with it.
import rasterio
import matplotlib.pyplot as plt
with rasterio.open("Coastal_2020_08.tif") as ship_con:
# read in raster (1st band)
= ship_con.read(1)
ships = ship_con.profile
ships_meta
plt.imshow(ships)print(ships_meta)
{'driver': 'GTiff', 'dtype': 'float32', 'nodata': -3.3999999521443642e+38, 'width': 3087, 'height': 2308, 'count': 1, 'crs': CRS.from_epsg(3338), 'transform': Affine(999.7994153462766, 0.0, -2550153.29233849,
0.0, -999.9687691991521, 2711703.104608573), 'tiled': False, 'compress': 'lzw', 'interleave': 'band'}
You’ll notice that we are saving two objects here, ships
and ships_meta
. Looking at the types of these two objects is useful to understand what rasterio
is doing.
type(ships)
numpy.ndarray
type(ships_meta)
rasterio.profiles.Profile
The ships
object is a numpy
array, while the ships_meta
is a special rasterio
class called Profile
. To understand why the raster data is represented as an array, and what that profile object is, let’s look into what raster data are, exactly.
The upper left panel of the figure above shows some satellite imagery data. These data are in raster format, which when you zoom in, you can see consist of regularly gridded pixels, each of which contails a value. When we plot these data, we can assign a color map to the pixel values, which generates the image we see. The data themselves, though, are just an n-dimensional grid of numbers. Another way we might describe this is…an array! So, this is why raster data is represented in python using a numpy
array.
This is all great, and the array of values is a lot of information, but there are some key items that are missing. This array isn’t imaginary, it represents a physical space on this earth, so where is all of that contextual information? The answer is in the rasterio
profile object. This object contains all of the metadata needed to interpret the raster array. Here is what our ships_meta
contains:
'driver': 'GTiff',
'dtype': 'float32',
'nodata': -3.3999999521443642e+38,
'width': 3087,
'height': 2308,
'count': 1,
'crs': CRS.from_epsg(3338),
'transform': Affine(999.7994153462766, 0.0, -2550153.29233849, 0.0, -999.9687691991521, 2711703.104608573),
'tiled': False,
'compress': 'lzw',
'interleave': 'band'}
This object gives us critical information, like the CRS of the data, the no data value, and the transform. The transform is what allows us to move from image pixel (row, column) coordinates to and from geographic/projected (x, y) coordinates. The transform and the CRS are crititally important, and related. If the CRS are instructions for how the coordinates can be represented in space and on a flat surface (in the case of projected coordinate systems), then the transform describes how to locate the raster array positions in the correct coordinates given by the CRS.
Note that since the array and the profile are in separate objects it is easy to lose track of one of them, accidentally overwrite it, etc. Try to adopt a naming convention that works for you because they usually need to work together in geospatial operations.
9.4 Pre-processing vector data
Now download a vector shapefile of commercial fishing districts in Alaska.
= 'https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3A7c942c45-1539-4d47-b429-205499f0f3e4'
url
= urllib.request.urlretrieve(url, "Alaska_Commercial_Salmon_Boundaries.gpkg") msg
Read in the data using geopandas
.
import geopandas as gpd
= gpd.read_file("Alaska_Commercial_Salmon_Boundaries.gpkg") comm
Note the “pandas” in the library name “geopandas.” Our comm
object is really just a special type of pandas data frame called a geodataframe. This means that in addition to any geospatial stuff we need to do, we can also just do regular pandas
things on this data frame.
For example, we can get a list of column names (there are a lot!)
comm.columns.values
array(['OBJECTID', 'GEOMETRY_START_DATE', 'GEOMETRY_END_DATE',
'STAT_AREA', 'STAT_AREA_NAME', 'FISHERY_GROUP_CODE',
'GIS_SERIES_NAME', 'GIS_SERIES_CODE', 'REGION_CODE',
'REGISTRATION_AREA_NAME', 'REGISTRATION_AREA_CODE',
'REGISTRATION_AREA_ID', 'REGISTRATION_LOCATION_ABBR',
'MANAGEMENT_AREA_NAME', 'MANAGEMENT_AREA_CODE', 'DISTRICT_NAME',
'DISTRICT_CODE', 'DISTRICT_ID', 'SUBDISTRICT_NAME',
'SUBDISTRICT_CODE', 'SUBDISTRICT_ID', 'SECTION_NAME',
'SECTION_CODE', 'SECTION_ID', 'SUBSECTION_NAME', 'SUBSECTION_CODE',
'SUBSECTION_ID', 'COAR_AREA_CODE', 'CREATOR', 'CREATE_DATE',
'EDITOR', 'EDIT_DATE', 'COMMENTS', 'STAT_AREA_VERSION_ID',
'Shape_Length', 'Shape_Area', 'geometry'], dtype=object)
We can also look at the head of the data frame:
comm.head()
OBJECTID | GEOMETRY_START_DATE | GEOMETRY_END_DATE | STAT_AREA | STAT_AREA_NAME | FISHERY_GROUP_CODE | GIS_SERIES_NAME | GIS_SERIES_CODE | REGION_CODE | REGISTRATION_AREA_NAME | ... | COAR_AREA_CODE | CREATOR | CREATE_DATE | EDITOR | EDIT_DATE | COMMENTS | STAT_AREA_VERSION_ID | Shape_Length | Shape_Area | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12 | 1975-01-01 00:00:00+00:00 | NaT | 33461 | Tanana River mouth to Kantishna River | B | Salmon | B | 3 | Yukon Area | ... | YU | Evelyn Russel | 2006-03-26 00:00:00+00:00 | Sabrina Larsen | 2017-02-02 00:00:00+00:00 | Yukon District, 6 Subdistrict and 6-A Section ... | NaN | 4.610183 | 0.381977 | MULTIPOLYGON (((-151.32805 64.96913, -151.3150... |
1 | 13 | 1975-01-01 00:00:00+00:00 | NaT | 33462 | Kantishna River to Wood River | B | Salmon | B | 3 | Yukon Area | ... | YU | Evelyn Russel | 2006-03-26 00:00:00+00:00 | Sabrina Larsen | 2017-02-02 00:00:00+00:00 | Yukon District, 6 Subdistrict and 6-B Section ... | NaN | 3.682421 | 0.321943 | MULTIPOLYGON (((-149.96255 64.70518, -149.9666... |
2 | 18 | 1978-01-01 00:00:00+00:00 | NaT | 33431 | Toklik to Cottonwood Point | B | Salmon | B | 3 | Yukon Area | ... | YL | Evelyn Russel | 2006-03-26 00:00:00+00:00 | Sabrina Larsen | 2017-02-02 00:00:00+00:00 | Yukon District and 3 Subdistrict until 1/1/1980 | NaN | 2.215641 | 0.198740 | MULTIPOLYGON (((-161.39853 61.55463, -161.4171... |
3 | 19 | 1980-01-01 00:00:00+00:00 | NaT | 33442 | Right Bank, Bishop Rock to Illinois Creek | B | Salmon | B | 3 | Yukon Area | ... | YU | Evelyn Russel | 2006-03-26 00:00:00+00:00 | Sabrina Larsen | 2017-02-02 00:00:00+00:00 | NaN | NaN | 9.179852 | 0.382788 | MULTIPOLYGON (((-153.15234 65.24944, -153.0761... |
4 | 20 | 1980-01-01 00:00:00+00:00 | NaT | 33443 | Left Bank, Cone Point to Illinois Creek | B | Salmon | B | 3 | Yukon Area | ... | YU | Evelyn Russel | 2006-03-26 00:00:00+00:00 | Sabrina Larsen | 2017-02-02 00:00:00+00:00 | NaN | NaN | 9.500826 | 0.378262 | MULTIPOLYGON (((-152.99905 65.17027, -152.9897... |
5 rows × 37 columns
Note the existence of the geometry
column. This is where the actual geospatial points that comprise the vector data are stored, and this brings up the important difference between raster and vector data - while raster data is regularly gridded at a specific resolution, vector data are just points in space.
The diagram above shows the three different types of geometries that geospatial vector data can take, points, lines or polygons. Whatever the geometry type, the geometry information (the x,y points) is stored in the column named geometry
in the geopandas data frame. In this example, we have a dataset containing polygons of fishing districts. Each row in the dataset corresonds to a district, with unique attributes (the other columns in the dataset), and it’s own set of points defining the boundaries of the district, contained in the geometry
column.
'geometry'][:5] comm[
0 MULTIPOLYGON (((-151.32805 64.96913, -151.3150...
1 MULTIPOLYGON (((-149.96255 64.70518, -149.9666...
2 MULTIPOLYGON (((-161.39853 61.55463, -161.4171...
3 MULTIPOLYGON (((-153.15234 65.24944, -153.0761...
4 MULTIPOLYGON (((-152.99905 65.17027, -152.9897...
Name: geometry, dtype: geometry
So, now we know where our x,y points are, where is all of the other information like the crs
? With vector data, all of this information is contained within the geodataframe. We can access the crs
attribute on the data frame and print it like so:
comm.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
We can also plot one geometry cell by simply calling it:
'geometry'][8] comm[
Now that we know a little about what we are working with, let’s get this data ready to analyze. First, we can make a plot of it just to see what we have.
=(9,9)) comm.plot(figsize
<AxesSubplot:>
This plot doesn’t look so good. Turns out, these data are in WGS 84 (EPSG 4326), as opposed to Alaska Albers (EPSG 3338), which is what our raster data are in. To make pretty plots, and allow our raster data and vector data to be analyzed together, we’ll need to reproject the vector data into 3338. To to this, we’ll use the to_crs
method on our comm
object, and specify as an argument the projection we want to transform to.
= comm.to_crs("EPSG:3338")
comm_3338
comm_3338.plot()
<AxesSubplot:>
Much better!
9.5 Crop data to area of interest
For this example, we are only interested in south central Alaska, encompassing Prince William Sound, Cook Inlet, and Kodiak. Our raster data is significantly larger than that, and the vector data is statewide. So, as a first step we might want to crop our data to the area of interest.
First, we’ll need to create a bounding box. We use the box
function from shapely
to create the bounding box, then create a geoDataFrame
from the points, and finally convert the WGS 84 coordinates to the Alaska Albers projection.
from shapely.geometry import box
= box(-159.5, 55, -144.5, 62)
coord_box
= gpd.GeoDataFrame(
coord_box_df = 'EPSG:4326',
crs = [coord_box]).to_crs("EPSG:3338") geometry
Now, we can read in raster again cropped to bounding box. We use the mask
function from rasterio.mask
. Note that we apply this to the connection to the raster file (with rasterio.open(...)
), then update the metadata associated with the raster, because the mask
function requires as its first dataset
argument a dataset object opened in r
mode.
import rasterio.mask
import numpy as np
with rasterio.open("Coastal_2020_08.tif") as ship_con:
= rasterio.mask.mask(ship_con,
shipc_arr, shipc_transform "geometry"],
coord_box_df[=True)
crop= ship_con.meta
shipc_meta # select just the 2-D array (by default a 3-D array is returned even though we only have one band)
= shipc_arr[0,:,:]
shipc_arr # turn the no-data values into NaNs.
== ship_con.nodata] = np.nan shipc_arr[shipc_arr
"driver": "GTiff",
shipc_meta.update({"height": shipc_arr.shape[0],
"width": shipc_arr.shape[1],
"transform": shipc_transform,
"compress": "lzw"})
Now we’ll do a similar task with the vector data. Tin this case, we use a spatial join. The join will be an inner join, and select only rows from the left side (our fishing districts) that are within the right side (our bounding box). I chose this method as opposed to a clipping type operation because it ensures that we don’t end up with any open polygons at the boundaries of our box, which could cause problems for us down the road.
= gpd.sjoin(comm_3338,
comm_clip
coord_box_df,='inner',
how='within') predicate
9.5.1 Check extents
Now let’s look at the two cropped datasets overlayed on each other to ensure that the extents look right.
import rasterio.plot
# set up plot
= plt.subplots(figsize=(7, 7))
fig, ax # plot the raster
rasterio.plot.show(shipc_arr,=ax,
ax= 0,
vmin = 50000,
vmax = shipc_transform)
transform # plot the vector
=ax, facecolor='none', edgecolor='red') comm_clip.plot(ax
<AxesSubplot:>
9.6 Calculate total distance per fishing area
In this step, we rasterize each polygon in the shapefile, such that pixels in or touching the polygon get a value of 1, and pixels not touching it get a value of 0. Then, for each polygon, we extract the indices of the raster array that are equal to 1. We then extract the values of these indicies from the original ship traffic raster data, and calculate the sum of the values over all of those pixels.
Here is a simplified diagram of the process:
9.6.0.1 Zonal statistics over one polygon
Let’s look at how this works over just one fishing area first. We use the rasterize
method from the features
module in rasterio
. This takes as arguments the data to rasterize (in this case the 40th row of our dataset), and the shape and transform the output raster will take. We alo set the all_touched
argument to true, which means any pixel that touches a boundary of our vector will be burned into the mask.
from rasterio import features
= features.rasterize(comm_clip['geometry'][40].geoms,
r40 =shipc_arr.shape,
out_shape=shipc_meta['transform'],
transform=True) all_touched
If we have a look at a plot of our rasterized version of the single fishing district, we can see that instead of a vector, we now have a raster showing the rasterized district (with pixel values of 1) and any area not in the district has a pixel value of 0.
# set up plot
= plt.subplots(figsize=(7, 7))
fig, ax # plot the raster
rasterio.plot.show(r40,=ax,
ax= 0,
vmin = 1,
vmax = shipc_meta['transform'])
transform # plot the vector
=ax, facecolor='none', edgecolor='red') comm_clip.plot(ax
<AxesSubplot:>
A quick call to np.unique
shows our unique values are 0 or 1, which is what we expect.
np.unique(r40)
array([0, 1], dtype=uint8)
Finally, we need to know is the indices of the original raster where the fishing district is. We can use np.where
to extract this information
= np.where(r40 == 1)
r40_index print(r40_index)
(array([108, 108, 108, 108, 108, 109, 109, 109, 109, 109, 109, 109, 109,
109, 109, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110,
110, 110, 110, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111,
111, 111, 111, 111, 112, 112, 112, 112, 112, 112, 112, 112, 112,
112, 112, 112, 112, 112, 112, 113, 113, 113, 113, 113, 113, 113,
113, 113, 113, 113, 113, 113, 113, 113, 114, 114, 114, 114, 114,
114, 114, 114, 114, 114, 114, 115, 115, 115, 115, 115, 115, 116]), array([759, 760, 762, 763, 764, 755, 756, 757, 758, 759, 761, 762, 763,
764, 765, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763,
764, 765, 766, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762,
763, 764, 765, 766, 753, 754, 755, 756, 757, 758, 759, 760, 761,
762, 763, 764, 765, 766, 767, 753, 754, 755, 756, 757, 758, 759,
760, 761, 762, 763, 764, 765, 766, 767, 753, 754, 755, 756, 757,
758, 759, 760, 761, 762, 763, 753, 754, 755, 756, 757, 758, 754]))
In the last step, we’ll using these indices to extract the values of the data from the fishing raster, and sum them to get a total distance travelled.
np.nansum(shipc_arr[r40_index])
14369028.0
Now that we know the individual steps, let’s run this over all of the districts. First we’ll create an id
column in the vector data frame. This will help us track unique fishing districts later.
9.6.0.2 Zonal statistics over all polygons
'id'] = range(0,len(comm_clip)) comm_clip[
For each district (with geometry
and id
), we run the features.rasterize
function. Then we calculate the sum of the values of the shipping raster r_array
based on the indicies in the raster where the district is located.
= {}
distance_dict for geom, idx in zip(comm_clip['geometry'], comm_clip['id']):
= features.rasterize(geom.geoms,
rasterized =shipc_arr.shape,
out_shape=shipc_meta['transform'],
transform=True)
all_touched
= np.where(rasterized == 1)
r_index = np.nansum(shipc_arr[r_index]) distance_dict[idx]
Now we just create a data frame from that dictionary, and join it to the vector data using pandas
operations.
import pandas as pd
# create a data frame from the result
= pd.DataFrame.from_dict(distance_dict,
distance_df ='index',
orient=['distance'])
columns
# extract the index of the data frame as a column to use in a join and convert distance to kilometers
'id'] = distance_df.index
distance_df['distance'] = distance_df['distance']/1000 distance_df[
Now we join the result to the original geodataframe.
# join the sums to the original data frame
= comm_clip.merge(distance_df,
res_full = "id",
on = 'inner') how
Finally, we can plot our result!
import matplotlib.ticker
= plt.subplots(figsize=(7, 7))
fig, ax
= res_full.plot(column = "distance", legend = True, ax = ax)
ax = ax.figure
fig = '{:,.0f}'
label_format = fig.axes[1]
cb_ax = cb_ax.get_yticks().tolist()
ticks_loc
cb_ax.yaxis.set_major_locator(matplotlib.ticker.FixedLocator(ticks_loc))format(x) for x in ticks_loc])
cb_ax.set_yticklabels([label_format.
ax.set_axis_off()"Distance Traveled by Ships in Kilometers")
ax.set_title( plt.show()
From here we can do any additional geopandas
operations we might be interested in. For example, what if we want to calculate the total distance by registration area (a superset of fishing district). We can do that using dissolve
from geopandas.
= res_full.dissolve(by = "REGISTRATION_AREA_NAME",
reg_area = 'sum') aggfunc
/home/jclark/.virtualenvs/scomp/lib/python3.9/site-packages/geopandas/geodataframe.py:1686: FutureWarning:
The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Let’s have a look at the same plot as before, but this time over our aggregated data.
= plt.subplots(figsize=(7, 7))
fig, ax
= reg_area.plot(column = "distance", legend = True, ax = ax)
ax = ax.figure
fig = '{:,.0f}'
label_format = fig.axes[1]
cb_ax = cb_ax.get_yticks().tolist()
ticks_loc
cb_ax.yaxis.set_major_locator(matplotlib.ticker.FixedLocator(ticks_loc))format(x) for x in ticks_loc])
cb_ax.set_yticklabels([label_format.
ax.set_axis_off()"Distance Traveled by Ships in Kilometers")
ax.set_title( plt.show()
9.7 Summary
We covered a lot of ground here, so let’s recap some of the high level points:
- Raster data consist of regularly gridded values, and can be represented in python as an array
- Vector data consist of any number of points, that might be connected, and is represented in python as a geodataframe
- We can do geospatial operations like changing the projection or cropping the data to a particular extent on both raster and vector data
- You can use vector data to help analyze raster data (and vice versa!) by rasterizing the vector data and using numpy operations on the resulting array.