Week 2 - Manipulating spatial data

In week 1, you got to load up some spatial data and make some pretty maps. This week, we will be stepping up a gear and learning how to crop and subset spatial data. We will also be go through the process of resampling rasters.

Learning outcomes

By the end of this week, you will be able to:

Clip and subset vector and raster data
Resample rasters
Relate spatial data

Load the necessary libraries for this week

library(sp)
library(raster)
library(leaflet)
library(rgdal)
library(geosphere)
library(rgeos)
library(wesanderson)
library(stats)
library(ggplot2)

First we are going to subset some spatial (polygon) data. For this exersize, we are going to use the admin 1 boundaries for Ethiopia we used in week 1. As a reminder, we can load these in from a local shapefile using the readOGR function, or we can use the handy getData function from the raster package to access GADM data.

ETH_Adm_1 <- raster::getData("GADM", country="ETH", level = 1)

You can subset a SpatialPolygonsDataFrame just like a data frame. Let’s subset the data first by row/polygon

ETH_Adm_1_cropped <- ETH_Adm_1[1,]

# Get a summary of the cropped data
ETH_Adm_1_cropped

## class       : SpatialPolygonsDataFrame 
## features    : 1 
## extent      : 38.6394, 38.90624, 8.833486, 9.098195  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 10
## names       : GID_0,   NAME_0,   GID_1,      NAME_1,                                     VARNAME_1, NL_NAME_1,    TYPE_1, ENGTYPE_1, CC_1, HASC_1 
## value       :   ETH, Ethiopia, ETH.1_1, Addis Abeba, Āddīs Ābaba|Addis Ababa|Adis-Abeba|Ādīs Ābeba,        NA, Astedader,      City,   14,  ET.AA

# Plot over the top of the full dataset
plot(ETH_Adm_1_cropped)
lines(ETH_Adm_1_cropped, col="red", lwd=2)

You can also subset by name. For example, if we wanted to extract the polygon representing the boundary of the province “Amhara”

ETH_Adm_1_Amhara <- subset(ETH_Adm_1, ETH_Adm_1$NAME_1=="Amhara") #OR ETH_Adm_1[ETH_Adm_1$NAME_1=="Amhara",] will also work
ETH_Adm_1_Amhara

## class       : SpatialPolygonsDataFrame 
## features    : 1 
## extent      : 35.25711, 40.21244, 8.714812, 13.7687  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 10
## names       : GID_0,   NAME_0,   GID_1, NAME_1, VARNAME_1, NL_NAME_1, TYPE_1, ENGTYPE_1, CC_1, HASC_1 
## value       :   ETH, Ethiopia, ETH.3_1, Amhara,     Amara,        NA,  Kilil,     State,   03,  ET.AM

#Plot the result
plot(ETH_Adm_1)
lines(ETH_Adm_1_Amhara, col="blue", lwd=2)

### Pop quiz * How would you plot all provinces except Amhara? * Try plotting all province using leaflet, with Amhara colored red and all others colored orange.

Spatial overlays

Often, we have point and polygon data and wish to relate them. For example, we might want to summarize point data over regions. To illustrate this, we are going to use the Ethiopia malaria point prevalence data and aggregate that to provincial level to get a provincial level estimate of prevalence.

# Get the point prevalence data from the GitHub repo
ETH_malaria_data <- read.csv("https://raw.githubusercontent.com/HughSt/HughSt.github.io/master/course_materials/week1/Lab_files/Data/mal_data_eth_2009_no_dups.csv",header=T)

# Convert to a SPDF
ETH_malaria_data_SPDF <- SpatialPointsDataFrame(coords = ETH_malaria_data[,c("longitude", "latitude")],
                                      data = ETH_malaria_data[,c("examined", "pf_pos", "pf_pr")],
                                      proj4string = CRS("+init=epsg:4326"))

To identify the Province each point lies within you can use the over function from the sp package

ETH_Adm_1_per_point <- over(ETH_malaria_data_SPDF, ETH_Adm_1)
  Error in .local(x, y, returnList, fn, ...) : identicalCRS(x, y) is not TRUE

This throws an error, because ETH_malaria_data_SPDF and ETH_Adm_1 do not have exactly the same coordinate reference system (CRS). Let’s take a look

crs(ETH_Adm_1)

## CRS arguments: +proj=longlat +datum=WGS84 +no_defs

crs(ETH_malaria_data_SPDF)

## CRS arguments: +proj=longlat +datum=WGS84 +no_defs

To reproject to the same CRS, you can use the spTransform function from the sp package

ETH_malaria_data_SPDF <- spTransform(ETH_malaria_data_SPDF, crs(ETH_Adm_1))

# Check the new;y projected object
ETH_malaria_data_SPDF

## class       : SpatialPointsDataFrame 
## features    : 203 
## extent      : 34.5418, 42.4915, 3.8966, 9.9551  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 3
## names       : examined, pf_pos,       pf_pr 
## min values  :       37,      0,           0 
## max values  :      221,     14, 0.127272727

Now we can re-run the over command

ETH_Adm_1_per_point <- over(ETH_malaria_data_SPDF, ETH_Adm_1)

This gives us a table where each row represents a point from ETH_malaria_data_SPDF and columns represent the data from ETH_Adm_1. Let’s take a look

head(ETH_Adm_1_per_point)

##   GID_0   NAME_0   GID_1 NAME_1 VARNAME_1 NL_NAME_1 TYPE_1 ENGTYPE_1 CC_1
## 1   ETH Ethiopia ETH.8_1 Oromia   Oromiya      <NA>  Kilil     State   04
## 2   ETH Ethiopia ETH.8_1 Oromia   Oromiya      <NA>  Kilil     State   04
## 3   ETH Ethiopia ETH.8_1 Oromia   Oromiya      <NA>  Kilil     State   04
## 4   ETH Ethiopia ETH.8_1 Oromia   Oromiya      <NA>  Kilil     State   04
## 5   ETH Ethiopia ETH.8_1 Oromia   Oromiya      <NA>  Kilil     State   04
## 6   ETH Ethiopia ETH.8_1 Oromia   Oromiya      <NA>  Kilil     State   04
##   HASC_1
## 1  ET.OR
## 2  ET.OR
## 3  ET.OR
## 4  ET.OR
## 5  ET.OR
## 6  ET.OR

Now we can use this to calculate admin unit specific statistics. We might be interested in the number of sites per admin unit. To get that, we could just create a frequency table

table(ETH_Adm_1_per_point$NAME_1)

## 
## Benshangul-Gumaz  Gambela Peoples           Oromia 
##                1                1              201

Or we can use the tapply function for more complex calculations. tapply allows us to apply a function across groups. Let’s look at the number examined per admin unit

Nex_per_Adm1 <- tapply(ETH_malaria_data_SPDF$examined, ETH_Adm_1_per_point$NAME_1, sum)
Nex_per_Adm1

## Benshangul-Gumaz  Gambela Peoples           Oromia 
##              109              108            24350

Now let’s get the number of positives by admin unit

Npos_per_Adm1 <- tapply(ETH_malaria_data_SPDF$pf_pos, ETH_Adm_1_per_point$NAME_1, sum)
Npos_per_Adm1

## Benshangul-Gumaz  Gambela Peoples           Oromia 
##                1                0               78

From these numbers, we can calculate the prevalence per province

prev_per_Adm1 <- Npos_per_Adm1 / Nex_per_Adm1
prev_per_Adm1

## Benshangul-Gumaz  Gambela Peoples           Oromia 
##      0.009174312      0.000000000      0.003203285

If you want to merge these provincial prevalence estimates back into the province object ETH_Adm_1 it is best practice to create a new table of prevalence by provice with unique ID for each province. That unique ID can be used to relate and merge the data with ETH_Adm_1.

First convert your prev_per_Adm1 vector into a dataframe with an ID column

prev_per_Adm1_df <- data.frame(NAME_1 = names(prev_per_Adm1),
                               prevalence = prev_per_Adm1,
                               row.names=NULL)

Now merge this with the ETH_Adm_1 data frame

ETH_Adm_1 <- merge(ETH_Adm_1, prev_per_Adm1_df,
                  by = "NAME_1")

You can now see that the additional prevalence field has been added

head(ETH_Adm_1)

##             NAME_1 GID_0   NAME_0   GID_1
## 1      Addis Abeba   ETH Ethiopia ETH.1_1
## 2             Afar   ETH Ethiopia ETH.2_1
## 3           Amhara   ETH Ethiopia ETH.3_1
## 4 Benshangul-Gumaz   ETH Ethiopia ETH.4_1
## 5        Dire Dawa   ETH Ethiopia ETH.5_1
## 6  Gambela Peoples   ETH Ethiopia ETH.6_1
##                                       VARNAME_1 NL_NAME_1    TYPE_1 ENGTYPE_1
## 1 Āddīs Ābaba|Addis Ababa|Adis-Abeba|Ādīs Ābeba      <NA> Astedader      City
## 2                                                    <NA>     Kilil     State
## 3                                         Amara      <NA>     Kilil     State
## 4                              Beneshangul Gumu      <NA>     Kilil     State
## 5                                                    <NA> Astedader      City
## 6                                       Gambela      <NA>     Kilil     State
##   CC_1 HASC_1  prevalence
## 1   14  ET.AA          NA
## 2   02  ET.AF          NA
## 3   03  ET.AM          NA
## 4   06  ET.BE 0.009174312
## 5   15  ET.DD          NA
## 6   12  ET.GA 0.000000000

We can now plot province colored by prevalence. Let’s use the leaflet package

# First define a color palette based on prevalence
colorPal <- colorNumeric(wes_palette("Zissou1")[1:5], ETH_Adm_1$prevalence)

# Plot with leaflet
leaflet() %>% addProviderTiles("CartoDB.Positron") %>% addPolygons(data=ETH_Adm_1, 
                                         col=colorPal(ETH_Adm_1$prevalence),
                                         fillOpacity=0.6) %>%
                                         addLegend(pal = colorPal, 
                                         values = ETH_Adm_1$prevalence,
                                         title = "Prevalence")

Notes on table joins

In this example, we joined the table of prevalence values for each State using the common NAME_1 field Often, however, when you are joining a table of data to some spatial data, they are from different sources and the common field on which to match rows can be formatted differently. For example if you were merging a table of state level data for the USA using the state name, your table might have an entry for California which would not match with your spatial data if it has california as the corresponding entry. It is always preferable to use ID codes over names for matching as these are often less variable. If you do have to use a character string such as name, the following functions are useful ways to reformat characters to make sure they match:

substr - this allows you to extract substrings, for example substr("Cali", 1,2) extracts the first to second characters and would give you back Ca
tolower - converts all characters to lower case. toupper is the reverse.
gsub - allows you to replace characters. For example gsub(" ", "-", "CA USA") would replace any whitespace with -, i.e. in this example it would return CA-USA.

See here for a worked example of table joins in R.

Pop quiz

Try generating the same plot using a different color palette.
How would you plot only the provinces for which you have prevalence estimates?

Manipulating raster data

You’ve now seen how to subset polygons and relate point and polygon data. Now we are going to look at basic manipulations of raster data. We are going to load 2 raster file, elevation and land use for Ethiopia.

# Get elevation using the getData function from the raster package
ETH_elev <- raster::getData("alt", country="ETH")

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum Unknown based on WGS84 ellipsoid in CRS definition

plot(ETH_elev)

# Land use (# For information on land use classifications see http://due.esrin.esa.int/files/GLOBCOVER2009_Validation_Report_2.2.pdf)
ETH_land_use <- raster("https://github.com/HughSt/HughSt.github.io/blob/master/course_materials/week2/Lab_files/ETH_land_use.tif?raw=true")
ETH_land_use

## class      : RasterLayer 
## dimensions : 4121, 5384, 22187464  (nrow, ncol, ncell)
## resolution : 0.002777778, 0.002777778  (x, y)
## extent     : 33.00139, 47.95694, 3.398611, 14.84583  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## source     : https://github.com/HughSt/HughSt.github.io/blob/master/course_materials/week2/Lab_files/ETH_land_use.tif?raw=true 
## names      : ETH_land_use.tif.raw.true 
## values     : 11, 210  (min, max)

#Plot the land use raster
plot(ETH_land_use)

# For a break down of the classes in Ethiopia aka how often each land use type occurs
#(Note: this is just the number of pixels per land use type - NOT acres)
table(ETH_land_use[])

## 
##      11      14      20      30      40      60      90     110     120     130 
##  105074  336324 2451623 2223305  129786  813227       8 5254463  580793 1900890 
##     140     150     160     170     180     190     200     210 
## 1403832 1998099    6765      18   26026    6375 2947145 2003711

Resampling rasters

Its good practice to resample rasters to the same extent and resolution (i.e. same grid). This makes it easier to deal with later and to relate rasters to each other. The resample command in the raster package makes this process easy. Here we are going to resample our land use raster, but for a deeper dive on resampling rasters of different data types, see here.

The default method is bilinear interpolation, which doesn’t make sense for our categorical variable, so we should use the nearest neighbour function ‘ngb’

# Takes a little time to run..
ETH_land_use_resampled <- resample(ETH_land_use, ETH_elev, method="ngb") 

# Get summaries of both raster objects to check resolution and extent
# and to see whether resampled values look right
ETH_land_use_resampled

## class      : RasterLayer 
## dimensions : 1416, 1824, 2582784  (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 32.9, 48.1, 3.2, 15  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## source     : /private/var/folders/1t/vvyjk4r52pn8rx79p7cl5s4r0000gn/T/RtmpHBOpLV/raster/r_tmp_2021-02-24_143834_26120_37754.grd 
## names      : ETH_land_use.tif.raw.true 
## values     : 11, 210  (min, max)

ETH_elev

## class      : RasterLayer 
## dimensions : 1416, 1824, 2582784  (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 32.9, 48.1, 3.2, 15  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## source     : /Users/sturrockh/Documents/Work/MEI/DiSARM/GitRepos/spatial-epi-course/_posts/ETH_msk_alt.grd 
## names      : ETH_msk_alt 
## values     : -189, 4420  (min, max)

Manipulating rasters

It is often the case that we want to change the resolution of a raster for analysis. For example, for computational reasons we might want to work at a coarser resolution. First, let’s check the resolution

res(ETH_elev) # in decimal degrees. 1 dd roughly 111km at the equator

## [1] 0.008333333 0.008333333

Let’s aggregate (make lower resolution) by a factor of 10

ETH_elev_low_res <- aggregate(ETH_elev, fact = 10) # by default, calculates mean
res(ETH_elev_low_res)

## [1] 0.08333333 0.08333333

plot(ETH_elev_low_res)

You can change the values of the pixels easily. For example, if you want to change the ETH_elev raster from its native meters to feet, you can mulitply by 3.28

ETH_elev_feet <- ETH_elev*3.28
plot(ETH_elev_feet)

Similarly, you can categorize raster values

ETH_elev_categorized <- cut(ETH_elev, 4)
plot(ETH_elev_categorized)

If a raster is the same resolution and extent, you can perform joint operations on them, for example subtract values of one from another

new_raster <- ETH_elev - ETH_land_use_resampled # Meaningless! Just for illustrative purposes..
plot(new_raster)

Extracting data from rasters

Now let’s extract values of elevation at each survey point. You can use the extract function from the raster package and insert the extracted values as a new field on ETH_malaria_data_SPDF

ETH_malaria_data_SPDF$elev <- extract(ETH_elev, ETH_malaria_data_SPDF)
ETH_malaria_data_SPDF # now has 3 variables

## class       : SpatialPointsDataFrame 
## features    : 203 
## extent      : 34.5418, 42.4915, 3.8966, 9.9551  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 4
## names       : examined, pf_pos,       pf_pr, elev 
## min values  :       37,      0,           0,  817 
## max values  :      221,     14, 0.127272727, 2451

You can also extract values using polygons e.g to get admin 1 level elevations. You just have to define a function to apply, otherwise you get all the pixel values per polygon. For very large rasters, check out the velox package.

ETH_Adm_1$elev <- extract(ETH_elev, ETH_Adm_1, fun=mean, na.rm=TRUE) # takes a little longer..

## Warning in .getRat(x, ratvalues, ratnames, rattypes): NAs introduced by coercion

Exploratory spatial analysis

We can now have a quick look at the relationship between prevalence and elevation. First generate a prevalence variable

ETH_malaria_data_SPDF$prevalence <- ETH_malaria_data_SPDF$pf_pos / ETH_malaria_data_SPDF$examined

Now you can plot the relationship between prevalence and elevation

ggplot(ETH_malaria_data_SPDF@data) + geom_point(aes(elev, prevalence))

You might also be interested in distances to/from other features (e.g. health facilities, water). Here we are going to load up a waterbody layer (obtained via http://www.diva-gis.org/Data) and calculate distance from each point. In this case, the file is in GeoJSON format instead of Shapefile. readOGR is able to handle GeoJSON easily.

waterbodies <- readOGR("https://raw.githubusercontent.com/HughSt/HughSt.github.io/master/course_materials/week2/Lab_files/ETH_waterbodies.geojson")

## OGR data source with driver: GeoJSON 
## Source: "https://raw.githubusercontent.com/HughSt/HughSt.github.io/master/course_materials/week2/Lab_files/ETH_waterbodies.geojson", layer: "ETH_waterbodies"
## with 380 features
## It has 5 fields

waterbodies

## class       : SpatialPolygonsDataFrame 
## features    : 380 
## extent      : 33.00001, 46.80059, 4.232061, 14.55  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 5
## names       : ISO,  COUNTRY,                 F_CODE_DES,                             HYC_DESCRI,                  NAME 
## min values  : ETH, Ethiopia,               Inland Water, Non-Perennial/Intermittent/Fluctuating, ABAY WENZ (BLUE NILE) 
## max values  : ETH, Ethiopia, Land Subject to Inundation,                    Perennial/Permanent,           ZIWAY HAYK'

plot(waterbodies)

The goesphere package has some nice functions such as dist2Line which calculates distance in meters from spatial data recorded using decimal degrees. Warning: takes a little while to compute

dist_to_water <- dist2Line(ETH_malaria_data_SPDF, waterbodies)

This produces a matrix, where each row represents each point in ETH_malaria_data_SPDF and the first column is the distance in meters to the nearest waterbody

head(dist_to_water)

##       distance      lon      lat  ID
## [1,] 116153.32 38.03253 6.426888 363
## [2,] 163802.37 38.56778 7.082849 358
## [3,] 137683.25 40.65231 8.808335 238
## [4,] 173427.48 37.62588 5.794201 367
## [5,]  40482.23 37.00735 4.811708 365
## [6,] 163677.02 38.56340 7.075962 358

# Can add to your data frame by extracting the first column
ETH_malaria_data_SPDF$dist_to_water <- dist_to_water[,1]

If the objects you are interested in calucating distance to are points as opposed to polygons/lines (as above) you first have to calculate the distance to every point and then identify the minimum. For example, imagine waterbodies data was only available as a point dataset (we can fake this by calculating the centroid of each polygon)

waterbodies_points <- gCentroid(waterbodies, byid=TRUE)

Now calucate a distance matrix showing distances between each observation and each waterbody point. the distm function creates a distance matrix between every pair of data points and waterbody points in meters.

dist_matrix <- distm(ETH_malaria_data_SPDF, waterbodies_points)

Then use the apply function to apply the ‘minimum’ function to each row (as each row represents the distance of every waterbody point from our first observation)

ETH_malaria_data_SPDF$dist_to_water_point <- apply(dist_matrix, 1, min)

The alternative, much faster, but potentially less accurate method to ‘distm’, is to use the nn2 function from the RANN package. This allows you to calculate the nearest point from each observation and then you can use the ‘distGeo’ function from the ‘geosphere’ package to calculate the distance in meters. The reason this could be inaccurate, is that the nearest point in decimal degrees might not be the nearest in meters as degrees are not a good measure of distance. In most cases, you are probably OK.

library(RANN)
library(geosphere)

# Get the index of the waterbody points that are nearest to each observation
nn <- nn2(waterbodies_points@coords,ETH_malaria_data_SPDF@coords, 
          k=1)
          
# Calculate the distance in meters between each observation and its nearest waterbody point   
ETH_malaria_data_SPDF$dist_to_water_point <- distGeo(ETH_malaria_data_SPDF@coords,
        waterbodies_points@coords[nn$nn.idx,])

Useful resources

The raster package vignette is extremely useful
If you are bumping into speed issues extracting/summarizing raster data, have a look at the velox package

Key readings

This week is all about practice. Instead of working through journal articles, have a play with your data and get to know how the functions work.

Pop quiz answers

Can be found here