Introduction
Last week, we spoke about cluster analysis (week 4) to reveal noteworthy spatial patterns in the outcome data (health related events/measures) and ways to assess spatial autocorrelation. In week 3 (Spatial variation in risk), we also used spatial autocorrelation to interpolate the data (kriging) to produce smooth maps of our outcomes.
Starting next week and through the end of the class, we will talk about spatial regression modeling which combines regression modeling (first order trend) and spatial autocorrelation modeling (second order trend).
Today, in this transition session, we will cover some of the ways you can access and generate useful spatial data to incoporate into spatial regression analyses.
Accessing spatial data
Many spatial data have been made directly available within R thanks to bespoke packages, but for some data, you might need to look online or even generate these yourself.
From R packages
You have already used spatial data from the “raster” package via the getData
function. Let’s take a deeper dive on these datasets.
So you can download administrative boundaries (option name = 'GADM'
), elevation (name = 'alt'
) and even climate variables (name = 'worldclim'
).
For the GADM and elevation datasets, you need to specify which country layer you want using the country ISO3 code. These are standard codes. If you are unsure, you can google it, or get a table using the following command:
The level
option gives you different administrative boundaries. Level 0 is country boundary, level 1 is province, level 2 district and for some countries you can get level 3 or 4. For elevation, disabling the mask
option returns an elevation raster for the whole bounding box of the country as opposed to clipped to the country boundary:
If you are requesting WorldClim data, you also need to specify the res
parameter which controls the resolution of the data you would like in minutes (60 minutes in 1 degree). If you are requesting the 0.5 resolution version, these are only available as tiles and you therefore also have to specify lon
and lat
of a point which identifies the tile you would like - e.g. this would be a point in the country of interest which you can obtain using google earth, geojson.io or a similar tool.
Let’s take a look at the two objects with different resolutions:
## class : RasterStack
## dimensions : 900, 2160, 1944000, 12 (nrow, ncol, ncell, nlayers)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## names : prec1, prec2, prec3, prec4, prec5, prec6, prec7, prec8, prec9, prec10, prec11, prec12
## min values : 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## max values : 885, 736, 719, 820, 955, 1850, 2088, 1386, 904, 980, 893, 914
## class : RasterStack
## dimensions : 3600, 3600, 12960000, 12 (nrow, ncol, ncell, nlayers)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 30, 60, 0, 30 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## names : prec1_27, prec2_27, prec3_27, prec4_27, prec5_27, prec6_27, prec7_27, prec8_27, prec9_27, prec10_27, prec11_27, prec12_27
## min values : 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## max values : 126, 137, 240, 697, 289, 360, 458, 457, 340, 450, 584, 184
The WorldClim data are actually raster stacks, i.e. 12 rasterLayers stacked together in a single (cube) object referring to the long term average for each month. You can extract specific months (layers of the stack) using the [[]]
command as per an R list.
As you are dealing with tiles rather than country level datasets when using 0.5 resolution, in some situations you will need to obtain and merge more than 1 tile. For example, if we wanted to obtain the 0.5 resolution October precipitation for South Sudan, we would have to down load both required tiles and merge.
Other R packages
Here are a few R packages with spatial data available:
- raster
- getSpatialData
- rgistools
- rstoolbox
- sentinel2
- MODIS
Online
Often times those R packages simply provide easier access to data hosted online by various research groups and can have useful additional functions to process the data. Sometimes though, you just need access to the raw data and will need to download the data from an online source (which can come in many different formats) and import in R yourself.
Download the precipitation WorldClim variable at 10’ resolution available here. What you get is a folder with 12 .tif files, representing monthly precipitation (in mm and averaged over the 1970-2000 period), over the entire globe. As you encounter new file formats that you don’t know how to import in R, it’s good practice to scout forums. Let’s google How to import .tif files in R? and find our answer on that post. Fortunately, it looks like the raster function of the ‘raster’ package reads in .tif files.
## [1] "readme.txt" "wc2.0_10m_prec_01.tif"
## [3] "wc2.0_10m_prec_02.tif" "wc2.0_10m_prec_03.tif"
## [5] "wc2.0_10m_prec_04.tif" "wc2.0_10m_prec_05.tif"
## [7] "wc2.0_10m_prec_06.tif" "wc2.0_10m_prec_07.tif"
## [9] "wc2.0_10m_prec_08.tif" "wc2.0_10m_prec_09.tif"
## [11] "wc2.0_10m_prec_10.tif" "wc2.0_10m_prec_11.tif"
## [13] "wc2.0_10m_prec_12.tif"
## [1] "wc2.0_10m_prec_01.tif" "wc2.0_10m_prec_02.tif"
## [3] "wc2.0_10m_prec_03.tif" "wc2.0_10m_prec_04.tif"
## [5] "wc2.0_10m_prec_05.tif" "wc2.0_10m_prec_06.tif"
## [7] "wc2.0_10m_prec_07.tif" "wc2.0_10m_prec_08.tif"
## [9] "wc2.0_10m_prec_09.tif" "wc2.0_10m_prec_10.tif"
## [11] "wc2.0_10m_prec_11.tif" "wc2.0_10m_prec_12.tif"
## class : RasterLayer
## dimensions : 1080, 2160, 2332800 (nrow, ncol, ncell)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : https://raw.githubusercontent.com/HughSt/HughSt.github.io/master/_posts/Week_5_Spatial_Epi_files/wc2/wc2.0_10m_prec_10.tif
## names : wc2.0_10m_prec_10
## values : 0, 2328 (min, max)
## class : RasterLayer
## dimensions : 900, 2160, 1944000 (nrow, ncol, ncell)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## source : /Users/francoisrerolle/Desktop/wc10/prec10.bil
## names : prec10
## values : 0, 980 (min, max)
## class : RasterStack
## dimensions : 1080, 2160, 2332800, 12 (nrow, ncol, ncell, nlayers)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## names : wc2.0_10m_prec_01, wc2.0_10m_prec_02, wc2.0_10m_prec_03, wc2.0_10m_prec_04, wc2.0_10m_prec_05, wc2.0_10m_prec_06, wc2.0_10m_prec_07, wc2.0_10m_prec_08, wc2.0_10m_prec_09, wc2.0_10m_prec_10, wc2.0_10m_prec_11, wc2.0_10m_prec_12
## min values : 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## max values : 908, 793, 720, 1004, 2068, 2210, 2381, 1674, 1955, 2328, 718, 806
## class : RasterLayer
## dimensions : 1080, 2160, 2332800 (nrow, ncol, ncell)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : https://raw.githubusercontent.com/HughSt/HughSt.github.io/master/_posts/Week_5_Spatial_Epi_files/wc2/wc2.0_10m_prec_10.tif
## names : wc2.0_10m_prec_10
## values : 0, 2328 (min, max)
Other online resources
Here are a few good websites with free downloadable spatial data
Spatial data: where do they come from?
We often interact with climatic and environmental raster data, like the land use dataset shown below. But how are these datasets produced? The answer is via remotely sensed imagery, such as images taken by satellites, planes or drones. These data can sometimes be combined with ground data, such as data from weather stations.
Remote sensing data
Satellites orbiting around the globe carry instruments to take measurements. Just likes our eyes or cameras, the sensors onboard satellites receive a radiation that was emitted by a source of light (sun, flash, light bulb, etc…) and reflected by an object. The differences between the pre and post reflection radiations characterize properties of the reflecting object such as color for which our eyes have been optimized to perceive. You can picture the radiation as a wave signal. Speed radars and the SRTM mission act in the same way (except they also send the incident radiation) and from the difference in the 2 wave signals calculate speed and elevation respectively.
Speed, elevation, color are calculated based on the wavelengths property of the wave. Sensors onboard remote sensing satelitte mission can receive multiple waves across the spectrum.
For instance the satellite Landsat-8 has 11 bands (i.e wavelengths ranges) at which it measures the reflectance (fraction of incident electromagnetic power that is reflected) which we can simply see as a measure of ‘light’ intensity reflected by the earth surface.
## Landsat_2017_Band1
## Min. 11.0
## 1st Qu. 338.5
## Median 404.0
## 3rd Qu. 494.5
## Max. 2538.5
## NA's 0.0
How do you get from reflectance to land use surfaces?
First you can read some documentation about the wavelength ranges covered by band for your satellite.
Pop quiz: what do you think is in the top right corner of our spatial extent?
Notice the difference in shading and range of legends between the different bands. This is because different surface features reflect the incident solar radiation differently. Each layer represent how much incident solar radiation is reflected for a particular wavelength range. For example, vegetation reflects more energy in NIR than other wavelengths and thus appears brighter. In contrast, water absorbs most of the energy in the NIR wavelength and it appears dark.
We do not gain that much information from these grey-scale plots; they are often combined into a “composite” to create more interesting plots. To make a “true (or natural) color” image, that is, something that looks like a normal photograph (vegetation in green, water blue etc), we need bands in the red, green and blue regions. For this Landsat image, band 4 (red), 3 (green), and 2 (blue) can be used. The plotRGB method can be used to combine them into a single composite. You can also supply additional arguments to plotRGB to improve the visualization (e.g. a linear stretch of the values, using stretch = “lin”).
The true-color composite reveals much more about the landscape than the earlier gray images. Another popular image visualization method in remote sensing is known “false color” image in which NIR, red, and green bands are combined. This representation is popular as it makes it easy to see the vegetation (in red).
Relation between bands
A scatterplot matrix can be helpful in exploring relationships between raster layers. This can be done with the pairs() function of the “raster” package.
Plot of reflection in the ultra-blue wavelength against reflection in the blue wavelength.
Plot of reflection in the red wavelength against reflection in the NIR wavelength.
The first plot reveals high correlations between the blue wavelength regions. Because of the high correlation, we can just use one of the blue bands without losing much information. The distribution of points in second plot (between NIR and red) is unique due to its triangular shape. Vegetation reflects more in the NIR range than in the red and creates the upper corner close to NIR (y) axis. Water absorbs energy from all the bands and occupies the location close to origin. The furthest corner is created due to highly reflecting surface features like bright soil or concrete.
Spectral profiles
A plot of the spectrum (all bands) for pixels representing a certain earth surface features (e.g. water) is known as a spectral profile. Such profiles demonstrate the differences in spectral properties of various earth surface features and constitute the basis for image analysis.
Let’s start by loading some training data from the region which compiles GPS coordinates of points for which we know the Land use and Land cover (LULC) class.
## Latitude Longitude LULC
## 1 11.75859 37.26399 12
## 2 11.70965 37.14431 12
## 3 11.79550 37.38545 17
## 4 11.78845 37.15329 10
## 5 11.50553 37.15888 10
## 6 11.79360 37.29655 10
##
## Savanna Grassland Wetland Cropland Urban Water
## 4 481 27 328 4 156
You can now convert the data frame to a spatial object to extract reflectance values at training points. Spectral values can be extracted from any multispectral data set using extract() function of the “raster” package.
## LULC Ultra.blue Blue Green Red NIR SWIR1 SWIR2
## 1 Cropland 372 512 823 991 2494 2548 1620
## 2 Cropland 375 483 795 1183 2038 1886 1613
## 3 Water 603 780 1378 1111 259 171 141
## 4 Grassland 364 485 806 917 1767 1524 997
## 5 Grassland 355 437 693 765 1972 1990 1427
## 6 Grassland 407 499 759 868 2254 2161 1433
We can now compute the mean reflectance values for each class and each band and plot to obtain the spectral profiles.
## LULC Ultra.blue Blue Green Red NIR SWIR1 SWIR2
## 1 Savanna 404.00 455.25 702.12 747.62 2303.88 1639.50 1085.62
## 2 Grassland 367.07 450.39 731.63 883.92 2137.84 2044.79 1478.32
## 3 Wetland 360.65 440.81 760.96 664.87 1929.70 1099.56 586.33
## 4 Cropland 408.25 503.26 794.93 1030.95 2308.46 2169.48 1615.76
## 5 Urban 476.38 580.12 848.00 957.88 2128.75 1884.00 1393.12
## 6 Water 571.12 731.72 1313.05 1131.43 302.03 170.37 136.47
The spectral profiles show (dis)similarity in the reflectance of different features on the earth’s surface (or above it). ‘Water’ shows relatively low reflection in all wavelengths, while all other classes have relatively high reflectance in the longer wavelengts. Remember though that while our training data is made up of 1000 points across 6 LULC classes, we have relatively few observations for ‘Savanna’, ‘Wetland’ and ‘Urban’ classes.
Classification
So now we have observed classes and the spectral values for each of those observations, it is possible to build a classification model. There are a number of different ways of fitting such a classifcation (sometimes called multi-nomial) model. Here we are going to use a random forest model which is a type of decision tree machine learning model. We are going to use the “randomForest” package.
##
## Call:
## randomForest(x = Training_Data_Reflectance[, 2:8], y = Training_Data_Reflectance[, 1])
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 22.9%
## Confusion matrix:
## Savanna Grassland Wetland Cropland Urban Water class.error
## Savanna 0 1 2 1 0 0 1.00000000
## Grassland 1 378 8 92 0 2 0.21413721
## Wetland 1 10 12 0 0 4 0.55555556
## Cropland 0 99 1 228 0 0 0.30487805
## Urban 0 2 0 2 0 0 1.00000000
## Water 0 1 2 0 0 153 0.01923077
By inspecting the output of the model, we realize it is not doing a super job. The OOB estimate of error rate
refers to ‘out of bag’ error rate. In this context, out of bag is similar to cross-vadliation we explored as part of interpolation - i.e. how well the model is predicting data not included in the model. Again remember we only have a few observations in our training dataset. Classification of the ‘water’ class is pretty good, thanks to its particular spectral profile. Now that we have a model of the relationship between land cover and reflectance values, we can predict land cover over our entire region.
##
|
| | 0%
|
|================ | 25%
|
|================================ | 50%
|
|================================================= | 75%
|
|=================================================================| 100%
##
Concluding remarks
Now you have a sense of how remote sensing can be used to produce all sort of environmental layers. The quality of those will depend on the spatial and temporal resolutions of the sensors as well as the techniques used to process the raw data (clouds?) and model it to produce more useful layers.
Resources
This week borrows heavily from this excellent tutorial
Readings
Hay SI. An overview of remote sensing and geodesy for epidemiology and public health application. Advances in parasitology. 2000 Jan 1;47:1-35.
Simoonga C, Utzinger J, Brooker S, Vounatsou P, Appleton CC, Stensgaard AS, Olsen A, Kristensen TK. Remote sensing, geographical information system and spatial analysis for schistosomiasis epidemiology and ecology in Africa. Parasitology. 2009 Nov;136(13):1683-93.
Savory DJ, Andrade-Pacheco R, Gething PW, Midekisa A, Bennett A, Sturrock HJ. Intercalibration and Gaussian process modeling of nighttime lights imagery for measuring urbanization trends in Africa 2000–2013. Remote Sensing. 2017 Jul;9(7):713.
Yang GJ, Vounatsou P, Xiao-Nong Z, Utzinger J, Tanner M. A review of geographic information system and remote sensing with applications to the epidemiology and control of schistosomiasis in China. Acta tropica. 2005 Nov 1;96(2-3):117-29.