This week we are going to explore methods to understand and predict risk across space from point data. These may be point level data (i.e. measurements of something of interest at particular points) or point process data (i.e. occurences of events in a given area). When you load this week’s libraries, it may prompt you to download XQuartz
First load up some obfuscated malaria case-control data from Namibia. This is comprised of latitudes and longitudes of cases and controls.
## household_id lat long case
## 1 1 -17.51470 16.05666 1
## 2 2 -17.82175 16.15147 1
## 3 3 -17.78743 15.93465 1
## 4 4 -17.51352 15.83933 1
## 5 5 -17.63668 15.91185 1
## 6 6 -17.64459 16.16105 1
To set ourselves up for further analyses, let’s create objects of just cases and just controls
We are also going to create a SpatialPointsDataFrame
of the case-control data
And get hold of a boundary file for Namibia
Let’s plot and see what we have. First, create a color scheme based on the case classification (0 or 1)
Then, plot
Risk Mapping using Kernel Density
To generate a kernel density estimate, we first need to generate point pattern object of points (aka ppp). First, we need to define a window defining the population from which the cases arose
Now we can define the ppp object of the cases
We can now generate and plot a kernel density estimate of cases
Its possible to use different bandwidths. The larger the bandwidth, the smoother the density estimate.
If you want to map using leaflet, you have to convert the density object to a rasterLayer with a coordinate reference system
But this is just a density of cases, e.g. it doesn’t account for the denominator - the controls. To do this, we can use the kelsall & diggle method, which calculates the ratio of the density estimate of cases:controls
First we have to add ‘marks’ to the points. Marks are just values associated with each point such as case or control (1/0)
Now we can use the relrisk
function from the spatstat pakage to look at the risk of being a case relative to the background population. In order to obtain an output of relative risk, we must specify relative = TRUE in the code line (the probability of being a case, relative to probability of being a control). If the ‘relative’ argument is not included in the code line the argument is technically specified as ‘FALSE’ since this is the default and the output is the probability of being a case. You can set sigma (bandwidth), but the default is to use cross-validation to find a common bandwidth to use for cases and controls. See ?bw.relrisk
for more details.
Obtaining a relative risk of being a case
To plot on a web map, first specify the projection
Then define a color palette
Then plot with leaflet
Interpolation of point (prevalence etc.) data
First load Ethiopia malaria prevalence data
Get the Ethiopia Adm 1 level boundary file using the raster package which provides access to GADM data
Inverse distance weighting (IDW)
Inverse distance weighting is one method of interpolation. To perform IDW using the spatstat package, as per kernel density estimates, we have to create a ppp object with the outcome we wish to interpolate as marks. We have to start by setting the observation window. In this case, we are going to use the bounding box around Oromia State from which these data were collected. To set the window for the ppp
function, we need to use the owin
function.
Then define a ppp of the prevalence data
Set the parameters for displaying multiple plots in one screen and plot different IDW results NB: 1) power represents the power function we want to use 2) ‘at’ can be ‘pixels’ where it generates estimates across a grid of pixels or ‘points’ where it interpolates values at every point using leave-one-out-cross validation
To calculate the ‘best’ power to use, you can use cross-validation. This is possible using the argument at=points when running the idw function. There is no off the shelf function (that I know of) to do this, so you have to loop through different power values and find the one that produces the lowest error using cross-validation.
See which produced the lowest error
Plot observed versus expected with optimal power
Plot using leaflet.
Kriging
We are going to use the GeoR package to perform kriging. First, we have to create a geodata object with the package GeoR. This wants dataframe of x,y and data
We can plot a summary plot using ther Lowes parameter. The Lowes option gives us lowes curves for the relationship between x and y
It’s important to assess whether there is a first order trend in the data before kriging. We can see from the plots of the prevalence against the x and y coordinates that there isn’t really any evidence of such a trend. Were there to be evidence, you can add trend = 1st
or trend = 2nd
to the plot command to see the result after havin regressed prevalence against x and y using a linear and polynomial effect respectively.
Now generate and plot a variogram. As a rule of thumb, its a good idea to limit variogram estimation to half the maximum interpoint distance
## variog: computing omnidirectional variogram
To make it easier to interpret, we can bin points by distance
## variog: computing omnidirectional variogram
Its possible to change the way the variogram bins are constructed. Just be careful not to have too few pairs of points in any distance class. NB: uvec
agrument provides values to define the variogram binning (ie let’s try bins of 0.2 decimal degrees, about 22 km)
## variog: computing omnidirectional variogram
Let’s look at the number in each bin
## [1] 85 432 541 586 692 607 652 661 679 663 736 764 711 692 577 585 594 551 630
## [20] 724
What is the minimum? A rule of thumb is 30 in each bin
## [1] 85
Plot
We can now fit variogram model by minimized least sqaures using different covariance models. In this case we are just going to use a ‘spherical’ and ‘exponential’ model.
## variofit: covariance model used is spherical
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## Warning in variofit(Vario, cov.model = "sph"): initial values not provided -
## running the default search
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "0" "3.05" "0" "0.5"
## status "est" "est" "est" "fix"
## loss value: 2.28256710551259e-05
## variofit: covariance model used is exponential
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## Warning in variofit(Vario, cov.model = "exp"): initial values not provided -
## running the default search
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "0" "1.22" "0" "0.5"
## status "est" "est" "est" "fix"
## loss value: 2.76112575299845e-05
Plot results
Get summaries of the fits
## $pmethod
## [1] "WLS (weighted least squares)"
##
## $cov.model
## [1] "spherical"
##
## $spatial.component
## sigmasq phi
## 0.000160867 3.048000000
##
## $spatial.component.extra
## kappa
## 0.5
##
## $nugget.component
## tausq
## 8.043352e-05
##
## $fix.nugget
## [1] FALSE
##
## $fix.kappa
## [1] TRUE
##
## $practicalRange
## [1] 3.048
##
## $sum.of.squares
## value
## 2.282567e-05
##
## $estimated.pars
## tausq sigmasq phi
## 8.043352e-05 1.608670e-04 3.048000e+00
##
## $weights
## [1] "npairs"
##
## $call
## variofit(vario = Vario, cov.model = "sph")
##
## attr(,"class")
## [1] "summary.variomodel"
## $pmethod
## [1] "WLS (weighted least squares)"
##
## $cov.model
## [1] "exponential"
##
## $spatial.component
## sigmasq phi
## 0.0002281533 1.2192006253
##
## $spatial.component.extra
## kappa
## 0.5
##
## $nugget.component
## tausq
## 3.042044e-05
##
## $fix.nugget
## [1] FALSE
##
## $fix.kappa
## [1] TRUE
##
## $practicalRange
## [1] 3.652398
##
## $sum.of.squares
## value
## 2.643998e-05
##
## $estimated.pars
## tausq sigmasq phi
## 3.042044e-05 2.281533e-04 1.219201e+00
##
## $weights
## [1] "npairs"
##
## $call
## variofit(vario = Vario, cov.model = "exp")
##
## attr(,"class")
## [1] "summary.variomodel"
In this case, the spherical model has a slightly lower sum of squares, suggesting it provides a better fit to the data.
Now we have a variogram model depicting the covariance between pairs of points as a function of distance between points, we can use it to Krig values at prediction locations. To allow us to compare with IDW, first get grid of points from the IDW example for comparison
## krige.conv: model with constant mean
## krige.conv: Kriging performed using global neighbourhood
Visualize predictions
If you want to create a raster of your predictions, you can use the rasterFromXYZ
function
Generating cross-validated predictions in straightforward in geoR using the xvlalid
function. Two types of validation are possible: 1. leaving-on-out cross validation where each data location (all or a subset) is removed in turn and predicted using the remaining locations, for a given model. 2. External validation which can predict to locations outside of the dataset Here we will use the default leave-one-out xvalidation for all points
## xvalid: number of data locations = 203
## xvalid: number of validation locations = 203
## xvalid: performing cross-validation at location ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203,
## xvalid: end of cross-validation
You might notice that some of the kriged values are <0. As we are modeling probabilities this can’t be true. In these situations, it is possible to apply a transformation to your data before kriging and then back-transform results. One transformation useful for probabilities is the logit transform (used in logistic regression). The logit
and inv.logit
functions from the package gtools
can be used for this. Note that it doesn’t work if you have 0 values as you can’t log(0). You can add a small amount to avoid this situation. The process would look like this
## variog: computing omnidirectional variogram
## variofit: covariance model used is spherical
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## Warning in variofit(Vario_logit, cov.model = "sph"): initial values not provided
## - running the default search
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "1.16" "2.45" "0.15" "0.5"
## status "est" "est" "est" "fix"
## loss value: 448.598010419967
## xvalid: number of data locations = 203
## xvalid: number of validation locations = 203
## xvalid: performing cross-validation at location ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203,
## xvalid: end of cross-validation
Pop quiz
- How could you compare how well the best fitting IDW performs versus kriging?
- Which appears to be more accurate?
- Can you visualize where predictions from IDW differ to kriging?
- Does inclusion of a trend surface improve kriging estimates?
Answers here
Key readings
Good overview
Pfeiffer, D., T. P. Robinson, M. Stevenson, K. B. Stevens, D. J. Rogers and A. C. Clements (2008). Spatial analysis in epidemiology, Oxford University Press Oxford. Chapter 6.
Technical paper covering kernel estimation of relative risk. Key reference but not necessary to understand in detail.
Kelsall, Julia E., and Peter J. Diggle. “Kernel estimation of relative risk.” Bernoulli 1.1-2 (1995): 3-16.
Illustration of the Kelsall Diggle approach used to map sleeping sickness risk
Additional readings
Nice example of kriging applied across space and time
Additional example of Kelsall Diggle in action
Di Salvo, Francesca, et al. “Spatial variation in mortality risk for hematological malignancies near a petrochemical refinery: A population-based case-control study.” Environmental research 140 (2015): 641-648.