When we undertake a Species Distribution Model (SDM) we use both species occurrence data collected from field surveys or collated observations, as well as environmental predictor data. Before beginning to use covariate data to model occurrence over space, a crucial first step is to familiarise ourselves with the data and its limitations. It is important to do this before modelling the data, as data exploration can result in consequences for the choice of model applied as well as the accurracy of the inference taken from the model. Equally, choice of model a priori may bring a unique set of assumptions that we need to check for in our data. This guide outlines a few suggested steps implemented in both base R as well as using zoon
output
and process
modules for data exploration (Figure 1), but is by no means exhaustive.
To begin to familiarise ourselves with the data, it is very useful to examine summary information, including min, mean, and max values for each covariate that you have loaded. Environmental covariate data is subject to error. For example, maybe the maximum value in your elevation variable is 1000 m, but you know that the highest peak in your study region is only 500 m. Maybe your vegetation classification shows ten levels despite it being an eight-category scale.
It can also help us to check the format of the covariates - to ensure continuous variables are stored as numeric data and categorical variables as factors. Categorical variables may be erroneously listed in a numerical index (e.g. vegetation categories identified as 1-10), causing them to be mis-classed. Similarly, if there are any typos in numerical data entries that introduce characters (e.g. ‘2 cm’ instead of ‘2’), then a continuous variable could be classed as a factor.
NA values are to be expected in our raster data as they are commonly masked to cover only a particular region, however finding NA values for covariates in the training data could indicate that some data points have incorrect latitude/longitude values. This can result from being mapped to locations outide the extent of our study, or suggest that the data point sits on the border of the study region and misses the raster due to its resolution and should be adjusted slightly.
With the zoon
output module DataSummary
, examining covariate data for these potential pitfalls is very easy.
library(zoon)
Carolina_Wren_Workflow <- workflow(occurrence = CarolinaWrenPO,
covariate = CarolinaWrenRasters,
process = Background(1000),
model = NullModel,
output = DataSummary)
$PointData
longitude latitude pcMix pcDec
Min. :-100.19 Min. :25.10 Min. :-0.28163 Min. :-0.4023
1st Qu.: -90.09 1st Qu.:32.47 1st Qu.:-0.25373 1st Qu.:-0.1441
Median : -84.22 Median :35.90 Median : 0.07983 Median : 1.0642
Mean : -85.09 Mean :35.54 Mean : 0.51684 Mean : 1.4522
3rd Qu.: -80.24 3rd Qu.:38.74 3rd Qu.: 0.81348 3rd Qu.: 2.5997
Max. : -70.01 Max. :43.48 Max. : 9.72729 Max. : 5.7853
pcCon pcGr Lat Lon
Min. :-0.4118 Min. :-0.4333 Min. :-1.47041 Min. :-0.2513
1st Qu.:-0.3641 1st Qu.:-0.4028 1st Qu.:-0.69941 1st Qu.: 0.2841
Median :-0.1350 Median :-0.3277 Median :-0.25919 Median : 0.5832
Mean : 0.2708 Mean :-0.1669 Mean :-0.29397 Mean : 0.5262
3rd Qu.: 0.7252 3rd Qu.:-0.1607 3rd Qu.: 0.09957 3rd Qu.: 0.7851
Max. : 3.5748 Max. : 4.4750 Max. : 0.83779 Max. : 1.1720
$RasterData
pcMix pcDec pcCon pcGr
Min. :-0.28 Min. :-0.40 Min. :-0.41 Min. :-0.43
1st Qu.:-0.28 1st Qu.:-0.40 1st Qu.:-0.41 1st Qu.:-0.38
Median :-0.27 Median :-0.34 Median :-0.27 Median :-0.22
Mean : 0.25 Mean : 0.34 Mean : 0.36 Mean : 0.38
3rd Qu.: 0.06 3rd Qu.: 0.58 3rd Qu.: 0.66 3rd Qu.: 0.55
Max. :10.95 Max. : 5.79 Max. : 5.67 Max. : 5.12
NA's :35451 NA's :35451 NA's :35451 NA's :35451
Lat Lon
Min. :-1.56 Min. :-1.34
1st Qu.:-0.34 1st Qu.:-0.69
Median : 0.22 Median :-0.17
Mean : 0.18 Mean :-0.15
3rd Qu.: 0.74 3rd Qu.: 0.35
Max. : 1.51 Max. : 1.26
NA's :35451 NA's :35451
EXPAND ON WHAT WE SEE ONCE THE UPDATED MODULE PULL REQUEST GOES THROUGH
Species distribution datasets are, to varying degrees, reliant on observation records gathered by humans and therefore subject to human error. Even in situations where we are fitting a model to entirely remotely-sensed data such as bioclimatic variables, our species occurrence records are usually pen-and-paper recordings from the field. This manual data entry can lead to mistakes. One way to check for nonsense entries is to the plot occurrence data to covariates one by one, allowing us to isolate unusual entries. For example, if we plotted against latitude, we might see a single occurrence value at a latitude outside the range of our study.
You can visualise these plots using the CovHistogram
output module.
Carolina_Wren_Workflow <- workflow(occurrence = CarolinaWrenPO,
covariate = CarolinaWrenRasters,
process = Background(1000),
model = NullModel,
output = CovHistograms)
Collinearity is the existence of correlations among covariates, for example between altitude and temperature. For some modelling methods, though notably not all of them, when two covariates in a model are correlated, the modelling method may struggle to identify the impact of each variable independently and the significance of either may be masked. If covariate A and covariate B are correlated, one approach would be to include a single one in the model, preferably based on sound biological justification. Should variable A be included, then in the discussion of the results it will be important to note that the observed effect could equally be driven by correlated covariate B.
You can check for collinearity in many ways, but the simplest is to look at a pair plot of your covariate data. You can do this using the zoon
output module PairPlot
. The upper panel shows a density scatterplots of relationships between covariates, the diagonal panel shows a density of the points as well as variable names, and the lower panel shows the \(r^2\) correlation values where darker colours indicate stronger correlation (blue in the positive direction, red in the negative).
Carolina_Wren_Workflow <- workflow(occurrence = CarolinaWrenPO,
covariate = CarolinaWrenRasters,
process = Background(1000),
model = NullModel,
output = PairPlot)
A tidy summary of many of these data exploration methods is provided with the GenerateCovariateReport
Output
module. This is based on the GenerateReport()
function in the DataExplorer
R
package, but tailored specifically for SDM analyses. These reports show our data structure, the percentage of missing data, the distribution of our covariates (histograms for continuous data, bar charts for discrete), and show the results of a basic correlation analysis. We need to tell the module which report(s) to generate by setting the type
argument to one of “D” (Data Report only), “R” (Raster report only), or “DR” (Data and Raster Report).
output = GenerateCovariateReport(type = "DR")
Where there is no module already designed, we can manipulate our data directly by using the accessor function Process()
to generate a dataframe object with extracted raster covariate values for each occurrence point. We can examine the first six rows of this data frame by using the base R function head()
.
occ.cov.df <- Process(Carolina_Wren_Workflow)$df
head(occ.cov.df)
longitude latitude value type fold pcMix pcDec pcCon
1 -87.50607 34.92022 1 presence 1 0.5902465 1.8175174 -0.2232220
2 -87.78094 34.93984 1 presence 1 0.3973857 1.8771694 -0.1192491
3 -88.05596 34.95882 1 presence 1 1.5005269 2.5864888 0.2768159
4 -86.73598 34.41256 1 presence 1 1.0277690 1.6152677 -0.1380109
5 -86.70923 34.63505 1 presence 1 0.7697110 1.4580649 -0.1153329
6 -86.98288 34.65655 1 presence 1 0.6184651 0.5775564 -0.2159737
pcGr Lat Lon
1 -0.3189337 -0.4050766 0.4139044
2 -0.3031093 -0.4075254 0.3991733
3 -0.2853706 -0.4098892 0.3844383
4 -0.3246833 -0.4765890 0.4622093
5 -0.3219977 -0.4501391 0.4608309
6 -0.3036037 -0.4263944 0.4446908
We can check the format of the data with the str()
function, or check individual variables with class()
.
str(occ.cov.df)
'data.frame': 2505 obs. of 11 variables:
$ longitude: num -87.5 -87.8 -88.1 -86.7 -86.7 ...
$ latitude : num 34.9 34.9 35 34.4 34.6 ...
$ value : num 1 1 1 1 1 1 1 1 1 1 ...
$ type : chr "presence" "presence" "presence" "presence" ...
$ fold : num 1 1 1 1 1 1 1 1 1 1 ...
$ pcMix : num 0.59 0.397 1.501 1.028 0.77 ...
$ pcDec : num 1.82 1.88 2.59 1.62 1.46 ...
$ pcCon : num -0.223 -0.119 0.277 -0.138 -0.115 ...
$ pcGr : num -0.319 -0.303 -0.285 -0.325 -0.322 ...
$ Lat : num -0.405 -0.408 -0.41 -0.477 -0.45 ...
$ Lon : num 0.414 0.399 0.384 0.462 0.461 ...
- attr(*, "call_path")=List of 1
..$ occurrence: chr "CarolinaWrenPO"
- attr(*, "covCols")= chr "pcMix" "pcDec" "pcCon" "pcGr" ...
- attr(*, "na.action")=Class 'omit' Named int 218
.. ..- attr(*, "names")= chr "218"
class(occ.cov.df$pcMix)
[1] "numeric"