Introduction

When we undertake a Species Distribution Model (SDM), a good first step is to explore our data to familiarise ourselves with its structure and limitations. We explore this topic in the Data Exploration vignette.

After exploring our data, and in conjunction with variable selection, if appropriate, we may want to implement various data processing steps into our zoon workflow. There are some generally agreed upon guidelines for this data processing, but it is also highly subjective and open to interpretation. This guide outlines a few suggested steps implemented in both base R as well as using zoon output and process modules for data processing. These include data cleaning, variable standardisation and transformation, and adding interaction terms (Figure 1).

Figure 1. Conceptual flowchart for process modules in covariates selection and reparation.

library(zoon)

Data cleaning

Plotting occurrence data in this can make it possible to identify errors. One of the zoon modules, Clean, removes impossible, incomplete, or unlikely species occurrence points. Impossible occurrence points include those where the location doesn’t exist, incomplete records may be missing either a longitude or latitude value (or both), and unlikely data points are those that fall well outside the geographic range of your study area (for example, in the middle of the sea). Within Clean these options are referred to by number as impossible (1), incomplete (2), and unlikely (3), and this module is used as follows:

process = Clean(which = c(1,2,3))

Standardising variables

For some modelling methods, it is helpful to standardise variables, thereby putting them on the same scale. Always check the assumptions of your selected model, however, as this is not always necessary, for example with Boosted Regression Trees or Random Forest models. Standardisation is achieved by subtracting the mean for a covariate from each individual value so all covariates are centred on zero, or additionally by dividing by the standard deviation, which makes the standard deviation of each covariate equal to one (this is called the ‘z-score’). One major reason to standardise covarites is to aid interpretation of our model results. Regression parameters of covariates on the same scale allow us to compare the influence of different variables on species distributions. Without standardisation, the regression coefficient for the distance of a site to roads might be +0.003 and the effect of average temperature could be -10. How would we compare these coefficients?

In addition, standardisation of covariates can aid in model fitting. If regression coefficients are too different between two covariates in a model, then the software may struggle to converge on parameter estimates for both terms.

We can standardise covariates in zoon with the StandardiseCov process module. By default, the module standardises all variables by subtracting their mean and dividing by their standard deviation (calculating their ‘z-score’). To use this module we need to choose which variables to exclude from standardisation (if any), and whether to use the Gelman variant (standardises by two standard deviations instead of one). Some examples of how to use the module are below:

process = StandardiseCov() # default form

process = StandardiseCov(Gelman = TRUE,
                         exclude = c("VarB", "VarC"))

Data transformation

Transformation of covariate data is a way to express dependent variables in a non-linear form. Generally, decisions about transformation are subjective, and depend upon the assumptions of our chosen model as well as the natural biological underpinnings of our parameters. A few key reasons to transform data:

Biological justification: Many biological parameters are naturally better understood on a particular scale, for example, many plant biometric variables are typically log-transformed because the data are naturally skewed, with only a few very large values.
Skewed/systematically varying residuals: Transformation doesn’t just change the distribution of the raw data, it also implies a relationship of the residuals. If there is a pattern among the residuals (‘heterscedasticity’) then transformation can be used to remove that pattern (ensuring ‘homoscedasticity’), an assumption of some models.
To linearise a relationship/simplify a model: If the relationship between a covariate and the response variable is non-linear, then you may need to add extra terms to the model, making it more complicated. Transformation to linearlise this relationship thus simplifies the model.
To add flexibility: Polynomial transformation is useful for variables whose impact on occurrence is not linear. Some environmental variables, such as elevation, temperature, and rainfall, actually have a peak value for occurrence and would therefore be good candidates for polynomial transformation.

Great care must be taken when adding transformations to our data, and there are several good reasons not to do so, or to take caution in doing so:

Can complicate a model: Polynomial terms in particular add complexity to a model, and should only be attempted when you have enough data to reliably estimate those additional parameters.
Masking outliers: Without a biological justification for doing so, transforming data simply to hide the impact of outlying values hides valuable insight in our data.

There is a process module for data transformation in zoon called Transform. To use this module we need to define the transformation, nominate the variable to be transformed, and decide whether to replace the original variable or to create a new one. We define the transformation in a similar manner to defining a function in base R. This takes the format of setting the trans argument in this module to the format of function(x) {our transformation}. We select the variables to transform by supplying a character vector to the which_cov argument, and determine variables to replace by setting the replace argument to TRUE or FALSE.

Let’s run through a couple of examples. If we want to square a variable called VarA and save it as an additional variable in our dataset (i.e. do not replace original variable) we would use this:

process = Transform(trans = function(x) {x^2},
                    which_cov = "VarA",
                    replace = FALSE)

If we want to perform a log transformation to the variables VarA, VarB, and VarC, and replace the original variables in the dataset with our newly transformed variables, we would use this:

process = Transform(trans = function(x) {log(x)},
                    which_cov = c("VarA", "VarB", "VarC"),
                    replace = TRUE)

If we want to get fancy and provide different transformations to different variables we can achieve this using the Chain() function:

process = Chain(Transform(trans = function(x) {x^2},
                          which_cov = c("VarA", "VarB"),
                          replace = FALSE),
                Transform(trans = function(x) {log(x)},
                          which_cov = c("VarC", "VarD")))

Interactions

Typical inference assumes that unique predictor variables have an independent effect on our response variable (species occurrence). This makes sense in many cases, when we are reasonably confortable with the biological justification. For example, it is reasonable and backed by experimental evidence, to suppose that mosquito populations are more common in humid regions. This would be an independent effect of humidity on mosquito occurrence. However, we need to consider the conditionality of this assumption. Occurrence data are conditional upon how they enter our dataset (is data collection biased by region where we expect mosquitos to occur?) It is possible that the relationship between mosquito occurrence and humidity is conditional on another environmental variable. In this example, perhaps temperature moderates this relationship. A pairwise interaction is the interaction between two variables in a model such that:

\(Y = b_0 + b_1*X_1 + b_2*X_2 + b_3(X_1*X_2)\)

and where \(b_3\) is the interaction term between the variables \(X_1\) and \(X_2\).

Once we have decided to add an interaction effect into our model, we can select from three possible ways to implement it in our model: add all pairwise interactions, define set interactions between a select group of variables, or specify polynomial terms.

Pairwise interactions

To implement all possible pairwise interactions in the model we write up the module with the addInteraction process module like this:

process = addInteraction(which_covs = 'pairs')

Set interactions

Rather than a blanket application of interaction terms across our model, we might decide that it is more ecologically reasonable to define interactions only between a select group of variables. There are multiple ways to achieve this so lets go through them one at a time:

To define the pairwise interaction between any two variables as a character vector:

process = addInteraction(which_covs = c("A", "B"))   # adds an interaction between A & B

To define multiple pairwise interactions, but not all pairwise interactions, we make use of R’s list() function. We provide a list of interaction terms as character vectors like so:

process = addInteraction(which_covs = list(c("A","B"), c("A","C")))   # adds interactions between A & B and A & C, but not B & C

To define higher order interactions between more than two variables we just need to extend the length of our character vectors. This will define the highest order interaction term between all of the selected variables as well as all combinations of lower-order interaction terms.

process = addInteraction(which_covs = c('A', 'B', 'C'))   # adds all two-way (e.g. A & B) interactions and a three-way interaction between A, B & C

Interactions as polynomial terms

the addInteraction method can also be used as an alternative to coding in polynomial terms.

process = addInteraction(which_covs = c('A', 'A'))   # leads to a quadratic polynomial

process = addInteraction(which_covs = c('A', 'A', 'A'))   # leads to a cubic, polynomial

Example

We will start by setting up a basic zoon workflow() with the Sugar Maple data and using a raster covariate dataset from Bioclim for illustration.

# ext <- c(-170, -20, 10, 80)
# SugarMaple_Workflow <- workflow(occurrence = SugarMaple,
#                                 covariate = Bioclim(extent = as.vector(ext)),
#                                 process = addInteraction(which_covs = c('Bio3', 'Bio3')),
#                                 model = NullModel,
#                                 output = ResponsePlot)

Selecting and preparing covariates