Introduction

Species distribution models are used for all sorts of important decisions: choosing protected areas, prioritising conservation actions, informing development applications, even determining the conservation status of species. As such, we need to know how much faith we can put in our models. We need to know if they are up to the task we set for them. We do that by evaluating our model’s performance.

Model evaluation is the process of determining if our model is close enough to reality for our purpose. Models are smaller, simpler, hopefully useful versions of reality, but all models are wrong. Whether our imperfect model is useful is dependent on what we’re going to use it for. For example, if we wish to predict the geographic expansion of a species under a changing climate, we need to know if our model can extrapolate across geographic space. But, if we wish only to accurately map the current distribution of a species, we don’t mind much at all if our model can extrapolate accurately. It is thus important to consider not how good our model is in absolute terms, but rather if our model is good enough for its purpose. Model evaluation, then, is the process of checking our model outputs to determine if they are a close enough representation of reality for the purpose you’re using it for.

Fit-for-purpose

tradeoff of fit and predictive capacity

The purpose of our model will decide for us both how we evaluate our model and whether the goodness of fit, or lack thereof, determined by our evaluation is tolerable. Determinig if our model is ‘good enough’ is not a simple task and generally has to be justified on a case by case basis. Are we building a model to identify gradients species respond to and contribute to ecological knowledge? Or maybe we are running models on behalf of a governmental agency and thus require accurate models? We’ll address some of these nuances throughout this guide. However, there is one facet of model fit worht discussing straight up - over-fitting.

Models are considered over-fitted when they go beyond identifying a pattern in data and instead fit a relationship to the noise in the data - to the stochastic variation. In species distribution models, sometimes we want to avoid over-fitting and sometimes we don’t mind. Generally, if we only want to know about here and now as precisely as possible, over-fitting our models is of littel concern. For example, if there is a location where a species is found but that location is an outlier as far as the species is concerned, we still want to protect it. Thus, a model which estimates that location as a presence if useful. If we use a model that determines that site is an outlier, and therefore has a low probability of presence, we will miss it in our conservation prioritisations. Over-fitted models are difficult to identify and the jury is still out on precisley what makes a model over-fitted.

Model evaluation then, even over-fitted models, is an exercise in justifying that our model is fit for the purpose for which we made it. It is not an exercise in validating that our models perfectly reflect reality. For example, a model that only explains 15% of the variance in the data but identifies an important environmental gradient is still a useful model for the purpose of generating ecological knowledge. It is probably not a model we should use to inform development applications but it is still a useful model.

Calibration versus discrimination

We cannot validate a model by comparing it to another model

How

In this zoon guide, we will discuss the different methods of evaluating a model (or set of models), when we might choose one method of evaluation over another, and how to use zoon to implement these different evaluation methods. In this guide, we split model evaluation methods and metrics into three broad categories increasing in thoroughness: bare minimum, internal evaluation, and cross-validation.

However, no matter which category of evaluation method and metric we choose, there are a few over-arhcing princples to keep in mind. Firstly, we want our results to make sense ecologically and reflect our raw data. Secondly, we would like the most parsimoneous model. Some old white man said ‘the simplest answer is usualy the right answer’. Thus, if given the choice between two equally performing models, we would choose the least complicated model, the model with the fewest environmental covariates. Lastly, we want our model to be able to predict the data we fitted it with; we want it to be accurate and not biased. In this guide, we’ll discuss the various model evaluation methods and metrics in line with these guiding principles.

The bare minimum

Let’s first run through the Bare Minimum. The Bare Minimum is what we absolutely must do in order to be able to defend our choice of model. Remember a model consists of the response data, the environmental covariates, and the modelling method we choose. Some of the bare mimimum is just part and parcel of interpretting the results, but while we’re interpretting our results, it’s good to remember that we need to question the model not accept the printed map or response curve as gospel truth.

The bare minimum can be split into two categories: checking the model results against our data and checking the model against our knowledge of the ecology we’re modelling.

Check maps match data (interactive map)

library(zoon)

zoon_workflow <- workflow(occurrence = CarolinaWrenPA,
                          covariate = CarolinaWrenRasters,
                          process = NoProcess,
                          model = LogisticRegression,
                          output = InteractiveMap)

One can go a little bit further in evaluating model fit by plotting species repsonse curves where available (response curves)

For parametric models, check coefficients make sense (go the same way as data, giev back sensible results) (coefficient plot)

Independence

An underlying assumption of most presence-background regression models is that the occurrence data are conditionally independent given the covariates, meaning the covariates are expected to explain all the spatial variation in our data. Typically, however, this is not the case. Spatial dependence, for example through clustering or social aggregation of species, may take place, invalidating this assumption of independence. There are a handful of modelling approaches to identifying spatial dependence structures, including Gibbs or Cox process models, that will be discussed elsewhere.

Internal evalution

The next step up in the model evaluation game is internal evaluation. Internal evaluation checks if our model is a good description of the data we fit it with.

Deviance and Variance

R-squared (variance explained)

Deviance

Information criteria

Adding more parameters almost always improves the fit of the model, so why not just put them all in? While adding more parameters improves the fit, it may reduce the ability of the model to predict new data outside your current sample.

There are two fundamental types of statistical error that we must deal with when evaluating and interpreting model results: overfitting and underfitting. Both of these errors lead to poor prediction by the model, or its ability to retrodict the original data. When we overfit our data, we learn too much from our data, while underfitting leads us to learn too little from our data.

The information theoretic approach to model selection uses information criteria, which are metrics to score models by prediction accuracy with a penalty for complexity.

This approach is in contrast to significance testing approaches that focus on selecting models where all terms have <0.05 significance. Some ‘significant’ terms may not help improve prediction, whereas some that are not might.

Information criteria build on deviance measures by penalising models for additional covariates. That is, they select for accurate and parimsoneous models.

Cross-validation

Model cross-validation is the gold standard of model evalution in species distritbuion models. In the absence of a independent dataset, that is.

There’s many way to do cross-validation and again it depends on what aspect of model performance we want to check.

Cross-validation works as follows - you iteratively split the data into a test and training dataset. The model is fit (or ‘trained’) with the traiing dataset and evaluated with the test dataset.

K-fold

Bootstrap

Extrapolation in space

Want to do if need to predict range shift or to a new location. Olden (ref) refer to this as the transferability of the model in space. And you do it by

Model Evaluation