Typically, the first step in species distribution modelling is sourcing species observation records (presence-only, presence-absence, or abundance) and environmental data (such as bioclimatic variables or vegetation maps in the form of rasters) that we will use for our model. In some instances we might be using data we recorded ourselves, other times we might want to use data from public repositories, or even a combination of the two. This tutorial will guide you through the process of accessing some of these data sources within a zoon
workflow()
.
Using the GetModuleList()
command we can see all of the Occurrence
modules under the $occurrence
sub-heading. We can also view the Covariate
modules under the $covariate
sub-heading.
modules <- GetModuleList()
modules$occurrence
[1] "CWBZimbabwe" "CarolinaWrenPO"
[3] "LocalOccurrenceData" "Lorem_ipsum_UK"
[5] "NaiveRandomPresence" "NaiveRandomPresenceAbsence"
[7] "SugarMaple" "UKAnophelesPlumbeus"
[9] "CarolinaWrenPA" "NATrees"
[11] "AnophelesPlumbeus" "SpOcc"
modules$covariate
[1] "Bioclim_future" "CarolinaWrenRasters" "LocalRaster"
[4] "NaiveRandomRaster" "UKBioclim" "NCEP"
[7] "UKAir" "AirNCEP" "Bioclim"
zoon
datasetszoon
comes with several example dataset modules that we can use. These pre-loaded datasets are useful examples for experimenting with zoon
modules, or as test datasets when building new modules of your own. For example, SugarMaple
provides presence/absence data of sugar maple in North America, CWBZimbabwe
provides presence/absence data for the coffee white stem border in Zimbabwe, and NATrees
provides occurrence records for 21 eastern North American trees.
To gain familiarity with zoon
, we could, for example, choose to fit a model to the Carolina wren data using the CarolinaWrenPO
or CarolinaWrenPA
occurrence modules (presence-only and presence-absence, respectively) with the CarolinaWrenRasters
covariate module.
Carolina_Wren <- workflow(occurrence = CarolinaWrenPO,
covariate = CarolinaWrenRasters,
process = Background(100),
model = NullModel,
output = InteractiveOccurrenceMap)
SDMs are commonly fit to datasets we have collected ourselves, and zoon
has modules to help us load that data. The two modules of interest here are LocalOccurrenceData
for our observation records and LocalRaster
for our raster-based data. To ensure that all datasets loaded into a zoon
workflow()
are compatible with model
modules, there are specific requirements for the structure and formate of each of these data types, which we will go through below.
The LocalOccurrenceData
module requires our data to be saved either as .csv, .xlsx, .tab, or .tsv files and be structured in a very specific format. The first and second columns are the longitude
and latitude
values (in that order), and the third column is the value
of the observation (0 for absence, 1 for presence, or an integer for abundance data). If your coordinate system is not latitude/longitude then you can supply an optional fourth column called CRS
that contains the PROJ.4 code for your coordinate system (e.g. +init=epsg:27700
for easting/northing data for the UK). If no CRS
column is supplied then latitude/longitude is assumed.
To use the LocalOccurrenceData
module you call occurrence
in your workflow
like this:
occurrence = LocalOccurrenceData(filename = "myData.csv", # File path to your data file
occurrenceType = "presence", # The type of data you have
columns = c(long = "longitude", # The names of the columns in
lat = "latitude", # your .csv that match the
value = "value"), # required columns
externalValidation = FALSE) # Only required if validation
# data is set up externally
Raster data loaded into a workflow
using LocalRaster
also follows a set format, but it is a simpler process than for occurrence data. This module reads in either a single raster or raster-stack, or a list or vector of rasters and creates a raster-stack. You can find the set of allowed formats for loading rasters by looking at ?writeRasters
.
To use this module you call the covariate
module like this:
covariate = LocalRaster(rasters = c("myRaster1", # Filepath to a raster
"myRaster2")) # Filepath to a second raster
covariate = LocalRaster(rasters = "myRasterStack") # A RasterStack object already loaded
To serve as an example of loading our own data, here we make use of data on the brown-throated sloth, Bradypus variegatus, provided in the dismo
R
package. The code below will access this dataset and then do the necessary pre-processing to get it into the correct format.
filename <- paste(system.file(package = "dismo"),
'/ex/bradypus.csv', sep='') # filepath to .csv
Bradypus <- read.csv(filename, header = TRUE) # read in .csv file
Bradypus <- Bradypus[,2:3] # remove columns of species name
Bradypus <- cbind(Bradypus, rep(1, nrow(Bradypus))) # all records are presences, so add column of 1s
colnames(Bradypus) <- c("longitude", "latitude", "value") # set necessary column names
write.csv(Bradypus, "Bradypus.csv") # create a .csv file of the required format
The raster data available through dismo
is in the correct format, but still needs to be loaded into our environment for use.
files <- list.files(path = paste(system.file(package = "dismo"), '/ex', sep=''),
pattern = 'grd', full.names = TRUE ) # list all .grd files in `dismo`
predictors <- stack(files) # create a raster-stack
predictors <- dropLayer(x = predictors, i = 9) # drop unwanted raster layer
Now that we have our data in the required format (a .csv file and a raster-stack object) we can fit a model using the LocalOccurrenceData
and LocalRaster
modules.
Our_Data <- workflow(occurrence = LocalOccurrenceData(filename = "Bradypus.csv",
occurrenceType = "presence",
columns = c(long = "longitude",
lat = "latitude",
value = "value")),
covariate = LocalRaster(predictors),
process = Background(100),
model = NullModel,
output = InteractiveOccurrenceMap)
Our output, an interactive occurrence map for the brown-throated sloth, is shown here:
Sometimes we want to source our occurrence
and/or covariate
data from online sources. Some modules that are useful for sourcing online data include SpOcc
, Bioclim
, Bioclim-future
, and NCEP
.
SpOcc
The SpOcc
module is used to obtain species occurrence records from a selection of online databases, including: GBIF, BISON, iNat, eBird, Ecoengine, and AntWeb. These databases provide free access to species observation records taken by institutions all over the world. This means there is a wealth of data available to ecologists to tackle questions on species distributions without having to undertake costly field work themselves, or to supplement their own observations. The Global Biodiversity Information Facility, or GBIF, is the largest of these repositories (the others restricted by geographical region or taxon). We can call this module like this:
occurrence = SpOcc(species = "SpeciesName", # Species scientific name
extent = c(-1, 0, 51, 52), # Coordinates for the extent of the region
databases = "gbif", # List of data bases to use
type = "presence", # Type of data you want
limit = 10000) # A maximum limit of records to obtain
The Bioclim
module obtains bioclimatic variables from WorldClim. There are 19 bioclimatic variables available that are derived from monthly measurements of temperature and precipitation values to provide more biologically meaningful variables. These variables represent annual trends, seasonality, and limiting factors. This data is available at various resolutions (2.5, 5, or 10 minutes). We can call this module like this:
covariate = Bioclim(extent = c(-180, 180, -90, 90), # Coordinates for the extent of the region
resolution = 10, # Required resolution
layers = 1:5) # Variables we want (between 1-19)
We can also obtain these bioclimatic variables for predictions of future climate using the Bioclim_future
module. Predicting species distributions under future climate conditions is useful for assessing the effects of climate change on a species for the purposes of conservation. There are multiple models used for predicting climate conditions in the future, and different magnitudes of change, so we must put some thought into the scenario we wish to predict. We can find more information about different scenarios used to predict future climates here and here. Once we have done our research about how type of pathway and model we will use to predict, we can call the Bioclim_future
module
like this:
covariate = Bioclim_future(extent = c(-10, 10, 45, 65), # Coordinates of the extent of the region
resolution = 10, # Resolution of the data
layers = 1:19, # Required Bioclim variables
rcp = 45, # Representative Concentration Pathways
model = "AC", # General Circulation Models
year = 70) # Time period for the prediction
The NCEP
module obtains environmental data from the National Centers for Environmental Prediction. This repository provides a source of environmental variable not available through other repositories, such as air pressure, humidity measures, and soil moisture. For a full list of variables look at ?NCEP.gather
. We can call this module like this:
covariate = NCEP(extent = c(-5, 5, 50, 60), # Coordinates of the extent of the region
variables = "hgt", # Character vector of variables of interest
status.bar = FALSE) # Show a status bar of download progress?
For example, here we obtain presence-only data for the brown bear, Ursus arctos, in North America from GBIF, and bioclimatic variables from Wordclim.
Online <- workflow(occurrence = SpOcc(species = "Ursus arctos",
extent = c(-175, -65, 20, 75),
databases = "gbif",
type = "presence"),
covariate = Bioclim(extent = c(-175, -65, 20, 75),
resolution = 10,
layers = 1:19),
process = Background(1000),
model = NullModel,
output = InteractiveOccurrenceMap)
Now that we’ve seen how to use datasets from different sources, lets construct a workflow using a combination of sources with two sets of occurrence data for comparison. Let’s use CarolinaWrenPO
and SpOcc
to obtain presence-only data for the Carolina wren and Bioclim
for our environmental rasters.
CombinationA <- workflow(occurrence = list(CarolinaWrenPO,
SpOcc("Thryothorus ludovicianus",
extent = c(-138.71, -52.58, 18.15, 54.95))),
covariate = Bioclim(extent = c(-138.71, -52.58, 18.15, 54.95)),
process = Background(100),
model = LogisticRegression,
output = PrintMap)
Both of the above plots were generated using the same process and model mdoules; they only differ in the source of their occurrence records. It is obvious that there are significant differences in the predicted distribution of the Carolina wren depending on the dataset chosen. This highlights a very important point to be considered during an analysis: bias in data source. Different data sources are going to have different amounts (and quality) of data-points, and their own inherent sampling biases. Do they contain scientific surveys only? Goverment agency surveys? All occurrence records?
If we pretend that the dataset from CarolinaWrenPO
was our own survey data from a study restricted to the contiguous USA, while the SpOcc
dataset (via the GBIF repository) includes all occurrence records aross North America (thus including Canada and Mexico), then we have a spatial sampling bias to consider. The latter dataset potentially includes occurrence records outside of the environmental range of the former, which could impact the estimation of the species’ response to the environmental variables.
Other potential sources of bias to consider include the sampled range of environmental variables, spatial sampling bias related to site accessibility (e.g. museum records commonly occur near roads), and reliability of the data source. Source A might have twice the number of occurrence records as Source B, possibly over a larger spatial and/or environmental range, but Source B may be more reliable as it includes records only from trained surveyors to eliminate possibilities of species mis-identification.
As a final example, lets explore the ability to use the SpOcc
module to obtain presence records for species other than our focus species and use them as background data in our model. This can be used to implement a ‘targetted background group’ approach for generating our background data for overcoming observation bias. Here we use presence records of the sedge wren, Cistothorus platensis, and the winter wren, Troglodytes hiemalis, as background data for our Carolina wren analysis. You can see that our background data points are now from geographically similar regions as our presence records rather than randomly distributed throughout the environment.
TargettedBG <- workflow(occurrence = Chain(SpOcc("Thryothorus ludovicianus",
extent = c(-138.71, -52.58, 18.15, 54.95),
type = "presence"),
SpOcc("Cistothorus platensis",
extent = c(-138.71, -52.58, 18.15, 54.95),
type = "background"),
SpOcc("Troglodytes hiemalis",
extent = c(-138.71, -52.58, 18.15, 54.95),
type = "background")),
covariate = Bioclim(extent = c(-138.71, -52.58, 18.15, 54.95)),
process = NoProcess,
model = NullModel,
output = InteractiveOccurrenceMap)