In this vignette we explain how to fit, evaluate and interpret a trophic species distribution model (trophic SDM). Trophic SDM is a statistical model to model the distribution of species accounting for their known trophic interactions. We refer to Poggiato et al., “Integrating trophic food webs in species distribution models improves ecological niche estimation and prediction”, in preparation for the full description of trophic SDM.
For illustrative purposes, we use simulated data for which we know the parameter values. We thus simulate a trophic interaction network and species distribution for six species, whose names are ordered accordingly to their trophic level (Y1 being basal and Y6 being at the highest trophic position). We simulate data with a stronger dependence on the abiotic variables for species in lower trophic levels, and increasing dependence on the biotic terms (i.e., their preys) for species with a higher trophic level. We will then use trophic SDM and see if the model can retrieve this simulate process and the species potential niches.
We first create the trophic interaction network. The network needs to be a directed acyclic graph (DAG), with arrows pointing from predator to prey. We thus simulate an interaction network such that species only have prey in a lower trophic position.
# Set number of species
S = 6
# Create the adjacency matrix of the graph
A = matrix(0,nrow=S,ncol=S)
# Ensure that the graph is connected
while(!igraph::is.connected(igraph::graph_from_adjacency_matrix(A))){
A = A_w = matrix(0,nrow=S,ncol=S)
# Create links (i.e. zeros and ones of the adjacency matrix)
for(j in 1:(S-1)){
A[c((j+1):S),j] = sample(c(0,1), S-j, replace = T)
}
# Sample weights for each link
#A_w[which(A != 0)] = runif(1,min=0,max=2)
}
colnames(A) = rownames(A) = paste0("Y",as.character(1:S))
# Build an igraph object from A
G = igraph::graph_from_adjacency_matrix(A)
plot(G)
We now simulate species distribution at 300 locations along two environmental gradients. We simulate species such that species Y1 to Y3 have a negative dependence to the first environmental variable X_1 and a positive dependence to X_2, viceversa for species Y4 to Y6. Moreover, the strength of the environmental term decreases along the trophic level, while the effect of the prey increases. Finally, preys always have a positive effect on predators.
# Number of sites
n = 500
# Simulate environmental gradient
X_1= scale(runif(n, max = 1))
X_2= scale(runif(n, max = 1))
X = cbind(1, X_1^2, X_2^2)
B = matrix(c(1, 1, 1, 1, 1, 1,
-2, -2, -2, 2, 2, 2,
2, 2, 2, -2, -2, -2
), ncol = 3)
# Sample the distribution of each species as a logit model as a function of the
# environment and the distribution of preys
Y = V = prob = matrix(0, nrow = n, ncol = S)
for(j in 1:S){
# The logit of the probability of presence of each species is given by the
# effect of the environment and of the preys.
# Create the linear term. As j increases, the weight of the environmental effect
# decreases and the biotic effect increases
eff_bio = seq(0.1, 0.9, length.out = S)[j]
eff_abio = seq(0.1, 0.9, length.out = S)[S+1-j]
V[,j]= eff_abio*X%*%B[j,] + eff_bio*Y%*%A[j,]*3
# Transform to logit
prob[,j] = 1/(1 + exp(-V[,j]))
# Sample the probability of presence
Y[,j] = rbinom(n = n, size = 1, prob = prob[,j])
}
# Set names
colnames(Y) = paste0("Y",as.character(1:S))
Now that we simulate the data, we can fit a species distribution
model with the function trophicSDM
. The function requires
the species distribution data Y
(a sites x species matrix),
the environmental variables X
and an igraph object
G
representing the trophic interaction network.
Importantly, G
must be acyclic and arrows must point from
predator to prey. Finally, the user has to specify a formula for the
environmental part of the model and the description of the error
distribution and link function to be used in the model (parameter
family
), just as with any glm
.
By default, trophicSDM
models each species as a function of
their prey (set mode = "predator"
to model species as a
function of their predators) and of the environmental variables (as
defined in env.formula
). Trophic SDM can be also fitted as
a function of the summary variables of preys (i.e. composite variables
like prey richness), instead of each prey species directly. We cover
this topic in the vignette ‘Composite variables’.
The algorithm behind trophic SDM fits a set of generalised linear model
(glm), one for each species. These glm can be fitted in the frequentist
framework by relying on the function glm
(set
method = "glm"
), or in the Bayesian framework by relying on
the function rstanarm
(set
method = "stan_glm"
, the default choice). Model fitting is
parallelised if run.parallel = TRUE
. These glm can be
penalised to shrink regression coefficients (by default the model is run
without penalisation). In the Bayesian framework set
penal = "horshoe"
to implement penalised glm using horshoe
prior. In the frequentist framework set
penal = "elasticnet"
to implement penalised glm with
elasticnet. See the examples in ?trophicSDM
. Hereafter, we
only fit trophic SDM in the Bayesian framework with no penalisation.
However, all the functions and analyses (except for MCMC diagnostic)
presented in this vignette applies straightforwardly with or without
penalisation and in the frequentist framework.
Hereafter we model species as a function of preys and the two
environmental covariates in the Bayesian framework without penalisation.
We though need to specify the parameters of the MCMC chains. We need to
decide how many chains to sample (chains
), how many samples
to obtain per chain (iter
), and if we wish to see the
progress of the MCMC sampling (verbose
).
We are now ready to fit a trophic SDM.
X = data.frame(X_1 = X_1, X_2 = X_2)
m = trophicSDM(Y = Y, X = X, G = G,
env.formula = "~ I(X_1^2) + I(X_2^2)",
family = binomial(link = "logit"),
method = "stan_glm", chains = nchains, iter = iter, verbose = verbose)
The fitted model m
is of class
"trophicSDMfit"
. We also show here the formulas used to fit
the each glm, where each predator is modeled as a function of the preys
and of the environmental variables.
m
#> ==================================================================
#> A trophicSDM fit with no penalty, fitted using stan_glm
#> with a bottom-up approach
#>
#> Number of species : 6
#> Number of links : 10
#> ==================================================================
#> * Useful fields
#> $coef
#> * Useful S3 methods
#> print(), coef(), plot(), predict(), evaluateModelFit()
#> predictFundamental(), plotG(), plotG_inferred(), computeVariableImportance()
#> * Local models (i.e. single species SDM) can be accessed through
#> $model
class(m)
#> [1] "trophicSDMfit"
m$form.all
#> $Y1
#> [1] "y ~ I(X_1^2) + I(X_2^2)"
#>
#> $Y2
#> [1] "y ~ I(X_1^2)+I(X_2^2)+Y1"
#>
#> $Y3
#> [1] "y ~ I(X_1^2)+I(X_2^2)+Y1+Y2"
#>
#> $Y4
#> [1] "y ~ I(X_1^2)+I(X_2^2)+Y1+Y3"
#>
#> $Y5
#> [1] "y ~ I(X_1^2)+I(X_2^2)+Y1+Y3+Y4"
#>
#> $Y6
#> [1] "y ~ I(X_1^2)+I(X_2^2)+Y3+Y5"
The field $model
contains all the the single species glm
(i.e., all the local models). $data
and
trophicSDM
contains informations on the data and choices to
fit the model respectively. $form.all
contains the formula
used to fit each glm (i.e., both abiotic and biotic terms).
$coef
contains the inferred regression coefficients of the
glm for all species. $AIC
and $log.lik
are
respectively the AIC and log-likelihood of the fitted trophic SDM.
Finally, in the Bayesian framework, the field $mcmc.diag
contains the convergence diagnostic metrics of the MCMC chains for each
species.
In this vignette, we first ensure that the MCMC chains correctly converged. Then, we analyse the inferred model, by focusing in particular on the effect of the biotic versus abiotic terms. We then analyse the predictive performances of the model (comparing it to environment-only SDM) and use it to predict the species potential niches.
First, we can have a look at effective sample size to total sample size and of the potential scale reduction factor (aka ‘R hat’) for each species. Each line corresponds to a given parameter.
# Plot mcmc potential scale reduction factor (aka rhat)
p1 = mcmc_rhat(m$mcmc.diag$rhat)
# Plot mcmc ratio of effective sample size to total sample size
p2 = mcmc_neff(m$mcmc.diag$neff.ratio)
grid.arrange(p1, p2, ncol = 2)
We observe that the effective sample sizes are very close to the
theoretical value of the actual number of samples, as the ratio is close
to 1. In few cases the ratio is lower, and we might consider running the
chains for longer, or thin them, to improve convergence. This indicates
that there is very little autocorrelation among consecutive samples. The
potential scale reduction factors are very close to one, which indicates
that two chains gave consistent results. We can have a closer look to
the MCMC chains by looking at the traceplots of the parameters of a
given model, for example here for species Y5 who feeds on species Y1,
Y2, Y3 and Y4. To do so, we first assess the local model of species Y5,
and in particular its field $model
that gives us the output
of the function stan_glm
. We can then exploit the plot
method of this object.
Again, we see that the chains mix very well and that there’s little autocorrelation among samples within the same chain.
We now analyse the estimates of the regression coefficients. We can have a first glance of the regression coefficients with the function plot. This function plots the posterior distribution of the regression coefficients of each species.
We can visually see that the regression coefficients for the
environmental part of the model match the simulated parameters, as
species Y1 to Y3 have a positive response to X_1 and a negative one to
X_2, viceversa for species Y3 to Y6. We also remark that the
environmental coefficients are less important for basal species than for
predators. Indeed, the posterior mean of the regression coefficients of
the environmental part gets closer to zero and their posterior
distribution tend to overlap zero more and more as we move from species
Y1 to species Y6. We can quantify this visual assessment by looking at
the posterior mean and quantile of the regression coefficients. The
parameter level
set the probability of the quantile
(default 0.95).
coef(m, level = 0.9)
#> $Y1
#> mean 5% 95%
#> (Intercept) 1.193315 0.8631366 1.551786
#> I(X_1^2) -2.153769 -2.4827322 -1.854351
#> I(X_2^2) 1.904532 1.5618766 2.281827
#>
#> $Y2
#> mean 5% 95%
#> (Intercept) 0.4378192 -0.05096073 0.9425659
#> I(X_1^2) -1.3103497 -1.64827655 -0.9781233
#> I(X_2^2) 1.6693748 1.28262664 2.0953004
#> Y1 1.1014760 0.60271353 1.6253656
#>
#> $Y3
#> mean 5% 95%
#> (Intercept) 0.9246159 0.3048754 1.537713
#> I(X_1^2) -1.3782372 -1.7729318 -1.011603
#> I(X_2^2) 1.2837309 0.8501504 1.799605
#> Y1 0.8750454 0.2487845 1.523552
#> Y2 1.3356857 0.7836524 1.874160
#>
#> $Y4
#> mean 5% 95%
#> (Intercept) -0.149971 -0.9501498 0.6711274
#> I(X_1^2) 1.132746 0.7019883 1.5938620
#> I(X_2^2) -1.138931 -1.5981180 -0.6976561
#> Y1 2.550784 1.6477388 3.4849039
#> Y3 2.000125 1.2299528 2.9196692
#>
#> $Y5
#> mean 5% 95%
#> (Intercept) 0.5422551 -0.76409358 2.0854414
#> I(X_1^2) 0.9843555 0.02913223 2.0103196
#> I(X_2^2) -1.4438633 -2.74833106 -0.2571721
#> Y1 3.0216650 0.36971766 6.3588373
#> Y3 4.4991171 2.06335620 7.4694324
#> Y4 1.9616552 0.64162409 3.4955833
#>
#> $Y6
#> mean 5% 95%
#> (Intercept) -0.18963593 -2.0485881 1.5794885
#> I(X_1^2) 0.06055848 -0.9455691 0.9899717
#> I(X_2^2) -0.18911690 -1.2013288 0.8962215
#> Y3 3.62070445 1.1223913 6.5666245
#> Y5 3.69026866 2.1993395 5.4164494
Notice that regression coefficients are directly available in the
field $coef
of an object trophicSDMfit (with
level = 0.95
). As we previously hinted, we see that the
estimate of the regression coefficients of the environmental part get
closer and closer to zero with credible interval that often overlap zero
(implying they are not significant with a confidence level of 10%) for
species with a higher trophic level. In order to be able to correctly
quantify the effect of the different predictors across species (i.e.,
their relative importance), we need to standardise the regression
coefficients (see Grace et al. 2018). To do so, we need to set
standardise = TRUE
.
coef(m, level = 0.9, standardise = T)
#> $Y1
#> estimate 5% 95%
#> (Intercept) 1.1933152 1.1933152 1.1933152
#> I(X_1^2) -0.6032735 -0.6954163 -0.5194061
#> I(X_2^2) 0.5524401 0.4530474 0.6618806
#>
#> $Y2
#> estimate 5% 95%
#> (Intercept) 0.4378192 0.43781923 0.4378192
#> I(X_1^2) -0.3967764 -0.49910135 -0.2961777
#> I(X_2^2) 0.5234732 0.40219885 0.6570325
#> Y1 0.1815471 0.09934023 0.2678954
#>
#> $Y3
#> estimate 5% 95%
#> (Intercept) 0.9246159 0.92461585 0.9246159
#> I(X_1^2) -0.4018688 -0.51695451 -0.2949649
#> I(X_2^2) 0.3876289 0.25670715 0.5433997
#> Y1 0.1388822 0.03948565 0.2418093
#> Y2 0.2027708 0.11896649 0.2845168
#>
#> $Y4
#> estimate 5% 95%
#> (Intercept) -0.1499710 -0.1499710 -0.1499710
#> I(X_1^2) 0.4526599 0.2805236 0.6369279
#> I(X_2^2) -0.4713231 -0.6613485 -0.2887107
#> Y1 0.5548413 0.3584128 0.7580292
#> Y3 0.3802667 0.2338405 0.5550919
#>
#> $Y5
#> estimate 5% 95%
#> (Intercept) 0.5422551 0.542255094 0.54225509
#> I(X_1^2) 0.2863784 0.008475438 0.58486205
#> I(X_2^2) -0.4350071 -0.828017071 -0.07748081
#> Y1 0.4785091 0.058548274 1.00698178
#> Y3 0.6227408 0.285597372 1.03387396
#> Y4 0.1816850 0.059426083 0.32375471
#>
#> $Y6
#> estimate 5% 95%
#> (Intercept) -0.18963593 -0.1896359 -0.1896359
#> I(X_1^2) 0.02223722 -0.3472152 0.3635200
#> I(X_2^2) -0.07191471 -0.4568244 0.3408025
#> Y3 0.63254324 0.1960837 1.1472005
#> Y5 0.17891028 0.1066276 0.2625983
This results confirms the previous analysis. We can analyse the
relative importance of different group of variables with the function
computeVariableImportance
. This function can take as input
the definition of groups of variables in the argument
groups
(by default, with groups = NULL
, each
explanatory variable is a different group). The variable importance of
groups of variables is computed as the standardised regression
coefficients summed across variable of the same group.
We hereafter use this function to compute the relative importance of the
abiotic versus biotic covariates.
VarImpo = computeVariableImportance(m,
groups = list("Abiotic" = c("X_1","X_2"),
"Biotic" = c("Y1","Y2", "Y3", "Y4", "Y5", "Y6")))
VarImpo = apply(VarImpo, 2, function(x) x/(x[1]+x[2]))
tab = reshape2::melt(VarImpo)
tab$Var2 = factor(tab$Var2, levels = colnames(Y))
ggplot(tab, aes(x = Var2, y = value, fill = Var1)) + geom_bar(stat="identity") +
theme_classic()
We clearly see that our model correctly retrieved the simulated pattern, with the importance of biotic variables with respect to the abiotic ones becomes greater as the trophic level increases.
Finally, we can visualise the effect of each trophic interactions
with the function plotG_inferred
, which plots the metaweb
G
together with the variable importance (i.e. the
standardised regression coefficient) of each link. The parameter
level
sets the confidence level with which coefficients are
deemed as significant or non-significant.
We see that the model correctly retrieves that preys have a posite effect on the predators and that this effect becomes more important for predators at higher trophic positions.
In order to examine the package and practice with it, we here show how to easily manipulate local models, i.e. each glm (that can be seen as a single species SDMs), the smallest pieces of trophic SDM. These local models belong to the class “SDMfit”, for which multiple methods exists. We hereafter show how to some of these methods for species Y6.
SDM = m$model$Y6
# The formula used to fit the model
SDM$form.all
#> [1] "y ~ I(X_1^2)+I(X_2^2)+Y3+Y5"
# The inferred model
plot(SDM)
# The regression coefficients
coef(SDM, level = 0.9, standardise = T)
#> estimate 5% 95%
#> (Intercept) -0.18963593 -0.1896359 -0.1896359
#> I(X_1^2) 0.02223722 -0.3472152 0.3635200
#> I(X_2^2) -0.07191471 -0.4568244 0.3408025
#> Y3 0.63254324 0.1960837 1.1472005
#> Y5 0.17891028 0.1066276 0.2625983
We can also predict the probability of presence of species Y6 at a site with X_1 = 0.5, X_2 = 0.5, assuming its preys to be present present.
preds = predict(SDM, newdata = data.frame(X_1 = 0.5, X_2 = 0.5, Y3 = 1, Y5 = 1))
# Posterior mean of the probability of presence
mean(preds$predictions.prob)
#> [1] 0.9983807
We see that the species has a high probability of presence in this environmental conditions when its prey are present.
We showed how to fit and analyse a trophic SDM and how to predict a
single species (Y6 in the previous section), when we can fix the
distribution of its prey. However, the distribution of prey is typically
unavailable when we want to predict the distribution of species at
unobserved sites and/or under future environmental conditions.
Intuitively, in a simple network of two trophic level we would need to
predict the prey first and use these predictions to predict the
predator. To generalize this idea to complex networks, we predict
species following the topological ordering of the metaweb. This order
that exists for every DAG guarantees that when predicting a given
species, all its prey will have already been predicted.
The function predict
computes the topological order of
species, and then predicts them sequentially. As any
predict
function, it takes into account the values of the
environmental variables in the argument Xnew
. We can
provide to the function the number of samples from the posterior
distribution.
pred = predict(m, Xnew = data.frame(X_1 = 0.5, X_2 = 0.5), pred_samples = 100)
dim(pred$Y2)
#> [1] 1 100
By default, the function predict
transforms the
probability of presence of prey in presence-absence, and then use these
presence-absences to predict the predator. However, we can use directly
the probability of presence of prey by setting
prob.cov = TRUE
.
Pred is a list containing the posterior distribution of the predicted
probability of presence of each species (i.e., number of sites x
pred_samples). Notice that we can ask the function to directly provide a
summary of the posterior distribution (i.e. posterior mean and 95%
quantiles) of the probabilities of presence by setting
fullPost = FALSE
. As for trophicSDM
, we can
choose to parallelise some parts of the predict
function by
setting run.parallel = TRUE
(which can speed up
computational times).
We now want to evaluate how well the model fits the data and how well
it generalises to new data with respect to classical SDM. We thus first
run a classical SDM, by providing an empty graph to
trophicSDM
. Then, we will compare model AIC and their
predictive performances both on the training data and in
cross-validation. Notice that when we compare classical SDM and trophic
SDM it is important to account for the fact that trophic SDM have a
higher number of parameters, and will therefore have, by default, a
higher fit on the the training data.
empty_graph = graph_from_adjacency_matrix(matrix(0, nrow = S, ncol = S,
dimnames = list(colnames(Y), colnames(Y))))
m_classic = trophicSDM(Y = Y, X = X,
G = empty_graph,
env.formula = "~ I(X_1^2) + I(X_2^2)", family = binomial(link = "logit"),
method = "stan_glm", chains = nchains, iter = iter, verbose = verbose)
We first compare model AIC and an approximation of the leave-one-out cross-validation log-likelihood (loo, see Vehtari, A. et al 2017). AIC penalises the likelihood of the model as a function of the number of model parameter, while loo approximate the likelihood of the model on leave-one out cross validation. Therefore, both these metrics can be used to correctly compare the two approaches
m$AIC
#> [1] 1455.775
m_classic$AIC
#> [1] 1584.531
loo(m)
#> Warning: Found 6 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 6 times to compute the ELPDs for the problematic observations directly.
#> [1] -707.626
loo(m_classic)
#> [1] -777.2048
We see that our trophic model has a lower AIC and a higher loo, which proves that including prey improves the model.
We now compare the predictive performance of the two approaches on
the training dataset. To do so, we first use the function
predict
, and then the function
evaluateModelFit
to compare how well predictions match the
true data.
# Trophic SDM (when Xnew is not specified, predictions are carried on
# the training dataset by setting Xnew = m$data$X)
Ypred = predict(m, fullPost = FALSE)
# Transfom in a table
Ypred = do.call(cbind,
lapply(Ypred, function(x) x$predictions.mean))
# Re order columns
Ypred = Ypred[,colnames(Y)]
metrics = evaluateModelFit(m, Ynew = Y, Ypredicted = Ypred)
# Classical SDM
Ypred_classic = predict(m, fullPost = FALSE)
Ypred_classic = do.call(cbind,
lapply(Ypred_classic, function(x) x$predictions.mean))
Ypred_classic = Ypred_classic[,colnames(Y)]
metrics_classic = evaluateModelFit(m, Ynew = Y, Ypredicted = Ypred_classic)
# Mean AUC
mean(metrics$auc)
#> [1] 0.7937958
mean(metrics_classic$auc)
#> [1] 0.8031646
# Mean TSS
mean(metrics$tss)
#> [1] 0.5325318
mean(metrics_classic$tss)
#> [1] 0.5312451
On the training dataset models have similar performances, with our trophic model that slightly improves SDM’s TSS.
We now compare the two approaches in K-fold cross validation with the
function trophicSDM_CV
. This function creates a partition
based on the specified number of fold (but can also takes a
user-specified partition in the argument partition
) and run
a cross validation. This can take a bit of time. Set
run.parallel = TRUE
if you want to reduce computation
time.
# 3-fold cross validation
CV = trophicSDM_CV(m, K = 3, prob.cov = T, iter = 2000,
pred_samples = 500, run.parallel = FALSE)
#> [1] "Fold 1 out of 3 \n"
#> [1] "Fold 2 out of 3 \n"
#> [1] "Fold 3 out of 3 \n"
# Transfom in a table
Ypred = CV$meanPred
# Re order columns
Ypred = Ypred[,colnames(Y)]
metrics = evaluateModelFit(m, Ynew = Y, Ypredicted = Ypred)
# Classical SDM
# We specify the same partition for classic SDM so that the two cross-validations are comparable.
CV_classic = trophicSDM_CV(m_classic, K = 3, prob.cov = T, iter = 2000,
pred_samples = 500, run.parallel = FALSE,
partition = CV$partition)
#> [1] "Fold 1 out of 3 \n"
#> [1] "Fold 2 out of 3 \n"
#> [1] "Fold 3 out of 3 \n"
Ypred_classic = CV_classic$meanPred
Ypred_classic = Ypred_classic[,colnames(Y)]
metrics_classic = evaluateModelFit(m, Ynew = Y, Ypredicted = Ypred_classic)
# Mean AUC
mean(metrics$auc)
#> [1] 0.6999403
mean(metrics_classic$auc)
#> [1] 0.668446
# Mean TSS
mean(metrics$tss)
#> [1] 0.4260548
mean(metrics_classic$tss)
#> [1] 0.3773521
The increase in mean AUC and TSS is even higher in 3-fold cross-validation than on the fitted data. As we simulated the data with a strong biotic signal, trophic SDM proves to be a better tool than classic SDM.
We also implemented a function to predict species potential niches, defined as the probability of presence of species along the environmental gradient when the biotic constrain is realised, i.e., when all its prey are set to present (vice-versa, when modeling species top-down, when the predator are absent). We here compute the probability of presence of species at environmental conditions X_1= 0.5, X_2 = 0.5 when the preys are available.
preds = predictPotential(m, Xnew = data.frame(X_1 = 0.5, X_2 = 0.5))
#> Error in predictPotential(m, Xnew = data.frame(X_1 = 0.5, X_2 = 0.5)): could not find function "predictPotential"
# Take the posterior mean
lapply(preds,mean)
#> Warning in mean.default(X[[i]], ...): l'argomento non è numerico o logico: si restituisce NA
#> $predictions.prob
#> [1] 0.9983807
#>
#> $predictions.bin
#> [1] NA
Grace, J. B., Johnson, D. J., Lefcheck, J. S., and Byrnes, J. E. K.. 2018. Quantifying relative importance: computing standardized effects in models with binary outcomes. Ecosphere 9(6):e02283.
Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5), 1413–1432.