This section briefly describes the steps involved in the series of
model-based analyses of eDNA metabarcoding data using the
occumb
R package.
We will use the package’s built-in data fish
(see
?fish
for documentation) to analyze the sequence count data
using the occumb
package. The summary()
function can be used to provide an overview of the dataset.
#> Sequence read counts:
#> Number of species, I = 50
#> Number of sites, J = 50
#> Maximum number of replicates per site, K = 3
#> Number of missing observations = 6
#> Number of replicates per site: 2.88 (average), 0.33 (sd)
#> Sequencing depth: 77910 (average), 98034.7 (sd)
#>
#> Species covariates:
#> mismatch (continuous)
#> Site covariates:
#> riverbank (categorical)
#> Replicate covariates:
#> (None)
#>
#> Labels for species:
#> Abbottina rivularis, Acanthogobius lactipes, Acheilognathus macropterus, Acheilognathus rhombeus, Anguilla japonica, Biwia zezera, Carassius cuvieri, Carassius spp., Channa argus, Ctenopharyngodon idella, Cyprinus carpio, Gambusia affinis, Gnathopogon spp., Gymnogobius castaneus, Gymnogobius petschiliensis, Gymnogobius urotaenia, Hemibarbus spp., Hypomesus nipponensis, Hypophthalmichthys spp., Hyporhamphus intermedius, Ictalurus punctatus, Ischikauia steenackeri, Lepomis macrochirus macrochirus, Leucopsarion petersii, Megalobrama amblycephala, Micropterus dolomieu dolomieu, Micropterus salmoides, Misgurnus spp., Monopterus albus, Mugil cephalus cephalus, Mylopharyngodon piceus, Nipponocypris sieboldii, Nipponocypris temminckii, Opsariichthys platypus, Opsariichthys uncirostris uncirostris, Oryzias latipes, Plecoglossus altivelis altivelis, Pseudogobio spp., Pseudorasbora parva, Rhinogobius spp., Rhodeus ocellatus ocellatus, Salangichthys microdon, Sarcocheilichthys variegatus microoculus, Silurus asotus, Squalidus chankaensis biwae, Tachysurus tokiensis, Tanakia lanceolata, Tribolodon brandtii maruta, Tribolodon hakonensis, Tridentiger spp.
#> Labels for sites:
#> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50
#> Labels for replicates:
#> L, C, R
A summary of the sequence read count data can be found in the first
block of the output. Data were obtained from samples taken from 50 sites
\(\times\) (up to) three replicates,
with 50 fish species recorded. A summary of the missing samples, number
of replicates per site, and sequencing depth (i.e., the total number of
sequence reads per sample) is also presented. The second block of the
output indicates that the fish
data also have two
covariates: one continuous species covariate, mismatch
, and
one discrete site covariate, riverbank
, which can be used
in the following analysis. The third block of the output shows the
labels assigned to the 50 species, 50 sites, and three replicates.
As such, the fish
data brought together a set of data
relevant to the analysis using the occumb
package. When
analyzing one’s own dataset, you can use occumbData()
to
set up a data object.
If you want to display the entire data, just type
fish
.
Using the fish
dataset (or the dataset constructed with
occumbData()
), one can fit a multispecies site occupancy
model for eDNA metabarcoding using the occumb()
function
which fits the model in a fully Bayesian approach via the Markov Chain
Monte Carlo (MCMC) method implemented in JAGS.
occumb()
has several arguments that accept model
formulas to fit different model variants using the usual R formula
syntax. See vignette("model_specification")
for an overview
of the models that occumb()
can fit and details on how to
specify models.
Fitting a null model (i.e., an intercept-only model, although the value of the intercept is species-specific) is very simple, as follows:
This model fitting can take approximately one hour. Given that
occumb()
fits a class of complex hierarchical models with
many parameters and unknown variables, it may require, depending on the
dataset size, a long time and/or a lot of memory for model fitting.
Setting parallel = TRUE
is recommended for faster model
fitting through parallel computations.
A model incorporating species and site covariates from the
fish
dataset can be fitted as follows:
fit1 <- occumb(formula_psi = ~ riverbank,
formula_phi_shared = ~ mismatch,
data = fish,
parallel = TRUE)
Here, the riverbank
is specified as the covariate for
site occupancy probability (psi
) in the
formula_psi
argument, and mismatch
as the
covariate for the relative dominance of sequence (phi
) in
the formula_phi_shared
argument. In the latter, since
mismatch
is a species covariate, the
formula_phi_shared
argument is used instead of
formula_phi
so that the effect of mismatch
is
constant across species; see
vignette("model_specification")
for more details.
Thus, as in the generalized linear models (GLMs), model parameters
can be expressed as functions of covariates, and occumb()
allows straightforward covariate specification using the formula syntax.
We will see later how to examine the estimated parameter values and
covariate effects.
occumb()
also has several arguments for controlling the
MCMC computation to fit the model. For example, to obtain more precise
posterior estimates than in the above example, one can explicitly set
n.iter
and n.thin
arguments to obtain longer
and less autocorrelated MCMC samples. In the following,
n.iter
and n.thin
are set to twice the
default.
fit1x <- occumb(formula_psi = ~ riverbank,
formula_phi_shared = ~ mismatch,
data = fish,
n.thin = 20,
n.iter = 40000,
parallel = TRUE)
Note that when fitting a model using occumb()
, a warning
message "At least one Rhat value could not be calculated."
may appear. Usually, this warning can be ignored safely. In models fit
by occumb()
, the values of some latent variables may not
change across MCMC iterations, making it impossible to compute
Rhat
values for them. This is the correct behavior of the
model, but it can cause the above warning to be displayed even when the
MCMC method is working properly.
With the fit1
object, we will examine how to access and
diagnose the posterior sample obtained using occumb()
.
Enter the object name fit1
to display a summary of the
MCMC results, including a table of posterior estimates for each saved
parameter.
The plot()
function draws a trace plot and density plot
for each parameter.
Users of the {jagsUI}
package should be familiar with
the outputs of these operations. Model fitting in occumb()
relies on functions provided by the {jagsUI}
package, and
the above operations apply print
and plot
methods to {jagsUI}
objects.
The summary()
function provides a high-level summary of
the results. This is based on a previous version of the
summary
method for {jagsUI}
objects
(implemented until {jagsUI}
package version 1.5.2) and is
modified for the occumb()
output, which displays a data
summary, model equations, MCMC settings, and a quick overview of the
convergence of the relevant model parameters.
#> Summary for an occumbFit object
#>
#> Summary of data:
#> Number of species, I = 50
#> Number of sites, J = 50
#> Maximum number of replicates per site, K = 3
#> Number of missing observations = 6
#> Number of replicates per site: 2.88 (average), 0.33 (sd)
#> Sequencing depth: 77910 (average), 98034.7 (sd)
#>
#> Model specification:
#> formula_phi: ~ 1
#> formula_theta: ~ 1
#> formula_psi: ~ riverbank
#> formula_phi_shared: ~ mismatch
#> formula_theta_shared: ~ 1
#> formula_psi_shared: ~ 1
#> prior_prec: 1e-04
#> prior_ulim: 10000
#>
#> Saved parameters:
#> Mu sigma rho alpha beta gamma alpha_shared phi theta psi z pi deviance
#>
#> MCMC ran in parallel for 43.593 minutes at time 2023-09-11 20:16:31.787026:
#> For each of 4 chains:
#> Adaptation: 100 iterations (sufficient)
#> Burn-in: 10000 iterations
#> Thin rate: 10 iterations
#> Total chain length: 20100 iterations
#> Posterior sample size: 1000 draws
#>
#> Summary of posterior samples:
#> Mu:
#> Number of parameters: 4
#> Rhat: 1.009 (min), 1.027 (median), 1.043 (mean), 1.109 (max)
#> n.eff: 30 (min), 117.5 (median), 154.2 (mean), 352 (max)
#> sigma:
#> Number of parameters: 4
#> Rhat: 1.004 (min), 1.04 (median), 1.074 (mean), 1.213 (max)
#> n.eff: 17 (min), 67.5 (median), 217 (mean), 716 (max)
#> rho:
#> Number of parameters: 6
#> Rhat: 1.042 (min), 1.107 (median), 1.106 (mean), 1.187 (max)
#> n.eff: 19 (min), 33.5 (median), 40.7 (mean), 79 (max)
#> alpha:
#> Number of parameters: 50
#> Rhat: 1.027 (min), 1.093 (median), 1.157 (mean), 1.77 (max)
#> n.eff: 7 (min), 34 (median), 40.6 (mean), 104 (max)
#> beta:
#> Number of parameters: 50
#> Rhat: 1.002 (min), 1.018 (median), 1.028 (mean), 1.237 (max)
#> n.eff: 16 (min), 183.5 (median), 349.9 (mean), 4000 (max)
#> gamma:
#> Number of parameters: 100
#> Rhat: 1.001 (min), 1.019 (median), 1.029 (mean), 1.136 (max)
#> n.eff: 24 (min), 173.5 (median), 363.5 (mean), 2756 (max)
#> alpha_shared:
#> Number of parameters: 1
#> Rhat: 1.1
#> n.eff: 31
#> phi:
#> Number of parameters: 50
#> Rhat: 1.001 (min), 1.07 (median), 1.127 (mean), 1.664 (max)
#> n.eff: 8 (min), 42.5 (median), 202.7 (mean), 3161 (max)
#> theta:
#> Number of parameters: 50
#> Rhat: 1.002 (min), 1.016 (median), 1.024 (mean), 1.219 (max)
#> n.eff: 16 (min), 196 (median), 358.1 (mean), 4000 (max)
#> psi:
#> Number of parameters: 2500
#> Rhat: 1 (min), 1.008 (median), 1.013 (mean), 1.101 (max)
#> n.eff: 36 (min), 396 (median), 679.3 (mean), 4000 (max)
#> z:
#> Number of parameters: 2500
#> Rhat: 1 (min), 1.006 (median), 1.009 (mean), 1.184 (max), 946 (Number of NAs)
#> n.eff: 1 (min), 1233.5 (median), 1628.6 (mean), 4000 (max)
#> pi:
#> Number of parameters: 7500
#> Rhat: 1 (min), 1.078 (median), 1.102 (mean), 1.294 (max), 30 (Number of NAs)
#> n.eff: 1 (min), 1 (median), 1009.1 (mean), 4000 (max)
#> deviance:
#> Rhat: 1.001
#> n.eff: 1332
Convergence of the MCMC algorithm is determined based on the
Rhat
value, which becomes 1 when the chains are converged.
Customarily, Rhat
values of less than 1.1 will be required
for the parameters of interest. Overall, the fit1
model is
not very far from convergent, but the convergence of phi
and alpha
, the parameters related to the relative dominance
of species sequence reads, is not good (i.e., high values of
Rhat
); hence, longer MCMC runs will be required for a
formal inference.
The get_post_samples()
and
get_post_summary()
functions can be used to access the
posterior samples of individual parameters or their summaries. This may
require understanding the model parameters and their names: see
vignette("model_specification")
for details. As an example,
we attempt to obtain a posterior summary of the site occupancy
probability psi
using the get_post_summary()
function.
#> mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff overlap0 f
#> psi[1,1] 0.64607375 0.083034133 0.466158634 0.59382744 0.65078275 0.70478301 0.79318634 1.001697 1264 0 1
#> psi[2,1] 0.28250539 0.089275014 0.136970706 0.21789079 0.27480943 0.33672122 0.47971573 1.013796 317 0 1
#> psi[3,1] 0.38321762 0.088583160 0.218908875 0.32150169 0.38055944 0.44171514 0.56510758 1.005127 472 0 1
#> psi[4,1] 0.16899696 0.070833671 0.058363836 0.11684541 0.15943020 0.21075828 0.32821669 1.001143 3788 0 1
#> psi[5,1] 0.20435372 0.088327893 0.069729103 0.13950376 0.19315615 0.25286705 0.41681236 1.006308 596 0 1
#> psi[6,1] 0.10737782 0.061179251 0.024299550 0.06246265 0.09573034 0.13925811 0.25577838 1.002291 1418 0 1
#> psi[7,1] 0.84639119 0.053478041 0.730770215 0.81327035 0.85048238 0.88494478 0.93751979 1.003575 654 0 1
#> psi[8,1] 0.94296846 0.028288919 0.875373936 0.92739641 0.94770976 0.96354354 0.98337253 1.002581 1006 0 1
#> (Omitted the remaining)
In the fit1
model, psi
was assumed to have
different values for each site depending on the riverbank
covariate (remember the psi_formula
argument for
fit1
). Therefore, psi
was estimated for each
of the 50 species \(\times\) 50 sites
(note that psi
differs for each species by default), and
the two subscripts in brackets distinguish them. Which species and site
parameter was psi[1,1]
about? How about
psi[2,1]
? Information regarding the dimensions of the
parameters is provided in the attributes of the resulting object.
#> $Species
#> [1] "Abbottina rivularis"
#> [2] "Acanthogobius lactipes"
#> [3] "Acheilognathus macropterus"
#> [4] "Acheilognathus rhombeus"
#> [5] "Anguilla japonica"
#> [6] "Biwia zezera"
#> [7] "Carassius cuvieri"
#> [8] "Carassius spp."
#> [9] "Channa argus"
#> [10] "Ctenopharyngodon idella"
#> [11] "Cyprinus carpio"
#> [12] "Gambusia affinis"
#> [13] "Gnathopogon spp."
#> [14] "Gymnogobius castaneus"
#> [15] "Gymnogobius petschiliensis"
#> [16] "Gymnogobius urotaenia"
#> [17] "Hemibarbus spp."
#> [18] "Hypomesus nipponensis"
#> [19] "Hypophthalmichthys spp."
#> [20] "Hyporhamphus intermedius"
#> [21] "Ictalurus punctatus"
#> [22] "Ischikauia steenackeri"
#> [23] "Lepomis macrochirus macrochirus"
#> [24] "Leucopsarion petersii"
#> [25] "Megalobrama amblycephala"
#> [26] "Micropterus dolomieu dolomieu"
#> [27] "Micropterus salmoides"
#> [28] "Misgurnus spp."
#> [29] "Monopterus albus"
#> [30] "Mugil cephalus cephalus"
#> [31] "Mylopharyngodon piceus"
#> [32] "Nipponocypris sieboldii"
#> [33] "Nipponocypris temminckii"
#> [34] "Opsariichthys platypus"
#> [35] "Opsariichthys uncirostris uncirostris"
#> [36] "Oryzias latipes"
#> [37] "Plecoglossus altivelis altivelis"
#> [38] "Pseudogobio spp."
#> [39] "Pseudorasbora parva"
#> [40] "Rhinogobius spp."
#> [41] "Rhodeus ocellatus ocellatus"
#> [42] "Salangichthys microdon"
#> [43] "Sarcocheilichthys variegatus microoculus"
#> [44] "Silurus asotus"
#> [45] "Squalidus chankaensis biwae"
#> [46] "Tachysurus tokiensis"
#> [47] "Tanakia lanceolata"
#> [48] "Tribolodon brandtii maruta"
#> [49] "Tribolodon hakonensis"
#> [50] "Tridentiger spp."
#>
#> $Site
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
#> [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
#> [46] "46" "47" "48" "49" "50"
Thus, from the $dimension
attribute, one can know that
the first subscript of psi
represents species and the
second sites. From the $label
attribute, one can know that
psi[1,1]
represents the value of psi
for
Abbottina rivularis at site 1, and psi[2,1]
represents the value of psi
for Acanthogobius
lactipes at site 1.
Let us also look at the posterior summary of gamma
,
which denotes species-specific effects on psi
.
gamma
had two effect values for each species: the
intercept, (Intercept)
, and the effect of the absence of
vegetation on the riverbank, riverbankwithout_vegetation
.
Displaying the summary, one can see that, for many species, the absence
of vegetation has a negative effect on psi
(on its link
scale).
#> mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff overlap0 f
#> (Omitted the beginning)
#> gamma[1,2] -0.68135951 0.4102181 -1.4106885 -0.964385346 -0.71198673 -0.41908665 0.197784615 1.0031831 720 1 0.93875
#> gamma[2,2] -0.57124068 0.4861383 -1.4368696 -0.906331725 -0.60645174 -0.27472631 0.480190462 1.0141750 414 1 0.87925
#> gamma[3,2] -1.07479024 0.4260032 -1.9550037 -1.336688784 -1.07308990 -0.80888005 -0.213626578 1.0040679 632 0 0.99250
#> gamma[4,2] -0.96563859 0.5007208 -1.9760743 -1.276917710 -0.95208744 -0.64960250 0.016721077 1.0020286 1835 1 0.97225
#> gamma[5,2] -1.01038376 0.5436033 -2.1340409 -1.339282006 -0.98783592 -0.65193007 0.006887964 1.0148153 190 1 0.97400
#> gamma[6,2] -0.96443244 0.5694898 -2.1291123 -1.307706597 -0.96394933 -0.62099851 0.205483803 1.0043719 989 1 0.95325
#> gamma[7,2] -1.44522243 0.4356167 -2.4011743 -1.721131931 -1.39965212 -1.14153735 -0.689367497 1.0021543 1065 0 0.99925
#> gamma[8,2] -1.36146823 0.4820477 -2.3661876 -1.654265660 -1.32819584 -1.04709249 -0.483307820 1.0014346 1791 0 0.99900
#> (Omitted the remaining)
A posterior summary of Mu
, which denotes the community
average of species-specific effects, shows that
riverbankwithout_vegetation
has a community-wide negative
effect on psi
(note from the label
attribute
that Mu[4]
is the community-level effect of
riverbankwithout_vegetation
).
#> mean sd 2.5% 25% 50% 75%
#> Mu[1] -1.18914419 0.1466486 -1.4845230 -1.2872349 -1.18205680 -1.0888031
#> Mu[2] 0.94327206 0.1945952 0.5640255 0.8102745 0.94803155 1.0691758
#> Mu[3] 0.02522906 0.3264013 -0.6034793 -0.1859071 0.01929096 0.2317505
#> Mu[4] -1.03746412 0.1609537 -1.3626390 -1.1414288 -1.03743346 -0.9292544
#> 97.5% Rhat n.eff overlap0 f
#> Mu[1] -0.9119219 1.108793 30 0 1.0000
#> Mu[2] 1.3284452 1.031154 100 0 1.0000
#> Mu[3] 0.6951928 1.008576 352 1 0.5255
#> Mu[4] -0.7312586 1.022904 135 0 1.0000
#> attr(,"dimension")
#> [1] "Effects"
#> attr(,"label")
#> attr(,"label")$Effects
#> [1] "phi | (Intercept)" "theta | (Intercept)"
#> [3] "psi | (Intercept)" "psi | riverbankwithout_vegetation"
Next, let us access the posterior samples rather than their summary
using the get_post_samples()
function. We extract the
posterior sample of psi
from fit1
as an
example.
The resulting object, post_sample_psi
, is a
3-dimensional array containing the posterior sample of psi
.
Information on the dimensions of the extracted sample is provided in the
attributes.
#> $dim
#> [1] 4000 50 50
#>
#> $dimension
#> [1] "Sample" "Species" "Site"
#>
#> $label
#> $label$Sample
#> NULL
#>
#> $label$Species
#> [1] "Abbottina rivularis"
#> [2] "Acanthogobius lactipes"
#> [3] "Acheilognathus macropterus"
#> [4] "Acheilognathus rhombeus"
#> [5] "Anguilla japonica"
#> [6] "Biwia zezera"
#> [7] "Carassius cuvieri"
#> [8] "Carassius spp."
#> [9] "Channa argus"
#> [10] "Ctenopharyngodon idella"
#> [11] "Cyprinus carpio"
#> [12] "Gambusia affinis"
#> [13] "Gnathopogon spp."
#> [14] "Gymnogobius castaneus"
#> [15] "Gymnogobius petschiliensis"
#> [16] "Gymnogobius urotaenia"
#> [17] "Hemibarbus spp."
#> [18] "Hypomesus nipponensis"
#> [19] "Hypophthalmichthys spp."
#> [20] "Hyporhamphus intermedius"
#> [21] "Ictalurus punctatus"
#> [22] "Ischikauia steenackeri"
#> [23] "Lepomis macrochirus macrochirus"
#> [24] "Leucopsarion petersii"
#> [25] "Megalobrama amblycephala"
#> [26] "Micropterus dolomieu dolomieu"
#> [27] "Micropterus salmoides"
#> [28] "Misgurnus spp."
#> [29] "Monopterus albus"
#> [30] "Mugil cephalus cephalus"
#> [31] "Mylopharyngodon piceus"
#> [32] "Nipponocypris sieboldii"
#> [33] "Nipponocypris temminckii"
#> [34] "Opsariichthys platypus"
#> [35] "Opsariichthys uncirostris uncirostris"
#> [36] "Oryzias latipes"
#> [37] "Plecoglossus altivelis altivelis"
#> [38] "Pseudogobio spp."
#> [39] "Pseudorasbora parva"
#> [40] "Rhinogobius spp."
#> [41] "Rhodeus ocellatus ocellatus"
#> [42] "Salangichthys microdon"
#> [43] "Sarcocheilichthys variegatus microoculus"
#> [44] "Silurus asotus"
#> [45] "Squalidus chankaensis biwae"
#> [46] "Tachysurus tokiensis"
#> [47] "Tanakia lanceolata"
#> [48] "Tribolodon brandtii maruta"
#> [49] "Tribolodon hakonensis"
#> [50] "Tridentiger spp."
#>
#> $label$Site
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
#> [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
#> [46] "46" "47" "48" "49" "50"
Thus, the second dimension of post_sample_psi
corresponds to species and the third to sites. The labels for species
and sites indicate that post_sample_psi[, 1, 1]
contains
the posterior sample of psi
of Abbottina rivularis
at site 1.
#> [1] 0.6673664 0.7505735 0.6416083 0.7011978 0.5935338 0.6087940 0.6159829
#> [8] 0.7758206 0.6710466 0.7423708 0.6244308 0.6237291 0.6344345 0.6715390
#> [15] 0.6255393 0.7306071 0.6781538 0.7391794 0.6678208 0.6715868 0.7582302
#> [22] 0.5947020 0.4880676 0.6958859 0.6202520 0.5487584 0.6781225 0.7073072
#> (Omitted the remaining)
In occumb()
, three parameters, psi
,
theta
, and phi
, can be modeled as a function
of covariates (see vignette("model_specification")
for
details). Prediction of these parameters conditional on a given set of
covariate values can be easily obtained using the predict()
function to the fitted object.
The predict()
function has the newdata
argument that accepts an optional dataset object used for prediction. If
newdata
is not specified, predictions are made using the
fitted covariates; this could be an alternative to the
get_post_summary()
and get_post_samples()
functions for accessing posterior summaries and posterior samples of
parameters. For example, the following will output a 3-dimensional array
of the posterior median and upper and lower limits of the 95% credible
interval for psi
of the 50 species and 50 sites.
#> $dim
#> [1] 3 50 50
#>
#> $dimnames
#> $dimnames[[1]]
#> [1] "50%" "2.5%" "97.5%"
#>
#> $dimnames[[2]]
#> NULL
#>
#> $dimnames[[3]]
#> NULL
#>
#>
#> $parameter
#> [1] "psi"
#>
#> $scale
#> [1] "response"
#>
#> $dimension
#> [1] "Statistics" "Species" "Sites"
#>
#> $label
#> $label$Statistics
#> [1] "50%" "2.5%" "97.5%"
#>
#> $label$Species
#> [1] "Abbottina rivularis"
#> [2] "Acanthogobius lactipes"
#> [3] "Acheilognathus macropterus"
#> [4] "Acheilognathus rhombeus"
#> [5] "Anguilla japonica"
#> [6] "Biwia zezera"
#> [7] "Carassius cuvieri"
#> [8] "Carassius spp."
#> [9] "Channa argus"
#> [10] "Ctenopharyngodon idella"
#> [11] "Cyprinus carpio"
#> [12] "Gambusia affinis"
#> [13] "Gnathopogon spp."
#> [14] "Gymnogobius castaneus"
#> [15] "Gymnogobius petschiliensis"
#> [16] "Gymnogobius urotaenia"
#> [17] "Hemibarbus spp."
#> [18] "Hypomesus nipponensis"
#> [19] "Hypophthalmichthys spp."
#> [20] "Hyporhamphus intermedius"
#> [21] "Ictalurus punctatus"
#> [22] "Ischikauia steenackeri"
#> [23] "Lepomis macrochirus macrochirus"
#> [24] "Leucopsarion petersii"
#> [25] "Megalobrama amblycephala"
#> [26] "Micropterus dolomieu dolomieu"
#> [27] "Micropterus salmoides"
#> [28] "Misgurnus spp."
#> [29] "Monopterus albus"
#> [30] "Mugil cephalus cephalus"
#> [31] "Mylopharyngodon piceus"
#> [32] "Nipponocypris sieboldii"
#> [33] "Nipponocypris temminckii"
#> [34] "Opsariichthys platypus"
#> [35] "Opsariichthys uncirostris uncirostris"
#> [36] "Oryzias latipes"
#> [37] "Plecoglossus altivelis altivelis"
#> [38] "Pseudogobio spp."
#> [39] "Pseudorasbora parva"
#> [40] "Rhinogobius spp."
#> [41] "Rhodeus ocellatus ocellatus"
#> [42] "Salangichthys microdon"
#> [43] "Sarcocheilichthys variegatus microoculus"
#> [44] "Silurus asotus"
#> [45] "Squalidus chankaensis biwae"
#> [46] "Tachysurus tokiensis"
#> [47] "Tanakia lanceolata"
#> [48] "Tribolodon brandtii maruta"
#> [49] "Tribolodon hakonensis"
#> [50] "Tridentiger spp."
#>
#> $label$Sites
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
#> [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
#> [46] "46" "47" "48" "49" "50"
One can obtain posterior samples of psi
with the
predict()
function by specifying
type = "samples"
instead of
type = "quantiles"
.
For an example of specifying the newdata
argument, let
us prepare a new dataset containing two sites with different
riverbank
values. Recall that in fit1
,
psi
was expressed as a function of the site covariate
riverbank
. Here, we use the new dataset created below to
predict psi
of two sites with different
riverbank
values. In the following, fish_raw
is a data object built into the package that stores the same data as
fish
in the list format (see ?fish_raw
for
documentation).
new_y <- array(1, dim = c(50, 2, 1))
new_riverbank <- factor(levels(fish_raw$riverbank))
dimnames(new_y)[[1]] <- dimnames(fish_raw$y)[[1]]
dimnames(new_y)[[2]] <- levels(fish_raw$riverbank)
newdata <- occumbData(y = new_y,
spec_cov = list(mismatch = fish_raw$mismatch),
site_cov = list(riverbank = new_riverbank))
Covariate values are relevant to the prediction but not the sequence
read counts. Hence, any hypothetical data may be given to the sequence
read count data, y
, which was set to an array with elements
of 1, new_y
, here. Let’s verify that newdata
is a dataset with I = 50
species, J = 2
sites,
and K = 1
replicate, including mismatch
and
riverbank
as covariates. Note that the first site is
labeled (for demonstration purposes) as a site with vegetation
(with_vegetation
) and the second site as a site without
vegetation (without_vegetation
).
#> Sequence read counts:
#> Number of species, I = 50
#> Number of sites, J = 2
#> Maximum number of replicates per site, K = 1
#> Number of missing observations = 0
#> Number of replicates per site: 1 (average), 0 (sd)
#> Sequencing depth: 50 (average), 0 (sd)
#>
#> Species covariates:
#> mismatch (continuous)
#> Site covariates:
#> riverbank (categorical)
#> Replicate covariates:
#> (None)
#>
#> Labels for species:
#> Abbottina rivularis, Acanthogobius lactipes, Acheilognathus macropterus, Acheilognathus rhombeus, Anguilla japonica, Biwia zezera, Carassius cuvieri, Carassius spp., Channa argus, Ctenopharyngodon idella, Cyprinus carpio, Gambusia affinis, Gnathopogon spp., Gymnogobius castaneus, Gymnogobius petschiliensis, Gymnogobius urotaenia, Hemibarbus spp., Hypomesus nipponensis, Hypophthalmichthys spp., Hyporhamphus intermedius, Ictalurus punctatus, Ischikauia steenackeri, Lepomis macrochirus macrochirus, Leucopsarion petersii, Megalobrama amblycephala, Micropterus dolomieu dolomieu, Micropterus salmoides, Misgurnus spp., Monopterus albus, Mugil cephalus cephalus, Mylopharyngodon piceus, Nipponocypris sieboldii, Nipponocypris temminckii, Opsariichthys platypus, Opsariichthys uncirostris uncirostris, Oryzias latipes, Plecoglossus altivelis altivelis, Pseudogobio spp., Pseudorasbora parva, Rhinogobius spp., Rhodeus ocellatus ocellatus, Salangichthys microdon, Sarcocheilichthys variegatus microoculus, Silurus asotus, Squalidus chankaensis biwae, Tachysurus tokiensis, Tanakia lanceolata, Tribolodon brandtii maruta, Tribolodon hakonensis, Tridentiger spp.
#> Labels for sites:
#> with_vegetation, without_vegetation
#> Labels for replicates:
#> (None)
When the predict()
function is applied with the
newdata
argument, predictions are made using the covariate
values contained in the newdata
. The following will provide
a prediction of psi
for each of the sites with and without
vegetation.
#> , , 1
#>
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> 50% 0.6615133 0.2751231 0.3827295 0.15224665 0.17718177 0.09182312 0.8490621
#> 2.5% 0.4802434 0.1300821 0.2306797 0.05507663 0.06783645 0.02441254 0.7212899
#> 97.5% 0.7915599 0.4723714 0.5649520 0.32693805 0.38346795 0.24695832 0.9368627
#> [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> 50% 0.9491461 0.6094722 0.4620570 0.9901477 0.08998769 0.8800989 0.15458012
#> 2.5% 0.8789348 0.4101236 0.2996333 0.9557121 0.02421783 0.7735010 0.05625485
#> 97.5% 0.9844834 0.8144525 0.6430711 0.9986149 0.23751947 0.9596028 0.31844988
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21]
#> 50% 0.12125019 0.9695970 0.9028737 0.15509196 0.5713331 0.2300419 0.8826944
#> 2.5% 0.03724413 0.9110775 0.8015870 0.05106154 0.3814354 0.1092197 0.7664569
#> 97.5% 0.26923379 0.9924366 0.9631803 0.34754250 0.7741282 0.4051336 0.9550034
#> [,22] [,23] [,24] [,25] [,26] [,27] [,28]
#> 50% 0.7916404 0.7469745 0.06964821 0.15088247 0.10296476 0.8278437 0.9750893
#> 2.5% 0.6445037 0.5788305 0.01384897 0.04985234 0.02831969 0.7060083 0.9284540
#> 97.5% 0.9020019 0.8627692 0.19692563 0.34110667 0.25821933 0.9216507 0.9943395
#> [,29] [,30] [,31] [,32] [,33] [,34] [,35]
#> 50% 0.06220973 0.4863689 0.07515230 0.4353972 0.2299977 0.8332137 0.5024979
#> 2.5% 0.01098996 0.2994199 0.01757379 0.2667359 0.1056246 0.6977411 0.3344813
#> 97.5% 0.19309849 0.6536841 0.21486234 0.6188191 0.4107509 0.9326166 0.6992168
#> [,36] [,37] [,38] [,39] [,40] [,41] [,42]
#> 50% 0.5793313 0.09849281 0.5462387 0.9190994 0.9970209 0.8173368 0.09874564
#> 2.5% 0.4008518 0.02705771 0.3669093 0.8250025 0.9815437 0.6718077 0.02554623
#> 97.5% 0.7541354 0.25569027 0.7265439 0.9755158 0.9997452 0.9114977 0.26906733
#> [,43] [,44] [,45] [,46] [,47] [,48] [,49]
#> 50% 0.15160773 0.5472967 0.3177699 0.1336518 0.16453410 0.3119199 0.1010661
#> 2.5% 0.05407187 0.3702903 0.1750710 0.0401887 0.06069858 0.1646373 0.0270361
#> 97.5% 0.32445164 0.7429405 0.4983112 0.3168030 0.32694679 0.4919569 0.2257502
#> [,50]
#> 50% 0.9885246
#> 2.5% 0.9521853
#> 97.5% 0.9984315
#>
#> , , 2
#>
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> 50% 0.4756873 0.16831019 0.17344502 0.06827157 0.08065680 0.03799725
#> 2.5% 0.3210251 0.07468112 0.08268459 0.02152790 0.02558621 0.00904520
#> 97.5% 0.6477671 0.31803370 0.30375314 0.16653407 0.20628815 0.10914663
#> [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> 50% 0.5747380 0.8277408 0.4451700 0.2415741 0.9426904 0.04637033 0.6435848
#> 2.5% 0.4085796 0.6832309 0.2727587 0.1336498 0.8361653 0.01254011 0.4719155
#> 97.5% 0.7316505 0.9241055 0.6653103 0.3976390 0.9855532 0.12900086 0.7834428
#> [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> 50% 0.07037047 0.06163691 0.8602899 0.6909106 0.08289891 0.3366483 0.1156426
#> 2.5% 0.02364689 0.01866216 0.7192849 0.5273816 0.02726444 0.2012909 0.0498471
#> 97.5% 0.16850684 0.15171748 0.9468762 0.8219908 0.20203910 0.5227359 0.2284146
#> [,21] [,22] [,23] [,24] [,25] [,26] [,27]
#> 50% 0.7204299 0.5831452 0.5653456 0.030925710 0.06781351 0.05881219 0.6000844
#> 2.5% 0.5574411 0.4218203 0.4080076 0.006236538 0.02063876 0.01733949 0.4410841
#> 97.5% 0.8503492 0.7422631 0.7276702 0.094870829 0.17049638 0.16207666 0.7470483
#> [,28] [,29] [,30] [,31] [,32] [,33]
#> 50% 0.9126102 0.035225441 0.3407391 0.04236923 0.16616006 0.08886842
#> 2.5% 0.8020100 0.007672009 0.2031182 0.01018349 0.07272355 0.03188771
#> 97.5% 0.9696633 0.110552419 0.5178231 0.13816851 0.29161845 0.18440940
#> [,34] [,35] [,36] [,37] [,38] [,39] [,40]
#> 50% 0.5159626 0.2344728 0.3705463 0.04820736 0.2161039 0.7693729 0.9831972
#> 2.5% 0.3512136 0.1224432 0.2295261 0.01206818 0.1039605 0.6147098 0.9306534
#> 97.5% 0.6751210 0.3846941 0.5527593 0.13468654 0.3594184 0.8923270 0.9976299
#> [,41] [,42] [,43] [,44] [,45] [,46]
#> 50% 0.6527351 0.06006526 0.06753850 0.2371226 0.15535572 0.05145973
#> 2.5% 0.4967540 0.01537025 0.02157825 0.1134572 0.07282467 0.01382803
#> 97.5% 0.7956659 0.16542028 0.16515022 0.3848913 0.27405126 0.13368309
#> [,47] [,48] [,49] [,50]
#> 50% 0.08990812 0.13653424 0.05038579 0.9428926
#> 2.5% 0.03330197 0.05777936 0.01466287 0.8401821
#> 97.5% 0.20500919 0.25327092 0.12890174 0.9873862
#>
#> attr(,"parameter")
#> [1] "psi"
#> attr(,"scale")
#> [1] "response"
#> attr(,"dimension")
#> [1] "Statistics" "Species" "Sites"
#> attr(,"label")
#> attr(,"label")$Statistics
#> [1] "50%" "2.5%" "97.5%"
#>
#> attr(,"label")$Species
#> [1] "Abbottina rivularis"
#> [2] "Acanthogobius lactipes"
#> [3] "Acheilognathus macropterus"
#> [4] "Acheilognathus rhombeus"
#> [5] "Anguilla japonica"
#> [6] "Biwia zezera"
#> [7] "Carassius cuvieri"
#> [8] "Carassius spp."
#> [9] "Channa argus"
#> [10] "Ctenopharyngodon idella"
#> [11] "Cyprinus carpio"
#> [12] "Gambusia affinis"
#> [13] "Gnathopogon spp."
#> [14] "Gymnogobius castaneus"
#> [15] "Gymnogobius petschiliensis"
#> [16] "Gymnogobius urotaenia"
#> [17] "Hemibarbus spp."
#> [18] "Hypomesus nipponensis"
#> [19] "Hypophthalmichthys spp."
#> [20] "Hyporhamphus intermedius"
#> [21] "Ictalurus punctatus"
#> [22] "Ischikauia steenackeri"
#> [23] "Lepomis macrochirus macrochirus"
#> [24] "Leucopsarion petersii"
#> [25] "Megalobrama amblycephala"
#> [26] "Micropterus dolomieu dolomieu"
#> [27] "Micropterus salmoides"
#> [28] "Misgurnus spp."
#> [29] "Monopterus albus"
#> [30] "Mugil cephalus cephalus"
#> [31] "Mylopharyngodon piceus"
#> [32] "Nipponocypris sieboldii"
#> [33] "Nipponocypris temminckii"
#> [34] "Opsariichthys platypus"
#> [35] "Opsariichthys uncirostris uncirostris"
#> [36] "Oryzias latipes"
#> [37] "Plecoglossus altivelis altivelis"
#> [38] "Pseudogobio spp."
#> [39] "Pseudorasbora parva"
#> [40] "Rhinogobius spp."
#> [41] "Rhodeus ocellatus ocellatus"
#> [42] "Salangichthys microdon"
#> [43] "Sarcocheilichthys variegatus microoculus"
#> [44] "Silurus asotus"
#> [45] "Squalidus chankaensis biwae"
#> [46] "Tachysurus tokiensis"
#> [47] "Tanakia lanceolata"
#> [48] "Tribolodon brandtii maruta"
#> [49] "Tribolodon hakonensis"
#> [50] "Tridentiger spp."
#>
#> attr(,"label")$Sites
#> [1] "with_vegetation" "without_vegetation"
Assessing the goodness-of-fit of a model is an essential step in data
analysis based on statistical models. The goodness-of-fit of the models
fitted with occumb()
is evaluated using the
gof()
function, which computes the Bayesian p-value using
the posterior predictive check approach. The Bayesian p-value may take
extreme values (e.g., p < 0.05
or
p > 0.95
) when the model was poorly fitted.
For faster computation, it is recommended that the cores
argument be specified explicitly for parallel computations. By default,
the gof()
function outputs a scatter plot of fit statistics
(in this case, the default Freeman-Tukey statistics).
Enter the name of the resulting object to summarize the results.
#> Posterior predictive check for an occumbFit object:
#> Statistics: Freeman-Tukey
#> p-value: 0.2005
#> Discrepancy statistics for observed data: 542.81 (mean), 17.92 (sd)
#> Discrepancy statistics for replicated data: 523.28 (mean), 16.47 (sd)
A moderate p-value indicated a lack of evidence of the model’s inadequacy in fitting.
The plot()
function can be used to (re)draw a scatter
plot of the fit statistics for the resulting object.
The model fitted with occumb()
can be used to identify
eDNA metabarcoding study designs that can effectively detect the species
present. How many sites, within-site replicates, and sequencing depths
are required for reliable species detection? What is the best balance
between the number of sites visited, number of within-site replicates,
and sequencing depth under a limited budget? The multispecies site
occupancy model fit by occumb()
answers these questions by
predicting the number of species expected to be detected using specific
study designs.
eval_util_L()
and eval_util_R()
are
functions available for these purposes. These two functions assume
species diversity assessments at different spatial scales (i.e., the
former is for local (L) and the latter for regional (R)). Specifically,
eval_util_L()
is appropriate if you are interested in
assessing species diversity only for the study sites included in the
dataset, and eval_util_R()
is appropriate if you are
interested in assessing species diversity in a broader area that
includes the study sites in the dataset (i.e., the population of sites
or a “metacommunity”).
In the context of the fish
dataset,
eval_util_L()
evaluates the expected number of species
detected per site at 50 sites in the dataset under different
combinations of number of replicates and sequencing depth. For example,
the following will give the expected number of species detected per site
(referred to as Utility
below) when the number of
replicates K
and the sequencing depth N
take
the values (1, 2, 3) and (1000, 10000, 100000), respectively.
#> K N Utility
#> 1 1 1e+03 13.03225
#> 2 2 1e+03 16.72177
#> 3 3 1e+03 18.21783
#> 4 1 1e+04 14.34826
#> 5 2 1e+04 17.72304
#> 6 3 1e+04 18.92009
#> 7 1 1e+05 15.04591
#> 8 2 1e+05 18.20935
#> 9 3 1e+05 19.23478
It can be seen that as K
and N
increase,
the Utility
value increases. For faster computation, it is
recommended that the cores
argument be explicitly specified
for parallel computation.
If one knows the cost per sequence read for high-throughput
sequencing, cost per replicate for library preparation, and research
budget values, the list_cond_L()
function can be used to
obtain a set of feasible settings under these cost and budget values.
This can be applied to eval_util_L()
to identify the
optimal study design under the budget constraints.
settings <- list_cond_L(budget = 875 * 1E3,
lambda1 = 0.01,
lambda2 = 5000,
fit1)
utilL2 <- eval_util_L(settings, fit1, cores = 4)
utilL2
#> budget lambda1 lambda2 K N Utility
#> 1 875000 0.01 5000 1 1250000.00 15.45671
#> 2 875000 0.01 5000 2 375000.00 18.37862
#> 3 875000 0.01 5000 3 83333.33 19.21953
Hence, under this specific budget and cost, it is best to have
K = 3
replicates per site for effective species
detection.
In contrast to eval_util_L()
, eval_util_R()
considers differences in the number of sites visited. Nevertheless,
eval_util_R()
can be applied similarly as
eval_util_L()
. In eval_util_R()
,
Utility
is the number of species expected to be detected in
the region of interest under the settings of number of sites
J
, number of replicates K
, and sequencing
depth N
. Because list_cond_R()
may return a
large list of possible settings, one may want to obtain a manually
restricted list using its J
and/or K
arguments.
#> 1 1 1 1e+03 13.06227
#> 2 2 1 1e+03 18.77537
#> 3 3 1 1e+03 22.21901
#> 4 1 2 1e+03 16.75208
#> 5 2 2 1e+03 22.80510
#> 6 3 2 1e+03 26.33776
#> 7 1 3 1e+03 18.18526
#> 8 2 3 1e+03 24.63089
#> 9 3 3 1e+03 28.13433
#> 10 1 1 1e+04 14.36287
#> 11 2 1 1e+04 20.30642
#> 12 3 1 1e+04 23.85725
#> 13 1 2 1e+04 17.65318
#> 14 2 2 1e+04 23.98433
#> 15 3 2 1e+04 27.55905
#> 16 1 3 1e+04 19.03370
#> 17 2 3 1e+04 25.30884
#> 18 3 3 1e+04 29.14037
#> 19 1 1 1e+05 14.97290
#> 20 2 1 1e+05 21.16216
#> 21 3 1 1e+05 24.76543
#> 22 1 2 1e+05 18.25193
#> 23 2 2 1e+05 24.64478
#> 24 3 2 1e+05 28.20150
#> 25 1 3 1e+05 19.33757
#> 26 2 3 1e+05 25.85813
#> 27 3 3 1e+05 29.68874
settings <- list_cond_R(budget = 1125 * 1E3,
lambda1 = 0.01,
lambda2 = 5000,
lambda3 = 5000,
J = seq(5, 50, 5),
K = 1:4)
utilR2 <- eval_util_R(settings, fit1, cores = 4)
utilR2
#> budget lambda1 lambda2 lambda3 J K N Utility
#> 1 1125000 0.01 5000 5000 5 1 21500000.00 30.04136
#> 2 1125000 0.01 5000 5000 10 1 10250000.00 35.64544
#> 3 1125000 0.01 5000 5000 15 1 6500000.00 38.60797
#> 4 1125000 0.01 5000 5000 20 1 4625000.00 40.60944
#> 5 1125000 0.01 5000 5000 25 1 3500000.00 42.02112
#> 6 1125000 0.01 5000 5000 30 1 2750000.00 43.15096
#> 7 1125000 0.01 5000 5000 35 1 2214285.71 43.89833
#> 8 1125000 0.01 5000 5000 40 1 1812500.00 44.71556
#> 9 1125000 0.01 5000 5000 45 1 1500000.00 45.22666
#> 10 1125000 0.01 5000 5000 50 1 1250000.00 45.71167
#> 11 1125000 0.01 5000 5000 5 2 10500000.00 33.34409
#> 12 1125000 0.01 5000 5000 10 2 4875000.00 38.81906
#> 13 1125000 0.01 5000 5000 15 2 3000000.00 41.58322
#> 14 1125000 0.01 5000 5000 20 2 2062500.00 43.47530
#> 15 1125000 0.01 5000 5000 25 2 1500000.00 44.69525
#> 16 1125000 0.01 5000 5000 30 2 1125000.00 45.58398
#> 17 1125000 0.01 5000 5000 35 2 857142.86 46.27036
#> 18 1125000 0.01 5000 5000 40 2 656250.00 46.82490
#> 19 1125000 0.01 5000 5000 45 2 500000.00 47.24841
#> 20 1125000 0.01 5000 5000 50 2 375000.00 47.56585
#> 21 1125000 0.01 5000 5000 5 3 6833333.33 34.55557
#> 22 1125000 0.01 5000 5000 10 3 3083333.33 40.03808
#> 23 1125000 0.01 5000 5000 15 3 1833333.33 42.85560
#> 24 1125000 0.01 5000 5000 20 3 1208333.33 44.51759
#> 25 1125000 0.01 5000 5000 25 3 833333.33 45.67757
#> 26 1125000 0.01 5000 5000 30 3 583333.33 46.48685
#> 27 1125000 0.01 5000 5000 35 3 404761.90 47.14072
#> 28 1125000 0.01 5000 5000 40 3 270833.33 47.60570
#> 29 1125000 0.01 5000 5000 45 3 166666.67 47.93433
#> 30 1125000 0.01 5000 5000 50 3 83333.33 48.12801
#> 31 1125000 0.01 5000 5000 5 4 5000000.00 35.17767
#> 32 1125000 0.01 5000 5000 10 4 2187500.00 40.72769
#> 33 1125000 0.01 5000 5000 15 4 1250000.00 43.51095
#> 34 1125000 0.01 5000 5000 20 4 781250.00 45.16898
#> 35 1125000 0.01 5000 5000 25 4 500000.00 46.23139
#> 36 1125000 0.01 5000 5000 30 4 312500.00 46.98586
#> 37 1125000 0.01 5000 5000 35 4 178571.43 47.53393
#> 38 1125000 0.01 5000 5000 40 4 78125.00 47.89744