This package provides tools to determine the probably effect of undersampling in bipartite ecological networks and tools to determine likely gaps in ecological data following the approaches discussed in Finding missing links in interaction networks by Terry & Lewis (in prep).
While this cannot replace further sampling, by making it as easy as possible to assess the most likely missing links, we hope we can encourage further consideration of the quality of ecological networks.
We provide two ‘high level functions’ that are designed to work as
easily as possible out of the box: PredictLinks()
and
RarefyNetwork()
. Both functions are designed to ‘play
nicely’ with the bipartite
package (Dormann et
al., 2008). They both come with accompanying plotting functions
that return ggplot objects displaying the key information.
Source code is hosted on github at https://github.com/jcdterry/cassandRa/. Any issues, comments or suggestions are probably best routed through there. If you make use of the link prediction aspect of the package, please cite the paper (either the preprint, or the paper if/when it comes out. Otherwise, if you use resampling, then please just cite the package.
Both functions assume that you have a quantitative ecological network
matrix in the format ready to go with the widely used
bipartite
package. It should be in bipartite form, i.e. a
potentially rectangular matrix of side lengths NxM rather than SxS.
There should not be any additional columns with totals or other
summaries.
The package bipartite
provides some useful tools to help
with this such as frame2webs()
.
Data in this format has rows being each of the focal layer (referred to as ‘Host’ in the underlying function documentation) and columns detailing the response layer (referred to as ‘Wasp’ in the underlying documentation). The distinction is that the focal layer is the layer where you have control over the sampling - for example which plants were watched, which hosts were gathered for parasite-rearing etc.
If present, row names and column names are retained for use. Species that are not observed interacting with any others are removed.
For this vignette, we will use one of the datasets bundled with the
bipartite
package:
PredictLinks()
- Inferring Missing LinksThis is the simplest function to quickly generate predictions based on a bipartite network. A description and reasoning and behind each model is given in the paper. The function here is deliberately as simple as possible - to tinker with the too many of the settings you will need to dig into the functions which it calls. See the SI for the original paper for a script that uses the other functions in this package to conduct some more specific and detailed analyses.
The coverage deficit models assume that the data is raw count data - so don’t transform or standardise it in any way before use. The network structure models work from presence-absence data, so any transformations would not affect them.
Calling PredictLinks(web)
will return a list object with
many components. The most important is $Predictions
, which
is a data frame with columns detailing the predictions of each of the
five models discussed in the paper.
These are returned ‘raw’ and standardised (preceded by
std_
). The raw values include the probabilities of each
interaction (observed or unobserved) as fitted by each model. The
standardised equivalents are divided through by the mean probability of
each unobserved interaction, to give a relative probability for
each potential missing link.
There is just one additional optional argument:
RepeatModels
which controls how many times to fit each
statistical model. The best half (rounding up) model predictions are
averaged which evens out edge cases and helps with the optimisation runs
which may get stuck at poor local optima.. The default is set to 10,
which fits relatively quickly (under 30 seconds in most cases). It might
be worth raising this if there is time.
PredictFit <- PredictLinks(Safariland , RepeatModels = 1) # Set to 1 here for speed in the vignette
#> Starting:_ at 2024-06-07 12:26:41.628216
#> Fitting:
#> C
#> M
#> B
#> -SBM
int_code | Observed | Centrality_Prob | Matching_Prob | Both_Prob | SBM_Prob | C_def_Prob | Interaction | std_Centrality_Prob | std_Matching_Prob | std_Both_Prob | std_SBM_Prob | std_C_def_Prob |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Aristotelia chilensis - Policana albopilosa | TRUE | 0.1345 | 0.1664 | 0.1495 | 0.3667 | 0.0002 | Aristotelia chilensis - Policana albopilosa | NA | NA | NA | NA | NA |
Alstroemeria aurea - Policana albopilosa | FALSE | 0.5462 | 0.1664 | 0.2749 | 0.1000 | 0.0029 | Alstroemeria aurea - Policana albopilosa | 4.7078 | 1 | 2.7790 | 0.8218 | 0.7147 |
Schinus patagonicus - Policana albopilosa | FALSE | 0.0198 | 0.1664 | 0.0171 | 0.3667 | 0.0023 | Schinus patagonicus - Policana albopilosa | 0.1710 | 1 | 0.1727 | 3.0134 | 0.5606 |
Berberis darwinii - Policana albopilosa | FALSE | 0.0369 | 0.1664 | 0.0154 | 0.1000 | 0.0005 | Berberis darwinii - Policana albopilosa | 0.3178 | 1 | 0.1552 | 0.8218 | 0.1339 |
Rosa eglanteria - Policana albopilosa | FALSE | 0.0811 | 0.1664 | 0.0439 | 0.1000 | 0.0026 | Rosa eglanteria - Policana albopilosa | 0.6989 | 1 | 0.4439 | 0.8218 | 0.6337 |
Cynanchum diemii - Policana albopilosa | FALSE | 0.0879 | 0.1664 | 0.0814 | 0.3667 | 0.0007 | Cynanchum diemii - Policana albopilosa | 0.7573 | 1 | 0.8225 | 3.0134 | 0.1795 |
The function PlotFit()
extracts information from the
list object to produce a heat-map of the strengths of predictions, both
in terms of how the goodness of fit to the observed interactions and the
distribution of predictions.
It needs to be supplied with a network list, for example the output
from PredictLinks()
, and the network model that you wish to
plot to Matrix_to_plot()
.
This uses a shorthand:
M
= the latent-trait ‘Matching’
modelC
= the ‘Centrality’ model of
vulnerability and generalityB
= the ‘Both’ matching-centrality
modelSBM
= the stochastic
block modelC_def
= the coverage
deficit modelIf a vector of multiple models is provided, they will combined
according to the Combine
argument. If '+'
they
will be averaged, if '*'
they will be multiplied.
There are various options for arranging the species. By default it
will take a guess based on the predictive model, to best display how it
has fit the network, but this can be controlled manually with
OrderBy
.
The options for OrderBy
are:
'Degree'
: Sort each level by number of observed
interactors'Manual'
: Controlled by the order of
WaspNames
and HostNames
in the network list
(not really for easy use)'AsPerMatrix'
: Order will follow the original web'SBM'
: Sort by group assignations in the SBM'LatentTrait'
: Sort by the best fitting latent trait
(Matching) model parametersThere are three further graphical options:
if addDots = TRUE
then observed species will be have a
black dot, and if addDots = 'Size'
then a dot proportional
to the number of observations will be drawn. This is most useful for the
coverage deficit model since the others only make use of the binary
interaction matrix.
and
if RemoveTP = TRUE
then the observed species will not be
coloured. This is mostly useful for plotting coverage deficit model.
Finally title
adds a manual title.
The function returns a ggplot object, which can be further customised by adding ggplot elements in the usual way. The most common cases would probably be switching off the guides, since the colours are relative. There might be few coercion warnings - these should be ok, mostly a product of trying the make the code as robust as possible to either having or not having named species.
RarefyNetwork()
- Metric Responses to ResamplingThe purpose of this function is to provide an easy way to get some
information on the confidence with which network metrics can be
assigned, given a particular level of sampling. It does this by
redrawing observations with probabilities proportional to the relative
interaction frequency, and assessing the spread of the network metrics.
Before starting, I would recommend also looking at capacity built into
bipartite
, specifically bipartite::nullmodel
and the discussion in the Intro2bipartite
vignette. These
functions include other ways of generating null models that may be more
relevant.
This has three main purposes: to (visually) assess whether a metric has stabilised with the level of sampling, to assess (via bootstrapped confidence intervals) the precision with which that metric can be reported, and to allow the rarefaction of different networks to a matched sample size to allow a more equitable comparison of metrics that are known to be correlated with sample size.
The function calls networklevel()
from the
bipartite
package, which offers the vast majority of
bipartite network metrics used by ecologists. Note that if the
re-sampling procedure means that certain species are no longer present
in the network, then they will be dropped by the function by default.
This could lead to some problems for some metrics such as overall
connectance since the number of possible interactions will decrease with
a smaller web size. This can be overcome by passing
empty.web=FALSE
to the ...
argument.
A basic way used to calculate confidence intervals is to sort the
metric values from the re-sampled networks and find the value at the 5th
and 95th percentile. These values must still be treated with some
caution as the re-sampling procedure will not generate unobserved
interaction. Nonetheless it can give an indication of the extent to
which the sampling is sufficient to make a particular conclusion. Any
NA
values are excluded. These are only likely to occur for
certain quantitative metrics that may fail for small networks
(e.g. weighted nestedness
can do this). By excluding NA’s a
value can still be returned, by since these NA’s are non-random they may
skew the results. The underlying networklevel()
function
should warn which metric and for what size this may be an issue.
web
a network formatted for bipartite (see above)n_per_level
How many samples to draw at each sampling
level. Should be quite large to get a converged distribution, default is
1000metrics
Vector of metrics to calculate. This is passed
directly to networklevel()
, so see that help file for the
full list. Useful shortcuts include:
'ALLBUTDD'
which calculates all indices except degree
distribution fits'info'
for basic information'binary'
for some qualitative metrics'quantitative'
for some quantitative metrics.Beware that each metric will be called potentially many 1000s of
times, so don’t be too greedy! The degree distribution fit metrics will
not work as they cannot be coerced easily into a vector.
?networklevel()
discusses time considerations.
and then ONE of:
frac_sample_levels
A sequence detailing fractions of
the original sample size to rarefy to. Can be larger than 1 to
extrapolate potential future confidence.abs_sample_levels
Manually specify a sequence of sample
sizes to rarefy to. Suitable for use with non-integer quantitative
webs.Then there are two optional arguments if you want to take advantage
of parallel computation. If PARALLEL=TRUE
, it will use the
parallel
package to split the network calculations over the
number of cores set by cores
.
By default it will return a data frame of the raw recalculated metrics, with a separate column for each metric, and the last column specifying the resample size.
This data frame can then be passed to ComputeCI()
to
return basic confidence intervals (in a data frame containing 5 columns:
Metric, LowerCI, UpperCI, Mean & SampleSize), or to
PlotRarefaction()
to make a simple ggplot. Both these
functions have no additional arguments and are pretty basic. The look of
the ggplots can be modified as normal.
If you want to skip this step, it is possible to pass an
output
argument direct to RarefyNetwork()
. If
'plot'
will return a ggplot object using
PlotRarefaction()
, and if output='CI'
it will
return a data frame using ComputeCI()
.
MetricsAfterReSampling<- RarefyNetwork(Safariland,
n_per_level = 50, # v. small for the vignette
metrics = c('H2', 'nestedness', 'links per species'))
PlotRarefaction(MetricsAfterReSampling)
ComputeCI(MetricsAfterReSampling)
#> Metric LowerCI UpperCI Mean
#> H2...1 H2 0.8293225 0.9249341 0.8796841
#> H2...2 H2 0.8248465 0.9233817 0.8663997
#> H2...3 H2 0.8350408 0.9090144 0.8724597
#> H2...4 H2 0.8264402 0.9014620 0.8597829
#> H2...5 H2 0.8277802 0.8961996 0.8568270
#> links per species...6 links per species 0.7600000 0.9583333 0.8414517
#> links per species...7 links per species 0.8518519 1.0333333 0.9256467
#> links per species...8 links per species 0.9032258 1.0645161 0.9780459
#> links per species...9 links per species 0.9677419 1.0909091 1.0158670
#> links per species...10 links per species 1.0000000 1.0909091 1.0330300
#> nestedness...11 nestedness 22.4026640 38.6126773 27.3835563
#> nestedness...12 nestedness 19.4509950 32.1398845 24.2783014
#> nestedness...13 nestedness 19.0295559 27.3292434 22.6104222
#> nestedness...14 nestedness 18.2888526 26.5684360 22.1101334
#> nestedness...15 nestedness 17.9240325 25.7960163 21.8303458
#> SampleSize
#> H2...1 226
#> H2...2 452
#> H2...3 678
#> H2...4 904
#> H2...5 1130
#> links per species...6 226
#> links per species...7 452
#> links per species...8 678
#> links per species...9 904
#> links per species...10 1130
#> nestedness...11 226
#> nestedness...12 452
#> nestedness...13 678
#> nestedness...14 904
#> nestedness...15 1130
sessionInfo()
#> R version 4.4.0 (2024-04-24 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United Kingdom.utf8
#> [3] LC_MONETARY=English_United Kingdom.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.utf8
#>
#> time zone: Europe/London
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 tidyr_1.3.1 dplyr_1.1.4
#> [4] purrr_1.0.2 cassandRa_0.2.0 bipartite_2.20
#> [7] sna_2.7-2 network_1.18.2 statnet.common_4.9.0
#> [10] vegan_2.6-6.1 lattice_0.22-6 permute_0.9-7
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.4
#> [5] digest_0.6.35 magrittr_2.0.3 evaluate_0.23 grid_4.4.0
#> [9] fastmap_1.2.0 maps_3.4.2 plyr_1.8.9 jsonlite_1.8.8
#> [13] Matrix_1.7-0 mgcv_1.9-1 fansi_1.0.6 spam_2.10-0
#> [17] viridisLite_0.4.2 scales_1.3.0 jquerylib_0.1.4 cli_3.6.2
#> [21] rlang_1.1.3 munsell_0.5.1 splines_4.4.0 withr_3.0.0
#> [25] cachem_1.1.0 yaml_2.3.8 tools_4.4.0 parallel_4.4.0
#> [29] reshape2_1.4.4 coda_0.19-4.1 colorspace_2.1-0 boot_1.3-30
#> [33] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 stringr_1.5.1
#> [37] MASS_7.3-60.2 cluster_2.1.6 pkgconfig_2.0.3 pillar_1.9.0
#> [41] bslib_0.7.0 gtable_0.3.5 glue_1.7.0 Rcpp_1.0.12
#> [45] fields_15.2 highr_0.11 xfun_0.44 tibble_3.2.1
#> [49] tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.46 farver_2.1.2
#> [53] htmltools_0.5.8.1 nlme_3.1-164 igraph_2.0.3 labeling_0.4.3
#> [57] rmarkdown_2.27 dotCall64_1.1-1 compiler_4.4.0