1 Department of Biology, Philipps-University Marburg, Marburg, Germany
2 Department of Natural Product Biosynthesis, Max Planck Institute for Chemical Ecology, Jena, Germany
3 Department of Environment and Biodiversity, University of Salzburg, Salzburg, Austria
chemodiv
is an R package for analysing phytochemical
data. It includes a number of functions that makes it straightforward to
quantify and visualize phytochemical diversity and dissimilarity for any
type of phytochemical samples, such as herbivore defence compounds,
volatiles or similar. Importantly, calculations of diversity and
dissimilarity can incorporate biosynthetic and/or structural properties
of the phytochemical compounds, resulting in more comprehensive
quantifications of diversity and dissimilarity.
This introduction serves as a tutorial explaining the main intended use of all the functions in the package. A complete description of the package is available in Petrén et al. 2023a, while a more in-depth discussion and review of phytochemical diversity is available in Petrén et al. 2023b.
The current version of the package can be installed from CRAN.
Alternatively, the developmental version of the package can be installed
from GitHub using the install_github()
function from the
devtools
package.
# Install current version
install.packages("chemodiv")
# Install developmental version
install.packages("devtools") # Install devtools if not already installed
library("devtools")
install_github("hpetren/chemodiv")
# Load chemodiv
library(chemodiv)
We illustrate the use of the chemodiv package by using data, included in the package, on floral scent of the plant Arabis alpina (Petrén et al. 2021). We use a subset of the data consisting of 87 individuals from three populations.
Two separate datasets are needed to use the full set of analyses in the chemodiv package.
The first dataset, which we call the sample dataset, should be a data frame containing data on the relative concentrations (proportions) of different compounds (columns) in different samples (rows):
data("alpinaSampData")
head(alpinaSampData)[,1:5]
#> ZBetaOcimene EBetaOcimene Benzaldehyde MethylBenzoate Phenylacetaldehyde
#> 1 0.01584642 0.007014541 0.8540981 0.001959341 0
#> 2 0.00000000 0.000000000 0.9116668 0.013874965 0
#> 3 0.00000000 0.000000000 0.9305998 0.002476401 0
#> 4 0.00000000 0.000000000 0.7920615 0.018483767 0
#> 5 0.00000000 0.000000000 0.9251283 0.013249618 0
#> 6 0.02653006 0.014940792 0.8090447 0.003321790 0
The dataset contains the relative concentration of 15 phytochemical compounds in different samples.
The second dataset, which we call the compound dataset, should be a data frame containing, in each of three columns, the common name, SMILES and InChIKey IDs of all the compounds that are present in the sample dataset:
data("alpinaCompData")
head(alpinaCompData)
#> compound smiles inchikey
#> 1 ZBetaOcimene CC(=CC/C=C(/C)\\C=C)C IHPKGUQCSIINRJ-NTMALXAHSA-N
#> 2 EBetaOcimene CC(=CC/C=C(\\C)/C=C)C IHPKGUQCSIINRJ-CSKARUKUSA-N
#> 3 Benzaldehyde C1=CC=C(C=C1)C=O HUMNYLRZRPPJDN-UHFFFAOYSA-N
#> 4 MethylBenzoate COC(=O)C1=CC=CC=C1 QPJVMBTYPHYUOC-UHFFFAOYSA-N
#> 5 Phenylacetaldehyde C1=CC=C(C=C1)CC=O DTUQWGWMVIHBKE-UHFFFAOYSA-N
#> 6 Acetophenone CC(=O)C1=CC=CC=C1 KWOLFJPFCHCOCG-UHFFFAOYSA-N
SMILES and InChIKey are chemical identifiers that are easily obtained for each compound by searching for compounds in PubChem. Searching for compounds by their common name, or a number of other chemical identifiers, will bring up the matching molecule, along with the SMILES and InChIKey. Additionally, various automated tools such as the PubChem Identifier Exchange Service or The Chemical Translation Service can be used to automatically obtain IDs for lists of compounds. The user has to compile the SMILES and InChIKey manually to ensure correctness, as lists of compounds very often contain compounds wrongly named, wrongly formatted, under various synonyms etc. which prevents efficient automatic translation of compound names to SMILES and InChIKey.
As the compound dataset consists of a list of known compounds, analyses in the chemodiv package will work best for sets of data, commonly generated by chemical ecologists using GC-MS, LC-MS or similar, where all or most compounds in the samples have been confidently identified.
When both datasets are prepared, we can use the function
chemoDivCheck()
to make sure that the sample
dataset and the compound dataset are formatted correctly, so
that they can be used by the other functions in the package:
chemoDivCheck(compoundData = alpinaCompData, sampleData = alpinaSampData)
#> The two datasets appear to be correctly formatted.
In addition to the sample and compound datasets, a third dataset indicating what groups different samples belong to can be used in the plotting functions below.
data("alpinaPopData")
table(alpinaPopData)
#> Population
#> G1 It8 S1
#> 17 37 33
In this case, our samples belong to three different populations: G1 (Greece), It8 (Italy) and S1 (Sweden).
Now we have all necessary datasets, which are correctly formatted, and we can begin with analyses.
Two functions are used to classify and compare phytochemical compounds, and are applied to the compound dataset.
The function NPCTable()
classifies compounds with NPClassifier (Kim et
al. 2021). NPClassifier is a deep-learning tool that classifies
phytochemical compounds into a hierarchical classification of three
groups, pathway, superclass and class, largely corresponding to
biosynthetic pathways:
alpinaNPC <- NPCTable(compoundData = alpinaCompData)
alpinaNPC[1,] # Classification of the first compound in dataset
#> compound smiles inchikey pathway
#> 1 ZBetaOcimene CC(=CC/C=C(/C)\\C=C)C IHPKGUQCSIINRJ-NTMALXAHSA-N Terpenoids
#> superclass class
#> 1 Monoterpenoids Acyclic monoterpenoids
Here, the first compound in the list, (Z)-beta-Ocimene, is classified as Terpenoids > Monoterpenoids > Acyclic monoterpenoids. The classification of the compounds is put into a data frame and can subsequently be used by other functions.
The function compDis()
compares phytochemical compounds
by calculating pairwise Jaccard dissimilarities between them. The
dissimilarity calculations can be based on the biosynthesis and/or
structure of the compounds. type = "NPClassifier"
calculates dissimilarities based on the classification made by
NPClassifier, which largely corresponds to biosynthetic pathways.
type = "PubChemFingerprint"
and type = "fMCS"
are two similar methods that calculate dissimilarities based on the
structure of the compounds. In this case, molecules that have similar
substructures/features will have a low dissimilarity, while molecules
not having similar substructures/features will have a high
dissimilarity.
We calculate compound dissimilarities with
type = "PubChemFingerprint"
.
alpinaCompDis <- compDis(compoundData = alpinaCompData,
type = "PubChemFingerprint")
alpinaCompDis$fingerDisMat[1:4, 1:4] # Part of compound dissimilarity matrix
#> ZBetaOcimene EBetaOcimene Benzaldehyde MethylBenzoate
#> ZBetaOcimene 0.0000000 0.0000000 0.7027027 0.7191011
#> EBetaOcimene 0.0000000 0.0000000 0.7027027 0.7191011
#> Benzaldehyde 0.7027027 0.7027027 0.0000000 0.2637363
#> MethylBenzoate 0.7191011 0.7191011 0.2637363 0.0000000
The output from the function is a list with one or several compound
dissimilarity matrices, depending on which type
was used as
input. If multiple type
are used as input, a matrix with
mean values of the other matrices will also calculated. If
type
includes "NPClassifier"
, a matrix with
“mixed” values is also calculated. In this case, values are based on
NPClassifier when these are > 0, and otherwise based the
PubChem fingerprints/fMCS values (see manual for details). Importantly,
a resulting matrix of compound dissimilarities is used by other
functions in the package that quantify phytochemical diversity and
dissimilarity, and can be used to visualize how similar sets of
compounds are to each other.
Calculations of phytochemical diversity is the core of the
chemodiv package. Phytochemical diversity can be calculated for
the sample dataset with functions calcDiv()
,
calcBetaDiv()
and calcDivProf()
.
calcDiv()
calculates alpha diversity for each sample
(row) in the sample dataset. The function can calculate a
number of different diversity and evenness indices, depending on what
type
is used as input. The default and recommended way of
calculating diversity is as Hill numbers (Chao et al. 2014), which
provides a number of advantages, including the use of the parameter
q which controls the sensitivity of the measure to the relative
concentrations of compounds. This can be done as “normal” Hill
diversity, which depends on compound richness and evenness, and as
functional Hill diversity, which additionally considers compound
dissimilarity (Chiu & Chao 2014), utilizing the compound
dissimilarity matrix from compDis()
. This means that, for
calculations of functional Hill diversity, a set of compounds that are
biosynthetically/structurally different from each other is more diverse
than a similar set of compounds that are biosynthetically/structurally
more similar to each other.
We calculate functional Hill diversity for q = 1.
alpinaDiv <- calcDiv(sampleData = alpinaSampData,
compDisMat = alpinaCompDis$fingerDisMat,
type = "FuncHillDiv",
q = 1)
head(alpinaDiv)
#> FuncHillDiv
#> 1 4.698821
#> 2 2.479441
#> 3 1.925043
#> 4 3.497343
#> 5 2.946872
#> 6 5.703370
The function outputs a data frame with samples as rows and the
selected type
of measures as columns.
calcDivProf()
can be used to calculate Hill diversity
for a range of q-values simultaneously, generating a so called diversity
profile. This allows for a more nuanced exploration of the
diversity.
We calculate a diversity profile for functional Hill diversity, using the default range of q = 0 to q = 3.
alpinaDivProf <- calcDivProf(sampleData = alpinaSampData,
compDisMat = alpinaCompDis$fingerDisMat,
type = "FuncHillDiv")
head(alpinaDivProf$divProf)[,1:5] # Part of the diversity profile data frame
#> q0 q0.1 q0.2 q0.3 q0.4
#> 1 19.453742 15.693727 12.805088 10.609162 8.950470
#> 2 5.629633 4.918416 4.340654 3.879160 3.514778
#> 3 5.629633 4.566221 3.792641 3.237913 2.841879
#> 4 11.517752 9.660744 8.169274 6.987923 6.061618
#> 5 11.517752 8.982790 7.142494 5.836170 4.920329
#> 6 19.453742 16.340454 13.829040 11.825500 10.239775
The function outputs a list with input parameters and the diversity profile as a data frame, with samples as rows and diversity values at different values of q as columns.
calcBetaDiv()
calculates beta diversity for a set of
samples, in the Hill numbers framework. The function calculates a single
beta-diversity value for the supplied sample data. This is calculated as
beta = gamma / alpha, where gamma is the diversity of the
pooled data set and alpha represents the mean diversity of individual
samples.
alpinaBetaDiv <- calcBetaDiv(sampleData = alpinaSampData,
compDisMat = alpinaCompDis$fingerDisMat,
type = "FuncHillDiv")
alpinaBetaDiv
#> Type q gamma alpha beta
#> 1 FuncHillDiv 1 9.998989 2.594547 3.853847
The function outputs a data frame with type of beta diversity calculated, q, and values for gamma diversity, mean alpha diversity and beta diversity.
Calculations of pairwise phytochemical dissimilarities between
samples can be made with the function sampDis()
. This
calculates Bray-Curtis dissimilarities and/or Generalized UniFrac
dissimilarities (Chen et al. 2012, Junker 2018) between samples in the
sample dataset. Generalized UniFrac dissimilarities utilize the
compound dissimilarity matrix from compDis()
, such that two
samples containing more biosynthetically/structurally different
compounds have a higher pairwise dissimilarity than two samples
containing more biosynthetically/structurally similar compounds.
We calculate phytochemical dissimilarity as Generalized UniFrac dissimilarities:
alpinaSampDis <- sampDis(sampleData = alpinaSampData,
compDisMat = alpinaCompDis$fingerDisMat,
type = "GenUniFrac")
alpinaSampDis$GenUniFrac[1:4, 1:4] # Part of sample dissimilarity matrix
#> 1 2 3 4
#> 1 0.00000000 0.05244257 0.04518892 0.04880070
#> 2 0.05244257 0.00000000 0.01254811 0.06126169
#> 3 0.04518892 0.01254811 0.00000000 0.05839175
#> 4 0.04880070 0.06126169 0.05839175 0.00000000
The output from the function is a list with one or several sample
dissimilarity matrices, depending on which type
was used as
input.
Function molNet()
uses a matrix generated by the
compDis()
function to create a molecular network of the
phytochemical compounds, and calculates some properties of the network.
Such networks are useful to visualize relationships between compound
similarities and abundances.
We create a molecular network based on the compound dissimilarity
matrix. We can also include the the NPClassifier classification, where
the “pathway” group will control node colour. We manually set
cutOff = 0.75
in this case. This limits the number of edges
between nodes, as edges are only plotted between nodes if their
similarity (where similarity = 1 - dissimilarity) is larger
than the cut-off value.
alpinaNetwork <- molNet(compDisMat = alpinaCompDis$fingerDisMat,
npcTable = alpinaNPC,
cutOff = 0.75)
summary(alpinaNetwork)
#> Length Class Mode
#> networkObject 15 tbl_graph list
#> nCompounds 1 -none- numeric
#> nNpcPathways 1 -none- numeric
#> modularity 1 -none- numeric
The output is a list including the network object, and some basic network parameters.
Once we have calculated different measures of phytochemical diversity and dissimilarity using the above functions, two plotting functions can be used to conveniently create different types of plots and molecular networks.
molNetPlot()
creates a basic plot of the molecular
network generated by the molNet()
function. We create a
single molecular network for the whole dataset, including compound names
and the classification by NPClassifier.
molNetPlot(sampleData = alpinaSampData,
networkObject = alpinaNetwork$networkObject,
npcTable = alpinaNPC,
plotNames = TRUE)
Nodes represent individual compounds. They are coloured by their “pathway” classification by NPClassifier. Node size corresponds to the mean proportional concentration of the compounds in the samples. Edge widths represent compound similarity, and are only plotted for similarities higher than the cutoff value. We see that the floral scent bouquet consists mostly of compounds belonging to the “Shikimates and Phenylpropanoids” pathway. These compounds are connected to each other in a network, but separated from the three “Terpenoids” compounds, indicating that the two groups of compounds belonging to different pathways are also structurally different to each other.
chemoDivPlot()
can be used to conveniently create basic
plots of the different types of diversity and dissimilarity measurements
calculated by the functions above. Four types of plots can be created,
in any combination. With argument compDisMat
a dendrogram
visualizing compound dissimilarities is created. With argument
divData
, diversity/evenness values are visualized with
boxplots. With argument divProfData
a diversity profile
will be created. With argument sampDisMat
sample
dissimilarities will be visualized as an NMDS plot. Grouping data can be
supplied with argument groupData
.
chemoDivPlot(compDisMat = alpinaCompDis$fingerDisMat,
divData = alpinaDiv,
divProfData = alpinaDivProf,
sampDisMat = alpinaSampDis$GenUniFrac,
groupData = alpinaPopData)
The output consists of the selected plots. The dendrogram visualizes similarities between compounds in a way complementary to the molecular network above, with the three terpenoids having a high dissimilarity to the other compounds. The boxplot indicates that the functional Hill diversity of the floral scent compounds is highest in the G1 population. Further analyses can examine more exactly what components of diversity are higher in this population. The diversity profile demonstrates how at q = 1 (shown also in boxplot) diversity is highest for population G1, but at q = 0, where compound proportions are not taken into account, diversity is highest for population It8. Finally, the NMDS indicates that all three populations are compositionally/structurally different to each other, and that in population It8, dissimilarities between samples in the same population is lower than for the other two populations.
In this example, we end at the plots visualizing phytochemical diversity and dissimilarity. Such plots may inform about patterns of variation within and between groups of samples that represent different populations, species etc. Importantly, testing for associations between measures of diversity/dissimilarity and variables such as herbivore performance, pollinator visitation rates and plant fitness may provide insights on the effects of phytochemical variation on plants for various ecological interactions and evolutionary processes.
The function quickChemoDiv()
uses many of the above
functions to in one simple step calculate and visualize chemodiversity
for users wanting to quickly explore their data using standard
parameters. This can be used to generate the same four types of plots as
above.
quickChemoDiv(compoundData = alpinaCompData,
sampleData = alpinaSampData,
groupData = alpinaPopData,
outputType = "plots") # Not run
?chemodiv
for a detailed description on how to structure
datasets, and compile chemical identifiers for compounds.?chemodiv
for a description on how datasets with missing
data can be handled, and for alternative ways to calculate compound
dissimilarities when most or all compounds are unidentified.?compDis
for details on how
compound dissimilarities are calculated using the three different
methods. See Petrén et al. 2023a for a discussion on what method is
suitable depending on type of data and research question addressed.compDis()
.
The compDis()
function can calculate compound
dissimilarities using NPClassifier
,
PubChemFingerprint
and fMCS
. For larger
datasets, this will take some time as data is downloaded and pairwise
compound dissimilarities are calculated. Of the three methods,
fMCS
is much more computationally intensive than the
others, and may take a very long time for datasets with a large number
of structurally complex molecules. In such cases, it is recommended to
use PubChemFingerprint
instead.?calcDiv
for a detailed description
on how different measures of diversity and evenness are calculated. See
Petrén et al. 2023a for a comparison of different diversity indices, and
Petrén et al. 2023b for a more in-depth discussion and review of
different components and measures of phytochemical diversity in the
context of their function, mechanism and ecology.chemoDivPlot()
. The chemoDivPlot()
function can be used to conveniently create basic plots of
chemodiversity. If the function argument sampDisMat
is
included, a Nonmetric Multidimensional Scaling (NMDS) will be performed.
This might take some time for larger datasets, and excluding this
argument will make the plotting much quicker.chemoDivPlot()
or molNetPlot()
? These functions exist to provide
an easy way to create basic chemodiversity plots and molecular networks,
and therefore have limited customization options. Customized plots are
easily created with the ggplot2
and/or ggraph
packages.chemodiv
gives a warning
message. Installing the package using
install.packages("chemodiv")
may result in the warning
message
Warning in install.packages : dependencies ‘fmcsR’, ‘ChemmineR’ are not available
,
meaning that these package dependencies from Bioconductor have not been
installed. It is recommended to install the package using the
install()
function in the BiocManager
package
instead. See the installation instructions in the README file for
details.Error in loadNamespace(x) : there is no package called ‘ChemmineR’
or
Error in loadNamespace(x) : there is no package called ‘fmcsR’
,
it means these package dependencies from Bioconductor have not been
installed. Either install these separately, or reinstall
chemodiv
using the install()
function in the
BiocManager
package. See the installation instructions in
the README file for details.Chao, A., C.-H. Chiu, and L. Jost. 2014. Unifying Species Diversity, Phylogenetic Diversity, Functional Diversity, and Related Similarity and Differentiation Measures Through Hill Numbers. Annu. Rev. Ecol. Evol. Syst. 45:297–324.
Chen, J., K. Bittinger, E. S. Charlson, C. Hoffmann, J. Lewis, G. D. Wu, R. G. Collman, F. D. Bushman, and H. Li. 2012. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics 28:2106–2113.
Chiu, C.-H., and A. Chao. 2014. Distance-Based Functional Diversity Measures and Their Decomposition: A Framework Based on Hill Numbers. PLoS ONE 9:e100014.
Junker, R. R. 2018. A biosynthetically informed distance measure to compare secondary metabolite profiles. Chemoecology 28:29–37.
Kim, H. W., M. Wang, C. A. Leber, L.-F. Nothias, R. Reher, K. B. Kang, J. J. J. van der Hooft, P. C. Dorrestein, W. H. Gerwick, and G. W. Cottrell. 2021. NPClassifier: A Deep Neural Network-Based Structural Classification Tool for Natural Products. J. Nat. Prod. 84:2795–2807.
Petrén, H., P. Toräng, J. Ågren, and M. Friberg. 2021. Evolution of floral scent in relation to self-incompatibility and capacity for autonomous self-pollination in the perennial herb Arabis alpina. Annals of Botany 127:737–747.
Petrén H., T. G. Köllner and R. R. Junker. 2023a. Quantifying chemodiversity considering biochemical and structural properties of compounds with the R package chemodiv. New Phytologist 237: 2478-2492.
Petrén H., R. A. Anaia, K. S. Aragam, A. Bräutigam, S. Eckert, R. Heinen, R. Jakobs, L. Ojeda, M. Popp, R. Sasidharan, J-P. Schnitzler, A. Steppuhn, F. Thon, S. Tschikin, S. B. Unsicker, N. M. van Dam, W. W. Weisser, M. J. Wittmann, S. Yepes, D. Ziaja, C. Müller, R. R. Junker. 2023b. Understanding the phytochemical diversity of plants: Quantification, variation and ecological function. bioRxiv doi: 10.1101/2023.03.23.533415.