Untargeted metabolomics goal is to quantify as metabolites as
possible from a sample. We can use liquid chromatography coupled to mass
spectrometry (LC/MS) for this purpose. It is a great challenge to
transform LC/MS data into a profile of annotated metabolites that
provides us meaningful biological information. A very important
limitation to the annotation of metabolomic experiments is that the
number of m/z processed signals, called features, is much bigger than
the putative number of metabolites in a sample. The sources that produce
multiple features from a single metabolite are multiple and variable.
Natural isotopes as carbon isotopes produce isotope features. Ionization
of metabolites produce the so called adducts of the metabolite, which
are detected as different features depending on the ion adduct involved
([M+Na]+, [M+H]+, etc..). Apart from adduct features, ionization also
produces metabolite fragmentation and other reactions as dimerizations,
trimerizations, all of them being detected as multiple features. Being
the reduction of multiple features to single metabolites a crucial step
for the correct annotation of LC/MS experiments, we will show how to use
cliqueMS
to do so.
cliqueMS
annotates samples one by one. Annotation can be
summarised in these three steps:
Annotation steps are stored in an anClique
S4 object.
This object can be created from a XCMSnExp
or a
xcmsSet
object with processed m/z data from
xcms
package. First m/z raw data is processed:
library(cliqueMS)
mzfile <- system.file("standards.mzXML", package = "cliqueMS")
library(xcms)
mzraw <- MSnbase::readMSData(files = mzfile, mode = "onDisk")
cpw <- CentWaveParam(ppm = 15, peakwidth = c(5,20), snthresh = 10)
mzData <- findChromPeaks(object = mzraw, param = cpw)
Then we can create an anClique
object:
ex.anClique <- createanClique(mzData)
show(ex.anClique)
#> anClique S4 object with 126 features
#> No computed clique groups
#> No isotope annotation
#> No adduct annotation
Here we see an anClique
object before any annotation
step. Features have not been grouped, isotopes and adducts are not
annotated. Now let’s see the three steps in detail.
Metabolites produce multiple features and very often they do not separate completely in the chromatography, so we observe coelution. This increases the difficulty of the annotation because many features coming from different metabolites might appear very close in the chromatogram. Before trying to annotate isotopes, adducts and fragments we want to make groups of features. Ideally, each group should include all the features produced by a single metabolite.
cliqueMS
uses a similarity network to find groups of
features. Each feature is a node, and edges are weighted according to
the cosine similarity between features:
\(c_{ij}=\frac{\sum_k f_i(t_k)f_j(t_k)}{\| {f}_i\| \|{f}_j\|}\)
Values from cosine similarity are useful to discriminate pairs of features that come from the same metabolite from pairs of features that come from different metabolites[1]. We compute the cosine similarity using the profile mode of the data, having each feature a m/z value and vector intensities. All features are discretized into a vector of equal bins \(k\). Each vector position relative to retention time \(t_k\) contains the intensity of the feature \(i_k\) at that moment of the chromatography. Features with no coelution at all have a cosine similarity = 0. Edges with weight = 0 are not included in the network, nor nodes without any edge.
Once we have the network, it is time to divide the features into
groups. cliqueMS
assumes that the similarity between all
features that come from the same metabolite must be \(c_{ij} > 0\). Additionally, similarity
values between features of the same metabolite should be generally
higher than between features of different metabolites. With this
information, cliqueMS
uses a probabilistic model to find
the feature groups. This model find cliques, fully connected components
so for all nodes \(c_{ij} > 0\). The
similarity inside this cliques should be higher than the similarity
between features outside the clique. cliqueMS
estimates the
log-likelihood for a particular assignment of features into different
clique groups. For details in the probabilistic model and the
log-likelihood maximisation see[1]. The log-likelihood maximisation
procedure can be summarised in the following way:
cliqueMS
starts with each node as a different clique
group.Now let’s see how to find this groups with
getCliques
.
With the function getCliques
we assign clique groups to
our features. This function creates the network of similarity and then
computes the clique groups. As input data it uses a xcmsSet
object. getCliques
outputs an anClique
S4
object, which will be used to store all annotation steps.
set.seed(2)
ex.cliqueGroups <- getCliques(mzData, filter = TRUE)
#> Creating anClique object
#> Creating network
#> Features filtered: 0
#> Computing cliques
#> Beggining value of logl is -712.347
#> Aggregate cliques done, with 163 rounds
#> Kernighan-Lin done with 1 rounds
#> Finishing value of logl is -174.171
show(ex.cliqueGroups)
#> anClique S4 object with 126 features
#> Features have been splitted into 15 cliques
#> No isotope annotation
#> No adduct annotation
As we see from the printed messages, the function
getCliques
first creates a network, and then it filters
features if parameter filter = TRUE
. As m/z signal
processing algorithms may produce two artefact features from a single
one, it is recommended to set filter = TRUE
to drop
repeated features. This filter only drops features with similarity >
0.99, and equal values of m/z, retention time and maximum intensity,
defined by the relative error parameters mzerror
,
rtdiff
and intdiff
. From the output of the
function we see the computed log-likelihood at the beginning, when each
feature is in a different clique and the computed log-likelihood at the
end of the process. If we now look at the summary
of the
resulting anClique
object, we see that the features have
been grouped into 69 clique groups. Now we can annotate isotopes.
cliqueMS
annotates isotopes within each clique group.
cliqueMS
searches pairs of features than can be carbon
isotopes based in these two rules:
Isotopes are annotated with the function getIsotopes
.
This function finds pairs of features that fulfil the conditions of an
isotope. Then it creates the isotope annotation after removing
incoherences like two monoisotopic masses for one isotope, two second
isotopes for one first isotope, etc… In all this cases the removed pair
is the one with smaller similarity. The use of the function is pretty
straightforward:
ex.Isotopes <- getIsotopes(ex.cliqueGroups, ppm = 10)
#> Computing isotopes
#> Updating anClique object
show(ex.Isotopes)
#> anClique S4 object with 126 features
#> Features have been splitted into 15 cliques
#> 37 Features are isotopes
#> No adduct annotation
Parameter ppm
is important because it defines in ppm
units the range of the accepted relative error. Once isotopes are
annotated we can annotate adducts and fragments.
The last step of cliqueMS
is to annotate adducts. Each
feature has a m/z value that is the neutral mass of the metabolite plus
the mass of the ion adduct (or fragmentation ion adduct). The neutral
mass is an unknown value, but the ion adduct mass is to some degree
known as many ion adducts are known. The list of possible adducts should
be given as input to cliqueMS
by the user or either use one
of the default adduct lists (positive.adinfo
or
negative.adinfo
). Here is how the default lists look:
data(positive.adinfo)
head(positive.adinfo)
#> adduct log10freq massdiff nummol charge
#> 1 [M+2H-NH3]2+ -3.512904 -15.012016600 1 2
#> 2 [Cat]3+ -3.512904 -0.001645737 1 3
#> 3 [Cat]2+ -3.512904 -0.001040400 1 2
#> 4 [Cat+H]2+ -3.336813 1.006178842 1 2
#> 5 [M+2H]2+ -1.813934 2.014552000 1 2
#> 6 [M+H+Na]2+ -2.699991 23.996494000 1 2
data(negative.adinfo)
head(negative.adinfo)
#> adduct log10freq massdiff nummol charge
#> 1 [M-2H]2- -1.4029243 -2.014552 1 -2
#> 2 [M+Na-3H]2- -4.2480223 19.967390 1 -2
#> 3 [M-H]- -0.1721106 -1.007276 1 -1
#> 4 [M-H-H2O]- -0.8198875 -19.018390 1 -1
#> 5 [M-H-NH3]- -1.9469923 -18.033815 1 -1
#> 6 [M-H+H2O]- -2.1340790 17.003275 1 -1
The lists should have a column with the name of the adduct, one for the log10 frequency of that adduct, another for the mass of the adduct, one for the number of molecules involved and also one for the charge (see[1] for details in how default lists were made). With the adduct list we can estimate neutral masses.
cliqueMS
searches in each clique for groups of two or
more features compatible with a neutral mass and two or more adducts in
the adduct list. Neutral masses with only one adduct are not included in
the scoring. Once we have all possible neutral masses and their
corresponding adducts, the algorithm tries combinations of different
adducts and neutral masses to find the most plausible annotation. All
combinations are scored and the top five annotations are returned. The
scoring is based on the following criteria:
The computed score (which is a logarithmic score) is the sum of the
adducts log-frequencies plus the number of empty features (which has a
log-frequency smaller than the least frequent adduct) and the number of
neutral masses. Within a clique group, it may happen than the annotation
of some features is independent from the annotation of some other
features, as there is not a single neutral mass with adducts in both
groups of features. In those cases, the clique group is splitted in non
overlapping groups, called annotation groups. This is common in big
cliques. The reported scores refer to annotation groups. The score is
useful to see how probable is the first annotation compared to second
annotation, third annotation, etc… within an annotation group, but it is
not intended to compare annotations between different annotation groups
because the score will be smaller when the number of features in the
group is bigger.To compare scores from different groups, the option
normalizeScore
should be set to TRUE
. The
normalized score value is 100 when the score is similar to the
theoretical maximum score (all the features annotated with the most
frequent adducts and the minimum number of neutral masses) and goes
until 0, which is the extreme case that all features of the group are
not annotated. To find annotation cliqueMS
uses the
function getAnnotation
.
Here is an example of annotating adducts with
getAnnotation
ex.Adducts <- getAnnotation(ex.Isotopes, ppm = 10,
adinfo = positive.adinfo, polarity = "positive",
normalizeScore = TRUE)
#> Computing annotation
#> Annotation computed, updating peaklist
show(ex.Adducts)
#> anClique S4 object with 126 features
#> Features have been splitted into 15 cliques
#> 37 Features are isotopes
#> 85 features annotated
As we see from the summary
output, 178 of 275 features
have annotation. Function getAnnotation
requires as input
an adduct list, the parameter adinfo
. Users can use the
default adduct list positive.adinfo
for positive charged
adducts and negative.adinfo
for negative charged adducts.
polarity
must be set, either to positive
or
negative
. Lots of neutral masses are found when the clique
groups have many features. In those cases, scoring all annotations could
take much time as there are many possible combinations. To prevent this,
neutral masses that likely will be in the final top annotations are
selected and annotation is computed quickly. The selected masses have
the highest frequency adducts and the largest number of adducts. For
each clique group, all neutral masses are ordered depending on their
score. A number of top scoring masses controlled by
topmasstotal
parameter are selected. Additionally and for
every feature, the ordered list of scored neutral masses is subsetted to
only the neutral masses with adducts in that feature. Then a number of
top scoring masses set by topmassf
parameter are selected
in each sublist, and added to the set of selected masses. After the mass
selection, and in cases of big cliques (size of a “big” clique is
defined by parameter sizeanG
), annotation groups are
splitted again in new non overlapping groups just taking into account
the set of selected neutral masses.
getCliques
stores the annotation in the
peaklist
of the anClique
object. Here we can
see an overview of some annotated features in our sample:
features.clique6 <- getlistofCliques(ex.Adducts)[[6]]
head(getPeaklistanClique(ex.Adducts)[features.clique6,
c("an1","mass1","an2", "mass2", "an3", "mass3", "an4", "mass4", "an5",
"mass5")], n = 10)
#> an1 mass1 an2 mass2 an3 mass3 an4
#> CP037 [M+K]+ 242.0805 NA [M+H+K]2+ 522.1188
#> CP038 [2M+H]+ 242.0805 [M+H]+ 484.1605 [M+Na]+ 462.1786 [M+H-H2O]+
#> CP039 NA [M+NH4]+ 484.1605 NA
#> CP040 [2M+K]+ 242.0805 [M+K]+ 484.1605 [M+H]+ 522.1188 [M+K-H2O]+
#> CP041 [Cat-H]+ 105.0583 [Cat]2+ 208.1014 [Cat-H]+ 105.0583 [Cat-H]+
#> CP042 [M+H-OH]+ 216.0768 [M+H]+ 199.0745 [M+H]+ 199.0745 [M+H]+
#> CP043 NA NA NA
#> CP044 [Cat]+ 118.0649 [Cat]+ 118.0649 [Cat]+ 118.0649 [Cat]+
#> CP045 [Cat-H2O]+ 216.0768 [Cat-H]+ 199.0745 [Cat-H]+ 199.0745 [Cat-H]+
#> CP046 [M+Na]+ 242.0805 [M-H+2Na]+ 220.0986 [M+Na]+ 242.0805 [M-H+2Na]+
#> mass4 an5 mass5
#> CP037 NA [M+H+K]2+ 522.1188
#> CP038 502.1711 [M+H-NH3]+ 501.1871
#> CP039 NA [M+H]+ 501.1871
#> CP040 502.1711 [M+H]+ 522.1188
#> CP041 105.0583 [Cat-H]+ 105.0583
#> CP042 199.0745 [M+H]+ 199.0745
#> CP043 NA NA
#> CP044 118.0649 [Cat]+ 118.0649
#> CP045 199.0745 [Cat-H]+ 199.0745
#> CP046 220.0986 [M-H+2Na]+ 220.0986
Now we have obtained the neutral mass and the adduct annotation for our features. We could use the neutral mass together with the retention time and MS/MS data to annotate more confidently some of these metabolites. We also know how many features in the dataset are isotopes. Finally, we have achieved a reduction in the complexity of our the dataset, from many features to a signficant smaller number annotated neutral masses that have different adducts and isotopes.
[1]: “CliqueMS: a computational tool for annotating in-source metabolite ions from LC-MS untargeted metabolomics data based on a coelution similarity network”. Oriol Senan, Antoni Aguilar-Mogas, Miriam Navarro, Jordi Capellades, Luke Noon, Deborah Burks, Oscar Yanes, Roger Guimerà and Marta Sales-Pardo. Bioinformatics. Accepted March 2019.