Adaptive gPCA (Fukuyama, J. (2017)) is a flexible method for incorporating side information about the variables into principal components analysis. The method has a single parameter, \(r\), which controls how much weight to give to the side information: \(r = 0\) corresponds to standard PCA (the side information is not used at all), and \(r = 1\) corresponds to the maximum amount of weight on the side information.
The output from adaptive gPCA is analogous to that given by PCA: we get representations of the samples and the variables, which can be visualized as either independently or as a biplot. The difference with standard PCA is that, assuming \(r > 0\), the variable loadings are regularized so that similar variables are encouraged to have similar loadings on the principal axes.
There are two main ways of using this package. The function
adaptivegpca
will choose the value of \(r\) automatically, while the function
gpcaFullFamily
produces analogous output for the full range
of values of \(r\), which is then
chosen manually. We will illustrate both methods in this vignette.
Although adaptive gPCA can be applied more generally, the method was developed to incorporate phylogenetic structure in microbiome data analysis, and so there are also functions to interface with phyloseq objects and functions to visualize adaptive gPCA output along with a phylogenetic tree.
The data set we will use to illustrate the functionality of this
package is an antibiotic time course study1. In this experiment,
three subjects were given two courses of antibiotics, and the abundances
of bacteria in the gut microbiome were measured before, during, and
after each course of antibiotics. The data is stored in a phyloseq
object called AntibioticPhyloseq
, which contains
variance-stabilized and centered abundances on the
otu_table
slot, a phylogenetic tree describing the
relationships between the bacterial species measured in the
phy_tree
slot, and information about the samples in the
sample_data
slot. We want a low-dimensional representation
of the samples which takes into account the phylogenetic similarities
between the bacteria, which is what adaptive gPCA will provide for
us.
Now that we have described the data we will be using, we will show how to use the package. The first step is to load the required packages and data.
library(adaptiveGPCA)
library(ggplot2)
library(phyloseq)
data(AntibioticPhyloseq)
theme_set(theme_bw())
Next, we create the inputs required for the adaptivegpca
function. If we have a phyloseq
object containing taxon
abundances and a phylogenetic tree, the function
processPhyloseq
will create a list with the following
elements:
A centered data matrix, stored as X
in the
output.
A similarity matrix based on the phylogeny, stored as
Q
in the output.
A vector of weights, stored as weights
. The weights
are only relevant for correspondence analysis, so if the function is
invoked with ca = FALSE
(the default), the weights vector
will be a vector containing only 1’s and can be ignored.
pp = processPhyloseq(AntibioticPhyloseq)
Next, we pass our data matrix and our similarity matrix to the
adaptivegpca
function. The arguments to
adaptivegpca
are
X
: A centered data matrix.
Q
: A symmetric, positive definite matrix describing
the similarities between the variables.
k
: The number of axes to keep, defaults to
2.
weights
: A vector with sample weights, defaults to a
vector of all 1’s (equal weights on each sample).
The matrices X
and Q
will be generated from
a phyloseq object using the OTU table and the phylogenetic tree with the
processPhyloseq
function. If you have another kind of data,
you can generate these matrices yourself. Below we use the output from
processPhyloseq
to fit an adaptive gPCA model.
out.agpca = adaptivegpca(pp$X, pp$Q, k = 2)
The output from the adaptivegpca
function is stored in
an object of class adaptivegpca
, which has some generic
printing and plotting functions associated with it. As such, if we just
print out the object, it will give us some basic information about the
results, namely the number of axes, the value of \(r\) chosen, and the fraction of the
variance explained by the top axes.
out.agpca
## An object of class adaptivegpca
## -------------------------------
## Number of axes: 2
## Value of r chosen: 0.462
## Fraction of variance explained
## by first 2 axes:
## 0.195 0.156
The output is a list with several elements. These are:
U
: The sample scores on the principal axes.
V
: The variable loadings before rotation. These are
not used for plotting — they need to be rotated into the space of the
samples for the biplot representation.
QV
: The variable loadings rotated to be in the same
space as the sample scores. These are the variable loadings that should
be plotted for a biplot representation of the samples and the
variables.
vars
: The variances along the principal
axes.
r
: The value of \(r\) chosen by adaptivegpca, corresponding
to how much regularization to perform according to the structure of the
variables. \(r = 0\) is standard PCA
(variable structure is not considered at all), and \(r = 1\) is the maximal amount of
regularization.
evals
: The eigenvalues of the modified similarity
matrix. This is described in more detail in the paper, but it is an
alternate way of understanding how much regularization is performed by
the method. The larger the top eigenvalues are compared with the
subsequent ones, the more regularization there is toward the top
eigenvectors of the similarity matrix.
We can also plot the output from adaptivegpca
using a
generic plotting function. By default, this will produce a scree plot,
as shown below:
plot(out.agpca)
The plotting function has a type
argument, which can be
either scree
, samples
, or
variables
. type = scree
will give a plot
showing the fraction of the variance explained by each of the axes,
type = samples
will show the sample scores along the
principal axes, and type = variables
will show the variable
loadings on the principal axes. The axes
argument allows
the user to specify which axes to plot, by default these are axes 1 and
2.
plot(out.agpca, type = "samples", axes = c(1,2))
plot(out.agpca, type = "variables", axes = c(1,2))
Finally, the inspectTaxonomy
function allows interactive
inspection of the taxa or variables plot, assuming you started the
analysis with a phyloseq object. This function will open a browser
window with a representation of the phylogenetic tree and the taxon
loadings on the adaptive gPCA axes. Taxa in this plot can be selected by
“brushing” the plot (drawing a rectangular box), and the selected taxa
will be highlighted on the phylogenetic tree and their taxonomic
assignments will be printed below. An example call is given below:
inspectTaxonomy(out.agpca, AntibioticPhyloseq)
The arguments to inspectTaxonomy
are
agpcafit
: The output from
adaptivegpca
.
physeq
: A phyloseq object with a taxonomy table and
a phylogenetic tree. Should be the same one used for for the adaptive
gPCA fit.
axes
: Which axes to plot. Defaults to the first two
axes.
br.length
: Should the tree include branch lengths?
The trees tend to look better when the branches are all plotted as being
the same length, so this defaults to FALSE
.
height
: The height of the plotting area in pixels.
Defaults to 600.
If you click the “done” button in the browser window, the app will close and the function will return a table describing the taxonomy, position along selected axes, and names of the selected taxa.
We can also choose the value of \(r\) manually. In the previous section, we
described how to use the adaptivegpca
function, which
chooses the value of \(r\)
automatically, but if we would like to see the results for different
values of \(r\) and to choose one
manually, we can first use the gpcaFullFamily
function to
create a full set of models and then use the
visualizeFullFamily
function to visualize the biplots for
each value of \(r\).
gpcaFullFamily
takes the same arguments as
adaptivegpca
, along with some additional arguments
describing the form of the output.
The visualizeFullFamily
function is a shiny gadget.
Calling this function will open a browser window where you can visualize
the data set for a range of values of \(r\). Clicking “done” in this window will
give as output an object of the same format as that given by the
adaptivegpca
function, the difference being that the value
of \(r\) was chosen manually instead of
automatically. The sample_data
/sample_mapping
and var_data
/var_mapping
arguments allow you
to customize the visualization. Without these arguments, the first two
axes will be plotted for both the samples and the variables. If you
include these arguments, you can customize the plots by providing
aesthetic mappings for ggplot. Note that the code below is only included
as an example and not evaluated in the vignette.
out.ff = gpcaFullFamily(pp$X, pp$Q, k = 2)
out.agpca = visualizeFullFamily(out.ff,
sample_data = sample_data(AntibioticPhyloseq),
sample_mapping = aes(x = Axis1, y = Axis2, color = type),
var_data = tax_table(AntibioticPhyloseq),
var_mapping = aes(x = Axis1, y = Axis2, color = Phylum))
In either case, we can plot the results. This time we will do it
manually, using ggplot. The sample scores on the principal axes are
located in out.agpca$U
and the loadings of the variables on
the principal axes are located in out.agpca$QV
. To plot the
samples, we make a data frame that includes out.agpca$U
and
the sample data, and to plot the taxa we make a data frame containing
out.agpca$QV
and the taxonomy table.
ggplot(data.frame(out.agpca$U, sample_data(AntibioticPhyloseq))) +
geom_point(aes(x = Axis1, y = Axis2, color = type, shape = ind)) +
xlab("Axis 1") + ylab("Axis 2")
ggplot(data.frame(out.agpca$QV, tax_table(AntibioticPhyloseq))) +
geom_point(aes(x = Axis1, y = Axis2, color = Phylum)) +
xlab("Axis 1") + ylab("Axis 2")
Finally, note that the processPhyloseq
function also has
an argument ca
(for correspondence analysis). This should
be used with phyloseq objects containing raw counts, and it will process
a phyloseq object so as to do an adaptive gPCA version of correspondence
analysis (this entails transforming counts to relative abunadnces,
computing sample weights based on the overall counts for the samples,
and finally doing a weighted centering of the relative abundances).
Dethlefsen, L., and D. A. Relman. “Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation.” Proc Natl Acad Sci USA 108. Suppl 1 (2010): 4554-4561.↩︎