The input data for the phylogenetic comparative methods covered in
the PCMBase package consists of a phylogenetic tree of \(N\) tips and a \(k\times N\) matrix, \(X\) of observed trait-values, where \(k\) is the number of traits. The matrix
\(X\) can contain NA
s
corresponding to missing measurements and NaN
’s
corresponding to non-existing traits for some of the tips.
Given a number of traits \(k\), a model is defined as a set of parameters with dimensionality possibly depending on \(k\) and a rule stating how these parameters are used to calculate the model likelihood for an observed tree and data and/or to simulate data along a tree. Often, we use the term model type to denote a family of models sharing the same rule.
Currently, PCMBase supports Gaussian model types from the so called \(\mathcal{G}_{LInv}\)-family (Mitov et al. 2019). Specifically, these are models representing branching stochastic processes and satisfying the following two conditions (Mitov et al. 2019):
after a branching point on the tree the traits evolve independently in the two descending lineages,
the distribution of the trait \(\vec{X}\), at time \(t\) conditional on the value at time \(s < t\) is Gaussian with the mean and variance satisfying
2.a The expectation depends linearly on the ancestral trait value, i.e. \(\text{E}\big[{\vec{X}(t) \vert \vec{X}(s)}\big] = \vec{\omega}_{s,t} + \mathbf{\Phi}_{s,t} \vec{X}(s)\);
2.b The variance is invariant (does not depend on) with respect to the ancestral trait, i.e. \(\text{V}\big[{\vec{X}(t) \vert \vec{X}(s)}\big] = \mathbf{V}_{s,t}\),
where \(\vec{\omega}\) and the matrices \(\mathbf{\Phi}\), \(\mathbf{V}\) may depend on \(s\) and \(t\) but do not depend on the previous trajectory of the trait \(\vec{X}(\cdot)\).
As an example, let’s consider a model representing a \(k\)-variate Ornstein-Uhlenbeck branching stochastic process. This is defined by the following stochastic differential equation: \[\begin{equation}\label{eq:mammals:OUprocess} d\vec{X}(t)=\mathbf{H}\big(\vec{\theta}-\vec{X}(t)\big)dt+\mathbf{\mathbf{\Sigma}}_{\chi} dW(t). \end{equation}\] In the above equation, \(\vec{X}(t)\) is a \(k\)-dimensional real vector, \(\mathbf{H}\) is a \(k\times k\)-dimensional eigen-decomposable real matrix, \(\vec{\theta}\) is a \(k\)-dimensional real vector, \(\mathbf{\Sigma}_{\chi}\) is a \(k\times k\)-dimensional real positive definite matrix and \(W(t)\) denotes the \(k\)-dimensional standard Wiener process.
Biologically, \(\vec{X}(t)\) denotes the mean values of \(k\) continuous traits in a species at a time \(t\) from the root, the parameter \(\mathbf{\Sigma}=\mathbf{\Sigma}_{\chi}\mathbf{\Sigma}_{\chi}^T\) defines the magnitude and shape of the momentary fluctuations in the mean vector due to genetic drift, the matrix \(\mathbf{H}\) and the vector \(\vec{\theta}\) specify the trajectory of the population mean through time. When \(\mathbf{H}\) is the zero matrix, the process is equivalent to a Brownian motion process and the parameter \(\vec{\theta}\) is irrelevant. When \(\mathbf{H}\) has strictly positive eigenvalues, the population mean converges in the long term towards \(\vec{\theta}\), although the trajectory of this convergence can be complex.
So, the OU-model defines the following set of parameters:
X0
) : a \(k\)-vector of
initial values;H
) : a \(k\times k\)
matrix denoting the selection strength of the process;Theta
) : a \(k\)-vector of
long-term optimal trait values;Sigma_x
) : a \(k\times k\)
matrix denoting the an upper triangular factor or the stochastic drift
variance-covariances;
The rule defining how the parameters of the OU-model are used to calculate the model likelihood and to simulate data consists in the definition of the functions \(\vec{\omega}\) and matrices \(\mathbf{\Phi}\), \(\mathbf{V}\) (Mitov et al. 2019): \[\begin{equation}\label{eq:mammals:omegaPhiVOU} \begin{array}{l} \vec{\omega}_{s,t}=\bigg(\mathbf{I}-\text{Exp}\big(-(t-s)\mathbf{H}\big)\bigg)\vec{\theta} \\ \mathbf{\Phi}_{s,t}=\text{Exp}(-(t-s)\mathbf{H}) \\ \mathbf{V}_{s,t}=\int_{0}^{t-s}\text{Exp}(-v\mathbf{H})(\mathbf{\Sigma}_{\chi}\mathbf{\Sigma}_{\chi}^T)\text{Exp}(-v\mathbf{H}^T)dv \end{array} \end{equation}\]
Together, \(\vec{X}_{0}\), \(\vec{\omega}\), \(\mathbf{\Phi}\), \(\mathbf{V}\) and the tree define a \(kN\)-variate Gaussian distribution for the
vector of trait values at the tips. This is the defining property of all
Gaussian phylogenetic models. Hence, calculating the model likelihood is
equivalent to calculating the density of this Gaussian distribution at
observed trait values, and simulating data under the model is equivalent
to drawing a random sample from this distribution. The functions
PCMMean()
and PCMVar()
allow to calculate the
mean \(kN\)-vector and the \(kN\times kN\) variance covariance matrix of
this distribution. This can be useful, in particular, to compare two
models by calculating a distance metric such as the Mahalanobis
distance, or the Bhattacharyya distance. However, for big \(k\) and/or \(N\), it is inefficient to use these
functions in combination with a general purpose multivariate normal
implementation (e.g. the mvtnorm::dmvnorm
and
mvtnorm::rmvnorm
), to calculate the likelihood or simulate
data assuming an OU model. The main purpose of PCMBase package is to
provide a generic and computationally efficient way to perform these two
operations.
It is convenient to group model types into smaller subsets with named elements. This allows to use simple names, such as letters, as aliases for otherwise very long model type names. For example, the PCMBase package defines a subset of six so called “default model types”, which are commonly used in macroevolutionary studies. These are model types based on parameterizations of the \(k\)-variate Ornstein-Uhlenbeck (OU) process. All of these six models restrict \(\mathbf{H}\) to have non-negative eigenvalues - a negative eigenvalue of \(\mathbf{H}\) transforms the process into repulsion with respect to \(\vec{\theta}\), which, while biologically plausible, is not identifiable in a ultrametric tree. The six default models are defined as follows:
Calling the function PCMDefaultModelTypes()
returns a
named vector of the technical class-names for these six models:
# scroll to the right in the following listing to see the full model type names
# and their corresponding alias:
PCMDefaultModelTypes()
## A
## "BM__Global_X0__Diagonal_WithNonNegativeDiagonal_Sigma_x__Omitted_Sigmae_x"
## B
## "BM__Global_X0__UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Sigma_x__Omitted_Sigmae_x"
## C
## "OU__Global_X0__Schur_Diagonal_WithNonNegativeDiagonal_Transformable_H__Theta__Diagonal_WithNonNegativeDiagonal_Sigma_x__Omitted_Sigmae_x"
## D
## "OU__Global_X0__Schur_Diagonal_WithNonNegativeDiagonal_Transformable_H__Theta__UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Sigma_x__Omitted_Sigmae_x"
## E
## "OU__Global_X0__Schur_UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Transformable_H__Theta__UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Sigma_x__Omitted_Sigmae_x"
## F
## "OU__Global_X0__Schur_WithNonNegativeDiagonal_Transformable_H__Theta__UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Sigma_x__Omitted_Sigmae_x"
The PCMBase package comes with numerous other predefined model types.
At present, all these are members of the \(\mathcal{G}_{LInv}\)-family. A list of
these model types is returned from calling the function
PCMModels()
.
For each model type, it is possible to check how the conditional distribution of \(\vec{X}\) at the end of a time interval of length \(t\) is defined from an ancestral value \(X_{0}\). In particular, for \(\mathcal{G}_{LInv}\) models, this is the definition of the functions \(\vec{\omega}\), \(\mathbf{\Phi}\) and \(\mathbf{V}\). For example,
PCMFindMethod(PCMDefaultModelTypes()["A"], "PCMCond")
## function (tree, model, r = 1, metaI = PCMInfo(NULL, tree, model,
## verbose = verbose), verbose = FALSE)
## {
## Sigma_x <- GetSigma_x(model, "Sigma", r)
## Sigma <- Sigma_x %*% t(Sigma_x)
## if (!is.null(model$Sigmae_x)) {
## Sigmae_x <- GetSigma_x(model, "Sigmae", r)
## Sigmae <- Sigmae_x %*% t(Sigmae_x)
## }
## else {
## Sigmae <- NULL
## }
## V <- PCMCondVOU(matrix(0, nrow(Sigma), ncol(Sigma)), Sigma,
## Sigmae)
## omega <- function(t, edgeIndex, metaI) {
## rep(0, nrow(Sigma))
## }
## Phi <- function(t, edgeIndex, metaI, e_Ht = NULL) {
## diag(nrow(Sigma))
## }
## list(omega = omega, Phi = Phi, V = V)
## }
## <bytecode: 0x562e5134edb0>
## <environment: namespace:PCMBase>
# The complex maths is implemented in the function PCMCondVOU. You can see its
# R-code by typing :
# PCMBase::PCMCondVOU
In the computer memory, models are represented by S3 objects,
i.e. ordinary R-lists with a class attribute. The base S3 class of all
models is called "PCM"
, which is inherited by more specific
model-classes. Let us create a BM PCM for two traits:
<- PCM(model = "BM", k = 2) modelBM
Printing the model object shows a short verbal description, the S3-class, the number of traits, k, the number of numerical parameters of the model, p, the model regimes and the current values of the parameters for each regime (more on regimes in the next sub-section):
modelBM
## Brownian motion model
## S3 class: BM, GaussianPCM, PCM; k=2; p=8; regimes: 1. Parameters/sub-models:
## X0 (VectorParameter, _Global, numeric; trait values at the root):
## [1] 0 0
## Sigma_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; factor of the unit-time variance rate):
## , , 1
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
##
## Sigmae_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; factor of the non-heritable variance or the variance of the measurement error):
## , , 1
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
##
##
One may wonder why in the above description, p = 8 instead of 10 (see
also ?PCMParamCount
). The reason is that both, the matrix
Sigma and the matrix Sigmae, are symmetric matrices and their matching
off-diagonal elements are counted only one time.
Model regimes are different models associated with different parts of the phylogenetic tree. This is a powerful concept allowing to model different evolutionary modes on different lineages on the tree. Let us create a 2-trait BM model with two regimes called a and b:
<- PCM("BM", k = 2, regimes = c("a", "b"))
modelBM.ab modelBM.ab
## Brownian motion model
## S3 class: BM, GaussianPCM, PCM; k=2; p=14; regimes: a, b. Parameters/sub-models:
## X0 (VectorParameter, _Global, numeric; trait values at the root):
## [1] 0 0
## Sigma_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; factor of the unit-time variance rate):
## , , a
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
##
## , , b
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
##
## Sigmae_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; factor of the non-heritable variance or the variance of the measurement error):
## , , a
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
##
## , , b
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
##
##
Now, we can set some different values for the parameters of the model we’ve just created. First, let us specify an initial value vector different from the default 0-vector:
$X0[] <- c(5, 2) modelBM.ab
X0 is defined as a parameter with S3 class
class(modelBM.ab$X0)
. This specifies that X0
is global vector parameter shared by all model regimes. This is also the
reason, why the number of parameters is not the double of the number of
parameters in the first model:
PCMParamCount(modelBM)
## [1] 8
PCMParamCount(modelBM.ab)
## [1] 14
The other parameters, Sigma_x and Sigmae_x are local for each regime:
# in regime 'a' the traits evolve according to two independent BM processes (starting from the global vecto X0).
$Sigma_x[,, "a"] <- rbind(c(1.6, 0),
modelBM.abc(0, 2.4))
$Sigmae_x[,, "a"] <- rbind(c(.1, 0),
modelBM.abc(0, .4))
# in regime 'b' there is a correlation between the traits
$Sigma_x[,, "b"] <- rbind(c(1.6, .8),
modelBM.abc(.8, 2.4))
$Sigmae_x[,, "b"] <- rbind(c(.1, 0),
modelBM.abc(0, .4))
The above way of setting values for model parameters, while human readable, is not handy during model fitting procedures, such as likelihood maximization. Thus, there is another way to set (or get) the model parameter values from a numerical vector:
<- double(PCMParamCount(modelBM.ab))
param
# load the current model parameters into param
PCMParamLoadOrStore(modelBM.ab, param, offset=0, load=FALSE)
## [1] 14
print(param)
## [1] 5.0 2.0 1.6 0.0 2.4 1.6 0.8 2.4 0.1 0.0 0.4 0.1 0.0 0.4
# modify slightly the model parameters
<- jitter(param)
param2
print(param2)
## [1] 4.9805426428 1.9891297395 1.6111129699 -0.0022761540 2.3996846637
## [6] 1.6089305746 0.8001918500 2.3839385408 0.0933156090 0.0004943791
## [11] 0.3885086007 0.1198546167 0.0046077819 0.4077723531
# set the new parameter vector
PCMParamLoadOrStore(modelBM.ab, param2, offset = 0, load=TRUE)
## [1] 14
print(modelBM.ab)
## Brownian motion model
## S3 class: BM, GaussianPCM, PCM; k=2; p=14; regimes: a, b. Parameters/sub-models:
## X0 (VectorParameter, _Global, numeric; trait values at the root):
## [1] 4.980543 1.989130
## Sigma_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; factor of the unit-time variance rate):
## , , a
##
## [,1] [,2]
## [1,] 1.611113 -0.002276154
## [2,] 0.000000 2.399684664
##
## , , b
##
## [,1] [,2]
## [1,] 1.608931 0.8001919
## [2,] 0.800000 2.3839385
##
## Sigmae_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; factor of the non-heritable variance or the variance of the measurement error):
## , , a
##
## [,1] [,2]
## [1,] 0.09331561 0.0004943791
## [2,] 0.00000000 0.3885086007
##
## , , b
##
## [,1] [,2]
## [1,] 0.1198546 0.004607782
## [2,] 0.0000000 0.407772353
##
##
It can be handy to print the parameters of a model in the form of a
table with rows corresponding to the different regimes. For that purpose
we use the PCMTable()
function, which generates a
PCMTable
object for a given PCM
object. The
PCMTable
S3 class inherits from data.table
and
implements a print method that allows for beautiful formatting of matrix
and vector parameters. Here is an example:
options(digits = 2)
print(PCMTable(modelBM.ab), xtable = TRUE, type="html")
regime | \(X0\) | \(\Sigma\) | \(\Sigma_{e}\) |
---|---|---|---|
:global: | \(\begin{bmatrix}{} 4.98 \\ 1.99 \\ \end{bmatrix}\) | ||
a | \(\begin{bmatrix}{} 2.60 & -0.01 \\ -0.01 & 5.76 \\ \end{bmatrix}\) | \(\begin{bmatrix}{} 0.01 & 0.00 \\ 0.00 & 0.15 \\ \end{bmatrix}\) | |
b | \(\begin{bmatrix}{} 3.23 & 3.19 \\ 3.19 & 6.32 \\ \end{bmatrix}\) | \(\begin{bmatrix}{} 0.01 & 0.00 \\ 0.00 & 0.17 \\ \end{bmatrix}\) |
Check the help-page for the PCMTable
-function for more
details.
One of the features of the package is the possibility to specify
models in which different types of processes are associated with
different regimes. We call these “mixed Gaussian” models. To create a
mixed Gaussian model, we use the constructor function
MixedGaussian
:
<- MixedGaussian(
model.OU.BM k = 3,
modelTypes = c(
BM = paste0("BM",
"__Omitted_X0",
"__UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Sigma_x",
"__Omitted_Sigmae_x"),
OU = paste0("OU",
"__Omitted_X0",
"__H",
"__Theta",
"__UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Sigma_x",
"__Omitted_Sigmae_x")),
mapping = c(a = 2, b = 1),
Sigmae_x = structure(
0,
class = c("MatrixParameter", "_Omitted"),
description = "upper triangular factor of the non-phylogenetic variance-covariance"))
In the above, snippet, notice the long names for the BM and the OU
model types. These are the model types that can be mapped to regimes. It
is important to notice that both of these model types omit the parameter
X0
, which is a global parameter shared by all regimes.
Similarly the two model types omit the parameter Sigmae_x
.
However, in this example, we specify that the whole mixed Gaussian model
does not have a Sigmae_x
parameter, i.e. the model assumes
that the trait variation is completely explainable by the mixed Gaussian
phylogenetic process and there is no non-heritable component. For more
information on model parametrizations, see the PCMBase
parameterizations guide. For further information on mixed Gaussian
models, see the ?MixedGaussian
help page.
We can set the parameters of the model manually as follows:
$X0[] <- c(NA, NA, NA)
model.OU.BM$`a`$H[,,1] <- cbind(
model.OU.BMc(.1, -.7, .6),
c(1.3, 2.2, -1.4),
c(0.8, 0.2, 0.9))
$`a`$Theta[] <- c(1.3, -.5, .2)
model.OU.BM$`a`$Sigma_x[,,1] <- cbind(
model.OU.BMc(1, 0, 0),
c(1.0, 0.5, 0),
c(0.3, -.8, 1))
$`b`$Sigma_x[,,1] <- cbind(
model.OU.BMc(0.8, 0, 0),
c(1, 0.3, 0),
c(0.4, 0.5, 0.3))
We can print the mixed Gaussian model as a table:
print(PCMTable(model.OU.BM), xtable = TRUE, type="html")
regime | type | \(X0\) | \(H\) | \(\Theta\) | \(\Sigma\) |
---|---|---|---|---|---|
:global: | NA | \(\begin{bmatrix}{} . \\ . \\ . \\ \end{bmatrix}\) | |||
a | OU | \(\begin{bmatrix}{} 0.10 & 1.30 & 0.80 \\ -0.70 & 2.20 & 0.20 \\ 0.60 & -1.40 & 0.90 \\ \end{bmatrix}\) | \(\begin{bmatrix}{} 1.30 \\ -0.50 \\ 0.20 \\ \end{bmatrix}\) | \(\begin{bmatrix}{} 2.09 & 0.26 & 0.30 \\ 0.26 & 0.89 & -0.80 \\ 0.30 & -0.80 & 1.00 \\ \end{bmatrix}\) | |
b | BM | \(\begin{bmatrix}{} 1.80 & 0.50 & 0.12 \\ 0.50 & 0.34 & 0.15 \\ 0.12 & 0.15 & 0.09 \\ \end{bmatrix}\) |
The first functionality of the PCMBase package is to provide an easy way to simulate multiple trait data on a tree under a given (possibly multiple regime) PCM.
For this example, first we simulate a birth death tree with two parts
“a” and “b” using the ape
R-package:
# make results reproducible
set.seed(2, kind = "Mersenne-Twister", normal.kind = "Inversion")
# number of regimes
<- 2
R
# number of extant tips
<- 100
N
<- PCMTree(rtree(n=N))
tree.a PCMTreeSetLabels(tree.a)
PCMTreeSetPartRegimes(tree.a, part.regime = c(`101` = "a"), setPartition = TRUE)
<- PCMTreeListDescendants(tree.a)
lstDesc <- names(lstDesc)[which(sapply(lstDesc, length) > N/2 & sapply(lstDesc, length) < 2*N/3)][1]
splitNode
<- PCMTreeInsertSingletons(
tree.ab nodes = as.integer(splitNode),
tree.a, positions = PCMTreeGetBranchLength(tree.a, as.integer(splitNode))/2)
PCMTreeSetPartRegimes(
tree.ab,part.regime = structure(c("a", "b"), names = as.character(c(N+1, splitNode))),
setPartition = TRUE)
if(requireNamespace("ggtree")) {
<- PCMColorPalette(2, c("a", "b"))
palette
# Plot the tree with branches colored according to the regimes.
# The following code works only if the ggtree package is installed, which is not on CRAN.
# The tree would not be depicted correctly if ggtree is not installed.
<- PCMTreePlot(tree.ab)
plTree <- plTree + ggtree::geom_nodelab(size = 2)
plTree
plTree }
Now we can simulate data on the tree using the modelBM.ab$X0 as a starting value:
<- PCMSim(tree.ab, modelBM.ab, modelBM.ab$X0) traits
Calculating a model likelihood for a given tree and data is the other key functionality of the PCMBase package.
PCMLik(traits, tree.ab, modelBM.ab)
## [1] -414
## attr(,"X0")
## [1] 5 2
For faster and repeated likelihood evaluation, I recommend creating a
likelihood function for a given data, tree and model object. Passing
this function object to optim
would save the need for
pre-processing the data and tree at every likelihood evaluation.
# a function of a numerical parameter vector:
<- PCMCreateLikelihood(traits, tree.ab, modelBM.ab)
likFun
likFun(param2)
## [1] -414
## attr(,"X0")
## [1] 5 2