This vignette illustrates the standard use of the PLNLDA
function and the methods accompanying the R6 Classes PLNLDA
and PLNLDAfit
.
The packages required for the analysis are PLNmodels plus some others for data manipulation and representation:
We illustrate our point with the trichoptera data set, a full description of which can be found in the corresponding vignette. Data preparation is also detailed in the specific vignette.
The trichoptera
data frame stores a matrix of counts
(trichoptera$Abundance
), a matrix of offsets
(trichoptera$Offset
) and some vectors of covariates
(trichoptera$Wind
, trichoptera$Temperature
,
etc.) In the following, we’re particularly interested in the
trichoptera$Group
discrete covariate which
corresponds to disjoint time spans during which the catching took place.
The correspondence between group label and time spans is:
Label | Number.of.Consecutive.Nights | Date |
---|---|---|
1 | 12 | June 59 |
2 | 5 | June 59 |
3 | 5 | June 59 |
4 | 4 | June 59 |
5 | 4 | July 59 |
6 | 1 | June 59 |
7 | 3 | June 60 |
8 | 4 | June 60 |
9 | 5 | June 60 |
10 | 4 | June 60 |
11 | 1 | June 60 |
12 | 1 | July 60 |
In the vein of Fisher (1936) and Rao (1948), we introduce a multi-class LDA model for multivariate count data which is a variant of the Poisson Lognormal model of Aitchison and Ho (1989) (see the PLN vignette as a reminder). Indeed, it can viewed as a PLN model with a discrete group structure in the latent Gaussian space.
This PLN-LDA model can be written in a hierarchical framework where a sample of \(p\)-dimensional observation vectors \(\mathbf{Y}_i\) is related to some \(p\)-dimensional vectors of latent variables \(\mathbf{Z}_i\) and a discrete structure with \(K\) groups in the following way: \[\begin{equation} \begin{array}{rcl} \text{group structure } & \mathbf{\mu}_i = \mu_{g_i} & g_i \in \{1, \dots, K\} \\ \text{latent space } & \mathbf{Z}_i \quad \text{indep.} & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i, \boldsymbol{\Sigma}) & \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right) \end{array} \end{equation}\] where \(g_i\) denotes the group sample \(i\) belongs to.
The different parameters \({\boldsymbol\mu}_k \in\mathbb{R}^p\) corresponds to the group-specific main effects and the variance matrix \(\boldsymbol{\Sigma}\) is shared among groups. An equivalent way of writing this model is the following: \[\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i,\boldsymbol\Sigma) & \boldsymbol{\mu}_i = \mathbf{g}_i^\top \mathbf{M} \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right), \end{array} \end{equation}\] where, with a slight abuse of notation, \(\mathbf{g}_i\) is a group-indicator vector of length \(K\) (\(g_{ik} = 1 \Leftrightarrow g_i = k\)) and \(\mathbf{M} = [\boldsymbol{\mu}_1^\top, \dots, \boldsymbol{\mu}_K^\top]^\top\) is a \(K \times p\) matrix collecting the group-specific main effects.
Just like PLN, PLN-LDA generalizes to a formulation close to a multivariate generalized linear model where the main effect is due to a linear combination of the discrete group structure, \(d\) covariates \(\mathbf{x}_i\) and a vector \(\mathbf{o}_i\) of \(p\) offsets in sample \(i\). The latent layer then reads \[\begin{equation} \mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{g}_i^\top \mathbf{M} + \mathbf{x}_i^\top\mathbf{B}},\boldsymbol\Sigma) \end{equation}\] where \(\mathbf{B}\) is a \(d\times p\) matrix of regression parameters.
Given:
We can predict the observation’s group using Bayes rule as follows: for \(k \in {1, \dots, K}\), compute \[\begin{equation} \begin{aligned} f_k(\mathbf{Y}) & = p(\mathbf{Y} | \mathbf{g} = k, \mathbf{o}, \mathbf{x}, \hat{\mathbf{B}}, \hat{\boldsymbol{\Sigma}}) \\ & = \boldsymbol{\Phi}_{PLN}(\mathbf{Y}; \mathbf{o} + \boldsymbol{\mu}_k + \mathbf{x}^\top \hat{\mathbf{B}}, \hat{\boldsymbol{\Sigma}}) \\ p_k & = \frac{n_k}{\sum_{k' = 1}^K n_{k'}} \end{aligned} \end{equation}\] where \(\boldsymbol{\Phi}_{PLN}(\bullet; \boldsymbol{\mu}, \boldsymbol{\Sigma})\) is the density function of a PLN distribution with parameters \((\boldsymbol{\mu}, \boldsymbol{\Sigma})\). \(f_k(\mathbf{Y})\) and \(p_k\) are respectively plug-in estimates of (i) the probability of observing counts \(\mathbf{Y}\) in a sample from group \(k\) and (ii) the probability that a sample originates from group \(k\).
The posterior probability \(\hat{\pi}_k(\mathbf{Y})\) that observation \(\mathbf{Y}\) belongs to group \(k\) and most likely group \(\hat{k}(\mathbf{Y})\) can thus be defined as \[\begin{equation} \begin{aligned} \hat{\pi}_k(\mathbf{Y}) & = \frac{p_k f_k(\mathbf{Y})}{\sum_{k' = 1}^K p_{k'} f_{k'}(\mathbf{Y})} \\ \hat{k}(\mathbf{Y}) & = \underset{k \in \{1, \dots, K\}}{\arg\max} \hat{\pi}_k(\mathbf{Y}) \end{aligned} \end{equation}\]
Classification and prediction are the main objectives in (PLN-)LDA. To reach this goal, we first need to estimate the model parameters. Inference in PLN-LDA focuses on the group-specific main effects \(\mathbf{M}\), the regression parameters \(\mathbf{B}\) and the covariance matrix \(\boldsymbol\Sigma\). Technically speaking, we can treat \(\mathbf{g}_i\) as a discrete covariate and estimate \([\mathbf{M}, \mathbf{B}]\) using the same strategy as for the standard PLN model. Briefly, we adopt a variational strategy to approximate the log-likelihood function and optimize the consecutive variational surrogate of the log-likelihood with a gradient-ascent-based approach. To this end, we rely on the CCSA algorithm of Svanberg (2002) implemented in the C++ library (Johnson 2011), which we link to the package.
In the package, the PLN-LDA model is adjusted with the function
PLNLDA
, which we review in this section. This function
adjusts the model and stores it in a object of class
PLNLDAfit
which inherits from the class
PLNfit
, so we strongly recommend the reader to be somehow
comfortable with PLN
and PLNfit
before using
PLNLDA
(see the PLN vignette).
We start by adjusting the above model to the Trichoptera data set. We
use Group
, the catching time spans, as a discrete structure
and use log as an offset to capture differences in sampling luck.
The model can be fitted with the function PLNLDA
as
follows:
##
## Performing discriminant Analysis...
## DONE!
Note that PLNLDA
uses the standard formula
interface, like every other model in the PLNmodels
package.
PLNLDAfit
The myLDA_nocov
variable is an R6
object
with class PLNLDAfit
, which comes with a couple of methods.
The most basic is the show/print
method, which sends a
brief summary of the estimation process and available methods:
## Linear Discriminant Analysis for Poisson Lognormal distribution
## ==================================================================
## nb_param loglik BIC ICL
## 357 -799.824 -1494.514 -1082.492
## ==================================================================
## * Useful fields
## $model_par, $latent, $latent_pos, $var_par, $optim_par
## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria
## * Useful S3 methods
## print(), coef(), sigma(), vcov(), fitted()
## predict(), predict_cond(), standard_error()
## * Additional fields for LDA
## $percent_var, $corr_map, $scores, $group_means
## * Additional S3 methods for LDA
## plot.PLNLDAfit(), predict.PLNLDAfit()
Comprehensive information about PLNLDAfit
is available
via ?PLNLDAfit
.
The user can easily access several fields of the
PLNLDAfit
object using S3
methods, the most
interesting ones are
NULL
as there are no covariates)## NULL
grouping1 | grouping2 | grouping3 | grouping4 | grouping5 | grouping6 | grouping7 | grouping8 | grouping9 | grouping10 | grouping11 | grouping12 |
---|---|---|---|---|---|---|---|---|---|---|---|
-26.99 | -6.82 | -26.56 | -25.38 | -28.34 | -28.92 | -27.23 | -26.37 | -26.45 | -5.64 | -3.46 | -25.37 |
-26.98 | -27.72 | -5.67 | -25.37 | -7.44 | -28.90 | -27.22 | -26.36 | -26.43 | -26.50 | -24.34 | -4.48 |
-2.38 | -3.96 | -2.36 | -2.92 | -6.51 | -5.54 | -2.70 | -2.03 | -2.71 | -2.65 | -24.44 | -3.85 |
-26.99 | -27.74 | -5.69 | -4.52 | -28.28 | -28.96 | -27.25 | -5.51 | -5.59 | -4.91 | -24.39 | -25.41 |
-0.28 | -0.26 | -0.62 | -1.09 | -0.61 | -0.11 | -0.66 | -0.80 | -0.39 | -0.45 | -0.61 | -1.16 |
-4.00 | -3.32 | -2.93 | -4.47 | -6.72 | -8.00 | -2.85 | -3.06 | -5.53 | -3.99 | -24.32 | -25.34 |
The PLNLDAfit
class also benefits from two important
methods: plot
and predict
.
plot
methodThe plot
methods provides easy to interpret graphics
which reveals here that the groups are well separated:
By default, plot
shows the first 3 axis of the LDA when
there are 4 or more groups and uses special representations for the edge
cases of 3 or less groups.
ggplot2
-savvy users who want to make their own
representations can extracts the \(n \times
(K-1)\) matrix of sample scores from the PLNLDAfit
object …
LD1 | LD2 | LD3 | LD4 | LD5 | LD6 | LD7 | LD8 | LD9 | LD10 | LD11 |
---|---|---|---|---|---|---|---|---|---|---|
4323755 | -258930.5 | -86710.46 | -95694.11 | -28601.92 | 15856.26 | 2050.06 | 2377.53 | 160.19 | 51.09 | -165.24 |
4323566 | -263460.8 | -86587.85 | -96223.82 | -28408.38 | 15864.20 | 2072.05 | 2435.32 | 153.66 | 53.12 | -166.51 |
4323404 | -305794.6 | -85065.32 | -99745.12 | -26724.08 | 15650.09 | 1846.05 | 2387.27 | 183.75 | 57.62 | -159.26 |
4322958 | -353775.7 | -83386.55 | -103532.26 | -24844.54 | 15344.36 | 1476.14 | 2140.68 | 230.84 | 60.06 | -149.95 |
4323121 | -326011.6 | -84396.06 | -101269.07 | -25953.61 | 15490.85 | 1644.05 | 2197.54 | 208.88 | 58.04 | -154.87 |
4323379 | -290829.9 | -85633.04 | -98443.34 | -27331.31 | 15700.71 | 1882.34 | 2328.87 | 178.17 | 55.01 | -161.42 |
…or the \(p \times (K-1)\) matrix of correlations between scores and (latent) variables
LD1 | LD2 | LD3 | LD4 | LD5 | LD6 | LD7 | LD8 | LD9 | LD10 | LD11 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Che | -0.28 | 0.23 | 0.60 | 0.22 | 0.59 | -0.40 | 0.15 | 0.39 | 0.84 | -0.50 | 0.02 |
Hyc | -0.08 | -0.44 | 0.02 | 0.39 | 0.03 | -0.40 | -0.65 | -0.94 | -0.05 | -0.17 | 0.57 |
Hym | 0.07 | -0.19 | -0.21 | -0.76 | -0.33 | -0.15 | 0.34 | 0.01 | -0.32 | 0.03 | -0.60 |
Hys | -0.51 | 0.19 | -0.49 | -0.05 | -0.30 | -0.49 | 0.33 | -0.01 | -0.30 | -0.07 | 0.25 |
Psy | 0.28 | 0.23 | 0.23 | -0.28 | 0.40 | 0.20 | 0.23 | 0.08 | 0.49 | -0.58 | -0.37 |
Aga | 0.10 | -0.15 | -0.24 | -0.89 | -0.05 | -0.21 | 0.15 | 0.15 | -0.12 | -0.20 | -0.52 |
predict
methodThe predict
method has a slightly different behavior
than its siblings in other models of the PLNmodels. The
goal of predict
is to predict the discrete class based on
observed species counts (rather than predicting counts from
known covariates).
By default, the predict
use the argument
type = "posterior"
to output the matrix of log-posterior
probabilities \(\log(\hat{\pi})_k\)
predicted.class <- predict(myLDA_nocov, newdata = trichoptera)
## equivalent to
## predicted.class <- predict(myLDA_nocov, newdata = trichoptera, type = "posterior")
predicted.class %>% head() %>% knitr::kable(digits = 2)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|
-17.45 | -46.50 | -43.54 | -25.48 | -54.71 | -32.73 | -25.92 | -87.53 | -84.59 | -20.60 | -207.84 | -74.88 |
-9.67 | -17.04 | -59.96 | -21.33 | -72.21 | -27.29 | -14.70 | -15.58 | -15.21 | -14.70 | -121.96 | -66.44 |
-12.94 | -22.57 | -115.18 | -39.94 | -120.03 | -30.78 | -22.54 | -50.11 | -44.51 | -22.94 | -149.15 | -120.62 |
-18.43 | -31.03 | -122.03 | -115.41 | -144.04 | -54.03 | -41.92 | -114.68 | -103.31 | -44.66 | -313.94 | -227.41 |
-13.84 | -20.01 | -52.73 | -60.79 | -72.69 | -38.18 | -26.89 | -51.84 | -45.29 | -25.88 | -196.75 | -115.64 |
-9.25 | -13.49 | -37.24 | -24.40 | -48.20 | -23.73 | -14.62 | -15.91 | -15.30 | -14.32 | -101.14 | -67.77 |
You can also show them in the standard (and human-friendly) \([0, 1]\) scale with
scale = "prob"
to get the matrix \(\hat{\pi}_k\)
predicted.class <- predict(myLDA_nocov, newdata = trichoptera, scale = "prob")
predicted.class %>% head() %>% knitr::kable(digits = 3)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|
0.959 | 0.000 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.041 | 0 | 0 |
0.980 | 0.001 | 0 | 0 | 0 | 0 | 0.006 | 0.003 | 0.004 | 0.006 | 0 | 0 |
1.000 | 0.000 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 |
1.000 | 0.000 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 |
0.998 | 0.002 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 |
0.972 | 0.014 | 0 | 0 | 0 | 0 | 0.005 | 0.001 | 0.002 | 0.006 | 0 | 0 |
Setting type = "response"
, we can predict the most
likely group \(\hat{k}\) instead:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 1 1 1 1 1 9 2 1 1 1 1 2 3 2 2 2 3 3 3 9 3 4 1 4 4
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 12 5 4 5 6 7 7 7 8 8 8 8 8 1 9 9 9 10 10 10 10 11 12
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12
We can assess that the predictions are quite similar to the real group (this is not a proper validation of the method as we used data set for both model fitting and prediction and are thus at risk of overfitting).
## true
## predicted 1 2 3 4 5 6 7 8 9 10 11 12
## 1 10 0 0 1 0 0 0 0 1 0 0 0
## 2 1 4 0 0 0 0 0 0 0 0 0 0
## 3 0 1 4 0 0 0 0 0 0 0 0 0
## 4 0 0 0 3 1 0 0 0 0 0 0 0
## 5 0 0 0 0 2 0 0 0 0 0 0 0
## 6 0 0 0 0 0 1 0 0 0 0 0 0
## 7 0 0 0 0 0 0 3 0 0 0 0 0
## 8 0 0 0 0 0 0 0 4 1 0 0 0
## 9 1 0 1 0 0 0 0 0 3 0 0 0
## 10 0 0 0 0 0 0 0 0 0 4 0 0
## 11 0 0 0 0 0 0 0 0 0 0 1 0
## 12 0 0 0 0 1 0 0 0 0 0 0 1
Finally, we can get the coordinates of the new data on the same graph
at the original ones with type = "scores"
. This is done by
averaging the latent positions \(\hat{\mathbf{Z}}_i + \boldsymbol{\mu}_k\)
(found when the sample is assumed to come from group \(k\)) and weighting them with the \(\hat{\pi}_k\). Some samples, have
compositions that put them very far from their group mean.
library(ggplot2)
predicted.scores <- predict(myLDA_nocov, newdata = trichoptera, type = "scores")
colnames(predicted.scores) <- paste0("Axis.", 1:ncol(predicted.scores))
predicted.scores <- as.data.frame(predicted.scores)
predicted.scores$group <- trichoptera$Group
plot(myLDA_nocov, map = "individual", nb_axes = 2, plot = FALSE) +
geom_point(data = predicted.scores,
aes(x = Axis.1, y = Axis.2, color = group, label = NULL))
It is possible to correct for other covariates before finding the LDA
axes that best separate well the groups. In our case ,we’re going to use
Wind
as a covariate and illustrate the main differences
with before :
myLDA_cov <- PLNLDA(Abundance ~ Wind + 0 + offset(log(Offset)),
grouping = Group,
data = trichoptera)
##
## Performing discriminant Analysis...
## DONE!
All fields of our new PLNLDA
fit can be accessed as
before with similar results. The only important difference is the result
of coef
: since we included a covariate in the model,
coef
now returns a 1-column matrix for \(\hat{\mathbf{B}}\) instead of
NULL
Wind | |
---|---|
Che | -0.3301236 |
Hyc | 1.4197887 |
Hym | -0.1927953 |
Hys | -0.4544197 |
Psy | 0.0365670 |
Aga | -0.0611589 |
The group-specific main effects can still be accessed with
$group_means
grouping1 | grouping2 | grouping3 | grouping4 | grouping5 | grouping6 | grouping7 | grouping8 | grouping9 | grouping10 | grouping11 | grouping12 |
---|---|---|---|---|---|---|---|---|---|---|---|
-20.37 | -10.84 | -24.28 | -19.17 | -25.82 | -43.78 | -29.97 | -28.34 | -19.42 | 4.58 | 11.73 | -24.00 |
-23.96 | -43.19 | -1.38 | -16.70 | -10.50 | -48.18 | -33.36 | -30.16 | -15.34 | -17.79 | -6.64 | 1.88 |
-1.07 | -4.47 | -1.70 | -1.68 | -5.75 | -6.99 | -2.65 | -1.82 | -1.45 | -1.44 | -25.75 | -3.26 |
-24.76 | -33.10 | -2.80 | 0.42 | -27.47 | -38.28 | -30.20 | -5.07 | -0.26 | 0.33 | -19.55 | -26.16 |
-0.05 | -0.42 | -0.47 | -0.85 | -0.44 | -0.43 | -0.59 | -0.71 | -0.07 | -0.15 | -0.19 | -1.03 |
-1.73 | -4.14 | -1.67 | -2.32 | -5.45 | -10.63 | -2.71 | -2.73 | -3.21 | -1.66 | -24.12 | -27.53 |
plot
methodOnce again, the plot
method is very useful to get a
quick look at the results.
predict
methodWe can again predict the most likely group for each sample :
and check that we recover the correct class in most cases (again, we used the same data set for model fitting and group prediction only for ease of exposition):
## true
## predicted 1 2 3 4 5 6 7 8 9 10 11 12
## 1 11 0 0 1 0 0 0 0 1 0 0 0
## 2 1 4 0 0 0 0 0 0 0 0 0 0
## 3 0 1 4 0 0 0 0 0 0 0 0 0
## 4 0 0 0 3 1 0 0 0 0 0 0 0
## 5 0 0 0 0 2 0 0 0 0 0 0 0
## 6 0 0 0 0 0 1 0 0 0 0 0 0
## 7 0 0 0 0 0 0 3 0 0 0 0 0
## 8 0 0 0 0 0 0 0 4 1 0 0 0
## 9 0 0 1 0 0 0 0 0 3 0 0 0
## 10 0 0 0 0 0 0 0 0 0 4 0 0
## 11 0 0 0 0 0 0 0 0 0 0 1 0
## 12 0 0 0 0 1 0 0 0 0 0 0 1