gemma2
is an R implementation of an EM algorithm (Zhou and Stephens 2014) to fit a multivariate linear mixed effects model. Specifically, it fits, via restricted maximum likelihood (REML), the model:
\[vec(Y) = X vec(B) + vec(G) + vec(E)\]
where \(Y\) is a n by 2 phenotypes matrix, with each row being a single subject and each column a distinct phenotype. \(X\) is a 2n by 2f matrix of genotypes, where \(n\) is the sample size and \(f\) is the number of founder lines. \(X\) has a block-diagonal structure, so that it contains, on the diagonal, two \(n\) by \(f\) blocks of genotype data. The two off-diagonal blocks are all zeros. \(B\) is a \(f\) by \(2\) matrix of founder allele effect sizes. \(G\) is a \(n\) by \(2\) matrix of polygenic random effects that follows the distribution:
\[G \sim MN_{n x 2}(0, K, V_g)\]
where \(MN\) denotes the matrix-variate normal distribution with mean being the n by 2 zero matrix, \(K\) - a kinship matrix - being the n by n covariance among rows, and \(V_g\) being the 2 by 2 covariance matrix among columns.
\(E\) is a n by 2 matrix of random errors that follows the distribution:
\[E \sim MN_{n x 2}(0, I_n, V_e)\] where \(I_n\) is the n by n identity matrix.
Additionally, we have the usual assumption that \(E\) and \(G\) are independent.
By applying the vectorization operator, “vec”, we see that:
\[vec(Y) \sim N(Xvec(B), V_g \otimes K + V_e \otimes I_n)\] where \(\otimes\) denotes the Kronecker product.
gemma2
I wrote gemma2
by translating, line by line, the C++ code for version 0.96 of GEMMA. Specifically, I drew heavily on the file mvlmm.cpp when writing the update functions for the EM algorithm.
When fitting the above multivariate linear mixed effects model, Zhou’s GEMMA uses first an EM algorithm to get reasonable starting values for input to a Newton-Raphson algorithm. The model fit is taken as the output of the Newton-Raphson algorithm. They explain that the reason for doing this is that the Newton-Raphson algorithm - with reasonable starting values - converges much faster than does the EM algorithm. I elected to not implement the Newton-Raphson algorithm. Instead, I use exclusively the EM algorithm output. For that reason, I can’t merely look at the output of Zhou’s GEMMA and compare it directly to my output from gemma2
. Zhou’s GEMMA doesn’t print the (intermediate) output of its EM algorithm.
gemma2
To compare the output of Zhou’s GEMMA with that of gemma2
, I decided to insert print statements in the C++ code of GEMMA. That way, when I run GEMMA, I get verbose output of intermediate results. That is, I can see exactly how each component is updated in the EM algorithm because I print the numerical values at each iteration of the GEMMA EM algorithm.
I can then ask the question:
for a set of starting values, does the first iteration of EM updates by GEMMA give the same values as the first iteration of EM updates by
gemma2
?
There is an additional complication. GEMMA is set up so that you can’t merely specify a chosen set of starting values for EM algorithm. Rather, it takes a data set as input. GEMMA then fit univariate LMMs to get the starting values for the variance components, \(V_g\) and \(V_e\). Note that the off-diagonal elements of \(V_g\) and \(V_e\) start at zero for GEMMA’s methods. In other words, GEMMA takes the inputted data and fits a univariate LMM for each phenotype. It then uses those fitted values for input to the EM algorithm for a multivariate LMM, before, ultimately, using a NR algorithm to get the final mvlmm fit.
Because of how I’ve added the print statements to GEMMA’s C++ code, I’ll need to look at the printed output from GEMMA to ensure that I use the same starting values for gemma2
. In other words, I can’t pre-specify starting values and use them in both GEMMA and gemma2
. Instead, I need to specify a data set, then look to see where GEMMA starts its EM algorithm for mvLMM, and use input those same starting values into gemma2
.
gemma2
with example dataFirst, we read the data into R.
readr::read_csv(system.file("extdata", "mouse100.geno.txt", package = "gemma2"), col_names = FALSE) -> geno
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> .default = col_double(),
#> X1 = col_character(),
#> X2 = col_character(),
#> X3 = col_character()
#> )
#> ℹ Use `spec()` for the full column specifications.
readr::read_tsv(system.file("extdata", "mouse100.pheno.txt", package = "gemma2"), col_names = FALSE) -> pheno
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> X1 = col_double(),
#> X2 = col_double(),
#> X3 = col_double(),
#> X4 = col_double(),
#> X5 = col_double(),
#> X6 = col_double()
#> )
readr::read_tsv(system.file("extdata", "mouse100.cXX.txt", package = "gemma2"), col_names = FALSE)[, 1:100] -> kinship
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> .default = col_double(),
#> X101 = col_logical()
#> )
#> ℹ Use `spec()` for the full column specifications.
We then isolate the genotypes for a single marker and convert it to a matrix.
# isolate first SNP
g1 <- geno[1, - c(1:3)] # first 3 columns are SNP annotations!
t(as.matrix(g1)) -> g1m
as.matrix(pheno[, c(1, 6)]) -> phe16
We load the gemma2
R package before we decompose the kinship matrix into eigenvalues and eigenvectors.
We then multiply the genotypes matrix by the eigenvectors:
We then call the function - MphEM
- that uses the EM algorithm:
MphEM(eval = e_out$values,
X = X1U,
Y = t(phe16) %*% e_out$vectors,
V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2),
V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2)
) -> foo
The output of MphEM
has a complicated structure. It’s a list of lists.
There is one list for every iteration of the EM algorithm. For each iteration of the EM algorithm, the list contains 17 components.
We can examine the code for MphEM
to understand the outputs:
MphEM
#> function (max_iter = 10000, max_prec = 1/1e+06, eval, X, Y, V_g,
#> V_e, verbose_output = FALSE)
#> {
#> n_size <- length(eval)
#> c_size <- nrow(X)
#> d_size <- nrow(Y)
#> dc_size <- c_size * d_size
#> XXt <- X %*% t(X)
#> logl_const <- -(n_size - c_size) * d_size * log(2 * pi)/2 +
#> d_size * log(det(XXt))/2
#> out <- list()
#> for (t in 1:max_iter) {
#> ep_out <- eigen_proc(V_g, V_e)
#> logdet_Ve <- ep_out[[1]]
#> UltVeh <- ep_out[[2]]
#> UltVehi <- ep_out[[3]]
#> D_l <- ep_out[[4]]
#> cq_out <- calc_qi(eval, D_l, X)
#> Qi <- cq_out[[1]]
#> lndetQ <- cq_out[[2]]
#> UltVehiY <- UltVehi %*% Y
#> xHiy <- calc_XHiY(eval, D_l, X, UltVehiY)
#> logl_new <- logl_const + MphCalcLogL(eval = eval, xHiy = xHiy,
#> D_l = D_l, UltVehiY = UltVehiY, Qi = Qi) - 0.5 *
#> n_size * logdet_Ve
#> logl_new <- logl_new - 0.5 * (lndetQ - c_size * logdet_Ve)
#> if (t > 1) {
#> if (logl_new - logl_old < max_prec) {
#> break
#> }
#> }
#> logl_old <- logl_new
#> co_out <- calc_omega(eval, D_l)
#> OmegaU <- co_out[[1]]
#> OmegaE <- co_out[[2]]
#> UltVehiB <- UpdateRL_B(xHiy, Qi, d_size = nrow(Y))
#> UltVehiBX <- UltVehiB %*% X
#> UltVehiU <- update_u(OmegaE, UltVehiY, UltVehiBX)
#> UltVehiE <- update_e(UltVehiY, UltVehiBX, UltVehiU)
#> U_hat <- t(UltVeh) %*% UltVehiU
#> E_hat <- t(UltVeh) %*% UltVehiE
#> B <- t(UltVeh) %*% UltVehiB
#> cs_out <- calc_sigma(eval = eval, D_l = D_l, X = X, OmegaU = OmegaU,
#> OmegaE = OmegaE, UltVeh = UltVeh, Qi = Qi)
#> Sigma_ee <- cs_out[[1]]
#> Sigma_uu <- cs_out[[2]]
#> uv_out <- update_v(eval, U_hat, E_hat, Sigma_uu, Sigma_ee)
#> V_g <- uv_out[[1]]
#> V_e <- uv_out[[2]]
#> if (verbose_output) {
#> out[[t]] <- list(logl_new = logl_new, Vg = V_g, Ve = V_e,
#> Sigma_uu = Sigma_uu, Sigma_ee = Sigma_ee, B = B,
#> U_hat = U_hat, E_hat = E_hat, OmegaU = OmegaU,
#> OmegaE = OmegaE, logdet_Ve = logdet_Ve, UltVeh = UltVeh,
#> UltVehi = UltVehi, Dl = D_l, xHiy = xHiy, logl_const = logl_const,
#> UltVehiU = UltVehiU)
#> }
#> else {
#> out[[1]] <- list(logl_new = logl_new, Vg = V_g, Ve = V_e,
#> Sigma_uu = Sigma_uu, Sigma_ee = Sigma_ee, B = B,
#> U_hat = U_hat, E_hat = E_hat, OmegaU = OmegaU,
#> OmegaE = OmegaE, logdet_Ve = logdet_Ve, UltVeh = UltVeh,
#> UltVehi = UltVehi, Dl = D_l, xHiy = xHiy, logl_const = logl_const,
#> UltVehiU = UltVehiU)
#> }
#> }
#> if (length(out) == max_iter) {
#> warning("EM algorithm didn't converge.")
#> }
#> return(out)
#> }
#> <bytecode: 0x55b48f936ff0>
#> <environment: namespace:gemma2>
We see that the first component is the log-likelihood.
We can verify that the log-likelihood uniformly increases (over EM iterations) with the following code:
gemma2
We can run GEMMA (v0.96) with a collection of data in text files that I distribute with this package. They are a subset of the data that come with GEMMA.
According to GEMMA v0.96, we should first be updating Vg from its initial value: \[\left(\begin{array}{cc} 1.91352 & 0\\ 0 & 0.530827 \end{array}\right)\]
to
\[\left(\begin{array}{cc} 1.91352 & 0.0700492\\ 0.0700492 & 0.530827 \end{array}\right)\]
in the first EM iteration.
We then ask,
when we give
gemma2::MphEM()
the same starting values and same data, do we see that the first update of \(V_g\) coincides with that from GEMMA (above)?
e_out <- eigen2(as.matrix(kinship))
center_kinship(as.matrix(kinship)) -> kinship_centered
ec_out <- eigen2(kinship_centered)
## first two
MphEM(eval = ec_out$values,
X = t(g1m) %*% ec_out$vectors,
Y = t(phe16) %*% ec_out$vectors,
V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2),
V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2)
) -> bar00
MphEM(eval = e_out$values,
X = t(g1m) %*% e_out$vectors,
Y = t(phe16) %*% e_out$vectors,
V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2),
V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2)
) -> bar10
## second two
MphEM(eval = ec_out$values,
X = t(cbind(1, g1m)) %*% ec_out$vectors,
Y = t(phe16) %*% ec_out$vectors,
V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2),
V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2)
) -> bar01
MphEM(eval = e_out$values,
X = t(cbind(1, g1m)) %*% e_out$vectors,
Y = t(phe16) %*% e_out$vectors,
V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2),
V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2)
) -> bar11
We see that the centering of the kinship matrix doesn’t change the results here. Look also at the convergence points:
bar00[[length(bar00)]][[2]]
#> [,1] [,2]
#> [1,] 2.8112632 -0.4202874
#> [2,] -0.4202874 0.5839737
bar01[[length(bar01)]][[2]]
#> [,1] [,2]
#> [1,] 2.6853565 -0.3978194
#> [2,] -0.3978194 0.7337082
bar10[[length(bar10)]][[2]]
#> [,1] [,2]
#> [1,] 2.8112632 -0.4202874
#> [2,] -0.4202874 0.5839737
bar11[[length(bar11)]][[2]]
#> [,1] [,2]
#> [1,] 2.6853565 -0.3978194
#> [2,] -0.3978194 0.7337082
I see now that the X, as inputted to MphEM is printed 3 times in each of my output files. I’m not sure why this is.
t(cbind(1, g1m)) %*% e_out$vectors
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] -10.0 8.209571e-10 3.423880e-10 8.799904e-10 3.741043e-10 5.193286e-10
#> [2,] -8.2 5.214029e-01 1.593502e-01 2.047141e-01 -2.119088e-01 1.434733e-01
#> [,7] [,8] [,9] [,10] [,11]
#> [1,] -9.694780e-10 -3.196897e-10 6.163396e-10 5.822616e-10 3.875483e-10
#> [2,] -2.042852e-01 -1.109870e-01 6.902534e-02 -2.025183e-01 -1.264110e-01
#> [,12] [,13] [,14] [,15] [,16]
#> [1,] -5.354989e-10 4.053215e-10 -1.139380e-09 -1.024021e-09 -2.905492e-10
#> [2,] 3.671612e-01 -3.176738e-01 -2.499061e-01 -4.861219e-01 5.697363e-02
#> [,17] [,18] [,19] [,20] [,21]
#> [1,] -4.402925e-10 3.027143e-10 4.031493e-10 -2.617206e-10 9.958686e-10
#> [2,] 1.413093e-01 4.235653e-01 -9.595364e-02 2.091495e-01 5.928482e-02
#> [,22] [,23] [,24] [,25] [,26]
#> [1,] -4.930074e-10 -1.710612e-10 -3.388599e-10 -7.418881e-10 5.581811e-10
#> [2,] 5.366786e-01 -2.983362e-01 -3.451517e-01 2.110054e-01 1.807756e-01
#> [,27] [,28] [,29] [,30] [,31]
#> [1,] -9.200053e-10 -9.621139e-11 4.626981e-10 -6.783181e-10 6.249647e-11
#> [2,] 2.627168e-01 1.685653e-01 7.195078e-01 4.955564e-02 1.142125e-01
#> [,32] [,33] [,34] [,35] [,36]
#> [1,] -2.818461e-10 4.676824e-10 2.764171e-10 1.328437e-10 5.698039e-10
#> [2,] -9.182112e-01 -1.214265e+00 -1.380134e-01 -5.150178e-01 -5.743965e-01
#> [,37] [,38] [,39] [,40] [,41]
#> [1,] 8.583911e-11 -2.430915e-10 -5.453552e-10 1.853685e-10 2.150999e-10
#> [2,] 3.627758e-01 1.455956e-01 -3.543663e-01 4.206109e-01 -1.125620e+00
#> [,42] [,43] [,44] [,45] [,46]
#> [1,] 2.055196e-10 3.165112e-10 4.230766e-11 2.304351e-10 -6.131784e-11
#> [2,] 4.366374e-03 4.839497e-01 5.627990e-01 1.881103e-02 1.534764e-03
#> [,47] [,48] [,49] [,50] [,51]
#> [1,] 1.198367e-10 4.710556e-10 5.443409e-10 2.393824e-10 -8.476331e-12
#> [2,] -9.599920e-02 2.913756e-01 -1.670098e-01 -3.429212e-01 9.796257e-02
#> [,52] [,53] [,54] [,55] [,56]
#> [1,] 1.665049e-10 -2.681996e-11 -2.718945e-10 -3.359916e-10 2.576899e-10
#> [2,] 8.098477e-01 -1.309711e+00 -5.100377e-02 -4.045755e-01 -4.973091e-01
#> [,57] [,58] [,59] [,60] [,61]
#> [1,] -2.166084e-11 2.015501e-10 -8.589462e-12 3.397737e-10 1.629878e-10
#> [2,] 6.244735e-01 2.753963e-01 3.056504e-01 1.027420e+00 5.171210e-01
#> [,62] [,63] [,64] [,65] [,66]
#> [1,] -1.125520e-10 -9.604466e-11 -1.782041e-10 6.516876e-11 -2.114665e-10
#> [2,] -9.679433e-01 9.252804e-01 -2.548765e-01 -6.264708e-01 -8.376359e-01
#> [,67] [,68] [,69] [,70] [,71]
#> [1,] -4.046121e-10 1.220681e-10 1.596400e-10 -4.175104e-10 -2.283430e-10
#> [2,] -6.859459e-01 -8.372176e-01 8.521224e-01 1.009573e-02 2.286194e-01
#> [,72] [,73] [,74] [,75] [,76]
#> [1,] -3.944047e-10 -1.824441e-10 -2.485945e-10 2.511755e-11 -2.587797e-11
#> [2,] -1.347910e+00 5.240299e-01 5.949179e-01 4.958164e-01 1.359179e+00
#> [,77] [,78] [,79] [,80] [,81]
#> [1,] 1.795558e-10 4.519914e-12 2.866099e-11 -2.266703e-10 -9.518052e-11
#> [2,] -1.211559e-01 6.212741e-01 7.158448e-01 -9.853067e-01 1.570249e-01
#> [,82] [,83] [,84] [,85] [,86]
#> [1,] -7.734569e-11 -4.454652e-11 -5.285478e-11 -1.012893e-10 3.718089e-11
#> [2,] 7.387338e-01 1.681314e-02 -1.937415e+00 1.512452e-01 8.376756e-01
#> [,87] [,88] [,89] [,90] [,91]
#> [1,] -1.109072e-10 1.729715e-10 -4.520380e-11 2.238981e-10 -3.103531e-12
#> [2,] 8.955207e-01 -1.930832e+00 -8.796979e-01 2.219885e-01 5.667558e-01
#> [,92] [,93] [,94] [,95] [,96]
#> [1,] 1.040781e-10 -7.791080e-11 2.907737e-11 1.831045e-11 -7.721866e-11
#> [2,] 1.251593e-01 -9.984738e-01 6.009364e-02 -1.296282e-01 -1.681463e+00
#> [,97] [,98] [,99] [,100]
#> [1,] -8.969529e-11 -2.186182e-11 -3.175127e-11 2.421263e-11
#> [2,] 2.364884e-01 3.328094e-01 -5.143277e-01 -1.334516e+00
I see that the third instance of “X,” - in the file of GEMMA output - specifies that X has two rows, while the first two seem to say that X has only one row. Furthermore, it seems that the third instance of “X,” has the same entries that I get when calculating X %*% U
- namely, the first column is -10 and -8.2, the remaining entries in the first row are near zero, and the remaining entries in the second row begin with 0.52.
I see now that I misread the output file from GEMMA. Within it, there are actually 3 calls (ie, per SNP) to MphEM. It’s the third call that we want to reproduce.
From the above output, we see that the initial values, for the third call to MphEM() are:
\[V_g = \left(\begin{array}{cc} 2.73 & -0.39 \\ -0.39 & 0.74 \end{array} \right)\]
and
\[V_e = \left(\begin{array}{cc} 0.15 & 0.27 \\ 0.27 & 0.50 \end{array} \right)\]
and, after a single update, the matrices are:
\[V_g = \left(\begin{array}{cc} 2.74 & -0.40 \\ -0.40 & 0.74 \end{array} \right)\]
and
\[V_e = \left(\begin{array}{cc} 0.15 & 0.27 \\ 0.27 & 0.51 \end{array} \right)\]
Can we get our gemma2
R code to do this?
MphEM(eval = e_out$values,
X = t(cbind(1, g1m)) %*% e_out$vectors,
Y = t(phe16) %*% e_out$vectors,
V_g = matrix(c(2.73, - 0.39, - 0.39, 0.74), nrow = 2),
V_e = matrix(c(0.15, 0.27, 0.27, 0.50), nrow = 2)
) -> out1
out1[[1]][c(2:3)]
#> $Vg
#> [,1] [,2]
#> [1,] 2.706773 -0.4119620
#> [2,] -0.411962 0.7428087
#>
#> $Ve
#> [,1] [,2]
#> [1,] 0.1531700 0.2761671
#> [2,] 0.2761671 0.5119295
Now try with ec_out
MphEM(eval = ec_out$values,
X = t(cbind(1, g1m)) %*% ec_out$vectors,
Y = t(phe16) %*% ec_out$vectors,
V_g = matrix(c(2.731324, - 0.394865, - 0.394865, 0.737116), nrow = 2),
V_e = matrix(c(0.147467, 0.272090, 0.272090, 0.502031), nrow = 2)
) -> out1
out1[[1]][c(2:3)]
#> $Vg
#> [,1] [,2]
#> [1,] 2.7372540 -0.4041481
#> [2,] -0.4041481 0.7434534
#>
#> $Ve
#> [,1] [,2]
#> [1,] 0.1495667 0.2759642
#> [2,] 0.2759642 0.5091793
The results, above, are accurate to within rounding.
Zhou, Xiang, and Matthew Stephens. 2014. “Efficient Multivariate Linear Mixed Model Algorithms for Genome-Wide Association Studies.” Nature Methods 11 (4): 407–9.