4. Cross-validation by mask matrix/tensor (kFoldMaskTensor)

Koki Tsuyuzaki

Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research
k.t.the-answer@hotmail.co.jp

Itoshi Nikaido

Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research

2024-05-13

Introduction

In this vignette, we introduce a cross-validation approach using mask tensor (Fujita 2018; Owen 2009) to estimate the optimal rank parameter for matrix/tensor decomposition. Mask matrix/tensor has the same size of data matrix/tensor and contains only 0 or 1 elements, 0 means not observed value, and 1 means observed value. In this approach, only non-zero and non-NA elements are randomly specified as 0 and estimated by other elements reconstructed by the result of matrix/tensor decomposition. This can be considered a cross-validation approach because the elements specified as 1 are the training dataset and the elements specified as 0 are the test dataset.

Here, we use these packages.

library("nnTensor")
library("ggplot2")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Here, we use three types of demo data as follows:

data_matrix <- toyModel("NMF")
data_matrices <- toyModel("siNMF_Easy")
data_tensor <- toyModel("CP")

NMF

To set mask matrix/tensor, here we use kFoldMaskTensor, which divides only the elements other than NA and 0 into k mask matrices/tensors. In NMF, the mask matrix can be specified as M and the rank parameter (J) with the smallest test error is considered the optimal rank. Here, three mask matrices are prepared for each rank, and the average is used to estimate the optimal rank.

out_NMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
  masks_NMF <- kFoldMaskTensor(data_matrix, k=3)
  for(j in 1:3){
    out_NMF[count, 3] <- rev(
      NMF(data_matrix,
        M = masks_NMF[[j]],
        J = i)$TestRecError)[1]
    count <- count + 1
  }
}

Looking at the average test error for each rank, the optimal rank appears to be around 8 to 10 (with some variation depending on random seeds).

ggplot(out_NMF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")

(group_by(out_NMF, rank) |>
  summarize(Avg = mean(value)) -> avg_test_error_NMF)
## # A tibble: 10 × 2
##    rank    Avg
##    <fct> <dbl>
##  1 1     4193.
##  2 2     3332.
##  3 3     2119.
##  4 4     1364.
##  5 5     1313.
##  6 6     1313.
##  7 7     1312.
##  8 8      934.
##  9 9      710.
## 10 10     732.
avg_test_error_NMF[which(avg_test_error_NMF$Avg == min(avg_test_error_NMF$Avg))[1], ]
## # A tibble: 1 × 2
##   rank    Avg
##   <fct> <dbl>
## 1 9      710.

NMTF

Same as NMF, mask matrix M can be specified in NMTF, and the rank parameter (rank) with the smallest test error is considered the optimal rank. The following code is time-consuming and should be executed in your own environment.

out_NMTF <- expand.grid(replicate=1:3, rank2=1:10, rank1=1:10, value=0)
rank_NMTF <- paste0(out_NMTF$rank1, "-", out_NMTF$rank2)
out_NMTF <- cbind(out_NMTF, rank=factor(rank_NMTF, levels=unique(rank_NMTF)))
count <- 1
for(i in 1:10){
  for(j in 1:10){
    masks_NMTF <- kFoldMaskTensor(data_matrix, k=3)
    for(k in 1:3){
      out_NMTF[count, 4] <- rev(
        NMTF(data_matrix,
          M = masks_NMTF[[k]],
          rank = c(i, j))$TestRecError)[1]
      count <- count + 1
    }
  }
}
ggplot(out_NMTF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
(out_NMTF |>
  group_by(rank1, rank2) |>
  summarize(Avg = mean(value)) -> avg_test_error_NMTF)
avg_test_error_NMTF[which(avg_test_error_NMTF$Avg == min(avg_test_error_NMTF$Avg))[1], ]

siNMF

Same as NMF, mask matrices M can be specified in siNMF, and the rank parameter (J) with the smallest test error is considered the optimal rank. The following code is time-consuming and should be executed in your own environment.

out_siNMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
  masks_siNMF <- lapply(1:3, function(x){
    list(
      kFoldMaskTensor(data_matrices[[1]], k=3)[[x]],
      kFoldMaskTensor(data_matrices[[2]], k=3)[[x]],
      kFoldMaskTensor(data_matrices[[3]], k=3)[[x]])
  })
  for(j in 1:3){
    out_siNMF[count, 3] <- rev(
      siNMF(data_matrices,
        M = masks_siNMF[[j]],
        J = i)$TestRecError)[1]
    count <- count + 1
  }
}
ggplot(out_siNMF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
(out_siNMF |>
  group_by(rank) |>
  summarize(Avg = mean(value)) -> avg_test_error_siNMF)
avg_test_error_siNMF[which(avg_test_error_siNMF$Avg == min(avg_test_error_siNMF$Avg))[1], ]

jNMF

Same as NMF, mask matrices M can be specified in jNMF, and the rank parameter (J) with the smallest test error is considered the optimal rank. The following code is time-consuming and should be executed in your own environment.

out_jNMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
  masks_jNMF <- lapply(1:3, function(x){
    list(
      kFoldMaskTensor(data_matrices[[1]], k=3)[[x]],
      kFoldMaskTensor(data_matrices[[2]], k=3)[[x]],
      kFoldMaskTensor(data_matrices[[3]], k=3)[[x]])
  })
  for(j in 1:3){
    out_jNMF[count, 3] <- rev(
      jNMF(data_matrices,
        M = masks_jNMF[[j]],
        J = i)$TestRecError)[1]
    count <- count + 1
  }
}
ggplot(out_jNMF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
(out_jNMF |>
  group_by(rank) |>
  summarize(Avg = mean(value)) -> avg_test_error_jNMF)
avg_test_error_jNMF[which(avg_test_error_jNMF$Avg == min(avg_test_error_jNMF$Avg))[1], ]

NTF

Same as NMF, mask tensor M can be specified in NTF, and the rank parameter (rank) with the smallest test error is considered the optimal rank. The following code is time-consuming and should be executed in your own environment.

out_NTF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
  masks_NTF <- kFoldMaskTensor(data_tensor, k=3)
  for(j in 1:3){
    out_NTF[count, 3] <- rev(
      NTF(data_tensor,
        M = masks_NTF[[j]],
        rank = i)$TestRecError)[1]
    count <- count + 1
  }
}
ggplot(out_NTF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
(group_by(out_NTF, rank) |>
  summarize(Avg = mean(value)) -> avg_test_error_NTF)
avg_test_error_NTF[which(avg_test_error_NTF$Avg == min(avg_test_error_NTF$Avg))[1], ]

NTD

Same as NMF, mask tensor M can be specified in NTD, and the rank parameter (rank) with the smallest test error is considered the optimal rank. The following code is time-consuming and should be executed in your own environment.

out_NTD <- expand.grid(replicate=1:3, rank3=1:5, rank2=1:5, rank1=1:5, value=0)
rank_NTD <- paste0(out_NTD$rank1, "-", out_NTD$rank2,
  "-", out_NTD$rank3)
out_NTD <- cbind(out_NTD, rank=factor(rank_NTD, levels=unique(rank_NTD)))
count <- 1
for(i in 1:5){
  for(j in 1:5){
    for(k in 1:5){
      masks_NTD <- kFoldMaskTensor(data_tensor, k=3)
      for(k in 1:3){
        out_NTD[count, 5] <- rev(
          NTD(data_tensor,
            M = masks_NTD[[k]],
            rank = c(i, j, k))$TestRecError)[1]
        count <- count + 1
      }
    }
  }
}
ggplot(out_NTD, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
(out_NTD |>
  group_by(rank1, rank2, rank3) |>
  summarize(Avg = mean(value)) -> avg_test_error_NTD)
avg_test_error_NTD[which(avg_test_error_NTD$Avg == min(avg_test_error_NTD$Avg))[1], ]

Session Information

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.4    ggplot2_3.4.2  rTensor_1.4.8  nnTensor_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.3      sass_0.4.6         utf8_1.2.3         generics_0.1.3    
##  [5] tcltk_4.3.0        digest_0.6.31      magrittr_2.0.3     evaluate_0.21     
##  [9] grid_4.3.0         RColorBrewer_1.1-3 fastmap_1.1.1      maps_3.4.1        
## [13] jsonlite_1.8.5     misc3d_0.9-1       gridExtra_2.3      fansi_1.0.4       
## [17] spam_2.9-1         viridisLite_0.4.2  scales_1.2.1       jquerylib_0.1.4   
## [21] cli_3.6.1          rlang_1.1.1        munsell_0.5.0      withr_2.5.0       
## [25] cachem_1.0.8       yaml_2.3.7         tools_4.3.0        colorspace_2.1-0  
## [29] vctrs_0.6.5        R6_2.5.1           lifecycle_1.0.3    plot3D_1.4        
## [33] MASS_7.3-60        pkgconfig_2.0.3    pillar_1.9.0       bslib_0.5.0       
## [37] gtable_0.3.3       glue_1.6.2         Rcpp_1.0.10        fields_14.1       
## [41] xfun_0.39          tibble_3.2.1       tidyselect_1.2.1   highr_0.10        
## [45] knitr_1.43         farver_2.1.1       htmltools_0.5.5    rmarkdown_2.22    
## [49] labeling_0.4.2     tagcloud_0.6       dotCall64_1.0-2    compiler_4.3.0

References

Fujita, N. et al. 2018. “Biomarker Discovery by Integrated Joint Non-Negative Matrix Factorization and Pathway Signature Analyses.” Scientific Report 8(1): 9743.
Owen, A. B. et al. 2009. “Bi-Cross-Validation of the SVD and the Nonnegative Matrix Factorization.” The Annals of Applied Statistics 3(2): 564–94.