kFoldMaskTensor
)In this vignette, we introduce a cross-validation approach using mask tensor (Fujita 2018; Owen 2009) to estimate the optimal rank parameter for matrix/tensor decomposition. Mask matrix/tensor has the same size of data matrix/tensor and contains only 0 or 1 elements, 0 means not observed value, and 1 means observed value. In this approach, only non-zero and non-NA elements are randomly specified as 0 and estimated by other elements reconstructed by the result of matrix/tensor decomposition. This can be considered a cross-validation approach because the elements specified as 1 are the training dataset and the elements specified as 0 are the test dataset.
Here, we use these packages.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Here, we use three types of demo data as follows:
To set mask matrix/tensor, here we use kFoldMaskTensor
,
which divides only the elements other than NA and 0 into k
mask matrices/tensors. In NMF
, the mask matrix can be
specified as M
and the rank parameter (J
) with
the smallest test error is considered the optimal rank. Here, three mask
matrices are prepared for each rank, and the average is used to estimate
the optimal rank.
out_NMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
masks_NMF <- kFoldMaskTensor(data_matrix, k=3)
for(j in 1:3){
out_NMF[count, 3] <- rev(
NMF(data_matrix,
M = masks_NMF[[j]],
J = i)$TestRecError)[1]
count <- count + 1
}
}
Looking at the average test error for each rank, the optimal rank appears to be around 8 to 10 (with some variation depending on random seeds).
ggplot(out_NMF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
## # A tibble: 10 × 2
## rank Avg
## <fct> <dbl>
## 1 1 4193.
## 2 2 3332.
## 3 3 2119.
## 4 4 1364.
## 5 5 1313.
## 6 6 1313.
## 7 7 1312.
## 8 8 934.
## 9 9 710.
## 10 10 732.
## # A tibble: 1 × 2
## rank Avg
## <fct> <dbl>
## 1 9 710.
Same as NMF
, mask matrix M
can be specified
in NMTF
, and the rank parameter (rank
) with
the smallest test error is considered the optimal rank. The following
code is time-consuming and should be executed in your own
environment.
out_NMTF <- expand.grid(replicate=1:3, rank2=1:10, rank1=1:10, value=0)
rank_NMTF <- paste0(out_NMTF$rank1, "-", out_NMTF$rank2)
out_NMTF <- cbind(out_NMTF, rank=factor(rank_NMTF, levels=unique(rank_NMTF)))
count <- 1
for(i in 1:10){
for(j in 1:10){
masks_NMTF <- kFoldMaskTensor(data_matrix, k=3)
for(k in 1:3){
out_NMTF[count, 4] <- rev(
NMTF(data_matrix,
M = masks_NMTF[[k]],
rank = c(i, j))$TestRecError)[1]
count <- count + 1
}
}
}
ggplot(out_NMTF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Same as NMF
, mask matrices M
can be
specified in siNMF
, and the rank parameter (J
)
with the smallest test error is considered the optimal rank. The
following code is time-consuming and should be executed in your own
environment.
out_siNMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
masks_siNMF <- lapply(1:3, function(x){
list(
kFoldMaskTensor(data_matrices[[1]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[2]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[3]], k=3)[[x]])
})
for(j in 1:3){
out_siNMF[count, 3] <- rev(
siNMF(data_matrices,
M = masks_siNMF[[j]],
J = i)$TestRecError)[1]
count <- count + 1
}
}
Same as NMF
, mask matrices M
can be
specified in jNMF
, and the rank parameter (J
)
with the smallest test error is considered the optimal rank. The
following code is time-consuming and should be executed in your own
environment.
out_jNMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
masks_jNMF <- lapply(1:3, function(x){
list(
kFoldMaskTensor(data_matrices[[1]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[2]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[3]], k=3)[[x]])
})
for(j in 1:3){
out_jNMF[count, 3] <- rev(
jNMF(data_matrices,
M = masks_jNMF[[j]],
J = i)$TestRecError)[1]
count <- count + 1
}
}
Same as NMF
, mask tensor M
can be specified
in NTF
, and the rank parameter (rank
) with the
smallest test error is considered the optimal rank. The following code
is time-consuming and should be executed in your own environment.
out_NTF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
masks_NTF <- kFoldMaskTensor(data_tensor, k=3)
for(j in 1:3){
out_NTF[count, 3] <- rev(
NTF(data_tensor,
M = masks_NTF[[j]],
rank = i)$TestRecError)[1]
count <- count + 1
}
}
Same as NMF
, mask tensor M
can be specified
in NTD
, and the rank parameter (rank
) with the
smallest test error is considered the optimal rank. The following code
is time-consuming and should be executed in your own environment.
out_NTD <- expand.grid(replicate=1:3, rank3=1:5, rank2=1:5, rank1=1:5, value=0)
rank_NTD <- paste0(out_NTD$rank1, "-", out_NTD$rank2,
"-", out_NTD$rank3)
out_NTD <- cbind(out_NTD, rank=factor(rank_NTD, levels=unique(rank_NTD)))
count <- 1
for(i in 1:5){
for(j in 1:5){
for(k in 1:5){
masks_NTD <- kFoldMaskTensor(data_tensor, k=3)
for(k in 1:3){
out_NTD[count, 5] <- rev(
NTD(data_tensor,
M = masks_NTD[[k]],
rank = c(i, j, k))$TestRecError)[1]
count <- count + 1
}
}
}
}
ggplot(out_NTD, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 ggplot2_3.4.2 rTensor_1.4.8 nnTensor_1.3.0
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.3 sass_0.4.6 utf8_1.2.3 generics_0.1.3
## [5] tcltk_4.3.0 digest_0.6.31 magrittr_2.0.3 evaluate_0.21
## [9] grid_4.3.0 RColorBrewer_1.1-3 fastmap_1.1.1 maps_3.4.1
## [13] jsonlite_1.8.5 misc3d_0.9-1 gridExtra_2.3 fansi_1.0.4
## [17] spam_2.9-1 viridisLite_0.4.2 scales_1.2.1 jquerylib_0.1.4
## [21] cli_3.6.1 rlang_1.1.1 munsell_0.5.0 withr_2.5.0
## [25] cachem_1.0.8 yaml_2.3.7 tools_4.3.0 colorspace_2.1-0
## [29] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.3 plot3D_1.4
## [33] MASS_7.3-60 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.0
## [37] gtable_0.3.3 glue_1.6.2 Rcpp_1.0.10 fields_14.1
## [41] xfun_0.39 tibble_3.2.1 tidyselect_1.2.1 highr_0.10
## [45] knitr_1.43 farver_2.1.1 htmltools_0.5.5 rmarkdown_2.22
## [49] labeling_0.4.2 tagcloud_0.6 dotCall64_1.0-2 compiler_4.3.0