Performing a bootstrap analysis with qtl2pleio
requires first sampling a (bivariate) phenotype from the inferred bivariate normal distribution, then a two-dimensional, two-QTL QTL scan over the defined scan region. Often, one wants to include more than 100 markers in a scan region. Each two-dimensional scan can take tens of minutes with a single core. Thus, performing a bootstrap analysis with, say, 1000 bootstrap samples is prohibitively time-consuming on most computers. For this reason, we used a high-throughput computing cluster at the University of Wisconsin-Madison (where qtl2pleio
was developed). The UW-Madison Center for High-Throughput Computing (CHTC) is available to all UW-Madison researchers. While we recognize that qtl2pleio
users may not have access to the CHTC, we anticipate that some will have access to computing clusters. Hopefully this vignette can be adapted to your needs.
We placed on github a repository that contains the results from our bootstrap analysis of the Recla data.
Here is the repository’s url: https://github.com/fboehm/qtl2pleio-manuscript-chtc/
Within this repo are three subdirectories. We’ll examine in detail the “Recla-bootstrap” subdirectory.
Within this directory, we created four subdirectories:
The Github repository also contains a fifth subdirectory, squid
and a sixth subdirectory results
, that are included for completeness. In practice, files in squid
were in a distinct directory. We describe it below.
Rscript
The Rscript
subdirectory contains a file, boot-Recla-10-22.R
that contains R code for our bootstrap analysis.
fn <- file.path("https://raw.githubusercontent.com",
"fboehm/qtl2pleio-manuscript/master/chtc",
"Recla-bootstrap/Rscript/boot-Recla-10-22.R"
)
foo <- readLines(fn)
Here we display the contents of our boot-Recla-10-22.R
file.
##
pleio_peak_index <- 788
##
library(qtl2pleio)
library(stringr)
library(dplyr)
library(readr)
args <- R.utils::commandArgs(trailingOnly = TRUE, asValues = TRUE)
# translate command line args
print(args)
print(args$argname)
(proc_num <- as.numeric(args$argname))
(run_num <- as.numeric(args$run_num))
(nboot_per_job <- args$nboot_per_job) # think 1 boot per job
(nsnp <- as.integer(args$nsnp))
(s1 <- as.integer(args$s1))
###############
library(qtl2)
recla <- readRDS("data/recla.rds")
# make sex a covariate for use in pvl_scan
recla[[6]][ , 1, drop = FALSE] -> sex
sex <- sex == "female"
# insert pseudomarkers
insert_pseudomarkers(recla, step = 0.1) -> pseudomap
## ------------------------------------------------------------------------
## ------------------------------------------------------------------------
pp <- readRDS("data/recla-aprobs-chr8.rds")
## ------------------------------------------------------------------------
kinship <- readRDS("data/recla-kinship.rds")
## ------------------------------------------------------------------------
recla$pheno -> ph
log(ph) -> lph
apply(FUN = broman::winsorize, X = lph, MARGIN = 2) -> wlph
## ------------------------------------------------------------------------
phe <- wlph[, c(10, 22)]
gm <- pseudomap$`8`
k <- kinship[[8]]
## ------------------------------------------------------------------------
library(qtl2pleio)
###############
# simulate a phenotype
X1 <- pp[ , , pleio_peak_index]
cbind(X1, unlist(sex)) -> Xpre
## remove subjects with missing values of phenotype
is.na(phe[ , 1]) | is.na(phe[ , 2]) -> missing_indic
phe_nona <- phe[!missing_indic, ]
Xpre_nona <- Xpre[!missing_indic, ]
k_nona <- k[!missing_indic, !missing_indic]
##
gemma2::stagger_mats(Xpre_nona, Xpre_nona) -> X
set.seed(proc_num)
calc_covs(pheno = phe_nona, kinship = k_nona) -> cc_out
(cc_out$Vg -> Vg)
(cc_out$Ve -> Ve)
# calculate Sigma
calc_Sigma(Vg = Vg, Ve = Ve, K = k_nona) -> Sigma
solve(Sigma) -> Sigma_inv
# calc Bhat
B <- calc_Bhat(X = X, Sigma_inv = Sigma_inv, Y = phe_nona)
# Start loop
lrt <- numeric()
for (i in 1:nboot_per_job){
sim1(X = X, B = B, Vg = Vg, Ve = Ve, kinship = k_nona) -> foo
matrix(foo, ncol = 2, byrow = FALSE) -> Ysim
rownames(Ysim) <- rownames(phe_nona)
colnames(Ysim) <- c("t1", "t2")
scan_pvl(probs = pp[!missing_indic, , ], pheno = Ysim, covariates = sex[!missing_indic, , drop = FALSE], kinship = k_nona, start_snp1 = s1, n_snp = nsnp) -> loglik
# in above call, s1 & nsnp come from command line args
calc_lrt_tib(loglik) -> lrt[i]
}
fn_out <- paste0("recla-boot-run", run_num, "_", proc_num, ".txt")
write.table(lrt, fn_out)
q("no")
data
The data subdirectory contains three rds files, one for each of the three input components.
shell_scripts
We examine only the file that is needed for the Recla analysis, boot-Recla-10-22.sh
.
Here is the contents of the file.
#!/bin/bash
# untar your R installation
tar -xzf R13.tar.gz
tar -xzf SLIBS.tar.gz
# make sure the script will use your R installation
export PATH=$(pwd)/R/bin:$PATH
export LD_LIBRARY_PATH=$(pwd)/SS:$LD_LIBRARY_PATH
# run R, with the name of your R script
R CMD BATCH '--args argname='$1' nsnp='$2' s1='$3' nboot_per_job='$4' run_num='$5'' Rscript/boot-Recla-10-22.R 'boot400_run'$5'-'$1'.Rout'
We need to unzip the R installation, which we’ve named R13.tar.gz, and SLIBS file on the remote computer where the actual computing will occur. Following that, and needed adjustments to the paths, we execute R. Note that we have, in this file, specified the variable values by numbers, like $1
for argname
. The values assigned to each of these is specified in the “submit” file.
submit_files
The HTCondor “submit” file provides instructions for controlling the interaction with the high-throughput resources. Below is the text in the submit file boot-Recla-10-22.sub
# hello-chtc.sub
# My very first HTCondor submit file
s1=650
# start scan at this SNP index value
nsnp=350
# length of scan in number of SNPs
# which run in the experimental design?
run_num=561
#
nboot_per_job=1
# for almost all jobs), the desired name of the HTCondor log file,
# and the desired name of the standard error file.
# Wherever you see $(Cluster), HTCondor will insert the queue number
# assigned to this set of jobs at the time of submission.
universe = vanilla
log = boot-$(Process)-run$(run_num).log
error = boot-$(Process)-run$(run_num).err
#
# Specify your executable (single binary or a script that runs several
# commands), arguments, and a files for HTCondor to store standard
# output (or "screen output").
# $(Process) will be a integer number for each job, starting with "0"
# and increasing for the relevant number of jobs.
executable = ../shell_scripts/boot-Recla-10-22.sh
arguments = $(Process) $(nsnp) $(s1) $(nboot_per_job) $(run_num)
output = boot-$(Process)-run$(run_num).out
#
# Specify that HTCondor should transfer files to and from the
# computer where each job runs. The last of these lines *would* be
when_to_transfer_output = ON_EXIT
transfer_input_files = ../data,../Rscript,../shell_scripts,http://proxy.chtc.wisc.edu/SQUID/fjboehm/R13.tar.gz,http://proxy.chtc.wisc.edu/SQUID/SLIBS.tar.gz
#
# Tell HTCondor what amount of compute resources
# each job will need on the computer where it runs.
request_cpus = 1
request_memory = 3GB
request_disk = 1GB
#
requirements = (OpSysMajorVer == 6) || (OpSysMajorVer == 7)
# which computer grids to use:
+WantFlocking = true
+WantGlideIn = true
# Tell HTCondor to run instances of our job:
queue 1000
The ordering of the arguments on in the line that starts with arguments =
dictates the numbers that are assigned to each argument. These numbers are used in the shell script, above.
SQUID
The UW-Madison CHTC provides for each user disk space on a web proxy. More information about SQUID
is available online.
In practice, we placed files containing founder allele dosages (for chromosomes of interest) and our (compressed) R installation on SQUID
.
We followed the usual procedure for submitting R jobs with the CHTC. To submit the file boot-Recla-10-22.sub
, we first changed into its directory. We then typed
By using this command from within the submit_files
subdirectory, we ensure that our outputted files are returned to the submit_files
subdirectory. We then manually moved the outputted results files to the results
subdirectory.
We found that up to ten percent of jobs would initially fail and require re-submission. There are a variety of reasons that may explain the need for re-submission. Because of this need, we include in the git repository the file boot-Recla-10-22-fix.sub
, which gives the code that we used for resubmission. The file bad-jobs-run561
gives the ids of the jobs that failed the initial runs.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.3 (2020-10-10)
## os Ubuntu 20.10
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate C
## ctype en_US.UTF-8
## tz America/New_York
## date 2020-12-01
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [3] CRAN (R 4.0.0)
## callr 3.5.1 2020-10-13 [3] CRAN (R 4.0.3)
## cli 2.2.0 2020-11-20 [3] CRAN (R 4.0.3)
## crayon 1.3.4.9000 2020-10-22 [3] Github (r-lib/crayon@6b3f0c6)
## desc 1.2.0 2018-05-01 [3] CRAN (R 4.0.0)
## devtools 2.3.2 2020-09-18 [3] CRAN (R 4.0.3)
## digest 0.6.27 2020-10-24 [3] CRAN (R 4.0.3)
## ellipsis 0.3.1 2020-05-15 [3] CRAN (R 4.0.0)
## evaluate 0.14 2019-05-28 [3] CRAN (R 4.0.0)
## fansi 0.4.1 2020-01-08 [3] CRAN (R 4.0.0)
## fs 1.5.0 2020-07-31 [3] CRAN (R 4.0.3)
## glue 1.4.2 2020-08-27 [3] CRAN (R 4.0.2)
## htmltools 0.5.0 2020-06-16 [3] CRAN (R 4.0.0)
## knitr 1.30 2020-09-22 [3] CRAN (R 4.0.3)
## magrittr 2.0.1 2020-11-17 [3] CRAN (R 4.0.3)
## memoise 1.1.0 2017-04-21 [3] CRAN (R 4.0.0)
## pkgbuild 1.1.0 2020-07-13 [3] CRAN (R 4.0.2)
## pkgload 1.1.0 2020-05-29 [3] CRAN (R 4.0.0)
## prettyunits 1.1.1 2020-01-24 [3] CRAN (R 4.0.0)
## processx 3.4.4 2020-09-03 [3] CRAN (R 4.0.3)
## ps 1.4.0 2020-10-07 [3] CRAN (R 4.0.3)
## R6 2.5.0 2020-10-28 [3] CRAN (R 4.0.3)
## remotes 2.2.0 2020-07-21 [3] CRAN (R 4.0.3)
## rlang 0.4.9 2020-11-26 [3] CRAN (R 4.0.3)
## rmarkdown 2.5 2020-10-21 [3] CRAN (R 4.0.3)
## rprojroot 2.0.2 2020-11-15 [3] CRAN (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [3] CRAN (R 4.0.0)
## stringi 1.5.3 2020-09-09 [3] CRAN (R 4.0.2)
## stringr 1.4.0 2019-02-10 [3] CRAN (R 4.0.0)
## testthat 3.0.0 2020-10-31 [3] CRAN (R 4.0.3)
## usethis 1.6.3 2020-09-17 [3] CRAN (R 4.0.3)
## withr 2.3.0 2020-09-22 [3] CRAN (R 4.0.3)
## xfun 0.19 2020-10-30 [3] CRAN (R 4.0.3)
## yaml 2.2.1 2020-02-01 [3] CRAN (R 4.0.0)
##
## [1] /tmp/RtmpbYyM8o/Rinst8229dd84ff4f
## [2] /tmp/RtmpQ3N82V/temp_libpath803da596e6e5d
## [3] /home/fred/R/x86_64-pc-linux-gnu-library/4.0
## [4] /usr/local/lib/R/site-library
## [5] /usr/lib/R/site-library
## [6] /usr/lib/R/library