NB: Usually running on a cluster requires some scripting and coding skills, however, with the VPN graphical connections, it’s becoming easier for non-programmers to run any software. Below, we provide some exemplary scripts that one can usually copy and use with small modifications on many clusters. If in doubt, check with your administrator and/or write to us!
To run Haplin on a cluster you will need an MPI implementation and the Rmpi package installed manually, before the Haplin package installation. How to install extra R packages can vary from cluster to cluster, so check the manual!
To run a job on a cluster, usually one needs to submit a script to a job queue. The submission method varies depending on the queue system used, so check the help pages of your cluster. Here, we present the quite popular SLURM queueing system.
Below, is an exemplary script that sets up a SLURM job:
#!/bin/bash
#SBATCH --job-name=haplin_cluster_run
#SBATCH --output=haplin_cluster_run.out
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=8
#SBATCH --time=8:00:00
#SBATCH --mem-per-cpu=100
#SBATCH --mail-user=user```domain.com
#SBATCH --mail-type=ALL
module load R
module load openmpi
echo "nodes: $SLURM_JOB_NODELIST"
myhostfile="cur_nodes.dat"
echo "----STARTING THE JOB----"
date
echo "------------------------"
mpiexec --hostfile $myhostfile -n 1 R --save < haplin_cluster_run.r >& mpi_run.out
exit_status=$?
echo "----JOB EXITED WITH STATUS---: $exit_status"
exit $exit_status
echo "----DONE----"
Here, the important part is the mpiexec
line, where the
R session is loaded to run in parallel on several cores. To achieve this
with the Rmpi package, one needs to provide a list of cores available
currently for the user, which is done through the
--hostfile $myhostfile
part. This means that the given file
should hold a list of cores — if this is not available automatically on
the cluster, one can extract it from the
$SLURM_JOB_NODELIST
variable (see
submit_haplin_cluster_rmpi.sh
script in this folder).
For a more detailed explanation of the #SBATCH
commands,
see e.g., the official
documentation.
The most effective way of using Haplin on a cluster is to run
haplinSlide
on a large GWAS dataset. The data preparation
and calling haplinSlide is the same as for single run, see the section
above. However, before calling any parallel function one needs to setup
the cluster with the function:
initParallelRun()
This will make use of maximum number of available cores. If one wants
to limit the run to a specific number of CPUs, the cpus
argument needs to be specified.
Then, when evoking the analysis, one needs to specify that the Rmpi package will be used:
haplinSlide( trial.data2.prep, use.missing = TRUE, ccvar = 2, design =
"cc.triad", reference = "ref.cat", response = "mult", para.env = "Rmpi" )
Finally, right before the script finishes, we need to close all the
threads created by initParallelRun
:
finishParallelRun()
CAUTION: If the user forgets to call this function before
exiting R, all the work will still be saved, however, the
mpirun
will end with an error.
To sum up, an exemplary R script to run on a cluster, would look like that:
library( Haplin )
initParallelRun()
<- 3:55
chosen.markers
<- genDataLoad( filename = "mynicedata" )
data.in # analysis without maternal risks calculated
<- haplinSlide( data = data.in, markers = chosen.markers, winlength = 2,
results1 design = "triad", use.missing = TRUE, maternal = FALSE, response = "free",
cpus = 2, verbose = FALSE, printout = FALSE, para.env = "Rmpi" )
# analysis with maternal risks calculated
<- haplinSlide( data = data.in, markers = chosen.markers, winlength = 2,
results2 design = "triad", use.missing = TRUE, maternal = TRUE, response = "mult",
cpus = 2, verbose = FALSE, printout = FALSE, para.env = "Rmpi" )
finishParallelRun()
IMPORTANT: To run in parallel, we need to specify both the
cpus
and para.env
arguments, however, the true
number of CPUs used will be set within initParallelRun
and
not by the cpus
argument.