Running rMIDAS on a server instance

MIDAS offers a scalable solution to multiple imputation of very large datasets. Such applications can, of course, have sizable memory requirements that push the limits of a personal computer. Users may therefore wish to “offshore” model training to a virtual server that has greater memory capacity.

This guide runs through how to configure an Amazon AWS server instance to run MIDAS in R (and Python).

Choice of server instance

Amazon AWS (and other server providers) provide a range of different hardware configurations. In our experience, the largest bottleneck to training times is memory as sizable datasets are memory-intensive. Both the t5 and c6 classes offer cost-efficient and memory-scalable solutions, which we have used successfully in our own testing and workflows.

Users may also wish to use GPU-acceleration. More information on GPU-compatible classes can be found here. Note, MIDASpy is not configured to take advantage of distributed training at present.

Server setup

When configuring the instance initially, we recommend choosing the Ubuntu operating system (the latest version at time of writing was 22.04 LTS).

Once instantiated, tunnel into the instance from the command line (making sure any authentication key is stored in the current directory):

ssh -i [your_authentication_key_name.pem] ubuntu@[server.address]

At this point, we are ready to configure the software on the server instance. First, we install a series of common tools required for compiling software:

sudo apt install x11-common
sudo apt-get install libssl-dev
sudo apt-get install libcurl4-openssl-dev
sudo apt install libxml2-dev

Then, we install C++ compilers:

sudo apt install gcc
sudo apt install g++

Next, we download and install miniconda to allow us to isolate the Python dependencies for rMIDAS. Note, you may be asked whether to initialise miniconda after install. Type yes when prompted:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Finally, we can setup a conda environment using the YAML file available on GitHub or by using the hard-coded URL:

conda env create -f https://raw.githubusercontent.com/MIDASverse/rMIDAS/master/rmidas-env.yml

R specific setup

To setup R on the server instance, we need to install a few extra helper packages, as well as R itself, by running the following code at the command line:

sudo apt update -qq
sudo apt install --no-install-recommends software-properties-common dirmngr
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
sudo apt update
sudo apt install r-base
sudo apt install build-essential

The server is now configured and ready to be used. Remember to include a line at the beginning of your script that points R to the conda environment we set up above:

library(rMIDAS)
set_python_env("rmidas", type = "conda")