How to Run on GPU Cluster

This describes how to set up the environment to run APPFL in GPU cluster. This tutorial is generated based on SWING GPU cluster in Argonne National Laboratory. The cluster information is avaiable at Laboratory Computing Resource Center. In this tutorial, we use MNIST example to run APPFL in the cluster.

Preparing Training

We assume user run the MNIST example in locally machine according to Our first run MNIST. MNIST datasets will be downloaded while running the MNIST example.

We upload the data and example code from local machine to cluster.

$ cd APPFL/examples
$ ssh [your_id]@[cluster_destination] mkdir -p workspace
$ scp -r * [your_id]@[cluster_destination]:workspace

Please check if the workspace folder contains “datasets”, “”, “models” for this tutorial.

Loading Modules

This tutorial uses modules in SWING cluster. The module configuration may vary depending on the Clusters.

$ module load gcc/9.2.0-r4tyw54 cuda/11.4.0-gqbcqie openmpi/4.1.4-cuda-ucx anaconda3

Creating Conda Environment and Installing APPFL

Anaconda environment is used to control dependencies.

$ conda create -n APPFL python=3.8
$ conda activate APPFL
$ pip install pip --upgrade
$ pip install "appfl[dev,examples,analytics]"

Modifying Dependencies for CUDA Support

SWING Cluster uses CUDA 11.4 version, so we need to modify torch version to adjust to the CUDA version. CUDA version may vary depending on the Clusters. A different version of CUDA may require changing the torch versions.

$ pip uninstall torch tourchvision
$ pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url
$ conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch


pip install chardet may need to resolve the dependency issue from the torchvision package.

Creating Batch Script

SWING Cluster uses Slurm workload manager for job management. The job management configuration may vary depending on the Clusters.

$ vim
#SBATCH --job-name=APPFL-test
#SBATCH --account=<your_project_name>
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --time=00:05:00

mpiexec -np 2 --mca opal_cuda_support 1 python ./ --num_clients=2

The script needs to be submitted to run.

$ sbatch
Submitted batch job {job_id}

The output file is generated when the script run.

$ cat slurm-{job_number}.out