Machine Learning on Frontera

Frontera is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages.

Running PyTorch

  1. Request a single compute node in Frontera's rtx-dev queue using the idev utility:

    login2.frontera$ idev -N 1 -n 1 -p rtx-dev -t 02:00:00

  2. Create a Python virtual environment

    c123-456$ python3 -m venv /path/to/virtual-env  # (e.g., $SCRATCH/python-envs/test)

  3. Activate the Python virtual environment

    c123-456$ source /path/to/virtual-env/bin/activate

  4. Install PyTorch

    c123-456$ pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Single-Node

  1. Download the benchmark:

    c123-456$ cd $SCRATCH
    c123-456$ git clone https://github.com/gpauloski/kfac-pytorch.git
    c123-456$ cd kfac-pytorch
    c123-456$ git checkout tags/v0.3.2
    c123-456$ pip3 install -e .
    c123-456$ pip3 install torchinfo tqdm Pillow
    c123-456$ export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH

  2. Run the benchmark on one node (4 GPUs):

    c123-456$ python3 -m torch.distributed.launch --nproc_per_node=4 examples/torch_cifar10_resnet.py --kfac-update-freq 0

Multi-Node

  1. Request two nodes in the rtx-dev queue using the idev utility:

    login2.frontera$ idev -N 2 -n 2 -p rtx-dev -t 02:00:00

  2. Go to the benchmark directory:

    c123-456$ cd $SCRATCH/kfac-pytorch

  3. Create a script called "run.sh". This script needs two parameters, the hostname of the master node and the number of nodes.

    #!/bin/bash

    HOST=$1 NODES=$2 LOCAL_RANK=${PMI_RANK}

    python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \ examples/torch_cifar10_resnet.py --kfac-update-freq 0

  4. Run multi-gpu training:

    c123-456$ ibrun -np 2 ./run.sh c123-456 2

Running Tensorflow

Follow these instructions to install and run TensorFlow benchmarks on Frontera RTX. Frontera RTX runs TensorFlow 2.8.0 with Python 3.8.2. Frontera supports CUDA/10.1, CUDA/11.0, and CUDA/11.1. By default, we use CUDA/11.3. Select the appropriate CUDA version for your TensorFlow version.

  1. Request a single compute node in Frontera's rtx-dev queue using the idev utility:

    login2.frontera$ idev -N 1 -n 1 -p rtx-dev -t 02:00:00

  2. Create a Python virtual environment:

    c123-456$ python3 -m venv /path/to/virtual-env # e.g., $SCRATCH/python-envs/test

  3. Activate the Python virtual environment:

    c123-456$ source /path/to/virtual-env/bin/activate

  4. Install TensorFlow and Horovod

    c123-456$ module load cuda/11.3 cudnn nccl
    c123-456$ pip3 install tensorflow-gpu==2.8.2

    We suggest installing Horovod version 0.25.0. If you wish to install other versions of Horovod, please submit a support ticket with the subject "Request for Horovod" and TACC staff will provide special instructions.

    c123-456$ HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR CC=gcc \
        HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod==0.25.0

Single-Node

  1. Download the tensorflow benchmark to your $SCRATCH directory, then check out the branch that matches your tensorflow version.

    c123-456$ cds; git clone https://github.com/tensorflow/benchmarks.git
    c123-456$ cd benchmarks 
    c123-456$ git checkout 51d647f     # master head as of 08/18/2022

  2. Activate the Python virtual environment

    c123-456$ source /path/to/virtual-env/bin/activate

  3. Benchmark the performance with synthetic dataset on 1 GPU

    c123-456$ cd scripts/tf_cnn_benchmarks
    c123-456$ python3 tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200

  4. Benchmark the performance with synthetic dataset on 4 GPUs

c123-456$ cd scripts/tf_cnn_benchmarks
c123-456$ ibrun -np 4 python3 tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 \
    --model resnet50 --batch_size 32 --num_batches 200 --allow_growth=True

Multi-Node

Coming Soon