High performance computing
NOTE: We are in the process of trialing this service to users so that we can make the service as accommodating and secure as possible. This means that items concerning the service, including this documentation, are subject to change. We will do our best to keep everyone updated and notified of changes as they come.
Introduction
At the OCF we offer a High Performance Computing (HPC) service for individuals and groups that need to run computationally demanding software. We currently have one main HPC server; however, we have plans to expand the cluster to make use the resources at our disposal.
Gaining Access
In order to access the HPC cluster, please send an access request to help@ocf.berkeley.edu. Make sure to include your OCF username or group account name and a detailed technical description of the projects you plan to run on our HPC infrastructure. This would include information about the nature of the software being run, as well as the amount of computational resources that are expected to be needed.
Connecting
Once you submit your proposal and are approved access you will be able to connect to our Slurm master node via SSH by running the following command:
ssh my_ocf_username@hpcctl.ocf.berkeley.edu
If you have trouble connecting please contact us at help@ocf.berkeley.edu, or come to staff hours when the lab is open and chat with us in person. We also have a #hpc_users channel on slack and irc where you can ask questions and talk to us about anything HPC.
The Cluster
As of Fall 2018, the OCF HPC cluster is composed of one server, with the following specifications:
- 2 Intel Xeon E5-2640v4 CPUs (10c/20t @ 2.4GHz)
- 4 NVIDIA 1080Ti GPUs
- 256GB ECC DDR4-2400 RAM
We have plans to expand the cluster with additional nodes of comparable specifications as funding becomes available. The current hardware was generously funded by a series of grants from the Student Tech Fund.
Slurm
We currently use Slurm as our workload manager for the cluster. Slurm is a free and open source job scheduler that evenly distributes jobs across an HPC cluster, where each computer in the cluster is referred to as a node. The only way to access our HPC nodes is through Slurm.
Detailed documentation for how to access Slurm is here.
Dependencies
For managing application dependencies, you currently have two options:
Virtual Environments
First you can use a virtual environment if you are using Python packages. To create a virtual environment navigate to your home directory and run the following commands:
virtualenv -p python3 venv
. venv/bin/activate
This will allow you to pip install
any Python packages that the OCF does not
already have for your program.
Singularity
For those who need access to non-Python dependencies or have already integrated their program into Docker, the second option is to use Singularity containers. Singularity is a containerization platform developed at Lawrence Berkeley National Laboratory that is designed specifically for HPC environments. To read more about the benefits of Singularity you can look here. We suggest a particular workflow, which will help simplify deploying your program on our infrastructure.
Installing
We recommend that you do your development on our HPC infrastructure, but you
can also develop on your own machine if you would like. If you are running
Linux on your system, you can install Singularity from the official apt
repos:
sudo apt install singularity-container
If you do not have an apt
based Linux distribution, installation instructions
can be found here. Otherwise, if you are running Mac you can
look here, or Windows here.
Building Your Container
singularity build --sandbox ./my_container docker://ubuntu
This will create a Singularity container named my_container
. If you are
working on our infrastructure you will not be able to install non-pip
packages on your container, because you do not have root privileges.
If you would like to create your own container with new packages, you must
create the container on your own machine, using the above command with
sudo
prepended, and then transfer it over to our infrastructure.
The docker://ubuntu
option notifies Singularity to bootstrap the container from
the official Ubuntu docker container on Docker Hub. There is also
a Singularity Hub, from which you can directly pull
Singularity images in a similar fashion. We also have some pre-built containers
that you may use to avoid having to build your own. They are currently located
at /home/containers
on the Slurm master node.
Using Your Container
singularity shell my_container
The above command will allow you to shell into your container. By default your home directory in the container is linked to your real home directory outside of the container environment, which helps you avoid having to transfer files in and out of the container.
singularity exec --nv my_container ./my_executable.sh
This command will open your container and run the my_executable.sh
script in
the container environment. The --nv
option allows the container to interface with
the GPU. This command is useful when using srun
so you can run your program
in a single command.
Working on HPC Infrastructure
If you were using a sandboxed container for testing, we suggest you convert it to a Singularity image file. This is because images are more portable and easier to interact with than sandboxed containers. You can make this conversion using the following command:
sudo singularity build my_image.simg ./my_sandboxed_container
If you were working on the image on your own computer, you can transfer it over to your home directory on our infrastructure using the following command:
scp my_image.simg my_ocf_username@hpcctl.ocf.berkeley.edu:~/
To actually submit a Slurm job that uses your Singularity container and runs
your script my_executable.sh
, run the following command:
srun --gres=gpu --partition=ocf-hpc singularity exec --nv my_image.simg ./my_executable.sh
This will submit a Slurm job to run your executable on the ocf-hpc
Slurm
partition. The --gres=gpu
option is what allows multiple users to run jobs
on a single node so it is important to include. Without it, you will not be
able to interface with the GPUs.