Tutorials
This is the multi-page printable view of this section. Click here to print.
Tutorials
- 1: Cybertraining
- 2: Globus
- 3: Rclone on Rivanna
- 4: Rivanna
- 5: Rivanna Pod
- 6: Raspberry Pi Cluster
- 7: Rivanna and Singularity
1 - Cybertraining
A large number of tutorials and modules are avialable in our cybertraining educational activities.
Cybertraining
The main links to our cybertraining material are:
2 - Globus
Getting the Cosmoflow data via globus commandline
Data Directory
We will showcase how to transfer data via globus commandline tools.
In our example we will use the data directory as
export DATA=/project/bii_dsc_community/$USER/cosmoflow/data
Globus Set Up on Rivanna
Rivanna allows to load the Globus file transfer command line tools via the modules command with the following commands. However, prior to executing globus login, please visit https://www.globus.org/ and log in using your UVA credentials.
module load globus_cli
globus login
The globus login
method will output a unique link per user that you
should paste into a web browser and sign in with using your UVA
credentials. Afterwords, the website will present you with a unique
sign-in key that you will need to paste back into the command line to
verify your login.
After executing globus login
your console should look like the
following block.
NOTE: this is a unique link generated for the example login, each user will have a different link.
-bash-4.2$globus login
Please authenticate with Globus here:
------------------------------------
https://auth.globus.org/v2/oauth2/authorize?client_id=affbecb5-5f93-404e-b342-957af296dea0&redirect_uri=https%3A%2F%2Fauth.globus.org%2Fv2%2Fweb%2Fauth-code&scope=openid+profile+email+urn%3Aglobus%3Aauth%3Ascope%3Aauth.globus.org%3Aview_identity_set+urn%3Aglobus%3Aauth%3Ascope%3Atransfer.api.globus.org%3Aall&state=_default&response_type=code&access_type=offline&prompt=login
------------------------------------
Enter the resulting Authorization Code here:
Follow the url and input the authorization code to login successfully.
Source Endpoint Search
First, verify that you were able to sign in properly, and verify your
identity and then search for the source endpoint of the data you want
to transfer. In this example, our endpoint is named CosmoFlow benchmark data cosmoUniverse_2019_02_4parE
. Please note that the file
to be downloaded is 1.7 TB large. Make sure that the system on which
you download it has enough space. The following commands
will verify your sign in identity and then search for an endpoint
within the single quotation marks.
globus get-identities -v 'youremail@gmailprobably.com'
globus endpoint search 'CosmoFlow benchmark data cosmoUniverse_2019_02_4parE'
Each globus endpoint has a unique endpoint ID. In this case our source endpoint ID is:
d0b1b73a-efd3-11e9-993f-0a8c187e8c12
Set up a variable ENDPOINT
so you can use the endpoint more easily without retyping it.
Also set a variable SRC_DIR
to indicate the directory with the files to be transferred.
export SRC_ENDPOINT=d0b1b73a-efd3-11e9-993f-0a8c187e8c12
export SRC_DIR=/~/
You can look at the files in the globus endpoint using globus ls
to
verify that you are looking at the right endpoint.
globus ls $SRC_ENDPOINT
Destination Endpoint Set Up
Rivanna HPC has set a special endpoint for data transfers into the
/project
, /home
, or /scratch
directories. The name of this
destination endpoint will be UVA Standard Security Storage
.
Repeat the above steps with this endpoint and set up the variables
including a path
variable with the desired path to write to.
globus endpoint search 'UVA Standard Security Storage'
export DEST_ENDPOINT=e6b338df-213b-4d31-b02c-1bc2c628ca07
export DEST_DIR=/dtn/landings/users/u/uj/$USER/project/bii_dsc_community/uja2wd/cosmoflow/
NOTE: We cannot set the path to start at the root level in rivanna and instead need to follow a few steps to find the specific path of where to write to.
To begin, our path must start with
/dtn/landings/users/
and is then appended on a unique sequence depending on the users computing ID. The rest of the path is dependent on characters of the users computing ID. As an example, if your computing ID is abc5xy, the next three arguments are/a/ab/abc5xy
(first char, first two chars, computing id), at this point the user is essentially in the root level of rivanna and can access/home
,/project
, or/scratch
how they normally would.
Note: If you want to use the web format of Globus to find the path isntead. Follow the below steps to find the desired value of your path var.
- First sign into the web format of globus
- Locate
file manager
on the left side of the screen - In the
collections
box at the top of the screen begin to search forUVA Standard Security Storage
- Select our destination endpoint
- Use the GUI tool to select exactly where you wish to write to
- Copy the path from the box immedietally below
collections
- Write this value to the DEST_DIR variable created above (I have included my path to where I wish to write to)
Initiate the Transfer
Finally, execute the transfer
globus transfer $SRC_ENDPOINT:$SRC_DIR $DEST_ENDPOINT:$DEST_DIR
NOTE: In case your first transfer may have an issue because you need to give globus permission to initiate transfers via the CLI instead of via the web tool. I was given the unique command as follows by my terminal:
-bash-4.2$globus transfer $SRC_ENDPOINT:$SRC_DIR $DEST_ENDPOINT:$DEST_DIR
The collection you are trying to access data on requires you to grant
consent for the Globus CLI to access it. message: Missing required
data_access consent
Please run
globus session consent 'urn:globus:auth:scope:transfer.api.globus.org:all[*https://auth.globus.org/scopes/e6b338df-213b-4d31-b02c-1bc2c628ca07/data_access]'
to login with the required scopes
After initiating this command, a similar sign in a verification will
be conducted compared to the globus login
method where the cli will
output a url to follow, the user will sign in, and return a
verification code.
After fixing this, remember to re-initiate the transfer with the
globus transfer
command as previously descibed.
Managing Tasks
To monitor the status of active transfers, use
globus task list
or similarly you can use the web tool to verify transfers.
References:
- Globus Data Transfer, Rivanna HPC https://www.rc.virginia.edu/userinfo/globus/
3 - Rclone on Rivanna
Using the Rclone Module on Rivanna
Rclone is a useful tool to upload and download from cloud services such as Google Drive by using the commandline. However, a web browser is required for initial setup, which can be done from the computer that logs into Rivanna.
Setup Rclone on Rivanna
First, load the newer version of module; otherwise, Rivanna loads an incompatible, older version by default. Then, initialize a new rclone configuration and enter the following inputs:
$ module load rclone/1.61.1
$ rclone config
n/s/q> n
name> gdrive
Storage> drive
A client ID is required to create a provision that interfaces with Google Drive. Follow the instructions at https://rclone.org/drive/#making-your-own-client-id to create a client ID and then input the values into Rivanna.
client_id> myCoolID..
client_secret> verySecretClientSecret..
scope> 2 # read only
service_account_file> # just press enter
Edit advanced config?
y) Yes
n) No (default)
y/n> n
Use web browser to automatically authenticate rclone with remote?
y/n> n
Install Rclone on Client Computer
If the computer used to log on to Rivanna is running Windows, and the computer has Chocolatey, then download Rclone using an administrative Git Bash instance with
$ choco install rclone -y
Otherwise, for Linux and macOS, use
$ sudo -v ; curl https://rclone.org/install.sh | sudo bash
Then, after opening a new instance of the terminal, paste the command given into Git Bash and follow the instructions.
Rclone Authentication
In the web browser, click Advanced when google says that they have not verified this app; it is safe and expected. Then click Go to rclone, then Continue.
When Rclone gives the config token, ensure that all new line characters are removed. This can be done by pasting the code into an application such as Notepad and manually ensuring that all characters are on the same line. Otherwise, the code will be split across new prompts, breaking the setup.
This is bad:
sjgnkajdfnkj
fdnskjafnkad
asdfnasjkffd
This is good:
sjgnkajdfnkjfdnskjafnkadasdfnasjkffd
Paste the fixed token into Rivanna.
config_token> myCoolCodeThatHasNoNewLineCharacters
Configure this as a Shared Drive (Team Drive)?
y) Yes
n) No (default)
y/n> n
Keep this "gdrive" remote?
y) Yes this is OK (default)
y/e/d> y
q) Quit config
e/n/d/r/c/s/q> q
An example command to use Rclone is as follows.
The flag --drive-shared-with-me
restricts the scope to
only shared files.
$ rclone copy --drive-shared-with-me gdrive:Colab\ Datasets/EarthquakeDec2020 /scratch/$USER/EarthquakeDec2020 -P
4 - Rivanna
Rivanna is the University of Virginia’s High-Performance Computing (HPC) system. As a centralized resource and has many software packages available. Currently, the Rivanna supercomputer has 603 nodes with over 20476 cores and 8PB of various storage. Rivanna has multiple nodes equipped with GPUs including RTX2080, RTX3090, K80, P100, V100, A100-40GB, A100-80GB.
Communication
We have a team discord at: uva-bii-community
please subscribe if you work on rivanna and are part of the bii_dsc_community.
Rivanna at UVA
The official Web page for Rivanna is located at
In case you need support you can ask the staff using a ticket system at
- https://www.rc.virginia.edu/support/
- This page also contains zoom office hours Tue 3-5 pm, Thu 10-12 pm
It is important that before you use Rivanna to attend a seminar that upon request is given every Wednesday. To sign up, use the link:
Please note that in this introduction we will provide you with additional inforamation that may make the use of Rivanna easier. We encourage you to add to this information and share your tips,
Getting Permissions to use Rivanna
To use Rivanna you need to have special authorization. In case you work with a faculty member you will need to be added to a special group (or multiple) to be able to access it. The faculty member will know which group it is. This is managed via the group management portal by the faculty member. Please do not use the previous link and instead communicate with your faculty member first.
- Note: For BII work conducted with Geoffrey Fox or Gregor von Laszewski, please contact Gregor at laszewski@gmail.com
TODO: IS THIS THE CASE?
Once you are added to the group, you will receive an invitation email to set up password for the research computing support portal. If you do not recive such an email, please visit the support portal at
TBD
This password is also the password that you will use to log into the system.
END TODO IS THIS THE CASE
After your account is set up, you can try to log in through the Web-based access. Please test it to make sure you have the proper access already.
However, we will typically notuse the online portal but instead use the more advanced batch system as it provides significant advantages for you when managing multiple jobs to Fivanna.
Accessing an HPC Computer via command line
If you need to use X11 on Rivanna you can finde documentation at the rivanna documentation. In case you need to run jupyter notebooks directly on Rivanna, please consult with the Rivanna documentation.
VPN (required)
You can access rivanna via ssh only via VPN. UVA requires you to use the VPN to access any computer on campus. VPN is offered by IT services but oficially only supported for Mac and Windows.
However, if you have a Linux machine you can follow the VPN install instructions for Linux. If you have issues installing it, attend an online support session with the Rivanna staff.
Access via the Web Browser
Rivanna can be accessed right from the Web browser. Although this may be helpful for those with systems where a proper terminal can not be accessed it can not leverage the features of your own desktop or laptop while using for example advanced editors or keeping the file system of your machine in sync with the HPC file system.
Therefore, practical experience shows that you benefit while using a terminal and your own computer for software development.
Additiional documentation by the rivanna system staff is provided at
Access Rivanna from macOS and Linux
To access Rivanna from macOS, use the terminal and use ssh to connect to it. We will provide an in depth configuration tutorial on this later on. We will use the same programs as on Linux and Windows so we have to only provide one documentation and it is uniform across platforms.
Please remember to use
$ ssh-agent
$ ssh-add
To activate ssh in your terminal
Access Rivanna from Windows
While exploring the various choices for accessing Rivanna from Windows you can use putty and MobaXterm.
However, most recently a possible better choice is available while using gitbash. Git bash is trivial to install. However, you need to read the configuration options carefully. READ CAREFULLY Let us know your options so we can add them here.
To simplify the setup of a Windows computer for research we have prepared a separate
It addresses the installation of gitbash, Python, PyCharm (much better than VSCode), and other useful tools such as chocolate.
With git bash, you get a bash terminal that works the same as a Linux bash terminal and which is similar to the zsh terminal for a Mac.
Set up the connection (mac/Linux)
The first thing to do when trying to connect to Rivanna is to create an ssh key if you have not yet done so.
To do this use the command
ssh-keygen
Please make sure you use a passphrase when generating the key. Make
sure to not just skip the passphrase by typing in ENTER but instead
use a real not easy to guess passphrase as this is best practice and
not in violation violation of security policies. You always can use
use ssh-agent
and ssh-add
so you do not have to repeatedly enter
your passphrase.
The ssh-keygen
program will generate a public-private keypair in the
directory ~/.ssh/id_rsa.pub
(public key) and ~/.ssh/id_rsa
. Please
never share the private key with anyone.
Next, we need to add the public key to Rivanna’s
rivanna:~/.ssh/authorized_keys file
. The easiest way to do this is
to use the program ssh-copy-id
.
ssh-copy-id username@rivanna.hpc.virginia.edu
Please use your password when using ssh-copy-id
. Your username is
your UVA computing id. Now you should be ready to connect with
ssh username@rivanna.hpc.virginia.edu
Commandline editor
Sometimes it is necessary to edit files on Rivanna. For this, we recommend that you learn a command line editor. There are lots of debates on which one is better. When I was young I used vi, but found it too cumbersome. So I spend one-day learning emacs which is just great and all you need to learn. You can install it also on Linux, Mac, and Windows. This way you have one editor with very advanced features that is easy to learn.
If you do not have one day to familiarize yourself with editors such as emacs, vim, or vi, you can use editors such as nano and pico.
The best commandline editor is emacs. It is extremely easy to learn when using just the basics. The advantage is that the same commands also work in the terminal.
Keys | Action |
---|---|
CTRL-x c | Save in emacs |
CTRL-x q | Leave |
CTRL-x g | If something goes wrong |
CTRL a | Go to beginning line |
CTRL e | Go to end of line |
CTRL k | Delete till end of line from curser |
cursor | Just works ;-) |
PyCharm
The best editor to do python development is pyCharm. Install it on your desktop. The education version is free.
VSCode
An inferior editor for python development is VSCode. It can be configured to also use a Remote-SSH plugin.
Moving data from your desktop to Rivanna
To copy a directory use scp
If only a few lines have changed use rsync
To mount Rivannas file system onto your computer use fuse-ssh
.
This will allow you to for example use pyCharm to directly edit files on Rivanna.
Developers however often also use GitHub to push the code to git and then on Rivanna use pull to get the code from git. This has the advantage that you can use pyCharm on your local system while synchronizing the code via git onto Rivanna.
However often scp and rsync may just be sufficient.
Example Config file
Replace abc2de with your computing id
place this on your computer in ~/.ssh/config
ServerAliveInterval 60
Host rivanna
User abc2de
HostName rivanna.hpc.virginia.edu
IdentityFile ~/.ssh/id_rsa.pub
Host b1
User abc2de
HostName biihead1.bii.virginia.edu
IdentityFile ~/.ssh/id_rsa.pub
Host b2
User abc2de
HostName biihead2.bii.virginia.edu
IdentityFile ~/.ssh/id_rsa.pub
Adding it allows you to just ssh to the machines with
ssh rivanna
ssh b1
ssh b2
Rivanna’s filesystem
The file systems on Rivanna have some restrictions that are set by system wide policies that you need to be inspecting:
- TODO: add link here
we distinguish
- home directory:
/home/<uvaid>
or~
/scratch/bii_dsc_community/<uvaid>
/project/bii_dsc_community/projectname/<uvaid>
Y
In your home directory, you will find system directories and files such as
~/.ssh
, ~/.bashrc
and ~/.zshrc
The difference in the file systems is explained at
Dealing with limited space under HOME
As we conduct research you may find that the file spece in your home
directory is insufficient. This is especially the case when using
conda. Therefore, it is recommended that you create softlinks from
your home directory to a location where you have more space. This is
typically soemwhere under /project
.
We describe next how to relocate some of the directories to /project
In ~/.bashrc
, add the following lines, for creating a project
directory.
$ vi ~/.bashrc
$ PS1="\w \$"
$ alias project='cd /project/bii_dsc_community/$USER'
$ export PROJECT="/project/bii_dsc_community/$USER"
At the end of the .bashrc file use
$ cd $PROJECT
So you always cd directly into your project directory instead of home.
The home directory only has 50GB. Installing everything on the home directory will exceed the allocation and have problems with any execution. So it’s better to move conda all other package installation directories to $PROJECT.
First explore what is in your home directory and how much space it consumes with the following commands.
cd $HOME
$ ls -lisa
$ du -h .
Select from this list directories that you want to move (thise that you not already have moved).
Let us assume you want to move the directories .local
,
.vscode-server
, and .conda
.
Imporatnt is that you want to make sure that .conda and .local are
moved as they may include lots of files and you may run out of memory
quickly.
Hence you do next the following.
$ cd $PROJECT
$ mv ~/.local .
$ mv ~/.vscode-server .
$ mv ~/.conda .
Then create symbolic links to the home directory installed folder.
$ cd $PROJECT
$ ln -s $PROJECT/.local ~/.local
$ ln -s $PROJECT/.vscode-server ~/.vscode-server
$ ln -s $PROJECT/.conda ~/.conda
Check all symbolic links:
$ ls -lisa
20407358289 4 lrwxrwxrwx 1 $USER users 40 May 5 10:58 .local -> /project/bii_dsc_community/djy8hg/.local
20407358290 4 lrwxrwxrwx 1 $USER users 48 May 5 10:58 .vscode-server -> /project/bii_dsc_community/djy8hg/.vscode-server
In case you use python venv, do not place them in home but under project.
module load python3.8
python -m venv $PROJECT/ENV3
source $PROJECT/ENV3/bin/activate
If you succesd, you can also place the source line in your .bashrcs file.
In case you use conda and python, we also recommend that you create a venv from the conda pythin, so you have a copy of that in ENV3 and if something goes wrong its easy to recreate from your default python. Thise that use that path ought to improve how to do this here.
Load modules
Modules are preconfigured packages that allow you to use a specific software to be loaded into your environment without needing you to install it from source. To find out more about a particular package such as cmake you can use the command
module spider cmake # check whether cmake is available and details
Load the needed module (you can add version info). Note that some
modules are dependent on other modules (clang/10.0.1
depends on
gcc/9.2.0
so gcc
needs to be loaded first.
# module load gcc/9.2.0 clang/10.0.1
module load clanggcc
module load cmake/3.23.3 git/2.4.1 ninja/1.10.2-py3.8 llvm cuda/11.4.2
check currently loaded modules
module list
clean all the modules
module purge
Request GPUs to use interactively
TODO: explain what -A is
rivanna$ ijob -c number_of_cpus
-A group_name
-p queue_name
--gres=gpu:gpu_model:number_of_gpus
--time=day-hours:minutes:seconds
An example to request 1 cpu with 1 a100 gpu for 10 minutes in ‘dev’ partition is
rivanna$ ijob -c 1 -A bii_dsc_community -p gpu --gres=gpu:a100:1 --time=0-00:10:00
Rivanna has different partitions with different resource availability
and charging rate. dev
is free but limited to 1 hour for each
session/allocation and no GPU is available. To list the different
partions use qlist
to check partitions
Queue | Total | Free | Jobs | Jobs | Time | SU |
---|---|---|---|---|---|---|
(partition) | Cores | Cores | Running | Pending | Limit | Charge |
bii | 4640 | 4306 | 38 | 1949 | 7-00:00:00 | 1 |
standard | 5644 | 1391 | 706 | 12 | 7-00:00:00 | 1 |
dev | 456 | 426 | 2 | 1 | 1:00:00 | 0 |
parallel | 5680 | 364 | 32 | 21 | 3-00:00:00 | 1 |
instructional | 2320 | 2180 | 3-00:00:00 | 1 | ||
largemem | 208 | 123 | 10 | 2 | 4-00:00:00 | 1 |
gpu | 2372 | 1745 | 67 | 1 | 3-00:00:00 | 3 |
bii-gpu | 608 | 554 | 4 | 3-00:00:00 | 1 | |
bii-largemem | 288 | 171 | 1 | 7-00:00:00 | 1 |
To list the limits, use the command qlimits
Queue | Maximum | Maximum | Minimum | Maximum | Maximum | Default | Maximum | Minimum |
---|---|---|---|---|---|---|---|---|
(partition) | Submit | Cores(GPU) | Cores | Mem/Node | Mem/Core | Mem/Core | Nodes | Nodes |
per User | per Job | in MB | in MB | in MB | per Job | per Job | ||
bii | 10000 | cpu=400 | 354000+ | 9400 | 112 | |||
standard | 10000 | cpu=1000 | 255000+ | 9000 | 1 | |||
dev | 10000 | cpu=16 | 127000+ | 9000 | 6000 | 2 | ||
parallel | 2000 | cpu=1000 | 4 | 384000 | 9600 | 9000 | 25 | 2 |
instructional | 2000 | cpu=20 | 112000+ | 6000 | 5 | |||
largemem | 2000 | cpu=32 | 1000000+ | 64000 | 60000 | 2 | ||
gpu | 10000 | gres/gpu=32 | 128000+ | 32000 | 6000 | 4 | ||
bii-gpu | 10000 | 384000+ | 9400 | 12 | ||||
bii-largemem | 10000 | 1500000 | 31000 | 2 |
Linux commands for HPC
Many useful commands can be found in Gregor’s book at
The following additional commands are quite useful on HPC systems
command | description |
---|---|
allocations |
check available account and balance |
hdquota |
check storage you has used |
du -h --max-depth=1 |
check which directory uses most space |
SLURM Batch Parameters
We present next a number of default parameters for using a variety of GPUs on rivanna. Please note that you may need to adopt some parameters to adjust for cores or memory according to your application.
Running on v100
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --partition=bii-gpu
#SBATCH --account=bii_dsc_community
#SBATCH --gres=gpu:v100:1
#SBATCH --job-name=MYNAME
#SBATCH --output=%u-%j.out
#SBATCH --error=%u-%j.err
Running on a100-40GB
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --partition=bii-gpu
#SBATCH --account=bii_dsc_community
#SBATCH --gres=gpu:a100:1
#SBATCH --job-name=MYNAME
#SBATCH --output=%u-%j.out
#SBATCH --error=%u-%j.err
Running on special fox node a100-80GB
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --partition=bii-gpu
#SBATCH --account=bii_dsc_community
#SBATCH --gres=gpu:a100:1
#SBATCH --job-name=MYNAME
#SBATCH --output=%u-%j.out
#SBATCH --error=%u-%j.err
#SBATCH --reservation=bi_fox_dgx
#SBATCH --constraint=a100_80gb
Some suggestions
When compiling large projects, you may neeed to make surue you have enough time and memory to conduct such compiles. This can be best achieved by using an interactive node, possibly from the large memory partition.
References
- Presentation about SLURM on rivanna
- Tutorial on using Rivanna
- Gregors book chapters
- MPI with python:
- https://cloudmesh.github.io/cloudmesh-mpi/report-mpi.pdf
- https://github.com/cloudmesh/cloudmesh-mpi
- Tutorials about cybertraining: https://cybertraining-dsc.github.io/docs/tutorial/
- Modules about cybertraining: https://cybertraining-dsc.github.io/docs/modules/
Help Support
When requesting help from Gregor or anyone make sure to be completley specify the issue, alot of things cannot be solved if you are not clear on the issue and where it is occuring. Include:
- The issue you are encountering.
- Where it is occuring.
- What you have done to try to resolve the issue.
A good example is:
I ran application xyz, from url xyz on Rivanna. I placed code in directory /project/…. or I placed the data in /project/… The download worked and I placed about 600GB. However when I uncompress the data with command xyz I get the error xyz. What should we do now?
5 - Rivanna Pod
This documentation is so far only useful for betatesters. In this group we have
- Gregor von Laszewski
Dear GPU beta testers,
Thank you for signing up as beta testers for the new GPU POD on Rivanna. We appreciate your patience during the longer-than-expected installation phase. This email will unveil some details about the new hardware and provide instructions on access and usage.
Introducing the NVIDIA DGX BasePOD
You might have seen/heard the term SuperPOD in our earlier communications or from other sources. Since then the vendor has rebranded the specific type purchased by UVA as BasePOD, which as of today is comprised of 10 DGX A100 nodes 8 A100 GPU devices and 2 TB local node memory (per node) 80 GB GPU memory (per GPU device) I’ll just refer to it as the POD for the remainder of the email.
Unbeknown to most users, these nodes have been up and running on Rivanna since last summer as regular GPU nodes. We are pleased to inform you that the following Advanced Features have now been enabled on the POD:
- NVLink for fast multi-GPU communication
- GPUDirect RDMA Peer Memory for fast multi-node multi-GPU communication
- GPUDirect Storage with 200 TB IBM ESS3200 (NVMe) SpectrumScale storage array
What this means to you is that the POD is ideal for the following scenarios:
- The job needs multiple GPUs and/or even multiple nodes.
- The job (can be single- or multi-GPU) is I/O intensive.
- The job (can be single- or multi-GPU) requires more than 40 GB GPU memory. (We have 12 A100 nodes in total, 10 of which are the POD and 2 are regular with 40 GB GPU memory per device.)
Detailed specs can be found in the official document (Chapter 3.1):
Accessing the POD
Allocation
As a token of appreciation, we have created a superpodtest allocation such that you may run benchmarks and tests without spending your own allocation. A single job can request up to 4 nodes with 32 GPUs. Before running multi-node jobs, please make sure it can scale well to 8 GPUs on a single node.
We kindly ask you to keep other beta testers and the general users in mind by refraining from dominating the queue with high-throughput jobs through this allocation.
If you are the PI and wish to delegate the testing work to someone else in your group, you are welcome to provide one or two names with their computing IDs.
Slurm script Please include the following lines:
#SBATCH -p gpu
#SBATCH --gres=gpu:a100:X # replace X with the number of GPUs per node
#SBATCH -C gpupod
#SBATCH -A superpodtest
Open OnDemand
In Optional: Slurm Option write:
-C gpupod
Remarks
Many of you may have already used the POD by simply requesting an A100 node, since 10 out of the total 12 A100 nodes are POD nodes. Hence, if you do not see any performance improvement, do not be disappointed. As we expand our infrastructure, there could be changes to the Slurm directives and job resource limitations in the future. Please keep an eye out for our announcements.
Usage examples
Deep learning
We will be migrating toward NVIDIA’s NGC containers for deep learning frameworks such as PyTorch and TensorFlow, as they have been heavily optimized to achieve excellent multi-GPU performance. These containers have not yet been installed as modules but can be accessed under /share/resources/containers/singularity:
- pytorch_23.03-py3.sif
- tensorflow_23.03-tf1-py3.sif
- tensorflow_23.03-tf2-py3.sif
(NGC has their own versioning scheme. The PyTorch and TensorFlow versions are 2.0.0, 1.15.5, 2.11.0, respectively.)
The singularity command is of the form:
singularity run --nv /path/to/sif python /path/to/python/script
Warning: Distributed training is not automatic! Your code must be parallelizable. If you are not familiar with this concept, please visit:
- TF distributed training https://www.tensorflow.org/guide/distributed_training
- PyTorch DDP https://pytorch.org/docs/stable/notes/ddp.html
MPI codes
Please check the manual for your code regarding the relationship between the number of MPI ranks and the number of GPUs. For computational chemistry codes (e.g. VASP, QuantumEspresso, LAMMPS) the two are oftentimes equal, e.g.
#SBATCH --gres=gpu:a100:8
#SBATCH --ntasks-per-node=8
If you are building your own code, please load the modules nvhpc and cuda which provide NVIDIA compilers and CUDA libraries. The compute capability of the POD A100 is 8.0.
For documentation and demos, refer to the Resources section at the bottom of this page: https://developer.nvidia.com/hpc-sdk
We will be updating our website documentation gradually in the near future as we iron out some operational specifics. GPU-enabled modules are now marked with a (g) in the module avail command as shown below:
TODO: output from maodule avail to be included
6 - Raspberry Pi Cluster
Links
The main web page for this is at https://piplanet.org
additional tutorials and resources are
7 - Rivanna and Singularity
TODO: thats where the images are /share/resources/containers/singularity
Singularity
Singularity is a container runtime that implements a unique security model to mitigate privilege escalation risks and provides a platform to capture a complete application environment into a single file (SIF).
Singularity is often used in HPC centers.
University of Virginia granted us special permission to create Singularity images on rivanna. We discuss here how to build and run singularity images.
Access
In order for you to be able to access singularity and build images, you must be in the following groups:
biocomplexity
nssac_students
bii_dsc_community
To find out if you are, ssh into rivanna and issue the command
$ groups
If any of the groups is missing, please send Gregor an e-mail at
laszewski@gmail.com
.
build.def
To build an image you will need a build definition file
We show next an exxample of a simple buid.def
file that uses
internally a
NVIDIA NGC PyTorch container.
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:23.02-py3
Next you can follow the steps that are detailed in https://docs.sylabs.io/guides/3.7/user-guide/definition_files.html#sections
However, for Rivanna we MUST create the image as discussed next.
Creating the Singularity Image
In order for you to create a singularity container from the
build.def
file please login to either of the following special nodes
on Rivanna:
biihead1.bii.virginia.edu
biihead2.bii.virginia.edu
For example:
ssh $USER@biihead1.bii.virginia.edu
where $USER is your computing ID on Rivanna.
Now that you are logged in to the special node, you can create the singularity image with the following command:
sudo /opt/singularity/3.7.1/bin/singularity build output_image.sif build.def
Note: It is important that you type in only this command. If you modify the name output_image.sif or build.def the command will not work and you will recieve an authorization error.
In case you need to rename the image to a better name please use the mv
command.
In case you also need to have a different name other then build.def
the following Makefile is very useful. We assume you use myimage.def
and myimage.sif
. Include it into a makefile such as:
BUILD=myimage.def
IMAGE=myimage.sif
image:
cp ${BUILD} build.def
sudo /opt/singularity/3.7.1/bin/singularity build output_image.sif build.def
cp output_image.sif ${IMAGE}
make -f clean
clean:
rm -rf build.def output_image.sif
Having such a Makefile
will allow you to use the command
make image
and the image myimage.sif
will be created. with make clean you will
delete the temporary files build.def
and output_image.sif
Create a singularity image for tensorflow
TODO
Work with Singularity container
Now that you have an image, you can use it while using the documentation provided at https://www.rc.virginia.edu/userinfo/rivanna/software/containers/
Run GPU images
To use NVIDIA GPU with Singularity, --nv
flag is needed.
singularity exec --nv output_image.sif python myscript.py
TODO: THE NEXT PARAGRAPH IS WRONG
Since Python is defined as the default command to be excuted and singularity passes the argument(s) after the image name, i.e. myscript.py, to the Python interpreter. So the above singularity command is equivalent to
singularity run --nv output_image.sif myscript.py
Run Images Interactively
ijob -A mygroup -p gpu --gres=gpu -c 1
module purge
module load singularity
singularity shell --nv output_image.sif
Singularity Filesystem on Rivanna
The following paths are exposed to the container by default
- /tmp
- /proc
- /sys
- /dev
- /home
- /scratch
- /nv
- /project
Adding Custom Bind Paths
For example, the following command adds the /scratch/$USER directory as an overlay without overlaying any other user directories provided by the host:
singularity run -c -B /scratch/$USER output_image.sif
To add the /home directory on the host as /rivanna/home inside the container:
singularity run -c -B /home:/rivanna/home output_image.sif
FAQ
Adding singularity to slurm scripts
TBD
Running on v100
TBD
Running on a100-40GB
TBD
Running on a100-80GB
TBD
RUnning on special fox node a100-80GB
TBD