1 - Getting Started with the bii_dsc_community

Getting Started

Contributing your tutorial and experiences

Please contribute to infomall.org as your experiences will help. Please remember that technology evolves fast, and we like to stay up to date by improving information.

Each page as an edit here feature, that you ca use to propose changes. The changes will be reviewed by Gregor and are not automatically posted online.

Once a change is accepted, the Web site will be published and updates are visible. Send an e-mail to Gergor for urgent updates.

Activating your account

Do the following while sending an e-mail to Gregor:

Subject: Activate my account ,

Body: (fill in lastname and firstname. Do not use all caps)

Firstname: 
Lastname:
e-mail:
github.com:

* [ ] Please add me to the `discord` 
* [ ] Please add me to the unix groups:
  * [ ] `biocomplexity`
  * [ ] `nssac_students`
  * [ ] `bii_dsc_community`

Preparing your computer for research

Seee the documentation at

Using Docker on your computer

To isolate your computer form changes and to develop portable code we recommend using docker images. This is especially the case when using GPUs on your computer as this is these days the default distribution mechanism for NVIDIA software for research.

Using Singularity on your computer

As Rivanna is using singularity, it is also beneficial to use singularity on your loacl computer as this can be used to create images for rivanna. However, note that due to the transfer speeds to rivanna the experience may be limited For that reason. we recommend you visit our [https://infomall.org/uva/docs/tutorial/singularity/](Tutorial on Singularity on Rivanna)

Getting an account on Rivanna

Please read

Do not make your account insecure. On Rivanna’s documentation you will find a statement that we do NOT RECOMMEND TO FOLLOW as it is not best security practice and can be handled in almost all cases differently. The statement on the official UVA Rivanna Web Site states:

Sometimes you will need to enable passwordless ssh. We allow passwordless ssh to frontend nodes from UVA IP addresses. Key authentication works by matching two halves of an encrypted keypair. The “public” key is placed within your home directory on the remote server and the “private” key is kept safely on your own workstation. You should treat private keys as securely as you would any password."

Instead you need to use

computer> 
  eval `ssh-agent`
  ssh-add

Using Python

When using anaconda, be careful as it takes over your python instalation and may not provide a level of inconsistant libraries when you do more complex stuff. Evaluate if you need anaconda or not. IN many cases it is best to just use vanilla python and use pip.

You can also switch between anaconda and regular python. for that you DO NOT USE

conda init

Fix or outcomment anaconde from your .bashrc or .zshrc files.

If you are a conda expert, give us some tips and tutorials on this topic.

Always check if you use the correct version of python with

computer>
  which python
  python --version

Please keep in mind: When attending university classes some teachers may give you convenient but inssuficcient instructions on how to use python. They are typically designed to make the use of python easy for a specific class and not necessarily easy for research.

Please keep in mind that you may have python versions that do not work properly on your computer if you have attended classes some years back. You will likely need to update your python. Often its good to unisntall your previous verison and reinstall.

If you need multiple python versions such as teacher A wants version X and wants version Y, this is possible. Just use python virtual environments, containers, or virtual machines. What you chose is your choice.

Using Rivanna

Read

Using Singularity on Rivanna

Read

Using Docker on Rivanna via Singularity

Which they do not document but we do on infomall.org

I will go into this in tutorial. If you already have created a passwordless key, please redo it with a password ….

Onramping Tutorial with Gregor

If you need help on assessing your computer for research you can optionally send the folloing info to me.

os:
size ram:
size hdd/ssd:
free space on hdd/ssd
date purchased:

We observed that when using chrome and pycharm and zoom you may need lots of memory. Shut down all over applications. We recommend 16GB ram these days. However, many students have 8GB which may lead to slowing things down in some cases as you may hit the memory

For example, when Gregor runs chrome and pycharm he uses up 8.1GB RAM, so if you were to have an 8GB machine it would slow down. However, your usage of the RAM may vary dependent on what plugins and which version of software as well as the OS you use.

  • Please make sure to have some space on your computers HDD, send me how much free space you have
  • if windows, please install gitbash before meeting
  • if windows I recommend chocolatey, but be careful what you install
  • make sure you know how to use UVA vpn
  • set up ssh key with ssh-keygen and use password WRITE PASSWORD DOWN
  • set up ~/.ssh/config as
  • upload sshkey to github

Make sure you employ backup strategy on external HDD or google or something like that. I have seen to many computer HDD break and this is standard best practice. We can discuss in meeting.

If anythinig unclear or you have questions let me know we will also go through the sshkey things if you do not understand.

Editor

  • use pycharm (best) on your local computer alternatively vscode
  • learn commandline edtor for rivanna emacs best. alternatives nano, pico, vim

Cloudmesh is useful

You will see that cloudmesh has many features that you will find useful. We focus here on a number of libraries useful for rivanna.

Please create venv, this depends on your os on how to do.

Name it ~/ENV3 (if you do conda do it in whatever fashion conda does, as I do not use conda you can help us writing documentation about it)

activate it and do

computer>
  python -m venv ~/ENV3
  source ~/ENV3
  pip install pip -U
  pip install cloudmesh-common
  pip install cloudmesh-sbatch
  pip install cloudmesh-rivanna
  cms help

On rivanna

computer>
  python -m venv /project/bii_dsc_community/$USER/ENV3
  source /project/bii_dsc_community/$USER/ENV3
  pip install pip -U
  pip install cloudmesh-common
  pip install cloudmesh-sbatch
  pip install cloudmesh-rivanna
  pip install cloudmesh-gpu
  cms help

Make sure you are in Gregors discord

In future learn how to do cloudmesh StopWatch so you conveniently augment your code with timers

Gregor von Laszewski laszewski@gmail.com

2 - Rivanna

Rivanna

Rivanna is the University of Virginia’s High-Performance Computing (HPC) system. As a centralized resource and has many software packages available. Currently, the Rivanna supercomputer has 603 nodes with over 20476 cores and 8PB of various storage. Rivanna has multiple nodes equipped with GPUs including RTX2080, RTX3090, K80, P100, V100, A100-40GB, A100-80GB.

Communication

We have a team discord at: uva-bii-community

https://discord.gg/7K2PQqxYz7

to be added, send laszewski@gmail.com an email with subject: Please add me to discord for bii_dsc_community

please subscribe if you work on rivanna and are part of the bii_dsc_community.

Rivanna at UVA

The official Web page for Rivanna is located at

In case you need support you can ask the staff using a ticket system at

It is important that before you use Rivanna to attend a seminar that upon request is given every Wednesday. To sign up, use the link:

Please note that in this introduction we will provide you with additional inforamation that may make the use of Rivanna easier. We encourage you to add to this information and share your tips,

Getting Permissions to use Rivanna

To use Rivanna you need to have special authorization. In case you work with a faculty member you will need to be added to a special group (or multiple) to be able to access it. The faculty member will know which group it is. This is managed via the group management portal by the faculty member. Please do not use the previous link and instead communicate with your faculty member first.

  • Note: For BII work conducted with Geoffrey Fox or Gregor von Laszewski, please contact Gregor at laszewski@gmail.com

TODO: IS THIS THE CASE?

Once you are added to the group, you will receive an invitation email to set up password for the research computing support portal. If you do not recive such an email, please visit the support portal at

TBD

This password is also the password that you will use to log into the system.

END TODO IS THIS THE CASE

After your account is set up, you can try to log in through the Web-based access. Please test it to make sure you have the proper access already.

However, we will typically notuse the online portal but instead use the more advanced batch system as it provides significant advantages for you when managing multiple jobs to Fivanna.

Accessing an HPC Computer via command line

If you need to use X11 on Rivanna you can finde documentation at the rivanna documentation. In case you need to run jupyter notebooks directly on Rivanna, please consult with the Rivanna documentation.

VPN (required)

You can access rivanna via ssh only via VPN. UVA requires you to use the VPN to access any computer on campus. VPN is offered by IT services but oficially only supported for Mac and Windows.

However, if you have a Linux machine you can follow the VPN install instructions for Linux. If you have issues installing it, attend an online support session with the Rivanna staff.

Access via the Web Browser

Rivanna can be accessed right from the Web browser. Although this may be helpful for those with systems where a proper terminal can not be accessed it can not leverage the features of your own desktop or laptop while using for example advanced editors or keeping the file system of your machine in sync with the HPC file system.

Therefore, practical experience shows that you benefit while using a terminal and your own computer for software development.

Additiional documentation by the rivanna system staff is provided at

Access Rivanna from macOS and Linux

To access Rivanna from macOS, use the terminal and use ssh to connect to it. We will provide an in depth configuration tutorial on this later on. We will use the same programs as on Linux and Windows so we have to only provide one documentation and it is uniform across platforms.

Please remember to use

$ eval `ssh-agent`
$ ssh-add

To activate ssh in your terminal

Access Rivanna from Windows

While exploring the various choices for accessing Rivanna from Windows you can use putty and MobaXterm.

However, most recently a possible better choice is available while using gitbash. Git bash is trivial to install. However, you need to read the configuration options carefully. READ CAREFULLY Let us know your options so we can add them here.

To simplify the setup of a Windows computer for research we have prepared a separate

It addresses the installation of gitbash, Python, PyCharm (much better than VSCode), and other useful tools such as chocolate.

With git bash, you get a bash terminal that works the same as a Linux bash terminal and which is similar to the zsh terminal for a Mac.

Set up the connection (mac/Linux)

The first thing to do when trying to connect to Rivanna is to create an ssh key if you have not yet done so.

To do this use the command

ssh-keygen

Please make sure you use a passphrase when generating the key. Make sure to not just skip the passphrase by typing in ENTER but instead use a real not easy to guess passphrase as this is best practice and not in violation violation of security policies. You always can use use ssh-agent and ssh-add so you do not have to repeatedly enter your passphrase.

The ssh-keygen program will generate a public-private keypair in the directory ~/.ssh/id_rsa.pub (public key) and ~/.ssh/id_rsa. Please never share the private key with anyone.

Next, we need to add the public key to Rivanna’s rivanna:~/.ssh/authorized_keys file. The easiest way to do this is to use the program ssh-copy-id.

ssh-copy-id username@rivanna.hpc.virginia.edu

Please use your password when using ssh-copy-id. Your username is your UVA computing id. Now you should be ready to connect with

ssh username@rivanna.hpc.virginia.edu

Commandline editor

Sometimes it is necessary to edit files on Rivanna. For this, we recommend that you learn a command line editor. There are lots of debates on which one is better. When I was young I used vi, but found it too cumbersome. So I spend one-day learning emacs which is just great and all you need to learn. You can install it also on Linux, Mac, and Windows. This way you have one editor with very advanced features that is easy to learn.

If you do not have one day to familiarize yourself with editors such as emacs, vim, or vi, you can use editors such as nano and pico.

The best commandline editor is emacs. It is extremely easy to learn when using just the basics. The advantage is that the same commands also work in the terminal.

Keys Action
CTRL-x c Save in emacs
CTRL-x q Leave
CTRL-x g If something goes wrong
CTRL a Go to beginning line
CTRL e Go to end of line
CTRL k Delete till end of line from curser
cursor Just works ;-)

PyCharm

The best editor to do python development is pyCharm. Install it on your desktop. The education version is free.

VSCode

An inferior editor for python development is VSCode. It can be configured to also use a Remote-SSH plugin.

Moving data from your desktop to Rivanna

To copy a directory use scp

If only a few lines have changed use rsync

To mount Rivannas file system onto your computer use fuse-ssh. This will allow you to for example use pyCharm to directly edit files on Rivanna.

Developers however often also use GitHub to push the code to git and then on Rivanna use pull to get the code from git. This has the advantage that you can use pyCharm on your local system while synchronizing the code via git onto Rivanna.

However often scp and rsync may just be sufficient.

Example Config file

Replace abc2de with your computing id

place this on your computer in ~/.ssh/config

ServerAliveInterval 60

Host rivanna
     User abc2de
     HostName rivanna.hpc.virginia.edu
     IdentityFile ~/.ssh/id_rsa.pub
     
Host b1
     User abc2de
     HostName biihead1.bii.virginia.edu
     IdentityFile ~/.ssh/id_rsa.pub
     
Host b2
     User abc2de
     HostName biihead2.bii.virginia.edu
     IdentityFile ~/.ssh/id_rsa.pub

Adding it allows you to just ssh to the machines with

ssh rivanna
ssh b1
ssh b2

Rivanna’s filesystem

The file systems on Rivanna have some restrictions that are set by system wide policies that you need to be inspecting:

  • TODO: add link here

You can alls see your quote with

rivanna>
  hdquota

we distinguish

  • home directory: /home/<uvaid> or ~
  • /scratch/bii_dsc_community/<uvaid>
  • /project/bii_dsc_community/projectname/<uvaid>

Y In your home directory, you will find system directories and files such as ~/.ssh , ~/.bashrcand ~/.zshrc

The difference in the file systems is explained at

Dealing with limited space under HOME

As we conduct research you may find that the file space in your home directory is insufficient. This is especially the case when using conda. Therefore, it is recommended that you create softlinks from your home directory to a location where you have more space. This is typically somewhere under /project.

We describe next how to relocate some of the directories to /project

In ~/.bashrc, add the following lines, for creating a project directory.

$ vi ~/.bashrc

$ PS1="\w \$"
$ alias project='cd /project/bii_dsc_community/$USER'
$ export PROJECT="/project/bii_dsc_community/$USER"

At the end of the .bashrc file use

$ cd $PROJECT

So you always cd directly into your project directory instead of home.

The home directory only has 50GB. Installing everything on the home directory will exceed the allocation and have problems with any execution. So it’s better to move conda all other package installation directories to $PROJECT.

First, explore what is in your home directory and how much space it consumes with the following commands.

cd $HOME
$ ls -lisa
$ du -h .

Select from this list of directories that you want to move (those that you not already have moved).

Let us assume you want to move the directories .local, .vscode-server, and .conda. Important is that you want to make sure that .conda and .local are moved as they may include lots of files and you may run out of memory quickly. Hence you do next the following.

rivanna>
  $ cd $PROJECT
  $ mv ~/.local .
  $ mv ~/.vscode-server .
  $ mv ~/.conda .

Then create symbolic links to the home directory installed folder.

rivanna>
  $ cd $PROJECT
  $ ln -s $PROJECT/.local ~/.local
  $ ln -s $PROJECT/.vscode-server ~/.vscode-server
  $ ln -s $PROJECT/.conda ~/.conda

Check all symbolic links:

rivanna>
  $ ls -lisa

20407358289   4 lrwxrwxrwx    1 $USER users          40 May  5 10:58 .local -> /project/bii_dsc_community/djy8hg/.local
20407358290   4 lrwxrwxrwx    1 $USER users          48 May  5 10:58 .vscode-server -> /project/bii_dsc_community/djy8hg/.vscode-server

Singularity Cache

In case you use singularity you can build images you need to set the singularity cache. This is due to the fact that the cache usually is created in your home directory and is often far too small for even our small projects. Thus you need to set it as follows

rivanna>
  mkdir -p /project/$USER/.singularity/cache
  export SINGULARITY_CACHEDIR=/project/$USER/.singularity/cache

`

Python

In case you use python venv, do not place them in home but under project.

rivanna>
  module load python3.8
  python -m venv $PROJECT/ENV3
  source $PROJECT/ENV3/bin/activate

If you succeed, you can also place the source line in your .bashrc file.

In case you use conda and python, we also recommend that you create a venv from the conda python, so you have a copy of that in ENV3 and if something goes wrong it is easy to recreate from your default python. Those that use that path ought to improve how to do this here.

Adding cloudmesh rivanna specific commands and tools

On your computer in your ENV3 add the following to enable the commands

computer> 
  pip install pip -U
  pip install cloudmesh-common
  pip install cloudmesh-rivanna
  pip install cloudmesh-sbatch
  pip install cloudmesh-vpn

On Rivanna in ENV3 also add the gpu monitor

computer> 
  pip install pip -U
  pip install cloudmesh-common
  pip install cloudmesh-gpu
  pip install cloudmesh-rivanna
  pip install cloudmesh-sbatch

Note: Please send me a mail to laszewski@gmail.com if any requirements are missing as I may not yet have included all of them in the pip package.

Once you have activated it the cloudmesh rivanna command shows you combinations of SBATCH flags that you can use.

To see them type in

cms rivanna slurm list

To login into a specific node you can say (lest assume you like to log into a k80

cms rivanna login k80

Please be reminded that interactive login is only allowed for debugging all jobs must be submitted through sbatch.

To get the directives template to use that GPU, use

cms rivanna slurm k80

cloudmesh sbatch

Cloudmesh-sbatch is a super cool extension to sbatch allowing you to outomatically run parameters studies while creating permuattions on experiment parameters. At this time we try to create some sampel applications, but you can also ararnge a 30 minute meeting with Gregor so we try setting it up for your application with his help

See also:

cloudmesh vpn command

cloudmesh has a simple commandline vpn command that you can use to switch on and off vpn for UVA (and other vpn’s, we can add that feature ;-))

computer> 
  cms vpn connect
  ... do your work in vpn such as working on rivanna
  cms vpn disconnect
  ... work on your regular network 

Load modules

Modules are preconfigured packages that allow you to use a specific software to be loaded into your environment without needing you to install it from source. To find out more about a particular package such as cmake you can use the command

rivanna>
  module spider cmake # check whether cmake is available and details

Load the needed module (you can add version info). Note that some modules are dependent on other modules (clang/10.0.1 depends on gcc/9.2.0 so gcc needs to be loaded first.

rivanna>
  # module load gcc/9.2.0 clang/10.0.1
  module load clanggcc
  module load cmake/3.23.3 git/2.4.1 ninja/1.10.2-py3.8 llvm cuda/11.4.2

check currently loaded modules

rivanna>
  module list

clean all the modules

rivanna>
  module purge

Request GPUs to use interactively

TODO: explain what -A is

rivanna$ ijob -c number_of_cpus
              -A group_name
	      -p queue_name
	      --gres=gpu:gpu_model:number_of_gpus
	      --time=day-hours:minutes:seconds

An example to request 1 cpu with 1 a100 gpu for 10 minutes in ‘dev’ partition is

rivanna$ ijob -c 1 -A bii_dsc_community -p gpu --gres=gpu:a100:1 --time=0-00:10:00

Rivanna has different partitions with different resource availability and charging rate. dev is free but limited to 1 hour for each session/allocation and no GPU is available. To list the different partitons use qlist to check partitions

Last Checked July 28th, note thes values may change.

Queue Total Free Jobs Jobs Time SU
(partition) Cores Cores Running Pending Limit Charge
bii 4640 3331 31 15 7-00:00:00 1
standard 4080 496 1209 5670 7-00:00:00 1
dev 160 86 5 1:00:00 0
parallel 4880 1594 21 3 3-00:00:00 1
instructional 480 280 16 3-00:00:00 1
largemem 144 80 2 1 4-00:00:00 1
gpu 1876 1066 99 210 3-00:00:00 3
bii-gpu 608 542 18 1 3-00:00:00 1
bii-largemem 288 224 7-00:00:00 1

To list the limits, use the command qlimits

Last Checked July 28th, note these values may change.

Queue Maximum Maximum Minimum Maximum Maximum Default Maximum Minimum
(partition) Submit Cores(GPU)/User Cores/Job Mem/Node(MB) Mem/Core(MB) Mem/Core(MB) Nodes/Job Nodes/Job
bii 10000 cpu=400 354000+ 9400 112
standard 10000 cpu=1000 384000+ 9000 1
dev 10000 cpu=16 384000 9000 6000 2
parallel 2000 cpu=1500 4 384000 9600 9000 50 2
instructional 2000 cpu=20 384000 6000 5
largemem 2000 cpu=32 1500000 64000 60000 2
gpu 10000 gres/gpu=32 128000+ 32000 6000 4
bii-gpu 10000 384000+ 9400 12
bii-largemem 10000 1500000 31000 2

Linux commands for HPC

Many useful commands can be found in Gregor’s book at

The following additional commands are quite useful on HPC systems

command description
allocations check available account and balance
hdquota check storage you has used
du -h --max-depth=1 check which directory uses most space
qlist list the queues
qlimits prints the limits of the queues

SLURM Batch Parameters

We present next a number of default parameters for using a variety of GPUs on rivanna. Please note that you may need to adopt some parameters to adjust for cores or memory according to your application.

Running on v100

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --partition=bii-gpu
#SBATCH --account=bii_dsc_community
#SBATCH --gres=gpu:v100:1
#SBATCH --job-name=MYNAME
#SBATCH --output=%u-%j.out
#SBATCH --error=%u-%j.err

Running on a100-40GB

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --partition=bii-gpu
#SBATCH --account=bii_dsc_community
#SBATCH --gres=gpu:a100:1
#SBATCH --job-name=MYNAME
#SBATCH --output=%u-%j.out
#SBATCH --error=%u-%j.err

Running on special fox node a100-80GB

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --partition=bii-gpu
#SBATCH --account=bii_dsc_community
#SBATCH --gres=gpu:a100:1
#SBATCH --job-name=MYNAME
#SBATCH --output=%u-%j.out
#SBATCH --error=%u-%j.err
#SBATCH --reservation=bi_fox_dgx
#SBATCH --constraint=a100_80gb

Some suggestions

When compiling large projects, you may neeed to make surue you have enough time and memory to conduct such compiles. This can be best achieved by using an interactive node, possibly from the large memory partition.

References

Help Support

When requesting help from Gregor or anyone make sure to completely specify the issue, a lot of things cannot be solved if you are not clear on the issue and where it is occurring. Include:

  • The issue you are encountering.
  • Where it is occurring.
  • What you have done to try to resolve the issue.

A good example is:

I ran the application xyz, from url xyz on Rivanna. I placed code in the directory /project/…. or I placed the data in /project/… The download worked and I placed about 600GB. However when I uncompress the data with the command xyz I get the error xyz. What should we do now?

3 - Rivanna Pod

Rivanna

This documentation is so far only useful for betatesters. In this group we have

  • Gregor von Laszewski

The rivanna documentation for the basic pod is available at

https://www.rc.virginia.edu/userinfo/rivanna/basepod/

Introducing the NVIDIA DGX BasePOD

Rivanna contains a BasePod with

  • 10 DGX A100 nodes
  • 8 A100 GPU devices
  • 2 TB local node memory (per node)
  • 80 GB GPU memory (per GPU device)

The following Advanced Features have now been enabled on the BasePOD:

  • NVLink for fast multi-GPU communication
  • GPUDirect RDMA Peer Memory for fast multi-node multi-GPU communication
  • GPUDirect Storage with 200 TB IBM ESS3200 (NVMe) SpectrumScale storage array

What this means to you is that the POD is ideal for the following scenarios:

  • The job needs multiple GPUs and/or even multiple nodes.
  • The job (can be single- or multi-GPU) is I/O intensive.
  • The job (can be single- or multi-GPU) requires more than 40 GB GPU memory. (We have 12 A100 nodes in total, 10 of which are the POD and 2 are regular with 40 GB GPU memory per device.)

Detailed specs can be found in the official document (Chapter 3.1):

Accessing the POD

Allocation

A single job can request up to 4 nodes with 32 GPUs. Before running multi-node jobs, please make sure it can scale well to 8 GPUs on a single node.

Slurm script Please include the following lines:

#SBATCH -p gpu
#SBATCH --gres=gpu:a100:X # replace X with the number of GPUs per node
#SBATCH -C gpupod

Open OnDemand

In Optional: Slurm Option write:

-C gpupod

Interactive login

Interactive login to the nodes should be VERY limited and you need to use for most activities the batch queue. In case you need to look at thisng you can use our cloudmesh progarm to do so

Make sure to have vpn enabled and cloumdesh-rivanna installed via pip.

  cms rivanna login a100-pod

Will log you into a node. The time is set by default to 30 minutes. Please immediatly log out after you are done with your work interactive work.

Usage examples

Deep learning

We will be migrating toward NVIDIA’s NGC containers for deep learning frameworks such as PyTorch and TensorFlow, as they have been heavily optimized to achieve excellent multi-GPU performance. These containers have not yet been installed as modules but can be accessed under /share/resources/containers/singularity:

  • pytorch_23.03-py3.sif
  • tensorflow_23.03-tf1-py3.sif
  • tensorflow_23.03-tf2-py3.sif

(NGC has their own versioning scheme. The PyTorch and TensorFlow versions are 2.0.0, 1.15.5, 2.11.0, respectively.)

The singularity command is of the form:

singularity run --nv /path/to/sif python /path/to/python/script

Warning: Distributed training is not automatic! Your code must be parallelizable. If you are not familiar with this concept, please visit:

MPI codes

Please check the manual for your code regarding the relationship between the number of MPI ranks and the number of GPUs. For computational chemistry codes (e.g. VASP, QuantumEspresso, LAMMPS) the two are oftentimes equal, e.g.

#SBATCH --gres=gpu:a100:8
#SBATCH --ntasks-per-node=8

If you are building your own code, please load the modules nvhpc and cuda which provide NVIDIA compilers and CUDA libraries. The compute capability of the POD A100 is 8.0.

For documentation and demos, refer to the Resources section at the bottom of this page: https://developer.nvidia.com/hpc-sdk

We will be updating our website documentation gradually in the near future as we iron out some operational specifics. GPU-enabled modules are now marked with a (g) in the module avail command as shown below:

TODO: output from maodule avail to be included

4 - Rivanna and Singularity

Singularity.

Singularity

Singularity is a container runtime that implements a unique security model to mitigate privilege escalation risks and provides a platform to capture a complete application environment into a single file (SIF).

Singularity is often used in HPC centers.

University of Virginia granted us special permission to create Singularity images on rivanna. We discuss here how to build and run singularity images.

Access

In order for you to be able to access singularity and build images, you must be in the following groups:

biocomplexity
nssac_students
bii_dsc_community

To find out if you are, ssh into rivanna and issue the command

$ groups

If any of the groups is missing, please send Gregor an e-mail at laszewski@gmail.com.

Singularity cache

Before you can build images you need to set the singularity cache. This is due to the fact that the cache usually is created in your home directory and is often far too small for even our small projects. Thus you need to set it as follows

rivanna>
  mkdir -p /scratch/$USER/.singularity/cache
  export SINGULARITY_CACHEDIR=/scratch/$USER/.singularity/cache

Please remember that scratch is not permanent. In case you like a bit more permanent location you can alternatively use

rivanna>
  mkdir -p /project/bii_dsc_community/$USER/.singularity/cache
  export SINGULARITY_CACHEDIR=/project/bii_dsc_community/$USER/.singularity/cache

build.def

To build an image you will need a build definition file

We show next an exxample of a simple buid.def file that uses internally a NVIDIA NGC PyTorch container.

Bootstrap: docker
From: nvcr.io/nvidia/pytorch:23.02-py3

Next you can follow the steps that are detailed in https://docs.sylabs.io/guides/3.7/user-guide/definition_files.html#sections

However, for Rivanna we MUST create the image as discussed next.

Creating the Singularity Image

In order for you to create a singularity container from the build.def file please login to either of the following special nodes on Rivanna:

  • biihead1.bii.virginia.edu
  • biihead2.bii.virginia.edu

For example:

ssh $USER@biihead1.bii.virginia.edu

where $USER is your computing ID on Rivanna.

Now that you are logged in to the special node, you can create the singularity image with the following command:

sudo /opt/singularity/3.7.1/bin/singularity build output_image.sif build.def

Note: It is important that you type in only this command. If you modify the name output_image.sif or build.def the command will not work and you will recieve an authorization error.

In case you need to rename the image to a better name please use the mv command.

In case you also need to have a different name other then build.def the following Makefile is very useful. We assume you use myimage.def and myimage.sif. Include it into a makefile such as:

BUILD=myimage.def
IMAGE=myimage.sif

image:
	cp ${BUILD} build.def
	sudo /opt/singularity/3.7.1/bin/singularity build output_image.sif build.def
	cp output_image.sif ${IMAGE}
	make -f clean

clean:
	rm -rf build.def output_image.sif

Having such a Makefile will allow you to use the command

make image

and the image myimage.sif will be created. with make clean you will delete the temporary files build.def and output_image.sif

Create a singularity image for tensorflow

TODO

Work with Singularity container

Now that you have an image, you can use it while using the documentation provided at https://www.rc.virginia.edu/userinfo/rivanna/software/containers/

Run GPU images

To use NVIDIA GPU with Singularity, --nv flag is needed.

singularity exec --nv output_image.sif python myscript.py

TODO: THE NEXT PARAGRAPH IS WRONG

Since Python is defined as the default command to be excuted and singularity passes the argument(s) after the image name, i.e. myscript.py, to the Python interpreter. So the above singularity command is equivalent to

singularity run --nv output_image.sif myscript.py

Run Images Interactively

ijob  -A mygroup -p gpu --gres=gpu -c 1
module purge
module load singularity
singularity shell --nv output_image.sif

Singularity Filesystem on Rivanna

The following paths are exposed to the container by default

  • /tmp
  • /proc
  • /sys
  • /dev
  • /home
  • /scratch
  • /nv
  • /project

Adding Custom Bind Paths

For example, the following command adds the /scratch/$USER directory as an overlay without overlaying any other user directories provided by the host:

singularity run -c -B /scratch/$USER output_image.sif

To add the /home directory on the host as /rivanna/home inside the container:

singularity run -c -B /home:/rivanna/home output_image.sif

FAQ

Adding singularity to slurm scripts

TBD

Running on v100

TBD

Running on a100-40GB

TBD

Running on a100-80GB

TBD

RUnning on special fox node a100-80GB

TBD

5 - Rclone on Rivanna

Using Rclone to upload and download from cloud services

Using the Rclone Module on Rivanna

Rclone is a useful tool to upload and download from cloud services such as Google Drive by using the commandline. However, a web browser is required for initial setup, which can be done from the computer that logs into Rivanna.

Setup Rclone on Rivanna

First, load the newer version of module; otherwise, Rivanna loads an incompatible, older version by default. Then, initialize a new rclone configuration and enter the following inputs:

$ module load rclone/1.61.1
$ rclone config
n/s/q> n
name> gdrive
Storage> drive

A client ID is required to create a provision that interfaces with Google Drive. Follow the instructions at https://rclone.org/drive/#making-your-own-client-id to create a client ID and then input the values into Rivanna.

client_id> myCoolID..
client_secret> verySecretClientSecret..
scope> 2 # read only
service_account_file> # just press enter
Edit advanced config?
y) Yes
n) No (default)
y/n> n
Use web browser to automatically authenticate rclone with remote?
y/n> n

Install Rclone on Client Computer

If the computer used to log on to Rivanna is running Windows, and the computer has Chocolatey, then download Rclone using an administrative Git Bash instance with

$ choco install rclone -y

Otherwise, for Linux and macOS, use

$ sudo -v ; curl https://rclone.org/install.sh | sudo bash

Then, after opening a new instance of the terminal, paste the command given into Git Bash and follow the instructions.

Rclone Authentication

In the web browser, click Advanced when google says that they have not verified this app; it is safe and expected. Then click Go to rclone, then Continue.

When Rclone gives the config token, ensure that all new line characters are removed. This can be done by pasting the code into an application such as Notepad and manually ensuring that all characters are on the same line. Otherwise, the code will be split across new prompts, breaking the setup.

This is bad:

sjgnkajdfnkj
fdnskjafnkad
asdfnasjkffd

This is good:

sjgnkajdfnkjfdnskjafnkadasdfnasjkffd

Paste the fixed token into Rivanna.

config_token> myCoolCodeThatHasNoNewLineCharacters
Configure this as a Shared Drive (Team Drive)?

y) Yes
n) No (default)
y/n> n
Keep this "gdrive" remote?
y) Yes this is OK (default)
y/e/d> y
q) Quit config
e/n/d/r/c/s/q> q

An example command to use Rclone is as follows. The flag --drive-shared-with-me restricts the scope to only shared files.

$ rclone copy --drive-shared-with-me gdrive:Colab\ Datasets/EarthquakeDec2020  /scratch/$USER/EarthquakeDec2020 -P

6 - Docker

Cybertraining Links

Docker drivers images from NVIDIA

Install GPU drivers in a docker image

NVIDIA GPU drivers can be installed into docker images. As the software may frequently cange, we recommend to look at the Nvidia documentation

An example to add to a debian based Dockerfile to install the GPU drivers (this may be incomplete and you need to check the instructions):

RUN curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
    apt-key add - \ &&
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ &&
    curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | 
RUN apt-get update \ &&
apt-get install -y nvidia-container-runtime

7 - Globus

File transfer with Globus

Getting the Cosmoflow data via globus commandline

Data Directory

We will showcase how to transfer data via globus commandline tools.

In our example we will use the data directory as

export DATA=/project/bii_dsc_community/$USER/cosmoflow/data

Globus Set Up on Rivanna

Rivanna allows to load the Globus file transfer command line tools via the modules command with the following commands. However, prior to executing globus login, please visit https://www.globus.org/ and log in using your UVA credentials.

module load globus_cli
globus login

The globus login method will output a unique link per user that you should paste into a web browser and sign in with using your UVA credentials. Afterwords, the website will present you with a unique sign-in key that you will need to paste back into the command line to verify your login.

After executing globus login your console should look like the following block.

NOTE: this is a unique link generated for the example login, each user will have a different link.

-bash-4.2$globus login
Please authenticate with Globus here:
------------------------------------
https://auth.globus.org/v2/oauth2/authorize?client_id=affbecb5-5f93-404e-b342-957af296dea0&redirect_uri=https%3A%2F%2Fauth.globus.org%2Fv2%2Fweb%2Fauth-code&scope=openid+profile+email+urn%3Aglobus%3Aauth%3Ascope%3Aauth.globus.org%3Aview_identity_set+urn%3Aglobus%3Aauth%3Ascope%3Atransfer.api.globus.org%3Aall&state=_default&response_type=code&access_type=offline&prompt=login
------------------------------------

Enter the resulting Authorization Code here:

Follow the url and input the authorization code to login successfully.

First, verify that you were able to sign in properly, and verify your identity and then search for the source endpoint of the data you want to transfer. In this example, our endpoint is named CosmoFlow benchmark data cosmoUniverse_2019_02_4parE. Please note that the file to be downloaded is 1.7 TB large. Make sure that the system on which you download it has enough space. The following commands will verify your sign in identity and then search for an endpoint within the single quotation marks.

globus get-identities -v 'youremail@gmailprobably.com'
globus endpoint search 'CosmoFlow benchmark data cosmoUniverse_2019_02_4parE'

Each globus endpoint has a unique endpoint ID. In this case our source endpoint ID is:

  • d0b1b73a-efd3-11e9-993f-0a8c187e8c12

Set up a variable ENDPOINT so you can use the endpoint more easily without retyping it. Also set a variable SRC_DIR to indicate the directory with the files to be transferred.

export SRC_ENDPOINT=d0b1b73a-efd3-11e9-993f-0a8c187e8c12
export SRC_DIR=/~/

You can look at the files in the globus endpoint using globus ls to verify that you are looking at the right endpoint.

globus ls $SRC_ENDPOINT

Destination Endpoint Set Up

Rivanna HPC has set a special endpoint for data transfers into the /project, /home, or /scratch directories. The name of this destination endpoint will be UVA Standard Security Storage.

Repeat the above steps with this endpoint and set up the variables including a path variable with the desired path to write to.

globus endpoint search 'UVA Standard Security Storage'
export DEST_ENDPOINT=e6b338df-213b-4d31-b02c-1bc2c628ca07
export DEST_DIR=/dtn/landings/users/u/uj/$USER/project/bii_dsc_community/uja2wd/cosmoflow/

NOTE: We cannot set the path to start at the root level in rivanna and instead need to follow a few steps to find the specific path of where to write to.

To begin, our path must start with /dtn/landings/users/ and is then appended on a unique sequence depending on the users computing ID. The rest of the path is dependent on characters of the users computing ID. As an example, if your computing ID is abc5xy, the next three arguments are /a/ab/abc5xy (first char, first two chars, computing id), at this point the user is essentially in the root level of rivanna and can access /home, /project, or /scratch how they normally would.


Note: If you want to use the web format of Globus to find the path isntead. Follow the below steps to find the desired value of your path var.

  • First sign into the web format of globus
  • Locate file manager on the left side of the screen
  • In the collections box at the top of the screen begin to search for UVA Standard Security Storage
  • Select our destination endpoint
  • Use the GUI tool to select exactly where you wish to write to
  • Copy the path from the box immedietally below collections
  • Write this value to the DEST_DIR variable created above (I have included my path to where I wish to write to)

Initiate the Transfer

Finally, execute the transfer

globus transfer $SRC_ENDPOINT:$SRC_DIR $DEST_ENDPOINT:$DEST_DIR

NOTE: In case your first transfer may have an issue because you need to give globus permission to initiate transfers via the CLI instead of via the web tool. I was given the unique command as follows by my terminal:

-bash-4.2$globus transfer $SRC_ENDPOINT:$SRC_DIR $DEST_ENDPOINT:$DEST_DIR
The collection you are trying to access data on requires you to grant
consent for the Globus CLI to access it.  message: Missing required
data_access consent

Please run

  globus session consent 'urn:globus:auth:scope:transfer.api.globus.org:all[*https://auth.globus.org/scopes/e6b338df-213b-4d31-b02c-1bc2c628ca07/data_access]'

to login with the required scopes

After initiating this command, a similar sign in a verification will be conducted compared to the globus login method where the cli will output a url to follow, the user will sign in, and return a verification code.

After fixing this, remember to re-initiate the transfer with the

globus transfer

command as previously descibed.

Managing Tasks

To monitor the status of active transfers, use

globus task list

or similarly you can use the web tool to verify transfers.

References:

  1. Globus Data Transfer, Rivanna HPC https://www.rc.virginia.edu/userinfo/globus/

8 - Cybertraining

Cybertraining Links

A large number of tutorials and modules are avialable in our cybertraining educational activities.

Cybertraining

The main links to our cybertraining material are:

9 - Raspberry Pi Cluster

Raspberry Pi Cluster

The main web page for this is at https://piplanet.org

additional tutorials and resources are

10 - Create infomall.org

Description on how to create infomall.org

We assume you have hugo installed and cloudmesh-vpn is installed

You need to have python 3

python -m venv ~/ENV3
source ~/ENV3/bin/activate  # if windows in gitbash source ~/ENV3/Scripts/activate
pip install cloudmesh-vpn -U
cms help

Creating a draft

To create a new version of the code from the repository use

rivanna terminal 1>
  git clone git@github.com:DSC-SPIDAL/infomall-org-uva.git
  make serve

To view the content say

rivanna terminal 2>
  make view

Publish

The Web site is currently published by Gregor as follows. No other person must publish it.

computer>
  cms vpn info # make sure vpn is set tu UVA
  cms vpn connect # only needed if vpn is off
  make huge
  make rsync
  cms vpn disconnect #optional to make sure vpn is off