Grid'5000 is a large-scale and versatile testbed for experiment-driven research in all areas of computer science, with a focus on parallel and distributed computing including Cloud, HPC and Big Data. Grid5000's website
This guide is mainly focused on how to use Grid5000 as an alternative to processing when hardware like GPUs are not available for local use.
First step: Get an account.
There are two main types of accounts:
- Academics from France: Those currently working on any research project in France or Academics abroad working on a collaboration with academics in France (The former need to ask to their french collaborators for details).
- Open Access Program: People who are not on a collaboration can request a lower priority account. Private companies interested need to contact Gird5000's executive committee members.
For this part you will need to give your ssh public key. If you have not generated one, follow this tutorial to generate one
This is a list of all the hardware available on Grid5000. Check the list to know which cluster better suits your needs. At the moment of creation of this guide, these were the clusters with CUDA capable GPUs:
Site | Cluster | Available GPUs | Queue |
---|---|---|---|
Lille | chifflet | Nvidia GTX 1080Ti x 2 | default |
Lille | chifflot | Nvidia Tesla P100 x 2 and Nvidia Tesla V100 x 2 | default |
Lyon | orion | Nvidia Tesla M2075 | default |
Nancy | graphique | Nvidia Titan Black x 2 and Nvidia GTX 980 x 2 | production |
Nancy | grele | Nvidia GTX 1080Ti x 2 | production |
Nancy | grimani | Nvidia Tesla K40M | default |
Once you have chosen a cluster, you can log in into your account via: ssh [email protected]
, to then ssh to the site that has the cluster you want to work in, e.g. ssh nancy / ssh lille / ssh lyon
. Now you should be able to access your home directory on any of the Grid5000's clusters.
Note about storage: Default storage per user on Grid5000 is 25GB, but if you need more storage, you can request a bigger quota on the Grid5000 api user storage tab (login needed).
The steps listed here are based on this tutorial (login needed) by the user Ibada. This guide only covers the setting up process with Anaconda, Miniconda specifically since is lighter. All commands are executed from the user's home directory.
First, download Miniconda depending on the version of Python you will be working with. If you are working with Python 2.7, change the version of Miniconda to 2 instead of 3.
For Python 3.7:
user@site:~$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
Then, run it:
user@site:~$ bash Miniconda3-latest-Linux-x86_64.sh
Here, the instalation guide will prompt you to choose the path where Miniconda will be installed. Also, it will prompt you to choose if the conda environment starts with bash (default is no).
Finally, just copy the .bashrc
script available on this repository and change user for your username on lines 119, 123, 124, 127. It contains many useful features to personalize your bash experience, but more importantly if you chose the default option for conda to be disabled, sourcing this script allows you to activate the conda environment.
user@site:~$ source .bashrc
(base) user@site:~$ #Environment now
With Miniconda set up, you can now create environments for your projects via:
conda create --name env
And install conda supported libraries and packages via Anaconda Cloud, Conda Forge or any other channel you want.
You can check here (login needed) the status of each of the clusters' availability, to see if your desired hardware is busy or not.
These bash scripts facilitate the process of asking for jobs. Both are mainly using on the oarsub
commands and using the default queue, check the node's hardware table to see which queue the GPUs you want to use are in:
ask_for_job_fixed_time.sh
has a fixed job time, can be used to quickly test if the environment recognizes the cluster's GPUsask_for_job_input_time.sh
lets you input time as an argument in the format hh:mm:ss. Can be used e.g. when you have an estimated train time for a network.
For example, using the ask_for_job_input_time.sh
:
user@flille:~$ bash ask_for_job_scripts/ask_for_job_input_time.sh 00:05:00
Remember to source bashrc!
Remember to activate the conda env!
[ADMISSION RULE] Modify resource description with type constraints
[ADMISSION_RULE] Resources properties : \{'property' => 'type = \'default\'','resources' => [{'resource' => 'host','value' => '1}]}
[ADMISSION RULE] Job properties : (GPU <> 'NO') AND maintenance = 'NO'
Generate a job key...
OAR_JOB_ID=1681786
Interactive mode: waiting...
Starting...
Connect to OAR job 1681786 via the node chifflet-6.lille.grid5000.fr
user@chifflet-6:~$ source .bashrc
(base) user@chifflet-6:~$ conda activate pytorch_env
(pytorch_env) user@chifflet-6:~$ python pytorch_probe_gpus.py
GeForce GTX 1080 Ti detected on device 0
GeForce GTX 1080 Ti detected on device 1
(pytorch_env) user@chifflet-6:~$ #GPUs detected!
Once you are in a job, you can use the available hardware on that specific cluster for your computations.
Since some clusters have more than one type of GPU, using the ask_for_job_input_time_and_gpu.sh you can ask for a specific GPU on a cluster. This script takes as first argument the time in hh:mm:ss format, and as second argument the wanted GPU name in 'quotes'. The names can be consulted in the OAR Properties on the Monika page of each site at the status page of G5000 (login needed).
user@fnancy:~/usingGrid5000$ bash ask_for_job_scripts/ask_for_job_input_time_and_gpu.sh 00:05:00 'GTX 980'
Remember to source bashrc!
Remember to activate the conda env!
Asking for job with GTX 980
[ADMISSION RULE] Modify resource description with type constraints
[ADMISSION RULE] Assign max_walltime property for production resources selection
[ADMISSION_RULE] Resources properties : \{'resources' => [{'value' => '1','resource' => 'host'}],'property' => '((type = \'default\') AND production = \'YES\') AND (max_walltime >= 300 OR max_walltime <= 0)'}
[ADMISSION RULE] Job properties : (GPU = 'GTX 980') AND maintenance = 'NO
Generate a job key...
OAR_JOB_ID=1930362
Interactive mode: waiting...
Starting...
Connect to OAR job 1930362 via the node graphique-5.nancy.grid5000.fr
user@graphique-5:~/usingGrid5000$ source .bashrc
(base) user@graphique-5:~/usingGrid5000$ conda activate pytorch-env
(pytorch-env) user@graphique-5:~/usingGrid5000$ python pytorch_probe_gpus.py
GeForce GTX 980 detected on device 0
GeForce GTX 980 detected on device 1
(pytorch-env) user@graphique-5:~/usingGrid5000$ #Got wanted GPUs!
To transfer a file from the machines to your local PC via secure copy:
user@localPC:~$ scp [email protected]:site/path_from_home/file.py /home/user/directory/file.py #for single files
user@localPC:~$ scp -r [email protected]:site/path_from_home/directory /home/user/directory/ #for directories
To transfer a file from your PC to a cluster via secure copy:
user@localPC:~$ scp /home/user/directory/file.py [email protected]:site/path_from_home/file.py #for single files
user@localPC:~$ scp -r /home/user/directory/ [email protected]:site/path_from_home/directory #for directories
Commands to check or delete jobs:
user@site:~$ oarstat -u #check if you have any jobs running on this site and the state of them
user@site:~$ oardel JOB_ID #delete any job you no longer need by giving the JOB_ID number
Check your storage:
user@site:~$ du -h --max-depth=1 | sort -hr
For more in-depth usage of Grid5000 for Deep Learning, Check Ibada's tutorial