Skip to content

Commit

Permalink
Merge pull request #327 from hpcleuven/fix_genius_quickstart_memory
Browse files Browse the repository at this point in the history
Use appropriate memory per core value + other polishing (Genius quickstart)
  • Loading branch information
MaximeVdB authored Jun 2, 2023
2 parents 833e139 + fb6320c commit 27eb7d1
Show file tree
Hide file tree
Showing 2 changed files with 74 additions and 55 deletions.
121 changes: 70 additions & 51 deletions source/leuven/genius_quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

Genius quick start guide
========================

:ref:`Genius <Genius hardware>` is one of the two KU Leuven/UHasselt Tier-2 clusters,
besides :ref:`wICE <wice hardware>`.
Given the architectural diversity of compute nodes on Genius, this cluster is suited
Expand All @@ -15,61 +14,79 @@ For example, to log in to any of the login node using SSH::
$ ssh [email protected]


.. _running jobs on genius:
.. _running_jobs_on_genius:

Running jobs on Genius
----------------------

Genius is equipped with normal (thin) compute nodes, two kinds of large memory nodes,
and GPU nodes. The resources specifications for jobs have to be tuned to use these
Genius is equipped with regular (thin) compute nodes, two kinds of big memory nodes,
and GPU nodes. The resource specifications for jobs have to be tuned to use these
nodes properly.

In case you are not yet familiar with the system, you can find more
information on
The `Slurm Workload Manager <https://slurm.schedmd.com>`_ is the scheduler,
resource manager and credit accounting manager on Genius (and wICE).

- :ref:`hardware specification <Genius hardware>`
- Submitting jobs using :ref:`Slurm Workload Manager <Antwerp Slurm>` and the
:ref:`Advanced topics <Antwerp advanced Slurm>`
- :ref:`running jobs <running jobs>`
- :ref:`obtaining credits <KU Leuven credits>` and
:ref:`Slurm accounting <accounting_leuven>`
In case you are not yet familiar with Slurm and/or the Genius hardware, you can find
more information on the following pages:

The default ``batch`` partition allows jobs with maximum 3 days of walltime.
Jobs which require a walltime up to maximum 7 days must be submitted to the
``batch_long`` partition explicitly.
- :ref:`Genius hardware <Genius hardware>`
- :ref:`Slurm jobs (basics) <Antwerp Slurm>` and
:ref:`Slurm jobs (advanced) <Antwerp advanced Slurm>`
- :ref:`General info on running jobs <running jobs>`
- :ref:`Obtaining compute credits <KU Leuven credits>` and
:ref:`Slurm credit accounting <accounting_leuven>`

The `Slurm Workload Manager <https://slurm.schedmd.com>`_ is the scheduler, resource manager and
accounting manager on Genius (and wICE).
To get started with Slurm, you may refer to the internal documentation on
:ref:`Basics of Slurm <Antwerp Slurm>` and :ref:`Advanced Slurm <Antwerp advanced Slurm>` usage.

.. _submit to genius compute node:
.. _submit_genius_batch:

Submit to a compute node
~~~~~~~~~~~~~~~~~~~~~~~~
Submitting a compute job boils down to specifying the required number of nodes, cores-per-node, memory and walltime.
You may e.g. request two full nodes like this::
Submitting to a regular compute node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The regular (thin) compute nodes are gathered in the default ``batch`` partition.
This partition allows jobs with maximum 3 days of walltime. Jobs which require a
walltime up to maximum 7 days need to be submitted to the ``batch_long`` partition
instead.

Submitting a regular compute job boils down to specifying the required number of
nodes, cores-per-node, memory and walltime. You may e.g. request two full nodes like
this::

$ sbatch -A lp_myproject -M genius -t 2:00:00 --nodes=2 --ntasks-per-node=36 myjobscript.slurm

You may also request only a part of the resources on a node.
For instance, to test a multi-threaded application which performs optimally using 4 cores, you may submit your job like this::
For instance, to test a multi-threaded application which performs optimally using 4 cores,
you may submit your job like this::

$ sbatch -A lp_myproject -M genius -t 2:00:00 --ntasks=4 myjobscript.slurm
# or
$ sbatch -A lp_myproject -M genius -t 2:00:00 --ntasks=1 --cpus-per-task=4 myjobscript.slurm

.. note::

Please bear in mind to not exceed the maximum allowed resources on compute nodes for the targeted partition.
E.g. you can request at most 36 cores per node (``--ntasks=36``).
In general, we advise you to only request as much resources as needed by your application.
Please bear in mind to not exceed the maximum allowed resources on compute
nodes for the targeted partition. E.g. you can request at most 36 cores per
node (``--ntasks=36``). In general, we advise you to only request as many
resources as needed by your application.

.. note::

If you do not provide a walltime for your job, then a default walltime will
be applied. This is 1 hour for all partitions, except for the debug partitions
where it is 30 minutes (see :ref:`submit_genius_debug` below).

.. note::

If you do not specify the number of tasks and cores per task for your job,
then it will default to a single task running on a single core.

.. note::

By default, each job will use a single core on a single node for a duration of 1 hour.
In other words, these default values are implicitly applied
``--nodes=1 --ntasks=1 --mem-per-cpu=5000M --time=1:00:00``.
Each partition also has a default amount of memory that is provided for
every allocated core. For e.g. the `batch` partition, this is 5000 MB,
which corresponds to the ``--mem-per-cpu=5000M`` Slurm option.
You may choose higher values if your application requires more memory
than what is provided by default. When doing so, keep in mind that e.g.
specifying ``--mem-per-cpu=10G`` will be interpreted as a request for
10240 MB and not 10000 MB.


Advanced node usage
^^^^^^^^^^^^^^^^^^^
Expand All @@ -81,27 +98,28 @@ Otherwise, your job will land on the first available node(s) as decided by Slurm
By default, all nodes are shared among all jobs and users, unless the resource specifications
would imply an exclusive access to a node by a job or user.

.. _submit to genius GPU node:

Submit to a GPU node
~~~~~~~~~~~~~~~~~~~~
The GPU nodes are accessible via different partitions.
The table below summarizes all the possibilities:
.. _submit_genius_gpu:

Submitting to a GPU node
~~~~~~~~~~~~~~~~~~~~~~~~
The GPU nodes are accessible via the following partitions:

+---------------+----------+----------------------------------------+-------------+
| Partition | Walltime | Resources | CPU model |
+===============+==========+========================================+=============+
| gpu_p100 | 3 days | 20 nodes, 4x Nvidia P100 GPUs per node | Skylake |
+---------------+----------+ | |
| gpu_p100_long | 7 days | | |
+---------------+----------+----------------------------------------+-------------+
| gpu_v100 | 3 days | 2 nodes, 8x Nvidia V100 GPUs per node | Cascadelake |
+---------------+----------+ | |
| gpu_v100_long | 7 days | | |
+---------------+----------+----------------------------------------+-------------+


Similar to the other nodes, the GPU nodes can be shared by different jobs from
different users.
However every user will have exclusive access to the number of GPUs requested.
However, every user will have exclusive access to the number of GPUs requested.
If you want to use only 1 GPU of type P100 you can submit for example like this::

$ sbatch -A lp_my_project -M genius -N 1 -n 9 --gpus-per-node=1 -p gpu_p100 myjobscript.slurm
Expand All @@ -114,36 +132,37 @@ requested, so in case of for example 3 GPUs you will have to specify this::

To specifically request V100 GPUs, you can submit for example like this::

$ sbatch -A lp_my_project -M genius -N 1 -n 4 --gpus-per-node=1 --mem-per-cpu=20G -p gpu_v100 myjobscript.slurm
$ sbatch -A lp_my_project -M genius -N 1 -n 4 --gpus-per-node=1 --mem-per-cpu=20000M -p gpu_v100 myjobscript.slurm
For the V100 type of GPU, it is required that you request 4 cores for each GPU.
Also notice that these nodes offer a much larger memory bank.
Also notice that these nodes offer a much larger amount of CPU memory.


.. _submit to genius big memory node:
.. _submit_genius_bigmem:

Submit to a big memory node
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The big memory nodes are hosted by the ``bigmem`` and ``bigmem_long`` partitions.
Submitting to a big memory node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The big memory nodes are located in the ``bigmem`` and ``bigmem_long`` partitions.
In case of the big memory nodes it is also important to add your memory requirements,
for example::

$ sbatch -A lp_my_project -M genius -N 1 -n 36 --mem-per-cpu=20G -p bigmem myjobscript.slurm
$ sbatch -A lp_my_project -M genius -N 1 -n 36 --mem-per-cpu=20000M -p bigmem myjobscript.slurm

.. _submit to genius AMD node:

Submit to an AMD node
~~~~~~~~~~~~~~~~~~~~~
.. _submit_genius_amd:

Submitting to an AMD node
~~~~~~~~~~~~~~~~~~~~~~~~~
The AMD nodes are accessible via the ``amd`` and ``amd_long`` partitions.
Besides specifying the partition, it is also important to note that the default memory
per core in this partition is 3800 MB, and each node offers maximum 64 cores.
per core in this partition is 3800 MB, and each node contains 64 cores.
For example, to request two full nodes::

$ sbatch -A lp_my_project -M genius -N 2 --ntasks-per-node=64 -p amd myjobscript.slurm


.. _submit_genius_debug:

Running debug jobs
------------------
Debugging on a busy cluster can be taxing due to long queue times.
Expand Down
8 changes: 4 additions & 4 deletions source/leuven/tier2_hardware/genius_hardware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Hardware details
- 2 Xeon Gold 6140 CPUs\@2.3 GHz (Skylake), 18 cores each
- 768 GB RAM
- 200 GB SSD local disk
- partition ``bigmem``, specific Slurm :ref:`options <submit to genius big memory node>` apply
- partition ``bigmem``, specific Slurm :ref:`options <submit_genius_bigmem>` apply

- 22 GPGPU nodes, 96 GPU devices

Expand All @@ -42,23 +42,23 @@ Hardware details
- 192 GB RAM
- 4 NVIDIA P100 SXM2\@1.3 GHz, 16 GB GDDR, connected with NVLink
- 200 GB SSD local disk
- partition ``gpu_p100``, specific Slurm :ref:`options <submit to genius GPU node>` apply
- partition ``gpu_p100``, specific Slurm :ref:`options <submit_genius_gpu>` apply

- 2 V100 nodes

- 2 Xeon Gold 6240 CPUs\@2.6 GHz (Cascadelake), 18 cores each
- 768 GB RAM
- 8 NVIDIA V100 SXM2\@1.5 GHz, 32 GB GDDR, connected with NVLink
- 200 GB SSD local disk
- partition ``gpu_v100``, specific Slurm :ref:`options <submit to genius GPU node>` apply
- partition ``gpu_v100``, specific Slurm :ref:`options <submit_genius_gpu>` apply


- 4 AMD nodes

- 2 EPYC 7501 CPUs\@2.0 GHz, 32 cores each
- 256 GB RAM
- 200 GB SSD local disk
- partition ``amd``, specific Slurm :ref:`options <submit to genius AMD node>` apply
- partition ``amd``, specific Slurm :ref:`options <submit_genius_amd>` apply

The nodes are connected using an Infiniband EDR network (bandwidth 25 Gb/s), the islands
are indicated on the diagram below.
Expand Down

0 comments on commit 27eb7d1

Please sign in to comment.