Skip to content

Commit

Permalink
Merge branch 'main' into Cal-wip
Browse files Browse the repository at this point in the history
Signed-off-by: Cal <[email protected]>
  • Loading branch information
CallumWalley authored Dec 5, 2023
2 parents 23d4006 + da98009 commit 5643aff
Show file tree
Hide file tree
Showing 6 changed files with 62 additions and 126 deletions.
3 changes: 2 additions & 1 deletion .markdownlint.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
"MD033": false,
"MD038": false,
"MD046": false
}
"MD041": false
}
6 changes: 6 additions & 0 deletions .proselint.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"checks":{
"hyperbole.misc": false,
"typography.exclamation": false,
"typography.symbols": false
}}
11 changes: 7 additions & 4 deletions checks/run_proselint.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,21 @@

import sys
import proselint
from proselint import config

from proselint import config, tools


files = sys.argv[1:]

ret_code = 0
proselint.config.default["checks"]["hyperbole.misc"] = False

# Load defaults from config.
config_custom = tools.load_options(config_file_path=".proselint.json", conf_default=config.default)

print(config_custom)

for file in files:
with open(file, "r", encoding="utf8") as f:
for notice in proselint.tools.lint(f.read(), config=config.default):
for notice in proselint.tools.lint(f.read(), config=config_custom):
if (notice[7] == "error"):
ret_code = 1
print(f"::{notice[7]} file={file},line={notice[2]},col={notice[3]},endLine={notice[2]+notice[6]},title={notice[0]}::'{notice[1]}'")
Expand Down
78 changes: 29 additions & 49 deletions docs/Getting_Started/Next_Steps/Finding_Job_Efficiency.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,12 @@ hidden: false
position: 5
tags:
- slurm
title: Finding Job Efficiency
vote_count: 8
vote_sum: 8
zendesk_article_id: 360000903776
zendesk_section_id: 360000189716
---



[//]: <> (REMOVE ME IF PAGE VALIDATED)
[//]: <> (vvvvvvvvvvvvvvvvvvvv)
!!! warning
This page has been automatically migrated and may contain formatting errors.
[//]: <> (^^^^^^^^^^^^^^^^^^^^)
[//]: <> (REMOVE ME IF PAGE VALIDATED)

## On Job Completion

It is good practice to have a look at the resources your job used on
Expand All @@ -29,13 +19,13 @@ future.
Once your job has finished check the relevant details using the tools:
`nn_seff` or `sacct` For example:

**nn\_seff**
### Using `nn_seff`

``` sl
```bash
nn_seff 30479534
```

``` sl
```txt
Job ID: 1936245
Cluster: mahuika
User/Group: user/group
Expand All @@ -48,30 +38,25 @@ CPU Efficiency: 98.55% 00:01:08 of 00:01:09 core-walltime
Mem Efficiency: 10.84% 111.00 MB of 1.00 GB
```

Notice that the CPU efficiency was high but the memory efficiency was
very low and consideration should be given to reducing memory requests
for similar jobs.  If in doubt, please contact <[email protected]> for
guidance.


Notice that the CPU efficiency was high but the memory efficiency was low and consideration should be given to reducing memory requests
for similar jobs.  If in doubt, please contact [[email protected]](mailto:[email protected]) for guidance.

**sacct**
### Using `sacct`

``` sl
```bash
sacct --format="JobID,JobName,Elapsed,AveCPU,MinCPU,TotalCPU,Alloc,NTask,MaxRSS,State" -j <jobid>
```
!!! prerequisite Tip

!!! tip
*If you want to make this your default* `sacct` *setting, run;*
``` sl
```bash
echo 'export SACCT_FORMAT="JobID,JobName,Elapsed,AveCPU,MinCPU,TotalCPU,Alloc%2,NTask%2,MaxRSS,State"' >> ~/.bash_profile
source ~/.bash_profile
```

------------------------------------------------------------------------

Below is an output for reference:

``` sl
```txt
JobID JobName Elapsed AveCPU MinCPU TotalCPU AllocCPUS NTasks MaxRSS State
------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ----------
3007056 rfm_ANSYS+ 00:27:07 03:35:55 16 COMPLETED
Expand All @@ -82,9 +67,7 @@ Below is an output for reference:
*All of the adjustments below still allow for a degree of variation.
There may be factors you have not accounted for.*

------------------------------------------------------------------------

### **Walltime**
#### Walltime

From the `Elapsed` field we may want to update our next run to have a
more appropriate walltime.
Expand All @@ -93,7 +76,7 @@ more appropriate walltime.
#SBATCH --time=00:40:00
```

### **Memory**
#### Memory

The `MaxRSS` field shows the maximum memory used by each of the job
steps, so in this case 13 GB. For our next run we may want to set:
Expand All @@ -102,7 +85,7 @@ steps, so in this case 13 GB. For our next run we may want to set:
#SBATCH --mem=15G
```

### **CPU's**
#### CPUs

`TotalCPU` is the number of computation hours, in the best case scenario
the computation hours would be equal to `Elapsed` x `AllocCPUS`.
Expand All @@ -116,8 +99,6 @@ however bear in mind there are other factors that affect CPU efficiency.
#SBATCH --cpus-per-task=10
```



Note: When using sacct to determine the amount of memory your job used -
in order to reduce memory wastage - please keep in mind that Slurm
reports the figure as RSS (Resident Set Size) when in fact the metric
Expand Down Expand Up @@ -153,19 +134,20 @@ If 'nodelist' is not one of the fields in the output of your `sacct` or
`squeue` commands you can find the node a job is running on using the
command; `squeue -h -o %N -j <jobid>` The node will look something like
`wbn123` on Mahuika or `nid00123` on Māui
!!! prerequisite Note

!!! note
If your job is using MPI it may be running on multiple nodes

### htop 
### Using `htop`

``` sl
```bash
ssh -t wbn175 htop -u $USER
```

If it is your first time connecting to that particular node, you may be
prompted:

``` sl
```txt
The authenticity of host can't be established 
Are you sure you want to continue connecting (yes/no)?
```
Expand All @@ -185,15 +167,16 @@ Processes in green can be ignored

**S** - State, what the thread is currently doing.

- R - Running
- S - Sleeping, waiting on another thread to finish.
- D - Sleeping
- Any other letter - Something has gone wrong!
- R - Running
- S - Sleeping, waiting on another thread to finish.
- D - Sleeping
- Any other letter - Something has gone wrong!

**CPU%** - Percentage CPU utilisation.

**MEM% **Percentage Memory utilisation.
!!! prerequisite Warning
**MEM%** - Percentage Memory utilisation.

!!! warning
If the job finishes, or is killed you will be kicked off the node. If
htop freezes, type `reset` to clear your terminal.

Expand All @@ -204,21 +187,18 @@ time* the CPUs are in use. This is not enough to get a picture of
overall job efficiency, as required CPU time *may vary by number of
CPU*s.

The only way to get the full context, is to compare walltime performance
between jobs at different scale. See [Job
Scaling](../../Getting_Started/Next_Steps/Job_Scaling_Ascertaining_job_dimensions.md)
for more details.
The only way to get the full context, is to compare walltime performance between jobs at different scale. See [Job Scaling](../../Getting_Started/Next_Steps/Job_Scaling_Ascertaining_job_dimensions.md) for more details.

### Example

![qdyn\_eff.png](../../assets/images/Finding_Job_Efficiency_0.png)

From the above plot of CPU efficiency, you might decide a 5% reduction
of CPU efficiency is acceptable and scale your job up to 18 CPU cores . 
of CPU efficiency is acceptable and scale your job up to 18 CPU cores .

![qdyn\_walltime.png](../../assets/images/Finding_Job_Efficiency_1.png)

However, when looking at a plot of walltime it becomes apparent that
performance gains per CPU added drop significantly after 4 CPUs, and in
fact absolute performance losses (negative returns) are seen after 8
CPUs.
CPUs.
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,6 @@ zendesk_article_id: 360000728016
zendesk_section_id: 360000189716
---



[//]: <> (REMOVE ME IF PAGE VALIDATED)
[//]: <> (vvvvvvvvvvvvvvvvvvvv)
!!! warning
This page has been automatically migrated and may contain formatting errors.
[//]: <> (^^^^^^^^^^^^^^^^^^^^)
[//]: <> (REMOVE ME IF PAGE VALIDATED)

When you run software in an interactive environment such as your
ordinary workstation (desktop PC or laptop), the software is able to
request from the operating system whatever resources it needs from
Expand All @@ -32,9 +23,9 @@ others to use at the same time.
The three resources that every single job submitted on the platforms
needs to request are:

- CPUs (i.e. logical CPU cores), and
- Memory (RAM), and
- Time.
- CPUs (i.e. logical CPU cores), and
- Memory (RAM), and
- Time.

Some jobs will also need to request GPUs.

Expand All @@ -44,65 +35,18 @@ When you are initially trying to set up your jobs it can be difficult to
ascertain how much of each of these resources you will need. Asking for
too little or too much, however, can both cause problems: your jobs will
be at increased risk of taking a long time in the queue or failing, and
your project's [fair share
score](../../Scientific_Computing/Running_Jobs_on_Maui_and_Mahuika/Fair_Share.md)
is likely to suffer.  Your project's fair share score will be reduced in
your project's [fair share score](../../Scientific_Computing/Running_Jobs_on_Maui_and_Mahuika/Fair_Share.md)
is likely to suffer. Your project's fair share score will be reduced in
view of compute time spent regardless of whether you obtain a result or
not. 
not.

<table style="width: 646px;">
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<tbody>
<tr class="odd">
<td class="wysiwyg-text-align-center"
style="width: 60px"><strong>Resource</strong></td>
<td class="wysiwyg-text-align-center"
style="width: 287px"><strong>Asking for too much</strong></td>
<td class="wysiwyg-text-align-center" style="width: 293px"><strong>Not
asking for enough</strong></td>
</tr>
<tr class="even">
<td style="width: 60px">Number of CPUs</td>
<td style="width: 287px"><ul>
<li>The job may wait in the queue for longer.</li>
<li>Your fair share score will <span>fall rapidly (your project will be
charged for CPU cores that it reserved but didn't use)</span></li>
</ul></td>
<td style="width: 293px"><ul>
<li>The job will run more slowly than expected, and so may run out of
time and get killed for exceeding its time limit.</li>
</ul></td>
</tr>
<tr class="odd">
<td style="width: 60px">Memory</td>
<td style="width: 287px"><ul>
<li>The job may wait in the queue for longer.</li>
<li>Your fair share score will fall more than necessary.</li>
</ul></td>
<td style="width: 293px"><ul>
<li>Your job will fail, probably with an 'OUT OF MEMORY' error,
segmentation fault or bus error. This may not happen immediately.</li>
</ul></td>
</tr>
<tr class="even">
<td style="width: 60px">Wall time</td>
<td style="width: 287px"><ul>
<li>The job may wait in the queue for longer than necessary</li>
</ul></td>
<td style="width: 293px"><ul>
<li>The job will run out of time and get killed. </li>
</ul></td>
</tr>
</tbody>
</table>
| Resource | Asking for too much | Not asking for enough |
|---|---|---|
| CPUs | The job may wait in the queue for longer. Your fair share score will fall rapidly (your project will be charged for CPU cores that it reserved but didn't use) | The job will run more slowly than expected, and so may run out of time and get killed for exceeding its time limit. |
| Memory | The job may wait in the queue for longer. Your fair share score will fall more than necessary. | Your job will fail, probably with an 'OUT OF MEMORY' error, segmentation fault or bus error. This may not happen immediately. |
| Wall time | The job may wait in the queue for longer than necessary | The job will run out of time and get killed. |

***See our ["What is an allocation?" support
page](../../Getting_Started/Accounts-Projects_and_Allocations/What_is_an_allocation.md)
for more details on how each resource effects your compute usage.***
***See our ["What is an allocation?" support page](../../Getting_Started/Accounts-Projects_and_Allocations/What_is_an_allocation.md) for more details on how each resource effects your compute usage.***

It is therefore important to try and make your jobs resource requests
reasonably accurate. In this article we will discuss how you can scale
Expand Down Expand Up @@ -131,7 +75,7 @@ they will both queue faster and run for less time. Also, if one of these
jobs fails due to not asking for enough resources, a small scale job
will (hopefully) not have waited for hours or days in the queue
beforehand.
!!! prerequisite Examples
[Multithreading
Scaling](../../Getting_Started/Next_Steps/Multithreading_Scaling_Example.md)
[MPI Scaling](../../Getting_Started/Next_Steps/MPI_Scaling_Example.md)

!!! example
- [Multithreading Scaling](../../Getting_Started/Next_Steps/Multithreading_Scaling_Example.md)
- [MPI Scaling](../../Getting_Started/Next_Steps/MPI_Scaling_Example.md)
2 changes: 2 additions & 0 deletions docs/format.md
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,8 @@ Note the additional spacing around the `+` else it will appear cramped.

snake-case anchors are automatically generated for all headers.

For example a header `## This is my Header` can be linked to with the anchor `[Anchor Link](#this-is-my-header)`

### Tooltips

[Hover over me](https://example.com "I'm a link with a custom tooltip.")
Expand Down

0 comments on commit 5643aff

Please sign in to comment.