-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm job template: how a job can probe instance topology and hostname-instanceid mappings… #268
base: main
Are you sure you want to change the base?
Conversation
…e-instanceid mappings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how I feel about this being a job submission script as opposed to a bash script customers can include in their own job submission script. It seems counter productive to tell the narative "you can bring your own slurm submission scripts" and on the other hand suggesting they use our template which is very Hyperpod specific.
Just some food for thought, I'm ok to merge once comments are addressed.
# Helper function to query instance topology | ||
lstopo_ec2() { | ||
local INSTANCE_IDS=( $(srun cat /sys/devices/virtual/dmi/id/board_asset_tag) ) | ||
aws ec2 describe-instance-topology --instance-ids "${INSTANCE_IDS[@]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I like how simple this is.
@@ -0,0 +1,154 @@ | |||
# Slurm job template to probe EC2 informations | |||
|
|||
Usage: review and customize [job-template.sbatch](job-template.sbatch) to your need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include bullet points on the features here. If I'm reading this correctly the script does:
- Outputs instance topology
- Outputs hostname to instance id mapping
- Checks all instances are on the same network spine
- Prints out start, end, and elapsed time in the job
- Outputs instance id's after the job completes (so you can see if any have been replaced)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the value of this since Slurm only comes with network topology awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response in a separate comment.
|
||
#SBATCH --nodes=2 # number of nodes to use | ||
|
||
set -exuo pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to print all output and commands? Seems too verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response in a separate comment.
srun -l bash -c "echo \"hostname <=> instance_id mapping: \$(hostname) <=> \$(cat /sys/devices/virtual/dmi/id/board_asset_tag)\"" | ||
|
||
# Track per-instance cumulative Lustre statistics. In this example, we only show the write_bytes. | ||
srun -l bash -c "echo BEFORE: \$(hostname) \$(sudo lctl get_param llite.*.stats | grep write_bytes)" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this attempting to show lustre read/write statistics for the job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
################################################################################ | ||
BEGIN_TRAINING=$(date) | ||
SECONDS=0 | ||
srun -l /usr/bin/hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a block here for the training code
################################################################################
## Insert training code here...
################################################################################
This template provides a collection of recipes. It's not meant to be plug-and-play, but cherry picked. And it's intentionally very verbose, and up to adopter to tone it down. On the describe topology, you're right the intent is not to affect scheduling, but for job to collect runtime information, that later on can be analyized (or traced back). |
dd72aaf
to
d155905
Compare
44e448e
to
1209815
Compare
Issue #, if available: N/A
Description of changes: a sample template on writing Slurm job that probes ec2 informations, so that job logs contain as much info as possible for later analysis.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.