Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm job template: how a job can probe instance topology and hostname-instanceid mappings… #268

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

verdimrc
Copy link
Contributor

Issue #, if available: N/A

Description of changes: a sample template on writing Slurm job that probes ec2 informations, so that job logs contain as much info as possible for later analysis.

  • check instance topoloty
  • display the mapping between hostname (of allocated nodes) and their instance id.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@sean-smith sean-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how I feel about this being a job submission script as opposed to a bash script customers can include in their own job submission script. It seems counter productive to tell the narative "you can bring your own slurm submission scripts" and on the other hand suggesting they use our template which is very Hyperpod specific.

Just some food for thought, I'm ok to merge once comments are addressed.

# Helper function to query instance topology
lstopo_ec2() {
local INSTANCE_IDS=( $(srun cat /sys/devices/virtual/dmi/id/board_asset_tag) )
aws ec2 describe-instance-topology --instance-ids "${INSTANCE_IDS[@]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I like how simple this is.

@@ -0,0 +1,154 @@
# Slurm job template to probe EC2 informations

Usage: review and customize [job-template.sbatch](job-template.sbatch) to your need.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include bullet points on the features here. If I'm reading this correctly the script does:

  • Outputs instance topology
  • Outputs hostname to instance id mapping
  • Checks all instances are on the same network spine
  • Prints out start, end, and elapsed time in the job
  • Outputs instance id's after the job completes (so you can see if any have been replaced)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the value of this since Slurm only comes with network topology awareness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response in a separate comment.


#SBATCH --nodes=2 # number of nodes to use

set -exuo pipefail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to print all output and commands? Seems too verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response in a separate comment.

srun -l bash -c "echo \"hostname <=> instance_id mapping: \$(hostname) <=> \$(cat /sys/devices/virtual/dmi/id/board_asset_tag)\""

# Track per-instance cumulative Lustre statistics. In this example, we only show the write_bytes.
srun -l bash -c "echo BEFORE: \$(hostname) \$(sudo lctl get_param llite.*.stats | grep write_bytes)" || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this attempting to show lustre read/write statistics for the job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

################################################################################
BEGIN_TRAINING=$(date)
SECONDS=0
srun -l /usr/bin/hostname
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a block here for the training code

################################################################################
## Insert training code here...
################################################################################

@verdimrc verdimrc marked this pull request as draft April 18, 2024 04:55
@verdimrc
Copy link
Contributor Author

This template provides a collection of recipes. It's not meant to be plug-and-play, but cherry picked. And it's intentionally very verbose, and up to adopter to tone it down.

On the describe topology, you're right the intent is not to affect scheduling, but for job to collect runtime information, that later on can be analyized (or traced back).

@KeitaW KeitaW force-pushed the slurm-job-template-probe-ec2 branch 2 times, most recently from dd72aaf to d155905 Compare June 4, 2024 02:26
@KeitaW KeitaW force-pushed the main branch 2 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants