Allow customization of Parallelstore mounts #3144

wiktorn · 2024-10-17T09:26:53Z

Add two customization options:

daos_agent_config - to allow providing additional configuration in /etc/daos/daos_agent.yml
dfuse_environment - to allow providing environment variables for dfuse process

Topics for discussion:

whether using templates is proper approach here (it gives a lot of flexibility)
support for whitespaces in mount points (currently not supported, not sure if it was supported earlier)
handling of /etc/daos/daos_agent.yml - currently anyway everything is commented, and we just uncomment specific lines with sed. The other approach is either to create a new file (as implemnted), or just make sure that everything is commented, and just add configuration at the end. I think this aproach is cleaner in terms of what is expected result / file content.
usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior
Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

TODO

align those changes with modules/file-system/pre-existing-network-storage

modules/file-system/parallelstore/variables.tf

harshthakkar01 · 2024-11-12T19:40:08Z

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

harshthakkar01 · 2024-11-12T19:41:41Z

usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior

Current behavior we dont use systemd unit to mount the instance. We manually mount it using bash and enable the service to be run for every restart.

tpdownes · 2024-11-12T19:56:38Z

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

I strongly recommend Restart=on-failure for almost any SystemD unit unless always provides some benefit you can clearly articulate. I believe always isn't allowed for oneshots.

wiktorn · 2024-11-13T13:22:30Z

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

I strongly recommend Restart=on-failure for almost any SystemD unit unless always provides some benefit you can clearly articulate. I believe always isn't allowed for oneshots.

My idea was to run dfuse in foreground and do not make this a oneshoot, so systemd can monitor the process and if it crashes (what may happen), and whatever the reason it will be for it to exit, it systemd will restart it.

The only thing to do is to instead call directly the dfuse binary, call a script, which will call unmount if needed.

wiktorn · 2024-11-13T13:25:46Z

usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior

Current behavior we dont use systemd unit to mount the instance. We manually mount it using bash and enable the service to be run for every restart.

Using systemd always would simplify the code, as there would only one way to mount the filesystem, whether this is the first startup of the instance, or the next one. The only drawback is that I did not find a way, to fail systemctl enable --now (or similar command) when service did not start successfully.

I can add a check after the call, with some sleeps to check if either service started successfully or the mountpoint is mounted and fail the startup script if those checks fail. WDYT?

mounts

mr0re1 · 2024-11-15T21:34:53Z

/gcbrun

harshthakkar01 · 2024-11-15T21:56:14Z

You may need to rebase this after #3256 (this one was high priorty). I have discussed with Ivan and he will follow up with on this as I am ooo for 3 weeks from today.

Btw, we agreed on moving forward with adding customization for mounts. Thanks for this PR.

wiktorn · 2024-11-16T17:47:16Z

You may need to rebase this after #3256 (this one was high priorty). I have discussed with Ivan and he will follow up with on this as I am ooo for 3 weeks from today.

Btw, we agreed on moving forward with adding customization for mounts. Thanks for this PR.

Thanks for letting me know. For now, I rebased this PR on top of #3256, once it is in, I can skip a few commits from here.

I aligned now pre-existing-network-storage with parallelstore in terms of scripts and APIs and fixed some smaller issues I faced (like installation failing on RHEL 8, but started to work well after dnf clean all 🤷 )

wiktorn · 2024-12-16T13:50:44Z

modules/scripts/startup-script/templates/startup-script-custom.tftpl

@@ -31,7 +31,7 @@ stdlib::runner() {
  stdlib::info "=== start executing runner: $object ==="
  case "$1" in
    ansible-local) stdlib::run_playbook "$destpath/$filename" "$args";;
-    shell) chmod u+x /$destpath/$filename && ./$destpath/$filename $args;;
+    shell) chmod u+x /$destpath/$filename && $destpath/$filename $args;;


This change is to fix failures like this:

startup-script: Mon Dec 16 08:44:43 -0500 2024 Info [4589]: === start executing runner: early_run_hotfixes.sh-b383 === startup-script: /slurm/custom_scripts/controller.d/ghpc_daos_mount.sh: line 373: .//tmp/tmp.g1yoDyA6T5/early_run_hotfixes.sh: No such file or directory startup-script: Mon Dec 16 08:44:43 -0500 2024 Info [4589]: === early_run_hotfixes.sh-b383 finished with exit_code=127 === startup-script: Mon Dec 16 08:44:43 -0500 2024 Error [4589]: === execution of early_run_hotfixes.sh-b383 failed, exiting ===

If re-running the script, using for example: DEBUG=1 google_metadata_script_runner startup

harshthakkar01 self-assigned this Oct 21, 2024

harshthakkar01 reviewed Oct 23, 2024

View reviewed changes

modules/file-system/parallelstore/variables.tf Show resolved Hide resolved

harshthakkar01 added the release-improvements Added to release notes under the "Improvements" heading. label Nov 12, 2024

harshthakkar01 marked this pull request as ready for review November 12, 2024 19:46

Update mount parallelstore script to support multiple parallelstore

02c7962

mounts

wiktorn marked this pull request as draft November 15, 2024 21:35

harshthakkar01 assigned mr0re1 and unassigned mr0re1 and harshthakkar01 Nov 15, 2024

wiktorn added 2 commits November 16, 2024 11:59

POC - use templates for DAOS mounts

63e84c6

Use templates in pre-existing-storage

87f6475

wiktorn force-pushed the template_pstore_mount branch from 0d3506a to 87f6475 Compare November 16, 2024 17:44

wiktorn marked this pull request as ready for review November 16, 2024 17:44

wiktorn commented Dec 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow customization of Parallelstore mounts #3144

Allow customization of Parallelstore mounts #3144

wiktorn commented Oct 17, 2024 •

edited by harshthakkar01

Loading

harshthakkar01 commented Nov 12, 2024

harshthakkar01 commented Nov 12, 2024

tpdownes commented Nov 12, 2024

wiktorn commented Nov 13, 2024

wiktorn commented Nov 13, 2024

mr0re1 commented Nov 15, 2024

harshthakkar01 commented Nov 15, 2024 •

edited

Loading

wiktorn commented Nov 16, 2024

wiktorn Dec 16, 2024 •

edited

Loading

Allow customization of Parallelstore mounts #3144

Are you sure you want to change the base?

Allow customization of Parallelstore mounts #3144

Conversation

wiktorn commented Oct 17, 2024 • edited by harshthakkar01 Loading

Topics for discussion:

TODO

harshthakkar01 commented Nov 12, 2024

harshthakkar01 commented Nov 12, 2024

tpdownes commented Nov 12, 2024

wiktorn commented Nov 13, 2024

wiktorn commented Nov 13, 2024

mr0re1 commented Nov 15, 2024

harshthakkar01 commented Nov 15, 2024 • edited Loading

wiktorn commented Nov 16, 2024

wiktorn Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

wiktorn commented Oct 17, 2024 •

edited by harshthakkar01

Loading

harshthakkar01 commented Nov 15, 2024 •

edited

Loading

wiktorn Dec 16, 2024 •

edited

Loading