-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meta-issue on Parsl functionalities with QuAcc #2419
Comments
This one seems relatively trivial to fix if we can understand what the problem is. Quacc does correctly pick up on changes to In [1]: import os
In [2]: os.environ["PARSL_MPI_PREFIX"]="cow"
In [3]: import quacc
In [4]: from quacc import QuaccSettings
In [5]: QuaccSettings().ESPRESSO_PARALLEL_CMD
Out[5]: ('cow', '')
In [6]: os.environ["PARSL_MPI_PREFIX"]="sheep"
In [7]: QuaccSettings().ESPRESSO_PARALLEL_CMD
Out[7]: ('sheep', '')
This is not too surprising, as classes are a bit more complex to hash. As noted in the Parsl docs: By default Parsl can compute sensible hashes for basic data types: str, int, float, None, as well as more some complex types: functions, and dictionaries and lists containing hashable types. So, if we want to help the user out here, we need to supply a custom hashing function that users can rely on. We already have a hashing algorithm implemented for
In the context of Parsl or any other workflow engine, this seems like it should be an improvement to the retry handlers there. Independent of Parsl, we won't be implementing retry handling in quacc itself. This is already handled beautifully by a pre-existing dependency called Custodian. Custodian is currently used for VASP and Q-Chem runs. It supports both pre-made and custom error handlers and is quite robust. Users interested in implementing generic retry handlers for a specific code are best off implementing those features in Custodian and having the calculator call Custodian. |
I think it might be a little bit more complicated than that. Because Parsl needs to be able to handle cases where different apps have different resources, so it seems to me that the env variable is defined "per function" somewhat. I will investigate when I will have more time. Note that in your example you defined the environment variable before importing Quacc, while right now, it happens definitely after (when the function is called). That might be it?
Yes, I think that's pretty much it, do you think this has its place in the Quacc documentation? Can this be done automatically when Quacc start (if Parsl is the workflow engine of course)?
Yep, that's exactly why I opened this issue, so that we can look at what features are missing for each workflow engine (Parsl here) and improve these areas without the need to modify Quacc too much (improve the documentation maybe?). The reason for that is: we are trying to use Quacc/Parsl for a project with collaborators and we ran into these questions, which I think are important. If you feel like I am overstepping please feel free to tell me/close this issue. |
I see. We already have support for supplying
That also seems to work okay when I test it out.
Definitely, I think this should be provided with the Parsl usage instructions. I don't actually know how the memo function works with Parsl, but once someone figures it out and confirms it works as expected, the details should definitely be added. As for whether it can be done automatically, I try to set things up automatically when the workflow engine is selected but if this is something that must be supplied when you instantiate a Parsl config, that might need to be left for the end-user (but still documented). In other words, if we can automate it, absolutely, but I don't know if we can.
Yes, this is all very fair. And I appreciate the comments! You are not overstepping. Personally, I think improved retry handling is something that should be done on Parsl's side (i.e. make it easier to inject logic into the retry handler). There is technically a way to do this right now, but it's hacky. The suggestion from the Parsl team is to take advantage of the |
## Summary of Changes As discussed in (#2419) this is an attempt to fix the current issue with the ## Expected behavior Users can redecorate their job, adding the `parsl_resource_specification`, Parsl take these informations and define environment variables such as `PARSL_MPI_PREFIX`. If the users redecorate multiple jobs, different `PARSL_MPI_PREFIX` will be accessible on a per job basis. ### Example: ``` python resource_specification_slow = { "num_nodes": 1, "ranks_per_node": 128, "num_ranks": 128, "launcher_options": "-vv --cpu-freq=2250000 --hint=nomultithread --distribution=block:block", } resource_specification_fast = { "num_nodes": 4, "ranks_per_node": 128, "num_ranks": 128 * 4, "launcher_options": "-vv --cpu-freq=2250000 --hint=nomultithread --distribution=block:block", } relax_job_slow = redecorate( relax_job, job(parsl_resource_specification=resource_specification_slow) ) relax_job_fast = redecorate( relax_job, job(parsl_resource_specification=resource_specification_fast) ) ``` In this case, each job will have access to a different `PARSL_MPI_PREFIX` environment variable. ## The problem Right now, the `ESPRESSO_PARALLEL_CMD` is modified in `settings.py` based on the `PARSL_MPI_PREFIX`, the problem is that this env variable is only defined within jobs. `ESPRESSO_PARALLEL_CMD` end up being defined before that, leading to wrong commands. ## The problem One solution is to modify the command being lunched at the last moment inside the espresso calculator before the command is lunched. ### Requirements - [x] My PR is focused on a [single feature addition or bugfix](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/getting-started/best-practices-for-pull-requests#write-small-prs). - [ ] My PR has relevant, comprehensive [unit tests](https://quantum-accelerators.github.io/quacc/dev/contributing.html#unit-tests). - [x] My PR is on a [custom branch](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-and-deleting-branches-within-your-repository) (i.e. is _not_ named `main`). Note: If you are an external contributor, you will see a comment from [@buildbot-princeton](https://github.com/buildbot-princeton). This is solely for the maintainers.
What would you like to report?
Here is a list of requirements that (in my opinion) should be checked to ensure that the combo QuAcc/Parsl is production-ready.
Use the full capabilities of the software and the HPC: run flexible parallel commands without additional complexity. For now, this is done by providing
parsl_resource_specification
parameter in the job decorator. This gives access to thePARSL_MPI_PREFIX
environment variable that should be used to launch the job. This is critical because without it jobs are run fully relying on the good faith of the task executor (srun mainly) to correctly dispatch jobs to the nodes. For example, this does not work on the MMM Young cluster which usesmpirun
.HPCs time limitations should not be a burden: For this Parsl checkpoints seem like a natural solution: they allow users to keep results in the cache, and to restart a workflow only where it stopped.
If possible, it should be possible to ask for retries in case of job failures. If possible there should be various options depending on the nature of the error, and it should be possible to customize the input parameters for retries (e.g. my meta-gga calculation failed, I lower the
mixing_beta
, etc...)For the first point, it seems that it does not work correctly currently as the environment variable is retrieved in
settings.py
which is probably done before the job is launched (#2411).For the second point, I think the Parsl checkpointing currently does not work with Quacc, as Parsl keeps complaining about Atoms object not being hashable (to check).
For the third point, Parsl retry handling does not allow to specify new parameters. This makes it generally not fitted for computational chemistry as most users will not care much about blindly repeating the same calculations. Some Parsl team members suggested that, conceptually, if one wants to change the parameters, they should instead launch a new job, this approach will be definitely possible with in-built try-except statements, after (#2410), although somewhat limited.
Additionally, reading QuAcc documentation it is a little bit confusing what functionality is usable with QuAcc.
I will work on this when I have more time to do what I like (end of september 😅) if I still have access to some HPCs.
The text was updated successfully, but these errors were encountered: