Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta-issue on Parsl functionalities with QuAcc #2419

Open
3 tasks
tomdemeyere opened this issue Aug 13, 2024 · 3 comments
Open
3 tasks

Meta-issue on Parsl functionalities with QuAcc #2419

tomdemeyere opened this issue Aug 13, 2024 · 3 comments

Comments

@tomdemeyere
Copy link
Contributor

tomdemeyere commented Aug 13, 2024

What would you like to report?

Here is a list of requirements that (in my opinion) should be checked to ensure that the combo QuAcc/Parsl is production-ready.

  • Use the full capabilities of the software and the HPC: run flexible parallel commands without additional complexity. For now, this is done by providing parsl_resource_specification parameter in the job decorator. This gives access to the PARSL_MPI_PREFIX environment variable that should be used to launch the job. This is critical because without it jobs are run fully relying on the good faith of the task executor (srun mainly) to correctly dispatch jobs to the nodes. For example, this does not work on the MMM Young cluster which uses mpirun.

  • HPCs time limitations should not be a burden: For this Parsl checkpoints seem like a natural solution: they allow users to keep results in the cache, and to restart a workflow only where it stopped.

  • If possible, it should be possible to ask for retries in case of job failures. If possible there should be various options depending on the nature of the error, and it should be possible to customize the input parameters for retries (e.g. my meta-gga calculation failed, I lower the mixing_beta, etc...)

For the first point, it seems that it does not work correctly currently as the environment variable is retrieved in settings.py which is probably done before the job is launched (#2411).

For the second point, I think the Parsl checkpointing currently does not work with Quacc, as Parsl keeps complaining about Atoms object not being hashable (to check).

For the third point, Parsl retry handling does not allow to specify new parameters. This makes it generally not fitted for computational chemistry as most users will not care much about blindly repeating the same calculations. Some Parsl team members suggested that, conceptually, if one wants to change the parameters, they should instead launch a new job, this approach will be definitely possible with in-built try-except statements, after (#2410), although somewhat limited.

Additionally, reading QuAcc documentation it is a little bit confusing what functionality is usable with QuAcc.

I will work on this when I have more time to do what I like (end of september 😅) if I still have access to some HPCs.

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Aug 13, 2024

For the first point, it seems that it does not work correctly currently as the environment variable is retrieved in settings.py which is probably done before the job is launched (#2411).

This one seems relatively trivial to fix if we can understand what the problem is. Quacc does correctly pick up on changes to PARSL_MPI_PREFIX, so it's not clear to me what the issue is you're encountering. Presumably, if you launch the Python script as a Slurm job and export PARSL_MPI_PREFIX before you execute the Python script, it should pick up on it just fine.

In [1]: import os

In [2]: os.environ["PARSL_MPI_PREFIX"]="cow"

In [3]: import quacc

In [4]: from quacc import QuaccSettings

In [5]: QuaccSettings().ESPRESSO_PARALLEL_CMD
Out[5]: ('cow', '')

In [6]: os.environ["PARSL_MPI_PREFIX"]="sheep"

In [7]: QuaccSettings().ESPRESSO_PARALLEL_CMD
Out[7]: ('sheep', '')

For the second point, I think the Parsl checkpointing currently does not work with Quacc, as Parsl keeps complaining about Atoms object not being hashable (to check).

This is not too surprising, as classes are a bit more complex to hash.

As noted in the Parsl docs: By default Parsl can compute sensible hashes for basic data types: str, int, float, None, as well as more some complex types: functions, and dictionaries and lists containing hashable types.
Attempting to cache apps invoked with other, non-hashable, data types will lead to an exception at invocation.
In that case, mechanisms to hash new types can be registered by a program by implementing the parsl.dataflow.memoization.id_for_memo function for the new type.

So, if we want to help the user out here, we need to supply a custom hashing function that users can rely on. We already have a hashing algorithm implemented for Atoms objects: it's quacc.atoms.core.get_atoms_id.

If possible, it should be possible to ask for retries in case of job failures. If possible there should be various options depending on the nature of the error, and it should be possible to customize the input parameters for retries (e.g. my meta-gga calculation failed, I lower the mixing_beta, etc...)

In the context of Parsl or any other workflow engine, this seems like it should be an improvement to the retry handlers there.

Independent of Parsl, we won't be implementing retry handling in quacc itself. This is already handled beautifully by a pre-existing dependency called Custodian. Custodian is currently used for VASP and Q-Chem runs. It supports both pre-made and custom error handlers and is quite robust. Users interested in implementing generic retry handlers for a specific code are best off implementing those features in Custodian and having the calculator call Custodian.

@tomdemeyere
Copy link
Contributor Author

tomdemeyere commented Aug 13, 2024

This one seems relatively trivial to fix if we can understand what the problem is. Quacc does correctly pick up on changes to PARSL_MPI_PREFIX, so it's not clear to me what the issue is you're encountering. Presumably, if you launch the Python script as a Slurm job and export PARSL_MPI_PREFIX before you execute the Python script, it should pick up on it just fine.

I think it might be a little bit more complicated than that. Because Parsl needs to be able to handle cases where different apps have different resources, so it seems to me that the env variable is defined "per function" somewhat. I will investigate when I will have more time. Note that in your example you defined the environment variable before importing Quacc, while right now, it happens definitely after (when the function is called). That might be it?

As noted in the Parsl docs: By default Parsl can compute sensible hashes for basic data types: str, int, float, None, as well as more some complex types: functions and dictionaries and lists containing hashable types.
Attempting to cache apps invoked with other, non-hashable, data types will lead to an exception at invocation.
In that case, mechanisms to hash new types can be registered by a program by implementing the parsl.dataflow.memoization.id_for_memo function for the new type.

Yes, I think that's pretty much it, do you think this has its place in the Quacc documentation? Can this be done automatically when Quacc start (if Parsl is the workflow engine of course)?

Independent of Parsl, we won't be implementing retry handling in quacc itself.

Yep, that's exactly why I opened this issue, so that we can look at what features are missing for each workflow engine (Parsl here) and improve these areas without the need to modify Quacc too much (improve the documentation maybe?). The reason for that is: we are trying to use Quacc/Parsl for a project with collaborators and we ran into these questions, which I think are important. If you feel like I am overstepping please feel free to tell me/close this issue.

@Andrew-S-Rosen
Copy link
Member

I think it might be a little bit more complicated than that. Because Parsl needs to be able to handle cases where different apps have different resources, so it seems to me that the env variable is defined "per function" somewhat.

I see. We already have support for supplying parsl_resource_specification via the decorator, but I see that this kind of flexibility is not possible for the parsl MPI prefix. There must be a misunderstanding about intended uses with the Parsl team, because it does not make sense to rely on an environment variable (PARSL_MPI_PREFIX) that changes on a per-app basis. Perhaps I'm missing something. I guess they just assume that all the apps are running the same code in the same parallelized way? That seems like a potential oversight worth bringing up with them.

Note that in your example you defined the environment variable before importing Quacc, while right now, it happens definitely after (when the function is called). That might be it?

That also seems to work okay when I test it out.

Yes, I think that's pretty much it, do you think this has its place in the Quacc documentation? Can this be done automatically when Quacc start (if Parsl is the workflow engine of course)?

Definitely, I think this should be provided with the Parsl usage instructions. I don't actually know how the memo function works with Parsl, but once someone figures it out and confirms it works as expected, the details should definitely be added.

As for whether it can be done automatically, I try to set things up automatically when the workflow engine is selected but if this is something that must be supplied when you instantiate a Parsl config, that might need to be left for the end-user (but still documented). In other words, if we can automate it, absolutely, but I don't know if we can.

Yep, that's exactly why I opened this issue, so that we can look at what features are missing for each workflow engine (Parsl here) and improve these areas without the need to modify Quacc too much (improve the documentation maybe?). The reason for that is: we are trying to use Quacc/Parsl for a project with collaborators and we ran into these questions, which I think are important. If you feel like I am overstepping please feel free to tell me/close this issue.

Yes, this is all very fair. And I appreciate the comments! You are not overstepping.

Personally, I think improved retry handling is something that should be done on Parsl's side (i.e. make it easier to inject logic into the retry handler). There is technically a way to do this right now, but it's hacky. The suggestion from the Parsl team is to take advantage of the rety_handler keyword argument in Config and supply a custom function that introduces some "smart" logic for when/how to correct things. Seems very messy and prone to lots of problems though.

Andrew-S-Rosen pushed a commit that referenced this issue Aug 14, 2024
## Summary of Changes

As discussed in
(#2419) this is an
attempt to fix the current issue with the

## Expected behavior

Users can redecorate their job, adding the
`parsl_resource_specification`, Parsl take these informations and define
environment variables such as `PARSL_MPI_PREFIX`. If the users
redecorate multiple jobs, different `PARSL_MPI_PREFIX` will be
accessible on a per job basis.

### Example:

``` python
resource_specification_slow = {
    "num_nodes": 1,
    "ranks_per_node": 128,
    "num_ranks": 128,
    "launcher_options": "-vv --cpu-freq=2250000 --hint=nomultithread --distribution=block:block",
}
resource_specification_fast = {
    "num_nodes": 4,
    "ranks_per_node": 128,
    "num_ranks": 128 * 4,
    "launcher_options": "-vv --cpu-freq=2250000 --hint=nomultithread --distribution=block:block",
}

relax_job_slow = redecorate(
    relax_job, job(parsl_resource_specification=resource_specification_slow)
)
relax_job_fast = redecorate(
    relax_job, job(parsl_resource_specification=resource_specification_fast)
)
```
In this case, each job will have access to a different
`PARSL_MPI_PREFIX` environment variable.

## The problem

Right now, the `ESPRESSO_PARALLEL_CMD` is modified in `settings.py`
based on the `PARSL_MPI_PREFIX`, the problem is that this env variable
is only defined within jobs. `ESPRESSO_PARALLEL_CMD` end up being
defined before that, leading to wrong commands.

## The problem

One solution is to modify the command being lunched at the last moment
inside the espresso calculator before the command is lunched.

### Requirements

- [x] My PR is focused on a [single feature addition or
bugfix](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/getting-started/best-practices-for-pull-requests#write-small-prs).
- [ ] My PR has relevant, comprehensive [unit
tests](https://quantum-accelerators.github.io/quacc/dev/contributing.html#unit-tests).
- [x] My PR is on a [custom
branch](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-and-deleting-branches-within-your-repository)
(i.e. is _not_ named `main`).

Note: If you are an external contributor, you will see a comment from
[@buildbot-princeton](https://github.com/buildbot-princeton). This is
solely for the maintainers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants