-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] be able to perform continuation with Quacc #2399
Comments
Thank you for the note, @tomdemeyere and colleagues. First, let me start with an unhelpful comment: most of your concerns are actually about the underlying workflow management system and not necessarily about quacc itself. In short, you are hitting pain points of Parsl. If the workflow engine were constructed well, it should be able to handle "calculations will crash, will run out of time, or simply won't converge". For instance, FireWorks/Covalent/Prefect allow you to redispatch jobs (and some have continuation logic), but Parsl does not make it easy. Other issues like caching of results so jobs aren't rerun are also done on the workflow management level. This is the problem with workflow engines --- we have hundreds of them, and they all are mediocre in various ways. In spite of the above comment, I still think there is significant room to have quacc help the user out, and I would welcome any improvements in this regard.
I agree this is an area for improvement. Many workflow tools, like FireWorks and Covalent, allow the user to add metadata to their calculations before executing them. This makes it trivial to link up to a given result that is stored. Parsl or Dask, on the other hand, do not do such a great job here. The I don't necessarily have a suggestion here but agree there is more to do.
Dashboards are a core part of many workflow engines and can help in this regard. Still, there is room for improvement within quacc itself for cases where a dashboard is not viable. The sea of quacc folders will hopefully become easier to manage with #2296, which is not stale despite the lack of movement you see there on the PR. The schema can only be generated, however, if the calculation finishes. The results dictionary is not really the place to store task metadata if the job fails. So, this begs the question: where should we store that data if the user isn't using a workflow engine that can handle task metadata well? Again, this is something that a workflow engine should be providing for the user, but if we want to have a redundant solution that is fine. Covalent, FireWorks, Redun, and others make it clear to the user where things are failing and why. Parsl does not. Technically, shouldn't the errors be logged to stderr though via the
This can already be done by swapping out
This is getting into workflow management territory. Quacc has no knowledge about a database on its own. What you are describing is similar to caching, which many workflow tools support (e.g.
All of these pain points are pain points of Parsl, which does a good job at launching tasks but not managing them. That's not to say any of the other supported workflow tools are better by the way. They all are painful in their own special ways, which is why I couldn't settle with just supporting one of them. Again, I'm happy to add features to quacc to address the limitations of the workflow tools themselves though. Sorry. I understand that this is not a "fix" for your issues. The issues you bring up are important but are also non-trivial and require dedicated work from an individual to address. Sadly, that isn't going to be me at this very point in time, but I am open to addressing your concerns. |
Thanks for your answer @Andrew-S-Rosen, I must say that I understand your need for balance between the work of QuAcc and the work that should be done by workflow engines, let me try to push some ideas before giving up:
While QuAcc shouldn't shoulder the entire burden of workflow management, there are indeed some features that could enhance its functionality without overstepping its primary role. Here are some suggestions for improvements:
But as you said, one problem is the potential overlap of these features with the already existing one in some workflow engine. Normally there should not be the need for such functionalities (especially the last one, which I agree should be taken care of by the workflow engine). But at the same time I will come back on something you wrote:
This is very true, and after testing some of them I can say that there is always little detail(s) that will make a given engine unusable in a particular situation, due to a mix between technical constraints and specific needs for a project. If QuAcc had the above functionalities in-built this would allow to alleviate the dependency on workflow engine-specific features I believe the first two suggestions (job annotation and error handling) are relatively neutral additions that would enhance QuAcc's usability without significantly overstepping on workflow engine territory. While the restart mode feature might be more contentious, it could serve as a fallback option for users dealing with workflow engines that lack robust continuation logic (or when running QuAcc without a workflow engine, which is my main usage for now) EDIT:
I am not sure I agree with that, sometimes the job failing is an important part of the process: If you pick a mixing scheme and want to test its robustness for high-throughput, some jobs will surely fail, and that is an important results. Often, a job might return an error to |
@tomdemeyere: Thank you for your comments. Could you please do me a favor and open three specific issues, one for each topic? You can simply copy/paste your examples here but please link back to this issue. You can then close this "meta-issue". That will help ensure things hopefully get addressed.
I am all ears. No need to worry about giving up. My goal at this point is to get additional clarity so we can brainstorm how to best proceed.
Yes, agreed. This was actually the original intention but never made its way into
I think the error should already be captured here and is logged with its directory. Is there another error you're referring to? It should also be clear which jobs failed --- they are written out to directories with
This is actually a separate feature request: write out There is likely going to be an easier way to do what you want. Perhaps the solution you're looking for is to write out the info not just to the logger but also to disk in that calculation's
Indeed, this does have overap with things like the retry handlers in Parsl. Most workflow engines have something along these lines, but I agree it's not always intuitive or useful as-implemented. I don't know how easy it would be to do this in quacc (we would have to serialize/deserialize functions with pickle, which gets messy...), but I'll keep this one in mind. We also can't assume that the user is running with a database. This is one I'm open to, but I admittedly don't know how to go about it nicely offhand. To me, this one seems more like workflow engine territory. If we provide the user with enough metadata to know which jobs have failed, they can hopefully write their own scripts to rerun what is needed.
This is fair. Just to note, we of course can't have every |
## Summary of Changes This PR addresses the request in #2399 to have an `additional_fields` keyword argument for all `@job`s in quacc. Will not merge without input from @tomdemeyere. ### Requirements - [X] My PR is focused on a [single feature addition or bugfix](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/getting-started/best-practices-for-pull-requests#write-small-prs). - [X] My PR has relevant, comprehensive [unit tests](https://quantum-accelerators.github.io/quacc/dev/contributing.html#unit-tests). - [X] My PR is on a [custom branch](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-and-deleting-branches-within-your-repository) (i.e. is _not_ named `main`). Note: If you are an external contributor, you will see a comment from [@buildbot-princeton](https://github.com/buildbot-princeton). This is solely for the maintainers.
What new feature would you like to see?
We want to perform a simple workflow with Quacc:
It seems easy, but in fact there are plenty of things that could go wrong here. Very often, computational chemistry is complex and things will not go as expected, e.g. calculations will crash, will run out of time, or simply won't converge.
In this case, the above workflow when using Quacc gets much more complicated, if users want to be proactive they can put all kinds of checks in place, i.e. try, except, but with workflow engines it is often not ideal. Then, the easiest route to avoid this issue seems to pre-filter which calculations you want to run based on which ones are already finished.
However, this route is a little bit complicated still:
What would be ideal is that for a given project you have a given database/results folder, and then for each calculation with a given label, Quacc would check if this calculation was already done, and converged, either from the database or from the "quacc_results.json. Settings or keywords might manage this behavior? Other things can be implemented to keep flexibility as well.
Maybe I am mistaken here and that's not what Quacc was made for? Although, for our practical case it is really problematic, and we often end up spending a lot of time look into Quacc's folders.
@Nekkrad @julianholland @BCAyers2000
The text was updated successfully, but these errors were encountered: