-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhead from filewriting #79
Comments
Hi @JaGeo, thanks for reporting this. We have not used jobflow-remote for such large calculations yet, but I was somehow expecting that new issues would be triggered by starting using it for different use cases. Can you be more explicit about where the issues happen? Where do the jobs fail? Do you have a stacktrace? Can you also share a simple script that would allow to reproduce the problem?
You mean that MongoDB is reachable from the cluster nodes, and thus getting inputs and inserting outputs directly from the Jobs is possible? While I think it could be technically possible, I am a bit reluctant to go down this road, since with things happening in different ways may lead to some bugs and the code could be more difficult to maintain and expand. Let us first see if there is anything we can do to optimize the usage of the Store. |
I will make an example script to reproduce later on today. |
@gpetretto , here it is. The structure I uploaded is nonsense but should illustrate the problem as it creates 24000 displacements (thus, leave out the optimization part of the phonon run). I get failures when the supercells are created in jobflow-remote but not in jobflow. The reported issues via slurm are memory errors. |
Thanks! I will have a look at it. Just to be sure, when you say running with jobflow, you mean using |
I assume fireworks should work as well but I don't have a good MongoDB setup on the cluster where I wanted to run it. |
Thanks. Actually I could have deduced from your script if I had looked before asking. One potential issue with fireworks is that when you create the detours a single FWAction containing all the as_dict of all the detour Fireworks should be inserted in a One problem is that I suppose I will test and see if I can pinpoint the issue. One more question: how much memory was allocated on the node? |
I tested it on one 850 GB node and it failed. |
It's also not failing while creating the Response but while creating the structures. A step before. Of course, I could combine both steps. |
I have done some tests on this workflow and (de)serialization in general. As a first comment I should correct what I said above concerning the execution of jobflow alone. While even running with a The second element is that I tried to check the timings of the execution of the
At this point I think there are different problems with flows involving large Jobs that should be addressed in the different packages. I will start to highlight them here, but it may be worth opening issues in the respective repositories to check what would be possible. JobflowConcerning jobflow, I think that it will always play badly with large data due to the fact that (de)serialization is expensive. Especially the JSON one. Whether you are storing the data in a file or in a DB, converting all the object to dictionaries and then transfering the data is going to take time and require a large amount of memory. A minimal optimization would be to at least make sure that Aside from this, I suspect that obtaining further improvements would require ad hoc solutions, with other kinds of (de)serialization or changes in the logic. For example big data might be stored directly into files that could be dumped to the file system and read in chunks. This could avoid loading all the data in memory at once or allow to access only the required information. Atomate2For these specific kind of workflows, involving relatively fast calculations and many big cells, it is possible that any workflow manager would add more overhead than it is worth. Considering the timings above, it should be considered that after the Again, dealing with large data might require ad-hoc solutions. (see also materialsproject/atomate2#515) Jobflow remoteComing to what can be done to improve the situation for jobflow-remote, it is definitely true that the The current implementation uses orjson, but from a few tests it seems that when deserializing a big json file the amount of used memory is larger than other json libraries. However, since there is no strict need to interact with other software outside jobflow-remote, any compatible format should be fine. I made an implementation for msgpack, that could be more efficient. Or other libraries like msgspec could be considered. I made some tests with these as well, but I don't have a conclusive answer with respect to the required memory and time for very large files. Also, should the very large files be the target? Or should we optimize for smaller file sizes? The choice of the (de)serialization library can be an option for the project and a user could tune depending on the kind of simulations that have to run. The biggest downside though remains that the Runner will still need to load all those data to put and read from the final JobStore. While ithe Worker shold have enough memory to handle the big data, this is likely not true for the machine hosting the Runner. And I don't think there is any way of avoiding this, in the context of jobflow and JSON-like data. One option already mentioned would be to allow the worker to directly interact with the JobStore, if accessible. However, I am afraid that this would have some problems of its own:
On top of that, the limitations from standard jobflow listed above will still be there. So, I think that it would be worth addressing this first in jobflow and in the specific workflow. In any case I am not discarding this option. During these tests I also realized one additional issue with the current version of jobflow-remote: if there is a |
First of all, thank you so much for looking into it! And also for the detailed answer!
I agree that potentially just running the whole workflow in jobflow is faster. However, any kind of check-pointing would be missing and a restart in case of a failure would be really hard. I would therefore prefer to run the flow with a workflow manager.
Thank you so much, @gpetretto. Yes, I would need a merge of this branch with the one for MFA. Thanks in advance! I would love to test this. |
Short additional question that is related, @gpetretto : |
I see your point. However, I would also suggest to make a few tests. If, exaggerating, the multi Job flow would require 10x the time required by the single job flow, it would still way more convenient to just rerun the whole workflow for the cases that fail. This could be true even for much lower speedup.
Here is the branch: https://github.com/Matgenix/jobflow-remote/tree/store-interactive
|
Thank you! Will try to do it until the end of the week! I also hope, of course, that these issues help with making jobflow-remote even better and extending its usecases! |
Just to add: this only works in one direction. If I say that
Maybe, an additional flag could help. In any case, I could solve my issue now and will install the new jobflow-remote version for the test ;) |
Tried it but it still needs more memory (did not kill the sever but I needed more than 80÷ of the memory) and takes very long for the file writing process. It did not finish over night. |
Thanks a lot for testing this. I am sorry this did not solve your problem. While I expected it to take longer and require more memory it seemed to have a lower overhead in my tests. Of course I had to use a smaller test case, so maybe it scales badly at even larger sizes.
Do you mean running the An additional downside of the current approach in atomate2 is that the 24000 structures will occupying a considerable amount of space in the output Store. There will be the whole output of |
Thanks @gpetretto . We are using a GAP model for our calculations. Thus, it will take slightly longer time but yeah that might be an approach to take. |
Hi all,
I have been playing around with jobflow-remote for the last weeks now. I really like it.
One thing, however, that leads to some issues was the following: I am trying to create a huge number of structures within the phonon workflow (>>4000 atoms, no symmetry, thus 24,000 structures). Of course, one could do MD; however, similar things could happen if one aims to calculate thermal conductivity. When I run this workflow with jobflow-remote, the jobs fail locally and remotely due to memory issues. There are no issues if I just run the same job in jobflow. Thus, I assume that the JSONStore and the saving process might lead to problems.
Have you encountered similar problems? Would one need to adapt how atomate2 workflows are written or are there possible other solutions? Locally, in my setup, for example, the MongoDB database could be used directly, and a JSONStore would not be needed.
The text was updated successfully, but these errors were encountered: