-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re: Memory leak problem (tensorflow backend) #343
Comments
Any suggestions? |
One thing that immediately sticks out from your model, is that in each iteration you create the generator. That might have to do something with your problem. I suggest looking at #11 for examples on how you might want to do it instead. |
Thanks for pointing out this potential problem. I've put the generator inside the function that builds the model because it is using one of the grid parameters (batch_size). I'll try to see if your suggestion helps by fixing the batch_size for now. |
Hi, Sorry I could not work on this earlier. There is definitely a memory leak, and your suggestion to call the generator once outside the talos function helps reduce its impact. It seems that each time I call the generator class, the memory load increases more in comparison with the last call. However, if I remove the generator call from the talos function, I can no longer tune the batch size. One thing that helped a little bit is to use one worker only: out = model.fit_generator(generator = train_gen, By doing so, I was able to run a few more iterations with talos, although the iterations were slower. But at a certain point, the program just stops producing outputs. Any thoughts? |
Have you profiled your code in a loop without Talos? Just run it as a standalone Keras model with everything else the same, but run it in a for loop and see what happens. |
Not yet, but I'll do it and let you know. I am also thinking that maybe this has to do with Keras rather than talos, but let's see. Thanks |
Thanks. I will be very interested in knowing what you learn. I know there are a lot of issues people have reported regarding the use of generators in Keras models. Which is consistent with my own experience. |
@Adnane017 any updates on this? |
Hi @mikkokotila. Really sorry for the delay in sending an update on this (and that you are pushing for a solution for this instead of me doing it). I was practically unavailable in the last two weeks (off to a conference). I noticed too that many Keras users have reported memory leaks. Someone told me he experienced the same problem with TensorFlow, so it might even have to do with TensorFlow rather than Keras or Talos . I will run the test you suggested and let you know later this week. Many thanks again. |
I have the same problem, the first iteration happens very fast, but the video memory does not empty and the second iteration starts already with the full video memory, and then begins the nightmare taking up to 3 days to improve the hyperparameters, already researched the internet and really don't think the solution |
@TraderJair sorry to hear that you have joined the sizable group of people with frustration regarding using generators with Keras on TensorFlow backend. In the new Talos docs there is an example you can try. Try it first with the example as it is, and if everything goes fine (works very well even in low end laptop), the replace with your own data (which I assume is much bigger). @Adnane017, could you try the same. |
@Adnane017 This is really wonderful work. I love it :) The name is changed to Can you share the text output from cProfile or similar, so we can see exactly what is eating the memory. That way there is no doubt. Ideally, given you already have put time to this, if you can create a repo with the codes you have and the proc/similar text outputs, and I will then repeat the exact same in two different systems. About your experiment, did I understand correct that you have done the following:
i.e. the three graphs above are "apples-to-apples". Performance is definitely a key concern/focus, and we shall get to the bottom of this! :) Thanks again so far. Great work. |
Can you try with |
...having looked at both of the graphs you had provided, neither of them imply a memory leak. The characteristics of a memory leak is continuously building memory pressure, as is with Keras example without manual release. That does not seem to be the case with either of the graphs with Talos. I kind of of "know" there is no memory leak because I have thousands of hours of use myself, zero reported issues on it, and my background is on high performance industry systems that handle hundreds of billions of events per day, and considerable attention have gone to this topic, including extensive profiling of memory use. What I'm alarmed about here, is that there is an added memory cost in comparison to Keras, which should be marginal at worst (which is not the case in your experiment). So if it indeed is that your experiment is apples to apples, and not that Talos changes the parameters for example, but Keras version does not, then this will definitely become a top priority in terms of fixes. |
@Adnane017 I'm working on this currently, and indeed, with a This might take a couple of days time as I also want to repeat the test with other OS (I'm on Mac OSX) as well as without data generator. After that I should be able to publish conclusive results, with hopefully also a fix for whatever is causing this. Regarding the added memory cost your plots imply, can you share the params dictionary you are sharing? I can't replicate that, but instead see that Keras with memory release and Talos with |
Turns out I was wrong. Thanks for persistence in this matter @Adnane017 :) ExperimentHere I've used the MNIST dataset with simple CNN, and batch_size 256 for 2 epochs. The cycle is repeated 10 times without changing anything else. Keras release means exactly the process that is used in Talos when ResultsTalos with generatorKeras with generatorTalos without generatorKeras without generatorThis is very clear, there is indeed a memory leak, albeit a small one. The issue is of course that it's not uncommon for a researcher to have a very large dataset, and an experiment that runs for several days at a time. Next Steps
|
@mikkokotila to respond to these 3 questions (and to some of your subsequent comments):
I used the same model configuration for both Keras and Talos (LSTM with the same input and output dimensions), but did not use exactly the same hyperparameters. For Talos, I allowed the parameters to vary - probably the most relevant parameters for the comparison are 'number of epochs' ([1, 5, 10]) and 'batch size' ([10, 25, 50, 100, 200]). For keras, I used a fixed combination of parameters (epochs = 10 and batch_size = 50). So I think Talos had an advantage here, at least from run-time perspective, since for many simulations it run for only 1 or 5 epochs. Now looking at the text generated by the memory_profiler (unfortunately I did not save it so cannot share it with you), I noticed that the memory use/leak is higher for large batches, which I think is pointing to a problem with the generator. I will soon run some simulations at large scale with a basic program that does grid search with Keras directly to see if it can run without problems. That will tell us for sure whether there is something in the current version of Talos that is causing the memory leak or whether it is purely due to Keras. |
@mikkokotila Really great stuffs! You've got the answer to what I wanted to do. But that is weird because you only need two lines, which you already have: 'clear_session()' and 'del model', right? Looking forward to knowing what is causing the problem. |
Mystery solvedInside the scan_object (resulting from
I've used two different approaches for recursive memory size of the stored object, and it matches exactly with the increase in memory between each round. So it is in fact by-design functionality of Talos. We store the weights so we can evaluate and pick the winning model once the experiment is over, and deploy it if need to. In the example case, with MNIST dataset and the simple CNN, the increase is roughly 11mb/round. In other words, 1gb for every 100 rounds. That is the cost for storing those weights. I think generally bigger models run on bigger systems, and end up running for lesser number of rounds, so this would generally be a non-issue. Especially when the standard working of Keras is as we had witnessed it to be in the tests. I guess there are a couple things that could be done anyhow, as I can see how this would be an invisible issue particularly for those less savvy with system performance, who might never look at performance indicators of their systems i.e. we can't simply leave it be. The most obvious thing is to add an argument Another, more nuanced approach, is to have |
- Related with #343 it's now possible to avoid saving model weights in `scan_object`, which might be desirable for very long runs with very large networks, due to the memory cost of keeping the weights throughout the experiment. - fixed a small bug in `AutoParams` where choosing `network=False` resulted in 'dense' to be split into characters - `max_param_values` is now optional in `AutoScan` and instead created the issue to handle the underlying problem properly #367 - fixed all the tests accordingly to the change in `AutoScan` arguments
Talos with
|
Happy ending then :-). Can't you save the weight matrices in the hard drive instead of keeping them in RAM? For example, I know that they do a similar thing in this package: https://pypi.org/project/pyndl/ pyndl.ndl.ndl. For at least one function, they store temporary files in the hard drive when the model is being trained (see the function pyndl.ndl.ndl - https://github.com/quantling/pyndl/blob/master/pyndl/ndl.py). Also, when will you release a stable version on Pypi that includes the fix? The software engineering team at our University are very strict about the software they can install on the high computing server (they will only install packages that are stable and do not conflict with existing packages). Many thanks for fixing the problem! |
@Adnane017 thanks for the pointer. I guess we could do that with hdf5 in the Keras way, and then load them back once the experiment is ended. That does make sense. I'll create a ticket for it, but it might take time to do it as we now have taken the pressure out of this issue. Regarding stability, I would generally Talos is very stable in the grander scheme of things. If you go through the tickets you will see that there are very few actual bug reports for many months already, and we do make good effort to keep things like that. I think that's quite rare generally, but specifically in this field. Specifically with pypi, I think the v.0.6.3 will be the first v.0.6 to make it to pypi, which is probably a week or two away. Later this week it will become master, and usually the practice is to have a buffer between master and production (pypi), at least a week. |
No worries. I'll wait for the next release then. What I meant is that they won't accept to install a version through Github and that they will check if it doesn't conflict with other packages. For the time I have been using the package, I agree I didn't encounter any problems other than the memory leak. |
Hi,
Thanks for your quick answer to the issue # 342. Unfortunately, the option "clear_tf_session = True" doesn't help. I had already tried it without success. I also tried to put gc.collect() at the start of the function where you build the model.
Finally, I played with some toy example (dataset with 6000 training examples), and I noticed that after the end of the iterations, the memory used by python gets very large in comparison with the memory usage at the start even if I remove all the variables in the workspace and do gc.collect().
I've attached my code for reference. It is a simple problem where I try to predict the verb tense in a sentence (7 possible tenses) from the surrounding words (limited to the 10000 most frequent words).
Cheers,
FNN_tense1Verb_paramsearch.zip
The text was updated successfully, but these errors were encountered: