Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many cores systems problem (1 core loaded and 95 are free) and fitted_pipeline_ problem #1349

Open
AlekseyGur opened this issue May 7, 2024 · 1 comment

Comments

@AlekseyGur
Copy link

Hi!
Thank you for creating TPOT!

I'm using TPOT==0.12.2.

My server has 96 cores and I always notice that TPOT uses all cores only for first 5-7 minutes of each "generation" (right?). And after that TPOT uses only 1 core for next few hours.

I tried to limit that process life time using different combinations of arguments:

  • generations = None
  • max_time_mins = 30
  • max_eval_time_mins = 30

For the first time I use to think that "max_time_mins" works like "signal" in python and rises an exception to stop a bench of processes that you call "generation". But it doesn't. And I don't understand why.

Could you tell me what combination of arguments should I set to stop one "generation" in 30 minutes?

P.S.
I've created a custom metric to explore the problem and found strange situations. For example. TPOT has found a good pipeline at the beginning of process (within first 5 minutes after start when 96 cores worked hard). Metric of that pipeline was printed and it was "perfect" for me. But after that all 95 process was finished and 1 process worked for next hour. After an hour TPOT finished all jobs and returned to fitted_pipeline_ NOT a "perfect" pipeline. But a "random" (?) or last(?) pipeline with very bad score. I wanted to get "perfect" pipeline (with printed score). But I received not a good pipeline. Why? If it is an error it is probably connected with that 1 super long process.

P.P.S.
It's very painful to look at one working core for few hours:

@perib
Copy link
Contributor

perib commented Jun 18, 2024

each generation, tpot evaluates "population_size" number of pipelines. What can often happen is that all the pipelines are complete except for one (often SVC). TPOT will not evaluate the next generation/batch until the current batch is completed. So you will see only one core utilized until that individual is completed.

Another bug is that sometimes tpot is not able to actually terminate the long-running pipeline in some cases, which could take a very long time. I think that is likely what is happening here.

I am not sure about your second issue where it is not selecting the best score.. was this an sklearn scorer?

we resolved these issues in the next version of the package, called TPOT2. This version should correctly time out all pipelines. TPOT2 also has support for custom objective functions. TPOT2 also returns a pandas dataframe with all evaluated pipelines and their scores to make them easier to access. You can find that here. Note that currently the search_space_api branch is the latest version.

There is also a second version of the evolutionary algorithm included in TPOT2 called TPOTEstimatorSteadyState which can be found here. This version does not wait for batches to be completed. As soon as a pipeline finishes evaluation, another one is submitted. this ensures that all cores are always utilized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants