Many cores systems problem (1 core loaded and 95 are free) and fitted_pipeline_ problem #1349

AlekseyGur · 2024-05-07T14:26:33Z

Hi!
Thank you for creating TPOT!

I'm using TPOT==0.12.2.

My server has 96 cores and I always notice that TPOT uses all cores only for first 5-7 minutes of each "generation" (right?). And after that TPOT uses only 1 core for next few hours.

I tried to limit that process life time using different combinations of arguments:

generations = None
max_time_mins = 30
max_eval_time_mins = 30

For the first time I use to think that "max_time_mins" works like "signal" in python and rises an exception to stop a bench of processes that you call "generation". But it doesn't. And I don't understand why.

Could you tell me what combination of arguments should I set to stop one "generation" in 30 minutes?

P.S.
I've created a custom metric to explore the problem and found strange situations. For example. TPOT has found a good pipeline at the beginning of process (within first 5 minutes after start when 96 cores worked hard). Metric of that pipeline was printed and it was "perfect" for me. But after that all 95 process was finished and 1 process worked for next hour. After an hour TPOT finished all jobs and returned to fitted_pipeline_ NOT a "perfect" pipeline. But a "random" (?) or last(?) pipeline with very bad score. I wanted to get "perfect" pipeline (with printed score). But I received not a good pipeline. Why? If it is an error it is probably connected with that 1 super long process.

P.P.S.
It's very painful to look at one working core for few hours:

perib · 2024-06-18T23:12:47Z

each generation, tpot evaluates "population_size" number of pipelines. What can often happen is that all the pipelines are complete except for one (often SVC). TPOT will not evaluate the next generation/batch until the current batch is completed. So you will see only one core utilized until that individual is completed.

Another bug is that sometimes tpot is not able to actually terminate the long-running pipeline in some cases, which could take a very long time. I think that is likely what is happening here.

I am not sure about your second issue where it is not selecting the best score.. was this an sklearn scorer?

we resolved these issues in the next version of the package, called TPOT2. This version should correctly time out all pipelines. TPOT2 also has support for custom objective functions. TPOT2 also returns a pandas dataframe with all evaluated pipelines and their scores to make them easier to access. You can find that here. Note that currently the search_space_api branch is the latest version.

There is also a second version of the evolutionary algorithm included in TPOT2 called TPOTEstimatorSteadyState which can be found here. This version does not wait for batches to be completed. As soon as a pipeline finishes evaluation, another one is submitted. this ensures that all cores are always utilized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many cores systems problem (1 core loaded and 95 are free) and fitted_pipeline_ problem #1349

Many cores systems problem (1 core loaded and 95 are free) and fitted_pipeline_ problem #1349

AlekseyGur commented May 7, 2024

perib commented Jun 18, 2024

Many cores systems problem (1 core loaded and 95 are free) and fitted_pipeline_ problem #1349

Many cores systems problem (1 core loaded and 95 are free) and fitted_pipeline_ problem #1349

Comments

AlekseyGur commented May 7, 2024

perib commented Jun 18, 2024