[8pt] Upgrade agg_by_huc logging #1325

RobHanna-NOAA · 2024-10-20T16:44:35Z

We see weird performance issues with aggregate_by_huc.py, but it is not yet clear on why.

I would like to add a very specific logging upgrade aimed at two things: Watching for key variables at key times, upgrade on how we track errors and add a separate huc processing duration log.

Let's bolt in the new src/utilts/fim_logger.py system. It is now running in the CatFIM system. I can show you how to plug it in. Inherently, it automatically creates a basic logging file, an separate warning and a separate error log files.
We also want to add a separate "duration-by-huc" csv file which is not a feature of the logging system. agg by huc uses multi proc. We will want to add a system called a "futures" system into the multi-proc. I don't think we have it anywhere else but it could help in other places. It exactly solves a problem we have here.
- For each multi-proc, you can have that process send back values. From each future (return from each huc mp), we want it to return the huc number, a true/false if error exists and run time duration in seconds.
- ensure that there is a try/except inside the function used for MP (aka. one per HUC). I can show you some tricks to ensure we always get the logs we need plus the right futures return value. In the try / except, if the error is huc related, trap it, log it and continue to return the "futures" values. If it is catastrophic where we need to stop the entire app, you can re-throw the error to the processpool, who can shut down the entire tool. I can show you how.
- For each future returned (from each HUC mp), we add it to a list of dictionaries outside the MP block.
When Process Pool MP set is done, output a independent csv file purely based on those durations. Columns requested:
- HUC (watch for zero padding)
- Success (True / False or some variant of that)
- duration based in time of mm:ss (skip hours)
- duration based on 10 base of time, ie) time as a percentage. Where normal time would show 4:30 (4 min 30 seconds), this column would show 4.50. This percent based makes it easy for doing averaging and summing later if we want.

The additional "duration-by-huc" part will help tell us which HUCs are being slower then others. There is also a suspicion the problem might not be so much HUC related but something about the code itself. ie) it is possible that it slows down as each huc progresses if there is a memory leak or object that is not cleaning up fast enough.

RobHanna-NOAA self-assigned this Oct 20, 2024

RobHanna-NOAA added enhancement New feature or request FIM4 labels Oct 20, 2024

RobHanna-NOAA removed their assignment Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8pt] Upgrade agg_by_huc logging #1325

[8pt] Upgrade agg_by_huc logging #1325

RobHanna-NOAA commented Oct 20, 2024

[8pt] Upgrade agg_by_huc logging #1325

[8pt] Upgrade agg_by_huc logging #1325

Comments

RobHanna-NOAA commented Oct 20, 2024