Log status of fluxsite and comparison runs #291

SeanBryan51 · 2024-05-08T01:38:32Z

This change improves how the exit status of fluxsite tasks and bitwise comparison tasks are reported in the PBS log files so that users know which tasks succeeded/failed.

A State object is introduced as a minimal way of having state persist between separate processes. This is necessary for correctly showing the status of fluxsite and comparison runs as these tasks are run inside child processes which do not share the same data structures in the parent process.

Fixes #180

codecov · 2024-05-08T01:44:06Z

Codecov Report

Attention: Patch coverage is 77.77778% with 14 lines in your changes missing coverage. Please review.

Project coverage is 70.55%. Comparing base (bb0ab3d) to head (d490ed9).
Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
src/benchcab/benchcab.py	7.69%	12 Missing ⚠️
src/benchcab/fluxsite.py	71.42%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #291      +/-   ##
==========================================
+ Coverage   69.97%   70.55%   +0.57%     
==========================================
  Files          18       19       +1     
  Lines         986     1046      +60     
==========================================
+ Hits          690      738      +48     
- Misses        296      308      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SeanBryan51 · 2024-05-10T06:12:23Z

I've pasted what the PBS log file looks like below for the integration test (with debug information removed for clarity)

/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-run-tasks --config=config.yaml
2024-05-10 16:09:00,934 - INFO - benchcab.benchcab.py:307 - Running fluxsite tasks...
2024-05-10 16:09:00,935 - INFO - benchcab.benchcab.py:308 - tasks: 8 (models: 2, sites: 1, science configurations: 4)
2024-05-10 16:10:46,165 - INFO - benchcab.benchcab.py:319 - 0 failed, 8 passed
/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-bitwise-cmp --config=config.yaml
2024-05-10 16:10:47,103 - INFO - benchcab.benchcab.py:336 - Running comparison tasks...
2024-05-10 16:10:47,104 - INFO - benchcab.benchcab.py:337 - tasks: 4 (models: 2, sites: 1, science configurations: 4)
2024-05-10 16:10:59,550 - INFO - comparison.comparison.py:66 - Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S2_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S2_out.nc are identical
2024-05-10 16:10:59,764 - INFO - comparison.comparison.py:66 - Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S0_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S0_out.nc are identical
2024-05-10 16:10:59,848 - INFO - comparison.comparison.py:66 - Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S3_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S3_out.nc are identical
2024-05-10 16:10:59,848 - INFO - comparison.comparison.py:66 - Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S1_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S1_out.nc are identical
2024-05-10 16:10:59,862 - INFO - benchcab.benchcab.py:348 - 0 failed, 4 passed

======================================================================================
                  Resource Usage on 2024-05-10 16:11:03:
   Job Id:             115370058.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      1.25
   NCPUs Requested:    18                     NCPUs Used: 18              
                                           CPU Time Used: 00:11:32        
   Memory Requested:   30.0GB                Memory Used: 912.21MB        
   Walltime requested: 06:00:00            Walltime Used: 00:02:05        
   JobFS requested:    100.0MB                JobFS used: 0B              
======================================================================================

bschroeter

Nice PR - just a few small things.

Let me know if anything is unclear.

src/benchcab/fluxsite.py

src/benchcab/comparison.py

src/benchcab/fluxsite.py

src/benchcab/workdir.py

ccarouge

It's good but I have a few comments:

More tests need modification. fluxsite.py and comparison.py functions now run their tasks AND create the state file. We should test that these files are created. We also need to modify the tests for the cleanup. They need to test the state files are deleted.
We will eventually need the same functionality on the spatial tests since we will want to plug the outputs to ilamb on me.org. Want to add to this PR or create a new issue for that?

I have a final comment that is beyond this PR. I'll open an issue, mentioning it here for your information. I am starting to dislike the amount of functions that are not CLI in the Benchcab class. I'm wondering if we need to rethink the organisation of that class.

ccarouge · 2024-05-17T05:46:50Z

src/benchcab/benchcab.py

+
+        tasks_failed = [task for task in tasks if not task.is_done()]
+        n_failed, n_success = len(tasks_failed), len(tasks) - len(tasks_failed)
+        logger.info(f"{n_failed} failed, {n_success} passed")


Would it be useful to users if there was a warning level log entry if there were any failed tasks? And some indication either of which tasks failed or how to find out which failed. This is valid for the bitwise comparison as well.

I've checked that failure cases always output to the PBS log file. Here is the PBS log of a benchcab run with a task that fails due to a malformed namelist file:

/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-run-tasks --config=config.yaml 2024-05-20 12:30:37,479 - INFO - benchcab.benchcab.py:307 - Running fluxsite tasks... 2024-05-20 12:30:37,483 - INFO - benchcab.benchcab.py:308 - tasks: 20 (models: 1, sites: 5, science configurations: 4) 2024-05-20 12:30:37,615 - ERROR - fluxsite.fluxsite.py:252 - Error: CABLE returned an error for task AU-Tum_2002-2017_OzFlux_Met_R0_S0 2024-05-20 12:33:47,687 - INFO - benchcab.benchcab.py:319 - 1 failed, 19 passed /scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-bitwise-cmp --config=config.yaml 2024-05-20 12:33:48,725 - INFO - benchcab.benchcab.py:336 - Running comparison tasks... 2024-05-20 12:33:48,729 - INFO - benchcab.benchcab.py:337 - tasks: 0 (models: 1, sites: 5, science configurations: 4) 2024-05-20 12:33:48,773 - INFO - benchcab.benchcab.py:348 - 0 failed, 0 passed ====================================================================================== Resource Usage on 2024-05-20 12:33:52: Job Id: 116077189.gadi-pbs Project: tm70 Exit Status: 0 Service Units: 1.93 NCPUs Requested: 18 NCPUs Used: 18 CPU Time Used: 00:40:46 Memory Requested: 30.0GB Memory Used: 1.86GB Walltime requested: 06:00:00 Walltime Used: 00:03:13 JobFS requested: 100.0MB JobFS used: 0B ======================================================================================

ccarouge · 2024-05-17T05:50:23Z

src/benchcab/internal.py

@@ -30,6 +30,10 @@

 # DIRECTORY PATHS/STRUCTURE:

+# Path to hidden state directory:
+STATE_DIR = Path(".state")
+STATE_PREFIX = ".attr_"


What does the STATE_PREFIX give us? Is it in case we want to use the same functionality for something else and we need to differentiate the various files created?

I prefixed the state files to differentiate them from ordinary files and are hidden so that users are less likely to accidentally break something. The .attr_ prefix isn't that descriptive now that I'm thinking about it - I will change this to .state_attr_.

SeanBryan51 · 2024-05-20T02:48:07Z

@ccarouge I will add in the extra tests.

We will eventually need the same functionality on the spatial tests since we will want to plug the outputs to ilamb on me.org. Want to add to this PR or create a new issue for that?

I will need to look more into this as I'm not sure how to get payu to dump a state file on a successful run. Perhaps save it for another PR 😄? We still haven't tested out running ILAMB in meorg yet so I'd say it is not a priority yet for spatial runs.

ccarouge

A couple more tests, sorry.

ccarouge · 2024-07-09T00:58:38Z

src/benchcab/workdir.py

+    if internal.STATE_DIR.exists():
+        shutil.rmtree(internal.STATE_DIR)
+


It seems my previous comment about cleanup wasn't clear. I was talking about updating this test:

benchcab/tests/test_workdir.py

Line 136 in fc625c0

def test_clean_submission_files(self, runs_path, pbs_job_files: List[Path]):

To check that the STATE_DIR is removed.

src/benchcab/fluxsite.py

ccarouge

All good.

This change introduces a State object as a minimal way of having state persist between separate processes. This is necessary for correctly showing the status of fluxsite and comparison runs as these tasks are run inside child processes which do not share the same data structures in the parent process. Fixes #180

SeanBryan51 linked an issue May 8, 2024 that may be closed by this pull request

Log run summary of tasks to standard output #180

Closed

ccarouge added the priority:high High priority issues that should be included in the next release. label May 8, 2024

SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch from 27e2fa8 to ddb8b70 Compare May 9, 2024 01:59

Make error message visible on failure

bb20f5e

SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch 3 times, most recently from 0b72b72 to e897894 Compare May 10, 2024 05:59

SeanBryan51 marked this pull request as ready for review May 10, 2024 06:18

SeanBryan51 requested review from bschroeter and ccarouge May 10, 2024 06:18

bschroeter requested changes May 13, 2024

View reviewed changes

SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch from 94e885d to 6f2958d Compare May 14, 2024 02:33

SeanBryan51 requested review from bschroeter and removed request for ccarouge May 14, 2024 05:10

bschroeter approved these changes May 16, 2024

View reviewed changes

SeanBryan51 mentioned this pull request May 16, 2024

Unsafe practices when deleting directories #295

Open

SeanBryan51 requested a review from ccarouge May 16, 2024 01:32

SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch from 1bd9d46 to 4f4e54d Compare May 16, 2024 01:40

ccarouge requested changes May 17, 2024

View reviewed changes

ccarouge mentioned this pull request May 17, 2024

Rethink what is included in the Benchcab class #297

Open

SeanBryan51 changed the title ~~Log status of model runs and bitwise comparisons~~ Log status of fluxsite and comparison runs May 20, 2024

SeanBryan51 requested a review from ccarouge May 21, 2024 01:34

SeanBryan51 mentioned this pull request Jul 8, 2024

Bitwise comparisons: confusing message to users #299

Closed

ccarouge requested changes Jul 9, 2024

View reviewed changes

SeanBryan51 requested a review from ccarouge July 9, 2024 05:14

ccarouge approved these changes Jul 9, 2024

View reviewed changes

SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch from df55ae5 to d490ed9 Compare July 10, 2024 02:07

SeanBryan51 merged commit a314471 into main Jul 10, 2024
4 checks passed

SeanBryan51 deleted the 180-log-run-summary-of-tasks-to-standard-output branch July 10, 2024 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log status of fluxsite and comparison runs #291

Log status of fluxsite and comparison runs #291

SeanBryan51 commented May 8, 2024 •

edited

Loading

codecov bot commented May 8, 2024 •

edited

Loading

SeanBryan51 commented May 10, 2024

bschroeter left a comment

ccarouge left a comment

ccarouge May 17, 2024

SeanBryan51 May 20, 2024

ccarouge May 17, 2024

SeanBryan51 May 20, 2024

SeanBryan51 commented May 20, 2024

ccarouge left a comment

ccarouge Jul 9, 2024

ccarouge left a comment

		if internal.STATE_DIR.exists():
		shutil.rmtree(internal.STATE_DIR)

Log status of fluxsite and comparison runs #291

Log status of fluxsite and comparison runs #291

Conversation

SeanBryan51 commented May 8, 2024 • edited Loading

codecov bot commented May 8, 2024 • edited Loading

Codecov Report

SeanBryan51 commented May 10, 2024

bschroeter left a comment

Choose a reason for hiding this comment

ccarouge left a comment

Choose a reason for hiding this comment

ccarouge May 17, 2024

Choose a reason for hiding this comment

SeanBryan51 May 20, 2024

Choose a reason for hiding this comment

ccarouge May 17, 2024

Choose a reason for hiding this comment

SeanBryan51 May 20, 2024

Choose a reason for hiding this comment

SeanBryan51 commented May 20, 2024

ccarouge left a comment

Choose a reason for hiding this comment

ccarouge Jul 9, 2024

Choose a reason for hiding this comment

ccarouge left a comment

Choose a reason for hiding this comment

SeanBryan51 commented May 8, 2024 •

edited

Loading

codecov bot commented May 8, 2024 •

edited

Loading