Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log status of fluxsite and comparison runs #291

Merged
merged 2 commits into from
Jul 10, 2024

Conversation

SeanBryan51
Copy link
Collaborator

@SeanBryan51 SeanBryan51 commented May 8, 2024

This change improves how the exit status of fluxsite tasks and bitwise comparison tasks are reported in the PBS log files so that users know which tasks succeeded/failed.

A State object is introduced as a minimal way of having state persist between separate processes. This is necessary for correctly showing the status of fluxsite and comparison runs as these tasks are run inside child processes which do not share the same data structures in the parent process.

Fixes #180

@SeanBryan51 SeanBryan51 linked an issue May 8, 2024 that may be closed by this pull request
Copy link

codecov bot commented May 8, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 14 lines in your changes missing coverage. Please review.

Project coverage is 70.55%. Comparing base (bb0ab3d) to head (d490ed9).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
src/benchcab/benchcab.py 7.69% 12 Missing ⚠️
src/benchcab/fluxsite.py 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #291      +/-   ##
==========================================
+ Coverage   69.97%   70.55%   +0.57%     
==========================================
  Files          18       19       +1     
  Lines         986     1046      +60     
==========================================
+ Hits          690      738      +48     
- Misses        296      308      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ccarouge ccarouge added the priority:high High priority issues that should be included in the next release. label May 8, 2024
@SeanBryan51 SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch from 27e2fa8 to ddb8b70 Compare May 9, 2024 01:59
@SeanBryan51 SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch 3 times, most recently from 0b72b72 to e897894 Compare May 10, 2024 05:59
@SeanBryan51
Copy link
Collaborator Author

I've pasted what the PBS log file looks like below for the integration test (with debug information removed for clarity)

/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-run-tasks --config=config.yaml
2024-05-10 16:09:00,934 - INFO - benchcab.benchcab.py:307 - Running fluxsite tasks...
2024-05-10 16:09:00,935 - INFO - benchcab.benchcab.py:308 - tasks: 8 (models: 2, sites: 1, science configurations: 4)
2024-05-10 16:10:46,165 - INFO - benchcab.benchcab.py:319 - 0 failed, 8 passed
/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-bitwise-cmp --config=config.yaml
2024-05-10 16:10:47,103 - INFO - benchcab.benchcab.py:336 - Running comparison tasks...
2024-05-10 16:10:47,104 - INFO - benchcab.benchcab.py:337 - tasks: 4 (models: 2, sites: 1, science configurations: 4)
2024-05-10 16:10:59,550 - INFO - comparison.comparison.py:66 - Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S2_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S2_out.nc are identical
2024-05-10 16:10:59,764 - INFO - comparison.comparison.py:66 - Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S0_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S0_out.nc are identical
2024-05-10 16:10:59,848 - INFO - comparison.comparison.py:66 - Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S3_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S3_out.nc are identical
2024-05-10 16:10:59,848 - INFO - comparison.comparison.py:66 - Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S1_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S1_out.nc are identical
2024-05-10 16:10:59,862 - INFO - benchcab.benchcab.py:348 - 0 failed, 4 passed

======================================================================================
                  Resource Usage on 2024-05-10 16:11:03:
   Job Id:             115370058.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      1.25
   NCPUs Requested:    18                     NCPUs Used: 18              
                                           CPU Time Used: 00:11:32        
   Memory Requested:   30.0GB                Memory Used: 912.21MB        
   Walltime requested: 06:00:00            Walltime Used: 00:02:05        
   JobFS requested:    100.0MB                JobFS used: 0B              
======================================================================================

@SeanBryan51 SeanBryan51 marked this pull request as ready for review May 10, 2024 06:18
Copy link
Collaborator

@bschroeter bschroeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR - just a few small things.

Let me know if anything is unclear.

src/benchcab/fluxsite.py Outdated Show resolved Hide resolved
src/benchcab/comparison.py Outdated Show resolved Hide resolved
src/benchcab/comparison.py Outdated Show resolved Hide resolved
src/benchcab/fluxsite.py Outdated Show resolved Hide resolved
src/benchcab/fluxsite.py Outdated Show resolved Hide resolved
src/benchcab/workdir.py Show resolved Hide resolved
@SeanBryan51 SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch from 94e885d to 6f2958d Compare May 14, 2024 02:33
@SeanBryan51 SeanBryan51 requested review from bschroeter and removed request for ccarouge May 14, 2024 05:10
@SeanBryan51 SeanBryan51 requested a review from ccarouge May 16, 2024 01:32
@SeanBryan51 SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch from 1bd9d46 to 4f4e54d Compare May 16, 2024 01:40
Copy link
Collaborator

@ccarouge ccarouge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good but I have a few comments:

  1. More tests need modification. fluxsite.py and comparison.py functions now run their tasks AND create the state file. We should test that these files are created. We also need to modify the tests for the cleanup. They need to test the state files are deleted.
  2. We will eventually need the same functionality on the spatial tests since we will want to plug the outputs to ilamb on me.org. Want to add to this PR or create a new issue for that?

I have a final comment that is beyond this PR. I'll open an issue, mentioning it here for your information. I am starting to dislike the amount of functions that are not CLI in the Benchcab class. I'm wondering if we need to rethink the organisation of that class.


tasks_failed = [task for task in tasks if not task.is_done()]
n_failed, n_success = len(tasks_failed), len(tasks) - len(tasks_failed)
logger.info(f"{n_failed} failed, {n_success} passed")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to users if there was a warning level log entry if there were any failed tasks? And some indication either of which tasks failed or how to find out which failed. This is valid for the bitwise comparison as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked that failure cases always output to the PBS log file. Here is the PBS log of a benchcab run with a task that fails due to a malformed namelist file:

/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-run-tasks --config=config.yaml
2024-05-20 12:30:37,479 - INFO - benchcab.benchcab.py:307 - Running fluxsite tasks...
2024-05-20 12:30:37,483 - INFO - benchcab.benchcab.py:308 - tasks: 20 (models: 1, sites: 5, science configurations: 4)
2024-05-20 12:30:37,615 - ERROR - fluxsite.fluxsite.py:252 - Error: CABLE returned an error for task AU-Tum_2002-2017_OzFlux_Met_R0_S0
2024-05-20 12:33:47,687 - INFO - benchcab.benchcab.py:319 - 1 failed, 19 passed
/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-bitwise-cmp --config=config.yaml
2024-05-20 12:33:48,725 - INFO - benchcab.benchcab.py:336 - Running comparison tasks...
2024-05-20 12:33:48,729 - INFO - benchcab.benchcab.py:337 - tasks: 0 (models: 1, sites: 5, science configurations: 4)
2024-05-20 12:33:48,773 - INFO - benchcab.benchcab.py:348 - 0 failed, 0 passed

======================================================================================
                  Resource Usage on 2024-05-20 12:33:52:
   Job Id:             116077189.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      1.93
   NCPUs Requested:    18                     NCPUs Used: 18              
                                           CPU Time Used: 00:40:46        
   Memory Requested:   30.0GB                Memory Used: 1.86GB          
   Walltime requested: 06:00:00            Walltime Used: 00:03:13        
   JobFS requested:    100.0MB                JobFS used: 0B              
======================================================================================

@@ -30,6 +30,10 @@

# DIRECTORY PATHS/STRUCTURE:

# Path to hidden state directory:
STATE_DIR = Path(".state")
STATE_PREFIX = ".attr_"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the STATE_PREFIX give us? Is it in case we want to use the same functionality for something else and we need to differentiate the various files created?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefixed the state files to differentiate them from ordinary files and are hidden so that users are less likely to accidentally break something. The .attr_ prefix isn't that descriptive now that I'm thinking about it - I will change this to .state_attr_.

@SeanBryan51
Copy link
Collaborator Author

@ccarouge I will add in the extra tests.

We will eventually need the same functionality on the spatial tests since we will want to plug the outputs to ilamb on me.org. Want to add to this PR or create a new issue for that?

I will need to look more into this as I'm not sure how to get payu to dump a state file on a successful run. Perhaps save it for another PR 😄? We still haven't tested out running ILAMB in meorg yet so I'd say it is not a priority yet for spatial runs.

@SeanBryan51 SeanBryan51 changed the title Log status of model runs and bitwise comparisons Log status of fluxsite and comparison runs May 20, 2024
@SeanBryan51 SeanBryan51 requested a review from ccarouge May 21, 2024 01:34
Copy link
Collaborator

@ccarouge ccarouge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more tests, sorry.

Comment on lines +27 to +29
if internal.STATE_DIR.exists():
shutil.rmtree(internal.STATE_DIR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems my previous comment about cleanup wasn't clear. I was talking about updating this test:

def test_clean_submission_files(self, runs_path, pbs_job_files: List[Path]):

To check that the STATE_DIR is removed.

src/benchcab/fluxsite.py Show resolved Hide resolved
@SeanBryan51 SeanBryan51 requested a review from ccarouge July 9, 2024 05:14
Copy link
Collaborator

@ccarouge ccarouge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good.

This change introduces a State object as a minimal way of having state
persist between separate processes. This is necessary for correctly
showing the status of fluxsite and comparison runs as these tasks are
run inside child processes which do not share the same data structures
in the parent process.

Fixes #180
@SeanBryan51 SeanBryan51 force-pushed the 180-log-run-summary-of-tasks-to-standard-output branch from df55ae5 to d490ed9 Compare July 10, 2024 02:07
@SeanBryan51 SeanBryan51 merged commit a314471 into main Jul 10, 2024
4 checks passed
@SeanBryan51 SeanBryan51 deleted the 180-log-run-summary-of-tasks-to-standard-output branch July 10, 2024 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:high High priority issues that should be included in the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log run summary of tasks to standard output
3 participants