-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log status of fluxsite and comparison runs #291
Log status of fluxsite and comparison runs #291
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #291 +/- ##
==========================================
+ Coverage 69.97% 70.55% +0.57%
==========================================
Files 18 19 +1
Lines 986 1046 +60
==========================================
+ Hits 690 738 +48
- Misses 296 308 +12 ☔ View full report in Codecov by Sentry. |
27e2fa8
to
ddb8b70
Compare
0b72b72
to
e897894
Compare
I've pasted what the PBS log file looks like below for the integration test (with debug information removed for clarity)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR - just a few small things.
Let me know if anything is unclear.
94e885d
to
6f2958d
Compare
1bd9d46
to
4f4e54d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good but I have a few comments:
- More tests need modification.
fluxsite.py
andcomparison.py
functions now run their tasks AND create the state file. We should test that these files are created. We also need to modify the tests for the cleanup. They need to test the state files are deleted. - We will eventually need the same functionality on the spatial tests since we will want to plug the outputs to ilamb on me.org. Want to add to this PR or create a new issue for that?
I have a final comment that is beyond this PR. I'll open an issue, mentioning it here for your information. I am starting to dislike the amount of functions that are not CLI in the Benchcab class. I'm wondering if we need to rethink the organisation of that class.
|
||
tasks_failed = [task for task in tasks if not task.is_done()] | ||
n_failed, n_success = len(tasks_failed), len(tasks) - len(tasks_failed) | ||
logger.info(f"{n_failed} failed, {n_success} passed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to users if there was a warning level log entry if there were any failed tasks? And some indication either of which tasks failed or how to find out which failed. This is valid for the bitwise comparison as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked that failure cases always output to the PBS log file. Here is the PBS log of a benchcab run with a task that fails due to a malformed namelist file:
/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-run-tasks --config=config.yaml
2024-05-20 12:30:37,479 - INFO - benchcab.benchcab.py:307 - Running fluxsite tasks...
2024-05-20 12:30:37,483 - INFO - benchcab.benchcab.py:308 - tasks: 20 (models: 1, sites: 5, science configurations: 4)
2024-05-20 12:30:37,615 - ERROR - fluxsite.fluxsite.py:252 - Error: CABLE returned an error for task AU-Tum_2002-2017_OzFlux_Met_R0_S0
2024-05-20 12:33:47,687 - INFO - benchcab.benchcab.py:319 - 1 failed, 19 passed
/scratch/tm70/sb8430/conda/envs/benchcab-dev/bin/benchcab fluxsite-bitwise-cmp --config=config.yaml
2024-05-20 12:33:48,725 - INFO - benchcab.benchcab.py:336 - Running comparison tasks...
2024-05-20 12:33:48,729 - INFO - benchcab.benchcab.py:337 - tasks: 0 (models: 1, sites: 5, science configurations: 4)
2024-05-20 12:33:48,773 - INFO - benchcab.benchcab.py:348 - 0 failed, 0 passed
======================================================================================
Resource Usage on 2024-05-20 12:33:52:
Job Id: 116077189.gadi-pbs
Project: tm70
Exit Status: 0
Service Units: 1.93
NCPUs Requested: 18 NCPUs Used: 18
CPU Time Used: 00:40:46
Memory Requested: 30.0GB Memory Used: 1.86GB
Walltime requested: 06:00:00 Walltime Used: 00:03:13
JobFS requested: 100.0MB JobFS used: 0B
======================================================================================
src/benchcab/internal.py
Outdated
@@ -30,6 +30,10 @@ | |||
|
|||
# DIRECTORY PATHS/STRUCTURE: | |||
|
|||
# Path to hidden state directory: | |||
STATE_DIR = Path(".state") | |||
STATE_PREFIX = ".attr_" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the STATE_PREFIX give us? Is it in case we want to use the same functionality for something else and we need to differentiate the various files created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefixed the state files to differentiate them from ordinary files and are hidden so that users are less likely to accidentally break something. The .attr_
prefix isn't that descriptive now that I'm thinking about it - I will change this to .state_attr_
.
@ccarouge I will add in the extra tests.
I will need to look more into this as I'm not sure how to get payu to dump a state file on a successful run. Perhaps save it for another PR 😄? We still haven't tested out running ILAMB in meorg yet so I'd say it is not a priority yet for spatial runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more tests, sorry.
if internal.STATE_DIR.exists(): | ||
shutil.rmtree(internal.STATE_DIR) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems my previous comment about cleanup wasn't clear. I was talking about updating this test:
benchcab/tests/test_workdir.py
Line 136 in fc625c0
def test_clean_submission_files(self, runs_path, pbs_job_files: List[Path]): |
To check that the STATE_DIR is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good.
This change introduces a State object as a minimal way of having state persist between separate processes. This is necessary for correctly showing the status of fluxsite and comparison runs as these tasks are run inside child processes which do not share the same data structures in the parent process. Fixes #180
df55ae5
to
d490ed9
Compare
This change improves how the exit status of fluxsite tasks and bitwise comparison tasks are reported in the PBS log files so that users know which tasks succeeded/failed.
A
State
object is introduced as a minimal way of having state persist between separate processes. This is necessary for correctly showing the status of fluxsite and comparison runs as these tasks are run inside child processes which do not share the same data structures in the parent process.Fixes #180