Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchcab not working with latest MAIN(?) #302

Open
har917 opened this issue Jul 17, 2024 · 12 comments
Open

benchcab not working with latest MAIN(?) #302

har917 opened this issue Jul 17, 2024 · 12 comments

Comments

@har917
Copy link
Collaborator

har917 commented Jul 17, 2024

@ccarouge @SeanBryan51 @abhaasgoyal As of 17/7/2024 - I'm having difficulty getting benchcab to run (anything).

First issue - following recent updates to check%ranges the current default namelist (so what is supposed to be used for regression testing) still has check%ranges = .false. when created via git clone (and so the runs fail).

Using

realisations:
  - repo:
      git:
        branch: main
    patch:
      cable:
        check:
          ranges: 0
  - repo: 
      git: 
        branch: 335-facilitate-output-of-potential-evaporation-directly-from-the-offline-code-base
    patch:
      cable:
        check:
          ranges: 0
  
modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]

as the benchcab.yaml file appears to successfully create cable.nml files with the correct entries.

However benchcab then throws (in the qsub.sh.o*** file)

/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/bin/benchcab fluxsite-run-tasks --config=config.yaml
Traceback (most recent call last):
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/bin/benchcab", line 10, in <module>
    sys.exit(main())
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/main.py", line 42, in main
    parse_and_dispatch(parser)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/main.py", line 32, in parse_and_dispatch
    func(**args)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/benchcab.py", line 279, in fluxsite_run_tasks
    config = self._get_config(config_path)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/benchcab.py", line 136, in _get_config
    self._config = read_config(config_path)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/config.py", line 165, in read_config
    config = read_config_file(config_path)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/config.py", line 139, in read_config_file
    with Path.open(Path(config_path), "r", encoding="utf-8") as file:
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'

Interestingly the spatial runs appear to have completed successfully. I don't see a .yaml file in the runs/fluxsite directory which is consistent with the error message.

Any thoughts?

@abhaasgoyal
Copy link

abhaasgoyal commented Jul 17, 2024

@har917 not sure, but I think the issue is that config file name should be config.yaml instead of benchcab.yaml (also the namelist files have been updated in bench_example and cable so no patch is needed - this is why the namelist file contents could be correct).

@har917
Copy link
Collaborator Author

har917 commented Jul 17, 2024

Perhaps to add more detail.

This testing is based off is a fresh git clone (as of today) - the cable.nml that is downloaded into the benchcab_example/namelists directory has the old check%ranges = .false. line

In the above (and this wasn't clear - apologies) the file that I refer to as benchcab.yaml is the config.yaml file that the user edits in the benchcab_example root directory (I named it that because there are other config.yaml files created elsewhere in the structure)

Is there supposed to be a config.yaml file created in the benchcab_exmaple/runs/fluxsite/ directory (equivalent to the .yaml files created in the spatial/crujra_access_* directories)?

@abhaasgoyal
Copy link

abhaasgoyal commented Jul 17, 2024

I see, regarding bench_example we still have to merge the (approved) PR CABLE-LSM/bench_example#23 (so it will be done soon edit: we still need to see how to manage namelist compability)

Now, the following set of commands seem to work for me

$ git clone [email protected]:CABLE-LSM/bench_example.git
$ cd bench_example
$ vim config.yaml
# Following lines go in this file
realisations:
  - repo:
      git:
        branch: main
    patch:
      cable:
        check:
          ranges: 0
  - repo: 
      git: 
        branch: 335-facilitate-output-of-potential-evaporation-directly-from-the-offline-code-base
    patch:
      cable:
        check:
          ranges: 0
  
modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]
$ benchcab run -v

By any chance, was benchcab run from another directory?

Also, there shouldn't be any config.yaml directory in benchcab_example/runs/fluxsite

@har917
Copy link
Collaborator Author

har917 commented Jul 17, 2024

By any chance, was benchcab run from another directory?

I don't think so though - there is a possibility that I ran it from one layer too high but I thought it would completely fail if I did that (I have a /benchcab directory on scratch into which git clone creates the /benchcab_example directory and think I ran from /benchcab_example). All this was run via a VS code terminal.

I didn't use the -v option - is that important?

@abhaasgoyal
Copy link

I didn't use the -v option - is that important?

Not really, it is for verbose output (just to check whether there were any warnings/issues before submitting job)

Maybe it detected a config.yaml on top/environment path (little chance but just in case). It seems to work well for me, but maybe somebody else (@SeanBryan51 @ccarouge) can recreate this issue. Meanwhile @har917 maybe run the above set of commands from /scratch and if you could recheck that'd be great.

@har917
Copy link
Collaborator Author

har917 commented Jul 17, 2024

I've just completed a completely fresh run using the commands above - with the only thing different being that I used the VS code editor not vim (since I'm not a vim user).

It's failed in the same way - it's

  • compiled the two branches successfully,
  • created the expected output directory structure
  • written namelists into the runs/fluxsite/tasks/ directory with the correct check%ranges entry
  • done something in the spatial section (it's created an N9.o*** file in each cru_access*/task directory)
  • but failed with the same error as above.

Likely contradicting my earlier thinking - I'm not sure it's done anything in the payu section in that there's notthing in the work directory (only in the archive directory).

One thing I've just thought of - is there a project dependence somewhere in here? I've been running these tests from p66 - should I try from a different project (e.g. x45, rp23).

@har917
Copy link
Collaborator Author

har917 commented Jul 17, 2024

@AlisonBennett Could you have a go at following the instructions (4^) from @abhaasgoyal above to see whether you can get this to run?

Just trying to figure out whether this is at my end or somewhere else.

ps. you'll get to see how quickly the updated compilation/build is - only takes a couple of minutes in contrast to 15+ with BLAZE_9814

@AlisonBennett
Copy link

@har917 yes - I have done this and it seems to have run (ie. it built some stuff and then submitted a pbs job which took a while to run through and now there is a bunch of extra stuff in some new directories). I'm not really sure what output to expect though, so perhaps it's best for you to have a look at scratch/x45/ab7412/benchcab_test to see if that is what it is mean to do.

There were a few errors before I got this far. To overcome these, I had to:
a) copy @abhaasgoyal's code for the .yaml file into my .yaml file (before I did that I got lots of errors very similar to your initial post). I think the yaml syntax is very fussy. You could try taking a copy of my .yaml file to see if that solves your problem?
b) follow instructions to load benchcab modules here (before I did that my environment didn't know about benchcab)
c) start a new arc session with adding access to both projects gdata/hh5 and gdata/ks32 (before I did that it said it didn't have access to the meteorology for one of the flux sites).

Hope this helps.

@ccarouge
Copy link
Collaborator

@har917 mind sharing the path where you are running from?

@ccarouge
Copy link
Collaborator

Actually @har917 what's the -l storage line in the qsub job and where do you run? Are you running from /g/data/p66 and it isn't in the -l storage line for example?

@har917
Copy link
Collaborator Author

har917 commented Jul 18, 2024

@ccarouge I've been running from /scratch/x45 but likely submitted the job under p66 (as that's my default project).

@AlisonBennett has successfully run the regression test (under x45) this morning.

I'm trying again (but ensuring that I'm under x45) - and this is certainly behaving differently (in that it's produced fluxsite outputs) however it hasn't produced a benchmark_cable_qsub.sh.o*** file even though the job has apparently finished (via qstat)

the -l storage line is both sets of runs is #PBS -l storage=gdata/ks32+gdata/hh5+gdata/wd9

Basically I think the problem is that I've been essentially asking a job under p66 to write to scratch under x45 and it's said no (understandable) - but the error message is a bit odd.

On further thought - what's likely happened is that benchcab tries to copy the config.yaml file from its root directory to somewhere else as part of the workflow (that fails because of the gadi permissions requirement), then benchcab tries to read the copy of the config.yaml file (which doesn't exist) and you get the error above.

Perhaps a note in the benchcab 'how to' about matching project with the PBS storage and/or matching project with calling point is needed

EDIT: it's now produced a .o*** file so all good.

@ccarouge
Copy link
Collaborator

ccarouge commented Jul 18, 2024

@har917 When you run the job using p66, Gadi will automatically mount /scratch/p66 but not /scratch/x45. If you run using x45 resources, Gadi will mount /scratch/x45 (and not /scratch/p66).
In config.yaml, it's possible to give additional projects to mount: https://benchcab.readthedocs.io/en/latest/user_guide/config_options/#+pbs.storage
You may want to add scratch/x45 so it works no matter what resources are used

Edit: I'm assuming you run switchproj before running benchcab since we haven't provided a way to run benchcab under a different project as the current project of the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants