subtitle: | New features and running ACCESS-OM2 models |
---|---|
Author: | Aidan Heerdegen |
description: | A training course to introduce new and upcoming features |
Date: | 17 October 2018 |
- payu recap
- New payu features
- Upcoming payu features
- Running ACCESS-OM2 models
... a python based "scientific workflow manager"
That means it runs your model for you. In short:
- Setup model run directory (
work
) - Run the model
- Move outputs/restarts to
archive
directory - Clean up the run directory
- Run again (if instructed to do so)
There is a new mppnccombine in town ... and it's fast.
How fast?
Probably not faster than the Waco Kid
.. notes:: That is it collates tiled outputs to multiple files, which makes model input and output faster Directly copies the compressed chunks from one file to another, skipping the decompress/recompress step Example of the raison d'etre of CMS, to improve researcher productivity
But seriously fast, hence the name mppnccombine-fast
https://github.com/coecms/mppnccombine-fast
- Written by Scott Wales
- Collates any tiled FMS model output (MOM5/MOM6/GOLD)
- Particularly fast with netCDF4 compressed data
- Copy
/short/public/aph502/mppnccombine-fast
to/short/$PROJECT/$model/bin
or
- Specify full path in
config.yaml
- A version of
payu
of0.10
or greater (module load payu/0.10
onraijin
) - Updated
config.yaml
syntax
collate: true
collate_mem: 16GB
collate_queue: express
collate_ncpus: 4
collate_flags: -n4 -r
.. notes:: Must specify mpi to use mppnccombine-fast. Minimum of 2 cpus, so can't use copyq ncpus per thread is ncpus / nthreads nthreads defaults to 1 ncpus defaults to 2 enable defaults to true Don't need to specify flags, enable or exe Fewer flags, as mppnccombine-fast has fewer options Don't get your hopes up Ryan, I haven't written restart collation, but when it is done, adding restart:true will collate restarts when the restart cleaning is done
Replaces collate_
options with dictionary
collate:
enable: true
queue: express
memory: 4GB
walltime: 00:30:00
mpi: true
ncpus: 4
threads: 2
# flags: -v
# exe: /full/path/to/mppnccombine-fast
# restart: true
.. notes:: Memory use should only depend on chunksize in the compressed file, not on the overall size of the file being written, so resolution independent. Unfortunately a memory leak bug in the underlying ``HDF5`` library means memory use will go up with the number of times data is written to a collated file. It is difficult to predict, but 2-4GB per thread has been the upper limit observed so far. No speed-up for low resolution outputs (MPI overhead swamps fast run times). Quarter degree 10-50x faster. Tenth 100x faster.
- Memory independent of resolution (<4GB per thread)
- Walltime in minutes
- No speed-up for low resolution (1 deg global model)
- Minimum of 2 cpus
- Chunk sizes chosen automatically by netCDF4 and depend on tile size
- Inconsistent tile sizes => inconsistent chunk sizes
- Inconsistent chunk sizes makes program slow (has to uncompress/compress)
- Make processor layout an integer divisor of grid
- Make io_layout an integer divisor of layout
.. notes:: Might think with io_layout would make consistent tile sizes, but the decomposition algorithm has already chosen some distribution of different tile sizes that cannot be evenly combined with io_layout Surprise to me to!
Quarter degree MOM-SIS model: 1440 x 1080.
layout = 64, 30
io_layout = 8, 6
- 1920 CPUs
- Tiles are 22 x 36 and 23 x 36
- IO tiles are 184 x 180, and 176 x 180
- Slow for collating normal data and slow for untiled data (restarts and regional output)
Quarter degree MOM-SIS model: 1440 x 1080.
layout = 60, 36
io_layout = 10, 6
- 2160 CPUs
- Tiles are 24 x 10
- IO tile is 144 x 180
- Fast for collating tiled and untiled output
.. notes:: Don't agree with Marshall from first payu training session nf_limits -P project -q queue -n ncpus 48 hrs < 256 CPUs 255 < 24 hrs < 512 512 < 10 hrs < 1024
- For low CPU count model: walltime up to 48 hours
- Maximise walltime to reduce effect of queue time
- A single 48 hour model run: What if crashes? Output non optimal?
.. notes:: Runspersub to the rescue! Being conservative with walltime in case some runs take > 2hr When last run crashes, only time of last run is lost
runspersub: 23
walltime: 48:00:00
- Say model takes 2hr per run
- Above config would run the model 23 times per PBS submit
walltime
must allow forrunspersub
runs of the model- If
walltime
exceeded last run will crash.payu
will not resubmit
payu
can resubmit itself with-n
command line option- Using same model example if I wanted 50 runs of the model:
payu run -n 50
runspersub: 1
=> 50 PBS submissions, single run in eachrunspersub: 23
=> 3 PBS submissions, 23/23/4 model runs respectively
Wanted to do this for a long long time
.. notes:: Very early in this job, there was a "dodgy aerosol file" that had been used in some simulations, but hard/impossible to say which runs/files were affected
- Track input files used for each model run
- Reproducibly re-run previous experiment
- Share experiments more easily as input files all specified
- Flexibility with specifying path to input files
- Identify all runs using specified file (possible future feature)
.. notes:: Executables and inputs are not expected to change. Can specify a flag to either warn if they do and stop, or update manifest and continue Restarts are the opposite, and by default are always expected to be different for each run, unless a flag is specified to reproduce a run, in which case any difference will flag an error and stop
Executables | mf_exe.yaml |
Inputs | mf_inputs.yaml |
Restarts | mf_restarts.yaml |
- Uses yamanifest
- Creates a
YaML
file - Each file (symlink) in
work
is dictionary key
.. notes:: Note there is a header and a version string, can ignore All files in work are either config files (which are tracked by git) or symbolic links to files elsewhere on filesystem Issues with getting this working has to do with enforcing this for all models - can be difficult with hardwired paths etc
fullpath
is the actual location of the file- The hashes uniquely identify file
.. notes:: binhash uses datestamp and size combined with first 100MB of a file. Not guaranteed unique, but likely to detect if the file has changed
- yamanifest supports multiple hashes => hierachy of hashes
- Unique hashes (md5, sha128) take too long on large files
- Fast hashing to check for file changes
- Use unique hash check when necessary (or periodically?)
ACCESS-OM2 model suite from 1 degree global to 0.1 degree global, Ocean/Ice model forced with atmospheric data and almost identical model parameters.
Single access-om2
repository with all code and configs
https://github.com/OceansAus/access-om2
Ocean | MOM5 |
Ice | CICE5 |
Atmosphere | libaccessom2 |
Coupler | OASIS3-MCT |
MOM5 |
https://github.com/mom-ocean/MOM5 |
CICE5 |
https://github.com/OceansAus/cice5 |
libaccessom2 |
https://github.com/OceansAus/libaccessom2 |
OASIS3-MCT |
https://github.com/OceansAus/oasis3-mct |
- Uses JRA55 reanalysis derivative product JRA55-do
http://jra.kishou.go.jp/JRA-55/index_en.html https://www.sciencedirect.com/science/article/pii/S146350031830235X
- IAF (Interannual Forcing) : JRA55-do (1955-present)
- RYF (Repeat Year Forcing) : RYF8485, RYF9091, RYF0304
- Nominal 1 degree global resolution
- JRA55 RYF and IAF, and CORE-II configurations
https://github.com/OceansAus/1deg_jra55_iaf https://github.com/OceansAus/1deg_jra55_ryf https://github.com/OceansAus/1deg_core_nyf
- Nominal 0.25 degree global resolution
- JRA55 RYF and IAF configurations
https://github.com/OceansAus/025deg_jra55_ryf https://github.com/OceansAus/025deg_jra55_iaf
.. notes:: Don't suggest anyone runs this without contacting COSIMA as runs are expensive and a bit tricky to get running on raijin.
- Nominal 0.1 degree global resolution
- JRA55 RYF and IAF configurations
- Minimal JRA55 IAF configuration (fewer cores)
https://github.com/OceansAus/01deg_jra55_iaf https://github.com/OceansAus/01deg_jra55_ryf https://github.com/OceansAus/minimal_01deg_jra55_iaf
.. notes:: Can run in a branch to keep config clean Can fork
- Follow the Quick Start instructions in the ACCESS-OM2 Wiki on github
https://github.com/OceansAus/access-om2/wiki/Getting-started#quick-start
.. notes:: All executables and Can fork
Use the 1 deg JRA55 IAF configuration:
The PBS and platform specific options for normalbw
queue
The model options
Miscellaneous options (including collation)