Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade 1-Degree to 025-Degree RYF Configuration #48

Merged
merged 10 commits into from
Apr 23, 2024

Conversation

ezhilsabareesh8
Copy link
Contributor

This pull request addresses the upgrade of the 1-Degree RYF configuration to a higher resolution of 0.25-Degree RYF configuration.

Changes Implemented:

  1. A new CICE grid has been generated from the MOM super grid to resolve the longitude mismatch issue in these PRs 1 and 2

  2. The CFL violation error caused by bad departure points has been fixed by adjusting the time step of CICE and MOM, similar to OM2 quarter-degree configurations. Note that only the time step has been changed and ndtd is not changed to avoid truncation errors in MOM6 refer here.

  3. Fixed Restart File Issue: Multiple restart files were being generated by MOM6 due to netCDF file limits. This issue has been addressed and fixed in this PR.

  4. Updated datm_in, drof_in, and nuopc.runconfig files with a nomask ESMF mesh file for atmosphere and runoff components to resolve NaN.

  5. Used a default ice initial condition for cold start, which will be updated with the initial conditions generated by running this configuration for a longer period.

  6. All input files for this configuration have been updated with OM2 input files refer here.

  7. This 025 degree configuration has been tested and executed for two years without encountering any errors.

Note that the MOM6 parameters listed here needs to be updated and the block size in ice_in needs to be optimised for performance.

Copy link
Contributor

@anton-seaice anton-seaice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done Ezhil! Apologies for many questions!

Can you rebase this to use the latest RYF config changed please?

5. Used a default ice initial condition for cold start, which will be updated with the initial conditions generated by running this configuration for a longer period.

For the 1deg we used initial conditions from 3 hour run of OM2, did we still want to generate those?

Shall we update ice_in to match the OM2 config now or in a later PR?

Does @minghangli-uni have suggestions for performance changes or is that also a later PR?

Shall we try and make a clean commit history to explain each change of somesort? (Or is this all one merge commit?)

manifests/input.yaml Outdated Show resolved Hide resolved
manifests/input.yaml Outdated Show resolved Hide resolved
manifests/input.yaml Outdated Show resolved Hide resolved
manifests/input.yaml Outdated Show resolved Hide resolved
manifests/input.yaml Outdated Show resolved Hide resolved
manifests/input.yaml Outdated Show resolved Hide resolved
manifests/input.yaml Outdated Show resolved Hide resolved
metadata.yaml Outdated Show resolved Hide resolved
config.yaml Outdated Show resolved Hide resolved
config.yaml Outdated Show resolved Hide resolved
manifests/restart.yaml Outdated Show resolved Hide resolved
@minghangli-uni
Copy link

minghangli-uni commented Apr 3, 2024

Shall we update ice_in to match the OM2 config now or in a later PR?
Does @minghangli-uni have suggestions for performance changes or is that also a later PR?

I suggest so. For 0.25deg, the processor shape is not recommended using slenderX2, but square-ice instead. Similar to distribution type, not cartesian but roundrobin instead. The above two changes omit land-only blocks and improve load balance.

This is the updated &domain_nml in ice.in, which is consistent with OM2.

&domain_nml
  block_size_x = 16
  block_size_y = 15
  distribution_type = "roundrobin"
  distribution_wght = "latitude"
  maskhalo_bound = .true.
  maskhalo_dyn = .true.
  maskhalo_remap = .true.
  max_blocks = 1000
  ns_boundary_type = "tripole"
  nx_global = 1440
  ny_global = 1080
  processor_shape = "square-ice"

The default 1deg setting,

&domain_nml
  block_size_x = 16
  block_size_y = 15
  distribution_type = "cartesian"
  distribution_wght = "latitude"
  maskhalo_bound = .true.
  maskhalo_dyn = .true.
  maskhalo_remap = .true.
  max_blocks = 1000
  ns_boundary_type = "tripole"
  nx_global = 1440
  ny_global = 1080
  processor_shape = "slenderX2"

With the updated ice.in, the speedup of ice component can be around 35%, calculated from the mean time of two consecutive runs.

Shall we try and make a clean commit history to explain each change of somesort? (Or is this all one merge commit?)

A clean commit history is better for users to track and understand changes made.

@anton-seaice
Copy link
Contributor

anton-seaice commented Apr 3, 2024

Thanks @minghangli-uni . I think you can push that change straight to this branch (025deg_jra55do_ryf_iss101), or if your prefer, make a PR into this branch for Ezhil to review first.

@minghangli-uni
Copy link

Sure thing. I can make a PR into this branch soon.

MOM_input Outdated
@@ -1,7 +1,7 @@
! This input file provides the adjustable run-time parameters for version 6 of the Modular Ocean Model (MOM6).

! === module MOM ===
DT = 1800.0
DT = 1350.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should DT_THERM be an integer multiple of DT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, DT_THERM should be an integer multiple of DT and less than the forcing or coupling time-step (DT = 1350.0). However, there are no warnings or errors with the current DT_THERM but we have to reduce it, any thoughts @aekiss ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to do what you've done here, we'll also need to set THERMO_SPANS_COUPLING = True, which may require some testing.

But do we really want to change the coupling timestep? Is this done only because the CICE and coupling timesteps are currently linked? Could we instead try to relax this requirement?

(Sorry, lot's of questions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue will be handled in a separate PR as discussed here

Copy link
Collaborator

@dougiesquire dougiesquire Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to decide what to do about DT_THERM here. The current settings are confusing:
DT_THERM > DT but THERMO_SPANS_COUPLING = False (by default).

With these settings, I think the value of DT_THERM in the MOM_input will not be used - instead DT_THERM will be reset internally to DT. If we do want to have DT_THERM > DT we need to set THERMO_SPANS_COUPLING = True which will require we also change to DIABATIC_FIRST = False due to limitations in the MOM6 NUOPC cap.

Thoughts @aekiss?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly confusing. Do we set single_step_call anywhere? It defaults to true, in which case DT, DT_THERM and THERMO_SPANS_COUPLING are ignored, according to this.

Additional relevant documentation in MOM6 cap is here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in today's meeting, let's do option 1 for now.

We can then explore option 2 with different DT_THERM in subsequent tests and PRs. (Note that when THERMO_SPANS_COUPLING = True, DT_THERM will be rounded down to an integer number of coupling timesteps.)

Copy link
Contributor

@aekiss aekiss Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how confusingly these parameters interact, it will be good to have MOM_parameter_doc.* added to the repos to provide a sanity check on what is actually being used for the tests when we do option 2 - see COSIMA/access-om3#117. We could do this manually until CI is in place to automate it.

Copy link
Contributor

@aekiss aekiss Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In option 2, if DT_THERM exceeds the coupling timestep, but the coupling timestep matches the dynamic timestep in CICE and baroclinic DT in MOM6, would it make sense to set the thermo timestep in CICE to match that in MOM?

This would involve setting ndtd in CICE to DT_THERM/DT and setting the CICE thermo timestep to ndtd*DT.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question: can CICE run with a thermo timestep greater than the coupling timestep?

@ezhilsabareesh8
Copy link
Contributor Author

For the 1deg we used initial conditions from 3 hour run of OM2, did we still want to generate those?

There's an ongoing discussion about the ice initial conditions as highlighted in issue #50. We initially used the default option due to uncertainties in using OM2-generated initial conditions when switching to a C-grid, as discussed here.

The question remains: should we generate these initial conditions? or leave running the current configuration for a longer duration and examining the ice area output can provide insights into whether using OM2-generated initial conditions is needed or not.

Copy link
Collaborator

@dougiesquire dougiesquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @ezhilsabareesh8. A couple of initial comments. I'd still like to try and run the configuration, but I haven't had time today.

I think we maybe need to discuss whether we really want to be changing the coupling timestep here (and if so, what we want to do with MOM's thermodynamic timestep).

Also, from what I can tell the tool to generate the CICE grid isn't yet finialised. Should this wait until there is a single established tool?

MOM_input Outdated Show resolved Hide resolved
config.yaml Outdated Show resolved Hide resolved
ice_in Outdated Show resolved Hide resolved
metadata.yaml Outdated Show resolved Hide resolved
MOM_input Outdated
@@ -1,7 +1,7 @@
! This input file provides the adjustable run-time parameters for version 6 of the Modular Ocean Model (MOM6).

! === module MOM ===
DT = 1800.0
DT = 1350.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to do what you've done here, we'll also need to set THERMO_SPANS_COUPLING = True, which may require some testing.

But do we really want to change the coupling timestep? Is this done only because the CICE and coupling timesteps are currently linked? Could we instead try to relax this requirement?

(Sorry, lot's of questions)

MOM_input Outdated Show resolved Hide resolved
nuopc.runconfig Outdated Show resolved Hide resolved
manifests/input.yaml Show resolved Hide resolved
@dougiesquire
Copy link
Collaborator

@ezhilsabareesh8 I see you just force-pushed changes which has made Github think that some of my review comments are outdated. Please view my comments in this conversation, rather than in the "Files changed" tab, where a number of comments have disappeared.

@anton-seaice
Copy link
Contributor

For the 1deg we used initial conditions from 3 hour run of OM2, did we still want to generate those?

There's an ongoing discussion about the ice initial conditions as highlighted in issue #50. We initially used the default option due to uncertainties in using OM2-generated initial conditions when switching to a C-grid, as discussed here.

The question remains: should we generate these initial conditions? or leave running the current configuration for a longer duration and examining the ice area output can provide insights into whether using OM2-generated initial conditions is needed or not.

I think we should generate the initial conditions, so that

  • the config is as close as possible to OM2 for verification of output, and
  • to reduce the number of potential stability issues we have during model spin-up (noting the same initial will probably used for CM3 initial set-up too).

@anton-seaice
Copy link
Contributor

I think we maybe need to discuss whether we really want to be changing the coupling timestep here (and if so, what we want to do with MOM's thermodynamic timestep).

I guess we could reduce the CICE timestep without reducing the coupling timestep? Is that what you are thinking? Isn't changing the coupling timestep here just matching the OM2 configuration?

Also, from what I can tell the tool to generate the CICE grid isn't yet finialised. Should this wait until there is a single established tool?

I think we should progress:

  • We will have to update all the configs when the cice grid is settled
  • Finalising this PR allows for other improvements to be made (noting this is still a config under development). e.g. any cice pararmeter configs, performance tuning etc

@minghangli-uni
Copy link

But do we really want to change the coupling timestep? Is this done only because the CICE and coupling timesteps are currently linked? Could we instead try to relax this requirement?

@dougiesquire In the current NUOPC setup, if the coupling timestep exceeds or equals 1800s, below two issues popped up,

  1. FATAL from PE 100: write energy: Ocean velocity has been truncated too many times
  2. (abort ice) error = (diagnostic abort) ERROR: bad departure points

From my perspective, the issues mainly stem from CICE, irrespective of the values assigned to ndtd during current tests, whether it's set to 1 or 2.

@dougiesquire
Copy link
Collaborator

I guess we could reduce the CICE timestep without reducing the coupling timestep? Is that what you are thinking?

Yeah, really just wondering whether that's even sensible. Probably not. And yeah, in OM2 the atmosphere-to-ice coupling is fixed at 3 hourly, but the ice-to-ocean coupling changes with the ocean-baroclinic/ice-thermodynamic timestep (which are the same). In MOM6 we have to think about what we want to do with DT_THERM - i.e. do we want this to be longer than the coupling timestep?

@dougiesquire
Copy link
Collaborator

dougiesquire commented Apr 4, 2024

@dougiesquire In the current NUOPC setup, if the coupling timestep exceeds or equals 1800s, below two issues popped up

Okay, so maybe we don't really have a choice (at least for now). Changing the coupling timestep along with the ice and mom-baroclinic timesteps is certainly the easiest approach and aligns best with OM2. So the question that remains is what to do now with DT_THERM. In the long run, I presume we'll want this to be longer than the coupling timestep, so maybe now is the time to give THERMO_SPANS_COUPLING a whirl?

(P.S. the reason I'm so hung up on DT_THERM/THERMO_SPANS_COUPLING is because I'll need to think about how this impacts WOMBAT coupling)

@anton-seaice
Copy link
Contributor

For reference, Kieran did test a patch to the CICE nuopc driver to run cice several times per coupling timesteps when he was messing around with the atmosphere coupling. It's looking like it won't be needed though.

@anton-seaice
Copy link
Contributor

Also - can someone make the 025deg_jra55do_ryf branch write protected?

@minghangli-uni
Copy link

minghangli-uni commented Apr 4, 2024

Okay, so maybe we don't really have a choice (at least for now). Changing the coupling timestep along with the ice and mom-baroclinic timesteps is certainly the easiest approach and aligns best with OM2. So the question that remains is what to do now with DT_THERM. In the long run, I presume we'll want this to be longer than the coupling timestep, so maybe now is the time to give THERMO_SPANS_COUPLING a whirl?

@dougiesquire These two COSIMA/mom6-panan#28 and https://github.com/jsimkins2/nwa25/tree/main/run could be useful for dt_therm selection.

For a quick update, my current test case with DT=1350 and DT_THERM=5400 can run up to 2 years and is still running now.

@dougiesquire
Copy link
Collaborator

dougiesquire commented Apr 4, 2024

For a quick update, my current test case with DT=1350 and DT_THERM=5400 can run up to 2 years and is still running now.

Did you also set THERMO_SPANS_COUPLING = True? I'm guessing not because the NUOPC cap cannot handle THERMO_SPANS_COUPLING = True with DIABATIC_FIRST = True

@minghangli-uni
Copy link

Did you also set THERMO_SPANS_COUPLING = True? I'm guessing not because the NUOPC cap cannot handle THERMO_SPANS_COUPLING = True with DIABATIC_FIRST = True

No. THERMO_SPANS_COUPLING is in default False. and DIABATIC_FIRST = False

@dougiesquire
Copy link
Collaborator

No. THERMO_SPANS_COUPLING is in default False. and DIABATIC_FIRST = False

If you do not set THERMO_SPANS_COUPLING = True then your thermodynamic timestep is just getting reset internally to the baroclinic timestep. Also, have you changed the DIABATIC_FIRST setting because in this PR it is set to True?

@minghangli-uni
Copy link

have you changed the DIABATIC_FIRST setting because in this PR it is set to True?

It was discussed in https://forum.access-hive.org.au/t/namelist-configuration-discussion-meeting/1917/8 and agreed to better leave it in default first.

@ezhilsabareesh8
Copy link
Contributor Author

have you changed the DIABATIC_FIRST setting because in this PR it is set to True?

It was discussed in https://forum.access-hive.org.au/t/namelist-configuration-discussion-meeting/1917/8 and agreed to better leave it in default first.

Hi @minghangli-uni, did you change 'DIABATIC_FIRST' to true in the '1deg_jra55do_ryf' branch?. Since this PR is rebased from that.

@minghangli-uni
Copy link

For 1deg_jra55do_ryf, it is True. But following the MOM6 parameter discussion, I havent run 1deg_jra55do_ryf.

@dougiesquire
Copy link
Collaborator

dougiesquire commented Apr 4, 2024

I think the DIABATIC_FIRST issue should be handled in a separate PR. Here, as in our other config branches, it is currently set to True.

ADDED: Of course, that is, unless we decide to have DT_THERM > the coupling timestep. In which case, we need to set THERMO_SPANS_COUPLING = True for it to have any effect. Then we will also have to change to DIABATIC_FIRST = False since the NUOPC cap isn't set up to run with THERMO_SPANS_COUPLING = True and DIABATIC_FIRST = True.

@aekiss
Copy link
Contributor

aekiss commented Apr 10, 2024

Do we also want to merge this or do that as a separate PR?
https://github.com/COSIMA/MOM6-CICE6/tree/025deg_jra55do_ryf_iss135
COSIMA/access-om3#135

@ezhilsabareesh8
Copy link
Contributor Author

I'm just doing a test run with this PR and if that's successful I'll approve. Are you happy for me to rebase this and deal with the conflicts?

Thanks @dougiesquire, yes that would be great.

@dougiesquire
Copy link
Collaborator

Do we also want to merge this or do that as a separate PR?
https://github.com/COSIMA/MOM6-CICE6/tree/025deg_jra55do_ryf_iss135
COSIMA/access-om3#135

I think the changes in https://github.com/COSIMA/MOM6-CICE6/tree/025deg_jra55do_ryf_iss135 that aren't already included in this PR should be handled in separate PR(s). @minghang, can you have another look at the changes here to check that you're happy with them before we merge. The CICE block_size_x, block_size_y and max_blocks, in particular, are different than what you propose in COSIMA/access-om3#135

@minghangli-uni
Copy link

Hi @dougiesquire My account is @minghangli-uni instead of @minghang, which you have been referring to.

Regarding the parameters block_size_x=30 and block_size_y=27, these values were directly grabbed from the OM2 technical report. And max_blocks=8 is calculated from this code. As I used 288 cores, the corresponding value for max_blocks was 8. However, for the current config, which employs 240 cores, max_blocks should be set to 10.

Given that we have reliable references to validate these parameter values, I prefer to maintain block_size_x=30, block_size_y=27, and max_blocks=10.

I am happy for the rest of the change.

@dougiesquire
Copy link
Collaborator

My account is @minghangli-uni instead of @minghang, which you have been referring to.

Whoops, sorry!

@minghangli-uni
Copy link

No problem at all. I'm concerned about you quoting the wrong person, and I might not receive your message in time.

@anton-seaice
Copy link
Contributor

Regarding the parameters block_size_x=30 and block_size_y=27, these values were directly grabbed from the OM2 technical report. And max_blocks=8 is calculated from this code. As I used 288 cores, the corresponding value for max_blocks was 8. However, for the current config, which employs 240 cores, max_blocks should be set to 10.

The larger block sizes seem like a more sensible default :)

Feel free to mess with the numbers as much as you would like during the profiling work. The cice doco states:

In practice, blocks should probably not have fewer than about 8 to 10 grid cells in each direction, and more square blocks tend to optimize the volume-to-surface ratio important for communication cost. Often 3 to 8 blocks per processor provide the decompositions flexiblity to create reasonable load balance configurations.

So moving to bigger blocks rather than increasing max_blocks might make sense.

Also see https://cice-consortium-cice.readthedocs.io/en/main/user_guide/ug_implementation.html#fig-distribscorecard . I don't know why we use 'sectrobin' either.

Anyway - this is for future work, not this PR :)

@minghangli-uni
Copy link

We use roundrobin

https://github.com/COSIMA/MOM6-CICE6/blob/185b44f46f10aa4700fd3b094dfa7fec9484f1e9/ice_in#L70

@dougiesquire
Copy link
Collaborator

@ezhilsabareesh8 I'm just doing a test run with this PR and if that's successful I'll approve.

So this ran for me, but it took 19 hours to complete 1 year-long run. Is that expected?

@minghangli-uni
Copy link

Yes, with the current configuration, using DT=DT_THERM=1350s, a year-long run took approximately 19 hours to complete.

Copy link
Collaborator

@dougiesquire dougiesquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ezhilsabareesh8 et al. LGTM and ran as is.

@anton-seaice
Copy link
Contributor

Can this get merged ?

@dougiesquire
Copy link
Collaborator

@ezhilsabareesh8 we're happy for you to merge

@ezhilsabareesh8
Copy link
Contributor Author

@ezhilsabareesh8 we're happy for you to merge

Thanks @dougiesquire , I will merge then

@ezhilsabareesh8 ezhilsabareesh8 merged commit 56fe064 into 025deg_jra55do_ryf Apr 23, 2024
@dougiesquire dougiesquire deleted the 025deg_jra55do_ryf_iss101 branch April 23, 2024 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants