Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tripole initial/restart file with inconsistent values on the tripole seam #937

Open
DeniseWorthen opened this issue Feb 27, 2024 · 15 comments
Assignees

Comments

@DeniseWorthen
Copy link
Contributor

DeniseWorthen commented Feb 27, 2024

NOTE added later:
The issue here is that the initial condition being used in UFS for tripole grids did not have consistent values on the tripole seam. The values on the seam are not unique (that seam is folded on itself with a particular orientation). Initial conditions were generated for UFS by interpolating from one restart file to another and the tripole seam was not dealt with correctly. On startup, reading with netcdf and PIO did not result in bit-for-bit answers which started the issue. With additional debugging, the problems in the initial condition file became clearer without a clear answer about how CICE should deal with this issue. At present, netcdf seems to just use the inconsistent values directly while PIO makes some adjustments. It's not clear either is correct or reasonable. If CICE writes a restart, this situation does not occur.

=================================

UFS's implementation of CICE has always used the NetCDF interface. In testing the new PIO features, I found what appears to be an issue with restart reads through PIO. This issue predates the recent PR.

Our current develop branch sits just after the 6.4 tag. Using that code, I ran two test cases, one w/ PIO and one with NetCDF using our 1/4deg CICE-MOM6-DATM configuration. In the PIO case, I am setting restart_format = history_format = pio_pnetcdf. The run starts with using

    runtype        = 'initial'
    runid          = 'unknown'
    ice_ic         = 'cice_model.res.nc'
    restart        = .true.
    restart_ext    = .false.
    use_restart_time = .false.

If I write out the IC file ( write_ic = .true.), I see 8 fields different using NetCDF vs PIO (using cprnc). These 8 fields are

2:240: RMS uvel                             3.1287E-03            NORMALIZED  9.4347E-02
4:249: RMS vvel                             3.4885E-03            NORMALIZED  1.1972E-01
6:258: RMS icespd                           4.5378E-03            NORMALIZED  9.1556E-02
8:267: RMS icedir                           3.3009E+00            NORMALIZED  1.8086E-02
10:703: RMS uvel_d                           2.4702E-03            NORMALIZED  1.1950E-01
12:712: RMS vvel_d                           2.7543E-03            NORMALIZED  1.5164E-01
14:1120: RMS uvel_h                           2.4702E-03            NORMALIZED  1.1950E-01
16:1129: RMS vvel_h                           2.7543E-03            NORMALIZED  1.5164E-01

All the differences are located on the last jrow. For the field uvel, I see (black = netcdf, red=pio)

Screenshot 2024-02-27 at 11 44 20 AM

The mediator history files also show differences on the last row at the first coupling timestep, so the differences are being propagated into the model fields. The netcdf/pio cases quickly diverge.

I then tried the same two cases, but using ice_ic = none. In that case, the netcdf/pio cases were B4B after 24 hours.

The final bit of the puzzle---the corruption of the last jrow happens for our 1/4, 1/2 and 5 deg tripole grid cases. I does not happen with our 1-deg tripole case.

@apcraig
Copy link
Contributor

apcraig commented Feb 27, 2024

I'm going to speculate this has to do with a tripole halo update on restart. There are several questions. First, the uvel at initialization looks like it's symmetric across the center gridpoint on the tripole line (as expected), but netcdf and pio have different signs. That's not good. Second, we do run some tests on tx1 in the CICE test suite but none use PIO. I will see if I can run some tx1 tests with PIO in CICE. Finally, @DeniseWorthen, can you email me a sample ice_in file for one of your cases. Just want to check things like the boundary_type settings and so forth, so I can more readily try to duplicate the problem. I assume this bug would produce non-exact restarts, so I'll do that test too with tx1 and PIO. My guess is that this is a problem in PIO with tripole, not netCDF. But will try to confirm that too.

@DeniseWorthen
Copy link
Contributor Author

@apcraig Thanks, I'll email you the ice_in shortly. My next step was going to be to move backwards in the commit history but I would love if you could reproduce. I haven't yet tried to figure out why the 1-deg config would not be impacted.

@apcraig
Copy link
Contributor

apcraig commented Feb 27, 2024

Thanks @DeniseWorthen. I'm trying to duplicate this error in CICE now with tx1. Hopefully, I'll find something. The test suite currently does not cover this combination. It is odd that your 1 deg case behaves differently, let see if I can duplicate first. Will keep you posted.

@apcraig
Copy link
Contributor

apcraig commented Feb 27, 2024

Quick tests with tx1 in CICE don't show a problem. Trying to understand better what might be happening

  • Is it possible your one degree initial condition file has different properties for u/v on the tripole? Were the way those restart files created different or done at different times? Can you quickly look at the uvel field in the restart file for one degree and not one degree cases at ny_global and check whether they exhibit similar differences as your plot above?
  • Do you know if your netcdf and pio cases both restart exactly even if the solutions are not identical based on the same initial condition file? How about trying the following tests. Run a netcdf and pio non one-degree restart test. Do those both restart exactly (even if they produce different solutions based on the same initial condition file)? Then can you run the 2nd part of the restart test for the netcdf and pio cases but swap the restart files? Which, if any, of those 4 cases (runs from new pio and netcdf restart files) are producing identical results with each other?

The implementation of the netcdf and pio "halo update" on restart is very different. The netcdf reads a global array and scatters it with a halo update part of the scatter. The pio reads directly into the decomposed blocks then does a halo update explicitly on the decomposed data. There could be other differences, I'm still looking. But what I want to understand better is whether the initial conditions you're using are playing a role. That might explain why the one degree case doesn't show the same problem (maybe the one degree restart file is newer/different somehow?). And if both CICE standalone and your cases restart exactly with "new" restart files and are interchangeable between pio and netcdf, then that also suggests something may be funny in the initial condition files. That doesn't mean it's not a bug in CICE, but it would help to understand whether the initial condition files are playing an important role.

Depending how these tests go, it might be worthwhile to try to setup your one degree and 5 (or 1/2) degree tests in CICE standalone to see if I can duplicate. I may ask for the grid, kmt, ice_in, and restart files for those configurations from UFS if it's possible to share.

@DeniseWorthen
Copy link
Contributor Author

Hi Tony,

I'll take on your suggestions and let you know what I find.

I'll be able to share files if that is useful to you. Either derecho or hera works for me (my current tests are on hera).

For restarts, yes our current cases using NetCDF restart exactly. I'm not going to say the same about PIO---I believe it does, but I'm can't lay my hands on that case at the moment.

@dabail10
Copy link
Contributor

Are your grids tripole or tripoleT? That is, does the tripole seam / fold go through U points or T points? Also, are you running C-grid? I think the tripole halo extrapolate gets turned off for the C-grid. Did you try restart_ext = .true.?

@DeniseWorthen
Copy link
Contributor Author

@dabail10 The grid is 'tripole' (both grid_type = 'tripole' and ns_boundary_type = 'tripole'). This is not the C-grid.

The two cases were run in the same sandbox, with identical inputs and ice_in other than the _format specifiers. The only difference was the executable and whether I ran CICE compiled w/ NetCDF or compiled with PIO.

@dabail10
Copy link
Contributor

This is a crazy thought, but are you changing processors across a restart? Do you have land block elimination? That is distribution_type = 'block'? Also, does restart_ext.= .true. help?

@DeniseWorthen
Copy link
Contributor Author

@dabail10 These are not actually restart runs, they are 'initial'. I'm reading a restart file as an IC. I am not using land block elimination. I'll try the restart_ext=.true. setting along w/ Tony's suggestions.

@dabail10
Copy link
Contributor

OK. Sorry I missed that. Do you ever use land block elimination? That is, do your initial files have missing values? I'm guessing that restart_ext will not help here, since the initial files probably did not have the ghost cells saved. One thing that is different with PIO compared to netCDF is that land blocks are set to a missing value internal to PIO. I've run into this previously where I tried to set land to zeroes, but PIO sets the land blocks to a different fill value. I had added some code in the restart to check for the PIO missing values. However, I ended up deciding to just remove land block elimination.

@DeniseWorthen
Copy link
Contributor Author

Quick tests with tx1 in CICE don't show a problem. Trying to understand better what might be happening

  • Is it possible your one degree initial condition file has different properties for u/v on the tripole? Were the way those restart files created different or done at different times? Can you quickly look at the uvel field in the restart file for one degree and not one degree cases at ny_global and check whether they exhibit similar differences as your plot above?

I've checked the 1deg and 1/2 deg IC (restart) files. They were generated using the same process, which was down-scaling of a 1/4 deg restart file from the CPC reanalysis. This was done using a modified NCL script that Dave and Xingren originally developed, just with a different set of ESMF weights. The 1/4 deg is for a different date, but for the 1deg vs 1/2 deg, this is the uvel and vvel on the final row. The upper is the 1 deg (imax = 360) and the bottom is for 1/2 deg (imax=720)

Screenshot 2024-02-28 at 4 42 25 PM

What I've spent most of today on is understanding why the 1 deg netcdf/pio cases reproduce each other, but the 1/2deg does not. I am now more perplexed than ever. The following shows the 1 deg case, uvel on last row. These are from the iceh_ic file and the IC file.

1deg

Clearly, the uvel is not the same for the netcdf/pio cases, and the netcdf case looks like the IC. But, in this case, somehow, whatever is happening on that last row doesn't break B4B between PIO.

The same figure for the 1/2 deg case.

halfdeg

I have not yet run the PIO restart cases, but I will do that next.

@apcraig
Copy link
Contributor

apcraig commented Feb 28, 2024

Thanks @DeniseWorthen, sorry this is so difficult. Let me know if you want me to help debug the UFS cases, I have access to hera and derecho.

The fact that the netcdf iceh and ic results are "the same" is good to see, I think. PIO is different, that's where the problems almost certainly arise. The fact that 1 degree has the same problem but doesn't diverge is odd, but maybe we should set that aside for the moment.

I would be interested to see the netcdf, pio, and ic plots for a restart file generated by the model after 5 days (or something) and then restarted. I still wonder if the ic files you're using are playing a role. Points along the tripole are not unique and maybe there is an expectation that the restart files satisfies some specific tripole requirements, but that the "interpolated" ic files don't meet that requirement while restart files written by the model do (whether with pio or netcdf). That would mean there still is a fix needed in either netcdf or pio so they behave the same, but is that driven largely by an ic file that is technically flawed or not?

@apcraig
Copy link
Contributor

apcraig commented Feb 29, 2024

Also, just to clarify and add some additional info. The tx1 cases in standalone CICE do not start from an ic file. They start from some internal initial conditions. But if I look at the uvel at j=ny_global in the restart file of a tx1 CICE run, the plot is more like the PIO values (symmetric with different sign across the center). Also the netcdf and pio restarts are the same for standalone CICE and the results are identical and bit-for-bit throughout the run including thru a restart.

uvel_jmax

So, I think a contributing factor is that the j=ny_global values in your IC files are NOT technically correct. Then netcdf and pio "adjust" the field in different ways, maybe one is right or both are wrong, we don't know yet. If the IC files were technically correct, then whatever netcdf and pio are doing to them must ultimately result in the j=ny_global values being unchanged.

So, I think there are a few things to think about. Are there requirements on the ic files for tripole grids on the tripole seam? Should the code be checking the ic files? Should the code be "fixing" the ic data on the tripole if it's technically incorrect? If so, what is the correct way to do that? There are many ways it could be done, clearly the netcdf and pio versions of the code are not doing the same thing. Basically, the ic's being used are not valid on the tripole seam. So there is no right answer to how the code should startup (on some level).

Looking at your plots of "netcdf" and "pio" at j=ny_global. It looks to me like netcdf is just using the data as it comes in without any correction to the uvel. I guess the code runs and then after a timestep, the data becomes technically correct on the seam, although that first timestep is probably not so great with respect to science on that first timestep on the seam. But it probably doesn't affect climate and the initial condition is just that. It looks to me like the pio version is trying to "reconcile" the data on the two sides of the seam by averaging one side with the opposite sign on the other side, averaging, then assigning that value to both sides of the tripole but with opposite sign. I assume the initial interpolation is introducing asymmetry in the ic along the seam and that is showing up in the "pio" values which shoudl probably be uvel==0 but aren't quite.

@DeniseWorthen
Copy link
Contributor Author

@apcraig Thanks for your help on this. At this point, I'm convinced the issue lies with the IC files we are using. It is puzzling why some resolutions reproduce and others do not, but fundamentally I can see that the signs are incorrect in our current ICs.

I carried out a test by generating a CICE restart after 2 hours using netcdf. Using that restart as the IC, I started both the netcdf and pio cases (as 'initial' runs). The netcdf and pio cases were B4B after 6 hours for both 1deg and 1/2 deg. This tells me that even though netcdf and PIO have different methods of reading/scattering the restart file fields, as long as the restart fields are self-consistent w/ CICE, netcdf and PIO are B4B with each other. This is consistent with your standalone tests etc.

I'm satisfied at this point, so we can close the issue.

@apcraig
Copy link
Contributor

apcraig commented Feb 29, 2024

Thanks @DeniseWorthen. I am going to leave this issue open for now. I'll update the title. I think there is one outstanding question. If an initial or restart CICE file for tripole grid is not technically correct on the tripole seam, what should the code do, if anything?

@apcraig apcraig changed the title issue with restart read using PIO tripole initial/restart file with inconsistent values on the tripole seam Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants