Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with 'use_leap_years' #343

Closed
abouchat opened this issue Aug 7, 2019 · 36 comments
Closed

Problem with 'use_leap_years' #343

abouchat opened this issue Aug 7, 2019 · 36 comments

Comments

@abouchat
Copy link
Contributor

abouchat commented Aug 7, 2019

When running the QC test case with the 'boxrestore' option activated, I get errors in the thermo step as soon as Jan 1st 2000 is reached. This seems to be due to having the option 'use_leap_years = .true.' activated in the namelist (2000 is a leap year). If I manually put 'use_leap_years = .false.' then the case runs and there are no errors.

Maybe related to #162 ?
Might also be related to using 'ktherm = 0' (i.e. zero-layer thermodynamics) in conjunction with leap years since the errors seem to originate from the 'zerolayer_temperature' routine. See sample of errors below:

(zerolayer_temperature)Thermo iteration does not converge,
 (zerolayer_temperature)Ice thickness:  0.210543751420792
 (zerolayer_temperature)Snow thickness:  6.959153238059593E-002
 (zerolayer_temperature)dTsf, Tsf_errmax:  -325.110616024919       5.0000000000
00000E-004
 (zerolayer_temperature)Tsf:  -1982.97625185358
 (zerolayer_temperature)fsurfn:  -5880.71078384918
 (zerolayer_temperature)fcondtopn, fcondbot  -5874.50719981974       -5874.5071
9981974
   (icepack_warnings_setabort) T :file icepack_therm_0layer.F90 :line
303
(zerolayer_temperature) zerolayer_temperature: Thermo iteration does not conver
ge
(icepack_warnings_aborted) ... (thermo_vertical)
(icepack_warnings_aborted) ... (icepack_step_therm1)
(icepack_step_therm1) ice: Vertical thermo error:
(icepack_warnings_aborted) ... (icepack_step_therm1)

and so on....

@phil-blain
Copy link
Member

The use_leap_year namelist flag is only active for the option 'boxrestore', and in the test suites the 'boxrestore' option only runs for 1 or 2 days (I think) so this might not have been noticed before.

@dabail10
Copy link
Contributor

dabail10 commented Aug 7, 2019

This is very curious. I don't know why the leap year option would cause a thermodynamic error like this. Did you try turning on use_leap_years for a ktherm=1 or ktherm=2 case? I have a feeling this is just coincidental.

@dabail10
Copy link
Contributor

dabail10 commented Aug 7, 2019

Actually, one potential issue is that shortwave is not being generated correctly for the "leap days". However, I would think this is only a problem when you reach Feb 29.

@eclare108213
Copy link
Contributor

The code tends to fail with thermodynamic errors if anything is wrong with any of the forcing, so most likely having the extra day is causing some forcing field to be computed or read in incorrectly, for leap years. Could be shortwave, might be something else. Good catch.

@abouchat
Copy link
Contributor Author

abouchat commented Aug 9, 2019

I ran new cases with ktherm=1 or ktherm=2 and use_leap_years on, and the thermodynamic also fails when reaching Jan. 1st 2000 in both cases.

@dabail10
Copy link
Contributor

dabail10 commented Aug 9, 2019

Thanks for checking. In the CESM, the driver deals with leap years so we haven't seen this in CICE before. Good catch as Elizabeth said.

@eclare108213
Copy link
Contributor

@abouchat would you mind setting up a test to debug the forcing? Use -s debug for the setup, and swap out CICE_RunMod.F90 with the version that prints out various fields (including forcing), called CICE_RunMod.F90_debug (it should be in the same driver directory, and identical except for the print calls). You'll probably need to change the timestep on which printing begins, in ice_diagnostics.F90 -- it just needs to start a few timesteps before the error occurs, so that you can see what changes. If you don't have time, one of the others (maybe me) can do it. Thanks!

@abouchat
Copy link
Contributor Author

abouchat commented Aug 9, 2019

I am afraid I won't have time to complete this as my internship at ECCC ends on Aug. 13th, and I won't be able to access the output after this. However, @phil-blain mentioned he will take care of this soon!

@phil-blain
Copy link
Member

I'm looking at this now. Just setting use_leap_year=true and compiling in debug mode makes the code abort with a division by zero error :

 Forcing data year =         2005
 Atmospheric data files:
 /users/dor/afsg/phb/local/FORCING/CICE_data/forcing/gx1/COREII/MONTHLY/cldf.omi
 p.dat
 /users/dor/afsg/phb/local/FORCING/CICE_data/forcing/gx1/COREII/MONTHLY/prec.nmy
 r.dat
 /users/dor/afsg/phb/local/FORCING/CICE_data/forcing/gx1/COREII/4XDAILY/u_10.200
 5.dat
 /users/dor/afsg/phb/local/FORCING/CICE_data/forcing/gx1/COREII/4XDAILY/v_10.200
 5.dat
 /users/dor/afsg/phb/local/FORCING/CICE_data/forcing/gx1/COREII/4XDAILY/t_10.200
 5.dat
 /users/dor/afsg/phb/local/FORCING/CICE_data/forcing/gx1/COREII/4XDAILY/q_10.200
 5.dat
  Current forcing data year =         2005
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
cice               0000000001889FC5  Unknown               Unknown  Unknown
cice               0000000001887D87  Unknown               Unknown  Unknown
cice               000000000183E104  Unknown               Unknown  Unknown
cice               000000000183DF16  Unknown               Unknown  Unknown
cice               00000000017CFB36  Unknown               Unknown  Unknown
cice               00000000017D45B6  Unknown               Unknown  Unknown
libpthread.so.0    00007F5118376330  Unknown               Unknown  Unknown
cice               0000000000A70FC6  ice_forcing_mp_co        2286  ice_forcing.F90
cice               0000000000A6A6E6  ice_forcing_mp_ly        2177  ice_forcing.F90
cice               0000000000A2FBBC  ice_forcing_mp_ge         541  ice_forcing.F90
cice               0000000000437DB2  cice_initmod_mp_c         192  CICE_InitMod.F90
cice               0000000000436D6B  cice_initmod_mp_c          51  CICE_InitMod.F90
cice               0000000000436511  MAIN__                     44  CICE.F90
cice               000000000043649E  Unknown               Unknown  Unknown
libc.so.6          00007F5117FBEF45  Unknown               Unknown  Unknown
cice               00000000004363A9  Unknown               Unknown  Unknown

That is on the gx1 grid. It does not happen on gx3.
I will run the code in Totalview to investigate where this comes from.

@phil-blain
Copy link
Member

the divide by zero error comes from line 2286 in function compute_shortwave of ice_forcing.F90:

e = 1.e5*Qa(i,j)/(0.622_dbl_kind + 0.378_dbl_kind*Qa(i,j))

For some reason the denominator is exactly zero for some indices. In my case it errors at
iblk = 2
i = 32
j =21
when creating the case with -g gx1 -p 1x1.

So the values in Qa are different depending on if use_leap_years is set are not...

@eclare108213
Copy link
Contributor

That's weird. Qa would have to be negative for the denominator to be zero, so it's getting corrupted. Are you testing using data that has realistic values for 366 days in a year? If not, then it does not make sense to set use_leap_years=T, and perhaps we should add a consistency check on this with an abort. The new JRA55 data does include leap days, I believe, but probably (hopefully?) only for 2008.

@phil-blain
Copy link
Member

phil-blain commented Aug 28, 2019

The values in Qa are different because of the values of ixm, ixp which are passed to subroutine read_data at line 2134 of ice_forcing.F90, in subroutine LY_data.
The indices ixm, ixp themselves are computed using recnum, which is computed at line 2104 in LY_data using the variable yday from module ice_calendar.

      recnum = 4*int(yday) - 3 + int(real(sec,kind=dbl_kind)/sec6hr)

When use_leap_years=false, I get yday=1, but with use_leap_years=true, I get yday=362, leading to different values of recnum.
yday is set in init_calendar of ice_calendar.F90 so something is fishy in this calendar initialization... so this is definitely related to #162.

@phil-blain
Copy link
Member

phil-blain commented Aug 28, 2019

I'm using COREII forcing with the standard namelist values, so

 year_init      = 1997
 use_restart_time = .true.
fyear_init      = 2005
ycycle          = 1

maybe there is an interaction between year_init and fyear_init... I will try some other combination of options. However this happens at initialization so I think it is related to the calendar initialization

@eclare108213
Copy link
Contributor

I don't think the COREII forcing includes leap days, so it probably won't work no matter what years you put in the namelist.

@phil-blain
Copy link
Member

phil-blain commented Aug 28, 2019

I agree, but I would have thought that the problem would manifest itself later on. I investigated further and here is what I found:
the value of yday is computed first (=1) in init_calendar (called from cice_init) along with other calendar variables. These variables all get the same values if use_leap_year is set to True or False (except basis_seconds, but from what I understand I think this is normal for this variable).
Then, cice_init calls calendar(time) (line 182) and then the subroutine calendar computes new values for yday (and other values) at line 248 :

call sec2time(nyr,month,mday,basis_seconds+ttime)

After this call, with use_leap_years=false we get :

nyr = 2017 (off by 20 years from year_init, as we mentioned in #350)
month = 1
mday = 1

but with use_leap_years=true we get :

nyr = 2016
month = 12
mday = 27 (4 days difference with the above)

then yday is computed:

 yday = mday + daycal(month)   ! day of the year

So this explain why yday has different values : this comes from the subroutine sec2time which computes stuff differently when use_leap_years=True.

@phil-blain
Copy link
Member

When using use_restart_time=.false., the divide by zero is not encountered and the code seems to run ok.
Same when using year_init=2005.

I think this issue should be taken care of carefully as part of #162. I wanted to see if a quick fix was possible before the release, but it seems it is not...
I think aborting the code early in the initialization process would be safer, but what should we check for ? @eclare108213 do you think we should add this abort before the next release or we wait ?

@dabail10
Copy link
Contributor

Did you try re-running with use_restart_time = .false.? I think the problem is that it is counting steps from year_init to the date on the restart file. When leap_years is .true., this means it counts the extra leap days between 1997 and 2017. Now, that said it shouldn't matter for the forcing, because the forcing cycle repeats right? When reading the LY data, the relevant variables that it needs it to call interp_coeff. The variable maxrec is hard coded to 365x4 = 1460 as the LY data do not have leap days. The variable recnum is the time slice in the model year. So, ixm will be the first slice in the LY data it needs. ixx is the second slice and ixp is not used in this case. So, marching through the year:

recnum, ixm, ixx, ixm (leap), ixx (leap)
1, 2, 1, 2, 1
2, 3, 2, 3, 2
3, 4, 3, 4, 3
...

1459, 1460, 1459, 1460, 1459
1460, 1, 1460, 1, 1460
1461, *, *, 2, 1
1462, *, *, 3, 2
1463, *, *, 4, 3
1464, *, *, 5, 4

So, in a leap year, you are getting January 1st/2nd slices from the LY 4x daily data. Maybe not ideal, but it works. I just used the following code in NCL:

recnum = ispan(1,1464,1)
ixm = mod(recnum+1460,1460)+1
ixx = mod(recnum-1,1460)+1

Just saw the note about use_restart_time = .false. So, I guess the issue has to do with the difference between the restart time and fyear_init.
Dave

@eclare108213
Copy link
Contributor

Please add a check/abort to subroutine init_forcing_atmo. I'd check whether the forcing data is JRA55 or default, and abort if it's not. Hard-coding is not optimal, but this is a temporary fix until we can overhaul the time manager.

Alternatively, we could reset use_leap_year to false if the forcing is not JRA55, but we're moving away from less-than-transparent changes like that. Aborting is annoying but better.

@rallard77
Copy link
Contributor

rallard77 commented Sep 11, 2019 via email

@apcraig
Copy link
Contributor

apcraig commented Feb 2, 2021

I have a new time manager which should fix many of the problems, although I'm not sure we understand all of them. I'm also reviewing the ice_forcing.F90 code and also testing the older time manager myself. I'm a little surprised by a few things in the implement. In particular, that we still compute the "record number" of the input data instead of reading the time axis from the file. I understand this is probably a hold-over from binary input days, but it's far from ideal. As part of that, it's very difficult to deal with cases where the model calendar and the input data calendar are different. In the case where the input data is gregorian, but we're running noleap in the model, the skipped input data will be Dec 31 on leap years. That seems rather arbitrary. For cases where the model is gregorian and the input data is no leap, it's likely the model is just going to abort trying to read data on Dec 31 that doesn't exist. All this also has to assume the input data frequency (8x, 4x, etc) and there are no checks that the dates are consistent in any way. It's all hardwired into the code.

I'm a little reluctant to fix all this now, but I will review the implementation and add some extra checks to the JRA55 forcing where I can. I'm not going to touch the other forcing modes at this point. I think if we do improve anything, it should be the JRA55 forcing. If I think it's doable, I might go ahead and rewrite some of it. Let me know if that raises any concerns.

@eclare108213
Copy link
Contributor

LOL! @apcraig thanks for pointing this out. Now that you say it, of course I recognize that you're right, but I hadn't even considered how hard-wired the forcing code is. It indeed is a hold-over from binary input, which we aren't long past (e.g. the gx3 core forcing still isn't completely deprecated). My recommendation is to consider fixing these issues for JRA and not worry about the older forcing routines -- I'd like to remove them from the code completely at some point.

@apcraig
Copy link
Contributor

apcraig commented Feb 2, 2021

In terms of JRA55 forcing, it looks like the model reads the forcing every timestep and that it effectively does no interpolation because it reads the same record into the two input data
slots. Is this what we expect? You can see this in the code here,

     fieldname = 'airtmp'
     call ice_read_nc(ncid,recnum,fieldname,Tair_data(:,:,1,:),debug_n_d, &
          field_loc=field_loc_center, &
          field_type=field_type_scalar)
     call ice_read_nc(ncid,recnum,fieldname,Tair_data(:,:,2,:),debug_n_d, &
          field_loc=field_loc_center, &
          field_type=field_type_scalar)

where recnum is the same in both calls and this is called every timestep even if recnum is not changing. I would have expected that the JRA55 is read only when needed and that it reads two consecutive input times and interpolates between them.

One other thing, it also looks like when the data is shifting from one year to the next, the forcing is going to not interpolate from the end of the prior year, but from the end of the current year in some forcing cases.

Something else I've noticed, the JRA55 input files do have a time axis on them and it seems to be correct for the gregorian calendar. All that is good. But the first timestep on the file is basically Jan 1, 00z. The documentation in the code for the JRA55 data says

!-------------------------------------------------------------------                        
! 3-hourly data                                                                             
!                                                                                           
! Assume that the 3-hourly value is located at the end of the                               
!  3-hour period.  This is the convention for NCEP reanalysis data.                         
!  E.g. record 1 gives conditions at 3 am GMT on 1 January.                                 
!-------------------------------------------------------------------                        

That suggests to me the time axis defined on the JRA55 input files is off by 3 hours or the model is assuming the data is offset by 3 hours and it isn't. We need to make these consistent. @rallard77 @daveh150 any comments on any of these points? Thanks.

@daveh150
Copy link
Contributor

daveh150 commented Feb 2, 2021 via email

@apcraig
Copy link
Contributor

apcraig commented Feb 2, 2021

@daveh150, thanks for the clarification. I will fix this in the code. I'm working on a few other things in the same space. And I will try to address the wrap-around issue as well. For a single year, it's fine. For multiple years, you really do want to read data from two different years during the period where you're between the last timestep on one file and the first from the other.

@daveh150
Copy link
Contributor

daveh150 commented Feb 2, 2021 via email

@apcraig apcraig mentioned this issue Feb 27, 2021
16 tasks
@apcraig
Copy link
Contributor

apcraig commented Feb 27, 2021

I think we can close this soon. The JRA55 bugs have been addressed in #562. I believe the problems that @phil-blain pointed out in the initial date when switching between leap years off and on will be fixed with the new time manager. The new time manager keeps track of the date, not the time in seconds and uses the date information on restart. It could be that there are still some issues, but the JRA55 forcing should work fine with noleap or leap calendars. With noleap it'll just be skipping Dec 31 data of leap years. If someone can summarize a test for me to do with the new time manager, I can give that a try. I do think our experience and bug fixing in the longer JRA55 runs has provided a relatively robust place for testing.

@apcraig
Copy link
Contributor

apcraig commented Feb 27, 2021

I think one outstanding issue maybe unrelated to the time manager is

  • Do we have an issue with a divide by zero with Qa that we need to code around or abort on?

@apcraig
Copy link
Contributor

apcraig commented Mar 16, 2021

This should be fixed in #566

@apcraig apcraig closed this as completed Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment