"Nearest neighbor" is different with different number of processors #261
-
Requirements
Affiliation(s)NSF-NCAR ESMF VersionNo response IssueIn CTSM, we use ESMF to read some input files. One particular pair of input files, specifying crop sowing window start and end dates, is at half-degree resolution. We tell ESMF to do nearest-neighbor1 spatial interpolation as necessary to match the simulation grid. When I do a run at 10°x15° resolution, some of the simulation gridcell centers are located exactly at the "corners" of four half-degree input pixels, meaning that those four neighbors are equally near. It doesn't matter to me which of those ESMF chooses as the "nearest neighbor," as long as it's consistent. Unfortunately, it's not: At least one gridcell has a different "nearest neighbor" chosen depending on how many processors the job is split across. As an example, I've made a figure based on two cases that are identical in setup except that Case 1 used 128 processors and Case 2 used 64. Due to this issue, a certain crop in the gridcell centered at latitude 0, longitude 30°E2 gets sowing window of days 7-82 in Case 1 and 336-46 in Case 2. The white/gray/black in this figure represents the half-degree sowing window files. Gray pixels match the values in Case 1, black pixels match Case 2, and white pixels match neither. The red lines intersect at the center of the 10x15 CTSM gridcell. Some notes:
Tagging @ekluzek, @billsacks, and @briandobbins, who have expressed interest in this. By the way, I think I mentioned to y'all that I was having an ERP test pass but the equivalent PEM test fail—this is why! The read of sowing windows only happens at the very beginning of the test, so changing processor count halfway through makes no difference. AutotagFootnotes
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
It looks like the specific nearest-neighbor method being used is else if (trim(sdat%stream(ns)%mapalgo) == 'nn') then
call ESMF_FieldReGridStore(sdat%pstrm(ns)%field_stream, lfield_dst, &
routehandle=sdat%pstrm(ns)%routehandle, &
regridmethod=ESMF_REGRIDMETHOD_NEAREST_STOD, &
dstMaskValues=(/sdat%stream(ns)%dst_mask_val/), &
srcMaskValues=(/sdat%stream(ns)%src_mask_val/), &
srcTermProcessing=srcTermProcessing_Value, ignoreDegenerate=.true., &
unmappedaction=ESMF_UNMAPPEDACTION_IGNORE, rc=rc) I see that there's also a |
Beta Was this translation helpful? Give feedback.
-
Thanks for all of these details, @samsrabin . It does seem like the presence of equally-close points in the nearest neighbor, together with the statement that the choice is arbitrary in this case, is likely to be the issue. How time-critical is this? Specifically: can it wait a few weeks until @oehmke is back from vacation? One thing you could try to verify that the problem is in ESMF is to try to do an identical regridding using the offline ESMF_RegridWeightGen tool. ESMF does its regridding in parallel, and you can try running that tool in a batch job using the same number of processors as are being used in the CTSM runs. Let us know if you'd like to try this, and if so I could try to help you set this up if you're not clear on how. That said, this could be a rabbit hole: if you are able to reproduce it, that will clearly point to it being an issue within ESMF, but if you're not able to reproduce the problem in ESMF_RegridWeightGen, there could be questions of whether it's just hard to reproduce there for some reason or if the problem is truly elsewhere. So I understand if you don't want to go there right now, or if you'd like to turn it over to the ESMF team at this point. |
Beta Was this translation helpful? Give feedback.
-
Closing this discussion; we have opened #276 to fix the underlying issue. |
Beta Was this translation helpful? Give feedback.
I'm not 100% sure, but I think the destination grid / mesh file should be the same as the mesh_lnd file specified in nuopc.runconfig. e.g., for a 10x15 run that I have sitting around, this is share/meshes/10x15_nomask_c110308_ESMFmesh.nc, though you should confirm that.
You could also print the arguments to the call that sets up the stream stuff to verify that this is the mesh that's being used here.
Thanks for the info on the time-critical nature of this. I'm concerned that, if this is an issue in ESMF, it seems unlikely that we'll have a fix in place and a new ESMF version ready in time for this timeline. It makes me wonder if there might be a workaround you could do to get this working…