-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operations on DataTree objects should not create duplicate coordinates on sub-trees #9475
Comments
As a starting point, it might make sense to disable most automatically ported DataTree methods, at least those that can't be implemented via |
I'm kind of surprised that the existing |
This could be because |
In the meeting just now we decided that this is a deficiency in the data model, not just in the implementation of This is important to do as otherwise almost every user will run into this problem, especially using this pattern: dt['lat'] = ... # coord present on root
ds = dt[path].dataset
result_ds = some_operation(ds)
dt[path].dataset = result_ds # coord now present on both root and descendant! Currently this would duplicate the del dt[path]['lat']
'lat' in dt[path] # would return true! (because only the duplicated coordinate on the child would have been deleted, not the original on the root, which would still be inherited). IODisallowing duplicated coordinates in general would subtly affect round-tripping behaviour. It would no longer be possible to write out a netCDF/Zarr file with duplicated coordinates using Round-tripping is then broken for any file with duplicated coordinates, as reading the file would de-duplicate the coordinates, and the information wouldn't be propagated in order to re-duplicate them before writing. Mostly this wouldn't matter often - there is not much reason to define duplicated coordinates in netCDF/Zarr in the first place. It might lead to subtle trickiness for files containing CF metadata that refers to other variables in different groups using absolute paths, but that is not something that xarray promises to handle anyway, as interpreting metadata contents is not part of xarray's data model (beyond what happens during decoding). However again All of this is a consequence of the fact that whilst very similar, the data models of netCDF, Zarr, and DataTree are not identical. cc @owenlittlejohns @flamingbear @eni-awowale Not sure exactly what the internal implementation of de-deduplication would be (perhaps some kind of |
It should be straightforward enough to handle duplicated index coordinates. Non-identical coordinates with an index are already disallowed between parent and child nodes, because the parent and child nodes would fail the alignment check in However, there is one tricky case: duplicated coordinates without an associated index. Here's an example of how these currently manifest themselves:
Notice the duplicated coordinates For consistency with the rest of Xarray, we should try to handle this like duplicate coordinates that result from other Xarray operations like arithmetic. As noted in #9481, the current strategy requires evaluating coordinate values to see if they are equal or not in order to decide where they appear. This is attractive for DataTree as well, but unfortunately is not compatible with lazy evaluation. Instead, I propose that we switch to the equivalent of An exception to this rule are operations that explicitly assign a child coordinate, e.g., |
Thanks for this great write up! I think in the long run this will probably help to avoid confusion and is probably the best way forward. But from more of a DAAC archiving, service provider and data validation perspective (lol I wear a lot of hats) this makes me a little nervous. At GES DISC we use DataTree to do data validation checks against server hosted services and cloud hosted services. The part that makes me nervous is that using DataTree to open netCDF4 files would be modifying the orginal file by removing the duplicate coordinate variables at each node. I don't think this will actually break anything in the backend for us but there is some complexity especialy with metadata that was mentioned. Would folks be opposed to some kind of flag that's like |
The proposed change here would make such a flag impossible - your intended result of What we could perhaps do though is add a flag to constructors / openers that would raise on encountering duplicated coordinates instead of silently de-duplicating them. e.g. dt = open_datatree(unvalidated.nc, duplicate_coords='raise') where the error message tells you which coordinates are duplicated, and refers you to xarray/xarray/namedarray/core.py Line 1001 in aeaa082
|
Yes, this is a great idea! I would actually suggest such a stricter mode for The cases where we should be more lenient by default (automatically dropping conflicts) are situations where users are constructing a new DataTree from a collection of Dataset objects, which were likely created using |
If we add the flag kwarg to |
So summarising another meeting's worth of discussion on this... (including special guest @castelao) The de-duplication idea has some issues.
As implied above, any de-duplication should ideally occur in the data model itself, not as a special feature of
Comparing two coordinate variables to decide if they are de-duplicated could just be done by comparing names, as in #9510 (comment). But removing anything of the same name (and doing so in
If you don't compare by names you have to compare something else. You could compare the def mean(ds: Dataset) -> Dataset:
return ds.mean() # this creates a shallow copy of all the variables!
dt.map_over_subtree(mean) # this would therefore still end up with duplicated coordinates
You can instead identify duplicated coordinates by comparing values directly, but this implies loading the variable into memory. Mostly our coordinate variables will be backed by in-memory indexes, so we can compare those and all would be fine. So a data model of "only inherit index-backed coordinates" works quite nicely, in that you can then always cheaply do comparison of inherited coordinates to check for de-duplication, and you can distinguish overridden from duplicated coordinates.
The fly in that ointment is that it's possible to have coordinate variables that are not backed by indexes, and as these can still be multi-dimensional they can still lazily point to large amounts of data. If the data model is now "only inherit coordinate variables backed by indexes", then it limits the usefulness of inheritance. If we try to do inheritance of non-indexed coordinate variables, we can't use the solution from step (4). |
My only concern is what @eni-awowale already mentioned, removing duplicated coordinates from child-groups will break the original file data model. Overriding coordinates in child groups is also a CF conventions feature, it should at least be handled correctly via some switch/flag. |
This is a _partial_ solution to the duplicate coordinates issue from pydata#9475. Here we remove all duplicate coordinates between parent and child nodes with an index (these are already checked for equality via the alignment check). Other repeated coordinates (which we cannot automatically check for equality) are still allowed for now. We will need an alternative solution for these, as discussed in pydata#9475, but it is less obvious what the right solution is so I'm holding off on it for now.
This is option (4) from pydata#9475 (comment)
I made a few slides going through my proposed fix (only inheriting index variables): |
I realised that the Click to see exampleIn [6]: ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()
In [7]: ds_daily = ds.resample(time="D").mean("time")
In [8]: ds_weekly = ds.resample(time="W").mean("time")
In [9]: ds_monthly = ds.resample(time="ME").mean("time")
In [12]: dt = xr.DataTree.from_dict(
...: {
...: "/": ds.drop_dims("time"),
...: "daily": ds_daily.drop_vars(["lat", "lon"]),
...: "weekly": ds_weekly.drop_vars(["lat", "lon"]),
...: "monthly": ds_monthly.drop_vars(["lat", "lon"]),
...: }
...: )
In [13]: dt
Out[13]:
<xarray.DataTree>
Group: /
│ Dimensions: (lat: 25, lon: 53)
│ Coordinates:
│ * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│ * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
├── Group: /daily
│ Dimensions: (time: 730, lat: 25, lon: 53)
│ Coordinates:
│ * time (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
│ Data variables:
│ air (time, lat, lon) float64 8MB 241.9 242.3 242.7 ... 295.9 295.5
├── Group: /weekly
│ Dimensions: (time: 105, lat: 25, lon: 53)
│ Coordinates:
│ * time (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
│ Data variables:
│ air (time, lat, lon) float64 1MB 245.3 245.2 245.0 ... 296.6 296.2
└── Group: /monthly
Dimensions: (time: 24, lat: 25, lon: 53)
Coordinates:
* time (time) datetime64[ns] 192B 2013-01-31 2013-02-28 ... 2014-12-31
Data variables:
air (time, lat, lon) float64 254kB 244.5 244.7 244.7 ... 297.7 297.7
In [14]: dt.sel(lat=75, lon=300)
Out[14]:
<xarray.DataTree>
Group: /
│ Dimensions: ()
│ Coordinates:
│ lat float32 4B 75.0
│ lon float32 4B 300.0
├── Group: /daily
│ Dimensions: (time: 730)
│ Coordinates:
│ lat float32 4B 75.0
│ lon float32 4B 300.0
│ * time (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
│ Data variables:
│ air (time) float64 6kB 242.7 245.6 244.9 249.8 ... 254.8 255.6 256.8
├── Group: /weekly
│ Dimensions: (time: 105)
│ Coordinates:
│ lat float32 4B 75.0
│ lon float32 4B 300.0
│ * time (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
│ Data variables:
│ air (time) float64 840B 247.2 251.7 256.2 261.4 ... 249.8 248.2 255.7
└── Group: /monthly
Dimensions: (time: 24)
Coordinates:
lat float32 4B 75.0
lon float32 4B 300.0
* time (time) datetime64[ns] 192B 2013-01-31 2013-02-28 ... 2014-12-31
Data variables:
air (time) float64 192B 254.0 252.8 256.9 258.7 ... 265.1 261.8 251.7
In [15]: dt.sel(lat=[75], lon=[300])
Out[15]:
<xarray.DataTree>
Group: /
│ Dimensions: (lat: 1, lon: 1)
│ Coordinates:
│ * lat (lat) float32 4B 75.0
│ * lon (lon) float32 4B 300.0
├── Group: /daily
│ Dimensions: (time: 730, lat: 1, lon: 1)
│ Coordinates:
│ * time (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
│ Data variables:
│ air (time, lat, lon) float64 6kB 242.7 245.6 244.9 ... 255.6 256.8
├── Group: /weekly
│ Dimensions: (time: 105, lat: 1, lon: 1)
│ Coordinates:
│ * time (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
│ Data variables:
│ air (time, lat, lon) float64 840B 247.2 251.7 256.2 ... 248.2 255.7
└── Group: /monthly
Dimensions: (time: 24, lat: 1, lon: 1)
Coordinates:
* time (time) datetime64[ns] 192B 2013-01-31 2013-02-28 ... 2014-12-31
Data variables:
air (time, lat, lon) float64 192B 254.0 252.8 256.9 ... 261.8 251.7 The first call In some ways this is weird, but it also kind of makes sense: dimensions are inherited, and the I wondered what people thought about encouraging these kind of index-preserving operations as an escape hatch, especially @castelao. EDIT: This is really the same thing as what @kmuehlbauer and @castelao were saying last meeting - that knowing that they could add length-1 dimensions to scalars to have them be inherited was a least a way to have "scalars" be inherited... |
For |
I'm not sure I totally follow. You mean so that even if the result of indexing is a non-indexed coordinate (e.g. a scalar), special-case logic in isel/sel de-deuplicates that result coordinate automatically. |
Yes, exactly. Coordinates defined at a higher level could be automatically be excluded from the indexing result. This is similar to how |
I think that's already tracked in #8949. This would be a nice thing to get in before release, because it's definitely going to annoy people who try to use datatrees with inheritance for analysis. |
This should be closed by #9555 |
* remove too-long underline * draft section on data alignment * fixes * draft section on coordinate inheritance * various improvements * more improvements * link from other page * align call include all 3 datasets * link back to use cases * clarification * small improvements * remove TODO after #9532 * add todo about #9475 * correct xr.align example call * add links to netCDF4 documentation * Consistent voice Co-authored-by: Maximilian Roos <[email protected]> * keep indexes in lat lon selection to dodge #9475 * unpack generator properly Co-authored-by: Stephan Hoyer <[email protected]> * ideas for next section * briefly summarize what alignment means * clarify that it's the data in each node that was previously unrelated * fix incorrect indentation of code block * display the tree with redundant coordinates again * remove content about non-inherited coords for a follow-up PR * remove todo * remove todo now that aggregations are re-implemented * remove link to (unmerged) migration guide * remove todo about improving error message * correct statement in data-structures docs * fix internal link --------- Co-authored-by: Maximilian Roos <[email protected]> Co-authored-by: Stephan Hoyer <[email protected]>
What is your issue?
This is very obviously an issue using the repr from #9470:
Instead, the result for
tree * 2
should look like:We probably also need to also revisit other Dataset methods that are ported to DataTree via
map_over_subtree
(xref #9472). Some of these (e.g., arithmetic, aggregations) can likely easily be corrected simply by mapping over nodes withinherited=False
. Others (e.g., indexing) will need more careful consideration.The text was updated successfully, but these errors were encountered: