Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs]: Extra unsupplied attrs in VoltageClampStimulusSeries, merge inheritance behavior needs more documentation #1954

Open
3 tasks done
sneakers-the-rat opened this issue Sep 3, 2024 · 8 comments

Comments

@sneakers-the-rat
Copy link
Contributor

What happened?

Validating nwb_linkml & nwb_models against live and synthetic data, and it looks like VoltageClampStimulusSeries adds additional attributes into the data dataset, and the validator doesn't catch it.

In core version 2.7.0, VoltageClampStimulusSeries.data is:

  - name: data
    doc: Stimulus voltage applied.
    attributes:
    - name: unit
      dtype: text
      value: volts
      doc: Base unit of measurement for working with the data. which is fixed to 'volts'.
        Actual stored values are not necessarily stored in these units. To access the data in these units,
        multiply 'data' by 'conversion' and add 'offset'.

and yet my generated dataset has conversion = 1.0, offset=0.0, and resolution=-1.0 in it.

This appears to come from how PatchClampSeries.data incorrectly specifies that data is a TimeSeries.data:

"type": ("array_data", "data", TimeSeries),

despite no such neurodata_type_inc existing in the schema.

Unless it's the case that by doing neurodata_type_inc one is supposed to recursive merge all attributes in datasets and groups defined in inheriting classes, rather than overriding them, which would be... interesting.

My MWE below is literally just following the docs example, so not too thrilling

Steps to Reproduce

from pynwb import NWBFile, NWBHDF5IO
from pynwb.icephys import VoltageClampStimulusSeries
from datetime import datetime
import h5py

nwbfile = NWBFile(session_description='', identifier='', session_start_time=datetime.now())
device = nwbfile.create_device(name="Heka ITC-1600")
electrode = nwbfile.create_icephys_electrode(
    name="elec0", description="a mock intracellular electrode", device=device
)

stimulus_template = VoltageClampStimulusSeries(
    name="ccst",
    data=[0, 1, 2, 3, 4],
    starting_time=0.0,
    rate=10e3,
    electrode=electrode,
    gain=0.02,
)
nwbfile.add_stimulus_template(stimulus_template)

with NWBHDF5IO('example.nwb', 'w') as io:
    io.write(nwbfile)

>>> h5f = h5py.File('example.nwb', 'r')

>>> print(dict(h5f.get('/stimulus/templates/ccst/data').attrs.items()))
{'conversion': 1.0, 'offset': 0.0, 'resolution': -1.0, 'unit': 'volts'}

Traceback

No response

Operating System

macOS

Python Executable

Python

Python Version

3.11

Package Versions

pynwb==2.8.1
hdmf==3.14.3
h5py==3.11.0

Code of Conduct

@sneakers-the-rat
Copy link
Contributor Author

sneakers-the-rat commented Sep 3, 2024

I can keep checking for these mismatches if you want, since my models are literal transcriptions of the schema, but also don't want to generate too much noise or be annoying, so just lmk

Edit: I mean this in the sense of "I need to keep doing close reads of the schema because the models keep refusing to instantiate" not like "the models are perfect." Like "while im down here in the schema store yall want me to get anything" lol

@stephprince
Copy link
Contributor

@sneakers-the-rat I believe it is the case that neurodata_type_inc indicates that the VoltageClampStimulusSeries should inherit the specification for all attributes in datasets and groups of TimeSeries. The additional information specified in VoltageClampStimulusSeries.data is meant to refine the data specification from the base TimeSeries.data spec (in this case restricting the unit to volts). So it is expected that VoltageClampStimulusSeries.data would include the conversion, offset, and resolution attributes since it inherits these attributes from TimeSeries.

For the section of code you highlighted in icephys.py where the PatchClampSeries constructor can accept a TimeSeries type for the data argument, I think TimeSeries is a valid input for PatchClampSeries.data because of the option to have linked TimeSeries. See the TimeSeries constructor here:

pynwb/src/pynwb/base.py

Lines 171 to 172 in 2d00afe

if isinstance(data, TimeSeries):
data.__add_link('data_link', self)
where if the data argument is a TimeSeries, then that TimeSeries will be linked to so that the same data can be reused in multiple locations in the file.

@sneakers-the-rat
Copy link
Contributor Author

ooh boy ok that's what i was afraid of. OK so just to be clear it's the case that if i do this:

groups:
- neurodata_type_def: MyData
  datasets:
  - name: data
     attributes:
     - name: parent_attribute
       dtype: float32
       default_value: 1.0
- neurodata_type_def: MyChild
  neurodata_type_inc: MyData
  datasets:
  - name: data
    attributes:
    - name: parent_attribute
      dtype: text

that we would expect all unset attributes to merge down into the child class and so here the default_value of MyChild.data.parent_attribute should be 1.0?

That is not what i would have expected - i expect the inheritance from the parent class, but i did not expect that it would also recursively merge all properties, where even if i override a dataset, group, or attribute, the rest of the values come along with it. The docs could be clearer on this, and i am not sure that happens uniformly throughout the schema but i'll check


Re: being able to use a TimeSeries for TimeSeries.data, that's even more unexpected, since TimeSeries.data is a dataset and TimeSeries is a group, and i was unaware that there was a mechanism in the schema to do references like that - I thought that a reference dtype or link was the only way to do that. Is there any documentation in the schema for that behavior? When can I substitute a group for a dataset, is it just TimeSeries or is this a regular mechanism used throughout NWB?

@oruebel
Copy link
Contributor

oruebel commented Sep 4, 2024

That is not what i would have expected

The main reason is because otherwise this would break compliance with the base type. I.e., if TimeSeries.data requires that a resolution exists, then we cannot simply remove that in a derived type or otherwise the derived type really is no longer a TimeSeries. The behavior is akin to inheritance in Python , C++ etc., where we can refine a class through inheritance but we cannot alter the base type. Here if TimeSeries.data says it can be of any data type, then a derived type can further restrict that type to, e.g., float. However, we cannot make a type more general or remove required fields through inheritance.

@sneakers-the-rat
Copy link
Contributor Author

Yes yes, inheritance is to be expected from the docs, i sort of assumed monotonicity (though that could be made more explicitly in the docs).

The thing that is surprising to me here is that the inheritance is recursive through overrides.

So i would imagine it working analogously to this:

from dataclasses import dataclass

@dataclass
class TimeSeries:
    data: 'Data'
    timestamps: list[float]
    # and the rest of the attrs/datasets/groups...

    @dataclass
    class Data:
        conversion: float = 1.0
        offset: float = 0.0
        resolution: float = -1.0
        unit: str

@dataclass
class VoltageClampStimulusSeries(TimeSeries):
    data: 'Data'
    # we have the rest of the top-level fields like timestamps here from inheritance

    @dataclass
    class Data:
        unit: str = "float"
        # because `Data` is defined locally within TimeSeries and VoltageClampStimulusSeries,
        # we don't inherit the `conversion`, `offset`, and `resolution`

Where Data is defined within the classes to match the structure of the schema, but it would be the same if they weren't. By declaring data on the child class, I imagined it would override the whole data description of the parent class, rather than being merged with it. I see the note in the docs that says the current neurodata_type_inc syntax is derived from a merge keyword, but i dont't think the details of that merging made it through in the docs.

This part is really no problem for me, because i have models for all the schema language elements and the adapter classes already have walk methods for iterating through schema contents, so i could easily add a roll_down method to the NamespacesAdapter similar to the way it resolves imports between schema - it would just be deep merging two dicts and then casting them back to the pydantic models.

My question there is whether i should do that with everything in the parent model? like every field in every attr, dataset, and group should merge downwards recursively unless each individual field is overridden/refined by the child model?

Can I draft some language for the docs that clarifies this? would like to help clear things up for the next person :) being the dogfood eater that helps find places that aren't obvious externally but might be obvious internally.


The other behavior of TimeSeries being an allowable value for TimeSeries.data is much more surprising to me, if i'm reading the above comment correctly, this is what i'm understanding:

@dataclass
class TimeSeriesData:
    # the fields from the TimeSeries.data dataset...

@dataclass
class TimeSeries:
    data: Union[TimeSeriesData, "TimeSeries"]

I can see something like a subclass being able to refine the dtype of TimeSeriesData to be dtype: {target_type: TimeSeries, reftype:object} analogously to TimeSeriesReferenceVectorData, but that would be an array of references to TimeSeries rather than another TimeSeries instance taking the place of TimeSeries.data. If that's correct, it would seem to be an out-of-schema special case behavior? unless i am missing something, which is always possible and usually likely.

Or is it the case that just means something like this

ts_a = TimeSeries(...)
ts_b = TimeSeries(data=ts_a.data)

which would be fine and wouldn't require any changes for me, since I am reading by doing a reverse topological sort on the dependency graph of the datasets, resolving hardlinks, and so ts_b.data == ts_a.data would be automatically resolved.

If it's the latter, then that's fine, just an optimization thing that's within schema, but lmk if it's the former bc i'll have to inspect all the signatures in pynwb and add all those extra ranges to my models.

@sneakers-the-rat
Copy link
Contributor Author

Implementing this now, and i'm struggling a bit - some properties seem to be recursively rolled down and merged and others don't.

For example, TimeSeriesReferenceVectorData is a subclass of VectorData. VectorData has a required attribute, description without a default. TimeSeriesReferenceVectorData does not override description in the schema, but it provides a default in pynwb:

'default': "Column storing references to a TimeSeries (rows). For each TimeSeries this "
from the doc value. it seems like it is only rolling down the data field (from get_docval) but providing its own override of default manually.

when they are in conflict, should i prefer the API's implementation or the schema?

@sneakers-the-rat sneakers-the-rat changed the title [Bug]: Extra unsupplied attrs in VoltageClampStimulusSeries [Docs]: Extra unsupplied attrs in VoltageClampStimulusSeries, merge inheritance behavior needs more documentation Sep 14, 2024
@sneakers-the-rat
Copy link
Contributor Author

sneakers-the-rat commented Sep 20, 2024

Another example where i am actually not sure what the expected behavior should be - when recursively rolling down inheritance, i am assuming that when a property is not set in the child, it should receive the last-set value from the parent(s).

The quantity property is a tricky case, and i guess i have questions about defaults generally - the docs say:

The default value is quantity=1. If name is defined, quantity may not be >1.

Timeseries.timestamps has a quantity '?'. SpikeEventSeries overrides timestamps with this comment:

Unlike for TimeSeries, timestamps are required for SpikeEventSeries and are thus re-specified here.

and pynwb indeed has SpikeEventSeries.timestamps as required.

The definition of SpikeEventSeries does not have an explicit quantity=1, but just leaves it blank, so it has the default quantity=1.

When rolling down, though, it does receive a value for quantity from the parent class, so my baseline expectation would be that one would need to explicitly set the value to override it.

I think it would be fine to have default values bind more tightly on the child classes, so default values override values from the parents, but then what about the other properties that have defaults? ctrl+f'ing through the documentation these are the properties that have some notion of default:

  • dtype (default=None)
  • dims (default=None)
  • default_value (is a default value for value)
  • default_name (is a default value for name)
  • linkable (default=True)

If the rule for tight binding on quantity defaults applied, then i would also expect that when datasets are overridden to eg. change the dtype that we would also override the shape information so they would become a scalar, but that's not the case (and would be pretty annoying).

So trying to come up with a description of the inheritance behavior, am i correct here: default values that are not null take precedence over those inherited from the parent class? or is quantity a special case and we should otherwise always override defaults from the parent class?

Additionally, why does the whole timestamps dataset need to be redefined in SpikeEventSeries? Given inheritance behavior elsewhere, i would assume that it would be possible to only override the quantity property and specify nothing else, and the rest would be rolled down from the parent. This behavior of needing to override the whole dataset is what i initially expected at the top of the post, so is this another historical thing where inheritance used to work differently?

@sneakers-the-rat
Copy link
Contributor Author

sneakers-the-rat commented Sep 20, 2024

OK I believe this captures the inheritance behavior:

https://github.com/p2p-ld/nwb-linkml/blob/734088f18e72ffc2746de9206786a324cd0a798a/nwb_linkml/src/nwb_linkml/adapters/namespaces.py#L397-L490

applied here: https://github.com/p2p-ld/nwb-linkml/blob/734088f18e72ffc2746de9206786a324cd0a798a/nwb_linkml/src/nwb_linkml/adapters/namespaces.py#L176-L207

  • roll down starting from the topmost ancestor.
  • for all ancestor classes, do a "complete" roll down where all attributes, datasets, and groups are merged from the parent to the child, such that
    • the child overrides any set model properties, attributes, datasets, and groups that aren't None recursively
    • for lists of datasets, groups, and attributes, use the name field to match definitions in the parent with the child.
    • copy any items in lists that don't have a name set without attempting to merge, because there is no principled way of matching them to each other - e.g. it is ambiguous whether a child would override/refine or add to a parent class that has a '*'-quantitied neurodata_type_inc indicating that it can have any number of that type of group/dataset by adding another.
  • (optional aesthetic/simplicity step): in the leaf class, don't do a "complete" roll down for all values that are set in the parent but not in the child, but only override any items that are set in the child because both linkml and python inheritance handles those correctly. Each item that is overridden in the child is merged "completely" because normal python/linkml inheritance would not preserve properties that are set in the parent but not in the child of a nested dataset/attribute/group

if that seems right, i'd be happy to contribute to the docs to clarify the inheritance behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants