speed up loading of namespaces: return shallow copy in build_const_args #1103

magland · 2024-04-22T15:17:54Z

Motivation and description

I am trying to speed up the loading of namespaces in pynwb. Sometimes it takes up to 6 seconds on initial load. I was tracing through the code to see what could be causing the slowness and I came across the a deepcopy in a low level function build_const_args that gets called a lot during namespace loading. I replaced this with a shallow copy and noticed a significant improvement in load time.

IMPORTANT: I am not familiar enough with the code to know whether this change is going to break anything.

This is one of two PRs I am submitting to try and speed things up.

How to test the behavior?

Run this script twice before the change and once after the change. The first time will download the needed data and will save the loaded file segments to a cache directory. The second time and third times it is run, it will not include the download time. On my machine it takes around 4 sec to load before the change and around 1.5 sec after the change.

import time
import remfile
import pynwb
import h5py


def example_slow_load_namespace():
    # https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/c04f6b30-82bf-40e1-9210-34f0bcd8be24/download/
    h5_url = 'https://api.dandiarchive.org/api/assets/c04f6b30-82bf-40e1-9210-34f0bcd8be24/download/'
    disk_cache = remfile.DiskCache('test_cache')
    remf = remfile.File(h5_url, disk_cache=disk_cache)
    timer = time.time()
    with h5py.File(remf, 'r') as h5f:
        with pynwb.NWBHDF5IO(file=h5f, mode='r', load_namespaces=True) as io:
            nwbfile = io.read()
            print(nwbfile)
    elapsed = time.time() - timer
    print('Elapsed time:', elapsed)


if __name__ == '__main__':
    example_slow_load_namespace()

Checklist

Did you update CHANGELOG.md with your changes?
Does the PR clearly describe the problem and the solution?
Have you reviewed our Contributing Guide?
Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

@oruebel @rly

codecov · 2024-04-24T23:09:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.88%. Comparing base (b0f068e) to head (da2c1d3).
Report is 3 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #1103      +/-   ##
==========================================
- Coverage   88.88%   88.88%   -0.01%     
==========================================
  Files          45       45              
  Lines        9836     9835       -1     
  Branches     2795     2795              
==========================================
- Hits         8743     8742       -1     
  Misses        776      776              
  Partials      317      317

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mavaylon1 · 2024-04-25T15:12:51Z

The change to a shallow copy should be fine. The only reason I would suspect a deepcopy is needed is if we wanted to modify an independent copy.
We do a modification here:

@classmethod
    def build_const_args(cls, spec_dict):
        ''' Build constructor arguments for this Spec class from a dictionary '''
        ret = super().build_const_args(spec_dict)
        if isinstance(ret['dtype'], dict):
            ret['dtype'] = RefSpec.build_spec(ret['dtype'])
        return ret

lines 276 -282

And also in namespace.py

if parent_cls.def_key() in spec_dict:
            spec_dict[spec_cls.def_key()] = spec_dict.pop(parent_cls.def_key())
        if parent_cls.inc_key() in spec_dict:
            spec_dict[spec_cls.inc_key()] = spec_dict.pop(parent_cls.inc_key())

I would need to dive in deeper to see if the order in which we call these methods should not conflict with us using a shallow copy. I'll tackle this next week when I am back.

oruebel · 2024-04-25T18:08:03Z

The only reason I would suspect a deepcopy is needed is if we wanted to modify an independent copy.

I think this will require careful testing. We should check if/where the spec object is actually being modified and why. If the spec is being modified downstream, then I'd suspect that this could lead to issues when reading multiple files where you could get undesirable side-effects where the spec if modified when reading file A and then when reading file B it would see the modifications made when reading A. I'm not sure whether that is actually the case or whether using a deepcopy was just done to be extra careful.

mavaylon1 · 2024-04-25T18:29:28Z

The only reason I would suspect a deepcopy is needed is if we wanted to modify an independent copy.

I think this will require careful testing. We should check if/where the spec object is actually being modified and why. If the spec is being modified downstream, then I'd suspect that this could lead to issues when reading multiple files where you could get undesirable side-effects where the spec if modified when reading file A and then when reading file B it would see the modifications made when reading A. I'm not sure whether that is actually the case or whether using a deepcopy was just done to be extra careful.

Agreed.

sneakers-the-rat · 2024-07-16T23:53:27Z

The only reason I would suspect a deepcopy is needed is if we wanted to modify an independent copy.

I think this will require careful testing. We should check if/where the spec object is actually being modified and why. If the spec is being modified downstream, then I'd suspect that this could lead to issues when reading multiple files where you could get undesirable side-effects where the spec if modified when reading file A and then when reading file B it would see the modifications made when reading A. I'm not sure whether that is actually the case or whether using a deepcopy was just done to be extra careful.

I checked this out over here: #1152 (comment)

tl;dr the deepcopy doesn't protect from mutation anyway because of when it is called/what calls it, the main thing deepcopy seems to be doing is giving derived objects a new id

return shallow copy in build_const_args

0ccde91

rly assigned mavaylon1 May 2, 2024

rly added category: enhancement improvements of code or code behavior topic: performance labels May 2, 2024

rly added this to the 3.14.0 milestone May 2, 2024

sneakers-the-rat mentioned this pull request Jul 16, 2024

[perf] Increase test performance by 274% with this one weird trick... #1152

Open

4 tasks

Merge branch 'dev' into patch-2

da2c1d3

mavaylon1 approved these changes Aug 19, 2024

View reviewed changes

mavaylon1 merged commit 875712b into hdmf-dev:dev Aug 19, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up loading of namespaces: return shallow copy in build_const_args #1103

speed up loading of namespaces: return shallow copy in build_const_args #1103

magland commented Apr 22, 2024 •

edited

Loading

codecov bot commented Apr 24, 2024 •

edited

Loading

mavaylon1 commented Apr 25, 2024 •

edited

Loading

oruebel commented Apr 25, 2024

mavaylon1 commented Apr 25, 2024

sneakers-the-rat commented Jul 16, 2024

speed up loading of namespaces: return shallow copy in build_const_args #1103

speed up loading of namespaces: return shallow copy in build_const_args #1103

Conversation

magland commented Apr 22, 2024 • edited Loading

Motivation and description

How to test the behavior?

Checklist

codecov bot commented Apr 24, 2024 • edited Loading

Codecov Report

mavaylon1 commented Apr 25, 2024 • edited Loading

oruebel commented Apr 25, 2024

mavaylon1 commented Apr 25, 2024

sneakers-the-rat commented Jul 16, 2024

magland commented Apr 22, 2024 •

edited

Loading

codecov bot commented Apr 24, 2024 •

edited

Loading

mavaylon1 commented Apr 25, 2024 •

edited

Loading