Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro optimize dataset.isel for speed on large datasets #9003

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented May 6, 2024

This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile:

Thanks for considering.

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

xref: #2799
xref: #7045

@hmaarrfk hmaarrfk marked this pull request as ready for review May 6, 2024 01:58
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 6, 2024

I'm happy to add benchmarks for these if you think it would help.

That said. I would love to leave that addition for future work. My time spent playing on this kind of speedup is up for the week.

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label May 6, 2024
@dcherian
Copy link
Contributor

dcherian commented May 6, 2024

Thanks. Do you see any changes in our asv benchmarks in asv_bench/?

We'd be happy to take updates for those too :)

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 6, 2024

Thanks. Do you see any changes in our asv benchmarks in asv_bench/?

I didn't get to running asv locally. (was focused on getting pytest + mypy working).

The speedups here are more associated with:

  1. few variables of interest in a dataset.
  2. Many variables with no dims.
  3. Slicing

I don't think this benchmark exists at quick glance. I could create one

Comment on lines 2990 to 2993
# Fastpath, skip all of this for variables with no dimensions
# Keep the result cached for future dictionary update
elif var_dims := var.dims:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Fastpath, skip all of this for variables with no dimensions
# Keep the result cached for future dictionary update
elif var_dims := var.dims:
elif var.ndim == 0:
continue
else:

Does this work

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no wait, i spoke too soon. i had a typo. oddly, it is slower...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py
index ec756176..4e8c31e5 100644
--- a/xarray/core/dataset.py
+++ b/xarray/core/dataset.py
@@ -2987,22 +2987,20 @@ class Dataset(
             if name in index_variables:
                 var = index_variables[name]
                 dims.update(zip(var.dims, var.shape))
-            # Fastpath, skip all of this for variables with no dimensions
-            # Keep the result cached for future dictionary update
-            elif var_dims := var.dims:
+            elif var.ndim == 0:
+                continue
+            else:
                 # Large datasets with alot of metadata may have many scalars
                 # without any relevant dimensions for slicing.
                 # Pick those out quickly and avoid paying the cost below
                 # of resolving the var_indexers variables
-                if var_indexer_keys := all_keys.intersection(var_dims):
+                if var_indexer_keys := all_keys.intersection(var.dims):
                     var_indexers = {k: indexers[k] for k in var_indexer_keys}
                     var = var.isel(var_indexers)
                     if drop and var.ndim == 0 and name in coord_names:
                         coord_names.remove(name)
                         continue
-                    # Update our reference to `var_dims` after the call to isel
-                    var_dims = var.dims
-                dims.update(zip(var_dims, var.shape))
+                dims.update(zip(var.dims, var.shape))
             variables[name] = var

         return self._construct_direct(

was slower.... this is somewhat unexpected. ndim should be "instant".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me add a benchmark tonight to "show" that this is the better way explicitely, otherwise it will be too easy to undo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My conclusion is that:

  • len(tuple) seems to be pretty fast.
  • But the .shape attribute is only resolved after 4-5 different python indirections going down to a LazilyIndexedArray, MemoryCachedArray, H5BackedArray (sorry, i'm not getting the class names right), but ultimately it isn't "readily available and needs to be resolved.

My little heuristic test is that with my dataset (93 variables long)

In [16]: %%timeit
    ...: for v in dataset._variables.values():
    ...:     v.ndim
    ...:
119 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [17]: %%timeit
    ...: for v in dataset._variables.values():
    ...:     v.shape
    ...:
105 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [18]: %%timeit
    ...: for v in dataset._variables.values():
    ...:     v.dims
    ...:
7.66 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [19]: %%timeit
    ...: for v in dataset._variables.values():
    ...:     v._dims
    ...:
3.1 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [20]: len(dataset._variables)
93

I mean, micro-optimizations are sometimes dumb. So that is why i've been breaking them out into distinct ideas when I find them, but together they can add up, especially when taken together.

So in other words, my hypothesis is that the the use of _dims is really helpful because it avoids many indirections in shape since dims is a "cached" version of the shape (where every number is replaced with a string).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(v.dims) or len(v._dims) sounds OK to me. They're both readily understandable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I better understand the xarray style,

The truthyness of tuples is not obvious enough while Len(tuple) is more obviously associated with a true/false statement

Would a comment be ok if Len(tuple) hurts performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It not about style, but about readability and understandability.

I've read this snippet about 6 times now, but I still have to look at it closely to see what it does. The perf improvement is also sensitive to order of iteration over variables (what if you alternated between 0D and 1D variable as you iterated through?)

This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you alternated between 0D and 1D variable as you iterated through?

You know, this is something I've thought about alot.

I'm generally not too happy with this optimization.

This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.

Ok understood. The challenge is that this PR doesn't do much on my benchmarks without #9002 and my current theory is that we are limited by calls to python methods, so I feel like even len(tuple) will slow things down.

I'll try again, but if its OK, i'm going to rebase onto #9002 until a resolution is found for those optimizations.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 7, 2024

On main:

[50.00%] ··· Running (indexing.IndexingDask.time_indexing_vectorized--).                                              
[56.25%] ··· indexing.Indexing.time_indexing_basic                                                                 ok 
[56.25%] ··· ================== ===========                                                                           
                    key                                                                                               
             ------------------ -----------                                                                           
                  1scalar        173±0.7μs                                                                            
                   1slice        180±0.9μs                                                                            
               1slice-1scalar     219±1μs                                                                             
              2slicess-1scalar    301±2μs  
             ================== ===========

[62.50%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[62.50%] ··· ================== =============
                    key                       
             ------------------ -------------
                  1scalar        3.07±0.02ms 
                   1slice        3.08±0.01ms 
               1slice-1scalar    3.17±0.01ms 
              2slicess-1scalar   3.30±0.02ms 
             ================== =============

On this branch

[ 0.00%] ·· Benchmarking existing-py_home_mark_miniforge3_envs_xr_bin_python
[25.00%] ··· Running (indexing.Indexing.time_indexing_basic--)..
[75.00%] ··· indexing.Indexing.time_indexing_basic                                                                 ok
[75.00%] ··· ================== ===========
                    key                    
             ------------------ -----------
                  1scalar        172±0.9μs 
                   1slice        179±0.7μs 
               1slice-1scalar     217±1μs  
              2slicess-1scalar    299±1μs  
             ================== ===========

[100.00%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[100.00%] ··· ================== =============
                     key                      
              ------------------ -------------
                   1scalar        2.67±0.01ms 
                    1slice        2.67±0.01ms 
                1slice-1scalar    2.71±0.01ms 
               2slicess-1scalar   2.81±0.01ms 
              ================== =============

On the combined #9002 + this branch:

[ 0.00%] ·· Benchmarking existing-py_home_mark_miniforge3_envs_xr_bin_python
[25.00%] ··· Running (indexing.Indexing.time_indexing_basic--)..
[75.00%] ··· indexing.Indexing.time_indexing_basic                                                                 ok
[75.00%] ··· ================== ===========
                    key                    
             ------------------ -----------
                  1scalar        155±0.5μs 
                   1slice         146±1μs  
               1slice-1scalar     182±1μs  
              2slicess-1scalar    233±1μs  
             ================== ===========

[100.00%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[100.00%] ··· ================== =============
                     key                      
              ------------------ -------------
                   1scalar        2.67±0.01ms 
                    1slice        2.65±0.01ms 
                1slice-1scalar    2.71±0.02ms 
               2slicess-1scalar   2.77±0.01ms 
              ================== =============

@hmaarrfk hmaarrfk marked this pull request as draft May 18, 2024 14:31
@hmaarrfk hmaarrfk force-pushed the micro_datasetisel branch 2 times, most recently from 5a7c903 to 2592076 Compare June 12, 2024 02:42
@hmaarrfk
Copy link
Contributor Author

I want to make sure I have a good reason for choosing the syntax I did. So without taking dcherian's suggestions:

| Change   | Before [cb3663d0] <main>   | After [2592076b] <micro_datasetisel>   |   Ratio | Benchmark (Parameter)                                                  |
|----------|----------------------------|----------------------------------------|---------|------------------------------------------------------------------------|
| -        | 2.64±0.01ms                | 2.33±0.02ms                            |    0.88 | indexing.Indexing.time_indexing_basic_ds_large('1slice')               |
| -        | 2.66±0.02ms                | 2.34±0.03ms                            |    0.88 | indexing.IndexingDask.time_indexing_basic_ds_large('1slice')           |
| -        | 2.66±0.03ms                | 2.30±0.02ms                            |    0.87 | indexing.Indexing.time_indexing_basic_ds_large('1scalar')              |
| -        | 2.66±0.02ms                | 2.33±0.03ms                            |    0.87 | indexing.IndexingDask.time_indexing_basic_ds_large('1scalar')          |
| -        | 2.80±0.02ms                | 2.41±0.02ms                            |    0.86 | indexing.Indexing.time_indexing_basic_ds_large('2slicess-1scalar')     |
| -        | 2.75±0.02ms                | 2.36±0.02ms                            |    0.86 | indexing.IndexingDask.time_indexing_basic_ds_large('1slice-1scalar')   |
| -        | 2.83±0.01ms                | 2.41±0.02ms                            |    0.85 | indexing.IndexingDask.time_indexing_basic_ds_large('2slicess-1scalar') |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

With the len(var_dims := var.dims) we get to

| Change   | Before [cb3663d0] <main>   | After [fadd8763] <micro_datasetisel>   |   Ratio | Benchmark (Parameter)                                                  |
|----------|----------------------------|----------------------------------------|---------|------------------------------------------------------------------------|
| -        | 2.62±0.02ms                | 2.30±0.02ms                            |    0.88 | indexing.Indexing.time_indexing_basic_ds_large('1scalar')              |
| -        | 2.62±0.01ms                | 2.29±0.02ms                            |    0.88 | indexing.Indexing.time_indexing_basic_ds_large('1slice')               |
| -        | 2.65±0.03ms                | 2.32±0.03ms                            |    0.88 | indexing.IndexingDask.time_indexing_basic_ds_large('1scalar')          |
| -        | 2.63±0.01ms                | 2.31±0.01ms                            |    0.88 | indexing.IndexingDask.time_indexing_basic_ds_large('1slice')           |
| -        | 2.58±0.1ms                 | 2.24±0.04ms                            |    0.87 | indexing.Assignment.time_assignment_outer('2d')                        |
| -        | 2.69±0.01ms                | 2.34±0.04ms                            |    0.87 | indexing.Indexing.time_indexing_basic_ds_large('1slice-1scalar')       |
| -        | 2.73±0.02ms                | 2.34±0.03ms                            |    0.86 | indexing.IndexingDask.time_indexing_basic_ds_large('1slice-1scalar')   |
| -        | 2.80±0.01ms                | 2.38±0.05ms                            |    0.85 | indexing.Indexing.time_indexing_basic_ds_large('2slicess-1scalar')     |
| -        | 2.82±0.02ms                | 2.41±0.02ms                            |    0.85 | indexing.IndexingDask.time_indexing_basic_ds_large('2slicess-1scalar') |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Which seems to be the same if not better.

@hmaarrfk hmaarrfk marked this pull request as ready for review June 12, 2024 03:29
@hmaarrfk
Copy link
Contributor Author

So final numbers again for myself on my internal benchmark (I know it isn't of that much use to others, but there are the other benchmarks for that)

Main:
100%|█| 100000/100000 [00:23<00:00, 4284.12it/s]
100%|█| 100000/100000 [00:23<00:00, 4169.37it/s]

With 9003
100%|█| 100000/100000 [00:16<00:00, 6008.62it/s]
100%|█| 100000/100000 [00:16<00:00, 5947.75it/s]

A 42% increase in throughput. i'll take it!

@headtr1ck headtr1ck added the plan to merge Final call for comments label Jun 21, 2024
@dcherian
Copy link
Contributor

@hmaarrfk does the latest version see the same speedups?

@hmaarrfk
Copy link
Contributor Author

Rerunning the benchmarks. Are the test failures due to this pR???

@hmaarrfk
Copy link
Contributor Author

The improvement is not as significant, but i'm unsure if it is related to the test failures:

| Change   | Before [3fd162e4] <main>   | After [84ef5474] <micro_datasetisel>   |   Ratio | Benchmark (Parameter)                                                  |
|----------|----------------------------|----------------------------------------|---------|------------------------------------------------------------------------|
| -        | 2.90±0.03ms                | 2.63±0.01ms                            |    0.91 | indexing.IndexingDask.time_indexing_basic_ds_large('2slicess-1scalar') |

This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
@hmaarrfk
Copy link
Contributor Author

Sorry i force pushed, let me run benchmarks again, my computer is currently in use right now, so i'll run them a little bit later.

This reverts commit f7945a3.
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jun 22, 2024

So on my personal benchmark:

  • Main 4k its/s
  • with "cleanups": 5kits /s
  • without "cleanups": 6k its/s

Trying to rerun the benchmark with and without cleanups.

I definitely see an improvement on my dataset (and I'm willing to "donate" it), but I just can't recreate the significant improvements that I saw on the benchmarking suite. Maybe a dependency changed in terms of performance which is now hiding the effects?

Nothing pops out in the only change i can find to the indexing benchmark: #9013

@dcherian
Copy link
Contributor

dcherian commented Jun 30, 2024

I suspect that caching shape in your previous PR has improved it so this PR makes no difference.

Can you upload a representative dataset somewhere? Or provide code to generate one pleasE?

@veenstrajelmer
Copy link
Contributor

@dcherian could you link the PR you are referring to? I would be very much interested in potential h5netcdf performance improvements (see my comments in h5netcdf/h5netcdf#195 and h5netcdf/h5netcdf#221 and Deltares/dfm_tools#484). There are many related issues and PR's open and I have lost track of the potential causes and fixes. Are you saying that this PR is redundant after another PR was already merged or is there still potential?

@hmaarrfk
Copy link
Contributor Author

I think the problem is that i never got arrond to uploading my sample dataset.

Do you have a ssample one we can add to the benchmarks? If so that might help speed things up.

@veenstrajelmer
Copy link
Contributor

veenstrajelmer commented Sep 17, 2024

I am affraid not, the datasets where I notice performance issues (time or memory wise) are all approx 3GB. Most of the smaller test datasets I use do not show these issues. What I noticed is that the performance issues are mainly happening for datasets with many variables, since these cause significant increases of accessing of sizes/dims/properties of variables. So in that light, I have tested with this code (based on an xarray test):

import numpy as np
from xarray import Dataset
import xarray as xr

nx = 80
ny = 50
ntime = 100
file_nc = f"test_x{nx}_y{ny}_time{ntime}.nc"

write_dataset = False
if write_dataset:
    ds = Dataset(
        dict(
            z1=(["y", "x"], np.random.randn(ny, nx)),
            z2=(["time", "y"], np.random.randn(ntime, ny)),
            z3=(["y", "x", "time"], np.random.randn(ny, nx, ntime)),
        ),
        dict(
            x=("x", np.linspace(0, 1.0, nx)),
            time=("time", np.linspace(0, 1.0, ntime)),
            y=range(ny),
        ),
    )
    varnum = 1500
    for i in range(varnum):
        ds[f"var_{i:04d}"] = ds["z3"]
    ds.to_netcdf(file_nc)

print("opening file")
ds_in = xr.open_dataset(file_nc, engine="h5netcdf")

I ran this code with memory_profiler with mprof run python script.py, excluding the writing of the netcdf by setting write_dataset=False but you would first have to run this code to read the file of course.

Results in this timings/usage for engine="netcdf4":
image

And these timings/usage for engine="h5netcdf":
image

This shows similar behaviour to what I documented in Deltares/dfm_tools#484, but simplified: h5netcdf consumes less memory but is significantly slower. I cannot judge if this testdata and/or results add value to this PR though.

@hmaarrfk
Copy link
Contributor Author

Lets try to keep this conversation focused on the performance of isel. your benchmarks are interesting, but unrelated to this PR.

@hmaarrfk
Copy link
Contributor Author

@dcherian please find a zip containing a file called

  • ramona_optics_metadata.nc
  • time_metadata.py

On the tag 2024.09.0

I get the following performance (

 (lscpu | grep "Model name") && python time_metadata.py
Model name:                           Intel(R) Core(TM) Ultra 7 165U
Using xarray 2024.9.0
Engine h5netcdf: 100%|█████████| 100000/100000 [00:14<00:00, 6981.62it/s]
Engine netcdf4: 100%|██████████| 100000/100000 [00:14<00:00, 6681.45it/s]

On this branch

(lscpu | grep "Model name") && python time_metadata.py
Model name:                           Intel(R) Core(TM) Ultra 7 165U
Using xarray 2024.9.1.dev19+gfac41781
Engine: h5netcdf: 100%|██████████████████████████████████| 100000/100000 [00:09<00:00, 10835.77it/s]
Engine: netcdf4: 100%|███████████████████████████████████| 100000/100000 [00:09<00:00, 10633.27it/s]

On my threadripper

$ (lscpu | grep "Model name") && python time_metadata.py
Model name:                           AMD Ryzen Threadripper 2950X 16-Core Processor
Using xarray 2024.9.1.dev0+ged0418bd.d20240918
Engine h5netcdf: 100%|███████| 100000/100000 [00:26<00:00, 3742.96it/s]
Engine netcdf4: 100%|████████| 100000/100000 [00:26<00:00, 3708.77it/s]

(lscpu | grep "Model name") && python time_metadata.py
Model name:                           AMD Ryzen Threadripper 2950X 16-Core Processor
Using xarray 2024.9.1.dev0+ged0418bd.d20240918
Engine h5netcdf: 100%|████████████████████████████████████| 100000/100000 [00:17<00:00, 5717.31it/s]
Engine netcdf4: 100%|█████████████████████████████████████| 100000/100000 [00:17<00:00, 5631.24it/s

It has:

  • 93 variables
  • 17 coords
  • 18 dims
  • 874 kiB

time_metadata.zip

I acknowledge that I am the owner of this data and I give you permission (@dcherian) to use it how you see fit.

@hmaarrfk
Copy link
Contributor Author

now that the legalize has gone out of the way, let me know what you want me to do to get it integrated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants