Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask and Zarr not loading obsp and obsm from remote s3 #951

Open
will-moore opened this issue Mar 16, 2023 · 15 comments · May be fixed by #1247
Open

Dask and Zarr not loading obsp and obsm from remote s3 #951

will-moore opened this issue Mar 16, 2023 · 15 comments · May be fixed by #1247

Comments

@will-moore
Copy link

will-moore commented Mar 16, 2023

Hi,

I'm using @ilan-gold's nice sample code at
https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread,write%7D_dispatched.html
to read remote anndata, which is working great when I'm serving data locally via http.

But when I'm using minio to serve the data, I'm not getting the obsp, obsm or uns parts of the AnnData object. See code sample below.

A UI view of the data is at https://deploy-preview-20--ome-ngff-validator.netlify.app/?source=https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/
(from ome/ome-ngff-validator#20) where I'm using the extra "keys" in e.g. https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/obsm/.zattrs to load those groups:

{
    "encoding-type": "dict",
    "encoding-version": "0.1.0",
    "keys": [
        "X_scanorama",
        "X_umap",
        "spatial"
    ]
}

Is there any way I can use those 'keys' to load obsp and obsm data?

Thanks!

from typing import Callable, Union
import dask.array as da
import zarr
from anndata import AnnData
from anndata._io.specs import IOSpec
from anndata.compat import H5Array, H5Group, ZarrArray, ZarrGroup

# ** requires anndata==0.9.0.rc1
from anndata.experimental import read_dispatched, read_elem
from zarr.storage import FSStore
from ome_zarr.io import parse_url
StorageType = Union[H5Array, H5Group, ZarrArray, ZarrGroup]


def read_remote_anndata(store: FSStore, name: str) -> AnnData:
    table_group = zarr.group(store=store, path=name)

    def callback(
        func: Callable, elem_name: str, elem: StorageType, iospec: IOSpec
    ) -> AnnData:
        if iospec.encoding_type in (
            "dataframe",
            "csr_matrix",
            "csc_matrix",
            "awkward-array",
        ):
            # Preventing recursing inside of these types
            return read_elem(elem)
        elif iospec.encoding_type == "array":
            return da.from_zarr(elem)
        else:
            return func(elem)

    adata = read_dispatched(table_group, callback=callback)
    return adata

url = "https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/"
name = "regions_table"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)

# anndata_obj AnnData object with n_obs × n_vars = 1045 × 36
#     obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
#     var: 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'


url = "http://localhost:8000/test_segment.zarr/tables/"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)

# anndata_obj AnnData object with n_obs × n_vars = 1045 × 36
#     obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
#     var: 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
#     uns: 'Cluster_colors', 'batch_colors', 'neighbors', 'spatial', 'umap'
#     obsm: 'X_scanorama', 'X_umap', 'spatial'
#     obsp: 'connectivities', 'distances'
@ilan-gold
Copy link
Contributor

If not, I should probably make that more obvious as an essential step in the tutorial. @ivirshup That is actually something I've been meaning to mention - we should be able to resolve as @will-moore does the keys of obsm etc. without consolidating metadata at a slight penalty. Should we implement this?

@will-moore
Copy link
Author

Ah, right - that's coming back to the discussion at kevinyamauchi/ome-ngff-tables-prototype#12 where we preferred consolidated metadata not to be part of the NGFF spec. cc @joshmoore

@ilan-gold
Copy link
Contributor

I mean, y'all are big users and the feature probably makes sense anyway so we probably should implement it. But the penalty is of course a bit higher since you need to open each .zarray for each column.

@will-moore
Copy link
Author

By "implement it" you mean support for "keys"?
Would this need to happen in anndata itself, or is there a way to add it in the example above?

This is code I'm adding to ome/ome-zarr-py#256 so all NGFF python users will be able to use it.

@ilan-gold
Copy link
Contributor

I am not sure @will-moore but "implement it" means using the trick you use in the browser to discover columns, whether that's in this example or in general. At the moment, we iterate over the keys of the incoming obsm object (a dict) to read it and consolidated metadata lets you have those keys.

@will-moore
Copy link
Author

will-moore commented Mar 21, 2023

I seem to have got the loading of obsm working, using the custom keys (see below), but obsp is beyond what I can get my head around just now...

I don't know if I'm getting close, or if this is really a lot more complex and we should resign ourselves to using consolidate_metadata if we want this remote access to work?

from typing import Callable, Union
import json

import dask.array as da
import zarr
from anndata import AnnData
from anndata._io.specs import IOSpec
from anndata.compat import H5Array, H5Group, ZarrArray, ZarrGroup

# ** requires anndata==0.9.0.rc1
from anndata.experimental import read_dispatched, read_elem
from zarr.storage import FSStore

from ome_zarr.io import parse_url

StorageType = Union[H5Array, H5Group, ZarrArray, ZarrGroup]

# From https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread,write%7D_dispatched.html


def read_remote_anndata(store: FSStore, name: str) -> AnnData:
    table_group = zarr.group(store=store, path=name)

    def callback(
        func: Callable, elem_name: str, elem: StorageType, iospec: IOSpec
    ) -> AnnData:
        print("el", elem_name, iospec.encoding_type, elem)
        if iospec.encoding_type in (
            "dataframe",
            "csr_matrix",
            "csc_matrix",
            "awkward-array",
        ):
            # Preventing recursing inside of these types
            return read_elem(elem)
        elif iospec.encoding_type == "array":
            return da.from_zarr(elem)
        elif iospec.encoding_type == "dict" and "obsm" in elem_name:    # or "obsp" in elem_name:
            # load .zattrs
            attrs = store.get(elem_name + "/.zattrs")
            attrs = json.loads(attrs)
            print("attrs", attrs)
            if "keys" in attrs:
                to_return = {}
                for key in attrs["keys"]:
                    print("URL", elem_name + "/" + key)
                    try:
                        arr = da.from_zarr(store, elem_name + "/" + key)
                        to_return[key] = arr
                    except:
                        print("Failed " + elem_name + "/" + key)
                        pass
                return to_return
        # not handled above, call func()
        rsp = func(elem)
        return rsp

    adata = read_dispatched(table_group, callback=callback)
    return adata

url = "https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/"
name = "regions_table"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)

@github-actions
Copy link

github-actions bot commented Jun 9, 2023

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

@flying-sheep
Copy link
Member

@ivirshup what do you think about this?

@ivirshup That is actually something I've been meaning to mention - we should be able to resolve as @will-moore does the keys of obsm etc. without consolidating metadata at a slight penalty. Should we implement this?

@ilan-gold
Copy link
Contributor

@flying-sheep I am doing this in the upcoming PR already #947

@will-moore
Copy link
Author

Thanks all for following up on this 👍

@flying-sheep
Copy link
Member

Should we mark that PR as Fixes #951 then?

@ilan-gold
Copy link
Contributor

Ah, good call. I'll add it.

@ilan-gold
Copy link
Contributor

ilan-gold commented Nov 30, 2023

Hi @will-moore as an alternative to #947 you can also try out https://github.com/scverse/anndata/tree/ig/xarray_compat (or #1247) which is a bit lighter weight, but similar with a read_backed method. This returns a normal AnnData object but with xarray.

@will-moore
Copy link
Author

Thanks for the update @ilan-gold - Unfortunately I'm not looking to include this work in ome-zarr-py now since the spec proposal isn't going ahead just now ome/ngff#64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants