-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Initial commit of biosample index * Make minimal class * Tidy up first draft of adding biosample index * Add beginning of logic for checking if biosample from a studyindex is in biosample index * Make early file for merging multiple biosample indices into one * Finish adding basic iteration of biosample index, needs debugging * Tweak slightly * Modified the parser to accept JSON files * Update biosample index * Tests and docs * Updating tests * Revert GWAS catalog file * fix(biosample index): update to match pre-commit standards * fix(biosample index): merging indices fix * fix(biosample index): update study index qc logic * fix(biosample index): fix missing mock_biosample_index * chore(biosample index): change datasource name from ontologies * fix(biosample index): add dataset doc * fix(biosample index): change dbXrefs to xrefs * chore (biosample index): better commenting Co-authored-by: Daniel Suveges <[email protected]> * fix(biosample index): various minor tweaks to biosample index * fix(biosample index): minor bug * fix(biosample index): fix merge shift to method * feat(biosample index): make biosampleName not nullable --------- Co-authored-by: Daniel Suveges <[email protected]>
- Loading branch information
1 parent
148e26e
commit ccdb1f2
Showing
19 changed files
with
1,735 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
title: Biosample index | ||
--- | ||
|
||
::: gentropy.dataset.biosample_index.BiosampleIndex | ||
|
||
## Schema | ||
|
||
--8<-- "assets/schemas/biosample_index.md" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
5
docs/python_api/datasources/biosample_ontologies/_cell_ontology.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: Cell Ontology | ||
--- | ||
|
||
The [Cell Ontology](http://www.obofoundry.org/ontology/cl.html) is a structured controlled vocabulary for cell types. It is used to annotate cell types in single-cell RNA-seq data and other omics data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: Uberon | ||
--- | ||
|
||
The [Uberon](http://uberon.github.io/) ontology is a multi-species anatomy ontology that integrates cross-species ontologies into a single ontology. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: biosample_index | ||
--- | ||
|
||
::: gentropy.biosample_index.BiosampleIndexStep |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
{ | ||
"type": "struct", | ||
"fields": [ | ||
{ | ||
"name": "biosampleId", | ||
"type": "string", | ||
"nullable": false, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "biosampleName", | ||
"type": "string", | ||
"nullable": false, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "description", | ||
"type": "string", | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "xrefs", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "synonyms", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "parents", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "ancestors", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "descendants", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "children", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
} | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
"""Step to generate biosample index dataset.""" | ||
from __future__ import annotations | ||
|
||
from gentropy.common.session import Session | ||
from gentropy.datasource.biosample_ontologies.utils import extract_ontology_from_json | ||
|
||
|
||
class BiosampleIndexStep: | ||
"""Biosample index step. | ||
This step generates a Biosample index dataset from the various ontology sources. Currently Cell Ontology and Uberon are supported. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
session: Session, | ||
cell_ontology_input_path: str, | ||
uberon_input_path: str, | ||
biosample_index_path: str, | ||
) -> None: | ||
"""Run Biosample index generation step. | ||
Args: | ||
session (Session): Session object. | ||
cell_ontology_input_path (str): Input cell ontology dataset path. | ||
uberon_input_path (str): Input uberon dataset path. | ||
biosample_index_path (str): Output gene index dataset path. | ||
""" | ||
cell_ontology_index = extract_ontology_from_json(cell_ontology_input_path, session.spark) | ||
uberon_index = extract_ontology_from_json(uberon_input_path, session.spark) | ||
|
||
biosample_index = cell_ontology_index.merge_indices([uberon_index]) | ||
|
||
biosample_index.df.write.mode(session.write_mode).parquet(biosample_index_path) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
"""Biosample index dataset.""" | ||
|
||
from __future__ import annotations | ||
|
||
from dataclasses import dataclass | ||
from functools import reduce | ||
from typing import TYPE_CHECKING | ||
|
||
import pyspark.sql.functions as f | ||
from pyspark.sql import DataFrame | ||
from pyspark.sql.types import ArrayType, StringType | ||
|
||
from gentropy.common.schemas import parse_spark_schema | ||
from gentropy.dataset.dataset import Dataset | ||
|
||
if TYPE_CHECKING: | ||
from pyspark.sql.types import StructType | ||
|
||
|
||
@dataclass | ||
class BiosampleIndex(Dataset): | ||
"""Biosample index dataset. | ||
A Biosample index dataset captures the metadata of the biosamples (e.g. tissues, cell types, cell lines, etc) such as alternate names and relationships with other biosamples. | ||
""" | ||
|
||
@classmethod | ||
def get_schema(cls: type[BiosampleIndex]) -> StructType: | ||
"""Provide the schema for the BiosampleIndex dataset. | ||
Returns: | ||
StructType: The schema of the BiosampleIndex dataset. | ||
""" | ||
return parse_spark_schema("biosample_index.json") | ||
|
||
def merge_indices( | ||
self: BiosampleIndex, | ||
biosample_indices : list[BiosampleIndex] | ||
) -> BiosampleIndex: | ||
"""Merge a list of biosample indices into a single biosample index. | ||
Where there are conflicts, in single values - the first value is taken. In list values, the union of all values is taken. | ||
Args: | ||
biosample_indices (list[BiosampleIndex]): Biosample indices to merge. | ||
Returns: | ||
BiosampleIndex: Merged biosample index. | ||
""" | ||
# Extract the DataFrames from the BiosampleIndex objects | ||
biosample_dfs = [biosample_index.df for biosample_index in biosample_indices] + [self.df] | ||
|
||
# Merge the DataFrames | ||
merged_df = reduce(DataFrame.unionAll, biosample_dfs) | ||
|
||
# Determine aggregation functions for each column | ||
# Currently this will take the first value for single values and merge lists for list values | ||
agg_funcs = [] | ||
for field in merged_df.schema.fields: | ||
if field.name != "biosampleId": # Skip the grouping column | ||
if field.dataType == ArrayType(StringType()): | ||
agg_funcs.append(f.array_distinct(f.flatten(f.collect_list(field.name))).alias(field.name)) | ||
else: | ||
agg_funcs.append(f.first(f.col(field.name), ignorenulls=True).alias(field.name)) | ||
|
||
# Perform aggregation | ||
aggregated_df = merged_df.groupBy("biosampleId").agg(*agg_funcs) | ||
|
||
return BiosampleIndex( | ||
_df=aggregated_df, | ||
_schema=BiosampleIndex.get_schema() | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
"""Biosample index data source.""" | ||
|
||
from __future__ import annotations |
Oops, something went wrong.