Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: drop v2g and reimplement distance features #771

Merged
merged 61 commits into from
Oct 1, 2024
Merged
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
50d98ed
refactor(L2GFeatureMatrix): remove schema validation
ireneisdoomed Sep 3, 2024
66a0f0b
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 3, 2024
e1f7c5c
refactor(FeatureFactory): reshape feature generation WIP
ireneisdoomed Sep 3, 2024
a7757ac
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Sep 3, 2024
646a810
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 4, 2024
8a70bf2
chore: set l2gfeature properties with decorator
ireneisdoomed Sep 6, 2024
c690ffc
chore(l2gfeature): make credible_set and input_dependency instance at…
ireneisdoomed Sep 6, 2024
a54e694
chore(l2gfeature): make credible_set and input_dependency instance at…
ireneisdoomed Sep 6, 2024
85a7bf4
chore(featurefactory): distanceTssMeanFeature working
ireneisdoomed Sep 6, 2024
d24de6d
refactor(l2g): improve step dependency management
ireneisdoomed Sep 9, 2024
6a3af69
feat: implement
ireneisdoomed Sep 9, 2024
09d5291
chore: fix mypy issues
ireneisdoomed Sep 9, 2024
6211d8d
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 9, 2024
5561b74
Merge branch 'il-3252' of https://github.com/opentargets/gentropy int…
ireneisdoomed Sep 9, 2024
b1f607b
feat: l2gfeaturematrix.from_features_list working
ireneisdoomed Sep 9, 2024
021e159
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 10, 2024
da20073
chore: comment out obsolete refs
ireneisdoomed Sep 10, 2024
d06c059
chore(L2GFeatureMatrix): change `mode` attribute to `with_gold_standard`
ireneisdoomed Sep 10, 2024
0a007a7
refactor(l2g): move feature matrix writing to training module
ireneisdoomed Sep 10, 2024
abfdf22
feat(L2GFeatureMatrix): accept L2GGoldStandard or StudyLocus as inputs
ireneisdoomed Sep 10, 2024
1eed6f3
feat: implement methods to build a feature matrix based on a studyloc…
ireneisdoomed Sep 10, 2024
b4a86a1
feat: coloc logic prototype
ireneisdoomed Sep 10, 2024
0b09193
feat(l2g): filter non gwas credible sets at the start of the step
ireneisdoomed Sep 11, 2024
a60095b
feat: rewrite colocalisation feature factory
ireneisdoomed Sep 13, 2024
16085ad
test: add `test_colocalisation_feature_type`
ireneisdoomed Sep 13, 2024
7ab1ff1
test(colocalisation): add test_extract_maximum_coloc_probability_per_…
ireneisdoomed Sep 13, 2024
e56e8ea
feat(L2GFeatureInputLoader): support multiple deps by passing loader …
ireneisdoomed Sep 13, 2024
b8525ad
test: add integration tests `test_build_feature_matrix`
ireneisdoomed Sep 13, 2024
ad8481e
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 13, 2024
e032bae
feat(variant_index): add `get_distance_to_gene` and deprecate `get_di…
ireneisdoomed Sep 17, 2024
6370f67
feat(variant_index): deprecate `get_most_severe_transcript_consequence`
ireneisdoomed Sep 17, 2024
fae8256
feat(variant_index): add `get_loftee` and deprecate `get_plof_v2g`
ireneisdoomed Sep 17, 2024
4ca943c
chore: reduce v2g assesments to intervals
ireneisdoomed Sep 17, 2024
36f8804
feat(feature_factory): add distance to footprint features
ireneisdoomed Sep 17, 2024
71042cb
test: refactor `test_feature_factory_return_type`
ireneisdoomed Sep 17, 2024
99477ab
feat(feature_factory): add all distance neighbourhood features
ireneisdoomed Sep 17, 2024
73e795c
chore: delete v2g
ireneisdoomed Sep 17, 2024
65a6771
feat(feature_factory): add all colocalisation neighbourhood features
ireneisdoomed Sep 18, 2024
f4f8ae0
chore: final v2g deletion
ireneisdoomed Sep 18, 2024
3fa9b55
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 18, 2024
95793c6
chore: drop config yamls
ireneisdoomed Sep 18, 2024
cb5c169
refactor: move feature classes to datasets module
ireneisdoomed Sep 18, 2024
d3498b4
docs: update feature docs
ireneisdoomed Sep 18, 2024
0d52dba
Merge branch 'il-3252' of https://github.com/opentargets/gentropy int…
ireneisdoomed Sep 18, 2024
03b11e2
fix: import
ireneisdoomed Sep 18, 2024
9149e8e
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 24, 2024
87d1877
test: add semantic `TestCommonColocalisationFeatureLogic`
ireneisdoomed Sep 24, 2024
dbc5d2e
test: add semantic `TestCommonDistanceFeatureLogic`
ireneisdoomed Sep 25, 2024
69d7112
refactor: separate features into diff modules
ireneisdoomed Sep 25, 2024
cda5b12
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 25, 2024
d0a9126
fix: documentation references
ireneisdoomed Sep 25, 2024
bb47b01
feat: implement distance to sentinel and adapt definitions
ireneisdoomed Sep 27, 2024
3f5de30
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 27, 2024
f08e432
docs: update distance class names
ireneisdoomed Sep 27, 2024
67a32c0
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 30, 2024
7215984
fix: add all variant index mandatory fields in tests
ireneisdoomed Sep 30, 2024
72ea515
fix(schema_validator): remove extra `[]` from parent prefix
ireneisdoomed Sep 30, 2024
161fa9f
Merge branch 'dev' of https://github.com/opentargets/gentropy into il…
ireneisdoomed Sep 30, 2024
087b897
fix: convert studylocusid to string in tests
ireneisdoomed Sep 30, 2024
f52b6e2
revert: revert 72ea515fb3ea9cb07c448072be2449f4ced0dab3 (it was ok)
ireneisdoomed Sep 30, 2024
fea5fa8
fix: adapt test
ireneisdoomed Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion docs/howto/command_line/run_step_in_cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ Available options:
ukbiobank
variant_annotation
variant_index
variant_to_gene

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
```
Expand Down
29 changes: 0 additions & 29 deletions docs/python_api/datasets/l2g_feature.md

This file was deleted.

11 changes: 11 additions & 0 deletions docs/python_api/datasets/l2g_features/_l2g_feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
title: L2G Feature
---

## Abstract Class

::: gentropy.dataset.l2g_features.l2g_feature.L2GFeature

## Schema

--8<-- "assets/schemas/l2g_feature.md"
27 changes: 27 additions & 0 deletions docs/python_api/datasets/l2g_features/colocalisation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: From colocalisation
---

## List of features

::: gentropy.dataset.l2g_features.colocalisation.EQtlColocClppMaximumFeature
::: gentropy.dataset.l2g_features.colocalisation.PQtlColocClppMaximumFeature
::: gentropy.dataset.l2g_features.colocalisation.SQtlColocClppMaximumFeature
::: gentropy.dataset.l2g_features.colocalisation.TuQtlColocClppMaximumFeature
::: gentropy.dataset.l2g_features.colocalisation.EQtlColocH4MaximumFeature
::: gentropy.dataset.l2g_features.colocalisation.PQtlColocH4MaximumFeature
::: gentropy.dataset.l2g_features.colocalisation.SQtlColocH4MaximumFeature
::: gentropy.dataset.l2g_features.colocalisation.TuQtlColocH4MaximumFeature
::: gentropy.dataset.l2g_features.colocalisation.EQtlColocClppMaximumNeighbourhoodFeature
::: gentropy.dataset.l2g_features.colocalisation.PQtlColocClppMaximumNeighbourhoodFeature
::: gentropy.dataset.l2g_features.colocalisation.SQtlColocClppMaximumNeighbourhoodFeature
::: gentropy.dataset.l2g_features.colocalisation.TuQtlColocClppMaximumNeighbourhoodFeature
::: gentropy.dataset.l2g_features.colocalisation.EQtlColocH4MaximumNeighbourhoodFeature
::: gentropy.dataset.l2g_features.colocalisation.PQtlColocH4MaximumNeighbourhoodFeature
::: gentropy.dataset.l2g_features.colocalisation.SQtlColocH4MaximumNeighbourhoodFeature
::: gentropy.dataset.l2g_features.colocalisation.TuQtlColocH4MaximumNeighbourhoodFeature

## Common logic

::: gentropy.dataset.l2g_features.colocalisation.common_colocalisation_feature_logic
::: gentropy.dataset.l2g_features.colocalisation.common_neighbourhood_colocalisation_feature_logic
19 changes: 19 additions & 0 deletions docs/python_api/datasets/l2g_features/distance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: From distance
---

## List of features

::: gentropy.dataset.l2g_features.distance.DistanceSentinelTssFeature
::: gentropy.dataset.l2g_features.distance.DistanceSentinelTssNeighbourhoodFeature
::: gentropy.dataset.l2g_features.distance.DistanceTssMeanFeature
::: gentropy.dataset.l2g_features.distance.DistanceTssMeanNeighbourhoodFeature
::: gentropy.dataset.l2g_features.distance.DistanceSentinelFootprintFeature
::: gentropy.dataset.l2g_features.distance.DistanceSentinelFootprintNeighbourhoodFeature
::: gentropy.dataset.l2g_features.distance.DistanceFootprintMeanFeature
::: gentropy.dataset.l2g_features.distance.DistanceFootprintMeanNeighbourhoodFeature

## Common logic

::: gentropy.dataset.l2g_features.distance.common_distance_feature_logic
::: gentropy.dataset.l2g_features.distance.common_neighbourhood_distance_feature_logic
9 changes: 0 additions & 9 deletions docs/python_api/datasets/variant_to_gene.md

This file was deleted.

7 changes: 2 additions & 5 deletions docs/python_api/methods/l2g/_l2g.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,10 @@ The **“locus-to-gene” (L2G)** model derives features to prioritize likely ca
- **Chromatin Interaction:** (e.g., promoter-capture Hi-C)
- **Variant Pathogenicity:** (from VEP)

The L2G model is distinct from the variant-to-gene (V2G) pipeline in that it:

- Uses a machine-learning model to learn the weights of each evidence source based on a gold standard of previously identified causal genes.
- Relies upon fine-mapping and colocalization data.

Some of the predictive features weight variant-to-gene (or genomic region-to-gene) evidence based on the posterior probability that the variant is causal, determined through fine-mapping of the GWAS association.

For a more detailed description of how each feature is computed, see [the L2G Feature documentation](../../datasets/l2g_features/_l2g_feature.md).

Details of the L2G model are provided in our Nature Genetics publication (ref - [Nature Genetics Publication](https://www.nature.com/articles/s41588-021-00945-5)):

- **Title:** An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci.
Expand Down
5 changes: 0 additions & 5 deletions docs/python_api/steps/variant_to_gene_step.md

This file was deleted.

85 changes: 4 additions & 81 deletions notebooks/Release_QC_metrics.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,17 @@
"1. Import necessary modules and set up the release path and version.\n",
"2. Load and analyze the variant index data:\n",
" - Count the number of unique variants.\n",
"3. Load and analyze the variant-to-gene (v2g) data:\n",
" - Count the number of unique variants and total variant-to-gene assignments.\n",
" - Count the number of v2g assignments where the score is > 0.8.\n",
" - Plot a histogram/density plot for the \"score\" column.\n",
"4. Load and analyze the study index data for different data sources (FinnGen, GWASCat, eQTLcat):\n",
"3. Load and analyze the study index data for different data sources (FinnGen, GWASCat, eQTLcat):\n",
" - Count the number of unique studies for each data source.\n",
"5. Analyze the credible sets for each datasource (Finngen, gwascat, eqtlcat):\n",
"4. Analyze the credible sets for each datasource (Finngen, gwascat, eqtlcat):\n",
" - Analyze the credible sets:\n",
" - Count the number of unique credible sets and unique study IDs.\n",
" - Plot a scatter plot of the credible set size vs. the top posterior probability.\n",
" - Count the number of credible sets with a top SNP posterior probability > 0.9..\n",
"6. Analyze colocalization data:\n",
"5. Analyze colocalization data:\n",
" - Count the total number of colocalizations and the number with clpp > 0.8.\n",
" - Calculate the average number of overlaps per credible set.\n",
"7. Analyze locus-to-gene (L2G) predictions:\n",
"6. Analyze locus-to-gene (L2G) predictions:\n",
" - Load the locus-to-gene predictions data.\n",
" - How many Studylocus contains a \"good\" l2g prediction? (l2g_score > 0.5)\n",
" - How does l2g perform based on different datasource inputs? (impossible to tell)\n",
Expand Down Expand Up @@ -126,79 +122,6 @@
"#variant_index.filter(variant_index[\"alleleFrequencies.populationName\"] > 0.05).show(10, False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### 3. Load and analyze the variant-to-gene (v2g) data:\n",
" - Count the number of unique variants and total variant-to-gene assignments.\n",
" - Count the number of v2g assignments where the score is > 0.8."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unique variants in v2g release: 5090991 , total variant to gene assignments: 105771851 , number of v2g assignments where score > 0.8: 23176515 ( 4.552 %)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Summary of v2g_score: Mean: 0.5909395615801637 L.quart: 0.29 Median: 0.62 U.quart: 0.94\n"
]
}
],
"source": [
"#v2g_path='gs://genetics_etl_python_playground/releases/24.03/variant_to_gene'\n",
"v2g_path=f\"{release_path}/{release_ver}/variant_to_gene\"\n",
"v2g=session.spark.read.parquet(v2g_path, recursiveFileLookup=True)\n",
"\n",
"#How many variants?\n",
"sample_size_quartiles = v2g.stat.approxQuantile(\"score\", [0.25, 0.5, 0.75], 0.01)\n",
"#v2g.select().toPandas().plot.hist()\n",
"#v2g.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" - Plot a histogram/density plot for the \"score\" column."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"#The histogram/density plot for “score”\n",
"# Out of mem error:\n",
"#v2g.select(f.col(\"score\")).toPandas().plot.hist(bins=10, alpha=0.5, label=\"v2g scores\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
77 changes: 0 additions & 77 deletions src/gentropy/assets/schemas/v2g.json

This file was deleted.

6 changes: 3 additions & 3 deletions src/gentropy/common/spark_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import sys
from functools import reduce, wraps
from itertools import chain
from typing import TYPE_CHECKING, Any, Callable, Dict, Iterable, Optional, TypeVar
from typing import TYPE_CHECKING, Any, Callable, Iterable, Optional, TypeVar

import pyspark.sql.functions as f
import pyspark.sql.types as t
Expand Down Expand Up @@ -447,14 +447,14 @@ def order_array_of_structs_by_two_fields(
)


def map_column_by_dictionary(col: Column, mapping_dict: Dict[str, str]) -> Column:
def map_column_by_dictionary(col: Column, mapping_dict: dict[str, str]) -> Column:
"""Map column values to dictionary values by key.

Missing consequence label will be converted to None, unmapped consequences will be mapped as None.

Args:
col (Column): Column containing labels to map.
mapping_dict (Dict[str, str]): Dictionary with mapping key/value pairs.
mapping_dict (dict[str, str]): Dictionary with mapping key/value pairs.

Returns:
Column: Column with mapped values.
Expand Down
Loading
Loading