Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.
Source code in src/otg/dataset/variant_index.py
20212223
@@ -59,7 +58,12 @@
585960
-61
@dataclass
+61
+62
+63
+64
+65
+66
@dataclassclassVariantIndex(Dataset):"""Variant index dataset.
@@ -75,6 +79,7 @@
deffrom_variant_annotation(cls:type[VariantIndex],variant_annotation:VariantAnnotation,
+ study_locus:StudyLocus,)->VariantIndex:"""Initialise VariantIndex from pre-existing variant annotation dataset."""unchanged_cols=[
@@ -89,9 +94,12 @@
"alleleFrequencies","cadd",]
+ va_slimmed=variant_annotation.filter_by_variant_df(
+ study_locus.unique_variants_in_locus(),["variantId","chromosome"]
+ )returncls(_df=(
- variant_annotation.df.select(
+ va_slimmed.df.select(*unchanged_cols,f.col("vep.mostSevereConsequence").alias("mostSevereConsequence"),# filters/rsid are arrays that can be empty, in this case we convert them to null
@@ -102,8 +110,7 @@
),_schema=cls.get_schema(),)
-
Initialise VariantIndex from pre-existing variant annotation dataset.
Source code in src/otg/dataset/variant_index.py
32333435
@@ -132,10 +139,16 @@
585960
-61
@classmethod
+61
+62
+63
+64
+65
+66
@classmethoddeffrom_variant_annotation(cls:type[VariantIndex],variant_annotation:VariantAnnotation,
+ study_locus:StudyLocus,)->VariantIndex:"""Initialise VariantIndex from pre-existing variant annotation dataset."""unchanged_cols=[
@@ -150,9 +163,12 @@
"alleleFrequencies","cadd",]
+ va_slimmed=variant_annotation.filter_by_variant_df(
+ study_locus.unique_variants_in_locus(),["variantId","chromosome"]
+ )returncls(_df=(
- variant_annotation.df.select(
+ va_slimmed.df.select(*unchanged_cols,f.col("vep.mostSevereConsequence").alias("mostSevereConsequence"),# filters/rsid are arrays that can be empty, in this case we convert them to null
@@ -163,10 +179,10 @@
),_schema=cls.get_schema(),)
-
The steps in this section only ever need to be done once on any particular system.
Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login and follow instructions.
Check that you have the make utility installed, and if not (which is unlikely), install it using your system package manager.
Run make setup-dev to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.
It is recommended to use VS Code as an IDE for development.
"},{"location":"contributing/guidelines/#how-to-run-the-code","title":"How to run the code","text":"
All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.
In order to run the code:
Manually edit your local workflow/dag.yaml file and comment out the steps you do not want to run.
Manually edit your local pyproject.toml file and modify the version of the code.
This must be different from the version used by any other people working on the repository to avoid any deployment conflicts, so it's a good idea to use your name, for example: 1.2.3+jdoe.
You can also add a brief branch description, for example: 1.2.3+jdoe.myfeature.
Note that the version must comply with PEP440 conventions, otherwise Poetry will not allow it to be deployed.
Do not use underscores or hyphens in your version name. When building the WHL file, they will be automatically converted to dots, which means the file name will no longer match the version and the build will fail. Use dots instead.
Run make build.
This will create a bundle containing the neccessary code, configuration and dependencies to run the ETL pipeline, and then upload this bundle to Google Cloud.
A version specific subpath is used, so uploading the code will not affect any branches but your own.
If there was already a code bundle uploaded with the same version number, it will be replaced.
Submit the Dataproc job with poetry run python workflow/workflow_template.py
You will need to specify additional parameters, some are mandatory and some are optional. Run with --help to see usage.
The script will provision the cluster and submit the job.
The cluster will take a few minutes to get provisioned and running, during which the script will not output anything, this is normal.
Once submitted, you can monitor the progress of your job on this page: https://console.cloud.google.com/dataproc/jobs?project=open-targets-genetics-dev.
On completion (whether successful or a failure), the cluster will be automatically removed, so you don't have to worry about shutting it down to avoid incurring charges.
When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with run build-documentation. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/)
For more details on each of these steps, see the sections below.
If during development you had a question which wasn't covered in the documentation, and someone explained it to you, add it to the documentation. The same applies if you encountered any instructions in the documentation which were obsolete or incorrect.
Documentation autogeneration expressions start with :::. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:
Dataset definitions in docs/reference/dataset (example: docs/reference/dataset/study_index/study_index_finngen.md)
Step definitions in docs/reference/step (example: docs/reference/step/finngen.md)
If you see errors related to BLAS/LAPACK libraries, see this StackOverflow post for guidance.
"},{"location":"contributing/troubleshooting/#pyenv-and-poetry","title":"Pyenv and Poetry","text":"
If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:
Close your currently activated environment, if any: exit
Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.
If you are encountering problems with initialising a Spark session, try using Java 11.
If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.
One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.
Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.
On Ubuntu, this can be done using sudo apt remove node nodejs npm, followed by sudo apt autoremove. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.
After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).
Dataset is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the json.schemas module.
Source code in src/otg/dataset/dataset.py
@dataclass\nclass Dataset(ABC):\n \"\"\"Open Targets Genetics Dataset.\n\n `Dataset` is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the `json.schemas` module.\n \"\"\"\n\n _df: DataFrame\n _schema: StructType\n\n def __post_init__(self: Dataset) -> None:\n \"\"\"Post init.\"\"\"\n self.validate_schema()\n\n @property\n def df(self: Dataset) -> DataFrame:\n \"\"\"Dataframe included in the Dataset.\"\"\"\n return self._df\n\n @df.setter\n def df(self: Dataset, new_df: DataFrame) -> None: # noqa: CCE001\n \"\"\"Dataframe setter.\"\"\"\n self._df: DataFrame = new_df\n self.validate_schema()\n\n @property\n def schema(self: Dataset) -> StructType:\n \"\"\"Dataframe expected schema.\"\"\"\n return self._schema\n\n @classmethod\n @abstractmethod\n def get_schema(cls: type[Dataset]) -> StructType:\n \"\"\"Abstract method to get the schema. Must be implemented by child classes.\"\"\"\n pass\n\n @classmethod\n def from_parquet(\n cls: type[Dataset], session: Session, path: str, **kwargs: Dict[str, Any]\n ) -> Dataset:\n \"\"\"Reads a parquet file into a Dataset with a given schema.\"\"\"\n schema = cls.get_schema()\n df = session.read_parquet(path=path, schema=schema, **kwargs)\n return cls(_df=df, _schema=schema)\n\n def validate_schema(self: Dataset) -> None: # sourcery skip: invert-any-all\n \"\"\"Validate DataFrame schema against expected class schema.\n\n Raises:\n ValueError: DataFrame schema is not valid\n \"\"\"\n expected_schema = self._schema\n expected_fields = flatten_schema(expected_schema)\n observed_schema = self._df.schema\n observed_fields = flatten_schema(observed_schema)\n\n # Unexpected fields in dataset\n if unexpected_field_names := [\n x.name\n for x in observed_fields\n if x.name not in [y.name for y in expected_fields]\n ]:\n raise ValueError(\n f\"The {unexpected_field_names} fields are not included in DataFrame schema: {expected_fields}\"\n )\n\n # Required fields not in dataset\n required_fields = [x.name for x in expected_schema if not x.nullable]\n if missing_required_fields := [\n req\n for req in required_fields\n if not any(field.name == req for field in observed_fields)\n ]:\n raise ValueError(\n f\"The {missing_required_fields} fields are required but missing: {required_fields}\"\n )\n\n # Fields with duplicated names\n if duplicated_fields := [\n x for x in set(observed_fields) if observed_fields.count(x) > 1\n ]:\n raise ValueError(\n f\"The following fields are duplicated in DataFrame schema: {duplicated_fields}\"\n )\n\n # Fields with different datatype\n observed_field_types = {\n field.name: type(field.dataType) for field in observed_fields\n }\n expected_field_types = {\n field.name: type(field.dataType) for field in expected_fields\n }\n if fields_with_different_observed_datatype := [\n name\n for name, observed_type in observed_field_types.items()\n if name in expected_field_types\n and observed_type != expected_field_types[name]\n ]:\n raise ValueError(\n f\"The following fields present differences in their datatypes: {fields_with_different_observed_datatype}.\"\n )\n\n def persist(self: Dataset) -> Dataset:\n \"\"\"Persist in memory the DataFrame included in the Dataset.\"\"\"\n self.df = self._df.persist()\n return self\n\n def unpersist(self: Dataset) -> Dataset:\n \"\"\"Remove the persisted DataFrame from memory.\"\"\"\n self.df = self._df.unpersist()\n return self\n
Abstract method to get the schema. Must be implemented by child classes.
Source code in src/otg/dataset/dataset.py
@classmethod\n@abstractmethod\ndef get_schema(cls: type[Dataset]) -> StructType:\n \"\"\"Abstract method to get the schema. Must be implemented by child classes.\"\"\"\n pass\n
Persist in memory the DataFrame included in the Dataset.
Source code in src/otg/dataset/dataset.py
def persist(self: Dataset) -> Dataset:\n \"\"\"Persist in memory the DataFrame included in the Dataset.\"\"\"\n self.df = self._df.persist()\n return self\n
Validate DataFrame schema against expected class schema.
Raises:
Type Description ValueError
DataFrame schema is not valid
Source code in src/otg/dataset/dataset.py
def validate_schema(self: Dataset) -> None: # sourcery skip: invert-any-all\n \"\"\"Validate DataFrame schema against expected class schema.\n\n Raises:\n ValueError: DataFrame schema is not valid\n \"\"\"\n expected_schema = self._schema\n expected_fields = flatten_schema(expected_schema)\n observed_schema = self._df.schema\n observed_fields = flatten_schema(observed_schema)\n\n # Unexpected fields in dataset\n if unexpected_field_names := [\n x.name\n for x in observed_fields\n if x.name not in [y.name for y in expected_fields]\n ]:\n raise ValueError(\n f\"The {unexpected_field_names} fields are not included in DataFrame schema: {expected_fields}\"\n )\n\n # Required fields not in dataset\n required_fields = [x.name for x in expected_schema if not x.nullable]\n if missing_required_fields := [\n req\n for req in required_fields\n if not any(field.name == req for field in observed_fields)\n ]:\n raise ValueError(\n f\"The {missing_required_fields} fields are required but missing: {required_fields}\"\n )\n\n # Fields with duplicated names\n if duplicated_fields := [\n x for x in set(observed_fields) if observed_fields.count(x) > 1\n ]:\n raise ValueError(\n f\"The following fields are duplicated in DataFrame schema: {duplicated_fields}\"\n )\n\n # Fields with different datatype\n observed_field_types = {\n field.name: type(field.dataType) for field in observed_fields\n }\n expected_field_types = {\n field.name: type(field.dataType) for field in expected_fields\n }\n if fields_with_different_observed_datatype := [\n name\n for name, observed_type in observed_field_types.items()\n if name in expected_field_types\n and observed_type != expected_field_types[name]\n ]:\n raise ValueError(\n f\"The following fields present differences in their datatypes: {fields_with_different_observed_datatype}.\"\n )\n
Colocalisation results for pairs of overlapping study-locus.
Source code in src/otg/dataset/colocalisation.py
@dataclass\nclass Colocalisation(Dataset):\n \"\"\"Colocalisation results for pairs of overlapping study-locus.\"\"\"\n\n @classmethod\n def get_schema(cls: type[Colocalisation]) -> StructType:\n \"\"\"Provides the schema for the Colocalisation dataset.\"\"\"\n return parse_spark_schema(\"colocalisation.json\")\n
Provides the schema for the Colocalisation dataset.
Source code in src/otg/dataset/colocalisation.py
@classmethod\ndef get_schema(cls: type[Colocalisation]) -> StructType:\n \"\"\"Provides the schema for the Colocalisation dataset.\"\"\"\n return parse_spark_schema(\"colocalisation.json\")\n
@classmethod\ndef get_schema(cls: type[GeneIndex]) -> StructType:\n \"\"\"Provides the schema for the GeneIndex dataset.\"\"\"\n return parse_spark_schema(\"gene_index.json\")\n
Intervals dataset links genes to genomic regions based on genome interaction studies.
Source code in src/otg/dataset/intervals.py
@dataclass\nclass Intervals(Dataset):\n \"\"\"Intervals dataset links genes to genomic regions based on genome interaction studies.\"\"\"\n\n @classmethod\n def get_schema(cls: type[Intervals]) -> StructType:\n \"\"\"Provides the schema for the Intervals dataset.\"\"\"\n return parse_spark_schema(\"intervals.json\")\n\n def v2g(self: Intervals, variant_index: VariantIndex) -> V2G:\n \"\"\"Convert intervals into V2G by intersecting with a variant index.\n\n Args:\n variant_index (VariantIndex): Variant index dataset\n\n Returns:\n V2G: Variant-to-gene evidence dataset\n \"\"\"\n return V2G(\n _df=(\n # TODO: We can include the start and end position as part of the `on` clause in the join\n self.df.alias(\"interval\")\n .join(\n variant_index.df.selectExpr(\n \"chromosome as vi_chromosome\", \"variantId\", \"position\"\n ).alias(\"vi\"),\n on=[\n f.col(\"vi.vi_chromosome\") == f.col(\"interval.chromosome\"),\n f.col(\"vi.position\").between(\n f.col(\"interval.start\"), f.col(\"interval.end\")\n ),\n ],\n how=\"inner\",\n )\n .drop(\"start\", \"end\", \"vi_chromosome\")\n ),\n _schema=V2G.get_schema(),\n )\n
@classmethod\ndef get_schema(cls: type[Intervals]) -> StructType:\n \"\"\"Provides the schema for the Intervals dataset.\"\"\"\n return parse_spark_schema(\"intervals.json\")\n
Convert intervals into V2G by intersecting with a variant index.
Parameters:
Name Type Description Default variant_indexVariantIndex
Variant index dataset
required
Returns:
Name Type Description V2GV2G
Variant-to-gene evidence dataset
Source code in src/otg/dataset/intervals.py
def v2g(self: Intervals, variant_index: VariantIndex) -> V2G:\n \"\"\"Convert intervals into V2G by intersecting with a variant index.\n\n Args:\n variant_index (VariantIndex): Variant index dataset\n\n Returns:\n V2G: Variant-to-gene evidence dataset\n \"\"\"\n return V2G(\n _df=(\n # TODO: We can include the start and end position as part of the `on` clause in the join\n self.df.alias(\"interval\")\n .join(\n variant_index.df.selectExpr(\n \"chromosome as vi_chromosome\", \"variantId\", \"position\"\n ).alias(\"vi\"),\n on=[\n f.col(\"vi.vi_chromosome\") == f.col(\"interval.chromosome\"),\n f.col(\"vi.position\").between(\n f.col(\"interval.start\"), f.col(\"interval.end\")\n ),\n ],\n how=\"inner\",\n )\n .drop(\"start\", \"end\", \"vi_chromosome\")\n ),\n _schema=V2G.get_schema(),\n )\n
Dataset containing linkage desequilibrium information between variants.
Source code in src/otg/dataset/ld_index.py
@dataclass\nclass LDIndex(Dataset):\n \"\"\"Dataset containing linkage desequilibrium information between variants.\"\"\"\n\n @classmethod\n def get_schema(cls: type[LDIndex]) -> StructType:\n \"\"\"Provides the schema for the LDIndex dataset.\"\"\"\n return parse_spark_schema(\"ld_index.json\")\n
@classmethod\ndef get_schema(cls: type[LDIndex]) -> StructType:\n \"\"\"Provides the schema for the LDIndex dataset.\"\"\"\n return parse_spark_schema(\"ld_index.json\")\n
A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.
Source code in src/otg/dataset/study_index.py
@dataclass\nclass StudyIndex(Dataset):\n \"\"\"Study index dataset.\n\n A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.\n \"\"\"\n\n @staticmethod\n def _aggregate_samples_by_ancestry(merged: Column, ancestry: Column) -> Column:\n \"\"\"Aggregate sample counts by ancestry in a list of struct colmns.\n\n Args:\n merged (Column): A column representing merged data (list of structs).\n ancestry (Column): The `ancestry` parameter is a column that represents the ancestry of each\n sample. (a struct)\n\n Returns:\n the modified \"merged\" column after aggregating the samples by ancestry.\n \"\"\"\n # Iterating over the list of ancestries and adding the sample size if label matches:\n return f.transform(\n merged,\n lambda a: f.when(\n a.ancestry == ancestry.ancestry,\n f.struct(\n a.ancestry.alias(\"ancestry\"),\n (a.sampleSize + ancestry.sampleSize).alias(\"sampleSize\"),\n ),\n ).otherwise(a),\n )\n\n @staticmethod\n def _map_ancestries_to_ld_population(gwas_ancestry_label: Column) -> Column:\n \"\"\"Normalise ancestry column from GWAS studies into reference LD panel based on a pre-defined map.\n\n This function assumes all possible ancestry categories have a corresponding\n LD panel in the LD index. It is very important to have the ancestry labels\n moved to the LD panel map.\n\n Args:\n gwas_ancestry_label (Column): A struct column with ancestry label like Finnish,\n European, African etc. and the corresponding sample size.\n\n Returns:\n Column: Struct column with the mapped LD population label and the sample size.\n \"\"\"\n # Loading ancestry label to LD population label:\n json_dict = json.loads(\n pkg_resources.read_text(\n data, \"gwas_population_2_LD_panel_map.json\", encoding=\"utf-8\"\n )\n )\n map_expr = f.create_map(*[f.lit(x) for x in chain(*json_dict.items())])\n\n return f.struct(\n map_expr[gwas_ancestry_label.ancestry].alias(\"ancestry\"),\n gwas_ancestry_label.sampleSize.alias(\"sampleSize\"),\n )\n\n @classmethod\n def get_schema(cls: type[StudyIndex]) -> StructType:\n \"\"\"Provide the schema for the StudyIndex dataset.\"\"\"\n return parse_spark_schema(\"study_index.json\")\n\n @classmethod\n def aggregate_and_map_ancestries(\n cls: type[StudyIndex], discovery_samples: Column\n ) -> Column:\n \"\"\"Map ancestries to populations in the LD reference and calculate relative sample size.\n\n Args:\n discovery_samples (Column): A list of struct column. Has an `ancestry` column and a `sampleSize` columns\n\n Returns:\n A list of struct with mapped LD population and their relative sample size.\n \"\"\"\n # Map ancestry categories to population labels of the LD index:\n mapped_ancestries = f.transform(\n discovery_samples, cls._map_ancestries_to_ld_population\n )\n\n # Aggregate sample sizes belonging to the same LD population:\n aggregated_counts = f.aggregate(\n mapped_ancestries,\n f.array_distinct(\n f.transform(\n mapped_ancestries,\n lambda x: f.struct(\n x.ancestry.alias(\"ancestry\"), f.lit(0.0).alias(\"sampleSize\")\n ),\n )\n ),\n cls._aggregate_samples_by_ancestry,\n )\n # Getting total sample count:\n total_sample_count = f.aggregate(\n aggregated_counts, f.lit(0.0), lambda total, pop: total + pop.sampleSize\n ).alias(\"sampleSize\")\n\n # Calculating relative sample size for each LD population:\n return f.transform(\n aggregated_counts,\n lambda ld_population: f.struct(\n ld_population.ancestry.alias(\"ldPopulation\"),\n (ld_population.sampleSize / total_sample_count).alias(\n \"relativeSampleSize\"\n ),\n ),\n )\n\n def study_type_lut(self: StudyIndex) -> DataFrame:\n \"\"\"Return a lookup table of study type.\n\n Returns:\n DataFrame: A dataframe containing `studyId` and `studyType` columns.\n \"\"\"\n return self.df.select(\"studyId\", \"studyType\")\n
Map ancestries to populations in the LD reference and calculate relative sample size.
Parameters:
Name Type Description Default discovery_samplesColumn
A list of struct column. Has an ancestry column and a sampleSize columns
required
Returns:
Type Description Column
A list of struct with mapped LD population and their relative sample size.
Source code in src/otg/dataset/study_index.py
@classmethod\ndef aggregate_and_map_ancestries(\n cls: type[StudyIndex], discovery_samples: Column\n) -> Column:\n \"\"\"Map ancestries to populations in the LD reference and calculate relative sample size.\n\n Args:\n discovery_samples (Column): A list of struct column. Has an `ancestry` column and a `sampleSize` columns\n\n Returns:\n A list of struct with mapped LD population and their relative sample size.\n \"\"\"\n # Map ancestry categories to population labels of the LD index:\n mapped_ancestries = f.transform(\n discovery_samples, cls._map_ancestries_to_ld_population\n )\n\n # Aggregate sample sizes belonging to the same LD population:\n aggregated_counts = f.aggregate(\n mapped_ancestries,\n f.array_distinct(\n f.transform(\n mapped_ancestries,\n lambda x: f.struct(\n x.ancestry.alias(\"ancestry\"), f.lit(0.0).alias(\"sampleSize\")\n ),\n )\n ),\n cls._aggregate_samples_by_ancestry,\n )\n # Getting total sample count:\n total_sample_count = f.aggregate(\n aggregated_counts, f.lit(0.0), lambda total, pop: total + pop.sampleSize\n ).alias(\"sampleSize\")\n\n # Calculating relative sample size for each LD population:\n return f.transform(\n aggregated_counts,\n lambda ld_population: f.struct(\n ld_population.ancestry.alias(\"ldPopulation\"),\n (ld_population.sampleSize / total_sample_count).alias(\n \"relativeSampleSize\"\n ),\n ),\n )\n
@classmethod\ndef get_schema(cls: type[StudyIndex]) -> StructType:\n \"\"\"Provide the schema for the StudyIndex dataset.\"\"\"\n return parse_spark_schema(\"study_index.json\")\n
A dataframe containing studyId and studyType columns.
Source code in src/otg/dataset/study_index.py
def study_type_lut(self: StudyIndex) -> DataFrame:\n \"\"\"Return a lookup table of study type.\n\n Returns:\n DataFrame: A dataframe containing `studyId` and `studyType` columns.\n \"\"\"\n return self.df.select(\"studyId\", \"studyType\")\n
This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.
Source code in src/otg/dataset/study_locus.py
@dataclass\nclass StudyLocus(Dataset):\n \"\"\"Study-Locus dataset.\n\n This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.\n \"\"\"\n\n @staticmethod\n def _overlapping_peaks(credset_to_overlap: DataFrame) -> DataFrame:\n \"\"\"Calculate overlapping signals (study-locus) between GWAS-GWAS and GWAS-Molecular trait.\n\n Args:\n credset_to_overlap (DataFrame): DataFrame containing at least `studyLocusId`, `studyType`, `chromosome` and `tagVariantId` columns.\n\n Returns:\n DataFrame: containing `leftStudyLocusId`, `rightStudyLocusId` and `chromosome` columns.\n \"\"\"\n # Reduce columns to the minimum to reduce the size of the dataframe\n credset_to_overlap = credset_to_overlap.select(\n \"studyLocusId\", \"studyType\", \"chromosome\", \"tagVariantId\"\n )\n return (\n credset_to_overlap.alias(\"left\")\n .filter(f.col(\"studyType\") == \"gwas\")\n # Self join with complex condition. Left it's all gwas and right can be gwas or molecular trait\n .join(\n credset_to_overlap.alias(\"right\"),\n on=[\n f.col(\"left.chromosome\") == f.col(\"right.chromosome\"),\n f.col(\"left.tagVariantId\") == f.col(\"right.tagVariantId\"),\n (f.col(\"right.studyType\") != \"gwas\")\n | (f.col(\"left.studyLocusId\") > f.col(\"right.studyLocusId\")),\n ],\n how=\"inner\",\n )\n .select(\n f.col(\"left.studyLocusId\").alias(\"leftStudyLocusId\"),\n f.col(\"right.studyLocusId\").alias(\"rightStudyLocusId\"),\n f.col(\"left.chromosome\").alias(\"chromosome\"),\n )\n .distinct()\n .repartition(\"chromosome\")\n .persist()\n )\n\n @staticmethod\n def _align_overlapping_tags(\n loci_to_overlap: DataFrame, peak_overlaps: DataFrame\n ) -> StudyLocusOverlap:\n \"\"\"Align overlapping tags in pairs of overlapping study-locus, keeping all tags in both loci.\n\n Args:\n loci_to_overlap (DataFrame): containing `studyLocusId`, `studyType`, `chromosome`, `tagVariantId`, `logABF` and `posteriorProbability` columns.\n peak_overlaps (DataFrame): containing `left_studyLocusId`, `right_studyLocusId` and `chromosome` columns.\n\n Returns:\n StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n \"\"\"\n # Complete information about all tags in the left study-locus of the overlap\n stats_cols = [\n \"logABF\",\n \"posteriorProbability\",\n \"beta\",\n \"pValueMantissa\",\n \"pValueExponent\",\n ]\n overlapping_left = loci_to_overlap.select(\n f.col(\"chromosome\"),\n f.col(\"tagVariantId\"),\n f.col(\"studyLocusId\").alias(\"leftStudyLocusId\"),\n *[f.col(col).alias(f\"left_{col}\") for col in stats_cols],\n ).join(peak_overlaps, on=[\"chromosome\", \"leftStudyLocusId\"], how=\"inner\")\n\n # Complete information about all tags in the right study-locus of the overlap\n overlapping_right = loci_to_overlap.select(\n f.col(\"chromosome\"),\n f.col(\"tagVariantId\"),\n f.col(\"studyLocusId\").alias(\"rightStudyLocusId\"),\n *[f.col(col).alias(f\"right_{col}\") for col in stats_cols],\n ).join(peak_overlaps, on=[\"chromosome\", \"rightStudyLocusId\"], how=\"inner\")\n\n # Include information about all tag variants in both study-locus aligned by tag variant id\n overlaps = overlapping_left.join(\n overlapping_right,\n on=[\n \"chromosome\",\n \"rightStudyLocusId\",\n \"leftStudyLocusId\",\n \"tagVariantId\",\n ],\n how=\"outer\",\n ).select(\n \"leftStudyLocusId\",\n \"rightStudyLocusId\",\n \"chromosome\",\n \"tagVariantId\",\n f.struct(\n *[f\"left_{e}\" for e in stats_cols] + [f\"right_{e}\" for e in stats_cols]\n ).alias(\"statistics\"),\n )\n return StudyLocusOverlap(\n _df=overlaps,\n _schema=StudyLocusOverlap.get_schema(),\n )\n\n @staticmethod\n def _update_quality_flag(\n qc: Column, flag_condition: Column, flag_text: StudyLocusQualityCheck\n ) -> Column:\n \"\"\"Update the provided quality control list with a new flag if condition is met.\n\n Args:\n qc (Column): Array column with the current list of qc flags.\n flag_condition (Column): This is a column of booleans, signing which row should be flagged\n flag_text (StudyLocusQualityCheck): Text for the new quality control flag\n\n Returns:\n Column: Array column with the updated list of qc flags.\n \"\"\"\n qc = f.when(qc.isNull(), f.array()).otherwise(qc)\n return f.when(\n flag_condition,\n f.array_union(qc, f.array(f.lit(flag_text.value))),\n ).otherwise(qc)\n\n @staticmethod\n def assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column:\n \"\"\"Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.\n\n Args:\n study_id_col (Column): column name with a study ID\n variant_id_col (Column): column name with a variant ID\n\n Returns:\n Column: column with a study locus ID\n\n Examples:\n >>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n >>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n +----------+----------+--------------------+\n | studyId| variantId| study_locus_id|\n +----------+----------+--------------------+\n |GCST000001|1_1000_A_C| 7437284926964690765|\n |GCST000002|1_1000_A_C|-7653912547667845377|\n +----------+----------+--------------------+\n <BLANKLINE>\n \"\"\"\n return f.xxhash64(*[study_id_col, variant_id_col]).alias(\"studyLocusId\")\n\n @classmethod\n def get_schema(cls: type[StudyLocus]) -> StructType:\n \"\"\"Provides the schema for the StudyLocus dataset.\"\"\"\n return parse_spark_schema(\"study_locus.json\")\n\n def filter_credible_set(\n self: StudyLocus,\n credible_interval: CredibleInterval,\n ) -> StudyLocus:\n \"\"\"Filter study-locus tag variants based on given credible interval.\n\n Args:\n credible_interval (CredibleInterval): Credible interval to filter for.\n\n Returns:\n StudyLocus: Filtered study-locus dataset.\n \"\"\"\n self.df = self._df.withColumn(\n \"locus\",\n f.expr(f\"filter(locus, tag -> (tag.{credible_interval.value}))\"),\n )\n return self\n\n def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:\n \"\"\"Calculate overlapping study-locus.\n\n Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always\n appearing on the right side.\n\n Args:\n study_index (StudyIndex): Study index to resolve study types.\n\n Returns:\n StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n \"\"\"\n loci_to_overlap = (\n self.df.join(study_index.study_type_lut(), on=\"studyId\", how=\"inner\")\n .withColumn(\"locus\", f.explode(\"locus\"))\n .select(\n \"studyLocusId\",\n \"studyType\",\n \"chromosome\",\n f.col(\"locus.variantId\").alias(\"tagVariantId\"),\n f.col(\"locus.logABF\").alias(\"logABF\"),\n f.col(\"locus.posteriorProbability\").alias(\"posteriorProbability\"),\n f.col(\"locus.pValueMantissa\").alias(\"pValueMantissa\"),\n f.col(\"locus.pValueExponent\").alias(\"pValueExponent\"),\n f.col(\"locus.beta\").alias(\"beta\"),\n )\n .persist()\n )\n\n # overlapping study-locus\n peak_overlaps = self._overlapping_peaks(loci_to_overlap)\n\n # study-locus overlap by aligning overlapping variants\n return self._align_overlapping_tags(loci_to_overlap, peak_overlaps)\n\n def unique_lead_tag_variants(self: StudyLocus) -> DataFrame:\n \"\"\"All unique lead and tag variants contained in the `StudyLocus` dataframe.\n\n Returns:\n DataFrame: A dataframe containing `variantId` and `chromosome` columns.\n \"\"\"\n lead_tags = (\n self.df.select(\n f.col(\"variantId\"),\n f.col(\"chromosome\"),\n f.explode(\"ldSet.tagVariantId\").alias(\"tagVariantId\"),\n )\n .repartition(\"chromosome\")\n .persist()\n )\n return (\n lead_tags.select(\"variantId\", \"chromosome\")\n .union(\n lead_tags.select(f.col(\"tagVariantId\").alias(\"variantId\"), \"chromosome\")\n )\n .distinct()\n )\n\n def neglog_pvalue(self: StudyLocus) -> Column:\n \"\"\"Returns the negative log p-value.\n\n Returns:\n Column: Negative log p-value\n \"\"\"\n return calculate_neglog_pvalue(\n self.df.pValueMantissa,\n self.df.pValueExponent,\n )\n\n def annotate_credible_sets(self: StudyLocus) -> StudyLocus:\n \"\"\"Annotate study-locus dataset with credible set flags.\n\n Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds\n `is95CredibleSet` and `is99CredibleSet` fields to the elements, indicating which are the tagging variants whose cumulative sum\n of their `posteriorProbability` values is below 0.95 and 0.99, respectively.\n\n Returns:\n StudyLocus: including annotation on `is95CredibleSet` and `is99CredibleSet`.\n \"\"\"\n if \"locus\" not in self.df.columns:\n raise ValueError(\"Locus column not available.\")\n\n self.df = self.df.withColumn(\n # Sort credible set by posterior probability in descending order\n \"locus\",\n f.when(\n f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n order_array_of_structs_by_field(\"locus\", \"posteriorProbability\"),\n ),\n ).withColumn(\n # Calculate array of cumulative sums of posterior probabilities to determine which variants are in the 95% and 99% credible sets\n # and zip the cumulative sums array with the credible set array to add the flags\n \"locus\",\n f.when(\n f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n f.zip_with(\n f.col(\"locus\"),\n f.transform(\n f.sequence(f.lit(1), f.size(f.col(\"locus\"))),\n lambda index: f.aggregate(\n f.slice(\n # By using `index - 1` we introduce a value of `0.0` in the cumulative sums array. to ensure that the last variant\n # that exceeds the 0.95 threshold is included in the cumulative sum, as its probability is necessary to satisfy the threshold.\n f.col(\"locus.posteriorProbability\"),\n 1,\n index - 1,\n ),\n f.lit(0.0),\n lambda acc, el: acc + el,\n ),\n ),\n lambda struct_e, acc: struct_e.withField(\n CredibleInterval.IS95.value, (acc < 0.95) & acc.isNotNull()\n ).withField(\n CredibleInterval.IS99.value, (acc < 0.99) & acc.isNotNull()\n ),\n ),\n ),\n )\n return self\n\n def clump(self: StudyLocus) -> StudyLocus:\n \"\"\"Perform LD clumping of the studyLocus.\n\n Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n Returns:\n StudyLocus: with empty credible sets for linked variants and QC flag.\n \"\"\"\n self.df = (\n self.df.withColumn(\n \"is_lead_linked\",\n LDclumping._is_lead_linked(\n self.df.studyId,\n self.df.variantId,\n self.df.pValueExponent,\n self.df.pValueMantissa,\n self.df.ldSet,\n ),\n )\n .withColumn(\n \"ldSet\",\n f.when(f.col(\"is_lead_linked\"), f.array()).otherwise(f.col(\"ldSet\")),\n )\n .withColumn(\n \"qualityControls\",\n StudyLocus._update_quality_flag(\n f.col(\"qualityControls\"),\n f.col(\"is_lead_linked\"),\n StudyLocusQualityCheck.LD_CLUMPED,\n ),\n )\n .drop(\"is_lead_linked\")\n )\n return self\n\n def _qc_unresolved_ld(\n self: StudyLocus,\n ) -> StudyLocus:\n \"\"\"Flag associations with variants that are not found in the LD reference.\n\n Returns:\n StudyLocusGWASCatalog | StudyLocus: Updated study locus.\n \"\"\"\n self.df = self.df.withColumn(\n \"qualityControls\",\n self._update_quality_flag(\n f.col(\"qualityControls\"),\n f.col(\"ldSet\").isNull(),\n StudyLocusQualityCheck.UNRESOLVED_LD,\n ),\n )\n return self\n\n def _qc_no_population(self: StudyLocus) -> StudyLocus:\n \"\"\"Flag associations where the study doesn't have population information to resolve LD.\n\n Returns:\n StudyLocusGWASCatalog | StudyLocus: Updated study locus.\n \"\"\"\n # If the tested column is not present, return self unchanged:\n if \"ldPopulationStructure\" not in self.df.columns:\n return self\n\n self.df = self.df.withColumn(\n \"qualityControls\",\n self._update_quality_flag(\n f.col(\"qualityControls\"),\n f.col(\"ldPopulationStructure\").isNull(),\n StudyLocusQualityCheck.NO_POPULATION,\n ),\n )\n return self\n
Annotate study-locus dataset with credible set flags.
Sorts the array in the locus column elements by their posteriorProbability values in descending order and adds is95CredibleSet and is99CredibleSet fields to the elements, indicating which are the tagging variants whose cumulative sum of their posteriorProbability values is below 0.95 and 0.99, respectively.
Returns:
Name Type Description StudyLocusStudyLocus
including annotation on is95CredibleSet and is99CredibleSet.
Source code in src/otg/dataset/study_locus.py
def annotate_credible_sets(self: StudyLocus) -> StudyLocus:\n \"\"\"Annotate study-locus dataset with credible set flags.\n\n Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds\n `is95CredibleSet` and `is99CredibleSet` fields to the elements, indicating which are the tagging variants whose cumulative sum\n of their `posteriorProbability` values is below 0.95 and 0.99, respectively.\n\n Returns:\n StudyLocus: including annotation on `is95CredibleSet` and `is99CredibleSet`.\n \"\"\"\n if \"locus\" not in self.df.columns:\n raise ValueError(\"Locus column not available.\")\n\n self.df = self.df.withColumn(\n # Sort credible set by posterior probability in descending order\n \"locus\",\n f.when(\n f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n order_array_of_structs_by_field(\"locus\", \"posteriorProbability\"),\n ),\n ).withColumn(\n # Calculate array of cumulative sums of posterior probabilities to determine which variants are in the 95% and 99% credible sets\n # and zip the cumulative sums array with the credible set array to add the flags\n \"locus\",\n f.when(\n f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n f.zip_with(\n f.col(\"locus\"),\n f.transform(\n f.sequence(f.lit(1), f.size(f.col(\"locus\"))),\n lambda index: f.aggregate(\n f.slice(\n # By using `index - 1` we introduce a value of `0.0` in the cumulative sums array. to ensure that the last variant\n # that exceeds the 0.95 threshold is included in the cumulative sum, as its probability is necessary to satisfy the threshold.\n f.col(\"locus.posteriorProbability\"),\n 1,\n index - 1,\n ),\n f.lit(0.0),\n lambda acc, el: acc + el,\n ),\n ),\n lambda struct_e, acc: struct_e.withField(\n CredibleInterval.IS95.value, (acc < 0.95) & acc.isNotNull()\n ).withField(\n CredibleInterval.IS99.value, (acc < 0.99) & acc.isNotNull()\n ),\n ),\n ),\n )\n return self\n
@staticmethod\ndef assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column:\n \"\"\"Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.\n\n Args:\n study_id_col (Column): column name with a study ID\n variant_id_col (Column): column name with a variant ID\n\n Returns:\n Column: column with a study locus ID\n\n Examples:\n >>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n >>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n +----------+----------+--------------------+\n | studyId| variantId| study_locus_id|\n +----------+----------+--------------------+\n |GCST000001|1_1000_A_C| 7437284926964690765|\n |GCST000002|1_1000_A_C|-7653912547667845377|\n +----------+----------+--------------------+\n <BLANKLINE>\n \"\"\"\n return f.xxhash64(*[study_id_col, variant_id_col]).alias(\"studyLocusId\")\n
Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always appearing on the right side.
Parameters:
Name Type Description Default study_indexStudyIndex
Study index to resolve study types.
required
Returns:
Name Type Description StudyLocusOverlapStudyLocusOverlap
Pairs of overlapping study-locus with aligned tags.
Source code in src/otg/dataset/study_locus.py
def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:\n \"\"\"Calculate overlapping study-locus.\n\n Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always\n appearing on the right side.\n\n Args:\n study_index (StudyIndex): Study index to resolve study types.\n\n Returns:\n StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n \"\"\"\n loci_to_overlap = (\n self.df.join(study_index.study_type_lut(), on=\"studyId\", how=\"inner\")\n .withColumn(\"locus\", f.explode(\"locus\"))\n .select(\n \"studyLocusId\",\n \"studyType\",\n \"chromosome\",\n f.col(\"locus.variantId\").alias(\"tagVariantId\"),\n f.col(\"locus.logABF\").alias(\"logABF\"),\n f.col(\"locus.posteriorProbability\").alias(\"posteriorProbability\"),\n f.col(\"locus.pValueMantissa\").alias(\"pValueMantissa\"),\n f.col(\"locus.pValueExponent\").alias(\"pValueExponent\"),\n f.col(\"locus.beta\").alias(\"beta\"),\n )\n .persist()\n )\n\n # overlapping study-locus\n peak_overlaps = self._overlapping_peaks(loci_to_overlap)\n\n # study-locus overlap by aligning overlapping variants\n return self._align_overlapping_tags(loci_to_overlap, peak_overlaps)\n
@classmethod\ndef get_schema(cls: type[StudyLocus]) -> StructType:\n \"\"\"Provides the schema for the StudyLocus dataset.\"\"\"\n return parse_spark_schema(\"study_locus.json\")\n
Study-Locus quality control options listing concerns on the quality of the association.
Attributes:
Name Type Description SUBSIGNIFICANT_FLAGstr
p-value below significance threshold
NO_GENOMIC_LOCATION_FLAGstr
Incomplete genomic mapping
COMPOSITE_FLAGstr
Composite association due to variant x variant interactions
VARIANT_INCONSISTENCY_FLAGstr
Inconsistencies in the reported variants
NON_MAPPED_VARIANT_FLAGstr
Variant not mapped to GnomAd
PALINDROMIC_ALLELE_FLAGstr
Alleles are palindromic - cannot harmonize
AMBIGUOUS_STUDYstr
Association with ambiguous study
UNRESOLVED_LDstr
Variant not found in LD reference
LD_CLUMPEDstr
Explained by a more significant variant in high LD (clumped)
Source code in src/otg/dataset/study_locus.py
class StudyLocusQualityCheck(Enum):\n \"\"\"Study-Locus quality control options listing concerns on the quality of the association.\n\n Attributes:\n SUBSIGNIFICANT_FLAG (str): p-value below significance threshold\n NO_GENOMIC_LOCATION_FLAG (str): Incomplete genomic mapping\n COMPOSITE_FLAG (str): Composite association due to variant x variant interactions\n VARIANT_INCONSISTENCY_FLAG (str): Inconsistencies in the reported variants\n NON_MAPPED_VARIANT_FLAG (str): Variant not mapped to GnomAd\n PALINDROMIC_ALLELE_FLAG (str): Alleles are palindromic - cannot harmonize\n AMBIGUOUS_STUDY (str): Association with ambiguous study\n UNRESOLVED_LD (str): Variant not found in LD reference\n LD_CLUMPED (str): Explained by a more significant variant in high LD (clumped)\n \"\"\"\n\n SUBSIGNIFICANT_FLAG = \"Subsignificant p-value\"\n NO_GENOMIC_LOCATION_FLAG = \"Incomplete genomic mapping\"\n COMPOSITE_FLAG = \"Composite association\"\n INCONSISTENCY_FLAG = \"Variant inconsistency\"\n NON_MAPPED_VARIANT_FLAG = \"No mapping in GnomAd\"\n PALINDROMIC_ALLELE_FLAG = \"Palindrome alleles - cannot harmonize\"\n AMBIGUOUS_STUDY = \"Association with ambiguous study\"\n UNRESOLVED_LD = \"Variant not found in LD reference\"\n LD_CLUMPED = \"Explained by a more significant variant in high LD (clumped)\"\n NO_POPULATION = \"Study does not have population annotation to resolve LD\"\n
Interval within which an unobserved parameter value falls with a particular probability.
Attributes:
Name Type Description IS95str
95% credible interval
IS99str
99% credible interval
Source code in src/otg/dataset/study_locus.py
class CredibleInterval(Enum):\n \"\"\"Credible interval enum.\n\n Interval within which an unobserved parameter value falls with a particular probability.\n\n Attributes:\n IS95 (str): 95% credible interval\n IS99 (str): 99% credible interval\n \"\"\"\n\n IS95 = \"is95CredibleSet\"\n IS99 = \"is99CredibleSet\"\n
This dataset captures pairs of overlapping StudyLocus: that is associations whose credible sets share at least one tagging variant.
Note
This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.
Source code in src/otg/dataset/study_locus_overlap.py
@dataclass\nclass StudyLocusOverlap(Dataset):\n \"\"\"Study-Locus overlap.\n\n This dataset captures pairs of overlapping `StudyLocus`: that is associations whose credible sets share at least one tagging variant.\n\n !!! note\n This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.\n \"\"\"\n\n @classmethod\n def get_schema(cls: type[StudyLocusOverlap]) -> StructType:\n \"\"\"Provides the schema for the StudyLocusOverlap dataset.\"\"\"\n return parse_spark_schema(\"study_locus_overlap.json\")\n\n @classmethod\n def from_associations(\n cls: type[StudyLocusOverlap], study_locus: StudyLocus, study_index: StudyIndex\n ) -> StudyLocusOverlap:\n \"\"\"Find the overlapping signals in a particular set of associations (StudyLocus dataset).\n\n Args:\n study_locus (StudyLocus): Study-locus associations to find the overlapping signals\n study_index (StudyIndex): Study index to find the overlapping signals\n\n Returns:\n StudyLocusOverlap: Study-locus overlap dataset\n \"\"\"\n return study_locus.find_overlaps(study_index)\n
Find the overlapping signals in a particular set of associations (StudyLocus dataset).
Parameters:
Name Type Description Default study_locusStudyLocus
Study-locus associations to find the overlapping signals
required study_indexStudyIndex
Study index to find the overlapping signals
required
Returns:
Name Type Description StudyLocusOverlapStudyLocusOverlap
Study-locus overlap dataset
Source code in src/otg/dataset/study_locus_overlap.py
@classmethod\ndef from_associations(\n cls: type[StudyLocusOverlap], study_locus: StudyLocus, study_index: StudyIndex\n) -> StudyLocusOverlap:\n \"\"\"Find the overlapping signals in a particular set of associations (StudyLocus dataset).\n\n Args:\n study_locus (StudyLocus): Study-locus associations to find the overlapping signals\n study_index (StudyIndex): Study index to find the overlapping signals\n\n Returns:\n StudyLocusOverlap: Study-locus overlap dataset\n \"\"\"\n return study_locus.find_overlaps(study_index)\n
Provides the schema for the StudyLocusOverlap dataset.
Source code in src/otg/dataset/study_locus_overlap.py
@classmethod\ndef get_schema(cls: type[StudyLocusOverlap]) -> StructType:\n \"\"\"Provides the schema for the StudyLocusOverlap dataset.\"\"\"\n return parse_spark_schema(\"study_locus_overlap.json\")\n
A summary statistics dataset contains all single point statistics resulting from a GWAS.
Source code in src/otg/dataset/summary_statistics.py
@dataclass\nclass SummaryStatistics(Dataset):\n \"\"\"Summary Statistics dataset.\n\n A summary statistics dataset contains all single point statistics resulting from a GWAS.\n \"\"\"\n\n @classmethod\n def get_schema(cls: type[SummaryStatistics]) -> StructType:\n \"\"\"Provides the schema for the SummaryStatistics dataset.\"\"\"\n return parse_spark_schema(\"summary_statistics.json\")\n\n def pvalue_filter(self: SummaryStatistics, pvalue: float) -> SummaryStatistics:\n \"\"\"Filter summary statistics based on the provided p-value threshold.\n\n Args:\n pvalue (float): upper limit of the p-value to be filtered upon.\n\n Returns:\n SummaryStatistics: summary statistics object containing single point associations with p-values at least as significant as the provided threshold.\n \"\"\"\n # Converting p-value to mantissa and exponent:\n (mantissa, exponent) = split_pvalue(pvalue)\n\n # Applying filter:\n df = self._df.filter(\n (f.col(\"pValueExponent\") < exponent)\n | (\n (f.col(\"pValueExponent\") == exponent)\n & (f.col(\"pValueMantissa\") <= mantissa)\n )\n )\n return SummaryStatistics(_df=df, _schema=self._schema)\n\n def window_based_clumping(\n self: SummaryStatistics,\n distance: int,\n gwas_significance: float = 5e-8,\n baseline_significance: float = 0.05,\n locus_collect_distance: int | None = None,\n ) -> StudyLocus:\n \"\"\"Generate study-locus from summary statistics by distance based clumping + collect locus.\n\n Args:\n distance (int): Distance in base pairs to be used for clumping.\n gwas_significance (float, optional): GWAS significance threshold. Defaults to 5e-8.\n baseline_significance (float, optional): Baseline significance threshold for inclusion in the locus. Defaults to 0.05.\n locus_collect_distance (int, optional): The distance to collect locus around semi-indices. If not provided, defaults to `distance`.\n\n Returns:\n StudyLocus: Clumped study-locus containing variants based on window.\n \"\"\"\n # If locus collect distance is present, collect locus with the provided distance:\n if locus_collect_distance:\n clumped_df = WindowBasedClumping.clump_with_locus(\n self,\n window_length=distance,\n p_value_significance=gwas_significance,\n p_value_baseline=baseline_significance,\n locus_window_length=locus_collect_distance,\n )\n else:\n clumped_df = WindowBasedClumping.clump(\n self, window_length=distance, p_value_significance=gwas_significance\n )\n\n return clumped_df\n\n def exclude_region(self: SummaryStatistics, region: str) -> SummaryStatistics:\n \"\"\"Exclude a region from the summary stats dataset.\n\n Args:\n region (str): region given in \"chr##:#####-####\" format\n\n Returns:\n SummaryStatistics: filtered summary statistics.\n \"\"\"\n (chromosome, start_position, end_position) = parse_region(region)\n\n return SummaryStatistics(\n _df=(\n self.df.filter(\n ~(\n (f.col(\"chromosome\") == chromosome)\n & (\n (f.col(\"position\") >= start_position)\n & (f.col(\"position\") <= end_position)\n )\n )\n )\n ),\n _schema=SummaryStatistics.get_schema(),\n )\n
Provides the schema for the SummaryStatistics dataset.
Source code in src/otg/dataset/summary_statistics.py
@classmethod\ndef get_schema(cls: type[SummaryStatistics]) -> StructType:\n \"\"\"Provides the schema for the SummaryStatistics dataset.\"\"\"\n return parse_spark_schema(\"summary_statistics.json\")\n
Filter summary statistics based on the provided p-value threshold.
Parameters:
Name Type Description Default pvaluefloat
upper limit of the p-value to be filtered upon.
required
Returns:
Name Type Description SummaryStatisticsSummaryStatistics
summary statistics object containing single point associations with p-values at least as significant as the provided threshold.
Source code in src/otg/dataset/summary_statistics.py
def pvalue_filter(self: SummaryStatistics, pvalue: float) -> SummaryStatistics:\n \"\"\"Filter summary statistics based on the provided p-value threshold.\n\n Args:\n pvalue (float): upper limit of the p-value to be filtered upon.\n\n Returns:\n SummaryStatistics: summary statistics object containing single point associations with p-values at least as significant as the provided threshold.\n \"\"\"\n # Converting p-value to mantissa and exponent:\n (mantissa, exponent) = split_pvalue(pvalue)\n\n # Applying filter:\n df = self._df.filter(\n (f.col(\"pValueExponent\") < exponent)\n | (\n (f.col(\"pValueExponent\") == exponent)\n & (f.col(\"pValueMantissa\") <= mantissa)\n )\n )\n return SummaryStatistics(_df=df, _schema=self._schema)\n
Generate study-locus from summary statistics by distance based clumping + collect locus.
Parameters:
Name Type Description Default distanceint
Distance in base pairs to be used for clumping.
required gwas_significancefloat
GWAS significance threshold. Defaults to 5e-8.
5e-08baseline_significancefloat
Baseline significance threshold for inclusion in the locus. Defaults to 0.05.
0.05locus_collect_distanceint
The distance to collect locus around semi-indices. If not provided, defaults to distance.
None
Returns:
Name Type Description StudyLocusStudyLocus
Clumped study-locus containing variants based on window.
Source code in src/otg/dataset/summary_statistics.py
def window_based_clumping(\n self: SummaryStatistics,\n distance: int,\n gwas_significance: float = 5e-8,\n baseline_significance: float = 0.05,\n locus_collect_distance: int | None = None,\n) -> StudyLocus:\n \"\"\"Generate study-locus from summary statistics by distance based clumping + collect locus.\n\n Args:\n distance (int): Distance in base pairs to be used for clumping.\n gwas_significance (float, optional): GWAS significance threshold. Defaults to 5e-8.\n baseline_significance (float, optional): Baseline significance threshold for inclusion in the locus. Defaults to 0.05.\n locus_collect_distance (int, optional): The distance to collect locus around semi-indices. If not provided, defaults to `distance`.\n\n Returns:\n StudyLocus: Clumped study-locus containing variants based on window.\n \"\"\"\n # If locus collect distance is present, collect locus with the provided distance:\n if locus_collect_distance:\n clumped_df = WindowBasedClumping.clump_with_locus(\n self,\n window_length=distance,\n p_value_significance=gwas_significance,\n p_value_baseline=baseline_significance,\n locus_window_length=locus_collect_distance,\n )\n else:\n clumped_df = WindowBasedClumping.clump(\n self, window_length=distance, p_value_significance=gwas_significance\n )\n\n return clumped_df\n
Source code in src/otg/dataset/variant_annotation.py
@dataclass\nclass VariantAnnotation(Dataset):\n \"\"\"Dataset with variant-level annotations.\"\"\"\n\n @classmethod\n def get_schema(cls: type[VariantAnnotation]) -> StructType:\n \"\"\"Provides the schema for the VariantAnnotation dataset.\"\"\"\n return parse_spark_schema(\"variant_annotation.json\")\n\n def max_maf(self: VariantAnnotation) -> Column:\n \"\"\"Maximum minor allele frequency accross all populations.\n\n Returns:\n Column: Maximum minor allele frequency accross all populations.\n \"\"\"\n return f.array_max(\n f.transform(\n self.df.alleleFrequencies,\n lambda af: f.when(\n af.alleleFrequency > 0.5, 1 - af.alleleFrequency\n ).otherwise(af.alleleFrequency),\n )\n )\n\n def filter_by_variant_df(\n self: VariantAnnotation, df: DataFrame, cols: list[str]\n ) -> VariantAnnotation:\n \"\"\"Filter variant annotation dataset by a variant dataframe.\n\n Args:\n df (DataFrame): A dataframe of variants\n cols (List[str]): A list of columns to join on\n\n Returns:\n VariantAnnotation: A filtered variant annotation dataset\n \"\"\"\n self.df = self._df.join(f.broadcast(df.select(cols)), on=cols, how=\"inner\")\n return self\n\n def get_transcript_consequence_df(\n self: VariantAnnotation, filter_by: Optional[GeneIndex] = None\n ) -> DataFrame:\n \"\"\"Dataframe of exploded transcript consequences.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index. Defaults to None.\n\n Returns:\n DataFrame: A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence\n \"\"\"\n # exploding the array removes records without VEP annotation\n transript_consequences = self.df.withColumn(\n \"transcriptConsequence\", f.explode(\"vep.transcriptConsequences\")\n ).select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"transcriptConsequence\",\n f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n )\n if filter_by:\n transript_consequences = transript_consequences.join(\n f.broadcast(filter_by.df),\n on=[\"chromosome\", \"geneId\"],\n )\n return transript_consequences.persist()\n\n def get_most_severe_vep_v2g(\n self: VariantAnnotation,\n vep_consequences: DataFrame,\n filter_by: GeneIndex,\n ) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments based on VEP's predicted consequence on the transcript.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n vep_consequences (DataFrame): A dataframe of VEP consequences\n filter_by (GeneIndex): A gene index to filter by. Defaults to None.\n\n Returns:\n V2G: High and medium severity variant to gene assignments\n \"\"\"\n vep_lut = vep_consequences.select(\n f.element_at(f.split(\"Accession\", r\"/\"), -1).alias(\n \"variantFunctionalConsequenceId\"\n ),\n f.col(\"Term\").alias(\"label\"),\n f.col(\"v2g_score\").cast(\"double\").alias(\"score\"),\n )\n\n return V2G(\n _df=self.get_transcript_consequence_df(filter_by).select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n f.explode(\"transcriptConsequence.consequenceTerms\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"variantConsequence\").alias(\"datasourceId\"),\n )\n # A variant can have multiple predicted consequences on a transcript, the most severe one is selected\n .join(\n f.broadcast(vep_lut),\n on=\"label\",\n how=\"inner\",\n )\n .filter(f.col(\"score\") != 0)\n .transform(\n lambda df: get_record_with_maximum_value(\n df, [\"variantId\", \"geneId\"], \"score\"\n )\n ),\n _schema=V2G.get_schema(),\n )\n\n def get_polyphen_v2g(\n self: VariantAnnotation, filter_by: Optional[GeneIndex] = None\n ) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.\n\n Polyphen informs about the probability that a substitution is damaging. Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by. Defaults to None.\n\n Returns:\n V2G: variant to gene assignments with their polyphen scores\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.polyphenScore\").isNotNull())\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n f.col(\"transcriptConsequence.polyphenScore\").alias(\"score\"),\n f.col(\"transcriptConsequence.polyphenPrediction\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"polyphen\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n\n def get_sift_v2g(self: VariantAnnotation, filter_by: GeneIndex) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.\n\n SIFT informs about the probability that a substitution is tolerated so scores nearer zero are more likely to be deleterious.\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n\n Returns:\n V2G: variant to gene assignments with their SIFT scores\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.siftScore\").isNotNull())\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n f.expr(\"1 - transcriptConsequence.siftScore\").alias(\"score\"),\n f.col(\"transcriptConsequence.siftPrediction\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"sift\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n\n def get_plof_v2g(self: VariantAnnotation, filter_by: GeneIndex) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n\n Returns:\n V2G: variant to gene assignments from the LOFTEE algorithm\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.lof\").isNotNull())\n .withColumn(\n \"isHighQualityPlof\",\n f.when(f.col(\"transcriptConsequence.lof\") == \"HC\", True).when(\n f.col(\"transcriptConsequence.lof\") == \"LC\", False\n ),\n )\n .withColumn(\n \"score\",\n f.when(f.col(\"isHighQualityPlof\"), 1.0).when(\n ~f.col(\"isHighQualityPlof\"), 0\n ),\n )\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n \"isHighQualityPlof\",\n f.col(\"score\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"loftee\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n\n def get_distance_to_tss(\n self: VariantAnnotation,\n filter_by: GeneIndex,\n max_distance: int = 500_000,\n ) -> V2G:\n \"\"\"Extracts variant to gene assignments for variants falling within a window of a gene's TSS.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n max_distance (int): The maximum distance from the TSS to consider. Defaults to 500_000.\n\n Returns:\n V2G: variant to gene assignments with their distance to the TSS\n \"\"\"\n return V2G(\n _df=(\n self.df.alias(\"variant\")\n .join(\n f.broadcast(filter_by.locations_lut()).alias(\"gene\"),\n on=[\n f.col(\"variant.chromosome\") == f.col(\"gene.chromosome\"),\n f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n <= max_distance,\n ],\n how=\"inner\",\n )\n .withColumn(\n \"inverse_distance\",\n max_distance - f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\")),\n )\n .transform(lambda df: normalise_column(df, \"inverse_distance\", \"score\"))\n .select(\n \"variantId\",\n f.col(\"variant.chromosome\").alias(\"chromosome\"),\n \"position\",\n \"geneId\",\n \"score\",\n f.lit(\"distance\").alias(\"datatypeId\"),\n f.lit(\"canonical_tss\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Extracts variant to gene assignments for variants falling within a window of a gene's TSS.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index to filter by.
required max_distanceint
The maximum distance from the TSS to consider. Defaults to 500_000.
500000
Returns:
Name Type Description V2GV2G
variant to gene assignments with their distance to the TSS
Source code in src/otg/dataset/variant_annotation.py
def get_distance_to_tss(\n self: VariantAnnotation,\n filter_by: GeneIndex,\n max_distance: int = 500_000,\n) -> V2G:\n \"\"\"Extracts variant to gene assignments for variants falling within a window of a gene's TSS.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n max_distance (int): The maximum distance from the TSS to consider. Defaults to 500_000.\n\n Returns:\n V2G: variant to gene assignments with their distance to the TSS\n \"\"\"\n return V2G(\n _df=(\n self.df.alias(\"variant\")\n .join(\n f.broadcast(filter_by.locations_lut()).alias(\"gene\"),\n on=[\n f.col(\"variant.chromosome\") == f.col(\"gene.chromosome\"),\n f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n <= max_distance,\n ],\n how=\"inner\",\n )\n .withColumn(\n \"inverse_distance\",\n max_distance - f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\")),\n )\n .transform(lambda df: normalise_column(df, \"inverse_distance\", \"score\"))\n .select(\n \"variantId\",\n f.col(\"variant.chromosome\").alias(\"chromosome\"),\n \"position\",\n \"geneId\",\n \"score\",\n f.lit(\"distance\").alias(\"datatypeId\"),\n f.lit(\"canonical_tss\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Creates a dataset with variant to gene assignments based on VEP's predicted consequence on the transcript.
Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default vep_consequencesDataFrame
A dataframe of VEP consequences
required filter_byGeneIndex
A gene index to filter by. Defaults to None.
required
Returns:
Name Type Description V2GV2G
High and medium severity variant to gene assignments
Source code in src/otg/dataset/variant_annotation.py
def get_most_severe_vep_v2g(\n self: VariantAnnotation,\n vep_consequences: DataFrame,\n filter_by: GeneIndex,\n) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments based on VEP's predicted consequence on the transcript.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n vep_consequences (DataFrame): A dataframe of VEP consequences\n filter_by (GeneIndex): A gene index to filter by. Defaults to None.\n\n Returns:\n V2G: High and medium severity variant to gene assignments\n \"\"\"\n vep_lut = vep_consequences.select(\n f.element_at(f.split(\"Accession\", r\"/\"), -1).alias(\n \"variantFunctionalConsequenceId\"\n ),\n f.col(\"Term\").alias(\"label\"),\n f.col(\"v2g_score\").cast(\"double\").alias(\"score\"),\n )\n\n return V2G(\n _df=self.get_transcript_consequence_df(filter_by).select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n f.explode(\"transcriptConsequence.consequenceTerms\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"variantConsequence\").alias(\"datasourceId\"),\n )\n # A variant can have multiple predicted consequences on a transcript, the most severe one is selected\n .join(\n f.broadcast(vep_lut),\n on=\"label\",\n how=\"inner\",\n )\n .filter(f.col(\"score\") != 0)\n .transform(\n lambda df: get_record_with_maximum_value(\n df, [\"variantId\", \"geneId\"], \"score\"\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.
Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index to filter by.
required
Returns:
Name Type Description V2GV2G
variant to gene assignments from the LOFTEE algorithm
Source code in src/otg/dataset/variant_annotation.py
def get_plof_v2g(self: VariantAnnotation, filter_by: GeneIndex) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n\n Returns:\n V2G: variant to gene assignments from the LOFTEE algorithm\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.lof\").isNotNull())\n .withColumn(\n \"isHighQualityPlof\",\n f.when(f.col(\"transcriptConsequence.lof\") == \"HC\", True).when(\n f.col(\"transcriptConsequence.lof\") == \"LC\", False\n ),\n )\n .withColumn(\n \"score\",\n f.when(f.col(\"isHighQualityPlof\"), 1.0).when(\n ~f.col(\"isHighQualityPlof\"), 0\n ),\n )\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n \"isHighQualityPlof\",\n f.col(\"score\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"loftee\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.
Polyphen informs about the probability that a substitution is damaging. Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index to filter by. Defaults to None.
None
Returns:
Name Type Description V2GV2G
variant to gene assignments with their polyphen scores
Source code in src/otg/dataset/variant_annotation.py
def get_polyphen_v2g(\n self: VariantAnnotation, filter_by: Optional[GeneIndex] = None\n) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.\n\n Polyphen informs about the probability that a substitution is damaging. Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by. Defaults to None.\n\n Returns:\n V2G: variant to gene assignments with their polyphen scores\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.polyphenScore\").isNotNull())\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n f.col(\"transcriptConsequence.polyphenScore\").alias(\"score\"),\n f.col(\"transcriptConsequence.polyphenPrediction\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"polyphen\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Provides the schema for the VariantAnnotation dataset.
Source code in src/otg/dataset/variant_annotation.py
@classmethod\ndef get_schema(cls: type[VariantAnnotation]) -> StructType:\n \"\"\"Provides the schema for the VariantAnnotation dataset.\"\"\"\n return parse_spark_schema(\"variant_annotation.json\")\n
Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.
SIFT informs about the probability that a substitution is tolerated so scores nearer zero are more likely to be deleterious. Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index to filter by.
required
Returns:
Name Type Description V2GV2G
variant to gene assignments with their SIFT scores
Source code in src/otg/dataset/variant_annotation.py
def get_sift_v2g(self: VariantAnnotation, filter_by: GeneIndex) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.\n\n SIFT informs about the probability that a substitution is tolerated so scores nearer zero are more likely to be deleterious.\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n\n Returns:\n V2G: variant to gene assignments with their SIFT scores\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.siftScore\").isNotNull())\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n f.expr(\"1 - transcriptConsequence.siftScore\").alias(\"score\"),\n f.col(\"transcriptConsequence.siftPrediction\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"sift\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index. Defaults to None.
None
Returns:
Name Type Description DataFrameDataFrame
A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence
Source code in src/otg/dataset/variant_annotation.py
def get_transcript_consequence_df(\n self: VariantAnnotation, filter_by: Optional[GeneIndex] = None\n) -> DataFrame:\n \"\"\"Dataframe of exploded transcript consequences.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index. Defaults to None.\n\n Returns:\n DataFrame: A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence\n \"\"\"\n # exploding the array removes records without VEP annotation\n transript_consequences = self.df.withColumn(\n \"transcriptConsequence\", f.explode(\"vep.transcriptConsequences\")\n ).select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"transcriptConsequence\",\n f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n )\n if filter_by:\n transript_consequences = transript_consequences.join(\n f.broadcast(filter_by.df),\n on=[\"chromosome\", \"geneId\"],\n )\n return transript_consequences.persist()\n
Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.
Source code in src/otg/dataset/variant_index.py
@dataclass\nclass VariantIndex(Dataset):\n \"\"\"Variant index dataset.\n\n Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.\n \"\"\"\n\n @classmethod\n def get_schema(cls: type[VariantIndex]) -> StructType:\n \"\"\"Provides the schema for the VariantIndex dataset.\"\"\"\n return parse_spark_schema(\"variant_index.json\")\n\n @classmethod\n def from_variant_annotation(\n cls: type[VariantIndex],\n variant_annotation: VariantAnnotation,\n ) -> VariantIndex:\n \"\"\"Initialise VariantIndex from pre-existing variant annotation dataset.\"\"\"\n unchanged_cols = [\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"referenceAllele\",\n \"alternateAllele\",\n \"chromosomeB37\",\n \"positionB37\",\n \"alleleType\",\n \"alleleFrequencies\",\n \"cadd\",\n ]\n return cls(\n _df=(\n variant_annotation.df.select(\n *unchanged_cols,\n f.col(\"vep.mostSevereConsequence\").alias(\"mostSevereConsequence\"),\n # filters/rsid are arrays that can be empty, in this case we convert them to null\n nullify_empty_array(f.col(\"rsIds\")).alias(\"rsIds\"),\n )\n .repartition(400, \"chromosome\")\n .sortWithinPartitions(\"chromosome\", \"position\")\n ),\n _schema=cls.get_schema(),\n )\n
@classmethod\ndef get_schema(cls: type[VariantIndex]) -> StructType:\n \"\"\"Provides the schema for the VariantIndex dataset.\"\"\"\n return parse_spark_schema(\"variant_index.json\")\n
A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific biofeatures (e.g. cell types)
Source code in src/otg/dataset/v2g.py
@dataclass\nclass V2G(Dataset):\n \"\"\"Variant-to-gene (V2G) evidence dataset.\n\n A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific `biofeatures` (e.g. cell types)\n \"\"\"\n\n @classmethod\n def get_schema(cls: type[V2G]) -> StructType:\n \"\"\"Provides the schema for the V2G dataset.\"\"\"\n return parse_spark_schema(\"v2g.json\")\n\n def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:\n \"\"\"Filter by V2G dataset by genes.\n\n Args:\n genes (GeneIndex): Gene index dataset to filter by\n\n Returns:\n V2G: V2G dataset filtered by genes\n \"\"\"\n self.df = self._df.join(genes.df.select(\"geneId\"), on=\"geneId\", how=\"inner\")\n return self\n
@classmethod\ndef get_schema(cls: type[V2G]) -> StructType:\n \"\"\"Provides the schema for the V2G dataset.\"\"\"\n return parse_spark_schema(\"v2g.json\")\n
The following information is aggregated/extracted:
Study ID in the special format (FINNGEN_R9_*)
Trait name (for example, Amoebiasis)
Number of cases and controls
Link to the summary statistics location
Some fields are also populated as constants, such as study type and the initial sample size.
Source code in src/otg/datasource/finngen/study_index.py
class FinnGenStudyIndex(StudyIndex):\n \"\"\"Study index dataset from FinnGen.\n\n The following information is aggregated/extracted:\n\n - Study ID in the special format (FINNGEN_R9_*)\n - Trait name (for example, Amoebiasis)\n - Number of cases and controls\n - Link to the summary statistics location\n\n Some fields are also populated as constants, such as study type and the initial sample size.\n \"\"\"\n\n @classmethod\n def from_source(\n cls: type[FinnGenStudyIndex],\n finngen_studies: DataFrame,\n finngen_release_prefix: str,\n finngen_summary_stats_url_prefix: str,\n finngen_summary_stats_url_suffix: str,\n ) -> FinnGenStudyIndex:\n \"\"\"This function ingests study level metadata from FinnGen.\n\n Args:\n finngen_studies (DataFrame): FinnGen raw study table\n finngen_release_prefix (str): Release prefix pattern.\n finngen_summary_stats_url_prefix (str): URL prefix for summary statistics location.\n finngen_summary_stats_url_suffix (str): URL prefix suffix for summary statistics location.\n\n Returns:\n FinnGenStudyIndex: Parsed and annotated FinnGen study table.\n \"\"\"\n return FinnGenStudyIndex(\n _df=finngen_studies.select(\n f.concat(f.lit(f\"{finngen_release_prefix}_\"), f.col(\"phenocode\")).alias(\n \"studyId\"\n ),\n f.col(\"phenostring\").alias(\"traitFromSource\"),\n f.col(\"num_cases\").alias(\"nCases\"),\n f.col(\"num_controls\").alias(\"nControls\"),\n (f.col(\"num_cases\") + f.col(\"num_controls\")).alias(\"nSamples\"),\n f.lit(finngen_release_prefix).alias(\"projectId\"),\n f.lit(\"gwas\").alias(\"studyType\"),\n f.lit(True).alias(\"hasSumstats\"),\n f.lit(\"377,277 (210,870 females and 166,407 males)\").alias(\n \"initialSampleSize\"\n ),\n f.array(\n f.struct(\n f.lit(377277).cast(\"long\").alias(\"sampleSize\"),\n f.lit(\"Finnish\").alias(\"ancestry\"),\n )\n ).alias(\"discoverySamples\"),\n f.concat(\n f.lit(finngen_summary_stats_url_prefix),\n f.col(\"phenocode\"),\n f.lit(finngen_summary_stats_url_suffix),\n ).alias(\"summarystatsLocation\"),\n ).withColumn(\n \"ldPopulationStructure\",\n cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n ),\n _schema=cls.get_schema(),\n )\n
The information comes from LD matrices made available by GnomAD in Hail's native format. We aggregate the LD information across 8 ancestries. The basic steps to generate the LDIndex are:
Convert a LD matrix to a Spark DataFrame.
Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.
Aggregate the LD information across populations.
Source code in src/otg/datasource/gnomad/ld.py
class GnomADLDMatrix:\n \"\"\"Importer of LD information from GnomAD.\n\n The information comes from LD matrices [made available by GnomAD](https://gnomad.broadinstitute.org/downloads/#v2-linkage-disequilibrium) in Hail's native format. We aggregate the LD information across 8 ancestries.\n The basic steps to generate the LDIndex are:\n\n 1. Convert a LD matrix to a Spark DataFrame.\n 2. Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.\n 3. Aggregate the LD information across populations.\n\n \"\"\"\n\n @staticmethod\n def _aggregate_ld_index_across_populations(\n unaggregated_ld_index: DataFrame,\n ) -> DataFrame:\n \"\"\"Aggregate LDIndex across populations.\n\n Args:\n unaggregated_ld_index (DataFrame): Unaggregate LDIndex index dataframe each row is a variant pair in a population\n\n Returns:\n DataFrame: Aggregated LDIndex index dataframe each row is a variant with the LD set across populations\n\n Examples:\n >>> data = [(\"1.0\", \"var1\", \"X\", \"var1\", \"pop1\"), (\"1.0\", \"X\", \"var2\", \"var2\", \"pop1\"),\n ... (\"0.5\", \"var1\", \"X\", \"var2\", \"pop1\"), (\"0.5\", \"var1\", \"X\", \"var2\", \"pop2\"),\n ... (\"0.5\", \"var2\", \"X\", \"var1\", \"pop1\"), (\"0.5\", \"X\", \"var2\", \"var1\", \"pop2\")]\n >>> df = spark.createDataFrame(data, [\"r\", \"variantId\", \"chromosome\", \"tagvariantId\", \"population\"])\n >>> GnomADLDMatrix._aggregate_ld_index_across_populations(df).printSchema()\n root\n |-- variantId: string (nullable = true)\n |-- chromosome: string (nullable = true)\n |-- ldSet: array (nullable = false)\n | |-- element: struct (containsNull = false)\n | | |-- tagVariantId: string (nullable = true)\n | | |-- rValues: array (nullable = false)\n | | | |-- element: struct (containsNull = false)\n | | | | |-- population: string (nullable = true)\n | | | | |-- r: string (nullable = true)\n <BLANKLINE>\n \"\"\"\n return (\n unaggregated_ld_index\n # First level of aggregation: get r/population for each variant/tagVariant pair\n .withColumn(\"r_pop_struct\", f.struct(\"population\", \"r\"))\n .groupBy(\"chromosome\", \"variantId\", \"tagVariantId\")\n .agg(\n f.collect_set(\"r_pop_struct\").alias(\"rValues\"),\n )\n # Second level of aggregation: get r/population for each variant\n .withColumn(\"r_pop_tag_struct\", f.struct(\"tagVariantId\", \"rValues\"))\n .groupBy(\"variantId\", \"chromosome\")\n .agg(\n f.collect_set(\"r_pop_tag_struct\").alias(\"ldSet\"),\n )\n )\n\n @staticmethod\n def _convert_ld_matrix_to_table(\n block_matrix: BlockMatrix, min_r2: float\n ) -> DataFrame:\n \"\"\"Convert LD matrix to table.\"\"\"\n table = block_matrix.entries(keyed=False)\n return (\n table.filter(hl.abs(table.entry) >= min_r2**0.5)\n .to_spark()\n .withColumnRenamed(\"entry\", \"r\")\n )\n\n @staticmethod\n def _create_ldindex_for_population(\n population_id: str,\n ld_matrix_path: str,\n ld_index_raw_path: str,\n grch37_to_grch38_chain_path: str,\n min_r2: float,\n ) -> DataFrame:\n \"\"\"Create LDIndex for a specific population.\"\"\"\n # Prepare LD Block matrix\n ld_matrix = GnomADLDMatrix._convert_ld_matrix_to_table(\n BlockMatrix.read(ld_matrix_path), min_r2\n )\n\n # Prepare table with variant indices\n ld_index = GnomADLDMatrix._process_variant_indices(\n hl.read_table(ld_index_raw_path),\n grch37_to_grch38_chain_path,\n )\n\n return GnomADLDMatrix._resolve_variant_indices(ld_index, ld_matrix).select(\n \"*\",\n f.lit(population_id).alias(\"population\"),\n )\n\n @staticmethod\n def _process_variant_indices(\n ld_index_raw: hl.Table, grch37_to_grch38_chain_path: str\n ) -> DataFrame:\n \"\"\"Creates a look up table between variants and their coordinates in the LD Matrix.\n\n !!! info \"Gnomad's LD Matrix and Index are based on GRCh37 coordinates. This function will lift over the coordinates to GRCh38 to build the lookup table.\"\n\n Args:\n ld_index_raw (hl.Table): LD index table from GnomAD\n grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n\n Returns:\n DataFrame: Look up table between variants in build hg38 and their coordinates in the LD Matrix\n \"\"\"\n ld_index_38 = _liftover_loci(\n ld_index_raw, grch37_to_grch38_chain_path, \"GRCh38\"\n )\n\n return (\n ld_index_38.to_spark()\n # Filter out variants where the liftover failed\n .filter(f.col(\"`locus_GRCh38.position`\").isNotNull())\n .withColumn(\n \"chromosome\", f.regexp_replace(\"`locus_GRCh38.contig`\", \"chr\", \"\")\n )\n .withColumn(\n \"position\",\n convert_gnomad_position_to_ensembl(\n f.col(\"`locus_GRCh38.position`\"),\n f.col(\"`alleles`\").getItem(0),\n f.col(\"`alleles`\").getItem(1),\n ),\n )\n .select(\n \"chromosome\",\n f.concat_ws(\n \"_\",\n f.col(\"chromosome\"),\n f.col(\"position\"),\n f.col(\"`alleles`\").getItem(0),\n f.col(\"`alleles`\").getItem(1),\n ).alias(\"variantId\"),\n f.col(\"idx\"),\n )\n # Filter out ambiguous liftover results: multiple indices for the same variant\n .withColumn(\"count\", f.count(\"*\").over(Window.partitionBy([\"variantId\"])))\n .filter(f.col(\"count\") == 1)\n .drop(\"count\")\n )\n\n @staticmethod\n def _resolve_variant_indices(\n ld_index: DataFrame, ld_matrix: DataFrame\n ) -> DataFrame:\n \"\"\"Resolve the `i` and `j` indices of the block matrix to variant IDs (build 38).\"\"\"\n ld_index_i = ld_index.selectExpr(\n \"idx as i\", \"variantId as variantId_i\", \"chromosome\"\n )\n ld_index_j = ld_index.selectExpr(\"idx as j\", \"variantId as variantId_j\")\n return (\n ld_matrix.join(ld_index_i, on=\"i\", how=\"inner\")\n .join(ld_index_j, on=\"j\", how=\"inner\")\n .drop(\"i\", \"j\")\n )\n\n @staticmethod\n def _transpose_ld_matrix(ld_matrix: DataFrame) -> DataFrame:\n \"\"\"Transpose LD matrix to a square matrix format.\n\n Args:\n ld_matrix (DataFrame): Triangular LD matrix converted to a Spark DataFrame\n\n Returns:\n DataFrame: Square LD matrix without diagonal duplicates\n\n Examples:\n >>> df = spark.createDataFrame(\n ... [\n ... (1, 1, 1.0, \"1\", \"AFR\"),\n ... (1, 2, 0.5, \"1\", \"AFR\"),\n ... (2, 2, 1.0, \"1\", \"AFR\"),\n ... ],\n ... [\"variantId_i\", \"variantId_j\", \"r\", \"chromosome\", \"population\"],\n ... )\n >>> GnomADLDMatrix._transpose_ld_matrix(df).show()\n +-----------+-----------+---+----------+----------+\n |variantId_i|variantId_j| r|chromosome|population|\n +-----------+-----------+---+----------+----------+\n | 1| 2|0.5| 1| AFR|\n | 1| 1|1.0| 1| AFR|\n | 2| 1|0.5| 1| AFR|\n | 2| 2|1.0| 1| AFR|\n +-----------+-----------+---+----------+----------+\n <BLANKLINE>\n \"\"\"\n ld_matrix_transposed = ld_matrix.selectExpr(\n \"variantId_i as variantId_j\",\n \"variantId_j as variantId_i\",\n \"r\",\n \"chromosome\",\n \"population\",\n )\n return ld_matrix.filter(\n f.col(\"variantId_i\") != f.col(\"variantId_j\")\n ).unionByName(ld_matrix_transposed)\n\n @classmethod\n def as_ld_index(\n cls: type[GnomADLDMatrix],\n ld_populations: list[str],\n ld_matrix_template: str,\n ld_index_raw_template: str,\n grch37_to_grch38_chain_path: str,\n min_r2: float,\n ) -> LDIndex:\n \"\"\"Create LDIndex dataset aggregating the LD information across a set of populations.\"\"\"\n ld_indices_unaggregated = []\n for pop in ld_populations:\n try:\n ld_matrix_path = ld_matrix_template.format(POP=pop)\n ld_index_raw_path = ld_index_raw_template.format(POP=pop)\n pop_ld_index = cls._create_ldindex_for_population(\n pop,\n ld_matrix_path,\n ld_index_raw_path.format(pop),\n grch37_to_grch38_chain_path,\n min_r2,\n )\n ld_indices_unaggregated.append(pop_ld_index)\n except Exception as e:\n print(f\"Failed to create LDIndex for population {pop}: {e}\")\n sys.exit(1)\n\n ld_index_unaggregated = (\n GnomADLDMatrix._transpose_ld_matrix(\n reduce(lambda df1, df2: df1.unionByName(df2), ld_indices_unaggregated)\n )\n .withColumnRenamed(\"variantId_i\", \"variantId\")\n .withColumnRenamed(\"variantId_j\", \"tagVariantId\")\n )\n return LDIndex(\n _df=cls._aggregate_ld_index_across_populations(ld_index_unaggregated),\n _schema=LDIndex.get_schema(),\n )\n
GnomAD variants included in the GnomAD genomes dataset.
Source code in src/otg/datasource/gnomad/variants.py
class GnomADVariants:\n \"\"\"GnomAD variants included in the GnomAD genomes dataset.\"\"\"\n\n @staticmethod\n def _convert_gnomad_position_to_ensembl_hail(\n position: Int32Expression,\n reference: StringExpression,\n alternate: StringExpression,\n ) -> Int32Expression:\n \"\"\"Convert GnomAD variant position to Ensembl variant position in hail table.\n\n For indels (the reference or alternate allele is longer than 1), then adding 1 to the position, for SNPs, the position is unchanged.\n More info about the problem: https://www.biostars.org/p/84686/\n\n Args:\n position (Int32Expression): Position of the variant in the GnomAD genome.\n reference (StringExpression): The reference allele.\n alternate (StringExpression): The alternate allele\n\n Returns:\n The position of the variant according to Ensembl genome.\n \"\"\"\n return hl.if_else(\n (reference.length() > 1) | (alternate.length() > 1), position + 1, position\n )\n\n @classmethod\n def as_variant_annotation(\n cls: type[GnomADVariants],\n gnomad_file: str,\n grch38_to_grch37_chain: str,\n populations: list,\n ) -> VariantAnnotation:\n \"\"\"Generate variant annotation dataset from gnomAD.\n\n Some relevant modifications to the original dataset are:\n\n 1. The transcript consequences features provided by VEP are filtered to only refer to the Ensembl canonical transcript.\n 2. Genome coordinates are liftovered from GRCh38 to GRCh37 to keep as annotation.\n 3. Field names are converted to camel case to follow the convention.\n\n Args:\n gnomad_file (str): Path to `gnomad.genomes.vX.X.X.sites.ht` gnomAD dataset\n grch38_to_grch37_chain (str): Path to chain file for liftover\n populations (list): List of populations to include in the dataset\n\n Returns:\n VariantAnnotation: Variant annotation dataset\n \"\"\"\n # Load variants dataset\n ht = hl.read_table(\n gnomad_file,\n _load_refs=False,\n )\n\n # Liftover\n grch37 = hl.get_reference(\"GRCh37\")\n grch38 = hl.get_reference(\"GRCh38\")\n grch38.add_liftover(grch38_to_grch37_chain, grch37)\n\n # Drop non biallelic variants\n ht = ht.filter(ht.alleles.length() == 2)\n # Liftover\n ht = ht.annotate(locus_GRCh37=hl.liftover(ht.locus, \"GRCh37\"))\n # Select relevant fields and nested records to create class\n return VariantAnnotation(\n _df=(\n ht.select(\n gnomad3VariantId=hl.str(\"-\").join(\n [\n ht.locus.contig.replace(\"chr\", \"\"),\n hl.str(ht.locus.position),\n ht.alleles[0],\n ht.alleles[1],\n ]\n ),\n chromosome=ht.locus.contig.replace(\"chr\", \"\"),\n position=GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n ht.locus.position, ht.alleles[0], ht.alleles[1]\n ),\n variantId=hl.str(\"_\").join(\n [\n ht.locus.contig.replace(\"chr\", \"\"),\n hl.str(\n GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n ht.locus.position, ht.alleles[0], ht.alleles[1]\n )\n ),\n ht.alleles[0],\n ht.alleles[1],\n ]\n ),\n chromosomeB37=ht.locus_GRCh37.contig.replace(\"chr\", \"\"),\n positionB37=ht.locus_GRCh37.position,\n referenceAllele=ht.alleles[0],\n alternateAllele=ht.alleles[1],\n rsIds=ht.rsid,\n alleleType=ht.allele_info.allele_type,\n cadd=hl.struct(\n phred=ht.cadd.phred,\n raw=ht.cadd.raw_score,\n ),\n alleleFrequencies=hl.set([f\"{pop}-adj\" for pop in populations]).map(\n lambda p: hl.struct(\n populationName=p,\n alleleFrequency=ht.freq[ht.globals.freq_index_dict[p]].AF,\n )\n ),\n vep=hl.struct(\n mostSevereConsequence=ht.vep.most_severe_consequence,\n transcriptConsequences=hl.map(\n lambda x: hl.struct(\n aminoAcids=x.amino_acids,\n consequenceTerms=x.consequence_terms,\n geneId=x.gene_id,\n lof=x.lof,\n polyphenScore=x.polyphen_score,\n polyphenPrediction=x.polyphen_prediction,\n siftScore=x.sift_score,\n siftPrediction=x.sift_prediction,\n ),\n # Only keeping canonical transcripts\n ht.vep.transcript_consequences.filter(\n lambda x: (x.canonical == 1)\n & (x.gene_symbol_source == \"HGNC\")\n ),\n ),\n ),\n )\n .key_by(\"chromosome\", \"position\")\n .drop(\"locus\", \"alleles\")\n .select_globals()\n .to_spark(flatten=False)\n ),\n _schema=VariantAnnotation.get_schema(),\n )\n
If assigned disease of the study and the association don't agree, we assume the study needs to be split. Then disease EFOs, trait names and study ID are consolidated
Parameters:
Name Type Description Default studiesGWASCatalogStudyIndex
GWAS Catalog studies.
required associationsStudyLocusGWASCatalog
GWAS Catalog associations.
required
Returns:
Type Description Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]
A tuple of the split associations and studies.
Source code in src/otg/datasource/gwas_catalog/study_splitter.py
@classmethod\ndef split(\n cls: type[GWASCatalogStudySplitter],\n studies: GWASCatalogStudyIndex,\n associations: GWASCatalogAssociations,\n) -> Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]:\n \"\"\"Splitting multi-trait GWAS Catalog studies.\n\n If assigned disease of the study and the association don't agree, we assume the study needs to be split.\n Then disease EFOs, trait names and study ID are consolidated\n\n Args:\n studies (GWASCatalogStudyIndex): GWAS Catalog studies.\n associations (StudyLocusGWASCatalog): GWAS Catalog associations.\n\n Returns:\n A tuple of the split associations and studies.\n \"\"\"\n # Composite of studies and associations to resolve scattered information\n st_ass = (\n associations.df.join(f.broadcast(studies.df), on=\"studyId\", how=\"inner\")\n .select(\n \"studyId\",\n \"subStudyDescription\",\n cls._resolve_study_id(\n f.col(\"studyId\"), f.col(\"subStudyDescription\")\n ).alias(\"updatedStudyId\"),\n cls._resolve_trait(\n f.col(\"traitFromSource\"),\n f.split(\"subStudyDescription\", r\"\\|\").getItem(0),\n f.split(\"subStudyDescription\", r\"\\|\").getItem(1),\n ).alias(\"traitFromSource\"),\n cls._resolve_efo(\n f.split(\"subStudyDescription\", r\"\\|\").getItem(2),\n f.col(\"traitFromSourceMappedIds\"),\n ).alias(\"traitFromSourceMappedIds\"),\n )\n .persist()\n )\n\n return (\n studies.update_study_id(\n st_ass.select(\n \"studyId\",\n \"updatedStudyId\",\n \"traitFromSource\",\n \"traitFromSourceMappedIds\",\n ).distinct()\n ),\n associations.update_study_id(\n st_ass.select(\n \"updatedStudyId\", \"studyId\", \"subStudyDescription\"\n ).distinct()\n )._qc_ambiguous_study(),\n )\n
Some fields are populated as constants, such as projectID, studyType, and initial sample size.
Source code in src/otg/datasource/ukbiobank/study_index.py
class UKBiobankStudyIndex(StudyIndex):\n \"\"\"Study index dataset from UKBiobank.\n\n The following information is extracted:\n\n - studyId\n - pubmedId\n - publicationDate\n - publicationJournal\n - publicationTitle\n - publicationFirstAuthor\n - traitFromSource\n - ancestry_discoverySamples\n - ancestry_replicationSamples\n - initialSampleSize\n - nCases\n - replicationSamples\n\n Some fields are populated as constants, such as projectID, studyType, and initial sample size.\n \"\"\"\n\n @classmethod\n def from_source(\n cls: type[UKBiobankStudyIndex],\n ukbiobank_studies: DataFrame,\n ) -> UKBiobankStudyIndex:\n \"\"\"This function ingests study level metadata from UKBiobank.\n\n The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).\n\n The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.\n\n Args:\n ukbiobank_studies (DataFrame): UKBiobank study manifest file loaded in spark session.\n\n Returns:\n UKBiobankStudyIndex: Annotated UKBiobank study table.\n \"\"\"\n return StudyIndex(\n _df=(\n ukbiobank_studies.select(\n f.col(\"code\").alias(\"studyId\"),\n f.lit(\"UKBiobank\").alias(\"projectId\"),\n f.lit(\"gwas\").alias(\"studyType\"),\n f.col(\"trait\").alias(\"traitFromSource\"),\n # Make publication and ancestry schema columns.\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"30104761\").alias(\n \"pubmedId\"\n ),\n f.when(\n f.col(\"code\").startswith(\"SAIGE_\"),\n \"Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies\",\n )\n .otherwise(None)\n .alias(\"publicationTitle\"),\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Wei Zhou\").alias(\n \"publicationFirstAuthor\"\n ),\n f.when(f.col(\"code\").startswith(\"NEALE2_\"), \"2018-08-01\")\n .otherwise(\"2018-10-24\")\n .alias(\"publicationDate\"),\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Nature Genetics\").alias(\n \"publicationJournal\"\n ),\n f.col(\"n_total\").cast(\"string\").alias(\"initialSampleSize\"),\n f.col(\"n_cases\").cast(\"long\").alias(\"nCases\"),\n f.array(\n f.struct(\n f.col(\"n_total\").cast(\"long\").alias(\"sampleSize\"),\n f.concat(f.lit(\"European=\"), f.col(\"n_total\")).alias(\n \"ancestry\"\n ),\n )\n ).alias(\"discoverySamples\"),\n f.col(\"in_path\").alias(\"summarystatsLocation\"),\n f.lit(True).alias(\"hasSumstats\"),\n )\n .withColumn(\n \"traitFromSource\",\n f.when(\n f.col(\"traitFromSource\").contains(\":\"),\n f.concat(\n f.initcap(\n f.split(f.col(\"traitFromSource\"), \": \").getItem(1)\n ),\n f.lit(\" | \"),\n f.lower(f.split(f.col(\"traitFromSource\"), \": \").getItem(0)),\n ),\n ).otherwise(f.col(\"traitFromSource\")),\n )\n .withColumn(\n \"ldPopulationStructure\",\n cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n )\n ),\n _schema=StudyIndex.get_schema(),\n )\n
This function ingests study level metadata from UKBiobank.
The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).
The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.
Parameters:
Name Type Description Default ukbiobank_studiesDataFrame
UKBiobank study manifest file loaded in spark session.
required
Returns:
Name Type Description UKBiobankStudyIndexUKBiobankStudyIndex
Annotated UKBiobank study table.
Source code in src/otg/datasource/ukbiobank/study_index.py
@classmethod\ndef from_source(\n cls: type[UKBiobankStudyIndex],\n ukbiobank_studies: DataFrame,\n) -> UKBiobankStudyIndex:\n \"\"\"This function ingests study level metadata from UKBiobank.\n\n The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).\n\n The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.\n\n Args:\n ukbiobank_studies (DataFrame): UKBiobank study manifest file loaded in spark session.\n\n Returns:\n UKBiobankStudyIndex: Annotated UKBiobank study table.\n \"\"\"\n return StudyIndex(\n _df=(\n ukbiobank_studies.select(\n f.col(\"code\").alias(\"studyId\"),\n f.lit(\"UKBiobank\").alias(\"projectId\"),\n f.lit(\"gwas\").alias(\"studyType\"),\n f.col(\"trait\").alias(\"traitFromSource\"),\n # Make publication and ancestry schema columns.\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"30104761\").alias(\n \"pubmedId\"\n ),\n f.when(\n f.col(\"code\").startswith(\"SAIGE_\"),\n \"Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies\",\n )\n .otherwise(None)\n .alias(\"publicationTitle\"),\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Wei Zhou\").alias(\n \"publicationFirstAuthor\"\n ),\n f.when(f.col(\"code\").startswith(\"NEALE2_\"), \"2018-08-01\")\n .otherwise(\"2018-10-24\")\n .alias(\"publicationDate\"),\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Nature Genetics\").alias(\n \"publicationJournal\"\n ),\n f.col(\"n_total\").cast(\"string\").alias(\"initialSampleSize\"),\n f.col(\"n_cases\").cast(\"long\").alias(\"nCases\"),\n f.array(\n f.struct(\n f.col(\"n_total\").cast(\"long\").alias(\"sampleSize\"),\n f.concat(f.lit(\"European=\"), f.col(\"n_total\")).alias(\n \"ancestry\"\n ),\n )\n ).alias(\"discoverySamples\"),\n f.col(\"in_path\").alias(\"summarystatsLocation\"),\n f.lit(True).alias(\"hasSumstats\"),\n )\n .withColumn(\n \"traitFromSource\",\n f.when(\n f.col(\"traitFromSource\").contains(\":\"),\n f.concat(\n f.initcap(\n f.split(f.col(\"traitFromSource\"), \": \").getItem(1)\n ),\n f.lit(\" | \"),\n f.lower(f.split(f.col(\"traitFromSource\"), \": \").getItem(0)),\n ),\n ).otherwise(f.col(\"traitFromSource\")),\n )\n .withColumn(\n \"ldPopulationStructure\",\n cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n )\n ),\n _schema=StudyIndex.get_schema(),\n )\n
Clumping is a commonly used post-processing method that allows for identification of independent association signals from GWAS summary statistics and curated associations. This process is critical because of the complex linkage disequilibrium (LD) structure in human populations, which can result in multiple statistically significant associations within the same genomic region. Clumping methods help reduce redundancy in GWAS results and ensure that each reported association represents an independent signal.
We have implemented 2 clumping methods:
"},{"location":"python_api/method/clumping/#clumping-based-on-linkage-disequilibrium-ld","title":"Clumping based on Linkage Disequilibrium (LD)","text":"
LD clumping reports the most significant genetic associations in a region in terms of a smaller number of \u201cclumps\u201d of genetically linked SNPs.
Source code in src/otg/method/clump.py
class LDclumping:\n \"\"\"LD clumping reports the most significant genetic associations in a region in terms of a smaller number of \u201cclumps\u201d of genetically linked SNPs.\"\"\"\n\n @staticmethod\n def _is_lead_linked(\n study_id: Column,\n variant_id: Column,\n p_value_exponent: Column,\n p_value_mantissa: Column,\n ld_set: Column,\n ) -> Column:\n \"\"\"Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n Args:\n study_id (Column): studyId\n variant_id (Column): Lead variant id\n p_value_exponent (Column): p-value exponent\n p_value_mantissa (Column): p-value mantissa\n locus (Column): Credible set <array of structs>\n\n Returns:\n Column: Boolean in which True indicates that the lead is linked to another tag in the same dataset.\n \"\"\"\n leads_in_study = f.collect_set(variant_id).over(Window.partitionBy(study_id))\n tags_in_studylocus = f.array_union(\n # Get all tag variants from the credible set per studyLocusId\n f.transform(ld_set, lambda x: x.tagVariantId),\n # And append the lead variant so that the intersection is the same for all studyLocusIds in a study\n f.array(variant_id),\n )\n intersect_lead_tags = f.array_sort(\n f.array_intersect(leads_in_study, tags_in_studylocus)\n )\n return (\n # If the lead is in the credible set, we rank the peaks by p-value\n f.when(\n f.size(intersect_lead_tags) > 0,\n f.row_number().over(\n Window.partitionBy(study_id, intersect_lead_tags).orderBy(\n p_value_exponent, p_value_mantissa\n )\n )\n > 1,\n )\n # If the intersection is empty (lead is not in the credible set or cred set is empty), the association is not linked\n .otherwise(f.lit(False))\n )\n\n @classmethod\n def clump(cls: type[LDclumping], associations: StudyLocus) -> StudyLocus:\n \"\"\"Perform clumping on studyLocus dataset.\n\n Args:\n associations (StudyLocus): StudyLocus dataset\n\n Returns:\n StudyLocus: including flag and removing locus information for LD clumped loci.\n \"\"\"\n return associations.clump()\n
Name Type Description Default associationsStudyLocus
StudyLocus dataset
required
Returns:
Name Type Description StudyLocusStudyLocus
including flag and removing locus information for LD clumped loci.
Source code in src/otg/method/clump.py
@classmethod\ndef clump(cls: type[LDclumping], associations: StudyLocus) -> StudyLocus:\n \"\"\"Perform clumping on studyLocus dataset.\n\n Args:\n associations (StudyLocus): StudyLocus dataset\n\n Returns:\n StudyLocus: including flag and removing locus information for LD clumped loci.\n \"\"\"\n return associations.clump()\n
Calculate bayesian colocalisation based on overlapping signals from credible sets.
Based on the R COLOC package, which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that only one single causal variant exists for any given trait in any genomic region.
Hypothesis Description H0 no association with either trait in the region H1 association with trait 1 only H2 association with trait 2 only H3 both traits are associated, but have different single causal variants H4 both traits are associated and share the same single causal variant
Approximate Bayes factors required
Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (logABF column).
Source code in src/otg/method/colocalisation.py
class Coloc:\n \"\"\"Calculate bayesian colocalisation based on overlapping signals from credible sets.\n\n Based on the [R COLOC package](https://github.com/chr1swallace/coloc/blob/main/R/claudia.R), which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that **only one single causal variant** exists for any given trait in any genomic region.\n\n | Hypothesis | Description |\n | ------------- | --------------------------------------------------------------------- |\n | H<sub>0</sub> | no association with either trait in the region |\n | H<sub>1</sub> | association with trait 1 only |\n | H<sub>2</sub> | association with trait 2 only |\n | H<sub>3</sub> | both traits are associated, but have different single causal variants |\n | H<sub>4</sub> | both traits are associated and share the same single causal variant |\n\n !!! warning \"Approximate Bayes factors required\"\n Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (`logABF` column).\n\n \"\"\"\n\n @staticmethod\n def _get_logsum(log_abf: ndarray) -> float:\n \"\"\"Calculates logsum of vector.\n\n This function calculates the log of the sum of the exponentiated\n logs taking out the max, i.e. insuring that the sum is not Inf\n\n Args:\n log_abf (ndarray): log approximate bayes factor\n\n Returns:\n float: logsum\n\n Example:\n >>> l = [0.2, 0.1, 0.05, 0]\n >>> round(Coloc._get_logsum(l), 6)\n 1.476557\n \"\"\"\n themax = np.max(log_abf)\n result = themax + np.log(np.sum(np.exp(log_abf - themax)))\n return float(result)\n\n @staticmethod\n def _get_posteriors(all_abfs: ndarray) -> DenseVector:\n \"\"\"Calculate posterior probabilities for each hypothesis.\n\n Args:\n all_abfs (ndarray): h0-h4 bayes factors\n\n Returns:\n DenseVector: Posterior\n\n Example:\n >>> l = np.array([0.2, 0.1, 0.05, 0])\n >>> Coloc._get_posteriors(l)\n DenseVector([0.279, 0.2524, 0.2401, 0.2284])\n \"\"\"\n diff = all_abfs - Coloc._get_logsum(all_abfs)\n abfs_posteriors = np.exp(diff)\n return Vectors.dense(abfs_posteriors)\n\n @classmethod\n def colocalise(\n cls: type[Coloc],\n overlapping_signals: StudyLocusOverlap,\n priorc1: float = 1e-4,\n priorc2: float = 1e-4,\n priorc12: float = 1e-5,\n ) -> Colocalisation:\n \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n Args:\n overlapping_signals (StudyLocusOverlap): overlapping peaks\n priorc1 (float): Prior on variant being causal for trait 1. Defaults to 1e-4.\n priorc2 (float): Prior on variant being causal for trait 2. Defaults to 1e-4.\n priorc12 (float): Prior on variant being causal for traits 1 and 2. Defaults to 1e-5.\n\n Returns:\n Colocalisation: Colocalisation results\n \"\"\"\n # register udfs\n logsum = f.udf(Coloc._get_logsum, DoubleType())\n posteriors = f.udf(Coloc._get_posteriors, VectorUDT())\n return Colocalisation(\n _df=(\n overlapping_signals.df\n # Before summing log_abf columns nulls need to be filled with 0:\n .fillna(0, subset=[\"statistics.left_logABF\", \"statistics.right_logABF\"])\n # Sum of log_abfs for each pair of signals\n .withColumn(\n \"sum_log_abf\",\n f.col(\"statistics.left_logABF\") + f.col(\"statistics.right_logABF\"),\n )\n # Group by overlapping peak and generating dense vectors of log_abf:\n .groupBy(\"chromosome\", \"leftStudyLocusId\", \"rightStudyLocusId\")\n .agg(\n f.count(\"*\").alias(\"numberColocalisingVariants\"),\n fml.array_to_vector(\n f.collect_list(f.col(\"statistics.left_logABF\"))\n ).alias(\"left_logABF\"),\n fml.array_to_vector(\n f.collect_list(f.col(\"statistics.right_logABF\"))\n ).alias(\"right_logABF\"),\n fml.array_to_vector(f.collect_list(f.col(\"sum_log_abf\"))).alias(\n \"sum_log_abf\"\n ),\n )\n .withColumn(\"logsum1\", logsum(f.col(\"left_logABF\")))\n .withColumn(\"logsum2\", logsum(f.col(\"right_logABF\")))\n .withColumn(\"logsum12\", logsum(f.col(\"sum_log_abf\")))\n .drop(\"left_logABF\", \"right_logABF\", \"sum_log_abf\")\n # Add priors\n # priorc1 Prior on variant being causal for trait 1\n .withColumn(\"priorc1\", f.lit(priorc1))\n # priorc2 Prior on variant being causal for trait 2\n .withColumn(\"priorc2\", f.lit(priorc2))\n # priorc12 Prior on variant being causal for traits 1 and 2\n .withColumn(\"priorc12\", f.lit(priorc12))\n # h0-h2\n .withColumn(\"lH0abf\", f.lit(0))\n .withColumn(\"lH1abf\", f.log(f.col(\"priorc1\")) + f.col(\"logsum1\"))\n .withColumn(\"lH2abf\", f.log(f.col(\"priorc2\")) + f.col(\"logsum2\"))\n # h3\n .withColumn(\"sumlogsum\", f.col(\"logsum1\") + f.col(\"logsum2\"))\n # exclude null H3/H4s: due to sumlogsum == logsum12\n .filter(f.col(\"sumlogsum\") != f.col(\"logsum12\"))\n .withColumn(\"max\", f.greatest(\"sumlogsum\", \"logsum12\"))\n .withColumn(\n \"logdiff\",\n (\n f.col(\"max\")\n + f.log(\n f.exp(f.col(\"sumlogsum\") - f.col(\"max\"))\n - f.exp(f.col(\"logsum12\") - f.col(\"max\"))\n )\n ),\n )\n .withColumn(\n \"lH3abf\",\n f.log(f.col(\"priorc1\"))\n + f.log(f.col(\"priorc2\"))\n + f.col(\"logdiff\"),\n )\n .drop(\"right_logsum\", \"left_logsum\", \"sumlogsum\", \"max\", \"logdiff\")\n # h4\n .withColumn(\"lH4abf\", f.log(f.col(\"priorc12\")) + f.col(\"logsum12\"))\n # cleaning\n .drop(\n \"priorc1\", \"priorc2\", \"priorc12\", \"logsum1\", \"logsum2\", \"logsum12\"\n )\n # posteriors\n .withColumn(\n \"allABF\",\n fml.array_to_vector(\n f.array(\n f.col(\"lH0abf\"),\n f.col(\"lH1abf\"),\n f.col(\"lH2abf\"),\n f.col(\"lH3abf\"),\n f.col(\"lH4abf\"),\n )\n ),\n )\n .withColumn(\n \"posteriors\", fml.vector_to_array(posteriors(f.col(\"allABF\")))\n )\n .withColumn(\"h0\", f.col(\"posteriors\").getItem(0))\n .withColumn(\"h1\", f.col(\"posteriors\").getItem(1))\n .withColumn(\"h2\", f.col(\"posteriors\").getItem(2))\n .withColumn(\"h3\", f.col(\"posteriors\").getItem(3))\n .withColumn(\"h4\", f.col(\"posteriors\").getItem(4))\n .withColumn(\"h4h3\", f.col(\"h4\") / f.col(\"h3\"))\n .withColumn(\"log2h4h3\", f.log2(f.col(\"h4h3\")))\n # clean up\n .drop(\n \"posteriors\",\n \"allABF\",\n \"h4h3\",\n \"lH0abf\",\n \"lH1abf\",\n \"lH2abf\",\n \"lH3abf\",\n \"lH4abf\",\n )\n .withColumn(\"colocalisationMethod\", f.lit(\"COLOC\"))\n ),\n _schema=Colocalisation.get_schema(),\n )\n
It extends CAVIAR\u00a0framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (CLPP) by utilizing the marginal posterior probabilities. This framework allows for multiple variants to be causal in a single locus.
Source code in src/otg/method/colocalisation.py
class ECaviar:\n \"\"\"ECaviar-based colocalisation analysis.\n\n It extends [CAVIAR](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5142122/#bib18)\u00a0framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (**CLPP**) by utilizing the marginal posterior probabilities. This framework allows for **multiple variants to be causal** in a single locus.\n \"\"\"\n\n @staticmethod\n def _get_clpp(left_pp: Column, right_pp: Column) -> Column:\n \"\"\"Calculate the colocalisation posterior probability (CLPP).\n\n If the fact that the same variant is found causal for two studies are independent events,\n CLPP is defined as the product of posterior porbabilities that a variant is causal in both studies.\n\n Args:\n left_pp (Column): left posterior probability\n right_pp (Column): right posterior probability\n\n Returns:\n Column: CLPP\n\n Examples:\n >>> d = [{\"left_pp\": 0.5, \"right_pp\": 0.5}, {\"left_pp\": 0.25, \"right_pp\": 0.75}]\n >>> df = spark.createDataFrame(d)\n >>> df.withColumn(\"clpp\", ECaviar._get_clpp(f.col(\"left_pp\"), f.col(\"right_pp\"))).show()\n +-------+--------+------+\n |left_pp|right_pp| clpp|\n +-------+--------+------+\n | 0.5| 0.5| 0.25|\n | 0.25| 0.75|0.1875|\n +-------+--------+------+\n <BLANKLINE>\n\n \"\"\"\n return left_pp * right_pp\n\n @classmethod\n def colocalise(\n cls: type[ECaviar], overlapping_signals: StudyLocusOverlap\n ) -> Colocalisation:\n \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n Args:\n overlapping_signals (StudyLocusOverlap): overlapping signals.\n\n Returns:\n Colocalisation: colocalisation results based on eCAVIAR.\n \"\"\"\n return Colocalisation(\n _df=(\n overlapping_signals.df.withColumn(\n \"clpp\",\n ECaviar._get_clpp(\n f.col(\"statistics.left_posteriorProbability\"),\n f.col(\"statistics.right_posteriorProbability\"),\n ),\n )\n .groupBy(\"leftStudyLocusId\", \"rightStudyLocusId\", \"chromosome\")\n .agg(\n f.count(\"*\").alias(\"numberColocalisingVariants\"),\n f.sum(f.col(\"clpp\")).alias(\"clpp\"),\n )\n .withColumn(\"colocalisationMethod\", f.lit(\"eCAVIAR\"))\n ),\n _schema=Colocalisation.get_schema(),\n )\n
Class to annotate linkage disequilibrium (LD) operations from GnomAD.
Source code in src/otg/method/ld.py
class LDAnnotator:\n \"\"\"Class to annotate linkage disequilibrium (LD) operations from GnomAD.\"\"\"\n\n @staticmethod\n def _calculate_weighted_r_overall(ld_set: Column) -> Column:\n \"\"\"Aggregation of weighted R information using ancestry proportions.\"\"\"\n return f.transform(\n ld_set,\n lambda x: f.struct(\n x[\"tagVariantId\"].alias(\"tagVariantId\"),\n # r2Overall is the accumulated sum of each r2 relative to the population size\n f.aggregate(\n x[\"rValues\"],\n f.lit(0.0),\n lambda acc, y: acc\n + f.coalesce(\n f.pow(y[\"r\"], 2) * y[\"relativeSampleSize\"], f.lit(0.0)\n ), # we use coalesce to avoid problems when r/relativeSampleSize is null\n ).alias(\"r2Overall\"),\n ),\n )\n\n @staticmethod\n def _add_population_size(ld_set: Column, study_populations: Column) -> Column:\n \"\"\"Add population size to each rValues entry in the ldSet.\n\n Args:\n ld_set (Column): LD set\n study_populations (Column): Study populations\n\n Returns:\n Column: LD set with added 'relativeSampleSize' field\n \"\"\"\n # Create a population to relativeSampleSize map from the struct\n populations_map = f.map_from_arrays(\n study_populations[\"ldPopulation\"],\n study_populations[\"relativeSampleSize\"],\n )\n return f.transform(\n ld_set,\n lambda x: f.struct(\n x[\"tagVariantId\"].alias(\"tagVariantId\"),\n f.transform(\n x[\"rValues\"],\n lambda y: f.struct(\n y[\"population\"].alias(\"population\"),\n y[\"r\"].alias(\"r\"),\n populations_map[y[\"population\"]].alias(\"relativeSampleSize\"),\n ),\n ).alias(\"rValues\"),\n ),\n )\n\n @classmethod\n def ld_annotate(\n cls: type[LDAnnotator],\n associations: StudyLocus,\n studies: StudyIndex,\n ld_index: LDIndex,\n ) -> StudyLocus:\n \"\"\"Annotate linkage disequilibrium (LD) information to a set of studyLocus.\n\n This function:\n 1. Annotates study locus with population structure information from the study index\n 2. Joins the LD index to the StudyLocus\n 3. Adds the population size of the study to each rValues entry in the ldSet\n 4. Calculates the overall R weighted by the ancestry proportions in every given study.\n\n Args:\n associations (StudyLocus): Dataset to be LD annotated\n studies (StudyIndex): Dataset with study information\n ld_index (LDIndex): Dataset with LD information for every variant present in LD matrix\n\n Returns:\n StudyLocus: including additional column with LD information.\n \"\"\"\n return (\n StudyLocus(\n _df=(\n associations.df\n # Drop ldSet column if already available\n .select(*[col for col in associations.df.columns if col != \"ldSet\"])\n # Annotate study locus with population structure from study index\n .join(\n studies.df.select(\"studyId\", \"ldPopulationStructure\"),\n on=\"studyId\",\n how=\"left\",\n )\n # Bring LD information from LD Index\n .join(\n ld_index.df,\n on=[\"variantId\", \"chromosome\"],\n how=\"left\",\n )\n # Add population size to each rValues entry in the ldSet if population structure available:\n .withColumn(\n \"ldSet\",\n f.when(\n f.col(\"ldPopulationStructure\").isNotNull(),\n cls._add_population_size(\n f.col(\"ldSet\"), f.col(\"ldPopulationStructure\")\n ),\n ),\n )\n # Aggregate weighted R information using ancestry proportions\n .withColumn(\n \"ldSet\",\n f.when(\n f.col(\"ldPopulationStructure\").isNotNull(),\n cls._calculate_weighted_r_overall(f.col(\"ldSet\")),\n ),\n ).drop(\"ldPopulationStructure\")\n ),\n _schema=StudyLocus.get_schema(),\n )\n ._qc_no_population()\n ._qc_unresolved_ld()\n )\n
Annotate linkage disequilibrium (LD) information to a set of studyLocus.
This function
Annotates study locus with population structure information from the study index
Joins the LD index to the StudyLocus
Adds the population size of the study to each rValues entry in the ldSet
Calculates the overall R weighted by the ancestry proportions in every given study.
Parameters:
Name Type Description Default associationsStudyLocus
Dataset to be LD annotated
required studiesStudyIndex
Dataset with study information
required ld_indexLDIndex
Dataset with LD information for every variant present in LD matrix
required
Returns:
Name Type Description StudyLocusStudyLocus
including additional column with LD information.
Source code in src/otg/method/ld.py
@classmethod\ndef ld_annotate(\n cls: type[LDAnnotator],\n associations: StudyLocus,\n studies: StudyIndex,\n ld_index: LDIndex,\n) -> StudyLocus:\n \"\"\"Annotate linkage disequilibrium (LD) information to a set of studyLocus.\n\n This function:\n 1. Annotates study locus with population structure information from the study index\n 2. Joins the LD index to the StudyLocus\n 3. Adds the population size of the study to each rValues entry in the ldSet\n 4. Calculates the overall R weighted by the ancestry proportions in every given study.\n\n Args:\n associations (StudyLocus): Dataset to be LD annotated\n studies (StudyIndex): Dataset with study information\n ld_index (LDIndex): Dataset with LD information for every variant present in LD matrix\n\n Returns:\n StudyLocus: including additional column with LD information.\n \"\"\"\n return (\n StudyLocus(\n _df=(\n associations.df\n # Drop ldSet column if already available\n .select(*[col for col in associations.df.columns if col != \"ldSet\"])\n # Annotate study locus with population structure from study index\n .join(\n studies.df.select(\"studyId\", \"ldPopulationStructure\"),\n on=\"studyId\",\n how=\"left\",\n )\n # Bring LD information from LD Index\n .join(\n ld_index.df,\n on=[\"variantId\", \"chromosome\"],\n how=\"left\",\n )\n # Add population size to each rValues entry in the ldSet if population structure available:\n .withColumn(\n \"ldSet\",\n f.when(\n f.col(\"ldPopulationStructure\").isNotNull(),\n cls._add_population_size(\n f.col(\"ldSet\"), f.col(\"ldPopulationStructure\")\n ),\n ),\n )\n # Aggregate weighted R information using ancestry proportions\n .withColumn(\n \"ldSet\",\n f.when(\n f.col(\"ldPopulationStructure\").isNotNull(),\n cls._calculate_weighted_r_overall(f.col(\"ldSet\")),\n ),\n ).drop(\"ldPopulationStructure\")\n ),\n _schema=StudyLocus.get_schema(),\n )\n ._qc_no_population()\n ._qc_unresolved_ld()\n )\n
Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.
Source code in src/otg/method/pics.py
class PICS:\n \"\"\"Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.\"\"\"\n\n @staticmethod\n def _pics_relative_posterior_probability(\n neglog_p: float, pics_snp_mu: float, pics_snp_std: float\n ) -> float:\n \"\"\"Compute the PICS posterior probability for a given SNP.\n\n !!! info \"This probability needs to be scaled to take into account the probabilities of the other variants in the locus.\"\n\n Args:\n neglog_p (float): Negative log p-value of the lead variant\n pics_snp_mu (float): Mean P value of the association between a SNP and a trait\n pics_snp_std (float): Standard deviation for the P value of the association between a SNP and a trait\n\n Returns:\n Relative posterior probability of a SNP being causal in a locus\n\n Examples:\n >>> rel_prob = PICS._pics_relative_posterior_probability(neglog_p=10.0, pics_snp_mu=1.0, pics_snp_std=10.0)\n >>> round(rel_prob, 3)\n 0.368\n \"\"\"\n return float(norm(pics_snp_mu, pics_snp_std).sf(neglog_p) * 2)\n\n @staticmethod\n def _pics_standard_deviation(neglog_p: float, r2: float, k: float) -> float | None:\n \"\"\"Compute the PICS standard deviation.\n\n This distribution is obtained after a series of permutation tests described in the PICS method, and it is only\n valid when the SNP is highly linked with the lead (r2 > 0.5).\n\n Args:\n neglog_p (float): Negative log p-value of the lead variant\n r2 (float): LD score between a given SNP and the lead variant\n k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n Returns:\n Standard deviation for the P value of the association between a SNP and a trait\n\n Examples:\n >>> PICS._pics_standard_deviation(neglog_p=1.0, r2=1.0, k=6.4)\n 0.0\n >>> round(PICS._pics_standard_deviation(neglog_p=10.0, r2=0.5, k=6.4), 3)\n 1.493\n >>> print(PICS._pics_standard_deviation(neglog_p=1.0, r2=0.0, k=6.4))\n None\n \"\"\"\n return (\n abs(((1 - (r2**0.5) ** k) ** 0.5) * (neglog_p**0.5) / 2)\n if r2 >= 0.5\n else None\n )\n\n @staticmethod\n def _pics_mu(neglog_p: float, r2: float) -> float | None:\n \"\"\"Compute the PICS mu that estimates the probability of association between a given SNP and the trait.\n\n This distribution is obtained after a series of permutation tests described in the PICS method, and it is only\n valid when the SNP is highly linked with the lead (r2 > 0.5).\n\n Args:\n neglog_p (float): Negative log p-value of the lead variant\n r2 (float): LD score between a given SNP and the lead variant\n\n Returns:\n Mean P value of the association between a SNP and a trait\n\n Examples:\n >>> PICS._pics_mu(neglog_p=1.0, r2=1.0)\n 1.0\n >>> PICS._pics_mu(neglog_p=10.0, r2=0.5)\n 5.0\n >>> print(PICS._pics_mu(neglog_p=10.0, r2=0.3))\n None\n \"\"\"\n return neglog_p * r2 if r2 >= 0.5 else None\n\n @staticmethod\n def _finemap(ld_set: list[Row], lead_neglog_p: float, k: float) -> list | None:\n \"\"\"Calculates the probability of a variant being causal in a study-locus context by applying the PICS method.\n\n It is intended to be applied as an UDF in `PICS.finemap`, where each row is a StudyLocus association.\n The function iterates over every SNP in the `ldSet` array, and it returns an updated locus with\n its association signal and causality probability as of PICS.\n\n Args:\n ld_set (list): list of tagging variants after expanding the locus\n lead_neglog_p (float): P value of the association signal between the lead variant and the study in the form of -log10.\n k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n Returns:\n List of tagging variants with an estimation of the association signal and their posterior probability as of PICS.\n\n Examples:\n >>> from pyspark.sql import Row\n >>> ld_set = [\n ... Row(variantId=\"var1\", r2Overall=0.8),\n ... Row(variantId=\"var2\", r2Overall=1),\n ... ]\n >>> PICS._finemap(ld_set, lead_neglog_p=10.0, k=6.4)\n [{'variantId': 'var1', 'r2Overall': 0.8, 'standardError': 0.07420896512708416, 'posteriorProbability': 0.07116959886882368}, {'variantId': 'var2', 'r2Overall': 1, 'standardError': 0.9977000638225533, 'posteriorProbability': 0.9288304011311763}]\n >>> empty_ld_set = []\n >>> PICS._finemap(empty_ld_set, lead_neglog_p=10.0, k=6.4)\n []\n >>> ld_set_with_no_r2 = [\n ... Row(variantId=\"var1\", r2Overall=None),\n ... Row(variantId=\"var2\", r2Overall=None),\n ... ]\n >>> PICS._finemap(ld_set_with_no_r2, lead_neglog_p=10.0, k=6.4)\n [{'variantId': 'var1', 'r2Overall': None}, {'variantId': 'var2', 'r2Overall': None}]\n \"\"\"\n if ld_set is None:\n return None\n elif not ld_set:\n return []\n tmp_credible_set = []\n new_credible_set = []\n # First iteration: calculation of mu, standard deviation, and the relative posterior probability\n for tag_struct in ld_set:\n tag_dict = (\n tag_struct.asDict()\n ) # tag_struct is of type pyspark.Row, we'll represent it as a dict\n if (\n not tag_dict[\"r2Overall\"]\n or tag_dict[\"r2Overall\"] < 0.5\n or not lead_neglog_p\n ):\n # If PICS cannot be calculated, we'll return the original credible set\n new_credible_set.append(tag_dict)\n continue\n\n pics_snp_mu = PICS._pics_mu(lead_neglog_p, tag_dict[\"r2Overall\"])\n pics_snp_std = PICS._pics_standard_deviation(\n lead_neglog_p, tag_dict[\"r2Overall\"], k\n )\n pics_snp_std = 0.001 if pics_snp_std == 0 else pics_snp_std\n if pics_snp_mu is not None and pics_snp_std is not None:\n posterior_probability = PICS._pics_relative_posterior_probability(\n lead_neglog_p, pics_snp_mu, pics_snp_std\n )\n tag_dict[\"standardError\"] = 10**-pics_snp_std\n tag_dict[\"relativePosteriorProbability\"] = posterior_probability\n\n tmp_credible_set.append(tag_dict)\n\n # Second iteration: calculation of the sum of all the posteriors in each study-locus, so that we scale them between 0-1\n total_posteriors = sum(\n tag_dict.get(\"relativePosteriorProbability\", 0)\n for tag_dict in tmp_credible_set\n )\n\n # Third iteration: calculation of the final posteriorProbability\n for tag_dict in tmp_credible_set:\n if total_posteriors != 0:\n tag_dict[\"posteriorProbability\"] = float(\n tag_dict.get(\"relativePosteriorProbability\", 0) / total_posteriors\n )\n tag_dict.pop(\"relativePosteriorProbability\")\n new_credible_set.append(tag_dict)\n return new_credible_set\n\n @classmethod\n def finemap(\n cls: type[PICS], associations: StudyLocus, k: float = 6.4\n ) -> StudyLocus:\n \"\"\"Run PICS on a study locus.\n\n !!! info \"Study locus needs to be LD annotated\"\n The study locus needs to be LD annotated before PICS can be calculated.\n\n Args:\n associations (StudyLocus): Study locus to finemap using PICS\n k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n Returns:\n StudyLocus: Study locus with PICS results\n \"\"\"\n # Register UDF by defining the structure of the output locus array of structs\n # it also renames tagVariantId to variantId\n\n picsed_ldset_schema = t.ArrayType(\n t.StructType(\n [\n t.StructField(\"tagVariantId\", t.StringType(), True),\n t.StructField(\"r2Overall\", t.DoubleType(), True),\n t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n t.StructField(\"standardError\", t.DoubleType(), True),\n ]\n )\n )\n picsed_study_locus_schema = t.ArrayType(\n t.StructType(\n [\n t.StructField(\"variantId\", t.StringType(), True),\n t.StructField(\"r2Overall\", t.DoubleType(), True),\n t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n t.StructField(\"standardError\", t.DoubleType(), True),\n ]\n )\n )\n _finemap_udf = f.udf(\n lambda locus, neglog_p: PICS._finemap(locus, neglog_p, k),\n picsed_ldset_schema,\n )\n return StudyLocus(\n _df=(\n associations.df\n # Old locus column will be dropped if available\n .select(*[col for col in associations.df.columns if col != \"locus\"])\n # Estimate neglog_pvalue for the lead variant\n .withColumn(\"neglog_pvalue\", associations.neglog_pvalue())\n # New locus containing the PICS results\n .withColumn(\n \"locus\",\n f.when(\n f.col(\"ldSet\").isNotNull(),\n _finemap_udf(f.col(\"ldSet\"), f.col(\"neglog_pvalue\")).cast(\n picsed_study_locus_schema\n ),\n ),\n )\n # Rename tagVariantId to variantId\n .drop(\"neglog_pvalue\")\n ),\n _schema=StudyLocus.get_schema(),\n )\n
The study locus needs to be LD annotated before PICS can be calculated.
Parameters:
Name Type Description Default associationsStudyLocus
Study locus to finemap using PICS
required kfloat
Empiric constant that can be adjusted to fit the curve, 6.4 recommended.
6.4
Returns:
Name Type Description StudyLocusStudyLocus
Study locus with PICS results
Source code in src/otg/method/pics.py
@classmethod\ndef finemap(\n cls: type[PICS], associations: StudyLocus, k: float = 6.4\n) -> StudyLocus:\n \"\"\"Run PICS on a study locus.\n\n !!! info \"Study locus needs to be LD annotated\"\n The study locus needs to be LD annotated before PICS can be calculated.\n\n Args:\n associations (StudyLocus): Study locus to finemap using PICS\n k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n Returns:\n StudyLocus: Study locus with PICS results\n \"\"\"\n # Register UDF by defining the structure of the output locus array of structs\n # it also renames tagVariantId to variantId\n\n picsed_ldset_schema = t.ArrayType(\n t.StructType(\n [\n t.StructField(\"tagVariantId\", t.StringType(), True),\n t.StructField(\"r2Overall\", t.DoubleType(), True),\n t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n t.StructField(\"standardError\", t.DoubleType(), True),\n ]\n )\n )\n picsed_study_locus_schema = t.ArrayType(\n t.StructType(\n [\n t.StructField(\"variantId\", t.StringType(), True),\n t.StructField(\"r2Overall\", t.DoubleType(), True),\n t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n t.StructField(\"standardError\", t.DoubleType(), True),\n ]\n )\n )\n _finemap_udf = f.udf(\n lambda locus, neglog_p: PICS._finemap(locus, neglog_p, k),\n picsed_ldset_schema,\n )\n return StudyLocus(\n _df=(\n associations.df\n # Old locus column will be dropped if available\n .select(*[col for col in associations.df.columns if col != \"locus\"])\n # Estimate neglog_pvalue for the lead variant\n .withColumn(\"neglog_pvalue\", associations.neglog_pvalue())\n # New locus containing the PICS results\n .withColumn(\n \"locus\",\n f.when(\n f.col(\"ldSet\").isNotNull(),\n _finemap_udf(f.col(\"ldSet\"), f.col(\"neglog_pvalue\")).cast(\n picsed_study_locus_schema\n ),\n ),\n )\n # Rename tagVariantId to variantId\n .drop(\"neglog_pvalue\")\n ),\n _schema=StudyLocus.get_schema(),\n )\n
Get semi-lead snps from summary statistics using a window based function.
Source code in src/otg/method/window_based_clumping.py
class WindowBasedClumping:\n \"\"\"Get semi-lead snps from summary statistics using a window based function.\"\"\"\n\n @staticmethod\n def _cluster_peaks(\n study: Column, chromosome: Column, position: Column, window_length: int\n ) -> Column:\n \"\"\"Cluster GWAS significant variants, were clusters are separated by a defined distance.\n\n !! Important to note that the length of the clusters can be arbitrarily big.\n\n Args:\n study (Column): study identifier\n chromosome (Column): chromosome identifier\n position (Column): position of the variant\n window_length (int): window length in basepair\n\n Returns:\n Column: containing cluster identifier\n\n Examples:\n >>> data = [\n ... # Cluster 1:\n ... ('s1', 'chr1', 2),\n ... ('s1', 'chr1', 4),\n ... ('s1', 'chr1', 12),\n ... # Cluster 2 - Same chromosome:\n ... ('s1', 'chr1', 31),\n ... ('s1', 'chr1', 38),\n ... ('s1', 'chr1', 42),\n ... # Cluster 3 - New chromosome:\n ... ('s1', 'chr2', 41),\n ... ('s1', 'chr2', 44),\n ... ('s1', 'chr2', 50),\n ... # Cluster 4 - other study:\n ... ('s2', 'chr2', 55),\n ... ('s2', 'chr2', 62),\n ... ('s2', 'chr2', 70),\n ... ]\n >>> window_length = 10\n >>> (\n ... spark.createDataFrame(data, ['studyId', 'chromosome', 'position'])\n ... .withColumn(\"cluster_id\",\n ... WindowBasedClumping._cluster_peaks(\n ... f.col('studyId'),\n ... f.col('chromosome'),\n ... f.col('position'),\n ... window_length\n ... )\n ... ).show()\n ... )\n +-------+----------+--------+----------+\n |studyId|chromosome|position|cluster_id|\n +-------+----------+--------+----------+\n | s1| chr1| 2| s1_chr1_2|\n | s1| chr1| 4| s1_chr1_2|\n | s1| chr1| 12| s1_chr1_2|\n | s1| chr1| 31|s1_chr1_31|\n | s1| chr1| 38|s1_chr1_31|\n | s1| chr1| 42|s1_chr1_31|\n | s1| chr2| 41|s1_chr2_41|\n | s1| chr2| 44|s1_chr2_41|\n | s1| chr2| 50|s1_chr2_41|\n | s2| chr2| 55|s2_chr2_55|\n | s2| chr2| 62|s2_chr2_55|\n | s2| chr2| 70|s2_chr2_55|\n +-------+----------+--------+----------+\n <BLANKLINE>\n\n \"\"\"\n # By adding previous position, the cluster boundary can be identified:\n previous_position = f.lag(position).over(\n Window.partitionBy(study, chromosome).orderBy(position)\n )\n # We consider a cluster boudary if subsequent snps are further than the defined window:\n cluster_id = f.when(\n (previous_position.isNull())\n | (position - previous_position > window_length),\n f.concat_ws(\"_\", study, chromosome, position),\n )\n # The cluster identifier is propagated across every variant of the cluster:\n return f.when(\n cluster_id.isNull(),\n f.last(cluster_id, ignorenulls=True).over(\n Window.partitionBy(study, chromosome)\n .orderBy(position)\n .rowsBetween(Window.unboundedPreceding, Window.currentRow)\n ),\n ).otherwise(cluster_id)\n\n @staticmethod\n def _prune_peak(position: ndarray, window_size: int) -> DenseVector:\n \"\"\"Establish lead snps based on their positions listed by p-value.\n\n The function `find_peak` assigns lead SNPs based on their positions listed by p-value within a specified window size.\n\n Args:\n position (ndarray): positions of the SNPs sorted by p-value.\n window_size (int): the distance in bp within which associations are clumped together around the lead snp.\n\n Returns:\n DenseVector: binary vector where 1 indicates a lead SNP and 0 indicates a non-lead SNP.\n\n Examples:\n >>> from pyspark.ml import functions as fml\n >>> from pyspark.ml.linalg import DenseVector\n >>> WindowBasedClumping._prune_peak(np.array((3, 9, 8, 4, 6)), 2)\n DenseVector([1.0, 1.0, 0.0, 0.0, 1.0])\n\n \"\"\"\n # Initializing the lead list with zeroes:\n is_lead: ndarray = np.zeros(len(position))\n\n # List containing indices of leads:\n lead_indices: list = []\n\n # Looping through all positions:\n for index in range(len(position)):\n # Looping through leads to find out if they are within a window:\n for lead_index in lead_indices:\n # If any of the leads within the window:\n if abs(position[lead_index] - position[index]) < window_size:\n # Skipping further checks:\n break\n else:\n # None of the leads were within the window:\n lead_indices.append(index)\n is_lead[index] = 1\n\n return DenseVector(is_lead)\n\n @classmethod\n def clump(\n cls: type[WindowBasedClumping],\n summary_stats: SummaryStatistics,\n window_length: int,\n p_value_significance: float = 5e-8,\n ) -> StudyLocus:\n \"\"\"Clump summary statistics by distance.\n\n Args:\n summary_stats (SummaryStatistics): summary statistics to clump\n window_length (int): window length in basepair\n p_value_significance (float): only more significant variants are considered\n\n Returns:\n StudyLocus: clumped summary statistics\n \"\"\"\n # Create window for locus clusters\n # - variants where the distance between subsequent variants is below the defined threshold.\n # - Variants are sorted by descending significance\n cluster_window = Window.partitionBy(\n \"studyId\", \"chromosome\", \"cluster_id\"\n ).orderBy(f.col(\"pValueExponent\").asc(), f.col(\"pValueMantissa\").asc())\n\n return StudyLocus(\n _df=(\n summary_stats\n # Dropping snps below significance - all subsequent steps are done on significant variants:\n .pvalue_filter(p_value_significance)\n .df\n # Clustering summary variants for efficient windowing (complexity reduction):\n .withColumn(\n \"cluster_id\",\n WindowBasedClumping._cluster_peaks(\n f.col(\"studyId\"),\n f.col(\"chromosome\"),\n f.col(\"position\"),\n window_length,\n ),\n )\n # Within each cluster variants are ranked by significance:\n .withColumn(\"pvRank\", f.row_number().over(cluster_window))\n # Collect positions in cluster for the most significant variant (complexity reduction):\n .withColumn(\n \"collectedPositions\",\n f.when(\n f.col(\"pvRank\") == 1,\n f.collect_list(f.col(\"position\")).over(\n cluster_window.rowsBetween(\n Window.currentRow, Window.unboundedFollowing\n )\n ),\n ).otherwise(f.array()),\n )\n # Get semi indices only ONCE per cluster:\n .withColumn(\n \"semiIndices\",\n f.when(\n f.size(f.col(\"collectedPositions\")) > 0,\n fml.vector_to_array(\n f.udf(WindowBasedClumping._prune_peak, VectorUDT())(\n fml.array_to_vector(f.col(\"collectedPositions\")),\n f.lit(window_length),\n )\n ),\n ),\n )\n # Propagating the result of the above calculation for all rows:\n .withColumn(\n \"semiIndices\",\n f.when(\n f.col(\"semiIndices\").isNull(),\n f.first(f.col(\"semiIndices\"), ignorenulls=True).over(\n cluster_window\n ),\n ).otherwise(f.col(\"semiIndices\")),\n )\n # Keeping semi indices only:\n .filter(f.col(\"semiIndices\")[f.col(\"pvRank\") - 1] > 0)\n .drop(\"pvRank\", \"collectedPositions\", \"semiIndices\", \"cluster_id\")\n # Adding study-locus id:\n .withColumn(\n \"studyLocusId\",\n StudyLocus.assign_study_locus_id(\"studyId\", \"variantId\"),\n )\n # Initialize QC column as array of strings:\n .withColumn(\n \"qualityControls\", f.array().cast(t.ArrayType(t.StringType()))\n )\n ),\n _schema=StudyLocus.get_schema(),\n )\n\n @classmethod\n def clump_with_locus(\n cls: type[WindowBasedClumping],\n summary_stats: SummaryStatistics,\n window_length: int,\n p_value_significance: float = 5e-8,\n p_value_baseline: float = 0.05,\n locus_window_length: int | None = None,\n ) -> StudyLocus:\n \"\"\"Clump significant associations while collecting locus around them.\n\n Args:\n summary_stats (SummaryStatistics): Input summary statistics dataset\n window_length (int): Window size in bp, used for distance based clumping.\n p_value_significance (float, optional): GWAS significance threshold used to filter peaks. Defaults to 5e-8.\n p_value_baseline (float, optional): Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.\n locus_window_length (int, optional): The distance for collecting locus around the semi indices.\n\n Returns:\n StudyLocus: StudyLocus after clumping with information about the `locus`\n \"\"\"\n # If no locus window provided, using the same value:\n if locus_window_length is None:\n locus_window_length = window_length\n\n # Run distance based clumping on the summary stats:\n clumped_dataframe = WindowBasedClumping.clump(\n summary_stats,\n window_length=window_length,\n p_value_significance=p_value_significance,\n ).df.alias(\"clumped\")\n\n # Get list of columns from clumped dataset for further propagation:\n clumped_columns = clumped_dataframe.columns\n\n # Dropping variants not meeting the baseline criteria:\n sumstats_baseline = summary_stats.pvalue_filter(p_value_baseline).df\n\n # Renaming columns:\n sumstats_baseline_renamed = sumstats_baseline.selectExpr(\n *[f\"{col} as tag_{col}\" for col in sumstats_baseline.columns]\n ).alias(\"sumstat\")\n\n study_locus_df = (\n sumstats_baseline_renamed\n # Joining the two datasets together:\n .join(\n f.broadcast(clumped_dataframe),\n on=[\n (f.col(\"sumstat.tag_studyId\") == f.col(\"clumped.studyId\"))\n & (f.col(\"sumstat.tag_chromosome\") == f.col(\"clumped.chromosome\"))\n & (\n f.col(\"sumstat.tag_position\")\n >= (f.col(\"clumped.position\") - locus_window_length)\n )\n & (\n f.col(\"sumstat.tag_position\")\n <= (f.col(\"clumped.position\") + locus_window_length)\n )\n ],\n how=\"right\",\n )\n .withColumn(\n \"locus\",\n f.struct(\n f.col(\"tag_variantId\").alias(\"variantId\"),\n f.col(\"tag_beta\").alias(\"beta\"),\n f.col(\"tag_pValueMantissa\").alias(\"pValueMantissa\"),\n f.col(\"tag_pValueExponent\").alias(\"pValueExponent\"),\n f.col(\"tag_standardError\").alias(\"standardError\"),\n ),\n )\n .groupby(\"studyLocusId\")\n .agg(\n *[\n f.first(col).alias(col)\n for col in clumped_columns\n if col != \"studyLocusId\"\n ],\n f.collect_list(f.col(\"locus\")).alias(\"locus\"),\n )\n )\n\n return StudyLocus(\n _df=study_locus_df,\n _schema=StudyLocus.get_schema(),\n )\n
Clump significant associations while collecting locus around them.
Parameters:
Name Type Description Default summary_statsSummaryStatistics
Input summary statistics dataset
required window_lengthint
Window size in bp, used for distance based clumping.
required p_value_significancefloat
GWAS significance threshold used to filter peaks. Defaults to 5e-8.
5e-08p_value_baselinefloat
Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.
0.05locus_window_lengthint
The distance for collecting locus around the semi indices.
None
Returns:
Name Type Description StudyLocusStudyLocus
StudyLocus after clumping with information about the locus
Source code in src/otg/method/window_based_clumping.py
@classmethod\ndef clump_with_locus(\n cls: type[WindowBasedClumping],\n summary_stats: SummaryStatistics,\n window_length: int,\n p_value_significance: float = 5e-8,\n p_value_baseline: float = 0.05,\n locus_window_length: int | None = None,\n) -> StudyLocus:\n \"\"\"Clump significant associations while collecting locus around them.\n\n Args:\n summary_stats (SummaryStatistics): Input summary statistics dataset\n window_length (int): Window size in bp, used for distance based clumping.\n p_value_significance (float, optional): GWAS significance threshold used to filter peaks. Defaults to 5e-8.\n p_value_baseline (float, optional): Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.\n locus_window_length (int, optional): The distance for collecting locus around the semi indices.\n\n Returns:\n StudyLocus: StudyLocus after clumping with information about the `locus`\n \"\"\"\n # If no locus window provided, using the same value:\n if locus_window_length is None:\n locus_window_length = window_length\n\n # Run distance based clumping on the summary stats:\n clumped_dataframe = WindowBasedClumping.clump(\n summary_stats,\n window_length=window_length,\n p_value_significance=p_value_significance,\n ).df.alias(\"clumped\")\n\n # Get list of columns from clumped dataset for further propagation:\n clumped_columns = clumped_dataframe.columns\n\n # Dropping variants not meeting the baseline criteria:\n sumstats_baseline = summary_stats.pvalue_filter(p_value_baseline).df\n\n # Renaming columns:\n sumstats_baseline_renamed = sumstats_baseline.selectExpr(\n *[f\"{col} as tag_{col}\" for col in sumstats_baseline.columns]\n ).alias(\"sumstat\")\n\n study_locus_df = (\n sumstats_baseline_renamed\n # Joining the two datasets together:\n .join(\n f.broadcast(clumped_dataframe),\n on=[\n (f.col(\"sumstat.tag_studyId\") == f.col(\"clumped.studyId\"))\n & (f.col(\"sumstat.tag_chromosome\") == f.col(\"clumped.chromosome\"))\n & (\n f.col(\"sumstat.tag_position\")\n >= (f.col(\"clumped.position\") - locus_window_length)\n )\n & (\n f.col(\"sumstat.tag_position\")\n <= (f.col(\"clumped.position\") + locus_window_length)\n )\n ],\n how=\"right\",\n )\n .withColumn(\n \"locus\",\n f.struct(\n f.col(\"tag_variantId\").alias(\"variantId\"),\n f.col(\"tag_beta\").alias(\"beta\"),\n f.col(\"tag_pValueMantissa\").alias(\"pValueMantissa\"),\n f.col(\"tag_pValueExponent\").alias(\"pValueExponent\"),\n f.col(\"tag_standardError\").alias(\"standardError\"),\n ),\n )\n .groupby(\"studyLocusId\")\n .agg(\n *[\n f.first(col).alias(col)\n for col in clumped_columns\n if col != \"studyLocusId\"\n ],\n f.collect_list(f.col(\"locus\")).alias(\"locus\"),\n )\n )\n\n return StudyLocus(\n _df=study_locus_df,\n _schema=StudyLocus.get_schema(),\n )\n
This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).
Source code in src/otg/colocalisation.py
@dataclass\nclass ColocalisationStep(ColocalisationStepConfig):\n \"\"\"Colocalisation step.\n\n This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).\n \"\"\"\n\n session: Session = Session()\n\n def run(self: ColocalisationStep) -> None:\n \"\"\"Run colocalisation step.\"\"\"\n # Study-locus information\n sl = StudyLocus.from_parquet(self.session, self.study_locus_path)\n si = StudyIndex.from_parquet(self.session, self.study_index_path)\n\n # Study-locus overlaps for 95% credible sets\n sl_overlaps = sl.credible_set(CredibleInterval.IS95).overlaps(si)\n\n coloc_results = Coloc.colocalise(\n sl_overlaps, self.priorc1, self.priorc2, self.priorc12\n )\n ecaviar_results = ECaviar.colocalise(sl_overlaps)\n\n coloc_results.df.unionByName(ecaviar_results.df, allowMissingColumns=True)\n\n coloc_results.df.write.mode(self.session.write_mode).parquet(self.coloc_path)\n
Colocalisation step requirements.
Attributes:
Name Type Description study_locus_pathDictConfig
Input Study-locus path.
coloc_pathDictConfig
Output Colocalisation path.
priorc1float
Prior on variant being causal for trait 1.
priorc2float
Prior on variant being causal for trait 2.
priorc12float
Prior on variant being causal for traits 1 and 2.
Source code in src/otg/config.py
@dataclass\nclass ColocalisationStepConfig:\n \"\"\"Colocalisation step requirements.\n\n Attributes:\n study_locus_path (DictConfig): Input Study-locus path.\n coloc_path (DictConfig): Output Colocalisation path.\n priorc1 (float): Prior on variant being causal for trait 1.\n priorc2 (float): Prior on variant being causal for trait 2.\n priorc12 (float): Prior on variant being causal for traits 1 and 2.\n \"\"\"\n\n _target_: str = \"otg.colocalisation.ColocalisationStep\"\n study_locus_path: str = MISSING\n study_index_path: str = MISSING\n coloc_path: str = MISSING\n priorc1: float = 1e-4\n priorc2: float = 1e-4\n priorc12: float = 1e-5\n
Variant annotation step produces a dataset of the type VariantAnnotation derived from gnomADs gnomad.genomes.vX.X.X.sites.ht Hail's table. This dataset is used to validate variants and as a source of annotation.
Source code in src/otg/variant_annotation.py
@dataclass\nclass VariantAnnotationStep(VariantAnnotationStepConfig):\n \"\"\"Variant annotation step.\n\n Variant annotation step produces a dataset of the type `VariantAnnotation` derived from gnomADs `gnomad.genomes.vX.X.X.sites.ht` Hail's table. This dataset is used to validate variants and as a source of annotation.\n \"\"\"\n\n session: Session = Session()\n\n def run(self: VariantAnnotationStep) -> None:\n \"\"\"Run variant annotation step.\"\"\"\n # init hail session\n hl.init(sc=self.session.spark.sparkContext, log=\"/dev/null\")\n\n \"\"\"Run variant annotation step.\"\"\"\n variant_annotation = GnomADVariants.as_variant_annotation(\n self.gnomad_genomes,\n self.chain_38_to_37,\n self.populations,\n )\n # Writing data partitioned by chromosome and position:\n (\n variant_annotation.df.repartition(400, \"chromosome\")\n .sortWithinPartitions(\"chromosome\", \"position\")\n .write.partitionBy(\"chromosome\")\n .mode(self.session.write_mode)\n .parquet(self.variant_annotation_path)\n )\n
Using a VariantAnnotation dataset as a reference, this step creates and writes a dataset of the type VariantIndex that includes only variants that have disease-association data with a reduced set of annotations.
Source code in src/otg/variant_index.py
@dataclass\nclass VariantIndexStep(VariantIndexStepConfig):\n \"\"\"Variant index step.\n\n Using a `VariantAnnotation` dataset as a reference, this step creates and writes a dataset of the type `VariantIndex` that includes only variants that have disease-association data with a reduced set of annotations.\n \"\"\"\n\n session: Session = Session()\n\n def run(self: VariantIndexStep) -> None:\n \"\"\"Run variant index step.\"\"\"\n # Variant annotation dataset\n va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n\n # Study-locus dataset\n study_locus = StudyLocus.from_parquet(self.session, self.study_locus_path)\n\n # Reduce scope of variant annotation dataset to only variants in study-locus sets:\n va_slimmed = va.filter_by_variant_df(\n study_locus.unique_lead_tag_variants(), [\"id\", \"chromosome\"]\n )\n\n # Generate variant index ussing a subset of the variant annotation dataset\n vi = VariantIndex.from_variant_annotation(va_slimmed)\n\n # Write data:\n # self.etl.logger.info(\n # f\"Writing invalid variants from the credible set to: {self.variant_invalid}\"\n # )\n # vi.invalid_variants.write.mode(self.etl.write_mode).parquet(\n # self.variant_invalid\n # )\n\n self.session.logger.info(f\"Writing variant index to: {self.variant_index_path}\")\n (\n vi.df.write.partitionBy(\"chromosome\")\n .mode(self.session.write_mode)\n .parquet(self.variant_index_path)\n )\n
This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:
Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).
In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.
Distance between the variant and each gene's canonical transcription start site (TSS).
Source code in src/otg/v2g.py
@dataclass\nclass V2GStep(V2GStepConfig):\n \"\"\"Variant-to-gene (V2G) step.\n\n This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:\n\n 1. Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).\n 2. In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.\n 3. Distance between the variant and each gene's canonical transcription start site (TSS).\n\n \"\"\"\n\n session: Session = Session()\n\n def run(self: V2GStep) -> None:\n \"\"\"Run V2G dataset generation.\"\"\"\n # Filter gene index by approved biotypes to define V2G gene universe\n gene_index_filtered = GeneIndex.from_parquet(\n self.session, self.gene_index_path\n ).filter_by_biotypes(self.approved_biotypes)\n\n vi = VariantIndex.from_parquet(self.session, self.variant_index_path).persist()\n va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n vep_consequences = self.session.spark.read.csv(\n self.vep_consequences_path, sep=\"\\t\", header=True\n )\n\n # Variant annotation reduced to the variant index to define V2G variant universe\n va_slimmed = va.filter_by_variant_df(vi.df, [\"id\", \"chromosome\"]).persist()\n\n # lift over variants to hg38\n lift = LiftOverSpark(\n self.liftover_chain_file_path, self.liftover_max_length_difference\n )\n\n # Expected andersson et al. schema:\n v2g_datasets = [\n va_slimmed.get_distance_to_tss(gene_index_filtered, self.max_distance),\n # variant effects\n va_slimmed.get_most_severe_vep_v2g(vep_consequences, gene_index_filtered),\n va_slimmed.get_polyphen_v2g(gene_index_filtered),\n va_slimmed.get_sift_v2g(gene_index_filtered),\n va_slimmed.get_plof_v2g(gene_index_filtered),\n # intervals\n IntervalsAndersson.parse(\n IntervalsAndersson.read_andersson(self.session, self.anderson_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsJavierre.parse(\n IntervalsJavierre.read_javierre(self.session, self.javierre_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsJung.parse(\n IntervalsJung.read_jung(self.session, self.jung_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsThurman.parse(\n IntervalsThurman.read_thurman(self.session, self.thurman_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n ]\n\n # merge all V2G datasets\n v2g = V2G(\n _df=reduce(\n lambda x, y: x.unionByName(y, allowMissingColumns=True),\n [dataset.df for dataset in v2g_datasets],\n ).repartition(\"chromosome\")\n )\n # write V2G dataset\n (\n v2g.df.write.partitionBy(\"chromosome\")\n .mode(self.session.write_mode)\n .parquet(self.v2g_path)\n )\n
def run(self: V2GStep) -> None:\n \"\"\"Run V2G dataset generation.\"\"\"\n # Filter gene index by approved biotypes to define V2G gene universe\n gene_index_filtered = GeneIndex.from_parquet(\n self.session, self.gene_index_path\n ).filter_by_biotypes(self.approved_biotypes)\n\n vi = VariantIndex.from_parquet(self.session, self.variant_index_path).persist()\n va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n vep_consequences = self.session.spark.read.csv(\n self.vep_consequences_path, sep=\"\\t\", header=True\n )\n\n # Variant annotation reduced to the variant index to define V2G variant universe\n va_slimmed = va.filter_by_variant_df(vi.df, [\"id\", \"chromosome\"]).persist()\n\n # lift over variants to hg38\n lift = LiftOverSpark(\n self.liftover_chain_file_path, self.liftover_max_length_difference\n )\n\n # Expected andersson et al. schema:\n v2g_datasets = [\n va_slimmed.get_distance_to_tss(gene_index_filtered, self.max_distance),\n # variant effects\n va_slimmed.get_most_severe_vep_v2g(vep_consequences, gene_index_filtered),\n va_slimmed.get_polyphen_v2g(gene_index_filtered),\n va_slimmed.get_sift_v2g(gene_index_filtered),\n va_slimmed.get_plof_v2g(gene_index_filtered),\n # intervals\n IntervalsAndersson.parse(\n IntervalsAndersson.read_andersson(self.session, self.anderson_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsJavierre.parse(\n IntervalsJavierre.read_javierre(self.session, self.javierre_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsJung.parse(\n IntervalsJung.read_jung(self.session, self.jung_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsThurman.parse(\n IntervalsThurman.read_thurman(self.session, self.thurman_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n ]\n\n # merge all V2G datasets\n v2g = V2G(\n _df=reduce(\n lambda x, y: x.unionByName(y, allowMissingColumns=True),\n [dataset.df for dataset in v2g_datasets],\n ).repartition(\"chromosome\")\n )\n # write V2G dataset\n (\n v2g.df.write.partitionBy(\"chromosome\")\n .mode(self.session.write_mode)\n .parquet(self.v2g_path)\n )\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Open Targets Genetics","text":"
Ingestion and analysis of genetic and functional genomic data for the identification and prioritisation of drug targets.
This project is still in experimental phase. Please refer to the roadmap section for more information.
For information on how to contribute to the project see the contributing section.
The steps in this section only ever need to be done once on any particular system.
Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login and follow instructions.
Check that you have the make utility installed, and if not (which is unlikely), install it using your system package manager.
Run make setup-dev to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.
It is recommended to use VS Code as an IDE for development.
"},{"location":"contributing/guidelines/#how-to-run-the-code","title":"How to run the code","text":"
All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.
In order to run the code:
Manually edit your local workflow/dag.yaml file and comment out the steps you do not want to run.
Manually edit your local pyproject.toml file and modify the version of the code.
This must be different from the version used by any other people working on the repository to avoid any deployment conflicts, so it's a good idea to use your name, for example: 1.2.3+jdoe.
You can also add a brief branch description, for example: 1.2.3+jdoe.myfeature.
Note that the version must comply with PEP440 conventions, otherwise Poetry will not allow it to be deployed.
Do not use underscores or hyphens in your version name. When building the WHL file, they will be automatically converted to dots, which means the file name will no longer match the version and the build will fail. Use dots instead.
Run make build.
This will create a bundle containing the neccessary code, configuration and dependencies to run the ETL pipeline, and then upload this bundle to Google Cloud.
A version specific subpath is used, so uploading the code will not affect any branches but your own.
If there was already a code bundle uploaded with the same version number, it will be replaced.
Submit the Dataproc job with poetry run python workflow/workflow_template.py
You will need to specify additional parameters, some are mandatory and some are optional. Run with --help to see usage.
The script will provision the cluster and submit the job.
The cluster will take a few minutes to get provisioned and running, during which the script will not output anything, this is normal.
Once submitted, you can monitor the progress of your job on this page: https://console.cloud.google.com/dataproc/jobs?project=open-targets-genetics-dev.
On completion (whether successful or a failure), the cluster will be automatically removed, so you don't have to worry about shutting it down to avoid incurring charges.
When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with run build-documentation. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/)
For more details on each of these steps, see the sections below.
If during development you had a question which wasn't covered in the documentation, and someone explained it to you, add it to the documentation. The same applies if you encountered any instructions in the documentation which were obsolete or incorrect.
Documentation autogeneration expressions start with :::. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:
Dataset definitions in docs/reference/dataset (example: docs/reference/dataset/study_index/study_index_finngen.md)
Step definitions in docs/reference/step (example: docs/reference/step/finngen.md)
If you see errors related to BLAS/LAPACK libraries, see this StackOverflow post for guidance.
"},{"location":"contributing/troubleshooting/#pyenv-and-poetry","title":"Pyenv and Poetry","text":"
If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:
Close your currently activated environment, if any: exit
Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.
If you are encountering problems with initialising a Spark session, try using Java 11.
If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.
One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.
Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.
On Ubuntu, this can be done using sudo apt remove node nodejs npm, followed by sudo apt autoremove. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.
After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).
Dataset is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the json.schemas module.
Source code in src/otg/dataset/dataset.py
@dataclass\nclass Dataset(ABC):\n \"\"\"Open Targets Genetics Dataset.\n\n `Dataset` is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the `json.schemas` module.\n \"\"\"\n\n _df: DataFrame\n _schema: StructType\n\n def __post_init__(self: Dataset) -> None:\n \"\"\"Post init.\"\"\"\n self.validate_schema()\n\n @property\n def df(self: Dataset) -> DataFrame:\n \"\"\"Dataframe included in the Dataset.\"\"\"\n return self._df\n\n @df.setter\n def df(self: Dataset, new_df: DataFrame) -> None: # noqa: CCE001\n \"\"\"Dataframe setter.\"\"\"\n self._df: DataFrame = new_df\n self.validate_schema()\n\n @property\n def schema(self: Dataset) -> StructType:\n \"\"\"Dataframe expected schema.\"\"\"\n return self._schema\n\n @classmethod\n @abstractmethod\n def get_schema(cls: type[Dataset]) -> StructType:\n \"\"\"Abstract method to get the schema. Must be implemented by child classes.\"\"\"\n pass\n\n @classmethod\n def from_parquet(\n cls: type[Dataset], session: Session, path: str, **kwargs: Dict[str, Any]\n ) -> Dataset:\n \"\"\"Reads a parquet file into a Dataset with a given schema.\"\"\"\n schema = cls.get_schema()\n df = session.read_parquet(path=path, schema=schema, **kwargs)\n return cls(_df=df, _schema=schema)\n\n def validate_schema(self: Dataset) -> None: # sourcery skip: invert-any-all\n \"\"\"Validate DataFrame schema against expected class schema.\n\n Raises:\n ValueError: DataFrame schema is not valid\n \"\"\"\n expected_schema = self._schema\n expected_fields = flatten_schema(expected_schema)\n observed_schema = self._df.schema\n observed_fields = flatten_schema(observed_schema)\n\n # Unexpected fields in dataset\n if unexpected_field_names := [\n x.name\n for x in observed_fields\n if x.name not in [y.name for y in expected_fields]\n ]:\n raise ValueError(\n f\"The {unexpected_field_names} fields are not included in DataFrame schema: {expected_fields}\"\n )\n\n # Required fields not in dataset\n required_fields = [x.name for x in expected_schema if not x.nullable]\n if missing_required_fields := [\n req\n for req in required_fields\n if not any(field.name == req for field in observed_fields)\n ]:\n raise ValueError(\n f\"The {missing_required_fields} fields are required but missing: {required_fields}\"\n )\n\n # Fields with duplicated names\n if duplicated_fields := [\n x for x in set(observed_fields) if observed_fields.count(x) > 1\n ]:\n raise ValueError(\n f\"The following fields are duplicated in DataFrame schema: {duplicated_fields}\"\n )\n\n # Fields with different datatype\n observed_field_types = {\n field.name: type(field.dataType) for field in observed_fields\n }\n expected_field_types = {\n field.name: type(field.dataType) for field in expected_fields\n }\n if fields_with_different_observed_datatype := [\n name\n for name, observed_type in observed_field_types.items()\n if name in expected_field_types\n and observed_type != expected_field_types[name]\n ]:\n raise ValueError(\n f\"The following fields present differences in their datatypes: {fields_with_different_observed_datatype}.\"\n )\n\n def persist(self: Dataset) -> Dataset:\n \"\"\"Persist in memory the DataFrame included in the Dataset.\"\"\"\n self.df = self._df.persist()\n return self\n\n def unpersist(self: Dataset) -> Dataset:\n \"\"\"Remove the persisted DataFrame from memory.\"\"\"\n self.df = self._df.unpersist()\n return self\n
Abstract method to get the schema. Must be implemented by child classes.
Source code in src/otg/dataset/dataset.py
@classmethod\n@abstractmethod\ndef get_schema(cls: type[Dataset]) -> StructType:\n \"\"\"Abstract method to get the schema. Must be implemented by child classes.\"\"\"\n pass\n
Persist in memory the DataFrame included in the Dataset.
Source code in src/otg/dataset/dataset.py
def persist(self: Dataset) -> Dataset:\n \"\"\"Persist in memory the DataFrame included in the Dataset.\"\"\"\n self.df = self._df.persist()\n return self\n
Validate DataFrame schema against expected class schema.
Raises:
Type Description ValueError
DataFrame schema is not valid
Source code in src/otg/dataset/dataset.py
def validate_schema(self: Dataset) -> None: # sourcery skip: invert-any-all\n \"\"\"Validate DataFrame schema against expected class schema.\n\n Raises:\n ValueError: DataFrame schema is not valid\n \"\"\"\n expected_schema = self._schema\n expected_fields = flatten_schema(expected_schema)\n observed_schema = self._df.schema\n observed_fields = flatten_schema(observed_schema)\n\n # Unexpected fields in dataset\n if unexpected_field_names := [\n x.name\n for x in observed_fields\n if x.name not in [y.name for y in expected_fields]\n ]:\n raise ValueError(\n f\"The {unexpected_field_names} fields are not included in DataFrame schema: {expected_fields}\"\n )\n\n # Required fields not in dataset\n required_fields = [x.name for x in expected_schema if not x.nullable]\n if missing_required_fields := [\n req\n for req in required_fields\n if not any(field.name == req for field in observed_fields)\n ]:\n raise ValueError(\n f\"The {missing_required_fields} fields are required but missing: {required_fields}\"\n )\n\n # Fields with duplicated names\n if duplicated_fields := [\n x for x in set(observed_fields) if observed_fields.count(x) > 1\n ]:\n raise ValueError(\n f\"The following fields are duplicated in DataFrame schema: {duplicated_fields}\"\n )\n\n # Fields with different datatype\n observed_field_types = {\n field.name: type(field.dataType) for field in observed_fields\n }\n expected_field_types = {\n field.name: type(field.dataType) for field in expected_fields\n }\n if fields_with_different_observed_datatype := [\n name\n for name, observed_type in observed_field_types.items()\n if name in expected_field_types\n and observed_type != expected_field_types[name]\n ]:\n raise ValueError(\n f\"The following fields present differences in their datatypes: {fields_with_different_observed_datatype}.\"\n )\n
Colocalisation results for pairs of overlapping study-locus.
Source code in src/otg/dataset/colocalisation.py
@dataclass\nclass Colocalisation(Dataset):\n \"\"\"Colocalisation results for pairs of overlapping study-locus.\"\"\"\n\n @classmethod\n def get_schema(cls: type[Colocalisation]) -> StructType:\n \"\"\"Provides the schema for the Colocalisation dataset.\"\"\"\n return parse_spark_schema(\"colocalisation.json\")\n
Provides the schema for the Colocalisation dataset.
Source code in src/otg/dataset/colocalisation.py
@classmethod\ndef get_schema(cls: type[Colocalisation]) -> StructType:\n \"\"\"Provides the schema for the Colocalisation dataset.\"\"\"\n return parse_spark_schema(\"colocalisation.json\")\n
@classmethod\ndef get_schema(cls: type[GeneIndex]) -> StructType:\n \"\"\"Provides the schema for the GeneIndex dataset.\"\"\"\n return parse_spark_schema(\"gene_index.json\")\n
Intervals dataset links genes to genomic regions based on genome interaction studies.
Source code in src/otg/dataset/intervals.py
@dataclass\nclass Intervals(Dataset):\n \"\"\"Intervals dataset links genes to genomic regions based on genome interaction studies.\"\"\"\n\n @classmethod\n def get_schema(cls: type[Intervals]) -> StructType:\n \"\"\"Provides the schema for the Intervals dataset.\"\"\"\n return parse_spark_schema(\"intervals.json\")\n\n def v2g(self: Intervals, variant_index: VariantIndex) -> V2G:\n \"\"\"Convert intervals into V2G by intersecting with a variant index.\n\n Args:\n variant_index (VariantIndex): Variant index dataset\n\n Returns:\n V2G: Variant-to-gene evidence dataset\n \"\"\"\n return V2G(\n _df=(\n # TODO: We can include the start and end position as part of the `on` clause in the join\n self.df.alias(\"interval\")\n .join(\n variant_index.df.selectExpr(\n \"chromosome as vi_chromosome\", \"variantId\", \"position\"\n ).alias(\"vi\"),\n on=[\n f.col(\"vi.vi_chromosome\") == f.col(\"interval.chromosome\"),\n f.col(\"vi.position\").between(\n f.col(\"interval.start\"), f.col(\"interval.end\")\n ),\n ],\n how=\"inner\",\n )\n .drop(\"start\", \"end\", \"vi_chromosome\")\n ),\n _schema=V2G.get_schema(),\n )\n
@classmethod\ndef get_schema(cls: type[Intervals]) -> StructType:\n \"\"\"Provides the schema for the Intervals dataset.\"\"\"\n return parse_spark_schema(\"intervals.json\")\n
Convert intervals into V2G by intersecting with a variant index.
Parameters:
Name Type Description Default variant_indexVariantIndex
Variant index dataset
required
Returns:
Name Type Description V2GV2G
Variant-to-gene evidence dataset
Source code in src/otg/dataset/intervals.py
def v2g(self: Intervals, variant_index: VariantIndex) -> V2G:\n \"\"\"Convert intervals into V2G by intersecting with a variant index.\n\n Args:\n variant_index (VariantIndex): Variant index dataset\n\n Returns:\n V2G: Variant-to-gene evidence dataset\n \"\"\"\n return V2G(\n _df=(\n # TODO: We can include the start and end position as part of the `on` clause in the join\n self.df.alias(\"interval\")\n .join(\n variant_index.df.selectExpr(\n \"chromosome as vi_chromosome\", \"variantId\", \"position\"\n ).alias(\"vi\"),\n on=[\n f.col(\"vi.vi_chromosome\") == f.col(\"interval.chromosome\"),\n f.col(\"vi.position\").between(\n f.col(\"interval.start\"), f.col(\"interval.end\")\n ),\n ],\n how=\"inner\",\n )\n .drop(\"start\", \"end\", \"vi_chromosome\")\n ),\n _schema=V2G.get_schema(),\n )\n
Dataset containing linkage desequilibrium information between variants.
Source code in src/otg/dataset/ld_index.py
@dataclass\nclass LDIndex(Dataset):\n \"\"\"Dataset containing linkage desequilibrium information between variants.\"\"\"\n\n @classmethod\n def get_schema(cls: type[LDIndex]) -> StructType:\n \"\"\"Provides the schema for the LDIndex dataset.\"\"\"\n return parse_spark_schema(\"ld_index.json\")\n
@classmethod\ndef get_schema(cls: type[LDIndex]) -> StructType:\n \"\"\"Provides the schema for the LDIndex dataset.\"\"\"\n return parse_spark_schema(\"ld_index.json\")\n
A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.
Source code in src/otg/dataset/study_index.py
@dataclass\nclass StudyIndex(Dataset):\n \"\"\"Study index dataset.\n\n A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.\n \"\"\"\n\n @staticmethod\n def _aggregate_samples_by_ancestry(merged: Column, ancestry: Column) -> Column:\n \"\"\"Aggregate sample counts by ancestry in a list of struct colmns.\n\n Args:\n merged (Column): A column representing merged data (list of structs).\n ancestry (Column): The `ancestry` parameter is a column that represents the ancestry of each\n sample. (a struct)\n\n Returns:\n the modified \"merged\" column after aggregating the samples by ancestry.\n \"\"\"\n # Iterating over the list of ancestries and adding the sample size if label matches:\n return f.transform(\n merged,\n lambda a: f.when(\n a.ancestry == ancestry.ancestry,\n f.struct(\n a.ancestry.alias(\"ancestry\"),\n (a.sampleSize + ancestry.sampleSize).alias(\"sampleSize\"),\n ),\n ).otherwise(a),\n )\n\n @staticmethod\n def _map_ancestries_to_ld_population(gwas_ancestry_label: Column) -> Column:\n \"\"\"Normalise ancestry column from GWAS studies into reference LD panel based on a pre-defined map.\n\n This function assumes all possible ancestry categories have a corresponding\n LD panel in the LD index. It is very important to have the ancestry labels\n moved to the LD panel map.\n\n Args:\n gwas_ancestry_label (Column): A struct column with ancestry label like Finnish,\n European, African etc. and the corresponding sample size.\n\n Returns:\n Column: Struct column with the mapped LD population label and the sample size.\n \"\"\"\n # Loading ancestry label to LD population label:\n json_dict = json.loads(\n pkg_resources.read_text(\n data, \"gwas_population_2_LD_panel_map.json\", encoding=\"utf-8\"\n )\n )\n map_expr = f.create_map(*[f.lit(x) for x in chain(*json_dict.items())])\n\n return f.struct(\n map_expr[gwas_ancestry_label.ancestry].alias(\"ancestry\"),\n gwas_ancestry_label.sampleSize.alias(\"sampleSize\"),\n )\n\n @classmethod\n def get_schema(cls: type[StudyIndex]) -> StructType:\n \"\"\"Provide the schema for the StudyIndex dataset.\"\"\"\n return parse_spark_schema(\"study_index.json\")\n\n @classmethod\n def aggregate_and_map_ancestries(\n cls: type[StudyIndex], discovery_samples: Column\n ) -> Column:\n \"\"\"Map ancestries to populations in the LD reference and calculate relative sample size.\n\n Args:\n discovery_samples (Column): A list of struct column. Has an `ancestry` column and a `sampleSize` columns\n\n Returns:\n A list of struct with mapped LD population and their relative sample size.\n \"\"\"\n # Map ancestry categories to population labels of the LD index:\n mapped_ancestries = f.transform(\n discovery_samples, cls._map_ancestries_to_ld_population\n )\n\n # Aggregate sample sizes belonging to the same LD population:\n aggregated_counts = f.aggregate(\n mapped_ancestries,\n f.array_distinct(\n f.transform(\n mapped_ancestries,\n lambda x: f.struct(\n x.ancestry.alias(\"ancestry\"), f.lit(0.0).alias(\"sampleSize\")\n ),\n )\n ),\n cls._aggregate_samples_by_ancestry,\n )\n # Getting total sample count:\n total_sample_count = f.aggregate(\n aggregated_counts, f.lit(0.0), lambda total, pop: total + pop.sampleSize\n ).alias(\"sampleSize\")\n\n # Calculating relative sample size for each LD population:\n return f.transform(\n aggregated_counts,\n lambda ld_population: f.struct(\n ld_population.ancestry.alias(\"ldPopulation\"),\n (ld_population.sampleSize / total_sample_count).alias(\n \"relativeSampleSize\"\n ),\n ),\n )\n\n def study_type_lut(self: StudyIndex) -> DataFrame:\n \"\"\"Return a lookup table of study type.\n\n Returns:\n DataFrame: A dataframe containing `studyId` and `studyType` columns.\n \"\"\"\n return self.df.select(\"studyId\", \"studyType\")\n
Map ancestries to populations in the LD reference and calculate relative sample size.
Parameters:
Name Type Description Default discovery_samplesColumn
A list of struct column. Has an ancestry column and a sampleSize columns
required
Returns:
Type Description Column
A list of struct with mapped LD population and their relative sample size.
Source code in src/otg/dataset/study_index.py
@classmethod\ndef aggregate_and_map_ancestries(\n cls: type[StudyIndex], discovery_samples: Column\n) -> Column:\n \"\"\"Map ancestries to populations in the LD reference and calculate relative sample size.\n\n Args:\n discovery_samples (Column): A list of struct column. Has an `ancestry` column and a `sampleSize` columns\n\n Returns:\n A list of struct with mapped LD population and their relative sample size.\n \"\"\"\n # Map ancestry categories to population labels of the LD index:\n mapped_ancestries = f.transform(\n discovery_samples, cls._map_ancestries_to_ld_population\n )\n\n # Aggregate sample sizes belonging to the same LD population:\n aggregated_counts = f.aggregate(\n mapped_ancestries,\n f.array_distinct(\n f.transform(\n mapped_ancestries,\n lambda x: f.struct(\n x.ancestry.alias(\"ancestry\"), f.lit(0.0).alias(\"sampleSize\")\n ),\n )\n ),\n cls._aggregate_samples_by_ancestry,\n )\n # Getting total sample count:\n total_sample_count = f.aggregate(\n aggregated_counts, f.lit(0.0), lambda total, pop: total + pop.sampleSize\n ).alias(\"sampleSize\")\n\n # Calculating relative sample size for each LD population:\n return f.transform(\n aggregated_counts,\n lambda ld_population: f.struct(\n ld_population.ancestry.alias(\"ldPopulation\"),\n (ld_population.sampleSize / total_sample_count).alias(\n \"relativeSampleSize\"\n ),\n ),\n )\n
@classmethod\ndef get_schema(cls: type[StudyIndex]) -> StructType:\n \"\"\"Provide the schema for the StudyIndex dataset.\"\"\"\n return parse_spark_schema(\"study_index.json\")\n
A dataframe containing studyId and studyType columns.
Source code in src/otg/dataset/study_index.py
def study_type_lut(self: StudyIndex) -> DataFrame:\n \"\"\"Return a lookup table of study type.\n\n Returns:\n DataFrame: A dataframe containing `studyId` and `studyType` columns.\n \"\"\"\n return self.df.select(\"studyId\", \"studyType\")\n
This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.
Source code in src/otg/dataset/study_locus.py
@dataclass\nclass StudyLocus(Dataset):\n \"\"\"Study-Locus dataset.\n\n This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.\n \"\"\"\n\n @staticmethod\n def _overlapping_peaks(credset_to_overlap: DataFrame) -> DataFrame:\n \"\"\"Calculate overlapping signals (study-locus) between GWAS-GWAS and GWAS-Molecular trait.\n\n Args:\n credset_to_overlap (DataFrame): DataFrame containing at least `studyLocusId`, `studyType`, `chromosome` and `tagVariantId` columns.\n\n Returns:\n DataFrame: containing `leftStudyLocusId`, `rightStudyLocusId` and `chromosome` columns.\n \"\"\"\n # Reduce columns to the minimum to reduce the size of the dataframe\n credset_to_overlap = credset_to_overlap.select(\n \"studyLocusId\", \"studyType\", \"chromosome\", \"tagVariantId\"\n )\n return (\n credset_to_overlap.alias(\"left\")\n .filter(f.col(\"studyType\") == \"gwas\")\n # Self join with complex condition. Left it's all gwas and right can be gwas or molecular trait\n .join(\n credset_to_overlap.alias(\"right\"),\n on=[\n f.col(\"left.chromosome\") == f.col(\"right.chromosome\"),\n f.col(\"left.tagVariantId\") == f.col(\"right.tagVariantId\"),\n (f.col(\"right.studyType\") != \"gwas\")\n | (f.col(\"left.studyLocusId\") > f.col(\"right.studyLocusId\")),\n ],\n how=\"inner\",\n )\n .select(\n f.col(\"left.studyLocusId\").alias(\"leftStudyLocusId\"),\n f.col(\"right.studyLocusId\").alias(\"rightStudyLocusId\"),\n f.col(\"left.chromosome\").alias(\"chromosome\"),\n )\n .distinct()\n .repartition(\"chromosome\")\n .persist()\n )\n\n @staticmethod\n def _align_overlapping_tags(\n loci_to_overlap: DataFrame, peak_overlaps: DataFrame\n ) -> StudyLocusOverlap:\n \"\"\"Align overlapping tags in pairs of overlapping study-locus, keeping all tags in both loci.\n\n Args:\n loci_to_overlap (DataFrame): containing `studyLocusId`, `studyType`, `chromosome`, `tagVariantId`, `logABF` and `posteriorProbability` columns.\n peak_overlaps (DataFrame): containing `left_studyLocusId`, `right_studyLocusId` and `chromosome` columns.\n\n Returns:\n StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n \"\"\"\n # Complete information about all tags in the left study-locus of the overlap\n stats_cols = [\n \"logABF\",\n \"posteriorProbability\",\n \"beta\",\n \"pValueMantissa\",\n \"pValueExponent\",\n ]\n overlapping_left = loci_to_overlap.select(\n f.col(\"chromosome\"),\n f.col(\"tagVariantId\"),\n f.col(\"studyLocusId\").alias(\"leftStudyLocusId\"),\n *[f.col(col).alias(f\"left_{col}\") for col in stats_cols],\n ).join(peak_overlaps, on=[\"chromosome\", \"leftStudyLocusId\"], how=\"inner\")\n\n # Complete information about all tags in the right study-locus of the overlap\n overlapping_right = loci_to_overlap.select(\n f.col(\"chromosome\"),\n f.col(\"tagVariantId\"),\n f.col(\"studyLocusId\").alias(\"rightStudyLocusId\"),\n *[f.col(col).alias(f\"right_{col}\") for col in stats_cols],\n ).join(peak_overlaps, on=[\"chromosome\", \"rightStudyLocusId\"], how=\"inner\")\n\n # Include information about all tag variants in both study-locus aligned by tag variant id\n overlaps = overlapping_left.join(\n overlapping_right,\n on=[\n \"chromosome\",\n \"rightStudyLocusId\",\n \"leftStudyLocusId\",\n \"tagVariantId\",\n ],\n how=\"outer\",\n ).select(\n \"leftStudyLocusId\",\n \"rightStudyLocusId\",\n \"chromosome\",\n \"tagVariantId\",\n f.struct(\n *[f\"left_{e}\" for e in stats_cols] + [f\"right_{e}\" for e in stats_cols]\n ).alias(\"statistics\"),\n )\n return StudyLocusOverlap(\n _df=overlaps,\n _schema=StudyLocusOverlap.get_schema(),\n )\n\n @staticmethod\n def _update_quality_flag(\n qc: Column, flag_condition: Column, flag_text: StudyLocusQualityCheck\n ) -> Column:\n \"\"\"Update the provided quality control list with a new flag if condition is met.\n\n Args:\n qc (Column): Array column with the current list of qc flags.\n flag_condition (Column): This is a column of booleans, signing which row should be flagged\n flag_text (StudyLocusQualityCheck): Text for the new quality control flag\n\n Returns:\n Column: Array column with the updated list of qc flags.\n \"\"\"\n qc = f.when(qc.isNull(), f.array()).otherwise(qc)\n return f.when(\n flag_condition,\n f.array_union(qc, f.array(f.lit(flag_text.value))),\n ).otherwise(qc)\n\n @staticmethod\n def assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column:\n \"\"\"Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.\n\n Args:\n study_id_col (Column): column name with a study ID\n variant_id_col (Column): column name with a variant ID\n\n Returns:\n Column: column with a study locus ID\n\n Examples:\n >>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n >>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n +----------+----------+--------------------+\n | studyId| variantId| study_locus_id|\n +----------+----------+--------------------+\n |GCST000001|1_1000_A_C| 7437284926964690765|\n |GCST000002|1_1000_A_C|-7653912547667845377|\n +----------+----------+--------------------+\n <BLANKLINE>\n \"\"\"\n return f.xxhash64(*[study_id_col, variant_id_col]).alias(\"studyLocusId\")\n\n @classmethod\n def get_schema(cls: type[StudyLocus]) -> StructType:\n \"\"\"Provides the schema for the StudyLocus dataset.\"\"\"\n return parse_spark_schema(\"study_locus.json\")\n\n def filter_credible_set(\n self: StudyLocus,\n credible_interval: CredibleInterval,\n ) -> StudyLocus:\n \"\"\"Filter study-locus tag variants based on given credible interval.\n\n Args:\n credible_interval (CredibleInterval): Credible interval to filter for.\n\n Returns:\n StudyLocus: Filtered study-locus dataset.\n \"\"\"\n self.df = self._df.withColumn(\n \"locus\",\n f.expr(f\"filter(locus, tag -> (tag.{credible_interval.value}))\"),\n )\n return self\n\n def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:\n \"\"\"Calculate overlapping study-locus.\n\n Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always\n appearing on the right side.\n\n Args:\n study_index (StudyIndex): Study index to resolve study types.\n\n Returns:\n StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n \"\"\"\n loci_to_overlap = (\n self.df.join(study_index.study_type_lut(), on=\"studyId\", how=\"inner\")\n .withColumn(\"locus\", f.explode(\"locus\"))\n .select(\n \"studyLocusId\",\n \"studyType\",\n \"chromosome\",\n f.col(\"locus.variantId\").alias(\"tagVariantId\"),\n f.col(\"locus.logABF\").alias(\"logABF\"),\n f.col(\"locus.posteriorProbability\").alias(\"posteriorProbability\"),\n f.col(\"locus.pValueMantissa\").alias(\"pValueMantissa\"),\n f.col(\"locus.pValueExponent\").alias(\"pValueExponent\"),\n f.col(\"locus.beta\").alias(\"beta\"),\n )\n .persist()\n )\n\n # overlapping study-locus\n peak_overlaps = self._overlapping_peaks(loci_to_overlap)\n\n # study-locus overlap by aligning overlapping variants\n return self._align_overlapping_tags(loci_to_overlap, peak_overlaps)\n\n def unique_variants_in_locus(self: StudyLocus) -> DataFrame:\n \"\"\"All unique variants collected in a `StudyLocus` dataframe.\n\n Returns:\n DataFrame: A dataframe containing `variantId` and `chromosome` columns.\n \"\"\"\n return (\n self.df.withColumn(\n \"variantId\",\n # Joint array of variants in that studylocus. Locus can be null\n f.explode(\n f.array_union(\n f.array(f.col(\"variantId\")),\n f.coalesce(f.col(\"locus.variantId\"), f.array()),\n )\n ),\n )\n .select(\n \"variantId\", f.split(f.col(\"variantId\"), \"_\")[0].alias(\"chromosome\")\n )\n .distinct()\n )\n\n def neglog_pvalue(self: StudyLocus) -> Column:\n \"\"\"Returns the negative log p-value.\n\n Returns:\n Column: Negative log p-value\n \"\"\"\n return calculate_neglog_pvalue(\n self.df.pValueMantissa,\n self.df.pValueExponent,\n )\n\n def annotate_credible_sets(self: StudyLocus) -> StudyLocus:\n \"\"\"Annotate study-locus dataset with credible set flags.\n\n Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds\n `is95CredibleSet` and `is99CredibleSet` fields to the elements, indicating which are the tagging variants whose cumulative sum\n of their `posteriorProbability` values is below 0.95 and 0.99, respectively.\n\n Returns:\n StudyLocus: including annotation on `is95CredibleSet` and `is99CredibleSet`.\n \"\"\"\n if \"locus\" not in self.df.columns:\n raise ValueError(\"Locus column not available.\")\n\n self.df = self.df.withColumn(\n # Sort credible set by posterior probability in descending order\n \"locus\",\n f.when(\n f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n order_array_of_structs_by_field(\"locus\", \"posteriorProbability\"),\n ),\n ).withColumn(\n # Calculate array of cumulative sums of posterior probabilities to determine which variants are in the 95% and 99% credible sets\n # and zip the cumulative sums array with the credible set array to add the flags\n \"locus\",\n f.when(\n f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n f.zip_with(\n f.col(\"locus\"),\n f.transform(\n f.sequence(f.lit(1), f.size(f.col(\"locus\"))),\n lambda index: f.aggregate(\n f.slice(\n # By using `index - 1` we introduce a value of `0.0` in the cumulative sums array. to ensure that the last variant\n # that exceeds the 0.95 threshold is included in the cumulative sum, as its probability is necessary to satisfy the threshold.\n f.col(\"locus.posteriorProbability\"),\n 1,\n index - 1,\n ),\n f.lit(0.0),\n lambda acc, el: acc + el,\n ),\n ),\n lambda struct_e, acc: struct_e.withField(\n CredibleInterval.IS95.value, (acc < 0.95) & acc.isNotNull()\n ).withField(\n CredibleInterval.IS99.value, (acc < 0.99) & acc.isNotNull()\n ),\n ),\n ),\n )\n return self\n\n def clump(self: StudyLocus) -> StudyLocus:\n \"\"\"Perform LD clumping of the studyLocus.\n\n Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n Returns:\n StudyLocus: with empty credible sets for linked variants and QC flag.\n \"\"\"\n self.df = (\n self.df.withColumn(\n \"is_lead_linked\",\n LDclumping._is_lead_linked(\n self.df.studyId,\n self.df.variantId,\n self.df.pValueExponent,\n self.df.pValueMantissa,\n self.df.ldSet,\n ),\n )\n .withColumn(\n \"ldSet\",\n f.when(f.col(\"is_lead_linked\"), f.array()).otherwise(f.col(\"ldSet\")),\n )\n .withColumn(\n \"qualityControls\",\n StudyLocus._update_quality_flag(\n f.col(\"qualityControls\"),\n f.col(\"is_lead_linked\"),\n StudyLocusQualityCheck.LD_CLUMPED,\n ),\n )\n .drop(\"is_lead_linked\")\n )\n return self\n\n def _qc_unresolved_ld(\n self: StudyLocus,\n ) -> StudyLocus:\n \"\"\"Flag associations with variants that are not found in the LD reference.\n\n Returns:\n StudyLocusGWASCatalog | StudyLocus: Updated study locus.\n \"\"\"\n self.df = self.df.withColumn(\n \"qualityControls\",\n self._update_quality_flag(\n f.col(\"qualityControls\"),\n f.col(\"ldSet\").isNull(),\n StudyLocusQualityCheck.UNRESOLVED_LD,\n ),\n )\n return self\n\n def _qc_no_population(self: StudyLocus) -> StudyLocus:\n \"\"\"Flag associations where the study doesn't have population information to resolve LD.\n\n Returns:\n StudyLocusGWASCatalog | StudyLocus: Updated study locus.\n \"\"\"\n # If the tested column is not present, return self unchanged:\n if \"ldPopulationStructure\" not in self.df.columns:\n return self\n\n self.df = self.df.withColumn(\n \"qualityControls\",\n self._update_quality_flag(\n f.col(\"qualityControls\"),\n f.col(\"ldPopulationStructure\").isNull(),\n StudyLocusQualityCheck.NO_POPULATION,\n ),\n )\n return self\n
Annotate study-locus dataset with credible set flags.
Sorts the array in the locus column elements by their posteriorProbability values in descending order and adds is95CredibleSet and is99CredibleSet fields to the elements, indicating which are the tagging variants whose cumulative sum of their posteriorProbability values is below 0.95 and 0.99, respectively.
Returns:
Name Type Description StudyLocusStudyLocus
including annotation on is95CredibleSet and is99CredibleSet.
Source code in src/otg/dataset/study_locus.py
def annotate_credible_sets(self: StudyLocus) -> StudyLocus:\n \"\"\"Annotate study-locus dataset with credible set flags.\n\n Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds\n `is95CredibleSet` and `is99CredibleSet` fields to the elements, indicating which are the tagging variants whose cumulative sum\n of their `posteriorProbability` values is below 0.95 and 0.99, respectively.\n\n Returns:\n StudyLocus: including annotation on `is95CredibleSet` and `is99CredibleSet`.\n \"\"\"\n if \"locus\" not in self.df.columns:\n raise ValueError(\"Locus column not available.\")\n\n self.df = self.df.withColumn(\n # Sort credible set by posterior probability in descending order\n \"locus\",\n f.when(\n f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n order_array_of_structs_by_field(\"locus\", \"posteriorProbability\"),\n ),\n ).withColumn(\n # Calculate array of cumulative sums of posterior probabilities to determine which variants are in the 95% and 99% credible sets\n # and zip the cumulative sums array with the credible set array to add the flags\n \"locus\",\n f.when(\n f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n f.zip_with(\n f.col(\"locus\"),\n f.transform(\n f.sequence(f.lit(1), f.size(f.col(\"locus\"))),\n lambda index: f.aggregate(\n f.slice(\n # By using `index - 1` we introduce a value of `0.0` in the cumulative sums array. to ensure that the last variant\n # that exceeds the 0.95 threshold is included in the cumulative sum, as its probability is necessary to satisfy the threshold.\n f.col(\"locus.posteriorProbability\"),\n 1,\n index - 1,\n ),\n f.lit(0.0),\n lambda acc, el: acc + el,\n ),\n ),\n lambda struct_e, acc: struct_e.withField(\n CredibleInterval.IS95.value, (acc < 0.95) & acc.isNotNull()\n ).withField(\n CredibleInterval.IS99.value, (acc < 0.99) & acc.isNotNull()\n ),\n ),\n ),\n )\n return self\n
@staticmethod\ndef assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column:\n \"\"\"Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.\n\n Args:\n study_id_col (Column): column name with a study ID\n variant_id_col (Column): column name with a variant ID\n\n Returns:\n Column: column with a study locus ID\n\n Examples:\n >>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n >>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n +----------+----------+--------------------+\n | studyId| variantId| study_locus_id|\n +----------+----------+--------------------+\n |GCST000001|1_1000_A_C| 7437284926964690765|\n |GCST000002|1_1000_A_C|-7653912547667845377|\n +----------+----------+--------------------+\n <BLANKLINE>\n \"\"\"\n return f.xxhash64(*[study_id_col, variant_id_col]).alias(\"studyLocusId\")\n
Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always appearing on the right side.
Parameters:
Name Type Description Default study_indexStudyIndex
Study index to resolve study types.
required
Returns:
Name Type Description StudyLocusOverlapStudyLocusOverlap
Pairs of overlapping study-locus with aligned tags.
Source code in src/otg/dataset/study_locus.py
def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:\n \"\"\"Calculate overlapping study-locus.\n\n Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always\n appearing on the right side.\n\n Args:\n study_index (StudyIndex): Study index to resolve study types.\n\n Returns:\n StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n \"\"\"\n loci_to_overlap = (\n self.df.join(study_index.study_type_lut(), on=\"studyId\", how=\"inner\")\n .withColumn(\"locus\", f.explode(\"locus\"))\n .select(\n \"studyLocusId\",\n \"studyType\",\n \"chromosome\",\n f.col(\"locus.variantId\").alias(\"tagVariantId\"),\n f.col(\"locus.logABF\").alias(\"logABF\"),\n f.col(\"locus.posteriorProbability\").alias(\"posteriorProbability\"),\n f.col(\"locus.pValueMantissa\").alias(\"pValueMantissa\"),\n f.col(\"locus.pValueExponent\").alias(\"pValueExponent\"),\n f.col(\"locus.beta\").alias(\"beta\"),\n )\n .persist()\n )\n\n # overlapping study-locus\n peak_overlaps = self._overlapping_peaks(loci_to_overlap)\n\n # study-locus overlap by aligning overlapping variants\n return self._align_overlapping_tags(loci_to_overlap, peak_overlaps)\n
@classmethod\ndef get_schema(cls: type[StudyLocus]) -> StructType:\n \"\"\"Provides the schema for the StudyLocus dataset.\"\"\"\n return parse_spark_schema(\"study_locus.json\")\n
Study-Locus quality control options listing concerns on the quality of the association.
Attributes:
Name Type Description SUBSIGNIFICANT_FLAGstr
p-value below significance threshold
NO_GENOMIC_LOCATION_FLAGstr
Incomplete genomic mapping
COMPOSITE_FLAGstr
Composite association due to variant x variant interactions
VARIANT_INCONSISTENCY_FLAGstr
Inconsistencies in the reported variants
NON_MAPPED_VARIANT_FLAGstr
Variant not mapped to GnomAd
PALINDROMIC_ALLELE_FLAGstr
Alleles are palindromic - cannot harmonize
AMBIGUOUS_STUDYstr
Association with ambiguous study
UNRESOLVED_LDstr
Variant not found in LD reference
LD_CLUMPEDstr
Explained by a more significant variant in high LD (clumped)
Source code in src/otg/dataset/study_locus.py
class StudyLocusQualityCheck(Enum):\n \"\"\"Study-Locus quality control options listing concerns on the quality of the association.\n\n Attributes:\n SUBSIGNIFICANT_FLAG (str): p-value below significance threshold\n NO_GENOMIC_LOCATION_FLAG (str): Incomplete genomic mapping\n COMPOSITE_FLAG (str): Composite association due to variant x variant interactions\n VARIANT_INCONSISTENCY_FLAG (str): Inconsistencies in the reported variants\n NON_MAPPED_VARIANT_FLAG (str): Variant not mapped to GnomAd\n PALINDROMIC_ALLELE_FLAG (str): Alleles are palindromic - cannot harmonize\n AMBIGUOUS_STUDY (str): Association with ambiguous study\n UNRESOLVED_LD (str): Variant not found in LD reference\n LD_CLUMPED (str): Explained by a more significant variant in high LD (clumped)\n \"\"\"\n\n SUBSIGNIFICANT_FLAG = \"Subsignificant p-value\"\n NO_GENOMIC_LOCATION_FLAG = \"Incomplete genomic mapping\"\n COMPOSITE_FLAG = \"Composite association\"\n INCONSISTENCY_FLAG = \"Variant inconsistency\"\n NON_MAPPED_VARIANT_FLAG = \"No mapping in GnomAd\"\n PALINDROMIC_ALLELE_FLAG = \"Palindrome alleles - cannot harmonize\"\n AMBIGUOUS_STUDY = \"Association with ambiguous study\"\n UNRESOLVED_LD = \"Variant not found in LD reference\"\n LD_CLUMPED = \"Explained by a more significant variant in high LD (clumped)\"\n NO_POPULATION = \"Study does not have population annotation to resolve LD\"\n
Interval within which an unobserved parameter value falls with a particular probability.
Attributes:
Name Type Description IS95str
95% credible interval
IS99str
99% credible interval
Source code in src/otg/dataset/study_locus.py
class CredibleInterval(Enum):\n \"\"\"Credible interval enum.\n\n Interval within which an unobserved parameter value falls with a particular probability.\n\n Attributes:\n IS95 (str): 95% credible interval\n IS99 (str): 99% credible interval\n \"\"\"\n\n IS95 = \"is95CredibleSet\"\n IS99 = \"is99CredibleSet\"\n
This dataset captures pairs of overlapping StudyLocus: that is associations whose credible sets share at least one tagging variant.
Note
This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.
Source code in src/otg/dataset/study_locus_overlap.py
@dataclass\nclass StudyLocusOverlap(Dataset):\n \"\"\"Study-Locus overlap.\n\n This dataset captures pairs of overlapping `StudyLocus`: that is associations whose credible sets share at least one tagging variant.\n\n !!! note\n This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.\n \"\"\"\n\n @classmethod\n def get_schema(cls: type[StudyLocusOverlap]) -> StructType:\n \"\"\"Provides the schema for the StudyLocusOverlap dataset.\"\"\"\n return parse_spark_schema(\"study_locus_overlap.json\")\n\n @classmethod\n def from_associations(\n cls: type[StudyLocusOverlap], study_locus: StudyLocus, study_index: StudyIndex\n ) -> StudyLocusOverlap:\n \"\"\"Find the overlapping signals in a particular set of associations (StudyLocus dataset).\n\n Args:\n study_locus (StudyLocus): Study-locus associations to find the overlapping signals\n study_index (StudyIndex): Study index to find the overlapping signals\n\n Returns:\n StudyLocusOverlap: Study-locus overlap dataset\n \"\"\"\n return study_locus.find_overlaps(study_index)\n
Find the overlapping signals in a particular set of associations (StudyLocus dataset).
Parameters:
Name Type Description Default study_locusStudyLocus
Study-locus associations to find the overlapping signals
required study_indexStudyIndex
Study index to find the overlapping signals
required
Returns:
Name Type Description StudyLocusOverlapStudyLocusOverlap
Study-locus overlap dataset
Source code in src/otg/dataset/study_locus_overlap.py
@classmethod\ndef from_associations(\n cls: type[StudyLocusOverlap], study_locus: StudyLocus, study_index: StudyIndex\n) -> StudyLocusOverlap:\n \"\"\"Find the overlapping signals in a particular set of associations (StudyLocus dataset).\n\n Args:\n study_locus (StudyLocus): Study-locus associations to find the overlapping signals\n study_index (StudyIndex): Study index to find the overlapping signals\n\n Returns:\n StudyLocusOverlap: Study-locus overlap dataset\n \"\"\"\n return study_locus.find_overlaps(study_index)\n
Provides the schema for the StudyLocusOverlap dataset.
Source code in src/otg/dataset/study_locus_overlap.py
@classmethod\ndef get_schema(cls: type[StudyLocusOverlap]) -> StructType:\n \"\"\"Provides the schema for the StudyLocusOverlap dataset.\"\"\"\n return parse_spark_schema(\"study_locus_overlap.json\")\n
A summary statistics dataset contains all single point statistics resulting from a GWAS.
Source code in src/otg/dataset/summary_statistics.py
@dataclass\nclass SummaryStatistics(Dataset):\n \"\"\"Summary Statistics dataset.\n\n A summary statistics dataset contains all single point statistics resulting from a GWAS.\n \"\"\"\n\n @classmethod\n def get_schema(cls: type[SummaryStatistics]) -> StructType:\n \"\"\"Provides the schema for the SummaryStatistics dataset.\"\"\"\n return parse_spark_schema(\"summary_statistics.json\")\n\n def pvalue_filter(self: SummaryStatistics, pvalue: float) -> SummaryStatistics:\n \"\"\"Filter summary statistics based on the provided p-value threshold.\n\n Args:\n pvalue (float): upper limit of the p-value to be filtered upon.\n\n Returns:\n SummaryStatistics: summary statistics object containing single point associations with p-values at least as significant as the provided threshold.\n \"\"\"\n # Converting p-value to mantissa and exponent:\n (mantissa, exponent) = split_pvalue(pvalue)\n\n # Applying filter:\n df = self._df.filter(\n (f.col(\"pValueExponent\") < exponent)\n | (\n (f.col(\"pValueExponent\") == exponent)\n & (f.col(\"pValueMantissa\") <= mantissa)\n )\n )\n return SummaryStatistics(_df=df, _schema=self._schema)\n\n def window_based_clumping(\n self: SummaryStatistics,\n distance: int,\n gwas_significance: float = 5e-8,\n baseline_significance: float = 0.05,\n locus_collect_distance: int | None = None,\n ) -> StudyLocus:\n \"\"\"Generate study-locus from summary statistics by distance based clumping + collect locus.\n\n Args:\n distance (int): Distance in base pairs to be used for clumping.\n gwas_significance (float, optional): GWAS significance threshold. Defaults to 5e-8.\n baseline_significance (float, optional): Baseline significance threshold for inclusion in the locus. Defaults to 0.05.\n locus_collect_distance (int, optional): The distance to collect locus around semi-indices. If not provided, defaults to `distance`.\n\n Returns:\n StudyLocus: Clumped study-locus containing variants based on window.\n \"\"\"\n # If locus collect distance is present, collect locus with the provided distance:\n if locus_collect_distance:\n clumped_df = WindowBasedClumping.clump_with_locus(\n self,\n window_length=distance,\n p_value_significance=gwas_significance,\n p_value_baseline=baseline_significance,\n locus_window_length=locus_collect_distance,\n )\n else:\n clumped_df = WindowBasedClumping.clump(\n self, window_length=distance, p_value_significance=gwas_significance\n )\n\n return clumped_df\n\n def exclude_region(self: SummaryStatistics, region: str) -> SummaryStatistics:\n \"\"\"Exclude a region from the summary stats dataset.\n\n Args:\n region (str): region given in \"chr##:#####-####\" format\n\n Returns:\n SummaryStatistics: filtered summary statistics.\n \"\"\"\n (chromosome, start_position, end_position) = parse_region(region)\n\n return SummaryStatistics(\n _df=(\n self.df.filter(\n ~(\n (f.col(\"chromosome\") == chromosome)\n & (\n (f.col(\"position\") >= start_position)\n & (f.col(\"position\") <= end_position)\n )\n )\n )\n ),\n _schema=SummaryStatistics.get_schema(),\n )\n
Provides the schema for the SummaryStatistics dataset.
Source code in src/otg/dataset/summary_statistics.py
@classmethod\ndef get_schema(cls: type[SummaryStatistics]) -> StructType:\n \"\"\"Provides the schema for the SummaryStatistics dataset.\"\"\"\n return parse_spark_schema(\"summary_statistics.json\")\n
Filter summary statistics based on the provided p-value threshold.
Parameters:
Name Type Description Default pvaluefloat
upper limit of the p-value to be filtered upon.
required
Returns:
Name Type Description SummaryStatisticsSummaryStatistics
summary statistics object containing single point associations with p-values at least as significant as the provided threshold.
Source code in src/otg/dataset/summary_statistics.py
def pvalue_filter(self: SummaryStatistics, pvalue: float) -> SummaryStatistics:\n \"\"\"Filter summary statistics based on the provided p-value threshold.\n\n Args:\n pvalue (float): upper limit of the p-value to be filtered upon.\n\n Returns:\n SummaryStatistics: summary statistics object containing single point associations with p-values at least as significant as the provided threshold.\n \"\"\"\n # Converting p-value to mantissa and exponent:\n (mantissa, exponent) = split_pvalue(pvalue)\n\n # Applying filter:\n df = self._df.filter(\n (f.col(\"pValueExponent\") < exponent)\n | (\n (f.col(\"pValueExponent\") == exponent)\n & (f.col(\"pValueMantissa\") <= mantissa)\n )\n )\n return SummaryStatistics(_df=df, _schema=self._schema)\n
Generate study-locus from summary statistics by distance based clumping + collect locus.
Parameters:
Name Type Description Default distanceint
Distance in base pairs to be used for clumping.
required gwas_significancefloat
GWAS significance threshold. Defaults to 5e-8.
5e-08baseline_significancefloat
Baseline significance threshold for inclusion in the locus. Defaults to 0.05.
0.05locus_collect_distanceint
The distance to collect locus around semi-indices. If not provided, defaults to distance.
None
Returns:
Name Type Description StudyLocusStudyLocus
Clumped study-locus containing variants based on window.
Source code in src/otg/dataset/summary_statistics.py
def window_based_clumping(\n self: SummaryStatistics,\n distance: int,\n gwas_significance: float = 5e-8,\n baseline_significance: float = 0.05,\n locus_collect_distance: int | None = None,\n) -> StudyLocus:\n \"\"\"Generate study-locus from summary statistics by distance based clumping + collect locus.\n\n Args:\n distance (int): Distance in base pairs to be used for clumping.\n gwas_significance (float, optional): GWAS significance threshold. Defaults to 5e-8.\n baseline_significance (float, optional): Baseline significance threshold for inclusion in the locus. Defaults to 0.05.\n locus_collect_distance (int, optional): The distance to collect locus around semi-indices. If not provided, defaults to `distance`.\n\n Returns:\n StudyLocus: Clumped study-locus containing variants based on window.\n \"\"\"\n # If locus collect distance is present, collect locus with the provided distance:\n if locus_collect_distance:\n clumped_df = WindowBasedClumping.clump_with_locus(\n self,\n window_length=distance,\n p_value_significance=gwas_significance,\n p_value_baseline=baseline_significance,\n locus_window_length=locus_collect_distance,\n )\n else:\n clumped_df = WindowBasedClumping.clump(\n self, window_length=distance, p_value_significance=gwas_significance\n )\n\n return clumped_df\n
Source code in src/otg/dataset/variant_annotation.py
@dataclass\nclass VariantAnnotation(Dataset):\n \"\"\"Dataset with variant-level annotations.\"\"\"\n\n @classmethod\n def get_schema(cls: type[VariantAnnotation]) -> StructType:\n \"\"\"Provides the schema for the VariantAnnotation dataset.\"\"\"\n return parse_spark_schema(\"variant_annotation.json\")\n\n def max_maf(self: VariantAnnotation) -> Column:\n \"\"\"Maximum minor allele frequency accross all populations.\n\n Returns:\n Column: Maximum minor allele frequency accross all populations.\n \"\"\"\n return f.array_max(\n f.transform(\n self.df.alleleFrequencies,\n lambda af: f.when(\n af.alleleFrequency > 0.5, 1 - af.alleleFrequency\n ).otherwise(af.alleleFrequency),\n )\n )\n\n def filter_by_variant_df(\n self: VariantAnnotation, df: DataFrame, cols: list[str]\n ) -> VariantAnnotation:\n \"\"\"Filter variant annotation dataset by a variant dataframe.\n\n Args:\n df (DataFrame): A dataframe of variants\n cols (List[str]): A list of columns to join on\n\n Returns:\n VariantAnnotation: A filtered variant annotation dataset\n \"\"\"\n self.df = self._df.join(f.broadcast(df.select(cols)), on=cols, how=\"inner\")\n return self\n\n def get_transcript_consequence_df(\n self: VariantAnnotation, filter_by: Optional[GeneIndex] = None\n ) -> DataFrame:\n \"\"\"Dataframe of exploded transcript consequences.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index. Defaults to None.\n\n Returns:\n DataFrame: A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence\n \"\"\"\n # exploding the array removes records without VEP annotation\n transript_consequences = self.df.withColumn(\n \"transcriptConsequence\", f.explode(\"vep.transcriptConsequences\")\n ).select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"transcriptConsequence\",\n f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n )\n if filter_by:\n transript_consequences = transript_consequences.join(\n f.broadcast(filter_by.df),\n on=[\"chromosome\", \"geneId\"],\n )\n return transript_consequences.persist()\n\n def get_most_severe_vep_v2g(\n self: VariantAnnotation,\n vep_consequences: DataFrame,\n filter_by: GeneIndex,\n ) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments based on VEP's predicted consequence on the transcript.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n vep_consequences (DataFrame): A dataframe of VEP consequences\n filter_by (GeneIndex): A gene index to filter by. Defaults to None.\n\n Returns:\n V2G: High and medium severity variant to gene assignments\n \"\"\"\n vep_lut = vep_consequences.select(\n f.element_at(f.split(\"Accession\", r\"/\"), -1).alias(\n \"variantFunctionalConsequenceId\"\n ),\n f.col(\"Term\").alias(\"label\"),\n f.col(\"v2g_score\").cast(\"double\").alias(\"score\"),\n )\n\n return V2G(\n _df=self.get_transcript_consequence_df(filter_by).select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n f.explode(\"transcriptConsequence.consequenceTerms\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"variantConsequence\").alias(\"datasourceId\"),\n )\n # A variant can have multiple predicted consequences on a transcript, the most severe one is selected\n .join(\n f.broadcast(vep_lut),\n on=\"label\",\n how=\"inner\",\n )\n .filter(f.col(\"score\") != 0)\n .transform(\n lambda df: get_record_with_maximum_value(\n df, [\"variantId\", \"geneId\"], \"score\"\n )\n ),\n _schema=V2G.get_schema(),\n )\n\n def get_polyphen_v2g(\n self: VariantAnnotation, filter_by: Optional[GeneIndex] = None\n ) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.\n\n Polyphen informs about the probability that a substitution is damaging. Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by. Defaults to None.\n\n Returns:\n V2G: variant to gene assignments with their polyphen scores\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.polyphenScore\").isNotNull())\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n f.col(\"transcriptConsequence.polyphenScore\").alias(\"score\"),\n f.col(\"transcriptConsequence.polyphenPrediction\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"polyphen\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n\n def get_sift_v2g(self: VariantAnnotation, filter_by: GeneIndex) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.\n\n SIFT informs about the probability that a substitution is tolerated so scores nearer zero are more likely to be deleterious.\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n\n Returns:\n V2G: variant to gene assignments with their SIFT scores\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.siftScore\").isNotNull())\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n f.expr(\"1 - transcriptConsequence.siftScore\").alias(\"score\"),\n f.col(\"transcriptConsequence.siftPrediction\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"sift\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n\n def get_plof_v2g(self: VariantAnnotation, filter_by: GeneIndex) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n\n Returns:\n V2G: variant to gene assignments from the LOFTEE algorithm\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.lof\").isNotNull())\n .withColumn(\n \"isHighQualityPlof\",\n f.when(f.col(\"transcriptConsequence.lof\") == \"HC\", True).when(\n f.col(\"transcriptConsequence.lof\") == \"LC\", False\n ),\n )\n .withColumn(\n \"score\",\n f.when(f.col(\"isHighQualityPlof\"), 1.0).when(\n ~f.col(\"isHighQualityPlof\"), 0\n ),\n )\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n \"isHighQualityPlof\",\n f.col(\"score\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"loftee\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n\n def get_distance_to_tss(\n self: VariantAnnotation,\n filter_by: GeneIndex,\n max_distance: int = 500_000,\n ) -> V2G:\n \"\"\"Extracts variant to gene assignments for variants falling within a window of a gene's TSS.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n max_distance (int): The maximum distance from the TSS to consider. Defaults to 500_000.\n\n Returns:\n V2G: variant to gene assignments with their distance to the TSS\n \"\"\"\n return V2G(\n _df=(\n self.df.alias(\"variant\")\n .join(\n f.broadcast(filter_by.locations_lut()).alias(\"gene\"),\n on=[\n f.col(\"variant.chromosome\") == f.col(\"gene.chromosome\"),\n f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n <= max_distance,\n ],\n how=\"inner\",\n )\n .withColumn(\n \"inverse_distance\",\n max_distance - f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\")),\n )\n .transform(lambda df: normalise_column(df, \"inverse_distance\", \"score\"))\n .select(\n \"variantId\",\n f.col(\"variant.chromosome\").alias(\"chromosome\"),\n \"position\",\n \"geneId\",\n \"score\",\n f.lit(\"distance\").alias(\"datatypeId\"),\n f.lit(\"canonical_tss\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Extracts variant to gene assignments for variants falling within a window of a gene's TSS.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index to filter by.
required max_distanceint
The maximum distance from the TSS to consider. Defaults to 500_000.
500000
Returns:
Name Type Description V2GV2G
variant to gene assignments with their distance to the TSS
Source code in src/otg/dataset/variant_annotation.py
def get_distance_to_tss(\n self: VariantAnnotation,\n filter_by: GeneIndex,\n max_distance: int = 500_000,\n) -> V2G:\n \"\"\"Extracts variant to gene assignments for variants falling within a window of a gene's TSS.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n max_distance (int): The maximum distance from the TSS to consider. Defaults to 500_000.\n\n Returns:\n V2G: variant to gene assignments with their distance to the TSS\n \"\"\"\n return V2G(\n _df=(\n self.df.alias(\"variant\")\n .join(\n f.broadcast(filter_by.locations_lut()).alias(\"gene\"),\n on=[\n f.col(\"variant.chromosome\") == f.col(\"gene.chromosome\"),\n f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n <= max_distance,\n ],\n how=\"inner\",\n )\n .withColumn(\n \"inverse_distance\",\n max_distance - f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\")),\n )\n .transform(lambda df: normalise_column(df, \"inverse_distance\", \"score\"))\n .select(\n \"variantId\",\n f.col(\"variant.chromosome\").alias(\"chromosome\"),\n \"position\",\n \"geneId\",\n \"score\",\n f.lit(\"distance\").alias(\"datatypeId\"),\n f.lit(\"canonical_tss\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Creates a dataset with variant to gene assignments based on VEP's predicted consequence on the transcript.
Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default vep_consequencesDataFrame
A dataframe of VEP consequences
required filter_byGeneIndex
A gene index to filter by. Defaults to None.
required
Returns:
Name Type Description V2GV2G
High and medium severity variant to gene assignments
Source code in src/otg/dataset/variant_annotation.py
def get_most_severe_vep_v2g(\n self: VariantAnnotation,\n vep_consequences: DataFrame,\n filter_by: GeneIndex,\n) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments based on VEP's predicted consequence on the transcript.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n vep_consequences (DataFrame): A dataframe of VEP consequences\n filter_by (GeneIndex): A gene index to filter by. Defaults to None.\n\n Returns:\n V2G: High and medium severity variant to gene assignments\n \"\"\"\n vep_lut = vep_consequences.select(\n f.element_at(f.split(\"Accession\", r\"/\"), -1).alias(\n \"variantFunctionalConsequenceId\"\n ),\n f.col(\"Term\").alias(\"label\"),\n f.col(\"v2g_score\").cast(\"double\").alias(\"score\"),\n )\n\n return V2G(\n _df=self.get_transcript_consequence_df(filter_by).select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n f.explode(\"transcriptConsequence.consequenceTerms\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"variantConsequence\").alias(\"datasourceId\"),\n )\n # A variant can have multiple predicted consequences on a transcript, the most severe one is selected\n .join(\n f.broadcast(vep_lut),\n on=\"label\",\n how=\"inner\",\n )\n .filter(f.col(\"score\") != 0)\n .transform(\n lambda df: get_record_with_maximum_value(\n df, [\"variantId\", \"geneId\"], \"score\"\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.
Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index to filter by.
required
Returns:
Name Type Description V2GV2G
variant to gene assignments from the LOFTEE algorithm
Source code in src/otg/dataset/variant_annotation.py
def get_plof_v2g(self: VariantAnnotation, filter_by: GeneIndex) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n\n Returns:\n V2G: variant to gene assignments from the LOFTEE algorithm\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.lof\").isNotNull())\n .withColumn(\n \"isHighQualityPlof\",\n f.when(f.col(\"transcriptConsequence.lof\") == \"HC\", True).when(\n f.col(\"transcriptConsequence.lof\") == \"LC\", False\n ),\n )\n .withColumn(\n \"score\",\n f.when(f.col(\"isHighQualityPlof\"), 1.0).when(\n ~f.col(\"isHighQualityPlof\"), 0\n ),\n )\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n \"isHighQualityPlof\",\n f.col(\"score\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"loftee\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.
Polyphen informs about the probability that a substitution is damaging. Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index to filter by. Defaults to None.
None
Returns:
Name Type Description V2GV2G
variant to gene assignments with their polyphen scores
Source code in src/otg/dataset/variant_annotation.py
def get_polyphen_v2g(\n self: VariantAnnotation, filter_by: Optional[GeneIndex] = None\n) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.\n\n Polyphen informs about the probability that a substitution is damaging. Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by. Defaults to None.\n\n Returns:\n V2G: variant to gene assignments with their polyphen scores\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.polyphenScore\").isNotNull())\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n f.col(\"transcriptConsequence.polyphenScore\").alias(\"score\"),\n f.col(\"transcriptConsequence.polyphenPrediction\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"polyphen\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Provides the schema for the VariantAnnotation dataset.
Source code in src/otg/dataset/variant_annotation.py
@classmethod\ndef get_schema(cls: type[VariantAnnotation]) -> StructType:\n \"\"\"Provides the schema for the VariantAnnotation dataset.\"\"\"\n return parse_spark_schema(\"variant_annotation.json\")\n
Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.
SIFT informs about the probability that a substitution is tolerated so scores nearer zero are more likely to be deleterious. Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index to filter by.
required
Returns:
Name Type Description V2GV2G
variant to gene assignments with their SIFT scores
Source code in src/otg/dataset/variant_annotation.py
def get_sift_v2g(self: VariantAnnotation, filter_by: GeneIndex) -> V2G:\n \"\"\"Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.\n\n SIFT informs about the probability that a substitution is tolerated so scores nearer zero are more likely to be deleterious.\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index to filter by.\n\n Returns:\n V2G: variant to gene assignments with their SIFT scores\n \"\"\"\n return V2G(\n _df=(\n self.get_transcript_consequence_df(filter_by)\n .filter(f.col(\"transcriptConsequence.siftScore\").isNotNull())\n .select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"geneId\",\n f.expr(\"1 - transcriptConsequence.siftScore\").alias(\"score\"),\n f.col(\"transcriptConsequence.siftPrediction\").alias(\"label\"),\n f.lit(\"vep\").alias(\"datatypeId\"),\n f.lit(\"sift\").alias(\"datasourceId\"),\n )\n ),\n _schema=V2G.get_schema(),\n )\n
Optionally the trancript consequences can be reduced to the universe of a gene index.
Parameters:
Name Type Description Default filter_byGeneIndex
A gene index. Defaults to None.
None
Returns:
Name Type Description DataFrameDataFrame
A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence
Source code in src/otg/dataset/variant_annotation.py
def get_transcript_consequence_df(\n self: VariantAnnotation, filter_by: Optional[GeneIndex] = None\n) -> DataFrame:\n \"\"\"Dataframe of exploded transcript consequences.\n\n Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n Args:\n filter_by (GeneIndex): A gene index. Defaults to None.\n\n Returns:\n DataFrame: A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence\n \"\"\"\n # exploding the array removes records without VEP annotation\n transript_consequences = self.df.withColumn(\n \"transcriptConsequence\", f.explode(\"vep.transcriptConsequences\")\n ).select(\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"transcriptConsequence\",\n f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n )\n if filter_by:\n transript_consequences = transript_consequences.join(\n f.broadcast(filter_by.df),\n on=[\"chromosome\", \"geneId\"],\n )\n return transript_consequences.persist()\n
Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.
Source code in src/otg/dataset/variant_index.py
@dataclass\nclass VariantIndex(Dataset):\n \"\"\"Variant index dataset.\n\n Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.\n \"\"\"\n\n @classmethod\n def get_schema(cls: type[VariantIndex]) -> StructType:\n \"\"\"Provides the schema for the VariantIndex dataset.\"\"\"\n return parse_spark_schema(\"variant_index.json\")\n\n @classmethod\n def from_variant_annotation(\n cls: type[VariantIndex],\n variant_annotation: VariantAnnotation,\n study_locus: StudyLocus,\n ) -> VariantIndex:\n \"\"\"Initialise VariantIndex from pre-existing variant annotation dataset.\"\"\"\n unchanged_cols = [\n \"variantId\",\n \"chromosome\",\n \"position\",\n \"referenceAllele\",\n \"alternateAllele\",\n \"chromosomeB37\",\n \"positionB37\",\n \"alleleType\",\n \"alleleFrequencies\",\n \"cadd\",\n ]\n va_slimmed = variant_annotation.filter_by_variant_df(\n study_locus.unique_variants_in_locus(), [\"variantId\", \"chromosome\"]\n )\n return cls(\n _df=(\n va_slimmed.df.select(\n *unchanged_cols,\n f.col(\"vep.mostSevereConsequence\").alias(\"mostSevereConsequence\"),\n # filters/rsid are arrays that can be empty, in this case we convert them to null\n nullify_empty_array(f.col(\"rsIds\")).alias(\"rsIds\"),\n )\n .repartition(400, \"chromosome\")\n .sortWithinPartitions(\"chromosome\", \"position\")\n ),\n _schema=cls.get_schema(),\n )\n
@classmethod\ndef get_schema(cls: type[VariantIndex]) -> StructType:\n \"\"\"Provides the schema for the VariantIndex dataset.\"\"\"\n return parse_spark_schema(\"variant_index.json\")\n
A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific biofeatures (e.g. cell types)
Source code in src/otg/dataset/v2g.py
@dataclass\nclass V2G(Dataset):\n \"\"\"Variant-to-gene (V2G) evidence dataset.\n\n A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific `biofeatures` (e.g. cell types)\n \"\"\"\n\n @classmethod\n def get_schema(cls: type[V2G]) -> StructType:\n \"\"\"Provides the schema for the V2G dataset.\"\"\"\n return parse_spark_schema(\"v2g.json\")\n\n def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:\n \"\"\"Filter by V2G dataset by genes.\n\n Args:\n genes (GeneIndex): Gene index dataset to filter by\n\n Returns:\n V2G: V2G dataset filtered by genes\n \"\"\"\n self.df = self._df.join(genes.df.select(\"geneId\"), on=\"geneId\", how=\"inner\")\n return self\n
@classmethod\ndef get_schema(cls: type[V2G]) -> StructType:\n \"\"\"Provides the schema for the V2G dataset.\"\"\"\n return parse_spark_schema(\"v2g.json\")\n
The following information is aggregated/extracted:
Study ID in the special format (FINNGEN_R9_*)
Trait name (for example, Amoebiasis)
Number of cases and controls
Link to the summary statistics location
Some fields are also populated as constants, such as study type and the initial sample size.
Source code in src/otg/datasource/finngen/study_index.py
class FinnGenStudyIndex(StudyIndex):\n \"\"\"Study index dataset from FinnGen.\n\n The following information is aggregated/extracted:\n\n - Study ID in the special format (FINNGEN_R9_*)\n - Trait name (for example, Amoebiasis)\n - Number of cases and controls\n - Link to the summary statistics location\n\n Some fields are also populated as constants, such as study type and the initial sample size.\n \"\"\"\n\n @classmethod\n def from_source(\n cls: type[FinnGenStudyIndex],\n finngen_studies: DataFrame,\n finngen_release_prefix: str,\n finngen_summary_stats_url_prefix: str,\n finngen_summary_stats_url_suffix: str,\n ) -> FinnGenStudyIndex:\n \"\"\"This function ingests study level metadata from FinnGen.\n\n Args:\n finngen_studies (DataFrame): FinnGen raw study table\n finngen_release_prefix (str): Release prefix pattern.\n finngen_summary_stats_url_prefix (str): URL prefix for summary statistics location.\n finngen_summary_stats_url_suffix (str): URL prefix suffix for summary statistics location.\n\n Returns:\n FinnGenStudyIndex: Parsed and annotated FinnGen study table.\n \"\"\"\n return FinnGenStudyIndex(\n _df=finngen_studies.select(\n f.concat(f.lit(f\"{finngen_release_prefix}_\"), f.col(\"phenocode\")).alias(\n \"studyId\"\n ),\n f.col(\"phenostring\").alias(\"traitFromSource\"),\n f.col(\"num_cases\").alias(\"nCases\"),\n f.col(\"num_controls\").alias(\"nControls\"),\n (f.col(\"num_cases\") + f.col(\"num_controls\")).alias(\"nSamples\"),\n f.lit(finngen_release_prefix).alias(\"projectId\"),\n f.lit(\"gwas\").alias(\"studyType\"),\n f.lit(True).alias(\"hasSumstats\"),\n f.lit(\"377,277 (210,870 females and 166,407 males)\").alias(\n \"initialSampleSize\"\n ),\n f.array(\n f.struct(\n f.lit(377277).cast(\"long\").alias(\"sampleSize\"),\n f.lit(\"Finnish\").alias(\"ancestry\"),\n )\n ).alias(\"discoverySamples\"),\n f.concat(\n f.lit(finngen_summary_stats_url_prefix),\n f.col(\"phenocode\"),\n f.lit(finngen_summary_stats_url_suffix),\n ).alias(\"summarystatsLocation\"),\n ).withColumn(\n \"ldPopulationStructure\",\n cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n ),\n _schema=cls.get_schema(),\n )\n
The information comes from LD matrices made available by GnomAD in Hail's native format. We aggregate the LD information across 8 ancestries. The basic steps to generate the LDIndex are:
Convert a LD matrix to a Spark DataFrame.
Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.
Aggregate the LD information across populations.
Source code in src/otg/datasource/gnomad/ld.py
class GnomADLDMatrix:\n \"\"\"Importer of LD information from GnomAD.\n\n The information comes from LD matrices [made available by GnomAD](https://gnomad.broadinstitute.org/downloads/#v2-linkage-disequilibrium) in Hail's native format. We aggregate the LD information across 8 ancestries.\n The basic steps to generate the LDIndex are:\n\n 1. Convert a LD matrix to a Spark DataFrame.\n 2. Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.\n 3. Aggregate the LD information across populations.\n\n \"\"\"\n\n @staticmethod\n def _aggregate_ld_index_across_populations(\n unaggregated_ld_index: DataFrame,\n ) -> DataFrame:\n \"\"\"Aggregate LDIndex across populations.\n\n Args:\n unaggregated_ld_index (DataFrame): Unaggregate LDIndex index dataframe each row is a variant pair in a population\n\n Returns:\n DataFrame: Aggregated LDIndex index dataframe each row is a variant with the LD set across populations\n\n Examples:\n >>> data = [(\"1.0\", \"var1\", \"X\", \"var1\", \"pop1\"), (\"1.0\", \"X\", \"var2\", \"var2\", \"pop1\"),\n ... (\"0.5\", \"var1\", \"X\", \"var2\", \"pop1\"), (\"0.5\", \"var1\", \"X\", \"var2\", \"pop2\"),\n ... (\"0.5\", \"var2\", \"X\", \"var1\", \"pop1\"), (\"0.5\", \"X\", \"var2\", \"var1\", \"pop2\")]\n >>> df = spark.createDataFrame(data, [\"r\", \"variantId\", \"chromosome\", \"tagvariantId\", \"population\"])\n >>> GnomADLDMatrix._aggregate_ld_index_across_populations(df).printSchema()\n root\n |-- variantId: string (nullable = true)\n |-- chromosome: string (nullable = true)\n |-- ldSet: array (nullable = false)\n | |-- element: struct (containsNull = false)\n | | |-- tagVariantId: string (nullable = true)\n | | |-- rValues: array (nullable = false)\n | | | |-- element: struct (containsNull = false)\n | | | | |-- population: string (nullable = true)\n | | | | |-- r: string (nullable = true)\n <BLANKLINE>\n \"\"\"\n return (\n unaggregated_ld_index\n # First level of aggregation: get r/population for each variant/tagVariant pair\n .withColumn(\"r_pop_struct\", f.struct(\"population\", \"r\"))\n .groupBy(\"chromosome\", \"variantId\", \"tagVariantId\")\n .agg(\n f.collect_set(\"r_pop_struct\").alias(\"rValues\"),\n )\n # Second level of aggregation: get r/population for each variant\n .withColumn(\"r_pop_tag_struct\", f.struct(\"tagVariantId\", \"rValues\"))\n .groupBy(\"variantId\", \"chromosome\")\n .agg(\n f.collect_set(\"r_pop_tag_struct\").alias(\"ldSet\"),\n )\n )\n\n @staticmethod\n def _convert_ld_matrix_to_table(\n block_matrix: BlockMatrix, min_r2: float\n ) -> DataFrame:\n \"\"\"Convert LD matrix to table.\"\"\"\n table = block_matrix.entries(keyed=False)\n return (\n table.filter(hl.abs(table.entry) >= min_r2**0.5)\n .to_spark()\n .withColumnRenamed(\"entry\", \"r\")\n )\n\n @staticmethod\n def _create_ldindex_for_population(\n population_id: str,\n ld_matrix_path: str,\n ld_index_raw_path: str,\n grch37_to_grch38_chain_path: str,\n min_r2: float,\n ) -> DataFrame:\n \"\"\"Create LDIndex for a specific population.\"\"\"\n # Prepare LD Block matrix\n ld_matrix = GnomADLDMatrix._convert_ld_matrix_to_table(\n BlockMatrix.read(ld_matrix_path), min_r2\n )\n\n # Prepare table with variant indices\n ld_index = GnomADLDMatrix._process_variant_indices(\n hl.read_table(ld_index_raw_path),\n grch37_to_grch38_chain_path,\n )\n\n return GnomADLDMatrix._resolve_variant_indices(ld_index, ld_matrix).select(\n \"*\",\n f.lit(population_id).alias(\"population\"),\n )\n\n @staticmethod\n def _process_variant_indices(\n ld_index_raw: hl.Table, grch37_to_grch38_chain_path: str\n ) -> DataFrame:\n \"\"\"Creates a look up table between variants and their coordinates in the LD Matrix.\n\n !!! info \"Gnomad's LD Matrix and Index are based on GRCh37 coordinates. This function will lift over the coordinates to GRCh38 to build the lookup table.\"\n\n Args:\n ld_index_raw (hl.Table): LD index table from GnomAD\n grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n\n Returns:\n DataFrame: Look up table between variants in build hg38 and their coordinates in the LD Matrix\n \"\"\"\n ld_index_38 = _liftover_loci(\n ld_index_raw, grch37_to_grch38_chain_path, \"GRCh38\"\n )\n\n return (\n ld_index_38.to_spark()\n # Filter out variants where the liftover failed\n .filter(f.col(\"`locus_GRCh38.position`\").isNotNull())\n .withColumn(\n \"chromosome\", f.regexp_replace(\"`locus_GRCh38.contig`\", \"chr\", \"\")\n )\n .withColumn(\n \"position\",\n convert_gnomad_position_to_ensembl(\n f.col(\"`locus_GRCh38.position`\"),\n f.col(\"`alleles`\").getItem(0),\n f.col(\"`alleles`\").getItem(1),\n ),\n )\n .select(\n \"chromosome\",\n f.concat_ws(\n \"_\",\n f.col(\"chromosome\"),\n f.col(\"position\"),\n f.col(\"`alleles`\").getItem(0),\n f.col(\"`alleles`\").getItem(1),\n ).alias(\"variantId\"),\n f.col(\"idx\"),\n )\n # Filter out ambiguous liftover results: multiple indices for the same variant\n .withColumn(\"count\", f.count(\"*\").over(Window.partitionBy([\"variantId\"])))\n .filter(f.col(\"count\") == 1)\n .drop(\"count\")\n )\n\n @staticmethod\n def _resolve_variant_indices(\n ld_index: DataFrame, ld_matrix: DataFrame\n ) -> DataFrame:\n \"\"\"Resolve the `i` and `j` indices of the block matrix to variant IDs (build 38).\"\"\"\n ld_index_i = ld_index.selectExpr(\n \"idx as i\", \"variantId as variantId_i\", \"chromosome\"\n )\n ld_index_j = ld_index.selectExpr(\"idx as j\", \"variantId as variantId_j\")\n return (\n ld_matrix.join(ld_index_i, on=\"i\", how=\"inner\")\n .join(ld_index_j, on=\"j\", how=\"inner\")\n .drop(\"i\", \"j\")\n )\n\n @staticmethod\n def _transpose_ld_matrix(ld_matrix: DataFrame) -> DataFrame:\n \"\"\"Transpose LD matrix to a square matrix format.\n\n Args:\n ld_matrix (DataFrame): Triangular LD matrix converted to a Spark DataFrame\n\n Returns:\n DataFrame: Square LD matrix without diagonal duplicates\n\n Examples:\n >>> df = spark.createDataFrame(\n ... [\n ... (1, 1, 1.0, \"1\", \"AFR\"),\n ... (1, 2, 0.5, \"1\", \"AFR\"),\n ... (2, 2, 1.0, \"1\", \"AFR\"),\n ... ],\n ... [\"variantId_i\", \"variantId_j\", \"r\", \"chromosome\", \"population\"],\n ... )\n >>> GnomADLDMatrix._transpose_ld_matrix(df).show()\n +-----------+-----------+---+----------+----------+\n |variantId_i|variantId_j| r|chromosome|population|\n +-----------+-----------+---+----------+----------+\n | 1| 2|0.5| 1| AFR|\n | 1| 1|1.0| 1| AFR|\n | 2| 1|0.5| 1| AFR|\n | 2| 2|1.0| 1| AFR|\n +-----------+-----------+---+----------+----------+\n <BLANKLINE>\n \"\"\"\n ld_matrix_transposed = ld_matrix.selectExpr(\n \"variantId_i as variantId_j\",\n \"variantId_j as variantId_i\",\n \"r\",\n \"chromosome\",\n \"population\",\n )\n return ld_matrix.filter(\n f.col(\"variantId_i\") != f.col(\"variantId_j\")\n ).unionByName(ld_matrix_transposed)\n\n @classmethod\n def as_ld_index(\n cls: type[GnomADLDMatrix],\n ld_populations: list[str],\n ld_matrix_template: str,\n ld_index_raw_template: str,\n grch37_to_grch38_chain_path: str,\n min_r2: float,\n ) -> LDIndex:\n \"\"\"Create LDIndex dataset aggregating the LD information across a set of populations.\"\"\"\n ld_indices_unaggregated = []\n for pop in ld_populations:\n try:\n ld_matrix_path = ld_matrix_template.format(POP=pop)\n ld_index_raw_path = ld_index_raw_template.format(POP=pop)\n pop_ld_index = cls._create_ldindex_for_population(\n pop,\n ld_matrix_path,\n ld_index_raw_path.format(pop),\n grch37_to_grch38_chain_path,\n min_r2,\n )\n ld_indices_unaggregated.append(pop_ld_index)\n except Exception as e:\n print(f\"Failed to create LDIndex for population {pop}: {e}\")\n sys.exit(1)\n\n ld_index_unaggregated = (\n GnomADLDMatrix._transpose_ld_matrix(\n reduce(lambda df1, df2: df1.unionByName(df2), ld_indices_unaggregated)\n )\n .withColumnRenamed(\"variantId_i\", \"variantId\")\n .withColumnRenamed(\"variantId_j\", \"tagVariantId\")\n )\n return LDIndex(\n _df=cls._aggregate_ld_index_across_populations(ld_index_unaggregated),\n _schema=LDIndex.get_schema(),\n )\n
GnomAD variants included in the GnomAD genomes dataset.
Source code in src/otg/datasource/gnomad/variants.py
class GnomADVariants:\n \"\"\"GnomAD variants included in the GnomAD genomes dataset.\"\"\"\n\n @staticmethod\n def _convert_gnomad_position_to_ensembl_hail(\n position: Int32Expression,\n reference: StringExpression,\n alternate: StringExpression,\n ) -> Int32Expression:\n \"\"\"Convert GnomAD variant position to Ensembl variant position in hail table.\n\n For indels (the reference or alternate allele is longer than 1), then adding 1 to the position, for SNPs, the position is unchanged.\n More info about the problem: https://www.biostars.org/p/84686/\n\n Args:\n position (Int32Expression): Position of the variant in the GnomAD genome.\n reference (StringExpression): The reference allele.\n alternate (StringExpression): The alternate allele\n\n Returns:\n The position of the variant according to Ensembl genome.\n \"\"\"\n return hl.if_else(\n (reference.length() > 1) | (alternate.length() > 1), position + 1, position\n )\n\n @classmethod\n def as_variant_annotation(\n cls: type[GnomADVariants],\n gnomad_file: str,\n grch38_to_grch37_chain: str,\n populations: list,\n ) -> VariantAnnotation:\n \"\"\"Generate variant annotation dataset from gnomAD.\n\n Some relevant modifications to the original dataset are:\n\n 1. The transcript consequences features provided by VEP are filtered to only refer to the Ensembl canonical transcript.\n 2. Genome coordinates are liftovered from GRCh38 to GRCh37 to keep as annotation.\n 3. Field names are converted to camel case to follow the convention.\n\n Args:\n gnomad_file (str): Path to `gnomad.genomes.vX.X.X.sites.ht` gnomAD dataset\n grch38_to_grch37_chain (str): Path to chain file for liftover\n populations (list): List of populations to include in the dataset\n\n Returns:\n VariantAnnotation: Variant annotation dataset\n \"\"\"\n # Load variants dataset\n ht = hl.read_table(\n gnomad_file,\n _load_refs=False,\n )\n\n # Liftover\n grch37 = hl.get_reference(\"GRCh37\")\n grch38 = hl.get_reference(\"GRCh38\")\n grch38.add_liftover(grch38_to_grch37_chain, grch37)\n\n # Drop non biallelic variants\n ht = ht.filter(ht.alleles.length() == 2)\n # Liftover\n ht = ht.annotate(locus_GRCh37=hl.liftover(ht.locus, \"GRCh37\"))\n # Select relevant fields and nested records to create class\n return VariantAnnotation(\n _df=(\n ht.select(\n gnomad3VariantId=hl.str(\"-\").join(\n [\n ht.locus.contig.replace(\"chr\", \"\"),\n hl.str(ht.locus.position),\n ht.alleles[0],\n ht.alleles[1],\n ]\n ),\n chromosome=ht.locus.contig.replace(\"chr\", \"\"),\n position=GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n ht.locus.position, ht.alleles[0], ht.alleles[1]\n ),\n variantId=hl.str(\"_\").join(\n [\n ht.locus.contig.replace(\"chr\", \"\"),\n hl.str(\n GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n ht.locus.position, ht.alleles[0], ht.alleles[1]\n )\n ),\n ht.alleles[0],\n ht.alleles[1],\n ]\n ),\n chromosomeB37=ht.locus_GRCh37.contig.replace(\"chr\", \"\"),\n positionB37=ht.locus_GRCh37.position,\n referenceAllele=ht.alleles[0],\n alternateAllele=ht.alleles[1],\n rsIds=ht.rsid,\n alleleType=ht.allele_info.allele_type,\n cadd=hl.struct(\n phred=ht.cadd.phred,\n raw=ht.cadd.raw_score,\n ),\n alleleFrequencies=hl.set([f\"{pop}-adj\" for pop in populations]).map(\n lambda p: hl.struct(\n populationName=p,\n alleleFrequency=ht.freq[ht.globals.freq_index_dict[p]].AF,\n )\n ),\n vep=hl.struct(\n mostSevereConsequence=ht.vep.most_severe_consequence,\n transcriptConsequences=hl.map(\n lambda x: hl.struct(\n aminoAcids=x.amino_acids,\n consequenceTerms=x.consequence_terms,\n geneId=x.gene_id,\n lof=x.lof,\n polyphenScore=x.polyphen_score,\n polyphenPrediction=x.polyphen_prediction,\n siftScore=x.sift_score,\n siftPrediction=x.sift_prediction,\n ),\n # Only keeping canonical transcripts\n ht.vep.transcript_consequences.filter(\n lambda x: (x.canonical == 1)\n & (x.gene_symbol_source == \"HGNC\")\n ),\n ),\n ),\n )\n .key_by(\"chromosome\", \"position\")\n .drop(\"locus\", \"alleles\")\n .select_globals()\n .to_spark(flatten=False)\n ),\n _schema=VariantAnnotation.get_schema(),\n )\n
If assigned disease of the study and the association don't agree, we assume the study needs to be split. Then disease EFOs, trait names and study ID are consolidated
Parameters:
Name Type Description Default studiesGWASCatalogStudyIndex
GWAS Catalog studies.
required associationsStudyLocusGWASCatalog
GWAS Catalog associations.
required
Returns:
Type Description Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]
A tuple of the split associations and studies.
Source code in src/otg/datasource/gwas_catalog/study_splitter.py
@classmethod\ndef split(\n cls: type[GWASCatalogStudySplitter],\n studies: GWASCatalogStudyIndex,\n associations: GWASCatalogAssociations,\n) -> Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]:\n \"\"\"Splitting multi-trait GWAS Catalog studies.\n\n If assigned disease of the study and the association don't agree, we assume the study needs to be split.\n Then disease EFOs, trait names and study ID are consolidated\n\n Args:\n studies (GWASCatalogStudyIndex): GWAS Catalog studies.\n associations (StudyLocusGWASCatalog): GWAS Catalog associations.\n\n Returns:\n A tuple of the split associations and studies.\n \"\"\"\n # Composite of studies and associations to resolve scattered information\n st_ass = (\n associations.df.join(f.broadcast(studies.df), on=\"studyId\", how=\"inner\")\n .select(\n \"studyId\",\n \"subStudyDescription\",\n cls._resolve_study_id(\n f.col(\"studyId\"), f.col(\"subStudyDescription\")\n ).alias(\"updatedStudyId\"),\n cls._resolve_trait(\n f.col(\"traitFromSource\"),\n f.split(\"subStudyDescription\", r\"\\|\").getItem(0),\n f.split(\"subStudyDescription\", r\"\\|\").getItem(1),\n ).alias(\"traitFromSource\"),\n cls._resolve_efo(\n f.split(\"subStudyDescription\", r\"\\|\").getItem(2),\n f.col(\"traitFromSourceMappedIds\"),\n ).alias(\"traitFromSourceMappedIds\"),\n )\n .persist()\n )\n\n return (\n studies.update_study_id(\n st_ass.select(\n \"studyId\",\n \"updatedStudyId\",\n \"traitFromSource\",\n \"traitFromSourceMappedIds\",\n ).distinct()\n ),\n associations.update_study_id(\n st_ass.select(\n \"updatedStudyId\", \"studyId\", \"subStudyDescription\"\n ).distinct()\n )._qc_ambiguous_study(),\n )\n
Some fields are populated as constants, such as projectID, studyType, and initial sample size.
Source code in src/otg/datasource/ukbiobank/study_index.py
class UKBiobankStudyIndex(StudyIndex):\n \"\"\"Study index dataset from UKBiobank.\n\n The following information is extracted:\n\n - studyId\n - pubmedId\n - publicationDate\n - publicationJournal\n - publicationTitle\n - publicationFirstAuthor\n - traitFromSource\n - ancestry_discoverySamples\n - ancestry_replicationSamples\n - initialSampleSize\n - nCases\n - replicationSamples\n\n Some fields are populated as constants, such as projectID, studyType, and initial sample size.\n \"\"\"\n\n @classmethod\n def from_source(\n cls: type[UKBiobankStudyIndex],\n ukbiobank_studies: DataFrame,\n ) -> UKBiobankStudyIndex:\n \"\"\"This function ingests study level metadata from UKBiobank.\n\n The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).\n\n The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.\n\n Args:\n ukbiobank_studies (DataFrame): UKBiobank study manifest file loaded in spark session.\n\n Returns:\n UKBiobankStudyIndex: Annotated UKBiobank study table.\n \"\"\"\n return StudyIndex(\n _df=(\n ukbiobank_studies.select(\n f.col(\"code\").alias(\"studyId\"),\n f.lit(\"UKBiobank\").alias(\"projectId\"),\n f.lit(\"gwas\").alias(\"studyType\"),\n f.col(\"trait\").alias(\"traitFromSource\"),\n # Make publication and ancestry schema columns.\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"30104761\").alias(\n \"pubmedId\"\n ),\n f.when(\n f.col(\"code\").startswith(\"SAIGE_\"),\n \"Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies\",\n )\n .otherwise(None)\n .alias(\"publicationTitle\"),\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Wei Zhou\").alias(\n \"publicationFirstAuthor\"\n ),\n f.when(f.col(\"code\").startswith(\"NEALE2_\"), \"2018-08-01\")\n .otherwise(\"2018-10-24\")\n .alias(\"publicationDate\"),\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Nature Genetics\").alias(\n \"publicationJournal\"\n ),\n f.col(\"n_total\").cast(\"string\").alias(\"initialSampleSize\"),\n f.col(\"n_cases\").cast(\"long\").alias(\"nCases\"),\n f.array(\n f.struct(\n f.col(\"n_total\").cast(\"long\").alias(\"sampleSize\"),\n f.concat(f.lit(\"European=\"), f.col(\"n_total\")).alias(\n \"ancestry\"\n ),\n )\n ).alias(\"discoverySamples\"),\n f.col(\"in_path\").alias(\"summarystatsLocation\"),\n f.lit(True).alias(\"hasSumstats\"),\n )\n .withColumn(\n \"traitFromSource\",\n f.when(\n f.col(\"traitFromSource\").contains(\":\"),\n f.concat(\n f.initcap(\n f.split(f.col(\"traitFromSource\"), \": \").getItem(1)\n ),\n f.lit(\" | \"),\n f.lower(f.split(f.col(\"traitFromSource\"), \": \").getItem(0)),\n ),\n ).otherwise(f.col(\"traitFromSource\")),\n )\n .withColumn(\n \"ldPopulationStructure\",\n cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n )\n ),\n _schema=StudyIndex.get_schema(),\n )\n
This function ingests study level metadata from UKBiobank.
The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).
The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.
Parameters:
Name Type Description Default ukbiobank_studiesDataFrame
UKBiobank study manifest file loaded in spark session.
required
Returns:
Name Type Description UKBiobankStudyIndexUKBiobankStudyIndex
Annotated UKBiobank study table.
Source code in src/otg/datasource/ukbiobank/study_index.py
@classmethod\ndef from_source(\n cls: type[UKBiobankStudyIndex],\n ukbiobank_studies: DataFrame,\n) -> UKBiobankStudyIndex:\n \"\"\"This function ingests study level metadata from UKBiobank.\n\n The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).\n\n The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.\n\n Args:\n ukbiobank_studies (DataFrame): UKBiobank study manifest file loaded in spark session.\n\n Returns:\n UKBiobankStudyIndex: Annotated UKBiobank study table.\n \"\"\"\n return StudyIndex(\n _df=(\n ukbiobank_studies.select(\n f.col(\"code\").alias(\"studyId\"),\n f.lit(\"UKBiobank\").alias(\"projectId\"),\n f.lit(\"gwas\").alias(\"studyType\"),\n f.col(\"trait\").alias(\"traitFromSource\"),\n # Make publication and ancestry schema columns.\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"30104761\").alias(\n \"pubmedId\"\n ),\n f.when(\n f.col(\"code\").startswith(\"SAIGE_\"),\n \"Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies\",\n )\n .otherwise(None)\n .alias(\"publicationTitle\"),\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Wei Zhou\").alias(\n \"publicationFirstAuthor\"\n ),\n f.when(f.col(\"code\").startswith(\"NEALE2_\"), \"2018-08-01\")\n .otherwise(\"2018-10-24\")\n .alias(\"publicationDate\"),\n f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Nature Genetics\").alias(\n \"publicationJournal\"\n ),\n f.col(\"n_total\").cast(\"string\").alias(\"initialSampleSize\"),\n f.col(\"n_cases\").cast(\"long\").alias(\"nCases\"),\n f.array(\n f.struct(\n f.col(\"n_total\").cast(\"long\").alias(\"sampleSize\"),\n f.concat(f.lit(\"European=\"), f.col(\"n_total\")).alias(\n \"ancestry\"\n ),\n )\n ).alias(\"discoverySamples\"),\n f.col(\"in_path\").alias(\"summarystatsLocation\"),\n f.lit(True).alias(\"hasSumstats\"),\n )\n .withColumn(\n \"traitFromSource\",\n f.when(\n f.col(\"traitFromSource\").contains(\":\"),\n f.concat(\n f.initcap(\n f.split(f.col(\"traitFromSource\"), \": \").getItem(1)\n ),\n f.lit(\" | \"),\n f.lower(f.split(f.col(\"traitFromSource\"), \": \").getItem(0)),\n ),\n ).otherwise(f.col(\"traitFromSource\")),\n )\n .withColumn(\n \"ldPopulationStructure\",\n cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n )\n ),\n _schema=StudyIndex.get_schema(),\n )\n
Clumping is a commonly used post-processing method that allows for identification of independent association signals from GWAS summary statistics and curated associations. This process is critical because of the complex linkage disequilibrium (LD) structure in human populations, which can result in multiple statistically significant associations within the same genomic region. Clumping methods help reduce redundancy in GWAS results and ensure that each reported association represents an independent signal.
We have implemented 2 clumping methods:
"},{"location":"python_api/method/clumping/#clumping-based-on-linkage-disequilibrium-ld","title":"Clumping based on Linkage Disequilibrium (LD)","text":"
LD clumping reports the most significant genetic associations in a region in terms of a smaller number of \u201cclumps\u201d of genetically linked SNPs.
Source code in src/otg/method/clump.py
class LDclumping:\n \"\"\"LD clumping reports the most significant genetic associations in a region in terms of a smaller number of \u201cclumps\u201d of genetically linked SNPs.\"\"\"\n\n @staticmethod\n def _is_lead_linked(\n study_id: Column,\n variant_id: Column,\n p_value_exponent: Column,\n p_value_mantissa: Column,\n ld_set: Column,\n ) -> Column:\n \"\"\"Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n Args:\n study_id (Column): studyId\n variant_id (Column): Lead variant id\n p_value_exponent (Column): p-value exponent\n p_value_mantissa (Column): p-value mantissa\n locus (Column): Credible set <array of structs>\n\n Returns:\n Column: Boolean in which True indicates that the lead is linked to another tag in the same dataset.\n \"\"\"\n leads_in_study = f.collect_set(variant_id).over(Window.partitionBy(study_id))\n tags_in_studylocus = f.array_union(\n # Get all tag variants from the credible set per studyLocusId\n f.transform(ld_set, lambda x: x.tagVariantId),\n # And append the lead variant so that the intersection is the same for all studyLocusIds in a study\n f.array(variant_id),\n )\n intersect_lead_tags = f.array_sort(\n f.array_intersect(leads_in_study, tags_in_studylocus)\n )\n return (\n # If the lead is in the credible set, we rank the peaks by p-value\n f.when(\n f.size(intersect_lead_tags) > 0,\n f.row_number().over(\n Window.partitionBy(study_id, intersect_lead_tags).orderBy(\n p_value_exponent, p_value_mantissa\n )\n )\n > 1,\n )\n # If the intersection is empty (lead is not in the credible set or cred set is empty), the association is not linked\n .otherwise(f.lit(False))\n )\n\n @classmethod\n def clump(cls: type[LDclumping], associations: StudyLocus) -> StudyLocus:\n \"\"\"Perform clumping on studyLocus dataset.\n\n Args:\n associations (StudyLocus): StudyLocus dataset\n\n Returns:\n StudyLocus: including flag and removing locus information for LD clumped loci.\n \"\"\"\n return associations.clump()\n
Name Type Description Default associationsStudyLocus
StudyLocus dataset
required
Returns:
Name Type Description StudyLocusStudyLocus
including flag and removing locus information for LD clumped loci.
Source code in src/otg/method/clump.py
@classmethod\ndef clump(cls: type[LDclumping], associations: StudyLocus) -> StudyLocus:\n \"\"\"Perform clumping on studyLocus dataset.\n\n Args:\n associations (StudyLocus): StudyLocus dataset\n\n Returns:\n StudyLocus: including flag and removing locus information for LD clumped loci.\n \"\"\"\n return associations.clump()\n
Calculate bayesian colocalisation based on overlapping signals from credible sets.
Based on the R COLOC package, which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that only one single causal variant exists for any given trait in any genomic region.
Hypothesis Description H0 no association with either trait in the region H1 association with trait 1 only H2 association with trait 2 only H3 both traits are associated, but have different single causal variants H4 both traits are associated and share the same single causal variant
Approximate Bayes factors required
Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (logABF column).
Source code in src/otg/method/colocalisation.py
class Coloc:\n \"\"\"Calculate bayesian colocalisation based on overlapping signals from credible sets.\n\n Based on the [R COLOC package](https://github.com/chr1swallace/coloc/blob/main/R/claudia.R), which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that **only one single causal variant** exists for any given trait in any genomic region.\n\n | Hypothesis | Description |\n | ------------- | --------------------------------------------------------------------- |\n | H<sub>0</sub> | no association with either trait in the region |\n | H<sub>1</sub> | association with trait 1 only |\n | H<sub>2</sub> | association with trait 2 only |\n | H<sub>3</sub> | both traits are associated, but have different single causal variants |\n | H<sub>4</sub> | both traits are associated and share the same single causal variant |\n\n !!! warning \"Approximate Bayes factors required\"\n Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (`logABF` column).\n\n \"\"\"\n\n @staticmethod\n def _get_logsum(log_abf: ndarray) -> float:\n \"\"\"Calculates logsum of vector.\n\n This function calculates the log of the sum of the exponentiated\n logs taking out the max, i.e. insuring that the sum is not Inf\n\n Args:\n log_abf (ndarray): log approximate bayes factor\n\n Returns:\n float: logsum\n\n Example:\n >>> l = [0.2, 0.1, 0.05, 0]\n >>> round(Coloc._get_logsum(l), 6)\n 1.476557\n \"\"\"\n themax = np.max(log_abf)\n result = themax + np.log(np.sum(np.exp(log_abf - themax)))\n return float(result)\n\n @staticmethod\n def _get_posteriors(all_abfs: ndarray) -> DenseVector:\n \"\"\"Calculate posterior probabilities for each hypothesis.\n\n Args:\n all_abfs (ndarray): h0-h4 bayes factors\n\n Returns:\n DenseVector: Posterior\n\n Example:\n >>> l = np.array([0.2, 0.1, 0.05, 0])\n >>> Coloc._get_posteriors(l)\n DenseVector([0.279, 0.2524, 0.2401, 0.2284])\n \"\"\"\n diff = all_abfs - Coloc._get_logsum(all_abfs)\n abfs_posteriors = np.exp(diff)\n return Vectors.dense(abfs_posteriors)\n\n @classmethod\n def colocalise(\n cls: type[Coloc],\n overlapping_signals: StudyLocusOverlap,\n priorc1: float = 1e-4,\n priorc2: float = 1e-4,\n priorc12: float = 1e-5,\n ) -> Colocalisation:\n \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n Args:\n overlapping_signals (StudyLocusOverlap): overlapping peaks\n priorc1 (float): Prior on variant being causal for trait 1. Defaults to 1e-4.\n priorc2 (float): Prior on variant being causal for trait 2. Defaults to 1e-4.\n priorc12 (float): Prior on variant being causal for traits 1 and 2. Defaults to 1e-5.\n\n Returns:\n Colocalisation: Colocalisation results\n \"\"\"\n # register udfs\n logsum = f.udf(Coloc._get_logsum, DoubleType())\n posteriors = f.udf(Coloc._get_posteriors, VectorUDT())\n return Colocalisation(\n _df=(\n overlapping_signals.df\n # Before summing log_abf columns nulls need to be filled with 0:\n .fillna(0, subset=[\"statistics.left_logABF\", \"statistics.right_logABF\"])\n # Sum of log_abfs for each pair of signals\n .withColumn(\n \"sum_log_abf\",\n f.col(\"statistics.left_logABF\") + f.col(\"statistics.right_logABF\"),\n )\n # Group by overlapping peak and generating dense vectors of log_abf:\n .groupBy(\"chromosome\", \"leftStudyLocusId\", \"rightStudyLocusId\")\n .agg(\n f.count(\"*\").alias(\"numberColocalisingVariants\"),\n fml.array_to_vector(\n f.collect_list(f.col(\"statistics.left_logABF\"))\n ).alias(\"left_logABF\"),\n fml.array_to_vector(\n f.collect_list(f.col(\"statistics.right_logABF\"))\n ).alias(\"right_logABF\"),\n fml.array_to_vector(f.collect_list(f.col(\"sum_log_abf\"))).alias(\n \"sum_log_abf\"\n ),\n )\n .withColumn(\"logsum1\", logsum(f.col(\"left_logABF\")))\n .withColumn(\"logsum2\", logsum(f.col(\"right_logABF\")))\n .withColumn(\"logsum12\", logsum(f.col(\"sum_log_abf\")))\n .drop(\"left_logABF\", \"right_logABF\", \"sum_log_abf\")\n # Add priors\n # priorc1 Prior on variant being causal for trait 1\n .withColumn(\"priorc1\", f.lit(priorc1))\n # priorc2 Prior on variant being causal for trait 2\n .withColumn(\"priorc2\", f.lit(priorc2))\n # priorc12 Prior on variant being causal for traits 1 and 2\n .withColumn(\"priorc12\", f.lit(priorc12))\n # h0-h2\n .withColumn(\"lH0abf\", f.lit(0))\n .withColumn(\"lH1abf\", f.log(f.col(\"priorc1\")) + f.col(\"logsum1\"))\n .withColumn(\"lH2abf\", f.log(f.col(\"priorc2\")) + f.col(\"logsum2\"))\n # h3\n .withColumn(\"sumlogsum\", f.col(\"logsum1\") + f.col(\"logsum2\"))\n # exclude null H3/H4s: due to sumlogsum == logsum12\n .filter(f.col(\"sumlogsum\") != f.col(\"logsum12\"))\n .withColumn(\"max\", f.greatest(\"sumlogsum\", \"logsum12\"))\n .withColumn(\n \"logdiff\",\n (\n f.col(\"max\")\n + f.log(\n f.exp(f.col(\"sumlogsum\") - f.col(\"max\"))\n - f.exp(f.col(\"logsum12\") - f.col(\"max\"))\n )\n ),\n )\n .withColumn(\n \"lH3abf\",\n f.log(f.col(\"priorc1\"))\n + f.log(f.col(\"priorc2\"))\n + f.col(\"logdiff\"),\n )\n .drop(\"right_logsum\", \"left_logsum\", \"sumlogsum\", \"max\", \"logdiff\")\n # h4\n .withColumn(\"lH4abf\", f.log(f.col(\"priorc12\")) + f.col(\"logsum12\"))\n # cleaning\n .drop(\n \"priorc1\", \"priorc2\", \"priorc12\", \"logsum1\", \"logsum2\", \"logsum12\"\n )\n # posteriors\n .withColumn(\n \"allABF\",\n fml.array_to_vector(\n f.array(\n f.col(\"lH0abf\"),\n f.col(\"lH1abf\"),\n f.col(\"lH2abf\"),\n f.col(\"lH3abf\"),\n f.col(\"lH4abf\"),\n )\n ),\n )\n .withColumn(\n \"posteriors\", fml.vector_to_array(posteriors(f.col(\"allABF\")))\n )\n .withColumn(\"h0\", f.col(\"posteriors\").getItem(0))\n .withColumn(\"h1\", f.col(\"posteriors\").getItem(1))\n .withColumn(\"h2\", f.col(\"posteriors\").getItem(2))\n .withColumn(\"h3\", f.col(\"posteriors\").getItem(3))\n .withColumn(\"h4\", f.col(\"posteriors\").getItem(4))\n .withColumn(\"h4h3\", f.col(\"h4\") / f.col(\"h3\"))\n .withColumn(\"log2h4h3\", f.log2(f.col(\"h4h3\")))\n # clean up\n .drop(\n \"posteriors\",\n \"allABF\",\n \"h4h3\",\n \"lH0abf\",\n \"lH1abf\",\n \"lH2abf\",\n \"lH3abf\",\n \"lH4abf\",\n )\n .withColumn(\"colocalisationMethod\", f.lit(\"COLOC\"))\n ),\n _schema=Colocalisation.get_schema(),\n )\n
It extends CAVIAR\u00a0framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (CLPP) by utilizing the marginal posterior probabilities. This framework allows for multiple variants to be causal in a single locus.
Source code in src/otg/method/colocalisation.py
class ECaviar:\n \"\"\"ECaviar-based colocalisation analysis.\n\n It extends [CAVIAR](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5142122/#bib18)\u00a0framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (**CLPP**) by utilizing the marginal posterior probabilities. This framework allows for **multiple variants to be causal** in a single locus.\n \"\"\"\n\n @staticmethod\n def _get_clpp(left_pp: Column, right_pp: Column) -> Column:\n \"\"\"Calculate the colocalisation posterior probability (CLPP).\n\n If the fact that the same variant is found causal for two studies are independent events,\n CLPP is defined as the product of posterior porbabilities that a variant is causal in both studies.\n\n Args:\n left_pp (Column): left posterior probability\n right_pp (Column): right posterior probability\n\n Returns:\n Column: CLPP\n\n Examples:\n >>> d = [{\"left_pp\": 0.5, \"right_pp\": 0.5}, {\"left_pp\": 0.25, \"right_pp\": 0.75}]\n >>> df = spark.createDataFrame(d)\n >>> df.withColumn(\"clpp\", ECaviar._get_clpp(f.col(\"left_pp\"), f.col(\"right_pp\"))).show()\n +-------+--------+------+\n |left_pp|right_pp| clpp|\n +-------+--------+------+\n | 0.5| 0.5| 0.25|\n | 0.25| 0.75|0.1875|\n +-------+--------+------+\n <BLANKLINE>\n\n \"\"\"\n return left_pp * right_pp\n\n @classmethod\n def colocalise(\n cls: type[ECaviar], overlapping_signals: StudyLocusOverlap\n ) -> Colocalisation:\n \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n Args:\n overlapping_signals (StudyLocusOverlap): overlapping signals.\n\n Returns:\n Colocalisation: colocalisation results based on eCAVIAR.\n \"\"\"\n return Colocalisation(\n _df=(\n overlapping_signals.df.withColumn(\n \"clpp\",\n ECaviar._get_clpp(\n f.col(\"statistics.left_posteriorProbability\"),\n f.col(\"statistics.right_posteriorProbability\"),\n ),\n )\n .groupBy(\"leftStudyLocusId\", \"rightStudyLocusId\", \"chromosome\")\n .agg(\n f.count(\"*\").alias(\"numberColocalisingVariants\"),\n f.sum(f.col(\"clpp\")).alias(\"clpp\"),\n )\n .withColumn(\"colocalisationMethod\", f.lit(\"eCAVIAR\"))\n ),\n _schema=Colocalisation.get_schema(),\n )\n
Class to annotate linkage disequilibrium (LD) operations from GnomAD.
Source code in src/otg/method/ld.py
class LDAnnotator:\n \"\"\"Class to annotate linkage disequilibrium (LD) operations from GnomAD.\"\"\"\n\n @staticmethod\n def _calculate_weighted_r_overall(ld_set: Column) -> Column:\n \"\"\"Aggregation of weighted R information using ancestry proportions.\"\"\"\n return f.transform(\n ld_set,\n lambda x: f.struct(\n x[\"tagVariantId\"].alias(\"tagVariantId\"),\n # r2Overall is the accumulated sum of each r2 relative to the population size\n f.aggregate(\n x[\"rValues\"],\n f.lit(0.0),\n lambda acc, y: acc\n + f.coalesce(\n f.pow(y[\"r\"], 2) * y[\"relativeSampleSize\"], f.lit(0.0)\n ), # we use coalesce to avoid problems when r/relativeSampleSize is null\n ).alias(\"r2Overall\"),\n ),\n )\n\n @staticmethod\n def _add_population_size(ld_set: Column, study_populations: Column) -> Column:\n \"\"\"Add population size to each rValues entry in the ldSet.\n\n Args:\n ld_set (Column): LD set\n study_populations (Column): Study populations\n\n Returns:\n Column: LD set with added 'relativeSampleSize' field\n \"\"\"\n # Create a population to relativeSampleSize map from the struct\n populations_map = f.map_from_arrays(\n study_populations[\"ldPopulation\"],\n study_populations[\"relativeSampleSize\"],\n )\n return f.transform(\n ld_set,\n lambda x: f.struct(\n x[\"tagVariantId\"].alias(\"tagVariantId\"),\n f.transform(\n x[\"rValues\"],\n lambda y: f.struct(\n y[\"population\"].alias(\"population\"),\n y[\"r\"].alias(\"r\"),\n populations_map[y[\"population\"]].alias(\"relativeSampleSize\"),\n ),\n ).alias(\"rValues\"),\n ),\n )\n\n @classmethod\n def ld_annotate(\n cls: type[LDAnnotator],\n associations: StudyLocus,\n studies: StudyIndex,\n ld_index: LDIndex,\n ) -> StudyLocus:\n \"\"\"Annotate linkage disequilibrium (LD) information to a set of studyLocus.\n\n This function:\n 1. Annotates study locus with population structure information from the study index\n 2. Joins the LD index to the StudyLocus\n 3. Adds the population size of the study to each rValues entry in the ldSet\n 4. Calculates the overall R weighted by the ancestry proportions in every given study.\n\n Args:\n associations (StudyLocus): Dataset to be LD annotated\n studies (StudyIndex): Dataset with study information\n ld_index (LDIndex): Dataset with LD information for every variant present in LD matrix\n\n Returns:\n StudyLocus: including additional column with LD information.\n \"\"\"\n return (\n StudyLocus(\n _df=(\n associations.df\n # Drop ldSet column if already available\n .select(*[col for col in associations.df.columns if col != \"ldSet\"])\n # Annotate study locus with population structure from study index\n .join(\n studies.df.select(\"studyId\", \"ldPopulationStructure\"),\n on=\"studyId\",\n how=\"left\",\n )\n # Bring LD information from LD Index\n .join(\n ld_index.df,\n on=[\"variantId\", \"chromosome\"],\n how=\"left\",\n )\n # Add population size to each rValues entry in the ldSet if population structure available:\n .withColumn(\n \"ldSet\",\n f.when(\n f.col(\"ldPopulationStructure\").isNotNull(),\n cls._add_population_size(\n f.col(\"ldSet\"), f.col(\"ldPopulationStructure\")\n ),\n ),\n )\n # Aggregate weighted R information using ancestry proportions\n .withColumn(\n \"ldSet\",\n f.when(\n f.col(\"ldPopulationStructure\").isNotNull(),\n cls._calculate_weighted_r_overall(f.col(\"ldSet\")),\n ),\n ).drop(\"ldPopulationStructure\")\n ),\n _schema=StudyLocus.get_schema(),\n )\n ._qc_no_population()\n ._qc_unresolved_ld()\n )\n
Annotate linkage disequilibrium (LD) information to a set of studyLocus.
This function
Annotates study locus with population structure information from the study index
Joins the LD index to the StudyLocus
Adds the population size of the study to each rValues entry in the ldSet
Calculates the overall R weighted by the ancestry proportions in every given study.
Parameters:
Name Type Description Default associationsStudyLocus
Dataset to be LD annotated
required studiesStudyIndex
Dataset with study information
required ld_indexLDIndex
Dataset with LD information for every variant present in LD matrix
required
Returns:
Name Type Description StudyLocusStudyLocus
including additional column with LD information.
Source code in src/otg/method/ld.py
@classmethod\ndef ld_annotate(\n cls: type[LDAnnotator],\n associations: StudyLocus,\n studies: StudyIndex,\n ld_index: LDIndex,\n) -> StudyLocus:\n \"\"\"Annotate linkage disequilibrium (LD) information to a set of studyLocus.\n\n This function:\n 1. Annotates study locus with population structure information from the study index\n 2. Joins the LD index to the StudyLocus\n 3. Adds the population size of the study to each rValues entry in the ldSet\n 4. Calculates the overall R weighted by the ancestry proportions in every given study.\n\n Args:\n associations (StudyLocus): Dataset to be LD annotated\n studies (StudyIndex): Dataset with study information\n ld_index (LDIndex): Dataset with LD information for every variant present in LD matrix\n\n Returns:\n StudyLocus: including additional column with LD information.\n \"\"\"\n return (\n StudyLocus(\n _df=(\n associations.df\n # Drop ldSet column if already available\n .select(*[col for col in associations.df.columns if col != \"ldSet\"])\n # Annotate study locus with population structure from study index\n .join(\n studies.df.select(\"studyId\", \"ldPopulationStructure\"),\n on=\"studyId\",\n how=\"left\",\n )\n # Bring LD information from LD Index\n .join(\n ld_index.df,\n on=[\"variantId\", \"chromosome\"],\n how=\"left\",\n )\n # Add population size to each rValues entry in the ldSet if population structure available:\n .withColumn(\n \"ldSet\",\n f.when(\n f.col(\"ldPopulationStructure\").isNotNull(),\n cls._add_population_size(\n f.col(\"ldSet\"), f.col(\"ldPopulationStructure\")\n ),\n ),\n )\n # Aggregate weighted R information using ancestry proportions\n .withColumn(\n \"ldSet\",\n f.when(\n f.col(\"ldPopulationStructure\").isNotNull(),\n cls._calculate_weighted_r_overall(f.col(\"ldSet\")),\n ),\n ).drop(\"ldPopulationStructure\")\n ),\n _schema=StudyLocus.get_schema(),\n )\n ._qc_no_population()\n ._qc_unresolved_ld()\n )\n
Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.
Source code in src/otg/method/pics.py
class PICS:\n \"\"\"Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.\"\"\"\n\n @staticmethod\n def _pics_relative_posterior_probability(\n neglog_p: float, pics_snp_mu: float, pics_snp_std: float\n ) -> float:\n \"\"\"Compute the PICS posterior probability for a given SNP.\n\n !!! info \"This probability needs to be scaled to take into account the probabilities of the other variants in the locus.\"\n\n Args:\n neglog_p (float): Negative log p-value of the lead variant\n pics_snp_mu (float): Mean P value of the association between a SNP and a trait\n pics_snp_std (float): Standard deviation for the P value of the association between a SNP and a trait\n\n Returns:\n Relative posterior probability of a SNP being causal in a locus\n\n Examples:\n >>> rel_prob = PICS._pics_relative_posterior_probability(neglog_p=10.0, pics_snp_mu=1.0, pics_snp_std=10.0)\n >>> round(rel_prob, 3)\n 0.368\n \"\"\"\n return float(norm(pics_snp_mu, pics_snp_std).sf(neglog_p) * 2)\n\n @staticmethod\n def _pics_standard_deviation(neglog_p: float, r2: float, k: float) -> float | None:\n \"\"\"Compute the PICS standard deviation.\n\n This distribution is obtained after a series of permutation tests described in the PICS method, and it is only\n valid when the SNP is highly linked with the lead (r2 > 0.5).\n\n Args:\n neglog_p (float): Negative log p-value of the lead variant\n r2 (float): LD score between a given SNP and the lead variant\n k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n Returns:\n Standard deviation for the P value of the association between a SNP and a trait\n\n Examples:\n >>> PICS._pics_standard_deviation(neglog_p=1.0, r2=1.0, k=6.4)\n 0.0\n >>> round(PICS._pics_standard_deviation(neglog_p=10.0, r2=0.5, k=6.4), 3)\n 1.493\n >>> print(PICS._pics_standard_deviation(neglog_p=1.0, r2=0.0, k=6.4))\n None\n \"\"\"\n return (\n abs(((1 - (r2**0.5) ** k) ** 0.5) * (neglog_p**0.5) / 2)\n if r2 >= 0.5\n else None\n )\n\n @staticmethod\n def _pics_mu(neglog_p: float, r2: float) -> float | None:\n \"\"\"Compute the PICS mu that estimates the probability of association between a given SNP and the trait.\n\n This distribution is obtained after a series of permutation tests described in the PICS method, and it is only\n valid when the SNP is highly linked with the lead (r2 > 0.5).\n\n Args:\n neglog_p (float): Negative log p-value of the lead variant\n r2 (float): LD score between a given SNP and the lead variant\n\n Returns:\n Mean P value of the association between a SNP and a trait\n\n Examples:\n >>> PICS._pics_mu(neglog_p=1.0, r2=1.0)\n 1.0\n >>> PICS._pics_mu(neglog_p=10.0, r2=0.5)\n 5.0\n >>> print(PICS._pics_mu(neglog_p=10.0, r2=0.3))\n None\n \"\"\"\n return neglog_p * r2 if r2 >= 0.5 else None\n\n @staticmethod\n def _finemap(ld_set: list[Row], lead_neglog_p: float, k: float) -> list | None:\n \"\"\"Calculates the probability of a variant being causal in a study-locus context by applying the PICS method.\n\n It is intended to be applied as an UDF in `PICS.finemap`, where each row is a StudyLocus association.\n The function iterates over every SNP in the `ldSet` array, and it returns an updated locus with\n its association signal and causality probability as of PICS.\n\n Args:\n ld_set (list): list of tagging variants after expanding the locus\n lead_neglog_p (float): P value of the association signal between the lead variant and the study in the form of -log10.\n k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n Returns:\n List of tagging variants with an estimation of the association signal and their posterior probability as of PICS.\n\n Examples:\n >>> from pyspark.sql import Row\n >>> ld_set = [\n ... Row(variantId=\"var1\", r2Overall=0.8),\n ... Row(variantId=\"var2\", r2Overall=1),\n ... ]\n >>> PICS._finemap(ld_set, lead_neglog_p=10.0, k=6.4)\n [{'variantId': 'var1', 'r2Overall': 0.8, 'standardError': 0.07420896512708416, 'posteriorProbability': 0.07116959886882368}, {'variantId': 'var2', 'r2Overall': 1, 'standardError': 0.9977000638225533, 'posteriorProbability': 0.9288304011311763}]\n >>> empty_ld_set = []\n >>> PICS._finemap(empty_ld_set, lead_neglog_p=10.0, k=6.4)\n []\n >>> ld_set_with_no_r2 = [\n ... Row(variantId=\"var1\", r2Overall=None),\n ... Row(variantId=\"var2\", r2Overall=None),\n ... ]\n >>> PICS._finemap(ld_set_with_no_r2, lead_neglog_p=10.0, k=6.4)\n [{'variantId': 'var1', 'r2Overall': None}, {'variantId': 'var2', 'r2Overall': None}]\n \"\"\"\n if ld_set is None:\n return None\n elif not ld_set:\n return []\n tmp_credible_set = []\n new_credible_set = []\n # First iteration: calculation of mu, standard deviation, and the relative posterior probability\n for tag_struct in ld_set:\n tag_dict = (\n tag_struct.asDict()\n ) # tag_struct is of type pyspark.Row, we'll represent it as a dict\n if (\n not tag_dict[\"r2Overall\"]\n or tag_dict[\"r2Overall\"] < 0.5\n or not lead_neglog_p\n ):\n # If PICS cannot be calculated, we'll return the original credible set\n new_credible_set.append(tag_dict)\n continue\n\n pics_snp_mu = PICS._pics_mu(lead_neglog_p, tag_dict[\"r2Overall\"])\n pics_snp_std = PICS._pics_standard_deviation(\n lead_neglog_p, tag_dict[\"r2Overall\"], k\n )\n pics_snp_std = 0.001 if pics_snp_std == 0 else pics_snp_std\n if pics_snp_mu is not None and pics_snp_std is not None:\n posterior_probability = PICS._pics_relative_posterior_probability(\n lead_neglog_p, pics_snp_mu, pics_snp_std\n )\n tag_dict[\"standardError\"] = 10**-pics_snp_std\n tag_dict[\"relativePosteriorProbability\"] = posterior_probability\n\n tmp_credible_set.append(tag_dict)\n\n # Second iteration: calculation of the sum of all the posteriors in each study-locus, so that we scale them between 0-1\n total_posteriors = sum(\n tag_dict.get(\"relativePosteriorProbability\", 0)\n for tag_dict in tmp_credible_set\n )\n\n # Third iteration: calculation of the final posteriorProbability\n for tag_dict in tmp_credible_set:\n if total_posteriors != 0:\n tag_dict[\"posteriorProbability\"] = float(\n tag_dict.get(\"relativePosteriorProbability\", 0) / total_posteriors\n )\n tag_dict.pop(\"relativePosteriorProbability\")\n new_credible_set.append(tag_dict)\n return new_credible_set\n\n @classmethod\n def finemap(\n cls: type[PICS], associations: StudyLocus, k: float = 6.4\n ) -> StudyLocus:\n \"\"\"Run PICS on a study locus.\n\n !!! info \"Study locus needs to be LD annotated\"\n The study locus needs to be LD annotated before PICS can be calculated.\n\n Args:\n associations (StudyLocus): Study locus to finemap using PICS\n k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n Returns:\n StudyLocus: Study locus with PICS results\n \"\"\"\n # Register UDF by defining the structure of the output locus array of structs\n # it also renames tagVariantId to variantId\n\n picsed_ldset_schema = t.ArrayType(\n t.StructType(\n [\n t.StructField(\"tagVariantId\", t.StringType(), True),\n t.StructField(\"r2Overall\", t.DoubleType(), True),\n t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n t.StructField(\"standardError\", t.DoubleType(), True),\n ]\n )\n )\n picsed_study_locus_schema = t.ArrayType(\n t.StructType(\n [\n t.StructField(\"variantId\", t.StringType(), True),\n t.StructField(\"r2Overall\", t.DoubleType(), True),\n t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n t.StructField(\"standardError\", t.DoubleType(), True),\n ]\n )\n )\n _finemap_udf = f.udf(\n lambda locus, neglog_p: PICS._finemap(locus, neglog_p, k),\n picsed_ldset_schema,\n )\n return StudyLocus(\n _df=(\n associations.df\n # Old locus column will be dropped if available\n .select(*[col for col in associations.df.columns if col != \"locus\"])\n # Estimate neglog_pvalue for the lead variant\n .withColumn(\"neglog_pvalue\", associations.neglog_pvalue())\n # New locus containing the PICS results\n .withColumn(\n \"locus\",\n f.when(\n f.col(\"ldSet\").isNotNull(),\n _finemap_udf(f.col(\"ldSet\"), f.col(\"neglog_pvalue\")).cast(\n picsed_study_locus_schema\n ),\n ),\n )\n # Rename tagVariantId to variantId\n .drop(\"neglog_pvalue\")\n ),\n _schema=StudyLocus.get_schema(),\n )\n
The study locus needs to be LD annotated before PICS can be calculated.
Parameters:
Name Type Description Default associationsStudyLocus
Study locus to finemap using PICS
required kfloat
Empiric constant that can be adjusted to fit the curve, 6.4 recommended.
6.4
Returns:
Name Type Description StudyLocusStudyLocus
Study locus with PICS results
Source code in src/otg/method/pics.py
@classmethod\ndef finemap(\n cls: type[PICS], associations: StudyLocus, k: float = 6.4\n) -> StudyLocus:\n \"\"\"Run PICS on a study locus.\n\n !!! info \"Study locus needs to be LD annotated\"\n The study locus needs to be LD annotated before PICS can be calculated.\n\n Args:\n associations (StudyLocus): Study locus to finemap using PICS\n k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n Returns:\n StudyLocus: Study locus with PICS results\n \"\"\"\n # Register UDF by defining the structure of the output locus array of structs\n # it also renames tagVariantId to variantId\n\n picsed_ldset_schema = t.ArrayType(\n t.StructType(\n [\n t.StructField(\"tagVariantId\", t.StringType(), True),\n t.StructField(\"r2Overall\", t.DoubleType(), True),\n t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n t.StructField(\"standardError\", t.DoubleType(), True),\n ]\n )\n )\n picsed_study_locus_schema = t.ArrayType(\n t.StructType(\n [\n t.StructField(\"variantId\", t.StringType(), True),\n t.StructField(\"r2Overall\", t.DoubleType(), True),\n t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n t.StructField(\"standardError\", t.DoubleType(), True),\n ]\n )\n )\n _finemap_udf = f.udf(\n lambda locus, neglog_p: PICS._finemap(locus, neglog_p, k),\n picsed_ldset_schema,\n )\n return StudyLocus(\n _df=(\n associations.df\n # Old locus column will be dropped if available\n .select(*[col for col in associations.df.columns if col != \"locus\"])\n # Estimate neglog_pvalue for the lead variant\n .withColumn(\"neglog_pvalue\", associations.neglog_pvalue())\n # New locus containing the PICS results\n .withColumn(\n \"locus\",\n f.when(\n f.col(\"ldSet\").isNotNull(),\n _finemap_udf(f.col(\"ldSet\"), f.col(\"neglog_pvalue\")).cast(\n picsed_study_locus_schema\n ),\n ),\n )\n # Rename tagVariantId to variantId\n .drop(\"neglog_pvalue\")\n ),\n _schema=StudyLocus.get_schema(),\n )\n
Get semi-lead snps from summary statistics using a window based function.
Source code in src/otg/method/window_based_clumping.py
class WindowBasedClumping:\n \"\"\"Get semi-lead snps from summary statistics using a window based function.\"\"\"\n\n @staticmethod\n def _cluster_peaks(\n study: Column, chromosome: Column, position: Column, window_length: int\n ) -> Column:\n \"\"\"Cluster GWAS significant variants, were clusters are separated by a defined distance.\n\n !! Important to note that the length of the clusters can be arbitrarily big.\n\n Args:\n study (Column): study identifier\n chromosome (Column): chromosome identifier\n position (Column): position of the variant\n window_length (int): window length in basepair\n\n Returns:\n Column: containing cluster identifier\n\n Examples:\n >>> data = [\n ... # Cluster 1:\n ... ('s1', 'chr1', 2),\n ... ('s1', 'chr1', 4),\n ... ('s1', 'chr1', 12),\n ... # Cluster 2 - Same chromosome:\n ... ('s1', 'chr1', 31),\n ... ('s1', 'chr1', 38),\n ... ('s1', 'chr1', 42),\n ... # Cluster 3 - New chromosome:\n ... ('s1', 'chr2', 41),\n ... ('s1', 'chr2', 44),\n ... ('s1', 'chr2', 50),\n ... # Cluster 4 - other study:\n ... ('s2', 'chr2', 55),\n ... ('s2', 'chr2', 62),\n ... ('s2', 'chr2', 70),\n ... ]\n >>> window_length = 10\n >>> (\n ... spark.createDataFrame(data, ['studyId', 'chromosome', 'position'])\n ... .withColumn(\"cluster_id\",\n ... WindowBasedClumping._cluster_peaks(\n ... f.col('studyId'),\n ... f.col('chromosome'),\n ... f.col('position'),\n ... window_length\n ... )\n ... ).show()\n ... )\n +-------+----------+--------+----------+\n |studyId|chromosome|position|cluster_id|\n +-------+----------+--------+----------+\n | s1| chr1| 2| s1_chr1_2|\n | s1| chr1| 4| s1_chr1_2|\n | s1| chr1| 12| s1_chr1_2|\n | s1| chr1| 31|s1_chr1_31|\n | s1| chr1| 38|s1_chr1_31|\n | s1| chr1| 42|s1_chr1_31|\n | s1| chr2| 41|s1_chr2_41|\n | s1| chr2| 44|s1_chr2_41|\n | s1| chr2| 50|s1_chr2_41|\n | s2| chr2| 55|s2_chr2_55|\n | s2| chr2| 62|s2_chr2_55|\n | s2| chr2| 70|s2_chr2_55|\n +-------+----------+--------+----------+\n <BLANKLINE>\n\n \"\"\"\n # By adding previous position, the cluster boundary can be identified:\n previous_position = f.lag(position).over(\n Window.partitionBy(study, chromosome).orderBy(position)\n )\n # We consider a cluster boudary if subsequent snps are further than the defined window:\n cluster_id = f.when(\n (previous_position.isNull())\n | (position - previous_position > window_length),\n f.concat_ws(\"_\", study, chromosome, position),\n )\n # The cluster identifier is propagated across every variant of the cluster:\n return f.when(\n cluster_id.isNull(),\n f.last(cluster_id, ignorenulls=True).over(\n Window.partitionBy(study, chromosome)\n .orderBy(position)\n .rowsBetween(Window.unboundedPreceding, Window.currentRow)\n ),\n ).otherwise(cluster_id)\n\n @staticmethod\n def _prune_peak(position: ndarray, window_size: int) -> DenseVector:\n \"\"\"Establish lead snps based on their positions listed by p-value.\n\n The function `find_peak` assigns lead SNPs based on their positions listed by p-value within a specified window size.\n\n Args:\n position (ndarray): positions of the SNPs sorted by p-value.\n window_size (int): the distance in bp within which associations are clumped together around the lead snp.\n\n Returns:\n DenseVector: binary vector where 1 indicates a lead SNP and 0 indicates a non-lead SNP.\n\n Examples:\n >>> from pyspark.ml import functions as fml\n >>> from pyspark.ml.linalg import DenseVector\n >>> WindowBasedClumping._prune_peak(np.array((3, 9, 8, 4, 6)), 2)\n DenseVector([1.0, 1.0, 0.0, 0.0, 1.0])\n\n \"\"\"\n # Initializing the lead list with zeroes:\n is_lead: ndarray = np.zeros(len(position))\n\n # List containing indices of leads:\n lead_indices: list = []\n\n # Looping through all positions:\n for index in range(len(position)):\n # Looping through leads to find out if they are within a window:\n for lead_index in lead_indices:\n # If any of the leads within the window:\n if abs(position[lead_index] - position[index]) < window_size:\n # Skipping further checks:\n break\n else:\n # None of the leads were within the window:\n lead_indices.append(index)\n is_lead[index] = 1\n\n return DenseVector(is_lead)\n\n @classmethod\n def clump(\n cls: type[WindowBasedClumping],\n summary_stats: SummaryStatistics,\n window_length: int,\n p_value_significance: float = 5e-8,\n ) -> StudyLocus:\n \"\"\"Clump summary statistics by distance.\n\n Args:\n summary_stats (SummaryStatistics): summary statistics to clump\n window_length (int): window length in basepair\n p_value_significance (float): only more significant variants are considered\n\n Returns:\n StudyLocus: clumped summary statistics\n \"\"\"\n # Create window for locus clusters\n # - variants where the distance between subsequent variants is below the defined threshold.\n # - Variants are sorted by descending significance\n cluster_window = Window.partitionBy(\n \"studyId\", \"chromosome\", \"cluster_id\"\n ).orderBy(f.col(\"pValueExponent\").asc(), f.col(\"pValueMantissa\").asc())\n\n return StudyLocus(\n _df=(\n summary_stats\n # Dropping snps below significance - all subsequent steps are done on significant variants:\n .pvalue_filter(p_value_significance)\n .df\n # Clustering summary variants for efficient windowing (complexity reduction):\n .withColumn(\n \"cluster_id\",\n WindowBasedClumping._cluster_peaks(\n f.col(\"studyId\"),\n f.col(\"chromosome\"),\n f.col(\"position\"),\n window_length,\n ),\n )\n # Within each cluster variants are ranked by significance:\n .withColumn(\"pvRank\", f.row_number().over(cluster_window))\n # Collect positions in cluster for the most significant variant (complexity reduction):\n .withColumn(\n \"collectedPositions\",\n f.when(\n f.col(\"pvRank\") == 1,\n f.collect_list(f.col(\"position\")).over(\n cluster_window.rowsBetween(\n Window.currentRow, Window.unboundedFollowing\n )\n ),\n ).otherwise(f.array()),\n )\n # Get semi indices only ONCE per cluster:\n .withColumn(\n \"semiIndices\",\n f.when(\n f.size(f.col(\"collectedPositions\")) > 0,\n fml.vector_to_array(\n f.udf(WindowBasedClumping._prune_peak, VectorUDT())(\n fml.array_to_vector(f.col(\"collectedPositions\")),\n f.lit(window_length),\n )\n ),\n ),\n )\n # Propagating the result of the above calculation for all rows:\n .withColumn(\n \"semiIndices\",\n f.when(\n f.col(\"semiIndices\").isNull(),\n f.first(f.col(\"semiIndices\"), ignorenulls=True).over(\n cluster_window\n ),\n ).otherwise(f.col(\"semiIndices\")),\n )\n # Keeping semi indices only:\n .filter(f.col(\"semiIndices\")[f.col(\"pvRank\") - 1] > 0)\n .drop(\"pvRank\", \"collectedPositions\", \"semiIndices\", \"cluster_id\")\n # Adding study-locus id:\n .withColumn(\n \"studyLocusId\",\n StudyLocus.assign_study_locus_id(\"studyId\", \"variantId\"),\n )\n # Initialize QC column as array of strings:\n .withColumn(\n \"qualityControls\", f.array().cast(t.ArrayType(t.StringType()))\n )\n ),\n _schema=StudyLocus.get_schema(),\n )\n\n @classmethod\n def clump_with_locus(\n cls: type[WindowBasedClumping],\n summary_stats: SummaryStatistics,\n window_length: int,\n p_value_significance: float = 5e-8,\n p_value_baseline: float = 0.05,\n locus_window_length: int | None = None,\n ) -> StudyLocus:\n \"\"\"Clump significant associations while collecting locus around them.\n\n Args:\n summary_stats (SummaryStatistics): Input summary statistics dataset\n window_length (int): Window size in bp, used for distance based clumping.\n p_value_significance (float, optional): GWAS significance threshold used to filter peaks. Defaults to 5e-8.\n p_value_baseline (float, optional): Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.\n locus_window_length (int, optional): The distance for collecting locus around the semi indices.\n\n Returns:\n StudyLocus: StudyLocus after clumping with information about the `locus`\n \"\"\"\n # If no locus window provided, using the same value:\n if locus_window_length is None:\n locus_window_length = window_length\n\n # Run distance based clumping on the summary stats:\n clumped_dataframe = WindowBasedClumping.clump(\n summary_stats,\n window_length=window_length,\n p_value_significance=p_value_significance,\n ).df.alias(\"clumped\")\n\n # Get list of columns from clumped dataset for further propagation:\n clumped_columns = clumped_dataframe.columns\n\n # Dropping variants not meeting the baseline criteria:\n sumstats_baseline = summary_stats.pvalue_filter(p_value_baseline).df\n\n # Renaming columns:\n sumstats_baseline_renamed = sumstats_baseline.selectExpr(\n *[f\"{col} as tag_{col}\" for col in sumstats_baseline.columns]\n ).alias(\"sumstat\")\n\n study_locus_df = (\n sumstats_baseline_renamed\n # Joining the two datasets together:\n .join(\n f.broadcast(clumped_dataframe),\n on=[\n (f.col(\"sumstat.tag_studyId\") == f.col(\"clumped.studyId\"))\n & (f.col(\"sumstat.tag_chromosome\") == f.col(\"clumped.chromosome\"))\n & (\n f.col(\"sumstat.tag_position\")\n >= (f.col(\"clumped.position\") - locus_window_length)\n )\n & (\n f.col(\"sumstat.tag_position\")\n <= (f.col(\"clumped.position\") + locus_window_length)\n )\n ],\n how=\"right\",\n )\n .withColumn(\n \"locus\",\n f.struct(\n f.col(\"tag_variantId\").alias(\"variantId\"),\n f.col(\"tag_beta\").alias(\"beta\"),\n f.col(\"tag_pValueMantissa\").alias(\"pValueMantissa\"),\n f.col(\"tag_pValueExponent\").alias(\"pValueExponent\"),\n f.col(\"tag_standardError\").alias(\"standardError\"),\n ),\n )\n .groupby(\"studyLocusId\")\n .agg(\n *[\n f.first(col).alias(col)\n for col in clumped_columns\n if col != \"studyLocusId\"\n ],\n f.collect_list(f.col(\"locus\")).alias(\"locus\"),\n )\n )\n\n return StudyLocus(\n _df=study_locus_df,\n _schema=StudyLocus.get_schema(),\n )\n
Clump significant associations while collecting locus around them.
Parameters:
Name Type Description Default summary_statsSummaryStatistics
Input summary statistics dataset
required window_lengthint
Window size in bp, used for distance based clumping.
required p_value_significancefloat
GWAS significance threshold used to filter peaks. Defaults to 5e-8.
5e-08p_value_baselinefloat
Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.
0.05locus_window_lengthint
The distance for collecting locus around the semi indices.
None
Returns:
Name Type Description StudyLocusStudyLocus
StudyLocus after clumping with information about the locus
Source code in src/otg/method/window_based_clumping.py
@classmethod\ndef clump_with_locus(\n cls: type[WindowBasedClumping],\n summary_stats: SummaryStatistics,\n window_length: int,\n p_value_significance: float = 5e-8,\n p_value_baseline: float = 0.05,\n locus_window_length: int | None = None,\n) -> StudyLocus:\n \"\"\"Clump significant associations while collecting locus around them.\n\n Args:\n summary_stats (SummaryStatistics): Input summary statistics dataset\n window_length (int): Window size in bp, used for distance based clumping.\n p_value_significance (float, optional): GWAS significance threshold used to filter peaks. Defaults to 5e-8.\n p_value_baseline (float, optional): Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.\n locus_window_length (int, optional): The distance for collecting locus around the semi indices.\n\n Returns:\n StudyLocus: StudyLocus after clumping with information about the `locus`\n \"\"\"\n # If no locus window provided, using the same value:\n if locus_window_length is None:\n locus_window_length = window_length\n\n # Run distance based clumping on the summary stats:\n clumped_dataframe = WindowBasedClumping.clump(\n summary_stats,\n window_length=window_length,\n p_value_significance=p_value_significance,\n ).df.alias(\"clumped\")\n\n # Get list of columns from clumped dataset for further propagation:\n clumped_columns = clumped_dataframe.columns\n\n # Dropping variants not meeting the baseline criteria:\n sumstats_baseline = summary_stats.pvalue_filter(p_value_baseline).df\n\n # Renaming columns:\n sumstats_baseline_renamed = sumstats_baseline.selectExpr(\n *[f\"{col} as tag_{col}\" for col in sumstats_baseline.columns]\n ).alias(\"sumstat\")\n\n study_locus_df = (\n sumstats_baseline_renamed\n # Joining the two datasets together:\n .join(\n f.broadcast(clumped_dataframe),\n on=[\n (f.col(\"sumstat.tag_studyId\") == f.col(\"clumped.studyId\"))\n & (f.col(\"sumstat.tag_chromosome\") == f.col(\"clumped.chromosome\"))\n & (\n f.col(\"sumstat.tag_position\")\n >= (f.col(\"clumped.position\") - locus_window_length)\n )\n & (\n f.col(\"sumstat.tag_position\")\n <= (f.col(\"clumped.position\") + locus_window_length)\n )\n ],\n how=\"right\",\n )\n .withColumn(\n \"locus\",\n f.struct(\n f.col(\"tag_variantId\").alias(\"variantId\"),\n f.col(\"tag_beta\").alias(\"beta\"),\n f.col(\"tag_pValueMantissa\").alias(\"pValueMantissa\"),\n f.col(\"tag_pValueExponent\").alias(\"pValueExponent\"),\n f.col(\"tag_standardError\").alias(\"standardError\"),\n ),\n )\n .groupby(\"studyLocusId\")\n .agg(\n *[\n f.first(col).alias(col)\n for col in clumped_columns\n if col != \"studyLocusId\"\n ],\n f.collect_list(f.col(\"locus\")).alias(\"locus\"),\n )\n )\n\n return StudyLocus(\n _df=study_locus_df,\n _schema=StudyLocus.get_schema(),\n )\n
This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).
Source code in src/otg/colocalisation.py
@dataclass\nclass ColocalisationStep(ColocalisationStepConfig):\n \"\"\"Colocalisation step.\n\n This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).\n \"\"\"\n\n session: Session = Session()\n\n def run(self: ColocalisationStep) -> None:\n \"\"\"Run colocalisation step.\"\"\"\n # Study-locus information\n sl = StudyLocus.from_parquet(self.session, self.study_locus_path)\n si = StudyIndex.from_parquet(self.session, self.study_index_path)\n\n # Study-locus overlaps for 95% credible sets\n sl_overlaps = sl.credible_set(CredibleInterval.IS95).overlaps(si)\n\n coloc_results = Coloc.colocalise(\n sl_overlaps, self.priorc1, self.priorc2, self.priorc12\n )\n ecaviar_results = ECaviar.colocalise(sl_overlaps)\n\n coloc_results.df.unionByName(ecaviar_results.df, allowMissingColumns=True)\n\n coloc_results.df.write.mode(self.session.write_mode).parquet(self.coloc_path)\n
Colocalisation step requirements.
Attributes:
Name Type Description study_locus_pathDictConfig
Input Study-locus path.
coloc_pathDictConfig
Output Colocalisation path.
priorc1float
Prior on variant being causal for trait 1.
priorc2float
Prior on variant being causal for trait 2.
priorc12float
Prior on variant being causal for traits 1 and 2.
Source code in src/otg/config.py
@dataclass\nclass ColocalisationStepConfig:\n \"\"\"Colocalisation step requirements.\n\n Attributes:\n study_locus_path (DictConfig): Input Study-locus path.\n coloc_path (DictConfig): Output Colocalisation path.\n priorc1 (float): Prior on variant being causal for trait 1.\n priorc2 (float): Prior on variant being causal for trait 2.\n priorc12 (float): Prior on variant being causal for traits 1 and 2.\n \"\"\"\n\n _target_: str = \"otg.colocalisation.ColocalisationStep\"\n study_locus_path: str = MISSING\n study_index_path: str = MISSING\n coloc_path: str = MISSING\n priorc1: float = 1e-4\n priorc2: float = 1e-4\n priorc12: float = 1e-5\n
Variant annotation step produces a dataset of the type VariantAnnotation derived from gnomADs gnomad.genomes.vX.X.X.sites.ht Hail's table. This dataset is used to validate variants and as a source of annotation.
Source code in src/otg/variant_annotation.py
@dataclass\nclass VariantAnnotationStep(VariantAnnotationStepConfig):\n \"\"\"Variant annotation step.\n\n Variant annotation step produces a dataset of the type `VariantAnnotation` derived from gnomADs `gnomad.genomes.vX.X.X.sites.ht` Hail's table. This dataset is used to validate variants and as a source of annotation.\n \"\"\"\n\n session: Session = Session()\n\n def run(self: VariantAnnotationStep) -> None:\n \"\"\"Run variant annotation step.\"\"\"\n # init hail session\n hl.init(sc=self.session.spark.sparkContext, log=\"/dev/null\")\n\n \"\"\"Run variant annotation step.\"\"\"\n variant_annotation = GnomADVariants.as_variant_annotation(\n self.gnomad_genomes,\n self.chain_38_to_37,\n self.populations,\n )\n # Writing data partitioned by chromosome and position:\n (\n variant_annotation.df.repartition(400, \"chromosome\")\n .sortWithinPartitions(\"chromosome\", \"position\")\n .write.partitionBy(\"chromosome\")\n .mode(self.session.write_mode)\n .parquet(self.variant_annotation_path)\n )\n
Using a VariantAnnotation dataset as a reference, this step creates and writes a dataset of the type VariantIndex that includes only variants that have disease-association data with a reduced set of annotations.
Source code in src/otg/variant_index.py
@dataclass\nclass VariantIndexStep(VariantIndexStepConfig):\n \"\"\"Variant index step.\n\n Using a `VariantAnnotation` dataset as a reference, this step creates and writes a dataset of the type `VariantIndex` that includes only variants that have disease-association data with a reduced set of annotations.\n \"\"\"\n\n session: Session = Session()\n\n def run(self: VariantIndexStep) -> None:\n \"\"\"Run variant index step to only variants in study-locus sets.\"\"\"\n # Extract\n va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n study_locus = StudyLocus.from_parquet(\n self.session, self.study_locus_path, recursiveFileLookup=True\n )\n\n # Transform\n vi = VariantIndex.from_variant_annotation(va, study_locus)\n\n # Load\n self.session.logger.info(f\"Writing variant index to: {self.variant_index_path}\")\n (\n vi.df.write.partitionBy(\"chromosome\")\n .mode(self.session.write_mode)\n .parquet(self.variant_index_path)\n )\n
This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:
Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).
In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.
Distance between the variant and each gene's canonical transcription start site (TSS).
Source code in src/otg/v2g.py
@dataclass\nclass V2GStep(V2GStepConfig):\n \"\"\"Variant-to-gene (V2G) step.\n\n This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:\n\n 1. Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).\n 2. In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.\n 3. Distance between the variant and each gene's canonical transcription start site (TSS).\n\n \"\"\"\n\n session: Session = Session()\n\n def run(self: V2GStep) -> None:\n \"\"\"Run V2G dataset generation.\"\"\"\n # Filter gene index by approved biotypes to define V2G gene universe\n gene_index_filtered = GeneIndex.from_parquet(\n self.session, self.gene_index_path\n ).filter_by_biotypes(self.approved_biotypes)\n\n vi = VariantIndex.from_parquet(self.session, self.variant_index_path).persist()\n va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n vep_consequences = self.session.spark.read.csv(\n self.vep_consequences_path, sep=\"\\t\", header=True\n )\n\n # Variant annotation reduced to the variant index to define V2G variant universe\n va_slimmed = va.filter_by_variant_df(vi.df, [\"id\", \"chromosome\"]).persist()\n\n # lift over variants to hg38\n lift = LiftOverSpark(\n self.liftover_chain_file_path, self.liftover_max_length_difference\n )\n\n # Expected andersson et al. schema:\n v2g_datasets = [\n va_slimmed.get_distance_to_tss(gene_index_filtered, self.max_distance),\n # variant effects\n va_slimmed.get_most_severe_vep_v2g(vep_consequences, gene_index_filtered),\n va_slimmed.get_polyphen_v2g(gene_index_filtered),\n va_slimmed.get_sift_v2g(gene_index_filtered),\n va_slimmed.get_plof_v2g(gene_index_filtered),\n # intervals\n IntervalsAndersson.parse(\n IntervalsAndersson.read_andersson(self.session, self.anderson_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsJavierre.parse(\n IntervalsJavierre.read_javierre(self.session, self.javierre_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsJung.parse(\n IntervalsJung.read_jung(self.session, self.jung_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsThurman.parse(\n IntervalsThurman.read_thurman(self.session, self.thurman_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n ]\n\n # merge all V2G datasets\n v2g = V2G(\n _df=reduce(\n lambda x, y: x.unionByName(y, allowMissingColumns=True),\n [dataset.df for dataset in v2g_datasets],\n ).repartition(\"chromosome\")\n )\n # write V2G dataset\n (\n v2g.df.write.partitionBy(\"chromosome\")\n .mode(self.session.write_mode)\n .parquet(self.v2g_path)\n )\n
def run(self: V2GStep) -> None:\n \"\"\"Run V2G dataset generation.\"\"\"\n # Filter gene index by approved biotypes to define V2G gene universe\n gene_index_filtered = GeneIndex.from_parquet(\n self.session, self.gene_index_path\n ).filter_by_biotypes(self.approved_biotypes)\n\n vi = VariantIndex.from_parquet(self.session, self.variant_index_path).persist()\n va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n vep_consequences = self.session.spark.read.csv(\n self.vep_consequences_path, sep=\"\\t\", header=True\n )\n\n # Variant annotation reduced to the variant index to define V2G variant universe\n va_slimmed = va.filter_by_variant_df(vi.df, [\"id\", \"chromosome\"]).persist()\n\n # lift over variants to hg38\n lift = LiftOverSpark(\n self.liftover_chain_file_path, self.liftover_max_length_difference\n )\n\n # Expected andersson et al. schema:\n v2g_datasets = [\n va_slimmed.get_distance_to_tss(gene_index_filtered, self.max_distance),\n # variant effects\n va_slimmed.get_most_severe_vep_v2g(vep_consequences, gene_index_filtered),\n va_slimmed.get_polyphen_v2g(gene_index_filtered),\n va_slimmed.get_sift_v2g(gene_index_filtered),\n va_slimmed.get_plof_v2g(gene_index_filtered),\n # intervals\n IntervalsAndersson.parse(\n IntervalsAndersson.read_andersson(self.session, self.anderson_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsJavierre.parse(\n IntervalsJavierre.read_javierre(self.session, self.javierre_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsJung.parse(\n IntervalsJung.read_jung(self.session, self.jung_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n IntervalsThurman.parse(\n IntervalsThurman.read_thurman(self.session, self.thurman_path),\n gene_index_filtered,\n lift,\n ).v2g(vi),\n ]\n\n # merge all V2G datasets\n v2g = V2G(\n _df=reduce(\n lambda x, y: x.unionByName(y, allowMissingColumns=True),\n [dataset.df for dataset in v2g_datasets],\n ).repartition(\"chromosome\")\n )\n # write V2G dataset\n (\n v2g.df.write.partitionBy(\"chromosome\")\n .mode(self.session.write_mode)\n .parquet(self.v2g_path)\n )\n
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 79609f04f..6474258b9 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ