Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: vep annotation (dockerised + google batch + airflow) #608

Merged
merged 24 commits into from
Jul 4, 2024

Conversation

d0choa
Copy link
Collaborator

@d0choa d0choa commented May 15, 2024

✨ Context

We need a mechanism to retrieve Variant Annotation from variants outside GnomAD. After some consideration, we concluded Ensembl VEP is the best stable source to retrieve the required annotations.

🛠 What does this PR implement

A mechanism to run Ensembl VEP for a given set of variants. WIP (more details soon)

🙈 Missing

WIP

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@DSuveges DSuveges linked an issue Jun 5, 2024 that may be closed by this pull request
6 tasks
@DSuveges DSuveges marked this pull request as ready for review June 25, 2024 16:02
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! The step works as expected (job). Just look at my comments to put these functions inside the utils script

@@ -44,7 +44,11 @@ create-dev-cluster: build ## Spin up a simple dataproc cluster with all dependen
--master-machine-type n1-standard-16 \
--initialization-actions=gs://genetics_etl_python_playground/initialisation/${VERSION_NO}/install_dependencies_on_cluster.sh \
--metadata="PACKAGE=gs://genetics_etl_python_playground/initialisation/${VERSION_NO}/gentropy-${VERSION_NO}-py3-none-any.whl,CONFIGTAR=gs://genetics_etl_python_playground/initialisation/${VERSION_NO}/config.tar.gz" \
--single-node \
--num-workers 4 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't set these specs by default, I suggest reverting the changes

import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any, List
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python 3.10 supports list as type hints

Suggested change
from typing import Any, List
from typing import Any

return runnable


def create_task_spec(image: str, commands: List[str]) -> batch_v1.TaskSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is useful inside common_airflow.py. I've used it in my branch to submit Batch jobs. I suggest moving it

return runnable


def create_task_spec(image: str, commands: List[str]) -> batch_v1.TaskSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def create_task_spec(image: str, commands: List[str]) -> batch_v1.TaskSpec:
def create_task_spec(image: str, commands: list[str]) -> batch_v1.TaskSpec:

return task


def set_up_mouting_points(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is useful inside common_airflow.py. I've used it in my branch to submit Batch jobs. I suggest moving it


Args:
image (str): The Docker image to use.
commands (List[str]): The commands to run in the container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
commands (List[str]): The commands to run in the container.
commands (list[str]): The commands to run in the container.

return volumes


def create_batch_job(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is useful inside common_airflow.py. I've used it in my branch to submit Batch jobs. I suggest moving it

return list(self.path_dictionary.values())


def create_container_runnable(image: str, commands: List[str]) -> batch_v1.Runnable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is useful inside common_airflow.py. I've used it in my branch to submit Batch jobs. I suggest moving it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this function to be able to pass params to the container (to run the gentropy step with the HYDRA_FULL_ERROR=1 env variable.
I suggest the following refactoring:

def create_container_runnable(
    image: str, commands: list[str], **kwargs: Any
) -> batch_v1.Runnable:
    """Create a container runnable for a Batch job with additional optional parameters.

    Args:
        image (str): The Docker image to use.
        commands (list[str]): The commands to run in the container.
        kwargs (Any): Additional optional parameters to set on the container.

    Returns:
        batch_v1.Runnable: The container runnable.
    """
    container = batch_v1.Runnable.Container(
        image_uri=image, entrypoint="/bin/sh", commands=commands, **kwargs
    )
    return batch_v1.Runnable(container=container)

gcs_volume.gcs = gcs_bucket
gcs_volume.mount_path = mount["mount_point"]
volumes.append(gcs_volume)
return volumes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I've understand the necessity of mounting the volumes in the case of VEP. Could you briefly explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VEP is a command line application, which cannot read or write files from google cloud location. So the folders containing cache, input and output files, need to be mounted.

@DSuveges DSuveges merged commit 6487b31 into dev Jul 4, 2024
4 checks passed
project-defiant added a commit that referenced this pull request Jul 12, 2024
* feat: custom dockerfile to run ensembl vep

* ci: automate vep image build and artifact registry

* chore: update airflow google operators (not required)

* feat: working version of google batch airflow vep job

* feat: working version of google batch airflow vep job

* feat(VEP): adding CADD plugin

* feat: local loftee file

* feat: working with input bucket full of input files

* feat: prevent writing html

* fix: minor adjustments to retry strategy

* feat(airflow): separating mounting points for input/output and cache

* fix: typo in airflow dag

* fix: pre-commit pain

* chore: rename airflow dag file

---------

Co-authored-by: DSuveges <[email protected]>
Co-authored-by: Szymon Szyszkowski <[email protected]>
@DSuveges DSuveges deleted the dsdo_airflow_batch_vep branch September 18, 2024 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modelling new variant annotation dataset for variant page
4 participants