Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: drop v2g and reimplement distance features #771

Merged
merged 61 commits into from
Oct 1, 2024
Merged

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Sep 18, 2024

✨ Context

This PR closes 3434 and 3258

The definition of the features have been implemented as agreed with @addramir and @xyg123*. The spreadsheet that documents them has been updated accordingly https://docs.google.com/spreadsheets/d/1wUs1AprRCCGItZmgDhc1fF5BtwCSosdzFv4NQ8V6Dtg/edit?gid=452826388#gid=452826388

🛠 What does this PR implement

  • Removal of V2G as a step and a dataset.
  • We still have the intervals dataset, but we are not dumping it (will be processed by L2G)
  • Generation of distance features based on the variant index, not V2G
  • Generation of negative gold standard based on the variant index, not V2G
  • Addition of distance to footprint features
  • Addition of all features in the neighbourhood
  • Testing

🙈 Missing

  • Discuss the filtering on protein coding genes when calculating the neighbourhood features. To be implemented in another PR @addramir

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

ireneisdoomed and others added 30 commits September 3, 2024 12:28
@ireneisdoomed ireneisdoomed marked this pull request as ready for review September 18, 2024 17:04
@d0choa
Copy link
Collaborator

d0choa commented Sep 19, 2024

The new features are definitely a step forward to make it more modular and easy to interpret.

.groupBy("studyLocusId", "geneId")
.agg(agg_expr.alias(feature_name))
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using weighted distances as the feature now? instead of the normalised distance score between 0-1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to weight them earlier as well. The feature itself is not normalised until later.

Copy link
Contributor

@xyg123 xyg123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice, good bye v2g, just need to make sure we are happy with the weighted distance features. x

Base automatically changed from il-3252 to dev September 23, 2024 12:30
@@ -74,32 +71,3 @@ def from_source(
source_class = source_to_class[source_name]
data = source_class.read(spark, source_path) # type: ignore
return source_class.parse(data, gene_index, lift) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, what is the plan with the interval dataclass? Do we plan to have a separate DAG to process these and save them somewhere?

Copy link
Contributor

@xyg123 xyg123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good!

@ireneisdoomed ireneisdoomed merged commit b7ccfae into dev Oct 1, 2024
5 checks passed
@ireneisdoomed ireneisdoomed deleted the il-3434 branch October 1, 2024 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drop V2G step
3 participants