Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --maintain-topo and --no-simplify options to grg convert #14

Merged
merged 1 commit into from
Sep 19, 2024

Conversation

dcdehaas
Copy link
Collaborator

@dcdehaas dcdehaas commented Sep 19, 2024

Conversion from TS to GRG previously ignored topology changes below a mutation IF that topology change did not impact the sample set of the mutation. This manifests when the break and join point of a recombination both occur entirely below a mutation.

You can now run grg convert --maintain-topo if you want to instead retain such topology changes. The algorithm change is quite simple: instead of traversing from added/removed edges to their MRCA we instead just traverse all the way to the root(s). The nodes along the way are marked to be split next time they are encountered by a mutation.

On a reasonably large Tree-Seq file (500k samples, 100Mbp len), there is no noticeable timing difference between the two versions of the algorithm.

GRG file size is pretty much the same as well; when you look at the edges/nodes you see the minimal difference:

  Original algorithm == Nodes: 3998656, Edges: 13190551
  Maintain topo == Nodes: 4011727, Edges: 13245575

But both files are 140Mb

Additionally I exposed the already existing "--no-simplify" option to "grg convert". When you turn off simplification you DO see the size difference:

  Maintain topo + no simplify == Nodes: 10648753, Edges: 19297506

File size = 201Mb vs. 140Mb

Conversion from TS to GRG previously ignored topology changes
below a mutation _IF_ that topology change did not impact the
sample set of the mutation. This manifests when the break and join
point of a recombination both occur entirely below a mutation.

You can no run `grg convert --maintain-topo` if you want to
instead retain such topology changes. The algorithm change is quite
simple: instead of traversing from added/removed edges to their
MRCA we instead just traverse all the way to the root(s). The nodes
along the way are marked to be split next time they are encountered
by a mutation.

On a reasonably large Tree-Seq file (500k samples, 100Mbp len),
there is no noticeable timing difference between the two versions
of the algorithm.

GRG file size is pretty much the same as well; when you look at the
edges/nodes you see the minimal difference:
  Original algorithm == Nodes: 3998656, Edges: 13190551
  Maintain topo == Nodes: 4011727, Edges: 13245575
But both files are 140Mb

Additionally I exposed the already existing "--no-simplify" option
to "grg convert". When you turn off simplification you DO see the
size difference:
  Maintain topo + no simplify == Nodes: 10648753, Edges: 19297506
  File size = 201Mb vs. 140Mb
@dcdehaas dcdehaas merged commit 7528bfc into main Sep 19, 2024
3 checks passed
@dcdehaas dcdehaas deleted the ts_maintain_topo branch September 19, 2024 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant