Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placements sensitive to outlier sequences? #33

Open
gavinmdouglas opened this issue Dec 16, 2019 · 3 comments
Open

Placements sensitive to outlier sequences? #33

gavinmdouglas opened this issue Dec 16, 2019 · 3 comments
Assignees

Comments

@gavinmdouglas
Copy link

I recently noticed that the placements of all query sequences can be affected by the presence of a small number of outlier sequences when placing onto a large tree. This problem appears to be especially affect the pendant length estimates and less so the edge placements. I noticed this issue when running different subsets of a dataset with PICRUSt2, which wraps EPA-NG.

I've reproduced this for query datasets of 323 sequences into a tree of 20,000 sequences. In this example only one query sequence differs between the test datasets (essentially in one case the query sequence doesn't align the reference).

In the original case the focal query sequence alignment looks like this:

>02905cfb87861c837dde629596d9272b
....-----GGTCTTGACATC--CCTCT-GACGAGTGAGTAATGT-CG--CTT--T-C--CC--T----T----C--GG---------G------G--C-A-G-A-GGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAGTAGCCAGCA----GTAAGA-----TGGGAACTCTAGAGAGACTGCCGGGGATAACCCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCAGGGCTACACACGTGCTACAATGGC-G-T-A-A-ACAGAGGGA-AGCGACCCTGTGAAGGTAAGCAAATCCCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGAATCAGA-ATGTCGCGGTGAATACGTTCCCGG----...

I swapped in random DNA for a different test file (with the same header) and the alignment looks like this (only a single base aligned):

>02905cfb87861c837dde629596d9272b
....--------------T---------------....

You can see in the jplace outputs of running EPA-NG that there are many differences in the placements, particularly in the pendant distances.

I think this issue might be related to issue #29. I didn't expect all placements to be affected by a single weird sequence. I wasn't able to reproduce this issue with the example datasets used in the EPA-NG paper and I'm thinking that maybe this issue only arises with large trees. Any insight would be greatly appreciated!

I ran EPA-NG with this command:

epa-ng --tree pro_ref.tre \
       --ref-msa ref_seqs_hmmalign.fasta \
       --query STUDY_SEQS \
       --chunk-size 5000 \
       -T 20 \
       -m pro_ref.model \
       -w epa_out \
       --filter-acc-lwr 0.99 \
       --filter-max 100

You can see the input and output files attached in this zipfile: placement_test.zip.

STUDY_SEQS corresponds to study_seqs_hmmalign_original.fasta and study_seqs_hmmalign_funky.fasta for the original dataset and the dataset with the outlier sequence, respectively. The output jplace files are named the same way.

Thanks,

Gavin

@pierrebarbera
Copy link
Owner

Hi Gavin,

thank you for finding this! it could very well be related to #29 as you say... though even then it seems strange. My initial thought is there might be some side effects of the optimization functions, though I thought I had properly isolated the relevant code, even to the (slight) detriment of performance.

Early in the new year, when my current paper is finally submitted, I will most likely have another go at epa to solve some issues and include a bunch of useful functionality, such as the issues with memory handling... I'm afraid you will have to stay tuned until then.

Cheers,
Pierre

@pierrebarbera pierrebarbera self-assigned this Dec 17, 2019
@gavinmdouglas
Copy link
Author

Thanks for looking into it Pierre! I should have mentioned that I could reproduce this problem with both EPA-ng v0.3.5 and v0.3.6.

Best,

Gavin

@brendanf
Copy link

brendanf commented Dec 6, 2023

I am confused... My impression was that the placement of each query sequence by EPA-ng was completely independent from all the other query sequences. Is this just a bug which is becomes visible for large trees and/or weird query sequences, or is there actually some kind of intentional information sharing between queries? This is particularly relevant for me because I have been assuming that it makes no difference for the results to split up my queries (which I have millions of) into smaller batches and run epa-ng on them independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants