Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 genome long series of overlapping overlaps --> nonsensical translation table. #54

Open
Sebastien-Raguideau opened this issue Jan 12, 2024 · 4 comments

Comments

@Sebastien-Raguideau
Copy link

Hello,

I resumed working on Hifiasm assembly graphs and I come to you with new issues.

The graph I am looking at possess a stretch of uninterupted overlapping overlaps spanning almost a full genome. Unclear if that is the cause of the issue I'm observing, but I have 10000/86000 nodes with erroneous translation (origin unitig seq doesn't match). Some of these are clearly nonsensical:
98125999 s2.utg080687l[18446744073709541131:18446744073709541132]-,s2.utg112978l[18446744073709541131:18446744073709541132]-,s2.utg070338l[18446744073709541383:18446744073709541384]-

Here is a link to the initial .gfa + output of GetBlunted. I also ran the validate_bluntification.py script and put the output there.

I used/compiled the github repos version and git status says I'm fine but GetBlunted help shows v0.0.3. Just wondering if that is just a typo or if I should use the release from ~1 year+ ago.

Hopefully this graph is not unbluntifiable :)

Best,
Seb

@rlorigro
Copy link
Collaborator

Hi Seb, it will unfortunately be some time before I can get to this. The translation step was written as sort of an afterthought and is not as thoroughly tested as the bluntification steps. Do you think the graph itself is incorrect?

@rlorigro
Copy link
Collaborator

As a way to investigate, you could use the extract_subgraph executable to fetch a small radius around the malformed translations and then bluntify locally

@rlorigro
Copy link
Collaborator

Also, I forget if we discussed this in one of our previous issues, but hifiasm has made-up overlaps that are simply xM where x is the length of one of the reads that was involved in the overlap step of assembly. It's not conforming to the GFA spec.

So even if we realign these nodes to each other, there is no guarantee that the alignment is meaningful because we don't know where the true overlap starts/ends for the participating nodes. This means we could leave out or bring in extra bases that should/shouldn't be involved.

The proper way to address this would involve something more along the lines of mapping, not simple POA anymore, at which point we are basically recapitulating the process of assembly.

@Sebastien-Raguideau
Copy link
Author

Hi Ryan,

I am not sure the graph is incorrect, it doesn't look bad, but then if the translation table is wrong, what else could be wrong? I didn't take any step to check that and stopped looking at this at the moment.

I do need the translation table as I am not super convinced by some of the string of 1-nucleotides linear subgraph which are, when you check them, actually generated by only a few unitig from the initial .gfa. From there, I started to think on what the ideal graph would be for me, and I think that I want is a graph which retain the same information while maximising the minimum node size. That imply mostly duplicating everything and losing the information of what bits are in common between parallel paths.
I can do that from GetBlunted output, but I need the translation table.

Yes I remember the issues with hifiasm, I think you discussed that with somebody else and I read that, which is where I started from: I wrote a small pipeline to re-align overlap between nodes and make sure the gfa is valid. One of the reason for the string of 1 nuc issue, was related to having strange overlap pattern with, for instance the overlap between 2 node not starting directly at the edge between them but hundred of nucleotides within. That can be coded with a valid cigar but that caused issues. On this example, I didn't see any such case, but instead just some mismatch in the middle of the overlap, which can be encoded into a valid cigar and which is translated into a string of 1 nuc subgraph from the output of GetBlunted.

Anyhow, thanks for answer and whenever you can will be nice.

Best,
Seb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants