1 genome long series of overlapping overlaps --> nonsensical translation table. #54

Sebastien-Raguideau · 2024-01-12T18:58:28Z

Hello,

I resumed working on Hifiasm assembly graphs and I come to you with new issues.

The graph I am looking at possess a stretch of uninterupted overlapping overlaps spanning almost a full genome. Unclear if that is the cause of the issue I'm observing, but I have 10000/86000 nodes with erroneous translation (origin unitig seq doesn't match). Some of these are clearly nonsensical:
98125999 s2.utg080687l[18446744073709541131:18446744073709541132]-,s2.utg112978l[18446744073709541131:18446744073709541132]-,s2.utg070338l[18446744073709541383:18446744073709541384]-

Here is a link to the initial .gfa + output of GetBlunted. I also ran the validate_bluntification.py script and put the output there.

I used/compiled the github repos version and git status says I'm fine but GetBlunted help shows v0.0.3. Just wondering if that is just a typo or if I should use the release from ~1 year+ ago.

Hopefully this graph is not unbluntifiable :)

Best,
Seb

The text was updated successfully, but these errors were encountered:

rlorigro · 2024-01-23T17:54:45Z

Hi Seb, it will unfortunately be some time before I can get to this. The translation step was written as sort of an afterthought and is not as thoroughly tested as the bluntification steps. Do you think the graph itself is incorrect?

rlorigro · 2024-01-23T18:13:33Z

As a way to investigate, you could use the extract_subgraph executable to fetch a small radius around the malformed translations and then bluntify locally

rlorigro · 2024-01-23T18:35:57Z

Also, I forget if we discussed this in one of our previous issues, but hifiasm has made-up overlaps that are simply xM where x is the length of one of the reads that was involved in the overlap step of assembly. It's not conforming to the GFA spec.

So even if we realign these nodes to each other, there is no guarantee that the alignment is meaningful because we don't know where the true overlap starts/ends for the participating nodes. This means we could leave out or bring in extra bases that should/shouldn't be involved.

The proper way to address this would involve something more along the lines of mapping, not simple POA anymore, at which point we are basically recapitulating the process of assembly.

Sebastien-Raguideau · 2024-01-25T17:11:37Z

Hi Ryan,

I am not sure the graph is incorrect, it doesn't look bad, but then if the translation table is wrong, what else could be wrong? I didn't take any step to check that and stopped looking at this at the moment.

I do need the translation table as I am not super convinced by some of the string of 1-nucleotides linear subgraph which are, when you check them, actually generated by only a few unitig from the initial .gfa. From there, I started to think on what the ideal graph would be for me, and I think that I want is a graph which retain the same information while maximising the minimum node size. That imply mostly duplicating everything and losing the information of what bits are in common between parallel paths.
I can do that from GetBlunted output, but I need the translation table.

Yes I remember the issues with hifiasm, I think you discussed that with somebody else and I read that, which is where I started from: I wrote a small pipeline to re-align overlap between nodes and make sure the gfa is valid. One of the reason for the string of 1 nuc issue, was related to having strange overlap pattern with, for instance the overlap between 2 node not starting directly at the edge between them but hundred of nucleotides within. That can be coded with a valid cigar but that caused issues. On this example, I didn't see any such case, but instead just some mismatch in the middle of the overlap, which can be encoded into a valid cigar and which is translated into a string of 1 nuc subgraph from the output of GetBlunted.

Anyhow, thanks for answer and whenever you can will be nice.

Best,
Seb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1 genome long series of overlapping overlaps --> nonsensical translation table. #54

1 genome long series of overlapping overlaps --> nonsensical translation table. #54

Sebastien-Raguideau commented Jan 12, 2024

rlorigro commented Jan 23, 2024

rlorigro commented Jan 23, 2024

rlorigro commented Jan 23, 2024

Sebastien-Raguideau commented Jan 25, 2024

1 genome long series of overlapping overlaps --> nonsensical translation table. #54

1 genome long series of overlapping overlaps --> nonsensical translation table. #54

Comments

Sebastien-Raguideau commented Jan 12, 2024

rlorigro commented Jan 23, 2024

rlorigro commented Jan 23, 2024

rlorigro commented Jan 23, 2024

Sebastien-Raguideau commented Jan 25, 2024