Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CRAM writer (alpha) #313

Merged
merged 10 commits into from
Jul 22, 2024
Merged

Add CRAM writer (alpha) #313

merged 10 commits into from
Jul 22, 2024

Conversation

athos
Copy link
Member

@athos athos commented Jul 4, 2024

This PR adds an implementation of the CRAM writer, based on CRAM v3.1.

Like the initial implementation of the CRAM reader, the current CRAM writer is quite naive and supports only a limited set of features. Additionally, its performance has not been tuned yet.

The structure of the CRAM writer mirrors that of the CRAM reader in many ways, making it easy to understand for those familiar with the CRAM reader's implementation.

The biggest difference can be found in cljam.io.cram.writer, which scans and aggregates the records in slices and containers to calculate various header components. Some of these might be moved to separate namespaces in the future.

Currently, the following features are not supported (roughly ordered by priority):

  • Index file generation
  • Block compression
  • Changing the encoding of data series
  • Fine-grained container/slice partitioning
  • Delta encoding of the AP data series
  • Reference embedding
  • Construction of substitution matrices considering SNV frequency
  • Encoding of non-detached mate reads
  • QNAME omission

Also, the PR contains the following commits for miscellaneous fixes within the CRAM reader:

  • 079f699: Renames :external-id to :content-id for consistency with the CRAM writer
  • 380ef64: Adds the missing :bases case for record-cigar
  • ac1fa5e: Trims trailing null characters from the decoded file ID in the CRAM file definition

@athos athos requested review from alumi and a team July 4, 2024 07:43
@athos athos self-assigned this Jul 4, 2024
Copy link

codecov bot commented Jul 4, 2024

Codecov Report

Attention: Patch coverage is 93.96450% with 51 lines in your changes missing coverage. Please review.

Project coverage is 89.28%. Comparing base (3b567e8) to head (800eecf).

Files Patch % Lines
src/cljam/io/cram/writer.clj 90.85% 11 Missing and 4 partials ⚠️
src/cljam/io/cram/encode/structure.clj 93.65% 12 Missing and 1 partial ⚠️
src/cljam/io/cram/data_series.clj 95.25% 2 Missing and 9 partials ⚠️
src/cljam/io/cram/encode/record.clj 94.94% 3 Missing and 6 partials ⚠️
src/cljam/io/cram.clj 88.88% 1 Missing ⚠️
src/cljam/io/cram/decode/record.clj 0.00% 1 Missing ⚠️
src/cljam/io/cram/encode/alignment_stats.clj 96.87% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
+ Coverage   88.85%   89.28%   +0.43%     
==========================================
  Files          93       97       +4     
  Lines        8091     8841     +750     
  Branches      465      481      +16     
==========================================
+ Hits         7189     7894     +705     
- Misses        437      466      +29     
- Partials      465      481      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alumi alumi requested review from a team and niyarin and removed request for a team July 5, 2024 02:30
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this feature!
I think both the design and the implementation are excellent 👍👍
I have added one minor comment about the document, so please check it 🙏

a map {<data series name> <encoder>}, where:
- <data series name>: a keyword representing the data series name
- <encoding>: a map representing the encoding of the data series
- <encoder>: a function that takes one argument to encode"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoder is actually an overloaded function that acts as both a single-argument function that performs encoding and a zero-argument function that returns the encoding result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It slipped my mind when I was writing these docstrings. Thanks for pointing that out 🙏
Fixed in 800eecf.

@athos athos requested a review from a team as a code owner July 16, 2024 05:09
@athos athos requested review from r6eve and removed request for a team and r6eve July 16, 2024 05:09
Copy link
Contributor

@niyarin niyarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review.
LGTM!

@niyarin niyarin merged commit 121e544 into master Jul 22, 2024
17 checks passed
@niyarin niyarin deleted the feature/cram-writer branch July 22, 2024 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants