Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during make_examples #912

Open
yangao07 opened this issue Dec 7, 2024 · 8 comments
Open

Error during make_examples #912

yangao07 opened this issue Dec 7, 2024 · 8 comments

Comments

@yangao07
Copy link

yangao07 commented Dec 7, 2024

Hi, here is the command:

deepvariant_sif=/homes2/yangao/software/deepvariant/deepvariant_1.8.0-gpu.sif
THREADS=16
model=PACBIO
in_bam=/hlilab/yangao/data/HG002/HG002_PacBio-HiFi-Revio_20231031_48x_GRCh38-GIABv3.bam
ref_fa=/hlilab/yangao/data/HG002/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz

singularity run --nv \
    -B ${INPUT_DIR},${OUTPUT_DIR} \
    ${deepvariant_sif} \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type ${model} \
    --ref ${ref_fa} \
    --reads ${in_bam} \
    --output_vcf ${output1_vcf} \
    --num_shards ${THREADS}

and the error message:

...
I1206 06:45:42.280306 139926186369472 make_examples_core.py:322] 9218017 candidates (4159794 examples) [7.76s elapsed]
2024-12-06 06:46:36.505705: F ./third_party/nucleus/core/statusor.h:230] Non-OK-status: status_ status: INVALID_ARGUMENT: Invalid interval: reference_name: "KMT2C_chr14_3610318_3640421" start: 30103 end: 30103
Fatal Python error: Aborted

Current thread 0x00007f431aa221c0 (most recent call first):
  File "/tmp/Bazel.runfiles_qtm4cv8_/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1732 in writes_examples_in_region
  File "/tmp/Bazel.runfiles_qtm4cv8_/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 3064 in make_examples_runner
  File "/tmp/Bazel.runfiles_qtm4cv8_/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 224 in main
  File "/tmp/Bazel.runfiles_qtm4cv8_/runfiles/absl_py/absl/app.py", line 258 in _run_main
  File "/tmp/Bazel.runfiles_qtm4cv8_/runfiles/absl_py/absl/app.py", line 312 in run
  File "/tmp/Bazel.runfiles_qtm4cv8_/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 234 in <module>
...

The error is related to the "KMT2C_chr14_3610318_3640421" contig.
I would appreciate any suggestions!
Thanks!

@kishwarshafin
Copy link
Collaborator

@yangao07 ,

Seems like the reference you have used to generate HG002_PacBio-HiFi-Revio_20231031_48x_GRCh38-GIABv3.bam and the reference you are using to run DeepVariant are different. You can take a look at the header of the bam file by running samtools view -H HG002_PacBio-HiFi-Revio_20231031_48x_GRCh38-GIABv3.bam and see which reference you used to generate that bam and use the exact same one to run DeepVariant.

@yangao07
Copy link
Author

yangao07 commented Dec 7, 2024

Hi, thanks for your quick reply.
The reference fasta and the bam do match, they both contain this contig "KMT2C_chr14_3610318_3640421".

@kishwarshafin
Copy link
Collaborator

@yangao07,

Can you please re-download the fasta file on your end? Maybe the file got corrupted during download? I am downloading the files on my end to see what might be the issue.

@yangao07
Copy link
Author

yangao07 commented Dec 9, 2024

Hi, I double-checked the md5sum, and the FASTA file is complete.

md5sum GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz
939ce19062d1462c09b88c55faca4d76  GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz

Here is the md5sum link: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/checksum.md5

@kishwarshafin
Copy link
Collaborator

kishwarshafin commented Dec 9, 2024

@yangao07 ,

Please see the header here:

samtools view -H HG002_PacBio-HiFi-Revio_20231031_48x_CHM13v2.0.bam
@RG	ID:b0776b05/0--0	PL:PACBIO	DS:READTYPE=CCS;Ipd:Frames=ip;PulseWidth:Frames=pw;BINDINGKIT=102-739-100;SEQUENCINGKIT=102-118-800;BASECALLERVERSION=5.0;FRAMERATEHZ=100.000000;BarcodeFile=metadata/m84039_230928_213653_s3.barcodes.fasta;BarcodeHash=e7c4279103df8c8de7036efdbdca9008;BarcodeCount=113;BarcodeMode=Symmetric;BarcodeQuality=Score	LB:NIST Hg002	PU:m84039_230928_213653_s3	SM:HG002	PM:REVIO	BC:AGAGAGAT	CM:R/P1-C1/5.0-25M
@RG	ID:57265ef9/0--0	PL:PACBIO	DS:READTYPE=CCS;Ipd:Frames=ip;PulseWidth:Frames=pw;BINDINGKIT=102-739-100;SEQUENCINGKIT=102-118-800;BASECALLERVERSION=5.0;FRAMERATEHZ=100.000000;BarcodeFile=metadata/m84039_231005_222902_s1.barcodes.fasta;BarcodeHash=e7c4279103df8c8de7036efdbdca9008;BarcodeCount=113;BarcodeMode=Symmetric;BarcodeQuality=Score	LB:NIST HG002	PU:m84039_231005_222902_s1	SM:HG002	PM:REVIO	BC:AGAGAGAT	CM:R/P1-C1/5.0-25M
@PG	ID:ccs	PN:ccs	VN:7.0.0 (commit v7.0.0)	DS:Generate circular consensus sequences (ccs) from subreads.	CL:/opt/pacbio/tag-ccs-current/bin/ccs --streamed --log-level INFO --stderr-json-log --kestrel-files-layout --movie-name m84039_230928_213653_s3 --log-file metadata/m84039_230928_213653_s3.ccs.log --min-rq 0.9 --non-hifi-prefix fail --knrt-ada --pbdc-model /opt/pacbio/tag-ccs-current/bin/../models/revio_v1.onnx --alarms metadata/m84039_230928_213653_s3.ccs.alarms.json
@PG	ID:lima	VN:2.7.1 (commit v2.7.1-1-gf067520)	CL:/opt/pacbio/tag-lima-current/bin/lima --movie-name m84039_230928_213653_s3 --kestrel-files-layout --quality hifi --output-missing-pairs --shared-prefix --hifi-preset SYMMETRIC-ADAPTERS --store-unbarcoded --split-named --reuse-source-uuid --reuse-biosample-uuids --stderr-json-log --alarms metadata/m84039_230928_213653_s3.hifi_reads.lima.alarms.json --log-file metadata/m84039_230928_213653_s3.hifi_reads.lima.log pb_formats/m84039_230928_213653_s3.hifi_reads.consensusreadset.primrose.xml metadata/m84039_230928_213653_s3.barcodes.fasta hifi_reads/m84039_230928_213653_s3.hifi_reads.demux.bam
@PG	ID:pbmm2	PN:pbmm2	VN:1.10.0 (commit v1.10.0)	CL:pbmm2 align --num-threads 24 --sort-memory 4G --preset HIFI --sample HG002 --log-level DEBUG --sort --unmapped /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/human_chm13v2p0_maskedY_rCRS.fasta /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/m84039_230928_213653_s3.hifi_reads.default.bam HG002.m84039_230928_213653_s3.hifi_reads.default.CHM13.aligned.bam
@PG	ID:primrose	VN:1.4.0 (commit v1.4.0)	CL:/opt/pacbio/tag-primrose-current/bin/primrose --movie-name m84039_230928_213653_s3 --kestrel-files-layout --quality hifi --reuse-source-uuid --stderr-json-log --log-file metadata/m84039_230928_213653_s3.hifi_reads.primrose.log --alarms metadata/m84039_230928_213653_s3.hifi_reads.primrose.alarms.json
@PG	PN:hiphase	ID:hiphase-v0.10.2	VN:0.10.2	CL:hiphase --threads 16 --sample-name HG002 --vcf /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.CHM13.deepvariant.vcf.gz --vcf /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.CHM13.pbsv.vcf.gz --output-vcf HG002.CHM13.deepvariant.phased.vcf.gz --output-vcf HG002.CHM13.pbsv.phased.vcf.gz --bam /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.m84039_230928_213653_s3.hifi_reads.default.CHM13.aligned.bam --bam /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.m84039_231005_222902_s1.hifi_reads.default.CHM13.aligned.bam --output-bam HG002.m84039_230928_213653_s3.hifi_reads.default.CHM13.aligned.haplotagged.bam --output-bam HG002.m84039_231005_222902_s1.hifi_reads.default.CHM13.aligned.haplotagged.bam --reference /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/human_chm13v2p0_maskedY_rCRS.fasta --summary-file HG002.CHM13.hiphase.stats.tsv --blocks-file HG002.CHM13.hiphase.blocks.tsv --haplotag-file HG002.CHM13.hiphase.haplotags.tsv --global-realignment-cputime 300
@PG	ID:ccs-931C1C8	PN:ccs	VN:7.0.0 (commit v7.0.0)	DS:Generate circular consensus sequences (ccs) from subreads.	CL:/opt/pacbio/tag-ccs-current/bin/ccs --streamed --log-level INFO --stderr-json-log --kestrel-files-layout --movie-name m84039_231005_222902_s1 --log-file metadata/m84039_231005_222902_s1.ccs.log --min-rq 0.9 --non-hifi-prefix fail --knrt-ada --pbdc-model /opt/pacbio/tag-ccs-current/bin/../models/revio_v1.onnx --alarms metadata/m84039_231005_222902_s1.ccs.alarms.json
@PG	ID:lima-20709C12	VN:2.7.1 (commit v2.7.1-1-gf067520)	CL:/opt/pacbio/tag-lima-current/bin/lima --movie-name m84039_231005_222902_s1 --kestrel-files-layout --quality hifi --output-missing-pairs --shared-prefix --hifi-preset SYMMETRIC-ADAPTERS --store-unbarcoded --split-named --reuse-source-uuid --reuse-biosample-uuids --stderr-json-log --alarms metadata/m84039_231005_222902_s1.hifi_reads.lima.alarms.json --log-file metadata/m84039_231005_222902_s1.hifi_reads.lima.log pb_formats/m84039_231005_222902_s1.hifi_reads.consensusreadset.primrose.xml metadata/m84039_231005_222902_s1.barcodes.fasta hifi_reads/m84039_231005_222902_s1.hifi_reads.demux.bam
@PG	ID:pbmm2-C3420F4	PN:pbmm2	VN:1.10.0 (commit v1.10.0)	CL:pbmm2 align --num-threads 24 --sort-memory 4G --preset HIFI --sample HG002 --log-level DEBUG --sort --unmapped /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/human_chm13v2p0_maskedY_rCRS.fasta /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/m84039_231005_222902_s1.hifi_reads.default.bam HG002.m84039_231005_222902_s1.hifi_reads.default.CHM13.aligned.bam
@PG	ID:primrose-6BE3DF33	VN:1.4.0 (commit v1.4.0)	CL:/opt/pacbio/tag-primrose-current/bin/primrose --movie-name m84039_231005_222902_s1 --kestrel-files-layout --quality hifi --reuse-source-uuid --stderr-json-log --log-file metadata/m84039_231005_222902_s1.hifi_reads.primrose.log --alarms metadata/m84039_231005_222902_s1.hifi_reads.primrose.alarms.json
@PG	PN:hiphase	ID:hiphase-v0.10.2-4A0390CA	VN:0.10.2	CL:hiphase --threads 16 --sample-name HG002 --vcf /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.CHM13.deepvariant.vcf.gz --vcf /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.CHM13.pbsv.vcf.gz --output-vcf HG002.CHM13.deepvariant.phased.vcf.gz --output-vcf HG002.CHM13.pbsv.phased.vcf.gz --bam /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.m84039_230928_213653_s3.hifi_reads.default.CHM13.aligned.bam --bam /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.m84039_231005_222902_s1.hifi_reads.default.CHM13.aligned.bam --output-bam HG002.m84039_230928_213653_s3.hifi_reads.default.CHM13.aligned.haplotagged.bam --output-bam HG002.m84039_231005_222902_s1.hifi_reads.default.CHM13.aligned.haplotagged.bam --reference /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/human_chm13v2p0_maskedY_rCRS.fasta --summary-file HG002.CHM13.hiphase.stats.tsv --blocks-file HG002.CHM13.hiphase.blocks.tsv --haplotag-file HG002.CHM13.hiphase.haplotags.tsv --global-realignment-cputime 300
@PG	ID:samtools	PN:samtools	PP:hiphase-v0.10.2-4A0390CA	VN:1.14	CL:samtools merge -@ 7 -o HG002.CHM13.haplotagged.bam /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.m84039_230928_213653_s3.hifi_reads.default.CHM13.aligned.haplotagged.bam /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/HG002.m84039_231005_222902_s1.hifi_reads.default.CHM13.aligned.haplotagged.bam
@PG	ID:samtools.1	PN:samtools	PP:samtools	VN:1.15	CL:samtools view -H HG002_PacBio-HiFi-Revio_20231031_48x_CHM13v2.0.bam

This bam was aligned against human_chm13v2p0_maskedY_rCRS.fasta not GRCh38. Please use the right reference and DeepVariant will work accurately.

@yangao07
Copy link
Author

yangao07 commented Dec 10, 2024

Oh I just found that I put the wrong BAM link here, it should be GRCh38, not CHM13.
I am so sorry!

Here is the correct link for the BAM I used: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_HiFi-Revio_20231031/HG002_PacBio-HiFi-Revio_20231031_48x_GRCh38-GIABv3.bam

Also the header of this BAM:

samtools view HG002_PacBio-HiFi-Revio_20231031_48x_GRCh38-GIABv3.bam -H | grep pbmm2
@PG	ID:pbmm2	PN:pbmm2	VN:1.10.0 (commit v1.10.0)	CL:pbmm2 align --num-threads 24 --sort-memory 4G --preset HIFI --sample HG002 --log-level DEBUG --sort --unmapped /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/m84039_230928_213653_s3.hifi_reads.default.bam HG002.m84039_230928_213653_s3.hifi_reads.default.GRCh38.aligned.bam
@PG	ID:pbmm2-52B9AB0B	PN:pbmm2	VN:1.10.0 (commit v1.10.0)	CL:pbmm2 align --num-threads 24 --sort-memory 4G --preset HIFI --sample HG002 --log-level DEBUG --sort --unmapped /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/m84039_231005_222902_s1.hifi_reads.default.bam HG002.m84039_231005_222902_s1.hifi_reads.default.GRCh38.aligned.bam

Here are the links: (just in case this helps) https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz

https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_HiFi-Revio_20231031/HG002_PacBio-HiFi-Revio_20231031_48x_CHM13v2.0.bam

@lucasbrambrink
Copy link
Collaborator

Hi @yangao07,

I just wanted to quickly check-in and mention that I was able to reproduce the error you are seeing.

2024-12-12 23:27:13.948492: F ./third_party/nucleus/core/statusor.h:230] Non-OK-status: status_ status: INVALID_ARGUMENT: Invalid interval: reference_name: "KMT2C_chr14_3610318_3640421" start: 30103 end: 30103

The interval is indeed invalid, because it has length 0: (start: 30103 end: 30103). Interestingly, when you run just on that contig (--regions "KMT2C_chr14_3610318_3640421"), it runs fine. I'm working on an easy way to split up the regions such that the above works. Additionally, I've made a note to look into why that region is computed with length zero. Stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants