Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General questions: mode and usage #7

Open
HLHsieh opened this issue Jun 13, 2024 · 1 comment
Open

General questions: mode and usage #7

HLHsieh opened this issue Jun 13, 2024 · 1 comment

Comments

@HLHsieh
Copy link

HLHsieh commented Jun 13, 2024

Hi Kinsey,

Thank you for your quick response to my issue; everything was fixed. I tried HMMSTR on several of my own datasets, and the experience was amazing.

However, I have several questions:

  1. There are two input modes. Could you confirm whether they return the same results or if there are any differences between them?
  2. Regarding the definition of our own input files, is the information based on a specific reference genome or is it possible to extend this to other species. I am currently using hg38.
  3. Is HMMSTR applicable to other types of repeats, and are there any length limits? I attempted to detect a 67-bp repeat in my data but encountered an error as follows:
hmmstr targets_tsv /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed ./test/scratch/kinfai_root/kinfai0/hsinlun/Data_R10/test/test.fasta --cpus 8
Parameters for this HMMSTR run:
subcommand : targets_tsv
out : ./test
inFile : /scratch/kinfai_root/kinfai0/hsinlun/Data_R10/test/test.fasta
output_plots : False
output_labelled_seqs : False
max_peaks : 2
cpus : 8
flanking_size : 100
mode : map-ont
mapq_cutoff : 30
k : None
w : None
use_full_read : False
peakcalling_method : auto
bandwidth : None
kernel : gaussian
bootstrap : False
call_width : 0.95
resample_size : 100
allele_specific_CIs : False
allele_specific_plots : False
discard_outliers : False
filter_quantile : 0.25
flanking_like_filter : False
stranded_report : False
cluster_only : False
save_intermediates : False
background : None
E_probs : None
A_probs : None
custom_RM : None
hmm_pre : None
targets : /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed 
Target tsv input selected! Reading input from  /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed ...
All models built! finished .... time was:  79.32333261705935
Starting multiprocess, cpu's in use: 8
All reads processed, the pooled run took:  1037.1308538229205
Traceback (most recent call last):
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/bin/hmmstr", line 8, in <module>
    sys.exit(main())
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/HMMSTR/HMMSTR.py", line 773, in main
    geno_df_final = geno_df.apply(sort_outputs, args=(args.max_peaks,args.bootstrap,args.allele_specific_CIs), axis=1)
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/HMMSTR_utils/HMMSTR_utils.py", line 246, in sort_outputs
    new_row = curr_row[column_names_in_order].copy()
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/series.py", line 1153, in __getitem__
    return self._get_with(key)
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/series.py", line 1194, in _get_with
    return self.loc[key]
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1191, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1420, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1360, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis)
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1558, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6200, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['num_supporting_reads'] not in index"
  1. Regarding the output, genotype_calls.tsv shows:
name	A1:median	A1:mode	A1:SD	A1:supporting_reads	A2:median	A2:mode	A2:SD	A2:supporting_reads	num_supporting_reads	bandwidth	peak_calling_method
C9ORF72	0	0	0	0	264.0	264	0.7339693949098695	68	68	0.5	kde_throw_outliers

I am wondering why the copy number of allele 1 is 0 while that of allele 2 is 264.0, rather than having only allele 1 with a copy number of 264.0.

read_assignments.tsv shows:

name	read_id	strand	align_score	neg_log_likelihood	subset_likelihood	repeat_likelihood	repeat_start	repeat_end	align_start	align_end	counts	freq	cluster_assignmentsoutlier	peak_calling_method
C9ORF72	C9ORF72-8_27568299_aligned_278233_F_39_15559_10	forward	120	-1546.114	-308.153	-296.803	5254	6833	5058	7032	264.0	37	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27560224_aligned_280856_F_1_38138_44	forward	120	-1726.833	-484.1819999999999	-466.743	13276	14853	13075	15051	261.0	2		True	kde_throw_outliers
C9ORF72	C9ORF72-8_27559176_aligned_334855_F_1_32063_41	forward	120	-1741.146	-499.6199999999999	-476.6540000000001	14314	15905	14114	16112	264.0	37	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27539687_aligned_480453_R_390_40843_20094	reverse	120	-1562.965	-316.68600000000004	-299.563	25510	27089	25311	27288	264.0	37	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27572376_aligned_18870_R_2_15374_0	reverse	120	-1595.043	-359.16100000000006	-331.653	12651	14227	12448	14426	263.0	21	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27552425_aligned_44556_F_1216_30940_528	forward	120	-1562.619	-316.34000000000003	-304.352	22331	23916	22131	24115	264.0	37	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27558694_aligned_147495_R_41_18456_9	reverse	120	-1631.929	-387.199	-343.42	2046	3625	1845	3824	263.0	21	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27518711_aligned_305149_R_43_57261_0	reverse	120	-1647.123	-411.2410000000001	-368.809	860	2438	661	2632	263.0	21	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27553579_aligned_420791_R_43_32918_1	reverse	120	-1597.8	-353.6	-326.093	11359	12942	11160	13142	264.0	37	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27572786_aligned_505298_F_1_10977_44	forward	120	-1621.927	-375.648	-358.381	744	2327	544	2525	264.0	37	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27571470_aligned_514863_R_8_12938_30	reverse	120	-1525.204	-279.501	-262.02599999999995	9322	10906	9123	11105	264.0	37	1	False	kde_throw_outliers
C9ORF72	C9ORF72-8_27565841_aligned_527646_R_33_16083_12	reverse	120	-1636.656	-400.775	-318.755	6816	8394	6618	8598	262.0	6	1	False	kde_throw_outliers

However, all the reads were assigned to allele one. Additionally, I noticed that some reads have a blank in the cluster_assignments field. What does this mean, and how should I interpret these reads?

C9ORF72	C9ORF72-8_27560224_aligned_280856_F_1_38138_44	forward	120	-1726.833	-484.1819999999999	-466.743	13276	14853	13075	15051	261.0	2		True	kde_throw_outliers

Could you also provide a brief explanation of the "freq" column (the frequency of the copy number for the assigned target)? I supposed that this value should be less than 1, but in my results, the values are greater than 1.

Any suggestions and comments would be appreciated.

Best regards,
Hsin

@kvandeynze
Copy link
Collaborator

Hi Hsin,
Thank you for your interest in HMMSTR! Here are my responses for your quesitons:

  1. The only file difference in outputs between the coordinates and targets_tsv modes is that under coordinates mode the *_inputs.tsv file will be produced from the input coordinates. All other outputs should be equivalent
  2. The input tsv can be any target sequence based on any species you would like! The benefit of this mode is that it is 100% reference free so hg38 or any other sequence you would like to target should be compatible in this mode.
  3. HMMSTR should be compatible with any tandemly repeated DNA element, however if the sequence diverges too far from the expected repeat unit HMMSTR may return that there were no reads found with that underlying repeat sequence. As for your error, this appears to be a bug where I failed to initialize the 'num_supporting_reads' column for the first target returned with no supporting reads detected. In response to this I have patched where I believe the problem was introduced but please let me know if you are still encountering this issue.
  4. This was an inconsistency between the allele sorting in the genotypes vs read_assignment files, I have now updated the outputs such that the allele assignment number will be the same across both output files. To note, the alleles are sorted by size so all homozygous calls will have A1 equal to 0. As for the rows in read_assignments with no cluster assignment: these reads were called as outliers and were thus dropped before the clustering step so they were not assigned to a cluster. If you would like all reads to be clustered, the 'auto' peakcalling_method can be overridden to 'gmm' or 'kde' to prevent 'kde_throw_outliers', which automatically discards reads with copy numbers that lie outside of the IQR of the data.

Hope this helps and please let me know if you have any other questions,
Kinsey

@aboyle aboyle transferred this issue from Boyle-Lab/HMMSTR_depreciated Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants