Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional analysis - Summary Gene Centric Noncoding not running to completion #58

Open
samreenzafer opened this issue Apr 25, 2024 · 6 comments

Comments

@samreenzafer
Copy link

samreenzafer commented Apr 25, 2024

Thanks for a great easy to use tool. I've finally reached the step where I would like to do a Gene Centric Noncoding conditional analysis on some select rare variants I identified in my data set.
As per the tutorial, I am running the STAARpipelineSummary_Gene_Centric_Noncoding.r script, and this time give a nonNULL input to the known_loci variable.

`

known_loci
CHR POS REF ALT
1 2 168944600 C T
2 2 168976689 T G
3 2 168973831 C A
4 2 168976617 T C
5 2 168979945 A G
6 2 168990819 T C
7 2 168993859 G A
8 2 168996703 C T
9 2 168996703 C T
10 2 169013329 G T
11 2 168970065 C T
12 2 168990800 C T
13 2 168995461 C T

`

The script first generates the combined summary stats as before
`

UTR.Rdata
upstream.Rdata
downstream.Rdata
promoter_CAGE.Rdata
promoter_DHS.Rdata
enhancer_CAGE.Rdata
enhancer_DHS.Rdata
results_ncRNA_genome.Rdata
UTR_sig.csv
upstream_sig.csv
downstream_sig.csv
promoter_CAGE_sig.csv
promoter_DHS_sig.csv
enhancer_CAGE_sig.csv
enhancer_DHS_sig.csv
ncRNA_sig.csv
noncoding_sig.csv

`

and then went on running for over 24hours before being killed, and never produced any of the *_cond_sig.* files. I'm wondering if this is indeed a very long step, and expected behavior ?

@xihaoli
Copy link
Owner

xihaoli commented Apr 25, 2024

Hi @samreenzafer,

Thanks for the question. If the input to known_loci is particularly long (e.g. thousands of variants directly extracted from GWAS Catalog), then you should perform LD pruning to select the subset of independent variants from the full list for conditional analysis.

Have you performed the LD pruning step already?

Best,
Xihao

@samreenzafer
Copy link
Author

Hi Xihao,
I have only these 13 variants in the known_loci . They are rare variants, and I don't want to prune them, since I specifically want to identify if there are other samples in my cohort with other variants that could be contributing to my phenotype, after conditioning on these 13 variants.

Thanks.

@samreenzafer
Copy link
Author

samreenzafer commented Apr 25, 2024

I did try to run the STAARpipelineSummary_Known_Loci_Pruning.r script, but it gave me a NULL LDpruned file at the end. Maybe this is what is leading the STAARpipelineSummary_Gene_Centric_Noncoding.r script also to just keep on running?
Although I don't understand why no LDpruned variants are being identified .

> 
> Rscript  scripts/STAARpipelineSummary_Known_Loci_Pruning.r 2
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 272200 14.6     665751 35.6   411493 22.0
> Vcells 456006  3.5    8388608 64.0  1820273 13.9
> [1] 2
> [1] "Input variants"
>    CHR       POS REF ALT
> 1    2 168944600   C   T
> 2    2 168976689   T   G
> 3    2 168973831   C   A
> 4    2 168976617   T   C
> 5    2 168979945   A   G
> 6    2 168990819   T   C
> 7    2 168993859   G   A
> 8    2 168996703   C   T
> 9    2 168996703   C   T
> 10   2 169013329   G   T
> 11   2 168970065   C   T
> 12   2 168990800   C   T
> 13   2 168995461   C   T
> # of selected samples: 483
> # of selected variants: 10
> [1] "LD pruned variants"
> NULL
> 
> 

I modified the script to print both data before and after calling the LD_pruning function as below.

`

print("Input variants")
print(variants_list)
known_loci <- LD_pruning(chr=chr,genofile=genofile,obj_nullmodel=obj_nullmodel,variants_list=variants_list,maf_cutoff=maf_cutoff,
method_cond=method_cond,QC_label=QC_label,
variant_type=variant_type,geno_missing_imputation=geno_missing_imputation)

known_loci_chr <- rbind(known_loci_chr,known_loci)
print("LD pruned variants")
print(known_loci_chr)

`

@xihaoli
Copy link
Owner

xihaoli commented Apr 25, 2024

Hi @samreenzafer,

Thanks for sharing more details. There are two relevant but separate things to consider:

  1. LD_pruning: In LD pruning procedure, we have included a parameter maf_cutoff to only keep the variants with a given minimum minor allele frequency. Typically, this parameter is set to be 0.01 (by default, to keep common or low-frequency variants) or 19.5/(2 * n) (which is equivalent to MAC >= 20). As you mentioned, these 13 variants are rare, but if you are using a maf_cutoff of 0.01, then none of the variants will remain as part of the output of LD_pruning. Another reason for having an output of LD_pruning being NULL is the cond_p_thresh parameter (default is 1e-04), where you may want to check the manual as well. As a minor note, I've recently updated the LD_pruning function to the STAARpipeline GitHub repository. Please make sure to install the latest version of the STAARpipeline package (although I don't think this is the cause of the results being NULL).

  2. STAARpipelineSummary_Known_Loci_Pruning.r: I agree that if you only have 13 variants, then in general there is no need to prune them further. But based on the rule of thumb in (1), please check whether any of the 13 variants are extremely rare (i.e. MAC < 20), and in that case, you may not want to adjust for them.

Hope this helps.

Best,
Xihao

@samreenzafer
Copy link
Author

samreenzafer commented Apr 26, 2024

Thanks for the detailed response.

  1. I do see that after pruning, by adjusting the maf_cutoff parameters, I get only 1 selected variant. Relaxing these parameters still give me the same variant.

MAFcutoff=1e-05 cond_p_thresh=1e-04
[1] "LD pruned variants"
  CHR       POS REF ALT
   2 168996703   C   T

MAFcutoff=1e-10 cond_p_thresh=1
[1] "LD pruned variants"
  CHR       POS REF ALT
   2 168996703   C   T

  1. I went through the LD_pruning code and looks like the only difference in code are the left_join lines example;
    Old code that I had downloaded few weeks back.
    left_join(variants_list_chr,Gene_info,by=c(POS="POS",REF="REF",ALT="ALT")
    New code from your GitHub page
    left_join(variants_list_chr,Gene_info,by=c("POS"="POS","REF"="REF","ALT"="ALT")
    So maybe this is not the issue. And I only need to focus on the variants list.
    Going back to the variants list, my variants are super rare, with MAC=1 or 2, would you expect STAAR to not reach end of computation ?

@xihaoli
Copy link
Owner

xihaoli commented Apr 26, 2024

Hi @samreenzafer,

Thanks for the input.

  1. In this case, other variants in the input list can all be perfectly explained by this variant. You may perform a quick check by looking at the genotypes for all these 13 variants.

  2. I am referring to line 175 of the LD_pruning function, where we additionally included QC_label parameter in Individual_Analysis_cond last week and is not the change you mentioned above.

  3. I am still not sure why you would want to adjust these extremely rare variants (MAC = 1 or 2) in conditional analysis. Please refer to the STAARpipeline paper for the best practices of rare variant analysis.

Best,
Xihao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants