Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelise MAGMA steps #110

Open
bschilder opened this issue Jul 29, 2022 · 1 comment
Open

Parallelise MAGMA steps #110

bschilder opened this issue Jul 29, 2022 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@bschilder
Copy link
Collaborator

A parallelised version of MAGMA was recently described here:
https://star-protocols.cell.com/protocols/1392

@bschilder bschilder self-assigned this Jul 29, 2022
@bschilder bschilder added the enhancement New feature or request label Jul 29, 2022
@bschilder
Copy link
Collaborator Author

bschilder commented Aug 10, 2022

Notes on parallelisation

The following tests were conducted on the Neurogenomics private cloud, within a virtual machine instance with 252GB of RAM, (8?)TB of storage, and 64 CPU cores (128 threads).

MungeSumstats::import_sumstats

Error description

Documented here: https://github.com/neurogenomics/MungeSumstats/issues/113

BiocParallel errors
  0 remote errors, element index: 
  50 unevaluated and other errors
  first remote error:
  • No sumstats produced.
  • Also recorded messages from the wrong/next id!
  • Seems to occur with smaller files, <2M variants.
  • VCF and tabix index download fine (checked file sizes against manual download from OpenGWAS).
  • Only using ~half of available memory at most, so doesn't seem to be crashing due to that.
  • Seems to trace back to read_vcf_parallel step.
  • Restarting the Docker container and/or decreasing the number of threads to 30 seems to resolve this.
    • This suggests that perhaps the issue is breaking the VCFs into too small of chunks when the VCF is already small leads to situations where some chunks are empty, causing an error when being merged with all the other chunks.

Current solution

  • Limit the number of parallel threads to <=30.

Reprex

subcategories3 <- c("neurological","Immune","cardio")
metagwas3 <- MungeSumstats::find_sumstats(subcategories = subcategories3)
meta <- filter_traits(meta = metagwas3,  
                           group_var = "subcategory",
                           topn = 100) 

gwas_paths <- MungeSumstats::import_sumstats(
  ids = meta$id, 
  save_dir = here::here("data/GWAS_sumstats"), 
  nThread = 30, # >30 causes issues with read_vcf_parallel
  parallel_across_ids = FALSE, 
  force_new_vcf = FALSE,
  force_new = FALSE,
  vcf_download = TRUE,
  vcf_dir = here::here("data/VCFs"),
  ### axel will keep trying forever if the URL doesn't exist (or is private)
  # download_method = "axel",
  #### Record logs
  log_folder_ind = TRUE,
  log_mungesumstats_msgs = TRUE,
  ) 

MAGMA.Celltyping::map_snps_to_genes

Error description

Parallelising map_snps_to_genes across too many threads actually seems to slow everything down, despite having sufficient memory.

This might be because:

  1. They are all calling to the same underlying MAGMA C code.
  2. They are reserving more memory in C than the computer thinks is available. (seems likely)
  3. They are all reading from the same reference files (e.g. 1KG).
  4. MAGMA is somehow already parallelising processes internally.
  5. Docker is launching some processes that don't get killed when restarting the container.
  6. Intermediate MAGMA files are getting repeatedly appended and causing problems in subsequent reruns.
  7. genes_only=TRUE? but this should make things faster, not slower
    • Confirmed that setting genes_only=TRUE seems to speed things up (thought unsure how significantly).
  8. Newer versions of MAGMA(v1.10) has bugs that causes everything to slow down.
    • MAGMAv1.08 might be a bit faster, but hard to tell bc uses diff files each time.

Current solution

Limit the number of parallel threads to <=20. This results in low memory usage (~11/252GB at any given time, indicated by the green bar in htop) but, perhaps importantly, ensures that the amount of memory being reserved only reaches ~2/3 of the max (the yellow bar in htop).

Screenshot 2022-08-10 at 11 45 51

Reprex

source("https://github.com/neurogenomics/MAGMA_Files_Public/raw/master/code/utils.R")

save_dir <- here::here("data/GWAS_sumstats")
meta <- gather_metadata(save_dir = save_dir,
                        N_dict=c("Wightman2021"=1126563, 
                                 "Vuckovic2020"=408112))
data.table::setkey(meta,id)

t1 <- Sys.time()
magma_files <- parallel::mclapply(seq_len(nrow(meta)),
                                  function(i){ 
  EWCE:::message_parallel("----- ",i," : ",
                                      meta$id[i]," -----") 
     tryCatch(expr = { 
       MAGMA.Celltyping::map_snps_to_genes(
         # version = "1.08",
         path_formatted = meta$munged_path[i],
         genome_build = meta$build_final[i],
         N = if(is.na(meta$N[i])) NULL else meta$N[i],
         population = meta$population_1KG[i],
         upstream_kb = 35,  
         downstream_kb = 10,
         genes_only = TRUE, 
         force_new = FALSE
         )
     }, error = function(e) {EWCE:::message_parallel(e);NULL}) 
}, mc.cores = min(nrow(meta),20) ) |> `names<-`(meta$id) 
t2 <- Sys.time()
print(t2-t1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant