Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding Contaminants and Removing them #28

Open
ajkarloss opened this issue Feb 25, 2019 · 2 comments
Open

Finding Contaminants and Removing them #28

ajkarloss opened this issue Feb 25, 2019 · 2 comments
Assignees

Comments

@ajkarloss
Copy link

ajkarloss commented Feb 25, 2019

Add option in quality check of sequences - to screen for possible contaminants
Use mash to predict the contaminants in the raw sequence
-- Prepare/Download the contaminant database from NCBI
-- Prokaryotes database - will need to be updated regularly

-- Make a summary with Håkon script - nb as such not ok for metagenomics - can be precised

PB: We need to remove phiX - maybe trimming -> ask Thomas advise on issue

@evezeyl
Copy link
Contributor

evezeyl commented Feb 26, 2019

as Karin said: we might some advises as for the best way of creating the database: the default database contains all sequences...

  • do we have a way to clean the database: clean entry names (maybe better to modify R script name filter)?
    -- complete genomes or not - all eukaryotes sequences?
    -- the database will need to be update regularly
    • frequency updates ? can we automatize as much as possible? is it eg. possible to scheldulde a way for updating/creating database with specified parameters?

Karin do not want any modification of the files here -> maybe remove phiX and adaptors Trim - but do not output files -> send them directly in the chanel - would that be a good enough solution?
Not removing phiX and adaptors should aftect mashscreen ...

we need slight modification from Håkon's script: https://github.com/hkaspersen/misc-scripts/blob/master/scripts/mash_screen.R

  • on the organism of interst (ie in Håkons' script we filter organism of interest based on name: ex: "Listeria monocytogenes" but <Listeria.monocytogenes> was not filtrered and poped up as likely contaminant because of this dot inserted in the name in the mash database -> so we might need to find an improvement of the filter.
  • line 74: needs to be modified for pattern matching - according to nextflow script
  • maybe add an option to transpose the output tables (question of preference - I prefer it transposed - easy to modify)
  • short explanation of what the filter is/do to help selecting for options-> on bifrost/Håkon (towards 0 we get also rare reads matching and toward 1: high values filter out all of the low-abundance sequences and we only get the ones that dominate the files
  • we might require some package installed for R and Bifrost/conda? (ie. had to install cairo librairy on my ubuntu system to be able to use the script - and additional svglite package in R - but maybe already in R system)

@Thomieh73
Copy link
Member

I think this paper was really helpful when it comes to Human contamination.

Interestingly, in discussion with Jen Lu, who worked on Kraken2 I heard that when she is doing classifications of metagenomic reads, she includes a unmasked human genome, in order to catch anything that looks like human.

karinlag added a commit that referenced this issue Aug 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants