WinstonCleaner is a software tool for detecting and removing cross-contaminated contigs from assembled transcriptomes. The program uses BLAST to identify suspicious contigs and RPKM values to sort these as either correct or contamination.
To run WinstonCleaner, the following requirements must be satisfied:
-
Checkout repository
git clone https://github.com/kolecko007/WinstonCleaner.git
cd WinstonCleaner
-
Install pip dependencies:
pip2 install --user -r requirements.txt
-
Initialize settings:
cp config/settings.yml.default config/settings.yml
-
Check installation by running
test/integration/run.sh
fromWinstonCleaner
folder
- Prepare the folder with input data and an empty folder for the results
- Open
config/settings.yml
and specify input and output paths bin/prepare_data.py
bin/find_contaminations.py
- Inspect the results in the output folder
The input data should be presented as a set of triads of files for each dataset. For each dataset it is necessary to prepare:
- left reads
.fastq
- right reads
.fastq
- assembled transcriptome
.fasta
file
Names of the files must be in the following format:
NAME_1.fastq
NAME_2.fastq
NAME.fasta
For example:
brucei_1.fastq
brucei_2.fastq
brucei.fasta
giardia_1.fastq
giardia_2.fastq
giardia.fasta
For file names only letters, digits and _
symbols are allowed.
All the files must be placed together in one folder.
All the settings are declared in config/settings.yml
.
winston.paths.input
— input folder with reads and contigswinston.paths.output
— output folder with the resultswinston.paths.tools.pileup_sh
— (optional) bbtoolspileup.sh
execution commandwinston.paths.tools.bowtie2
— (optional) bowtie2 execution commandwinston.paths.tools.bowtie2_build
— (optional) bowtie2-build execution commandwinston.hits_filtering.len_ratio
— minimalqcovhsp
for hits filteringwinston.hits_filtering.len_minimum
— minimal hit lenth for hits filteringwinston.coverage_ratio.regular
— coverage ratio for REGULAR dataset pair type (lower values make contamination prediction more strict, less contaminations will be found)winston.coverage_ratio.close
— coverage ratio for CLOSE dataset pair typewinston.threads.multithreading
— enable multithreading (disabling is convenient for debugging purposes)winston.threads.count
— number of threads if multithreading enabledwinston.tools.blast.threads
— number of threads for BLAST processingwinston.tools.bowtie.threads
— number of threads for bowtie2 processingwinston.in_memory_db
— load coverage database to RAM in the beginning. Makes contamination lookup faster, but requires decent amount of memory.
The default configuration can be found in file config/settings.yml.default
.
winston:
in_memory_db: false
paths:
input: /path/to/folder/with/data/
output: /path/to/output/folder
hits_filtering:
len_ratio: 70
len_minimum: 100
coverage_ratio:
REGULAR: 1.1
CLOSE: 0.04
threads:
multithreading: true
count: 8
tools:
blast:
threads: 8
bowtie:
threads: 8
The first step is to prepare the data for WinstonCleaner processing.
bin/prepare_data.py
The result will be stored in the folder, specified in winston.paths.output
option.
After the preparation the file types.csv
can be inspected and edited.
It contains all possible combinations of dataset pairs and their types.
The default types are:
CLOSE
- taxonomically close organismsREGULAR
- simple pair of organisms
In types.csv
there can also be specified any amount of custom types.
Their names must be in upper case.
predator,prey,95.0,LEFT_EATS_RIGHT
prey,predator,95.0,RIGHT_EATS_LEFT
In these case coverage ratio for each custom type must be specified in winston.coverage_ratio
section of
settings.yml
file:
...
coverage_ratio:
REGULAR: 1.1
CLOSE: 0.04
LEFT_EATS_RIGHT: 10
RIGHT_EATS_LEFT: 0.1
...
bin/find_contaminations.py
The results will be saved in the folder, specified in winston.paths.output
option.
For each datasets there will be the following structure of files.
- DATASET_NAME_clean.fasta — clean contigs
- DATASET_NAME_deleted.fasta — contaminated contigs
- DATASET_NAME_suspicious_hits.csv — all suspicious BLAST hits
- DATASET_NAME_contamination_sources.csv — sources of contaminations with a following columns: source contamination dataset name, number of sequences
- DATASET_NAME_contaminations.csv — list of blast hits from which contaminations were detected
- DATASET_NAME_missing_coverage.csv — list of contig ids without a coverage
- Moving to python3
- Logging system
- Extended testing
- export to graph format