Skip to content

olsonanl/FastOrtho

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

orthomcl starts with gene protein sequences grouped by genomes and generates
ortholog groups by creating input for the mcl program with input based on the
all by all blast of the sequences.

FastOrtho is a reimplementation of the orthomcl program that does not require
the use of databases or perl.

To create the FastOrtho executable type make in the src directory.


There are many input options(listed below) for FastOrtho which are probably
most easily configured by using the included SetFast.jar GUI to create
an option file that can be given to FastOrtho as its only input.
(You may need to use the line
java -jar SetFast.jar
to use the GUI)


FastOrtho Command line Options  (see below for sample use)

	--option_file file_name
		Used to read options for a file.
		Expects at most one option per line in the file.

////////  OPERATIONAL INPUT PARAMETER copied from OrthoMCL //////
	--pv_cutoff  maximum_e_value
		Used to discard blast hits with large e-values.
		default = 1-e5
		
	--pi_cutoff minimum_percent_identity
		used to discard blast hits with small percent identity values
		Percent identity for a query subject pair is based on
		the weighted mean of all blast lines for the query subject pair
		The weight of a line is based on the length of it alignement section.
		default = 0.0  (does not generate any discards)
		
	--pmatch_cutoff minimum_percent_matching
		Used to discard blast hits in which too small a percent of the
		protein sequences are involved in the blast alignments.
		Percent applies to shorter of the query and subject sequences.
		default = 0.0 (does not generate any discards)
		
	--maximum_weight numeric_value
		Weights for mcl are computed using -log10(e-value) from blast hits
		which has no meaining for an e-value of 0.0.
		This value is used inplace of -log10(0.0).
		default = 316.0
		316.0 is larger than -log10(x) where x is larger than 0
		and x can be stored as a double floating point primitive.

	--inflation numeric_value
		Provides a value for the -I option to use when FastOrtho
		calls the mcl program.
		default = 1.5
		
	--blast_cpus numeric_value
		Only used when FastOrtho handles launching NCBI blast.  
		Provides a value for blastall -a option or the blastp -num_threads.
		default = 1
		
	--blast_b numeric_value
		Only used when FastOrtho handles launching NCBI blast.  
		Provides a value for blastall -b option or the blastp -num_descriptions.
		default = 1000
		
	--blast_e numeric_value
		Only used when FastOrtho handles launching NCBI blast.  
		Provides a value for blastall -3 option or the blastp -evalue.
		default = 1e-5
		
	--blast_v numeric_value
		Only used when FastOrtho handles launching NCBI blast.  
		Provides a value for blastall -v option or the blastp -num_alignments.
		default = 1000
	
	--only_fastas
		Allows FastOrtho to be used to create a combined protein sequence
		file for use by an NCBI blast called outside of FastOrtho.
		It also results in a *.glg file being generated for later use
		with the --gg_file option.

////////  FILE LOCATION SPECIFICATION //////		
	--single_genome_fasta file_path
		Describes a protein sequence file to be used as input for 
		NCBI blast processing.  All sequences in this file will be
		considered members of a genome with a name derived from the
		file_path value. Even is NCBI blast processing is not being
		preformed this option is useful when no appropriate input
		is available for the --gg_file option.
		  
		
	--mixed_genome_fasta file_path
		Describes a protein sequence file to be used as input for 
		NCBI blast processing.  The genome name to associate with
		a protein sequence will be derived from the text enclosed by
		[] at the end of the sequences > name line.
		Even is NCBI blast processing is not being
		preformed this option is useful when no appropriate input
		is available for the --gg_file option.

	--blast_file file_path
		Allows FastOrtho to use pre-computed NCBI blast output generated
		using legacy blastall -m 8 or current blastp -outfmt 6 output format.
		
	--bpo_file file_path
		Allows FastOrtho to use *.bpo which were generated by classic
		orthomcl using NCBI blast output.
		
	--gg_file file_path
		Classic orthomcl generates *.gg files to detail the genomes
		involved in a project and which gene names below to which genomes.
		FastOrtho generates similar *.glg files.  FastOrtho needs this
		membership information to operate.  If no such files are available
		FastOrtho can generate the information from a list of the
		protein sequence files used to prepare the input for NCBI blast.
		
	--working_directory directory_path
		FastOrtho generates several files during its work flow and needs
		a directory where it has permission to create these files.
		
	--project_name file_prefix
		All temporary files generted by FastOrtho will begin with
		this value be placed in the working_directory
		
	--formatdb_path file_path
		Only used when FastOrtho is tasked with running NCBI blast.
		Provides text for running makeblastdb executable or
		legacy formatdb executable.  Not required if executable 
		will run without a path specification.
		
	--blastall_path file_path
		Only used when FastOrtho is tasked with running NCBI blast.
		Provides text for running blastp executable or
		legacy blastp executable.  Not required if executable 
		will run without a path specification.
	
	--mcl_path file_path
		Allows FastOrtho to apply its input to the mcl program.
		Not required if simple mcl will execute from the command line.
		
	--result_file file_path
		Specifies where FastOrtho should store its final results.

	--single_genome_fasta file_path
		Describes a protein sequence file to be used as input for 
		NCBI blast processing.  All sequences in this file will be
		considered members of a genome with a name derived from the
		file_path value. Even is NCBI blast processing is not being
		preformed this option is useful when no appropriate input
		is available for the --gg_file option.
		  
		
	--mixed_genome_fasta file_path
		Describes a protein sequence file to be used as input for 
		NCBI blast processing.  The genome name to associate with
		a protein sequence will be derived from the text enclosed by
		[] at the end of the sequences > name line.
		Even is NCBI blast processing is not being
		preformed this option is useful when no appropriate input
		is available for the --gg_file option.
	

	
////////  SUPPORT FOR BLAST HITS WITH NON-STANDARD COLUMN ARRANGEMENTS //////	
	--query_index numeric_value
		FastOrtho expects blast hit data in column format.  This value
		specifies where to read the query name.
		default = 0   
			if FastOrtho is running NCBI blast or using
				--blast_file input.
		default = 1
			if FastOrtho is using --bpo_file input
		This option allows the use of files that are similar to those
		produced by NCBI blast but with different column placements
			
	--subject_index numeric_value
		see --query_index with defaults 1, 3 instead of 0, 1
	
	--e_value_index numeric_value
		see --query_index with defaults 10, 5 instead of 0, 1
	
	--percent_idenity_index numeric_value
		see --query_index with defaults 2, 6 instead of 0, 1
		
	--alignment_length_index numeric_value
		see --query_index with default 3 instead of 0
		 (does not apply to --bpo_file)
		
	--query_start_index numeric_value
		see --query_index with default 6 instead of 0
		 (does not apply to --bpo_file)
		
	--query_end_index numeric_value
		see --query_index with default 7 instead of 0
		 (does not apply to --bpo_file)
	
	--query_length_index numeric_value
		see --query_index with default x, 2 instead of 0, 1
		 (Only applies to --bpo_file)
		
	--subject_start_index numeric_value
		see --query_index with default 8 instead of 0
		 (does not apply to --bpo_file)
	
	--subject_end_index numeric_value
		see --query_index with default 9 instead of 0
		 (does not apply to --bpo_file)
	
	--subject_length_index numeric_value
		see --query_index with default x, 4 instead of 0, 1
		 (Only applies to --bpo_file)
				
	--mapping_index numeric_value
		see --query_index with default x, 7 instead of 0, 1
		 (Only applies to --bpo_file)
	
	--split_char single_character
		Specified character used to separate columns in blast hit file.
		default = tab   
			if FastOrtho is running NCBI blast or using
				--blast_file input.
		default = ;
			if FastOrtho is using --bpo_file input
		
	--use_tab_split
		Equivalent to --split_char single_character where
		 single_character = tab
		 
///////////  SPECIAL FLAGS ///////////////
	--match_OrthoMcl
		Insures that FastOrtho uses exact logic of classic orthomcl.
		In classic orthomcl discarding a paralog blast hits because of
		low percent identity will block all subsequent paralog hits in
		the same query block even if they pass of all the other blast
		hit filtering.  This did not seem reasonable and is not the
		default behavior of FastOrtho
		
		
	--legacy_blast
		Only used when FastOrtho handles launching NCBI blast.
		Tells FastOrtho to use formatdb & blastall instead of the
		defaults makeblastdb & blastp.  When using legacy NCBI blast
		this option needs to be included even if --formatdb_path 
		and --blastall_path have been specified since the legacy programs
		use different strings for specifying option values.


///////////////////////////////   SAMPLE EXAMPLES OF TEXT LINES FOR --option-file

//// Smallest option set
///      Assumes the $PATH environmental variable will provide the locations of
///      mcl and the NCBI program makeblastdb and blastp
///      final result will be found in /home/mscott/projects/samples/version1.end
--mixed_genome_fasta  /home/mscott/fasta/samples.faa
--working_directory /home/mscott/projects/samples
--project_name version1

//// Smallest option set where a blast file has been provided
///      samples.faa specification is required to link proteins to their genome
///      final result will be found in /home/mscott/projects/samples/version2.end
--mixed_genome_fasta  /home/mscott/fasta/samples.faa
--blast_file /home/mscott/projects/samples/version1.out
--working_directory /home/mscott/projects/samples
--project_name version2


/// short option set using --single_genome_fasta instead of --mixed_genome_fasta
//    final result will be found in /home/mscott/projects/samples/version1.end
//      genome names will consist of organism_A, organism_B, and organism_C 
--single_genome_fasta /home/mscott/fasta/organism_A.faa
--single_genome_fasta /home/mscott/fasta/organism_B.faa
--single_genome_fasta /home/mscott/fasta/organism_C.faa
--working_directory /home/mscott/projects/samples
--project_name version3

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages