-
Notifications
You must be signed in to change notification settings - Fork 3
olsonanl/FastOrtho
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
orthomcl starts with gene protein sequences grouped by genomes and generates ortholog groups by creating input for the mcl program with input based on the all by all blast of the sequences. FastOrtho is a reimplementation of the orthomcl program that does not require the use of databases or perl. To create the FastOrtho executable type make in the src directory. There are many input options(listed below) for FastOrtho which are probably most easily configured by using the included SetFast.jar GUI to create an option file that can be given to FastOrtho as its only input. (You may need to use the line java -jar SetFast.jar to use the GUI) FastOrtho Command line Options (see below for sample use) --option_file file_name Used to read options for a file. Expects at most one option per line in the file. //////// OPERATIONAL INPUT PARAMETER copied from OrthoMCL ////// --pv_cutoff maximum_e_value Used to discard blast hits with large e-values. default = 1-e5 --pi_cutoff minimum_percent_identity used to discard blast hits with small percent identity values Percent identity for a query subject pair is based on the weighted mean of all blast lines for the query subject pair The weight of a line is based on the length of it alignement section. default = 0.0 (does not generate any discards) --pmatch_cutoff minimum_percent_matching Used to discard blast hits in which too small a percent of the protein sequences are involved in the blast alignments. Percent applies to shorter of the query and subject sequences. default = 0.0 (does not generate any discards) --maximum_weight numeric_value Weights for mcl are computed using -log10(e-value) from blast hits which has no meaining for an e-value of 0.0. This value is used inplace of -log10(0.0). default = 316.0 316.0 is larger than -log10(x) where x is larger than 0 and x can be stored as a double floating point primitive. --inflation numeric_value Provides a value for the -I option to use when FastOrtho calls the mcl program. default = 1.5 --blast_cpus numeric_value Only used when FastOrtho handles launching NCBI blast. Provides a value for blastall -a option or the blastp -num_threads. default = 1 --blast_b numeric_value Only used when FastOrtho handles launching NCBI blast. Provides a value for blastall -b option or the blastp -num_descriptions. default = 1000 --blast_e numeric_value Only used when FastOrtho handles launching NCBI blast. Provides a value for blastall -3 option or the blastp -evalue. default = 1e-5 --blast_v numeric_value Only used when FastOrtho handles launching NCBI blast. Provides a value for blastall -v option or the blastp -num_alignments. default = 1000 --only_fastas Allows FastOrtho to be used to create a combined protein sequence file for use by an NCBI blast called outside of FastOrtho. It also results in a *.glg file being generated for later use with the --gg_file option. //////// FILE LOCATION SPECIFICATION ////// --single_genome_fasta file_path Describes a protein sequence file to be used as input for NCBI blast processing. All sequences in this file will be considered members of a genome with a name derived from the file_path value. Even is NCBI blast processing is not being preformed this option is useful when no appropriate input is available for the --gg_file option. --mixed_genome_fasta file_path Describes a protein sequence file to be used as input for NCBI blast processing. The genome name to associate with a protein sequence will be derived from the text enclosed by [] at the end of the sequences > name line. Even is NCBI blast processing is not being preformed this option is useful when no appropriate input is available for the --gg_file option. --blast_file file_path Allows FastOrtho to use pre-computed NCBI blast output generated using legacy blastall -m 8 or current blastp -outfmt 6 output format. --bpo_file file_path Allows FastOrtho to use *.bpo which were generated by classic orthomcl using NCBI blast output. --gg_file file_path Classic orthomcl generates *.gg files to detail the genomes involved in a project and which gene names below to which genomes. FastOrtho generates similar *.glg files. FastOrtho needs this membership information to operate. If no such files are available FastOrtho can generate the information from a list of the protein sequence files used to prepare the input for NCBI blast. --working_directory directory_path FastOrtho generates several files during its work flow and needs a directory where it has permission to create these files. --project_name file_prefix All temporary files generted by FastOrtho will begin with this value be placed in the working_directory --formatdb_path file_path Only used when FastOrtho is tasked with running NCBI blast. Provides text for running makeblastdb executable or legacy formatdb executable. Not required if executable will run without a path specification. --blastall_path file_path Only used when FastOrtho is tasked with running NCBI blast. Provides text for running blastp executable or legacy blastp executable. Not required if executable will run without a path specification. --mcl_path file_path Allows FastOrtho to apply its input to the mcl program. Not required if simple mcl will execute from the command line. --result_file file_path Specifies where FastOrtho should store its final results. --single_genome_fasta file_path Describes a protein sequence file to be used as input for NCBI blast processing. All sequences in this file will be considered members of a genome with a name derived from the file_path value. Even is NCBI blast processing is not being preformed this option is useful when no appropriate input is available for the --gg_file option. --mixed_genome_fasta file_path Describes a protein sequence file to be used as input for NCBI blast processing. The genome name to associate with a protein sequence will be derived from the text enclosed by [] at the end of the sequences > name line. Even is NCBI blast processing is not being preformed this option is useful when no appropriate input is available for the --gg_file option. //////// SUPPORT FOR BLAST HITS WITH NON-STANDARD COLUMN ARRANGEMENTS ////// --query_index numeric_value FastOrtho expects blast hit data in column format. This value specifies where to read the query name. default = 0 if FastOrtho is running NCBI blast or using --blast_file input. default = 1 if FastOrtho is using --bpo_file input This option allows the use of files that are similar to those produced by NCBI blast but with different column placements --subject_index numeric_value see --query_index with defaults 1, 3 instead of 0, 1 --e_value_index numeric_value see --query_index with defaults 10, 5 instead of 0, 1 --percent_idenity_index numeric_value see --query_index with defaults 2, 6 instead of 0, 1 --alignment_length_index numeric_value see --query_index with default 3 instead of 0 (does not apply to --bpo_file) --query_start_index numeric_value see --query_index with default 6 instead of 0 (does not apply to --bpo_file) --query_end_index numeric_value see --query_index with default 7 instead of 0 (does not apply to --bpo_file) --query_length_index numeric_value see --query_index with default x, 2 instead of 0, 1 (Only applies to --bpo_file) --subject_start_index numeric_value see --query_index with default 8 instead of 0 (does not apply to --bpo_file) --subject_end_index numeric_value see --query_index with default 9 instead of 0 (does not apply to --bpo_file) --subject_length_index numeric_value see --query_index with default x, 4 instead of 0, 1 (Only applies to --bpo_file) --mapping_index numeric_value see --query_index with default x, 7 instead of 0, 1 (Only applies to --bpo_file) --split_char single_character Specified character used to separate columns in blast hit file. default = tab if FastOrtho is running NCBI blast or using --blast_file input. default = ; if FastOrtho is using --bpo_file input --use_tab_split Equivalent to --split_char single_character where single_character = tab /////////// SPECIAL FLAGS /////////////// --match_OrthoMcl Insures that FastOrtho uses exact logic of classic orthomcl. In classic orthomcl discarding a paralog blast hits because of low percent identity will block all subsequent paralog hits in the same query block even if they pass of all the other blast hit filtering. This did not seem reasonable and is not the default behavior of FastOrtho --legacy_blast Only used when FastOrtho handles launching NCBI blast. Tells FastOrtho to use formatdb & blastall instead of the defaults makeblastdb & blastp. When using legacy NCBI blast this option needs to be included even if --formatdb_path and --blastall_path have been specified since the legacy programs use different strings for specifying option values. /////////////////////////////// SAMPLE EXAMPLES OF TEXT LINES FOR --option-file //// Smallest option set /// Assumes the $PATH environmental variable will provide the locations of /// mcl and the NCBI program makeblastdb and blastp /// final result will be found in /home/mscott/projects/samples/version1.end --mixed_genome_fasta /home/mscott/fasta/samples.faa --working_directory /home/mscott/projects/samples --project_name version1 //// Smallest option set where a blast file has been provided /// samples.faa specification is required to link proteins to their genome /// final result will be found in /home/mscott/projects/samples/version2.end --mixed_genome_fasta /home/mscott/fasta/samples.faa --blast_file /home/mscott/projects/samples/version1.out --working_directory /home/mscott/projects/samples --project_name version2 /// short option set using --single_genome_fasta instead of --mixed_genome_fasta // final result will be found in /home/mscott/projects/samples/version1.end // genome names will consist of organism_A, organism_B, and organism_C --single_genome_fasta /home/mscott/fasta/organism_A.faa --single_genome_fasta /home/mscott/fasta/organism_B.faa --single_genome_fasta /home/mscott/fasta/organism_C.faa --working_directory /home/mscott/projects/samples --project_name version3
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published