diff VelvetOptimiser-2.1.7_modified/README @ 0:50ae1360fbbe default tip

Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
author konradpaszkiewicz
date Tue, 07 Jun 2011 18:07:56 -0400
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/VelvetOptimiser-2.1.7_modified/README	Tue Jun 07 18:07:56 2011 -0400
@@ -0,0 +1,325 @@
+NAME
+====
+
+VelvetOptimiser
+
+VERSION
+=======
+
+Version 2.1.7
+
+LICENCE
+=======
+
+Copyright 2009 - Simon Gladman - CSIRO.
+	
+simon.gladman@csiro.au
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+    
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+        
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+MA 02110-1301, USA.
+
+
+INTRODUCTION
+============
+
+The VelvetOptimiser is designed to run as a wrapper script for the Velvet
+assembler (Daniel Zerbino, EBI UK) and to assist with optimising the
+assembly.  It searches a supplied hash value range for the optimum,
+estimates the expected coverage and then searches for the optimum coverage
+cutoff.  It uses Velvet's internal mechanism for estimating insert lengths
+for paired end libraries.  It can optimise the assemblies by either the
+default optimisation condition or by a user supplied one.  It outputs the
+results to a subdirectory and records all its operations in a logfile.
+
+Expected coverage is estimated using the length weighted mode of the contig
+coverage in all active short columns of the stats.txt file.
+
+
+PREREQUISITES
+=============
+
+Velvet => 0.7.51
+Perl => 5.8.8
+BioPerl >= 1.4
+GNU utilities: grep sed free cut
+
+
+COMMAND LINE
+============
+	
+	VelvetOptimiser.pl [options] -f 'velveth input line'
+  
+  Options:
+  
+  --help          	This help.
+  --v|verbose+    	Verbose logging, includes all velvet output in the logfile. (default '0').
+  --s|hashs=i     	The starting (lower) hash value (default '19').
+  --e|hashe=i     	The end (higher) hash value (default '31').
+  --f|velvethfiles=s The file section of the velveth command line. (default '0').
+  --a|amosfile!   	Turn on velvet's read tracking and amos file output. (default '0').
+  --o|velvetgoptions=s Extra velvetg options to pass through.  eg. -long_mult_cutoff -max_coverage etc (default '').
+  --t|threads=i   	The maximum number of simulataneous velvet instances to run. (default '4').
+  --g|genomesize=f 	The approximate size of the genome to be assembled in megabases.
+					Only used in memory use estimation. If not specified, memory use estimation
+					will not occur. If memory use is estimated, the results are shown and then program exits. (default '0').
+  --k|optFuncKmer=s The optimisation function used for k-mer choice. (default 'n50').
+  --c|optFuncCov=s 	The optimisation function used for cov_cutoff optimisation. (default 'Lbp').
+  --p|prefix=s    	The prefix for the output filenames, the default is the date and time in the format DD-MM-YYYY-HH-MM_. (default 'auto').
+
+Advanced!: Changing the optimisation function(s)
+
+Velvet optimiser assembly optimisation function can be built from the following variables.
+	Lbp = The total number of base pairs in large contigs
+	Lcon = The number of large contigs
+	max = The length of the longest contig
+	n50 = The n50
+	ncon = The total number of contigs
+	tbp = The total number of basepairs in contigs
+Examples are:
+	'Lbp' = Just the total basepairs in contigs longer than 1kb
+	'n50*Lcon' = The n50 times the number of long contigs.
+	'n50*Lcon/tbp+log(Lbp)' = The n50 times the number of long contigs divided
+		by the total bases in all contigs plus the log of the number of bases
+		in long contigs.
+
+
+
+EXAMPLES
+========
+
+Find the best assembly for a lane of Illumina single-end reads, trying k-values between 27 and 31:
+
+% VelvetOptimiser.pl -s 27 -e 31 -f '-short -fastq s_1_sequence.txt'
+
+Print an estimate of how much RAM is needed by the above command, if we use eight threads at once,
+and we estimate our assembled genome to be 4.5 megabases long:
+
+% VelvetOptimiser.pl -s 27 -e 31 -f '-short -fastq s_1_sequence.txt' -g 4.5
+
+Find the best assembly for Illumina paired end reads just for k=31, using four threads (eg. quad core CPU), 
+but optimizing for N50 for k-mer length rather than sum of large contig sizes:
+
+% VelvetOptimiser.pl -s 31 -e 31 -f '-shortPaired -fasta interleaved.fasta' -t 4 --optFuncKmer 'n50'
+
+
+DETAILED OPTIONS
+================
+
+-h or --help
+
+	Prints the commandline help to STDOUT.
+
+-v or --verbose
+
+	Adds the full velveth and velvetg output to the logfile. (Handy for
+        looking at the insert lengths and sds that Velvet has chosen for each library.)
+
+-s or --hashs
+
+	Parameter type required: odd integer > 0 & <= the MAXKMERLENGTH velvet was compiled with.
+	Default: 19
+	
+	This is the lower end of the hash value range that the optimiser will search for the optimum.
+	If the supplied value is even, it will be lowered by 1.
+	If the supplied value is higher than MAXKMERLENGTH, it will be dropped to MAXKMERLENGTH.
+	
+-e or --hashe
+
+	Parameter type required: odd integer >= 'hashs' & <= the MAXKMERLENGTH velvet was compiled with.
+	Default: MAXKMERLENGTH
+	
+	This is the upper end of the hash value range that the optimiser will search for the optimum.
+	If the supplied value is even, it will be lowered by 1.
+	If the supplied value is higher than MAXKMERLENGTH, it will be dropped to MAXKMERLENGTH.
+	If the supplied value is lower than 'hashs' then it will be set to equal 'hashs'.
+	
+-f or --velvethfiles
+
+	Parameter type required: string with '' or ""
+	No default.
+	
+	This is a required parameter.  If this option is not specified, then the optimisers usage will be displayed.
+	
+	You need to supply everything you would normally supply velveth at this point except for the hash size and the 
+        directory name in the following format.  
+		
+	{[-file_format][-read_type] filename} repeated for as many read files as you have.
+
+
+	File format options:
+		-fasta
+		-fastq
+		-fasta.gz
+		-fastq.gz
+		-bam
+		-sam
+		-eland
+		-gerald
+
+	Read type options:
+		-short
+		-shortPaired
+		-short2
+		-shortPaired2
+		-long
+		-longPaired
+
+	At this stage the optimiser does not support more than the 2 standard CATEGORIES but this will be added in the near future.
+	
+	Examples:
+	
+	-f 'reads.fna'
+		reads.fna is short not paired and fasta.  (these are the defaults: -short and -fasta)
+		
+	-f '-shortPaired -fastq paired_reads.fastq -long long_reads.fna'
+		Two read files supplied, first one is a paired end fastq file and the second is a long single ended read file.
+		
+	-f '-shortPaired paired_reads_1.fna -shortPaired2 paired_reads_2.fna'
+		Two read files supplied, both are short paired fastas but come from two different libraries, therefore needing two different CATEGORIES.
+		
+	There is a fairly extensive checker built into the optimiser to check if the format of the string is correct.  However, it won't check the read files for their format (fasta, fastq, eland etc.)
+	
+-a or --amosfile
+
+	Turns on Velvets read tracking and amos file output.
+	This option is the same as specifying '-amos_file yes -read_trkg yes' in velvetg.  However, it will only be added to the final assembly and not to the intermediate ones.
+
+-o or --velvetgoptions
+
+	Parameter type required: string.
+	No default
+	
+	String should contain extra options to be passed to velvetg as required such as "-long_mult_cutoff 1" or "-max_coverage 50" etc.  Warning, there is no sanity check, so be careful.  The optimiser will crash if you give velvetg something it doesn't handle.
+	
+-t or --threads
+
+	Parameter type required: integer
+	
+	Specifies the maximum number of threads (simulataneous Velvet instances) to run.  It defaults to the number of CPUs in the current computer.
+
+-g or --genomesize
+
+	Parameter type required: float.
+	No default.
+	
+	This option will run the Optimiser's memory estimator.  It will estimate the memory required to run Velvet with the current -f parameter and number of threads.  Once the estimator has finsihed its calulations, it will display the required memory, make a recommendation and then exit the script.  This is useful for determining if you will have sufficient free RAM to run the assembly before you start.
+	You need to supply the approximate size of the genome you are assembling in mega bases.  For example, for a Salmonella genome I would use: -g 5
+	
+-k or --optFuncKmer
+
+	Parameter type required: string.
+	Default: 'n50'
+		
+	This option will change the function that the Optimiser uses to find the best hash value from the given range.  The default is to use the n50.  I have found this function to work for me better than the previous single optimisation function, but you may wish to change it.  A list of possible variables to use in your optimisation function and some examples are shown below.
+
+-c or --optFuncCov
+
+	Parameter type required: string.
+	Default: 'Lbp'
+		
+	This option will change the function that the Optimiser uses to find the best hash value from the given range.  The default is to use the number of basepairs in contigs greater than 1 kilobase in length.  I have found this function to work for me but you may wish to change it.  A list of possible variables to use in your optimisation function and some examples are shown below.
+		
+	Velvet optimiser assembly optimisation functions can be built from the following variables:
+		
+		Lbp = The total number of base pairs in large contigs
+		Lcon = The number of large contigs
+		max = The length of the longest contig
+		n50 = The n50
+		ncon = The total number of contigs
+		tbp = The total number of basepairs in contigs
+
+	Examples are:
+	
+		'Lbp' = Just the total basepairs in contigs longer than 1kb
+		'n50*Lcon' = The n50 times the number of long contigs.
+		'n50*Lcon/tbp+log(Lbp)' = The n50 times the number of long contigs divided
+			by the total bases in all contigs plus the log of the number of bases
+			in long contigs.
+
+	Be warned! The optimiser doesn't care what you supply in this string and will attempt to run anyway.  If you give it a nonsensical optimisation function be prepared to receive a nonsensical assembly!
+	
+-p or --prefix
+
+	Parameter type required: string
+	Default: The current date and time in the format "DD-MM-YYYY-HH-MM-SS_"
+	
+	Names the logfile and the output directory with whatever prefix is supplied followed by "_logfile.txt" for the logfile and "_data_k" where k is the optimum hash value for the ouput data directory.
+
+
+BUGS
+====
+
+* None that I am aware of.
+
+
+CHANGE LOG
+==========
+
+Changes since Version 2.0:
+
+2.0.1:
+
+*	Added Mikael Brandstrom Durling's code to get free_mem and num_cpus for the Mac.
+*	Fixed a bug where if no assembly score was calculable the program crashed.  It now sets the assembly score to 0 instead.
+
+2.1.0:
+
+*	Added two stage optimisation functions.  First one is used to optimise for hash value and second to optimise for cov_cutoff.  Both are user definable and default to "n50" for k-mer size and "Lbp" for cov_cutoff.
+*	Above necessitated change in command line option letters to minimise confusion.  first stage opt. func is -k for k-mer size and second is -c for cov_cutoff
+*	Fixed a bug in Utils.pm where the exp_cov was only calculated for the first two categories and left out the rest.  Now uses all short read categories.
+*	Added a command line option -o to pass through extra commands to velvetg (such as long_mult_cutoff etc.)  NB: No sanity checking here!
+
+2.1.1:
+
+*	Fixed a bug where prefixs containing '-' or '.' would cause the script to fail.
+
+2.1.2:
+
+*	Fixed a bug where estExpCov would try and incorporate columns in the stats.txt file that contained "Inf" or "N/A" into the calculation and thereby crash.
+
+2.1.3:
+
+*	Now gives a nice warning when optimisation function returns undef or 0, instead of cryptic perl error message.
+
+2.1.4:
+
+*	Fixed another bug in estExpCov in Utils.pm so it now doesn't count stats with coverage < 2 and contigs of less than 3 * kmer size - 1.
+
+2.1.5:
+
+*	Added support for velveth's new input file types. (bam, sam and raw) and attempted to future proof it..
+
+2.1.6
+
+*	Now prints Velvet calculated insert sizes and standard deviations in assembly summaries, both in the logfile and on screen
+
+2.1.7
+
+*	Takes new velveth help format into account.  Thanks to Alexie Papanicolaou - CSIRO for the patch.
+
+TO DO
+=====
+	
+	* Add the number of N's in the assembly output to the list of variables available to the optimisation function.  
+
+
+CONTACT
+=======
+
+Simon Gladman <simon.gladman@csiro.au>
+
+
+
+