Mercurial > repos > aaronpetkau > filter_spades_repeats
diff filter_spades_repeats.xml @ 0:ddd1e15df88c draft
Uploaded
author | aaronpetkau |
---|---|
date | Sat, 04 Jul 2015 09:45:30 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/filter_spades_repeats.xml Sat Jul 04 09:45:30 2015 -0400 @@ -0,0 +1,160 @@ +<tool id="filter_spades_repeat" name="Filter SPAdes repeats" version="1.0.0"> + <description>Remove short and repeat contigs/scaffolds</description> + <requirements> + <requirement type="package" version="5.18.1">perl</requirement> + </requirements> + <command interpreter="perl">nml_filter_spades_repeats.pl -i $fasta_input -t $tab_input -c $cov_cutoff -r $rep_cutoff -l $len_cutoff -o $output_with_repeats -u $output_without_repeats -n $repeat_sequences_only -e $cov_len_cutoff -f $discarded_sequences -s $summary + </command> + + <inputs> + <param name="fasta_input" type="data" format="fasta" label="Contigs or scaffolds file" help="Contigs/Scaffolds output file from Spades" /> + <param name="tab_input" type="data" format="tabular" label="Stats file" help="Enter the corresponding stats file of the fasta file input above" /> + <param name="cov_cutoff" type="float" value="0.33" min="0" label="Coverage cut-off ratio" help="This is the average coverage ratio cutoff. For example: if the average coverage is 100 and a coverage cut-off ratio of 0.5 is used, then any contigs with coverage lower than 50 will be eliminated." /> + <param name="rep_cutoff" type="float" value="1.75" min="0" label="Repeat cut-off ratio" help="This is the coverage ratio cutoff to determine repeats in contigs. For exmaple: if the average coverage is 100 and a repeat cut-off ratio of 1.75 is used, then any contigs with coverage more than or equal to 175 will be marked as repeats." /> + <param name="len_cutoff" type="integer" value="1000" min="0" label="Length cut-off" help="Contigs with length under the chosen cut-off will be eliminated." /> + <param name="cov_len_cutoff" type="integer" value="5000" min="0" label="Length for average coverage calculation" help="Only contigs above this length will be used to calculate the average coverage." /> + <param name="keep_leftover" type="select" label="Print out a fasta file containing the discarded sequences?"> + <option value="yes">Yes</option> + <option value="no">No</option> + </param> + <param name="print_summary" type="select" label="Print out a summary of all the results?"> + <option value="yes">Yes</option> + <option value="no">No</option> + </param> + </inputs> + <outputs> + <data format="fasta" name="output_with_repeats" label="Filtered sequences (with repeats)" /> + <data format="fasta" name="output_without_repeats" label="Filtered sequences (no repeats)" /> + <data format="fasta" name="repeat_sequences_only" label="Repeat sequences" /> + <data format="fasta" name="discarded_sequences" label="Discarded sequences"> + <filter>keep_leftover == "yes"</filter> + </data> + <data format="txt" name="summary" label="Results summary"> + <filter>print_summary == "yes"</filter> + </data> + </outputs> + + + + + <help> +================ +**What it does** +================ + +Using the output of SPAdes (a fasta and a stats file, either from contigs or scaffolds), it filters the fasta files, discarding all sequences that are under a given length or under a calculated coverage. Repeated contigs are detected based on coverage. + +-------------------------------------- + +========== +**Output** +========== + +- **Filtered sequences (with repeats)** + - Will contain the filtered contigs/scaffolds including the repeats. These are the sequences that passed the length and minumum coverage cutoffs. + - For workflows, this output is named **output_with_repeats** +- **Filtered sequences (no repeats)** + - Will contain the filtered contigs/scaffolds excluding the repeats. These are the sequences that passed the length, minimum coverage and repeat cutoffs. + - For workflows, this output is named **output_without_repeats** +- **Repeat sequences** + - Will contain the repeated contigs/scaffolds only. These are the sequences that were exluded for having high coverage (determined by the repeat cutoff). + - For workflows, this output is named **repeat_sequences_only** +- **Discarded sequences** + - If selected, will contain the discarded sequences. These are the sequences that fell below the length and minumum coverage cutoffs, and got discarded. + - For workflows, this output is named **discarded_sequences** +- **Results summary** : If selected, will contain a summary of all the results. + +------------------------------------------ + +============ +**Example** +============ + +Stats file input: +------------------ + + +------------+------------+------------+ + |#name |length |coverage | + +============+============+============+ + |NODE_1 |2500 |15.5 | + +------------+------------+------------+ + |NODE_2 |102 |3.0 | + +------------+------------+------------+ + |NODE_3 |1300 |50.0 | + +------------+------------+------------+ + |NODE_4 |1000 |2.3 | + +------------+------------+------------+ + |NODE_5 |5000 |14.3 | + +------------+------------+------------+ + |NODE_6 |450 |25.2 | + +------------+------------+------------+ + +User Inputs: +------------ + +- Coverage cut-off ratio = 0.33 +- Repeat cut-off ratio = 1.75 +- Length cut-off = 500 +- Length for average coverage calculation = 1000 + +Calculations: +------------- + +**Average coverage will be calculatd based on contigs with length >= 1000bp** + + +- Average coverage = 15.5 + 50.0 + 2.3 + 14.3 / 4 = 20.5 + +**Contigs that have coverage in the lower 1/3 of the average coverage will be eliminated.** + +- Coverage cut-off = 20.5 * 0.33 = 6.8 + +**Contigs with high coverage (larger than 1.75 times the average coverage) are considered to be repeated contigs.** + +- Repeat cut-off = 20.5 * 1.75 = 35.9 + +**Number of copies are calculated by dividing the sequence coverage by the average coverage.** + +- Number of repeats for NODE_3 = 50 / 20.5 = 2 copies + + +Output (in fasta format): +-------------------------- + +**Filtered sequences (with repeats)** + +:: + + >NODE_1 + >NODE_3 (2 copies) + >NODE_5 + +**Filtered sequences (no repeats)** + +:: + + >NODE_1 + >NODE_5 + +**Repeat sequences** + +:: + + >NODE_3 (2 copies) + +**Discarded sequences** + +:: + + >NODE_2 + >NODE_4 + >NODE_6 + +--------------------------------------- + + + + +</help> + +</tool>