Mercurial > repos > aaronpetkau > filter_spades_repeats

<tool id="filter_spades_repeat" name="Filter SPAdes repeats" version="1.0.0">
	<description>Remove short and repeat contigs/scaffolds</description>
	<requirements>
		<requirement type="package" version="5.18.1">perl</requirement>
	</requirements>
	<command interpreter="perl">nml_filter_spades_repeats.pl -i $fasta_input -t $tab_input -c $cov_cutoff -r $rep_cutoff -l $len_cutoff -o $output_with_repeats -u $output_without_repeats -n $repeat_sequences_only -e $cov_len_cutoff -f $discarded_sequences -s $summary
	</command>

	<inputs>
		<param name="fasta_input" type="data" format="fasta" label="Contigs or scaffolds file" help="Contigs/Scaffolds output file from Spades" />
		<param name="tab_input" type="data" format="tabular" label="Stats file" help="Enter the corresponding stats file of the fasta file input above" />
		<param name="cov_cutoff" type="float" value="0.33" min="0" label="Coverage cut-off ratio" help="This is the average coverage ratio cutoff. For example: if the average coverage is 100 and a coverage cut-off ratio of 0.5 is used, then any contigs with coverage lower than 50 will be eliminated." />
		<param name="rep_cutoff" type="float" value="1.75" min="0" label="Repeat cut-off ratio" help="This is the coverage ratio cutoff to determine repeats in contigs. For exmaple: if the average coverage is 100 and a repeat cut-off ratio of 1.75 is used, then any contigs with coverage more than or equal to 175 will be marked as repeats." />
		<param name="len_cutoff" type="integer" value="1000" min="0" label="Length cut-off" help="Contigs with length under the chosen cut-off will be eliminated." />
                <param name="cov_len_cutoff" type="integer" value="5000" min="0" label="Length for average coverage calculation" help="Only contigs above this length will be used to calculate the average coverage." />
		<param name="keep_leftover" type="select" label="Print out a fasta file containing the discarded sequences?">
			<option value="yes">Yes</option>
			<option value="no">No</option>
		</param>
                <param name="print_summary" type="select" label="Print out a summary of all the results?">
                        <option value="yes">Yes</option>
                        <option value="no">No</option>
               </param>
	</inputs>
	<outputs>
		<data format="fasta" name="output_with_repeats" label="Filtered sequences (with repeats)" />
		<data format="fasta" name="output_without_repeats" label="Filtered sequences (no repeats)" />
                <data format="fasta" name="repeat_sequences_only" label="Repeat sequences" />
		<data format="fasta" name="discarded_sequences" label="Discarded sequences">
			<filter>keep_leftover == "yes"</filter>
		</data>
                <data format="txt" name="summary" label="Results summary">
                        <filter>print_summary == "yes"</filter>
               </data>
	</outputs>


	<help>
================
**What it does**
================

Using the output of SPAdes (a fasta and a stats file, either from contigs or scaffolds), it filters the fasta files, discarding all sequences that are under a given length or under a calculated coverage. Repeated contigs are detected based on coverage.

--------------------------------------

==========
**Output**
==========

- **Filtered sequences (with repeats)**
	- Will contain the filtered contigs/scaffolds including the repeats. These are the sequences that passed the length and minumum coverage cutoffs.
	- For workflows, this output is named **output_with_repeats**
- **Filtered sequences (no repeats)**
	-  Will contain the filtered contigs/scaffolds excluding the repeats. These are the sequences that passed the length, minimum coverage and repeat cutoffs.
	- For workflows, this output is named **output_without_repeats**
- **Repeat sequences**
	- Will contain the repeated contigs/scaffolds only. These are the sequences that were exluded for having high coverage (determined by the repeat cutoff).
	- For workflows, this output is named **repeat_sequences_only**
- **Discarded sequences**
	- If selected, will contain the discarded sequences. These are the sequences that fell below the length and minumum coverage cutoffs, and got discarded.
	- For workflows, this output is named **discarded_sequences**
- **Results summary**  : If selected, will contain a summary of all the results.

------------------------------------------

============
**Example**
============

Stats file input:
------------------

    +------------+------------+------------+
    |#name       |length      |coverage    |
    +============+============+============+
    |NODE_1      |2500        |15.5        |
    +------------+------------+------------+
    |NODE_2      |102         |3.0         |
    +------------+------------+------------+
    |NODE_3      |1300        |50.0        |
    +------------+------------+------------+
    |NODE_4      |1000        |2.3         |
    +------------+------------+------------+
    |NODE_5      |5000        |14.3        |
    +------------+------------+------------+
    |NODE_6      |450         |25.2        |
    +------------+------------+------------+

User Inputs:
------------

- Coverage cut-off ratio = 0.33
- Repeat cut-off ratio = 1.75
- Length cut-off = 500
- Length for average coverage calculation = 1000

Calculations:
-------------

**Average coverage will be calculatd based on contigs with length >= 1000bp**


- Average coverage = 15.5 + 50.0 + 2.3 + 14.3 / 4 = 20.5

**Contigs that have coverage in the lower 1/3 of the average coverage will be eliminated.**

- Coverage cut-off = 20.5 * 0.33 = 6.8

**Contigs with high coverage (larger than 1.75 times the average coverage) are considered to be repeated contigs.**

- Repeat cut-off = 20.5 * 1.75 = 35.9

**Number of copies are calculated by dividing the sequence coverage by the average coverage.**

- Number of repeats for NODE_3  = 50 / 20.5 = 2 copies


Output (in fasta format):
--------------------------

**Filtered sequences (with repeats)**

::

	>NODE_1
	>NODE_3 (2 copies)
	>NODE_5

**Filtered sequences (no repeats)**

::

	>NODE_1
	>NODE_5

**Repeat sequences**

::

      >NODE_3 (2 copies)

**Discarded sequences**

::

	>NODE_2
	>NODE_4
    >NODE_6

---------------------------------------


</help>

</tool>
author	aaronpetkau
date	Sat, 04 Jul 2015 09:45:30 -0400
parents
children