0
|
1 <tool id="filter_spades_repeat" name="Filter SPAdes repeats" version="1.0.0">
|
|
2 <description>Remove short and repeat contigs/scaffolds</description>
|
|
3 <requirements>
|
|
4 <requirement type="package" version="5.18.1">perl</requirement>
|
|
5 </requirements>
|
|
6 <command interpreter="perl">nml_filter_spades_repeats.pl -i $fasta_input -t $tab_input -c $cov_cutoff -r $rep_cutoff -l $len_cutoff -o $output_with_repeats -u $output_without_repeats -n $repeat_sequences_only -e $cov_len_cutoff -f $discarded_sequences -s $summary
|
|
7 </command>
|
|
8
|
|
9 <inputs>
|
|
10 <param name="fasta_input" type="data" format="fasta" label="Contigs or scaffolds file" help="Contigs/Scaffolds output file from Spades" />
|
|
11 <param name="tab_input" type="data" format="tabular" label="Stats file" help="Enter the corresponding stats file of the fasta file input above" />
|
|
12 <param name="cov_cutoff" type="float" value="0.33" min="0" label="Coverage cut-off ratio" help="This is the average coverage ratio cutoff. For example: if the average coverage is 100 and a coverage cut-off ratio of 0.5 is used, then any contigs with coverage lower than 50 will be eliminated." />
|
|
13 <param name="rep_cutoff" type="float" value="1.75" min="0" label="Repeat cut-off ratio" help="This is the coverage ratio cutoff to determine repeats in contigs. For exmaple: if the average coverage is 100 and a repeat cut-off ratio of 1.75 is used, then any contigs with coverage more than or equal to 175 will be marked as repeats." />
|
|
14 <param name="len_cutoff" type="integer" value="1000" min="0" label="Length cut-off" help="Contigs with length under the chosen cut-off will be eliminated." />
|
|
15 <param name="cov_len_cutoff" type="integer" value="5000" min="0" label="Length for average coverage calculation" help="Only contigs above this length will be used to calculate the average coverage." />
|
|
16 <param name="keep_leftover" type="select" label="Print out a fasta file containing the discarded sequences?">
|
|
17 <option value="yes">Yes</option>
|
|
18 <option value="no">No</option>
|
|
19 </param>
|
|
20 <param name="print_summary" type="select" label="Print out a summary of all the results?">
|
|
21 <option value="yes">Yes</option>
|
|
22 <option value="no">No</option>
|
|
23 </param>
|
|
24 </inputs>
|
|
25 <outputs>
|
|
26 <data format="fasta" name="output_with_repeats" label="Filtered sequences (with repeats)" />
|
|
27 <data format="fasta" name="output_without_repeats" label="Filtered sequences (no repeats)" />
|
|
28 <data format="fasta" name="repeat_sequences_only" label="Repeat sequences" />
|
|
29 <data format="fasta" name="discarded_sequences" label="Discarded sequences">
|
|
30 <filter>keep_leftover == "yes"</filter>
|
|
31 </data>
|
|
32 <data format="txt" name="summary" label="Results summary">
|
|
33 <filter>print_summary == "yes"</filter>
|
|
34 </data>
|
|
35 </outputs>
|
|
36
|
|
37
|
|
38
|
|
39
|
|
40 <help>
|
|
41 ================
|
|
42 **What it does**
|
|
43 ================
|
|
44
|
|
45 Using the output of SPAdes (a fasta and a stats file, either from contigs or scaffolds), it filters the fasta files, discarding all sequences that are under a given length or under a calculated coverage. Repeated contigs are detected based on coverage.
|
|
46
|
|
47 --------------------------------------
|
|
48
|
|
49 ==========
|
|
50 **Output**
|
|
51 ==========
|
|
52
|
|
53 - **Filtered sequences (with repeats)**
|
|
54 - Will contain the filtered contigs/scaffolds including the repeats. These are the sequences that passed the length and minumum coverage cutoffs.
|
|
55 - For workflows, this output is named **output_with_repeats**
|
|
56 - **Filtered sequences (no repeats)**
|
|
57 - Will contain the filtered contigs/scaffolds excluding the repeats. These are the sequences that passed the length, minimum coverage and repeat cutoffs.
|
|
58 - For workflows, this output is named **output_without_repeats**
|
|
59 - **Repeat sequences**
|
|
60 - Will contain the repeated contigs/scaffolds only. These are the sequences that were exluded for having high coverage (determined by the repeat cutoff).
|
|
61 - For workflows, this output is named **repeat_sequences_only**
|
|
62 - **Discarded sequences**
|
|
63 - If selected, will contain the discarded sequences. These are the sequences that fell below the length and minumum coverage cutoffs, and got discarded.
|
|
64 - For workflows, this output is named **discarded_sequences**
|
|
65 - **Results summary** : If selected, will contain a summary of all the results.
|
|
66
|
|
67 ------------------------------------------
|
|
68
|
|
69 ============
|
|
70 **Example**
|
|
71 ============
|
|
72
|
|
73 Stats file input:
|
|
74 ------------------
|
|
75
|
|
76 +------------+------------+------------+
|
|
77 |#name |length |coverage |
|
|
78 +============+============+============+
|
|
79 |NODE_1 |2500 |15.5 |
|
|
80 +------------+------------+------------+
|
|
81 |NODE_2 |102 |3.0 |
|
|
82 +------------+------------+------------+
|
|
83 |NODE_3 |1300 |50.0 |
|
|
84 +------------+------------+------------+
|
|
85 |NODE_4 |1000 |2.3 |
|
|
86 +------------+------------+------------+
|
|
87 |NODE_5 |5000 |14.3 |
|
|
88 +------------+------------+------------+
|
|
89 |NODE_6 |450 |25.2 |
|
|
90 +------------+------------+------------+
|
|
91
|
|
92 User Inputs:
|
|
93 ------------
|
|
94
|
|
95 - Coverage cut-off ratio = 0.33
|
|
96 - Repeat cut-off ratio = 1.75
|
|
97 - Length cut-off = 500
|
|
98 - Length for average coverage calculation = 1000
|
|
99
|
|
100 Calculations:
|
|
101 -------------
|
|
102
|
|
103 **Average coverage will be calculatd based on contigs with length >= 1000bp**
|
|
104
|
|
105
|
|
106 - Average coverage = 15.5 + 50.0 + 2.3 + 14.3 / 4 = 20.5
|
|
107
|
|
108 **Contigs that have coverage in the lower 1/3 of the average coverage will be eliminated.**
|
|
109
|
|
110 - Coverage cut-off = 20.5 * 0.33 = 6.8
|
|
111
|
|
112 **Contigs with high coverage (larger than 1.75 times the average coverage) are considered to be repeated contigs.**
|
|
113
|
|
114 - Repeat cut-off = 20.5 * 1.75 = 35.9
|
|
115
|
|
116 **Number of copies are calculated by dividing the sequence coverage by the average coverage.**
|
|
117
|
|
118 - Number of repeats for NODE_3 = 50 / 20.5 = 2 copies
|
|
119
|
|
120
|
|
121 Output (in fasta format):
|
|
122 --------------------------
|
|
123
|
|
124 **Filtered sequences (with repeats)**
|
|
125
|
|
126 ::
|
|
127
|
|
128 >NODE_1
|
|
129 >NODE_3 (2 copies)
|
|
130 >NODE_5
|
|
131
|
|
132 **Filtered sequences (no repeats)**
|
|
133
|
|
134 ::
|
|
135
|
|
136 >NODE_1
|
|
137 >NODE_5
|
|
138
|
|
139 **Repeat sequences**
|
|
140
|
|
141 ::
|
|
142
|
|
143 >NODE_3 (2 copies)
|
|
144
|
|
145 **Discarded sequences**
|
|
146
|
|
147 ::
|
|
148
|
|
149 >NODE_2
|
|
150 >NODE_4
|
|
151 >NODE_6
|
|
152
|
|
153 ---------------------------------------
|
|
154
|
|
155
|
|
156
|
|
157
|
|
158 </help>
|
|
159
|
|
160 </tool>
|