Mercurial > repos > aaronpetkau > filter_spades_repeats
comparison filter_spades_repeats.xml @ 0:ddd1e15df88c draft
Uploaded
author | aaronpetkau |
---|---|
date | Sat, 04 Jul 2015 09:45:30 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:ddd1e15df88c |
---|---|
1 <tool id="filter_spades_repeat" name="Filter SPAdes repeats" version="1.0.0"> | |
2 <description>Remove short and repeat contigs/scaffolds</description> | |
3 <requirements> | |
4 <requirement type="package" version="5.18.1">perl</requirement> | |
5 </requirements> | |
6 <command interpreter="perl">nml_filter_spades_repeats.pl -i $fasta_input -t $tab_input -c $cov_cutoff -r $rep_cutoff -l $len_cutoff -o $output_with_repeats -u $output_without_repeats -n $repeat_sequences_only -e $cov_len_cutoff -f $discarded_sequences -s $summary | |
7 </command> | |
8 | |
9 <inputs> | |
10 <param name="fasta_input" type="data" format="fasta" label="Contigs or scaffolds file" help="Contigs/Scaffolds output file from Spades" /> | |
11 <param name="tab_input" type="data" format="tabular" label="Stats file" help="Enter the corresponding stats file of the fasta file input above" /> | |
12 <param name="cov_cutoff" type="float" value="0.33" min="0" label="Coverage cut-off ratio" help="This is the average coverage ratio cutoff. For example: if the average coverage is 100 and a coverage cut-off ratio of 0.5 is used, then any contigs with coverage lower than 50 will be eliminated." /> | |
13 <param name="rep_cutoff" type="float" value="1.75" min="0" label="Repeat cut-off ratio" help="This is the coverage ratio cutoff to determine repeats in contigs. For exmaple: if the average coverage is 100 and a repeat cut-off ratio of 1.75 is used, then any contigs with coverage more than or equal to 175 will be marked as repeats." /> | |
14 <param name="len_cutoff" type="integer" value="1000" min="0" label="Length cut-off" help="Contigs with length under the chosen cut-off will be eliminated." /> | |
15 <param name="cov_len_cutoff" type="integer" value="5000" min="0" label="Length for average coverage calculation" help="Only contigs above this length will be used to calculate the average coverage." /> | |
16 <param name="keep_leftover" type="select" label="Print out a fasta file containing the discarded sequences?"> | |
17 <option value="yes">Yes</option> | |
18 <option value="no">No</option> | |
19 </param> | |
20 <param name="print_summary" type="select" label="Print out a summary of all the results?"> | |
21 <option value="yes">Yes</option> | |
22 <option value="no">No</option> | |
23 </param> | |
24 </inputs> | |
25 <outputs> | |
26 <data format="fasta" name="output_with_repeats" label="Filtered sequences (with repeats)" /> | |
27 <data format="fasta" name="output_without_repeats" label="Filtered sequences (no repeats)" /> | |
28 <data format="fasta" name="repeat_sequences_only" label="Repeat sequences" /> | |
29 <data format="fasta" name="discarded_sequences" label="Discarded sequences"> | |
30 <filter>keep_leftover == "yes"</filter> | |
31 </data> | |
32 <data format="txt" name="summary" label="Results summary"> | |
33 <filter>print_summary == "yes"</filter> | |
34 </data> | |
35 </outputs> | |
36 | |
37 | |
38 | |
39 | |
40 <help> | |
41 ================ | |
42 **What it does** | |
43 ================ | |
44 | |
45 Using the output of SPAdes (a fasta and a stats file, either from contigs or scaffolds), it filters the fasta files, discarding all sequences that are under a given length or under a calculated coverage. Repeated contigs are detected based on coverage. | |
46 | |
47 -------------------------------------- | |
48 | |
49 ========== | |
50 **Output** | |
51 ========== | |
52 | |
53 - **Filtered sequences (with repeats)** | |
54 - Will contain the filtered contigs/scaffolds including the repeats. These are the sequences that passed the length and minumum coverage cutoffs. | |
55 - For workflows, this output is named **output_with_repeats** | |
56 - **Filtered sequences (no repeats)** | |
57 - Will contain the filtered contigs/scaffolds excluding the repeats. These are the sequences that passed the length, minimum coverage and repeat cutoffs. | |
58 - For workflows, this output is named **output_without_repeats** | |
59 - **Repeat sequences** | |
60 - Will contain the repeated contigs/scaffolds only. These are the sequences that were exluded for having high coverage (determined by the repeat cutoff). | |
61 - For workflows, this output is named **repeat_sequences_only** | |
62 - **Discarded sequences** | |
63 - If selected, will contain the discarded sequences. These are the sequences that fell below the length and minumum coverage cutoffs, and got discarded. | |
64 - For workflows, this output is named **discarded_sequences** | |
65 - **Results summary** : If selected, will contain a summary of all the results. | |
66 | |
67 ------------------------------------------ | |
68 | |
69 ============ | |
70 **Example** | |
71 ============ | |
72 | |
73 Stats file input: | |
74 ------------------ | |
75 | |
76 +------------+------------+------------+ | |
77 |#name |length |coverage | | |
78 +============+============+============+ | |
79 |NODE_1 |2500 |15.5 | | |
80 +------------+------------+------------+ | |
81 |NODE_2 |102 |3.0 | | |
82 +------------+------------+------------+ | |
83 |NODE_3 |1300 |50.0 | | |
84 +------------+------------+------------+ | |
85 |NODE_4 |1000 |2.3 | | |
86 +------------+------------+------------+ | |
87 |NODE_5 |5000 |14.3 | | |
88 +------------+------------+------------+ | |
89 |NODE_6 |450 |25.2 | | |
90 +------------+------------+------------+ | |
91 | |
92 User Inputs: | |
93 ------------ | |
94 | |
95 - Coverage cut-off ratio = 0.33 | |
96 - Repeat cut-off ratio = 1.75 | |
97 - Length cut-off = 500 | |
98 - Length for average coverage calculation = 1000 | |
99 | |
100 Calculations: | |
101 ------------- | |
102 | |
103 **Average coverage will be calculatd based on contigs with length >= 1000bp** | |
104 | |
105 | |
106 - Average coverage = 15.5 + 50.0 + 2.3 + 14.3 / 4 = 20.5 | |
107 | |
108 **Contigs that have coverage in the lower 1/3 of the average coverage will be eliminated.** | |
109 | |
110 - Coverage cut-off = 20.5 * 0.33 = 6.8 | |
111 | |
112 **Contigs with high coverage (larger than 1.75 times the average coverage) are considered to be repeated contigs.** | |
113 | |
114 - Repeat cut-off = 20.5 * 1.75 = 35.9 | |
115 | |
116 **Number of copies are calculated by dividing the sequence coverage by the average coverage.** | |
117 | |
118 - Number of repeats for NODE_3 = 50 / 20.5 = 2 copies | |
119 | |
120 | |
121 Output (in fasta format): | |
122 -------------------------- | |
123 | |
124 **Filtered sequences (with repeats)** | |
125 | |
126 :: | |
127 | |
128 >NODE_1 | |
129 >NODE_3 (2 copies) | |
130 >NODE_5 | |
131 | |
132 **Filtered sequences (no repeats)** | |
133 | |
134 :: | |
135 | |
136 >NODE_1 | |
137 >NODE_5 | |
138 | |
139 **Repeat sequences** | |
140 | |
141 :: | |
142 | |
143 >NODE_3 (2 copies) | |
144 | |
145 **Discarded sequences** | |
146 | |
147 :: | |
148 | |
149 >NODE_2 | |
150 >NODE_4 | |
151 >NODE_6 | |
152 | |
153 --------------------------------------- | |
154 | |
155 | |
156 | |
157 | |
158 </help> | |
159 | |
160 </tool> |