comparison tools/protein_analysis/signalp3.xml @ 11:99b82a2b1272 draft

Uploaded v0.2.0 which added PSORTb wrapper (written with Konrad Paszkiewicz)
author peterjc
date Wed, 03 Apr 2013 10:49:10 -0400
parents e52220a9ddad
children 7de64c8b258d
comparison
equal deleted inserted replaced
10:09ff180d1615 11:99b82a2b1272
1 <tool id="signalp3" name="SignalP 3.0" version="0.0.10"> 1 <tool id="signalp3" name="SignalP 3.0" version="0.0.11">
2 <description>Find signal peptides in protein sequences</description> 2 <description>Find signal peptides in protein sequences</description>
3 <!-- If job splitting is enabled, break up the query file into parts --> 3 <!-- If job splitting is enabled, break up the query file into parts -->
4 <!-- Using 2000 chunks meaning 4 threads doing 500 each is ideal --> 4 <!-- Using 2000 chunks meaning 4 threads doing 500 each is ideal -->
5 <parallelism method="basic" split_inputs="fasta_file" split_mode="to_size" split_size="2000" merge_outputs="tabular_file"></parallelism> 5 <parallelism method="basic" split_inputs="fasta_file" split_mode="to_size" split_size="2000" merge_outputs="tabular_file"></parallelism>
6 <command interpreter="python"> 6 <command interpreter="python">
69 69
70 This calls the SignalP v3.0 tool for prediction of signal peptides, which uses both a Neural Network (NN) and Hidden Markov Model (HMM) to produce two sets of scores. 70 This calls the SignalP v3.0 tool for prediction of signal peptides, which uses both a Neural Network (NN) and Hidden Markov Model (HMM) to produce two sets of scores.
71 71
72 The input is a FASTA file of protein sequences, and the output is tabular with twenty columns (one row per protein): 72 The input is a FASTA file of protein sequences, and the output is tabular with twenty columns (one row per protein):
73 73
74 * Sequence identifier 74 ====== =================================================
75 * Neural Network (NN) predictions (13 columns) 75 Column Description
76 * Hidden Markov Model (HMM) predictions (6 columns) 76 ------ -------------------------------------------------
77 1 Sequence identifier
78 2-14 Neural Network (NN) predictions (13 columns)
79 15-20 Hidden Markov Model (HMM) predictions (6 columns)
80 ====== =================================================
77 81
78 Internally the input FASTA file is divided into parts (to allow multiple processors to be used), and the proteins truncated as specified (see below). The raw output from SignalP is then reformatted into a tabular layout suitable for Galaxy (see below). 82 Internally the input FASTA file is divided into parts (to allow multiple processors to be used), and the proteins truncated as specified (see below). The raw output from SignalP is then reformatted into a tabular layout suitable for Galaxy (see below).
79 83
80 **Neural Network Scores** 84 **Neural Network Scores**
81 85
82 For each organism class (Eukaryote, Gram-negative and Gram-positive), two different neural networks are used, one for predicting the actual signal peptide and one for predicting the position of the signal peptidase I (SPase I) cleavage site. 86 For each organism class (Eukaryote, Gram-negative and Gram-positive), two different neural networks are used, one for predicting the actual signal peptide and one for predicting the position of the signal peptidase I (SPase I) cleavage site.
83 87
84 The NN output comprises three different scores (C-max, S-max and Y-max) and two scores derived from them (S-mean and D-score). 88 The NN output comprises three different scores (C-max, S-max and Y-max) and two scores derived from them (S-mean and D-score).
85 89
86 The C-score is the 'cleavage site' score. For each position in the submitted sequence, a C-score is reported, which should only be significantly high at the cleavage site. Confusion is often seen with the position numbering of the cleavage site. When a cleavage site position is referred to by a single number, the number indicates the first residue in the mature protein, meaning that a predicted cleavage site between amino acid 26-27 is reported as 27, corresponding to the mature protein starting at (and including) position 27. 90 ====== ======= ===============================================================
87 91 Column Name Description
88 The S-score for the signal peptide prediction is calculated for every single amino acid position in the submitted sequence (not shown in the output via Galaxy), with high scores indicating that the corresponding amino acid is part of a signal peptide, and low scores indicating that the amino acid is part of a mature protein. 92 ------ ------- ---------------------------------------------------------------
89 93 2-4 C-score The C-score is the 'cleavage site' score. For each position in
90 Y-max is a derivative of the C-score combined with the S-score resulting in a better cleavage site prediction than the raw C-score alone. This is due to the fact that multiple high-peaking C-scores can be found in one sequence, where only one is the true cleavage site. The cleavage site is assigned from the Y-score where the slope of the S-score is steep and a significant C-score is found. 94 the submitted sequence, a C-score is reported, which should
91 95 only be significantly high at the cleavage site. Confusion is
92 The S-mean is the average of the S-score, ranging from the N-terminal amino acid to the amino acid assigned with the highest Y-max score, thus the S-mean score is calculated for the length of the predicted signal peptide. The S-mean score was in SignalP version 2.0 used as the criteria for discrimination of secretory and non-secretory proteins. 96 often seen with the position numbering of the cleavage site.
93 97 When a cleavage site position is referred to by a single number,
94 The D-score was introduced in SignalP version 3.0 and is a simple average of the S-mean and Y-max score. The score shows superior discrimination performance of secretory and non-secretory proteins to that of the S-mean score which was used in SignalP version 1 and 2. 98 the number indicates the first residue in the mature protein,
99 meaning, that a predicted cleavage site between amino acid 26-27
100 is reported as 27, corresponding to the mature protein starting
101 at (and including) position 27.
102 ------ ------- ---------------------------------------------------------------
103 5-7 S-score The S-score for the signal peptide prediction is calculated for
104 every single amino acid position in the submitted sequence (not
105 shown in the output via Galaxy), with high scores indicating
106 that the corresponding amino acid is part of a signal peptide,
107 and low scores indicating that the amino acid is part of a
108 mature protein.
109 ------ ------- ---------------------------------------------------------------
110 8-10 Y-max Y-max is a derivative of the C-score combined with the S-score
111 resulting in a better cleavage site prediction than the raw
112 C-score alone. This is due to the fact that multiple high-peaking
113 C-scores can be found in one sequence, where only one is the
114 true cleavage site. The cleavage site is assigned from the
115 Y-score where the slope of the S-score is steep and a
116 significant C-score is found.
117 ------ ------- ---------------------------------------------------------------
118 11-12 S-mean The S-mean is the average of the S-score, ranging from the
119 N-terminal amino acid to the amino acid assigned with the
120 highest Y-max score, thus the S-mean score is calculated for
121 the length of the predicted signal peptide. The S-mean score
122 was in SignalP version 2.0 used as the criteria for
123 discrimination of secretory and non-secretory proteins.
124 ------ ------- ---------------------------------------------------------------
125 13-14 D-score The D-score was introduced in SignalP version 3.0 and is a
126 simple average of the S-mean and Y-max score. The score shows
127 superior discrimination performance of secretory and
128 non-secretory proteins to that of the S-mean score which was
129 used in SignalP version 1 and 2.
130 ====== ======= ===============================================================
95 131
96 For non-secretory proteins all the scores represented in the SignalP3-NN output should ideally be very low. 132 For non-secretory proteins all the scores represented in the SignalP3-NN output should ideally be very low.
97 133
98 **Hidden Markov Model Scores** 134 **Hidden Markov Model Scores**
99 135