Mercurial > repos > bgruening > augustus_training
comparison test-data/extrinsic.truncated.cfg @ 2:0d425a4b6896 draft
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/augustus commit 0fed5bb024a096dcb5b2858520ba191da7798b6d
author | iuc |
---|---|
date | Thu, 23 May 2019 18:17:22 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
1:1fbb1135da16 | 2:0d425a4b6896 |
---|---|
1 ==# extrinsic information configuration file for AUGUSTUS | |
2 # | |
3 # protein hints | |
4 # include with --extrinsicCfgFile=filename | |
5 # date: 16.10.2007 | |
6 # Mario Stanke (mstanke@gwdg.de) | |
7 | |
8 | |
9 # source of extrinsic information: | |
10 # M manual anchor (required) | |
11 # P protein database hit | |
12 # E EST/cDNA database hit | |
13 # C combined est/protein database hit | |
14 # D Dialign | |
15 # R retroposed genes | |
16 # T transMapped refSeqs | |
17 # W wiggle track coverage info from RNA-Seq | |
18 | |
19 [SOURCES] | |
20 M RM E W | |
21 | |
22 # | |
23 # individual_liability: Only unsatisfiable hints are disregarded. By default this flag is not set | |
24 # and the whole hint group is disregarded when one hint in it is unsatisfiable. | |
25 # 1group1gene: Try to predict a single gene that covers all hints of a given group. This is relevant for | |
26 # hint groups with gaps, e.g. when two ESTs, say 5' and 3', from the same clone align nearby. | |
27 # | |
28 [SOURCE-PARAMETERS] | |
29 | |
30 | |
31 # feature bonus malus gradelevelcolumns | |
32 # r+/r- | |
33 # | |
34 # the gradelevel colums have the following format for each source | |
35 # sourcecharacter numscoreclasses boundary ... boundary gradequot ... gradequot | |
36 # | |
37 | |
38 [GENERAL] | |
39 start 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
40 stop 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
41 tss 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
42 tts 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
43 ass 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
44 dss 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
45 exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 W 1 1.005 | |
46 exon 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
47 intronpart 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
48 intron 1 .8 M 1 1e+100 RM 1 1 E 1 1000 W 1 1 | |
49 CDSpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
50 CDS 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
51 UTRpart 1 1 .973 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
52 UTR 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
53 irpart 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
54 nonexonpart 1 1 M 1 1e+100 RM 1 1.01 E 1 1 W 1 1 | |
55 genicpart 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1 | |
56 | |
57 # | |
58 # Explanation: | |
59 # | |
60 # The gff/gtf file containint the hints must contain somewhere in the last | |
61 # column an entry source=?, where ? is one of the source characters listed in | |
62 # the line after [SOURCES] above. You can use different sources when you have | |
63 # hints of different reliability of the same type, e.g. exon hints from ESTs | |
64 # and exon hints from evolutionary conservation information. | |
65 # | |
66 # In the [GENERAL] section the entries second column specify a bonus for obeying | |
67 # a hint and the entry in the third column specify a malus (penalty) for | |
68 # predicting a feature that is not supported by any hint. The bonus and the | |
69 # malus is a factor that is multiplied to the posterior probability of gene | |
70 # structueres. | |
71 # Example: | |
72 # CDS 1000 0.7 .... | |
73 # means that, when AUGUSTUS is searching for the most likely gene structure, | |
74 # every gene structure that has a CDS exactly as given in a hint gets | |
75 # a bonus factor of 1000. Also, for every CDS that is not supported the | |
76 # probability of the gene structure gets a malus of 0.7. Increase the bonus to | |
77 # make AUGUSTUS obey more hints, decrease the malus to make AUGUSTUS predict few | |
78 # features that are not supported by hints. The malus helps increasing | |
79 # specificity, e.g. when the exons predicted by AUGUSTUS are suspicious because | |
80 # there is no evidence from ESTs, mRNAs, protein databases, sequence | |
81 # conservation, transMapped expressed sequences. | |
82 # Setting the malus to 1.0 disables those penalties. Setting the bonus to 1.0 | |
83 # disables the boni. | |
84 # | |
85 # start: translation start (start codon), specifies an interval that contains | |
86 # the start codon. The interval can be larger than 3bp, in which case | |
87 # every ATG in the interval gets a bonus. The highest bonus is given | |
88 # to ATGs in the middle of the interval, the bonus fades off towards the ends. | |
89 # stop: translation end (stop codon), see 'start' | |
90 # tss: transcription start site, see 'start' | |
91 # tts: transcription termination site, see 'start' | |
92 # ass: acceptor (3') splice site, the last intron position | |
93 # dss: donor (5') splice site, the first intron position | |
94 # exonpart: part of an exon in the biological sense. The bonus applies only | |
95 # to exons that contain the interval from the hint. Just | |
96 # overlapping means no bonus at all. The malus applies to every | |
97 # base of an exon. Therefore the malus for an exon is exponential | |
98 # in the length of an exon: malus=exonpartmalus^length. | |
99 # Therefore the malus should be close to 1, e.g. 0.99. | |
100 # exon: exon in the biological sense. Only exons that exactly match the | |
101 # hint get a bonus. Exception: The exons that contain the start | |
102 # codon and stop codon. This malus applies to a complete exon | |
103 # independent of its length. | |
104 # intronpart: introns both between coding and non-coding exons. The bonus | |
105 # applies to every intronic base in the interval of the hint. | |
106 # intron: An intron gets the bonus if and only if it is exactly as in the hint. | |
107 # CDSpart: part of the coding part of an exon. (CDS = coding sequence) | |
108 # CDS: coding part of an exon with exact boundaries. For internal exons | |
109 # of a multi exon gene this is identical to the biological | |
110 # boundaries of the exon. For the first and the last coding exon | |
111 # the boundaries are the boundaries of the coding sequence (start, stop). | |
112 # UTR: exact boundaries of a UTR exon or the untranslated part of a | |
113 # partially coding exon. | |
114 # UTRpart: The hint interval must be included in the UTR part of an exon. | |
115 # irpart: The bonus applies to every base of the intergenic region. If UTR | |
116 # prediction is turned on (--UTR=on) then UTR is considered | |
117 # genic. If you choose against the usual meaning the bonus of | |
118 # irparts to be much smaller than 1 in the configuration file you | |
119 # can force AUGUSTUS to not predict an intergenic region in the | |
120 # specified interval. This is useful if you want to tell AUGUSTUS | |
121 # that two distant exons belong to the same gene, when AUGUSTUS | |
122 # tends to split that gene into smaller genes. | |
123 # nonexonpart: intergenic region or intron. The bonus applies to very non-exon | |
124 # base that overlaps with the interval from the hint. It is | |
125 # geometric in the length of that overlap, so choose it close to | |
126 # 1.0. This is useful as a weak kind of masking, e.g. when it is | |
127 # unlikely that a retroposed gene contains a coding region but you | |
128 # do not want to completely forbid exons. | |
129 # genicpart: everything that is not intergenic region, i.e. intron or exon or UTR if | |
130 # applicable. The bonus applies to every genic base that overlaps with the | |
131 # interval from the hint. This can be used in particular to make Augustus | |
132 # predict one gene between positions a and b if a and b are experimentally | |
133 # confirmed to be part of the same gene, e.g. through ESTs from the same clone. | |
134 # alias: nonirpart | |
135 # | |
136 # Any hints of types dss, intron, exon, CDS, UTR that (implicitly) suggest a donor splice | |
137 # site allow AUGUSTUS to predict a donor splice site that has a GC instead of the much more common GT. | |
138 # AUGUSTUS does not predict a GC donor splice site unless there is a hint for one. | |
139 # | |
140 # Starting in column number 4 you can tell AUGUSTUS how to modify the bonus | |
141 # depending on the source of the hint and the score of the hint. | |
142 # The score of the hints is specified in the 6th column of the hint gff/gtf. | |
143 # If the score is used at all, the score is not used directly through some | |
144 # conversion formula but by distinguishing different classes of scores, e.g. low | |
145 # score, medium score, high score. The format is the following: | |
146 # First, you specify the source character, then the number of classes (say n), then you | |
147 # specify the score boundaries that separate the classes (n-1 thresholds) and then you specify | |
148 # for each score class the multiplicative modifier to the bonus (n factors). | |
149 # | |
150 # Examples: | |
151 # | |
152 # M 1 1e+100 | |
153 # means for the manual hint there is only one score class, the bonus for this | |
154 # type of hint is multiplied by 10^100. This practically forces AUGUSTUS to obey | |
155 # all manual hints. | |
156 # | |
157 # T 2 1.5 1 5e29 | |
158 # For the transMap hints distinguish 2 classes. Those with a score below 1.5 and | |
159 # with a score above 1.5. The bonus if the lower score hints is unchanged and | |
160 # the bonus of the higher score hints is multiplied by 5x10^29. | |
161 # | |
162 # D 8 1.5 2.5 3.5 4.5 5.5 6.5 7.5 0.58 0.4 0.2 2.9 0.87 0.44 0.31 7.3 | |
163 # Use 8 score classes for the DIALIGN hints. DIALIGN hints give a score, a strand and | |
164 # reading frame information for CDSpart hints. The strand and reading frame are often correct but not | |
165 # often enough to rely on them. To account for that I generated hints for all | |
166 # 6 combinations of a strand and reading frame and then used 2x2x2=8 different | |
167 # score classes: | |
168 # {low score, high score} x {DIALIGN strand, opposite strand} x {DIALIGN reading frame, other reading frame} | |
169 # This example shows that scores don't have to be monotonous. A higher score | |
170 # does not have to mean a higher bonus. They are merely a way of classifying the | |
171 # hints into categories as you wish. In particular, you could get the effect of | |
172 # having different sources by having just hints of one source and then distinguishing | |
173 # more scores classes. | |
174 # | |
175 # | |
176 # Future plans: | |
177 # - Add fuzzy intron hints. Introns get a bonus only when they approximately | |
178 # have the same boundaries as in the hint. | |
179 # - Make the splice site hints fuzzy also. Allow a hint interval that contains a | |
180 # likely splice site, as opposed to only an individual position. | |
181 # - Write a program that automatically optimizes the boni and mali given an | |
182 # annotated test set of genes and hints for that set of sequences. | |
183 |