augustus_training: test-data/extrinsic.truncated.cfg comparison

comparison test-data/extrinsic.truncated.cfg @ 2:0d425a4b6896 draft

planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/augustus commit 0fed5bb024a096dcb5b2858520ba191da7798b6d

author	iuc
date	Thu, 23 May 2019 18:17:22 -0400
parents
children

comparison

equal deleted inserted replaced

-:1fbb1135da16
+:0d425a4b6896
+==# extrinsic information configuration file for AUGUSTUS
+#
+# protein hints
+# include with --extrinsicCfgFile=filename
+# date: 16.10.2007
+# Mario Stanke (mstanke@gwdg.de)
+# source of extrinsic information:
+# M manual anchor (required)
+# P protein database hit
+# E EST/cDNA database hit
+# C combined est/protein database hit
+# D Dialign
+# R retroposed genes
+# T transMapped refSeqs
+# W wiggle track coverage info from RNA-Seq
+[SOURCES]
+M RM E W
+#
+# individual_liability: Only unsatisfiable hints are disregarded. By default this flag is not set
+# and the whole hint group is disregarded when one hint in it is unsatisfiable.
+# 1group1gene: Try to predict a single gene that covers all hints of a given group. This is relevant for
+# hint groups with gaps, e.g. when two ESTs, say 5' and 3', from the same clone align nearby.
+#
+[SOURCE-PARAMETERS]
+#   feature        bonus         malus   gradelevelcolumns
+#		r+/r-
+#
+# the gradelevel colums have the following format for each source
+# sourcecharacter numscoreclasses boundary    ...  boundary    gradequot  ...  gradequot
+#
+[GENERAL]
+start        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+stop        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+tss        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+tts        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+ass        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+dss        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+exonpart        1     .992  M    1  1e+100  RM  1     1    E 1    1    W 1  1.005
+exon        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+intronpart        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+intron        1       .8  M    1  1e+100  RM  1     1    E 1    1000 W 1    1
+CDSpart        1  1 0.985  M    1  1e+100  RM  1     1    E 1    1	  W 1    1
+CDS        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+UTRpart        1   1 .973  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+UTR        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+irpart        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+nonexonpart        1        1  M    1  1e+100  RM  1     1.01 E 1    1    W 1    1
+genicpart        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
+#
+# Explanation:
+#
+# The gff/gtf file containint the hints must contain somewhere in the last
+# column an entry source=?, where ? is one of the source characters listed in
+# the line after [SOURCES] above. You can use different sources when you have
+# hints of different reliability of the same type, e.g. exon hints from ESTs
+# and exon hints from evolutionary conservation information.
+#
+# In the [GENERAL] section the entries second column specify a bonus for obeying
+# a hint and the entry in the third column specify a malus (penalty) for
+# predicting a feature that is not supported by any hint. The bonus and the
+# malus is a factor that is multiplied to the posterior probability of gene
+# structueres.
+# Example:
+#   CDS     1000  0.7  ....
+# means that, when AUGUSTUS is searching for the most likely gene structure,
+# every gene structure that has a CDS exactly as given in a hint gets
+# a bonus factor of 1000. Also, for every CDS that is not supported the
+# probability of the gene structure gets a malus of 0.7. Increase the bonus to
+# make AUGUSTUS obey more hints, decrease the malus to make AUGUSTUS predict few
+# features that are not supported by hints. The malus helps increasing
+# specificity, e.g. when the exons predicted by AUGUSTUS are suspicious because
+# there is no evidence from ESTs, mRNAs, protein databases, sequence
+# conservation, transMapped expressed sequences.
+# Setting the malus to 1.0 disables those penalties. Setting the bonus to 1.0
+# disables the boni.
+#
+#       start: translation start (start codon), specifies an interval that contains
+#              the start codon. The interval can be larger than 3bp, in which case
+#              every ATG in the interval gets a bonus. The highest bonus is given
+#              to ATGs in the middle of the interval, the bonus fades off towards the ends.
+#        stop: translation end  (stop codon), see 'start'
+#         tss: transcription start site, see 'start'
+#         tts: transcription termination site, see 'start'
+#         ass: acceptor (3') splice site, the last intron position
+#         dss: donor (5') splice site, the first intron position
+#    exonpart: part of an exon in the biological sense. The bonus applies only
+#              to exons that contain the interval from the hint. Just
+#              overlapping means no bonus at all. The malus applies to every
+#              base of an exon. Therefore the malus for an exon is exponential
+#              in the length of an exon: malus=exonpartmalus^length.
+# 	     Therefore the malus should be close to 1, e.g. 0.99.
+#        exon: exon in the biological sense. Only exons that exactly match the
+#              hint get a bonus. Exception: The exons that contain the start
+#              codon and stop codon. This malus applies to a complete exon
+#              independent of its length.
+#  intronpart: introns both between coding and non-coding exons. The bonus
+#              applies to every intronic base in the interval of the hint.
+#      intron: An intron gets the bonus if and only if it is exactly as in the hint.
+#     CDSpart: part of the coding part of an exon. (CDS = coding sequence)
+#         CDS: coding part of an exon with exact boundaries. For internal exons
+#              of a multi exon gene this is identical to the biological
+#              boundaries of the exon. For the first and the last coding exon
+#              the boundaries are the boundaries of the coding sequence (start, stop).
+#         UTR: exact boundaries of a UTR exon or the untranslated part of a
+#              partially coding exon.
+#     UTRpart: The hint interval must be included in the UTR part of an exon.
+#      irpart: The bonus applies to every base of the intergenic region. If UTR
+#              prediction is turned on (--UTR=on) then UTR is considered
+#              genic. If you choose against the usual meaning the bonus of
+#              irparts to be much smaller than 1 in the configuration file you
+#              can force AUGUSTUS to not predict an intergenic region in the
+#              specified interval. This is useful if you want to tell AUGUSTUS
+#              that two distant exons belong to the same gene, when AUGUSTUS
+#              tends to split that gene into smaller genes.
+# nonexonpart: intergenic region or intron. The bonus applies to very non-exon
+#              base that overlaps with the interval from the hint. It is
+#              geometric in the length of that overlap, so choose it close to
+#              1.0. This is useful as a weak kind of masking, e.g. when it is
+#              unlikely that a retroposed gene contains a coding region but you
+#              do not want to completely forbid exons.
+#   genicpart: everything that is not intergenic region, i.e. intron or exon or UTR if
+#              applicable. The bonus applies to every genic base that overlaps with the
+#              interval from the hint. This can be used in particular to make Augustus
+#              predict one gene between positions a and b if a and b are experimentally
+#              confirmed to be part of the same gene, e.g. through ESTs from the same clone.
+#              alias: nonirpart
+#
+# Any hints of types dss, intron, exon, CDS, UTR that (implicitly) suggest a donor splice
+# site allow AUGUSTUS to predict a donor splice site that has a GC instead of the much more common GT.
+# AUGUSTUS does not predict a GC donor splice site unless there is a hint for one.
+#
+# Starting in column number 4 you can tell AUGUSTUS how to modify the bonus
+# depending on the source of the hint and the score of the hint.
+# The score of the hints is specified in the 6th column of the hint gff/gtf.
+# If the score is used at all, the score is not used directly through some
+# conversion formula but by distinguishing different classes of scores, e.g. low
+# score, medium score, high score. The format is the following:
+# First, you specify the source character, then the number of classes (say n), then you
+# specify the score boundaries that separate the classes (n-1 thresholds) and then you specify
+# for each score class the multiplicative modifier to the bonus (n factors).
+#
+# Examples:
+#
+# M 1 1e+100
+# means for the manual hint there is only one score class, the bonus for this
+# type of hint is multiplied by 10^100. This practically forces AUGUSTUS to obey
+# all manual hints.
+#
+# T    2       1.5 1 5e29
+# For the transMap hints distinguish 2 classes. Those with a score below 1.5 and
+# with a score above 1.5. The bonus if the lower score hints is unchanged and
+# the bonus of the higher score hints is multiplied by 5x10^29.
+#
+# D    8     1.5  2.5  3.5  4.5  5.5  6.5  7.5  0.58  0.4  0.2  2.9  0.87  0.44 0.31  7.3
+# Use 8 score classes for the DIALIGN hints. DIALIGN hints give a score, a strand and
+# reading frame information for CDSpart hints. The strand and reading frame are often correct but not
+# often enough to rely on them. To account for that I generated hints for all
+# 6 combinations of a strand and reading frame and then used 2x2x2=8 different
+# score classes:
+# {low score, high score} x {DIALIGN strand, opposite strand} x {DIALIGN reading frame, other reading frame}
+# This example shows that scores don't have to be monotonous. A higher score
+# does not have to mean a higher bonus. They are merely a way of classifying the
+# hints into categories as you wish. In particular, you could get the effect of
+# having different sources by having just hints of one source and then distinguishing
+# more scores classes.
+#
+#
+# Future plans:
+# - Add fuzzy intron hints. Introns get a bonus only when they approximately
+# have the same boundaries as in the hint.
+# - Make the splice site hints fuzzy also. Allow a hint interval that contains a
+# likely splice site, as opposed to only an individual position.
+# - Write a program that automatically optimizes the boni and mali given an
+# annotated test set of genes and hints for that set of sequences.

Mercurial > repos > bgruening > augustus_training

comparison test-data/extrinsic.truncated.cfg @ 2:0d425a4b6896 draft