# HG changeset patch
# User jasper
# Date 1494351286 14400
# Node ID 647866afd876ca20c3326f878ca3b8f870a531bf
# Parent 6c6b16cab42a9bc34cb24411188339ed38caa6bc
Uploaded
diff -r 6c6b16cab42a -r 647866afd876 tools/align_back_trans/align_back_trans.py
--- a/tools/align_back_trans/align_back_trans.py Tue May 09 13:22:02 2017 -0400
+++ b/tools/align_back_trans/align_back_trans.py Tue May 09 13:34:46 2017 -0400
@@ -25,10 +25,6 @@
from Bio import AlignIO
from Bio.Data.CodonTable import ambiguous_generic_by_id
-if "-v" in sys.argv or "--version" in sys.argv:
- print "v0.0.7"
- sys.exit(0)
-
def check_trans(identifier, nuc, prot, table):
"""Returns nucleotide sequence if works (can remove trailing stop)"""
diff -r 6c6b16cab42a -r 647866afd876 tools/align_back_trans/align_back_trans.xml
--- a/tools/align_back_trans/align_back_trans.xml Tue May 09 13:22:02 2017 -0400
+++ b/tools/align_back_trans/align_back_trans.xml Tue May 09 13:34:46 2017 -0400
@@ -9,7 +9,6 @@
- align_back_trans.py --version
align_back_trans.py $prot_align.ext "$prot_align" "$nuc_file" "$out_nuc_align" "$table"
diff -r 6c6b16cab42a -r 647866afd876 tools/test-data/demo_nuc_align.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/test-data/demo_nuc_align.fasta Tue May 09 13:34:46 2017 -0400
@@ -0,0 +1,6 @@
+>Alpha
+GATGAGGAACGA
+>Beta
+GATGAG---CGU
+>Gamma
+GAT------CGG
diff -r 6c6b16cab42a -r 647866afd876 tools/test-data/demo_nucs.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/test-data/demo_nucs.fasta Tue May 09 13:34:46 2017 -0400
@@ -0,0 +1,6 @@
+>Alpha
+GATGAGGAACGA
+>Beta
+GATGAGCGU
+>Gamma
+GATCGG
diff -r 6c6b16cab42a -r 647866afd876 tools/test-data/demo_nucs_trailing_stop.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/test-data/demo_nucs_trailing_stop.fasta Tue May 09 13:34:46 2017 -0400
@@ -0,0 +1,6 @@
+>Alpha
+GATGAGGAACGA
+>Beta
+GATGAGCGU
+>Gamma
+GATCGGTAA
diff -r 6c6b16cab42a -r 647866afd876 tools/test-data/demo_prot_align.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/test-data/demo_prot_align.fasta Tue May 09 13:34:46 2017 -0400
@@ -0,0 +1,6 @@
+>Alpha
+DEER
+>Beta
+DE-R
+>Gamma
+D--R
diff -r 6c6b16cab42a -r 647866afd876 tools/tools/align_back_trans/README.rst
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/tools/align_back_trans/README.rst Tue May 09 13:34:46 2017 -0400
@@ -0,0 +1,131 @@
+Galaxy tool to back-translate a protein alignment to nucleotides
+================================================================
+
+This tool is copyright 2012-2015 by Peter Cock, The James Hutton Institute
+(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
+See the licence text below (MIT licence).
+
+This tool is a short Python script (using Biopython library functions) to
+load a protein alignment, and matching nucleotide FASTA file of unaligned
+sequences, which are threaded onto the protein alignment in order to produce
+a codon aware nucleotide alignment - which can be viewed as a back translation.
+
+This tool is available from the Galaxy Tool Shed at:
+
+* http://toolshed.g2.bx.psu.edu/view/peterjc/align_back_trans
+
+The underlying Python script can also be used outside of Galaxy, for
+details run::
+
+ $ python align_back_trans.py
+
+Automated Installation
+======================
+
+This should be straightforward using the Galaxy Tool Shed, which should be
+able to automatically install the dependency on Biopython, and then install
+this tool and run its unit tests.
+
+
+Manual Installation
+===================
+
+There are just two files to install to use this tool from within Galaxy:
+
+* ``align_back_trans.py`` (the Python script)
+* ``align_back_trans.xml`` (the Galaxy tool definition)
+
+The suggested location is in a dedicated ``tools/align_back_trans`` folder.
+
+You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer
+the tool. One suggested location is in the multiple alignments section. Simply
+add the line::
+
+
+
+You will also need to install Biopython 1.62 or later.
+
+If you wish to run the unit tests, also move/copy the ``test-data/`` files
+under Galaxy's ``test-data/`` folder. Then::
+
+ ./run_tests.sh -id align_back_trans
+
+That's it.
+
+
+History
+=======
+
+======= ======================================================================
+Version Changes
+------- ----------------------------------------------------------------------
+v0.0.1 - Initial version, based on a previously written Python script
+v0.0.2 - Optionally check the translation is consistent
+v0.0.3 - First official release
+v0.0.4 - Simplified XML to apply input format to output data.
+ - Fixed error message when sequence length not a multiple of three.
+v0.0.5 - More explicit error messages when seqences lengths do not match.
+ - Tool definition now embeds citation information.
+v0.0.6 - Reorder XML elements (internal change only).
+ - Use ``format_source=...`` tag.
+ - Planemo for Tool Shed upload (``.shed.yml``, internal change only).
+v0.0.7 - Minor Python code style improvements (internal change only).
+======= ======================================================================
+
+
+Developers
+==========
+
+This script was initially developed on this repository:
+https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
+
+With the addition of a Galaxy wrapper, developement moved here:
+https://github.com/peterjc/pico_galaxy/tree/master/tools/align_back_trans
+
+For pushing a release to the test or main "Galaxy Tool Shed", use the following
+Planemo commands (which requires you have set your Tool Shed access details in
+``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
+
+ $ planemo shed_update -t testtoolshed --check_diff ~/repositories/pico_galaxy/tools/align_back_trans/
+ ...
+
+or::
+
+ $ planemo shed_update -t toolshed --check_diff ~/repositories/pico_galaxy/tools/align_back_trans/
+ ...
+
+To just build and check the tar ball, use::
+
+ $ planemo shed_upload --tar_only ~/repositories/pico_galaxy/tools/align_back_trans/
+ ...
+ $ tar -tzf shed_upload.tar.gz
+ test-data/demo_nucs.fasta
+ test-data/demo_nucs_trailing_stop.fasta
+ test-data/demo_prot_align.fasta
+ test-data/demo_nuc_align.fasta
+ tools/align_back_trans/README.rst
+ tools/align_back_trans/align_back_trans.py
+ tools/align_back_trans/align_back_trans.xml
+ tools/align_back_trans/tool_dependencies.xml
+
+
+Licence (MIT)
+=============
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff -r 6c6b16cab42a -r 647866afd876 tools/tools/align_back_trans/align_back_trans.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/tools/align_back_trans/align_back_trans.py Tue May 09 13:34:46 2017 -0400
@@ -0,0 +1,185 @@
+#!/usr/bin/env python
+"""Back-translate a protein alignment to nucleotides
+
+This tool is a short Python script (using Biopython library functions) to
+load a protein alignment, and matching nucleotide FASTA file of unaligned
+sequences, in order to produce a codon aware nucleotide alignment - which
+can be viewed as a back translation.
+
+The development repository for this tool is here:
+
+* https://github.com/peterjc/pico_galaxy/tree/master/tools/align_back_trans
+
+This tool is available with a Galaxy wrapper from the Galaxy Tool Shed at:
+
+* http://toolshed.g2.bx.psu.edu/view/peterjc/align_back_trans
+
+See accompanying text file for licence details (MIT licence).
+"""
+
+import sys
+from Bio.Seq import Seq
+from Bio.Alphabet import generic_protein
+from Bio.Align import MultipleSeqAlignment
+from Bio import SeqIO
+from Bio import AlignIO
+from Bio.Data.CodonTable import ambiguous_generic_by_id
+
+
+def check_trans(identifier, nuc, prot, table):
+ """Returns nucleotide sequence if works (can remove trailing stop)"""
+ if len(nuc) % 3:
+ sys.exit("Nucleotide sequence for %s is length %i (not a multiple of three)"
+ % (identifier, len(nuc)))
+
+ p = str(prot).upper().replace("*", "X")
+ t = str(nuc.translate(table)).upper().replace("*", "X")
+ if len(t) == len(p) + 1:
+ if str(nuc)[-3:].upper() in ambiguous_generic_by_id[table].stop_codons:
+ # Allow this...
+ t = t[:-1]
+ nuc = nuc[:-3] # edit return value
+ if len(t) != len(p):
+ err = ("Inconsistent lengths for %s, ungapped protein %i, "
+ "tripled %i vs ungapped nucleotide %i." %
+ (identifier, len(p), len(p) * 3, len(nuc)))
+ if t.endswith(p):
+ err += "\nThere are %i extra nucleotides at the start." % (len(t) - len(p))
+ elif t.startswith(p):
+ err += "\nThere are %i extra nucleotides at the end." % (len(t) - len(p))
+ elif p in t:
+ # TODO - Calculate and report the number to trim at start and end?
+ err += "\nHowever, protein sequence found within translated nucleotides."
+ elif p[1:] in t:
+ err += "\nHowever, ignoring first amino acid, protein sequence found within translated nucleotides."
+ sys.exit(err)
+
+ if t == p:
+ return nuc
+ elif p.startswith("M") and "M" + t[1:] == p:
+ # Close, was there a start codon?
+ if str(nuc[0:3]).upper() in ambiguous_generic_by_id[table].start_codons:
+ return nuc
+ else:
+ sys.exit("Translation check failed for %s\n"
+ "Would match if %s was a start codon (check correct table used)\n"
+ % (identifier, nuc[0:3].upper()))
+ else:
+ # Allow * vs X here? e.g. internal stop codons
+ m = "".join("." if x == y else "!" for (x, y) in zip(p, t))
+ if len(prot) < 70:
+ sys.stderr.write("Protein: %s\n" % p)
+ sys.stderr.write(" %s\n" % m)
+ sys.stderr.write("Translation: %s\n" % t)
+ else:
+ for offset in range(0, len(p), 60):
+ sys.stderr.write("Protein: %s\n" % p[offset:offset + 60])
+ sys.stderr.write(" %s\n" % m[offset:offset + 60])
+ sys.stderr.write("Translation: %s\n\n" % t[offset:offset + 60])
+ sys.exit("Translation check failed for %s\n" % identifier)
+
+
+def sequence_back_translate(aligned_protein_record, unaligned_nucleotide_record, gap, table=0):
+ # TODO - Separate arguments for protein gap and nucleotide gap?
+ if not gap or len(gap) != 1:
+ raise ValueError("Please supply a single gap character")
+
+ alpha = unaligned_nucleotide_record.seq.alphabet
+ if hasattr(alpha, "gap_char"):
+ gap_codon = alpha.gap_char * 3
+ assert len(gap_codon) == 3
+ else:
+ from Bio.Alphabet import Gapped
+ alpha = Gapped(alpha, gap)
+ gap_codon = gap * 3
+
+ ungapped_protein = aligned_protein_record.seq.ungap(gap)
+ ungapped_nucleotide = unaligned_nucleotide_record.seq
+ if table:
+ ungapped_nucleotide = check_trans(aligned_protein_record.id, ungapped_nucleotide, ungapped_protein, table)
+ elif len(ungapped_protein) * 3 != len(ungapped_nucleotide):
+ sys.exit("Inconsistent lengths for %s, ungapped protein %i, "
+ "tripled %i vs ungapped nucleotide %i" %
+ (aligned_protein_record.id,
+ len(ungapped_protein),
+ len(ungapped_protein) * 3,
+ len(ungapped_nucleotide)))
+
+ seq = []
+ nuc = str(ungapped_nucleotide)
+ for amino_acid in aligned_protein_record.seq:
+ if amino_acid == gap:
+ seq.append(gap_codon)
+ else:
+ seq.append(nuc[:3])
+ nuc = nuc[3:]
+ assert not nuc, "Nucleotide sequence for %r longer than protein %r" \
+ % (unaligned_nucleotide_record.id, aligned_protein_record.id)
+
+ aligned_nuc = unaligned_nucleotide_record[:] # copy for most annotation
+ aligned_nuc.letter_annotation = {} # clear this
+ aligned_nuc.seq = Seq("".join(seq), alpha) # replace this
+ assert len(aligned_protein_record.seq) * 3 == len(aligned_nuc)
+ return aligned_nuc
+
+
+def alignment_back_translate(protein_alignment, nucleotide_records, key_function=None, gap=None, table=0):
+ """Thread nucleotide sequences onto a protein alignment."""
+ # TODO - Separate arguments for protein and nucleotide gap characters?
+ if key_function is None:
+ key_function = lambda x: x
+ if gap is None:
+ gap = "-"
+
+ aligned = []
+ for protein in protein_alignment:
+ try:
+ nucleotide = nucleotide_records[key_function(protein.id)]
+ except KeyError:
+ raise ValueError("Could not find nucleotide sequence for protein %r"
+ % protein.id)
+ aligned.append(sequence_back_translate(protein, nucleotide, gap, table))
+ return MultipleSeqAlignment(aligned)
+
+
+if len(sys.argv) == 4:
+ align_format, prot_align_file, nuc_fasta_file = sys.argv[1:]
+ nuc_align_file = sys.stdout
+ table = 0
+elif len(sys.argv) == 5:
+ align_format, prot_align_file, nuc_fasta_file, nuc_align_file = sys.argv[1:]
+ table = 0
+elif len(sys.argv) == 6:
+ align_format, prot_align_file, nuc_fasta_file, nuc_align_file, table = sys.argv[1:]
+else:
+ sys.exit("""This is a Python script for 'back-translating' a protein alignment,
+
+It requires three, four or five arguments:
+- alignment format (e.g. fasta, clustal),
+- aligned protein file (in specified format),
+- unaligned nucleotide file (in fasta format).
+- aligned nucleotiode output file (in same format), optional.
+- NCBI translation table (0 for none), optional
+
+The nucleotide alignment is printed to stdout if no output filename is given.
+
+Example usage:
+
+$ python align_back_trans.py fasta demo_prot_align.fasta demo_nucs.fasta demo_nuc_align.fasta
+
+Warning: If the output file already exists, it will be overwritten.
+
+This script is available with sample data and a Galaxy wrapper here:
+https://github.com/peterjc/pico_galaxy/tree/master/tools/align_back_trans
+http://toolshed.g2.bx.psu.edu/view/peterjc/align_back_trans
+""")
+
+try:
+ table = int(table)
+except ValueError:
+ sys.exit("Bad table argument %r" % table)
+
+prot_align = AlignIO.read(prot_align_file, align_format, alphabet=generic_protein)
+nuc_dict = SeqIO.index(nuc_fasta_file, "fasta")
+nuc_align = alignment_back_translate(prot_align, nuc_dict, gap="-", table=table)
+AlignIO.write(nuc_align, nuc_align_file, align_format)
diff -r 6c6b16cab42a -r 647866afd876 tools/tools/align_back_trans/align_back_trans.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/tools/align_back_trans/align_back_trans.xml Tue May 09 13:34:46 2017 -0400
@@ -0,0 +1,128 @@
+
+ Gives a codon aware alignment
+
+ biopython
+ Bio
+
+
+
+
+
+
+
+align_back_trans.py $prot_align.ext "$prot_align" "$nuc_file" "$out_nuc_align" "$table"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+**What it does**
+
+Takes an input file of aligned protein sequences (typically FASTA or Clustal
+format), and a matching file of unaligned nucleotide sequences (FASTA format,
+using the same identifiers), and threads the nucleotide sequences onto the
+protein alignment to produce a codon aware nucleotide alignment - which can
+be viewed as a back translation.
+
+If you specify one of the standard NCBI genetic codes (recommended), then the
+translation is verified. This will allow fuzzy matching if stop codons in the
+protein sequence have been reprented as X, and will allow for a trailing stop
+codon present in the nucleotide sequences but not the protein.
+
+Note - the protein and nucleotide sequences must use the same identifers.
+
+Note - If no translation table is specified, the provided nucleotide sequences
+should be exactly three times the length of the protein sequences (exluding the gaps).
+
+Note - the nucleotide FASTA file may contain extra sequences not in the
+protein alignment, they will be ignored. This can be useful if for example
+you have a nucleotide FASTA file containing all the genes in an organism,
+while the protein alignment is for a specific gene family.
+
+**Example**
+
+Given this protein alignment in FASTA format::
+
+ >Alpha
+ DEER
+ >Beta
+ DE-R
+ >Gamma
+ D--R
+
+and this matching unaligned nucleotide FASTA file::
+
+ >Alpha
+ GATGAGGAACGA
+ >Beta
+ GATGAGCGU
+ >Gamma
+ GATCGG
+
+the tool would return this nucleotide alignment::
+
+ >Alpha
+ GATGAGGAACGA
+ >Beta
+ GATGAG---CGU
+ >Gamma
+ GAT------CGG
+
+Notice that all the gaps are multiples of three in length.
+
+
+**Citation**
+
+This tool uses Biopython, so if you use this Galaxy tool in work leading to a
+scientific publication please cite the following paper:
+
+Cock et al (2009). Biopython: freely available Python tools for computational
+molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
+http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
+
+This tool is available to install into other Galaxy Instances via the Galaxy
+Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/align_back_trans
+
+
+ 10.7717/peerj.167
+ 10.1093/bioinformatics/btp163
+
+