# HG changeset patch # User davidmurphy # Date 1326456913 18000 # Node ID 33ac482245236a49e4e44564746637e72b1a1877 # Parent c55bdc2fb9fad5f637f4491f11502b1b9bb71181 Deleted selected files diff -r c55bdc2fb9fa -r 33ac48224523 Codonlogo.xml --- a/Codonlogo.xml Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,128 +0,0 @@ - - generator for fasta (eg Clustal alignments) - - codonlogo -F $outformat -s $size -f $input -o $output -t "$logoname" -m $frame -n $stacks - #if $range.mode == 'part' - -l "$range.seqstart" -u "$range.seqend" - #end if - #if $comp.mode == 'equiprobable' - --composition 'equiprobable' - #end if - #if $comp.mode == 'none' - --composition 'none' - #end if - #if $comp.mode == 'file' - -R $compfile - #end if - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -**Note** - -This tool uses CodonLogo in Galaxy to generate a sequence logo from a provided alignment. -The input file must be an alignment in your current history. -The tool will attempt to read a provided file and if it is unable to it will generate an error. - -A typical output looks like this - -.. image:: ./static/images/CodonLogoExample.png - ----- - -**Warning about input files** - -The program used by this tool will fail if your alignment files are not all the same length. - -Fasta alignments from ClustalW Galaxy tool will work but many other fasta files may cause this tool to fail - please do not file -a Galaxy bug report - this is a feature of the tool and a problem with your source data - not a tool error - please make certain all your fasta -sequences are the same length! - ----- - -**Attribution** - - -This Galaxy wrapper was modified for CodonLogo by David Murphy and is based on the wrapper written by Ross Lazarus for the rgenetics project and the source code is licensed under the LGPL_ - -.. _Weblogo3: http://weblogo.berkeley.edu/ -.. _LGPL: http://www.gnu.org/copyleft/lesser.html -.. _CodonLogo: http://recode.ucc.ie/CodonLogo - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 LICENSE.txt --- a/LICENSE.txt Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,135 +0,0 @@ -=============================================================================== - - CoreBio and WebLogo : Copyrights and Licenses - -=============================================================================== - -This package is distributed under the new BSD Open Source License. Much -of the code was written by Gavin E. Crooks, Gary Hon, Steven Brenner, John-Marc -Chandonia, Liana Lareau, David Ding, Clare Gollnick,David Murphy and other contributers. - - -Copyright Notice -================ -Copyright (c) 2006, The Regents of the University of California, through -Lawrence Berkeley National Laboratory (subject to receipt of any required -approvals from the U.S. Dept. of Energy). All rights reserved. - -Parts of this software package are covered by the individual copyrights of the -contributers. Please refer to individual source code files for further details. - -NOTICE. This software was developed under funding from the U.S. Department of -Energy. As such, the U.S. Government has been granted for itself and others -acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in -the Software to reproduce, prepare derivative works, and perform publicly and -display publicly. Beginning five (5) years after the date permission to assert -copyright is obtained from the U.S. Department of Energy, and subject to any -subsequent five (5) year renewals, the U.S. Government is granted for itself -and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide -license in the Software to reproduce, prepare derivative works, distribute -copies to the public, perform publicly and display publicly, and to permit -others to do so. - - -The new BSD Open Source License -=============================== - -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - - - -The MIT Open Source License -=========================== - -Parts of the code are covered by the MIT license, which is equivalent to the -BSD license apart from the third, no promotion clause. Please refer to -individual source code files for further details. - -# Copyright (c) 2003-2004 The Regents of the University of California. -# Copyright (c) 2005 Gavin E. Crooks -# Copyright (c) 2006 David Ding -# Copyright (c) 2006 Clare Gollnick -# Copyright (c) 2002-2005 ActiveState Corp. - -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING -# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS -# IN THE SOFTWARE. - - - -The Biopython License (biopython.org) -===================================== -Parts of the code are covered by the biopython license, which is functionally -equivalent to the BSD license. - - -# Biopython License Agreement -# -# Permission to use, copy, modify, and distribute this software and its -# documentation with or without modifications and for any purpose and -# without fee is hereby granted, provided that any copyright notices -# appear in all copies and that both those copyright notices and this -# permission notice appear in supporting documentation, and that the -# names of the contributors or copyright holders not be used in -# advertising or publicity pertaining to distribution of the software -# without specific prior permission. - -# THE CONTRIBUTORS AND COPYRIGHT HOLDERS OF THIS SOFTWARE DISCLAIM ALL -# WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED -# WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE -# CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY SPECIAL, INDIRECT -# OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS -# OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE -# OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE -# OR PERFORMANCE OF THIS SOFTWARE. - - diff -r c55bdc2fb9fa -r 33ac48224523 PKG-INFO --- a/PKG-INFO Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,62 +0,0 @@ -Metadata-Version: 1.0 -Name: weblogo -Version: 3.0 -Summary: WebLogo3 : Sequence Logos Redrawn -Home-page: http://code.google.com/p/weblogo/ -Author: Gavin Crooks -Author-email: gec@threeplusone.com -License: UNKNOWN -Download-URL: http://weblogo.googlecode.com/svn/dist/weblogo-3.0.tar.gz -Description: - WebLogo (http://code.google.com/p/weblogo/) is a tool for creating sequence - logos from biological sequence alignments. It can be run on the command line, - as a standalone webserver, as a CGI webapp, or as a python library. - - The main WebLogo webserver is located at http://bespoke.lbl.gov/weblogo/ - - Please consult the manual for installation instructions and more information: - (Also located in the weblogolib/htdocs subdirectory.) - - http://bespoke.lbl.gov/weblogo/manual.html - - For help on the command line interface run - ./weblogo --help - - To build a simple logo run - ./weblogo < cap.fa > logo0.eps - - To run as a standalone webserver at localhost:8080 - ./weblogo --server - - To create a logo in python code: - >>> from weblogolib import * - >>> fin = open('cap.fa') - >>> seqs = read_seq_data(fin) - >>> data = LogoData.from_seqs(seqs) - >>> options = LogoOptions() - >>> options.title = "A Logo Title" - >>> format = LogoFormat(data, options) - >>> fout = open('cap.eps', 'w') - >>> eps_formatter( data, format, fout) - - - -- Distribution and Modification -- - This package is distributed under the new BSD Open Source License. - Please see the LICENSE.txt file for details on copyright and licensing. - The WebLogo source code can be downloaded from - http://code.google.com/p/weblogo/ - - WebLogo requires Python 2.3, 2.4 or 2.5, the corebio python toolkit for - computational biology (http://code.google.com/p/corebio), and the python - array package 'numpy' (http://www.scipy.org/Download) - -Platform: UNKNOWN -Classifier: Development Status :: 5 - Production/Stable -Classifier: Intended Audience :: Science/Research -Classifier: License :: OSI Approved :: BSD License -Classifier: Topic :: Scientific/Engineering :: Bio-Informatics -Classifier: Programming Language :: Python -Classifier: Natural Language :: English -Classifier: Operating System :: OS Independent -Classifier: Topic :: Software Development :: Libraries -Classifier: Topic :: Software Development :: Libraries :: Python Modules diff -r c55bdc2fb9fa -r 33ac48224523 README.txt --- a/README.txt Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,25 +0,0 @@ - -CodonLogo (http://recode.ucc.ie/CodonLogo) is a tool for creating sequence -logos from biological sequence alignments. It can be run on the command line, -as a standalone webserver or as a CGI webapp. - - -For help on the command line interface run - ./codonlogo --help - -To build a simple logo run - ./codonlogo < cap.fa > logo0.eps - -To run as a standalone webserver at localhost:8080 - ./codonlogo --server - - --- Distribution and Modification -- -This package is distributed under the new BSD Open Source License. -Please see the LICENSE.txt file for details on copyright and licensing. -The CodonLogo source code can be downloaded from -http://recode.ucc.ie/CodonLogo - -CodonLogo requires Python 2.6 or 2.7, the corebio python toolkit for -computational biology (http://code.google.com/p/corebio), and the python -array package 'numpy' (http://www.scipy.org/Download) diff -r c55bdc2fb9fa -r 33ac48224523 build_test.sh --- a/build_test.sh Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,148 +0,0 @@ - -mkdir -p tmp - -echo "# Test weblogo by building logos with many different options." - -echo -ne '.' -./weblogo < cap.fa > tmp/logo0.eps ||exit - -echo -ne '.' -./weblogo --title "Default Logo with Title" < cap.fa > tmp/logo1.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint "Default Logo with this fineprint and debug on" < cap.fa > tmp/logo2.eps ||exit - -echo -ne '.' -./weblogo --debug no --fineprint "--debug no" --debug no < cap.fa > tmp/logo3.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint "" --title "No fine print" --debug yes < cap.fa > tmp/logo4.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint "No title" --title "" < cap.fa > tmp/logo5.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint "--first-index -10" --first-index -10 < cap.fa > tmp/logo6.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint " --first-index -10 --stacks-per-line 11 " --first-index -10 --stacks-per-line 11 < cap.fa > tmp/logo7a.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint " --first-index -10 --stacks-per-line 8 " --first-index -10 --stacks-per-line 8 < cap.fa > tmp/logo7b.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint " --first-index -10 --stacks-per-line 7 " --first-index -10 --stacks-per-line 7 < cap.fa > tmp/logo7c.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint "Test fin and fout" --fin cap.fa --fout logo8.eps ||exit - -# Test Y Axis - -echo -ne '.' -./weblogo --debug yes --fineprint "Custom yaxis label " --ylabel 'yaxis label' < cap.fa > tmp/logo9a.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint "Custom units" --units 'nats' < cap.fa > tmp/logo9b.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint "Override custom units with custom yaxis label." --ylabel 'yaxis label' --units nats < cap.fa > tmp/logo9c.eps ||exit - -echo -ne '.' -./weblogo --debug yes --fineprint "Empty ylabel" --ylabel '' < cap.fa > tmp/logo9d.eps - -echo -ne '.' -./weblogo --debug yes --fineprint "No Yaxis" --show-yaxis no < cap.fa > tmp/logo9e.eps ||exit - -# Test X Axis - -echo -ne '.' -./weblogo --debug yes --format pdf --fineprint "Custom xaxis label " --xlabel 'xaxis label' < cap.fa > tmp/logo10a.pdf ||exit - -echo -ne '.' -./weblogo --debug yes --format pdf --fineprint "Empty xlabel" --xlabel '' < cap.fa > tmp/logo10b.pdf ||exit - -echo -ne '.' -./weblogo --debug yes --format pdf --fineprint "No Xaxis" --show-xaxis no < cap.fa > tmp/logo10c.pdf ||exit - -echo -ne '.' -./weblogo --debug yes --format pdf --fineprint "No Xaxis, custom label" --xlabel "Custom xlabel" --show-xaxis no < cap.fa > tmp/logo10d.pdf ||exit - -# Test Formats - -echo -ne '.' -./weblogo --debug no --fineprint "Format: eps" --format eps < cap.fa > tmp/logo11a.eps ||exit - -echo -ne '.' -./weblogo --debug no --fineprint "Format: png" --size large --format png < cap.fa > tmp/logo11b.png ||exit - -echo -ne '.' -./weblogo --debug no --fineprint "Format: png high res" --format png_print < cap.fa > tmp/logo11c.png ||exit - -echo -ne '.' -./weblogo --debug no --fineprint "Format: pdf" --format pdf < cap.fa > tmp/logo11d.pdf ||exit - -echo -ne '.' -./weblogo --debug no --fineprint "Format: jpeg" --size large --format jpeg < cap.fa > tmp/logo11e.jpeg ||exit - -echo -ne '.' -./weblogo --debug no --fineprint "Format: EPS" --format EPS < cap.fa > tmp/logo11f.eps ||exit - -# Test Sizes - -echo -ne '.' -./weblogo --debug no --format png_print --fineprint "default size" < cap.fa > tmp/logo12_default.png ||exit - -echo -ne '.' -./weblogo --debug no --format png_print --fineprint "--size large" --size large < cap.fa > tmp/logo12_large.png ||exit - -echo -ne '.' -./weblogo --debug no --format png_print --fineprint "--size medium" --size medium < cap.fa > tmp/logo12_medium.png ||exit - -echo -ne '.' -./weblogo --debug no --format png_print --fineprint "--size small" --size small < cap.fa > tmp/logo12_small.png ||exit - - - -echo -ne '.' -./weblogo --format pdf --fineprint "" > tmp/logo13.pdf << LimitString -> -GTTGTTGTTGTT -> -GTCGTCGTCGTC -> -GGGGGGGGGGGG -> -GGAGGAGGAGGA -LimitString - - - - -# Test unit options -echo -ne '.' -./weblogo --format pdf --fineprint "probability" --unit probability > tmp/logo14a.pdf < cap.fa ||exit - -echo -ne '.' -./weblogo --format pdf --fineprint "bits" --unit bits > tmp/logo14b.pdf < cap.fa ||exit - -echo -ne '.' -./weblogo --format pdf --fineprint "nats" --unit nats > tmp/logo14c.pdf < cap.fa ||exit - -echo -ne '.' -./weblogo --format pdf --fineprint "kJ/mol" --unit kJ/mol \ - > tmp/logo14d.pdf < cap.fa ||exit - -echo -ne '.' -./weblogo --format pdf --fineprint "kT" --unit kT \ - > tmp/logo14e.pdf < cap.fa ||exit - -echo -ne '.' -./weblogo --format pdf --fineprint "kcal/mol" --unit kcal/mol \ - > tmp/logo14f.pdf < cap.fa || exit - - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 cap.fa --- a/cap.fa Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,98 +0,0 @@ ->aldB -18->4 -attcgtgatagctgtcgtaaag ->ansB 103->125 -ttttgttacctgcctctaactt ->araB1 109->131 -aagtgtgacgccgtgcaaataa ->araB2 147->169 -tgccgtgattatagacactttt ->cdd 1 107->129 -atttgcgatgcgtcgcgcattt ->cdd 2 57->79 -taatgagattcagatcacatat ->crp 1 115->137 -taatgtgacgtcctttgcatac ->crp 2 -gaaggcgacctgggtcatgctg ->cya 151->173 -aggtgttaaattgatcacgttt ->cytR 1 125->147 -cgatgcgaggcggatcgaaaaa ->cytR 2 106->128 -aaattcaatattcatcacactt ->dadAX 1 95->117 -agatgtgagccagctcaccata ->dadAX 2 32->54 -agatgtgattagattattattc ->deoP2 1 75->97 -aattgtgatgtgtatcgaagtg ->deoP2 2 128->150 -ttatttgaaccagatcgcatta ->fur 136->158 -aaatgtaagctgtgccacgttt ->gal 56->78 -aagtgtgacatggaataaatta ->glpACB (glpTQ) 1 54->76 -ttgtttgatttcgcgcatattc ->glpACB (glpTQ) 2 94->116 -aaacgtgatttcatgcgtcatt ->glpACB (glpTQ) 144->166 -atgtgtgcggcaattcacattt ->glpD (glpE) 95->117 -taatgttatacatatcactcta ->glpFK 1 120->142 -ttttatgacgaggcacacacat ->glpFK 2 95->117 -aagttcgatatttctcgttttt ->gut (srlA) 72->94 -ttttgcgatcaaaataacactt ->ilvB 87->109 -aaacgtgatcaacccctcaatt ->lac 1 (lacZ) 88->110 -taatgtgagttagctcactcat ->lac 2 (lacZ) 16->38 -aattgtgagcggataacaattt ->malEpKp1 110->132 -ttgtgtgatctctgttacagaa ->malEpKp2 139->161 -TAAtgtggagatgcgcacaTAA ->malEpKp3 173->195 -TTTtgcaagcaacatcacgAAA ->malEpKp4 205->227 -GACctcggtttagttcacaGAA ->malT 121->143 -aattgtgacacagtgcaaattc ->melR 52->74 -aaccgtgctcccactcgcagtc ->mtl 302->324 -TCTTGTGATTCAGATCACAAAG ->nag 156->178 -ttttgtgagttttgtcaccaaa ->nupG2 97->119 -aaatgttatccacatcacaatt ->nupG1 47->69 -ttatttgccacaggtaacaaaa ->ompA 166->188 -atgcctgacggagttcacactt ->ompR 161->183 -taacgtgatcatatcaacagaa ->ptsH A 316->338 -Ttttgtggcctgcttcaaactt ->ptsH B 188->210 -ttttatgatttggttcaattct ->rhaS (rhaB) 161->183 -aattgtgaacatcatcacgttc ->rot 1 (ppiA) 182->204 -ttttgtgatctgtttaaatgtt ->rot 2 (ppiA) 129->151 -agaggtgattttgatcacggaa ->tdcA 60->82 -atttgtgagtggtcgcacatat ->tnaL 73->95 -gattgtgattcgattcacattt ->tsx 2 146->168 -gtgtgtaaacgtgaacgcaatc ->tsx 1 107->129 -aactgtgaaacgaaacatattt ->uxuAB 165->187 -TCTTGTGATGTGGTTAACCAAT diff -r c55bdc2fb9fa -r 33ac48224523 capu.fa --- a/capu.fa Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,98 +0,0 @@ ->aldB -18->4 -auucgugauagcugucguaaag ->ansB 103->125 -uuuuguuaccugccucuaacuu ->araB1 109->131 -aagugugacgccgugcaaauaa ->araB2 147->169 -ugccgugauuauagacacuuuu ->cdd 1 107->129 -auuugcgaugcgucgcgcauuu ->cdd 2 57->79 -uaaugagauucagaucacauau ->crp 1 115->137 -uaaugugacguccuuugcauac ->crp 2 -gaaggcgaccugggucaugcug ->cya 151->173 -agguguuaaauugaucacguuu ->cyuR 1 125->147 -cgaugcgaggcggaucgaaaaa ->cyuR 2 106->128 -aaauucaauauucaucacacuu ->dadAX 1 95->117 -agaugugagccagcucaccaua ->dadAX 2 32->54 -agaugugauuagauuauuauuc ->deoP2 1 75->97 -aauugugauguguaucgaagug ->deoP2 2 128->150 -uuauuugaaccagaucgcauua ->fur 136->158 -aaauguaagcugugccacguuu ->gal 56->78 -aagugugacauggaauaaauua ->glpACB (glpUQ) 1 54->76 -uuguuugauuucgcgcauauuc ->glpACB (glpUQ) 2 94->116 -aaacgugauuucaugcgucauu ->glpACB (glpUQ) 144->166 -augugugcggcaauucacauuu ->glpD (glpE) 95->117 -uaauguuauacauaucacucua ->glpFK 1 120->142 -uuuuaugacgaggcacacacau ->glpFK 2 95->117 -aaguucgauauuucucguuuuu ->guu (srlA) 72->94 -uuuugcgaucaaaauaacacuu ->ilvB 87->109 -aaacgugaucaaccccucaauu ->lac 1 (lacZ) 88->110 -uaaugugaguuagcucacucau ->lac 2 (lacZ) 16->38 -aauugugagcggauaacaauuu ->malEpKp1 110->132 -uugugugaucucuguuacagaa ->malEpKp2 139->161 -UAAuguggagaugcgcacaUAA ->malEpKp3 173->195 -UUUugcaagcaacaucacgAAA ->malEpKp4 205->227 -GACcucgguuuaguucacaGAA ->malU 121->143 -aauugugacacagugcaaauuc ->melR 52->74 -aaccgugcucccacucgcaguc ->mul 302->324 -UCUUGUGAUUCAGAUCACAAAG ->nag 156->178 -uuuugugaguuuugucaccaaa ->nupG2 97->119 -aaauguuauccacaucacaauu ->nupG1 47->69 -uuauuugccacagguaacaaaa ->ompA 166->188 -augccugacggaguucacacuu ->ompR 161->183 -uaacgugaucauaucaacagaa ->pusH A 316->338 -Uuuuguggccugcuucaaacuu ->pusH B 188->210 -uuuuaugauuugguucaauucu ->rhaS (rhaB) 161->183 -aauugugaacaucaucacguuc ->rou 1 (ppiA) 182->204 -uuuugugaucuguuuaaauguu ->rou 2 (ppiA) 129->151 -agaggugauuuugaucacggaa ->udcA 60->82 -auuugugaguggucgcacauau ->unaL 73->95 -gauugugauucgauucacauuu ->usx 2 146->168 -guguguaaacgugaacgcaauc ->usx 1 107->129 -aacugugaaacgaaacauauuu ->uxuAB 165->187 -UCUUGUGAUGUGGUUAACCAAU diff -r c55bdc2fb9fa -r 33ac48224523 codonlogo --- a/codonlogo Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,80 +0,0 @@ -#!/usr/bin/env python - - - - -# -------------------------------- WebLogo -------------------------------- - -# Copyright (c) 2003-2004 The Regents of the University of California. -# Copyright (c) 2005 Gavin E. Crooks -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - -""" WebLogo is a tool for creating sequence logos from biological sequence -alignments. It can be run on the command line, as a standalone webserver, as a -CGI webapp, or as a python library. - -For help on the command line interface run - ./codonlogo --help - -To build a simple logo run - ./codonlogo < cap.fa > logo0.eps - -To run as a standalone webserver at localhost:8080 - ./codonlogo --serve - - -""" -import weblogolib - -# Standard python voodoo for CLI -if __name__ == "__main__": - ## Code Profiling. Uncomment these lines - #import hotshot, hotshot.stats - #prof = hotshot.Profile("stones.prof") - #prof.runcall(main) - #prof.close() - #stats = hotshot.stats.load("stones.prof") - #stats.strip_dirs() - #stats.sort_stats('cumulative', 'calls') - #stats.print_stats(40) - #sys.exit() - - weblogolib.main() - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/__init__.py --- a/corebio/__init__.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,62 +0,0 @@ - -# Copyright (c) 2005 Gavin E. Crooks -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - - -""" A python toolkit for computational biology. - -http://code.google.com/p/corebio/ -""" - -__all__ = [ 'data', - "moremath", - "resource", - "seq", - "seq_io", - 'ssearch_io', - "utils", - 'transform', - ] - -from _version import __version__ -from _version import description - -__doc__ = description +' : ' + __doc__ - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/_future/__init__.py --- a/corebio/_future/__init__.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,78 +0,0 @@ - -""" -Private compatability module for running under python version 2.3. - -Replacement for - o string.Template -- introduced in python 2.4 - o subprocess -- introduced in python 2.4 - - o resource_string -- introduced in pkg_resource of setuptools - o resource_stream - o resource_filename - -from string import Template -> from corebio._future import Template - -""" - - -try : - import pkg_resources -except ImportError : - pkg_resources = None - - -try : - from string import Template -except ImportError : - from _string import Template - - - - -def resource_string( modulename, resource, basefilename = None): - """Locate and return a resource as a string. - >>> f = resource_string( __name, 'somedatafile', __file__) - """ - if pkg_resources : - return pkg_resources.resource_string(modulename, resource) - - f = resource_stream( modulename, resource, basefilename) - return f.read() - -def resource_stream( modulename, resource, basefilename = None): - """Locate and return a resource as a stream. - >>> f = resource_stream( __name__, 'somedatafile', __file__) - """ - if pkg_resources : - return pkg_resources.resource_stream(modulename, resource) - - return open( resource_filename( modulename, resource, basefilename) ) - -def resource_filename( modulename, resource, basefilename = None): - """Locate and return a resource filename. - >>> f = resource_stream( __name__, 'somedatafile', __file__) - - A resource is a data file stored with the python code in a package. - All three resource methods (resource_string, resource_stream, - resource_filename) call the corresponding methods in the 'pkg_resources' - module, if installed. Otherwise, we resort to locating the resource - in the local filesystem. However, this does not work if the package - is located inside a zip file. - """ - if pkg_resources : - return pkg_resources.resource_filename(modulename, resource) - - if basefilename is None : - raise NotImplementedError( - "Require either basefilename or pkg_resources") - - import os - return os.path.join(os.path.dirname(basefilename), resource) - - - - - - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/_future/_string.py --- a/corebio/_future/_string.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,127 +0,0 @@ -#################################################################### -import re as _re - -class _multimap: - """Helper class for combining multiple mappings. - - Used by .{safe_,}substitute() to combine the mapping and keyword - arguments. - """ - def __init__(self, primary, secondary): - self._primary = primary - self._secondary = secondary - - def __getitem__(self, key): - try: - return self._primary[key] - except KeyError: - return self._secondary[key] - - -class _TemplateMetaclass(type): - pattern = r""" - %(delim)s(?: - (?P%(delim)s) | # Escape sequence of two delimiters - (?P%(id)s) | # delimiter and a Python identifier - {(?P%(id)s)} | # delimiter and a braced identifier - (?P) # Other ill-formed delimiter exprs - ) - """ - - def __init__(cls, name, bases, dct): - super(_TemplateMetaclass, cls).__init__(name, bases, dct) - if 'pattern' in dct: - pattern = cls.pattern - else: - pattern = _TemplateMetaclass.pattern % { - 'delim' : _re.escape(cls.delimiter), - 'id' : cls.idpattern, - } - cls.pattern = _re.compile(pattern, _re.IGNORECASE | _re.VERBOSE) - - -class Template: - """A string class for supporting $-substitutions.""" - __metaclass__ = _TemplateMetaclass - - delimiter = '$' - idpattern = r'[_a-z][_a-z0-9]*' - - def __init__(self, template): - self.template = template - - # Search for $$, $identifier, ${identifier}, and any bare $'s - - def _invalid(self, mo): - i = mo.start('invalid') - lines = self.template[:i].splitlines(True) - if not lines: - colno = 1 - lineno = 1 - else: - colno = i - len(''.join(lines[:-1])) - lineno = len(lines) - raise ValueError('Invalid placeholder in string: line %d, col %d' % - (lineno, colno)) - - def substitute(self, *args, **kws): - if len(args) > 1: - raise TypeError('Too many positional arguments') - if not args: - mapping = kws - elif kws: - mapping = _multimap(kws, args[0]) - else: - mapping = args[0] - # Helper function for .sub() - def convert(mo): - # Check the most common path first. - named = mo.group('named') or mo.group('braced') - if named is not None: - val = mapping[named] - # We use this idiom instead of str() because the latter will - # fail if val is a Unicode containing non-ASCII characters. - return '%s' % val - if mo.group('escaped') is not None: - return self.delimiter - if mo.group('invalid') is not None: - self._invalid(mo) - raise ValueError('Unrecognized named group in pattern', - self.pattern) - return self.pattern.sub(convert, self.template) - - def safe_substitute(self, *args, **kws): - if len(args) > 1: - raise TypeError('Too many positional arguments') - if not args: - mapping = kws - elif kws: - mapping = _multimap(kws, args[0]) - else: - mapping = args[0] - # Helper function for .sub() - def convert(mo): - named = mo.group('named') - if named is not None: - try: - # We use this idiom instead of str() because the latter - # will fail if val is a Unicode containing non-ASCII - return '%s' % mapping[named] - except KeyError: - return self.delimiter + named - braced = mo.group('braced') - if braced is not None: - try: - return '%s' % mapping[braced] - except KeyError: - return self.delimiter + '{' + braced + '}' - if mo.group('escaped') is not None: - return self.delimiter - if mo.group('invalid') is not None: - return self.delimiter - raise ValueError('Unrecognized named group in pattern', - self.pattern) - return self.pattern.sub(convert, self.template) - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/_future/subprocess.py --- a/corebio/_future/subprocess.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,1165 +0,0 @@ -# subprocess - Subprocesses with accessible I/O streams -# -# For more information about this module, see PEP 324. -# -# Copyright (c) 2003-2004 by Peter Astrand -# -# By obtaining, using, and/or copying this software and/or its -# associated documentation, you agree that you have read, understood, -# and will comply with the following terms and conditions: -# -# Permission to use, copy, modify, and distribute this software and -# its associated documentation for any purpose and without fee is -# hereby granted, provided that the above copyright notice appears in -# all copies, and that both that copyright notice and this permission -# notice appear in supporting documentation, and that the name of the -# author not be used in advertising or publicity pertaining to -# distribution of the software without specific, written prior -# permission. -# -# THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, -# INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. -# IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR -# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS -# OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, -# NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION -# WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. - -r"""subprocess - Subprocesses with accessible I/O streams - -This module allows you to spawn processes, connect to their -input/output/error pipes, and obtain their return codes. This module -intends to replace several other, older modules and functions, like: - -os.system -os.spawn* -os.popen* -popen2.* -commands.* - -Information about how the subprocess module can be used to replace these -modules and functions can be found below. - - - -Using the subprocess module -=========================== -This module defines one class called Popen: - -class Popen(args, bufsize=0, executable=None, - stdin=None, stdout=None, stderr=None, - preexec_fn=None, close_fds=False, shell=False, - cwd=None, env=None, universal_newlines=False, - startupinfo=None, creationflags=0): - - -Arguments are: - -args should be a string, or a sequence of program arguments. The -program to execute is normally the first item in the args sequence or -string, but can be explicitly set by using the executable argument. - -On UNIX, with shell=False (default): In this case, the Popen class -uses os.execvp() to execute the child program. args should normally -be a sequence. A string will be treated as a sequence with the string -as the only item (the program to execute). - -On UNIX, with shell=True: If args is a string, it specifies the -command string to execute through the shell. If args is a sequence, -the first item specifies the command string, and any additional items -will be treated as additional shell arguments. - -On Windows: the Popen class uses CreateProcess() to execute the child -program, which operates on strings. If args is a sequence, it will be -converted to a string using the list2cmdline method. Please note that -not all MS Windows applications interpret the command line the same -way: The list2cmdline is designed for applications using the same -rules as the MS C runtime. - -bufsize, if given, has the same meaning as the corresponding argument -to the built-in open() function: 0 means unbuffered, 1 means line -buffered, any other positive value means use a buffer of -(approximately) that size. A negative bufsize means to use the system -default, which usually means fully buffered. The default value for -bufsize is 0 (unbuffered). - -stdin, stdout and stderr specify the executed programs' standard -input, standard output and standard error file handles, respectively. -Valid values are PIPE, an existing file descriptor (a positive -integer), an existing file object, and None. PIPE indicates that a -new pipe to the child should be created. With None, no redirection -will occur; the child's file handles will be inherited from the -parent. Additionally, stderr can be STDOUT, which indicates that the -stderr data from the applications should be captured into the same -file handle as for stdout. - -If preexec_fn is set to a callable object, this object will be called -in the child process just before the child is executed. - -If close_fds is true, all file descriptors except 0, 1 and 2 will be -closed before the child process is executed. - -if shell is true, the specified command will be executed through the -shell. - -If cwd is not None, the current directory will be changed to cwd -before the child is executed. - -If env is not None, it defines the environment variables for the new -process. - -If universal_newlines is true, the file objects stdout and stderr are -opened as a text files, but lines may be terminated by any of '\n', -the Unix end-of-line convention, '\r', the Macintosh convention or -'\r\n', the Windows convention. All of these external representations -are seen as '\n' by the Python program. Note: This feature is only -available if Python is built with universal newline support (the -default). Also, the newlines attribute of the file objects stdout, -stdin and stderr are not updated by the communicate() method. - -The startupinfo and creationflags, if given, will be passed to the -underlying CreateProcess() function. They can specify things such as -appearance of the main window and priority for the new process. -(Windows only) - - -This module also defines two shortcut functions: - -call(*args, **kwargs): - Run command with arguments. Wait for command to complete, then - return the returncode attribute. - - The arguments are the same as for the Popen constructor. Example: - - retcode = call(["ls", "-l"]) - - -Exceptions ----------- -Exceptions raised in the child process, before the new program has -started to execute, will be re-raised in the parent. Additionally, -the exception object will have one extra attribute called -'child_traceback', which is a string containing traceback information -from the childs point of view. - -The most common exception raised is OSError. This occurs, for -example, when trying to execute a non-existent file. Applications -should prepare for OSErrors. - -A ValueError will be raised if Popen is called with invalid arguments. - - -Security --------- -Unlike some other popen functions, this implementation will never call -/bin/sh implicitly. This means that all characters, including shell -metacharacters, can safely be passed to child processes. - - -Popen objects -============= -Instances of the Popen class have the following methods: - -poll() - Check if child process has terminated. Returns returncode - attribute. - -wait() - Wait for child process to terminate. Returns returncode attribute. - -communicate(input=None) - Interact with process: Send data to stdin. Read data from stdout - and stderr, until end-of-file is reached. Wait for process to - terminate. The optional stdin argument should be a string to be - sent to the child process, or None, if no data should be sent to - the child. - - communicate() returns a tuple (stdout, stderr). - - Note: The data read is buffered in memory, so do not use this - method if the data size is large or unlimited. - -The following attributes are also available: - -stdin - If the stdin argument is PIPE, this attribute is a file object - that provides input to the child process. Otherwise, it is None. - -stdout - If the stdout argument is PIPE, this attribute is a file object - that provides output from the child process. Otherwise, it is - None. - -stderr - If the stderr argument is PIPE, this attribute is file object that - provides error output from the child process. Otherwise, it is - None. - -pid - The process ID of the child process. - -returncode - The child return code. A None value indicates that the process - hasn't terminated yet. A negative value -N indicates that the - child was terminated by signal N (UNIX only). - - -Replacing older functions with the subprocess module -==================================================== -In this section, "a ==> b" means that b can be used as a replacement -for a. - -Note: All functions in this section fail (more or less) silently if -the executed program cannot be found; this module raises an OSError -exception. - -In the following examples, we assume that the subprocess module is -imported with "from subprocess import *". - - -Replacing /bin/sh shell backquote ---------------------------------- -output=`mycmd myarg` -==> -output = Popen(["mycmd", "myarg"], stdout=PIPE).communicate()[0] - - -Replacing shell pipe line -------------------------- -output=`dmesg | grep hda` -==> -p1 = Popen(["dmesg"], stdout=PIPE) -p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE) -output = p2.communicate()[0] - - -Replacing os.system() ---------------------- -sts = os.system("mycmd" + " myarg") -==> -p = Popen("mycmd" + " myarg", shell=True) -sts = os.waitpid(p.pid, 0) - -Note: - -* Calling the program through the shell is usually not required. - -* It's easier to look at the returncode attribute than the - exitstatus. - -A more real-world example would look like this: - -try: - retcode = call("mycmd" + " myarg", shell=True) - if retcode < 0: - print >>sys.stderr, "Child was terminated by signal", -retcode - else: - print >>sys.stderr, "Child returned", retcode -except OSError, e: - print >>sys.stderr, "Execution failed:", e - - -Replacing os.spawn* -------------------- -P_NOWAIT example: - -pid = os.spawnlp(os.P_NOWAIT, "/bin/mycmd", "mycmd", "myarg") -==> -pid = Popen(["/bin/mycmd", "myarg"]).pid - - -P_WAIT example: - -retcode = os.spawnlp(os.P_WAIT, "/bin/mycmd", "mycmd", "myarg") -==> -retcode = call(["/bin/mycmd", "myarg"]) - - -Vector example: - -os.spawnvp(os.P_NOWAIT, path, args) -==> -Popen([path] + args[1:]) - - -Environment example: - -os.spawnlpe(os.P_NOWAIT, "/bin/mycmd", "mycmd", "myarg", env) -==> -Popen(["/bin/mycmd", "myarg"], env={"PATH": "/usr/bin"}) - - -Replacing os.popen* -------------------- -pipe = os.popen(cmd, mode='r', bufsize) -==> -pipe = Popen(cmd, shell=True, bufsize=bufsize, stdout=PIPE).stdout - -pipe = os.popen(cmd, mode='w', bufsize) -==> -pipe = Popen(cmd, shell=True, bufsize=bufsize, stdin=PIPE).stdin - - -(child_stdin, child_stdout) = os.popen2(cmd, mode, bufsize) -==> -p = Popen(cmd, shell=True, bufsize=bufsize, - stdin=PIPE, stdout=PIPE, close_fds=True) -(child_stdin, child_stdout) = (p.stdin, p.stdout) - - -(child_stdin, - child_stdout, - child_stderr) = os.popen3(cmd, mode, bufsize) -==> -p = Popen(cmd, shell=True, bufsize=bufsize, - stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True) -(child_stdin, - child_stdout, - child_stderr) = (p.stdin, p.stdout, p.stderr) - - -(child_stdin, child_stdout_and_stderr) = os.popen4(cmd, mode, bufsize) -==> -p = Popen(cmd, shell=True, bufsize=bufsize, - stdin=PIPE, stdout=PIPE, stderr=STDOUT, close_fds=True) -(child_stdin, child_stdout_and_stderr) = (p.stdin, p.stdout) - - -Replacing popen2.* ------------------- -Note: If the cmd argument to popen2 functions is a string, the command -is executed through /bin/sh. If it is a list, the command is directly -executed. - -(child_stdout, child_stdin) = popen2.popen2("somestring", bufsize, mode) -==> -p = Popen(["somestring"], shell=True, bufsize=bufsize - stdin=PIPE, stdout=PIPE, close_fds=True) -(child_stdout, child_stdin) = (p.stdout, p.stdin) - - -(child_stdout, child_stdin) = popen2.popen2(["mycmd", "myarg"], bufsize, mode) -==> -p = Popen(["mycmd", "myarg"], bufsize=bufsize, - stdin=PIPE, stdout=PIPE, close_fds=True) -(child_stdout, child_stdin) = (p.stdout, p.stdin) - -The popen2.Popen3 and popen3.Popen4 basically works as subprocess.Popen, -except that: - -* subprocess.Popen raises an exception if the execution fails -* the capturestderr argument is replaced with the stderr argument. -* stdin=PIPE and stdout=PIPE must be specified. -* popen2 closes all filedescriptors by default, but you have to specify - close_fds=True with subprocess.Popen. - - -""" - -import sys -mswindows = (sys.platform == "win32") - -import os -import types -import traceback - -if mswindows: - import threading - import msvcrt - if 0: # <-- change this to use pywin32 instead of the _subprocess driver - import pywintypes - from win32api import GetStdHandle, STD_INPUT_HANDLE, \ - STD_OUTPUT_HANDLE, STD_ERROR_HANDLE - from win32api import GetCurrentProcess, DuplicateHandle, \ - GetModuleFileName, GetVersion - from win32con import DUPLICATE_SAME_ACCESS, SW_HIDE - from win32pipe import CreatePipe - from win32process import CreateProcess, STARTUPINFO, \ - GetExitCodeProcess, STARTF_USESTDHANDLES, \ - STARTF_USESHOWWINDOW, CREATE_NEW_CONSOLE - from win32event import WaitForSingleObject, INFINITE, WAIT_OBJECT_0 - else: - from _subprocess import * - class STARTUPINFO: - dwFlags = 0 - hStdInput = None - hStdOutput = None - hStdError = None - class pywintypes: - error = IOError -else: - import select - import errno - import fcntl - import pickle - -__all__ = ["Popen", "PIPE", "STDOUT", "call"] - -try: - MAXFD = os.sysconf("SC_OPEN_MAX") -except: - MAXFD = 256 - -# True/False does not exist on 2.2.0 -try: - False -except NameError: - False = 0 - True = 1 - -_active = [] - -def _cleanup(): - for inst in _active[:]: - inst.poll() - -PIPE = -1 -STDOUT = -2 - - -def call(*args, **kwargs): - """Run command with arguments. Wait for command to complete, then - return the returncode attribute. - - The arguments are the same as for the Popen constructor. Example: - - retcode = call(["ls", "-l"]) - """ - return Popen(*args, **kwargs).wait() - - -def list2cmdline(seq): - """ - Translate a sequence of arguments into a command line - string, using the same rules as the MS C runtime: - - 1) Arguments are delimited by white space, which is either a - space or a tab. - - 2) A string surrounded by double quotation marks is - interpreted as a single argument, regardless of white space - contained within. A quoted string can be embedded in an - argument. - - 3) A double quotation mark preceded by a backslash is - interpreted as a literal double quotation mark. - - 4) Backslashes are interpreted literally, unless they - immediately precede a double quotation mark. - - 5) If backslashes immediately precede a double quotation mark, - every pair of backslashes is interpreted as a literal - backslash. If the number of backslashes is odd, the last - backslash escapes the next double quotation mark as - described in rule 3. - """ - - # See - # http://msdn.microsoft.com/library/en-us/vccelng/htm/progs_12.asp - result = [] - needquote = False - for arg in seq: - bs_buf = [] - - # Add a space to separate this argument from the others - if result: - result.append(' ') - - needquote = (" " in arg) or ("\t" in arg) - if needquote: - result.append('"') - - for c in arg: - if c == '\\': - # Don't know if we need to double yet. - bs_buf.append(c) - elif c == '"': - # Double backspaces. - result.append('\\' * len(bs_buf)*2) - bs_buf = [] - result.append('\\"') - else: - # Normal char - if bs_buf: - result.extend(bs_buf) - bs_buf = [] - result.append(c) - - # Add remaining backspaces, if any. - if bs_buf: - result.extend(bs_buf) - - if needquote: - result.extend(bs_buf) - result.append('"') - - return ''.join(result) - - -class Popen(object): - def __init__(self, args, bufsize=0, executable=None, - stdin=None, stdout=None, stderr=None, - preexec_fn=None, close_fds=False, shell=False, - cwd=None, env=None, universal_newlines=False, - startupinfo=None, creationflags=0): - """Create new Popen instance.""" - _cleanup() - - if not isinstance(bufsize, (int, long)): - raise TypeError("bufsize must be an integer") - - if mswindows: - if preexec_fn is not None: - raise ValueError("preexec_fn is not supported on Windows " - "platforms") - if close_fds: - raise ValueError("close_fds is not supported on Windows " - "platforms") - else: - # POSIX - if startupinfo is not None: - raise ValueError("startupinfo is only supported on Windows " - "platforms") - if creationflags != 0: - raise ValueError("creationflags is only supported on Windows " - "platforms") - - self.stdin = None - self.stdout = None - self.stderr = None - self.pid = None - self.returncode = None - self.universal_newlines = universal_newlines - - # Input and output objects. The general principle is like - # this: - # - # Parent Child - # ------ ----- - # p2cwrite ---stdin---> p2cread - # c2pread <--stdout--- c2pwrite - # errread <--stderr--- errwrite - # - # On POSIX, the child objects are file descriptors. On - # Windows, these are Windows file handles. The parent objects - # are file descriptors on both platforms. The parent objects - # are None when not using PIPEs. The child objects are None - # when not redirecting. - - (p2cread, p2cwrite, - c2pread, c2pwrite, - errread, errwrite) = self._get_handles(stdin, stdout, stderr) - - self._execute_child(args, executable, preexec_fn, close_fds, - cwd, env, universal_newlines, - startupinfo, creationflags, shell, - p2cread, p2cwrite, - c2pread, c2pwrite, - errread, errwrite) - - if p2cwrite: - self.stdin = os.fdopen(p2cwrite, 'wb', bufsize) - if c2pread: - if universal_newlines: - self.stdout = os.fdopen(c2pread, 'rU', bufsize) - else: - self.stdout = os.fdopen(c2pread, 'rb', bufsize) - if errread: - if universal_newlines: - self.stderr = os.fdopen(errread, 'rU', bufsize) - else: - self.stderr = os.fdopen(errread, 'rb', bufsize) - - _active.append(self) - - - def _translate_newlines(self, data): - data = data.replace("\r\n", "\n") - data = data.replace("\r", "\n") - return data - - - if mswindows: - # - # Windows methods - # - def _get_handles(self, stdin, stdout, stderr): - """Construct and return tupel with IO objects: - p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite - """ - if stdin == None and stdout == None and stderr == None: - return (None, None, None, None, None, None) - - p2cread, p2cwrite = None, None - c2pread, c2pwrite = None, None - errread, errwrite = None, None - - if stdin == None: - p2cread = GetStdHandle(STD_INPUT_HANDLE) - elif stdin == PIPE: - p2cread, p2cwrite = CreatePipe(None, 0) - # Detach and turn into fd - p2cwrite = p2cwrite.Detach() - p2cwrite = msvcrt.open_osfhandle(p2cwrite, 0) - elif type(stdin) == types.IntType: - p2cread = msvcrt.get_osfhandle(stdin) - else: - # Assuming file-like object - p2cread = msvcrt.get_osfhandle(stdin.fileno()) - p2cread = self._make_inheritable(p2cread) - - if stdout == None: - c2pwrite = GetStdHandle(STD_OUTPUT_HANDLE) - elif stdout == PIPE: - c2pread, c2pwrite = CreatePipe(None, 0) - # Detach and turn into fd - c2pread = c2pread.Detach() - c2pread = msvcrt.open_osfhandle(c2pread, 0) - elif type(stdout) == types.IntType: - c2pwrite = msvcrt.get_osfhandle(stdout) - else: - # Assuming file-like object - c2pwrite = msvcrt.get_osfhandle(stdout.fileno()) - c2pwrite = self._make_inheritable(c2pwrite) - - if stderr == None: - errwrite = GetStdHandle(STD_ERROR_HANDLE) - elif stderr == PIPE: - errread, errwrite = CreatePipe(None, 0) - # Detach and turn into fd - errread = errread.Detach() - errread = msvcrt.open_osfhandle(errread, 0) - elif stderr == STDOUT: - errwrite = c2pwrite - elif type(stderr) == types.IntType: - errwrite = msvcrt.get_osfhandle(stderr) - else: - # Assuming file-like object - errwrite = msvcrt.get_osfhandle(stderr.fileno()) - errwrite = self._make_inheritable(errwrite) - - return (p2cread, p2cwrite, - c2pread, c2pwrite, - errread, errwrite) - - - def _make_inheritable(self, handle): - """Return a duplicate of handle, which is inheritable""" - return DuplicateHandle(GetCurrentProcess(), handle, - GetCurrentProcess(), 0, 1, - DUPLICATE_SAME_ACCESS) - - - def _find_w9xpopen(self): - """Find and return absolut path to w9xpopen.exe""" - w9xpopen = os.path.join(os.path.dirname(GetModuleFileName(0)), - "w9xpopen.exe") - if not os.path.exists(w9xpopen): - # Eeek - file-not-found - possibly an embedding - # situation - see if we can locate it in sys.exec_prefix - w9xpopen = os.path.join(os.path.dirname(sys.exec_prefix), - "w9xpopen.exe") - if not os.path.exists(w9xpopen): - raise RuntimeError("Cannot locate w9xpopen.exe, which is " - "needed for Popen to work with your " - "shell or platform.") - return w9xpopen - - - def _execute_child(self, args, executable, preexec_fn, close_fds, - cwd, env, universal_newlines, - startupinfo, creationflags, shell, - p2cread, p2cwrite, - c2pread, c2pwrite, - errread, errwrite): - """Execute program (MS Windows version)""" - - if not isinstance(args, types.StringTypes): - args = list2cmdline(args) - - # Process startup details - default_startupinfo = STARTUPINFO() - if startupinfo == None: - startupinfo = default_startupinfo - if not None in (p2cread, c2pwrite, errwrite): - startupinfo.dwFlags |= STARTF_USESTDHANDLES - startupinfo.hStdInput = p2cread - startupinfo.hStdOutput = c2pwrite - startupinfo.hStdError = errwrite - - if shell: - default_startupinfo.dwFlags |= STARTF_USESHOWWINDOW - default_startupinfo.wShowWindow = SW_HIDE - comspec = os.environ.get("COMSPEC", "cmd.exe") - args = comspec + " /c " + args - if (GetVersion() >= 0x80000000L or - os.path.basename(comspec).lower() == "command.com"): - # Win9x, or using command.com on NT. We need to - # use the w9xpopen intermediate program. For more - # information, see KB Q150956 - # (http://web.archive.org/web/20011105084002/http://support.microsoft.com/support/kb/articles/Q150/9/56.asp) - w9xpopen = self._find_w9xpopen() - args = '"%s" %s' % (w9xpopen, args) - # Not passing CREATE_NEW_CONSOLE has been known to - # cause random failures on win9x. Specifically a - # dialog: "Your program accessed mem currently in - # use at xxx" and a hopeful warning about the - # stability of your system. Cost is Ctrl+C wont - # kill children. - creationflags |= CREATE_NEW_CONSOLE - - # Start the process - try: - hp, ht, pid, tid = CreateProcess(executable, args, - # no special security - None, None, - # must inherit handles to pass std - # handles - 1, - creationflags, - env, - cwd, - startupinfo) - except pywintypes.error, e: - # Translate pywintypes.error to WindowsError, which is - # a subclass of OSError. FIXME: We should really - # translate errno using _sys_errlist (or simliar), but - # how can this be done from Python? - raise WindowsError(*e.args) - - # Retain the process handle, but close the thread handle - self._handle = hp - self.pid = pid - ht.Close() - - # Child is launched. Close the parent's copy of those pipe - # handles that only the child should have open. You need - # to make sure that no handles to the write end of the - # output pipe are maintained in this process or else the - # pipe will not close when the child process exits and the - # ReadFile will hang. - if p2cread != None: - p2cread.Close() - if c2pwrite != None: - c2pwrite.Close() - if errwrite != None: - errwrite.Close() - - - def poll(self): - """Check if child process has terminated. Returns returncode - attribute.""" - if self.returncode == None: - if WaitForSingleObject(self._handle, 0) == WAIT_OBJECT_0: - self.returncode = GetExitCodeProcess(self._handle) - _active.remove(self) - return self.returncode - - - def wait(self): - """Wait for child process to terminate. Returns returncode - attribute.""" - if self.returncode == None: - obj = WaitForSingleObject(self._handle, INFINITE) - self.returncode = GetExitCodeProcess(self._handle) - _active.remove(self) - return self.returncode - - - def _readerthread(self, fh, buffer): - buffer.append(fh.read()) - - - def communicate(self, input=None): - """Interact with process: Send data to stdin. Read data from - stdout and stderr, until end-of-file is reached. Wait for - process to terminate. The optional input argument should be a - string to be sent to the child process, or None, if no data - should be sent to the child. - - communicate() returns a tuple (stdout, stderr).""" - stdout = None # Return - stderr = None # Return - - if self.stdout: - stdout = [] - stdout_thread = threading.Thread(target=self._readerthread, - args=(self.stdout, stdout)) - stdout_thread.setDaemon(True) - stdout_thread.start() - if self.stderr: - stderr = [] - stderr_thread = threading.Thread(target=self._readerthread, - args=(self.stderr, stderr)) - stderr_thread.setDaemon(True) - stderr_thread.start() - - if self.stdin: - if input != None: - self.stdin.write(input) - self.stdin.close() - - if self.stdout: - stdout_thread.join() - if self.stderr: - stderr_thread.join() - - # All data exchanged. Translate lists into strings. - if stdout != None: - stdout = stdout[0] - if stderr != None: - stderr = stderr[0] - - # Translate newlines, if requested. We cannot let the file - # object do the translation: It is based on stdio, which is - # impossible to combine with select (unless forcing no - # buffering). - if self.universal_newlines and hasattr(open, 'newlines'): - if stdout: - stdout = self._translate_newlines(stdout) - if stderr: - stderr = self._translate_newlines(stderr) - - self.wait() - return (stdout, stderr) - - else: - # - # POSIX methods - # - def _get_handles(self, stdin, stdout, stderr): - """Construct and return tupel with IO objects: - p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite - """ - p2cread, p2cwrite = None, None - c2pread, c2pwrite = None, None - errread, errwrite = None, None - - if stdin == None: - pass - elif stdin == PIPE: - p2cread, p2cwrite = os.pipe() - elif type(stdin) == types.IntType: - p2cread = stdin - else: - # Assuming file-like object - p2cread = stdin.fileno() - - if stdout == None: - pass - elif stdout == PIPE: - c2pread, c2pwrite = os.pipe() - elif type(stdout) == types.IntType: - c2pwrite = stdout - else: - # Assuming file-like object - c2pwrite = stdout.fileno() - - if stderr == None: - pass - elif stderr == PIPE: - errread, errwrite = os.pipe() - elif stderr == STDOUT: - errwrite = c2pwrite - elif type(stderr) == types.IntType: - errwrite = stderr - else: - # Assuming file-like object - errwrite = stderr.fileno() - - return (p2cread, p2cwrite, - c2pread, c2pwrite, - errread, errwrite) - - - def _set_cloexec_flag(self, fd): - try: - cloexec_flag = fcntl.FD_CLOEXEC - except AttributeError: - cloexec_flag = 1 - - old = fcntl.fcntl(fd, fcntl.F_GETFD) - fcntl.fcntl(fd, fcntl.F_SETFD, old | cloexec_flag) - - - def _close_fds(self, but): - for i in range(3, MAXFD): - if i == but: - continue - try: - os.close(i) - except: - pass - - - def _execute_child(self, args, executable, preexec_fn, close_fds, - cwd, env, universal_newlines, - startupinfo, creationflags, shell, - p2cread, p2cwrite, - c2pread, c2pwrite, - errread, errwrite): - """Execute program (POSIX version)""" - - if isinstance(args, types.StringTypes): - args = [args] - - if shell: - args = ["/bin/sh", "-c"] + args - - if executable == None: - executable = args[0] - - # For transferring possible exec failure from child to parent - # The first char specifies the exception type: 0 means - # OSError, 1 means some other error. - errpipe_read, errpipe_write = os.pipe() - self._set_cloexec_flag(errpipe_write) - - self.pid = os.fork() - if self.pid == 0: - # Child - try: - # Close parent's pipe ends - if p2cwrite: - os.close(p2cwrite) - if c2pread: - os.close(c2pread) - if errread: - os.close(errread) - os.close(errpipe_read) - - # Dup fds for child - if p2cread: - os.dup2(p2cread, 0) - if c2pwrite: - os.dup2(c2pwrite, 1) - if errwrite: - os.dup2(errwrite, 2) - - # Close pipe fds. Make sure we doesn't close the same - # fd more than once. - if p2cread: - os.close(p2cread) - if c2pwrite and c2pwrite not in (p2cread,): - os.close(c2pwrite) - if errwrite and errwrite not in (p2cread, c2pwrite): - os.close(errwrite) - - # Close all other fds, if asked for - if close_fds: - self._close_fds(but=errpipe_write) - - if cwd != None: - os.chdir(cwd) - - if preexec_fn: - apply(preexec_fn) - - if env == None: - os.execvp(executable, args) - else: - os.execvpe(executable, args, env) - - except: - exc_type, exc_value, tb = sys.exc_info() - # Save the traceback and attach it to the exception object - exc_lines = traceback.format_exception(exc_type, - exc_value, - tb) - exc_value.child_traceback = ''.join(exc_lines) - os.write(errpipe_write, pickle.dumps(exc_value)) - - # This exitcode won't be reported to applications, so it - # really doesn't matter what we return. - os._exit(255) - - # Parent - os.close(errpipe_write) - if p2cread and p2cwrite: - os.close(p2cread) - if c2pwrite and c2pread: - os.close(c2pwrite) - if errwrite and errread: - os.close(errwrite) - - # Wait for exec to fail or succeed; possibly raising exception - data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB - os.close(errpipe_read) - if data != "": - os.waitpid(self.pid, 0) - child_exception = pickle.loads(data) - raise child_exception - - - def _handle_exitstatus(self, sts): - if os.WIFSIGNALED(sts): - self.returncode = -os.WTERMSIG(sts) - elif os.WIFEXITED(sts): - self.returncode = os.WEXITSTATUS(sts) - else: - # Should never happen - raise RuntimeError("Unknown child exit status!") - - _active.remove(self) - - - def poll(self): - """Check if child process has terminated. Returns returncode - attribute.""" - if self.returncode == None: - try: - pid, sts = os.waitpid(self.pid, os.WNOHANG) - if pid == self.pid: - self._handle_exitstatus(sts) - except os.error: - pass - return self.returncode - - - def wait(self): - """Wait for child process to terminate. Returns returncode - attribute.""" - if self.returncode == None: - pid, sts = os.waitpid(self.pid, 0) - self._handle_exitstatus(sts) - return self.returncode - - - def communicate(self, input=None): - """Interact with process: Send data to stdin. Read data from - stdout and stderr, until end-of-file is reached. Wait for - process to terminate. The optional input argument should be a - string to be sent to the child process, or None, if no data - should be sent to the child. - - communicate() returns a tuple (stdout, stderr).""" - read_set = [] - write_set = [] - stdout = None # Return - stderr = None # Return - - if self.stdin: - # Flush stdio buffer. This might block, if the user has - # been writing to .stdin in an uncontrolled fashion. - self.stdin.flush() - if input: - write_set.append(self.stdin) - else: - self.stdin.close() - if self.stdout: - read_set.append(self.stdout) - stdout = [] - if self.stderr: - read_set.append(self.stderr) - stderr = [] - - while read_set or write_set: - rlist, wlist, xlist = select.select(read_set, write_set, []) - - if self.stdin in wlist: - # When select has indicated that the file is writable, - # we can write up to PIPE_BUF bytes without risk - # blocking. POSIX defines PIPE_BUF >= 512 - bytes_written = os.write(self.stdin.fileno(), input[:512]) - input = input[bytes_written:] - if not input: - self.stdin.close() - write_set.remove(self.stdin) - - if self.stdout in rlist: - data = os.read(self.stdout.fileno(), 1024) - if data == "": - self.stdout.close() - read_set.remove(self.stdout) - stdout.append(data) - - if self.stderr in rlist: - data = os.read(self.stderr.fileno(), 1024) - if data == "": - self.stderr.close() - read_set.remove(self.stderr) - stderr.append(data) - - # All data exchanged. Translate lists into strings. - if stdout != None: - stdout = ''.join(stdout) - if stderr != None: - stderr = ''.join(stderr) - - # Translate newlines, if requested. We cannot let the file - # object do the translation: It is based on stdio, which is - # impossible to combine with select (unless forcing no - # buffering). - if self.universal_newlines and hasattr(open, 'newlines'): - if stdout: - stdout = self._translate_newlines(stdout) - if stderr: - stderr = self._translate_newlines(stderr) - - self.wait() - return (stdout, stderr) - - -def _demo_posix(): - # - # Example 1: Simple redirection: Get process list - # - plist = Popen(["ps"], stdout=PIPE).communicate()[0] - print "Process list:" - print plist - - # - # Example 2: Change uid before executing child - # - if os.getuid() == 0: - p = Popen(["id"], preexec_fn=lambda: os.setuid(100)) - p.wait() - - # - # Example 3: Connecting several subprocesses - # - print "Looking for 'hda'..." - p1 = Popen(["dmesg"], stdout=PIPE) - p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE) - print repr(p2.communicate()[0]) - - # - # Example 4: Catch execution error - # - print - print "Trying a weird file..." - try: - print Popen(["/this/path/does/not/exist"]).communicate() - except OSError, e: - if e.errno == errno.ENOENT: - print "The file didn't exist. I thought so..." - print "Child traceback:" - print e.child_traceback - else: - print "Error", e.errno - else: - print >>sys.stderr, "Gosh. No error." - - -def _demo_windows(): - # - # Example 1: Connecting several subprocesses - # - print "Looking for 'PROMPT' in set output..." - p1 = Popen("set", stdout=PIPE, shell=True) - p2 = Popen('find "PROMPT"', stdin=p1.stdout, stdout=PIPE) - print repr(p2.communicate()[0]) - - # - # Example 2: Simple execution of program - # - print "Executing calc..." - p = Popen("calc") - p.wait() - - -if __name__ == "__main__": - if mswindows: - _demo_windows() - else: - _demo_posix() diff -r c55bdc2fb9fa -r 33ac48224523 corebio/_version.py --- a/corebio/_version.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,18 +0,0 @@ - - -# Keywords between dollar signs are subsituted by subversion. -# The date and build will only tell the truth after a branch or tag, -# since different files in trunk will have been changed at different times -date ="$Date: 2006-11-27 12:10:21 -0800 (Mon, 27 Nov 2006) $".split()[1] -revision = "$Revision: 167 $".split()[1] - -# major.minor.patch -# The patch level should be zero in trunk, a positive number in a release -# branch. During a release, increment the minor number in trunk and set the -# patch level to 1 in the branch. Increment patch number for bug fix releases. -__version__ = '0.5.0' #b' + revision - - -description = "CoreBio %s (%s)" % (__version__, date) - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data.py --- a/corebio/data.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,385 +0,0 @@ -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - -""" -Standard information used in computational biology. - - -To convert a property dictionary to a list : ->>> comp = [ amino_acid_composition[k] for k in amino_acid_letters] - - -Resources: (Various standard data files.) - - -BLOSUM Scoring Matrices - Source: ftp://ftp.ncbi.nih.gov/repository/blocks/unix/blosum - These are all new blast style with 1/3 bit scaling - - blosum35 - - blosum45 - - blosum62 - - blosum40 - - blosum50 - - blosum80 - - blosum100 - -Other subsitution scoring matrices: - - dist20_comp - - pam250 - - pam120 - - -Status: Beta (Data needs to be proof checked.) -""" -# TODO: add this datafile? -# Description of database cross references : -# - dbxref.txt (http://www.expasy.org/cgi-bin/lists?dbxref.txt) - - -# FIXME: Move documentation of data to docstring above. docstrings -# after variables don't work. - - -# The ExPasy ProtScale tool is a great source of amino acid properties. -# http://au.expasy.org/cgi-bin/protscale.pl - -from StringIO import StringIO -from corebio._future import resource_string, resource_stream,resource_filename -from corebio import utils - -# Explictly list set of available data resources. We want to be able to access -# these resources in, for example, a webapp, without inadvertently allowing -# unrestricted read access to the local file system. - -resource_names = [ - 'blosum35', - 'blosum45', - 'blosum62', - 'blosum40', - 'blosum50', - 'blosum80', - 'blosum100', - 'dist20_comp', - 'pam250', - 'pam120', - ] - -_resource_filenames = { - 'blosum35': 'data/blosum35.mat', - 'blosum45': 'data/blosum45.mat', - 'blosum62': 'data/blosum62.mat', - 'blosum40': 'data/blosum40.mat', - 'blosum50': 'data/blosum50.mat', - 'blosum80': 'data/blosum80.mat', - 'blosum100': 'data/blosum100.mat', - 'dist20_comp': 'data/dist20_comp.mat', - 'pam250': 'data/pam250.mat', - 'pam120': 'data/pam120.mat', - } - -# TODO: Subsitution matrix parser, SeqMatrix.read -_resource_parsers = {} - -def data_string( name ): - fn = _resource_filenames[name] - return resource_string(__name__, fn , __file__) - -def data_stream( name ): - fn = _resource_filenames[name] - return resource_stream(__name__, fn , __file__) - -def data_filename( name ): - fn = _resource_filenames[name] - return resource_filename(__name__, fn, __file__) - -def data_object( name, parser = None) : - if parser is None : - if name in _resource_parsers : - parser = _resource_parsers[name] - else : - parser = str - return parser( data_stream(name) ) - - -amino_acid_letters = "ACDEFGHIKLMNPQRSTVWY" -"""Standard codes for the 20 canonical amino acids, in alphabetic order.""" - -amino_acid_alternative_letters = "ARNDCQEGHILKMFPSTWYV" -"""Amino acid one letter codes, alphabetic by three letter codes.""" - -amino_acid_extended_letters = "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" - - -dna_letters = "GATC" -dna_extended_letters = "GATCRYWSMKHBVDN" - -rna_letters = "GAUC" -rna_extended_letters = "GAUCRYWSMKHBVDN" - - -dna_ambiguity = { - "A": "A", - "C": "C", - "G": "G", - "T": "T", - "M": "AC", - "R": "AG", - "W": "AT", - "S": "CG", - "Y": "CT", - "K": "GT", - "V": "ACG", - "H": "ACT", - "D": "AGT", - "B": "CGT", - "X": "GATC", - "N": "GATC", -} - -rna_ambiguity = { - "A": "A", - "C": "C", - "G": "G", - "U": "U", - "M": "AC", - "R": "AG", - "W": "AU", - "S": "CG", - "Y": "CU", - "K": "GU", - "V": "ACG", - "H": "ACU", - "D": "AGU", - "B": "CGU", - "X": "GAUC", - "N": "GAUC", -} - -amino_acid_ambiguity = { - "A": "A", - "B": "ND", - "C": "C", - "D": "D", - "E": "E", - "F": "F", - "G": "G", - "H": "H", - "I": "I", - "K": "K", - "L": "L", - "M": "M", - "N": "N", - "P": "P", - "Q": "Q", - "R": "R", - "S": "S", - "T": "T", - "V": "V", - "W": "W", - "X": "ACDEFGHIKLMNPQRSTVWY", - "Y": "Y", - "Z": "QE", - "J": "IL", - 'U': 'U', - 'O': 'O', -} - - -# Monomer isotopically averaged molecular mass -# Data Checked GEC Nov 2006 -amino_acid_mass = { - "A": 89.09, - "B" : 132.66, # Averaged proportional to amino_acid_composition - "C": 121.16, - "D": 133.10, - "E": 147.13, - "F": 165.19, - "G": 75.07, - "H": 155.16, - "I": 131.18, - "J": 131.18, - "K": 146.19, - "L": 131.18, - "M": 149.21, - "N": 132.12, - # "O" : ???, # TODO - "P": 115.13, - "Q": 146.15, - "R": 174.20, - "S": 105.09, - "T": 119.12, - "U" : 168.05, - "V": 117.15, - "W": 204.23, - "X" : 129.15, # Averaged proportional to amino_acid_composition - "Y": 181.19, - "Z" : 146.76, # Averaged proportional to amino_acid_composition - } - -dna_mass = { - "A": 347., - "C": 323., - "G": 363., - "T": 322., - } - -rna_mass = { - "A": 363., - "C": 319., - "G": 379., - "U": 340., -} - -one_to_three = { - 'A':'Ala', 'B':'Asx', 'C':'Cys', 'D':'Asp', - 'E':'Glu', 'F':'Phe', 'G':'Gly', 'H':'His', - 'I':'Ile', 'K':'Lys', 'L':'Leu', 'M':'Met', - 'N':'Asn', 'P':'Pro', 'Q':'Gln', 'R':'Arg', - 'S':'Ser', 'T':'Thr', 'V':'Val', 'W':'Trp', - 'Y':'Tyr', 'Z':'Glx', 'X':'Xaa', - 'U':'Sec', 'J':'Xle', 'O':'Pyl' - } -""" Map between standard 1 letter amino acid codes and standard three letter codes. - -Ref: http://www.ebi.ac.uk/RESID/faq.html -""" - -standard_three_to_one = utils.invert_dict(one_to_three) -""" Map between standard three letter amino acid codes and standard one letter codes. - -Ref: http://www.ebi.ac.uk/RESID/faq.html -""" - - -extended_three_to_one= { -'2as':'D', '3ah':'H', '5hp':'E', 'Acl':'R', 'Agm':'R', 'Aib':'A', 'Ala':'A', 'Alm':'A', 'Alo':'T', 'Aly':'K', 'Arg':'R', 'Arm':'R', 'Asa':'D', 'Asb':'D', 'Ask':'D', 'Asl':'D', 'Asn':'N', 'Asp':'D', 'Asq':'D', 'Asx':'B', 'Aya':'A', 'Bcs':'C', 'Bhd':'D', 'Bmt':'T', 'Bnn':'A', 'Buc':'C', 'Bug':'L', 'C5c':'C', 'C6c':'C', 'Ccs':'C', 'Cea':'C', 'Cgu':'E', 'Chg':'A', 'Cle':'L', 'Cme':'C', 'Csd':'A', 'Cso':'C', 'Csp':'C', 'Css':'C', 'Csw':'C', 'Csx':'C', 'Cxm':'M', 'Cy1':'C', 'Cy3':'C', 'Cyg':'C', 'Cym':'C', 'Cyq':'C', 'Cys':'C', 'Dah':'F', 'Dal':'A', 'Dar':'R', 'Das':'D', 'Dcy':'C', 'Dgl':'E', 'Dgn':'Q', 'Dha':'A', 'Dhi':'H', 'Dil':'I', 'Div':'V', 'Dle':'L', 'Dly':'K', 'Dnp':'A', 'Dpn':'F', 'Dpr':'P', 'Dsn':'S', 'Dsp':'D', 'Dth':'T', 'Dtr':'W', 'Dty':'Y', 'Dva':'V', 'Efc':'C', 'Fla':'A', 'Fme':'M', 'Ggl':'E', 'Gl3':'G', 'Gln':'Q', 'Glu':'E', 'Glx':'Z', 'Gly':'G', 'Glz':'G', 'Gma':'E', 'Gsc':'G', 'Hac':'A', 'Har':'R', 'Hic':'H', 'Hip':'H', 'His':'H', 'Hmr':'R', 'Hpq':'F', 'Htr':'W', 'Hyp':'P', 'Iil':'I', 'Ile':'I', 'Iyr':'Y', 'Kcx':'K', 'Leu':'L', 'Llp':'K', 'Lly':'K', 'Ltr':'W', 'Lym':'K', 'Lys':'K', 'Lyz':'K', 'Maa':'A', 'Men':'N', 'Met':'M', 'Mhs':'H', 'Mis':'S', 'Mle':'L', 'Mpq':'G', 'Msa':'G', 'Mse':'M', 'Mva':'V', 'Nem':'H', 'Nep':'H', 'Nle':'L', 'Nln':'L', 'Nlp':'L', 'Nmc':'G', 'Oas':'S', 'Ocs':'C', 'Omt':'M', 'Paq':'Y', 'Pca':'E', 'Pec':'C', 'Phe':'F', 'Phi':'F', 'Phl':'F', 'Pr3':'C', 'Pro':'P', 'Prr':'A', 'Ptr':'Y', 'Pyl':'O', 'Sac':'S', 'Sar':'G', 'Sch':'C', 'Scs':'C', 'Scy':'C', 'Sec':'U', 'Sel':'U', 'Sep':'S', 'Ser':'S', 'Set':'S', 'Shc':'C', 'Shr':'K', 'Smc':'C', 'Soc':'C', 'Sty':'Y', 'Sva':'S', 'Ter':'*', 'Thr':'T', 'Tih':'A', 'Tpl':'W', 'Tpo':'T', 'Tpq':'A', 'Trg':'K', 'Tro':'W', 'Trp':'W', 'Tyb':'Y', 'Tyq':'Y', 'Tyr':'Y', 'Tys':'Y', 'Tyy':'Y', 'Unk':'X', 'Val':'V', 'Xaa':'X', 'Xer':'X', 'Xle':'J'} - -""" Map between three letter amino acid codes and standard one letter codes. -This map contains many nonstandard three letter codes, used, for example, to specify chemically modified amino acids in PDB files. - -Ref: http://astral.berkeley.edu/ -Ref: http://www.ebi.ac.uk/RESID/faq.html -""" -# Initial table is from the ASTRAL RAF release notes. -# added UNK -# Extra IUPAC: Xle, Xaa, Sec, Pyl -# The following have been seen in biopython code. -# Ter : '*' Termination -# Sel : 'U' A typo for Sec, selenocysteine? -# Xer : 'X' Another alternative for unknown? - - -amino_acid_names = { - 'A' : 'alanine', - 'M' : 'methionine', - 'C' : 'cysteine', - 'N' : 'asparagine', - 'D' : 'aspartic acid', - 'P' : 'proline', - 'E' : 'glutamic acid', - 'Q' : 'glutamine', - 'F' : 'phenylalanine', - 'R' : 'arginine', - 'G' : 'glycine', - 'S' : 'serine', - 'H' : 'histidine', - 'T' : 'threonine', - 'I' : 'isoleucine', - 'V' : 'valine', - 'K' : 'lysine', - 'W' : 'tryptophan', - 'L' : 'leucine', - 'Y' : 'tyrosine', - 'B' : 'aspartic acid or asparagine', - 'J' : 'leucine or isoleucine', - 'X' : 'unknown', - 'Z' : 'glutamic acid or glutamine', - 'U' : 'selenocysteine', - 'O' : 'pyrrolysine', - '*' : 'translation stop', - '-' : 'gap' - } - -amino_acid_composition = dict( - A = .082, R = .057, N = .044, D = .053, C = .017, - Q = .040, E = .062, G = .072, H = .022, I = .052, - L = .090, K = .057, M = .024, F =.039, P = .051, - S = .069, T = .058, W = .013, Y= .032, V =.066 ) - -""" -Overall amino acid composition of proteins. -Ref: McCaldon P., Argos P. Proteins 4:99-122 (1988). -""" -# FIXME : Proof these values - -kyte_doolittle_hydrophobicity = dict( - A=1.8, R=-4.5, N=-3.5, D=-3.5, C=2.5, - Q=-3.5, E=-3.5, G=-0.4, H=-3.2, I=4.5, - L=3.8, K=-3.9, M=1.9, F=2.8, P=-1.6, - S=-0.8, T=-0.7, W=-0.9, Y=-1.3, V=4.2 ) -""" -Kyte-Doolittle hydrophobicity scale. -Ref: Kyte J., Doolittle R.F. J. Mol. Biol. 157:105-132 (1982) -""" -# FIXME : Proof these values - - -nucleotide_names = { - 'A' : 'Adenosine', - 'C' : 'Cytidine', - 'G' : 'Guanine', - 'T' : 'Thymidine', - 'U' : 'Uracil', - 'R' : 'G A (puRine)', - 'Y' : 'T C (pYrimidine)', - 'K' : 'G T (Ketone)', - 'M' : 'A C (aMino group)', - 'S' : 'G C (Strong interaction)', - 'W' : 'A T (Weak interaction)', - 'B' : 'G T C (not A) (B comes after A)', - 'D' : 'G A T (not C) (D comes after C)', - 'H' : 'A C T (not G) (H comes after G)', - 'V' : 'G C A (not T, not U) (V comes after U)', - 'N' : 'A G C T (aNy)', - '-' : 'gap', - } - - - - - - - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/blosum100.mat --- a/corebio/data/blosum100.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -# Matrix made by matblas from blosum100_3.iij -# * column uses minimum score -# BLOSUM Clustered Scoring Matrix in 1/3 Bit Units -# Blocks Database = /data/blocks_5.0/blocks.dat -# Cluster Percentage: >= 100 -# Entropy = 1.4516, Expected = -1.0948 - A R N D C Q E G H I L K M F P S T W Y V B Z X * -A 8 -3 -4 -5 -2 -2 -3 -1 -4 -4 -4 -2 -3 -5 -2 1 -1 -6 -5 -2 -4 -2 -2 -10 -R -3 10 -2 -5 -8 0 -2 -6 -1 -7 -6 3 -4 -6 -5 -3 -3 -7 -5 -6 -4 -1 -3 -10 -N -4 -2 11 1 -5 -1 -2 -2 0 -7 -7 -1 -5 -7 -5 0 -1 -8 -5 -7 5 -2 -3 -10 -D -5 -5 1 10 -8 -2 2 -4 -3 -8 -8 -3 -8 -8 -5 -2 -4 -10 -7 -8 6 0 -4 -10 -C -2 -8 -5 -8 14 -7 -9 -7 -8 -3 -5 -8 -4 -4 -8 -3 -3 -7 -6 -3 -7 -8 -5 -10 -Q -2 0 -1 -2 -7 11 2 -5 1 -6 -5 2 -2 -6 -4 -2 -3 -5 -4 -5 -2 5 -2 -10 -E -3 -2 -2 2 -9 2 10 -6 -2 -7 -7 0 -5 -8 -4 -2 -3 -8 -7 -5 0 7 -3 -10 -G -1 -6 -2 -4 -7 -5 -6 9 -6 -9 -8 -5 -7 -8 -6 -2 -5 -7 -8 -8 -3 -5 -4 -10 -H -4 -1 0 -3 -8 1 -2 -6 13 -7 -6 -3 -5 -4 -5 -3 -4 -5 1 -7 -2 -1 -4 -10 -I -4 -7 -7 -8 -3 -6 -7 -9 -7 8 2 -6 1 -2 -7 -5 -3 -6 -4 4 -8 -7 -3 -10 -L -4 -6 -7 -8 -5 -5 -7 -8 -6 2 8 -6 3 0 -7 -6 -4 -5 -4 0 -8 -6 -3 -10 -K -2 3 -1 -3 -8 2 0 -5 -3 -6 -6 10 -4 -6 -3 -2 -3 -8 -5 -5 -2 0 -3 -10 -M -3 -4 -5 -8 -4 -2 -5 -7 -5 1 3 -4 12 -1 -5 -4 -2 -4 -5 0 -7 -4 -3 -10 -F -5 -6 -7 -8 -4 -6 -8 -8 -4 -2 0 -6 -1 11 -7 -5 -5 0 4 -3 -7 -7 -4 -10 -P -2 -5 -5 -5 -8 -4 -4 -6 -5 -7 -7 -3 -5 -7 12 -3 -4 -8 -7 -6 -5 -4 -4 -10 -S 1 -3 0 -2 -3 -2 -2 -2 -3 -5 -6 -2 -4 -5 -3 9 2 -7 -5 -4 -1 -2 -2 -10 -T -1 -3 -1 -4 -3 -3 -3 -5 -4 -3 -4 -3 -2 -5 -4 2 9 -7 -5 -1 -2 -3 -2 -10 -W -6 -7 -8 -10 -7 -5 -8 -7 -5 -6 -5 -8 -4 0 -8 -7 -7 17 2 -5 -9 -7 -6 -10 -Y -5 -5 -5 -7 -6 -4 -7 -8 1 -4 -4 -5 -5 4 -7 -5 -5 2 12 -5 -6 -6 -4 -10 -V -2 -6 -7 -8 -3 -5 -5 -8 -7 4 0 -5 0 -3 -6 -4 -1 -5 -5 8 -7 -5 -3 -10 -B -4 -4 5 6 -7 -2 0 -3 -2 -8 -8 -2 -7 -7 -5 -1 -2 -9 -6 -7 6 0 -4 -10 -Z -2 -1 -2 0 -8 5 7 -5 -1 -7 -6 0 -4 -7 -4 -2 -3 -7 -6 -5 0 6 -2 -10 -X -2 -3 -3 -4 -5 -2 -3 -4 -4 -3 -3 -3 -3 -4 -4 -2 -2 -6 -4 -3 -4 -2 -3 -10 -* -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 1 diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/blosum35.mat --- a/corebio/data/blosum35.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -# Matrix made by matblas from blosum35.iij -# * column uses minimum score -# BLOSUM Clustered Scoring Matrix in 1/4 Bit Units -# Blocks Database = /data/blocks_5.0/blocks.dat -# Cluster Percentage: >= 35 -# Entropy = 0.2111, Expected = -0.1550 - A R N D C Q E G H I L K M F P S T W Y V B Z X * -A 5 -1 -1 -1 -2 0 -1 0 -2 -1 -2 0 0 -2 -2 1 0 -2 -1 0 -1 -1 0 -5 -R -1 8 -1 -1 -3 2 -1 -2 -1 -3 -2 2 0 -1 -2 -1 -2 0 0 -1 -1 0 -1 -5 -N -1 -1 7 1 -1 1 -1 1 1 -1 -2 0 -1 -1 -2 0 0 -2 -2 -2 4 0 0 -5 -D -1 -1 1 8 -3 -1 2 -2 0 -3 -2 -1 -3 -3 -1 -1 -1 -3 -2 -2 5 1 -1 -5 -C -2 -3 -1 -3 15 -3 -1 -3 -4 -4 -2 -2 -4 -4 -4 -3 -1 -5 -5 -2 -2 -2 -2 -5 -Q 0 2 1 -1 -3 7 2 -2 -1 -2 -2 0 -1 -4 0 0 0 -1 0 -3 0 4 -1 -5 -E -1 -1 -1 2 -1 2 6 -2 -1 -3 -1 1 -2 -3 0 0 -1 -1 -1 -2 0 5 -1 -5 -G 0 -2 1 -2 -3 -2 -2 7 -2 -3 -3 -1 -1 -3 -2 1 -2 -1 -2 -3 0 -2 -1 -5 -H -2 -1 1 0 -4 -1 -1 -2 12 -3 -2 -2 1 -3 -1 -1 -2 -4 0 -4 0 -1 -1 -5 -I -1 -3 -1 -3 -4 -2 -3 -3 -3 5 2 -2 1 1 -1 -2 -1 -1 0 4 -2 -3 0 -5 -L -2 -2 -2 -2 -2 -2 -1 -3 -2 2 5 -2 3 2 -3 -2 0 0 0 2 -2 -2 0 -5 -K 0 2 0 -1 -2 0 1 -1 -2 -2 -2 5 0 -1 0 0 0 0 -1 -2 0 1 0 -5 -M 0 0 -1 -3 -4 -1 -2 -1 1 1 3 0 6 0 -3 -1 0 1 0 1 -2 -2 0 -5 -F -2 -1 -1 -3 -4 -4 -3 -3 -3 1 2 -1 0 8 -4 -1 -1 1 3 1 -2 -3 -1 -5 -P -2 -2 -2 -1 -4 0 0 -2 -1 -1 -3 0 -3 -4 10 -2 0 -4 -3 -3 -1 0 -1 -5 -S 1 -1 0 -1 -3 0 0 1 -1 -2 -2 0 -1 -1 -2 4 2 -2 -1 -1 0 0 0 -5 -T 0 -2 0 -1 -1 0 -1 -2 -2 -1 0 0 0 -1 0 2 5 -2 -2 1 -1 -1 0 -5 -W -2 0 -2 -3 -5 -1 -1 -1 -4 -1 0 0 1 1 -4 -2 -2 16 3 -2 -3 -1 -1 -5 -Y -1 0 -2 -2 -5 0 -1 -2 0 0 0 -1 0 3 -3 -1 -2 3 8 0 -2 -1 -1 -5 -V 0 -1 -2 -2 -2 -3 -2 -3 -4 4 2 -2 1 1 -3 -1 1 -2 0 5 -2 -2 0 -5 -B -1 -1 4 5 -2 0 0 0 0 -2 -2 0 -2 -2 -1 0 -1 -3 -2 -2 5 0 -1 -5 -Z -1 0 0 1 -2 4 5 -2 -1 -3 -2 1 -2 -3 0 0 -1 -1 -1 -2 0 4 0 -5 -X 0 -1 0 -1 -2 -1 -1 -1 -1 0 0 0 0 -1 -1 0 0 -1 -1 0 -1 0 -1 -5 -* -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/blosum40.mat --- a/corebio/data/blosum40.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -# Matrix made by matblas from blosum40.iij -# * column uses minimum score -# BLOSUM Clustered Scoring Matrix in 1/4 Bit Units -# Blocks Database = /data/blocks_5.0/blocks.dat -# Cluster Percentage: >= 40 -# Entropy = 0.2851, Expected = -0.2090 - A R N D C Q E G H I L K M F P S T W Y V B Z X * -A 5 -2 -1 -1 -2 0 -1 1 -2 -1 -2 -1 -1 -3 -2 1 0 -3 -2 0 -1 -1 0 -6 -R -2 9 0 -1 -3 2 -1 -3 0 -3 -2 3 -1 -2 -3 -1 -2 -2 -1 -2 -1 0 -1 -6 -N -1 0 8 2 -2 1 -1 0 1 -2 -3 0 -2 -3 -2 1 0 -4 -2 -3 4 0 -1 -6 -D -1 -1 2 9 -2 -1 2 -2 0 -4 -3 0 -3 -4 -2 0 -1 -5 -3 -3 6 1 -1 -6 -C -2 -3 -2 -2 16 -4 -2 -3 -4 -4 -2 -3 -3 -2 -5 -1 -1 -6 -4 -2 -2 -3 -2 -6 -Q 0 2 1 -1 -4 8 2 -2 0 -3 -2 1 -1 -4 -2 1 -1 -1 -1 -3 0 4 -1 -6 -E -1 -1 -1 2 -2 2 7 -3 0 -4 -2 1 -2 -3 0 0 -1 -2 -2 -3 1 5 -1 -6 -G 1 -3 0 -2 -3 -2 -3 8 -2 -4 -4 -2 -2 -3 -1 0 -2 -2 -3 -4 -1 -2 -1 -6 -H -2 0 1 0 -4 0 0 -2 13 -3 -2 -1 1 -2 -2 -1 -2 -5 2 -4 0 0 -1 -6 -I -1 -3 -2 -4 -4 -3 -4 -4 -3 6 2 -3 1 1 -2 -2 -1 -3 0 4 -3 -4 -1 -6 -L -2 -2 -3 -3 -2 -2 -2 -4 -2 2 6 -2 3 2 -4 -3 -1 -1 0 2 -3 -2 -1 -6 -K -1 3 0 0 -3 1 1 -2 -1 -3 -2 6 -1 -3 -1 0 0 -2 -1 -2 0 1 -1 -6 -M -1 -1 -2 -3 -3 -1 -2 -2 1 1 3 -1 7 0 -2 -2 -1 -2 1 1 -3 -2 0 -6 -F -3 -2 -3 -4 -2 -4 -3 -3 -2 1 2 -3 0 9 -4 -2 -1 1 4 0 -3 -4 -1 -6 -P -2 -3 -2 -2 -5 -2 0 -1 -2 -2 -4 -1 -2 -4 11 -1 0 -4 -3 -3 -2 -1 -2 -6 -S 1 -1 1 0 -1 1 0 0 -1 -2 -3 0 -2 -2 -1 5 2 -5 -2 -1 0 0 0 -6 -T 0 -2 0 -1 -1 -1 -1 -2 -2 -1 -1 0 -1 -1 0 2 6 -4 -1 1 0 -1 0 -6 -W -3 -2 -4 -5 -6 -1 -2 -2 -5 -3 -1 -2 -2 1 -4 -5 -4 19 3 -3 -4 -2 -2 -6 -Y -2 -1 -2 -3 -4 -1 -2 -3 2 0 0 -1 1 4 -3 -2 -1 3 9 -1 -3 -2 -1 -6 -V 0 -2 -3 -3 -2 -3 -3 -4 -4 4 2 -2 1 0 -3 -1 1 -3 -1 5 -3 -3 -1 -6 -B -1 -1 4 6 -2 0 1 -1 0 -3 -3 0 -3 -3 -2 0 0 -4 -3 -3 5 0 -1 -6 -Z -1 0 0 1 -3 4 5 -2 0 -4 -2 1 -2 -4 -1 0 -1 -2 -2 -3 0 5 -1 -6 -X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 0 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -6 -* -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 1 diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/blosum45.mat --- a/corebio/data/blosum45.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -# Matrix made by matblas from blosum45.iij -# * column uses minimum score -# BLOSUM Clustered Scoring Matrix in 1/3 Bit Units -# Blocks Database = /data/blocks_5.0/blocks.dat -# Cluster Percentage: >= 45 -# Entropy = 0.3795, Expected = -0.2789 - A R N D C Q E G H I L K M F P S T W Y V B Z X * -A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0 -1 -1 0 -5 -R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2 -1 0 -1 -5 -N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 4 0 -1 -5 -D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 5 1 -1 -5 -C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 -2 -3 -2 -5 -Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 0 4 -1 -5 -E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 1 4 -1 -5 -G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -5 -H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3 0 0 -1 -5 -I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 -3 -3 -1 -5 -L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 -3 -2 -1 -5 -K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2 0 1 -1 -5 -M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 -2 -1 -1 -5 -F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 -3 -3 -1 -5 -P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 -2 -1 -1 -5 -S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 0 0 0 -5 -T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0 0 -1 0 -5 -W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3 -4 -2 -2 -5 -Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1 -2 -2 -1 -5 -V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5 -3 -3 -1 -5 -B -1 -1 4 5 -2 0 1 -1 0 -3 -3 0 -2 -3 -2 0 0 -4 -2 -3 4 0 -1 -5 -Z -1 0 0 1 -3 4 4 -2 0 -3 -2 1 -1 -3 -1 0 -1 -2 -2 -3 0 4 -1 -5 -X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -2 -1 -1 -1 -1 -1 -5 -* -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/blosum50.mat --- a/corebio/data/blosum50.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -# Matrix made by matblas from blosum50.iij -# * column uses minimum score -# BLOSUM Clustered Scoring Matrix in 1/3 Bit Units -# Blocks Database = /data/blocks_5.0/blocks.dat -# Cluster Percentage: >= 50 -# Entropy = 0.4808, Expected = -0.3573 - A R N D C Q E G H I L K M F P S T W Y V B Z X * - 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0 -2 -1 -1 -5 --2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3 -1 0 -1 -5 --1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3 4 0 -1 -5 --2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4 5 1 -1 -5 --1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 -3 -3 -2 -5 --1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3 0 4 -1 -5 --1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3 1 5 -1 -5 - 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 -1 -2 -2 -5 --2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4 0 0 -1 -5 --1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4 -4 -3 -1 -5 --2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1 -4 -3 -1 -5 --1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3 0 1 -1 -5 --1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1 -3 -1 -1 -5 --3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1 -4 -4 -2 -5 --1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3 -2 -1 -2 -5 - 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2 0 0 -1 -5 - 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0 0 -1 0 -5 --3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3 -5 -2 -3 -5 --2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1 -3 -2 -1 -5 - 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5 -4 -3 -1 -5 --2 -1 4 5 -3 0 1 -1 0 -4 -4 0 -3 -4 -2 0 0 -5 -3 -4 5 0 -1 -5 --1 0 0 1 -3 4 5 -2 0 -3 -3 1 -1 -4 -1 0 -1 -2 -2 -3 0 5 -1 -5 --1 -1 -1 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -1 0 -3 -1 -1 -1 -1 -1 -5 --5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/blosum62.mat --- a/corebio/data/blosum62.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -# Matrix made by matblas from blosum62_3.iij -# * column uses minimum score -# BLOSUM Clustered Scoring Matrix in 1/3 Bit Units -# Blocks Database = /data/blocks_5.0/blocks.dat -# Cluster Percentage: >= 62 -# Entropy = 0.6979, Expected = -0.5209 - A R N D C Q E G H I L K M F P S T W Y V B Z X * -A 6 -2 -2 -3 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 2 0 -4 -3 0 -2 -1 -1 -6 -R -2 8 -1 -2 -5 1 0 -3 0 -4 -3 3 -2 -4 -3 -1 -2 -4 -3 -4 -2 0 -2 -6 -N -2 -1 8 2 -4 0 0 -1 1 -5 -5 0 -3 -4 -3 1 0 -6 -3 -4 5 0 -2 -6 -D -3 -2 2 9 -5 0 2 -2 -2 -5 -5 -1 -5 -5 -2 0 -2 -6 -5 -5 6 1 -2 -6 -C -1 -5 -4 -5 13 -4 -5 -4 -4 -2 -2 -5 -2 -4 -4 -1 -1 -3 -4 -1 -5 -5 -3 -6 -Q -1 1 0 0 -4 8 3 -3 1 -4 -3 2 -1 -5 -2 0 -1 -3 -2 -3 0 5 -1 -6 -E -1 0 0 2 -5 3 7 -3 0 -5 -4 1 -3 -5 -2 0 -1 -4 -3 -4 1 6 -1 -6 -G 0 -3 -1 -2 -4 -3 -3 8 -3 -6 -5 -2 -4 -5 -3 0 -2 -4 -5 -5 -1 -3 -2 -6 -H -2 0 1 -2 -4 1 0 -3 11 -5 -4 -1 -2 -2 -3 -1 -3 -4 3 -5 -1 0 -2 -6 -I -2 -4 -5 -5 -2 -4 -5 -6 -5 6 2 -4 2 0 -4 -4 -1 -4 -2 4 -5 -5 -2 -6 -L -2 -3 -5 -5 -2 -3 -4 -5 -4 2 6 -4 3 1 -4 -4 -2 -2 -2 1 -5 -4 -2 -6 -K -1 3 0 -1 -5 2 1 -2 -1 -4 -4 7 -2 -5 -2 0 -1 -4 -3 -3 -1 1 -1 -6 -M -1 -2 -3 -5 -2 -1 -3 -4 -2 2 3 -2 8 0 -4 -2 -1 -2 -1 1 -4 -2 -1 -6 -F -3 -4 -4 -5 -4 -5 -5 -5 -2 0 1 -5 0 9 -5 -4 -3 1 4 -1 -5 -5 -2 -6 -P -1 -3 -3 -2 -4 -2 -2 -3 -3 -4 -4 -2 -4 -5 11 -1 -2 -5 -4 -4 -3 -2 -2 -6 -S 2 -1 1 0 -1 0 0 0 -1 -4 -4 0 -2 -4 -1 6 2 -4 -3 -2 0 0 -1 -6 -T 0 -2 0 -2 -1 -1 -1 -2 -3 -1 -2 -1 -1 -3 -2 2 7 -4 -2 0 -1 -1 -1 -6 -W -4 -4 -6 -6 -3 -3 -4 -4 -4 -4 -2 -4 -2 1 -5 -4 -4 16 3 -4 -6 -4 -3 -6 -Y -3 -3 -3 -5 -4 -2 -3 -5 3 -2 -2 -3 -1 4 -4 -3 -2 3 10 -2 -4 -3 -2 -6 -V 0 -4 -4 -5 -1 -3 -4 -5 -5 4 1 -3 1 -1 -4 -2 0 -4 -2 6 -5 -4 -1 -6 -B -2 -2 5 6 -5 0 1 -1 -1 -5 -5 -1 -4 -5 -3 0 -1 -6 -4 -5 5 0 -2 -6 -Z -1 0 0 1 -5 5 6 -3 0 -5 -4 1 -2 -5 -2 0 -1 -4 -3 -4 0 5 -1 -6 -X -1 -2 -2 -2 -3 -1 -1 -2 -2 -2 -2 -1 -1 -2 -2 -1 -1 -3 -2 -1 -2 -1 -2 -6 -* -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 1 diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/blosum80.mat --- a/corebio/data/blosum80.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -# Matrix made by matblas from blosum80_3.iij -# * column uses minimum score -# BLOSUM Clustered Scoring Matrix in 1/3 Bit Units -# Blocks Database = /data/blocks_5.0/blocks.dat -# Cluster Percentage: >= 80 -# Entropy = 0.9868, Expected = -0.7442 - A R N D C Q E G H I L K M F P S T W Y V B Z X * -A 7 -3 -3 -3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 -3 -2 -1 -8 -R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3 -3 -5 -3 -2 -2 -5 -4 -4 -2 0 -2 -8 -N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 5 -1 -2 -8 -D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 6 1 -3 -8 -C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 -6 -7 -4 -8 -Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1 -1 -4 -3 -4 -1 5 -2 -8 -E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 1 6 -2 -8 -G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 -2 -4 -3 -8 -H -3 0 1 -2 -7 1 0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4 3 -5 -1 0 -2 -8 -I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4 -6 -6 -2 -8 -L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 -7 -5 -2 -8 -K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 -1 1 -2 -8 -M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4 -3 -1 -3 -3 1 -5 -3 -2 -8 -F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 -6 -6 -3 -8 -P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 -4 -2 -3 -8 -S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 0 -1 -1 -8 -T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3 0 -1 -2 -1 -8 -W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 -8 -5 -5 -8 -Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 -5 -4 -3 -8 -V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 -6 -4 -2 -8 -B -3 -2 5 6 -6 -1 1 -2 -1 -6 -7 -1 -5 -6 -4 0 -1 -8 -5 -6 6 0 -3 -8 -Z -2 0 -1 1 -7 5 6 -4 0 -6 -5 1 -3 -6 -2 -1 -2 -5 -4 -4 0 6 -1 -8 -X -1 -2 -2 -3 -4 -2 -2 -3 -2 -2 -2 -2 -2 -3 -3 -1 -1 -5 -3 -2 -3 -1 -2 -8 -* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1 diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/dist20_comp.mat --- a/corebio/data/dist20_comp.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,25 +0,0 @@ -#TOTAL: 1.000000 -# entropy = 0.404334 -A 4 -2 -2 -2 0 -1 -1 -1 -2 -2 -2 -2 -1 -2 -1 0 -1 -3 -2 -1 -2 -1 -1 -R -2 6 -1 -1 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -2 -2 -3 3 -1 -1 -N -2 -1 7 1 -3 0 0 -1 0 -5 -4 0 -3 -4 -2 0 -1 -3 -2 -4 3 -1 -1 -D -2 -1 1 7 -4 0 1 -1 -1 -6 -5 0 -4 -5 -1 0 -1 -4 -3 -5 0 -2 -2 -C 0 -4 -3 -4 12 -3 -4 -3 -3 -1 -2 -4 -1 -2 -3 -2 -1 -2 -2 0 -3 5 -2 -Q -1 1 0 0 -3 6 1 -2 0 -3 -3 1 -2 -3 -1 0 -1 -2 -2 -3 0 1 -1 -E -1 0 0 1 -4 1 5 -2 -1 -4 -4 1 -3 -4 -1 -1 -1 -3 -3 -4 0 -1 -1 -G -1 -3 -1 -1 -3 -2 -2 7 -2 -6 -5 -2 -4 -5 -2 -1 -2 -4 -4 -5 -2 -2 -2 -H -2 0 0 -1 -3 0 -1 -2 9 -3 -3 -1 -2 -1 -2 -1 -1 0 0 -3 0 -1 -1 -I -2 -3 -5 -6 -1 -3 -4 -6 -3 5 2 -4 1 0 -4 -4 -2 -1 -1 3 -4 -2 -2 -L -2 -3 -4 -5 -2 -3 -4 -5 -3 2 5 -3 2 1 -3 -3 -2 -1 -1 1 -4 -2 -2 -K -2 2 0 0 -4 1 1 -2 -1 -4 -3 5 -2 -4 -1 -1 -1 -3 -3 -3 1 -1 -1 -M -1 -2 -3 -4 -1 -2 -3 -4 -2 1 2 -2 7 1 -3 -2 -1 0 0 1 -3 -2 -1 -F -2 -3 -4 -5 -2 -3 -4 -5 -1 0 1 -4 1 7 -3 -3 -2 3 3 0 -3 -2 -1 -P -1 -2 -2 -1 -3 -1 -1 -2 -2 -4 -3 -1 -3 -3 8 -1 -2 -3 -3 -3 -2 -2 -2 -S 0 -1 0 0 -2 0 -1 -1 -1 -4 -3 -1 -2 -3 -1 4 1 -3 -2 -3 0 -1 -1 -T -1 -1 -1 -1 -1 -1 -1 -2 -1 -2 -2 -1 -1 -2 -2 1 5 -2 -2 -1 -1 -1 -1 -W -3 -2 -3 -4 -2 -2 -3 -4 0 -1 -1 -3 0 3 -3 -3 -2 12 3 -2 -3 -2 -1 -Y -2 -2 -2 -3 -2 -2 -3 -4 0 -1 -1 -3 0 3 -3 -2 -2 3 8 -2 -2 -2 -1 -V -1 -3 -4 -5 0 -3 -4 -5 -3 3 1 -3 1 0 -3 -3 -1 -2 -2 5 -4 -2 -2 -B -2 3 3 0 -3 0 0 -2 0 -4 -4 1 -3 -3 -2 0 -1 -3 -2 -4 3 -1 -1 -Z -1 -1 -1 -2 5 1 -1 -2 -1 -2 -2 -1 -2 -2 -2 -1 -1 -2 -2 -2 -1 3 -1 -X -1 -1 -1 -2 -2 -1 -1 -2 -1 -2 -2 -1 -1 -1 -2 -1 -1 -1 -1 -2 -1 -1 -1 diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/pam120.mat --- a/corebio/data/pam120.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,34 +0,0 @@ -# -# This matrix was produced by "pam" Version 1.0.6 [28-Jul-93] -# -# PAM 120 substitution matrix, scale = ln(2)/2 = 0.346574 -# -# Expected score = -1.64, Entropy = 0.979 bits -# -# Lowest score = -8, Highest score = 12 -# - A R N D C Q E G H I L K M F P S T W Y V B Z X -A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 0 -1 -1 -R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1 -2 1 -5 -3 -2 -1 -2 -N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1 0 -4 -2 -3 3 0 -1 -D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0 -1 -8 -5 -3 4 3 -2 -C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0 -3 -8 -1 -3 -6 -7 -4 -Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2 -2 -6 -5 -3 0 4 -1 -E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1 -2 -8 -5 -3 3 4 -1 -G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1 -1 -8 -6 -2 0 -2 -2 -H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2 -3 -3 -1 -3 1 1 -2 -I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2 0 -6 -2 3 -3 -3 -1 -L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4 -3 -3 -2 1 -4 -3 -2 -K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1 -1 -5 -5 -4 0 -1 -2 -M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2 -1 -6 -4 1 -4 -2 -2 -F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3 -4 -1 4 -3 -5 -6 -3 -P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1 -1 -7 -6 -2 -2 -1 -2 -S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 2 -2 -3 -2 0 -1 -1 -T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 -6 -3 0 0 -2 -1 -W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 -2 -8 -6 -7 -5 -Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 -3 -3 -5 -3 -V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 -3 -3 -1 -B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 2 -1 -Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 -1 -X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/data/pam250.mat --- a/corebio/data/pam250.mat Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,34 +0,0 @@ -# -# This matrix was produced by "pam" Version 1.0.6 [28-Jul-93] -# -# PAM 250 substitution matrix, scale = ln(2)/3 = 0.231049 -# -# Expected score = -0.844, Entropy = 0.354 bits -# -# Lowest score = -8, Highest score = 17 -# - A R N D C Q E G H I L K M F P S T W Y V B Z X -A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/moremath.py --- a/corebio/moremath.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,455 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - - -""" Various bits of useful math not in the standard python library. - -Constants : - -- euler_gamma = 0.577215... -- catalan = 0.915965... -- golden_ratio = 1.618033... -- bits_per_nat = log2(e) = 1/log(2) -- sqrt_2pi = 2.50662... - -Special Functions : - -- gamma() -- Gamma function. -- lngamma() -- Logarithm of the gamma function -- factorial() -- The factorial function. -- digamma() -- Digamma function (logarithmic derivative of gamma). -- trigamma() -- Trigamma function (derivative of digamma). -- entropy() -- The entropy of a probability vector -- incomplete_gamma() -- The 'upper' incomplete gamma function. -- normalized_incomplete_gamma() -- -- lg() -- Base 2 logarithms. - - -Vector Operations : - -- rmsd() -- Root mean squared deviation of two point vectors -- minimize_rmsd() -- Find the rigid transformation that minimized the - RMSD between two vectors of points. - -Minimization : - -- find_root() -- 1d root finding - -Probability Distributions : -- Gamma -- Dirichlet -- Multinomial -- Gaussian - -""" - - - -__all__ = ('euler_gamma', 'catalan', 'golden_ratio', 'bits_per_nat', 'sqrt_2pi', - 'gamma', 'lngamma', 'factorial', 'digamma', 'trigamma', - 'entropy', 'log2', - 'incomplete_gamma', 'normalized_incomplete_gamma', - # 'integrate', - # 'rmsd', 'minimize_rmsd', 'find_root', - # 'Gamma', 'Dirichlet', - # 'decompose_log_odds_array', - 'argmax', 'argmin' - ) - -from math import * -import random -from itertools import izip, count - -# Some mathematical constants -euler_gamma = 0.57721566490153286060651 -catalan = 0.91596559417721901505460 -golden_ratio = 1.6180339887498948482046 -bits_per_nat = 1.44269504088896340735992468100 # = log_2(e) = 1/log(2) -sqrt_2pi = 2.5066282746310005024157652848110 - - - - - -# The Lanczos approximation for the gamma function is -# -# -(z + g + 1/2) (z + 1/2) -# Gamma(z+1) = e * (z + g + 1/2) * Sqrt(2Pi) * C -# -# -# c[1] c[2] c[3] -# C = [c[0] + ----- + ----- + ----- + ... ] -# z + 1 z + 2 z + 3 -# -# -# To calculate digamma and trigamma functions we take an analytic derivative -# of the Lanczos approximation. -# -# Gamma(z) = Gamma(z+1)/z -# Digamma(z) = D ln Gamma(z) -# Trigamma(z) = D Digamma(z) - -# These Lanczos constants are from -# "A note on the computation of the convergent -# Lanczos complex Gamma approximation." Paul Godfrey (2001) -# http://my.fit.edu/~gabdo/gamma.txt - - -__lanczos_gamma = 607./128. -__lanczos_coefficients = ( - 0.99999999999999709182, - 57.156235665862923517, - -59.597960355475491248, - 14.136097974741747174, - -0.49191381609762019978, - .33994649984811888699e-4, - .46523628927048575665e-4, - -.98374475304879564677e-4, - .15808870322491248884e-3, - -.21026444172410488319e-3, - .21743961811521264320e-3, - -.16431810653676389022e-3, - .84418223983852743293e-4, - -.26190838401581408670e-4, - .36899182659531622704e-5) - -__factorial =( - 1., - 1., - 2., - 6., - 24., - 120., - 720., - 5040., - 40320., - 362880., - 3628800., - 39916800., - 479001600., - 6227020800., - 87178291200., - 1307674368000., - 20922789888000., - 355687428096000., - 6402373705728000., - 121645100408832000., - 2432902008176640000., - 51090942171709440000., - 1124000727777607680000., - 25852016738884976640000., - 620448401733239439360000., - 15511210043330985984000000., - 403291461126605635584000000., - 10888869450418352160768000000., - 304888344611713860501504000000., - 8841761993739701954543616000000., - 265252859812191058636308480000000., - 8222838654177922817725562880000000., - 263130836933693530167218012160000000. ) - -def gamma(z) : - """The gamma function. Returns exact results for small integers. Will - overflow for modest sized arguments. Use lngamma(z) instead. - - See: Eric W. Weisstein. "Gamma Function." From MathWorld, A Wolfram Web Resource. - http://mathworld.wolfram.com/GammaFunction.html - - """ - - n = floor(z) - if n == z : - if z <= 0 : - return 1.0/0.0 # Infinity - elif n <= len(__factorial) : - return __factorial[int(n)-1] - - zz = z - if z < 0.5 : - zz = 1-z - - - g = __lanczos_gamma - c = __lanczos_coefficients - - zz = zz - 1. - zh = zz + 0.5 - zgh = zh + g - zp = zgh** (zh*0.5) # trick for avoiding FP overflow above z=141 - - ss = 0.0 - for k in range(len(c)-1,0,-1): - ss += c[k]/(zz+k) - - f = (sqrt_2pi*(c[0]+ss)) * (( zp*exp(-zgh)) *zp) - - if z<0.5 : - f = pi /( sin(pi*z) *f) - - return f - - -def lngamma(z) : - """The logarithm of the gamma function. - """ - - # common case optimization - - n = floor(z) - if n == z : - if z <= 0 : - return 1.0/0.0 # Infinity - elif n <= len(__factorial) : - return __factorial[int(n)-1] - - zz = z - if z < 0.5 : - zz = 1-z - - - g = __lanczos_gamma - c = __lanczos_coefficients - - zz = zz - 1. - zh = zz + 0.5 - zgh = zh + g - zp = zgh** (zh*0.5) # trick for avoiding FP overflow above z=141 - - ss = 0.0 - for k in range(len(c)-1,0,-1): - ss += c[k]/(zz+k) - - f = (sqrt_2pi*(c[0]+ss)) * (( zp*exp(-zgh)) *zp) - - if z<0.5 : - f = pi /( sin(pi*z) *f) - - return log(f) - - -def factorial(z) : - """ The factorial function. - factorial(z) == gamma(z+1) - """ - return gamma(z+1) - - -def digamma(z) : - """The digamma function, the logarithmic derivative of the gamma function. - digamma(z) = d/dz ln( gamma(z) ) - - See: Eric W. Weisstein. "Digamma Function." From MathWorld-- - A Wolfram Web Resource. http://mathworld.wolfram.com/DigammaFunction.html - """ - - g = __lanczos_gamma - c = __lanczos_coefficients - - zz = z - if z < 0.5 : - zz = 1 -z - - n=0. - d=0. - for k in range(len(c)-1,0,-1): - dz =1./(zz+(k+1)-2); - dd =c[k] * dz - d = d + dd - n = n - dd * dz - - d = d + c[0] - gg = zz + g - 0.5 - f = log(gg) + (n/d - g/gg) - - if z<0.5 : - f -= pi / tan( pi * z) - - return f - - -def trigamma(z) : - """The trigamma function, the derivative of the digamma function. - trigamma(z) = d/dz digamma(z) = d/dz d/dz ln( gamma(z) ) - - See: Eric W. Weisstein. "Digamma Function." From MathWorld-- - A Wolfram Web Resource. http://mathworld.wolfram.com/TrigammaFunction.html - """ - - g = __lanczos_gamma - c = __lanczos_coefficients - - t1=0. - t2=0. - t3=0. - for k in range(len(c)-1,0,-1): - dz =1./(z+k); - dd1 = c[k]* dz - t1 += dd1 - dd2 = dd1 * dz - t2 += dd2 - t3 += dd2 * dz - - t1 += c[0] - c = - (t2*t2)/(t1*t1) +2*t3/t1 - - result = 1./(z*z) - gg = z + g + 0.5 - result += - (z+0.5)/ (gg*gg) - result += 2./gg - - result += c - - return result - -def incomplete_gamma(a,x) : - """The 'upper' incomplete gamma function: - - oo - - - | -t a-1 - incomplete_gamma(a,x) = | e t dt. - | - - - x - - In Mathematica, Gamma[a,x]. - - Note that, very confusingly, the phrase 'incomplete gamma fucntion' - can also refer to the same integral between 0 and x, (the 'lower' - incomplete gamma function) or to the normalized versions, - normalized_incomplete_gamma() ) - - - See: Eric W. Weisstein. "Gamma Function." From MathWorld, A Wolfram Web Resource. - http://mathworld.wolfram.com/IncompleteGammaFunction.html - - Bugs : - This implentation is not very accurate for some arguments. - """ - return normalized_incomplete_gamma(a,x) * gamma(a) - - -def normalized_incomplete_gamma(a,x) : - """The upper, incomplete gamma function normalized so that the limiting - values are zero and one. - - Q(a,x) = incomplete_gamma(a,x) / gamma(a) - - See: - incomplete_gamma() - Bugs : - This implentation is not very accurate for some arguments. - """ - maxiter = 100 - epsilon = 1.48e-8 - small = 1e-30 - - - if a<=0 or x<0 : - raise ValueError("Invalid arguments") - if x == 0.0 : return 1.0 - - if x<= a+1 : - # Use the series representation - term = 1./a - total = term - for n in range(1,maxiter) : - term *= x/(a+n) - total += term - if abs(term/total) < epsilon : - return 1. - total * exp(-x+a*log(x) - lngamma(a) ) - raise RuntimeError( - "Failed to converge after %d iterations." % (maxiter) ) - else : - # Use the continued fraction representation - total = 1.0 - b = x + 1. -a - c = 1./small - d = 1./b - h = d - for i in range(1, maxiter) : - an = -i * (i-a) - b = b+2. - d = an * d + b - if abs(d) < small : d = small - c = b + an /c - if abs(c) < small : c= small - d = 1./d - term = d * c - h = h * term - if abs( term-1.) < epsilon : - return h * exp(-x+a*log(x) - lngamma(a) ) - raise RuntimeError( - "Failed to converge after %d iterations." % (maxiter) ) - - - -def log2( x) : - """ Return the base 2 logarithm of x """ - return log(x,2) - - -def entropy( pvec, base= exp(1) ) : - """ The entropy S = -Sum_i p_i ln p_i - pvec is a frequency vector, not necessarily normalized. - """ - # TODO: Optimize - if len(pvec) ==0 : - raise ValueError("Zero length vector") - - - total = 0.0 - ent = 0.0 - for p in pvec: - if p>0 : # 0 log(0) =0 - total += p - ent += - log(float(p)) *p - elif p<0: - raise ValueError("Negative probability") - - - ent = (ent/total) + log(total) - ent /= log(base) - - return ent - - - - - -def argmax( alist) : - """Return the index of the last occurance of the maximum value in the list.""" - return max(izip(alist, count() ))[1] - -def argmin( alist) : - """Return the index of the first occurance of the minimum value in the list.""" - return min(izip(alist, count() ))[1] - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/resource/__init__.py --- a/corebio/resource/__init__.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,38 +0,0 @@ -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - - -"""Access to programs, complex file formats and databases used in -computational biology. -""" \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/resource/astral.py --- a/corebio/resource/astral.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,341 +0,0 @@ - -# Copyright 2000 by Jeffrey Chang. All rights reserved. -# Copyright 2001 by Gavin E. Crooks. All rights reserved. -# Modifications Copyright 2004/2005 James Casbon. -# Copyright 2005 by Regents of the University of California. All rights reserved -# (Major rewrite for conformance to corebio. Gavin Crooks) -# -# This code is derived from the Biopython distribution and is governed by it's -# license. Please see the LICENSE file that should have been included -# as part of this package. - -""" ASTRAL dataset IO. - -From http://astral.berkeley.edu/ : - -The ASTRAL Compendium for Sequence and Structure Analysis - -The ASTRAL compendium provides databases and tools useful for analyzing protein structures and their sequences. It is partially derived from, and augments the SCOP: Structural Classification of Proteins database. Most of the resources depend upon the coordinate files maintained and distributed by the Protein Data Bank. - -* Classes : - - Raf -- A file ofASTRAL RAF (Rapid Access Format) Sequence Maps. - - RafSeqMap -- A sequence map, a RAF record. - - Res -- A single residue mapping from a RAF record. - -* Functions : - - parse_domain -- Convert an ASTRAL fasta header string into a Scop domain. - - normalize_letters -- Bormalize RAF amino acid codes. - -""" - -# TODO : Need to pull more of James Casbon's Astral code. - -import re -from copy import copy - -from corebio.resource.scop import Domain, Residues -from corebio.data import extended_three_to_one as to_one_letter_code -from corebio.utils import FileIndex - -__all__ = ('astral_evalues', 'astral_percent_identities', - 'astral_evalues_filenames', 'normalize_letters', 'parse_domain', - 'Raf', 'RafSeqMap', 'Res') - -# Percentage identity filtered ASTRAL SCOP genetic domain sequence subset -astral_percent_identities = [10,20,25,30,35,40,50,70,90,95,100] - -# E-value filtered ASTRAL SCOP genetic domain sequence subsets, based on PDB SEQRES records. -astral_evalues = [10, 5, 1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 1e-4, 1e-5, 1e-10, 1e-15,1e-20, 1e-25, 1e-50] - -# A map between evalues and astral filename suffixes. -astral_evalues_filenames = { - 10: 'e+1', 5: 'e+0,7', 1: 'e+0', 0.5: 'e-0,3', 0.1: 'e-1', - 0.05: 'e-1,3', 0.01: 'e-2', 0.005: 'e-2,3', 0.001: 'e-3', - 1e-4: 'e-4', 1e-5: 'e-5', 1e-10: 'e-10', 1e-15: 'e-15', - 1e-20: 'e-20', 1e-25: 'e-25', 1e-50: 'e-50' } - - - -def normalize_letters(one_letter_code) : - """Convert RAF one-letter amino acid codes into IUPAC standard codes. - Letters are uppercased, and "." ("Unknown") is converted to "X". - """ - if one_letter_code == '.' : - return 'X' - else : - return one_letter_code.upper() - -_domain_re = re.compile(r">?([\w_\.]*)\s+([\w\.]*)\s+$([^)]*)$ (.*)") -def parse_domain(str) : - """Convert an ASTRAL fasta header string into a Scop domain. - - An ASTRAL (http://astral.stanford.edu/) header contains a concise - description of a SCOP domain. A very similar format is used when a - Domain object is converted into a string. The Domain returned by this - method contains most of the SCOP information, but it will not be located - within the SCOP hierarchy (i.e. The parent node will be None). The - description is composed of the SCOP protein and species descriptions. - - A typical ASTRAL header looks like -- - >d1tpt_1 a.46.2.1 (1-70) Thymidine phosphorylase {Escherichia coli} - """ - - m = _domain_re.match(str) - if (not m) : raise ValueError("Domain: "+ str) - - dom = Domain() - dom.sid = m.group(1) - dom.sccs = m.group(2) - dom.residues = Residues(m.group(3)) - if not dom.residues.pdbid : - dom.residues.pdbid= dom.sid[1:5] - dom.description = m.group(4).strip() - - return dom - - -class Raf(FileIndex) : - """ASTRAL RAF (Rapid Access Format) Sequence Maps. - - The ASTRAL RAF Sequence Maps record the relationship between the PDB SEQRES - records (representing the sequence of the molecule used in an experiment) - and the ATOM records (representing the atoms experimentally observed). - - This data is derived from the Protein Data Bank CIF files. Known errors in - the CIF files are corrected manually, with the original PDB file serving as - the final arbiter in case of discrepancies. - - Residues are referenced by residue ID. This consists of a the PDB residue - sequence number (up to 4 digits) and an optional PDB insertion code (an - ascii alphabetic character, a-z, A-Z). e.g. "1", "10A", "1010b", "-1" - - See "ASTRAL RAF Sequence Maps":http://astral.stanford.edu/raf.html - - The RAF file itself is about 50 MB. Each line consists of a sequence map of - a different protein chain. This index provides rapid, random - access of RAF records without having to load the entire file into memory. - - This class does not load the entire RAF file into memory. Instead, it - reads the file once, noting the location and content of each RafSeqMap. - The index key is a concatenation of the PDB ID and chain ID. e.g - "2drcA", "155c_". RAF uses an underscore to indicate blank - chain IDs. Custom maps of subsequences or spanning multiple chains, can - be constructed with the get_seqmap method. - - """ - def __init__(self, raf_file) : - def linekey(line) : - if not line or len(line)<5 or line.isspace() or line[0]=='#': - return None - return line[0:5] - def parser( f) : return RafSeqMap(f.readline()) - - FileIndex.__init__(self, raf_file, linekey, parser) - - - def get_seqmap(self, residues) : - """Get the sequence map for a collection of residues. - - residues -- A SCOP style description of a collection of residues from a - PDB strucure, (e.g. '(1bba A:10-20,B:)'), as a string or a - scop.Residues instance. - """ - if type(residues)== str : - residues = Residues(residues) - - pdbid = residues.pdbid - frags = residues.fragments - if not frags: frags =(('_','',''),) # All residues of unnamed chain - - seqMap = None - for frag in frags : - chainid = frag[0] - if chainid=='' or chainid=='-' or chainid==' ' or chainid=='_': - chainid = '_' - sid = pdbid + chainid - - sm = self[sid] - - # Cut out fragment of interest - start = 0 - end = len(sm.res) - if frag[1] : start = int(sm.index(frag[1], chainid)) - if frag[2] : end = int(sm.index(frag[2], chainid)+1) - - sm = sm[start:end] - - if seqMap is None : - seqMap = sm - else : - seqMap += sm - - return seqMap - # End Raf - -class RafSeqMap(object) : - """ASTRAL RAF (Rapid Access Format) Sequence Maps. - - RafSeqMap is a list like object; You can find the location of particular - residues with index(), slice this RafSeqMap into fragments, and glue - fragments back together with extend(). - - - pdbid -- The PDB 4 character ID - - pdb_datestamp -- From the PDB file - - version -- The RAF format version. e.g. 0.01 - - flags -- RAF flags. (See release notes for more information.) - - res -- A list of Res objects, one for each residue in this sequence map - """ - - def __init__(self, raf_record=None) : - """Parses a RAF record into a RafSeqMap object.""" - - self.pdbid = '' - self.pdb_datestamp = '' - self.version = '' - self.flags = '' - self.res = [] - - if not raf_record : return - - header_len = 38 - line = raf_record.rstrip() # no trailing whitespace - - if len(line) -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. - -"""STRIDE: Protein secondary structure assignment from atomic coordinates. - -This module provides an interface to STRIDE, a c program used to recognize -secondary structural elements in proteins from their atomic coordinates. - -""" - -from corebio.seq import Seq, protein_alphabet, Alphabet -from corebio.resource.astral import to_one_letter_code - -# alphabet for stride secondary structure -stride_alphabet = Alphabet("HGIEBC12345678@&T") - -# Dictionary for conversion between names and alphabet -stride_alphabet_names = ( - "H", "AlphaHelix", - "G", "310Helix", - "I", "PiHelix", - "E", "Strand", - "b", "Bridge", - "B", "Bridge", - "C", "Coil", - "1", "TurnI", - "2", "TurnI'", - "3", "TurnII", - "4", "TurnII'", - "5", "TurnVIa", - "6", "TurnVIb", - "7", "TurnVIII", - "8", "TurnIV", - "@", "GammaClassic", - "&", "GammaInv", - "T", "Turn" - ) - - -class Stride(object) : - def __init__(self, stride_file) : - """ Read and parse a STRIDE output file. - - args: - - stride_file : An open file handle - attributes : - - pdbid : The PDB id. - - res : A list of Res objects, one per PDB resiude - """ - res =[] - f=stride_file - self.pdbid = f.readline()[75:79] - for l in f: - if l[0:3] =="ASG": - res.append(Res(l)) - - self.res = res # A list of Res objects - - self._res_dict = None - - def total_area(self) : - """ Return the solvent accessible area """ - area = 0 - for i in self.res : - area += i.solvent_acc_area - return area - - def primary(self): - """ Return the protein primary sequence as a Seq object.""" - return Seq(''.join([r.primary_seq for r in self.res]), protein_alphabet) - - def secondary(self): - """Return the secondary structure of the protien as a Seq object""" - return Seq(''.join([r.secondary_str for r in self.res]), stride_alphabet) - - - def get_res(self, chainid, resid) : - """ Return the given resiude """ - if not self._res_dict : - d = {} - for r in self.res : - d[ (r.chainid, r.resid)] = r - self._res_dict =d - - return self._res_dict[(chainid, resid)] - - - -class Res(object): - """ Structural information of a single resiude. An ASG line from a stride - output file. - - Attributes : - - chainid - - resid - - primary_seq - - secondary_str - - solvent_acc_area - - phi - - psi - """ - - def __init__(self, res_line) : - """ Eats a single 'ASG' line from a stride file, splits it up - into parts and return a Res object.""" - - if (len(res_line)<70): - raise ValueError("Line not long enough") - try: - self.chainid = res_line[9:10] - # STRIDE converts blank chain ids into dashes. Undo. - if self.chainid=="-" : self.chainid = " " - - # In rare cases STRIDE columns can be misaligned. Grab extra - # white space to compensate. - self.resid = res_line[10:15].strip() - self.primary_seq = to_one_letter_code[res_line[5:8].capitalize()] - self.secondary_str = res_line[24:25] - self.solvent_acc_area = float(res_line[64:71]) - self.phi = float(res_line[42:49].strip()) - self.psi = float(res_line[52:59].strip()) - except FloatingPointError: - raise FloatingPointError("Can't float phi, psi, or area") - except KeyError: - raise KeyError("Can't find three letter code in dictionary") - except LookupError: - raise LookupError("One of the values is out of index of res_line") - - - - - - - - - - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq.py --- a/corebio/seq.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,665 +0,0 @@ - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - - - -""" Alphabetic sequences and associated tools and data. - -Seq is a subclass of a python string with additional annotation and an alphabet. -The characters in string must be contained in the alphabet. Various standard -alphabets are provided. - - -Classes : - Alphabet -- A subset of non-null ascii characters - Seq -- An alphabetic string - SeqList -- A collection of Seq's - -Alphabets : - o generic_alphabet -- A generic alphabet. Any printable ASCII character. - o protein_alphabet -- IUCAP/IUB Amino Acid one letter codes. - o nucleic_alphabet -- IUPAC/IUB Nucleic Acid codes 'ACGTURYSWKMBDHVN-' - o dna_alphabet -- Same as nucleic_alphabet, with 'U' (Uracil) an - alternative for 'T' (Thymidine). - o rna_alphabet -- Same as nucleic_alphabet, with 'T' (Thymidine) an - alternative for 'U' (Uracil). - o reduced_nucleic_alphabet -- All ambiguous codes in 'nucleic_alphabet' are - alternative to 'N' (aNy) - o reduced_protein_alphabet -- All ambiguous ('BZJ') and non-canonical amino - acids codes ( 'U', Selenocysteine and 'O', Pyrrolysine) in - 'protein_alphabet' are alternative to 'X'. - o unambiguous_dna_alphabet -- 'ACGT' - o unambiguous_rna_alphabet -- 'ACGU' - o unambiguous_protein_alphabet -- The twenty canonical amino acid one letter - codes, in alphabetic order, 'ACDEFGHIKLMNPQRSTVWY' - -Amino Acid Codes: - Code Alt. Meaning - ----------------- - A Alanine - B Aspartic acid or Asparagine - C Cysteine - D Aspartate - E Glutamate - F Phenylalanine - G Glycine - H Histidine - I Isoleucine - J Leucine or Isoleucine - K Lysine - L Leucine - M Methionine - N Asparagine - O Pyrrolysine - P Proline - Q Glutamine - R Arginine - S Serine - T Threonine - U Selenocysteine - V Valine - W Tryptophan - Y Tyrosine - Z Glutamate or Glutamine - X ? any - * translation stop - - .~ gap - -Nucleotide Codes: - Code Alt. Meaning - ------------------------------ - A Adenosine - C Cytidine - G Guanine - T Thymidine - U Uracil - R G A (puRine) - Y T C (pYrimidine) - K G T (Ketone) - M A C (aMino group) - S G C (Strong interaction) - W A T (Weak interaction) - B G T C (not A) (B comes after A) - D G A T (not C) (D comes after C) - H A C T (not G) (H comes after G) - V G C A (not T, not U) (V comes after U) - N X? A G C T (aNy) - - .~ A gap - - - - -Refs: - http://www.chem.qmw.ac.uk/iupac/AminoAcid/A2021.html - http://www.chem.qmw.ac.uk/iubmb/misc/naseq.html -Status: - Beta -Authors: - GEC 2004,2005 -""" - -# TODO: Add this to docstring somewhere. -# To replace all ambiguous nucleic code by 'N', replace alphabet and then n -# normalize. -# -# >>> Seq( 'ACGT-RYKM', reduced_nucleic_alphabet).normalized() -# 'ACGT-NNNN' - -from array import array -from string import maketrans -from corebio.moremath import argmax, sqrt - -__all__ = [ - 'Alphabet', - 'Seq', - 'rna', 'dna', 'protein', - 'SeqList', - 'generic_alphabet', - 'protein_alphabet', - 'nucleic_alphabet', - 'dna_alphabet', - 'rna_alphabet', - 'reduced_nucleic_alphabet', - 'reduced_protein_alphabet', - 'unambiguous_dna_alphabet', - 'unambiguous_dna_alphabet', - 'unambiguous_rna_alphabet', - 'unambiguous_protein_alphabet', - 'generic_alphabet' - ] - - - -class Alphabet(object) : - """An ordered subset of printable ascii characters. - - Status: - Beta - Authors: - - GEC 2005 - """ - __slots__ = ['_letters', '_alternatives','_ord_table', '_chr_table'] - - # We're immutable, so use __new__ not __init__ - def __new__(cls, letters, alternatives= None) : - """Create a new, immutable Alphabet. - - arguments: - - letters -- the letters in the alphabet. The ordering determines - the ordinal position of each character in this alphabet. - - alt -- A list of (alternative, canonical) letters. The alternatives - are given the same ordinal position as the canonical character. - e.g. (('?','X'),('x', 'X')) states that '?' and 'x' are synomonous - with 'X'. Values that are not in 'letters' are ignored. Alternatives - that are already in 'letters' are also ignored. If the same - alternative character is used twice then the alternative is assigned - to the canonical character that occurs first in 'letters'. The - default is to assume that upper and lower case characters are - equivalent, unless both cases are included in 'letters'. - raises: - ValueError : Repetitive or otherwise illegal set of letters. - """ - self = object.__new__(cls) - - # Printable Ascii characters - ascii_letters = "".join([chr(__i) for __i in range(32,128)]) - - if letters is None : letters = ascii_letters - self._letters = letters - - equivalent_by_case = zip( 'abcdefghijklmnopqrstuvwxyz', - 'ABCDEFGHIJKLMNOPQRSTUVWXYZ') - - if alternatives is None : alternatives = equivalent_by_case - - - # The ord_table maps between the ordinal position of a character in ascii - # and the ordinal position in this alphabet. Characters not in the - # alphabet are given a position of 255. The ord_table is stored as a - # string. - ord_table = ["\xff",] * 256 - for i,a in enumerate(letters) : - n = ord(a) - if n == 0 : - raise ValueError("Alphabet cannot contain null character \\0") - if ord_table[ n ] != "\xff": - raise ValueError("Repetitive alphabet") - ord_table[ n ] = chr(i) - - # Add alternatives - _from = [] - _to = [] - for e, c in alternatives : - if c in letters : - n = ord(e) - if ord_table[ n ] == "\xff" : # empty - ord_table[ n ] = ord_table[ ord(c)] - _from.append(e) - _to.append(c) - self._alternatives = (''.join(_from), ''.join(_to)) - - ord_table = "".join(ord_table) - assert( ord_table[0] == "\xff") - self._ord_table = ord_table - - # The chr_table maps between ordinal position in the alphabet letters - # and the ordinal position in ascii. This map is not the inverse of - # ord_table if there are alternatives. - chr_table = ["\x00"]*256 - for i,a in enumerate(letters) : - chr_table[ i ] = a - chr_table = "".join(chr_table) - self._chr_table = chr_table - - return self - - - def alphabetic(self, string) : - """True if all characters of the string are in this alphabet.""" - table = self._ord_table - for s in str(string): - if table[ord(s)] == "\xff" : - return False - return True - - def chr(self, n) : - """ The n'th character in the alphabet (zero indexed) or \\0 """ - return self._chr_table[n] - - def ord(self, c) : - """The ordinal position of the character c in this alphabet, - or 255 if no such character. - """ - return ord(self._ord_table[ord(c)]) - - def chrs(self, sequence_of_ints) : - """Convert a sequence of ordinals into an alphabetic string.""" - if not isinstance(sequence_of_ints, array) : - sequence_of_ints = array('B', sequence_of_ints) - s = sequence_of_ints.tostring().translate(self._chr_table) - return Seq(s, self) - - def ords(self, string) : - """Convert an alphabetic string into a byte array of ordinals.""" - string = str(string) - s = string.translate(self._ord_table) - a = array('B',s) - return a - - - def normalize(self, string) : - """Normalize an alphabetic string by converting all alternative symbols - to the canonical equivalent in 'letters'. - """ - if not self.alphabetic(string) : - raise ValueError("Not an alphabetic string.") - return self.chrs(self.ords(string)) - - def letters(self) : - """ Letters of the alphabet as a string.""" - return str(self) - - def _all_letters(self) : - """ All allowed letters, including alternatives.""" - let = [] - let.append(self._letters) - for key, value in self._alternatives : - let.append(value) - return ''.join(let) - - def __repr__(self) : - return "Alphabet( '" + self._letters +"', zip"+ repr(self._alternatives)+" )" - - def __str__(self) : - return str(self._letters) - - def __len__(self) : - return len(self._letters) - - def __eq__(self, other) : - if not hasattr(other, "_ord_table") : return False - return self._ord_table == other._ord_table - - def __ne__(self, other) : - return not self.__eq__(other) - - def __iter__(self) : - return iter(self._letters) - - def __getitem__(self, key) : - return self._letters[key] - - -# End class Alphabet - -# ------------------- Standard ALPHABETS ------------------- -# Standard alphabets are defined here, after Alphabet class. - -generic_alphabet = Alphabet(None, None) - - -protein_alphabet = Alphabet('ACDEFGHIKLMNOPQRSTUVWYBJZX*-', - zip('acdefghiklmnopqrstuvwybjzx?.~', - 'ACDEFGHIKLMNOPQRSTUVWYBJZXX--') ) - - -nucleic_alphabet = Alphabet("ACGTURYSWKMBDHVN-", - zip("acgturyswkmbdhvnXx?.~", - "ACGTURYSWKMBDHVNNNN--") ) - -dna_alphabet = Alphabet("ACGTRYSWKMBDHVN-", - zip('acgtryswkmbdhvnXx?.~Uu', - 'ACGTRYSWKMBDHVNNNN--TT') ) - -rna_alphabet = Alphabet("ACGURYSWKMBDHVN-", - zip('acguryswkmbdhvnXx?.~Tt', - 'ACGURYSWKMBDHVNNNN--UU') ) - -reduced_nucleic_alphabet = Alphabet("ACGTN-", - zip('acgtryswkmbdhvnXx?.~TtRYSWKMBDHV', - 'ACGTNNNNNNNNNNNNNN--TTNNNNNNNNNN') ) - -reduced_protein_alphabet = Alphabet('ACDEFGHIKLMNPQRSTVWYX*-', - zip('acdefghiklmnpqrstvwyx?.~BbZzUu', - 'ACDEFGHIKLMNPQRSTVWYXX--XXXXCC') ) - -unambiguous_dna_alphabet = Alphabet("ACGT", zip('acgt','ACGT') ) - -unambiguous_rna_alphabet = Alphabet("ACGU", zip('acgu','ACGU') ) - -unambiguous_protein_alphabet = Alphabet("ACDEFGHIKLMNPQRSTVWY", - zip('acdefghiklmnopqrstuvwy', - 'ACDEFGHIKLMNOPQRSTUVWY') ) - - -_complement_table = maketrans("ACGTRYSWKMBDHVN-acgtUuryswkmbdhvnXx?.~", - "TGCAYRSWMKVHDBN-tgcaAayrswmkvhdbnXx?.~") - - - -class Seq(str): - """ An alphabetic string. A subclass of "str" consisting solely of - letters from the same alphabet. - - Attributes: - alphabet -- A string or Alphabet of allowed characters. - name -- A short string used to identify the sequence. - description -- A string describing the sequence - - Authors : - GEC 2005 - """ - # TODO: need a method to return a copy of the string with a new alphabet, - # preserving the sequence, name and alphabet? - - def __new__(cls, obj, - alphabet= generic_alphabet, - name =None, description=None, - ): - self = str.__new__(cls, obj) - if alphabet is None: - alphabet = generic_alphabet - if not isinstance(alphabet, Alphabet): - alphabet = Alphabet(alphabet) - if not alphabet.alphabetic(self) : - raise ValueError("Sequence not alphabetic %s, '%s'" %(alphabet, self)) - - self._alphabet=alphabet - self.name = name - self.description = description - - return self - - # BEGIN PROPERTIES - - # Make alphabet constant - def _get_alphabet(self): - return self._alphabet - alphabet = property(_get_alphabet) - - # END PROPERTIES - - - def ords(self) : - """ Convert sequence to an array of integers - in the range [0, len(alphabet) ) - """ - return self.alphabet.ords(self) - - def tally(self, alphabet = None): - """Counts the occurrences of alphabetic characters. - - Arguments: - - alphabet -- an optional alternative alphabet - - Returns : - A list of character counts in alphabetic order. - """ - # Renamed from count() since this conflicts with str.count(). - if not alphabet : alphabet = self.alphabet - L = len(alphabet) - counts = [0,] * L - - ords = alphabet.ords(self) - - for n in ords: - if n= N : # Skip non-alphabetic kmers - i += k - continue - #FIXME: this should be a function of alphabet? - n = sum([multi[j]* ords[i+j] for j in range(k) ]) - counts[n] +=1 - - return counts - - def __getslice__(self, i, j): - cls = self.__class__ - return cls( str.__getslice__(self,i,j), self.alphabet) - - def __getitem__(self, key) : - cls = self.__class__ - return cls( str.__getitem__(self,key), self.alphabet) - - def __add__(self, other) : - # called for "self + other" - cls = self.__class__ - return cls( str.__add__(self, other), self.alphabet) - - def __radd__(self, other) : - # Called when "other + self" and other is superclass of self - cls = self.__class__ - return cls( str.__add__(self, other), self.alphabet) - - def join(self, str_list) : - cls = self.__class__ - return cls( super(Seq, self).join(str_list), self.alphabet) - - def __eq__(self, other) : - if not hasattr(other, "alphabet") : return False - if self.alphabet != other.alphabet : - return False - return str.__eq__(self, other) - - def __ne__(self, other) : - return not self.__eq__(other) - - def tostring(self) : - """ Converts Seq to a raw string. - """ - # Compatibility with biopython - return str(self) - - # ---- Transformations of Seq ---- - def reverse(self) : - """Return the reversed sequence. - - Not that this method returns a new object, in contrast to - the in-place reverse() method of list objects. - """ - cls = self.__class__ - return cls( self[::-1], self.alphabet) - - def ungap(self) : - # FIXME: Gap symbols should be specified by the Alphabet? - return self.remove( '-.~') - - def remove(self, delchars) : - """Return a new alphabetic sequence with all characters in 'delchars' - removed. - """ - cls = self.__class__ - return cls( str(self).translate(maketrans('',''), delchars), self.alphabet) - - def lower(self) : - """Return a lower case copy of the sequence. """ - cls = self.__class__ - trans = maketrans('ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz') - return cls(str(self).translate(trans), self.alphabet) - - def upper(self) : - """Return a lower case copy of the sequence. """ - cls = self.__class__ - trans = maketrans('abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ') - return cls(str(self).translate(trans), self.alphabet) - - def mask(self, letters= 'abcdefghijklmnopqrstuvwxyz', mask='X') : - """Replace all occurences of letters with the mask character. - The default is to replace all lower case letters with 'X'. - """ - LL = len(letters) - if len(mask) !=1 : - raise ValueError("Mask should be single character") - to = mask * LL - trans = maketrans( letters, to) - cls = self.__class__ - return cls(str(self).translate(trans), self.alphabet) - - def translate(self) : - """Translate a nucleotide sequence to a polypeptide using full - IUPAC ambiguities in DNA/RNA and amino acid codes, using the - standard genetic code. See corebio.transform.GeneticCode for - details and more options. - """ - # Note: masks str.translate - from transform import GeneticCode - return GeneticCode.std().translate(self) - - def back_translate(self) : - """Translate a protein sequence back into coding DNA, using using the - standard genetic code. See corebio.transform.GeneticCode for - details and more options. - """ - from transform import GeneticCode - return GeneticCode.std().back_translate(self) - - - def reverse_complement(self) : - """Returns reversed complementary nucleic acid sequence (i.e. the other - strand of a DNA sequence.) - """ - return self.reverse().complement() - - def complement(self) : - """Returns complementary nucleic acid sequence.""" - if not nucleic_alphabet.alphabetic(self.alphabet): - raise ValueError("Incompatable alphabets") - s = str.translate(self, _complement_table) - cls = self.__class__ - return cls(s, self.alphabet, self.name, self.description) - - -# end class Seq - - -class SeqList(list): - """ A list of sequences. - - Status: - Beta - """ - # TODO: If alphabet given, we should ensure that all sequences conform. - # TODO: Need an isaligned() method. All seqs same length, same alphabet. - __slots__ =["alphabet", "name", "description"] - - def __init__(self, alist=[], alphabet=None, name=None, description=None): - list.__init__(self, alist) - self.alphabet = alphabet - self.name = name - self.description = description - - # TOOWTDI. Replicates seq_io.read() - #@classmethod - #def read(cls, afile, alphabet = None): - # return corebio.seq_io.read(afile, alphabet) - #read = classmethod(read) - - def ords(self, alphabet=None) : - """ Convert sequence list into a 2D array of ordinals. - """ - if not alphabet : alphabet = self.alphabet - if not alphabet : raise ValueError("No alphabet") - k = [] - for s in self: - k.append( alphabet.ords(s) ) - return k - - def tally(self, alphabet = None): - """Counts the occurrences of characters in each column.""" - if not alphabet : alphabet = self.alphabet - if not alphabet : raise ValueError("No alphabet") - - N = len(alphabet) - ords = self.ords(alphabet) - L = len(ords[0]) - counts = [ [0,]*N for l in range(0,L)] - - for o in ords : - for j,n in enumerate(o) : - if n -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - - - - - -""" Sequence file reading and writing. - -Biological sequence data is stored and transmitted using a wide variety of -different file formats. This package provides convient methods to read and -write several of these file fomats. - -CoreBio is often capable of guessing the correct file type, either from the -file extension or the structure of the file: ->>> import corebio.seq_io ->>> afile = open("test_corebio/data/cap.fa") ->>> seqs = corebio.seq_io.read(afile) - -Alternatively, each sequence file type has a seperate module named FILETYPE_io -(e.g. fasta_io, clustal_io). ->>> import corebio.seq_io.fasta_io ->>> afile = open("test_corebio/data/cap.fa") ->>> seqs = corebio.seq_io.fasta_io.read( afile ) - -Sequence data can also be written back to files: ->>> fout = open("out.fa", "w") ->>> corebio.seq_io.fasta_io.write( fout, seqs ) - - -Supported File Formats ----------------------- - -Module Name Extension read write features ---------------------------------------------------------------------------- -array_io array, flatfile yes yes none -clustal_io clustalw aln yes yes -fasta_io fasta, Pearson fa yes yes none -genbank_io genbank gb yes -intelligenetics_io intelligenetics ig yes yes -msf_io msf msf yes -nbrf_io nbrf, pir pir yes -nexus_io nexus nexus yes -phylip_io phylip phy yes -plain_io plain, raw txt yes yes none -table_io table tbl yes yes none - -Each IO module defines one or more of the following functions and variables: - -read(afile, alphabet=None) - Read a file of sequence data and return a SeqList, a collection - of Seq's (Alphabetic strings) and features. - -read_seq(afile, alphabet=None) - Read a single sequence from a file. - -iter_seq(afile, alphabet =None) - Iterate over the sequences in a file. - -index(afile, alphabet = None) - Instead of loading all of the sequences into memory, scan the file and - return an index map that will load sequences on demand. Typically not - implemented for formats with interleaved sequences. - -write(afile, seqlist) - Write a collection of sequences to the specifed file. - -write_seq(afile, seq) - Write one sequence to the file. Only implemented for non-inteleaved, - headerless formats, such as fasta and plain. - -example - A string containing a short example of the file format - -names - A list of synonyms for the file format. e.g. for fasta_io, ( 'fasta', - 'pearson', 'fa'). The first entry is the preferred format name. - -extensions - A list of file name extensions used for this file format. e.g. - fasta_io.extensions is ('fa', 'fasta', 'fast', 'seq', 'fsa', 'fst', 'nt', - 'aa','fna','mpfa'). The preferred or standard extension is first in the - list. - - -Attributes : -- formats -- Available seq_io format parsers -- format_names -- A map between format names and format parsers. -- format_extensions -- A map between filename extensions and parsers. - -""" - -# Dev. References : -# -# - http://iubio.bio.indiana.edu/soft/molbio/readseq/java/Readseq2-help.html -# - http://www.ebi.ac.uk/help/formats_frame.html -# - http://www.cmbi.kun.nl/bioinf/tools/crab_pir.html -# - http://bioperl.org/HOWTOs/html/SeqIO.html -# - http://emboss.sourceforge.net/docs/themes/SequenceFormats.html -# - http://www.cse.ucsc.edu/research/compbio/a2m-desc.html (a2m) -# - http://www.genomatix.de/online_help/help/sequence_formats.html - -from corebio.seq import * - -import clustal_io -import fasta_io -import msf_io -import nbrf_io -import nexus_io -import plain_io -import phylip_io -#import null_io -import stockholm_io -import intelligenetics_io -import table_io -import array_io -import genbank_io - -__all__ = [ - 'clustal_io', - 'fasta_io', - 'msf_io', - 'nbrf_io', - 'nexus_io', - 'plain_io', - 'phylip_io', - 'null_io', - 'stockholm_io', - 'intelligenetics_io', - 'table_io', - 'array_io', - 'genbank_io', - 'read', - 'formats', - 'format_names', - 'format_extensions', - ] - -formats = ( clustal_io, fasta_io, plain_io, msf_io, genbank_io,nbrf_io, nexus_io, phylip_io, stockholm_io, intelligenetics_io, table_io, array_io) -"""Available seq_io formats""" - - -def format_names() : - """Return a map between format names and format modules""" - global formats - fnames = {} - for f in formats : - for name in f.names : - assert name not in fnames # Insanity check - fnames[name] = f - return fnames - -def format_extensions() : - """Return a map between filename extensions and sequence file types""" - global formats - fext = {} - for f in formats : - for ext in f.extensions : - assert ext not in fext # Insanity check - fext[ext] = f - return fext - - -# seq_io._parsers is an ordered list of sequence parsers that are tried, in -# turn, on files of unknown format. Each parser must raise an exception when -# fed a format further down the list. -# -# The general trend is most common to least common file format. However, -# 'nbrf_io' is before 'fasta_io' because nbrf looks like fasta with extras, and -# 'array_io' is last, since it is very general. -_parsers = (nbrf_io, fasta_io, clustal_io, phylip_io, genbank_io, stockholm_io, msf_io, nexus_io, table_io, array_io) - - -def _get_parsers(fin) : - global _parsers - - fnames = format_names() - fext = format_extensions() - parsers = list(_parsers) - best_guess = parsers[0] - - # If a filename is supplied use the extension to guess the format. - if hasattr(fin, "name") and '.' in fin.name : - extension = fin.name.split('.')[-1] - if extension in fnames: - best_guess = fnames[extension] - elif extension in fext : - best_guess = fext[extension] - - if best_guess in parsers : - parsers.remove(best_guess) - parsers.insert(0,best_guess) - - return parsers - - - -def read(fin, alphabet=None) : - """ Read a sequence file and attempt to guess its format. - First the filename extension (if available) is used to infer the format. - If that fails, then we attempt to parse the file using several common - formats. - - returns : - SeqList - raises : - ValueError - If the file cannot be parsed. - ValueError - Sequence do not conform to the alphabet. - """ - - alphabet = Alphabet(alphabet) - parsers = _get_parsers(fin) - - for p in _get_parsers(fin) : - try: - return p.read(fin, alphabet) - except ValueError: - pass - fin.seek(0) # FIXME. Non seakable stdin? - - names = ", ".join([ p.names[0] for p in parsers]) - raise ValueError("Cannot parse sequence file: Tried %s " % names) - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/_nexus/Nodes.py --- a/corebio/seq_io/_nexus/Nodes.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,172 +0,0 @@ -# Copyright 2005 by Frank Kauff & Cymon J. Cox. All rights reserved. -# This code is part of the Biopython distribution and governed by its -# license. Please see the LICENSE file that should have been included -# as part of this package. -# -# Nodes.py -# -# Provides functionality of a linked list. -# Each node has one (or none) predecessor, and an arbitrary number of successors. -# Nodes can store arbitrary data in a NodeData class. -# -# Subclassed by Nexus.Trees to store phylogenetic trees. -# -# Bug reports to Frank Kauff (fkauff@duke.edu) -# - -class ChainException(Exception): - pass - -class NodeException(Exception): - pass - -class Chain: - """Stores a list of nodes that are linked together.""" - - def __init__(self): - """Initiates a node chain: (self).""" - self.chain={} - self.id=-1 - - def _get_id(self): - """Gets a new id for a node in the chain.""" - self.id+=1 - return self.id - - def all_ids(self): - """Return a list of all node ids.""" - return self.chain.keys() - - def add(self,node,prev=None): - """Attaches node to another: (self, node, prev).""" - if prev is not None and prev not in self.chain: - raise ChainException('Unknow predecessor: '+str(prev)) - else: - id=self._get_id() - node.set_id(id) - node.set_prev(prev) - if prev is not None: - self.chain[prev].add_succ(id) - self.chain[id]=node - return id - - def collapse(self,id): - """Deletes node from chain and relinks successors to predecessor: collapse(self, id).""" - if id not in self.chain: - raise ChainException('Unknown ID: '+str(id)) - prev_id=self.chain[id].get_prev() - self.chain[prev_id].remove_succ(id) - succ_ids=self.chain[id].get_succ() - for i in succ_ids: - self.chain[i].set_prev(prev_id) - self.chain[prev_id].add_succ(succ_ids) - node=self.chain[id] - self.kill(id) - return node - - def kill(self,id): - """Kills a node from chain without caring to what it is connected: kill(self,id).""" - if id not in self.chain: - raise ChainException('Unknown ID: '+str(id)) - else: - del self.chain[id] - - def unlink(self,id): - """Disconnects node from his predecessor: unlink(self,id).""" - if id not in self.chain: - raise ChainException('Unknown ID: '+str(id)) - else: - prev_id=self.chain[id].prev - if prev_id is not None: - self.chain[prev_id].succ.pop(self.chain[prev_id].succ.index(id)) - self.chain[id].prev=None - return prev_id - - def link(self, parent,child): - """Connects son to parent: link(self,son,parent).""" - if child not in self.chain: - raise ChainException('Unknown ID: '+str(child)) - elif parent not in self.chain: - raise ChainException('Unknown ID: '+str(parent)) - else: - self.unlink(child) - self.chain[parent].succ.append(child) - self.chain[child].set_prev(parent) - - def is_parent_of(self,parent,grandchild): - """Check if grandchild is a subnode of parent: is_parent_of(self,parent,grandchild).""" - if grandchild==parent or grandchild in self.chain[parent].get_succ(): - return True - else: - for sn in self.chain[parent].get_succ(): - if self.is_parent_of(sn,grandchild): - return True - else: - return False - - def trace(self,start,finish): - """Returns a list of all node_ids between two nodes (excluding start, including end): trace(start,end).""" - if start not in self.chain or finish not in self.chain: - raise NodeException('Unknown node.') - if not self.is_parent_of(start,finish) or start==finish: - return [] - for sn in self.chain[start].get_succ(): - if self.is_parent_of(sn,finish): - return [sn]+self.trace(sn,finish) - -class Node: - """A single node.""" - - def __init__(self,data=None): - """Represents a node with one predecessor and multiple successors: (self, data=None).""" - self.id=None - self.data=data - self.prev=None - self.succ=[] - - def set_id(self,id): - """Sets the id of a node, if not set yet: (self,id).""" - if self.id is not None: - raise NodeException, 'Node id cannot be changed.' - self.id=id - - def get_id(self): - """Returns the node's id: (self).""" - return self.id - - def get_succ(self): - """Returns a list of the node's successors: (self).""" - return self.succ - - def get_prev(self): - """Returns the id of the node's predecessor: (self).""" - return self.prev - - def add_succ(self,id): - """Adds a node id to the node's successors: (self,id).""" - if isinstance(id,type([])): - self.succ.extend(id) - else: - self.succ.append(id) - - def remove_succ(self,id): - """Removes a node id from the node's successors: (self,id).""" - self.succ.remove(id) - - def set_succ(self,new_succ): - """Sets the node's successors: (self,new_succ).""" - if not isinstance(new_succ,type([])): - raise NodeException, 'Node successor must be of list type.' - self.succ=new_succ - - def set_prev(self,id): - """Sets the node's predecessor: (self,id).""" - self.prev=id - - def get_data(self): - """Returns a node's data: (self).""" - return self.data - - def set_data(self,data): - """Sets a node's data: (self,data).""" - self.data=data diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/_nexus/Trees.py --- a/corebio/seq_io/_nexus/Trees.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,686 +0,0 @@ - -# -# Trees.py -# -# Copyright 2005 by Frank Kauff & Cymon J. Cox. All rights reserved. -# This code is part of the Biopython distribution and governed by its -# license. Please see the LICENSE file that should have been included -# as part of this package. -# -# Tree class handles phylogenetic trees. Provides a set of methods to read and write newick-format tree -# descriptions, get information about trees (monphyly of taxon sets, congruence between trees, common ancestors,...) -# and to manipulate trees (reroot trees, split terminal nodes). -# -# Bug reports welcome: fkauff@duke.edu -# - -import sys, random, sets -import Nodes - -PRECISION_BRANCHLENGTH=6 -PRECISION_SUPPORT=6 - -class TreeError(Exception): pass - -class NodeData: - """Stores tree-relevant data associated with nodes (e.g. branches or otus).""" - def __init__(self,taxon=None,branchlength=0.0,support=None): - self.taxon=taxon - self.branchlength=branchlength - self.support=support - -class Tree(Nodes.Chain): - """Represents a tree using a chain of nodes with on predecessor (=ancestor) - and multiple successors (=subclades). - """ - # A newick tree is parsed into nested list and then converted to a node list in two stages - # mostly due to historical reasons. This could be done in one swoop). Note: parentheses ( ) and - # colon : are not allowed in taxon names. This is against NEXUS standard, but makes life much - # easier when parsing trees. - - ## NOTE: Tree should store its data class in something like self.dataclass=data, - ## so that nodes that are generated have easy access to the data class - ## Some routines use automatically NodeData, this needs to be more concise - - def __init__(self,tree=None,weight=1.0,rooted=False,name='',data=NodeData,values_are_support=False,max_support=1.0): - """Ntree(self,tree).""" - Nodes.Chain.__init__(self) - self.dataclass=data - self.__values_are_support=values_are_support - self.max_support=max_support - self.weight=weight - self.rooted=rooted - self.name=name - root=Nodes.Node(data()) - self.add(root) - self.root=root.id - if tree: # use the tree we have - # if Tree is called from outside Nexus parser, we need to get rid of linebreaks, etc - tree=tree.strip().replace('\n','').replace('\r','') - # there's discrepancy whether newick allows semicolons et the end - tree=tree.rstrip(';') - self._add_subtree(parent_id=root.id,tree=self._parse(tree)[0]) - - def _parse(self,tree): - """Parses (a,b,c...)[[[xx]:]yy] into subcomponents and travels down recursively.""" - - if tree.count('(')!=tree.count(')'): - raise TreeError, 'Parentheses do not match in (sub)tree: '+tree - if tree.count('(')==0: # a leaf - colon=tree.rfind(':') - if colon>-1: - return [tree[:colon],self._get_values(tree[colon+1:])] - else: - return [tree,[None]] - else: - closing=tree.rfind(')') - val=self._get_values(tree[closing+1:]) - if not val: - val=[None] - subtrees=[] - plevel=0 - prev=1 - for p in range(1,closing): - if tree[p]=='(': - plevel+=1 - elif tree[p]==')': - plevel-=1 - elif tree[p]==',' and plevel==0: - subtrees.append(tree[prev:p]) - prev=p+1 - subtrees.append(tree[prev:closing]) - subclades=[self._parse(subtree) for subtree in subtrees] - return [subclades,val] - - def _add_subtree(self,parent_id=None,tree=None): - """Adds leaf or tree (in newick format) to a parent_id. (self,parent_id,tree).""" - - if parent_id is None: - raise TreeError('Need node_id to connect to.') - for st in tree: - if type(st[0])==list: # it's a subtree - nd=self.dataclass() - if len(st[1])>=2: # if there's two values, support comes first. Is that always so? - nd.support=st[1][0] - if st[1][1] is not None: - nd.branchlength=st[1][1] - elif len(st[1])==1: # otherwise it could be real branchlengths or support as branchlengths - if not self.__values_are_support: # default - if st[1][0] is not None: - nd.branchlength=st[1][0] - else: - nd.support=st[1][0] - sn=Nodes.Node(nd) - self.add(sn,parent_id) - self._add_subtree(sn.id,st[0]) - else: # it's a leaf - nd=self.dataclass() - nd.taxon=st[0] - if len(st)>1: - if len(st[1])>=2: # if there's two values, support comes first. Is that always so? - nd.support=st[1][0] - if st[1][1] is not None: - nd.branchlength=st[1][1] - elif len(st[1])==1: # otherwise it could be real branchlengths or support as branchlengths - if not self.__values_are_support: # default - if st[1][0] is not None: - nd.branchlength=st[1][0] - else: - nd.support=st[1][0] - leaf=Nodes.Node(nd) - self.add(leaf,parent_id) - - def _get_values(self, text): - """Extracts values (support/branchlength) from xx[:yyy], xx.""" - - if text=='': - return None - return [float(t) for t in text.split(':') if t.strip()] - - def _walk(self,node=None): - """Return all node_ids downwards from a node.""" - - if node is None: - node=self.root - for n in self.node(node).succ: - yield n - for sn in self._walk(n): - yield sn - - def node(self,node_id): - """Return the instance of node_id. - - node = node(self,node_id) - """ - if node_id not in self.chain: - raise TreeError('Unknown node_id: %d' % node_id) - return self.chain[node_id] - - def split(self,parent_id=None,n=2,branchlength=1.0): - """Speciation: generates n (default two) descendants of a node. - - [new ids] = split(self,parent_id=None,n=2,branchlength=1.0): - """ - if parent_id is None: - raise TreeError('Missing node_id.') - ids=[] - parent_data=self.chain[parent_id].data - for i in range(n): - node=Nodes.Node() - if parent_data: - node.data=self.dataclass() - # each node has taxon and branchlength attribute - if parent_data.taxon: - node.data.taxon=parent_data.taxon+str(i) - node.data.branchlength=branchlength - ids.append(self.add(node,parent_id)) - return ids - - def search_taxon(self,taxon): - """Returns the first matching taxon in self.data.taxon. Not restricted to terminal nodes. - - node_id = search_taxon(self,taxon) - """ - for id,node in self.chain.items(): - if node.data.taxon==taxon: - return id - return None - - def prune(self,taxon): - """Prunes a terminal taxon from the tree. - - id_of_previous_node = prune(self,taxon) - If taxon is from a bifurcation, the connectiong node will be collapsed - and its branchlength added to remaining terminal node. This might be no - longer a meaningful value' - """ - - id=self.search_taxon(taxon) - if id is None: - raise TreeError('Taxon not found: %s' % taxon) - elif id not in self.get_terminals(): - raise TreeError('Not a terminal taxon: %s' % taxon) - else: - prev=self.unlink(id) - self.kill(id) - if not prev==self.root and len(self.node(prev).succ)==1: - succ=self.node(prev).succ[0] - new_bl=self.node(prev).data.branchlength+self.node(succ).data.branchlength - self.collapse(prev) - self.node(succ).data.branchlength=new_bl - return prev - - def get_taxa(self,node_id=None): - """Return a list of all otus downwards from a node (self, node_id). - - nodes = get_taxa(self,node_id=None) - """ - - if node_id is None: - node_id=self.root - if node_id not in self.chain: - raise TreeError('Unknown node_id: %d.' % node_id) - if self.chain[node_id].succ==[]: - if self.chain[node_id].data: - return [self.chain[node_id].data.taxon] - else: - return None - else: - list=[] - for succ in self.chain[node_id].succ: - list.extend(self.get_taxa(succ)) - return list - - def get_terminals(self): - """Return a list of all terminal nodes.""" - return [i for i in self.all_ids() if self.node(i).succ==[]] - - def sum_branchlength(self,root=None,node=None): - """Adds up the branchlengths from root (default self.root) to node. - - sum = sum_branchlength(self,root=None,node=None) - """ - - if root is None: - root=self.root - if node is None: - raise TreeError('Missing node id.') - blen=0.0 - while node is not None and node is not root: - blen+=self.node(node).data.branchlength - node=self.node(node).prev - return blen - - def set_subtree(self,node): - """Return subtree as a set of nested sets. - - sets = set_subtree(self,node) - """ - - if self.node(node).succ==[]: - return self.node(node).data.taxon - else: - return sets.Set([self.set_subtree(n) for n in self.node(node).succ]) - - def is_identical(self,tree2): - """Compare tree and tree2 for identity. - - result = is_identical(self,tree2) - """ - return self.set_subtree(self.root)==tree2.set_subtree(tree2.root) - - def is_compatible(self,tree2,threshold,strict=True): - """Compares branches with support>threshold for compatibility. - - result = is_compatible(self,tree2,threshold) - """ - - # check if both trees have the same set of taxa. strict=True enforces this. - missing2=sets.Set(self.get_taxa())-sets.Set(tree2.get_taxa()) - missing1=sets.Set(tree2.get_taxa())-sets.Set(self.get_taxa()) - if strict and (missing1 or missing2): - if missing1: - print 'Taxon/taxa %s is/are missing in tree %s' % (','.join(missing1) , self.name) - if missing2: - print 'Taxon/taxa %s is/are missing in tree %s' % (','.join(missing2) , tree2.name) - raise TreeError, 'Can\'t compare trees with different taxon compositions.' - t1=[(sets.Set(self.get_taxa(n)),self.node(n).data.support) for n in self.all_ids() if \ - self.node(n).succ and\ - (self.node(n).data and self.node(n).data.support and self.node(n).data.support>=threshold)] - t2=[(sets.Set(tree2.get_taxa(n)),tree2.node(n).data.support) for n in tree2.all_ids() if \ - tree2.node(n).succ and\ - (tree2.node(n).data and tree2.node(n).data.support and tree2.node(n).data.support>=threshold)] - conflict=[] - for (st1,sup1) in t1: - for (st2,sup2) in t2: - if not st1.issubset(st2) and not st2.issubset(st1): # don't hiccup on upstream nodes - intersect,notin1,notin2=st1 & st2, st2-st1, st1-st2 # all three are non-empty sets - # if notin1==missing1 or notin2==missing2 <==> st1.issubset(st2) or st2.issubset(st1) ??? - if intersect and not (notin1.issubset(missing1) or notin2.issubset(missing2)): # omit conflicts due to missing taxa - conflict.append((st1,sup1,st2,sup2,intersect,notin1,notin2)) - return conflict - - def common_ancestor(self,node1,node2): - """Return the common ancestor that connects to nodes. - - node_id = common_ancestor(self,node1,node2) - """ - - l1=[self.root]+self.trace(self.root,node1) - l2=[self.root]+self.trace(self.root,node2) - return [n for n in l1 if n in l2][-1] - - - def distance(self,node1,node2): - """Add and return the sum of the branchlengths between two nodes. - dist = distance(self,node1,node2) - """ - - ca=self.common_ancestor(node1,node2) - return self.sum_branchlength(ca,node1)+self.sum_branchlength(ca,node2) - - def is_monophyletic(self,taxon_list): - """Return node_id of common ancestor if taxon_list is monophyletic, -1 otherwise. - - result = is_monophyletic(self,taxon_list) - """ - if isinstance(taxon_list,str): - taxon_set=sets.Set([taxon_list]) - else: - taxon_set=sets.Set(taxon_list) - node_id=self.root - while 1: - subclade_taxa=sets.Set(self.get_taxa(node_id)) - if subclade_taxa==taxon_set: # are we there? - return node_id - else: # check subnodes - for subnode in self.chain[node_id].succ: - if sets.Set(self.get_taxa(subnode)).issuperset(taxon_set): # taxon_set is downstream - node_id=subnode - break # out of for loop - else: - return -1 # taxon set was not with successors, for loop exhausted - - def is_bifurcating(self,node=None): - """Return True if tree downstream of node is strictly bifurcating.""" - if not node: - node=self.root - if node==self.root and len(self.node(node).succ)==3: #root can be trifurcating, because it has no ancestor - return self.is_bifurcating(self.node(node).succ[0]) and \ - self.is_bifurcating(self.node(node).succ[1]) and \ - self.is_bifurcating(self.node(node).succ[2]) - if len(self.node(node).succ)==2: - return self.is_bifurcating(self.node(node).succ[0]) and self.is_bifurcating(self.node(node).succ[1]) - elif len(self.node(node).succ)==0: - return True - else: - return False - - - - def branchlength2support(self): - """Move values stored in data.branchlength to data.support, and set branchlength to 0.0 - - This is necessary when support has been stored as branchlength (e.g. paup), and has thus - been read in as branchlength. - """ - - for n in self.chain.keys(): - self.node(n).data.support=self.node(n).data.branchlength - self.node(n).data.branchlength=0.0 - - def convert_absolute_support(self,nrep): - """Convert absolute support (clade-count) to rel. frequencies. - - Some software (e.g. PHYLIP consense) just calculate how often clades appear, instead of - calculating relative frequencies.""" - - for n in self._walk(): - if self.node(n).data.support: - self.node(n).data.support/=float(nrep) - - def randomize(self,ntax=None,taxon_list=None,branchlength=1.0,branchlength_sd=None,bifurcate=True): - """Generates a random tree with ntax taxa and/or taxa from taxlabels. - - new_tree = randomize(self,ntax=None,taxon_list=None,branchlength=1.0,branchlength_sd=None,bifurcate=True) - Trees are bifurcating by default. (Polytomies not yet supported). - """ - - if not ntax and taxon_list: - ntax=len(taxon_list) - elif not taxon_list and ntax: - taxon_list=['taxon'+str(i+1) for i in range(ntax)] - elif not ntax and not taxon_list: - raise TreeError('Either numer of taxa or list of taxa must be specified.') - elif ntax<>len(taxon_list): - raise TreeError('Length of taxon list must correspond to ntax.') - # initiate self with empty root - self.__init__() - terminals=self.get_terminals() - # bifurcate randomly at terminal nodes until ntax is reached - while len(terminals)1: - raise TreeError, 'Isolated nodes in tree description: %s' % ','.join(oldroot) - elif len(oldroot)==1: - self.kill(oldroot[0]) - return self.root - - -def consensus(trees, threshold=0.5,outgroup=None): - """Compute a majority rule consensus tree of all clades with relative frequency>=threshold from a list of trees.""" - - total=len(trees) - if total==0: - return None - # shouldn't we make sure that it's NodeData or subclass?? - dataclass=trees[0].dataclass - max_support=trees[0].max_support - clades={} - #countclades={} - alltaxa=sets.Set(trees[0].get_taxa()) - # calculate calde frequencies - c=0 - for t in trees: - c+=1 - #if c%50==0: - # print c - if alltaxa!=sets.Set(t.get_taxa()): - raise TreeError, 'Trees for consensus must contain the same taxa' - t.root_with_outgroup(outgroup=outgroup) - for st_node in t._walk(t.root): - subclade_taxa=t.get_taxa(st_node) - subclade_taxa.sort() - subclade_taxa=str(subclade_taxa) # lists are not hashable - if subclade_taxa in clades: - clades[subclade_taxa]+=float(t.weight)/total - else: - clades[subclade_taxa]=float(t.weight)/total - #if subclade_taxa in countclades: - # countclades[subclade_taxa]+=t.weight - #else: - # countclades[subclade_taxa]=t.weight - # weed out clades below threshold - for (c,p) in clades.items(): - if p' -MRBAYESSAFE='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890_' -WHITESPACE=' \t\n' -#SPECIALCOMMENTS=['!','&','%','/','\\','@'] #original list of special comments -SPECIALCOMMENTS=['&'] # supported special comment ('tree' command), all others are ignored -CHARSET='chars' -TAXSET='taxa' - -class NexusError(Exception): pass - -class CharBuffer: - """Helps reading NEXUS-words and characters from a buffer.""" - def __init__(self,string): - if string: - self.buffer=list(string) - else: - self.buffer=[] - - def peek(self): - if self.buffer: - return self.buffer[0] - else: - return None - - def peek_nonwhitespace(self): - b=''.join(self.buffer).strip() - if b: - return b[0] - else: - return None - - def next(self): - if self.buffer: - return self.buffer.pop(0) - else: - return None - - def next_nonwhitespace(self): - while True: - p=self.next() - if p is None: - break - if p not in WHITESPACE: - return p - return None - - def skip_whitespace(self): - while self.buffer[0] in WHITESPACE: - self.buffer=self.buffer[1:] - - def next_until(self,target): - for t in target: - try: - pos=self.buffer.index(t) - except ValueError: - pass - else: - found=''.join(self.buffer[:pos]) - self.buffer=self.buffer[pos:] - return found - else: - return None - - def peek_word(self,word): - return ''.join(self.buffer[:len(word)])==word - - def next_word(self): - """Return the next NEXUS word from a string, dealing with single and double quotes, - whitespace and punctuation. - """ - - word=[] - quoted=False - first=self.next_nonwhitespace() # get first character - if not first: # return empty if only whitespace left - return None - word.append(first) - if first=="'": # word starts with a quote - quoted=True - elif first in PUNCTUATION: # if it's punctuation, return immediately - return first - while True: - c=self.peek() - if c=="'": # a quote? - word.append(self.next()) # store quote - if self.peek()=="'": # double quote - skip=self.next() # skip second quote - elif quoted: # second single quote ends word - break - elif quoted: - word.append(self.next()) # if quoted, then add anything - elif not c or c in PUNCTUATION or c in WHITESPACE: # if not quoted and special character, stop - break - else: - word.append(self.next()) # standard character - return ''.join(word) - - def rest(self): - """Return the rest of the string without parsing.""" - return ''.join(self.buffer) - -class StepMatrix: - """Calculate a stepmatrix for weighted parsimony. - See Wheeler (1990), Cladistics 6:269-275. - """ - - def __init__(self,symbols,gap): - self.data={} - self.symbols=[s for s in symbols] - self.symbols.sort() - if gap: - self.symbols.append(gap) - for x in self.symbols: - for y in [s for s in self.symbols if s!=x]: - self.set(x,y,0) - - def set(self,x,y,value): - if x>y: - x,y=y,x - self.data[x+y]=value - - def add(self,x,y,value): - if x>y: - x,y=y,x - self.data[x+y]+=value - - def sum(self): - return reduce(lambda x,y:x+y,self.data.values()) - - def transformation(self): - total=self.sum() - if total!=0: - for k in self.data: - self.data[k]=self.data[k]/float(total) - return self - - def weighting(self): - for k in self.data: - if self.data[k]!=0: - self.data[k]=-math.log(self.data[k]) - return self - - def smprint(self,name='your_name_here'): - matrix='usertype %s stepmatrix=%d\n' % (name,len(self.symbols)) - matrix+=' %s\n' % ' '.join(self.symbols) - for x in self.symbols: - matrix+='[%s]'.ljust(8) % x - for y in self.symbols: - if x==y: - matrix+=' . ' - else: - if x>y: - x1,y1=y,x - else: - x1,y1=x,y - if self.data[x1+y1]==0: - matrix+='inf. ' - else: - matrix+='%2.2f'.ljust(10) % (self.data[x1+y1]) - matrix+='\n' - matrix+=';\n' - return matrix - -def safename(name,mrbayes=False): - """Return a taxon identifier according to NEXUS standard. - Wrap quotes around names with punctuation or whitespace, and double single quotes. - mrbayes=True: write names without quotes, whitespace or punctuation for mrbayes. - """ - if mrbayes: - safe=name.replace(' ','_') - safe=''.join([c for c in safe if c in MRBAYESSAFE]) - else: - safe=name.replace("'","''") - if sets.Set(safe).intersection(sets.Set(WHITESPACE+PUNCTUATION)): - safe="'"+safe+"'" - return safe - -def quotestrip(word): - """Remove quotes and/or double quotes around identifiers.""" - if not word: - return None - while (word.startswith("'") and word.endswith("'")) or (word.startswith('"') and word.endswith('"')): - word=word[1:-1] - return word - -def get_start_end(sequence, skiplist=['-','?']): - """Return position of first and last character which is not in skiplist (defaults to ['-','?']).""" - - length=len(sequence) - if length==0: - return None,None - end=length-1 - while end>=0 and (sequence[end] in skiplist): - end-=1 - start=0 - while start1: - step=1 - for i,x in enumerate(clist): - if x==clist[0]+i*step: # are we still in the right step? - continue - elif i==1 and len(clist)>3 and clist[i+1]-x==x-clist[0]: - # second element, and possibly at least 3 elements to link, - # and the next one is in the right step - step=x-clist[0] - else: # pattern broke, add all values before current position to new list - sub=clist[:i] - if len(sub)==1: - shortlist.append(str(sub[0]+1)) - else: - if step==1: - shortlist.append('%d-%d' % (sub[0]+1,sub[-1]+1)) - else: - shortlist.append('%d-%d\\%d' % (sub[0]+1,sub[-1]+1,step)) - clist=clist[i:] - break - return ' '.join(shortlist) - -def combine(matrices): - """Combine matrices in [(name,nexus-instance),...] and return new nexus instance. - - combined_matrix=combine([(name1,nexus_instance1),(name2,nexus_instance2),...] - Character sets, character partitions and taxon sets are prefixed, readjusted and present in - the combined matrix. - """ - - if not matrices: - return None - name=matrices[0][0] - combined=copy.deepcopy(matrices[0][1]) # initiate with copy of first matrix - mixed_datatypes=(len(sets.Set([n[1].datatype for n in matrices]))>1) - if mixed_datatypes: - combined.datatype='None' # dealing with mixed matrices is application specific. You take care of that yourself! - # raise NexusError, 'Matrices must be of same datatype' - combined.charlabels=None - combined.statelabels=None - combined.interleave=False - combined.translate=None - - # rename taxon sets and character sets and name them with prefix - for cn,cs in combined.charsets.items(): - combined.charsets['%s.%s' % (name,cn)]=cs - del combined.charsets[cn] - for tn,ts in combined.taxsets.items(): - combined.taxsets['%s.%s' % (name,tn)]=ts - del combined.taxsets[tn] - # previous partitions usually don't make much sense in combined matrix - # just initiate one new partition parted by single matrices - combined.charpartitions={'combined':{name:range(combined.nchar)}} - for n,m in matrices[1:]: # add all other matrices - both=[t for t in combined.taxlabels if t in m.taxlabels] - combined_only=[t for t in combined.taxlabels if t not in both] - m_only=[t for t in m.taxlabels if t not in both] - for t in both: - # concatenate sequences and unify gap and missing character symbols - combined.matrix[t]+=Seq(m.matrix[t].tostring().replace(m.gap,combined.gap).replace(m.missing,combined.missing),combined.alphabet) - # replace date of missing taxa with symbol for missing data - for t in combined_only: - combined.matrix[t]+=Seq(combined.missing*m.nchar,combined.alphabet) - for t in m_only: - combined.matrix[t]=Seq(combined.missing*combined.nchar,combined.alphabet)+\ - Seq(m.matrix[t].tostring().replace(m.gap,combined.gap).replace(m.missing,combined.missing),combined.alphabet) - combined.taxlabels.extend(m_only) # new taxon list - for cn,cs in m.charsets.items(): # adjust character sets for new matrix - combined.charsets['%s.%s' % (n,cn)]=[x+combined.nchar for x in cs] - if m.taxsets: - if not combined.taxsets: - combined.taxsets={} - combined.taxsets.update(dict([('%s.%s' % (n,tn),ts) for tn,ts in m.taxsets.items()])) # update taxon sets - combined.charpartitions['combined'][n]=range(combined.nchar,combined.nchar+m.nchar) # update new charpartition - # update charlabels - if m.charlabels: - if not combined.charlabels: - combined.charlabels={} - combined.charlabels.update(dict([(combined.nchar+i,label) for (i,label) in m.charlabels.items()])) - combined.nchar+=m.nchar # update nchar and ntax - combined.ntax+=len(m_only) - - return combined - -def _kill_comments_and_break_lines(text): - """Delete []-delimited comments out of a file and break into lines separated by ';'. - - stripped_text=_kill_comments_and_break_lines(text): - Nested and multiline comments are allowed. [ and ] symbols within single - or double quotes are ignored, newline ends a quote, all symbols with quotes are - treated the same (thus not quoting inside comments like [this character ']' ends a comment]) - Special [&...] and [\...] comments remain untouched, if not inside standard comment. - Quotes inside special [& and [\ are treated as normal characters, - but no nesting inside these special comments allowed (like [& [\ ]]). - ';' ist deleted from end of line. - - NOTE: this function is very slow for large files, and obsolete when using C extension cnexus - """ - contents=CharBuffer(text) - newtext=[] - newline=[] - quotelevel='' - speciallevel=False - commlevel=0 - while True: - #plain=contents.next_until(["'",'"','[',']','\n',';']) # search for next special character - #if not plain: - # newline.append(contents.rest) # not found, just add the rest - # break - #newline.append(plain) # add intermediate text - t=contents.next() # and get special character - if t is None: - break - if t==quotelevel and not (commlevel or speciallevel): # matching quote ends quotation - quotelevel='' - elif not quotelevel and not (commlevel or speciallevel) and (t=='"' or t=="'"): # single or double quote starts quotation - quotelevel=t - elif not quotelevel and t=='[': # opening bracket outside a quote - if contents.peek() in SPECIALCOMMENTS and commlevel==0 and not speciallevel: - speciallevel=True - else: - commlevel+=1 - elif not quotelevel and t==']': # closing bracket ioutside a quote - if speciallevel: - speciallevel=False - else: - commlevel-=1 - if commlevel<0: - raise NexusError, 'Nexus formatting error: unmatched ]' - continue - if commlevel==0: # copy if we're not in comment - if t==';' and not quotelevel: - newtext.append(''.join(newline)) - newline=[] - else: - newline.append(t) - #level of comments should be 0 at the end of the file - if newline: - newtext.append('\n'.join(newline)) - if commlevel>0: - raise NexusError, 'Nexus formatting error: unmatched [' - return newtext - - -def _adjust_lines(lines): - """Adjust linebreaks to match ';', strip leading/trailing whitespace - - list_of_commandlines=_adjust_lines(input_text) - Lines are adjusted so that no linebreaks occur within a commandline - (except matrix command line) - """ - formatted_lines=[] - for l in lines: - #Convert line endings - l=l.replace('\r\n','\n').replace('\r','\n').strip() - if l.lower().startswith('matrix'): - formatted_lines.append(l) - else: - l=l.replace('\n',' ') - if l: - formatted_lines.append(l) - return formatted_lines - -def _replace_parenthesized_ambigs(seq,rev_ambig_values): - """Replaces ambigs in xxx(ACG)xxx format by IUPAC ambiguity code.""" - - opening=seq.find('(') - while opening>-1: - closing=seq.find(')') - if closing<0: - raise NexusError, 'Missing closing parenthesis in: '+seq - elif closing 0: - try: - options = options.replace('=', ' = ').split() - valued_indices=[(n-1,n,n+1) for n in range(len(options)) if options[n]=='=' and n!=0 and n!=len((options))] - indices = [] - for sl in valued_indices: - indices.extend(sl) - token_indices = [n for n in range(len(options)) if n not in indices] - for opt in valued_indices: - #self.options[options[opt[0]].lower()] = options[opt[2]].lower() - self.options[options[opt[0]].lower()] = options[opt[2]] - for token in token_indices: - self.options[options[token].lower()] = None - except ValueError: - raise NexusError, 'Incorrect formatting in line: %s' % line - -class Block: - """Represent a NEXUS block with block name and list of commandlines .""" - def __init__(self,title=None): - self.title=title - self.commandlines=[] - -class Nexus(object): - - __slots__=['original_taxon_order','__dict__'] - - def __init__(self, input=None): - self.ntax=0 # number of taxa - self.nchar=0 # number of characters - self.taxlabels=[] # labels for taxa, ordered by their id - self.charlabels=None # ... and for characters - self.statelabels=None # ... and for states - self.datatype='dna' # (standard), dna, rna, nucleotide, protein - self.respectcase=False # case sensitivity - self.missing='?' # symbol for missing characters - self.gap='-' # symbol for gap - self.symbols=None # set of symbols - self.equate=None # set of symbol synonyms - self.matchchar=None # matching char for matrix representation - self.labels=None # left, right, no - self.transpose=False # whether matrix is transposed - self.interleave=False # whether matrix is interleaved - self.tokens=False # unsupported - self.eliminate=None # unsupported - self.matrix=None # ... - self.unknown_blocks=[] # blocks we don't care about - self.taxsets={} - self.charsets={} - self.charpartitions={} - self.taxpartitions={} - self.trees=[] # list of Trees (instances of tree class) - self.translate=None # Dict to translate taxon <-> taxon numbers - self.structured=[] # structured input representation - self.set={} # dict of the set command to set various options - self.options={} # dict of the options command in the data block - - # some defaults - self.options['gapmode']='missing' - - if input: - self.read(input) - - def get_original_taxon_order(self): - """Included for backwards compatibility.""" - return self.taxlabels - def set_original_taxon_order(self,value): - """Included for backwards compatibility.""" - self.taxlabels=value - original_taxon_order=property(get_original_taxon_order,set_original_taxon_order) - - def read(self,input): - """Read and parse NEXUS imput (filename, file-handle, string.""" - - # 1. Assume we have the name of a file in the execution dir - # Note we need to add parsing of the path to dir/filename - try: - file_contents = open(os.path.expanduser(input),'rU').read() - self.filename=input - except (TypeError,IOError,AttributeError): - #2 Assume we have a string from a fh.read() - #if isinstance(input, str): - # file_contents = input - # self.filename='input_string' - #3 Assume we have a file object - if hasattr(input,'read'): # file objects or StringIO objects - file_contents=input.read() - # GEC : Change next line so that StringIO objects work - #if input.name: - if hasattr(input, 'name'): - self.filename=input.name - else: - self.filename='Unknown_nexus_file' - else: - print input.strip()[:6] - raise NexusError, 'Unrecognized input: %s ...' % input[:100] - if C: - decommented=cnexus.scanfile(file_contents) - #check for unmatched parentheses - if decommented=='[' or decommented==']': - raise NexusError, 'Unmatched %s' % decommented - # cnexus can't return lists, so in analogy we separate commandlines with chr(7) - # (a character that shoudn't be part of a nexus file under normal circumstances) - commandlines=_adjust_lines(decommented.split(chr(7))) - else: - commandlines=_adjust_lines(_kill_comments_and_break_lines(file_contents)) - # get rid of stupid 'NEXUS token' - try: - if commandlines[0][:6].upper()=='#NEXUS': - commandlines[0]=commandlines[0][6:].strip() - except: - pass - # now loop through blocks (we parse only data in known blocks, thus ignoring non-block commands - nexus_block_gen = self._get_nexus_block(commandlines) - while 1: - try: - title, contents = nexus_block_gen.next() - except StopIteration: - break - if title in KNOWN_NEXUS_BLOCKS: - self._parse_nexus_block(title, contents) - else: - self._unknown_nexus_block(title, contents) - - def _get_nexus_block(self,file_contents): - """Generator for looping through Nexus blocks.""" - inblock=False - blocklines=[] - while file_contents: - cl=file_contents.pop(0) - if cl.lower().startswith('begin'): - if not inblock: - inblock=True - title=cl.split()[1].lower() - else: - raise NexusError('Illegal block nesting in block %s' % title) - elif cl.lower().startswith('end'): - if inblock: - inblock=False - yield title,blocklines - blocklines=[] - else: - raise NexusError('Unmatched \'end\'.') - elif inblock: - blocklines.append(cl) - - def _unknown_nexus_block(self,title, contents): - block = Block() - block.commandlines.append(contents) - block.title = title - self.unknown_blocks.append(block) - - def _parse_nexus_block(self,title, contents): - """Parse a known Nexus Block """ - # attached the structered block representation - self._apply_block_structure(title, contents) - #now check for taxa,characters,data blocks. If this stuff is defined more than once - #the later occurences will override the previous ones. - block=self.structured[-1] - for line in block.commandlines: - try: - getattr(self,'_'+line.command)(line.options) - except AttributeError: - raise - raise NexusError, 'Unknown command: %s ' % line.command - - def _dimensions(self,options): - if options.has_key('ntax'): - self.ntax=eval(options['ntax']) - if options.has_key('nchar'): - self.nchar=eval(options['nchar']) - - def _format(self,options): - # print options - # we first need to test respectcase, then symbols (which depends on respectcase) - # then datatype (which, if standard, depends on symbols and respectcase in order to generate - # dicts for ambiguous values and alphabet - if options.has_key('respectcase'): - self.respectcase=True - # adjust symbols to for respectcase - if options.has_key('symbols'): - self.symbols=options['symbols'] - if (self.symbols.startswith('"') and self.symbols.endswith('"')) or\ - (self.symbold.startswith("'") and self.symbols.endswith("'")): - self.symbols=self.symbols[1:-1].replace(' ','') - if not self.respectcase: - self.symbols=self.symbols.lower()+self.symbols.upper() - self.symbols=list(sets.Set(self.symbols)) - if options.has_key('datatype'): - self.datatype=options['datatype'].lower() - if self.datatype=='dna' or self.datatype=='nucleotide': - self.alphabet=IUPAC.ambiguous_dna - self.ambiguous_values=IUPACData.ambiguous_dna_values - self.unambiguous_letters=IUPACData.unambiguous_dna_letters - elif self.datatype=='rna': - self.alphabet=IUPAC.ambiguous_rna - self.ambiguous_values=IUPACData.ambiguous_rna_values - self.unambiguous_letters=IUPACData.unambiguous_rna_letters - elif self.datatype=='protein': - self.alphabet=IUPAC.protein - self.ambiguous_values={'B':'DN','Z':'EQ','X':IUPACData.protein_letters} # that's how PAUP handles it - self.unambiguous_letters=IUPACData.protein_letters+'*' # stop-codon - elif self.datatype=='standard': - raise NexusError('Datatype standard is not yet supported.') - #self.alphabet=None - #self.ambiguous_values={} - #if not self.symbols: - # self.symbols='01' # if nothing else defined, then 0 and 1 are the default states - #self.unambiguous_letters=self.symbols - else: - raise NexusError, 'Unsupported datatype: '+self.datatype - self.valid_characters=''.join(self.ambiguous_values.keys())+self.unambiguous_letters - if not self.respectcase: - self.valid_characters=self.valid_characters.lower()+self.valid_characters.upper() - #we have to sort the reverse ambig coding dict key characters: - #to be sure that it's 'ACGT':'N' and not 'GTCA':'N' - rev=dict([(i[1],i[0]) for i in self.ambiguous_values.items() if i[0]!='X']) - self.rev_ambiguous_values={} - for (k,v) in rev.items(): - key=[c for c in k] - key.sort() - self.rev_ambiguous_values[''.join(key)]=v - #overwrite symbols for datype rna,dna,nucleotide - if self.datatype in ['dna','rna','nucleotide']: - self.symbols=self.alphabet.letters - if self.missing not in self.ambiguous_values: - self.ambiguous_values[self.missing]=self.unambiguous_letters+self.gap - self.ambiguous_values[self.gap]=self.gap - elif self.datatype=='standard': - if not self.symbols: - self.symbols=['1','0'] - if options.has_key('missing'): - self.missing=options['missing'][0] - if options.has_key('gap'): - self.gap=options['gap'][0] - if options.has_key('equate'): - self.equate=options['equate'] - if options.has_key('matchchar'): - self.matchchar=options['matchchar'][0] - if options.has_key('labels'): - self.labels=options['labels'] - if options.has_key('transpose'): - raise NexusError, 'TRANSPOSE is not supported!' - self.transpose=True - if options.has_key('interleave'): - if options['interleave']==None or options['interleave'].lower()=='yes': - self.interleave=True - if options.has_key('tokens'): - self.tokens=True - if options.has_key('notokens'): - self.tokens=False - - - def _set(self,options): - self.set=options; - - def _options(self,options): - self.options=options; - - def _eliminate(self,options): - self.eliminate=options - - def _taxlabels(self,options): - """Get taxon labels.""" - self.taxlabels=[] - opts=CharBuffer(options) - while True: - taxon=quotestrip(opts.next_word()) - if not taxon: - break - self.taxlabels.append(taxon) - - def _check_taxlabels(self,taxon): - """Check for presence of taxon in self.taxlabels.""" - # According to NEXUS standard, underscores shall be treated as spaces..., - # so checking for identity is more difficult - nextaxa=dict([(t.replace(' ','_'),t) for t in self.taxlabels]) - nexid=taxon.replace(' ','_') - return nextaxa.get(nexid) - - def _charlabels(self,options): - self.charlabels={} - opts=CharBuffer(options) - while True: - try: - # get id and state - w=opts.next_word() - if w is None: # McClade saves and reads charlabel-lists with terminal comma?! - break - identifier=self._resolve(w,set_type=CHARSET) - state=quotestrip(opts.next_word()) - self.charlabels[identifier]=state - # check for comma or end of command - c=opts.next_nonwhitespace() - if c is None: - break - elif c!=',': - raise NexusError,'Missing \',\' in line %s.' % options - except NexusError: - raise - except: - raise NexusError,'Format error in line %s.' % options - - def _charstatelabels(self,options): - # warning: charstatelabels supports only charlabels-syntax! - self._charlabels(options) - - def _statelabels(self,options): - #self.charlabels=options - #print 'Command statelabels is not supported and will be ignored.' - pass - - def _matrix(self,options): - if not self.ntax or not self.nchar: - raise NexusError,'Dimensions must be specified before matrix!' - taxlabels_present=(self.taxlabels!=[]) - self.matrix={} - taxcount=0 - block_interleave=0 - #eliminate empty lines and leading/trailing whitespace - lines=[l.strip() for l in options.split('\n') if l.strip()<>''] - lineiter=iter(lines) - while 1: - try: - l=lineiter.next() - except StopIteration: - if taxcountself.ntax: - raise NexusError, 'Too many taxa in matrix.' - else: - break - # count the taxa and check for interleaved matrix - taxcount+=1 - ##print taxcount - if taxcount>self.ntax: - if not self.interleave: - raise NexusError, 'Too many taxa in matrix - should matrix be interleaved?' - else: - taxcount=1 - block_interleave=1 - #get taxon name and sequence - linechars=CharBuffer(l) - id=quotestrip(linechars.next_word()) - l=linechars.rest().strip() - if taxlabels_present and not self._check_taxlabels(id): - raise NexusError,'Taxon '+id+' not found in taxlabels.' - chars='' - if self.interleave: - #interleaved matrix - #print 'In interleave' - if l: - chars=''.join(l.split()) - else: - chars=''.join(lineiter.next().split()) - else: - #non-interleaved matrix - chars=''.join(l.split()) - while len(chars) [0,1,2,3,4,'dog','cat',10,13,16,19] - """ - opts=CharBuffer(options) - name=self._name_n_vector(opts,separator=separator) - indices=self._parse_list(opts,set_type=set_type) - if indices is None: - raise NexusError, 'Formatting error in line: %s ' % options - return name,indices - - def _name_n_vector(self,opts,separator='='): - """Extract name and check that it's not in vector format.""" - rest=opts.rest() - name=opts.next_word() - if not name: - raise NexusError, 'Formatting error in line: %s ' % rest - name=quotestrip(name) - if opts.peek_nonwhitespace=='(': - open=opts.next_nonwhitespace() - qualifier=open.next_word() - close=opts.next_nonwhitespace() - if qualifier.lower()=='vector': - raise NexusError, 'Unsupported VECTOR format in line %s' % (options) - elif qualifier.lower()!='standard': - raise NexusError, 'Unknown qualifier %s in line %s' % (qualifier,options) - if opts.next_nonwhitespace()!=separator: - raise NexusError, 'Formatting error in line: %s ' % rest - return name - - def _parse_list(self,options_buffer,set_type): - """Parse a NEXUS list: [1, 2, 4-8\\2, dog, cat] --> [1,2,4,6,8,17-21], - (assuming dog is taxon no. 17 and cat is taxon no. 21). - """ - plain_list=[] - if options_buffer.peek_nonwhitespace(): - try: # capture all possible exceptions and treat them as formatting erros, if they are not NexusError - while True: - identifier=options_buffer.next_word() # next list element - if not identifier: # end of list? - break - start=self._resolve(identifier,set_type=set_type) - if options_buffer.peek_nonwhitespace()=='-': # followd by - - end=start - step=1 - # get hyphen and end of range - hyphen=options_buffer.next_nonwhitespace() - end=self._resolve(options_buffer.next_word(),set_type=set_type) - if set_type==CHARSET: - if options_buffer.peek_nonwhitespace()=='\\': # followd by \ - backslash=options_buffer.next_nonwhitespace() - step=int(options_buffer.next_word()) # get backslash and step - plain_list.extend(range(start,end+1,step)) - else: - if type(start)==list or type(end)==list: - raise NexusError, 'Name if character sets not allowed in range definition: %s' % identifier - start=self.taxlabels.index(start) - end=self.taxlabels.index(end) - taxrange=self.taxlabels[start:end+1] - plain_list.extend(taxrange) - else: - if type(start)==list: # start was the name of charset or taxset - plain_list.extend(start) - else: # start was an ordinary identifier - plain_list.append(start) - except NexusError: - raise - except: - return None - return plain_list - - def _resolve(self,identifier,set_type=None): - """Translate identifier in list into character/taxon index. - Characters (which are referred to by their index in Nexus.py): - Plain numbers are returned minus 1 (Nexus indices to python indices) - Text identifiers are translaterd into their indices (if plain character indentifiers), - the first hit in charlabels is returned (charlabels don't need to be unique) - or the range of indices is returned (if names of character sets). - Taxa (which are referred to by their unique name in Nexus.py): - Plain numbers are translated in their taxon name, underscores and spaces are considered equal. - Names are returned unchanged (if plain taxon identifiers), or the names in - the corresponding taxon set is returned - """ - identifier=quotestrip(identifier) - if not set_type: - raise NexusError('INTERNAL ERROR: Need type to resolve identifier.') - if set_type==CHARSET: - try: - n=int(identifier) - except ValueError: - if self.charlabels and identifier in self.charlabels.values(): - for k in self.charlabels: - if self.charlabels[k]==identifier: - return k - elif self.charsets and identifier in self.charsets: - return self.charsets[identifier] - else: - raise NexusError, 'Unknown character identifier: %s' % identifier - else: - if n<=self.nchar: - return n-1 - else: - raise NexusError, 'Illegal character identifier: %d>nchar (=%d).' % (identifier,self.nchar) - elif set_type==TAXSET: - try: - n=int(identifier) - except ValueError: - taxlabels_id=self._check_taxlabels(identifier) - if taxlabels_id: - return taxlabels_id - elif self.taxsets and identifier in self.taxsets: - return self.taxsets[identifier] - else: - raise NexusError, 'Unknown taxon identifier: %s' % identifier - else: - if n>0 and n<=self.ntax: - return self.taxlabels[n-1] - else: - raise NexusError, 'Illegal taxon identifier: %d>ntax (=%d).' % (identifier,self.ntax) - else: - raise NexusError('Unknown set specification: %s.'% set_type) - - def _stateset(self, options): - #Not implemented - pass - - def _changeset(self, options): - #Not implemented - pass - - def _treeset(self, options): - #Not implemented - pass - - def _treepartition(self, options): - #Not implemented - pass - - def write_nexus_data_partitions(self, matrix=None, filename=None, blocksize=None, interleave=False, - exclude=[], delete=[], charpartition=None, comment='',mrbayes=False): - """Writes a nexus file for each partition in charpartition. - Only non-excluded characters and non-deleted taxa are included, just the data block is written. - """ - - if not matrix: - matrix=self.matrix - if not matrix: - return - if not filename: - filename=self.filename - if charpartition: - pfilenames={} - for p in charpartition: - total_exclude=[]+exclude - total_exclude.extend([c for c in range(self.nchar) if c not in charpartition[p]]) - total_exclude=_make_unique(total_exclude) - pcomment=comment+'\nPartition: '+p+'\n' - dot=filename.rfind('.') - if dot>0: - pfilename=filename[:dot]+'_'+p+'.data' - else: - pfilename=filename+'_'+p - pfilenames[p]=pfilename - self.write_nexus_data(filename=pfilename,matrix=matrix,blocksize=blocksize, - interleave=interleave,exclude=total_exclude,delete=delete,comment=pcomment,append_sets=False, - mrbayes=mrbayes) - return pfilenames - else: - fn=self.filename+'.data' - self.write_nexus_data(filename=fn,matrix=matrix,blocksize=blocksize,interleave=interleave, - exclude=exclude,delete=delete,comment=comment,append_sets=False, - mrbayes=mrbayes) - return fn - - def write_nexus_data(self, filename=None, matrix=None, exclude=[], delete=[],\ - blocksize=None, interleave=False, interleave_by_partition=False,\ - comment=None,omit_NEXUS=False,append_sets=True,mrbayes=False): - """ Writes a nexus file with data and sets block. Character sets and partitions - are appended by default, and are adjusted according - to excluded characters (i.e. character sets still point to the same sites (not necessarily same positions), - without including the deleted characters. - """ - if not matrix: - matrix=self.matrix - if not matrix: - return - if not filename: - filename=self.filename - if [t for t in delete if not self._check_taxlabels(t)]: - raise NexusError, 'Unknwon taxa: %s' % ', '.join(sets.Set(delete).difference(sets.Set(self.taxlabels))) - if interleave_by_partition: - if not interleave_by_partition in self.charpartitions: - raise NexusError, 'Unknown partition: '+interleave_by_partition - else: - partition=self.charpartitions[interleave_by_partition] - # we need to sort the partition names by starting position before we exclude characters - names=_sort_keys_by_values(partition) - newpartition={} - for p in partition: - newpartition[p]=[c for c in partition[p] if c not in exclude] - # how many taxa and how many characters are left? - undelete=[taxon for taxon in self.taxlabels if taxon in matrix and taxon not in delete] - cropped_matrix=_seqmatrix2strmatrix(self.crop_matrix(matrix,exclude=exclude,delete=delete)) - ntax_adjusted=len(undelete) - nchar_adjusted=len(cropped_matrix[undelete[0]]) - if not undelete or (undelete and undelete[0]==''): - return - if isinstance(filename,str): - try: - fh=open(filename,'w') - except IOError: - raise NexusError, 'Could not open %s for writing.' % filename - elif isinstance(filename,file): - fh=filename - if not omit_NEXUS: - fh.write('#NEXUS\n') - if comment: - fh.write('['+comment+']\n') - fh.write('begin data;\n') - fh.write('\tdimensions ntax=%d nchar=%d;\n' % (ntax_adjusted, nchar_adjusted)) - fh.write('\tformat datatype='+self.datatype) - if self.respectcase: - fh.write(' respectcase') - if self.missing: - fh.write(' missing='+self.missing) - if self.gap: - fh.write(' gap='+self.gap) - if self.matchchar: - fh.write(' matchchar='+self.matchchar) - if self.labels: - fh.write(' labels='+self.labels) - if self.equate: - fh.write(' equate='+self.equate) - if interleave or interleave_by_partition: - fh.write(' interleave') - fh.write(';\n') - #if self.taxlabels: - # fh.write('taxlabels '+' '.join(self.taxlabels)+';\n') - if self.charlabels: - newcharlabels=self._adjust_charlabels(exclude=exclude) - clkeys=newcharlabels.keys() - clkeys.sort() - fh.write('charlabels '+', '.join(["%s %s" % (k+1,safename(newcharlabels[k])) for k in clkeys])+';\n') - fh.write('matrix\n') - if not blocksize: - if interleave: - blocksize=70 - else: - blocksize=self.nchar - # delete deleted taxa and ecxclude excluded characters... - namelength=max([len(safename(t,mrbayes=mrbayes)) for t in undelete]) - if interleave_by_partition: - # interleave by partitions, but adjust partitions with regard to excluded characters - seek=0 - for p in names: - fh.write('[%s: %s]\n' % (interleave_by_partition,p)) - if len(newpartition[p])>0: - for taxon in undelete: - fh.write(safename(taxon,mrbayes=mrbayes).ljust(namelength+1)) - fh.write(cropped_matrix[taxon][seek:seek+len(newpartition[p])]+'\n') - fh.write('\n') - else: - fh.write('[empty]\n\n') - seek+=len(newpartition[p]) - elif interleave: - for seek in range(0,nchar_adjusted,blocksize): - for taxon in undelete: - fh.write(safename(taxon,mrbayes=mrbayes).ljust(namelength+1)) - fh.write(cropped_matrix[taxon][seek:seek+blocksize]+'\n') - fh.write('\n') - else: - for taxon in undelete: - if blocksize.""" - if not self.charsets and not self.taxsets and not self.charpartitions: - return '' - sets=['\nbegin sets'] - # - now if characters have been excluded, the character sets need to be adjusted, - # so that they still point to the right character positions - # calculate a list of offsets: for each deleted character, the following character position - # in the new file will have an additional offset of -1 - offset=0 - offlist=[] - for c in range(self.nchar): - if c in exclude: - offset+=1 - offlist.append(-1) # dummy value as these character positions are excluded - else: - offlist.append(c-offset) - # now adjust each of the character sets - for n,ns in self.charsets.items(): - cset=[offlist[c] for c in ns if c not in exclude] - if cset: - sets.append('charset %s = %s' % (safename(n),_compact4nexus(cset))) - for n,s in self.taxsets.items(): - tset=[safename(t,mrbayes=mrbayes) for t in s if t not in delete] - if tset: - sets.append('taxset %s = %s' % (safename(n),' '.join(tset))) - for n,p in self.charpartitions.items(): - # as characters have been excluded, the partitions must be adjusted - # if a partition is empty, it will be omitted from the charpartition command - # (although paup allows charpartition part=t1:,t2:,t3:1-100) - names=_sort_keys_by_values(p) - newpartition={} - for sn in names: - nsp=[offlist[c] for c in p[sn] if c not in exclude] - if nsp: - newpartition[sn]=nsp - if newpartition: - sets.append('charpartition %s = %s' % (safename(n),\ - ', '.join(['%s: %s' % (sn,_compact4nexus(newpartition[sn])) for sn in names if sn in newpartition]))) - # now write charpartititions, much easier than charpartitions - for n,p in self.taxpartitions.items(): - names=_sort_keys_by_values(p) - newpartition={} - for sn in names: - nsp=[t for t in p[sn] if t not in delete] - if nsp: - newpartition[sn]=nsp - if newpartition: - sets.append('taxpartition %s = %s' % (safename(n),\ - ', '.join(['%s: %s' % (safename(sn),' '.join(map(safename,newpartition[sn]))) for sn in names if sn in newpartition]))) - # add 'end' and return everything - sets.append('end;\n') - return ';\n'.join(sets) - f.close() - - def export_fasta(self, filename=None, width=70): - """Writes matrix into a fasta file: (self, filename=None, width=70).""" - if not filename: - if '.' in filename and self.filename.split('.')[-1].lower() in ['paup','nexus','nex','dat']: - filename='.'.join(self.filename.split('.')[:-1])+'.fas' - else: - filename=self.filename+'.fas' - fh=open(filename,'w') - for taxon in self.taxlabels: - fh.write('>'+safename(taxon)+'\n') - for i in range(0, len(self.matrix[taxon].tostring()), width): - fh.write(self.matrix[taxon].tostring()[i:i+width] + '\n') - fh.close() - - def constant(self,matrix=None,delete=[],exclude=[]): - """Return a list with all constant characters.""" - if not matrix: - matrix=self.matrix - undelete=[t for t in self.taxlabels if t in matrix and t not in delete] - if not undelete: - return None - elif len(undelete)==1: - return [x for x in range(len(matrix[undelete[0]])) if x not in exclude] - # get the first sequence and expand all ambiguous values - constant=[(x,self.ambiguous_values.get(n.upper(),n.upper())) for - x,n in enumerate(matrix[undelete[0]].tostring()) if x not in exclude] - for taxon in undelete[1:]: - newconstant=[] - for site in constant: - #print '%d (paup=%d)' % (site[0],site[0]+1), - seqsite=matrix[taxon][site[0]].upper() - #print seqsite,'checked against',site[1],'\t', - if seqsite==self.missing or (seqsite==self.gap and self.options['gapmode'].lower()=='missing') or seqsite==site[1]: - # missing or same as before -> ok - newconstant.append(site) - elif seqsite in site[1] or site[1]==self.missing or (self.options['gapmode'].lower()=='missing' and site[1]==self.gap): - # subset of an ambig or only missing in previous -> take subset - newconstant.append((site[0],self.ambiguous_values.get(seqsite,seqsite))) - elif seqsite in self.ambiguous_values: # is it an ambig: check the intersection with prev. values - intersect=sets.Set(self.ambiguous_values[seqsite]).intersection(sets.Set(site[1])) - if intersect: - newconstant.append((site[0],''.join(intersect))) - # print 'ok' - #else: - # print 'failed' - #else: - # print 'failed' - constant=newconstant - cpos=[s[0] for s in constant] - return constant - # return [x[0] for x in constant] - - def cstatus(self,site,delete=[],narrow=True): - """Summarize character. - narrow=True: paup-mode (a c ? --> ac; ? ? ? --> ?) - narrow=false: (a c ? --> a c g t -; ? ? ? --> a c g t -) - """ - undelete=[t for t in self.taxlabels if t not in delete] - if not undelete: - return None - cstatus=[] - for t in undelete: - c=self.matrix[t][site].upper() - if self.options.get('gapmode')=='missing' and c==self.gap: - c=self.missing - if narrow and c==self.missing: - if c not in cstatus: - cstatus.append(c) - else: - cstatus.extend([b for b in self.ambiguous_values[c] if b not in cstatus]) - if self.missing in cstatus and narrow and len(cstatus)>1: - cstatus=[c for c in cstatus if c!=self.missing] - cstatus.sort() - return cstatus - - def weighted_stepmatrix(self,name='your_name_here',exclude=[],delete=[]): - """Calculates a stepmatrix for weighted parsimony. - See Wheeler (1990), Cladistics 6:269-275 and - Felsenstein (1981), Biol. J. Linn. Soc. 16:183-196 - """ - m=StepMatrix(self.unambiguous_letters,self.gap) - for site in [s for s in range(self.nchar) if s not in exclude]: - cstatus=self.cstatus(site,delete) - for i,b1 in enumerate(cstatus[:-1]): - for b2 in cstatus[i+1:]: - m.add(b1.upper(),b2.upper(),1) - return m.transformation().weighting().smprint(name=name) - - def crop_matrix(self,matrix=None, delete=[], exclude=[]): - """Return a matrix without deleted taxa and excluded characters.""" - if not matrix: - matrix=self.matrix - if [t for t in delete if not self._check_taxlabels(t)]: - raise NexusError, 'Unknwon taxa: %s' % ', '.join(sets.Set(delete).difference(self.taxlabels)) - if exclude!=[]: - undelete=[t for t in self.taxlabels if t in matrix and t not in delete] - if not undelete: - return {} - m=[matrix[k].tostring() for k in undelete] - zipped_m=zip(*m) - sitesm=[s for i,s in enumerate(zipped_m) if i not in exclude] - if sitesm==[]: - return dict([(t,Seq('',self.alphabet)) for t in undelete]) - else: - zipped_sitesm=zip(*sitesm) - m=[Seq(s,self.alphabet) for s in map(''.join,zipped_sitesm)] - return dict(zip(undelete,m)) - else: - return dict([(t,matrix[t]) for t in self.taxlabels if t in matrix and t not in delete]) - - def bootstrap(self,matrix=None,delete=[],exclude=[]): - """Return a bootstrapped matrix.""" - if not matrix: - matrix=self.matrix - seqobjects=isinstance(matrix[matrix.keys()[0]],Seq) # remember if Seq objects - cm=self.crop_matrix(delete=delete,exclude=exclude) # crop data out - if not cm: # everything deleted? - return {} - elif len(cm[cm.keys()[0]])==0: # everything excluded? - return cm - undelete=[t for t in self.taxlabels if t in cm] - if seqobjects: - sitesm=zip(*[cm[t].tostring() for t in undelete]) - alphabet=matrix[matrix.keys()[0]].alphabet - else: - sitesm=zip(*[cm[t] for t in undelete]) - bootstrapsitesm=[sitesm[random.randint(0,len(sitesm)-1)] for i in range(len(sitesm))] - bootstrapseqs=map(''.join,zip(*bootstrapsitesm)) - if seqobjects: - bootstrapseqs=[Seq(s,alphabet) for s in bootstrapseqs] - return dict(zip(undelete,bootstrapseqs)) - - def add_sequence(self,name,sequence): - """Adds a sequence to the matrix.""" - if not name: - raise NexusError, 'New sequence must have a name' - diff=self.nchar-len(sequence) - if diff<0: - self.insert_gap(self.nchar,-diff) - elif diff>0: - sequence+=self.missing*diff - - self.matrix[name]=Seq(sequence,self.alphabet) - self.ntax+=1 - self.taxlabels.append(name) - #taxlabels? - - def insert_gap(self,pos,n=1,leftgreedy=False): - """Add a gap into the matrix and adjust charsets and partitions. - - pos=0: first position - pos=nchar: last position - """ - - def _adjust(set,x,d,leftgreedy=False): - """Adjusts chartacter sets if gaps are inserted, taking care of - new gaps within a coherent character set.""" - # if 3 gaps are inserted at pos. 9 in a set that looks like 1 2 3 8 9 10 11 13 14 15 - # then the adjusted set will be 1 2 3 8 9 10 11 12 13 14 15 16 17 18 - # but inserting into position 8 it will stay like 1 2 3 11 12 13 14 15 16 17 18 - set.sort() - addpos=0 - for i,c in enumerate(set): - if c>=x: - set[i]=c+d - # if we add gaps within a group of characters, we want the gap position included in this group - if c==x: - if leftgreedy or (i>0 and set[i-1]==c-1): - addpos=i - if addpos>0: - set[addpos:addpos]=range(x,x+d) - return set - - if pos<0 or pos>self.nchar: - raise NexusError('Illegal gap position: %d' % pos) - if n==0: - return - sitesm=zip(*[self.matrix[t].tostring() for t in self.taxlabels]) - sitesm[pos:pos]=[['-']*len(self.taxlabels)]*n - # #self.matrix=dict([(taxon,Seq(map(''.join,zip(*sitesm))[i],self.alphabet)) for\ - # i,taxon in enumerate(self.taxlabels)]) - zipped=zip(*sitesm) - mapped=map(''.join,zipped) - listed=[(taxon,Seq(mapped[i],self.alphabet)) for i,taxon in enumerate(self.taxlabels)] - self.matrix=dict(listed) - self.nchar+=n - # now adjust character sets - for i,s in self.charsets.items(): - self.charsets[i]=_adjust(s,pos,n,leftgreedy=leftgreedy) - for p in self.charpartitions: - for sp,s in self.charpartitions[p].items(): - self.charpartitions[p][sp]=_adjust(s,pos,n,leftgreedy=leftgreedy) - # now adjust character state labels - self.charlabels=self._adjust_charlabels(insert=[pos]*n) - return self.charlabels - - def _adjust_charlabels(self,exclude=None,insert=None): - """Return adjusted indices of self.charlabels if characters are excluded or inserted.""" - if exclude and insert: - raise NexusError, 'Can\'t exclude and insert at the same time' - if not self.charlabels: - return None - labels=self.charlabels.keys() - labels.sort() - newcharlabels={} - if exclude: - exclude.sort() - exclude.append(sys.maxint) - excount=0 - for c in labels: - if not c in exclude: - while c>exclude[excount]: - excount+=1 - newcharlabels[c-excount]=self.charlabels[c] - elif insert: - insert.sort() - insert.append(sys.maxint) - icount=0 - for c in labels: - while c>=insert[icount]: - icount+=1 - newcharlabels[c+icount]=self.charlabels[c] - else: - return self.charlabels - return newcharlabels - - def invert(self,charlist): - """Returns all character indices that are not in charlist.""" - return [c for c in range(self.nchar) if c not in charlist] - - def gaponly(self,include_missing=False): - """Return gap-only sites.""" - gap=sets.Set(self.gap) - if include_missing: - gap.add(self.missing) - sitesm=zip(*[self.matrix[t].tostring() for t in self.taxlabels]) - gaponly=[i for i,site in enumerate(sitesm) if sets.Set(site).issubset(gap)] - return gaponly - - def terminal_gap_to_missing(self,missing=None,skip_n=True): - """Replaces all terminal gaps with missing character. - - Mixtures like ???------??------- are properly resolved.""" - - if not missing: - missing=self.missing - replace=[self.missing,self.gap] - if not skip_n: - replace.extend(['n','N']) - for taxon in self.taxlabels: - sequence=self.matrix[taxon].tostring() - length=len(sequence) - start,end=get_start_end(sequence,skiplist=replace) - sequence=sequence[:end+1]+missing*(length-end-1) - sequence=start*missing+sequence[start:] - assert length==len(sequence), 'Illegal sequence manipulation in Nexus.termial_gap_to_missing in taxon %s' % taxon - self.matrix[taxon]=Seq(sequence,self.alphabet) - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/array_io.py --- a/corebio/seq_io/array_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,133 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Read and write a rectangular array of sequence data. - -One sequence per line and nothing else. Each line must contain the same number -of characters. Blank lines and white space are ignored. - ---- Example Array --- - ---------------------------LENSTSPYDYGENESD-------FSDSPPCPQDF ---------------------------LENLEDLF-WELDRLD------NYNDTSLVENH- ---------------------------MSNITDPQMWDFDDLN-------FTGMPPADEDY ------------------------------------YTSDN---------YSGSGDYDSNK --SL-------NFDRTFLPALYSLLFLLGLLGNGAVAAVLLSQRTALSSTDTFLLHLAVAD ---LC-PATMASFKAVFVPVAYSLIFLLGVIGNVLVLVILERHRQTRSSTETFLFHLAVAD --SPC-MLETETLNKYVVIIAYALVFLLSLLGNSLVMLVILYSRVGRSVTDVYLLNLALAD --EPC-RDENVHFNRIFLPTIYFIIFLTGIVGNGLVILVMGYQKKLRSMTDKYRLHLSVAD -""" - -from corebio.seq import * -from corebio.utils import * - -example = """ ---------------------------LENSTSPYDYGENESD-------FSDSPPCPQDF ---------------------------LENLEDLF-WELDRLD------NYNDTSLVENH- ---------------------------MSNITDPQMWDFDDLN-------FTGMPPADEDY ------------------------------------YTSDN---------YSGSGDYDSNK --SL-------NFDRTFLPALYSLLFLLGLLGNGAVAAVLLSQRTALSSTDTFLLHLAVAD ---LC-PATMASFKAVFVPVAYSLIFLLGVIGNVLVLVILERHRQTRSSTETFLFHLAVAD --SPC-MLETETLNKYVVIIAYALVFLLSLLGNSLVMLVILYSRVGRSVTDVYLLNLALAD --EPC-RDENVHFNRIFLPTIYFIIFLTGIVGNGLVILVMGYQKKLRSMTDKYRLHLSVAD -""" - -names = ("array",'flatfile') -extensions = () - -def read(fin, alphabet=None): - """Read a file of raw sequecne alignment data. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Returns: - SeqList -- A list of sequences - Raises: - ValueError -- If the file is unparsable - """ - seqs = [ s for s in iterseq(fin, alphabet)] - return SeqList(seqs) - - -def iterseq(fin, alphabet=None) : - """ Read one line of sequence data and yeild the sequence. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Yeilds: - Seq -- One alphabetic sequence at a time. - Raises: - ValueError -- If the file is unparsable - """ - - alphabet = Alphabet(alphabet) - line_length = 0 - - for linenum, line in enumerate(fin) : - if line.isspace(): continue # Blank line - line = line.strip() - - if line[0] == '>' : # probable a fasta file. Fail. - raise ValueError( - "Parse Error on input line: %d " % (linenum) ) - - line = remove_whitespace(line) - - if not alphabet.alphabetic(line) : - raise ValueError( - "Character on line: %d not in alphabet: %s : %s" % \ - (linenum, alphabet, line) ) - - if line_length and line_length != len(line) : - raise ValueError("Line %d has a incommensurate length." % linenum) - line_length = len(line) - - yield Seq(line, alphabet) - - -def write(afile, seqs): - """Write raw sequence data, one line per sequence. - - arguments: - afile -- A writable stream. - seqs -- A list of Seq's - """ - for s in seqs : - writeseq(afile, s) - - -def writeseq(afile, seq): - """ Write a single sequence in raw format. - - arguments: - afile -- A writable stream. - seq -- A Seq instance - """ - print >>afile, seq - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/clustal_io.py --- a/corebio/seq_io/clustal_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,222 +0,0 @@ - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -""" Read and write the CLUSTAL sequence file format. - -See : -- http://www.cmpharm.ucsf.edu/~goh/Treecorr/sampleAlignment.html -- http://www.bioperl.org/wiki/ClustalW_multiple_alignment_format - -Ref : -- Higgins D., Thompson J., Gibson T., Thompson J.D., Higgins D.G., Gibson - T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple - sequence alignment through sequence weighting, position-specific gap - penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680. -""" - -# TODO: What happens if CLUSTAL is not the first line of the file? - - -import re - -from corebio.utils import * -from corebio.seq import * -from corebio.seq_io import * - -__all__ = ('example', 'names', 'extensions', 'read') - -example = """ -CLUSTAL W (1.81) multiple sequence alignment - - -CXCR3_MOUSE --------------------------LENSTSPYDYGENESD-------FSDSPPCPQDF -BLR_HUMAN --------------------------LENLEDLF-WELDRLD------NYNDTSLVENH- -CXCR1_HUMAN --------------------------MSNITDPQMWDFDDLN-------FTGMPPADEDY -CXCR4_MURINE -----------------------------------YTSDN---------YSGSGDYDSNK - : : :.. .. - -CXCR3_MOUSE -SL-------NFDRTFLPALYSLLFLLGLLGNGAVAAVLLSQRTALSSTDTFLLHLAVAD -BLR_HUMAN --LC-PATMASFKAVFVPVAYSLIFLLGVIGNVLVLVILERHRQTRSSTETFLFHLAVAD -CXCR1_HUMAN -SPC-MLETETLNKYVVIIAYALVFLLSLLGNSLVMLVILYSRVGRSVTDVYLLNLALAD -CXCR4_MURINE -EPC-RDENVHFNRIFLPTIYFIIFLTGIVGNGLVILVMGYQKKLRSMTDKYRLHLSVAD - :. .: * ::** .::** * :: : * *: : ::*::** - -CXCR3_MOUSE VLLVLTLPLWAVDAA-VQWVFGPGLCKVAGALFNINFYAGAFLLACISFDRYLSIVHATQ -BLR_HUMAN LLLVFILPFAVAEGS-VGWVLGTFLCKTVIALHKVNFYCSSLLLACIAVDRYLAIVHAVH -CXCR1_HUMAN LLFALTLPIWAASKV-NGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRYLAIVHATR -CXCR4_MURINE LLFVITLPFWAVDAM-ADWYFGKFLCKAVHIIYTVNLYSSVLILAFISLDRYLAIVHATN - :*:.: **: ... * :* ***.. : :*:*.. ::** *:.****:****.. -""" - - - -names = ("clustal", "clustalw",) -extensions = ('aln',) - - -header_line = re.compile(r'(CLUSTAL.*)$') - -# (sequence_id) (Sequence) (Optional sequence number) -seq_line = re.compile(r'(\s*\S+\s+)(\S+)\s*(\d*)$') - -# Saved group includes variable length leading space. -# Must consult a seq_line to figure out how long the leading spoace is since -# the maximum CLUSTAL ids length (normally 10 characters) can be changed. -match_line = re.compile(r'([\s:\.\*]*)$') - - -def iterseq(fin, alphabet=None): - """Iterate over the sequences in the file.""" - # Default implementation - return iter(read(fin, alphabet) ) - - -def read(fin, alphabet=None) : - alphabet = Alphabet(alphabet) - seq_ids = [] - seqs = [] - block_count = 0 - - - for token in _scan(fin): - if token.typeof== "begin_block": - block_count = 0 - elif token.typeof == "seq_id": - if len(seqs) <= block_count : - seq_ids.append(token.data) - seqs.append([]) - elif token.typeof == "seq": - if not alphabet.alphabetic(token.data) : - raise ValueError( - "Character on line: %d not in alphabet: %s : %s" % ( - token.lineno, alphabet, token.data) ) - seqs[block_count].append(token.data) - block_count +=1 - - - seqs = [ Seq("".join(s), alphabet, name= i) for s,i in zip(seqs,seq_ids)] - return SeqList(seqs) - - -# 1) The word "CLUSTAL" should be the first word on the first line of the file. -# (But sometimes isn't.) -# 2) The alignment is displayed in blocks of fixed length. -# 3) Each line in the block corresponds to one sequence. -# 4) Each sequence line starts with a sequence name followed by at least one -# space and then the sequence. - -def _scan( fin ): - """Scan a clustal format MSA file and yeild tokens. - The basic file structure is - begin_document - header? - (begin_block - (seq_id seq seq_index?)+ - match_line? - end_block)* - end_document - - Usage: - for token in scan(clustal_file): - do_somthing(token) - """ - header, body, block = range(3) - - yield Token("begin") - leader_width = -1 - state = header - for L, line in enumerate(fin): - if state==header : - if line.isspace() : continue - m = header_line.match(line) - state = body - if m is not None : - yield Token("header", m.group() ) - continue - else : - raise ValueError("Cannot find required header") - - - if state == body : - if line.isspace() : continue - yield Token("begin_block") - state = block - # fall through to block - - if state == block: - if line.isspace() : - yield Token("end_block") - state = body - continue - - m = match_line.match(line) - if m is not None : - yield Token("match_line", line[leader_width:-1]) - continue - - m = seq_line.match(line) - if m is None: - raise ValueError("Parse error on line: %d" % L) - leader_width = len(m.group(1)) - yield Token("seq_id", m.group(1).strip() ) - yield Token("seq", m.group(2).strip() ) - if m.group(3) : - yield Token("seq_num", m.group(3)) - continue - - # END state blocks. If I ever get here something has gone terrible wrong - raise RuntimeError() - - if state==block: - yield Token("end_block") - yield Token("end") - return - -def write(fout, seqs) : - """Write 'seqs' to 'fout' as text in clustal format""" - header = "CLUSTAL W (1.81) multiple sequence alignment" - name_width = 17 - seq_width = 60 - - print >>fout, header - print >>fout - print >>fout - - L = 0 - for s in seqs: L = max(L, len(s)) - - for block in range(0, L, seq_width): - for s in seqs : - start = min(block, len(s)) - end = min( start+seq_width, len(s)) - print >>fout, s.name.ljust(name_width), - print >>fout, s[start:end] - print >>fout - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/fasta_io.py --- a/corebio/seq_io/fasta_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,230 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Read and write sequence information in FASTA format. - -This is a very common format for unannotated biological sequence data, -accepted by many multiple sequence alignment programs. Each sequence -consists of a single-line description, followed by lines of sequence data. -The first character of the description line is a greater-than (">") symbol -in the first column. The first word of the description is often the name or -ID of the sequence. Fasta files containing multiple sequences have one -sequence listed right after another. - - -Example Fasta File :: - ->Lamprey GLOBIN V - SEA LAMPREY -PIVDTGSVA-P------------------LSAAEKTKIRSAWAPVYSTY---ETSGVDILVKFFTSTPAAQEFFPKFKGL -TT-----ADQLKKSA---DVRWHA-ERIINAVNDAVASMDDTEKMS--MKL-RDLSGKH----AKSFQV-----DPQYFK -VLAAVI-AD-TVAAGD--AGFEKLMSM------I---CILLR----S-----A-----Y------------ ->Hagfish GLOBIN III - ATLANTIC HAGFISH -PITDHGQPP-T------------------LSEGDKKAIRESWPQIYKNF---EQNSLAVLLEFLKKFPKAQDSFPKFSAK -KS-------HLEQDP---AVKLQA-EVIINAVNHTIGLMDKEAAMK--KYL-KDLSTKH----STEFQV-----NPDMFK -ELSAVF-VS-TMG-GK--AAYEKLFSI------I---ATLLR----S-----T-----YDA---------- ->Frog HEMOGLOBIN BETA CHAIN - EDIBLE FROG -----------GS-----------------------DLVSGFWGKV--DA---HKIGGEALARLLVVYPWTQRYFTTFGNL -GSADAIC-----HNA---KVLAHG-EKVLAAIGEGLKHPENLKAHY--AKL-SEYHSNK----LHVDPANFRLLGNVFIT -VLARHF-QH-EFTPELQ-HALEAHFCA------V---GDALA----K-----A-----YH----------- - - -""" -import re -from corebio.utils import * -from corebio.seq import * -from corebio.seq_io import * - - -names = ( 'fasta', 'pearson', 'fa') -extensions = ('fa', 'fasta', 'fast', 'seq', 'fsa', 'fst', 'nt', 'aa','fna','mpfa', 'faa', 'fnn','mfasta') - - -example = """ ->Lamprey GLOBIN V - SEA LAMPREY -PIVDTGSVA-P------------------LSAAEKTKIRSAWAPVYSTY---ETSGVDILVKFFTSTPAAQEFFPKFKGL -TT-----ADQLKKSA---DVRWHA-ERIINAVNDAVASMDDTEKMS--MKL-RDLSGKH----AKSFQV-----DPQYFK -VLAAVI-AD-TVAAGD--AGFEKLMSM------I---CILLR----S-----A-----Y------------ - ->Hagfish GLOBIN III - ATLANTIC HAGFISH -PITDHGQPP-T------------------LSEGDKKAIRESWPQIYKNF---EQNSLAVLLEFLKKFPKAQDSFPKFSAK -KS-------HLEQDP---AVKLQA-EVIINAVNHTIGLMDKEAAMK--KYL-KDLSTKH----STEFQV-----NPDMFK -ELSAVF-VS-TMG-GK--AAYEKLFSI------I---ATLLR----S-----T-----YDA---------- - ->Frog HEMOGLOBIN BETA CHAIN - EDIBLE FROG -----------GS-----------------------DLVSGFWGKV--DA---HKIGGEALARLLVVYPWTQRYFTTFGNL -GSADAIC-----HNA---KVLAHG-EKVLAAIGEGLKHPENLKAHY--AKL-SEYHSNK----LHVDPANFRLLGNVFIT -VLARHF-QH-EFTPELQ-HALEAHFCA------V---GDALA----K-----A-----YH----------- - -""" - - -def read(fin, alphabet=None): - """Read and parse a fasta file. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Returns: - SeqList -- A list of sequences - Raises: - ValueError -- If the file is unparsable - """ - seqs = [ s for s in iterseq(fin, alphabet)] - name = names[0] - if hasattr(fin, "name") : name = fin.name - return SeqList(seqs, name=name) - - -def readseq(fin, alphabet=None) : - """Read one sequence from the file, starting - from the current file position.""" - return iterseq(fin, alphabet).next() - - -def iterseq(fin, alphabet=None): - """ Parse a fasta file and generate sequences. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Yeilds: - Seq -- One alphabetic sequence at a time. - Raises: - ValueError -- If the file is unparsable - """ - alphabet = Alphabet(alphabet) - - seqs = [] - comments = [] # FIXME: comments before first sequence are lost. - header = None - header_lineno = -1 - - def build_seq(seqs,alphabet, header, header_lineno,comments) : - try : - name = header.split(' ',1)[0] - if comments : - header += '\n' + '\n'.join(comments) - s = Seq( "".join(seqs), alphabet, name=name, description=header) - except ValueError: - raise ValueError( - "Parsed failed with sequence starting at line %d: " - "Character not in alphabet: %s" % (header_lineno, alphabet) ) - return s - - for lineno, line in enumerate(fin) : - line = line.strip() - if line == '' : continue - if line.startswith('>') : - if header is not None : - yield build_seq(seqs,alphabet, header, header_lineno, comments) - header = None - seqs = [] - header = line[1:] - header_lineno = lineno - comments = [] - elif line.startswith(';') : - # Optional (and unusual) comment line - comments.append(line[1:]) - else : - if header is None : - raise ValueError ( - "Parse failed on line %d: sequence before header" - % (lineno) ) - seqs.append(line) - - if not seqs: return - yield build_seq(seqs,alphabet, header, header_lineno, comments) - - -def write(fout, seqs): - """Write a fasta file. - - Args: - fout -- A writable stream. - seqs -- A list of Seq's - """ - if seqs.description : - for line in seqs.description.splitlines(): - print >>fout, ';'+ line - for s in seqs : - writeseq(fout, s) - - -def writeseq(afile, seq): - """ Write a single sequence in fasta format. - - Args: - afile -- A writable stream. - seq -- A Seq instance - """ - - header = seq.description or seq.name or '' - - # We prepend '>' to the first header line - # Additional lines start with ';' to indicate comment lines - if header : - header = header.splitlines() - print >>afile, '>'+header[0] - if len(header) > 1 : - for h in header[1:] : - print >>afile, ';' +h - else : - print >>afile, '>' - - L = len(seq) - line_length = 80 - for n in range (1+ L/line_length) : - print >>afile, seq[n * line_length: (n+1) * line_length] - print >>afile - - -def index(afile, alphabet=None) : - """Return a FileIndex for the fasta file. Sequences can be retrieved - by item number or name. - """ - def parser( afile) : - return readseq(afile, alphabet) - - key = re.compile(r"^>\s*(\S*)") - def linekey( line): - k = key.search(line) - if k is None : return None - return k.group(1) - - return FileIndex(afile, linekey, parser) - - - - - - - - - - - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/genbank_io.py --- a/corebio/seq_io/genbank_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,85 +0,0 @@ -#!/usr/bin/env python - - -"""Read GenBank flat files. - -Currently only reads sequence data and not annotations. - -""" -from corebio.utils import * -from corebio.seq import * - - -names = ( 'genbank',) -extensions = ('gb','genbank', 'gbk') - - - -def read(fin, alphabet=None): - """Read and parse a file of genbank records. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - - Returns: - SeqList -- A list of sequences - - Raises: - ValueError -- If the file is unparsable - """ - seqs = [ s for s in iterseq(fin, alphabet)] - return SeqList(seqs) - - -def iterseq(fin, alphabet=None): - """ Iterate over genbank records - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - - Yeilds: - Seq -- One alphabetic sequence at a time. - - Raises: - ValueError -- If the file is unparsable - """ - alphabet = Alphabet(alphabet) - - seq = [] - - def notblank(string) : - return not isblank(string) - - lines = Reiterate(iter(fin)) - - - while True : - line = lines.filter( notblank ) - if not line.startswith('LOCUS') : - raise ValueError( - "Cannot find start of record at line %d"% lines.index() ) - - line = lines.filter(lambda s : s.startswith('ORIGIN') - or s.startswith('//') ) - - if line.startswith('//') : - # No sequence data - yield Seq( '', alphabet) - else: - for line in lines : - if line.startswith('//') : - yield Seq( ''.join(seq), alphabet) - seq = [] - break - seq.extend( line.split()[1:] ) - - - - - - - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/intelligenetics_io.py --- a/corebio/seq_io/intelligenetics_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,203 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Read and write sequence information in IntelliGenetics format. - -A sequence file in IG format can contain several sequences, each consisting of a -number of comment lines that must begin with a semicolon (";"), a line with the -sequence name and the sequence itself terminated with the termination character -'1' for linear or '2' for circular sequences. The termination caracter is -defacto optional. - ---- Example IG File --- - -;H.sapiens fau mRNA, 518 bases -HSFAU -ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc -actcttaagtcttttgtaattctggctttctctaataaaaaagccactta -gttcagtcaaaaaaaaaa1 -;H.sapiens fau 1 gene, 2016 bases -HSFAU1 -ctaccattttccctctcgattctatatgtacactcgggacaagttctcct -gatcgaaaacggcaaaactaaggccccaagtaggaatgccttagttttcg -gggttaacaatgattaacactgagcctcacacccacgcgatgccctcagc -tcctcgctcagcgctctcaccaacagccgtagcccgcagccccgctggac -accggttctccatccccgcagcgtagcccggaacatggtagctgccatct -ttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgcccc2 - -""" - -from corebio.utils import * -from corebio.seq import * -from corebio.seq_io import * - - -names = ( 'intelligenetics', 'ig', 'stanford', ) -extensions = ('ig') - - -example = """ -;H.sapiens fau mRNA, 518 bases -HSFAU -ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc -actcttaagtcttttgtaattctggctttctctaataaaaaagccactta -gttcagtcaaaaaaaaaa1 -;H.sapiens fau 1 gene, 2016 bases -HSFAU1 -ctaccattttccctctcgattctatatgtacactcgggacaagttctcct -gatcgaaaacggcaaaactaaggccccaagtaggaatgccttagttttcg -gggttaacaatgattaacactgagcctcacacccacgcgatgccctcagc -tcctcgctcagcgctctcaccaacagccgtagcccgcagccccgctggac -accggttctccatccccgcagcgtagcccggaacatggtagctgccatct -ttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgcccc2 -""" - - - - -def read(fin, alphabet=None): - """Read and parse an IG file. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Returns: - SeqList -- A list of sequences - Raises: - ValueError -- If the file is unparsable - """ - seqs = [ s for s in iterseq(fin, alphabet)] - return SeqList(seqs) - - -def iterseq(fin, alphabet=None): - """ Parse an IG file and generate sequences. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Yeilds: - Seq -- One alphabetic sequence at a time. - Raises: - ValueError -- If the file is unparsable - """ - alphabet = Alphabet(alphabet) - - seqs = [] - header = [] - start_lineno = -1 - name = None - - def build_seq(seqs,alphabet, name, comments, lineno) : - try : - desc = '\n'.join(comments) - s = Seq( "".join(seqs), alphabet, name=name, description=desc) - except ValueError : - raise ValueError( - "Parsed failed with sequence starting at line %d: " - "Character not in alphabet: %s" % (lineno, alphabet) ) - return s - - for lineno, line in enumerate(fin) : - line = line.strip() - if line == '' : continue - if line.startswith(';') : - if seqs : - # end of sequence - yield build_seq(seqs,alphabet, name, header, start_lineno) - header = [] - seqs = [] - name = None - header.append(line[1:]) - start_lineno = lineno - elif not name : - name = line - elif line[-1] == '1' or line[-1]=='2': - # End of sequence - seqs.append(remove_whitespace(line[0:-1])) - yield build_seq(seqs,alphabet, name, header, start_lineno) - header = [] - seqs = [] - name = None - else: - seqs.append( remove_whitespace(line)) - - if seqs : - yield build_seq(seqs,alphabet, name, header, start_lineno) - return - - - - - -def write(fout, seqs): - """Write an IG file. - - Args: - fout -- A writable stream. - seqs -- A list of Seq's - Raises: - ValueError -- If a sequence is missing a name - """ - for s in seqs : - writeseq(fout, s) - - -def writeseq(fout, seq): - """ Write a single sequence in IG format. - - Args: - afile -- A writable stream. - seq -- A Seq instance - Raises: - ValueError -- If a sequence is missing a name - """ - - desc = seq.description or '' - - # We prepend ';' to each line - for h in desc.splitlines() : - print >> fout, ';' +h - - if not seq.name : - raise ValueError( - "Write failed with missing sequence name: %s"% str(seq) ) - print >>fout, seq.name - L = len(seq) - line_length = 80 - for n in range (1+ int(L/line_length)) : - print >>fout, seq[n * line_length: (n+1) * line_length] - print >>fout - - - - - - - - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/msf_io.py --- a/corebio/seq_io/msf_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,156 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 Clare Gollnick -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - - -"""Read sequence information in MSF format. - -This is a file format for biological sequence data. The sequences are interweaved and each line is labeled with the sequence name. The MSF format can be identified in one, or more of the following ways: -1. The word PileUp on the first line (optional) -2. the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT at the start of the file (optional) -3. the word MSF on the first line of the file, and the characters ".." at the end of this line (optional) -4. A header containing sequence information followed by a line with the characters "//" -""" -example= """ - - PileUp - - -MSF: 64 Type: P Check: 767 .. - - Name: Cow Len: 100 Check: 3761 Weight: 1.00 - Name: Carp Len: 100 Check: 1550 Weight: 1.00 - Name: Chicken Len: 100 Check: 2397 Weight: 1.00 - Name: Human Len: 100 Check: 9021 Weight: 1.00 - Name: Loach Len: 100 Check: 984 Weight: 1.00 - Name: Mouse Len: 100 Check: 2993 Weight: 1.00 - - -// - - - Cow MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL - Carp MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL -Chicken MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL - Human MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL - Loach MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL - Mouse MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL - - - - Cow THTSTMDAQE VETIWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM - Carp TNKYILDSQE IEIVWTILPA VILVLIALPS LRILYLMDEI NDPHLTIKAM -Chicken S.SNTVDAQE VELIWTILPA IVLVLLALPS LQILYMMDEI DEPDLTLKAI - Human TNTNISDAQE METVWTILPA IILVLIALPS LRILYMTDEV NDPSLTIKSI - Loach TNMYILDSQE IEIVWTVLPA LILILIALPS LRILYLMDEI NDPHLTIKAM - Mouse THTSTMDAQE VETIWTILPA VILIMIALPS LRILYMMDEI NNPVLTVKTM - - """ - -import re - -from corebio.seq import * -from corebio.seq_io import * -from corebio.utils import * - -names = ('msf', 'gcg-msf', 'gcg', 'PileUp') -extensions = ('msf') - -end_header=re.compile(r'(//)(\s*)$') -seq_line=re.compile(r'\s*(\S+)\s+([\S\s.?]+)$') - -def iterseq(fin, alphabet=None): - """Iterate over the sequences in the file.""" - # Default implementation - return iter(read(fin, alphabet) ) - - - -def read(fin, alphabet=None): - alphabet =Alphabet(alphabet) - seq_ids=[] - seqs=[] - block_count=0 - - for token in _line_is(fin): - if token.typeof=="begin_block": - block_count=0 - - elif token.typeof == "seq_id": - if len(seqs)<= block_count: - seq_ids.append(token.data) - seqs.append([]) - elif token.typeof=="seq": - if not alphabet.alphabetic(token.data): - raise ValueError( - "Character on line: %d not in alphabet: %s : %s" % ( - token.lineno, alphabet, token.data) ) - seqs[block_count].append(token.data) - block_count +=1 - if seq_ids==[]: - raise ValueError("Parse error, possible wrong format") - seqs = [ Seq("".join(s), alphabet, name= i) for s,i in zip(seqs,seq_ids)] - return SeqList(seqs) - -def _line_is(fin): - header, body, block = range(3) - yield Token("begin") - state=header - for L, line in enumerate(fin): - if state==header: - if line.isspace():continue - m=end_header.match(line) - if m is not None: - yield Token("end_header") - state=body - continue - else: continue - - if state==body: - if line.isspace():continue - yield Token("begin_block") - state=block - #skips to a block of sequences - - if state==block: - if line.isspace(): - yield Token("end_block") - state=body - continue - m=seq_line.match(line) - if m is None: - raise ValueError("Parse error on line: %d" % L) - if m.group(1).isdigit() and m.group(2).strip().isdigit(): - continue - yield Token("seq_id",m.group(1).strip() ) - data=m.group(2) - data="".join((data.split())) - yield Token("seq",data.strip() ) - - - - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/nbrf_io.py --- a/corebio/seq_io/nbrf_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,169 +0,0 @@ - -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - -"""Sequence IO for NBRF/PIR format. - -The format is similar to fasta. The header line consistins of '>', a two- -letter sequence type (P1, F1, DL, DC, RL, RC, or XX), a semicolon, and a -sequence ID. The next line is a textual description of the sequence, -followed by one or more lines containing the sequence data. The end of -the sequence is marked by a "*" (asterisk) character. - -type_code -- A map between NBRF two letter type codes and Alphabets. - - -see: http://www.cmbi.kun.nl/bioinf/tools/crab_pir.html - ---- Example NBRF File --- - ->P1;CRAB_ANAPL -ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). - MDITIHNPLI RRPLFSWLAP SRIFDQIFGE HLQESELLPA SPSLSPFLMR - SPIFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKV KVLGDMVEIH - GKHEERQDEH GFIAREFNRK YRIPADVDPL TITSSLSLDG VLTVSAPRKQ - SDVPERSIPI TREEKPAIAG AQRK* - ->P1;CRAB_BOVIN -ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). - MDIAIHHPWI RRPFFPFHSP SRLFDQFFGE HLLESDLFPA STSLSPFYLR - PPSFLRAPSW IDTGLSEMRL EKDRFSVNLD VKHFSPEELK VKVLGDVIEV - HGKHEERQDE HGFISREFHR KYRIPADVDP LAITSSLSSD GVLTVNGPRK - QASGPERTIP ITREEKPAVT AAPKK* - -""" - -from corebio.utils import * -from corebio.seq import * -from corebio.seq_io import * - -names = ("nbrf", "pir",) -extensions = ('nbrf', 'pir', 'ali') - - - - -type_code = { - 'P1' : protein_alphabet, # Protein (complete) - 'F1' : protein_alphabet, # Protein (fragment) - 'DL' : dna_alphabet, # DNA (linear) - 'DC' : dna_alphabet, # DNA (circular) - 'RC' : rna_alphabet, # RNA (linear) - 'RL' : rna_alphabet, # RNA (circular) - 'N3' : rna_alphabet, # tRNA - 'N1' : rna_alphabet, # other functional RNA - 'XX' : generic_alphabet - } - -def read(fin, alphabet=None): - """Read and parse a NBRF seqquence file. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data. If not supplied, then - an appropriate alphabet will be inferred from the data. - Returns: - SeqList -- A list of sequences - Raises: - ValueError -- If the file is unparsable - """ - seqs = [ s for s in iterseq(fin, alphabet)] - return SeqList(seqs) - - - -def iterseq(fin, alphabet=None): - """ Generate sequences from an NBRF file. - - arguments: - fin -- A stream or file to read - alphabet -- - yeilds : - Seq - raises : - ValueError -- On a parse error. - """ - - body, header,sequence = range(3) # Internal states - - state = body - seq_id = None - seq_desc = None - seq_alpha = None - seqs = [] - - for lineno, line in enumerate(fin) : - if state == body : - if line == "" or line.isspace() : - continue - if line[0] == '>': - seq_type, seq_id = line[1:].split(';') - if alphabet : - seq_alpha = alphabet - else : - seq_alpha = type_code[seq_type] - state = header - continue - raise ValueError("Parse error on line: %d" % lineno) - - elif state == header : - seq_desc = line.strip() - state = sequence - continue - - elif state == sequence : - data = "".join(line.split()) # Strip out white space - if data[-1] =='*' : - # End of sequence data - seqs.append(data[:-1]) - - seq = Seq( "".join(seqs), name = seq_id.strip(), - description = seq_desc, alphabet = seq_alpha) - - yield seq - state= body - seq_id = None - seq_desc = None - seqs = [] - continue - else : - seqs.append(data) - continue - else : - # If we ever get here something has gone terrible wrong - assert(False) - - # end for - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/nexus_io.py --- a/corebio/seq_io/nexus_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,76 +0,0 @@ -#!/usr/bin/env python - -# Copyright 2005 Gavin E. Crooks -# Copyright 2005-2006 The Regents of the University of California. -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Read the sequence data from a nexus file. - -This IO code only gives read access to the sequence data. - -Reference: -'NEXUS: An extensible file format for systematic information' -Maddison, Swofford, Maddison. 1997. Syst. Biol. 46(4):590-621 -""" - -from corebio.seq import Seq, SeqList, Alphabet -from corebio.seq_io._nexus import Nexus, safename - - - - - -names = ( 'nexus', 'paup') -extensions = ('nex', 'nexus', 'paup', 'nxs') - -def iterseq(fin, alphabet=None): - """Iterate over the sequences in the file.""" - # Default implementation - return iter(read(fin, alphabet) ) - - -def read(fin, alphabet=None): - """ Extract sequence data from a nexus file.""" - n = Nexus(fin) - - seqs = [] - for taxon in n.taxlabels: - name = safename(taxon) - r = n.matrix[taxon] - if alphabet is None : - s = Seq(r, name = name, alphabet=r.alphabet) - else : - s = Seq(r, name = name, alphabet=alphabet ) - seqs.append(s) - - if len(seqs) == 0 : - # Something went terrible wrong. - raise ValueError("Cannot parse file") - - return SeqList(seqs) - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/null_io.py --- a/corebio/seq_io/null_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,57 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Null sequence IO. Acts like /dev/null. Read returns empty sequences or sequence lists, writes do nothing.""" - - -from corebio.seq import Seq, SeqList - -names = () -extensions = () - -def read(fin, alphabet=None): - assert fin is not None # Do something with arguments to quite pychecker - if alphabet is not None : pass - return SeqList([]) - -def iterseq(fin, alphabet=None) : - assert fin is not None - if alphabet is not None : pass - yield Seq('') - return - -def write(fout, seqs): - assert fout is not None - assert seqs is not None - return - - -def writeseq(fout, seq): - assert fout is not None - assert seq is not None - return - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/phylip_io.py --- a/corebio/seq_io/phylip_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,167 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 David D. Ding -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Read Sequences in interleaved Phylip format (not sequential) and returns a -list of sequences. Phylips is a very common phylogeny generating sequence type -that has the following traits -1) First line contains number of species and number of characters in a species' -sequence. Options can may follow, and they can be spaced or unspaced. Options are -simply letters such as A and W after the number of characters. -2) Options doesn't have to contain U in order for a usertree to appear. -3) If there are options then options appear first, then the sequences. For the -first iteration of sequences the first ten spaces are reserved for names of -options and species, the rest is for sequences. -4) For the second and following iterations the names are removed, only -sequence appears -4) At end of file an usertree may appear. First there is a number that indicts -the number of lines the usertree will take, and then the usertrees follow. - -Examples: - 6 50 W -W 0101001111 0101110101 01011 -dmras1 GTCGTCGTTG GACCTGGAGG CGTGG -hschras GTGGTGGTGG GCGCCGGCCG TGTGG -ddrasa GTTATTGTTG GTGGTGGTGG TGTCG -spras GTAGTTGTAG GAGATGGTGG TGTTG -scras1 GTAGTTGTCG GTGGAGGTGG CGTTG -scras2 GTCGTCGTTG GTGGTGGTGG TGTTG - -0101001111 0101110101 01011 -GTCGTCGTTG GACCTGGAGG CGTGG -GTGGTGGTGG GCGCCGGCCG TGTGG -GTTATTGTTG GTGGTGGTGG TGTCG -GTAGTTGTAG GAGATGGTGG TGTTG -GTAGTTGTCG GTGGAGGTGG CGTTG -GTCGTCGTTG GTGGTGGTGG TGTTG - -1 -((dmras1,ddrasa),((hschras,spras),(scras1,scras2))); - - -""" - -from corebio.seq import * - -names = ( 'phylip',) -extensions = ('phy',) - -def iterseq(fin, alphabet=None): - """Iterate over the sequences in the file.""" - # Default implementation - return iter(read(fin, alphabet) ) - - -#Read takes in a phylip file name, read it, processes it, and returns a SeqList -def read(fin, alphabet=None): - - - sequence=[] #where sequences are stored - idents=[] - num_seq=0 - num_total_seq=0 #length of sequence of 1 species - tracker=0 #track what sequence the line is on - usertree_tracker=0 #track usertree lines - options='' #options - num_options=0 #number/lens of options - U - - line=fin.readline() - while line: - s_line=line.split() #for ease of use, not used in all scenarios, but easier on the eye - - if s_line == []: #see nothing do nothing - pass - - elif (s_line[0].isdigit() and len(s_line) == 1 and len(sequence)==num_seq and len(sequence[0])==num_total_seq): #identifies usertree - usertree_tracker = int(s_line[0]) - pass - - elif num_options > 0: - if len(sequence) < num_seq: - if s_line[0][0] in options: - num_options -= 1 - pass - else: - raise ValueError('Not an option, but it should be one') - else: - num_options -= 1 - pass - - elif usertree_tracker > 0: #baskically skip usertree - if len(sequence[num_seq-1]) == num_total_seq: - usertree_tracker -=1 - pass - else: - raise ValueError('User Tree in Wrong Place') - - #####problems parse error unexpected - elif s_line[0].isdigit(): - if len(s_line) >= 2 and len(sequence) == 0: #identifies first line of file - num_seq = int(s_line[0]) #get number of sequences - num_total_seq = int(s_line[1]) #get length of sequences - if len(s_line) > 2: #takes care of the options - options= (''.join(s_line[2:])) - num_options=len(options) - options.count('U') - else: - raise ValueError('parse error') - - - #when options end, this take care of the sequence - elif num_options == 0: - if (num_seq==0): - raise ValueError("Empty File, or possibly wrong file") - elif tracker < num_seq: - if num_seq > len(sequence): - sequence.append(''.join(line[10:].split())) #removes species name - idents.append(line[0:10].strip()) - tracker +=1 - - else: - sequence[tracker] += (''.join(s_line)) - tracker +=1 - - if tracker == num_seq: - tracker = 0 - num_options = len(options)-options.count('U') - - line=fin.readline() - - if len(sequence) != len(idents) or len(sequence)!=num_seq: - raise ValueError("Number of different sequences wrong") - - seqs = [] - for i in range (0, len(idents)): - if len(sequence[i])==num_total_seq: - seqs.append(Seq(sequence[i], alphabet, idents[i])) - else: - raise ValueError("extra sequence in list") - - return SeqList(seqs) - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/plain_io.py --- a/corebio/seq_io/plain_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,129 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Read and write raw, unformatted sequence data. The whole file is read -in as a sequence. Whitespace is removed. - - ---- Example Plain/Raw/Text File --- - ---------------------------LENSTSPYDYGENESD-------FSDSPPCPQDF ---------------------------LENLEDLF-WELDRLD------NYNDTSLVENH- ---------------------------MSNITDPQMWDFDDLN-------FTGMPPADEDY ------------------------------------YTSDN---------YSGSGDYDSNK --SL-------NFDRTFLPALYSLLFLLGLLGNGAVAAVLLSQRTALSSTDTFLLHLAVAD ---LC-PATMASFKAVFVPVAYSLIFLLGVIGNVLVLVILERHRQTRSSTETFLFHLAVAD --SPC-MLETETLNKYVVIIAYALVFLLSLLGNSLVMLVILYSRVGRSVTDVYLLNLALAD --EPC-RDENVHFNRIFLPTIYFIIFLTGIVGNGLVILVMGYQKKLRSMTDKYRLHLSVAD -""" - -from corebio.seq import * -from corebio.utils import remove_whitespace - -example = """ ---------------------------LENSTSPYDYGENESD-------FSDSPPCPQDF ---------------------------LENLEDLF-WELDRLD------NYNDTSLVENH- ---------------------------MSNITDPQMWDFDDLN-------FTGMPPADEDY ------------------------------------YTSDN---------YSGSGDYDSNK --SL-------NFDRTFLPALYSLLFLLGLLGNGAVAAVLLSQRTALSSTDTFLLHLAVAD ---LC-PATMASFKAVFVPVAYSLIFLLGVIGNVLVLVILERHRQTRSSTETFLFHLAVAD --SPC-MLETETLNKYVVIIAYALVFLLSLLGNSLVMLVILYSRVGRSVTDVYLLNLALAD --EPC-RDENVHFNRIFLPTIYFIIFLTGIV -""" - -names = ("plain","raw") -extensions = ('txt', ) - -def read(fin, alphabet=None): - """Read a file of raw sequecne data. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Returns: - SeqList -- A list of sequences - Raises: - ValueError -- If the file is unparsable - """ - seqs = [ s for s in iterseq(fin, alphabet)] - return SeqList(seqs) - - -def iterseq(fin, alphabet=None) : - """ Read the sequence data and yeild one (and only one) sequence. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Yeilds: - Seq -- One alphabetic sequence at a time. - Raises: - ValueError -- If the file is unparsable - """ - - alphabet = Alphabet(alphabet) - lines = [] - for linenum, line in enumerate(fin) : - if line.isspace(): continue # Blank line - line = line.strip() - - - if line[0] == '>' : # probable a fasta file. Fail. - raise ValueError( - "Parse Error on input line: %d " % (linenum) ) - line = remove_whitespace(line) - - if not alphabet.alphabetic(line) : - raise ValueError( - "Character on line: %d not in alphabet: %s : %s" % \ - (linenum, alphabet, line) ) - lines.append(line) - - yield Seq(''.join(lines), alphabet) - - - -def write(afile, seqs): - """Write raw sequence data, one line per sequence. - - arguments: - afile -- A writable stream. - seqs -- A list of Seq's - """ - for s in seqs : - writeseq(afile, s) - - -def writeseq(afile, seq): - """ Write a single sequence in raw format. - - arguments: - afile -- A writable stream. - seq -- A Seq instance - """ - print >>afile, seq - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/stockholm_io.py --- a/corebio/seq_io/stockholm_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,172 +0,0 @@ - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Read a multiple sequence alignment in STOCKHOLM format. - -This file format is used by PFAM and HMMER. At present, all annotation -information is ignored. - -See: - - http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html - - HMMER manual - -""" - -import re - -from corebio.utils import * -from corebio.seq import * -from corebio.seq_io import * - - - -example = """ -# STOCKHOLM 1.0 -#=GF ID CBS -#=GF AC PF00571 -#=GF DE CBS domain -#=GF AU Bateman A -#=GF CC CBS domains are small intracellular modules mostly found -#=GF CC in 2 or four copies within a protein. -#=GF SQ 67 -#=GS O31698/18-71 AC O31698 -#=GS O83071/192-246 AC O83071 -#=GS O83071/259-312 AC O83071 -#=GS O31698/88-139 AC O31698 -#=GS O31698/88-139 OS Bacillus subtilis -O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS -#=GR O83071/192-246 SA 999887756453524252..55152525....36463774777 -O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY -#=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE -O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS -#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH -O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE -#=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH -#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH -O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE -#=GR O31699/88-139 AS ________________*__________________________ -#=GR_O31699/88-139_IN ____________1______________2__________0____ -// -""" - - - -names = ("stockholm", "pfam",) -extensions = ('sth', 'stockholm', 'align') - - -header_line = re.compile(r'#\s+STOCKHOLM\s+1.\d\s+$') - -def iterseq(fin, alphabet=None): - """Iterate over the sequences in the file.""" - # Default implementation - return iter(read(fin, alphabet) ) - - -def read(fin, alphabet=None) : - alphabet = Alphabet(alphabet) - seq_ids = [] - seqs = [] - block_count = 0 - - - for token in _scan(fin): - if token.typeof== "begin_block": - block_count = 0 - elif token.typeof == "seq_id": - if len(seqs) <= block_count : - seq_ids.append(token.data) - seqs.append([]) - elif token.typeof == "seq": - if not alphabet.alphabetic(token.data) : - raise ValueError ( - "Character on line: %d not in alphabet: %s : %s" % ( - token.lineno, alphabet, token.data) ) - seqs[block_count].append(token.data) - block_count +=1 - - - seqs = [ Seq("".join(s), alphabet, name= i) for s,i in zip(seqs,seq_ids)] - return SeqList(seqs) - - -def _scan( fin ): - - header, body, block = range(3) - - yield Token("begin") - state = header - for L, line in enumerate(fin): - - - if state==header : - if line.isspace() : continue - m = header_line.match(line) - state = body - if m is not None : - # print "header: ", m.group() - yield Token("header", m.group() ) - continue - else : - raise ValueError("Parse error on line: %d" % L) - - - if state == body : - if line.isspace() : continue - yield Token("begin_block") - state = block - # fall through to block - - if state == block: - if line.isspace() : - yield Token("end_block") - state = body - continue - if line.strip() == '//' : - yield Token("end_block") - return - - - if line[0] =='#' : # Comment or annotation line - continue - - name_seq = line.split(None,1) # Split into two parts at first whitespace - if len(name_seq) != 2 : - raise ValueError("Parse error on line: %d" % L) - - - yield Token("seq_id", name_seq[0].strip() ) - yield Token("seq", name_seq[1].strip() ) - continue - - # END state blocks. If I ever get here something has gone terrible wrong - raise RuntimeError() - - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/seq_io/table_io.py --- a/corebio/seq_io/table_io.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,126 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Read and write sequence information in tab delimited format. - -This very simple format has two columns per line. The first column is a sequence name, the second column is the sequence itself. The columns are separated by a single tab ("\\t") character. - -""" -from corebio.utils import * -from corebio.seq import * -from corebio.seq_io import * - - -names = ( 'table', 'tab') -extensions = ('tbl') - - -example = """ -EC0001 MKRISTTITTTITITTGNGAG -EC0002 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAM -EC0003 MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLG -EC0004 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLD -EC0005 MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGH -EC0006 MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLM -EC0007 MPDFFSFINSVLWGSVMIYLLFGAGCWFTFRTGFVQFRYIRQFGKSLKNS -EC0008 MTDKLTSLRQYTTVVADTGDIAAMKLYQPQDATTNPSLILNAAQIPEYRK -EC0009 MNTLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDE -EC0010 MGNTKLANPAPLGLMGFGMTTILLNLHNVGYFALDGIILAMGIFYGGIAQ -""" - - - - -def read(fin, alphabet=None): - """Read and parse file. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Returns: - SeqList -- A list of sequences - Raises: - ValueError -- If the file is unparsable - """ - seqs = [ s for s in iterseq(fin, alphabet)] - return SeqList(seqs) - - -def iterseq(fin, alphabet=None): - """ Parse a file and generate sequences. - - Args: - fin -- A stream or file to read - alphabet -- The expected alphabet of the data, if given - Yeilds: - Seq -- One alphabetic sequence at a time. - Raises: - ValueError -- If the file is unparsable - """ - alphabet = Alphabet(alphabet) - - for lineno, line in enumerate(fin) : - line = line.strip() - if line == '' : continue - - columns = line.split('\t') - if len(columns) !=2 : - raise ValueError( "Parse failed on line %d: did not find two " - "columns seperated by a tab." % (lineno) ) - yield Seq(columns[1], alphabet=alphabet, name=columns[0]) - - -def write(fout, seqs): - """Write a two column, tab delineated file. - - Args: - fout -- A writable stream. - seqs -- A list of Seq's - """ - for s in seqs : writeseq(fout, s) - - -def writeseq(fout, seq): - """ Write a single sequence in fasta format. - - Args: - afile -- A writable stream. - seq -- A Seq instance - """ - - name = seq.name or '' - print >>fout, name, '\t', seq - - - - - - - - - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/ssearch_io/__init__.py --- a/corebio/ssearch_io/__init__.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,207 +0,0 @@ - -# Copyright (c) 2006 John Gilman -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. - -""" Parse the output of BLAST and similar sequence search analysis reports. - -The result of a sequence database search is represented by the Report class. - o Each Report contains one or more results, one for each database query. - o Each Result contains one or more hits - o Each Hit may contain one or more Alignments (High scoring Sequence pairs) - -CoreBio is often capable of guessing the correct format: ->>> from corebio import ssearch_io ->>> afile = open("test_corebio/data/ssearch/ssearch_out.txt") ->>> report = ssearch_io.read(afile) ->>> print report - -Alternatively, each report type has a seperate module. Each module defines a -read(fin) method that can parse that report format. - ->>> from corebio.ssearch_io import fasta ->>> report = fasta.read( open("test_corebio/data/ssearch/ssearch_out.txt") ) ->>> print report - -Module Application Comments ---------------------------------------------------------------------------- -fasta FASTA / SSEARCH Default (-m 1) or compact (-m 9 -d 0) -blastxml NCBI Blast NCBI XML format - -Status: Beta -""" -# Dev. References : -# Inspired by Bioperls searchIO system -# http://www.bioperl.org/wiki/HOWTO:SearchIO - -__all__ = ['read', 'Report', 'Result', - 'Hit','Annotation', 'Alignment'] - - -from corebio.utils import stdrepr - -def read(fin) : - """ Read and parse an analysis report. - - returns : - A database search Report. - raises : - ValueError - If the file cannot be parsed - """ - - import fasta - import blastxml - parsers = (fasta, blastxml) - for p in parsers: - try: - return p.read(fin) - except ValueError, e: - pass - fin.seek(0) # FIXME. Non seakable stdin? - - raise ValueError("Cannot parse sequence file: Tried fasta and blastxml") - - - -class Report(object) : - """The results of a database search. The Report contains a list of 1 or more - Results, one for each query. Each query result containts a list of hits. - Each Hit contains a list of HSP's (High scoring segment pairs). - - The structure of the report will vary somewhat depending on the source. - - algorithm -- e.g. 'BLASTX' - algorithm_version -- e.g. '2.2.4 [Aug-26-2002]' - algorithm_reference -- - database_name -- e.g. 'test.fa' - database_letters -- number of residues in database e.g. 1291 - database_entries -- number of database entries - - parameters -- Dictionary of parameters used in search - - results -- A list of list of Results, one per query - """ - __slots__ = ['algorithm', 'algorithm_version', 'algorithm_reference','database_name', - 'database_letters', 'database_entries', 'parameters', 'results'] - - def __init__(self) : - for name in self.__slots__ : setattr(self, name, None) - self.parameters = {} - self.results = [] - - def __repr__(self): - return stdrepr(self) - - -class Result(object) : - """ The result from searching a database with a single query sequence. - - query -- Information about the query sequence - statistics -- A dictionary of search statistics - hits -- A list of Hits - """ - __slots__ = ['query', 'statistics', 'hits'] - - def __init__(self) : - for name in self.__slots__ : setattr(self, name, None) - self.query = Annotation() - self.statistics = {} - self.hits = [] - - def __repr__(self): - return stdrepr(self) - - -class Hit(object) : - """ A search hit between a query sequence and a subject sequence. - Each hit may have one or more Alignments - - target -- Information about the target sequence. - raw_score -- Typically the ignficance of the hit in bits, e.g. 92.0 - significance -- Typically evalue. e.g '2e-022' - alignments -- A list of alignments between subject and target - """ - __slots__ =['target', 'raw_score', 'bit_score', 'significance', - 'alignments'] - def __init__(self) : - for name in self.__slots__ : setattr(self, name, None) - self.target = Annotation() - self.alignments = [] - - def __repr__(self): - return stdrepr(self) - -class Annotation(object) : - """ Information about a subject or query sequence. - - name -- subject sequence name, e.g. '443893|124775' - description -- e.g. 'LaForas sequence' - length -- subject sequence length, e.g. 331 - locus -- e.g. '124775' - accession -- e.g. '443893' - """ - # Fixme: change into generic sequence annotation class? - __slots__ = ['name', 'description', 'length', 'locus', 'accession', ] - - def __init__(self): - for name in self.__slots__ : - setattr(self, name, None) - - def __repr__(self): - return stdrepr(self) - -class Alignment(object): - """An alignment between query and subject sequences. - For BLAST, these are High scoring Segment pairs (HSPs) - - raw_score -- Typically signficance of the hit in bits, e.g. 92.0 - significance -- Typically evalue. e.g '2e-022' - - similar -- number of conserved residues #FIXME eiter frac or num - identical -- number of identical residues - gaps -- number of gaps - length -- length of the alignment - - query_seq -- query string from alignment - target_seq -- hit string from alignment - mid_seq -- - - query_start -- - query_frame -- - - target_start -- - target_frame -- - - """ - __slots__ = ['raw_score', 'bit_score', 'significance', 'similar', - 'identical', 'gaps', 'length', 'query_seq', 'target_seq', 'mid_seq', - 'query_start', 'query_frame', 'target_start', - 'target_frame'] - - def __init__(self): - for name in self.__slots__ : - setattr(self, name, None) - - def __repr__(self): - return stdrepr(self) - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/ssearch_io/blastxml.py --- a/corebio/ssearch_io/blastxml.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,249 +0,0 @@ - -# Copyright (c) 2006 John Gilman -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. - - -"""Read BLAST XML output. - -The DTD is available at -http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.mod.dtd - -""" - -# See also -# -# http://bugzilla.open-bio.org/show_bug.cgi?id=1933 -#http://portal.open-bio.org/pipermail/biojava-dev/2004-December/002513.html - - -from corebio.ssearch_io import Report, Result, Hit, Annotation, Alignment - -import xml.sax -from xml.sax.handler import ContentHandler - -__all__ = 'read' - -def read(fin): - """Read BLAST xml output and return a list of Result objects. - """ - parser = xml.sax.make_parser() - handler = _BlastHandler() - parser.setContentHandler(handler) - - #To avoid ValueError: unknown url type: NCBI_BlastOutput.dtd - parser.setFeature(xml.sax.handler.feature_validation, 0) - parser.setFeature(xml.sax.handler.feature_namespaces, 0) - parser.setFeature(xml.sax.handler.feature_external_pes, 0) - parser.setFeature(xml.sax.handler.feature_external_ges, 0) - - try : - parser.parse(fin) - except xml.sax.SAXParseException, e : - raise ValueError( "Cannot parse file; "+str(e)) - return handler.report - -class _BlastHandler( ContentHandler) : - def __init__(self): - """ - """ - ContentHandler.__init__(self) - self._content = [] - self.report = None - self._result = None - self._hit = None - self._hsp = None - - - def characters(self, ch): - self._content.append(ch) - - def startDocument(self): - self.report = Report() - - def endDocument(self) : - pass - - def startElement(self, name, attr): - if name == 'BlastOutput' : - pass - elif name == 'Iteration' : - result = Result() - self._result = result - self.report.results.append(result) - elif name == 'Parameters' : - pass - elif name == 'Statistics' : - pass - elif name == 'Hit' : - self._hit = Hit() - self._result.hits.append(self._hit) - elif name == 'Hsp' : - self._hsp = Alignment() - self._hit.alignments.append(self._hsp) - else : - pass - - - def endElement(self, name): - content = ''.join(self._content).strip() - self._content = [] - - report = self.report - result = self._result - hsp = self._hsp - hit = self._hit - - if name == 'BlastOutput' : - pass - elif name == 'BlastOutput_program' : - report.algorithm = content - elif name == 'BlastOutput_version' : - report.algorithm_version = content.split()[1] - elif name == 'BlastOutput_reference' : - report.algorithm_reference = content - elif name == 'BlastOutput_db' : - report.database_name = content - elif name == 'BlastOutput_query-ID' : pass - elif name == 'BlastOutput_query-def' : pass - elif name == 'BlastOutput_query-len' : pass - elif name == 'BlastOutput_query-seq' : pass - elif name == 'BlastOutput_param' : pass - elif name == 'BlastOutput_iterations' : pass - elif name == 'BlastOutput_mbstat' : pass - - elif name == 'Iteration' : pass - elif name == 'Iteration_iter-num' : pass - elif name == 'Iteration_query-ID' : - result.query.name = content - elif name == 'Iteration_query-def' : - result.query.description = content - elif name == 'Iteration_query-len' : - result.query.length = int(content) - elif name == 'Iteration_hits' : pass - elif name == 'Iteration_stat' : pass - elif name == 'Iteration_message' : pass - - elif name == 'Parameters' : - pass - elif name == 'Parameters_matrix' : - report.parameters['matrix'] = content - elif name == 'Parameters_expect' : - report.parameters['expect'] = content - elif name == 'Parameters_include' : - report.parameters['include'] = content - elif name == 'Parameters_sc-match' : - report.parameters['sc-match'] = content - elif name == 'Parameters_sc-mismatch' : - report.parameters['sc-mismatch'] = content - elif name == 'Parameters_gap-open' : - report.parameters['gap-open'] = content - elif name == 'Parameters_gap-extend' : - report.parameters['gap-extend'] = content - elif name == 'Parameters_filter' : - report.parameters['filter'] = content - elif name == 'Parameters_pattern' : - report.parameters['pattern'] = content - elif name == 'Parameters_entrez-query' : - report.parameters['entrez-query'] = content - - elif name == 'Statistics' : - pass - elif name == 'Statistics_db-num' : - result.statistics['db-num'] = int(content) - elif name == 'Statistics_db-len' : - result.statistics['db-len'] = int(content) - elif name == 'Statistics_hsp-len' : - result.statistics['hsp-len'] = int(content) - elif name == 'Statistics_eff-space' : - result.statistics['eff-space'] = float(content) - elif name == 'Statistics_kappa' : - result.statistics['kappa'] = float(content) - elif name == 'Statistics_lambda' : - result.statistics['lambda'] = float(content) - elif name == 'Statistics_entropy' : - result.statistics['entropy'] = float(content) - - elif name == 'Hit' : - self._hit = None - elif name == 'Hit_num' : - pass - elif name == 'Hit_id' : - hit.target.name = content - elif name == 'Hit_def' : - hit.target.description = content - elif name == 'Hit_accession' : - hit.target.accession = content - elif name == 'Hit_len' : - hit.target.length = int(content) - elif name == 'Hit_hsps' : - pass - - elif name == 'Hsp' : - self._hsp = None - elif name == 'Hsp_num' : - pass - elif name == 'Hsp_bit-score' : - hsp.bit_score = float(content) - elif name == 'Hsp_score' : - hsp.raw_score = float(content) - elif name == 'Hsp_evalue' : - hsp.significance = float(content) - elif name == 'Hsp_query-from' : - hsp.query_start = int(content) -1 - elif name == 'Hsp_query-to' : - #hsp.query_end= int(content) - pass - elif name == 'Hsp_hit-from' : - hsp.target_start = int(content) -1 - elif name == 'Hsp_hit-to' : - #hsp.target_end = int(content) - pass - elif name == 'Hsp_pattern-from' : - pass - elif name == 'Hsp_pattern-to' : - pass - elif name == 'Hsp_query-frame' : - hsp.query_frame = int(content) - elif name == 'Hsp_hit-frame' : - hsp.target_frame = int(content) - elif name == 'Hsp_identity' : - hsp.identical = int(content) - elif name == 'Hsp_positive' : - hsp.similar = int(content) - elif name == 'Hsp_gaps' : - hsp.gaps = int(content) - elif name == 'Hsp_align-len' : - hsp.length = int(content) - elif name == 'Hsp_density' : - pass - elif name == 'Hsp_qseq' : - hsp.query_seq = content - elif name == 'Hsp_hseq' : - hsp.target_seq = content - elif name == 'Hsp_midline' : - hsp.mid_seq = content - else : - pass - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/ssearch_io/fasta.py --- a/corebio/ssearch_io/fasta.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,349 +0,0 @@ -# Copyright (c) 2006 John Gilman -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. - - - - -"""Read the output of a fasta sequence similarity search. - -FASTA is a DNA and Protein sequence alignment software package first described -by David J. Lipman and William R. Pearson in 1985. In addition to rapid -heuristic search methods, the FASTA package provides SSEARCH, an implementation -of the optimal Smith Waterman algorithm. - -The module can parse the output from fasta, ssearch and other search programs -in the fasta collection. It will parse both default ('-m 1') and compact -('-m 9 -d 0') output. - -Refs: - ftp.virginia.edu/pub/fasta - http://en.wikipedia.org/wiki/FASTA -""" - - - -from corebio.utils import Reiterate, Token, isblank -from corebio.ssearch_io import Report, Result, Hit, Annotation, Alignment -from math import floor -import re - -__all__ = 'read' - -_rangere = re.compile(r"$(\d+)-\d+:(\d+)-\d+$") - - -def read(fin) : - """Read and parse a fasta search output file. - - returns: a list of Results - """ - scanner = _scan(fin) - - report = None - result = None - hit = None - #query_seq = None - #target_seq = None - alignment_num = 0 - - - for token in scanner : - #print token - typeof = token.typeof - value = token.data - - if typeof == 'begin_report' : - report = Report() - elif typeof == 'algorithm' : - report.algorithm = value - elif typeof == 'algorithm_version' : - report.algorithm_version = value - elif typeof == 'algorithm_reference' : - report.algorithm_reference = value - elif typeof == 'database_name' : - report.database_name = value - elif typeof == 'database_letters' : - report.database_letters = value - elif typeof == 'database_entries' : - report.database_entries = value - elif typeof == 'end_report' : - # Final sanity checking - break - elif typeof == 'parameter' : - key = value[0] - value = value[1] - report.parameters[key] = value - - elif typeof == 'begin_result' : - result = Result() - report.results.append(result) - - elif typeof == 'query_name' : - result.query.name = value - elif typeof == 'query_description' : - result.query.description = value - elif typeof == 'end_result' : - pass - - elif typeof == 'begin_hit' : - hit = Hit() - elif typeof == 'target_name' : - hit.target.name = value - elif typeof == 'target_description' : - hit.target.description = value - elif typeof == 'target_length' : - hit.target.length = value - elif typeof == 'raw_score' : - hit.raw_score = value - elif typeof == 'bit_score' : - hit.bit_score = value - elif typeof == 'significance' : - hit.significance = value - elif typeof == 'end_hit' : - result.hits.append(hit) - hit = None - - elif typeof == 'begin_alignment' : - alignment = Alignment() - tseq = [] - qseq = [] - elif typeof == 'end_alignment' : - tseq = ''.join(tseq) - qseq = ''.join(qseq) - L = max (len(tseq), len(qseq) ) - tseq = tseq.ljust(L).replace(' ', '.') - qseq = qseq.ljust(L).replace(' ', '.') - alignment.query_seq = tseq - alignment.target_seq = qseq - result.hits[alignment_num].alignments.append(alignment) - alignment_num+=1 - tseq = None - qseq = None - elif typeof == 'target_seq' : - tseq += value - elif typeof == 'query_seq' : - qseq += value - elif typeof == 'alignment_raw_score' : - alignment.raw_score = value - - elif typeof == 'alignment_bit_score' : - alignment.bit_score = value - elif typeof == 'alignment_significance' : - alignment.significance = value - elif typeof == 'alignment_length' : - alignment.length = value - elif typeof == 'alignment_similar' : - alignment.similar = value - elif typeof == 'alignment_identical' : - alignment.identical = value - elif typeof == 'alignment_query_start' : - alignment.query_start = value - elif typeof == 'alignment_target_start' : - alignment.target_start = value - - else: - # Should never get here. - raise RuntimeError("Unrecoverable internal parse error (SPE)") - pass - - - return report -# End method read() - - -def _scan(fin) : - - def next_nonempty(i) : - L = i.next() - while L.strip() == '': L = i.next() - return L - - - lines = Reiterate(iter(fin)) - try : - - yield Token("begin_report", lineno= lines.index()) - - # find header line : "SSEARCH searches a sequence data bank" - L = lines.next() - - if L[0] == '#' : - yield Token("parameter", ("command", L[1:].strip()), lines.index()) - L = lines.next() - - while not L : L= lines.next() - algorithm = L.split()[0] - expected = [ "SSEARCH", "FASTA","TFASTA","FASTX", - "FASTY","TFASTX","TFASTY"] - if algorithm not in expected: - raise ValueError("Parse failed: line %d" % lines.index() ) - yield Token ("algorithm", algorithm, lines.index() ) - - # Next line should be the version - L = lines.next() - if not L.startswith(" version") : - raise ValueError("Parse failed: Cannot find version.") - yield Token( "algorithm_version", L[8:].split()[0].strip(), lines.index()) - - # Algorithm reference - L = lines.next() - if not L.startswith("Please cite:") : - raise ValueError("Parse failed: Expecting citation" + L) - cite = lines.next().strip() + ' ' + lines.next().strip() - yield Token( "algorithm_reference", cite) - - # Find line "searching testset.fa library" - L = lines.next() - while not L.startswith("searching") : L = lines.next() - yield Token("database_name", L[10:-8], lines.index() ) - - # Results - L = lines.next() - while isblank(L) : L = lines.next() - if ">>>" not in L : - raise ValueError("Parse failed on line %d: " % lines.index()) - - while ">>>" in L : - yield Token("begin_result", lineno= lines.index()) - index = L.find('>>>') - (name, description) = L[index+3:].split(' ',1) - yield Token("query_name", name, lines.index()) - yield Token("query_description", description, lines.index()) - - while not L.startswith("The best scores are:") : - L = lines.next() - L = lines.next() - # hits - while not isblank(L) : - lineno = lines.index() - desc = L[0:49] - yield Token("begin_hit", lineno= lineno) - yield Token("target_description", desc, lineno, 0) - yield Token("target_name", desc.split(' ',1)[0], lineno, 0) - yield Token("target_length", int(L[52:56]), lineno, 52) - fields = L[57:].split() - raw, bit, sig = fields[0], fields[1], fields[2] - #print raw, bit, sig - yield Token("raw_score", float(raw), lineno, 57) - yield Token("bit_score", float(bit), lineno) - yield Token("significance", float(sig), lineno) - yield Token("end_hit", lineno=lineno) - L = lines.next() - - # Optimal alignment information - L = next_nonempty(lines) - #print ">>>", L, L.startswith('>>') - while L.startswith('>>'): - if L.startswith('>>>') : break - - yield Token("begin_alignment", lineno=lines.index() ) - - # 1 2 3 4 - #01234567890123456789012345678901234567890123456789 - # s-w opt: 46 Z-score: 70.7 bits: 18.5 E(): 3.6 - L = lines.next() - fields = L.split() - raw, bit, sig = fields[2], fields[6], fields[8] - yield Token("alignment_raw_score", float(raw), lineno) - yield Token("alignment_bit_score", float(bit), lineno) - yield Token("alignment_significance", float(sig), lineno) - - #Smith-Waterman score: 46; 38.095% identity (71.429% similar) in 21 aa overlap (2-22:36-56) - L = lines.next() - lineno = lines.index() - fields = L.split() - assert( len(fields) ==12) - alen = int(fields[8]) - identical = int( floor(0.5+alen* float(fields[3][:-1])/100.)) - similar = int( floor(0.5+alen* float(fields[3][:-1])/100.)) - yield Token("alignment_length", alen, lineno) - yield Token("alignment_similar", similar, lineno) - yield Token("alignment_identical", identical, lineno) - - m = _rangere.match( fields[11]) - assert (m is not None) - yield Token("alignment_query_start", int(m.group(1))-1, lineno) - yield Token("alignment_target_start", int(m.group(2))-1, lineno) - - - count = 1 - while True: - L = lines.next() - count += 1 - - - - if L.startswith('>>'): break - if '>>>' in L: - lines.push(L) - break - if 'residues' in L and 'sequences' in L : - lines.push(L) - break - if not L or L[0].isspace() : continue - - - # there are 2 lines before the first query sequence (but - # we started the count at 1). There is 1 line between query - # and target, 3 lines between target and query, unless the - # query ends before the ends and the target wraps onto another - # Then there are two lines between target and target. - -# Smith-Waterman score: 34; 35.294% identity ... -# -# 30 40 50 60 70 -# d1pfsa EGFLHLEDKPHPLQCQFFVESVIPAGSYQVPYRINVNNG-RPELAFDFKAMKRA -# : . . .:: .: .:: -# d8rxna MKKYVCTVCGYEYDPAEGDPDNGVKPGTSFDDLPADWVCPVCGA -# 10 20 30 40 -# -# d8rxna PKSEFEAA -# 50 - - lineno=lines.index() - if count==4 : - yield Token("query_seq", L[7:].rstrip(), lineno) - else : - yield Token("target_seq", L[7:].rstrip(),lineno) - count = 0 - - yield Token("end_alignment", lineno=lines.index() ) - yield Token("end_result", lineno= lines.index()) - L = next_nonempty(lines) - # End results - - # "13355 residues in 93 query sequences" - # "13355 residues in 93 library sequences" - #print '>>', L - LL = L.split() - yield Token("database_letters",int(LL[0]), lines.index() ) - yield Token("database_entries", int(LL[3]), lines.index() ) - - yield Token("end_report", lineno= lines.index()) - except StopIteration : - raise ValueError("Premature end of file ") - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/transform.py --- a/corebio/transform.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,605 +0,0 @@ -# Copyright (c) 2006 John Gilman -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. - -""" Transformations of Seqs (alphabetic sequences). - - - -Classes : -- Transform -- Simple transforms of alphabetic strings. -- GeneticCode -- The genetic mapping of dna to protein. - -Functions : -- mask_low_complexity -- Implementation of Seg algorithm to remove low complexity - regions from protein sequences. - - -""" - - -from corebio.data import dna_extended_letters, dna_ambiguity -from corebio.seq import Seq, protein_alphabet, nucleic_alphabet, dna_alphabet -from string import maketrans -from corebio.moremath import log2 , entropy - -__all__ = [ - 'Transform', - 'mask_low_complexity', - 'GeneticCode' - ] - -class Transform(object) : - """A translation between alphabetic strings. - (This class is not called 'Translation' to avoid confusion with the - biological translation of rna to protein.) - - Example: - trans = Transform( - Seq("ACGTRYSWKMBDHVN-acgtUuryswkmbdhvnXx?.~'", dna_alphabet), - Seq("ACGTRYSWKMNNNNN-acgtUuryswkmbnnnnXx?.~", reduced_nucleic_alphabet) - ) - s0 = Seq("AAAAAV", nucleic_alphabet) - s1 = trans(s0) - assert(s1.alphabet == reduced_nucleic_alphabet) - assert(s2 == Seq("AAAAAN", reduced_nucleic_alphabet) - - Status : Beta - """ - - __slots__ = ["table", "source", "target"] - def __init__(self, source, target) : - - self.table = maketrans(source, target) - self.source = source - self.target = target - - - def __call__(self, seq) : - """Translate sequence.""" - if not self.source.alphabet.alphabetic(seq) : - raise ValueError("Incompatable alphabets") - s = str.translate(seq, self.table) - cls = self.target.__class__ - return cls(s, self.target.alphabet, seq.name, seq.description) -# End class Translation - -# FIXME: Test, document, add to seq. -dna_complement = Transform( - Seq("ACGTRYSWKMBDHVN-acgtUuryswkmbdhvnXx?.~", dna_alphabet), - Seq("TGCAYRSWMKVHDBN-tgcaAayrswmkvhdbnXx?.~", dna_alphabet), - ) - - - -def mask_low_complexity(seq, width =12, trigger=1.8, extension=2.0, mask='X') : - """ Mask low complexity regions in protein sequences. - - Uses the method of Seg [1] by Wootton & Federhen [2] to divide a sequence - into regions of high and low complexity. The sequence is divided into - overlapping windows. Low complexity windows either have a sequence entropy - less that the trigger complexity, or have an entropy less than the extension - complexity and neighbor other low-complexity windows. The sequence within - low complexity regions are replaced with the mask character (default 'X'), - and the masked alphabetic sequence is returned. - - The default parameters, width=12, trigger=1.8, extension=2.0, mask='X' are - suitable for masking protein sequences before a database search. The - standard default seg parameters are width=12, trigger=2.2, extension=2.5 - - Arguments: - Seq seq -- An alphabetic sequence - int width -- Window width - float trigger -- Entropy in bits between 0 and 4.3.. ( =log_2(20) ) - float extension -- Entropy in bits between 0 and 4.3.. ( =log_2(20) ) - char mask -- The mask character (default: 'X') - Returns : - Seq -- A masked alphabetic sequence - Raises : - ValueError -- On invalid arguments - Refs: - [1] seg man page: - http://bioportal.weizmann.ac.il/education/materials/gcg/seg.html - [2] Wootton & Federhen (Computers and Chemistry 17; 149-163, (1993)) - Authors: - GEC 2005 - Future : - - Optional mask character. - - Option to lower case masked symbols. - - Remove arbitary restriction to protein. - """ - - lg20 = log2(20) - if trigger<0 or trigger>lg20 : - raise ValueError("Invalid trigger complexity: %f"% trigger) - if extension<0 or extension>lg20 or extension len(seq) : return seq - - s = seq.ords() - - X = seq.alphabet.ord(mask) - - - nwindows = len(seq)- width +1 - ent = [ 0 for x in range(0, nwindows)] - count = [ 0 for x in range(0, len(seq.alphabet) )] - - for c in s[0:width] : count[c] +=1 - ent[0] = entropy(count,2) - - for i in range(1, nwindows) : - count[ s[i-1] ] -= 1 - count[ s[i+width-1] ] +=1 - ent[i] = entropy(count,2) - - prev_segged = False - for i in range(0, nwindows) : - if ((prev_segged and ent[i]< extension) or - ent[i]< trigger) : - for j in range(0, width) : s[i+j]=X - prev_segged=True - else : - prev_segged = False - - - # Redo, only backwards - prev_segged = False - for i in range(nwindows-1, -1, -1) : - if ((prev_segged and ent[i]< extension) or - ent[i]< trigger) : - for j in range(0, width) : s[i+j]=X - prev_segged=True - else : - prev_segged = False - - - return seq.alphabet.chrs(s) -# end mask_low_complexity() - - -class GeneticCode(object): - """An encoding of amino acids by DNA triplets. - - Example : - - Genetic Code [1]: Standard - T C A G - +---------+---------+---------+---------+ - T | TTT F | TCT S | TAT Y | TGT C | T - T | TTC F | TCC S | TAC Y | TGC C | C - T | TTA L | TCA S | TAA Stop| TGA Stop| A - T | TTG L(s)| TCG S | TAG Stop| TGG W | G - +---------+---------+---------+---------+ - C | CTT L | CCT P | CAT H | CGT R | T - C | CTC L | CCC P | CAC H | CGC R | C - C | CTA L | CCA P | CAA Q | CGA R | A - C | CTG L(s)| CCG P | CAG Q | CGG R | G - +---------+---------+---------+---------+ - A | ATT I | ACT T | AAT N | AGT S | T - A | ATC I | ACC T | AAC N | AGC S | C - A | ATA I | ACA T | AAA K | AGA R | A - A | ATG M(s)| ACG T | AAG K | AGG R | G - +---------+---------+---------+---------+ - G | GTT V | GCT A | GAT D | GGT G | T - G | GTC V | GCC A | GAC D | GGC G | C - G | GTA V | GCA A | GAA E | GGA G | A - G | GTG V | GCG A | GAG E | GGG G | G - +---------+---------+---------+---------+ - - - See Also : - -- http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c - -- http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html#7.5 - Authors: - JXG, GEC - """ - # TODO: Explain use of '?' in translated sequence. - # TODO: Does translate fails with aproriate execption when fed gaps? - # TODO: Can back_translate handle gaps? - - def __init__(self, ident, description, - amino_acid, start, base1, base2, base3): - """Create a new GeneticCode. - - Args: - -- ident - Standarad identifier (Or zero). An integer - -- description - -- amino acid - A sequecne of amino acids and stop codons. e.g. - "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG" - -- start - A sequence indicating start codons, e.g., - "---M---------------M---------------M----------------------------" - -- base1 - The first base of each codon. e.g., - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG" - -- base2 - The second base of each codon. e.g., - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG" - -- base3 - The last base of each codon. e.g., - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG" - """ - self.ident = ident - self.description = description - - self.amino_acid = amino_acid - self.start = start - self.base1 = base1 - self.base2 = base2 - self.base3 = base3 - - stop_codons = [] - start_codons = [] - for i, a in enumerate(amino_acid) : - codon = base1[i] + base2[i] + base3[i] - if a=='*' : stop_codons.append(codon) - if start[i] == 'M': start_codons.append(codon) - - self.stop_codons = tuple(stop_codons) - self.start_codons = tuple(start_codons) - - # Building the full translation table is expensive, - # so we avoid doing so until necessary. - self._table = None - self._back_table = None - - #@staticmethod - def std_list(): - "Return a list of standard genetic codes." - return _codon_tables - std_list = staticmethod(std_list) - - #@staticmethod - def std(): - "The standard 'universal' genetic code." - return _codon_tables[0] - std = staticmethod(std) - - - #@staticmethod - def by_name(name) : - """Find a genetic code in the code list by name or identifier. - """ - for t in _codon_tables : - if t.ident == name or t.description == name : - return t - raise ValueError("No such translation table: %s" % str(name) ) - by_name = staticmethod(by_name) - - - def _get_table(self) : - if self._table is None : self._create_table() - return self._table - table = property(_get_table, None, "A map between codons and amino acids") - - def _get_back_table(self) : - if self._back_table is None : - self._create_table() - return self._back_table - back_table = property(_get_back_table, None, "A map between amino acids and codons") - - - def _create_table(self) : - aa = self.amino_acid - base1 = self.base1 - base2 = self.base2 - base3 = self.base3 - - # Construct a table of unambiguous codon translations - table = {} - for i, a in enumerate(aa) : - codon = base1[i] + base2[i] + base3[i] - table[codon] = a - - # Build the back table. - back_table = {} - items = table.items() - items.sort() - for codon, aa in items[::-1] : - back_table[aa] = codon # Use first codon, alphabetically. - back_table['X'] = 'NNN' - back_table['B'] = 'NNN' - back_table['Z'] = 'NNN' - back_table['J'] = 'NNN' - self._back_table = back_table - - ltable = {} - letters = dna_extended_letters+'U' # include RNA in table - - # Create a list of all possble codons - codons = [] - for c1 in letters: - for c2 in letters: - for c3 in letters : - codons.append( c1+c2+c3) - - # For each ambiguous codon, construct all compatible unambiguous codons. - # Translate and collect a set of all possible translated amino acids. - # If more than one translation look for possible amino acid ambiguity - # codes. - for C in codons : - translated = dict() # Use dict, because no set in py2.3 - c = C.replace('U', 'T') # Convert rna codon to dna - for c1 in dna_ambiguity[c[0]]: - for c2 in dna_ambiguity[c[1]]: - for c3 in dna_ambiguity[c[2]]: - aa = table[ c1+c2+c3 ] - translated[aa] = '' - translated = list(translated.keys()) - translated.sort() - if len(translated) ==1 : - trans = list(translated)[0] - elif translated == ['D','N'] : - trans = 'B' - elif translated == ['E','Q'] : - trans = 'Z' - elif translated == ['I','L'] : - trans = 'J' - elif '*' in translated: - trans = '?' - else : - trans = 'X' - ltable[C] = trans - - self._table = ltable - # End create tables - - def translate(self, seq, frame=0) : - """Translate a DNA sequence to a polypeptide using full - IUPAC ambiguities in DNA/RNA and amino acid codes. - - Returns : - -- Seq - A polypeptide sequence - """ - # TODO: Optimize. - # TODO: Insanity check alphabet. - seq = str(seq) - table = self.table - trans = [] - L = len(seq) - for i in range(frame, L-2, 3) : - codon = seq[i:i+3].upper() - trans.append( table[codon]) - return Seq(''.join(trans), protein_alphabet) - - - def back_translate(self, seq) : - """Convert protein back into coding DNA. - - Args: - -- seq - A polypeptide sequence. - - Returns : - -- Seq - A dna sequence - """ - # TODO: Optimzie - # TODO: Insanity check alphabet. - table = self.back_table - seq = str(seq) - trans = [ table[a] for a in seq] - return Seq(''.join(trans), dna_alphabet) - - #TODO: translate_orf(self, seq, start) ? - #TODO: translate_to_stop(self, seq, frame) ? - #TODO: translate_all_frames(self,seq) -> 6 translations. - - def __repr__(self) : - string = [] - string += 'GeneticCode( %d, "' % self.ident - string += self.description - string += '", \n' - string += ' amino_acid = "' - string += self.amino_acid - string += '",\n' - string += ' start = "' - string += self.start - string += '",\n' - string += ' base1 = "' - string += self.base1 - string += '",\n' - string += ' base2 = "' - string += self.base2 - string += '",\n' - string += ' base3 = "' - string += self.base3 - string += '" )' - return ''.join(string) - - - def __str__(self) : - """Returns a text representation of this genetic code.""" - # Inspired by http://bugzilla.open-bio.org/show_bug.cgi?id=1963 - letters = "TCAG" # Convectional ordering for codon tables. - string = [] - - if self.ident : - string += 'Genetic Code [%d]: ' % self.ident - else : - string += 'Genetic Code: ' - string += self.description or '' - - string += "\n " - string += " ".join( [" %s " % c2 for c2 in letters] ) - - string += "\n +" - string += "+".join(["---------" for c2 in letters]) + "+ " - - table = self.table - - for c1 in letters : - for c3 in letters : - string += '\n ' - string += c1 - string += " |" - for c2 in letters : - codon = c1+c2+c3 - string += " " + codon - if codon in self.stop_codons : - string += " Stop|" - else : - amino = table.get(codon, '?') - if codon in self.start_codons : - string += " %s(s)|" % amino - else : - string += " %s |" % amino - string += " " + c3 - - string += "\n +" - string += "+".join(["---------" for c2 in letters]) - string += "+ " - string += '\n' - return ''.join(string) -# end class GeneticCode - - -# Data from http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html#7.5 -# Aug. 2006 -# Genetic Code Tables -# -# Authority International Sequence Databank Collaboration -# Contact NCBI -# Scope /transl_table qualifier -# URL http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c -_codon_tables = ( - GeneticCode(1, "Standard", - "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "---M---------------M---------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(2, "Vertebrate Mitochondrial", - "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG", - "--------------------------------MMMM---------------M------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(3, "Yeast Mitochondrial", - "FFLLSSSSYY**CCWWTTTTPPPPHHQQRRRRIIMMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "----------------------------------MM----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(4, "Mold, Protozoan, Coelenterate Mitochondrial & Mycoplasma/Spiroplasma", - "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "--MM---------------M------------MMMM---------------M------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(5, "Invertebrate Mitochondrial", - "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG", - "---M----------------------------MMMM---------------M------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(6, "Ciliate, Dasycladacean and Hexamita Nuclear", - "FFLLSSSSYYQQCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "-----------------------------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(9, "Echinoderm and Flatworm Mitochondrial", - "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG", - "-----------------------------------M---------------M------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(10, "Euplotid Nuclear", - "FFLLSSSSYY**CCCWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "-----------------------------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(11, "Bacterial and Plant Plastid", - "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "---M---------------M------------MMMM---------------M------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(12, "Alternative Yeast Nuclear", - "FFLLSSSSYY**CC*WLLLSPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "-------------------M---------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(13,"Ascidian Mitochondrial", - "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSGGVVVVAAAADDEEGGGG", - "-----------------------------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(14, "Alternative Flatworm Mitochondrial", - "FFLLSSSSYYY*CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG", - "-----------------------------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(15, "Blepharisma Nuclear", - "FFLLSSSSYY*QCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "-----------------------------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(16, "Chlorophycean Mitochondrial", - "FFLLSSSSYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "-----------------------------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(21, "Trematode Mitochondrial", - "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNNKSSSSVVVVAAAADDEEGGGG", - "-----------------------------------M---------------M------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(22, "Scenedesmus obliquus Mitochondrial", - "FFLLSS*SYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "-----------------------------------M----------------------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"), - - GeneticCode(23,"Thraustochytrium Mitochondrial", - "FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", - "--------------------------------M--M---------------M------------", - "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG", - "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG", - "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG",), - ) - - - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/utils/__init__.py --- a/corebio/utils/__init__.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,470 +0,0 @@ - - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - - -"""Extra utilities and core classes not in standard python. -""" - - -__all__ = ('isblank', 'isfloat', 'isint', 'fcmp', 'remove_whitespace', - 'invert_dict','update', 'stdrepr', 'Token', 'Struct', 'Reiterate', - 'deoptparse', 'crc32', 'crc64', 'FileIndex', 'find_command', - 'ArgumentError', 'frozendict') - -import os.path -import math - -def isblank( string) : - """Is this whitespace or an empty string?""" - if string == '' : return True - return string.isspace() - -def isfloat(s): - """Does this object represent a floating point number? """ - try: - float(s) - return True - except (ValueError, TypeError): - return False - -def isint(s): - """Does this object represent an integer?""" - try: - int(s) - return True - except (ValueError, TypeError): - return False - -def fcmp(x, y, precision): - """Floating point comparison.""" - # TODO: Doc string, default precision. Test - if math.fabs(x-y) < precision: - return 0 - elif x < y: - return -1 - return 1 - -def remove_whitespace( astring) : - """Remove all whitespace from a string.""" - # TODO: Is this horrible slow? - return "".join(astring.split()) - - -def invert_dict( dictionary) : - """Constructs a new dictionary with inverted mappings so that keys become - values and vice versa. If the values of the original dictionary are not - unique then only one of the original kesys will be included in the new - dictionary. - """ - return dict( [(value, key) for key, value in dictionary.iteritems()] ) - - - -def update(obj, **entries): - """Update an instance with new values. - - >>> update({'a': 1}, a=10, b=20) - {'a': 10, 'b': 20} - """ - if hasattr(obj, 'update') : - obj.update( entries) - else : - for k, v in entries.iteritems() : - setattr(obj, k, v) - return obj - - - -def stdrepr( obj, attributes=None, name=None) : - """Create a standard representation of an object.""" - if name==None : name = obj.__class__.__name__ - if attributes==None: attributes = obj.__class__.__slots__ - args = [] - for a in attributes : - args.append( '%s=%s' % ( a, repr( getattr(obj, a) ) ) ) - args = ',\n'.join(args).replace('\n', '\n ') - return '%s(\n %s\n)' % (name, args) - - -class Token(object): - """Represents the items returned by a file scanner, normally processed - by a parser. - - Attributes : - o typeof -- a string describing the kind of token - o data -- the value of the token - o lineno -- the line of the file on which the data was found (if known) - o offset -- the offset of the data within the line (if known) - """ - __slots__ = [ 'typeof', 'data', 'lineno', 'offset'] - def __init__(self, typeof, data=None, lineno=-1, offset=-1) : - self.typeof = typeof - self.data = data - self.lineno = lineno - self.offset = offset - - def __repr__(self) : - return stdrepr( self) - - def __str__(self): - coord = str(self.lineno) - if self.offset != -1 : coord += ':'+str(self.offset) - coord = coord.ljust(7) - return (coord+ ' '+ self.typeof +' : ').ljust(32)+ str(self.data or '') - - - -def Struct(**kwargs) : - """Create a new instance of an anonymous class with the supplied attributes - and values. - - >>> s = Struct(a=3,b=4) - >>> s - Struct( - a=3, - b=4 - ) - >>> s.a - 3 - - """ - name = 'Struct' - - def _init(obj, **kwargs) : - for k, v in kwargs.iteritems() : - setattr( obj, k, v) - - def _repr(obj) : - return stdrepr( obj, obj.__slots__, name) - - adict = {} - adict['__slots__'] = kwargs.keys() - adict['__init__'] = _init - adict['__repr__'] = _repr - - return type( name, (object,) , adict)(**kwargs) - - -class Reiterate(object): - """ A flexible wrapper around a simple iterator. - """ - def __new__(cls, iterator): - if isinstance(iterator, cls) : return iterator - new = object.__new__(cls) - new._iterator = iter(iterator) - new._stack = [] - new._index = 0 - return new - - def __init__(self, *args, **kw): - pass - - - def __iter__(self): - return self - - def next(self): - """Return the next item in the iteration.""" - self._index +=1 - if self._stack : - return self._stack.pop() - else: - return self._iterator.next() - - def index(self) : - """The number of items returned. Incremented by next(), Decremented - by push(), unchanged by peek() """ - return self._index - - def push(self, item) : - """Push an item back onto the top of the iterator,""" - self._index -=1 - self._stack.append(item) - - def peek(self) : - """Returns the next item, but does not advance the iteration. - Returns None if no more items. (Bit may also return None as the - next item.)""" - try : - item = self.next() - self.push(item) - return item - except StopIteration: - return None - - def has_item(self) : - """More items to return?""" - try : - item = self.next() - self.push(item) - return True - except StopIteration: - return False - - def filter(self, predicate): - """Return the next item in the iteration that satisifed the - predicate.""" - next = self.next() - while not predicate(next) : next = self.next() - return next -# End class Reiterate - - - - - -def crc32(string): - """Return the standard CRC32 checksum as a hexidecimal string.""" - import binascii - return "%08X"% binascii.crc32(string) - -_crc64_table =None - -def crc64(string): - """ Calculate ISO 3309 standard cyclic redundancy checksum. - Used, for example, by SWISS-PROT. - - Returns : The CRC as a hexadecimal string. - - Reference: - o W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, - "Numerical recipes in C", 2nd ed., Cambridge University Press. Pages 896ff. - """ - # Adapted from biopython, which was adapted from bioperl - global _crc64_table - if _crc64_table is None : - # Initialisation of CRC64 table - table = [] - for i in range(256): - l = i - part_h = 0 - for j in range(8): - rflag = l & 1 - l >>= 1 - if part_h & 1: l |= (1L << 31) - part_h >>= 1L - if rflag: part_h ^= 0xd8000000L - table.append(part_h) - _crc64_table= tuple(table) - - crcl = 0 - crch = 0 - for c in string: - shr = (crch & 0xFF) << 24 - temp1h = crch >> 8 - temp1l = (crcl >> 8) | shr - idx = (crcl ^ ord(c)) & 0xFF - crch = temp1h ^ _crc64_table[idx] - crcl = temp1l - - return "%08X%08X" % (crch, crcl) -# End crc64 - - -class FileIndex(object) : - """Line based random access to a file. Quickly turn a file into a read-only - database. - - Attr: - - indexfile -- The file to be indexed. Can be set to None and latter - replaced with a new file handle, for exampel, if you need to - close and latter reopen the file. - - Bugs: - User must set the indexedfile to None before pickling this class. - - """ - __slots__ = [ 'indexedfile', '_parser', '_positions', '_keys', '_key_dict'] - - def __init__(self, indexedfile, linekey = None, parser=None) : - """ - - Args: - - indexedfile -- The file to index - - linekey -- An optional function. keyofline() will be passed each line - of the file in turn and should return a string to index the line, - or None. If keyofline() is supplied, then only lines that generate - keys are indexed. - - parser -- An optional parser. A function that reads from a file handle - positioned at the start of a record and returns an object. - """ - - def default_parser(seekedfile) : - return seekedfile.readline() - - if parser is None : parser = default_parser - self._parser = parser - - indexedfile.seek(0) - positions = [] - keys = [] - - while True : - position = indexedfile.tell() - line = indexedfile.readline() - if line == '' : break - - if linekey : - k = linekey(line) - if k is None: continue - keys.append(k) - - positions.append(position) - - self.indexedfile = indexedfile - self._positions = tuple(positions) - - if linekey : - self._keys = tuple(keys) - self._key_dict = dict( zip(keys, positions)) - - - def tell(self, item) : - if isinstance(item, str) : - p = self._key_dict[item] - else : - p = self._positions[item] - return p - - def seek(self, item) : - """Seek the indexfile to the position of item.""" - self.indexedfile.seek(self.tell(item)) - - def __iter__(self) : - for i in range(0, len(self)) : - yield self[i] - - def __len__(self) : - return len(self._positions) - - def __getitem__(self, item) : - self.indexedfile.seek(self.tell(item)) - return self._parser(self.indexedfile) - - def __contains__(self, item) : - try: - self.tell(item) - return True - except KeyError : - return False - except IndexError : - return False - -# End class FileIndex - - -def find_command(command, path=None): - """Return the full path to the first match of the given command on - the path. - - Arguments: - - command -- is a the name of the executable to search for. - - path -- is an optional alternate path list to search. The default it - to use the COREBIOPATH environment variable, if it exists, else the - PATH environment variable. - - Raises: - - EnvironmentError -- If no match is found for the command. - - By default the COREBIO or PATH environment variable is searched (as well - as, on Windows, the AppPaths key in the registry), but a specific 'path' - list to search may be specified as well. - - Author: Adapted from code by Trent Mick (TrentM@ActiveState.com) - See: http://trentm.com/projects/which/ - """ - import _which - if path is None : - path = os.environ.get("COREBIOPATH", "").split(os.pathsep) - if path==['']: path = None - - try : - match =_which.whichgen(command, path).next() - except StopIteration, _which.WhichError: - raise EnvironmentError("Could not find '%s' on the path." % command) - return match - - - -class ArgumentError(ValueError) : - """ A subclass of ValueError raised when a function receives an argument - that has the right type but an inappropriate value, and the situation is not - described by a more precise exception such as IndexError. The name of the - argument or component at fault and (optionally) the value are also stored. - """ - - def __init__(self, message, key, value=None) : - """ Args: - - message -- An error message. - - key -- The name of the argument or component at fault. - - value -- Optional value of the argument. - """ - ValueError.__init__(self, message) - self.key = key - self.value = value -# end class ArgumentError - - -class frozendict(dict): - """A frozendict is a dictionary that cannot be modified after being created - - but it is hashable and may serve as a member of a set or a key in a - dictionary. - # Author: Adapted from code by Oren Tirosh - """ - # See: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/414283 - - def _blocked_attribute(obj): - raise AttributeError, "A frozendict cannot be modified." - _blocked_attribute = property(_blocked_attribute) - - __delitem__ = _blocked_attribute - __setitem__ = _blocked_attribute - clear = _blocked_attribute - pop = _blocked_attribute - popitem = _blocked_attribute - setdefault = _blocked_attribute - update = _blocked_attribute - - def __new__(cls, *args, **kw): - new = dict.__new__(cls) - dict.__init__(new, *args, **kw) - return new - - def __init__(self, *args, **kw): - pass - - def __hash__(self): - try: - return self._cached_hash - except AttributeError: - # Hash keys, not items, since items can be mutable and unhasahble. - h = self._cached_hash = hash(tuple(sorted(self.keys()))) - return h - - def __repr__(self): - return "frozendict(%s)" % dict.__repr__(self) -# end class frozendict - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 corebio/utils/_which.py --- a/corebio/utils/_which.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,335 +0,0 @@ -#!/usr/bin/env python -# Copyright (c) 2002-2005 ActiveState Corp. -# See LICENSE.txt for license details. -# Author: -# Trent Mick (TrentM@ActiveState.com) -# Home: -# http://trentm.com/projects/which/ - -r"""Find the full path to commands. - -which(command, path=None, verbose=0, exts=None) - Return the full path to the first match of the given command on the - path. - -whichall(command, path=None, verbose=0, exts=None) - Return a list of full paths to all matches of the given command on - the path. - -whichgen(command, path=None, verbose=0, exts=None) - Return a generator which will yield full paths to all matches of the - given command on the path. - -By default the PATH environment variable is searched (as well as, on -Windows, the AppPaths key in the registry), but a specific 'path' list -to search may be specified as well. On Windows, the PATHEXT environment -variable is applied as appropriate. - -If "verbose" is true then a tuple of the form - (, ) -is returned for each match. The latter element is a textual description -of where the match was found. For example: - from PATH element 0 - from HKLM\SOFTWARE\...\perl.exe -""" - -_cmdlnUsage = """ - Show the full path of commands. - - Usage: - which [...] [...] - - Options: - -h, --help Print this help and exit. - -V, --version Print the version info and exit. - - -a, --all Print *all* matching paths. - -v, --verbose Print out how matches were located and - show near misses on stderr. - -q, --quiet Just print out matches. I.e., do not print out - near misses. - - -p , --path= - An alternative path (list of directories) may - be specified for searching. - -e , --exts= - Specify a list of extensions to consider instead - of the usual list (';'-separate list, Windows - only). - - Show the full path to the program that would be run for each given - command name, if any. Which, like GNU's which, returns the number of - failed arguments, or -1 when no was given. - - Near misses include duplicates, non-regular files and (on Un*x) - files without executable access. -""" - -__revision__ = "$Id: which.py 430 2005-08-20 03:11:58Z trentm $" -__version_info__ = (1, 1, 0) -__version__ = '.'.join(map(str, __version_info__)) - -import os -import sys -import getopt -import stat - - -#---- exceptions - -class WhichError(Exception): - pass - - - -#---- internal support stuff - -def _getRegisteredExecutable(exeName): - """Windows allow application paths to be registered in the registry.""" - registered = None - if sys.platform.startswith('win'): - if os.path.splitext(exeName)[1].lower() != '.exe': - exeName += '.exe' - import _winreg - try: - key = "SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\App Paths\\" +\ - exeName - value = _winreg.QueryValue(_winreg.HKEY_LOCAL_MACHINE, key) - registered = (value, "from HKLM\\"+key) - except _winreg.error: - pass - if registered and not os.path.exists(registered[0]): - registered = None - return registered - -def _samefile(fname1, fname2): - if sys.platform.startswith('win'): - return ( os.path.normpath(os.path.normcase(fname1)) ==\ - os.path.normpath(os.path.normcase(fname2)) ) - else: - return os.path.samefile(fname1, fname2) - -def _cull(potential, matches, verbose=0): - """Cull inappropriate matches. Possible reasons: - - a duplicate of a previous match - - not a disk file - - not executable (non-Windows) - If 'potential' is approved it is returned and added to 'matches'. - Otherwise, None is returned. - """ - for match in matches: # don't yield duplicates - if _samefile(potential[0], match[0]): - if verbose: - sys.stderr.write("duplicate: %s (%s)\n" % potential) - return None - else: - if not stat.S_ISREG(os.stat(potential[0]).st_mode): - if verbose: - sys.stderr.write("not a regular file: %s (%s)\n" % potential) - elif not os.access(potential[0], os.X_OK): - if verbose: - sys.stderr.write("no executable access: %s (%s)\n"\ - % potential) - else: - matches.append(potential) - return potential - - -#---- module API - -def whichgen(command, path=None, verbose=0, exts=None): - """Return a generator of full paths to the given command. - - "command" is a the name of the executable to search for. - "path" is an optional alternate path list to search. The default it - to use the PATH environment variable. - "verbose", if true, will cause a 2-tuple to be returned for each - match. The second element is a textual description of where the - match was found. - "exts" optionally allows one to specify a list of extensions to use - instead of the standard list for this system. This can - effectively be used as an optimization to, for example, avoid - stat's of "foo.vbs" when searching for "foo" and you know it is - not a VisualBasic script but ".vbs" is on PATHEXT. This option - is only supported on Windows. - - This method returns a generator which yields either full paths to - the given command or, if verbose, tuples of the form (, ). - """ - matches = [] - if path is None: - usingGivenPath = 0 - path = os.environ.get("PATH", "").split(os.pathsep) - if sys.platform.startswith("win"): - path.insert(0, os.curdir) # implied by Windows shell - else: - usingGivenPath = 1 - - # Windows has the concept of a list of extensions (PATHEXT env var). - if sys.platform.startswith("win"): - if exts is None: - exts = os.environ.get("PATHEXT", "").split(os.pathsep) - # If '.exe' is not in exts then obviously this is Win9x and - # or a bogus PATHEXT, then use a reasonable default. - for ext in exts: - if ext.lower() == ".exe": - break - else: - exts = ['.COM', '.EXE', '.BAT'] - elif not isinstance(exts, list): - raise TypeError("'exts' argument must be a list or None") - else: - if exts is not None: - raise WhichError("'exts' argument is not supported on "\ - "platform '%s'" % sys.platform) - exts = [] - - # File name cannot have path separators because PATH lookup does not - # work that way. - if os.sep in command or os.altsep and os.altsep in command: - pass - else: - for i in range(len(path)): - dirName = path[i] - # On windows the dirName *could* be quoted, drop the quotes - if sys.platform.startswith("win") and len(dirName) >= 2\ - and dirName[0] == '"' and dirName[-1] == '"': - dirName = dirName[1:-1] - for ext in ['']+exts: - absName = os.path.abspath( - os.path.normpath(os.path.join(dirName, command+ext))) - if os.path.isfile(absName): - if usingGivenPath: - fromWhere = "from given path element %d" % i - elif not sys.platform.startswith("win"): - fromWhere = "from PATH element %d" % i - elif i == 0: - fromWhere = "from current directory" - else: - fromWhere = "from PATH element %d" % (i-1) - match = _cull((absName, fromWhere), matches, verbose) - if match: - if verbose: - yield match - else: - yield match[0] - match = _getRegisteredExecutable(command) - if match is not None: - match = _cull(match, matches, verbose) - if match: - if verbose: - yield match - else: - yield match[0] - - -def which(command, path=None, verbose=0, exts=None): - """Return the full path to the first match of the given command on - the path. - - "command" is a the name of the executable to search for. - "path" is an optional alternate path list to search. The default it - to use the PATH environment variable. - "verbose", if true, will cause a 2-tuple to be returned. The second - element is a textual description of where the match was found. - "exts" optionally allows one to specify a list of extensions to use - instead of the standard list for this system. This can - effectively be used as an optimization to, for example, avoid - stat's of "foo.vbs" when searching for "foo" and you know it is - not a VisualBasic script but ".vbs" is on PATHEXT. This option - is only supported on Windows. - - If no match is found for the command, a WhichError is raised. - """ - try: - match = whichgen(command, path, verbose, exts).next() - except StopIteration: - raise WhichError("Could not find '%s' on the path." % command) - return match - - -def whichall(command, path=None, verbose=0, exts=None): - """Return a list of full paths to all matches of the given command - on the path. - - "command" is a the name of the executable to search for. - "path" is an optional alternate path list to search. The default it - to use the PATH environment variable. - "verbose", if true, will cause a 2-tuple to be returned for each - match. The second element is a textual description of where the - match was found. - "exts" optionally allows one to specify a list of extensions to use - instead of the standard list for this system. This can - effectively be used as an optimization to, for example, avoid - stat's of "foo.vbs" when searching for "foo" and you know it is - not a VisualBasic script but ".vbs" is on PATHEXT. This option - is only supported on Windows. - """ - return list( whichgen(command, path, verbose, exts) ) - - - -#---- mainline - -def main(argv): - all = 0 - verbose = 0 - altpath = None - exts = None - try: - optlist, args = getopt.getopt(argv[1:], 'haVvqp:e:', - ['help', 'all', 'version', 'verbose', 'quiet', 'path=', 'exts=']) - except getopt.GetoptError, msg: - sys.stderr.write("which: error: %s. Your invocation was: %s\n"\ - % (msg, argv)) - sys.stderr.write("Try 'which --help'.\n") - return 1 - for opt, optarg in optlist: - if opt in ('-h', '--help'): - print _cmdlnUsage - return 0 - elif opt in ('-V', '--version'): - print "which %s" % __version__ - return 0 - elif opt in ('-a', '--all'): - all = 1 - elif opt in ('-v', '--verbose'): - verbose = 1 - elif opt in ('-q', '--quiet'): - verbose = 0 - elif opt in ('-p', '--path'): - if optarg: - altpath = optarg.split(os.pathsep) - else: - altpath = [] - elif opt in ('-e', '--exts'): - if optarg: - exts = optarg.split(os.pathsep) - else: - exts = [] - - if len(args) == 0: - return -1 - - failures = 0 - for arg in args: - #print "debug: search for %r" % arg - nmatches = 0 - for match in whichgen(arg, path=altpath, verbose=verbose, exts=exts): - if verbose: - print "%s (%s)" % match - else: - print match - nmatches += 1 - if not all: - break - if not nmatches: - failures += 1 - return failures - - -if __name__ == "__main__": - sys.exit( main(sys.argv) ) - - diff -r c55bdc2fb9fa -r 33ac48224523 corebio/utils/deoptparse.py --- a/corebio/utils/deoptparse.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,258 +0,0 @@ -# Copyright (c) 2004 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. -# - -"""Custom extensions to OptionParse for parsing command line options.""" -# FIXME: Docstring - -# TODO: Add profiling option - -# DeOptionParser : -# -# http://docs.python.org/lib/module-optparse.html -# -# Random_options : -# Set random generator and seed. Use options.random as -# source of random numbers -# Copyright : -# print copyright information - -# Documentation : -# print extended document information -# -# Additional file_in and file_out types - -import sys -from copy import copy -from optparse import Option -from optparse import OptionParser -from optparse import IndentedHelpFormatter -from optparse import OptionValueError -import random - - - -def _copyright_callback(option, opt, value, parser): - if option or opt or value or parser: pass # Shut up lint checker - print parser.copyright - sys.exit() - -def _doc_callback(option, opt, value, parser): - if option or opt or value or parser: pass # Shut up lint checker - print parser.long_description - sys.exit() - - -class DeHelpFormatter(IndentedHelpFormatter) : - def __init__ (self, - indent_increment=2, - max_help_position=32, - width=78, - short_first=1): - IndentedHelpFormatter.__init__( - self, indent_increment, max_help_position, - width, short_first) - - def format_option_strings (self, option): - """Return a comma-separated list of option strings & metavariables.""" - if option.takes_value(): - metavar = option.metavar or option.dest.upper() - short_opts = option._short_opts - long_opts = [lopt + " " + metavar for lopt in option._long_opts] - else: - short_opts = option._short_opts - long_opts = option._long_opts - - if not short_opts : short_opts = [" ",] - - if self.short_first: - opts = short_opts + long_opts - else: - opts = long_opts + short_opts - - return " ".join(opts) - - - -def _check_file_in(option, opt, value): - if option or opt or value : pass # Shut up lint checker - try: - return file(value, "r") - except IOError: - raise OptionValueError( - "option %s: cannot open file: %s" % (opt, value) ) - -def _check_file_out(option, opt, value): - if option or opt or value : pass # Shut up lint checker - try: - return file(value, "w+") - except IOError: - raise OptionValueError( - "option %s: cannot open file: %s" % (opt, value) ) - -def _check_boolean(option, opt, value) : - if option or opt or value : pass # Shut up lint checker - v = value.lower() - choices = {'no': False, 'false':False, '0': False, - 'yes': True, 'true': True, '1':True } - try: - return choices[v] - except KeyError: - raise OptionValueError( - "option %s: invalid choice: '%s' " \ - "(choose from 'yes' or 'no', 'true' or 'false')" % (opt, value)) - -def _check_dict(option, opt, value) : - if option or opt or value : pass # Shut up lint checker - v = value.lower() - choices = option.choices - try: - return choices[v] - except KeyError: - raise OptionValueError( - "option %s: invalid choice: '%s' " \ - "(choose from '%s')" % (opt, value, "', '".join(choices))) - - - -class DeOption(Option): - TYPES = Option.TYPES + ("file_in","file_out", "boolean", "dict") - TYPE_CHECKER = copy(Option.TYPE_CHECKER) - TYPE_CHECKER["file_in"] = _check_file_in - TYPE_CHECKER["file_out"] = _check_file_out - TYPE_CHECKER["boolean"] = _check_boolean - TYPE_CHECKER["dict"] = _check_dict - choices = None - - def _new_check_choice(self): - if self.type == "dict": - if self.choices is None: - raise OptionValueError( - "must supply a dictionary of choices for type 'dict'") - elif not isinstance(self.choices, dict): - raise OptionValueError( - "choices must be a dictinary ('%s' supplied)" - % str(type(self.choices)).split("'")[1]) - return - self._check_choice() - - # Have to override _check_choices so that we can parse - # a dict through to check_dict - CHECK_METHODS = Option.CHECK_METHODS - CHECK_METHODS[2] = _new_check_choice - - - - - -class DeOptionParser(OptionParser) : - def __init__(self, - usage=None, - option_list=None, - option_class=DeOption, - version=None, - conflict_handler="error", - description=None, - long_description = None, - formatter=DeHelpFormatter(), - add_help_option=True, - prog=None, - copyright=None, - add_verbose_options=True, - add_random_options=False - ): - - OptionParser.__init__(self, - usage, - option_list, - option_class, - version, - conflict_handler, - description, - formatter, - add_help_option, - prog ) - - if long_description : - self.long_description = long_description - self.add_option("--doc", - action="callback", - callback=_doc_callback, - help="Detailed documentation") - - if copyright : - self.copyright = copyright - self.add_option("--copyright", - action="callback", - callback=_copyright_callback, - help="") - - if add_verbose_options : - self.add_option("-q", "--quite", - action="store_false", - dest="verbose", - default=False, - help="Run quietly (default)") - - self.add_option("-v", "--verbose", - action="store_true", - dest="verbose", - default=False, - help="Verbose output (Not quite)") - - self.random_options = False - if add_random_options : - self.random_options = True - self.add_option("--seed", - action="store", - type = "int", - dest="random_seed", - help="Initial seed for pseudo-random number generator. (default: System time)", - metavar="INTEGER" ) - - self.add_option("--generator", - action="store", - dest="random_generator", - default="MersenneTwister", - help="Select MersenneTwister (default) or WichmannHill pseudo-random number generator", - metavar="TYPE" ) - - def parse_args(self,args, values=None) : - (options, args) = OptionParser.parse_args(self, args, values) - - if self.random_options : - if options.random_generator is None or options.random_generator =="MersenneTwister" : - r = random.Random() - elif options.random_generator == "WichmannHill" : - r = random.WichmannHill() - else : - self.error("Acceptible generators are MersenneTwister (default) or WichmannHill") - if options.random_seed : - r.seed(options.random_seed) - - options.__dict__["random"] = r - - - return (options, args) - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 inter.py --- a/inter.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,29 +0,0 @@ -#!/usr/tmp/bin/python2.7 - - -import weblogolib -from corebio.utils.deoptparse import DeOptionParser -import sys - - -#setup -def blackboxcodonl(inputf,outputf): - parser = weblogolib._build_option_parser() - - (opts, args) = parser.parse_args(['--size', 'large', '--composition', 'none', '--fin',inputf,'--fout', outputf]) - - - if args : parser.error("Unparsable arguments: %s " % args) - #best not to change anything in the try except block. - try: - data = weblogolib._build_logodata(opts) - format = weblogolib._build_logoformat(data, opts) - formatter = opts.formatter - formatter(data, format, opts.fout) - except ValueError, err : - print >>sys.stderr, 'Error:', err - sys.exit(2) - except KeyboardInterrupt, err: - sys.exit(0) - -blackboxcodonl('/home/david/examples/cluster17.aln',"outfile.eps") diff -r c55bdc2fb9fa -r 33ac48224523 setup.py --- a/setup.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,93 +0,0 @@ -#!/usr/bin/env python - -import sys - -from distutils.core import setup -from distutils.core import Extension -from distutils.command.build import build -from distutils.command.install_data import install_data - -# Supress warning that distutils generates for the install_requires option -import warnings -warnings.simplefilter('ignore', UserWarning, lineno =236) - -# check dependancies -if not hasattr(sys, 'version_info') or sys.version_info < (2,3,0,'final'): - raise SystemExit, \ - "Dependancy error: CodonLogo requires Python 2.3 or later." - - -from weblogolib import __version__ - -def main() : - long_description = open("README.txt").read() - - - setup( - name = "codonlogo", - version = __version__, - description = "CodonLogo: WebLogo3 messed around with", - long_description = long_description, - maintainer = "David Murphy", - maintainer_email = "Murphy.David@gmail.com", - classifiers =[ - 'Development Status :: Alpha', - 'Intended Audience :: Science/Research', - 'License :: OSI Approved :: BSD License', - 'Topic :: Scientific/Engineering :: Bio-Informatics', - 'Programming Language :: Python', - 'Natural Language :: English', - 'Operating System :: OS Independent', - 'Topic :: Software Development :: Libraries', - 'Topic :: Software Development :: Libraries :: Python Modules', - ], - - scripts = [ 'codonlogo', ], - packages = [ 'weblogolib',], - data_files = ['weblogolib/htdocs/*.*','weblogolib/template.eps'], - install_requires=['numpy', 'corebio'], - - cmdclass= {"install_data" : _install_data}, - ) - - -# Python 2.3 compatability -# Rework the install_data command to act like the package_data distutils -# command included with python 2.4. -# Adapted from biopython, which was adapted from mxtexttools -class _install_data(install_data): - def finalize_options(self): - if self.install_dir is None: - installobj = self.distribution.get_command_obj('install') - # Use install_lib rather than install_platlib because we are - # currently a pure python distribution (No c extensions.) - self.install_dir = installobj.install_lib - #print installobj.install_lib - install_data.finalize_options(self) - - def run (self): - import glob - import os - if not self.dry_run: - self.mkpath(self.install_dir) - data_files = self.get_inputs() - for entry in data_files: - if type(entry) is not type(""): - raise ValueError("data_files must be strings") - # Unix- to platform-convention conversion - entry = os.sep.join(entry.split("/")) - filenames = glob.glob(entry) - for filename in filenames: - dst = os.path.join(self.install_dir, filename) - dstdir = os.path.split(dst)[0] - if not self.dry_run: - self.mkpath(dstdir) - outfile = self.copy_file(filename, dst)[0] - else: - outfile = dst - self.outfiles.append(outfile) - -if __name__ == '__main__' : - main() - - \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 test --- a/test Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,97 +0,0 @@ - -./codonlogo --version < baseline.txt > output.eps -display output.eps -./codonlogo --help < baseline.txt > output.eps -display output.eps -./codonlogo -f baseline.txt > output.eps -display output.eps -./codonlogo -o output.eps < baseline.txt -display output.eps -./codonlogo -F eps < baseline.txt > output.eps -display output.eps -./codonlogo -F png < baseline.txt > output.png -display output.png -./codonlogo -F png_print < baseline.txt > output.png -display output.png -./codonlogo -F pdf < baseline.txt > output.pdf -display output.pdf -./codonlogo -F jpeg < baseline.txt > output.eps -display output.jpeg -./codonlogo -F txt < baseline.txt > output.eps -kate output.txt -./codonlogo -m 1 < baseline.txt > output.eps -display output.eps -./codonlogo -m 4 < baseline.txt > output.eps -display output.eps -./codonlogo -T True < baseline.txt > output.eps -display output.eps -./codonlogo -T False < baseline.txt > output.eps -display output.eps -./codonlogo -U 'bits' < baseline.txt > output.eps -display output.eps -./codonlogo -U 'nats' < baseline.txt > output.eps -display output.eps -./codonlogo -U 'digits' < baseline.txt > output.eps -display output.eps -./codonlogo -U 'kT' < baseline.txt > output.eps -display output.eps -./codonlogo -U 'kJ/mol' < baseline.txt > output.eps -display output.eps -./codonlogo -U 'kcal/mol' < baseline.txt > output.eps -display output.eps -./codonlogo -U 'probability' < baseline.txt > output.eps -display output.eps -./codonlogo --weight 32 < baseline.txt > output.eps -display output.eps -./codonlogo -i 2 < baseline.txt > output.eps -display output.eps -./codonlogo -l 2 < baseline.txt > output.eps -display output.eps -./codonlogo -u 4 < baseline.txt > output.eps -display output.eps -./codonlogo -n 2 < baseline.txt > output.eps -display output.eps -./codonlogo -t "Testing" < baseline.txt > output.eps -display output.eps -./codonlogo --label "WHeeeeeee" < baseline.txt > output.eps -display output.eps -./codonlogo -X True < baseline.txt > output.eps -display output.eps -./codonlogo -X False < baseline.txt > output.eps -display output.eps -./codonlogo -x "Hello" < baseline.txt > output.eps -display output.eps -./codonlogo -S 7 < baseline.txt > output.eps -display output.eps -./codonlogo -Y True < baseline.txt > output.eps -display output.eps -./codonlogo -Y False < baseline.txt > output.eps -display output.eps -./codonlogo -y "World" < baseline.txt > output.eps -display output.eps -./codonlogo -E True < baseline.txt > output.eps -display output.eps -./codonlogo --fineprint "Testing" < baseline.txt > output.eps -display output.eps -./codonlogo --ticmarks 3 < baseline.txt > output.eps -display output.eps -./codonlogo --errorbars False < baseline.txt > output.eps -display output.eps -./codonlogo -W 40 < baseline.txt > output.eps -display output.eps -./codonlogo -H 200 < baseline.txt > output.eps -display output.eps -./codonlogo --box True < baseline.txt > output.eps -display output.eps -./codonlogo --box False < baseline.txt > output.eps -display output.eps -./codonlogo --resolution 1200 < baseline.txt > output.eps -display output.eps -./codonlogo --scale-width YES < baseline.txt > output.eps -display output.eps -./codonlogo --scale-width NO < baseline.txt > output.eps -display output.eps -./codonlogo --debug True < baseline.txt > output.eps -display output.eps -./codonlogo --debug False < baseline.txt > output.eps -display output.eps diff -r c55bdc2fb9fa -r 33ac48224523 test_weblogo.py --- a/test_weblogo.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,544 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - - -import unittest - -import weblogolib -from weblogolib import * -from weblogolib import parse_prior, GhostscriptAPI -from weblogolib.color import * -from weblogolib.colorscheme import * -from StringIO import StringIO -import sys - -from numpy import array, asarray, float64, ones, zeros, int32,all,any, shape -import numpy as na - -from corebio import seq_io -from corebio.seq import * - -# python2.3 compatability -from corebio._future.subprocess import * -from corebio._future import resource_stream - -from corebio.moremath import entropy -from math import log, sqrt -codon_alphabetU=['AAA', 'AAC', 'AAG', 'AAU', 'ACA', 'ACC', 'ACG', 'ACU', 'AGA', 'AGC', 'AGG', 'AGU', 'AUA', 'AUC', 'AUG', 'AUU', 'CAA', 'CAC', 'CAG', 'CAU', 'CCA', 'CCC', 'CCG', 'CCU', 'CGA', 'CGC', 'CGG', 'CGU', 'CUA', 'CUC', 'CUG', 'CUU', 'GAA', 'GAC', 'GAG', 'GAU', 'GCA', 'GCC', 'GCG', 'GCU', 'GGA', 'GGC', 'GGG', 'GGU', 'GUA', 'GUC', 'GUG', 'GUU', 'UAA', 'UAC', 'UAG', 'UAU', 'UCA', 'UCC', 'UCG', 'UCU', 'UGA', 'UGC', 'UGG', 'UGU', 'UUA', 'UUC', 'UUG', 'UUU'] -codon_alphabetT=['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', 'AGA', 'AGC', 'AGG', 'AGT', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CAT', 'CCA', 'CCC', 'CCG', 'CCT', 'CGA', 'CGC', 'CGG', 'CGT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAC', 'GAG', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'TAA', 'TAC', 'TAG', 'TAT', 'TCA', 'TCC', 'TCG', 'TCT', 'TGA', 'TGC', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT'] - - -def testdata_stream( name ): - return resource_stream(__name__, 'tests/data/'+name, __file__) - -class test_logoformat(unittest.TestCase) : - - def test_options(self) : - options = LogoOptions() - - -class test_ghostscript(unittest.TestCase) : - def test_version(self) : - version = GhostscriptAPI().version - - - -class test_parse_prior(unittest.TestCase) : - def assertTrue(self, bool) : - self.assertEquals( bool, True) - - def test_parse_prior_none(self) : - self.assertEquals( None, - parse_prior(None, unambiguous_protein_alphabet ) ) - self.assertEquals( None, - parse_prior( 'none', unambiguous_protein_alphabet ) ) - self.assertEquals( None, - parse_prior( 'noNe', None) ) - - def test_parse_prior_equiprobable(self) : - self.assertTrue( all(20.*equiprobable_distribution(20) == - parse_prior( 'equiprobable', unambiguous_protein_alphabet ) ) ) - - self.assertTrue( - all( 1.2* equiprobable_distribution(3) - == parse_prior( ' equiprobablE ', Alphabet('123'), 1.2 ) ) ) - - def test_parse_prior_percentage(self) : - #print parse_prior( '50%', unambiguous_dna_alphabet, 1. ) - self.assertTrue( all( equiprobable_distribution(4) - == parse_prior( '50%', unambiguous_dna_alphabet, 1. ) ) ) - - self.assertTrue( all( equiprobable_distribution(4) - == parse_prior( ' 50.0 % ', unambiguous_dna_alphabet, 1. ) ) ) - - self.assertTrue( all( array( (0.3,0.2,0.2,0.3), float64) - == parse_prior( ' 40.0 % ', unambiguous_dna_alphabet, 1. ) ) ) - - def test_parse_prior_float(self) : - self.assertTrue( all( equiprobable_distribution(4) - == parse_prior( '0.5', unambiguous_dna_alphabet, 1. ) ) ) - - self.assertTrue( all( equiprobable_distribution(4) - == parse_prior( ' 0.500 ', unambiguous_dna_alphabet, 1. ) ) ) - - self.assertTrue( all( array( (0.3,0.2,0.2,0.3), float64) - == parse_prior( ' 0.40 ', unambiguous_dna_alphabet, 1. ) ) ) - - def test_auto(self) : - self.assertTrue( all(4.*equiprobable_distribution(4) == - parse_prior( 'auto', unambiguous_dna_alphabet ) ) ) - self.assertTrue( all(4.*equiprobable_distribution(4) == - parse_prior( 'automatic', unambiguous_dna_alphabet ) ) ) - - def test_weight(self) : - self.assertTrue( all(4.*equiprobable_distribution(4) == - parse_prior( 'automatic', unambiguous_dna_alphabet ) ) ) - self.assertTrue( all(123.123*equiprobable_distribution(4) == - parse_prior( 'auto', unambiguous_dna_alphabet , 123.123) ) ) - - def test_explicit(self) : - s = "{'A':10, 'C':40, 'G':40, 'T':10}" - p = array( (10, 40, 40,10), float64)*4./100. - self.assertTrue( all( - p == parse_prior( s, unambiguous_dna_alphabet ) ) ) - - -class test_logooptions(unittest.TestCase) : - def test_create(self) : - opt = LogoOptions() - opt.small_fontsize =10 - options = repr(opt) - - opt = LogoOptions(title="sometitle") - assert opt.title == "sometitle" - -class test_logosize(unittest.TestCase) : - def test_create(self) : - s = LogoSize(101.0,10.0) - assert s.stack_width == 101.0 - r = repr(s) - - -class test_seqlogo(unittest.TestCase) : - # FIXME: The version of python used by Popen may not be the - # same as that used to run this test. - def _exec(self, args, outputtext, returncode =0, stdin=None) : - if not stdin : - stdin = testdata_stream("cap.fa") - args = ["./weblogo"] + args - p = Popen(args,stdin=stdin,stdout=PIPE, stderr=PIPE) - (out, err) = p.communicate() - if returncode ==0 and p.returncode >0 : - print err - self.assertEquals(returncode, p.returncode) - if returncode == 0 : self.assertEquals( len(err), 0) - - for item in outputtext : - self.failUnless(item in out) - - - - def test_malformed_options(self) : - self._exec( ["--notarealoption"], [], 2) - self._exec( ["extrajunk"], [], 2) - self._exec( ["-I"], [], 2) - - def test_help_option(self) : - self._exec( ["-h"], ["options"]) - self._exec( ["--help"], ["options"]) - - def test_version_option(self) : - self._exec( ['--version'], weblogolib.__version__) - - - def test_default_build(self) : - self._exec( [], ["%%Title: Sequence Logo:"] ) - - - # Format options - def test_width(self) : - self._exec( ['-W','1234'], ["/stack_width 1234"] ) - self._exec( ['--stack-width','1234'], ["/stack_width 1234"] ) - - def test_height(self) : - self._exec( ['-H','1234'], ["/stack_height 1234"] ) - self._exec( ['--stack-height','1234'], ["/stack_height 1234"] ) - - - def test_stacks_per_line(self) : - self._exec( ['-n','7'], ["/stacks_per_line 7 def"] ) - self._exec( ['--stacks-per-line','7'], ["/stacks_per_line 7 def"] ) - - - def test_title(self) : - self._exec( ['-t', '3456'], ['/logo_title (3456) def', - '/show_title True def']) - self._exec( ['-t', ''], ['/logo_title () def', - '/show_title False def']) - self._exec( ['--title', '3456'], ['/logo_title (3456) def', - '/show_title True def']) - - - - - -class test_which(unittest.TestCase) : - def test_which(self): - tests = ( - (seq_io.read(testdata_stream('cap.fa')), codon_alphabetT), - (seq_io.read(testdata_stream('capu.fa')), codon_alphabetU), - - #(seq_io.read(testdata_stream('cox2.msf')), unambiguous_protein_alphabet), - #(seq_io.read(testdata_stream('Rv3829c.fasta')), unambiguous_protein_alphabet), - ) - for t in tests : - self.failUnlessEqual(which_alphabet(t[0]), t[1]) - - - - -class test_colorscheme(unittest.TestCase) : - - def test_colorgroup(self) : - cr = ColorGroup( "ABC", "black", "Because") - self.assertEquals( cr.description, "Because") - - def test_colorscheme(self) : - cs = ColorScheme([ - ColorGroup("G", "orange"), - ColorGroup("TU", "red"), - ColorGroup("C", "blue"), - ColorGroup("A", "green") - ], - title = "title", - description = "description", - ) - - self.assertEquals( cs.color('A'), Color.by_name("green")) - self.assertEquals( cs.color('X'), cs.default_color) - - - -class test_color(unittest.TestCase) : - # 2.3 Python compatibility - assertTrue = unittest.TestCase.failUnless - assertFalse = unittest.TestCase.failIf - - def test_color_names(self) : - names = Color.names() - self.failUnlessEqual( len(names), 147) - - for n in names: - c = Color.by_name(n) - self.assertTrue( c != None ) - - - def test_color_components(self) : - white = Color.by_name("white") - self.failUnlessEqual( 1.0, white.red) - self.failUnlessEqual( 1.0, white.green) - self.failUnlessEqual( 1.0, white.blue) - - - c = Color(0.3, 0.4, 0.2) - self.failUnlessEqual( 0.3, c.red) - self.failUnlessEqual( 0.4, c.green) - self.failUnlessEqual( 0.2, c.blue) - - c = Color(0,128,0) - self.failUnlessEqual( 0.0, c.red) - self.failUnlessEqual( 128./255., c.green) - self.failUnlessEqual( 0.0, c.blue) - - - def test_color_from_rgb(self) : - white = Color.by_name("white") - - self.failUnlessEqual(white, Color(1.,1.,1.) ) - self.failUnlessEqual(white, Color(255,255,255) ) - self.failUnlessEqual(white, Color.from_rgb(1.,1.,1.) ) - self.failUnlessEqual(white, Color.from_rgb(255,255,255) ) - - - def test_color_from_hsl(self) : - red = Color.by_name("red") - lime = Color.by_name("lime") - saddlebrown = Color.by_name("saddlebrown") - darkgreen = Color.by_name("darkgreen") - blue = Color.by_name("blue") - green = Color.by_name("green") - - self.failUnlessEqual(red, Color.from_hsl(0, 1.0,0.5) ) - self.failUnlessEqual(lime, Color.from_hsl(120, 1.0, 0.5) ) - self.failUnlessEqual(blue, Color.from_hsl(240, 1.0, 0.5) ) - self.failUnlessEqual(Color.by_name("gray"), Color.from_hsl(0,0,0.5) ) - - self.failUnlessEqual(saddlebrown, Color.from_hsl(25, 0.76, 0.31) ) - - self.failUnlessEqual(darkgreen, Color.from_hsl(120, 1.0, 0.197) ) - - - def test_color_by_name(self): - white = Color.by_name("white") - self.failUnlessEqual(white, Color.by_name("white")) - self.failUnlessEqual(white, Color.by_name("WHITE")) - self.failUnlessEqual(white, Color.by_name(" wHiTe \t\n\t")) - - - self.failUnlessEqual(Color(255,255,240), Color.by_name("ivory")) - self.failUnlessEqual(Color(70,130,180), Color.by_name("steelblue")) - - self.failUnlessEqual(Color(0,128,0), Color.by_name("green")) - - - def test_color_from_invalid_name(self): - self.failUnlessRaises( ValueError, Color.by_name, "not_a_color") - - - def test_color_clipping(self): - red = Color.by_name("red") - self.failUnlessEqual(red, Color(255,0,0) ) - self.failUnlessEqual(red, Color(260,-10,0) ) - self.failUnlessEqual(red, Color(1.1,-0.,-1.) ) - - self.failUnlessEqual( Color(1.0001, 213.0, 1.2).red, 1.0 ) - self.failUnlessEqual( Color(-0.001, -2183.0, -1.0).red, 0.0 ) - self.failUnlessEqual( Color(1.0001, 213.0, 1.2).green, 1.0 ) - self.failUnlessEqual( Color(-0.001, -2183.0, -1.0).green, 0.0 ) - self.failUnlessEqual( Color(1.0001, 213.0, 1.2).blue, 1.0 ) - self.failUnlessEqual( Color(-0.001, -2183.0, -1.0).blue, 0.0 ) - - - def test_color_fail_on_mixed_type(self): - self.failUnlessRaises( TypeError, Color.from_rgb, 1,1,1.0 ) - self.failUnlessRaises( TypeError, Color.from_rgb, 1.0,1,1.0 ) - - def test_color_red(self) : - # Check Usage comment in Color - red = Color.by_name("red") - self.failUnlessEqual( red , Color(255,0,0) ) - self.failUnlessEqual( red, Color(1., 0., 0.) ) - - self.failUnlessEqual( red , Color.from_rgb(1.,0.,0.) ) - self.failUnlessEqual( red , Color.from_rgb(255,0,0) ) - self.failUnlessEqual( red , Color.from_hsl(0.,1., 0.5) ) - - self.failUnlessEqual( red , Color.from_string("red") ) - self.failUnlessEqual( red , Color.from_string("RED") ) - self.failUnlessEqual( red , Color.from_string("#F00") ) - self.failUnlessEqual( red , Color.from_string("#FF0000") ) - self.failUnlessEqual( red , Color.from_string("rgb(255, 0, 0)") ) - self.failUnlessEqual( red , Color.from_string("rgb(100%, 0%, 0%)") ) - self.failUnlessEqual( red , Color.from_string("hsl(0, 100%, 50%)") ) - - - def test_color_from_string(self) : - purple = Color(128,0,128) - red = Color(255,0,0) - skyblue = Color(135,206,235) - - red_strings = ("red", - "ReD", - "RED", - " Red \t", - "#F00", - "#FF0000", - "rgb(255, 0, 0)", - "rgb(100%, 0%, 0%)", - "hsl(0, 100%, 50%)") - - for s in red_strings: - self.failUnlessEqual( red, Color.from_string(s) ) - - skyblue_strings = ("skyblue", - "SKYBLUE", - " \t\n SkyBlue \t", - "#87ceeb", - "rgb(135,206,235)" - ) - - for s in skyblue_strings: - self.failUnlessEqual( skyblue, Color.from_string(s) ) - - - - def test_color_equality(self): - c1 = Color(123,99,12) - c2 = Color(123,99,12) - - self.failUnlessEqual(c1,c2) - - - - - - -class test_Dirichlet(unittest.TestCase) : - # 2.3 Python compatibility - assertTrue = unittest.TestCase.failUnless - assertFalse = unittest.TestCase.failIf - - - def test_init(self) : - d = Dirichlet( ( 1,1,1,1,) ) - - - def test_random(self) : - - - def do_test( alpha, samples = 1000) : - ent = zeros( (samples,), float64) - #alpha = ones( ( K,), Float64 ) * A/K - - #pt = zeros( (len(alpha) ,), Float64) - d = Dirichlet(alpha) - for s in range(samples) : - p = d.sample() - #print p - #pt +=p - ent[s] = entropy(p) - - #print pt/samples - - m = mean(ent) - v = var(ent) - - dm = d.mean_entropy() - dv = d.variance_entropy() - - #print alpha, ':', m, v, dm, dv - error = 4. * sqrt(v/samples) - self.assertTrue( abs(m-dm) < error) - self.assertTrue( abs(v-dv) < error) # dodgy error estimate - - - do_test( (1., 1.) ) - do_test( (2., 1.) ) - do_test( (3., 1.) ) - do_test( (4., 1.) ) - do_test( (5., 1.) ) - do_test( (6., 1.) ) - - do_test( (1., 1.) ) - do_test( (20., 20.) ) - do_test( (1., 1., 1., 1., 1., 1., 1., 1., 1., 1.) ) - do_test( (.1, .1, .1, .1, .1, .1, .1, .1, .1, .1) ) - do_test( (.01, .01, .01, .01, .01, .01, .01, .01, .01, .01) ) - do_test( (2.0, 6.0, 1.0, 1.0) ) - - - def test_mean(self) : - alpha = ones( ( 10,), float64 ) * 23. - d = Dirichlet(alpha) - m = d.mean() - self.assertAlmostEqual( m[2], 1./10) - self.assertAlmostEqual( sum(m), 1.0) - - def test_covariance(self) : - alpha = ones( ( 4,), float64 ) - d = Dirichlet(alpha) - cv = d.covariance() - self.assertEqual( cv.shape, (4,4) ) - self.assertAlmostEqual( cv[0,0], 1.0 * (1.0 - 1./4.0)/ (4.0 * 5.0) ) - self.assertAlmostEqual( cv[0,1], - 1 / ( 4. * 4. * 5.) ) - - def test_mean_x(self) : - alpha = (1.0, 2.0, 3.0, 4.0) - xx = (2.0, 2.0, 2.0, 2.0) - m = Dirichlet(alpha).mean_x(xx) - self.assertEquals( m, 2.0) - - alpha = (1.0, 1.0, 1.0, 1.0) - xx = (2.0, 3.0, 4.0, 3.0) - m = Dirichlet(alpha).mean_x(xx) - self.assertEquals( m, 3.0) - - def test_variance_x(self) : - alpha = (1.0, 1.0, 1.0, 1.0) - xx = (2.0, 2.0, 2.0, 2.0) - v = Dirichlet(alpha).variance_x(xx) - self.assertAlmostEquals( v, 0.0) - - alpha = (1.0, 2.0, 3.0, 4.0) - xx = (2.0, 0.0, 1.0, 10.0) - v = Dirichlet(alpha).variance_x(xx) - #print v - # TODO: Don't actually know if this is correct - - def test_relative_entropy(self): - alpha = (2.0, 10.0, 1.0, 1.0) - d = Dirichlet(alpha) - pvec = (0.1, 0.2, 0.3, 0.4) - - rent = d.mean_relative_entropy(pvec) - vrent = d.variance_relative_entropy(pvec) - low, high = d.interval_relative_entropy(pvec, 0.95) - - #print - #print '> ', rent, vrent, low, high - - # This test can fail randomly, but the precision form a few - # thousand samples is low. Increasing samples, 1000->2000 - samples = 2000 - sent = zeros( (samples,), float64) - - for s in range(samples) : - post = d.sample() - e = -entropy(post) - for k in range(4) : - e += - post[k] * log(pvec[k]) - sent[s] = e - sent.sort() - self.assertTrue( abs(sent.mean() - rent) < 4.*sqrt(vrent) ) - self.assertAlmostEqual( sent.std(), sqrt(vrent), 1 ) - self.assertTrue( abs(low-sent[ int( samples *0.025)])<0.2 ) - self.assertTrue( abs(high-sent[ int( samples *0.975)])<0.2 ) - - #print '>>', mean(sent), var(sent), sent[ int( samples *0.025)] ,sent[ int( samples *0.975)] - - - -def mean( a) : - return sum(a)/ len(a) - -def var(a) : - return (sum(a*a) /len(a) ) - mean(a)**2 - - - - -if __name__ == '__main__': - unittest.main() diff -r c55bdc2fb9fa -r 33ac48224523 tests/data/Rv3829c.fasta --- a/tests/data/Rv3829c.fasta Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,13 +0,0 @@ ->Rv3829c -MTGYDAIVIGAGHNGLTAAVLLQRAGLRTACLDAKRYAGGMASTVELFDG -YRFEIAGSVQFPTSSAVSSELGLDSLPTVDLEVMSVALRGVGDDPVVQFT -DPTKMLTHLHRVHGADAVTGMAGLLAWSQAPTRALGRFEAGTLPKSFDEM -YACATNEFERSAIDDMLFGSVTDVLDRHFPDREKHGALRGSMTVLAVNTL -YRGPATPGSAAALAFGLGVPEGDFVRWKKLRGGIGALTTHLSQLLERTGG -EVRLRSKVTEIVVDNSRSSARVRGVRTAAGDTLTSPIVVSAIAPDVTINE -LIDPAVLPSEIRDRYLRIDHRGSYLQMHFALAQPPAFAAPYQALNDPSMQ -ASMGIFCTPEQVQQQWEDCRRGIVPADPTVVLQIPSLHDPSLAPAGKQAA -SAFAMWFPIEGGSKYGGYGRAKVEMGQNVIDKITRLAPNFKGSILRYTTF -TPKHMGVMFGAPGGDYCHALLHSDQIGPNRPGPKGFIGQPIPIAGLYLGS -AGCHGGPGITFIPGYNAARQALADRRAANCCVLSGR -* \ No newline at end of file diff -r c55bdc2fb9fa -r 33ac48224523 tests/data/cap.fa --- a/tests/data/cap.fa Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,98 +0,0 @@ ->aldB -18->4 -attcgtgatagctgtcgtaaag ->ansB 103->125 -ttttgttacctgcctctaactt ->araB1 109->131 -aagtgtgacgccgtgcaaataa ->araB2 147->169 -tgccgtgattatagacactttt ->cdd 1 107->129 -atttgcgatgcgtcgcgcattt ->cdd 2 57->79 -taatgagattcagatcacatat ->crp 1 115->137 -taatgtgacgtcctttgcatac ->crp 2 -gaaggcgacctgggtcatgctg ->cya 151->173 -aggtgttaaattgatcacgttt ->cytR 1 125->147 -cgatgcgaggcggatcgaaaaa ->cytR 2 106->128 -aaattcaatattcatcacactt ->dadAX 1 95->117 -agatgtgagccagctcaccata ->dadAX 2 32->54 -agatgtgattagattattattc ->deoP2 1 75->97 -aattgtgatgtgtatcgaagtg ->deoP2 2 128->150 -ttatttgaaccagatcgcatta ->fur 136->158 -aaatgtaagctgtgccacgttt ->gal 56->78 -aagtgtgacatggaataaatta ->glpACB (glpTQ) 1 54->76 -ttgtttgatttcgcgcatattc ->glpACB (glpTQ) 2 94->116 -aaacgtgatttcatgcgtcatt ->glpACB (glpTQ) 144->166 -atgtgtgcggcaattcacattt ->glpD (glpE) 95->117 -taatgttatacatatcactcta ->glpFK 1 120->142 -ttttatgacgaggcacacacat ->glpFK 2 95->117 -aagttcgatatttctcgttttt ->gut (srlA) 72->94 -ttttgcgatcaaaataacactt ->ilvB 87->109 -aaacgtgatcaacccctcaatt ->lac 1 (lacZ) 88->110 -taatgtgagttagctcactcat ->lac 2 (lacZ) 16->38 -aattgtgagcggataacaattt ->malEpKp1 110->132 -ttgtgtgatctctgttacagaa ->malEpKp2 139->161 -TAAtgtggagatgcgcacaTAA ->malEpKp3 173->195 -TTTtgcaagcaacatcacgAAA ->malEpKp4 205->227 -GACctcggtttagttcacaGAA ->malT 121->143 -aattgtgacacagtgcaaattc ->melR 52->74 -aaccgtgctcccactcgcagtc ->mtl 302->324 -TCTTGTGATTCAGATCACAAAG ->nag 156->178 -ttttgtgagttttgtcaccaaa ->nupG2 97->119 -aaatgttatccacatcacaatt ->nupG1 47->69 -ttatttgccacaggtaacaaaa ->ompA 166->188 -atgcctgacggagttcacactt ->ompR 161->183 -taacgtgatcatatcaacagaa ->ptsH A 316->338 -Ttttgtggcctgcttcaaactt ->ptsH B 188->210 -ttttatgatttggttcaattct ->rhaS (rhaB) 161->183 -aattgtgaacatcatcacgttc ->rot 1 (ppiA) 182->204 -ttttgtgatctgtttaaatgtt ->rot 2 (ppiA) 129->151 -agaggtgattttgatcacggaa ->tdcA 60->82 -atttgtgagtggtcgcacatat ->tnaL 73->95 -gattgtgattcgattcacattt ->tsx 2 146->168 -gtgtgtaaacgtgaacgcaatc ->tsx 1 107->129 -aactgtgaaacgaaacatattt ->uxuAB 165->187 -TCTTGTGATGTGGTTAACCAAT diff -r c55bdc2fb9fa -r 33ac48224523 tests/data/capu.fa --- a/tests/data/capu.fa Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,98 +0,0 @@ ->aldB -18->4 -auucgugauagcugucguaaag ->ansB 103->125 -uuuuguuaccugccucuaacuu ->araB1 109->131 -aagugugacgccgugcaaauaa ->araB2 147->169 -ugccgugauuauagacacuuuu ->cdd 1 107->129 -auuugcgaugcgucgcgcauuu ->cdd 2 57->79 -uaaugagauucagaucacauau ->crp 1 115->137 -uaaugugacguccuuugcauac ->crp 2 -gaaggcgaccugggucaugcug ->cya 151->173 -agguguuaaauugaucacguuu ->cyuR 1 125->147 -cgaugcgaggcggaucgaaaaa ->cyuR 2 106->128 -aaauucaauauucaucacacuu ->dadAX 1 95->117 -agaugugagccagcucaccaua ->dadAX 2 32->54 -agaugugauuagauuauuauuc ->deoP2 1 75->97 -aauugugauguguaucgaagug ->deoP2 2 128->150 -uuauuugaaccagaucgcauua ->fur 136->158 -aaauguaagcugugccacguuu ->gal 56->78 -aagugugacauggaauaaauua ->glpACB (glpUQ) 1 54->76 -uuguuugauuucgcgcauauuc ->glpACB (glpUQ) 2 94->116 -aaacgugauuucaugcgucauu ->glpACB (glpUQ) 144->166 -augugugcggcaauucacauuu ->glpD (glpE) 95->117 -uaauguuauacauaucacucua ->glpFK 1 120->142 -uuuuaugacgaggcacacacau ->glpFK 2 95->117 -aaguucgauauuucucguuuuu ->guu (srlA) 72->94 -uuuugcgaucaaaauaacacuu ->ilvB 87->109 -aaacgugaucaaccccucaauu ->lac 1 (lacZ) 88->110 -uaaugugaguuagcucacucau ->lac 2 (lacZ) 16->38 -aauugugagcggauaacaauuu ->malEpKp1 110->132 -uugugugaucucuguuacagaa ->malEpKp2 139->161 -UAAuguggagaugcgcacaUAA ->malEpKp3 173->195 -UUUugcaagcaacaucacgAAA ->malEpKp4 205->227 -GACcucgguuuaguucacaGAA ->malU 121->143 -aauugugacacagugcaaauuc ->melR 52->74 -aaccgugcucccacucgcaguc ->mul 302->324 -UCUUGUGAUUCAGAUCACAAAG ->nag 156->178 -uuuugugaguuuugucaccaaa ->nupG2 97->119 -aaauguuauccacaucacaauu ->nupG1 47->69 -uuauuugccacagguaacaaaa ->ompA 166->188 -augccugacggaguucacacuu ->ompR 161->183 -uaacgugaucauaucaacagaa ->pusH A 316->338 -Uuuuguggccugcuucaaacuu ->pusH B 188->210 -uuuuaugauuugguucaauucu ->rhaS (rhaB) 161->183 -aauugugaacaucaucacguuc ->rou 1 (ppiA) 182->204 -uuuugugaucuguuuaaauguu ->rou 2 (ppiA) 129->151 -agaggugauuuugaucacggaa ->udcA 60->82 -auuugugaguggucgcacauau ->unaL 73->95 -gauugugauucgauucacauuu ->usx 2 146->168 -guguguaaacgugaacgcaauc ->usx 1 107->129 -aacugugaaacgaaacauauuu ->uxuAB 165->187 -UCUUGUGAUGUGGUUAACCAAU diff -r c55bdc2fb9fa -r 33ac48224523 tests/data/cox2.msf --- a/tests/data/cox2.msf Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,33 +0,0 @@ - mini.msf MSF: 166 Type: N January 01, 1776 12:00 Check: 8077 .. - - Name: cox2_leita Len: 166 Check: 2103 Weight: 1.00 - Name: cox2_crifa Len: 166 Check: 1179 Weight: 1.00 - Name: cox2_trybb Len: 166 Check: 999 Weight: 1.00 - Name: Cox2_bsalt Len: 166 Check: 2740 Weight: 1.00 - Name: cox2_tborr Len: 166 Check: 1056 Weight: 1.00 - -// - - cox2_leita MAFILSFWMI FLLDSVIVL? ???LSFVCFV CVWICALLFS TVLLVSKLN? - cox2_crifa MAFILSFWMI FLIDAVIVL? ???LSFVCFV CIWICSLFFS SFLLVSKIN? - cox2_trybb MSFILTFWMI FLMDSIIVL? ???ISFSIFL SVWICALIIA TVLTVTKIN? - Cox2_bsalt MSFIISF?ML FLIDSLIVL? ???LSGAIFV CIWICSLFFL CILFICKLD? - cox2_tborr MLFFINQLLL LLVDTFVIL? ???EIFSLFV CVFIIVMYIL FINYNIFLK? - - cox2_leita ?NIYCTWDFT ASKFIDVYWF TIGGMFSLGL ?LLRLCLLLY FGHLN????? - cox2_crifa ?NVYCTWDFT ASKFIDAYWF TIGGMFVLCL ?LLRLCLLLY FGCLN????? - cox2_trybb ?NIYCTWDFI SSKFIDTYWF VLGMMFILCL ?LLRLCLLLY FSCIN????? - Cox2_bsalt ?YIFCS?DFI SAKFIDLY?F TLGCLFIVCL ?LIRLCLLLY FSCLN????? - cox2_tborr ?NINVYLDFI GSKYLDLYWF LIGIFFVIVL ?LIRLCLLLY YSWIS????? - - cox2_leita ???FVSFDLC KVVGFQWYWV YFIFG????? ??ETTIFSNL ILESDYMIGD - cox2_crifa ???FVSFDLC KVVGFQWYWV YFIFG????? ??ETTIFSNL ILESDYLIGD - cox2_trybb ???FVSFDLC KVIGFQWYWV YFLFG????? ??ETTIFSNL ILESDYLIGD - Cox2_bsalt ???FVCFDLC KCIGFQ?Y?V YFIFG????? ??ETTIFSNL ILESDYLIGD - cox2_tborr ???LLIFDLC KIMGFQWYWI FFVFK????? ??ENVIFSNL LIESDYWIGD - - cox2_leita LR???????? ?????? - cox2_crifa LR???????? ?????? - cox2_trybb LR???????? ?????? - Cox2_bsalt LR???????? ?????? - cox2_tborr LR???????? ?????? diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/.___init__.py Binary file weblogolib/.___init__.py has changed diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/__init__.py --- a/weblogolib/__init__.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,2078 +0,0 @@ -#!/usr/bin/env python - -# -------------------------------- WebLogo -------------------------------- - -# Copyright (c) 2003-2004 The Regents of the University of California. -# Copyright (c) 2005 Gavin E. Crooks -# Copyright (c) 2006, The Regents of the University of California, through print -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - -# Replicates README.txt - -""" -WebLogo (http://code.google.com/p/weblogo/) is a tool for creating sequence -logos from biological sequence alignments. It can be run on the command line, -as a standalone webserver, as a CGI webapp, or as a python library. - -The main WebLogo webserver is located at http://bespoke.lbl.gov/weblogo/ - - -Codonlogo is based on Weblogo. - -Please consult the manual for installation instructions and more information: -(Also located in the weblogolib/htdocs subdirectory.) - -For help on the command line interface run - ./codonlogo --help - -To build a simple logo run - ./codonlogo < cap.fa > logo0.eps - -To run as a standalone webserver at localhost:8080 - ./codonlogo --server - -To create a logo in python code: - >>> from weblogolib import * - >>> fin = open('cap.fa') - >>> seqs = read_seq_data(fin) - >>> data = LogoData.from_seqs(seqs) - >>> options = LogoOptions() - >>> options.title = "A Logo Title" - >>> format = LogoFormat(data, options) - >>> fout = open('cap.eps', 'w') - >>> eps_formatter( data, format, fout) - - --- Distribution and Modification -- -This package is distributed under the new BSD Open Source License. -Please see the LICENSE.txt file for details on copyright and licensing. -The WebLogo source code can be downloaded from http://code.google.com/p/weblogo/ -WebLogo requires Python 2.3, 2.4 or 2.5, the corebio python toolkit for computational -biology (http://code.google.com/p/corebio), and the python array package -'numpy' (http://www.scipy.org/Download) - -# -------------------------------- CodonLogo -------------------------------- - - -""" - -from math import * -import random -from itertools import izip, count -import sys -import copy -import os -from itertools import product -from datetime import datetime -from StringIO import StringIO - - - -from corebio.data import rna_letters, dna_letters, amino_acid_letters -import random - -# python2.3 compatability -from corebio._future import Template -from corebio._future.subprocess import * -from corebio._future import resource_string, resource_filename - - -from math import log, sqrt - -# Avoid 'from numpy import *' since numpy has lots of names defined -from numpy import array, asarray, float64, ones, zeros, int32,all,any, shape -import numpy as na - -from corebio.utils.deoptparse import DeOptionParser -from optparse import OptionGroup - -from color import * -from colorscheme import * -from corebio.seq import Alphabet, Seq, SeqList -from corebio import seq_io -from corebio.utils import isfloat, find_command, ArgumentError -from corebio.moremath import * -from corebio.data import amino_acid_composition -from corebio.seq import unambiguous_rna_alphabet, unambiguous_dna_alphabet, unambiguous_protein_alphabet - -codon_alphabetU=['AAA', 'AAC', 'AAG', 'AAU', 'ACA', 'ACC', 'ACG', 'ACU', 'AGA', 'AGC', 'AGG', 'AGU', 'AUA', 'AUC', 'AUG', 'AUU', 'CAA', 'CAC', 'CAG', 'CAU', 'CCA', 'CCC', 'CCG', 'CCU', 'CGA', 'CGC', 'CGG', 'CGU', 'CUA', 'CUC', 'CUG', 'CUU', 'GAA', 'GAC', 'GAG', 'GAU', 'GCA', 'GCC', 'GCG', 'GCU', 'GGA', 'GGC', 'GGG', 'GGU', 'GUA', 'GUC', 'GUG', 'GUU', 'UAA', 'UAC', 'UAG', 'UAU', 'UCA', 'UCC', 'UCG', 'UCU', 'UGA', 'UGC', 'UGG', 'UGU', 'UUA', 'UUC', 'UUG', 'UUU'] -codon_alphabetT=['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', 'AGA', 'AGC', 'AGG', 'AGT', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CAT', 'CCA', 'CCC', 'CCG', 'CCT', 'CGA', 'CGC', 'CGG', 'CGT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAC', 'GAG', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'TAA', 'TAC', 'TAG', 'TAT', 'TCA', 'TCC', 'TCG', 'TCT', 'TGA', 'TGC', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT'] - -altype="codonsT" -offset=0 -col=[] - - - -__all__ = ['LogoSize', - 'LogoOptions', - 'description', - '__version__', - 'LogoFormat', - 'LogoData', - 'Dirichlet', - 'GhostscriptAPI', - 'std_color_schemes', - 'default_color_schemes', - 'classic', - 'std_units', - 'std_sizes', - 'std_alphabets', - 'std_percentCG', - 'pdf_formatter', - 'jpeg_formatter', - 'png_formatter', - 'png_print_formatter', - 'txt_formatter', - 'eps_formatter', - 'formatters', - 'default_formatter', - 'base_distribution', - 'equiprobable_distribution', - 'read_seq_data', - 'which_alphabet', - 'color', - 'colorscheme', - ] - -description = "Create sequence logos from biological sequence alignments." - -__version__ = "1.0" - -# These keywords are subsituted by subversion. -# The date and revision will only tell the truth after a branch or tag, -# since different files in trunk will have been changed at different times -release_date ="$Date: 2011-09-17 16:30:00 -0700 (Tue, 14 Oct 2008) $".split()[1] -release_build = "$Revision: 53 $".split()[1] -release_description = "CodonLogo %s (%s)" % (__version__, release_date) - - - -def cgi(htdocs_directory) : - import weblogolib._cgi - weblogolib._cgi.main(htdocs_directory) - -class GhostscriptAPI(object) : - """Interface to the command line program Ghostscript ('gs')""" - - formats = ('png', 'pdf', 'jpeg') - - def __init__(self, path=None) : - try: - command = find_command('gs', path=path) - except EnvironmentError: - try: - command = find_command('gswin32c.exe', path=path) - except EnvironmentError: - raise EnvironmentError("Could not find Ghostscript on path." - " There should be either a gs executable or a gswin32c.exe on your system's path") - - self.command = command - - def version(self) : - args = [self.command, '--version'] - try : - p = Popen(args, stdout=PIPE) - (out,err) = p.communicate() - except OSError : - raise RuntimeError("Cannot communicate with ghostscript.") - return out.strip() - - def convert(self, format, fin, fout, width, height, resolution=300) : - device_map = { 'png':'png16m', 'pdf':'pdfwrite', 'jpeg':'jpeg'} - - try : - device = device_map[format] - except KeyError: - raise ValueError("Unsupported format.") - - args = [self.command, - "-sDEVICE=%s" % device, - "-dPDFSETTINGS=/printer", - #"-q", # Quite: Do not dump messages to stdout. - "-sstdout=%stderr", # Redirect messages and errors to stderr - "-sOutputFile=-", # Stdout - "-dDEVICEWIDTHPOINTS=%s" % str(width), - "-dDEVICEHEIGHTPOINTS=%s" % str(height), - "-dSAFER", # For added security - "-dNOPAUSE",] - - if device != 'pdf' : - args.append("-r%s" % str(resolution) ) - if resolution < 300 : # Antialias if resolution is Less than 300 DPI - args.append("-dGraphicsAlphaBits=4") - args.append("-dTextAlphaBits=4") - args.append("-dAlignToPixels=0") - - args.append("-") # Read from stdin. Must be last argument. - - error_msg = "Unrecoverable error : Ghostscript conversion failed " \ - "(Invalid postscript?). %s" % " ".join(args) - - source = fin.read() - - try : - p = Popen(args, stdin=PIPE, stdout = PIPE, stderr= PIPE) - (out,err) = p.communicate(source) - except OSError : - raise RuntimeError(error_msg) - - if p.returncode != 0 : - error_msg += '\nReturn code: %i\n' % p.returncode - if err is not None : error_msg += err - raise RuntimeError(error_msg) - - print >>fout, out -# end class Ghostscript - - -aa_composition = [ amino_acid_composition[_k] for _k in - unambiguous_protein_alphabet] - - - -# ------ DATA ------ - -classic = ColorScheme([ - ColorGroup("G", "orange" ), - ColorGroup("TU", "red"), - ColorGroup("C", "blue"), - ColorGroup("A", "green") - ] ) - -std_color_schemes = {"auto": None, # Depends on sequence type - "monochrome": monochrome, - "base pairing": base_pairing, - "classic": classic, - "hydrophobicity" : hydrophobicity, - "chemistry" : chemistry, - "charge" : charge, - }# - -default_color_schemes = { - unambiguous_protein_alphabet: hydrophobicity, - unambiguous_rna_alphabet: base_pairing, - unambiguous_dna_alphabet: base_pairing - #codon_alphabet:codonsU - } - - -std_units = { - "bits" : 1./log(2), - "nats" : 1., - "digits" : 1./log(10), - "kT" : 1., - "kJ/mol" : 8.314472 *298.15 /1000., - "kcal/mol": 1.987 *298.15 /1000., - "probability" : None, -} - -class LogoSize(object) : - def __init__(self, stack_width, stack_height) : - self.stack_width = stack_width - self.stack_height = stack_height - - def __repr__(self): - return stdrepr(self) - -# The base stack width is set equal to 9pt Courier. -# (Courier has a width equal to 3/5 of the point size.) -# Check that can get 80 characters in journal page @small -# 40 chacaters in a journal column -std_sizes = { - "small" : LogoSize( stack_width = 10, stack_height = 10*1*5), - "medium" : LogoSize( stack_width = 10*2, stack_height = 10*2*5), - "large" : LogoSize( stack_width = 10*3, stack_height = 10*3*5), -} - - -std_alphabets = { - 'protein': unambiguous_protein_alphabet, - 'rna': unambiguous_rna_alphabet, - 'dna': unambiguous_dna_alphabet, - 'codonsU':codon_alphabetU, - 'codonsT':codon_alphabetT -} - -std_percentCG = { - 'H. sapiens' : 40., - 'E. coli' : 50.5, - 'S. cerevisiae' : 38., - 'C. elegans' : 36., - 'D. melanogaster': 43., - 'M. musculus' : 42., - 'T. thermophilus' : 69.4, -} - -# Thermus thermophilus: Henne A, Bruggemann H, Raasch C, Wiezer A, Hartsch T, -# Liesegang H, Johann A, Lienard T, Gohl O, Martinez-Arias R, Jacobi C, -# Starkuviene V, Schlenczeck S, Dencker S, Huber R, Klenk HP, Kramer W, -# Merkl R, Gottschalk G, Fritz HJ: The genome sequence of the extreme -# thermophile Thermus thermophilus. -# Nat Biotechnol 2004, 22:547-53 - -def stdrepr(obj) : - attr = vars(obj).items() - - - attr.sort() - args = [] - for a in attr : - if a[0][0]=='_' : continue - args.append( '%s=%s' % ( a[0], repr(a[1])) ) - args = ',\n'.join(args).replace('\n', '\n ') - return '%s(\n %s\n)' % (obj.__class__.__name__, args) - - -class LogoOptions(object) : - """ A container for all logo formating options. Not all of these - are directly accessible through the CLI or web interfaces. - - To display LogoOption defaults: - >>> from weblogolib import * - >>> LogoOptions() - - - Attributes: - o alphabet - o creator_text -- Embedded as comment in figures. - o logo_title - o logo_label - o stacks_per_line - o unit_name - o show_yaxis - o yaxis_label -- Default depends on other settings. - o yaxis_tic_interval - o yaxis_minor_tic_ratio - o yaxis_scale - o show_xaxis - o xaxis_label - o xaxis_tic_interval - o rotate_numbers - o number_interval - o show_ends - o show_fineprint - o fineprint - o show_boxes - o shrink_fraction - o show_errorbars - o errorbar_fraction - o errorbar_width_fraction - o errorbar_gray - o resolution -- Dots per inch - o default_color - o color_scheme - o debug - o logo_margin - o stroke_width - o tic_length - o size - o stack_margin - o pad_right - o small_fontsize - o fontsize - o title_fontsize - o number_fontsize - o text_font - o logo_font - o title_font - o first_index - o logo_start - o logo_end - o scale_width - """ - - def __init__(self, **kwargs) : - """ Create a new LogoOptions instance. - - >>> L = LogoOptions(logo_title = "Some Title String") - >>> L.show_yaxis = False - >>> repr(L) - """ - - self.creator_text = release_description, - self.alphabet = None - - self.logo_title = "" - self.logo_label = "" - self.stacks_per_line = 20 - - self.unit_name = "bits" - - self.show_yaxis = True - # yaxis_lable default depends on other settings. See LogoFormat - self.yaxis_label = None - self.yaxis_tic_interval = 1. - self.yaxis_minor_tic_ratio = 5 - self.yaxis_scale = None - - self.show_xaxis = True - self.xaxis_label = "" - self.xaxis_tic_interval =1 - self.rotate_numbers = False - self.number_interval = 5 - self.show_ends = False - - self.show_fineprint = True - self.fineprint = "CodonLogo "+__version__ - - self.show_boxes = False - self.shrink_fraction = 0.5 - - self.show_errorbars = True - self.altype = True - - self.errorbar_fraction = 0.90 - self.errorbar_width_fraction = 0.25 - self.errorbar_gray = 0.75 - - self.resolution = 96. # Dots per inch - - self.default_color = Color.by_name("black") - self.color_scheme = None - #self.show_color_key = False # NOT yet implemented - - self.debug = False - - self.logo_margin = 2 - self.stroke_width = 0.5 - self.tic_length = 5 - - self.size = std_sizes["medium"] - - self.stack_margin = 0.5 - self.pad_right = False - - self.small_fontsize = 6 - self.fontsize = 10 - self.title_fontsize = 12 - self.number_fontsize = 8 - - self.text_font = "ArialMT" - self.logo_font = "Arial-BoldMT" - self.title_font = "ArialMT" - - self.first_index = 1 - self.logo_start = None - self.logo_end=None - - # Scale width of characters proportional to gaps - self.scale_width = True - - from corebio.utils import update - update(self, **kwargs) - - def __repr__(self) : - attr = vars(self).items() - attr.sort() - args = [] - for a in attr : - if a[0][0]=='_' : continue - args.append( '%s=%s' % ( a[0], repr(a[1])) ) - args = ',\n'.join(args).replace('\n', '\n ') - return '%s(\n %s\n)' % (self.__class__.__name__, args) -# End class LogoOptions - - - - -class LogoFormat(LogoOptions) : - """ Specifies the format of the logo. Requires a LogoData and LogoOptions - objects. - - >>> data = LogoData.from_seqs(seqs ) - >>> options = LogoOptions() - >>> options.title = "A Logo Title" - >>> format = LogoFormat(data, options) - """ - # TODO: Raise ArgumentErrors instead of ValueError and document - def __init__(self, data, options= None) : - - LogoOptions.__init__(self) - #global offset - if options is not None : - self.__dict__.update(options.__dict__) - - #offset=options.frame - - self.alphabet = data.alphabet - self.seqlen = data.length - self.altype = True - self.show_title = False - self.show_xaxis_label = False - self.yaxis_minor_tic_interval = None - self.lines_per_logo = None - self.char_width = None - self.line_margin_left = None - self.line_margin_right = None - self.line_margin_bottom = None - self.line_margin_top = None - self.title_height = None - self.xaxis_label_height = None - self.line_height = None - self.line_width = None - self.logo_height = None - self.logo_width = None - self.creation_date = None - self.end_type = None - - if self.stacks_per_line< 1 : - raise ArgumentError("Stacks per line should be greater than zero.", - "stacks_per_line" ) - - if self.size.stack_height<=0.0 : - raise ArgumentError( - "Stack height must be greater than zero.", "stack_height") - if (self.small_fontsize <= 0 or self.fontsize <=0 or - self.title_fontsize<=0 ): - raise ValueError("Font sizes must be positive.") - - if self.errorbar_fraction<0.0 or self.errorbar_fraction>1.0 : - raise ValueError( - "The visible fraction of the error bar must be between zero and one.") - - if self.yaxis_tic_interval<=0.0 : - raise ArgumentError( "The yaxis tic interval cannot be negative.", - 'yaxis_tic_interval') - - if self.size.stack_width <= 0.0 : - raise ValueError( - "The width of a stack should be a positive number.") - - if self.yaxis_minor_tic_interval and \ - self.yaxis_minor_tic_interval<=0.0 : - raise ValueError("Distances cannot be negative.") - - if self.xaxis_tic_interval<=0 : - raise ValueError("Tic interval must be greater than zero.") - - if self.number_interval<=0 : - raise ValueError("Invalid interval between numbers.") - - if self.shrink_fraction<0.0 or self.shrink_fraction>1.0 : - raise ValueError("Invalid shrink fraction.") - - if self.stack_margin<=0.0 : - raise ValueError("Invalid stack margin." ) - - if self.logo_margin<=0.0 : - raise ValueError("Invalid logo margin." ) - - if self.stroke_width<=0.0 : - raise ValueError("Invalid stroke width.") - - if self.tic_length<=0.0 : - raise ValueError("Invalid tic length.") - - # FIXME: More validation - - # Inclusive upper and lower bounds - # FIXME: Validate here. Move from eps_formatter - if self.logo_start is None: self.logo_start = self.first_index - - if self.logo_end is None : - self.logo_end = self.seqlen + self.first_index -1 - - self.total_stacks = self.logo_end - self.logo_start +1 - - if self.logo_start - self.first_index <0 : - raise ArgumentError( - "Logo range extends before start of available sequence.", - 'logo_range') - - if self.logo_end - self.first_index >= self.seqlen : - raise ArgumentError( - "Logo range extends beyond end of available sequence.", - 'logo_range') - - if self.logo_title : self.show_title = True - if not self.fineprint : self.show_fineprint = False - if self.xaxis_label : self.show_xaxis_label = True - - if self.yaxis_label is None : - self.yaxis_label = self.unit_name - - if self.yaxis_label : - self.show_yaxis_label = True - else : - self.show_yaxis_label = False - self.show_ends = False - - if not self.yaxis_scale : - conversion_factor = std_units[self.unit_name] - if conversion_factor : - self.yaxis_scale=log(len(self.alphabet))*conversion_factor - #self.yaxis_scale=max(data.entropy)*conversion_factor - #marker# this is where I handle the max height. needs revision. - else : - self.yaxis_scale=1.0 # probability units - - if self.yaxis_scale<=0.0 : - raise ValueError(('yaxis_scale', "Invalid yaxis scale")) - if self.yaxis_tic_interval >= self.yaxis_scale: - self.yaxis_tic_interval /= 2. - - self.yaxis_minor_tic_interval \ - = float(self.yaxis_tic_interval)/self.yaxis_minor_tic_ratio - - if self.color_scheme is None : - #if self.alphabet in default_color_schemes : - #self.color_scheme = default_color_schemes[self.alphabet] - #else : - self.color_scheme = codonsT - #else: - #for color, symbols, desc in options.colors: - #try : - #self.color_scheme.append( ColorGroup(symbols, color, desc) ) - #print >>sys.stderr, color_scheme.groups[2] - #except ValueError : - #raise ValueError( - #"error: option --color: invalid value: '%s'" % color ) - - - self.lines_per_logo = 1+ ( (self.total_stacks-1) / self.stacks_per_line) - - if self.lines_per_logo==1 and not self.pad_right: - self.stacks_per_line = min(self.stacks_per_line, self.total_stacks) - - self.char_width = self.size.stack_width - 2* self.stack_margin - - - if self.show_yaxis : - self.line_margin_left = self.fontsize * 3.0 - else : - self.line_margin_left = 0 - - if self.show_ends : - self.line_margin_right = self.fontsize *1.5 - else : - self.line_margin_right = self.fontsize - - if self.show_xaxis : - if self.rotate_numbers : - self.line_margin_bottom = self.number_fontsize *2.5 - else: - self.line_margin_bottom = self.number_fontsize *1.5 - else : - self.line_margin_bottom = 4 - - self.line_margin_top = 4 - - if self.show_title : - self.title_height = self.title_fontsize - else : - self.title_height = 0 - - self.xaxis_label_height =0. - if self.show_xaxis_label : - self.xaxis_label_height += self.fontsize - if self.show_fineprint : - self.xaxis_label_height += self.small_fontsize - - self.line_height = (self.size.stack_height + self.line_margin_top + - self.line_margin_bottom ) - self.line_width = (self.size.stack_width*self.stacks_per_line + - self.line_margin_left + self.line_margin_right ) - - self.logo_height = int(2*self.logo_margin + self.title_height \ - + self.xaxis_label_height + self.line_height*self.lines_per_logo) - self.logo_width = int(2*self.logo_margin + self.line_width ) - - - self.creation_date = datetime.now().isoformat(' ') - - end_type = '-' - end_types = { - unambiguous_protein_alphabet: 'p', - unambiguous_rna_alphabet: '-', - unambiguous_dna_alphabet: 'd' - } - if self.show_ends and self.alphabet in end_types: - end_type = end_types[self.alphabet] - self.end_type = end_type - # End __init__ -# End class LogoFormat - - - -# ------ Logo Formaters ------ -# Each formatter is a function f(LogoData, LogoFormat, output file). -# that draws a represntation of the logo into the given file. -# The main graphical formatter is eps_formatter. A mapping 'formatters' -# containing all available formatters is located after the formatter -# definitions. - -def pdf_formatter(data, format, fout) : - """ Generate a logo in PDF format.""" - - feps = StringIO() - eps_formatter(data, format, feps) - feps.seek(0) - - gs = GhostscriptAPI() - gs.convert('pdf', feps, fout, format.logo_width, format.logo_height) - - -def _bitmap_formatter(data, format, fout, device) : - feps = StringIO() - eps_formatter(data, format, feps) - feps.seek(0) - - gs = GhostscriptAPI() - gs.convert(device, feps, fout, - format.logo_width, format.logo_height, format.resolution) - - -def jpeg_formatter(data, format, fout) : - """ Generate a logo in JPEG format.""" - _bitmap_formatter(data, format, fout, device="jpeg") - - -def png_formatter(data, format, fout) : - """ Generate a logo in PNG format.""" - - _bitmap_formatter(data, format, fout, device="png") - - -def png_print_formatter(data, format, fout) : - """ Generate a logo in PNG format with print quality (600 DPI) resolution.""" - format.resolution = 600 - _bitmap_formatter(data, format, fout, device="png") - - -def txt_formatter( logodata, format, fout) : - """ Create a text representation of the logo data. - """ - print >>fout, str(logodata) - - - - -def eps_formatter( logodata, format, fout) : - """ Generate a logo in Encapsulated Postscript (EPS)""" - - subsitutions = {} - from_format =[ - "creation_date", "logo_width", "logo_height", - "lines_per_logo", "line_width", "line_height", - "line_margin_right","line_margin_left", "line_margin_bottom", - "line_margin_top", "title_height", "xaxis_label_height", - "creator_text", "logo_title", "logo_margin", - "stroke_width", "tic_length", - "stacks_per_line", "stack_margin", - "yaxis_label", "yaxis_tic_interval", "yaxis_minor_tic_interval", - "xaxis_label", "xaxis_tic_interval", "number_interval", - "fineprint", "shrink_fraction", "errorbar_fraction", - "errorbar_width_fraction", - "errorbar_gray", "small_fontsize", "fontsize", - "title_fontsize", "number_fontsize", "text_font", - "logo_font", "title_font", - "logo_label", "yaxis_scale", "end_type", - "debug", "show_title", "show_xaxis", - "show_xaxis_label", "show_yaxis", "show_yaxis_label", - "show_boxes", "show_errorbars", "show_fineprint", - "rotate_numbers", "show_ends", "altype", - - ] - - for s in from_format : - subsitutions[s] = getattr(format,s) - - - from_format_size = ["stack_height", "stack_width"] - for s in from_format_size : - subsitutions[s] = getattr(format.size,s) - - subsitutions["shrink"] = str(format.show_boxes).lower() - - - # --------- COLORS -------------- - def format_color(color): - return " ".join( ("[",str(color.red) , str(color.green), - str(color.blue), "]")) - - subsitutions["default_color"] = format_color(format.default_color) - global col - colors = [] - - if altype=="codonsT" or altype=="codonsU": - for group in col: - cf = format_color(group.color) - colors.append( " ("+group.symbols+") " + cf ) - for group in format.color_scheme.groups : - cf = format_color(group.color) - - colors.append( " ("+group.symbols+") " + cf ) - #print >>sys.stderr,opts.colors - #print >>sys.stderr,logodata.options - #print >>sys.stderr, group.symbols - #print >>sys.stderr, cf - - - - else: - for group in format.color_scheme.groups : - cf = format_color(group.color) - for s in group.symbols : - colors.append( " ("+s+") " + cf ) - - subsitutions["color_dict"] = "\n".join(colors) - data = [] - - # Unit conversion. 'None' for probability units - conv_factor = std_units[format.unit_name] - - data.append("StartLine") - - - seq_from = format.logo_start- format.first_index - seq_to = format.logo_end - format.first_index +1 - - # seq_index : zero based index into sequence data - # logo_index : User visible coordinate, first_index based - # stack_index : zero based index of visible stacks - for seq_index in range(seq_from, seq_to) : - logo_index = seq_index + format.first_index - stack_index = seq_index - seq_from - - if stack_index!=0 and (stack_index % format.stacks_per_line) ==0 : - data.append("") - data.append("EndLine") - data.append("StartLine") - data.append("") - - if logo_index % format.number_interval == 0 : - data.append("(%d) StartStack" % logo_index) - else : - data.append("() StartStack" ) - - if conv_factor: - stack_height = logodata.entropy[seq_index] * std_units[format.unit_name] - else : - stack_height = 1.0 # Probability - - # if logodata.entropy_interval is not None and conv_factor: - # Draw Error bars - # low, high = logodata.entropy_interval[seq_index] - # center = logodata.entropy[seq_index] - - - # down = (center - low) * conv_factor - # up = (high - center) * conv_factor - # data.append(" %f %f %f DrawErrorbarFirst" % (down, up, stack_height) ) - - s = zip(logodata.counts[seq_index], logodata.alphabet) - def mycmp( c1, c2 ) : - # Sort by frequency. If equal frequency then reverse alphabetic - if c1[0] == c2[0] : return cmp(c2[1], c1[1]) - return cmp(c1[0], c2[0]) - - s.sort(mycmp) - - C = float(sum(logodata.counts[seq_index])) - if C > 0.0 : - fraction_width = 1.0 - if format.scale_width : - fraction_width = logodata.weight[seq_index] - for c in s: - data.append(" %f %f (%s) ShowSymbol" % (fraction_width, c[0]*stack_height/C, c[1]) ) - - # Draw error bar on top of logo. Replaced by DrawErrorbarFirst above. - if logodata.entropy_interval is not None and conv_factor: - low, high = logodata.entropy_interval[seq_index] - center = logodata.entropy[seq_index] - - down = (center - low) * conv_factor - up = (high - center) * conv_factor - data.append(" %f %f DrawErrorbar" % (down, up) ) - - data.append("EndStack") - data.append("") - - data.append("EndLine") - subsitutions["logo_data"] = "\n".join(data) - - - # Create and output logo - template = resource_string( __name__, 'template.eps', __file__) - logo = Template(template).substitute(subsitutions) - print >>fout, logo - - -# map between output format names and logo -formatters = { - 'eps': eps_formatter, - 'pdf': pdf_formatter, - 'png': png_formatter, - 'png_print' : png_print_formatter, - 'jpeg' : jpeg_formatter, - 'txt' : txt_formatter, - } - -default_formatter = eps_formatter - - - - - -def parse_prior(composition, alphabet, weight=None) : - """ Parse a description of the expected monomer distribution of a sequence. - - Valid compositions: - - - None or 'none' : No composition sepecified - - 'auto' or 'automatic': Use the typical average distribution - for proteins and an equiprobable distribution for - everything else. - - 'equiprobable' : All monomers have the same probability. - - a percentage, e.g. '45%' or a fraction '0.45': - The fraction of CG bases for nucleotide alphabets - - a species name, e.g. 'E. coli', 'H. sapiens' : - Use the average CG percentage for the specie's - genome. - - An explicit distribution, e.g. {'A':10, 'C':40, 'G':40, 'T':10} - """ - - - if composition is None: return None - comp = composition.strip() - - if comp.lower() == 'none': return None - - if weight is None and alphabet is not None: weight = float(len(alphabet)) - - if comp.lower() == 'equiprobable' : - prior = weight * equiprobable_distribution(len(alphabet)) - elif comp.lower() == 'auto' or comp.lower() == 'automatic': - if alphabet == unambiguous_protein_alphabet : - prior = weight * asarray(aa_composition, float64) - else : - prior = weight * equiprobable_distribution(len(alphabet)) - elif comp in std_percentCG : - prior = weight * base_distribution(std_percentCG[comp]) - - elif comp[-1] == '%' : - prior = weight * base_distribution( float(comp[:-1])) - - elif isfloat(comp) : - prior = weight * base_distribution( float(comp)*100. ) - - elif composition[0] == '{' and composition[-1] == '}' : - explicit = composition[1: -1] - explicit = explicit.replace(',',' ').replace("'", ' ').replace('"',' ').replace(':', ' ').split() - - if len(explicit) != len(alphabet)*2 : - #print explicit - raise ValueError("Explicit prior does not match length of alphabet") - prior = - ones(len(alphabet), float64) - try : - for r in range(len(explicit)/2) : - letter = explicit[r*2] - index = alphabet.index(letter) - value = float(explicit[r*2 +1]) - prior[index] = value - except ValueError : - raise ValueError("Cannot parse explicit composition") - - if any(prior==-1.) : - raise ValueError("Explicit prior does not match alphabet") - prior/= sum(prior) - prior *= weight - - - else : - raise ValueError("Unknown or malformed composition: %s"%composition) - if len(prior) != len(alphabet) : - raise ValueError( - "The sequence alphabet and composition are incompatible.") - - return prior - - -def base_distribution(percentCG) : - A = (1. - (percentCG/100.))/2. - C = (percentCG/100.)/2. - G = (percentCG/100.)/2. - T = (1. - (percentCG/100))/2. - return asarray((A,C,G,T), float64) - -def equiprobable_distribution( length) : - return ones( (length), float64) /length - - - - -def read_seq_data(fin, input_parser=seq_io.read,alphabet=None, ignore_lower_case=False, max_file_size=0): - # TODO: Document this method and enviroment variable - max_file_size =int(os.environ.get("WEBLOGO_MAX_FILE_SIZE", max_file_size)) - - # If max_file_size is set, or if fin==stdin (which is non-seekable), we - # read the data and replace fin with a StringIO object. - - if(max_file_size>0) : - data = fin.read(max_file_size) - - more_data = fin.read(2) - if more_data != "" : - raise IOError("File exceeds maximum allowed size: %d bytes" % max_file_size) - - fin = StringIO(data) - elif fin == sys.stdin: - fin = StringIO(fin.read()) - - seqs = input_parser(fin) - - if seqs is None or len(seqs) ==0 : - raise ValueError("Please provide a multiple sequence alignment") - - if ignore_lower_case : - # Case is significant. Do not count lower case letters. - for i,s in enumerate(seqs) : - seqs[i] = s.mask() - - global altype - if(altype=="codonsT" or altype=="codonsU"): - if 'T' in seqs[0] or 't' in seqs[0]: - altype="codonsT" - if 'U' in seqs[0] or 'u' in seqs[0]: - altype="codonsU" - global offset - seq2=[""]*len(seqs) - seq2 = [] - for i in xrange(len(seqs)): - seq2.append([]) - if(offset%6>2): - for x in range(0,len(seqs)): - backs=seqs[x][::-1] - for y in range(0,len(backs)): - seq2[x].append(str(backs[y])) - - if(altype=="codonsU"): - for x in range(0,len(seq2)): - for y in range(0,len(seq2[x])): - if(cmp(seq2[x][y],'G')==0): - seq2[x][y]="C" - elif(cmp(seq2[x][y],'A')==0): - seq2[x][y]='U' - elif(cmp(seq2[x][y],'U')==0): - seq2[x][y]='A' - elif(cmp(seq2[x][y],'C')==0): - seq2[x][y]='G' - if(altype=="codonsT"): - for x in range(0,len(seq2)): - for y in range(0,len(seq2[x])): - if(cmp(seq2[x][y],'G')==0): - seq2[x][y]='C' - elif(cmp(seq2[x][y],'A')==0): - seq2[x][y]='T' - elif(cmp(seq2[x][y],'T')==0): - seq2[x][y]='A' - elif(cmp(seq2[x][y],'C')==0): - seq2[x][y]='G' - offset=offset%3 - for x in range(0,len(seqs)): - seqs[x]=Seq("".join(seq2[x])) - - - # Add alphabet to seqs. - if alphabet : - seqs.alphabet = alphabet - else : - seqs.alphabet = which_alphabet(seqs) - - return seqs - - - -#TODO: Move to seq_io? -# Would have to check that no symbol outside of full alphabet? -def which_alphabet(seqs) : - """ Returns the most appropriate unambiguous protien, rna or dna alphabet - for a Seq or SeqList. - """ - alphabets = (unambiguous_protein_alphabet, - unambiguous_rna_alphabet, - unambiguous_dna_alphabet - ) - # Heuristic - # Count occurances of each letter. Downweight longer alphabet. - #for x in seqs: - - - if( altype=="codonsU"): - return codon_alphabetU - if( altype=="codonsT"): - return codon_alphabetT - else: - score = [1.0*asarray(seqs.tally(a)).sum()/sqrt(len(a)) for a in alphabets] - #print score - best = argmax(score) # Ties go to last in list. - a = alphabets[best] - return a - - - -class LogoData(object) : - """The data needed to generate a sequence logo. - - - alphabet - - length - - counts -- An array of character counts - - entropy -- The relative entropy of each column - - entropy_interval -- entropy confidence interval - """ - - def __init__(self, length=None, alphabet = None, counts =None, - entropy =None, entropy_interval = None, weight=None) : - """Creates a new LogoData object""" - self.length = length - self.alphabet = alphabet - self.counts = counts - self.entropy = entropy - self.entropy_interval = entropy_interval - self.weight = weight - - - - #@classmethod - def from_counts(cls, alphabet, counts, prior= None): - - """Build a logodata object from counts.""" - seq_length, A = counts.shape - - if prior is not None: prior = array(prior, float64) - if prior is None : - R = log(A) - ent = zeros( seq_length, float64) - entropy_interval = None - for i in range (0, seq_length) : - C = sum(counts[i]) - #FIXME: fixup corebio.moremath.entropy()? - if C == 0 : - ent[i] = 0.0 - else : - ent[i] = R - entropy(counts[i]) - else : - ent = zeros( seq_length, float64) - entropy_interval = zeros( (seq_length,2) , float64) - - R = log(A) - - for i in range (0, seq_length) : - alpha = array(counts[i] , float64) - alpha += prior - - posterior = Dirichlet(alpha) - ent[i] = posterior.mean_relative_entropy(prior/sum(prior)) - entropy_interval[i][0], entropy_interval[i][1] = \ - posterior.interval_relative_entropy(prior/sum(prior), 0.95) - weight = array( na.sum(counts,axis=1) , float) - - weight /= max(weight) - - return cls(seq_length, alphabet, counts, ent, entropy_interval, weight) - from_counts = classmethod(from_counts) - - - #@classmethod - def from_seqs(cls, seqs, prior= None): - - - alphabet=seqs.alphabet - - #get the offset and if it's greater than 2 flip the sequences to the negative strand and reverse. - for x in range(0,len(seqs)): - seqs[x]=seqs[x].upper() - counter=0 - - - """Build a 2D array from a SeqList, a list of sequences.""" - # --- VALIDATE DATA --- - # check that there is at least one sequence of length 1 - if len(seqs)==0 or len(seqs[0]) ==0: - raise ValueError("No sequence data found.") - sys.exit(0) - # Check sequence lengths - seq_length = len(seqs[0]) - for i,s in enumerate(seqs) : - #print i,s, len(s) - if seq_length != len(s) : - raise ArgumentError( - "Sequence number %d differs in length from the previous sequences" % (i+1) ,'sequences') - sys.exit(0) - - if(altype=="codonsT" or altype=="codonsU"): - x = [[0]*64 for x in xrange(seq_length/3)] - counter=offset - - while counter+offset>out, '## LogoData' - print >>out, '# First column is position number, couting from zero' - print >>out, '# Subsequent columns are raw symbol counts' - print >>out, '# Entropy is mean entropy measured in nats.' - print >>out, '# Low and High are the 95% confidence limits.' - print >>out, '# Weight is the fraction of non-gap symbols in the column.' - print >>out, '#\t' - print >>out, '#\t', - for a in self.alphabet : - print >>out, a, '\t', - print >>out, 'Entropy\tLow\tHigh\tWeight' - - for i in range(self.length) : - print >>out, i, '\t', - for c in self.counts[i] : print >>out, c, '\t', - print >>out, self.entropy[i], '\t', - if self.entropy_interval is not None: - print >>out, self.entropy_interval[i][0], '\t', - print >>out, self.entropy_interval[i][1], '\t', - else : - print >>out, '\t','\t', - if self.weight is not None : - print >>out, self.weight[i], - print >>out, '' - print >>out, '# End LogoData' - - return out.getvalue() - -# ====================== Main: Parse Command line ============================= -def main(): - """CodonLogo command line interface """ - - # ------ Parse Command line ------ - parser = _build_option_parser() - (opts, args) = parser.parse_args(sys.argv[1:]) - if args : parser.error("Unparsable arguments: %s " % args) - - if opts.serve: - httpd_serve_forever(opts.port) # Never returns? - sys.exit(0) - - - # ------ Create Logo ------ - try: - data = _build_logodata(opts) - - - format = _build_logoformat(data, opts) - - - formatter = opts.formatter - formatter(data, format, opts.fout) - - except ValueError, err : - print >>sys.stderr, 'Error:', err - sys.exit(2) - except KeyboardInterrupt, err: - sys.exit(0) -# End main() - - -def httpd_serve_forever(port=8080) : - """ Start a webserver on a local port.""" - import BaseHTTPServer - import CGIHTTPServer - - class __HTTPRequestHandler(CGIHTTPServer.CGIHTTPRequestHandler): - def is_cgi(self) : - if self.path == "/create.cgi": - self.cgi_info = '', 'create.cgi' - return True - return False - - # Add current directory to PYTHONPATH. This is - # so that we can run the standalone server - # without having to run the install script. - pythonpath = os.getenv("PYTHONPATH", '') - pythonpath += ":" + os.path.abspath(sys.path[0]).split()[0] - os.environ["PYTHONPATH"] = pythonpath - - htdocs = resource_filename(__name__, 'htdocs', __file__) - os.chdir(htdocs) - - HandlerClass = __HTTPRequestHandler - ServerClass = BaseHTTPServer.HTTPServer - httpd = ServerClass(('', port), HandlerClass) - print "Serving HTTP on localhost:%d ..." % port - - try : - httpd.serve_forever() - except KeyboardInterrupt: - sys.exit(0) -# end httpd_serve_forever() - -def read_priors(finp, alphabet ,max_file_size=0): - - max_file_size =int(os.environ.get("WEBLOGO_MAX_FILE_SIZE", max_file_size)) - if(max_file_size>0) : - data = finp.read(max_file_size) - more_data = finp.read(2) - if more_data != "" : - raise IOError("File exceeds maximum allowed size: %d bytes" % max_file_size) - finp = StringIO(data) - priordict={} - while 1: - line = finp.readline() - if not line: - break - line = line.split() - priordict[line[0]]=line[1] - return priordict - -def _build_logodata(options) : - global offset - offset=options.frame - options.alphabet = None - options.ignore_lower_case = False - #options.default_color = Color.by_name("black") - options.color_scheme=None - #options.colors=[] - options.show_ends=False - seqs = read_seq_data(options.fin, - options.input_parser.read, - alphabet=options.alphabet, - ignore_lower_case = options.ignore_lower_case) - if(options.priorfile!=None): - if(altype=="CodonsT"): - options.composition= str(read_priors(options.priorfile,codon_alphabetT)) - options.alphabet = codon_alphabetT - else: - options.composition= str(read_priors(options.priorfile,codon_alphabetU)) - options.alphabet = codon_alphabetU - - prior = parse_prior( options.composition,seqs.alphabet, options.weight) - data = LogoData.from_seqs(seqs, prior) - return data - - -def _build_logoformat( logodata, opts) : - """ Extract and process relevant option values and return a - LogoFormat object.""" - - args = {} - direct_from_opts = [ - "stacks_per_line", - "logo_title", - "yaxis_label", - "show_xaxis", - "show_yaxis", - "xaxis_label", - "show_ends", - "fineprint", - "show_errorbars", - "altype", - "show_boxes", - "yaxis_tic_interval", - "resolution", - "alphabet", - "debug", - "show_ends", - "default_color", - #"show_color_key", - "color_scheme", - "unit_name", - "logo_label", - "yaxis_scale", - "first_index", - "logo_start", - "logo_end", - "scale_width", - "frame", - ] - - for k in direct_from_opts: - args[k] = opts.__dict__[k] - logo_size = copy.copy(opts.__dict__['logo_size']) - size_from_opts = ["stack_width", "stack_height"] - for k in size_from_opts : - length = getattr(opts, k) - if length : setattr( logo_size, k, length ) - args["size"] = logo_size - - global col - - if opts.colors: - color_scheme = ColorScheme() - for color, symbols, desc in opts.colors: - try : - #c = Color.from_string(color) - color_scheme.groups.append( ColorGroup(symbols, color, desc) ) - #print >> sys.stderr, color - - col.append( ColorGroup(symbols, color, desc) ) - - except ValueError : - raise ValueError( - "error: option --color: invalid value: '%s'" % color ) - if(altype!="codonsU" and altype!="codonsT") : - args["color_scheme"] = color_scheme - - #cf = colorscheme.format_color(col[0]) - #col.append( " ("+group.symbols+") " + cf ) - - logooptions = LogoOptions() - for a, v in args.iteritems() : - setattr(logooptions,a,v) - - - - theformat = LogoFormat(logodata, logooptions ) - - return theformat - - - - - - -# ========================== OPTIONS ========================== -def _build_option_parser() : - defaults = LogoOptions() - parser = DeOptionParser(usage="%prog [options] < sequence_data.fa > sequence_logo.eps", - description = description, - version = __version__ , - add_verbose_options = False - ) - - io_grp = OptionGroup(parser, "Input/Output Options",) - data_grp = OptionGroup(parser, "Logo Data Options",) - format_grp = OptionGroup(parser, "Logo Format Options", - "These options control the format and display of the logo.") - color_grp = OptionGroup(parser, "Color Options", - "Colors can be specified using CSS2 syntax. e.g. 'red', '#FF0000', etc.") - advanced_grp = OptionGroup(parser, "Advanced Format Options", - "These options provide fine control over the display of the logo. ") - server_grp = OptionGroup(parser, "CodonLogo Server", - "Run a standalone webserver on a local port.") - - - parser.add_option_group(io_grp) - parser.add_option_group(data_grp) - parser.add_option_group(format_grp) - parser.add_option_group(color_grp) - parser.add_option_group(advanced_grp) - parser.add_option_group(server_grp) - - # ========================== IO OPTIONS ========================== - - - - io_grp.add_option( "-f", "--fin", - dest="fin", - action="store", - type="file_in", - default=sys.stdin, - help="Sequence input file (default: stdin)", - metavar="FILENAME") - - io_grp.add_option( "-R", "--prior", - dest="priorfile", - action="store", - type="file_in", - help="A file with 64 codons and their prior probabilities, one per line, each codon followed by a space and it's probability.", - metavar="FILENAME") - - io_grp.add_option("", "--fin-format", - dest="input_parser", - action="store", type ="dict", - default = seq_io, - choices = seq_io.format_names(), - help="Multiple sequence alignment format: (%s)" % - ', '.join([ f.names[0] for f in seq_io.formats]), - metavar="FORMAT") - - io_grp.add_option("-o", "--fout", dest="fout", - type="file_out", - default=sys.stdout, - help="Output file (default: stdout)", - metavar="FILENAME") - - - - io_grp.add_option( "-F", "--format", - dest="formatter", - action="store", - type="dict", - choices = formatters, - metavar= "FORMAT", - help="Format of output: eps (default), png, png_print, pdf, jpeg, txt", - default = default_formatter) - - - # ========================== Data OPTIONS ========================== - - data_grp.add_option("-m", "--frame", - dest="frame", - action="store", - type="int", - default=0, - help="Offset of the reading frame you wish to look in (default: 0)", - metavar="COUNT") - - data_grp.add_option("-T", "--type", - dest="altype", - action="store", - type="boolean", - default=True, - help="Generate a codon logo rather than a sequence logo (default: True)", - metavar="YES/NO") - - - - #data_grp.add_option( "-A", "--sequence-type", - #dest="alphabet", - #action="store", - #type="dict", - #choices = std_alphabets, - #help="The type of sequence data: 'protein', 'rna' or 'dna'.", - #metavar="TYPE") - - #data_grp.add_option( "-a", "--alphabet", - #dest="alphabet", - #action="store", - #help="The set of symbols to count, e.g. 'AGTC'. " - #"All characters not in the alphabet are ignored. " - #"If neither the alphabet nor sequence-type are specified then codonlogo will examine the input data and make an educated guess. " - #"See also --sequence-type, --ignore-lower-case" ) - - # FIXME Add test? - #data_grp.add_option( "", "--ignore-lower-case", - #dest="ignore_lower_case", - #action="store", - #type = "boolean", - #default=False, - #metavar = "YES/NO", - #help="Disregard lower case letters and only count upper case letters in sequences? (Default: No)" - #) - - data_grp.add_option( "-U", "--units", - dest="unit_name", - action="store", - choices = std_units.keys(), - type="choice", - default = defaults.unit_name, - help="A unit of entropy ('bits' (default), 'nats', 'digits'), or a unit of free energy ('kT', 'kJ/mol', 'kcal/mol'), or 'probability' for probabilities", - metavar = "NUMBER") - - - data_grp.add_option( "", "--composition", - dest="composition", - action="store", - type="string", - default = "auto", - help="The expected composition of the sequences: 'auto' (default), 'equiprobable', 'none' (Do not perform any compositional adjustment), a CG percentage, a species name (e.g. 'E. coli', 'H. sapiens'), or an explicit distribution (e.g. {'A':10, 'C':40, 'G':40, 'T':10}). The automatic option uses a typical distribution for proteins and equiprobable distribution for everything else. ", - metavar="COMP.") - - data_grp.add_option( "", "--weight", - dest="weight", - action="store", - type="float", - default = None, - help="The weight of prior data. Default: total pseudocounts equal to the number of monomer types.", - metavar="NUMBER") - - data_grp.add_option( "-i", "--first-index", - dest="first_index", - action="store", - type="int", - default = 1, - help="Index of first position in sequence data (default: 1)", - metavar="INDEX") - - data_grp.add_option( "-l", "--lower", - dest="logo_start", - action="store", - type="int", - help="Lower bound of sequence to display", - metavar="INDEX") - - data_grp.add_option( "-u", "--upper", - dest="logo_end", - action="store", - type="int", - help="Upper bound of sequence to display", - metavar="INDEX") - - # ========================== FORMAT OPTIONS ========================== - - format_grp.add_option( "-s", "--size", - dest="logo_size", - action="store", - type ="dict", - choices = std_sizes, - metavar = "LOGOSIZE", - default = defaults.size, - help="Specify a standard logo size (small, medium (default), large)" ) - - - - format_grp.add_option( "-n", "--stacks-per-line", - dest="stacks_per_line", - action="store", - type="int", - help="Maximum number of logo stacks per logo line. (default: %default)", - default = defaults.stacks_per_line, - metavar="COUNT") - - format_grp.add_option( "-t", "--title", - dest="logo_title", - action="store", - type="string", - help="Logo title text.", - default = defaults.logo_title, - metavar="TEXT") - - format_grp.add_option( "", "--label", - dest="logo_label", - action="store", - type="string", - help="A figure label, e.g. '2a'", - default = defaults.logo_label, - metavar="TEXT") - - format_grp.add_option( "-X", "--show-xaxis", - action="store", - type = "boolean", - default= defaults.show_xaxis, - metavar = "YES/NO", - help="Display sequence numbers along x-axis? (default: %default)") - - format_grp.add_option( "-x", "--xlabel", - dest="xaxis_label", - action="store", - type="string", - default = defaults.xaxis_label, - help="X-axis label", - metavar="TEXT") - - format_grp.add_option( "-S", "--yaxis", - dest="yaxis_scale", - action="store", - type="float", - help="Height of yaxis in units. (Default: Maximum value with uninformative prior.)", - metavar = "UNIT") - - format_grp.add_option( "-Y", "--show-yaxis", - action="store", - type = "boolean", - dest = "show_yaxis", - default= defaults.show_yaxis, - metavar = "YES/NO", - help="Display entropy scale along y-axis? (default: %default)") - - format_grp.add_option( "-y", "--ylabel", - dest="yaxis_label", - action="store", - type="string", - help="Y-axis label (default depends on plot type and units)", - metavar="TEXT") - - #format_grp.add_option( "-E", "--show-ends", - #action="store", - #type = "boolean", - #default= defaults.show_ends, - #metavar = "YES/NO", - #help="Label the ends of the sequence? (default: %default)") - - format_grp.add_option( "-P", "--fineprint", - dest="fineprint", - action="store", - type="string", - default= defaults.fineprint, - help="The fine print (default: Codonlogo version)", - metavar="TEXT") - - format_grp.add_option( "", "--ticmarks", - dest="yaxis_tic_interval", - action="store", - type="float", - default= defaults.yaxis_tic_interval, - help="Distance between ticmarks (default: %default)", - metavar = "NUMBER") - - - format_grp.add_option( "", "--errorbars", - dest = "show_errorbars", - action="store", - type = "boolean", - default= defaults.show_errorbars, - metavar = "YES/NO", - help="Display error bars? (default: %default)") - - - - # ========================== Color OPTIONS ========================== - # TODO: Future Feature - # color_grp.add_option( "-K", "--color-key", - # dest= "show_color_key", - # action="store", - # type = "boolean", - # default= defaults.show_color_key, - # metavar = "YES/NO", - # help="Display a color key (default: %default)") - - - #color_scheme_choices = std_color_schemes.keys() - #color_scheme_choices.sort() - #color_grp.add_option( "-c", "--color-scheme", - #dest="color_scheme", - #action="store", - #type ="dict", - #choices = std_color_schemes, - #metavar = "SCHEME", - #default = None, # Auto - #help="Specify a standard color scheme (%s)" % \ - #", ".join(color_scheme_choices) ) - - color_grp.add_option( "-C", "--color", - dest="colors", - action="append", - metavar="COLOR SYMBOLS DESCRIPTION ", - nargs = 3, - default=[], - help="Specify symbol colors, e.g. --color black AG 'Purine' --color red TC 'Pyrimidine' ") - - color_grp.add_option( "", "--default-color", - dest="default_color", - action="store", - metavar="COLOR", - default= defaults.default_color, - help="Symbol color if not otherwise specified.") - - # ========================== Advanced options ========================= - - advanced_grp.add_option( "-W", "--stack-width", - dest="stack_width", - action="store", - type="float", - default= None, - help="Width of a logo stack (default: %s)"% defaults.size.stack_width, - metavar="POINTS" ) - - advanced_grp.add_option( "-H", "--stack-height", - dest="stack_height", - action="store", - type="float", - default= None, - help="Height of a logo stack (default: %s)"%defaults.size.stack_height, - metavar="POINTS" ) - - advanced_grp.add_option( "", "--box", - dest="show_boxes", - action="store", - type = "boolean", - default=False, - metavar = "YES/NO", - help="Draw boxes around symbols? (default: no)") - - advanced_grp.add_option( "", "--resolution", - dest="resolution", - action="store", - type="float", - default=96, - help="Bitmap resolution in dots per inch (DPI). (default: 96 DPI, except png_print, 600 DPI) Low resolution bitmaps (DPI<300) are antialiased.", - metavar="DPI") - - advanced_grp.add_option( "", "--scale-width", - dest="scale_width", - action="store", - type = "boolean", - default= True, - metavar = "YES/NO", - help="Scale the visible stack width by the fraction of symbols in the column? (i.e. columns with many gaps of unknowns are narrow.) (default: yes)") - - advanced_grp.add_option( "", "--debug", - action="store", - type = "boolean", - default= defaults.debug, - metavar = "YES/NO", - help="Output additional diagnostic information. (default: %default)") - - - # ========================== Server options ========================= - server_grp.add_option( "", "--serve", - dest="serve", - action="store_true", - default= False, - help="Start a standalone CodonLogo server for creating sequence logos.") - - server_grp.add_option( "", "--port", - dest="port", - action="store", - type="int", - default= 8080, - help="Listen to this local port. (Default: %default)", - metavar="PORT") - - return parser - - # END _build_option_parser - - -############################################################## - -class Dirichlet(object) : - """The Dirichlet probability distribution. The Dirichlet is a continuous - multivariate probability distribution across non-negative unit length - vectors. In other words, the Dirichlet is a probability distribution of - probability distributions. It is conjugate to the multinomial - distribution and is widely used in Bayesian statistics. - - The Dirichlet probability distribution of order K-1 is - - p(theta_1,...,theta_K) d theta_1 ... d theta_K = - (1/Z) prod_i=1,K theta_i^{alpha_i - 1} delta(1 -sum_i=1,K theta_i) - - The normalization factor Z can be expressed in terms of gamma functions: - - Z = {prod_i=1,K Gamma(alpha_i)} / {Gamma( sum_i=1,K alpha_i)} - - The K constants, alpha_1,...,alpha_K, must be positive. The K parameters, - theta_1,...,theta_K are nonnegative and sum to 1. - - Status: - Alpha - """ - __slots__ = 'alpha', '_total', '_mean', - - - - - def __init__(self, alpha) : - """ - Args: - - alpha -- The parameters of the Dirichlet prior distribution. - A vector of non-negative real numbers. - """ - # TODO: Check that alphas are positive - #TODO : what if alpha's not one dimensional? - self.alpha = asarray(alpha, float64) - - self._total = sum(alpha) - self._mean = None - - - def sample(self) : - """Return a randomly generated probability vector. - - Random samples are generated by sampling K values from gamma - distributions with parameters a=\alpha_i, b=1, and renormalizing. - - Ref: - A.M. Law, W.D. Kelton, Simulation Modeling and Analysis (1991). - Authors: - Gavin E. Crooks (2002) - """ - alpha = self.alpha - K = len(alpha) - theta = zeros( (K,), float64) - - for k in range(K): - theta[k] = random.gammavariate(alpha[k], 1.0) - theta /= sum(theta) - - return theta - - def mean(self) : - if self._mean ==None: - self._mean = self.alpha / self._total - return self._mean - - def covariance(self) : - alpha = self.alpha - A = sum(alpha) - #A2 = A * A - K = len(alpha) - cv = zeros( (K,K), float64) - - for i in range(K) : - cv[i,i] = alpha[i] * (1. - alpha[i]/A) / (A * (A+1.) ) - - for i in range(K) : - for j in range(i+1,K) : - v = - alpha[i] * alpha[j] / (A * A * (A+1.) ) - cv[i,j] = v - cv[j,i] = v - return cv - - def mean_x(self, x) : - x = asarray(x, float64) - if shape(x) != shape(self.alpha) : - raise ValueError("Argument must be same dimension as Dirichlet") - return sum( x * self.mean()) - - def variance_x(self, x) : - x = asarray(x, float64) - if shape(x) != shape(self.alpha) : - raise ValueError("Argument must be same dimension as Dirichlet") - - cv = self.covariance() - var = na.dot(na.dot(na.transpose( x), cv), x) - return var - - - def mean_entropy(self) : - """Calculate the average entropy of probabilities sampled - from this Dirichlet distribution. - - Returns: - The average entropy. - - Ref: - Wolpert & Wolf, PRE 53:6841-6854 (1996) Theorem 7 - (Warning: this paper contains typos.) - Status: - Alpha - Authors: - GEC 2005 - - """ - # TODO: Optimize - alpha = self.alpha - A = float(sum(alpha)) - ent = 0.0 - for a in alpha: - if a>0 : ent += - 1.0 * a * digamma( 1.0+a) # FIXME: Check - ent /= A - ent += digamma(A+1.0) - return ent - - - - def variance_entropy(self): - """Calculate the variance of the Dirichlet entropy. - - Ref: - Wolpert & Wolf, PRE 53:6841-6854 (1996) Theorem 8 - (Warning: this paper contains typos.) - """ - alpha = self.alpha - A = float(sum(alpha)) - A2 = A * (A+1) - L = len(alpha) - - dg1 = zeros( (L) , float64) - dg2 = zeros( (L) , float64) - tg2 = zeros( (L) , float64) - - for i in range(L) : - dg1[i] = digamma(alpha[i] + 1.0) - dg2[i] = digamma(alpha[i] + 2.0) - tg2[i] = trigamma(alpha[i] + 2.0) - - dg_Ap2 = digamma( A+2. ) - tg_Ap2 = trigamma( A+2. ) - - mean = self.mean_entropy() - var = 0.0 - - for i in range(L) : - for j in range(L) : - if i != j : - var += ( - ( dg1[i] - dg_Ap2 ) * (dg1[j] - dg_Ap2 ) - tg_Ap2 - ) * (alpha[i] * alpha[j] ) / A2 - else : - var += ( - ( dg2[i] - dg_Ap2 ) **2 + ( tg2[i] - tg_Ap2 ) - ) * ( alpha[i] * (alpha[i]+1.) ) / A2 - - var -= mean**2 - return var - - - - def mean_relative_entropy(self, pvec) : - ln_p = na.log(pvec) - return - self.mean_x(ln_p) - self.mean_entropy() - - - def variance_relative_entropy(self, pvec) : - ln_p = na.log(pvec) - return self.variance_x(ln_p) + self.variance_entropy() - - - def interval_relative_entropy(self, pvec, frac) : - mean = self.mean_relative_entropy(pvec) - variance = self.variance_relative_entropy(pvec) - # If the variance is small, use the standard 95% - # confidence interval: mean +/- 1.96 * sd - if variance< 0.1 : - sd = sqrt(variance) - return max(0.0, mean - sd*1.96), mean + sd*1.96 - sd = sqrt(variance) - return max(0.0, mean - sd*1.96), mean + sd*1.96 - - g = gamma.from_mean_variance(mean, variance) - low_limit = g.inverse_cdf( (1.-frac)/2.) - high_limit = g.inverse_cdf( 1. - (1.-frac)/2. ) - - return low_limit, high_limit - - -# Standard python voodoo for CLI -if __name__ == "__main__": - ## Code Profiling. Uncomment these lines - #import hotshot, hotshot.stats - #prof = hotshot.Profile("stones.prof") - #prof.runcall(main) - #prof.close() - #stats = hotshot.stats.load("stones.prof") - #stats.strip_dirs() - #stats.sort_stats('cumulative', 'calls') - #stats.print_stats(40) - #sys.exit() - - main() - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/_cgi.py --- a/weblogolib/_cgi.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,446 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2003-2004 The Regents of the University of California. -# Copyright (c) 2005 Gavin E. Crooks -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - -import sys -import cgi as cgilib -import cgitb; cgitb.enable() - -#print "Content-Type: text/html\n\n" -#print "HELLO WORLD" -#print __name__ - -from StringIO import StringIO -from color import * -from colorscheme import ColorScheme, ColorGroup - -import weblogolib -from corebio.utils import * -from corebio._future import Template - - -# TODO: Check units - -# TODO: In WebLogo2: why slash create.cgi? I think this was a workaround -# for some browser quirk -#

- -def resource_string(resource, basefilename) : - import os - fn = os.path.join(os.path.dirname(basefilename), resource) - return open( fn ).read() - -mime_type = { - 'eps': 'application/postscript', - 'pdf': 'application/pdf', - 'png': 'image/png', - 'png_print': 'image/png', - 'txt' : 'text/plain', - 'jpeg' : 'image/jpeg', -} - -extension = { - 'eps': 'eps', - 'pdf': 'pdf', - 'png': 'png', - 'png_print': 'png', - 'txt' : 'txt', - 'jpeg' : 'png' -} - - -alphabets = { - 'alphabet_auto': None, - 'alphabet_protein': weblogolib.unambiguous_protein_alphabet, - 'alphabet_rna': weblogolib.unambiguous_rna_alphabet, - 'alphabet_dna': weblogolib.unambiguous_dna_alphabet} - -color_schemes = {} -for k in weblogolib.std_color_schemes.keys(): - color_schemes[ 'color_'+k.replace(' ', '_')] = weblogolib.std_color_schemes[k] - - -composition = {'comp_none' : 'none', - 'comp_auto' : 'auto', - 'comp_equiprobable':'equiprobable', - 'comp_CG': 'percentCG', - 'comp_Celegans' : 'C. elegans', - 'comp_Dmelanogaster' : 'D. melanogaster', - 'comp_Ecoli' : 'E. coli', - 'comp_Hsapiens': 'H. sapiens', - 'comp_Mmusculus' : 'M. musculus', - 'comp_Scerevisiae': 'S. cerevisiae' -} - -class Field(object) : - """ A representation of an HTML form field.""" - def __init__(self, name, default=None, conversion= None, options=None, errmsg="Illegal value.") : - self.name = name - self.default = default - self.value = default - self.conversion = conversion - self.options = options - self.errmsg = errmsg - - def get_value(self) : - if self.options : - if not self.value in self.options : - raise ValueError, (self.name, self.errmsg) - - if self.conversion : - try : - return self.conversion(self.value) - except ValueError, e : - raise ValueError, (self.name, self.errmsg) - else: - return self.value - - -def string_or_none(value) : - if value is None or value == 'auto': - return None - return str(value) - -def truth(value) : - if value== "true" : return True - return bool(value) - -def int_or_none(value) : - if value =='' or value is None or value == 'auto': - return None - return int(value) - -def float_or_none(value) : - if value =='' or value is None or value == 'auto': - return None - return float(value) - - -def main(htdocs_directory = None) : - - logooptions = weblogolib.LogoOptions() - - # A list of form fields. - # The default for checkbox values must be False (irrespective of - # the default in logooptions) since a checked checkbox returns 'true' - # but an unchecked checkbox returns nothing. - controls = [ - Field( 'sequences', ''), - Field( 'format', 'png', weblogolib.formatters.get , - options=['png_print', 'png', 'jpeg', 'eps', 'pdf', 'txt'] , - errmsg="Unknown format option."), - Field( 'stacks_per_line', logooptions.stacks_per_line , int, - errmsg='Invalid number of stacks per line.'), - Field( 'size','medium', weblogolib.std_sizes.get, - options=['small', 'medium', 'large'], errmsg='Invalid logo size.'), - Field( 'alphabet','alphabet_auto', alphabets.get, - options=['alphabet_auto', 'alphabet_protein', 'alphabet_dna', - 'alphabet_rna'], - errmsg="Unknown sequence type."), - Field( 'unit_name', 'bits', - options=[ 'probability', 'bits', 'nats', 'kT', 'kJ/mol', - 'kcal/mol']), - Field( 'first_index', 1, int_or_none), - Field( 'logo_start', '', int_or_none), - Field( 'logo_end', '', int_or_none), - Field( 'composition', 'comp_auto', composition.get, - options=['comp_none','comp_auto','comp_equiprobable','comp_CG', - 'comp_Celegans','comp_Dmelanogaster','comp_Ecoli', - 'comp_Hsapiens','comp_Mmusculus','comp_Scerevisiae'], - errmsg= "Illegal sequence composition."), - Field( 'percentCG', '', float_or_none, errmsg="Invalid CG percentage."), - Field( 'show_errorbars', False , truth), - Field( 'altype', False , truth), - Field( 'logo_title', logooptions.logo_title ), - Field( 'logo_label', logooptions.logo_label ), - Field( 'show_xaxis', False, truth), - Field( 'xaxis_label', logooptions.xaxis_label ), - Field( 'show_yaxis', False, truth), - Field( 'yaxis_label', logooptions.yaxis_label, string_or_none ), - Field( 'yaxis_scale', logooptions.yaxis_scale , float_or_none, - errmsg="The yaxis scale must be a positive number." ), - Field( 'yaxis_tic_interval', logooptions.yaxis_tic_interval , - float_or_none), - Field( 'show_ends', False, truth), - Field( 'show_fineprint', False , truth), - Field( 'color_scheme', 'color_auto', color_schemes.get, - options=color_schemes.keys() , - errmsg = 'Unknown color scheme'), - Field( 'color0', ''), - Field( 'symbols0', ''), - Field( 'desc0', ''), - Field( 'color1', ''), - Field( 'symbols1', ''), - Field( 'desc1', ''), - Field( 'color2', ''), - Field( 'symbols2', ''), - Field( 'desc2', ''), - Field( 'color3', ''), - Field( 'symbols3', ''), - Field( 'desc3', ''), - Field( 'color4', ''), - Field( 'symbols4', ''), - Field( 'desc4', ''), - Field( 'ignore_lower_case', False, truth), - Field( 'altype', False, truth), - Field( 'scale_width', False, truth), - ] - - form = {} - for c in controls : - form[c.name] = c - - - form_values = cgilib.FieldStorage() - - # Send default form? - if len(form_values) ==0 or form_values.has_key("cmd_reset"): - # Load default truth values now. - form['show_errorbars'].value = logooptions.show_errorbars - form['show_xaxis'].value = logooptions.show_xaxis - form['show_yaxis'].value = logooptions.show_yaxis - form['show_ends'].value = logooptions.show_ends - form['show_fineprint'].value = logooptions.show_fineprint - form['scale_width'].value = logooptions.scale_width - form['altype'].value = logooptions.altype - - send_form(controls, htdocs_directory = htdocs_directory) - return - - # Get form content - for c in controls : - c.value = form_values.getfirst( c.name, c.default) - - - options_from_form = ['format', 'stacks_per_line', 'size', - 'alphabet', 'unit_name', 'first_index', 'logo_start','logo_end', - 'composition', - 'show_errorbars', 'logo_title', 'logo_label', 'show_xaxis', - 'xaxis_label', - 'show_yaxis', 'yaxis_label', 'yaxis_scale', 'yaxis_tic_interval', - 'show_ends', 'show_fineprint', 'scale_width','altype'] - - errors = [] - for optname in options_from_form : - try : - value = form[optname].get_value() - if value!=None : setattr(logooptions, optname, value) - except ValueError, err : - errors.append(err.args) - - #check if using codons or not. - if logooptions.altype!=True: - weblogolib.altype = "" - print >> sys.stderr,logooptions.altype - print >> sys.stderr, "--nn-" - - # Construct custom color scheme - custom = ColorScheme() - for i in range(0,5) : - color = form["color%d"%i].get_value() - symbols = form["symbols%d"%i].get_value() - desc = form["desc%d"%i].get_value() - - if color : - try : - custom.groups.append(weblogolib.ColorGroup(symbols, color, desc)) - except ValueError, e : - errors.append( ('color%d'%i, "Invalid color: %s" % color) ) - - if form["color_scheme"].value == 'color_custom' : - logooptions.color_scheme = custom - else : - try : - logooptions.color_scheme = form["color_scheme"].get_value() - except ValueError, err : - errors.append(err.args) - - sequences = None - - # FIXME: Ugly fix: Must check that sequence_file key exists - # FIXME: Sending malformed or missing form keys should not cause a crash - # sequences_file = form["sequences_file"] - if form_values.has_key("sequences_file") : - sequences = form_values.getvalue("sequences_file") - assert type(sequences) == str - - if not sequences or len(sequences) ==0: - sequences = form["sequences"].get_value() - - if not sequences or len(sequences) ==0: - errors.append( ("sequences", "Please enter a multiple-sequence alignment in the box above, or select a file to upload.")) - - - - # If we have uncovered errors or we want the chance to edit the logo - # ("cmd_edit" command from examples page) then we return the form now. - # We do not proceed to the time consuming logo creation step unless - # required by a 'create' or 'validate' command, and no errors have been - # found yet. - if form_values.has_key("cmd_edit") or errors : - send_form(controls, errors, htdocs_directory) - return - - - - - # We write the logo into a local buffer so that we can catch and - # handle any errors. Once the "Content-Type:" header has been sent - # we can't send any useful feedback - logo = StringIO() - try : - comp = form["composition"].get_value() - percentCG = form["percentCG"].get_value() - ignore_lower_case = form_values.has_key("ignore_lower_case") - seqs = weblogolib.read_seq_data(StringIO( sequences), - alphabet=logooptions.alphabet, - ignore_lower_case=ignore_lower_case - ) - if comp=='percentCG': comp = str(percentCG/100) - prior = weblogolib.parse_prior(comp, seqs.alphabet) - data = weblogolib.LogoData.from_seqs(seqs, prior) - logoformat = weblogolib.LogoFormat(data, logooptions) - format = form["format"].value - weblogolib.formatters[format](data, logoformat, logo) - except ValueError, err : - errors.append( err.args ) - except IOError, err : - errors.append( err.args) - except RuntimeError, err : - errors.append( err.args ) - - if form_values.has_key("cmd_validate") or errors : - send_form(controls, errors, htdocs_directory) - return - - - # - # RETURN LOGO OVER HTTP - # - - print "Content-Type:", mime_type[format] - # Content-Disposition: inline Open logo in browser window - # Content-Disposition: attachment Download logo - if form_values.has_key("download") : - print 'Content-Disposition: attachment; ' \ - 'filename="logo.%s"' % extension[format] - else : - print 'Content-Disposition: inline; ' \ - 'filename="logo.%s"' % extension[format] - - - # Seperate header from data - print - - # Finally, and at last, send the logo. - print logo.getvalue() - - -def send_form(controls, errors=[], htdocs_directory=None) : - if htdocs_directory is None : - htdocs_directory = os.path.join( - os.path.dirname(__file__, "htdocs") ) - - subsitutions = {} - subsitutions["version"] = weblogolib.release_description - - for c in controls : - if c.options : - for opt in c.options : - subsitutions[opt.replace('/','_')] = '' - subsitutions[c.value.replace('/','_')] = 'selected' - else : - value = c.value - if value == None : value = 'auto' - if value=='true': - subsitutions[c.name] = 'checked' - elif type(value)==bool : - if value : - subsitutions[c.name] = 'checked' - else : - subsitutions[c.name] = '' - else : - subsitutions[c.name] = str(value) - subsitutions[c.name+'_err'] = '' - - if errors : - print >>sys.stderr, errors - error_message = [] - for e in errors : - if type(e) is str : - msg = e - elif len(e)==2: - subsitutions[e[0]+"_err"] = "class='error'" - msg = e[1] - else : - msg = e[0] - - - error_message += "ERROR: " - error_message += msg - error_message += '
' - - error_message += \ - " " - subsitutions["error_message"] = ''.join(error_message) - else : - subsitutions["error_message"] = "" - - - template = resource_string("create_html_template.html", htdocs_directory) - html = Template(template).safe_substitute(subsitutions) #FIXME - - print "Content-Type: text/html\n\n" - print html - - # DEBUG - # keys = subsitutions.keys() - # keys.sort() - # for k in keys : - # print k,"=", subsitutions[k], "
" - - #for k in controls : - # print k.name,"=", k.get_value(), "
" - - - -if __name__=="__main__" : - main() - - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/color.py --- a/weblogolib/color.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,324 +0,0 @@ - - -# Copyright (c) 2005 Gavin E. Crooks -# -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. - -""" Color specifications using CSS2 (Cascading Style Sheet) syntax.""" - -class Color: - """ Color specifications using CSS2 (Cascading Style Sheet) syntax. - - http://www.w3.org/TR/REC-CSS2/syndata.html#color-units - - Usage: - - red = Color(255,0,0) - red = Color(1., 0., 0.) - red = Color.by_name("red") - red = Color.from_rgb(1.,0.,0.) - red = Color.from_rgb(255,0,0) - red = Color.from_hsl(0.,1., 0.5) - - red = Color.from_string("red") - red = Color.from_string("RED") - red = Color.from_string("#F00") - red = Color.from_string("#FF0000") - red = Color.from_string("rgb(255, 0, 0)") - red = Color.from_string("rgb(100%, 0%, 0%)") - red = Color.from_string("hsl(0, 100%, 50%)") - - """ - def __init__(self, red, green, blue) : - - if ( type(red) is not type(green) ) or (type(red) is not type(blue)): - raise TypeError("Mixed floats and integers?") - - if type(red) is type(1) : red = float(red)/255. - if type(green) is type(1) : green = float(green)/255. - if type(blue) is type(1) : blue = float(blue)/255. - - self.red = max(0., min(float(red), 1.0)) - self.green = max(0., min(float(green), 1.0)) - self.blue = max(0., min(float(blue), 1.0)) - - #@staticmethod - def names(): - "Return a list of standard color names." - return _std_colors.keys() - names = staticmethod(names) - - #@classmethod - def from_rgb(cls, r, g, b): - return cls(r,g,b) - from_rgb = classmethod(from_rgb) - - #@classmethod - def from_hsl(cls, hue_angle, saturation, lightness ): - def hue_to_rgb( v1, v2, vH) : - if vH < 0.0 : vH += 1.0 - if vH > 1.0 : vH -= 1.0 - if vH*6.0 < 1.0 : return (v1 + (v2 - v1) * 6.0 * vH) - if vH*2.0 < 1.0 : return v2 - if vH*3.0 < 2.0 : return (v1 + (v2 - v1) * ((2.0/3.0) - vH) * 6.0) - return v1 - - hue = (((hue_angle % 360.) + 360.) % 360.)/360. - - if not (saturation >= 0.0 and saturation <=1.0) : - raise ValueError("Out-of-range saturation %f"% saturation) - if not (lightness >= 0.0 and lightness <=1.0) : - raise ValueError("Out-of-range lightness %f"% lightness) - - if saturation == 0 : - # greyscale - return cls.from_rgb( lightness, lightness, lightness) - - if lightness < 0.5 : - v2 = lightness * (1.0+ saturation) - else : - v2 = (lightness + saturation) - (saturation* lightness) - - v1 = 2.0 * lightness - v2 - r = hue_to_rgb( v1, v2, hue + (1./3.) ) - g = hue_to_rgb( v1, v2, hue ) - b = hue_to_rgb( v1, v2, hue - (1./3.) ) - - return cls(r,g,b) - from_hsl = classmethod(from_hsl) - - - #@staticmethod - def by_name(string): - s = string.strip().lower().replace(' ', '') - - try: - return _std_colors[s] - except KeyError: - raise ValueError("Unknown color name: %s"% s) - by_name = staticmethod(by_name) - - #@classmethod - def from_string(cls, string): - def to_frac(string) : - # string can be "255" or "100%" - if string[-1]=='%': - return float(string[0:-1])/100. - else: - return float(string)/255. - - s = string.strip().lower().replace(' ', '').replace('_', '') - - if s in _std_colors : # "red" - return _std_colors[s] - - if s[0] == "#" : # "#fef" - if len(s) == 4 : - r = int(s[1]+s[1],16) - g = int(s[2]+s[2],16) - b = int(s[3]+s[3],16) - return cls(r,g,b) - elif len(s) ==7 : # "#ff00aa" - r = int(s[1:3],16) - g = int(s[3:5],16) - b = int(s[5:7],16) - return cls(r,g,b) - else : - raise ValueError("Cannot parse string: %s" % s) - - if s[0:4] == 'rgb(' and s[-1] == ')' : - rgb = s[4:-1].split(",") - if len(rgb) != 3 : - raise ValueError("Cannot parse string a: %s" % s) - return cls( to_frac(rgb[0]), to_frac(rgb[1]), to_frac(rgb[2])) - - if s[0:4] == 'hsl(' and s[-1] == ')' : - hsl = s[4:-1].split(",") - if len(hsl) != 3 : - raise ValueError("Cannot parse string a: %s" % s) - return cls.from_hsl( int(hsl[0]), to_frac(hsl[1]), to_frac(hsl[2])) - - raise ValueError("Cannot parse string: %s" % s) - from_string = classmethod(from_string) - - def __eq__(self, other) : - req = int(0.5+255.*self.red) == int(0.5+255.*other.red) - beq = int(0.5+255.*self.blue) == int(0.5+255.*other.blue) - geq = int(0.5+255.*self.green) == int(0.5+255.*other.green) - - return req and beq and geq - - def __repr__(self): - return "Color(%f,%f,%f)" % (self.red, self.green, self.blue) - - -_std_colors = dict( - aliceblue = Color(240,248,255), #f0f8ff - antiquewhite = Color(250,235,215), #faebd7 - aqua = Color(0,255,255), #00ffff - aquamarine = Color(127,255,212), #7fffd4 - azure = Color(240,255,255), #f0ffff - beige = Color(245,245,220), #f5f5dc - bisque = Color(255,228,196), #ffe4c4 - black = Color(0,0,0), #000000 - blanchedalmond = Color(255,235,205), #ffebcd - blue = Color(0,0,255), #0000ff - blueviolet = Color(138,43,226), #8a2be2 - brown = Color(165,42,42), #a52a2a - burlywood = Color(222,184,135), #deb887 - cadetblue = Color(95,158,160), #5f9ea0 - chartreuse = Color(127,255,0), #7fff00 - chocolate = Color(210,105,30), #d2691e - coral = Color(255,127,80), #ff7f50 - cornflowerblue = Color(100,149,237), #6495ed - cornsilk = Color(255,248,220), #fff8dc - crimson = Color(220,20,60), #dc143c - cyan = Color(0,255,255), #00ffff - darkblue = Color(0,0,139), #00008b - darkcyan = Color(0,139,139), #008b8b - darkgoldenrod = Color(184,134,11), #b8860b - darkgray = Color(169,169,169), #a9a9a9 - darkgreen = Color(0,100,0), #006400 - darkgrey = Color(169,169,169), #a9a9a9 - darkkhaki = Color(189,183,107), #bdb76b - darkmagenta = Color(139,0,139), #8b008b - darkolivegreen = Color(85,107,47), #556b2f - darkorange = Color(255,140,0), #ff8c00 - darkorchid = Color(153,50,204), #9932cc - darkred = Color(139,0,0), #8b0000 - darksalmon = Color(233,150,122), #e9967a - darkseagreen = Color(143,188,143), #8fbc8f - darkslateblue = Color(72,61,139), #483d8b - darkslategray = Color(47,79,79), #2f4f4f - darkslategrey = Color(47,79,79), #2f4f4f - darkturquoise = Color(0,206,209), #00ced1 - darkviolet = Color(148,0,211), #9400d3 - deeppink = Color(255,20,147), #ff1493 - deepskyblue = Color(0,191,255), #00bfff - dimgray = Color(105,105,105), #696969 - dimgrey = Color(105,105,105), #696969 - dodgerblue = Color(30,144,255), #1e90ff - firebrick = Color(178,34,34), #b22222 - floralwhite = Color(255,250,240), #fffaf0 - forestgreen = Color(34,139,34), #228b22 - fuchsia = Color(255,0,255), #ff00ff - gainsboro = Color(220,220,220), #dcdcdc - ghostwhite = Color(248,248,255), #f8f8ff - gold = Color(255,215,0), #ffd700 - goldenrod = Color(218,165,32), #daa520 - gray = Color(128,128,128), #808080 - green = Color(0,128,0), #008000 - greenyellow = Color(173,255,47), #adff2f - grey = Color(128,128,128), #808080 - honeydew = Color(240,255,240), #f0fff0 - hotpink = Color(255,105,180), #ff69b4 - indianred = Color(205,92,92), #cd5c5c - indigo = Color(75,0,130), #4b0082 - ivory = Color(255,255,240), #fffff0 - khaki = Color(240,230,140), #f0e68c - lavender = Color(230,230,250), #e6e6fa - lavenderblush = Color(255,240,245), #fff0f5 - lawngreen = Color(124,252,0), #7cfc00 - lemonchiffon = Color(255,250,205), #fffacd - lightblue = Color(173,216,230), #add8e6 - lightcoral = Color(240,128,128), #f08080 - lightcyan = Color(224,255,255), #e0ffff - lightgoldenrodyellow = Color(250,250,210), #fafad2 - lightgray = Color(211,211,211), #d3d3d3 - lightgreen = Color(144,238,144), #90ee90 - lightgrey = Color(211,211,211), #d3d3d3 - lightpink = Color(255,182,193), #ffb6c1 - lightsalmon = Color(255,160,122), #ffa07a - lightseagreen = Color(32,178,170), #20b2aa - lightskyblue = Color(135,206,250), #87cefa - lightslategray = Color(119,136,153), #778899 - lightslategrey = Color(119,136,153), #778899 - lightsteelblue = Color(176,196,222), #b0c4de - lightyellow = Color(255,255,224), #ffffe0 - lime = Color(0,255,0), #00ff00 - limegreen = Color(50,205,50), #32cd32 - linen = Color(250,240,230), #faf0e6 - magenta = Color(255,0,255), #ff00ff - maroon = Color(128,0,0), #800000 - mediumaquamarine = Color(102,205,170), #66cdaa - mediumblue = Color(0,0,205), #0000cd - mediumorchid = Color(186,85,211), #ba55d3 - mediumpurple = Color(147,112,219), #9370db - mediumseagreen = Color(60,179,113), #3cb371 - mediumslateblue = Color(123,104,238), #7b68ee - mediumspringgreen = Color(0,250,154), #00fa9a - mediumturquoise = Color(72,209,204), #48d1cc - mediumvioletred = Color(199,21,133), #c71585 - midnightblue = Color(25,25,112), #191970 - mintcream = Color(245,255,250), #f5fffa - mistyrose = Color(255,228,225), #ffe4e1 - moccasin = Color(255,228,181), #ffe4b5 - navajowhite = Color(255,222,173), #ffdead - navy = Color(0,0,128), #000080 - oldlace = Color(253,245,230), #fdf5e6 - olive = Color(128,128,0), #808000 - olivedrab = Color(107,142,35), #6b8e23 - orange = Color(255,165,0), #ffa500 - orangered = Color(255,69,0), #ff4500 - orchid = Color(218,112,214), #da70d6 - palegoldenrod = Color(238,232,170), #eee8aa - palegreen = Color(152,251,152), #98fb98 - paleturquoise = Color(175,238,238), #afeeee - palevioletred = Color(219,112,147), #db7093 - papayawhip = Color(255,239,213), #ffefd5 - peachpuff = Color(255,218,185), #ffdab9 - peru = Color(205,133,63), #cd853f - pink = Color(255,192,203), #ffc0cb - plum = Color(221,160,221), #dda0dd - powderblue = Color(176,224,230), #b0e0e6 - purple = Color(128,0,128), #800080 - red = Color(255,0,0), #ff0000 - rosybrown = Color(188,143,143), #bc8f8f - royalblue = Color(65,105,225), #4169e1 - saddlebrown = Color(139,69,19), #8b4513 - salmon = Color(250,128,114), #fa8072 - sandybrown = Color(244,164,96), #f4a460 - seagreen = Color(46,139,87), #2e8b57 - seashell = Color(255,245,238), #fff5ee - sienna = Color(160,82,45), #a0522d - silver = Color(192,192,192), #c0c0c0 - skyblue = Color(135,206,235), #87ceeb - slateblue = Color(106,90,205), #6a5acd - slategray = Color(112,128,144), #708090 - slategrey = Color(112,128,144), #708090 - snow = Color(255,250,250), #fffafa - springgreen = Color(0,255,127), #00ff7f - steelblue = Color(70,130,180), #4682b4 - tan = Color(210,180,140), #d2b48c - teal = Color(0,128,128), #008080 - thistle = Color(216,191,216), #d8bfd8 - tomato = Color(255,99,71), #ff6347 - turquoise = Color(64,224,208), #40e0d0 - violet = Color(238,130,238), #ee82ee - wheat = Color(245,222,179), #f5deb3 - white = Color(255,255,255), #ffffff - whitesmoke = Color(245,245,245), #f5f5f5 - yellow = Color(255,255,0), #ffff00 - yellowgreen = Color(154,205,50) #9acd32 - ) - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/colorscheme.py --- a/weblogolib/colorscheme.py Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,455 +0,0 @@ - -# Copyright (c) 2003-2005 The Regents of the University of California. -# Copyright (c) 2005 Gavin E. Crooks - -# This software is distributed under the MIT Open Source License. -# -# -# Permission is hereby granted, free of charge, to any person obtaining a -# copy of this software and associated documentation files (the "Software"), -# to deal in the Software without restriction, including without limitation -# the rights to use, copy, modify, merge, publish, distribute, sublicense, -# and/or sell copies of the Software, and to permit persons to whom the -# Software is furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included -# in all copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -# THE SOFTWARE. - -""" Popular color codings for nucleic and amino acids. - -Classes: - ColorScheme -- A color scheme - ColorGroup - - -Generic - monochrome - -Nucleotides - nucleotide - base pairing - -Amino Acid - hydrophobicity - chemistry - charge - taylor - -Status : Beta - Needs documentation. - -""" -# Good online references include bioruby and the JalView alignment editor. -# Clamp, M., Cuff, J., Searle, S. M. and Barton, G. J. (2004), -# "The Jalview Java Alignment Editor," Bioinformatics, 12, 426-7 -# http://www.jalview.org - -import sys - -from corebio import seq -from color import Color -codon_alphabetU=['AAA', 'AAU', 'AAC', 'AAG', 'AUA', 'AUU', 'AUC', 'AUG', 'ACA', 'ACU', 'ACC', 'ACG', 'AGA', 'AGU', 'AGC', 'AGG', 'UAA', 'UAU', 'UAC', 'UAG', 'UUA', 'UUU', 'UUC', 'UUG', 'UCA', 'UCU', 'UCC', 'UCG', 'UGA', 'UGU', 'UGC', 'UGG', 'CAA', 'CAU', 'CAC', 'CAG', 'CUA', 'CUU', 'CUC', 'CUG', 'CCA', 'CCU', 'CCC', 'CCG', 'CGA', 'CGU', 'CGC', 'CGG', 'GAA', 'GAU', 'GAC', 'GAG', 'GUA', 'GUU', 'GUC', 'GUG', 'GCA', 'GCU', 'GCC', 'GCG', 'GGA', 'GGU', 'GGC', 'GGG'] -codon_alphabetT=['AAA', 'AAT', 'AAC', 'AAG', 'ATA', 'ATT', 'ATC', 'ATG', 'ACA', 'ACT', 'ACC', 'ACG', 'AGA', 'AGT', 'AGC', 'AGG', 'TAA', 'TAT', 'TAC', 'TAG', 'TTA', 'TTT', 'TTC', 'TTG', 'TCA', 'TCT', 'TCC', 'TCG', 'TGA', 'TGT', 'TGC', 'TGG', 'CAA', 'CAT', 'CAC', 'CAG', 'CTA', 'CTT', 'CTC', 'CTG', 'CCA', 'CCT', 'CCC', 'CCG', 'CGA', 'CGT', 'CGC', 'CGG', 'GAA', 'GAT', 'GAC', 'GAG', 'GTA', 'GTT', 'GTC', 'GTG', 'GCA', 'GCT', 'GCC', 'GCG', 'GGA', 'GGT', 'GGC', 'GGG'] - -class ColorScheme(object): - """ A coloring of an alphabet. - - title : string -- A human readable description - defualt_color : Color -- - groups : list of color groups - alphabet : string -- The set of colored symbols - color -- A map between a symbol and a Coloring - - - """ - - def __init__(self, - groups = [], - title = "", - description = "", - default_color = "black", - alphabet = seq.generic_alphabet) : - """ """ - self.title= title - self.description = description - self.default_color = Color.from_string(default_color) - self.groups = groups - self.alphabet = alphabet - #print >> sys.stderr, groups - altype="codons" - #print >> sys.stderr,altype - #if(alphabet==codon_alphabet): - #print >> sys.stderr,"haleyulia it works" - - color = {} - #print >> sys.stderr, groups - if(alphabet!=codon_alphabetT and alphabet!=codon_alphabetU): - for cg in groups : - #print >> sys.stderr, cg - for s in cg.symbols : - color[s] = cg.color - #print >> sys.stderr, s - #print >> sys.stderr, cg - if s not in alphabet : - raise KeyError("Colored symbol does not exist in alphabet.") - else: - for cg in groups : - #print >> sys.stderr, cg - color[cg.symbols] = cg.color - #print >> sys.stderr, cg.symbols - self._color = color - - def color(self, symbol) : - if symbol in self._color : - return self._color[symbol] - return self.default_color - -class ColorGroup(object) : - """Associate a group of symbols with a color""" - def __init__(self, symbols, color, description=None) : - self.symbols = symbols - self.color = Color.from_string(color) - self.description = description - - - -monochrome = ColorScheme([]) # This list intentionally left blank - -# From makelogo -nucleotide = ColorScheme([ - ColorGroup("G", "orange"), - ColorGroup("TU", "red"), - ColorGroup("C", "blue"), - ColorGroup("A", "green") - ]) - -base_pairing = ColorScheme([ - ColorGroup("TAU", "darkorange", "Weak (2 Watson-Crick hydrogen bonds)"), - ColorGroup("GC", "blue", "Strong (3 Watson-Crick hydrogen bonds)")], - ) - - -hydrophobicity = ColorScheme([ - ColorGroup( "RKDENQ", "black", "hydrophobic"), - ColorGroup( "SGHTAP", "green", "neutral" ), - ColorGroup( "YVMCLFIW", "blue", "hydrophilic") ], - alphabet = seq.unambiguous_protein_alphabet - ) - -# from makelogo -chemistry = ColorScheme([ - ColorGroup( "GSTYC", "green", "polar"), - ColorGroup( "NQ", "purple", "neutral"), - ColorGroup( "KRH", "blue", "basic"), - ColorGroup( "DE", "red", "acidic"), - ColorGroup("PAWFLIMV", "black", "hydrophobic") ], - alphabet = seq.unambiguous_protein_alphabet - ) - - -codonsU = ColorScheme([ -ColorGroup( 'CAT', '#00FFFF'), -ColorGroup( 'CAU', '#00FFFF'), -ColorGroup( 'CAC', '#00FFFF'), - -ColorGroup( 'AAA', '#00FFFF'), -ColorGroup( 'AAG', '#00FFFF'), - -ColorGroup( 'CGT', '#00FFFF'), -ColorGroup( 'CGU', '#00FFFF'), -ColorGroup( 'CGC', '#00FFFF'), -ColorGroup( 'CGA', '#00FFFF'), -ColorGroup( 'CGG', '#00FFFF'), -ColorGroup( 'AGA', '#00FFFF'), -ColorGroup( 'AGG', '#00FFFF'), - - -ColorGroup( 'GAT', '#FF0000'), -ColorGroup( 'GAU', '#FF0000'), -ColorGroup( 'GAC', '#FF0000'), - -ColorGroup( 'GAA', '#FF0000'), -ColorGroup( 'GAG', '#FF0000'), - - -ColorGroup( 'TCT', '#00FF00'), -ColorGroup( 'UCU', '#00FF00'), -ColorGroup( 'TCC', '#00FF00'), -ColorGroup( 'UCC', '#00FF00'), -ColorGroup( 'TCA', '#00FF00'), -ColorGroup( 'UCA', '#00FF00'), -ColorGroup( 'TCG', '#00FF00'), -ColorGroup( 'UCG', '#00FF00'), -ColorGroup( 'AGT', '#00FF00'), -ColorGroup( 'AGU', '#00FF00'), -ColorGroup( 'AGC', '#00FF00'), - -ColorGroup( 'ACT', '#00FF00'), -ColorGroup( 'ACU', '#00FF00'), -ColorGroup( 'ACC', '#00FF00'), -ColorGroup( 'ACA', '#00FF00'), -ColorGroup( 'ACG', '#00FF00'), - -ColorGroup( 'CAA', '#00FF00'), -ColorGroup( 'CAG', '#00FF00'), - -ColorGroup( 'AAT', '#00FF00'), -ColorGroup( 'AAU', '#00FF00'), -ColorGroup( 'AAC', '#00FF00'), - - -ColorGroup( 'GCT', '#5555FF'), -ColorGroup( 'GCU', '#5555FF'), -ColorGroup( 'GCC', '#5555FF'), -ColorGroup( 'GCA', '#5555FF'), -ColorGroup( 'GCG', '#5555FF'), - -ColorGroup( 'GTT', '#5555FF'), -ColorGroup( 'GUU', '#5555FF'), -ColorGroup( 'GTC', '#5555FF'), -ColorGroup( 'GUC', '#5555FF'), -ColorGroup( 'GTA', '#5555FF'), -ColorGroup( 'GUA', '#5555FF'), -ColorGroup( 'GTG', '#5555FF'), -ColorGroup( 'GUG', '#5555FF'), - -ColorGroup( 'CTT', '#5555FF'), -ColorGroup( 'CUU', '#5555FF'), -ColorGroup( 'CTC', '#5555FF'), -ColorGroup( 'CUC', '#5555FF'), -ColorGroup( 'CTA', '#5555FF'), -ColorGroup( 'CUA', '#5555FF'), -ColorGroup( 'CTG', '#5555FF'), -ColorGroup( 'CUG', '#5555FF'), -ColorGroup( 'TTA', '#5555FF'), -ColorGroup( 'UUA', '#5555FF'), -ColorGroup( 'TTG', '#5555FF'), -ColorGroup( 'UUG', '#5555FF'), - -ColorGroup( 'ATT', '#5555FF'), -ColorGroup( 'AUU', '#5555FF'), -ColorGroup( 'ATC', '#5555FF'), -ColorGroup( 'AUC', '#5555FF'), -ColorGroup( 'ATA', '#5555FF'), -ColorGroup( 'AUA', '#5555FF'), - -ColorGroup( 'ATG', '#5555FF'), -ColorGroup( 'AUG', '#5555FF'), - - -ColorGroup( 'TTT', '#FF00FF'), -ColorGroup( 'UUU', '#FF00FF'), -ColorGroup( 'TTC', '#FF00FF'), -ColorGroup( 'UUC', '#FF00FF'), - -ColorGroup( 'TAT', '#FF00FF'), -ColorGroup( 'UAU', '#FF00FF'), -ColorGroup( 'TAC', '#FF00FF'), -ColorGroup( 'UAC', '#FF00FF'), - -ColorGroup( 'TGG', '#FF00FF'), -ColorGroup( 'UGG', '#FF00FF'), - - -ColorGroup( 'GGT', '#996600'), -ColorGroup( 'GGU', '#996600'), -ColorGroup( 'GGC', '#996600'), -ColorGroup( 'GGA', '#996600'), -ColorGroup( 'GGG', '#996600'), - -ColorGroup( 'CCT', '#996600'), -ColorGroup( 'CCU', '#996600'), -ColorGroup( 'CCC', '#996600'), -ColorGroup( 'CCA', '#996600'), -ColorGroup( 'CCG', '#996600'), - - -ColorGroup( 'TGT', '#FFFF00'), -ColorGroup( 'UGU', '#FFFF00'), -ColorGroup( 'TGC', '#FFFF00'), -ColorGroup( 'UGC', '#FFFF00'), - -ColorGroup( 'TAA', '#000000'), -ColorGroup( 'UAA', '#000000'), -ColorGroup( 'TAG', '#000000'), -ColorGroup( 'UAG', '#000000'), -ColorGroup( 'TGA', '#000000'), -ColorGroup( 'UGA', '#000000')], - alphabet = codon_alphabetU - ) -codonsT = ColorScheme([ -ColorGroup( 'CAT', '#00FFFF'), -ColorGroup( 'CAU', '#00FFFF'), -ColorGroup( 'CAC', '#00FFFF'), - -ColorGroup( 'AAA', '#00FFFF'), -ColorGroup( 'AAG', '#00FFFF'), - -ColorGroup( 'CGT', '#00FFFF'), -ColorGroup( 'CGU', '#00FFFF'), -ColorGroup( 'CGC', '#00FFFF'), -ColorGroup( 'CGA', '#00FFFF'), -ColorGroup( 'CGG', '#00FFFF'), -ColorGroup( 'AGA', '#00FFFF'), -ColorGroup( 'AGG', '#00FFFF'), - - -ColorGroup( 'GAT', '#FF0000'), -ColorGroup( 'GAU', '#FF0000'), -ColorGroup( 'GAC', '#FF0000'), - -ColorGroup( 'GAA', '#FF0000'), -ColorGroup( 'GAG', '#FF0000'), - - -ColorGroup( 'TCT', '#00FF00'), -ColorGroup( 'UCU', '#00FF00'), -ColorGroup( 'TCC', '#00FF00'), -ColorGroup( 'UCC', '#00FF00'), -ColorGroup( 'TCA', '#00FF00'), -ColorGroup( 'UCA', '#00FF00'), -ColorGroup( 'TCG', '#00FF00'), -ColorGroup( 'UCG', '#00FF00'), -ColorGroup( 'AGT', '#00FF00'), -ColorGroup( 'AGU', '#00FF00'), -ColorGroup( 'AGC', '#00FF00'), - -ColorGroup( 'ACT', '#00FF00'), -ColorGroup( 'ACU', '#00FF00'), -ColorGroup( 'ACC', '#00FF00'), -ColorGroup( 'ACA', '#00FF00'), -ColorGroup( 'ACG', '#00FF00'), - -ColorGroup( 'CAA', '#00FF00'), -ColorGroup( 'CAG', '#00FF00'), - -ColorGroup( 'AAT', '#00FF00'), -ColorGroup( 'AAU', '#00FF00'), -ColorGroup( 'AAC', '#00FF00'), - - -ColorGroup( 'GCT', '#5555FF'), -ColorGroup( 'GCU', '#5555FF'), -ColorGroup( 'GCC', '#5555FF'), -ColorGroup( 'GCA', '#5555FF'), -ColorGroup( 'GCG', '#5555FF'), - -ColorGroup( 'GTT', '#5555FF'), -ColorGroup( 'GUU', '#5555FF'), -ColorGroup( 'GTC', '#5555FF'), -ColorGroup( 'GUC', '#5555FF'), -ColorGroup( 'GTA', '#5555FF'), -ColorGroup( 'GUA', '#5555FF'), -ColorGroup( 'GTG', '#5555FF'), -ColorGroup( 'GUG', '#5555FF'), - -ColorGroup( 'CTT', '#5555FF'), -ColorGroup( 'CUU', '#5555FF'), -ColorGroup( 'CTC', '#5555FF'), -ColorGroup( 'CUC', '#5555FF'), -ColorGroup( 'CTA', '#5555FF'), -ColorGroup( 'CUA', '#5555FF'), -ColorGroup( 'CTG', '#5555FF'), -ColorGroup( 'CUG', '#5555FF'), -ColorGroup( 'TTA', '#5555FF'), -ColorGroup( 'UUA', '#5555FF'), -ColorGroup( 'TTG', '#5555FF'), -ColorGroup( 'UUG', '#5555FF'), - -ColorGroup( 'ATT', '#5555FF'), -ColorGroup( 'AUU', '#5555FF'), -ColorGroup( 'ATC', '#5555FF'), -ColorGroup( 'AUC', '#5555FF'), -ColorGroup( 'ATA', '#5555FF'), -ColorGroup( 'AUA', '#5555FF'), - -ColorGroup( 'ATG', '#5555FF'), -ColorGroup( 'AUG', '#5555FF'), - - -ColorGroup( 'TTT', '#FF00FF'), -ColorGroup( 'UUU', '#FF00FF'), -ColorGroup( 'TTC', '#FF00FF'), -ColorGroup( 'UUC', '#FF00FF'), - -ColorGroup( 'TAT', '#FF00FF'), -ColorGroup( 'UAU', '#FF00FF'), -ColorGroup( 'TAC', '#FF00FF'), -ColorGroup( 'UAC', '#FF00FF'), - -ColorGroup( 'TGG', '#FF00FF'), -ColorGroup( 'UGG', '#FF00FF'), - - -ColorGroup( 'GGT', '#996600'), -ColorGroup( 'GGU', '#996600'), -ColorGroup( 'GGC', '#996600'), -ColorGroup( 'GGA', '#996600'), -ColorGroup( 'GGG', '#996600'), - -ColorGroup( 'CCT', '#996600'), -ColorGroup( 'CCU', '#996600'), -ColorGroup( 'CCC', '#996600'), -ColorGroup( 'CCA', '#996600'), -ColorGroup( 'CCG', '#996600'), - - -ColorGroup( 'TGT', '#FFFF00'), -ColorGroup( 'UGU', '#FFFF00'), -ColorGroup( 'TGC', '#FFFF00'), -ColorGroup( 'UGC', '#FFFF00'), - -ColorGroup( 'TAA', '#000000'), -ColorGroup( 'UAA', '#000000'), -ColorGroup( 'TAG', '#000000'), -ColorGroup( 'UAG', '#000000'), -ColorGroup( 'TGA', '#000000'), -ColorGroup( 'UGA', '#000000')], - alphabet = codon_alphabetT - ) - - - - -charge = ColorScheme([ - ColorGroup("KRH", "blue", "Positive" ), - ColorGroup( "DE", "red", "Negative") ], - alphabet = seq.unambiguous_protein_alphabet - ) - - -taylor = ColorScheme([ - ColorGroup( 'A', '#CCFF00' ), - ColorGroup( 'C', '#FFFF00' ), - ColorGroup( 'D', '#FF0000'), - ColorGroup( 'E', '#FF0066' ), - ColorGroup( 'F', '#00FF66'), - ColorGroup( 'G', '#FF9900'), - ColorGroup( 'H', '#0066FF'), - ColorGroup( 'I', '#66FF00'), - ColorGroup( 'K', '#6600FF'), - ColorGroup( 'L', '#33FF00'), - ColorGroup( 'M', '#00FF00'), - ColorGroup( 'N', '#CC00FF'), - ColorGroup( 'P', '#FFCC00'), - ColorGroup( 'Q', '#FF00CC'), - ColorGroup( 'R', '#0000FF'), - ColorGroup( 'S', '#FF3300'), - ColorGroup( 'T', '#FF6600'), - ColorGroup( 'V', '#99FF00'), - ColorGroup( 'W', '#00CCFF'), - ColorGroup( 'Y', '#00FFCC')], - title = "Taylor", - description = "W. Taylor, Protein Engineering, Vol 10 , 743-746 (1997)", - alphabet = seq.unambiguous_protein_alphabet - ) - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/create.cgi --- a/weblogolib/htdocs/create.cgi Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,48 +0,0 @@ -#!/usr/bin/env python - -# Copyright (c) 2003-2004 The Regents of the University of California. -# Copyright (c) 2005 Gavin E. Crooks -# Copyright (c) 2006, The Regents of the University of California, through -# Lawrence Berkeley National Laboratory (subject to receipt of any required -# approvals from the U.S. Dept. of Energy). All rights reserved. - -# This software is distributed under the new BSD Open Source License. -# -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# (1) Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# (2) Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and or other materials provided with the distribution. -# -# (3) Neither the name of the University of California, Lawrence Berkeley -# National Laboratory, U.S. Dept. of Energy nor the names of its contributors -# may be used to endorse or promote products derived from this software -# without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - -import cgi -import cgitb; cgitb.enable() -import weblogolib - -if __name__=="__main__" : - weblogolib.cgi(__file__) - - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/create_html_template.html --- a/weblogolib/htdocs/create_html_template.html Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,350 +0,0 @@ - - - - - - -CodonLogo 1.0- Create - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Codonlogo 1.0 : Create

- -

- · - about · - create · - examples · - manual · -
-$version -

-Sequence data -

- -

-(or paste sequence data below) -

- -

-
-Download -

-${error_message} -

-Output format

- - -

-Logo size -

- -

- Stacks per line -

- -

-Sequence type

- - -

-Ignore lower case -

- -

-Use codons: -

- -

-Units

- - -

-First position number -

- -

-Logo range

- - -

-Composition -

- -or - % CG -

-Scale stack widths -

- -

-Error bars -

- -

-Title -

- -

-Figure label

- -

-X-axis -

- -Label: - -

-Y-axis -

- -Label: - -

-Y-axis scale: -

- -

-Y-axis tic spacing: -

- -

-Sequence end labels -

- -

-Version fineprint -

- -

Color scheme -

- - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Symbols	Color

- -

- - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/examples.html --- a/weblogolib/htdocs/examples.html Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,2210 +0,0 @@ - - - - - -CodonLogo - Examples - - - - - - - - - - - - - - - - - - - -

CodonLogo 1.0: Examples

- -

- · - about · - create · - examples · - manual · -
- -

- -

CAP HTH motif
Transcription Factors
E. coli Promoters
Globins
HTH motif
Splice Signals

-The Edit Logo buttons will transfer the relevant -sequence data to the Logo creation form. -There you can examine the sequence data and recreate the logo for -yourself. - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---> - -

- - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/img/example.png Binary file weblogolib/htdocs/img/example.png has changed diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/img/feed-icon-16x16.png Binary file weblogolib/htdocs/img/feed-icon-16x16.png has changed diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/img/weblogo-fig1.png Binary file weblogolib/htdocs/img/weblogo-fig1.png has changed diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/img/weblogo_create.png Binary file weblogolib/htdocs/img/weblogo_create.png has changed diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/index.html --- a/weblogolib/htdocs/index.html Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,135 +0,0 @@ - - - - -CodonLogo - About - - - - - - - - - - - - - - - - - -

CodonLogo 1.0

- -

- · - about · - create · - examples · - manual · -
- - -

- - -

Introduction

-Codonlogo is based on WebLogo,a web based application designed to make the -generation of sequence logos easy and painless.CodonLogo Extends Weblogo to allow generation of sequence logos with codons rather than amino acids or nucleotides. -

- -

-Sequence logos are a graphical representation of an amino acid -or nucleic acid multiple sequence alignment developed by -Tom Schneider - and Mike - Stephens. -Each logo consists of stacks of symbols, one stack for each position in the -sequence. The overall height of the stack indicates the sequence conservation -at that position, while the height of symbols within the stack indicates the -relative frequency of each codon at that position. In general, -a sequence logo provides a richer and more precise description of, for example, -a binding site, than would a consensus sequence. -

- -

- - -

References

- - -

- -Crooks GE, -Hon G, -Chandonia JM, -Brenner SE -CodonLogo: A sequence logo -generator,
-Genome Research, 14:1188-1190, (2004) -[Full Text ] - -

- -

- -Schneider TD, Stephens RM. 1990. -Sequence Logos: A New Way to Display Consensus Sequences. -Nucleic Acids Res. 18:6097-6100 - -

- - - - - -

Disclaimer

- -

-While no permanent records are kept of submitted sequences, we cannot -undertake to guarantee that data sent to CodonLogo remains secure. Moreover, -no guarantees whatsoever are provided about data generated by CodonLogo. -

- - - -

Feedback

-Suggestions on how to improve CodonLogo are heartily welcomed! -Please direct questions to murphy.david@gmail.com -

- -

- -a - - - - - - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/logo.css --- a/weblogolib/htdocs/logo.css Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,67 +0,0 @@ - -/* - Logo Cascading Style Sheet -*/ - -body { - font-family: sans-serif; - color: black; - background: white; -} -a:link { color: #369; background: transparent } -a:visited { color: #066; background: transparent } -a:active { color: #C00; background: transparent } - -a.selected:visited { - color: #000; - background: #EEE; - text-decoration: none; -} - -li {margin-bottom: 0.5em} - - -th, td { /* ns 4 */ - font-family: sans-serif; -} - -h1, h2, h3, h4, h5, h6 { text-align: left } -h1 { color: #900 } -h1 { text-align: left } -h1 { font: 170% sans-serif} -h2 { font: bold 140% sans-serif } -h3 { font: bold 120% sans-serif } -h4 { font: bold 100% sans-serif } -h5 { font: italic 100% sans-serif } -h6 { font: small-caps 100% sans-serif } - -h2 { margin-top: 2em} -h4 { margin-bottom: 0.2em} - -.discourse { - font-size: small - } - - -p {text-align: justify; margin-top: 0em} -p.copyright { font-size: small; text-align: center } - -pre { margin-left: 2em ; } - -dt, dd { margin-top: 0; margin-bottom: 0 } /* opera 3.50 */ -dt { font-weight: bold } - -/* navigator 4 requires this */ -pre, code { - font-family: monospace ; - -} - - - - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/manual.html --- a/weblogolib/htdocs/manual.html Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,576 +0,0 @@ - - - - -CodonLogo - User's Manual - - - - - - - - - - - -

CodonLogo 1.0: User's Manual

- -

- · - about · - create · - examples · - manual · -
- - -

- -

- - -

Introduction -
Creating Sequences Logos using the Web interface -
Downloading and Installing CodonLogo -
Command Line Interface (CLI) -
Application Programmer Interface (API) -
Development and Future Features -
Miscellanea -

- - -

Introduction

- - -

-CodonLogo -is a web based application designed to make the -generation of -codon sequence logos as easy and painless as possible. -It is almost entirely based on the application WebLogo. -

- - -

-Sequence logos -are a graphical representation of an amino acid -or nucleic acid multiple sequence alignment. -Each logo consists of stacks of symbols, one stack for each position in the -sequence. The overall height of the stack indicates the sequence conservation -at that position, while the height of symbols within the stack indicates the -relative frequency of each amino or nucleic acid at that position. The width of the stack is proportional to the fraction of valid symbols in that position. (Positions with many gasp have thin stacks.) In general, a sequence logo provides a richer and more precise description of, for example, -a binding site, than would a consensus sequence. -

- - - - - - - - - -

References

- - -

-Crooks GE, -Hon G, -Chandonia JM, -Brenner SE -WebLogo: A sequence logo -generator, -Genome Research, 14:1188-1190, (2004) -[Full Text ] -

- -

-Schneider TD, Stephens RM. 1990. -Sequence Logos: A New Way to Display Consensus Sequences. -Nucleic Acids Res. 18:6097-6100 -

- - - - - - - - -

Creating Sequences Logos using the Web interface

- - -

Sequence Data

-Enter your multiple sequence alignment here, or select a file to upload. Supported file formats include CLUSTALW, FASTA, plain flatfile, MSF, NBRF, PIR, NEXUS and PHYLIP. All sequences must be the same length, else CodonLogo will return an error and report the first sequence that differed in length from previous sequences. - -

Output format

- -

PNG : (600 DPI) Print resolution bitmap -
PNG : (low res, 96 DPI) Screen resolution bitmap -
JPEG :Screen resolution bitmap -
EPS : Encapsulated postscript -
PDF : Portable Document Format -

-Generally speaking, vector formats (EPS and PDF) are better for printing, while bitmaps (JPEG and PNG) are more suitable for displaying on the screen or embedding into a web page. - -

Logo size

-The physical dimensions of the generated logo. -Specifically, controls the size of individual symbols stacks. -

small : 5.4 points wide (Same as 9pt Courier), aspect ratio 5:1 -
medium : Double the width and height of small. -
large : Triple the width and height of small. -

-The choices have been limited to promote inter-logo consistency. Small logos can fit 80 stacks across a printed page, or 40 across a half page column. The command line interface provides greater control, if so desired. - -

Stacks per line

-If the length of the sequences is greater than this maximum number of stacks per line, then the logo will be split across multiple lines. - -

Sequence type

-The type of biological molecule. -

auto: Automatically guess sequence type from the data -
protein -
dna -
rna -

- -

Ignore lower case

-Disregard lower case letters in the and only count -upper case letters in sequences? - -

Units

-The units used for the y-axis. -

probability: Show residue probabilities, rather than information content. If compositional adjustment is disabled, then these are the raw residue frequencies. -
bits: Information content in bits -
nats: Natural units, 1 bit = ln 2 (0.69) nats -
kT : Thermal energy units in natural units (Numerically the same as nats) -
kJ/mol : Thermal energy (Assuming T = 300 K) -
kcal/mol : Thermal energy (Assuming T = 300 K) -

- - -

First position number

-The numerical label of the first residue in the multiple sequence alignment. The label must be an integer. Residue labels for the logo will be relative to this number. (See also: Logo Range) -

Logo range

-By default, all sequence data is displayed in the Sequence Logo. With this option, you can instead show a subrange of the entire sequence. Start and end positions are included, and the numbering of positions is relative to the sequence number of the first position. (See also: First Position Number ) Thus, if the first position number is "2", start is "5" and end is "10", then the 4th through 9th (inclusive) sequence positions will be displayed, and they will be numbered "5", "6", "7", "8", "9" and "10". - - -

Composition

-The background composition of the genome or proteome from which the sequences have been drawn. The default, automatic option is to use equiprobable background for nucleic acids and a typical amino acid usage pattern for proteins. However, you may also explicitly set the expected CG content for nucleic acid sequences, insists on equiprobable background distributions, or turn off composition adjustment altogether. -

-Compositional adjustment has two effects. First, the information content of a site is defined as the relative entropy of the monomers at that site to the background distribution. Consequentially, rare monomers have higher information content (when they occur) than relatively common monomers. -

-Secondly, the background composition is used in the small sample correction of information content. Briefly, if only a few sequences are available in the multiple sequence alignment, then sites typically appear more conserved than they really are. Small samples bias the relative entropy upwards. To compensate, we add pseudocounts to the actual counts, proportional to the expected background composition. These pseudocounts smooth the data for small samples, but become irrelevant for large samples. The proportionality constant is set to 4 for nucleic acid sequences, and 20 for proteins (These numbers have been found to give reasonable results in practice). -

-Behind the scenes, things are more complex. We do a full Bayesian calculation, starting with explicit Dirichlet priors based on the background composition, to which we add the data and then calculate both the posterior mean relative entropy (the stack height) and Bayesian 95% confidence intervals for error bars. These interesting details will be explained elsewhere. -

- -

Scale stack width

-Scale the visible stack width by the fraction of symbols in the column? (i.e. columns with many gaps or unknown residues are narrow.) - -

Error bars

-Display error bars. These indicate an approximate, Bayesian 95% confidence interval. - -

Title

-Give your logo a title. - -

Figure label

-An optional figure label, added to the top left (e.g. '(a)') - -

X-axis

-Add a label to the x-axis, or hide axis altogether. - -

Y-axis

-The vertical axis indicates the information content of a sequence position. Use this option to toggle the y-axis and override the default axis label. - -

Y-axis scale

-The height of the y-axis in designated units. The automatic option will pick reasonable defaults based on the sequence type and axis unit. - -

Y-axis tic spacing

-The distance between major tic marks on the Y-axis. - -

Sequence end labels

-Choose this option to label the 5' & 3' ends of nucleic acid or the N & C termini of amino acid sequences. - -

Version fineprint

-Toggle display of the CodonLogo version information in the lower right corner. - -

Color Scheme

auto : use Base Pairing for nucleic acids (NA), Hydrophobicity for amino acids (AA). -
monochrome: All symbols black -
Base Pairing (NA default) : - - - -
2 Watson-Crick hydrogen bonds TAU dark orange
3 Watson-Crick hydrogen bonds GC blue
- -
Classic (NA) : WebLogo (version 1 and 2) and makelogo default color scheme for nucleic acids: G, orange; T & U, red; C, blue; and A, green. - - - - - -
G orange
TU red
C blue
A green
- -
Hydrophobicity (AA default) : - - - - -
Hydrophobic RKDENQ black
Neutral SGHTAP green
Hydrophilic YVMCLFIW blue
- -
Chemistry (AA): Color amino acids according to chemical properties. WebLogo (version 1 and 2) and makelogo default color. - - - - - -
Polar G,S,T,Y,C,Q,N green
Basic K,R,H blue
Acidic D,E red
Hydrophobic A,V,L,I,P,W,F,M black
- -
Charge (AA) : - - - -
Positive KRH blue
Negative DE red
- - - - -
Custom : A custom color scheme can be specified in the input field below. Specify colors on the left and associated symbols on the right. Colors are entered using CSS2 (Cascading Style Sheet) syntax. (e.g. 'red', '#F00', '#FF0000', 'rgb(255, 0, 0)', 'rgb(100%, 0%, 0%)' or 'hsl(0, 100%, 50%)' for the color red.) -

- -

More Options

-The CodonLogo command line client, codonlogo, provides many more options and greater control over the final logo appearance. - - - - - - - - -

Installing CodonLogo

- - -

Dependencies

-CodonLogo is written in python. It is necessary to have python 2.3, 2.4 or 2.5 and the -extension packages numpy and -corebio -installed before WebLogo will run. WebLogo also requires a recent version of ghostscript to create PNG and PDF output. -

- - - - -

-After unpacking the CodonLogo tarfile, it should be possible to immediately create -logos using the command line client (Provided that python, numpy, corebio and ghostscript have already been installed). -

 
-./codonlogo --format PNG < htdocs/examples/cap_hth.fa > cap_hth.png   
-

-Please consult the file build_examples.sh for more examples. -

-To run CodonLogo as a stand alone web service, run the logo server command : -

 
-./codonlogo --serve 
-

-It should now be possible to access CodonLogo at http://localhost:8080/. -

- -

-The command line client and WebLogo libraries can be permanently installed using the supplied setup.py script. -

 
-sudo python setup.py install
-

-Run python setup.py help for more installation options. For example, to specifically install the CodonLogo script to /usr/local/bin -

-sudo python setup.py install_scripts --install-dir /usr/local/bin
-

- - -

Web App

- -

- To use CodonLogo as a web application, first install the weblogo dependancies and libraries as above, then - place (or link) the weblogolib/weblogo_htdocs directory - somewhere within the document root of your webserver. The webserver - must be able to execute the CGI script create.cgi. For Apache, you may have to add an ExecCGI - option and add a cgi handler in the httpd.conf configuration file. - Something like this: -

-<Directory "/home/httpd/htdocs/weblogo/">
-    Options FollowSymLinks MultiViews ExecCGI
-    AllowOverride None
-    Order allow,deny
-    Allow from all
-</Directory>
-...
-# To use CGI scripts outside of ScriptAliased directories:
-# (You will also need to add "ExecCGI" to the "Options" directive.)
-#
-AddHandler cgi-script .cgi
-

-It may also be necessary to set the PATH and PYTHONPATH environment variables. -

-SetEnv PYTHONPATH /path/to/weblogo/libraries
-

-The cgi script also has to be able to find the 'gs' ghostscript executable. If ghostscipt is installed in a non-standard location add the following environment variable. -

-SetEnv COREBIOPATH /path/to/gs
-

-The maxium bytes of uploaded sequecne data can be controlled with the WEBLOGO_MAX_FILE_SIZE environment variable. -

-SetEnv WEBLOGO_MAX_FILE_SIZE 1000000
-

- - - - - - -

`codonlogo`, The CodonLogo Command Line Interface (CLI)

-The command line client has many options not available through the web interface. Please consult the bundled build_examples.sh script for inspiration. -

-	Usage: codonlogo [options]  < sequence_data.fa > sequence_logo.eps
-
-	Create sequence logos from biological sequence alignments.
-
-	Options:
-	     --version                  show program's version number and exit
-	  -h --help                     show this help message and exit
-
-	  Input/Output Options:
-	    -f --fin FILENAME           Sequence input file (default: stdin)
-	       --fin-format FORMAT      Multiple sequence alignment format: (clustal,
-	                                fasta, plain, msf, genbank, nbrf, nexus,
-	                                phylip, stockholm, intelligenetics, table,
-	                                array)
-	    -o --fout FILENAME          Output file (default: stdout)
-	    -F --format FORMAT          Format of output: eps (default), png,
-	                                png_print, pdf, jpeg, txt
-
-	  Logo Data Options:
-	    -A --sequence-type TYPE     The type of sequence data: 'protein', 'rna' or
-	                                'dna'.
-	    -a --alphabet ALPHABET      The set of symbols to count, e.g. 'AGTC'. All
-	                                characters not in the alphabet are ignored. If
-	                                neither the alphabet nor sequence-type are
-	                                specified then weblogo will examine the input
-	                                data and make an educated guess. See also
-	                                --sequence-type, --ignore-lower-case
-	       --ignore-lower-case YES/NO
-	                                Disregard lower case letters and only count
-	                                upper case letters in sequences? (Default: No)
-	    -U --units NUMBER           A unit of entropy ('bits' (default), 'nats',
-	                                'digits'), or a unit of free energy ('kT',
-	                                'kJ/mol', 'kcal/mol'), or 'probability' for
-	                                probabilities
-	       --composition COMP.      The expected composition of the sequences:
-	                                'auto' (default), 'equiprobable', 'none' (Do
-	                                not perform any compositional adjustment), a
-	                                CG percentage, a species name (e.g. 'E. coli',
-	                                'H. sapiens'), or an explicit distribution
-	                                (e.g. {'A':10, 'C':40, 'G':40, 'T':10}). The
-	                                automatic option uses a typical distribution
-	                                for proteins and equiprobable distribution for
-	                                everything else.
-	       --weight NUMBER          The weight of prior data.  Default: total
-	                                pseudocounts equal to the number of monomer
-	                                types.
-	    -i --first-index INDEX      Index of first position in sequence data
-	                                (default: 1)
-	    -l --lower INDEX            Lower bound of sequence to display
-	    -u --upper INDEX            Upper bound of sequence to display
-
-	  Logo Format Options:
-	    These options control the format and display of the logo.
-
-	    -s --size LOGOSIZE          Specify a standard logo size (small, medium
-	                                (default), large)
-	    -n --stacks-per-line COUNT  Maximum number of logo stacks per logo line.
-	                                (default: 40)
-	    -t --title TEXT             Logo title text.
-	       --label TEXT             A figure label, e.g. '2a'
-	    -X --show-xaxis YES/NO      Display sequence numbers along x-axis?
-	                                (default: True)
-	    -x --xlabel TEXT            X-axis label
-	    -S --yaxis UNIT             Height of yaxis in units. (Default: Maximum
-	                                value with uninformative prior.)
-	    -Y --show-yaxis YES/NO      Display entropy scale along y-axis? (default:
-	                                True)
-	    -y --ylabel TEXT            Y-axis label  (default depends on plot type
-	                                and units)
-	    -E --show-ends YES/NO       Label the ends of the sequence? (default:
-	                                False)
-	    -P --fineprint TEXT         The fine print (default: weblogo version)
-	       --ticmarks NUMBER        Distance between ticmarks (default: 1.0)
-	       --errorbars YES/NO       Display error bars? (default: True)
-
-	  Color Options:
-	    Colors can be specified using CSS2 syntax. e.g. 'red', '#FF0000', etc.
-
-	    -c --color-scheme SCHEME    Specify a standard color scheme (auto, base
-	                                pairing, charge, chemistry, classic,
-	                                hydrophobicity, monochrome)
-	    -C --color COLOR SYMBOLS DESCRIPTION 
-	                                Specify symbol colors, e.g. --color black AG
-	                                'Purine' --color red TC 'Pyrimidine'
-	       --default-color COLOR    Symbol color if not otherwise specified.
-
-	  Advanced Format Options:
-	    These options provide fine control over the display of the logo.
-
-	    -W --stack-width POINTS     Width of a logo stack (default: 10.8)
-	    -H --stack-height POINTS    Height of a logo stack (default: 54.0)
-	       --box YES/NO             Draw boxes around symbols? (default: no)
-	       --resolution DPI         Bitmap resolution in dots per inch (DPI).
-	                                (default: 96 DPI, except png_print, 600 DPI)
-	                                Low resolution bitmaps (DPI<300) are
-	                                antialiased.
-	       --scale-width YES/NO     Scale the visible stack width by the fraction
-	                                of symbols in the column?  (i.e. columns with
-	                                many gaps of unknowns are narrow.)  (default:
-	                                yes)
-	       --debug YES/NO           Output additional diagnostic information.
-	                                (default: False)
-
-	  CodonLogo Server:
-	    Run a standalone webserver on a local port.
-
-	       --serve                  Start a standalone CodonLogo server for creating
-	                                sequence logos.
-	       --port PORT              Listen to this local port. (Default: 8080)
-

- - -

WebLogo Application Programmer Interface (API)

- -The WebLogo python libraries provide even greater flexibility than the command line client. The code is split between two principle packages, weblogo itself, which contains specialized sequence logo generation code, and corebio, a package that contains code of more general utility. -Please consult the WebLogo and CoreBio API documentation. - - - - - -

WebLogo Development and Future Features

-The development project is hosted at -http://code.google.com/p/weblogo. - -If you wish to extend WebLogo or to contribute code, then you should download the full source code development package directly from the subversion repository. -

-> svn checkout http://weblogo.googlecode.com/svn/trunk/ weblogo
->  cd weblogo
-> ./weblogo < cap.fa > cap.eps
-

-Please consult the developer notes, DEVELOPERS.txt and software license LICENSE.txt -

- -

Outstanding bugs and feature requests are listed on the WebLogo issue tracker. -

- - -

Miscellanea

Release Notes and Known Bugs

-The WebLogo release notes detail changes to WebLogo and known issues with particular versions. - -

WebLogo 2

-The legacy WebLogo 2 sever can be found here. - - - -

Acknowledgments

- -

-WebLogo was created by -Gavin E. Crooks, -Liana Lareau, -Gary Hon, -John-Marc Chandonia and -Steven E. Brenner. -Many others have provided suggestions, bug fixes and moral support. -

- -

-WebLogo was originally based upon the programs -alpro and -makelogo, -both of which are part of Tom Schneider's -delila package. Many thanks -are due to him for making this software freely available and for encouraging its use. -

- - - - - -

- - - - - - - - - - - - - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/htdocs/test.html --- a/weblogolib/htdocs/test.html Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,543 +0,0 @@ - - - - - -CodonLogo - Tests - - - - - - - - - - - - - - - - - -

CodonLogo: Tests

- -

- · - about · - create · - examples · - manual · -
- -

- -Various tests of the CodonLogo webapp.
- -

- -

Relative Entropy Test: -The entropy should be 2 bits, 1 bit, 0 bits -(The small sample correction should be turned off.) -
-
- - - - - - - - - - -GTTGTTGTTGTT -> -GTCGTCGTCGTC -> -GGGGGGGGGGGG -> -GGAGGAGGAGGA -" > -
-
Title And Labels Test: -Replace and display x-label, y-label and title. -
- - - - - - - - - - - - -AAAGTGAAAGTGAAAGTGAAAGTG -> -AAAGCGAAAGCGAAAGCGAAAGCG -> -TGCCCTTGCCCTTGCCCTTGCCCT -> -TGCCTTTGCCTTTGCCTTTGCCTT -" > -
- -Same, but do not show axes -
- - - - - - - - - - -AAAGTGAAAGTGAAAGTGAAAGTG -> -AAAGCGAAAGCGAAAGCGAAAGCG -> -TGCCCTTGCCCTTGCCCTTGCCCT -> -TGCCTTTGCCTTTGCCTTTGCCTT -" > -
-
Format Test: -Ensure that this logo can be created in each of the available formats -
- - - - - - - - -Format: - - -AAAGTGAAAGTGAAAGTGAAAGTG -> -AAAGCGAAAGCGAAAGCGAAAGCG -> -TGCCCTTGCCCTTGCCCTTGCCCT -> -TGCCTTTGCCTTTGCCTTTGCCTT -" > -
-
Test Alphabetic Order: -Each character in each stack has the same entropy. The letters -should be alphabetized, top down. -
- - - - - - - - -(NA) - -AAAGTG -> -AAAGCG -> -TGCCCT -> -TGCCTT -" > -
- -
- - - - - - (AA) - - -CA -> -CA -> -VA -> -VG -> -KG -> -KG -" > -
-
One Single Column: -Should not die just because there is only 1 stack. -
- - - - -A -> -A -> -G -> -G -> -G -> -G -" > -
-
Small Sample Correction Test: -The samples per column decrease from 32 (left) to 1 (right). -Before the small sample correction the relative entropy of each stack -is 2 bits. -
- - - - - - - - - -GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG -> -GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG- -> -GGGGGGGGGGGGGGGGGGGGGGGGGGGGGG-- -> -GGGGGGGGGGGGGGGGGGGGGGGGGGGGG--- -> -GGGGGGGGGGGGGGGGGGGGGGGGGGGG---- -> -GGGGGGGGGGGGGGGGGGGGGGGGGGG----- -> -GGGGGGGGGGGGGGGGGGGGGGGGGG------ -> -GGGGGGGGGGGGGGGGGGGGGGGGG------- -> -GGGGGGGGGGGGGGGGGGGGGGGG-------- -> -GGGGGGGGGGGGGGGGGGGGGGG--------- -> -GGGGGGGGGGGGGGGGGGGGGG---------- -> -GGGGGGGGGGGGGGGGGGGGG----------- -> -GGGGGGGGGGGGGGGGGGGG------------ -> -GGGGGGGGGGGGGGGGGGG------------- -> -GGGGGGGGGGGGGGGGGG-------------- -> -GGGGGGGGGGGGGGGG---------------- -> -GGGGGGGGGGGGGGG----------------- -> -GGGGGGGGGGGGGG------------------ -> -GGGGGGGGGGGGG------------------- -> -GGGGGGGGGGGG-------------------- -> -GGGGGGGGGGG--------------------- -> -GGGGGGGGGG---------------------- -> -GGGGGGGGG----------------------- -> -GGGGGGGG------------------------ -> -GGGGGGG------------------------- -> -GGGGGG-------------------------- -> -GGGGG--------------------------- -> -GGGG---------------------------- -> -GGG----------------------------- -> -GG------------------------------ -> -G------------------------------- -" > -
-
Test Graceful Failure: -Each of these tests should result in a polite and informative error message. -
- - - - - - - -
- -
- - - - - - - -
- -
- - - - - - - - -
- -
- - - - - - - -
- - -
Logos Sizes: -Same sequences, three different sizes. -
- - - - - - -Size: - - - -aldB -18->4 -attcgtgatagctgtcgtaaag ->ansB 103->125 -ttttgttacctgcctctaactt ->araB1 109->131 -aagtgtgacgccgtgcaaataa ->araB2 147->169 -tgccgtgattatagacactttt ->cdd 1 107->129 -atttgcgatgcgtcgcgcattt ->cdd 2 57->79 -taatgagattcagatcacatat ->crp 1 115->137 -taatgtgacgtcctttgcatac ->crp 2 -gaaggcgacctgggtcatgctg ->cya 151->173 -aggtgttaaattgatcacgttt ->cytR 1 125->147 -cgatgcgaggcggatcgaaaaa ->cytR 2 106->128 -aaattcaatattcatcacactt ->dadAX 1 95->117 -agatgtgagccagctcaccata ->dadAX 2 32->54 -agatgtgattagattattattc ->deoP2 1 75->97 -aattgtgatgtgtatcgaagtg ->deoP2 2 128->150 -ttatttgaaccagatcgcatta ->fur 136->158 -aaatgtaagctgtgccacgttt ->gal 56->78 -aagtgtgacatggaataaatta ->glpACB (glpTQ) 1 54->76 -ttgtttgatttcgcgcatattc ->glpACB (glpTQ) 2 94->116 -aaacgtgatttcatgcgtcatt ->glpACB (glpTQ) 144->166 -atgtgtgcggcaattcacattt ->glpD (glpE) 95->117 -taatgttatacatatcactcta ->glpFK 1 120->142 -ttttatgacgaggcacacacat ->glpFK 2 95->117 -aagttcgatatttctcgttttt ->gut (srlA) 72->94 -ttttgcgatcaaaataacactt ->ilvB 87->109 -aaacgtgatcaacccctcaatt ->lac 1 (lacZ) 88->110 -taatgtgagttagctcactcat ->lac 2 (lacZ) 16->38 -aattgtgagcggataacaattt ->malEpKp1 110->132 -ttgtgtgatctctgttacagaa ->malEpKp2 139->161 -TAAtgtggagatgcgcacaTAA ->malEpKp3 173->195 -TTTtgcaagcaacatcacgAAA ->malEpKp4 205->227 -GACctcggtttagttcacaGAA ->malT 121->143 -aattgtgacacagtgcaaattc ->melR 52->74 -aaccgtgctcccactcgcagtc ->mtl 302->324 -TCTTGTGATTCAGATCACAAAG ->nag 156->178 -ttttgtgagttttgtcaccaaa ->nupG2 97->119 -aaatgttatccacatcacaatt ->nupG1 47->69 -ttatttgccacaggtaacaaaa ->ompA 166->188 -atgcctgacggagttcacactt ->ompR 161->183 -taacgtgatcatatcaacagaa ->ptsH A 316->338 -Ttttgtggcctgcttcaaactt ->ptsH B 188->210 -ttttatgatttggttcaattct ->rhaS (rhaB) 161->183 -aattgtgaacatcatcacgttc ->rot 1 (ppiA) 182->204 -ttttgtgatctgtttaaatgtt ->rot 2 (ppiA) 129->151 -agaggtgattttgatcacggaa ->tdcA 60->82 -atttgtgagtggtcgcacatat ->tnaL 73->95 -gattgtgattcgattcacattt ->tsx 2 146->168 -gtgtgtaaacgtgaacgcaatc ->tsx 1 107->129 -aactgtgaaacgaaacatattt ->uxuAB 165->187 -TCTTGTGATGTGGTTAACCAAT -" > -
-

- - - -

- - - - diff -r c55bdc2fb9fa -r 33ac48224523 weblogolib/template.eps --- a/weblogolib/template.eps Thu Oct 27 12:09:09 2011 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,669 +0,0 @@ -%!PS-Adobe-3.0 EPSF-3.0 -%%Title: Sequence Logo: ${logo_title} -%%Creator: ${creator_text} -%%CreationDate: ${creation_date} -%%BoundingBox: 0 0 ${logo_width} ${logo_height} -%%Pages: 0 -%%DocumentFonts: -%%EndComments - - -% ---- VARIABLES ---- - -/True true def -/False false def - -/debug ${debug} def - -/logo_height ${logo_height} def -/logo_width ${logo_width} def -/logo_title (${logo_title}) def -/show_title ${show_title} def - -/logo_margin ${logo_margin} def -/xaxis_label_height ${xaxis_label_height} def -/title_height ${title_height} def -/stroke_width ${stroke_width} def -/tic_length ${tic_length} def - -/lines_per_logo ${lines_per_logo} def -/line_width ${line_width} def -/line_height ${line_height} def -/line_margin_left ${line_margin_left} def -/line_margin_right ${line_margin_right} def -/line_margin_bottom ${line_margin_bottom} def -/line_margin_top ${line_margin_top} def - -/stack_width ${stack_width} def -/stack_height ${stack_height} def -/stacks_per_line ${stacks_per_line} def -/stack_margin ${stack_margin} def - -/show_yaxis ${show_yaxis} def -/show_yaxis_label ${show_yaxis_label} def -/yaxis_label (${yaxis_label}) def -/yaxis_scale ${yaxis_scale} def % height in units -/yaxis_tic_interval ${yaxis_tic_interval} def % in units -/yaxis_minor_tic_interval ${yaxis_minor_tic_interval} def % in units - -/show_xaxis_label ${show_xaxis_label} def % True or False -/show_xaxis ${show_xaxis} def % True or False -/xaxis_label (${xaxis_label}) def -/xaxis_tic_interval ${xaxis_tic_interval} def -/rotate_numbers ${rotate_numbers} def % True or False -/number_interval ${number_interval} def -/show_ends ${show_ends} def -/end_type (${end_type}) def % d: DNA, p: PROTEIN, -: none - -/show_fineprint ${show_fineprint} def -/fineprint (${fineprint}) def -/logo_label (${logo_label}) def - -/show_boxes ${show_boxes} def % True or False -/shrink ${shrink} def % True or False -/shrink_fraction ${shrink_fraction} def - -/show_errorbars ${show_errorbars} def % True or False -/errorbar_fraction ${errorbar_fraction} def -/errorbar_width_fraction ${errorbar_width_fraction} def -/errorbar_gray ${errorbar_gray} def - -/fontsize ${fontsize} def -/small_fontsize ${small_fontsize} def -/title_fontsize ${title_fontsize} def -/number_fontsize ${number_fontsize} def - - -/UseCIEColor true def % Fix for issue 4 -/default_color ${default_color} def -/color_dict << -${color_dict} ->> def - - - -% ---- DERIVED PARAMETERS ---- - -/char_width stack_width 2 stack_margin mul sub def -/char_width2 char_width 2 div def -/char_width4 char_width 4 div def - -% movements to place 5'/N and 3'/C symbols -/leftEndDeltaX fontsize neg def -/leftEndDeltaY fontsize 1.25 mul neg def -/rightEndDeltaX fontsize 0.25 mul def -/rightEndDeltaY leftEndDeltaY def - - -% ---- PROCEDURES ---- - - -/SetTitleFont {/${title_font} findfont title_fontsize scalefont setfont} bind def -/SetLogoFont {/${logo_font} findfont char_width scalefont setfont} bind def -/SetStringFont{/${text_font} findfont fontsize scalefont setfont} bind def -/SetPrimeFont {/Symbol findfont fontsize scalefont setfont} bind def -/SetSmallFont {/${text_font} findfont small_fontsize scalefont setfont} bind def -/SetNumberFont {/${text_font} findfont number_fontsize scalefont setfont} bind def - -/DrawBox { % width height - /hh exch def - /ww exch def - gsave - 0.2 setlinewidth - %0.5 setgray - - %0 0 moveto - hh 0 rlineto - 0 ww rlineto - hh neg 0 rlineto - 0 ww neg rlineto - stroke - grestore -} bind def - - -/StartLogo { - %save - gsave - - - debug { - logo_margin logo_margin moveto - logo_height logo_margin 2 mul sub - logo_width logo_margin 2 mul sub - DrawBox } if - - show_title { DrawTitle } if - show_xaxis_label { DrawXaxisLable } if - show_fineprint { DrawFineprint } if - DrawLogoLabel - - - MoveToFirstLine -} bind def - - -/DrawLogoLabel { - gsave - SetTitleFont - - logo_margin - logo_height title_fontsize sub logo_margin sub - moveto - - debug { title_fontsize logo_label stringwidth pop DrawBox } if - 0 title_fontsize 4 div rmoveto % Move up to baseline (approximatly) - logo_label show - - grestore -} bind def - -/DrawTitle { - gsave - SetTitleFont - - logo_width 2 div logo_title stringwidth pop 2 div sub - logo_height title_fontsize sub logo_margin sub - moveto - - debug { title_fontsize logo_title stringwidth pop DrawBox } if - - 0 title_fontsize 4 div rmoveto % Move up to baseline (approximatly) - logo_title show - - grestore -} bind def - -/DrawXaxisLable { - % Print X-axis label, bottom center - gsave - SetStringFont - - logo_width 2 div xaxis_label stringwidth pop 2 div sub - xaxis_label_height logo_margin add fontsize sub - moveto - %fontsize 3 div - - debug { fontsize xaxis_label stringwidth pop DrawBox } if - - xaxis_label show - - grestore -} bind def - - -/DrawFineprint { - gsave - - SetSmallFont - - logo_width fineprint stringwidth pop sub - logo_margin sub line_margin_right sub - logo_margin - moveto - - debug { small_fontsize fineprint stringwidth pop DrawBox } if - - fineprint show - grestore -} bind def - -/MoveToFirstLine { - logo_margin - logo_height logo_margin sub title_height sub line_height sub - moveto -} bind def - -/EndLogo { - grestore - %showpage - %restore -} bind def - - -/StartLine{ - gsave - - % Draw outer box - debug { line_height line_width DrawBox } if - - % Move to lower left corner of content area - line_margin_left line_margin_bottom rmoveto - - % Draw inner content box - debug { - line_height line_margin_bottom sub line_margin_top sub - line_width line_margin_left sub line_margin_right sub - DrawBox - } if - - show_yaxis { DrawYaxis } if - show_xaxis { DrawLeftEnd } if - -} bind def - -/EndLine{ - show_xaxis { DrawRightEnd } if - grestore - 0 line_height neg rmoveto -} bind def - - -/DrawYaxis { - gsave - stack_margin neg 0 translate - DrawYaxisBar - DrawYaxisLabel - grestore -} bind def - - -/DrawYaxisBar { - gsave - stack_margin neg 0 rmoveto - - SetNumberFont - stroke_width setlinewidth - - /str 10 string def % string to hold number - /smallgap stack_margin def - - % Draw first tic and bar - gsave - tic_length neg 0 rmoveto - tic_length 0 rlineto - 0 stack_height rlineto - stroke - grestore - - % Draw the tics - % initial increment limit proc for - 0 yaxis_tic_interval yaxis_scale abs - {/loopnumber exch def - - % convert the number coming from the loop to a string - % and find its width - loopnumber 10 str cvrs - /stringnumber exch def % string representing the number - - stringnumber stringwidth pop - /numberwidth exch def % width of number to show - - /halfnumberheight - stringnumber CharBoxHeight 2 div - def - - gsave - numberwidth % move back width of number - neg loopnumber stack_height yaxis_scale div mul % shift on y axis - halfnumberheight sub % down half the digit - rmoveto % move back the width of the string - - tic_length neg smallgap sub % Move back a bit more - 0 rmoveto % move back the width of the tic - - stringnumber show - smallgap 0 rmoveto % Make a small gap - - % now show the tic mark - 0 halfnumberheight rmoveto % shift up again - tic_length 0 rlineto - stroke - grestore - } for - - % Draw the minor tics - % initial increment limit proc for - 0 yaxis_minor_tic_interval yaxis_scale abs - {/loopnumber2 exch def - gsave - 0 - loopnumber2 stack_height yaxis_scale div mul - rmoveto - - tic_length 2 div neg 0 rlineto - stroke - grestore - } for - - grestore -} bind def - -/DrawYaxisLabel { - gsave - SetStringFont - - % How far we move left depends on the size of - % the tic labels. - /str 10 string def % string to hold number - yaxis_scale yaxis_tic_interval div cvi yaxis_tic_interval mul - str cvs stringwidth pop - tic_length 1.25 mul add neg - - stack_height - yaxis_label stringwidth pop - sub 2 div - - rmoveto - 90 rotate - - yaxis_label show - grestore -} bind def - - -%Take a single character and return the bounding box -/CharBox { % CharBox - gsave - newpath - 0 0 moveto - % take the character off the stack and use it here: - true charpath - flattenpath - pathbbox % compute bounding box of 1 pt. char => lx ly ux uy - % the path is here, but toss it away ... - grestore -} bind def - - -% The height of a characters bounding box -/CharBoxHeight { % CharBoxHeight - CharBox - exch pop sub neg exch pop -} bind def - - -% The width of a characters bounding box -/CharBoxWidth { % CharBoxHeight - CharBox - pop exch pop sub neg -} bind def - - -/DrawLeftEnd { - gsave - SetStringFont - leftEndDeltaX leftEndDeltaY rmoveto - - show_ends { - debug { leftEndDeltaY neg leftEndDeltaX neg DrawBox } if - end_type (d) eq {(5) show DrawPrime} if - end_type (p) eq {(N) show} if - } if - grestore -} bind def - -/DrawRightEnd { - gsave - SetStringFont - rightEndDeltaX rightEndDeltaY rmoveto - - show_ends { - debug { rightEndDeltaY neg leftEndDeltaX neg DrawBox } if - end_type (d) eq {(3) show DrawPrime} if - end_type (p) eq {(C) show} if - } if - grestore -} bind def - -/DrawPrime { - gsave - SetPrimeFont - (\242) show - grestore -} bind def - - -/StartStack { % startstack - show_xaxis {DrawNumber}{pop} ifelse - gsave - debug { stack_height stack_width DrawBox } if - -} bind def - -/EndStack { - grestore - stack_width 0 rmoveto -} bind def - - -/DrawNumber { % number MakeNumber - /n exch def - - - gsave - %0 stack_margin neg rmoveto - stroke_width setlinewidth - stack_width 0 rlineto - stack_width 2 div neg 0 rmoveto - - n () eq - { 0 tic_length 4 div neg rlineto } - { 0 tic_length 2 div neg rlineto } - ifelse - - stroke - grestore - - - - gsave - n - SetNumberFont - stack_width 2 div tic_length 2 div neg rmoveto - - rotate_numbers { - 90 rotate - dup stringwidth pop neg % find the length of the number - stack_margin sub % Move down a bit - (0) CharBoxHeight 2 div neg % left half height of numbers - rmoveto - show - } { - dup stringwidth pop neg 2 div number_fontsize neg rmoveto - show - } ifelse - - - - grestore -} bind def - - - -% Draw a character whose height is proportional to symbol bits -/ShowSymbol{ % interval character ShowSymbol - /char exch def - /interval exch def - /fraction_width exch def - - /char_height - interval yaxis_scale div stack_height mul - stack_margin sub - dup - % if char_height is negative or very small replace with zero - % BUG FIX: This used to be '0.0 gt' but it seems that DrawHeight - % has a finite, non-zero minimum, which results in a rangecheck error - 0.001 gt {}{pop 0.0} ifelse - def - - char_height 0.0 gt { - show_boxes { - gsave - /ww char_height stack_margin add def - /hh stack_width def - stroke_width setlinewidth - hh 0 rlineto - 0 ww rlineto - hh neg 0 rlineto - 0 ww neg rlineto - stroke - grestore - } if - - gsave - stack_margin stack_margin rmoveto - debug { char_height char_width DrawBox } if - 1 fraction_width sub char_width mul 2 div 0 rmoveto - fraction_width char_width mul char_height char DrawChar - grestore - - } if - 0 interval yaxis_scale div stack_height mul rmoveto -} bind def - - -/DrawChar { % ShowChar - /tc exch def % The character - /ysize exch def % the y size of the character - /xsize exch def % the x size of the character - /xmulfactor 1 def - /ymulfactor 1 def - - gsave - SetLogoFont - tc SetColor - - % IReplacementHack - % Deal with the lack of bars on the letter 'I' in Arial and Helvetica - % by replacing with 'I' from Courier. - tc (I) eq { - /Courier findfont char_width scalefont setfont - } if - - - shrink { - xsize 1 shrink_fraction sub 2 div mul - ysize 1 shrink_fraction sub 2 div mul rmoveto - shrink_fraction shrink_fraction scale - } if - - % Calculate the font scaling factors - % Loop twice to catch small correction due to first scaling - 2 { - gsave - xmulfactor ymulfactor scale - - ysize % desired size of character in points - tc CharBoxHeight - dup 0.0 ne { - div % factor by which to scale up the character - /ymulfactor exch def - } {pop pop} ifelse - - xsize % desired size of character in points - tc CharBoxWidth - dup 0.0 ne { - div % factor by which to scale up the character - /xmulfactor exch def - } {pop pop} ifelse - grestore - } repeat - - - - % Draw the character - xmulfactor ymulfactor scale - % Move lower left corner of character to start point - tc CharBox pop pop % llx lly : Lower left corner - exch neg exch neg - rmoveto - - tc show - - grestore -} bind def - -/SetColor{ % SetColor - dup color_dict exch known { - color_dict exch get aload pop setrgbcolor - } { - pop - default_color aload pop setrgbcolor - } ifelse -} bind def - - -/DrawErrorbar{ % interval_down interval_up DrawErrorbar - - gsave - /points_per_unit stack_height yaxis_scale div def - /height_up exch points_per_unit mul def - /height_down exch points_per_unit mul def - - show_errorbars { - - stroke_width setlinewidth - errorbar_gray setgray - stack_width 2 div 0 rmoveto - - /errorbar_width char_width errorbar_width_fraction mul def - /errorbar_width2 errorbar_width 2 div def - - gsave - 0 height_down neg rmoveto - errorbar_width2 neg 0 rlineto - errorbar_width 0 rlineto - errorbar_width2 neg 0 rlineto - 0 height_down errorbar_fraction mul rlineto - stroke - grestore - - gsave - 0 height_up rmoveto - errorbar_width2 neg 0 rlineto - errorbar_width 0 rlineto - errorbar_width2 neg 0 rlineto - 0 height_up neg errorbar_fraction mul rlineto - stroke - grestore - } if - - grestore - -} bind def - -/DrawErrorbarFirst{ % interval_down interval_up center DrawErrorbarFirst - gsave - /points_per_unit stack_height yaxis_scale div def - /center exch points_per_unit mul def - - 0 center rmoveto - DrawErrorbar - grestore -} bind def - -%%EndProlog - -%%Page: 1 1 - -% Example Data -%StartLogo -% StartLine -% (1) StartStack -% 1.2 (C) ShowSymbol -% 2.2 (I) ShowSymbol -% 0.5 0.5 DrawErrorbar -% EndStack -% (2) StartStack -% 0.5 (I) ShowSymbol -% 0.9 (L) ShowSymbol -% 1.0 (G) ShowSymbol -% -% 0.5 0.5 DrawErrorbar -% EndStack -% (234) StartStack -% EndStack -% (235) StartStack -% EndStack -% EndLine -%EndLogo - -StartLogo - -${logo_data} - -EndLogo - - -%%EOF

Codonlogo 1.0 : Create

CodonLogo 1.0: Examples

CodonLogo 1.0

Introduction

References

Disclaimer

Feedback

CodonLogo 1.0: User's Manual

Contents

Introduction

References

Creating Sequences Logos using the Web interface

Sequence Data

Output format

Logo size

Stacks per line

Sequence type

Ignore lower case

Units

First position number

Logo range

Composition

Scale stack width

Error bars

Title

Figure label

X-axis

Y-axis

Y-axis scale

Y-axis tic spacing

Sequence end labels

Version fineprint

Color Scheme

More Options

Installing CodonLogo

Dependencies

Web App

codonlogo, The CodonLogo Command Line Interface (CLI)

WebLogo Application Programmer Interface (API)

WebLogo Development and Future Features

Miscellanea

Release Notes and Known Bugs

WebLogo 2

Acknowledgments

CodonLogo: Tests

`codonlogo`, The CodonLogo Command Line Interface (CLI)