Mercurial > repos > miller-lab > genome_diversity
changeset 31:a631c2f6d913
Update to Miller Lab devshed revision 3c4110ffacc3
author | Richard Burhans <burhans@bx.psu.edu> |
---|---|
date | Fri, 20 Sep 2013 13:25:27 -0400 |
parents | 4188853b940b |
children | 03c22b722882 |
files | BeautifulSoup.py README add_fst_column.xml aggregate_gd_indivs.xml assignment_of_optimal_breeding_pairs.py assignment_of_optimal_breeding_pairs.xml average_fst.xml cluster_kegg.xml cluster_onConnctdComps.py coverage_distributions.xml discover_familial_relationships.py discover_familial_relationships.xml diversity_pi.py diversity_pi.xml dpmix.py dpmix.xml dpmix_plot.py draw_variants.py draw_variants.xml filter_gd_snp.xml find_intervals.xml gd_snp2vcf.pl gd_snp2vcf.xml genome_diversity/Makefile genome_diversity/bin/gd_ploteig genome_diversity/bin/varplot genome_diversity/src/Fst_ave.c genome_diversity/src/Fst_column.c genome_diversity/src/Fst_lib.c genome_diversity/src/Fst_lib.h genome_diversity/src/Huang.c genome_diversity/src/Huang.h genome_diversity/src/Makefile genome_diversity/src/admix_prep.c genome_diversity/src/aggregate.c genome_diversity/src/coords2admix.c genome_diversity/src/coverage.c genome_diversity/src/dist_mat.c genome_diversity/src/dpmix.c genome_diversity/src/eval2pct.c genome_diversity/src/filter_snps.c genome_diversity/src/get_pi.c genome_diversity/src/lib.c genome_diversity/src/lib.h genome_diversity/src/mito_lib.c genome_diversity/src/mito_lib.h genome_diversity/src/mk_Ji.c genome_diversity/src/mt_pi.c genome_diversity/src/sweep.c inbreeding_and_kinship.py inbreeding_and_kinship.xml make_phylip.py make_phylip.xml nucleotide_diversity_pi.xml offspring_heterozygosity.py offspring_heterozygosity.xml offspring_heterozygosity_pedigree.py offspring_heterozygosity_pedigree.xml pathway_image.xml pca.xml phylogenetic_tree.xml population_structure.xml prepare_population_structure.xml rank_pathways.xml rank_terms.xml raxml.py raxml.xml reorder.xml static/images/cluster_kegg_formula.png static/images/gd_coverage.png tool_dependencies.xml |
diffstat | 71 files changed, 2914 insertions(+), 6230 deletions(-) [+] |
line wrap: on
line diff
--- a/BeautifulSoup.py Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,2014 +0,0 @@ -"""Beautiful Soup -Elixir and Tonic -"The Screen-Scraper's Friend" -http://www.crummy.com/software/BeautifulSoup/ - -Beautiful Soup parses a (possibly invalid) XML or HTML document into a -tree representation. It provides methods and Pythonic idioms that make -it easy to navigate, search, and modify the tree. - -A well-formed XML/HTML document yields a well-formed data -structure. An ill-formed XML/HTML document yields a correspondingly -ill-formed data structure. If your document is only locally -well-formed, you can use this library to find and process the -well-formed part of it. - -Beautiful Soup works with Python 2.2 and up. It has no external -dependencies, but you'll have more success at converting data to UTF-8 -if you also install these three packages: - -* chardet, for auto-detecting character encodings - http://chardet.feedparser.org/ -* cjkcodecs and iconv_codec, which add more encodings to the ones supported - by stock Python. - http://cjkpython.i18n.org/ - -Beautiful Soup defines classes for two main parsing strategies: - - * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific - language that kind of looks like XML. - - * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid - or invalid. This class has web browser-like heuristics for - obtaining a sensible parse tree in the face of common HTML errors. - -Beautiful Soup also defines a class (UnicodeDammit) for autodetecting -the encoding of an HTML or XML document, and converting it to -Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser. - -For more than you ever wanted to know about Beautiful Soup, see the -documentation: -http://www.crummy.com/software/BeautifulSoup/documentation.html - -Here, have some legalese: - -Copyright (c) 2004-2010, Leonard Richardson - -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are -met: - - * Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - - * Redistributions in binary form must reproduce the above - copyright notice, this list of conditions and the following - disclaimer in the documentation and/or other materials provided - with the distribution. - - * Neither the name of the the Beautiful Soup Consortium and All - Night Kosher Bakery nor the names of its contributors may be - used to endorse or promote products derived from this software - without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS -"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT -LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR -A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR -CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, -EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, -PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR -PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF -LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING -NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT. - -""" -from __future__ import generators - -__author__ = "Leonard Richardson (leonardr@segfault.org)" -__version__ = "3.2.0" -__copyright__ = "Copyright (c) 2004-2010 Leonard Richardson" -__license__ = "New-style BSD" - -from sgmllib import SGMLParser, SGMLParseError -import codecs -import markupbase -import types -import re -import sgmllib -try: - from htmlentitydefs import name2codepoint -except ImportError: - name2codepoint = {} -try: - set -except NameError: - from sets import Set as set - -#These hacks make Beautiful Soup able to parse XML with namespaces -sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*') -markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match - -DEFAULT_OUTPUT_ENCODING = "utf-8" - -def _match_css_class(str): - """Build a RE to match the given CSS class.""" - return re.compile(r"(^|.*\s)%s($|\s)" % str) - -# First, the classes that represent markup elements. - -class PageElement(object): - """Contains the navigational information for some part of the page - (either a tag or a piece of text)""" - - def setup(self, parent=None, previous=None): - """Sets up the initial relations between this element and - other elements.""" - self.parent = parent - self.previous = previous - self.next = None - self.previousSibling = None - self.nextSibling = None - if self.parent and self.parent.contents: - self.previousSibling = self.parent.contents[-1] - self.previousSibling.nextSibling = self - - def replaceWith(self, replaceWith): - oldParent = self.parent - myIndex = self.parent.index(self) - if hasattr(replaceWith, "parent")\ - and replaceWith.parent is self.parent: - # We're replacing this element with one of its siblings. - index = replaceWith.parent.index(replaceWith) - if index and index < myIndex: - # Furthermore, it comes before this element. That - # means that when we extract it, the index of this - # element will change. - myIndex = myIndex - 1 - self.extract() - oldParent.insert(myIndex, replaceWith) - - def replaceWithChildren(self): - myParent = self.parent - myIndex = self.parent.index(self) - self.extract() - reversedChildren = list(self.contents) - reversedChildren.reverse() - for child in reversedChildren: - myParent.insert(myIndex, child) - - def extract(self): - """Destructively rips this element out of the tree.""" - if self.parent: - try: - del self.parent.contents[self.parent.index(self)] - except ValueError: - pass - - #Find the two elements that would be next to each other if - #this element (and any children) hadn't been parsed. Connect - #the two. - lastChild = self._lastRecursiveChild() - nextElement = lastChild.next - - if self.previous: - self.previous.next = nextElement - if nextElement: - nextElement.previous = self.previous - self.previous = None - lastChild.next = None - - self.parent = None - if self.previousSibling: - self.previousSibling.nextSibling = self.nextSibling - if self.nextSibling: - self.nextSibling.previousSibling = self.previousSibling - self.previousSibling = self.nextSibling = None - return self - - def _lastRecursiveChild(self): - "Finds the last element beneath this object to be parsed." - lastChild = self - while hasattr(lastChild, 'contents') and lastChild.contents: - lastChild = lastChild.contents[-1] - return lastChild - - def insert(self, position, newChild): - if isinstance(newChild, basestring) \ - and not isinstance(newChild, NavigableString): - newChild = NavigableString(newChild) - - position = min(position, len(self.contents)) - if hasattr(newChild, 'parent') and newChild.parent is not None: - # We're 'inserting' an element that's already one - # of this object's children. - if newChild.parent is self: - index = self.index(newChild) - if index > position: - # Furthermore we're moving it further down the - # list of this object's children. That means that - # when we extract this element, our target index - # will jump down one. - position = position - 1 - newChild.extract() - - newChild.parent = self - previousChild = None - if position == 0: - newChild.previousSibling = None - newChild.previous = self - else: - previousChild = self.contents[position-1] - newChild.previousSibling = previousChild - newChild.previousSibling.nextSibling = newChild - newChild.previous = previousChild._lastRecursiveChild() - if newChild.previous: - newChild.previous.next = newChild - - newChildsLastElement = newChild._lastRecursiveChild() - - if position >= len(self.contents): - newChild.nextSibling = None - - parent = self - parentsNextSibling = None - while not parentsNextSibling: - parentsNextSibling = parent.nextSibling - parent = parent.parent - if not parent: # This is the last element in the document. - break - if parentsNextSibling: - newChildsLastElement.next = parentsNextSibling - else: - newChildsLastElement.next = None - else: - nextChild = self.contents[position] - newChild.nextSibling = nextChild - if newChild.nextSibling: - newChild.nextSibling.previousSibling = newChild - newChildsLastElement.next = nextChild - - if newChildsLastElement.next: - newChildsLastElement.next.previous = newChildsLastElement - self.contents.insert(position, newChild) - - def append(self, tag): - """Appends the given tag to the contents of this tag.""" - self.insert(len(self.contents), tag) - - def findNext(self, name=None, attrs={}, text=None, **kwargs): - """Returns the first item that matches the given criteria and - appears after this Tag in the document.""" - return self._findOne(self.findAllNext, name, attrs, text, **kwargs) - - def findAllNext(self, name=None, attrs={}, text=None, limit=None, - **kwargs): - """Returns all items that match the given criteria and appear - after this Tag in the document.""" - return self._findAll(name, attrs, text, limit, self.nextGenerator, - **kwargs) - - def findNextSibling(self, name=None, attrs={}, text=None, **kwargs): - """Returns the closest sibling to this Tag that matches the - given criteria and appears after this Tag in the document.""" - return self._findOne(self.findNextSiblings, name, attrs, text, - **kwargs) - - def findNextSiblings(self, name=None, attrs={}, text=None, limit=None, - **kwargs): - """Returns the siblings of this Tag that match the given - criteria and appear after this Tag in the document.""" - return self._findAll(name, attrs, text, limit, - self.nextSiblingGenerator, **kwargs) - fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x - - def findPrevious(self, name=None, attrs={}, text=None, **kwargs): - """Returns the first item that matches the given criteria and - appears before this Tag in the document.""" - return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs) - - def findAllPrevious(self, name=None, attrs={}, text=None, limit=None, - **kwargs): - """Returns all items that match the given criteria and appear - before this Tag in the document.""" - return self._findAll(name, attrs, text, limit, self.previousGenerator, - **kwargs) - fetchPrevious = findAllPrevious # Compatibility with pre-3.x - - def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs): - """Returns the closest sibling to this Tag that matches the - given criteria and appears before this Tag in the document.""" - return self._findOne(self.findPreviousSiblings, name, attrs, text, - **kwargs) - - def findPreviousSiblings(self, name=None, attrs={}, text=None, - limit=None, **kwargs): - """Returns the siblings of this Tag that match the given - criteria and appear before this Tag in the document.""" - return self._findAll(name, attrs, text, limit, - self.previousSiblingGenerator, **kwargs) - fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x - - def findParent(self, name=None, attrs={}, **kwargs): - """Returns the closest parent of this Tag that matches the given - criteria.""" - # NOTE: We can't use _findOne because findParents takes a different - # set of arguments. - r = None - l = self.findParents(name, attrs, 1) - if l: - r = l[0] - return r - - def findParents(self, name=None, attrs={}, limit=None, **kwargs): - """Returns the parents of this Tag that match the given - criteria.""" - - return self._findAll(name, attrs, None, limit, self.parentGenerator, - **kwargs) - fetchParents = findParents # Compatibility with pre-3.x - - #These methods do the real heavy lifting. - - def _findOne(self, method, name, attrs, text, **kwargs): - r = None - l = method(name, attrs, text, 1, **kwargs) - if l: - r = l[0] - return r - - def _findAll(self, name, attrs, text, limit, generator, **kwargs): - "Iterates over a generator looking for things that match." - - if isinstance(name, SoupStrainer): - strainer = name - # (Possibly) special case some findAll*(...) searches - elif text is None and not limit and not attrs and not kwargs: - # findAll*(True) - if name is True: - return [element for element in generator() - if isinstance(element, Tag)] - # findAll*('tag-name') - elif isinstance(name, basestring): - return [element for element in generator() - if isinstance(element, Tag) and - element.name == name] - else: - strainer = SoupStrainer(name, attrs, text, **kwargs) - # Build a SoupStrainer - else: - strainer = SoupStrainer(name, attrs, text, **kwargs) - results = ResultSet(strainer) - g = generator() - while True: - try: - i = g.next() - except StopIteration: - break - if i: - found = strainer.search(i) - if found: - results.append(found) - if limit and len(results) >= limit: - break - return results - - #These Generators can be used to navigate starting from both - #NavigableStrings and Tags. - def nextGenerator(self): - i = self - while i is not None: - i = i.next - yield i - - def nextSiblingGenerator(self): - i = self - while i is not None: - i = i.nextSibling - yield i - - def previousGenerator(self): - i = self - while i is not None: - i = i.previous - yield i - - def previousSiblingGenerator(self): - i = self - while i is not None: - i = i.previousSibling - yield i - - def parentGenerator(self): - i = self - while i is not None: - i = i.parent - yield i - - # Utility methods - def substituteEncoding(self, str, encoding=None): - encoding = encoding or "utf-8" - return str.replace("%SOUP-ENCODING%", encoding) - - def toEncoding(self, s, encoding=None): - """Encodes an object to a string in some encoding, or to Unicode. - .""" - if isinstance(s, unicode): - if encoding: - s = s.encode(encoding) - elif isinstance(s, str): - if encoding: - s = s.encode(encoding) - else: - s = unicode(s) - else: - if encoding: - s = self.toEncoding(str(s), encoding) - else: - s = unicode(s) - return s - -class NavigableString(unicode, PageElement): - - def __new__(cls, value): - """Create a new NavigableString. - - When unpickling a NavigableString, this method is called with - the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be - passed in to the superclass's __new__ or the superclass won't know - how to handle non-ASCII characters. - """ - if isinstance(value, unicode): - return unicode.__new__(cls, value) - return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING) - - def __getnewargs__(self): - return (NavigableString.__str__(self),) - - def __getattr__(self, attr): - """text.string gives you text. This is for backwards - compatibility for Navigable*String, but for CData* it lets you - get the string without the CData wrapper.""" - if attr == 'string': - return self - else: - raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr) - - def __unicode__(self): - return str(self).decode(DEFAULT_OUTPUT_ENCODING) - - def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): - if encoding: - return self.encode(encoding) - else: - return self - -class CData(NavigableString): - - def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): - return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding) - -class ProcessingInstruction(NavigableString): - def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): - output = self - if "%SOUP-ENCODING%" in output: - output = self.substituteEncoding(output, encoding) - return "<?%s?>" % self.toEncoding(output, encoding) - -class Comment(NavigableString): - def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): - return "<!--%s-->" % NavigableString.__str__(self, encoding) - -class Declaration(NavigableString): - def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): - return "<!%s>" % NavigableString.__str__(self, encoding) - -class Tag(PageElement): - - """Represents a found HTML tag with its attributes and contents.""" - - def _invert(h): - "Cheap function to invert a hash." - i = {} - for k,v in h.items(): - i[v] = k - return i - - XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'", - "quot" : '"', - "amp" : "&", - "lt" : "<", - "gt" : ">" } - - XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS) - - def _convertEntities(self, match): - """Used in a call to re.sub to replace HTML, XML, and numeric - entities with the appropriate Unicode characters. If HTML - entities are being converted, any unrecognized entities are - escaped.""" - x = match.group(1) - if self.convertHTMLEntities and x in name2codepoint: - return unichr(name2codepoint[x]) - elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS: - if self.convertXMLEntities: - return self.XML_ENTITIES_TO_SPECIAL_CHARS[x] - else: - return u'&%s;' % x - elif len(x) > 0 and x[0] == '#': - # Handle numeric entities - if len(x) > 1 and x[1] == 'x': - return unichr(int(x[2:], 16)) - else: - return unichr(int(x[1:])) - - elif self.escapeUnrecognizedEntities: - return u'&%s;' % x - else: - return u'&%s;' % x - - def __init__(self, parser, name, attrs=None, parent=None, - previous=None): - "Basic constructor." - - # We don't actually store the parser object: that lets extracted - # chunks be garbage-collected - self.parserClass = parser.__class__ - self.isSelfClosing = parser.isSelfClosingTag(name) - self.name = name - if attrs is None: - attrs = [] - elif isinstance(attrs, dict): - attrs = attrs.items() - self.attrs = attrs - self.contents = [] - self.setup(parent, previous) - self.hidden = False - self.containsSubstitutions = False - self.convertHTMLEntities = parser.convertHTMLEntities - self.convertXMLEntities = parser.convertXMLEntities - self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities - - # Convert any HTML, XML, or numeric entities in the attribute values. - convert = lambda(k, val): (k, - re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);", - self._convertEntities, - val)) - self.attrs = map(convert, self.attrs) - - def getString(self): - if (len(self.contents) == 1 - and isinstance(self.contents[0], NavigableString)): - return self.contents[0] - - def setString(self, string): - """Replace the contents of the tag with a string""" - self.clear() - self.append(string) - - string = property(getString, setString) - - def getText(self, separator=u""): - if not len(self.contents): - return u"" - stopNode = self._lastRecursiveChild().next - strings = [] - current = self.contents[0] - while current is not stopNode: - if isinstance(current, NavigableString): - strings.append(current.strip()) - current = current.next - return separator.join(strings) - - text = property(getText) - - def get(self, key, default=None): - """Returns the value of the 'key' attribute for the tag, or - the value given for 'default' if it doesn't have that - attribute.""" - return self._getAttrMap().get(key, default) - - def clear(self): - """Extract all children.""" - for child in self.contents[:]: - child.extract() - - def index(self, element): - for i, child in enumerate(self.contents): - if child is element: - return i - raise ValueError("Tag.index: element not in tag") - - def has_key(self, key): - return self._getAttrMap().has_key(key) - - def __getitem__(self, key): - """tag[key] returns the value of the 'key' attribute for the tag, - and throws an exception if it's not there.""" - return self._getAttrMap()[key] - - def __iter__(self): - "Iterating over a tag iterates over its contents." - return iter(self.contents) - - def __len__(self): - "The length of a tag is the length of its list of contents." - return len(self.contents) - - def __contains__(self, x): - return x in self.contents - - def __nonzero__(self): - "A tag is non-None even if it has no contents." - return True - - def __setitem__(self, key, value): - """Setting tag[key] sets the value of the 'key' attribute for the - tag.""" - self._getAttrMap() - self.attrMap[key] = value - found = False - for i in range(0, len(self.attrs)): - if self.attrs[i][0] == key: - self.attrs[i] = (key, value) - found = True - if not found: - self.attrs.append((key, value)) - self._getAttrMap()[key] = value - - def __delitem__(self, key): - "Deleting tag[key] deletes all 'key' attributes for the tag." - for item in self.attrs: - if item[0] == key: - self.attrs.remove(item) - #We don't break because bad HTML can define the same - #attribute multiple times. - self._getAttrMap() - if self.attrMap.has_key(key): - del self.attrMap[key] - - def __call__(self, *args, **kwargs): - """Calling a tag like a function is the same as calling its - findAll() method. Eg. tag('a') returns a list of all the A tags - found within this tag.""" - return apply(self.findAll, args, kwargs) - - def __getattr__(self, tag): - #print "Getattr %s.%s" % (self.__class__, tag) - if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3: - return self.find(tag[:-3]) - elif tag.find('__') != 0: - return self.find(tag) - raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag) - - def __eq__(self, other): - """Returns true iff this tag has the same name, the same attributes, - and the same contents (recursively) as the given tag. - - NOTE: right now this will return false if two tags have the - same attributes in a different order. Should this be fixed?""" - if other is self: - return True - if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other): - return False - for i in range(0, len(self.contents)): - if self.contents[i] != other.contents[i]: - return False - return True - - def __ne__(self, other): - """Returns true iff this tag is not identical to the other tag, - as defined in __eq__.""" - return not self == other - - def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING): - """Renders this tag as a string.""" - return self.__str__(encoding) - - def __unicode__(self): - return self.__str__(None) - - BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|" - + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)" - + ")") - - def _sub_entity(self, x): - """Used with a regular expression to substitute the - appropriate XML entity for an XML special character.""" - return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";" - - def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING, - prettyPrint=False, indentLevel=0): - """Returns a string or Unicode representation of this tag and - its contents. To get Unicode, pass None for encoding. - - NOTE: since Python's HTML parser consumes whitespace, this - method is not certain to reproduce the whitespace present in - the original string.""" - - encodedName = self.toEncoding(self.name, encoding) - - attrs = [] - if self.attrs: - for key, val in self.attrs: - fmt = '%s="%s"' - if isinstance(val, basestring): - if self.containsSubstitutions and '%SOUP-ENCODING%' in val: - val = self.substituteEncoding(val, encoding) - - # The attribute value either: - # - # * Contains no embedded double quotes or single quotes. - # No problem: we enclose it in double quotes. - # * Contains embedded single quotes. No problem: - # double quotes work here too. - # * Contains embedded double quotes. No problem: - # we enclose it in single quotes. - # * Embeds both single _and_ double quotes. This - # can't happen naturally, but it can happen if - # you modify an attribute value after parsing - # the document. Now we have a bit of a - # problem. We solve it by enclosing the - # attribute in single quotes, and escaping any - # embedded single quotes to XML entities. - if '"' in val: - fmt = "%s='%s'" - if "'" in val: - # TODO: replace with apos when - # appropriate. - val = val.replace("'", "&squot;") - - # Now we're okay w/r/t quotes. But the attribute - # value might also contain angle brackets, or - # ampersands that aren't part of entities. We need - # to escape those to XML entities too. - val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val) - - attrs.append(fmt % (self.toEncoding(key, encoding), - self.toEncoding(val, encoding))) - close = '' - closeTag = '' - if self.isSelfClosing: - close = ' /' - else: - closeTag = '</%s>' % encodedName - - indentTag, indentContents = 0, 0 - if prettyPrint: - indentTag = indentLevel - space = (' ' * (indentTag-1)) - indentContents = indentTag + 1 - contents = self.renderContents(encoding, prettyPrint, indentContents) - if self.hidden: - s = contents - else: - s = [] - attributeString = '' - if attrs: - attributeString = ' ' + ' '.join(attrs) - if prettyPrint: - s.append(space) - s.append('<%s%s%s>' % (encodedName, attributeString, close)) - if prettyPrint: - s.append("\n") - s.append(contents) - if prettyPrint and contents and contents[-1] != "\n": - s.append("\n") - if prettyPrint and closeTag: - s.append(space) - s.append(closeTag) - if prettyPrint and closeTag and self.nextSibling: - s.append("\n") - s = ''.join(s) - return s - - def decompose(self): - """Recursively destroys the contents of this tree.""" - self.extract() - if len(self.contents) == 0: - return - current = self.contents[0] - while current is not None: - next = current.next - if isinstance(current, Tag): - del current.contents[:] - current.parent = None - current.previous = None - current.previousSibling = None - current.next = None - current.nextSibling = None - current = next - - def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING): - return self.__str__(encoding, True) - - def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, - prettyPrint=False, indentLevel=0): - """Renders the contents of this tag as a string in the given - encoding. If encoding is None, returns a Unicode string..""" - s=[] - for c in self: - text = None - if isinstance(c, NavigableString): - text = c.__str__(encoding) - elif isinstance(c, Tag): - s.append(c.__str__(encoding, prettyPrint, indentLevel)) - if text and prettyPrint: - text = text.strip() - if text: - if prettyPrint: - s.append(" " * (indentLevel-1)) - s.append(text) - if prettyPrint: - s.append("\n") - return ''.join(s) - - #Soup methods - - def find(self, name=None, attrs={}, recursive=True, text=None, - **kwargs): - """Return only the first child of this Tag matching the given - criteria.""" - r = None - l = self.findAll(name, attrs, recursive, text, 1, **kwargs) - if l: - r = l[0] - return r - findChild = find - - def findAll(self, name=None, attrs={}, recursive=True, text=None, - limit=None, **kwargs): - """Extracts a list of Tag objects that match the given - criteria. You can specify the name of the Tag and any - attributes you want the Tag to have. - - The value of a key-value pair in the 'attrs' map can be a - string, a list of strings, a regular expression object, or a - callable that takes a string and returns whether or not the - string matches for some custom definition of 'matches'. The - same is true of the tag name.""" - generator = self.recursiveChildGenerator - if not recursive: - generator = self.childGenerator - return self._findAll(name, attrs, text, limit, generator, **kwargs) - findChildren = findAll - - # Pre-3.x compatibility methods - first = find - fetch = findAll - - def fetchText(self, text=None, recursive=True, limit=None): - return self.findAll(text=text, recursive=recursive, limit=limit) - - def firstText(self, text=None, recursive=True): - return self.find(text=text, recursive=recursive) - - #Private methods - - def _getAttrMap(self): - """Initializes a map representation of this tag's attributes, - if not already initialized.""" - if not getattr(self, 'attrMap'): - self.attrMap = {} - for (key, value) in self.attrs: - self.attrMap[key] = value - return self.attrMap - - #Generator methods - def childGenerator(self): - # Just use the iterator from the contents - return iter(self.contents) - - def recursiveChildGenerator(self): - if not len(self.contents): - raise StopIteration - stopNode = self._lastRecursiveChild().next - current = self.contents[0] - while current is not stopNode: - yield current - current = current.next - - -# Next, a couple classes to represent queries and their results. -class SoupStrainer: - """Encapsulates a number of ways of matching a markup element (tag or - text).""" - - def __init__(self, name=None, attrs={}, text=None, **kwargs): - self.name = name - if isinstance(attrs, basestring): - kwargs['class'] = _match_css_class(attrs) - attrs = None - if kwargs: - if attrs: - attrs = attrs.copy() - attrs.update(kwargs) - else: - attrs = kwargs - self.attrs = attrs - self.text = text - - def __str__(self): - if self.text: - return self.text - else: - return "%s|%s" % (self.name, self.attrs) - - def searchTag(self, markupName=None, markupAttrs={}): - found = None - markup = None - if isinstance(markupName, Tag): - markup = markupName - markupAttrs = markup - callFunctionWithTagData = callable(self.name) \ - and not isinstance(markupName, Tag) - - if (not self.name) \ - or callFunctionWithTagData \ - or (markup and self._matches(markup, self.name)) \ - or (not markup and self._matches(markupName, self.name)): - if callFunctionWithTagData: - match = self.name(markupName, markupAttrs) - else: - match = True - markupAttrMap = None - for attr, matchAgainst in self.attrs.items(): - if not markupAttrMap: - if hasattr(markupAttrs, 'get'): - markupAttrMap = markupAttrs - else: - markupAttrMap = {} - for k,v in markupAttrs: - markupAttrMap[k] = v - attrValue = markupAttrMap.get(attr) - if not self._matches(attrValue, matchAgainst): - match = False - break - if match: - if markup: - found = markup - else: - found = markupName - return found - - def search(self, markup): - #print 'looking for %s in %s' % (self, markup) - found = None - # If given a list of items, scan it for a text element that - # matches. - if hasattr(markup, "__iter__") \ - and not isinstance(markup, Tag): - for element in markup: - if isinstance(element, NavigableString) \ - and self.search(element): - found = element - break - # If it's a Tag, make sure its name or attributes match. - # Don't bother with Tags if we're searching for text. - elif isinstance(markup, Tag): - if not self.text: - found = self.searchTag(markup) - # If it's text, make sure the text matches. - elif isinstance(markup, NavigableString) or \ - isinstance(markup, basestring): - if self._matches(markup, self.text): - found = markup - else: - raise Exception, "I don't know how to match against a %s" \ - % markup.__class__ - return found - - def _matches(self, markup, matchAgainst): - #print "Matching %s against %s" % (markup, matchAgainst) - result = False - if matchAgainst is True: - result = markup is not None - elif callable(matchAgainst): - result = matchAgainst(markup) - else: - #Custom match methods take the tag as an argument, but all - #other ways of matching match the tag name as a string. - if isinstance(markup, Tag): - markup = markup.name - if markup and not isinstance(markup, basestring): - markup = unicode(markup) - #Now we know that chunk is either a string, or None. - if hasattr(matchAgainst, 'match'): - # It's a regexp object. - result = markup and matchAgainst.search(markup) - elif hasattr(matchAgainst, '__iter__'): # list-like - result = markup in matchAgainst - elif hasattr(matchAgainst, 'items'): - result = markup.has_key(matchAgainst) - elif matchAgainst and isinstance(markup, basestring): - if isinstance(markup, unicode): - matchAgainst = unicode(matchAgainst) - else: - matchAgainst = str(matchAgainst) - - if not result: - result = matchAgainst == markup - return result - -class ResultSet(list): - """A ResultSet is just a list that keeps track of the SoupStrainer - that created it.""" - def __init__(self, source): - list.__init__([]) - self.source = source - -# Now, some helper functions. - -def buildTagMap(default, *args): - """Turns a list of maps, lists, or scalars into a single map. - Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and - NESTING_RESET_TAGS maps out of lists and partial maps.""" - built = {} - for portion in args: - if hasattr(portion, 'items'): - #It's a map. Merge it. - for k,v in portion.items(): - built[k] = v - elif hasattr(portion, '__iter__'): # is a list - #It's a list. Map each item to the default. - for k in portion: - built[k] = default - else: - #It's a scalar. Map it to the default. - built[portion] = default - return built - -# Now, the parser classes. - -class BeautifulStoneSoup(Tag, SGMLParser): - - """This class contains the basic parser and search code. It defines - a parser that knows nothing about tag behavior except for the - following: - - You can't close a tag without closing all the tags it encloses. - That is, "<foo><bar></foo>" actually means - "<foo><bar></bar></foo>". - - [Another possible explanation is "<foo><bar /></foo>", but since - this class defines no SELF_CLOSING_TAGS, it will never use that - explanation.] - - This class is useful for parsing XML or made-up markup languages, - or when BeautifulSoup makes an assumption counter to what you were - expecting.""" - - SELF_CLOSING_TAGS = {} - NESTABLE_TAGS = {} - RESET_NESTING_TAGS = {} - QUOTE_TAGS = {} - PRESERVE_WHITESPACE_TAGS = [] - - MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'), - lambda x: x.group(1) + ' />'), - (re.compile('<!\s+([^<>]*)>'), - lambda x: '<!' + x.group(1) + '>') - ] - - ROOT_TAG_NAME = u'[document]' - - HTML_ENTITIES = "html" - XML_ENTITIES = "xml" - XHTML_ENTITIES = "xhtml" - # TODO: This only exists for backwards-compatibility - ALL_ENTITIES = XHTML_ENTITIES - - # Used when determining whether a text node is all whitespace and - # can be replaced with a single space. A text node that contains - # fancy Unicode spaces (usually non-breaking) should be left - # alone. - STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, } - - def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None, - markupMassage=True, smartQuotesTo=XML_ENTITIES, - convertEntities=None, selfClosingTags=None, isHTML=False): - """The Soup object is initialized as the 'root tag', and the - provided markup (which can be a string or a file-like object) - is fed into the underlying parser. - - sgmllib will process most bad HTML, and the BeautifulSoup - class has some tricks for dealing with some HTML that kills - sgmllib, but Beautiful Soup can nonetheless choke or lose data - if your data uses self-closing tags or declarations - incorrectly. - - By default, Beautiful Soup uses regexes to sanitize input, - avoiding the vast majority of these problems. If the problems - don't apply to you, pass in False for markupMassage, and - you'll get better performance. - - The default parser massage techniques fix the two most common - instances of invalid HTML that choke sgmllib: - - <br/> (No space between name of closing tag and tag close) - <! --Comment--> (Extraneous whitespace in declaration) - - You can pass in a custom list of (RE object, replace method) - tuples to get Beautiful Soup to scrub your input the way you - want.""" - - self.parseOnlyThese = parseOnlyThese - self.fromEncoding = fromEncoding - self.smartQuotesTo = smartQuotesTo - self.convertEntities = convertEntities - # Set the rules for how we'll deal with the entities we - # encounter - if self.convertEntities: - # It doesn't make sense to convert encoded characters to - # entities even while you're converting entities to Unicode. - # Just convert it all to Unicode. - self.smartQuotesTo = None - if convertEntities == self.HTML_ENTITIES: - self.convertXMLEntities = False - self.convertHTMLEntities = True - self.escapeUnrecognizedEntities = True - elif convertEntities == self.XHTML_ENTITIES: - self.convertXMLEntities = True - self.convertHTMLEntities = True - self.escapeUnrecognizedEntities = False - elif convertEntities == self.XML_ENTITIES: - self.convertXMLEntities = True - self.convertHTMLEntities = False - self.escapeUnrecognizedEntities = False - else: - self.convertXMLEntities = False - self.convertHTMLEntities = False - self.escapeUnrecognizedEntities = False - - self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags) - SGMLParser.__init__(self) - - if hasattr(markup, 'read'): # It's a file-type object. - markup = markup.read() - self.markup = markup - self.markupMassage = markupMassage - try: - self._feed(isHTML=isHTML) - except StopParsing: - pass - self.markup = None # The markup can now be GCed - - def convert_charref(self, name): - """This method fixes a bug in Python's SGMLParser.""" - try: - n = int(name) - except ValueError: - return - if not 0 <= n <= 127 : # ASCII ends at 127, not 255 - return - return self.convert_codepoint(n) - - def _feed(self, inDocumentEncoding=None, isHTML=False): - # Convert the document to Unicode. - markup = self.markup - if isinstance(markup, unicode): - if not hasattr(self, 'originalEncoding'): - self.originalEncoding = None - else: - dammit = UnicodeDammit\ - (markup, [self.fromEncoding, inDocumentEncoding], - smartQuotesTo=self.smartQuotesTo, isHTML=isHTML) - markup = dammit.unicode - self.originalEncoding = dammit.originalEncoding - self.declaredHTMLEncoding = dammit.declaredHTMLEncoding - if markup: - if self.markupMassage: - if not hasattr(self.markupMassage, "__iter__"): - self.markupMassage = self.MARKUP_MASSAGE - for fix, m in self.markupMassage: - markup = fix.sub(m, markup) - # TODO: We get rid of markupMassage so that the - # soup object can be deepcopied later on. Some - # Python installations can't copy regexes. If anyone - # was relying on the existence of markupMassage, this - # might cause problems. - del(self.markupMassage) - self.reset() - - SGMLParser.feed(self, markup) - # Close out any unfinished strings and close all the open tags. - self.endData() - while self.currentTag.name != self.ROOT_TAG_NAME: - self.popTag() - - def __getattr__(self, methodName): - """This method routes method call requests to either the SGMLParser - superclass or the Tag superclass, depending on the method name.""" - #print "__getattr__ called on %s.%s" % (self.__class__, methodName) - - if methodName.startswith('start_') or methodName.startswith('end_') \ - or methodName.startswith('do_'): - return SGMLParser.__getattr__(self, methodName) - elif not methodName.startswith('__'): - return Tag.__getattr__(self, methodName) - else: - raise AttributeError - - def isSelfClosingTag(self, name): - """Returns true iff the given string is the name of a - self-closing tag according to this parser.""" - return self.SELF_CLOSING_TAGS.has_key(name) \ - or self.instanceSelfClosingTags.has_key(name) - - def reset(self): - Tag.__init__(self, self, self.ROOT_TAG_NAME) - self.hidden = 1 - SGMLParser.reset(self) - self.currentData = [] - self.currentTag = None - self.tagStack = [] - self.quoteStack = [] - self.pushTag(self) - - def popTag(self): - tag = self.tagStack.pop() - - #print "Pop", tag.name - if self.tagStack: - self.currentTag = self.tagStack[-1] - return self.currentTag - - def pushTag(self, tag): - #print "Push", tag.name - if self.currentTag: - self.currentTag.contents.append(tag) - self.tagStack.append(tag) - self.currentTag = self.tagStack[-1] - - def endData(self, containerClass=NavigableString): - if self.currentData: - currentData = u''.join(self.currentData) - if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and - not set([tag.name for tag in self.tagStack]).intersection( - self.PRESERVE_WHITESPACE_TAGS)): - if '\n' in currentData: - currentData = '\n' - else: - currentData = ' ' - self.currentData = [] - if self.parseOnlyThese and len(self.tagStack) <= 1 and \ - (not self.parseOnlyThese.text or \ - not self.parseOnlyThese.search(currentData)): - return - o = containerClass(currentData) - o.setup(self.currentTag, self.previous) - if self.previous: - self.previous.next = o - self.previous = o - self.currentTag.contents.append(o) - - - def _popToTag(self, name, inclusivePop=True): - """Pops the tag stack up to and including the most recent - instance of the given tag. If inclusivePop is false, pops the tag - stack up to but *not* including the most recent instqance of - the given tag.""" - #print "Popping to %s" % name - if name == self.ROOT_TAG_NAME: - return - - numPops = 0 - mostRecentTag = None - for i in range(len(self.tagStack)-1, 0, -1): - if name == self.tagStack[i].name: - numPops = len(self.tagStack)-i - break - if not inclusivePop: - numPops = numPops - 1 - - for i in range(0, numPops): - mostRecentTag = self.popTag() - return mostRecentTag - - def _smartPop(self, name): - - """We need to pop up to the previous tag of this type, unless - one of this tag's nesting reset triggers comes between this - tag and the previous tag of this type, OR unless this tag is a - generic nesting trigger and another generic nesting trigger - comes between this tag and the previous tag of this type. - - Examples: - <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'. - <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'. - <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'. - - <li><ul><li> *<li>* should pop to 'ul', not the first 'li'. - <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr' - <td><tr><td> *<td>* should pop to 'tr', not the first 'td' - """ - - nestingResetTriggers = self.NESTABLE_TAGS.get(name) - isNestable = nestingResetTriggers != None - isResetNesting = self.RESET_NESTING_TAGS.has_key(name) - popTo = None - inclusive = True - for i in range(len(self.tagStack)-1, 0, -1): - p = self.tagStack[i] - if (not p or p.name == name) and not isNestable: - #Non-nestable tags get popped to the top or to their - #last occurance. - popTo = name - break - if (nestingResetTriggers is not None - and p.name in nestingResetTriggers) \ - or (nestingResetTriggers is None and isResetNesting - and self.RESET_NESTING_TAGS.has_key(p.name)): - - #If we encounter one of the nesting reset triggers - #peculiar to this tag, or we encounter another tag - #that causes nesting to reset, pop up to but not - #including that tag. - popTo = p.name - inclusive = False - break - p = p.parent - if popTo: - self._popToTag(popTo, inclusive) - - def unknown_starttag(self, name, attrs, selfClosing=0): - #print "Start tag %s: %s" % (name, attrs) - if self.quoteStack: - #This is not a real tag. - #print "<%s> is not real!" % name - attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs]) - self.handle_data('<%s%s>' % (name, attrs)) - return - self.endData() - - if not self.isSelfClosingTag(name) and not selfClosing: - self._smartPop(name) - - if self.parseOnlyThese and len(self.tagStack) <= 1 \ - and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)): - return - - tag = Tag(self, name, attrs, self.currentTag, self.previous) - if self.previous: - self.previous.next = tag - self.previous = tag - self.pushTag(tag) - if selfClosing or self.isSelfClosingTag(name): - self.popTag() - if name in self.QUOTE_TAGS: - #print "Beginning quote (%s)" % name - self.quoteStack.append(name) - self.literal = 1 - return tag - - def unknown_endtag(self, name): - #print "End tag %s" % name - if self.quoteStack and self.quoteStack[-1] != name: - #This is not a real end tag. - #print "</%s> is not real!" % name - self.handle_data('</%s>' % name) - return - self.endData() - self._popToTag(name) - if self.quoteStack and self.quoteStack[-1] == name: - self.quoteStack.pop() - self.literal = (len(self.quoteStack) > 0) - - def handle_data(self, data): - self.currentData.append(data) - - def _toStringSubclass(self, text, subclass): - """Adds a certain piece of text to the tree as a NavigableString - subclass.""" - self.endData() - self.handle_data(text) - self.endData(subclass) - - def handle_pi(self, text): - """Handle a processing instruction as a ProcessingInstruction - object, possibly one with a %SOUP-ENCODING% slot into which an - encoding will be plugged later.""" - if text[:3] == "xml": - text = u"xml version='1.0' encoding='%SOUP-ENCODING%'" - self._toStringSubclass(text, ProcessingInstruction) - - def handle_comment(self, text): - "Handle comments as Comment objects." - self._toStringSubclass(text, Comment) - - def handle_charref(self, ref): - "Handle character references as data." - if self.convertEntities: - data = unichr(int(ref)) - else: - data = '&#%s;' % ref - self.handle_data(data) - - def handle_entityref(self, ref): - """Handle entity references as data, possibly converting known - HTML and/or XML entity references to the corresponding Unicode - characters.""" - data = None - if self.convertHTMLEntities: - try: - data = unichr(name2codepoint[ref]) - except KeyError: - pass - - if not data and self.convertXMLEntities: - data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref) - - if not data and self.convertHTMLEntities and \ - not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref): - # TODO: We've got a problem here. We're told this is - # an entity reference, but it's not an XML entity - # reference or an HTML entity reference. Nonetheless, - # the logical thing to do is to pass it through as an - # unrecognized entity reference. - # - # Except: when the input is "&carol;" this function - # will be called with input "carol". When the input is - # "AT&T", this function will be called with input - # "T". We have no way of knowing whether a semicolon - # was present originally, so we don't know whether - # this is an unknown entity or just a misplaced - # ampersand. - # - # The more common case is a misplaced ampersand, so I - # escape the ampersand and omit the trailing semicolon. - data = "&%s" % ref - if not data: - # This case is different from the one above, because we - # haven't already gone through a supposedly comprehensive - # mapping of entities to Unicode characters. We might not - # have gone through any mapping at all. So the chances are - # very high that this is a real entity, and not a - # misplaced ampersand. - data = "&%s;" % ref - self.handle_data(data) - - def handle_decl(self, data): - "Handle DOCTYPEs and the like as Declaration objects." - self._toStringSubclass(data, Declaration) - - def parse_declaration(self, i): - """Treat a bogus SGML declaration as raw data. Treat a CDATA - declaration as a CData object.""" - j = None - if self.rawdata[i:i+9] == '<![CDATA[': - k = self.rawdata.find(']]>', i) - if k == -1: - k = len(self.rawdata) - data = self.rawdata[i+9:k] - j = k+3 - self._toStringSubclass(data, CData) - else: - try: - j = SGMLParser.parse_declaration(self, i) - except SGMLParseError: - toHandle = self.rawdata[i:] - self.handle_data(toHandle) - j = i + len(toHandle) - return j - -class BeautifulSoup(BeautifulStoneSoup): - - """This parser knows the following facts about HTML: - - * Some tags have no closing tag and should be interpreted as being - closed as soon as they are encountered. - - * The text inside some tags (ie. 'script') may contain tags which - are not really part of the document and which should be parsed - as text, not tags. If you want to parse the text as tags, you can - always fetch it and parse it explicitly. - - * Tag nesting rules: - - Most tags can't be nested at all. For instance, the occurance of - a <p> tag should implicitly close the previous <p> tag. - - <p>Para1<p>Para2 - should be transformed into: - <p>Para1</p><p>Para2 - - Some tags can be nested arbitrarily. For instance, the occurance - of a <blockquote> tag should _not_ implicitly close the previous - <blockquote> tag. - - Alice said: <blockquote>Bob said: <blockquote>Blah - should NOT be transformed into: - Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah - - Some tags can be nested, but the nesting is reset by the - interposition of other tags. For instance, a <tr> tag should - implicitly close the previous <tr> tag within the same <table>, - but not close a <tr> tag in another table. - - <table><tr>Blah<tr>Blah - should be transformed into: - <table><tr>Blah</tr><tr>Blah - but, - <tr>Blah<table><tr>Blah - should NOT be transformed into - <tr>Blah<table></tr><tr>Blah - - Differing assumptions about tag nesting rules are a major source - of problems with the BeautifulSoup class. If BeautifulSoup is not - treating as nestable a tag your page author treats as nestable, - try ICantBelieveItsBeautifulSoup, MinimalSoup, or - BeautifulStoneSoup before writing your own subclass.""" - - def __init__(self, *args, **kwargs): - if not kwargs.has_key('smartQuotesTo'): - kwargs['smartQuotesTo'] = self.HTML_ENTITIES - kwargs['isHTML'] = True - BeautifulStoneSoup.__init__(self, *args, **kwargs) - - SELF_CLOSING_TAGS = buildTagMap(None, - ('br' , 'hr', 'input', 'img', 'meta', - 'spacer', 'link', 'frame', 'base', 'col')) - - PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) - - QUOTE_TAGS = {'script' : None, 'textarea' : None} - - #According to the HTML standard, each of these inline tags can - #contain another tag of the same type. Furthermore, it's common - #to actually use these tags this way. - NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', - 'center') - - #According to the HTML standard, these block tags can contain - #another tag of the same type. Furthermore, it's common - #to actually use these tags this way. - NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del') - - #Lists can contain other lists, but there are restrictions. - NESTABLE_LIST_TAGS = { 'ol' : [], - 'ul' : [], - 'li' : ['ul', 'ol'], - 'dl' : [], - 'dd' : ['dl'], - 'dt' : ['dl'] } - - #Tables can contain other tables, but there are restrictions. - NESTABLE_TABLE_TAGS = {'table' : [], - 'tr' : ['table', 'tbody', 'tfoot', 'thead'], - 'td' : ['tr'], - 'th' : ['tr'], - 'thead' : ['table'], - 'tbody' : ['table'], - 'tfoot' : ['table'], - } - - NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre') - - #If one of these tags is encountered, all tags up to the next tag of - #this type are popped. - RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript', - NON_NESTABLE_BLOCK_TAGS, - NESTABLE_LIST_TAGS, - NESTABLE_TABLE_TAGS) - - NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS, - NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS) - - # Used to detect the charset in a META tag; see start_meta - CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M) - - def start_meta(self, attrs): - """Beautiful Soup can detect a charset included in a META tag, - try to convert the document to that charset, and re-parse the - document from the beginning.""" - httpEquiv = None - contentType = None - contentTypeIndex = None - tagNeedsEncodingSubstitution = False - - for i in range(0, len(attrs)): - key, value = attrs[i] - key = key.lower() - if key == 'http-equiv': - httpEquiv = value - elif key == 'content': - contentType = value - contentTypeIndex = i - - if httpEquiv and contentType: # It's an interesting meta tag. - match = self.CHARSET_RE.search(contentType) - if match: - if (self.declaredHTMLEncoding is not None or - self.originalEncoding == self.fromEncoding): - # An HTML encoding was sniffed while converting - # the document to Unicode, or an HTML encoding was - # sniffed during a previous pass through the - # document, or an encoding was specified - # explicitly and it worked. Rewrite the meta tag. - def rewrite(match): - return match.group(1) + "%SOUP-ENCODING%" - newAttr = self.CHARSET_RE.sub(rewrite, contentType) - attrs[contentTypeIndex] = (attrs[contentTypeIndex][0], - newAttr) - tagNeedsEncodingSubstitution = True - else: - # This is our first pass through the document. - # Go through it again with the encoding information. - newCharset = match.group(3) - if newCharset and newCharset != self.originalEncoding: - self.declaredHTMLEncoding = newCharset - self._feed(self.declaredHTMLEncoding) - raise StopParsing - pass - tag = self.unknown_starttag("meta", attrs) - if tag and tagNeedsEncodingSubstitution: - tag.containsSubstitutions = True - -class StopParsing(Exception): - pass - -class ICantBelieveItsBeautifulSoup(BeautifulSoup): - - """The BeautifulSoup class is oriented towards skipping over - common HTML errors like unclosed tags. However, sometimes it makes - errors of its own. For instance, consider this fragment: - - <b>Foo<b>Bar</b></b> - - This is perfectly valid (if bizarre) HTML. However, the - BeautifulSoup class will implicitly close the first b tag when it - encounters the second 'b'. It will think the author wrote - "<b>Foo<b>Bar", and didn't close the first 'b' tag, because - there's no real-world reason to bold something that's already - bold. When it encounters '</b></b>' it will close two more 'b' - tags, for a grand total of three tags closed instead of two. This - can throw off the rest of your document structure. The same is - true of a number of other tags, listed below. - - It's much more common for someone to forget to close a 'b' tag - than to actually use nested 'b' tags, and the BeautifulSoup class - handles the common case. This class handles the not-co-common - case: where you can't believe someone wrote what they did, but - it's valid HTML and BeautifulSoup screwed up by assuming it - wouldn't be.""" - - I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \ - ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong', - 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b', - 'big') - - I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',) - - NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS, - I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS, - I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS) - -class MinimalSoup(BeautifulSoup): - """The MinimalSoup class is for parsing HTML that contains - pathologically bad markup. It makes no assumptions about tag - nesting, but it does know which tags are self-closing, that - <script> tags contain Javascript and should not be parsed, that - META tags may contain encoding information, and so on. - - This also makes it better for subclassing than BeautifulStoneSoup - or BeautifulSoup.""" - - RESET_NESTING_TAGS = buildTagMap('noscript') - NESTABLE_TAGS = {} - -class BeautifulSOAP(BeautifulStoneSoup): - """This class will push a tag with only a single string child into - the tag's parent as an attribute. The attribute's name is the tag - name, and the value is the string child. An example should give - the flavor of the change: - - <foo><bar>baz</bar></foo> - => - <foo bar="baz"><bar>baz</bar></foo> - - You can then access fooTag['bar'] instead of fooTag.barTag.string. - - This is, of course, useful for scraping structures that tend to - use subelements instead of attributes, such as SOAP messages. Note - that it modifies its input, so don't print the modified version - out. - - I'm not sure how many people really want to use this class; let me - know if you do. Mainly I like the name.""" - - def popTag(self): - if len(self.tagStack) > 1: - tag = self.tagStack[-1] - parent = self.tagStack[-2] - parent._getAttrMap() - if (isinstance(tag, Tag) and len(tag.contents) == 1 and - isinstance(tag.contents[0], NavigableString) and - not parent.attrMap.has_key(tag.name)): - parent[tag.name] = tag.contents[0] - BeautifulStoneSoup.popTag(self) - -#Enterprise class names! It has come to our attention that some people -#think the names of the Beautiful Soup parser classes are too silly -#and "unprofessional" for use in enterprise screen-scraping. We feel -#your pain! For such-minded folk, the Beautiful Soup Consortium And -#All-Night Kosher Bakery recommends renaming this file to -#"RobustParser.py" (or, in cases of extreme enterprisiness, -#"RobustParserBeanInterface.class") and using the following -#enterprise-friendly class aliases: -class RobustXMLParser(BeautifulStoneSoup): - pass -class RobustHTMLParser(BeautifulSoup): - pass -class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup): - pass -class RobustInsanelyWackAssHTMLParser(MinimalSoup): - pass -class SimplifyingSOAPParser(BeautifulSOAP): - pass - -###################################################### -# -# Bonus library: Unicode, Dammit -# -# This class forces XML data into a standard format (usually to UTF-8 -# or Unicode). It is heavily based on code from Mark Pilgrim's -# Universal Feed Parser. It does not rewrite the XML or HTML to -# reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi -# (XML) and BeautifulSoup.start_meta (HTML). - -# Autodetects character encodings. -# Download from http://chardet.feedparser.org/ -try: - import chardet -# import chardet.constants -# chardet.constants._debug = 1 -except ImportError: - chardet = None - -# cjkcodecs and iconv_codec make Python know about more character encodings. -# Both are available from http://cjkpython.i18n.org/ -# They're built in if you use Python 2.4. -try: - import cjkcodecs.aliases -except ImportError: - pass -try: - import iconv_codec -except ImportError: - pass - -class UnicodeDammit: - """A class for detecting the encoding of a *ML document and - converting it to a Unicode string. If the source encoding is - windows-1252, can replace MS smart quotes with their HTML or XML - equivalents.""" - - # This dictionary maps commonly seen values for "charset" in HTML - # meta tags to the corresponding Python codec names. It only covers - # values that aren't in Python's aliases and can't be determined - # by the heuristics in find_codec. - CHARSET_ALIASES = { "macintosh" : "mac-roman", - "x-sjis" : "shift-jis" } - - def __init__(self, markup, overrideEncodings=[], - smartQuotesTo='xml', isHTML=False): - self.declaredHTMLEncoding = None - self.markup, documentEncoding, sniffedEncoding = \ - self._detectEncoding(markup, isHTML) - self.smartQuotesTo = smartQuotesTo - self.triedEncodings = [] - if markup == '' or isinstance(markup, unicode): - self.originalEncoding = None - self.unicode = unicode(markup) - return - - u = None - for proposedEncoding in overrideEncodings: - u = self._convertFrom(proposedEncoding) - if u: break - if not u: - for proposedEncoding in (documentEncoding, sniffedEncoding): - u = self._convertFrom(proposedEncoding) - if u: break - - # If no luck and we have auto-detection library, try that: - if not u and chardet and not isinstance(self.markup, unicode): - u = self._convertFrom(chardet.detect(self.markup)['encoding']) - - # As a last resort, try utf-8 and windows-1252: - if not u: - for proposed_encoding in ("utf-8", "windows-1252"): - u = self._convertFrom(proposed_encoding) - if u: break - - self.unicode = u - if not u: self.originalEncoding = None - - def _subMSChar(self, orig): - """Changes a MS smart quote character to an XML or HTML - entity.""" - sub = self.MS_CHARS.get(orig) - if isinstance(sub, tuple): - if self.smartQuotesTo == 'xml': - sub = '&#x%s;' % sub[1] - else: - sub = '&%s;' % sub[0] - return sub - - def _convertFrom(self, proposed): - proposed = self.find_codec(proposed) - if not proposed or proposed in self.triedEncodings: - return None - self.triedEncodings.append(proposed) - markup = self.markup - - # Convert smart quotes to HTML if coming from an encoding - # that might have them. - if self.smartQuotesTo and proposed.lower() in("windows-1252", - "iso-8859-1", - "iso-8859-2"): - markup = re.compile("([\x80-\x9f])").sub \ - (lambda(x): self._subMSChar(x.group(1)), - markup) - - try: - # print "Trying to convert document to %s" % proposed - u = self._toUnicode(markup, proposed) - self.markup = u - self.originalEncoding = proposed - except Exception, e: - # print "That didn't work!" - # print e - return None - #print "Correct encoding: %s" % proposed - return self.markup - - def _toUnicode(self, data, encoding): - '''Given a string and its encoding, decodes the string into Unicode. - %encoding is a string recognized by encodings.aliases''' - - # strip Byte Order Mark (if present) - if (len(data) >= 4) and (data[:2] == '\xfe\xff') \ - and (data[2:4] != '\x00\x00'): - encoding = 'utf-16be' - data = data[2:] - elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \ - and (data[2:4] != '\x00\x00'): - encoding = 'utf-16le' - data = data[2:] - elif data[:3] == '\xef\xbb\xbf': - encoding = 'utf-8' - data = data[3:] - elif data[:4] == '\x00\x00\xfe\xff': - encoding = 'utf-32be' - data = data[4:] - elif data[:4] == '\xff\xfe\x00\x00': - encoding = 'utf-32le' - data = data[4:] - newdata = unicode(data, encoding) - return newdata - - def _detectEncoding(self, xml_data, isHTML=False): - """Given a document, tries to detect its XML encoding.""" - xml_encoding = sniffed_xml_encoding = None - try: - if xml_data[:4] == '\x4c\x6f\xa7\x94': - # EBCDIC - xml_data = self._ebcdic_to_ascii(xml_data) - elif xml_data[:4] == '\x00\x3c\x00\x3f': - # UTF-16BE - sniffed_xml_encoding = 'utf-16be' - xml_data = unicode(xml_data, 'utf-16be').encode('utf-8') - elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \ - and (xml_data[2:4] != '\x00\x00'): - # UTF-16BE with BOM - sniffed_xml_encoding = 'utf-16be' - xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8') - elif xml_data[:4] == '\x3c\x00\x3f\x00': - # UTF-16LE - sniffed_xml_encoding = 'utf-16le' - xml_data = unicode(xml_data, 'utf-16le').encode('utf-8') - elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \ - (xml_data[2:4] != '\x00\x00'): - # UTF-16LE with BOM - sniffed_xml_encoding = 'utf-16le' - xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8') - elif xml_data[:4] == '\x00\x00\x00\x3c': - # UTF-32BE - sniffed_xml_encoding = 'utf-32be' - xml_data = unicode(xml_data, 'utf-32be').encode('utf-8') - elif xml_data[:4] == '\x3c\x00\x00\x00': - # UTF-32LE - sniffed_xml_encoding = 'utf-32le' - xml_data = unicode(xml_data, 'utf-32le').encode('utf-8') - elif xml_data[:4] == '\x00\x00\xfe\xff': - # UTF-32BE with BOM - sniffed_xml_encoding = 'utf-32be' - xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8') - elif xml_data[:4] == '\xff\xfe\x00\x00': - # UTF-32LE with BOM - sniffed_xml_encoding = 'utf-32le' - xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8') - elif xml_data[:3] == '\xef\xbb\xbf': - # UTF-8 with BOM - sniffed_xml_encoding = 'utf-8' - xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8') - else: - sniffed_xml_encoding = 'ascii' - pass - except: - xml_encoding_match = None - xml_encoding_match = re.compile( - '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) - if not xml_encoding_match and isHTML: - regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I) - xml_encoding_match = regexp.search(xml_data) - if xml_encoding_match is not None: - xml_encoding = xml_encoding_match.groups()[0].lower() - if isHTML: - self.declaredHTMLEncoding = xml_encoding - if sniffed_xml_encoding and \ - (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode', - 'iso-10646-ucs-4', 'ucs-4', 'csucs4', - 'utf-16', 'utf-32', 'utf_16', 'utf_32', - 'utf16', 'u16')): - xml_encoding = sniffed_xml_encoding - return xml_data, xml_encoding, sniffed_xml_encoding - - - def find_codec(self, charset): - return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \ - or (charset and self._codec(charset.replace("-", ""))) \ - or (charset and self._codec(charset.replace("-", "_"))) \ - or charset - - def _codec(self, charset): - if not charset: return charset - codec = None - try: - codecs.lookup(charset) - codec = charset - except (LookupError, ValueError): - pass - return codec - - EBCDIC_TO_ASCII_MAP = None - def _ebcdic_to_ascii(self, s): - c = self.__class__ - if not c.EBCDIC_TO_ASCII_MAP: - emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15, - 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31, - 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7, - 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26, - 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33, - 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94, - 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63, - 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34, - 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200, - 201,202,106,107,108,109,110,111,112,113,114,203,204,205, - 206,207,208,209,126,115,116,117,118,119,120,121,122,210, - 211,212,213,214,215,216,217,218,219,220,221,222,223,224, - 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72, - 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81, - 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89, - 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57, - 250,251,252,253,254,255) - import string - c.EBCDIC_TO_ASCII_MAP = string.maketrans( \ - ''.join(map(chr, range(256))), ''.join(map(chr, emap))) - return s.translate(c.EBCDIC_TO_ASCII_MAP) - - MS_CHARS = { '\x80' : ('euro', '20AC'), - '\x81' : ' ', - '\x82' : ('sbquo', '201A'), - '\x83' : ('fnof', '192'), - '\x84' : ('bdquo', '201E'), - '\x85' : ('hellip', '2026'), - '\x86' : ('dagger', '2020'), - '\x87' : ('Dagger', '2021'), - '\x88' : ('circ', '2C6'), - '\x89' : ('permil', '2030'), - '\x8A' : ('Scaron', '160'), - '\x8B' : ('lsaquo', '2039'), - '\x8C' : ('OElig', '152'), - '\x8D' : '?', - '\x8E' : ('#x17D', '17D'), - '\x8F' : '?', - '\x90' : '?', - '\x91' : ('lsquo', '2018'), - '\x92' : ('rsquo', '2019'), - '\x93' : ('ldquo', '201C'), - '\x94' : ('rdquo', '201D'), - '\x95' : ('bull', '2022'), - '\x96' : ('ndash', '2013'), - '\x97' : ('mdash', '2014'), - '\x98' : ('tilde', '2DC'), - '\x99' : ('trade', '2122'), - '\x9a' : ('scaron', '161'), - '\x9b' : ('rsaquo', '203A'), - '\x9c' : ('oelig', '153'), - '\x9d' : '?', - '\x9e' : ('#x17E', '17E'), - '\x9f' : ('Yuml', ''),} - -####################################################################### - - -#By default, act as an HTML pretty-printer. -if __name__ == '__main__': - import sys - soup = BeautifulSoup(sys.stdin) - print soup.prettify()
--- a/README Fri Jul 26 12:51:13 2013 -0400 +++ b/README Fri Sep 20 13:25:27 2013 -0400 @@ -1,18 +1,3 @@ -Source code for the executables needed by these tools can be found in -the genome_diversity directory. - -Additionally, you'll need the following python modules: - matplotlib (we used version 1.1.0) http://pypi.python.org/packages/source/m/matplotlib/ - mechanize (we used version 0.2.5) http://pypi.python.org/packages/source/m/mechanize/ - networkx (we used version 1.6) http://pypi.python.org/packages/source/n/networkx/ - fisher (we used version 0.1.4) http://pypi.python.org/packages/source/f/fisher/ - -And the following software: +The Genome Diversity tools require the following software: ADMIXTURE (we used version 1.22) http://www.genetics.ucla.edu/software/admixture/ - EIGENSOFT (we used version 3.0) http://genepath.med.harvard.edu/~reich/Software.htm - PHAST (we used version 1.2.1) http://compgen.bscb.cornell.edu/phast/ - QuickTree (we used version 1.1) http://www.sanger.ac.uk/resources/software/quicktree/ - -Images used in the tools' documentation are located in the static/images -directory. Please copy these to the static/images directory in your -Galaxy installation. + KING (we used version 1.5) http://people.virginia.edu/~wc9c/KING/
--- a/add_fst_column.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/add_fst_column.xml Fri Sep 20 13:25:27 2013 -0400 @@ -77,6 +77,10 @@ <data name="output" format="input" format_source="input" metadata_source="input" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" />
--- a/aggregate_gd_indivs.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/aggregate_gd_indivs.xml Fri Sep 20 13:25:27 2013 -0400 @@ -43,6 +43,10 @@ <data name="output" format="input" format_source="input" metadata_source="input" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" />
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/assignment_of_optimal_breeding_pairs.py Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,185 @@ +#!/usr/bin/env python2.6 + +import sys +import munkres +import random + +class Vertex(object): + def __init__(self, name): + self.name = name + self.neighbors = {} + self.color = 0 + self.explored = False + + def add_neighbor(self, neighbor, weight=0.0): + if neighbor in self.neighbors: + if self.neighbors[neighbor] != weight: + die('multiple edges not supported') + else: + self.neighbors[neighbor] = weight + +class Graph(object): + def __init__(self): + self.vertex_list = {} + self.vertices = 0 + self.max_weight = 0.0 + + def add_vertex(self, name): + if name not in self.vertex_list: + self.vertex_list[name] = Vertex(name) + self.vertices += 1 + return self.vertex_list[name] + + def add_edge(self, name1, name2, weight): + vertex1 = self.add_vertex(name1) + vertex2 = self.add_vertex(name2) + vertex1.add_neighbor(vertex2, weight) + vertex2.add_neighbor(vertex1, weight) + self.max_weight = max(self.max_weight, weight) + + def from_edge_file(self, filename): + fh = try_open(filename) + line_number = 0 + for line in fh: + line_number += 1 + line = line.rstrip('\r\n') + elems = line.split() + if len(elems) < 3: + die('too few columns on line {0} of {1}:\n{2}'.format(line_number, filename, line)) + name1 = elems[0] + name2 = elems[1] + weight = float_value(elems[2]) + if weight is None: + die('invalid weight on line {0} of {1}:\n{2}'.format(line_number, filename, line)) + self.add_edge(name1, name2, weight) + fh.close() + + def bipartite_partition(self): + vertices_left = self.vertex_list.values() + + while vertices_left: + fifo = [vertices_left[0]] + while fifo: + vertex = fifo.pop(0) + if not vertex.explored: + vertex.explored = True + vertices_left.remove(vertex) + + if vertex.color == 0: + vertex.color = 1 + neighbor_color = 2 + elif vertex.color == 1: + neighbor_color = 2 + elif vertex.color == 2: + neighbor_color = 1 + + for neighbor in vertex.neighbors: + if neighbor.color == 0: + neighbor.color = neighbor_color + elif neighbor.color != neighbor_color: + return None, None + fifo.append(neighbor) + + c1 = [] + c2 = [] + + for vertex in self.vertex_list.values(): + if vertex.color == 1: + c1.append(vertex) + elif vertex.color == 2: + c2.append(vertex) + + return c1, c2 + +def try_open(*args): + try: + return open(*args) + except IOError: + die('Failed opening file: {0}'.format(args[0])) + +def float_value(token): + try: + return float(token) + except ValueError: + return None + +def die(message): + print >> sys.stderr, message + sys.exit(1) + +def main(input, randomizations, output): + graph = Graph() + graph.from_edge_file(input) + c1, c2 = graph.bipartite_partition() + + if c1 is None: + die('Graph is not bipartite') + + if len(c1) + len(c2) != graph.vertices: + die('Bipartite partition failed: {0} + {1} != {2}'.format(len(c1), len(c2), graph.vertices)) + + with open(output, 'w') as ofh: + a1 = optimal_assignment(c1, c2, graph.max_weight) + optimal_total_weight = 0.0 + for a in a1: + optimal_total_weight += a[0].neighbors[a[1]] + + print >> ofh, 'optimal average {0:.3f}'.format(optimal_total_weight / len(a1)) + + if randomizations > 0: + random_total_count = 0 + random_total_weight = 0.0 + for i in range(randomizations): + a2 = random_assignment(c1, c2) + random_total_count += len(a2) + for a in a2: + random_total_weight += a[0].neighbors[a[1]] + print >> ofh, 'random average {0:.3f}'.format(random_total_weight / random_total_count) + + + for a in a1: + print >> ofh, '\t'.join([a[0].name, a[1].name]) + +def optimal_assignment(c1, c2, max_weight): + matrix = [] + assignment = [] + + for v1 in c1: + row = [] + for v2 in c2: + row.append(max_weight + 1.0 - v1.neighbors[v2]) + matrix.append(row) + + m = munkres.Munkres() + indexes = m.compute(matrix) + for row, column in indexes: + assignment.append([c1[row], c2[column]]) + + return assignment + +def random_assignment(c1, c2): + assignment = [] + + ## note, this assumes that graph is complete bipartite + ## this needs to be fixed + c1_len = len(c1) + c2_len = len(c2) + idx_list = list(range(max(c1_len, c2_len))) + random.shuffle(idx_list) + + if c1_len <= c2_len: + for i, v1 in enumerate(c1): + assignment.append([v1, c2[idx_list[i]]]) + else: + for i, v1 in enumerate(c2): + assignment.append([v1, c1[idx_list[i]]]) + + return assignment + +################################################################################ + +if len(sys.argv) != 4: + die('Usage') + +input, randomizations, output = sys.argv[1:] +main(input, int(randomizations), output)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/assignment_of_optimal_breeding_pairs.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,54 @@ +<tool id="gd_assignment_of_optimal_breeding_pairs" name="Matings" version="1.0.0"> + <description>: Assignment of optimal breeding pairs</description> + + <command interpreter="python"> + assignment_of_optimal_breeding_pairs.py '$input' '$randomizations' '$output' + </command> + + <inputs> + <param name="input" type="data" format="txt" label="Pairs dataset" /> + <param name="randomizations" type="integer" min="0" value="0" label="Randomizations" /> + </inputs> + + <outputs> + <data name="output" format="txt" /> + </outputs> + + <requirements> + <requirement type="package" version="1.0.5.4">munkres</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input and output datasets are in text_ format. + +.. _text: ./static/formatHelp.html#text + +The pairs dataset consists of lines of the form:: + + name1 name2 prob + +as generated by either of the "Offspring estimated heterozygosity" tools. + +----- + +**What it does** + +The user supplies the offspring estimated heterozygosity for every +potential breeding pair, i.e., the expected fraction of autosomal SNPs +for which an offspring is heterozygous. The tool assigns breeding +pairs to maximize the average estimated heterozygosity of the offspring. +Optionally, the user can specify a number of random assigned pairings, +for which the program reports the average estimated heterozygosity +of the offspring; this gives a comparison of the optimal and average +heterozygosity resulting from an assignment of breeding pairs. + + </help> +</tool>
--- a/average_fst.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/average_fst.xml Fri Sep 20 13:25:27 2013 -0400 @@ -85,6 +85,10 @@ <data name="output" format="txt" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" />
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cluster_kegg.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,46 @@ +<tool id="gd_cluster_kegg" name="Cluster KEGG" version="1.0.0"> + <description>: Group gene categories connected by shared genes</description> + + <command interpreter="python"> + #set $ensembltcolmn_arg = int(str($ensembltcolmn)) - 1 + cluster_onConnctdComps.py '--input=$input' '--input_columns=${input.dataset.metadata.columns}' '--outfile=$output' '--threshold=$threshold' '--ENSEMBLTcolmn=$ensembltcolmn_arg' '--classClmns=$classclmns' + </command> + + <inputs> + <param name="input" type="data" format="tabular" label="Input dataset" /> + <param name="ensembltcolmn" type="data_column" data_ref="input" numerical="false" label="Column with the ENSEMBL code in the Input dataset" /> + <param name="threshold" type="float" value="90" min="0" max="100" label="Threshold to disconnect the nodes" /> + <param name="classclmns" size="10" type="text" value="c1,c2" label="Gene category columns"/> + + </inputs> + + <outputs> + <data name="output" format="tabular" /> + </outputs> + + <requirements> + <requirement type="package" version="1.8.1">networkx</requirement> + </requirements> + + <help> +**What it does** + +The program builds a network of gene categories connected by shared +genes. The edges of this network are weighted based on the number of +genes that each node shares. The clustering coefficient, c\ :sub:`u`\ , is then calculated for each node using the formula: + +.. image:: $PATH_TO_IMAGES/cluster_kegg_formula.png + +| + +where deg(u) is the degree of u and edge weights, w\ :sub:`uv`\ , +are normalized by the maximum weight in the network. The cluster +coefficients are then filtered by our program based on threshold (that +could be a percentile or a value choose by the user) and all the nodes +with a cluster coefficient lower than this threshold are deleted from +the network. Finally, the program reports each connected component as +a cluster of gene classifications. With our program a lower number of +gene categories is obtained, but the results are easier to interpret as +they exclude genes present in many gene groups. + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cluster_onConnctdComps.py Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,223 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# +# Cluster_GOKEGG.py +# +# Copyright 2013 Oscar Reina <oscar@niska.bx.psu.edu> +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, +# MA 02110-1301, USA. + +import argparse +import os +from networkx import connected_components,Graph,clustering +from numpy import percentile +from decimal import Decimal,getcontext +from itertools import permutations,combinations +import sys + +def rtrnClustrsOnCltrCoff(dNodesWeightMin,threshold,perctile=True): + """ + From a file with three columns: nodeA, nodeB and a score, it returns + the strong and weak connected components produced when the edges + below the percentage threshold (or value) are excluded. + """ + #~ + Gmin = Graph() + for nodeA,nodeB in dNodesWeightMin: + wMin=dNodesWeightMin[nodeA,nodeB] + Gmin.add_edge(nodeA,nodeB,weight=wMin) + #~ + clstrCoffcMin=clustering(Gmin,weight='weight') + #~ + if perctile: + umbralMin=percentile(clstrCoffcMin.values(),threshold) + else: + umbralMin=threshold + #~ + GminNdsRmv=[x for x in clstrCoffcMin if clstrCoffcMin[x]<umbralMin] + #~ + Gmin.remove_nodes_from(GminNdsRmv) + #~ + dTermCmptNumbWkMin=rtrndata(Gmin) + #~ + salelClustr=[] + srtdterms=sorted(dTermCmptNumbWkMin.keys()) + for echTerm in srtdterms: + try: + MinT=dTermCmptNumbWkMin[echTerm] + except: + MinT='-' + salelClustr.append('\t'.join([echTerm,MinT])) + #~ + return salelClustr + +def rtrndata(G): + """ + returna list of terms and its clustering, as well as clusters from + a networkx formatted file. + """ + #~ + cntCompnts=0 + dTermCmptNumbWk={} + for echCompnt in connected_components(G): + cntCompnts+=1 + #print '.'.join(echCompnt) + for echTerm in echCompnt: + dTermCmptNumbWk[echTerm]=str(cntCompnts) + #~ + return dTermCmptNumbWk + +def rtrnCATcENSEMBLc(inCATfile,classClmns,ENSEMBLTcolmn,nonHdr=True): + """ + return a dictionary of all the categories in an input file with + a set of genes. Takes as input a file with categories an genes. + """ + dCAT={} + dENSEMBLTCAT={} + for eachl in open(inCATfile,'r'): + if nonHdr and eachl.strip(): + ENSEMBLT=eachl.splitlines()[0].split('\t')[ENSEMBLTcolmn] + sCAT=set() + for CATcolmn in classClmns: + sCAT.update(set([x for x in eachl.splitlines()[0].split('\t')[CATcolmn].split('.')])) + sCAT=sCAT.difference(set(['','U','N'])) + if len(sCAT)>0: + dENSEMBLTCAT[ENSEMBLT]=sCAT + for CAT in sCAT: + try: + dCAT[CAT].add(ENSEMBLT) + except: + dCAT[CAT]=set([ENSEMBLT]) + nonHdr=True + #~ + dCAT=dict([(x,len(dCAT[x])) for x in dCAT.keys()]) + #~ + return dCAT,dENSEMBLTCAT + + +def calcDistance(sCAT1,sCAT2): + """ + takes as input two set of genesin different categories and returns + a value 1-percentage of gene shared cat1->cat2, and cat2->cat1. + """ + getcontext().prec=5 + lgensS1=Decimal(len(sCAT1)) + lgensS2=Decimal(len(sCAT2)) + shrdGns=sCAT1.intersection(sCAT2) + lenshrdGns=len(shrdGns) + #~ + dC1C2=1-(lenshrdGns/lgensS1) + dC2C1=1-(lenshrdGns/lgensS2) + #~ + return dC1C2,dC2C1 + +def rtnPrwsdtncs(dCAT,dENSEMBLTCAT): + """ + return a mcl formated pairwise distances from a list of categories + """ + #~ + getcontext().prec=5 + dCATdst={} + lENSEMBL=dENSEMBLTCAT.keys() + l=len(lENSEMBL) + c=0 + for ENSEMBL in lENSEMBL: + c+=1 + lCAT=dENSEMBLTCAT.pop(ENSEMBL) + for CAT1,CAT2 in combinations(lCAT, 2): + try: + dCATdst[CAT1,CAT2]+=1 + except: + dCATdst[CAT1,CAT2]=1 + try: + dCATdst[CAT2,CAT1]+=1 + except: + dCATdst[CAT2,CAT1]=1 + #~ + dNodesWeightMin={} + l=len(dCATdst) + for CAT1,CAT2 in dCATdst.keys(): + shrdGns=dCATdst.pop((CAT1,CAT2)) + dC1C2=float(shrdGns) + nodeA,nodeB=sorted([CAT1,CAT2]) + try: + cscor=dNodesWeightMin[nodeA,nodeB] + if cscor>=dC1C2: + dNodesWeightMin[nodeA,nodeB]=dC1C2 + except: + dNodesWeightMin[nodeA,nodeB]=dC1C2 + # + return dNodesWeightMin + +def parse_class_columns(val, max_col): + int_list = [] + + for elem in [x.strip() for x in val.split(',')]: + if elem[0].lower() != 'c': + print >> sys.stderr, "bad column format:", elem + sys.exit(1) + + int_val = as_int(elem[1:]) + + if int_val is None: + print >> sys.stderr, "bad column format:", elem + sys.exit(1) + elif not 1 <= int_val <= max_col: + print >> sys.stderr, "column out of range:", elem + sys.exit(1) + + int_list.append(int_val - 1) + + return int_list + +def as_int(val): + try: + return int(val) + except ValueError: + return None + else: + raise + +def main(): + """ + """ + #~ bpython cluster_onConnctdComps.py --input=../conctFinal_CEU.tsv --outfile=../borrar.txt --threshold=90 --ENSEMBLTcolmn=1 --classClmns='20 22' + parser = argparse.ArgumentParser(description='Returns the count of genes in ...') + parser.add_argument('--input',metavar='input TXT file',type=str,help='the input file with the table in txt format.',required=True) + parser.add_argument('--input_columns',metavar='input INT value',type=int,help='the number of columns in the input file.',required=True) + parser.add_argument('--outfile',metavar='input TXT file',type=str,help='the output file with the connected components.',required=True) + parser.add_argument('--threshold',metavar='input FLOAT value',type=float,help='the threshold to disconnect the nodes.',required=True) + parser.add_argument('--ENSEMBLTcolmn',metavar='input INT file',type=int,help='the column with the ENSEMBLE code in the input.',required=True) + parser.add_argument('--classClmns',metavar='input STR value',type=str,help='the list of columns with the gene categories separated by space.',required=True) + args = parser.parse_args() + infile = args.input + threshold = args.threshold + outfile = args.outfile + ENSEMBLTcolmn = args.ENSEMBLTcolmn + classClmns = parse_class_columns(args.classClmns, args.input_columns) + #~ + dCAT,dENSEMBLTCAT=rtrnCATcENSEMBLc(infile,classClmns,ENSEMBLTcolmn) + dNodesWeightMin=rtnPrwsdtncs(dCAT,dENSEMBLTCAT) + salelClustr=rtrnClustrsOnCltrCoff(dNodesWeightMin,threshold) + #~ + with open(outfile, 'w') as salef: + print >> salef, '\n'.join(salelClustr) + #~ + #~ + +if __name__ == '__main__': + main() +
--- a/coverage_distributions.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/coverage_distributions.xml Fri Sep 20 13:25:27 2013 -0400 @@ -57,6 +57,10 @@ <data name="output" format="html" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> @@ -121,7 +125,7 @@ graphical output: -.. image:: ${static_path}/images/gd_coverage.png +.. image:: $PATH_TO_IMAGES/gd_coverage.png </help> </tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/discover_familial_relationships.py Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,100 @@ +#!/usr/bin/env python + +import sys +import gd_util + +from Population import Population + +################################################################################ + +if len(sys.argv) != 6: + gd_util.die('Usage') + +input, input_type, ind_arg, pop_input, output = sys.argv[1:] + +p_total = Population() +p_total.from_wrapped_dict(ind_arg) + +p1 = Population() +p1.from_population_file(pop_input) +if not p_total.is_superset(p1): + gd_util.die('There is an individual in the population that is not in the SNP table') + +################################################################################ + +prog = 'kinship_prep' + +args = [ prog ] +args.append(input) # a Galaxy SNP table +args.append(0) # required number of reads for each individual to use a SNP +args.append(0) # required genotype quality for each individual to use a SNP +args.append(0) # minimum spacing between SNPs on the same scaffold + +for tag in p1.tag_list(): + if input_type == 'gd_genotype': + column, name = tag.split(':') + tag = '{0}:{1}'.format(int(column) - 2, name) + args.append(tag) + +gd_util.run_program(prog, args) + +# kinship.map +# kinship.ped +# kinship.dat + +################################################################################ + +prog = 'king' + +args = [ prog ] +args.append('-d') +args.append('kinship.dat') +args.append('-p') +args.append('kinship.ped') +args.append('-m') +args.append('kinship.map') +args.append('--kinship') + +gd_util.run_program(prog, args) + +# king.kin + +################################################################################ + +valid_header = 'FID\tID1\tID2\tN_SNP\tZ0\tPhi\tHetHet\tIBS0\tKinship\tError\n' + +with open('king.kin') as fh: + header = fh.readline() + if header != valid_header: + gd_util.die('crap') + + with open(output, 'w') as ofh: + + for line in fh: + elems = line.split('\t') + if len(elems) != 10: + gd_util.die('crap') + + x = elems[1] + y = elems[2] + z = elems[8] + + f = float(z) + + message = '' + + if f > 0.354: + message = 'duplicate or MZ twin' + elif f >= 0.177: + message = '1st degree relatives' + elif f >= 0.0884: + message = '2nd degree relatives' + elif f >= 0.0442: + message = '3rd degree relatives' + + print >> ofh, '\t'.join([x, y, z, message]) + +################################################################################ + +sys.exit(0) +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/discover_familial_relationships.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,67 @@ +<tool id="gd_discover_familial_relationships" name="Close relatives" version="1.0.0"> + <description>: Discover familial relationships</description> + + <command interpreter="python"> + #import json + #import base64 + #import zlib + #set $ind_names = $input.dataset.metadata.individual_names + #set $ind_colms = $input.dataset.metadata.individual_columns + #set $ind_dict = dict(zip($ind_names, $ind_colms)) + #set $ind_json = json.dumps($ind_dict, separators=(',',':')) + #set $ind_comp = zlib.compress($ind_json, 9) + #set $ind_arg = base64.b64encode($ind_comp) + discover_familial_relationships.py '$input' '$input.ext' '$ind_arg' '$pop_input' '$output' + </command> + + <inputs> + <param name="input" type="data" format="gd_snp,gd_genotype" label="Input dataset" /> + <param name="pop_input" type="data" format="gd_indivs" label="Individuals dataset" /> + </inputs> + + <outputs> + <data name="output" format="tabular" /> + </outputs> + + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input datasets are in gd_snp_, gd_genotype_, and gd_indivs_ formats. +The output dataset is in tabular_ format. + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_genotype: ./static/formatHelp.html#gd_genotype +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _tabular: ./static/formatHelp.html#tab + +----- + +**What it does** + +The user specifies a SNP table (either gd_snp or gd_genotype format) and +a set of individuals. The command runs the KING program (Manichaikul et +al., 2010) to look for pairs of distinct individuals in the specified +set that have a close family relationship. Putatively related pairs +are classified into five categories: + + 1. duplicate or MZ twin + 2. 1st degree relatives -- siblings (other than identical twins) or parent-child + 3. 2nd degree relatives -- e.g., half-siblings, grandparent-grandchild pair, individual-uncle/aunt pair + 4. 3rd degree relatives -- e.g., first cousins + 5. unrelated + +Reference: + +Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26: 2867-2873. + </help> +</tool>
--- a/diversity_pi.py Fri Jul 26 12:51:13 2013 -0400 +++ b/diversity_pi.py Fri Sep 20 13:25:27 2013 -0400 @@ -6,30 +6,55 @@ ################################################################################ -if len(sys.argv) != 7: - gd_util.die('Usage') - -snp_input, coverage_input, indiv_input, min_coverage, output, ind_arg = sys.argv[1:] +def load_pop(file, wrapped_dict): + if file == '/dev/null': + pop = None + else: + pop = Population() + pop.from_wrapped_dict(wrapped_dict) + return pop -p_total = Population() -p_total.from_wrapped_dict(ind_arg) - -p1 = Population() -p1.from_population_file(indiv_input) -if not p_total.is_superset(p1): - gd_util.die('There is an individual in the population individuals that is not in the SNP table') +def append_tags(the_list, p, p_type, val): + if p is None: + return + for tag in p.tag_list(): + column, name = tag.split(':') + if p_type == 'gd_genotype': + column = int(column) - 2 + the_list.append('{0}:{1}:{2}'.format(val, column, name)) ################################################################################ -prog = 'mt_pi' +if len(sys.argv) != 11: + gd_util.die('Usage') + +snp_input, snp_ext, snp_arg, cov_input, cov_ext, cov_arg, indiv_input, min_coverage, req_thresh, output = sys.argv[1:] + +p_snp = load_pop(snp_input, snp_arg) +p_cov = load_pop(cov_input, cov_arg) + +p_ind = Population() +p_ind.from_population_file(indiv_input) + +if not p_snp.is_superset(p_ind): + gd_util.die('There is an individual in the population individuals that is not in the SNP/Genotype table') + +if p_cov is not None and (not p_cov.is_superset(p_ind)): + gd_util.die('There is an individual in the population individuals that is not in the Coverage table') + +################################################################################ + +prog = 'mito_pi' args = [ prog ] args.append(snp_input) -args.append(coverage_input) +args.append(cov_input) args.append(min_coverage) +args.append(req_thresh) -for tag in p1.tag_list(): - args.append(tag) +append_tags(args, p_ind, 'gd_indivs', 0) +append_tags(args, p_snp, snp_ext, 1) +append_tags(args, p_cov, cov_ext, 2) with open(output, 'w') as fh: gd_util.run_program(prog, args, stdout=fh)
--- a/diversity_pi.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/diversity_pi.xml Fri Sep 20 13:25:27 2013 -0400 @@ -1,30 +1,73 @@ -<tool id="gd_diversity_pi" name="Diversity" version="1.0.0"> - <description>&pi;</description> +<tool id="gd_diversity_pi" name="Diversity" version="1.1.0"> + <description>: pi, allowing for unsequenced intervals</description> <command interpreter="python"> #import json #import base64 #import zlib - #set $ind_names = $input.dataset.metadata.individual_names - #set $ind_colms = $input.dataset.metadata.individual_columns - #set $ind_dict = dict(zip($ind_names, $ind_colms)) - #set $ind_json = json.dumps($ind_dict, separators=(',',':')) - #set $ind_comp = zlib.compress($ind_json, 9) - #set $ind_arg = base64.b64encode($ind_comp) - diversity_pi.py '$input' '$coverage_input' '$indiv_input' '$min_coverage' '$output' '$ind_arg' + #set $snp_names = $input.dataset.metadata.individual_names + #set $snp_colms = $input.dataset.metadata.individual_columns + #set $snp_dict = dict(zip($snp_names, $snp_colms)) + #set $snp_json = json.dumps($snp_dict, separators=(',',':')) + #set $snp_comp = zlib.compress($snp_json, 9) + #set $snp_arg = base64.b64encode($snp_comp) + #if $use_cov.choice == '1' + #set $cov_file = $use_cov.cov_input + #set $cov_ext = $use_cov.cov_input.ext + #set $cov_names = $use_cov.cov_input.dataset.metadata.individual_names + #set $cov_colms = $use_cov.cov_input.dataset.metadata.individual_columns + #set $cov_dict = dict(zip($cov_names, $cov_colms)) + #set $cov_json = json.dumps($cov_dict, separators=(',',':')) + #set $cov_comp = zlib.compress($cov_json, 9) + #set $cov_arg = base64.b64encode($cov_comp) + #set $cov_min = $use_cov.min_coverage + #set $cov_req = $use_cov.req_thresh + #else + #set $cov_file = '/dev/null' + #set $cov_ext = '' + #set $cov_arg = '' + #set $cov_min = 0 + #set $cov_req = 0 + #end if + diversity_pi.py '$input' '$input.ext' '$snp_arg' '$cov_file' '$cov_ext' '$cov_arg' '$indiv_input' '$cov_min' '$cov_req' '$output' </command> <inputs> - <param name="input" type="data" format="gd_snp" label="SNP dataset" /> - <param name="coverage_input" type="data" format="interval" label="Coverage dataset" /> + <param name="input" type="data" format="gd_snp,gd_genotype" label="SNP/Genotype dataset" /> + <conditional name="use_cov"> + <param name="choice" type="select" format="integer" label="Include Coverage dataset"> + <option value="1" selected="true">yes</option> + <option value="0">no</option> + </param> + <when value="0" /> + <when value="1"> + <param name="cov_input" type="data" format="gd_snp,gd_genotype" label="Coverage dataset" /> + <param name="min_coverage" type="integer" min="1" value="1" label="Minimum coverage" /> + <param name="req_thresh" type="integer" min="1" value="1" label="Lower bound for shared well-covered bp" /> + </when> + </conditional> <param name="indiv_input" type="data" format="gd_indivs" label="Population Individuals" /> - <param name="min_coverage" type="integer" min="1" value="1" label="Minimum coverage" /> </inputs> <outputs> <data name="output" format="txt" metadata_source="input" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <help> +**What it does** + +The user supplies the following: + + 1. A file in gd_genotype or gd_snp format giving the mitochondrial SNPs. + 2. An optional gd_genotype file gives the sequence coverage for each individual at each mitochondrial position. + 3. A set of individuals specified with the "Specify individuals" tool. + 4. The minimum depth of sequence coverage. Positions where an individual has less coverage are ignored. + 5. The number of adequately covered positions that must be shared by two individuals before their diversity is included in the reported average. + +For each pair of individual (with adequate shared coverage), the program divides the number of nucleotide difference between the individuals in those intervals by the intervals' total length. Those ratios are averaged over the relevant pairs of individuals. </help> </tool>
--- a/dpmix.py Fri Jul 26 12:51:13 2013 -0400 +++ b/dpmix.py Fri Sep 20 13:25:27 2013 -0400 @@ -8,6 +8,20 @@ from dpmix_plot import make_dpmix_plot from LocationFile import LocationFile +def load_and_check_pop(name, file, total_pop): + p = Population(name=name) + p.from_population_file(file) + if not total_pop.is_superset(p): + gd_util.die('There is an individual in {0} that is not in the SNP table'.format(name)) + return p + +def append_pop_tags(the_list, p, input_type, number): + for tag in p.tag_list(): + column, name = tag.split(':') + if input_type == 'gd_genotype': + column = int(column) - 2 + the_list.append('{0}:{1}:{2}'.format(column, number, name)) + ################################################################################ if len(sys.argv) != 22: @@ -16,6 +30,11 @@ input, input_type, data_source, switch_penalty, ap1_input, ap1_name, ap2_input, ap2_name, ap3_input, ap3_name, p_input, output, output2, output2_dir, dbkey, ref_column, galaxy_data_index_dir, heterochromatin_loc_file, ind_arg, het_arg, add_logs = sys.argv[1:] +if ap1_input == '/dev/null': + use_reference = True +else: + use_reference = False + if ap3_input == '/dev/null': populations = 2 else: @@ -39,30 +58,19 @@ p_total = Population() p_total.from_wrapped_dict(ind_arg) -ap1 = Population(name='Ancestral population 1') -ap1.from_population_file(ap1_input) -population_list.append(ap1) -if not p_total.is_superset(ap1): - gd_util.die('There is an individual in ancestral population 1 that is not in the SNP table') +if not use_reference: + ap1 = load_and_check_pop('Ancestral population 1', ap1_input, p_total) + population_list.append(ap1) -ap2 = Population(name='Ancestral population 2') -ap2.from_population_file(ap2_input) +ap2 = load_and_check_pop('Ancestral population 2', ap2_input, p_total) population_list.append(ap2) -if not p_total.is_superset(ap2): - gd_util.die('There is an individual in ancestral population 2 that is not in the SNP table') if populations == 3: - ap3 = Population(name='Ancestral population 3') - ap3.from_population_file(ap3_input) + ap3 = load_and_check_pop('Ancestral population 3', ap3_input, p_total) population_list.append(ap3) - if not p_total.is_superset(ap3): - gd_util.die('There is an individual in ancestral population 3 that is not in the SNP table') -p = Population(name='Potentially admixed') -p.from_population_file(p_input) +p = load_and_check_pop('Potentially admixed', p_input, p_total) population_list.append(p) -if not p_total.is_superset(p): - gd_util.die('There is an individual in the population that is not in the SNP table') gd_util.mkdir_p(output2_dir) @@ -84,42 +92,17 @@ args.append(heterochrom_path) args.append(misc_file) -columns = ap1.column_list() -for column in columns: - col = int(column) - name = ap1.individual_with_column(column).name - first_token = name.split()[0] - if input_type == 'gd_genotype': - col -= 2 - args.append('{0}:1:{1}'.format(col, first_token)) +if use_reference: + args.append('0:1:reference') +else: + append_pop_tags(args, ap1, input_type, 1) -columns = ap2.column_list() -for column in columns: - col = int(column) - name = ap2.individual_with_column(column).name - first_token = name.split()[0] - if input_type == 'gd_genotype': - col -= 2 - args.append('{0}:2:{1}'.format(col, first_token)) +append_pop_tags(args, ap2, input_type, 2) if populations == 3: - columns = ap3.column_list() - for column in columns: - col = int(column) - name = ap3.individual_with_column(column).name - first_token = name.split()[0] - if input_type == 'gd_genotype': - col -= 2 - args.append('{0}:3:{1}'.format(col, first_token)) + append_pop_tags(args, ap3, input_type, 3) -columns = p.column_list() -for column in columns: - col = int(column) - name = p.individual_with_column(column).name - first_token = name.split()[0] - if input_type == 'gd_genotype': - col -= 2 - args.append('{0}:0:{1}'.format(col, first_token)) +append_pop_tags(args, p, input_type, 0) with open(output, 'w') as fh: gd_util.run_program(prog, args, stdout=fh)
--- a/dpmix.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/dpmix.xml Fri Sep 20 13:25:27 2013 -0400 @@ -1,4 +1,4 @@ -<tool id="gd_dpmix" name="Admixture" version="1.1.0"> +<tool id="gd_dpmix" name="Admixture" version="1.2.0"> <description>: Map genomic intervals resembling specified source populations</description> <command interpreter="python"> @@ -31,7 +31,13 @@ #else if $user_het.choice == '2' #set $het_arg = 'use_none' #end if - '$switch_penalty' '$ap1_input' '$ap1_input.name' '$ap2_input' '$ap2_input.name' '$ap3_arg' '$ap3_name_arg' '$p_input' '$output' '$output2' '$output2.files_path' '$input.dataset.metadata.dbkey' '$input.dataset.metadata.ref' '$GALAXY_DATA_INDEX_DIR' 'gd.heterochromatic.loc' '$ind_arg' '$het_arg' '1' + '$switch_penalty' + #if $use_reference.choice == '0' + '$ap1_input' '$ap1_input.name' + #else if $use_reference.choice == '1' + '/dev/null' 'reference' + #end if + '$ap2_input' '$ap2_input.name' '$ap3_arg' '$ap3_name_arg' '$p_input' '$output' '$output2' '$output2.files_path' '$input.dataset.metadata.dbkey' '$input.dataset.metadata.ref' '$GALAXY_DATA_INDEX_DIR' 'gd.heterochromatic.loc' '$ind_arg' '$het_arg' '$add_logs' </command> <inputs> @@ -57,7 +63,17 @@ </when> </conditional> - <param name="ap1_input" type="data" format="gd_indivs" label="Source population 1 individuals" /> + <conditional name="use_reference"> + <param name="choice" type="select" format="integer" label="History item or Reference sequence"> + <option value="0" selected="true">History item</option> + <option value="1">Reference sequence</option> + </param> + <when value="0"> + <param name="ap1_input" type="data" format="gd_indivs" label="Source population 1 individuals" /> + </when> + <when value="1" /> + </conditional> + <param name="ap2_input" type="data" format="gd_indivs" label="Source population 2 individuals" /> <conditional name="third_pop"> @@ -87,12 +103,10 @@ </when> </conditional> - <!-- <param name="add_logs" type="select" format="integer" label="Probabilities"> <option value="1" selected="true">add logs of probabilities</option> <option value="0">add probabilities</option> </param> - --> </inputs> @@ -101,6 +115,11 @@ <data name="output2" format="html" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + <requirement type="package" version="1.2.1">matplotlib</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" />
--- a/dpmix_plot.py Fri Jul 26 12:51:13 2013 -0400 +++ b/dpmix_plot.py Fri Sep 20 13:25:27 2013 -0400 @@ -3,8 +3,10 @@ import os import sys import math + import matplotlib as mpl mpl.use('PDF') +from matplotlib.backends.backend_pdf import PdfPages import matplotlib.pyplot as plt from matplotlib.path import Path import matplotlib.patches as patches @@ -226,18 +228,49 @@ return vals, labels ################################################################################ +################################################################################ +################################################################################ +################################################################################ + +def space_for_legend(plot_params): + space = 0.0 + + legend_states = plot_params['legend_states'] + if legend_states: + ind_space = plot_params['ind_space'] + ind_height = plot_params['ind_height'] + space += len(legend_states) * (ind_space + ind_height) - ind_space + + return space + +################################################################################ + +def space_for_chroms(plot_params, chroms, individuals, data): + space_dict = {} + + chrom_height = plot_params['chrom_height'] + ind_space = plot_params['ind_space'] + ind_height = plot_params['ind_height'] + + for chrom in chroms: + space_dict[chrom] = chrom_height + + individual_count = 0 + for individual in individuals: + if individual in data[chrom]: + individual_count += 1 + + space_dict[chrom] += individual_count * (ind_space + ind_height) + + return space_dict + +################################################################################ def make_dpmix_plot(input_dbkey, input_file, output_file, galaxy_data_index_dir, state2name=None, populations=3): fs_chrom_len = build_chrom_len_dict(input_dbkey, galaxy_data_index_dir) chroms, individuals, data, chrom_len, used_states = parse_input_file(input_file) - if populations == 3: - make_state_rectangle = make_state_rectangle_3pop - elif populations == 2: - make_state_rectangle = make_state_rectangle_2pop - else: - pass - + ## populate chrom_len for chrom in chrom_len.keys(): if chrom in fs_chrom_len: chrom_len[chrom] = fs_chrom_len[chrom] @@ -245,135 +278,177 @@ #check_chroms(chroms, chrom_len, input_dbkey) check_data(data, chrom_len, input_dbkey) - ## units below are inches - top_space = 0.10 - chrom_space = 0.25 - chrom_height = 0.25 - ind_space = 0.10 - ind_height = 0.25 + ## plot parameters + plot_params = { + 'plot_dpi': 300, + 'page_width': 8.50, + 'page_height': 11.00, + 'top_margin': 0.10, + 'bottom_margin': 0.10, + 'chrom_space': 0.25, + 'chrom_height': 0.25, + 'ind_space': 0.10, + 'ind_height': 0.25, + 'legend_space': 0.10 + } - total_height = 0.0 - - ## make a legend - ## only print out states that are + ## in the legend, only print out states that are ## 1) in the data ## - AND - ## 2) in the state2name map - ## here, we only calculate the space needed legend_states = [] if state2name is not None: for state in used_states: if state in state2name: legend_states.append(state) - if legend_states: - total_height += len(legend_states) * (ind_space + ind_height) - total_height += (top_space - ind_space) - at_top = False - else: - at_top = True + plot_params['legend_states'] = legend_states + + ## choose the correct make_state_rectangle method + if populations == 3: + plot_params['rectangle_method'] = make_state_rectangle_3pop + elif populations == 2: + plot_params['rectangle_method'] = make_state_rectangle_2pop + + pdf_pages = PdfPages(output_file) + + ## generate a list of chroms for each page + + needed_for_legend = space_for_legend(plot_params) + needed_for_chroms = space_for_chroms(plot_params, chroms, individuals, data) - for chrom in chroms: - if at_top: - total_height += (top_space + chrom_height) - at_top = False + chrom_space_per_page = plot_params['page_height'] + chrom_space_per_page -= plot_params['top_margin'] + plot_params['bottom_margin'] + chrom_space_per_page -= needed_for_legend + plot_params['legend_space'] + chrom_space_per_page -= plot_params['chrom_space'] + + chroms_left = chroms[:] + pages = [] + + space_left = chrom_space_per_page + chrom_list = [] + + while chroms_left: + chrom = chroms_left.pop(0) + space_needed = needed_for_chroms[chrom] + plot_params['chrom_space'] + if (space_needed > chrom_space_per_page): + print >> sys.stderr, 'Multipage chroms not yet supported' + sys.exit(1) + + ## sometimes 1.9 - 1.9 < 0 (-4.4408920985e-16) + ## so, we make sure it's not more than a millimeter over + if space_left - space_needed > -0.04: + chrom_list.append(chrom) + space_left -= space_needed else: - total_height += (top_space + chrom_space + chrom_height) - - individual_count = 0 - for individual in individuals: - if individual in data[chrom]: - individual_count += 1 - total_height += individual_count * (ind_space + ind_height) + pages.append(chrom_list[:]) + chrom_list = [] + chroms_left.insert(0, chrom) + space_left = chrom_space_per_page - width = 7.5 - height = math.ceil(total_height) - - bottom = 1.0 - - fig = plt.figure(figsize=(width, height)) + ############################################################################ - if legend_states: - at_top = True - for state in sorted(legend_states): - if at_top: - bottom -= (top_space + ind_height)/height - at_top = False - else: - bottom -= (ind_space + ind_height)/height - ## add code here to draw legend - # [left, bottom, width, height] - ax1 = fig.add_axes([0.0, bottom, 0.09, ind_height/height]) - plt.axis('off') - ax1.set_xlim(0, 1) - ax1.set_ylim(0, 1) - for patch in make_state_rectangle(0, 1, state, 'legend', state2name[state]): - ax1.add_patch(patch) + plot_dpi = plot_params['plot_dpi'] + page_width = plot_params['page_width'] + page_height = plot_params['page_height'] + top_margin = plot_params['top_margin'] + ind_space = plot_params['ind_space'] + ind_height = plot_params['ind_height'] + make_state_rectangle = plot_params['rectangle_method'] + legend_space = plot_params['legend_space'] + chrom_space = plot_params['chrom_space'] + chrom_height = plot_params['chrom_height'] + + for page in pages: + fig = plt.figure(figsize=(page_width, page_height), dpi=plot_dpi) + bottom = 1.0 - (top_margin/page_height) + + # print legend + if legend_states: + top = True + for state in sorted(legend_states): + if top: + bottom -= ind_height/page_height + top = False + else: + bottom -= (ind_space + ind_height)/page_height - ax2 = fig.add_axes([0.10, bottom, 0.88, ind_height/height], frame_on=False) - plt.axis('off') - plt.text(0.0, 0.5, state2name[state], fontsize=10, ha='left', va='center') - else: - at_top = True + ax1 = fig.add_axes([0.0, bottom, 0.09, ind_height/page_height]) + plt.axis('off') + ax1.set_xlim(0, 1) + ax1.set_ylim(0, 1) + for patch in make_state_rectangle(0, 1, state, 'legend', state2name[state]): + ax1.add_patch(patch) - for_webb = False + ax2 = fig.add_axes([0.10, bottom, 0.88, ind_height/page_height], frame_on=False) + plt.axis('off') + plt.text(0.0, 0.5, state2name[state], fontsize=10, ha='left', va='center') + + bottom -= legend_space/page_height - for chrom in chroms: - length = chrom_len[chrom] - vals, labels = tick_foo(0, length) + # print chroms + top = True + for chrom in page: + length = chrom_len[chrom] + vals, labels = tick_foo(0, length) - if at_top: - bottom -= (top_space + chrom_height)/height - at_top = False - else: - bottom -= (top_space + chrom_space + chrom_height)/height + if top: + bottom -= chrom_height/page_height + top = False + else: + bottom -= (chrom_space + chrom_height)/page_height - if not for_webb: - ax = fig.add_axes([0.0, bottom, 1.0, chrom_height/height]) + ax = fig.add_axes([0.0, bottom, 1.0, chrom_height/page_height]) plt.axis('off') plt.text(0.5, 0.5, chrom, fontsize=14, ha='center') - individual_count = 0 - for individual in individuals: - if individual in data[chrom]: - individual_count += 1 + individual_count = 0 + for individual in individuals: + if individual in data[chrom]: + individual_count += 1 - i = 0 - for individual in individuals: - if individual in data[chrom]: - i += 1 + i = 0 + for individual in individuals: + if individual in data[chrom]: + i += 1 + bottom -= (ind_space + ind_height)/page_height - bottom -= (ind_space + ind_height)/height - if not for_webb: - # [left, bottom, width, height] - ax1 = fig.add_axes([0.0, bottom, 0.09, ind_height/height]) + ax1 = fig.add_axes([0.0, bottom, 0.09, ind_height/page_height]) plt.axis('off') plt.text(1.0, 0.5, individual, fontsize=10, ha='right', va='center') - # [left, bottom, width, height] - ax2 = fig.add_axes([0.10, bottom, 0.88, ind_height/height], frame_on=False) - ax2.set_xlim(0, length) - ax2.set_ylim(0, 1) - if i != individual_count: - plt.axis('off') - else: - if not for_webb: + + ax2 = fig.add_axes([0.10, bottom, 0.88, ind_height/page_height], frame_on=False) + ax2.set_xlim(0, length) + ax2.set_ylim(0, 1) + + if i != individual_count: + plt.axis('off') + else: ax2.tick_params(top=False, left=False, right=False, labelleft=False) ax2.set_xticks(vals) ax2.set_xticklabels(labels) - else: - plt.axis('off') - for p1, p2, state in sorted(data[chrom][individual]): - #for patch in make_state_rectangle(p1, p2, state, chrom, individual): - for patch in make_state_rectangle(p1, p2, state, chrom, individual): - ax2.add_patch(patch) + + for p1, p2, state in sorted(data[chrom][individual]): + for patch in make_state_rectangle(p1, p2, state, chrom, individual): + ax2.add_patch(patch) - plt.savefig(output_file) + # extend last state to end of chrom + if p2 < length: + for patch in make_state_rectangle(p2, length, state, chrom, individual): + ax2.add_patch(patch) + + + pdf_pages.savefig(fig) + plt.close(fig) + + pdf_pages.close() ################################################################################ if __name__ == '__main__': - input_dbkey, input_file, output_file, galaxy_data_index_dir = sys.argv[1:5] - make_dpmix_plot(input_dbkey, input_file, output_file, galaxy_data_index_dir) + make_dpmix_plot('loxAfr3', 'output.dat', 'output2_files/picture.pdf', '/scratch/galaxy/home/oocyte/galaxy_oocyte/tool-data', state2name={0: 'heterochromatin', 1: 'reference', 2: 'asian'}, populations=2) +# input_dbkey, input_file, output_file, galaxy_data_index_dir = sys.argv[1:5] +# make_dpmix_plot(input_dbkey, input_file, output_file, galaxy_data_index_dir) sys.exit(0) ## notes
--- a/draw_variants.py Fri Jul 26 12:51:13 2013 -0400 +++ b/draw_variants.py Fri Sep 20 13:25:27 2013 -0400 @@ -6,35 +6,81 @@ ################################################################################ -if len(sys.argv) != 10: - gd_util.die('Usage') - -snp_input, indel_input, coverage_input, annotation_input, indiv_input, ref_name, min_coverage, output, ind_arg = sys.argv[1:] +def load_pop(file, wrapped_dict): + if file == '/dev/null': + pop = None + else: + pop = Population() + pop.from_wrapped_dict(wrapped_dict) + return pop -p_total = Population() -p_total.from_wrapped_dict(ind_arg) - -p1 = Population() -p1.from_population_file(indiv_input) -if not p_total.is_superset(p1): - gd_util.die('There is an individual in the population individuals that is not in the SNP table') +def append_tags(the_list, p, p_type, val): + if p is None: + return + for tag in p.tag_list(): + column, name = tag.split(':') + if p_type == 'gd_genotype': + column = int(column) - 2 + the_list.append('{0}:{1}:{2}'.format(val, column, name)) ################################################################################ -prog = 'mk_Ji' +if len(sys.argv) != 11: + gd_util.die('Usage') + + +snp_file, snp_ext, snp_arg, indiv_input, annotation_input, cov_file, cov_ext, cov_arg, min_coverage, output = sys.argv[1:] + +p_snp = load_pop(snp_file, snp_arg) +p_cov = load_pop(cov_file, cov_arg) + +if indiv_input == '/dev/null': + if p_snp is not None: + p_ind = p_snp + elif p_cov is not None: + p_ind = p_cov + else: + p_ind = None + order_p_ind = True +else: + p_ind = Population() + p_ind.from_population_file(indiv_input) + order_p_ind = False + +## p ind must be from either p_snp or p_cov +if p_snp is not None and p_cov is not None: + if not (p_snp.is_superset(p_ind) or p_cov.is_superset(p_ind)): + gd_util.die('There is an individual in the population individuals that is not in the SNP/Genotype or Coverage table') +elif p_snp is not None: + if not p_snp.is_superset(p_ind): + gd_util.die('There is an individual in the population individuals that is not in the SNP/Genotype table') +elif p_cov is not None: + if not p_cov.is_superset(p_ind): + gd_util.die('There is an individual in the population individuals that is not in the Coverage table') + + +################################################################################ + +prog = 'mito_draw' args = [ prog ] -args.append(snp_input) -args.append(indel_input) -args.append(coverage_input) +args.append(snp_file) +args.append(cov_file) args.append(annotation_input) args.append(min_coverage) -args.append(ref_name) -for tag in p1.tag_list(): - args.append(tag) +if order_p_ind: + for column in sorted(p_ind.column_list()): + individual = p_ind.individual_with_column(column) + name = individual.name.split()[0] + args.append('{0}:{1}:{2}'.format(0, column, name)) +else: + append_tags(args, p_ind, 'gd_indivs', 0) -with open('mk_Ji.out', 'w') as fh: +append_tags(args, p_snp, snp_ext, 1) +append_tags(args, p_cov, cov_ext, 2) + +with open('Ji.spec', 'w') as fh: gd_util.run_program(prog, args, stdout=fh) ################################################################################ @@ -48,9 +94,20 @@ args.append(0.3) args.append('-g') args.append(0.2) -args.append('mk_Ji.out') +args.append('Ji.spec') -with open(output, 'w') as fh: +with open('Ji.svg', 'w') as fh: gd_util.run_program(prog, args, stdout=fh) +################################################################################ + +prog = 'convert' + +args = [ prog ] +args.append('-density') +args.append(100) +args.append('Ji.svg') +args.append('tiff:{0}'.format(output)) + +gd_util.run_program(prog, args) sys.exit(0)
--- a/draw_variants.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/draw_variants.xml Fri Sep 20 13:25:27 2013 -0400 @@ -1,41 +1,102 @@ -<tool id="gd_draw_variants" name="Draw" version="1.0.0"> - <description>variants</description> +<tool id="gd_draw_variants" name="Draw variants" version="1.1.0"> + <description>: show positions of SNVs and unsequenced intervals</description> <command interpreter="python"> #import json #import base64 #import zlib - #set $ind_names = $input.dataset.metadata.individual_names - #set $ind_colms = $input.dataset.metadata.individual_columns - #set $ind_dict = dict(zip($ind_names, $ind_colms)) - #set $ind_json = json.dumps($ind_dict, separators=(',',':')) - #set $ind_comp = zlib.compress($ind_json, 9) - #set $ind_arg = base64.b64encode($ind_comp) - draw_variants.py '$input' '$indel_input' '$coverage_input' '$annotation_input' '$indiv_input' '$ref_name' '$min_coverage' '$output' '$ind_arg' + #if $use_snp.choice == '1' + #set $snp_file = $use_snp.snp_input + #set $snp_ext = $use_snp.snp_input.ext + #set $snp_names = $use_snp.snp_input.dataset.metadata.individual_names + #set $snp_colms = $use_snp.snp_input.dataset.metadata.individual_columns + #set $snp_dict = dict(zip($snp_names, $snp_colms)) + #set $snp_json = json.dumps($snp_dict, separators=(',',':')) + #set $snp_comp = zlib.compress($snp_json, 9) + #set $snp_arg = base64.b64encode($snp_comp) + #else + #set $snp_file = '/dev/null' + #set $snp_ext = '' + #set $snp_arg = '' + #end if + #if $use_cov.choice == '1' + #set $cov_file = $use_cov.cov_input + #set $cov_ext = $use_cov.cov_input.ext + #set $cov_names = $use_cov.cov_input.dataset.metadata.individual_names + #set $cov_colms = $use_cov.cov_input.dataset.metadata.individual_columns + #set $cov_dict = dict(zip($cov_names, $cov_colms)) + #set $cov_json = json.dumps($cov_dict, separators=(',',':')) + #set $cov_comp = zlib.compress($cov_json, 9) + #set $cov_arg = base64.b64encode($cov_comp) + #set $cov_min = $use_cov.min_coverage + #else + #set $cov_file = '/dev/null' + #set $cov_ext = '' + #set $cov_arg = '' + #set $cov_min = 0 + #end if + #if $use_indiv.choice == '1' + #set $ind_arg = $use_indiv.indiv_input + #else + #set $ind_arg = '/dev/null' + #end if + draw_variants.py '$snp_file' '$snp_ext' '$snp_arg' '$ind_arg' '$annotation_input' '$cov_file' '$cov_ext' '$cov_arg' '$cov_min' '$output' </command> <inputs> - <param name="input" type="data" format="gd_snp" label="SNP dataset" /> - <param name="indel_input" type="data" format="gd_snp" label="Indel dataset" /> - <param name="coverage_input" type="data" format="interval" label="Coverage dataset" /> + <conditional name="use_snp"> + <param name="choice" type="select" format="integer" label="Include SNP/Genotype dataset"> + <option value="1" selected="true">yes</option> + <option value="0">no</option> + </param> + <when value="0" /> + <when value="1"> + <param name="snp_input" type="data" format="gd_snp,gd_genotype" label="SNP/Genotype dataset" /> + </when> + </conditional> + <conditional name="use_cov"> + <param name="choice" type="select" format="integer" label="Include Coverage dataset"> + <option value="1" selected="true">yes</option> + <option value="0">no</option> + </param> + <when value="0" /> + <when value="1"> + <param name="cov_input" type="data" format="gd_snp,gd_genotype" label="Coverage dataset" /> + <param name="min_coverage" type="integer" min="1" value="1" label="Minimum coverage" /> + </when> + </conditional> + <conditional name="use_indiv"> + <param name="choice" type="select" label="Compute for"> + <option value="0" selected="true">All individuals</option> + <option value="1">Individuals in a population</option> + </param> + <when value="0" /> + <when value="1"> + <param name="indiv_input" type="data" format="gd_indivs" label="Population Individuals" /> + </when> + </conditional> <param name="annotation_input" type="data" format="interval" label="Annotation dataset" /> - <param name="indiv_input" type="data" format="gd_indivs" label="Population Individuals" /> - - <param name="ref_name" type="select" label="Ref name"> - <options from_dataset="indiv_input"> - <column name="name" index="1"/> - <column name="value" index="1"/> - <filter type="add_value" name="default" value="default" index="0" /> - </options> - </param> - - <param name="min_coverage" type="integer" min="1" value="1" label="Minimum coverage" /> </inputs> <outputs> - <data name="output" format="svg" /> + <data name="output" format="tiff" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <help> +**What it does** + +The user supplies the following: + + 1. A optional file in gd_genotype or gd_snp format giving the mitochondrial SNPs. + 2. An optional gd_genotype file gives the sequence coverage for each individual at each mitochondrial position. + 3. The minimum depth of sequence coverage. Positions where an individual has less coverage are ignoried. + 4. A set of individuals specified with the "Specify individuals" tool. + 5. A file of annotation for the reference mitochondrial sequence. + +The program draws a picture indicating the locations of SNPs and the inadequately covered interval. </help> </tool>
--- a/filter_gd_snp.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/filter_gd_snp.xml Fri Sep 20 13:25:27 2013 -0400 @@ -67,6 +67,10 @@ <data name="output" format="input" format_source="input" metadata_source="input" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" />
--- a/find_intervals.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/find_intervals.xml Fri Sep 20 13:25:27 2013 -0400 @@ -69,6 +69,10 @@ </data> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" />
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/gd_snp2vcf.pl Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,222 @@ +#!/usr/bin/perl -w +use strict; + +#convert from gd_snp file to vcf file (with dbSNP fields) + +#gd_snp table format: +#1. chr +#2. position (0 based) +#3. ref allele +#4. second allele +#5. overall quality +#foreach individual (6-9, 10-13, ...) +#a. count of allele in 3 +#b. count of allele in 4 +#c. genotype call (-1, or count of ref allele) +#d. quality of genotype call (quality of non-ref allele from masterVar) + +if (!@ARGV) { + print "usage: gd_snp2vcf.pl file.gd_snp[.gz|.bz2] -geno=8[,12:16,20...] -handle=HANDLE -batch=BATCHNAME -ref=REFERENCEID [-bioproj=XYZ -biosamp=ABC -population=POPID[,POPID2...] -chrCol=9 -posCol=9 ] > snpsForSubmission.vcf\n"; + exit; +} + +my $in = shift @ARGV; +my $genoCols = ''; +my $handle; +my $batch; +my $bioproj; +my $biosamp; +my $ref; +my $pop; +my $cr = 0; #allow to use alternate reference? +my $cp = 1; +my $meta; +my $offset = 0; #offset for genotype column, gd_snp vs gd_genotype indivs file +foreach (@ARGV) { + if (/-geno=([0-9,]+)/) { $genoCols .= "$1:"; } + elsif (/-geno=(.*)/) { $genoCols .= readGeno($1); } + elsif (/-off=([0-9])/) { $offset = $1; } + elsif (/-handle=(.*)/) { $handle = $1; } + elsif (/-batch=(.*)/) { $batch = $1; } + elsif (/-bioproj=(.*)/) { $bioproj = $1; } + elsif (/-biosamp=(.*)/) { $biosamp = $1; } + elsif (/-ref=(.*)/) { $ref = $1; } + elsif (/-population=(\S+)/) { $pop = $1; } + elsif (/-chrCol=(\d+)/) { $cr = $1 - 1; } + elsif (/-posCol=(\d+)/) { $cp = $1 - 1; } + elsif (/-metaOut=(.*)/) { $meta = $1; } +} +if ($cr < 0 or $cp < 0) { die "ERROR the column numbers should be 1 based.\n"; } + +#remove trailing delimiters +$genoCols =~ s/,:/:/g; +$genoCols =~ s/[,:]$//; + +my @gnc = split(/,|:/, $genoCols); + +if ($in =~ /.gz$/) { + open(FH, "zcat $in |") or die "Couldn't open $in, $!\n"; +}elsif ($in =~ /.bz2$/) { + open(FH, "bzcat $in |") or die "Couldn't open $in, $!\n"; +}else { + open(FH, $in) or die "Couldn't open $in, $!\n"; +} +my @head = prepHeader(); +if (@head) { + print join("\n", @head), "\n"; + #now column headers + print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"; + if (defined $pop) { + $pop =~ s/,$//; + my $t = $pop; + $t =~ s/,/\t/g; + print "\tFORMAT\t$t"; + } + print "\n"; +} +while (<FH>) { + chomp; + if (/^#/) { next; } + if (/^\s*$/) { next; } + my @f = split(/\t/); + #vcf columns: chrom pos id ref alt qual filter info + # info must have VRT=[0-9] 1==SNV 2=indel 6=NoVariation 8=MNV ... + my $vrt = 1; + if ($f[2] !~ /^[ACTG]$/ or $f[3] !~ /^[ACTG]$/) { + die "Sorry this can only do SNV's at this time\n"; + } + if (scalar @gnc == 1 && !defined $pop) { #single genotype column + if (!defined $f[4] or $f[4] == -1) { $f[4] = '.'; } + if ($f[$gnc[0]-1] == 2) { $vrt = 6; } #reference match + if ($f[$gnc[0]-1] == -1) { next; } #no data, don't use + print "$f[$cr]\t$f[$cp]\t$f[$cr];$f[$cp]\t$f[2]\t$f[3]\t$f[4]\t.\tVRT=$vrt\n"; + #TODO? put read counts in comment? + }elsif ($pop) { #do as population + my @cols; + foreach my $gp (split(/:/,$genoCols)) { #foreach population + my @g = split(/,/, $gp); + my $totChrom = 2*(scalar @g); + my $totRef = 0; + foreach my $i (@g) { if (!defined $f[$i-1] or $f[$i-1] == -1) { $totChrom -= 2; next; } $totRef += $f[$i-1]; } + if ($totChrom == $totRef) { $vrt = 6; } + if ($totRef > $totChrom) { die "ERROR likely the wrong column was chosen for genotype\n"; } + my $altCnt = $totChrom - $totRef; + push(@cols, "$totChrom:$altCnt"); + } + print "$f[$cr]\t$f[$cp]\t$f[$cr];$f[$cp]\t$f[2]\t$f[3]\t$f[4]\t.\tVRT=$vrt\tNA:AC\t", join("\t", @cols), "\n"; + }else { #leave allele counts off + my $totChrom = 2*(scalar @gnc); + my $totRef = 0; + foreach my $i (@gnc) { if ($f[$i-1] == -1) { $totChrom -= 2; next; } $totRef += $f[$i-1]; } + if ($totChrom == $totRef) { $vrt = 6; } + print "$f[$cr]\t$f[$cp]\t$f[$cr];$f[$cp]\t$f[2]\t$f[3]\t$f[4]\t.\tVRT=$vrt\n"; + } +} +close FH or die "Couldn't close $in, $!\n"; + +if ($meta) { + open(FH, ">", $meta) or die "Couldn't open $meta, $!\n"; + print FH "TYPE: CONT\n", + "HANDLE: $handle\n", + "NAME: \n", + "FAX: \n", + "TEL: \n", + "EMAIL: \n", + "LAB: \n", + "INST: \n", + "ADDR: \n", + "||\n", + "TYPE: METHOD\n", + "HANDLE: $handle\n", + "ID: \n", + "METHOD_CLASS: Sequence\n", + "TEMPLATE_TYPE: \n", + "METHOD:\n", + "||\n"; + if ($pop) { + my @p = split(/,/, $pop); + foreach my $t (@p) { + print FH + "TYPE: POPULATION\n", + "HANDLE: $handle\n", + "ID: $t\n", + "POPULATION: \n", + "||\n"; + } + } + print FH "TYPE: SNPASSAY\n", + "HANDLE: $handle\n", + "BATCH: $batch\n", + "MOLTYPE: \n", + "METHOD: \n", + "ORGANISM: \n", + "||\n", + "TYPE: SNPPOPUSE | SNPINDUSE\n", + "HANDLE: $handle\n", + "BATCH: \n", + "METHOD: \n", + "||\n"; + + close FH or die "Couldn't close $meta, $!\n"; +} + +exit 0; + +#parse old header and add or create new +sub prepHeader { + my @h; + $h[0] = '##fileformat=VCFv4.1'; + my ($day, $mo, $yr) = (localtime)[3,4,5]; + $mo++; + $yr+=1900; + $h[1] = '##fileDate=' . "$yr$mo$day"; + $h[2] = "##handle=$handle"; + $h[3] = "##batch=$batch"; + my $i = 4; + if ($bioproj) { $h[$i] = "##bioproject_id=$bioproj"; $i++; } + if ($biosamp) { $h[$i] = "##biosample_id=$biosamp"; $i++; } + $h[$i] = "##reference=$ref"; ##reference=GCF_999999.99 + #$i++; + #$h[$i] = '##INFO=<ID=LID, Number=1,Type=string, Description="Unique local variation ID or name for display. The LID provided here combined with the handle must be unique for a particular submitter.">' + $i++; + $h[$i] = '##INFO=<ID=VRT,Number=1,Type=Integer,Description="Variation type,1 - SNV: single nucleotide variation,2 - DIV: deletion/insertion variation,3 - HETEROZYGOUS: variable, but undefined at nucleotide level,4 - STR: short tandem repeat (microsatellite) variation, 5 - NAMED: insertion/deletion variation of named repetitive element,6 - NO VARIATON: sequence scanned for variation, but none observed,7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used) ,8 - MNV: multiple nucleotide variation with all eles of common length greater than 1,9 - Exception">'; + #sometimes have allele freqs? + if (defined $pop) { + $i++; + $h[$i] = "##FORMAT=<ID=NA,Number=1,Type=Integer,Description=\"Number of alleles for the population.\""; + $i++; + $h[$i] = '##FORMAT=<ID=AC,Number=.,Type=Integer,Description="Allele count for each alternate allele.">'; + my @p = split(/,/, $pop); + foreach my $t (@p) { + $i++; + $h[$i] = "##population_id=$t"; + } + } + #PMID? +##INFO=<ID=PMID,Number=.,Type=Integer,Description="PubMed ID linked to variation if available."> + + return @h; +} +####End + +#read genotype columns from a file +sub readGeno { + my $list = shift @_; + my @files = split(/,/, $list); + my $cols=''; + foreach my $file (@files) { + open(FH, $file) or die "Couldn't read $file, $!\n"; + while (<FH>) { + chomp; + my @f = split(/\s+/); + if ($f[0] =~/\D/) { die "ERROR expect an integer for the column\n"; } + $f[0] += $offset; + $cols .= "$f[0],"; + } + close FH; + $cols .= ":"; + } + $cols =~ s/,:$//; + return $cols; +} +####End
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/gd_snp2vcf.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,155 @@ +<tool id="gd_gd_snp2vcf" name="gd_snp to VCF" version="1.1.0" force_history_refresh="True"> + <description>: Convert from gd_snp or gd_genotype to VCF format, for submission to dbSNP</description> + + <command interpreter="perl"> + gd_snp2vcf.pl "$input" -handle=$hand -batch=$batch -ref=$ref -metaOut=$output2 + #if $individuals.choice == '0': + #set $geno = '' + #for $individual_col in $input.dataset.metadata.individual_columns + ##need to check to number of cols per individual + #if $input.ext == "gd_snp": + #set $t = $individual_col + 2 + #else if $input.ext == "gd_genotype": + #set $t = $individual_col + #else: + #set $t = $individual_col + #end if + #set $geno += "%d," % ($t) + #end for + #if $individuals.pall_id != '': + -population=$individuals.pall_id + #end if + #else if $individuals.choice == '1': + #set $geno = '' + #set $pop = '' + #if $input.ext == "gd_snp": + -off=2 + #else if $input.ext == "gd_genotype": + -off=0 + #else: + -off=2 + #end if + #for $population in $individuals.populations + #set $geno += "%s," % ($population.p1_input) + #set $pop += "%s," % ($population.p1_id) + #end for + -population=$pop + #else if $individuals.choice == '2': + #set $geno = $individuals.geno + #end if + -geno=$geno + #if $bioproj.value != '': + -bioproj=$bioproj + #end if + #if $biosamp.value != '': + -biosamp=$biosamp + #end if + > $output + </command> + + <inputs> + <param name="input" type="data" format="gd_snp,gd_genotype" label="SNP dataset" /> + <conditional name="individuals"> + <param name="choice" type="select" label="Generate dataset for"> + <option value="0" selected="true">All individuals</option> + <option value="1">Individuals in populations</option> + <option value="2">A single individual</option> + </param> + <when value="0"> + <param name="pall_id" type="text" size="20" label="ID for this population" help="Leaving this blank will omit allele counts from the output" /> + </when> + <when value="1"> + <repeat name="populations" title="Population" min="1"> + <param name="p1_input" type="data" format="gd_indivs" label="Population individuals" /> + <param name="p1_id" type="text" size="20" label="ID for this population" help="Leaving this blank will omit allele counts from the output" /> + </repeat> + </when> + <when value="2"> + <param name="geno" type="data_column" data_ref="input" label="Column containing genotype" value="8" /> + </when> + </conditional> + <param name="hand" type="text" size="20" label="dbSNP handle" help="If you do not have a handle, request one at http://www.ncbi.nlm.nih.gov/projects/SNP/handle.html" /> + <param name="batch" type="text" size="20" label="Batch ID" help="ID used to tie dbSNP metadata to the VCF submission" /> + <param name="ref" type="text" size="20" label="Reference sequence ID" help="The RefSeq assembly accession.version on which the SNP positions are based (see http://www.ncbi.nlm.nih.gov/assembly/)" /> + <param name="bioproj" type="text" size="20" label="Optional: Registered BioProject ID" /> + <param name="biosamp" type="text" size="20" label="Optional: Comma-separated list of registered BioSample IDs" /> + </inputs> + + <outputs> + <data name="output" format="vcf" /> + <data name="output2" format="text" /> + </outputs> + + <tests> + <test> + <param name="input" value="sample.gd_snp" ftype="gd_snp" /> + <param name="choice" value="2" /> + <param name="geno" value="11" /> + <param name="hand" value="MyHandle" /> + <param name="batch" value="Test1" /> + <param name="ref" value="pb_000001.1" /> + <output name="output" file="snpsForSubmission.vcf" ftype="vcf" compare="diff" /> + <output name="output2" file="snpsForSubmission.text" ftype="text" compare="diff" /> + </test> + </tests> + + <help> + +**Dataset formats** + +The input dataset is in gd_snp_ or gd_genotype_ format. +The output consists of two datasets needed for submitting SNPs: +a VCF_ file in the specific format required by dbSNP, and a partially +completed text_ file for the associated dbSNP metadata. +(`Dataset missing?`_) + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_genotype: ./static/formatHelp.html#gd_genotype +.. _VCF: ./static/formatHelp.html#vcf +.. _text: ./static/formatHelp.html#text +.. _Dataset missing?: ./static/formatHelp.html + +----- + +**What it does** + +This tool converts a dataset in gd_snp or gd_genotype format to a VCF file formatted +for submission to the dbSNP database at NCBI. It also creates a partially +filled-in template to assist you in preparing the required "metadata" file +describing the SNP submission. + +----- + +**Example** + +- input:: + + #{"column_names":["scaf","pos","A","B","qual","ref","rpos","rnuc","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q","4A","4B","4G","4Q","5A","5B","5G","5Q","6A","6B","6G","6Q","pair","dist", + #"prim","rflp"],"dbkey":"canFam2","individuals":[["PB1",9],["PB2",13],["PB3",17],["PB4",21],["PB6",25],["PB8",29]],"pos":2,"rPos":7,"ref":6,"scaffold":1,"species":"bear"} + Contig161 115 C T 73.5 chr1 4641382 C 6 0 2 45 8 0 2 51 15 0 2 72 5 0 2 42 6 0 2 45 10 0 2 57 Y 54 0.323 0 + Contig48 11 A G 94.3 chr1 10150264 A 1 0 2 30 1 0 2 30 1 0 2 30 3 0 2 36 1 0 2 30 1 0 2 30 Y 22 +99. 0 + Contig20 66 C T 54.0 chr1 21313534 C 4 0 2 39 4 0 2 39 5 0 2 42 4 0 2 39 4 0 2 39 5 0 2 42 N 1 +99. 0 + etc. + +- VCF output (for all individuals, and giving a population ID):: + + #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PB + Contig161 115 Contig161;115 C T 73.5 . VRT=6 NA:AC 8:0 + Contig48 11 Contig48;11 A G 94.3 . VRT=6 NA:AC 8:0 + Contig 66 Contig20;66 C T 54.0 . VRT=6 NA:AC 8:0 + etc. + +Note: This excerpt from the output does not show all of the headers. Also, +if the population ID had not been given, then the last two columns would not +appear in the output. + +----- + +**Reference** + +Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. +dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 +Jan 1;29(1):308-11. + + </help> +</tool>
--- a/genome_diversity/Makefile Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,8 +0,0 @@ -all: - cd src && make - -clean: - cd src && make clean - -install: - cd src && make install
--- a/genome_diversity/bin/gd_ploteig Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,172 +0,0 @@ -#!/usr/bin/env perl - -### ploteig -i eigfile -p pops -c a:b [-t title] [-s stem] [-o outfile] [-x] [-k] [-y] [-z sep] -use Getopt::Std ; -use File::Basename ; -use warnings ; - -## pops : separated -x = make postscript and pdf -z use another separator -## -k keep intermediate files -## NEW if pops is a file names are read one per line - -getopts('i:o:p:c:s:d:z:t:xky',\%opts) ; -$postscmode = $opts{"x"} ; -$oldkeystyle = $opts{"y"} ; -$kflag = $opts{"k"} ; -$keepflag = 1 if ($kflag) ; -$keepflag = 1 unless ($postscmode) ; - -$zsep = ":" ; -if (defined $opts{"z"}) { - $zsep = $opts{"z"} ; - $zsep = "\+" if ($zsep eq "+") ; -} - -$title = "" ; -if (defined $opts{"t"}) { - $title = $opts{"t"} ; -} -if (defined $opts{"i"}) { - $infile = $opts{"i"} ; -} -else { - usage() ; - exit 0 ; -} -open (FF, $infile) || die "can't open $infile\n" ; -@L = (<FF>) ; -chomp @L ; -$nf = 0 ; -foreach $line (@L) { - next if ($line =~ /^\s+#/) ; - @Z = split " ", $line ; - $x = @Z ; - $nf = $x if ($nf < $x) ; -} -printf "## number of fields: %d\n", $nf ; -$popcol = $nf-1 ; - - -if (defined $opts{"p"}) { - $pops = $opts{"p"} ; -} -else { - die "p parameter compulsory\n" ; -} - -$popsname = setpops ($pops) ; -print "$popsname\n" ; - -$c1 = 1; $c2 =2 ; -if (defined $opts{"c"}) { - $cols = $opts{"c"} ; - ($c1, $c2) = split ":", $cols ; - die "bad c param: $cols\n" unless (defined $cols) ; -} - -$stem = "$infile.$c1:$c2" ; -if (defined $opts{"s"}) { - $stem = $opts{"s"} ; -} -$gnfile = "$stem.$popsname.xtxt" ; - -if (defined $opts{"o"}) { - $gnfile = $opts{"o"} ; -} - -@T = () ; ## trash -open (GG, ">$gnfile") || die "can't open $gnfile\n" ; -print GG "## " unless ($postscmode) ; -print GG "set terminal postscript color\n" ; -print GG "set style line 2 lc rgbcolor \"#376600\"\n"; -print GG "set style line 11 lc rgbcolor \"#376600\"\n"; -print GG "set style line 20 lc rgbcolor \"#376600\"\n"; -print GG "set style line 29 lc rgbcolor \"#376600\"\n"; -print GG "set style line 6 lc rgbcolor \"#FFCC00\"\n"; -print GG "set style line 15 lc rgbcolor \"#FFCC00\"\n"; -print GG "set style line 24 lc rgbcolor \"#FFCC00\"\n"; -print GG "set style increment user\n"; -print GG "set title \"$title\" \n" ; -print GG "set key outside\n" unless ($oldkeystyle) ; -print GG "set xlabel \"eigenvector $c1\" \n" ; -print GG "set ylabel \"eigenvector $c2\" \n" ; -print GG "plot " ; -$np = @P ; -$lastpop = $P[$np-1] ; -$d1 = $c1+1 ; -$d2 = $c2+1 ; -foreach $pop (@P) { - $dfile = "$stem:$pop" ; - push @T, $dfile ; - print GG " \"$dfile\" using $d1:$d2 title \"$pop\" " ; - print GG ", \\\n" unless ($pop eq $lastpop) ; - open (YY, ">$dfile") || die "can't open $dfile\n" ; - foreach $line (@L) { - next if ($line =~ /^\s+#/) ; - @Z = split " ", $line ; - next unless (defined $Z[$popcol]) ; - next unless ($Z[$popcol] eq $pop) ; - print YY "$line\n" ; - } - close YY ; -} -print GG "\n" ; -print GG "## " if ($postscmode) ; -print GG "pause 9999\n" ; -close GG ; - -if ($postscmode) { -$psfile = "$stem.ps" ; - - if ($gnfile =~ /xtxt/) { - $psfile = $gnfile ; - $psfile =~ s/xtxt/ps/ ; - } -system "gnuplot < $gnfile > $psfile" ; -#system "fixgreen $psfile" ; -system "ps2pdf $psfile " ; -} -unlink (@T) unless $keepflag ; - -sub usage { - -print "ploteig -i eigfile -p pops -c a:b [-t title] [-s stem] [-o outfile] [-x] [-k]\n" ; -print "-i eigfile input file first col indiv-id last col population\n" ; -print "## as output by smartpca in outputvecs \n" ; -print "-c a:b a, b columns to plot. 1:2 would be common and leading 2 eigenvectors\n" ; -print "-p pops Populations to plot. : delimited. eg -p Bantu:San:French\n" ; -print "## pops can also be a filename. List populations 1 per line\n" ; -print "[-s stem] stem will start various output files\n" ; -print "[-o ofile] ofile will be gnuplot control file. Should have xtxt suffix\n"; -print "[-x] make ps and pdf files\n" ; -print "[-k] keep various intermediate files although -x set\n" ; -print "## necessary if .xtxt file is to be hand edited\n" ; -print "[-y] put key at top right inside box (old mode)\n" ; -print "[-t] title (legend)\n" ; - -print "The xtxt file is a gnuplot file and can be easily hand edited. Intermediate files -needed if you want to make your own plot\n" ; - -} -sub setpops { - my ($pops) = @_ ; - local (@a, $d, $b, $e) ; - - if (-e $pops) { - open (FF1, $pops) || die "can't open $pops\n" ; - @P = () ; - foreach $line (<FF1>) { - ($a) = split " ", $line ; - next unless (defined $a) ; - next if ($a =~ /\#/) ; - push @P, $a ; - } - $out = join ":", @P ; - print "## pops: $out\n" ; - ($b, $d , $e) = fileparse($pops) ; - return $b ; - } - @P = split $zsep, $pops ; - return $pops ; - -}
--- a/genome_diversity/bin/varplot Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,464 +0,0 @@ -#!/usr/bin/env python - -""" -Take a specification file and draw the Miller plot. - -The specification should have a header where each line starts with a "@". "@" -should be followed by a tag and value separated by a "=". Currently the only -defined tag is GL which is the genome length of the genome under consideration. - -The lines after the header should be one for each genome. The first column -should be the name of the individual/genome followed by the space separated -positions which need to be marked. - -An example spec file will be as follows: - -@GN=IR_3 -@GL=14574 -@GA=0:64::tRNA -@GA=64:1035:nad2:gene -@GA=1035:1100::tRNA -@GA=1092:1153::tRNA -@GA=1160:1226::tRNA -@GA=1218:2757:cox1:gene -@GA=2764:3440:cox2:gene -@GA=3440:3509::tRNA -@GA=3508:3574::tRNA -@GA=3574:3730:atp8:gene -@GA=3723:4389:atp6:gene -@GA=4395:5173:cox3:gene -@GA=5173:5236::tRNA -@GA=5236:5572:nad3:gene -@GA=5572:5633::tRNA -@GA=5632:5696::tRNA -@GA=5695:5763::tRNA -@GA=5765:5820::tRNA -@GA=5820:5885::tRNA -@GA=5883:5948::tRNA -@GA=5948:7617:nad5:gene -@GA=7617:7678::tRNA -@GA=7680:8997:nad4:gene -@GA=8990:9266:nad4L:gene -@GA=9268:9330::tRNA -@GA=9330:9395::tRNA -@GA=9397:9826:nad6:gene -@GA=9829:10910:cob:gene -@GA=10910:10976::tRNA -@GA=10993:11912:nad1:gene -@GA=11912:11978::tRNA -@GA=11992:12053::tRNA -@GA=12034:13289:rrnL:gene -@GA=12034:13289:16S:rRNA -@GA=13289:13351::tRNA -@GA=13351:14069:rrnS:gene -@GA=13351:14069:12S:rRNA -@GA=14423:14492::tRNA -@GA=14499:14569::tRNA -@CL=rRNA:#2B83BA -@CL=tRNA:#FFFFBF -@CL=gene:#D7191C -@CL=special:#000000 -@CL=indel:#FDAE61 -@CL=missing:#ABDDA4 -IR_65 2618 3267 3752 7768 8523 special=10177 10848 12790 13157 indel=3500:3560 -missing=4000:6000 -IR_66 2618 3267 3752 7768 8523 special=10177 10848 12790 13157 missing=4000:6000 -IR_63 2618 3267 3752 4883 8523 9798 10848 13157 missing=1:1000 -""" - -from sys import argv, stderr, exit -from getopt import getopt, GetoptError -from commands import getstatusoutput - -__author__ = "Aakrosh Ratan" -__email__ = "ratan@bx.psu.edu" - -# do we want the debug information to be printed? -debug_flag = False - -varstrokewidth = 1 - -# print the header for the image.. Inputs are in cm -def printheader(imagewidth, imageheight): - print "<?xml version=\"1.0\" standalone=\"no\"?>" - print "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"" - print "\"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">" - - print "<svg width=\"%dcm\"" % imagewidth - print "\theight=\"%dcm\"" % imageheight - print "\tviewBox=\"0 0 %d %d\"" % (100*imagewidth, 100*imageheight) - print "\txmlns=\"http://www.w3.org/2000/svg\" version=\"1.1\">" - -# print the footer for the svg image.. Inputs are in cm -def printfooter(): - print "</svg>" - -# print a rectangle -def printrectangle(x, y, w, h, - sw = 2, - so = 1.0, - rx = None, - cp = None, - fl = "none", - fo = 1.0): -# print >> stderr, "Rectangle: %d %d %d %d" % (x, y, w, h) - print "<rect x=\"%d\"" % x - print "\ty=\"%d\"" % y - print "\twidth=\"%d\"" % w - print "\theight=\"%d\"" % h - if rx != None: print "\trx=\"%d\"" % rx - print "\tstroke=\"black\"" - if cp != None: print "\tclip-path=\"url(#g%d)\"" % cp - print "\tstroke-width=\"%d\"" % sw - print "\tstroke-opacity=\"%2.2f\"" % so - print "\tfill-opacity=\"%2.2f\"" % fo - print "\tfill=\"%s\" />" % fl - -def printtext(x, y, text, - fontfamily = "Times", - fontweight = "bold", - fontsize = "0.9cm", - fontvariant = "normal", - fill = "black", - dx = 0): -# print >> stderr, "Text: %d %d %s" % (x, y, text) - print "<text x=\"%d\"" % x - print "\tdx=\"%d\"" % dx - print "\ty=\"%d\"" % y - print "\tfill=\"%s\"" % fill - print "\tfont-family=\"%s\"" % fontfamily - print "\tfont-weight=\"%s\"" % fontweight - print "\tfont-size=\"%s\"" % fontsize - print "\tfont-variant=\"%s\">" % fontvariant - print "\t%s" % text - print "</text>" - -def printline(x1, y1, x2, y2, sw = 1, cp = None, sc = "black"): -# print >> stderr, "Line: %d %d %d %d" % (x1, y1, x2, y2) - print "<line x1=\"%d\"" % x1 - print "\ty1=\"%d\"" % y1 - print "\tx2=\"%d\"" % x2 - print "\ty2=\"%d\"" % y2 - print "\tstroke=\"%s\"" % sc - if cp != None: print "\tclip-path=\"url(#g%d)\"" % cp - print "\tstroke-width=\"%d\"/>" % sw - -def main(filename, cm2bp, eachplotheight, vgap): - file = open(filename, "r") - - # the genome length - genomelength = None - # the name of the genome - genomename = None - # the atrributes for the genome - attributes = [] - # the colors of the various attributes - colors = {} - # how much of the box should be filled - filltypes = {} - - # the variants that should be marked - variants = {} - # the order in which I want to display the names - names = [] - - # lets read the spec file and make sure the format is correct - for line in file: - if line.startswith("\n"): continue - - if line.startswith("#"): continue - - if line.startswith("@"): - tag,value = line.strip().split("=") - if tag[1:] == "GL": - genomelength = int(value) - elif tag[1:] == "GN": - genomename = value - elif tag[1:] == "GA": - tokens = value.split(":") - assert len(tokens) == 4 - attributes.append(tokens) - elif tag[1:] == "CL": - tokens = value.split(":") - filltype = "C" - if len(tokens) == 2: - name,color = tokens - elif len(tokens) == 3: - name,color,filltype = tokens - else: - print >> stderr, "Incorrect specification for colors" - exit(4) - colors[name] = color - assert filltype in ["C","L","U"], "color can include C,L,U attributes for the fillsizes" - filltypes[name] = filltype - else: - print >> stderr, "Undefined tag: %s" % tag - exit(5) - continue - - tokens = line.strip().split() - - name = tokens[0] - - positions = [] - specialpositions = [] - intervals = [] - for token in tokens[1:]: - if token.find("=") == -1: - positions.append(int(token)) - else: - type,position = token.split("=") - if position.find(":") == -1: - specialpositions.append((type, int(position))) - else: - s,e = position.split(":") - intervals.append((type, int(s), int(e))) - - variants[name] = (positions, specialpositions, intervals) - names.append(name) - file.close() - - # if genomename or genomelength is not specified, tell the user that it is - # necessary to do so - if genomename == None or genomelength == None: - print >> stderr, \ - "Please specify a tags for genomename (@GN) and genomelength (@GL)" - exit(6) - - # how much space would I need for the name - namelengths = [len(x) for x in names] - namelengths.append(len(genomename)) - namelength = max(namelengths) - - # gap between the name and the bar plots themselves (in cm) - hgap = 0.3 - - # the padding on the left side (in cm) - lpad = 0.5 - # the padding on the right (in cm) - rpad = 1.0 - # the padding on the top (in cm) - tpad = 0.5 - # the padding on the bottom (in cm) - bpad = 0.5 - - # convert cm into pt - cm2pt = 100 - - # so the width of the image is going to be : - # lpad + namelength + hgap + (genomelength/cm2bp) + rpad - # the image will be 1 cm wide for each cm2bp genome locations - # mf is the mulriplication factor to convert number of alphabets into cms. - mf = 0.20 - imagewidth = lpad + (namelength*mf) + hgap + (genomelength/cm2bp) + rpad - - # the height of the image is going to be - if len(attributes) == 0: - imageheight = tpad + (len(names) * (eachplotheight + vgap)) + bpad - else: - imageheight = tpad + ((len(names)+3) * (eachplotheight + vgap)) + bpad - - # start the image - printheader(imagewidth, imageheight) - - if debug_flag == True: - printrectangle(0, 0, imagewidth * cm2pt, imageheight * cm2pt) - printrectangle(0, 0, lpad * cm2pt, imageheight * cm2pt) - printrectangle(lpad*cm2pt, 0, namelength*mf*cm2pt, imageheight*cm2pt) - printrectangle((lpad+(namelength*mf))*cm2pt, 0, hgap*cm2pt,imageheight*cm2pt) - printrectangle((lpad+(namelength*mf)+hgap)*cm2pt, 0, genomelength*cm2pt/cm2bp, imageheight*cm2pt) - - # set up persistent variables in the documents - docstart = lpad * cm2pt - figstart = (lpad * cm2pt) + (namelength * mf * cm2pt) + (hgap * cm2pt) - figlen = genomelength * cm2pt / cm2bp - - # print the details for all the individuals - y = tpad * cm2pt - for index,name in enumerate(names): - if debug_flag == True: - printrectangle(0, y, imagewidth * cm2pt, eachplotheight * cm2pt) - - printtext(docstart, y + (eachplotheight * 85), name) - printrectangle(figstart, y, figlen, eachplotheight * cm2pt) - - # print vertical lines for the variants - positions = variants[name][0] - for position in positions: - x = figstart + (position * cm2pt / cm2bp) - printline(x, y, x, y + (eachplotheight * 100), sw = varstrokewidth) - - # print colored lines for special variants - specialpositions = variants[name][1] - for type,position in specialpositions: - x = figstart + (position * cm2pt / cm2bp) - h = eachplotheight * 100 - if filltypes[type] == "C": - printline(x, y, x, y + h, sc = colors[type], sw = 4) - elif filltypes[type] == "L": - printline(x, y + h/2, x, y + h, sc = colors[type], sw = 4) - elif filltypes[type] == "U": - printline(x, y, x, y + h/2, sc = colors[type], sw = 4) - - else: - print >> stderr, "Incorrect fillsize type specified" - exit(7) - - # print translucent rectangles for the missing regions and indels - intervals = variants[name][2] - for type,s,e in intervals: - s = int(s) - e = int(e) - x = figstart + (s * cm2pt / cm2bp) - w = (e - s) * cm2pt / cm2bp - h = eachplotheight * 100 - if filltypes[type] == "C": - printrectangle(x, y, w, h, sw=1, so=0.1, fl=colors[type]) - elif filltypes[type] == "L": - printrectangle(x, y + h/2, w, h/2, sw=1, so=0.1,fl=colors[type]) - elif filltypes[type] == "U": - printrectangle(x, y, w, h/2, sw=1, so=0.1, fl=colors[type]) - else: - print >> stderr, "Incorrect fillsize type specified" - exit(8) - - y += ((eachplotheight + vgap) * cm2pt) - - # print the attributes if we have any - if len(attributes) > 0: - if debug_flag == True: - printrectangle(0, y, imagewidth * cm2pt, eachplotheight * cm2pt) - - printtext(docstart, y + (eachplotheight * 85), genomename) - printrectangle(figstart, y, figlen, eachplotheight * cm2pt) - - # lets color the attributes as specified by the user - for index,(start,stop,name,group) in enumerate(attributes): - start = int(start) - stop = int(stop) - - x = figstart + (start * cm2pt / cm2bp) - printrectangle(x, - y, - (stop - start) * cm2pt / cm2bp, - eachplotheight * cm2pt, - fl = colors.get(group, "White")) - - # can I fit the name of the gene/feature in the colored area? - wordlen = len(name) * mf - if (wordlen * cm2pt) < ((stop - start) * cm2pt / cm2bp): - if group in colors: - color = "White" - else: - color = "Black" - - printtext(x, - y + (eachplotheight * 85), - name, - fill = color, - dx = (((stop-start)*cm2pt/cm2bp)-(wordlen*cm2pt))/2) - else: - # will this name fit at all, even at the bottom? Where is the - # next text label that I need to write? - tmpidx = index + 1 - while tmpidx < len(attributes) and \ - len(attributes[tmpidx][2]) == 0: - tmpidx += 1 - - if tmpidx < len(attributes): - nextstart = int(attributes[tmpidx][0]) - if ((wordlen*cm2pt) < ((nextstart-start) * cm2pt/cm2bp)): - printtext(x, - y + (eachplotheight + vgap) * cm2pt, - name, - colors.get(group, "Black")) - - y += ((eachplotheight + vgap) * cm2pt) - - # print the coordinates on a line - y += vgap * cm2pt - if debug_flag == True: - printrectangle(0, y, imagewidth * cm2pt, eachplotheight * cm2pt) - - printline(figstart, y, figstart + figlen, y) - - x = figstart - ticlength = 15 - for i in range(0, genomelength, 2000): - printline(x, y, x, y + ticlength) - printtext(x, y + ticlength + vgap * cm2pt, str(i), fontweight="normal") - x += (2000 * cm2pt / cm2bp) - printline(figstart + figlen, y, figstart + figlen, y + ticlength) - - # print the legend if there were attributes - if len(attributes) > 0: - if debug_flag == True: - printrectangle(0, y, imagewidth * cm2pt, eachplotheight * cm2pt) - - y += ((eachplotheight + 2 * vgap) * cm2pt) - x = figstart - - for name,color in colors.items(): - printtext(x, y, name, fontsize = "0.9cm") - x += ((len(name) + 1) * mf * cm2pt) - printrectangle(x, - y - eachplotheight * cm2pt + 10, - 100, - eachplotheight * cm2pt * 3 / 4, - fl = color) - x += 125 - - # end of the image - printfooter() - -def usage(): - f = stderr - print >> f, "usage:" - print >> f, "varplot [options] specfile" - print >> f, "where the options are:" - print >> f, "-h,--help : print usage and quit" - print >> f, "-d,--debug: print debug information" - print >> f, "-w,--strokewidth: stroke width for normal variants [1]" - print >> f, "-b,--eachplotheight : height of the plot for an individual (in cm) [0.4]" - print >> f, "-g,--eachplotgap : vertical gap between plots of different individuals (in cm) [0.4]" - -if __name__ == "__main__": - try: - opts, args = getopt(argv[1:],"hdw:s:g:",["help","debug","strokewidth=", "eachplotheight=", "eachplotgap="]) - except GetoptError, err: - print str(err) - usage() - exit(2) - - # number of bases to be drawn in 1 cm - cm2bp = 1000 - - # the strokewidth used to show simple SNPs - varstrokewidth = 1 - - # the height of the plot for each individual (in cm) - eachplotheight = 0.4 - - # vertical gap between plots of different individuals (in cm) - vgap = 0.4 - - for o, a in opts: - if o in ("-h", "--help"): - usage() - exit() - elif o in ("-d", "--debug"): - debug_flag = True - elif o in ("-w", "--strokewidth"): - varstrokewidth = int(a) - elif o in ("-s", "--eachplotheight"): - eachplotheight = float(a) - elif o in ("-g", "--eachplotgap"): - vgap = float(a) - else: - assert False, "unhandled option" - - if len(args) != 1: - usage() - exit(3) - - main(args[0], cm2bp, eachplotheight, vgap)
--- a/genome_diversity/src/Fst_ave.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,361 +0,0 @@ -/* Fst_ave -- determine four FST values between two specified populations, -* and optionally between several pairs of random populations -* -* argv{1] = a Galaxy SNP table. For each of several individuals, the table -* has four columns (#A, #B, genotype, quality). -* argv[2] = 1 if FST is estimated from SAMtools genotypes; 0 means use -* read-coverage data. -* argv[3] = lower bound, for individual quality value if argv[2] = 1, -* or for total number of reads per population if argv[2] = 0. -* SNPs not satisfying these lower bounds are ignored. -* argv[4] = 1 to discard SNPs that appear fixed in the two populations -* argv[5] = k says report the maximum and average FST over k randomly -* chosen splits into two populations of two original sizes -* argv[6], argv[7], ..., have the form "13:1", "13:2" or "13:0", meaning -* that the 13th and 14th columns (base 1) give the allele counts -* (and column 15 gives the genotype) for an individual that is in -* population 1, in population 2, or in neither population. - -What it does on Galaxy - -The user specifies a SNP table and two "populations" of individuals, both previously defined using the Galaxy tool to specify individuals from a SNP table. No individual can be in both populations. Other choices are as follows. - -Data source. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. - -After specifying the data source, the user sets lower bounds on amount of data required at a SNP. For estimating the FST using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. SMPs not meeting these lower bounds are ignored. - -The user specifies whether SNPs where both populations appear to be fixed for the same allele should be retained or discarded. - -Finally, the user decides whether to use randomizations. If so, then the user specifies how many randomly generated population pairs (retaining the numbers of individuals of the originals) to generate, as well as the "population" of additional individuals (not in the first two populations) that can be used in the randomization process. - -The program prints the following measures of FST for the two populations. -1. The formulation by Sewall Wright (average over FSTs for all SNPs). -2. The Weir-Cockerham estimator (average over FSTs for all SNPs). -3. The Reich-Patterson estimator (average over FSTs for all SNPs). -4. The population-based Reich-Patterson estimator. - -If randomizations were requested, it prints a summary for each of the four definitions of FST that includes the maximum and average value, and the highest-scoring population pair (if any scored higher than the two user-specified populations). - -References: - -Sewall Wright (1951) The genetical structure of populations. Ann Eugen 15:323-354. - -B. S. Weir and C. Clark Cockerham (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. - -Weir, B.S. 1996. Population substructure. Genetic data analysis II, pp. 161-173. Sinauer Associates, Sunderland, MA. - -David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh (2009) Reconstructing Indian population history. Nature 461:489-494, especially Supplement 2. - -Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the followoing paper. - -Eva-Maria Willing, Christine Dreyer, Cock van Oosterhout (2012) Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One 7:e42649. - -*/ - -#include "lib.h" -#include "Fst_lib.h" - -// maximum length of a line from the table -#define MOST 50000 - -// information about the specified individuals -// x is an array of nI values 0, 1, or 2; -// shuffling x creates random "populations" -int col[MOST], x[MOST]; -int nI, lower_bound, discard, genotypes, nsnp, nfail; -double F_wright, F_weir, F_reich, N_reich, D_reich; - -// each SNP has an array of counts -struct count { - int A, B; -}; - -// linked list summarizes the Galaxy table -struct snp { - struct count *c; - struct snp *next; -} *start, *last; - -/* For each of wright, weir and reich, we observe allele counts A1 and A2 -* for one allele in the two populations, and B1 and B2 for the other allele. -*/ - -// given the two populations specified by x[], compute four corresponding FSTs -void pop_Fst() { - double N, D; - struct snp *s; - int i, A1, B1, A2, B2, too_few; - - - // scan the SNPs - F_wright = F_weir = F_reich = N_reich = D_reich = 0.0; - nsnp = nfail = 0; - for (s = start; s != NULL; s = s->next) { - // get counts for the two populations at this SNP - for (A1 = B1 = A2 = B2 = i = 0; i < nI; ++i) { - if (s->c[i].A < 0) // no genotypes - continue; - if (x[i] == 1) { - A1 += s->c[i].A; - B1 += s->c[i].B; - } else if (x[i] == 2) { - A2 += s->c[i].A; - B2 += s->c[i].B; - } - } - if (discard && ((A1 == 0 && A2 == 0) || (B1 == 0 && B2 == 0))) - continue; // fixed in these two populations - too_few = (genotypes ? 1 : lower_bound); - if (A1+B1 >= too_few && A2+B2 >= too_few) { - ++nsnp; - wright(A1, A2, B1, B2, &N, &D); - if (D != 0.0) - F_wright += N/D; - else - ++nfail; - weir(A1, A2, B1, B2, &N, &D); - if (D != 0.0) - F_weir += N/D; - else - ++nfail; - reich(A1, A2, B1, B2, &N, &D); - N_reich += N; - D_reich += D; - if (D != 0.0) - F_reich += N/D; - else - ++nfail; - } - } - F_wright /= nsnp; - F_weir /= nsnp; - N_reich /= nsnp; - D_reich /= nsnp; - F_reich /= nsnp; -} - -/* shuffle the values x[0], x[1], ... , x[nI-1]; -* Uses Algorithm P in page 125 of "The Art of Computer Programming (Vol II) -* Seminumerical Programming", by Donald Knuth, Addison-Wesley, 1971. -*/ -void shuffle() { - int i, j, temp; - - for (i = nI - 1; i > 0; --i) { - // swap what's in location i with location j, where 0 <= j <= i - j = random() % (i+1); - temp = x[i]; - x[i] = x[j]; - x[j] = temp; - } -} - -int main(int argc, char **argv) { - FILE *fp; - char *p, *z = "\t\n", buf[MOST]; - int X[MOST], nshuff, n, i, j, k, saw[3], larger0, larger1, larger2, - larger3, best_x0[MOST], best_x1[MOST], best_x2[MOST], best_x3[MOST]; - struct snp *new; - double F, F0, F1, F2, F3, tot_F0, tot_F1, tot_F2, tot_F3, - largest_F0, largest_F1, largest_F2, largest_F3; - - if (argc < 7) - fatal("args: table data-source lower_bound discard? #shuffles n:1 m:2 ..."); - - // handle command-line arguments - genotypes = atoi(argv[2]); - lower_bound = atoi(argv[3]); - if (!genotypes && lower_bound <= 0) - fatal("minimum coverage should exceed 0"); - discard = atoi(argv[4]); - nshuff = atoi(argv[5]); - saw[0] = saw[1] = saw[2] = 0; - // populations 1 and 2 must be disjoint - for (i = 6; i < argc; ++i) { - if (sscanf(argv[i], "%d:%d", &j, &k) != 2) - fatalf("not like 13:2 : %s", argv[i]); - if (k < 0 || k > 2) - fatalf("not population 0, 1 or 2: %s", argv[i]); - saw[k] = 1; - // seen this individual (i.e., column) before?? - for (n = 0; n < nI && col[n] != j; ++n) - ; - if (n < nI) { // OK if one of the populations is 0 - if (k > 0) { - if (x[n] > 0 && x[n] != k) - fatalf("column %d is in both populations", j); - x[n] = k; - } - } else { - col[nI] = j; - x[nI] = k; - ++nI; - } - } - if (saw[1] == 0) - fatal("population 1 is empty"); - if (saw[2] == 0) - fatal("population 2 is empty"); - - // read the table of SNPs and store the essential allele counts - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - new = ckalloc(sizeof(*new)); - new->next = NULL; - new->c = ckalloc(nI*sizeof(struct count)); - // set X[i] = atoi(i-th word of buf), i is base 1 - for (i = 1, p = strtok(buf, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = 0; i < nI; ++i) { - n = col[i]; - if (genotypes) { - k = X[n+2]; - if (k == -1) - new->c[i].A = new->c[i].B = -1; - else { - new->c[i].A = k; - new->c[i].B = 2 - k; - } - } else { - new->c[i].A = X[n]; - new->c[i].B = X[n+1]; - } - } - if (start == NULL) - start = new; - else - last->next = new; - last = new; - } - fclose(fp); - - pop_Fst(); - printf("Using %d SNPs, we compute:\n", nsnp); - printf("Average Reich-Patterson FST is %5.5f.\n", F2 = F_reich); - printf("The population-based Reich-Patterson Fst is %5.5f.\n", - F3 = N_reich/D_reich); - printf("Average Weir-Cockerham FST is %5.5f.\n", F1 = F_weir); - printf("Average Wright FST is %5.5f.\n", F0 = F_wright); - if (nfail > 0) - printf("WARNING: %d of %d FSTs could not be computed\n", - nfail, 3*nsnp); - if (nshuff == 0) - return 0; - - // do the following only if randomization is requested - for (j = 0; j < nI; ++j) - best_x0[j] = best_x1[j] = best_x2[j] = best_x3[j] = x[j]; - tot_F0 = tot_F1 = tot_F2 = tot_F3 = - largest_F0 = largest_F1 = largest_F2 = largest_F3 = 0.0; - larger0 = larger1 = larger2 = larger3 = 0; - for (i = 0; i < nshuff; ++i) { - shuffle(); - pop_Fst(); - - // Wright - if ((F = F_wright) > F0) - ++larger0; - if (F > largest_F0) { - largest_F0 = F; - for (j = 0; j < nI; ++j) - best_x0[j] = x[j]; - } - tot_F0 += F; -/* - if (all) // make this optional? - printf("%d: %f\n", i+1, F); -*/ - - // Weir - if ((F = F_weir) > F1) - ++larger1; - if (F > largest_F1) { - largest_F1 = F; - for (j = 0; j < nI; ++j) - best_x1[j] = x[j]; - } - tot_F1 += F; - - // Riech average - if ((F = F_reich) > F2) - ++larger2; - if (F > largest_F2) { - largest_F2 = F; - for (j = 0; j < nI; ++j) - best_x2[j] = x[j]; - } - tot_F2 += F; - - // Reich population - if ((F = (N_reich/D_reich)) > F3) - ++larger3; - if (F > largest_F3) { - largest_F3 = F; - for (j = 0; j < nI; ++j) - best_x3[j] = x[j]; - } - tot_F3 += F; - } - printf("\nOf %d random groupings:\n", nshuff); - printf("%d had a larger average Wright FST (max %5.5f, mean %5.5f)\n", - larger0, largest_F0, tot_F0/nshuff); - if (largest_F0 > F0) { - printf("first columns for the best two populations:\n"); - for (i = 0; i < nI; ++i) - if (best_x0[i] == 1) - printf("%d ", col[i]); - printf("and\n"); - for (i = 0; i < nI; ++i) - if (best_x0[i] == 2) - printf("%d ", col[i]); - putchar('\n'); - putchar('\n'); - } - printf("%d had a larger average Weir-Cockerham FST (max %5.5f, mean %5.5f)\n", - larger1, largest_F1, tot_F1/nshuff); - if (largest_F1 > F1) { - printf("first columns for the best two populations:\n"); - for (i = 0; i < nI; ++i) - if (best_x1[i] == 1) - printf("%d ", col[i]); - printf("and\n"); - for (i = 0; i < nI; ++i) - if (best_x1[i] == 2) - printf("%d ", col[i]); - putchar('\n'); - putchar('\n'); - } - printf("%d had a larger average Reich-Patterson FST (max %5.5f, mean %5.5f)\n", - larger2, largest_F2, tot_F2/nshuff); - if (largest_F2 > F2) { - printf("first columns for the best two populations:\n"); - for (i = 0; i < nI; ++i) - if (best_x2[i] == 1) - printf("%d ", col[i]); - printf("and\n"); - for (i = 0; i < nI; ++i) - if (best_x2[i] == 2) - printf("%d ", col[i]); - putchar('\n'); - putchar('\n'); - } - printf("%d had a larger Reich-Patterson population FST (max %5.5f, mean %5.5f)\n", - larger3, largest_F3, tot_F3/nshuff); - if (largest_F3 > F3) { - printf("first columns for the best two populations:\n"); - for (i = 0; i < nI; ++i) - if (best_x3[i] == 1) - printf("%d ", col[i]); - printf("and\n"); - for (i = 0; i < nI; ++i) - if (best_x3[i] == 2) - printf("%d ", col[i]); - putchar('\n'); - putchar('\n'); - } - - return 0; -}
--- a/genome_diversity/src/Fst_column.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,162 +0,0 @@ -/* Fst_column -- add an Fst column to a Galaxy table -* -* argv{1] = a Galaxy SNP table. For each of several individuals, the table -* has four columns (#A, #B, genotype, quality). -* argv[2] = 1 if Fst is estimated from SAMtools genotypes; 0 means use -* read-coverage data. -* argv[3] = lower bound for total number of reads per population -* argv[4] = lower bound for individual quality value -* argv[5] = 1 to retain SNPs that fail to satisfy the lower bound and set -* Fst = -1; delete them if argv[4] = 0. -* argv[6] = 1 to discard SNPs that appear fixed in the two populations -* argv[7] = 0 for the original Wright form, 1 for Weir, 2 for Reich -* argv[8], argv[9], ..., have the form "13:1" or "13:2", meaning that -* the 13th, 14th, and 15th columns (base 1) give the allele counts -* and genotype for an individual that is in population 1 or -* population 2, respectively. - -What It Does on Galaxy - -The user specifies a SNP table and two "populations" of individuals, both previously defined using the Galaxy tool to specify individuals from a SNP table. No individual can be in both populations. Other choices are as follows. - -Data source. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. - -After specifying the data source, the user sets lower bounds on amount of data required at a SNP. For estimating the Fst using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. - -The user specifies whether the SNPs that violate the lower bound should be ignored or the Fst set to -1. - -The user specifies whether SNPs where both populations appear to be fixed for the same allele should be retained or discarded. - -Finally, the user chooses which definition of Fst to use: Wright's original definition, the Weir-Cockerham unbiased estimator, or the Reich-Patterson estimator. - -A column is appended to the SNP table giving the Fst for each retained SNP. - -References: - -Sewall Wright (1951) The genetical structure of populations. Ann Eugen 15:323-354. - -B. S. Weir and C. Clark Cockerham (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. - -Weir, B.S. 1996. Population substructure. Genetic data analysis II, pp. 161-173. Sinauer Associates, Sunderland, MA. - -David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh (2009) Reconstructing Indian population history. Nature 461:489-494, especially Supplement 2. - -Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the followoing paper. - -Eva-Maria Willing, Christine Dreyer, Cock van Oosterhout (2012) Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One 7:e42649. - -*/ - -#include "lib.h" -#include "Fst_lib.h" - -// most characters allowed in a row of the table -#define MOST 50000 - -// column and population for the relevant individuals/groups -int col[MOST], pop[MOST]; -int nI; - -int main(int argc, char **argv) { - FILE *fp; - char *p, *z = "\t\n", buf[MOST], trash[MOST]; - int X[MOST], min_cov, min_qual, retain, discard, unbiased, genotypes, - n, i, g, A1, B1, A2, B2, saw[3], x1, y1, x2, y2; - double F, N, D; - - if (argc < 7) - fatal("args: table data-source lower-bound retain? discard? unbiased? n:1 m:2 ..."); - genotypes = atoi(argv[2]); - min_cov = atoi(argv[3]); - min_qual = atoi(argv[4]); - retain = atoi(argv[5]); - discard = atoi(argv[6]); - unbiased = atoi(argv[7]); - saw[1] = saw[2] = 0; - for (i = 8; i < argc; ++i, ++nI) { - if (sscanf(argv[i], "%d:%d", &(col[nI]), &(pop[nI])) != 2) - fatalf("not like 13:2 : %s", argv[i]); - if (pop[nI] < 1 || pop[nI] > 2) - fatalf("not population 1 or 2: %s", argv[i]); - saw[pop[nI]] = 1; - // seen this individual before? - for (n = 0; n < nI && col[n] != col[nI]; ++n) - ; - if (n < nI) - fatalf("individual at column %d is mentioned twice", - col[n]); - } - if (saw[1] == 0) - fatal("population 1 is empty"); - if (saw[2] == 0) - fatal("population 2 is empty"); - - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - strcpy(trash, buf); - // set X[i] = atoi(i-th word of s), i is base 0 - for (i = 1, p = strtok(trash, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = A1 = B1 = A2 = B2 = x1 = y1 = x2 = y2 = 0; - i < nI; ++i) { - n = col[i]; - g = X[n+2]; // save genotype - - if (genotypes) { - if (g == -1) - continue; - } else if (X[n+3] < min_qual) - continue; - if (pop[i] == 1) { - // column n (base 1) corresponds to entry X[n] - x1 += X[n]; - y1 += X[n+1]; - if (genotypes) { - A1 += g; - B1 += (2 - g); - } else { - A1 += X[n]; - B1 += X[n+1]; - } - } else if (pop[i] == 2) { - x2 += X[n]; - y2 += X[n+1]; - if (genotypes) { - A2 += g; - B2 += (2 - g); - } else { - A2 += X[n]; - B2 += X[n+1]; - } - } - } - if (discard && ((A1 == 0 && A2 == 0) || (B1 == 0 && B2 == 0))) - continue; // not variable in the two populations - if (!genotypes && (x1+y1 < min_cov || x2+y2 < min_cov)) - F = -1.0; - else { - if (unbiased == 0) - wright(A1, A2, B1, B2, &N, &D); - else if (unbiased == 1) - weir(A1, A2, B1, B2, &N, &D); - else if (unbiased == 2) - reich(A1, A2, B1, B2, &N, &D); - else - fatal("impossible value of 'unbiased'"); - if (D == 0.0) - continue; // ignore these SNPs - else - F = N/D; - } - if (F == -1.0 && !retain) - continue; - if ((p = strchr(buf, '\n')) != NULL) - *p = '\0'; - printf("%s\t%5.4f\n", buf, F); - } - - return 0; -}
--- a/genome_diversity/src/Fst_lib.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,73 +0,0 @@ -// This file contains three procedures for computing different variants of Fst. - -#include "Fst_lib.h" - -/* For each of wright, weir and reich, args 1 and 2 are counts for one -* allele in the two populations; args 3 and 4 are counts for the other allele. -* The numerator and denominator of the computed Fst are returned through -* args 5 and 6. -*/ - -void wright(int A1, int A2, int B1, int B2, double *N, double *D) { - double a1 = A1, a2 = A2, b1 = B1, b2 = B2, n1, n2, p1, p2; - - double - p, // frequency in the pooled population - H_ave, // average of HWE heterogosity in the two populations - H_all; // HWE heterozygosity in the pooled popuations - - n1 = a1+b1; - n2 = a2+b2; - if (n1 == 0.0 || n2 == 0.0) { - // let the calling program handle it - *N = *D = 0.0; - return; - } - p1 = a1/n1; - p2 = a2/n2; - H_ave = p1*(1.0 - p1) + p2*(1.0 - p2); - p = (p1 + p2)/2.0; - H_all = 2.0*p*(1.0 - p); - *N = H_all - H_ave; - *D = H_all; -} - -void weir(int A1, int A2, int B1, int B2, double *N, double *D) { - double a1 = A1, a2 = A2, b1 = B1, b2 = B2, n1, n2, p1, p2, - n_tot, p_bar, nc, MSP, MSG; - - n1 = a1+b1; - n2 = a2+b2; - if (n1 == 0.0 || n2 == 0.0) { - // let the calling program handle it - *N = *D = 0.0; - return; - } - n_tot = n1 + n2; - p1 = a1/n1; - p2 = a2/n2; - - MSG = (n1*p1*(1.0-p1) + n2*p2*(1.0-p2))/(n_tot-2.0); - p_bar = (n1*p1 + n2*p2)/n_tot; - MSP = n1*(p1-p_bar)*(p1-p_bar) + n2*(p2-p_bar)*(p2-p_bar); - nc = n_tot - (n1*n1 + n2*n2)/n_tot; - *N = MSP - MSG; - *D = MSP + (nc-1)*MSG; -} - -void reich(int A1, int A2, int B1, int B2, double *N, double *D) { - double a1 = A1, a2 = A2, b1 = B1, b2 = B2, n1, n2, h1, h2, x; - - n1 = a1+b1; - n2 = a2+b2; - if (n1<=1 || n2<=1) { - // let the calling program handle it - *N = *D = 0.0; - return; - } - h1 = (a1*(n1-a1)) / (n1*(n1-1)); - h2 = (a2*(n2-a2)) / (n2*(n2-1)); - x = a1/n1 - a2/n2; - *N = x*x - h1/n1 - h2/n2; - *D = *N + h1 + h2; -}
--- a/genome_diversity/src/Fst_lib.h Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,11 +0,0 @@ -// Fst_lib.h - -/* For each of wright, weir and reich, args 1 and 2 are counts for one -* allele in the two populations; args 3 and 4 are counts for the other allele. -* The numerator and denominator of the computed Fst are returned through -* args 5 and 6. -*/ - -void wright(int, int , int , int , double * , double * ); -void weir(int, int , int , int , double * , double * ); -void reich(int, int , int , int , double * , double * );
--- a/genome_diversity/src/Huang.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,44 +0,0 @@ -// Find highest scoring intervals, as discussed in Huang.h. - -#include "lib.h" -#include "Huang.h" - -void Huang(double x[], int n) { - double Score, oldScore; - int v, L, i; - - top = 0; // don't use location 0, so as to follow Fig. 6 - for (Score = 0.0, v = 0; v < n; ++v) { - oldScore = Score; - Score += x[v]; - if (x[v] < 0) - continue; - if (top > 0 && R[top].Rpos == v-1) { - // add edge to top subpath - R[top].Rpos = v; - R[top].Rscore = Score; - } else { - // create a one-edge subpath - ++top; - if (top >= MAX_R) - fatal("In Haung(), top is too big"); - R[top].Lpos = v-1; - R[top].Lscore = oldScore; - R[top].Rpos = v; - R[top].Rscore = Score; - R[top].Lower = top-1; - while ((L = R[top].Lower) > 0 && - R[L].Lscore > R[top].Lscore) - R[top].Lower = R[L].Lower; - } - // merge subpaths - while (top > 1 && (L = R[top].Lower) > 0 && - R[L].Rscore <= R[top].Rscore) { - R[L].Rpos = R[top].Rpos; - R[L].Rscore = R[top].Rscore; - top = L; - } - } - for (i = 1; i <= top; ++i) - R[i].Score = R[i].Rscore - R[i].Lscore; -}
--- a/genome_diversity/src/Huang.h Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,20 +0,0 @@ -/* Find intervals of highest total score, i.e., such that adding postions to -* either end will decrease the total. We use the method of Fig. 6 of the paper: -* Xiaoqiu Huang, Pavel Pevzner, Webb Miller (1994) Parametric recomputing in -* alignment graphs. Combinatorial Pattern Matching (Springer Lecture Notes in -* Computer Science, 807), 87-101. -* -* The input scores are in x[0], x[1], ..., x[n-1], but the output regions -* are in R[1], R[2], ..., R[top]. R[i].Score is the total score of the i-th -* (in order of position) positive-scoring interval of x, which consists of of -* x[R[i].Lpos + 1] to x[R[i].Rpos]. -*/ -#define MAX_R 5000000 - -struct region { // a consecutive (relative to the reference) run of SNPs - double Lscore, Rscore, Score; - int Lpos, Rpos, Lower; -} R[MAX_R]; -int top; - -void Huang(double *x, int n);
--- a/genome_diversity/src/Makefile Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,61 +0,0 @@ -CC = gcc -COPT = -O2 -CWARN = -W -Wall -CFLAGS = $(COPT) $(CWARN) -INSTALL_DIR = ../bin - -TARGETS = admix_prep aggregate coords2admix coverage dist_mat dpmix \ - eval2pct filter_snps Fst_ave Fst_column get_pi mk_Ji mt_pi sweep - -all: $(TARGETS) - -install: $(TARGETS) - if [ ! -d "$(INSTALL_DIR)" ]; then mkdir -p "$(INSTALL_DIR)"; fi - cp $(TARGETS) $(INSTALL_DIR) - -admix_prep: admix_prep.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -aggregate: aggregate.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -coords2admix: coords2admix.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -coverage: coverage.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -dist_mat: dist_mat.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -dpmix: dpmix.c lib.c - $(CC) $(CFLAGS) $^ -lm -o $@ - -eval2pct: eval2pct.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -filter_snps: filter_snps.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -Fst_ave: Fst_ave.c Fst_lib.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -Fst_column: Fst_column.c Fst_lib.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -get_pi: get_pi.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -mk_Ji: mk_Ji.c lib.c mito_lib.c - $(CC) $(CWARN) $^ -o $@ - -mt_pi: mt_pi.c lib.c mito_lib.c - $(CC) $(CWARN) $^ -o $@ - -sweep: sweep.c lib.c Huang.c - $(CC) $(CFLAGS) $^ -o $@ - -.PHONY: clean - -clean: - rm -f $(TARGETS)
--- a/genome_diversity/src/admix_prep.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,119 +0,0 @@ -/* admix_prep -- prepare the ".ped" and ".map" files (PLINK format) for input to -* the "admixture" program. -* -* argv[1] -- a Galaxy SNP table -* argv[2] -- required number of reads for each individual to use a SNP -* argv[3] -- required genotype quality for each individual to use a SNP -* argv[4] -- minimum spacing between SNPs on the same scaffold -* argv[k] for k > 4 have the form "13:fred", meaning that the 13th and 14th -* columns (base 0) give the allele counts for the individual or group named -* "fred". - -What it does on Galaxy -The tool converts a SNP table into two tables, called "admix.map" and "admix.ped", needed for estimating the population structure. The user can read or download those files, or simply pass this tool's output on to other programs. The user imposes conditions on which SNPs to consider, such as the minimum coverage and/or quality value for every individual, or the distance to the closest SNP in the same contig (as named in the first column of the SNP table). A useful piece of information produced by the tool is the number of SNPs meeting those conditions, which can be found by clicking on the "eye" after the program runs. - -*/ - -#include "lib.h" - -// bounds line length for a line of the Galaxy table -#define MOST 50000 -struct individual { - int column; - char *name; -} I[MOST/8]; // each individual has 4 columns and 4 tab characters -int nI; // number of individuals -int X[MOST]; // integer values in a row of the SNP table - -// bounds the number of SNPs that can be kept -#define MAX_KEEP 10000000 -char *S[MAX_KEEP]; // S[i] is a row of 2*nI alleles -int nK; - -int main(int argc, char **argv) { - FILE *fp, *ped, *map; - char *p, *z = " \t\n", buf[MOST], trash[MOST], name[100], *s, - scaf[100], prev_scaf[100]; - int i, j, m, min_coverage, min_quality, min_space, nsnp, genotype, - pos, prev_pos; - - if (argc < 5) - fatal("args: Galaxy-table min-cov min-qual min-space 13:fred 16:mary ..."); - min_coverage = atoi(argv[2]); - min_quality = atoi(argv[3]); - min_space = atoi(argv[4]); - - for (i = 5; i < argc; ++i, ++nI) { - if (nI >= MOST/8) - fatal("Too many individuals"); - if (sscanf(argv[i], "%d:%s", &(I[nI].column), name) != 2) - fatalf("bad arg: %s", argv[i]); - I[nI].name = copy_string(name); - } - - map = ckopen("admix.map", "w"); - - fp = ckopen(argv[1], "r"); - prev_scaf[0] = '\0'; - prev_pos = 0; - for (nsnp = 0; fgets(buf, MOST, fp); ) { - if (buf[0] == '#') - continue; - ++nsnp; - if (sscanf(buf, "%s %d", scaf, &pos) != 2) - fatalf("choke: %s", buf); - if (same_string(scaf, prev_scaf)) { - if (pos < prev_pos + min_space) - continue; - } else { - strcpy(prev_scaf, scaf); - prev_pos = -min_space; - } - - // X[i] = atoi(i-th word base-1) - strcpy(trash, buf); - for (i = 1, p = strtok(trash, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = 0; i < nI; ++i) { - m = I[i].column; - if (X[m] + X[m+1] < min_coverage || X[m+3] < min_quality) - break; - } - if (i < nI) - continue; - prev_pos = pos; - - if (nK >= MAX_KEEP) - fatal("Too many SNPs"); - fprintf(map, "1 snp%d 0 %d\n", nsnp, nsnp+1); - s = S[nK++] = ckalloc(2*nI*sizeof(char)); - for (i = j = 0; i < nI; ++i, j += 2) { - genotype = X[I[i].column+2]; - if (genotype == 2) - s[j] = s[j+1] = '1'; - else if (genotype == 0) - s[j] = s[j+1] = '2'; - else if (genotype == 1) { - s[j] = '1'; - s[j+1] = '2'; - } else // undefined genotype - s[j] = s[j+1] = '0'; - } - } - - fclose(map); - - ped = ckopen("admix.ped", "w"); - for (i = 0; i < nI; ++i) { - fprintf(ped, "%s 1 0 0 1 1", I[i].name); - for (j = 0; j < nK; ++j) - fprintf(ped, " %c %c", S[j][2*i], S[j][2*i+1]); - putc('\n', ped); - } - - printf("Using %d of %d SNPs\n", nK, nsnp); - fclose(ped); - - return 0; -}
--- a/genome_diversity/src/aggregate.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,69 +0,0 @@ -/* aggregate -- add four columns (allele counts, genotype, maximum quality) for -* a specified population to a Galaxy SNP table -* -* argv[1] = file containing a Galaxy table -* argv[2] = 0 for a gd_genotype file, 1 for a gd_snp file -* argv[3] ... are the starting columns (base-1) for the chosen individuals - -What it does on Galaxy -The user specifies that some of the individuals in a gd_snp or gd_genotype dataset form a "population", by supplying a list that has been previously created using the Specify Individuals tool. The program appends a new "entity" (set of four columns for a gd_snp table, or one column for a gd_genotype table), analogous to the column(s) for an individual but containing summary data for the population as a group. For a gd_snp table, these four columns give the total counts for the two alleles, the "genotype" for the population, and the maximum quality value, taken over all individuals in the population. If all defined genotypes in the population are 2 (agree with the reference), then the population's genotype is 2, and similarly for 0; otherwise the genotype is 1 (unless all individuals have undefined genotype, in which case it is -1). For a gd_genotype file, only the aggregate genotype is appended. -*/ - -#include "lib.h" - -// most characters allowed in a row of the table -#define MOST 50000 - -// column for the relevant individuals/groups -int col[MOST]; -int nI; - -int main(int argc, char **argv) { - FILE *fp; - char *p, *z = "\t\n", buf[MOST], trash[MOST]; - int X[MOST], m, i, A, B, G, Q, g, gd_snp; - - if (argc < 3) - fatalf("args: SNP-table typedef individual1 ..."); - - gd_snp = atoi(argv[2]); - for (i = 3, nI = 0; i < argc; ++i, ++nI) - col[nI] = atoi(argv[i]); - - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - strcpy(trash, buf); - // set X[i] = atoi(i-th word of s), i is base 0 - for (i = 1, p = strtok(trash, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = A = B = Q = 0, G = -1; i < nI; ++i) { - m = col[i]; - if (gd_snp) { - A += X[m]; - B += X[m+1]; - Q = MAX(Q, X[m+3]); - } - g = X[m+2]; - if (g != -1) { - if (G == -1) // first time - G = g; - else if (G != g) - G = 1; - } - } - if (i < nI) // check bounds on the population's individuals - continue; - // add columns - if ((p = strchr(buf, '\n')) != NULL) - *p = '\0'; - if (gd_snp) - printf("%s\t%d\t%d\t%d\t%d\n", buf, A, B, G, Q); - else - printf("%s\t%d\n", buf, G); - } - - return 0; -}
--- a/genome_diversity/src/coords2admix.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,91 +0,0 @@ -// coords2admix -- add projections onto chords to information about -// coordinates in PCA plots - -#include "lib.h" - -#define MAX_POP 1000 -struct pop { - char *name; - float x, y; -} P[MAX_POP]; -int nP; - -int main(int argc, char **argv) { - FILE *fp; - char buf[500], x[100], y[100], z[100], cur_pop[100]; - int ncur, i, j, k; - float eig1, eig2, tot_x = 0.0, tot_y = 0.0, x1, y1, x2, y2, a, b, c, d; - - if (argc == 1) - fp = stdin; - else if (argc == 2) - fp = ckopen(argv[1], "r"); - else - fatal("optional arg: smartpca coordinates"); - - if (!fgets(buf, 500, fp)) - fatal("empty set of coordinates"); - if (sscanf(buf, "%s %s %s", x, y, z) != 3 || - !same_string(x, "#eigvals:")) - fatalf("cannot find eigenvalues: %s", buf); - printf("%s", buf); - eig1 = atof(y); - eig2 = atof(z); - //printf("eig1 = %f, eig2 = %f\n", eig1, eig2); - - strcpy(cur_pop, ""); - ncur = 0; - while (fgets(buf, 500, fp)) { - if (sscanf(buf, "%*s %s %s %s", x, y, z) != 3) - fatalf("gag: %s", buf); - printf("%s", buf); - if (!same_string(cur_pop, z)) { - if (ncur > 0) { - P[nP].name = copy_string(cur_pop); - P[nP].x = tot_x/ncur; - P[nP].y = tot_y/ncur; - ++nP; - } - ncur = 1; - strcpy(cur_pop, z); - tot_x = atof(x); - tot_y = atof(y); - } else { - ++ncur; - tot_x += atof(x); - tot_y += atof(y); - } - } - P[nP].name = copy_string(cur_pop); - P[nP].x = tot_x/ncur; - P[nP].y = tot_y/ncur; - ++nP; - -/* -for (i = 0; i < nP; ++i) -printf("%s %f %f\n", P[i].name, P[i].x, P[i].y); -*/ - - // loop over pairs of populations - for (i = 0; i < nP; ++i) { - x1 = eig1*P[i].x; - y1 = eig2*P[i].y; - for (j = i+1; j < nP; ++j) { - printf("\nprojection along chord %s -> %s\n", - P[i].name, P[j].name); - x2 = eig1*P[j].x; - y2 = eig2*P[j].y; - c = (x1-x2)*(x1-x2) + (y1-y2)*(y1-y2); - for (k = 0; k < nP; ++k) - if (k != i && k != j) { - a = eig1*P[k].x; - b = eig2*P[k].y; - d = (x2-x1)*(a-x1) + (y2-y1)*(b-y1); - printf(" %s: %f\n", P[k].name, d/c); - } - } - } - - return 0; -} -
--- a/genome_diversity/src/coverage.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,155 +0,0 @@ -/* coverage -- report distributions of SNP coverage or quality for individuals, -* or coverage for populations -* -* argv{1] -- a Galaxy SNP table. For each individuals, the table has four -* columns (count of each allele, genotype, quality). -* argv[2] -- 0 = sequence coverage, 1 = genotype quality -* argv[3] -- file name for the text version of output (input for producing -* the graphical summary goes to stdout) -* argv[4], argv[5], ..., have the form "13:fred", meaning that the 13th -* 14th, and 16th columns (base 1) give the two allele counts -* and the quality for "fred", where "fred" can be the name of -* a population with several individuals (all named "fred") -What it does on Galaxy -The tool reports distributions of SNP reliability indicators for individuals or populations. The reliability can be measured by either the sequence coverage or the SAMtools quality value, though the notion of a population-level quality is not supported. Textual and graphical reports are generated, where the text output gives the cumulative distributions. -*/ - -#include "lib.h" - -// maximum length of a line from the table -#define MOST 50000 - -// the largest coverage or quality value being considered -#define MAX_VAL 5000 - -FILE *gp; // for text output - -// a population is the set of all indivuals with the same name -// (perhaps just a single individual) -struct pop { - int cov, n[MAX_VAL+1]; - long long sum, tot; - char *name; -} P[MOST/4]; -int nP; // number of populations - -// maps column to population -struct individual { - int col, pop; -} I[MOST/4]; -int nI; - -/* Report the distribution for each individual. P[i].n[k] is the number of SNPs -* of value (coverage or quality) k in population i, for k < MAX_VAL; -* I[i].n[MAX_VAL] is the number of SNPs of value k >= MAX_VAL. -* We print the percentages, p, of SNPs with value <= k, ending when all -* populations have reached a p >= 98%. -*/ -void print_cov() { - int i, j, k, last_j; - long long sum; - - // find where to stop printing - for (last_j = i = 0; i < nP; ++i) { - for (sum = j = 0; j <= MAX_VAL; ++j) - sum += P[i].n[j]; - P[i].tot = sum; - for (sum = j = 0; j <= MAX_VAL; ++j) { - sum += P[i].n[j]; - if (sum >= 0.98*P[i].tot) - break; - } - last_j = MAX(last_j, j); - } - - - ++last_j; - // print to stdout the output for graphing; not broken into short lines - for (j = 0; j < last_j; ++j) - printf("\t%3d", j); - putchar('\n'); - for (i = 0; i < nP; ++i) { - printf("%s", P[i].name); - for (sum = j = 0; j < last_j; ++j) { - sum += P[i].n[j]; - printf("\t%4.2f", 100.0*(float)sum/(float)P[i].tot); - } - putchar('\n'); - } - - // print a user-friendly version to the named file - // <= 20 numbers per row - for (j = 0; j < last_j; j += 20) { - fprintf(gp, "\n "); - for (k = j; k < MIN(j+20, last_j); ++k) - fprintf(gp, "%3d", k); - for (i = 0; i < nP; ++i) { - fprintf(gp, "\n%10s", P[i].name); - for (k = j; k < MIN(j+20, last_j); ++k) { - P[i].sum += P[i].n[k]; - fprintf(gp, "%3lld", - MIN(99, 100*P[i].sum/P[i].tot)); - } - } - fprintf(gp,"\n\n"); - } -} - -int main(int argc, char **argv) { - FILE *fp; - char buf[MOST], *z = " \t\n", *p; - int X[MOST], i, j, cov, m, quality, is_pop; - - if (argc < 5) - fatal("args: SNP-file quality-value? out-name 13:fred ... "); - quality = atoi(argv[2]); - gp = ckopen(argv[3], "w"); - // record the individuals and populations - for (nI = 0, i = 4; i < argc; ++i, ++nI) { - if (nI >= MOST) - fatal("Too many individuals"); - // allow spaces in names - if ((p = strchr(argv[i], ':')) == NULL) - fatalf("no colon: %s", argv[i]); - I[nI].col = atoi(argv[i]); - for (j = 0; j < nP && !same_string(p+1, P[j].name); ++j) - ; - if (j == nP) { // new population - is_pop = 1; - P[nP++].name = copy_string(p+1); - } - I[nI].pop = j; - } - if (is_pop && quality) - fatal("quality values for a population are not supported."); - - // Record the number of SNPs with coverage 0, 1, ..., MAX_VAL-1, - // or >= MAX_VAL for each individual. - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - // P[i].cov is the total coverage for all individuals in pop i - for (i = 0; i < nP; ++i) - P[i].cov = 0; - // X[i] = atoi(i-th word base-1) - for (i = 1, p = strtok(buf, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = 0; i < nI; ++i) { - m = I[i].col; - if (quality) - cov = X[m+3]; - else - cov = X[m] + X[m+1]; - P[I[i].pop].cov += cov; - } - for (i = 0; i < nP; ++i) - P[i].n[MIN(P[i].cov, MAX_VAL)]++; - } - - // Print the distributions. - print_cov(); - - return 0; -}
--- a/genome_diversity/src/dist_mat.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,196 +0,0 @@ -/* dist_mat -- create a distance matrix in PHYLIP format for pairs of -* specified individuals, including by default the reference sequence -* -* argv[1] -- a Galaxy SNP table -* argv[2] -- min coverage -* argv[3] -- min quality -* argv[4] -- name of reference species (or "none") -* argv[5] -- 0 = distance from coverage; 1 = distance from genotype -* argv[6] -- name of file for the numbers of informative SNPs -* argv[7] -- name of file to write the Mega-format distance matrix -* argv[k] for k > 7 have the form "13:fred", meaning that the 13th and 14th -* columns (base 0) give the allele counts for the individual or group named -* "fred". - -What it does on Galaxy -This tool uses the selected SNP table to determine a "genetic distance" between each pair of selected individuals; the table of pairwise distances can be used by the Neighbor-Joining methods to construct a tree that depicts how the individuals are related. For a given pair of individuals, we find all SNP positions where both individuals have at least a minimum number of sequence "reads"; the individuals' distance at that SNP is defined as the absolute value of difference in the frequency of the first allele (equivalently: the second allele). For instance, if the first individuals has 5 reads of each allele and the second individual has respectivley 3 and 6 reads, then the frequencies are 1/2 and 1/3, giving a distance 1/6 at that SNP (provided that the minimum read total is at most 9). The output includes a report of the numbers of SNPs passing that thresold for each pair of individuals. - -*/ - -#include "lib.h" - -// bounds line length for a line of the Galaxy table - -#define MOST 50000 -#define MIN_SNPS 3 - -struct argument { - int column; - char *name; -} A[MOST]; -int nA; // number of individuals or groups + 1 (for the reference species) - -#define MOST_INDIVIDUALS 1000 -#define SIZ 1+MOST_INDIVIDUALS // includes the reference - -double tot_diff[SIZ][SIZ]; -int ndiff[SIZ][SIZ], X[MOST]; - -int main(int argc, char **argv) { - FILE *fp, *gp, *mega; - char *p, *z = "\t\n", buf[MOST], name[100], B[100], C[100], D[100], - *nucs = "ACGT"; - int i, j, m, n, min_coverage, too_few, ref_allele = -1, has_ref, - min_quality, genotype; - double fi, fj, dist; - - if (argc < 8) - fatal("args: Galaxy-table min-cov min-qual min-snp ref-name genotype dist-out mega-out 13:fred 16:mary ..."); - min_coverage = atoi(argv[2]); - min_quality = atoi(argv[3]); - genotype = atoi(argv[5]); - if (!genotype && min_coverage <= 0 && min_quality <= 0) - fatal("coverage and/or quality of SNPs should be constrained"); - - if (same_string(argv[4], "none")) - has_ref = 0; - else { - has_ref = 1; - A[0].name = copy_string(argv[4]); - } - gp = ckopen(argv[6], "w"); - mega = ckopen(argv[7], "w"); - fprintf(mega, "#mega\n!Title: Galaxy;\n"); - - for (nA = has_ref, i = 8; i < argc; ++i, ++nA) { - if (nA >= SIZ) - fatal("Too many individuals"); - if (sscanf(argv[i], "%d:%s", &(A[nA].column), name) != 2) - fatalf("bad arg: %s", argv[i]); - A[nA].name = copy_string(name); - } - fprintf(mega, - "!Format DataType=Distance DataFormat=LowerLeft NTaxa=%d;\n\n", - nA); - for (i = 0; i < nA; ++i) - fprintf(mega, "[%d] #%s\n", i+1, A[i].name); - fprintf(mega, "\n\n\n["); - for (i = 1; i <= nA; ++i) - fprintf(mega, "%4d", i); - fprintf(mega, " ]\n"); - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - if (has_ref) { - // get the reference allele - if (sscanf(buf, "%*s %*s %s %s %*s %*s %*s %s", B, C, D) - != 3) - fatalf("3 fields: %s", buf); - if (strchr(nucs, B[0]) == NULL || - strchr(nucs, C[0]) == NULL) - fatalf("not nucs : %s %s", B, C); - if (D[0] == B[0]) - ref_allele = 1; - else if (D[0] == C[0]) - ref_allele = 2; - else if (strchr(nucs, D[0]) != NULL) - ref_allele = 3; - else { - if (D[0] != '-' && D[0] != 'N') - fatalf("what is this: %s", D); - ref_allele = -1; - } - } - - // X[i] = atoi(i-th word base-1) - for (i = 1, p = strtok(buf, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = has_ref; i < nA; ++i) { - m = A[i].column; - if (X[m] + X[m+1] < min_coverage || - X[m+3] < min_quality) - continue; - - // frequency of the second allele - if (genotype) { - if (X[m+2] == -1) - continue; // no genotype - fi = (double)X[m+2]; - } else - fi = (double)X[m+1] / (double)(X[m]+X[m+1]); - if (has_ref && ref_allele > 0) { - ndiff[0][i]++; - // reference allele might be different from both - if (ref_allele == 1) - tot_diff[0][i] += fi; - else if (ref_allele == 2) - tot_diff[0][i] += (1.0 - fi); - else - tot_diff[0][i] += 1.0; - } - for (j = i+1; j < nA; ++j) { - n = A[j].column; - if (X[n] + X[n+1] < min_coverage || - X[n+3] < min_quality) - continue; - if (genotype && X[n+2] == -1) - continue; - ndiff[i][j]++; - if (genotype) - fj = (double)X[n+2]; - else - fj = (double)X[n+1] / - (double)(X[n] + X[n+1]); - fj -= fi; - // add abs. value of difference in frequencies - tot_diff[i][j] += (fj >= 0.0 ? fj : -fj); - } - - } - } - for (i = too_few = 0; i < nA; ++i) - for (j = i+1; j < nA; ++j) - if (ndiff[i][j] < MIN_SNPS) { - too_few = 1; - fprintf(stderr, - "%s and %s have only %d informative SNPs\n", - A[i].name, A[j].name, ndiff[i][j]); - } - if (too_few) - fatal("remove individuals or relax constraints"); - - // print distances - printf("%d\n", nA); - for (i = 0; i < nA; ++i) { - printf("%9s", A[i].name); - fprintf(mega, "[%d] ", i+1); - for (j = 0; j < i; ++j) { - dist = tot_diff[j][i]/(double)ndiff[j][i]; - printf(" %6.4f", dist); - fprintf(mega, " %6.4f", dist); - } - fprintf(mega, " \n"); - printf(" 0.0000"); - for (j = i+1; j < nA; ++j) - printf(" %6.4f", - tot_diff[i][j]/(double)ndiff[i][j]); - putchar('\n'); - } - fprintf(mega, "\n\n\n\n\n"); - fclose(mega); - - // print numbers of SNPs - for (i = 0; i < nA; ++i) { - fprintf(gp, "%9s", A[i].name); - for (j = 0; j < i; ++j) - fprintf(gp, " %8d", ndiff[j][i]); - fprintf(gp, " 0"); - for (j = i+1; j < nA; ++j) - fprintf(gp," %8d", ndiff[i][j]); - putc('\n', gp); - } - - return 0; -}
--- a/genome_diversity/src/dpmix.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,641 +0,0 @@ -/* dpmix -- admixture using dynamic programming; 2 or 3 source populations -* -* argv{1] = a Galaxy SNP table. For each individual, the table may have -* four columns (#A, #B, genotype, quality), or only one -* (genotype). SNPs on the same chromosome must appear together, -* and in order of position -* argv[2] = column with the chromosome name (position is the next column) -* argv[3] = "all" or e.g., "chr20" -* argv[4] = 1 if source-pop allele frequencies are estimated from genotypes; -* 0 means use read-coverage data. -* argv[5] = 1 to add logarithms of probabilities; -* 0 to simply add probabilities -* argv[6] = switch penalty (>= 0) -* argv[7] = file giving heterochromatic intervals ('-' = no file is given) -* argv[8] = file name for additional output -* argv[9], argv[10], ... = like "13:1:Peter", "13:2:Paul", "13:3:Sam" -* or "13:0:Mary", meaning that the 13th and 14th columns (base 1) -* give the allele counts for an individual that is in source -* population 1, source population 2, source population 3, -* or is a potentially admixed individual, resp. Even when only the -* genotype is given, it is found in column 15. -* optional last arguments = "debug" - -What it does on Galaxy -The user specifies two or three source populations (i.e., sources for chromosomes) and a set of potentially admixed individuals, and chooses between the sequence coverage or the estimated genotypes to measure the similarity of genomic intervals in admixed individuals to the source chromosomes. The user also specifies a "switch penalty", controlling the strength of evidence needed to switch between source populations as the the program scans along a chromosome. - -Choice of an appropriate switch penalty depends on the number of SNPs and the time since the admixture event(s), and it is not always easy to determine a good setting ahead of time. One approach is to try, say, 10.0, and look at the resulting picture. If the admixture blocks are shorter than anticipated, based on expectations about the number of generations since the admixture event(s) (i.e., the implied admixture events are too ancient), then try a larger value. Conversely, if the blocks are longer than anticipated (i.e., the implied timing of admixture is too recent), then lower the penalty. In any case, it is prudent to base conclusions on results consistently obtained with a range of switch penalities. - -If there are 3 source populatons, then for each potentially admixed individual the program divides each potentially admixed genome into six "genotypes": - -(1) homozygous for the first source population (i.e., both chromosomes from that population), -(2) homozygous for the second source population, -(3) homozygous for the third source population, -(4) heterozygous for the first and second populations (i.e., one chromosome from each), -(5) heterozygous for the first and third populations, or -(6) heterozygous for the second and third populations. - -Parts of a reference chromosome that are labeled as "heterochromatic" are given the "non-genotype" 0. With two source populations, there are only three possible "genotypes". - -There are two Galaxy history-items generated. The first is a tabular dataset with chromosome, start, stop, and pairs of columns containing the "genotypes" as described above and the label from the admixed individual. It is used to draw a pciture and can be consulted for details. The second item is a composite of (1) a link to a pdf which graphically shows the inferred source population for each potentially admixed individual along each chromosome, and (2) a link to a text file that describes the run and summarizes the extent of predicted admixture in each individual. - -For certain genome assemblies, Galaxy stores a "heterochromatin file", which specifies the heterochromatic intervals on each chromosome, as well as telling the chromosome lengths. For other genome assemblies, the user can supply this information. Here are the first few lines of a heterochromatin file for human assembly hg19: -chr1 121485434 142535434 -chr1 249250621 249250621 -chr2 90545103 95326171 -chr2 243199373 243199373 -This gives the centromeres and lengths for chromosomes 1 and 2. Chromosomal addresses begin at 0, and the stated end position is 1 beyond the last address. For instance the interval described by the second line has length 0; it tells Galaxy that chromosome 1 is 249,250,621 bp in length. The entries for an acrocentric chromosome looks like the following: -chr22 0 16050000 -chr22 51304566 51304566 -The file must be ordered by chromosome name (starting with autosomes), and by position within a chromosome. -*/ - -#include "lib.h" -#include <math.h> - -// maximum length of a line from the table -#define MOST 50000 - -// we create a linked list of "events" on a chromosome -- mostly SNPs, but -// also ends of hetorochomatic intervals -struct snp { - double F1, F2, F3; // ref. allele frequencies in the three populations - int pos, *g, // position; genotypes of admixed individuals - type; // 0 = SNP, 1 = start of het. interval, 2 = end - struct snp *prev; // we keep the list in order of decreasing pos -} *last; - -// array of potentially admixed individuals -struct admixed { - char *name; - int gcol; - double x[7]; // number of reference bp in each state -} A[MOST]; - -// information about source-population individuals -struct ances { - int col, pop; - char *name; -} C[MOST]; - -// heterochromatic intervals -struct het { - char *chr; - int b, e; -} H[MOST]; - -#define MAX_CHR_NAME 1000000 - -// global variables -int *B[7], // backpointer to state at the previous SNP (or event) - *P; // chromosome position -int nH, nI, nG, genotypes, nsnp, debug, chr_col, logs, last_snp_pos, pop3, - nchr_name; -char this_chr[100], *chr_name[MAX_CHR_NAME]; -double switch_penalty; -char buf[MOST], *status; -FILE *fp, *out; - -#define LOG_ZERO -4.0 -// probability of producing genotype g in admixture state s = 1 to 6 -// given source-population ref. allele frequencies f1, f2, and f3 -double score (double f1, double f2, double f3, int g, int s) { - double p; - - if (g < 0 || g > 2) - fatalf("bad genotype %d", g); - if (s == 1) { // homozygous for the first source population - if (g == 2) - p = f1*f1; - else if (g == 0) - p = (1.0-f1)*(1.0-f1); - else - p = 2.0*f1*(1.0-f1); - } else if (s == 2) { // homozygous for the second source population - if (g == 2) - p = f2*f2; - else if (g == 0) - p = (1.0-f2)*(1.0-f2); - else - p = 2.0*f2*(1.0-f2); - } else if (s == 3) { // homozygous for the third source population - if (g == 2) - p = f3*f3; - else if (g == 0) - p = (1.0-f3)*(1.0-f3); - else - p = 2.0*f3*(1.0-f3); - } else if (s == 4) { // one chromosome each from source pops 1 and 2 - if (g == 2) - p = f1*f2; - else if (g == 0) - p = (1.0-f1)*(1.0-f2); - else - p = f1*(1.0-f2) + (1.0-f1)*f2; - } else if (s == 5) { // one chromosome each from source pops 1 and 3 - if (g == 2) - p = f1*f3; - else if (g == 0) - p = (1.0-f1)*(1.0-f3); - else - p = f1*(1.0-f3) + (1.0-f1)*f3; - } else if (s == 6) { // one chromosome each from source pops 2 and 3 - if (g == 2) - p = f2*f3; - else if (g == 0) - p = (1.0-f2)*(1.0-f3); - else - p = f2*(1.0-f3) + (1.0-f2)*f3; - } else - fatalf("bad state %d", s); - - if (p < 0.0) - fatalf("%f %f %f %d %d => %f", f1, f2, f3, g, s, p); - if (!logs) - return p; - if (p == 0.0) - return LOG_ZERO; - p = MAX(log(p), LOG_ZERO); - return p; - fatal("dpmix: cannot happen"); -} - -char *get_chr_name() { - static char tmp[MOST]; - char *s, *z = "\t\n"; - int i = chr_col; - static int autosome_warning = 0; - - strcpy(tmp, buf); - s = strtok(tmp, z); - while (--i > 0) - s = strtok(NULL, z); - if (!autosome_warning && strncmp(s, "chr", 3) == 0 && - !isdigit(s[3])) { - fprintf(out, - "WARNING: results assume diploid (non-sex) chromosomes\n\n"); - autosome_warning = 1; - } - return s; -} - -/* Process the a-th potentially admixed individual. -* We think of a graph with nodes (event, state) for each event (SNP or -* end-point of a heterochromatic interval on the current chromosome) and state -* = 0 through 7 (corresponding to genotypes 1 to 6, plus 0 = -* heterochromatin); for events where state != 0, there are 7 edges from -* each (event, state) to (event+1, k) for 0 <= k <= 6. An edge (event, j) to -* (event+1, k) has penalty 0 if j = k and penalty switch_penalty otherwise. -* The bonus at SNP node (event, state) for 1 <= state <= 6 is the probability -* of generating the genotype observed in the a-th potentially admixed -* individual given the allele frequences in the source populations and -* the assumed admixture state in this region of the chromosome. The score of a -* path is the sum of the node bonuses minus the sum of the edge penalties. -* -* Working backwards through the events, we compute the maximum path score, -* from[state], from (event,state) back to the closest heterochromatin interval. -* To force paths to reach state 0 at an event signalling the start of a -* heterochromatic interval (type = 1), but to avoid state 0 at other events, -* we assign huge but arbitrary negative scores (see "avoid", below). -* At (event,state), B[event][state] is the backpointer to the state at -* event+1 on an optimal path. Finally, we follow backpointers to partition -* the chromosome into admixture states. -*/ -void one_admix(int a) { - int i, j, m, state, prev_pos, b; - double from[7], f[7], ff[7], avoid = -1000000.0; - struct snp *p; - - // from[i] = highest score of a path from the current event - // (usually a SNP) to the next (to the right) heterochromatic interval - // or the end of the chromosome. The score of the path is the sum of - // SNP scores minus (switch_penalty times number of state switches). - // We assume that the last two events on the chromosome are the start - // and end of a heterochromatic interval (possibly of length 0) - for (i = 0; i < 7; ++i) - from[i] = 0; - for (i = nsnp-1, p = last; i >= 0 && p != NULL; --i, p = p->prev) { - if (p->type == 0 && p->g[a] == -1) { // no genotype - // no state change at this SNP - for (state = 0; state < 7; ++state) - B[state][i] = state; - continue; - } - - for (state = 0; state < 7; ++state) { - // find highest path-score from this event onward - for (m = j = 0; j < 7; ++j) { - f[j] = from[j]; - if (j != state) - f[j] -= switch_penalty; - if (f[j] > f[m]) - m = j; - } - B[state][i] = m; - ff[state] = f[m]; - if (state > 0 && p->type == 0) - ff[state] += - score(p->F1, p->F2, p->F3, p->g[a], state); - } - if (p->type == 1) { - // start of heterochomatic interval. Force paths - // reaching this point to go through state 0 - from[0] = 0; - for (j = 1; j < 7; ++j) - from[j] = avoid; - } else { - for (j = 1; j < 7; ++j) - from[j] = ff[j]; - from[0] = avoid; - } - if (debug) { - fprintf(stderr, "%d:", i); - for (j = 0; j < 7; ++j) { - if (pop3 || j == 3 || j == 5 || j == 6) - fprintf(stderr, " %f(%d)", from[j], B[j][i]); - } - putc('\n', stderr); - } - } - - // find the best initial state - for (state = 0, j = 1; j < 7; ++j) - if (from[j] > from[state]) - state = j; - - // trace back to find the switch points - // A[a].x[state] records the total length of intervals in each state - for (prev_pos = i = 0; i < nsnp; ++i) { - if ((b = B[state][i]) != state) { - if (prev_pos < P[i+1]-1) - printf("%s\t%d\t%d\t%d\t%s\n", - this_chr, prev_pos, P[i+1], - (state==4 && !pop3 ? 3 : state), A[a].name); - A[a].x[state] += (double)(P[i+1]-prev_pos); - prev_pos = P[i+1]; - state = b; - } - } -} - -// Add a heterochromatic interval to the SNP list, where type = 1 signifies -// the start of the interval, 2 signifies the end. -void add_het(int b, int type) { - struct snp *new = ckalloc(sizeof(struct snp)); - int i; - - new->F1 = new->F2 = new->F3 = 0.0; - new->pos = b; - new->type = type; - new->g = ckalloc(nG*sizeof(int)); - for (i = 0; i < nG; ++i) - new->g[i] = 0; - new->prev = last; - last = new; -} - -/* Process one chromosome. Read the SNPs on the chromosome (the first one is -* already in the buf). Boil each SNP down to the contents of a SNP entry -* (pos, F1, F2, g[]) and put it in the linked list. Also, intersperse the -* "events" corresponding to the start and end of a heterochromatic interval. -* Then call the dynamic-programming routine for each potentially admixed -* individual. -*/ -void one_chr() { - char *s, *z = "\t\n"; - int X[MOST], n, i, g, A1, B1, A2, B2, A3, B3, a, do_read, p, pos, het, - old_pos; - struct snp *new; - double F1, F2, F3; - - strcpy(this_chr, get_chr_name()); - if (nchr_name == 0) - chr_name[nchr_name++] = copy_string(this_chr); - old_pos = nsnp = 0; - last = NULL; - // advance to this chromosome in the list of heterochromatic intervals - for (het = 0; het < nH && !same_string(this_chr, H[het].chr); ++het) - ; - // loop over the SNPs on the current chromosome - for (do_read = 0; ; do_read = 1) { - if (do_read && (status = fgets(buf, MOST, fp)) == NULL) - break; - if (!same_string(s = get_chr_name(), this_chr)) { - if (nchr_name >= MAX_CHR_NAME) - fatal("Too many chromosome names"); - for (i = 0; - i < nchr_name && !same_string(s, chr_name[i]); ++i) - ; - if (i < nchr_name) - fatalf("SNVs on %s aren't together", s); - chr_name[nchr_name++] = copy_string(s); - break; - } - - // set X[i] = atoi(i-th word of buf), i is base 1 - for (i = 1, s = strtok(buf, z); s != NULL; - ++i, s = strtok(NULL, z)) - X[i] = atoi(s); - - // insert events (pseudo-SNPs) for heterochomatin intervals - // coming before the SNP - pos = X[chr_col+1]; - if (pos <= old_pos) - fatalf("SNP at %s %d is out of order", this_chr, pos); - old_pos = pos; - while (het < nH && same_string(this_chr, H[het].chr) && - H[het].b < pos) { - add_het(H[het].b, 1); - add_het(H[het].e, 2); - nsnp+= 2; - ++het; - } - - // should we discard this SNP? - if (pos == -1) // SNP not mapped to the reference - continue; - - // add SNP to a "backward pointing" linked list, recording the - // major allele frequencies in the source populations and - // genotypes in the potential admixed individuals - for (i = A1 = B1 = A2 = B2 = A3 = B3 = 0; i < nI; ++i) { - n = C[i].col; - p = C[i].pop; - if (genotypes) { - g = X[n+2]; - if (g == -1) - continue; - if (g < 0 || g > 2) - fatalf("invalid genotype %d in column %d, pos %d", g, n+2, X[2]); - if (p == 1) { - A1 += g; - B1 += (2 - g); - } else if (p == 2) { - A2 += g; - B2 += (2 - g); - } else if (p == 3) { - A3 += g; - B3 += (2 - g); - } - } else { // use read counts - if (p == 1) { - A1 += X[n]; - B1 += X[n+1]; - } else if (p == 2) { - A2 += X[n]; - B2 += X[n+1]; - } - } - } - if (A1+B1 == 0 || A2+B2 == 0) - continue; - ++nsnp; - new = ckalloc(sizeof(struct snp)); - new->pos = X[chr_col+1]; - new->F1 = F1 = (double)A1/(double)(A1+B1); - new->F2 = F2 = (double)A2/(double)(A2+B2); - new->F3 = F3 = (double)A3/(double)(A3+B3); - new->type = 0; - new->g = ckalloc(nG*sizeof(int)); - for (i = 0; i < nG; ++i) - g = new->g[i] = X[A[i].gcol]; - if (F1 < 0.0 || F1 > 1.0) - fatalf("F1 = %f (A1 = %d, B1 = %d) at snp %d", - F1, A1, B1, nsnp); - if (F2 < 0.0 || F2 > 1.0) - fatalf("F2 = %f (A2 = %d, B2 = %d) at snp %d", - F2, A2, B2, nsnp); - if (F3 < 0.0 || F3 > 1.0) - fatalf("F3 = %f (A2 = %d, B2 = %d) at snp %d", - F3, A3, B3, nsnp); - new->prev = last; - last = new; - } - - // insert heterochomatin intervals that follow all SNPs - while (het < nH && same_string(this_chr, H[het].chr)) { - add_het(H[het].b, 1); - add_het(H[het].e, 2); - nsnp += 2; - ++het; - } - // make sure the picture is drawn to at least the last SNP - if (last->type == 0) { - i = last->pos + 1; - add_het(i, 1); - add_het(i, 2); - nsnp += 2; - } - - // allocate arrays for the DP analysis - P = ckalloc(nsnp*sizeof(int)); // position of each event - for (i = nsnp-1, new = last; i >= 0 && new != NULL; - --i, new = new->prev) - P[i] = new->pos; - - for (i = 0; i < 7; ++i) { // space for back-pointers - B[i] = ckalloc((nsnp+1)*sizeof(int)); - B[i][nsnp] = 0; - } - - // loop over possibly admixed individuals - for (a = 0; a < nG; ++a) - one_admix(a); - - // free the allocated storage - while (last != NULL) { - new = last; - last = last->prev; - free(new->g); - free(new); - } - free(P); - for (i = 0; i < 7; ++i) - free(B[i]); -} - -int main(int argc, char **argv) { - int n, i, j, k, saw[4]; - long long het_len, ref_len; - double N, tot[7], keep[7], xx, yy; - char nam[100], *chr; - - if (argc < 9) - fatal("args: table chr-col chr data-source logs switch heterochrom outfile n:1:name1 m:2:name2 ..."); - if (same_string(argv[argc-1], "debug")) { - debug = 1; - --argc; - } - - // handle command-line arguments - chr_col = atoi(argv[2]); - chr = argv[3]; - genotypes = atoi(argv[4]); - - logs = atoi(argv[5]); - //if (logs) switch_penalty = log(switch_penalty); - - switch_penalty = atof(argv[6]); - if (switch_penalty < 0.0) - fatal("negative switch penalty"); - out = ckopen(argv[8], "w"); - - het_len = ref_len = 0; - if (!same_string(argv[7], "-")) { - fp = ckopen(argv[7], "r"); - while (fgets(buf, MOST, fp)) { - if (nH >= MOST) - fatal("Too many heterochromatic intervals"); - if (sscanf(buf, "%s %d %d", nam, &i, &j) != 3) - fatalf("gagging: %s", buf); - if (nH > 0 && !same_string(nam, H[nH-1].chr)) - ref_len += H[nH-1].e; - H[nH].chr = copy_string(nam); - H[nH].b = i; - H[nH].e = j; - // assumes last event per chrom. is a het. interval - het_len += (j - i); - ++nH; - } - fclose(fp); - } - ref_len += H[nH-1].e; - - // populations must be disjoint - saw[0] = saw[1] = saw[2] = saw[3] = 0; - for (i = 9; i < argc; ++i) { - if (sscanf(argv[i], "%d:%d:%s", &j, &k, nam) != 3) - fatalf("not like 13:2:fred : %s", argv[i]); - if (k < 0 || k > 3) - fatalf("not population 0, 1, 2 or 3: %s", argv[i]); - saw[k] = 1; - - // seen this individual (i.e., column) before?? - for (n = 0; n < nI && C[n].col != j; ++n) - ; - if (n < nI) - fatal("populations are not disjoint"); - if (k == 0) { // admixed individual - if (nG >= MOST) - fatal("Too many admixed individuals"); - A[nG].name = copy_string(nam); - A[nG++].gcol = j+2; - } else { // in a source population - if (nI >= MOST) - fatal("Too many ancestral individuals"); - C[nI].col = j; - C[nI].pop = k; - C[nI++].name = copy_string(nam); - } - } - if (saw[0] == 0) - fatal("no admixed individual is specified"); - if (saw[1] == 0) - fatal("first reference population is empty"); - if (saw[2] == 0) - fatal("second reference population is empty"); - pop3 = saw[3]; - - // start the output file of text - for (k = 1; k <= 3; ++k) { - if (k == 3 && !pop3) - break; - fprintf(out, "source population %d (state %d):", k, k); - for (i = 0; i < nI; ++i) - if (C[i].pop == k) - fprintf(out, " %s", C[i].name); - fprintf(out, "\n\n"); - } - if (pop3) { - fprintf(out, "state 4 is heterozygous for populations 1 and 2\n"); - fprintf(out, - "state 5 is heterozygous for populations 1 and 3\n"); - fprintf(out, - "state 6 is heterozygous for populations 2 and 3\n"); - } else - fprintf(out, "state 3 is heterozygous for populations 1 and 2\n"); - fprintf(out, "\nswitch penalty = %2.2f\n", switch_penalty); - putc('\n', out); - - fp = ckopen(argv[1], "r"); - while ((status = fgets(buf, MOST, fp)) != NULL && buf[0] == '#') - ; - if (same_string(chr, "all")) - while (status != NULL) - one_chr(); - else { // skip to the specified chromosome - while (!same_string(chr, get_chr_name()) && - (status = fgets(buf, MOST, fp)) != NULL) - ; - if (status != NULL) - one_chr(); - } - - if (ref_len) - fprintf(out, - "%lld of %lld reference bp (%1.1f%%) are heterochromatin\n\n", - het_len, ref_len, 100.0*(float)het_len/(float)ref_len); - - // write fractions in each state to the output text file - for (j = 0; j < 7; ++j) - tot[j] = 0.0; - fprintf(out, "individual:"); - fprintf(out, "\tstate 1\tstate 2\tstate 3"); - if (pop3) - fprintf(out, "\tstate 4\tstate 5\tstate 6"); - fprintf(out, "\t pop 1\t pop 2"); - if (pop3) - fprintf(out, "\t pop 3"); - putc('\n', out); - for (i = 0; i < nG; ++i) { - N = A[i].x[1] + A[i].x[2] + A[i].x[4]; - if (pop3) - N += A[i].x[3] + A[i].x[5] + A[i].x[6]; - N /= 100.0; - fprintf(out, "%s:", A[i].name); - if (strlen(A[i].name) < 7) - putc('\t', out); - for (j = 1; j < 7; ++j) - if (pop3 || j == 1 || j == 2 || j == 4) { - tot[j] += (keep[j] = A[i].x[j]); - fprintf(out, "\t %5.1f%%", keep[j]/N); - } - keep[1] += 0.5*keep[4]; - keep[2] += 0.5*keep[4]; - if (pop3) { - keep[1] += 0.5*keep[5]; - keep[2] += 0.5*keep[6]; - keep[3] += 0.5*(keep[5]+keep[6]); - } - - fprintf(out, "\t %5.1f%%\t %5.1f%%", keep[1]/N, keep[2]/N); - if (pop3) - fprintf(out, "\t %5.1f%%", keep[3]/N); - - putc('\n', out); - } - if (nG > 1) { - fprintf(out, "\naverage: "); - N = tot[1] + tot[2] + tot[4]; - if (pop3) - N += (tot[3] + tot[5] + tot[6]); - N /= 100.0; - for (j = 1; j < 7; ++j) { - if (pop3 || j == 1 || j == 2 || j == 4) - fprintf(out, "\t %5.1f%%", tot[j]/N); - } - xx = tot[1] + 0.5*tot[4]; - yy = tot[2] + 0.5*tot[4]; - if (pop3) { - xx += 0.5*tot[5]; - yy += 0.5*tot[6]; - } - fprintf(out, "\t %5.1f%%\t %5.1f%%", xx/N, yy/N); - if (pop3) - fprintf(out, "\t %5.1f%%", - (tot[3] + 0.5*tot[5] + 0.5*tot[6])/N); - putc('\n', out); - } - - return 0; -}
--- a/genome_diversity/src/eval2pct.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,26 +0,0 @@ -#include "lib.h" - -#define MAX_EVAL 1000 - -float E[MAX_EVAL]; -int nE; - -int main (int argc, char **argv) { - FILE *fp; - char buf[500]; - int i; - float tot; - - fp = (argc== 1 ? stdin : ckopen(argv[1], "r")); - while (fgets(buf, 500, fp)) { - if (nE >= MAX_EVAL) - fatal("Too many eigenvalues"); - E[nE++] = atof(buf); - } - for (tot = 0.0, i = 0; i < nE; ++i) - tot += E[i]; - printf("Percentage explained by eigenvectors:\n"); - for (i = 0 ; i < nE && E[i] > 0.0; ++i) - printf("%d: %1.1f%%\n", i+1, 100.0*(float)E[i]/tot); - return 0; -}
--- a/genome_diversity/src/filter_snps.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,165 +0,0 @@ -/* filter_snps -- enforce constraints on a file of SNPs. -* -* argv[1] = file containing a Galaxy table -* argv[2] = 1 for a gd_snp file, 0 for gd_genotype -* argv[3] = lower bound on total coverage (< 0 means interpret as percentage) -* argv[4] = upper bound on total coveraae (< 0 means interpret as percentage; -* 0 means no bound) -* argv[5] = lower bound on individual coverage -* argv[6] = lower bound on individual quality value -* argv[7] = lower bound on the number of defined genotypes -* argv[8] = lower bound on the spacing between SNPs -* argv[9] = reference-chromosome column (base-1); ref position in next column -* If argv[8] == 0, then argv[7] must be 0. -* argv[10] ... are the starting columns (base-1) for the chosen individuals. -* If argc == 10, then only the lower bound on spacing between SNPs is -* enforced. - -What it does on Galaxy -For a gd_snp dataset, the user specifies that some of the individuals form a "population", by supplying a list that has been previously created using the Specify Individuals tool. SNPs are then discarded if their total coverage for the population is too low or too high, or if their coverage or quality score for any individual in the population is too low. - -The upper and lower bounds on total population coverage can be specified either as read counts or as percentiles (e.g. "5%", with no decimal places). For percentile bounds the SNPs are ranked by read count, so for example, a lower bound of "10%" means that the least-covered 10% of the SNPs will be discarded, while an upper bound of, say, "80%" will discard all SNPs above the 80% mark, i.e. the top 20%. The threshold for the lower bound on individual coverage can only be specified as a plain read count. - -For either a gd_snp or gd_genotype dataset, the user can specify a minimum number of defined genotypes (i.e., not -1) and/or a minimum spacing relative to the reference sequence. An error is reported if the user requests a minimum spacing but no reference sequence is available. - -*/ - -#include "lib.h" - -// most characters allowed in a row of the table -#define MOST 50000 - -// larger than any possible coverage -#define INFINITE_COVERAGE 100000000 - -char buf[MOST], chr[100], old_chr[100]; - -// column for the relevant individuals/groups -int col[MOST], *T; -int nI, lo, hi, min_space, min_geno, chr_col, pos, old_pos, gd_snp, X[MOST]; - -void get_X() { - char *p, *z = "\t\n", trash[MOST]; - int i; - - strcpy(trash, buf); - // set X[i] = atoi(i-th word of s), i is base 0 - for (i = 1, p = strtok(trash, z); p != NULL; - ++i, p = strtok(NULL, z)) { - if (chr_col && i == chr_col) - strcpy(chr, p); - X[i] = atoi(p); - if (chr_col && i == chr_col+1) - pos = X[i]; - } -} - -int compar(int *a, int *b) { - if (*a < *b) - return -1; - if (*a > *b) - return 1; - return 0; -} - -void find_lo(char *filename) { - FILE *fp = ckopen(filename, "r"); - int n, m, i, k; - - for (n = 0; fgets(buf, MOST, fp); ++n) - ; - T = ckalloc(n*sizeof(int)); - rewind(fp); - for (k = 0; fgets(buf, MOST, fp); ++k) { - get_X(); - for (i = T[k] = 0; i < nI; ++i) { - m = col[i]; - T[k] += (X[m]+X[m+1]); - } - } - qsort((void *)T, (size_t)n, sizeof(int), (const void *)compar); - if (lo < 0) { - lo = -lo; - if (lo > 100) - fatal("cannot have low > 100%"); - lo = T[(n*lo)/100]; - } - if (hi < 0) { - hi = -hi; - if (hi > 100) - fatal("cannot have high > 100%"); - hi = T[(n*hi)/100]; - } - free(T); - fclose(fp); -} - -void OK() { - if (chr_col == 0 || !same_string(chr, old_chr) || - pos >= old_pos + min_space) { - printf("%s", buf); - if (chr_col) { - strcpy(old_chr, chr); - old_pos = pos; - } - } -} -int main(int argc, char **argv) { - FILE *fp; - int m, i, cov, tot_cov, indiv, qual, ngeno; - - if (argc < 10) - fatalf("args: SNP-table gd_snp low-tot high-tot low-cov low-qual low-genotype low-space chr-col col1 col2 ..."); - - for (i = 10, nI = 0; i < argc; ++i, ++nI) - col[nI] = atoi(argv[i]); - gd_snp = atoi(argv[2]); - lo = atoi(argv[3]); - hi = atoi(argv[4]); - if (hi == 0) - hi = INFINITE_COVERAGE; - if (lo < 0 || hi < 0) - find_lo(argv[1]); - indiv = atoi(argv[5]); - qual = atoi(argv[6]); - min_geno = atoi(argv[7]); - min_space = atoi(argv[8]); - chr_col = atoi(argv[9]); - - // reality checks - if (!gd_snp && - (lo != 0 || hi != INFINITE_COVERAGE || indiv != 0 || qual != 0)) - fatal("cannot bound coverage or quality in gd_genotype file"); - if (chr_col == 0 && min_space != 0) - fatalf("Specification of minimum spacing requires a reference sequence"); - if (indiv < 0 || qual < 0) - fatalf("percentiles not implemented for individuals"); - - // scan the SNPs - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - get_X(); - for (i = tot_cov = ngeno = 0; i < nI; ++i) { - m = col[i]; - if (gd_snp) { - cov = (X[m]+X[m+1]); - if (cov < indiv || X[m+3] < qual) - break; - tot_cov += cov; - } - if (X[m+2] != -1) - ++ngeno; - } - if (i < nI) // check bounds on the population's individuals - continue; - if (gd_snp && (tot_cov < lo || tot_cov > hi)) - continue; - if (ngeno >= min_geno) - // OK, except possibly for lower bound on spacing - OK(); - } - - return 0; -}
--- a/genome_diversity/src/get_pi.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,282 +0,0 @@ -/* get_pi -- compute piNon, piSyn, thetaNon and thetaSyn -* -* argv[1] -- SAPs file -* argv[2] -- SNPs file -* argv[3] -- covered intervals file -* argv[4], argv[5], ... -- starting columns in the SNP file for the chosen -* individuals -* -* We assume that lines of argv[1], argv[2] and argv[3] start with a scaffold -* name and a scaffold position, and that they are sorted on those two fields. -* The 4th entry in an interval line gives the reference chromosome. We ignore -* unnumbered chromosome, e.g., "chrX". -* -* Output: -* the number of nonsyn SNPs, total number of nonsynon sites, piNon, -* the number of synon SNPs, total number of synon sites, piSyn, plus -* total length of covered intervals, thetaNon, thetaSyn -* -* What it does on Galaxy -The tool computes values that estimate some basic parameters -*/ - -#include "lib.h" - -// END_FILE should be larger than any real scaffold number -#define END_FILE 1000000000 // scaffold "number" signifying end-of-file -#define BUF_SIZE 50000 // maximum length of a SNP-file row - -int col[10000], nC; // columns containing the chosen genotypes - -// scaffold numbers and positions of the current SAP, SNP, and interval -int nbr_SAP, nbr_SNP, nbr_interv, pos_SAP, pos_SNP, beg, end, columns, debug; - -// current SNP row, the variant amino acids of the SAP, interval's reference chr -char snp[BUF_SIZE], A[100], B[100], chr[100]; - -// number of nonsynon and snynon sites in the current interval -float all_non, all_syn; - -// return the number of chromosome pairs that differ at a SNP -int diff_pair() { - int i, j, X[1000]; - char *p, *z = "\t\n"; - - // set X[i] = atoi(i-th word of SNP-file row), base 1 - for (i = 1, p = strtok(snp, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - // add genotypes to count the reference allele - for (j = i = 0; i < nC; ++i) - j += X[col[i]]; - // of the 2*nC chromosomes, j have the reference nucleotide - if (debug) - printf("get_pi: j = %d, return %d\n", j, j*(2*nC-j)); - return j*(2*nC-j); -} - -// translate e.g. "scaffold234" to the integer 234 -int name2nbr(char *s) { - if (same_string(s, "chrX")) - return 1000; - if (same_string(s, "chrY")) - return 1001; - while (isalpha(*s)) - ++s; - return atoi(s); -} - -// does one scaffold-position pair strictly precede another -int before(int nbra, int posa, int nbrb, int posb) { - return (nbra < nbrb || (nbra == nbrb && posa < posb)); -} - -// get a SAP; set A and B; set nbr_SAP = END_FILE for end-of-file -void get_SAP(FILE *fp) { - char buf[500], scaf_name[100]; - int old_nbr = nbr_SAP, old_pos = pos_SAP; - - if (nbr_SAP >= END_FILE) - return; - if (!fgets(buf, 500, fp)) { - nbr_SAP = END_FILE; - return; - } - while (buf[0] == '#') - if (!fgets(buf, 500, fp)) { - nbr_SAP = END_FILE; - return; - } - if (columns == 8) { - if (sscanf(buf, "%s %d %*s %*s %*s %*s %s %*s %s", - scaf_name, &pos_SAP, A, B) != 4) - fatalf("bad SAP : %s", buf); - } else if (columns == 5) { - if (sscanf(buf, "%s %d %*s %*s %s %*s %s", - scaf_name, &pos_SAP, A, B) != 4) - fatalf("bad SAP : %s", buf); - } else - fatalf("get_SAP: columns = %d", columns); - nbr_SAP = name2nbr(scaf_name); - if (before(nbr_SAP, pos_SAP, old_nbr, old_pos)) - fatalf("SAP at scaf%d %d is out of order", nbr_SAP, pos_SAP); - if (debug) - printf("SAP: scaf%d %d\n", nbr_SAP, pos_SAP); -} - -// get a SNP -void get_SNP(FILE *fp) { - char scaf_name[100]; - int old_nbr = nbr_SNP, old_pos = pos_SNP; - - if (nbr_SNP >= END_FILE) - return; - if (!fgets(snp, BUF_SIZE, fp)) { - nbr_SNP = END_FILE+1; - return; - } - while (snp[0] == '#') - if (!fgets(snp, 500, fp)) { - nbr_SNP = END_FILE+1; - return; - } - if (sscanf(snp, "%s %d", scaf_name, &pos_SNP) != 2) - fatalf("bad SNP : %s", snp); - nbr_SNP = name2nbr(scaf_name); - if (before(nbr_SNP, pos_SNP, old_nbr, old_pos)) { - fprintf(stderr, "seq%d %d before seq%d %d\n", - nbr_SNP, pos_SNP, old_nbr, old_pos); - fatalf("SNP at sequence %d %d is out of order", nbr_SNP, pos_SNP); - } - if (debug) - printf("SNP: scaf%d %d\n", nbr_SNP, pos_SNP); -} - -// expand fractions .333 and .666 to full double-precision accuracy -double grow(float x) { - int chop = x; - float remain; - double d, third = (double)1/(double)3; - - d = (double)chop; - remain = x - (float)chop; - if (0.1 < remain) - d += third; - if (0.5 < remain) - d += third; - return d; -} - -// read an interval; update tot_non and tot_syn -int get_interval(FILE *fp) { - char buf[500], scaf_name[100], tmp[500], *t, *z = " \t\n"; - int old_nbr = nbr_interv, old_end = end; - - if (!fgets(buf, 500, fp)) - return 0; - while (buf[0] == '#') - if (!fgets(buf, 500, fp)) - return 0; - if (columns == 0) { - strcpy(tmp, buf); - for (columns = 0, t = strtok(tmp, z); t != NULL; - ++columns, t = strtok(NULL, z)) - ; - } - if (columns != 5 && columns != 8) - fatalf("interval table has %d columns", columns); - if (columns == 8 && sscanf(buf, "%s %d %d %s %*s %*s %f %f", - scaf_name, &beg, &end, chr, &all_non, &all_syn) != 6) - fatalf("bad interval : %s", buf); - if (columns == 5) { - if (sscanf(buf, "%s %d %d %f %f", - scaf_name, &beg, &end, &all_non, &all_syn) != 5) - fatalf("bad interval : %s", buf); - strcpy(chr, scaf_name); - } - nbr_interv = name2nbr(scaf_name); - if (before(nbr_interv, beg, old_nbr, old_end)) - fatalf("interval at %s %d is out of order", scaf_name, beg); - if (debug) - printf("int: scaf%d %d-%d\n", nbr_interv, beg, end); - - return 1; -} - -int main(int argc, char **argv) { - FILE *fp1, *fp2, *fp3; - int i, nint, nsap, no_sap, no_snp, no_chr, nsyn, nnon, d, tot_len; - float non, syn, x; - double tot_non = 0.0, tot_syn = 0.0, // total sites in the intervals - factor; - - // process command-line arguments - if (same_string(argv[argc-1], "debug")) { - debug = 1; - --argc; - } - if (argc < 5) - fatal("args: SAPs SNPs intervals individual1 ... [debug]"); - fp1 = ckopen(argv[1], "r"); - fp2 = ckopen(argv[2], "r"); - fp3 = ckopen(argv[3], "r"); - for (i = 4; i < argc; ++i) - col[i-4] = atoi(argv[i]) + 2; - nC = argc - 4; - - // loop over the intervals - tot_len = no_sap = nsap = no_snp = no_chr = nsyn = nnon = 0; - non = syn = 0.0; - for (nint = 0; get_interval(fp3); ++nint) { - if (strncmp(chr, "chr", 3) == 0 && !isdigit(chr[3])) { - ++no_chr; - continue; - } - tot_len += (end - beg); - // expand e.g. .333 to .3333333.. - tot_non += grow(all_non); - tot_syn += grow(all_syn); - - // skip SAPs coming before this interval - while (before(nbr_SAP, pos_SAP, nbr_interv, beg)) - get_SAP(fp1); - // loop over SAPs in this inteval - while (before(nbr_SAP, pos_SAP, nbr_interv, end)) { - ++nsap; - - // look for this SAP in the SNP file - while (before(nbr_SNP, pos_SNP, nbr_SAP, pos_SAP)) { - if (nbr_SNP == nbr_interv && pos_SNP >= beg) - ++no_sap; - get_SNP(fp2); - } - - // is this the SAP we looked for? - if (nbr_SAP == nbr_SNP && pos_SAP == pos_SNP) { - d = diff_pair(); - if (A[0] == B[0]) { - ++nsyn; - syn += (float)d; - } else { - ++nnon; - non += (float)d; - } - get_SNP(fp2); - } else - ++no_snp; - get_SAP(fp1); - } - // process SNPs in the interval but not in the SAP file - while (before(nbr_SNP, pos_SNP, nbr_interv, end)) { - if (nbr_SNP == nbr_interv && pos_SNP >= beg) - ++no_sap; - get_SNP(fp2); - } - } - - // there are x = (2*nC choose 2) pairs of chromosomes - x = (float)(nC*(2*nC-1)); - non /= x; - syn /= x; - printf("%d intervals\n", nint); - if (no_chr) - printf("ignored %d interval%s on unnumbered chromosomes, like chrX\n", - no_chr, no_chr == 1 ? "" : "s"); - printf("%d SNPs, %d nonsyn, %d synon\n", nsap, nnon, nsyn); - if (no_sap) - printf("%d SNPs in an interval are not in the SAP table\n", - no_sap); - if (no_snp) - printf("%d SNPs in an interval are not in the SNP table\n", - no_snp); - printf("nonsynon: %4.3f/%4.3f = %6.5f%%\n", - non, tot_non, 100.0*non/tot_non); - printf("synon: %4.3f/%4.3f = %6.5f%%\n", - syn, tot_syn, 100.0*syn/tot_syn); - for (factor = 0.0, i = 1; i < 2*nC; ++i) - factor += (1.0/(double)i); - factor *= (double)tot_len/100.0; - printf("%d covered bp, thetaNon = %6.5f%%, thetaSyn = %6.5f%%\n", - tot_len, (double)nnon/factor, (double)nsyn/factor); - return 0; -}
--- a/genome_diversity/src/lib.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,71 +0,0 @@ -// lib.c -- a little library of C procudures - -#include "lib.h" - -char *argv0; - -/* print_argv0 ---------------------------------------- print name of program */ -void print_argv0(void) -{ - if (argv0) { - char *p = strrchr(argv0, '/'); - (void)fprintf(stderr, "%s: ", p ? p+1 : argv0); - } -} - -/* fatal ---------------------------------------------- print message and die */ -void fatal(const char *msg) -{ - fatalf("%s", msg); -} - -/* fatalf --------------------------------- format message, print it, and die */ -void fatalf(const char *fmt, ...) -{ - va_list ap; - va_start(ap, fmt); - fflush(stdout); - print_argv0(); - (void)vfprintf(stderr, fmt, ap); - (void)fputc('\n', stderr); - va_end(ap); - exit(1); -} - -/* ckopen -------------------------------------- open file; check for success */ -FILE *ckopen(const char *name, const char *mode) -{ - FILE *fp; - - if ((fp = fopen(name, mode)) == NULL) - fatalf("Cannot open %s.", name); - return fp; -} - -/* ckalloc -------------------------------- allocate space; check for success */ -void *ckalloc(size_t amount) -{ - void *p; - - if ((long)amount < 0) /* was "<= 0" -CR */ - fatal("ckalloc: request for negative space."); - if (amount == 0) - amount = 1; /* ANSI portability hack */ - if ((p = malloc(amount)) == NULL) - fatalf("Ran out of memory trying to allocate %lu.", - (unsigned long)amount); - return p; -} - -/* same_string ------------------ determine whether two strings are identical */ -bool same_string(const char *s, const char *t) -{ - return (strcmp(s, t) == 0); -} - -/* copy_string ---------------------- save string s somewhere; return address */ -char *copy_string(const char *s) -{ - char *p = ckalloc(strlen(s)+1); /* +1 to hold '\0' */ - return strcpy(p, s); -}
--- a/genome_diversity/src/lib.h Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,33 +0,0 @@ -// lib.h -- header file for some useful procedures - -#include <stdio.h> -#include <stdlib.h> -#include <string.h> -#include <ctype.h> -#include <limits.h> /* INT_MAX, INT_MIN, LONG_MAX, LONG_MIN, etc. */ -#include <stdarg.h> - -typedef unsigned char uchar; -typedef int bool; - -extern char *argv0; - -void print_argv0(void); -#ifdef __GNUC__ /* avoid some "foo might be used uninitialized" warnings */ - void fatal(const char *msg) __attribute__ ((noreturn)); - void fatalf(const char *fmt, ...) __attribute__ ((noreturn)); - void fatalfr(const char *fmt, ...) __attribute__ ((noreturn)); -#else - void fatal(const char *msg); - void fatalf(const char *fmt, ...); - void fatalfr(const char *fmt, ...); -#endif -FILE *ckopen(const char *name, const char *mode); -void *ckalloc(size_t amount); -bool same_string(const char *s, const char *t); -char *copy_string(const char *s); - -#undef MAX -#define MAX(x,y) ((x) > (y) ? (x) : (y)) -#undef MIN -#define MIN(x,y) ((x) < (y) ? (x) : (y))
--- a/genome_diversity/src/mito_lib.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,98 +0,0 @@ -// mito_data.c -- shared procedures to read SNP and coverage file for -// mitogenomes - -#include "lib.h" -#include "mito_lib.h" - -#define ADHOC - -// get the adequately covered intervals for each specified individual; -// merge adjacent intervals -void get_intervals(char *filename) { - FILE *fp = ckopen(filename, "r"); - char buf[500], name[100]; - struct interv *p, *new; - int i, b, e, cov; - - while (fgets(buf, 500, fp)) { - if (sscanf(buf, "%*s %d %d %d %s", &b, &e, &cov, name) != 4) - fatalf("interval: %s", buf); - if (cov < min_cov) - continue; - for (i = 0; i < nM && !same_string(M[i].name, name); ++i) - ; - if (i == nM) - continue; - if (M[i].last != NULL && M[i].last->e == b) { - // merge with adjacent interval - M[i].last->e = e; - continue; - } - new = ckalloc(sizeof(*new)); - new->b = b; - new->e = e; - new->next = NULL; - if ((p = M[i].last) == NULL) - M[i].intervals = new; - else - p->next = new; - M[i].last = new; - } - fclose(fp); -/* - for (i = 0; i < nM; ++i) { - printf("%s:", M[i].name); - for (p = M[i].intervals; p; p = p->next) - printf(" %d-%d", p->b, p->e); - putchar('\n'); - } -*/ -} - -// get the SNPs; for each SNP set the array of (first characters from) -// genotypes of the specified samples (individuals) -int get_variants(char *filename, struct snp *S, int refcol) { - FILE *fp = ckopen(filename, "r"); - char buf[5000], *s, *f[101], *z = " \t\n\0"; - int i, n, c; - - for (i = 0; i <= 100; ++i) - f[i] = NULL; - for (n = 0; fgets(buf, 500, fp); ++n) { - if (buf[0] == '#') { - --n; - continue; - } - if (n >= MAX_SNP) - fatal("too many SNPs"); - if (sscanf(buf, "%*s %d", &(S[n].pos)) != 1) - fatalf("pos : %s", buf); - S[n].g = ckalloc((nM+1)*(sizeof(char))); - S[n].g[nM] = '\0'; - for (i = 0; i <= 100; ++i) - if (f[i] != NULL) - free(f[i]); - for (i = 1, s = strtok(buf, z); s; s = strtok(NULL, z), ++i) - f[i] = copy_string(s); - for (i = 0; i < nM; ++i) { - // genotype is 2 columns past the individual's 1st - // column - // AD HOC RULE: IF THERE IS ONE READ OF EACH ALLELE, - // THEN IGNORE THE SNP. - c = M[i].col; - if (refcol == 0) - S[n].g[i] = f[c+2][0]; - else if (same_string(f[refcol+2], f[c+2])) - S[n].g[i] = '2'; - else - S[n].g[i] = '0'; -#ifdef ADHOC - if (same_string(f[c], "1") && - same_string(f[c+1], "1")) - S[n].g[i] = '-'; -#endif - } - } - fclose(fp); - return n; -}
--- a/genome_diversity/src/mito_lib.h Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -// mito_data.h -- header file for shared procedures to read SNPs and intervals -// for mitogenomes - -#define MAX_SNP 20000 -#define MAX_SAMPLE 100 -struct snp { - int pos; - char *g; // genotypes - one character per specified mitogenome -} S[MAX_SNP], I[MAX_SNP]; - -// intervals associated with each specified mitogenome -struct interv { - int b, e; - struct interv *next; -}; -int nM, min_cov, debug; - -// mitogenomes -struct mito { - char *name; - int col; // first column in the SNP table - struct interv *intervals, *last; -} M[MAX_SAMPLE]; - -// get the adequately covered intervals for each specified individual; -// merge adjacent intervals -void get_intervals(char *filename); - -// get the SNPs; for each SNP set the array of (first characters from) -// genotypes of the specified samples (individuals) -int get_variants(char *filename, struct snp *S, int refcol);
--- a/genome_diversity/src/mk_Ji.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,147 +0,0 @@ -/* mk_Ji -- prepare data for drawing a picture of mitogenome data -* -* argv[1] -- SNP table for the mitogenome -* -* argv[2] -- indel table for the mitogenome -* -* argv[3] -- coverage table: file of intervals with lines like: - -P.ingens-mt 175 194 6 9-M-352 - -* giving genome name, start postion (base-0), end position (half open), -* coverage and sample name. -* -* argv[4] -- annotation file like - -P.ingens-mt 0 70 tRNA + tRNA-Phe -P.ingens-mt 70 1030 rRNA + 12S -P.ingens-mt 1030 1100 tRNA + tRNA-Val -P.ingens-mt 1101 2680 CDS + rRNA -P.ingens-mt 1101 2680 rRNA + 16S -P.ingens-mt 2680 2755 tRNA + tRNA-Leu -P.ingens-mt 2758 3713 CDS + ND1 -... -P.ingens-mt 15484 16910 D-loop + D-loop - -* argv[5] -- the minimum coverage. Intervals of lower coverage are ignored. -* -* argv[6] -- either the string "default" or the name of an individual -* -* argv[7], argv[8], ... column:name pairs like "9:sam". -* -* Also, if the last argument is "debug", then much output sent to stderr. -*/ - -#include "lib.h" -#include "mito_lib.h" - -int ref_len; - -// read gene annotation, change "CDS" to "gene", print for the drawing tool, -// print lines showing the genome name and length (last annotated position). -void get_genes(char *filename) { - FILE *fp = ckopen(filename, "r"); - char buf[500], ref[100], type[100], name[100], *t; - int b, e; - - while (fgets(buf, 500, fp)) { - if (sscanf(buf, "%s %d %d %s %*s %s", - ref, &b, &e, type, name) != 5) - fatalf("gag: %s", buf); - t = (same_string(type, "CDS") ? "gene" : type); - // print the Genome Annotation line - printf("@GA=%d:%d:%s:%s\n", b, e, name, t); - } - printf("@GL=%d\n", ref_len = e); - printf("@GN=%s\n", ref); -} - -// print items that are adequately covered -void visible(int i, struct snp *S, int nS, char *s) { - struct interv *t; - int j; - - for (j = 0, t = M[i].intervals; j < nS; ++j) { - while (t && t->e <= S[j].pos) - t = t->next; - if (t && t->b <= S[j].pos && S[j].g[i] == '0') - printf(" %s%d", s, S[j].pos); - } -} - -int main(int argc, char **argv) { - struct interv *t; - int i, nS, nI, last_e, refcol; - char *a, *s; - - if (argc > 7 && same_string(argv[argc-1], "debug")) { - --argc; - debug = 1; - } - - if (argc < 7) - fatal("args: snps indels intervals genes min_cov ref 9:sam 13:judy ... "); - - // store sample names and start positions (argv[6], argv[7], ...) - for (nM = 0, i = 7; i < argc; ++nM, ++i) { - if (nM >= MAX_SAMPLE) - fatalf("Too many mitogenomes"); - if ((s = strchr(a = argv[i], ':')) == NULL) - fatalf("colon: %s", a); - M[nM].col = atoi(a); - M[nM].name = copy_string(s+1); - } - if (same_string(argv[6], "default")) - refcol = 0; - else { - for (i = 0; i < nM && !same_string(argv[6], M[i].name); ++i) - ; - if (i == nM) - fatalf("improper reference name '%s'", argv[6]); - refcol = M[i].col; - // fprintf(stderr, "refcol = %d\n", refcol); - } - - // read annotation and annotate in the file for drawing - get_genes(argv[4]); - - // record color information - printf("@CL=rRNA:#EF8A62\n@CL=tRNA:#31A354\n@CL=gene:#B2182B\n"); - printf("@CL=missing:#67A9CF:L\n@CL=indel:#2166AC\n@CL=special:#000000\n"); - - min_cov = atoi(argv[5]); - - // store the coverage data - get_intervals(argv[3]); - - if (debug) { - for (i = 0; i < nM; ++i) { - fprintf(stderr, ">%s\n", M[i].name); - for (t = M[i].intervals; t; t = t->next) - fprintf(stderr, "%d\t%d\n", t->b, t->e); - } - putc('\n', stderr); - } - - // record the variants - nS = get_variants(argv[1], S, refcol); - nI = get_variants(argv[2], I, refcol); - - // report the information for each sample - for (i = 0; i < nM; ++i) { - printf("%s", M[i].name); - visible(i, S, nS, ""); - visible(i, I, nI, "indel="); - last_e = 0; - for (t = M[i].intervals; t; t = t->next) { - if (last_e < t->b) - printf(" missing=%d:%d", last_e, t->b); - last_e = t->e; - } - if (last_e < ref_len) - printf(" missing=%d:%d", last_e, ref_len); - putchar('\n'); - } - - return 0; -}
--- a/genome_diversity/src/mt_pi.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,164 +0,0 @@ -/* mt_pi -- compute the diversity measure pi for mitochondrial genomes -* [SHOULD I OPTIONALLY INCLUDE INDELS?] -* -* argv[1] -- SNP table for the mitogenome -* -* argv[2] -- file of intervals with lines like: - -P.ingens-mt 175 194 6 9-M-352 - -* giving genome name, start postion (base-0), end position (half open), -* coverage and sample name. -* -* argv[3] -- the minimum coverage. Intervals of lower coverage are ignored. -* -* argv[4], argv[5], ... column:name pairs like "9:sam". -* -* Also, if the last argument is "debug", then much output sent to stderr, if it -* is "debug2", then the coverage and difference-count for each mitogenome-pair -* is sent to stderr. -*/ - -#include "lib.h" -#include "mito_lib.h" - -int debug2; - -// for a pair of samples, determine how much of the reference is in the -// adequately covered segments for both, and count the number of SNPs in those -// shared regions where they differ -// PUTATIVE HETEROPLASMIES ARE IGNORED -float pair(int i, int j, int nS) { - int covered, B, E, diffs, k; - struct interv *p = M[i].intervals, *q = M[j].intervals; - char x, y; - - // k scans the SNPs - covered = diffs = k = 0; - while (p && q) { - if (debug) - fprintf(stderr, "trying %d-%d and %d-%d\n", - p->b, p->e, q->b, q->e); - // take the intersection of the two well-covered intervals - B = MAX(p->b, q->b); - E = MIN(p->e, q->e); - if (B < E) { - if (debug) - fprintf(stderr, " covered %d\n", E-B); - covered += (E - B); - for ( ; k < nS && S[k].pos < E; ++k) { - if (S[k].pos >= B) { - x = S[k].g[i]; - y = S[k].g[j]; - if (debug) - fprintf(stderr, - " SNP %c %c at %d\n", - x, y, S[k].pos); -/* - if (x == '-' || y == '-') - fatalf("SNP at %d missing geno", - S[k].pos); -*/ -/* - if (x == '1' || y == '1') - continue; -*/ - if (x != y) { - ++diffs; - if (debug) - fprintf(stderr, - "\tdiff at %d\n", - S[k].pos); - } - } - } - } - // go to next adequately covered interval(s) - if (p->e < q->e) - p = p->next; - else if (p->e > q->e) - q = q->next; - else { - p = p->next; - q = q->next; - } - } - if (debug2) - fprintf(stderr, "cov(%s,%s) = %d, diffs = %d\n", - M[i].name, M[j].name, covered, diffs); -/* - if (covered == 0) - fatalf("coverage threshold is too high for %s and %s", - M/[i].name, M[j].name); -*/ - if (covered == 0) - return -1.0; - return (float)diffs/(float)covered; -} - -int main(int argc, char **argv) { - struct interv *t; - int i, j, nS, good_pairs, bad_pairs; - char *a, *s; - float tot, pr; - - if (argc > 4 && same_string(argv[argc-1], "debug")) { - --argc; - debug = debug2 = 1; - } else if (argc > 4 && same_string(argv[argc-1], "debug2")) { - --argc; - debug2 = 1; - } - - if (argc < 5) - fatal("args: snps intervals min_cov 9:sam 13:judy ... "); - // store sample names and start positions (argv[4], argv[5], ...) - for (nM = 0, i = 4; i < argc; ++nM, ++i) { - if (nM >= MAX_SAMPLE) - fatalf("Too many mitogenomes"); - if ((s = strchr(a = argv[i], ':')) == NULL) - fatalf("colon: %s", a); - M[nM].col = atoi(a); - M[nM].name = copy_string(s+1); - } - min_cov = atoi(argv[3]); - get_intervals(argv[2]); - - if (debug) { - for (i = 0; i < nM; ++i) { - fprintf(stderr, ">%s\n", M[i].name); - for (t = M[i].intervals; t; t = t->next) - fprintf(stderr, "%d\t%d\n", t->b, t->e); - } - putc('\n', stderr); - } - - // record the SNPs - nS = get_variants(argv[1], S, 0); - - if (debug) { - for (i = 0; i < nS; ++i) - fprintf(stderr, "%d %s\n", S[i].pos, S[i].g); - putc('\n', stderr); - } - - // record the total rate of diversity, over all pairs of individuals - // having overlapping well-covered intervals - good_pairs = bad_pairs = 0; - for (i = 0, tot = 0.0; i < nM; ++i) { - for (j = i+1; j < nM; ++j) { - pr = pair(i, j, nS); - if (pr >= 0.0) { - ++good_pairs; - tot += pr; - } else - ++bad_pairs; - } - } - printf("pi = %5.5f\n", tot/(float)good_pairs); - if (bad_pairs > 0) - printf("%d of %d pairs had no sequenced regions in common\n", - bad_pairs, bad_pairs + good_pairs); - - return 0; -}
--- a/genome_diversity/src/sweep.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,302 +0,0 @@ -/* sweep -- find regions of the genome with high scores (e.g., Fst scores). -* -* argv[1] -- file containing a Galaxy table -* argv[2] -- column number for the chromosome name (column numbers base-1) -* argv[3] -- column number for the chromosomal position -* argv[4] -- column number for a score for the position -* argv[5] -- a percentage, such as "95", or a raw score, such as "=0.9". -* argv[6] -- the number of randomizations (shuffles) of the scores -* argv[7] -- [optional] if present and non-zero, report SNPs -* -* The program first determines a threshold such that the stated percentage -* of the scores are below that threshold (or uses the provided number if -* argv[5] starts with "="). The program subtracts the threshold -* from each score, then looks for maximal-scoring runs of SNPs, i.e., where -* adding or subtracting SNPs from an end of then run always decreases the -* total score. These regions are printed in order of descreasing total score. -* To determine a cutoff for the printed regions, the programs takes the maximum -* score over all regions observed in a specified number of shuffles of the -* list of scores. If argv[6] = 0, then all maximal-scoring runs of at least -* 4 table entries are printed. - -What it does on Galaxy -The user selects a SNP table and specifies the columns containing (1) chromosome, (2) position, (3) scores (such as an Fst-value for the SNP), (4) a percentage or raw score for the "cutoff" and (5) the number of times the data should be radomized (only intervals with score exceeding the maximum for the randomized data are reported). If a percentage (e.g. 95%) is specified for #3, then that percentile of the scores is used as the cutoff; this may not work well if many SNPs have the same score. The program subtracts the cutoff from every score, then finds genomic intervals (i.e., consecutive runs of SNPs) whose total score cannot be increased by adding or subtracting one or more SNPs at the ends of the interval. -*/ - -#include "lib.h" -#include "Huang.h" - -// maximum number of rows in any processed table -#define MANY 20000000 -#define BUF_SIZE 50000 -#define MAX_WINDOW 1000000 - -double X[MANY]; // holds all scores -int nX; - -// position-score pairs for a single chromosome -struct score { - int pos; - double x; // original score, then shifted score -} S[MANY]; -int nS; - -struct snp { - int pos; - double x; - struct snp *next; -}; - -// structure to hold the maximum-scoring chromosomal intervals -struct sweep { - float score; - char *chr; - int b, e; - struct snp *snps; -} W[MAX_WINDOW]; -int nW; - -#define MAX_CHR_NAME 1000000 -char *chr_name[MAX_CHR_NAME]; -int nchr_name; - -// return the linked list of SNPs in positions b to e -struct snp *add_snps(int b, int e) { - struct snp *first = NULL, *last = NULL, *new; - int i; - for (i = b; i <= e; ++i) - if (S[i].pos >= 0) { - new = ckalloc(sizeof(*new)); - new->pos = S[i].pos; - new->x = S[i].x; - new->next = NULL; - if (first == NULL) - first = new; - else - last->next = new; - last = new; - } - return first; -} - -// given a table row, return a pointer to the item in a particular column -char *get_col(char *buf, int col) { - static char temp[BUF_SIZE], *p; - int i; - char *z = " \t\n"; - - strcpy(temp, buf); - for (p = strtok(temp, z), i = 1; *p && i < col; - p = strtok(NULL, z), ++i) - ; - if (p == NULL) - fatalf("no column %d in %s", col, buf); - return p; -} - -// fill S[] with position-score pairs for the next chromosome -// return 0 for EOF -int get_chr(FILE *fp, int chr_col, int pos_col, int score_col, char *chr) { - static char buf[BUF_SIZE]; - static int init = 1; - int old_pos = 0, p, i; - char *status, *s; - - if (init) { - while ((status = fgets(buf, BUF_SIZE, fp)) != NULL && - buf[0] == '#') - ; - if (status == NULL) - fatal("empty table"); - init = 0; - } - if (buf[0] == '\0') - return 0; - - if (buf[0] == '#') - fatal("cannot happen"); - strcpy(chr, get_col(buf, chr_col)); - if (nchr_name == 0) - chr_name[nchr_name++] = copy_string(chr); - - S[0].pos = atoi(get_col(buf, pos_col)); - if (S[0].pos < 0) - fatalf("remove unmapped SNPs (address = -1)"); - S[0].x = atof(get_col(buf, score_col)); - for (nS = 1; ; ++nS) { - if (!fgets(buf, BUF_SIZE, fp)) { - buf[0] = '\0'; - return 1; - } - if (!same_string(chr, s = get_col(buf, chr_col))) - break; - S[nS].pos = p = atoi(get_col(buf, pos_col)); - if (p <= old_pos) - fatalf("SNV at %s %d is out of order", chr, p); - old_pos = p; - if (S[nS].pos < 0) - fatalf("remove unmapped SNPs (address = -1)"); - S[nS].x = atof(get_col(buf, score_col)); - } - if (nchr_name >= MAX_CHR_NAME) - fatal("Too many chromosome names"); - for (i = 0; i < nchr_name && !same_string(s, chr_name[i]); ++i) - ; - if (i < nchr_name) - fatalf("SNVs on %s aren't together", s); - chr_name[nchr_name++] = copy_string(s); - - return 1; -} - -// for sorting genomic intervals by *decreasing* score -int Wcompar(struct sweep *a, struct sweep *b) { - float y = a->score, z = b->score; - - if (y > z) - return -1; - if (y < z) - return 1; - return 0; -} - -// for sorting an array of scores into increasing order -int fcompar(double *a, double *b) { - if (*a < *b) - return -1; - if (*a > *b) - return 1; - return 0; -} - -/* shuffle the values S[0], S[1], ... , S[nscores-1]; -* Uses Algorithm P in page 125 of "The Art of Computer Programming (Vol II) -* Seminumerical Programming", by Donald Knuth, Addison-Wesley, 1971. -*/ -void shuffle_scores() { - int i, j; - double temp; - - for (i = nX-1; i > 0; --i) { - // swap what's in location i with location j, where 0 <= j <= i - j = random() % (i+1); - temp = X[i]; - X[i] = X[j]; - X[j] = temp; - } -} - -// return the best interval score (R[i] is the struct operated by Huang()) -double best() { - int i; - double bestScore; - - Huang(X, nX); - - for (bestScore = 0.0, i = 1; i <= top; ++i) - bestScore = MAX(R[i].Score, bestScore); - return bestScore; -} - -int main(int argc, char **argv) { - FILE *fp; - char buf[BUF_SIZE], chr[100], *a; - double shift = 0.0, cutoff; - int i, b, e, chr_col, pos_col, score_col, nshuffle, snps = 0; - struct snp *s; - - if (argc != 7 && argc != 8) - fatal("args: table chr_col pos_col score_col threshold randomizations [SNPs]"); - - // process command-line arguments - chr_col = atoi(argv[2]); - pos_col = atoi(argv[3]); - score_col = atoi(argv[4]); - a = argv[5]; - fp = ckopen(argv[1], "r"); - if (argc == 8) - snps = atoi(argv[7]); - if (isdigit(a[0])) { - for (nX = 0; nX < MANY && fgets(buf, BUF_SIZE, fp); ) { - if (buf[0] == '#') - continue; - X[nX++] = atof(get_col(buf, score_col)); - } - if (nX == MANY) - fatal("Too many rows"); - qsort((void *)X, (size_t)nX, sizeof(double), - (const void *)fcompar); - shift = X[atoi(a)*nX/100]; - rewind(fp); - } else if (a[0] == '=') - shift = atof(a+1); - -//fprintf(stderr, "shift = %4.3f\n", shift); - nshuffle = atoi(argv[6]); - if (nshuffle == 0) - cutoff = 0; - else { - for (nX = 0; nX < MANY && fgets(buf, BUF_SIZE, fp); ) { - if (buf[0] == '#') - continue; - X[nX++] = atof(get_col(buf, score_col)) - shift; - } - if (nX == MANY) - fatal("Too many rows"); - for (cutoff = 0.0, i = 0; i < nshuffle; ++i) { - shuffle_scores(); - cutoff = MAX(cutoff, best()); - } - rewind(fp); - } -//fprintf(stderr, "cutoff = %4.3f\n", cutoff); - - // loop over chromosomes; - // start by getting the chromosome's scores - while (get_chr(fp, chr_col, pos_col, score_col, chr)) { - // subtract shift from the scores - for (i = 0; i < nS; ++i) - X[i] = S[i].x - shift; - - // find the maximum=scoring regions - Huang(X, nS); - - // save any regions with >= 4 points and score >= cutoff - for (i = 0; i <= top; ++i) { - if (nW >= MAX_WINDOW) - fatalf("too many windows"); - - // get indices of the first and last SNP in the interval - b = R[i].Lpos + 1; - e = R[i].Rpos; - - // remove unmapped SNP position from intervals' ends - while (b < e && S[b].pos == -1) - ++b; - while (e > b && S[e].pos == -1) - --e; - - // record intervals - if (e - b < 3 || R[i].Score < cutoff) - continue; - W[nW].score = R[i].Score; - W[nW].chr = copy_string(chr); - W[nW].b = S[b].pos; - W[nW].e = S[e].pos+1; // Ws are half-open - if (snps) - W[nW].snps = add_snps(b, e); - ++nW; - } - } - - // sort by decreasing score - qsort((void *)W, (size_t)nW, sizeof(W[0]), (const void *)Wcompar); - - for (i = 0; i < nW; ++i) { - printf("%s\t%d\t%d\t%4.4f\n", - W[i].chr, W[i].b, W[i].e, W[i].score); - for (s = W[i].snps; s; s = s->next) - printf(" %d %3.2f\n", s->pos, s->x); - } - return 0; -}
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/inbreeding_and_kinship.py Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,29 @@ +#!/usr/bin/env python + +import sys +import gd_util + +################################################################################ + +if len(sys.argv) != 6: + gd_util.die('Usage') + +ped_input, ind_input, computed_value, output, kinship_input = sys.argv[1:] + +################################################################################ + +prog = 'inbreed' + +args = [ prog ] +args.append(ped_input) # pedigree +args.append(ind_input) # specified individuals (e.g.,,potential breeding population) +args.append(kinship_input) # kinships of founders +args.append(computed_value) # 0 = inbreedng coefficients, 1 = kinships, 2 = mean kinships + +with open(output, 'w') as fh: + gd_util.run_program(prog, args, stdout=fh) + +################################################################################ + +sys.exit(0) +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/inbreeding_and_kinship.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,133 @@ +<tool id="gd_inbreeding_and_kinship" name="Inbreeding and kinship" version="1.0.0"> + <description>: Analyze the pedigree without genomic data</description> + + <command interpreter="python"> + inbreeding_and_kinship.py '$ped_input' '$ind_input' '$computed_value' '$output' + #if $kinship_dataset.choice == '0' + '/dev/null' + #else if $kinship_dataset.choice == '1' + '$kinship_input' + #end if + </command> + + <inputs> + <param name="ped_input" type="data" format="txt" label="Pedigree dataset" /> + <param name="ind_input" type="data" format="txt" label="Individuals dataset" /> + <conditional name="kinship_dataset"> + <param name="choice" type="select" format="integer" label="Kinship dataset"> + <option value="0" selected="true">no kinship dataset</option> + <option value="1">select kinship dataset</option> + </param> + <when value="0" /> + <when value="1"> + <param name="kinship_input" type="data" format="txt" label="Kinship dataset" /> + </when> + </conditional> + <param name="computed_value" type="select" format="integer" label="Computed value"> + <option value="0" selected="true">inbreeding coeffiecients</option> + <option value="1">kinships</option> + <option value="2">mean kinships</option> + </param> + </inputs> + + <outputs> + <data name="output" format="txt" /> + </outputs> + + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input datasets are in text_ format. +The output dataset is in text_ format. + +.. _text: ./static/formatHelp.html#text + +----- + +**What it does** + +The user specifies a pedigree. This is done with a Galaxy table with one +row per individual, containing (1) the individual's name, (2) the name of +one of the individual's parents, which must have occurred at the start +of a previous line, and (3) the name of the individual's other parent, +which occurred at the start of a previous line. For a pedigree founder, +each parent name is replaced by "-". + +The user also provides a file that specifies a set of names of individuals +(specifically the first word on each line (one line per individual); +any subsequent information on a line is ignored. + +The user can optionally provide a file giving kinship information for +each pair of distinct individuals from the founder set. + +Finally the user picks from among the options: + + 1. inbreeding coefficients for each specified individual + 2. the kinship for each pair of distinct specified individual + 3. the mean kinship for each specified individual, i.e., the average kinship value for that individual and every specified individual + +The command reports the requested values. + +----- + +**Example** + +- input:: + + A - - + B - - + C - - + D - - + E - - + F A B + G A B + Thelma A F + Louise F G + +Rows can have more than three columns (such as the individual's sex), +but only the first three columns affect this command. + +Suppose on the other hand that we select an alternative +"founder" set, {A, F, G}. (We require a founder sets to have a +member on any ancestral path from Thelma or Louise.) The above pedigree +file is then replaced by:: + + A - - + F - - + G - - + Thelma A F + Louise F G + +The user then also provides a file giving kinship information for each +pairs of distinct individuals from the founder set; for the current +example, the kinship file is as follows:: + + A F 0.25 + A G 0.25 + F G 0.25 + +since parent-child pairs and siblings both have kinship 0.25. The +advantage is that this capability can be used in cases where the kinships +of the founders are not initially known, but instead are computationally +predicted, e.g., with the Galaxy "Discover" tool. + </help> +</tool> + + + + + + + + +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/make_phylip.py Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,511 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# +# mkFastas.py +# +# Copyright 2013 Oscar Reina <oscar@niska.bx.psu.edu> +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, +# MA 02110-1301, USA. + +import argparse +import errno +import os +import shutil + +def mkdir_p(path): + try: + os.makedirs(path) + except OSError, e: + if e.errno <> errno.EEXIST: + raise + +def revseq(seq): + seq=list(seq) + seq.reverse() + seq=''.join(seq) + return seq + +def revComp(allPop): + dAllCompAll={'A':'T','T':'A','C':'G','G':'C','N':'N','M':'K','K':'M','R':'Y','Y':'R','W':'W','S':'S'} + allPopsComp=dAllCompAll[allPop] + return allPopsComp + +def rtrnCons(ntA,ntB): + srtdPairs=''.join(sorted([ntA,ntB])) + dpairsCons={'AC':'M', 'AG':'R', 'AT':'W', 'CG':'S', 'CT':'Y', 'GT':'K', 'AN':'A', 'CN':'C', 'GN':'G', 'NT':'T'} + cons=dpairsCons[srtdPairs] + return cons + +def rtrnFxdChrPos(inSNPf,dPopsinSNPfPos,pxchrx,pxpos,pxntA,pxntB,fulldChrdPosdPopsAlllsInit=False,cvrgTreshold=False,indvlsPrctTrshld=False): + """ + """ + dChrdPosdPopsAlllsInit={} + seqref=[] + for eachl in open(inSNPf,'r'): + if eachl.strip() and eachl[0]!='#': + fllInfoSplt=eachl.splitlines()[0].split('\t') + chrx=fllInfoSplt[pxchrx] + pos=int(fllInfoSplt[pxpos]) + ntA=fllInfoSplt[pxntA] + ntB=fllInfoSplt[pxntB] + seqref.append([pos,ntA]) + dPopsAllls={} + if fulldChrdPosdPopsAlllsInit: + #~ + cntIndv=0 + # + try: + fulldPopsAllls=fulldChrdPosdPopsAlllsInit[chrx][pos] + except: + fulldPopsAllls=dict([(echPop,ntA) for echPop in dPopsinSNPfPos]) + # + for eachPop in dPopsinSNPfPos: + clmnCvrg=dPopsinSNPfPos[eachPop] + if clmnCvrg: + eachPopCvrg=int(fllInfoSplt[clmnCvrg]) + else: + #~ eachPopCvrg=0 + eachPopCvrg=cvrgTreshold + if eachPopCvrg>=cvrgTreshold: + dPopsAllls[eachPop]=fulldPopsAllls[eachPop] + cntIndv+=1 + else: + dPopsAllls[eachPop]='N' + #~ + if indvlsPrctTrshld>(cntIndv/float(len(dPopsinSNPfPos))): + dPopsAllls=dict([(echPop,'N') for echPop in dPopsinSNPfPos]) + #~ + else: + for eachPop in dPopsinSNPfPos: + if dPopsinSNPfPos[eachPop]: + eachPopAll=int(fllInfoSplt[dPopsinSNPfPos[eachPop]]) + if eachPopAll==0: + dPopsAllls[eachPop]=ntB + elif eachPopAll==2: + dPopsAllls[eachPop]=ntA + elif eachPopAll==1: + dPopsAllls[eachPop]=rtrnCons(ntA,ntB) + else: + dPopsAllls[eachPop]='N' + else: + dPopsAllls[eachPop]=ntA + try: + dChrdPosdPopsAlllsInit[chrx][pos]=dPopsAllls + except: + dChrdPosdPopsAlllsInit[chrx]={pos:dPopsAllls} + #~ + seqref.sort() + startExs=[seqref[0][0]] + endExs=[seqref[-1][0]+1] + seqref=''.join(x[1] for x in seqref) + #~ + return dChrdPosdPopsAlllsInit,seqref,chrx,startExs,endExs + + +def rtrndENSEMBLTseq(inCDSfile,inUCSCfile,fchrClmn,txStartClmn,txEndClmn,strandClmn,geneNameClmn,startExsClmn,endExsClmn,cdsStartClmn,cdsEndClmn): + """ + """ + dENSEMBLTchrxStEndEx={} + dChrdStrtEndENSEMBLT={} + for eachl in open(inUCSCfile,'r'): + if eachl.strip(): + rvrse=False + allVls=eachl.split('\t') + txStart=allVls[txStartClmn] + txEnd=allVls[txEndClmn] + ENSEMBLT=allVls[geneNameClmn] + strand=allVls[strandClmn] + chrx=allVls[fchrClmn] + if cdsStartClmn and cdsEndClmn: + cdsStart=allVls[cdsStartClmn] + cdsEnd=allVls[cdsEndClmn] + if startExsClmn and endExsClmn: + startExs=allVls[startExsClmn] + endExs=allVls[endExsClmn] + if strand=='-': + rvrse=True + try: + dChrdStrtEndENSEMBLT[chrx][int(txStart),int(txEnd)]=ENSEMBLT + except: + try: + dChrdStrtEndENSEMBLT[chrx]={(int(txStart),int(txEnd)):ENSEMBLT} + except: + dChrdStrtEndENSEMBLT={chrx:{(int(txStart),int(txEnd)):ENSEMBLT}} + #~ + if cdsStartClmn and cdsEndClmn and startExsClmn and endExsClmn: + startExs,endExs=rtrnExnStarEndCorrc(startExs,endExs,cdsStart,cdsEnd) + else: + startExs,endExs=[int(txStart)],[int(txEnd)] + dENSEMBLTchrxStEndEx[ENSEMBLT]=(chrx,startExs,endExs,rvrse) + #~ + dENSEMBLTseq={} + ENSEMBLTseqs=[(x.splitlines()[0],''.join(x.splitlines()[1:])) for x in open(inCDSfile).read().split('>') if x.strip()] + for ENSEMBLT,seq in ENSEMBLTseqs: + dENSEMBLTseq[ENSEMBLT]=seq + #~ + dENSEMBLTseqChrStEnEx={} + for ENSEMBLT in dENSEMBLTchrxStEndEx: + chrx,startExs,endExs,rvrse=dENSEMBLTchrxStEndEx[ENSEMBLT] + addEseqChrStEnEx=True + try: + seq=dENSEMBLTseq[ENSEMBLT] + if rvrse: + seq=revseq(seq) + except: + addEseqChrStEnEx=False + if addEseqChrStEnEx: + dENSEMBLTseqChrStEnEx[ENSEMBLT]=(seq,chrx,startExs,endExs,rvrse) + return dENSEMBLTseqChrStEnEx,dChrdStrtEndENSEMBLT + + +def rtrnFxdChrPosinCodReg(dChrdStrtEndENSEMBLT,dChrdPosdPopsAlllsInit): + """ + """ + dENSEMBLTChrPosdAlls={} + dChrPosdPopsAllls={} + todel=set(dChrdPosdPopsAlllsInit.keys()).difference(set(dChrdStrtEndENSEMBLT.keys())) + for x in todel: + x=dChrdPosdPopsAlllsInit.pop(x) + #--- + while len(dChrdPosdPopsAlllsInit)>0: + chrx=dChrdPosdPopsAlllsInit.keys()[0] + dStrtEndENSEMBLT=dChrdStrtEndENSEMBLT.pop(chrx) + dPosdPopsAllls=dChrdPosdPopsAlllsInit.pop(chrx) + #~ + srtdStrtEndENSEMBLT=sorted(dStrtEndENSEMBLT.keys()) + srtdPosdPopsAllls=sorted(dPosdPopsAllls.keys()) + #~ + pos=srtdPosdPopsAllls.pop(0) + strt,end=srtdStrtEndENSEMBLT.pop(0) + ENSEMBLT=dStrtEndENSEMBLT[strt,end] + dPopsAllls=dPosdPopsAllls[pos] + keePloop=True + #~ + while keePloop: + if strt<=pos<=end: + for tmpstrt,tmpend in [(strt,end)]+srtdStrtEndENSEMBLT: + if tmpstrt<=pos<=tmpend: + dPopsAllls=dPosdPopsAllls[pos] + dChrPosdPopsAllls[chrx,pos]=dPopsAllls + try: + dENSEMBLTChrPosdAlls[ENSEMBLT][chrx,pos]=dPopsAllls + except: + dENSEMBLTChrPosdAlls[ENSEMBLT]={(chrx,pos):dPopsAllls} + else: + continue + if len(srtdPosdPopsAllls)>0: + pos=srtdPosdPopsAllls.pop(0) + dPopsAllls=dPosdPopsAllls[pos] + else: + keePloop=False + #~ + elif pos<=strt: + if len(srtdPosdPopsAllls)>0: + pos=srtdPosdPopsAllls.pop(0) + dPopsAllls=dPosdPopsAllls[pos] + else: + keePloop=False + else: + if len(srtdStrtEndENSEMBLT)>0: + strt,end=srtdStrtEndENSEMBLT.pop(0) + ENSEMBLT=dStrtEndENSEMBLT[strt,end] + else: + keePloop=False + return dENSEMBLTChrPosdAlls,dChrPosdPopsAllls + +def rtrnExnStarEndCorrc(startExs,endExs,cdsStart,cdsEnd): + """ + """ + cdsStart,cdsEnd=int(cdsStart),int(cdsEnd) + crrctdstartExs=set([int(x) for x in startExs.split(',') if x.strip()]) + crrctdendExs=set([int(x) for x in endExs.split(',') if x.strip()]) + crrctdstartExs.add(cdsStart) + crrctdendExs.add(cdsEnd) + sStartDel=set() + sEndDel=set() + #~ + for echvl in crrctdstartExs: + if echvl<cdsStart or echvl>cdsEnd: + sStartDel.add(echvl) + #~ + for echvl in crrctdendExs: + if echvl<cdsStart or echvl>cdsEnd: + sEndDel.add(echvl) + #~ + return sorted(crrctdstartExs.difference(sStartDel)),sorted(crrctdendExs.difference(sEndDel)) + +def rtrndPopsFasta(seq,chrx,startExs,endExs,rvrse,dChrPosdPopsAllls,ENSEMBLT): + """ + """ + exnIntrvl=zip(startExs,endExs) + CDSinitPos=exnIntrvl[0][0] + dexnIntrvlSeq={} + for exStart,exEnd in exnIntrvl: + lenEx=exEnd-exStart + dexnIntrvlSeq[exStart,exEnd]=seq[:lenEx] + seq=seq[lenEx:] + + ldexnIntrvlSeq=len(dexnIntrvlSeq) + #~ + dPopsFasta={} + #~ + strePos=set() + dStrePosAbsPos={} + tmpAcmltdPos=0 + #~ + exStart,exEnd=sorted(dexnIntrvlSeq.keys())[0] + seq=dexnIntrvlSeq.pop((exStart,exEnd)) + chrx,pos=sorted(dChrPosdPopsAllls.keys())[0] + dPopsAllls=dChrPosdPopsAllls.pop((chrx,pos)) + tmpdPopsFasta=dict([(x,list(seq)) for x in dPopsAllls]) + cntExns=0 + while True: + if exStart<=pos<=exEnd-1: + relPos=tmpAcmltdPos+pos-exStart + strePos.add(relPos) + dStrePosAbsPos[relPos]=pos + for echPop in tmpdPopsFasta: + allPop=dPopsAllls[echPop] + if rvrse: + allPop=revComp(allPop) + tmpdPopsFasta[echPop][pos-exStart]=allPop + if len(dChrPosdPopsAllls)>0: + chrx,pos=sorted(dChrPosdPopsAllls.keys())[0] + dPopsAllls=dChrPosdPopsAllls.pop((chrx,pos)) + else: + pos=endExs[-1]+100#max pos of exns + elif pos<exStart: + if len(dChrPosdPopsAllls)>0: + chrx,pos=sorted(dChrPosdPopsAllls.keys())[0] + dPopsAllls=dChrPosdPopsAllls.pop((chrx,pos)) + else: + pos=endExs[-1]+100#max pos of exns + elif pos>exEnd-1:# or len(dChrPosdPopsAllls)==0: + for echPop in tmpdPopsFasta: + try: + dPopsFasta[echPop]+=''.join(tmpdPopsFasta[echPop]) + except: + dPopsFasta[echPop]=''.join(tmpdPopsFasta[echPop]) + cntExns+=1 + tmpAcmltdPos+=len(seq) + if len(dexnIntrvlSeq)>0: + exStart,exEnd=sorted(dexnIntrvlSeq.keys())[0] + seq=dexnIntrvlSeq.pop((exStart,exEnd)) + tmpdPopsFasta=dict([(x,list(seq)) for x in dPopsAllls]) + else: + break + if ldexnIntrvlSeq!=cntExns: + for echPop in tmpdPopsFasta: + dPopsFasta[echPop]+=''.join(tmpdPopsFasta[echPop]) + #~ + lchrStartexEndpos=[] + if rvrse: + dPopsFasta=dict([(echPop,revseq(dPopsFasta[echPop])) for echPop in dPopsFasta])#[echPop]+=''.join(tmpdPopsFasta[echPop]) + for ePos in strePos: + lchrStartexEndpos.append('\t'.join([ENSEMBLT,chrx,str(tmpAcmltdPos-ePos-1),str(dStrePosAbsPos[ePos])])) + else: + for ePos in strePos: + lchrStartexEndpos.append('\t'.join([ENSEMBLT,chrx,str(ePos),str(dStrePosAbsPos[ePos])])) + #~ + return dPopsFasta,lchrStartexEndpos + +def rtrnSeqVars(dENSEMBLTseqChrStEnEx,dENSEMBLTChrPosdAlls): + """ + """ + dENSEMBLTPopsFasta={} + lchrStartexEndposAll=[] + #~ + sENSEMBLTcmmn=set(dENSEMBLTChrPosdAlls.keys()).intersection(set(dENSEMBLTseqChrStEnEx.keys()))#sENSEMBLTcmmn between UCSC and ENSEMBLE + #~ + for ENSEMBLT in sENSEMBLTcmmn: + seq,chrx,startExs,endExs,rvrse=dENSEMBLTseqChrStEnEx[ENSEMBLT] + dChrPosdPopsAllls=dENSEMBLTChrPosdAlls[ENSEMBLT] + if len(startExs)>0 and len(endExs)>0: + dPopsFasta,lchrStartexEndpos=rtrndPopsFasta(seq,chrx,startExs,endExs,rvrse,dChrPosdPopsAllls,ENSEMBLT) + lchrStartexEndposAll.extend(lchrStartexEndpos) + if dPopsFasta:#to correct a bug of the input table, in cases in which endExons<startExn (!). See ENSCAFT00000000145 (MC4R) in canFam2 for example. + dENSEMBLTPopsFasta[ENSEMBLT]=dPopsFasta + return dENSEMBLTPopsFasta,lchrStartexEndposAll + + + +def rtrnPhy(dPopsFasta,ENSEMBLT): + """ + """ + dPopsFormPhy={} + for eachPop in dPopsFasta: + hader='%s'%eachPop + #~ hader='>%s'%eachPop + seq=dPopsFasta[eachPop] + formtd='\t'.join([hader,seq]) + #~ formtd='\n'.join([hader,seq]) + dPopsFormPhy[eachPop]=formtd + #~ + return dPopsFormPhy,len(seq) + +def wrapSeqsFasta(dENSEMBLTPopsFasta,outFastaFold,sPopsIntrst): + """ + """ + ENSEMBLTKaKs=[] + nonHeader=True + cnt=0 + lENSEMBLT=len(dENSEMBLTPopsFasta) + #~ + for ENSEMBLT in sorted(dENSEMBLTPopsFasta.keys()): + cnt+=1 + dPopsFasta=dENSEMBLTPopsFasta[ENSEMBLT] + dPopsFormPhy,lenseq=rtrnPhy(dPopsFasta,ENSEMBLT) + #~ + seqPMLformat=['%s %s'%(len(dPopsFormPhy),lenseq)]#generate new PHYML sequence + #~ seqPMLformat=[]#generate new PHYML sequence + for namex in sorted(sPopsIntrst): + seqPMLformat.append(dPopsFormPhy[namex]) + #~ + mkdir_p(outFastaFold) + outFastaf=os.path.join(outFastaFold,'%s.phy'%ENSEMBLT) + outFastaf=open(outFastaf,'w') + outFastaf.write('\n'.join(seqPMLformat)) + outFastaf.close() + #~ + return 0 + +def main(): + #~ + #~bpython mkPhyl.py --input=colugo_mt_Galaxy_genotypes.txt --chrClmn=0 --posClmn=1 --refClmn=2 --altrClmn=3 --output=out.d --gd_indivs=genotypes.gd_indivs --inputCover=colugo_mt_Galaxy_coverage.txt --gd_indivs_cover=coverage.gd_indivs --cvrgTreshold=0 --chrClmnCvrg=0 --posClmnCvrg=1 --refClmnCvrg=2 --altrClmnCvrg=3 --indvlsPrctTrshld=0 + parser = argparse.ArgumentParser(description='Returns the count of genes in KEGG categories and their statistical overrrepresentation, from a list of genes and an background file (i.e. plane text with ENSEMBLT and KEGG pathways).') + parser.add_argument('--input',metavar='input gd_snp file',type=str,help='the input file with the table in gd_snp/gd_genotype format.',required=True) + parser.add_argument('--chrClmn',metavar='int',type=int,help='the column with the chromosome.',required=True) + parser.add_argument('--posClmn',metavar='int',type=int,help='the column with the SNPs position.',required=True) + parser.add_argument('--refClmn',metavar='int',type=int,help='the column with the reference nucleotide.',required=True) + parser.add_argument('--altrClmn',metavar='int',type=int,help='the column with the derived nucleotide.',required=True) + parser.add_argument('--output',metavar='output',type=str,help='the output',required=True) + parser.add_argument('--output_id',metavar='int',type=int,help='the output id',required=True) + parser.add_argument('--output_dir',metavar='output folder sequences',type=str,help='the output folder with the sequences.',required=True) + parser.add_argument('--gd_indivs',metavar='input gd_indivs file',type=str,help='the input reference species columns in the input file.',required=True) + #~ + parser.add_argument('--inputCover',metavar='input gd_snp cover file',type=str,help='the input file with the table in gd_snp/gd_genotype cover format.',required=False,default=False) + parser.add_argument('--gd_indivs_cover',metavar='input gd_indivs file',type=str,help='the input reference species columns in the input cover file.',required=False,default=False) + parser.add_argument('--cvrgTreshold',metavar='input coverage threshold',type=int,help='the coverage threshold above which nucleotides are included, else "N".',required=False,default=False) + parser.add_argument('--chrClmnCvrg',metavar='int',type=int,help='the column with the chromosome in the input coverage file.',required=False,default=False) + parser.add_argument('--posClmnCvrg',metavar='int',type=int,help='the column with the SNPs position in the input coverage file.',required=False,default=False) + parser.add_argument('--refClmnCvrg',metavar='int',type=int,help='the column with the reference nucleotide in the input coverage file.',required=False,default=False) + parser.add_argument('--altrClmnCvrg',metavar='int',type=int,help='the column with the derived nucleotide in the input coverage file.',required=False,default=False) + parser.add_argument('--indvlsPrctTrshld',metavar='int',type=float,help='the percentage of individual above which nucleotides are included, else "N".',required=False,default=False) + #~ + parser.add_argument('--sequence',metavar='input fasta file',type=str,help='the input file with the sequence whose SNPs are in the input.',required=False,default=False) + parser.add_argument('--gene_info',metavar='input interval file',type=str,help='the input interval file with the the information on the genes.',required=False,default=False) + parser.add_argument('--fchrClmn',metavar='int',type=int,help='the column with the chromosome in the gene_info file.',required=False,default=False) + parser.add_argument('--txStartClmn',metavar='int',type=int,help='the column with the transcript start column in the gene_info file.',required=False,default=False) + parser.add_argument('--txEndClmn',metavar='int',type=int,help='the column with the transcript end column in the gene_info file.',required=False,default=False) + parser.add_argument('--strandClmn',metavar='int',type=int,help='the column with the strand column in the gene_info file.',required=False,default=False) + parser.add_argument('--geneNameClmn',metavar='int',type=int,help='the column with the gene name column in the gene_info file.',required=False,default=False) + parser.add_argument('--cdsStartClmn',metavar='int',type=int,help='the column with the coding start column in the gene_info file.',required=False,default=False) + parser.add_argument('--cdsEndClmn',metavar='int',type=int,help='the column with the coding end column in the gene_info file.',required=False,default=False) + parser.add_argument('--startExsClmn',metavar='int',type=int,help='the column with the exon start positions column in the gene_info file.',required=False,default=False) + parser.add_argument('--endExsClmn',metavar='int',type=int,help='the column with the exon end positions column in the gene_info file.',required=False,default=False) + + args = parser.parse_args() + + inSNPf = args.input + outfile = args.output + outfile_id = args.output_id + outFastaFold = './out' + files_dir = args.output_dir + gd_indivs = args.gd_indivs + pxchrx = args.chrClmn + pxpos = args.posClmn + pxntA = args.refClmn + pxntB = args.altrClmn + + + inCDSfile = args.sequence + inUCSCfile = args.gene_info + fchrClmn = args.fchrClmn#chromosome column + txStartClmn = args.txStartClmn#transcript start column + txEndClmn = args.txEndClmn#transcript end column + strandClmn = args.strandClmn#strand column + geneNameClmn = args.geneNameClmn#gene name column + cdsStartClmn = args.cdsStartClmn#coding sequence start column + cdsEndClmn = args.cdsEndClmn#coding sequence end column + startExsClmn = args.startExsClmn#exons start column + endExsClmn = args.endExsClmn#exons end column + + inputCover = args.inputCover + gd_indivs_cover = args.gd_indivs_cover + cvrgTreshold = args.cvrgTreshold + pxchrxCov = args.chrClmnCvrg + pxposCov = args.posClmnCvrg + pxntACov = args.refClmnCvrg + pxntBCov = args.altrClmnCvrg + indvlsPrctTrshld = args.indvlsPrctTrshld + + #print inputCover, gd_indivs_cover, cvrgTreshold + + assert ((inputCover and gd_indivs_cover and cvrgTreshold>=0 and indvlsPrctTrshld>=0) or (inCDSfile and inUCSCfile)) + + #~ + dPopsinSNPfPos=dict([(x.split()[1],int(x.split()[0])-1) for x in open(gd_indivs).read().splitlines() if x.strip()]) + #~ dPopsinSNPfPos.update({'ref':False}) + #~ + sPopsIntrst=set(dPopsinSNPfPos.keys()) + dChrdPosdPopsAlllsInit,seqref,chrx,startExs,endExs=rtrnFxdChrPos(inSNPf,dPopsinSNPfPos,pxchrx,pxpos,pxntA,pxntB)#~ print '1. Getting fixed alleles information...' + #~ dENSEMBLTseqChrStEnEx,dChrdStrtEndENSEMBLT=rtrndENSEMBLTseq(inCDSfile,inUCSCfile) + #~ + if inputCover and gd_indivs_cover and cvrgTreshold>=0: + dPopsinSNPfPos_cover=dict([(eachPop,False) for eachPop in dPopsinSNPfPos.keys()]) + dPopsinSNPfPos_cover.update(dict([(x.split()[1],int(x.split()[0])-1) for x in open(gd_indivs_cover).read().splitlines() if x.strip()])) + dChrdPosdPopsAlllsInit,seqref,chrx,startExs,endExs=rtrnFxdChrPos(inputCover,dPopsinSNPfPos_cover,pxchrxCov,pxposCov,pxntACov,pxntBCov,dChrdPosdPopsAlllsInit,cvrgTreshold,indvlsPrctTrshld) + rvrse=False + dENSEMBLTseqChrStEnEx={'tmp':(seqref,chrx,startExs,endExs,rvrse)} + dChrdStrtEndENSEMBLT={chrx:{(startExs[0],endExs[0]):'tmp'}} + #~ + elif inCDSfile and inUCSCfile: + dENSEMBLTseqChrStEnEx,dChrdStrtEndENSEMBLT=rtrndENSEMBLTseq(inCDSfile,inUCSCfile,fchrClmn,txStartClmn,txEndClmn,strandClmn,geneNameClmn,startExsClmn,endExsClmn,cdsStartClmn,cdsEndClmn)#~ print '2. Getting transcripts and exons information...' + #~ + dENSEMBLTChrPosdAlls,dChrPosdPopsAllls=rtrnFxdChrPosinCodReg(dChrdStrtEndENSEMBLT,dChrdPosdPopsAlllsInit)#~ print '3. Getting fixed alleles in exons...' + #~ + dENSEMBLTPopsFasta,lchrStartexEndposAll=rtrnSeqVars(dENSEMBLTseqChrStEnEx,dENSEMBLTChrPosdAlls)#~ print '4. Getting fasta sequences of populations...' + #~ + wrapSeqsFasta(dENSEMBLTPopsFasta,outFastaFold,sPopsIntrst) + #~ + + + ## get a lit of output files + files = [] + for dirpath, dirnames, filenames in os.walk(outFastaFold): + for file in filenames: + if file.endswith('.phy'): + files.append( os.path.join(dirpath, file) ) + del dirnames[:] + + if len(files) == 0: + with open(outfile, 'w') as ofh: + print >> ofh, 'No output.' + else: + ## the first file becomes the output + file = files.pop(0) + shutil.move(file, outfile) + + ## rename/move the rest of the files + for i, file in enumerate(files): + new_filename = 'primary_{0}_output{1}_visible_txt_?'.format(outfile_id, i+2) + new_pathname = os.path.join(files_dir, new_filename) + shutil.move(file, new_pathname) + + return 0 + +if __name__ == '__main__': + main()
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/make_phylip.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,178 @@ +<tool id="gd_make_phylip" name="Phylip" version="1.0.0" force_history_refresh="True"> + <description>: prepare data for phylogenetic analysis</description> + + <command interpreter="python"> + #set $zero_based = 1 + #set $gen_chrClmn = int($input.metadata.scaffold) - $zero_based + #set $gen_posClmn = int($input.metadata.pos) - $zero_based + #set $gen_refClmn = int($input.metadata.pos) - $zero_based + 1 + #set $gen_altrClmn = int($input.metadata.pos) - $zero_based + 2 + make_phylip.py '--altrClmn=$gen_altrClmn' '--chrClmn=$gen_chrClmn' '--gd_indivs=$indivs_input' '--input=$input' '--output=$output1' '--output_id=$output1.id' '--output_dir=$__new_file_path__' '--posClmn=$gen_posClmn' '--refClmn=$gen_refClmn' + #if $input_type.choice == '0' + #set $cov_chrClmn = int($input_type.coverage_input.metadata.scaffold) - $zero_based + #set $cov_posClmn = int($input_type.coverage_input.metadata.pos) - $zero_based + #set $cov_refClmn = int($input_type.coverage_input.metadata.pos) - $zero_based + 1 + #set $cov_altrClmn = int($input_type.coverage_input.metadata.pos) - $zero_based + 2 + '--altrClmnCvrg=$cov_altrClmn' '--chrClmnCvrg=$cov_chrClmn' '--cvrgTreshold=$input_type.coverage_threshold' '--gd_indivs_cover=$indivs_input' '--indvlsPrctTrshld=$input_type.indivs_threshold' '--inputCover=$input_type.coverage_input' '--posClmnCvrg=$cov_posClmn' '--refClmnCvrg=$cov_refClmn' + #else if $input_type.choice == '1' + #set $fchrClmn = int($input_type.annotation_input.metadata.chromCol) - $zero_based + #set $strandClmn = int($input_type.annotation_input.metadata.strandCol) - $zero_based + #set $geneNameClmn = int($input_type.annotation_input.metadata.nameCol) - $zero_based + #set $txStartClmn = int(str($input_type.tx_start_col)) - $zero_based + #set $txEndClmn = int(str($input_type.tx_end_col)) - $zero_based + #set $cdsStartClmn = int(str($input_type.cds_start_col)) - $zero_based + #set $cdsEndClmn = int(str($input_type.cds_end_col)) - $zero_based + #set $startExsClmn = int(str($input_type.exs_start_col)) - $zero_based + #set $endExsClmn = int(str($input_type.exs_end_col)) - $zero_based + '--cdsEndClmn=$cdsEndClmn' '--cdsStartClmn=$cdsStartClmn' '--endExsClmn=$endExsClmn' '--fchrClmn=$fchrClmn' '--geneNameClmn=$geneNameClmn' '--gene_info=$input_type.annotation_input' '--sequence=$input_type.fasta_input' '--startExsClmn=$startExsClmn' '--strandClmn=$strandClmn' '--txEndClmn=$txEndClmn' '--txStartClmn=$txStartClmn' + #end if + </command> + + <inputs> + <param name="input" type="data" format="gd_genotype,gd_snp" label="Genotype/SNP dataset"> + <validator type="metadata" check="scaffold" message="scaffold missing" /> + <validator type="metadata" check="pos" message="pos missing" /> + </param> + <param name="indivs_input" type="data" format="gd_indivs" label="Individuals dataset" /> + <conditional name="input_type"> + <param name="choice" type="select" format="integer" label="Input type"> + <option value="0" selected="true">Coverage</option> + <option value="1">Genes</option> + </param> + <when value="0"> + <param name="coverage_input" type="data" format="gd_genotype,gd_snp" label="Coverage dataset"> + <validator type="metadata" check="scaffold" message="scaffold missing" /> + <validator type="metadata" check="pos" message="pos missing" /> + </param> + <param name="coverage_threshold" type="integer" min="1" value="1" label="Coverage threshold" /> + <param name="indivs_threshold" type="float" value="0.5" min="0.0" max="1.0" label="Individuals genotype percentage threshold" /> + </when> + <when value="1"> + <param name="annotation_input" type="data" format="interval" label="Genes dataset"> + <validator type="metadata" check="chromCol" message="chromCol missing" /> + <validator type="metadata" check="strandCol" message="strandCol missing" /> + <validator type="metadata" check="nameCol" message="nameCol missing" /> + </param> + <param name="tx_start_col" type="data_column" data_ref="input" label="Genes transcript start column" /> + <param name="tx_end_col" type="data_column" data_ref="input" label="Genes transcript end column" /> + <param name="cds_start_col" type="data_column" data_ref="input" label="Genes coding sequence start column" /> + <param name="cds_end_col" type="data_column" data_ref="input" label="Genes coding sequence end column" /> + <param name="exs_start_col" type="data_column" data_ref="input" label="Genes exon starts column" /> + <param name="exs_end_col" type="data_column" data_ref="input" label="Genes exon ends column" /> + <param name="fasta_input" type="data" format="fasta" label="FASTA dataset" /> + </when> + </conditional> + </inputs> + + <outputs> + <data name="output1" format="txt" /> + </outputs> + + <help> +**What it does** + +This tool creates phylip formatted files from two different input types: +coverage and genes. + +If the coverage option is selected the inputs for the program are: + + 1. a gd_indivs table + 2. a gd_genotype file with the coverage information for individuals in the gd_indivs table + 3. a gd_genotype file with the genotype information for individuals in the gd_indivs table + 4. a coverage threshold (optional) + 5. a percentage of individuals (threshold). + +The program produces a phylip formatted file using the sequence in the +genotype file as a template. In this sequence nucleotides for each +sequence that are below the coverage threshold, or the positions with +a percentage of individuals below the selected value are replaced by "N". + +If the gene option is selected the inputs for the program are: + + 1. a gd_indivs table + 2. a gene dataset table with a gene name in the first column + 3. the column with transcript start in the gene dataset table + 4. the column with transcript end in the gene dataset table + 5. the column with coding start in the gene dataset table + 6. the column with coding end in the gene dataset table + 7. the column with exon starts (comma-separated) in the gene dataset table + 8. the column with exon ends (comma-separated) in the gene dataset table + 9. a FASTA formatted file for all the genes of interest with their names as headers (NOTE: these names should be the same in the input gene dataset table). + +The program produces as output one phylip formatted file for each gene +in the gene dataset table. + +----- + +**Example** + +In a case were the option coverage is selected, for the inputs: + +- gd_indivs:: + + 7 W_Java + 10 E_Java + 16 Pen_Ma + ... + +- Genotype table:: + + chrM 15 T C -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 0 -1 -1 + chrM 18 G A -1 -1 0 -1 -1 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 + chrM 20 C T -1 -1 0 -1 -1 2 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 + ... + +- Coverage table:: + + chrM 0 G G 0 0 0 0 0 0 0 0 0 0 0 0 0 + chrM 1 T T 0 0 3 0 0 50 0 0 0 0 0 2 0 + chrM 2 T T 0 0 5 0 0 50 0 0 0 0 0 2 0 + ... + +- Coverage threshold = 0 + +- Percentage of individuals = 0.0 + +- The output is:: + + 4 19 15428 + W_Java GTTCATCATGTTCATCGAAT + E_Java GTTCATCATGTTCATCGAAC + Pen_Ma GTTCATCATGTTCATCGAAT + +In a case were option genotype is selected with the inputs: + +- Gene dataset table input:: + + 1 ENSLAFT00000017123 chrM + 1002 1061 1002 1061 1 1002, 1061, 0 ENSLAFG00000017122 cmpl incmpl 0, BTRC ENSLAFT00000017123 ENSLAFP00000014355 + 1 ENSLAFT00000037164 chrM - 1058 1092 1062 1073 1 1062,1068 1065,1073 0 ENSLAFG00000007680 cmpl cmpl 0, MYOF ENSLAFT00000037164 ENSLAFP00000025175 26509 + 1 ENSLAFT00000008925 chrM + 990 1000 990 1000 1 990, 1000, 0 ENSLAFG00000008924 incmpl incmpl 0, PRKG1 ENSLAFT00000008925 ENSLAFP00000007492 + ... + +In this table: + + column with transcript start = 5 + column with transcript end = 6 + column with coding start = 7 + column with coding end = 8 + column with exon starts = 10 + column with exon ends = 11 + +- gd_indivs:: + + 7 W_Java + 10 E_Java + 16 Pen_Ma + ... + +- Genotype table:: + + chrM 1005 T C -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 0 -1 -1 + chrM 1060 G A -1 -1 0 -1 -1 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 + chrM 991 C T -1 -1 0 -1 -1 2 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 + ... + +The outputs are going to one file for each sequence in the input gene +dataset table (as long as they are included in the input FASTA file). + </help> +</tool>
--- a/nucleotide_diversity_pi.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/nucleotide_diversity_pi.xml Fri Sep 20 13:25:27 2013 -0400 @@ -25,6 +25,10 @@ <data name="output" format="txt" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <help> **What it does**
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/offspring_heterozygosity.py Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,59 @@ +#!/usr/bin/env python + +import sys +import gd_util + +from Population import Population + +################################################################################ + +if len(sys.argv) != 7: + gd_util.die('Usage') + +input, input_type, ind_arg, p1_input, p2_input, output = sys.argv[1:] + +p_total = Population() +p_total.from_wrapped_dict(ind_arg) + +p1 = Population() +p1.from_population_file(p1_input) +if not p_total.is_superset(p1): + gd_util.die('There is an individual in the first population that is not in the SNP table') + +p2 = Population() +p2.from_population_file(p2_input) +if not p_total.is_superset(p2): + gd_util.die('There is an individual in the second population that is not in the SNP table') + +################################################################################ + +prog = 'offspring_heterozygosity' + +args = [ prog ] +args.append(input) # a Galaxy SNP table + +for tag in p1.tag_list(): + column, name = tag.split(':') + + if input_type == 'gd_genotype': + column = int(column) - 2 + + tag = '{0}:{1}:{2}'.format(column, 0, name) + args.append(tag) + +for tag in p2.tag_list(): + column, name = tag.split(':') + + if input_type == 'gd_genotype': + column = int(column) - 2 + + tag = '{0}:{1}:{2}'.format(column, 1, name) + args.append(tag) + +with open(output, 'w') as fh: + gd_util.run_program(prog, args, stdout=fh) + +################################################################################ + +sys.exit(0) +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/offspring_heterozygosity.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,68 @@ +<tool id="gd_offspring_heterozygosity" name="Pairs sequenced" version="1.0.0"> + <description>: Offspring estimated heterozygosity of sequenced pairs</description> + + <command interpreter="python"> + #import json + #import base64 + #import zlib + #set $ind_names = $input.dataset.metadata.individual_names + #set $ind_colms = $input.dataset.metadata.individual_columns + #set $ind_dict = dict(zip($ind_names, $ind_colms)) + #set $ind_json = json.dumps($ind_dict, separators=(',',':')) + #set $ind_comp = zlib.compress($ind_json, 9) + #set $ind_arg = base64.b64encode($ind_comp) + offspring_heterozygosity.py '$input' '$input.ext' '$ind_arg' '$p1_input' '$p2_input' '$output' + </command> + + <inputs> + <param name="input" type="data" format="gd_snp,gd_genotype" label="SNP dataset" /> + <param name="p1_input" type="data" format="gd_indivs" label="First individuals dataset" /> + <param name="p2_input" type="data" format="gd_indivs" label="Second individuals dataset" /> + </inputs> + + <outputs> + <data name="output" format="txt" /> + </outputs> + + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input datasets are in gd_snp_, gd_genotype_, and gd_indivs_ formats. +The output dataset is in text_ format. + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_genotype: ./static/formatHelp.html#gd_genotype +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _text: ./static/formatHelp.html#text + +----- + +**What it does** + +For each pair of individuals, one from each specified set, the program +computes the expected heterozygosity of any offspring of the pair, i.e., +the probability that the offspring has distinct nucleotides at a randomly +chosen autosomal SNP. In other words, we add the following numbers for +each autosomal SNP where both genotypes are defined, then divide by the +number of those SNPs: + +0 if the individuals are homozygous for the same nucleotide + +1 if the individuals are homozygous for different nucleotides + +1/2 otherwise (i.e., if one or both individuals are heterozygous) + +A SNP is ignored if one or both individuals have an undefined genotype +(designated as -1). + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/offspring_heterozygosity_pedigree.py Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,58 @@ +#!/usr/bin/env python + +import sys +import gd_util + +from Population import Population + +def load_and_check_pop(file, total_pop, name): + p = Population() + p.from_population_file(file) + if not total_pop.is_superset(p): + gd_util.die('There is an individual in the {0} that is not in the SNP table'.format(name)) + return p + +def append_breeders_from_file(the_list, filename, kind): + with open(filename) as fh: + for line in fh: + elems = line.split() + breeder = elems[0].rstrip('\r\n') + the_list.append('{0}:{1}'.format(kind, breeder)) + +################################################################################ + +if len(sys.argv) != 9: + gd_util.die('Usage') + +input, input_type, pedigree, ind_arg, founders, b1_input, b2_input, output = sys.argv[1:] + +p_total = Population() +p_total.from_wrapped_dict(ind_arg) + +f1 = load_and_check_pop(founders, p_total, 'founders') + +################################################################################ + +prog = 'offspring_heterozygosity2' + +args = [ prog ] +args.append(input) # a Galaxy SNP table +args.append(pedigree) # a pedigree, where the SNP table is for the founders + +for tag in f1.tag_list(): + column, name = tag.split(':') + if type == 'gd_genotype': + column = int(column) - 2 + tag = 'founder:{0}:{1}'.format(column, name) + args.append(tag) + +append_breeders_from_file(args, b1_input, 0) +append_breeders_from_file(args, b2_input, 1) + +with open(output, 'w') as fh: + gd_util.run_program(prog, args, stdout=fh) + +################################################################################ + +sys.exit(0) +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/offspring_heterozygosity_pedigree.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,81 @@ +<tool id="gd_offspring_heterozygosity_pedigree" name="Founders sequenced" version="1.0.0"> + <description>: Offspring estimated heterozygosity from a pedigree with sequenced founders</description> + + <command interpreter="python"> + #import json + #import base64 + #import zlib + #set $ind_names = $input.dataset.metadata.individual_names + #set $ind_colms = $input.dataset.metadata.individual_columns + #set $ind_dict = dict(zip($ind_names, $ind_colms)) + #set $ind_json = json.dumps($ind_dict, separators=(',',':')) + #set $ind_comp = zlib.compress($ind_json, 9) + #set $ind_arg = base64.b64encode($ind_comp) + offspring_heterozygosity_pedigree.py '$input' '$input.ext' '$pedigree' '$ind_arg' '$founders' '$b1_input' '$b2_input' '$output' + </command> + + <inputs> + <param name="input" type="data" format="gd_snp,gd_genotype" label="SNP dataset" /> + <param name="pedigree" type="data" format="txt" label="Pedigree dataset" /> + <param name="founders" type="data" format="gd_indivs" label="Founders dataset" /> + <param name="b1_input" type="data" format="txt" label="First breeders dataset" /> + <param name="b2_input" type="data" format="txt" label="Second breeders dataset" /> + </inputs> + + <outputs> + <data name="output" format="txt" /> + </outputs> + + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input datasets are in gd_snp_, gd_genotype_, text_, and gd_indivs_ formats. +The output dataset is in text_ format. + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_genotype: ./static/formatHelp.html#gd_genotype +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _text: ./static/formatHelp.html#text + +----- + +**What it does** + +The user provides a Galaxy SNP table (gd_snp or gd_genotype format) that +includes the founders of a pedigree, as well as two sets of individuals. +The pedigree is specified by a text file with one row per individual, +containing (1) the individual's name, (2) the name of one of the +individual's parents, which must have occurred at the start of a previous +line, and (3) the name of the individual's other parent, which occurred at +the start of a previous line. For a pedigree founder, both parent names +are replaced by "-". The founders are specified by a table in +gd_indivs format, e.g., as produced by "Specify individuals" +tool. Every founder must have genotypes supplied in the SNP table, +and both parents need to be given as "-" in the pedigree. +Conversely, every pedigree individual whose parents are "-" +must be named as a founder. + +The user also provides two files that specify a set of names of +individuals. The first word on each line names an individual (one +line per individual); any subsequent information on a line is ignored. +The name of each individual must appear at the start of a line in the +pedigree. + +For each pair of individuals, one from each specified set, the program +computes the expected heterozygosity of any offspring of the pair, +i.e., the probability that the offspring has distinct nucleotides at +a randomly chosen autosomal SNP. A SNP is ignorned if one or both +potential parents have an ancestor with undefined genotype (designated +as -1 in the SNP table). + </help> +</tool>
--- a/pathway_image.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/pathway_image.xml Fri Sep 20 13:25:27 2013 -0400 @@ -27,6 +27,10 @@ <data name="output" format="png" /> </outputs> + <requirements> + <requirement type="package" version="0.2.5">mechanize</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_sap" ftype="gd_sap" /> @@ -77,7 +81,7 @@ output showing pathway cfa05214: -.. image:: ${static_path}/images/gd_pathway_image.png +.. image:: $PATH_TO_IMAGES/gd_pathway_image.png </help> </tool>
--- a/pca.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/pca.xml Fri Sep 20 13:25:27 2013 -0400 @@ -13,6 +13,12 @@ <data name="output" format="html" /> </outputs> + <requirements> + <requirement type="package" version="5.0.1">eigensoft</requirement> + <requirement type="package" version="0.1">gd_c_tools</requirement> + <requirement type="package" version="3.2.1">beautifulsoup</requirement> + </requirements> + <!-- <tests> <test>
--- a/phylogenetic_tree.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/phylogenetic_tree.xml Fri Sep 20 13:25:27 2013 -0400 @@ -136,6 +136,13 @@ </test> </tests> + <requirements> + <requirement type="package" version="1.3">phast</requirement> + <requirement type="package" version="1.1">quicktree</requirement> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <help> **Dataset formats**
--- a/population_structure.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/population_structure.xml Fri Sep 20 13:25:27 2013 -0400 @@ -14,6 +14,10 @@ <data name="output" format="html" /> </outputs> + <requirements> + <requirement type="package" version="3.2.1">beautifulsoup</requirement> + </requirements> + <!-- <tests> <test>
--- a/prepare_population_structure.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/prepare_population_structure.xml Fri Sep 20 13:25:27 2013 -0400 @@ -71,6 +71,10 @@ </data> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" />
--- a/rank_pathways.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/rank_pathways.xml Fri Sep 20 13:25:27 2013 -0400 @@ -53,6 +53,13 @@ <data name="output" format="tabular" /> </outputs> + <requirements> + <requirement type="package" version="0.2.5">mechanize</requirement> + <requirement type="package" version="1.8.1">networkx</requirement> + <requirement type="package" version="0.1.4">fisher</requirement> + </requirements> + + <tests> <test> </test> @@ -64,7 +71,7 @@ The query dataset has a column containing ENSEMBL transcript codes for the gene set of interest, while the background dataset has one column -with ENSEMBL transcript codes and another with GO terms, for some larger +with ENSEMBL transcript codes and another with KEGG pathways, for some larger universe of genes. All of the input and output datasets are in tabular_ format. The input
--- a/rank_terms.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/rank_terms.xml Fri Sep 20 13:25:27 2013 -0400 @@ -25,6 +25,10 @@ <data name="output" format="tabular" /> </outputs> + <requirements> + <requirement type="package" version="0.1.4">fisher</requirement> + </requirements> + <help> **Dataset formats**
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/raxml.py Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,51 @@ +#!/usr/bin/env python + +import random +import sys +import shutil +import gd_util + +################################################################################ + +if len(sys.argv) != 3: + gd_util.die('Usage') + +input, output = sys.argv[1:] +random.seed() + +################################################################################ + +prog = 'raxmlHPC' + +args = [ prog ] + +## required: -s sequenceFileName -n outputFileName -m substitutionModel +## we supply -s, -n (they are not allowed from user) + +args.append('-s') # name of the alignment data file in PHYLIP format +args.append(input) + +args.append('-n') # name of the output file +args.append('fake') + +## default options +args.append('-m') # substitutionModel +args.append('GTRGAMMA') # GTR + Optimization of substitution rates + GAMMA model of rate + # heterogeneity (alpha parameter will be estimated) + +args.append('-N') # number of alternative runs on distinct starting trees +args.append(1000) + +args.append('-f') # select algorithm +args.append('a') # rapid Bootstrap analysis and search for + # best-scoring ML tree in one program run + +args.append('-x') # integer random seed and turn on rapid bootstrapping +args.append(random.randint(0,100000000000000)) + +args.append('-p') # random seed for parsimony inferences +args.append(random.randint(0,100000000000000)) + +gd_util.run_program(prog, args) +shutil.copy2('RAxML_bipartitions.fake', output) +sys.exit(0)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/raxml.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,37 @@ +<tool id="gd_raxml" name="RAxML" version="1.0.0"> + <description>: construct a maximum-likelihood phylogenetic tree</description> + + <command interpreter="python"> + raxml.py '$input' '$output' + </command> + + <inputs> + <param name="input" type="data" format="txt" label="PHYLIP dataset" /> + </inputs> + + <outputs> + <data name="output" format="nhx" /> + </outputs> + + + <requirements> + <requirement type="package" version="7.7.6">raxml</requirement> + </requirements> + + <help> +**What it does** + +This tool runs RAxML on a phylip formatted file and returns a maximum +likelihood phylogram supported by a desired number of bootstraps. + +This program takes as input a phylip formatted file and optionally a +number of parameters (for further information consult the manual_), +and returns a Newick formatted tree that can be explored with Phyloviz. + +By default the program runs 1,000 fast bootstraps on the best likelihood +tree constructed with the GRT + gamma model. + +.. _manual: http://sco.h-its.org/exelixis/oldPage/RAxML-Manual.7.0.4.pdf + + </help> +</tool>
--- a/reorder.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/reorder.xml Fri Sep 20 13:25:27 2013 -0400 @@ -1,5 +1,5 @@ -<tool id="gd_reorder" name="Reorder" version="1.0.0"> - <description>individuals</description> +<tool id="gd_reorder" name="Reorder individuals" version="1.0.0"> + <description>: exchange rows in the above picture</description> <command interpreter="python"> reorder.py '$input' '$output' '$order' @@ -15,5 +15,64 @@ </outputs> <help> +**Dataset formats** + +The input and output datasets are in gd_indivs_ format. + +.. _gd_indivs: ./static/formatHelp.html#gd_indivs + +----- + +**What it does** + +The user picks a gd_indivs dataset from their history and specifies +a new ordering. This tool creates a new gd_indivs dataset with the +individuals reordered as specified by the user. + +The new ordering is a list of comma separated ranges (e.g **5,6-12,20**). +Ranges can be either a single number (e.g. **3**) or two dash separated +numbers (e.g. **3-5**). The numbers represent the line number of +gd_indivs dataset. Line numbers that are not listed will appear on the +output after the specified line numbers in their same relative ordering. + +----- + +**Example** + +Input dataset (six rows):: + + 18 McClintock + 22 Peltonen-Palotie + 26 Sager + 30 Franklin + 34 Auerbach + 38 Stevens + +new ordering "**1,3-4**" will return:: + + 18 McClintock + 26 Sager + 30 Franklin + 22 Peltonen-Palotie + 34 Auerbach + 38 Stevens + +new ordering "**3,5,1,6**" will return:: + + 26 Sager + 34 Auerbach + 18 McClintock + 38 Stevens + 22 Peltonen-Palotie + 30 Franklin + +new ordering "**3-1,6,4-5**" will return:: + + 26 Sager + 22 Peltonen-Palotie + 18 McClintock + 38 Stevens + 30 Franklin + 34 Auerbach </help> </tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Fri Sep 20 13:25:27 2013 -0400 @@ -0,0 +1,36 @@ +<?xml version="1.0"?> +<tool_dependency> + <package name="beautifulsoup" version="3.2.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_beautifulsoup_3_2_1" changeset_revision="83c21b81ee9d" /> + </package> + <package name="eigensoft" version="5.0.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_eigensoft_5_0_1" changeset_revision="02f04f3579b5" /> + </package> + <package name="fisher" version="0.1.4"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_fisher_0_1_4" changeset_revision="c84c287b81a4" /> + </package> + <package name="gd_c_tools" version="0.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_gd_c_tools_0_1" changeset_revision="7361ee4b5f40" /> + </package> + <package name="matplotlib" version="1.2.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="iuc" name="package_matplotlib_1_2" changeset_revision="9d164359606b" /> + </package> + <package name="mechanize" version="0.2.5"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_mechanize_0_2_5" changeset_revision="59801857421b" /> + </package> + <package name="munkres" version="1.0.5.4"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_munkres_1_0_5_4" changeset_revision="613b89b28767" /> + </package> + <package name="networkx" version="1.8.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_networkx_1_8_1" changeset_revision="43c20433f2d6" /> + </package> + <package name="phast" version="1.3"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_phast_1_3" changeset_revision="f633177177b9" /> + </package> + <package name="quicktree" version="1.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_quicktree_1_1" changeset_revision="dae77031fa2f" /> + </package> + <package name="raxml" version="7.7.6"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_raxml_7_7_6" changeset_revision="77f73a8c45be" /> + </package> +</tool_dependency>