annotate BeautifulSoup.py @ 32:03c22b722882

remove BeautifulSoup dependency
author Richard Burhans <burhans@bx.psu.edu>
date Fri, 20 Sep 2013 13:54:23 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
32
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1 """Beautiful Soup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2 Elixir and Tonic
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
3 "The Screen-Scraper's Friend"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
4 http://www.crummy.com/software/BeautifulSoup/
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
5
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
6 Beautiful Soup parses a (possibly invalid) XML or HTML document into a
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
7 tree representation. It provides methods and Pythonic idioms that make
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
8 it easy to navigate, search, and modify the tree.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
9
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
10 A well-formed XML/HTML document yields a well-formed data
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
11 structure. An ill-formed XML/HTML document yields a correspondingly
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
12 ill-formed data structure. If your document is only locally
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
13 well-formed, you can use this library to find and process the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
14 well-formed part of it.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
15
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
16 Beautiful Soup works with Python 2.2 and up. It has no external
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
17 dependencies, but you'll have more success at converting data to UTF-8
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
18 if you also install these three packages:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
19
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
20 * chardet, for auto-detecting character encodings
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
21 http://chardet.feedparser.org/
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
22 * cjkcodecs and iconv_codec, which add more encodings to the ones supported
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
23 by stock Python.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
24 http://cjkpython.i18n.org/
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
25
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
26 Beautiful Soup defines classes for two main parsing strategies:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
27
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
28 * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
29 language that kind of looks like XML.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
30
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
31 * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
32 or invalid. This class has web browser-like heuristics for
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
33 obtaining a sensible parse tree in the face of common HTML errors.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
34
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
35 Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
36 the encoding of an HTML or XML document, and converting it to
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
37 Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
38
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
39 For more than you ever wanted to know about Beautiful Soup, see the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
40 documentation:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
41 http://www.crummy.com/software/BeautifulSoup/documentation.html
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
42
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
43 Here, have some legalese:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
44
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
45 Copyright (c) 2004-2010, Leonard Richardson
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
46
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
47 All rights reserved.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
48
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
49 Redistribution and use in source and binary forms, with or without
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
50 modification, are permitted provided that the following conditions are
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
51 met:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
52
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
53 * Redistributions of source code must retain the above copyright
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
54 notice, this list of conditions and the following disclaimer.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
55
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
56 * Redistributions in binary form must reproduce the above
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
57 copyright notice, this list of conditions and the following
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
58 disclaimer in the documentation and/or other materials provided
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
59 with the distribution.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
60
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
61 * Neither the name of the the Beautiful Soup Consortium and All
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
62 Night Kosher Bakery nor the names of its contributors may be
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
63 used to endorse or promote products derived from this software
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
64 without specific prior written permission.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
65
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
66 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
67 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
68 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
69 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
70 CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
71 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
72 PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
73 PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
74 LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
75 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
76 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
77
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
78 """
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
79 from __future__ import generators
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
80
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
81 __author__ = "Leonard Richardson (leonardr@segfault.org)"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
82 __version__ = "3.2.1"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
83 __copyright__ = "Copyright (c) 2004-2012 Leonard Richardson"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
84 __license__ = "New-style BSD"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
85
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
86 from sgmllib import SGMLParser, SGMLParseError
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
87 import codecs
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
88 import markupbase
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
89 import types
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
90 import re
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
91 import sgmllib
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
92 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
93 from htmlentitydefs import name2codepoint
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
94 except ImportError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
95 name2codepoint = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
96 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
97 set
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
98 except NameError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
99 from sets import Set as set
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
100
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
101 #These hacks make Beautiful Soup able to parse XML with namespaces
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
102 sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
103 markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
104
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
105 DEFAULT_OUTPUT_ENCODING = "utf-8"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
106
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
107 def _match_css_class(str):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
108 """Build a RE to match the given CSS class."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
109 return re.compile(r"(^|.*\s)%s($|\s)" % str)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
110
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
111 # First, the classes that represent markup elements.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
112
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
113 class PageElement(object):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
114 """Contains the navigational information for some part of the page
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
115 (either a tag or a piece of text)"""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
116
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
117 def _invert(h):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
118 "Cheap function to invert a hash."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
119 i = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
120 for k,v in h.items():
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
121 i[v] = k
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
122 return i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
123
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
124 XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
125 "quot" : '"',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
126 "amp" : "&",
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
127 "lt" : "<",
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
128 "gt" : ">" }
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
129
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
130 XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
131
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
132 def setup(self, parent=None, previous=None):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
133 """Sets up the initial relations between this element and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
134 other elements."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
135 self.parent = parent
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
136 self.previous = previous
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
137 self.next = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
138 self.previousSibling = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
139 self.nextSibling = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
140 if self.parent and self.parent.contents:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
141 self.previousSibling = self.parent.contents[-1]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
142 self.previousSibling.nextSibling = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
143
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
144 def replaceWith(self, replaceWith):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
145 oldParent = self.parent
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
146 myIndex = self.parent.index(self)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
147 if hasattr(replaceWith, "parent")\
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
148 and replaceWith.parent is self.parent:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
149 # We're replacing this element with one of its siblings.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
150 index = replaceWith.parent.index(replaceWith)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
151 if index and index < myIndex:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
152 # Furthermore, it comes before this element. That
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
153 # means that when we extract it, the index of this
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
154 # element will change.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
155 myIndex = myIndex - 1
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
156 self.extract()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
157 oldParent.insert(myIndex, replaceWith)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
158
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
159 def replaceWithChildren(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
160 myParent = self.parent
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
161 myIndex = self.parent.index(self)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
162 self.extract()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
163 reversedChildren = list(self.contents)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
164 reversedChildren.reverse()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
165 for child in reversedChildren:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
166 myParent.insert(myIndex, child)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
167
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
168 def extract(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
169 """Destructively rips this element out of the tree."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
170 if self.parent:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
171 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
172 del self.parent.contents[self.parent.index(self)]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
173 except ValueError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
174 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
175
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
176 #Find the two elements that would be next to each other if
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
177 #this element (and any children) hadn't been parsed. Connect
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
178 #the two.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
179 lastChild = self._lastRecursiveChild()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
180 nextElement = lastChild.next
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
181
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
182 if self.previous:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
183 self.previous.next = nextElement
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
184 if nextElement:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
185 nextElement.previous = self.previous
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
186 self.previous = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
187 lastChild.next = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
188
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
189 self.parent = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
190 if self.previousSibling:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
191 self.previousSibling.nextSibling = self.nextSibling
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
192 if self.nextSibling:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
193 self.nextSibling.previousSibling = self.previousSibling
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
194 self.previousSibling = self.nextSibling = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
195 return self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
196
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
197 def _lastRecursiveChild(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
198 "Finds the last element beneath this object to be parsed."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
199 lastChild = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
200 while hasattr(lastChild, 'contents') and lastChild.contents:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
201 lastChild = lastChild.contents[-1]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
202 return lastChild
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
203
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
204 def insert(self, position, newChild):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
205 if isinstance(newChild, basestring) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
206 and not isinstance(newChild, NavigableString):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
207 newChild = NavigableString(newChild)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
208
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
209 position = min(position, len(self.contents))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
210 if hasattr(newChild, 'parent') and newChild.parent is not None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
211 # We're 'inserting' an element that's already one
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
212 # of this object's children.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
213 if newChild.parent is self:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
214 index = self.index(newChild)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
215 if index > position:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
216 # Furthermore we're moving it further down the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
217 # list of this object's children. That means that
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
218 # when we extract this element, our target index
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
219 # will jump down one.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
220 position = position - 1
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
221 newChild.extract()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
222
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
223 newChild.parent = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
224 previousChild = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
225 if position == 0:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
226 newChild.previousSibling = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
227 newChild.previous = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
228 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
229 previousChild = self.contents[position-1]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
230 newChild.previousSibling = previousChild
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
231 newChild.previousSibling.nextSibling = newChild
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
232 newChild.previous = previousChild._lastRecursiveChild()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
233 if newChild.previous:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
234 newChild.previous.next = newChild
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
235
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
236 newChildsLastElement = newChild._lastRecursiveChild()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
237
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
238 if position >= len(self.contents):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
239 newChild.nextSibling = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
240
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
241 parent = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
242 parentsNextSibling = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
243 while not parentsNextSibling:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
244 parentsNextSibling = parent.nextSibling
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
245 parent = parent.parent
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
246 if not parent: # This is the last element in the document.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
247 break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
248 if parentsNextSibling:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
249 newChildsLastElement.next = parentsNextSibling
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
250 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
251 newChildsLastElement.next = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
252 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
253 nextChild = self.contents[position]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
254 newChild.nextSibling = nextChild
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
255 if newChild.nextSibling:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
256 newChild.nextSibling.previousSibling = newChild
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
257 newChildsLastElement.next = nextChild
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
258
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
259 if newChildsLastElement.next:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
260 newChildsLastElement.next.previous = newChildsLastElement
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
261 self.contents.insert(position, newChild)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
262
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
263 def append(self, tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
264 """Appends the given tag to the contents of this tag."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
265 self.insert(len(self.contents), tag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
266
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
267 def findNext(self, name=None, attrs={}, text=None, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
268 """Returns the first item that matches the given criteria and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
269 appears after this Tag in the document."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
270 return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
271
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
272 def findAllNext(self, name=None, attrs={}, text=None, limit=None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
273 **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
274 """Returns all items that match the given criteria and appear
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
275 after this Tag in the document."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
276 return self._findAll(name, attrs, text, limit, self.nextGenerator,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
277 **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
278
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
279 def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
280 """Returns the closest sibling to this Tag that matches the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
281 given criteria and appears after this Tag in the document."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
282 return self._findOne(self.findNextSiblings, name, attrs, text,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
283 **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
284
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
285 def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
286 **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
287 """Returns the siblings of this Tag that match the given
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
288 criteria and appear after this Tag in the document."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
289 return self._findAll(name, attrs, text, limit,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
290 self.nextSiblingGenerator, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
291 fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
292
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
293 def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
294 """Returns the first item that matches the given criteria and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
295 appears before this Tag in the document."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
296 return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
297
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
298 def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
299 **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
300 """Returns all items that match the given criteria and appear
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
301 before this Tag in the document."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
302 return self._findAll(name, attrs, text, limit, self.previousGenerator,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
303 **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
304 fetchPrevious = findAllPrevious # Compatibility with pre-3.x
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
305
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
306 def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
307 """Returns the closest sibling to this Tag that matches the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
308 given criteria and appears before this Tag in the document."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
309 return self._findOne(self.findPreviousSiblings, name, attrs, text,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
310 **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
311
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
312 def findPreviousSiblings(self, name=None, attrs={}, text=None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
313 limit=None, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
314 """Returns the siblings of this Tag that match the given
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
315 criteria and appear before this Tag in the document."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
316 return self._findAll(name, attrs, text, limit,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
317 self.previousSiblingGenerator, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
318 fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
319
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
320 def findParent(self, name=None, attrs={}, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
321 """Returns the closest parent of this Tag that matches the given
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
322 criteria."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
323 # NOTE: We can't use _findOne because findParents takes a different
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
324 # set of arguments.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
325 r = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
326 l = self.findParents(name, attrs, 1)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
327 if l:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
328 r = l[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
329 return r
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
330
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
331 def findParents(self, name=None, attrs={}, limit=None, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
332 """Returns the parents of this Tag that match the given
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
333 criteria."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
334
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
335 return self._findAll(name, attrs, None, limit, self.parentGenerator,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
336 **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
337 fetchParents = findParents # Compatibility with pre-3.x
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
338
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
339 #These methods do the real heavy lifting.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
340
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
341 def _findOne(self, method, name, attrs, text, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
342 r = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
343 l = method(name, attrs, text, 1, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
344 if l:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
345 r = l[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
346 return r
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
347
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
348 def _findAll(self, name, attrs, text, limit, generator, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
349 "Iterates over a generator looking for things that match."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
350
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
351 if isinstance(name, SoupStrainer):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
352 strainer = name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
353 # (Possibly) special case some findAll*(...) searches
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
354 elif text is None and not limit and not attrs and not kwargs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
355 # findAll*(True)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
356 if name is True:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
357 return [element for element in generator()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
358 if isinstance(element, Tag)]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
359 # findAll*('tag-name')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
360 elif isinstance(name, basestring):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
361 return [element for element in generator()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
362 if isinstance(element, Tag) and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
363 element.name == name]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
364 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
365 strainer = SoupStrainer(name, attrs, text, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
366 # Build a SoupStrainer
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
367 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
368 strainer = SoupStrainer(name, attrs, text, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
369 results = ResultSet(strainer)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
370 g = generator()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
371 while True:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
372 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
373 i = g.next()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
374 except StopIteration:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
375 break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
376 if i:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
377 found = strainer.search(i)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
378 if found:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
379 results.append(found)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
380 if limit and len(results) >= limit:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
381 break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
382 return results
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
383
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
384 #These Generators can be used to navigate starting from both
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
385 #NavigableStrings and Tags.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
386 def nextGenerator(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
387 i = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
388 while i is not None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
389 i = i.next
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
390 yield i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
391
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
392 def nextSiblingGenerator(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
393 i = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
394 while i is not None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
395 i = i.nextSibling
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
396 yield i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
397
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
398 def previousGenerator(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
399 i = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
400 while i is not None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
401 i = i.previous
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
402 yield i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
403
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
404 def previousSiblingGenerator(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
405 i = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
406 while i is not None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
407 i = i.previousSibling
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
408 yield i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
409
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
410 def parentGenerator(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
411 i = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
412 while i is not None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
413 i = i.parent
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
414 yield i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
415
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
416 # Utility methods
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
417 def substituteEncoding(self, str, encoding=None):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
418 encoding = encoding or "utf-8"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
419 return str.replace("%SOUP-ENCODING%", encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
420
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
421 def toEncoding(self, s, encoding=None):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
422 """Encodes an object to a string in some encoding, or to Unicode.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
423 ."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
424 if isinstance(s, unicode):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
425 if encoding:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
426 s = s.encode(encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
427 elif isinstance(s, str):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
428 if encoding:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
429 s = s.encode(encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
430 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
431 s = unicode(s)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
432 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
433 if encoding:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
434 s = self.toEncoding(str(s), encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
435 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
436 s = unicode(s)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
437 return s
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
438
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
439 BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
440 + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
441 + ")")
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
442
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
443 def _sub_entity(self, x):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
444 """Used with a regular expression to substitute the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
445 appropriate XML entity for an XML special character."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
446 return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
447
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
448
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
449 class NavigableString(unicode, PageElement):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
450
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
451 def __new__(cls, value):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
452 """Create a new NavigableString.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
453
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
454 When unpickling a NavigableString, this method is called with
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
455 the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
456 passed in to the superclass's __new__ or the superclass won't know
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
457 how to handle non-ASCII characters.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
458 """
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
459 if isinstance(value, unicode):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
460 return unicode.__new__(cls, value)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
461 return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
462
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
463 def __getnewargs__(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
464 return (NavigableString.__str__(self),)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
465
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
466 def __getattr__(self, attr):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
467 """text.string gives you text. This is for backwards
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
468 compatibility for Navigable*String, but for CData* it lets you
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
469 get the string without the CData wrapper."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
470 if attr == 'string':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
471 return self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
472 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
473 raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
474
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
475 def __unicode__(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
476 return str(self).decode(DEFAULT_OUTPUT_ENCODING)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
477
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
478 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
479 # Substitute outgoing XML entities.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
480 data = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, self)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
481 if encoding:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
482 return data.encode(encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
483 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
484 return data
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
485
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
486 class CData(NavigableString):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
487
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
488 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
489 return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
490
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
491 class ProcessingInstruction(NavigableString):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
492 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
493 output = self
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
494 if "%SOUP-ENCODING%" in output:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
495 output = self.substituteEncoding(output, encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
496 return "<?%s?>" % self.toEncoding(output, encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
497
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
498 class Comment(NavigableString):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
499 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
500 return "<!--%s-->" % NavigableString.__str__(self, encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
501
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
502 class Declaration(NavigableString):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
503 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
504 return "<!%s>" % NavigableString.__str__(self, encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
505
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
506 class Tag(PageElement):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
507
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
508 """Represents a found HTML tag with its attributes and contents."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
509
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
510 def _convertEntities(self, match):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
511 """Used in a call to re.sub to replace HTML, XML, and numeric
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
512 entities with the appropriate Unicode characters. If HTML
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
513 entities are being converted, any unrecognized entities are
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
514 escaped."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
515 x = match.group(1)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
516 if self.convertHTMLEntities and x in name2codepoint:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
517 return unichr(name2codepoint[x])
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
518 elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
519 if self.convertXMLEntities:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
520 return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
521 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
522 return u'&%s;' % x
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
523 elif len(x) > 0 and x[0] == '#':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
524 # Handle numeric entities
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
525 if len(x) > 1 and x[1] == 'x':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
526 return unichr(int(x[2:], 16))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
527 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
528 return unichr(int(x[1:]))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
529
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
530 elif self.escapeUnrecognizedEntities:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
531 return u'&amp;%s;' % x
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
532 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
533 return u'&%s;' % x
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
534
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
535 def __init__(self, parser, name, attrs=None, parent=None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
536 previous=None):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
537 "Basic constructor."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
538
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
539 # We don't actually store the parser object: that lets extracted
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
540 # chunks be garbage-collected
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
541 self.parserClass = parser.__class__
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
542 self.isSelfClosing = parser.isSelfClosingTag(name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
543 self.name = name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
544 if attrs is None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
545 attrs = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
546 elif isinstance(attrs, dict):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
547 attrs = attrs.items()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
548 self.attrs = attrs
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
549 self.contents = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
550 self.setup(parent, previous)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
551 self.hidden = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
552 self.containsSubstitutions = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
553 self.convertHTMLEntities = parser.convertHTMLEntities
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
554 self.convertXMLEntities = parser.convertXMLEntities
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
555 self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
556
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
557 # Convert any HTML, XML, or numeric entities in the attribute values.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
558 convert = lambda(k, val): (k,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
559 re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
560 self._convertEntities,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
561 val))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
562 self.attrs = map(convert, self.attrs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
563
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
564 def getString(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
565 if (len(self.contents) == 1
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
566 and isinstance(self.contents[0], NavigableString)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
567 return self.contents[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
568
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
569 def setString(self, string):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
570 """Replace the contents of the tag with a string"""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
571 self.clear()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
572 self.append(string)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
573
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
574 string = property(getString, setString)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
575
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
576 def getText(self, separator=u""):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
577 if not len(self.contents):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
578 return u""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
579 stopNode = self._lastRecursiveChild().next
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
580 strings = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
581 current = self.contents[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
582 while current is not stopNode:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
583 if isinstance(current, NavigableString):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
584 strings.append(current.strip())
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
585 current = current.next
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
586 return separator.join(strings)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
587
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
588 text = property(getText)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
589
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
590 def get(self, key, default=None):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
591 """Returns the value of the 'key' attribute for the tag, or
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
592 the value given for 'default' if it doesn't have that
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
593 attribute."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
594 return self._getAttrMap().get(key, default)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
595
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
596 def clear(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
597 """Extract all children."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
598 for child in self.contents[:]:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
599 child.extract()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
600
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
601 def index(self, element):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
602 for i, child in enumerate(self.contents):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
603 if child is element:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
604 return i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
605 raise ValueError("Tag.index: element not in tag")
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
606
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
607 def has_key(self, key):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
608 return self._getAttrMap().has_key(key)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
609
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
610 def __getitem__(self, key):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
611 """tag[key] returns the value of the 'key' attribute for the tag,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
612 and throws an exception if it's not there."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
613 return self._getAttrMap()[key]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
614
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
615 def __iter__(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
616 "Iterating over a tag iterates over its contents."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
617 return iter(self.contents)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
618
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
619 def __len__(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
620 "The length of a tag is the length of its list of contents."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
621 return len(self.contents)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
622
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
623 def __contains__(self, x):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
624 return x in self.contents
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
625
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
626 def __nonzero__(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
627 "A tag is non-None even if it has no contents."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
628 return True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
629
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
630 def __setitem__(self, key, value):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
631 """Setting tag[key] sets the value of the 'key' attribute for the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
632 tag."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
633 self._getAttrMap()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
634 self.attrMap[key] = value
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
635 found = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
636 for i in range(0, len(self.attrs)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
637 if self.attrs[i][0] == key:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
638 self.attrs[i] = (key, value)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
639 found = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
640 if not found:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
641 self.attrs.append((key, value))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
642 self._getAttrMap()[key] = value
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
643
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
644 def __delitem__(self, key):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
645 "Deleting tag[key] deletes all 'key' attributes for the tag."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
646 for item in self.attrs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
647 if item[0] == key:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
648 self.attrs.remove(item)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
649 #We don't break because bad HTML can define the same
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
650 #attribute multiple times.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
651 self._getAttrMap()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
652 if self.attrMap.has_key(key):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
653 del self.attrMap[key]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
654
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
655 def __call__(self, *args, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
656 """Calling a tag like a function is the same as calling its
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
657 findAll() method. Eg. tag('a') returns a list of all the A tags
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
658 found within this tag."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
659 return apply(self.findAll, args, kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
660
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
661 def __getattr__(self, tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
662 #print "Getattr %s.%s" % (self.__class__, tag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
663 if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
664 return self.find(tag[:-3])
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
665 elif tag.find('__') != 0:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
666 return self.find(tag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
667 raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
668
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
669 def __eq__(self, other):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
670 """Returns true iff this tag has the same name, the same attributes,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
671 and the same contents (recursively) as the given tag.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
672
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
673 NOTE: right now this will return false if two tags have the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
674 same attributes in a different order. Should this be fixed?"""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
675 if other is self:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
676 return True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
677 if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
678 return False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
679 for i in range(0, len(self.contents)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
680 if self.contents[i] != other.contents[i]:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
681 return False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
682 return True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
683
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
684 def __ne__(self, other):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
685 """Returns true iff this tag is not identical to the other tag,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
686 as defined in __eq__."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
687 return not self == other
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
688
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
689 def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
690 """Renders this tag as a string."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
691 return self.__str__(encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
692
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
693 def __unicode__(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
694 return self.__str__(None)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
695
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
696 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
697 prettyPrint=False, indentLevel=0):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
698 """Returns a string or Unicode representation of this tag and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
699 its contents. To get Unicode, pass None for encoding.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
700
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
701 NOTE: since Python's HTML parser consumes whitespace, this
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
702 method is not certain to reproduce the whitespace present in
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
703 the original string."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
704
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
705 encodedName = self.toEncoding(self.name, encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
706
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
707 attrs = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
708 if self.attrs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
709 for key, val in self.attrs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
710 fmt = '%s="%s"'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
711 if isinstance(val, basestring):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
712 if self.containsSubstitutions and '%SOUP-ENCODING%' in val:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
713 val = self.substituteEncoding(val, encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
714
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
715 # The attribute value either:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
716 #
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
717 # * Contains no embedded double quotes or single quotes.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
718 # No problem: we enclose it in double quotes.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
719 # * Contains embedded single quotes. No problem:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
720 # double quotes work here too.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
721 # * Contains embedded double quotes. No problem:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
722 # we enclose it in single quotes.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
723 # * Embeds both single _and_ double quotes. This
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
724 # can't happen naturally, but it can happen if
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
725 # you modify an attribute value after parsing
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
726 # the document. Now we have a bit of a
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
727 # problem. We solve it by enclosing the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
728 # attribute in single quotes, and escaping any
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
729 # embedded single quotes to XML entities.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
730 if '"' in val:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
731 fmt = "%s='%s'"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
732 if "'" in val:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
733 # TODO: replace with apos when
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
734 # appropriate.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
735 val = val.replace("'", "&squot;")
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
736
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
737 # Now we're okay w/r/t quotes. But the attribute
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
738 # value might also contain angle brackets, or
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
739 # ampersands that aren't part of entities. We need
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
740 # to escape those to XML entities too.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
741 val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
742
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
743 attrs.append(fmt % (self.toEncoding(key, encoding),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
744 self.toEncoding(val, encoding)))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
745 close = ''
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
746 closeTag = ''
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
747 if self.isSelfClosing:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
748 close = ' /'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
749 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
750 closeTag = '</%s>' % encodedName
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
751
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
752 indentTag, indentContents = 0, 0
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
753 if prettyPrint:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
754 indentTag = indentLevel
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
755 space = (' ' * (indentTag-1))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
756 indentContents = indentTag + 1
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
757 contents = self.renderContents(encoding, prettyPrint, indentContents)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
758 if self.hidden:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
759 s = contents
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
760 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
761 s = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
762 attributeString = ''
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
763 if attrs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
764 attributeString = ' ' + ' '.join(attrs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
765 if prettyPrint:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
766 s.append(space)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
767 s.append('<%s%s%s>' % (encodedName, attributeString, close))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
768 if prettyPrint:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
769 s.append("\n")
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
770 s.append(contents)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
771 if prettyPrint and contents and contents[-1] != "\n":
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
772 s.append("\n")
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
773 if prettyPrint and closeTag:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
774 s.append(space)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
775 s.append(closeTag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
776 if prettyPrint and closeTag and self.nextSibling:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
777 s.append("\n")
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
778 s = ''.join(s)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
779 return s
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
780
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
781 def decompose(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
782 """Recursively destroys the contents of this tree."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
783 self.extract()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
784 if len(self.contents) == 0:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
785 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
786 current = self.contents[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
787 while current is not None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
788 next = current.next
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
789 if isinstance(current, Tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
790 del current.contents[:]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
791 current.parent = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
792 current.previous = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
793 current.previousSibling = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
794 current.next = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
795 current.nextSibling = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
796 current = next
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
797
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
798 def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
799 return self.__str__(encoding, True)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
800
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
801 def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
802 prettyPrint=False, indentLevel=0):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
803 """Renders the contents of this tag as a string in the given
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
804 encoding. If encoding is None, returns a Unicode string.."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
805 s=[]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
806 for c in self:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
807 text = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
808 if isinstance(c, NavigableString):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
809 text = c.__str__(encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
810 elif isinstance(c, Tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
811 s.append(c.__str__(encoding, prettyPrint, indentLevel))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
812 if text and prettyPrint:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
813 text = text.strip()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
814 if text:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
815 if prettyPrint:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
816 s.append(" " * (indentLevel-1))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
817 s.append(text)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
818 if prettyPrint:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
819 s.append("\n")
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
820 return ''.join(s)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
821
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
822 #Soup methods
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
823
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
824 def find(self, name=None, attrs={}, recursive=True, text=None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
825 **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
826 """Return only the first child of this Tag matching the given
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
827 criteria."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
828 r = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
829 l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
830 if l:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
831 r = l[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
832 return r
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
833 findChild = find
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
834
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
835 def findAll(self, name=None, attrs={}, recursive=True, text=None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
836 limit=None, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
837 """Extracts a list of Tag objects that match the given
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
838 criteria. You can specify the name of the Tag and any
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
839 attributes you want the Tag to have.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
840
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
841 The value of a key-value pair in the 'attrs' map can be a
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
842 string, a list of strings, a regular expression object, or a
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
843 callable that takes a string and returns whether or not the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
844 string matches for some custom definition of 'matches'. The
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
845 same is true of the tag name."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
846 generator = self.recursiveChildGenerator
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
847 if not recursive:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
848 generator = self.childGenerator
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
849 return self._findAll(name, attrs, text, limit, generator, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
850 findChildren = findAll
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
851
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
852 # Pre-3.x compatibility methods
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
853 first = find
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
854 fetch = findAll
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
855
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
856 def fetchText(self, text=None, recursive=True, limit=None):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
857 return self.findAll(text=text, recursive=recursive, limit=limit)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
858
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
859 def firstText(self, text=None, recursive=True):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
860 return self.find(text=text, recursive=recursive)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
861
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
862 #Private methods
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
863
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
864 def _getAttrMap(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
865 """Initializes a map representation of this tag's attributes,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
866 if not already initialized."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
867 if not getattr(self, 'attrMap'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
868 self.attrMap = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
869 for (key, value) in self.attrs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
870 self.attrMap[key] = value
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
871 return self.attrMap
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
872
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
873 #Generator methods
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
874 def childGenerator(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
875 # Just use the iterator from the contents
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
876 return iter(self.contents)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
877
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
878 def recursiveChildGenerator(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
879 if not len(self.contents):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
880 raise StopIteration
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
881 stopNode = self._lastRecursiveChild().next
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
882 current = self.contents[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
883 while current is not stopNode:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
884 yield current
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
885 current = current.next
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
886
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
887
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
888 # Next, a couple classes to represent queries and their results.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
889 class SoupStrainer:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
890 """Encapsulates a number of ways of matching a markup element (tag or
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
891 text)."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
892
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
893 def __init__(self, name=None, attrs={}, text=None, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
894 self.name = name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
895 if isinstance(attrs, basestring):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
896 kwargs['class'] = _match_css_class(attrs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
897 attrs = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
898 if kwargs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
899 if attrs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
900 attrs = attrs.copy()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
901 attrs.update(kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
902 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
903 attrs = kwargs
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
904 self.attrs = attrs
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
905 self.text = text
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
906
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
907 def __str__(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
908 if self.text:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
909 return self.text
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
910 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
911 return "%s|%s" % (self.name, self.attrs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
912
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
913 def searchTag(self, markupName=None, markupAttrs={}):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
914 found = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
915 markup = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
916 if isinstance(markupName, Tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
917 markup = markupName
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
918 markupAttrs = markup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
919 callFunctionWithTagData = callable(self.name) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
920 and not isinstance(markupName, Tag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
921
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
922 if (not self.name) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
923 or callFunctionWithTagData \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
924 or (markup and self._matches(markup, self.name)) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
925 or (not markup and self._matches(markupName, self.name)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
926 if callFunctionWithTagData:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
927 match = self.name(markupName, markupAttrs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
928 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
929 match = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
930 markupAttrMap = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
931 for attr, matchAgainst in self.attrs.items():
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
932 if not markupAttrMap:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
933 if hasattr(markupAttrs, 'get'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
934 markupAttrMap = markupAttrs
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
935 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
936 markupAttrMap = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
937 for k,v in markupAttrs:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
938 markupAttrMap[k] = v
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
939 attrValue = markupAttrMap.get(attr)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
940 if not self._matches(attrValue, matchAgainst):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
941 match = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
942 break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
943 if match:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
944 if markup:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
945 found = markup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
946 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
947 found = markupName
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
948 return found
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
949
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
950 def search(self, markup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
951 #print 'looking for %s in %s' % (self, markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
952 found = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
953 # If given a list of items, scan it for a text element that
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
954 # matches.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
955 if hasattr(markup, "__iter__") \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
956 and not isinstance(markup, Tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
957 for element in markup:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
958 if isinstance(element, NavigableString) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
959 and self.search(element):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
960 found = element
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
961 break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
962 # If it's a Tag, make sure its name or attributes match.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
963 # Don't bother with Tags if we're searching for text.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
964 elif isinstance(markup, Tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
965 if not self.text:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
966 found = self.searchTag(markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
967 # If it's text, make sure the text matches.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
968 elif isinstance(markup, NavigableString) or \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
969 isinstance(markup, basestring):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
970 if self._matches(markup, self.text):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
971 found = markup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
972 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
973 raise Exception, "I don't know how to match against a %s" \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
974 % markup.__class__
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
975 return found
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
976
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
977 def _matches(self, markup, matchAgainst):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
978 #print "Matching %s against %s" % (markup, matchAgainst)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
979 result = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
980 if matchAgainst is True:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
981 result = markup is not None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
982 elif callable(matchAgainst):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
983 result = matchAgainst(markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
984 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
985 #Custom match methods take the tag as an argument, but all
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
986 #other ways of matching match the tag name as a string.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
987 if isinstance(markup, Tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
988 markup = markup.name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
989 if markup and not isinstance(markup, basestring):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
990 markup = unicode(markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
991 #Now we know that chunk is either a string, or None.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
992 if hasattr(matchAgainst, 'match'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
993 # It's a regexp object.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
994 result = markup and matchAgainst.search(markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
995 elif hasattr(matchAgainst, '__iter__'): # list-like
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
996 result = markup in matchAgainst
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
997 elif hasattr(matchAgainst, 'items'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
998 result = markup.has_key(matchAgainst)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
999 elif matchAgainst and isinstance(markup, basestring):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1000 if isinstance(markup, unicode):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1001 matchAgainst = unicode(matchAgainst)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1002 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1003 matchAgainst = str(matchAgainst)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1004
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1005 if not result:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1006 result = matchAgainst == markup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1007 return result
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1008
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1009 class ResultSet(list):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1010 """A ResultSet is just a list that keeps track of the SoupStrainer
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1011 that created it."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1012 def __init__(self, source):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1013 list.__init__([])
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1014 self.source = source
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1015
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1016 # Now, some helper functions.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1017
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1018 def buildTagMap(default, *args):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1019 """Turns a list of maps, lists, or scalars into a single map.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1020 Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1021 NESTING_RESET_TAGS maps out of lists and partial maps."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1022 built = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1023 for portion in args:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1024 if hasattr(portion, 'items'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1025 #It's a map. Merge it.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1026 for k,v in portion.items():
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1027 built[k] = v
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1028 elif hasattr(portion, '__iter__'): # is a list
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1029 #It's a list. Map each item to the default.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1030 for k in portion:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1031 built[k] = default
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1032 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1033 #It's a scalar. Map it to the default.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1034 built[portion] = default
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1035 return built
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1036
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1037 # Now, the parser classes.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1038
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1039 class BeautifulStoneSoup(Tag, SGMLParser):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1040
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1041 """This class contains the basic parser and search code. It defines
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1042 a parser that knows nothing about tag behavior except for the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1043 following:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1044
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1045 You can't close a tag without closing all the tags it encloses.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1046 That is, "<foo><bar></foo>" actually means
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1047 "<foo><bar></bar></foo>".
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1048
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1049 [Another possible explanation is "<foo><bar /></foo>", but since
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1050 this class defines no SELF_CLOSING_TAGS, it will never use that
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1051 explanation.]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1052
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1053 This class is useful for parsing XML or made-up markup languages,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1054 or when BeautifulSoup makes an assumption counter to what you were
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1055 expecting."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1056
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1057 SELF_CLOSING_TAGS = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1058 NESTABLE_TAGS = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1059 RESET_NESTING_TAGS = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1060 QUOTE_TAGS = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1061 PRESERVE_WHITESPACE_TAGS = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1062
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1063 MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1064 lambda x: x.group(1) + ' />'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1065 (re.compile('<!\s+([^<>]*)>'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1066 lambda x: '<!' + x.group(1) + '>')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1067 ]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1068
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1069 ROOT_TAG_NAME = u'[document]'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1070
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1071 HTML_ENTITIES = "html"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1072 XML_ENTITIES = "xml"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1073 XHTML_ENTITIES = "xhtml"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1074 # TODO: This only exists for backwards-compatibility
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1075 ALL_ENTITIES = XHTML_ENTITIES
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1076
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1077 # Used when determining whether a text node is all whitespace and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1078 # can be replaced with a single space. A text node that contains
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1079 # fancy Unicode spaces (usually non-breaking) should be left
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1080 # alone.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1081 STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1082
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1083 def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1084 markupMassage=True, smartQuotesTo=XML_ENTITIES,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1085 convertEntities=None, selfClosingTags=None, isHTML=False):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1086 """The Soup object is initialized as the 'root tag', and the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1087 provided markup (which can be a string or a file-like object)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1088 is fed into the underlying parser.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1089
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1090 sgmllib will process most bad HTML, and the BeautifulSoup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1091 class has some tricks for dealing with some HTML that kills
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1092 sgmllib, but Beautiful Soup can nonetheless choke or lose data
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1093 if your data uses self-closing tags or declarations
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1094 incorrectly.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1095
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1096 By default, Beautiful Soup uses regexes to sanitize input,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1097 avoiding the vast majority of these problems. If the problems
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1098 don't apply to you, pass in False for markupMassage, and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1099 you'll get better performance.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1100
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1101 The default parser massage techniques fix the two most common
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1102 instances of invalid HTML that choke sgmllib:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1103
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1104 <br/> (No space between name of closing tag and tag close)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1105 <! --Comment--> (Extraneous whitespace in declaration)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1106
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1107 You can pass in a custom list of (RE object, replace method)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1108 tuples to get Beautiful Soup to scrub your input the way you
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1109 want."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1110
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1111 self.parseOnlyThese = parseOnlyThese
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1112 self.fromEncoding = fromEncoding
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1113 self.smartQuotesTo = smartQuotesTo
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1114 self.convertEntities = convertEntities
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1115 # Set the rules for how we'll deal with the entities we
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1116 # encounter
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1117 if self.convertEntities:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1118 # It doesn't make sense to convert encoded characters to
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1119 # entities even while you're converting entities to Unicode.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1120 # Just convert it all to Unicode.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1121 self.smartQuotesTo = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1122 if convertEntities == self.HTML_ENTITIES:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1123 self.convertXMLEntities = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1124 self.convertHTMLEntities = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1125 self.escapeUnrecognizedEntities = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1126 elif convertEntities == self.XHTML_ENTITIES:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1127 self.convertXMLEntities = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1128 self.convertHTMLEntities = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1129 self.escapeUnrecognizedEntities = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1130 elif convertEntities == self.XML_ENTITIES:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1131 self.convertXMLEntities = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1132 self.convertHTMLEntities = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1133 self.escapeUnrecognizedEntities = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1134 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1135 self.convertXMLEntities = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1136 self.convertHTMLEntities = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1137 self.escapeUnrecognizedEntities = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1138
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1139 self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1140 SGMLParser.__init__(self)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1141
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1142 if hasattr(markup, 'read'): # It's a file-type object.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1143 markup = markup.read()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1144 self.markup = markup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1145 self.markupMassage = markupMassage
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1146 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1147 self._feed(isHTML=isHTML)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1148 except StopParsing:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1149 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1150 self.markup = None # The markup can now be GCed
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1151
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1152 def convert_charref(self, name):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1153 """This method fixes a bug in Python's SGMLParser."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1154 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1155 n = int(name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1156 except ValueError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1157 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1158 if not 0 <= n <= 127 : # ASCII ends at 127, not 255
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1159 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1160 return self.convert_codepoint(n)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1161
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1162 def _feed(self, inDocumentEncoding=None, isHTML=False):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1163 # Convert the document to Unicode.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1164 markup = self.markup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1165 if isinstance(markup, unicode):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1166 if not hasattr(self, 'originalEncoding'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1167 self.originalEncoding = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1168 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1169 dammit = UnicodeDammit\
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1170 (markup, [self.fromEncoding, inDocumentEncoding],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1171 smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1172 markup = dammit.unicode
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1173 self.originalEncoding = dammit.originalEncoding
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1174 self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1175 if markup:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1176 if self.markupMassage:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1177 if not hasattr(self.markupMassage, "__iter__"):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1178 self.markupMassage = self.MARKUP_MASSAGE
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1179 for fix, m in self.markupMassage:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1180 markup = fix.sub(m, markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1181 # TODO: We get rid of markupMassage so that the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1182 # soup object can be deepcopied later on. Some
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1183 # Python installations can't copy regexes. If anyone
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1184 # was relying on the existence of markupMassage, this
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1185 # might cause problems.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1186 del(self.markupMassage)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1187 self.reset()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1188
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1189 SGMLParser.feed(self, markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1190 # Close out any unfinished strings and close all the open tags.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1191 self.endData()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1192 while self.currentTag.name != self.ROOT_TAG_NAME:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1193 self.popTag()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1194
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1195 def __getattr__(self, methodName):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1196 """This method routes method call requests to either the SGMLParser
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1197 superclass or the Tag superclass, depending on the method name."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1198 #print "__getattr__ called on %s.%s" % (self.__class__, methodName)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1199
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1200 if methodName.startswith('start_') or methodName.startswith('end_') \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1201 or methodName.startswith('do_'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1202 return SGMLParser.__getattr__(self, methodName)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1203 elif not methodName.startswith('__'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1204 return Tag.__getattr__(self, methodName)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1205 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1206 raise AttributeError
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1207
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1208 def isSelfClosingTag(self, name):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1209 """Returns true iff the given string is the name of a
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1210 self-closing tag according to this parser."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1211 return self.SELF_CLOSING_TAGS.has_key(name) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1212 or self.instanceSelfClosingTags.has_key(name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1213
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1214 def reset(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1215 Tag.__init__(self, self, self.ROOT_TAG_NAME)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1216 self.hidden = 1
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1217 SGMLParser.reset(self)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1218 self.currentData = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1219 self.currentTag = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1220 self.tagStack = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1221 self.quoteStack = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1222 self.pushTag(self)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1223
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1224 def popTag(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1225 tag = self.tagStack.pop()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1226
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1227 #print "Pop", tag.name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1228 if self.tagStack:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1229 self.currentTag = self.tagStack[-1]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1230 return self.currentTag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1231
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1232 def pushTag(self, tag):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1233 #print "Push", tag.name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1234 if self.currentTag:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1235 self.currentTag.contents.append(tag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1236 self.tagStack.append(tag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1237 self.currentTag = self.tagStack[-1]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1238
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1239 def endData(self, containerClass=NavigableString):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1240 if self.currentData:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1241 currentData = u''.join(self.currentData)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1242 if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1243 not set([tag.name for tag in self.tagStack]).intersection(
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1244 self.PRESERVE_WHITESPACE_TAGS)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1245 if '\n' in currentData:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1246 currentData = '\n'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1247 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1248 currentData = ' '
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1249 self.currentData = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1250 if self.parseOnlyThese and len(self.tagStack) <= 1 and \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1251 (not self.parseOnlyThese.text or \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1252 not self.parseOnlyThese.search(currentData)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1253 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1254 o = containerClass(currentData)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1255 o.setup(self.currentTag, self.previous)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1256 if self.previous:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1257 self.previous.next = o
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1258 self.previous = o
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1259 self.currentTag.contents.append(o)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1260
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1261
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1262 def _popToTag(self, name, inclusivePop=True):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1263 """Pops the tag stack up to and including the most recent
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1264 instance of the given tag. If inclusivePop is false, pops the tag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1265 stack up to but *not* including the most recent instqance of
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1266 the given tag."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1267 #print "Popping to %s" % name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1268 if name == self.ROOT_TAG_NAME:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1269 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1270
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1271 numPops = 0
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1272 mostRecentTag = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1273 for i in range(len(self.tagStack)-1, 0, -1):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1274 if name == self.tagStack[i].name:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1275 numPops = len(self.tagStack)-i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1276 break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1277 if not inclusivePop:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1278 numPops = numPops - 1
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1279
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1280 for i in range(0, numPops):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1281 mostRecentTag = self.popTag()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1282 return mostRecentTag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1283
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1284 def _smartPop(self, name):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1285
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1286 """We need to pop up to the previous tag of this type, unless
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1287 one of this tag's nesting reset triggers comes between this
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1288 tag and the previous tag of this type, OR unless this tag is a
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1289 generic nesting trigger and another generic nesting trigger
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1290 comes between this tag and the previous tag of this type.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1291
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1292 Examples:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1293 <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1294 <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1295 <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1296
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1297 <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1298 <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1299 <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1300 """
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1301
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1302 nestingResetTriggers = self.NESTABLE_TAGS.get(name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1303 isNestable = nestingResetTriggers != None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1304 isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1305 popTo = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1306 inclusive = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1307 for i in range(len(self.tagStack)-1, 0, -1):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1308 p = self.tagStack[i]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1309 if (not p or p.name == name) and not isNestable:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1310 #Non-nestable tags get popped to the top or to their
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1311 #last occurance.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1312 popTo = name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1313 break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1314 if (nestingResetTriggers is not None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1315 and p.name in nestingResetTriggers) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1316 or (nestingResetTriggers is None and isResetNesting
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1317 and self.RESET_NESTING_TAGS.has_key(p.name)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1318
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1319 #If we encounter one of the nesting reset triggers
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1320 #peculiar to this tag, or we encounter another tag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1321 #that causes nesting to reset, pop up to but not
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1322 #including that tag.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1323 popTo = p.name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1324 inclusive = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1325 break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1326 p = p.parent
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1327 if popTo:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1328 self._popToTag(popTo, inclusive)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1329
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1330 def unknown_starttag(self, name, attrs, selfClosing=0):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1331 #print "Start tag %s: %s" % (name, attrs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1332 if self.quoteStack:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1333 #This is not a real tag.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1334 #print "<%s> is not real!" % name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1335 attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs])
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1336 self.handle_data('<%s%s>' % (name, attrs))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1337 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1338 self.endData()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1339
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1340 if not self.isSelfClosingTag(name) and not selfClosing:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1341 self._smartPop(name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1342
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1343 if self.parseOnlyThese and len(self.tagStack) <= 1 \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1344 and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1345 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1346
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1347 tag = Tag(self, name, attrs, self.currentTag, self.previous)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1348 if self.previous:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1349 self.previous.next = tag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1350 self.previous = tag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1351 self.pushTag(tag)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1352 if selfClosing or self.isSelfClosingTag(name):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1353 self.popTag()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1354 if name in self.QUOTE_TAGS:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1355 #print "Beginning quote (%s)" % name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1356 self.quoteStack.append(name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1357 self.literal = 1
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1358 return tag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1359
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1360 def unknown_endtag(self, name):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1361 #print "End tag %s" % name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1362 if self.quoteStack and self.quoteStack[-1] != name:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1363 #This is not a real end tag.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1364 #print "</%s> is not real!" % name
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1365 self.handle_data('</%s>' % name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1366 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1367 self.endData()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1368 self._popToTag(name)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1369 if self.quoteStack and self.quoteStack[-1] == name:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1370 self.quoteStack.pop()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1371 self.literal = (len(self.quoteStack) > 0)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1372
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1373 def handle_data(self, data):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1374 self.currentData.append(data)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1375
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1376 def _toStringSubclass(self, text, subclass):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1377 """Adds a certain piece of text to the tree as a NavigableString
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1378 subclass."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1379 self.endData()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1380 self.handle_data(text)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1381 self.endData(subclass)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1382
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1383 def handle_pi(self, text):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1384 """Handle a processing instruction as a ProcessingInstruction
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1385 object, possibly one with a %SOUP-ENCODING% slot into which an
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1386 encoding will be plugged later."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1387 if text[:3] == "xml":
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1388 text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1389 self._toStringSubclass(text, ProcessingInstruction)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1390
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1391 def handle_comment(self, text):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1392 "Handle comments as Comment objects."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1393 self._toStringSubclass(text, Comment)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1394
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1395 def handle_charref(self, ref):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1396 "Handle character references as data."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1397 if self.convertEntities:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1398 data = unichr(int(ref))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1399 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1400 data = '&#%s;' % ref
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1401 self.handle_data(data)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1402
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1403 def handle_entityref(self, ref):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1404 """Handle entity references as data, possibly converting known
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1405 HTML and/or XML entity references to the corresponding Unicode
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1406 characters."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1407 data = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1408 if self.convertHTMLEntities:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1409 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1410 data = unichr(name2codepoint[ref])
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1411 except KeyError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1412 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1413
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1414 if not data and self.convertXMLEntities:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1415 data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1416
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1417 if not data and self.convertHTMLEntities and \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1418 not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1419 # TODO: We've got a problem here. We're told this is
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1420 # an entity reference, but it's not an XML entity
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1421 # reference or an HTML entity reference. Nonetheless,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1422 # the logical thing to do is to pass it through as an
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1423 # unrecognized entity reference.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1424 #
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1425 # Except: when the input is "&carol;" this function
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1426 # will be called with input "carol". When the input is
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1427 # "AT&T", this function will be called with input
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1428 # "T". We have no way of knowing whether a semicolon
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1429 # was present originally, so we don't know whether
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1430 # this is an unknown entity or just a misplaced
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1431 # ampersand.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1432 #
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1433 # The more common case is a misplaced ampersand, so I
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1434 # escape the ampersand and omit the trailing semicolon.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1435 data = "&amp;%s" % ref
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1436 if not data:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1437 # This case is different from the one above, because we
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1438 # haven't already gone through a supposedly comprehensive
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1439 # mapping of entities to Unicode characters. We might not
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1440 # have gone through any mapping at all. So the chances are
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1441 # very high that this is a real entity, and not a
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1442 # misplaced ampersand.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1443 data = "&%s;" % ref
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1444 self.handle_data(data)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1445
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1446 def handle_decl(self, data):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1447 "Handle DOCTYPEs and the like as Declaration objects."
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1448 self._toStringSubclass(data, Declaration)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1449
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1450 def parse_declaration(self, i):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1451 """Treat a bogus SGML declaration as raw data. Treat a CDATA
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1452 declaration as a CData object."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1453 j = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1454 if self.rawdata[i:i+9] == '<![CDATA[':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1455 k = self.rawdata.find(']]>', i)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1456 if k == -1:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1457 k = len(self.rawdata)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1458 data = self.rawdata[i+9:k]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1459 j = k+3
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1460 self._toStringSubclass(data, CData)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1461 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1462 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1463 j = SGMLParser.parse_declaration(self, i)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1464 except SGMLParseError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1465 toHandle = self.rawdata[i:]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1466 self.handle_data(toHandle)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1467 j = i + len(toHandle)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1468 return j
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1469
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1470 class BeautifulSoup(BeautifulStoneSoup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1471
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1472 """This parser knows the following facts about HTML:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1473
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1474 * Some tags have no closing tag and should be interpreted as being
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1475 closed as soon as they are encountered.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1476
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1477 * The text inside some tags (ie. 'script') may contain tags which
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1478 are not really part of the document and which should be parsed
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1479 as text, not tags. If you want to parse the text as tags, you can
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1480 always fetch it and parse it explicitly.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1481
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1482 * Tag nesting rules:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1483
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1484 Most tags can't be nested at all. For instance, the occurance of
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1485 a <p> tag should implicitly close the previous <p> tag.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1486
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1487 <p>Para1<p>Para2
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1488 should be transformed into:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1489 <p>Para1</p><p>Para2
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1490
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1491 Some tags can be nested arbitrarily. For instance, the occurance
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1492 of a <blockquote> tag should _not_ implicitly close the previous
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1493 <blockquote> tag.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1494
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1495 Alice said: <blockquote>Bob said: <blockquote>Blah
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1496 should NOT be transformed into:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1497 Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1498
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1499 Some tags can be nested, but the nesting is reset by the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1500 interposition of other tags. For instance, a <tr> tag should
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1501 implicitly close the previous <tr> tag within the same <table>,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1502 but not close a <tr> tag in another table.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1503
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1504 <table><tr>Blah<tr>Blah
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1505 should be transformed into:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1506 <table><tr>Blah</tr><tr>Blah
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1507 but,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1508 <tr>Blah<table><tr>Blah
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1509 should NOT be transformed into
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1510 <tr>Blah<table></tr><tr>Blah
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1511
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1512 Differing assumptions about tag nesting rules are a major source
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1513 of problems with the BeautifulSoup class. If BeautifulSoup is not
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1514 treating as nestable a tag your page author treats as nestable,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1515 try ICantBelieveItsBeautifulSoup, MinimalSoup, or
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1516 BeautifulStoneSoup before writing your own subclass."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1517
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1518 def __init__(self, *args, **kwargs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1519 if not kwargs.has_key('smartQuotesTo'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1520 kwargs['smartQuotesTo'] = self.HTML_ENTITIES
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1521 kwargs['isHTML'] = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1522 BeautifulStoneSoup.__init__(self, *args, **kwargs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1523
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1524 SELF_CLOSING_TAGS = buildTagMap(None,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1525 ('br' , 'hr', 'input', 'img', 'meta',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1526 'spacer', 'link', 'frame', 'base', 'col'))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1527
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1528 PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1529
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1530 QUOTE_TAGS = {'script' : None, 'textarea' : None}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1531
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1532 #According to the HTML standard, each of these inline tags can
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1533 #contain another tag of the same type. Furthermore, it's common
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1534 #to actually use these tags this way.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1535 NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1536 'center')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1537
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1538 #According to the HTML standard, these block tags can contain
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1539 #another tag of the same type. Furthermore, it's common
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1540 #to actually use these tags this way.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1541 NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1542
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1543 #Lists can contain other lists, but there are restrictions.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1544 NESTABLE_LIST_TAGS = { 'ol' : [],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1545 'ul' : [],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1546 'li' : ['ul', 'ol'],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1547 'dl' : [],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1548 'dd' : ['dl'],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1549 'dt' : ['dl'] }
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1550
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1551 #Tables can contain other tables, but there are restrictions.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1552 NESTABLE_TABLE_TAGS = {'table' : [],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1553 'tr' : ['table', 'tbody', 'tfoot', 'thead'],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1554 'td' : ['tr'],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1555 'th' : ['tr'],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1556 'thead' : ['table'],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1557 'tbody' : ['table'],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1558 'tfoot' : ['table'],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1559 }
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1560
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1561 NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1562
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1563 #If one of these tags is encountered, all tags up to the next tag of
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1564 #this type are popped.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1565 RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1566 NON_NESTABLE_BLOCK_TAGS,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1567 NESTABLE_LIST_TAGS,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1568 NESTABLE_TABLE_TAGS)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1569
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1570 NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1571 NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1572
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1573 # Used to detect the charset in a META tag; see start_meta
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1574 CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1575
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1576 def start_meta(self, attrs):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1577 """Beautiful Soup can detect a charset included in a META tag,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1578 try to convert the document to that charset, and re-parse the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1579 document from the beginning."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1580 httpEquiv = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1581 contentType = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1582 contentTypeIndex = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1583 tagNeedsEncodingSubstitution = False
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1584
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1585 for i in range(0, len(attrs)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1586 key, value = attrs[i]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1587 key = key.lower()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1588 if key == 'http-equiv':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1589 httpEquiv = value
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1590 elif key == 'content':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1591 contentType = value
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1592 contentTypeIndex = i
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1593
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1594 if httpEquiv and contentType: # It's an interesting meta tag.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1595 match = self.CHARSET_RE.search(contentType)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1596 if match:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1597 if (self.declaredHTMLEncoding is not None or
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1598 self.originalEncoding == self.fromEncoding):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1599 # An HTML encoding was sniffed while converting
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1600 # the document to Unicode, or an HTML encoding was
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1601 # sniffed during a previous pass through the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1602 # document, or an encoding was specified
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1603 # explicitly and it worked. Rewrite the meta tag.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1604 def rewrite(match):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1605 return match.group(1) + "%SOUP-ENCODING%"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1606 newAttr = self.CHARSET_RE.sub(rewrite, contentType)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1607 attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1608 newAttr)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1609 tagNeedsEncodingSubstitution = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1610 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1611 # This is our first pass through the document.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1612 # Go through it again with the encoding information.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1613 newCharset = match.group(3)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1614 if newCharset and newCharset != self.originalEncoding:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1615 self.declaredHTMLEncoding = newCharset
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1616 self._feed(self.declaredHTMLEncoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1617 raise StopParsing
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1618 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1619 tag = self.unknown_starttag("meta", attrs)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1620 if tag and tagNeedsEncodingSubstitution:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1621 tag.containsSubstitutions = True
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1622
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1623 class StopParsing(Exception):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1624 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1625
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1626 class ICantBelieveItsBeautifulSoup(BeautifulSoup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1627
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1628 """The BeautifulSoup class is oriented towards skipping over
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1629 common HTML errors like unclosed tags. However, sometimes it makes
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1630 errors of its own. For instance, consider this fragment:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1631
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1632 <b>Foo<b>Bar</b></b>
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1633
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1634 This is perfectly valid (if bizarre) HTML. However, the
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1635 BeautifulSoup class will implicitly close the first b tag when it
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1636 encounters the second 'b'. It will think the author wrote
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1637 "<b>Foo<b>Bar", and didn't close the first 'b' tag, because
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1638 there's no real-world reason to bold something that's already
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1639 bold. When it encounters '</b></b>' it will close two more 'b'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1640 tags, for a grand total of three tags closed instead of two. This
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1641 can throw off the rest of your document structure. The same is
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1642 true of a number of other tags, listed below.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1643
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1644 It's much more common for someone to forget to close a 'b' tag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1645 than to actually use nested 'b' tags, and the BeautifulSoup class
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1646 handles the common case. This class handles the not-co-common
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1647 case: where you can't believe someone wrote what they did, but
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1648 it's valid HTML and BeautifulSoup screwed up by assuming it
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1649 wouldn't be."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1650
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1651 I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1652 ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1653 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1654 'big')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1655
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1656 I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1657
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1658 NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1659 I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1660 I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1661
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1662 class MinimalSoup(BeautifulSoup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1663 """The MinimalSoup class is for parsing HTML that contains
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1664 pathologically bad markup. It makes no assumptions about tag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1665 nesting, but it does know which tags are self-closing, that
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1666 <script> tags contain Javascript and should not be parsed, that
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1667 META tags may contain encoding information, and so on.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1668
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1669 This also makes it better for subclassing than BeautifulStoneSoup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1670 or BeautifulSoup."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1671
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1672 RESET_NESTING_TAGS = buildTagMap('noscript')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1673 NESTABLE_TAGS = {}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1674
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1675 class BeautifulSOAP(BeautifulStoneSoup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1676 """This class will push a tag with only a single string child into
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1677 the tag's parent as an attribute. The attribute's name is the tag
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1678 name, and the value is the string child. An example should give
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1679 the flavor of the change:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1680
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1681 <foo><bar>baz</bar></foo>
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1682 =>
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1683 <foo bar="baz"><bar>baz</bar></foo>
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1684
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1685 You can then access fooTag['bar'] instead of fooTag.barTag.string.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1686
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1687 This is, of course, useful for scraping structures that tend to
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1688 use subelements instead of attributes, such as SOAP messages. Note
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1689 that it modifies its input, so don't print the modified version
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1690 out.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1691
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1692 I'm not sure how many people really want to use this class; let me
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1693 know if you do. Mainly I like the name."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1694
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1695 def popTag(self):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1696 if len(self.tagStack) > 1:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1697 tag = self.tagStack[-1]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1698 parent = self.tagStack[-2]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1699 parent._getAttrMap()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1700 if (isinstance(tag, Tag) and len(tag.contents) == 1 and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1701 isinstance(tag.contents[0], NavigableString) and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1702 not parent.attrMap.has_key(tag.name)):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1703 parent[tag.name] = tag.contents[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1704 BeautifulStoneSoup.popTag(self)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1705
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1706 #Enterprise class names! It has come to our attention that some people
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1707 #think the names of the Beautiful Soup parser classes are too silly
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1708 #and "unprofessional" for use in enterprise screen-scraping. We feel
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1709 #your pain! For such-minded folk, the Beautiful Soup Consortium And
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1710 #All-Night Kosher Bakery recommends renaming this file to
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1711 #"RobustParser.py" (or, in cases of extreme enterprisiness,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1712 #"RobustParserBeanInterface.class") and using the following
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1713 #enterprise-friendly class aliases:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1714 class RobustXMLParser(BeautifulStoneSoup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1715 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1716 class RobustHTMLParser(BeautifulSoup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1717 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1718 class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1719 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1720 class RobustInsanelyWackAssHTMLParser(MinimalSoup):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1721 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1722 class SimplifyingSOAPParser(BeautifulSOAP):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1723 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1724
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1725 ######################################################
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1726 #
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1727 # Bonus library: Unicode, Dammit
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1728 #
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1729 # This class forces XML data into a standard format (usually to UTF-8
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1730 # or Unicode). It is heavily based on code from Mark Pilgrim's
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1731 # Universal Feed Parser. It does not rewrite the XML or HTML to
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1732 # reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1733 # (XML) and BeautifulSoup.start_meta (HTML).
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1734
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1735 # Autodetects character encodings.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1736 # Download from http://chardet.feedparser.org/
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1737 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1738 import chardet
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1739 # import chardet.constants
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1740 # chardet.constants._debug = 1
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1741 except ImportError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1742 chardet = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1743
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1744 # cjkcodecs and iconv_codec make Python know about more character encodings.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1745 # Both are available from http://cjkpython.i18n.org/
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1746 # They're built in if you use Python 2.4.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1747 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1748 import cjkcodecs.aliases
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1749 except ImportError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1750 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1751 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1752 import iconv_codec
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1753 except ImportError:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1754 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1755
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1756 class UnicodeDammit:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1757 """A class for detecting the encoding of a *ML document and
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1758 converting it to a Unicode string. If the source encoding is
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1759 windows-1252, can replace MS smart quotes with their HTML or XML
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1760 equivalents."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1761
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1762 # This dictionary maps commonly seen values for "charset" in HTML
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1763 # meta tags to the corresponding Python codec names. It only covers
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1764 # values that aren't in Python's aliases and can't be determined
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1765 # by the heuristics in find_codec.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1766 CHARSET_ALIASES = { "macintosh" : "mac-roman",
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1767 "x-sjis" : "shift-jis" }
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1768
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1769 def __init__(self, markup, overrideEncodings=[],
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1770 smartQuotesTo='xml', isHTML=False):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1771 self.declaredHTMLEncoding = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1772 self.markup, documentEncoding, sniffedEncoding = \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1773 self._detectEncoding(markup, isHTML)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1774 self.smartQuotesTo = smartQuotesTo
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1775 self.triedEncodings = []
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1776 if markup == '' or isinstance(markup, unicode):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1777 self.originalEncoding = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1778 self.unicode = unicode(markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1779 return
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1780
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1781 u = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1782 for proposedEncoding in overrideEncodings:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1783 u = self._convertFrom(proposedEncoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1784 if u: break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1785 if not u:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1786 for proposedEncoding in (documentEncoding, sniffedEncoding):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1787 u = self._convertFrom(proposedEncoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1788 if u: break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1789
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1790 # If no luck and we have auto-detection library, try that:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1791 if not u and chardet and not isinstance(self.markup, unicode):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1792 u = self._convertFrom(chardet.detect(self.markup)['encoding'])
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1793
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1794 # As a last resort, try utf-8 and windows-1252:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1795 if not u:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1796 for proposed_encoding in ("utf-8", "windows-1252"):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1797 u = self._convertFrom(proposed_encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1798 if u: break
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1799
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1800 self.unicode = u
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1801 if not u: self.originalEncoding = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1802
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1803 def _subMSChar(self, orig):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1804 """Changes a MS smart quote character to an XML or HTML
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1805 entity."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1806 sub = self.MS_CHARS.get(orig)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1807 if isinstance(sub, tuple):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1808 if self.smartQuotesTo == 'xml':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1809 sub = '&#x%s;' % sub[1]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1810 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1811 sub = '&%s;' % sub[0]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1812 return sub
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1813
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1814 def _convertFrom(self, proposed):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1815 proposed = self.find_codec(proposed)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1816 if not proposed or proposed in self.triedEncodings:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1817 return None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1818 self.triedEncodings.append(proposed)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1819 markup = self.markup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1820
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1821 # Convert smart quotes to HTML if coming from an encoding
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1822 # that might have them.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1823 if self.smartQuotesTo and proposed.lower() in("windows-1252",
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1824 "iso-8859-1",
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1825 "iso-8859-2"):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1826 markup = re.compile("([\x80-\x9f])").sub \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1827 (lambda(x): self._subMSChar(x.group(1)),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1828 markup)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1829
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1830 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1831 # print "Trying to convert document to %s" % proposed
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1832 u = self._toUnicode(markup, proposed)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1833 self.markup = u
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1834 self.originalEncoding = proposed
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1835 except Exception, e:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1836 # print "That didn't work!"
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1837 # print e
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1838 return None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1839 #print "Correct encoding: %s" % proposed
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1840 return self.markup
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1841
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1842 def _toUnicode(self, data, encoding):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1843 '''Given a string and its encoding, decodes the string into Unicode.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1844 %encoding is a string recognized by encodings.aliases'''
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1845
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1846 # strip Byte Order Mark (if present)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1847 if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1848 and (data[2:4] != '\x00\x00'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1849 encoding = 'utf-16be'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1850 data = data[2:]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1851 elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1852 and (data[2:4] != '\x00\x00'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1853 encoding = 'utf-16le'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1854 data = data[2:]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1855 elif data[:3] == '\xef\xbb\xbf':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1856 encoding = 'utf-8'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1857 data = data[3:]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1858 elif data[:4] == '\x00\x00\xfe\xff':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1859 encoding = 'utf-32be'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1860 data = data[4:]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1861 elif data[:4] == '\xff\xfe\x00\x00':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1862 encoding = 'utf-32le'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1863 data = data[4:]
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1864 newdata = unicode(data, encoding)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1865 return newdata
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1866
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1867 def _detectEncoding(self, xml_data, isHTML=False):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1868 """Given a document, tries to detect its XML encoding."""
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1869 xml_encoding = sniffed_xml_encoding = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1870 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1871 if xml_data[:4] == '\x4c\x6f\xa7\x94':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1872 # EBCDIC
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1873 xml_data = self._ebcdic_to_ascii(xml_data)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1874 elif xml_data[:4] == '\x00\x3c\x00\x3f':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1875 # UTF-16BE
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1876 sniffed_xml_encoding = 'utf-16be'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1877 xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1878 elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1879 and (xml_data[2:4] != '\x00\x00'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1880 # UTF-16BE with BOM
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1881 sniffed_xml_encoding = 'utf-16be'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1882 xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1883 elif xml_data[:4] == '\x3c\x00\x3f\x00':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1884 # UTF-16LE
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1885 sniffed_xml_encoding = 'utf-16le'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1886 xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1887 elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1888 (xml_data[2:4] != '\x00\x00'):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1889 # UTF-16LE with BOM
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1890 sniffed_xml_encoding = 'utf-16le'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1891 xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1892 elif xml_data[:4] == '\x00\x00\x00\x3c':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1893 # UTF-32BE
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1894 sniffed_xml_encoding = 'utf-32be'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1895 xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1896 elif xml_data[:4] == '\x3c\x00\x00\x00':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1897 # UTF-32LE
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1898 sniffed_xml_encoding = 'utf-32le'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1899 xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1900 elif xml_data[:4] == '\x00\x00\xfe\xff':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1901 # UTF-32BE with BOM
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1902 sniffed_xml_encoding = 'utf-32be'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1903 xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1904 elif xml_data[:4] == '\xff\xfe\x00\x00':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1905 # UTF-32LE with BOM
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1906 sniffed_xml_encoding = 'utf-32le'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1907 xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1908 elif xml_data[:3] == '\xef\xbb\xbf':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1909 # UTF-8 with BOM
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1910 sniffed_xml_encoding = 'utf-8'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1911 xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1912 else:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1913 sniffed_xml_encoding = 'ascii'
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1914 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1915 except:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1916 xml_encoding_match = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1917 xml_encoding_match = re.compile(
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1918 '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1919 if not xml_encoding_match and isHTML:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1920 regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1921 xml_encoding_match = regexp.search(xml_data)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1922 if xml_encoding_match is not None:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1923 xml_encoding = xml_encoding_match.groups()[0].lower()
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1924 if isHTML:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1925 self.declaredHTMLEncoding = xml_encoding
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1926 if sniffed_xml_encoding and \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1927 (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1928 'iso-10646-ucs-4', 'ucs-4', 'csucs4',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1929 'utf-16', 'utf-32', 'utf_16', 'utf_32',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1930 'utf16', 'u16')):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1931 xml_encoding = sniffed_xml_encoding
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1932 return xml_data, xml_encoding, sniffed_xml_encoding
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1933
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1934
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1935 def find_codec(self, charset):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1936 return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1937 or (charset and self._codec(charset.replace("-", ""))) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1938 or (charset and self._codec(charset.replace("-", "_"))) \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1939 or charset
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1940
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1941 def _codec(self, charset):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1942 if not charset: return charset
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1943 codec = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1944 try:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1945 codecs.lookup(charset)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1946 codec = charset
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1947 except (LookupError, ValueError):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1948 pass
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1949 return codec
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1950
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1951 EBCDIC_TO_ASCII_MAP = None
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1952 def _ebcdic_to_ascii(self, s):
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1953 c = self.__class__
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1954 if not c.EBCDIC_TO_ASCII_MAP:
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1955 emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1956 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1957 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1958 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1959 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1960 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1961 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1962 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1963 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1964 201,202,106,107,108,109,110,111,112,113,114,203,204,205,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1965 206,207,208,209,126,115,116,117,118,119,120,121,122,210,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1966 211,212,213,214,215,216,217,218,219,220,221,222,223,224,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1967 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1968 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1969 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1970 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1971 250,251,252,253,254,255)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1972 import string
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1973 c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1974 ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1975 return s.translate(c.EBCDIC_TO_ASCII_MAP)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1976
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1977 MS_CHARS = { '\x80' : ('euro', '20AC'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1978 '\x81' : ' ',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1979 '\x82' : ('sbquo', '201A'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1980 '\x83' : ('fnof', '192'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1981 '\x84' : ('bdquo', '201E'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1982 '\x85' : ('hellip', '2026'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1983 '\x86' : ('dagger', '2020'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1984 '\x87' : ('Dagger', '2021'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1985 '\x88' : ('circ', '2C6'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1986 '\x89' : ('permil', '2030'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1987 '\x8A' : ('Scaron', '160'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1988 '\x8B' : ('lsaquo', '2039'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1989 '\x8C' : ('OElig', '152'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1990 '\x8D' : '?',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1991 '\x8E' : ('#x17D', '17D'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1992 '\x8F' : '?',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1993 '\x90' : '?',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1994 '\x91' : ('lsquo', '2018'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1995 '\x92' : ('rsquo', '2019'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1996 '\x93' : ('ldquo', '201C'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1997 '\x94' : ('rdquo', '201D'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1998 '\x95' : ('bull', '2022'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
1999 '\x96' : ('ndash', '2013'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2000 '\x97' : ('mdash', '2014'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2001 '\x98' : ('tilde', '2DC'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2002 '\x99' : ('trade', '2122'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2003 '\x9a' : ('scaron', '161'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2004 '\x9b' : ('rsaquo', '203A'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2005 '\x9c' : ('oelig', '153'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2006 '\x9d' : '?',
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2007 '\x9e' : ('#x17E', '17E'),
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2008 '\x9f' : ('Yuml', ''),}
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2009
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2010 #######################################################################
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2011
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2012
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2013 #By default, act as an HTML pretty-printer.
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2014 if __name__ == '__main__':
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2015 import sys
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2016 soup = BeautifulSoup(sys.stdin)
03c22b722882 remove BeautifulSoup dependency
Richard Burhans <burhans@bx.psu.edu>
parents:
diff changeset
2017 print soup.prettify()