ITEEDU

高级主题

That does it for the basic usage of Beautiful Soup. But HTML and XML are tricky, and in the real world they're even trickier. So Beautiful Soup keeps some extra tricks of its own up its sleeve.
那些是对Beautiful Soup的基本用法。但是现实中的HTML和XML是非常棘手的(tricky),即使他们不是trickier。 因此Beautiful Soup也有一些额外的技巧。

产生器

The search methods described above are driven by generator methods. You can use these methods yourself: they're called nextGenerator, previousGenerator, nextSiblingGenerator, previousSiblingGenerator, and parentGenerator. Tag and parser objects also have childGenerator and recursiveChildGenerator available.
以上的搜索方法都是由产生器驱动的。你也可以自己使用这些方法: 他们是nextGenerator, previousGenerator, nextSiblingGenerator, previousSiblingGenerator, 和parentGenerator. Tag和剖析对象 可以使用childGeneratorrecursiveChildGenerator

Here's a simple example that strips HTML tags out of a document by iterating over the document and collecting all the strings.
下面是一个简单的例子,将遍历HTML的标签并将它们从文档中剥离,搜集所有的字符串:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("""<div>You <i>bet</i>
<a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>
rocks!</div>""")

''.join([e for e in soup.recursiveChildGenerator() 
         if isinstance(e,unicode)])
# u'You bet\nBeautifulSoup\nrocks!'

Here's a more complex example that uses recursiveChildGenerator to iterate over the elements of a document, printing each one as it gets it. 这是一个稍微复杂点的使用recursiveChildGenerator的例子来遍历文档中所有元素, 并打印它们。

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("1<a>2<b>3")
g = soup.recursiveChildGenerator()
while True:
    try:
        print g.next()
    except StopIteration:
        break
# 1
# <a>2<b>3</b></a>
# 2
# <b>3</b>
# 3