That does it for the basic usage of Beautiful Soup. But HTML and
XML are tricky, and in the real world they're even trickier. So
Beautiful Soup keeps some extra tricks of its own up its sleeve.
那些是对Beautiful Soup的基本用法。但是现实中的HTML和XML是非常棘手的(tricky),即使他们不是trickier。
因此Beautiful Soup也有一些额外的技巧。
The search methods described above are driven by generator
methods. You can use these methods yourself: they're called nextGenerator
, previousGenerator
, nextSiblingGenerator
, previousSiblingGenerator
, and parentGenerator
. Tag
and parser
objects also have childGenerator
and recursiveChildGenerator
available.
以上的搜索方法都是由产生器驱动的。你也可以自己使用这些方法:
他们是nextGenerator
, previousGenerator
, nextSiblingGenerator
, previousSiblingGenerator
, 和parentGenerator
. Tag
和剖析对象
可以使用childGenerator
和recursiveChildGenerator
。
Here's a simple example that strips HTML tags out of a document by
iterating over the document and collecting all the strings.
下面是一个简单的例子,将遍历HTML的标签并将它们从文档中剥离,搜集所有的字符串:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("""<div>You <i>bet</i> <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> rocks!</div>""") ''.join([e for e in soup.recursiveChildGenerator() if isinstance(e,unicode)]) # u'You bet\nBeautifulSoup\nrocks!'
Here's a more complex example that uses recursiveChildGenerator
to iterate over the elements of a document, printing each one as it
gets it.
这是一个稍微复杂点的使用recursiveChildGenerator
的例子来遍历文档中所有元素,
并打印它们。
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("1<a>2<b>3") g = soup.recursiveChildGenerator() while True: try: print g.next() except StopIteration: break # 1 # <a>2<b>3</b></a> # 2 # <b>3</b> # 3