ITEEDU

通过剖析部分文档来提升效率

Beautiful Soup turns every element of a document into a Python object and connects it to a bunch of other Python objects. If you only need a subset of the document, this is really slow. But you can pass in a SoupStrainer as the parseOnlyThese argument to the soup constructor. Beautiful Soup checks each element against the SoupStrainer, and only if it matches is the element turned into a Tag or NavigableText, and added to the tree.
Beautiful Soup 将一个文档的每个元素都转换为Python对象并将文档转换为一些Python对象的集合。 如果你只需要这个文档的子集,全部转换确实非常慢。 但是你可以传递SoupStrainer作为parseOnlyThese参数的值给 soup的构造器。Beautiful Soup检查每一个元素是否满足SoupStrainer条件, 只有那些满足条件的元素会转换为Tag标签或NavigableText,并被添加到剖析树中。

If an element is added to to the tree, then so are its children—even if they wouldn't have matched the SoupStrainer on their own. This lets you parse only the chunks of a document that contain the data you want.
如果一个元素被加到剖析树中,那么的子元素即使不满足SoupStrainer也会被加入到树中。 这可以让你只剖析文档中那些你想要的数据块。

Here's a pretty varied document:
看看下面这个有意思的例子:

doc = '''Bob reports <a href="http://www.bob.com/">success</a>
with his plasma breeding <a
href="http://www.bob.com/plasma">experiments</a>. <i>Don't get any on
us, Bob!</i>

<br><br>Ever hear of annular fusion? The folks at <a
href="http://www.boogabooga.net/">BoogaBooga</a> sure seem obsessed
with it. Secret project, or <b>WEB MADNESS?</b> You decide!'''

Here are several different ways of parsing the document into soup, depending on which parts you want. All of these are faster and use less memory than parsing the whole document and then using the same SoupStrainer to pick out the parts you want.
有几种不同的方法可以根据你的需求来剖析部分文档.比起剖析全部文档,他们都更快并占用更少的内存,他们都是使用相同的 SoupStrainer来挑选文档中你想要的部分。

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

links = SoupStrainer('a')
[tag for tag in BeautifulSoup(doc, parseOnlyThese=links)]
# [<a href="http://www.bob.com/">success</a>, 
#  <a href="http://www.bob.com/plasma">experiments</a>, 
#  <a href="http://www.boogabooga.net/">BoogaBooga</a>]

linksToBob = SoupStrainer('a', href=re.compile('bob.com/'))
[tag for tag in BeautifulSoup(doc, parseOnlyThese=linksToBob)]
# [<a href="http://www.bob.com/">success</a>, 
#  <a href="http://www.bob.com/plasma">experiments</a>]

mentionsOfBob = SoupStrainer(text=re.compile("Bob"))
[text for text in BeautifulSoup(doc, parseOnlyThese=mentionsOfBob)]
# [u'Bob reports ', u"Don't get any on\nus, Bob!"]

allCaps = SoupStrainer(text=lambda(t):t.upper()==t)
[text for text in BeautifulSoup(doc, parseOnlyThese=allCaps)]
# [u'. ', u'\n', u'WEB MADNESS?']

There is one major difference between the SoupStrainer you pass into a search method and the one you pass into a soup constructor. Recall that the name argument can take a function whose argument is a Tag object. You can't do this for a SoupStrainer's name, because the SoupStrainer is used to decide whether or not a Tag object should be created in the first place. You can pass in a function for a SoupStrainer's name, but it can't take a Tag object: it can only take the tag name and a map of arguments.
SoupStrainer传递给搜索方法和soup构造器有一个很大的不同。 回忆一下,name参数可以使用以Tag对象为参数的函数。 但是你不能对SoupStrainername使用这招,因为SoupStrainer被用于决定 一个Tag对象是否可以在第一个地方被创建。 你可以传递一个函数给SoupStrainername,但是不能是使用Tag对象的函数: 只能使用tag的名字和一个参数映射。

shortWithNoAttrs = SoupStrainer(lambda name, attrs: \
                                len(name) == 1 and not attrs)
[tag for tag in BeautifulSoup(doc, parseOnlyThese=shortWithNoAttrs)]
# [<i>Don't get any on us, Bob!</i>, 
#  <b>WEB MADNESS?</b>]