Beautiful Soup turns every element of a document into a Python
object and connects it to a bunch of other Python objects. If you only
need a subset of the document, this is really slow. But you can pass
in a SoupStrainer
as the parseOnlyThese
argument to the soup constructor. Beautiful Soup
checks each element against the SoupStrainer
, and only if it matches
is the element turned into a Tag
or NavigableText
, and added to
the tree.
Beautiful Soup 将一个文档的每个元素都转换为Python对象并将文档转换为一些Python对象的集合。
如果你只需要这个文档的子集,全部转换确实非常慢。
但是你可以传递SoupStrainer
作为parseOnlyThese
参数的值给
soup的构造器。Beautiful Soup检查每一个元素是否满足SoupStrainer
条件,
只有那些满足条件的元素会转换为Tag
标签或NavigableText
,并被添加到剖析树中。
If an element is added to to the tree, then so are its
children—even if they wouldn't have matched the SoupStrainer
on their own. This lets you parse only the chunks of a document that
contain the data you want.
如果一个元素被加到剖析树中,那么的子元素即使不满足SoupStrainer
也会被加入到树中。
这可以让你只剖析文档中那些你想要的数据块。
Here's a pretty varied document:
看看下面这个有意思的例子:
doc = '''Bob reports <a href="http://www.bob.com/">success</a> with his plasma breeding <a href="http://www.bob.com/plasma">experiments</a>. <i>Don't get any on us, Bob!</i> <br><br>Ever hear of annular fusion? The folks at <a href="http://www.boogabooga.net/">BoogaBooga</a> sure seem obsessed with it. Secret project, or <b>WEB MADNESS?</b> You decide!'''
Here are several different ways of parsing the document into soup,
depending on which parts you want. All of these are faster and use
less memory than parsing the whole document and then using the same SoupStrainer
to pick out the parts you want.
有几种不同的方法可以根据你的需求来剖析部分文档.比起剖析全部文档,他们都更快并占用更少的内存,他们都是使用相同的 SoupStrainer
来挑选文档中你想要的部分。
from BeautifulSoup import BeautifulSoup, SoupStrainer import re links = SoupStrainer('a') [tag for tag in BeautifulSoup(doc, parseOnlyThese=links)] # [<a href="http://www.bob.com/">success</a>, # <a href="http://www.bob.com/plasma">experiments</a>, # <a href="http://www.boogabooga.net/">BoogaBooga</a>] linksToBob = SoupStrainer('a', href=re.compile('bob.com/')) [tag for tag in BeautifulSoup(doc, parseOnlyThese=linksToBob)] # [<a href="http://www.bob.com/">success</a>, # <a href="http://www.bob.com/plasma">experiments</a>] mentionsOfBob = SoupStrainer(text=re.compile("Bob")) [text for text in BeautifulSoup(doc, parseOnlyThese=mentionsOfBob)] # [u'Bob reports ', u"Don't get any on\nus, Bob!"] allCaps = SoupStrainer(text=lambda(t):t.upper()==t) [text for text in BeautifulSoup(doc, parseOnlyThese=allCaps)] # [u'. ', u'\n', u'WEB MADNESS?']
There is one major difference between the SoupStrainer
you pass
into a search method and the one you pass into a soup
constructor. Recall that the name
argument can take a function whose argument is a Tag
object. You can't do this for a SoupStrainer
's name
, because
the SoupStrainer
is used to decide whether or not a Tag
object
should be created in the first place. You can pass in a function for a SoupStrainer
's name
, but it can't take a Tag
object: it can only
take the tag name and a map of arguments.
把SoupStrainer
传递给搜索方法和soup构造器有一个很大的不同。
回忆一下,name
参数可以使用以Tag
对象为参数的函数。
但是你不能对SoupStrainer
的name
使用这招,因为SoupStrainer
被用于决定
一个Tag
对象是否可以在第一个地方被创建。
你可以传递一个函数给SoupStrainer
的name
,但是不能是使用Tag
对象的函数:
只能使用tag的名字和一个参数映射。
shortWithNoAttrs = SoupStrainer(lambda name, attrs: \ len(name) == 1 and not attrs) [tag for tag in BeautifulSoup(doc, parseOnlyThese=shortWithNoAttrs)] # [<i>Don't get any on us, Bob!</i>, # <b>WEB MADNESS?</b>]