ITEEDU

使用正则式处理糟糕的数据

Beautiful Soup does pretty well at handling bad markup when "bad markup" means tags in the wrong places. But sometimes the markup is just malformed, and the underlying parser can't handle it. So Beautiful Soup runs regular expressions against an input document before trying to parse it.
对于那些在错误的位置的"坏标签",Beautiful Soup处理的还不错。但有时有些 非常不正常的标签,底层的剖析器也不能处理。这时Beautiful Soup会在剖析之前运用正则表达式 来处理输入的文档。

By default, Beautiful Soup uses regular expressions and replacement functions to do search-and-replace on input documents. It finds self-closing tags that look like <BR/>, and changes them to look like <BR />. It finds declarations that have extraneous whitespace, like <! --Comment-->, and removes the whitespace: <!--Comment-->.
默认情况下,Beautiful Soup使用正则式和替换函数对输入文档进行搜索替换操作。 它可以发现自关闭的标签如<BR/>,转换它们如同<BR />(译注:多加了一个空格)。 它可以找到有多余空格的声明,如<! --Comment-->,移除空格:<!--Comment-->.

If you have bad markup that needs fixing in some other way, you can pass your own list of (regular expression, replacement function) tuples into the soup constructor, as the markupMassage argument.
如果你的坏标签需要以其他的方式修复,你也可以传递你自己的以(regular expression, replacement function) 元组的list到soup对象构造器,作为markupMassage参数。

Let's take an example: a page that has a malformed comment. The underlying SGML parser can't cope with this, and ignores the comment and everything afterwards: 我们举个例子:有一个页面的注释很糟糕。底层的SGML不能解析它,并会忽略注释以及它后面的所有内容。

from BeautifulSoup import BeautifulSoup
badString = "Foo<!-This comment is malformed.-->Bar<br/>Baz"
BeautifulSoup(badString)
# Foo

Let's fix it up with a regular expression and a function:
让我们使用正则式和一个函数来解决这个问题:

import re
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
BeautifulSoup(badString, markupMassage=myMassage)
# Foo<!--This comment is malformed.-->Bar

Oops, we're still missing the <BR> tag. Our markupMassage overrides the parser's default massage, so the default search-and-replace functions don't get run. The parser makes it past the comment, but it dies at the malformed self-closing tag. Let's add our new massage function to the default list, so we run all the functions.
哦呃呃,我们还是漏掉了<BR>标签。我们的markupMassage 重载了剖析默认的message,因此默认的搜索替换函数不会运行。 剖析器让它来处理注释,但是它在坏的自关闭标签那里停止了。让我加一些新的message函数到默认的list中去, 并让这些函数都运行起来。

import copy
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz

Now we've got it all.
这样我们就搞定了。

If you know for a fact that your markup doesn't need any regular expressions run on it, you can get a faster startup time by passing in False for markupMassage.
如果你已经知道你的标签不需要任何正则式,你可以通过传递一个FalsemarkupMassage.