When the built-in parser classes won't do the job, you need to
customize. This usually means customizing the lists of nestable and
self-closing tags. You can customize the list of self-closing tags by
passing a selfClosingTags
argument
into the soup constructor. To customize the lists of nestable tags,
though, you'll have to subclass.
当内置的剖析类不能做一些工作时,你需要定制它们。
这通常意味着重新定义可内嵌的标签和自关闭的标签列表。
你可以通过传递参数selfClosingTags
给soup的构造器来定制自关闭的标签。自定义可以内嵌的标签的列表,你需要子类化。
The most useful classes to subclass are MinimalSoup
(for HTML)
and BeautifulStoneSoup
(for XML). I'm going to show you how to
override RESET_NESTING_TAGS
and NESTABLE_TAGS
in a subclass. This
is the most complicated part of Beautiful Soup and I'm not going to
explain it very well here, but I'll get something written and then I
can improve it with feedback.
非常有用的用来子类的类是MinimalSoup
类(针对HTML)和BeautifulStoneSoup
(针对XML)。
我会说明如何在子类中重写RESET_NESTING_TAGS
和NESTABLE_TAGS
。这是Beautiful Soup 中
最为复杂的部分,所以我也不会在这里详细的解释,但是我会写些东西并利用反馈来改进它。
When Beautiful Soup is parsing a document, it keeps a stack of open
tags. Whenever it sees a new start tag, it tosses that tag on top of
the stack. But before it does, it might close some of the open tags
and remove them from the stack. Which tags it closes depends on the
qualities of tag it just found, and the qualities of the tags in the
stack.
当Beautiful Soup剖析一个文档的时候,它会保持一个打开的tag的堆栈。任何时候只要它看到一个新的
开始tag,它会将这个tag拖到堆栈的顶端。但在做这步之前,它可能会关闭某些已经打开的标签并将它们从
堆栈中移除。
The best way to explain it is through example. Let's say the stack
looks like ['html', 'p', 'b']
, and Beautiful Soup encounters a <P>
tag. If it just tossed another 'p'
onto the stack, this would imply
that the second <P> tag is within the first <P> tag, not to mention
the open <B> tag. But that's not the way <P> tags work. You can't
stick a <P> tag inside another <P> tag. A <P> tag isn't "nestable" at
all.
我们最好还是通过例子来解释。我们假定堆栈如同['html','p','b']
,
并且Beautiful Soup遇到一个<P>标签。如果它仅仅将另一个'p'
拖到堆栈的顶端,
这意味着第二个<P>标签在第一个<P>内,而不会影响到打开的<B>。
但是这不是<P>应该的样子。你不能插入一个<P>到另一个<P>里面去。<P>标签不是可内嵌的。
So when Beautiful Soup encounters a <P> tag, it closes and pops all
the tags up to and including the previously encountered tag of the
same type. This is the default behavior, and this is how BeautifulStoneSoup
treats every tag. It's what you get when a
tag is not mentioned in either NESTABLE_TAGS
or RESET_NESTING_TAGS
. It's also what you get when a tag shows up in RESET_NESTING_TAGS
but has no entry in NESTABLE_TAGS
, the way the
<P> tag does.
因此当Beautiful Soup 遇到一个<P>时,它先关闭并弹出所有的标签,包括前面遇到的同类型的标签。
这是默认的操作,这也是Beautiful Soup对待每个标签的方式。当一个标签不在NESTABLE_TAGS
或RESET_NESTING_TAGS
中时,你会遇到的处理方式。这也是当一个标签在RESET_NESTING_TAGS
中而不在NESTABLE_TAGS
中时的处理方式,就像处理<P>一样。
from BeautifulSoup import BeautifulSoup BeautifulSoup.RESET_NESTING_TAGS['p'] == None # True BeautifulSoup.NESTABLE_TAGS.has_key('p') # False print BeautifulSoup("<html><p>Para<b>one<p>Para two") # <html><p>Para<b>one</b></p><p>Para two</p></html> # ^---^--The second <p> tag made those two tags get closed
Let's say the stack looks like ['html', 'span', 'b']
, and
Beautiful Soup encounters a <SPAN> tag. Now, <SPAN> tags can contain
other <SPAN> tags without limit, so there's no need to pop up to the
previous <SPAN> tag when you encounter one. This is represented by
mapping the tag name to an empty list in NESTABLE_TAGS
. This kind of
tag should not be mentioned in RESET_NESTING_TAGS
: there are no
circumstances when encountering a <SPAN> tag would cause any tags to
be popped.
我们假定堆栈如同['html','span','b']
,并且Beautiful Soup 遇到一个<SPAN>标签。
现在,<SPAN>可以无限制包含其他的<SPAN>,因此当再次遇到<SPAN>标签时没有必要弹出前面的<SPAN>标签。
这是通过映射标签名到NESTABLE_TAGS
中的一个空列表里。这样的标签也需要在RESET_NESTING_TAGS
里
设置:当再次遇到<SPAN>是不会再导致任何标签被弹出并关闭。
from BeautifulSoup import BeautifulSoup BeautifulSoup.NESTABLE_TAGS['span'] # [] BeautifulSoup.RESET_NESTING_TAGS.has_key('span') # False print BeautifulSoup("<html><span>Span<b>one<span>Span two") # <html><span>Span<b>one<span>Span two</span></b></span></html>
Third example: suppose the stack looks like ['ol','li','ul']
:
that is, we've got an ordered list, the first element of which
contains an unordered list. Now suppose Beautiful Soup encounters a
<LI> tag. It shouldn't pop up to the first <LI> tag, because this new
<LI> tag is part of the unordered sublist. It's okay for an <LI> tag
to be inside another <LI> tag, so long as there's a <UL> or <OL> tag
in the way.
第三个例子:假定堆栈如同['ol','li','ul']
:
也就是,我们有一个有序的list,且列表的第一个元素包含一个无序的list。现在假设,Beautiful Soup
遇到一个<LI>标签。它不会弹出第一个<LI>,因为这个新的<LI>是无序的子list一部分。
<LI>中内嵌一个<LI>是可以的,同样的<UL>和<OL>标签也可以这样。
from BeautifulSoup import BeautifulSoup print BeautifulSoup("<ol><li>1<ul><li>A").prettify() # <ol> # <li> # 1 # <ul> # <li> # A # </li> # </ul> # </li> # </ol>
But if there is no intervening <UL> or <OL>, then one <LI> tag
can't be underneath another:
如果<UL>和<OL>没有被干扰,这时一个<LI>标签也不能在另一个之下。[bad]
print BeautifulSoup("<ol><li>1<li>A").prettify() # <ol> # <li> # 1 # </li> # <li> # A # </li> # </ol>
We tell Beautiful Soup to treat <LI> tags this way by putting "li"
in RESET_NESTING_TAGS
, and by giving "li" a NESTABLE_TAGS
entry
showing list of tags under which it can nest.
Beautiful Soup这样对待<LI>是通过将"li"放入RESET_NESTING_TAGS
,并给在NESTABLE_TAGS
中给"li"一个可以内嵌接口。
BeautifulSoup.RESET_NESTING_TAGS.has_key('li') # True BeautifulSoup.NESTABLE_TAGS['li'] # ['ul', 'ol']
This is also how we handle the nesting of table tags:
这也是处理内嵌的table标签的方式:
BeautifulSoup.NESTABLE_TAGS['td'] # ['tr'] BeautifulSoup.NESTABLE_TAGS['tr'] # ['table', 'tbody', 'tfoot', 'thead'] BeautifulSoup.NESTABLE_TAGS['tbody'] # ['table'] BeautifulSoup.NESTABLE_TAGS['thead'] # ['table'] BeautifulSoup.NESTABLE_TAGS['tfoot'] # ['table'] BeautifulSoup.NESTABLE_TAGS['table'] # []
That is: <TD> tags can be nested within <TR> tags. <TR> tags can be
nested within <TABLE>, <TBODY>, <TFOOT>, and <THEAD> tags. <TBODY>,
<TFOOT>, and <THEAD> tags can be nested in <TABLE> tags, and <TABLE>
tags can be nested in other <TABLE> tags. If you know about HTML
tables, these rules should already make sense to you.
也就是<TD>标签可以嵌入到<TR>中。
<TR>可以被嵌入到<TABLE>, <TBODY>, <TFOOT>, 以及 <THEAD> 中。
<TBODY>,<TFOOT>, and <THEAD>标签可以嵌入到 <TABLE> 标签中, 而 <TABLE>
嵌入到其它的<TABLE> 标签中. 如果你对HTML有所了解,这些规则对你而言应该很熟悉。
One more example. Say the stack looks like ['html', 'p', 'table']
and Beautiful Soup encounters a <P> tag.
再举一个例子,假设堆栈如同['html','p','table']
,并且Beautiful Soup遇到一个<P>标签。
At first glance, this looks just like the example where the stack
is ['html', 'p', 'b']
and Beautiful Soup encounters a <P> tag. In
that example, we closed the <B> and <P> tags, because you can't have
one paragraph inside another.
首先,这看起来像前面的同样是Beautiful Soup遇到了堆栈['html','p','b']
。
在那个例子中,我们关闭了<B>和<P>标签,因为你不能在一个段落里内嵌另一个段落。
Except... you can have a paragraph that contains a table, and then the table contains a paragraph. So the right thing to do is to not close any of these tags. Beautiful Soup does the right thing: 除非,你的段落里包含了一个table,然后这table包含了一个段落。因此,这种情况下正确的处理是 不关闭任何标签。Beautiful Soup就是这样做的:
from BeautifulSoup import BeautifulSoup print BeautifulSoup("<p>Para 1<b><p>Para 2") # <p> # Para 1 # <b> # </b> # </p> # <p> # Para 2 # </p> print BeautifulSoup("<p>Para 1<table><p>Para 2").prettify() # <p> # Para 1 # <table> # <p> # Para 2 # </p> # </table> # </p>
What's the difference? The difference is that <TABLE> is in RESET_NESTING_TAGS
and <B> is not. A tag that's in RESET_NESTING_TAGS
doesn't get popped off the stack as easily as a
tag that's not.
有什么不同?不同是<TABLE>标签在RESET_NESTING_TAGS
中,而<B>不在。
一个在RESET_NESTING_TAGS
中标签不会像不在其里面的标签那样,会是堆栈中标签被弹出。
Okay, hopefully you get the idea. Here's the NESTABLE_TAGS
for
the BeautifulSoup
class. Correlate this with what you know about
HTML, and you should be able to create your own NESTABLE_TAGS
for
bizarre HTML documents that don't follow the normal rules, and for
other XML dialects that have different nesting rules.
好了,希望你明白了(我被弄有点晕,有些地方翻译的不清,还请见谅)。 NESTABLE_TAGS
用于BeautifulSoup
类。
依据你所知道的HTML,你可以创建你自己NESTABLE_TAGS
来处理那些不遵循标准规则的HTML文档。
以及那些使用不同嵌入规则XML的方言。
from BeautifulSoup import BeautifulSoup nestKeys = BeautifulSoup.NESTABLE_TAGS.keys() nestKeys.sort() for key in nestKeys: print "%s: %s" % (key, BeautifulSoup.NESTABLE_TAGS[key]) # bdo: [] # blockquote: [] # center: [] # dd: ['dl'] # del: [] # div: [] # dl: [] # dt: ['dl'] # fieldset: [] # font: [] # ins: [] # li: ['ul', 'ol'] # object: [] # ol: [] # q: [] # span: [] # sub: [] # sup: [] # table: [] # tbody: ['table'] # td: ['tr'] # tfoot: ['table'] # th: ['tr'] # thead: ['table'] # tr: ['table', 'tbody', 'tfoot', 'thead'] # ul: []
And here's BeautifulSoup
's RESET_NESTING_TAGS
. Only the keys
are important: RESET_NESTING_TAGS
is actually a list, put into the
form of a dictionary for quick random access.
这是BeautifulSoup
的RESET_NESTING_TAGS
。只有键(keys)是重要的: RESET_NESTING_TAGS
实际是一个list,以字典的形式可以快速随机存取。
from BeautifulSoup import BeautifulSoup resetKeys = BeautifulSoup.RESET_NESTING_TAGS.keys() resetKeys.sort() resetKeys # ['address', 'blockquote', 'dd', 'del', 'div', 'dl', 'dt', 'fieldset', # 'form', 'ins', 'li', 'noscript', 'ol', 'p', 'pre', 'table', 'tbody', # 'td', 'tfoot', 'th', 'thead', 'tr', 'ul']
Since you're subclassing anyway, you might as well override SELF_CLOSING_TAGS
while you're at it. It's a dictionary that maps
self-closing tag names to any values at all (like RESET_NESTING_TAGS
, it's actually a list in the form of a
dictionary). Then you won't have to pass that list in to the
constructor (as selfClosingTags
) every time you instantiate your
subclass.
因为无论如何都有使用继承,你最好还是在需要的时候重写SELF_CLOSING_TAGS
。
这是一个映射自关闭标签名的字典(如同RESET_NESTING_TAGS
,它实际是字典形式的list)。
这样每次实例化你的子类时,你就不用传list给构造器(如selfClosingTags
)。