Beautiful Soup comes with three parser classes besides BeautifulSoup
and BeautifulStoneSoup
:
除了BeautifulSoup
和 BeautifulStoneSoup
,还有其它三个Beautiful Soup剖析器:
MinimalSoup
is a subclass of BeautifulSoup
. It knows most
facts about HTML like which tags are self-closing, the special
behavior of the <SCRIPT> tag, the possibility of an encoding mentioned
in a <META> tag, etc. But it has no nesting heuristics at all. So it
doesn't know that <LI> tags go underneath <UL> tags and not the other
way around. It's useful for parsing pathologically bad markup, and for
subclassing.
MinimalSoup
是BeautifulSoup
的子类。对于HTML的大部分内容都可以处理,
例如自关闭的标签,特殊的标签<SCRIPT>,<META>中写到的可能的编码类型,等等。
但是它没有内置的智能判断能力。例如它不知道<LI>标签应该在<UL>下,而不是其他方式。
对于处理糟糕的标记和用来被继承还是有用的。
ICantBelieveItsBeautifulSoup
is also a subclass of BeautifulSoup
. It has HTML heuristics that conform more
closely to the HTML standard, but ignore how HTML is used in the real
world. For instance, it's valid HTML to nest <B> tags, but in the real
world a nested <B> tag almost always means that the author forgot to
close the first <B> tag. If you run into someone who actually nests
<B> tags, then you can use ICantBelieveItsBeautifulSoup
. ICantBelieveItsBeautifulSoup
也是BeautifulSoup的子类。
它具有HTML的智能(heuristics)判断能力,更加符合标准的HTML,但是忽略实际使用的HTML。
例如:一个嵌入<B>标签的HTML是有效的,但是实际上一个嵌入的<B>通常意味着
那个HTML的作者忘记了关闭第一个<B>标签。如果你运行某些人确实使用嵌入的<B>标签的HTML,
这是你可以是使用ICantBelieveItsBeautifulSoup
。
BeautifulSOAP
is a subclass of BeautifulStoneSoup
. It's useful for parsing documents
like SOAP messages, which use a subelement when they could just use an
attribute of the parent element. Here's an example:
BeautifulSOAP
是BeautifulStoneSoup
的子类。对于处理那些类似SOAP消息的文档,
也就是处理那些可以将标签的子标签变为其属性的文档很方便。下面是一个例子:
from BeautifulSoup import BeautifulStoneSoup, BeautifulSOAP xml = "<doc><tag>subelement</tag></doc>" print BeautifulStoneSoup(xml) # <doc><tag>subelement</tag></doc> print BeautifulSOAP(xml) <doc tag="subelement"><tag>subelement</tag></doc>
With BeautifulSOAP
you can access the contents of the
<TAG> tag without descending into the tag.
使用BeautifulSOAP
,你可以直接存取<TAG>而不需要再往下解析。