ITEEDU

其它内置的剖析器

Beautiful Soup comes with three parser classes besides BeautifulSoup and BeautifulStoneSoup:
除了BeautifulSoup BeautifulStoneSoup,还有其它三个Beautiful Soup剖析器:

  • MinimalSoup is a subclass of BeautifulSoup. It knows most facts about HTML like which tags are self-closing, the special behavior of the <SCRIPT> tag, the possibility of an encoding mentioned in a <META> tag, etc. But it has no nesting heuristics at all. So it doesn't know that <LI> tags go underneath <UL> tags and not the other way around. It's useful for parsing pathologically bad markup, and for subclassing.
    MinimalSoupBeautifulSoup的子类。对于HTML的大部分内容都可以处理, 例如自关闭的标签,特殊的标签<SCRIPT>,<META>中写到的可能的编码类型,等等。 但是它没有内置的智能判断能力。例如它不知道<LI>标签应该在<UL>下,而不是其他方式。 对于处理糟糕的标记和用来被继承还是有用的。

  • ICantBelieveItsBeautifulSoup is also a subclass of BeautifulSoup. It has HTML heuristics that conform more closely to the HTML standard, but ignore how HTML is used in the real world. For instance, it's valid HTML to nest <B> tags, but in the real world a nested <B> tag almost always means that the author forgot to close the first <B> tag. If you run into someone who actually nests <B> tags, then you can use ICantBelieveItsBeautifulSoup. ICantBelieveItsBeautifulSoup也是BeautifulSoup的子类。 它具有HTML的智能(heuristics)判断能力,更加符合标准的HTML,但是忽略实际使用的HTML。 例如:一个嵌入<B>标签的HTML是有效的,但是实际上一个嵌入的<B>通常意味着 那个HTML的作者忘记了关闭第一个<B>标签。如果你运行某些人确实使用嵌入的<B>标签的HTML, 这是你可以是使用ICantBelieveItsBeautifulSoup

  • BeautifulSOAP is a subclass of BeautifulStoneSoup. It's useful for parsing documents like SOAP messages, which use a subelement when they could just use an attribute of the parent element. Here's an example:
    BeautifulSOAPBeautifulStoneSoup的子类。对于处理那些类似SOAP消息的文档, 也就是处理那些可以将标签的子标签变为其属性的文档很方便。下面是一个例子:

    from BeautifulSoup import BeautifulStoneSoup, BeautifulSOAP
    xml = "<doc><tag>subelement</tag></doc>"
    print BeautifulStoneSoup(xml)
    # <doc><tag>subelement</tag></doc>
    print BeautifulSOAP(xml)
    <doc tag="subelement"><tag>subelement</tag></doc>
        

    With BeautifulSOAP you can access the contents of the <TAG> tag without descending into the tag.
    使用BeautifulSOAP,你可以直接存取<TAG>而不需要再往下解析。