ITEEDU

Beautiful Soup 弄丢了我给的数据!

Beautiful Soup can handle poorly-structured SGML, but sometimes it loses data when it gets stuff that's not SGML at all. This is not nearly as common as poorly-structured markup, but if you're building a web crawler or something you'll surely run into it.
Beautiful Soup可以处理结构不太规范的SGML,但是给它的材料非常不规范, 它会丢失数据。如果你是在写一个网络爬虫之类的程序,你肯定会遇到这种,不太常见的结构有问题的文档。

The only solution is to sanitize the data ahead of time with a regular expression. Here are some examples that I and Beautiful Soup users have discovered:
唯一的解决方法是先使用正则表达式来规范数据。 下面是一些我和一些Beautiful Soup的使用者发现的例子:

  • Beautiful Soup treats ill-formed XML definitions as data. However, it loses well-formed XML definitions that don't actually exist:
    Beautiful Soup 将不规范德XML定义处理为数据(data)。然而,它丢失了那些实际上不存在的良好的XML定义:

    from BeautifulSoup import BeautifulSoup
    BeautifulSoup("< ! FOO @=>")
    # < ! FOO @=>
    BeautifulSoup("<b><!FOO>!</b>")
    # <b>!</b>
        
  • If your document starts a declaration and never finishes it, Beautiful Soup assumes the rest of your document is part of the declaration. If the document ends in the middle of the declaration, Beautiful Soup ignores the declaration totally. A couple examples:
    如果你的文档开始了声明但却没有关闭,Beautiful Soup假定你的文档的剩余部分都是这个声明的一部分。 如果文档在声明的中间结束了,Beautiful Soup会忽略这个声明。如下面这个例子:

    from BeautifulSoup import BeautifulSoup
    
    BeautifulSoup("foo<!bar") 
    # foo 
    
    soup = BeautifulSoup("<html>foo<!bar</html>") 
    print soup.prettify()
    # <html>
    #  foo<!bar</html>
    # </html>
        

    There are a couple ways to fix this; one is detailed here.
    有几种方法来处理这种情况;其中一种在 这里有详细介绍。

    Beautiful Soup also ignores an entity reference that's not finished by the end of the document:
    Beautiful Soup 也会忽略实体引用,如果它没有在文档结束的时候关闭:

    BeautifulSoup("&lt;foo&gt")
    # &lt;foo
        

    I've never seen this in real web pages, but it's probably out there somewhere. 我从来没有在实际的网页中遇到这种情况,但是也许别的地方会出现。

  • A malformed comment will make Beautiful Soup ignore the rest of the document. This is covered as the example in Sanitizing Bad Data with Regexps.
    一个畸形的注释会是Beautiful Soup回来文档的剩余部分。在使用正则规范数据这里有详细的例子。

The parse tree built by the BeautifulSoup class offends my senses!
BeautifulSoup类构建的剖析树让我感到头痛。

To get your markup parsed differently, check out
尝试一下别的剖析方法,试试 其他内置的剖析器,或者 自定义一个剖析器.