ITEEDU

实体转换

When you parse a document, you can convert HTML or XML entity references to the corresponding Unicode characters. This code converts the HTML entity "é" to the Unicode character LATIN SMALL LETTER E WITH ACUTE, and the numeric entity "e" to the Unicode character LATIN SMALL LETTER E.
当你剖析一个文档是,你可以转换HTML或者XML实体引用到可表达Unicode的字符。 这个代码转换HTML实体"é"到Unicode字符 LATIN SMALL LETTER E WITH ACUTE,以及将 数量实体"e"转换到Unicode字符LATIN SMALL LETTER E.

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacré bleu!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

That's if you use HTML_ENTITIES (which is just the string "html"). If you use XML_ENTITIES (or the string "xml"), then only numeric entities and the five XML entities (""", "'", ">", "<", and "&") get converted. If you use ALL_ENTITIES (or the list ["xml", "html"]), then both kinds of entities will be converted. This last one is neccessary because ' is an XML entity but not an HTML entity.
这是针对使用HTML_ENTITIES(也就是字符串"html")。如果你使用XML_ENTITIES(或字符串"xml"), 这是只有数字实体和五个XML实体((""","'", ">", "<", 和 "&") 会被转换。如果你使用ALL_ENTITIES(或者列表["xml","html"]), 两种实体都会被转换。最后一种方式是必要的,因为'是一个XML的实体而不是HTML的。

BeautifulStoneSoup("Sacré bleu!", 
                   convertEntities=BeautifulStoneSoup.XML_ENTITIES)
# Sacré bleu!

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Il a dit, <<Sacré bleu!>>", 
                   convertEntities=BeautifulStoneSoup.XML_ENTITIES)
# Il a dit, <<Sacr&eacute; bleu!>>

If you tell Beautiful Soup to convert XML or HTML entities into the corresponding Unicode characters, then Windows-1252 characters (like Microsoft smart quotes) also get transformed into Unicode characters. This happens even if you told Beautiful Soup to convert those characters to entities.
如果你指定Beautiful Soup转换XML或HTML实体到可通信的Unicode字符时,Windows-1252(微软的smart quotes)也会 被转换为Unicode字符。即使你指定Beautiful Soup转换这些字符到实体是,也还是这样。

from BeautifulSoup import BeautifulStoneSoup
smartQuotesAndEntities = "Il a dit, \x8BSacr&eacute; bl&#101;u!\x9b"

BeautifulStoneSoup(smartQuotesAndEntities, smartQuotesTo="html").contents[0]
# u'Il a dit, &lsaquo;Sacr&eacute; bl&#101;u!&rsaquo;'

BeautifulStoneSoup(smartQuotesAndEntities, convertEntities="html", 
                   smartQuotesTo="html").contents[0]
# u'Il a dit, \u2039Sacr\xe9 bleu!\u203a'

BeautifulStoneSoup(smartQuotesAndEntities, convertEntities="xml", 
                   smartQuotesTo="xml").contents[0]
# u'Il a dit, \u2039Sacr&eacute; bleu!\u203a'

It doesn't make sense to create new HTML/XML entities while you're busy turning all the existing entities into Unicode characters.
将所有存在的实体转换为Unicode时,不会影响创建新的HTML/XML实体。