When you parse a document, you can convert HTML or XML entity
references to the corresponding Unicode characters. This code converts
the HTML entity "é" to the Unicode character LATIN SMALL
LETTER E WITH ACUTE, and the numeric entity "e" to the Unicode
character LATIN SMALL LETTER E.
当你剖析一个文档是,你可以转换HTML或者XML实体引用到可表达Unicode的字符。
这个代码转换HTML实体"é"到Unicode字符 LATIN SMALL LETTER E WITH ACUTE,以及将
数量实体"e"转换到Unicode字符LATIN SMALL LETTER E.
from BeautifulSoup import BeautifulStoneSoup BeautifulStoneSoup("Sacré bleu!", convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] # u'Sacr\xe9 bleu!'
That's if you use HTML_ENTITIES
(which is just the string
"html"). If you use XML_ENTITIES
(or the string "xml"), then only
numeric entities and the five XML entities (""",
"'", ">", "<", and "&") get
converted. If you use ALL_ENTITIES
(or the list ["xml", "html"]
),
then both kinds of entities will be converted. This last one is
neccessary because ' is an XML entity but not an HTML
entity.
这是针对使用HTML_ENTITIES
(也就是字符串"html")。如果你使用XML_ENTITIES
(或字符串"xml"),
这是只有数字实体和五个XML实体((""","'", ">", "<", 和 "&")
会被转换。如果你使用ALL_ENTITIES
(或者列表["xml","html"]
),
两种实体都会被转换。最后一种方式是必要的,因为'是一个XML的实体而不是HTML的。
BeautifulStoneSoup("Sacré bleu!", convertEntities=BeautifulStoneSoup.XML_ENTITIES) # Sacré bleu! from BeautifulSoup import BeautifulStoneSoup BeautifulStoneSoup("Il a dit, <<Sacré bleu!>>", convertEntities=BeautifulStoneSoup.XML_ENTITIES) # Il a dit, <<Sacré bleu!>>
If you tell Beautiful Soup to convert XML or HTML entities into the
corresponding Unicode characters, then Windows-1252 characters (like
Microsoft smart quotes) also get transformed into Unicode
characters. This happens even if you told Beautiful Soup to convert
those characters to entities.
如果你指定Beautiful Soup转换XML或HTML实体到可通信的Unicode字符时,Windows-1252(微软的smart quotes)也会
被转换为Unicode字符。即使你指定Beautiful Soup转换这些字符到实体是,也还是这样。
from BeautifulSoup import BeautifulStoneSoup smartQuotesAndEntities = "Il a dit, \x8BSacré bleu!\x9b" BeautifulStoneSoup(smartQuotesAndEntities, smartQuotesTo="html").contents[0] # u'Il a dit, ‹Sacré bleu!›' BeautifulStoneSoup(smartQuotesAndEntities, convertEntities="html", smartQuotesTo="html").contents[0] # u'Il a dit, \u2039Sacr\xe9 bleu!\u203a' BeautifulStoneSoup(smartQuotesAndEntities, convertEntities="xml", smartQuotesTo="xml").contents[0] # u'Il a dit, \u2039Sacré bleu!\u203a'
It doesn't make sense to create new HTML/XML entities while you're
busy turning all the existing entities into Unicode
characters.
将所有存在的实体转换为Unicode时,不会影响创建新的HTML/XML实体。