ITEEDU

输出文档

你可以使用 str函数将Beautiful Soup文档（或者它的子集）转换为字符串，或者使用它的code>prettify或renderContents。你也可以使用unicode函数以Unicode字符串的形式获得。

prettify 方法添加了一些换行和空格以便让文档结构看起来更清晰。它也将那些只包含空白符的，可能影响一个XML文档意义的文档节点(nodes)剔除(strips out)。 str和unicode函数不会剔除这些节点，他们也不会添加任何空白符。

看看这个例子：

from BeautifulSoup import BeautifulSoup
doc = "<html><h1>Heading</h1><p>Text"
soup = BeautifulSoup(doc)

str(soup)
# '<html><h1>Heading</h1><p>Text</p></html>'
soup.renderContents()
# '<html><h1>Heading</h1><p>Text</p></html>'
soup.__str__()
# '<html><h1>Heading</h1><p>Text</p></html>'
unicode(soup)
# u'<html><h1>Heading</h1><p>Text</p></html>'

soup.prettify()
# '<html>\n <h1>\n  Heading\n </h1>\n <p>\n  Text\n </p>\n</html>'

print soup.prettify()
# <html>
#  <h1>
#   Heading
#  </h1>
#  <p>
#   Text
#  </p>
# </html>

可以看到使用文档中的tag成员时 str和renderContents返回的结果是不同的。

heading = soup.h1
str(heading)
# '<h1>Heading</h1>'
heading.renderContents()
# 'Heading'

当你调用__str__,prettify或者renderContents时，你可以指定输出的编码。默认的编码(str使用的)是UTF-8。下面是处理ISO-8851-1的串并以不同的编码输出同样的串的例子。

from BeautifulSoup import BeautifulSoup
doc = "Sacr\xe9 bleu!"
soup = BeautifulSoup(doc)
str(soup)
# 'Sacr\xc3\xa9 bleu!'                          # UTF-8
soup.__str__("ISO-8859-1")
# 'Sacr\xe9 bleu!'
soup.__str__("UTF-16")
# '\xff\xfeS\x00a\x00c\x00r\x00\xe9\x00 \x00b\x00l\x00e\x00u\x00!\x00'
soup.__str__("EUC-JP")
# 'Sacr\x8f\xab\xb1 bleu!'

如果原始文档含有编码声明，Beautiful Soup会将原始的编码声明改为新的编码。也就是说，你载入一个HTML文档到BeautifulSoup后，在输出它，不仅HTML被清理过了，而且可以明显的看到它已经被转换为UTF-8。

这是HTML的例子：

from BeautifulSoup import BeautifulSoup
doc = """<html>
<meta http-equiv="Content-type" content="text/html; charset=ISO-Latin-1" >
Sacr\xe9 bleu!
</html>"""

print BeautifulSoup(doc).prettify()
# <html>
#  <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
#  Sacré bleu!
# </html>

这是XML的例子：

from BeautifulSoup import BeautifulStoneSoup
doc = """<?xml version="1.0" encoding="ISO-Latin-1">Sacr\xe9 bleu!"""

print BeautifulStoneSoup(doc).prettify()
# <?xml version='1.0' encoding='utf-8'>
# Sacré bleu!