这一节是使用BeautifulSoup时会遇到的一些常见问题的解决方法。
If you're getting errors that say: "'ascii' codec can't encode character 'x' in position y: ordinal not in range(128)"
,
the problem is probably with your Python installation rather than with
Beautiful Soup. Try printing out the non-ASCII characters without
running them through Beautiful Soup and you should have the same
problem. For instance, try running code like this:
如果你遇到这样的错误: "'ascii' codec can't encode character 'x' in position y: ordinal not in range(128)"
,
这个错误可能是Python的问题而不是BeautifulSoup。
(译者注:在已知文档编码类型的情况下,可以先将编码转换为unicode形式,在转换为utf-8编码,然后才传递给BeautifulSoup。
例如HTML的内容htm是GB2312编码:
htm=unicode(htm,'gb2312','ignore').encode('utf-8','ignore')
soup=BeautifulSoup(htm)
如果不知道编码的类型,可以使用chardet先检测一下文档的编码类型。chardet需要自己安装一下,在网上很容下到。)
试着不用Beautiful Soup而直接打印non-ASCII 字符,你也会遇到一样的问题。
例如,试着运行以下代码:
latin1word = 'Sacr\xe9 bleu!' unicodeword = unicode(latin1word, 'latin-1') print unicodeword
If this works but Beautiful Soup doesn't, there's probably a bug in
Beautiful Soup. However, if this doesn't work, the problem's with your
Python setup. Python is playing it safe and not sending non-ASCII
characters to your terminal. There are two ways to override this
behavior.
如果它没有问题而Beautiful Soup不行,这可能是BeautifulSoup的一个bug。
但是,如果这个也有问题,就是Python本身的问题。Python为了安全缘故不支持发送non-ASCII
到终端。有两种方法可以解决这个限制。
The easy way is to remap standard output to a converter that's
not afraid to send ISO-Latin-1 or UTF-8 characters to the terminal.
最简单的方式是将标准输出重新映射到一个转换器,不在意发送到终端的字符类型是ISO-Latin-1还是UTF-8字符串。
import codecs import sys streamWriter = codecs.lookup('utf-8')[-1] sys.stdout = streamWriter(sys.stdout)
codecs.lookup
returns a number of bound methods and
other objects related to a codec. The last one is a StreamWriter
object capable of wrapping an output
stream.
codecs.lookup
返回一些绑定的方法和其它和codec相关的对象。
最后一行是一个封装了输出流的StreamWriter
对象。
The hard way is to create a sitecustomize.py
file
in your Python installation which sets the default encoding to
ISO-Latin-1 or to UTF-8. Then all your Python programs will use that
encoding for standard output, without you having to do something for
each program. In my installation, I have a /usr/lib/python/sitecustomize.py
which looks like this:
稍微困难点的方法是创建一个sitecustomize.py
文件在你的Python安装中,
将默认编码设置为ISO-Latin-1或UTF-8。这样你所有的Python程序都会使用这个编码作为标准输出,
不用在每个程序里再设置一下。在我的安装中,我有一个 /usr/lib/python/sitecustomize.py
,内容如下:
import sys sys.setdefaultencoding("utf-8")
For more information about Python's Unicode support, look at Unicode for
Programmers or End to End Unicode
Web Applications in Python. Recipes 1.20 and 1.21 in the Python
cookbook are also very helpful.
更多关于Python的Unicode支持的信息,参考 Unicode for
Programmers or End to End Unicode
Web Applications in Python。Python食谱的给的菜谱1.20和1.21也很有用。
Remember, even if your terminal display is restricted to ASCII, you
can still use Beautiful Soup to parse, process, and write documents in
UTF-8 and other encodings. You just can't print certain strings with print
.
但是即使你的终端显示被限制为ASCII,你也可以使用BeautifulSoup以UTF-8和其它的编码类型来剖析,处理和修改文档。
只是对于某些字符,你不能使用print
来输出。