scrapy抓取的页面中文会变成unicode字符串

本文转载自查看原文 2017-03-28 23:38 3879 中文/ 编码/ 解决问题/ scrapy/ unicode字符串

不了解编码的，需要先补下：http://www.cnblogs.com/jiangtu/p/6245264.html

在学习&使用scrapy抓取网上信息时，发现scrapy 会将含有中文的field输出为 unicode字符串形式。

这个原因的根本是，在python中使用json序列化时，如果使用 ensure_ascii 编码就会出现这个问题。并且，json.dumps默认使用的也是这个编码。

在scrapy中，JsonItemExporter 也是默认使用的 ensure_ascii 编码:

1 class JsonItemExporter(BaseItemExporter):
2 
3     def __init__(self, file, **kwargs):
4         self._configure(kwargs, dont_fail=True)
5         self.file = file
6         kwargs.setdefault('ensure_ascii', not self.encoding) # look here 7         self.encoder = ScrapyJSONEncoder(**kwargs)
8         self.first_item = True

可以看到，在第六行，如果不传递值的话，就会默认使用 ensure_ascii 编码。

所以，我们只要在 pipeline 中实例化 exporter 时，传入编码方式即可:

exporter = MyJsonExporter(fi, encoding='utf-8')

然后就ok了。

JSON.dumps()同理。

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python2.7字典转换成json时中文字符串变成unicode的问题： java对含有中文的字符串进行Unicode编码 Java将\u开头的unicode字符串转换为中文 python3将字符串unicode转换为中文 python3 将字符串unicode转换为中文 Unicode字符串和非Unicode字符串 python unicode字符串字符串和字符编码unicode Python2.X如何将Unicode中文字符串转换成 string字符串中文字符串转换为十六进制Unicode编码字符串