scrapy相關 通過設置 FEED_EXPORT_ENCODING 解決 unicode 中文寫入json文件出現`\uXXXX`


0.問題現象

爬取 item:

2017-10-16 18:17:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.huxiu.com/v2_action/article_list>
{'author': u'\u5546\u4e1a\u8bc4\u8bba\u7cbe\u9009\xa9',
 'cmt': 5,
 'fav': 194,
 'time': u'4\u5929\u524d',
 'title': u'\u96f7\u519b\u8c08\u5c0f\u7c73\u201c\u65b0\u96f6\u552e\u201d\uff1a\u50cfZara\u4e00\u6837\u5f00\u5e97\uff0c\u8981\u505a\u5f97\u6bd4Costco\u66f4\u597d',
 'url': u'/article/217755.html'}

 

寫入jsonline jl 文件

{"title": "\u8fd9\u4e00\u5468\uff1a\u8d2b\u7a77\u66b4\u51fb", "url": "/article/217997.html", "author": "\u864e\u55c5", "fav": 8, "time": "2\u5929\u524d", "cmt": 5}
{"title": "\u502a\u840d\u8001\u516c\u7684\u65b0\u620f\u6251\u8857\u4e86\uff0c\u9ec4\u6e24\u6301\u80a1\u7684\u516c\u53f8\u8981\u8d54\u60e8\u4e86", "url": "/article/217977.html", "author": "\u5a31\u4e50\u8d44\u672c\u8bba", "fav": 5, "time": "2\u5929\u524d", "cmt": 3}

 

item 被轉 str,默認 ensure_ascii = True,則非 ASCII 字符被轉化為 `\uXXXX`,每一個 ‘{xxx}’ 單位被寫入文件

目標:注意最后用 chrome 或 notepad++ 打開確認,firefox 打開 jl 可能出現中文亂碼,需要手動指定編碼。 

{"title": "這一周:貧窮暴擊", "url": "/article/217997.html", "author": "虎嗅", "fav": 8, "time": "2天前", "cmt": 5}
{"title": "倪萍老公的新戲撲街了,黃渤持股的公司要賠慘了", "url": "/article/217977.html", "author": "娛樂資本論", "fav": 5, "time": "2天前", "cmt": 3}

 

1.參考資料

scrapy抓取到中文,保存到json文件為unicode,如何解決.

import json
import codecs

class JsonWithEncodingPipeline(object):

    def __init__(self):
        self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):^M
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()
View Code

scrapy中輸出中文保存中文

Scrapy爬蟲框架抓取中文結果為Unicode編碼,如何轉換UTF-8編碼

lidashuang / imax-spider

 

以上資料實際上就是官方文檔舉的 pipeline 例子,另外指定  ensure_ascii=False

Write items to a JSON file

The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item per line serialized in JSON format:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"  #另外指定  ensure_ascii=False
        self.file.write(line)
        return item

Note

The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

 

2.更好的解決辦法:

scrapy 使用item export輸出中文到json文件,內容為unicode碼,如何輸出為中文?

http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence 里面有提到,將 JSONEncoder 的 ensure_ascii 參數設為 False 即可。

而 scrapy 的 item export 文檔里有提到

The additional constructor arguments are passed to the
BaseItemExporter constructor, and the leftover arguments to the
JSONEncoder constructor, so you can use any JSONEncoder constructor
argument to customize this exporter.

因此就在調用 scrapy.contrib.exporter.JsonItemExporter 的時候額外指定 ensure_ascii=False 就可以啦。

 

3.根據上述解答,結合官網和源代碼,直接解決辦法:

1.可以通過修改 project settings.py 補充 FEED_EXPORT_ENCODING = 'utf-8'

2.或在cmd中傳入 G:\pydata\pycode\scrapy\huxiu_com>scrapy crawl -o new.jl -s FEED_EXPORT_ENCODING='utf-8' huxiu

 

https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-encoding

FEED_EXPORT_ENCODING

Default: None

The encoding to be used for the feed.

If unset or set to None (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (\uXXXX sequences) for historic reasons.

Use utf-8 if you want UTF-8 for JSON too.

 

 

In [615]: json.dump?
Signature: json.dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding='utf-8', default=None, sort_keys=False, **kw)
Docstring:
Serialize ``obj`` as a JSON formatted stream to ``fp`` (a
``.write()``-supporting file-like object).



If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only. If ``ensure_ascii`` is
``False``, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.

 

 

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\exporters.py

class JsonLinesItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        kwargs.setdefault('ensure_ascii', not self.encoding)


class JsonItemExporter(BaseItemExporter):
    def __init__(self, file, **kwargs):
        kwargs.setdefault('ensure_ascii', not self.encoding)


class XmlItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        if not self.encoding:
            self.encoding = 'utf-8'

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM