scrapy中輸出中文保存中文

本文轉載自查看原文 2017-09-15 21:51 2246 scrapy

1.json文件中文解碼：

#!/usr/bin/python
#coding=utf-8
#author=dahu
import json
with open('huxiu.json','r') as f:
    data=json.load(f)
print data[0]['title']
for key in data[0]:
    print '\"%s\":\"%s\",'%(key,data[0][key])

read_from_json

中文寫入json：

#!/usr/bin/python
#coding=utf-8
#author=dahu
import json
data={
"desc":"女友不是你想租想租就能租",
"link":"/article/214877.html",
"title":"押金8000元，共享女友門檻不低啊"
}
with open('tmp.json','w') as f:
    json.dump(data,f,ensure_ascii=False)        #指定ensure_ascii

write_to_json

2.scrapy在保存json文件時，容易亂碼，

例如：

scrapy crawl huxiu --nolog -o huxiu.json
$ head huxiu.json 
[
{"title": "\u62bc\u91d18000\u5143\uff0c\u5171\u4eab\u5973\u53cb\u95e8\u69db\u4e0d\u4f4e\u554a", "link": "/article/214877.html", "desc": "\u5973\u53cb\u4e0d\u662f\u4f60\u60f3\u79df\u60f3\u79df\u5c31\u80fd\u79df"},
{"title": "\u5f20\u5634\uff0c\u817e\u8baf\u8981\u5582\u4f60\u5403\u836f\u4e86", "link": "/article/214879.html", "desc": "\u201c\u8033\u65c1\u56de\u8361\u7740Pony\u9a6c\u7684\u6559\u8bf2\uff1a\u597d\u597d\u7528\u8111\u5b50\u60f3\u60f3\uff0c\u4e0d\u5145\u94b1\uff0c\u4f60\u4eec\u4f1a\u53d8\u5f3a\u5417\uff1f\u201d"},

結合上面保存json文件為中文的技巧：

settings.py文件改動：

ITEM_PIPELINES = {
   'coolscrapy.pipelines.CoolscrapyPipeline': 300,
}

注釋去掉

pipelines.py改成如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
# import codecs

class CoolscrapyPipeline(object):
    # def __init__(self):
        # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        # line = json.dumps(dict(item),ensure_ascii=False) + '\n'
        # self.file.write(line)

        with open('data_cn1.json', 'a') as f:
            json.dump(dict(item), f, ensure_ascii=False)
            f.write(',\n')
        return item

注釋的部分是另一種寫法，核心在於settings里啟動pipeline，會自動運行process_item程序，所以就可以保存我們想要的任何格式

此時終端輸入命令

scrapy crawl huxiu --nolog

如果仍然加 -o file.json ，file和pipeline里定義文件都會生成，但是file的json格式仍然是亂碼。

3.進一步

由上分析可以得出另一個結論，setting里的ITEM_PIPELINES 是控制着pipeline的，如果我們多開啟幾個呢：

ITEM_PIPELINES = {
   'coolscrapy.pipelines.CoolscrapyPipeline': 300,
   'coolscrapy.pipelines.CoolscrapyPipeline1': 300,
}

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
# import codecs

class CoolscrapyPipeline(object):
    # def __init__(self):
        # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        # line = json.dumps(dict(item),ensure_ascii=False) + '\n'
        # self.file.write(line)

        with open('data_cn1.json', 'a') as f:
            json.dump(dict(item), f, ensure_ascii=False)
            f.write(',\n')
        return item
class CoolscrapyPipeline1(object):

    def process_item(self, item, spider):
        with open('data_cn2.json', 'a') as f:
            json.dump(dict(item), f, ensure_ascii=False)
            f.write(',hehe\n')
        return item

pipelines.py

運行：

$ scrapy crawl huxiu --nolog

$ head -n 2 data_cn*
==> data_cn1.json <==
{"title": "押金8000元，共享女友門檻不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},
{"title": "張嘴，騰訊要喂你吃葯了", "link": "/article/214879.html", "desc": "“耳旁回盪着Pony馬的教誨：好好用腦子想想，不充錢，你們會變強嗎？”"},

==> data_cn2.json <==
{"title": "押金8000元，共享女友門檻不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},hehe
{"title": "張嘴，騰訊要喂你吃葯了", "link": "/article/214879.html", "desc": "“耳旁回盪着Pony馬的教誨：好好用腦子想想，不充錢，你們會變強嗎？”"},hehe

可以看到兩個文件都生成了！而且還是按照我們想要的格式！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy輸出的json文件中顯示中文 scrapy將爬取的中文內容保存到json文件中 php json_encode() 中文保留 scrapy抓中文，保存csv文件亂碼解決方法 Java中Servlet輸出中文亂碼問題關於Python中輸出中文的一點疑問 scrapy抓取到中文,保存到json文件為unicode,如何解決. 解決Scrapy抓取中文結果保存為文件時的編碼問題解決scrapy中文亂碼的案例 MybatisPlus 保存中文亂碼