scrapy簡單使用方法
1.創建項目:
scrapy startproject 項目名
例如:
scrapy startproject baike
windows下,cmd進入項目路徑例如
d:\pythonCode\spiderProject>scrapy startproject baidubaike
將創建項目名為 baidubaike
2.使用命令創建一個爬蟲:
scrapy genspider 爬蟲名稱 需要爬取的網址
scrapy genspider baike baike.baidu.com
注意:爬蟲名稱不能和項目名相同
d:\pythonCode\spiderProject\baidubaike>scrapy genspider baike baike.baidu.com
命令執行后將在d:\pythonCode\spiderProject\baidubaike\baidubaike\spiders\下,生成baike.py
3.修改baike.py文件
import scrapy
from baidubaike.items import BaidubaikeItem
from scrapy.http.response.html import HtmlResponse
from scrapy.selector.unified import SelectorList
class BaikeSpider(scrapy.Spider):
#爬蟲名稱
name = 'baike'
#需要爬取的網址
allowed_domains = ['baike.baidu.com']
#起始網址
start_urls = ['https://baike.baidu.com/art/%E6%8B%8D%E5%8D%96%E8%B5%84%E8%AE%AF']
def parse(self, response):
uls = response.xpath("//div[@class='list-content']/ul")
for ul in uls:
lis = ul.xpath(".//li")
#print(lis)
for li in lis:
title = li.xpath(".//a/text()").get()
time = li.xpath(".//span/text()").get()
item = BaidubaikeItem(title=title, time=time)
yield item
4.items.py
import scrapy
class BaidubaikeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
title = scrapy.Field()
time = scrapy.Field()
5.修改settings.py文件
1)開啟 DEFAULT_REQUEST_HEADERS
修改如下
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
2)將 ROBOTSTXT_OBEY = True 改為 ROBOTSTXT_OBEY = False
說明:
默認為True,就是要遵守robots.txt 的規則
將此配置項設置為 False ,拒絕遵守 Robot協議
3)開啟 ITEM_PIPELINES
ITEM_PIPELINES = {
'baidubaike.pipelines.BaidubaikePipeline': 300,
}
其中,ITEM_PIPELINES是一個字典文件,鍵為要打開的ItemPipeline類,值為優先級,ItemPipeline是按照優先級來調用的,值越小,優先級越高。
6.修改pipelines.py文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#第一種方式
#import json
#
#class BaidubaikePipeline(object):
# def __init__(self):
# #pass
# self.fp = open('baike.json', 'w', encoding='utf-8')
#
# def open_spider(self, spider):
# print('爬蟲開始了。。')
#
# def process_item(self, item, spider):
# item_json = json.dumps(dict(item), ensure_ascii=False)
# self.fp.write(item_json+ '\n')
# return item
#
# def close_spider(self, spider):
# self.fp.close()
# print('爬蟲結束了。。')
#
#第二種方式
#from scrapy.exporters import JsonItemExporter
#
#class BaidubaikePipeline(object):
# def __init__(self):
# #pass
# self.fp = open('baike.json', 'wb')
# self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')
# self.exporter.start_exporting()
#
# def open_spider(self, spider):
# print('爬蟲開始了。。')
#
# def process_item(self, item, spider):
# self.exporter.export_item(item)
# return item
#
# def close_spider(self, spider):
# self.exporter.finish_exporting()
# self.fp.close()
# print('爬蟲結束了。。')
#第三種方式
from scrapy.exporters import JsonLinesItemExporter
class BaidubaikePipeline(object):
def __init__(self):
#pass
self.fp = open('baike.json', 'wb')
self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')
def open_spider(self, spider):
print('爬蟲開始了。。')
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self, spider):
self.fp.close()
print('爬蟲結束了。。')
7.運行爬蟲
scrapy crawl 爬蟲名
d:\pythonCode\spiderProject\baidubaike\baidubaike>scrapy crawl baike