一、Scrapy總只有一個spider(大多數情況)
items設置
class UserInfoItem(scrapy.Item):
uid = scrapy.Field() # 用戶ID
name = scrapy.Field() # 用戶名
general = scrapy.Field() # 用戶性別
def get_insert_sql(self):
insert_sql = (
"""insert into table Valuse()"""
)
params = (xxx)
return insert_sql, params
def distinct_data(self):
query = """select uid from userinfo where uid=%s"""
params = (0,)
return query, params
pipline設置
classs UserinfoPipline(object):
def __init__(self):
self.settings = get_project_settings()
self.host = self.settings['MYSQL_HOST']
self.port = self.settings['MYSQL_PORT']
self.user = self.settings['MYSQL_USER']
self.passwd = self.setting['MYSQL_PASSWD']
self.db = self.settings['MYSQL_DBNAME']
# 連接數據庫
self.connect = pymysql.connect(host=self.host, db=self.db,user=self.user, passwd=self.passwd, charset='utf8', use_unicode=False)
# 通過cursor執行增刪改查
self.cursor =self.connect.cursor()
def process_item(self, item, spider):
try:
sql, params = item.distinct_data()
self.cursor.excute(sql, params)
data = self.cursor.fetchone()
if data:
pass
else:
# 插入數據
sql, params = item.get_insert_sql()
self.cursor.excute(sql, params)
self.connect.commit()
except:
pass
settings設置
ITEM_PIPELINES = {
'UserInfo.piplines.UserinfoPipline': 300,
}
二、Scrapy中有多個spider,存入同一個數據庫的不同表中
如下,在同一個Scarpy項目中存在多個spider,一般情況下,會在piplne中進行item判斷,如果是對應的item進行對應的數據庫操作
pipline設置
def do_insert(self, cursor, item):
# 執行具體的插入
# 根據不同的item構建不同的sql語句並插入到mysql中
if isinstance(item, UserInfoItem):
pass # 執行插入操作
elif isinstance(item, FansInfoTiem):
pass # 執行插入操作
這個方法隨着spider和item的增多變得越來越臃腫,可以采取將多個spider的items單獨存入到一個item文件夾中。每個item是包含數據庫插入方法
items設置
class FansInfoItem(scrapy.Item):
fan_id = scrapy.Field()
fan_name = scrapy.Field()
fan_time = scrapy.Field()
def get_insert_sql(self):
insert_sql = """insert into fan_table (fan_id, fan_name, fan_time) values(%s, %s, %s)"""
params = (self["fan_id"], self["fan_name"], self["fan_time"])
return insert_sql, params
def distinct_data(self):
query = """select fan_id from fan_table where fan_id=%s"""
params = (self["fan_id"])
return query, params
這樣,pipline的通用性會提高很多
Scrapy中多個Spider, 存入不同數據庫的不同表中
有些時候,我們可能需要存入不同的數據庫中,而settings中只能設置一個數據庫的資料,那么這時候我們就需要使用custom_settings參數來為每一個spider配置對應的pipeline。不過scrapy版本必須是1.1以上
class Test1(scrapy.Spider):
name = "test1"
custom_settings = {
'ITEM_PIPELINES':{'xxxx.piplines.TestPipline1': 301},
}
class Test2(scrapy.Spider):
name = "test2"
custom_settings = {
'ITEM_PIPELINES': {'xxxx.piplines.'}
}
在settings里面配置pipeline:
ITEM_PIPELINES = {
'xxxx.piplines.TestPipeline1': 301,
'xxxx.piplines.TestPipeline2': 302
}
