scrapy startproject ZuCai
會自動生成2個zucai文件夾
cd ZuCai
cd ZuCai
進入最下面的ZuCai文件夾
scrapy genspider zucai trade.500.com/jczq/
開始分析 https://trade.500.com/jczq/ 這個頁面
進入頁面后,點擊F12查看網頁代碼。通過查找,發現所有的比賽結果全部在 <table class="bet-tb bet-tb-dg">...</table>中,然后繼續往下看
每一行都在一個tr中。這里我們就可以定位到tr,然后獲取到所有的tr的值,然后在tr中循環找我們需要的信息
首先在 item.py中確定我們需要爬取的信息
class ZucaiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
League = scrapy.Field() ---賽事
Time = scrapy.Field()--時間
Home_team = scrapy.Field()--主隊
Away_team = scrapy.Field()--客隊
Result = scrapy.Field()--賽果
Win = scrapy.Field()--贏的賠率
Level = scrapy.Field()--平局賠率
Negative = scrapy.Field()--負的賠率
pass
然后寫zucai.py
def parse(self, response):
datas = response.xpath('//div[@class="bet-main bet-main-dg"]/table/tbody/tr')
for data in datas:
item = ZucaiItem()
item['League'] = data.xpath('.//td[@class="td td-evt"]/a/text()').extract()[0]
item['Time'] = data.xpath('.//td[@class="td td-endtime"]/text()').extract()[0]
item['Home_team'] = data.xpath('.//span[@class="team-l"]/a/text()').extract()[0]
item['Result'] = data.xpath('.//i[@class="team-vs team-bf"]/a/text()').extract()[0]
item['Away_team'] = data.xpath('.//span[@class="team-r"]/a/text()').extract()[0]
item['Win'] = data.xpath('.//div[@class="betbtn-row itm-rangB1"]/p[1]/span/text()').extract()[0]
item['Level'] = data.xpath('.//div[@class="betbtn-row itm-rangB1"]/p[2]/span/text()').extract()[0]
item['Negative'] = data.xpath('.//div[@class="betbtn-row itm-rangB1"]/p[3]/span/text()').extract()[0]
yield item
這里執行的時候有時會報超出數組范圍的錯誤,則需要將對應的extract()[0]替換成extract_first()。后面再說這兩者的區別
這里需要將獲取的數據存入MySQL數據庫
首先得在本地裝一個MySQL數據庫,然后建一個庫和一個表。表的列和item.py中的相同。以便爬取的數據能順利存入其中。
然后在pepelines.py中寫存入數據庫的代碼
import pymysql
import logging
class ZucaiPipeline(object):
def __init__(self):
self.connect = pymysql.connect(host='localhost', user='root', password='123456', db='douban', port=3306)
self.cursor = self.connect.cursor()
def process_item(self, item, spider):
try:
sql = 'insert into jcai values ("{}","{}","{}","{}","{}","{}","{}","{}")'.format(item['League'], item['Time'], item['Home_team'], item['Result'], item['Away_team'], item['Win'], item['Level'], item['Negative'])
self.cursor.execute(sql)
self.connect.commit()
except Exception as error:
logging.log(error)
return item
def close_db(self):
self.cursor.close()
self.connect.close()
最后 修改setting.py中的信息
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'ZuCai.pipelines.ZucaiPipeline': 300,
}
修改這三處就可以了。
import scrapy
from ZuCai.items import ZucaiItem
class ZucaiSpider(scrapy.Spider):
name = 'zucai'
allowed_domains = ['trade.500.com/jczq/']
start_urls = ['https://trade.500.com/jczq/?date=2019-05-13']
def parse(self, response):
datas = response.xpath('//div[@class="bet-main bet-main-dg"]/table/tbody/tr')
for data in datas:
item = ZucaiItem()
item['League'] = data.xpath('.//td[@class="td td-evt"]/a/text()').extract()[0]
item['Time'] = data.xpath('.//td[@class="td td-endtime"]/text()').extract()[0]
item['Home_team'] = data.xpath('.//span[@class="team-l"]/a/text()').extract()[0]
item['Result'] = data.xpath('.//i[@class="team-vs team-bf"]/a/text()').extract()[0]
item['Away_team'] = data.xpath('.//span[@class="team-r"]/a/text()').extract()[0]
item['Win'] = data.xpath('.//div[@class="betbtn-row itm-rangB1"]/p[1]/span/text()').extract()[0]
item['Level'] = data.xpath('.//div[@class="betbtn-row itm-rangB1"]/p[2]/span/text()').extract()[0]
item['Negative'] = data.xpath(
'.//div[@class="betbtn-row itm-rangB1"]/p[3]/span/text()').extract()[0]
yield item
至此爬取一個頁面的賽果信息就完成了。
然后 cd zucai
scrapy crawl zucai 執行。就會發現數據庫對應的表中有數據。