本篇文章我們以抓取歷史天氣數據為例,簡單說明數據抓取的兩種方式:
1、一般簡單或者較小量的數據需求,我們以requests(selenum)+beautiful的方式抓取數據
2、當我們需要的數據量較多時,建議采用scrapy框架進行數據采集,scrapy框架采用異步方式發起請求,數據抓取效率極高。
下面我們以http://www.tianqihoubao.com/lishi/網站數據抓取為例進行進行兩種數據抓取得介紹:
1、以request+bs的方式采集天氣數據,並以mysql存儲數據
思路:
我們要采集的天氣數據都在地址 http://www.tianqihoubao.com/lishi/beijing/month/201101.html 中存儲,觀察url可以發現,url中只有兩部分在變化,即城市名稱和你年月,而且每年都固定包含12個月份,可以使用 months = list(range(1, 13))構造月份,將城市名稱和年份作為變量即可構造出需要采集數據的url列表,遍歷列表,請求url,解析response,即可獲取數據。
以上是我們采集天氣數據的思路,首先我們需要構造url鏈接。
1 def get_url(cityname,start_year,end_year): 2 years = list(range(start_year, end_year)) 3 months = list(range(1, 13)) 4 suburl = 'http://www.tianqihoubao.com/lishi/' 5 urllist = [] 6 for year in years: 7 for month in months: 8 if month < 10: 9 url = suburl + cityname + '/month/'+ str(year) + (str(0) + str(month)) + '.html' 10 else: 11 url = suburl + cityname + '/month/' + str(year) + str(month) + '.html' 12 urllist.append(url.strip()) 13 return urllist
通過以上函數,可以得到需要抓取的url列表。
可以看到,我們在上面使用了cityname,而cityname就是我們需要抓取的城市的城市名稱,需要我們手工構造,假設我們已經構造了城市名稱的列表,存儲在mysql數據庫中,我們需要查詢數據庫獲取城市名稱,遍歷它,將城市名稱和開始年份,結束年份,給上面的函數。
1 def get_cityid(db_conn,db_cur,url): 2 suburl = url.split('/') 3 sql = 'select cityid from city where cityname = %s ' 4 db_cur.execute(sql,suburl[4]) 5 cityid = db_cur.fetchone() 6 idlist = list(cityid) 7 return idlist[0]
有了城市代碼,開始和結束年份,生成了url列表,接下來當然就是請求地址,解析返回的html代碼了,此處我們以bs解析網頁源代碼,代碼如下:
1 def parse_html_bs(db_conn,db_cur,url): 2 proxy = get_proxy() 3 proxies = { 4 'http': 'http://' + proxy, 5 'https': 'https://' + proxy, 6 } 7 headers = { 8 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36', 9 'Connection': 'close' 10 } 11 12 # 獲取天氣數據的html網頁源代碼 13 weather_data = requests.get(url=url, headers=headers,proxies = proxies).text 14 weather_data_new =(weather_data.replace('\n','').replace('\r','').replace(' ','')) 15 soup = BeautifulSoup(weather_data_new,'lxml') 16 table = soup.find_all(['td']) 17 # 獲取城市id 18 cityid = get_cityid(db_conn, db_cur, url) 19 listall = [] 20 for t in list(table): 21 ts = t.string 22 listall.append(ts) 23 n= 4 24 sublist = [listall[i:i+n] for i in range(0,len(listall),n)] 25 sublist.remove(sublist[0]) 26 flist = [] 27 # 將列表元素中的最高和最低氣溫拆分,方便后續數據分析,並插入城市代碼 28 for sub in sublist: 29 if sub == sublist[0]: 30 pass 31 sub2 = sub[2].split('/') 32 sub.remove(sub[2]) 33 sub.insert(2, sub2[0]) 34 sub.insert(3, sub2[1]) 35 sub.insert(0, cityid) # 插入城市代碼 36 flist.append(sub) 37 return flist
最后我們在主函數中遍歷上面的列表,並將解析出來的結果存儲到mysql數據庫。
1 if __name__ == '__main__': 2 citylist = get_cityname(db_conn,db_cur) 3 for city in citylist: 4 urllist = get_url(city,2016,2019) 5 for url in urllist: 6 time.sleep(1) 7 flist = parse_html_bs(db_conn, db_cur, url) 8 for li in flist: 9 tool.dyn_insert_sql('weather',tuple(li),db_conn,db_cur) 10 time.sleep(1)
以上我們便完成了以requests+bs方式抓取歷史天氣數據,並以mysql存儲的程序代碼,完成代碼見:https://gitee.com/liangxinbin/Scrpay/blob/master/weatherData.py
2、用scrapy框架采集天氣數據,並以mongo存儲數據
1)定義我們需要抓取的數據結構,修改框架中的items.py文件
1 class WeatherItem(scrapy.Item): 2 # define the fields for your item here like: 3 # name = scrapy.Field() 4 cityname = Field() #城市名稱 5 data = Field() #日期 6 tq = Field() #天氣 7 maxtemp=Field() #最高溫度 8 mintemp=Field() #最低溫度 9 fengli=Field() #風力
2)修改下載器中間件,隨機獲取user-agent,ip地址
1 class RandomUserAgentMiddleware(): 2 def __init__(self,UA): 3 self.user_agents = UA 4 5 @classmethod 6 def from_crawler(cls, crawler): 7 return cls(UA = crawler.settings.get('MY_USER_AGENT')) #MY_USER_AGENT在settings文件中配置,通過類方法獲取 8 9 def process_request(self,request,spider): 10 request.headers['User-Agent'] = random.choice(self.user_agents) #隨機獲取USER_AGENT 11 12 def process_response(self,request, response, spider): 13 return response 14 15 16 class ProxyMiddleware(): 17 def __init__(self): 18 ipproxy = requests.get('http://localhost:5000/random/') #此地址為從代理池中隨機獲取可用代理 19 self.random_ip = 'http://' + ipproxy.text 20 21 def process_request(self,request,spider): 22 print(self.random_ip) 23 request.meta['proxy'] = self.random_ip 24 25 def process_response(self,request, response, spider): 26 return response
3)修改pipeline文件,處理返回的item,處理蜘蛛文件返回的item
1 import pymongo 2 3 class MongoPipeline(object): 4 5 def __init__(self,mongo_url,mongo_db,collection): 6 self.mongo_url = mongo_url 7 self.mongo_db = mongo_db 8 self.collection = collection 9 10 @classmethod 14 def from_crawler(cls,crawler): 15 return cls( 16 mongo_url=crawler.settings.get('MONGO_URL'), #MONGO_URL,MONGO_DB,COLLECTION在settings文件中配置,通過類方法獲取數據 17 mongo_db = crawler.settings.get('MONGO_DB'), 18 collection = crawler.settings.get('COLLECTION') 19 ) 20 21 def open_spider(self,spider): 22 self.client = pymongo.MongoClient(self.mongo_url) 23 self.db = self.client[self.mongo_db] 24 25 def process_item(self,item, spider): 26 # name = item.__class__.collection 27 name = self.collection 28 self.db[name].insert(dict(item)) #將數據插入到mongodb數據庫。 29 return item 30 31 def close_spider(self,spider): 32 self.client.close()
4)最后也是最重要的,編寫蜘蛛文件解析數據,先上代碼,在解釋
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from bs4 import BeautifulSoup 4 from scrapy import Request 5 from lxml import etree 6 from scrapymodel.items import WeatherItem 7 8 9 class WeatherSpider(scrapy.Spider): 10 name = 'weather' #蜘蛛的名稱,在整個項目中必須唯一 11 # allowed_domains = ['tianqihoubao'] 12 start_urls = ['http://www.tianqihoubao.com/lishi/'] #起始鏈接,用這個鏈接作為開始,爬取數據,它的返回數據默認返回給parse來解析。 13 14 15 #解析http://www.tianqihoubao.com/lishi/網頁,提取連接形式http://www.tianqihoubao.com/lishi/beijing.html 16 def parse(self, response): 17 soup = BeautifulSoup(response.text, 'lxml') 18 citylists = soup.find_all(name='div', class_='citychk') 19 for citys in citylists: 20 for city in citys.find_all(name='dd'): 21 url = 'http://www.tianqihoubao.com' + city.a['href'] 22 yield Request(url=url,callback = self.parse_citylist) #返回Request對象,作為新的url由框架進行調度請求,返回的response有回調函數parse_citylist進行解析 23 24 #解析http://www.tianqihoubao.com/lishi/beijing.html網頁,提取鏈接形式為http://www.tianqihoubao.com/lishi/tianjin/month/201811.html 25 def parse_citylist(self,response): 26 soup = BeautifulSoup(response.text, 'lxml') 27 monthlist = soup.find_all(name='div', class_='wdetail') 28 for months in monthlist: 29 for month in months.find_all(name='li'): 30 if month.text.endswith("季度:"): 31 continue 32 else: 33 url = month.a['href'] 34 url = 'http://www.tianqihoubao.com' + url 35 yield Request(url= url,callback = self.parse_weather) #返回Request對象,作為新的url由框架進行調度請求,返回的response由parse_weather進行解析 36 37 # 以xpath解析網頁數據; 38 def parse_weather(self,response): #解析網頁數據,返回數據給pipeline處理 39 # 獲取城市名稱 40 url = response.url 41 cityname = url.split('/')[4] 42 43 weather_html = etree.HTML(response.text) 44 table = weather_html.xpath('//table//tr//td//text()') 45 # 獲取所有日期相關的數據,存儲在列表中 46 listall = [] 47 for t in table: 48 if t.strip() == '': 49 continue 50 # 替換元素中的空格和\r\n 51 t1 = t.replace(' ', '') 52 t2 = t1.replace('\r\n', '') 53 listall.append(t2.strip()) 54 # 對提取到的列表數據進行拆分,將一個月的天氣數據拆分成每天的天氣情況,方便數據插入數據庫 55 n = 4 56 sublist = [listall[i:i + n] for i in range(0, len(listall), n)] 57 # 刪除表頭第一行 58 sublist.remove(sublist[0]) 59 # 將列表元素中的最高和最低氣溫拆分,方便后續數據分析,並插入城市代碼 60 61 for sub in sublist: 62 if sub == sublist[0]: 63 pass 64 sub2 = sub[2].split('/') 65 sub.remove(sub[2]) 66 sub.insert(2, sub2[0]) 67 sub.insert(3, sub2[1]) 68 sub.insert(0, cityname) 69 70 Weather = WeatherItem() #使用items中定義的數據結構 71 72 Weather['cityname'] = sub[0] 73 Weather['data'] = sub[1] 74 Weather['tq'] = sub[2] 75 Weather['maxtemp'] = sub[3] 76 Weather['mintemp'] = sub[4] 77 Weather['fengli'] = sub[5] 78 yield Weather
運行項目,即可獲取數據,至此,我們完成了天氣數據的抓取項目。
項目完整代碼:
https://gitee.com/liangxinbin/Scrpay/tree/master/scrapymodel