Scrapy實戰篇(五)之爬取歷史天氣數據


 

  本篇文章我們以抓取歷史天氣數據為例,簡單說明數據抓取的兩種方式:

  1、一般簡單或者較小量的數據需求,我們以requests(selenum)+beautiful的方式抓取數據

  2、當我們需要的數據量較多時,建議采用scrapy框架進行數據采集,scrapy框架采用異步方式發起請求,數據抓取效率極高。

 

  下面我們以http://www.tianqihoubao.com/lishi/網站數據抓取為例進行進行兩種數據抓取得介紹:  

 

  1、以request+bs的方式采集天氣數據,並以mysql存儲數據

 

  思路:

  我們要采集的天氣數據都在地址 http://www.tianqihoubao.com/lishi/beijing/month/201101.html 中存儲,觀察url可以發現,url中只有兩部分在變化,即城市名稱和你年月,而且每年都固定包含12個月份,可以使用 months = list(range(1, 13))構造月份,將城市名稱和年份作為變量即可構造出需要采集數據的url列表,遍歷列表,請求url,解析response,即可獲取數據。

  

  以上是我們采集天氣數據的思路,首先我們需要構造url鏈接。

  

 1 def get_url(cityname,start_year,end_year):
 2     years = list(range(start_year, end_year))
 3     months = list(range(1, 13))
 4     suburl = 'http://www.tianqihoubao.com/lishi/'
 5     urllist = []
 6     for year in years:
 7         for month in months:
 8             if month < 10:
 9                 url = suburl + cityname + '/month/'+ str(year) + (str(0) + str(month)) + '.html'
10             else:
11                 url = suburl + cityname + '/month/' + str(year) + str(month) + '.html'
12             urllist.append(url.strip())
13     return urllist

 

      通過以上函數,可以得到需要抓取的url列表。  

 

  可以看到,我們在上面使用了cityname,而cityname就是我們需要抓取的城市的城市名稱,需要我們手工構造,假設我們已經構造了城市名稱的列表,存儲在mysql數據庫中,我們需要查詢數據庫獲取城市名稱,遍歷它,將城市名稱和開始年份,結束年份,給上面的函數。

1 def get_cityid(db_conn,db_cur,url):
2     suburl = url.split('/')
3     sql = 'select cityid from city where cityname = %s '
4     db_cur.execute(sql,suburl[4])
5     cityid = db_cur.fetchone()
6     idlist = list(cityid)
7     return idlist[0]

 

 

  有了城市代碼,開始和結束年份,生成了url列表,接下來當然就是請求地址,解析返回的html代碼了,此處我們以bs解析網頁源代碼,代碼如下:

 1 def parse_html_bs(db_conn,db_cur,url):
 2     proxy = get_proxy()
 3     proxies = {
 4         'http': 'http://' + proxy,
 5         'https': 'https://' + proxy,
 6     }
 7     headers = {
 8         'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
 9         'Connection': 'close'
10     }
11 
12     # 獲取天氣數據的html網頁源代碼
13     weather_data = requests.get(url=url, headers=headers,proxies = proxies).text
14     weather_data_new =(weather_data.replace('\n','').replace('\r','').replace(' ',''))
15     soup = BeautifulSoup(weather_data_new,'lxml')
16     table = soup.find_all(['td'])
17     # 獲取城市id
18     cityid = get_cityid(db_conn, db_cur, url)
19     listall = []
20     for t in list(table):
21         ts = t.string
22         listall.append(ts)
23     n= 4
24     sublist = [listall[i:i+n] for i in range(0,len(listall),n)]
25     sublist.remove(sublist[0])
26     flist = []
27     # 將列表元素中的最高和最低氣溫拆分,方便后續數據分析,並插入城市代碼
28     for sub in sublist:
29         if sub == sublist[0]:
30             pass
31         sub2 = sub[2].split('/')
32         sub.remove(sub[2])
33         sub.insert(2, sub2[0])
34         sub.insert(3, sub2[1])
35         sub.insert(0, cityid)  # 插入城市代碼
36         flist.append(sub)
37     return flist

 

  最后我們在主函數中遍歷上面的列表,並將解析出來的結果存儲到mysql數據庫。

 1 if __name__ == '__main__':
 2     citylist = get_cityname(db_conn,db_cur)
 3     for city in citylist:
 4         urllist = get_url(city,2016,2019)
 5         for url in urllist:
 6             time.sleep(1)
 7             flist = parse_html_bs(db_conn, db_cur, url)
 8             for li in flist:
 9                 tool.dyn_insert_sql('weather',tuple(li),db_conn,db_cur)
10                 time.sleep(1)

 

 

 

以上我們便完成了以requests+bs方式抓取歷史天氣數據,並以mysql存儲的程序代碼,完成代碼見:https://gitee.com/liangxinbin/Scrpay/blob/master/weatherData.py

 

  2、用scrapy框架采集天氣數據,並以mongo存儲數據

 

  1)定義我們需要抓取的數據結構,修改框架中的items.py文件

1 class WeatherItem(scrapy.Item):
2     # define the fields for your item here like:
3     # name = scrapy.Field()
4     cityname = Field()   #城市名稱
5     data = Field()    #日期
6     tq = Field()     #天氣
7     maxtemp=Field()    #最高溫度
8     mintemp=Field()   #最低溫度
9     fengli=Field()    #風力

 

 

  2)修改下載器中間件,隨機獲取user-agent,ip地址

 1 class RandomUserAgentMiddleware():
 2     def __init__(self,UA):
 3         self.user_agents = UA
 4 
 5     @classmethod
 6     def from_crawler(cls, crawler):
 7         return cls(UA = crawler.settings.get('MY_USER_AGENT'))   #MY_USER_AGENT在settings文件中配置,通過類方法獲取
 8 
 9     def process_request(self,request,spider):
10         request.headers['User-Agent'] = random.choice(self.user_agents)    #隨機獲取USER_AGENT
11 
12     def process_response(self,request, response, spider):
13         return response
14 
15 
16 class ProxyMiddleware():
17     def __init__(self):
18         ipproxy = requests.get('http://localhost:5000/random/')   #此地址為從代理池中隨機獲取可用代理
19         self.random_ip = 'http://' + ipproxy.text
20 
21     def process_request(self,request,spider):
22         print(self.random_ip)
23         request.meta['proxy'] = self.random_ip
24 
25     def process_response(self,request, response, spider):
26         return response

 

 

  3)修改pipeline文件,處理返回的item,處理蜘蛛文件返回的item

  

 1 import pymongo
 2 
 3 class MongoPipeline(object):
 4 
 5     def __init__(self,mongo_url,mongo_db,collection):
 6         self.mongo_url = mongo_url
 7         self.mongo_db = mongo_db
 8         self.collection = collection
 9 
10     @classmethod
14     def from_crawler(cls,crawler):
15         return cls(
16             mongo_url=crawler.settings.get('MONGO_URL'),    #MONGO_URL,MONGO_DB,COLLECTION在settings文件中配置,通過類方法獲取數據
17             mongo_db = crawler.settings.get('MONGO_DB'),
18             collection = crawler.settings.get('COLLECTION')
19         )
20 
21     def open_spider(self,spider):
22         self.client = pymongo.MongoClient(self.mongo_url)
23         self.db = self.client[self.mongo_db]
24 
25     def process_item(self,item, spider):
26         # name = item.__class__.collection
27         name = self.collection
28         self.db[name].insert(dict(item)) #將數據插入到mongodb數據庫。
29         return item
30 
31     def close_spider(self,spider):
32         self.client.close()

 

 

  4)最后也是最重要的,編寫蜘蛛文件解析數據,先上代碼,在解釋

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from bs4 import BeautifulSoup
 4 from scrapy import Request
 5 from lxml import etree
 6 from scrapymodel.items import WeatherItem
 7 
 8 
 9 class WeatherSpider(scrapy.Spider):
10     name = 'weather'        #蜘蛛的名稱,在整個項目中必須唯一
11     # allowed_domains = ['tianqihoubao']
12     start_urls = ['http://www.tianqihoubao.com/lishi/']    #起始鏈接,用這個鏈接作為開始,爬取數據,它的返回數據默認返回給parse來解析13  
14 
15     #解析http://www.tianqihoubao.com/lishi/網頁,提取連接形式http://www.tianqihoubao.com/lishi/beijing.html
16     def parse(self, response):
17         soup = BeautifulSoup(response.text, 'lxml')
18         citylists = soup.find_all(name='div', class_='citychk')
19         for citys in citylists:
20             for city in citys.find_all(name='dd'):
21                 url = 'http://www.tianqihoubao.com' + city.a['href']
22                 yield Request(url=url,callback = self.parse_citylist)    #返回Request對象,作為新的url由框架進行調度請求,返回的response有回調函數parse_citylist進行解析 23 
24     #解析http://www.tianqihoubao.com/lishi/beijing.html網頁,提取鏈接形式為http://www.tianqihoubao.com/lishi/tianjin/month/201811.html
25     def parse_citylist(self,response):
26         soup = BeautifulSoup(response.text, 'lxml')
27         monthlist = soup.find_all(name='div', class_='wdetail')
28         for months in monthlist:
29             for month in months.find_all(name='li'):
30                 if month.text.endswith("季度:"):
31                     continue
32                 else:
33                     url = month.a['href']
34                     url = 'http://www.tianqihoubao.com' + url
35                     yield Request(url= url,callback = self.parse_weather) #返回Request對象,作為新的url由框架進行調度請求,返回的response由parse_weather進行解析 36 
37     # 以xpath解析網頁數據;
38     def parse_weather(self,response):    #解析網頁數據,返回數據給pipeline處理 39         # 獲取城市名稱
40         url = response.url
41         cityname = url.split('/')[4]
42 
43         weather_html = etree.HTML(response.text)
44         table = weather_html.xpath('//table//tr//td//text()')
45         # 獲取所有日期相關的數據,存儲在列表中
46         listall = []
47         for t in table:
48             if t.strip() == '':
49                 continue
50             # 替換元素中的空格和\r\n
51             t1 = t.replace(' ', '')
52             t2 = t1.replace('\r\n', '')
53             listall.append(t2.strip())
54         # 對提取到的列表數據進行拆分,將一個月的天氣數據拆分成每天的天氣情況,方便數據插入數據庫
55         n = 4
56         sublist = [listall[i:i + n] for i in range(0, len(listall), n)]
57         # 刪除表頭第一行
58         sublist.remove(sublist[0])
59         # 將列表元素中的最高和最低氣溫拆分,方便后續數據分析,並插入城市代碼
60 
61         for sub in sublist:
62             if sub == sublist[0]:
63                 pass
64             sub2 = sub[2].split('/')
65             sub.remove(sub[2])
66             sub.insert(2, sub2[0])
67             sub.insert(3, sub2[1])
68             sub.insert(0, cityname)
69 
70             Weather = WeatherItem()   #使用items中定義的數據結構
71 
72             Weather['cityname'] = sub[0]
73             Weather['data'] = sub[1]
74             Weather['tq'] = sub[2]
75             Weather['maxtemp'] = sub[3]
76             Weather['mintemp'] = sub[4]
77             Weather['fengli'] = sub[5]
78             yield Weather

 

 

 

運行項目,即可獲取數據,至此,我們完成了天氣數據的抓取項目。

項目完整代碼:

https://gitee.com/liangxinbin/Scrpay/tree/master/scrapymodel

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM