《爬蟲學習》（五）（爬蟲實戰之爬取天氣信息）

本文轉載自查看原文 2020-01-28 15:03 300

1.大體框架列出+爬取網頁：

#數據可視化
from pyecharts import Bar
#用來url連接登陸等功能
import requests
#解析數據
from bs4 import BeautifulSoup

#用來存取爬取到的數據
data = []

def parse_data(url):
    headers = {
        'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
    }
    rest = requests.get(url=url, headers=headers)#使用requests.get方法爬取網頁
    # 一般人可能會用rest.text，但是會顯示亂碼
    text = rest.content.decode('utf-8')#使用utf-8解碼，防止顯示亂碼，接下來無法解析
    soup = BeautifulSoup(text, 'html5lib')#BeautifulSoup方法需要指定解析文本和解析方式


def main():
    url = "http://www.weather.com.cn/textFC/hb.shtml"
    parse_data(url)

if __name__ == '__main__':
    main()

parse_data函數主要用於爬取以及解析數據

headers可以在網頁之中查找

易錯點：當使用requests.get獲取到網頁之后，一般可能使用text方法進行數據獲取，但是嘗試之后數據產生了亂碼，因為requests.get方法獲取再用text解碼時候默認ISO-8859-1解碼，

　　　　因此使用content方法並指定decode('utf-8')進行解碼

數據解析我使用的是bs4庫，也可以用lxml庫，但是感覺沒有bs4方便，解析方式使用html5lib，對於html數據解析更具有容錯性和開放性

2.爬取網頁解析：

# 爬取數據
    cons = soup.find('div', attrs={'class':'conMidtab'})
    tables = cons.find_all('table')
    for table in tables:
        trs = table.find_all('tr')[2:]
        for index,tr in enumerate(trs):
            if index == 0:
                tds = tr.find_all('td')[1]
                qiwen = tr.find_all('td')[4]
            else:
                tds = tr.find_all('td')[0]
                qiwen = tr.find_all('td')[3]
            city = list(tds.stripped_strings)[0]
            wendu = list(qiwen.stripped_strings)[0]
            data.append({'城市':city, '最高氣溫':wendu})

bs4庫一般使用方法是find或者find_all方法（詳細內容見上一篇博客）

find方法比較使用的是可以查找指定內容的數據，使用attrs={}來定制條件，代碼中我用了attrs={'class':'conMidtab'}或者使用class_='conMidtab'

查看網頁源代碼可知

通過'class':'conMidtab'來定位到所需信息的表

再分析：因為有多個conMidtab，所以測試分析得知多個conMidtab對應的是今天，明天，后天......的天氣情況

我們分析的是今天的情況，所以取第一個conMidtab，使用soup.find("div",class_="conMidtab")獲取第一個conMidtab的內容

由上知：conMidtab下的多個class="conMidtab2"代表不同的省的天氣信息

但是在研究可以發現，所有天氣信息都是存儲在table里的，因此獲取所有tables即可——cons.find_all('table')

同時對於每一個table而言：第三個tr開始才是對應的城市信息，故對於每一個table獲取trs = table.find_all("tr")[2:]

易錯點：同時發現對於每個省第一個城市，它隱藏在tr的第二個td里，而除此之外的該省其他城市則在tr的第一個td里，因此使用一個if和else判斷

enumerate方法可以產生一個index下標，因此在遍歷trs的時候可以知道當index==0的時候是第一行

之后分析：城市名字：對於每個省第一個城市，它隱藏在tr的第二個td里，而除此之外的該省其他城市則在tr的第一個td里

　　　　　最高氣溫：對於每個省第一個城市，它隱藏在tr的第五個td里，而除此之外的該省其他城市則在tr的第四個td里

因此使用

　　　　　　　if index == 0:
                tds = tr.find_all('td')[1]
                qiwen = tr.find_all('td')[4]
            else:
                tds = tr.find_all('td')[0]
                qiwen = tr.find_all('td')[3]
最后使用stripped_strings獲取字符串並且添加到data列表里

3.進行所有城市的數據獲取：

def main():
    urls = [
        "http://www.weather.com.cn/textFC/hb.shtml",
        "http://www.weather.com.cn/textFC/db.shtml",
        "http://www.weather.com.cn/textFC/hd.shtml",
        "http://www.weather.com.cn/textFC/hz.shtml",
        "http://www.weather.com.cn/textFC/hn.shtml",
        "http://www.weather.com.cn/textFC/xb.shtml",
        "http://www.weather.com.cn/textFC/xn.shtml",
        "http://www.weather.com.cn/textFC/gat.shtml"
    ]
    for url in urls:
        parse_data(url)

修改了一下main方法：獲取全國數據

4.數據排序找出全國氣溫最高十大城市：

# 排序找出十大溫度最高的城市
# 按照溫度排序
data.sort(key=lambda x:int(x['最高氣溫']))
#十大溫度最高的城市
data_2 = data[-10:]

其中在排序的時候注意：要轉化為int型才可以進行排序，否則是按照string進行排序的。

5.數據可視化：

citys = list(map(lambda x:x['城市'], data_2))#橫坐標
wendu = list(map(lambda x:x['最高氣溫'], data_2))#縱坐標
charts = Bar('中國十大最高溫度城市')
charts.add('', citys, wendu)
charts.render('天氣網.html')

使用Bar模塊：

　　Bar方法主要可以給該圖標命名

　　add方法主要是添加（圖顏色的名稱，橫坐標名，縱坐標名）

　　render主要是存儲在本地之中

結果展示：

完整代碼：

#數據可視化
from pyecharts import Bar
#用來url連接登陸等功能
import requests
#解析數據
from bs4 import BeautifulSoup

#用來存取爬取到的數據
data = []


def parse_data(url):
    headers = {
        'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
    }
    rest = requests.get(url=url, headers=headers)#使用requests.get方法爬取網頁
    # 一般人可能會用rest.text，但是會顯示亂碼
    text = rest.content.decode('utf-8')#使用utf-8解碼，防止顯示亂碼，接下來無法解析
    soup = BeautifulSoup(text, 'html5lib')#BeautifulSoup方法需要指定解析文本和解析方式

    # 爬取數據
    cons = soup.find('div', attrs={'class':'conMidtab'})
    tables = cons.find_all('table')
    for table in tables:
        trs = table.find_all('tr')[2:]
        for index,tr in enumerate(trs):
            if index == 0:
                tds = tr.find_all('td')[1]
                qiwen = tr.find_all('td')[4]
            else:
                tds = tr.find_all('td')[0]
                qiwen = tr.find_all('td')[3]
            city = list(tds.stripped_strings)[0]
            wendu = list(qiwen.stripped_strings)[0]
            data.append({'城市':city, '最高氣溫':wendu})

def main():
    urls = [
        "http://www.weather.com.cn/textFC/hb.shtml",
        "http://www.weather.com.cn/textFC/db.shtml",
        "http://www.weather.com.cn/textFC/hd.shtml",
        "http://www.weather.com.cn/textFC/hz.shtml",
        "http://www.weather.com.cn/textFC/hn.shtml",
        "http://www.weather.com.cn/textFC/xb.shtml",
        "http://www.weather.com.cn/textFC/xn.shtml",
        "http://www.weather.com.cn/textFC/gat.shtml"
    ]
    for url in urls:
        parse_data(url)

    # 排序找出十大溫度最高的城市
    # 按照溫度排序
    data.sort(key=lambda x:int(x['最高氣溫']))
    #十大溫度最高的城市
    data_2 = data[-10:]

    # 數據可視化
    citys = list(map(lambda x:x['城市'], data_2))#橫坐標
    wendu = list(map(lambda x:x['最高氣溫'], data_2))#縱坐標
    charts = Bar('中國十大最高溫度城市')
    charts.add('', citys, wendu)
    charts.render('天氣網.html')

if __name__ == '__main__':
    main()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Python3 爬蟲】17_爬取天氣信息爬蟲實戰爬取58同城房源信息 Python網絡爬蟲入門實戰（爬取最近7天的天氣以及最高/最低氣溫）【網絡爬蟲學習】實戰，爬取網頁以及貼吧數據 Python學習之路（六）爬蟲（五）爬取拉勾網招聘信息利用爬蟲、SMTP和樹莓派3B發送郵件(爬取墨跡天氣預報信息) 爬蟲學習之視頻爬取 Python爬蟲實戰之一 - 基於Requests爬取拉勾網招聘信息，並保存至本地csv文件爬蟲再探實戰（一）——爬取智聯招聘職位信息爬蟲實戰：爬取免費小說