Python爬取數據(基礎，從0開始)

本文轉載自查看原文 2020-06-15 22:46 1741

1、技術概述

爬蟲，就是給網站發起請求，並從響應中提取需要的數據的自動化程序，一般有三個步驟：
（1）發起請求，獲取響應
（2）解析內容
（3）保存數據
當初學習該技術是因為要做疫情網頁，需要准確的疫情數據。
技術難點：或許需要去了解一下爬蟲的字典和列表。

2、技術詳述

僅到爬取數據並存儲數據到數據庫階段，需要安裝Python 3.6,MySQL,Jupyte notebook(Python IDE)(安裝方法自己百度)，啟動jupyter notebook（基礎使用教程請自己百度，很簡單的）
發起請求，獲取響應

不少網站有反扒措施，為了避免這個，我們可以冒充各種搜索引擎去爬取，比如百度，谷歌。
輸入百度網址www.baidu.com,按下F12進入開發者頁面（不同瀏覽器間可能不同），如圖找到百度的User-Agent數據，這個數據用來標識訪問者身份，這個就是咱們冒充百度的關鍵

有兩種發起請求的方式，其一是用urllib,主要用的是其中的request.urlopen()方法

from urllib import request

url = "https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5" #這是你想爬取數據的地址

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"
} #就是上一點中得到的那個User-Agent，這一步我們冒充成百度

req = request.Request(url,headers = header)
res = request.urlopen(req) #訪問url並獲取響應

html = res.read()#獲取的是字節形式的內容
html.decode("utf-8")#解碼，如果是亂碼的話
print(html)

運行結果:

第二種是使用requests發送請求
這里要注意：如果是以前沒有裝過requests庫的話，要在命令行安裝一下：pip install requests,主要用到的命令是requests.get()

import requests
url = "https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5" #這是你想爬取數據的地址

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"
} 
r1 = requests.get(url,headers = header)#發起請求
r1.encoding = "UTF-8"
html = r1.text
print(html)

可以看出代碼和用urllib的沒太大區別，運行后結果一樣的，就不貼圖了
3. 解析內容

解析內容也有兩種方式，一種是beautifulsoup4,beautifulsoup4將復雜的HTML文檔轉化成一個樹狀結構，每個節點都是Python的對象，find(),select(),find_all()函數獲取標簽。安裝命令：pip install beautifulsoup4

import requests
from bs4 import BeautifulSoup #別忘了導入庫啊
url = "http://wjw.fujian.gov.cn/xxgk/gsgg/yqgg/202005/t20200520_5270636.htm" #這是你想爬取數據的地址，例子是福建衛健委4月法定報告傳染病疫情報告，不再是數據整理好的接口了

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"
} 
r1 = requests.get(url,headers = header)#發起請求
r1.encoding = "UTF-8"
#r1.text
html = r1.text
#BeautifulSoup(html)#整理了數據，想要的數據可以整理后一個個去找，更方便的是去你想爬取數據的網站打開F12開發者工具，左上角有個選擇元素，可以直接在網頁上定位標簽
soup = BeautifulSoup(html)
#soup.find("font").text #可以拿到font標簽里的文本，如果是.attrs則可以拿到標簽屬性，若只有soup.find("font")，則是整個標簽+內容,怎么得到你要找的數據需要好好找找資料，如果用find()得到的是第一個符合的標簽，find_all得到的是所有符合條件的標簽
res=soup.find("font")
print(res)
print(res.text)
print(res.attrs)

先看看未解析內容前的效果（運行到r1.text命令）：

解析后效果（運行到BeautifulSoup(html)命令）：

soup.find("font")\soup.find("font").text\soup.find("font").attrs三個運行結果：

另一種是re，要對正則表達式有一定理解

當初是從騰訊接口爬取疫情數據，數據持久化在本地數據庫，完整代碼（僅顯示爬蟲爬取數據部分,不包括數據庫）：

import requests
import json
import pymysql
import time
import traceback
def getdata():#從騰訊接口爬取數據，並存到字典及列表中
    url = "https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5"
    header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"
    }
    r1 = requests.get(url,headers = header)
    res1 = json.loads(r1.text)
    data_all = json.loads(res1["data"])
    
    url2 = "https://view.inews.qq.com/g2/getOnsInfo?name=disease_other"
    r2 = requests.get(url2,headers = header)
    res2 = json.loads(r2.text)
    odata_all = json.loads(res2["data"])
    
    country_history = {}#歷史數據，全國
    for i in odata_all["chinaDayList"]:
        ds ="2020."+ i["date"]
        tup =time.strptime(ds,"%Y.%m.%d")
        ds =time.strftime("%Y-%m-%d",tup)
        confirm =i["confirm"]
        heal=i["heal"]
        dead=i["dead"]
        country_history[ds] ={"confirm":confirm, "heal":heal, "dead":dead}
    for i in odata_all["chinaDayAddList"]:
        ds ="2020."+ i["date"]
        tup =time.strptime(ds,"%Y.%m.%d")
        ds =time.strftime("%Y-%m-%d",tup)
        confirm =i["confirm"]
        country_history[ds].update({"confirm_add":confirm})
        
    details =[]#各省數據
    update_time=data_all["lastUpdateTime"]
    data_province=data_all["areaTree"][0]["children"]
    for pro_infos in data_province:
        province=pro_infos["name"]
        confirm=pro_infos["total"]["confirm"]
        confirm_add=pro_infos["today"]["confirm"]
        heal=pro_infos["total"]["heal"]
        dead=pro_infos["total"]["dead"]
        details.append([update_time,province,confirm,confirm_add,heal,dead])
    
    country_now =[]#全國如今數據
    update_time =data_all["lastUpdateTime"]
    nowConfirm =data_all["chinaTotal"]["nowConfirm"]
    suspect =data_all["chinaTotal"]["suspect"]
    nowSevere =data_all["chinaTotal"]["nowSevere"]
    confirm =data_all["chinaTotal"]["confirm"]
    heal =data_all["chinaTotal"]["heal"]
    dead =data_all["chinaTotal"]["dead"]    
    nowConfirm_add =data_all["chinaAdd"]["nowConfirm"]
    suspect_add =data_all["chinaAdd"]["suspect"]
    nowSevere_add =data_all["chinaAdd"]["nowSevere"]
    confirm_add =data_all["chinaAdd"]["confirm"]
    heal_add =data_all["chinaAdd"]["heal"]
    dead_add =data_all["chinaAdd"]["dead"]
    country_now.append([update_time,nowConfirm,suspect,nowSevere,confirm,heal,dead,nowConfirm_add,suspect_add,nowSevere_add,confirm_add,heal_add,dead_add])
    
    province_history = []#歷史數據，各省
    ds=time.strftime("%Y-%m-%d")
    data_province=data_all["areaTree"][0]["children"]
    for pro_infos in data_province:
        province=pro_infos["name"]
        confirm=pro_infos["total"]["confirm"]
        confirm_add=pro_infos["today"]["confirm"]
        heal=pro_infos["total"]["heal"]
        dead=pro_infos["total"]["dead"]
        province_history.append([ds,province,confirm,confirm_add,heal,dead])
    
    return country_history,details,country_now,province_history

保存數據
上一步已經得到爬取數據，並把它們暫時放在列表和字典里，那么爬取的數據要怎么存儲進本地數據庫呢？
下面是其中一個更新details表的例子，其它大同小異。

def get_conn():#把常用的函數封裝，打開關閉數據庫鏈接
    conn=pymysql.connect(host="localhost",
                     user="root",
                     password="123456",
                     db="covtest",
                     charset="utf8")
    cursor=conn.cursor()
    return conn,cursor

def close_conn(conn,cursor):
    if cursor:
        cursor.close()
    if conn:
        conn.close()
def update_details():#更新details表，因為用了insert命令，所以取出時必須判斷最新時間為最新數據
    cursor=None
    conn=None
    try:
        li=getdata()[1]
        conn,cursor=get_conn()
        sql ="insert into details(update_time,province,confirm,confirm_add,heal,dead) values(%s,%s,%s,%s,%s,%s)"
        sql_query='select %s=(select update_time from details order by id desc limit 1)'
        cursor.execute(sql_query,li[0][0])
        if not cursor.fetchone()[0]:
            print(f"{time.asctime()}開始更新最新數據")
            for item in li:
                cursor.execute(sql,item)
            conn.commit()
            print(f"{time.asctime()}更新最新數據完畢")
        else:
            print(f"{time.asctime()}已是最新數據!")
    except:
        traceback.print_exc()
    finally:
        close_conn(conn,cursor)

3、技術使用中遇到的問題和解決過程

1.notebook下載超時

2.如圖

4、進行總結。

步驟：

1）發起請求，獲取響應（urllib、requests）
2）解析內容(re、beautifulsoup4)
3）保存數據(保存在本地庫或雲數據庫)

5、列出參考文獻、參考博客（標題、作者、鏈接）。

https://b23.tv/lfdazZ?share_medium=android&share_source=qq&bbid=FA2630F5-082A-47E2-B26C-438479DAD067101599infoc&ts=1592231486457

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy基礎之數據爬取 Python網絡數據爬取----網絡爬蟲基礎（一） python爬取疫情數據大規模數據爬取 -- Python python爬取世界疫情數據 python | 爬蟲筆記（六）- Ajax數據爬取 Python爬蟲爬取數據的步驟爬取某些網站的彈幕和評論數據 - Python 使用Python進行疫情數據爬取 Python分頁爬取數據的分析