python之疫情監控（爬蟲+可視化）主要技術（Python+Flask+Echarts）

本文轉載自查看原文 2021-03-08 16:26 1032 python

項目准備

簡介

基於Python+Flask+Echarts打造一個疫情監控系統，涉及技術有：

Python爬蟲
使用Python與Mysql數據庫交互
使用Flask構建Web項目
基於Echarts數據可視化展示
在linux上部署Web項目及爬蟲

項目架構

數據獲取（爬蟲）>>數據持久化（MySQL）>>flask搭建Web后台>>數據可視化(h5+echarts)

項目環境

Python 3.+
MySQL
PyCharm(Python IDE)
Jupyter notebook(Python IDE)
Hbuilder(前端IDE，https://www.dcloud.io/hbuilderx.html)
Linux主機（后期項目部署）

Jupyter Notebook

Jupyter Notebook（此前被稱為IPython notebook）是一個基於網頁的用於交互計算的應用程序，在數據可續額領域很受歡迎。

簡言之，notebook是以網頁的形式打開，可以在code類型單元格中直接編寫代碼和運行代碼，代碼的運行結果也會直接在代碼塊下顯示。如在編程過程中需要編寫說明文檔，可在md類型的單元格中直接編寫，便於及時的說明和解釋。

安裝

pip install notebook

啟動

jupyter notebook

修改工作目錄

jupyter notebook本身的目錄看着難受的話可以自己修改工作目錄

1.先在一個喜歡的路徑創建一個目錄（我這里是C:\Users\18322\Desktop\note）

2.cmd輸入jupyter notebook --generate-config

找到這個配置文件（jupyter_notebook_config.py）所在路徑，打開它並編輯

搜索找到notebook_dir

將前面注釋去掉，在后面配置目錄路徑

 
           ## The directory to use for notebooks and kernels.
c.NotebookApp.notebook_dir = r'C:\Users\18322\Desktop\note'

3.保存退出，cmd>>jupyter notebook啟動看效果OK了么（也可以在創建的../../note目錄下新建一個.cmd為后綴的腳本，內容編輯為jupyter notebook，以后點擊它就可以快捷啟動了）

基本操作

1.新建文件與導入文件

2.單元格分類：code、md

3.命令模式（藍色邊框）與編輯模式（綠色模式）

4.常用快捷鍵

單元格類型轉換：Y、M；

插入單元格：A、B；

進入命令模式：Esc

代碼補全：Tab

運行單元格：ctrl/shift/alt+enter

刪除單元格：DD

md常用語法

1.標題：使用1~~6個#跟隨一個空格表示1~~6級標題

2.無序列表：使用*，-或+后跟隨一個空格來表示

3.有序列表：使用數字+點表示

4.換行：使用兩個或以上的空行

5.代碼：三個反引號

6.分割線：三個星號或三個減號

7.鏈接：[文字](鏈接地址)

8.圖片：![圖片說明](圖片鏈接地址"圖片說明信息")

數據獲取

爬蟲概述

爬蟲，就是給網站發起請求，並從響應中提取需要的數據自動化程序

發起請求，獲取響應
- 通過http庫，對目標站點進行請求。等同於自己打開瀏覽器，輸入網址
- 常用庫：urllib、urllib3、requests
- 服務器會返回請求的內容，一般為：HTML、二進制文件（視頻、音頻）、文檔、JSON字符串等
解析內容
- 尋找自己需要的信息，就是利用正則表達式或者其他庫提取目標信息
- 常用庫：re、beautifulsoup4
保存數據
- 將解析得到的數據持久化到文件或者數據庫中

urllib發送請求

這里使用jupyter notebook進行測試

demo

from urllib import request
url="http://www.baidu.com"
res=request.urlopen(url)  #獲取響應

print(res.info()) #響應頭
print(res.getcode()) #狀態碼 2xx正常,3xx發生重定向,4xx訪問資源問題,5xx服務器內部錯誤
print(res.geturl()) #返回響應地址

#獲取網頁html源碼
html=res.read()
print(html)

解決不顯示中文問題

#獲取網頁html源碼
html=res.read()
# print(html)
html=html.decode("utf-8")
print(html)

簡單解決網站反爬機制的問題

例如我把上面的demo的url換成點評（www.dianping.com）的就會遇到
HTTPError: HTTP Error 403: Forbidden這個錯誤

我們可以使用瀏覽器的User-Agent（我這里用的google的）進行偽裝:

from urllib import request
url="http://www.dianping.com"
#最基本的反爬措施：添加header信息
header={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
}
req=request.Request(url,headers=header)
res=request.urlopen(req)  #獲取響應
#獲取網頁html源碼
html=res.read()
html=html.decode("utf-8")
print(html)

ConnectionResetError: [WinError 10054]

在request后面寫入一個關閉的操作，

response.close()

設置socket默認的等待時間，在read超時后能自動往下繼續跑

socket.setdefaulttimeout(t_default)

設置sleep()等待一段時間后繼續下面的操作

time.sleep(t)

request發送請求

demo

1.先安裝：pip install requests

2.requests.get()

import requests

url="http://www.baidu.com"
res=requests.get(url)

print(res.encoding)
print(res.headers)
#res.headers返回結果里面 如果沒有Content-Type encoding=utf-8 否則 如果設置了charset就以設置的為准
print(res.url) 否則就是ISO-8859-1

返回結果：
>>>
ISO-8859-1
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 24 Mar 2020 03:58:05 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
http://www.baidu.com/

查看網頁html源碼

res.encoding="utf-8" #前面已經看過了是ISO-8859-1，這里轉一下否則顯示亂碼
print(res.text)

解決反爬

同樣，這里也把url改成點評

import requests

url="http://www.dianping.com"
res=requests.get(url)

print(res.encoding)
print(res.headers)
print(res.url)
print(res.status_code)    #查看狀態碼發現很不幸，又是403

返回結果：
>>>
UTF-8
{'Date': 'Tue, 24 Mar 2020 04:10:58 GMT', 'Content-Type': 'text/html;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=5', 'Vary': 'Accept-Encoding', 'X-Forbid-Reason': '.', 'M-TraceId': '6958868738621748090', 'Content-Language': 'en-US', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Server': 'DPweb', 'Content-Encoding': 'gzip'}
http://www.dianping.com/
403

解決

import requests

url="http://www.dianping.com"
header={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
}
res=requests.get(url,headers=header)

print(res.encoding)
print(res.headers)
print(res.url)
print(res.status_code)    #此時的狀態碼是200說明可以正常訪問了

返回結果：
>>>
UTF-8
{'Date': 'Tue, 24 Mar 2020 04:14:23 GMT', 'Content-Type': 'text/html;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=5', 'Vary': 'User-Agent, Accept-Encoding', 'M-TraceId': '-4673120569475554214, 1931600023016859584', 'Set-Cookie': 'cy=1281; Domain=.dianping.com; Expires=Fri, 24-Apr-2020 04:14:23 GMT; Path=/, cye=nanzhao; Domain=.dianping.com; Expires=Fri, 24-Apr-2020 04:14:23 GMT; Path=/', 'Content-Language': 'en-US', 'Content-Encoding': 'gzip', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Server': 'DPweb'}
http://www.dianping.com/
200

然后就可以正常通過print(res.text)查看頁面html源碼了

beautifulsoup4解析內容

beautifulsoup4將復雜的HTML文檔轉換成一個樹形結構，每個節點都是python對象。

安裝：pip install beautifulsoup4
BeautifulSoup(html)
- 獲取節點：find()、find_all()/select()、
- 獲取屬性：attrs
- 獲取文本：text

demo

以四川衛健委官網的一個網頁(http://wsjkw.sc.gov.cn/scwsjkw/gzbd/fyzt.shtml)的為例

這里需要使用google瀏覽器的開發者工具左上角的小箭頭

示例：點擊該箭頭頭將鼠標光標放到一處鏈接上

就會顯示光標所在處的標簽類型，這個地方是a標簽，接下來以該a標簽為demo展開降解

from bs4 import BeautifulSoup
import requests

url="http://wsjkw.sc.gov.cn/scwsjkw/gzbd/fyzt.shtml"
res=requests.get(url)
res.encoding="utf-8"
html=res.text
soup=BeautifulSoup(html)
soup.find("h2").text
a=soup.find("a") #獲取網頁a標簽
print(a)
print(a.attrs) #打印標簽屬性
print(a.attrs["href"]) #打印標簽屬性中的href的值

返回結果：
>>>
<a href="/scwsjkw/gzbd01/2020/3/24/62ae66867eea419dac169bf6a8684fb8.shtml" target="_blank"><img alt="我省新型冠狀病毒肺炎疫情最新情況（3月..." src="/scwsjkw/gzbd01/2020/3/24/62ae66867eea419dac169bf6a8684fb8/images/a799555b325242f6b0b2924c907f09ad.jpg
"/></a>
{'target': '_blank', 'href': '/scwsjkw/gzbd01/2020/3/24/62ae66867eea419dac169bf6a8684fb8.shtml'}
/scwsjkw/gzbd01/2020/3/24/62ae66867eea419dac169bf6a8684fb8.shtml

然后獲取該標簽屬性中的href值拼接新的url

url_new="http://wsjkw.sc.gov.cn"+a.attrs["href"]
res=requests.get(url_new)
res.encoding="utf-8"
BeautifulSoup(res.text)    #獲取Html文本

返回結果往下拉，找到我們感興趣的目標：

瀏覽器+開發者工具查看該網頁發現該部分是p標簽：

所以我們定位p標簽,鎖定我們需要的信息以便下一步正則分析數據

soup=BeautifulSoup(res.text)
context=soup.find("p")
print(context)

返回結果：
>>>
<p><span style="font-size: 12pt;">    3月23日0-24時，我省新型冠狀病毒肺炎新增2例確診病例（1、黃某某3月17日從英國經上海，於3月18日抵達成都后即接受隔離醫學觀察和動態診療，3月23日確診；2、王某某3月18日從英國經北京，於3月20日抵達成都后即接受隔離醫學觀察和動態診療，3月23日確診），相關密切接觸者正在實施追蹤和集中隔離醫學觀察。無新增治愈出院病例，無新增疑似病例，無新增死亡病例。
</span><br/>
<span style="font-size: 12pt;">   （確診患者&lt;含輸入病例&gt;具體情況由各市&lt;州&gt;衛生健康委進行通報）
</span><br/>
<span style="font-size: 12pt;">    截至3月24日0時，我省累計報告新型冠狀病毒肺炎確診病例545例（其中6例為境外輸入病例），涉及21個市（州）。
</span><br/>
<span style="font-size: 12pt;">    我省183個縣（市、區）全部為低風險區。
</span><br/>
<span style="font-size: 12pt;">    545名確診患者中，正在住院隔離治療6人，已治愈出院536人，死亡3人。
</span><br/>
<span style="font-size: 12pt;">    現有疑似病例0例。
</span><br/>
<span style="font-size: 12pt;">    現有564人正在接受醫學觀察。</span></p>

re解析內容

re是python自帶的正則表達式模塊，使用它需要有一定的正則表達式基礎
re.search(regex,str)
- 1.在str中查找滿足條件的字符串，匹配不上返回None
- 2.對返回結果可以分組，可在字符串內添加小括號分離數據
  - groups()
  - group(index):返回指定分組內容

借用上個的demo獲取的context

import re
pattern="新增(\d+)例確診病例"
res=re.search(pattern,context)
print(res)

返回結果：
>>>
<_sre.SRE_Match object; span=(25, 33), match='新增2例確診病例'>

爬取騰訊疫情數據

有了爬蟲基礎后，我們可以自行去全國各地的衛健委網站上爬取數據，不過部分網站反爬蟲手段很高明，需要更專業的反爬手段
我們也可以去各大平台直接爬取最終數據，比如：
- https://voice.baidu.com/act/newpneumonia/newpneumonia/?from=osari_pc_

我們可以直接拿來使用

這里有個問題：如果騰訊的大兄弟沒有偷懶的話怎么辦？

解：打開google Browser開發者工具，network>>js（一般js、json格式的都在這里面）>>找get開頭的（一般這種請求的數據都是以get命名開頭）；

一個一個篩選，找到是這個：

不確定的話你把鼠標放上去看一下：

import requests
url="https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5"
#使用requests請求
res=requests.get(url)
print(res.text)

返回的josn格式，我們可以直接使用

拿到json格式的數據后我們把它轉換為字典

import json
d=json.loads(res.text)
print(d)

里面主要兩個：ret、data

打印一下data

print(d["data"])

看看data的數據類型是什么

print(type(d["data"]))

返回結果：
>>>
<class 'str'>

再用json模塊加載data里的數據到新的變量data_all（即str加載成字典格式）

data_all=json.loads(d["data"])
print(type(data_all))

>>>
<class 'dict'>

看一下這個字典格式的data里面有什么

data_all.keys()

>>>
dict_keys(['lastUpdateTime', 'chinaTotal', 'chinaAdd', 'isShowAdd', 'showAddSwitch', 'areaTree', 'chinaDayList', 'chinaDayAddList', 'dailyNewAddHistory', 'dailyHistory', 'wuhanDayList', 'articleList'])

打印一下前幾個看看怎么樣

拿到數據的完整代碼

import pymysql
import time 
import json
import traceback  #追蹤異常
import requests

def get_tencent_data(): 
    """
    :return: 返回歷史數據和當日詳細數據
    """
    url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5'
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
    }
    r = requests.get(url, headers)
    res = json.loads(r.text)  # json字符串轉字典
    data_all = json.loads(res['data'])

    history = {}  # 歷史數據
    for i in data_all["chinaDayList"]:
        ds = "2020." + i["date"]
        tup = time.strptime(ds, "%Y.%m.%d")
        ds = time.strftime("%Y-%m-%d", tup)  # 改變時間格式,不然插入數據庫會報錯，數據庫是datetime類型
        confirm = i["confirm"]
        suspect = i["suspect"]
        heal = i["heal"]
        dead = i["dead"]
        history[ds] = {"confirm": confirm, "suspect": suspect, "heal": heal, "dead": dead}
    for i in data_all["chinaDayAddList"]:
        ds = "2020." + i["date"]
        tup = time.strptime(ds, "%Y.%m.%d")
        ds = time.strftime("%Y-%m-%d", tup)
        confirm = i["confirm"]
        suspect = i["suspect"]
        heal = i["heal"]
        dead = i["dead"]
        history[ds].update({"confirm_add": confirm, "suspect_add": suspect, "heal_add": heal, "dead_add": dead})

    details = []  # 當日詳細數據
    update_time = data_all["lastUpdateTime"]
    data_country = data_all["areaTree"]  # list 25個國家
    data_province = data_country[0]["children"]  # 中國各省
    for pro_infos in data_province:
        province = pro_infos["name"]  # 省名
        for city_infos in pro_infos["children"]:
            city = city_infos["name"]
            confirm = city_infos["total"]["confirm"]
            confirm_add = city_infos["today"]["confirm"]
            heal = city_infos["total"]["heal"]
            dead = city_infos["total"]["dead"]
            details.append([update_time, province, city, confirm, confirm_add, heal, dead])
    return history, details

數據存儲

創建數據庫cov,然后建兩張表

history 表存儲每日總數據

CREATE TABLE `history` ( 
`ds` datetime NOT NULL COMMENT '日期', 
`confirm` int(11) DEFAULT NULL COMMENT '累計確診', 
`confirm_add` int(11) DEFAULT NULL COMMENT '當日新增確診', 
`suspect` int(11) DEFAULT NULL COMMENT '剩余疑似', 
`suspect_add` int(11) DEFAULT NULL COMMENT '當日新增疑似', 
`heal` int(11) DEFAULT NULL COMMENT '累計治愈', 
`heal_add` int(11) DEFAULT NULL COMMENT '當日新增治愈', 
`dead` int(11) DEFAULT NULL COMMENT '累計死亡', 
`dead_add` int(11) DEFAULT NULL COMMENT '當日新增死亡', 
PRIMARY KEY (`ds`) USING BTREE ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

details 表存儲每日詳細數據

CREATE TABLE `details` ( 
`id` int(11) NOT NULL AUTO_INCREMENT, 
`update_time` datetime DEFAULT NULL COMMENT '數據最后更新時間', 
`province` varchar(50) DEFAULT NULL COMMENT '省', 
`city` varchar(50) DEFAULT NULL COMMENT '市', 
`confirm` int(11) DEFAULT NULL COMMENT '累計確診', 
`confirm_add` int(11) DEFAULT NULL COMMENT '新增確診', 
`heal` int(11) DEFAULT NULL COMMENT '累計治愈', 
`dead` int(11) DEFAULT NULL COMMENT '累計死亡', 
PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

使用 pymysql 模塊與數據庫交互

安裝： pip install pymysql

① 建立連接 ② 創建游標 ③ 執行操作 ④ 關閉連接

pymysql基礎&測試

隨便插入一條數據測試一下

#pymysql 的簡單使用

#建立連接
conn = pymysql.connect(host="127.0.0.1",
                      user="root",
                      password="123456",
                      db="cov")
#創建游標，默認是元組型
cursor = conn.cursor()

sql= "insert into history values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
cursor.execute(sql,[time.strftime("%Y-%m-%d"),10,1,2,3,4,5,6,7])
conn.commit() #提交事務
# res = cursor.fetchall()
cursor.close()
conn.close()

存儲&操作

def get_conn():
    """
    :return: 連接，游標
    """
    # 創建連接
    conn = pymysql.connect(host="127.0.0.1",
                           user="root",
                           password="123456",
                           db="cov",
                           charset="utf8")
    # 創建游標
    cursor = conn.cursor()  # 執行完畢返回的結果集默認以元組顯示
    return conn, cursor


def close_conn(conn, cursor):
    if cursor:
        cursor.close()
    if conn:
        conn.close()
def update_details():
    """
    更新 details 表
    :return:
    """
    cursor = None
    conn = None
    try:
        li = get_tencent_data()[1]  #  0 是歷史數據字典,1 最新詳細數據列表
        conn, cursor = get_conn()
        sql = "insert into details(update_time,province,city,confirm,confirm_add,heal,dead) values(%s,%s,%s,%s,%s,%s,%s)"
        sql_query = 'select %s=(select update_time from details order by id desc limit 1)' #對比當前最大時間戳
        cursor.execute(sql_query,li[0][0])
        if not cursor.fetchone()[0]:
            print(f"{time.asctime()}開始更新最新數據")
            for item in li:
                cursor.execute(sql, item)
            conn.commit()  # 提交事務 update delete insert操作
            print(f"{time.asctime()}更新最新數據完畢")
        else:
            print(f"{time.asctime()}已是最新數據！")
    except:
        traceback.print_exc()
    finally:
        close_conn(conn, cursor)
def insert_history():
    """
        插入歷史數據
    :return:
    """
    cursor = None
    conn = None
    try:
        dic = get_tencent_data()[0]  # 0 是歷史數據字典,1 最新詳細數據列表
        print(f"{time.asctime()}開始插入歷史數據")
        conn, cursor = get_conn()
        sql = "insert into history values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
        for k, v in dic.items():
            # item 格式 {'2020-01-13': {'confirm': 41, 'suspect': 0, 'heal': 0, 'dead': 1}
            cursor.execute(sql, [k, v.get("confirm"), v.get("confirm_add"), v.get("suspect"),
                                 v.get("suspect_add"), v.get("heal"), v.get("heal_add"),
                                 v.get("dead"), v.get("dead_add")])

        conn.commit()  # 提交事務 update delete insert操作
        print(f"{time.asctime()}插入歷史數據完畢")
    except:
        traceback.print_exc()
    finally:
        close_conn(conn, cursor)
def update_history():
    """
    更新歷史數據
    :return:
    """
    cursor = None
    conn = None
    try:
        dic = get_tencent_data()[0]  #  0 是歷史數據字典,1 最新詳細數據列表
        print(f"{time.asctime()}開始更新歷史數據")
        conn, cursor = get_conn()
        sql = "insert into history values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
        sql_query = "select confirm from history where ds=%s"
        for k, v in dic.items():
            # item 格式 {'2020-01-13': {'confirm': 41, 'suspect': 0, 'heal': 0, 'dead': 1}
            if not cursor.execute(sql_query, k):
                cursor.execute(sql, [k, v.get("confirm"), v.get("confirm_add"), v.get("suspect"),
                                     v.get("suspect_add"), v.get("heal"), v.get("heal_add"),
                                     v.get("dead"), v.get("dead_add")])
        conn.commit()  # 提交事務 update delete insert操作
        print(f"{time.asctime()}歷史數據更新完畢")
    except:
        traceback.print_exc()
    finally:
        close_conn(conn, cursor)
#插入歷史數據
insert_history()

>>>
Mon Feb 17 01:43:37 2020開始插入歷史數據
Mon Feb 17 01:43:40 2020插入歷史數據完畢

爬取百度熱搜數據

百度的數據頁面使用了動態渲染技術，我們可以用 selenium 來爬取

selenium

selenium 是一個用於 web 應用程序測試的工具,直接運行在瀏覽器中，就像真正的用戶在操作一樣
安裝：pip install selenium
安裝瀏覽器（Firefox、Google等）
下載對應版本瀏覽器驅動：http://npm.taobao.org/mirrors/chromedriver/

版本查看方法：例：Google的是設置>>關於Chrome

找到對應版本下載解壓到之前創建的note文件夾

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python疫情可視化基於Python爬蟲的疫情數據可視化系統實踐-20200531 python可視化大屏-疫情監控圖（2）玫瑰圖【爬蟲+可視化】Python爬取疫情數據，並做可視化展示 python數據可視化之flask+echarts（一） python數據可視化之flask+echarts（二） python數據可視化之flask+echarts（一） Python網絡爬蟲設計————爬取丁香園疫情數據&數據可視化 python可視化-疫情監控圖（1）地圖，柱形圖 python可視化大屏-疫情監控圖（3）條形圖和面積圖