該腳本的目的:獲取博客的排名和積分,將抓取時間,排名,積分存入數據庫,然后把最近的積分和排名信息進行繪圖,查看積分或者排名的變化情況。
整個腳本的流程:是利用python3來編寫,利用selnium獲取網頁的信息,使用re正則表達式解析積分score和排名rank,用pymysql連接mysql數據庫,最后利用matplotlib進行繪圖。
首先創建db: xiaoshitou
創建表blog_rank:
CREATE TABLE `blog_rank` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'id',
`rank` varchar(255) NOT NULL DEFAULT '' COMMENT '排名',
`score` varchar(255) NOT NULL DEFAULT '' COMMENT '積分',
`create_time` varchar(255) NOT NULL DEFAULT '' COMMENT '添加時間',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=27 DEFAULT CHARSET=utf8;
現在來看下繪圖的結果:
數據庫表,blog_rank表中存的數據:
下面就來看實現過程:
1、該文件是利用pymysql來連接數據庫,新增和查詢數據的(operation_mysql.py)
#coding=utf-8 import pymysql as MySQLdb import datetime host = '127.0.0.1' user = 'root' passwd = '123456' port = 3306 db = 'xiaoshitou' class OperationMySQL(object): def __init__(self): """連接數據庫""" try: self.conn = MySQLdb.connect(host=host, port=port, user=user, passwd=passwd, db=db, charset='utf8', ) self.cur = self.conn.cursor() except Exception as e: print('Connect MySQL Database Fail: ' + e) def _close_connect(self): """關閉連接""" self.cur.close() self.conn.close() def insert_data(self, data): """插入數據""" sql = 'insert into blog_rank (rank,score,create_time) values ({0},{1},{2})'.format(data['rank'], data['score'], datetime.datetime.now().timestamp()) res = self.cur.execute(sql) self.conn.commit() self._close_connect() def select_data(self, sql=None): """根據sql查詢數據""" if sql is None: sql = 'select rank,score,create_time from blog_rank order by create_time' self.cur.execute(sql) result = self.cur.fetchall() self._close_connect() headers = ('rank', 'score', 'create_time') results = [dict(zip(headers, row)) for row in result] # print(results) return results if __name__ == '__main__': OperationMySQL().select_data()
2、get_my_blog_score.py,這個文件包含:獲取網頁內容,解析排名和積分,將抓取的數據存入數據庫,讀取數據庫進行繪圖
# coding=utf-8 try: import requests except: import os os.system('pip install requests') import requests import re from selenium import webdriver from time import sleep from operation_mysql import OperationMySQL class GetMyBlogScore: """獲取博客園積分和排名""" def __init__(self): pass def _get_blog_content(self): """獲取博客的頁面內容""" url = "http://www.cnblogs.com/xiaoshitoutest" driver = webdriver.Firefox() sleep(1) driver.get(url) sleep(1) self.content = driver.page_source driver.quit() return def _match_content(self, compile_str_args): """進行匹配內容""" compile_str = re.compile(compile_str_args) result = compile_str.findall(self.content) final_str = re.sub(r'\D', '', result[0]) return final_str def _save_database(self, data): """將結果寫入數據庫""" if isinstance(data, dict) and data is not None: OperationMySQL().insert_data(data) print('Insert Data Success.') else: print('The data is invalid.') def _show_map(self): """讀取數據庫中的值,畫圖表,保存結果""" datas = OperationMySQL().select_data() import matplotlib.pyplot as plt from datetime import datetime from matplotlib.dates import datestr2num,DateFormatter import matplotlib.dates as dates x_ = [ datetime.fromtimestamp(float(x['create_time'])).strftime('%Y-%m-%d %H:%M:%S') for x in datas] score = [x['score'] for x in datas] rank = [x['rank'] for x in datas] plt.rcParams['font.sans-serif'] = ['FangSong'] fig, ax = plt.subplots() ax.xaxis.set_major_locator(dates.DayLocator()) ax.xaxis.set_major_formatter(DateFormatter('%Y-%m-%d')) ax.plot_date(datestr2num(x_),score,'--') ax.set_xlabel('日期') ax.set_ylabel('積分') ax.set_title('博客園排名--積分') fig.autofmt_xdate() # plt.show() plt.savefig('./rank_score.png') def run(self): score = r'<li.*?class="liScore">([\s\S]*?)</li>' rank = r'<li.*?class="liRank">([\s\S]*?)</li>' self._get_blog_content() scores = self._match_content(score) ranks = self._match_content(rank) result = dict(zip(['score', 'rank'], [scores, ranks])) self._save_database(result) self._show_map() if __name__ == '__main__': GetMyBlogScore().run()
直接運行該文件,就會在當前目錄下生成一個rank_score.png的圖片,就是關於積分的變化圖。
開始那張是:時間--積分的繪圖,我在放一張。積分--排名變化圖