一、功能描述
用爬蟲爬取#我們的叄叄肆#
下的微博,然后再爬取他們的個人主頁信息,獲取年齡、地區、性別等信息,然后用數據分析,再可視化呈現。
注意:文中說的微博個人主頁信息均為微博公開信息,不包含任何隱私信息,同時全文中將不會出現任何人的個人信息,信息僅用於學習分析,任何人不得使用此教程用作商用,違者后果自付!
二、技術方案
我們大概分解下技術步驟,以及使用的技術
1、爬取#我們的叄叄肆#下的微博
2、根據每條微博爬取該用戶的基本信息
3、將信息保存到csv文件
4、使用數據分析用戶年齡、性別分布
5、分析逼粉的地區
6、使用詞雲分析打榜微博內容
爬取數據我們可以使用requests庫
,保存csv文件我們可以使用內置庫csv
,而可視化數據分析這次給大家介紹一個超級好用的庫pyecharts
,技術選型好了之后我們就可以開始技術實現了!
三、爬取超話微博
1、找到超話加載數據的url
我們在谷歌瀏覽器(chrome)中找到#我們的叄叄肆超話#
頁面,然后調出調試窗口,改為手機模式,然后過濾請求,只查看異步請求,查看返回數據格式,找到微博內容所在!
2.代碼模擬請求數據
拿到鏈接我們就可以模擬請求,這里我們還是使用我們熟悉的requests庫
。簡單幾句便可以獲取微博!
import requests def spider_topic(): ''' 爬取新浪話題 :return: ''' url = 'https://m.weibo.cn/api/container/getIndex?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main' kv = {'Referer': 'https://m.weibo.cn/p/index?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main', 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36', 'Accept': 'application/json, text/plain, */*', 'MWeibo - Pwa': '1', 'Sec - Fetch - Mode': 'cors', 'X - Requested - With': 'XMLHttpRequest', 'X - XSRF - TOKEN': '4dd422'} try: r = requests.get(url, headers=kv) r.raise_for_status() print(r.text) except Exception as e: print(e) if __name__ == '__main__': spider_topic()
了解微博返回的數據結構之后我們就可以將微博內容和id提取出來啦!
import json import re import requests def spider_topic(): ''' 爬取新浪話題 :return: ''' url = 'https://m.weibo.cn/api/container/getIndex?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main' kv = {'Referer': 'https://m.weibo.cn/p/index?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main', 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36', 'Accept': 'application/json, text/plain, */*', 'MWeibo - Pwa': '1', 'Sec - Fetch - Mode': 'cors', 'X - Requested - With': 'XMLHttpRequest', 'X - XSRF - TOKEN': '4dd422'} try: r = requests.get(url, headers=kv) r.raise_for_status() # 解析數據 r_json = json.loads(r.text) cards = r_json['data']['cards'] # 第一次請求的cards中包含了微博頭部信息,以后請求只有微博信息 card_group = cards[2]['card_group'] if len(cards) > 1 else cards[0]['card_group'] for card in card_group: mblog = card['mblog'] # 過濾掉html標簽,留下內容 sina_text = re.compile(r'<[^>]+>', re.S).sub(" ", mblog['text']) # # 除去無用開頭信息 sina_text = sina_text.replace("我們的叄叄肆超話", '').strip() print(sina_text) except Exception as e: print(e) if __name__ == '__main__': spider_topic()
4.批量爬取微博
在我們提取一條微博之后,我們便可以批量爬取微博啦,如何批量?當然是要分頁了?那如何分頁?
查找分頁參數技巧:比較第一次和第二次請求url,看看有何不同,找出不同的參數!給大家推薦一款文本比較工具:Beyond Compare
比較兩次請求的URL發現,第二次比第一次請求鏈接中多了一個:since_id
參數,而這個since_id參數就是每條微博的id!
微博分頁機制:根據時間分頁,每一條微博都有一個since_id,時間越大的since_id越大所以在請求時將since_id傳入,則會加載對應話題下比此since_id小的微博,然后又重新獲取最小since_id將最小since_id傳入,依次請求,這樣便實現分頁
了解微博分頁機制之后,我們就可以制定我們的分頁策略:我們將上一次請求返回的微博中最小的since_id作為下次請求的參數,這樣就等於根據時間倒序分頁抓取數據!
import json import re import requests min_since_id = None def spider_topic(): ''' 爬取新浪話題 :return: ''' global min_since_id url = 'https://m.weibo.cn/api/container/getIndex?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_feed&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main' kv = {'Referer': 'https://m.weibo.cn/p/index?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main', 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36', 'Accept': 'application/json, text/plain, */*', 'MWeibo - Pwa': '1', 'Sec - Fetch - Mode': 'cors', 'X - Requested - With': 'XMLHttpRequest', 'X - XSRF - TOKEN': '4dd422'} if min_since_id: url = url + '&since_id=' + min_since_id try: r = requests.get(url, headers=kv) r.raise_for_status() # 解析數據 r_json = json.loads(r.text) cards = r_json['data']['cards'] # 第一次請求的cards中包含了微博頭部信息,以后請求只有微博信息 card_group = cards[2]['card_group'] if len(cards) > 1 else cards[0]['card_group'] for card in card_group: mblog = card['mblog'] r_since_id = mblog['id'] # 過濾掉html標簽,留下內容 sina_text = re.compile(r'<[^>]+>', re.S).sub(" ", mblog['text']) # # 除去無用開頭信息 sina_text = sina_text.replace("我們的叄叄肆超話", '').strip() print(sina_text) # 獲取最小since_id,下次請求使用 with open('sansansi.txt', 'a+', encoding='utf-8') as f: f.write(sina_text + '\n') if min_since_id: min_since_id = r_since_id if min_since_id > r_since_id else min_since_id else: min_since_id = r_since_id except Exception as e: print(e) if __name__ == '__main__': for i in range(1000): spider_topic()
四、爬取用戶信息
批量爬取微博搞定之后,我們就可以開始爬取用戶信息啦!
首先我們得了解,用戶基本信息頁面的鏈接為:https://weibo.cn/用戶id/info
所以我們只要獲取到用戶的id就可以拿到他的公開基本信息!
1.獲取用戶id
回顧我們之前分析的微博數據格式,發現其中便有我們需要的用戶id!
所以我們在提取微博內容的時候可以順便將用戶id提取出來!
2.模擬登錄
我們獲取到用戶id之后,只要請求https://weibo.cn/用戶id/info 這個url就可以獲取公開信息了,但是查看別人用戶主頁是需要登錄的,那我們就先用代碼模擬登錄!
import requests # 每次請求中最小的since_id,下次請求是用,新浪分頁機制 min_since_id = '' # 生成Session對象,用於保存Cookie s = requests.session() def login_sina(): """ 登錄新浪微博 :return: """ # 登錄rul login_url = 'https://passport.weibo.cn/sso/login' # 請求頭 headers = {'user-agent': 'Mozilla/5.0', 'Referer': 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=https%3A%2F%2Fm.weibo.cn%2F'} # 傳遞用戶名和密碼 data = {'username': '用戶名', 'password': '密碼', 'savestate': 1, 'entry': 'mweibo', 'mainpageflag': 1} try: r = s.post(login_url, headers=headers, data=data) r.raise_for_status() except: print('登錄請求失敗') return 0 # 打印請求結果 print(json.loads(r.text)['msg']) return 1
登錄我們使用的是requests.Session()對象,這個對象會自動保存cookies,下次請求自動帶上cookies!
3.爬取用戶公開信息
拿到用戶id又登錄之后,就可以開始爬取用戶公開信息啦!
def spider_user_info(uid) -> list: ''' 爬取用戶信息(需要登錄),並將基本信息解析字典返回 :param uid: :return: ['用戶名','性別','地區','生日'] ''' user_info_url = 'https//weibo.cm/%s/info'%uid kv = { 'user-agent':'Mozilla/5.0' } try: r = s.get(url=user_info_url, headers=kv) r.raise_for_status() # 使用正則提取信息 basic_info_html = re.findall('<div class="tip">基本信息</div><div class="c">(.*?)</div>', r.text) # 提取:用戶名,性別,地區,生日 basic_infos = get_basic_info_list(basic_info_html) return basic_infos except Exception as e: print(e) return
這里公開信息我們只要:用戶名、性別、地區、生日這些數據!所以我們需要將這幾個數據提取出來
def get_basic_info_list(basic_info_html)-> list: ''' 將html解析提取需要的字段 :param basic_info_html: :return: ['用戶名','性別','地區','生日'] ''' basic_infos = [] basic_info_kvs = basic_info_html[0].split('<br/>') for basic_info_kv in basic_info_kvs: if basic_info_kv.startswitch('昵稱'): basic_infos.append(basic_info_kv.split(':')[1]) elif basic_info_kv.startswitch('性別'): basic_infos.append(basic_info_kv.split(":")[1]) elif basic_info_kv.startswitch('地區'): area = basic_info_kv.split(':')[1] # 如果地區是其他的話,就添加空 if '其他' in area or '海外' in area: basic_infos.append('') continue if ' ' in area: area = area.split(' ')[0] basic_infos.append(area) elif basic_info_kv.startswitch('生日'): birthday = basic_info_kv.split(':')[1] # 只判斷帶年份的才有效 if birthday.startswith('19') or birthday.startswith('20'): # 主要前三位 basic_infos.append(birthday[:3]) else: basic_infos.append("") else: pass #有些用戶沒有生日,直接添加一個空字符 if len(basic_infos)<4: basic_infos.append("") return basic_infos
爬取用戶信息不能過於頻繁,否則會出現請求失敗(響應狀態碼=418),但是不會封你的ip,其實很多大廠 不太會輕易的封ip,太容易誤傷了,也許一封就是一個小區甚至更大!
五、保存csv文件
微博信息拿到了、用戶信息也拿到了,那我們就把這些數據保存起來,方便后面做數據分析!
我們之前一直是保存txt格式的,因為之前都是只有一項數據,而這次是多項數據(微博內容、用戶名、地區、年齡、性別等),所以選擇CSV(Comma Separated Values逗號分隔值)格式的文件!
我們生成一個列表,然后將數據按順序放入,再寫入csv文件!
然后我們看到有輸出結果。
import csv import json import os import random import re import time import requests # 每次請求中最小的since_id,下次請求是用,新浪分頁機制 min_since_id = '' # 生成Session對象,用於保存Cookie s = requests.session() # 新浪話題數據保存文件 CSV_FILE_PATH = 'sina_topic.csv' def login_sina(): """ 登錄新浪微博 :return: """ # 登錄rul login_url = 'https://passport.weibo.cn/sso/login' # 請求頭 headers = {'user-agent': 'Mozilla/5.0', 'Referer': 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=https%3A%2F%2Fm.weibo.cn%2F'} # 傳遞用戶名和密碼 data = {'username': '用戶名', 'password': '密碼', 'savestate': 1, 'entry': 'mweibo', 'mainpageflag': 1} try: r = s.post(login_url, headers=headers, data=data) r.raise_for_status() except: print('登錄請求失敗') return 0 # 打印請求結果 print(json.loads(r.text)['msg']) return 1 def spider_topic(): ''' 爬取新浪話題 :return: ''' global min_since_id url = 'https://m.weibo.cn/api/container/getIndex?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_feed&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main' kv = {'Referer': 'https://m.weibo.cn/p/index?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main', 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36', 'Accept': 'application/json, text/plain, */*', 'MWeibo - Pwa': '1', 'Sec - Fetch - Mode': 'cors', 'X - Requested - With': 'XMLHttpRequest', 'X - XSRF - TOKEN': '4dd422'} if min_since_id: url = url + '&since_id=' + min_since_id try: r = requests.get(url, headers=kv) r.raise_for_status() # 解析數據 r_json = json.loads(r.text) cards = r_json['data']['cards'] # 第一次請求的cards中包含了微博頭部信息,以后請求只有微博信息 card_group = cards[2]['card_group'] if len(cards) > 1 else cards[0]['card_group'] for card in card_group: # 創建保存數據的列表,最后將它寫入csv文件中 sina_columns = [] mblog = card['mblog'] # 解析用戶信息 user = mblog['user'] # 爬取用戶信息,微博有反扒機制,太快會返回418 try: basic_infos = spider_user_info(user['id']) sina_columns.append(user['id']) # 現將用戶信息放進去 sina_columns.extend(basic_infos) # 將用戶信息放進去 except Exception as e: print(e) continue # 解析微博內容 r_since_id = mblog['id'] # 過濾掉html標簽,留下內容 sina_text = re.compile(r'<[^>]+>', re.S).sub(" ", mblog['text']) # # 除去無用開頭信息 sina_text = sina_text.replace("我們的叄叄肆超話", '').strip() # 將微博內容放入列表 sina_columns.append(r_since_id) sina_columns.append(sina_text) # 校驗列表是否完整 # sina_colums數據格式:['用不id','用戶名','性別','地區','生日','微博id','微博內容'] if len(sina_columns) < 7: print('-----上一條數據不完整----') continue # 保存數據 save_columns_to_csv(sina_columns) # 獲取最小since_id,下次請求使用 if min_since_id: min_since_id = r_since_id if min_since_id > r_since_id else min_since_id else: min_since_id = r_since_id # 設置時間間隔 time.sleep(random.randint(3,6)) except Exception as e: print(e) def spider_user_info(uid) -> list: ''' 爬取用戶信息(需要登錄),並將基本信息解析字典返回 :param uid: :return: ['用戶名','性別','地區','生日'] ''' user_info_url = 'https//weibo.cm/%s/info'%uid kv = { 'user-agent':'Mozilla/5.0' } try: r = s.get(url=user_info_url, headers=kv) r.raise_for_status() # 使用正則提取信息 basic_info_html = re.findall('<div class="tip">基本信息</div><div class="c">(.*?)</div>', r.text) # 提取:用戶名,性別,地區,生日 basic_infos = get_basic_info_list(basic_info_html) return basic_infos except Exception as e: print(e) return def get_basic_info_list(basic_info_html)-> list: ''' 將html解析提取需要的字段 :param basic_info_html: :return: ['用戶名','性別','地區','生日'] ''' basic_infos = [] basic_info_kvs = basic_info_html[0].split('<br/>') for basic_info_kv in basic_info_kvs: if basic_info_kv.startswitch('昵稱'): basic_infos.append(basic_info_kv.split(':')[1]) elif basic_info_kv.startswitch('性別'): basic_infos.append(basic_info_kv.split(":")[1]) elif basic_info_kv.startswitch('地區'): area = basic_info_kv.split(':')[1] # 如果地區是其他的話,就添加空 if '其他' in area or '海外' in area: basic_infos.append('') continue if ' ' in area: area = area.split(' ')[0] basic_infos.append(area) elif basic_info_kv.startswitch('生日'): birthday = basic_info_kv.split(':')[1] # 只判斷帶年份的才有效 if birthday.startswith('19') or birthday.startswith('20'): # 主要前三位 basic_infos.append(birthday[:3]) else: basic_infos.append("") else: pass #有些用戶沒有生日,直接添加一個空字符 if len(basic_infos)<4: basic_infos.append("") return basic_infos def save_columns_to_csv(columns, encoding='utf-8'): with open(CSV_FILE_PATH, 'a', encoding=encoding) as f: f = csv.writer(f) f.writerow(columns) def path_spider_topic(): # 先登錄,登錄失敗則不爬 if not login_sina(): return # 寫入數據前線清空之前數據 if os.path.exists(CSV_FILE_PATH): os.remove(CSV_FILE_PATH) # 批量爬 for i in range(25): print('第%d頁' % (i + 1)) spider_topic() if __name__ == '__main__': path_spider_topic()
看看生成的csv文件,注意csv如果用wps或excel打開可能會亂碼,因為我們寫入文件用utf-8編碼,而wps或excel只能打開gbk編碼的文件,你可以用一般的文本編輯器即可,pycharm也可以!
六、數據分析
數據保存下來之后我們就可以進行數據分析了,首先我們要知道我們需要分析哪些數據?
- 我們可以將性別數據做生成餅圖,簡單直觀
- 將年齡數據作出柱狀圖,方便對比
- 將地區做成中國熱力圖,看看哪個地區粉絲最活躍
- 最后將微博內容做成詞雲圖,直觀了解大家在說啥
1.讀取csv文件列
因為我們保存的數據格式為:’用戶id’, ‘用戶名’, ‘性別’, ‘地區’, ‘生日’, ‘微博id’, ‘微博內容’,的很多行,而現在做數據分析需要獲取指定的某一列,比如:性別列,所以我們需要封裝一個方法用來讀取指定的列!
def read_csv_to_dict(index) -> dict: """ 讀取csv數據 數據格式為:'用戶id', '用戶名', '性別', '地區', '生日', '微博id', '微博內容' :param index: 讀取某一列 從0開始 :return: dic屬性為key,次數為value """ with open(CSV_FILE_PATH, 'r', encoding='utf-8') as csvfile: reader = csv.reader(csvfile) column = [columns[index] for columns in reader] dic = collections.Counter(column) # 刪除空字符串 if '' in dic: dic.pop('') print(dic) return dic
2.可視化庫pyecharts
在我們分析之前,有一件很重要的事情,那就是選擇一個合適可視化庫!大家都知道Python可視化庫非常多,之前我們一直在用matplotlib庫
做詞雲,matplotlib做一些簡單的繪圖非常方便。但是今天我們需要做一個全國分布圖,所以經過豬哥對比篩選,選擇了國人開發的pyecharts庫
。選擇這個庫的理由是:開源免費、文檔詳細、圖形豐富、代碼簡介,用着就是一個字:爽!
- 官網:https://pyecharts.org/#/
- 源碼:https://github.com/pyecharts/pyecharts
- 安裝:pip install pyecharts
3.分析性別
選擇了可視化庫之后,我們就來使用吧!
補充生成的csv文件如果中間有空格,需要去掉空格。
def analysis_gender(): """ 分析性別 :return: """ # 讀取性別列 dic = read_csv_to_dict(2) # 生成二維數組 gender_count_list = [list(z) for z in zip(dic.keys(), dic.values())] print(gender_count_list) pie = ( Pie() .add("", gender_count_list) .set_colors(["red", "blue"]) .set_global_opts(title_opts=opts.TitleOpts(title="性別分析")) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) pie.render('gender.html')
這里說下為什么生成的是html?因為這是動態圖,就是可以點擊選擇顯示的,非常人性化!執行之后會生成一個gender.html文件,在瀏覽器打開就可以!

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> </head> <body> <div id="3b41603199d6404b8ca62be19e500c70" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_3b41603199d6404b8ca62be19e500c70 = echarts.init( document.getElementById('3b41603199d6404b8ca62be19e500c70'), 'white', {renderer: 'canvas'}); var option_3b41603199d6404b8ca62be19e500c70 = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "red", "blue" ], "series": [ { "type": "pie", "clockwise": true, "data": [ { "name": "\u5973", "value": 111 }, { "name": "\u7537", "value": 160 } ], "radius": [ "0%", "75%" ], "center": [ "50%", "50%" ], "label": { "show": true, "position": "top", "margin": 8, "formatter": "{b}: {c}" }, "rippleEffect": { "show": true, "brushType": "stroke", "scale": 2.5, "period": 4 } } ], "legend": [ { "data": [ "\u5973", "\u7537" ], "selected": {}, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "title": [ { "text": "\u6027\u522b\u5206\u6790" } ] }; chart_3b41603199d6404b8ca62be19e500c70.setOption(option_3b41603199d6404b8ca62be19e500c70); </script> </body> </html>
效果圖中可以看到,女逼粉稍小於男逼粉。
4.分析年齡
這一項是大家比較關心的,看看逼粉的年齡情況。
def analysis_age(): """ 分析年齡 :return: """ dic = read_csv_to_dict(4) # 生成柱狀圖 sorted_dic = {} for key in sorted(dic): sorted_dic[key] = dic[key] print(sorted_dic) bar = ( Bar() .add_xaxis(list(sorted_dic.keys())) .add_yaxis("李逼聽眾年齡分析", list(sorted_dic.values())) .set_global_opts( yaxis_opts=opts.AxisOpts(name="數量"), xaxis_opts=opts.AxisOpts(name="年齡"), ) ) bar.render('age_bar.html') # 生成餅圖 age_count_list = [list(z) for z in zip(dic.keys(), dic.values())] pie = ( Pie() .add("", age_count_list) .set_global_opts(title_opts=opts.TitleOpts(title="李逼聽眾年齡分析")) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) pie.render('age-pie.html')

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> </head> <body> <div id="0d8f48190494437c8fca5a690eedde34" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_0d8f48190494437c8fca5a690eedde34 = echarts.init( document.getElementById('0d8f48190494437c8fca5a690eedde34'), 'white', {renderer: 'canvas'}); var option_0d8f48190494437c8fca5a690eedde34 = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "#c23531", "#2f4554", "#61a0a8", "#d48265", "#749f83", "#ca8622", "#bda29a", "#6e7074", "#546570", "#c4ccd3", "#f05b72", "#ef5b9c", "#f47920", "#905a3d", "#fab27b", "#2a5caa", "#444693", "#726930", "#b2d235", "#6d8346", "#ac6767", "#1d953f", "#6950a1", "#918597" ], "series": [ { "type": "pie", "clockwise": true, "data": [ { "name": "199", "value": 124 }, { "name": "200", "value": 7 }, { "name": "198", "value": 13 }, { "name": "201", "value": 4 }, { "name": "190", "value": 3 }, { "name": "197", "value": 2 } ], "radius": [ "0%", "75%" ], "center": [ "50%", "50%" ], "label": { "show": true, "position": "top", "margin": 8, "formatter": "{b}: {c}" }, "rippleEffect": { "show": true, "brushType": "stroke", "scale": 2.5, "period": 4 } } ], "legend": [ { "data": [ "199", "200", "198", "201", "190", "197" ], "selected": {}, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "title": [ { "text": "\u674e\u903c\u542c\u4f17\u5e74\u9f84\u5206\u6790" } ] }; chart_0d8f48190494437c8fca5a690eedde34.setOption(option_0d8f48190494437c8fca5a690eedde34); </script> </body> </html>

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> </head> <body> <div id="7af17079a4594b07815191837d99a19d" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_7af17079a4594b07815191837d99a19d = echarts.init( document.getElementById('7af17079a4594b07815191837d99a19d'), 'white', {renderer: 'canvas'}); var option_7af17079a4594b07815191837d99a19d = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "#c23531", "#2f4554", "#61a0a8", "#d48265", "#749f83", "#ca8622", "#bda29a", "#6e7074", "#546570", "#c4ccd3", "#f05b72", "#ef5b9c", "#f47920", "#905a3d", "#fab27b", "#2a5caa", "#444693", "#726930", "#b2d235", "#6d8346", "#ac6767", "#1d953f", "#6950a1", "#918597" ], "series": [ { "type": "bar", "name": "\u674e\u903c\u542c\u4f17\u5e74\u9f84\u5206\u6790", "data": [ 3, 2, 13, 124, 7, 4 ], "barCategoryGap": "20%", "label": { "show": true, "position": "top", "margin": 8 } } ], "legend": [ { "data": [ "\u674e\u903c\u542c\u4f17\u5e74\u9f84\u5206\u6790" ], "selected": { "\u674e\u903c\u542c\u4f17\u5e74\u9f84\u5206\u6790": true }, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "xAxis": [ { "name": "\u5e74\u9f84", "show": true, "scale": false, "nameLocation": "end", "nameGap": 15, "gridIndex": 0, "inverse": false, "offset": 0, "splitNumber": 5, "minInterval": 0, "splitLine": { "show": false, "lineStyle": { "width": 1, "opacity": 1, "curveness": 0, "type": "solid" } }, "data": [ "190", "197", "198", "199", "200", "201" ] } ], "yAxis": [ { "name": "\u6570\u91cf", "show": true, "scale": false, "nameLocation": "end", "nameGap": 15, "gridIndex": 0, "inverse": false, "offset": 0, "splitNumber": 5, "minInterval": 0, "splitLine": { "show": false, "lineStyle": { "width": 1, "opacity": 1, "curveness": 0, "type": "solid" } } } ], "title": [ {} ] }; chart_7af17079a4594b07815191837d99a19d.setOption(option_7af17079a4594b07815191837d99a19d); </script> </body> </html>
5.地區分析
def analysis_area(): """ 分析地區 :return: """ dic = read_csv_to_dict(3) area_count_list = [list(z) for z in zip(dic.keys(), dic.values())] print(area_count_list) map = ( Map() .add("李逼聽眾地區分析", area_count_list, "china") .set_global_opts( visualmap_opts=opts.VisualMapOpts(max_=200), ) ) map.render('area.html')

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> <script type="text/javascript" src="https://assets.pyecharts.org/assets/maps/china.js"></script> </head> <body> <div id="3ec943ef847e4e89bf7b0066319b7cfa" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_3ec943ef847e4e89bf7b0066319b7cfa = echarts.init( document.getElementById('3ec943ef847e4e89bf7b0066319b7cfa'), 'white', {renderer: 'canvas'}); var option_3ec943ef847e4e89bf7b0066319b7cfa = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "#c23531", "#2f4554", "#61a0a8", "#d48265", "#749f83", "#ca8622", "#bda29a", "#6e7074", "#546570", "#c4ccd3", "#f05b72", "#ef5b9c", "#f47920", "#905a3d", "#fab27b", "#2a5caa", "#444693", "#726930", "#b2d235", "#6d8346", "#ac6767", "#1d953f", "#6950a1", "#918597" ], "series": [ { "type": "map", "name": "\u674e\u903c\u542c\u4f17\u5730\u533a\u5206\u6790", "label": { "show": true, "position": "top", "margin": 8 }, "mapType": "china", "data": [ { "name": "\u4e0a\u6d77", "value": 6 }, { "name": "\u6e56\u5357", "value": 11 }, { "name": "\u5c71\u4e1c", "value": 26 }, { "name": "\u6cb3\u5317", "value": 3 }, { "name": "\u6c5f\u82cf", "value": 38 }, { "name": "\u6cb3\u5357", "value": 14 }, { "name": "\u56db\u5ddd", "value": 6 }, { "name": "\u9655\u897f", "value": 19 }, { "name": "\u8d35\u5dde", "value": 2 }, { "name": "\u7518\u8083", "value": 5 }, { "name": "\u6c5f\u897f", "value": 4 }, { "name": "\u6d59\u6c5f", "value": 21 }, { "name": "\u6e56\u5317", "value": 6 }, { "name": "\u5b89\u5fbd", "value": 2 }, { "name": "\u5317\u4eac", "value": 27 }, { "name": "\u91cd\u5e86", "value": 6 }, { "name": "\u5929\u6d25", "value": 1 }, { "name": "\u4e91\u5357", "value": 16 }, { "name": "\u5e7f\u897f", "value": 2 }, { "name": "\u5c71\u897f", "value": 3 }, { "name": "\u5185\u8499\u53e4", "value": 4 }, { "name": "\u798f\u5efa", "value": 2 }, { "name": "\u5e7f\u4e1c", "value": 4 }, { "name": "\u8fbd\u5b81", "value": 7 } ], "roam": true, "zoom": 1, "showLegendSymbol": true, "emphasis": {} } ], "legend": [ { "data": [ "\u674e\u903c\u542c\u4f17\u5730\u533a\u5206\u6790" ], "selected": { "\u674e\u903c\u542c\u4f17\u5730\u533a\u5206\u6790": true }, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "title": [ {} ], "visualMap": { "show": true, "type": "continuous", "min": 0, "max": 200, "inRange": { "color": [ "#50a3ba", "#eac763", "#d94e5d" ] }, "calculable": true, "splitNumber": 5, "orient": "vertical", "showLabel": true } }; chart_3ec943ef847e4e89bf7b0066319b7cfa.setOption(option_3ec943ef847e4e89bf7b0066319b7cfa); </script> </body> </html>
6.內容分析
def analysis_sina_content(): """ 分析微博內容 :return: """ # 讀取微博內容列 dic = read_csv_to_dict(6) # 數據清洗,去掉無效詞 jieba.analyse.set_stop_words(STOP_WORDS_FILE_PATH) # 詞數統計 words_count_list = jieba.analyse.textrank(' '.join(dic.keys()), topK=50, withWeight=True) print(words_count_list) # 生成詞雲 word_cloud = ( WordCloud() .add("", words_count_list, word_size_range=[20, 100], shape=SymbolType.DIAMOND) .set_global_opts(title_opts=opts.TitleOpts(title="叄叄肆超話微博內容分析")) ) word_cloud.render('word_cloud.html')

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts-wordcloud.min.js"></script> </head> <body> <div id="80fcc52455ab4d7f91f4ab8d0197f6ee" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_80fcc52455ab4d7f91f4ab8d0197f6ee = echarts.init( document.getElementById('80fcc52455ab4d7f91f4ab8d0197f6ee'), 'white', {renderer: 'canvas'}); var option_80fcc52455ab4d7f91f4ab8d0197f6ee = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "#c23531", "#2f4554", "#61a0a8", "#d48265", "#749f83", "#ca8622", "#bda29a", "#6e7074", "#546570", "#c4ccd3", "#f05b72", "#ef5b9c", "#f47920", "#905a3d", "#fab27b", "#2a5caa", "#444693", "#726930", "#b2d235", "#6d8346", "#ac6767", "#1d953f", "#6950a1", "#918597" ], "series": [ { "type": "wordCloud", "shape": "diamond", "rotationRange": [ 0, 0 ], "rotationStep": 45, "girdSize": 20, "sizeRange": [ 20, 100 ], "data": [ { "name": "\u5de1\u6f14", "value": 1.0, "textStyle": { "normal": { "color": "rgb(29,6,27)" } } }, { "name": "\u559c\u6b22", "value": 0.9226508966995808, "textStyle": { "normal": { "color": "rgb(158,141,125)" } } }, { "name": "\u5357\u4eac", "value": 0.8754942531538223, "textStyle": { "normal": { "color": "rgb(50,113,55)" } } }, { "name": "\u4e50\u961f", "value": 0.7644495543604096, "textStyle": { "normal": { "color": "rgb(105,112,8)" } } }, { "name": "\u6ca1\u6709", "value": 0.6389758204009732, "textStyle": { "normal": { "color": "rgb(8,76,151)" } } }, { "name": "\u73b0\u573a", "value": 0.5563834430285522, "textStyle": { "normal": { "color": "rgb(154,14,18)" } } }, { "name": "\u6b4c\u8bcd", "value": 0.4919889843989407, "textStyle": { "normal": { "color": "rgb(148,85,157)" } } }, { "name": "\u5907\u5fd8\u5f55", "value": 0.47410381920096223, "textStyle": { "normal": { "color": "rgb(73,96,59)" } } }, { "name": "\u4e91\u5357", "value": 0.4418237101923882, "textStyle": { "normal": { "color": "rgb(106,131,32)" } } }, { "name": "\u5206\u4eab", "value": 0.42553985129519145, "textStyle": { "normal": { "color": "rgb(131,90,74)" } } }, { "name": "\u8ba1\u5212", "value": 0.42260853596250325, "textStyle": { "normal": { "color": "rgb(115,31,71)" } } }, { "name": "\u4e2d\u56fd", "value": 0.41893695687993576, "textStyle": { "normal": { "color": "rgb(125,133,93)" } } }, { "name": "\u5168\u6587", "value": 0.41534584071854486, "textStyle": { "normal": { "color": "rgb(156,19,132)" } } }, { "name": "\u897f\u5b89", "value": 0.4020968979871474, "textStyle": { "normal": { "color": "rgb(14,143,44)" } } }, { "name": "\u97f3\u4e50", "value": 0.36753593035275844, "textStyle": { "normal": { "color": "rgb(148,49,107)" } } }, { "name": "\u5408\u5531", "value": 0.34895724885152013, "textStyle": { "normal": { "color": "rgb(4,144,30)" } } }, { "name": "\u6f14\u51fa", "value": 0.3273760128360437, "textStyle": { "normal": { "color": "rgb(4,55,148)" } } }, { "name": "\u671f\u5f85", "value": 0.31982089608563147, "textStyle": { "normal": { "color": "rgb(86,125,38)" } } }, { "name": "\u5730\u65b9", "value": 0.31852079404512396, "textStyle": { "normal": { "color": "rgb(122,153,121)" } } }, { "name": "\u9ed1\u8272", "value": 0.3151578718530896, "textStyle": { "normal": { "color": "rgb(123,60,61)" } } }, { "name": "\u4e13\u8f91", "value": 0.30256152354372157, "textStyle": { "normal": { "color": "rgb(123,57,86)" } } }, { "name": "\u7075\u9b42", "value": 0.3005181674986806, "textStyle": { "normal": { "color": "rgb(144,37,107)" } } }, { "name": "\u6b23\u8d4f", "value": 0.2874080563658188, "textStyle": { "normal": { "color": "rgb(51,101,94)" } } }, { "name": "\u56db\u5ddd\u7701", "value": 0.28384410667439436, "textStyle": { "normal": { "color": "rgb(93,75,103)" } } }, { "name": "\u56fa\u539f", "value": 0.28355721368087394, "textStyle": { "normal": { "color": "rgb(84,85,103)" } } }, { "name": "\u5f00\u7968", "value": 0.2814562460930172, "textStyle": { "normal": { "color": "rgb(91,80,104)" } } }, { "name": "\u6e2d\u5357", "value": 0.2738759542409853, "textStyle": { "normal": { "color": "rgb(50,105,97)" } } }, { "name": "\u4e16\u754c", "value": 0.26554597196155416, "textStyle": { "normal": { "color": "rgb(118,110,51)" } } }, { "name": "\u6b4c\u624b", "value": 0.26226629736896706, "textStyle": { "normal": { "color": "rgb(100,100,33)" } } }, { "name": "\u5b81\u590f", "value": 0.262117305085348, "textStyle": { "normal": { "color": "rgb(34,3,134)" } } }, { "name": "\u7f51\u9875", "value": 0.2586337982175665, "textStyle": { "normal": { "color": "rgb(117,103,32)" } } }, { "name": "\u5927\u5b66", "value": 0.25452608020863804, "textStyle": { "normal": { "color": "rgb(137,59,129)" } } }, { "name": "\u5ef6\u5b89", "value": 0.252528735118958, "textStyle": { "normal": { "color": "rgb(63,73,87)" } } }, { "name": "\u6986\u6797", "value": 0.249453214001209, "textStyle": { "normal": { "color": "rgb(2,137,81)" } } }, { "name": "\u751f\u6d3b", "value": 0.2483242679792578, "textStyle": { "normal": { "color": "rgb(34,97,21)" } } }, { "name": "\u60c5\u6000", "value": 0.24401279551604893, "textStyle": { "normal": { "color": "rgb(13,142,140)" } } }, { "name": "\u77f3\u5634\u5c71", "value": 0.24050781839423452, "textStyle": { "normal": { "color": "rgb(92,6,72)" } } }, { "name": "\u4e91\u5357\u7701", "value": 0.239736944611729, "textStyle": { "normal": { "color": "rgb(92,60,67)" } } }, { "name": "\u70ed\u6cb3", "value": 0.23860882828501404, "textStyle": { "normal": { "color": "rgb(17,105,41)" } } }, { "name": "\u5bb4\u4f1a\u5385", "value": 0.23758877028338876, "textStyle": { "normal": { "color": "rgb(15,137,145)" } } }, { "name": "\u773c\u6cea", "value": 0.23638824719202423, "textStyle": { "normal": { "color": "rgb(84,121,119)" } } }, { "name": "\u8fd8\u6709", "value": 0.23347783026986726, "textStyle": { "normal": { "color": "rgb(67,137,5)" } } }, { "name": "\u5076\u9047", "value": 0.23242232593990997, "textStyle": { "normal": { "color": "rgb(140,50,83)" } } }, { "name": "\u62a2\u5230", "value": 0.23213526070343848, "textStyle": { "normal": { "color": "rgb(49,148,100)" } } }, { "name": "\u770b\u7740", "value": 0.23050966174133866, "textStyle": { "normal": { "color": "rgb(121,120,27)" } } }, { "name": "\u770b\u5230", "value": 0.228819750756063, "textStyle": { "normal": { "color": "rgb(65,105,114)" } } }, { "name": "\u5730\u7ea7\u5e02", "value": 0.22616709310835467, "textStyle": { "normal": { "color": "rgb(82,96,22)" } } }, { "name": "\u9655\u897f", "value": 0.2234798284685065, "textStyle": { "normal": { "color": "rgb(158,86,2)" } } }, { "name": "\u5168\u8eab", "value": 0.22268124757470714, "textStyle": { "normal": { "color": "rgb(57,1,136)" } } }, { "name": "\u65f6\u5019", "value": 0.21614711807323328, "textStyle": { "normal": { "color": "rgb(54,99,102)" } } } ] } ], "legend": [ { "data": [], "selected": {}, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "title": [ { "text": "\u53c1\u53c1\u8086\u8d85\u8bdd\u5fae\u535a\u5185\u5bb9\u5206\u6790" } ] }; chart_80fcc52455ab4d7f91f4ab8d0197f6ee.setOption(option_80fcc52455ab4d7f91f4ab8d0197f6ee); </script> </body> </html>