Python 爬取生成中文詞雲以爬取知乎用戶屬性為例


 

代碼如下:

# -*- coding:utf-8 -*-

import requests
import pandas as pd
import time

import matplotlib.pyplot as plt
from wordcloud import WordCloud
import jieba

header={
    'authorization':'Bearer 2|1:0|10:1515395885|4:z_c0|92:Mi4xOFQ0UEF3QUFBQUFBRU1LMElhcTVDeVlBQUFCZ0FsVk5MV2xBV3dDLVZPdEhYeGxaclFVeERfMjZvd3lOXzYzd1FB|39008996817966440159b3a15b5f921f7a22b5125eb5a88b37f58f3f459ff7f8',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36',
    'X-UDID':'ABDCtCGquQuPTtEPSOg35iwD-FA20zJg2ps=',
}

user_data = []
def get_user_data(page):
    for i in range(page):
        url = 'https://www.zhihu.com/api/v4/members/excited-vczh/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={}&limit=20'.format(i*20)
#response = requests.get(url, headers=header).text
        response = requests.get(url, headers=header).json()['data']#['data']   只有JSON格式中選擇data節點
        user_data.extend(response)
        print('正在爬取%s頁' % str(i+1))
        time.sleep(1)

if __name__=='__main__':
    get_user_data(10)
    #pandas 的函數 from_dict()可以直接將一個response變成一個對象
    #df = pd.DataFrame.from_dict(user_data)
    #df.to_csv('D:/PythonWorkSpace/TestData/zhihu/user2.csv')
    df = pd.DataFrame.from_dict(user_data).get('headline')
    df.to_csv('D:/PythonWorkSpace/TestData/zhihu/headline.txt')

    text_from_file_with_apath = open('D:/PythonWorkSpace/TestData/zhihu/headline.txt').read()
    wordlist_after_jieba = jieba.cut(text_from_file_with_apath, cut_all=True)
    wl_space_split = " ".join(wordlist_after_jieba)

    my_wordcloud = WordCloud().generate(wl_space_split)

    plt.imshow(my_wordcloud)
    plt.axis("off")
    plt.show()

需要安裝准備的庫:

pip install matplotlib
pip install jieba
pip install wordcloud(發現這方法安裝不成功)

換種安裝方式到 https://github.com/amueller/word_cloud 這里下載庫文件,解壓,然后進入到解壓后的文件,按住shift+鼠標右鍵 打開命令窗口運行一下命令:

python setup.py install

 然后同樣報錯

 然后我又換了一張安裝方式:
到 http://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud 頁面下載所需的wordcloud模塊的whl文件,下載后進入存儲該文件的路徑,按照方法一,執行“pip install wordcloud-1.3.3-cp36-cp36m-win_amd64.whl”,這樣就會安裝成功。

 

然后生成詞雲的代碼如下:

text_from_file_with_apath = open('D:\Python\zhihu\headline.txt','r',encoding='utf-8').read()
wordlist_after_jieba = jieba.cut(text_from_file_with_apath, cut_all=True)
wl_space_split = " ".join(wordlist_after_jieba)

my_wordcloud = WordCloud().generate(wl_space_split)

plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()

但是發現不顯示中文,這可就頭疼了。
顯示的是一些大大小小的彩色框框。這是因為,我們使用的wordcloud.py中,FONT_PATH的默認設置不識別中文。
仔細研究之后做了改進,終於可以正常顯示中文了

text_from_file_with_apath = open('D:\Python\zhihu\headline.txt','r',encoding='utf-8').read()
wordlist_after_jieba = jieba.cut(text_from_file_with_apath, cut_all=True)
wl_space_split = " ".join(wordlist_after_jieba)
#FONT_PATH = os.environ.get("FONT_PATH", os.path.join(os.path.dirname(__file__), "simkai.ttf"))
cloud = WordCloud(
    #設置字體,不指定就會出現亂碼
    font_path="simkai.ttf",
    #設置背景色
    background_color='white',
    #允許最大詞匯
    max_words=9000,
    #詞雲形狀
    #mask=color_mask
    )#.generate(wl_space_split)
## 產生詞雲
word_cloud = cloud.generate(wl_space_split)
word_cloud.to_file('D:\Python\zhihu\headline.jpg')#將圖片保存到指定文件中
#直接顯示圖片,並且可編輯
# plt.imshow(word_cloud)
# plt.axis("off")
# plt.show()

   

 

坑:

Python讀取文件時經常會遇到這樣的錯誤:python3.4 UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 0: illegal multibyte sequence

import codecs,sys
 f = codecs.open("***.txt","r","utf-8")
指明打開文件的編碼方式就可以消除錯誤了


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM