爬取網站出現高頻關鍵詞

本文轉載自查看原文 2019-12-10 20:31 250

import requests
from bs4 import BeautifulSoup
import jieba
    
    
#爬取頁面代碼並解析
def get_html(url):
    try:
        response=requests.get(url)
        response.raise_for_status
        response.encoding=response.apparent_encoding
        html=BeautifulSoup(response.text,'html.parser')
        return html
    except:
        print('爬取出錯')


#計算關鍵詞出現次數
def count_word(txt):
    counts={}
    words=jieba.cut(txt)
    for word in words:
        if len(word)==1:
            continue
        else:
            counts[word]=counts.get(word,0)+1
    return counts


def main():
    url='http://www.c114.com.cn/'
    html=get_html(url)
    print('get html')
    t=html.get_text('+',strip=True)
    txt = "".join(i for i in t if ord(i) >= 256)  #txt中除去英文
    print('get txt')
    counts=count_word(txt)
    items=list(counts.items())
    items.sort(key=lambda x:x[1],reverse=True)
    for i in range(15):
        word,count=items[i]
        print('{:<15}{:>5}'.format(word,count))
main()

分別以 c11通信網[http://www.c114.com.cn/] & 通信人家園[http://www.txrjy.com/forum.php] 這兩個網站為例：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python requests庫網頁爬取小實例：百度/360搜索關鍵詞提交 Python爬蟲根據關鍵詞爬取知網論文摘要並保存到數據庫中【入門必學】 awk統計文件中某關鍵詞出現次數提取文檔關鍵詞爬蟲大作業——爬取網站數據生成詞雲我的網站被黑了，關鍵詞被劫持，總結一下是怎么解決的。 php獲取從百度搜索進入網站的關鍵詞關於verilog中的關鍵詞signed 存儲過程常用的關鍵詞 Java拓展接口-default關鍵詞