爬蟲實例：中國日報高頻詞匯爬蟲

本文轉載自查看原文 2017-09-22 16:43 1175 python爬蟲系列

最近偶然打開一個英文網站，仔細一看原來是中國日報的英文版本，本着培養語感的想法多看看英語新聞，奈何水平渣渣，機智如我想到了爬取文章高頻詞匯，廢話少說，看下文：

爬取中國日報全網所有文章鏈接

1.用bs4獲取所有含有href屬性的a標簽

import requests
from bs4 import BeautifulSoup

url = 'http://www.chinadaily.com.cn/'
wbdata = requests.get(url).text
soup = BeautifulSoup(wbdata,'lxml')  #創建一個beautifulsoup對象
print soup
linklist = []
for i in soup.find_all('a'):
    link = i['href']
    linklist.append(link)
    set(linklist)  #去重
print linklist

輸出：

['http://www.chinadaily.com.cn', 'http://usa.chinadaily.com.cn', 'http://europe.chinadaily.com.cn', 'http://africa.chinadaily.com.cn',  
 'http://www.chinadaily.com.cn/business/2017-09/18/content_32146768.htm', 'http://www.chinadaily.com.cn/world/2017-09/18/content_32146327.htm', 
 'http://www.chinadaily.com.cn/world/2017-09/18/content_32146327.htm', 'javascript:void(0)', 'javascript:void(0)', 'javascript:void(0)', 'javascript:void(0)', 
 'china/2017-09/18/content_32158742.htm', 'business/2017-09/18/content_32157059.htm', 'china/2017-09/18/content_32157984.htm', 'business/2017cnruimf/2017-09/18/content_32157756.htm'...]

2.正則提取符合要求的鏈接

linklist = ['china/2017-09/18/content_32143681.htm','http://www.chinadaily.com.cn/culture/2017-09/18/content_32158002.htm','javascript:void(0)']
s = ",".join(linklist)
print s
links = re.findall('.*?/2017-09/18/content_\d+.htm', s)
print links

輸出：

china/2017-09/18/content_32143681.htm,http://www.chinadaily.com.cn/culture/2017-09/18/content_32158002.htm,javascript:void(0)
['china/2017-09/18/content_32143681.htm', ',http://www.chinadaily.com.cn/culture/2017-09/18/content_32158002.htm']

3.完整源碼如下：

import requests
from bs4 import BeautifulSoup
import re

url = 'http://www.chinadaily.com.cn/'
wbdata = requests.get(url).text
soup = BeautifulSoup(wbdata,'lxml')  #創建一個beautifulsoup對象
#print soup
linklist = []
for i in soup.find_all('a'):
    link = i['href']
    #print link
    linklist.append(link)
#set(linklist)  #去重
#print linklist
print len(linklist)

newlist = []
for i in linklist:
    link = re.findall('.*?/content_\d+.htm', i)
    if link != []:
            newlist.append(link[0])
    
print len(newlist)

for i in range(0,len(newlist)):
    if newlist[i].find('http://www.chinadaily.com.cn/') == -1:
        newlist[i] = 'http://www.chinadaily.com.cn/' + newlist[i]
print newlist

輸出：

獲取鏈接文章內容

參考

for url in newlist:
    print url
    wbdata = requests.get(url).content
    soup = BeautifulSoup(wbdata,'lxml')
    # 替換換行字符
    text = str(soup).replace('\n','').replace('\r','')
    # 替換<script>標簽
    text = re.sub(r'\<script.*?\>.*?\</script\>',' ',text)
    # 替換HTML標簽
    text = re.sub(r'\<.*?\>'," ",text)
    text = re.sub(r'[^a-zA-Z]',' ',text)
    # 轉換為小寫
    text = text.lower()
    text = text.split()
    text = [i for i in text if len(i) > 1 and i != 'chinadaily' and i != 'com' and i != 'cn']
    text = ' '.join(text)
    print text
    with open(r"C:\Users\HP\Desktop\codes\DATA\words.txt",'a+') as file:
        file.write(text+' ')
        print("寫入成功")

輸出：

高頻詞匯分析

基本語法說明：參考

from nltk.corpus import PlaintextCorpusReader

corpus_root = ''
wordlists = PlaintextCorpusReader(corpus_root, 'datafile.txt') #載入文件作為語料庫
wordlists.words() #整個語料庫中的詞匯

from nltk.corpus import stopwords
stop = stopwords.words('english') #獲得英語中的停用詞

from nltk.probability import *
fdist = nltk.FreqDist(swords) #創建一個條件頻率分布

源碼如下：

from nltk.corpus import stopwords
from nltk.probability import *
from nltk.corpus import PlaintextCorpusReader
import nltk

corpus_root = 'C:/Users/HP/Desktop/codes/DATA'
wordlists = PlaintextCorpusReader(corpus_root, '.*')  #載入文件作為語料庫
text = nltk.Text(wordlists.words('words.txt'))  #獲取整個語料庫中的詞匯

#text = 'gold and silver mooncakes look more alluring than real ones home business biz photos gold and silver mooncakes look more alluring than real ones by tan xinyu tan xinyu gold and silver mooncakes look more alluring than real ones mid autumn festival mooncake jewelry biz photos webnews enpproperty the jade mooncake left looks similar to the real one right in guangzhou south china guangdong province sept photo ic as the mid autumn festival is around the corner the traditional chinese delicacy mooncake has already hit shops and stores across the country and mooncake makers also are doing their best to attract customers with various flavors and stuffing the family reunion festival falls on oct this year or the th day of the th chinese lunar month and eating mooncakes with families on the day is one of important ways for chinese people to celebrate it sadly these mooncakes displayed in jewelry outlet cannot be eaten as they are made of gold silver or jade although they look more real than real previous next previous next photo on the move kazak herdsmen head to winter pastures salt lake in shanxi looks like double flavor hot pot special subway train for wuhan open ready to serve world top financial centers rostov on don makes preparations for world cup new chinese embassy sign of stronger china panama ties back to the top home china world business lifestyle culture travel sports opinion regional forum newspaper china daily pdf china daily paper mobile copyright all rights reserved the content including but not limited to text photo multimedia information etc published in this site belongs to china daily information co cdic without written authorization from cdic such content shall not be republished or used in any form note browsers with or higher resolution are suggested for this site license for publishing multimedia online registration number about china daily advertise on site contact us job offer expat employment follow us'
print u"單詞總數",len(text)


stop = stopwords.words('english')  #獲得英語中的停用詞
print stop
swords = [i for i in text if  i not in stop]
print u"去除停用詞的單詞總數：",len(swords)


fdist = FreqDist(swords)
print fdist  
#print fdist.items()  #元組列表
#print fdist.most_common(10)

newd = sorted(fdist.items(), key=lambda fdist: fdist[1], reverse=True)     #設置以字典值排序，使頻率從高到低顯示
print newd[:50]

注：type(text)為<class 'nltk.text.Text'>，故text若為字符串得出的是字母的頻率

輸出：

高頻詞雲

from pyecharts import WordCloud

data = dict(newd[:50])
wordcloud = WordCloud('高頻詞雲',width = 600,height = 400)
wordcloud.add('ryana',data.keys(),data.values(),word_size_range = [20,80])
wordcloud

輸出：

中國日報文章單詞出現頻率最高的是'china'和'daily'，實在無語，看來僅過濾一個'chinadaily'不管用啊啊啊

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 如何從大量數據中找出高頻詞 seo與python大數據結合給文本分詞並提取高頻詞基於統計的無詞典的高頻詞抽取(二)——根據LCP數組計算詞頻 python爬蟲實例 Python爬蟲實例 python 爬蟲實例 python爬蟲實例大全 Python之爬蟲-中國大學排名 $python爬蟲系列（1）——一個簡單的爬蟲實例 Python Scrapy 爬蟲框架實例（一）