第74天：Python newspaper 框架

本文轉載自查看原文 2020-05-31 15:52 906 python

by 程序員野客

1 簡介

newspaper 框架是一個主要用來提取新聞內容及分析的 Python 爬蟲框架，更確切的說，newspaper 是一個 Python 庫，但這個庫由第三方開發。

newspaper 主要具有如下幾個特點：

比較簡潔
速度較快
支持多線程
支持多語言

GitHub 鏈接：https://github.com/codelucas/newspaper

安裝方法：pip3 install newspaper3k

2 基本使用

2.1 查看支持語言

import newspaper

print(newspaper.languages())

2.2 獲取新聞

我們以環球網為例，如下所示：

import newspaper

hq_paper = newspaper.build("https://tech.huanqiu.com/", language="zh", memoize_articles=False)

默認情況下，newspaper 緩存所有以前提取的文章，並刪除它已經提取的任何文章，使用 memoize_articles 參數選擇退出此功能。

2.3 獲取文章 URL

>>> import newspaper

>>> hq_paper = newspaper.build("https://tech.huanqiu.com/", language="zh", memoize_articles=False)
>>> for article in hq_paper.articles:
>>>     print(article.url)

http://world.huanqiu.com/gallery/9CaKrnQhXvy
http://mil.huanqiu.com/gallery/7RFBDCOiXNC
http://world.huanqiu.com/gallery/9CaKrnQhXvz
http://world.huanqiu.com/gallery/9CaKrnQhXvw
...

2.4 獲取類別

>>> import newspaper

>>> hq_paper = newspaper.build("https://tech.huanqiu.com/", language="zh", memoize_articles=False)
>>> for category in hq_paper.category_urls():
>>>     print(category)

http://www.huanqiu.com
http://tech.huanqiu.com
http://smart.huanqiu.com
https://tech.huanqiu.com/

2.5 獲取品牌和描述

>>> import newspaper

>>> hq_paper = newspaper.build("https://tech.huanqiu.com/", language="zh", memoize_articles=False)
>>> print(hq_paper.brand)
>>> print(hq_paper.description)

huanqiu
環球網科技，不一樣的IT視角！以“成為全球科技界的一面鏡子”為出發點，向關注國際科技類資訊的網民，提供國際科技資訊的傳播與服務。

2.6 下載解析

我們選取其中一篇文章為例，如下所示：

>>> import newspaper

>>> hq_paper = newspaper.build("https://tech.huanqiu.com/", language="zh", memoize_articles=False)
>>> article = hq_paper.articles[4]
# 下載
>>> article.download()
# 解析
article.parse()
# 獲取文章標題
>>> print("title=", article.title)
# 獲取文章日期
>>> print("publish_date=", article.publish_date)
# 獲取文章作者
>>> print("author=", article.authors)
# 獲取文章頂部圖片地址
>>> print("top_iamge=", article.top_image)
# 獲取文章視頻鏈接
>>> print("movies=", article.movies)
# 獲取文章摘要
>>> print("summary=", article.summary)
# 獲取文章正文
>>> print("text=", article.text)

title= “美麗山”的美麗傳奇
publish_date= 2019-11-15 00:00:00
...

2.7 Article 類使用

from newspaper import Article

article = Article('https://money.163.com/19/1130/08/EV7HD86300258105.html')
article.download()
article.parse()
print("title=", article.title)
print("author=", article.authors)
print("publish_date=", article.publish_date)
print("top_iamge=", article.top_image)
print("movies=", article.movies)
print("text=", article.text)
print("summary=", article.summary)

2.8 解析 html

我們通過 requests 庫獲取文章 html 信息，用 newspaper 進行解析，如下所示：

import requests
from newspaper import fulltext

html = requests.get('https://money.163.com/19/1130/08/EV7HD86300258105.html').text
print('獲取的原信息-->', html)
text = fulltext(html, language='zh')
print('解析后的信息', text)

2.9 nlp（自然語言處理）

我們看一下在 nlp 處理前后獲取一篇新聞的關鍵詞情況，如下所示：

>>> from newspaper import Article

>>> article = Article('https://money.163.com/19/1130/08/EV7HD86300258105.html')
>>> article.download()
>>> article.parse()
>>> print('處理前-->', article.keywords)
# nlp 處理
>>> article.nlp()
>>> print('處理后-->', article.keywords)

處理前--> []
處理后--> ['亞洲最大水秀項目成擺設', '至今拖欠百萬設計費']

通過結果我們可以看出 newspaper 框架的 nlp 處理效果還算可以。

2.10 多任務

當我們需要從多個渠道獲取新聞信息時可以采用多任務的方式，如下所示：

import newspaper
from newspaper import news_pool

hq_paper = newspaper.build('https://www.huanqiu.com', language="zh")
sh_paper = newspaper.build('http://news.sohu.com', language="zh")
sn_paper = newspaper.build('https://news.sina.com.cn', language="zh")

papers = [hq_paper, sh_paper, sn_paper]
# 線程數為 3 * 2 = 6
news_pool.set(papers, threads_per_source=2)
news_pool.join()
print(hq_paper.articles[0].html)

因獲取內容較多，上述代碼執行可能需要一段時間，我們要耐心等待。

3 詞雲實現

下面我們來看一下如何實現一個簡單的詞雲。

需要的庫

import newspaper
# 詞頻統計庫
import collections  
# numpy 庫
import numpy as np  
# 結巴分詞
import jieba  
# 詞雲展示庫
import wordcloud 
# 圖像處理庫
from PIL import Image  
# 圖像展示庫
import matplotlib.pyplot as plt

第三方庫的安裝使用 pip install 即可，如：pip install wordcloud。

文章獲取及處理

# 獲取文章
article = newspaper.Article('https://news.sina.com.cn/o/2019-11-28/doc-iihnzahi3991780.shtml')
# 下載文章
article.download()
# 解析文章
article.parse()
# 對文章進行 nlp 處理
article.nlp()
# nlp 處理后的文章拼接
article_words = "".join(article.keywords)
# 精確模式分詞(默認模式)
seg_list_exact = jieba.cut(article_words, cut_all=False)
# 存儲分詞結果
object_list = []
# 移出的詞
rm_words = ['迎', '以來', '將']
# 迭代分詞對象
for word in seg_list_exact:
    if word not in rm_words:
        object_list.append(word)
# 詞頻統計
word_counts = collections.Counter(object_list)
# 獲取前 10 個頻率最高的詞
word_top10 = word_counts.most_common(10)
# 詞條及次數
for w, c in word_top10:
    print(w, c)

生成詞雲

# 詞頻展示
# 定義詞頻背景
mask = np.array(Image.open('bg.jpg'))
wc = wordcloud.WordCloud(
    # 設置字體格式
    font_path='C:/Windows/Fonts/simhei.ttf',
    # 背景圖
    mask=mask,
    # 設置最大顯示的詞數
    max_words=100,
    # 設置字體最大值
    max_font_size=120
)
# 從字典生成詞雲
wc.generate_from_frequencies(word_counts)
# 從背景圖建立顏色方案
image_colors = wordcloud.ImageColorGenerator(mask)
# 顯示詞雲
plt.imshow(wc)
# 關閉坐標軸
plt.axis('off')
plt.savefig('wc.jpg')
# 顯示圖像
plt.show()

效果如圖所示：

總結

本文為大家介紹了 Python 爬蟲框架 newspaper，讓大家能夠對 newspaper 有個基本了解以及能夠上手使用。在使用的過程中，我們會發現 newspaper 框架還存在一些 bug，因此，我們在實際工作中需要綜合考慮、謹慎使用。

示例代碼：Python-100-days-day074

參考：

https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#performing-nlp-on-an-article

關注公眾號：python技術，回復"python"一起學習交流

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用Newspaper3k框架快速抓取文章信息 python3使用newspaper快速抓取任何新聞文章正文 python+unittest框架第二天unittest之簡單認識Test Suite：測試套件 Python接口測試課程(第四天)-接口測試框架實現新聞類爬蟲庫：Newspaper Python - 按天算年齡第1天：Python 環境搭建第1天 | 12天搞定Python，告訴你有什么用？ gvim74 提示報錯 “無法加載庫python27.dll” Python第一天：你必須要知道的Python擅長領域以及各種重點學習框架（包含Python在世界上的應用）