新聞類爬蟲庫：Newspaper

本文轉載自查看原文 2021-02-12 15:06 326 周總結/月總結/年總結/ python/ 互聯網金融

newspaper庫是一個主要用來提取新聞內容及分析的Python爬蟲框架。此庫適合抓取新聞網頁。操作簡單易學，即使對完全沒了解過爬蟲的初學者也非常的友好，簡單學習就能輕易上手，除此之外，使用過程你不需要考慮HTTP Header、IP代理，也不需要考慮網頁解析，網頁源代碼架構等問題。

我們以https://www.wired.com/為例，進行演示。

獲取新聞

import newspaper
from newspaper import Article
from newspaper import fulltext
url = 'https://www.wired.com/'
paper = newspaper.build(url, language="en", memoize_articles=False)

輸出新聞對象

<newspaper.source.Source object at 0x7fe82c98c1d0>

默認情況下，newspaper 緩存所有以前提取的文章，並刪除它已經提取的任何文章，使用 memoize_articles 參數選擇退出此功能。

提取新聞URL

提取站點頁面的新聞URL

import newspaper
from newspaper import Article
from newspaper import fulltext
url = 'https://www.wired.com/'
paper = newspaper.build(url, language="en", memoize_articles=False)
for article in paper.articles:
    print(article.url)

輸出內容

提取新聞分類

支持提取站點下的新聞分類

for category in paper.category_urls():
    print(category)

提取新聞內容：Article

文章對象是新聞文章的抽象。例如，新聞Source將是Wired，而新聞Article是其站點下的Wired文章，這樣就可以提取出新聞的標題、作者、插圖、內容等。

article = Article('https://www.wired.com/story/preterm-babies-lonely-terror-of-a-pandemic-nicu/')
article.download()
article.parse()
print("title=", article.title)
print("author=", article.authors)
print("publish_date=", article.publish_date)
print("top_iamge=", article.top_image)
print("movies=", article.movies)
print("text=", article.text)
print("summary=", article.summary)

下載解析

我們選取其中一篇文章為例，如下所示：

first_url = paper.articles[0]
first_url.download()
first_url.parse()
print(first_url.title)
print(first_url.publish_date)
print(first_url.authors)
print(first_url.top_image)
print(first_url.summary)
print(first_url.movies)
print(first_url.text)

解析html

通過 requests 庫獲取文章 html 信息，用 newspaper 進行解析，如下所示：

html = requests.get('https://www.wired.com/story/preterm-babies-lonely-terror-of-a-pandemic-nicu/').text
print('獲取的原信息-->', html)
text = fulltext(html, language='en')
print('解析后的信息', text)

結合nlp

通過使用nlp方法，可以從文本中提取自然語言屬性。

first_article = paper.articles[1]
first_article.download()
first_article.parse()
first_article.nlp()
print(first_article.summary)
print(first_article.keywords)

多任務

當我們需要從多個渠道獲取新聞信息時可以采用多任務的方式，如下所示：

import newspaper
from newspaper import news_pool
lr_paper = newspaper.build('https://lifehacker.com/', language="en")
wd_paper = newspaper.build('https://www.wired.com/', language="en")
ct_paper = newspaper.build('https://www.cnet.com/news/', language="en")
papers = [lr_paper, wd_paper, ct_paper]
# 線程數為 3 * 2 = 6
news_pool.set(papers, threads_per_source=2)
news_pool.join()
print(lr_paper.articles[0].html)

其他

hot()返回Google上最熱門的術語列表。

popular_urls()返回熱門新聞來源網址的列表。

newspaper.hot()
newspaper.popular_urls()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 爬蟲newspaper3k 新聞爬去方法利用第三方庫 python3.6 使用newspaper庫的Article包來快速抓取網頁的文章或者新聞等正文新聞類網站的通用爬蟲--GNE python3使用newspaper快速抓取任何新聞文章正文 GNE: 4行代碼實現新聞類網站通用爬蟲 nodejs實現新聞爬蟲基於Scrapy框架的Python新聞爬蟲新浪滾動新聞爬蟲代碼爬蟲（1）selenium頭條新聞爬蟲抓取 Python爬蟲項目，獲取所有網站上的新聞，並保存到數據庫中，解析html網頁等

新聞類爬蟲庫：Newspaper

獲取新聞

提取新聞URL

提取新聞分類

提取新聞內容：Article

下載解析

解析html

結合nlp

多任務

其他

免責聲明！