使用Newspaper3k框架快速抓取文章信息

本文轉載自查看原文 2019-10-15 09:38 545 爬蟲/ Python

一、框架介紹

Newspaper是一個python3庫,但是Newspaper框架並不適用於實際工程類新聞信息爬取工作，框架不穩定，爬取過程中會有各種bug，例如獲取不到url、新聞信息等，但對於想獲取一些新聞語料的朋友不妨一試，簡單方便易上手，且不需要掌握太多關於爬蟲方面的專業知識。

這是 Newspaper 的github鏈接:

https://github.com/codelucas/newspaper

這是 Newspaper文檔說明的鏈接:

https://newspaper.readthedocs.io/en/latest/

這是 Newspaper快速入門的鏈接:

https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html

安裝方法：

pip3 install newspaper3k

二、功能

主要功能如下:

多線程文章下載框架
新聞網址識別
從html中提取文本
從html中提取頂部圖像
從html中提取所有圖像
從文本中提取關鍵字
從文本中提取摘要
從文本中提取作者
Google趨勢術語提取。
使用10種以上語言（英語，中文，德語，阿拉伯語……）

介紹:

1.建立新聞來源

import newspaper
web_paper = newspaper.build("http://www.sxdi.gov.cn/gzdt/jlsc/", language="zh", memoize_articles=False)

注：文章緩存：默認情況下，newspaper緩存所有以前提取的文章，並刪除它已經提取的任何文章。此功能用於防止重復的文章和提高提取速度。可以使用memoize_articles參數選擇退出此功能。

2.提取文章的url

for article in web_paper.articles:
    print(article.url)
output:
http://www.sxdi.gov.cn/gzdt/jlsc/2019101220009.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019101119998.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019100919989.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019100819980.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019092919940.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019092919933.html
....

3.提取源類別

for category in web_paper.category_urls():
    print(category)
output:
http://www.sxdi.gov.cn/gzdt/jlsc/....

4.提取源提要

for feed_url in web_paper.feed_urls():
    print(feed_url)

5.提取源品牌和描述

print(web_paper.brand)  # 品牌
print(web_paper.description) # 描述
print("一共獲取%s篇文章" % web_paper.size())  # 文章的數目

6.下載文章

from  newspaper import Article
article = Article("http://www.sol.com.cn/", language='zh')  # Chinese
article.download()

7.解析文章並提取想要的信息

article.parse()  #網頁解析
print("title=",article.title)    # 獲取文章標題
print("author=", article.authors)   # 獲取文章作者
print("publish_date=", article.publish_date)   # 獲取文章日期
print("top_iamge=",article.top_image)   # 獲取文章頂部圖片地址
print("movies=",article.movies)   # 獲取文章視頻鏈接
print("text=",article.text,"\n")     # 獲取文章正文
article.nlp()
print('keywords=',article.keywords)#從文本中提取關鍵字
print("summary=",article.summary)# 獲取文章摘要
print("images=",article.images)#從html中提取所有圖像
print("imgs=",article.imgs)
print("html=",article.html)#獲取html

簡單例子:

import newspaper
from newspaper import Article

def spider_newspaper_url(url):
    """
    默認情況下，newspaper緩存所有以前提取的文章，並刪除它已經提取的任何文章。
    使用memoize_articles參數選擇退出此功能。
    """
    web_paper = newspaper.build(url, language="zh", memoize_articles=False)
    print("提取新聞頁面的url！！！")
    for article in web_paper.articles:
    # 獲取新聞網頁的url
        print("新聞頁面url:", article.url)
# 調用spider_newspaper_information函數獲取新聞網頁數據
        spider_newspaper_information(article.url)

    print("一共獲取%s篇文章" % web_paper.size())  # 文章的數目

# 獲取文章的信息
def spider_newspaper_information(url):
    # 建立鏈接和下載文章
    article = Article(url, language='zh')  # Chinese
    article.download()
    article.parse()

# 獲取文章的信息
    print("title=", article.title)  # 獲取文章標題
    print("author=", article.authors)  # 獲取文章作者
    print("publish_date=", article.publish_date)  # 獲取文章日期
    # print("top_iamge=", article.top_image)  # 獲取文章頂部圖片地址
    # print("movies=", article.movies)  # 獲取文章視頻鏈接
    print("text=", article.text, "\n")  # 獲取文章正文
    print("summary=", article.summary)  # 獲取文章摘要


if __name__ == "__main__":
    web_lists = ["http://www.sxdi.gov.cn/gzdt/jlsc/","http://www.people.com.cn/GB/59476/"]
    for web_list in web_lists:
        spider_newspaper_url(web_list)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3使用newspaper快速抓取任何新聞文章正文 python3.6 使用newspaper庫的Article包來快速抓取網頁的文章或者新聞等正文 python 爬蟲newspaper3k 新聞爬去方法利用第三方庫使用python scrapy框架抓取cnblog 的文章內容使用phpspider抓取網站文章第74天：Python newspaper 框架網絡爬蟲：使用Scrapy框架編寫一個抓取書籍信息的爬蟲服務 windows下使用python的scrapy爬蟲框架，爬取個人博客文章內容信息使用Chrome快速實現數據的抓取（一）——概述

使用Newspaper3k框架快速抓取文章信息

一、框架介紹

https://github.com/codelucas/newspaper

安裝方法：

二、功能

多線程文章下載框架

新聞網址識別

從html中提取文本

從html中提取頂部圖像

從html中提取所有圖像

從文本中提取關鍵字

從文本中提取摘要

從文本中提取作者

Google趨勢術語提取。

使用10種以上語言（英語，中文，德語，阿拉伯語……）