用python爬取網站文獻、新聞報道內容，並保存為文本

本文轉載自查看原文 2021-11-11 13:56 2679 python

　　最近同學讓我幫忙爬取點工程類的事故案例，目標網站：http://www.mkaq.org/sggl/shigual/，對於java程序員的我，對python還不太熟悉，不過python也很容易學的，主要是學會根據自己需求，用各種庫就行了。下面記錄一下我從安裝環境到代碼運行的過程：

一、安裝python環境

　　安裝python我是參考的這篇文章，寫的很詳細，python3環境安裝。

二、安裝需要用到的python的庫

參考這個常用python庫安裝教程，此爬蟲代碼只用到了下列這幾個庫，參照教程安裝即可：

　　　　requests
　　　　selenium
　　　　chromedriver
　　　　lxml
　　　　beautifulsoup4 (注意：python2版本用的是beautifulsoup，但是3.版本用的是beautifulsoup4 ，注意版本不同，此處用的為python3.8)

三、代碼編寫

　　環境准備好以后，開始編寫代碼：

import os
import io
import sys
import requests

from bs4 import BeautifulSoup

# 改變標准輸出的默認編碼
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')

def urlBS(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
    }
    resp = requests.get(url, headers=headers)
    html = resp.content.decode('utf-8')
    soup = BeautifulSoup(html, 'html.parser')
    return soup

def main(url):
    soup = urlBS(url)
    # 數據保存的目錄( os.getced()創建文件夾)
    path = os.getcwd() + u'/爬取的文章/'

    if not os.path.isdir(path):  # 判斷是否有這個文件夾
        os.mkdir(path)

    # for new in soup.select('.news-box'):
    for new in soup.select('.imgr'):

        if len(new.select('h2')) > 0:
            # 獲取文章列表連接
            # article_list_url = 'https:' + new.select('a')[0]['href']
            #獲取首頁數據，不需要添加地址前綴
            # article_list_url = new.select('a')[0]['href']

            #獲取第二頁以后的數據，需要添加地址
            article_list_url = 'http://www.mkaq.org' + new.select('a')[0]['href']

            # 輸出文章列表連接
            # print(article_ list_ url)

            # 獲取文章標題
            # title = ''.join(new.select('h4')[0].text.split())
            title = ''.join(new.select('h2')[0].text.split())

            # 輸出文章標題
            print(title)

            # 請求每篇文章
            result = urlBS(article_list_url)

            article = []

            # 獲取article 中被p包含的內容去除最后一個p標簽即責任編輯
            for v in result.select(' .article_content p')[:-1]:
                # 將內容添加到列表中， 並去除兩邊特殊字符
                article.append(v.text.strip())

            # 將列表中內容以換行連接
            author_info = '\n'.join(article)

            # 輸出文章內容
            # print(author_ info)

            # 保存的文件格式為txt
            filename = path + title + '.txt'

            # 輸出保存路徑
            print(filename)

            new = open(filename, 'w', encoding="utf- 8")

            # 寫入標題
            new.write(title + '\n\n')
            # 寫入內容
            new.write(author_info)
            # 關閉
            new.close()

if __name__ == '__main__':
    # 目標網址
    # firsurl = 'http://www.mkaq.org/sggl/shigual/'

    #獲取第三頁數據，article_list_url也要同步修改
    firsurl = 'http://www.mkaq.org/sggl/shigual/index_3.shtml'

    main(firsurl)
    print('執行完成！')

四、運行效果如下圖所示

　　在獲取默寫頁面數據時，個別文章可能會報錯，不用管，忽略即可。

　　若要獲取其他網站的內容，參考網站源碼的標簽列表，做對應修改即可。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 網絡爬蟲爬取新聞報道高頻詞匯爬取某網站景區列表並保存為csv文件爬取表格類網站數據並保存為excel文件使用selenium + Chrome爬取某網站烏雲公開漏洞文章並保存為pdf文件 Python爬取網站新聞 Python | 一人之下漫畫爬取並保存為pdf文件 Python - 爬取博客園某一目錄下的隨筆 - 保存為docx Python爬取前程無憂網址，並保存為txt文件將爬取的網頁數據分別保存為csv和xls文件(Python） Python：爬取一個可下載的PDF鏈接並保存為本地pdf文件