學習筆記（爬蟲）：爬取古詩網站，獲取每一篇古詩，並保存到本地

本文轉載自查看原文 2020-04-09 19:21 689 爬蟲/ 代碼專區

1、目標網站

目標網站：https://so.gushiwen.org/shiwen/default.aspx?

2、爬蟲目的

爬取目標網站的文本，如古詩的內容，作者，朝代，並且保存到本地中。

3、爬蟲程序

# -*- coding:utf-8 -*-
#爬取古詩網站
import requests
import re

#下載數據
def write_data(data):
    with open('詩詞.txt','a')as f:
        f.write(data)

for i in range(1,10):
    #目標url地址
    url =  "https://so.gushiwen.org/shiwen/default.aspx?page={}".format(i)
    headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'none',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'}
    html = requests.get(url ,headers = headers).content.decode('utf-8')
    # print(html)
    p_title = '<p><a style="font-size:18px; line-height:22px; height:22px;" href=".*?" target="_blank"><b>(.*?)</b></a></p>'
    title = re.findall(p_title, html)
    # 提取內容
    p_context = '<div class="contson" id=".*?">(.*?)</div>'
    context = re.findall(p_context, html, re.S)
    #提取年代
    p_years = '<p class="source"><a href=".*?">(.*?)</a>'
    years = re.findall(p_years,html,re.S)
    #提取作者
    p_author = '<p class="source"><a href=".*?">.*?</a><span>：</span><.*?>(.*?)</a>'
    author = re.findall(p_author,html)
    # print(context)
    # print(title)
    # print(years)
    # print(author)
    for j in range(len(title)):
        context[j] = re.sub('<.*?>', '', context[j])
        #'gbk' codec can't encode character '\u4729' ，沒有這行會出現報錯
        context[j] = re.sub(r'\u4729', '', context[j])
        # print(title[j])
        # print(years[j])
        # print(author[j])
        # print(context[j])
        #寫入數據
        write_data(title[j])
        write_data('\n'+ years[j])
        write_data(' :'+ author[j])
        write_data(context[j])
    print('下載第{}頁成功'.format(str(i)))

4、難點與思考

本次爬蟲難點在於，正則表達式的使用，如使用正則表達式匹配古詩正文、古詩作者、古詩標題。正則表達式的使用，需要找到需要匹配的內容的前項和后項，這樣才能精准的定位到需要匹配的內容。如匹配古詩正文：

 # 提取內容
    p_context = '<div class="contson" id=".*?">(.*?)</div>'
    context = re.findall(p_context, html, re.S)

需要匹配的內容是括號中的內容，前項是'<div class="contson" id=".*?">，后項是</div>。這里需要注意的地方有兩點，第一：id=”.*？“這里必須使用非貪婪模式，即加上？，如果不加？它會繼續匹配下一個內容，這樣就無法匹配到我們需要的內容；第二：（.*?)加？這里也使用了非貪婪模式，只匹配括號中的內容一次。而且匹配標題、作者年代、作者姓名，方法都類似，這里就不一一介紹了。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3爬蟲 -----爬取古詩文-------from古詩文網站 Python：爬取網站圖片並保存至本地 scrapy爬蟲系列之三--爬取圖片保存到本地 python爬蟲--房產數據爬取並保存本地通過wireshark獲取應用接口並使用爬蟲爬取網站數據（一）通過wireshark獲取應用接口並使用爬蟲爬取網站數據（二）通過wireshark獲取應用接口並使用爬蟲爬取網站數據（三） python爬取數據保存到Excel中爬蟲小案例——爬取網站小說爬取簡單反爬蟲網站實戰