Python爬蟲-爬取豆瓣圖書Top250

本文轉載自查看原文 2019-05-13 14:44 3802 Python

豆瓣網站很人性化，對於新手爬蟲比較友好，沒有如果調低爬取頻率，不用擔心會被封 IP。但也不要太頻繁爬取。

涉及知識點：requests、html、xpath、csv

一、准備工作

需要安裝requests、lxml、csv庫

爬取目標：https://book.douban.com/top250

二、分析頁面源碼

打開網址，按下F12，然后查找書名，右鍵彈出菜單欄 Copy==> Copy Xpath

以書名“追風箏的人” 獲取書名的xpath是：//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a

這里需要注意一下，瀏覽器復制的xpath只能作參考，因為瀏覽器經常會在自己里面增加多余的tbody標簽，我們需要手動把這個標簽刪除，整理成//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a

同樣獲取圖書的評分、評論人數、簡介，結果如下：

//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[2]

//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[3]

//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/p[1]

初步代碼

import requests
from lxml import etree

html = requests.get('https://book.douban.com/top250').text
res = etree.HTML(html)
#因為要獲取標題文本，所以xpath表達式要追加/text(),res.xpath返回的是一個列表，且列表中只有一個元素所以追加一個[0]
name = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a/text()')[0].strip()
score = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[2]/text()')[0].strip()
comment = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[3]/text()')[0].strip()
info = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/p[1]/text()')[0].strip()
print(name,score,comment,info)

執行顯示：

這里只是獲取第一條圖書的信息，獲取第二、第三看看

得到xpath：

import requests
from lxml import etree

html = requests.get('https://book.douban.com/top250').text
res = etree.HTML(html)
name1 = res.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a/text()')[0].strip()
name2 = res.xpath('//*[@id="content"]/div/div[1]/div/table[2]/tr/td[2]/div[1]/a/text()')[0].strip()
name3 = res.xpath('//*[@id="content"]/div/div[1]/div/table[3]/tr/td[2]/div[1]/a/text()')[0].strip()
print(name1,name2,name3)

執行顯示：

對比他們的xpath，發現只有table序號不一樣，我們可以就去掉序號，得到全部關於書名的xpath信息：

import requests
from lxml import etree

html = requests.get('https://book.douban.com/top250').text
res = etree.HTML(html)
names = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[1]/a/text()')
for name in names:
    print(name.strip())

執行結果：太多，這里只展示一部分

對於其他評分、評論人數、簡介也同樣使用此方法來獲取。

到此，根據分析到的信息進行規律對比，寫出獲取第一頁圖書信息的代碼：

import requests
from lxml import etree

html = requests.get('https://book.douban.com/top250').text
res = etree.HTML(html)
trs = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr')
for tr in trs:
    name = tr.xpath('./td[2]/div[1]/a/text()')[0].strip()
    score = tr.xpath('./td[2]/div[2]/span[2]/text()')[0].strip()
    comment = tr.xpath('./td[2]/div[2]/span[3]/text()')[0].strip()
    info = tr.xpath('./td[2]/p[1]/text()')[0].strip()
    print(name,score,comment,info)

執行結果展示（內容較多，只展示前部分）

以上只是獲取第一頁的數據，接下來，我們獲取到全部頁數的鏈接，然后進行循環即可

三、獲取全部鏈接地址

查看分析頁數對應網頁源碼：

以代碼實現

for i in range(10):
    url = 'https://book.douban.com/top250?start={}'.format(i * 25)
    print(url)

執行結果：正是正確的結果

經過分析，已經獲取到全部的頁面鏈接和每一頁的數據提取，最后把整體代碼進行整理和優化。

完整代碼

#-*- coding:utf-8 -*-
"""
-------------------------------------------------
   File Name：     DoubanBookTop250
   Author :        zww
   Date：          2019/5/13
   Change Activity:2019/5/13
-------------------------------------------------
"""
import requests
from lxml import etree

#獲取每頁地址
def getUrl():
    for i in range(10):
        url = 'https://book.douban.com/top250?start={}'.format(i*25)
        urlData(url)
#獲取每頁數據
def urlData(url):
    html = requests.get(url).text
    res = etree.HTML(html)
    trs = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr')
    for tr in trs:
        name = tr.xpath('./td[2]/div/a/text()')[0].strip()
        score = tr.xpath('./td[2]/div/span[2]/text()')[0].strip()
        comment = tr.xpath('./td[2]/div/span[3]/text()')[0].replace('(','').replace(')','').strip()
        info = tr.xpath('./td[2]/p[1]/text()')[0].strip()
        print("《{}》--{}分--{}--{}".format(name,score,comment,info))

if __name__ == '__main__':
    getUrl()

執行結果：總共250條圖書信息，一條不少，由於數據太多，只展示前部分

把爬取到的數據存儲到csv文件中

def write_to_file(content):
    #‘a’追加模式，‘utf_8_sig’格式到處csv不亂碼
    with open('DoubanBookTop250.csv','a',encoding='utf_8_sig',newline='') as f:
        fieldnames = ['name','score','comment','info']
        #利用csv包的DictWriter函數將字典格式數據存儲到csv文件中
        w = csv.DictWriter(f,fieldnames=fieldnames)
        w.writerow(content)

完整代碼

#-*- coding:utf-8 -*-
"""
-------------------------------------------------
   File Name：     DoubanBookTop250
   Author :        zww
   Date：          2019/5/13
   Change Activity:2019/5/13
-------------------------------------------------
"""
import csv
import requests
from lxml import etree

#獲取每頁地址
def getUrl():
    for i in range(10):
        url = 'https://book.douban.com/top250?start={}'.format(i*25)
        for item in urlData(url):
            write_to_file(item)
        print('成功保存豆瓣圖書Top250第{}頁的數據!'.format(i+1))

#數據存儲到csv
def write_to_file(content):
    #‘a’追加模式，‘utf_8_sig’格式到處csv不亂碼
    with open('DoubanBookTop250.csv','a',encoding='utf_8_sig',newline='') as f:
        fieldnames = ['name','score','comment','info']
        #利用csv包的DictWriter函數將字典格式數據存儲到csv文件中
        w = csv.DictWriter(f,fieldnames=fieldnames)
        w.writerow(content)

#獲取每頁數據
def urlData(url):
    html = requests.get(url).text
    res = etree.HTML(html)
    trs = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr')
    for tr in trs:
        yield {
        'name':tr.xpath('./td[2]/div/a/text()')[0].strip(),
        'score':tr.xpath('./td[2]/div/span[2]/text()')[0].strip(),
        'comment':tr.xpath('./td[2]/div/span[3]/text()')[0].replace('(','').replace(')','').strip(),
        'info':tr.xpath('./td[2]/p[1]/text()')[0].strip()
        }
        #print("《{}》--{}分--{}--{}".format(name,score,comment,info))

if __name__ == '__main__':
    getUrl()

內容過多，只展示前部分

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲實例：爬取豆瓣Top250 爬蟲實踐-爬取豆瓣網圖書TOP250的數據 python爬取豆瓣電影top250 python爬取豆瓣top250電影源碼 python3爬蟲-6.使用requests和BeautifulSoup爬取豆瓣Top250電影我的第一個python爬蟲：爬取豆瓣top250前100部電影 Python-爬蟲實戰簡單爬取豆瓣top250電影保存到本地快速收集信息，Python爬蟲教你一招爬取豆瓣Top250信息！ #1 爬蟲：豆瓣圖書TOP250 「requests、BeautifulSoup」 scrapy爬蟲框架教程（二）-- 爬取豆瓣電影TOP250