爬取豆瓣網圖書TOP250的信息

本文轉載自查看原文 2019-12-31 21:13 632 爬蟲/ 正則表達式

爬取豆瓣網圖書TOP250的信息，需要爬取的信息包括：書名、書本的鏈接、作者、出版社和出版時間、書本的價格、評分和評價，並把爬取到的數據存儲到本地文件中。

參考網址：https://book.douban.com/top250

注意：使用正則表達式時，不要在Elements選項卡中直接查看源代碼，因為那的源碼可能經過Javascript渲染而與原始請求不同，而是需要從Network選項卡中查看源碼。

import re
import json
import time
import requests
from requests.exceptions import RequestException


def get_one_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) '
            + 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None
    
    
def parse_one_page(html, start):
    #.*? 非貪婪匹配
    items1 = re.findall('href="(.*?)".*?title="(.*?)".*?', html)
    items2 = re.findall('pl">(.*?\/)?(.*?\/)?(.*?)\/(.*?)\/(.*?)<\/p>', html)#()?有的書沒寫作者
    items3 = re.findall('nums">(.*?)<\/span>.*?<\/div>(.*?)?<\/td>', html, re.S)#有的書沒寫書評
    #re.S使.匹配包括換行在內的所有字符
    for i in range(25):
        yield{
            'page': start//25+1,
            'ranking': start+i+1,
            'book': items1[i][1],
            'link': items1[i][0],
            
            'author': items2[i][0].replace('/', '').strip(),
            'press': items2[i][2].strip(),
            'time': items2[i][3].strip(),
            'price': items2[i][4].strip(),
            
            'grade': items3[i][0],
            #有書評的則要去除兩邊的源碼
            'evaluation': items3[i][1].strip().replace("</span>\n              </p>", '')\
            .replace('<p class="quote" style="margin: 10px 0; color: #666">\n                  <span class="inq">', '') 
        }

def write_to_file(content):
    with open('doubanBookTop250.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')


def main(start):
    url = 'https://book.douban.com/top250?start=' + str(start)
    html = get_one_page(url)
    for item in parse_one_page(html, start):
        print(item)
        #write_to_file(item)


if __name__ == '__main__':
    for i in range(10):
        main(start=i * 25)
        time.sleep(1)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬取豆瓣電影TOP250榜爬蟲實戰：爬取豆瓣電影top250 python爬蟲實踐——爬取“豆瓣top250” python3爬取豆瓣top250電影一起學爬蟲——通過爬取豆瓣電影top250學習requests庫的使用正則表達式_爬取豆瓣電影排行Top250 用python爬取豆瓣電影Top 250 Python抓取豆瓣電影top250! python2.7抓取豆瓣電影top250 Python爬蟲----抓取豆瓣電影Top250