Scrapy+selenium爬取簡書全站

環境

Ubuntu 18.04
Python 3.8
Scrapy 2.1

爬取內容

文字標題
作者
作者頭像
發布日期
內容
文章連接
文章ID

思路

分析簡書文章的url規則
使用selenium請求頁面
使用xpath獲取需要的數據
異步存儲數據到MySQL（提高存儲效率）

實現

前戲：

創建scrapy項目
建立crawlsipder爬蟲文件
打開pipelines和middleware

第一步：分析簡書文章的url

可以看到url規則為jianshu.com/p/文章ID，然后再crawlsipder中設置url規則

class JsSpider(CrawlSpider):
    name = 'js'
    allowed_domains = ['jianshu.com']
    start_urls = ['http://jianshu.com/']
    rules = (
        Rule(LinkExtractor(allow=r'.+/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
    )

第二步：使用selenium請求頁面

設置下載器中間件

由於作者、發布日期等數據由Ajax加載，所以使用selenium來獲取頁面源碼以方便xpath解析
有時候請求會卡在一個頁面，一直未加載完成，所以需要設置超時時間
同理Ajax也可能未加載完成，所以需要顯示等待加載完成

from selenium import webdriver
from scrapy.http.response.html import HtmlResponse
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By


class SeleniumDownloadMiddleware(object):
    def __init__(self):
        self.driver = webdriver.Chrome()

    def process_request(self, request, spider):
        while True:
            # 超時重新請求
            try:
                self.driver.set_page_load_timeout(1)
                self.driver.get(request.url)
            except:
                pass
            finally:
                try:
                    # 等待ajax加載，超時了就重來
                    WebDriverWait(self.driver, 1).until(
                        expected_conditions((By.CLASS_NAME, 'rEsl9f'))
                    )
                except:
                    continue
                finally:
                    break
        url = self.driver.current_url
        source = self.driver.page_source
        response = HtmlResponse(url=url, body=source, request=request, encoding='utf-8')
        return response

注意提前將 chromedriver 放到/user/bin下，或者自行指定執行路徑。windows下可以講其添加到環境變量下。

第三步：使用xpath獲取需要的數據

設置好item

import scrapy


class JianshuCrawlItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    author = scrapy.Field()
    avatar = scrapy.Field()
    pub_time = scrapy.Field()
    origin_url = scrapy.Field()
    article_id = scrapy.Field()

分析所需數據的xpath路徑，進行獲取需要的數據,並交給pipelines處理

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import JianshuCrawlItem as Jitem


class JsSpider(CrawlSpider):
    name = 'js'
    allowed_domains = ['jianshu.com']
    start_urls = ['http://jianshu.com/']
    rules = (
        Rule(LinkExtractor(allow=r'.+/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
    )

    def parse_detail(self, response):
        # 使用xpath獲取數據
        title = response.xpath("//h1[@class='_2zeTMs']/text()").get()
        author = response.xpath("//a[@class='_1OhGeD']/text()").get()
        avatar = response.xpath("//img[@class='_13D2Eh']/@src").get()
        pub_time = response.xpath("//div[@class='s-dsoj']/time/text()").get()
        content = response.xpath("//article[@class='_2rhmJa']").get()
        origin_url = response.url
        article_id = origin_url.split("?")[0].split("/")[-1]
        print(title)  # 提示爬取的文章
        item = Jitem(
            title=title,
            author=author,
            avatar=avatar,
            pub_time=pub_time,
            origin_url=origin_url,
            article_id=article_id,
            content=content,
        )
        yield item

第四步：存儲數據到數據庫中

我這里用的數據庫是MySQL，其他數據同理，操作數據的包是pymysql

提交數據有兩種思路，順序存儲和異步存儲

由於scrapy是異步爬取，所以順序存儲效率就會顯得比較慢，推薦采用異步存儲

順序存儲：實現簡單、效率低

class JianshuCrawlPipeline(object):
    def __init__(self):
        dbparams = {
            'host': '127.0.0.1',
            'port': 3306,
            'user': 'debian-sys-maint',
            'password': 'lD3wteQ2BEPs5i2u',
            'database': 'jianshu',
            'charset': 'utf8mb4',
        }
        self.conn = pymysql.connect(**dbparams)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self, item, spider):
        self.cursor.execute(self.sql, (item['title'], item['content'], item['author'],
                                       item['avatar'], item['pub_time'],
                                       item['origin_url'], item['article_id']))
        self.conn.commit()
        return item

    @property
    def sql(self):
        if not self._sql:
            self._sql = '''
            insert into article(id,title,content,author,avatar,pub_time,origin_url,article_id)\
            values(null,%s,%s,%s,%s,%s,%s,%s)'''
        return self._sql

異步存儲：復雜、效率高

import pymysql
from twisted.enterprise import adbapi


class JinshuAsyncPipeline(object):
    '''
    異步儲存爬取的數據
    '''

    def __init__(self):
        # 連接本地mysql
        dbparams = {
            'host': '127.0.0.1',
            'port': 3306,
            'user': 'debian-sys-maint',
            'password': 'lD3wteQ2BEPs5i2u',
            'database': 'jianshu',
            'charset': 'utf8mb4',
            'cursorclass': pymysql.cursors.DictCursor
        }
        self.dbpool = adbapi.ConnectionPool('pymysql', **dbparams)
        self._sql = None

    @property
    def sql(self):
        # 初始化sql語句
        if not self._sql:
            self._sql = '''
                  insert into article(id,title,content,author,avatar,pub_time,origin_url,article_id)\
                  values(null,%s,%s,%s,%s,%s,%s,%s)'''
        return self._sql

    def process_item(self, item, spider):
        defer = self.dbpool.runInteraction(self.insert_item, item)  # 提交數據
        defer.addErrback(self.handle_error, item, spider)  # 錯誤處理

    def insert_item(self, cursor, item):
        # 執行SQL語句
        cursor.execute(self.sql, (item['title'], item['content'], item['author'],
                                  item['avatar'],
                                  item['pub_time'],
                                  item['origin_url'], item['article_id']))

    def handle_error(self, item, error, spider):
        print('Error!')

總結

類似簡書這種采用Ajax技術的網站可以使用selenium輕松爬取，不過效率相對解析接口的方式要低很多，但實現簡單，如果所需數據量不大沒必要費勁去分析接口。
selenium方式訪問頁面時，會經常出現加載卡頓的情況，使用超時設置和顯示等待避免浪費時間

Github：https://github.com/aduner/jianshu-crawl

博客地址：https://www.cnblogs.com/aduner/p/12852616.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲---scrapy全站爬取 scrapy+selenium 爬取淘寶 Scrapy+Selenium爬取動態渲染網站 scrapy+selenium爬取淘寶商品信息使用scrapy+selenium爬取淘寶網 scrapy+selenium爬取馬蜂窩網實戰小白scrapy爬蟲之爬取簡書網頁並下載對應鏈接內容 scrapy中使用selenium+webdriver獲取網頁源碼，爬取簡書網站 Scrapy全站數據爬取 Python爬蟲之scrapy高級(全站爬取,分布式,增量爬蟲)