Scapy——Scrapy shell的使用

本文轉載自查看原文 2019-09-04 22:19 963 爬蟲/ Scrapy/ Python

在開發爬蟲的使用，scrapy shell可以幫助我們定位需要爬取的資源

啟動Scrapy Shell

在終端中輸入以下內容即可啟動scrapy shell，其中url是要爬取的頁面，可以不設置

scrapy shell <url>

scrapy shell還支持本地文件，如果想用爬取本地的web頁面副本，可以用以下方式。使用文件相對路徑時，確保使用 “./” 或者 “../” 或者 “file://” ，直接scarpy shell index.html的方式會報錯

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

Shell使用方法

可用的方法

shelp(): 打印可用的對象和方法
fetch(url[, redirect=True]): 爬取新的 URL 並更新所有相關對象
fetch(request): 通過給定request 爬取，並更新所有相關對象
view(response): 使用本地瀏覽器打開給定的響應。這會在計算機中創建一個臨時文件，這個文件並不會自動刪除

可用的Scrapy對象

Scrapy shell自動從下載的頁面創建一些對象，如 Response 對象和 Selector 對象。這些對象分別是

crawler: 當前Crawler 對象
spider: 爬取使用的 Spider，如果沒有則為Spider對象
request: 最后一個獲取頁面的Request對象，可以使用 replace() 修改請求或者用 fetch() 提取新請求
response: 最后一個獲取頁面的Response對象
settings: 當前的Scrapy設置

簡單示例

fetch('https://scrapy.org')

response.xpath('//title/text()').get()
# 輸出
# 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

from pprint import pprint
pprint(response.headers)

在Spider內部調用Scrapy shell來檢查響應

有時你想檢查Spider某個特定點正在處理的響應，只是為了檢查你期望的響應是否到達那里。

可以通過使用該scrapy.shell.inspect_response功能來實現。

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.

啟動爬蟲后我們就開始檢查工作，注意這里不能使用fectch()，因為Scrapy引擎被shell阻塞了

response.xpath('//h1[@class="fn"]')

最后，按Ctrl-D（或Windows中的Ctrl-Z）退出shell並繼續爬行。

實例

爬取Scrapy官方文檔

fetch("https://docs.scrapy.org/en/latest/index.html")

根據頁面標簽，可以知道，根據標題等級，標題在h1、h2標簽中

以爬取標題二為例，我們可以用xpath定位這些元素

response.xpath('//h2')

此時仍然是一個xpath對象，需要用extract()提取出來

response.xpath('//h2').extract()

文檔主體都在div標簽中，class名稱為“section”，如果想爬取文檔內容，可以這樣

response.xpath("//div[@class='section']").extract()

然后再用正則表達式提取我們需要的內容

import re
data = response.xpath("//div[@class='section']").extract()  # 一個列表
pattern = re.compile("(?<=<h2>).*(?=<a)")  # 響應中可以看到結果為：<h2>二級標題<a class=……，用正則匹配出中間的標題
title = re.findall(pattern, data[0])
print(title)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy shell的使用 Scapy的使用 Scapy基礎使用（一）使用scrapy shell時設置cookies和headers scrapy shell scapy模塊使用解析使用Scapy回放報文pcap python爬蟲scrapy之scrapy終端(Scrapy shell) scapy流量嗅探簡單使用 Python爬蟲教程-33-scrapy shell 的使用