控制台命令

scrapy startproject 项目名

scrapy crawl XX

scrapy crawl quotes -o quotes.json

scrapy crawl quotes -o quotes.jl

scrapy shell http://www.scrapyd.cn

scrapy genspider example example.com#创建蜘蛛，蜘蛛名为example

scrapy选择器(response.selector)

.extract_first()　　.extract()　　.get()　　.getall()

.intro　　#class = "intro"

#firstname　　#id = "firstname"

标签名::attr(属性名)　　#“a::attr(href)” "img::attr(src)" .attrib['href']

标签名::text　　#"a::text" “a *::text”#a标签的所有文字

div,p#选择<div>元素内的所有<p>元素　　div p#选择<div>元素内的所有<p>元素　　

div>p#选择所有父级是 <div> 元素的 <p> 元素　　div+p#选择所有紧接着<div>元素之后的<p>元素

[target]#选择所有带有target属性元素,[target=blank],[target~=blank],[target|=blank]

string()#文本整段提取（拼接）

xpath

/#从根节点选取　　//#从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。

@#选取属性“//@href” 标签[@属性名=属性值]

//text()#标签文本内容

//a[contains(@href, "image")]# 匹配一个属性值中包含的字符串 -- 模糊定位

正则

https://docs.python.org/3/library/re.html

实例（模板）

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']
"""
    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }
"""

def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Scrapy爬虫(九)：scrapy的调试技巧 [爬虫框架scrapy]scrapy的安装 python爬虫scrapy之scrapy终端(Scrapy shell) 爬虫(十八)：Scrapy框架(五) Scrapy通用爬虫爬虫学习之基于Scrapy的网络爬虫 scrapy爬虫框架介绍 scrapy 主动停止爬虫 python爬虫之Scrapy框架网页爬虫--scrapy入门爬虫之scrapy框架

scrapy爬虫

控制台命令

scrapy选择器(response.selector)

css

xpath

正则

实例（模板）

免责声明！