控制台命令
scrapy startproject 项目名
scrapy crawl XX
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes.jl
scrapy shell http://www.scrapyd.cn
scrapy genspider example example.com#创建蜘蛛,蜘蛛名为example
scrapy选择器(response.selector)
.extract_first() .extract() .get() .getall()
-
css
.intro #class = "intro"
#firstname #id = "firstname"
标签名::attr(属性名) #“a::attr(href)” "img::attr(src)" .attrib['href']
标签名::text #"a::text" “a *::text”#a标签的所有文字
div,p#选择<div>元素内的所有<p>元素 div p#选择<div>元素内的所有<p>元素
div>p#选择所有父级是 <div> 元素的 <p> 元素 div+p#选择所有紧接着<div>元素之后的<p>元素
[target]#选择所有带有target属性元素,[target=blank],[target~=blank],[target|=blank]
string()#文本整段提取(拼接)
-
xpath
/#从根节点选取 //#从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。
@#选取属性“//@href” 标签[@属性名=属性值]
//text()#标签文本内容
//a[contains(@href, "image")]# 匹配一个属性值中包含的字符串 -- 模糊定位
-
正则
https://docs.python.org/3/library/re.html
实例(模板)
import scrapy class AuthorSpider(scrapy.Spider): name = 'author' start_urls = ['http://quotes.toscrape.com/'] """ def parse(self, response): # follow links to author pages for href in response.css('.author + a::attr(href)'): yield response.follow(href, self.parse_author) # follow pagination links for href in response.css('li.next a::attr(href)'): yield response.follow(href, self.parse) def parse_author(self, response): def extract_with_css(query): return response.css(query).get(default='').strip() yield { 'name': extract_with_css('h3.author-title::text'), 'birthdate': extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), }
"""
def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)