1.參考

《用Python寫網絡爬蟲》——2.2 三種網頁抓取方法 re / lxml / BeautifulSoup

需要注意的是，lxml在內部實現中，實際上是將CSS選擇器轉換為等價的XPath選擇器。

從結果中可以看出，在抓取我們的示例網頁時，Beautiful Soup比其他兩種方法慢了超過6倍之多。實際上這一結果是符合預期的，因為lxml和正則表達式模塊都是C語言編寫的，而BeautifulSoup``則是純Python編寫的。一個有趣的事實是，lxml表現得和正則表達式差不多好。由於lxml在搜索元素之前，必須將輸入解析為內部格式，因此會產生額外的開銷。而當抓取同一網頁的多個特征時，這種初始化解析產生的開銷就會降低，lxml也就更具競爭力。這真是一個令人驚嘆的模塊！

2.Scrapy Selectors 選擇器

https://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

BeautifulSoup缺點：慢
lxml:基於 ElementTree
Scrapy seletors: parsel library，構建於 lxml 庫之上，這意味着它們在速度和解析准確性上非常相似。

.css() .xpath() 返回 SelectorList，即 a list of new selectors
.extract() .re() 提取過濾 tag data

import scrapy

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\__init__.py

from scrapy.selector import Selector

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\selector\__init__.py

from scrapy.selector.unified import *

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\selector\unified.py

from parsel import Selector as _ParselSelector

class Selector(_ParselSelector, object_ref):

>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse
如此導入 Selector，實例化 Selector 的時候第一個參數是 HtmlResponse 實例，如果要通過 str 實例化 Selector ，需要 sel = Selector(text=doc)

In [926]: from parsel import Selector

In [927]: Selector?
Init signature: Selector(self, text=None, type=None, namespaces=None, root=None, base_url=None, _expr=None)
Docstring:
:class:`Selector` allows you to select parts of an XML or HTML text using CSS
or XPath expressions and extract data from it.

``text`` is a ``unicode`` object in Python 2 or a ``str`` object in Python 3

``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default).
If ``type`` is ``None``, the selector defaults to ``"html"``.
File:           c:\program files\anaconda2\lib\site-packages\parsel\selector.py
Type:           type

doc=u"""
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>>>
        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
        <a href="/author/Thomas-A-Edison">(about)</a>
"""

sel = Selector(doc)

sel.css('div.quote')
[<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>]

3.使用 scrapy shell 調試

https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data

G:\pydata\pycode\scrapy\splash_cnblogs>scrapy shell "http://quotes.toscrape.com/page/1/"

3.1Xpath VS CSS

對比

	CSS	Xpath	備注
含有屬性	response.css('div[class]')	response.xpath('//div[@class]')	css可以簡寫為 div.class 甚至 .class，div#abc 或 #abc 則對應於id=abc
匹配屬性值	response.css('div[class="quote"]')	response.xpath('//div[@class="quote"]')	response.xpath('//small[text()="Albert Einstein"]')
匹配部分屬性值	response.css('div[class*="quo"]')	response.xpath('//div[contains(@class,"quo")]')	response.xpath('//small[contains(text(),"Einstein")]')
提取屬性值	response.css('small::attr(class)')	response.xpath('//small/@class')	css里面text排除在attr以外，所以不支持上面兩個過濾text？？？
提取文字	response.css('small::text')	response.xpath('//small/text()')

使用

In [135]: response.xpath('//small[@class="author"]').extract_first()
In [122]: response.css('small.author').extract_first()
Out[122]: u'<small class="author" itemprop="author">Albert Einstein</small>'

In [136]: response.xpath('//small[@class="author"]/text()').extract_first()
In [123]: response.css('small.author::text').extract_first()
Out[123]: u'Albert Einstein'

In [137]: response.xpath('//small[@class="author"]/@class').extract_first()  #class也是屬性
In [124]: response.css('small.author::attr(class)').extract_first()
Out[124]: u'author'

In [138]: response.xpath('//small[@class="author"]/@itemprop').extract_first()
In [125]: response.css('small.author::attr(itemprop)').extract_first()
Out[125]: u'author'

class 是一個特殊屬性，允許多值 class="row header-box"

# 匹配多值中的某一個值
In [228]: response.css('div.row')
Out[228]:
[<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row header-box">\n           '>,
 <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row">\n    <div class="col-md'>]

In [232]: response.css('div.ro')
Out[232]: []

# 整個class屬性值，匹配全部字符串
In [226]: response.css('div[class="row"]')
Out[226]: [<Selector xpath=u"descendant-or-self::div[@class = 'row']" data=u'<div class="row">\n    <div class="col-md'>]

　In [240]: response.xpath('//div[@class="row header-box"]')
　Out[240]: [<Selector xpath='//div[@class="row header-box"]' data=u'<div class="row header-box">\n '>]


# 整個class屬性值，匹配部分字符串
In [229]: response.css('div[class*="row"]')
Out[229]:
[<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row header-box">\n           '>,
 <Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row">\n    <div class="col-md'>]

In [230]: response.xpath('//div[contains(@class,"row")]')
Out[230]:
[<Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row header-box">\n           '>,
 <Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row">\n    <div class="col-md'>]

In [234]: response.css('div[class*="w h"]')
Out[234]: [<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'w h')]" data=u'<div class="row header-box">\n           '>]

In [235]: response.xpath('//div[contains(@class,"w h")]')
Out[235]: [<Selector xpath='//div[contains(@class,"w h")]' data=u'<div class="row header-box">\n           '>]

3.2提取數據

提取data

　　selectorList / selector.extract()，extract_frist()

　　　　selector.extract() 返回一個str，selector.extract_first() 報錯

　　　　selectorList.extract() 對每一個selector執行selector.extract，返回 list of str，selectorList.extract_frist() 取前面list的第一個。

提取data同時過濾

　　selectorList / selector.re(r'xxx')，re_frist(r'xxx')

　　　　selector.re() 返回 list，selector.re_first() 取第一個str

　　　　selectorList.re() 對每一個selector執行selector.re，每個list結果（注意並非每個selector都會match）合並為一個list，selectorList.re_first()取前面合並list的第一個str。。。

使用 extract

In [21]: response.css('.author')  #內部轉為 xpath，返回 SelectorList 實例
Out[21]:
[<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>]

In [22]: response.css('.author').extract()  #提取上面的 data 部分，對 SelectorList 中的每個 Selector 執行 extract()，
Out[22]:
[u'<small class="author" itemprop="author">Albert Einstein</small>',
 u'<small class="author" itemprop="author">J.K. Rowling</small>',
 u'<small class="author" itemprop="author">Albert Einstein</small>',
 u'<small class="author" itemprop="author">Jane Austen</small>',
 u'<small class="author" itemprop="author">Marilyn Monroe</small>',
 u'<small class="author" itemprop="author">Albert Einstein</small>',
 u'<small class="author" itemprop="author">Andr\xe9 Gide</small>',
 u'<small class="author" itemprop="author">Thomas A. Edison</small>',
 u'<small class="author" itemprop="author">Eleanor Roosevelt</small>',
 u'<small class="author" itemprop="author">Steve Martin</small>']

In [23]: response.css('.author').extract_first()  #只取第一個，可能返回 None ,可能報錯 response.css('.author')[0].extract()
Out[23]: u'<small class="author" itemprop="author">Albert Einstein</small>'

In [24]: response.css('.author::text').extract_first()  #定位到 tag 內部的 text
Out[24]: u'Albert Einstein'

使用 re

In [46]: response.css('.author::text')[0]
Out[46]: <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]/text()" data=u'Albert Einstein'>

In [47]: response.css('.author::text')[0].re(r'\w+')
Out[47]: [u'Albert', u'Einstein']

In [48]: response.css('.author::text')[0].re_first(r'\w+')
Out[48]: u'Albert'

In [49]: response.css('.author::text')[0].re(r'((\w+)\s(\w+))')  #按照左邊括號順序輸出
Out[49]: [u'Albert Einstein', u'Albert', u'Einstein']

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲：Scrapy5 - 選擇器Selectors Scrapy進階知識點總結（二）——選擇器Selectors CSS selectors 選擇器 [javascript]MooTools Selectors(MooTools 選擇器) Scrapy的中Css 選擇器 scrapy中css選擇器初識 Scrapy基礎(五) ------css選擇器基礎 scrapy選擇器主要用法 Scrapy學習篇（六）之Selector選擇器 Spider-Scrapy css選擇器提取數據