0.
1.參考
《用Python寫網絡爬蟲》——2.2 三種網頁抓取方法 re / lxml / BeautifulSoup
需要注意的是,lxml在內部實現中,實際上是將CSS選擇器轉換為等價的XPath選擇器。
從結果中可以看出,在抓取我們的示例網頁時,Beautiful Soup比其他兩種方法慢了超過6倍之多。實際上這一結果是符合預期的,因為lxml和正則表達式模塊都是C語言編寫的,而BeautifulSoup``則是純Python編寫的。一個有趣的事實是,lxml表現得和正則表達式差不多好。由於lxml在搜索元素之前,必須將輸入解析為內部格式,因此會產生額外的開銷。而當抓取同一網頁的多個特征時,這種初始化解析產生的開銷就會降低,lxml也就更具競爭力。這真是一個令人驚嘆的模塊!
2.Scrapy Selectors 選擇器
https://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors
- BeautifulSoup缺點:慢
- lxml:基於 ElementTree
- Scrapy seletors: parsel library,構建於 lxml 庫之上,這意味着它們在速度和解析准確性上非常相似。
.css() .xpath() 返回 SelectorList,即 a list of new selectors
.extract() .re() 提取 過濾 tag data
import scrapy
C:\Program Files\Anaconda2\Lib\site-packages\scrapy\__init__.py
from scrapy.selector import Selector
C:\Program Files\Anaconda2\Lib\site-packages\scrapy\selector\__init__.py
from scrapy.selector.unified import *
C:\Program Files\Anaconda2\Lib\site-packages\scrapy\selector\unified.py
from parsel import Selector as _ParselSelector
class Selector(_ParselSelector, object_ref):
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse
如此導入 Selector,實例化 Selector 的時候第一個參數是 HtmlResponse 實例,如果要通過 str 實例化 Selector ,需要 sel = Selector(text=doc)
xx
In [926]: from parsel import Selector In [927]: Selector? Init signature: Selector(self, text=None, type=None, namespaces=None, root=None, base_url=None, _expr=None) Docstring: :class:`Selector` allows you to select parts of an XML or HTML text using CSS or XPath expressions and extract data from it. ``text`` is a ``unicode`` object in Python 2 or a ``str`` object in Python 3 ``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default). If ``type`` is ``None``, the selector defaults to ``"html"``. File: c:\program files\anaconda2\lib\site-packages\parsel\selector.py Type: type
xx
doc=u""" <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>>> <span>by <small class="author" itemprop="author">Thomas A. Edison</small> <a href="/author/Thomas-A-Edison">(about)</a> """ sel = Selector(doc) sel.css('div.quote') [<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>]
3.使用 scrapy shell 調試
https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data
G:\pydata\pycode\scrapy\splash_cnblogs>scrapy shell "http://quotes.toscrape.com/page/1/"
3.1Xpath VS CSS
對比
CSS | Xpath | 備注 | |
含有屬性 | response.css('div[class]') | response.xpath('//div[@class]') | css可以簡寫為 div.class 甚至 .class,div#abc 或 #abc 則對應於id=abc |
匹配屬性值 | response.css('div[class="quote"]') | response.xpath('//div[@class="quote"]') | response.xpath('//small[text()="Albert Einstein"]') |
匹配部分屬性值 | response.css('div[class*="quo"]') | response.xpath('//div[contains(@class,"quo")]') | response.xpath('//small[contains(text(),"Einstein")]') |
提取屬性值 | response.css('small::attr(class)') | response.xpath('//small/@class') | css里面text排除在attr以外,所以不支持上面兩個過濾text??? |
提取文字 | response.css('small::text') | response.xpath('//small/text()') | |
使用
In [135]: response.xpath('//small[@class="author"]').extract_first() In [122]: response.css('small.author').extract_first() Out[122]: u'<small class="author" itemprop="author">Albert Einstein</small>' In [136]: response.xpath('//small[@class="author"]/text()').extract_first() In [123]: response.css('small.author::text').extract_first() Out[123]: u'Albert Einstein' In [137]: response.xpath('//small[@class="author"]/@class').extract_first() #class也是屬性 In [124]: response.css('small.author::attr(class)').extract_first() Out[124]: u'author' In [138]: response.xpath('//small[@class="author"]/@itemprop').extract_first() In [125]: response.css('small.author::attr(itemprop)').extract_first() Out[125]: u'author'
class 是一個特殊屬性,允許多值 class="row header-box"
# 匹配多值中的某一個值
In [228]: response.css('div.row') Out[228]: [<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row header-box">\n '>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row">\n <div class="col-md'>] In [232]: response.css('div.ro') Out[232]: []
# 整個class屬性值,匹配全部字符串 In [226]: response.css('div[class="row"]') Out[226]: [<Selector xpath=u"descendant-or-self::div[@class = 'row']" data=u'<div class="row">\n <div class="col-md'>]
In [240]: response.xpath('//div[@class="row header-box"]')
Out[240]: [<Selector xpath='//div[@class="row header-box"]' data=u'<div class="row header-box">\n '>]
# 整個class屬性值,匹配部分字符串 In [229]: response.css('div[class*="row"]') Out[229]: [<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row header-box">\n '>, <Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row">\n <div class="col-md'>] In [230]: response.xpath('//div[contains(@class,"row")]') Out[230]: [<Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row header-box">\n '>, <Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row">\n <div class="col-md'>] In [234]: response.css('div[class*="w h"]') Out[234]: [<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'w h')]" data=u'<div class="row header-box">\n '>] In [235]: response.xpath('//div[contains(@class,"w h")]') Out[235]: [<Selector xpath='//div[contains(@class,"w h")]' data=u'<div class="row header-box">\n '>]
3.2提取數據
- 提取data
selectorList / selector.extract(),extract_frist()
selector.extract() 返回一個str,selector.extract_first() 報錯
selectorList.extract() 對每一個selector執行selector.extract,返回 list of str,selectorList.extract_frist() 取前面list的第一個。
- 提取data同時過濾
selectorList / selector.re(r'xxx'),re_frist(r'xxx')
selector.re() 返回 list,selector.re_first() 取第一個str
selectorList.re() 對每一個selector執行selector.re,每個list結果(注意並非每個selector都會match)合並為一個list,selectorList.re_first()取前面合並list的第一個str。。。
使用 extract
In [21]: response.css('.author') #內部轉為 xpath,返回 SelectorList 實例 Out[21]: [<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>, <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>] In [22]: response.css('.author').extract() #提取上面的 data 部分,對 SelectorList 中的每個 Selector 執行 extract(), Out[22]: [u'<small class="author" itemprop="author">Albert Einstein</small>', u'<small class="author" itemprop="author">J.K. Rowling</small>', u'<small class="author" itemprop="author">Albert Einstein</small>', u'<small class="author" itemprop="author">Jane Austen</small>', u'<small class="author" itemprop="author">Marilyn Monroe</small>', u'<small class="author" itemprop="author">Albert Einstein</small>', u'<small class="author" itemprop="author">Andr\xe9 Gide</small>', u'<small class="author" itemprop="author">Thomas A. Edison</small>', u'<small class="author" itemprop="author">Eleanor Roosevelt</small>', u'<small class="author" itemprop="author">Steve Martin</small>'] In [23]: response.css('.author').extract_first() #只取第一個,可能返回 None ,可能報錯 response.css('.author')[0].extract() Out[23]: u'<small class="author" itemprop="author">Albert Einstein</small>' In [24]: response.css('.author::text').extract_first() #定位到 tag 內部的 text Out[24]: u'Albert Einstein'
使用 re
In [46]: response.css('.author::text')[0] Out[46]: <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]/text()" data=u'Albert Einstein'> In [47]: response.css('.author::text')[0].re(r'\w+') Out[47]: [u'Albert', u'Einstein'] In [48]: response.css('.author::text')[0].re_first(r'\w+') Out[48]: u'Albert' In [49]: response.css('.author::text')[0].re(r'((\w+)\s(\w+))') #按照左邊括號順序輸出 Out[49]: [u'Albert Einstein', u'Albert', u'Einstein']
3.