一、實驗環境
1.Windows7x64_SP1
2.anaconda3 + python3.7.3(anaconda集成,不需單獨安裝)
3.scrapy1.6.0
二、用法舉例
1.開啟scrapy shell,在命令行輸入如下命令:
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
結果如下:
2.提取a節點
-
xpath中用法
result = response.xpath('//a')
結果如下:
[<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>, <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]
-
css中用法
result = response.css('a')
結果如下:
[<Selector xpath='descendant-or-self::a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='descendant-or-self::a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='descendant-or-self::a' data='<a href="image3.html">Name: My image 3 <'>, <Selector xpath='descendant-or-self::a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='descendant-or-self::a' data='<a href="image5.html">Name: My image 5 <'>]
3.查看result的類型
type(result)
結果如下:
scrapy.selector.unified.SelectorList
說明:result為Selector組成的列表,也是SelectList類型,他們都可以繼續調用xpath()和css()等方法,進一步提取數據。
4.查看result提取數據全部內容,使用extract()函數
result.extract()
結果如下:
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
5.提取節點內容
-
xpath中用法,使用text()函數
response.xpath('//a/text()')
結果如下:
[<Selector xpath='//a/text()' data='Name: My image 1 '>, <Selector xpath='//a/text()' data='Name: My image 2 '>, <Selector xpath='//a/text()' data='Name: My image 3 '>, <Selector xpath='//a/text()' data='Name: My image 4 '>, <Selector xpath='//a/text()' data='Name: My image 5 '>]
查看HTML內容
response.xpath('//a/text()').extract()
結果如下:
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
-
css中用法
response.css('a::text').extract()
結果如下:
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
6.提取屬性值
-
xpath中用法,使用/@屬性名(如/@href)
response.xpath('//a/@href').extract()
結果如下:
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
- css中用法
response.css('a::attr("href")').extract()
結果如下:
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
7.提取節點內部子節點
-
xpath中用法,/子節點名
response.xpath('//a/img').extract()
結果如下:
['<img src="image1_thumb.jpg">', '<img src="image2_thumb.jpg">', '<img src="image3_thumb.jpg">', '<img src="image4_thumb.jpg">', '<img src="image5_thumb.jpg">']
-
css中用法
response.css('a img').extract()
結果如下:
['<img src="image1_thumb.jpg">', '<img src="image2_thumb.jpg">', '<img src="image3_thumb.jpg">', '<img src="image4_thumb.jpg">', '<img src="image5_thumb.jpg">']
再提取其中的src屬性值,與步驟6相同
-
xpath用法
response.xpath('//a/img/@src').extract()
-
css用法
response.css('a img::attr("src")').extract()
8.公用方法
- extract_first() #用於提取第一個元素
- extract_first('default value') #同上,添加默認參數