scrapy 中用selector來提取數據的用法

本文轉載自查看原文 2018-08-01 17:39 3155 爬蟲

一. 基本概念

1. Selector是一個可獨立使用的模塊，我們可以用Selector類來構建一個選擇器對象，然后調用它的相關方法如xpaht(), css()等來提取數據，如下

from  scrapy import Selector
body= '<html><head><title>Hello World</title></head><body></body> </ html> ’
selector  = Selector(text=body)
title  = selector.xpath('//title/text()').extract_first()
print(title)



輸出為
Hello World

2. scrapy shell 主要用於測試scrapy項目中命令是否生效，可在bash下直接執行，

這里我們通過使用scrapy shell來驗證學習選擇器提取網頁數據，在linux中bash下執行命令

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html即可進入scrapy shell命令模式

上面測試網站源碼

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

二. scrapy shell中有內置選擇器response.selector，可用於提取網頁信息，幾個例子如下

1. xpath和css的基本用法

#獲取<title>的文本值，其中第一個selector字符可以不寫
response.selector.xpath('//title/text()').extract_first()response.selector.css('title::text').extract_first()


#獲取a標簽的href屬性值
response.xpath('//a/@href').extract()
response.css('a::attr(href)').extract() 


#查找屬性名稱包含image字樣的所有a標簽
 response.xpath('//a[contains(@href, "image")]/@href').extract()
 response.css('a[href*=image]::attr(href)').extract()


#查找屬性名稱包含image字樣的所有a標簽，並且在下級img目錄下的src屬性值
 response.xpath('//a[contains(@href, "image")]/img/@src').extract()
 response.css('a[href*=image] img::attr(src)').extract()


#結合正則表達式提取所需內容
 response.css('a::text').re('Name\:(.*)')   #提取(.*)代表的內容
 response.css('a::text').re_first('Name\:(.*)').strip()  #提取第一個(.*）代表的內容，strip()去除首尾空格

2. xpath和css也可以一起用

#先選上src屬性標簽
response.xpath('//div[@id="images"]').css('img::attr(src)'))
#提取相應信息
response.xpath('//div[@id="images"]').css('img::attr(src)')).extract() #得到多個字符值
response.xpath('//div[@id="images"]').css('img::attr(src)')).extract_first() #得到一個字符值
response.xpath('//div[@id="images"]').css('img::attr(src)')).extract_first(default='') #如果沒提取到返回默認值

注意：

1. extract()方法把selector類型變為數據類型

2. [@id="images"]表示用屬性來限制匹配的范圍，只查找id屬性值等於images的div標簽，經測試[]中的id屬性值image必須用雙引號

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy框架Selector提取數據 scrapy Selector用法及xpath語法 scrapy提取數據 Selector提取數據1：XPath選擇器 scrapy 下selector的使用 Android中的Selector的用法 jquery中not的用法[.not(selector)] Selector API用法 ActiveMQ之selector的用法 Android Selector和Shape的用法