使用scrapy中xpath選擇器的一個坑點

本文轉載自查看原文 2019-04-21 13:19 481 python爬蟲

情景如下：

一個網頁下有一個ul，這個ur下有125個li標簽，每個li標簽下有我們想要的 url 字段（每個 url 是唯一的）和 price 字段，我們現在要訪問每個li下的url並在生成的請求中攜帶該請求的price字段

毫無疑問，這里是要用到scrapy項目內meta傳參的，那么我們思路可能是這樣：

1）start_requests訪問初始網頁

2）定義一個 parse 方法，通過xpath選擇器獲取所有的li標簽，遍歷每個 li 標簽，獲取 url 和 price 字段，生成目標地址為 url 的 scrapy.Request對象，將 price 打包到 Request 對象的 meta 中，分別yield新地址為 url 的 scrapy.Request 對象

3）對新的 response 進行處理

現在問題出在第2）步驟中：

我們可能發現遍歷 li 標簽獲取的 url 和 price 對象都是一樣的，如

In [20]: url_item = response.xpath('//ul[contains(@class, "house-list")]/li')

In [21]: for item in url_item:
    ...:     url = item.xpath('//h2[@class="title"]/a/@href').extract_first()
    ...:     print(url)
    ...: 
https://bj.58.com/ershoufang/37822311633030x.shtml
https://bj.58.com/ershoufang/37822311633030x.shtml
https://bj.58.com/ershoufang/37822311633030x.shtml
......省略122個相同url

View Code

可以猜想scrapy中scrapy.selector.unified.SelectorList對象在進行遍歷時對子元素操作時，事實上並不是對子元素的操作，而是仍然在對這個SelectorList對象進行操作

In [24]: for item in url_item:
    ...:     url = url_item.xpath('//h2[@class="title"]/a/@href').extract_first()
    ...:     print(url)
    ...: 
https://bj.58.com/ershoufang/37822311633030x.shtml
https://bj.58.com/ershoufang/37822311633030x.shtml
https://bj.58.com/ershoufang/37822311633030x.shtml
......省略122個相同url

View Code

以上兩種結果完全一致，為了證明我的猜想，我這次在遍歷時不用extract_first()，而使用extract()，結果如下：

In [34]: for item in url_item:
    ...:     urls = url_item[2].xpath('//h2[@class="title"]/a/@href').extract()
    ...:     print(urls)
    ...: 
['https://bj.58.com/ershoufang/37822311633030x.shtml', 'https://bj.58.com/ershoufang/37834554715403x.shtml', 'https://bj.58.com/ershoufang/37769196098828x.shtml',
.....省略121個不同url
'https://bj.58.com/ershoufang/37398992001320x.shtml']
......省略和上面相同的123個列表
['https://bj.58.com/ershoufang/37822311633030x.shtml', 'https://bj.58.com/ershoufang/37834554715403x.shtml', 'https://bj.58.com/ershoufang/37769196098828x.shtml',
.....
'https://bj.58.com/ershoufang/37398992001320x.shtml']

View Code

分析：遍歷的每個item里面只有自己唯一的url，即使extract()，打印的也應該是含自己唯一的url的列表，並且每個item打印的url列表各不相同，

但實際每個item打印的列表包含了所有的url，且每個item打印的url列表完全一致，並且每個item中這個一致的url列表與item的父元素url_item的url列表一致：

In [37]: response.xpath('//ul[contains(@class, "house-list")]/li//h2[@class="title"]/a/@href').extract()
Out[37]: 
['https://bj.58.com/ershoufang/37822311633030x.shtml',
 'https://bj.58.com/ershoufang/37834554715403x.shtml',
 'https://bj.58.com/ershoufang/37769196098828x.shtml',
......省略121個不同url
 'https://bj.58.com/ershoufang/37398992001320x.shtml']

View Code

結果證實了我的猜想，這也就是我說的scapy中xpath選擇器的坑，那么還是面對我最開始提出的情景，該如何解決呢？

在這里提供兩種思路：

1）不要使用scrapy中xpath選擇器的鏈式解析，在拿到scrapy.selector.unified.SelectorList對象后，不要通過遍歷直接鏈式解析，直接提取出html文本列表，並對這個列表進行遍歷，對每個子元素再生成 scrapy.selector.unified.Selector 對象，然后通過 xpath 提取數據，如下

In [52]: url_item = response.xpath('//ul[contains(@class, "house-list")]/li')

In [53]: items = url_item.extract()

In [55]: for item in items:
    ...:     sele_obj = scrapy.Selector(text=item)
    ...:     url = sele_obj.xpath('//h2[@class="title"]/a/@href').extract_first()
    ...:     print(url)
    ...: 
https://bj.58.com/ershoufang/37822311633030x.shtml
https://bj.58.com/ershoufang/37834554715403x.shtml
https://bj.58.com/ershoufang/37769196098828x.shtml
......省略121個不同url
'https://bj.58.com/ershoufang/37398992001320x.shtml'

方法一

成功拿到每個 li 下的url

2）使用scrapy中xpath選擇器的鏈式解析，在拿到scrapy.selector.unified.SelectorList對象后，開始改用 css 選擇器解析：

In [60]: url_item = response.css('ul[class *= "house-list"]>li')

In [61]: for item in url_item:
    ...:     url = item.css('h2.title>a::attr(href)').extract_first()
    ...:     print(url)
    ...: 
https://bj.58.com/ershoufang/37822311633030x.shtml
https://bj.58.com/ershoufang/37834554715403x.shtml
https://bj.58.com/ershoufang/37769196098828x.shtml
......省略121個不同url
https://bj.58.com/ershoufang/37398992001320x.shtml

方法二

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 在Scrapy中如何利用Xpath選擇器從HTML中提取目標信息（兩種方式） Python中Scrapy框架元素選擇器XPath的簡單實例 scrapy簡單入門及選擇器(xpath\css) xpath選擇器簡介及如何使用 Scrapy的中Css 選擇器 scrapy中css選擇器初識 Scrapy進階知識點總結（二）——選擇器Selectors Selenium(九)：Xpath選擇器 Scrapy框架中的xpath選擇黃聰：HtmlAgilityPack中SelectSingleNode的XPath和CSS選擇器