# 命令行輸入:scrapy shell +鏈接,會自動請求url,得到的相應默認為response,開啟命令行交互模式 scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html In [1]: response#response為默認相應 Out[1]: <200 https://doc.scrapy.org/en/latest/_static/selectors-sample1.html> In [2]: response.text#response.text相應的源代碼 # 標准結構圖如下: response.text = ''' <html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> <div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a> <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a> </div> </body> </html> ''' # 1:使用選擇器response.selector.xpath()/response.selector.css() In [5]: response.selector.xpath('//title/text()').extract_first() Out[5]: 'Example website' In [6]: response.selector.css('title::text').extract_first() Out[6]: 'Example website' # 2:使用選擇器也可以簡寫為:response.xpath() / response.css() In [9]: response.css('title::text') Out[9]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>] In [10]: response.xpath('//title/text()') Out[10]: [<Selector xpath='//title/text()' data='Example website'>] # 3:以上可知使用.xpath() .css()返回仍然是一個選擇器,若要提取里面的數據,可以用extract()提取全部,extract_first提取首個 In [7]: response.xpath('//title/text()').extract_first() Out[7]: 'Example website' In [8]: response.css('title::text').extract_first() Out[8]: 'Example website' # 4:可以循環進行選擇 # 獲取div標簽里面,id = 'images'的元素, 然后繼續查找img標簽屬性為src的內容,最終提取出來 # 就是說,包含關系用中括號[],從屬關系用斜杠 / In [14]: response.xpath("//div[@id='images']").css('img::attr(src)').extract() Out[14]: ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg'] # extract_first還有default屬性,如果查找不到對應的元素即返回default指定的值 In [16]: response.xpath("//div[@id='images']").css('img::attr(src)').extract_first(default='') Out[16]: 'image1_thumb.jpg' # 查找a標簽下,屬性為href的元素,提取出來 In [18]: response.xpath('//a/@href').extract() Out[18]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] In [19]: response.css('a::attr(href)').extract() Out[19]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] # 5:查找標簽的文本 In [20]: response.xpath('//a/text()').extract() Out[20]: ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 '] In [21]: response.css('a::text').extract() Out[21]: ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 '] # 6:選取標簽的屬性 In [34]: response.css('a::attr(href)').extract() Out[34]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] In [39]: response.xpath('//a/@href').extract() Out[39]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] # 查找屬性名稱為href 包含image的標簽的屬性 In [24]: response.xpath('//a[contains(@href,"image")]/@href').extract() Out[24]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] In [25]: response.css('a[href*=image]::attr(href)').extract() Out[25]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] # 查找a標簽里面屬性名為href,包含image,包含img,屬性為src的屬性 In [27]: response.xpath('//a[contains(@href,"image")]/img/@src').extract() Out[27]: ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg'] In [28]: response.css('a[href*=image] img::attr(src)').extract() Out[28]: ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg'] # 7:可配合正則表達式,re_first表示取第一個滿足正則表達式的 In [30]: response.css('a::text').re('Name\:(.*)') Out[30]: [' My image 1 ', ' My image 2 ', ' My image 3 ', ' My image 4 ', ' My image 5 '] In [31]: response.css('a::text').re_first('Name\:(.*)') Out[31]: ' My image 1 ' In [32]: response.css('a::text').re_first('Name\:(.*)').strip()#去除空格 Out[32]: 'My image 1'