scrapy爬蟲提取網頁鏈接的兩種方法以及構造HtmlResponse對象的方式

本文轉載自查看原文 2020-02-12 22:00 1460

Response對象的幾點說明：

　　Response對象用來描述一個HTTP響應，Response只是一個基類，根據相應的不同有如下子類：

　　　　TextResponse，HtmlResponse，XmlResponse

　　僅以HtmlResponse為例，HtmlResponse在基類Response的基礎上，還多了很多新的方法。

一.使用Selector

　　　　因為鏈接也是頁面中的數據，所以可以使用與提取數據相同的方法進行提取。在分析網頁時可以通過jupyter notebook構造selector對象進行分析（selector對象有xpath和css方法）

　　　　　　import requests

　　　　　　from scrapy.selector import Selector

　　　　　　res=requests.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

　　　　　　selector=Selector(response=res)

二 .使用 scrapy框架中的linkextractors模塊

　　　　用法見相關資料

　　1. le.extractor_links(response)中的response指的是HtmlResponse

　　2.HtmlResponse的構造方法：

from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
import requests

#先構造Response對象，再用Response對象構造HtmlResponse對象，從而能夠使用linkextractor模塊

ResStack=requests.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

res = HtmlResponse(url="http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" , body=ResStack.text , encoding="utf-8")

注：1.HtmlResponse包含多種參數，具體如何使用可查書

　　2.HtmlResponse也包含多種方法，比如css，xpath，text等方法，也可以通過jupyter notebook進行網頁分析，而且也可以使用linkextractor提取鏈接，更加方便

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 兩種方式提取網頁信息——爬蟲初步遍歷對象鍵值對的兩種方法 Intent傳遞對象的兩種方法 Java對象排序兩種方法 Scrapy爬蟲中的鏈接提取器LinkExtractor 從視頻中將音頻提取出來的兩種方法？ ArcGIS添加超鏈接的兩種方法詳解 ArcGIS添加超鏈接的兩種方法詳解 python爬蟲使用Cookie的兩種方法在Scrapy中如何利用Xpath選擇器從HTML中提取目標信息（兩種方式）