scrapy+selenium爬取馬蜂窩網實戰


  • 剛開始學習selenium動態網頁的爬蟲,就想着自己做個實戰練習練習,然后就准備爬取馬蜂窩旅游網重慶的全部旅游景點,本來以為不是特別難,沒想到中間還是出現了很多問題,包括重寫下載中間件,加cookies,selenium動態刷新下一頁網頁后提取到的數據仍然是前一頁的數據,提取元素的方法選擇,子頁面跳轉,selenium動作鏈等,折磨了很久,但是還是沒有放棄,花了3天做完這個項目,下面記錄一下遇到的問題和解決方法。
  • 動態網頁加載問題
    • 首先分清楚網頁的數據是否是動態加載的,筆記本電腦右鍵查看網頁源代碼,按Ctrl+F搜索想要抓取的數據是否在網頁源代碼中,不在則該網頁就是動態加載的,需要使用selenium進行爬取,在spider中編寫構造函數__init__初始化webdriver對象
    • from selenium import webdriver
      from selenium.webdriver import ChromeOptions  #規避檢測
      #定義
      option = ChromeOptions()
      option.add_argument('--ignore-certificate-errors')
      option.add_argument('--ignore-ssl-errors')
      option.add_experimental_option('excludeSwitches', ['enable-automation', 'enable-logging'])
      bro = webdriver.Chrome(executable_path=r'chromedriver.exe路徑',options=option)    #將chromedriver.exe導入文件右鍵copy path
      bro.get(url)
      #bro查找元素
      bro.find_element(By.方法,value)
      例如bro.find_element(By.LINK_TEXT, ‘text’)
      #bro動作
      bro.find_element(By.方法,value).click()
      #關閉網頁
      bro.quit()
      
          def __init__(self):
              #設置cookies防止賬號被封,settings里的COOKIES_ENABLED = True,且在下載中間件中重寫process_request把cookie加上才可以
              self.headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36',
                            'Cookie': 'SECKEY_ABVK=hUAgUzjagDt7tRAoeBixuHARq3o5gtYSbMcKcAkM2Ho%3D; BMAP_SECKEY=adS1Ht6D0s1kWECRhDaf4vSf6OhvVYklxDSAiZ_3W0fIGZJ8rWr9TbzVPPYVaIW5ObgotD3EzPQrdL2XdiXldciYniNJWqvUHZ8Wk_ri0IuuKOY9h0aB4i09OHC30d-kbWCSrrEQe40grf1Gj9izw6SGB5cmzIjIenxaZzpq8lmEDDU5Kvl7gAMUQauc7TUC; mfw_uuid=62357780-7ac5-d4dc-9a8e-5a02aa298353; _r=baidu; _rp=a%3A2%3A%7Bs%3A1%3A%22p%22%3Bs%3A18%3A%22www.baidu.com%2Flink%22%3Bs%3A1%3A%22t%22%3Bi%3A1647671168%3B%7D; oad_n=a%3A5%3A%7Bs%3A5%3A%22refer%22%3Bs%3A21%3A%22https%3A%2F%2Fwww.baidu.com%22%3Bs%3A2%3A%22hp%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A3%3A%22oid%22%3Bi%3A1026%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222022-03-19+14%3A26%3A08%22%3B%7D; __jsluid_h=01784e2b1c452421aa25034fbbde3ed9; __mfwothchid=referrer%7Cwww.baidu.com; __omc_chl=; __mfwc=referrer%7Cwww.baidu.com; uva=s%3A307%3A%22a%3A4%3A%7Bs%3A13%3A%22host_pre_time%22%3Bs%3A10%3A%222022-03-19%22%3Bs%3A2%3A%22lt%22%3Bi%3A1647671169%3Bs%3A10%3A%22last_refer%22%3Bs%3A180%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DKZtwUSmw3x4cyZcTJdfrzYa8Pr4pEgDbvJU1Pv7yOxRPpeRIeoKj_rydoZuVdCf0_IXBx40vQyB-xiuXsf_AyQ1y3t3mO4En4c5USvOZ_ya%26wd%3D%26eqid%3Df0f5303d000ee843000000036235777b%22%3Bs%3A5%3A%22rhost%22%3Bs%3A13%3A%22www.baidu.com%22%3B%7D%22%3B; __mfwurd=a%3A3%3A%7Bs%3A6%3A%22f_time%22%3Bi%3A1647671169%3Bs%3A9%3A%22f_rdomain%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A6%3A%22f_host%22%3Bs%3A3%3A%22www%22%3B%7D; __mfwuuid=62357780-7ac5-d4dc-9a8e-5a02aa298353; UM_distinctid=17fa0dad264324-08a77f4b1aeaeb-9771539-e1000-17fa0dad26543a; __omc_r=; PHPSESSID=cdbbvncvrd5rqepai636p6pos7; Hm_lvt_8288b2ed37e5bc9b4c9f7008798d2de0=1647743622,1647851451,1647908302,1648001811; bottom_ad_status=0; __jsl_clearance=1648007349.914|0|yoFqmnWY6O7Msv1j5KemUKE3POE%3D; __mfwa=1647671168836.14813.17.1648001810109.1648007353609; CNZZDATA30065558=cnzz_eid%3D2058254581-1647670157-null%26ntime%3D1648003704; __mfwb=b20cf490195f.2.direct; __mfwlv=1648009103; __mfwvn=13; __mfwlt=1648009103; Hm_lpvt_8288b2ed37e5bc9b4c9f7008798d2de0=1648009104; ariaDefaultTheme=undefined'}
      
              option = ChromeOptions()
              option.add_argument('--ignore-certificate-errors')
              option.add_argument('--ignore-ssl-errors')
              option.add_experimental_option('excludeSwitches', ['enable-automation', 'enable-logging'])
              self.bro = webdriver.Chrome(executable_path=r'E:/爬蟲/vocation/vocation/spiders/chromedriver.exe',options=option)
              self.item = VocationItem()

       

  • 子頁面跳轉
    • 本實例進行旅游景點名稱,詳細介紹,電話,游覽用時,門票,開放時間,交通字段的爬取,由於除了第一個字段以外其他字段都需要點擊主頁面的標題進行跳轉到子頁面才能獲取其他信息,所以需要編寫回調函數,而主頁面數據是動態加載的但是子頁面的數據不是動態加載的,所以需要重寫下載中間件對不同請求的response進行處理
    • 在主頁面獲取景點的title和詳情頁href,在獲取到每一個旅游景點的詳情頁的href后就調用回調函數進行詳情頁信息的爬取,注意需要設置等待時間,否則會因為網頁數據還未加載出來就操作而出錯。
    • 關於元素定位不使用xpath,因為在點擊每一頁之后頁面都會刷新,有些標簽會隨之而改變,會報selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element等錯誤,就算不報錯也會導致頁面更新后數據不更新,爬取到的還是第一頁的數據,所以可以定位CLASS_NAME來獲取WebElement對象,根據對象的屬性來進行操作
    •  
    • 在這里插入圖片描述
    • WebElement對象的屬性
      #導入模塊
      from selenium import webdriver
      #創建一個瀏覽器對象
      driver = webdriver.Firefox()
      #訪問url地址
      url = "https://www.douban.com/"
      #調用瀏覽器對象
      driver.get(url)
      #使用name定位豆瓣的輸入框
      elem = driver.find_element_by_name("q")
      #打印elem對象
      print(elem)
      #打印標簽名
      print(elem.tag_name)
      #打印當前元素的上一級
      print(elem.parent)
      #打印當前元素的屬性值
      print(elem.get_attribute('type'))
      #使用xpath的方式定位豆瓣7.0文本內容
      elem_1 = driver.find_element('xpath','//p[@class="app-title"]')
      #打印當前元素的文本內容
      print(elem_1.text)
      #退出瀏覽器
      driver.quit()
      
      tag_name的作用就是獲取對應元素的標簽名
      parent的作用就是獲取對應元素的父級
      get_attribute('type')的作用就是獲取對應元素中的屬性值,框號中的屬性可更改
      text的作用就是獲取當前元素的文本內容
    • WebElement對象的操作
      from selenium import webdriver
      import time
      def test_start_selenium():
      #打開瀏覽器驅動,並輸入百度地址
      driver = webdriver.Firefox()
      url = "https://www.baidu.com/"
      driver.get(url)
      input_el = driver.find_element_by_id("kw")
      time.sleep(3)
      #輸入內容
      input_el.send_keys("老友記")
      #點擊百度一下按鈕
      input_e2 = driver.find_element('xpath','//input[@type="submit"]')
      input_e2.click()
      time.sleep(3)
      #清除輸入框中輸入的內容
      input_el.clear()
      time.sleep(3)
      input_el.send_keys("西游記")
      time.sleep(3)
      #提交
      input_el.submit()
      driver.quit()
      test_start_selenium()
      
      send_keys(""):輸入文本內容
      click():點擊
      clear():清空
      submit():提交
       
  • 編寫回調函數
    • 回調函數時出現521錯誤 (<521.......>HTTP status code is not  handled or not allowed)是因為發送request請求時沒有加上cookies或cookies過期了,需要重新獲取cookie,獲取網頁cookie方法:在目標網頁打開網頁抓包工具(Fn+F12),選擇network——>選擇doc——>cookie,復制后粘貼到__init__構造函數的headers中即可。
    •  
    •  
    • def parse(self, response):
          for i in range(20):
              title_list = str((self.bro.find_element(By.CLASS_NAME,'scenic-list').text))
              title = title_list.split('\n')
              for j in range(15):
                  self.item['title']=title[j]
                  # print(title[j])
                  detail_url = self.bro.find_element(By.LINK_TEXT, title[j]).get_attribute('href')
                  yield scrapy.Request(str(detail_url),callback=self.detail_parse,meta=self.item,headers=self.headers)  #item也需要作為參數傳送
                  self.bro.find_element(By.CLASS_NAME,'pg-next').click()
                  sleep(2)

       

  • 重寫下載中間件
    • 本例中主頁面數據為動態加載,需要用對該response進行處理,將數據加載在網頁中再返回,所以需要重寫下載中間件中的process_response函數,另外也需要重寫其中的process_request函數加上請求的cookie
    • 重寫下載中間件后需要修改setting中的相應參數並取消注釋
      COOKIES_ENABLED = True
      
    •  

      DOWNLOADER_MIDDLEWARES = {
         'vocation.middlewares.VocationDownloaderMiddleware': 543,
      }
      
    •  

      class VocationDownloaderMiddleware(object):
          def process_request(self, request,spider):
              Cookie='SECKEY_ABVK=hUAgUzjagDt7tRAoeBixuHARq3o5gtYSbMcKcAkM2Ho%3D; BMAP_SECKEY=adS1Ht6D0s1kWECRhDaf4vSf6OhvVYklxDSAiZ_3W0fIGZJ8rWr9TbzVPPYVaIW5ObgotD3EzPQrdL2XdiXldciYniNJWqvUHZ8Wk_ri0IuuKOY9h0aB4i09OHC30d-kbWCSrrEQe40grf1Gj9izw6SGB5cmzIjIenxaZzpq8lmEDDU5Kvl7gAMUQauc7TUC; mfw_uuid=62357780-7ac5-d4dc-9a8e-5a02aa298353; _r=baidu; _rp=a%3A2%3A%7Bs%3A1%3A%22p%22%3Bs%3A18%3A%22www.baidu.com%2Flink%22%3Bs%3A1%3A%22t%22%3Bi%3A1647671168%3B%7D; oad_n=a%3A5%3A%7Bs%3A5%3A%22refer%22%3Bs%3A21%3A%22https%3A%2F%2Fwww.baidu.com%22%3Bs%3A2%3A%22hp%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A3%3A%22oid%22%3Bi%3A1026%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222022-03-19+14%3A26%3A08%22%3B%7D; __jsluid_h=01784e2b1c452421aa25034fbbde3ed9; __mfwothchid=referrer%7Cwww.baidu.com; __omc_chl=; __mfwc=referrer%7Cwww.baidu.com; uva=s%3A307%3A%22a%3A4%3A%7Bs%3A13%3A%22host_pre_time%22%3Bs%3A10%3A%222022-03-19%22%3Bs%3A2%3A%22lt%22%3Bi%3A1647671169%3Bs%3A10%3A%22last_refer%22%3Bs%3A180%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DKZtwUSmw3x4cyZcTJdfrzYa8Pr4pEgDbvJU1Pv7yOxRPpeRIeoKj_rydoZuVdCf0_IXBx40vQyB-xiuXsf_AyQ1y3t3mO4En4c5USvOZ_ya%26wd%3D%26eqid%3Df0f5303d000ee843000000036235777b%22%3Bs%3A5%3A%22rhost%22%3Bs%3A13%3A%22www.baidu.com%22%3B%7D%22%3B; __mfwurd=a%3A3%3A%7Bs%3A6%3A%22f_time%22%3Bi%3A1647671169%3Bs%3A9%3A%22f_rdomain%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A6%3A%22f_host%22%3Bs%3A3%3A%22www%22%3B%7D; __mfwuuid=62357780-7ac5-d4dc-9a8e-5a02aa298353; UM_distinctid=17fa0dad264324-08a77f4b1aeaeb-9771539-e1000-17fa0dad26543a; __omc_r=; PHPSESSID=cdbbvncvrd5rqepai636p6pos7; Hm_lvt_8288b2ed37e5bc9b4c9f7008798d2de0=1647743622,1647851451,1647908302,1648001811; bottom_ad_status=0; __jsl_clearance=1648007349.914|0|yoFqmnWY6O7Msv1j5KemUKE3POE%3D; __mfwa=1647671168836.14813.17.1648001810109.1648007353609; CNZZDATA30065558=cnzz_eid%3D2058254581-1647670157-null%26ntime%3D1648003704; __mfwb=b20cf490195f.2.direct; __mfwlv=1648009103; __mfwvn=13; __mfwlt=1648009103; Hm_lpvt_8288b2ed37e5bc9b4c9f7008798d2de0=1648009104; ariaDefaultTheme=undefined'
              cookies = {i.split('=')[0]: i.split('=')[1] for i in Cookie.split('; ')}
              request.cookies =cookies
              return None
      
      
          def process_response(self, request, response, spider):
              bro=spider.bro
              if request.url in spider.start_urls:
                  bro.get(request.url)
                  sleep(2)
                  page_text=bro.page_source
                  new_response=HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)
                  return new_response
              else:
                  return response
      

       

  • 編寫子頁面處理函數
    •  其中有些字段可能是空的導致抓取錯位,所以盡量選擇標簽值來定位元素
    •  

      def detail_parse(self,response):
          item=response.meta
          item['title']=item['title']
          item['introduction']=response.xpath('/html/body/div[2]/div[3]/div[2]/div[1]/text()').get().strip()
          item['phone']=response.xpath('/html/body/div[2]/div[3]/div[2]/ul/li[@class="tel"]/div[2]/text()').get()
          item['time']=response.xpath('/html/body/div[2]/div[3]/div[2]/ul/li[@class="item-time"]/div[@class="content"]/text()').get()  #盡量選擇標簽定位而非位置定位
          item['traffic']=response.xpath('/html/body/div[2]/div[3]/div[2]/dl[1]/dd/text()').get()
          item['ticket']=response.xpath('/html/body/div[2]/div[3]/div[2]/dl[2]/dd/div/text()').get()
          item['open_time']=response.xpath('/html/body/div[2]/div[3]/div[2]/dl[3]/dd/text()').get()
          yield item

       

  • 關閉網頁
    •  

      def close(self,spider):
          self.bro.quit()
      

       

  • 至此全部實驗到此結束,再回頭看時好像沒有多少問題,但是在遇到問題的時候在網頁上搜索不到解決問題時真的會很崩潰。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM