在上一篇博客中使用redis所維護的代理池抓取微信文章,開始運行良好,之后運行時總是會報501錯誤,我用瀏覽器打開網頁又能正常打開,調試了好多次都還是會出錯,既然這種方法出錯,那就用selenium模擬瀏覽器獲取搜狗微信文章的詳情頁面信息,把這個詳情頁面信息獲取后,仍然用pyquery庫進行解析,之后就可以正常的獲得微信文章的url,然后就可以通過這個url,獲得微信文章的信息
代碼如下:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from weixin.weixin.weixin_article import WeixinArticle from requests.exceptions import ConnectionError from pyquery import PyQuery as pq class SeleniumWeixinArticle(WeixinArticle): """使用selenium模擬瀏覽器,獲取搜狗微信搜索的詳細信息,繼承WeixinArticle這個類""" proxy = None def __init__(self): """初始化瀏覽器,及部分瀏覽器信息""" self.browser = webdriver.Chrome(executable_path="C:/codeapp/seleniumDriver/chrome/chromedriver.exe") self.wait = WebDriverWait(self.browser, 10) super(SeleniumWeixinArticle, self).__init__() def get_html(self, url, count=1): """重寫WeixinArticle 中的get_html 用selenium模擬瀏覽器去獲取搜狗微信搜索的信息""" if not url: return None # 最后遞歸max_count這么多次,防止無限遞歸 if count >= self.max_count: print("try many count ") return None print('crowling url ', url) print('crowling count ', count) global proxy if self.proxy: proxy_ip = '--proxy-server=http://' + self.proxy chrome_options = webdriver.ChromeOptions() # 切換IP chrome_options.add_argument(proxy_ip) browser = self.browser(chrome_options=chrome_options) else: browser = self.browser try: browser.get(url) # 返回值是None,要取數直接用browser.page_source next_page = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#sogou_next"))) if browser.current_url == url: page_source = browser.page_source return page_source else: print("must change ip proxy ") proxy = self.get_proxy(self.proxy_pool_url) if proxy: return self.get_html(url) else: print("get proxy is faired ") return None except ConnectionError: count += 1 proxy = self.get_proxy(self.proxy_pool_url) return self.get_html(url, count) if __name__ == "__main__": weixin_article = SeleniumWeixinArticle() weixin_article.run()
程序較為簡單,主要是重寫WeixinArticle中的get_html方法,其他的邏輯不變,這也是面向對象編程的好處,
程序結構邏輯如下:

