很多時候我們在對網站進行數據抓取的時候,可以抓取一部分頁面或者接口,這部分可能沒有設置登錄限制。但是如果要抓取大規模數據的時候,沒有登錄進行爬取會出現一些弊端。對於一些設置登錄限制的頁面,無法爬取對於一些沒有設置登錄的頁面或者接口,一旦IP訪問頻繁,會觸發網站的反爬蟲,相比較代理池通過改變IP地址來避免被網站封禁,但是現在的有的網站已經不封IP地址,開始封賬號的反爬措施,如果做大規模爬蟲怎么辦呢,一個賬號有可能被封,如果像代理池一樣提供不同IP,我有多個賬號輪流爬取是不是可以避免被封。所有就需要維護多個賬號,這個時候就要用到cookies池了,通過獲取每個賬號模擬登錄后的cookies信息,保存到數據庫,定時檢測cookies有效性。
1.相關工具
安裝Redis數據庫,redis-py庫,reuqests,selenium,Flask 庫,還有Google Chrome瀏覽器 安裝好ChromeDriver,購買要爬取的網站賬號,比如我購買的微博小號(賣號網站隨機百度,有些網站不固定容易404,最好買免驗證碼登錄的) 這里我們搭建微博的Cookies池
2.cookies實現
需要下圖幾大模塊
存儲模塊負責存儲每個賬號的用戶名,密碼已經每個賬號對應的Cookies信息,同時提供方法對數據的存儲操作
生成模塊 負責獲取登錄之后的Cookies,這個模塊要從數據庫中取賬號密碼,再模擬登錄目標頁面,如果登陸成功,就獲取Cookies保存到數據庫
檢測模塊負責定時檢測數據庫中的Cookies是否有效,使用Cookies請求鏈接,如果登錄狀態成功,則是有效的,否則失效並刪除,接下來等待生產模塊重新登錄生成Cookies
接口模塊是通過Api提供對外服務的接口,Cookies越多越好,被檢測到的概率越小,越不容易被封
接下來實現存儲模塊:存儲部分有兩部分:1.賬號和密碼 2.賬號和Cookies 這兩部分是一對一對應的,所以可以使用Redis的hash ,hash存儲結構是key-value 也就是鍵值對的形式,和我們所需的是符合的,所以就有兩組映射,賬號密碼 ,賬號Cookies, key是賬號
import random import redis
# Redis數據庫地址
REDIS_HOST = 'localhost'
# Redis端口
REDIS_PORT = 6379
# Redis密碼,如無填None
REDIS_PASSWORD = None
class RedisClient(object): def __init__(self, type, website, host=REDIS_HOST, port=REDIS_PORT, password=REDIS_PASSWORD): """ 初始化Redis連接 :param host: 地址 :param port: 端口 :param password: 密碼 """ self.db = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True) self.type = type self.website = website def name(self): """ 獲取Hash的名稱 :return: Hash名稱 """ return "{type}:{website}".format(type=self.type, website=self.website) def set(self, username, value): """ 設置鍵值對 :param username: 用戶名 :param value: 密碼或Cookies :return: """ return self.db.hset(self.name(), username, value) def get(self, username): """ 根據鍵名獲取鍵值 :param username: 用戶名 :return: """ return self.db.hget(self.name(), username) def delete(self, username): """ 根據鍵名刪除鍵值對 :param username: 用戶名 :return: 刪除結果 """ return self.db.hdel(self.name(), username) def count(self): """ 獲取數目 :return: 數目 """ return self.db.hlen(self.name()) def random(self): """ 隨機得到鍵值,用於隨機Cookies獲取 :return: 隨機Cookies """ return random.choice(self.db.hvals(self.name())) def usernames(self): """ 獲取所有賬戶信息 :return: 所有用戶名 """ return self.db.hkeys(self.name()) def all(self): """ 獲取所有鍵值對 :return: 用戶名和密碼或Cookies的映射表 """ return self.db.hgetall(self.name()) if __name__ == '__main__': conn = RedisClient('accounts', 'weibo') result = conn.set('wert', 'ssdsdsf') print(result)
我們可以看見name()方法,返回值就是存儲key-value 的hash名稱 比如accounts:weibo 存儲的就是賬號和密碼,通過這種方式可以把賬號密碼添加進數據庫
生成模塊的實現
要獲取微博登錄信息的Cookies,肯定要登錄微博,但是微博這網站的登錄接口需要填寫驗證碼 或者手機號驗證比較復雜,比較好的是微博登錄站點有三個 1.https://weibo.cn 2.https://m.weibo.com
3.https://weibo.com 我們選擇第二個站點比較合適,類似於手機客戶端的界面,而且登錄的時候不需要驗證碼(前提購買免驗證碼賬號才可以)登錄界面是這樣
import json from selenium import webdriver from selenium.webdriver import DesiredCapabilities from cookiespool.db import RedisClient from login.weibo.cookies import WeiboCookies
# 產生器使用的瀏覽器
BROWSER_TYPE = 'PhantomJS'
class CookiesGenerator(object): def __init__(self, website='default'): """ 父類, 初始化一些對象 :param website: 名稱 :param browser: 瀏覽器, 若不使用瀏覽器則可設置為 None """ self.website = website self.cookies_db = RedisClient('cookies', self.website) self.accounts_db = RedisClient('accounts', self.website) self.init_browser() def __del__(self): self.close() def init_browser(self): """ 通過browser參數初始化全局瀏覽器供模擬登錄使用 :return: """ if BROWSER_TYPE == 'PhantomJS': caps = DesiredCapabilities.PHANTOMJS caps[ "phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' self.browser = webdriver.PhantomJS(desired_capabilities=caps) self.browser.set_window_size(1400, 500) elif BROWSER_TYPE == 'Chrome': self.browser = webdriver.Chrome() def new_cookies(self, username, password): """ 新生成Cookies,子類需要重寫 :param username: 用戶名 :param password: 密碼 :return: """ raise NotImplementedError def process_cookies(self, cookies): """ 處理Cookies :param cookies: :return: """ dict = {} for cookie in cookies: dict[cookie['name']] = cookie['value'] return dict def run(self): """ 運行, 得到所有賬戶, 然后順次模擬登錄 :return: """ accounts_usernames = self.accounts_db.usernames() cookies_usernames = self.cookies_db.usernames() for username in accounts_usernames: if not username in cookies_usernames: password = self.accounts_db.get(username) print('正在生成Cookies', '賬號', username, '密碼', password) result = self.new_cookies(username, password) # 成功獲取 if result.get('status') == 1: cookies = self.process_cookies(result.get('content')) print('成功獲取到Cookies', cookies) if self.cookies_db.set(username, json.dumps(cookies)): print('成功保存Cookies') # 密碼錯誤,移除賬號 elif result.get('status') == 2: print(result.get('content')) if self.accounts_db.delete(username): print('成功刪除賬號') else: print(result.get('content')) else: print('所有賬號都已經成功獲取Cookies') def close(self): """ 關閉 :return: """ try: print('Closing Browser') self.browser.close() del self.browser except TypeError: print('Browser not opened') class WeiboCookiesGenerator(CookiesGenerator): def __init__(self, website='weibo'): """ 初始化操作 :param website: 站點名稱 :param browser: 使用的瀏覽器 """ CookiesGenerator.__init__(self, website) self.website = website def new_cookies(self, username, password): """ 生成Cookies :param username: 用戶名 :param password: 密碼 :return: 用戶名和Cookies """ return WeiboCookies(username, password, self.browser).main()
這部分是來判斷是否登錄成功
import time from io import BytesIO from PIL import Image from selenium.common.exceptions import TimeoutException #from selenium.webdriver import ActionChains from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from os import listdir from os.path import abspath, dirname TEMPLATES_FOLDER = dirname(abspath(__file__)) + '/templates/' class WeiboCookies(): def __init__(self, username, password, browser): self.url = 'https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/' self.browser = browser self.wait = WebDriverWait(self.browser, 20) self.username = username self.password = password def open(self): """ 打開網頁輸入用戶名密碼並點擊 :return: None """ self.browser.delete_all_cookies() self.browser.get(self.url) username = self.wait.until(EC.presence_of_element_located((By.ID, 'loginName'))) password = self.wait.until(EC.presence_of_element_located((By.ID, 'loginPassword'))) submit = self.wait.until(EC.element_to_be_clickable((By.ID, 'loginAction'))) username.send_keys(self.username) password.send_keys(self.password) time.sleep(1) submit.click() def password_error(self): """ 判斷是否密碼錯誤 :return: """ try: return WebDriverWait(self.browser, 5).until( EC.text_to_be_present_in_element((By.ID, 'errorMsg'), '用戶名或密碼錯誤')) except TimeoutException: return False def login_successfully(self): """ 判斷是否登錄成功 :return: """ try: return bool( WebDriverWait(self.browser, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'lite-iconf-profile')))) except TimeoutException: return False def get_cookies(self): """ 獲取Cookies :return: """ return self.browser.get_cookies() def main(self): """ 破解入口 :return: """ self.open() if self.password_error(): return { 'status': 2, 'content': '用戶名或密碼錯誤' } # 如果不需要驗證碼直接登錄成功 if self.login_successfully(): cookies = self.get_cookies() return { 'status': 1, 'content': cookies }
檢測模塊:獲取到Cookies信息后還需要對Cookies的有效性進行檢測,也就是通過登錄返回的Response的狀態碼判斷是否有效
import json import requests from requests.exceptions import ConnectionError from cookiespool.db import * class ValidTester(object): def __init__(self, website='default'): self.website = website self.cookies_db = RedisClient('cookies', self.website) self.accounts_db = RedisClient('accounts', self.website) def test(self, username, cookies): raise NotImplementedError def run(self): cookies_groups = self.cookies_db.all() for username, cookies in cookies_groups.items(): self.test(username, cookies) class WeiboValidTester(ValidTester): def __init__(self, website='weibo'): ValidTester.__init__(self, website) def test(self, username, cookies): print('正在測試Cookies', '用戶名', username) try: cookies = json.loads(cookies) except TypeError: print('Cookies不合法', username) self.cookies_db.delete(username) print('刪除Cookies', username) return try: test_url = TEST_URL_MAP[self.website] response = requests.get(test_url, cookies=cookies, timeout=5, allow_redirects=False) if response.status_code == 200: print('Cookies有效', username) else: print(response.status_code, response.headers) print('Cookies失效', username) self.cookies_db.delete(username) print('刪除Cookies', username) except ConnectionError as e: print('發生異常', e.args)
接口模塊驅動其他幾個模塊的運行
import time from multiprocessing import Process from cookiespool.api import app from cookiespool.generator import * from cookiespool.tester import *
# 產生器類,如擴展其他站點,請在此配置
GENERATOR_MAP = {
'weibo': 'WeiboCookiesGenerator'
}
# 測試類,如擴展其他站點,請在此配置
TESTER_MAP = {
'weibo': 'WeiboValidTester'
}
TEST_URL_MAP = {
'weibo': 'https://m.weibo.cn/'
}
# 產生器和驗證器循環周期
CYCLE = 120
# 產生器開關,模擬登錄添加Cookies
GENERATOR_PROCESS = True
# 驗證器開關,循環檢測數據庫中Cookies是否可用,不可用刪除
VALID_PROCESS = True
# API接口服務
API_PROCESS = True
class Scheduler(object): @staticmethod def valid_cookie(cycle=CYCLE): while True: print('Cookies檢測進程開始運行') try: for website, cls in TESTER_MAP.items(): tester = eval(cls + '(website="' + website + '")') tester.run() print('Cookies檢測完成') del tester time.sleep(cycle) except Exception as e: print(e.args) @staticmethod def generate_cookie(cycle=CYCLE): while True: print('Cookies生成進程開始運行') try: for website, cls in GENERATOR_MAP.items(): generator = eval(cls + '(website="' + website + '")') generator.run() print('Cookies生成完成') generator.close() time.sleep(cycle) except Exception as e: print(e.args) @staticmethod def api(): print('API接口開始運行') app.run(host=API_HOST, port=API_PORT) def run(self): if API_PROCESS: api_process = Process(target=Scheduler.api) api_process.start() if GENERATOR_PROCESS: generate_process = Process(target=Scheduler.generate_cookie) generate_process.start() if VALID_PROCESS: valid_process = Process(target=Scheduler.valid_cookie) valid_process.start()
api接口
import json from flask import Flask, g from cookiespool.db import *
# API地址和端口
API_HOST = '127.0.0.1'
API_PORT = 5000
__all__ = ['app'] app = Flask(__name__) @app.route('/') def index(): return '<h2>Welcome to Cookie Pool System</h2>' def get_conn(): """ 獲取 :return: """ for website in GENERATOR_MAP: print(website) if not hasattr(g, website): setattr(g, website + '_cookies', eval('RedisClient' + '("cookies", "' + website + '")')) setattr(g, website + '_accounts', eval('RedisClient' + '("accounts", "' + website + '")')) return g @app.route('/<website>/random') def random(website): """ 獲取隨機的Cookie, 訪問地址如 /weibo/random :return: 隨機Cookie """ g = get_conn() cookies = getattr(g, website + '_cookies').random() return cookies @app.route('/<website>/add/<username>/<password>') def add(website, username, password): """ 添加用戶, 訪問地址如 /weibo/add/user/password :param website: 站點 :param username: 用戶名 :param password: 密碼 :return: """ g = get_conn() print(username, password) getattr(g, website + '_accounts').set(username, password) return json.dumps({'status': '1'}) @app.route('/<website>/count') def count(website): """ 獲取Cookies總數 """ g = get_conn() count = getattr(g, website + '_cookies').count() return json.dumps({'status': '1', 'count': count}) if __name__ == '__main__': app.run(host='0.0.0.0')
運行效果如下