搭建Cookie池


很多時候我們在對網站進行數據抓取的時候,可以抓取一部分頁面或者接口,這部分可能沒有設置登錄限制。但是如果要抓取大規模數據的時候,沒有登錄進行爬取會出現一些弊端。對於一些設置登錄限制的頁面,無法爬取對於一些沒有設置登錄的頁面或者接口,一旦IP訪問頻繁,會觸發網站的反爬蟲,相比較代理池通過改變IP地址來避免被網站封禁,但是現在的有的網站已經不封IP地址,開始封賬號的反爬措施,如果做大規模爬蟲怎么辦呢,一個賬號有可能被封,如果像代理池一樣提供不同IP,我有多個賬號輪流爬取是不是可以避免被封。所有就需要維護多個賬號,這個時候就要用到cookies池了,通過獲取每個賬號模擬登錄后的cookies信息,保存到數據庫,定時檢測cookies有效性。

1.相關工具

 安裝Redis數據庫,redis-py庫,reuqests,selenium,Flask 庫,還有Google Chrome瀏覽器 安裝好ChromeDriver,購買要爬取的網站賬號,比如我購買的微博小號(賣號網站隨機百度,有些網站不固定容易404,最好買免驗證碼登錄的) 這里我們搭建微博的Cookies池

2.cookies實現

 需要下圖幾大模塊

 

 存儲模塊負責存儲每個賬號的用戶名,密碼已經每個賬號對應的Cookies信息,同時提供方法對數據的存儲操作

 生成模塊 負責獲取登錄之后的Cookies,這個模塊要從數據庫中取賬號密碼,再模擬登錄目標頁面,如果登陸成功,就獲取Cookies保存到數據庫

 檢測模塊負責定時檢測數據庫中的Cookies是否有效,使用Cookies請求鏈接,如果登錄狀態成功,則是有效的,否則失效並刪除,接下來等待生產模塊重新登錄生成Cookies

 接口模塊是通過Api提供對外服務的接口,Cookies越多越好,被檢測到的概率越小,越不容易被封

接下來實現存儲模塊:存儲部分有兩部分:1.賬號和密碼 2.賬號和Cookies 這兩部分是一對一對應的,所以可以使用Redis的hash ,hash存儲結構是key-value 也就是鍵值對的形式,和我們所需的是符合的,所以就有兩組映射,賬號密碼 ,賬號Cookies, key是賬號

import random
import redis

# Redis數據庫地址
REDIS_HOST = 'localhost'

# Redis端口
REDIS_PORT = 6379

# Redis密碼,如無填None
REDIS_PASSWORD = None

class RedisClient(object):
    def __init__(self, type, website, host=REDIS_HOST, port=REDIS_PORT, password=REDIS_PASSWORD):
        """
        初始化Redis連接
        :param host: 地址
        :param port: 端口
        :param password: 密碼
        """
        self.db = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True)
        self.type = type
        self.website = website

    def name(self):
        """
        獲取Hash的名稱
        :return: Hash名稱
        """
        return "{type}:{website}".format(type=self.type, website=self.website)

    def set(self, username, value):
        """
        設置鍵值對
        :param username: 用戶名
        :param value: 密碼或Cookies
        :return:
        """
        return self.db.hset(self.name(), username, value)

    def get(self, username):
        """
        根據鍵名獲取鍵值
        :param username: 用戶名
        :return:
        """
        return self.db.hget(self.name(), username)

    def delete(self, username):
        """
        根據鍵名刪除鍵值對
        :param username: 用戶名
        :return: 刪除結果
        """
        return self.db.hdel(self.name(), username)

    def count(self):
        """
        獲取數目
        :return: 數目
        """
        return self.db.hlen(self.name())

    def random(self):
        """
        隨機得到鍵值,用於隨機Cookies獲取
        :return: 隨機Cookies
        """
        return random.choice(self.db.hvals(self.name()))

    def usernames(self):
        """
        獲取所有賬戶信息
        :return: 所有用戶名
        """
        return self.db.hkeys(self.name())

    def all(self):
        """
        獲取所有鍵值對
        :return: 用戶名和密碼或Cookies的映射表
        """
        return self.db.hgetall(self.name())


if __name__ == '__main__':
    conn = RedisClient('accounts', 'weibo')
    result = conn.set('wert', 'ssdsdsf')
    print(result)

 

 

我們可以看見name()方法,返回值就是存儲key-value 的hash名稱 比如accounts:weibo 存儲的就是賬號和密碼,通過這種方式可以把賬號密碼添加進數據庫

生成模塊的實現

要獲取微博登錄信息的Cookies,肯定要登錄微博,但是微博這網站的登錄接口需要填寫驗證碼 或者手機號驗證比較復雜,比較好的是微博登錄站點有三個 1.https://weibo.cn  2.https://m.weibo.com

 3.https://weibo.com  我們選擇第二個站點比較合適,類似於手機客戶端的界面,而且登錄的時候不需要驗證碼(前提購買免驗證碼賬號才可以)登錄界面是這樣

 

import json
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities

from cookiespool.db import RedisClient
from login.weibo.cookies import WeiboCookies

# 產生器使用的瀏覽器
BROWSER_TYPE = 'PhantomJS'

class CookiesGenerator(object):
    def __init__(self, website='default'):
        """
        父類, 初始化一些對象
        :param website: 名稱
        :param browser: 瀏覽器, 若不使用瀏覽器則可設置為 None
        """
        self.website = website
        self.cookies_db = RedisClient('cookies', self.website)
        self.accounts_db = RedisClient('accounts', self.website)
        self.init_browser()

    def __del__(self):
        self.close()
    
    def init_browser(self):
        """
        通過browser參數初始化全局瀏覽器供模擬登錄使用
        :return:
        """
        if BROWSER_TYPE == 'PhantomJS':
            caps = DesiredCapabilities.PHANTOMJS
            caps[
                "phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
            self.browser = webdriver.PhantomJS(desired_capabilities=caps)
            self.browser.set_window_size(1400, 500)
        elif BROWSER_TYPE == 'Chrome':
            self.browser = webdriver.Chrome()
    
    def new_cookies(self, username, password):
        """
        新生成Cookies,子類需要重寫
        :param username: 用戶名
        :param password: 密碼
        :return:
        """
        raise NotImplementedError
    
    def process_cookies(self, cookies):
        """
        處理Cookies
        :param cookies:
        :return:
        """
        dict = {}
        for cookie in cookies:
            dict[cookie['name']] = cookie['value']
        return dict
    
    def run(self):
        """
        運行, 得到所有賬戶, 然后順次模擬登錄
        :return:
        """
        accounts_usernames = self.accounts_db.usernames()
        cookies_usernames = self.cookies_db.usernames()
        
        for username in accounts_usernames:
            if not username in cookies_usernames:
                password = self.accounts_db.get(username)
                print('正在生成Cookies', '賬號', username, '密碼', password)
                result = self.new_cookies(username, password)
                # 成功獲取
                if result.get('status') == 1:
                    cookies = self.process_cookies(result.get('content'))
                    print('成功獲取到Cookies', cookies)
                    if self.cookies_db.set(username, json.dumps(cookies)):
                        print('成功保存Cookies')
                # 密碼錯誤,移除賬號
                elif result.get('status') == 2:
                    print(result.get('content'))
                    if self.accounts_db.delete(username):
                        print('成功刪除賬號')
                else:
                    print(result.get('content'))
        else:
            print('所有賬號都已經成功獲取Cookies')
    
    def close(self):
        """
        關閉
        :return:
        """
        try:
            print('Closing Browser')
            self.browser.close()
            del self.browser
        except TypeError:
            print('Browser not opened')


class WeiboCookiesGenerator(CookiesGenerator):
    def __init__(self, website='weibo'):
        """
        初始化操作
        :param website: 站點名稱
        :param browser: 使用的瀏覽器
        """
        CookiesGenerator.__init__(self, website)
        self.website = website
    
    def new_cookies(self, username, password):
        """
        生成Cookies
        :param username: 用戶名
        :param password: 密碼
        :return: 用戶名和Cookies
        """
        return WeiboCookies(username, password, self.browser).main()

 這部分是來判斷是否登錄成功

import time
from io import BytesIO
from PIL import Image
from selenium.common.exceptions import TimeoutException
#from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from os import listdir
from os.path import abspath, dirname

TEMPLATES_FOLDER = dirname(abspath(__file__)) + '/templates/'


class WeiboCookies():
    def __init__(self, username, password, browser):
        self.url = 'https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/'
        self.browser = browser
        self.wait = WebDriverWait(self.browser, 20)
        self.username = username
        self.password = password
    
    def open(self):
        """
        打開網頁輸入用戶名密碼並點擊
        :return: None
        """
        self.browser.delete_all_cookies()
        self.browser.get(self.url)
        username = self.wait.until(EC.presence_of_element_located((By.ID, 'loginName')))
        password = self.wait.until(EC.presence_of_element_located((By.ID, 'loginPassword')))
        submit = self.wait.until(EC.element_to_be_clickable((By.ID, 'loginAction')))
        username.send_keys(self.username)
        password.send_keys(self.password)
        time.sleep(1)
        submit.click()
    
    def password_error(self):
        """
        判斷是否密碼錯誤
        :return:
        """
        try:
            return WebDriverWait(self.browser, 5).until(
                EC.text_to_be_present_in_element((By.ID, 'errorMsg'), '用戶名或密碼錯誤'))
        except TimeoutException:
            return False
    
    def login_successfully(self):
        """
        判斷是否登錄成功
        :return:
        """
        try:
            return bool(
                WebDriverWait(self.browser, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'lite-iconf-profile'))))
        except TimeoutException:
            return False

def get_cookies(self):
        """
        獲取Cookies
        :return:
        """
        return self.browser.get_cookies()
    
    def main(self):
        """
        破解入口
        :return:
        """
        self.open()
        if self.password_error():
            return {
                'status': 2,
                'content': '用戶名或密碼錯誤'
            }
        # 如果不需要驗證碼直接登錄成功
        if self.login_successfully():
            cookies = self.get_cookies()
            return {
                'status': 1,
                'content': cookies
            }

 檢測模塊:獲取到Cookies信息后還需要對Cookies的有效性進行檢測,也就是通過登錄返回的Response的狀態碼判斷是否有效

 

import json
import requests
from requests.exceptions import ConnectionError
from cookiespool.db import *


class ValidTester(object):
    def __init__(self, website='default'):
        self.website = website
        self.cookies_db = RedisClient('cookies', self.website)
        self.accounts_db = RedisClient('accounts', self.website)
    
    def test(self, username, cookies):
        raise NotImplementedError
    
    def run(self):
        cookies_groups = self.cookies_db.all()
        for username, cookies in cookies_groups.items():
            self.test(username, cookies)


class WeiboValidTester(ValidTester):
    def __init__(self, website='weibo'):
        ValidTester.__init__(self, website)
    
    def test(self, username, cookies):
        print('正在測試Cookies', '用戶名', username)
        try:
            cookies = json.loads(cookies)
        except TypeError:
            print('Cookies不合法', username)
            self.cookies_db.delete(username)
            print('刪除Cookies', username)
            return
        try:
            test_url = TEST_URL_MAP[self.website]
            response = requests.get(test_url, cookies=cookies, timeout=5, allow_redirects=False)
            if response.status_code == 200:
                print('Cookies有效', username)
            else:
                print(response.status_code, response.headers)
                print('Cookies失效', username)
                self.cookies_db.delete(username)
                print('刪除Cookies', username)
        except ConnectionError as e:
            print('發生異常', e.args)

 

 接口模塊驅動其他幾個模塊的運行

import time
from multiprocessing import Process

from cookiespool.api import app
from cookiespool.generator import *
from cookiespool.tester import *

# 產生器類,如擴展其他站點,請在此配置
GENERATOR_MAP = {
'weibo': 'WeiboCookiesGenerator'
}

# 測試類,如擴展其他站點,請在此配置
TESTER_MAP = {
'weibo': 'WeiboValidTester'
}

TEST_URL_MAP = {
'weibo': 'https://m.weibo.cn/'
}

# 產生器和驗證器循環周期
CYCLE = 120

# 產生器開關,模擬登錄添加Cookies
GENERATOR_PROCESS = True
# 驗證器開關,循環檢測數據庫中Cookies是否可用,不可用刪除
VALID_PROCESS = True
# API接口服務
API_PROCESS = True

class Scheduler(object):
    @staticmethod
    def valid_cookie(cycle=CYCLE):
        while True:
            print('Cookies檢測進程開始運行')
            try:
                for website, cls in TESTER_MAP.items():
                    tester = eval(cls + '(website="' + website + '")')
                    tester.run()
                    print('Cookies檢測完成')
                    del tester
                    time.sleep(cycle)
            except Exception as e:
                print(e.args)
    
    @staticmethod
    def generate_cookie(cycle=CYCLE):
        while True:
            print('Cookies生成進程開始運行')
            try:
                for website, cls in GENERATOR_MAP.items():
                    generator = eval(cls + '(website="' + website + '")')
                    generator.run()
                    print('Cookies生成完成')
                    generator.close()
                    time.sleep(cycle)
            except Exception as e:
                print(e.args)
    
    @staticmethod
    def api():
        print('API接口開始運行')
        app.run(host=API_HOST, port=API_PORT)
    
    def run(self):
        if API_PROCESS:
            api_process = Process(target=Scheduler.api)
            api_process.start()
        
        if GENERATOR_PROCESS:
            generate_process = Process(target=Scheduler.generate_cookie)
            generate_process.start()
        
        if VALID_PROCESS:
            valid_process = Process(target=Scheduler.valid_cookie)
            valid_process.start()

 api接口

import json
from flask import Flask, g

from cookiespool.db import *

# API地址和端口
API_HOST = '127.0.0.1'
API_PORT = 5000

__all__ = ['app']

app = Flask(__name__)

@app.route('/')
def index():
    return '<h2>Welcome to Cookie Pool System</h2>'


def get_conn():
    """
    獲取
    :return:
    """
    for website in GENERATOR_MAP:
        print(website)
        if not hasattr(g, website):
            setattr(g, website + '_cookies', eval('RedisClient' + '("cookies", "' + website + '")'))
            setattr(g, website + '_accounts', eval('RedisClient' + '("accounts", "' + website + '")'))
    return g


@app.route('/<website>/random')
def random(website):
    """
    獲取隨機的Cookie, 訪問地址如 /weibo/random
    :return: 隨機Cookie
    """
    g = get_conn()
    cookies = getattr(g, website + '_cookies').random()
    return cookies


@app.route('/<website>/add/<username>/<password>')
def add(website, username, password):
    """
    添加用戶, 訪問地址如 /weibo/add/user/password
    :param website: 站點
    :param username: 用戶名
    :param password: 密碼
    :return: 
    """
    g = get_conn()
    print(username, password)
    getattr(g, website + '_accounts').set(username, password)
    return json.dumps({'status': '1'})


@app.route('/<website>/count')
def count(website):
    """
    獲取Cookies總數
    """
    g = get_conn()
    count = getattr(g, website + '_cookies').count()
    return json.dumps({'status': '1', 'count': count})


if __name__ == '__main__':
    app.run(host='0.0.0.0')

 運行效果如下

github:https://github.com/jzxsWZY/CookiesPool


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM