自動爬取網上免費代理實戰：爬取模塊篇

本文轉載自查看原文 2021-08-02 17:43 316 python/ 代理池

1. 爬取模塊說明

爬取模塊篇，主要從網上找到一些免費代理網站，網站內僅開放的一點免費代理抓取下來，爬取下來能用的代理可謂稀少，假設從一個代理網站首頁爬取20個免費代理，經過測試后剩下1、2個可用，因為免費的代理一般具有時效性，肯定不如花錢買的代理來得相對穩定。

既然爬取單個代理網站最后能用的只手可數，但是只要從爬取數量方向着手，就是說只要把爬取的代理網站數量提升，如果爬取一個代理網站得到20個左右的免費代理，假設可用率10%，就是2個可用，爬取10個代理網站，就是20個可用了。

既然說得那么容易，我加他個幾十代理網站最后可用的數量那不是很可觀嗎，但實際中，考慮可能不止這些.....，首先你要找到那么多的代理網站，並且這些網站開放了一些免費代理，滿足能爬取，然而你還要考慮寫抓取代理網站的代碼成本。其次，你還要考慮代理網站的穩定性，不然等你寫完代碼后發現這個代理突然網站崩了...

上面說的，可能只是我所了解到的其中一種爬取免費代理思路，不論最后的結果是怎么樣，有多少可用的代理。我們知道這只是應對反反爬的策略其中的一種，最終它能否滿足我們的需求才是關鍵的。

2. 實現思路

既然想實現爬取多個代理網站，那么必然要維護多份代理網站代碼，每個代理網站各自獨立起來。

當我寫了2，3份爬取代理網站代碼后，頭疼的發現這其中已有不少重復代碼，比如使用requests請求這一步，請求每個代理網站都會寫，

這時候我們就想能不能把requests請求獨立起來，以便減少重復的代碼，也方便后續拓展加入新的代理網站。

沿着上面聊到的思路，接下來步入正題。

下面展示的目錄結構主要是爬取模塊，它也是整塊抓取邏輯。

proxypool					# 項目名字
 │  context.py					# 項目依賴環境--針對window command
 │  getter.py					# 爬取模塊（入口）				# 3.3
 │  __init__.py					
 ├─crawler					# 整個抓取代理網站模塊
 │  │  base.py					# 通用請求抓取類					# 3.1
 │  │  __init__.py				
 │  │
 │  ├─proxysite					# 代理網站，目錄下每一份.py文件維護一個代理網站
 │  │  │  proxy_89ip.py			
 │  │  │  proxy_ip3366.py
 │  │  │  proxy_ipihuan.py
 │  │  │  proxy_seofangfa.py
 │  │  │  proxy_shenjidaili.py
 │  │  │  __init__.py				# 返回當前目錄的絕對路徑，提供給pkgutil包所需參數 	# 3.4.3
 ├─untils					# 其它模塊
 │  │  parse.py					# 代理校驗方法
 │  │  loggings.py   				# 封裝日志類					# 3.4.1	
 │  │  __init__.py
 ├─...

代碼注釋后面的#數字對應大綱目錄序號

3. 代碼實現

代碼環境：Python 3.9.1, Redis:3.5.3

依賴的第三方包：requests、fake_headers、retrying、loguru、pyquery

3.1 通用請求爬取代理類

import requests
from fake_headers import Headers
from retrying import retry
from proxypool.untils.parse import is_valid_proxy
from requests.exceptions import ConnectionError

try:
    from proxypool.untils.loggings import Logging
    logging = Logging()
except ImportError:
    from loguru import logger as logging

Exceptions = (
    TimeoutError,
    AssertionError,
    ConnectionError,
)

class Base(object):
    """
    一個通用請求抓取代理網站類

    Instance variable:
    - :url:              # 爬取的url，也就是代理網站
    - :proxies:          # 使用代理 (可單獨在子類配置，也就是抓取代理網站代碼中配置）
    - :isvalid = True    # 標識代理網站是否可用，如果為False，則重新啟動程序時，這個代理網站會被屏蔽掉，不再去請求

    decorator:
    - @retry(...):       # 一個裝飾器，請求重試代理網站說明
    :param: retry_on_result, 觸發重試條件，即website_response函數的返回值為None時觸發
    :param: stop_max_attempt_number,重試2次
    :param: wait_exponential_multiplier,等待最小時間
    :param: wait_exponential_max,等待最大時間，具體有個計算公式，可自行參考文檔
    """
    url = ""
    proxies = None
    isvalid = True

    def __init__(self):
        # 忽略安全警告
        requests.packages.urllib3.disable_warnings()
        self.logger = logging

    @retry(stop_max_attempt_number=2, retry_on_result=lambda x: x is None,
           wait_exponential_multiplier=1000,
           wait_exponential_max=10000)
    def website_response(self, url, **kwargs):
        """
        一個通用請求方法

        Args:
        - :url:             # 爬取的代理網站地址
        - kwargs:           # 使用kwargs定制一些配置

        Other variables:
        - headers           # 反爬蟲偽裝，如果無法安裝fake_headers包（可能被國內牆了），可以手動構造一個headers.
        					示例：
                            headers = {'Accept': '*/*', 'Connection': 'keep-alive',
                            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4;
                            rv:52.7.3) Gecko/20100101Firefox/52.7.3', 'DNT': '1',
                            'Referer': 'https://google.com', 'Pragma': 'no-cache'}

        - proxies           # 開啟代理，如開啟本地代理：
                            proxies = {
                                'http': 'http://127.0.0.1:1080',
                                'https': 'https://127.0.0.1:1080',
                            }


        """
        try:
            headers = Headers(headers=True).generate()
            kwargs.setdefault('timeout', 10)
            kwargs.setdefault('verify', False)  # verify: Defaults to ``True``.
            kwargs.setdefault('headers', headers)
            # 爬取的代理網站是否加入代理去爬取
            if self.proxies is not None:
                kwargs.setdefault('proxies', self.proxies)
            res = requests.get(url, **kwargs)
            # 代理網站http響應碼=200認為它正常
            if res.status_code == 200:
                res.encoding = 'utf-8'
                return res.text
        except Exceptions:
            return None

    @logging.catch
    def crawl(self):
        """
        一個抓取代理網站方法
            1.先是調用self.website_response實例方法，得到response.text賦值給html
            2.然后調用子類寫好的parse爬取方法，也就是每一個代理網站的各自維護的抓取邏輯
            3.接着調用is_valid_proxy方法校驗ip有效性，符合條件才會返回，否則返回None
            4.最后通過yield關鍵字返回代理
        """
        url = self.url
        self.logger.info(f'Request URL:{url}')
        html = self.website_response(url)
        for proxy in self.parse(html):
            proxy = is_valid_proxy(proxy)
            if proxy is not None:
                self.logger.info(f"Fetching proxy: {proxy} from {url}")
                yield proxy

3.2 抓取代理網站類

以下展示其中一些代理網站，后續如有更新，按照類似的模板補充即可。

如果想調用看結果，把base.py文件中的Base.crawl方法最后的yield注釋掉即可。

3.2.1 www.89ip.cn

#proxypool/crawler/proxysite/proxy_89ip.py	
from pyquery import PyQuery as pq
from proxypool.crawler.base import Base


class proxy_89ip(Base):
    url = 'https://www.89ip.cn/index_1.html'

    def parse(self, html):
        doc = pq(html)
        hosts = doc('.layui-table td:nth-child(1)').text().split(' ')
        ports = doc('.layui-table td:nth-child(2)').text().split(' ')
        for host, port in zip(hosts, ports):
            yield f'{host.strip()}:{port.strip()}'


if __name__ == '__main__':
    test = proxy_89ip()
    test.crawl()

3.2.2 www.ip3366.net

#proxypool/crawler/proxysite/proxy_ip3366.py
from pyquery import PyQuery as pq
from proxypool.crawler.base import Base


class proxy_ip3366(Base):
    url = 'http://www.ip3366.net/?stype=1&page=1'

    def parse(self, html):
        doc = pq(html)
        hosts = doc('.table td:nth-child(1)').text().split(' ')
        ports = doc('.table td:nth-child(2)').text().split(' ')
        for host, port in zip(hosts, ports):
            yield f'{host.strip()}:{port.strip()}'


if __name__ == '__main__':
    test = proxy_ip3366()
    test.crawl()

3.2.3 ip.ihuan.me

#proxypool/crawler/proxysite/proxy_ipihuan.py
from pyquery import PyQuery as pq
from proxypool.crawler.base import Base


class proxy_ipihuan(Base):
    url = 'https://ip.ihuan.me/'
    isvalid = False

    def parse(self, html):
        doc = pq(html)
        hosts = doc('.table td:nth-child(1)').text().split(' ')
        ports = doc('.table td:nth-child(2)').text().split(' ')
        for host, port in zip(hosts, ports):
            yield f'{host.strip()}:{port.strip()}'


if __name__ == '__main__':
    test = proxy_ipihuan()
    test.crawl()

3.2.4 proxy.seofangfa.com

#proxypool/crawler/proxysite/proxy_seofangfa.py
from pyquery import PyQuery as pq
from proxypool.crawler.base import Base


class proxy_seofangfa(Base):
    url = 'https://proxy.seofangfa.com/'

    # proxies = {
    #     'http': 'http://127.0.0.1:1080',
    #     'https': 'https://127.0.0.1:1080',
    # }

    def parse(self, html):
        doc = pq(html)
        hosts = doc('.table td:nth-child(1)').text().split(' ')
        ports = doc('.table td:nth-child(2)').text().split(' ')
        for host, port in zip(hosts, ports):
            yield f'{host.strip()}:{port.strip()}'

if __name__ == '__main__':
    test = proxy_seofangfa()
    test.crawl()

3.2.5 shenjidaili.com

#proxypool/crawler/proxysite/proxy_shenjidaili.py
from pyquery import PyQuery as pq
from proxypool.crawler.base import Base


class proxy_shenjidaili(Base):
    url = 'http://www.shenjidaili.com/product/open/'

    isvalid = False

    def parse(self, html):
        doc = pq(html)
        proxies = doc('.table td:nth-child(1)').text().split(' ')
        for proxy in proxies:
            yield f'{proxy}'


if __name__ == '__main__':
    test = proxy_shenjidaili()
    test.crawl()

調式3.2.1代碼，記得把父類Base.crawl方法最后的yield注釋掉，不然直接運行什么也不會返回。

運行proxy_89ip.py的結果如下：

3.3 爬取模塊（入口）

#proxypool/getter.py
import context
import pkgutil
import inspect
from proxypool.crawler.base import Base
from proxypool.crawler.proxysite import crawlerPath
from loguru import logger


def get_classes():
    """
    加載指定目錄下的所有可調用對象
    
    return: 返回proxysite包下的所有class obejct)
    """
    classes = []
    for loader, name, is_pkg in pkgutil.walk_packages([crawlerPath]):
        # get module type
        module = loader.find_module(name).load_module(name)
        for _class, _class_object in inspect.getmembers(module, callable):
            # 過濾可調用對象，留下有用的
            if inspect.isclass(_class_object) and issubclass(_class_object, Base) \
                    and _class_object is not Base and _class_object.isvalid:
                classes.append(_class_object)
    return classes


classes = get_classes()


class Getter(object):

    def __init__(self):
        self.classes = [cls() for cls in classes]
        self.in_storage_count = 0
        self.logger = logger

    @logger.catch
    def run(self):
        if len(self.classes):
            for cls in self.classes:
                self.logger.info(f'Get the proxy instance object: {cls}')
                for proxy in cls.crawl():
                    # .... write code
                    # .... save proxy to local or redis 
                    # .........
                    self.logger.info(f"獲取代理成功: {proxy}")


if __name__ == '__main__':
    test = Getter()
    test.run()

getter.py結果如下：

3.4 其它模塊

3.4.1 封裝日志類

#proxypool/untils/loggings.py

import sys
import time
from loguru import logger
from pathlib import Path

# 是否開啟日志記錄
OPEN_LOG = True

class Logging(object):
    """
    日志記錄
    """
    _instance = None
    _log = OPEN_LOG

    def __new__(cls, *arg, **kwargs):
        if cls._instance is None:
            cls._instance = object.__new__(cls, *arg, **kwargs)
        return cls._instance

    def __init__(self):
        if self._log:
            self.log()

    def info(self, msg):
        return logger.info(msg)

    def debug(self, msg):
        return logger.debug(msg)

    def error(self, msg):
        return logger.error(msg)

    def exception(self, msg):
        return logger.exception(msg)

    @classmethod
    def catch(cls, func):
        @logger.catch
        def decorator(*args, **kwargs):
            return func(*args, **kwargs)

        return decorator

    def log(self):
        """
        運行項目下生成log
        """
        if self._log:
            t = time.strftime('%Y_%m_%d')
            present_path = sys.path[0]
            p = Path(present_path).resolve()
            log_path = p.joinpath('log')
            logger.add(f'{log_path}/crawl_{t}.log',
                       level='ERROR',   # 只記錄error級別以上的log
                       enqueue=True,
                       rotation='00:00',
                       retention='1 months',
                       compression='tar.gz',
                       encoding='utf-8',
                       backtrace=True)

3.4.2 校驗代理格式

#proxypool/untils/parse.py
try:
    from proxypool.untils.loggings import Logging
    logging = Logging()

except ImportError:
    from loguru import logger as logging


Exceptions = (
    ValueError,
    AssertionError
)

def bytes_convert_string(data):
    """
    byte類型轉換為字符串
    示例：b'123' ---> '123'
    """
    if data is None:
        return None
    elif isinstance(data, bytes):
        return data.decode('utf8')


def is_valid_proxy(ip_port):
    """
    校驗代理格式
    :param: ip_port, {ip}:{port}
    示例：
        正常的代理：27.191.60.60:3256
        不正常的代理：299.299.299.299:123 or 1.2.4.8:66666
    """
    if ip_port is None:
        return
    elif isinstance(ip_port, str):
        try:
            ip_port_list = ip_port.split(':')
            if len(ip_port_list) == 2:
                port = ip_port_list.pop()
                if not port.isdigit():
                    return
                assert 1 <= int(port) <= 65535
                ip_list = ip_port_list
                ip_str = ",".join(ip_list)
                li = ip_str.split('.')
                if len(li) == 4:
                    _ip = [int(s) for s in li if 0 < int(s) <= 254]
                    if len(_ip) == 4:
                        return ip_port
        except Exceptions:  # int(x), x = 'a' --> ValueError
            logging.error(f'ip not valid -- {ip_port}')


if __name__ == '__main__':
    by = b'27.191.60.60:325611'
    ip = bytes_convert_string(by)
    is_valid_proxy(ip)

3.4.3 _init_.py

#proxypool/crawler/proxysite/__init__.py
import os.path

# 返回proxysite目錄所在的絕對路徑，提供給pkgutil包所需參數
crawlerPath = os.path.dirname(__file__)
__all__ = ["crawlerPath"]

3.4.4 context.py

#proxypool/context.py
import sys
from pathlib import Path
sys.path.insert(0, str(Path(sys.path[0], '..').resolve()))

完整代碼：https://github.com/rosaany/proxypool

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 第二篇 - python爬取免費代理簡單爬蟲-爬取免費代理ip golang爬取免費代理IP 利用Python爬取免費代理IP 爬蟲實戰：爬取免費小說爬取快代理 java爬取免費HTTP代理 code-for-fun 無憂代理免費ip爬取（端口js加密）極簡代理IP爬取代碼——Python爬取免費代理IP 如何爬取可用的IP代理