如何爬取可用的IP代理

本文轉載自查看原文 2017-07-23 16:41 1822 python/ Python/ crawl

上一篇說到對付反爬蟲有一個很關鍵的方法就是使用IP代理，那么我們應該如何獲取這些可用的IP代理呢？這里分享一下自己這兩天的一些爬取IP代理的心得體會。

1 步驟

　　1.找到幾個提供免費IP代理的網站，獲取IP數據源

　　2.驗證對應的IP代理訪問出口IP是否跟本機的出口IP一致，得到不一致的IP代理列表

　　3.根據自身的實驗目的驗證IP代理的響應速度，進行排序，擇優選用

2 具體做法

　　1.可以上網搜索，有很多，例如西刺、快代理等等

　　2.可以在這里進行驗證

　　3.這個就根據自身爬蟲的需要，看是下載東西還是其他的，再進一步測試速度

3 代碼

 1 # *-* coding: utf-8 *-*
 2 import BeautifulSoup
 3 import requests
 4 import time
 5 
 6 # to check if the ip proxy can work
 7 URL_CHECK = 'http://1212.ip138.com/ic.asp'
 8 RESPONSE_TIME = 2
 9 IP_LOCAL = '120.236.174.144'
10 
11 # this is the pages of the website "http://www.ip181.com/daili/1.html"
12 # you can check out in the browser.
13 # the program will crawl the ip proxy from pages [start_page, end_page]
14 # as: [1,2], it will crawl the page 1 and page 2.
15 start_page = input('Please input your start page to crawl: ')
16 end_page = input('Please input your end page to crawl: ')
17 
18 
19 s = requests.Session()
20 
21 # check if the exit IP is changed
22 def check_a_ip(ip):
23     start = time.time()
24     try:
25         connection = s.get(URL_CHECK, headers={
26             'Host': '1212.ip138.com',
27             'Referer': 'http://www.ip138.com/',
28             "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
29         }, proxies={'http': 'http://' + ip}, timeout=RESPONSE_TIME)
30         res = connection.content
31         # print res
32 
33         soup = BeautifulSoup.BeautifulSoup(res)
34         ip_return = soup.findAll('center')[0].text.split('[')[1].split(']')[0]
35         return ip_return != IP_LOCAL, '%.6f' % (time.time() - start)
36     except Exception, e:
37         # print '<ERROR>'
38         # print e
39         return False, '-1'
40 
41 url = 'http://www.ip181.com/daili/%s.html'
42 ip_proxy_file = open('proxy.txt', 'w')
43 ip_proxy_file.write('ip_port,response_time\n')
44 ip_proxy_file.close()
45 
46 for i in range(int(start_page), int(end_page) + 1):
47     ip_proxy_file = open('proxy.txt', 'a')
48 
49     connection_crawl = s.get(url % str(i),headers = {
50         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
51         })
52     soup_crawl = BeautifulSoup.BeautifulSoup(connection_crawl.content)
53 
54     # parse each page,find the good ip proxy
55     trs = soup_crawl.findAll('tr')
56     for tr in trs[1:len(trs)]:
57         tds = tr.findAll('td')
58         ip = tds[0].contents[0] + ':' + tds[1].contents[0]
59         is_good, res_time = check_a_ip(ip)
60         if is_good:
61             ip_proxy_file.write(ip + ',' + res_time + '\n')
62 
63     print '%s : Finish to crawl the page %d.  %s' % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), i, url % str(i))
64     ip_proxy_file.close()

View Code

關於這份代碼，有幾個地方做一下說明：

· check_a_ip(ip)：該函數為IP代理檢查函數，返回兩個值（一個為訪問請求是否成功使用了代理，一個為檢查的響應時間）

· start_page、end_page：手動輸入獲取IP代理的網頁頁碼，這個需要根據具體網站設定

· for i in range(int(start_page), int(end_page) + 1)：主函數的循環，遍歷設定范圍的網頁

· for tr in trs[1:len(trs)]：循環遍歷並解析出一個網頁中的所有IP代理，以及檢驗是否可用

· ip_proxy_file：文本寫入，最終把結果都寫入proxy.txt中

4 拓展

本實驗可以采用多線程進行爬取或者檢驗，這樣的爬取速率會快很多，大家有時間可以嘗試一下

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 golang爬取免費代理IP 利用Python爬取免費代理IP 簡單爬蟲-爬取免費代理ip Jsoup爬取數據設置代理IP 爬取西刺ip代理池代理IP爬取和驗證（快代理&西刺代理）極簡代理IP爬取代碼——Python爬取免費代理IP python+scrapy 爬取西刺代理ip(一) python爬蟲西刺代理ip爬取無憂代理免費ip爬取（端口js加密）