0.目錄
1.思路
2.windows安裝
3.相關命令行
4.簡單配置和初步使用
5.問題:squid是否支持HTTPS
6.問題:配置多個代理條目,相同ip不同port報錯
7.問題:根據代理請求區分HTTP/HTTPS並選擇相應代理條目
8.問題:代理IP類型 高匿/匿名/透明
9.問題:正向/反向/透明代理
10.python腳本更新配置
11.log相關
12.參考
1.思路
- 定時監控代理源網站(30分/1小時都可),解析出所有代理IP,入數據庫
- 從數據庫中取出所有代理,訪問某個固定的網站,找出訪問成功的代理,更新數據庫可用標記和響應時間
- 從數據庫中加載所有可用代理,通過某種算法,根據響應時間計算使用權重和最大使用次數
- 按照squid的cache_peer格式,寫入配置文件
- 重新加載squid配置文件,刷新squid下的代理列表
- 爬蟲指定squid的服務IP和端口,進行純粹的爬取操作
一個完整的代理服務通過這樣的方法就可以搭建完成,定時輸出高質量代理。爬蟲端不用關心代理的采集和測試,只管使用squid的統一服務入口爬取數據即可。
2.windows安裝
http://www.squid-cache.org/Versions/
In some cases, you may want (or be forced) to download a binary package of Squid. They are available for a variety of platforms, including Windows.
https://wiki.squid-cache.org/SquidFaq/BinaryPackages
https://wiki.squid-cache.org/KnowledgeBase/Windows
MSI installer packages for Windows are at:
-
64-bit: http://squid.diladele.com/
直接下載msi,建議安裝目錄:C:\Squid\
CentOS 安裝:
https://wiki.squid-cache.org/SquidFaq/BinaryPackages

CentOS Squid bundles with CentOS. However there is apparently no publicly available information about where to find the packages or who is bundling them. EPEL, DAG and RPMforge repositories appear to no longer contain any files. Other sources imply that CentOS is an alias for RHEL (we know otherwise). Although, yes, the RHEL packages should work on CentOS. Maintainer: unknown Bug Reporting: http://bugs.centos.org/search.php?category=squid&sortby=last_updated&hide_status_id=-2 Eliezer: 25/Apr/2017 - I have tested CentOS 7 RPMs for squid 3.5.25 on a small scale and it seems to be stable enough for 200-300 users as a forward proxy and basic features. Stable Repository Package (like epel-release) To install run the command: yum install http://ngtech.co.il/repo/centos/7/squid-repo-1-1.el7.centos.noarch.rpm -y or rpm -i http://ngtech.co.il/repo/centos/7/squid-repo-1-1.el7.centos.noarch.rpm and then install squid using the command: yum install squid
3.相關命令行
幫助信息:

C:\Squid\bin>squid -h Usage: squid [-cdhvzCFNRVYX] [-n name] [-s | -l facility] [-f config-file] [-[au] port] [-k signal] -a port Specify HTTP port number (default: 3128). -d level Write debugging to stderr also. -f file Use given config-file instead of /etc/squid/squid.conf -h Print help message. -k reconfigure|rotate|shutdown|restart|interrupt|kill|debug|check|parse Parse configuration file, then send signal to running copy (except -k parse) and exit. -n name Specify service name to use for service operations default is: squid. -s | -l facility Enable logging to syslog. -u port Specify ICP port number (default: 3130), disable with 0. -v Print version. -z Create missing swap directories and then exit. -C Do not catch fatal signals. -D OBSOLETE. Scheduled for removal. -F Don't serve any requests until store is rebuilt. -N No daemon mode. -R Do not set REUSEADDR on port. -S Double-check swap during rebuild. -X Force full debugging. -Y Only return UDP_HIT or UDP_MISS_NOFETCH during fast reload.
啟動/停止服務:計算機管理找到Squid for Windows,右鍵屬性顯示服務名稱為squidsrv

C:\Squid\bin>net start squidsrv 請求的服務已經啟動。 請鍵入 NET HELPMSG 2182 以獲得更多的幫助。 C:\Squid\bin>net stop squidsrv Squid for Windows 服務正在停止. Squid for Windows 服務已成功停止。 C:\Squid\bin>net start squidsrv Squid for Windows 服務正在啟動 .. Squid for Windows 服務已經啟動成功。
重新加載配置
C:\Squid\bin>squid -k reconfigure
4.簡單配置和初步使用
C:\Squid\etc\squid\squid.conf 復制另存 C:\Squid\etc\squid\squid_backup.conf 備用
確認默認監聽端口:
# Squid normally listens to port 3128
http_access allow all
http_port 3128
不修改原有配置,僅在結尾添加如下兩行,見章節 12.參考 (1):
免費代理IP請自行搜索
cache_peer 58.22.61.211 parent 3128 0 no-query
never_direct allow all
使用requests確認代理生效:

In [7]: os.system('c:/Squid/bin/squid -k reconfigure') Out[7]: 0 In [8]: import requests In [9]: s = requests.Session() In [10]: s.proxies = {'http': 'http://127.0.0.1:3128', 'https': 'https://127.0.0.1:3128'} In [11]: s.get('http://httpbin.org/ip', timeout=10).text DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1 DEBUG:urllib3.connectionpool:http://127.0.0.1:3128 "GET http://httpbin.org/ip HTTP/1.1" 200 58 Out[11]: u'{\n "origin": "127.0.0.1, 163.125.31.126, 58.22.61.211"\n}\n' In [12]: s.get('https://httpbin.org/ip', timeout=10).text DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): httpbin.org DEBUG:urllib3.connectionpool:https://httpbin.org:443 "GET /ip HTTP/1.1" 200 31 Out[12]: u'{\n "origin": "58.22.61.211"\n}\n'
官網幫助
http://www.squid-cache.org/Doc/config/never_direct/
Default Value: Allow DNS results to be used for this request.

Usage: never_direct allow|deny [!]aclname ... never_direct is the opposite of always_direct. Please read the description for always_direct if you have not already. With 'never_direct' you can use ACL elements to specify requests which should NEVER be forwarded directly to origin servers. For example, to force the use of a proxy for all requests, except those in your local domain use something like: acl local-servers dstdomain .foo.net never_direct deny local-servers never_direct allow all or if Squid is inside a firewall and there are local intranet servers inside the firewall use something like: acl local-intranet dstdomain .foo.net acl local-external dstdomain external.foo.net always_direct deny local-external always_direct allow local-intranet never_direct allow all This clause supports both fast and slow acl types. See http://wiki.squid-cache.org/SquidFaq/SquidAcl for details.
5.問題:squid是否支持HTTPS
注意,在作為正向代理的時候(squid默認配置),http_port 3128端口也可以處理https代理請求,因為作正向代理時squid並不需要參與ssl的加密解密,只需要幫忙從用戶到網站的443端口建立tcp連接,然后無腦轉發用戶到網站之間的加密數據即可。只有當要將squid用作反向代理的時候,才需要用到squid的https_port配置,為squid設置證書。
6.問題:配置多個代理條目,相同ip不同port報錯
由於有可能有相同ip,而端口不同的代理 會報錯
FATAL: ERROR: cache_peer 42.227.87.205 specified twice
所以要在最后加上proxy的name
cache_peer 120.xx.xx.32 parent 80 0 no-query weighted-round-robin weight=2 connect-fail-limit=2 allow-miss max-conn=5 name=proxy-90
http://www.squid-cache.org/Doc/config/cache_peer/
name=xxx Unique name for the peer. Required if you have multiple peers on the same host but different ports. This name can be used in cache_peer_access and similar directives to identify the peer. Can be used by outgoing access controls through the peername ACL type.
7.問題:根據代理請求區分HTTP/HTTPS並選擇相應代理條目
http://www.squid-cache.org/Doc/config/cache_peer/
http://www.squid-cache.org/Doc/config/cache_peer_access/
http://www.squid-cache.org/Doc/config/acl/
http://www.squid-cache.org/Doc/config/access_log/
通過ACL實現,還不能百分百確認生效性!!!
acl acl_deny_http port 80
acl acl_deny_https port 443 cache_peer 219.156.151.20 parent 53281 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5 name=0_HTTP cache_peer_access 0_HTTP deny acl_deny_https cache_peer 175.155.248.190 parent 808 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5 name=83_HTTPS cache_peer_access 83_HTTPS deny acl_deny_http
8.問題:代理IP類型 高匿/匿名/透明
官網介紹:
http://www.squid-cache.org/Doc/config/forwarded_for/
Default Value: forwarded_for on
If set to "on", Squid will append your client's IP address
in the HTTP requests it forwards. By default it looks like:
X-Forwarded-For: 192.1.2.3
If set to "off", it will appear as
X-Forwarded-For: unknown
If set to "transparent", Squid will not alter the
X-Forwarded-For header in any way.
If set to "delete", Squid will delete the entire
X-Forwarded-For header.
If set to "truncate", Squid will remove all existing
X-Forwarded-For entries, and place the client IP as the sole entry.
http://www.squid-cache.org/Doc/config/via/
Default Value: via on
If set (default), Squid will include a Via header in requests and replies as required by RFC2616.
http://www.squid-cache.org/Doc/config/request_header_access/
Default Value: No limits.
For example, to achieve the same behavior as the old
'http_anonymizer standard' option, you should use:
request_header_access From deny all
request_header_access Referer deny all
request_header_access User-Agent deny all
綜合參考資料,在squid.conf結尾添加如下內容:
forwarded_for off via off forwarded_for transparent request_header_access Via deny all request_header_access X-Forwarded-For deny all request_header_access From deny all
前后對比:

In [104]: from bs4 import BeautifulSoup as BS In [106]: os.system('c:/Squid/bin/squid -k reconfigure') ...: r=s.get('http://www.iprivacytools.com/proxy-checker-anonymity-test/', timeout=10) ...: soup=BS(r.text, 'lxml') ...: print soup.select('div.content')[1].text ...: DEBUG:urllib3.connectionpool:Starting new HTTP connection (16): 127.0.0.1 DEBUG:urllib3.connectionpool:http://127.0.0.1:3128 "GET http://www.iprivacytools.com/proxy-checker-a nonymity-test/ HTTP/1.1" 200 2777 Your IP address and hostname: 58.22.61.211 (58.22.61.211) Here are your headers that could reveal a proxy: HTTP_VIA: 1.1 win7-PC (squid/3.5.26), 1.1 RD2:3128 (squid/2.7.STABLE7) HTTP_X_FORWARDED_FOR: 127.0.0.1, 163.125.31.83 HTTP_FORWARDED_FOR: anonymous / none HTTP_X_FORWARDED: anonymous / none HTTP_FORWARDED: anonymous / none HTTP_CLIENT_IP: anonymous / none HTTP_FORWARDED_FOR_IP: anonymous / none VIA: anonymous / none X_FORWARDED_FOR: anonymous / none FORWARDED_FOR: anonymous / none X_FORWARDED: anonymous / none FORWARDED: anonymous / none CLIENT_IP: anonymous / none FORWARDED_FOR_IP: anonymous / none HTTP_PROXY_CONNECTION: anonymous / none Proxy detected? YES Here's how we know: Your HTTP_VIA header shows: 1.1 win7-PC (squid/3.5.26), 1.1 RD2:3128 (squid/2.7.STABLE7) Your HTTP_X_FORWARDED_FOR header shows: 127.0.0.1, 163.125.31.83 Again, please remember that this should not be considered a fullproof test of your anonymous surfing level, as it is only analyzing your browser headers. To surf via proxies with greater confidence, we highly suggest using a firewall and disabling all browser plugins and script support. # 在squid.conf結尾添加如下內容: # forwarded_for off # via off # forwarded_for transparent # request_header_access Via deny all # request_header_access X-Forwarded-For deny all # request_header_access From deny all #結果對比: In [107]: os.system('c:/Squid/bin/squid -k reconfigure') ...: r=s.get('http://www.iprivacytools.com/proxy-checker-anonymity-test/', timeout=10) ...: soup=BS(r.text, 'lxml') ...: print soup.select('div.content')[1].text ...: DEBUG:urllib3.connectionpool:Resetting dropped connection: 127.0.0.1 DEBUG:urllib3.connectionpool:http://127.0.0.1:3128 "GET http://www.iprivacytools.com/proxy-checker-a nonymity-test/ HTTP/1.1" 200 2749 Your IP address and hostname: 58.22.61.211 (58.22.61.211) Here are your headers that could reveal a proxy: HTTP_VIA: 1.1 RD2:3128 (squid/2.7.STABLE7) HTTP_X_FORWARDED_FOR: 163.125.31.93 HTTP_FORWARDED_FOR: anonymous / none HTTP_X_FORWARDED: anonymous / none HTTP_FORWARDED: anonymous / none HTTP_CLIENT_IP: anonymous / none HTTP_FORWARDED_FOR_IP: anonymous / none VIA: anonymous / none X_FORWARDED_FOR: anonymous / none FORWARDED_FOR: anonymous / none X_FORWARDED: anonymous / none FORWARDED: anonymous / none CLIENT_IP: anonymous / none FORWARDED_FOR_IP: anonymous / none HTTP_PROXY_CONNECTION: anonymous / none Proxy detected? YES Here's how we know: Your HTTP_VIA header shows: 1.1 RD2:3128 (squid/2.7.STABLE7) Your HTTP_X_FORWARDED_FOR header shows: 163.125.31.93 Again, please remember that this should not be considered a fullproof test of your anonymous surfing level, as it is only analyzing your browser headers. To surf via proxies with greater confidence, we highly suggest using a firewall and disabling all browser plugins and script support.
本機信息被隱藏
Your HTTP_VIA header shows: 1.1 win7-PC (squid/3.5.26), 1.1 RD2:3128 (squid/2.7.STABLE7) Your HTTP_X_FORWARDED_FOR header shows: 127.0.0.1, 163.125.31.83
9.問題:正向/反向/透明代理
xxx
10.python腳本更新配置
獲取可用代理IP列表,格式: ip_port_type_tuple_list = [('1.1.1.1', '80', 'HTTP'), ('1.1.1.2', '1080', 'HTTPS'), ('1.1.1.3', '3128', 'both')]

def update_squid_conf(): bk_file = 'C:/Squid/etc/squid/squid_backup.conf' conf_file = 'C:/Squid/etc/squid/squid.conf' fmt = 'cache_peer {ip} parent {port} 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5 name={name}' pre_lines = ['\n#\n#\n#\nhttp_access allow all', 'read_timeout 30 seconds', 'request_timeout 30 seconds', 'acl acl_deny_http port 80', 'acl acl_deny_https port 443',] post_lines = ['never_direct allow all', 'forwarded_for off', 'via off', 'forwarded_for transparent', 'request_header_access Via deny all', 'request_header_access X-Forwarded-For deny all', 'request_header_access From deny all'] merge = sorted(list(set(ip_port_type_tuple_list)), key=lambda x: x[-1]) # for i in merge: # print i count = 0 with open(bk_file, 'r') as bk_file, open(conf_file, 'w') as conf_file: conf_file.write(bk_file.read()+'\n') conf_file.write('\n'.join(pre_lines)+'\n') for index, (ip, port, _type) in enumerate(merge): name = '{}_{}'.format(index, _type) item = fmt.format(ip=ip, port=port, name=name) if _type in ['HTTP']: item += '\ncache_peer_access %s deny acl_deny_https' %name elif _type in ['HTTPS']: item += '\ncache_peer_access %s deny acl_deny_http' %name conf_file.write(item+'\n') count += 1 conf_file.write('\n'.join(post_lines)+'\n') assert os.system('c:/Squid/bin/squid -k reconfigure') == 0, 'update fail' print time.ctime(), '{}/{}'.format(count, len(merge))
11.log相關
# access_log 設置access日志,daemon表示在后台將日志寫入/var/log/squid/access.log文件,
# combined是一個預定義的logformat,也可以使用自定義的logformat
access_log daemon:/var/log/squid/access.log combined
# debug_options, 設置cache.log的log level
# ALL表示全部模塊,loglevel為1;28表示acl模塊,loglevel為5,29表示用戶認證模塊,loglevel為9
debug_options ALL,1 28,5 29,9
也可直接添加:access_log daemon:c:/Squid/var/log/squid/temp.log squid
查看log確認使用的父代理:其中訪問https會顯示 TCP_TUNNEL
1503895567.104 5567 127.0.0.1 TCP_MISS/200 510 GET http://httpbin.org/ip - FIRSTUP_PARENT/58.22.61.211 application/json
1503895643.345 67037 127.0.0.1 TCP_TUNNEL/200 3377 CONNECT httpbin.org:443 - FIRSTUP_PARENT/58.22.61.211 -
12.參考
官網: http://www.squid-cache.org/Doc/config/cache_peer/
中文文檔: http://zyan.cc/book/squid/index.html
搜索cache_peer:
cache_peer hostname
type
proxy-port
icp-port
在此輸入父代理(如果您想使用 ISP 的代理)。在主機名
中輸入要使用代理的名稱或 IP 地址,在類型
中輸入 parent
。對於 proxy-port
,輸入同樣是由父代理運營商設置的在瀏覽器中使用的端口號(通常為 8080
)。如果父代理的 ICP 端口未知並且該端口的使用與提供商無關,請將 icp-port
設為 7
或 0
。此外,端口號后應指定 default
和 no-query
以禁止使用 ICP 協議。借助提供商的代理,Squid 就可以像普通瀏覽器那樣操作了。
never_direct allow acl_name
要防止 Squid 直接從因特網接受請求,應使用上述命令強制連接到另一個代理。事先必須已在 cache_peer中輸入該代理。如果將 acl_name
指定為 all
,會強制所有請求直接轉發給父代理。有時這可能是必要的,例如在您的提供商嚴格規定使用它的代理或拒絕通過其防火牆直接訪問因特網時。
forwarded_for on
如果將此項設置為 off,Squid 會將客戶端的 IP 地址和系統名稱從 HTTP 請求中刪除。否則,它會向標題中添加以下行
X-Forwarded-For: 192.168.0.1
(2) 使用squid搭建代理服務器
注意,在作為正向代理的時候(squid默認配置),http_port 3128端口也可以處理https代理請求,因為作正向代理時squid並不需要參與ssl的加密解密,只需要幫忙從用戶到網站的443端口建立tcp連接,然后無腦轉發用戶到網站之間的加密數據即可。只有當要將squid用作反向代理的時候,才需要用到squid的https_port配置,為squid設置證書。
# 拒絕所有請求,最后兜底的規則
http_access deny all
注意:squid的http_access是按照配置文件中定義的順序依次進行判斷的!遇到第一個滿足條件的http_access(allow或者deny)就立即返回!不再進行后續http_access判斷。
通過代理訪問http://www.hawu.me,打開開發者工具中的網絡窗口,檢查該請求的狀態,可以看到Remote Address為我們設置的代理,在Response Headers里還有我定義的代理服務器名”funway.aliyun.proxy”,表示這個請求是通過我們的代理服務器返回的。
squid可以很方便的搭建http代理服務器,但從上面被牆的案例我們看到,單單使用牆外的squid代理是無法進行科學上網的。這時候就需要在牆內用戶與牆外squid之間加一個stunnel,將我們發送給squid的請求進行加密。更詳細的介紹請看下一篇文章http://www.hawu.me/operation/886
匿名代理:
http頭中有三個信息是用來給服務器鑒別用戶的:remote_addr,http_via,http_x_forwarded_for。

用戶不使用代理直接訪問網站時,http頭包含如下信息: remote_addr = 用戶真實ip http_via = 不包含 http_x_forwarded_for = 不包含 用戶使用普通代理訪問時,對方服務器知道用戶使用了代理,並且知道用戶的真實ip。此時http頭包含如下: remote_addr = 代理服務器ip http_via = 代理服務器主機名(squid的visible_hostname) http_x_forwarded_for = 用戶真實ip(如果用戶使用了多層代理,這里應該是不包括最后一跳的整個ip鏈) 用戶使用匿名代理訪問時,對方服務器知道用戶使用了代理,但不知道用戶的真實ip。此時http頭包含如下: remote_addr = 代理服務器ip http_via = 代理服務器主機名 http_x_forwarded_for = 代理服務器ip 用戶使用高匿名代理訪問時,對方服務器不知道用戶使用了代理,也不知道用戶真實ip。此時的http頭包含如下: remote_addr = 代理服務器ip http_via = 不包含 http_x_forwarded_for = 不包含 squid默認是作為普通代理的,即開啟via,並會寫入http_forwarded_for。要想作為匿名代理,只需修改如下兩個配置: # 關閉via via off # 設置不修改http_forwarded_for forwarded_for transparent
http_access allow all http_port 64441 read_timeout 10 seconds request_timeout 10 seconds cache_peer ec2-52-197-85-24.ap-northeast-1.compute.amazonaws.com parent 64441 0 no-query round-robin never_direct allow all
(4) 自己搭建億級爬蟲IP代理池
cache_peer IP parent PORT 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5
# 3. 重新加載配置文件
os.system('squid -k reconfigure')
使用方法
- 按Squid 搭建正向代理服務器、Squid 配置高匿代理介紹的方法搭建運行Squid高匿服務器
文檔參考資料:
要將如下配置加入到配置文件/etc/squid/squid.conf
末尾即可。
request_header_access Via deny all request_header_access X-Forwarded-For deny all request_header_access From deny all
可以訪問 http://httpbin.org/ip ,如果僅返回squid服務器ip,則表明高匿生效。
或者訪問Proxy Checker,網頁顯示詳細的代理檢測信息。如果網頁最上方顯示NO PROXY DETECTED
則表明高匿代理搭建成功。
(5) Squid中文權威指南

10.11 該怎么做? Squid新手經常問同樣的或相似的問題,關於如何讓squid正確的轉發請求。這里我將告訴你,在普通這種情況下,如何配置Squid。 10.11.1 通過另外的代理發送所有請求? 簡單的只需定義父cache,並告訴squid它不允許直接連接到原始服務器。例如: cache_peer parent.host.name parent 3128 0 acl All src 0/0 never_direct allow All 該配置的弊端是,假如父cache down掉,squid不能轉發cache丟失。假如這種情況發生,用戶會接受到“不能轉發”的錯誤消息。 10.11.2 通過另外的代理發送所有請求,除非它down了? 試試這個配置: nonhierarchical_direct off prefer_direct off cache_peer parent.host.name parent 3128 0 default no-query 或者,假如你喜歡對其他代理使用ICP: nonhierarchical_direct off prefer_direct off cache_peer parent.host.name parent 3128 3130 default 在該配置下,只要父cache存活,squid就會將所有cache丟失轉發給它。使用ICP可以讓squid快速檢測到死亡的父cache,但同時在某些情形下,可能不正確的宣稱父cache死亡。 10.11.3 確認squid對某些請求,不使用鄰居cache嗎? 定義1條ACL來匹配特殊的請求: cache_peer parent.host.name parent 3128 0 acl Special dstdomain special.server.name always_direct allow Special 在該情形下,對special.server.name域的請求的cache丟失,總是發送到原始服務器。其他請求也許,或也許不,通過父cache。 10.11.4 通過父cache發送某些請求來繞過本地過濾器? 某些ISP(或其他組織)有上級服務提供者,他們強迫HTTP傳輸通過包過濾代理(也許使用HTTP攔截)。假如你能在他們的網絡之外使用不同的代理,那就能繞過其過濾器。這里顯示你怎樣僅發送特殊的請求到遠端的代理: cache_peer far-away-parent.host.name parent 3128 0 acl BlockedSites dstdomain www.censored.com cache_peer_access far-away-parent.host.name allow BlockedSites never_direct allow BlockedSites
(6) squid配置-cache_peer和cache_peer_domain詳解
重啟機器或者命令行執行“ net start squid”啟動服務