Nginx防爬蟲優化

本文轉載自查看原文 2019-09-03 22:09 389 Nginx

轉載總結：

方式一：創建一個robots.txt文本文件，然后在文檔內設置好代碼，告訴搜索引擎我網站的哪些文件你不能訪問。然后上傳到網站根目錄下面，因為當搜索引擎蜘蛛在索引一個網站時，會先爬行查看網站根目錄下是否有robots.txt文件。
#摘自京東
cat<<EOF>robots.txt
User-agent: *
Disallow: /?*
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /
EOF
#摘自淘寶
cat<<EOF>robots.txt
User-agent: Baiduspider
Allow: /article
Allow: /oshtml
Allow: /ershou
Allow: /$
Disallow: /product/
Disallow: /

User-Agent: Googlebot
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Allow: /$
Disallow: /

User-agent: Bingbot
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Allow: /$
Disallow: /

User-Agent: 360Spider
Allow: /article
Allow: /oshtml
Allow: /ershou
Disallow: /

User-Agent: Yisouspider
Allow: /article
Allow: /oshtml
Allow: /ershou
Disallow: /

User-Agent: Sogouspider
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /ershou
Disallow: /

User-Agent: Yahoo! Slurp
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Allow: /$
Disallow: /

User-Agent: *
Disallow: /
EOF

方式二：根據客戶端的user-agents信息，阻止指定的爬蟲爬取我們的網站。

1.阻止下載協議代理，命令如下：
##Block download agents##
if ($http_user_agent ~* LWP:Simple | BBBike | wget)
{
    return 403;
}
#說明：如果用戶匹配了if后面的客戶端（例如wget），就返回403.

2.根據$http_user_agent獲取客戶端agent，然后判斷是否允許或返回指定錯誤碼。
添加內容防止N多爬蟲代理訪問網站，命令如下：
#這些爬蟲代理使用“|”分隔，具體要處理的爬蟲可以根據需求增加或減少，添加的內容如下：
if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot-Modile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Yahoo! SSlurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot")
{
    return 403;
}

3.測試禁止不同的瀏覽器軟件訪問
if ($http_user_agent ~* "Firefox|MSIE")
{
    rewrite ^(.*) http://www.wk.com/$1 permanent;
}
#如果瀏覽器為Firefox或IE，就會跳轉到http://www.wk.com

4.限制請求方式
#Only allow these request methods
if ($request_method ! ~ ^(GET|HEAD|POST)$)
{
    return 501;
}

網絡上常見的垃圾UA列表

FeedDemon 內容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy sql注入
Java 內容采集
Jullo 內容采集
Feedly 內容采集
UniversalFeedParser 內容采集
ApacheBench cc攻擊器
Swiftbot 無用爬蟲
YandexBot 無用爬蟲
AhrefsBot 無用爬蟲
YisouSpider 無用爬蟲
jikeSpider 無用爬蟲
MJ12bot 無用爬蟲
ZmEu phpmyadmin 漏洞掃描
WinHttp 采集cc攻擊
EasouSpider 無用爬蟲
HttpClient tcp攻擊
Microsoft URL Control 掃描
YYSpider 無用爬蟲
jaunty wordpress爆破掃描器
oBot 無用爬蟲
Python-urllib 內容采集
Indy Library 掃描
FlightDeckReports Bot 無用爬蟲
Linguee Bot 無用爬蟲

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Nginx的防爬蟲優化 Nginx優化防爬蟲限制http請求方法 CDN網頁加速架構優化監牢模式控制並發量以及客戶端請求速率 Nginx使用naxsi防xss、防注入攻擊配置 Python爬蟲學習筆記——防豆瓣反爬蟲 OpenResty(nginx擴展)實現防cc攻擊頁面接口防刷解決思路一nginx 識別User Agent屏蔽一些Web爬蟲防采集防采集與反爬蟲常見的策略以及解決思路爬蟲采集去重優化淺談 Nginx性能優化