用nginx屏蔽爬蟲的方法


用nginx屏蔽爬蟲的方法

1. 使用"robots.txt"規范

在網站根目錄新建空白文件,命名為"robots.txt",將下面內容保存即可。

User-agent: BaiduSpider
Disallow:
User-agent: YisouSpider
Disallow:
User-agent: 360Spider
Disallow:
User-agent: Sosospider
Disallow:
User-agent: SogouSpider
Disallow:
User-agent: YodaoBot
Disallow:
User-agent: Googlebot
Disallow:
User-agent: bingbot
Disallow:
User-agent: *
Disallow: /


2. nginx

在http字段下加入一個map做匹配引導

map $http_user_agent $limit_bots {
default 0;
~*(baiduspider|qqspider|google|soso|bing|yandex|sogou|bingbot|yahoo|sohu-search|yodao|YoudaoBot|robozilla|msnbot|MJ12bot|NHN|Twiceler) 1;
~*(AltaVista|Googlebot|Slurp|BlackWidow|Bot|ChinaClaw|Custo|DISCo|Download|Demon|eCatch|EirGrabber|EmailSiphon|EmailWolf|SuperHTTP|Surfbot|WebWhacker) 1;
~*(Express|WebPictures|ExtractorPro|EyeNetIE|FlashGet|GetRight|GetWeb!|Go!Zilla|Go-Ahead-Got-It|GrabNet|Grafula|HMView|Go!Zilla|Go-Ahead-Got-It) 1;
~*(rafula|HMView|HTTrack|Stripper|Sucker|Indy|InterGET|Ninja|JetCar|Spider|larbin|LeechFTP|Downloader|tool|Navroad|NearSite|NetAnts|tAkeOut|WWWOFFLE) 1;
~*(GrabNet|NetSpider|Vampire|NetZIP|Octopus|Offline|PageGrabber|Foto|pavuk|pcBrowser|RealDownload|ReGet|SiteSnagger|SmartDownload|SuperBot|WebSpider) 1;
~*(Teleport|VoidEYE|Collector|WebAuto|WebCopier|WebFetch|WebGo|WebLeacher|WebReaper|WebSauger|eXtractor|Quester|WebStripper|WebZIP|Wget|Widow|Zeus) 1;
~*(Twengabot|htmlparser|libwww|Python|perl|urllib|scan|Curl|email|PycURL|Pyth|PyQ|WebCollector|WebCopy|webcraw) 1;
}
再到server字段或者是location字段下加入if判斷
if ($limit_bots = 1) {return 403;}

 

 

 

 


將下面代碼添加到"location / { }" 段里面,比如偽靜態規則里面。

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}

#禁止指定UA及UA為空的訪問
if ($http_user_agent ~ "BaiduSpider|JiKeSpider|YandexBot|Bytespider|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadW
ebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bo
t|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Ezooms|^$" ) {
return 404;
}

#禁止非GET|HEAD|POST方式的抓取, ~ 為模糊匹配 ~* 為模糊匹配不區分大小寫
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}

if ($http_user_agent ~ "Mozilla/4.0\ \(compatible;\ MSIE\ 6.0;\ Windows\ NT\ 5.1;\ SV1;\ .NET\ CLR\ 1.1.4322;\ .NET\ CLR\ 2.0.50727\)" ) {
return 404;
}
if ($http_user_agent ~ "Mozilla/5.0+(compatible;+Baiduspider/2.0;++http://www.baidu.com/search/spider.html)") {
return 404;
}

if ($http_user_agent ~ "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X)(compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)") {
return 404;
}
if ($http_user_agent ~ "Mozilla/5.0 (Linux; Android 10; VCE-AL00 Build/HUAWEIVCE-AL00; wv)") {
return 404;
}


測試一下:

curl -I -A "Mozilla/5.0Macintosh;IntelMacOSX10_12_0AppleWebKit/537.36KHTML,likeGeckoChrome/60.0.6967.1704Safari/537.36;YandexBot" http://www.xxxxx.com

返回 403 表示設置成功!


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM