用nginx屏蔽爬蟲的方法
1. 使用"robots.txt"規范
在網站根目錄新建空白文件,命名為"robots.txt",將下面內容保存即可。
User-agent: BaiduSpider
Disallow:
User-agent: YisouSpider
Disallow:
User-agent: 360Spider
Disallow:
User-agent: Sosospider
Disallow:
User-agent: SogouSpider
Disallow:
User-agent: YodaoBot
Disallow:
User-agent: Googlebot
Disallow:
User-agent: bingbot
Disallow:
User-agent: *
Disallow: /
2. nginx
在http字段下加入一個map做匹配引導
map $http_user_agent $limit_bots {
default 0;
~*(baiduspider|qqspider|google|soso|bing|yandex|sogou|bingbot|yahoo|sohu-search|yodao|YoudaoBot|robozilla|msnbot|MJ12bot|NHN|Twiceler) 1;
~*(AltaVista|Googlebot|Slurp|BlackWidow|Bot|ChinaClaw|Custo|DISCo|Download|Demon|eCatch|EirGrabber|EmailSiphon|EmailWolf|SuperHTTP|Surfbot|WebWhacker) 1;
~*(Express|WebPictures|ExtractorPro|EyeNetIE|FlashGet|GetRight|GetWeb!|Go!Zilla|Go-Ahead-Got-It|GrabNet|Grafula|HMView|Go!Zilla|Go-Ahead-Got-It) 1;
~*(rafula|HMView|HTTrack|Stripper|Sucker|Indy|InterGET|Ninja|JetCar|Spider|larbin|LeechFTP|Downloader|tool|Navroad|NearSite|NetAnts|tAkeOut|WWWOFFLE) 1;
~*(GrabNet|NetSpider|Vampire|NetZIP|Octopus|Offline|PageGrabber|Foto|pavuk|pcBrowser|RealDownload|ReGet|SiteSnagger|SmartDownload|SuperBot|WebSpider) 1;
~*(Teleport|VoidEYE|Collector|WebAuto|WebCopier|WebFetch|WebGo|WebLeacher|WebReaper|WebSauger|eXtractor|Quester|WebStripper|WebZIP|Wget|Widow|Zeus) 1;
~*(Twengabot|htmlparser|libwww|Python|perl|urllib|scan|Curl|email|PycURL|Pyth|PyQ|WebCollector|WebCopy|webcraw) 1;
}
再到server字段或者是location字段下加入if判斷
if ($limit_bots = 1) {return 403;}
將下面代碼添加到"location / { }" 段里面,比如偽靜態規則里面。
#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA為空的訪問
if ($http_user_agent ~ "BaiduSpider|JiKeSpider|YandexBot|Bytespider|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadW
ebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bo
t|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Ezooms|^$" ) {
return 404;
}
#禁止非GET|HEAD|POST方式的抓取, ~ 為模糊匹配 ~* 為模糊匹配不區分大小寫
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
if ($http_user_agent ~ "Mozilla/4.0\ \(compatible;\ MSIE\ 6.0;\ Windows\ NT\ 5.1;\ SV1;\ .NET\ CLR\ 1.1.4322;\ .NET\ CLR\ 2.0.50727\)" ) {
return 404;
}
if ($http_user_agent ~ "Mozilla/5.0+(compatible;+Baiduspider/2.0;++http://www.baidu.com/search/spider.html)") {
return 404;
}
if ($http_user_agent ~ "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X)(compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)") {
return 404;
}
if ($http_user_agent ~ "Mozilla/5.0 (Linux; Android 10; VCE-AL00 Build/HUAWEIVCE-AL00; wv)") {
return 404;
}
測試一下:
curl -I -A "Mozilla/5.0Macintosh;IntelMacOSX10_12_0AppleWebKit/537.36KHTML,likeGeckoChrome/60.0.6967.1704Safari/537.36;YandexBot" http://www.xxxxx.com
返回 403 表示設置成功!