最近在做一個超大爬蟲,總的http請求數量接近一萬次,而且都是對同一個url發出的請求,只是請求的body有所不同。
測試發現,經常會出現request error: read ECONNRESET、request error: socket hang up 這樣的錯誤,查看日志發現每次出錯的請求的body並不一樣,並且通過正常方式(手機APP)訪問這個資源時不會出錯。 雖然已經用了for ... of + await 的語法代替forEach,限制了同一時間的異步請求數量,但每次這1萬個請求中總會出現4 5 百個這樣的錯誤。猜測是服務器有某種保護機制,拒絕了同一IP的過多請求,搜索一番后在以下鏈接中找到解決方法:
https://www.gregjs.com/javascript/2015/how-to-scrape-the-web-gently-with-node-js/
Limiting maximum concurrent sockets in Node
即限制並發套接字的最大數量
1 var http = require('http'); 2 var https = require('https'); 3 http.globalAgent.maxSockets = 1; 4 https.globalAgent.maxSockets = 1;
而我用的是request包,參考npm文檔(https://www.npmjs.com/package/request):
pool - an object describing which agents to use for the request. If this option is omitted the request will use the global agent (as long as your options allow for it). Otherwise, request will search the pool for your custom agent. If no custom agent is found, a new agent will be created and added to the pool. Note: pool is used only when the agent option is not specified.
- A
maxSocketsproperty can also be provided on thepoolobject to set the max number of sockets for all agents created (ex:pool: {maxSockets: Infinity}). - Note that if you are sending multiple requests in a loop and creating multiple new
poolobjects,maxSocketswill not work as intended. To work around this, either userequest.defaultswith your pool options or create the pool object with themaxSocketsproperty outside of the loop.
於是,在option中加入
,即可解決。
當然具體最大值設為多少合適需要多次測試,值太低則爬蟲耗時過長,太高又可能出現之前的錯誤。
