背景:
要爬取某網站的數據,數據每頁10條,有很多頁(形式如同table表格)。使用HttpClient 逐行逐頁爬取數據,但在循環爬取多次時,總會在不確定的位置報錯

在檢查代碼邏輯無果之后,開始瘋狂百度,網上給出的解釋:
服務器端因為某種原因關閉了Connection,而客戶端依然在讀寫數據。
給出的解決方案是:
- 客戶端和服務器統一使用TCP長連接或者短連接。
- 客戶端關閉了連接,檢查代碼,並無關閉。
以上兩種情況均無法解決,於是決定自己看錯誤源碼:
int read(byte b[], int off, int length, int timeout) throws IOException { int n; // EOF already encountered if (eof) { return -1; } // connection reset if (impl.isConnectionReset()) { throw new SocketException("Connection reset"); } // bounds check if (length <= 0 || off < 0 || length > b.length - off) { if (length == 0) { return 0; } throw new ArrayIndexOutOfBoundsException("length == " + length + " off == " + off + " buffer length == " + b.length); } boolean gotReset = false; // acquire file descriptor and do the read FileDescriptor fd = impl.acquireFD(); try { n = socketRead(fd, b, off, length, timeout); if (n > 0) { return n; } } catch (ConnectionResetException rstExc) { gotReset = true; } finally { impl.releaseFD(); } /* * We receive a "connection reset" but there may be bytes still * buffered on the socket */ if (gotReset) { impl.setConnectionResetPending(); impl.acquireFD(); try { n = socketRead(fd, b, off, length, timeout); if (n > 0) { return n; } } catch (ConnectionResetException rstExc) { } finally { impl.releaseFD(); } } /* * If we get here we are at EOF, the socket has been closed, * or the connection has been reset. */ if (impl.isClosedOrPending()) { throw new SocketException("Socket closed"); } if (impl.isConnectionResetPending()) { impl.setConnectionReset(); } if (impl.isConnectionReset()) { throw new SocketException("Connection reset"); } eof = true; return -1;
根據圖片中的提示信息,可以找到報錯信息在倒數第4行,從后往前看,當n <= 0時,才會報錯,然而
n = socketRead(fd, b, off, length, timeout);
認為是超時問題,故在代碼中加入


但依舊未解決問題,最終通過手動捕捉SocketException異常,讓異常發生時,重新請求該條記錄,完成任務。

雖然問題解決了,但本質還是不理解為什么會導致錯誤,有明白的大佬麻煩指點一二。
