HtmlUnit爬取Ajax動態生成的網頁以及自動調用頁面javascript函數


HtmlUnit官網的介紹:

HtmlUnit是一款基於Java的沒有圖形界面的瀏覽器程序。它模仿HTML document並且提供API讓開發人員像是在一個正常的瀏覽器上操作一樣,獲取網頁內容,填充表單,點擊超鏈接等等。

它非常好的支持JavaScript並且仍在不斷改進,同時能夠解析非常復雜的AJAX庫,通過不同的配置來模擬Chrome、Firefox和IE瀏覽器。

本文針對一個足彩網站抓取的例子,來熟悉HtmlUnit

       WebClient wc = new WebClient(BrowserVersion.FIREFOX_38);  
       wc.getOptions().setJavaScriptEnabled(true); //啟用JS解釋器,默認為true  
       wc.setJavaScriptTimeout(100000);//設置JS執行的超時時間
       wc.getOptions().setCssEnabled(false); //禁用css支持  
       wc.getOptions().setThrowExceptionOnScriptError(false); //js運行錯誤時,是否拋出異常  
       wc.getOptions().setTimeout(10000); //設置連接超時時間 ,這里是10S。如果為0,則無限期等待  
       wc.setAjaxController(new NicelyResynchronizingAjaxController());//設置支持AJAX
       wc.setWebConnection(
        new WebConnectionWrapper(wc) {
        public WebResponse getResponse(WebRequest request) throws IOException {
                      ......
                   }
        }
       
        );
       HtmlPage page = wc.getPage("http://XXXX.com/");
       FileWriter fileWriter = new FileWriter("D:\\text.html");
       String str = "";
       //獲取頁面的XML代碼
       str = page.asXml();
       fileWriter.write( str );
       //關閉webclient
       wc.close();
       fileWriter.close();

解決數據亂碼問題

該網站數據是由js動態載入,並且js有2種編碼:

<script language="javascript" src="XXX.js" charset="gb2312"></script>
<script language="javascript" src="XXX.js" charset="utf-8"></script>

可以通過重寫WebConnectionWrapper類的getResponse方法來修改返回值

例如,對bfdata.js的返回結果做修改

wc.setWebConnection(
new WebConnectionWrapper(wc) {
public WebResponse getResponse(WebRequest request) throws IOException {
               WebResponse response = super.getResponse(request);
               if (request.getUrl().toExternalForm().contains("bfdata.js")) {
                   String content = response.getContentAsString("GBK");
                   WebResponseData data = new WebResponseData(content.getBytes("UTF-8"),
                           response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
                   response = new WebResponse(data, request, response.getLoadTime());
               }
               return response;
           }
}

);

解決Content is not allowed in prolog

報錯信息:

六月 21, 2016 4:15:06 下午 com.gargoylesoftware.htmlunit.xml.XmlPage <init>
警告: Failed parsing XML document http://XXX/vbsxml/goalBf3.xml?r=0071466496906000: Content is not allowed in prolog.
六月 21, 2016 4:15:06 下午 com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine handleJavaScriptException
信息: Caught script exception
======= EXCEPTION START ========
EcmaError: lineNumber=[41] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[http://XXX/common2.js] message=[TypeError: Cannot read property "childNodes" from null (http://XXX/common2.js#41)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "childNodes" from null (http://XXX/common2.js#41)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:865)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:747)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1032)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:395)
    at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:276)

其中警告信息:Content is not allowed in prolog是導致后面報錯的原因,而Content is not allowed in prolog是因為解析內容內包含BOM。這個標記是看不到的,而在流里面有這個標記。

因此可以通過以下代碼來截取你需要的內容

wc.setWebConnection(
new WebConnectionWrapper(wc) {
public WebResponse getResponse(WebRequest request) throws IOException {
               WebResponse response = super.getResponse(request);
               if(request.getUrl().toExternalForm().contains("goalBf3.xml")){
                String content = response.getContentAsString("UTF-8");
                     if(null != content && !"".equals(content)){  
                     if(content.indexOf("<") != -1 && content.lastIndexOf(">") != -1 && content.lastIndexOf(">") > content.indexOf("<"))  
                     content = content.substring(content.indexOf("<"), content.lastIndexOf(">") + 1);  
                     }

                WebResponseData data = new WebResponseData(content.getBytes("UTF-8"),
                                response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
                        response = new WebResponse(data, request, response.getLoadTime());
               }
               return response;
           }
}

);

調用頁面javascript函數

該網站有些數據是通過鼠標懸停來獲得數據

我們可以通過page.executeJavaScript來執行js

例如:

HtmlPage page = wc.getPage("http://xxx.com/");
wc.waitForBackgroundJavaScript(30 * 1000); /* will wait JavaScript to execute up to 30s */
ScriptResult result = page.executeJavaScript("document.getElementById('pk_1248827').onmouseover(window.event)");
HtmlPage jspage = (HtmlPage) result.getNewPage(); 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM