java網絡爬蟲-利用phantomjs和jsoup爬取動態ajax加載頁面

本文轉載自查看原文 2020-03-16 11:34 1687 java爬蟲

java基於windows爬取ajax加載的動態頁面需要一定的輔助工具支持，本文爬取ajax加載的動態頁面所使用的工具是phantomJS(關於phantomJS的介紹百度一大堆)

首先下載phantomJS；下載地址：https://phantomjs.org/download.html

下載之后解壓文件，為了后面方便使用建議單獨放在一個文件夾里面，例如我這邊是放在F盤下面單獨的文件夾phantomjs,然后進入phantomjs--bin點擊運行phantomjs.exe，出現一下界面：

phantomjs運行界面

即表示可以正常運行js代碼了。（如果要經常使用建議配置path環境）

接下來就是爬取頁面了。

首先需要寫一個js（例：parser.js）：

 1 system = require('system')
 2 address = system.args[1];
 3 var page = require('webpage').create();
 4 var url = address;
 5 
 6 page.settings.resourceTimeout = 1000*10; // 10 seconds
 7 page.onResourceTimeout = function(e) {
 8     console.log(page.content);
 9     phantom.exit(1);
10 };
11 
12 page.open(url, function (status) {
13     //Page is loaded!
14     if (status !== 'success') {
15         console.log('Unable to post!');
16     } else {
17         console.log(page.content);
18     }
19     phantom.exit();
20 });

然后是java代碼（我的parser.js是放在F盤下面的）：

 1 //讀取動態頁面
 2     public static String dynamicHtml(String url){
 3         Runtime rt = Runtime.getRuntime();
 4         Process process = null;
 5         String html = "";
 6         try {
 7             process = rt.exec("F:\\phantomjs\\bin\\phantomjs.exe F:/parser.js " +url);
 8             InputStream in = process.getInputStream();
 9             InputStreamReader reader = new InputStreamReader(in, "UTF-8");
10             BufferedReader br = new BufferedReader(reader);
11             String tmp = "";
12             while ((tmp = br.readLine()) != null) {
13                 html = html + tmp;
14             }
15             br.close();
16             reader.close();
17         } catch (IOException e) {
18             e.printStackTrace();
19         }
20         return html;
21     }

處理邏輯（利用Jsoup爬取）：

 1 public static void ReadAjaxDynamicHtml(String htmlUrl){
 2         String imageHtml = dynamicHtml(htmlUrl);
 3         Document imageDoc = Jsoup.parse(imageHtml);
 4         //如果選擇其中部分元素 有class就用：
 5         //Elements childrenImg = imageDoc.select(".class");
 6         //System.err.println(childrenImg.html());
 7         //System.err.println(childrenImg.text());
 8         //如果選擇其中部分標簽 比如img：
 9         //Elements childrenImg = imageDoc.select("img");
10         System.err.println(imageDoc);
11         /* 接下來的處理邏輯 */
12         // ...
13     }

main方法調用示例：

1 public static void main(String[] args) {
2         String htmlUrl = "http://www.baidu.com";
3         ReadAjaxDynamicHtml(htmlUrl);
4     }

顯示的結果部分截圖：

jar參考：

1 <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
2 <dependency>
3     <groupId>org.jsoup</groupId>
4     <artifactId>jsoup</artifactId>
5     <version>1.8.3</version>
6 </dependency>

至此測試完成。爬取頁面或會涉及讀取文本和圖片，給出示例讀取文本和下載圖片到本地示例代碼：

 1 /**
 2      *
 3      * @param text 要寫入的文本
 4      * @param fileName 文件名
 5      * @throws IOException
 6      */
 7     public static void Writer(String text,String fileName) throws IOException {
 8         // 生成的文件路徑
 9         String path = "F:\\" + fileName + System.currentTimeMillis() + ".txt";
10         File file = new File(path);
11         if (!file.exists()) {
12             file.getParentFile().mkdirs();
13         }
14         file.createNewFile();
15         OutputStreamWriter fw = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
16         BufferedWriter bw = new BufferedWriter(fw);
17         bw.write(text);
18         bw.flush();
19         bw.close();
20         fw.close();
21     }

 1 /**
 2      * 
 3      * @param urlList 圖片地址
 4      * @param path 存儲路徑
 5      */
 6     private static void downloadPicture(String urlList,String path) {
 7         URL url = null;
 8         try {
 9             url = new URL(urlList);
10             DataInputStream dataInputStream = new DataInputStream(url.openStream());
11             File file = new File(path);
12             if (!file.exists()) {
13                 file.getParentFile().mkdirs();
14             }
15             //file.createNewFile();
16             FileOutputStream fileOutputStream = new FileOutputStream(file);
17             ByteArrayOutputStream output = new ByteArrayOutputStream();
18 
19             byte[] buffer = new byte[1024];
20             int length;
21 
22             while ((length = dataInputStream.read(buffer)) > 0) {
23                 output.write(buffer, 0, length);
24             }
25             BASE64Encoder encoder = new BASE64Encoder();
26             String encode = encoder.encode(buffer);//返回Base64編碼過的字節數組字符串
27             fileOutputStream.write(output.toByteArray());
28             dataInputStream.close();
29             fileOutputStream.close();
30         } catch (MalformedURLException e) {
31             e.printStackTrace();
32         } catch (IOException e) {
33             e.printStackTrace();
34         }
35     }

當然接口入參可自定義。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲——爬取Ajax動態加載網頁網絡爬蟲（14）-動態頁面爬取 Python網絡爬蟲_爬取Ajax動態加載和翻頁時url不變的網頁 C#使用phantomjs，爬取AJAX加載完成之后的頁面 Python爬蟲學習——使用selenium和phantomjs爬取js動態加載的網頁爬蟲再探實戰（三）———爬取動態加載頁面——selenium Jsoup-基於Java實現網絡爬蟲-爬取筆趣閣小說學習用java基於webMagic+selenium+phantomjs實現爬蟲Demo爬取淘寶搜索頁面 java爬蟲入門--用jsoup爬取汽車之家的新聞 python+selenium+PhantomJS爬取網頁動態加載內容