java從零到變身爬蟲大神（一）

本文轉載自查看原文 2016-08-18 17:00 33085 爬蟲/ java

學習java3天有余，知道一些基本語法后

學習java爬蟲，1天后開始出現明顯效果

剛開始先從最簡單的爬蟲邏輯入手

爬蟲最簡單的解析面真的是這樣

 1 import org.jsoup.Jsoup;
 2 import org.jsoup.nodes.Document;
 3 import java.io.IOException;
 4 
 5 public class Test {
 6     public static void Get_Url(String url) {
 7         try {
 8          Document doc = Jsoup.connect(url) 
 9           //.data("query", "Java")
10           //.userAgent("頭部")
11           //.cookie("auth", "token")
12           //.timeout(3000)
13           //.post()
14           .get();
15         } catch (IOException e) {
16               e.printStackTrace();
17         }
18     }
19 }

這只是一個函數而已

那么在下面加上：

1 //main函數
2     public static void main(String[] args) {
3         String url = "...";
4         Get_Url(url);
5     }

哈哈，搞定

就是這么一個爬蟲了

太神奇

但是得到的只是網頁的html頁面的東西

而且還沒篩選

那么就篩選吧

 1 public static void Get_Url(String url) {
 2         try {
 3          Document doc = Jsoup.connect(url) 
 4           //.data("query", "Java")
 5           //.userAgent("頭部")
 6           //.cookie("auth", "token")
 7           //.timeout(3000)
 8           //.post()
 9           .get();
10          
11         //得到html的所有東西
12         Element content = doc.getElementById("content");
13         //分離出html下<a>...</a>之間的所有東西
14         Elements links = content.getElementsByTag("a");
15         //Elements links = doc.select("a[href]");
16         // 擴展名為.png的圖片
17         Elements pngs = doc.select("img[src$=.png]");
18         // class等於masthead的div標簽
19         Element masthead = doc.select("div.masthead").first();
20             
21         for (Element link : links) {
22               //得到<a>...</a>里面的網址
23               String linkHref = link.attr("href");
24               //得到<a>...</a>里面的漢字
25               String linkText = link.text();
26               System.out.println(linkText);
27             }
28         } catch (IOException e) {
29               e.printStackTrace();
30         }
31     }

那就用上面的來解析一下我的博客園

解析的是<a>...</a>之間的東西

看起來很不錯，就是不錯

-------------------------------我是快樂的分割線-------------------------------

其實還有另外一種爬蟲的方法更加好

他能批量爬取網頁保存到本地

先保存在本地再去正則什么的篩選自己想要的東西

這樣效率比上面的那個高了很多

很多

看代碼！

 1 　　//將抓取的網頁變成html文件，保存在本地
 2     public static void Save_Html(String url) {
 3         try {
 4             File dest = new File("src/temp_html/" + "保存的html的名字.html");
 5             //接收字節輸入流
 6             InputStream is;
 7             //字節輸出流
 8             FileOutputStream fos = new FileOutputStream(dest);
 9     
10             URL temp = new URL(url);
11             is = temp.openStream();
12             
13             //為字節輸入流加緩沖
14             BufferedInputStream bis = new BufferedInputStream(is);
15             //為字節輸出流加緩沖
16             BufferedOutputStream bos = new BufferedOutputStream(fos);
17     
18             int length;
19     
20             byte[] bytes = new byte[1024*20];
21             while((length = bis.read(bytes, 0, bytes.length)) != -1){
22                 fos.write(bytes, 0, length);
23             }
24 
25             bos.close();
26             fos.close();
27             bis.close();
28             is.close();
29         } catch (IOException e) {
30             e.printStackTrace();
31         }
32     }

這個方法直接將html保存在了文件夾src/temp_html/里面

在批量抓取網頁的時候

都是先抓下來，保存為html或者json

然后在正則什么的進數據庫

東西在本地了，自己想怎么搞就怎么搞

反爬蟲關我什么事

上面兩個方法都會造成一個問題

這個錯誤代表

這種爬蟲方法太low逼

大部分網頁都禁止了

所以，要加個頭

就是UA

方法一那里的頭部那里直接

1 .userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")

方法二間接加：

1  URL temp = new URL(url);
2  URLConnection uc = temp.openConnection();
3  uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");
4  is = temp.openStream();

加了頭部，幾乎可以應付大部分網址了

-------------------------------我是快樂的分割線-------------------------------

將html下載到本地后需要解析啊

解析啊看這里啊

 1 //解析本地的html
 2     public static void Get_Localhtml(String path) {
 3 
 4         //讀取本地html的路徑
 5         File file = new File(path);
 6         //生成一個數組用來存儲這些路徑下的文件名
 7         File[] array = file.listFiles();
 8         //寫個循環讀取這些文件的名字
 9         
10         for(int i=0;i<array.length;i++){
11             try{
12                 if(array[i].isFile()){
13                 //文件名字
14                 System.out.println("正在解析網址：" + array[i].getName());
15 
16                 //下面開始解析本地的html
17                 Document doc = Jsoup.parse(array[i], "UTF-8");
18                 //得到html的所有東西
19                 Element content = doc.getElementById("content");
20                 //分離出html下<a>...</a>之間的所有東西
21                 Elements links = content.getElementsByTag("a");
22                 //Elements links = doc.select("a[href]");
23                 // 擴展名為.png的圖片
24                 Elements pngs = doc.select("img[src$=.png]");
25                 // class等於masthead的div標簽
26                 Element masthead = doc.select("div.masthead").first();
27                 
28                 for (Element link : links) {
29                       //得到<a>...</a>里面的網址
30                       String linkHref = link.attr("href");
31                       //得到<a>...</a>里面的漢字
32                       String linkText = link.text();
33                       System.out.println(linkText);
34                         }
35                     }
36                 }catch (Exception e) {
37                     System.out.println("網址：" + array[i].getName() + "解析出錯");
38                     e.printStackTrace();
39                     continue;
40                 }
41         }
42     }

文字配的很漂亮

就這樣解析出來啦

主函數加上

1 //main函數
2     public static void main(String[] args) {
3         String url = "http://www.cnblogs.com/TTyb/";
4         String path = "src/temp_html/";
5         Get_Localhtml(path);
6     }

那么這個文件夾里面的所有的html都要被我解析掉

好啦

3天java1天爬蟲的結果就是這樣子咯

-------------------------------我是快樂的分割線-------------------------------

其實對於這兩種爬取html的方法來說，最好結合在一起

作者測試過

方法二穩定性不足

方法一速度不好

所以自己改正

將方法一放到方法二的catch里面去

當方法二出現錯誤的時候就會用到方法一

但是當方法一也錯誤的時候就跳過吧

結合如下：

  1 import org.jsoup.Jsoup;
  2 import org.jsoup.nodes.Document;
  3 import org.jsoup.nodes.Element;
  4 import org.jsoup.select.Elements;
  5 
  6 import java.io.BufferedInputStream;
  7 import java.io.BufferedOutputStream;
  8 import java.io.BufferedReader;
  9 import java.io.File;
 10 import java.io.FileOutputStream;
 11 import java.io.IOException;
 12 import java.io.InputStream;
 13 import java.io.InputStreamReader;
 14 import java.io.OutputStream;
 15 import java.io.OutputStreamWriter;
 16 import java.net.HttpURLConnection;
 17 import java.net.URL;
 18 import java.net.URLConnection;
 19 import java.util.Date;
 20 import java.text.SimpleDateFormat;
 21 
 22 public class JavaSpider {
 23     
 24     //將抓取的網頁變成html文件，保存在本地
 25     public static void Save_Html(String url) {
 26         try {
 27             File dest = new File("src/temp_html/" + "我是名字.html");
 28             //接收字節輸入流
 29             InputStream is;
 30             //字節輸出流
 31             FileOutputStream fos = new FileOutputStream(dest);
 32     
 33             URL temp = new URL(url);
 34             URLConnection uc = temp.openConnection();
 35             uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");
 36             is = temp.openStream();
 37             
 38             //為字節輸入流加緩沖
 39             BufferedInputStream bis = new BufferedInputStream(is);
 40             //為字節輸出流加緩沖
 41             BufferedOutputStream bos = new BufferedOutputStream(fos);
 42     
 43             int length;
 44     
 45             byte[] bytes = new byte[1024*20];
 46             while((length = bis.read(bytes, 0, bytes.length)) != -1){
 47                 fos.write(bytes, 0, length);
 48             }
 49 
 50             bos.close();
 51             fos.close();
 52             bis.close();
 53             is.close();
 54         } catch (IOException e) {
 55             e.printStackTrace();
 56             System.out.println("openStream流錯誤，跳轉get流");
 57             //如果上面的那種方法解析錯誤
 58             //那么就用下面這一種方法解析
 59             try{
 60                 Document doc = Jsoup.connect(url)
 61                 .userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")
 62                 .timeout(3000) 
 63                 .get();
 64                 
 65                 File dest = new File("src/temp_html/" + "我是名字.html");
 66                 if(!dest.exists())
 67                     dest.createNewFile();
 68                 FileOutputStream out=new FileOutputStream(dest,false);
 69                 out.write(doc.toString().getBytes("utf-8"));
 70                 out.close();
 71 
 72             }catch (IOException E) {
 73                 E.printStackTrace();
 74                 System.out.println("get流錯誤，請檢查網址是否正確");
 75             }
 76             
 77         }
 78     }
 79     
 80     //解析本地的html
 81     public static void Get_Localhtml(String path) {
 82 
 83         //讀取本地html的路徑
 84         File file = new File(path);
 85         //生成一個數組用來存儲這些路徑下的文件名
 86         File[] array = file.listFiles();
 87         //寫個循環讀取這些文件的名字
 88         
 89         for(int i=0;i<array.length;i++){
 90             try{
 91                 if(array[i].isFile()){
 92                     //文件名字
 93                     System.out.println("正在解析網址：" + array[i].getName());
 94                     //文件地址加文件名字
 95                     //System.out.println("#####" + array[i]); 
 96                     //一樣的文件地址加文件名字
 97                     //System.out.println("*****" + array[i].getPath()); 
 98                     
 99                     
100                     //下面開始解析本地的html
101                     Document doc = Jsoup.parse(array[i], "UTF-8");
102                     //得到html的所有東西
103                     Element content = doc.getElementById("content");
104                     //分離出html下<a>...</a>之間的所有東西
105                     Elements links = content.getElementsByTag("a");
106                     //Elements links = doc.select("a[href]");
107                     // 擴展名為.png的圖片
108                     Elements pngs = doc.select("img[src$=.png]");
109                     // class等於masthead的div標簽
110                     Element masthead = doc.select("div.masthead").first();
111                     
112                     for (Element link : links) {
113                           //得到<a>...</a>里面的網址
114                           String linkHref = link.attr("href");
115                           //得到<a>...</a>里面的漢字
116                           String linkText = link.text();
117                           System.out.println(linkText);
118                         }
119                     }
120                 }catch (Exception e) {
121                     System.out.println("網址：" + array[i].getName() + "解析出錯");
122                     e.printStackTrace();
123                     continue;
124                 }
125             }
126         }
127     //main函數
128     public static void main(String[] args) {
129         String url = "http://www.cnblogs.com/TTyb/";
130         String path = "src/temp_html/";
131         //保存到本地的網頁地址
132         Save_Html(url);
133         //解析本地的網頁地址
134         Get_Localhtml(path);
135     }
136 }

總的來說

java爬蟲的方法比python的多好多

java的庫真特么變態

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 java大神進階之路 java學習進階之路，如果從一個菜鳥進階成大神【java爬蟲】---爬蟲+基於接口的網絡爬蟲 java爬蟲之入門基礎 Java爬蟲（Jsoup與WebDriver） java爬蟲系列(一) - 入門 Java爬蟲的實現 Java爬蟲框架之WebMagic java 爬蟲 WebMagic（三）-PipeLine 簡易的java爬蟲項目