java 爬蟲

本文轉載自查看原文 2019-04-16 14:28 642

轉自：博客園博主：三目鳥

https://www.cnblogs.com/sanmubird/p/7857474.html

本文內容淶源於羅剛老師的書籍 << 自己動手寫網絡爬蟲一書 >> ;

本文將介紹 1: 網絡爬蟲的是做什么的? 2: 手動寫一個簡單的網絡爬蟲;

1: 網絡爬蟲是做什么的? 他的主要工作就是跟據指定的url地址去發送請求,獲得響應, 然后解析響應 , 一方面從響應中查找出想要查找的數據,另一方面從響應中解析出新的URL路徑,

然后繼續訪問,繼續解析;繼續查找需要的數據和繼續解析出新的URL路徑 .

這就是網絡爬蟲主要干的工作. 下面是流程圖:

通過上面的流程圖能大概了解到網絡爬蟲干了哪些活 ,根據這些也就能設計出一個簡單的網絡爬蟲出來.

一個簡單的爬蟲必需的功能:

1: 發送請求和獲取響應的功能 ;

2: 解析響應的功能 ;

3: 對過濾出的數據進行存儲的功能 ;

4: 對解析出來的URL路徑處理的功能 ;

下面是包結構:

下面就上代碼：

RequestAndResponseTool  類： 主要方法： 發送請求 返回響應 並把 響應 封裝成 page 類 ;

package com.etoak.crawl.page;

import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpMethodParams;

import java.io.IOException;

public class RequestAndResponseTool {


    public static Page  sendRequstAndGetResponse(String url) {
        Page page = null;
        // 1.生成 HttpClinet 對象並設置參數
        HttpClient httpClient = new HttpClient();
        // 設置 HTTP 連接超時 5s
        httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(5000);
        // 2.生成 GetMethod 對象並設置參數
        GetMethod getMethod = new GetMethod(url);
        // 設置 get 請求超時 5s
        getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT, 5000);
        // 設置請求重試處理
        getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler());
        // 3.執行 HTTP GET 請求
        try {
            int statusCode = httpClient.executeMethod(getMethod);
        // 判斷訪問的狀態碼
            if (statusCode != HttpStatus.SC_OK) {
                System.err.println("Method failed: " + getMethod.getStatusLine());
            }
        // 4.處理 HTTP 響應內容
            byte[] responseBody = getMethod.getResponseBody();// 讀取為字節 數組
            String contentType = getMethod.getResponseHeader("Content-Type").getValue(); // 得到當前返回類型
            page = new Page(responseBody,url,contentType); //封裝成為頁面
        } catch (HttpException e) {
        // 發生致命的異常，可能是協議不對或者返回的內容有問題
            System.out.println("Please check your provided http address!");
            e.printStackTrace();
        } catch (IOException e) {
        // 發生網絡異常
            e.printStackTrace();
        } finally {
        // 釋放連接
            getMethod.releaseConnection();
        }
        return page;
    }
}

page 類：主要作用：保存響應的相關內容對外提供訪問方法；

package com.etoak.crawl.page;


import com.etoak.crawl.util.CharsetDetector;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.UnsupportedEncodingException;

/*
* page
*   1: 保存獲取到的響應的相關內容;
* */
public class Page {

    private byte[] content ;
    private String html ;  //網頁源碼字符串
    private Document doc  ;//網頁Dom文檔
    private String charset ;//字符編碼
    private String url ;//url路徑
    private String contentType ;// 內容類型


    public Page(byte[] content , String url , String contentType){
        this.content = content ;
        this.url = url ;
        this.contentType = contentType ;
    }

    public String getCharset() {
        return charset;
    }
    public String getUrl(){return url ;}
    public String getContentType(){ return contentType ;}
    public byte[] getContent(){ return content ;}

    /**
     * 返回網頁的源碼字符串
     *
     * @return 網頁的源碼字符串
     */
    public String getHtml() {
        if (html != null) {
            return html;
        }
        if (content == null) {
            return null;
        }
        if(charset==null){
            charset = CharsetDetector.guessEncoding(content); // 根據內容來猜測 字符編碼
        }
        try {
            this.html = new String(content, charset);
            return html;
        } catch (UnsupportedEncodingException ex) {
            ex.printStackTrace();
            return null;
        }
    }

    /*
    *  得到文檔
    * */
    public Document getDoc(){
        if (doc != null) {
            return doc;
        }
        try {
            this.doc = Jsoup.parse(getHtml(), url);
            return doc;
        } catch (Exception ex) {
            ex.printStackTrace();
            return null;
        }
    }


}

PageParserTool： 類  主要作用 提供了 根據選擇器來選取元素 屬性 等方法 ；

package com.etoak.crawl.page;

import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.ArrayList;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

public class PageParserTool {


    /* 通過選擇器來選取頁面的 */
    public static Elements select(Page page , String cssSelector) {
        return page.getDoc().select(cssSelector);
    }

    /*
     *  通過css選擇器來得到指定元素;
     *
     *  */
    public static Element select(Page page , String cssSelector, int index) {
        Elements eles = select(page , cssSelector);
        int realIndex = index;
        if (index < 0) {
            realIndex = eles.size() + index;
        }
        return eles.get(realIndex);
    }


    /**
     * 獲取滿足選擇器的元素中的鏈接 選擇器cssSelector必須定位到具體的超鏈接
     * 例如我們想抽取id為content的div中的所有超鏈接，這里
     * 就要將cssSelector定義為div[id=content] a
     *  放入set 中 防止重復；
     * @param cssSelector
     * @return
     */
    public static  Set<String> getLinks(Page page ,String cssSelector) {
        Set<String> links  = new HashSet<String>() ;
        Elements es = select(page , cssSelector);
        Iterator iterator  = es.iterator();
        while(iterator.hasNext()) {
            Element element = (Element) iterator.next();
            if ( element.hasAttr("href") ) {
                links.add(element.attr("abs:href"));
            }else if( element.hasAttr("src") ){
                links.add(element.attr("abs:src"));
            }
        }
        return links;
    }



    /**
     * 獲取網頁中滿足指定css選擇器的所有元素的指定屬性的集合
     * 例如通過getAttrs("img[src]","abs:src")可獲取網頁中所有圖片的鏈接
     * @param cssSelector
     * @param attrName
     * @return
     */
    public static ArrayList<String> getAttrs(Page page , String cssSelector, String attrName) {
        ArrayList<String> result = new ArrayList<String>();
        Elements eles = select(page ,cssSelector);
        for (Element ele : eles) {
            if (ele.hasAttr(attrName)) {
                result.add(ele.attr(attrName));
            }
        }
        return result;
    }
}

Link 包；

Links 類: 兩個屬性：一個是存放已經訪問的url集合的set ; 一個是存放待訪問url集合的 queue ；

 
                 package com.etoak.crawl.link; 
                
                 import java.util.HashSet; 
                
                 import java.util.LinkedList; 
                
                 import java.util.Set; 
                
                 /* 
                
                 *  Link主要功能; 
                
                 *  1: 存儲已經訪問過的URL路徑 和 待訪問的URL 路徑; 
                
                 * 
                
                 * 
                
                 * */ 
                
                 public 
                 class 
                 Links { 
                
                 //已訪問的 url 集合  已經訪問過的 主要考慮 不能再重復了 使用set來保證不重復; 
                
                 private 
                 static 
                 Set visitedUrlSet =  
                 new 
                 HashSet(); 
                
                 //待訪問的 url 集合  待訪問的主要考慮 1:規定訪問順序;2:保證不提供重復的帶訪問地址; 
                
                 private 
                 static 
                 LinkedList unVisitedUrlQueue =  
                 new 
                 LinkedList(); 
                
                 //獲得已經訪問的 URL 數目 
                
                 public 
                 static 
                 int 
                 getVisitedUrlNum() { 
                
                 return 
                 visitedUrlSet.size(); 
                
                 } 
                
                 //添加到訪問過的 URL 
                
                 public 
                 static 
                 void 
                 addVisitedUrlSet(String url) { 
                
                 visitedUrlSet.add(url); 
                
                 } 
                
                 //移除訪問過的 URL 
                
                 public 
                 static 
                 void 
                 removeVisitedUrlSet(String url) { 
                
                 visitedUrlSet.remove(url); 
                
                 } 
                
                 //獲得 待訪問的 url 集合 
                
                 public 
                 static 
                 LinkedList getUnVisitedUrlQueue() { 
                
                 return 
                 unVisitedUrlQueue; 
                
                 } 
                
                 // 添加到待訪問的集合中  保證每個 URL 只被訪問一次 
                
                 public 
                 static 
                 void 
                 addUnvisitedUrlQueue(String url) { 
                
                 if 
                 (url !=  
                 null 
                 && !url.trim(). 
                 equals 
                 ( 
                 "" 
                 )  && !visitedUrlSet.contains(url)  && !unVisitedUrlQueue.contains(url)){ 
                
                 unVisitedUrlQueue.add(url); 
                
                 } 
                
                 } 
                
                 //刪除 待訪問的url 
                
                 public 
                 static 
                 Object removeHeadOfUnVisitedUrlQueue() { 
                
                 return 
                 unVisitedUrlQueue.removeFirst(); 
                
                 } 
                
                 //判斷未訪問的 URL 隊列中是否為空 
                
                 public 
                 static 
                 boolean unVisitedUrlQueueIsEmpty() { 
                
                 return 
                 unVisitedUrlQueue.isEmpty(); 
                
                 } 
                
                 }

LinkFilter  接口： 可以起過濾作用；

package com.etoak.crawl.link;

public interface LinkFilter {
    public boolean accept(String url);
}

util 工具類

CharsetDetector 類； 獲取字符編碼

/*
 * Copyright (C) 2014 hu
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */
package com.etoak.crawl.util;

import org.mozilla.universalchardet.UniversalDetector;

import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * 字符集自動檢測
 *
 * @author hu
 */
public class CharsetDetector {

    //從Nutch借鑒的網頁編碼檢測代碼
    private static final int CHUNK_SIZE = 2000;

    private static Pattern metaPattern = Pattern.compile(
            "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>",
            Pattern.CASE_INSENSITIVE);
    private static Pattern charsetPattern = Pattern.compile(
            "charset=\\s*([a-z][_\\-0-9a-z]*)", Pattern.CASE_INSENSITIVE);
    private static Pattern charsetPatternHTML5 = Pattern.compile(
            "<meta\\s+charset\\s*=\\s*[\"']?([a-z][_\\-0-9a-z]*)[^>]*>",
            Pattern.CASE_INSENSITIVE);

    //從Nutch借鑒的網頁編碼檢測代碼
    private static String guessEncodingByNutch(byte[] content) {
        int length = Math.min(content.length, CHUNK_SIZE);

        String str = "";
        try {
            str = new String(content, "ascii");
        } catch (UnsupportedEncodingException e) {
            return null;
        }

        Matcher metaMatcher = metaPattern.matcher(str);
        String encoding = null;
        if (metaMatcher.find()) {
            Matcher charsetMatcher = charsetPattern.matcher(metaMatcher.group(1));
            if (charsetMatcher.find()) {
                encoding = new String(charsetMatcher.group(1));
            }
        }
        if (encoding == null) {
            metaMatcher = charsetPatternHTML5.matcher(str);
            if (metaMatcher.find()) {
                encoding = new String(metaMatcher.group(1));
            }
        }
        if (encoding == null) {
            if (length >= 3 && content[0] == (byte) 0xEF
                    && content[1] == (byte) 0xBB && content[2] == (byte) 0xBF) {
                encoding = "UTF-8";
            } else if (length >= 2) {
                if (content[0] == (byte) 0xFF && content[1] == (byte) 0xFE) {
                    encoding = "UTF-16LE";
                } else if (content[0] == (byte) 0xFE
                        && content[1] == (byte) 0xFF) {
                    encoding = "UTF-16BE";
                }
            }
        }

        return encoding;
    }

    /**
     * 根據字節數組，猜測可能的字符集，如果檢測失敗，返回utf-8
     *
     * @param bytes 待檢測的字節數組
     * @return 可能的字符集，如果檢測失敗，返回utf-8
     */
    public static String guessEncodingByMozilla(byte[] bytes) {
        String DEFAULT_ENCODING = "UTF-8";
        UniversalDetector detector = new UniversalDetector(null);
        detector.handleData(bytes, 0, bytes.length);
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        detector.reset();
        if (encoding == null) {
            encoding = DEFAULT_ENCODING;
        }
        return encoding;
    }

    /**
     * 根據字節數組，猜測可能的字符集，如果檢測失敗，返回utf-8
     * @param content 待檢測的字節數組
     * @return 可能的字符集，如果檢測失敗，返回utf-8
     */
    public static String guessEncoding(byte[] content) {
        String encoding;
        try {
            encoding = guessEncodingByNutch(content);
        } catch (Exception ex) {
            return guessEncodingByMozilla(content);
        }

        if (encoding == null) {
            encoding = guessEncodingByMozilla(content);
            return encoding;
        } else {
            return encoding;
        }
    }
}

FileTool  文件下載類：

package com.etoak.crawl.util;



import com.etoak.crawl.page.Page;

import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

/*  本類主要是 下載那些已經訪問過的文件*/
public class FileTool {

    private static String dirPath;


    /**
     * getMethod.getResponseHeader("Content-Type").getValue()
     * 根據 URL 和網頁類型生成需要保存的網頁的文件名，去除 URL 中的非文件名字符
     */
    private static String getFileNameByUrl(String url, String contentType) {
        //去除 http://
        url = url.substring(7);
        //text/html 類型
        if (contentType.indexOf("html") != -1) {
            url = url.replaceAll("[\\?/:*|<>\"]", "_") + ".html";
            return url;
        }
        //如 application/pdf 類型
        else {
            return url.replaceAll("[\\?/:*|<>\"]", "_") + "." +
                    contentType.substring(contentType.lastIndexOf("/") + 1);
        }
    }

    /*
    *  生成目錄
    * */
    private static void mkdir() {
        if (dirPath == null) {
            dirPath = Class.class.getClass().getResource("/").getPath() + "temp\\";
        }
        File fileDir = new File(dirPath);
        if (!fileDir.exists()) {
            fileDir.mkdir();
        }
    }

    /**
     * 保存網頁字節數組到本地文件，filePath 為要保存的文件的相對地址
     */

    public static void saveToLocal(Page page) {
        mkdir();
        String fileName =  getFileNameByUrl(page.getUrl(), page.getContentType()) ;
        String filePath = dirPath + fileName ;
        byte[] data = page.getContent();
        try {
            //Files.lines(Paths.get("D:\\jd.txt"), StandardCharsets.UTF_8).forEach(System.out::println);
            DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath)));
            for (int i = 0; i < data.length; i++) {
                out.write(data[i]);
            }
            out.flush();
            out.close();
            System.out.println("文件："+ fileName + "已經被存儲在"+ filePath  );
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

RegexRule  正則表達式類；

/*
 * Copyright (C) 2014 hu
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */
package com.etoak.crawl.util;


import java.util.ArrayList;
import java.util.regex.Pattern;

/**
 *
 * @author hu
 */
public class RegexRule {
    
    public RegexRule(){
        
    }
    public RegexRule(String rule){
        addRule(rule);
    }
    
    public RegexRule(ArrayList<String> rules){
        for (String rule : rules) {
            addRule(rule);
        }
    }
    
    public boolean isEmpty(){
        return positive.isEmpty();
    }

    private ArrayList<String> positive = new ArrayList<String>();
    private ArrayList<String> negative = new ArrayList<String>();

  
    
    /**
     * 添加一個正則規則 正則規則有兩種，正正則和反正則 
     * URL符合正則規則需要滿足下面條件： 1.至少能匹配一條正正則 2.不能和任何反正則匹配
     * 正正則示例：+a.*c是一條正正則，正則的內容為a.*c，起始加號表示正正則
     * 反正則示例：-a.*c時一條反正則，正則的內容為a.*c，起始減號表示反正則
     * 如果一個規則的起始字符不為加號且不為減號，則該正則為正正則，正則的內容為自身
     * 例如a.*c是一條正正則，正則的內容為a.*c
     * @param rule 正則規則
     * @return 自身
     */
    public RegexRule addRule(String rule) {
        if (rule.length() == 0) {
            return this;
        }
        char pn = rule.charAt(0);
        String realrule = rule.substring(1);
        if (pn == '+') {
            addPositive(realrule);
        } else if (pn == '-') {
            addNegative(realrule);
        } else {
            addPositive(rule);
        }
        return this;
    }

   
    
    /**
     * 添加一個正正則規則
     * @param positiveregex
     * @return 自身
     */
    public RegexRule addPositive(String positiveregex) {
        positive.add(positiveregex);
        return this;
    }

  
    /**
     * 添加一個反正則規則
     * @param negativeregex
     * @return 自身
     */
    public RegexRule addNegative(String negativeregex) {
        negative.add(negativeregex);
        return this;
    }

   
    /**
     * 判斷輸入字符串是否符合正則規則
     * @param str 輸入的字符串
     * @return 輸入字符串是否符合正則規則
     */
    public boolean satisfy(String str) {

        int state = 0;
        for (String nregex : negative) {
            if (Pattern.matches(nregex, str)) {
                return false;
            }
        }

        int count = 0;
        for (String pregex : positive) {
            if (Pattern.matches(pregex, str)) {
                count++;
            }
        }
        if (count == 0) {
            return false;
        } else {
            return true;
        }

    }
}

主類：

MyCrawler ：

 
                 package 
                 com.etoak.crawl.main; 
                
                 import 
                 com.etoak.crawl.link.LinkFilter; 
                
                 import 
                 com.etoak.crawl.link.Links; 
                
                 import 
                 com.etoak.crawl.page.Page; 
                
                 import 
                 com.etoak.crawl.page.PageParserTool; 
                
                 import 
                 com.etoak.crawl.page.RequestAndResponseTool; 
                
                 import 
                 com.etoak.crawl.util.FileTool; 
                
                 import 
                 org.jsoup.select.Elements; 
                
                 import 
                 java.util.Set; 
                
                 public 
                 class 
                 MyCrawler { 
                
                 /** 
                
                 * 使用種子初始化 URL 隊列 
                
                 * 
                
                 * @param seeds 種子 URL 
                
                 * @return 
                
                 */ 
                
                 private 
                 void 
                 initCrawlerWithSeeds(String[] seeds) { 
                
                 for 
                 ( 
                 int 
                 i =  
                 0 
                 ; i < seeds.length; i++){ 
                
                 Links.addUnvisitedUrlQueue(seeds[i]); 
                
                 } 
                
                 } 
                
                 /** 
                
                 * 抓取過程 
                
                 * 
                
                 * @param seeds 
                
                 * @return 
                
                 */ 
                
                 public 
                 void 
                 crawling(String[] seeds) { 
                
                 //初始化 URL 隊列 
                
                 initCrawlerWithSeeds(seeds); 
                
                 //定義過濾器，提取以 http://www.baidu.com 開頭的鏈接 
                
                 LinkFilter filter =  
                 new 
                 LinkFilter() { 
                
                 public 
                 boolean 
                 accept(String url) { 
                
                 if 
                 (url.startsWith( 
                 "http://www.baidu.com" 
                 )) 
                
                 return 
                 true 
                 ; 
                
                 else 
                
                 return 
                 false 
                 ; 
                
                 } 
                
                 }; 
                
                 //循環條件：待抓取的鏈接不空且抓取的網頁不多於 1000 
                
                 while 
                 (!Links.unVisitedUrlQueueIsEmpty()  && Links.getVisitedUrlNum() <=  
                 1000 
                 ) { 
                
                 //先從待訪問的序列中取出第一個； 
                
                 String visitUrl = (String) Links.removeHeadOfUnVisitedUrlQueue(); 
                
                 if 
                 (visitUrl ==  
                 null 
                 ){ 
                
                 continue 
                 ; 
                
                 } 
                
                 //根據URL得到page; 
                
                 Page page = RequestAndResponseTool.sendRequstAndGetResponse(visitUrl); 
                
                 //對page進行處理： 訪問DOM的某個標簽 
                
                 Elements es = PageParserTool.select(page, 
                 "a" 
                 ); 
                
                 if 
                 (!es.isEmpty()){ 
                
                 System.out.println( 
                 "下面將打印所有a標簽： " 
                 ); 
                
                 System.out.println(es); 
                
                 } 
                
                 //將保存文件 
                
                 FileTool.saveToLocal(page); 
                
                 //將已經訪問過的鏈接放入已訪問的鏈接中； 
                
                 Links.addVisitedUrlSet(visitUrl); 
                
                 //得到超鏈接 
                
                 Set<String> links = PageParserTool.getLinks(page, 
                 "img" 
                 ); 
                
                 for 
                 (String link : links) { 
                
                 Links.addUnvisitedUrlQueue(link); 
                
                 System.out.println( 
                 "新增爬取路徑: " 
                 + link); 
                
                 } 
                
                 } 
                
                 } 
                
                 //main 方法入口 
                
                 public 
                 static 
                 void 
                 main(String[] args) { 
                
                 MyCrawler crawler =  
                 new 
                 MyCrawler(); 
                
                 crawler.crawling( 
                 new 
                 String[]{ 
                 "http://www.baidu.com" 
                 }); 
                
                 } 
                
                 }

　　運行結果：

源碼下載鏈接： https://pan.baidu.com/s/1ge7Nkzx   下載密碼：  mz5b

文章主要參考： 1： 自己動手寫網絡爬蟲；
2： https://github.com/CrawlScript/WebCollector  
WebCollector是一個無須配置、便於二次開發的JAVA爬蟲框架（內核），它提供精簡的的API，只需少量代碼即可實現一個功能強大的爬蟲。WebCollector-Hadoop是WebCollector的Hadoop版本，支持分布式爬取。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Java爬蟲【java爬蟲】---爬蟲+基於接口的網絡爬蟲優秀的 Java 爬蟲項目？ Java網絡爬蟲 Jsoup Java爬蟲項目實戰（一） Java 網絡爬蟲，就是這么的簡單 java爬蟲入門 java爬蟲簡單實例關於Java爬蟲的研究 JAVA爬蟲代碼