ZeroCrawler V0.1:多線程爬蟲


      ZeroCrawler V0.1是一只簡單的多線程爬蟲,其基本架構如下:

      整個程序是這樣運作的:Scheduler不斷從Queue取出URL,如果發現可用的爬蟲(空閑線程),那么就將URL分給一只爬蟲。然后爬蟲完成下載網頁,抽取URL,保存網頁的工作后就回歸Scheduler(變回空閑線程)。直到Queue沒有待爬取的URL,並且所有爬蟲都空閑下來,就停止程序。

      Scheduler的主要工作就是建立線程池,從Queue中取出URL,分配URL給線程。容易出錯的地方是退出條件。如果只是判斷Queue為空就退出是不行的。因為這時可能還有爬蟲在工作中,而它可能提取到新的URL加到Queue中。所以退出條件應該是Queue為空且線程池的線程全部空閑。Scheduler實現如下:

View Code
    public static void Crawl(String url, String savePath) {        
        int cnt = 1;        
        long startTime = System.currentTimeMillis();
        AtomicInteger numberOfThreads = new AtomicInteger();    //記錄當前使用的爬蟲數
        ThreadPoolExecutor executor = new ThreadPoolExecutor(m_maxThreads, m_maxThreads, 
                3, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>());    //建立線程池
        
        Queue.Add(UrlUtility.Encode(UrlUtility.Normalizer(url)));    //添加初始URL到Queue中
        try {
            while ((url = Queue.Fetch()) != null) {
                executor.execute(new PageCrawler(url, savePath, numberOfThreads));    //將URL交給爬蟲
                        
                while( (Queue.Size() == 0 && numberOfThreads.get() != 0) 
                        || (numberOfThreads.get() == m_maxThreads) ) {    //防止提前退出
                    sleep();
                }
                
                //if( cnt++ > 1000 ) break;
                if( Queue.Size() == 0 && numberOfThreads.get() == 0 ) break;
            }            
        } finally {
            executor.shutdown();
        }
        
        long useTime = System.currentTimeMillis() - startTime;
        System.out.println("use " + Utility.ToStandardTime((int)(useTime / 1000)) + "to finish " + cnt + " links");    
        System.out.println("remain url: " + Queue.Size());
    }

     

     Queue負責保存URL,判斷URL重復出現與否。目前的保存方式是先使用Hash判斷URL是否已經保存過,然后再將完整的URL整個保存到list中。從Queue中取出URL時使用廣度優先原則。

View Code
public class Queue {
    private static HashSet<String> m_appear = new HashSet<String>();
    private static LinkedList<String> m_queue = new LinkedList<String>();
        
    public synchronized static void Add(String url) {
        if( !m_appear.contains(url) ) {
            m_appear.add(url);
            m_queue.addLast(url);
        }
    }
    
    public synchronized static String Fetch() {
        if( !m_queue.isEmpty() ) {
            return m_queue.poll();
        }
        
        return null;
    }
    
    public static int Size() {
        return m_queue.size();
    }
}

 

      接下來逐一介紹爬蟲最重要的幾個功能,首先從獲取網頁開始。網頁的獲取分為兩部分:一是下載網頁,二是正解地對字節流進行編碼。網頁下載使用httpclient-4.2.2完成,具體如下:

View Code
    //偽裝用的agent
    private static String[] m_agent = {"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)", 
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)", 
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", };
    
    private static Logger m_debug = LogManager.getLogger("Debuglogger");
    
    //取得"url"指定的entity
    public static HttpEntity GetEntity(String url) {
        HttpClient client = new DefaultHttpClient();        
        HttpGet getMethod = new HttpGet(UrlUtility.Encode(url));
        getMethod.getParams().setParameter("http.protocol.cookie-policy", CookiePolicy.BROWSER_COMPATIBILITY);
        
        //偽裝agent
        java.util.Random r = new java.util.Random(); 
        getMethod.setHeader("User-Agent", m_agent[r.nextInt(m_agent.length)]);
                    
        HttpResponse response = null;
        try {
            response = client.execute(getMethod);
        } catch (Exception e) {
            m_debug.debug("can't get response from " + url);
            m_debug.debug("reason is : " + e.getMessage());
            return null;
        }
                
        int statusCode = response.getStatusLine().getStatusCode();
        if ((statusCode == HttpStatus.SC_MOVED_PERMANENTLY)
                    || (statusCode == HttpStatus.SC_MOVED_TEMPORARILY)
                    || (statusCode == HttpStatus.SC_SEE_OTHER)
                    || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) {    // 轉向抓取新鏈接
            return GetEntity(response.getLastHeader("Location").getValue());
        }
        else if( statusCode == HttpStatus.SC_NOT_FOUND ) { //找不到網頁
            m_debug.debug(url + " : page was no found");
            response = null;
        }
        
        if( response != null ) return response.getEntity();
        else                   return null;    
    }

     

      得到網站返回的entity后,接下來要做的就是對字節流進行正確的編碼以得到網頁的內容。一般情況下,所下載的網頁都會在頭部寫清楚用的是哪一種charset。但這只是一般情況,當沒有寫的時候,就要自己來檢測。檢測編碼可不是簡單的工作,因而使用現成的ICU4J庫。實現如下:

View Code
    //從"entity"得到網頁內容
    public static String GetContent(HttpEntity entity) {        
        if( entity != null ) {
            byte[] bytes;
            try {
                bytes = EntityUtils.toByteArray(entity);                
            } catch (IOException e) {
                m_debug.debug("can't get bytes from entity. Reason are: " + e.getMessage());
                return null;
            }
            
            String charSet = EntityUtils.getContentCharSet(entity); //得到網頁編譯格式                                    
            if( charSet != null ) {  //網頁本身有告知編碼            
                try {
                    return new String(bytes, charSet);
                } catch (UnsupportedEncodingException e) {
                    m_debug.debug("unsupported charset " + charSet);
                    return null;
                }
            }
            else {    
                return GetContent(bytes);
            }            
        }    
        
        return null;
    }
    
    //使用ICU4J檢測編碼,並將編碼后的網頁內容返回
    public static String GetContent(byte[] bytes) {
        CharsetDetector detector = new CharsetDetector();
        detector.setText(bytes);
        CharsetMatch match = detector.detect();
        
        try {
            return match.getString();
        } catch (Exception e) {
            m_debug.debug("can't get content. Reason are: " + e.getMessage());
            return null;
        }            
    }

       

      第二個重要的功能就是獲取URL。這一功能要分為三個步驟:提取URL,拼接URL,對URL進行編碼。提取URL的方法使用正則表達式。提取出的URL如果是完整的就最好,如果不是,就要進行URL的拼接了。URL的拼接方式是將URL分為三部分:scheme, host, path。然后看提取出的相對URL缺少哪個,就補上哪個。最后,如果URL包含中文,空格等非法字符,就將這些字符編碼為"UTF-8"。

View Code
public class UrlUtility {
    
    private static String m_urlPatternString = "(?i)(?s)<\\s*?a.*?href=\"(.*?)\".*?>";
    private static Pattern m_urlPattern = Pattern.compile(m_urlPatternString);
    
    private static Logger m_debug = LogManager.getLogger("Debuglogger");
    
    public static void ExtractURL(String baseUrl, String content) {
        Matcher matcher = m_urlPattern.matcher(content);
        while( matcher.find() ) {
            String anchor = matcher.group();
                       
            String url = Utility.GetSubString(anchor, "href=\"", "\"");
            if( (url = UrlUtility.Refine(baseUrl, url)) != null ) {    
                Queue.Add(url);
            }
        }
    }
    
    //將"url"變為編碼為合法的URL
    public static String Encode(String url) {
        String res = "";
        for(char c : url.toCharArray()) {
            if( !":/.?&#=".contains("" + c) ) {
                try {
                    res += URLEncoder.encode("" + c, "UTF-8");
                } catch (UnsupportedEncodingException e) {
                    m_debug.debug("This JVM has no UTF-8 charset. It's strange");
                }
            } else {
                res += c;
            }
        }

        return res;
    }
    
    public static String Normalizer(String url) {
        url = url.replaceAll("&amp;", "&");
        if( url.endsWith("/") ) {
            url = url.substring(0, url.length() - 1);
        }
        
        return url;
    }
    
    //拼接URL
    public static String Refine(String baseUrl, String relative) {
        if( baseUrl == null || relative == null ) {
            return null;
        }
        
        final Url base = Parse(baseUrl), url = Parse(relative);        
        if( base == null || url == null ) {
            return null;
        }
        
        if( url.scheme == null ) {
            url.scheme = base.scheme;
            if( url.host == null ) {
                url.host = base.host;
            }
        }
        
        if( url.path.startsWith("../") ) {
            String prefix = "";
            int idx = base.path.lastIndexOf('/');
            if( (idx = base.path.lastIndexOf('/', idx - 1)) > 0 ) prefix = base.path.substring(0, idx);
            url.path = prefix + url.path.substring(3);            
        }
                                                
        return Normalizer(url.ToUrl());
    }
    
    //拆分URL成scheme, host, path
    private static Url Parse(String link) {
        int idx, endIndex;
        final Url url = new Url();    
        
        if( (idx = link.indexOf("#")) >= 0 ) {    //ignore fragment
            if( idx == 0 ) return null;
            else           link = link.substring(0, idx - 1);
        }
        
//        if( (idx = link.indexOf("?")) > 0 ) {    //ignore query information
//            link = link.substring(0, idx);
//        }
        
        if( (idx = link.indexOf(":")) > 0 ) {
            url.scheme = link.substring(0, idx).trim();
            if( IsLegalScheme(url.scheme) ) {
                link = link.substring(idx + 1);
            }
            else {
                return null;
            }
        }
        
        if( link.startsWith("//") ) {
            if( (endIndex = link.indexOf('/', 2)) > 0 ) {
                url.host = link.substring(2, endIndex).trim();
                link = link.substring(endIndex + 1);
            }
            else {
                url.host = link.substring(2).trim();
                link = null;
            }
        }
        
        if( link != null ) url.path = link.trim();
        else               url.path = "";
        
        return url;        
    }
    
    //判斷scheme是否合法(要處理的scheme類型)
    private static boolean IsLegalScheme(String scheme) {
        if( scheme.equals("http") || scheme.equals("https") || scheme.equals("ftp") ) return true;
        else                                                                          return false;
    }
    
    private static class Url {
        public Url() {}
                
        public String ToUrl() {
            String prefix = null;
            if( path.startsWith("/") ) prefix =  scheme + "://" + host;
            else                       prefix =  scheme + "://" + host + "/";
            
            return prefix + path;
        }
        
        public String scheme;
        public String host;
        public String path;
    }
}

     

      最后的重要功能就是保存網頁了。在這個功能中值得注意的是:如果將字節流編碼后,再保存為HTML文件(即后綴名為HTML),那么在保存時一定要指定charset,且該charset為網頁頭中的charset。否則,將來打開時會亂碼。原因是系統在打開這類文件時,會按照其頭部指定的charset來解碼。當不指定charset保存String,會按照平台的默認charset保存,默認charset與頭部指定的charset不同時,系統按照頭部的charset解碼,就出現亂碼的情況了。因而推薦直接將原始的字節流保存。

View Code
    //保存網頁
    public static boolean SavePage(byte[] bytes, String content, String savePath) {        
        String name = Utility.GetSubString(content, "<title>", "</title>");    //提取標題名,作為保存時的文件名
        if( name != null ) name = name.trim() + ".html";
        else               return false;
                
        name = FixFileName(name);                
        try {        
            FileOutputStream fos = new FileOutputStream(new File(savePath, name));
            fos.write(bytes);
            fos.close();     
        }
        catch(FileNotFoundException e) {
            m_debug.debug("無法建立文件名為\"" + name + "\"的文件");
            return false;
        } catch (IOException e) {
            m_debug.debug(e.getMessage());
            return false;
        }        
        
        return true;
    }
    
    //去掉文件名中的非法字符
    public static String FixFileName(String name) {
        String res = "";
        for(char c : name.toCharArray()) {
            if( "/\\:*?\"<>|".contains("" + c) ) {
                res += " ";
            } else {
                res += c;
            }
        }        
        return res;
    }

 

      至此,ZeroCrawler V0.1的主要部分就介紹完了,如果想要完整的代碼,可從[1]下載。運行代碼所需的類庫則從[2]下載。

[1]http://ishare.iask.sina.com.cn/f/34836546.html

[2]http://ishare.iask.sina.com.cn/f/34836710.html

 

 

 

 

    //保存網頁
    public static boolean SavePage(byte[] bytes, String content, String savePath) {        
        String name = Utility.GetSubString(content, "<title>", "</title>");    //提取標題名,作為保存時的文件名
        if( name != null ) name = name.trim() + ".html";
        else               return false;
                
        name = FixFileName(name);                
        try {        
            FileOutputStream fos = new FileOutputStream(new File(savePath, name));
            fos.write(bytes);
            fos.close();     
        }
        catch(FileNotFoundException e) {
            m_debug.debug("無法建立文件名為\"" + name + "\"的文件");
            return false;
        } catch (IOException e) {
            m_debug.debug(e.getMessage());
            return false;
        }        
        
        return true;
    }
    
    //去掉文件名中的非法字符
    public static String FixFileName(String name) {
        String res = "";
        for(char c : name.toCharArray()) {
            if( "/\\:*?\"<>|".contains("" + c) ) {
                res += " ";
            } else {
                res += c;
            }
        }        
        return res;
    }


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM