ZeroCrawler V0.1是一只簡單的多線程爬蟲,其基本架構如下:
整個程序是這樣運作的:Scheduler不斷從Queue取出URL,如果發現可用的爬蟲(空閑線程),那么就將URL分給一只爬蟲。然后爬蟲完成下載網頁,抽取URL,保存網頁的工作后就回歸Scheduler(變回空閑線程)。直到Queue沒有待爬取的URL,並且所有爬蟲都空閑下來,就停止程序。
Scheduler的主要工作就是建立線程池,從Queue中取出URL,分配URL給線程。容易出錯的地方是退出條件。如果只是判斷Queue為空就退出是不行的。因為這時可能還有爬蟲在工作中,而它可能提取到新的URL加到Queue中。所以退出條件應該是Queue為空且線程池的線程全部空閑。Scheduler實現如下:

public static void Crawl(String url, String savePath) { int cnt = 1; long startTime = System.currentTimeMillis(); AtomicInteger numberOfThreads = new AtomicInteger(); //記錄當前使用的爬蟲數 ThreadPoolExecutor executor = new ThreadPoolExecutor(m_maxThreads, m_maxThreads, 3, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>()); //建立線程池 Queue.Add(UrlUtility.Encode(UrlUtility.Normalizer(url))); //添加初始URL到Queue中 try { while ((url = Queue.Fetch()) != null) { executor.execute(new PageCrawler(url, savePath, numberOfThreads)); //將URL交給爬蟲 while( (Queue.Size() == 0 && numberOfThreads.get() != 0) || (numberOfThreads.get() == m_maxThreads) ) { //防止提前退出 sleep(); } //if( cnt++ > 1000 ) break; if( Queue.Size() == 0 && numberOfThreads.get() == 0 ) break; } } finally { executor.shutdown(); } long useTime = System.currentTimeMillis() - startTime; System.out.println("use " + Utility.ToStandardTime((int)(useTime / 1000)) + "to finish " + cnt + " links"); System.out.println("remain url: " + Queue.Size()); }
Queue負責保存URL,判斷URL重復出現與否。目前的保存方式是先使用Hash判斷URL是否已經保存過,然后再將完整的URL整個保存到list中。從Queue中取出URL時使用廣度優先原則。

public class Queue { private static HashSet<String> m_appear = new HashSet<String>(); private static LinkedList<String> m_queue = new LinkedList<String>(); public synchronized static void Add(String url) { if( !m_appear.contains(url) ) { m_appear.add(url); m_queue.addLast(url); } } public synchronized static String Fetch() { if( !m_queue.isEmpty() ) { return m_queue.poll(); } return null; } public static int Size() { return m_queue.size(); } }
接下來逐一介紹爬蟲最重要的幾個功能,首先從獲取網頁開始。網頁的獲取分為兩部分:一是下載網頁,二是正解地對字節流進行編碼。網頁下載使用httpclient-4.2.2完成,具體如下:

//偽裝用的agent private static String[] m_agent = {"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", }; private static Logger m_debug = LogManager.getLogger("Debuglogger"); //取得"url"指定的entity public static HttpEntity GetEntity(String url) { HttpClient client = new DefaultHttpClient(); HttpGet getMethod = new HttpGet(UrlUtility.Encode(url)); getMethod.getParams().setParameter("http.protocol.cookie-policy", CookiePolicy.BROWSER_COMPATIBILITY); //偽裝agent java.util.Random r = new java.util.Random(); getMethod.setHeader("User-Agent", m_agent[r.nextInt(m_agent.length)]); HttpResponse response = null; try { response = client.execute(getMethod); } catch (Exception e) { m_debug.debug("can't get response from " + url); m_debug.debug("reason is : " + e.getMessage()); return null; } int statusCode = response.getStatusLine().getStatusCode(); if ((statusCode == HttpStatus.SC_MOVED_PERMANENTLY) || (statusCode == HttpStatus.SC_MOVED_TEMPORARILY) || (statusCode == HttpStatus.SC_SEE_OTHER) || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) { // 轉向抓取新鏈接 return GetEntity(response.getLastHeader("Location").getValue()); } else if( statusCode == HttpStatus.SC_NOT_FOUND ) { //找不到網頁 m_debug.debug(url + " : page was no found"); response = null; } if( response != null ) return response.getEntity(); else return null; }
得到網站返回的entity后,接下來要做的就是對字節流進行正確的編碼以得到網頁的內容。一般情況下,所下載的網頁都會在頭部寫清楚用的是哪一種charset。但這只是一般情況,當沒有寫的時候,就要自己來檢測。檢測編碼可不是簡單的工作,因而使用現成的ICU4J庫。實現如下:

//從"entity"得到網頁內容 public static String GetContent(HttpEntity entity) { if( entity != null ) { byte[] bytes; try { bytes = EntityUtils.toByteArray(entity); } catch (IOException e) { m_debug.debug("can't get bytes from entity. Reason are: " + e.getMessage()); return null; } String charSet = EntityUtils.getContentCharSet(entity); //得到網頁編譯格式 if( charSet != null ) { //網頁本身有告知編碼 try { return new String(bytes, charSet); } catch (UnsupportedEncodingException e) { m_debug.debug("unsupported charset " + charSet); return null; } } else { return GetContent(bytes); } } return null; } //使用ICU4J檢測編碼,並將編碼后的網頁內容返回 public static String GetContent(byte[] bytes) { CharsetDetector detector = new CharsetDetector(); detector.setText(bytes); CharsetMatch match = detector.detect(); try { return match.getString(); } catch (Exception e) { m_debug.debug("can't get content. Reason are: " + e.getMessage()); return null; } }
第二個重要的功能就是獲取URL。這一功能要分為三個步驟:提取URL,拼接URL,對URL進行編碼。提取URL的方法使用正則表達式。提取出的URL如果是完整的就最好,如果不是,就要進行URL的拼接了。URL的拼接方式是將URL分為三部分:scheme, host, path。然后看提取出的相對URL缺少哪個,就補上哪個。最后,如果URL包含中文,空格等非法字符,就將這些字符編碼為"UTF-8"。

public class UrlUtility { private static String m_urlPatternString = "(?i)(?s)<\\s*?a.*?href=\"(.*?)\".*?>"; private static Pattern m_urlPattern = Pattern.compile(m_urlPatternString); private static Logger m_debug = LogManager.getLogger("Debuglogger"); public static void ExtractURL(String baseUrl, String content) { Matcher matcher = m_urlPattern.matcher(content); while( matcher.find() ) { String anchor = matcher.group(); String url = Utility.GetSubString(anchor, "href=\"", "\""); if( (url = UrlUtility.Refine(baseUrl, url)) != null ) { Queue.Add(url); } } } //將"url"變為編碼為合法的URL public static String Encode(String url) { String res = ""; for(char c : url.toCharArray()) { if( !":/.?&#=".contains("" + c) ) { try { res += URLEncoder.encode("" + c, "UTF-8"); } catch (UnsupportedEncodingException e) { m_debug.debug("This JVM has no UTF-8 charset. It's strange"); } } else { res += c; } } return res; } public static String Normalizer(String url) { url = url.replaceAll("&", "&"); if( url.endsWith("/") ) { url = url.substring(0, url.length() - 1); } return url; } //拼接URL public static String Refine(String baseUrl, String relative) { if( baseUrl == null || relative == null ) { return null; } final Url base = Parse(baseUrl), url = Parse(relative); if( base == null || url == null ) { return null; } if( url.scheme == null ) { url.scheme = base.scheme; if( url.host == null ) { url.host = base.host; } } if( url.path.startsWith("../") ) { String prefix = ""; int idx = base.path.lastIndexOf('/'); if( (idx = base.path.lastIndexOf('/', idx - 1)) > 0 ) prefix = base.path.substring(0, idx); url.path = prefix + url.path.substring(3); } return Normalizer(url.ToUrl()); } //拆分URL成scheme, host, path private static Url Parse(String link) { int idx, endIndex; final Url url = new Url(); if( (idx = link.indexOf("#")) >= 0 ) { //ignore fragment if( idx == 0 ) return null; else link = link.substring(0, idx - 1); } // if( (idx = link.indexOf("?")) > 0 ) { //ignore query information // link = link.substring(0, idx); // } if( (idx = link.indexOf(":")) > 0 ) { url.scheme = link.substring(0, idx).trim(); if( IsLegalScheme(url.scheme) ) { link = link.substring(idx + 1); } else { return null; } } if( link.startsWith("//") ) { if( (endIndex = link.indexOf('/', 2)) > 0 ) { url.host = link.substring(2, endIndex).trim(); link = link.substring(endIndex + 1); } else { url.host = link.substring(2).trim(); link = null; } } if( link != null ) url.path = link.trim(); else url.path = ""; return url; } //判斷scheme是否合法(要處理的scheme類型) private static boolean IsLegalScheme(String scheme) { if( scheme.equals("http") || scheme.equals("https") || scheme.equals("ftp") ) return true; else return false; } private static class Url { public Url() {} public String ToUrl() { String prefix = null; if( path.startsWith("/") ) prefix = scheme + "://" + host; else prefix = scheme + "://" + host + "/"; return prefix + path; } public String scheme; public String host; public String path; } }
最后的重要功能就是保存網頁了。在這個功能中值得注意的是:如果將字節流編碼后,再保存為HTML文件(即后綴名為HTML),那么在保存時一定要指定charset,且該charset為網頁頭中的charset。否則,將來打開時會亂碼。原因是系統在打開這類文件時,會按照其頭部指定的charset來解碼。當不指定charset保存String,會按照平台的默認charset保存,默認charset與頭部指定的charset不同時,系統按照頭部的charset解碼,就出現亂碼的情況了。因而推薦直接將原始的字節流保存。

//保存網頁 public static boolean SavePage(byte[] bytes, String content, String savePath) { String name = Utility.GetSubString(content, "<title>", "</title>"); //提取標題名,作為保存時的文件名 if( name != null ) name = name.trim() + ".html"; else return false; name = FixFileName(name); try { FileOutputStream fos = new FileOutputStream(new File(savePath, name)); fos.write(bytes); fos.close(); } catch(FileNotFoundException e) { m_debug.debug("無法建立文件名為\"" + name + "\"的文件"); return false; } catch (IOException e) { m_debug.debug(e.getMessage()); return false; } return true; } //去掉文件名中的非法字符 public static String FixFileName(String name) { String res = ""; for(char c : name.toCharArray()) { if( "/\\:*?\"<>|".contains("" + c) ) { res += " "; } else { res += c; } } return res; }
至此,ZeroCrawler V0.1的主要部分就介紹完了,如果想要完整的代碼,可從[1]下載。運行代碼所需的類庫則從[2]下載。
[1]http://ishare.iask.sina.com.cn/f/34836546.html
[2]http://ishare.iask.sina.com.cn/f/34836710.html
//保存網頁 public static boolean SavePage(byte[] bytes, String content, String savePath) { String name = Utility.GetSubString(content, "<title>", "</title>"); //提取標題名,作為保存時的文件名 if( name != null ) name = name.trim() + ".html"; else return false; name = FixFileName(name); try { FileOutputStream fos = new FileOutputStream(new File(savePath, name)); fos.write(bytes); fos.close(); } catch(FileNotFoundException e) { m_debug.debug("無法建立文件名為\"" + name + "\"的文件"); return false; } catch (IOException e) { m_debug.debug(e.getMessage()); return false; } return true; } //去掉文件名中的非法字符 public static String FixFileName(String name) { String res = ""; for(char c : name.toCharArray()) { if( "/\\:*?\"<>|".contains("" + c) ) { res += " "; } else { res += c; } } return res; }