1. 引言
在上一篇中,我們簡單的了解了爬蟲的工作流程,也簡單的實現了一個爬蟲,並且在文末簡單分析了目前存在的問題。這一篇博客將會對上一篇分析出的問題,給出改進方法。我們將從以下幾個方面加以改進。
2. 改進
我們首先利用Bloom Filet來改進UrlQueue中的visitedSet。
在上一篇中,我們使用visitedSet(HashSet)來存放已經訪問過的url。之所以使用HashSet是因為我們需要不斷的插入url到visitedSet中,並且還需要頻繁地判斷某個url是否在其中,而采用Hash Table,在平均情況下,所有的字典操作在O(1)時間內都能完成(具體分析請看散列表(hash table)——算法導論(13))。但不足之處在於我們需要花費大量的內存空間去維護hash table,我們是否可以減小它的空間復雜度呢?
從visitedSet的作用入手,它只是用來判斷某個url是否被包含在它內部,僅此而已。因此完全沒有必要保存每個url的完整信息,保存指紋信息即可。這時我們可想到常用的md5和sha1摘要算法。但盡管對url做了壓縮,我們還是需要去保存壓縮后的信息。我們還有沒有更好的方法呢?這時,Bloom Filter就派上用場了(關於Bloom Filter的介紹及實現在散列表(hash table)——算法導論(13)中的最后一小節)。
我們再來使用Berkeley DB改進我們UrlQueue中的unvisitedList。
Berkeley DB是一個嵌入式數據庫系統,簡單、小巧、性能高(簡單小巧沒得說,至於性能,沒驗證過)。關於Berkeley DB的下載和使用請到其官網:http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html
使用Berkeley DB后,我們會將從頁面解析出的url直接存入DB中,而unvisitedList只是作為從DB中取url時的緩沖池。即我們會開啟一個線程以一定的頻率從DB中讀取一定數量的url到unvisitedList中,執行頁面請求的線程還是從unvisitedList讀取url。
最后我們引入多線程來提高爬蟲的效率。
多線程的關鍵在於同步與通信。這些內容請自行百度。
改進后的整個結構圖如下:
3. 實現
最后我們給出改進后的代碼:
① 首先是改進后的UrlQueue.java,我們重命名為BloomQueue.java(其中的BloomFilter類,在散列表(hash table)——算法導論(13)中可找到)
public class BloomQueue<T> { private BloomFilter<T> bloomFilter; private LinkedBlockingQueue<T> visitedList; private AtomicInteger flowedCount; private int queueCapacity; public BloomQueue() { this(0.000001, 10000000, 500); } public BloomQueue(double falsePositiveProbability, int filterCapacity, int queueCapacity) { this.queueCapacity = queueCapacity; bloomFilter = new BloomFilter<>(falsePositiveProbability, filterCapacity); visitedList = new LinkedBlockingQueue<>(queueCapacity); flowedCount = new AtomicInteger(0); } /** * 入隊(當無法入隊時,默認阻塞3秒) * * @param t * @return */ public boolean enqueue(T t) { return enqueue(t, 3000); } /** * 入隊 * * @param t * @param timeout * 單位為毫秒 */ public boolean enqueue(T t, long timeout) { try { boolean result = visitedList.offer(t, timeout, TimeUnit.MILLISECONDS); if (result) { bloomFilter.add(t); flowedCount.getAndIncrement(); } return result; } catch (InterruptedException e) { e.printStackTrace(); } return false; } /** * 出隊(當隊列為空時,默認會阻塞3秒) * * @return */ public T dequeue() { return dequeue(3000); } /** * 出隊 * * @param timeout * 單位為毫秒 * @return */ public T dequeue(long timeout) { try { return visitedList.poll(timeout, TimeUnit.MILLISECONDS); } catch (InterruptedException e) { } return null; } /** * 當前是否包含 * * @return */ public boolean contains(T t) { return visitedList.contains(t); } /** * 曾經是否包含 * * @param t * @return */ public boolean contained(T t) { return bloomFilter.contains(t); } public boolean isEmpty() { return visitedList.isEmpty(); } public boolean isFull() { return visitedList.size() == queueCapacity; } public int size() { return visitedList.size(); } public int flowedCount() { return flowedCount.get(); } @Override public String toString() { return visitedList.toString(); } }
② 然后我們對Berkeley DB做一個簡單的封裝,便於使用。
public class DBHelper<T> { public static final String DEFAULT_DB_DIR = "C:/Users/Administrator/Desktop/db/"; public static final String DEFAULT_Entity_Store = "EntityStore"; public Environment myEnv; public EntityStore store; public PrimaryIndex<Long, T> primaryIndex; public DBHelper(Class<T> clazz) { this(clazz, DEFAULT_DB_DIR, DEFAULT_Entity_Store, false); } public DBHelper(Class<T> clazz, String dbDir, String storeName, boolean isRead) { File dir = new File(dbDir); if (!dir.exists()) { dir.mkdirs(); } EnvironmentConfig envConfig = new EnvironmentConfig(); envConfig.setAllowCreate(!isRead); // Environment myEnv = new Environment(dir, envConfig); // StoreConfig StoreConfig storeConfig = new StoreConfig(); storeConfig.setAllowCreate(!isRead); // store store = new EntityStore(myEnv, storeName, storeConfig); // PrimaryIndex primaryIndex = store.getPrimaryIndex(Long.class, clazz); } public void put(T t) { primaryIndex.put(t); store.sync(); myEnv.sync(); } public EntityCursor<T> entities() { return primaryIndex.entities(); } public T get(long key) { return primaryIndex.get(key); } public void close() { if (store != null) { store.close(); } if (myEnv != null) { myEnv.cleanLog(); myEnv.close(); } } }
③ 接着我們寫一個url的entity,便於存儲。
import com.sleepycat.persist.model.Entity; import com.sleepycat.persist.model.PrimaryKey; @Entity public class Url { @PrimaryKey(sequence = "Sequence_Namespace") private long id; private String url; public Url() { } public Url(String url) { super(); this.url = url; } public long getId() { return id; } public void setId(long id) { this.id = id; } public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } @Override public String toString() { return url; } public boolean isEmpty() { return url == null || url.isEmpty(); } }
④最后是我們的核心類CrawlerEngine。該類有兩個內部類:Feeder和Fetcher。Feeder意為飼養員、進料器,負責向urlQueue中添加url;Fetcher意為抓取者,負責從urlQueue中取出url,進行請求,解析。其中用到的JsoupDownLoader類和上一篇一樣,保持不變。
public class CrawlerEngine { public static final String DEFAULT_SAVE_DIR = "C:/Users/Administrator/Desktop/html/"; private static final long FEEDER_SLEEP_TIME = 10; private static final long FEEDER_MAX_WAIT_TIME = 3 * 1000;// 當DB中取不到url時,feeder最長等待時間(即如果等待該時間后,DB還是為空,則feeder結束工作) private static final int FEEDER_MAX_WAIT_COUNT = (int) (FEEDER_MAX_WAIT_TIME / FEEDER_SLEEP_TIME);// 當DB中取不到url時,feeder最長等待時間(即如果等待該時間后,DB還是為空,則feeder結束工作) private static final boolean LOG = false; private BloomQueue<Url> urlQueue; private ExecutorService fetcherPool; private int fetcherCount; private boolean running; private DBHelper<Url> dbHelper; private JsoupDownloader downloader; private String parseRegex; private String saveRegex; private String saveDir; private String saveName; private long maxCount = 1000; private long startTime; private long endTime = Long.MAX_VALUE; public CrawlerEngine() { this(20, DEFAULT_SAVE_DIR, null); } public CrawlerEngine(int fetcherCount, String saveDir, String saveName) { this.fetcherCount = fetcherCount; urlQueue = new BloomQueue<>(); fetcherPool = Executors.newFixedThreadPool(fetcherCount); dbHelper = new DBHelper<>(Url.class); downloader = JsoupDownloader.getInstance(); this.saveDir = saveDir; this.saveName = saveName; } public void startUp(String[] seeds) { if (running) { return; } running = true; startTime = System.currentTimeMillis(); for (String seed : seeds) { Url url = new Url(seed); urlQueue.enqueue(url); } for (int i = 0; i < fetcherCount; i++) { fetcherPool.execute(new Fetcher()); } new Feeder().start(); } public void shutdownNow() { running = false; fetcherPool.shutdown(); } public void shutdownAtTime(long time) { if (time > startTime) { endTime = time; } } public void shutdownDelayed(long delayed) { shutdownAtTime(startTime + delayed); } public void shutdownAtCount(long count) { maxCount = count; } private boolean isEnd() { return urlQueue.flowedCount() > maxCount || System.currentTimeMillis() > endTime; } private long currId = 1; private int currWaitCount; /** * 飼養員 * <p> * 從DB中獲取一定數量的url到queue * </p> * * @author D.K * */ private class Feeder extends Thread { @Override public void run() { while (!isEnd() && running && currWaitCount != FEEDER_MAX_WAIT_COUNT) { try { sleep(FEEDER_SLEEP_TIME); if (urlQueue.isFull()) { log("Feeder", "隊列已滿"); continue; } Url url = dbHelper.get(currId); if (url == null) { currWaitCount++; log("Feeder", "url為null,currWaitCount = " + currWaitCount); } else { while (urlQueue.contained(url)) { currId++; url = dbHelper.get(currId); } if (url != null) { log("Feeder", "url准備入隊"); urlQueue.enqueue(url); currId++; log("Feeder", "url已經入隊,currId = " + currId); currWaitCount = 0; } } } catch (Exception e) { e.printStackTrace(); } } log("Feeder", "執行結束..."); while (true) { try { sleep(100); log("Feeder", "等待Fetcher結束..."); } catch (InterruptedException e) { } if (urlQueue.isEmpty()) { shutdownNow(); System.out.println(">>>>>>>>>>>>爬取結束,共請求了" + urlQueue.flowedCount() + "個頁面,用時" + (System.currentTimeMillis() - startTime) + "毫秒<<<<<<<<<<<<"); return; } } } } /** * 抓取者 * <p> * 從queue中取出url,下載頁面,解析頁面,並把解析出的新的url添加到DB中 * </p> * * @author D.K * */ private class Fetcher implements Runnable { @Override public void run() { while (!isEnd() && (running || !urlQueue.isEmpty())) { log("Fetcher", "開始從隊列獲取url,size=" + urlQueue.size()); Url url = urlQueue.dequeue(); if (url == null) { log("Fetcher", "url為null"); continue; } log("Fetcher", "取出了url"); Document doc = downloader.downloadPage(url.getUrl()); Set<String> urlSet = downloader.parsePage(doc, parseRegex); downloader.savePage(doc, saveDir, saveName, saveRegex); for (String str : urlSet) { Url u = new Url(str); if (!urlQueue.contained(u)) { dbHelper.put(u); } } } } } private void log(String talker, String content) { if (LOG) { System.out.println("[" + talker + "] " + content); } } public String getParseRegex() { return parseRegex; } public void setParseRegex(String parseRegex) { this.parseRegex = parseRegex; } public String getSaveRegex() { return saveRegex; } public void setSaveRegex(String saveRegex) { this.saveRegex = saveRegex; } public void setSavePath(String saveDir, String saveName) { this.saveDir = saveDir; this.saveName = saveName; } }
我們采用上一篇的測試例子來做同樣的測試,以檢驗我們優化后的效果。下面是測試代碼:
public class Test { public static void main(String[] args) throws InterruptedException { CrawlerEngine crawlerEngine = new CrawlerEngine(); crawlerEngine.setParseRegex("(http://www.cnblogs.com/artech/p|http://www.cnblogs.com/artech/default|http://www.cnblogs.com/artech/archive/\\d{4}/\\d{2}/\\d{2}/).*"); crawlerEngine.setSaveRegex("(http://www.cnblogs.com/artech/p|http://www.cnblogs.com/artech/archive/\\d{4}/\\d{2}/\\d{2}/).*"); crawlerEngine.startUp(new String[] { "http://www.cnblogs.com/artech/" }); crawlerEngine.shutdownAtCount(1000); } }
下面是運行結果:
(4) 總結
對比我們上一篇中的測試時間61s,改進后用時14s,效率有明顯的提升。
在下一篇中,我們要對整個代碼再次進行小的優化,完善一些細節,如對請求狀態碼的處理,抽取出一些接口以降低代碼之間的耦合度,增強靈活性。