【本科畢業設計論文】分布式網絡爬蟲的研究與實現


分布式網絡爬蟲的研究與實現

摘 要

         隨着互聯網的高速發展,在互聯網搜索服務中,搜索引擎扮演着越來越重要的角色。網絡爬蟲是搜索引擎系統中十分重要的組成部分,它負責從互聯網中搜集網頁,這些頁面用於建立索引從而為搜索引擎提供支持。面對當前極具膨脹的網絡信息,集中式的單機爬蟲早已無法適應目前的互聯網信息規模,因此高性能的分布式網絡爬蟲系統成為目前信息采集領域研究的重點。

         本文對網絡爬蟲原理、分布式架構設計以及網絡爬蟲中的關鍵模塊、瓶頸問題及解決辦法進行了相關研究。論文工作主要表現為:

1、引入一致性哈希算法,用於解決URL任務分發策略、爬蟲主機間負載均衡、單機熱點問題,確保分布式爬蟲系統具有良好的可擴展性、平衡性、容錯性。

2、針對爬蟲系統的禮貌性、優先級特性給出了基於Mercator模型的URL隊列的設計和實現;

3、針對大規模URL去重、DNS解析、頁面抓取與解析等關鍵瓶頸問題給出了解決辦法;

4、設計並實現了一種線程池模型,用於多線程並行高效的進行頁面采集;

5、提出一種基於文件方式的頁面存儲方案,通過建立索引文件與數據文件進行有效的頁面存儲與管理。

         在上述工作的基礎上,本文設計實現了一個高性能的分布式網絡爬蟲原型系統。實驗表明,該網絡爬蟲系統不僅具有高頁面抓取效率、高可配置、運行穩定的特點,而且具有良好的可擴展性、容錯性、負載平衡性的分布式特性。 

關鍵詞:網絡爬蟲;分布式;一致性哈希算法;信息采集;線程池

 

The Research and Implementation of Distributed Web Crawler

Abstract

With the rapid development of Internet, search engines as the main entrance of the Internet plays a more and more important role. Web crawler is a very important part of the search engines, which is responsible to collect web pages from Internet. These pages are used to build index and provide support for search engines. Because of the great expansion of Internet Information, a centralized and stand-alone web crawler has not long been able to adapt to the Internet scale. So high-performance distributed web crawler system is becoming the focus of current research in the field of information collection.

This paper researches and demonstrates some topics about principle, distributed architecture design, keymodules, the bottleneck problem and solution in web crawler system. The main work as following :

1. This paper introduces a hash algorithm called Consistent Hash, which is used to solve the strategy of URL partition, hot-spot problem and load balancing between web crawler nodes and ensure that the distributed crawler has good scalability, balancing, fault tolerance.

2. In order to meet the politeness and priority needs of the web crawler, this paper designs and implements a URL queue based on Mercator model.

3. The solutions to large-scale URLs deduplication,DNS resolution,page crawling and parsing and some other key problems are given.

4.This paper designs and implements a thread pool model for efficient and  multi-threaded page collection.

5.A scheme for downloaded page storage is given, which creats indexd files and data files to manage and store the downloaded data.

On the basis of the above work, this paper designs and implements a high-performance distributed web crawler prototype system. The experiments at the end of this paper show that the Web crawler not only has the characteristics such as high efficiency page fetching, highly configurable, stable, but also has good distributed features such as good scalability, fault tolerance, load balancing and so on.

KeywordsWeb Crawler;Distributed;Consistent Hash  Algorithm;Information Retrieval;Thread Pool

畢業設計原文:分布式網絡爬蟲的研究與實現

PS:本科的畢業設計論文,寫的比較淺,但是對網絡爬蟲的一些概念和功能模塊進行了分析與實現。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM