Nutch 當前兩個版本 :
- 1.6 - Nutch1.6使用Hadoop Distributed File System (HDFS)來作為存儲,穩定可靠。
- 2.1 - 通過gora對存儲層進行了擴展,可以選擇使用HBase、Accumulo、Cassandra 、MySQL 、DataFileAvroStore、AvroStore中任何一種來存儲數據,但其中一些並不成熟。
在Linux(Centos)上搭建 Nutch 框架:
- 安裝 svn
yum install subversion
- 安裝 ant
yum install ant
- check out nutch(進入 http://nutch.apache.org ,在 Version Control 板塊可查看到svn地址。)
svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
- ant 構建 nutch
cd release-1.6/ ant
ant 構建完成之后,在 release-1.6 目錄下生成兩個目錄 :build、runtime,進入 runtime ,有兩個子文件夾 :deploy、local,分別代表了nutch兩種運行方式 :
- deploy - hadoop 運行
- local - 本地文件系統運行,只能有一個Map和Reduce。
local/bin/nutch :分析nutch腳本是入門的重點,可以看到通過 nutch 腳本連接Hadoop與Nutch,把apache-nutch-1.6.job提交給Hadoop的JobTracker;同時也可以看到在命令中所指定的是哪個Java類。

#!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # The Nutch command script # # Environment Variables # # NUTCH_JAVA_HOME The java implementation to use. Overrides JAVA_HOME. # # NUTCH_HEAPSIZE The maximum amount of heap to use, in MB. # Default is 1000. # # NUTCH_OPTS Extra Java runtime options. # cygwin=false case "`uname`" in CYGWIN*) cygwin=true;; esac # resolve links - $0 may be a softlink THIS="$0" while [ -h "$THIS" ]; do ls=`ls -ld "$THIS"` link=`expr "$ls" : '.*-> \(.*\)$'` if expr "$link" : '.*/.*' > /dev/null; then THIS="$link" else THIS=`dirname "$THIS"`/"$link" fi done # if no args specified, show usage if [ $# = 0 ]; then echo "Usage: nutch COMMAND" echo "where COMMAND is one of:" echo " crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)" echo " readdb read / dump crawl db" echo " mergedb merge crawldb-s, with optional filtering" echo " readlinkdb read / dump link db" echo " inject inject new urls into the database" echo " generate generate new segments to fetch from crawl db" echo " freegen generate new segments to fetch from text files" echo " fetch fetch a segment's pages" echo " parse parse a segment's pages" echo " readseg read / dump segment data" echo " mergesegs merge several segments, with optional filtering and slicing" echo " updatedb update crawl db from segments after fetching" echo " invertlinks create a linkdb from parsed segments" echo " mergelinkdb merge linkdb-s, with optional filtering" echo " solrindex run the solr indexer on parsed segments and linkdb" echo " solrdedup remove duplicates from solr" echo " solrclean remove HTTP 301 and 404 documents from solr" echo " parsechecker check the parser for a given url" echo " indexchecker check the indexing filters for a given url" echo " domainstats calculate domain statistics from crawldb" echo " webgraph generate a web graph from existing segments" echo " linkrank run a link analysis program on the generated web graph" echo " scoreupdater updates the crawldb with linkrank scores" echo " nodedumper dumps the web graph's node scores" echo " plugin load a plugin and run one of its classes main()" echo " junit runs the given JUnit test" echo " or" echo " CLASSNAME run the class named CLASSNAME" echo "Most commands print help when invoked w/o parameters." exit 1 fi # get arguments COMMAND=$1 shift # some directories THIS_DIR=`dirname "$THIS"` NUTCH_HOME=`cd "$THIS_DIR/.." ; pwd` # some Java parameters if [ "$NUTCH_JAVA_HOME" != "" ]; then #echo "run java in $NUTCH_JAVA_HOME" JAVA_HOME=$NUTCH_JAVA_HOME fi if [ "$JAVA_HOME" = "" ]; then echo "Error: JAVA_HOME is not set." exit 1 fi local=true # NUTCH_JOB if [ -f ${NUTCH_HOME}/*nutch*.job ]; then local=false for f in $NUTCH_HOME/*nutch*.job; do NUTCH_JOB=$f; done fi # cygwin path translation if $cygwin; then NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"` fi JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m # check envvars which might override default args if [ "$NUTCH_HEAPSIZE" != "" ]; then #echo "run with heapsize $NUTCH_HEAPSIZE" JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m" #echo $JAVA_HEAP_MAX fi # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf} CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar # so that filenames w/ spaces are handled correctly in loops below IFS= # add libs to CLASSPATH if $local; then for f in $NUTCH_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # local runtime # add plugins to classpath if [ -d "$NUTCH_HOME/plugins" ]; then CLASSPATH=${NUTCH_HOME}:${CLASSPATH} fi fi # cygwin path translation if $cygwin; then CLASSPATH=`cygpath -p -w "$CLASSPATH"` fi # setup 'java.library.path' for native-hadoop code if necessary # used only in local mode JAVA_LIBRARY_PATH='' if [ -d "${NUTCH_HOME}/lib/native" ]; then JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'` if [ -d "${NUTCH_HOME}/lib/native" ]; then if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} fi fi fi if [ $cygwin = true -a "X${JAVA_LIBRARY_PATH}" != "X" ]; then JAVA_LIBRARY_PATH=`cygpath -p -w "$JAVA_LIBRARY_PATH"` fi # restore ordinary behaviour unset IFS # default log directory & file if [ "$NUTCH_LOG_DIR" = "" ]; then NUTCH_LOG_DIR="$NUTCH_HOME/logs" fi if [ "$NUTCH_LOGFILE" = "" ]; then NUTCH_LOGFILE='hadoop.log' fi #Fix log path under cygwin if $cygwin; then NUTCH_LOG_DIR=`cygpath -p -w "$NUTCH_LOG_DIR"` fi NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR" NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE" if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH" fi # figure out which class to run if [ "$COMMAND" = "crawl" ] ; then CLASS=org.apache.nutch.crawl.Crawl elif [ "$COMMAND" = "inject" ] ; then CLASS=org.apache.nutch.crawl.Injector elif [ "$COMMAND" = "generate" ] ; then CLASS=org.apache.nutch.crawl.Generator elif [ "$COMMAND" = "freegen" ] ; then CLASS=org.apache.nutch.tools.FreeGenerator elif [ "$COMMAND" = "fetch" ] ; then CLASS=org.apache.nutch.fetcher.Fetcher elif [ "$COMMAND" = "parse" ] ; then CLASS=org.apache.nutch.parse.ParseSegment elif [ "$COMMAND" = "readdb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDbReader elif [ "$COMMAND" = "mergedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDbMerger elif [ "$COMMAND" = "readlinkdb" ] ; then CLASS=org.apache.nutch.crawl.LinkDbReader elif [ "$COMMAND" = "readseg" ] ; then CLASS=org.apache.nutch.segment.SegmentReader elif [ "$COMMAND" = "mergesegs" ] ; then CLASS=org.apache.nutch.segment.SegmentMerger elif [ "$COMMAND" = "updatedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDb elif [ "$COMMAND" = "invertlinks" ] ; then CLASS=org.apache.nutch.crawl.LinkDb elif [ "$COMMAND" = "mergelinkdb" ] ; then CLASS=org.apache.nutch.crawl.LinkDbMerger elif [ "$COMMAND" = "solrindex" ] ; then CLASS=org.apache.nutch.indexer.solr.SolrIndexer elif [ "$COMMAND" = "solrdedup" ] ; then CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates elif [ "$COMMAND" = "solrclean" ] ; then CLASS=org.apache.nutch.indexer.solr.SolrClean elif [ "$COMMAND" = "parsechecker" ] ; then CLASS=org.apache.nutch.parse.ParserChecker elif [ "$COMMAND" = "indexchecker" ] ; then CLASS=org.apache.nutch.indexer.IndexingFiltersChecker elif [ "$COMMAND" = "domainstats" ] ; then CLASS=org.apache.nutch.util.domain.DomainStatistics elif [ "$COMMAND" = "webgraph" ] ; then CLASS=org.apache.nutch.scoring.webgraph.WebGraph elif [ "$COMMAND" = "linkrank" ] ; then CLASS=org.apache.nutch.scoring.webgraph.LinkRank elif [ "$COMMAND" = "scoreupdater" ] ; then CLASS=org.apache.nutch.scoring.webgraph.ScoreUpdater elif [ "$COMMAND" = "nodedumper" ] ; then CLASS=org.apache.nutch.scoring.webgraph.NodeDumper elif [ "$COMMAND" = "plugin" ] ; then CLASS=org.apache.nutch.plugin.PluginRepository elif [ "$COMMAND" = "junit" ] ; then CLASSPATH=$CLASSPATH:$NUTCH_HOME/test/classes/ CLASS=junit.textui.TestRunner else CLASS=$COMMAND fi # distributed mode EXEC_CALL="hadoop jar $NUTCH_JOB" if $local; then EXEC_CALL="$JAVA $JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH" else # check that hadoop can be found on the path if [ $(which hadoop | wc -l ) -eq 0 ]; then echo "Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode." exit -1; fi fi # run it exec $EXEC_CALL $CLASS "$@"
nutch 的所有參數
[root@localhost local]# bin/nutch Usage: nutch COMMAND where COMMAND is one of: crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD) readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment's pages parse parse a segment's pages readseg read / dump segment data mergesegs merge several segments, with optional filtering and slicing updatedb update crawl db from segments after fetching invertlinks create a linkdb from parsed segments mergelinkdb merge linkdb-s, with optional filtering solrindex run the solr indexer on parsed segments and linkdb solrdedup remove duplicates from solr solrclean remove HTTP 301 and 404 documents from solr parsechecker check the parser for a given url indexchecker check the indexing filters for a given url domainstats calculate domain statistics from crawldb webgraph generate a web graph from existing segments linkrank run a link analysis program on the generated web graph scoreupdater updates the crawldb with linkrank scores nodedumper dumps the web graph's node scores plugin load a plugin and run one of its classes main() junit runs the given JUnit test or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
[root@localhost local]# bin/nutch crawl Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
參數的意義:
- urlDir - 種子url的目錄地址
- -solr - <solrUrl>為solr的地址(如果沒有則為空)
- -dir - 保存爬取文件的目錄
- -threads - 爬取線程數量(默認10)
- -depth - 爬取深度 (默認5)
- -topN - 訪問的廣度 (默認是Long.max)
配置 local/conf/nutch-site.xml
Nutch 的提高在於研讀nutch-default.xml中每一個配置的實際含義,需要結合源代碼理解。打開 local/conf/nutch-default.xml,找到 :
<property> <name>http.agent.name</name> <value></value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property>
將以上配置復制到 nutch-site.xml 的 <configuration></configuration> 中,http.agent.name 的value值(<value></value>)是基於瀏覽器的User-Agent( 用戶代理 ),它是一個特殊字符串頭,使得服務器能夠識別客戶使用的操作系統及版本、CPU 類型、瀏覽器及版本、瀏覽器渲染引擎、瀏覽器語言、瀏覽器插件等,如:Opera/9.80 (Windows NT 5.1; Edition IBIS) Presto/2.12.388 Version/12.15。這個是Nutch服從Robot協議,所以要改。
添加種子url
在local目錄下建文件夾如urls,在urls里面建文件如url,里面加入你要爬取的網站的入口url,如 :http://www.163.com/
配置local/conf/regex-urlfilter.txt
打開local/conf/regex-urlfilter.txt,注釋掉最后一行,並添上你要抓取的網站的域名 :
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else # +. +^http://([a-z0-9]*\.)*163\.com/
現在就可以爬取163所有的網頁了, 在local目錄下新建文件夾data,保存爬取內容,選擇合適的參數:
nohup bin/nutch crawl urls -dir data &
nohup 命令將把輸出的信息附加到的 nohup.out 文件中;在執行 nutch 會把爬蟲的記錄生成到 local/logs/hadoop.log
在爬取完成后,在 data 的文件夾下會有三個文件夾crawldb、linkdb、segments :
- crawldb - 是所有需要爬取的超鏈接
- Linkdb - 存放的是所有超連接及其每個連接的鏈入地址和錨文本
- segments - 存放的是抓取的頁面,以爬取的時間命名,個數不多於爬取的深度,Nutch的爬取策略是廣度優先,每一層url生成一個文件夾,直到沒有新的url。
在segments有6個文件夾 :
- crawl_generate - names a set of urls to be fetched(待爬取的url)
- crawl_fetch - contains the status of fetching each url(爬取的url的狀態)
- content - contains the content of each url(頁面內容)
- parse_text - contains the parsed text of each url(網頁的文本信息)
- parse_data - contains outlinks and metadata parsed from each url(url解析出來的外鏈和元數據)
- crawl_parse - contains the outlink urls, used to update the crawldb(更新crawldb的外鏈)
這些文件夾都是不可讀的,以方便存取並在高一層進行檢索用。如果想看到具體內容,要使用Nutch定義的讀取命令 :
1、查看CrawlDB(readdb)
[root@localhost local]# bin/nutch readdb Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>) <crawldb> directory name where crawldb is located -stats [-sort] print overall statistics to System.out [-sort] list status sorted by host -dump <out_dir> [-format normal|csv|crawldb] dump the whole db to a text file in <out_dir> [-format csv] dump in Csv format [-format normal] dump in standard format (default option) [-format crawldb] dump as CrawlDB [-regex <expr>] filter records with expression [-status <status>] filter records by CrawlDatum status -url <url> print information on <url> to System.out -topN <nnnn> <out_dir> [<min>] dump top <nnnn> urls sorted by score to <out_dir> [<min>] skip records with scores below this value. This can significantly improve performance.
查看URL地址總數和它的狀態及評分 :
[root@localhost local]# bin/nutch readdb data/crawldb/ -stats CrawlDb statistics start: data/crawldb/ Statistics for CrawlDb: data/crawldb/ TOTAL urls: 10635 retry 0: 10615 retry 1: 20 min score: 0.0 avg score: 2.6920545E-4 max score: 1.123 status 1 (db_unfetched): 9614 status 2 (db_fetched): 934 status 3 (db_gone): 2 status 4 (db_redir_temp): 81 status 5 (db_redir_perm): 4 CrawlDb statistics: done
導出每個url地址的詳細內容:bin/nutch readdb data/crawldb/ -dump crawldb(導出的地址)
2、查看linkdb
查看鏈接情況:bin/nutch readlinkdb data/linkdb/ -url http://www.163.com/
導出linkdb數據庫文件:bin/nutch readlinkdb 163/linkdb/ -dump linkdb(導出的地址)
3、查看segments
bin/nutch readseg -list -dir data/segments/ - 可以看到每一個segments的名稱,產生的頁面數,抓取的開始時間和結束時間,抓取數和解析數。
[root@localhost local]# bin/nutch readseg -list -dir data/segments/ NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20130427150144 53 2013-04-27T15:01:52 2013-04-27T15:05:15 53 51 20130427150553 1036 2013-04-27T15:06:01 2013-04-27T15:58:09 1094 921 20130427150102 1 2013-04-27T15:01:10 2013-04-27T15:01:10 1 1
導出segments :bin/nutch readseg -dump data/segments/20130427150144 segdb
其中data/segments/20130427150144 為一個segments文件夾,segdb為存放轉換后的內容的文件夾。
最后一個命令可能是最有用的,用於獲得頁面內容,一般會加上幾個選項
bin/nutch readseg -dump data/segments/20130427150144/ data_oscar /segments -nofetch -nogenerate -noparse -noparsedata -nocontent
這樣得到的 dump文件只包含網頁的正文信息,沒有標記。
感謝:http://yangshangchuan.iteye.com