Apache Nutch(一)


Nutch 當前兩個版本 :

  • 1.6 - Nutch1.6使用Hadoop Distributed File System (HDFS)來作為存儲,穩定可靠。
  • 2.1 - 通過gora對存儲層進行了擴展,可以選擇使用HBase、Accumulo、Cassandra 、MySQL 、DataFileAvroStore、AvroStore中任何一種來存儲數據,但其中一些並不成熟。

在Linux(Centos)上搭建 Nutch 框架:

  1. 安裝 svn
    yum install subversion
  2. 安裝 ant
    yum install ant
  3. check out nutch(進入 http://nutch.apache.org ,在 Version Control 板塊可查看到svn地址。)
    svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
  4. ant 構建 nutch
    cd release-1.6/
    ant

ant 構建完成之后,在 release-1.6 目錄下生成兩個目錄 :build、runtime,進入 runtime ,有兩個子文件夾 :deploy、local,分別代表了nutch兩種運行方式 :

  • deploy - hadoop 運行
  • local - 本地文件系統運行,只能有一個Map和Reduce。

local/bin/nutch :分析nutch腳本是入門的重點,可以看到通過 nutch 腳本連接Hadoop與Nutch,把apache-nutch-1.6.job提交給Hadoop的JobTracker;同時也可以看到在命令中所指定的是哪個Java類。

Nutch 腳本
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# 
# The Nutch command script
#
# Environment Variables
#
#   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
#
#   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB. 
#                   Default is 1000.
#
#   NUTCH_OPTS      Extra Java runtime options.
#
cygwin=false
case "`uname`" in
CYGWIN*) cygwin=true;;
esac

# resolve links - $0 may be a softlink
THIS="$0"
while [ -h "$THIS" ]; do
  ls=`ls -ld "$THIS"`
  link=`expr "$ls" : '.*-> \(.*\)$'`
  if expr "$link" : '.*/.*' > /dev/null; then
    THIS="$link"
  else
    THIS=`dirname "$THIS"`/"$link"
  fi
done

# if no args specified, show usage
if [ $# = 0 ]; then
  echo "Usage: nutch COMMAND"
  echo "where COMMAND is one of:"
  echo "  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)"
  echo "  readdb            read / dump crawl db"
  echo "  mergedb           merge crawldb-s, with optional filtering"
  echo "  readlinkdb        read / dump link db"
  echo "  inject            inject new urls into the database"
  echo "  generate          generate new segments to fetch from crawl db"
  echo "  freegen           generate new segments to fetch from text files"
  echo "  fetch             fetch a segment's pages"
  echo "  parse             parse a segment's pages"
  echo "  readseg           read / dump segment data"
  echo "  mergesegs         merge several segments, with optional filtering and slicing"
  echo "  updatedb          update crawl db from segments after fetching"
  echo "  invertlinks       create a linkdb from parsed segments"
  echo "  mergelinkdb       merge linkdb-s, with optional filtering"
  echo "  solrindex         run the solr indexer on parsed segments and linkdb"
  echo "  solrdedup         remove duplicates from solr"
  echo "  solrclean         remove HTTP 301 and 404 documents from solr"
  echo "  parsechecker      check the parser for a given url"
  echo "  indexchecker      check the indexing filters for a given url"
  echo "  domainstats       calculate domain statistics from crawldb"
  echo "  webgraph          generate a web graph from existing segments"
  echo "  linkrank          run a link analysis program on the generated web graph"
  echo "  scoreupdater      updates the crawldb with linkrank scores"
  echo "  nodedumper        dumps the web graph's node scores"
  echo "  plugin            load a plugin and run one of its classes main()"
  echo "  junit             runs the given JUnit test"
  echo " or"
  echo "  CLASSNAME         run the class named CLASSNAME"
  echo "Most commands print help when invoked w/o parameters."
  exit 1
fi

# get arguments
COMMAND=$1
shift

# some directories
THIS_DIR=`dirname "$THIS"`
NUTCH_HOME=`cd "$THIS_DIR/.." ; pwd`

# some Java parameters
if [ "$NUTCH_JAVA_HOME" != "" ]; then
  #echo "run java in $NUTCH_JAVA_HOME"
  JAVA_HOME=$NUTCH_JAVA_HOME
fi
  
if [ "$JAVA_HOME" = "" ]; then
  echo "Error: JAVA_HOME is not set."
  exit 1
fi

local=true

# NUTCH_JOB 
if [ -f ${NUTCH_HOME}/*nutch*.job ]; then
    local=false
  for f in $NUTCH_HOME/*nutch*.job; do
    NUTCH_JOB=$f;
  done
fi

# cygwin path translation
if $cygwin; then
  NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"`
fi

JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m 

# check envvars which might override default args
if [ "$NUTCH_HEAPSIZE" != "" ]; then
  #echo "run with heapsize $NUTCH_HEAPSIZE"
  JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m"
  #echo $JAVA_HEAP_MAX
fi

# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar

# so that filenames w/ spaces are handled correctly in loops below
IFS=

# add libs to CLASSPATH
if $local; then
  for f in $NUTCH_HOME/lib/*.jar; do
   CLASSPATH=${CLASSPATH}:$f;
  done
  # local runtime
  # add plugins to classpath
  if [ -d "$NUTCH_HOME/plugins" ]; then
     CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
  fi
fi

# cygwin path translation
if $cygwin; then
  CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi

# setup 'java.library.path' for native-hadoop code if necessary
# used only in local mode 
JAVA_LIBRARY_PATH=''
if [ -d "${NUTCH_HOME}/lib/native" ]; then
  JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`
  
  if [ -d "${NUTCH_HOME}/lib/native" ]; then
    if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
      JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
    else
      JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
    fi
  fi
fi

if [ $cygwin = true -a "X${JAVA_LIBRARY_PATH}" != "X" ]; then
  JAVA_LIBRARY_PATH=`cygpath -p -w "$JAVA_LIBRARY_PATH"`
fi

# restore ordinary behaviour
unset IFS

# default log directory & file
if [ "$NUTCH_LOG_DIR" = "" ]; then
  NUTCH_LOG_DIR="$NUTCH_HOME/logs"
fi
if [ "$NUTCH_LOGFILE" = "" ]; then
  NUTCH_LOGFILE='hadoop.log'
fi

#Fix log path under cygwin
if $cygwin; then
  NUTCH_LOG_DIR=`cygpath -p -w "$NUTCH_LOG_DIR"`
fi

NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR"
NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE"

if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
  NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"
fi

# figure out which class to run
if [ "$COMMAND" = "crawl" ] ; then
  CLASS=org.apache.nutch.crawl.Crawl
elif [ "$COMMAND" = "inject" ] ; then
  CLASS=org.apache.nutch.crawl.Injector
elif [ "$COMMAND" = "generate" ] ; then
  CLASS=org.apache.nutch.crawl.Generator
elif [ "$COMMAND" = "freegen" ] ; then
  CLASS=org.apache.nutch.tools.FreeGenerator
elif [ "$COMMAND" = "fetch" ] ; then
  CLASS=org.apache.nutch.fetcher.Fetcher
elif [ "$COMMAND" = "parse" ] ; then
  CLASS=org.apache.nutch.parse.ParseSegment
elif [ "$COMMAND" = "readdb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDbReader
elif [ "$COMMAND" = "mergedb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ "$COMMAND" = "readlinkdb" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbReader
elif [ "$COMMAND" = "readseg" ] ; then
  CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "mergesegs" ] ; then
  CLASS=org.apache.nutch.segment.SegmentMerger
elif [ "$COMMAND" = "updatedb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDb
elif [ "$COMMAND" = "invertlinks" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDb
elif [ "$COMMAND" = "mergelinkdb" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbMerger
elif [ "$COMMAND" = "solrindex" ] ; then
  CLASS=org.apache.nutch.indexer.solr.SolrIndexer
elif [ "$COMMAND" = "solrdedup" ] ; then
  CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
elif [ "$COMMAND" = "solrclean" ] ; then
  CLASS=org.apache.nutch.indexer.solr.SolrClean
elif [ "$COMMAND" = "parsechecker" ] ; then
  CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
  CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "domainstats" ] ; then 
  CLASS=org.apache.nutch.util.domain.DomainStatistics
elif [ "$COMMAND" = "webgraph" ] ; then
  CLASS=org.apache.nutch.scoring.webgraph.WebGraph
elif [ "$COMMAND" = "linkrank" ] ; then
  CLASS=org.apache.nutch.scoring.webgraph.LinkRank
elif [ "$COMMAND" = "scoreupdater" ] ; then
  CLASS=org.apache.nutch.scoring.webgraph.ScoreUpdater
elif [ "$COMMAND" = "nodedumper" ] ; then
  CLASS=org.apache.nutch.scoring.webgraph.NodeDumper
elif [ "$COMMAND" = "plugin" ] ; then
  CLASS=org.apache.nutch.plugin.PluginRepository
elif [ "$COMMAND" = "junit" ] ; then
  CLASSPATH=$CLASSPATH:$NUTCH_HOME/test/classes/
  CLASS=junit.textui.TestRunner
else
  CLASS=$COMMAND
fi

# distributed mode
EXEC_CALL="hadoop jar $NUTCH_JOB"

if $local; then
 EXEC_CALL="$JAVA $JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"
else
 # check that hadoop can be found on the path
 if [ $(which hadoop | wc -l ) -eq 0 ]; then
    echo "Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."
    exit -1;
 fi
fi

# run it
exec $EXEC_CALL $CLASS "$@"

nutch 的所有參數

[root@localhost local]# bin/nutch 
Usage: nutch COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
[root@localhost local]# bin/nutch crawl
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]

參數的意義:

  • urlDir - 種子url的目錄地址
  • -solr - <solrUrl>為solr的地址(如果沒有則為空)
  • -dir - 保存爬取文件的目錄
  • -threads - 爬取線程數量(默認10)
  • -depth - 爬取深度 (默認5)
  • -topN - 訪問的廣度 (默認是Long.max)

配置 local/conf/nutch-site.xml

Nutch 的提高在於研讀nutch-default.xml中每一個配置的實際含義,需要結合源代碼理解。打開 local/conf/nutch-default.xml,找到 :

<property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  </description>
</property>

將以上配置復制到 nutch-site.xml 的 <configuration></configuration> 中,http.agent.name 的value值(<value></value>)是基於瀏覽器的User-Agent( 用戶代理 ),它是一個特殊字符串頭,使得服務器能夠識別客戶使用的操作系統及版本、CPU 類型、瀏覽器及版本、瀏覽器渲染引擎、瀏覽器語言、瀏覽器插件等,如:Opera/9.80 (Windows NT 5.1; Edition IBIS) Presto/2.12.388 Version/12.15。這個是Nutch服從Robot協議,所以要改。

添加種子url

在local目錄下建文件夾如urls,在urls里面建文件如url,里面加入你要爬取的網站的入口url,如 :http://www.163.com/

配置local/conf/regex-urlfilter.txt

打開local/conf/regex-urlfilter.txt,注釋掉最后一行,並添上你要抓取的網站的域名 :

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
# +.
+^http://([a-z0-9]*\.)*163\.com/

現在就可以爬取163所有的網頁了, 在local目錄下新建文件夾data,保存爬取內容,選擇合適的參數:

nohup bin/nutch crawl urls -dir data &

nohup 命令將把輸出的信息附加到的 nohup.out 文件中;在執行 nutch 會把爬蟲的記錄生成到 local/logs/hadoop.log

在爬取完成后,在 data 的文件夾下會有三個文件夾crawldb、linkdb、segments :

  • crawldb - 是所有需要爬取的超鏈接
  • Linkdb - 存放的是所有超連接及其每個連接的鏈入地址和錨文本
  • segments - 存放的是抓取的頁面,以爬取的時間命名,個數不多於爬取的深度,Nutch的爬取策略是廣度優先,每一層url生成一個文件夾,直到沒有新的url。

在segments有6個文件夾 :

  • crawl_generate - names a set of urls to be fetched(待爬取的url)
  • crawl_fetch - contains the status of fetching each url(爬取的url的狀態)
  • content - contains the content of each url(頁面內容)
  • parse_text - contains the parsed text of each url(網頁的文本信息)
  • parse_data - contains outlinks and metadata parsed from each url(url解析出來的外鏈和元數據)
  • crawl_parse - contains the outlink urls, used to update the crawldb(更新crawldb的外鏈)

這些文件夾都是不可讀的,以方便存取並在高一層進行檢索用。如果想看到具體內容,要使用Nutch定義的讀取命令 :

1、查看CrawlDB(readdb)

[root@localhost local]# bin/nutch readdb
Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
    <crawldb>    directory name where crawldb is located
    -stats [-sort]     print overall statistics to System.out
        [-sort]    list status sorted by host
    -dump <out_dir> [-format normal|csv|crawldb]    dump the whole db to a text file in <out_dir>
        [-format csv]    dump in Csv format
        [-format normal]    dump in standard format (default option)
        [-format crawldb]    dump as CrawlDB
        [-regex <expr>]    filter records with expression
        [-status <status>]    filter records by CrawlDatum status
    -url <url>    print information on <url> to System.out
    -topN <nnnn> <out_dir> [<min>]    dump top <nnnn> urls sorted by score to <out_dir>
        [<min>]    skip records with scores below this value.
            This can significantly improve performance.

查看URL地址總數和它的狀態及評分 :

[root@localhost local]# bin/nutch readdb data/crawldb/ -stats
CrawlDb statistics start: data/crawldb/
Statistics for CrawlDb: data/crawldb/
TOTAL urls:    10635
retry 0:    10615
retry 1:    20
min score:    0.0
avg score:    2.6920545E-4
max score:    1.123
status 1 (db_unfetched):    9614
status 2 (db_fetched):    934
status 3 (db_gone):    2
status 4 (db_redir_temp):    81
status 5 (db_redir_perm):    4
CrawlDb statistics: done

 導出每個url地址的詳細內容:bin/nutch readdb data/crawldb/ -dump crawldb(導出的地址)

2、查看linkdb

查看鏈接情況:bin/nutch readlinkdb data/linkdb/ -url http://www.163.com/
導出linkdb數據庫文件:bin/nutch readlinkdb 163/linkdb/ -dump linkdb(導出的地址)

3、查看segments

bin/nutch readseg -list -dir data/segments/  -  可以看到每一個segments的名稱,產生的頁面數,抓取的開始時間和結束時間,抓取數和解析數。

[root@localhost local]# bin/nutch readseg -list -dir data/segments/
NAME              GENERATED    FETCHER START          FETCHER END            FETCHED    PARSED
20130427150144    53           2013-04-27T15:01:52    2013-04-27T15:05:15    53         51
20130427150553    1036         2013-04-27T15:06:01    2013-04-27T15:58:09    1094       921
20130427150102    1            2013-04-27T15:01:10    2013-04-27T15:01:10    1          1

導出segments :bin/nutch readseg -dump data/segments/20130427150144 segdb
其中data/segments/20130427150144 為一個segments文件夾,segdb為存放轉換后的內容的文件夾。

最后一個命令可能是最有用的,用於獲得頁面內容,一般會加上幾個選項
bin/nutch readseg -dump  data/segments/20130427150144/ data_oscar /segments -nofetch -nogenerate -noparse -noparsedata -nocontent
這樣得到的 dump文件只包含網頁的正文信息,沒有標記。

 

 

感謝:http://yangshangchuan.iteye.com

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM