最近看了下Nutch,目前Nutch最新版本2.3.1,支持Hbase、MongoDB等存儲,但在搭建和測試過程中發現對Mysql 的支持好像有點問題。
后來將Nutch版本改為2.2.1。基於Nutch2.2.1+Mysql 的環境配置過程如下:
1.下載Nutch2.2.1 源碼:SVN:https://svn.apache.org/repos/asf/nutch/branches/branch-2.2.1
2.修改Nutch2.2.1 源碼中的ivy/ivysetings.xml
- 添加一個源:
<property name="org.restlet"
value="http://maven.restlet.org"
override="false"/>
- 增加以下紅色部分代碼
<chain name="default" dual="true">
<resolver ref="local"/>
<resolver ref="maven2"/>
<resolver ref="apache-snapshot"/>
<resolver ref="sonatype"/>
<resolver ref="restlet"/>
</chain>
經過測試,沒有增加這個有些包下載不了,可能和網絡有關系。
3.修改ivy/ivy.xml
啟用以下兩個依賴
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" /> <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
4.進入命令行,並定位到Nutch目錄
執行:
ant eclipse -verbose
由於網絡帶寬問題,整個過程執行了半個小時
執行完成之后如下圖所示

發現build文件夾比原來多了很多內容。
5. 打開Eclipse
使用Import 導入Nutch工程


6.配置conf/nutch-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>YourNutchSpider</value> </property> <property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value> <description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.</description> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available</description> </property> <property> <name>plugin.folders</name> <value>src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> <property> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: ….</description> </property> <property> <name>generate.batch.id</name> <value>*</value> </property> </configuration>
7.配置 gora.properties
gora.datastore.default=org.apache.gora.sql.store.SqlStore gora.datastore.autocreateschema=true gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=
8.創建mysql數據庫和表結構
CREATE TABLE webpage (
id varchar(256) NOT NULL,
headers blob,
text longtext DEFAULT NULL,
status int(11) DEFAULT NULL,
markers blob,
parseStatus blob,
modifiedTime bigint(20) DEFAULT NULL,
prevModifiedTime bigint(20) DEFAULT NULL,
score float DEFAULT NULL,
typ varchar(32) CHARACTER SET latin1 DEFAULT NULL,
batchId varchar(32) CHARACTER SET latin1 DEFAULT NULL,
baseUrl varchar(256) DEFAULT NULL,
content longblob,
title text DEFAULT NULL,
reprUrl varchar(256) DEFAULT NULL,
fetchInterval int(11) DEFAULT NULL,
prevFetchTime bigint(20) DEFAULT NULL,
inlinks mediumblob,
prevSignature blob,
outlinks mediumblob,
fetchTime bigint(20) DEFAULT NULL,
retriesSinceFetch int(11) DEFAULT NULL,
protocolStatus blob,
signature blob,
metadata blob,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
9. 配置Crawler.java 的執行參數

10. 修改Hadoop的FileUtil.java
由於windows平台問題,需要修改FileUtil.java 代碼,將紅色部分注釋掉。否則在執行Crawl 過程中會報Hadoop的路徑權限錯誤
1 private static void checkReturnValue(boolean rv, File p, FsPermission permission) 2 throws IOException 3 { 4 //if (!rv) 5 // throw new IOException(new StringBuilder().append("Failed to set permissions of path: ").append(p).append(" to ").append(String.format("%04o", new Object[] { Short.valueOf(permission.toShort()) })).toString()); 6 }
11. 在工程目錄創建urls 文件夾,並在文件夾中創建seed.txt文件
添加需要爬取的網站URL路徑,如: http://www.cnblogs.com/
注意:這個urls文件夾與Crawler執行參數的urls 對應。
12.執行Crawler.java 觀察Mysql 數據
13.在大多數情況下,網站可能配置了反爬蟲的功能robots.txt
Nutch也遵守了該協議,但可以通過修改Nutch的源碼繞過反爬蟲。
只需要將類FetcherReducer 的以下這個代碼注釋掉即可
/*
if (!rules.isAllowed(fit.u.toString())) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
CrawlStatus.STATUS_GONE);
continue;
}
*/
