（轉）Mahout使用入門

本文轉載自查看原文 2013-01-22 20:38 10332 NoSql

一、簡介

Mahout 是 Apache Software Foundation（ASF）旗下的一個開源項目，提供一些可擴展的機器學習領域經典算法的實現，旨在幫助開發人員更加方便快捷地創建智能應用程序。Apache Mahout項目已經發展到了它的第三個年頭，目前已經有了三個公共發行版本。Mahout包含許多實現，包括集群、分類、推薦過濾、頻繁子項挖掘。此外，通過使用 Apache Hadoop 庫，Mahout 可以有效地擴展到雲中。

二、下載與准備

程序下載

下載hadoop http://labs.renren.com/apache-mirror/hadoop/common/下載適合版本的包（本文采用穩定版 hadoop-0.20.203.0rc1.tar.gz ）

下載mahout http://labs.renren.com/apache-mirror/mahout/

（本文采用mahout-distribution-0.5.tar.gz）

如需更多功能可能還需下載 maven 和 mahout-collections

數據下載

數據源：http://kdd.ics.uci.edu/databases/里面有大量經典數據提供下載

（本文使用synthetic_control 數據，synthetic_control.tar.gz）

三、安裝與部署

為了不污染Linux root環境，本文采用在個人Home目錄安裝程序，程序目錄為$HOME/local。

程序已經下載到$HOME/Downloads，使用tar命令解壓：

tar zxvf hadoop-0.20.203.0rc1.tar.gz -C ~/local/

cd ~/local

mv hadoop-0.20.203.0 hadoop

tar zxvf mahout-distribution-0.5.tar.gz -C ~/local/

cd ~/local

mv mahout-distribution-0.5 mahout

修改.bash_profile / .bashrc

export HADOOP_HOME=$HOME/local/hadoop

export HADOOP_CONF_DIR=$HADOOP_HOME/conf

為方便使用程序命令，可把程序bin目錄添加到$PATH下，或者直接alias 。

#Alias for apps

alias mahout='$HOME/local/mahout/mahout'

alias hdp='$HOME/local/hadoop/hdp'

測試

輸入命令： mahout

預期結果：

Running on hadoop, using HADOOP_HOME=/home/username/local/hadoop

HADOOP_CONF_DIR=/home/username/local/hadoop/conf

An example program must be given as the first argument.

Valid program names are:

arff.vector: : Generate Vectors from an ARFF file or directory

canopy: : Canopy clustering

cat: : Print a file or resource as the logistic regression models would see it

cleansvd: : Cleanup and verification of SVD output

clusterdump: : Dump cluster output to text

dirichlet: : Dirichlet Clustering

eigencuts: : Eigencuts spectral clustering

evaluateFactorization: : compute RMSE of a rating matrix factorization against probes in memory

evaluateFactorizationParallel: : compute RMSE of a rating matrix factorization against probes

fkmeans: : Fuzzy K-means clustering

fpg: : Frequent Pattern Growth

itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering

kmeans: : K-means clustering

lda: : Latent Dirchlet Allocation

ldatopics: : LDA Print Topics

lucene.vector: : Generate Vectors from a Lucene index

matrixmult: : Take the product of two matrices

meanshift: : Mean Shift clustering

parallelALS: : ALS-WR factorization of a rating matrix

predictFromFactorization: : predict preferences from a factorization of a rating matrix

prepare20newsgroups: : Reformat 20 newsgroups data

recommenditembased: : Compute recommendations using item-based collaborative filtering

rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}

rowsimilarity: : Compute the pairwise similarities of the rows of a matrix

runlogistic: : Run a logistic regression model against CSV data

seq2sparse: : Sparse Vector generation from Text sequence files

seqdirectory: : Generate sequence files (of Text) from a directory

seqdumper: : Generic Sequence File dumper

seqwiki: : Wikipedia xml dump to sequence file

spectralkmeans: : Spectral k-means clustering

splitDataset: : split a rating dataset into training and probe parts

ssvd: : Stochastic SVD

svd: : Lanczos Singular Value Decomposition

testclassifier: : Test Bayes Classifier

trainclassifier: : Train Bayes Classifier

trainlogistic: : Train a logistic regression using stochastic gradient descent

transpose: : Take the transpose of a matrix

vectordump: : Dump vectors from a sequence file to text

wikipediaDataSetCreator: : Splits data set of wikipedia wrt feature like country

wikipediaXMLSplitter: : Reads wikipedia data and creates ch

輸入命令：hdp

預期結果：

Usage: hadoop [--config confdir] COMMAND

where COMMAND is one of:

namenode -format format the DFS filesystem

secondarynamenode run the DFS secondary namenode

namenode run the DFS namenode

datanode run a DFS datanode

dfsadmin run a DFS admin client

mradmin run a Map-Reduce admin client

fsck run a DFS filesystem checking utility

fs run a generic filesystem user client

balancer run a cluster balancing utility

fetchdt fetch a delegation token from the NameNode

jobtracker run the MapReduce job Tracker node

pipes run a Pipes job

tasktracker run a MapReduce task Tracker node

historyserver run job history servers as a standalone daemon

job manipulate MapReduce jobs

queue get information regarding JobQueues

version print the version

jar <jar> run a jar file

distcp <srcurl> <desturl> copy file or directories recursively

archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive

classpath prints the class path needed to get the

Hadoop jar and the required libraries

daemonlog get/set the log level for each daemon

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

五、運行

步驟一：

通過這個命令可以查看mahout提供了哪些算法,以及如何使用

mahout --help

mahout kmeans --input /user/hive/warehouse/tmp_data/complex.seq --clusters 5 --output /home/hadoopuser/1.txt

mahout下處理的文件必須是SequenceFile格式的，所以需要把txtfile轉換成sequenceFile。SequenceFile是hadoop中的一個類，允許我們向文件中寫入二進制的鍵值對，具體介紹請看

eyjian寫的http://www.hadoopor.com/viewthread.php?tid=144&highlight=sequencefile

mahout中提供了一種將指定文件下的文件轉換成sequenceFile的方式。

（You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text.）

使用方法如下：

$MAHOUT_HOME/mahout seqdirectory \

--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \

<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \

<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \

<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>

舉個例子：

mahout seqdirectory --input /hive/hadoopuser/ --output /mahout/seq/ --charset UTF-8

步驟二：

運行kmeans的簡單的例子：

1：將樣本數據集放到hdfs中指定文件下,應該在testdata文件夾下

$HADOOP_HOME/hdp fs -put <PATH TO DATA> testdata

例如：

dap fs -put ~/datasetsynthetic_controltest/synthetic_control.data ~/local/mahout/testdata/

2：使用kmeans算法

hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

例如：

hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

3：使用canopy算法

hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job

例如：

hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job

4：使用dirichlet 算法

mahout jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

5：使用meanshift算法

meanshift :

hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job

6：查看一下結果吧

mahout vectordump --seqFile /user/hadoopuser/output/data/part-00000

這個直接把結果顯示在控制台上。

可以到hdfs中去看看數據是什么樣子的

上面跑的例子大多以testdata作為輸入和輸出文件夾名

可以使用 hdp fs -lsr 來查看所有的輸出結果

KMeans 方法的輸出結果在 output/points

Canopy 和 MeanShift 結果放在了 output/clustered-points

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 (轉)mahout 實用教程 (轉)Mahout Kmeans Clustering 學習轉】用Maven構建Mahout項目 Hadoop入門進階課程9--Mahout介紹、安裝與應用案例 Mahout分步式程序開發聚類Kmeans（轉） Spring Transaction 使用入門（轉）使用mahout fpgrowth算法求關聯規則《mahout實戰》 Mahout是什么？（一） Mahout介紹