一、簡介
Mahout 是 Apache Software Foundation(ASF)旗下的一個開源項目,提供一些可擴展的機器學習領域經典算法的實現,旨在幫助開發人員更加方便快捷地創建智能應用程序。Apache Mahout項目已經發展到了它的第三個年頭,目前已經有了三個公共發行版本。Mahout包含許多實現,包括集群、分類、推薦過濾、頻繁子項挖掘。此外,通過使用 Apache Hadoop 庫,Mahout 可以有效地擴展到雲中。
二、下載與准備
程序下載
下載hadoop http://labs.renren.com/apache-mirror/hadoop/common/下載適合版本的包(本文采用穩定版 hadoop-0.20.203.0rc1.tar.gz )
下載mahout http://labs.renren.com/apache-mirror/mahout/
(本文采用mahout-distribution-0.5.tar.gz)
如需更多功能可能還需下載 maven 和 mahout-collections
數據下載
數據源:http://kdd.ics.uci.edu/databases/里面有大量經典數據提供下載
(本文使用synthetic_control 數據,synthetic_control.tar.gz)
三、安裝與部署
為了不污染Linux root環境,本文采用在個人Home目錄安裝程序,程序目錄為$HOME/local。
程序已經下載到$HOME/Downloads,使用tar命令解壓:
tar zxvf hadoop-0.20.203.0rc1.tar.gz -C ~/local/
cd ~/local
mv hadoop-0.20.203.0 hadoop
tar zxvf mahout-distribution-0.5.tar.gz -C ~/local/
cd ~/local
mv mahout-distribution-0.5 mahout
修改.bash_profile / .bashrc
export HADOOP_HOME=$HOME/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
為方便使用程序命令,可把程序bin目錄添加到$PATH下,或者直接alias 。
#Alias for apps
alias mahout='$HOME/local/mahout/mahout'
alias hdp='$HOME/local/hadoop/hdp'
測試
輸入命令: mahout
預期結果:
Running on hadoop, using HADOOP_HOME=/home/username/local/hadoop
HADOOP_CONF_DIR=/home/username/local/hadoop/conf
An example program must be given as the first argument.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
dirichlet: : Dirichlet Clustering
eigencuts: : Eigencuts spectral clustering
evaluateFactorization: : compute RMSE of a rating matrix factorization against probes in memory
evaluateFactorizationParallel: : compute RMSE of a rating matrix factorization against probes
fkmeans: : Fuzzy K-means clustering
fpg: : Frequent Pattern Growth
itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
kmeans: : K-means clustering
lda: : Latent Dirchlet Allocation
ldatopics: : LDA Print Topics
lucene.vector: : Generate Vectors from a Lucene index
matrixmult: : Take the product of two matrices
meanshift: : Mean Shift clustering
parallelALS: : ALS-WR factorization of a rating matrix
predictFromFactorization: : predict preferences from a factorization of a rating matrix
prepare20newsgroups: : Reformat 20 newsgroups data
recommenditembased: : Compute recommendations using item-based collaborative filtering
rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
runlogistic: : Run a logistic regression model against CSV data
seq2sparse: : Sparse Vector generation from Text sequence files
seqdirectory: : Generate sequence files (of Text) from a directory
seqdumper: : Generic Sequence File dumper
seqwiki: : Wikipedia xml dump to sequence file
spectralkmeans: : Spectral k-means clustering
splitDataset: : split a rating dataset into training and probe parts
ssvd: : Stochastic SVD
svd: : Lanczos Singular Value Decomposition
testclassifier: : Test Bayes Classifier
trainclassifier: : Train Bayes Classifier
trainlogistic: : Train a logistic regression using stochastic gradient descent
transpose: : Take the transpose of a matrix
vectordump: : Dump vectors from a sequence file to text
wikipediaDataSetCreator: : Splits data set of wikipedia wrt feature like country
wikipediaXMLSplitter: : Reads wikipedia data and creates ch
輸入命令:hdp
預期結果:
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
datanode run a DFS datanode
dfsadmin run a DFS admin client
mradmin run a Map-Reduce admin client
fsck run a DFS filesystem checking utility
fs run a generic filesystem user client
balancer run a cluster balancing utility
fetchdt fetch a delegation token from the NameNode
jobtracker run the MapReduce job Tracker node
pipes run a Pipes job
tasktracker run a MapReduce task Tracker node
historyserver run job history servers as a standalone daemon
job manipulate MapReduce jobs
queue get information regarding JobQueues
version print the version
jar <jar> run a jar file
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
五、運行
步驟一:
通過這個命令可以查看mahout提供了哪些算法,以及如何使用
mahout --help
mahout kmeans --input /user/hive/warehouse/tmp_data/complex.seq --clusters 5 --output /home/hadoopuser/1.txt
mahout下處理的文件必須是SequenceFile格式的,所以需要把txtfile轉換成sequenceFile。SequenceFile是hadoop中的一個類,允許我們向文件中寫入二進制的鍵值對,具體介紹請看
eyjian寫的http://www.hadoopor.com/viewthread.php?tid=144&highlight=sequencefile
mahout中提供了一種將指定文件下的文件轉換成sequenceFile的方式。
(You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text.)
使用方法如下:
$MAHOUT_HOME/mahout seqdirectory \
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
舉個例子:
mahout seqdirectory --input /hive/hadoopuser/ --output /mahout/seq/ --charset UTF-8
步驟二:
運行kmeans的簡單的例子:
1:將樣本數據集放到hdfs中指定文件下,應該在testdata文件夾下
$HADOOP_HOME/hdp fs -put <PATH TO DATA> testdata
例如:
dap fs -put ~/datasetsynthetic_controltest/synthetic_control.data ~/local/mahout/testdata/
2:使用kmeans算法
hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
例如:
hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
3:使用canopy算法
hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job
例如:
hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job
4:使用dirichlet 算法
mahout jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
5:使用meanshift算法
meanshift :
hdp jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job
6:查看一下結果吧
mahout vectordump --seqFile /user/hadoopuser/output/data/part-00000
這個直接把結果顯示在控制台上。
可以到hdfs中去看看數據是什么樣子的
上面跑的例子大多以testdata作為輸入和輸出文件夾名
可以使用 hdp fs -lsr 來查看所有的輸出結果
KMeans 方法的輸出結果在 output/points
Canopy 和 MeanShift 結果放在了 output/clustered-points