3.11簡介
Mahout:是一個Apache的一個開源的機器學習庫,主要實現了三大類算法Recommender
(collaborative filtering)、Clustering、classification。可擴展,用Java實現,用MapReduce實現了部分數據挖掘算法,解決了並行挖掘的問題。
Mahout為數據分析人員,解決了大數據的門檻;為算法工程師提供了基礎算法庫;為Hadoop開發人員提供了數據建模的標准。
——張丹(Conan) http://blog.fens.me/hadoop-mahout-roadmap/
3.12Mahout歷史演變
Mahout began life in 2008 as a project of Apache`s lucene project .Lucene provides advanced implementations of search ,text mining and information-retrival techniques.In the universe of computer science ,there concepts are adjacent to machine learning techniques like clustering and to an extent ,classification .As a result,some of the work of the Lucene committers that fell more into these machine learning areas was spun off into its own subproject. Soon after ,Mahout absorbed the Taste open source collaborative filtering project.As of April 2010 ,Mahout became a top-level Apache project in its own right, and get a bran-new elephant rider logo to boot.
——Mahout in Action
25 April 2014 - Goodbye MapReduce
The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them.
We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.
——http://mahout.apache.org/
3.13Hadoop家族中Mahout的結構圖

主要算法:
| 算法類 |
算法名 |
中文名 |
| 分類算法 |
Logistic Regression |
邏輯回歸 |
| Bayesian |
貝葉斯 |
|
| SVM |
支持向量機 |
|
| Perceptron |
感知器算法 |
|
| Neural Network |
神經網絡 |
|
| Random Forests |
隨機森林 |
|
| Restricted Boltzmann Machines |
有限波爾茲曼機 |
|
| 聚類算法 |
Canopy Clustering |
Canopy聚類 |
| K-means Clustering |
K均值算法 |
|
| Fuzzy K-means |
模糊K均值 |
|
| Expectation Maximization |
EM聚類(期望最大化聚類) |
|
| Mean Shift Clustering |
均值漂移聚類 |
|
| Hierarchical Clustering |
層次聚類 |
|
| Dirichlet Process Clustering |
狄里克雷過程聚類 |
|
| Latent Dirichlet Allocation |
LDA聚類 |
|
| Spectral Clustering |
譜聚類 |
|
| 關聯規則挖掘 |
Parallel FP Growth Algorithm |
並行FP Growth算法 |
| 回歸 |
Locally Weighted Linear Regression |
局部加權線性回歸 |
| 降維/維約簡 |
Singular Value Decomposition |
奇異值分解 |
| Principal Components Analysis |
主成分分析 |
|
| Independent Component Analysis |
獨立成分分析 |
|
| Gaussian Discriminative Analysis |
高斯判別分析 |
|
| 進化算法 |
並行化了Watchmaker框架 |
|
| 推薦/協同過濾 |
Non-distributed recommenders |
Taste(UserCF, ItemCF, SlopeOne) |
| Distributed Recommenders |
ItemCF |
|
| 向量相似度計算 |
RowSimilarityJob |
計算列間相似度 |
| VectorDistanceJob |
計算向量間距離 |
|
| 非Map-Reduce算法 |
Hidden Markov Models |
隱馬爾科夫模型 |
| 集合方法擴展 |
Collections |
擴展了java的Collections類 |
3.14Mahout在Hadoop 平台上的安裝
1.下載mahout:http://archive.apache.org/dist/mahout/
2.下載Maven 一般ubuntu系統直接 sudo apt-get install maven
3.將mahout 的文件解壓成文件夾mahout 並放入/usr文件夾
sudo tar -zxvf mahout-distribution-0.9.tar.gz
sudo mv mahout-distribution-0.9 /usr/mahout
4.創建一個腳本,配置mahout的環境。腳本內容
export JAVA_HOME=/usr/lib/jvm/jdk8/
export MAHOUT_HOME=/usr/mahout9
export MAHOUT_CONF_DIR=/usr/mahout9/conf
export PATH=$MAHOUT_HOME/bin:$MAHOUT_HOME/conf:$PATH
export HADOOP_HOME=/usr/hadoop
export HADOOP_CONF_DIR=/usr/hadoop/conf
export PATH=$PATH:$HADOOP_HOME/bin
5.運行腳本文件,運行mahout命令
6.到這里就表示安裝成功,下面下載一個測書數據,是一下mahout 的Kmeans聚類方法。
Sudo wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

7.將數據上傳到HDFS 上
hdfs fs -mkdir testdata
hdfs fs -put /usr/synthetic_control.data ./testdata
8.運行k-means聚類算法
mahout -core org.apache.clustering.syntheticcontrol.kmeans.Job

