Mahout介紹


3.11簡介

Mahout:是一個Apache的一個開源的機器學習庫,主要實現了三大類算法Recommender

(collaborative filtering)、Clustering、classification。可擴展,用Java實現,用MapReduce實現了部分數據挖掘算法,解決了並行挖掘的問題。

Mahout為數據分析人員,解決了大數據的門檻;為算法工程師提供了基礎算法庫;為Hadoop開發人員提供了數據建模的標准。

——張丹(Conan)  http://blog.fens.me/hadoop-mahout-roadmap/

 

3.12Mahout歷史演變

Mahout began life in 2008 as a project of Apache`s lucene project .Lucene provides advanced implementations of search ,text mining and information-retrival techniques.In the universe of computer science ,there concepts are adjacent to machine learning  techniques like clustering and to an extent ,classification .As a result,some of the work of the Lucene committers that fell more into these machine learning areas was spun off into its own subproject. Soon after ,Mahout absorbed the Taste open source collaborative filtering project.As of April 2010 ,Mahout became a top-level Apache project in its own right, and get a bran-new elephant rider logo to boot.

——Mahout in Action

 

25 April 2014 - Goodbye MapReduce

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them.

 

We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.

——http://mahout.apache.org/

 

3.13Hadoop家族中Mahout的結構圖

 

 

主要算法:

算法類

算法名

中文名

分類算法

Logistic Regression

邏輯回歸

Bayesian

貝葉斯

SVM

支持向量機

Perceptron

感知器算法

Neural Network

神經網絡

Random Forests

隨機森林

Restricted Boltzmann Machines

有限波爾茲曼機

聚類算法

Canopy Clustering

Canopy聚類

K-means Clustering

K均值算法

Fuzzy K-means

模糊K均值

Expectation Maximization

EM聚類(期望最大化聚類)

Mean Shift Clustering

均值漂移聚類

Hierarchical Clustering

層次聚類

Dirichlet Process Clustering

狄里克雷過程聚類

Latent Dirichlet Allocation

LDA聚類

Spectral Clustering

譜聚類

關聯規則挖掘

Parallel FP Growth Algorithm

並行FP Growth算法

回歸

Locally Weighted Linear Regression

局部加權線性回歸

降維/維約簡

Singular Value Decomposition

奇異值分解

Principal Components Analysis

主成分分析

Independent Component Analysis

獨立成分分析

Gaussian Discriminative Analysis

高斯判別分析

進化算法

並行化了Watchmaker框架

 

推薦/協同過濾

Non-distributed recommenders

Taste(UserCF, ItemCF, SlopeOne)

Distributed Recommenders

ItemCF

向量相似度計算

RowSimilarityJob

計算列間相似度

VectorDistanceJob

計算向量間距離

非Map-Reduce算法

Hidden Markov Models

隱馬爾科夫模型

集合方法擴展

Collections

擴展了java的Collections類

 

——http://blog.csdn.net

 

3.14Mahout在Hadoop 平台上的安裝

1.下載mahout:http://archive.apache.org/dist/mahout/

2.下載Maven 一般ubuntu系統直接 sudo apt-get install maven

3.將mahout 的文件解壓成文件夾mahout 並放入/usr文件夾

 sudo tar -zxvf  mahout-distribution-0.9.tar.gz

 sudo mv   mahout-distribution-0.9  /usr/mahout

4.創建一個腳本,配置mahout的環境。腳本內容

export JAVA_HOME=/usr/lib/jvm/jdk8/    

export MAHOUT_HOME=/usr/mahout9

export MAHOUT_CONF_DIR=/usr/mahout9/conf

export PATH=$MAHOUT_HOME/bin:$MAHOUT_HOME/conf:$PATH

 

export HADOOP_HOME=/usr/hadoop

export HADOOP_CONF_DIR=/usr/hadoop/conf

export PATH=$PATH:$HADOOP_HOME/bin

 

5.運行腳本文件,運行mahout命令

6.到這里就表示安裝成功,下面下載一個測書數據,是一下mahout 的Kmeans聚類方法。

Sudo wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

 

7.將數據上傳到HDFS 上

hdfs fs -mkdir testdata

hdfs fs -put /usr/synthetic_control.data  ./testdata

8.運行k-means聚類算法

mahout -core org.apache.clustering.syntheticcontrol.kmeans.Job

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM