Mahout介紹

本文轉載自查看原文 2015-03-24 11:38 2111

3.11簡介

Mahout:是一個Apache的一個開源的機器學習庫，主要實現了三大類算法Recommender

(collaborative filtering)、Clustering、classification。可擴展，用Java實現，用MapReduce實現了部分數據挖掘算法，解決了並行挖掘的問題。

Mahout為數據分析人員，解決了大數據的門檻；為算法工程師提供了基礎算法庫；為Hadoop開發人員提供了數據建模的標准。

——張丹(Conan) http://blog.fens.me/hadoop-mahout-roadmap/

3.12Mahout歷史演變

Mahout began life in 2008 as a project of Apache`s lucene project .Lucene provides advanced implementations of search ,text mining and information-retrival techniques.In the universe of computer science ,there concepts are adjacent to machine learning techniques like clustering and to an extent ,classification .As a result,some of the work of the Lucene committers that fell more into these machine learning areas was spun off into its own subproject. Soon after ,Mahout absorbed the Taste open source collaborative filtering project.As of April 2010 ,Mahout became a top-level Apache project in its own right, and get a bran-new elephant rider logo to boot.

——Mahout in Action

25 April 2014 - Goodbye MapReduce

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them.

We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.

——http://mahout.apache.org/

3.13Hadoop家族中Mahout的結構圖

主要算法：

算法類	算法名	中文名
分類算法	Logistic Regression	邏輯回歸
	Bayesian	貝葉斯
	SVM	支持向量機
	Perceptron	感知器算法
	Neural Network	神經網絡
	Random Forests	隨機森林
	Restricted Boltzmann Machines	有限波爾茲曼機
聚類算法	Canopy Clustering	Canopy聚類
	K-means Clustering	K均值算法
	Fuzzy K-means	模糊K均值
	Expectation Maximization	EM聚類（期望最大化聚類）
	Mean Shift Clustering	均值漂移聚類
	Hierarchical Clustering	層次聚類
	Dirichlet Process Clustering	狄里克雷過程聚類
	Latent Dirichlet Allocation	LDA聚類
	Spectral Clustering	譜聚類
關聯規則挖掘	Parallel FP Growth Algorithm	並行FP Growth算法
回歸	Locally Weighted Linear Regression	局部加權線性回歸
降維/維約簡	Singular Value Decomposition	奇異值分解
	Principal Components Analysis	主成分分析
	Independent Component Analysis	獨立成分分析
	Gaussian Discriminative Analysis	高斯判別分析
進化算法	並行化了Watchmaker框架
推薦/協同過濾	Non-distributed recommenders	Taste(UserCF, ItemCF, SlopeOne）
推薦/協同過濾	Distributed Recommenders	ItemCF
向量相似度計算	RowSimilarityJob	計算列間相似度
向量相似度計算	VectorDistanceJob	計算向量間距離
非Map-Reduce算法	Hidden Markov Models	隱馬爾科夫模型
集合方法擴展	Collections	擴展了java的Collections類