(轉)mahout 實用教程


mahout svn倉庫地址:http://svn.apache.org/repos/asf/mahout/trunk

movie length 數據地址:http://www.grouplens.org/system/files/ml-100k.zip

1.    mahout簡介

The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries.

Classification

Logistic Regression (SGD)

Bayesian

Support Vector Machines (SVM)

Perceptron and Winnow

Neural Network

Random Forests

Restricted Boltzmann Machines

Online Passive Aggressive

Boosting

Hidden Markov Models 

Clustering

Canopy Clustering

K-Means Clustering

Fuzzy K-Means

Expectation Maximization (EM)

Mean Shift Clustering

Hierarchical Clustering

Dirichlet Process Clustering

Latent Dirichlet Allocation

Spectral Clustering

Minhash Clustering

Top Down Clustering

Pattern Mining

Parallel FP Growth Algorithm

Dimension reduction

Singular Value Decomposition and other Dimension Reduction Techniques

Stochastic Singular Value Decomposition with PCA workflow

Principal Components Analysis

Independent Component Analysis

Gaussian Discriminative Analysis

Recommenders / Collaborative Filtering

Non-distributed recommenders ("Taste")

Distributed Item-Based Collaborative Filtering

Collaborative Filtering using a parallel matrix factorization 

2.    應用於推薦系統(item-based/user-based/slopone)

2.1小型網站直接集成即可使用(user-based/item-based)

協同過濾主要分為:計算相似度è預測評分è產生推薦

 

 Preferenceinferre 0.8版本變為capper,他的作用是:評估用戶的缺失評分值:

2.1.1 User-based的實現

                                                                        

用戶u對第i個商品的評分預測:u為當前用戶,i為第i個用戶,p(vi)為第i個用戶對當前item的評分。

 

protected float doEstimatePreference(long theUserID, long[] theNeighborhood, long itemID) throws TasteException {

    if (theNeighborhood.length == 0) {

      return Float.NaN;

    }

    DataModel dataModel = getDataModel();

    double preference = 0.0;

    double totalSimilarity = 0.0;

    int count = 0;

    for (long userID : theNeighborhood) {

      if (userID != theUserID) {

        // See GenericItemBasedRecommender.doEstimatePreference() too

        Float pref = dataModel.getPreferenceValue(userID, itemID);

        if (pref != null) {

          double theSimilarity = similarity.userSimilarity(theUserID, userID);

          if (!Double.isNaN(theSimilarity)) {

            preference += theSimilarity * pref;

            totalSimilarity += theSimilarity;

            count++;

          }

        }

      }

}

 

2.1.2 Item-based的實現

                                                

 

iitem被用戶u評分預測,sim(i,j)i個商品與第j個商品的相似度,p(vj,u)表示用戶u對第j個商品的評分。

protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)

    throws TasteException {

    double preference = 0.0;

    double totalSimilarity = 0.0;

    int count = 0;

    double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());

    for (int i = 0; i < similarities.length; i++) {

      double theSimilarity = similarities[i];

      if (!Double.isNaN(theSimilarity)) {

        // Weights can be negative!

        preference += theSimilarity * preferencesFromUser.getValue(i);

        totalSimilarity += theSimilarity;

        count++;

      }

}

 

Mathout中實現的相似度度量

PearsonCorrelationSimilarity皮爾遜距離

皮爾森相關系數等於兩個變量的協方差除於兩個變量的標准差。

                                            

 

缺點:沒有考慮(take into account)用戶間重疊的評分項數量對相似度的影響;

 

 

                                               

EuclideanDistanceSimilarity 歐幾里德距離

                                                                              

                                                                                   

 

 

 

缺點:

CosineMeasureSimilarity   余弦距離(0.7變成了UncenteredCosineSimilarity

                                                                                 

 

                                                                     

 

 

標准余弦相似度對方向敏感但對數值不敏感,比如用戶對內容評分,5分制,XY兩個用戶對兩個內容的評分分別為(1,2)(4,5),使用余弦相似度得出的結果是0.98,兩者極為相似,但從評分上看X似乎不喜歡這2個內容,而Y比較喜歡,為了修正這種不合理性,就出現了調整余弦相似度,Mahout給出了調整余弦相似度的實現。即所有維度上的數值都減去一個均值。

SpearmanCorrelationSimilarity斯皮爾曼等級相關

TanimotoCoefficientSimilarity谷本相關系數

LogLikelihoodSimilarity 對數似然相似度

CityBlockSimilarity基於曼哈頓距離

2.2離線計算,基於中間數據再開發(item-based/slopone)

2.2.1 mahout的源代碼結構

                                                              

 

 

Item-basedslopone都有hadoop實現和單機版實現。User-based沒有。

Item-based recommender使用命令:

mahout org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -i input -o output --maxPrefsPerUser 100 -- numRecommendations 20

-s SIMILARITY_COSINE

 

Item-item 相似商品:

mahout org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input user-item --output similarity --similarityClassname SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 120 --maxPrefsPerUser 1200  --minPrefsPerUser 2 

 

3應用於機器學(貝葉斯/模式挖掘/聚類等…)

3.1快速建模/模型評估

                                                                     

 

$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

將該數據放到testdata目錄下,算法的輸出放到output目錄下:、

可以采用 mahout clusterdump來查看結果數據也可以輸出到本地文件。

Recommender的評估

RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();

 

                                                     

Cluster的模型評估可以參考:

http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

 

創建seqdirectory

./bin/mahout seqdirectory \     -i ${WORK_DIR}/20news-all \     -o ${WORK_DIR}/20news-seq

 

seqdirectory轉換為向量

./bin/mahout seq2sparse \     -i ${WORK_DIR}/20news-seq \     -o ${WORK_DIR}/20news-vectors  -lnorm -nv  -wt tfidf

3.2例如PFPGrowth

mahout fpg -i pfp/order_01.txt -o pfp/patterns/output.txt -k 50 -method mapreduce -regex '[\ ]' -s 2

 

pfpgrowth論文參照:http://infolab.stanford.edu/~echang/recsys08-69.pdf

結果示例:

Key: 0: Value: ([0],14), ([368, 0],7), ([0, 53],5), ([368, 0, 53],4), ([950, 0],4), ([682, 826, 523, 950, 277, 475, 0],3), ([682, 826, 523, 950, 475, 0],3), ([183, 0],3), ([168, 0],3), ([682, 826, 523, 168, 950, 277, 475, 0],2), ([368, 684, 401, 428, 0, 53],2), ([368, 871, 239, 0, 257],2), ([368, 766, 183, 0, 831],2), ([368, 684, 401, 428, 0],2), ([937, 57, 450, 0],2), ([710, 173, 0, 731],2), ([368, 871, 239, 0],2), ([368, 766, 183, 0],2), ([710, 173, 0],2), ([419, 581, 0],2), ([368, 4, 0],2), ([368, 242, 0],2), ([183, 366, 0],2), ([676, 0],2), ([460, 0],2), ([35, 0],2), ([298, 0],2), ([171, 0],2), ([10, 0],2)

原文地址:http://blog.csdn.net/comaple/article/details/8947640


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM