mahout svn倉庫地址:http://svn.apache.org/repos/asf/mahout/trunk
movie length 數據地址:http://www.grouplens.org/system/files/ml-100k.zip
1. mahout簡介
The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries.
Classification
Logistic Regression (SGD)
Support Vector Machines (SVM)
Clustering
Pattern Mining
Dimension reduction
Singular Value Decomposition and other Dimension Reduction Techniques
Stochastic Singular Value Decomposition with PCA workflow
Independent Component Analysis
Gaussian Discriminative Analysis
Recommenders / Collaborative Filtering
Non-distributed recommenders ("Taste")
Distributed Item-Based Collaborative Filtering
Collaborative Filtering using a parallel matrix factorization
2. 應用於推薦系統(item-based/user-based/slopone)
2.1小型網站直接集成即可使用(user-based/item-based)
協同過濾主要分為:計算相似度è預測評分è產生推薦
Preferenceinferre 0.8版本變為capper,他的作用是:評估用戶的缺失評分值:
2.1.1 User-based的實現
用戶u對第i個商品的評分預測:u為當前用戶,vi為第i個用戶,p(vi)為第i個用戶對當前item的評分。
protected float doEstimatePreference(long theUserID, long[] theNeighborhood, long itemID) throws TasteException {
if (theNeighborhood.length == 0) {
return Float.NaN;
}
DataModel dataModel = getDataModel();
double preference = 0.0;
double totalSimilarity = 0.0;
int count = 0;
for (long userID : theNeighborhood) {
if (userID != theUserID) {
// See GenericItemBasedRecommender.doEstimatePreference() too
Float pref = dataModel.getPreferenceValue(userID, itemID);
if (pref != null) {
double theSimilarity = similarity.userSimilarity(theUserID, userID);
if (!Double.isNaN(theSimilarity)) {
preference += theSimilarity * pref;
totalSimilarity += theSimilarity;
count++;
}
}
}
}
2.1.2 Item-based的實現
第i個item被用戶u評分預測,sim(i,j)第i個商品與第j個商品的相似度,p(vj,u)表示用戶u對第j個商品的評分。
protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)
throws TasteException {
double preference = 0.0;
double totalSimilarity = 0.0;
int count = 0;
double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());
for (int i = 0; i < similarities.length; i++) {
double theSimilarity = similarities[i];
if (!Double.isNaN(theSimilarity)) {
// Weights can be negative!
preference += theSimilarity * preferencesFromUser.getValue(i);
totalSimilarity += theSimilarity;
count++;
}
}
Mathout中實現的相似度度量
PearsonCorrelationSimilarity皮爾遜距離
皮爾森相關系數等於兩個變量的協方差除於兩個變量的標准差。
缺點:沒有考慮(take into account)用戶間重疊的評分項數量對相似度的影響;
EuclideanDistanceSimilarity 歐幾里德距離
缺點:
CosineMeasureSimilarity 余弦距離(0.7變成了UncenteredCosineSimilarity)
標准余弦相似度對方向敏感但對數值不敏感,比如用戶對內容評分,5分制,X和Y兩個用戶對兩個內容的評分分別為(1,2)和(4,5),使用余弦相似度得出的結果是0.98,兩者極為相似,但從評分上看X似乎不喜歡這2個內容,而Y比較喜歡,為了修正這種不合理性,就出現了調整余弦相似度,Mahout給出了調整余弦相似度的實現。即所有維度上的數值都減去一個均值。
SpearmanCorrelationSimilarity斯皮爾曼等級相關
TanimotoCoefficientSimilarity谷本相關系數
LogLikelihoodSimilarity 對數似然相似度
CityBlockSimilarity基於曼哈頓距離
2.2離線計算,基於中間數據再開發(item-based/slopone)
2.2.1 mahout的源代碼結構
Item-based和slopone都有hadoop實現和單機版實現。User-based沒有。
Item-based recommender使用命令:
mahout org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -i input -o output --maxPrefsPerUser 100 -- numRecommendations 20
-s SIMILARITY_COSINE
Item-item 相似商品:
mahout org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input user-item --output similarity --similarityClassname SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 120 --maxPrefsPerUser 1200 --minPrefsPerUser 2
3應用於機器學(貝葉斯/模式挖掘/聚類等…)
3.1快速建模/模型評估
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
將該數據放到testdata目錄下,算法的輸出放到output目錄下:、
可以采用 mahout clusterdump來查看結果數據也可以輸出到本地文件。
Recommender的評估
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
Cluster的模型評估可以參考:
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
創建seqdirectory
./bin/mahout seqdirectory \ -i ${WORK_DIR}/20news-all \ -o ${WORK_DIR}/20news-seq
將seqdirectory轉換為向量
./bin/mahout seq2sparse \ -i ${WORK_DIR}/20news-seq \ -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
3.2例如PFPGrowth
mahout fpg -i pfp/order_01.txt -o pfp/patterns/output.txt -k 50 -method mapreduce -regex '[\ ]' -s 2
pfpgrowth論文參照:http://infolab.stanford.edu/~echang/recsys08-69.pdf
結果示例:
Key: 0: Value: ([0],14), ([368, 0],7), ([0, 53],5), ([368, 0, 53],4), ([950, 0],4), ([682, 826, 523, 950, 277, 475, 0],3), ([682, 826, 523, 950, 475, 0],3), ([183, 0],3), ([168, 0],3), ([682, 826, 523, 168, 950, 277, 475, 0],2), ([368, 684, 401, 428, 0, 53],2), ([368, 871, 239, 0, 257],2), ([368, 766, 183, 0, 831],2), ([368, 684, 401, 428, 0],2), ([937, 57, 450, 0],2), ([710, 173, 0, 731],2), ([368, 871, 239, 0],2), ([368, 766, 183, 0],2), ([710, 173, 0],2), ([419, 581, 0],2), ([368, 4, 0],2), ([368, 242, 0],2), ([183, 366, 0],2), ([676, 0],2), ([460, 0],2), ([35, 0],2), ([298, 0],2), ([171, 0],2), ([10, 0],2)