Slope one—個性化推薦中最簡潔的協同過濾算法

本文轉載自查看原文 2018-09-25 13:45 871

Slope One 是一系列應用於協同過濾的算法的統稱。由 Daniel Lemire和Anna Maclachlan於2005年發表的論文中提出。 [1]有爭議的是，該算法堪稱基於項目評價的non-trivial 協同過濾算法最簡潔的形式。該系列算法的簡潔特性使它們的實現簡單而高效，而且其精確度與其它復雜費時的算法相比也不相上下。 [2]. 該系列算法也被用來改進其它算法。[3][4].

目錄 [隱藏]
1 協同過濾簡介及其主要優缺點
2 Item-based協同過濾和過適
3 電子商務中的Item-based協同過濾
4 Slope One 協同過濾
5 Slope One 的java/c#實現
6 Slope One 的算法復雜度
7 應用Slope One的推薦系統
8 腳注
協同過濾簡介及其主要優缺點[編輯]
協同過濾推薦（Collaborative Filtering recommendation）是在信息過濾和信息系統中正迅速成為一項很受歡迎的技術。與傳統的基於內容過濾直接分析內容進行推薦不同，協同過濾分析用戶興趣，在用戶群中找到指定用戶的相似（興趣）用戶，綜合這些相似用戶對某一信息的評價，形成系統對該指定用戶對此信息的喜好程度預測。
與傳統文本過濾相比，協同過濾有下列優點:
1 能夠過濾難以進行機器自動基於內容分析的信息。如藝術品、音樂；
2 能夠基於一些復雜的，難以表達的概念（信息質量、品位)進行過濾；
3 推薦的新穎性。
盡管協同過濾技術在個性化推薦系統中獲得了極大的成功，但隨着站點結構、內容的復雜度和用戶人數的不斷增加，協同過濾技術的一些缺點逐漸暴露出來。
主要有以下三點:
1 稀疏性(sparsity)：在許多推薦系統中，每個用戶涉及的信息量相當有限，在一些大的系統如亞馬遜網站中，用戶最多不過就評估了上百萬本書的1%~2%。造成評估矩陣數據相當稀疏，難以找到相似用戶集，導致推薦效果大大降低。
2 擴展性(scalability)：“最近鄰居”算法的計算量隨着用戶和項的增加而大大增加，對於上百萬之巨的數目，通常的算法將遭遇到嚴重的擴展性問題。
3 精確性(accuracy)：通過尋找相近用戶來產生推薦集，在數量較大的情況下，推薦的可信度隨之降低。

Item-based協同過濾和過適[編輯]
當可以對一些項目評分的時候，比如人們可以對一些東西給出1到5星的評價的時候，協同過濾意圖基於一個個體過去對某些項目的評分和（龐大的）由其他用戶的評價構成的數據庫，來預測該用戶對未評價項目的評分。例如: 如果一個人給披頭士的評分為5（總分5）的話，我們能否預測他對席琳狄翁新專輯的評分呢？

這種情形下, item-based 協同過濾系統[5][6] 根據其它項目的評分來預測某項目的分值，一般方法為線性回歸 (). 於是，需要列出x^2個線性回歸方程和2x^2個回歸量，例如：當有1000個項目時，需要列多達1,000,000個線性回歸方程，以及多達2,000,000個回歸量。除非我們只選擇某些用戶共同評價過的項目對，否則協同過濾會遇到過適[2](過擬合) 問題。

另外一種更好的方法是使用更簡單一些的式子，比如：實驗證明當使用一半的回歸量的時候，該式子（稱為Slope One）的表現有時優於[2] 線性回歸方程。該簡化方法也不需要那么多存儲空間和延遲。

Item-based 協同過濾只是協同過濾的一種形式.其它還有像 user-based 協同過濾一樣研究用戶間的聯系的過濾系統。但是，考慮到其他用戶數量龐大，item-based協同過濾更可行一些。

電子商務中的Item-based協同過濾[編輯]
人們並不總是能給出評分，當用戶只提供二進制數據（購買與否）的時候，就無法應用Slope One 和其它基於評分的算法。二進制 item-based協同過濾應用的例子之一就是Amazon的 item-to-item 專利算法[7] ，該算法中用二進制向量表示用戶-項目購買關系的矩陣，並計算二進制向量間的cosine相關系數。

有人認為Item-to-Item 算法甚至比Slope One 還簡單，例如：

購買統計樣本
顧客項目 1 項目 2 項目 3
John 買過沒買過買過
Mark 沒買過買過買過
Lucy 沒買過買過沒買過
在本例當中，項目1和項目2間的cosine相關系數為：

項目1和項目3間的cosine相關系數為：

而項目2和項目3的cosine相關系數為：

於是，瀏覽項目1的顧客會被推薦買項目3(兩者相關系數最大),而瀏覽項目2的顧客會被推薦買項目3,瀏覽了項目3的會首先被推薦買項目1（再然后是項目2,因為2和3的相關系數小於1和3）。該模型只使用了每對項目間的一個參數（cosine相關系數）來產生推薦。因此，如果有n個項目，則需要計算和存儲 n（n-1）/2 個cosine相關系數。

Slope One 協同過濾[編輯]
為了大大減少過適(過擬合)的發生，提升算法簡化實現， Slope One 系列易實現的Item-based協同過濾算法被提了出來。本質上，該方法運用更簡單形式的回歸表達式() 和單一的自由參數，而不是一個項目評分和另一個項目評分間的線性回歸 ()。該自由參數只不過就是兩個項目評分間的平均差值。甚至在某些實例當中，它比線性回歸的方法更准確[2]，而且該算法只需要一半（甚至更少）的存儲量。

例:

User A 對 Item I 評分為1 對Item J.評分為1.5
User B 對 Item I 評分為2.
你認為 User B 會給 Item J 打幾分?
Slope One 的答案是：2.5 (1.5-1+2=2.5).

舉個更實際的例子，考慮下表：

評分數據庫樣本
顧客項目 1 項目 2 項目 3
John 5 3 2
Mark 3 4 未評分
Lucy 未評分 2 5
在本例中，項目2和1之間的平均評分差值為 (2+(-1))/2=0.5. 因此，item1的評分平均比item2高0.5。同樣的，項目3和1之間的平均評分差值為3。因此，如果我們試圖根據Lucy 對項目2的評分來預測她對項目1的評分的時候，我們可以得到 2+0.5 = 2.5。同樣，如果我們想要根據她對項目3的評分來預測她對項目1的評分的話，我們得到 5+3=8.

如果一個用戶已經評價了一些項目，可以這樣做出預測：簡單地把各個項目的預測通過加權平均值結合起來。當用戶兩個項目都評價過的時候，權值就高。在上面的例子中，項目1和項目2都評價了的用戶數為2,項目1和項目3 都評價了的用戶數為1,因此權重分別為2和1. 我們可以這樣預測Lucy對項目1的評價：

於是，對“n”個項目，想要實現 Slope One，只需要計算並存儲“n”對評分間的平均差值和評價數目即可。

Slope One 的java/c#實現[編輯]
java實現

package test;

import java.util.*;

/**

* Daniel Lemire A simple implementation of the weighted slope one algorithm in
* Java for item-based collaborative filtering. Assumes Java 1.5.
*
* See main function for example.
*
* June 1st 2006. Revised by Marco Ponzi on March 29th 2007
*/
public class SlopeOne {

public static void main(String args[]) {
// this is my data base
Map<UserId, Map<ItemId, Float>> data = new HashMap<UserId, Map<ItemId, Float>>();
// items
ItemId item1 = new ItemId(" candy");
ItemId item2 = new ItemId(" dog");
ItemId item3 = new ItemId(" cat");
ItemId item4 = new ItemId(" war");
ItemId item5 = new ItemId("strange food");
mAllItems = new ItemId[] { item1, item2, item3, item4, item5 };
// I'm going to fill it in
HashMap<ItemId, Float> user1 = new HashMap<ItemId, Float>();
HashMap<ItemId, Float> user2 = new HashMap<ItemId, Float>();
HashMap<ItemId, Float> user3 = new HashMap<ItemId, Float>();
HashMap<ItemId, Float> user4 = new HashMap<ItemId, Float>();
user1.put(item1, 1.0f);
user1.put(item2, 0.5f);
user1.put(item4, 0.1f);
data.put(new UserId("Bob"), user1);
user2.put(item1, 1.0f);
user2.put(item3, 0.5f);
user2.put(item4, 0.2f);
data.put(new UserId("Jane"), user2);
user3.put(item1, 0.9f);
user3.put(item2, 0.4f);
user3.put(item3, 0.5f);
user3.put(item4, 0.1f);
data.put(new UserId("Jo"), user3);
user4.put(item1, 0.1f);
// user4.put(item2,0.4f);
// user4.put(item3,0.5f);
user4.put(item4, 1.0f);
user4.put(item5, 0.4f);
data.put(new UserId("StrangeJo"), user4);
// next, I create my predictor engine
SlopeOne so = new SlopeOne(data);
System.out.println("Here's the data I have accumulated...");
so.printData();
// then, I'm going to test it out...
HashMap<ItemId, Float> user = new HashMap<ItemId, Float>();
System.out.println("Ok, now we predict...");
user.put(item5, 0.4f);
System.out.println("Inputting...");
SlopeOne.print(user);
System.out.println("Getting...");
SlopeOne.print(so.predict(user));
//
user.put(item4, 0.2f);
System.out.println("Inputting...");
SlopeOne.print(user);
System.out.println("Getting...");
SlopeOne.print(so.predict(user));
}
Map<UserId, Map<ItemId, Float>> mData;
Map<ItemId, Map<ItemId, Float>> mDiffMatrix;
Map<ItemId, Map<ItemId, Integer>> mFreqMatrix;
static ItemId[] mAllItems;
public SlopeOne(Map<UserId, Map<ItemId, Float>> data) {
mData = data;
buildDiffMatrix();
}
/**
* Based on existing data, and using weights, try to predict all missing
* ratings. The trick to make this more scalable is to consider only
* mDiffMatrix entries having a large (>1) mFreqMatrix entry.
*
* It will output the prediction 0 when no prediction is possible.
*/
public Map<ItemId, Float> predict(Map<ItemId, Float> user) {
HashMap<ItemId, Float> predictions = new HashMap<ItemId, Float>();
HashMap<ItemId, Integer> frequencies = new HashMap<ItemId, Integer>();
for (ItemId j : mDiffMatrix.keySet()) {
frequencies.put(j, 0);
predictions.put(j, 0.0f);
}
for (ItemId j : user.keySet()) {
for (ItemId k : mDiffMatrix.keySet()) {
try {
float newval = (mDiffMatrix.get(k).get(j).floatValue() + user.get(j)
.floatValue()) * mFreqMatrix.get(k).get(j).intValue();
predictions.put(k, predictions.get(k) + newval);
frequencies.put(k, frequencies.get(k)
+ mFreqMatrix.get(k).get(j).intValue());
} catch (NullPointerException e) {
}
}
}
HashMap<ItemId, Float> cleanpredictions = new HashMap<ItemId, Float>();
for (ItemId j : predictions.keySet()) {
if (frequencies.get(j) > 0) {
cleanpredictions.put(j, predictions.get(j).floatValue()
/ frequencies.get(j).intValue());
}
}
for (ItemId j : user.keySet()) {
cleanpredictions.put(j, user.get(j));
}
return cleanpredictions;
}
/**
* Based on existing data, and not using weights, try to predict all missing
* ratings. The trick to make this more scalable is to consider only
* mDiffMatrix entries having a large (>1) mFreqMatrix entry.
*/
public Map<ItemId, Float> weightlesspredict(Map<ItemId, Float> user) {
HashMap<ItemId, Float> predictions = new HashMap<ItemId, Float>();
HashMap<ItemId, Integer> frequencies = new HashMap<ItemId, Integer>();
for (ItemId j : mDiffMatrix.keySet()) {
predictions.put(j, 0.0f);
frequencies.put(j, 0);
}
for (ItemId j : user.keySet()) {
for (ItemId k : mDiffMatrix.keySet()) {
// System.out.println("Average diff between "+j+" and "+ k +
// " is "+mDiffMatrix.get(k).get(j).floatValue()+" with n ="+mFreqMatrix.get(k).get(j).floatValue());
float newval = (mDiffMatrix.get(k).get(j).floatValue() + user.get(j)
.floatValue());
predictions.put(k, predictions.get(k) + newval);
}
}
for (ItemId j : predictions.keySet()) {
predictions.put(j, predictions.get(j).floatValue() / user.size());
}
for (ItemId j : user.keySet()) {
predictions.put(j, user.get(j));
}
return predictions;
}
public void printData() {
for (UserId user : mData.keySet()) {
System.out.println(user);
print(mData.get(user));
}
for (int i = 0; i < mAllItems.length; i++) {
System.out.print("\n" + mAllItems[i] + ":");
printMatrixes(mDiffMatrix.get(mAllItems[i]),
mFreqMatrix.get(mAllItems[i]));
}
}
private void printMatrixes(Map<ItemId, Float> ratings,
Map<ItemId, Integer> frequencies) {
for (int j = 0; j < mAllItems.length; j++) {
System.out.format("%10.3f", ratings.get(mAllItems[j]));
System.out.print(" ");
System.out.format("%10d", frequencies.get(mAllItems[j]));
}
System.out.println();
}
public static void print(Map<ItemId, Float> user) {
for (ItemId j : user.keySet()) {
System.out.println(" " + j + " --> " + user.get(j).floatValue());
}
}
public void buildDiffMatrix() {
mDiffMatrix = new HashMap<ItemId, Map<ItemId, Float>>();
mFreqMatrix = new HashMap<ItemId, Map<ItemId, Integer>>();
// first iterate through users
for (Map<ItemId, Float> user : mData.values()) {
// then iterate through user data
for (Map.Entry<ItemId, Float> entry : user.entrySet()) {
if (!mDiffMatrix.containsKey(entry.getKey())) {
mDiffMatrix.put(entry.getKey(), new HashMap<ItemId, Float>());
mFreqMatrix.put(entry.getKey(), new HashMap<ItemId, Integer>());
}
for (Map.Entry<ItemId, Float> entry2 : user.entrySet()) {
int oldcount = 0;
if (mFreqMatrix.get(entry.getKey()).containsKey(entry2.getKey()))
oldcount = mFreqMatrix.get(entry.getKey()).get(entry2.getKey())
.intValue();
float olddiff = 0.0f;
if (mDiffMatrix.get(entry.getKey()).containsKey(entry2.getKey()))
olddiff = mDiffMatrix.get(entry.getKey()).get(entry2.getKey())
.floatValue();
float observeddiff = entry.getValue() - entry2.getValue();
mFreqMatrix.get(entry.getKey()).put(entry2.getKey(), oldcount + 1);
mDiffMatrix.get(entry.getKey()).put(entry2.getKey(),
olddiff + observeddiff);
}
}
}
for (ItemId j : mDiffMatrix.keySet()) {
for (ItemId i : mDiffMatrix.get(j).keySet()) {
float oldvalue = mDiffMatrix.get(j).get(i).floatValue();
int count = mFreqMatrix.get(j).get(i).intValue();
mDiffMatrix.get(j).put(i, oldvalue / count);
}
}
}
}

class UserId {

String content;
public UserId(String s) {
content = s;
}
public int hashCode() {
return content.hashCode();
}
public String toString() {
return content;
}
}

class ItemId {

String content;
public ItemId(String s) {
content = s;
}
public int hashCode() {
return content.hashCode();
}
public String toString() {
return content;
}
}

C#實現： using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace SlopeOne {

public class Rating
{
public float Value { get; set; }
public int Freq { get; set; }
public float AverageValue
{
get { return Value / Freq; }
}
}
public class RatingDifferenceCollection : Dictionary<string, Rating>
{
private string GetKey(int Item1Id, int Item2Id)
{
return (Item1Id < Item2Id) ? Item1Id "/" Item2Id : Item2Id "/" Item1Id ;
}
public bool Contains(int Item1Id, int Item2Id)
{
return this.Keys.Contains<string>(GetKey(Item1Id, Item2Id));
}
public Rating this[int Item1Id, int Item2Id]
{
get {
return this[this.GetKey(Item1Id, Item2Id)];
}
set { this[this.GetKey(Item1Id, Item2Id)] = value; }
}
}
public class SlopeOne
{
public RatingDifferenceCollection _DiffMarix = new RatingDifferenceCollection(); // The dictionary to keep the diff matrix
public HashSet<int> _Items = new HashSet<int>(); // Tracking how many items totally
public void AddUserRatings(IDictionary<int, float> userRatings)
{
foreach (var item1 in userRatings)
{
int item1Id = item1.Key;
float item1Rating = item1.Value;
_Items.Add(item1.Key);
foreach (var item2 in userRatings)
{
if (item2.Key <= item1Id) continue; // Eliminate redundancy
int item2Id = item2.Key;
float item2Rating = item2.Value;
Rating ratingDiff;
if (_DiffMarix.Contains(item1Id, item2Id))
{
ratingDiff = _DiffMarix[item1Id, item2Id];
}
else
{
ratingDiff = new Rating();
_DiffMarix[item1Id, item2Id] = ratingDiff;
}
ratingDiff.Value = item1Rating - item2Rating;
ratingDiff.Freq = 1;
}
}
}
// Input ratings of all users
public void AddUerRatings(IList<IDictionary<int, float>> Ratings)
{
foreach(var userRatings in Ratings)
{
AddUserRatings(userRatings);
}
}
public IDictionary<int, float> Predict(IDictionary<int, float> userRatings)
{
Dictionary<int, float> Predictions = new Dictionary<int, float>();
foreach (var itemId in this._Items)
{
if (userRatings.Keys.Contains(itemId)) continue; // User has rated this item, just skip it
Rating itemRating = new Rating();
foreach (var userRating in userRatings)
{
if (userRating.Key == itemId) continue;
int inputItemId = userRating.Key;
if (_DiffMarix.Contains(itemId, inputItemId))
{
Rating diff = _DiffMarix[itemId, inputItemId];
itemRating.Value = diff.Freq * (userRating.Value diff.AverageValue * ((itemId < inputItemId) ? 1 : -1));
itemRating.Freq = diff.Freq;
}
}
Predictions.Add(itemId, itemRating.AverageValue);
}
return Predictions;
}
public static void Test()
{
SlopeOne test = new SlopeOne();
Dictionary<int, float> userRating = new Dictionary<int, float>();
userRating.Add(1, 5);
userRating.Add(2, 4);
userRating.Add(3, 4);
test.AddUserRatings(userRating);
userRating = new Dictionary<int, float>();
userRating.Add(1, 4);
userRating.Add(2, 5);
userRating.Add(3, 3);
userRating.Add(4, 5);
test.AddUserRatings(userRating);
userRating = new Dictionary<int, float>();
userRating.Add(1, 4);
userRating.Add(2, 4);
userRating.Add(4, 5);
test.AddUserRatings(userRating);
userRating = new Dictionary<int, float>();
userRating.Add(1, 5);
userRating.Add(3, 4);
IDictionary<int, float> Predictions = test.Predict(userRating);
foreach (var rating in Predictions)
{
Console.WriteLine("Item " rating.Key " Rating: " rating.Value);
}
}
}
}

Slope One 的算法復雜度[編輯]
設有“n”個項目，“m”個用戶，“N”個評分。計算每對評分之間的差值需要n(n-1)/2 單位的存儲空間，最多需要 m n2步. 計算量也有可能挺悲觀的:假設用戶已經評價了最多 y 個項目, 那么計算不超過n2+my2個項目間計算差值是可能的。 . 如果一個用戶已經評價過“x”個項目，預測單一的項目評分需要“x“步，而對其所有未評分項目做出評分預測需要最多 (n-x)x 步. 當一個用戶已經評價過“x”個項目時，當該用戶新增一個評價時，更新數據庫需要 x步.

可以通過分割數據（參照分割和稀疏存儲（沒有共同評價項目的用戶可以被忽略）來降低存儲要求，

應用Slope One的推薦系統[編輯]
hitflip DVD推薦系統
How Happy
inDiscover MP3推薦系統
RACOFI Composer
Value Investing News 股票新聞網站
AllTheBests 購物推薦引擎

---------------------

本文來自 Tech_Hog 的CSDN 博客，全文地址請點擊：https://blog.csdn.net/whaoxysh/article/details/19038453?utm_source=copy

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。