個性化召回算法實踐(四)——ContentBased算法


ContentBased算法的思想非常簡單:根據用戶過去喜歡的物品(本文統稱為 item),為用戶推薦和他過去喜歡的物品相似的物品。而關鍵就在於這里的物品相似性的度量,這才是算法運用過程中的核心。
CB的過程一般包括以下三步:
物品表示(Item Representation):為每個item抽取出一些特征(也就是item的content了)來表示此item;
特征學習(Profile Learning):利用一個用戶過去喜歡(及不喜歡)的item的特征數據,來學習出此用戶的喜好特征(profile);
生成推薦列表(Recommendation Generation):通過比較上一步得到的用戶profile與候選item的特征,為此用戶推薦一組相關性最大的item。

代碼中,初始化步驟如下:
1、得到moviesDF,包括movie_id,title,genres三列;得到ratingsDF,包括user_id,movie_id,rating和timestamp。
2、得到item_cate,cate_item分別代表item中不同種類的得分(平均)以及每個種類下item得分的倒排。
3、得到self.up,形式是userid:[(category,ratio),(category1,ratio1)],代表每個用戶對cate的評分。
重點有以下方法:

  • get_up(self,score_thr=4.0,topK=5)
    選出評分>score_thr的item代表用戶的傾向,對時間進行加權得到time_score,具體公式為:\(time\_score=round(\frac{1}{1+(max\_ts-ts)/(24*60*60*100)},3)\),代表最近的時間點評分的item時間權重越大。根據用戶對item的評分,評分的時間權重以及item下的cate權重最終得到每位用戶topK的cate分數(並進行歸一化)

  • recommend(self, userID, K=10)
    根據用戶的cate分數得到每一個cate下top的item,作為對用戶的推薦。

實際上,這里使用電影類別作為item的特征數據,來表示用戶的喜好特征(profile),根據用戶profile與候選item在特征下的分數,為此用戶推薦一組相關性最大的item。

全部代碼如下所示:

#-*-coding:utf-8-*-
"""
author:jamest
date:20190405
content based function
"""
import pandas as pd
import numpy as np
import time
import os

class contentBased:
    def __init__(self,rating_file,item_file):
        if not os.path.exists(rating_file) or not os.path.exists(item_file):
            print('the file not exists')
            return
        self.moviesDF = pd.read_csv(item_file, index_col=None, sep='::', header=None, names=['movie_id', 'title', 'genres'])
        self.ratingsDF = pd.read_csv(rating_file, index_col=None, sep='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'])
        self.item_cate, self.cate_item = self.get_item_cate()
        self.up = self.get_up()

    def get_item_cate(self,topK = 10):
        """
         Args:
             topK:nums of items in cate_item
         Returns:
             item_cate:a dic,key:itemid ,value:ratio
             cate_item:a dic:key:cate vale:[item1,item2,item3]
         """
        movie_rating_avg = self.ratingsDF.groupby('movie_id')['rating'].agg({'item_ratings_mean': np.mean}).reset_index()
        movie_rating_avg.head()
        items = movie_rating_avg['movie_id'].values
        scores = movie_rating_avg['item_ratings_mean'].values

        #得到item的平均評分
        item_score_veg = {}
        for item, score in zip(items, scores):
            item_score_veg[item] = score

        #得到item中不同種類的得分
        item_cate = {}
        items = self.moviesDF['movie_id'].values
        genres = self.moviesDF['genres'].apply(lambda x: x.split('|')).values
        for item, genres_lis in zip(items, genres):
            radio = 1 / len(genres_lis)
            item_cate[item] = {}
            for genre in genres_lis:
                item_cate[item][genre] = radio

        recode = {}
        for item in item_cate:
            for genre in item_cate[item]:
                if genre not in recode:
                    recode[genre] = {}
                recode[genre][item] = item_score_veg.get(item, 0)

        # 不同種類item的倒排
        cate_item = {}
        for cate in recode:
            if cate not in cate_item:
                cate_item[cate] = []
            for zuhe in sorted(recode[cate].items(), key=lambda x: x[1], reverse=True)[:topK]:
                cate_item[cate].append(zuhe[0])

        return item_cate, cate_item


    def get_time_score(self,timestamp,fix_time_stamp):
        """
         Args:
             timestamp:the timestamp of user-item
             fix_time_stamp:the max timestamp of the timestamps
         Returns:
             a time_score:fixed range in (0,1]
         """
        total_sec = 24*60*60
        delta = (fix_time_stamp-timestamp)/total_sec/100
        return round(1/(1+delta),3)

    def get_up(self,score_thr=4.0,topK=5):
        """
         Args:
             score_thr:select the score>=score_thr of ratingsDF
             topK:the number of item in up
         Returns:
             a dic,key:userid ,value[(category,ratio),(category1,ratio1)]
         """
        ratingsDF = self.ratingsDF[self.ratingsDF['rating'] > score_thr]
        fix_time_stamp = ratingsDF['timestamp'].max()
        ratingsDF['time_score'] = ratingsDF['timestamp'].apply(lambda x: self.get_time_score(x,fix_time_stamp))

        users = ratingsDF['user_id'].values
        items = ratingsDF['movie_id'].values
        ratings = ratingsDF['rating'].values
        scores = ratingsDF['time_score'].values

        recode = {}
        up = {}
        for userid, itemid, rating, time_score in zip(users, items, ratings, scores):
            if userid not in recode:
                recode[userid] = {}

            for cate in self.item_cate[itemid]:
                if cate not in recode[userid]:
                    recode[userid][cate] = 0
                recode[userid][cate] += rating * time_score * self.item_cate[itemid][cate]
        for userid in recode:
            if userid not in up:
                up[userid] = []
            total_score = 0
            for zuhe in sorted(recode[userid].items(), key=lambda x: x[1], reverse=True)[:topK]:
                up[userid].append((zuhe[0], zuhe[1]))
                total_score += zuhe[1]
            for index in range(len(up[userid])):
                up[userid][index] = (up[userid][index][0], round(up[userid][index][1] / total_score, 3))
        return up


    def recommend(self, userID, K=10):
        """
         Args:
             userID: the user to recom
             K: the num of recom item
         Returns:
             a dic,key:userID ,value:recommend itemid
         """
        if userID not in self.up:
            return
        recom_res = {}
        if userID not in recom_res:
            recom_res[userID] = []

        for zuhe in self.up[userID]:
            cate, ratio = zuhe
            num = int(K * ratio) + 1
            if cate not in self.cate_item:
                continue
            rec_list = self.cate_item[cate][:num]
            recom_res[userID] += rec_list
        return recom_res

if __name__ == '__main__':
    moviesPath = '../data/ml-1m/movies.dat'
    ratingsPath = '../data/ml-1m/ratings.dat'
    usersPath = '../data/ml-1m/users.dat'
    recom_res = contentBased(ratingsPath,moviesPath).recommend(userID=1,K=30)
    print('content based result',recom_res)

參考:
推薦系統概述(一)
Github


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM