個性化召回算法實踐(四)——ContentBased算法

本文轉載自查看原文 2019-10-30 11:43 296 Algorithm practice

ContentBased算法的思想非常簡單：根據用戶過去喜歡的物品（本文統稱為 item），為用戶推薦和他過去喜歡的物品相似的物品。而關鍵就在於這里的物品相似性的度量，這才是算法運用過程中的核心。
CB的過程一般包括以下三步：
物品表示（Item Representation）：為每個item抽取出一些特征（也就是item的content了）來表示此item；
特征學習（Profile Learning）：利用一個用戶過去喜歡（及不喜歡）的item的特征數據，來學習出此用戶的喜好特征（profile）；
生成推薦列表（Recommendation Generation）：通過比較上一步得到的用戶profile與候選item的特征，為此用戶推薦一組相關性最大的item。

代碼中，初始化步驟如下：
1、得到moviesDF，包括movie_id,title,genres三列；得到ratingsDF，包括user_id,movie_id,rating和timestamp。
2、得到item_cate,cate_item分別代表item中不同種類的得分（平均）以及每個種類下item得分的倒排。
3、得到self.up，形式是userid:[(category,ratio),(category1,ratio1)]，代表每個用戶對cate的評分。
重點有以下方法：

get_up(self,score_thr=4.0,topK=5)
選出評分>score_thr的item代表用戶的傾向，對時間進行加權得到time_score，具體公式為：\(time\_score=round(\frac{1}{1+(max\_ts-ts)/(24*60*60*100)},3)\)，代表最近的時間點評分的item時間權重越大。根據用戶對item的評分，評分的時間權重以及item下的cate權重最終得到每位用戶topK的cate分數（並進行歸一化）。
recommend(self, userID, K=10)
根據用戶的cate分數得到每一個cate下top的item，作為對用戶的推薦。

實際上，這里使用電影類別作為item的特征數據，來表示用戶的喜好特征（profile），根據用戶profile與候選item在特征下的分數，為此用戶推薦一組相關性最大的item。

全部代碼如下所示：

#-*-coding:utf-8-*-
"""
author:jamest
date:20190405
content based function
"""
import pandas as pd
import numpy as np
import time
import os

class contentBased:
    def __init__(self,rating_file,item_file):
        if not os.path.exists(rating_file) or not os.path.exists(item_file):
            print('the file not exists')
            return
        self.moviesDF = pd.read_csv(item_file, index_col=None, sep='::', header=None, names=['movie_id', 'title', 'genres'])
        self.ratingsDF = pd.read_csv(rating_file, index_col=None, sep='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'])
        self.item_cate, self.cate_item = self.get_item_cate()
        self.up = self.get_up()

    def get_item_cate(self,topK = 10):
        """
         Args:
             topK:nums of items in cate_item
         Returns:
             item_cate：a dic,key:itemid ,value:ratio
             cate_item：a dic:key:cate vale:[item1,item2,item3]
         """
        movie_rating_avg = self.ratingsDF.groupby('movie_id')['rating'].agg({'item_ratings_mean': np.mean}).reset_index()
        movie_rating_avg.head()
        items = movie_rating_avg['movie_id'].values
        scores = movie_rating_avg['item_ratings_mean'].values

        #得到item的平均評分
        item_score_veg = {}
        for item, score in zip(items, scores):
            item_score_veg[item] = score

        #得到item中不同種類的得分
        item_cate = {}
        items = self.moviesDF['movie_id'].values
        genres = self.moviesDF['genres'].apply(lambda x: x.split('|')).values
        for item, genres_lis in zip(items, genres):
            radio = 1 / len(genres_lis)
            item_cate[item] = {}
            for genre in genres_lis:
                item_cate[item][genre] = radio

        recode = {}
        for item in item_cate:
            for genre in item_cate[item]:
                if genre not in recode:
                    recode[genre] = {}
                recode[genre][item] = item_score_veg.get(item, 0)

        # 不同種類item的倒排
        cate_item = {}
        for cate in recode:
            if cate not in cate_item:
                cate_item[cate] = []
            for zuhe in sorted(recode[cate].items(), key=lambda x: x[1], reverse=True)[:topK]:
                cate_item[cate].append(zuhe[0])

        return item_cate, cate_item


    def get_time_score(self,timestamp,fix_time_stamp):
        """
         Args:
             timestamp:the timestamp of user-item
             fix_time_stamp:the max timestamp of the timestamps
         Returns:
             a time_score:fixed range in (0,1]
         """
        total_sec = 24*60*60
        delta = (fix_time_stamp-timestamp)/total_sec/100
        return round(1/(1+delta),3)

    def get_up(self,score_thr=4.0,topK=5):
        """
         Args:
             score_thr:select the score>=score_thr of ratingsDF
             topK:the number of item in up
         Returns:
             a dic,key:userid ,value[(category,ratio),(category1,ratio1)]
         """
        ratingsDF = self.ratingsDF[self.ratingsDF['rating'] > score_thr]
        fix_time_stamp = ratingsDF['timestamp'].max()
        ratingsDF['time_score'] = ratingsDF['timestamp'].apply(lambda x: self.get_time_score(x,fix_time_stamp))

        users = ratingsDF['user_id'].values
        items = ratingsDF['movie_id'].values
        ratings = ratingsDF['rating'].values
        scores = ratingsDF['time_score'].values

        recode = {}
        up = {}
        for userid, itemid, rating, time_score in zip(users, items, ratings, scores):
            if userid not in recode:
                recode[userid] = {}

            for cate in self.item_cate[itemid]:
                if cate not in recode[userid]:
                    recode[userid][cate] = 0
                recode[userid][cate] += rating * time_score * self.item_cate[itemid][cate]
        for userid in recode:
            if userid not in up:
                up[userid] = []
            total_score = 0
            for zuhe in sorted(recode[userid].items(), key=lambda x: x[1], reverse=True)[:topK]:
                up[userid].append((zuhe[0], zuhe[1]))
                total_score += zuhe[1]
            for index in range(len(up[userid])):
                up[userid][index] = (up[userid][index][0], round(up[userid][index][1] / total_score, 3))
        return up


    def recommend(self, userID, K=10):
        """
         Args:
             userID: the user to recom
             K: the num of recom item
         Returns:
             a dic,key:userID ,value:recommend itemid
         """
        if userID not in self.up:
            return
        recom_res = {}
        if userID not in recom_res:
            recom_res[userID] = []

        for zuhe in self.up[userID]:
            cate, ratio = zuhe
            num = int(K * ratio) + 1
            if cate not in self.cate_item:
                continue
            rec_list = self.cate_item[cate][:num]
            recom_res[userID] += rec_list
        return recom_res

if __name__ == '__main__':
    moviesPath = '../data/ml-1m/movies.dat'
    ratingsPath = '../data/ml-1m/ratings.dat'
    usersPath = '../data/ml-1m/users.dat'
    recom_res = contentBased(ratingsPath,moviesPath).recommend(userID=1,K=30)
    print('content based result',recom_res)

參考：
推薦系統概述（一）
Github

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 個性化召回算法實踐(二)——LFM算法個性化召回算法實踐(三)——PersonalRank算法個性化召回算法實踐(一)——CF算法個性化召回算法實踐(五)——item2vec 個性化排序算法實踐(五)——DCN算法個性化排序算法實踐(一)——FM算法個性化排序算法實踐(三)——deepFM算法個性化排序算法實踐(四)——GBDT+LR 基於Spark ALS算法的個性化推薦個性化推薦算法綜述