本文介紹一個基於pytorch的電影推薦系統。
代碼移植自https://github.com/chengstone/movie_recommender。
原作者用了tf1.0實現了這個基於movielens的推薦系統,我這里用pytorch0.4做了個移植。
本文實現的模型Github倉庫:https://github.com/Holy-Shine/movie_recommend_system
1. 總體框架
先來看下整個文件包下面的文件構成:
其中:
Params
: 保存模型的參數文件以及模型訓練后得到的用戶和電影特征向量
data.p
:保存了訓練和測試數據
dataset.py
:繼承於pytorch的Dataset類,是一個數據batch的generator
model.py
:推薦系統的pytorch模型實現
main.py
:主要的訓練過程
recInterface.py
: 推薦系統訓練完畢后,根據模型的中間輸出結果作為電影和用戶的特征向量,這個推薦接口根據這些向量的空間關系提供一些定向推薦結果
test.py
: 無用,純用來測試輸入維度是否和模型match
2. 數據集接口dataset.py
dataset.py
加載 data.p
到內存,用生成器的方式不斷形成指定batch_size大小的批數據,輸入到模型進行訓練。我們先來看看這個data.p
長什么樣。
data.p
實際上是保存了輸入數據的pickle文件,加載完畢后是一個pandas(>=0.22.0)的DataFrame對象(如下圖所示)
用下面代碼可以加載和觀察數據集(建議使用 jupyternotebook )
import pickle as pkl
data = pkl.load(open('data.p','rb'))
data
下面來看看數據加載類怎么實現:
class MovieRankDataset(Dataset):
def __init__(self, pkl_file):
self.dataFrame = pkl.load(open(pkl_file,'rb'))
def __len__(self):
return len(self.dataFrame)
def __getitem__(self, idx):
# user data
uid = self.dataFrame.ix[idx]['user_id']
gender = self.dataFrame.ix[idx]['user_gender']
age = self.dataFrame.ix[idx]['user_age']
job = self.dataFrame.ix[idx]['user_job']
# movie data
mid = self.dataFrame.ix[idx]['movie_id']
mtype=self.dataFrame.ix[idx]['movie_type']
mtext=self.dataFrame.ix[idx]['movie_title']
# target
rank = torch.FloatTensor([self.dataFrame.ix[idx]['rank']])
user_inputs = {
'uid': torch.LongTensor([uid]).view(1,-1),
'gender': torch.LongTensor([gender]).view(1,-1),
'age': torch.LongTensor([age]).view(1,-1),
'job': torch.LongTensor([job]).view(1,-1)
}
movie_inputs = {
'mid': torch.LongTensor([mid]).view(1,-1),
'mtype': torch.LongTensor(mtype),
'mtext': torch.LongTensor(mtext)
}
sample = {
'user_inputs': user_inputs,
'movie_inputs':movie_inputs,
'target':rank
}
return sample
pytorch要求自定義類實現三個函數:
__init__()
用來初始化一些東西__len__()
用來獲取整個數據集的樣本個數__getitem(idx)__
根據索引idx獲取相應的樣本
重點看下__getiem(idx)__
,主要使用dataframe的dataFrame.ix[idx]['user_id']
來獲取相應的屬性。由於整個模型是用戶+電影雙通道輸入,所以最后將提取的屬性組裝成兩個dict,最后再組成一個sample返回。拆解過程在訓練時進行。(組裝時提前用torch.tensor()
將向量轉為pytorch支持的tensor張量)
3. 推薦模型model.py
先看一下我們要實現的模型圖:
(注:圖片來自原作者倉庫)
pytorch依然要求用戶自定義的模型類至少實現兩個方法:__init__()
和__forward__()
,其中__init__()
用來初始化(定義一些pytorch的線性層、卷積層、embedding層等等),__forward__()
用來前向傳播和反向傳播誤差梯度信息。
分別來看下model.py
里的這兩個函數:
3.1 初始化函數
def __init__(self, user_max_dict, movie_max_dict, convParams, embed_dim=32, fc_size=200):
'''
Args:
user_max_dict: the max value of each user attribute. {'uid': xx, 'gender': xx, 'age':xx, 'job':xx}
user_embeds: size of embedding_layers.
movie_max_dict: {'mid':xx, 'mtype':18, 'mword':15}
fc_sizes: fully connect layer sizes. normally 2
'''
super(rec_model, self).__init__()
# --------------------------------- user channel ----------------------------------------------------------------
# user embeddings
self.embedding_uid = nn.Embedding(user_max_dict['uid'], embed_dim)
self.embedding_gender = nn.Embedding(user_max_dict['gender'], embed_dim // 2)
self.embedding_age = nn.Embedding(user_max_dict['age'], embed_dim // 2)
self.embedding_job = nn.Embedding(user_max_dict['job'], embed_dim // 2)
# user embedding to fc: the first dense layer
self.fc_uid = nn.Linear(embed_dim, embed_dim)
self.fc_gender = nn.Linear(embed_dim // 2, embed_dim)
self.fc_age = nn.Linear(embed_dim // 2, embed_dim)
self.fc_job = nn.Linear(embed_dim // 2, embed_dim)
# concat embeddings to fc: the second dense layer
self.fc_user_combine = nn.Linear(4 * embed_dim, fc_size)
# --------------------------------- movie channel -----------------------------------------------------------------
# movie embeddings
self.embedding_mid = nn.Embedding(movie_max_dict['mid'], embed_dim) # normally 32
self.embedding_mtype_sum = nn.EmbeddingBag(movie_max_dict['mtype'], embed_dim, mode='sum')
self.fc_mid = nn.Linear(embed_dim, embed_dim)
self.fc_mtype = nn.Linear(embed_dim, embed_dim)
# movie embedding to fc
self.fc_mid_mtype = nn.Linear(embed_dim * 2, fc_size)
# text convolutional part
# wordlist to embedding matrix B x L x D L=15 15 words
self.embedding_mwords = nn.Embedding(movie_max_dict['mword'], embed_dim)
# input word vector matrix is B x 15 x 32
# load text_CNN params
kernel_sizes = convParams['kernel_sizes']
# 8 kernel, stride=1,padding=0, kernel_sizes=[2x32, 3x32, 4x32, 5x32]
self.Convs_text = [nn.Sequential(
nn.Conv2d(1, 8, kernel_size=(k, embed_dim)),
nn.ReLU(),
nn.MaxPool2d(kernel_size=(15 - k + 1, 1), stride=(1, 1))
).to(device) for k in kernel_sizes]
# movie channel concat
self.fc_movie_combine = nn.Linear(embed_dim * 2 + 8 * len(kernel_sizes), fc_size) # tanh
# BatchNorm layer
self.BN = nn.BatchNorm2d(1)
__init__()
有5個參數:
-
user_max_dict/movie_max_dict
:用戶/電影字典,即用戶/電影的一些屬性的最大值,決定我們的模型的embedding表的寬度。user_max_dict={ 'uid':6041, # 6040 users 'gender':2, 'age':7, 'job':21 } movie_max_dict={ 'mid':3953, # 3952 movies 'mtype':18, 'mword':5215 # 5215 words }
在我們的模型中,這些字典作為固定的參數被傳入。
-
convParams
:文本卷積網絡超參,表示網絡層數和卷積核大小。convParams={ 'kernel_sizes':[2,3,4,5] }
-
embed_dim
:全局的embed大小,表示特征空間的維度。 -
fc_size
: 最后的全連接神經元個數
最后分別根據用戶通道定義一些全連接層、embedding層、文本卷積層(標題文本已經被one-hot化存入數據集中)
3.2 前向傳播
直接看代碼吧:
def forward(self, user_input, movie_input):
# pack train_data
uid = user_input['uid']
gender = user_input['gender']
age = user_input['age']
job = user_input['job']
mid = movie_input['mid']
mtype = movie_input['mtype']
mtext = movie_input['mtext']
if torch.cuda.is_available():
uid, gender, age, job,mid,mtype,mtext = \
uid.to(device), gender.to(device), age.to(device), job.to(device), mid.to(device), mtype.to(device), mtext.to(device)
# user channel
feature_uid = self.BN(F.relu(self.fc_uid(self.embedding_uid(uid))))
feature_gender = self.BN(F.relu(self.fc_gender(self.embedding_gender(gender))))
feature_age = self.BN(F.relu(self.fc_age(self.embedding_age(age))))
feature_job = self.BN(F.relu(self.fc_job(self.embedding_job(job))))
# feature_user B x 1 x 200
feature_user = F.tanh(self.fc_user_combine(
torch.cat([feature_uid, feature_gender, feature_age, feature_job], 3)
)).view(-1,1,200)
# movie channel
feature_mid = self.BN(F.relu(self.fc_mid(self.embedding_mid(mid))))
feature_mtype = self.BN(F.relu(self.fc_mtype(self.embedding_mtype_sum(mtype)).view(-1,1,1,32)))
# feature_mid_mtype = torch.cat([feature_mid, feature_mtype], 2)
# text cnn part
feature_img = self.embedding_mwords(mtext) # to matrix B x 15 x 32
flattern_tensors = []
for conv in self.Convs_text:
flattern_tensors.append(conv(feature_img.view(-1,1,15,32)).view(-1,1, 8)) # each tensor: B x 8 x1 x 1 to B x 8
feature_flattern_dropout = F.dropout(torch.cat(flattern_tensors,2), p=0.5) # to B x 32
# feature_movie B x 1 x 200
feature_movie = F.tanh(self.fc_movie_combine(
torch.cat([feature_mid.view(-1,1,32), feature_mtype.view(-1,1,32), feature_flattern_dropout], 2)
))
output = torch.sum(feature_user * feature_movie, 2) # B x rank
return output, feature_user, feature_movie
分為兩步:
- 拆解數據:根據用戶和電影dict的鍵值拆解sample里的數據
- 前向傳播:沒有特別的,就是用
__init__()
定義的網絡層來傳遞張量即可。
4. 主程序main.py
還是先來看代碼:
def train(model,num_epochs=5, lr=0.0001):
loss_function = nn.MSELoss()
optimizer = optim.Adam(model.parameters(),lr=lr)
datasets = MovieRankDataset(pkl_file='data.p')
dataloader = DataLoader(datasets,batch_size=256,shuffle=True)
losses=[]
writer = SummaryWriter()
for epoch in range(num_epochs):
loss_all = 0
for i_batch,sample_batch in enumerate(dataloader):
user_inputs = sample_batch['user_inputs']
movie_inputs = sample_batch['movie_inputs']
target = sample_batch['target'].to(device)
model.zero_grad()
tag_rank , _ , _ = model(user_inputs, movie_inputs)
loss = loss_function(tag_rank, target)
if i_batch%20 ==0:
writer.add_scalar('data/loss', loss, i_batch*20)
print(loss)
loss_all += loss
loss.backward()
optimizer.step()
print('Epoch {}:\t loss:{}'.format(epoch,loss_all))
writer.export_scalars_to_json("./test.json")
writer.close()
if __name__=='__main__':
model = rec_model(user_max_dict=user_max_dict, movie_max_dict=movie_max_dict, convParams=convParams)
model=model.to(device)
# train model
#train(model=model,num_epochs=1)
#torch.save(model.state_dict(), 'Params/model_params.pkl')
# get user and movie feature
# model.load_state_dict(torch.load('Params/model_params.pkl'))
# from recInterface import saveMovieAndUserFeature
# saveMovieAndUserFeature(model=model)
# test recsys
from recInterface import getKNNitem,getUserMostLike
print(getKNNitem(itemID=100,K=10))
print(getUserMostLike(uid=100))
流程大致如下:
- 調用
model.py
構建推薦模型。 - 訓練模型
train(model,num_epochs=5, lr=0.0001)
- 選擇損失函數
- 選擇優化器Adam
- 構建數據加載器dataloader
- 開始訓練,反向傳播,優化參數
- 保存模型參數
5. 推薦接口recInterface.py
模型訓練結束后,我們可以得到電影的特征和用戶的特征(可以看網絡圖中最后一層連接前兩個通道的輸出即為用戶/電影特征,我們在訓練結束后將其返回並保存起來)。
使用recInterface.py
里的saveMovieAndUserFeature(model)
可以將這兩個特征保存為Params/feature_data.pkl
,同時保存用戶和電影的字典,用來獲取特定用戶或者電影的信息,格式以用戶為例:{'uid':uid,'gender':gender,'age':age,'job':job}
def def saveMovieAndUserFeature(model):
'''
Save Movie and User feature into HD
'''
batch_size = 256
datasets = MovieRankDataset(pkl_file='data.p')
dataloader = DataLoader(datasets, batch_size=batch_size, shuffle=False,num_workers=4)
# format: {id(int) : feature(numpy array)}
user_feature_dict = {}
movie_feature_dict = {}
movies={}
users = {}
with torch.no_grad():
for i_batch, sample_batch in enumerate(dataloader):
user_inputs = sample_batch['user_inputs']
movie_inputs = sample_batch['movie_inputs']
# B x 1 x 200 = 256 x 1 x 200
_, feature_user, feature_movie = model(user_inputs, movie_inputs)
# B x 1 x 200 = 256 x 1 x 200
feature_user = feature_user.cpu().numpy()
feature_movie = feature_movie.cpu().numpy()
for i in range(user_inputs['uid'].shape[0]):
uid = user_inputs['uid'][i] # uid
gender = user_inputs['gender'][i]
age = user_inputs['age'][i]
job = user_inputs['job'][i]
mid = movie_inputs['mid'][i] # mid
mtype = movie_inputs['mtype'][i]
mtext = movie_inputs['mtext'][i]
if uid.item() not in users.keys():
users[uid.item()]={'uid':uid,'gender':gender,'age':age,'job':job}
if mid.item() not in movies.keys():
movies[mid.item()]={'mid':mid,'mtype':mtype, 'mtext':mtext}
if uid not in user_feature_dict.keys():
user_feature_dict[uid]=feature_user[i]
if mid not in movie_feature_dict.keys():
movie_feature_dict[mid]=feature_movie[i]
print('Solved: {} samples'.format((i_batch+1)*batch_size))
feature_data = {'feature_user': user_feature_dict, 'feature_movie':movie_feature_dict}
dict_user_movie={'user': users, 'movie':movies}
pkl.dump(feature_data,open('Params/feature_data.pkl','wb'))
pkl.dump(dict_user_movie, open('Params/user_movie_dict.pkl','wb'))(model):
'''
Save Movie and User feature into HD
'''
batch_size = 256
datasets = MovieRankDataset(pkl_file='data.p')
dataloader = DataLoader(datasets, batch_size=batch_size, shuffle=False,num_workers=4)
# format: {id(int) : feature(numpy array)}
user_feature_dict = {}
movie_feature_dict = {}
movies={}
users = {}
with torch.no_grad():
for i_batch, sample_batch in enumerate(dataloader):
user_inputs = sample_batch['user_inputs']
movie_inputs = sample_batch['movie_inputs']
# B x 1 x 200 = 256 x 1 x 200
_, feature_user, feature_movie = model(user_inputs, movie_inputs)
# B x 1 x 200 = 256 x 1 x 200
feature_user = feature_user.cpu().numpy()
feature_movie = feature_movie.cpu().numpy()
for i in range(user_inputs['uid'].shape[0]):
uid = user_inputs['uid'][i] # uid
gender = user_inputs['gender'][i]
age = user_inputs['age'][i]
job = user_inputs['job'][i]
mid = movie_inputs['mid'][i] # mid
mtype = movie_inputs['mtype'][i]
mtext = movie_inputs['mtext'][i]
if uid.item() not in users.keys():
users[uid.item()]={'uid':uid,'gender':gender,'age':age,'job':job}
if mid.item() not in movies.keys():
movies[mid.item()]={'mid':mid,'mtype':mtype, 'mtext':mtext}
if uid not in user_feature_dict.keys():
user_feature_dict[uid]=feature_user[i]
if mid not in movie_feature_dict.keys():
movie_feature_dict[mid]=feature_movie[i]
print('Solved: {} samples'.format((i_batch+1)*batch_size))
feature_data = {'feature_user': user_feature_dict, 'feature_movie':movie_feature_dict}
dict_user_movie={'user': users, 'movie':movies}
pkl.dump(feature_data,open('Params/feature_data.pkl','wb'))
pkl.dump(dict_user_movie, open('Params/user_movie_dict.pkl','wb'))
recInterface.py
還有其他的功能函數:
-
getKNNitem(itemID,itemName='movie',K=1)
: 根據項目的id得到K近鄰項目,如果itemName='user'
,那么就是獲取K近鄰的用戶。邏輯很簡單:
- 根據
itemName
提取保存在本地的相應的用戶/電影特征集合 - 根據
itemID
獲取目標用戶的特征 - 求其特征與其他所有用戶/電影的cosine相似度
- 排序后返回前k個用戶/電影即可
- 根據
-
getUserMostLike(uid)
: 獲取用戶id為uid
的用戶最喜歡的電影過程也很容易理解:
- 依次對
uid
對應的用戶特征和所有電影特征做一個點積操作 - 該點擊操作視為用戶對電影的評分,對這些評分做一個sort操作
- 返回評分最高的即可。
- 依次對
6. 聲明
大家有問題可以直接在Github倉庫的issue里提問。