歡迎交流,轉載請注明出處。
本文介紹gensim工具包中,帶標簽(一個或者多個)的文檔的doc2vec 的向量表示。
應用場景: 當每個文檔不僅可以由文本信息表示,還有別的其他標簽信息時,比如,在商品推薦中,將每個商品看成是一個文檔,我們想學習商品向量表示時,可以只使用商品的描述信息來學習商品的向量表示,但有時:商品類別等信息我們也想將其考慮進去, 最簡單的方法是:當用文本信息學習到商品向量后,添加一維商品的類別信息,但只用一維來表示商品類別信息的有效性差。gensim 工具包的doc2vec提供了更加合理的方法,將商品標簽(如類別)加入到商品向量的訓練中,即gensim 中的LabeledSentence方法
LabeledSentence的輸入文件格式:每一行為:<labels, words>, 其中labels 可以有多個,用tab 鍵分隔,words 用空格鍵分隔,eg:<id category I like my cat demon>.
輸出為詞典vocabuary 中每個詞的向量表示,這樣就可以將商品labels:id,類別的向量拼接用作商品的向量表示。
寫了個例子,僅供參考(訓練一定要加 min_count=1,否則詞典不全,這個小問題卡了一天 Doc2Vec(sentences, size = 100, window = 5, min_count=1))
注意:下面的例子是gensim更新之前的用法,gensim更新之后,沒有了labels 的屬性,換為tags, 且目標向量的表示也由vacb轉到docvecs 中。更新后gensim 的用法見例子2.
例子1:gensim 更新前。
# -*- coding: UTF-8 -*- import gensim, logging import os from gensim.models.doc2vec import Doc2Vec,LabeledSentence from gensim.models import Doc2Vec import gensim.models.doc2vec asin=set() category=set() class LabeledLineSentence(object): def __init__(self, filename=object): self.filename =filename def __iter__(self): with open(self.filename,'r') as infile: data=infile.readlines(); # print "length: ", len(data) for uid,line in enumerate(data): asin.add(line.split("\t")[0]) category.add(line.split("\t")[1]) yield LabeledSentence(words=line.split("\t")[2].split(), labels=[line.split("\t")[0],line.split("\t")[1]]) print 'success' logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO) sentences =LabeledLineSentence('product_bpr_train.txt') model = Doc2Vec(sentences, size = 100, window = 5, min_count=1) model.save('product_bpr_model.txt') print 'success1' #for uid,line in enumerate(model.vocab): # print line print len(model.vocab) outid = file('product_bpr_id_vector.txt', 'w') outcate = file('product_bpr_cate_vector.txt', 'w') for idx, line in enumerate(model.vocab): if line in asin : outid.write(line +'\t') for idx,lv in enumerate(model[line]): outid.write(str(lv)+" ") outid.write('\n') if line in category: outcate.write(line + '\t') for idx,lv in enumerate(model[line]): outcate.write(str(lv)+" ") outcate.write('\n') outid.close() outcate.close()
例子2:gensim 更新后
# -*- coding: UTF-8 -*- import gensim, logging import os from gensim.models.doc2vec import Doc2Vec,LabeledSentence from gensim.models import Doc2Vec import gensim.models.doc2vec asin=set() category=set() class LabeledLineSentence(object): def __init__(self, filename=object): self.filename =filename def __iter__(self): with open(self.filename,'r') as infile: data=infile.readlines(); print "length: ", len(data) for uid,line in enumerate(data): print "line:",line asin.add(line.split("\t")[0]) print "asin: ",asin category.add(line.split("\t")[1]) yield LabeledSentence(words=line.split("\t")[2].split(" "), tags=[line.split("\t")[0], line.split("\t")[1]]) print 'success' logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO) sentences =LabeledLineSentence('product_bpr_test_train.txt') model = Doc2Vec(sentences, size =50, window = 5, min_count=1) model.save('product_bpr_model50.txt') print 'success1' print "doc2vecs length:", len(model.docvecs) outid = file('product_bpr_id_vector50.txt', 'w') outcate = file('product_bpr_cate_vector50.txt', 'w') for id in asin: outid.write(id+"\t") for idx,lv in enumerate(model.docvecs[id]): outid.write(str(lv)+" ") outid.write("\n") for cate in category: outcate.write(cate + '\t') for idx,lv in enumerate(model.docvecs[cate]): outcate.write(str(lv)+" ") outcate.write('\n') outid.close() outcate.close()
參考:
http://rare-technologies.com/doc2vec-tutorial/
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
http://radimrehurek.com/gensim/models/doc2vec.html#blog