用 Mahout 和 Elasticsearch 實現推薦系統


原文地址

本文內容

  • 軟件
  • 步驟
  • 控制相關性
  • 總結
  • 參考資料

本文介紹如何用帶 Apache MahoutMapR Sandbox for HadoopElasticsearch 搭建推薦引擎,只需要很少的代碼。

This tutorial will give step-by-step instructions on how to:

  • 使用的電影評分數據位於 http://grouplens.org/datasets/movielens/
  • 使用 Apache Mahout 的協同過濾(collaborative filtering)搭建和訓練機器學習模型
  • 使用 Elasticsearch 的搜索技術簡化推薦系統的開發

遷移到:http://www.bdata-cap.com/newsinfo/1712675.html

軟件


該文章運行在 MapReduce Sandbox。還要求在 Sandbox 上安裝 Elasticsearch 和 Mahout。

步驟


Step 1: 索引(Index)電影元數據到 Elasticsearch


在 Elasticsearch 中,默認情況下,文檔的所有字段都會被索引。最簡單的文檔是只有一級 JSON 結構。文檔包含在索引中,文檔中的類型告訴 Elasticsearch 如何解釋文檔中的字段。

你可以把 Elasticsearch 的索引看做是關系型數據庫中的數據庫實例,而類型看做是數據庫表,字段看做表定義(但是這個字段,在 Elasticsearch 中的意義更廣泛),文檔看做是表的某行記錄。

針對本例,文檔類型是 film。並具有如下字段:電影ID(id)、標題(title)、上映時間(year)、電影類型/標簽(genre,基因)、指示(indicators)、indicators數組的數量(numFields):

{
 "id": "65006",
 "title": "Impulse",
 "year": "2008",
 "genre": ["Mystery","Thriller"],
 "indicators": ["154","272",”154","308", "535", "583", "593", "668", "670", "680", "702", "745"],
 "numFields": 12
}

通過 9200 端口訪問 Elasticsearch RESTful API 與其通信,或者命令行用 curl 命令。參看 Elasticsearch REST interfaceElasticsearch 101 tutorial

curl -X<VERB> 'http://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

使用 Elasticsearch's REST API 的 put mapping 命令可以定義文檔的類型。下面的請求在 bigmovie 索引中創建名為 film 的映射(mapping)。該映射定義一個類型為 integer 類型的 numFields 字段。默認情況,所有字段都被存儲並索引,整型也如此。

curl -XPUT 'http://localhost:9200/bigmovie' -d '
{
  "mappings": {
    "film" : {
      "properties" : {
        "numFields" : { "type" :   "integer" }
      }
    }
  }
}'

電影信息包含在 movies.dat 文件中。文件的每行表示一部電影,字段的含義如下所示:

MovieID::Title::Genres

例如:

65006::Impulse (2008)::Mystery|Thriller

p2264444441

圖 1 電影《沖動(Impulse)》(2008)、類型“懸疑/驚悚”

下面 Python 腳本把 movies.dat 文件中的數據轉換成 JSON 格式,以便導入 Elasticsearch:

import re
import json
count=0
with open('movies.dat','rb') as csv_file:
   content = csv_file.readlines()
   for line in content:
        fixed = re.sub("::", "\t", line).rstrip().split("\t")
   if len(fixed)==3:
          title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))
          genre = fixed[2].split('|')
          print '{ "create" : { "_index" : "bigmovie", "_type" : "film",
          "_id" : "%s" } }' %  fixed[0]
          print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }'
          % (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

運行該 Python 文件,轉換結果輸出到 index.json:

$ python index.py > index.json

將產生如下 Elasticsearch 需要的格式:

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }
{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }
{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }

文件中的每行創建索引和類型,並添加電影信息。這是利用 Elasticsearch 批量導入數據。

Elasticsearch 批量 API 可以執行對索引的操作,用同一個 API,不同的 http 請求(如 get、put、post、delete)。下面命令讓 Elasticsearch 批量加載 index.json 文中的內容:

curl -s -XPOST localhost:9200/_bulk --data-binary @index.json; echo

加載電影信息后,你就可以利用 REST API 進行查詢了。你也可以使用 Chrome 的 Elasticsearch 插件——Sense 進行操作(Kibana 4 提供的一個插件)。示例如下所示:

下面是檢索 id 為 1237的電影:

Step 2: 使用 Mahout 從用戶評分數據中創建 Movie indicators


評分包含在 ratings.dat 文件中。該文件每行表示某個用戶對某個電影的評分,格式如下所示:

UserID::MovieID::Rating::Timestamp

例如:

71567::2294::5::912577968
71567::2338::2::912578016

ratings.data 文件用 "::" 做分隔符,轉換成 tab 后 Mahout 才能使用。可以用 sed 命令把 :: 替換成 tab:

sed -i 's/::/\t/g' ratings.dat

該命令打開文件,把"::" 替換成"\t" 后,重新保存。Updates are only supported with MapR NFS and thus this command probably won't work on other NFS-on-Hadoop implementations. MapR Direct Access NFS allows files to be modified (supports random reads and writes) and accessed via mounting the Hadoop cluster over NFS.

sed 命令會產生如下格式的內容,該格式可以作為 Mahout 的輸入:

71567    2294    5    912580553
71567    2338    2    912580553

一般格式為:item1 item2 rating timestamp,即“物品1 物品2 評分”,本例不使用 timestamp。

啟動 Mahout 物品相似度(itemsimilarity)作業,命令如下所示:

 mahout itemsimilarity \
  --input /user/user01/mlinput/ratings.dat \
  --output /user/user01/mloutput \
  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
  --booleanData TRUE \
  --tempDir /user/user01/temp

The argument “-s SIMILARITY_LOGLIKELIHOOD” tells the recommender to use the Log Likelihood Ratio (LLR) method for determining which items co-occur anomalously often and thus which co-occurrences can be used as indicators of preference. 相似度默認是 0.9;this can be adjusted based on the use case with the --threshold parameter, which will discard pairs with lower similarity (the default is a fine choice). Mahout 通過啟動很多 Hadoop MapReduce 作業計算推薦,最后將產生輸出文件,該文件位於 /user/user01/mloutput 目錄。輸出文件格式如下所示

64957   64997   0.9604835425701245
64957   65126   0.919355104432831
64957   65133   0.9580439772229588

一般格式為:item1id item2id similarity,即“物品1 物品2 相似度”。

Step 3: 添加 Movie indicators 到 Elasticsearch 的電影文檔


下一步,我們從上面的輸出文件添加 indicators 到 Elasticsearch 的 film 文檔。例如,把電影的 indicators 放到 indicators 字段:

{
  "id": "65006",
  "title": "Impulse",
  "year": "2008",
  "genre": ["Mystery","Thriller"],
  "indicators": ["1076", "1936", "2057", "2204"],
  "numFields": 4
}

左面的表顯示文檔中包含 indicator 的內容,右邊的表顯示哪些文檔包含某個 indicator:

圖 2 文檔與 indicator

如果想要檢索 indicator 為 1237 551 的電影,那么本例將返回 id 為 8298 的文檔(電影)。如果檢索 1237 551,那么將返回 id 為 8298、3 和 64418 的電影。

下面腳本將讀取 Mahout 的輸出文件 part-r-00000,為每部電影創建 indicator 數組,然后輸出 JSON 文件,用該文件更新 Elasticsearch bigmovie 索引的 film 類型的 indicator 字段。

import fileinput
from string import join
import json
import csv
import json
### read the output from MAHOUT and collect into hash ###
with open('/user/user01/mloutput/part-r-00000','rb') as csv_file:
    csv_reader = csv.reader(csv_file,delimiter='\t')
    old_id = ""
    indicators = []
    update = {"update" : {"_id":""}}
    doc = {"doc" : {"indicators":[], "numFields":0}}
    for row in csv_reader:
        id = row[0]
        if (id != old_id and old_id != ""):
            update["update"]["_id"] = old_id
            doc["doc"]["indicators"] = indicators
            doc["doc"]["numFields"] = len(indicators)
            print(json.dumps(update))
            print(json.dumps(doc))
            indicators = [row[1]]
        else:
            indicators.append(row[1])
        old_id = id

下面命令會執行 update.py 的 Python 腳本,並輸出 update.json:

$ python update.py > update.json

上面 Python 腳本將創建如下內容的文件:

{"update": {"_id": "1"}}
{"doc": {"indicators": ["75", "118", "494", "512", "609", "626", "631", "634", "648", "711", "761", "810", "837", "881", "910", "1022", "1030", "1064", "1301", "1373", "1390", "1588", "1806", "2053", "2083", "2090", "2096", "2102", "2286", "2375", "2378", "2641", "2857", "2947", "3147", "3429", "3438", "3440", "3471", "3483", "3712", "3799", "3836", "4016", "4149", "4544", "4545", "4720", "4732", "4901", "5004", "5159", "5309", "5313", "5323", "5419", "5574", "5803", "5841", "5902", "5940", "6156", "6208", "6250", "6383", "6618", "6713", "6889", "6890", "6909", "6944", "7046", "7099", "7281", "7367", "7374", "7439", "7451", "7980", "8387", "8666", "8780", "8819", "8875", "8974", "9009", "25947", "27721", "31660", "32300", "33646", "40339", "42725", "45517", "46322", "46559", "46972", "47384", "48150", "49272", "55668", "63808"], "numFields": 102}}
{"update": {"_id": "2"}}
{"doc": {"indicators": ["15", "62", "153", "163", "181", "231", "239", "280", "333", "355", "374", "436", "473", "485", "489", "502", "505", "544", "546", "742", "829", "1021", "1474", "1562", "1588", "1590", "1713", "1920", "1967", "2002", "2012", "2045", "2115", "2116", "2139", "2143", "2162", "2296", "2338", "2399", "2408", "2447", "2616", "2793", "2798", "2822", "3157", "3243", "3327", "3438", "3440", "3477", "3591", "3614", "3668", "3802", "3869", "3968", "3972", "4090", "4103", "4247", "4370", "4467", "4677", "4686", "4846", "4967", "4980", "5283", "5313", "5810", "5843", "5970", "6095", "6383", "6385", "6550", "6764", "6863", "6881", "6888", "6952", "7317", "8424", "8536", "8633", "8641", "26870", "27772", "31658", "32954", "33004", "34334", "34437", "39419", "40278", "42011", "45210", "45447", "45720", "48142", "50347", "53464", "55553", "57528"], "numFields": 106}}

在命令行,用 curl 命令調用 Elasticsearch REST bulk 請求,把該文件 update.json 作為輸入,就可以更新 indicator 字段:

$ curl -s -XPOST localhost:9200/bigmovie/film/_bulk --data-binary @update.json; echo

Step 4: 檢索 Film 索引的 indicator 字段進行推薦


現在,你就可以檢索 film 的 indicator 字段進行查詢並推薦。例如,某人喜歡電影 1237 和 551,你想推薦類似的電影,可以執行如下 Elasticsearch 查詢獲得推薦,將返回indicator 數組為 1237 和 551 的電影,即 1237=Seventh Seal(第七封印),551=Nightmare Before Christmas(聖誕夜驚魂)

curl 'http://localhost:9200/bigmovie/film/_search?pretty' -d '
{
  "query": {
    "function_score": {
      "query": {
         "bool": {
           "must": [ { "match": { "indicators":"1237 551"} } ],
           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]
         }
      },
      "functions":[ {"random_score": {"seed":"48" } } ],
      "score_mode":"sum"
    }
  },
  "fields":["_id","title","genre"],
  "size":"8"
}'

上面查詢 indicator 為 1237 或 551,並且不是 1237 或 551 的電影。下面示例使用 Sense 插件進行查詢,右邊是檢索結果,推薦結果是 “A Man Named Pearl(這個是紀錄片)” 和 “Used People(寡婦三弄)”。

控制相關性

 

全文檢索引擎根據相關度排序,Elasticsearch 用 _score 字段表示文檔的相關度分數(relevance score)。function_score 允許你查詢時修改該分數。random_score 用一個種子變量使用散列生成分數。Elasticsearch 查詢如下所示,random_score 函數用於把變量添加到檢索結果,以便完成 dithering:

  "query": {
    "function_score": {
      "query": {
         "bool": {
           "must": [ { "match": { "indicators":"1237 551"} } ],
           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]
         }
      },
      "functions":[ {"random_score": {"seed":"48" } } ],
      "score_mode":"sum"
    }
  }

相關性抖動(dithering)有意地包含排名靠,但相關性較低的結果,以便拓展訓練數據,提供給推薦引擎。如果沒有 dithering,那么明天的訓練數據僅僅是教模型今天已經知道的事情。增加 dithering, 會幫助拓展推薦模型。如果模型給出的答案接近優秀的,那么 dithering 可以幫助找到正確答案。有效的 dithering 會減少今天的准確性,而改進明天的訓練數據(和未來的性能,算法的准確性也屬於性能的范疇),換句話說,為了讓將來的推薦准確,需要減少過去對將來的影響。

總結



We showed in this tutorial how to use Apache Mahout and Elasticsearch with the MapR Sandbox to build a basic recommendation engine. You can go beyond a basic recommender and get even better results with a few simple additions to the design to add cross recommendation of items, which leverages a variety of interactions and items for making recommendations. You can find more information about these technologies here:

參考資料



若想學習更多關於推薦引擎的組件和邏輯,參看 "An Inside Look at the Components of a Recommendation Engine",該文章詳細描述了推薦引擎的架構、Mahout 協同過濾(collaborative filtering)和 Elasticsearch 檢索引擎。

更多關於推薦引擎、機器學習和 Elasticsearch 的資源,如下所示:

Tutorial Category Reference:


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM