用 Mahout 和 Elasticsearch 實現推薦系統

本文轉載自查看原文 2016-05-24 10:44 3620 mhout/ 數據挖掘/機器學習/ elasticsearch/ ELK/ Mahout/ Hadoop/ Recommender System/ 大數據

原文地址

本文內容

軟件
步驟
控制相關性
總結
參考資料

本文介紹如何用帶 Apache Mahout 的 MapR Sandbox for Hadoop 和 Elasticsearch 搭建推薦引擎，只需要很少的代碼。

This tutorial will give step-by-step instructions on how to:

使用的電影評分數據位於 http://grouplens.org/datasets/movielens/
使用 Apache Mahout 的協同過濾（collaborative filtering）搭建和訓練機器學習模型
使用 Elasticsearch 的搜索技術簡化推薦系統的開發

遷移到：http://www.bdata-cap.com/newsinfo/1712675.html

軟件

該文章運行在 MapReduce Sandbox。還要求在 Sandbox 上安裝 Elasticsearch 和 Mahout。

從 http://grouplens.org/datasets/movielens/ 下載 10M MovieLens 數據
安裝 Mahout
安裝 Elasticsearch

步驟

Step 1: 索引（Index）電影元數據到 Elasticsearch

在 Elasticsearch 中，默認情況下，文檔的所有字段都會被索引。最簡單的文檔是只有一級 JSON 結構。文檔包含在索引中，文檔中的類型告訴 Elasticsearch 如何解釋文檔中的字段。

你可以把 Elasticsearch 的索引看做是關系型數據庫中的數據庫實例，而類型看做是數據庫表，字段看做表定義（但是這個字段，在 Elasticsearch 中的意義更廣泛），文檔看做是表的某行記錄。

針對本例，文檔類型是 film。並具有如下字段：電影ID（id）、標題（title）、上映時間（year）、電影類型/標簽（genre，基因）、指示（indicators）、indicators數組的數量（numFields）：

 "id": "65006",

 "title": "Impulse",

 "year": "2008",

 "genre": ["Mystery","Thriller"],

 "indicators": ["154","272",”154","308", "535", "583", "593", "668", "670", "680", "702", "745"],

 "numFields": 12

通過 9200 端口訪問 Elasticsearch RESTful API 與其通信，或者命令行用 curl 命令。參看 Elasticsearch REST interface 和 Elasticsearch 101 tutorial。

curl -X<VERB> 'http://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

使用 Elasticsearch's REST API 的 put mapping 命令可以定義文檔的類型。下面的請求在 bigmovie 索引中創建名為 film 的映射（mapping）。該映射定義一個類型為 integer 類型的 numFields 字段。默認情況，所有字段都被存儲並索引，整型也如此。

curl -XPUT 'http://localhost:9200/bigmovie' -d '

  "mappings": {

    "film" : {

      "properties" : {

        "numFields" : { "type" :   "integer" }

}'

電影信息包含在 movies.dat 文件中。文件的每行表示一部電影，字段的含義如下所示：

MovieID::Title::Genres

例如：

65006::Impulse (2008)::Mystery|Thriller

圖 1 電影《沖動（Impulse）》（2008）、類型“懸疑/驚悚”

下面 Python 腳本把 movies.dat 文件中的數據轉換成 JSON 格式，以便導入 Elasticsearch：

import re

import json

count=0

with open('movies.dat','rb') as csv_file:

   content = csv_file.readlines()

   for line in content:

        fixed = re.sub("::", "\t", line).rstrip().split("\t")

   if len(fixed)==3:

          title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))

          genre = fixed[2].split('|')

          print '{ "create" : { "_index" : "bigmovie", "_type" : "film",

          "_id" : "%s" } }' %  fixed[0]

          print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }'

          % (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

運行該 Python 文件，轉換結果輸出到 index.json：

$ python index.py > index.json

將產生如下 Elasticsearch 需要的格式：

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }

{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }

{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }

文件中的每行創建索引和類型，並添加電影信息。這是利用 Elasticsearch 批量導入數據。

Elasticsearch 批量 API 可以執行對索引的操作，用同一個 API，不同的 http 請求（如 get、put、post、delete）。下面命令讓 Elasticsearch 批量加載 index.json 文中的內容：

curl -s -XPOST localhost:9200/_bulk --data-binary @index.json; echo

加載電影信息后，你就可以利用 REST API 進行查詢了。你也可以使用 Chrome 的 Elasticsearch 插件——Sense 進行操作（Kibana 4 提供的一個插件）。示例如下所示：

下面是檢索 id 為 1237的電影：

Step 2: 使用 Mahout 從用戶評分數據中創建 Movie indicators

評分包含在 ratings.dat 文件中。該文件每行表示某個用戶對某個電影的評分，格式如下所示：

UserID::MovieID::Rating::Timestamp

例如：

71567::2294::5::912577968

71567::2338::2::912578016

ratings.data 文件用 "::" 做分隔符，轉換成 tab 后 Mahout 才能使用。可以用 sed 命令把 :: 替換成 tab：

sed -i 's/::/\t/g' ratings.dat

該命令打開文件，把"::" 替換成"\t" 后，重新保存。Updates are only supported with MapR NFS and thus this command probably won't work on other NFS-on-Hadoop implementations. MapR Direct Access NFS allows files to be modified (supports random reads and writes) and accessed via mounting the Hadoop cluster over NFS.

sed 命令會產生如下格式的內容，該格式可以作為 Mahout 的輸入：

71567    2294    5    912580553

71567    2338    2    912580553

一般格式為：item1 item2 rating timestamp，即“物品1 物品2 評分”，本例不使用 timestamp。

啟動 Mahout 物品相似度（itemsimilarity）作業，命令如下所示：

 mahout itemsimilarity \

  --input /user/user01/mlinput/ratings.dat \

  --output /user/user01/mloutput \

  --similarityClassname SIMILARITY_LOGLIKELIHOOD \

  --booleanData TRUE \

  --tempDir /user/user01/temp

The argument “-s SIMILARITY_LOGLIKELIHOOD” tells the recommender to use the Log Likelihood Ratio (LLR) method for determining which items co-occur anomalously often and thus which co-occurrences can be used as indicators of preference. 相似度默認是 0.9；this can be adjusted based on the use case with the --threshold parameter, which will discard pairs with lower similarity (the default is a fine choice). Mahout 通過啟動很多 Hadoop MapReduce 作業計算推薦，最后將產生輸出文件，該文件位於 /user/user01/mloutput 目錄。輸出文件格式如下所示：

64957   64997   0.9604835425701245
64957   65126   0.919355104432831
64957   65133   0.9580439772229588

一般格式為：item1id item2id similarity，即“物品1 物品2 相似度”。

Step 3: 添加 Movie indicators 到 Elasticsearch 的電影文檔

下一步，我們從上面的輸出文件添加 indicators 到 Elasticsearch 的 film 文檔。例如，把電影的 indicators 放到 indicators 字段：

  "id": "65006",

  "title": "Impulse",

  "year": "2008",

  "genre": ["Mystery","Thriller"],

  "indicators": ["1076", "1936", "2057", "2204"],

  "numFields": 4

左面的表顯示文檔中包含 indicator 的內容，右邊的表顯示哪些文檔包含某個 indicator：

圖 2 文檔與 indicator

如果想要檢索 indicator 為 1237 和 551 的電影，那么本例將返回 id 為 8298 的文檔（電影）。如果檢索 1237 或 551，那么將返回 id 為 8298、3 和 64418 的電影。

下面腳本將讀取 Mahout 的輸出文件 part-r-00000，為每部電影創建 indicator 數組，然后輸出 JSON 文件，用該文件更新 Elasticsearch bigmovie 索引的 film 類型的 indicator 字段。

import fileinput

from string import join

import json

import csv

import json

### read the output from MAHOUT and collect into hash ###

with open('/user/user01/mloutput/part-r-00000','rb') as csv_file:

    csv_reader = csv.reader(csv_file,delimiter='\t')

    old_id = ""

    indicators = []

    update = {"update" : {"_id":""}}

    doc = {"doc" : {"indicators":[], "numFields":0}}

    for row in csv_reader:

        id = row[0]

        if (id != old_id and old_id != ""):

            update["update"]["_id"] = old_id

            doc["doc"]["indicators"] = indicators

            doc["doc"]["numFields"] = len(indicators)

            print(json.dumps(update))

            print(json.dumps(doc))

            indicators = [row[1]]

        else:

            indicators.append(row[1])

        old_id = id

下面命令會執行 update.py 的 Python 腳本，並輸出 update.json：

$ python update.py > update.json

上面 Python 腳本將創建如下內容的文件：

{"update": {"_id": "1"}}

{"doc": {"indicators": ["75", "118", "494", "512", "609", "626", "631", "634", "648", "711", "761", "810", "837", "881", "910", "1022", "1030", "1064", "1301", "1373", "1390", "1588", "1806", "2053", "2083", "2090", "2096", "2102", "2286", "2375", "2378", "2641", "2857", "2947", "3147", "3429", "3438", "3440", "3471", "3483", "3712", "3799", "3836", "4016", "4149", "4544", "4545", "4720", "4732", "4901", "5004", "5159", "5309", "5313", "5323", "5419", "5574", "5803", "5841", "5902", "5940", "6156", "6208", "6250", "6383", "6618", "6713", "6889", "6890", "6909", "6944", "7046", "7099", "7281", "7367", "7374", "7439", "7451", "7980", "8387", "8666", "8780", "8819", "8875", "8974", "9009", "25947", "27721", "31660", "32300", "33646", "40339", "42725", "45517", "46322", "46559", "46972", "47384", "48150", "49272", "55668", "63808"], "numFields": 102}}

{"update": {"_id": "2"}}

{"doc": {"indicators": ["15", "62", "153", "163", "181", "231", "239", "280", "333", "355", "374", "436", "473", "485", "489", "502", "505", "544", "546", "742", "829", "1021", "1474", "1562", "1588", "1590", "1713", "1920", "1967", "2002", "2012", "2045", "2115", "2116", "2139", "2143", "2162", "2296", "2338", "2399", "2408", "2447", "2616", "2793", "2798", "2822", "3157", "3243", "3327", "3438", "3440", "3477", "3591", "3614", "3668", "3802", "3869", "3968", "3972", "4090", "4103", "4247", "4370", "4467", "4677", "4686", "4846", "4967", "4980", "5283", "5313", "5810", "5843", "5970", "6095", "6383", "6385", "6550", "6764", "6863", "6881", "6888", "6952", "7317", "8424", "8536", "8633", "8641", "26870", "27772", "31658", "32954", "33004", "34334", "34437", "39419", "40278", "42011", "45210", "45447", "45720", "48142", "50347", "53464", "55553", "57528"], "numFields": 106}}

在命令行，用 curl 命令調用 Elasticsearch REST bulk 請求，把該文件 update.json 作為輸入，就可以更新 indicator 字段：

$ curl -s -XPOST localhost:9200/bigmovie/film/_bulk --data-binary @update.json; echo

Step 4: 檢索 Film 索引的 indicator 字段進行推薦

現在，你就可以檢索 film 的 indicator 字段進行查詢並推薦。例如，某人喜歡電影 1237 和 551，你想推薦類似的電影，可以執行如下 Elasticsearch 查詢獲得推薦，將返回indicator 數組為 1237 和 551 的電影，即 1237=Seventh Seal（第七封印），551=Nightmare Before Christmas（聖誕夜驚魂）：

curl 'http://localhost:9200/bigmovie/film/_search?pretty' -d '

  "query": {

    "function_score": {

      "query": {

         "bool": {

           "must": [ { "match": { "indicators":"1237 551"} } ],

           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]

},

      "functions":[ {"random_score": {"seed":"48" } } ],

      "score_mode":"sum"

},

  "fields":["_id","title","genre"],

  "size":"8"

}'

上面查詢 indicator 為 1237 或 551，並且不是 1237 或 551 的電影。下面示例使用 Sense 插件進行查詢，右邊是檢索結果，推薦結果是 “A Man Named Pearl（這個是紀錄片）” 和 “Used People（寡婦三弄）”。

控制相關性

全文檢索引擎根據相關度排序，Elasticsearch 用 _score 字段表示文檔的相關度分數（relevance score）。function_score 允許你查詢時修改該分數。random_score 用一個種子變量使用散列生成分數。Elasticsearch 查詢如下所示，random_score 函數用於把變量添加到檢索結果，以便完成 dithering：

  "query": {

    "function_score": {

      "query": {

         "bool": {

           "must": [ { "match": { "indicators":"1237 551"} } ],

           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]

},

      "functions":[ {"random_score": {"seed":"48" } } ],

      "score_mode":"sum"

相關性抖動（dithering）有意地包含排名靠，但相關性較低的結果，以便拓展訓練數據，提供給推薦引擎。如果沒有 dithering，那么明天的訓練數據僅僅是教模型今天已經知道的事情。增加 dithering，會幫助拓展推薦模型。如果模型給出的答案接近優秀的，那么 dithering 可以幫助找到正確答案。有效的 dithering 會減少今天的准確性，而改進明天的訓練數據（和未來的性能，算法的准確性也屬於性能的范疇），換句話說，為了讓將來的推薦准確，需要減少過去對將來的影響。

總結

We showed in this tutorial how to use Apache Mahout and Elasticsearch with the MapR Sandbox to build a basic recommendation engine. You can go beyond a basic recommender and get even better results with a few simple additions to the design to add cross recommendation of items, which leverages a variety of interactions and items for making recommendations. You can find more information about these technologies here:

參考資料

若想學習更多關於推薦引擎的組件和邏輯，參看 "An Inside Look at the Components of a Recommendation Engine"，該文章詳細描述了推薦引擎的架構、Mahout 協同過濾（collaborative filtering）和 Elasticsearch 檢索引擎。

更多關於推薦引擎、機器學習和 Elasticsearch 的資源，如下所示：

Tutorial Category Reference:

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。