首先,這篇文章的內容大部分取自國外一篇博客Finding association rules with Mahout Frequent Pattern Mining,寫這個出於幾個原因,一 原文是英文的;二該博客貌似還被牆了,反正我是用了goagent才看到的;三 我簡化了其實驗內容,單純的用數字表示item了。
首先是實驗環境
- jdk >= 1.6
- maven
- hadoop (>1.0.0)
- mahout >= 0.7
環境搭建就不多說了,唯一注意的是mahout按照官網的指導絕對沒問題,如果安裝之后報錯,可能是你的hadoop版本問題,換個hadoop試試,我遇到的錯就是一直
Exception in thread "main" java.lang.NoClassDefFoundError:classpath。
我用的數據是mahout官網上提供的retail.dat,使用哪個數據沒關系,mahout fpgrowth的數據格式要求如下:
[item id1], [item id2], [item id3]
0, 2, 6, ...
0, 1, 6, ...
4, 5, ...
...
間隔符可以是別的,retail.dat里用的是空格,但要注意的是使用命令行時要標志。
這里不設置MAHOUT_LOCAL,讓mahout在hadoop上跑,所以先使用hadoop命令把數據放到hdfs上,在terminal輸入:
hadoop fs -put output.dat retail.dat
然后輸入如下指令運行mahout:
mahout fpg -i output.dat -o patterns -k 10 -method mapreduce -regex '[\ ]' -s 10
指令的含義在mahout的網站上有詳細說明,簡要說下,-i表示輸入,-o表示輸出,-k 10表示找出和某個item相關的前十個頻繁項,-method mapreduce表示使用mapreduce來運行這個作業,-regex '[\ ]'表示每個transaction里用空白來間隔item的,-s 10表示只統計最少出現10次的項。
成功運行后在patterns文件夾里會出現四個文件或者文件夾
- fList: 記錄了每個item出現的次數的序列文件
- frequentpatterns: 記錄了包含每個item的頻繁項的序列文件
- fpGrowth
- parallelcounting
當然這些結果是在hdfs上面的,可以使用mahout命令查看下這些輸出,在終端輸入 mahout seqdumper -i patterns/frequentpatterns/part-r-00000
第一行顯示了與item7671有關的前十個事務(按出現次數排序), ([7671],80) 表示item7671出現在80個事務中. ([39, 7671],57) 表示39和7671這兩個item同時出現在57個事務里。關聯規則可以由以下幾個參數來推導:
- support
包含集合X的事務出現的頻率:
- confidence
包含x的事務中含有同時包含Y的比例:
- lift 用來表示X和Y的相互獨立程度:
- conviction 也是用來衡量X和Y的獨立性的,這個值越大越好:
下面用程序來推導關聯規則,先把hdfs上面的幾個文件放到本地來,
hadoop fs -getmerge patterns/frequentpatterns frequentpatterns.seq
hadoop fs -get patterns/fList fList.seq
代碼是java代碼,怎么建工程都行,我是用的eclipse+maven,因為這樣它可以自動幫我下載所需要的mahout的包,把兩個序列文件拷到工程的根目錄下,代碼如下
package heyong; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.Reader; import org.apache.hadoop.io.Text; import org.apache.mahout.common.Pair; import org.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns; public class ResultReaderS { public static Map<Integer, Long> readFrequency(Configuration configuration, String fileName) throws Exception { FileSystem fs = FileSystem.get(configuration); Reader frequencyReader = new SequenceFile.Reader(fs, new Path(fileName), configuration); Map<Integer, Long> frequency = new HashMap<Integer, Long>(); Text key = new Text(); LongWritable value = new LongWritable(); while(frequencyReader.next(key, value)) { frequency.put(Integer.parseInt(key.toString()), value.get()); } return frequency; } public static void readFrequentPatterns( Configuration configuration, String fileName, int transactionCount, Map<Integer, Long> frequency, double minSupport, double minConfidence) throws Exception { FileSystem fs = FileSystem.get(configuration); Reader frequentPatternsReader = new SequenceFile.Reader(fs, new Path(fileName), configuration); Text key = new Text(); TopKStringPatterns value = new TopKStringPatterns(); while(frequentPatternsReader.next(key, value)) { long firstFrequencyItem = -1; String firstItemId = null; List<Pair<List<String>, Long>> patterns = value.getPatterns(); int i = 0; for(Pair<List<String>, Long> pair: patterns) { List<String> itemList = pair.getFirst(); Long occurrence = pair.getSecond(); if (i == 0) { firstFrequencyItem = occurrence; firstItemId = itemList.get(0); } else { double support = (double)occurrence / transactionCount; double confidence = (double)occurrence / firstFrequencyItem; if ((support > minSupport && confidence > minConfidence)) { List<String> listWithoutFirstItem = new ArrayList<String>(); for(String itemId: itemList) { if (!itemId.equals(firstItemId)) { listWithoutFirstItem.add(itemId); } } String firstItem = firstItemId; listWithoutFirstItem.remove(firstItemId); System.out.printf( "%s => %s: supp=%.3f, conf=%.3f", listWithoutFirstItem, firstItem, support, confidence); if (itemList.size() == 2) { // we can easily compute the lift and the conviction for set of // size 2, so do it int otherItemId = -1; for(String itemId: itemList) { if (!itemId.equals(firstItemId)) { otherItemId = Integer.parseInt(itemId); break; } } long otherItemOccurrence = frequency.get(otherItemId); double lift = (double)occurrence / (firstFrequencyItem * otherItemOccurrence); double conviction = (1.0 - (double)otherItemOccurrence / transactionCount) / (1.0 - confidence); System.out.printf( ", lift=%.3f, conviction=%.3f", lift, conviction); } System.out.printf("\n"); } } i++; } } frequentPatternsReader.close(); } public static void main(String args[]) throws Exception { int transactionCount = 88162;//事務總數 String frequencyFilename = "data/fList.seq";// String frequentPatternsFilename = "data/frequentpatterns.seq"; double minSupport = 0.001;//支持度 double minConfidence = 0.3;//置信度 Configuration configuration = new Configuration(); Map<Integer, Long> frequency = readFrequency(configuration, frequencyFilename); readFrequentPatterns(configuration, frequentPatternsFilename, transactionCount, frequency, minSupport, minConfidence); } }
程序運行得到如下的結果
[39] => 3361: supp=0.003, conf=0.565, lift=0.000, conviction=0.977
[48] => 3361: supp=0.003, conf=0.560, lift=0.000, conviction=1.186
[39, 48] => 3361: supp=0.002, conf=0.396
[48] => 337: supp=0.001, conf=0.589, lift=0.000, conviction=1.271
[39] => 337: supp=0.001, conf=0.554, lift=0.000, conviction=0.952
[48] => 338: supp=0.009, conf=0.611, lift=0.000, conviction=1.344
[39] => 338: supp=0.008, conf=0.582, lift=0.000, conviction=1.018
[39, 48] => 338: supp=0.006, conf=0.405
[48] => 340: supp=0.005, conf=0.633, lift=0.000, conviction=1.422
………………
調整支持度和置信度的值,可以增強結果的滿意度。至此,完成了使用mahout fpgrowth推導規則的一次入門實驗室,靈活使用這個算法,還是可以在很多地方派上用場的。