關聯分析(association analysis)又稱關聯挖掘,就是在交易數據、關系數據或其他信息載體中,查找存在於項目集合或對象集合之間的頻繁模式。簡言之,關聯分析是發現數據庫中不同項之間的聯系。
與回歸問題、分類問題不同,關聯算法不能進行預測,但可以用於無監督的知識發現,尋找數據之間的關聯性。由於其本身不需要提前標記數據,算法實施也很便捷,但是關聯算法除了從定性的角度衡量其有效性以外,尚無一個簡單的方法來客觀地衡量其性能。
基本概念:
1、A→B的支持度:事件A,B同時發生的概率support =P(AB)
2、A→B的置信度:發生事件A基礎上發生事件B的概率confidence = P(B|A)=P(AB)/P(A)=support(A→B)/support(A)
3、提升度:confidence(A→B)/support(A)
4、K項集:如果事件A中包含K個元素,稱這個事件A為K項集,並且事件A滿足最小支持度閾值的事件稱為頻繁K項集
主要算法:Apriori
算法思想:Apriori 算法利用了關於頻繁項集性質的一個簡單先驗信念:一個頻繁項集的所有子集必須是頻繁的。逆否命題:如果一個項集是非頻繁的,那么它的所有超集也是非頻繁的。
算法原理:
1、計算出單個元素的支持度,然后選出單個元素置信度大於我們要求的數值,
2、增加單個元素組合的個數,只要組合項的支持度大於我們要求的數值就把它加到我們的頻繁項集中,依次遞歸。
3、根據計算的支持度選出來的頻繁項集來生成關聯規則。
算法流程:
- 收集數據:使用任何方法
- 准備數據:任意數據類型都可以,因為我們只保存集合
- 分析數據:使用任何方法
- 訓練算法:使用Apriori算法來找到頻繁項集
- 測試算法:不需要測試過程
- 使用算法:用於發現頻繁項集以及物品之間的關聯規則
舉個例子:數據集某超市購物訂單,即A顧客購買了幾款產品、B幾款......,引入稀疏矩陣處理
> library(arules) Loading required package: Matrix Attaching package: ‘arules’ The following objects are masked from ‘package:base’: abbreviate, write > groceries <- read.transactions('groceries.csv', sep = ',') > groceries transactions in sparse format with 9835 transactions (rows) and #訂單數量9835條 169 items (columns) #商品類型169種
非零單元比例:a density of 0.02609146
常購商品:most frequent items
交易中包含商品種類分布:element (itemset/transaction) length distribution,也就是說有多少客戶買了多少件產品,買1件產品的訂單有2159個
數據探索
> inspect(groceries[1:5,])#查看數據格式 items [1] {citrus fruit, margarine, ready soups, semi-finished bread} [2] {coffee, tropical fruit, yogurt} [3] {whole milk} [4] {cream cheese, meat spreads, pip fruit, yogurt} [5] {condensed milk, long life bakery product, other vegetables, whole milk}
> item <- itemFrequency(groceries)#查看商品支持度 + item[order(item, decreasing = T)][1:10] whole milk other vegetables rolls/buns soda 0.25551601 0.19349263 0.18393493 0.17437722 yogurt bottled water root vegetables tropical fruit 0.13950178 0.11052364 0.10899847 0.10493137 shopping bags sausage 0.09852567 0.09395018
> itemFrequencyPlot(groceries, support = 0.1)#繪制支持度大於0.1的商品
> itemFrequencyPlot(groceries, topN = 20)#賣得最好的20種商品
訓練模型
> apriori(groceries) #不指定參數默認 Apriori Parameter specification: confidence minval smax arem aval originalSupport maxtime 0.8 0.1 1 none FALSE TRUE 5 support minlen maxlen target ext 0.1 1 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 983 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s]. sorting and recoding items ... [8 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 done [0.00s]. writing ... [0 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. set of 0 rules
默認參數下沒有發現規則,調整支持度、置信度
> apriori(groceries,parameter = list(support=0.01,confidence=0.2,minlen=2)) Apriori Parameter specification: confidence minval smax arem aval originalSupport maxtime 0.2 0.1 1 none FALSE TRUE 5 support minlen maxlen target ext 0.01 2 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 98 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835 transaction(s)] done [0.02s]. sorting and recoding items ... [88 item(s)] done [0.00s]. creating transaction tree ... done [0.01s]. checking subsets of size 1 2 3 4 done [0.01s]. writing ... [231 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. set of 231 rules > > groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2)) + groceryrules Apriori Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen 0.25 0.1 1 none FALSE TRUE 5 0.006 2 10 target ext rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 59 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. sorting and recoding items ... [109 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 done [0.00s]. writing ... [463 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. set of 463 rules
規則提取(假設提取的是最后這個規則,只是舉例,實際生產中,可以根據不同的規則估算銷售額可能的增長,以銷售額最高的那個規則組合作為輸出,或者是進行灰度測試)
> summary(groceryrules) set of 463 rules rule length distribution (lhs + rhs):sizes 2 3 4 150 297 16 Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.000 3.000 2.711 3.000 4.000 summary of quality measures: support confidence lift Min. :0.006101 Min. :0.2500 Min. :0.9932 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229 Median :0.008744 Median :0.3554 Median :1.9332 Mean :0.011539 Mean :0.3786 Mean :2.0351 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565 Max. :0.074835 Max. :0.6600 Max. :3.9565 mining info: data ntransactions support confidence groceries 9835 0.006 0.25
> inspect(sort(groceryrules, by = "lift")[1:5]) lhs rhs support confidence lift [1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 [2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 [3] {other vegetables, tropical fruit, whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074 [4] {beef, other vegetables} => {root vegetables} 0.007930859 0.4020619 3.688692 [5] {other vegetables, tropical fruit} => {pip fruit} 0.009456024 0.2634561 3.482649
提取指定關聯規則
> berryrules <- subset(groceryrules, items %in% "berries") > inspect(berryrules) lhs rhs support confidence lift [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 [2] {berries} => {yogurt} 0.010574479 0.3180428 2.279848 [3] {berries} => {other vegetables} 0.010269446 0.3088685 1.596280 [4] {berries} => {whole milk} 0.011794611 0.3547401 1.388328
輸出所有規則
> groceryrules_df <- as(groceryrules, "data.frame") + head(groceryrules_df) rules support confidence lift 1 {potted plants} => {whole milk} 0.006914082 0.4000000 1.565460 2 {pasta} => {whole milk} 0.006100661 0.4054054 1.586614 3 {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 4 {herbs} => {other vegetables} 0.007727504 0.4750000 2.454874 5 {herbs} => {whole milk} 0.007727504 0.4750000 1.858983 6 {processed cheese} => {whole milk} 0.007015760 0.4233129 1.656698
> write(groceryrules, + file = 'groceryrules.csv', + sep = ',', + row.names = F, + quote = T)