關聯分析(購物籃子分析market basket analysis)R練習


關聯分析(association analysis)又稱關聯挖掘,就是在交易數據、關系數據或其他信息載體中,查找存在於項目集合或對象集合之間的頻繁模式。簡言之,關聯分析是發現數據庫中不同項之間的聯系。

與回歸問題、分類問題不同,關聯算法不能進行預測,但可以用於無監督的知識發現,尋找數據之間的關聯性。由於其本身不需要提前標記數據,算法實施也很便捷,但是關聯算法除了從定性的角度衡量其有效性以外,尚無一個簡單的方法來客觀地衡量其性能。

基本概念:

1、A→B的支持度:事件A,B同時發生的概率support =P(AB)

2、A→B的置信度:發生事件A基礎上發生事件B的概率confidence = P(B|A)=P(AB)/P(A)=support(A→B)/support(A)

3、提升度:confidence(A→B)/support(A)

4、K項集:如果事件A中包含K個元素,稱這個事件A為K項集,並且事件A滿足最小支持度閾值的事件稱為頻繁K項集

主要算法:Apriori

算法思想:Apriori 算法利用了關於頻繁項集性質的一個簡單先驗信念:一個頻繁項集的所有子集必須是頻繁的。逆否命題:如果一個項集是非頻繁的,那么它的所有超集也是非頻繁的。

算法原理:

1、計算出單個元素的支持度,然后選出單個元素置信度大於我們要求的數值,

2、增加單個元素組合的個數,只要組合項的支持度大於我們要求的數值就把它加到我們的頻繁項集中,依次遞歸。

3、根據計算的支持度選出來的頻繁項集來生成關聯規則。

算法流程:

  • 收集數據:使用任何方法
  • 准備數據:任意數據類型都可以,因為我們只保存集合
  • 分析數據:使用任何方法
  • 訓練算法:使用Apriori算法來找到頻繁項集
  • 測試算法:不需要測試過程
  • 使用算法:用於發現頻繁項集以及物品之間的關聯規則

舉個例子:數據集某超市購物訂單,即A顧客購買了幾款產品、B幾款......,引入稀疏矩陣處理

> library(arules)
Loading required package: Matrix

Attaching package: ‘arules’
The following objects are masked from ‘package:base’:

    abbreviate, write
> groceries <- read.transactions('groceries.csv', sep = ',')
> groceries
transactions in sparse format with
 9835 transactions (rows) and #訂單數量9835條
 169 items (columns) #商品類型169種

非零單元比例:a density of 0.02609146

常購商品:most frequent items

交易中包含商品種類分布:element (itemset/transaction) length distribution,也就是說有多少客戶買了多少件產品,買1件產品的訂單有2159個

 數據探索

> inspect(groceries[1:5,])#查看數據格式
    items                     
[1] {citrus fruit,            
     margarine,               
     ready soups,             
     semi-finished bread}     
[2] {coffee,                  
     tropical fruit,          
     yogurt}                  
[3] {whole milk}              
[4] {cream cheese,            
     meat spreads,            
     pip fruit,               
     yogurt}                  
[5] {condensed milk,          
     long life bakery product,
     other vegetables,        
     whole milk}              
> item <- itemFrequency(groceries)#查看商品支持度
+ item[order(item, decreasing = T)][1:10]
      whole milk other vegetables       rolls/buns             soda 
      0.25551601       0.19349263       0.18393493       0.17437722 
          yogurt    bottled water  root vegetables   tropical fruit 
      0.13950178       0.11052364       0.10899847       0.10493137 
   shopping bags          sausage 
      0.09852567       0.09395018 
> itemFrequencyPlot(groceries, support = 0.1)#繪制支持度大於0.1的商品

> itemFrequencyPlot(groceries, topN = 20)#賣得最好的20種商品

 

 

訓練模型

> apriori(groceries) #不指定參數默認
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime
        0.8    0.1    1 none FALSE            TRUE       5
 support minlen maxlen target   ext
     0.1      1     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 983 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
sorting and recoding items ... [8 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
set of 0 rules 

默認參數下沒有發現規則,調整支持度、置信度

 

> apriori(groceries,parameter = list(support=0.01,confidence=0.2,minlen=2))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime
        0.2    0.1    1 none FALSE            TRUE       5
 support minlen maxlen target   ext
    0.01      2     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 98 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.02s].
sorting and recoding items ... [88 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 done [0.01s].
writing ... [231 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
set of 231 rules 
> 
> groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
+ groceryrules
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen
       0.25    0.1    1 none FALSE            TRUE       5   0.006      2     10
 target   ext
  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 59 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [109 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [463 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
set of 463 rules 

規則提取(假設提取的是最后這個規則,只是舉例,實際生產中,可以根據不同的規則估算銷售額可能的增長,以銷售額最高的那個規則組合作為輸出,或者是進行灰度測試)

> summary(groceryrules)
set of 463 rules

rule length distribution (lhs + rhs):sizes
  2   3   4 
150 297  16 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.000   3.000   2.711   3.000   4.000 

summary of quality measures:
    support           confidence          lift       
 Min.   :0.006101   Min.   :0.2500   Min.   :0.9932  
 1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:1.6229  
 Median :0.008744   Median :0.3554   Median :1.9332  
 Mean   :0.011539   Mean   :0.3786   Mean   :2.0351  
 3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:2.3565  
 Max.   :0.074835   Max.   :0.6600   Max.   :3.9565  

mining info:
      data ntransactions support confidence
 groceries          9835   0.006       0.25
> inspect(sort(groceryrules, by = "lift")[1:5])
    lhs                   rhs                      support confidence     lift
[1] {herbs}            => {root vegetables}    0.007015760  0.4312500 3.956477
[2] {berries}          => {whipped/sour cream} 0.009049314  0.2721713 3.796886
[3] {other vegetables,                                                        
     tropical fruit,                                                          
     whole milk}       => {root vegetables}    0.007015760  0.4107143 3.768074
[4] {beef,                                                                    
     other vegetables} => {root vegetables}    0.007930859  0.4020619 3.688692
[5] {other vegetables,                                                        
     tropical fruit}   => {pip fruit}          0.009456024  0.2634561 3.482649

提取指定關聯規則

> berryrules <- subset(groceryrules, items %in% "berries")
> inspect(berryrules)
    lhs          rhs                  support     confidence lift    
[1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  3.796886
[2] {berries} => {yogurt}             0.010574479 0.3180428  2.279848
[3] {berries} => {other vegetables}   0.010269446 0.3088685  1.596280
[4] {berries} => {whole milk}         0.011794611 0.3547401  1.388328

輸出所有規則

> groceryrules_df <- as(groceryrules, "data.frame")
+ head(groceryrules_df)
                               rules     support confidence     lift
1    {potted plants} => {whole milk} 0.006914082  0.4000000 1.565460
2            {pasta} => {whole milk} 0.006100661  0.4054054 1.586614
3       {herbs} => {root vegetables} 0.007015760  0.4312500 3.956477
4      {herbs} => {other vegetables} 0.007727504  0.4750000 2.454874
5            {herbs} => {whole milk} 0.007727504  0.4750000 1.858983
6 {processed cheese} => {whole milk} 0.007015760  0.4233129 1.656698

> write(groceryrules, + file = 'groceryrules.csv', + sep = ',', + row.names = F, + quote = T)

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM