R語言進行相關矩陣分析及其可視化

本文轉載自查看原文 2019-09-10 17:10 942

原文鏈接：http://tecdat.cn/?p=6488

數據准備

# Select columns of interest
mydata <- mtcars %>% 
  select(mpg, disp, hp, drat, wt, qsec)
# Add some missing values
mydata$hp[3] <- NA
# Inspect the data
head(mydata, 3)

##                mpg disp  hp drat   wt qsec
## Mazda RX4     21.0  160 110 3.90 2.62 16.5
## Mazda RX4 Wag 21.0  160 110 3.90 2.88 17.0
## Datsun 710    22.8  108  NA 3.85 2.32 18.6

計算相關矩陣


res.cor <- correlate(mydata)
res.cor

## # A tibble: 6 x 7
##   rowname     mpg    disp      hp     drat      wt     qsec
##   <chr>     <dbl>   <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
## 1 mpg      NA      -0.848  -0.775   0.681   -0.868   0.419 
## 2 disp     -0.848  NA       0.786  -0.710    0.888  -0.434 
## 3 hp       -0.775   0.786  NA      -0.443    0.651  -0.706 
## 4 drat      0.681  -0.710  -0.443  NA       -0.712   0.0912
## 5 wt       -0.868   0.888   0.651  -0.712   NA      -0.175 
## 6 qsec      0.419  -0.434  -0.706   0.0912  -0.175  NA

該函數的其他參數correlate()包括：

method：字符串，指示要計算哪個相關系數（或協方差）。“pearson”（默認），“kendall”或“spearman”之一：可以縮寫。
diagonal：將對角線設置為的值（通常為數字或NA）。

探索相關矩陣

過濾器相關性高於0.8：

## # A tibble: 6 x 3
##   rowname colname    cor
##   <chr>   <chr>    <dbl>
## 1 disp    mpg     -0.848
## 2 wt      mpg     -0.868
## 3 mpg     disp    -0.848
## 4 wt      disp     0.888
## 5 mpg     wt      -0.868
## 6 disp    wt       0.888

特定的列/行

該功能focus()使得可以focus()在列和行上進行操作。此函數的作用與dplyr類似slect()，但也會從行中排除選定的列。

選擇與興趣列相關的結果。所選列將從行中排除：

## # A tibble: 3 x 4
##   rowname    mpg   disp     hp
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 drat     0.681 -0.710 -0.443
## 2 wt      -0.868  0.888  0.651
## 3 qsec     0.419 -0.434 -0.706

選定的列：

## # A tibble: 3 x 4
##   rowname     mpg    disp      hp
##   <chr>     <dbl>   <dbl>   <dbl>
## 1 mpg      NA      -0.848  -0.775
## 2 disp     -0.848  NA       0.786
## 3 hp       -0.775   0.786  NA

刪除不需要的列：

## # A tibble: 3 x 4
##   rowname   drat     wt   qsec
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 mpg      0.681 -0.868  0.419
## 2 disp    -0.710  0.888 -0.434
## 3 hp      -0.443  0.651 -0.706

按正則表達式選擇列

## # A tibble: 4 x 3
##   rowname   disp    drat
##   <chr>    <dbl>   <dbl>
## 1 mpg     -0.848  0.681 
## 2 hp       0.786 -0.443 
## 3 wt       0.888 -0.712 
## 4 qsec    -0.434  0.0912

選擇高於0.8的相關性：

## # A tibble: 2 x 3
##   rowname   disp     wt
##   <chr>    <dbl>  <dbl>
## 1 disp    NA      0.888
## 2 wt       0.888 NA

關注一個變量與所有其他變量的相關性：

# Extract the correlation

## # A tibble: 5 x 2
##   rowname    mpg
##   <chr>    <dbl>
## 1 disp    -0.848
## 2 hp      -0.775
## 3 drat     0.681
## 4 wt      -0.868
## 5 qsec     0.419

# Plot the correlation between mpg and all others

重新排序相關矩陣

## # A tibble: 6 x 7
##   rowname      wt     drat    disp     mpg      hp     qsec
##   <chr>     <dbl>    <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
## 1 wt       NA      -0.712    0.888  -0.868   0.651  -0.175 
## 2 drat     -0.712  NA       -0.710   0.681  -0.443   0.0912
## 3 disp      0.888  -0.710   NA      -0.848   0.786  -0.434 
## 4 mpg      -0.868   0.681   -0.848  NA      -0.775   0.419 
## 5 hp        0.651  -0.443    0.786  -0.775  NA      -0.706 
## 6 qsec     -0.175   0.0912  -0.434   0.419  -0.706  NA

上/下三角

上/下三角形到缺失值

res.cor %>% shave()

## # A tibble: 6 x 7
##   rowname     mpg    disp      hp     drat      wt  qsec
##   <chr>     <dbl>   <dbl>   <dbl>    <dbl>   <dbl> <dbl>
## 1 mpg      NA      NA      NA      NA       NA        NA
## 2 disp     -0.848  NA      NA      NA       NA        NA
## 3 hp       -0.775   0.786  NA      NA       NA        NA
## 4 drat      0.681  -0.710  -0.443  NA       NA        NA
## 5 wt       -0.868   0.888   0.651  -0.712   NA        NA
## 6 qsec      0.419  -0.434  -0.706   0.0912  -0.175    NA

將數據拉伸為長格式

res.cor %>% stretch()

## # A tibble: 36 x 3
##   x     y           r
##   <chr> <chr>   <dbl>
## 1 mpg   mpg    NA    
## 2 mpg   disp   -0.848
## 3 mpg   hp     -0.775
## 4 mpg   drat    0.681
## 5 mpg   wt     -0.868
## 6 mpg   qsec    0.419
## # … with 30 more rows

使用tidyverse和corrr包處理相關性

可視化相關系數的分布：

重新排列並過濾相關矩陣：

res.cor %>%
  focus(mpg:drat, mirror = TRUE) %>%

## # A tibble: 3 x 4
##   rowname     mpg    disp   drat
##   <chr>     <dbl>   <dbl>  <dbl>
## 1 hp       -0.775   0.786 -0.443
## 2 mpg      NA      -0.848  0.681
## 3 disp     NA      NA     -0.710

解釋相關性

##   rowname  mpg disp   hp drat   wt qsec
## 1     mpg      -.85 -.77  .68 -.87  .42
## 2    disp -.85       .79 -.71  .89 -.43
## 3      hp -.77  .79      -.44  .65 -.71
## 4    drat  .68 -.71 -.44      -.71  .09
## 5      wt -.87  .89  .65 -.71      -.17
## 6    qsec  .42 -.43 -.71  .09 -.17

res.cor %>%
  focus(mpg:drat, mirror = TRUE)

##   rowname  mpg disp drat
## 1      hp -.77  .79 -.44
## 2     mpg      -.85  .68
## 3    disp           -.71

制作相關圖：

重新排列然后繪制下三角形：

制作網絡

關聯數據庫中的數據

使用SQLite數據庫：

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")
db_mtcars <- copy_to(con, mtcars)
class(db_mtcars)

correlate()檢測數據庫后端，用於tidyeval計算數據庫中的相關性，並返回相關數據。

db_mtcars %>% correlate(use = "complete.obs")

使用spark：

sc <- sparklyr::spark_connect(master = "local")
mtcars_tbl <- copy_to(sc, mtcars)
correlate(mtcars_tbl, use = "complete.obs")

如果您有任何疑問，請在下面發表評論。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。