數據分析實例--R語言如何對垃圾郵件進行分類

本文轉載自查看原文 2017-06-04 11:46 1217 數據分析/ R語言

Coursera上數據分析實例 --R語言如何對垃圾郵件進行分類

Structure of a Data Analysis

數據分析的步驟

l Define the question

l Define the ideal data set

l Determine what data you can access

l Obtain the data

l Clean the data

l Exploratory data analysis

l Statistical prediction/model

l Interpret results

l Challenge results

l Synthesize/write up results

l Create reproducible code

A sample

1) 問題.

Can I automatically detect emails that are SPAM or not?

2) 具體化問題

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

3) 獲取數據

http://search.r-project.org/library/kernlab/html/spam.html

4) 取樣

#if it isn't installed,please install the package first.

library(kernlab)

data(spam)

#perform the subsampling

set.seed(3435)

trainIndicator =rbinom(4601,size = 1,prob = 0.5)

table(trainIndicator)

trainSpam = spam[trainIndicator == 1, ]

testSpam = spam[trainIndicator == 0, ]

5) 初步分析

a) Names：查看的列名

names(trainSpam)

b) Head:查看前六行

head(trainSpam)

c) Summaries：匯總

table(trainSpam$type)

d) Plots:畫圖,查看垃圾郵件及非垃圾郵件的分布

plot(trainSpam$capitalAve ~ trainSpam$type)

上圖分布不明顯，我們取對數后，再看看

plot(log10(trainSpam$capitalAve + 1) ~ trainSpam$type)

e) 尋找預測的內在關系

plot(log10(trainSpam[, 1:4] + 1))

f) 試用層次聚類

hCluster = hclust(dist(t(trainSpam[, 1:57])))

plot(hCluster)

太亂了.不能發現些什么。老方法不是取log看看

hClusterUpdated = hclust(dist(t(log10(trainSpam[, 1:55] + 1))))

plot(hClusterUpdated)

6) 統計預測及建模

trainSpam$numType = as.numeric(trainSpam$type) - 1

costFunction = function(x, y) sum(x != (y > 0.5))

cvError = rep(NA, 55)

library(boot)

for (i in 1:55) {

lmFormula = reformulate(names(trainSpam)[i], response = "numType")

glmFit = glm(lmFormula, family = "binomial", data = trainSpam)

cvError[i] = cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]

}

## Which predictor has minimum cross-validated error?

names(trainSpam)[which.min(cvError)]

7) 檢測

## Use the best model from the group

predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)

## Get predictions on the test set

predictionTest = predict(predictionModel, testSpam)

predictedSpam = rep("nonspam", dim(testSpam)[1])

## Classify as `spam' for those with prob > 0.5

predictedSpam[predictionModel$fitted > 0.5] = "spam"

## Classification table 查看分類結果

table(predictedSpam, testSpam$type)

分類錯誤率：0.2243 =(61 + 458)/(1346 + 458 + 61 + 449)

8) Interpret results（結果解釋）

The fraction of charcters that are dollar signs can be used to predict if an email is Spam

Anything with more than 6.6% dollar signs is classified as Spam

More dollar signs always means more Spam under our prediction

Our test set error rate was 22.4%

9) Challenge results

10) Synthesize/write up results

11) Create reproducible code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 垃圾郵件分類 R語言數據分析系列六數據分析與R語言數據分析，R語言數據分析與挖掘 - R語言：貝葉斯分類算法（案例一）數據分析與挖掘 - R語言：貝葉斯分類算法（案例二）數據分析與挖掘 - R語言：貝葉斯分類算法（案例三）分類數據分析 R語言基礎-數據分析及常見數據分析方法用 R 進行高頻金融數據分析簡介