一個簡單的例子!
環境:CentOS6.5
Hadoop集群、Hive、R、RHive,具體安裝及調試方法見博客內文檔。
名詞解釋:
先驗概率:由以往的數據分析得到的概率, 叫做先驗概率。
后驗概率:而在得到信息之后,再重新加以修正的概率叫做后驗概率。貝葉斯分類是后驗概率。
貝葉斯分類算法步驟:
第一步:准備階段
該階段為朴素貝葉斯分類做必要的准備。主要是依據具體情況確定特征屬性,並且對特征屬性進行適當划分。然后就是對一部分待分類項進行人工划分,以確定訓練樣本。
這一階段的輸入是所有的待分類項,輸出特征屬性和訓練樣本。分類器的質量很大程度上依賴於特征屬性及其划分以及訓練樣本的質量。
第二步:分類器訓練階段
主要工作是計算每個類別在訓練樣本中出現頻率以及每個特征屬性划分對每個類別的條件概率估計。輸入是特征屬性和訓練樣本,輸出是分類器。
第三步:應用階段
這個階段的任務是使用分類器對待分類項進行分類,其輸入是分類器和待分類項,輸出是待分類項與類別的映射關系。
特別要注意的是:朴素貝葉斯的核心在於它假設向量的所有分量之間是獨立的。
實例編寫R腳本:
#!/usr/bin/Rscript #構造訓練集 data <- matrix(c("sunny","hot","high","weak","no", "sunny","hot","high","strong","no", "overcast","hot","high","weak","yes", "rain","mild","high","weak","yes", "rain","cool","normal","weak","yes", "rain","cool","normal","strong","no", "overcast","cool","normal","strong","yes", "sunny","mild","high","weak","no", "sunny","cool","normal","weak","yes", "rain","mild","normal","weak","yes", "sunny","mild","normal","strong","yes", "overcast","mild","high","strong","yes", "overcast","hot","normal","weak","yes", "rain","mild","high","strong","no"), byrow = TRUE, dimnames = list(day = c(),condition = c("outlook","temperature","humidity","wind","playtennis")), nrow=14, ncol=5); #計算先驗概率 prior.yes = sum(data[,5] == "yes") / length(data[,5]); prior.no = sum(data[,5] == "no") / length(data[,5]); #貝葉斯模型 naive.bayes.prediction <- function(condition.vec) { # Calculate unnormlized posterior probability for playtennis = yes. playtennis.yes <- sum((data[,1] == condition.vec[1]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(outlook = f_1 | playtennis = yes) sum((data[,2] == condition.vec[2]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(temperature = f_2 | playtennis = yes) sum((data[,3] == condition.vec[3]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(humidity = f_3 | playtennis = yes) sum((data[,4] == condition.vec[4]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(wind = f_4 | playtennis = yes) prior.yes; # P(playtennis = yes) # Calculate unnormlized posterior probability for playtennis = no. playtennis.no <- sum((data[,1] == condition.vec[1]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(outlook = f_1 | playtennis = no) sum((data[,2] == condition.vec[2]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(temperature = f_2 | playtennis = no) sum((data[,3] == condition.vec[3]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(humidity = f_3 | playtennis = no) sum((data[,4] == condition.vec[4]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(wind = f_4 | playtennis = no) prior.no; # P(playtennis = no) return(list(post.pr.yes = playtennis.yes, post.pr.no = playtennis.no, prediction = ifelse(playtennis.yes >= playtennis.no, "yes", "no"))); } #預測 naive.bayes.prediction(c("overcast", "mild", "normal", "weak"));
結果:
$post.pr.yes [1] 0.05643739 $post.pr.no [1] 0 $prediction [1] "yes"
預測結果為:yes