python實現朴素貝葉斯


什么是朴素貝葉斯?

朴素貝葉斯是jiyu貝葉斯定理和特征條件獨立假設的分類方法。即對於給定訓練數據集,首先基於特征條件獨立假設學習輸入\輸出的聯合概率分布,然后基於此模型,對於給定的輸入x,利用貝葉斯定理求出后驗概率最大的輸出y。

什么是貝葉斯法則?

在貝葉斯法則中,每個名詞都有約定俗成的名稱:
Pr(A)是A的先驗概率或邊緣概率。之所以稱為"先驗"是因為它不考慮任何B方面的因素。
Pr(A|B)是已知B發生后A的條件概率,也由於得自B的取值而被稱作A的后驗概率
Pr(B|A)是已知A發生后B的條件概率,也由於得自A的取值而被稱作B的后驗概率
Pr(B)是B的先驗概率或邊緣概率,也作標准化常量(normalized constant)。
按這些術語,Bayes法則可表述為:
后驗概率 = (似然度 * 先驗概率)/標准化常量 也就是說,后驗概率與先驗概率和似然度的乘積成正比。
另外,比例Pr(B|A)/Pr(B)也有時被稱作標准似然度(standardised likelihood),Bayes法則可表述為:
后驗概率 = 標准似然度 * 先驗概率。 

什么是條件獨立性假設?

也就是Z事件發生時,X事件是否發生與Y無關,Y事件是否發生與X事件無關。

什么是聯合概率分布?

如何由聯合概率模型得到朴素貝葉斯 模型?

朴素貝葉斯參數估計:極大似然估計

朴素貝葉斯算法描述:

具體例子:

極大似然估計存在的問題?

使用貝葉斯估計求解上述問題?

朴素貝葉斯優缺點?

優點:
    (1)朴素貝葉斯模型發源於古典數學理論,有穩定的分類效率。
    (2)對小規模的數據表現很好,能個處理多分類任務,適合增量式訓練,尤其是數據量超出內存時,我們可以一批批的去增量訓練。
    (3)對缺失數據不太敏感,算法也比較簡單,常用於文本分類。
缺點:
    (1)理論上,朴素貝葉斯模型與其他分類方法相比具有最小的誤差率。但是實際上並非總是如此,這是因為朴素貝葉斯模型給定輸出類別的情況下,假設屬性之間相互獨立,這個假設在實際應用中往往是不成立的,在屬性個數比較多或者屬性之間相關性較大時,分類效果不好。而在屬性相關性較小時,朴素貝葉斯性能最為良好。對於這一點,有半朴素貝葉斯之類的算法通過考慮部分關聯性適度改進。
    (2)需要知道先驗概率,且先驗概率很多時候取決於假設,假設的模型可以有很多種,因此在某些時候會由於假設的先驗模型的原因導致預測效果不佳。
    (3)由於我們是通過先驗和數據來決定后驗的概率從而決定分類,所以分類決策存在一定的錯誤率。
    (4)對輸入數據的表達形式很敏感。

代碼實現:

from __future__ import division, print_function
import numpy as np
import math
from mlfromscratch.utils import train_test_split, normalize
from mlfromscratch.utils import Plot, accuracy_score

class NaiveBayes():
    """The Gaussian Naive Bayes classifier. """
    def fit(self, X, y):
        self.X, self.y = X, y
        self.classes = np.unique(y)
        self.parameters = []
        # Calculate the mean and variance of each feature for each class
        for i, c in enumerate(self.classes):
            # Only select the rows where the label equals the given class
            X_where_c = X[np.where(y == c)]
            self.parameters.append([])
            # Add the mean and variance for each feature (column)
            for col in X_where_c.T:
                parameters = {"mean": col.mean(), "var": col.var()}
                self.parameters[i].append(parameters)

    def _calculate_likelihood(self, mean, var, x):
        """ Gaussian likelihood of the data x given mean and var """
        eps = 1e-4 # Added in denominator to prevent division by zero
        coeff = 1.0 / math.sqrt(2.0 * math.pi * var + eps)
        exponent = math.exp(-(math.pow(x - mean, 2) / (2 * var + eps)))
        return coeff * exponent

    def _calculate_prior(self, c):
        """ Calculate the prior of class c
        (samples where class == c / total number of samples)"""
        frequency = np.mean(self.y == c)
        return frequency

    def _classify(self, sample):
        """ Classification using Bayes Rule P(Y|X) = P(X|Y)*P(Y)/P(X),
            or Posterior = Likelihood * Prior / Scaling Factor

        P(Y|X) - The posterior is the probability that sample x is of class y given the
                 feature values of x being distributed according to distribution of y and the prior.
        P(X|Y) - Likelihood of data X given class distribution Y.
                 Gaussian distribution (given by _calculate_likelihood)
        P(Y)   - Prior (given by _calculate_prior)
        P(X)   - Scales the posterior to make it a proper probability distribution.
                 This term is ignored in this implementation since it doesn't affect
                 which class distribution the sample is most likely to belong to.

        Classifies the sample as the class that results in the largest P(Y|X) (posterior)
        """
        posteriors = []
        # Go through list of classes
        for i, c in enumerate(self.classes):
            # Initialize posterior as prior
            posterior = self._calculate_prior(c)
            # Naive assumption (independence):
            # P(x1,x2,x3|Y) = P(x1|Y)*P(x2|Y)*P(x3|Y)
            # Posterior is product of prior and likelihoods (ignoring scaling factor)
            for feature_value, params in zip(sample, self.parameters[i]):
                # Likelihood of feature value given distribution of feature values given y
                likelihood = self._calculate_likelihood(params["mean"], params["var"], feature_value)
                posterior *= likelihood
            posteriors.append(posterior)
        # Return the class with the largest posterior probability
        return self.classes[np.argmax(posteriors)]

    def predict(self, X):
        """ Predict the class labels of the samples in X """
        y_pred = [self._classify(sample) for sample in X]
        return y_pred

這里實現的是高斯貝葉斯估計,具體原理可參考:https://zhuanlan.zhihu.com/p/64498790

接着是主運行:

from __future__ import division, print_function
from sklearn import datasets
import numpy as np
import sys
sys.path.append("/content/drive/My Drive/learn/ML-From-Scratch/")
from mlfromscratch.utils import train_test_split, normalize, accuracy_score, Plot
from mlfromscratch.supervised_learning import NaiveBayes

def main():
    data = datasets.load_digits()
    X = normalize(data.data)
    y = data.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

    clf = NaiveBayes()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)

    print ("Accuracy:", accuracy)

    # Reduce dimension to two using PCA and plot the results
    Plot().plot_in_2d(X_test, y_pred, title="Naive Bayes", accuracy=accuracy, legend_labels=data.target_names)

if __name__ == "__main__":
    main()

運行結果:

Accuracy: 0.9122562674094707

 

 

代碼來源:https://github.com/eriklindernoren/ML-From-Scratch 

參考:

百度百科

https://blog.csdn.net/qiu_zhi_liao/article/details/90671932 

統計學習方法

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM