貝葉斯文本分類c#版

本文轉載自查看原文 2012-04-01 16:23 4533 c#/ 算法/ 貝葉斯/ 文本分類

關於這個話題，博客園已經有多個版本了

這幾個版本中，最具有實用性的應該是Pymining版，Pymining可以生成模型，便於復用，同時也講解的較為清楚，感興趣的可以去看下原文。

Pymining是基於python的，作為c#控，決定參考Pymining寫一個c#版本的分類器，目前完成了朴素貝葉斯分類的移植工作。

下面是使用示例：

           var loadModel = ClassiferSetting.LoadExistModel;
            //loadModel = true;
            Text2Matrix text2Matrix = new Text2Matrix(loadModel);
            ChiSquareFilter chiSquareFilter = new ChiSquareFilter(loadModel);
            NaiveBayes bayes = new NaiveBayes(loadModel);

            if (!loadModel)
            {
                Console.WriteLine("開始模型訓練...");

                //var matrix = text2Matrix.CreateTrainMatrix(new SogouRawTextSource(@"E:\語料下載程序\新聞下載\BaiduCrawl\Code\HtmlTest\Jade.Util\Classifier\SogouC.reduced.20061127\SogouC.reduced\Reduced"));
                var matrix = text2Matrix.CreateTrainMatrix(new TuangouTextSource());

                Console.WriteLine("卡方檢驗中...");

                chiSquareFilter.TrainFilter(matrix);

                Console.WriteLine("訓練模型中...");

                bayes.Train(matrix);
            }
            var totalCount = 0;
            var accurent = 0;

            var tuangouTest = new TuangouTextSource(@"E:\語料下載程序\新聞下載\BaiduCrawl\Code\HtmlTest\Jade.Util\Classifier\test.txt");

            while (!tuangouTest.IsEnd)
            {
                totalCount++;
                var raw = tuangouTest.GetNextRawText();
                Console.WriteLine("文本：" + raw.Text);
                Console.WriteLine("標記結果：" + raw.Category);
                var category = GetCategory(raw.Text, bayes, chiSquareFilter, text2Matrix);
                Console.WriteLine("結果:" + category);
                if (raw.Category == category)
                {
                    accurent++;
                }
            }

            Console.WriteLine("正確率:" + accurent * 100 / totalCount + "%");

            Console.ReadLine();

結果：

為了便於大家理解，下面將主要的模塊和流程進行介紹。

流程圖

文本模式分類一般的過程就是對訓練集提取特征，對於文本來說就是分詞，分出來的結果通常比較多，不能全部用來做特征，需要對特征進行降維，然后在使用分類算法（如貝葉斯）生成模型，並以模型來對需要進行分類的文本進行預測。

程序結構

分類程序主要由配置模塊，分詞模塊，特征選取模塊，分類模塊等幾個部分組成，下面逐一介紹：

配置模塊

python版本的程序用一個xml來存儲配置信息，c#版本繼續沿用這個配置信息

View Code

<?xml version="1.0" encoding="utf-8" ?>
<config>
  <__global__>
    <term_to_id>model/term_to_id</term_to_id>
    <id_to_term>model/id_to_term</id_to_term>
    <id_to_doc_count>model/id_to_doc_count</id_to_doc_count>
    <class_to_doc_count>model/class_to_doc_count</class_to_doc_count>
    <id_to_idf>model/id_to_idf</id_to_idf>
    <newid_to_id>model/newid_to_id</newid_to_id>
    <class_to_id>model/class_to_id</class_to_id>
    <id_to_class>model/id_to_class</id_to_class>
  </__global__>

  <__filter__>
    <rate>0.3</rate>
    <method>max</method>
    <log_path>model/filter.log</log_path>
    <model_path>model/filter.model</model_path>
  </__filter__>

  <naive_bayes>
    <model_path>model/naive_bayes.model</model_path>
    <log_path>model/naive_bayes.log</log_path>
  </naive_bayes>

  <twc_naive_bayes>
    <model_path>model/naive_bayes.model</model_path>
    <log_path>model/naive_bayes.log</log_path>
  </twc_naive_bayes>

</config>

配置信息主要是存儲模型文件相關的文件路徑

讀取xml就簡單了，當然為了方便使用，我們建立幾個類

View Code

   /// <summary>
    /// 全局配置信息
    /// </summary>
    public class GlobalSetting
    {
        public string TermToId { get; set; }
        public string IdToTerm { get; set; }
        public string IdToDocCount { get; set; }
        public string ClassToDocCount { get; set; }
        public string IdToIdf { get; set; }
        public string NewidToId { get; set; }
        public string ClassToId { get; set; }
        public string IdToClass { get; set; }
    }

    /// <summary>
    /// 卡方設置
    /// </summary>
    public class FilterSetting : TrainModelSetting
    {
        /// <summary>
        /// 特征選取比例
        /// </summary>
        public double Rate { get; set; }

        /// <summary>
        /// avg max
        /// </summary>
        public string Method { get; set; }

    }


    public class TrainModelSetting
    {
        /// <summary>
        /// 日志路徑
        /// </summary>
        public string LogPath { get; set; }

        /// <summary>
        /// 模型路徑
        /// </summary>
        public string ModelPath { get; set; }

    }

    /// <summary>
    /// 貝葉斯設置
    /// </summary>
    public class NaiveBayesSetting : TrainModelSetting
    {

    }

另外，提供一個供程序訪問配置信息的工具類

View ClassiferSetting

分詞

要提取特征，首先要進行分詞，對c#來說，直接采用盤古分詞就可以了，當然，還需要對盤古做下簡單的封裝

View Code

public class PanguSegment : ISegment
    {
        static PanguSegment()
        {
            PanGu.Segment.Init();
        }

        public List<string> DoSegment(string text)
        {
            PanGu.Segment segment = new PanGu.Segment();
            ICollection<WordInfo> words = segment.DoSegment(text);
            return words.Where(w=>w.OriginalWordType != WordType.Numeric).Select(w => w.Word).ToList();
        }
    }

另外，可以添加一個停用詞過濾StopWordsHandler

View Code

public class StopWordsHandler
    {
        private static string[] stopWordsList = { " ", "的", "我們", "要", "自己", "之", "將", "后", "應", "到", "某", "后", "個", "是", "位", "新", "一", "兩", "在", "中", "或", "有", "更" };
        public static bool IsStopWord(string word)
        {
            for (int i = 0; i < stopWordsList.Length; ++i)
            {
                if (word.IndexOf(stopWordsList[i]) != -1)
                    return true;
            }
            return false;
        }

        public static void RemoveStopWord(List words)
        {
            words.RemoveAll(word => word.Trim() == string.Empty || stopWordsList.Contains(word));
        }

    }

讀取訓練集

分類不是隨意做到的，而是要基於以往的知識，也就是需要通過訓練集計算概率

為了做到普適性，我們定義一個RawText類來代表原始語料

public class RawText
    {
        public string Text { get; set; }
        public string Category { get; set; }
    }

然后定義接口IRawTextSource來代表訓練集，看到IsEnd屬性就知道這個接口怎么使用了吧？

public interface IRawTextSource
    {
        bool IsEnd { get; }
        RawText GetNextRawText();
    }

對於搜狗的語料集（點擊下載），可以采用下面的方法讀取

View Code

同樣的，對於python版本的訓練集格式，可以使用下面的類來讀取

View Code

構建矩陣

在介紹矩陣之前，還需要介紹一個對象GlobalInfo，用來存儲矩陣計算過程中需要記錄的數據，比如詞語和id的映射

與python版本不同的是，為了方便訪問，c#版本的GlobalInfo使用單例模式。

View Code

從這里開始進入核心部分

這一部分會構造一個m*n的矩陣，表示數據的部分，每一行表示一篇文檔，每一列表示一個feature（單詞）。

矩陣中的categories是一個m * 1的矩陣，表示每篇文檔對應的分類id。

和python不同的是，我為了省事，矩陣對象還包含了一文檔文類（罪過），另外為了方便查看特征詞，特意添加了一個FeatureWords屬性

View Code

    public class Matrix
    {
        /// <summary>
        /// 行數目 代表樣本個數
        /// </summary>
        public int RowsCount { get; private set; }

        /// <summary>
        /// 列數目 代表詞（特征）數目
        /// </summary>
        public int ColsCount { get; private set; }

        /// <summary>
        /// 用於記錄文件的詞數目[0] = 0,[1] = [0]+ count(1),[2] = [1]+count(2)
        /// </summary>
        public List<int> Rows;

        /// <summary>
        /// 用於記錄詞id（termId)  與Rows一起可以將文檔區分開來
        /// </summary>
        public List<int> Cols;

        /// <summary>
        /// 與cols一一對應，記錄單篇文章中term的次數
        /// </summary>
        public List<int> Vals;

        /// <summary>
        /// 記錄每篇文章的分類，與Row對應
        /// </summary>
        public List<int> Categories;
        public Matrix(List<int> rows, List<int> cols, List<int> vals, List<int> categories)
        {
            this.Rows = rows;
            this.Cols = cols;
            this.Vals = vals;
            this.Categories = categories;
            if (rows != null && rows.Count > 0)
                this.RowsCount = rows.Count - 1;
            if (cols != null && cols.Count > 0)
                this.ColsCount = cols.Max() + 1;
        }

        private List<string> featureWords;
        public List<string> FeatureWords
        {
            get
            {
                if (Cols != null)
                {
                    featureWords = new List<string>();
                    Cols.ForEach(col => featureWords.Add(GlobalInfo.Instance.IdToTerm[col]));
                }
                return featureWords;
            }
        }
    }

Matrix一定要理解清楚Row和Col分別代表什么，下面來看怎么生成矩陣，代碼較長，請展開查看

View Code

        public Matrix CreateTrainMatrix(IRawTextSource textSource)
        {
            var rows = new List<int>();
            rows.Add(0);
            var cols = new List<int>();
            var vals = new List<int>();
            var categories = new List<int>();
            // 盤古分詞
            var segment = new PanguSegment();

            while (!textSource.IsEnd)
            {
                var rawText = textSource.GetNextRawText();

                if (rawText != null)
                {
                    int classId;

                    // 處理分類
                    if (GlobalInfo.Instance.ClassToId.ContainsKey(rawText.Category))
                    {
                        classId = GlobalInfo.Instance.ClassToId[rawText.Category];
                        GlobalInfo.Instance.ClassToDocCount[classId] += 1;
                    }
                    else
                    {
                        classId = GlobalInfo.Instance.ClassToId.Count;
                        GlobalInfo.Instance.ClassToId.Add(rawText.Category, classId);
                        GlobalInfo.Instance.IdToClass.Add(classId, rawText.Category);
                        GlobalInfo.Instance.ClassToDocCount.Add(classId, 1);
                    }

                    categories.Add(classId);

                    var text = rawText.Text;

                    //分詞
                    var wordList = segment.DoSegment(text);

                    // 去停用詞
                    StopWordsHandler.RemoveStopWord(wordList);
                    var partCols = new List<int>();
                    var termFres = new Dictionary<int, int>();
                    wordList.ForEach(word =>
                                         {
                                             int termId;
                                             if (!GlobalInfo.Instance.TermToId.ContainsKey(word))
                                             {
                                                 termId = GlobalInfo.Instance.IdToTerm.Count;
                                                 GlobalInfo.Instance.TermToId.Add(word, termId);
                                                 GlobalInfo.Instance.IdToTerm.Add(termId, word);
                                             }
                                             else
                                             {
                                                 termId = GlobalInfo.Instance.TermToId[word];
                                             }

                                             // partCols 記錄termId
                                             if (!partCols.Contains(termId))
                                             {
                                                 partCols.Add(termId);
                                             }

                                             //termFres 記錄termid出現的次數
                                             if (!termFres.ContainsKey(termId))
                                             {
                                                 termFres[termId] = 1;
                                             }
                                             else
                                             {
                                                 termFres[termId] += 1;
                                             }

                                         });

                    partCols.Sort();
                    partCols.ForEach(col =>
                                         {
                                             cols.Add(col);
                                             vals.Add(termFres[col]);
                                             if (!GlobalInfo.Instance.IdToDocCount.ContainsKey(col))
                                             {
                                                 GlobalInfo.Instance.IdToDocCount.Add(col, 1);
                                             }
                                             else
                                             {
                                                 GlobalInfo.Instance.IdToDocCount[col] += 1;
                                             }
                                         });
                    //fill rows rows記錄前n個句子的詞語數目之和
                    rows.Add(rows[rows.Count - 1] + partCols.Count);
                }
            }


            //fill GlobalInfo's idToIdf 計算idf 某一特定詞語的IDF，可以由總文件數目除以包含該詞語之文件的數目，再將得到的商取對數得到

            foreach (var termId in GlobalInfo.Instance.TermToId.Values)
            {
                GlobalInfo.Instance.IdToIdf[termId] =
                    Math.Log(d: (rows.Count - 1) / (GlobalInfo.Instance.IdToDocCount[termId] + 1));
            }

            this.Save();

            this.IsTrain = true;

            return new Matrix(rows, cols, vals, categories);
        }

特征降維

選取適合的特征對提高分類正確率有重要的幫助作用，c#版本選取chi-square，即卡方檢驗

卡方計算公式:
t: term
c: category
X^2(t, c) = N * (AD - CB)^2
____________________
(A+C)(B+D)(A+B)(C+D)
A,B,C,D is doc-count
A: belong to c, include t
B: Not belong to c, include t
C: belong to c, Not include t
D: Not belong to c, Not include t

B = t's doc-count - A
C = c's doc-count - A
D = N - A - B - C

得分計算：
and score of t can be calculated by n
X^2(t) = sigma p(ci)X^2(t,ci) (avg)
i
X^2(t) = max { X^2(t,c) } (max)

下面是對應的代碼代碼執行完成后，會將選取出來的特征詞寫到日志文件中：

View Code

        /// <summary>
        /// 訓練
        ///  卡方計算公式:
        ///  t: term
        ///  c: category
        ///  X^2(t, c) =   N * (AD - CB)^2
        ///             ____________________
        ///             (A+C)(B+D)(A+B)(C+D)
        ///  A,B,C,D is doc-count
        ///  A:     belong to c,     include t
        ///  B: Not belong to c,     include t
        ///  C:     belong to c, Not include t
        ///  D: Not belong to c, Not include t
        /// 
        ///  B = t's doc-count - A
        ///  C = c's doc-count - A
        ///  D = N - A - B - C
        /// and score of t can be calculated by next 2 formulations:
        /// X^2(t) = sigma p(ci)X^2(t,ci) (avg)
        ///            i
        /// X^2(t) = max { X^2(t,c) }     (max)
        /// """
        /// </summary>
        /// <param name="matrix"></param>
        public void TrainFilter(Matrix matrix)
        {
            if (matrix.RowsCount != matrix.Categories.Count)
            {
                throw new Exception("ERROR!,matrix.RowsCount shoud be equal to matrix.Categories.Count");
            }

            var distinctCategories = matrix.Categories.Distinct().ToList();
            distinctCategories.Sort();

            //#create a table stores X^2(t, c)
            // #create a table stores A(belong to c, and include t 創建二維數組
            ChiTable = new List<List<double>>();
            var data = new List<double>();
            for (var j = 0; j < matrix.ColsCount; j++)
            {
                data.Add(0);
            }

            for (var i = 0; i < distinctCategories.Count; i++)
            {
                ChiTable.Add(data.AsReadOnly().ToList());
            }

            // atable [category][term] - count
            ATable = ChiTable.AsReadOnly().ToList();

            for (var row = 0; row < matrix.RowsCount; row++)
            {
                for (var col = matrix.Rows[row]; col < matrix.Rows[row + 1]; col++)
                {
                    var categoryId = matrix.Categories[row];
                    var termId = matrix.Cols[col];
                    ATable[categoryId][termId] += 1;
                }
            }

            // 總文檔數
            var n = matrix.RowsCount;

            // 計算卡方
            for (var t = 0; t < matrix.ColsCount; t++)
            {
                for (var cc = 0; cc < distinctCategories.Count; cc++)
                {
                    var a = ATable[distinctCategories[cc]][matrix.Cols[t]]; // 屬於分類cc且包含詞t的數目
                    var b = GlobalInfo.Instance.IdToDocCount[t] - a; // 包含t但是不屬於分類的文檔 = t的總數-屬於cc的數目
                    var c = GlobalInfo.Instance.ClassToDocCount[distinctCategories[cc]] - a;  // 屬於分類cc但不包含t的數目 = c的數目 - 屬於分類包含t
                    var d = n - a - b - c; // 既不屬於c又不包含t的數目
                    //#get X^2(t, c)
                    var numberator = (n) * (a * d - c * b) * (a * d - c * b) + 1;
                    var denominator = (a + c) * (b + d) * (a + b) * (c + d) + 1;
                    ChiTable[distinctCategories[cc]][t] = numberator / denominator;
                }
            }

            // chiScore[t][2]  : chiScore[t][0] = score,chiScore[t][1]  = colIndex
            var chiScore = new List<List<double>>();
            for (var i = 0; i < matrix.ColsCount; i++)
            {
                var c = new List<double>();
                for (var j = 0; j < 2; j++)
                {
                    c.Add(0);
                }
                chiScore.Add(c);
            }

            // avg 函數時 最終得分 X^2(t) = sigma p(ci)X^2(t,ci)  p(ci)為類別的先驗概率
            if (this.Method == "avg")
            {
                // 構造類別先驗概率pc [category] - categoyCount/n
                var priorC = new double[distinctCategories.Count + 1];
                for (var i = 0; i < distinctCategories.Count; i++)
                {
                    priorC[distinctCategories[i]] = (double)GlobalInfo.Instance.ClassToDocCount[distinctCategories[i]] / n;
                }

                // 計算得分
                for (var t = 0; t < matrix.ColsCount; t++)
                {
                    chiScore[t][1] = t;
                    for (var c = 0; c < distinctCategories.Count; c++)
                    {
                        chiScore[t][0] += priorC[distinctCategories[c]] * ChiTable[distinctCategories[c]][t];
                    }
                }
            }
            else
            {
                //method == "max"
                // calculate score of each t
                for (var t = 0; t < matrix.ColsCount; t++)
                {
                    chiScore[t][1] = t;
                    // 取最大值
                    for (var c = 0; c < distinctCategories.Count; c++)
                    {
                        if (chiScore[t][0] < ChiTable[distinctCategories[c]][t])
                            chiScore[t][0] = ChiTable[distinctCategories[c]][t];
                    }
                }

            }

            // 比較得分
            chiScore.Sort(new ScoreCompare());
            chiScore.Reverse();

            #region
            var idMap = new int[matrix.ColsCount];

            // add un-selected feature-id to idmap
            for (var i = (int)(ClassiferSetting.FilterSetting.Rate * chiScore.Count); i < chiScore.Count; i++)
            {
                // 將未選中的標記為-1
                var termId = chiScore[i][1];
                idMap[(int)termId] = -1;
            }
            var offset = 0;
            for (var t = 0; t < matrix.ColsCount; t++)
            {
                if (idMap[t] < 0)
                {
                    offset += 1;
                }
                else
                {
                    idMap[t] = t - offset;
                    GlobalInfo.Instance.NewIdToId[t - offset] = t;
                }
            }

            this.IdMap = new List<int>(idMap);
            #endregion

            StringBuilder stringBuilder = new StringBuilder();
            stringBuilder.AppendLine("chiSquare info:");
            stringBuilder.AppendLine("=======selected========");
            for (var i = 0; i < chiScore.Count; i++)
            {
                if (i == (int)(ClassiferSetting.FilterSetting.Rate * chiScore.Count))
                {
                    stringBuilder.AppendLine("========unselected=======");
                }
                var term = GlobalInfo.Instance.IdToTerm[(int)chiScore[i][1]];
                var score = chiScore[i][0];
                stringBuilder.AppendLine(string.Format("{0} {1}", term, score));
            }
            File.WriteAllText(ClassiferSetting.FilterSetting.LogPath, stringBuilder.ToString());

            GlobalInfo.Instance.Save();

            this.Save();

            this.IsTrain = true;
        }

貝葉斯算法

具體可以參見開頭推薦的幾篇文章，知道P(C|X) = P(X|C)P(C)/P(X)就可以了

下面是具體的實現代碼

View Code

        public List<List<double>> vTable { get; set; }

        public List<double> Prior { get; set; }

        public void Train(Matrix matrix)
        {
            if (matrix.RowsCount != matrix.Categories.Count)
            {
                throw new Exception("ERROR!,matrix.RowsCount shoud be equal to matrix.Categories.Count");
            }

            //  #calculate prior of each class
            //  #1. init cPrior:

            var distinctCategories = matrix.Categories.Distinct().ToList();
            distinctCategories.Sort();
            var cPrior = new double[distinctCategories.Count + 1];

            // 2. fill cPrior
            matrix.Categories.ForEach(classid => cPrior[classid] += 1);

            //#calculate likehood of each term
            // #1. init vTable:  vTable[termId][Category]
            vTable = new List<List<double>>();
            for (var i = 0; i < matrix.ColsCount; i++)
            {
                var data = cPrior.Select(t => 0d).ToList();
                vTable.Add(data);
            }

            // #2. fill vTable
            for (var i = 0; i < matrix.RowsCount; i++)
            {
                for (var j = matrix.Rows[i]; j < matrix.Rows[i + 1]; j++)
                {
                    vTable[matrix.Cols[j]][matrix.Categories[i]] += 1;
                }
            }

            //#normalize vTable
            for (var i = 0; i < matrix.ColsCount; i++)
            {
                for (var j = 0; j < cPrior.Length; j++)
                {
                    // P(x|c) =  term 個數 / 分類個數  
                    if (cPrior[j] > 1e-10)
                        vTable[i][j] /= (cPrior[j]);
                }
            }

            //#normalize cPrior P(C) = C/TC
            for (var i = 0; i < cPrior.Length; i++)
            {
                cPrior[i] /= matrix.Categories.Count;
            }

            this.Prior = new List<double>(cPrior);

            this.IsTrain = true;

            this.Save();

        }

預測

引用作者的話：

PyMining的訓練、測試的過程可以獨立的運行，可以先訓練出一個模型，等到有需要的時候再進行測試，所以在訓練的過程中，有一些數據（比如說chi-square filter）中的黑名單，將會保存到文件中去。如果想單獨的運行測試程序，請參考下面的一段代碼，調用了NaiveBayes.Test方法后，返回的resultY就是一個m * 1的矩陣（m是測試文檔的個數），表示對於每一篇測試文檔使用模型測試得到的標簽（屬於0，1，2，3）中的哪一個，precision是測試的准確率。

預測首先是構造一個矩陣，構造過程和訓練時類似：

        public Matrix CreatePredictSample(string text)
        {
            if (!this.IsTrain)
            {
                throw new Exception("請選訓練模型");
            }

            // 盤古分詞
            var segment = new PanguSegment();
            //分詞
            var wordList = segment.DoSegment(text);

            // 去停用詞
            StopWordsHandler.RemoveStopWord(wordList);
            var cols = new List<int>();
            var vals = new List<int>();
            var partCols = new List<int>();
            var termFres = new Dictionary<int, int>();
            wordList.ForEach(word =>
            {
                int termId;
                if (GlobalInfo.Instance.TermToId.ContainsKey(word))
                {
                    termId = GlobalInfo.Instance.TermToId[word];

                    if (!partCols.Contains(termId))
                        partCols.Add(termId);

                    //termFres 記錄termid出現的次數
                    if (!termFres.ContainsKey(termId))
                    {
                        termFres[termId] = 1;
                    }
                    else
                    {
                        termFres[termId] += 1;
                    }
                }

            });

            partCols.Sort();
            partCols.ForEach(col =>
            {
                cols.Add(col);
                vals.Add(termFres[col]);
            });

            return new Matrix(null, cols, vals, null);
        }

然后將構造出來的矩陣進行降維，只選取卡方選擇出來的詞語做特征

        public void SampleFilter(Matrix matrix)
        {
            if (!this.IsTrain)
            {
                throw new Exception("請選訓練模型");
            }
            //#filter sample
            var newCols = new List<int>();
            var newVals = new List<int>();
            for (var c = 0; c < matrix.Cols.Count; c++)
            {
                if (IdMap[matrix.Cols[c]] >= 0)
                {
                    newCols.Add(matrix.Cols[c]);
                    newVals.Add(matrix.Vals[c]);
                }
            }
            matrix.Vals = newVals;
            matrix.Cols = newCols;
        }

最后將選取相互來的特征交給貝葉斯算法進行計算，選取得分最高的做為結果

View Code

        /// <summary>
        /// 測試
        /// </summary>
        /// <param name="matrix"></param>
        /// <returns></returns>
        public string TestSample(Matrix matrix)
        {
            var targetP = new List<double>();
            var maxP = -1000000000d;
            var best = -1;
            // 計算最大的P(C)*P(X|C)
            for (var target = 0; target < this.Prior.Count; target++)
            {
                var curP = 100D; // 放大100倍
                curP *= this.Prior[target];

                for (var c = 0; c < matrix.Cols.Count; c++)
                {
                    if (this.vTable[matrix.Cols[c]][target] == 0)
                    {
                        curP *= 1e-7;
                    }
                    else
                    {
                        curP *= vTable[matrix.Cols[c]][target];
                    }
                }
                targetP.Add(curP);
                if (curP > maxP)
                {
                    best = target;
                    maxP = curP;
                }
            }

            return GlobalInfo.Instance.IdToClass[best];

        }

模型的保存

模型的計算其實需要較長的時間，特別是當訓練集較大的時候，所以我們可以將訓練好的模型保存起來

下面是保存貝葉斯模型的code

View Code

        /// <summary>
        /// 貝葉斯模型
        /// </summary>
        [Serializable]
        public class NaiveBayesModel
        {
            public List<List<double>> vTable { get; set; }
            public List<double> Prior { get; set; }
        }

        public override void Save()
        {
            try
            {
                var model = new NaiveBayesModel { vTable = this.vTable, Prior = this.Prior };
                SerializeHelper helper = new SerializeHelper();
                helper.ToBinaryFile(model, ClassiferSetting.NaiveBayesSetting.ModelPath);
            }
            catch
            {
                Console.WriteLine("加載卡方模型失敗");
            }
        }

        public override void Load()
        {
            try
            {
                Console.WriteLine("加載貝葉斯模型……");
                SerializeHelper helper = new SerializeHelper();
                var model = (NaiveBayesModel)helper.FromBinaryFile<NaiveBayesModel>(ClassiferSetting.NaiveBayesSetting.ModelPath);
                this.vTable = model.vTable;
                this.Prior = model.Prior;
            }
            catch
            {
                Console.WriteLine("加載貝葉斯模型失敗");
            }
        }

源代碼下載數據請自備

有什么意見或者問題歡迎留言

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 朴素貝葉斯文本分類簡單介紹朴素貝葉斯文本分類代碼（詳解）朴素貝葉斯文本分類java實現朴素貝葉斯文本分類(python代碼實現) 朴素貝葉斯文本分類實現 python cherry分類器朴素貝葉斯文本分類-在《紅樓夢》作者鑒別的應用上（python實現）詳解使用EM算法的半監督學習方法應用於朴素貝葉斯文本分類 [學習記錄]sklearn貝葉斯及SVM文本分類基於朴素貝葉斯的文本分類算法機器學習實戰1：朴素貝葉斯模型:文本分類+垃圾郵件分類