使用ML.NET實現情感分析[新手篇]后補


在《使用ML.NET實現情感分析[新手篇]》完成后,有熱心的朋友建議說,為何例子不用中文的呢,其實大家是需要知道怎么預處理中文的數據集的。想想確實有道理,於是略微調整一些代碼,權作示范。

首先,我們需要一個好用的分詞庫,所以使用NuGet添加對JiebaNet.Analyser包的引用,這是一個支持.NET Core的版本。

 

然后,訓練和驗證用的數據集找一些使用中文的內容,並且確認有正確的標注,當然是要越多越好。內容類似如下:

最差的是三文魚生魚片。 0
我在這里吃了一頓非常可口的早餐。 1
這是拉斯維加斯最好的食物之一。 1
但即使是天才也無法挽救這一點。 0
我認為湯姆漢克斯是一位出色的演員。 1
...

增加一個切詞的函數:

public static void Segment(string source, string result)
{
    var segmenter = new JiebaSegmenter();
    using (var reader = new StreamReader(source))
    {
        using (var writer = new StreamWriter(result))
        {
            while (true)
            {
                var line = reader.ReadLine();
                if (string.IsNullOrWhiteSpace(line))
                    break;
                var parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
                if (parts.Length != 2) continue;
                var segments = segmenter.Cut(parts[0]);
                writer.WriteLine("{0}\t{1}", string.Join(" ", segments), parts[1]);
            }
        }
    }
}

原有的文件路徑要的調整為:

const string _dataPath = @".\data\sentiment labelled sentences\imdb_labelled.txt";
const string _testDataPath = @".\data\sentiment labelled sentences\yelp_labelled.txt";
const string _dataTrainPath = @".\data\sentiment labelled sentences\imdb_labelled_result.txt";
const string _testTargetPath = @".\data\sentiment labelled sentences\yelp_labelled_result.txt";

在Main函數的地方增加調用:

Segment(_dataPath, _dataTrainPath);
Segment(_testDataPath, _testTargetPath);

預測用的數據修改為:

IEnumerable<SentimentData> sentiments = new[]
{
    new SentimentData
    {
        SentimentText = "今天的任務並不輕松",
        Sentiment = 0
    },
    new SentimentData
    {
        SentimentText = "我非常想見到你",
        Sentiment = 0
    },
    new SentimentData
    {
        SentimentText = "實在是我沒有看清楚",
        Sentiment = 0
    }
};

一切就緒,運行結果如下:

 

看上去也不壞對么? :)

不久前也看到.NET Blog發了一篇關於ML.NET的文章《Introducing ML.NET: Cross-platform, Proven and Open Source Machine Learning Framework》,我重點摘一下關於路線圖方向的內容。

The Road Ahead

There are many capabilities we aspire to add to ML.NET, but we would love to understand what will best fit your needs. The current areas we are exploring are:

  • Additional ML Tasks and Scenarios
  • Deep Learning with TensorFlow & CNTK
  • ONNX support
  • Scale-out on Azure
  • Better GUI to simplify ML tasks
  • Integration with VS Tools for AI
  • Language Innovation for .NET

可以看到,隨着ONNX的支持,更多的機器學習框架如:TensorFlow、CNTK,甚至PyTorch都能共享模型了,加上不斷新增的場景支持,ML.NET將越來越實用,對已有其他語言開發的機器學習服務也能平滑地過渡到.NET Core來集成,值得期待!

按慣例最后放出項目結構和完整的代碼。

using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using JiebaNet.Segmenter;
using System.IO;

namespace SentimentAnalysis
{
    class Program
    {
        const string _dataPath = @".\data\sentiment labelled sentences\imdb_labelled.txt";
        const string _testDataPath = @".\data\sentiment labelled sentences\yelp_labelled.txt";
        const string _dataTrainPath = @".\data\sentiment labelled sentences\imdb_labelled_result.txt";
        const string _testTargetPath = @".\data\sentiment labelled sentences\yelp_labelled_result.txt";

        public class SentimentData
        {
            [Column(ordinal: "0")]
            public string SentimentText;
            [Column(ordinal: "1", name: "Label")]
            public float Sentiment;
        }

        public class SentimentPrediction
        {
            [ColumnName("PredictedLabel")]
            public bool Sentiment;
        }

        public static PredictionModel<SentimentData, SentimentPrediction> Train()
        {
            var pipeline = new LearningPipeline();
            pipeline.Add(new TextLoader<SentimentData>(_dataTrainPath, useHeader: false, separator: "tab"));
            pipeline.Add(new TextFeaturizer("Features", "SentimentText"));

            var featureSelector = new FeatureSelectorByCount() { Column = new[] { "Features" } };
            pipeline.Add(featureSelector);

            pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

            PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();
            return model;
        }

        public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
        {
            var testData = new TextLoader<SentimentData>(_testTargetPath, useHeader: false, separator: "tab");
            var evaluator = new BinaryClassificationEvaluator();
            BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);
            Console.WriteLine();
            Console.WriteLine("PredictionModel quality metrics evaluation");
            Console.WriteLine("------------------------------------------");
            Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
            Console.WriteLine($"Auc: {metrics.Auc:P2}");
            Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
        }

        public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)
        {
            IEnumerable<SentimentData> sentiments = new[]
            {
                new SentimentData
                {
                    SentimentText = "今天的任務並不輕松",
                    Sentiment = 0
                },
                new SentimentData
                {
                    SentimentText = "我非常想見到你",
                    Sentiment = 0
                },
                new SentimentData
                {
                    SentimentText = "實在是我沒有看清楚",
                    Sentiment = 0
                }
            };

            var segmenter = new JiebaSegmenter();
            foreach (var item in sentiments)
            {
                item.SentimentText = string.Join(" ", segmenter.Cut(item.SentimentText));
            }


            IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);
            Console.WriteLine();
            Console.WriteLine("Sentiment Predictions");
            Console.WriteLine("---------------------");

            var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
            foreach (var item in sentimentsAndPredictions)
            {
                Console.WriteLine($"Sentiment: {item.sentiment.SentimentText.Replace(" ", string.Empty)} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
            }
            Console.WriteLine();
        }

        public static void Segment(string source, string result)
        {
            var segmenter = new JiebaSegmenter();
            using (var reader = new StreamReader(source))
            {
                using (var writer = new StreamWriter(result))
                {
                    while (true)
                    {
                        var line = reader.ReadLine();
                        if (string.IsNullOrWhiteSpace(line))
                            break;
                        var parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
                        if (parts.Length != 2) continue;
                        var segments = segmenter.Cut(parts[0]);
                        writer.WriteLine("{0}\t{1}", string.Join(" ", segments), parts[1]);
                    }
                }
            }
        }

        static void Main(string[] args)
        {
            Segment(_dataPath, _dataTrainPath);
            Segment(_testDataPath, _testTargetPath);
            var model = Train();
            Evaluate(model);
            Predict(model);
        }
    }
}

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM