使用ML.NET實現情感分析[新手篇]后補

本文轉載自查看原文 2018-05-12 16:44 3543 .Net Core/ 中文/ ML.NET/ 機器學習/ 新手/ dotNET Core/ 回歸

在《使用ML.NET實現情感分析[新手篇]》完成后，有熱心的朋友建議說，為何例子不用中文的呢，其實大家是需要知道怎么預處理中文的數據集的。想想確實有道理，於是略微調整一些代碼，權作示范。

首先，我們需要一個好用的分詞庫，所以使用NuGet添加對JiebaNet.Analyser包的引用，這是一個支持.NET Core的版本。

然后，訓練和驗證用的數據集找一些使用中文的內容，並且確認有正確的標注，當然是要越多越好。內容類似如下：

最差的是三文魚生魚片。 0
我在這里吃了一頓非常可口的早餐。 1
這是拉斯維加斯最好的食物之一。 1
但即使是天才也無法挽救這一點。 0
我認為湯姆漢克斯是一位出色的演員。 1
...

增加一個切詞的函數：

public static void Segment(string source, string result)
{
    var segmenter = new JiebaSegmenter();
    using (var reader = new StreamReader(source))
    {
        using (var writer = new StreamWriter(result))
        {
            while (true)
            {
                var line = reader.ReadLine();
                if (string.IsNullOrWhiteSpace(line))
                    break;
                var parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
                if (parts.Length != 2) continue;
                var segments = segmenter.Cut(parts[0]);
                writer.WriteLine("{0}\t{1}", string.Join(" ", segments), parts[1]);
            }
        }
    }
}

原有的文件路徑要的調整為：

const string _dataPath = @".\data\sentiment labelled sentences\imdb_labelled.txt";
const string _testDataPath = @".\data\sentiment labelled sentences\yelp_labelled.txt";
const string _dataTrainPath = @".\data\sentiment labelled sentences\imdb_labelled_result.txt";
const string _testTargetPath = @".\data\sentiment labelled sentences\yelp_labelled_result.txt";

在Main函數的地方增加調用：

Segment(_dataPath, _dataTrainPath);
Segment(_testDataPath, _testTargetPath);

預測用的數據修改為：

IEnumerable<SentimentData> sentiments = new[]
{
    new SentimentData
    {
        SentimentText = "今天的任務並不輕松",
        Sentiment = 0
    },
    new SentimentData
    {
        SentimentText = "我非常想見到你",
        Sentiment = 0
    },
    new SentimentData
    {
        SentimentText = "實在是我沒有看清楚",
        Sentiment = 0
    }
};

一切就緒，運行結果如下：

看上去也不壞對么？：）

不久前也看到.NET Blog發了一篇關於ML.NET的文章《Introducing ML.NET: Cross-platform, Proven and Open Source Machine Learning Framework》，我重點摘一下關於路線圖方向的內容。

The Road Ahead

There are many capabilities we aspire to add to ML.NET, but we would love to understand what will best fit your needs. The current areas we are exploring are:

Additional ML Tasks and Scenarios
Deep Learning with TensorFlow & CNTK
ONNX support
Scale-out on Azure
Better GUI to simplify ML tasks
Integration with VS Tools for AI
Language Innovation for .NET

可以看到，隨着ONNX的支持，更多的機器學習框架如：TensorFlow、CNTK，甚至PyTorch都能共享模型了，加上不斷新增的場景支持，ML.NET將越來越實用，對已有其他語言開發的機器學習服務也能平滑地過渡到.NET Core來集成，值得期待！

按慣例最后放出項目結構和完整的代碼。

using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using JiebaNet.Segmenter;
using System.IO;

namespace SentimentAnalysis
{
    class Program
    {
        const string _dataPath = @".\data\sentiment labelled sentences\imdb_labelled.txt";
        const string _testDataPath = @".\data\sentiment labelled sentences\yelp_labelled.txt";
        const string _dataTrainPath = @".\data\sentiment labelled sentences\imdb_labelled_result.txt";
        const string _testTargetPath = @".\data\sentiment labelled sentences\yelp_labelled_result.txt";

        public class SentimentData
        {
            [Column(ordinal: "0")]
            public string SentimentText;
            [Column(ordinal: "1", name: "Label")]
            public float Sentiment;
        }

        public class SentimentPrediction
        {
            [ColumnName("PredictedLabel")]
            public bool Sentiment;
        }

        public static PredictionModel<SentimentData, SentimentPrediction> Train()
        {
            var pipeline = new LearningPipeline();
            pipeline.Add(new TextLoader<SentimentData>(_dataTrainPath, useHeader: false, separator: "tab"));
            pipeline.Add(new TextFeaturizer("Features", "SentimentText"));

            var featureSelector = new FeatureSelectorByCount() { Column = new[] { "Features" } };
            pipeline.Add(featureSelector);

            pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

            PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();
            return model;
        }

        public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
        {
            var testData = new TextLoader<SentimentData>(_testTargetPath, useHeader: false, separator: "tab");
            var evaluator = new BinaryClassificationEvaluator();
            BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);
            Console.WriteLine();
            Console.WriteLine("PredictionModel quality metrics evaluation");
            Console.WriteLine("------------------------------------------");
            Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
            Console.WriteLine($"Auc: {metrics.Auc:P2}");
            Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
        }

        public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)
        {
            IEnumerable<SentimentData> sentiments = new[]
            {
                new SentimentData
                {
                    SentimentText = "今天的任務並不輕松",
                    Sentiment = 0
                },
                new SentimentData
                {
                    SentimentText = "我非常想見到你",
                    Sentiment = 0
                },
                new SentimentData
                {
                    SentimentText = "實在是我沒有看清楚",
                    Sentiment = 0
                }
            };

            var segmenter = new JiebaSegmenter();
            foreach (var item in sentiments)
            {
                item.SentimentText = string.Join(" ", segmenter.Cut(item.SentimentText));
            }


            IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);
            Console.WriteLine();
            Console.WriteLine("Sentiment Predictions");
            Console.WriteLine("---------------------");

            var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
            foreach (var item in sentimentsAndPredictions)
            {
                Console.WriteLine($"Sentiment: {item.sentiment.SentimentText.Replace(" ", string.Empty)} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
            }
            Console.WriteLine();
        }

        public static void Segment(string source, string result)
        {
            var segmenter = new JiebaSegmenter();
            using (var reader = new StreamReader(source))
            {
                using (var writer = new StreamWriter(result))
                {
                    while (true)
                    {
                        var line = reader.ReadLine();
                        if (string.IsNullOrWhiteSpace(line))
                            break;
                        var parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
                        if (parts.Length != 2) continue;
                        var segments = segmenter.Cut(parts[0]);
                        writer.WriteLine("{0}\t{1}", string.Join(" ", segments), parts[1]);
                    }
                }
            }
        }

        static void Main(string[] args)
        {
            Segment(_dataPath, _dataTrainPath);
            Segment(_testDataPath, _testTargetPath);
            var model = Train();
            Evaluate(model);
            Predict(model);
        }
    }
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用ML.NET實現基於RFM模型的客戶價值分析通過 ML.NET 使用預訓練殘差網絡 ResNet 模型實現手勢識別 .NET開發人員如何開始使用ML.NET ML.NET 示例：目錄 ML.NET 0.9特性簡介微軟發布ML.NET 1.0 基於 ONNX 在 ML.NET 中使用 Pytorch 訓練的垃圾分類模型 C#使用ML.Net完成人工智能預測機器學習框架ML.NET學習筆記【5】多元分類之手寫數字識別（續）關於ML.NET v1.0 RC的發布說明