關於ML.NET v0.5的發布說明

在這個0.5版本中，我們將TensorFlow模型評分作為ML.NET的轉換類添加。這樣可以在ML.NET實驗中使用現有的TensorFlow模型。社區提出的各種問題和反饋可以在這里找到。

作為即將到來的ML.NET之路的一部分，我們正在開發一種新的ML.NET API，它可以提高靈活性和易用性。當新API准備得足夠好時，我們計划棄用當前的LearningPipelineAPI。因為這將是一個重大變化，本文末尾分享我們對多個API選項和比較的建議。

此博客文章提供了有關ML.NET中以下主題的詳細信息：

在ML.NET v0.5中添加了TensorFlow模型評分轉換（TensorFlowTransform）
新的ML.NET API建議

TensorFlow模型評分轉換（TensorFlowTransform）

TensorFlow是一種流行的深度學習和機器學習工具包，可以訓練深度神經網絡（和通用數值計算）。

深度學習是人工智能和機器學習的一個子集，它教授程序來做人類自然而然的事情：通過實例學習。
與傳統機器學習相比，它的主要區別在於深度學習模型可以學習直接從圖像，聲音或文本中執行對象檢測和分類任務，甚至可以提供語音識別和語言翻譯等任務，而傳統的ML方法則嚴重依賴於特征工程和數據處理。
深度學習模型需要通過使用包含多個層的大量標記數據和神經網絡進行訓練。它目前的流行是由幾個原因引起的。首先，它在計算機視覺等一些任務上表現更好第二，因為它可以利用現在變得可用的大量數據（並且需要該量以便表現良好）。

使用ML.NET 0.5，我們開始在ML.NET中添加對深度學習的支持。今天，我們通過新引進與TensorFlow在ML.NET整合的第一級TensorFlowTransform這使得能夠以現有的TensorFlow模型，無論是你訓練或從別的地方下載的，並得到來自ML.NET的TensorFlow模型的分數。

這種新的TensorFlow評分功能不需要您具備TensorFlow內部細節的工作知識。從長遠來看，我們將致力於使用ML.NET進行深度學習的體驗變得更加容易。

此轉換的實現基於TensorFlowSharp的代碼。

如下圖所示，您只需在.NET Core或.NET Framework應用程序中添加對ML.NET NuGet包的引用。在封面下，ML.NET包含並引用了本機TensorFlow庫，它允許您編寫加載現有訓練的TensorFlow模型文件以進行評分的代碼。

以下代碼段顯示了如何在ML.NET管道中使用TensorFlow轉換：

// ... Additional transformations in the pipeline code

pipeline.Add(new TensorFlowScorer()
{
    ModelFile = "model/tensorflow_inception_graph.pb",   // Example using the Inception v3 TensorFlow model
    InputColumns = new[] { "input" },                    // Name of input in the TensorFlow model
    OutputColumn = "softmax2_pre_activation"             // Name of output in the TensorFlow model
});

// ... Additional code specifying a learner and training process for the ML.NET model

您可以在此處找到與上述代碼片段相關的完整代碼示例TensorFlowTransform，使用TensorFlow Inception v3模型和現有LearningPipelineAPI。

上面的代碼示例使用名為Inception v3的預先訓練的TensorFlow模型，您可以從此處下載。在成立之初V3是受過訓練的非常流行的圖像識別模型ImageNet數據集，其中TensorFlow模型試圖整個圖像分成千類，如“傘”，“澤西”和“廚房”。

該盜夢空間V3模型可以被歸類為深卷積神經網絡，可以實現對硬盤的視覺識別任務，匹配或超過在某些領域人類行為的合理性能。該模型/算法由多位研究人員根據原始論文開發：“重新思考計算機視覺的初始架構”，Szegedy等。人。

在下一個ML.NET版本中，我們將添加功能，以便識別TensorFlow模型的預期輸入和輸出。目前，使用TensorFlow API或Netron等工具來探索TensorFlow模型。

如果您tensorflow_inception_graph.pb使用Netron打開上一個示例TensorFlow模型文件（）並瀏覽模型的圖形，您可以看到它如何InputColumn與input圖形開頭的節點相關聯：

以及如何OutputColumn與softmax2_pre_activation節點的輸出相關聯幾乎在圖的末尾。

限制：我們目前正在更新ML.NET API以提高靈活性，因為在今天的ML.NET中使用TensorFlow有一些限制。就目前而言（當使用LearningPipelineAPI時），這些分數只能LearningPipeline作為輸入（數字向量）用於像分類器學習者這樣的學習者。但是，隨着即將推出的新ML.NET API，TensorFlow模型得分將可以直接訪問，因此您可以使用TensorFlow模型進行評分，而無需在此示例中實現添加額外的學習者及其相關的訓練過程。它使用與數字向量要素相關的標簽（對象名稱），基於StochasticDualCoordinateAscentClassifier創建多類分類ML.NET模型 TensorFlow模型為每個圖像文件生成/評分。

考慮到使用ML.NET提到的TensorFlow代碼示例正在使用v0.5中LearningPipeline提供的當前API。接下來，支持使用TensorFlow的ML.NET API將略有不同，而不是基於“pipeline”。這與此博客文章的下一部分有關，該部分重點介紹即將推出的ML.NET新API。

最后，我們還要強調ML.NET框架目前正在出現TensorFlow，但未來我們可能會考慮其他深度學習庫集成，例如Torch和CNTK。

您可以在此處使用TensorFlowTransform現有LearningPipelineAPI 查找其他代碼示例/測試。

探索即將推出的新ML.NET API（0.5之后）並提供反饋

正如本文開頭所提到的，我們非常期待在制作ML.NET時創建新的ML.NET API時得到您的反饋。ML.NET的這種發展提供了比當前LearningPipelineAPI提供的更靈活的功能。在LearningPipeline當這個新的API准備和足夠好的API將被棄用。

以下鏈接到我們以GitHub形式獲得的一些示例反饋，這些反饋是關於使用LearningPipelineAPI 時的限制：

因此，基於LearningPipelineAPI的反饋，幾周前我們決定切換到新的ML.NET API，以解決LearningPipelineAPI目前的大部分限制。

這個新ML.NET API的設計原則

我們正在根據以下原則設計此新API：

使用與Scikit-Learn，TensorFlow和Spark等其他知名框架並行的術語，我們將嘗試在命名和概念方面保持一致，使開發人員更容易理解和學習ML.NET Core。
保持簡單和簡潔的ML場景，如簡單的訓練和預測。
允許高級ML場景（使用當前LearningPipelineAPI 無法實現，如下一節所述）。

我們還探索了諸如Fluent API，聲明性和命令式等API方法。
有關原則和所需方案的更深入討論，請在GitHub中查看此問題。

為什么ML.NET正在從`LearningPipeline`API 切換到新的API？

作為預覽版制作過程的一部分（請記住ML.NET仍處於早期預覽中），我們一直在獲得LearningPipelineAPI反饋，並發現了一些我們需要通過創建更靈活的API來解決的限制。

具體來說，新的ML.NET API提供了當前LearningPipelineAPI 無法實現的有吸引力的功能：

強類型API：這種新的強類型API利用了C＃功能，因此可以在編譯時發現錯誤，同時改進編輯器中的Intellisense。
更好的靈活性：此API提供可分解的訓練和預測過程，消除了剛性和線性管道執行。使用新API，執行某個代碼路徑，然后分叉執行，以便多個路徑可以重用初始公共執行。例如，與多個學習者和培訓師共享給定變換的執行和轉換數據，或分解管道並添加多個學習者。

這個新的API是基於概念，如Estimators，Transforms和DataView，在這篇博客文章下面的代碼所示。

改進的可用性：從代碼直接調用API，不再需要腳手架或日照層，在用戶/開發人員編寫的內容和內部API之間創建模糊的分隔。入口點不再是強制性的。
能夠使用TensorFlow模型進行簡單評分。由於API中提到的靈活性，您還可以簡單地加載TensorFlow模型並使用它進行評分，而無需添加任何其他學習者和培訓過程，如TensorFlow部分之前的“限制”主題中所述。
更好地查看轉換后的數據：在應用變換器時，您可以更好地查看數據。

強類型API與`LearningPipeline`API的比較

另一個重要的比較與新API中的強類型API功能有關。
作為您沒有強類型API時可以獲得的問題的示例，LearningPipelineAPI（如下面的代碼所示）通過將列的名稱指定為字符串來提供對數據列的訪問，因此如果您輸入錯字（即，寫了“Descrption”沒有'i'而不是“Description”，作為示例代碼中的拼寫錯誤，你會得到一個運行時異常：

pipeline.Add(new TextFeaturizer("Description", "Descrption"));

但是，當使用新的ML.NET API時，它是強類型的，因此如果你輸入錯誤，它將在編譯時捕獲，你也可以在編輯器中使用Intellisense。

var estimator = reader.MakeEstimator()
                .Append(row => (                    
                    description: row.description.FeaturizeText()))

有關可分解列車和預測API的詳細信息

以下代碼片段顯示了如何使用ML.NET中的新API實現“GitHub issue labeler”示例應用程序的轉換和培訓過程。

這是我們當前的提案，根據您的反饋，此API可能會相應地發展。

新的ML.NET API代碼示例：

public static async Task BuildAndTrainModelToClassifyGithubIssues()
{
    var env = new MLEnvironment();

    string trainDataPath = @"Data\issues_train.tsv";

    // Create reader
    var reader = TextLoader.CreateReader(env, ctx =>
                                    (area: ctx.LoadText(1),
                                    title: ctx.LoadText(2),
                                    description: ctx.LoadText(3)),
                                    new MultiFileSource(trainDataPath), 
                                    hasHeader : true);

    var loss = new HingeLoss(new HingeLoss.Arguments() { Margin = 1 });

    var estimator = reader.MakeNewEstimator
        .Append(row => (
            // Convert string label to key. 
            label: row.area.ToKey(),
            // Featurize 'description'
            description: row.description.FeaturizeText(),
            // Featurize 'title'
            title: row.title.FeaturizeText()))
        .Append(row => (
            // Concatenate the two features into a vector and normalize.
            features: row.description.ConcatWith(row.title).Normalize(),
            // Preserve the label - otherwise it will be dropped
            label: row.label))
        .Append(row => (
            // Preserve the label (for evaluation)
            row.label,
            // Train the linear predictor (SDCA)
            score: row.label.PredictSdcaClassification(row.features, loss: loss)))
        .Append(row => (
            // Want the prediction, as well as label and score which are needed for evaluation
            predictedLabel: row.score.predictedLabel.ToValue(),
            row.label,
            row.score));

    // Read the data
    var data = reader.Read(new MultiFileSource(trainDataPath));

    // Fit the data to get a model
    var model = estimator.Fit(data);

    // Use the model to get predictions on the test dataset and evaluate the accuracy of the model
    var scores = model.Transform(reader.Read(new MultiFileSource(@"Data\issues_test.tsv")));
    var metrics = MultiClassClassifierEvaluator.Evaluate(scores, r => r.label, r => r.score);

    Console.WriteLine("Micro-accuracy is: " + metrics.AccuracyMicro);

    // Save the ML.NET model into a .ZIP file
    await model.WriteAsync("github-Model.zip");
}

public static async Task PredictLableForGithubIssueAsync()
{
    // Read model from an ML.NET .ZIP model file
    var model = await PredictionModel.ReadAsync("github-Model.zip");

    // Create a prediction function that can be used to score incoming issues
    var predictor = model.AsDynamic.MakePredictionFunction<GitHubIssue, IssuePrediction>(env);

    // This prediction will classify this particular issue in a type such as "EF and Database access"
    var prediction = predictor.Predict(new GitHubIssue
    {
        title = "Sample issue related to Entity Framework",
        description = @"When using Entity Framework Core I'm experiencing database connection failures when running queries or transactions. Looks like it could be related to transient faults in network communication agains the Azure SQL Database."
    });

    Console.WriteLine("Predicted label is: " + prediction.predictedLabel);
}

與以下LearningPipeline缺乏靈活性的舊API代碼段相比較，因為管道執行不可分解但是線性：

舊的LearningPipelineAPI代碼示例：

public static async Task BuildAndTrainModelToClassifyGithubIssuesAsync()
{
        // Create the pipeline
    var pipeline = new LearningPipeline();

    // Read the data
    pipeline.Add(new TextLoader(DataPath).CreateFrom<GitHubIssue>(useHeader: true));

    // Dictionarize the "Area" column
    pipeline.Add(new Dictionarizer(("Area", "Label")));

    // Featurize the "Title" column
    pipeline.Add(new TextFeaturizer("Title", "Title"));

    // Featurize the "Description" column
    pipeline.Add(new TextFeaturizer("Description", "Description"));
    
    // Concatenate the provided columns
    pipeline.Add(new ColumnConcatenator("Features", "Title", "Description"));

    // Set the algorithm/learner to use when training
    pipeline.Add(new StochasticDualCoordinateAscentClassifier());

    // Specify the column to predict when scoring
    pipeline.Add(new PredictedLabelColumnOriginalValueConverter() { PredictedLabelColumn = "PredictedLabel" });

    Console.WriteLine("=============== Training model ===============");

    // Train the model
    var model = pipeline.Train<GitHubIssue, GitHubIssuePrediction>();

    // Save the model to a .zip file
    await model.WriteAsync(ModelPath);

    Console.WriteLine("=============== End training ===============");
    Console.WriteLine("The model is saved to {0}", ModelPath);
}

public static async Task<string> PredictLabelForGitHubIssueAsync()
{
    // Read model from an ML.NET .ZIP model file
    _model = await PredictionModel.ReadAsync<GitHubIssue, GitHubIssuePrediction>(ModelPath);
    
    // This prediction will classify this particular issue in a type such as "EF and Database access"
    var prediction = _model.Predict(new GitHubIssue
        {
            Title = "Sample issue related to Entity Framework", 
            Description = "When using Entity Framework Core I'm experiencing database connection failures when running queries or transactions. Looks like it could be related to transient faults in network communication agains the Azure SQL Database..."
        });

    return prediction.Area;
}

舊LearningPipelineAPI是完全線性的代碼路徑，因此您無法將其分解為多個部分。
例如，BikeSharing ML.NET示例（在機器學習樣本GitHub repo中可用）正在使用當前的LearningPipelineAPI。

此示例使用評估程序API通過以下方式比較回歸學習者的准確性：

執行多個數據轉換為原始數據集
基於七種不同的回歸訓練器/算法（如FastTreeRegressor，FastTreeTweedieRegressor，StochasticDualCoordinateAscentRegressor等）訓練和創建七種不同的ML.NET模型

目的是幫助您比較給定問題的回歸學習者。

由於這些模型的數據轉換是相同的，因此您可能希望重用與轉換相關的代碼執行。但是，由於LearningPipelineAPI僅提供單個線性執行，因此您需要為您創建/訓練的每個模型運行相同的數據轉換步驟，如以下代碼摘錄自BikeSharing ML.NET示例所示。

var fastTreeModel = new ModelBuilder(trainingDataLocation, new FastTreeRegressor()).BuildAndTrain();
var fastTreeMetrics = modelEvaluator.Evaluate(fastTreeModel, testDataLocation);
PrintMetrics("Fast Tree", fastTreeMetrics);

var fastForestModel = new ModelBuilder(trainingDataLocation, new FastForestRegressor()).BuildAndTrain();
var fastForestMetrics = modelEvaluator.Evaluate(fastForestModel, testDataLocation);
PrintMetrics("Fast Forest", fastForestMetrics);

var poissonModel = new ModelBuilder(trainingDataLocation, new PoissonRegressor()).BuildAndTrain();
var poissonMetrics = modelEvaluator.Evaluate(poissonModel, testDataLocation);
PrintMetrics("Poisson", poissonMetrics);

//Other learners/algorithms
//...

BuildAndTrain（）方法需要同時具有數據轉換和每種情況下的不同算法，如以下代碼所示：

public PredictionModel<BikeSharingDemandSample, BikeSharingDemandPrediction> BuildAndTrain()
{
    var pipeline = new LearningPipeline();
    pipeline.Add(new TextLoader(_trainingDataLocation).CreateFrom<BikeSharingDemandSample>(useHeader: true, separator: ','));
    pipeline.Add(new ColumnCopier(("Count", "Label")));
    pipeline.Add(new ColumnConcatenator("Features", 
                                        "Season", 
                                        "Year", 
                                        "Month", 
                                        "Hour", 
                                        "Weekday", 
                                        "Weather", 
                                        "Temperature", 
                                        "NormalizedTemperature",
                                        "Humidity",
                                        "Windspeed"));
    pipeline.Add(_algorythm);

    return pipeline.Train<BikeSharingDemandSample, BikeSharingDemandPrediction>();
}

使用舊LearningPipelineAPI，對於使用不同算法的每次培訓，您需要再次運行相同的過程，一次又一次地執行以下步驟：

從文件加載數據集
進行列轉換（連續，復制或其他特征或字典，如果需要）

但是，基於新的ML.NET API Estimators，DataView您將能夠重用部分執行，就像在這種情況下一樣，重新使用數據轉換執行作為使用不同算法的多個模型的基礎。

您還可以在此處使用新API探索其他代碼示例。