本文轉載自查看原文 2018-12-15 21:23 622 AI/ Python

refer to: https://www.kaggle.com/dansbecker/data-leakage

There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies.

Leaky Predictors

This occurs when your predictors include data that will not be available at the time you make predictions.

模型中用了預測前不可用的feature/data，這會導致在validation中accuracy很高，而在實際環境中部署后，accuracy很低，因為得不到這樣的數據。

如，預測肺炎，如果使用“服用抗生素”作為feature，就是這種情況，因為一般是得了肺炎自然會服用抗生素，在預測肺炎這格模型中，不應該使用“服用抗生素”這個feature。

Leaky Validation Strategies

在模型處理過程中，讓Validation Data影響到了模型的參數。

For example, this happens if you run preprocessing (like fitting the Imputer for missing values) before calling train_test_split.

例如，當你在調用train_test_split之前，對數據進行了預處理(如Imputer)，而預處理所用數據包含了spit之后的validation data。

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。