refer to: https://www.kaggle.com/dansbecker/data-leakage
There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies.
Leaky Predictors
This occurs when your predictors include data that will not be available at the time you make predictions.
模型中用了預測前不可用的feature/data,這會導致在validation中accuracy很高,而在實際環境中部署后,accuracy很低,因為得不到這樣的數據。
如,預測肺炎,如果使用“服用抗生素”作為feature,就是這種情況,因為一般是得了肺炎自然會服用抗生素,在預測肺炎這格模型中,不應該使用“服用抗生素”這個feature。
Leaky Validation Strategies
在模型處理過程中,讓Validation Data影響到了模型的參數。
For example, this happens if you run preprocessing (like fitting the Imputer for missing values) before calling train_test_split.
例如,當你在調用train_test_split之前,對數據進行了預處理(如Imputer),而預處理所用數據包含了spit之后的validation data。