一、問題源頭

訓練xgboost分類時報錯的，錯誤如題

二、原因

因為做了：train_test_split，導致train中的label不一定包含test的，從而出現上述問題

1、方案一：確保每個類都能取到並放到train中；

2、方案二：忽略測試集，將全部數據放到train中。但是未來不可預計，只能使用歷史數據；最大缺點是會影響驗證集和測試集的檢測的真實性；

3、方案三：添加一個類"unknown",在訓練集中添加一個鍵值對，再將test中的匹配下，沒有在訓練集中的label，均視為類"unknown"

之前采用方案二，目前采用方案三：

X_train = np.vstack((X_train,np.random.rand(1,128))) #增加一個unknow的部分，in test set，not in train set，這里我的數據是【1，128】的

y_train_code = np.append(y_train_code,np.array("unknown")) #同樣對應的Y也需要增加，因為這是鍵值對的形式。

vocab_code = list(set(y_train_code))

idx_to_code = [code for code in vocab_code]

code_to_idx = {code:i for i,code in enumerate(idx_to_code)}

y_train_idx = np.array([code_to_idx.get(t) for t in y_train_code] )

y_val_idx = np.array([code_to_idx.get(t) if t in vocab_code else code_to_idx["unknown"] for t in y_val_code] ) #測試中的數據若沒有，標注為unknown類

#code_to_idx.get(t) code_to_idx[t] 方法都一樣

多思考，多嘗試，這個問題在網站上並沒有找到答案。

嘗試這個：

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。