1、比較模型
這是我們建議在任何受監管實驗的工作流程中的第一步。此功能使用默認的超參數訓練模型庫中的所有模型,並使用交叉驗證評估性能指標。它返回經過訓練的模型對象。使用的評估指標是:
分類:准確性,AUC,召回率,精度,F1,Kappa,MCC
回歸:MAE,MSE,RMSE,R2,RMSLE,MAPE
該函數的輸出是一個表格,顯示了所有模型在折痕處的平均得分。可以使用compare_models函數中的fold參數定義折疊次數。默認情況下,折頁設置為10。表按選擇的度量標准排序(從高到低),可以使用sort參數定義。默認情況下,對於分類實驗,表按Accuracy排序;對於回歸實驗,按R2排序。由於某些模型的運行時間較長,因此無法進行比較。為了繞過此預防措施,可以將turbo參數設置為False。
該函數僅在pycaret.classification和pycaret.regression模塊中可用。
(1)分類案例:
from pycaret.datasets import get_data diabetes = get_data('diabetes') # Importing module and initializing setup from pycaret.classification import * clf1 = setup(data = diabetes, target = 'Class variable') # return best model best = compare_models() # return top 3 models based on 'Accuracy' top3 = compare_models(n_select = 3) # return best model based on AUC best = compare_models(sort = 'AUC') #default is 'Accuracy' # compare specific models best_specific = compare_models(whitelist = ['dt','rf','xgboost']) # blacklist certain models best_specific = compare_models(blacklist = ['catboost', 'svm'])
(2)回歸案例:
from pycaret.datasets import get_data boston = get_data('boston') # Importing module and initializing setup from pycaret.regression import * reg1 = setup(data = boston, target = 'medv') # return best model best = compare_models() # return top 3 models based on 'R2' top3 = compare_models(n_select = 3) # return best model based on MAPE best = compare_models(sort = 'MAPE') #default is 'R2' # compare specific models best_specific = compare_models(whitelist = ['dt','rf','xgboost']) # blacklist certain models best_specific = compare_models(blacklist = ['catboost', 'svm'])
2、創建模型
在任何模塊中創建模型就像編寫create_model一樣簡單。它僅采用一個參數,即型號ID作為字符串。對於受監督的模塊(分類和回歸),此函數將返回一個表,該表具有k倍交叉驗證的性能指標以及訓練有素的模型對象。對於無監督的模塊對於無監督的模塊集群,它會返回性能指標以及經過訓練的模型對象,而對於其余的無監督的模塊異常檢測,自然語言處理和關聯規則挖掘,則僅返回經過訓練的模型對象。使用的評估指標是:
分類:准確性,AUC,召回率,精度,F1,Kappa,MCC
回歸:MAE,MSE,RMSE,R2,RMSLE,MAPE
可以使用create_model函數中的fold參數定義折疊次數。默認情況下,折痕設置為10。默認情況下,所有指標均四舍五入為4位小數,可以使用create_model中的round參數進行更改。盡管有一個單獨的函數可以對訓練后的模型進行集成,但是在通過create_model函數中的ensemble參數和方法參數創建時,有一種快速的方法可以對模型進行集成。
分類模型:
ID | Name |
‘lr’ | Logistic Regression |
‘knn’ | K Nearest Neighbour |
‘nb’ | Naives Bayes |
‘dt’ | Decision Tree Classifier |
‘svm’ | SVM – Linear Kernel |
‘rbfsvm’ | SVM – Radial Kernel |
‘gpc’ | Gaussian Process Classifier |
‘mlp’ | Multi Level Perceptron |
‘ridge’ | Ridge Classifier |
‘rf’ | Random Forest Classifier |
‘qda’ | Quadratic Discriminant Analysis |
‘ada’ | Ada Boost Classifier |
‘gbc’ | Gradient Boosting Classifier |
‘lda’ | Linear Discriminant Analysis |
‘et’ | Extra Trees Classifier |
‘xgboost’ | Extreme Gradient Boosting |
‘lightgbm’ | Light Gradient Boosting |
‘catboost’ | CatBoost Classifier |
回歸模型:
ID | Name |
‘lr’ | Linear Regression |
‘lasso’ | Lasso Regression |
‘ridge’ | Ridge Regression |
‘en’ | Elastic Net |
‘lar’ | Least Angle Regression |
‘llar’ | Lasso Least Angle Regression |
‘omp’ | Orthogonal Matching Pursuit |
‘br’ | Bayesian Ridge |
‘ard’ | Automatic Relevance Determination |
‘par’ | Passive Aggressive Regressor |
‘ransac’ | Random Sample Consensus |
‘tr’ | TheilSen Regressor |
‘huber’ | Huber Regressor |
‘kr’ | Kernel Ridge |
‘svm’ | Support Vector Machine |
‘knn’ | K Neighbors Regressor |
‘dt’ | Decision Tree |
‘rf’ | Random Forest |
‘et’ | Extra Trees Regressor |
‘ada’ | AdaBoost Regressor |
‘gbr’ | Gradient Boosting Regressor |
‘mlp’ | Multi Level Perceptron |
‘xgboost’ | Extreme Gradient Boosting |
‘lightgbm’ | Light Gradient Boosting |
‘catboost’ | CatBoost Regressor |
聚類模型:
ID | Name |
‘kmeans’ | K-Means Clustering |
‘ap’ | Affinity Propagation |
‘meanshift’ | Mean shift Clustering |
‘sc’ | Spectral Clustering |
‘hclust’ | Agglomerative Clustering |
‘dbscan’ | Density-Based Spatial Clustering |
‘optics’ | OPTICS Clustering |
‘birch’ | Birch Clustering |
‘kmodes’ | K-Modes Clustering |
異常檢測模型:
ID | Name |
‘abod’ | Angle-base Outlier Detection |
‘iforest’ | Isolation Forest |
‘cluster’ | Clustering-Based Local Outlier |
‘cof’ | Connectivity-Based Outlier Factor |
‘histogram’ | Histogram-based Outlier Detection |
‘knn’ | k-Nearest Neighbors Detector |
‘lof’ | Local Outlier Factor |
‘svm’ | One-class SVM detector |
‘pca’ | Principal Component Analysis |
‘mcd’ | Minimum Covariance Determinant |
‘sod’ | Subspace Outlier Detection |
‘sos | Stochastic Outlier Selection |
自然語言處理模型:
ID | Model |
‘lda’ | Latent Dirichlet Allocation |
‘lsi’ | Latent Semantic Indexing |
‘hdp’ | Hierarchical Dirichlet Process |
‘rp’ | Random Projections |
‘nmf’ | Non-Negative Matrix Factorization |
分類例子:
# Importing dataset from pycaret.datasets import get_data diabetes = get_data('diabetes') # Importing module and initializing setup from pycaret.classification import * clf1 = setup(data = diabetes, target = 'Class variable') # train logistic regression model lr = create_model('lr') #lr is the id of the model # check the model library to see all models models() # train rf model using 5 fold CV rf = create_model('rf', fold = 5) # train svm model without CV svm = create_model('svm', cross_validation = False) # train xgboost model with max_depth = 10 xgboost = create_model('xgboost', max_depth = 10) # train xgboost model on gpu xgboost_gpu = create_model('xgboost', tree_method = 'gpu_hist', gpu_id = 0) #0 is gpu-id # train multiple lightgbm models with n learning_rate<br>import numpy as np lgbms = [create_model('lightgbm', learning_rate = i) for i in np.arange(0.1,1,0.1)] # train custom model from gplearn.genetic import SymbolicClassifier symclf = SymbolicClassifier(generation = 50) sc = create_model(symclf)
回歸例子:
# Importing dataset from pycaret.datasets import get_data boston = get_data('boston') # Importing module and initializing setup from pycaret.regression import * reg1 = setup(data = boston, target = 'medv') # train linear regression model lr = create_model('lr') #lr is the id of the model # check the model library to see all models models() # train rf model using 5 fold CV rf = create_model('rf', fold = 5) # train svm model without CV svm = create_model('svm', cross_validation = False) # train xgboost model with max_depth = 10 xgboost = create_model('xgboost', max_depth = 10) # train xgboost model on gpu xgboost_gpu = create_model('xgboost', tree_method = 'gpu_hist', gpu_id = 0) #0 is gpu-id # train multiple lightgbm models with n learning_rate import numpy as np lgbms = [create_model('lightgbm', learning_rate = i) for i in np.arange(0.1,1,0.1)] # train custom model from gplearn.genetic import SymbolicRegressor symreg = SymbolicRegressor(generation = 50) sc = create_model(symreg)
聚類例子:
# Importing dataset from pycaret.datasets import get_data jewellery = get_data('jewellery') # Importing module and initializing setup from pycaret.clustering import * clu1 = setup(data = jewellery) # check the model library to see all models models() # training kmeans model kmeans = create_model('kmeans') # training kmodes model kmodes = create_model('kmodes')
異常檢測例子:
# Importing dataset from pycaret.datasets import get_data anomalies = get_data('anomalies') # Importing module and initializing setup from pycaret.anomaly import * ano1 = setup(data = anomalies) # check the model library to see all models models() # training Isolation Forest iforest = create_model('iforest') # training KNN model knn = create_model('knn')
自然語言處理例子:
# Importing dataset from pycaret.datasets import get_data kiva = get_data('kiva') # Importing module and initializing setup from pycaret.nlp import * nlp1 = setup(data = kiva, target = 'en') # check the model library to see all models models() # training LDA model lda = create_model('lda') # training NNMF model nmf = create_model('nmf')
關聯規則例子:
# Importing dataset from pycaret.datasets import get_data france = get_data('france') # Importing module and initializing setup from pycaret.arules import * arule1 = setup(data = france, transaction_id = 'InvoiceNo', item_id = 'Description') # creating Association Rule model mod1 = create_model(metric = 'confidence')
3、微調模型
在任何模塊中調整機器學習模型的超參數就像編寫tune_model一樣簡單。它使用帶有完全可定制的預定義網格的隨機網格搜索來調整作為估計量傳遞的模型的超參數。優化模型的超參數需要一個目標函數,該目標函數會在有監督的實驗(例如分類或回歸)中自動鏈接到目標變量。但是,對於諸如聚類,異常檢測和自然語言處理之類的無監督實驗,PyCaret允許您通過使用tune_model中的supervised_target參數指定受監督目標變量來定義自定義目標函數(請參見以下示例)。對於有監督的學習,此函數將返回一個表,該表包含k倍的通用評估指標的交叉驗證分數以及訓練有素的模型對象。對於無監督學習,此函數僅返回經過訓練的模型對象。用於監督學習的評估指標是:
分類:准確性,AUC,召回率,精度,F1,Kappa,MCC
回歸:MAE,MSE,RMSE,R2,RMSLE,MAPE
可以使用tune_model函數中的fold參數定義折疊次數。默認情況下,折疊倍數設置為10。默認情況下,所有指標均四舍五入到4位小數,可以使用round參數進行更改。 PyCaret中的音調模型功能是對預定義搜索空間進行的隨機網格搜索,因此它依賴於搜索空間的迭代次數。默認情況下,此函數在搜索空間上執行10次隨機迭代,可以使用tune_model中的n_iter參數進行更改。增加n_iter參數可能會增加訓練時間,但通常會導致高度優化的模型。可以使用優化參數定義要優化的指標。默認情況下,回歸任務將優化R2,而分類任務將優化Accuracy。
分類例子:
# Importing dataset from pycaret.datasets import get_data diabetes = get_data('diabetes') # Importing module and initializing setup from pycaret.classification import * clf1 = setup(data = diabetes, target = 'Class variable') # train a decision tree model dt = create_model('dt') # tune hyperparameters of decision tree tuned_dt = tune_model(dt) # tune hyperparameters with increased n_iter tuned_dt = tune_model(dt, n_iter = 50) # tune hyperparameters to optimize AUC tuned_dt = tune_model(dt, optimize = 'AUC') #default is 'Accuracy' # tune hyperparameters with custom_grid params = {"max_depth": np.random.randint(1, (len(data.columns)*.85),20), "max_features": np.random.randint(1, len(data.columns),20), "min_samples_leaf": [2,3,4,5,6], "criterion": ["gini", "entropy"] } tuned_dt_custom = tune_model(dt, custom_grid = params) # tune multiple models dynamically top3 = compare_models(n_select = 3) tuned_top3 = [tune_model(i) for i in top3]
回歸例子:
from pycaret.datasets import get_data boston = get_data('boston') # Importing module and initializing setup from pycaret.regression import * reg1 = setup(data = boston, target = 'medv') # train a decision tree model dt = create_model('dt') # tune hyperparameters of decision tree tuned_dt = tune_model(dt) # tune hyperparameters with increased n_iter tuned_dt = tune_model(dt, n_iter = 50) # tune hyperparameters to optimize MAE tuned_dt = tune_model(dt, optimize = 'MAE') #default is 'R2' # tune hyperparameters with custom_grid params = {"max_depth": np.random.randint(1, (len(data.columns)*.85),20), "max_features": np.random.randint(1, len(data.columns),20), "min_samples_leaf": [2,3,4,5,6], "criterion": ["gini", "entropy"] } tuned_dt_custom = tune_model(dt, custom_grid = params) # tune multiple models dynamically top3 = compare_models(n_select = 3) tuned_top3 = [tune_model(i) for i in top3]
聚類例子:
# Importing dataset from pycaret.datasets import get_data diabetes = get_data('diabetes') # Importing module and initializing setup from pycaret.clustering import * clu1 = setup(data = diabetes) # Tuning K-Modes Model tuned_kmodes = tune_model('kmodes', supervised_target = 'Class variable')
異常檢測例子:
# Importing dataset from pycaret.datasets import get_data boston = get_data('boston') # Importing module and initializing setup from pycaret.anomaly import * ano1 = setup(data = boston) # Tuning Isolation Forest Model tuned_iforest = tune_model('iforest', supervised_target = 'medv')
自然語言例子:
# Importing dataset from pycaret.datasets import get_data kiva = get_data('kiva') # Importing module and initializing setup from pycaret.nlp import * nlp1 = setup(data = kiva, target = 'en') # Tuning LDA Model tuned_lda = tune_model('lda', supervised_target = 'status')