sklearn邏輯回歸(Logistic Regression,LR)調參指南


python金融風控評分卡模型和數據分析微專業課(博主親自錄制視頻):http://dwz.date/b9vv

sklearn邏輯回歸官網調參指南

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

sklearn.linear_model.LogisticRegression

class  sklearn.linear_model. LogisticRegression (penalty=’l2’dual=Falsetol=0.0001C=1.0fit_intercept=Trueintercept_scaling=1class_weight=Nonerandom_state=Nonesolver=’warn’max_iter=100multi_class=’warn’verbose=0warm_start=Falsen_jobs=Nonel1_ratio=None)[source]

Logistic Regression (aka logit, MaxEnt) classifier.

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers. Note that regularization is applied by default. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation, or no regularization. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual formulation only for the L2 penalty. The Elastic-Net regularization is only supported by the ‘saga’ solver.

Read more in the User Guide.

Parameters:
penalty  str, ‘l1’, ‘l2’, ‘elasticnet’ or ‘none’, optional (default=’l2’)

Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties. ‘elasticnet’ is only supported by the ‘saga’ solver. If ‘none’ (not supported by the liblinear solver), no regularization is applied.

New in version 0.19: l1 penalty with SAGA solver (allowing ‘multinomial’ + L1)

dual  bool, optional (default=False)

Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.

tol  float, optional (default=1e-4)

Tolerance for stopping criteria.

C  float, optional (default=1.0)

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

fit_intercept  bool, optional (default=True)

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

intercept_scaling  float, optional (default=1)

Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling synthetic_feature_weight.

Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.

class_weight  dict or ‘balanced’, optional (default=None)

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples (n_classes np.bincount(y)).

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

New in version 0.17: class_weight=’balanced’

random_state  int, RandomState instance or None, optional (default=None)

The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when solver == ‘sag’ or ‘liblinear’.

solver  str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, optional (default=’liblinear’).

Algorithm to use in the optimization problem.

  • For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
  • For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
  • ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
  • ‘liblinear’ and ‘saga’ also handle L1 penalty
  • ‘saga’ also supports ‘elasticnet’ penalty
  • ‘liblinear’ does not handle no penalty

Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

New in version 0.17: Stochastic Average Gradient descent solver.

New in version 0.19: SAGA solver.

Changed in version 0.20: Default will change from ‘liblinear’ to ‘lbfgs’ in 0.22.

max_iter  int, optional (default=100)

Maximum number of iterations taken for the solvers to converge.

multi_class  str, {‘ovr’, ‘multinomial’, ‘auto’}, optional (default=’ovr’)

If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

New in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’ case.

Changed in version 0.20: Default will change from ‘ovr’ to ‘auto’ in 0.22.

verbose  int, optional (default=0)

For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.

warm_start  bool, optional (default=False)

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See the Glossary.

New in version 0.17: warm_start to support lbfgsnewton-cgsagsaga solvers.

n_jobs  int or None, optional (default=None)

Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. This parameter is ignored when the solver is set to ‘liblinear’ regardless of whether ‘multi_class’ is specified or not. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

l1_ratio  float or None, optional (default=None)

The Elastic-Net mixing parameter, with <= l1_ratio <= 1. Only used if penalty='elasticnet'`. Setting ``l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For l1_ratio <1, the penalty is a combination of L1 and L2.

Attributes:
classes_  array, shape (n_classes, )

A list of class labels known to the classifier.

coef_  array, shape (1, n_features) or (n_classes, n_features)

Coefficient of the features in the decision function.

coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial'coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).

intercept_  array, shape (1,) or (n_classes,)

Intercept (a.k.a. bias) added to the decision function.

If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class='multinomial'intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).

n_iter_  array, shape (n_classes,) or (1, )

Actual number of iterations for all classes. If binary or multinomial, it returns only 1 element. For liblinear solver, only the maximum number of iteration across all classes is given.

Changed in version 0.20: In SciPy <= 1.0.0 the number of lbfgs iterations may exceed max_itern_iter_ will now report at most max_iter.

 

See also

SGDClassifier
incrementally trained logistic regression (when given the parameter  loss="log").
LogisticRegressionCV
Logistic regression with built-in cross validation
>>> from sklearn.datasets import load_iris >>> from sklearn.linear_model import LogisticRegression >>> X, y = load_iris(return_X_y=True) >>> clf = LogisticRegression(random_state=0, solver='lbfgs', ... multi_class='multinomial').fit(X, y) >>> clf.predict(X[:2, :]) array([0, 0]) >>> clf.predict_proba(X[:2, :]) array([[9.8...e-01, 1.8...e-02, 1.4...e-08],  [9.7...e-01, 2.8...e-02, ...e-08]]) >>> clf.score(X, y) 0.97...

 

1. 概述

    在scikit-learn中,與邏輯回歸有關的主要是這3個類。LogisticRegression, LogisticRegressionCV 和logistic_regression_path。其中LogisticRegression和LogisticRegressionCV的主要區別是LogisticRegressionCV使用了交叉驗證來選擇正則化系數C。而LogisticRegression需要自己每次指定一個正則化系數。除了交叉驗證,以及選擇正則化系數C以外, LogisticRegression和LogisticRegressionCV的使用方法基本相同。

    logistic_regression_path類則比較特殊,它擬合數據后,不能直接來做預測,只能為擬合數據選擇合適邏輯回歸的系數和正則化系數。主要是用在模型選擇的時候。一般情況用不到這個類,所以后面不再講述logistic_regression_path類。

    此外,scikit-learn里面有個容易讓人誤解的類RandomizedLogisticRegression,雖然名字里有邏輯回歸的詞,但是主要是用L1正則化的邏輯回歸來做特征選擇的,屬於維度規約的算法類,不屬於我們常說的分類算法的范疇。

    后面的講解主要圍繞LogisticRegression和LogisticRegressionCV中的重要參數的選擇來來展開,這些參數的意義在這兩個類中都是一樣的。

2. 正則化選擇參數:penalty

    LogisticRegression和LogisticRegressionCV默認就帶了正則化項。penalty參數可選擇的值為"l1"和"l2".分別對應L1的正則化和L2的正則化,默認是L2的正則化。

    在調參時如果我們主要的目的只是為了解決過擬合,一般penalty選擇L2正則化就夠了。但是如果選擇L2正則化發現還是過擬合,即預測效果差的時候,就可以考慮L1正則化。另外,如果模型的特征非常多,我們希望一些不重要的特征系數歸零,從而讓模型系數稀疏化的話,也可以使用L1正則化。

    penalty參數的選擇會影響我們損失函數優化算法的選擇。即參數solver的選擇,如果是L2正則化,那么4種可選的算法{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}都可以選擇。但是如果penalty是L1正則化的話,就只能選擇‘liblinear’了。這是因為L1正則化的損失函數不是連續可導的,而{‘newton-cg’, ‘lbfgs’,‘sag’}這三種優化算法時都需要損失函數的一階或者二階連續導數。而‘liblinear’並沒有這個依賴。

    具體使用了這4個算法有什么不同以及有什么影響我們下一節講。

3. 優化算法選擇參數:solver

    solver參數決定了我們對邏輯回歸損失函數的優化方法,有4種算法可以選擇,分別是:

    a) liblinear:使用了開源的liblinear庫實現,內部使用了坐標軸下降法來迭代優化損失函數。

    b) lbfgs:擬牛頓法的一種,利用損失函數二階導數矩陣即海森矩陣來迭代優化損失函數。

    c) newton-cg:也是牛頓法家族的一種,利用損失函數二階導數矩陣即海森矩陣來迭代優化損失函數。

    d) sag:即隨機平均梯度下降,是梯度下降法的變種,和普通梯度下降法的區別是每次迭代僅僅用一部分的樣本來計算梯度,適合於樣本數據多的時候,SAG是一種線性收斂算法,這個速度遠比SGD快。關於SAG的理解,參考博文線性收斂的隨機優化算法之 SAG、SVRG(隨機梯度下降)

 

    從上面的描述可以看出,newton-cg, lbfgs和sag這三種優化算法時都需要損失函數的一階或者二階連續導數,因此不能用於沒有連續導數的L1正則化,只能用於L2正則化。而liblinear通吃L1正則化和L2正則化。

    同時,sag每次僅僅使用了部分樣本進行梯度迭代,所以當樣本量少的時候不要選擇它,而如果樣本量非常大,比如大於10萬,sag是第一選擇。但是sag不能用於L1正則化,所以當你有大量的樣本,又需要L1正則化的話就要自己做取舍了。要么通過對樣本采樣來降低樣本量,要么回到L2正則化。

在sklearn的官方文檔中,對於solver的使用說明如下:

 

In a nutshell, one may choose the solver with the following rules:

Case Solver
Small dataset or L1 penalty “liblinear”
Multinomial loss or large dataset “lbfgs”, “sag” or “newton-cg”
Very Large dataset “sag”

 

    從上面的描述,大家可能覺得,既然newton-cg, lbfgs和sag這么多限制,如果不是大樣本,我們選擇liblinear不就行了嘛!錯,因為liblinear也有自己的弱點!我們知道,邏輯回歸有二元邏輯回歸和多元邏輯回歸。對於多元邏輯回歸常見的有one-vs-rest(OvR)和many-vs-many(MvM)兩種。而MvM一般比OvR分類相對准確一些。郁悶的是liblinear只支持OvR,不支持MvM,這樣如果我們需要相對精確的多元邏輯回歸時,就不能選擇liblinear了。也意味着如果我們需要相對精確的多元邏輯回歸不能使用L1正則化了。

總結而言,liblinear支持L1和L2,只支持OvR做多分類,“lbfgs”, “sag” “newton-cg”只支持L2,支持OvR和MvM做多分類。

    具體OvR和MvM有什么不同我們下一節講。

4. 分類方式選擇參數:multi_class

    multi_class參數決定了我們分類方式的選擇,有 ovr和multinomial兩個值可以選擇,默認是 ovr。

    ovr即前面提到的one-vs-rest(OvR),而multinomial即前面提到的many-vs-many(MvM)。如果是二元邏輯回歸,ovr和multinomial並沒有任何區別,區別主要在多元邏輯回歸上。

    OvR的思想很簡單,無論你是多少元邏輯回歸,我們都可以看做二元邏輯回歸。具體做法是,對於第K類的分類決策,我們把所有第K類的樣本作為正例,除了第K類樣本以外的所有樣本都作為負例,然后在上面做二元邏輯回歸,得到第K類的分類模型。其他類的分類模型獲得以此類推。

    而MvM則相對復雜,這里舉MvM的特例one-vs-one(OvO)作講解。如果模型有T類,我們每次在所有的T類樣本里面選擇兩類樣本出來,不妨記為T1類和T2類,把所有的輸出為T1和T2的樣本放在一起,把T1作為正例,T2作為負例,進行二元邏輯回歸,得到模型參數。我們一共需要T(T-1)/2次分類。

    從上面的描述可以看出OvR相對簡單,但分類效果相對略差(這里指大多數樣本分布情況,某些樣本分布下OvR可能更好)。而MvM分類相對精確,但是分類速度沒有OvR快。

    如果選擇了ovr,則4種損失函數的優化方法liblinear,newton-cg, lbfgs和sag都可以選擇。但是如果選擇了multinomial,則只能選擇newton-cg, lbfgs和sag了。

5. 類型權重參數: class_weight

    class_weight參數用於標示分類模型中各種類型的權重,可以不輸入,即不考慮權重,或者說所有類型的權重一樣。如果選擇輸入的話,可以選擇balanced讓類庫自己計算類型權重,或者我們自己輸入各個類型的權重,比如對於0,1的二元模型,我們可以定義class_weight={0:0.9, 1:0.1},這樣類型0的權重為90%,而類型1的權重為10%。

    如果class_weight選擇balanced,那么類庫會根據訓練樣本量來計算權重。某種類型樣本量越多,則權重越低,樣本量越少,則權重越高。

sklearn的官方文檔中,當class_weight為balanced時,類權重計算方法如下:

n_samples / (n_classes * np.bincount(y)),n_samples為樣本數,n_classes為類別數量,np.bincount(y)會輸出每個類的樣本數,例如y=[1,0,0,1,1],則np.bincount(y)=[2,3]

    那么class_weight有什么作用呢?在分類模型中,我們經常會遇到兩類問題:

    第一種是誤分類的代價很高。比如對合法用戶和非法用戶進行分類,將非法用戶分類為合法用戶的代價很高,我們寧願將合法用戶分類為非法用戶,這時可以人工再甄別,但是卻不願將非法用戶分類為合法用戶。這時,我們可以適當提高非法用戶的權重。

    第二種是樣本是高度失衡的,比如我們有合法用戶和非法用戶的二元樣本數據10000條,里面合法用戶有9995條,非法用戶只有5條,如果我們不考慮權重,則我們可以將所有的測試集都預測為合法用戶,這樣預測准確率理論上有99.95%,但是卻沒有任何意義。這時,我們可以選擇balanced,讓類庫自動提高非法用戶樣本的權重。

    提高了某種分類的權重,相比不考慮權重,會有更多的樣本分類划分到高權重的類別,從而可以解決上面兩類問題。

    當然,對於第二種樣本失衡的情況,我們還可以考慮用下一節講到的樣本權重參數: sample_weight,而不使用class_weight。sample_weight在下一節講。

6. 樣本權重參數: sample_weight

    上一節我們提到了樣本不失衡的問題,由於樣本不平衡,導致樣本不是總體樣本的無偏估計,從而可能導致我們的模型預測能力下降。遇到這種情況,我們可以通過調節樣本權重來嘗試解決這個問題。調節樣本權重的方法有兩種,第一種是在class_weight使用balanced。第二種是在調用fit函數時,通過sample_weight來自己調節每個樣本權重。

    在scikit-learn做邏輯回歸時,如果上面兩種方法都用到了,那么樣本的真正權重是class_weight*sample_weight.

    以上就是scikit-learn中邏輯回歸類庫調參的一個小結,還有些參數比如正則化參數C(交叉驗證就是 Cs),迭代次數max_iter等,由於和其它的算法類庫並沒有特別不同,這里不多累述了。

 

 

python機器學習生物信息學系列課(博主錄制)http://dwz.date/b9vw

 

 微信掃二維碼,免費學習更多python資源

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM