Variance Inflation Factor (VIF) 方差膨脹因子解釋_附python腳本

本文轉載自查看原文 2019-05-19 15:50 4588 統計statistics/ sklearn機器學習

python金融風控評分卡模型和數據分析微專業課（博主親自錄制視頻）：http://dwz.date/b9vv

https://etav.github.io/python/vif_factor_python.html

Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoeffunction.

Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

A common R function used for testing regression assumptions and specifically multicolinearity is "VIF()" and unlike many statistical concepts, its formula is straightforward:

$$ V.I.F. = 1 / (1 - R^2). $$

The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.

Steps for Implementing VIF

Run a multiple regression.
Calculate the VIF factors.
Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.

#Imports
import pandas as pd import numpy as np from patsy import dmatrices import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor df = pd.read_csv('loan.csv') df.dropna() df = df._get_numeric_data() #drop non-numeric cols df.head()

	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	int_rate	installment	annual_inc	dti	...	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m
0	1077501	1296599	5000.0	5000.0	4975.0	10.65	162.87	24000.0	27.65	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1077430	1314167	2500.0	2500.0	2500.0	15.27	59.83	30000.0	1.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1077175	1313524	2400.0	2400.0	2400.0	15.96	84.33	12252.0	8.72	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1076863	1277178	10000.0	10000.0	10000.0	13.49	339.31	49200.0	20.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1075358	1311748	3000.0	3000.0	3000.0	12.69	67.79	80000.0	17.94	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 51 columns

df = df[['annual_inc','loan_amnt', 'funded_amnt','annual_inc','dti']].dropna() #subset the dataframe

Step 1: Run a multiple regression

%%capture #gather features features = "+".join(df.columns - ["annual_inc"]) # get y and X dataframes based on this regression: y, X = dmatrices('annual_inc ~' + features, df, return_type='dataframe')

Step 2: Calculate VIF Factors

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame() vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] vif["features"] = X.columns

Step 3: Inspect VIF Factors

vif.round(1)

	VIF Factor	features
0	5.1	Intercept
1	1.0	dti
2	678.4	funded_amnt
3	678.4	loan_amnt

As expected, the total funded amount for the loan and the amount of the loan have a high variance inflation factor because they "explain" the same variance within this dataset. We would need to discard one of these variables before moving on to model building or risk building a model with high multicolinearity.

python機器學習生物信息學系列課（博主錄制）：http://dwz.date/b9vw

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 方差膨脹因子VIF 多重共線性檢驗-方差膨脹系數（VIF）-相關系數（機器學習）sklearn R factor因子及因子水平levels Hashtable 負載因子Load Factor R語言里的因子factor 什么是遺傳方差（Genetic variance）、加性遺傳方差（Additive genetic variance）、顯性遺傳方差（Dominance genetic variance）、上位遺傳方差（Epistatic genetic variance）偏差和方差以及偏差方差權衡(Bias Variance Trade off) 方差(Variance)、協方差(Covariance)與相關性系數機器學習中的偏差(bias)和方差(variance) C++ - Vector 計算均值(mean) 和方差(variance)