【第17期Datawhale | 零基礎入門金融風控-貸款違約預測】Task02打卡：探索性數據分析【pandas_profiling生成數據報告異常，解決后單開一篇】

本文轉載自查看原文 2020-09-18 23:01 467 ## Datawhale打卡

零基礎入門金融風控-貸款違約預測 Task02 探索性數據分析

Task02目的:

熟悉整體數據集的基本情況,異常值,缺失值等, 判斷數據集是否可以進行接下來的機器學習或者深度學習建模.
了解變量間的項目關系/變量與預測值之間的存在關系
為特征工程作准備

准備數據

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

file_path = 'E:\\阿里雲開發者-天池比賽\\02_零基礎入門金融風控_貸款違約預測\\'
train_file_path = file_path+'train.csv'
testA_file_path = file_path+'testA.csv'
now = datetime.datetime.now().strftime('%Y-%m-%d_%H:%M:%S')

output_path = 'E:\\PycharmProjects\\TianChiProject\\00_山楓葉紛飛\\competitions\\002_financial_risk\\profiling\\'

data_train = pd.read_csv(train_file_path)
data_test_a = pd.read_csv(testA_file_path)
# print('Train Data shape 行*列:',data_train.shape)
# print('TestA Data shape 行*列:',data_test_a.shape)
print('易得\n'
      '結果列  isDefault\n'
      'testA相較於train多出兩列: \'n2.2\' \'n2.3\' ')

輸出

易得
結果列 isDefault
testA相較於train多出兩列: 'n2.2' 'n2.3'

2.3.0 通過nrows參數, 來設置讀取文件的前多少行,

# data_train_sample = pd.read_csv(testA_file_path, nrows=5)

#b. 分塊讀取
#設置chunksize參數，來控制每次迭代數據的大小

# chunker = pd.read_csv(testA_file_path, chunksize=5000)
# for item in chunker:
#     print(type(item)) #<class 'pandas.core.frame.DataFrame'>
#     print(len(item)) #5

2.3.1 數據總體了解

"""
a. 讀取數據集並了解數據集大小，原始特征維度；
b. 通過info熟悉數據類型；
c. 粗略查看數據集中各特征基本統計量；
"""

print('data_train.shape', data_train.shape) # (800000, 47)
print('data_train.columns', data_train.columns)
print('data_test_a.shape', data_test_a.shape) # (200000, 48)

輸出

data_train.shape (800000, 47)
data_train.columns Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')
data_test_a.shape (200000, 48)

2.3.2 缺失值和恆定(唯一)值

"""
a. 查看數據缺失值情況
b. 查看唯一值特征情況
"""

fea_dict_with_null_num = (data_train.isnull().sum()/len(data_train)).to_dict()

fea_null_moreThan0point1 = {}
have_null_cnt = 0
have_null_arr =[]
for key,value in fea_dict_with_null_num.items():
    if value > 0.05:
        fea_null_moreThan0point1[key] = value
    if value > 0:
        have_null_cnt += 1
        have_null_arr.append(key)
print('存在缺失值的列的個數為{}, 分別是{}'.format(have_null_cnt, have_null_arr))
print('超過5%異常點的特征列為=', fea_null_moreThan0point1)
存在缺失值的列的個數為22, 分別是['employmentTitle', 'employmentLength', 'postCode', 'dti', 'pubRecBankruptcies', 'revolUtil', 'title', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
超過5%異常點的特征列為= {'employmentLength': 0.05849875, 'n0': 0.0503375, 'n1': 0.0503375, 'n2': 0.0503375, 'n2.1': 0.0503375, 'n5': 0.0503375, 'n6': 0.0503375, 'n7': 0.0503375, 'n8': 0.05033875, 'n9': 0.0503375, 'n11': 0.08719, 'n12': 0.0503375, 'n13': 0.0503375, 'n14': 0.0503375}

nan可視化

missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
# 打印missing
print('missing type', type(missing))
print('missing :\n', missing)

輸出

missing type <class 'pandas.core.series.Series'>
missing :
 employmentTitle       0.000001
postCode              0.000001
title                 0.000001
dti                   0.000299
pubRecBankruptcies    0.000506
revolUtil             0.000664
n10                   0.041549
n4                    0.041549
n12                   0.050338
n9                    0.050338
n7                    0.050338
n6                    0.050338
n2.1                  0.050338
n13                   0.050338
n2                    0.050338
n1                    0.050338
n0                    0.050338
n5                    0.050338
n14                   0.050338
n8                    0.050339
employmentLength      0.058499
n11                   0.087190
dtype: float64

2.3.2.1 查看訓練集和測試集中中特征屬性只有一個值的特征

one_value_fea = []
for col in data_train.columns:
    if data_train[col].nunique() <= 1:
        one_value_fea.append(col)
print('訓練集one_value_fea=', one_value_fea)
one_value_fea_test = []
for col in data_test_a.columns:
    if data_test_a[col].nunique() <= 1:
        one_value_fea_test.append(col)
print('測試集one_value_fea_test=', one_value_fea_test)

輸出

訓練集one_value_fea= ['policyCode']
測試集one_value_fea_test= ['policyCode']

2.3.3 深入數據-查看數據類型

"""
a. 類別型數據
b. 數值型數據
    離散數值型數據
    連續數值型數據
"""
print('data_train.head():\n', data_train.head())
print('data_train.tail():\n', data_train.tail())
print('data_train.info():\n', data_train.info())
print('總體粗略的查看數據集各個特征的一些基本統計量:\n',
      data_train.describe())
print('拼接首尾10行數據\n', data_train.head(5).append(data_train.tail(5)))

輸出

data_train.head():
    id  loanAmnt  term  interestRate  installment grade subGrade  \
0   0   35000.0     5         19.52       917.97     E       E2   
1   1   18000.0     5         18.49       461.90     D       D2   
2   2   12000.0     5         16.99       298.17     D       D3   
3   3   11000.0     3          7.26       340.96     A       A4   
4   4    3000.0     3         12.99       101.07     C       C2   

   employmentTitle employmentLength  homeOwnership  ...    n5    n6    n7  \
0            320.0          2 years              2  ...   9.0   8.0   4.0   
1         219843.0          5 years              0  ...   NaN   NaN   NaN   
2          31698.0          8 years              0  ...   0.0  21.0   4.0   
3          46854.0        10+ years              1  ...  16.0   4.0   7.0   
4             54.0              NaN              1  ...   4.0   9.0  10.0   

     n8   n9   n10  n11  n12  n13  n14  
0  12.0  2.0   7.0  0.0  0.0  0.0  2.0  
1   NaN  NaN  13.0  NaN  NaN  NaN  NaN  
2   5.0  3.0  11.0  0.0  0.0  0.0  4.0  
3  21.0  6.0   9.0  0.0  0.0  0.0  1.0  
4  15.0  7.0  12.0  0.0  0.0  0.0  4.0  

[5 rows x 47 columns]
data_train.tail():
             id  loanAmnt  term  interestRate  installment grade subGrade  \
799995  799995   25000.0     3         14.49       860.41     C       C4   
799996  799996   17000.0     3          7.90       531.94     A       A4   
799997  799997    6000.0     3         13.33       203.12     C       C3   
799998  799998   19200.0     3          6.92       592.14     A       A4   
799999  799999    9000.0     3         11.06       294.91     B       B3   

        employmentTitle employmentLength  homeOwnership  ...    n5    n6  \
799995           2659.0          7 years              1  ...   6.0   2.0   
799996          29205.0        10+ years              0  ...  15.0  16.0   
799997           2582.0        10+ years              1  ...   4.0  26.0   
799998            151.0        10+ years              0  ...  10.0   6.0   
799999             13.0          5 years              0  ...   3.0   4.0   

          n7    n8    n9   n10  n11  n12  n13  n14  
799995  12.0  13.0  10.0  14.0  0.0  0.0  0.0  3.0  
799996   2.0  19.0   2.0   7.0  0.0  0.0  0.0  0.0  
799997   4.0  10.0   4.0   5.0  0.0  0.0  1.0  4.0  
799998  12.0  22.0   8.0  16.0  0.0  0.0  0.0  5.0  
799999   4.0   8.0   3.0   7.0  0.0  0.0  0.0  2.0  

[5 rows x 47 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n2.1                759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 271.6+ MB
...

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【第17期Datawhale | 零基礎入門金融風控-貸款違約預測】Task03打卡：特征工程之特征預處理、異常值處理、數據分桶、特征交互、特征編碼、特征選擇等【留了大量TODO需要深入學習】 [pandas_profiling & Anaconda ] pandas_profiling安裝的各種問題的記錄 --為了用pandas_profiling生成數據報告 -- 數據報告簡介阿里雲的金融風控-貸款違約預測_數據分析探索性數據分析探索性數據分析金融風控之貸款違約預測筆記大數據分析(一)探索性分析探索性數據分析EDA綜述 R | 探索性數據分析 EDA python進行EDA探索性數據分析

【第17期Datawhale | 零基礎入門金融風控-貸款違約預測】Task02打卡：探索性數據分析 【pandas_profiling生成數據報告異常，解決后單開一篇】