（原創）(一)機器學習筆記之數據探索

本文轉載自查看原文 2017-10-25 00:37 1781 機器學習

機器學習的一般步驟

1.確定特征
（1）數據探索
（2）數據預處理
2.確定模型
（1）確定目標函數
3.模型訓練
（1）確定優化算法，估計模型參數
4.模型選擇
選擇不同參數下的模型。
5.模型評估
對所選擇的模型進行評估：估計模型在未知數據上的性能(泛化能力).

以上5個過程不斷迭代，直到尋找到一個最優的模型和其參數。

以下，以波士頓房價預測為例，先簡單講講數據探索。
一般我們拿到一堆數據之后，並不知道數據有何規律，為了了解數據特征的規律(概率分布)，我們第一個步驟就是數據探索，
探索數據的特征有何規律或者分布，為模型選擇奠定基礎。

數據探索

數據探索包括：
（1）數據規模。
（2）數據類型，確定是否需要進一步編碼。
（3）數據是否有缺失值，如果有則進行數據填補。
（4）查看數據分布，是否有異常數據點，如果有則進行離群點處理。
（5）查看是否需要降維：查看兩兩特征之間的關系，看數據是否有相關（冗余）。

1. 導入必要的工具包

數據處理工具包為：Numpy,SciPy,pandas,其中SciPy,pandas是基於Numpy進一步的封裝
數據可視化工具包為：Matplotlib,Seaborn,其中Seaborn是基於Matplotlib進一步的封裝

In [38]:

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns #將matplotlib的圖表直接嵌入到Notebook之中 %matplotlib inline

%matplotlib inline :

將matplotlib的圖表直接嵌入到Notebook之中.
IPython提供了許多魔法命令。魔法命令都以%或者%%開頭，以%開頭的成為行命令，%%開頭的稱為單元命令。行命
令只對命令所在的行有效，而單元命令則必須出現在單元的第一行，對整個單元的代碼進行處理。執行%magic可以
查看關於各個命令的說明，而在命令之后添加?可以查看該命令的詳細說明:
%matplotlib?

2.讀取結構化數據

read_csv與to_csv 是一對輸入輸出的工具，read_csv直接返回pandas.DataFrame，而to_csv只要執行命令即可寫文件
header：表示數據中是否存在列名，如果在第0行就寫就寫0，並且開始讀數據時跳過相應的行數，不存在可以寫none。
names：表示要用給定的列名來作為最終的列名。
encoding：表示數據集的字符編碼，通常而言一份數據為了方便的進行文件傳輸都以utf-8作為標准。

In [39]:

dpath = './data/' data = pd.read_csv(dpath + "boston_housing.csv") #返回的是DataFrame類型

3.數據概覽

df.head(n)：查看DataFrame對象的前n行,默認是5行
data.tail(n)：查看DataFrame對象的后n行,默認是5行
data.info(): 查看索引、數據類型和內存信息
data.isnull():檢查DataFrame對象中的空值，並返回一個Boolean數組　　 data.describe()：查看數值型列的匯總統計　　

In [40]:

data.head(5) #查看DataFrame對象的前５行數據

Out[40]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15	396.90	4.98	24.0
1	0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17	396.90	9.14	21.6
2	0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17	392.83	4.03	34.7
3	0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18	394.63	2.94	33.4
4	0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18	396.90	5.33	36.2

In [41]:

data.tail(5) #查看DataFrame對象的后５行數據

Out[41]:

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1	273	21	391.99	9.67	22.4
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1	273	21	396.90	9.08	20.6
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1	273	21	396.90	5.64	23.9
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1	273	21	393.45	6.48	22.0
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1	273	21	396.90	7.88	11.9

In [42]:

data.info() #查看索引、數據類型和內存信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null int64
INDUS      506 non-null float64
CHAS       506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null int64
TAX        506 non-null int64
PTRATIO    506 non-null int64
B          506 non-null float64
LSTAT      506 non-null float64
MEDV       506 non-null float64
dtypes: float64(9), int64(5)
memory usage: 55.4 KB

In [43]:

data.isnull().sum() #檢查DataFrame對象中的是否存在空值

Out[43]:

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [44]:

data.describe() #查看數值型列的匯總統計

Out[44]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.347826	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.083004	356.674032	12.653063	22.532806
std	8.601545	23.310593	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.280574	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.000000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.000000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.000000	391.440000	11.360000	21.200000
75%	3.677082	12.000000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.000000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

此處得到各屬性的樣本數目、均值、標准差、最小值、1/4分位數（25%）、中位數（50%）、3/4分位數（75%）、最大值
可初步了解各特征的分布

4.以數據可視化的方式查看各屬性的統計特征

（1）直方圖distplot

seaborn.distplot(a, bins=None, ..., kde=True,....)

a:數據， bins：箱子數，kde：是否進行核密度估計

In [45]:

from matplotlib.font_manager import FontProperties #添加相關包，指定相關字體 font_set = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=12) # 房價(即目標)的直方圖 sns.distplot(data.MEDV.values, bins=30,kde=True) plt.xlabel(u"房價分布",fontproperties=font_set) plt.show()

In [46]:

# 犯罪率的直方圖
sns.distplot(data.CRIM.values, bins=30, kde=False) plt.xlabel(u"犯罪率直方圖", fontproperties=font_set) plt.show()

犯罪率特征的分布是長尾分布，和指數分布比較接近。大部分城鎮的犯罪率很低，極少數樣本的犯罪率高。從常理看，
該數值應該比較准確，可以不予處理。

（2）散點圖scatter

In [47]:

# 房價(即目標)的散點圖
plt.scatter(range(data.shape[0]), data.MEDV.values, color='purple') plt.title(u"房價的散點分布",fontproperties=font_set)

Out[47]:

<matplotlib.text.Text at 0x10afe7b8>

可以看出，數據大多集中在均值附近，和正態分布比較接近。但最大值50的樣本數目偏多，可能是原始數據將所有大於50的
樣本的值都設置為50（猜測），在模型訓練時也可以考慮將y等於50的樣本當成outliers（離群點）去掉。

（3）計數圖countplot

計數圖，可將它認為一種應用到分類變量的直方圖，用以統計不同的類別的計數值：

seaborn.countplot(x=None, y=None, hue=None, data=None, order=None, ...)

order:控制變量繪圖的順序

In [30]:

#sns.countplot(data.ZN)
#sns.countplot(data.CHAS, order=[0,1]) #與下面等價 sns.countplot(data.CHAS)

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0xc5dd780>

In [48]:

sns.countplot(data.RAD)

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x11be3898>

5.兩兩特征之間的相關性

a.希望特征與標簽之間強相關
b.如果特征和特征之間強相關，說明信息冗余，可常用處理方法：
1）兩個特征之間只保留其中一個特征。
2）采用組成分析（PCA）等進行降維。
3）在模型的正則項采用L1正則。

（1）熱力圖heatmap

seaborn.heatmap(data, vmin=None, vmax=None, cmap=None, center=None, robust=False, annot=None, fmt='.2g',

annotkws=None, linewidths=0, linecolor='white', cbar=True, cbarkws=None, cbar_ax=None, square=False, ax=None,

xticklabels=True, yticklabels=True, mask=None, **kwargs)

data：矩陣數據集，可以使numpy的數組（array），如果是pandas的dataframe，則dataframe的index/column信息會分別對應
到heatmap的columns和rows。
linewidths：熱力圖矩陣之間的間隔大小。
vmax,vmin：圖例中最大值和最小值的顯示值，沒有該參數時默認不顯示。
annot : bool or rectangular dataset, optional，If True, write the data value in each cell，顯示數值。
mask : boolean array or DataFrame, optional。If passed, data will not be shown in cells where mask is True.
cbar : boolean, optionalWhether to draw a colorbar.

In [49]:

plt.subplots(figsize=(13,9)) #指定窗口尺寸（單位英尺） data_corr=data.corr().abs() #返回列與列之間的相關系數 #數據為相關系數，顯示數值，顯示顏色條 sns.heatmap(data_corr, annot=True)

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x111510b8>

（2）打印出兩兩特征之間的相關系數

sorted(list, key=lambda x: -abs(x[0]))

參數key是關鍵詞， lambda是一個隱函數，是固定寫法，表示按照列表x的-abs(x[0])這個值進行排序，其中x可為任意名稱，
指代前面的列表list。

DataFrame.iloc[i,j]：按位置選取數據

In [35]:

cols = data.columns #獲取列的名稱 corr_list = [] size = data.shape[1] for i in range(0, size): for j in range(i+1, size): if(abs(data_corr.iloc[i,j])>= 0.5): corr_list.append([data_corr.iloc[i,j], i, j]) #data_corr.iloc[i,j]：按位置選取數據 sorted_corr_list = sorted(corr_list, key=lambda xx:-abs(xx[0])) for v,i,j in sorted_corr_list: print("%s and %s = %.2f" % (cols[i], cols[j],v)) # cols: 列名

RAD and TAX = 0.91
NOX and DIS = 0.77
INDUS and NOX = 0.76
AGE and DIS = 0.75
LSTAT and MEDV = 0.74
NOX and AGE = 0.73
INDUS and TAX = 0.72
INDUS and DIS = 0.71
RM and MEDV = 0.70
NOX and TAX = 0.67
ZN and DIS = 0.66
INDUS and AGE = 0.64
CRIM and RAD = 0.63
RM and LSTAT = 0.61
NOX and RAD = 0.61
INDUS and LSTAT = 0.60
AGE and LSTAT = 0.60
INDUS and RAD = 0.60
NOX and LSTAT = 0.59
CRIM and TAX = 0.58
ZN and AGE = 0.57
TAX and LSTAT = 0.54
DIS and TAX = 0.53
ZN and INDUS = 0.53
ZN and NOX = 0.52
AGE and TAX = 0.51
PTRATIO and MEDV = 0.51

通常認為相關系數大於0.5的為強相關,可以看出相關系數大於0.5的特征較多，可對數據進行降維處理，例如去掉一個特征，
或者使用主成分分析（PCA），L1正則等進行處理。

（3）呈現數據集中成對的關系pairplot

seaborn.pairplot(data, hue=None, hue_order=None, palette=None, vars=None, x_vars=None,

y_vars=None, kind='scatter', diag_kind='hist', markers=None, size=2.5, aspect=1, dropna=True,

plot_kws=None, diag_kws=None, grid_kws=None)

數據指定：
vars : 與data使用，否則使用data的全部變量。參數類型：numeric類型的變量list。
{x, y}_vars : 與data使用，否則使用data的全部變量。參數類型：numeric類型的變量list。即指定x，y數據。
dropna : 是否剔除缺失值。參數類型：boolean, optional.

特殊參數： kind : {‘scatter’, ‘reg’}, optional Kind of plot for the non-identity relationships.
diag_kind : {‘hist’, ‘kde’}, optional。Kind of plot for the diagonal subplots.

基本參數：
size : 默認 6，圖的尺度大小（正方形）。參數類型：numeric
hue : 使用指定變量為分類變量畫圖。參數類型：string (變量名).
hue_order : list of strings Order for the levels of the hue variable in the palette.
palette : 調色板顏色.
markers : 使用不同的形狀。參數類型：list.
aspect : scalar, optional。Aspect * size gives the width (in inches) of each facet.
{plot, diag, grid}_kws : 指定其他參數。參數類型：dict
返回：
PairGrid 對象.

In [50]:

for v,i,j in sorted_corr_list: #繪制x_vars列和y_vars列特征之間的關系，看是否有線性關系 sns.pairplot(data, size=6, x_vars=cols[i], y_vars=cols[j]) plt.show()

人工智能從入門到專家教程資料：https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.38270209gU11fS&id=562189023765

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 （原創）(二)機器學習筆記之數據預處理機器學習之數據探索——數據質量分析 (原創)（四）機器學習筆記之Scikit Learn的Logistic回歸初探（原創）機器學習之矩陣論（三）機器學習筆記（一）機器學習之數據探索——數據特征分析（對比分析與統計量分析）機器學習之數據探索——數據特征分析（分布分析）原創-機器學習之推薦系統實戰原創：機器學習排序深入解讀 Python機器學習之旅｜手把手帶你探索IRIS數據集