機器學習建模分析---阿爾及利亞森林火災數據集

本文轉載自查看原文 2022-02-15 21:05 1741 機器學習

阿爾及利亞森林火災數據集

0.導入包

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier,plot_tree
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

1.數據分析及處理

1.1數據加載和分析

data = pd.read_csv('Algerian_forest_fires_dataset_UPDATE.csv')
data

.dataframe tbody tr th {
   vertical-align: top;
}

.dataframe thead th {
   text-align: right;
}

</style>

	DAY	MONTH	YEAR	TEMPERATURE	RH	WS	RAIN	FFMC	DMC	DC	ISI	BUI	FWI	CLASSES
0	1	6	2012	29	57	18	0	65.7	3.4	7.6	1.3	3.4	0.5	not fire
1	2	6	2012	29	61	13	1.3	64.4	4.1	7.6	1	3.9	0.4	not fire
2	3	6	2012	26	82	22	13.1	47.1	2.5	7.1	0.3	2.7	0.1	not fire
3	4	6	2012	25	89	13	2.5	28.6	1.3	6.9	0	1.7	0	not fire
4	5	6	2012	27	77	16	0	64.8	3	14.2	1.2	3.9	0.5	not fire
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
242	26	9	2012	30	65	14	0	85.4	16	44.5	4.5	16.9	6.5	fire
243	27	9	2012	28	87	15	4.4	41.1	6.5	8	0.1	6.2	0	not fire
244	28	9	2012	27	87	29	0.5	45.9	3.5	7.9	0.4	3.4	0.2	not fire
245	29	9	2012	24	54	18	0.1	79.7	4.3	15.2	1.7	5.1	0.7	not fire
246	30	9	2012	24	64	15	0.2	67.3	3.8	16.5	1.2	4.8	0.5	not fire

247 rows × 14 columns

</div>

阿爾及利亞森林火災數據集包括 244 個實例，這些實例對阿爾及利亞兩個區域的數據進行了重新分組，即位於阿爾及利亞東北部的 Bejaia 區域和位於阿爾及利亞西北部的 Sidi Bel-abbes 區域。每個區域 122 個。數據集有以下特征: 1.day - 日， 2.month - 月（"六月" 到 '九月' ）， 3.year - 年（2012） 4.Temperature - 溫度：溫度中午（最高溫度）以攝氏度為單位：22至42 5.RH - 相對濕度：21 至 90 6.Ws - 風速以公里/小時為單位：6至29 7.Rain - 雨：總天毫米：0至16.8 8.FWI系統的精細燃油水分代碼（FFMC）指數：28.6至92.5 9.FWI系統的達夫水分代碼（DMC）指數：1.1至65.9 10.FWI系統的干旱代碼（DC）指數：7至220.4 11.FWI系統的初始點差指數（ISI）：0至18.5 12.FWI系統的累積指數（BUI）指數：1.1至68 13.火災天氣指數（FWI）指數：0至31.1

目標列為是否有火（Classes）。

1.2數據預處理

為確保數據的完整性、一致性、准確性，需要進行數據預處理。

(1)整理數據集

# 數據集成：由於數據集分為兩個地區，因此我們把兩個地區的數據集合並為一個來分析，這樣可以使分析的數據樣本較多，可信度較高。
#因此去掉數據中的兩個地區行和一個空白行
data1 = data.iloc[0:122,:]  #第一個數據集
data1
data2 = data.iloc[125:247,:]  #第二個數據集
data2
data = pd.concat([data1,data2])  #合並為一個
#data為整理好的數據集
data

.dataframe tbody tr th {
   vertical-align: top;
}

.dataframe thead th {
   text-align: right;
}

</style>

	DAY	MONTH	YEAR	TEMPERATURE	RH	WS	RAIN	FFMC	DMC	DC	ISI	BUI	FWI	CLASSES
0	1	6	2012	29	57	18	0	65.7	3.4	7.6	1.3	3.4	0.5	not fire
1	2	6	2012	29	61	13	1.3	64.4	4.1	7.6	1	3.9	0.4	not fire
2	3	6	2012	26	82	22	13.1	47.1	2.5	7.1	0.3	2.7	0.1	not fire
3	4	6	2012	25	89	13	2.5	28.6	1.3	6.9	0	1.7	0	not fire
4	5	6	2012	27	77	16	0	64.8	3	14.2	1.2	3.9	0.5	not fire
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
242	26	9	2012	30	65	14	0	85.4	16	44.5	4.5	16.9	6.5	fire
243	27	9	2012	28	87	15	4.4	41.1	6.5	8	0.1	6.2	0	not fire
244	28	9	2012	27	87	29	0.5	45.9	3.5	7.9	0.4	3.4	0.2	not fire
245	29	9	2012	24	54	18	0.1	79.7	4.3	15.2	1.7	5.1	0.7	not fire
246	30	9	2012	24	64	15	0.2	67.3	3.8	16.5	1.2	4.8	0.5	not fire

244 rows × 14 columns

</div>

#由於每個樣本的年份都一樣，對整個數據集沒有太大的影響，因此可以將年份那一列刪掉
del data['year']
data

print(data.isnull())  #沒有找到缺失值，無需處理

       day  month  Temperature     RH     Ws  Rain    FFMC    DMC     DC  \
0   False False       False False False False False False False   
1   False False       False False False False False False False   
2   False False       False False False False False False False   
3   False False       False False False False False False False   
4   False False       False False False False False False False   
..     ...   ...         ...   ...   ...   ...   ...   ...   ...   
242 False False       False False False False False False False   
243 False False       False False False False False False False   
244 False False       False False False False False False False   
245 False False       False False False False False False False   
246 False False       False False False False False False False   

       ISI   BUI   FWI Classes    
0   False False False     False  
1   False False False     False  
2   False False False     False  
3   False False False     False  
4   False False False     False  
..     ...   ...   ...       ...  
242 False False False     False  
243 False False False     False  
244 False False False     False  
245 False False False     False  
246 False False False     False  

[244 rows x 13 columns]

(2)分離出僅含特征列的部分作為 X 和僅含目標列的部分作為 Y

X = data.iloc[:,0:12]
Y = data.iloc[:,-1]

#整理目標列：
s=Y.values
for i in range(len(s)):
    s[i] = s[i].replace(' ','')
print(s)

['notfire' 'notfire' 'notfire' 'notfire' 'notfire' 'fire' 'fire' 'fire'
 'notfire' 'notfire' 'fire' 'fire' 'notfire' 'notfire' 'notfire' 'notfire'
 'notfire' 'notfire' 'notfire' 'notfire' 'fire' 'notfire' 'fire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'notfire' 'fire' 'notfire' 'notfire'
 'notfire' 'notfire' 'fire' 'fire' 'notfire' 'fire' 'notfire' 'notfire'
 'notfire' 'notfire' 'notfire' 'notfire' 'notfire' 'notfire' 'fire' 'fire'
 'fire' 'fire' 'fire' 'notfire' 'notfire' 'notfire' 'fire' 'fire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'notfire' 'notfire' 'notfire' 'fire' 'fire'
 'fire' 'fire' 'notfire' 'fire' 'fire' 'fire' 'notfire' 'fire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'fire' 'notfire' 'notfire' 'notfire'
 'notfire' 'notfire' 'notfire' 'fire' 'notfire' 'notfire' 'notfire'
 'notfire' 'notfire' 'notfire' 'notfire' 'notfire' 'notfire' 'notfire'
 'notfire' 'fire' 'fire' 'fire' 'fire' 'fire' 'notfire' 'notfire'
 'notfire' 'notfire' 'notfire' 'fire' 'notfire' 'notfire' 'notfire'
 'notfire' 'notfire' 'notfire' 'notfire' 'notfire' 'fire' 'fire' 'notfire'
 'notfire' 'fire' 'fire' 'fire' 'notfire' 'notfire' 'notfire' 'notfire'
 'notfire' 'notfire' 'notfire' 'notfire' 'fire' 'notfire' 'notfire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'fire' 'notfire' 'notfire' 'fire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'notfire' 'notfire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire'
 'fire' 'notfire' 'notfire' 'notfire' 'notfire' 'fire' 'fire' 'fire'
 'fire' 'notfire' 'fire' 'fire' 'fire' 'fire' 'notfire' 'notfire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'fire' 'notfire'
 'fire' 'fire' 'fire' 'notfire' 'notfire' 'fire' 'notfire' 'notfire'
 'notfire' 'fire' 'fire' 'fire' 'notfire' 'notfire' 'fire' 'fire' 'fire'
 'fire' 'fire' 'fire' 'fire' 'fire' 'notfire' 'fire' 'fire' 'fire'
 'notfire' 'notfire' 'fire' 'notfire' 'notfire' 'notfire' 'notfire']

#使用訓練好的LabelEncoder對原數據進行編碼,notfire為1.fire為0
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y=le.fit_transform(s)            
y

array([1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1,
       1, 1])

y= pd.DataFrame(y)

#使用train_test_split函數自動隨機划分訓練集與測試集（70%和 30%）
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1)
#x_train,x_test,y_train,y_test

#查看訓練集和測試集的大小
x_train.shape, x_test.shape, y_train.shape,y_test.shape

((170, 12), (74, 12), (170, 1), (74, 1))

（3）標准化處理

使不同維度之間的特征在數值上有一定比較性，可以大大提高分類器的准確性。

#對X進行標准化處理，讓模型更擬合
scaler = StandardScaler().fit(x_train)
x_train = pd.DataFrame(scaler.transform(x_train))
x_train

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4	5	6	7	8	9	10	11
0	1.693550	-0.377923	-0.338241	1.164128	-0.203369	-0.414083	0.508256	1.099453	1.786169	-0.022055	1.423974	0.490624
1	-0.549241	1.406713	-0.622900	1.027785	2.027127	0.621124	-1.406602	-1.043440	-0.895966	-0.921917	-1.024626	-0.941239
2	1.221384	-1.270241	1.085051	0.005213	0.168381	-0.414083	0.679098	0.132734	-0.347347	0.552856	-0.045186	0.325409
3	-0.667282	1.406713	-0.907558	0.823271	-0.203369	0.218544	-1.335418	-0.817872	-0.900170	-0.996906	-0.870714	-0.941239
4	1.457467	0.514395	0.515734	0.141556	0.168381	-0.414083	0.792993	2.847602	3.350048	0.627845	3.207954	1.757272
...	...	...	...	...	...	...	...	...	...	...	...	...
165	0.040967	-1.270241	-0.907558	1.709500	-0.203369	-0.184037	-2.196748	-0.858152	-0.904374	-1.146882	-0.905694	-0.968775
166	-0.431199	0.514395	0.800393	-0.744673	-0.946867	-0.241548	0.216400	0.060230	0.506060	-0.571971	0.255642	-0.404291
167	0.395092	-1.270241	-0.053583	0.346070	-0.575118	2.173935	-0.972378	-0.842040	-0.900170	-0.946913	-0.898698	-0.927471
168	0.749217	1.406713	0.231076	0.141556	-0.946867	-0.414083	0.757401	0.906109	1.161879	0.577852	1.074174	0.903661
169	-0.903366	-0.377923	0.231076	0.414242	1.283628	-0.414083	0.522493	-0.189505	-0.025743	0.302895	-0.115146	0.118890

170 rows × 12 columns

</div>

scaler = StandardScaler().fit(x_test)
x_test = pd.DataFrame(scaler.transform(x_test))
x_test

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4	5	6	7	8	9	10	11
0	-0.943092	0.304017	-0.034882	0.460196	0.195027	-0.343850	0.637408	0.101718	0.070002	0.167530	0.101785	0.151281
1	1.470656	1.241402	-2.099869	0.132749	-0.132973	-0.263328	-0.655240	-0.859719	-0.610651	-0.782701	-0.792830	-0.818248
2	0.840982	0.304017	0.739489	-1.569978	-0.132973	-0.343850	1.014430	3.043550	2.171333	1.648122	2.821703	2.594495
3	-1.572765	-1.570753	-0.034882	0.591175	-1.116970	-0.062024	-1.341959	-0.966545	-0.786645	-0.915292	-0.937123	-0.857029
4	1.365710	0.304017	0.739489	-0.915083	0.851025	-0.343850	0.879779	3.281855	3.712874	1.714418	3.723532	3.021087
...	...	...	...	...	...	...	...	...	...	...	...	...
69	-0.943092	-0.633368	1.513859	-1.242530	-0.788971	-0.142546	0.536420	-0.103717	-0.209892	-0.141848	-0.150727	-0.210676
70	-0.838146	1.241402	-0.551129	0.722154	0.523026	0.018498	-1.012064	-0.958327	-0.782404	-0.804800	-0.922694	-0.831175
71	0.631091	-0.633368	-0.034882	-0.915083	0.851025	-0.343850	0.974035	2.460114	0.949974	1.869106	2.035307	2.400589
72	0.945928	-0.633368	1.771983	0.132749	-2.100968	0.139281	-0.217625	-0.210543	-0.623373	-0.804800	-0.316664	-0.792393
73	-1.257928	1.241402	-0.809252	0.722154	0.523026	-0.303589	-0.581182	-1.015849	-0.627614	-0.738505	-0.929908	-0.818248

74 rows × 12 columns

</div>

2. 構建邏輯回歸模型

簡單易懂，易於實施，訓練高效；當數據集線性可分時表現良好；對於較小的數據集具有良好的准確性

from sklearn.linear_model import LogisticRegression
model_logic = LogisticRegression(max_iter=10000).fit(x_train, y_train)
#查看訓練結果
print(model_logic.score(x_train,y_train))
#查看測試結果
print(model_logic.score(x_test,y_test))

0.9705882352941176
0.9054054054054054

D:\Anoconda\lib\site-packages\sklearn\utils\validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(*args, **kwargs)

預測結果及分類准確率

#測試集預測結果
y_pred=model_logic.predict(x_test)
y_pred
y_pred= pd.DataFrame(y_pred)

#真實結果 
y_true=y_test.reset_index(drop=True)
y_true

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0
0	0
1	1
2	0
3	1
4	0
...	...
69	0
70	1
71	0
72	1
73	1

74 rows × 1 columns

</div>

#train_accu正確分類樣本數
train_accu = np.equal(y_pred,y_true)
t=np.sum(train_accu!=0)
t[0]

#計算模型在測試集上的分類准確率
t[0]/len(y_test)

0.9054054054054054

3. 構建決策樹模型

3.1 構建決策樹模型

3.1.1 分別取分裂節點為“gini”基尼指標和“entropy”信息增益，構建決策樹模型並訓練，查看二者指標模型是否有區別，並設置樹深為5

#gini指標的決策樹模型
dt_gini = DecisionTreeClassifier(criterion='gini',max_depth=5,random_state=0)
dt_gini = dt_gini.fit(x_train,y_train)   #用訓練集進行訓練，調用該對象的訓練方法，接收兩個參數：訓練數據集及其樣本標簽
#entropy指標的決策樹模型
dt_entropy = DecisionTreeClassifier(criterion='entropy',max_depth=5,random_state=0)
dt_entropy = dt_entropy.fit(x_train,y_train)
#查看兩者的模型參數
dt_gini,dt_entropy
#查看特征系數，系數反映每個特征的影響力，越大表示該特征在分類中起到的作用越大
print(dt_gini.feature_importances_)

[0.         0.         0.         0.00382647 0.         0.
 0.97248796 0.         0.         0.02368557 0.         0.        ]

3.1.2 查看訓練的預測結果

#Gini指標
answer = dt_gini.predict(x_train)
score_ = dt_gini.score(x_train,y_train)
print(answer)   #分別輸出預測的和真實的，進行比較
print(y_train.values)
print(score_)
#entropy指標
answer = dt_entropy.predict(x_train)
score_ = dt_entropy.score(x_train,y_train)
print(answer)   #分別輸出預測的和真實的，進行比較
print(y_train.values)
print(score_)

[0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0
 1 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0
 0 1 0 1 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 1 0 0 0 1 1 0
 0 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0
 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0]
[[0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]]
1.0
[0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0
 1 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0
 0 1 0 1 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 1 0 0 0 1 1 0
 0 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0
 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0]
[[0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]]
1.0

結果：可以看出分裂節點為Gini指標和entropy指標的模型對訓練的數據進行測試后，其結果都為1.0，准確率都為100%

3.1.3 查看測試結果

# gini指標
y_gini_pre = dt_gini.predict(x_test)  #調用該對象的測試方法，接收一個參數：測試數據集
score_gini = dt_gini.score(x_test,y_test)   #調用該對象的准確率方法，接收兩個參數：測試數據集及其樣本標簽，返回測試集樣本映射到指定分類標記上的准確率
#輸出
print("Gini測試結果：",y_gini_pre)  
print("正確結果：",y_test.values)
print("正確率：",score_gini)

# entropy指標
y_entropy_pre = dt_entropy.predict(x_test)  #調用該對象的測試方法，接收一個參數：測試數據集
score_entropy = dt_entropy.score(x_test,y_test)   #調用該對象的准確率方法，接收兩個參數：測試數據集及其樣本標簽，返回測試集樣本映射到指定分類標記上的准確率
#輸出
print("entropy測試結果：",y_entropy_pre)  
print("正確結果：",y_test.values)
print("正確率：",score_entropy)

Gini測試結果： [0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 1
 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1]
正確結果： [[0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]]
正確率： 0.9324324324324325
entropy測試結果： [0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 1
 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1]
正確結果： [[0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]]
正確率： 0.8918918918918919

結果：由輸出可以看出，Gini指標的測試誤差大概為93%，而entropy指標的測試誤差為89%，說明此數據集對gini指標的模型具有較好的泛化能力。使用決策樹分類后，說明本例的決策樹對訓練集的規則吸收的很好，但是預測性稍微差點，僅有0.9的准確率。

3.2 數據可視化

#提取出列名，為決策樹畫圖做准備
feature_names = pd.DataFrame(X).columns.values 
feature_names 
target_name = [] 
for i in Y.values:
    if i not in target_name:
        target_name.append(i)
print(target_name)
plt.figure(figsize=(25,30))   #設置圖形大小寬為25英寸，高為30英寸
#gini模型
plot_tree(dt_gini,filled = True,feature_names = feature_names, class_names=target_name)    #設置自動填充顏色

['notfire', 'fire']

[Text(398.57142857142856, 1467.72, 'FFMC <= 0.131\ngini = 0.482\nsamples = 170\nvalue = [101, 69]\nclass = notfire'),
 Text(199.28571428571428, 1141.56, 'gini = 0.0\nsamples = 67\nvalue = [0, 67]\nclass = fire'),
 Text(597.8571428571429, 1141.56, 'ISI <= -0.559\ngini = 0.038\nsamples = 103\nvalue = [101, 2]\nclass = notfire'),
 Text(398.57142857142856, 815.4000000000001, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]\nclass = fire'),
 Text(797.1428571428571, 815.4000000000001, ' RH <= 1.266\ngini = 0.019\nsamples = 102\nvalue = [101, 1]\nclass = notfire'),
 Text(597.8571428571429, 489.24, 'gini = 0.0\nsamples = 96\nvalue = [96, 0]\nclass = notfire'),
 Text(996.4285714285713, 489.24, 'FFMC <= 0.277\ngini = 0.278\nsamples = 6\nvalue = [5, 1]\nclass = notfire'),
 Text(797.1428571428571, 163.08000000000015, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]\nclass = fire'),
 Text(1195.7142857142858, 163.08000000000015, 'gini = 0.0\nsamples = 5\nvalue = [5, 0]\nclass = notfire')]

png

#結果：從圖中可以看出根節點的信息增益gini的值最大，一直分類直到為0.此時為葉子節點。而samples屬性統計出它應用於多少個訓練樣本實例，從一開始
#的170個逐漸減少。value屬性告訴這個節點對於每一個類別的樣例有多少個。

3.3 繪制決策樹模型的學習曲線

#對gini指標和entropy指標的模型繪制學習曲線，樹深從1到30
test1 = []  #保存gini指標的每一次訓練后的測試結果
test2 = []  #保存entropy指標的每一次訓練后的測試結果
for i in range(30):
    clf_gini = DecisionTreeClassifier(max_depth=i+1,criterion='gini',random_state=30,splitter='random')
    clf_entropy = DecisionTreeClassifier(max_depth=i+1,criterion='entropy',random_state=30,splitter='random')
    clf_gini = clf_gini.fit(x_train,y_train)   #訓練模型
    clf_entropy = clf_entropy.fit(x_train,y_train)
    score1 = clf_gini.score(x_train,y_train)   #計算測試結果
    score2 = clf_entropy.score(x_train,y_train)
    test1.append(score1)   #gini指標模型的每一次測試結果都存到test1中
    test2.append(score2)   #entropy指標模型的每一次測試結果都存到test2中

#畫圖
plt.title('Gini')
plt.plot(range(1,31),test1,color='red',label='Gini')      #定義gini指標模型的圖形的橫軸范圍在30以內，曲線顏色為紅色，命名為gini
plt.xlabel('max_depth')                                   #命名x軸為max_depth
plt.ylabel('training accuracy')                           #命名y軸為training accuracy


plt.title('entropy')
plt.plot(range(1,31),test2,color='blue',label='entropy')   #定義entropy指標模型的圖形的橫軸范圍在30以內，曲線顏色為藍色，命名為entropy
plt.xlabel('max_depth')                                   #命名x軸為max_depth
plt.ylabel('training accuracy')                           #命名y軸為training accuracy

plt.legend()                                              #創建圖例
plt.show()                                                #展示圖形

png

結果：Gini指標模型比entropy得要好一點。

4.模型評估與優化

4.1 查看模型各項指標

（1）查看邏輯回歸指標

cm=confusion_matrix(y_test, y_pred,labels=[1,0])
cm

array([[30,  7],
       [ 0, 37]], dtype=int64)

from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred,labels=[0,1]).ravel()
print(tn,fp,fn,tp)
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (tn+tp)/(tn+tp+fn+fp)
trp = (tp)/(tp+fn)
fpr = (fp)/(tn+fp)
tpr = tp / (tp + fp)
print("准確率為：{}%".format(accuracy*100))
print("查全率為：{}%".format(trp*100))
print("假正率為：{}%".format(fpr*100))
print("精確率為：{}%".format(tpr*100))

37 0 7 30
准確率為：90.54054054054053%
查全率為：81.08108108108108%
假正率為：0.0%
精確率為：100.0%

import seaborn as sn
df_cm = pd.DataFrame(cm)
df_cm
#annot = True 顯示數字 ，fmt參數不使用科學計數法進行顯示
ax = sn.heatmap(df_cm,annot=True,fmt='.20g')
ax.set_title('LogisticRegression') #標題
ax.set_xlabel('predict') #x軸
ax.set_ylabel('true') #y軸

Text(33.0, 0.5, 'true')

png

結果：通過各項指標可以看出准確率能夠達到90%，而精確率能達到100%，假正率為0，說明此模型對所有正例都能正確預測。

（2）查看gini指標的決策樹模型的指標

#計算預測值
gini_pre = dt_gini.predict(x_test)
gini_pre
#輸出混淆矩陣
cm = confusion_matrix(y_test,gini_pre,labels=[0,1])
print(cm)
tn,fp,fn,tp = cm.ravel()
print(tp,tn,fp,fn)
#計算模型的准確率
accuracy = (tp + tn) / (tp + tn + fp + fn)   
#計算模型的查全率
tpr = tp / (tp + fn)
#計算模型的假正率
fpr = fp / (fp + tn)
#計算模型的精確率
tpr = tp / (tp + fp)
print("准確率為：{}%".format(accuracy*100))   #能正確預測的概率，（有100個正例，只能預測成功50個為正例）
print("查全率為：{}%".format(tpr*100))        #有89個正例就預測成功89個正例，（可能會將一個反例預測成正例）
print("假正率為：{}%".format(fpr*100))
print("精確率為：{}%".format(tpr*100))

[[37  0]
 [ 5 32]]
32 37 0 5
准確率為：93.24324324324324%
查全率為：100.0%
假正率為：0.0%
精確率為：100.0%

（3）查看entropy指標的決策樹模型的指標

#計算預測值
entropy_pre = dt_entropy.predict(x_test)
entropy_pre
#輸出混淆矩陣
cm = confusion_matrix(y_test,entropy_pre,labels=[0,1])
print(cm)

tn,fp,fn,tp = cm.ravel()
print(tp,tn,fp,fn)
#計算模型的准確率
accuracy = (tp + tn) / (tp + tn + fp + fn)   
#計算模型的查全率
tpr = tp / (tp + fn)
#計算模型的假正率
fpr = fp / (fp + tn)
#計算模型的精確（真正率）率
tpr = tp / (tp + fp)
print("准確率為：{}%".format(accuracy*100))   #能正確預測的概率，（有100個正例，只能預測成功50個為正例）
print("查全率為：{}%".format(tpr*100))        #有89個正例就預測成功89個正例，（可能會將一個反例預測成正例）
print("假正率為：{}%".format(fpr*100))
print("精確率為：{}%".format(tpr*100))

[[37  0]
 [ 8 29]]
29 37 0 8
准確率為：89.1891891891892%
查全率為：100.0%
假正率為：0.0%
精確率為：100.0%

結果：通過使用混淆矩陣來計算模型的准確率和查全率等，可以看出此決策樹對數據集訓練后，Gini指標能正確預測的概率為93%， #而entropy得為89%，模型預測能力較好。

（3）用10折交叉驗證法對邏輯回歸模型和決策樹模型的分類效果進行評估

為什么用交叉驗證法？交叉驗證用於評估模型的預測性能，尤其是訓練好的模型在新數據上的表現，可以在一定程度上減小過擬合。

#已使用“留出法”划分好數據集
result1 = cross_val_score(dt_gini,X,Y,cv = 10)    #傳入指定的gini指標學習器，數據集中的樣本集X和標簽集Y，設置cv為10
print("10折交叉驗證結果_gini指標的決策樹：",result1)
result2 = cross_val_score(dt_entropy,X,Y,cv = 10)    #傳入指定的entropy指標的學習器，數據集中的樣本集X和標簽集Y，設置cv為10
print("10折交叉驗證結果_entropy指標的決策樹：",result2)
result3 = cross_val_score(model_logic,X,Y,cv = 10)    #傳入指定的entropy指標的學習器，數據集中的樣本集X和標簽集Y，設置cv為10
print("10折交叉驗證結果_邏輯回歸：",result3)

#將10折交叉驗證結果可視化
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
plt.title('Gini')
plt.plot(range(10),result1,color='red')      #定義gini指標模型的圖形的橫軸范圍在10以內，曲線顏色為紅色，命名為gini                         #命名y軸為training accuracy

plt.subplot(1,3,2)
plt.title('entropy')
plt.plot(range(10),result2,color='blue')  #定義entropy指標模型的圖形的橫軸范圍在10以內，曲線顏色為藍色，命名為entropy

plt.subplot(1,3,3)
plt.title('model_logic')
plt.plot(range(10),result2,color='green')  #定義邏輯回歸模型的圖形的橫軸范圍在10以內，曲線顏色為綠色，命名為model_logic

plt.legend()                                              #創建圖例
plt.show()

10折交叉驗證結果_gini指標的決策樹： [0.96       0.96       0.92       0.92       1.         1.
 1.         1.         1.         0.95833333]
10折交叉驗證結果_entropy指標的決策樹： [0.96       0.96       0.92       1.         1.         0.95833333
 1.         1.         0.95833333 0.95833333]

No handles with labels found to put in legend.

10折交叉驗證結果_邏輯回歸： [0.96       1.         0.96       0.96       1.         0.875
 1.         1.         0.91666667 1.        ]

png

結果：從輸出結果和三個圖中可以看到，雖然分裂節點取gini和entropy指標的決策樹模型在10種不同的數據集組合上的預測性能都不大一致，但是波動范圍都在0.92 到 1不等，差距較大。而邏輯回歸模型的波動軌跡大致和entropy決策樹的相同，都是邏輯回歸的預測性能波動范圍在0.87 到 1 之間，差距更加大。由此可見決策樹模型的預測性能的差距比邏輯回歸的好一點。

4.2 使用正則化對邏輯回歸分類器進行參數調優

param = {'penalty':['l2','l1'],'C': [0.001, 0.01, 0.1],
        'class_weight':['balanced',None],'multi_class':['ovr'],'solver':['liblinear']}
gc = GridSearchCV(model_logic, param_grid=param, cv=10)                                                 
gc.fit(x_train, y_train.values.ravel())
print("在測試集上的准確率(得分)：",gc.score(x_test,y_test))
print("交叉驗證的最好結果：",gc.best_score_)
print("最佳參數組合：",gc.best_params_)

在測試集上的准確率(得分)： 0.8918918918918919
交叉驗證的最好結果： 0.9647058823529413
最佳參數組合： {'C': 0.1, 'class_weight': None, 'multi_class': 'ovr', 'penalty': 'l1', 'solver': 'liblinear'}

4.3 使用暴力搜索對決策樹分類器進行參數調優

#構建一個沒有選取任何分裂節點指標的決策樹模型
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)

param1 = {'criterion':['gini','entropy'],'max_features': [0.1,0.2,0.5,0.8,1],
        'splitter':['best','random'],'min_samples_split':[2,4,6,8]}
clf1 = GridSearchCV(dtc, param1, cv=10)                                                 
clf1.fit(x_train, y_train)
print("在測試集上的准確率(得分)：",clf1.score(x_test,y_test))
print("交叉驗證的最好結果：",clf1.best_score_)
print("最好的參數組合為：",clf1.best_params_)

在測試集上的准確率(得分)： 0.8918918918918919
交叉驗證的最好結果： 0.9882352941176471
最好的參數組合為： {'criterion': 'gini', 'max_features': 0.5, 'min_samples_split': 8, 'splitter': 'best'}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習——建立決策樹模型，使用gini、entropy指標分別分析，並對數據進行可視化，對模型進行評估，計算假正率、准確率、查全率（阿爾及利亞森林火災）機器學習——森林火災圖片識別機器學習+建模常用的數據集下載網站 UCI (Machine Learning Repository)網站機器學習數據集大全 MNIST機器學習數據集機器學習數據集(Dataset) 05.森林火災模型機器學習深度學習常用數據集機器學習sklearn（三）：加載數據集(數據導入) 周志華《機器學習》西瓜數據集