決策樹這節中涉及到了很多pandas中的新的函數用法等,所以我單拿出來詳細的理解一下這些pandas處理過程,進一步理解pandas背后的數據處理的手段原理。
決策樹程序
數據載入
pd.read_csv()竟然可以直接請求URL... ...
DataFrame.head()可以查看前面幾行的數據,默認是5行
DataFrame.info()可以查看數據的統計情報
'''數據載入''' import pandas as pd titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt') print(titanic.head(), '\n\n', '___*___'*15) # DataFrame.head(n=5) # Returns first n rows print(titanic.info())
.info()方法會統計各列的信息,包含數據量(不足的表示有空缺需要進一步補全)和數據格式(一般數字的沒問題,obj對象要進一步處理為數字才能處理)等。
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 11 columns):
row.names 1313 non-null int64
pclass 1313 non-null object
survived 1313 non-null int64
name 1313 non-null object
age 633 non-null float64
embarked 821 non-null object
home.dest 754 non-null object
room 77 non-null object
ticket 69 non-null object
boat 347 non-null object
sex 1313 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 112.9+ KB
None
數據預處理
對於特征太多的數據需要人為選擇一下,這里是泰坦尼克號的乘客數據,因為認為'pclass', 'age', 'sex'幾個特征可能更有代表性,所以選取了這幾個特征。
DataFrame['列名'].fillna()用於填充空白
參數1表示填充值
參數2inplace表示就地填充,默認false,即不修改原df,返回一個修改過的新的df
使用時注意,一般各個列數據意義不同,所以需要各自填充,所以我加了個['列名']。
'''數據預處理''' # 選擇特征,實際上可選特征很多,但是這幾個特征與幸存與否可能關聯更大 X = titanic[['pclass', 'age', 'sex']] # 選擇標簽 y = titanic['survived'] # 查看特征 print(X.info()) # <class 'pandas.core.frame.DataFrame'> # RangeIndex: 1313 entries, 0 to 1312 # Data columns (total 3 columns): # pclass 1313 non-null object # age 633 non-null float64 # sex 1313 non-null object # dtypes: float64(1), object(2) # memory usage: 30.9+ KB # None # 任務: # 1.age數據明顯缺失 # 2.pclass和sex數據類型不是數字,需要更改 X['age'].fillna(X['age'].mean(), inplace=True) # fillna返回一個新對象,inplace = True 可以就地填充 print(X.info())
數據集划分
'''數據集划分''' from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
特征提取器
'''特征提取器''' from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer(sparse=False) print(X_train.to_dict(orient='record')) X_train = vec.fit_transform(X_train.to_dict(orient='record')) print(X_train) print(vec.feature_names_) X_test = vec.transform(X_test.to_dict(orient='record'))
涉及兩個操作,
- DataFrame字典化
- 字典向量化
1.DataFrame字典化
import numpy as np import pandas as pd index = ['x', 'y'] columns = ['a','b','c'] dtype = [('a','int32'), ('b','float32'), ('c','float32')] values = np.zeros(2, dtype=dtype) df = pd.DataFrame(values, index=index) df.to_dict(orient='record')
2.字典向量化
DictVectorizer: 將dict類型的list數據,轉換成numpy array,具有屬性vec.feature_names_,查看提取后的特征名。
具體效果如下,
>>> from sklearn.feature_extraction import DictVectorizer >>> v = DictVectorizer(sparse=False) >>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] >>> X = v.fit_transform(D) >>> X array([[ 2., 0., 1.], [ 0., 1., 3.]]) >>> v.transform({'foo': 4, 'unseen_feature': 3}) array([[ 0., 0., 4.]])
數字的特征不變,沒有該特征的項給賦0,對於未參與訓練的特征不予考慮。
對應到本程序,
print(X_train.to_dict(orient='record')):
[{'sex': 'male', 'pclass': '3rd', 'age': 31.19418104265403},
...... ....... ....... ......
{'sex': 'female', 'pclass': '1st', 'age': 31.19418104265403}]
提取特征,
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
print(X_train):[[ 31.19418104 0. 0. 1. 0. 1. ]
[ 31.19418104 1. 0. 0. 1. 0. ]
[ 31.19418104 0. 0. 1. 0. 1. ]
...,
[ 12. 0. 1. 0. 1. 0. ]
[ 18. 0. 1. 0. 0. 1. ]
[ 31.19418104 0. 0. 1. 1. 0. ]]
數字的年齡沒有改變,其他obj特征變成了onehot編碼的特征,各列意義可以查看的,
print(vec.feature_names_):
['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']
決策樹
'''決策樹''' from sklearn.tree import DecisionTreeClassifier dtc = DecisionTreeClassifier() dtc.fit(X_train, y_train) y_predict = dtc.predict(X_test) '''模型評估''' from sklearn.metrics import classification_report print(dtc.score(X_test ,y_test)) print(classification_report(y_predict, y_test, target_names=['died', 'suvived']))
0.781155015198
precision recall f1-score support
died 0.91 0.78 0.84 236
suvived 0.58 0.80 0.67 93
avg / total 0.81 0.78 0.79 329
python學習
Pandas
pd.read_csv()竟然可以直接請求URL... ...
DataFrame.head()可以查看前面幾行的數據,默認是5行
DataFrame.info()可以查看數據的統計情報
DataFrame.to_dict()字典化DF,生成list[dict1,dict2... ...]這樣的原生python數據結構,一個字典表示一個行
集成模型程序
本程序對上面的數據同時使用了決策樹,隨機森林,梯度提升決策樹三種方法,程序以及結果對比如下,
'''集成模型''' import pandas as pd titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt') X = titanic[['pclass', 'age', 'sex']] y = titanic['survived'] X['age'].fillna(X['age'].mean(), inplace=True) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33) from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer(sparse=False) X_train = vec.fit_transform(X_train.to_dict(orient='record')) X_test = vec.transform(X_test.to_dict(orient='record')) '''決策樹''' from sklearn.tree import DecisionTreeClassifier dtc = DecisionTreeClassifier() dtc.fit(X_train, y_train) dtc_y_predict = dtc.predict(X_test) '''隨機森林''' from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier() rfc.fit(X_train, y_train) rfc_y_predict = rfc.predict(X_test) '''梯度提升決策樹''' from sklearn.ensemble import GradientBoostingClassifier gbc = GradientBoostingClassifier() gbc.fit(X_train, y_train) gbc_y_predict = gbc.predict(X_test) '''模型評估''' from sklearn.metrics import classification_report print(dtc.score(X_test, y_test)) print(classification_report(dtc_y_predict, y_test)) print(rfc.score(X_test, y_test)) print(classification_report(rfc_y_predict, y_test)) print(gbc.score(X_test, y_test)) print(classification_report(gbc_y_predict, y_test))
集成模型中的隨機森林經常做為用於對比的基准算法而存在。
由於分類器介紹的很多,之后會單拿出一篇來簡要介紹一下各個分類器的優劣,以便更好的使用。