基於隨機森林做回歸任務(數據預處理、MAPE指標評估、可視化展示、特征重要性、預測和實際值差異顯示圖)
學習唐宇迪老師的機器學習課程——基於隨機森林做回歸任務
這是一個天氣最高溫度預測任務。
通常想法是訓練出隨機森林,然后因為是做回歸任務,那么取葉子節點中樣本的平均值作為預測值
(如果是分類任務就是取眾數)
讀入數據,看數據情況,有無缺失值、異常值
數據集:
temps.csv
鏈接: https://pan.baidu.com/s/1afKQjExLGHUJxpwZdnUGUA 提取碼: xpad
擴展的數據集 temps_extended.csv
鏈接: https://pan.baidu.com/s/1Vr01IUV7Mnn3EqvT80ZDNQ 提取碼: 9r51
-
import pandas as pd
-
-
# Read in data as pandas dataframe and display first 5 rows
-
features = pd.read_csv( 'data/temps.csv')
-
features.head( 5)
-
-
print( 'The shape of our features is:', features.shape)
-
-
# Descriptive statistics for each column
-
features.describe()
The shape of our features is: (348, 9)
從上面可以看到,數據並沒有問題
而發現數據,year-month-day 是可以組合的特征
-
import datetime
-
-
-
years = features[ 'year']
-
months = features[ 'month']
-
days = features[ 'day']
-
-
-
dates = [str( int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
-
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
-
-
dates[: 5]
數據沒有異常和簡單組合后,我們可以來看一下數據的分布情況,以可視化的形式
-
# Import matplotlib for plotting and use magic command for Jupyter Notebooks
-
import matplotlib.pyplot as plt
-
-
%matplotlib inline
-
-
# Set the style
-
plt.style.use( 'fivethirtyeight')
-
-
# Set up the plotting layout
-
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows= 2, ncols=2, figsize = (10,10))
-
fig.autofmt_xdate(rotation = 45)
-
-
# Actual max temperature measurement
-
ax1.plot(dates, features[ 'actual'])
-
ax1.set_xlabel( ''); ax1.set_ylabel('Temperature'); ax1.set_title('Max Temp')
-
-
# Temperature from 1 day ago
-
ax2.plot(dates, features[ 'temp_1'])
-
ax2.set_xlabel( ''); ax2.set_ylabel('Temperature'); ax2.set_title('Previous Max Temp')
-
-
# Temperature from 2 days ago
-
ax3.plot(dates, features[ 'temp_2'])
-
ax3.set_xlabel( 'Date'); ax3.set_ylabel('Temperature'); ax3.set_title('Two Days Prior Max Temp')
-
-
# Friend Estimate
-
ax4.plot(dates, features[ 'friend'])
-
ax4.set_xlabel( 'Date'); ax4.set_ylabel('Temperature'); ax4.set_title('Friend Estimate')
-
-
plt.tight_layout(pad= 2)
上面的代碼相當於是選擇某些特征進行畫圖展示(需要畫類似圖的時候就可以直接借鑒使用)
我們發現了Friend Estimate這個特征比前面三個“粗”太多了,也就可能不是那么准確的數值,重要性也就可能沒有前面那三個重要
除此之外,在這個數據中之前有九個特征,其中就有星期week這個因素,里面的值都是Mon等,(先假設他們會有影響)
研究他們不能直接使用英文,需要轉換為機器看得懂的表示,因此需要進行一定的預處理
這里需要用到One-Hot Encoding,其作用如下:
轉變成數據的特征,是某個星期的就是為1,其他為0
代碼:
-
# One-hot encode categorical features
-
features = pd.get_dummies(features)
-
features.head( 5)
-
-
print( 'Shape of features after one-hot encoding:', features.shape)
(特征中只有week的數值不是數字)
Shape of features after one-hot encoding: (348, 15)
大致的數據預處理之后我們需要進行提取label(即actual)(回歸任務!)
-
# Use numpy to convert to arrays
-
import numpy as np
-
-
# Labels are the values we want to predict
-
labels = np. array(features['actual'])
-
-
# Remove the labels from the features
-
# axis 1 refers to the columns
-
features= features.drop( 'actual', axis = 1)
-
-
# Saving feature names for later use
-
feature_list = list(features.columns)
-
-
# Convert to numpy array
-
features = np. array(features)
之后就是切分數據集——訓練和測試集
-
# Using Skicit-learn to split data into training and testing sets
-
from sklearn.model_selection import train_test_split
-
-
# Split the data into training and testing sets
-
train_features, test_features, train_labels, test_labels =
-
train_test_split(features,labels, test_size = 0.25,random_state = 42)
-
-
print( 'Training Features Shape:', train_features.shape)
-
print( 'Training Labels Shape:', train_labels.shape)
-
print( 'Testing Features Shape:', test_features.shape)
-
print( 'Testing Labels Shape:', test_labels.shape)
-
Training Features Shape: (261, 14)
-
Training Labels Shape: (261,)
-
Testing Features Shape: (87, 14)
-
Testing Labels Shape: (87,)
這里可以看到切分后訓練集和測試集數據情況
那么這時候就可以訓練隨機森林了
-
# Import the model we are using
-
from sklearn.ensemble import RandomForestRegressor
-
-
# Instantiate model
-
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
-
-
# Train the model on training data
-
rf.fit(train_features, train_labels);
這里的隨機森林用了1000棵樹來尋找最合適的特征
進行測試:
-
# Use the forest's predict method on the test data
-
predictions = rf.predict(test_features)
-
-
# Calculate the absolute errors
-
errors = abs(predictions - test_labels)
-
-
# Print out the mean absolute error (mae)
-
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
Mean Absolute Error: 3.83 degrees.
這里的測試后用預測值和實際值相差多少來評估 ,即MAPE指標(和實際平均相差多少)來看一下它的效果如何
-
# Calculate mean absolute percentage error (MAPE)
-
mape = 100 * (errors / test_labels)
-
-
# Calculate and display accuracy
-
accuracy = 100 - np.mean(mape)
-
print( 'Accuracy:', round(accuracy, 2), '%.')
Accuracy: 93.99 %.
我們還可以可視化展示一下樹(舉一個可視化樹的例子,使用的數據)
-
# Limit depth of tree to 2 levels
-
rf_small = RandomForestRegressor(n_estimators= 10, max_depth = 3, random_state=42)
-
rf_small.fit(train_features, train_labels)
-
-
# Extract the small tree
-
tree_small = rf_small.estimators_[ 5]
-
-
# Save the tree as a png image
-
export_graphviz(tree_small, out_file =
-
'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1)
-
-
(graph, ) = pydot.graph_from_dot_file( 'small_tree.dot')
-
-
graph.write_png( 'small_tree.png');
上面標注了對樹的一些解釋
我們知道,隨機森林建立時會優先選擇有價值的特征(重要性比較強的特征,例如上面的temp_1),而我們可以通過隨機森林知道特征的重要性
-
# Get numerical feature importances
-
importances = list(rf.feature_importances_)
-
-
# List of tuples with variable and importance
-
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
-
-
# Sort the feature importances by most important first
-
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
-
-
# Print out the feature and importances
-
[print( 'Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
-
# list of x locations for plotting
-
x_values = list(range(len(importances)))
-
-
# Make a bar chart
-
plt.bar(x_values, importances, orientation = 'vertical')
-
-
# Tick labels for x axis
-
plt.xticks(x_values, feature_list, rotation= 'vertical')
-
-
# Axis labels and title
-
plt.ylabel( 'Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
很明顯顯示了那些特征重要
最后我們還可以以可視化的形式看一下我們的預測值和真實值之間的差異
-
# Dates of training values
-
months = features[:, feature_list. index('month')]
-
days = features[:, feature_list. index('day')]
-
years = features[:, feature_list. index('year')]
-
-
# List and then convert to datetime object
-
dates =
-
[str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
-
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
-
-
# Dataframe with true values and dates
-
true_data = pd.DataFrame(data = {'date': dates, 'actual': labels})
-
-
# Dates of predictions
-
months = test_features[:, feature_list. index('month')]
-
days = test_features[:, feature_list. index('day')]
-
years = test_features[:, feature_list. index('year')]
-
-
# Column of dates
-
test_dates =
-
[str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
-
-
# Convert to datetime objects
-
test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates]
-
-
# Dataframe with predictions and dates
-
predictions_data =
-
pd.DataFrame(data = {'date': test_dates, 'prediction': predictions})
-
-
-
# Plot the actual values
-
plt.plot(true_data[ 'date'], true_data['actual'], 'b-', label = 'actual')
-
-
# Plot the predicted values
-
plt.plot(predictions_data[ 'date'], predictions_data['prediction'], 'ro', label = 'prediction')
-
plt.xticks(rotation = '60');
-
plt.legend()
-
-
# Graph labels
-
plt.xlabel( 'Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual and Predicted Values');
同樣的也是畫圖,這里是以時間為x軸看一下溫度情況
之后我們還會以這個為例子,做一下對比,例如數據量和特征選擇、還有隨機森林參數對結果的影響
數據與特征對隨機森林的影響(特征對比、特征降維、考慮性價比)
https://blog.csdn.net/qq_40229367/article/details/88528421
隨機森林參數選擇