基於隨機森林做回歸任務（數據預處理、MAPE指標評估、可視化展示、特征重要性、預測和實際值差異顯示圖）

本文轉載自查看原文 2019-11-03 13:45 2206 機器學習

基於隨機森林做回歸任務（數據預處理、MAPE指標評估、可視化展示、特征重要性、預測和實際值差異顯示圖）

2019-03-13 10:55:04 PanDawson 閱讀數 3444更多

分類專欄：機器學習

本文鏈接： https://blog.csdn.net/qq_40229367/article/details/88526749

學習唐宇迪老師的機器學習課程——基於隨機森林做回歸任務

這是一個天氣最高溫度預測任務。

通常想法是訓練出隨機森林，然后因為是做回歸任務，那么取葉子節點中樣本的平均值作為預測值

（如果是分類任務就是取眾數）

讀入數據，看數據情況，有無缺失值、異常值

數據集：

temps.csv

鏈接: https://pan.baidu.com/s/1afKQjExLGHUJxpwZdnUGUA 提取碼: xpad

擴展的數據集 temps_extended.csv

鏈接: https://pan.baidu.com/s/1Vr01IUV7Mnn3EqvT80ZDNQ 提取碼: 9r51

import pandas as pd
# Read in data as pandas dataframe and display first 5 rows
features = pd.read_csv( 'data/temps.csv')
features.head( 5)
print( 'The shape of our features is:', features.shape)
# Descriptive statistics for each column
features.describe()

The shape of our features is: (348, 9)

從上面可以看到，數據並沒有問題

而發現數據，year-month-day 是可以組合的特征

import datetime
# Get years, months, and days
years = features[ 'year']
months = features[ 'month']
days = features[ 'day']
# List and then convert to datetime object
dates = [str( int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
dates[: 5]

數據沒有異常和簡單組合后，我們可以來看一下數據的分布情況，以可視化的形式

# Import matplotlib for plotting and use magic command for Jupyter Notebooks
import matplotlib.pyplot as plt
%matplotlib inline
# Set the style
plt.style.use( 'fivethirtyeight')
# Set up the plotting layout
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows= 2, ncols=2, figsize = (10,10))
fig.autofmt_xdate(rotation = 45)
# Actual max temperature measurement
ax1.plot(dates, features[ 'actual'])
ax1.set_xlabel( ''); ax1.set_ylabel('Temperature'); ax1.set_title('Max Temp')
# Temperature from 1 day ago
ax2.plot(dates, features[ 'temp_1'])
ax2.set_xlabel( ''); ax2.set_ylabel('Temperature'); ax2.set_title('Previous Max Temp')
# Temperature from 2 days ago
ax3.plot(dates, features[ 'temp_2'])
ax3.set_xlabel( 'Date'); ax3.set_ylabel('Temperature'); ax3.set_title('Two Days Prior Max Temp')
# Friend Estimate
ax4.plot(dates, features[ 'friend'])
ax4.set_xlabel( 'Date'); ax4.set_ylabel('Temperature'); ax4.set_title('Friend Estimate')
plt.tight_layout(pad= 2)

上面的代碼相當於是選擇某些特征進行畫圖展示（需要畫類似圖的時候就可以直接借鑒使用）

我們發現了Friend Estimate這個特征比前面三個“粗”太多了，也就可能不是那么准確的數值，重要性也就可能沒有前面那三個重要

除此之外，在這個數據中之前有九個特征，其中就有星期week這個因素，里面的值都是Mon等，（先假設他們會有影響）

研究他們不能直接使用英文，需要轉換為機器看得懂的表示，因此需要進行一定的預處理

這里需要用到One-Hot Encoding，其作用如下：

轉變成數據的特征，是某個星期的就是為1，其他為0

代碼：

# One-hot encode categorical features
features = pd.get_dummies(features)
features.head( 5)
print( 'Shape of features after one-hot encoding:', features.shape)

（特征中只有week的數值不是數字）

Shape of features after one-hot encoding: (348, 15)

大致的數據預處理之后我們需要進行提取label（即actual）（回歸任務！）

# Use numpy to convert to arrays
import numpy as np
# Labels are the values we want to predict
labels = np. array(features['actual'])
# Remove the labels from the features
# axis 1 refers to the columns
features= features.drop( 'actual', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np. array(features)

之后就是切分數據集——訓練和測試集

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels =
train_test_split(features,labels, test_size = 0.25,random_state = 42)
print( 'Training Features Shape:', train_features.shape)
print( 'Training Labels Shape:', train_labels.shape)
print( 'Testing Features Shape:', test_features.shape)
print( 'Testing Labels Shape:', test_labels.shape)

Training Features Shape: (261, 14)
Training Labels Shape: (261,)
Testing Features Shape: (87, 14)
Testing Labels Shape: (87,)

這里可以看到切分后訓練集和測試集數據情況

那么這時候就可以訓練隨機森林了

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
# Train the model on training data
rf.fit(train_features, train_labels);

這里的隨機森林用了1000棵樹來尋找最合適的特征

進行測試：

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

Mean Absolute Error: 3.83 degrees.

這里的測試后用預測值和實際值相差多少來評估，即MAPE指標（和實際平均相差多少）來看一下它的效果如何

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print( 'Accuracy:', round(accuracy, 2), '%.')

Accuracy: 93.99 %.

我們還可以可視化展示一下樹（舉一個可視化樹的例子，使用的數據）

# Limit depth of tree to 2 levels
rf_small = RandomForestRegressor(n_estimators= 10, max_depth = 3, random_state=42)
rf_small.fit(train_features, train_labels)
# Extract the small tree
tree_small = rf_small.estimators_[ 5]
# Save the tree as a png image
export_graphviz(tree_small, out_file =
'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file( 'small_tree.dot')
graph.write_png( 'small_tree.png');

上面標注了對樹的一些解釋

我們知道，隨機森林建立時會優先選擇有價值的特征（重要性比較強的特征，例如上面的temp_1），而我們可以通過隨機森林知道特征的重要性

# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print( 'Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

# list of x locations for plotting
x_values = list(range(len(importances)))
# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')
# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation= 'vertical')
# Axis labels and title
plt.ylabel( 'Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

很明顯顯示了那些特征重要

最后我們還可以以可視化的形式看一下我們的預測值和真實值之間的差異

# Dates of training values
months = features[:, feature_list. index('month')]
days = features[:, feature_list. index('day')]
years = features[:, feature_list. index('year')]
# List and then convert to datetime object
dates =
[str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
# Dataframe with true values and dates
true_data = pd.DataFrame(data = {'date': dates, 'actual': labels})
# Dates of predictions
months = test_features[:, feature_list. index('month')]
days = test_features[:, feature_list. index('day')]
years = test_features[:, feature_list. index('year')]
# Column of dates
test_dates =
[str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
# Convert to datetime objects
test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates]
# Dataframe with predictions and dates
predictions_data =
pd.DataFrame(data = {'date': test_dates, 'prediction': predictions})
# Plot the actual values
plt.plot(true_data[ 'date'], true_data['actual'], 'b-', label = 'actual')
# Plot the predicted values
plt.plot(predictions_data[ 'date'], predictions_data['prediction'], 'ro', label = 'prediction')
plt.xticks(rotation = '60');
plt.legend()
# Graph labels
plt.xlabel( 'Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual and Predicted Values');

同樣的也是畫圖，這里是以時間為x軸看一下溫度情況

之后我們還會以這個為例子，做一下對比，例如數據量和特征選擇、還有隨機森林參數對結果的影響

數據與特征對隨機森林的影響（特征對比、特征降維、考慮性價比）

https://blog.csdn.net/qq_40229367/article/details/88528421

隨機森林參數選擇

https://blog.csdn.net/qq_40229367/article/details/88532093

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 利用隨機森林進行特征重要性評估隨機森林是否需要交叉驗證+特征的重要性 pyspark 隨機森林特征重要性 kaggle數據挖掘競賽初步--Titanic<隨機森林&特征重要性> sklearn 可視化模型的訓練測試收斂情況和特征重要性拓端tecdat：R語言用加性多元線性回歸、隨機森林、彈性網絡模型預測鮑魚年齡和可視化【機器學習】隨機森林 Random Forest 得到模型后，評估參數重要性拓端tecdat|R語言隨機森林模型中具有相關特征的變量重要性 RandomForestClassifier(隨機森林檢測每個特征的重要性及每個樣例屬於哪個類的概率) 拓端數據|R語言隨機森林RandomForest、邏輯回歸Logisitc預測心臟病數據和可視化分析