基于随机森林做回归任务(数据预处理、MAPE指标评估、可视化展示、特征重要性、预测和实际值差异显示图)


基于随机森林做回归任务(数据预处理、MAPE指标评估、可视化展示、特征重要性、预测和实际值差异显示图)

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/qq_40229367/article/details/88526749

学习唐宇迪老师的机器学习课程——基于随机森林做回归任务

这是一个天气最高温度预测任务。

通常想法是训练出随机森林,然后因为是做回归任务,那么取叶子节点中样本的平均值作为预测值

(如果是分类任务就是取众数)

读入数据,看数据情况,有无缺失值、异常值

数据集:

temps.csv

链接: https://pan.baidu.com/s/1afKQjExLGHUJxpwZdnUGUA 提取码: xpad 

扩展的数据集 temps_extended.csv

链接: https://pan.baidu.com/s/1Vr01IUV7Mnn3EqvT80ZDNQ 提取码: 9r51 

 

  1.  
    import pandas as pd
  2.  
     
  3.  
    # Read in data as pandas dataframe and display first 5 rows
  4.  
    features = pd.read_csv( 'data/temps.csv')
  5.  
    features.head( 5)
  6.  
     
  7.  
    print( 'The shape of our features is:', features.shape)
  8.  
     
  9.  
    # Descriptive statistics for each column
  10.  
    features.describe()

The shape of our features is: (348, 9)

从上面可以看到,数据并没有问题

而发现数据,year-month-day 是可以组合的特征

  1.  
    import datetime
  2.  
     
  3.  
    # Get years, months, and days
  4.  
    years = features[ 'year']
  5.  
    months = features[ 'month']
  6.  
    days = features[ 'day']
  7.  
     
  8.  
    # List and then convert to datetime object
  9.  
    dates = [str( int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
  10.  
    dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
  11.  
     
  12.  
    dates[: 5]

数据没有异常和简单组合后,我们可以来看一下数据的分布情况,以可视化的形式

  1.  
    # Import matplotlib for plotting and use magic command for Jupyter Notebooks
  2.  
    import matplotlib.pyplot as plt
  3.  
     
  4.  
    %matplotlib inline
  5.  
     
  6.  
    # Set the style
  7.  
    plt.style.use( 'fivethirtyeight')
  8.  
     
  9.  
    # Set up the plotting layout
  10.  
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows= 2, ncols=2, figsize = (10,10))
  11.  
    fig.autofmt_xdate(rotation = 45)
  12.  
     
  13.  
    # Actual max temperature measurement
  14.  
    ax1.plot(dates, features[ 'actual'])
  15.  
    ax1.set_xlabel( ''); ax1.set_ylabel('Temperature'); ax1.set_title('Max Temp')
  16.  
     
  17.  
    # Temperature from 1 day ago
  18.  
    ax2.plot(dates, features[ 'temp_1'])
  19.  
    ax2.set_xlabel( ''); ax2.set_ylabel('Temperature'); ax2.set_title('Previous Max Temp')
  20.  
     
  21.  
    # Temperature from 2 days ago
  22.  
    ax3.plot(dates, features[ 'temp_2'])
  23.  
    ax3.set_xlabel( 'Date'); ax3.set_ylabel('Temperature'); ax3.set_title('Two Days Prior Max Temp')
  24.  
     
  25.  
    # Friend Estimate
  26.  
    ax4.plot(dates, features[ 'friend'])
  27.  
    ax4.set_xlabel( 'Date'); ax4.set_ylabel('Temperature'); ax4.set_title('Friend Estimate')
  28.  
     
  29.  
    plt.tight_layout(pad= 2)

上面的代码相当于是选择某些特征进行画图展示(需要画类似图的时候就可以直接借鉴使用)

我们发现了Friend Estimate这个特征比前面三个“粗”太多了,也就可能不是那么准确的数值,重要性也就可能没有前面那三个重要

除此之外,在这个数据中之前有九个特征,其中就有星期week这个因素,里面的值都是Mon等,(先假设他们会有影响)

研究他们不能直接使用英文,需要转换为机器看得懂的表示,因此需要进行一定的预处理

这里需要用到One-Hot Encoding,其作用如下:

转变成数据的特征,是某个星期的就是为1,其他为0

代码:

  1.  
    # One-hot encode categorical features
  2.  
    features = pd.get_dummies(features)
  3.  
    features.head( 5)
  4.  
     
  5.  
    print( 'Shape of features after one-hot encoding:', features.shape)

(特征中只有week的数值不是数字)

Shape of features after one-hot encoding: (348, 15)

大致的数据预处理之后我们需要进行提取label(即actual)(回归任务!)

  1.  
    # Use numpy to convert to arrays
  2.  
    import numpy as np
  3.  
     
  4.  
    # Labels are the values we want to predict
  5.  
    labels = np. array(features['actual'])
  6.  
     
  7.  
    # Remove the labels from the features
  8.  
    # axis 1 refers to the columns
  9.  
    features= features.drop( 'actual', axis = 1)
  10.  
     
  11.  
    # Saving feature names for later use
  12.  
    feature_list = list(features.columns)
  13.  
     
  14.  
    # Convert to numpy array
  15.  
    features = np. array(features)

之后就是切分数据集——训练和测试集

  1.  
    # Using Skicit-learn to split data into training and testing sets
  2.  
    from sklearn.model_selection import train_test_split
  3.  
     
  4.  
    # Split the data into training and testing sets
  5.  
    train_features, test_features, train_labels, test_labels =
  6.  
    train_test_split(features,labels, test_size = 0.25,random_state = 42)
  7.  
     
  8.  
    print( 'Training Features Shape:', train_features.shape)
  9.  
    print( 'Training Labels Shape:', train_labels.shape)
  10.  
    print( 'Testing Features Shape:', test_features.shape)
  11.  
    print( 'Testing Labels Shape:', test_labels.shape)
  1.  
    Training Features Shape: (261, 14)
  2.  
    Training Labels Shape: (261,)
  3.  
    Testing Features Shape: (87, 14)
  4.  
    Testing Labels Shape: (87,)

这里可以看到切分后训练集和测试集数据情况

那么这时候就可以训练随机森林了

  1.  
    # Import the model we are using
  2.  
    from sklearn.ensemble import RandomForestRegressor
  3.  
     
  4.  
    # Instantiate model
  5.  
    rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
  6.  
     
  7.  
    # Train the model on training data
  8.  
    rf.fit(train_features, train_labels);

这里的随机森林用了1000棵树来寻找最合适的特征

进行测试:

  1.  
    # Use the forest's predict method on the test data
  2.  
    predictions = rf.predict(test_features)
  3.  
     
  4.  
    # Calculate the absolute errors
  5.  
    errors = abs(predictions - test_labels)
  6.  
     
  7.  
    # Print out the mean absolute error (mae)
  8.  
    print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
Mean Absolute Error: 3.83 degrees.

这里的测试后用预测值和实际值相差多少来评估 ,即MAPE指标(和实际平均相差多少)来看一下它的效果如何

  1.  
    # Calculate mean absolute percentage error (MAPE)
  2.  
    mape = 100 * (errors / test_labels)
  3.  
     
  4.  
    # Calculate and display accuracy
  5.  
    accuracy = 100 - np.mean(mape)
  6.  
    print( 'Accuracy:', round(accuracy, 2), '%.')
Accuracy: 93.99 %.

我们还可以可视化展示一下树(举一个可视化树的例子,使用的数据)

  1.  
    # Limit depth of tree to 2 levels
  2.  
    rf_small = RandomForestRegressor(n_estimators= 10, max_depth = 3, random_state=42)
  3.  
    rf_small.fit(train_features, train_labels)
  4.  
     
  5.  
    # Extract the small tree
  6.  
    tree_small = rf_small.estimators_[ 5]
  7.  
     
  8.  
    # Save the tree as a png image
  9.  
    export_graphviz(tree_small, out_file =
  10.  
    'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1)
  11.  
     
  12.  
    (graph, ) = pydot.graph_from_dot_file( 'small_tree.dot')
  13.  
     
  14.  
    graph.write_png( 'small_tree.png');

上面标注了对树的一些解释

我们知道,随机森林建立时会优先选择有价值的特征(重要性比较强的特征,例如上面的temp_1),而我们可以通过随机森林知道特征的重要性

  1.  
    # Get numerical feature importances
  2.  
    importances = list(rf.feature_importances_)
  3.  
     
  4.  
    # List of tuples with variable and importance
  5.  
    feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
  6.  
     
  7.  
    # Sort the feature importances by most important first
  8.  
    feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
  9.  
     
  10.  
    # Print out the feature and importances
  11.  
    [print( 'Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

  1.  
    # list of x locations for plotting
  2.  
    x_values = list(range(len(importances)))
  3.  
     
  4.  
    # Make a bar chart
  5.  
    plt.bar(x_values, importances, orientation = 'vertical')
  6.  
     
  7.  
    # Tick labels for x axis
  8.  
    plt.xticks(x_values, feature_list, rotation= 'vertical')
  9.  
     
  10.  
    # Axis labels and title
  11.  
    plt.ylabel( 'Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

很明显显示了那些特征重要

最后我们还可以以可视化的形式看一下我们的预测值和真实值之间的差异

  1.  
    # Dates of training values
  2.  
    months = features[:, feature_list. index('month')]
  3.  
    days = features[:, feature_list. index('day')]
  4.  
    years = features[:, feature_list. index('year')]
  5.  
     
  6.  
    # List and then convert to datetime object
  7.  
    dates =
  8.  
    [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
  9.  
    dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
  10.  
     
  11.  
    # Dataframe with true values and dates
  12.  
    true_data = pd.DataFrame(data = {'date': dates, 'actual': labels})
  13.  
     
  14.  
    # Dates of predictions
  15.  
    months = test_features[:, feature_list. index('month')]
  16.  
    days = test_features[:, feature_list. index('day')]
  17.  
    years = test_features[:, feature_list. index('year')]
  18.  
     
  19.  
    # Column of dates
  20.  
    test_dates =
  21.  
    [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
  22.  
     
  23.  
    # Convert to datetime objects
  24.  
    test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates]
  25.  
     
  26.  
    # Dataframe with predictions and dates
  27.  
    predictions_data =
  28.  
    pd.DataFrame(data = {'date': test_dates, 'prediction': predictions})
  29.  
     
  30.  
     
  31.  
    # Plot the actual values
  32.  
    plt.plot(true_data[ 'date'], true_data['actual'], 'b-', label = 'actual')
  33.  
     
  34.  
    # Plot the predicted values
  35.  
    plt.plot(predictions_data[ 'date'], predictions_data['prediction'], 'ro', label = 'prediction')
  36.  
    plt.xticks(rotation = '60');
  37.  
    plt.legend()
  38.  
     
  39.  
    # Graph labels
  40.  
    plt.xlabel( 'Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual and Predicted Values');

同样的也是画图,这里是以时间为x轴看一下温度情况

之后我们还会以这个为例子,做一下对比,例如数据量和特征选择、还有随机森林参数对结果的影响

 

数据与特征对随机森林的影响(特征对比、特征降维、考虑性价比)

https://blog.csdn.net/qq_40229367/article/details/88528421

随机森林参数选择

https://blog.csdn.net/qq_40229367/article/details/88532093


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM