天池題目：工業蒸汽預測（一）- 數據探索

本文轉載自查看原文 2021-01-05 22:54 556 天池

1. 題目介紹

火力發電的基本原理是：燃料在燃燒時加熱水生成蒸汽，蒸汽壓力推動汽輪機旋轉，然后汽輪機帶動發電機旋轉，產生電能。在這一系列的能量轉化中，影響發電效率的核心是鍋爐的燃燒效率，即燃料燃燒加熱水產生高溫高壓蒸汽。鍋爐的燃燒效率的影響因素很多，包括鍋爐的可調參數，如燃燒給量，一二次風，引風，返料風，給水水量；以及鍋爐的工況，比如鍋爐床溫、床壓，爐膛溫度、壓力，過熱器的溫度等。

數據為：經脫敏后的鍋爐傳感器采集的數據（采集頻率是分鍾級別），根據鍋爐的工況，預測產生的蒸汽量。

題目地址：

https://tianchi.aliyun.com/competition/entrance/231693/introduction

2. 數據探索

2.1. 初步探索

先簡單查看一下數據：

import pandas as pd
import s3fs
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

plt.style.use('seaborn')
%matplotlib inline

train_raw = pd.read_csv(train_data_uri, sep='\t', encoding='utf-8')
test_raw = pd.read_csv(test_data_uri, sep='\t', encoding='utf-8')

train_raw.head()

train_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      2888 non-null   float64
 1   V1      2888 non-null   float64
 2   V2      2888 non-null   float64
 …
 37  V37     2888 non-null   float64
 38  target  2888 non-null   float64
dtypes: float64(39)
memory usage: 880.1 KB

從訓練集 info 信息我們可以知道，在訓練集中：

一共有2888 個樣本， 38個字段（V0 - V37），1個 target
所有特征均為連續型特征
Label（也就是target）為連續型，所以我們需要回歸函數進行預測
所有特征均沒有空置

測試集 info()：

test_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Data columns (total 38 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      1925 non-null   float64
 1   V1      1925 non-null   float64
 2   V2      1925 non-null   float64
 …
 36  V36     1925 non-null   float64
 37  V37     1925 non-null   float64
dtypes: float64(38)
memory usage: 571.6 KB

從測試集info() 我們可以了解到，在測試集中：

一共有1925個樣本，38個字段（V0 - V37）
所有特征均為連續型
沒有缺失值

若是進一步對df 做 describe()，則會有 39 個字段的describe數據，從觀察數據的角度來看，比較復雜，所以下一步我們對數據進行可視化。

2.2. 數據可視化

可視化的主要目標為：探索數據特征以及數據分布

2.2.3. 盒圖

盒圖是一種流行的分布的直觀表示。盒圖體現了五數概括：

盒的端點一般在四分位數上，使得盒的長度是四分位數極差IQR
中位數用盒內的線標記
盒外的兩條線（稱作胡須）延伸到最小（Minimum）和最大（Maximum）觀測值

當處理數量適中的觀測值時，盒圖中對離群點的表示為：僅當最高和最低觀測值超過四分位數不到1.5 × IQR 時，胡須擴展到它們。否則胡須在出現在四分位數1.5 × IQR 之內的最極端的觀測值處終止，剩下的情況個別地繪出。

先看看第一個特征的盒圖：

可以看到有較多的離群點。

繼續對所有特征畫出盒圖：

columns = train_raw.columns[:-1]

fig = plt.figure(figsize=(100, 80), dpi=75)
for i in range(len(columns)):
    plt.subplot(7, 6, i+1)
    sns.boxplot(train_raw[columns[i]], orient='v', width=0.5)
    plt.ylabel(columns[i], fontsize=36)
plt.show()

2.2.4. 使用模型預測識別離群點

下面是使用嶺回歸來預測離群點：

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# train liner Ridge model
X_train = train_raw.iloc[:, 0:-1]
y_train = train_raw.iloc[:, -1]

model = Ridge()
model.fit(X_train, y_train)

y_pred = model.predict(X_train)
y_pred = pd.Series(y_pred, index=y_train.index)

# calculate residual
residual = y_pred - y_train
resid_mean = residual.mean()
resid_std = residual.std()

# calculate z score
sigma = 3
z = (residual - resid_mean) / resid_std

# get outliers
outliers = y_train[abs(z) > sigma].index

# plot outliers
plt.figure(figsize=(15, 5))

plt.subplot(131)
plt.plot(y_train, y_pred, '.')
plt.plot(y_train.loc[outliers], y_pred.loc[outliners], 'ro')
plt.xlabel('y_train')
plt.ylabel('y_pred')
plt.legend(['Accepted', 'Outlier'])

plt.subplot(132)
plt.plot(y_train, y_pred-y_train, '.')
plt.plot(y_train[outliers], y_train.loc[outliers] - y_pred.loc[outliers], 'ro')
plt.xlabel('y_train')
plt.ylabel('y_pred - y_train')
plt.legend(['Accepted', 'Outlier'])

ax3 = plt.subplot(133)
z.plot.hist(bins=50, ax=ax3)
z.loc[outliers].plot.hist(bins=50, ax=ax3, color='red')
plt.legend(['Accepted', 'Outlier'])
plt.xlabel('z')

2.2.5. 直方圖與Q-Q圖

Q-Q圖的定義：The quantitle-quantile(q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

也就是指數據的分位數和正態分布的分位數對比參照的圖，如果數據符合正態分布，則所有的點都會落在直線上。

它主要是直觀的表示觀測與預測值之間的差異。一般我們所取得數量性狀數據都為正態分布數據。預測的線是一條從原點出發的45度角的虛線，事假觀測值是實心點。

偏離線越大，則兩個數據集來自具有不同分布的群體的結論的證據就越大。

首先，繪制特征V0的直方圖查看其在訓練集中的統計分布，並繪制Q-Q圖查看V0的分布是否近似於正態分布：

plt.figure(figsize=(10, 5))

ax1 = plt.subplot(121)
sns.distplot(train_raw['V0'], fit=stats.norm)

ax2 = plt.subplot(122)
res = stats.probplot(train_raw['V0'], plot=plt)

可以看到訓練集中V0 特征並非為正態分布。

接下來我們繪制所有特征的Q-Q圖：

import warnings

warnings.filterwarnings("ignore")

# 38 * 2 = 76 = 4 * 19
plt.figure(figsize=(80, 190))

columns = train_raw.columns.tolist()[:-1]

# subplot position
ax_index = 1

for i in range(len(columns)):
    ax = plt.subplot(19, 4, ax_index)
    sns.distplot(train_raw[columns[i]], fit=stats.norm)
    ax_index += 1
    
    ax = plt.subplot(19, 4, ax_index)
    res = stats.probplot(train_raw[columns[i]], plot=plt)
    ax_index += 1
    
#plt.tight_layout()

部分結果如下：

可以看到其中有的特征符合正態分布，但大部分並不符合，數據並不跟隨對角線分布。對此，后續可以使用數據變換對其進行處理。

2.2.6. KDE分布圖

KDE（Kernel Density Estimation，核密度估計）可以理解為是對直方圖的加窗平滑。它在概率論中用來估計未知的概率密度函數，屬於非參數檢驗方法之一。通過核密度估計圖可以比較直觀的看出數據樣本本身的分布特征。

通過繪制KDE圖，可以查看並對比訓練集和測試集中特征變量的分布情況，發現兩個數據集中分布不一致的特征變量。

先對同一特征V0，分別在訓練集與測試集中的分布情況，並觀察分布是否一致：

plt.figure(figsize=(10, 8))
ax = sns.kdeplot(train_raw['V0'], color="Red", shade=True)
ax = sns.kdeplot(test_raw['V0'], color="Blue", shade=True)
ax.set_xlabel("V0")
ax.set_ylabel("Frequency")
ax.legend(['train', 'test'])

可以看到 V0 在兩個數據集中的分布基本一致。

然后對所有特征畫出訓練集與測試集中的KDE分布：

# plot all features' kde plots
# 38 < 4*10
columns = train_raw.columns.tolist()[:-1]

plt.figure(figsize=(40, 100))
ax_index = 1

for i in range(len(columns)):
    ax = plt.subplot(10, 4, ax_index)
    ax = sns.kdeplot(train_raw[columns[i]], color="Red", shade=True)
    ax = sns.kdeplot(test_raw[columns[i]], color="Blue", shade=True)
    ax.set_xlabel(columns[i])
    ax.set_ylabel("Frequency")
    ax.legend(['train', 'test'])
ax_index += 1

可以看到大部分特征的分布在訓練集與測試集中基本一致，但仍有幾個特征的分布在兩個數據集中不一致（主要為V5、V9、V11、V17、V22、V28），這樣會導致模型的泛化能力變差，需要刪除此類特征。

2.2.7. 線性回歸關系圖

線性回歸關系圖主要用於分析變量之間的線性回歸關系。

首先看V0與target 之間的線性回歸關系（sns.regplot 和 sns.distplot 可以同時用，同時查看特征分布以及特征和變量關系）：

plt.figure(figsize=(10, 8))

ax = plt.subplot(121)
sns.regplot(x='V0', y='target', data=train_raw, ax=ax, 
            scatter_kws={'marker':'.', 's':4, 'alpha':0.3},
            line_kws={'color':'g'})
plt.xlabel('V0')
plt.ylabel('target')

ax = plt.subplot(122)
sns.distplot(train_raw['V0'].dropna())
plt.xlabel('V0')

plt.show()

從圖像結果來看，V0 與 target 存在一定程度的線性關系。

然后查看所有特征變量與target 變量的線性回歸關系：

fcols = 6
frows = len(columns)

plt.figure(figsize=(5*fcols, 4*frows), dpi=150)
ax_index = 1

for i in range(len(columns)):
    ax = plt.subplot(19, 4, ax_index)
    sns.regplot(x=columns[i], y='target', data=train_raw, ax=ax,
                scatter_kws={'marker':'.', 's':3, 'alpha':0.3},
                line_kws={'color':'g'})
    ax.set_xlabel(columns[i])
    ax.set_ylabel('target')
    ax_index += 1

    ax = plt.subplot(19, 4, ax_index)
    sns.distplot(train_raw[columns[i]].dropna())
    ax_index += 1

plt.show()

部分結果如下：

通過查看可視化的結果，我們可以明顯看到一些特征與target 完全沒有線性關系（例如V9、V17、V18、V22、V23、V24、V25、V28 等等……）

2.2.8. 特征變量的相關性

特征變量的相關性主要是通過計算協方差矩陣獲取，首先我們刪除訓練集與測試集中分布不一致的特征變量（V5、V9、V11、V17、V21、V22、V23、V28）：

為了方便查看，我們可以使用熱力圖顯示結果：

plt.figure(figsize=(20, 16))
sns.heatmap(train_corr, square=True, annot=True)

從圖中我們可以清晰地看到各個特征與 target 的相關系數程度。

有了相關系數后，繼而我們可以通過相關系數來選擇特征。

首先尋找K個與target變量最相關的特征：

# find top K most relevant features
K = 10
cols = train_corr.nlargest(10, 'target')['target'].index

plt.figure(figsize=(10, 8))
sns.heatmap(train_raw[cols].corr(), annot=True, square=True)

找到其中與target 變量相關性大於 0.5 的特征：

threshold = 0.5

corrmat = train_raw.corr()
most_relevants = corrmat[abs(corrmat['target']) > threshold].index

plt.figure(figsize=(10, 8))
sns.heatmap(train_raw[most_relevants].corr(), square=True, annot=True)

相關性選擇主要用於判斷線性相關，若是target 變量存在更復雜的函數形式的影響，則建議使用樹模型的特征重要性去選擇。

Box-Cox 變換

由於線性回歸是基於正態分布的，因此在進行統計分析時，需要將數據轉換使其符合正態分布。Box-Cox變換是統計建模中常用的一種數據轉換方法，通過自動計算最優的 λ，使得變換后的樣本（及總體）正態性最好。

在做Box-Cox變換之前，需要對數據做歸一化處理。在歸一化時，對數據進行合並操作可以使得訓練數據與測試數據一致：

# normalization
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()

columns = data_all.columns

normalized_data_all = min_max_scaler.fit_transform(data_all)
normalized_data_all = pd.DataFrame(normalized_data_all, index=data_all.index, columns=columns)
normalized_data_all.describe()

對特征做了歸一化后，繼續做Box-Cox 變換，顯示特征變量與target變量的線性關系：

# restore normalized training & test set 
train_size = train_raw.shape[0]

processed_train = normalized_data_all.iloc[:train_size]
processed_test = normalized_data_all.iloc[train_size:]

processed_train = pd.concat([processed_train, train_raw['target']], axis=1)

# box-cox transform and plot
def scale_minmax(col):
    return (col - col.min()) / (col.max() - col.min()) 

columns = processed_train.columns[:4]

plt.figure(figsize=(12, 100))
ax_index = 1
for i in range(len(columns)):
    # original distribution plot
    ax = plt.subplot(36, 3, ax_index)
    sns.distplot(processed_train[columns[i]], fit=stats.norm)
    ax.set_title('Original: ' + columns[i])
    ax_index += 1
    
    # prob plot
    ax = plt.subplot(36, 3, ax_index)
    stats.probplot(processed_train[columns[i]], plot=ax)
    ax.set_title('skew={:.4f}'.format(stats.skew(processed_train[columns[i]])))
    ax_index += 1
    
    # reg plot
    ax = plt.subplot(36, 3, ax_index)
    ax.plot(processed_train[columns[i]], processed_train['target'], '.', alpha=0.5)
    ax.set_title('corr={:.4f}'.format( np.corrcoef(processed_train[columns[i]], processed_train['target'])[0][1] ))
    ax_index += 1

    # box-cox transformed
    ax = plt.subplot(36, 3, ax_index)
    trans_var, lambda_var = stats.boxcox(processed_train[columns[i]].dropna() + 1)
    trans_var = scale_minmax(trans_var)
    
    # plot transformed
    sns.distplot(trans_var, fit=stats.norm)
    ax.set_title('Transformed: ' + columns[i])
    ax_index += 1
    
    # prob plot
    ax = plt.subplot(36, 3, ax_index)
    stats.probplot(trans_var, plot=ax)
    ax.set_title('skew={:.4f}'.format(stats.skew(trans_var)))
    ax_index += 1
    
    # reg plot
    ax = plt.subplot(36, 3, ax_index)
    ax.plot(trans_var, processed_train['target'], '.', alpha=0.5)
    ax.set_title('corr={:.4f}'.format( np.corrcoef(trans_var, processed_train['target'])[0][1]))
    ax_index += 1

    plt.tight_layout()

從結果圖可以看到，在執行了box-cox 變換后，變量分布更接近正態分布，且與target字段的線性關系更明顯。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【阿里雲-天池數據分析競賽】工業蒸汽量預測項目總結用戶貸款風險預測—數據探索天池_二手車交易價格預測數據分析數據挖掘實戰 - 天池新人賽o2o優惠券使用預測【天池大數據賽題解析】資金流入流出預測（附Top4答辯ppt）【項目實戰】天貓重復購買預測數據探索數據分析項目之：北京地區短租數據集分析及價格建模預測（天池大數據競賽）天池新人賽-天池新人實戰賽o2o優惠券使用預測（一）天池新人實戰賽之[離線賽]題目與思路基於XGBoost模型的幸福度預測——阿里天池學習賽