【tensorflow2.0】特征列feature_column


特征列 通常用於對結構化數據實施特征工程時候使用,圖像或者文本數據一般不會用到特征列。

一,特征列用法概述

使用特征列可以將類別特征轉換為one-hot編碼特征,將連續特征構建分桶特征,以及對多個特征生成交叉特征等等。

要創建特征列,請調用 tf.feature_column 模塊的函數。該模塊中常用的九個函數如下圖所示,所有九個函數都會返回一個 Categorical-Column 或一個 Dense-Column 對象,但卻不會返回 bucketized_column,后者繼承自這兩個類。

注意:所有的Catogorical Column類型最終都要通過indicator_column轉換成Dense Column類型才能傳入模型!

  • numeric_column 數值列,最常用。
  • bucketized_column 分桶列,由數值列生成,可以由一個數值列出多個特征,one-hot編碼。
  • categorical_column_with_identity 分類標識列,one-hot編碼,相當於分桶列每個桶為1個整數的情況。
  • categorical_column_with_vocabulary_list 分類詞匯列,one-hot編碼,由list指定詞典。
  • categorical_column_with_vocabulary_file 分類詞匯列,由文件file指定詞典。
  • categorical_column_with_hash_bucket 哈希列,整數或詞典較大時采用。
  • indicator_column 指標列,由Categorical Column生成,one-hot編碼
  • embedding_column 嵌入列,由Categorical Column生成,嵌入矢量分布參數需要學習。嵌入矢量維數建議取類別數量的 4 次方根。
  • crossed_column 交叉列,可以由除categorical_column_with_hash_bucket的任意分類列構成。

二,特征列使用范例

以下是一個使用特征列解決Titanic生存問題的完整范例。

import datetime
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers,models
 
 
# 打印日志
def printlog(info):
    nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print("\n"+"=========="*8 + "%s"%nowtime)
    print(info+'...\n\n')
 
 
 
# ================================================================================
# 一,構建數據管道
# ================================================================================
printlog("step1: prepare dataset...")
 
 
dftrain_raw = pd.read_csv("./data/titanic/train.csv")
dftest_raw = pd.read_csv("./data/titanic/test.csv")
 
dfraw = pd.concat([dftrain_raw,dftest_raw])
 
def prepare_dfdata(dfraw):
    dfdata = dfraw.copy()
    dfdata.columns = [x.lower() for x in dfdata.columns]
    dfdata = dfdata.rename(columns={'survived':'label'})
    dfdata = dfdata.drop(['passengerid','name'],axis = 1)
    for col,dtype in dict(dfdata.dtypes).items():
        # 判斷是否包含缺失值
        if dfdata[col].hasnans:
            # 添加標識是否缺失列
            dfdata[col + '_nan'] = pd.isna(dfdata[col]).astype('int32')
            # 填充
            if dtype not in [np.object,np.str,np.unicode]:
                dfdata[col].fillna(dfdata[col].mean(),inplace = True)
            else:
                dfdata[col].fillna('',inplace = True)
    return(dfdata)
 
dfdata = prepare_dfdata(dfraw)
dftrain = dfdata.iloc[0:len(dftrain_raw),:]
dftest = dfdata.iloc[len(dftrain_raw):,:]
 
 
 
# 從 dataframe 導入數據 
def df_to_dataset(df, shuffle=True, batch_size=32):
    dfdata = df.copy()
    if 'label' not in dfdata.columns:
        ds = tf.data.Dataset.from_tensor_slices(dfdata.to_dict(orient = 'list'))
    else: 
        labels = dfdata.pop('label')
        ds = tf.data.Dataset.from_tensor_slices((dfdata.to_dict(orient = 'list'), labels))  
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dfdata))
    ds = ds.batch(batch_size)
    return ds
 
ds_train = df_to_dataset(dftrain)
ds_test = df_to_dataset(dftest)
# ================================================================================
# 二,定義特征列
# ================================================================================
printlog("step2: make feature columns...")
 
feature_columns = []
 
# 數值列
for col in ['age','fare','parch','sibsp'] + [
    c for c in dfdata.columns if c.endswith('_nan')]:
    feature_columns.append(tf.feature_column.numeric_column(col))
 
# 分桶列
age = tf.feature_column.numeric_column('age')
age_buckets = tf.feature_column.bucketized_column(age, 
             boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)
 
# 類別列
# 注意:所有的Catogorical Column類型最終都要通過indicator_column轉換成Dense Column類型才能傳入模型!!
sex = tf.feature_column.indicator_column(
      tf.feature_column.categorical_column_with_vocabulary_list(
      key='sex',vocabulary_list=["male", "female"]))
feature_columns.append(sex)
 
pclass = tf.feature_column.indicator_column(
      tf.feature_column.categorical_column_with_vocabulary_list(
      key='pclass',vocabulary_list=[1,2,3]))
feature_columns.append(pclass)
 
ticket = tf.feature_column.indicator_column(
     tf.feature_column.categorical_column_with_hash_bucket('ticket',3))
feature_columns.append(ticket)
 
embarked = tf.feature_column.indicator_column(
      tf.feature_column.categorical_column_with_vocabulary_list(
      key='embarked',vocabulary_list=['S','C','B']))
feature_columns.append(embarked)
 
# 嵌入列
cabin = tf.feature_column.embedding_column(
    tf.feature_column.categorical_column_with_hash_bucket('cabin',32),2)
feature_columns.append(cabin)
 
# 交叉列
pclass_cate = tf.feature_column.categorical_column_with_vocabulary_list(
          key='pclass',vocabulary_list=[1,2,3])
 
crossed_feature = tf.feature_column.indicator_column(
    tf.feature_column.crossed_column([age_buckets, pclass_cate],hash_bucket_size=15))
 
feature_columns.append(crossed_feature)
 
# ================================================================================
# 三,定義模型
# ================================================================================
printlog("step3: define model...")
 
tf.keras.backend.clear_session()
model = tf.keras.Sequential([
  layers.DenseFeatures(feature_columns), #將特征列放入到tf.keras.layers.DenseFeatures中!!!
  layers.Dense(64, activation='relu'),
  layers.Dense(64, activation='relu'),
  layers.Dense(1, activation='sigmoid')
])
 
# ================================================================================
# 四,訓練模型
# ================================================================================
printlog("step4: train model...")
 
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
 
history = model.fit(ds_train,
          validation_data=ds_test,
          epochs=10)
# ================================================================================
# 五,評估模型
# ================================================================================
printlog("step5: eval model...")
 
model.summary()
 
 
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
 
import matplotlib.pyplot as plt
 
def plot_metric(history, metric):
    train_metrics = history.history[metric]
    val_metrics = history.history['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics, 'bo--')
    plt.plot(epochs, val_metrics, 'ro-')
    plt.title('Training and validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()
 
plot_metric(history,"accuracy")
================================================================================2020-04-13 02:29:07
step1: prepare dataset......



================================================================================2020-04-13 02:29:08
step2: make feature columns......



================================================================================2020-04-13 02:29:08
step3: define model......



================================================================================2020-04-13 02:29:08
step4: train model......


Epoch 1/10
23/23 [==============================] - 0s 21ms/step - loss: 0.7117 - accuracy: 0.6615 - val_loss: 0.5706 - val_accuracy: 0.7039
Epoch 2/10
23/23 [==============================] - 0s 3ms/step - loss: 0.5920 - accuracy: 0.7022 - val_loss: 0.6129 - val_accuracy: 0.6648
Epoch 3/10
23/23 [==============================] - 0s 3ms/step - loss: 0.6388 - accuracy: 0.7079 - val_loss: 0.5196 - val_accuracy: 0.7374
Epoch 4/10
23/23 [==============================] - 0s 3ms/step - loss: 0.5950 - accuracy: 0.7219 - val_loss: 0.5028 - val_accuracy: 0.7318
Epoch 5/10
23/23 [==============================] - 0s 3ms/step - loss: 0.5166 - accuracy: 0.7486 - val_loss: 0.4975 - val_accuracy: 0.7318
Epoch 6/10
23/23 [==============================] - 0s 3ms/step - loss: 0.5260 - accuracy: 0.7612 - val_loss: 0.5045 - val_accuracy: 0.7821
Epoch 7/10
23/23 [==============================] - 0s 3ms/step - loss: 0.4957 - accuracy: 0.7697 - val_loss: 0.4756 - val_accuracy: 0.7709
Epoch 8/10
23/23 [==============================] - 0s 3ms/step - loss: 0.4848 - accuracy: 0.7837 - val_loss: 0.4532 - val_accuracy: 0.8045
Epoch 9/10
23/23 [==============================] - 0s 3ms/step - loss: 0.4636 - accuracy: 0.8006 - val_loss: 0.4561 - val_accuracy: 0.7989
Epoch 10/10
23/23 [==============================] - 0s 3ms/step - loss: 0.4784 - accuracy: 0.7907 - val_loss: 0.4722 - val_accuracy: 0.7821

================================================================================2020-04-13 02:29:11
step5: eval model......


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_features (DenseFeature multiple                  64        
_________________________________________________________________
dense (Dense)                multiple                  3008      
_________________________________________________________________
dense_1 (Dense)              multiple                  4160      
_________________________________________________________________
dense_2 (Dense)              multiple                  65        
=================================================================
Total params: 7,297
Trainable params: 7,297
Non-trainable params: 0
_________________________________________________________________

 

參考:

開源電子書地址:https://lyhue1991.github.io/eat_tensorflow2_in_30_days/

GitHub 項目地址:https://github.com/lyhue1991/eat_tensorflow2_in_30_days

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM