自動化特征工程—Featuretools

本文轉載自查看原文 2019-07-08 15:52 1346 機器學習實踐/ 工具和技巧

Featuretools是一個可以自動進行特征工程的python庫，主要原理是針對多個數據表以及它們之間的關系，通過轉換(Transformation)和聚合(Aggregation)操作自動生成新的特征。轉換操作的對象是單一數據表的一列或多列(例如對某列取絕對值或者計算兩列之差)；聚合操作的對象是具有父子 (one-to-many)關系的兩個數據表，通過對父表的某列進行歸類(groupby)計算子表某列對應的統計值。下面通過幾個簡單的例子進行介紹，Featuretools在實際應用中的案例可以參考它的Github倉庫。

1. 顧客交易記錄，每個交易對應一個顧客，可分多次支付（需要求解的問題是關於顧客的）

建立數據

import featuretools as ft
import pandas as pd
### 構建簡單的數據表
customers = pd.DataFrame({'customer_id':[1,2],})
transactions = pd.DataFrame({'transaction_id':[1,2,3,4,5], 'customer_id':[1,1,1,2,2], \
                             'amount':[3.,8.,6.,4.,9.]})
payments = pd.DataFrame({'payment_id':[1,2,3,4,5,6,7,8], 'transaction_id':[1,1,2,3,3,4,4,5], \
                         'money':[3,7,6,5,8,2,4,7]})
### 建立數據表之間的關系
es = ft.EntitySet('example1')
es.entity_from_dataframe(dataframe=payments, entity_id='payments', index='payment_id')
es.entity_from_dataframe(dataframe=transactions, entity_id='transactions', index='transaction_id')
es.entity_from_dataframe(dataframe=customers, entity_id='customers', index='customer_id')
r1 = ft.Relationship(es['customers']['customer_id'], es['transactions']['customer_id'])
r2 = ft.Relationship(es['transactions']['transaction_id'], es['payments']['transaction_id'])
es = es.add_relationship(r1)
es = es.add_relationship(r2)
print(es)

View Code

生成新的特征

# 自定義primitive
# Featuretools內置了許多常用的primitive, 這里僅為了介紹Featuretools更多的特性
def plusOne(column): return column+1 
plus_one = ft.primitives.make_trans_primitive(function=plusOne, input_types=[ft.variable_types.Numeric],\
                                              return_type=ft.variable_types.Numeric)
def maximum(column): return max(column)
Maximum = ft.primitives.make_agg_primitive(function=maximum, input_types=[ft.variable_types.Numeric], \
                                           return_type=ft.variable_types.Numeric)
# max_depth控制轉換和聚合的次數
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers", trans_primitives=[plus_one], \
                                      agg_primitives=["sum", Maximum], max_depth=3)
print(feature_defs)

View Code

以特征SUM(transactions.PLUSONE(MAXIMUM(payments.money)))為例，下圖說明了對customer_id=1的顧客該特征是如何計算的：

2. 顧客交易記錄，每個交易對應一個顧客，可分多次支付（需要求解的問題是關於交易的）

同上，僅改動一行代碼：

feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="transactions", trans_primitives=[plus_one], \
                                      agg_primitives=["sum", Maximum], max_depth=3)

以特征customers.PLUSONE(SUM(payments.money))為例，下圖說明了對transaction_id=1的交易該特征是如何計算的：

3. 顧客交易記錄，每個交易對應一個顧客和一個商品（需要求解的問題是關於顧客的）

建立數據

customers = pd.DataFrame({'customer_id':[1,2],})
transactions = pd.DataFrame({'transaction_id':[1,2,3,4,5], 'customer_id':[1,1,1,2,2], \
                             'amount':[3.,8.,6.,4.,9.], 'product_id':[1,2,3,1,2]})
products = pd.DataFrame({'product_id':[1,2,3]})
### 建立數據表之間的關系
es = ft.EntitySet('example')
es.entity_from_dataframe(dataframe=products, entity_id='products', index='product_id')
es.entity_from_dataframe(dataframe=transactions, entity_id='transactions', index='transaction_id')
es.entity_from_dataframe(dataframe=customers, entity_id='customers', index='customer_id')
r1 = ft.Relationship(es['customers']['customer_id'], es['transactions']['customer_id'])
r2 = ft.Relationship(es['products']['product_id'], es['transactions']['product_id'])
es = es.add_relationship(r1)
es = es.add_relationship(r2)
print(es)

View Code

生成新的特征

def plusOne(column): return column+1 
plus_one = ft.primitives.make_trans_primitive(function=plusOne, input_types=[ft.variable_types.Numeric],\
                                              return_type=ft.variable_types.Numeric)
def maximum(column): return max(column)
Maximum = ft.primitives.make_agg_primitive(function=maximum, input_types=[ft.variable_types.Numeric], \
                                           return_type=ft.variable_types.Numeric)
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers", trans_primitives=[plus_one], \
                                      agg_primitives=["sum", Maximum], max_depth=3)
print(feature_defs)

View Code

以特征SUM(transactions.products.MAXIMUM(transactions.amount))為例，下圖說明了對customer_id=1的顧客該特征是如何計算的：

Featuretools的一個重要特性是可以在建立特征工程時自動考慮時間的影響，防止數據泄露。下面仍以一個簡單的例子進行說明，同上仍為顧客交易記錄，每個交易對應一個顧客和一個商品，但是需要求解的問題是關於顧客在某個時間點的情況。

建立數據

import featuretools as ft
import pandas as pd
### 構建交易數據表
transactions = pd.DataFrame({'transaction_id':[1,2,3,4,5,6], 'customer_id':[1,1,1,2,3,3], 'product_id':[1,2,1,1,2,2], \
                             'time':[pd.Timestamp('1/1/2019')+pd.Timedelta(x,'h') for x in [1,2,3,4,5,6]], \
                             'amount':[3., 8., 10., 4., 12., 9]}) #加入了交易時間
products = pd.DataFrame({'product_id':[1,2]})
### 對每個顧客，定義對應的預測時間
cutoff_times = pd.DataFrame({'customer_id':[1,2,3],'time':[pd.Timestamp('1/1/2019')+pd.Timedelta(x,'h') for x in [2,4,6]]})
### 從原始數據表中生成新的數據表並建立關系
es = ft.EntitySet('example')
es.entity_from_dataframe(dataframe=transactions, entity_id='transactions', index='transaction_id', time_index='time')
es.normalize_entity(base_entity_id='transactions', new_entity_id='customers',index='customer_id')
es.normalize_entity(base_entity_id='transactions', new_entity_id='products',index='product_id')
print(es)

View Code

生成新的特征

feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers", agg_primitives=["max","sum"], \
                                      max_depth=3, cutoff_time=cutoff_times) #添加了cutoff_time這一參數
print(feature_defs)

View Code

下圖以特征SUM(transactions.products.MAX(transactions.amount))為例，說明建立特征時如何考慮了時間的影響

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 AutoML之自動化特征工程淺談自動特征構造工具Featuretools 前端工程自動化構建總結 web前端工程化/構建自動化接口自動化平台搭建（二），搭建django項目與接口自動化平台的由來與功能特征短文本分析----基於python的TF-IDF特征詞標簽自動化提取 Webpack自動化工程特征工程（4）-數據預處理二值化 Alink漫談(九) ：特征工程之特征哈希/標准化縮放 app自動化