前沿:
這是天池的一個新人實戰塞題目,原址 https://tianchi.aliyun.com/getStart/information.htm?spm=5176.100067.5678.2.e1321db7ydQmSB&raceId=231593 ,下文會分析以下幾個過程。
1.數據預處理
2.特征的選取
3.算法的說明
4.結果分析
5.其他
第一部分:數據預處理
原始數據可以從上邊鏈接中下載,拿到.csv文件,可以使用pandas處理。
比如:
dfoff = pd.read_csv('ccf_offline_stage1_train.csv', keep_default_na=False)
參數 keep_default_na默認為True,當為True時,文件中的'null'則讀物Nan, 此時不能使用 dfoff['Date'] != 'null' 判斷,為了對‘null’可以使用 “==”,“!=”,此處設置 keep_default_na=False 。
我們需要得出優惠券與購買的關聯數據,以此得出Label。
有以下4中組合:
有優惠券,購買商品條數
無優惠券,購買商品條數
有優惠券,不購買商品條數
無優惠券,不購買商品條數
代碼如下:
print('有優惠券,購買商品條數', dfoff[(dfoff['Date_received'] != 'null') & (dfoff['Date'] != 'null')].shape[0])
print('無優惠券,購買商品條數', dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] != 'null')].shape[0])
print('有優惠券,不購買商品條數', dfoff[(dfoff['Date_received'] != 'null') & (dfoff['Date'] == 'null')].shape[0])
print('無優惠券,不購買商品條數', dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] == 'null')].shape[0])
文件中有買多少減多少,需要格式化為折扣率,距離門店格式化為數字等
def convertRate(row):
if row == 'null':
return 1.0
elif ':' in row:
rows = row.split(':')
return 1.0 - float(rows[1])/float(rows[0])
else:
return float(row)
def getDiscountMan(row):
if ':' in row:
rows = row.split(':')
return int(rows[0])
else:
return 0
def getDiscountJian(row):
if ':' in row:
rows = row.split(':')
return int(rows[1])
else:
return 0
def getWeekday(row):
if row == 'null':
return row
else:
return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1
def processData(df):
df['discount_rate'] = df['Discount_rate'].apply(convertRate)
df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
print(df['discount_rate'].unique())
df['distance'] = df['Distance'].replace('null', -1).astype(int)
return df
調用 dfoff = processData(dfoff) 即可格式化以上信息。
注意代碼中apply()函數,apply()函數是pandas里面所有函數中自由度最高的函數。該函數如下:
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
對收到優惠券日期處理:
date_received = dfoff['Date_received'].unique() #.unique()刪除重復項
date_received = sorted(date_received[date_received != 'null'] #排序
print('優惠券收到日期從',date_received[0],'到', date_received[-1]) #輸出最小日期和最大日期
同樣對於消費日期處理:
date_buy = dfoff['Date'].unique()
date_buy = sorted(date_buy[date_buy != 'null'])
date_buy = sorted(dfoff[dfoff['Date'] != 'null']['Date'])
print('消費日期從', date_buy[0], '到', date_buy[-1])
將發放的優惠券與被使用的優惠券畫圖:
couponbydate = dfoff[dfoff['Date_received'] != 'null'][['Date_received', 'Date']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']
buybydate = dfoff[(dfoff['Date'] != 'null') & (dfoff['Date_received'] != 'null')][['Date_received', 'Date']].groupby(['Date_received'], as_index=False).count()
buybydate.columns = ['Date_received','count']
sns.set_style('ticks')
sns.set_context("notebook", font_scale= 1.4)
plt.figure(figsize = (12,8))
date_received_dt = pd.to_datetime(date_received, format='%Y%m%d')
plt.subplot(211)
plt.bar(date_received_dt, couponbydate['count'], label = 'number of coupon received' )
plt.bar(date_received_dt, buybydate['count'], label = 'number of coupon used')
plt.yscale('log')
plt.ylabel('Count')
plt.legend()
plt.subplot(212)
plt.bar(date_received_dt, buybydate['count']/couponbydate['count'])
plt.ylabel('Ratio(coupon used/coupon received)')
plt.tight_layout()
plt.show()
得到一幅圖:
第二部分:特征的選取
第三部分:算法的說明
第四部分:結果分析
第五部分:其他
