第一次參加天池新人賽,主要目的還是想考察下自己對機器學習上的成果,以及系統化的實現一下所學的東西。看看自己的掌握度如何,能否順利的完成一個分析工作。為之后的學習奠定基礎。
這次成績並不好,只是把整個機器學習的流程熟悉了下。我本人總結如下:
步驟一:讀懂題目含義,分析賽題的數據
步驟二:特征工程的設計,這部分非常重要,好的特征工程能大大提高模型的准確率
步驟三:訓練算法。區分訓練集、測試集等。
步驟四:測試模型,看效果如何。
賽題可以去天池的官網查看,里面有賽題說明,賽題數據等等
https://tianchi.aliyun.com/getStart/introduction.htm?spm=5176.11165418.333.1.3c2e613cd1CCDk&raceId=231593
以下是代碼部分:
1 import numpy as np 2 import pandas as pd 3 4 #導入數據 5 train_online = pd.read_csv('ccf_online_stage1_train.csv') 6 train_offline = pd.read_csv('ccf_offline_stage1_train.csv') 7 test = pd.read_csv('ccf_offline_stage1_test_revised.csv')
1 #將數據合並,以便統一對數據進行處理。都是線下數據 2 all_offline = pd.concat([train_offline,test])
1 #查看每一列的異常值 2 f = lambda x:sum(x.isnull()) 3 all_offline.apply(f)

1 #Data的空值 賦值為null,統一空值的格式 2 all_offline['Date'] = all_offline['Date'].fillna('null') 3 4 #將online與offline的數據合並 5 pd.merge(all_offline,train_online,on=['Merchant_id','User_id']) 6 7 #通過合並數據,發現兩者並無交集,題目要求只用線下預測,故排除線上online數據, 8 #只用offline數據 9 10 #根據賽題的要求,把正負樣本標注出來 11 def is_used(column): 12 if column['Date']!='null' and column['Coupon_id']!='null': 13 return 1 14 elif column['Date']=='null' and column['Coupon_id']!='null': 15 return -1 16 else: 17 return 0 18 19 all_offline['is_used'] = all_offline.apply(is_used,axis=1)
1 #Coupon_id 優惠券ID的具體數值意義不大,因此我們把他轉換成:是否有優惠券 2 def has_coup(x): 3 if x['Coupon_id'] != 'null': 4 return 1 5 else: 6 return 0 7 8 all_offline['has_coup']=all_offline.apply(has_coup,axis=1)

1 #由於Discount_rate優惠率的特殊格式:"150:20",很難使用算法來計算使用 2 #根據實際情況,優惠力度是能夠影響優惠券的使用頻率的。因此需要對Discount_rate進行轉化 3 #根據Discount_rate標識出折扣率 4 import re 5 regex=re.compile('^\d+:\d+$') 6 7 def discount_percent(y): 8 if y['Discount_rate'] == 'null' and y['Date_received'] == 'null': 9 return 'null' 10 elif re.match(regex,y['Discount_rate']): 11 num_min,num_max=y['Discount_rate'].split(':') 12 return float(num_max)/float(num_min) 13 else: 14 return y['Discount_rate'] 15 16 all_offline['discount_percent'] = all_offline.apply(discount_percent,axis=1)
1 #在進一步想,優惠力度會影響優惠券使用的概率,x:y這種滿減的類型,x具體是多少,勢必也會影響優惠券使用率 2 #講滿x元的標出x元 3 def discount_limit(y): 4 if y['Discount_rate'] == 'null' and y['Date_received'] == 'null': 5 return 'null' 6 elif re.match(regex,y['Discount_rate']): 7 num_min,num_max=y['Discount_rate'].split(':') 8 return num_min 9 else: 10 return 0 11 12 all_offline['discount_limit'] = all_offline.apply(discount_limit,axis=1) 13 all_offline.head(10)
1 #由於賽題需要的是,優惠券領取后15天的使用概率 2 #因此,我們在is_used的基礎上,在對領券時間 Date_received 和使用時間Date,進行比較,判斷是否在15天內使用 3 #時間比較 4 import datetime 5 #標注15天內使用優惠券的情況 6 def used_in_15days(z): 7 if z['is_used'] == 1 and z['Date'] != 'null' and z['Date_received'] != 'null': 8 days= (datetime.datetime.strptime(z['Date'],"%Y%m%d")-datetime.datetime.strptime(z['Date_received'],"%Y%m%d")) 9 if days.days < 15: 10 return 1 11 else: 12 return 0 13 else: 14 return 0 15 16 all_offline['used_in_15days']=all_offline.apply(used_in_15days,axis=1)
1 #再來觀察discount_percent,discount_limit這2個特征,看數據的分布情況。 2 all_offline['discount_percent'].value_counts()
1 all_offline['discount_limit'].value_counts()
#將discount_percent分段 def discount_percent_layer(columns): if columns['discount_percent']=='null': return 'null' columns['discount_percent']=float(columns['discount_percent']) if columns['discount_percent'] <= 0.1: return 0.1 elif columns['discount_percent'] <= 0.2: return 0.2 elif columns['discount_percent'] <= 0.3: return 0.3 elif columns['discount_percent'] <= 0.4: return 0.4 else: return 0.5 all_offline['discount_percent_layer']=all_offline.apply(discount_percent_layer,axis=1) all_offline['discount_percent_layer'].value_counts()
·
1 #將discount_limit分段 2 def discount_limit_layer(columns): 3 if columns =='null': 4 return 'null' 5 6 columns=int(columns) 7 if columns <= 10: 8 return 10 9 elif columns <= 20: 10 return 20 11 elif columns <= 30: 12 return 30 13 elif columns <= 50: 14 return 50 15 elif columns <= 100: 16 return 100 17 elif columns <= 200: 18 return 200 19 else: 20 return 300 21 22 all_offline['discount_limit_layer']=all_offline['discount_limit'].apply(discount_limit_layer) 23 all_offline['discount_limit_layer'].value_counts()
總結:
此時 Coupon_id 被處理成 has_coup(1代表領取優惠券,0代表沒有領取優惠券)
Date,Date_received 被處理成 used_in_15days。表示是否在15天內使用過優惠券
Discount_rate 被處理成 discount_percent(折扣率),discount_limit(滿多少)
Merchant_id,User_id 是unicode值,不需要進行處理
1 #剩下Distance,看下Distance的分布情況 2 all_offline['Distance'].value_counts()
1 #保存數據,以便后期使用起來方便 2 train_finall,test_finall = all_offline[:train_offline.shape[0]],all_offline[train_offline.shape[0]:] 3 all_offline.to_csv(r'output\all_offline.csv') 4 train_finall.to_csv(r'output\train_finall.csv') 5 test_finall.to_csv(r'output\test_finall.csv')
1 #one_hot處理 2 all_offline_new=all_offline.drop( 3 ['Coupon_id','Date','Date_received','Discount_rate','Merchant_id', 4 'User_id','discount_percent','discount_limit'],axis=1) 5 all_offline_new=pd.get_dummies(all_offline_new)
1 #把測試集跟驗證集分開 2 train01,test01=all_offline_new[:len(train_offline)],all_offline_new[len(train_offline):] 3 4 #把沒有領券的去掉 5 train02=train01[train01['has_coup']==1] 6 7 #由於特征集 都是領券的人,故把 has_coup 字段刪掉 8 train02=train02.drop(['has_coup'],axis=1) 9 test01=test01.drop(['has_coup'],axis=1) 10 11 x_train=train02.drop(['used_in_15days'],axis=1) 12 y_train=pd.DataFrame({"used_in_15days":train02['used_in_15days']}) 13 x_text=test01.drop(['used_in_15days'],axis=1)
1 #建模 2 from sklearn.linear_model import LinearRegression 3 4 clf=LinearRegression() 5 clf.fit(x_train,y_train) 6 7 #用模型進行預測 8 predict=clf.predict(x_text) 9 10 11 result=pd.read_csv('ccf_offline_stage1_test_revised.csv') 12 result['probability']=predict 13 14 result=result.drop(['Merchant_id','Discount_rate','Distance'],axis=1) 15 16 17 #發現最終預測有負值,直接歸為0 18 result['probability']=result['probability'].apply(lambda x: 0 if x<0 else x) 19 20 result.to_csv(r'output/sample_submission.csv',index=False)