【新浪微博互动预测大赛】第二季比赛心得

本文转载自查看原文 2015-11-05 19:43 3547 算法/ RF/ 预测/ 新浪微博/ 大数据/ 阿里

【新浪微博互动预测大赛】比赛链接http://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100071.5678.1.7cNsYr&raceId=5

第一季和第二季是有区别的，所以只说说第二季。

因为练车，面试等等占用了大量时间，所以第二季我也没花多少时间做，拿了36名，大概总结下吧。

第二季的题目与其说是预测微博一天后的互动数之和，不如说是分类问题，因为我们只要预测出他所在的那一类就算正确。比如实际是25，你预测11和49都算正确的，因为三个都是属于第三个档位。

由于模型的迭代计算都是以准确率为score的，但是第一类的数目巨大，所以将一条微博预测为第一类的可能性也大些。但是题目给了权重值，一个第二类相当于10个第一类了，所以，要解决这个问题，要么改评价函数，要么就对训练样本采样。

采样率的选择也是有讲究的。我们对二三月的每类个数乘以权重进行统计，计算每类的占比。然后我们确定取300000的样本，计算每类需要的样本数。

每类个数和	42659513	561360	594222	123765	160248
每类权重和	42659513	5613600	29711100	12376500	32049600
权重和比例	0.348496	0.045859	0.242717	0.101107	0.261821
每类实际需要个数	104549	13758	72815	30332	78546
每类采样率	0.005	0.047	0.227	0.450	0.908
微调后的采样率	0.003	0.03	0.17	0.4	1
微调后每类采样个数	103584	14333	78233	33759	70091

由于平台的权重采样得到的个数达不到我们设计的每个类别个数的要求，所以要对采样率进行微调，使得采样得到的每个类别的个数和我们设计的接近，微调后的采样率见上表（空着的还没写，数据在寝室）。事实表明这样可以提升成绩。

确定了样本，接下来就是特征选择了，我提取的特征包括用户特征和博文特征。

用户特征包括用户的月份互动的平均数，中位数，微博个数，粉丝数。博文特征包括时间特征，微博长度，微博中词语的平均出现次数，微博的#个数，@格式，url个数，下面代码可以看到这些特征(uid_avg_xx(_yy) 该用户的xx月(到yy月)的微博转赞评和的平均值，word_avg_times 微博中每个词出现的平均次数，blog_time_week 星期几，blog_time_hour 一天分为8个时间段，每个时间段3个小时，word_length 微博的博文单词的个数，blog_jinghao，blog_at，blog_http，分别是#，@，有http的url出现的次数)。

select
a.uid,a.mid,a.blog_time,a.blog,a.mid_count,a.ratio,a.blog_time_week,a.blog_time_hour,a.blog_weekdays,a.blog_weekends,a.blog_time_1,a.blog_time_2,a.blog_time_3,a.blog_time_4,a.blog_time_5,a.blog_time_6,a.blog_time_7,a.blog_time_8,

b.uid_avg_11_2,b.uid_median_11_2,b.uid_count_11_2,b.uid_avg_1_2,b.uid_median_1_2,b.uid_count_1_2,b.uid_avg_11_12,b.uid_median_11_12,b.uid_count_11_12,b.uid_avg_2,b.uid_median_2,b.uid_count_2,b.uid_avg_1,b.uid_median_1,b.uid_count_1,b.uid_avg_12,b.uid_median_12,b.uid_count_12,b.uid_avg_11,b.uid_median_11,b.uid_count_11,b.uid_avg_15days,b.uid_median_15days,b.uid_count_15days,b.uid_avg_7days,b.uid_median_7days,b.uid_count_7days,

c.word_length,c.word_avg_times,c.blog_jinghao,c.blog_at,c.blog_http,c.blog_wenhao,c.blog_hongbao,c.blog_shishi,c.blog_dianji,c.blog_shouqi,c.blog_daijinquan,c.blog_zhuanfa,c.blog_p1,c.blog_p2,c.blog_p3,c.blog_p4,c.blog_p5,c.blog_p6,d.fans_count,a.real_label

from ${t1} a left outer join ${t2} b on a.uid=b.uid left outer join ${t3} c on a.mid=c.mid left outer join ${t4} d on a.uid=d.uid;

模型选用的是平台的RF模型。这次比赛的不足包括特征的选择，平台也没有特征选取的工具，我基本是人工试的。哪些特征放在一起得分高。由于时间有限，所以实验的组合也很少，目前测试的得分最高的是使用了17个特征（下图打钩的，虽然我提取了50多个特征）。

调用平台的RF随机森林算法，训练出RF模型。

模型的评价是对3月份的所有微博做预测，然后计算得分。得分计算sql语句如下。

select sum(case
when prediction_result=real_label and real_label=1 then 1
when prediction_result=real_label and real_label=2 then 10
when prediction_result=real_label and real_label=3 then 50
when prediction_result=real_label and real_label=4 then 100
when prediction_result=real_label and real_label=5 then 200
else 0 end)/sum(case
when real_label=1 then 1
when real_label=2 then 10
when real_label=3 then 50
when real_label=4 then 100
when real_label=5 then 200
else 0 end) as score,
sum(case when prediction_result=real_label and real_label=1 then 1
else 0 end) as right_1,
sum(case when real_label=1 then 1
else 0 end) as total_1,
sum(case when prediction_result=real_label and real_label=1 then 1
else 0 end)/sum(case when real_label=1 then 1
else 0 end) as score_1,

sum(case when prediction_result=real_label and real_label=2 then 10
else 0 end) as right_2,
sum(case when real_label=2 then 10
else 0 end) as total_2,
sum(case when prediction_result=real_label and real_label=2 then 10
else 0 end)/sum(case when real_label=2 then 10
else 0 end) as score_2,

sum(case when prediction_result=real_label and real_label=3 then 50
else 0 end) as right_3,
sum(case when real_label=3 then 50
else 0 end) as total_3,
sum(case when prediction_result=real_label and real_label=3 then 50
else 0 end)/sum(case when real_label=3 then 50
else 0 end) as score_3,

sum(case when prediction_result=real_label and real_label=4 then 100
else 0 end) as right_4,
sum(case when real_label=4 then 100
else 0 end) as total_4,
sum(case when prediction_result=real_label and real_label=4 then 100
else 0 end)/sum(case when real_label=4 then 100
else 0 end) as score_4,

sum(case when prediction_result=real_label and real_label=5 then 200
else 0 end) as right_5,
sum(case when real_label=5 then 200
else 0 end) as total_5,
sum(case when prediction_result=real_label and real_label=5 then 200
else 0 end)/sum(case when real_label=5 then 200
else 0 end) as score_5
from ${t1}

可以计算每一类的得分哦。

然后调参，RF的主要参数一个是树的个数，我这设的是500，一个是树的深度，我这设的是50。

最后想说的就是一个人做比赛太累，最后也缺乏动力，所以最好能组个队吧。

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 淘宝UED前端智勇大冲关第二季我们的生活第二季/全集This Is Us迅雷下载你云我云•兄弟夜谈会第二季 5G QQ聊天界面的布局和设计(IOS篇)-第二季『创意欣赏』60款惊艳的 iOS App 图标设计《第二季》 Hadoop源码学习笔记(1) ——第二季开始——找到Main函数及读一读Configure类西部世界第二季百度云免费在线观看_迅雷下载高晓松脱口秀--晓说(第一季&第二季)mp3下载西部世界第二季全集高清百度云在线观看BT种子迅雷下载新浪微博架构