2016-11-18
由於我們的爬蟲開發偷懶,爬取回來的數據還是保持為json格式,所以需要進一步處理,從json格式的info字段中,提取出需要的信息作為新字段。
MySQL 從5.7版本開始,已經對原生json格式提供支持,由於目前線上主流的版本還是停留在5.6,所以這時需要人工去處理json格式的數據。
原始表示例數據如下:

取info字段中的一條記錄為例:
{"title":"你不知道就out了,奧迪Coupe車型的第一次出現!","url":"http://mp.weixin.qq.com/s?__biz=MjM5NjM4NTAwMQ==&mid=2651782672&idx=1&sn=154dae8ec1c02403bab41c6f2c1ad679&chksm=bd1047968a67ce8086b9d8e2111ca33f49e74a44a24c9f05afac48c75e51147268573957eef9&scene=4#wechat_redirect","datetime":"2016-10-20 17:36"}
需要將其中的title、url、datetime提取出來。以下是處理所用到的sql:
SELECT CURDATE() AS sys_date, b.b_id AS id, c.brand AS brand, b.b_name AS brand_cn, a.wxid AS wechat_id, b.`name` AS wechat_cn, SUBSTRING_INDEX(SUBSTRING_INDEX(a.info,'\"',12),'\"',-1) AS post_date, SUBSTRING_INDEX(SUBSTRING_INDEX(a.info,'\"',4),'\"',-1) AS post_title, SUBSTRING_INDEX(SUBSTRING_INDEX(a.info,'&mid=',-1),'&idx',1) AS post_id, a.sn AS post_sn, SUBSTRING_INDEX(SUBSTRING_INDEX(a.info,'\"',8),'\"',-1) AS post_url, SUBSTRING_INDEX(SUBSTRING_INDEX(a.ext,':',-2),',',1) AS page_views, SUBSTRING_INDEX(SUBSTRING_INDEX(a.ext,':',-1),'}',1) AS likes, 1 / 0 AS notes FROM 11_cmnc.message a JOIN 11_cmnc.subscription b JOIN 11_cmnc.wechat_conf c ON a.wxid = b.wxid AND b.b_id = c.id GROUP BY b.b_id,a.sn ORDER BY id,post_date;
生成的數據如下:

到這里,title、url、datetime已經從message表的info字段中提取出來了
