有一個數據多維分析的任務:
- 日志的周UV;
- APP的收集量及標注量,TOP 20 APP(周UV),TOP 20 APP標注分類(周UV);
- 手機機型的收集量及標注量,TOP 20 機型(周UV),TOP 20 手機廠商(周UV);
初始的解決方案:Spark讀取數據日志,然后根據分析需求逐一進行map、distinct、reduceByKey得到分析結果。但是,這種方案存在着非常大的缺點——重復掃描數據源多次。
1. Pig
Pig提供cube
關鍵字做OLAP,將dimension分為了兩類:
- normal,對應於cube operation,\(n\)個該維度的組合數為\(2^n\);
- hierarchical ordering,對應於rollup operation, \(n\)個該維度的組合數為\(n+1\)。
官方doc例子如下:
salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') AS
(product:chararray, year:int, region:chararray, state:chararray, city:chararray, sales:long);
cubedinp = CUBE salesinp BY CUBE(product,year);
result = FOREACH cubedinp GENERATE FLATTEN(group), SUM(cube.sales) AS totalsales;
salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') AS
(product:chararray, year:int, region:chararray, state:chararray, city:chararray, sales:long);
rolledup = CUBE salesinp BY ROLLUP(region,state,city);
result = FOREACH rolledup GENERATE FLATTEN(group), SUM(cube.sales) AS totalsales
在例子中,cube
的操作相當於按維度組合對每一record進行展開並group by Dimensions
,與下一句foreach語句構成了Dimensions + Measure
的數據輸出格式。
2. Spark
朴素多維分析
從上面介紹的pig OLAP方案中,我們得到靈感——面對開篇的多維分析需求,也可以每一條記錄按Dimensions + Measure
的規則進行展開:
/**
* @param e (uid, LogFact)
* @return Array[((dimension order No, dimension), measure)]
*/
def flatAppDvc(e: (String, CaseClasses.LogFact)): Array[((String, String), String)] = {
val source = (("00", e._2.source), e._1)
val appName = (("11", e._2.appName), e._1)
val appTag = (("12", e._2.appTag), e._1)
val appAll = (("13", "a"), e._1)
val appCollect = (("14", "a"), e._2.appName)
val appLabel = e._2.appTag match {
case "EMPTY" => (("15", "a"), "useless")
case _ => (("15", "a"), e._2.appName)
}
val dvcModel = (("21", e._2.dvcModelLabel), e._1)
val vendor = (("22", e._2.vendor), e._1)
val (osAll, osCollect) = ((("23", e._2.osType), e._1), (("24", e._2.osType), e._2.dvcModel))
val osLabel = e._2.dvcModelLabel match {
case "EMPTY" => (("25", e._2.osType), "useless")
case _ => (("25", e._2.osType), e._2.dvcModel)
}
Array(source, appName, appTag, appAll, appCollect, appLabel, dvcModel, vendor,
osAll, osCollect, osLabel).filter(_._2 != "useless")
}
為了區別不同的維度組合,代碼中采取了比較low的方式——為每個維度組合進行編號以示區別。Spark提供flatMap API將一行展開為多行,完美地滿足了維度展開的需求;然后通過一把group by key + distinct count
即可得到結果:
val flatRdd = logRdd.flatMap(flatAppDvc)
val result = flatRdd.distinct()
.mapValues(_ => 1)
.reduceByKey(_ + _)
多Measure
前面的分析需求比較簡單,measure均為distinct count;因而可以不必對齊Dimensions + Measure
。然而,對於比較復雜的分析需求:
- (整體上)廣告物料的收集量、標注量、PV;
- (廣告物料的)二級標注類別的廣告物料數、UV、PV;
- (廣告物料的)一級標注類別的廣告物料數、UV、PV;
measure既有distinct count (UV) 也有count (PV),這時需要Dimensions + Measure
的對齊,維度flatMap如下:
/**
* @param e ((adid, 2nd ad-category, 1st ad-category, uid)
* @return Array[((dimension order No, dimension), measure:(adid, uid or adid, 1)]
*/
def flatAd(e: ((String, String, String), String)) = {
val all = e._1._2 match {
case "EMPTY" => (("0", "all"), (e._1._1, "non", 0))
case _ => (("0", "all"), (e._1._1, e._1._1, 1))
}
val adCate = (("1", e._1._2), (e._1._1, e._2, 1))
val adParent = (("2", e._1._3), (e._1._1, e._2, 1))
Array(all, adCate, adParent)
}
爾后,計算每一維度的measure(其中distinct count采用HyperLogLogPlus算法的stream lib實現):
val createHLL = (v: String) => {
val hll = new HyperLogLogPlus(14, 0) // relative-SD = 0.01
hll.offer(v)
hll
}
def computeAdDimention(rdd: RDD[((String, String), (String, String, Int))]) = {
rdd.combineByKey[(HyperLogLogPlus, HyperLogLogPlus, Int)](
(v: (String, String, Int)) => (createHLL(v._1), createHLL(v._2), 1),
(m: (HyperLogLogPlus, HyperLogLogPlus, Int), v: (String, String, Int)) => {
m._1.offer(v._1)
m._2.offer(v._2)
val pv = m._3 + v._3
(m._1, m._2, pv)
},
(m1: (HyperLogLogPlus, HyperLogLogPlus, Int),
m2: (HyperLogLogPlus, HyperLogLogPlus, Int)) => {
m1._1.addAll(m2._1)
m1._2.addAll(m2._2)
val pv = m1._3 + m2._3
(m1._1, m1._2, pv)
}
)
.mapValues(t => (t._1.cardinality().toInt, t._2.cardinality().toInt, t._3))
}
其實,本文有點標題黨~~只是借了OLAP的殼做數據多維分析,距離真正的OLAP還是很遠滴……