在數據庫中,常常會有Distinct Count的操作,比如,查看每一選修課程的人數:
select course, count(distinct sid)
from stu_table
group by course;
Hive
在大數據場景下,報表很重要一項是UV(Unique Visitor)統計,即某時間段內用戶人數。例如,查看一周內app的用戶分布情況,Hive中寫HiveQL實現:
select app, count(distinct uid) as uv
from log_table
where week_cal = '2016-03-27'
Pig
與之類似,Pig的寫法:
-- all users
define DISTINCT_COUNT(A, a) returns dist {
B = foreach $A generate $a;
unique_B = distinct B;
C = group unique_B all;
$dist = foreach C generate SIZE(unique_B);
}
A = load '/path/to/data' using PigStorage() as (app, uid);
B = DISTINCT_COUNT(A, uid);
-- <app, users>
A = load '/path/to/data' using PigStorage() as (app, uid);
B = distinct A;
C = group B by app;
D = foreach C generate group as app, COUNT($1) as uv;
-- suitable for small cardinality scenarios
D = foreach C generate group as app, SIZE($1) as uv;
DataFu 為pig提供基數估計的UDF datafu.pig.stats.HyperLogLogPlusPlus
,其采用HyperLogLog++算法,更為快速地Distinct Count:
define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
A = load '/path/to/data' using PigStorage() as (app, uid);
B = group A by app;
C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;
Spark
在Spark中,Load數據后通過RDD一系列的轉換——map、distinct、reduceByKey進行Distinct Count:
rdd.map { row => (row.app, row.uid) }
.distinct()
.map { line => (line._1, 1) }
.reduceByKey(_ + _)
// or
rdd.map { row => (row.app, row.uid) }
.distinct()
.mapValues{ _ => 1 }
.reduceByKey(_ + _)
// or
rdd.map { row => (row.app, row.uid) }
.distinct()
.map(_._1)
.countByValue()
同時,Spark提供近似Distinct Count的API:
rdd.map { row => (row.app, row.uid) }
.countApproxDistinctByKey(0.001)
實現是基於HyperLogLog算法:
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
或者,將Schema化的RDD轉成DataFrame后,registerTempTable然后執行sql命令亦可:
val sqlContext = new SQLContext(sc)
val df = rdd.toDF()
df.registerTempTable("app_table")
val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")