1.標准偏差概念
標准偏差(Std Dev,Standard Deviation) -統計學名詞。一種度量數據分布的分散程度之標准,用以衡量數據值偏離算術平均值的程度。標准偏差越小,這些值偏離平均值就越少,反之亦然。標准偏差的大小可通過標准偏差與平均值的倍率關系來衡量。
例如,A、B兩組各有6位學生參加同一次語文測驗,A組的分數為95、85、75、65、55、45,B組的分數為73、72、71、69、68、67。這兩組的平均數都是70,但A組的標准差應該是17.078分,B組的標准差應該是2.160分,說明A組學生之間的差距要比B組學生之間的差距大得多。
標准偏差又分為總體標准偏差與樣本標准偏差
總體標准偏差:針對總體數據的偏差,所以要平均,

樣本標准偏差,也稱實驗標准偏差:針對從總體抽樣,利用樣本來計算總體偏差,為了使算出的值與總體水平更接近,就必須將算出的標准偏差的值適度放大,即,
= (200+50+100+200)/4 = 550/4 = 137.5
= [(200-137.5)^2+(50-137.5)^2+(100-137.5)^2+(200-137.5)^2]/(4-1)

2.標准偏差計算公式:
樣本標准偏差
,
代表所采用的樣本X1,X2,...,Xn的均值。


總體標准偏差
,
代表總體X的均值。


例:有一組數字分別是200、50、100、200,求它們的樣本標准偏差。


樣本標准偏差 S = Sqrt(S^2)=75, 注:八年級(下冊)上海科學技術出版 21.2數據的離散程度中的標准差是總體標准差
3.hive中的標准偏差函數
stddev_pop(),stddev_samp(),stddev()
stddev_pop() 總體標准方差,stddev_samp() 樣本標准方差
(1) hive引擎計算標准偏差
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;
查詢結果:
(2)spark引擎查詢標准偏差
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col
查詢結果
由上可看出,hive中stddev()函數默認計算總體標准偏差,spark 中stddev()函數默認計算樣本標准偏差
4.stddev()也可用於窗口函數
select col, stddev(num) over(partition by col) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a
查詢結果:
5. 當計算的輸入數據只有一行時 ,hive和spark計算標准方差的結果
(1)hive
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;
查詢結果:
(2)spark
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;
查詢結果: