hive 取中位數的兩種方式

本文轉載自查看原文 2022-02-07 19:18 1496 HIve

需求描述

字段：店鋪(shop_id),銷量(sale),商品id(commodity_id)，求每個店鋪商品銷量排名的中位數

數據准備

use default;
create table temp_shop_info
(
    shop_id      string,
    commodity_id string,
    sale         int
) row format delimited
    fields terminated by '\t';

insert into temp_shop_info
values ('110', '1', 10),
       ('110', '2', 20),
       ('110', '3', 30),
       ('110', '4', 50),
       ('110', '5', 60),
       ('110', '6', 20),
       ('110', '7', 80),
       ('111', '1', 90),
       ('111', '2', 80),
       ('111', '3', 50),
       ('111', '4', 70),
       ('111', '5', 20),
       ('111', '6', 10);
select * from temp_shop_info;

樣例數據

方案一：公式法

abs(rn - (cnt+1)/2) < 1

rn是給定長度為cnt的數列的序號排序，cnt為整個序列的個數

如下圖所示：

第一行當cnt為偶數時：如序列長度為6，則中位數就在序號為3和4的位置上。即在（6+1）/ 2=3.5左右的即可

當cnt為奇數：如下圖序列長度為，則中位數就在序號為6的位置上。即：（5+1）/ 2=3

因此得出結論：中位數的值所在的索引位置，要么在（cnt +1）/ 2左右，要么就為（cnt +1）/ 2所在的位置，這種完全由序列長度奇偶性決定，如果為奇數（cnt +1）/ 2計算結果為X.5，那么中位數就是abs(rn-（cnt +1）/ 2) = 0.5的位置，如果為偶數，則有abs(rn-（cnt +1）/ 2) =0的位置，也就是說差值的絕對值要么為0.5要么為0，由於是連續的序列，所以統一后，即為：

abs(rn - (cnt+1)/2) < 1 或abs(rn - (cnt+1)/2) < =1/2

第一步

select shop_id,
       sale,
       row_number() over (partition by shop_id order by sale) rn,-按shop_id 分組並按 sale 配許
       count(*) over (partition by shop_id)                   cnt --按shop_id 分組 求 個數
from temp_shop_info;

第二步

select shop_id,
       avg(sale) as median --奇數保留，偶數取均值
from (select shop_id,
             sale,
             row_number() over (partition by shop_id order by sale) as rn, -- 生成每個商鋪各個商品按銷售量排序后的連續序列rn
             count(1) over (partition by shop_id)                   as cnt -- 計算序列長度cnt
      from temp_shop_info) t
where abs(rn - (cnt + 1) / 2) <= 0.5 -- 利用公式
group by shop_id;

111 60

110 30

方案二使用函數

hive 自帶了求中位數的函數，下面我們用更簡單的方法來實現上面的需求

percentil

select shop_id, percentile(sale, 0.5)
from temp_shop_info
group by shop_id;

結果：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 MYSQL 取中位數 Hive兩種訪問方式：HiveServer2 和 Hive Client python中獲取中位數的兩種方法兩種方式— 在hive SQL中傳入參數 Spark落地到hive表中的兩種方式及其區別 hive 傳遞變量的兩種方式 httpPost的兩種方式 spark利用sparkSQL將數據寫入hive兩種通用方式實現及比較 Spark SQL入門到實戰之（7）spark連接hive（spark-shell和eclipse兩種方式） Spark：DataFrame批量導入Hbase的兩種方式(HFile、Hive)