greenplum查看表的數據分布情況來調整dk值

本文轉載自查看原文 2013-01-11 20:00 7258 greenplum & postgresql/ greenplum

　　最近正在進行ETL后台系統數據的日志分析,查看運行耗時長的TASK,並找出耗時長的JOB,進行邏輯層面和數據庫層面的優化.本文僅從數據庫層面上的優化着手(包括SQL語句的調整以及greenplum table dk的調整).查看一個耗時30分鍾左右的JOB,找到相應的源表,進行如下分析:

dw=#select gp_segment_id,count(*) from tb_name group by gp_segment_id order by count(*) desc

gp_segment_id   count
----------------------
    65          16655

　　說明:gp_segment_id是greenplum table里面的一個隱藏列,用來標記該行屬於哪個節點.由此可見,該表只分布在一個節點65上(節點信息請查看gp_segment_configuration),而我的gp總共有96個節點,這顯然沒有利用到gp多節點運算能力,該表的DK值設置的有問題.因此,使用alter table tb_name set distributed by (col1,...)對表的DK值進行重新設置.然后重新運行上面的語句,一方面觀察節點數(是否每個節點都分布了),另一方面觀察節點的條數(分布是否平衡)。在上述二項觀察指標大致滿足要求后，請vacuum full、vacuum analyze一樣，徹底回收空間+收集統計信息。把耗時長JOB的源表抓出來，逐個分析，整個TASK的執行時長從3小時縮短到2小時左右（主要是之前表設計的太差，才導致有這么大的優化空間），后期就是對邏輯以及SQL的優化，以及提高並發度，這才是王道。

　　為了統計分析方便，設計了如下二張表和若干function，用來收集表的分布情況，並發現哪些表需要進行重新調整DK值。

--二張表
CREATE TABLE "public"."table_segment_statistics" (
"table_name" varchar(200) DEFAULT NULL,
"segment_count" int4 DEFAULT NULL,
"table_rows" int8 DEFAULT NULL
);

CREATE TABLE "public"."table_segment_statistics_balance" (
"table_name" varchar(200) DEFAULT NULL,
"segment_id" int4 DEFAULT NULL,
"segment_count" int8 DEFAULT NULL
);

--function
CREATE OR REPLACE FUNCTION "public"."analyze_table_dk_balance"(v_schemaname varchar)
  RETURNS "pg_catalog"."int4" AS $BODY$
DECLARE
    v_tb varchar(200);
  v_cur_tb cursor for select schemaname||'.'||tablename from pg_tables where schemaname<>'information_schema' and schemaname<>'pg_catalog'
and schemaname<>'gp_toolkit' and tablename not like '%prt%' and schemaname=v_schemaname;
BEGIN
    truncate table public.table_segment_statistics;
    truncate table public.table_segment_statistics_balance;
    open v_cur_tb;
    loop
        fetch v_cur_tb into v_tb;
        if not found THEN
            exit;
        end if;
        execute 'insert into public.table_segment_statistics select '''||v_tb||''' as table_name,count(*) as segment_id,sum(num) as table_rows from (select gp_segment_id,count(*) num from '||v_tb||' group by gp_segment_id) t';
        execute 'insert into public.table_segment_statistics_balance select '''||v_tb||''' as table_name,gp_segment_id,count(*) as cnt from '||v_tb||' group by gp_segment_id order by gp_segment_id';
    end loop;
    RETURN 0;
end;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE;

分析的語句如下：

--96指的是greenplum的節點（我的機器是96個）
select * from public.table_segment_statistics 
where table_rows is not null and segment_count<96 and table_rows>10000
order by table_rows desc;

--找出比平均值超出10%的節點，這個閥值可以自行調整，另：只統計超過1萬行的表，小表沒有太大的分析意義
select a."table_name",b.segment_id,a.table_rows/a.segment_count as reldk,b.segment_count
from 
"public".table_segment_statistics a
inner join 
"public".table_segment_statistics_balance b
on a."table_name" = b."table_name"
where a."table_name" is not null and a.table_rows > 10000
and abs(a.table_rows/a.segment_count-b.segment_count)/(a.table_rows/a.segment_count)>0.1

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 五、Doris數據分布數據類型與數據分布數據分布形態：峰度與偏度 Aerospike系列：7：數據分布詳解數據分布特征的描述 mysql查看表空間占用情況 DB2 查看表空間的容器情況 greenplum表的distributed key值查看數據分布轉換：非正態 -> 正態 Origin 畫不等距數據分布直方圖