分組操作group by 和分組的強化(rollup)
分組操作和分組函數的使用,對於編寫SQL語句的人來說,是最基本的概念。
我們來看下面的例子:
在這里我們使用員工表EMP
scott@DB01> select * from emp;
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
---------- ---------- --------- ---------- ------------------- ---------- ---------- ----------
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600 300 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250 500 30
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975 20
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250 1400 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850 30
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450 10
7788 SCOTT ANALYST 7566 1987-04-19 00:00:00 3000 20
7839 KING PRESIDENT 1981-11-17 00:00:00 5000 10
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500 0 30
7876 ADAMS CLERK 7788 1987-05-23 00:00:00 1100 20
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950 30
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000 20
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300 10
14 rows selected.
在員工表中有14條記錄,即14個員工,我們可以看到,這14個員工分別屬於3個部門(10,20,30),我們可以提出求EMP表中,每個部門的員工薪水總和
scott@DB01> select deptno,sum(sal) tsal
2 from emp
3 group by deptno;
DEPTNO TSAL
---------- ----------
30 9400
20 10875
10 8750
在這里稍微需要注意的是:select 列表里如果出現列的話,那在group by語句中同樣需要列名,並且只能是列名本身,不能是列的別名。group by語句可以說是oracle語句里最嚴格的語句,后面只能跟列的真名,別名、位置號、函數、表達式、子查詢 都不被允許。當然如果只考慮實現這里已經做到了,如果我們深入了解一點的話,分組對於數據庫來說是要消耗資源的,比如cpu、內存
在oracle9i之前 ,分組操作內部主要通過排序來實現,10剛開始,采用hash的算法實現,我們看一下10g下,讓面語句的執行計划
scott@DB01> set autotrace trace exp
scott@DB01> /
Execution Plan
----------------------------------------------------------
Plan hash value: 4067220884
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 364 | 4 (25)| 00:00:01 |
| 1 | HASH GROUP BY | | 14 | 364 | 4 (25)| 00:00:01 |
| 2 | TABLE ACCESS FULL| EMP | 14 | 364 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
其實在有些情況下,我們可以避免hash或是sort的發生,也可以實現分組查詢的效果,比如說通過索引,當然這需要你有適當的索引存在。
我們來看下面的演示:
scott@DB01> set autotrace off
scott@DB01> create table s_test(id number,name varchar2(10),sal number);
Table created.
scott@DB01> begin
2 for i in 1..20000 loop
3 insert into s_test values(i,i||'name',i*10);
4 end loop;
5 commit;
6 end;
7 /
PL/SQL procedure successfully completed.
scott@DB01> /
PL/SQL procedure successfully completed.
scott@DB01> /
PL/SQL procedure successfully completed.
scott@DB01> select count(*) from s_test;
COUNT(*)
----------
60000
我在這里建了一張表s_test,分3次往表里插入數據1-20000,現在我的需求是,找到表里100-120的記錄,以及他們出現的次數
scott@DB01> select id,name,count(*) from s_test where id>=100 and id<=120 group by id,name;
ID NAME COUNT(*)
---------- ---------- ----------
115 115name 3
101 101name 3
103 103name 3
106 106name 3
109 109name 3
118 118name 3
105 105name 3
114 114name 3
102 102name 3
104 104name 3
112 112name 3
116 116name 3
100 100name 3
110 110name 3
113 113name 3
117 117name 3
119 119name 3
107 107name 3
108 108name 3
111 111name 3
120 120name 3
21 rows selected.
我們來看一下語句的執行計划
scott@DB01> set autotrace trace exp
scott@DB01> /
Execution Plan
----------------------------------------------------------
Plan hash value: 752916570
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 163 | 3260 | 58 (6)| 00:00:01 |
| 1 | HASH GROUP BY | | 163 | 3260 | 58 (6)| 00:00:01 |
|* 2 | TABLE ACCESS FULL| S_TEST | 163 | 3260 | 57 (4)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("ID">=100 AND "ID"<=120)
Note
-----
- dynamic sampling used for this statement
在執行計划中我們發現,成本Cost是58,還有cpu的消耗,在執行計划的第2步,我們發現為了實現分組,oracle做了hash。接下來我們建一個組合索引看看
scott@DB01> create index s_id_n_idx on s_test(id,name);
Index created.
scott@DB01> select id,name,count(*) from s_test where id>=100 and id<=120 group by id,name;
Execution Plan
----------------------------------------------------------
Plan hash value: 826362002
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 63 | 1260 | 2 (0)| 00:00:01 |
| 1 | SORT GROUP BY NOSORT| | 63 | 1260 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | S_ID_N_IDX | 63 | 1260 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ID">=100 AND "ID"<=120)
filter("ID">=100 AND "ID"<=120)
Note
-----
- dynamic sampling used for this statement
在第一步中,應該做的排序並沒有做 SORT GROUP BY NOSORT,這樣就節省了cpu。
當然在這個例子當中,我們發現了一個重要的問題,就是語句的成本急劇下降,當然,這是通過索引,改變了數據的訪問方法造成的,以后有機會在討論索引的時候,我們會展開來說。
我們接下來看這樣一個需求,根據表里的deptno和job求分組,得到每個job下的薪水綜合,然后在部門級別做匯總,求小計,在整張表匯總,求總計
scott@DB01> select deptno,job,empno,ename,sal from emp order by deptno,job;
DEPTNO JOB EMPNO ENAME SAL
---------- --------- ---------- ---------- ----------
10 CLERK 7934 MILLER 1300
10 MANAGER 7782 CLARK 2450
10 PRESIDENT 7839 KING 5000
20 ANALYST 7788 SCOTT 3000
20 ANALYST 7902 FORD 3000
20 CLERK 7876 ADAMS 1100
20 CLERK 7369 SMITH 800
20 MANAGER 7566 JONES 2975
30 CLERK 7900 JAMES 950
30 MANAGER 7698 BLAKE 2850
30 SALESMAN 7654 MARTIN 1250
30 SALESMAN 7521 WARD 1250
30 SALESMAN 7499 ALLEN 1600
30 SALESMAN 7844 TURNER 1500
其實需求本身很簡單,如果僅僅是為了實現的話,使用集合並運算符union就可以了,不過union的效率在這里是非常的低。
scott@DB01> select deptno,job,sum(sal) tsal from emp group by deptno,job
2 union
3 select deptno,to_char(null),sum(sal) from emp group by deptno
4 union
5 select to_number(null),to_char(null),sum(sal) from emp;
DEPTNO JOB TSAL
---------- --------- ----------
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
10 8750
20 ANALYST 6000
20 CLERK 1900
20 MANAGER 2975
20 10875
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
30 9400
29025
13 rows selected.
為了得到比較高效的sql,我們可以借助於oracle分組里面的rollup來實現,我們可以得到同樣的效果
scott@DB01> select deptno,job,sum(sal) tsal from emp group by rollup(deptno,job);
DEPTNO JOB TSAL
---------- --------- ----------
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
10 8750
20 CLERK 1900
20 ANALYST 6000
20 MANAGER 2975
20 10875
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
30 9400
29025
13 rows selected.
第一直觀的表現,使用rollup要比使用分組再union的方法語句簡單很多,更重要的是,我們只對emp訪問了一次。
為了進一步比較,我們來看一下語句的執行計划
scott@DB01> set autotrace trace exp
scott@DB01> select deptno,job,sum(sal) tsal from emp group by deptno,job
2 union
3 select deptno,to_char(null),sum(sal) from emp group by deptno
4 union
5 select to_number(null),to_char(null),sum(sal) from emp;
Execution Plan
----------------------------------------------------------
Plan hash value: 3412076862
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 29 | 825 | 14 (79)| 00:00:01 |
| 1 | SORT UNIQUE | | 29 | 825 | 14 (79)| 00:00:01 |
| 2 | UNION-ALL | | | | | |
| 3 | HASH GROUP BY | | 14 | 448 | 5 (40)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMP | 14 | 448 | 3 (0)| 00:00:01 |
| 5 | HASH GROUP BY | | 14 | 364 | 5 (40)| 00:00:01 |
| 6 | TABLE ACCESS FULL| EMP | 14 | 364 | 3 (0)| 00:00:01 |
| 7 | SORT AGGREGATE | | 1 | 13 | 4 (25)| 00:00:01 |
| 8 | TABLE ACCESS FULL| EMP | 14 | 182 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
scott@DB01> select deptno,job,sum(sal) tsal from emp group by rollup(deptno,job);
Execution Plan
----------------------------------------------------------
Plan hash value: 52302870
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 448 | 4 (25)| 00:00:01 |
| 1 | SORT GROUP BY ROLLUP| | 14 | 448 | 4 (25)| 00:00:01 |
| 2 | TABLE ACCESS FULL | EMP | 14 | 448 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
通過比較發現,兩個語句的成本cost會差出很多14vs4。所以,如果我們以后有上面類似的需求的話,可以考慮使用rollup。
注:rollup語法
select a,b,組函數
from 表
group by rollup(a,b);
這個語法相當於 group by a,b union group a union group by null的sql語句的組合
---------- ---------- --------- ---------- ------------------- ---------- ---------- ----------
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600 300 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250 500 30
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975 20
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250 1400 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850 30
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450 10
7788 SCOTT ANALYST 7566 1987-04-19 00:00:00 3000 20
7839 KING PRESIDENT 1981-11-17 00:00:00 5000 10
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500 0 30
7876 ADAMS CLERK 7788 1987-05-23 00:00:00 1100 20
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950 30
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000 20
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300 10
14 rows selected.
在員工表中有14條記錄,即14個員工,我們可以看到,這14個員工分別屬於3個部門(10,20,30),我們可以提出求EMP表中,每個部門的員工薪水總和
scott@DB01> select deptno,sum(sal) tsal
2 from emp
3 group by deptno;
DEPTNO TSAL
---------- ----------
30 9400
20 10875
10 8750
在這里稍微需要注意的是:select 列表里如果出現列的話,那在group by語句中同樣需要列名,並且只能是列名本身,不能是列的別名。group by語句可以說是oracle語句里最嚴格的語句,后面只能跟列的真名,別名、位置號、函數、表達式、子查詢 都不被允許。當然如果只考慮實現這里已經做到了,如果我們深入了解一點的話,分組對於數據庫來說是要消耗資源的,比如cpu、內存
在oracle9i之前 ,分組操作內部主要通過排序來實現,10剛開始,采用hash的算法實現,我們看一下10g下,讓面語句的執行計划
scott@DB01> set autotrace trace exp
scott@DB01> /
Execution Plan
----------------------------------------------------------
Plan hash value: 4067220884
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 364 | 4 (25)| 00:00:01 |
| 1 | HASH GROUP BY | | 14 | 364 | 4 (25)| 00:00:01 |
| 2 | TABLE ACCESS FULL| EMP | 14 | 364 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
其實在有些情況下,我們可以避免hash或是sort的發生,也可以實現分組查詢的效果,比如說通過索引,當然這需要你有適當的索引存在。
我們來看下面的演示:
scott@DB01> set autotrace off
scott@DB01> create table s_test(id number,name varchar2(10),sal number);
Table created.
scott@DB01> begin
2 for i in 1..20000 loop
3 insert into s_test values(i,i||'name',i*10);
4 end loop;
5 commit;
6 end;
7 /
PL/SQL procedure successfully completed.
scott@DB01> /
PL/SQL procedure successfully completed.
scott@DB01> /
PL/SQL procedure successfully completed.
scott@DB01> select count(*) from s_test;
COUNT(*)
----------
60000
我在這里建了一張表s_test,分3次往表里插入數據1-20000,現在我的需求是,找到表里100-120的記錄,以及他們出現的次數
scott@DB01> select id,name,count(*) from s_test where id>=100 and id<=120 group by id,name;
ID NAME COUNT(*)
---------- ---------- ----------
115 115name 3
101 101name 3
103 103name 3
106 106name 3
109 109name 3
118 118name 3
105 105name 3
114 114name 3
102 102name 3
104 104name 3
112 112name 3
116 116name 3
100 100name 3
110 110name 3
113 113name 3
117 117name 3
119 119name 3
107 107name 3
108 108name 3
111 111name 3
120 120name 3
21 rows selected.
我們來看一下語句的執行計划
scott@DB01> set autotrace trace exp
scott@DB01> /
Execution Plan
----------------------------------------------------------
Plan hash value: 752916570
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 163 | 3260 | 58 (6)| 00:00:01 |
| 1 | HASH GROUP BY | | 163 | 3260 | 58 (6)| 00:00:01 |
|* 2 | TABLE ACCESS FULL| S_TEST | 163 | 3260 | 57 (4)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("ID">=100 AND "ID"<=120)
Note
-----
- dynamic sampling used for this statement
在執行計划中我們發現,成本Cost是58,還有cpu的消耗,在執行計划的第2步,我們發現為了實現分組,oracle做了hash。接下來我們建一個組合索引看看
scott@DB01> create index s_id_n_idx on s_test(id,name);
Index created.
scott@DB01> select id,name,count(*) from s_test where id>=100 and id<=120 group by id,name;
Execution Plan
----------------------------------------------------------
Plan hash value: 826362002
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 63 | 1260 | 2 (0)| 00:00:01 |
| 1 | SORT GROUP BY NOSORT| | 63 | 1260 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | S_ID_N_IDX | 63 | 1260 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ID">=100 AND "ID"<=120)
filter("ID">=100 AND "ID"<=120)
Note
-----
- dynamic sampling used for this statement
在第一步中,應該做的排序並沒有做 SORT GROUP BY NOSORT,這樣就節省了cpu。
當然在這個例子當中,我們發現了一個重要的問題,就是語句的成本急劇下降,當然,這是通過索引,改變了數據的訪問方法造成的,以后有機會在討論索引的時候,我們會展開來說。
我們接下來看這樣一個需求,根據表里的deptno和job求分組,得到每個job下的薪水綜合,然后在部門級別做匯總,求小計,在整張表匯總,求總計
scott@DB01> select deptno,job,empno,ename,sal from emp order by deptno,job;
DEPTNO JOB EMPNO ENAME SAL
---------- --------- ---------- ---------- ----------
10 CLERK 7934 MILLER 1300
10 MANAGER 7782 CLARK 2450
10 PRESIDENT 7839 KING 5000
20 ANALYST 7788 SCOTT 3000
20 ANALYST 7902 FORD 3000
20 CLERK 7876 ADAMS 1100
20 CLERK 7369 SMITH 800
20 MANAGER 7566 JONES 2975
30 CLERK 7900 JAMES 950
30 MANAGER 7698 BLAKE 2850
30 SALESMAN 7654 MARTIN 1250
30 SALESMAN 7521 WARD 1250
30 SALESMAN 7499 ALLEN 1600
30 SALESMAN 7844 TURNER 1500
其實需求本身很簡單,如果僅僅是為了實現的話,使用集合並運算符union就可以了,不過union的效率在這里是非常的低。
scott@DB01> select deptno,job,sum(sal) tsal from emp group by deptno,job
2 union
3 select deptno,to_char(null),sum(sal) from emp group by deptno
4 union
5 select to_number(null),to_char(null),sum(sal) from emp;
DEPTNO JOB TSAL
---------- --------- ----------
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
10 8750
20 ANALYST 6000
20 CLERK 1900
20 MANAGER 2975
20 10875
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
30 9400
29025
13 rows selected.
為了得到比較高效的sql,我們可以借助於oracle分組里面的rollup來實現,我們可以得到同樣的效果
scott@DB01> select deptno,job,sum(sal) tsal from emp group by rollup(deptno,job);
DEPTNO JOB TSAL
---------- --------- ----------
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
10 8750
20 CLERK 1900
20 ANALYST 6000
20 MANAGER 2975
20 10875
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
30 9400
29025
13 rows selected.
第一直觀的表現,使用rollup要比使用分組再union的方法語句簡單很多,更重要的是,我們只對emp訪問了一次。
為了進一步比較,我們來看一下語句的執行計划
scott@DB01> set autotrace trace exp
scott@DB01> select deptno,job,sum(sal) tsal from emp group by deptno,job
2 union
3 select deptno,to_char(null),sum(sal) from emp group by deptno
4 union
5 select to_number(null),to_char(null),sum(sal) from emp;
Execution Plan
----------------------------------------------------------
Plan hash value: 3412076862
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 29 | 825 | 14 (79)| 00:00:01 |
| 1 | SORT UNIQUE | | 29 | 825 | 14 (79)| 00:00:01 |
| 2 | UNION-ALL | | | | | |
| 3 | HASH GROUP BY | | 14 | 448 | 5 (40)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMP | 14 | 448 | 3 (0)| 00:00:01 |
| 5 | HASH GROUP BY | | 14 | 364 | 5 (40)| 00:00:01 |
| 6 | TABLE ACCESS FULL| EMP | 14 | 364 | 3 (0)| 00:00:01 |
| 7 | SORT AGGREGATE | | 1 | 13 | 4 (25)| 00:00:01 |
| 8 | TABLE ACCESS FULL| EMP | 14 | 182 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
scott@DB01> select deptno,job,sum(sal) tsal from emp group by rollup(deptno,job);
Execution Plan
----------------------------------------------------------
Plan hash value: 52302870
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 448 | 4 (25)| 00:00:01 |
| 1 | SORT GROUP BY ROLLUP| | 14 | 448 | 4 (25)| 00:00:01 |
| 2 | TABLE ACCESS FULL | EMP | 14 | 448 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
通過比較發現,兩個語句的成本cost會差出很多14vs4。所以,如果我們以后有上面類似的需求的話,可以考慮使用rollup。
注:rollup語法
select a,b,組函數
from 表
group by rollup(a,b);
這個語法相當於 group by a,b union group a union group by null的sql語句的組合