CBO優化器是基於對當前經過特定測試的數據集中預期的行比率估計來計算基數的。此處的行數之比是一個數值,稱為選擇率(selectivity)。得到選擇率之后,將其與輸入行數進行簡單相乘既可得到基數。
在理解選擇性之前,必須得對user_tab_col_statistics視圖有一定了解:
- SQL> desc user_tab_col_statistics
- 名稱 是否為空? 類型
- ----------------------------------------- -------- ----------------------------
- TABLE_NAME VARCHAR2(30) 表名
- COLUMN_NAME VARCHAR2(30) 列名
- NUM_DISTINCT NUMBER 列中distinct值的數目
- LOW_VALUE RAW(32) 列的最小值
- HIGH_VALUE RAW(32) 列的最大值
- DENSITY NUMBER 當對列創建了直方圖,則值不再等於1/NUM_DISTINCT。
- NUM_NULLS NUMBER 列中的NULL值數目。
- NUM_BUCKETS NUMBER Number of buckets in histogram for the column
- LAST_ANALYZED DATE 最近分析時間。
- SAMPLE_SIZE NUMBER 分析樣本大小。
- GLOBAL_STATS VARCHAR2(3) 對分區采樣,則-NO,否則-YES。
- USER_STATS VARCHAR2(3) 統計信息由用戶導入,則YES,否則-NO。
- AVG_COL_LEN NUMBER 列的平均長度(以字節為單位)
- HISTOGRAM VARCHAR2(15) Indicates existence/type of histogram: NONE、FREQUENCY、HEIGHT BALANCED
下面創建一張測試表,並收集統計信息:
- SQL> create table audience as
- 2 select
- 3 trunc(dbms_random.value(1,13)) month_no
- 4 from
- 5 all_objects
- 6 where
- 7 rownum <= 1200
- 8 ;
- 表已創建。
- SQL> begin
- 2 dbms_stats.gather_table_stats(
- 3 user,
- 4 'audience',
- 5 cascade => true,
- 6 estimate_percent => null,
- 7 );ethod_opt => 'for all columns size 1'
- method_opt => 'for all columns size 1'
- 8 );
- 9 end;
- 10 /
- PL/SQL 過程已成功完成。
先查看一下上面表和列的統計信息:
- SQL> select t.TABLE_NAME, t.NUM_ROWS, t.BLOCKS, t.SAMPLE_SIZE
- 2 from user_tables t;
- TABLE_NAME NUM_ROWS BLOCKS SAMPLE_SIZE
- ---------- ---------- ---------- -----------
- AUDIENCE 1200 5 1200
- SQL> select s.table_name,
- s.column_name,
- s.num_distinct,
- 4 s.low_value,
- s.high_value,
- s.density,
- 7 s.num_nulls,
- 8 s.sample_size,
- 9 s.avg_col_len
- 10 from user_tab_col_statistics s;
- TABLE_NAME COLUMN_NAM NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS SAMPLE_SIZE AVG_COL_LEN
- ---------- ---------- ------------ ---------- ---------- ---------- ---------- ----------- -----------
- AUDIENCE MONTH_NO 12 C102 C10D .083333333 0 1200 3
- SQL> select rawtohex(1), rawtohex(12) from dual;
- RAWT RAWT
- ---- ----
- C102 C10D
- SQL> select dump(1,16),dump(12,16) from dual;
- DUMP(1,16) DUMP(12,16)
- ----------------- -----------------
- Typ=2 Len=2: c1,2 Typ=2 Len=2: c1,d
- SQL> select utl_raw.cast_to_number('c102'),utl_raw.cast_to_number('c10d') from dual;
- UTL_RAW.CAST_TO_NUMBER('C102') UTL_RAW.CAST_TO_NUMBER('C10D')
- ------------------------------ ------------------------------
- 1 12 --可以看見上面的LOW_VALUE和HIGH_VALUE的值分別就是1和12.
- SQL> select count(a.month_no) from AUDIENCE a;
- COUNT(A.MONTH_NO)
- -----------------
- 1200 --可以看見,這里的值和NUM_ROWS是一樣的。
- SQL> select count(distinct a.month_no) from AUDIENCE a;
- COUNT(DISTINCTA.MONTH_NO)
- -------------------------
- 12 --可以看見,這里的值也和NUM_DISTINCT的值是一樣的。
- SQL> select 1/12 from dual;
- 1/12
- ----------
- .083333333 --這里的值和DENSITY一樣的,計算公式為1/NUM_DISTINCT。
1、假如在上面創建了一張表,里面包含1200個人,如何才能確定其中有多少人的生日是在12月份。
- SQL> select count(*) from AUDIENCE where month_no=12;
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 100 | 300 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"=12)
可以看見CBO計算出1200里面,12月份生日的人是100人(在ID=2行的rows)。這和我們通常所理解的是一樣的,我們知道月份只有12個,在1200人中在某一個月出生的人,算概率也是100人(CBO也是這樣做得)。
計算方法為:DENSITY * NUM_ROWS = 1 / 12 * 1200 = 100。
2、現在假設有10%的人不記得自己的生日了,那么CBO會怎么計算吶?
- SQL> drop table audience purge;
- 表已刪除。
- SQL> create table audience as
- 2 select
- 3 rownum id,
- 4 trunc(dbms_random.value(1,13)) month_no
- 5 from
- 6 all_objects
- 7 where
- 8 rownum <= 1200;
- 表已創建。
- SQL> update
- 2 audience
- 3 set month_no = null
- 4 where mod(id,10) = 0; --10%的用戶不記得自己的生日。
- 已更新120行。
- SQL> commit;
- 提交完成。
- SQL> begin
- 2 dbms_stats.gather_table_stats(
- 3 user,
- 4 'audience',
- 5 cascade => true,
- 6 estimate_percent => null,
- 7 method_opt => 'for all columns size 1'
- 8 );
- 9 end;
- 10 /
- PL/SQL 過程已成功完成。
- SQL> select t.TABLE_NAME, t.NUM_ROWS, t.BLOCKS, t.SAMPLE_SIZE from user_tables t;
- TABLE_NAME NUM_ROWS BLOCKS SAMPLE_SIZE
- ---------- ---------- ---------- -----------
- AUDIENCE 1200 5 1200
- SQL> select s.table_name,
- 2 s.column_name,
- 3 s.num_distinct,
- 4 s.low_value,
- 5 s.high_value,
- 6 s.density,
- 7 s.num_nulls,
- 8 s.sample_size,
- 9 s.avg_col_len
- 10 from user_tab_col_statistics s;
- TABLE_NAME COLUMN_NAM NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS SAMPLE_SIZE AVG_COL_LEN
- ---------- ---------- ------------ ---------- ---------- ---------- ---------- ----------- -----------
- AUDIENCE MONTH_NO 12 C102 C10D .083333333 120 1080 3 --這里可以看見,NUM_NULLS的值確實為120。
- AUDIENCE ID 1200 C102 C20D .000833333 0 1200 4
- SQL> select count(*) from AUDIENCE where month_no=12;
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 90 | 270 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"=12)
調整后的選擇率:DENSITY * ((NUM_ROWS-NUM_NULLS)/NUM_ROWS) = 1 / 12 * ((1200 - 120) / 1200) = 0.075。
返回的記錄數(ROWS):調整后的選擇率 * NUM_ROWS = 0.075 * 1200 = 90行。
或者我們可以換一種方法思考,通過前面可以很容易的知道12分月有100人生日(其中這里就包含了不記得生日的人)。然后1200人中有10%的人不記得自己的生日,也就是120,那么12月份不記得自己生日的人就平攤到10個人,100-10=90。
3、現在假如我們想知道在6、7、8月份生日的人有多少吶?
- SQL> select count(*) from AUDIENCE where month_no in(6,7,8);
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 270 | 810 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"=6 OR "MONTH_NO"=7 OR "MONTH_NO"=8)
6、7、8月份的選擇率:6月份選擇率 + 7月份選擇率 + 8月份選擇率 = 0.075 * 3 = 0.225
返回的記錄數(ROWS):6、7、8月份的選擇率 * NUM_ROWS = 0.225 * 1200 = 270行。
4、下面來一個更復雜一點的,我們想知道不在6、7、8月份生日的人有多少吶?
- SQL> select count(*) from AUDIENCE where month_no not in(6,7,8);
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 674 | 2022 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"<>6 AND "MONTH_NO"<>7 AND "MONTH_NO"<>8)
選擇率:1 - 6、7、8月份的選擇率 = 1 - 0.075 * 3
返回記錄數:(1-0.075*3)*1200 = 930。
month_no in{specific list} 的基數 + month_no not in{specific list} 的基數 = NUM_ROWS,這里計算出來是相等的,但是在數據庫中看見的卻不想等,需要注意!
5、現在我們求8月份以后出生的人,不包含8月份。
- SQL> select count(*) from AUDIENCE where month_no>8;
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 393 | 1179 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO">8)
選擇率:((HIGH_VALUE - LIMIT) / (HIGH_VALUE - LOW_VALUE)) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
返回的記錄數:選擇率 * NUM_ROWS = ((HIGH_VALUE - LIMIT) / (HIGH_VALUE - LOW_VALUE)) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS) * NUM_ROWS = round(((12-8)/(12-1))*((1200-120)/1200)*1200) = 393。
如果是求8月份以后出生的人,包含8月份。
- SQL> select count(*) from AUDIENCE where month_no>=8;
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 483 | 1449 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO">=8)
選擇率:((HIGH_VALUE - LIMIT) / (HIGH_VALUE - LOW_VALUE) + 1 / DENSITY) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
返回記錄數:選擇率 * NUM_ROWS = ((HIGH_VALUE - LIMIT) / (HIGH_VALUE - LOW_VALUE) + 1 / DENSITY) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS) * NUM_ROWS = round(((12-8)/(12-1)+1/12)*((1200-120)/1200)*1200) = 483。
如果是<8,選擇率:((LIMIT - LOW_VALUE) / (HIGH_VALUE - LOW_VALUE)) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
如果是<=8,選擇率:((LIMIT - LOW_VALUE) / (HIGH_VALUE - LOW_VALUE) + 1 / DENSITY) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
6、現在我們想知道6月份到8月份出生的人的數量?
- SQL> select count(*) from AUDIENCE where month_no>=6 and month_no<=8;
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 376 | 1128 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO">=6 AND "MONTH_NO"<=8)
選擇率:((HIGH_LIMIT - LOW_LIMIT) / (HIGH_VALUE - LOW_VALUE) + 1 / DENSITY + 1 / DENSITY) * ((NUM_ROWS - NUM_NULLS) / NUM_ROWS)
返回記錄數:round(((8-6)/(12-1)+1/12+1/12)*((1200-120)/1200)*1200) = 376。
7、下面看兩個謂詞的情況下,CBO是怎么計算選擇率的。
- SQL> drop table audience purge;
- 表已刪除。
- SQL> create table audience as
- 2 select
- 3 rownum id,
- 4 trunc(dbms_random.value(1,13))month_no,
- 5 trunc(dbms_random.value(1,16))eu_country
- 6 from
- 7 all_objects
- 8 where
- 9 rownum <= 1200;
- 表已創建。
- SQL> begin
- 2 dbms_stats.gather_table_stats(
- 3 user,
- 4 'audience',
- 5 cascade => true,
- 6 estimate_percent => null,
- 7 method_opt => 'for all columns size 1'
- 8 );
- 9 end;
- 10 /
- PL/SQL 過程已成功完成。
- SQL> select t.TABLE_NAME, t.NUM_ROWS, t.BLOCKS, t.SAMPLE_SIZE from user_tables t;
- TABLE_NAME NUM_ROWS BLOCKS SAMPLE_SIZE
- ---------- ---------- ---------- -----------
- AUDIENCE 1200 6 1200
- SQL> select s.table_name,
- 2 s.column_name,
- 3 s.num_distinct,
- 4 s.low_value,
- 5 s.high_value,
- 6 s.density,
- 7 s.num_nulls,
- 8 s.sample_size,
- 9 s.avg_col_len
- 10 from user_tab_col_statistics s;
- TABLE_NAME COLUMN_NAM NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS SAMPLE_SIZE AVG_COL_LEN
- ---------- ---------- ------------ ---------- ---------- ---------- ---------- ----------- -----------
- AUDIENCE EU_COUNTRY 15 C102 C110 .066666667 0 1200 3
- AUDIENCE MONTH_NO 12 C102 C10D .083333333 0 1200 3
- AUDIENCE ID 1200 C102 C20D .000833333 0 1200 4
- SQL> select count(*) from audience where month_no=12 and eu_country=8;
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 6 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 6 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 7 | 42 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("EU_COUNTRY"=8 AND "MONTH_NO"=12)
選擇率:month_no選擇率 * eu_contry選擇率 = 1/12 * 1/15
返回記錄:round(1/12*1/15*1200) = 7。
- SQL> select count(*) from audience where month_no=12 or eu_country=8;
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 6 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 6 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 173 | 1038 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"=12 OR "EU_COUNTRY"=8)
選擇率:month_no選擇率 + eu_contry選擇率 - month_no選擇率 * eu_contry選擇率 = 1/12+1/15-1/12*1/15
返回記錄:round((1/12+1/15-1/12*1/15)*1200) = 173。
- SQL> select count(*) from audience where month_no<>12;
- 執行計划
- ----------------------------------------------------------
- Plan hash value: 3337892515
- -------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------
- | 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
- | 1 | SORT AGGREGATE | | 1 | 3 | | |
- |* 2 | TABLE ACCESS FULL| AUDIENCE | 1100 | 3300 | 3 (0)| 00:00:01 |
- -------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter("MONTH_NO"<>12)
選擇率:1- month_no選擇率 = 1- 1/12
返回記錄:(1-1/12)*1200 = 1100。
8、總結:
- 單個謂詞過濾:
- = 基數計算公式 :1/num_distinct*(num_rows-num_nulls),如果有直方圖,基數計算公式=(num_rows-num_nulls)*density
- > 基數計算公式:(high_value-limit)/(high_value-low_value)*(num_rows-num_nulls)
- >= 基數計算公式:((high_value-limit)/(high_value-low_value)+1/num_distinct)*(num_rows-num_nulls) 因為有=,所以要加上=的選擇率,=的選擇率為1/num_distinct
- < 基數計算公式:(limit-low_value)/(high_value-low_value)*(num_rows-num_nulls)
- <= 基數計算公式:((limit-low_value)/(high_value-low_value)+1/num_distinct)*(num_rows-num_nulls)
- between ... and ... 的基數計算公式等價於 xxx<= high_limit ,xxxx>=low_limit
- 基數計算公式:((high_limit-low_limit)/(high_value-low_value)+2/num_distinct)*(num_rows-num_nulls)
- low_limit<xxx and xxx<high_limit 基數計算公式:(high_limit-low_limit)/(high_value-low_value)*(num_rows-num_nulls)
- low_limit<=xxx and xxx<high_limit 基數計算公式:(high_limit-low_limit)/(high_value-low_value)+1/num_distinct)*(num_rows-num_nulls)
- 雙謂詞,多謂詞:
- A AND B 選擇率計算公式=A選擇率*B選擇率
- A OR B 選擇率計算公式=A+B-(A AND B)
- NOT A 選擇率計算公式=1-A選擇率

