Postgresql GIN索引

本文轉載自查看原文 2021-06-03 01:19 924 數據庫

GIN概念介紹：

GIN是Generalized Inverted Index的縮寫。就是所謂的倒排索引。它處理的數據類型的值不是原子的，而是由元素構成。我們稱之為復合類型。如(‘hank’, ‘15:3 21:4’)中，表示hank在15:3和21:4這兩個位置出現過,下面會從具體的例子更加清晰的認識GIN索引。

全文搜索

GIN的主要應用領域是加速全文搜索，所以，這里我們使用全文搜索的例子介紹一下GIN索引。

如下，建一張表,doc_tsv是文本搜索類型，可以自動排序並消除重復的元素：

postgres=# create table ts(doc text, doc_tsv tsvector);

postgres=# insert into ts(doc) values
  ('Can a sheet slitter slit sheets?'), 
  ('How many sheets could a sheet slitter slit?'),
  ('I slit a sheet, a sheet I slit.'),
  ('Upon a slitted sheet I sit.'), 
  ('Whoever slit the sheets is a good sheet slitter.'), 
  ('I am a sheet slitter.'),
  ('I slit sheets.'),
  ('I am the sleekest sheet slitter that ever slit sheets.'),
  ('She slits the sheet she sits on.');

postgres=# update ts set doc_tsv = to_tsvector(doc);

postgres=# create index on ts using gin(doc_tsv);

postgresql tsvector 文檔鏈接：http://www.postgres.cn/docs/9.6/datatype-textsearch.html

該GIN索引結構如下，黑色方塊是TID編號，白色為單詞,注意這里是單向鏈表，不同於B-tree的雙向鏈表：

posgresql tid ，ctid 參考鏈接：

https://blog.csdn.net/weixin_34372728/article/details/90591262

https://help.aliyun.com/document_detail/181315.html

GIN索引在物理存儲上包含如下內容：

1. Entry：GIN索引中的一個元素，可以認為是一個詞位，也可以理解為一個key

2. Entry tree：在Entry上構建的B樹

3. posting list：一個Entry出現的物理位置(heap ctid, 堆表行號)的鏈表

4. posting tree：在一個Entry出現的物理位置鏈表(heap ctid, 堆表行號)上構建的B樹，所以posting tree的KEY是ctid，而entry tree的KEY是被索引的列的值

5. pending list：索引元組的臨時存儲鏈表，用於fastupdate模式的插入操作

參考鏈接：https://www.cnblogs.com/flying-tiger/p/6704931.html

hank=# select ctid,doc, doc_tsv from ts;          
  ctid  |                          doc                           |                         doc_tsv                         
--------+--------------------------------------------------------+---------------------------------------------------------
 (0,1) | Can a sheet slitter slit sheets?                       | 'sheet':3,6 'slit':5 'slitter':4
 (0,2) | How many sheets could a sheet slitter slit?            | 'could':4 'mani':2 'sheet':3,6 'slit':8 'slitter':7
 (0,3) | I slit a sheet, a sheet I slit.                        | 'sheet':4,6 'slit':2,8
 (1,1) | Upon a slitted sheet I sit.                            | 'sheet':4 'sit':6 'slit':3 'upon':1
 (1,2) | Whoever slit the sheets is a good sheet slitter.       | 'good':7 'sheet':4,8 'slit':2 'slitter':9 'whoever':1
 (1,3) | I am a sheet slitter.                                  | 'sheet':4 'slitter':5
 (2,1) | I slit sheets.                                         | 'sheet':3 'slit':2
 (2,2) | I am the sleekest sheet slitter that ever slit sheets. | 'ever':8 'sheet':5,10 'sleekest':4 'slit':9 'slitter':6
 (2,3) | She slits the sheet she sits on.                       | 'sheet':4 'sit':6 'slit':2
(9 rows)

由上可見，sheet,slit,slitter出現在多行之中，所有會有多個TID，這樣就會生成一個TID列表，並為之生成一棵單獨的B-tree。

以下語句可以找出多少行出現過該單詞。

hank=# select (unnest(doc_tsv)).lexeme, count(*) from ts
group by 1 order by 2 desc;
  lexeme  | count 
----------+-------
 sheet    |     9
 slit     |     8
 slitter  |     5
 sit      |     2
 upon     |     1
 mani     |     1
 whoever  |     1
 sleekest |     1
 good     |     1
 could    |     1
 ever     |     1
(11 rows)

所以執行以下語句，可以走用到GIN索引：

--這里由於數據量較小，所以禁用全表掃描
hank=# set enable_seqscan TO off;
SET
hank=# explain(costs off)                                 
select doc from ts where doc_tsv @@ to_tsquery('many & slitter');
                             QUERY PLAN                              
---------------------------------------------------------------------
 Bitmap Heap Scan on ts
   Recheck Cond: (doc_tsv @@ to_tsquery('many & slitter'::text))
   ->  Bitmap Index Scan on ts_doc_tsv_idx
         Index Cond: (doc_tsv @@ to_tsquery('many & slitter'::text))
(4 rows)

hank=# select amop.amopopr::regoperator, amop.amopstrategy
from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop
where opc.opcname = 'tsvector_ops'
and opf.oid = opc.opcfamily
and am.oid = opf.opfmethod
and amop.amopfamily = opc.opcfamily
and am.amname = 'gin'
and amop.amoplefttype = opc.opcintype;
        amopopr        | amopstrategy 
-----------------------+--------------
 @@(tsvector,tsquery)  |            1  matching search query
 @@@(tsvector,tsquery) |            2  synonym for @@ (for backward compatibility)
(2 rows)

索引掃描方式參考鏈接： https://blog.csdn.net/qq_35260875/article/details/106084392?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_title-6&spm=1001.2101.3001.4242

下圖分別找mani和slitter
mani — (0,2).
slitter — (0,1), (0,2), (1,2), (1,3), (2,2).

最后，看一下找到的相關行，並且條件是and,所以只能返回(0,2)。

       |      |         |  consistency
       |      |         |    function
  TID  | mani | slitter | slit & slitter
-------+------+---------+----------------
 (0,1) |    f |       T |              f 
 (0,2) |    T |       T |              T
 (1,2) |    f |       T |              f
 (1,3) |    f |       T |              f
 (2,2) |    f |       T |              f
 
postgres=# select doc from ts where doc_tsv @@ to_tsquery('many & slitter');
                     doc                     
---------------------------------------------
 How many sheets could a sheet slitter slit?
(1 row)

文本搜索運算符參考文檔： http://www.postgres.cn/docs/9.6/functions-textsearch.html

更新緩慢

GIN索引中的數據插入或更新非常慢。因為每行通常包含許多要索引的單詞元素。因此，當添加或更新一行時，我們必須大量更新索引樹。
另一方面，如果同時更新多個行，它們的某些單詞元素可能是相同的，所以總的代價小於一行一行單獨更新文檔時的代價。

GIN索引具有 fastupdate 存儲參數，我們可以在創建索引時指定它，並在以后更新：

postgres=# create index on ts using gin(doc_tsv) with (fastupdate = true);

fastupdate 參考鏈接： https://www.cnblogs.com/flying-tiger/p/6704931.html

啟用此參數后，更新將累積在單獨的無序列表中。當此列表足夠大時或vacuum期間，所有累積的更新將立即對索引操作。這個“足夠大”的列表由“ gin_pending_list_limit”配置參數或創建索引時同名的存儲參數確定。

部分匹配搜索

查詢包含slit打頭的doc

hank=# select doc from ts where doc_tsv @@ to_tsquery('slit:*');
                          doc                           
--------------------------------------------------------
 Can a sheet slitter slit sheets?
 How many sheets could a sheet slitter slit?
 I slit a sheet, a sheet I slit.
 Upon a slitted sheet I sit.
 Whoever slit the sheets is a good sheet slitter.
 I am a sheet slitter.
 I slit sheets.
 I am the sleekest sheet slitter that ever slit sheets.
 She slits the sheet she sits on.
(9 rows)

同樣可以使用索引加速：

postgres=# explain (costs off)
select doc from ts where doc_tsv @@ to_tsquery('slit:*');
                         QUERY PLAN                          
-------------------------------------------------------------
 Bitmap Heap Scan on ts
   Recheck Cond: (doc_tsv @@ to_tsquery('slit:*'::text))
   ->  Bitmap Index Scan on ts_doc_tsv_idx
         Index Cond: (doc_tsv @@ to_tsquery('slit:*'::text))
(4 rows)

頻翻和不頻繁

制造一些數據,下載地址： http://oc.postgrespro.ru/index.php/s/fRxTZ0sVfPZzbmd/download

fts=# alter table mail_messages add column tsv tsvector;
fts=# update mail_messages set tsv = to_tsvector(body_plain);
fts=# create index on mail_messages using gin(tsv);

--這里不使用unnest統計單詞出現在行的次數，因為數據量比較大，我們使用ts_stat函數來進行計算
fts=# select word, ndoc
from ts_stat('select tsv from mail_messages')
order by ndoc desc limit 3;
 word  |  ndoc  
-------+--------
 re    | 322141
 wrote | 231174
 use   | 176917
(3 rows)

例如我們查詢郵件信息里很少出現的單詞，如“tattoo”：

fts=# select word, ndoc from ts_stat('select tsv from mail_messages') where word = 'tattoo';
  word  | ndoc 
--------+------
 tattoo |    2
(1 row)

兩個單詞同一行出現的次數，wrote和tattoo同時出現的行只有一行

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote & tattoo');
 count 
-------
     1
(1 row)

我們來看看是如何執行的，如上所述，如果我們要獲得兩個詞的TID列表，則搜索效率顯然很低下：因為將必須遍歷20多萬個值，而只取一個值。但是通過統計信息，該算法可以了解到“wrote”經常出現，而“ tattoo”則很少出現。因此，將執行不經常使用的詞的搜索，然后從檢索到的兩行中檢查是否存在“wrote”。這樣就可以快速得出查詢結果：

fts=# \timing on

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote & tattoo');
 count 
-------
     1
(1 row)
Time: 0,959 ms

查詢wrote將話費更長的時間

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote');
 count  
--------
 231174
(1 row)
Time: 2875,543 ms (00:02,876)

這種優化當然不只是兩個單詞元素搜索有效，其他更復雜的搜索也有效。

限制查詢結果

GIN的一個特點是，結果總是以位圖的形式返回：該方法不能按TID返回所需數據的TID。因此，本文中的所有查詢計划都使用位圖掃描。

因此，使用LIMIT子句限制索引掃描結果的效率不是很高。注意操作的預計成本（“limit”節點的“cost”字段）：

fts=# explain (costs off)
select * from mail_messages where tsv @@ to_tsquery('wrote') limit 1;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Limit  (cost=1283.61..1285.13 rows=1)
   ->  Bitmap Heap Scan on mail_messages  (cost=1283.61..209975.49 rows=137207)
         Recheck Cond: (tsv @@ to_tsquery('wrote'::text))
         ->  Bitmap Index Scan on mail_messages_tsv_idx  (cost=0.00..1249.30 rows=137207)
               Index Cond: (tsv @@ to_tsquery('wrote'::text))
(5 rows)

估計成本為1285.13，比構建整個位圖1249.30的成本（“Bitmap Index Scan”節點的“cost”字段）稍大。

因此，索引具有限制結果數量的功能。該閾值gin_fuzzy_search_limit配置參數中指定，並且默認情況下等於零（沒有限制）。但是我們可以設置閾值：

fts=# set gin_fuzzy_search_limit = 1000;

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote');
 count 
-------
  5746
(1 row)
fts=# set gin_fuzzy_search_limit = 10000;

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote');
 count 
-------
 14726
(1 row)

我們可以看到，查詢返回的行數對於不同的參數值是不同的（如果使用索引訪問）。限制並不嚴格：可以返回多於指定行的行，這證明參數名稱的“模糊”部分是合理的。

GIN索引比較小，不會占用太多空間。首先，如果在多行中出現相同的單詞，則它僅在索引中存儲一次。其次，TID以有序的方式存儲在索引中，這使我們能夠使用一種簡單的壓縮方式：列表中的下一個TID實際上與上一個TID是不同的；這個數字通常很小，與完整的六字節TID相比，所需的位數要小得多。

為了了解其大小，我們從消息文本構建B樹：

GIN建立在不同的數據類型（“ tsvector”而不是“ text”）上，該數據類型較小
同時，B樹的消息大小必須縮短到大約2 KB。

fts=# create index mail_messages_btree on mail_messages(substring(body_plain for 2048));

創建一個gist索引：

fts=# create index mail_messages_gist on mail_messages using gist(tsv);

分別看一下gin,gist,btree的大小：

fts=# select pg_size_pretty(pg_relation_size('mail_messages_tsv_idx')) as gin,
             pg_size_pretty(pg_relation_size('mail_messages_gist')) as gist,
             pg_size_pretty(pg_relation_size('mail_messages_btree')) as btree;
  gin   |  gist  | btree  
--------+--------+--------
 179 MB | 125 MB | 546 MB
(1 row)

由於GIN索引更節省空間，我們從Oracle遷移到postgresql過程中可以使用GIN索引來代替位圖索引。通常，位圖索引用於唯一值很少的字段，這對於GIN也是非常有效的。而且，PostgreSQL可以基於任何索引（包括GIN）動態構建位圖。

使用GiST還是GIN
一般來說，GIN在准確性和搜索速度上均勝過GiST。如果數據更新不頻繁並且需要快速搜索，則可以選擇GIN。
另一方面，如果對數據進行密集更新，則更新GIN的開銷成本可能太大。在這種情況下，我們將不得不比較這兩種索引，並選擇其相關特征更適合的索引。

數組

使用GIN的另一個示例是數組的索引。在這種情況下，數組元素進入索引，這可以加快對數組的許多操作：

postgres=# select amop.amopopr::regoperator, amop.amopstrategy
from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop
where opc.opcname = 'array_ops'
and opf.oid = opc.opcfamily
and am.oid = opf.opfmethod
and amop.amopfamily = opc.opcfamily
and am.amname = 'gin'
and amop.amoplefttype = opc.opcintype;
        amopopr        | amopstrategy 
-----------------------+--------------
 &&(anyarray,anyarray) |            1  intersection
 @>(anyarray,anyarray) |            2  contains array
 <@(anyarray,anyarray) |            3  contained in array
 =(anyarray,anyarray)  |            4  equality
(4 rows)

數組運算符符文檔鏈接： http://postgres.cn/docs/9.6/functions-array.html

還是以以前航班數據庫為例：（我也不知道原博主的航班數據庫在哪里。。。）

demo=# select departure_airport_name, arrival_airport_name, days_of_week
from routes
where flight_no = 'PG0049';
 departure_airport_name | arrival_airport_name | days_of_week 
------------------------+----------------------+--------------
 Vnukovo                | Gelendzhik            | {2,4,7}
(1 row)

新建一張表並創建索引：

demo=# create table routes_t as select * from routes;

demo=# create index on routes_t using gin(days_of_week);

現在，我們可以使用該索引來獲取在星期二，星期四和星期日出發的所有航班：

demo=# explain (costs off) select * from routes_t where days_of_week = ARRAY[2,4,7];
                        QUERY PLAN                         
-----------------------------------------------------------
 Bitmap Heap Scan on routes_t
   Recheck Cond: (days_of_week = '{2,4,7}'::integer[])
   ->  Bitmap Index Scan on routes_t_days_of_week_idx
         Index Cond: (days_of_week = '{2,4,7}'::integer[])
(4 rows)

可以看到出現六趟航班：

demo=# select flight_no, departure_airport_name, arrival_airport_name, days_of_week from routes_t where days_of_week = ARRAY[2,4,7];
 flight_no | departure_airport_name | arrival_airport_name | days_of_week 
-----------+------------------------+----------------------+--------------
 PG0005    | Domodedovo             | Pskov                | {2,4,7}
 PG0049    | Vnukovo                | Gelendzhik           | {2,4,7}
 PG0113    | Naryan-Mar             | Domodedovo           | {2,4,7}
 PG0249    | Domodedovo             | Gelendzhik           | {2,4,7}
 PG0449    | Stavropol             | Vnukovo              | {2,4,7}
 PG0540    | Barnaul                | Vnukovo              | {2,4,7}
(6 rows)

該查詢的執行步驟分析：

首先從數組中取出元素2，4，7、
在元素樹中，找到提取的鍵，並為每個鍵選擇TID列表
在找到的TID中，從中選擇與運算符匹配的TID。對於=運算符，只有那些TID匹配出現在所有三個列表中的TID（換句話說，初始數組必須包含所有元素）。但這還不夠：數組還需要不包含任何其他值，並且我們無法使用索引檢查此條件。因此，在這種情況下，訪問方法要求索引引擎重新檢查與表一起返回的所有TID。

但是有些策略（例如，“包含在數組中”）無法檢查任何內容，而必須重新檢查在表中找到的所有TID。（原博主的這個“例如”我沒看懂。。。）

但是，如果我們需要知道周二，周四和周日從莫斯科起飛的航班怎么辦？索引不支持附加條件，該條件將進入“filter”。

demo=# explain (costs off)
select * from routes_t where days_of_week = ARRAY[2,4,7] and departure_city = 'Moscow';
                        QUERY PLAN                         
-----------------------------------------------------------
 Bitmap Heap Scan on routes_t
   Recheck Cond: (days_of_week = '{2,4,7}'::integer[])
   Filter: (departure_city = 'Moscow'::text)
   ->  Bitmap Index Scan on routes_t_days_of_week_idx
         Index Cond: (days_of_week = '{2,4,7}'::integer[])
(5 rows)

在這里可以（索引只選擇六行），但是在增加了其他條件選擇能力的情況下，我們希望也同樣支持。但是，我們不能直接創建聯合索引：

demo=# create index on routes_t using gin(days_of_week,departure_city);

ERROR:  data type text has no default operator class for access method "gin"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

這個時候可以使用btree_gin來幫助我們,它添加了GIN運算符來模擬常規B樹來工作

demo=# create extension btree_gin;

demo=# create index on routes_t using gin(days_of_week,departure_city);

demo=# explain (costs off)
select * from routes_t where days_of_week = ARRAY[2,4,7] and departure_city = 'Moscow';
                             QUERY PLAN
---------------------------------------------------------------------
 Bitmap Heap Scan on routes_t
   Recheck Cond: ((days_of_week = '{2,4,7}'::integer[]) AND
                  (departure_city = 'Moscow'::text))
   ->  Bitmap Index Scan on routes_t_days_of_week_departure_city_idx
         Index Cond: ((days_of_week = '{2,4,7}'::integer[]) AND
                      (departure_city = 'Moscow'::text))
(4 rows)

JSONB

內置GIN支持的復合數據類型的另一個示例是JSON。為了使用JSON值，PG定義了許多運算符和函數，其中一些可以使用索引加快訪問速度：

postgres=# select opc.opcname, amop.amopopr::regoperator, amop.amopstrategy as str
from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop
where opc.opcname in ('jsonb_ops','jsonb_path_ops')
and opf.oid = opc.opcfamily
and am.oid = opf.opfmethod
and amop.amopfamily = opc.opcfamily
and am.amname = 'gin'
and amop.amoplefttype = opc.opcintype;
    opcname     |     amopopr      | str
----------------+------------------+-----
 jsonb_ops      | ?(jsonb,text)    |   9  top-level key exists
 jsonb_ops      | ?|(jsonb,text[]) |  10  some top-level key exists
 jsonb_ops      | ?&(jsonb,text[]) |  11  all top-level keys exist
 jsonb_ops      | @>(jsonb,jsonb)  |   7  JSON value is at top level
 jsonb_path_ops | @>(jsonb,jsonb)  |   7
(5 rows)

可見有兩類運算符jsonb_ops和jsonb_path_ops。默認情況下，使用第一個運算符jsonb_ops。所有的鍵，值和數組元素都將作為初始JSON文檔的元素到達索引。屬性將會添加到每個元素中，指定該元素是否為鍵（“存在”策略需要此屬性，以區分鍵和值）。

demo=# create table routes_jsonb as
  select to_jsonb(t) route 
  from (
      select departure_airport_name, arrival_airport_name, days_of_week
      from routes 
      order by flight_no limit 4
  ) t;

demo=# select ctid, jsonb_pretty(route) from routes_jsonb;
 ctid  |                 jsonb_pretty                  
-------+-------------------------------------------------
 (0,1) | {                                              +
       |     "days_of_week": [                          +
       |         1                                      +
       |     ],                                         +
       |     "arrival_airport_name": "Surgut",          +
       |     "departure_airport_name": "Ust-Ilimsk"     +
       | }
 (0,2) | {                                              +
       |     "days_of_week": [                          +
       |         2                                      +
       |     ],                                         +
       |     "arrival_airport_name": "Ust-Ilimsk",      +
       |     "departure_airport_name": "Surgut"         +
       | }
 (0,3) | {                                              +
       |     "days_of_week": [                          +
       |         1,                                     +
       |         4                                      +
       |     ],                                         +
       |     "arrival_airport_name": "Sochi",           +
       |     "departure_airport_name": "Ivanovo-Yuzhnyi"+
       | }
 (0,4) | {                                              +
       |     "days_of_week": [                          +
       |         2,                                     +
       |         5                                      +
       |     ],                                         +
       |     "arrival_airport_name": "Ivanovo-Yuzhnyi", +
       |     "departure_airport_name": "Sochi"          +
       | }
(4 rows)

demo=# create index on routes_jsonb using gin(route);

索引結構如下：

以下示例可以使用索引：

demo=# explain (costs off) 
select jsonb_pretty(route) 
from routes_jsonb 
where route @> '{"days_of_week": [5]}';
                          QUERY PLAN                           
---------------------------------------------------------------
 Bitmap Heap Scan on routes_jsonb
   Recheck Cond: (route @> '{"days_of_week": [5]}'::jsonb)
   ->  Bitmap Index Scan on routes_jsonb_route_idx
         Index Cond: (route @> '{"days_of_week": [5]}'::jsonb)
(4 rows)

從JSON文檔的根位置開始，@> 運算符檢查是否發生了指定的路由（“ days_of_week”：[5]）。以下查詢將返回一行：

jsonb 運算符文檔：http://postgres.cn/docs/9.6/functions-json.html

demo=# select jsonb_pretty(route) from routes_jsonb where route @> '{"days_of_week": [5]}';
                 jsonb_pretty                 
------------------------------------------------
 {                                             +
     "days_of_week": [                         +
         2,                                    +
         5                                     +
     ],                                        +
     "arrival_airport_name": "Ivanovo-Yuzhnyi",+
     "departure_airport_name": "Sochi"         +
 }
(1 row)

這個查詢執行過程如下：

搜索查詢（“ days_of_week”：[5]）提取元素（搜索關鍵字）：«days_of_week»和«5»。
在元素的樹中找到提取的鍵，並為每個鍵選擇TID列表：對於5對應的TID為（0,4），對於days_of_week對應的TID為（0,1），（0,2 ），（0,3），（0,4）。
在已經找到的TID中，一致性函數從查詢中選擇與運算符匹配的TID。對於@>運算符，肯定不能包含不包含搜索查詢中所有元素的文檔，因此僅保留（0,4）。但是，我們仍然需要重新檢查保留的TID，因為從索引中無法清楚找到的元素在JSON文檔中的出現順序。

內部結構

使用 pageinspect 查看內部情況：

fts=# create extension pageinspect;

meta頁面顯示了常規的統計信息：

fts=# select * from gin_metapage_info(get_raw_page('mail_messages_tsv_idx',0));

頁面的結構提供了一個特殊的區域，這個區域存放了訪問方法的存儲信息。對於普通的程序，如vacuum，則該區域“不透明”。 gin_page_opaque_info 函數可以顯示GIN數據。如，我們可以了解索引頁面的集合：

fts=# select flags, count(*)
from generate_series(1,22967) as g(id), -- n_total_pages
     gin_page_opaque_info(get_raw_page('mail_messages_tsv_idx',g.id))
group by flags;
         flags          | count 
------------------------+-------
 {meta}                 |     1  meta page
 {}                     |   133  internal page of element B-tree
 {leaf}                 | 13618  leaf page of element B-tree
 {data}                 |  1497  internal page of TID B-tree
 {data,leaf,compressed} |  7719  leaf page of TID B-tree
(5 rows)

fts=# select flags, count(*)
from generate_series(1,22967) as g(id), -- n_total_pages
     gin_page_opaque_info(get_raw_page('mail_messages_tsv_idx',g.id))
group by flags;
         flags          | count 
------------------------+-------
 {meta}                 |     1  meta page
 {}                     |   133  internal page of element B-tree
 {leaf}                 | 13618  leaf page of element B-tree
 {data}                 |  1497  internal page of TID B-tree
 {data,leaf,compressed} |  7719  leaf page of TID B-tree
(5 rows)

gin_leafpage_items 函數可以展示頁面（data,leaf,compressed）上TID的信息：

fts=# select * from gin_leafpage_items(get_raw_page('mail_messages_tsv_idx',2672));
-[ RECORD 1 ]---------------------------------------------------------------------
first_tid | (239,44)
nbytes    | 248
tids      | {"(239,44)","(239,47)","(239,48)","(239,50)","(239,52)","(240,3)",...
-[ RECORD 2 ]---------------------------------------------------------------------
first_tid | (247,40)
nbytes    | 248
tids      | {"(247,40)","(247,41)","(247,44)","(247,45)","(247,46)","(248,2)",...
...

GIN的一些屬性

GIN的訪問方法如下：

--可以創建多列索引
 amname |     name      | pg_indexam_has_property 
--------+---------------+-------------------------
 gin    | can_order     | f
 gin    | can_unique    | f
 gin    | can_multi_col | t   
 gin    | can_exclude   | f

這個可以看到，GIN支持創建多列索引。但是，與常規B樹不同，多列索引仍將存儲單個元素，而不是復合鍵，並且會為每個元素指示列號。

索引層面的屬性：

--支持bitmap scan
     name      | pg_index_has_property 
---------------+-----------------------
 clusterable   | f
 index_scan    | f
 bitmap_scan   | t
 backward_scan | f

注意，不支持按TID（索引掃描）返回結果的TID；因為GIN只能進行位圖掃描。（這句話沒太明白。。。）

以下是列層面的屬性：

        name        | pg_index_column_has_property 
--------------------+------------------------------
 asc                | f
 desc               | f
 nulls_first        | f
 nulls_last         | f
 orderable          | f
 distance_orderable | f
 returnable         | f
 search_array       | f
 search_nulls       | f

其他的一些插件也支持類似GIN的功能：

如pg_trgm可以模糊匹配。它支持各種運算符，包括通過LIKE和正則表達式進行比較。我們可以使用此插件和GIN配合使用。
btree_gin上面介紹過，可以支持GIN創建多列復合索引。

參考鏈接：

https://blog.csdn.net/dazuiba008/article/details/103985791

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 PostgreSQL中的索引(七)--GIN postgresql 創建gin索引 postgresql gin索引使用 postgresql 索引之 gin 淺談postgresql的GIN索引(通用倒排索引) postgreSQL jsonb上創建gin索引的兩種方式 postgresql 創建索引：ERROR: operator class "gin_trgm_ops" does not exist for access method "gin" postgresql/lightdb查詢優化之GIN（Generalized Inverted Index）索引與全文檢索 PostgreSQL索引思考 postgresql----Gist索引