KingbaseES 全文檢索功能介紹

本文轉載自查看原文 2021-07-21 18:32 123 擴展插件/ PostgreSQL/ KINGBASE/ 全文檢索

KingbaseES 內置的缺省的分詞解析器采用空格分詞，因為中文的詞語之間沒有空格分割，所以這種方法並不適用於中文。要支持中文的全文檢索需要額外的中文分詞插件：zhparser and sys_jieba，其中zhparser 支持 GBK 和 UTF8 字符集，sys_jieba 支持 UTF8 字符集。

一、默認空格分詞

1、tsvector

test=# SELECT to_tsvector('English','Try not to become a man of success, but rather try to become a man of value');
                             to_tsvector                              
----------------------------------------------------------------------
 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17
(1 row)

test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value');
                                                     to_tsvector                                                     
---------------------------------------------------------------------------------------------------------------------
 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');
                                                     to_tsvector                                                     
---------------------------------------------------------------------------------------------------------------------
 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

這里可以看到，如果詞干分析器是english ，會采取詞干標准化的過程；而simple 只是轉換成小寫。默認是 simple。

test=# show default_text_search_config;
 default_text_search_config 
----------------------------
 pg_catalog.simple
(1 row)

2、標准化過程

標准化過程會完成以下操作：

總是把大寫字母換成小寫的
也經常移除后綴（比如英語中的s,es和ing等），這樣可以搜索同一個字的各種變體，而不是乏味地輸入所有可能的變體。
數字表示詞位在原始字符串中的位置，比如“man"出現在第6和15的位置上。
to_tesvetor的默認配置的文本搜索是“英語“。它會忽略掉英語中的停用詞（stopword，譯注：也就是am is are a an等單詞)。

3、tsvector搜索

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'become';
 ?column? 
----------
 t
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'becom';   
 ?column? 
----------
 f
(1 row)

test=# select 'become'::tsquery,to_tsquery('become'),to_tsquery('english','become');
tsquery | to_tsquery | to_tsquery
----------+------------+------------
'become' | 'become' | 'becom'
(1 row)

to_tsquery 也會進行標准化轉換，在搜索時必須用 to_tsquery，確保數據不會因為標准化轉換而搜索不到。

4、邏輯操作

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('become');
 ?column? 
----------
 t
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('!become'); 
 ?column? 
----------
 f
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('tri & become');
 ?column? 
----------
 t
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try & !becom');
 ?column? 
----------
 f
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try | !become');
 ?column? 
----------
 t
(1 row)

5、可以用 :* 表示某詞開始字符

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('bec:*');
 ?column? 
----------
 t
(1 row)

6、其他語言支持

test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value');
                                                     to_tsvector                                                     
---------------------------------------------------------------------------------------------------------------------
 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

test=# SELECT to_tsvector('english','Try not to become a man of success, but rather try to become a man of value') ;
                             to_tsvector                              
----------------------------------------------------------------------
 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17
(1 row)
                           ^
test=# SELECT to_tsvector('french','Try not to become a man of success, but rather try to become a man of value') ;
                                                   to_tsvector                                                   
-----------------------------------------------------------------------------------------------------------------
 'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17
(1 row)
                                     ^
test=# SELECT to_tsvector('french'::regconfig,'Try not to become a man of success, but rather try to become a man of value') ;
                                                   to_tsvector                                                   
-----------------------------------------------------------------------------------------------------------------
 'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17
(1 row)

simple並不忽略禁用詞表，它也不會試着去查找單詞的詞根。使用simple時，空格分割的每一組字符都是一個語義；simple 只做了小寫轉換；對於數據來說，simple文本搜索配置項很實用。

二、中文檢索

在開始介紹中文檢索前，我們先來看個例子：

test=# select to_tsvector('人大金倉致力於提供高可靠的數據庫產品');
               to_tsvector                
------------------------------------------
 '人大金倉致力於提供高可靠的數據庫產品':1

因為內置的分詞器是按空格分割的，而中文間沒有空格，因此，整句話就被看做一個分詞。

1、創建中文搜索插件

create extension zhparser;
create text search configuration zhongwen_parser (parser = zhparser);
alter text search configuration zhongwen_parser add mapping for n,v,a,i,e,l,j with simple;

上面 for 后面的字母表示分詞的token，上面的token映射只映射了名詞(n)，動詞(v)，形容詞(a)，成語(i)，嘆詞(e)，縮寫(j) 和習用語(l)6種，這6種以外的token全部被屏蔽。詞典使用的是內置的simple詞典。具體的token 如下：

test=# select ts_token_type('zhparser');
     ts_token_type      
------------------------
 (97,a,adjective)
 (98,b,differentiation)
 (99,c,conjunction)
 (100,d,adverb)
 (101,e,exclamation)
 (102,f,position)
 (103,g,root)
 (104,h,head)
 (105,i,idiom)
 (106,j,abbreviation)
 (107,k,tail)
 (108,l,tmp)
 (109,m,numeral)
 (110,n,noun)
 (111,o,onomatopoeia)
 (112,p,prepositional)
 (113,q,quantity)
 (114,r,pronoun)
 (115,s,space)
 (116,t,time)
 (117,u,auxiliary)
 (118,v,verb)
 (119,w,punctuation)
 (120,x,unknown)
 (121,y,modal)
 (122,z,status)
(26 rows)

2、查看pg_ts_config

創建text search configuration 后，可以在視圖pg_ts_config 看到如下信息：

test=# select * from pg_ts_config;
  oid  |     cfgname     | cfgnamespace | cfgowner | cfgparser 
-------+-----------------+--------------+----------+-----------
  3748 | simple          |           11 |       10 |      3722
 13265 | arabic          |           11 |       10 |      3722
 13267 | danish          |           11 |       10 |      3722
 13269 | dutch           |           11 |       10 |      3722
 13271 | english         |           11 |       10 |      3722
 13273 | finnish         |           11 |       10 |      3722
 13275 | french          |           11 |       10 |      3722
 13277 | german          |           11 |       10 |      3722
 13279 | hungarian       |           11 |       10 |      3722
 13281 | indonesian      |           11 |       10 |      3722
 13283 | irish           |           11 |       10 |      3722
 13285 | italian         |           11 |       10 |      3722
 13287 | lithuanian      |           11 |       10 |      3722
 13289 | nepali          |           11 |       10 |      3722
 13291 | norwegian       |           11 |       10 |      3722
 13293 | portuguese      |           11 |       10 |      3722
 13295 | romanian        |           11 |       10 |      3722
 13297 | russian         |           11 |       10 |      3722
 13299 | spanish         |           11 |       10 |      3722
 13301 | swedish         |           11 |       10 |      3722
 13303 | tamil           |           11 |       10 |      3722
 13305 | turkish         |           11 |       10 |      3722
 16390 | parser_name     |         2200 |       10 |     16389
 24587 | zhongwen_parser |         2200 |       10 |     16389

3、使用中文分詞

test=# select to_tsvector('zhongwen_parser','人大金倉致力於提供高可靠的數據庫產品');
                           to_tsvector                            
------------------------------------------------------------------
 '產品':7 '人大':1 '可靠':5 '提供':3 '數據庫':6 '致力於':2 '高':4

4、contains 函數

test=# \df+ contains
                                                                                           List of functions
 Schema |   Name   | Result data type | Argument data types | Type | Volatility | Parallel | Owner  | Security | Access privileges | Language |               Source code        
        | Description 
--------+----------+------------------+---------------------+------+------------+----------+--------+----------+-------------------+----------+------------------------------------------+-------------
 sys    | contains | boolean          | text, text          | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) | 
 sys    | contains | boolean          | text, text, integer | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) | 
 sys    | contains | boolean          | text, tsquery       | func | immutable  | safe     | system | invoker  |                   | sql      | select $1::tsvector @@ $2                | 
 sys    | contains | boolean          | tsvector, text      | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2::tsquery                 | 
 sys    | contains | boolean          | tsvector, tsquery   | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2                          |

默認contains 函數使用的是空格分詞解析器，因此，無法使用contains 進行中文判斷

test=# select contains('人大金倉致力於提供高可靠的數據庫產品','產品');
 contains 
----------
 f

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 全文檢索功能 Redisearch實現的全文檢索功能服務 Django--全文檢索功能什么是全文檢索 ElasticSearch全文檢索 Mysql全文檢索全文檢索elasticsearch PostgreSQL 全文檢索 Lucene全文檢索（一） MySQL全文檢索