KingbaseES 內置的缺省的分詞解析器采用空格分詞,因為中文的詞語之間沒有空格分割,所以這種方法並不適用於中文。要支持中文的全文檢索需要額外的中文分詞插件:zhparser and sys_jieba,其中zhparser 支持 GBK 和 UTF8 字符集,sys_jieba 支持 UTF8 字符集。
一、默認空格分詞
1、tsvector
test=# SELECT to_tsvector('English','Try not to become a man of success, but rather try to become a man of value'); to_tsvector ---------------------------------------------------------------------- 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17 (1 row) test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value'); to_tsvector --------------------------------------------------------------------------------------------------------------------- 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17 (1 row) test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value'); to_tsvector --------------------------------------------------------------------------------------------------------------------- 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17 (1 row)
這里可以看到,如果詞干分析器是english ,會采取詞干標准化的過程;而simple 只是轉換成小寫。默認是 simple。
test=# show default_text_search_config; default_text_search_config ---------------------------- pg_catalog.simple (1 row)
2、標准化過程
標准化過程會完成以下操作:
- 總是把大寫字母換成小寫的
- 也經常移除后綴(比如英語中的s,es和ing等),這樣可以搜索同一個字的各種變體,而不是乏味地輸入所有可能的變體。
- 數字表示詞位在原始字符串中的位置,比如“man"出現在第6和15的位置上。
- to_tesvetor的默認配置的文本搜索是“英語“。它會忽略掉英語中的停用詞(stopword,譯注:也就是am is are a an等單詞)。
3、tsvector搜索
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'become'; ?column? ---------- t (1 row) test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'becom'; ?column? ---------- f (1 row)
test=# select 'become'::tsquery,to_tsquery('become'),to_tsquery('english','become'); tsquery | to_tsquery | to_tsquery ----------+------------+------------ 'become' | 'become' | 'becom' (1 row)
to_tsquery 也會進行標准化轉換,在搜索時必須用 to_tsquery,確保數據不會因為標准化轉換而搜索不到。
4、邏輯操作
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('become'); ?column? ---------- t (1 row) test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('!become'); ?column? ---------- f (1 row) test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('tri & become'); ?column? ---------- t (1 row) test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try & !becom'); ?column? ---------- f (1 row) test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try | !become'); ?column? ---------- t (1 row)
5、可以用 :* 表示某詞開始字符
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('bec:*'); ?column? ---------- t (1 row)
6、其他語言支持
test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value'); to_tsvector --------------------------------------------------------------------------------------------------------------------- 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17 (1 row) test=# SELECT to_tsvector('english','Try not to become a man of success, but rather try to become a man of value') ; to_tsvector ---------------------------------------------------------------------- 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17 (1 row) ^ test=# SELECT to_tsvector('french','Try not to become a man of success, but rather try to become a man of value') ; to_tsvector ----------------------------------------------------------------------------------------------------------------- 'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17 (1 row) ^ test=# SELECT to_tsvector('french'::regconfig,'Try not to become a man of success, but rather try to become a man of value') ; to_tsvector ----------------------------------------------------------------------------------------------------------------- 'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17 (1 row)
simple並不忽略禁用詞表,它也不會試着去查找單詞的詞根。使用simple時,空格分割的每一組字符都是一個語義;simple 只做了小寫轉換;對於數據來說,simple文本搜索配置項很實用。
二、中文檢索
在開始介紹中文檢索前,我們先來看個例子:
test=# select to_tsvector('人大金倉致力於提供高可靠的數據庫產品'); to_tsvector ------------------------------------------ '人大金倉致力於提供高可靠的數據庫產品':1
因為內置的分詞器是按空格分割的,而中文間沒有空格,因此,整句話就被看做一個分詞。
1、創建中文搜索插件
create extension zhparser; create text search configuration zhongwen_parser (parser = zhparser); alter text search configuration zhongwen_parser add mapping for n,v,a,i,e,l,j with simple;
上面 for 后面的字母表示分詞的token,上面的token映射只映射了名詞(n),動詞(v),形容詞(a),成語(i),嘆詞(e),縮寫(j) 和習用語(l)6種,這6種以外的token全部被屏蔽。詞典使用的是內置的simple詞典。具體的token 如下:
test=# select ts_token_type('zhparser'); ts_token_type ------------------------ (97,a,adjective) (98,b,differentiation) (99,c,conjunction) (100,d,adverb) (101,e,exclamation) (102,f,position) (103,g,root) (104,h,head) (105,i,idiom) (106,j,abbreviation) (107,k,tail) (108,l,tmp) (109,m,numeral) (110,n,noun) (111,o,onomatopoeia) (112,p,prepositional) (113,q,quantity) (114,r,pronoun) (115,s,space) (116,t,time) (117,u,auxiliary) (118,v,verb) (119,w,punctuation) (120,x,unknown) (121,y,modal) (122,z,status) (26 rows)
2、查看pg_ts_config
創建text search configuration 后,可以在視圖pg_ts_config 看到如下信息:
test=# select * from pg_ts_config; oid | cfgname | cfgnamespace | cfgowner | cfgparser -------+-----------------+--------------+----------+----------- 3748 | simple | 11 | 10 | 3722 13265 | arabic | 11 | 10 | 3722 13267 | danish | 11 | 10 | 3722 13269 | dutch | 11 | 10 | 3722 13271 | english | 11 | 10 | 3722 13273 | finnish | 11 | 10 | 3722 13275 | french | 11 | 10 | 3722 13277 | german | 11 | 10 | 3722 13279 | hungarian | 11 | 10 | 3722 13281 | indonesian | 11 | 10 | 3722 13283 | irish | 11 | 10 | 3722 13285 | italian | 11 | 10 | 3722 13287 | lithuanian | 11 | 10 | 3722 13289 | nepali | 11 | 10 | 3722 13291 | norwegian | 11 | 10 | 3722 13293 | portuguese | 11 | 10 | 3722 13295 | romanian | 11 | 10 | 3722 13297 | russian | 11 | 10 | 3722 13299 | spanish | 11 | 10 | 3722 13301 | swedish | 11 | 10 | 3722 13303 | tamil | 11 | 10 | 3722 13305 | turkish | 11 | 10 | 3722 16390 | parser_name | 2200 | 10 | 16389 24587 | zhongwen_parser | 2200 | 10 | 16389
3、使用中文分詞
test=# select to_tsvector('zhongwen_parser','人大金倉致力於提供高可靠的數據庫產品'); to_tsvector ------------------------------------------------------------------ '產品':7 '人大':1 '可靠':5 '提供':3 '數據庫':6 '致力於':2 '高':4
4、contains 函數
test=# \df+ contains List of functions Schema | Name | Result data type | Argument data types | Type | Volatility | Parallel | Owner | Security | Access privileges | Language | Source code | Description --------+----------+------------------+---------------------+------+------------+----------+--------+----------+-------------------+----------+------------------------------------------+------------- sys | contains | boolean | text, text | func | immutable | safe | system | invoker | | sql | select to_tsvector($1) @@ to_tsquery($2) | sys | contains | boolean | text, text, integer | func | immutable | safe | system | invoker | | sql | select to_tsvector($1) @@ to_tsquery($2) | sys | contains | boolean | text, tsquery | func | immutable | safe | system | invoker | | sql | select $1::tsvector @@ $2 | sys | contains | boolean | tsvector, text | func | immutable | safe | system | invoker | | sql | select $1 @@ $2::tsquery | sys | contains | boolean | tsvector, tsquery | func | immutable | safe | system | invoker | | sql | select $1 @@ $2 |
默認contains 函數使用的是空格分詞解析器,因此,無法使用contains 進行中文判斷
test=# select contains('人大金倉致力於提供高可靠的數據庫產品','產品'); contains ---------- f