ClickHouse 字符串的相關操作函數

本文轉載自查看原文 2021-09-06 15:45 1550 ClickHouse：一款速度快到讓人發指的列式存儲數據庫

楔子

下面來說一說字符串的相關操作。

empty：檢測一個字符串是否為空，為空返回 1，不為空返回 0

notEmpty：檢測一個字符串是否不為空，不為空返回 1，為空返回 0

SELECT empty(''), empty('satori');
/*
┌─empty('')─┬─empty('satori')─┐
│         1 │               0 │
└───────────┴─────────────────┘
*/

SELECT notEmpty(''), notEmpty('satori');
/*
┌─notEmpty('')─┬─notEmpty('satori')─┐
│            0 │                  1 │
└──────────────┴────────────────────┘
*/

length：計算一個字符串占多少個字節

char_length：計算一個字符串占多少個字符

WITH 'satori' AS s1, '古明地覺' AS s2
SELECT length(s1), length(s2), char_length(s1), char_length(s2)
/*
┌─length(s1)─┬─length(s2)─┬─CHAR_LENGTH(s1)─┬─CHAR_LENGTH(s2)─┐
│          6 │         12 │               6 │               4 │
└────────────┴────────────┴─────────────────┴─────────────────┘
*/

toString：將整型、日期轉成字符串

SELECT toString(3), cast(3 AS String);
/*
┌─toString(3)─┬─CAST(3, 'String')─┐
│ 3           │ 3                 │
└─────────────┴───────────────────┘
*/

除了使用 cast 之外，每種數據類型都內置了相應的轉換函數，格式為 to + 類型，比如 toInt8、toUInt32、toFloat64、toDecimal64 等等

lower、lcase：字符串轉小寫

upper、ucase：字符串轉大寫

SELECT lower('SAtoRI'), upper('SAtoRI');
/*
┌─lower('SAtoRI')─┬─upper('SAtoRI')─┐
│ satori          │ SATORI          │
└─────────────────┴─────────────────┘
*/

repeat：將字符串重復 n 次

SELECT repeat('abc', 3);
/*
┌─repeat('abc', 3)─┐
│ abcabcabc        │
└──────────────────┘
*/

reverse：將字符串翻轉

SELECT reverse('satori');
/*
┌─reverse('satori')─┐
│ irotas            │
└───────────────────┘
*/

注意：reverse 是按照字節翻轉的，這意味着它不能用在中文上面，如果想翻轉中文，那么要使用 reverseUTF8，可以試一下。

format：格式化字符串

SELECT format('{}--{}', 'hello', 'world');
/*
┌─format('{}--{}', 'hello', 'world')─┐
│ hello--world                       │
└────────────────────────────────────┘
*/

-- {} 的數量和格式化的字符串數量要匹配，當然下面這種情況例外
SELECT format('{0}--{1}--{0}', 'hello', 'world');
/*
┌─format('{0}--{1}--{0}', 'hello', 'world')─┐
│ hello--world--hello                       │
└───────────────────────────────────────────┘
*/

concat：拼接字符串

SELECT concat('a', 'b', 'c');
/*
┌─concat('a', 'b', 'c')─┐
│ abc                   │
└───────────────────────┘
*/

當然拼接字符串還可以使用雙豎線：

SELECT 'a' || 'b' || 'c';
/*
┌─concat('a', 'b', 'c')─┐
│ abc                   │
└───────────────────────┘
*/

substring：字符串截取，也可以寫成 mid、substr，用法和標准 SQL 中的 substring 一樣，但有一點區別

-- 從第 2 個元素開始截取，截取 3 個字節，注意：區別來了，截取的是字節
SELECT substring('abcdefg', 2, 3);
/*
┌─substring('abcdefg', 2, 3)─┐
│ bcd                        │
└────────────────────────────┘
*/

-- 如果想按照字符截取，要使用 substringUTF8

appendTrailingCharIfAbsent：如果非空字符串 s 的末尾不包含字符 c，那么就在 s 的結尾填上字符 c

SELECT appendTrailingCharIfAbsent('satori', 'i'), 
       appendTrailingCharIfAbsent('sator', 'i');
/*
┌─appendTrailingCharIfAbsent('satori', 'i')─┬─appendTrailingCharIfAbsent('sator', 'i')─┐
│ satori                                    │ satori                                   │
└───────────────────────────────────────────┴──────────────────────────────────────────┘
*/

convertCharset：改變字符串的字符集

SELECT convertCharset('satori', 'ascii', 'utf8');
/*
┌─convertCharset('satori', 'ascii', 'utf8')─┐
│ satori                                    │
└───────────────────────────────────────────┘
*/

base64Encode：對字符串進行 base64 編碼

base64Decode：對 base64 編碼的字符串進行 base64 解碼

SELECT base64Encode('satori') s1, base64Decode(s1);
/*
┌─s1───────┬─base64Decode(base64Encode('satori'))─┐
│ c2F0b3Jp │ satori                               │
└──────────┴──────────────────────────────────────┘
*/

還有一個 tryBase64Decode，和 base64Decode 類似，但解析失敗時會返回空字符串。如果是 base64Decode，那么對一個非 base64 編碼的字符串解析會得到亂碼。

startsWith、endsWith：判斷字符串是否以某個子串開頭或結尾，如果是，返回 1；否則，返回 0

SELECT startsWith('古明地覺', '古明') v1, endsWith('古明地覺', '古明') v2;
/*
┌─v1─┬─v2─┐
│  1 │  0 │
└────┴────┘
*/

trim：去除字符串兩端的字符

SELECT trim('   satori    ') s, length(s);
/*
┌─s──────┬─length(trimBoth('   satori    '))─┐
│ satori │                                 6 │
└────────┴───────────────────────────────────┘
*/

-- 默認去除空格，也可以去除其它字符
-- 但此時必須指定是從 "左邊" 去除，還是從 "右邊" 去除，還是 "兩端" 都去除
-- 左邊是 LEADING，右邊是 TRAILING，兩端是 BOTH
SELECT trim(BOTH 'ab' FROM 'abxxxxxxbaaa') s1,
       trim(LEADING 'ab' FROM 'abxxxxxxbaaa') s2,
       trim(TRAILING 'ab' FROM 'abxxxxxxbaaa') s3;
/*
┌─s1─────┬─s2─────────┬─s3───────┐
│ xxxxxx │ xxxxxxbaaa │ abxxxxxx │
└────────┴────────────┴──────────┘
*/

trim 如果只接收一個普通字符串，那么默認行為就是刪除兩端的空格，所以還有 trimLeft、trimRight，也是接收一個普通的字符串，然后去除左邊、右邊的空格。其中 trimLeft 也可以寫作 ltrim，trimRight 也可以寫作 rtrim。

CRC32：返回字符串的 CRC32 校驗和，使用 CRC-32-IEEE 802.3 多項式，並且初始值為 0xFFFFFFFF

CRC32IEEE：返回字符串的 CRC32 校驗和，使用 CRC-32-IEEE 802.3 多項式

CRC64：返回字符串的 CRC64 校驗和，使用 CRC-64-ECMA 多項式

SELECT CRC32('satori'), CRC32IEEE('satori'), CRC64('satori');
/*
┌─CRC32('satori')─┬─CRC32IEEE('satori')─┬─────CRC64('satori')─┐
│       379058543 │          2807388364 │ 1445885890712067336 │
└─────────────────┴─────────────────────┴─────────────────────┘
*/

encodeXMLComponent：對字符串進行轉義，針對 <、&、>、"、' 五種符號

decodeXMLComponent：對字符串進行反轉義，針對 <、&、>、"、' 五種符號

SELECT encodeXMLComponent('<name>');
/*
┌─encodeXMLComponent('<name>')─┐
│ &lt;name&gt;                 │
└──────────────────────────────┘
*/

SELECT decodeXMLComponent('&lt;name&gt;');
/*
┌─decodeXMLComponent('&lt;name&gt;')─┐
│ <name>                             │
└────────────────────────────────────┘
*/

position：查找某個子串在字符串當中的位置

SELECT position('abcdefg', 'de');
/*
┌─position('abcdefg', 'de')─┐
│                         4 │
└───────────────────────────┘
*/

-- 也可以從指定位置查找
SELECT position('hello world', 'o', 1), position('hello world', 'o', 7);
/*
┌─position('hello world', 'o', 1)─┬─position('hello world', 'o', 7)─┐
│                               5 │                               8 │
└─────────────────────────────────┴─────────────────────────────────┘
*/

該函數是大小寫敏感的，如果想大小寫不敏感，那么可以使用 positionCaseInsensitive。還有一點需要注意，該函數是按照字節統計的。

position('古明地覺A', 'A') 得到的是 13，因為一個漢字 3 字節

如果包含中文，想按照字符統計，則需要使用 positionUTF8。

positionUTF8('古明地覺A', 'A') 得到的就是 5

如果不存在，則返回 0

multiSearchAllPositions：查找多個子串在字符串當中的位置，多個子串組成數組進行傳遞

SELECT multiSearchAllPositions('satori', ['sa', 'to', 'ri', 'xxx']);
/*
┌─multiSearchAllPositions('satori', ['sa', 'to', 'ri', 'xxx'])─┐
│ [1,3,5,0]                                                    │
└──────────────────────────────────────────────────────────────┘
*/

如果想大小寫不敏感，那么可以使用 multiSearchAllPositionsCaseInsensitive。同樣的，該函數也是在字節序列上進行搜索，不考慮字符編碼，如果想支持非 ASCII 字符，應該使用 multiSearchAllPositionsUTF8。

match：正則表達式匹配，如果給定的字符串匹配給定的表達式，則返回 1；不匹配，則返回 0

-- 字符串放左邊，模式方右邊
SELECT match('123', '\\d{1,3}'), match('abcd', '\\d{1,3}');
/*
┌─match('123', '\\d{1,3}')─┬─match('abcd', '\\d{1,3}')─┐
│                        1 │                         0 │
└──────────────────────────┴───────────────────────────┘
*/

我們知道反斜杠本身代表轉義，那么如果想表達 \d，應該使用 \\d。同理如果我們想檢測字符串是否包含反斜杠，那么應該這么做：

SELECT match(s, '\\\\');

因為反斜杠具有轉義，那么四個反斜杠會變成兩個普通的反斜杠，但我們知道反斜杠在正則中也具有含義，所以兩個反斜杠會變成一個普通的反斜杠。

multiMatchAny：正則表達式匹配，但可以接收多個模式，有一個能匹配上，則返回 1；全都匹配不上，則返回 0

SELECT match('satori', 'xx'), match('satori', 'satori');
/*
┌─match('satori', 'xx')─┬─match('satori', 'satori')─┐
│                     0 │                         1 │
└───────────────────────┴───────────────────────────┘
*/

SELECT multiMatchAny('satori', ['xx', 'satori']);
/*
┌─multiMatchAny('satori', ['xx', 'satori'])─┐
│                                         1 │
└───────────────────────────────────────────┘
*/

multiMatchAnyIndex：正則表達式匹配，接收多個模式，返回第一個匹配的模式的索引

-- 顯然 'satori' 可以匹配上，而它的索引為 3
SELECT multiMatchAnyIndex('satori', ['yy', 'xx', 'satori']);
/*
┌─multiMatchAnyIndex('satori', ['yy', 'xx', 'satori'])─┐
│                                                    3 │
└──────────────────────────────────────────────────────┘
*/

如果沒有一個能匹配上則返回 0，因為索引從 1 開始，所以返回 0 代表沒有一個匹配上。像一般的編程語言，由於索引從 0 開始，那么當匹配不上的時候返回的就是 -1。

multiMatchAllIndices：正則表達式匹配，接收多個模式，返回所有匹配的模式的索引

-- 索引為 2、3 的模式都能匹配上，但只返回第一個匹配上的
SELECT multiMatchAnyIndex('satori', ['yy', 'sa', 'satori']);
/*
┌─multiMatchAnyIndex('satori', ['yy', 'sa', 'satori'])─┐
│                                                    2 │
└──────────────────────────────────────────────────────┘
*/


-- 返回所有匹配上的
SELECT multiMatchAllIndices('satori', ['yy', 'sa', 'satori']);
/*
┌─multiMatchAllIndices('satori', ['yy', 'sa', 'satori'])─┐
│ [2,3]                                                  │
└────────────────────────────────────────────────────────┘
*/

extract：返回使用正則表達式匹配的字符串

-- 我們看到匹配使用的是貪婪模式
SELECT extract('satori', '\\w{1,3}');
/*
┌─extract('satori', '\\w{1,3}')─┐
│ sat                           │
└───────────────────────────────┘
*/

-- 采用非貪婪模式
SELECT extract('satori', '\\w{1,3}?');
/*
┌─extract('satori', '\\w{1,3}?')─┐
│ s                              │
└────────────────────────────────┘
*/

匹配不上，則返回空字符串。

extractAll：extract 只返回一個匹配的字符串，extractAll 則返回所有的

SELECT extract('abc abd abe', 'ab.'), extractAll('abc abd abe', 'ab.');
/*
┌─extract('abc abd abe', 'ab.')─┬─extractAll('abc abd abe', 'ab.')─┐
│ abc                           │ ['abc','abd','abe']              │
└───────────────────────────────┴──────────────────────────────────┘
*/

extractAllGroupsHorizontal、extractAllGroupsVertical：匹配組，舉例說明最直接

SELECT extractAllGroupsHorizontal('2020-01-05 2020-02-21 2020-11-13', 
                                  '(\\d{4})-(\\d{2})-(\\d{2})');
/*
┌─extractAllGroupsHorizontal('2020-01-05 2020-02-21 2020-11-13', '(\\d{4})-(\\d{2})-(\\d{2})')─┐
│ [['2020','2020','2020'],['01','02','11'],['05','21','13']]                                   │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
*/

SELECT extractAllGroupsVertical('2020-01-05 2020-02-21 2020-11-13', 
                                '(\\d{4})-(\\d{2})-(\\d{2})');
/*
┌─extractAllGroupsVertical('2020-01-05 2020-02-21 2020-11-13', '(\\d{4})-(\\d{2})-(\\d{2})')─┐
│ [['2020','01','05'],['2020','02','21'],['2020','11','13']]                                 │
└────────────────────────────────────────────────────────────────────────────────────────────┘
*/

ClickHouse 在匹配組的時候也給了兩種選擇，我們在使用編程語言進行組匹配的時候，一般返回都是第二種。而且事實上，extractAllGroupsVertical 的速度比 extractAllGroupsHorizontal 要快一些。

當匹配不上的時候，返回的是空列表。

SELECT extractAllGroupsHorizontal('2020-01-05 2020-02-21 2020-11-13', 
                                  '(\\d{10})-(\\d{20})-(\\d{20})');
/*
┌─extractAllGroupsHorizontal('2020-01-05 2020-02-21 2020-11-13', '(\\d{10})-(\\d{20})-(\\d{20})')─┐
│ [[],[],[]]                                                                                      │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
*/

SELECT extractAllGroupsVertical ('2020-01-05 2020-02-21 2020-11-13', 
                                 '(\\d{10})-(\\d{20})-(\\d{20})');
/*
┌─extractAllGroupsVertical('2020-01-05 2020-02-21 2020-11-13', '(\\d{10})-(\\d{20})-(\\d{20})')─┐
│ []                                                                                            │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
*/

extractAllGroupsHorizontal 相當於把多個組中按照順序合並了，所以列表里面是 3 個空列表，因為我們匹配的組有三個。

like：where 語句里面有 LIKE，但 like 也是一個函數，兩者規則是一樣的

-- % 表示任意數量的任意字符；_ 表示單個任意字符
-- \ 表示轉義
SELECT like('satori', 'sa%'), like('satori', 'sa_');

除了 like 之外，還有一個 notLike，以及不區分大小寫的 ilike。

ngramDistance：計算兩個字符串的相似度，取值為 0 到 1，越相似越接近 0

SELECT ngramDistance('satori', 'satori')
/*
┌─ngramDistance('satori', 'satori')─┐
│                                 0 │
└───────────────────────────────────┘
*/

注意：如果某個字符串的長度超過了 32 KB，那么結果直接為 1，就不再計算相似度了。該函數在計算字符串相似度的時候是大小寫敏感的，如果想要忽略大小寫，可以使用 ngramDistanceCaseInsensitive。同理如果針對中文，那么可以使用 ngramDistanceUTF8，以及 ngramDistanceCaseInsensitiveUTF8。

countSubstrings：計算字符串中某個字串出現的次數

SELECT countSubstrings('aaaa', 'aa'), countSubstrings('abc_abc', 'abc');
/*
┌─countSubstrings('aaaa', 'aa')─┬─countSubstrings('abc_abc', 'abc')─┐
│                             2 │                                 2 │
└───────────────────────────────┴───────────────────────────────────┘
*/

-- 從指定位置開始查找
SELECT countSubstrings('aabbaa', 'aa'), countSubstrings('aabbaa', 'aa', 3);
/*
┌─countSubstrings('aabbaa', 'aa')─┬─countSubstrings('aabbaa', 'aa', 3)─┐
│                               2 │                                  1 │
└─────────────────────────────────┴────────────────────────────────────┘
*/

如果希望大小寫敏感，那么可以使用 countSubstringsCaseInsensitive，針對中文可以使用 countSubstringsCaseInsensitiveUTF8。

countMatches：計算字符串中某個模式匹配的次數

SELECT countSubstrings('aaabbaa', 'aa'), countMatches('aaabbaa', 'a.');
/*
┌─countSubstrings('aaabbaa', 'aa')─┬─countMatches('aaabbaa', 'a.')─┐
│                                2 │                             3 │
└──────────────────────────────────┴───────────────────────────────┘
*/

replaceOne：對字符串中指定的部分進行替換，但只會替換第一次出現的部分

SELECT replaceOne('hello cruel world, cruel', 'cruel', 'beautiful');
/*
┌─replaceOne('hello cruel world, cruel', 'cruel', 'beautiful')─┐
│ hello beautiful world, cruel                                 │
└──────────────────────────────────────────────────────────────┘
*/

如果想全部替換，那么可以使用 replaceAll：

SELECT replaceAll('hello cruel world, cruel', 'cruel', 'beautiful');
/*
┌─replaceAll('hello cruel world, cruel', 'cruel', 'beautiful')─┐
│ hello beautiful world, beautiful                             │
└──────────────────────────────────────────────────────────────┘
*/

replaceRegexpOne：對字符串中指定的部分進行替換，但支持正則

SELECT replaceRegexpOne('hello cruel world, cruel', 'cru..', 'beautiful');
/*
┌─replaceRegexpOne('hello cruel world, cruel', 'cru..', 'beautiful')─┐
│ hello beautiful world, cruel                                       │
└────────────────────────────────────────────────────────────────────┘
*/

如果想全部替換，那么可以使用 replaceRegexpAll：

SELECT replaceRegexpAll('hello cruel world, cruel', 'cru..', 'beautiful');
/*
┌─replaceRegexpAll('hello cruel world, cruel', 'cru..', 'beautiful')─┐
│ hello beautiful world, beautiful                                   │
└────────────────────────────────────────────────────────────────────┘
*/

splitByChar：將字符串按照指定字符進行分解，返回數組

-- 分隔符必須是單個字符
SELECT splitByChar('_', 'ABC_def_fgh');
/*
┌─splitByChar('_', 'ABC_def_fgh')─┐
│ ['ABC','def','fgh']             │
└─────────────────────────────────┘
*/

splitByString：將字符串按照指定字符（串）進行分解，返回數組

-- 分隔符必須是單個字符
SELECT splitByString('_', 'ABC_def_fgh'), splitByString('__', 'ABC__def__fgh');
/*
┌─splitByString('_', 'ABC_def_fgh')─┬─splitByString('__', 'ABC__def__fgh')─┐
│ ['ABC','def','fgh']               │ ['ABC','def','fgh']                  │
└───────────────────────────────────┴──────────────────────────────────────┘
*/

從這里可以看出 splitByString 完全可以取代 splitByChar，因為它既可以按照單個字符分解，也可以按照字符串分解，當然單個字符在 ClickHouse 里面也是字符串。但 ClickHouse 既然提供了兩個函數，那么個人建議，如果是按照單個字符分解的話，還是使用 splitByChar。

splitByRegexp：將字符串按照正則的模式進行分解，返回數組

SELECT splitByRegexp('\\d+', 'a12bc23de345f');
/*
┌─splitByRegexp('\\d+', 'a12bc23de345f')─┐
│ ['a','bc','de','f']                    │
└────────────────────────────────────────┘
*/

arrayStringConcat：將數組中的字符串進行拼接

SELECT arrayStringConcat(['a', 'b', 'c', 'd'], '--');
/*
┌─arrayStringConcat(['a', 'b', 'c', 'd'], '--')─┐
│ a--b--c--d                                    │
└───────────────────────────────────────────────┘
*/

小結

字符串算是非常常用的一個數據結構，它的操作自然也有很多，但都不是很難。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 六、clickhouse字符串函數 python字符串、字符串處理函數及字符串相關操作七、clickhouse字符串搜索函數八、clickhouse字符串替換函數十三、 clickhouse字符串拆分合並函數 MySQL 字符串截取相關函數 python字符串相關操作 VBA 字符串-相關函數（1-5） python3 字符串相關函數 Go 中的字符串相關操作