ClickHouse使用
ClickHouse是一個面向列存儲的OLAP分析數據庫,以其強大的分析速度而聞名。有關ClickHouse的介紹可以參考其官網說明[1]。本文主要介紹它的基本使用。
1. 安裝
使用的環境為2台 AWS EC2,操作系統為Amazon Linux2。使用的ClickHouse為最新的stable版本v21.2.5.5-stable [2]。
export LATEST_VERSION=21.2.5.5 curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-common-static-$LATEST_VERSION.tgz curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-common-static-dbg-$LATEST_VERSION.tgz curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-server-$LATEST_VERSION.tgz curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-client-$LATEST_VERSION.tgz tar -xzvf clickhouse-common-static-$LATEST_VERSION.tgz sudo clickhouse-common-static-$LATEST_VERSION/install/doinst.sh tar -xzvf clickhouse-common-static-dbg-$LATEST_VERSION.tgz sudo clickhouse-common-static-dbg-$LATEST_VERSION/install/doinst.sh tar -xzvf clickhouse-server-$LATEST_VERSION.tgz sudo clickhouse-server-$LATEST_VERSION/install/doinst.sh sudo /etc/init.d/clickhouse-server start tar -xzvf clickhouse-client-$LATEST_VERSION.tgz sudo clickhouse-client-$LATEST_VERSION/install/doinst.sh
2. 初次使用
2.1. 數據
使用官網提供的數據:Yandex.Metrica的匿名數據。它是在ClickHouse成為開源之前作為生產環境運行的第一個服務:
curl https://datasets.clickhouse.tech/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv curl https://datasets.clickhouse.tech/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv # 上傳到s3 aws s3 sync ./ s3://xxx-clickhouse/data/
2.2. 建表
與其他數據庫一樣,clickhouse也自帶一個default數據庫。這里先創建一個tutorial數據庫:
ip-10-0-4-69.cn-north-1.compute.internal :) create database if not exists tutorial > CREATE DATABASE IF NOT EXISTS tutorial
建表語句必須指定3個關鍵事情:
- 表名
- 表結構:列名以及對應數據類型
- 表引擎及其設置:決定了對此表的查詢操作是如何在物理層執行的所有細節
Yandex.Metrica 是一個網絡分析服務,樣本數據集不包括其全部功能,因此只有2個表可以創建:
- hits表:包含所有用戶在服務所涵蓋的所有網站上完成的每個操作
- visits表:包含預先構建的會話,而不是單個操作
建表語句:
CREATE TABLE tutorial.hits_v1 ( `WatchID` UInt64, `JavaEnable` UInt8, `Title` String, `GoodEvent` Int16, `EventTime` DateTime, `EventDate` Date, `CounterID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RegionID` UInt32, `UserID` UInt64, `CounterClass` Int8, `OS` UInt8, `UserAgent` UInt8, `URL` String, `Referer` String, `URLDomain` String, `RefererDomain` String, `Refresh` UInt8, `IsRobot` UInt8, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `FlashMinor2` String, `NetMajor` UInt8, `NetMinor` UInt8, `UserAgentMajor` UInt16, `UserAgentMinor` FixedString(2), `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `MobilePhone` UInt8, `MobilePhoneModel` String, `Params` String, `IPNetworkID` UInt32, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `IsArtifical` UInt8, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `ClientTimeZone` Int16, `ClientEventTime` DateTime, `SilverlightVersion1` UInt8, `SilverlightVersion2` UInt8, `SilverlightVersion3` UInt32, `SilverlightVersion4` UInt16, `PageCharset` String, `CodeVersion` UInt32, `IsLink` UInt8, `IsDownload` UInt8, `IsNotBounce` UInt8, `FUniqID` UInt64, `HID` UInt32, `IsOldCounter` UInt8, `IsEvent` UInt8, `IsParameter` UInt8, `DontCountHits` UInt8, `WithHash` UInt8, `HitColor` FixedString(1), `UTCEventTime` DateTime, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `WindowName` Int32, `OpenerName` Int32, `HistoryLength` Int16, `BrowserLanguage` FixedString(2), `BrowserCountry` FixedString(2), `SocialNetwork` String, `SocialAction` String, `HTTPError` UInt16, `SendTiming` Int32, `DNSTiming` Int32, `ConnectTiming` Int32, `ResponseStartTiming` Int32, `ResponseEndTiming` Int32, `FetchTiming` Int32, `RedirectTiming` Int32, `DOMInteractiveTiming` Int32, `DOMContentLoadedTiming` Int32, `DOMCompleteTiming` Int32, `LoadEventStartTiming` Int32, `LoadEventEndTiming` Int32, `NSToDOMContentLoadedTiming` Int32, `FirstPaintTiming` Int32, `RedirectCount` Int8, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `ParamPrice` Int64, `ParamOrderID` String, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `GoalsReached` Array(UInt32), `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `RefererHash` UInt64, `URLHash` UInt64, `CLID` UInt32, `YCLID` UInt64, `ShareService` String, `ShareURL` String, `ShareTitle` String, `ParsedParams` Nested( Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), `IslandID` FixedString(16), `RequestNum` UInt32, `RequestTry` UInt8 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) CREATE TABLE tutorial.visits_v1 ( `CounterID` UInt32, `StartDate` Date, `Sign` Int8, `IsNew` UInt8, `VisitID` UInt64, `UserID` UInt64, `StartTime` DateTime, `Duration` UInt32, `UTCStartTime` DateTime, `PageViews` Int32, `Hits` Int32, `IsBounce` UInt8, `Referer` String, `StartURL` String, `RefererDomain` String, `StartURLDomain` String, `EndURL` String, `LinkURL` String, `IsDownload` UInt8, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `PlaceID` Int32, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `IsYandex` UInt8, `GoalReachesDepth` Int32, `GoalReachesURL` Int32, `GoalReachesAny` Int32, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `MobilePhoneModel` String, `ClientEventTime` DateTime, `RegionID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `IPNetworkID` UInt32, `SilverlightVersion3` UInt32, `CodeVersion` UInt32, `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `UserAgentMajor` UInt16, `UserAgentMinor` UInt16, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `SilverlightVersion2` UInt8, `SilverlightVersion4` UInt16, `FlashVersion3` UInt16, `FlashVersion4` UInt16, `ClientTimeZone` Int16, `OS` UInt8, `UserAgent` UInt8, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `NetMajor` UInt8, `NetMinor` UInt8, `MobilePhone` UInt8, `SilverlightVersion1` UInt8, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `JavaEnable` UInt8, `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `BrowserLanguage` UInt16, `BrowserCountry` UInt16, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `Params` Array(String), `Goals` Nested( ID UInt32, Serial UInt32, EventTime DateTime, Price Int64, OrderID String, CurrencyID UInt32), `WatchIDs` Array(UInt64), `ParamSumPrice` Int64, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `ClickLogID` UInt64, `ClickEventID` Int32, `ClickGoodEvent` Int32, `ClickEventTime` DateTime, `ClickPriorityID` Int32, `ClickPhraseID` Int32, `ClickPageID` Int32, `ClickPlaceID` Int32, `ClickTypeID` Int32, `ClickResourceID` Int32, `ClickCost` UInt32, `ClickClientIP` UInt32, `ClickDomainID` UInt32, `ClickURL` String, `ClickAttempt` UInt8, `ClickOrderID` UInt32, `ClickBannerID` UInt32, `ClickMarketCategoryID` UInt32, `ClickMarketPP` UInt32, `ClickMarketCategoryName` String, `ClickMarketPPName` String, `ClickAWAPSCampaignName` String, `ClickPageName` String, `ClickTargetType` UInt16, `ClickTargetPhraseID` UInt64, `ClickContextType` UInt8, `ClickSelectType` Int8, `ClickOptions` String, `ClickGroupBannerID` Int32, `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `FirstVisit` DateTime, `PredLastVisit` Date, `LastVisit` Date, `TotalVisits` UInt32, `TraficSource` Nested( ID Int8, SearchEngineID UInt16, AdvEngineID UInt8, PlaceID UInt16, SocialSourceNetworkID UInt8, Domain String, SearchPhrase String, SocialSourcePage String), `Attendance` FixedString(16), `CLID` UInt32, `YCLID` UInt64, `NormalizedRefererHash` UInt64, `SearchPhraseHash` UInt64, `RefererDomainHash` UInt64, `NormalizedStartURLHash` UInt64, `StartURLDomainHash` UInt64, `NormalizedEndURLHash` UInt64, `TopLevelDomain` UInt64, `URLScheme` UInt64, `OpenstatServiceNameHash` UInt64, `OpenstatCampaignIDHash` UInt64, `OpenstatAdIDHash` UInt64, `OpenstatSourceIDHash` UInt64, `UTMSourceHash` UInt64, `UTMMediumHash` UInt64, `UTMCampaignHash` UInt64, `UTMContentHash` UInt64, `UTMTermHash` UInt64, `FromHash` UInt64, `WebVisorEnabled` UInt8, `WebVisorActivity` UInt32, `ParsedParams` Nested( Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), `Market` Nested( Type UInt8, GoalID UInt32, OrderID String, OrderPrice Int64, PP UInt32, DirectPlaceID UInt32, DirectOrderID UInt32, DirectBannerID UInt32, GoodID String, GoodName String, GoodQuantity Int32, GoodPrice Int64), `IslandID` FixedString(16) ) ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID) SAMPLE BY intHash32(UserID)
可以看到,hits_v1 使用的是MergeTree引擎;visits_v1使用的是Collapsing引擎。兩者的partition格式均為toYYYYMM(EventDate)。
2.3. 導入數據並查詢
導入本地數據:
clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv clickhouse-client --query "INSERT INTO tutorial.visits_v1 FORMAT TSV" --max_insert_block_size=100000 < visits_v1.tsv
優化表:
OPTIMIZE TABLE tutorial.hits_v1 FINAL
OPTIMIZE TABLE tutorial.visits_v1 FINAL
示例查詢:
SELECT StartURL AS URL, AVG(Duration) AS AvgDuration FROM tutorial.visits_v1 WHERE (StartDate >= '2014-03-23') AND (StartDate <= '2014-03-30') GROUP BY URL ORDER BY AvgDuration DESC LIMIT 10 10 rows in set. Elapsed: 0.088 sec. Processed 1.45 million rows, 114.85 MB (16.56 million rows/s., 1.31 GB/s.)
SELECT sum(Sign) AS visits, sumIf(Sign, has(Goals.ID, 1105530)) AS goal_visits, (100. * goal_visits) / visits AS goal_percent FROM tutorial.visits_v1 WHERE (CounterID = 912887) AND (toYYYYMM(StartDate) = 201403) AND (domain(StartURL) = 'yandex.ru') 1 rows in set. Elapsed: 0.012 sec. Processed 13.05 thousand rows, 2.88 MB (1.10 million rows/s., 242.38 MB/s.)
從返回速度來看,基本上是立即返回,處理時間僅用 0.088 和 0.012 秒。
3. 為什么ClickHouse如此快
從各種公開文檔來看,ClickHouse如此之快的原因主要有2點:
- 列式存儲數據庫
- 使用向量化引擎
3.1. 列式存儲
列式存儲與行式存儲的區別已經有大量公開文檔進行詳細說明,在此不再贅述。簡單來說,列式存儲的優勢在於:
- 只提取所需要的列的信息,避免了掃描不需要的其他列信息
- 對數據壓縮的友好型:因為同一列擁有同樣的數據類型和現實語義,重復項的可能性更高
這兩點優勢提供的是:
- 減少了數據掃描范圍:有效減少了所需掃描的數據量
- 減少了數據傳輸的大小:數據壓縮率越高,則數據體量越小,在網絡中傳輸的數據量更少,所以對網絡帶寬和磁盤IO的壓力也就越小,速度也就越快。
3.2. 向量化執行
向量化執行是合理利用CPU指令集的方式,它的必備條件是CPU支持SIMD(Single Instruction Multiple Data)指令,此指令的作用是:單條指定一次性操作多條數據。
在Stack Overflow[3] 上對此有一個較為具體的說明:
許多CPU都有“vector”或“SIMD”指令集,可以將同一個操作同時應用到2條、4條或是更多的數據條目上。向量化(vectorization)就是重寫循環的操作,在一個循環中(例如while循環),一個長度為N的數組需要循環N次才能處理完。若是使用向量化操作,假設它一次能夠處理4條數據,則對於長度為N的數據,僅需要N/4的時間即能處理完畢。
更具體的例子是,假設有以下循環語句:
for (int i=0; i<16; ++i) C[i] = A[i] + B[i];
傳統處理方式是:一次循環處理A[i] 與 B[i] 相加,並賦值給C[i]。
此循環可以繼續展開為:
for (int i=0; i<16; i+=4) { C[i] = A[i] + B[i]; C[i+1] = A[i+1] + B[i+1]; C[i+2] = A[i+2] + B[i+2]; C[i+3] = A[i+3] + B[i+3]; }
對此使用向量化操作,則可以表示為:
for (int i=0; i<16; i+=4) addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);
此處addFourThingsAtOnceAndStoreResult() 為一個向量化操作,可以在一次循環中,同時處理4條數據。若是大家有了解過python 中 numpy的向量化操作,相信對此會有更深的了解。
3.3. 持續優化
根據朱凱[4] 在其書中提到的觀點,ClickHouse如此之快的原因還包含:
- 開發人員注意到各種影響性能的細節,並進行優化,一點一滴的積累,使得性能越來越好
- 針對不同場景使用了最優的算法,使性能最優化
- 若是出現了更合適、更快的算法,開發人員會立即進行驗證,若是效果理想則保留使用,否則將其拋棄
- ClickHouse更新迭代非常頻繁,開發人員一直在對此進行不斷的改進,追求更佳的性能
References
[1] https://clickhouse.tech/docs/zh/
[2] https://github.com/ClickHouse/ClickHouse/releases/tag/v21.1.8.30-stable
[3] https://stackoverflow.com/questions/1422149/what-is-vectorization
[4] 朱凱,ClickHouse原理解析與應用實踐,機器工業出版社,2020年