手把手教你：將 ClickHouse 集群遷至雲上

本文轉載自查看原文 2020-09-29 18:08 410 clickhouse/ 大數據

前言

隨着雲上 ClickHouse 服務完善，越來越多的用戶將自建 ClickHouse 服務遷移至雲上。對於不同數據規模，我們選擇不同的方案:

對於數據量比較小的表，通常小於10GB 情況下，可以將數據導出為 CSV 格式，在雲上集群重新寫入數據；
使用 clickhouse 發行版自帶工具 clickhouse-copier 來完成。

本文詳解 clickhouse-copier 完成跨 ClickHouse 集群數據遷移(當然也適用於用戶集群內部數據不同表間數據遷移)。

一、Zookeeper 集群准備

如果已經有 Zookeeper 集群，請忽略本章節。

由於 clickhouse-copier 需要 Zookeeper 存儲數據遷移任務信息，需要部署一個 Zookeeper 集群。

Zookeeper 集群到源 ClickHouse 集群與目標 ClickHouse 集群之間的網絡是正常的。

在本文中，我們部署一個單節點的 Zookeeper 集群。

步驟1: 准備可執行文件

$ wget http://apache.is.co.za/zookeeper/zookeeper-3.6.1/apache-zookeeper-3.6.1.tar.gz
$ tar -xvf zookeeper-3.6.1.tar.gz
$ chown hadoop:hadoop -R  zookeeper-3.6.1

步驟2：切換到 hadoop 賬號

su hadoop

步驟3：准備配置文件 conf/zoo.cfg，填寫配置，舉例如下：

tickTime=2000
dataDir=/var/data/zookeepe
clientPort=2181

步驟4：增加 myid 文件

echo 1 > /var/data/zookeeper/myid

步驟5：啟動 Zookeeper 進程

$ bin/zkServer.sh start

后續，我們可以用該 Zookeeper 存儲數據遷移任務信息。

二、定義遷移任務

在任務遷移數據前，需要定義遷移任務。遷移任務信息定義在 xml 文件中。具體包含如下信息：

源集群，包含數據分片信息
目的集群，包含數據分片信息
執行數據遷移任務的線程數量
定義待遷移的表信息，有 tables 字段指定，包括：

- 數據源集群名稱，由 cluster_pull 指定

- 數據源數據庫名稱，由 database_pull 指定

- 數據源表名稱，由 table_pull 指定

- 目的集群名稱，由 cluster_push 指定

- 目的數據庫名稱，由 database_push 指定

- 目的表名稱，由 table_push 指定

- 目的表引擎定義，由 engine 指定

- 待遷移的 partition 列表，由 enabled_partitions 指定。未指定，則全表遷移

如果目標集群數據庫不存在，則不會自動創建。故遷移數據前，確保目標集群數據庫存在。源表和目標表的 Schema 相同，表引擎可以不相同。

舉例如下：

<yandex>
    <!-- Configuration of clusters as in an ordinary server config -->
    <remote_servers>
        <source_cluster>
            <shard>
                <internal_replication>false</internal_replication>
                    <replica>
                        <host>172.16.0.72</host>
                        <port>9000</port>
                    </replica>
            </shard>
        </source_cluster>

        <destination_cluster>
            <shard>
                <internal_replication>false</internal_replication>
                    <replica>
                        <host>172.16.0.115</host>
                        <port>9000</port>
                    </replica>
                    <replica>
                        <host>172.16.0.47</host>
                        <port>9000</port>
                    </replica>
            </shard>
            <shard>
                <internal_replication>false</internal_replication>
                    <replica>
                        <host>172.16.0.138</host>
                        <port>9000</port>
                    </replica>
                    <replica>
                        <host>172.16.0.49</host>
                        <port>9000</port>
                    </replica>
            </shard>
        </destination_cluster>
    </remote_servers>

    <!-- How many simultaneously active workers are possible. If you run more workers superfluous workers will sleep. -->
    <max_workers>8</max_workers>

    <!-- Setting used to fetch (pull) data from source cluster tables -->
    <settings_pull>
        <readonly>1</readonly>
    </settings_pull>

    <!-- Setting used to insert (push) data to destination cluster tables -->
    <settings_push>
        <readonly>0</readonly>
    </settings_push>

    <settings>
        <connect_timeout>300</connect_timeout>
        <!-- Sync insert is set forcibly, leave it here just in case. -->
        <insert_distributed_sync>1</insert_distributed_sync>
    </settings>

    <tables>
        <!-- A table task, copies one table. -->
        <table_lineorder>
            <!-- Source cluster name (from <remote_servers/> section) and tables in it that should be copied -->
            <cluster_pull>source_cluster</cluster_pull>
            <database_pull>default</database_pull>
            <table_pull>lineorder</table_pull>

            <!-- Destination cluster name and tables in which the data should be inserted -->
            <cluster_push>destination_cluster</cluster_push>
            <database_push>default</database_push>
            <table_push>lineorder_7</table_push>

            <engine>
            ENGINE=ReplicatedMergeTree('/clickhouse/tables/{shard}/lineorder_7','{replica}')
            PARTITION BY toYear(LO_ORDERDATE)
            ORDER BY (LO_ORDERDATE, LO_ORDERKEY)
            </engine>

            <!-- Sharding key used to insert data to destination cluster -->
            <sharding_key>rand()</sharding_key>

            <!-- Optional expression that filter data while pull them from source servers -->
            <!-- <where_condition></where_condition> -->
           <!--
            <enabled_partitions>
            </enabled_partitions>
           -->
        </table_lineorder>
    </tables>
</yandex>

准備完成配置文件后，在 Zookeeper 上准備路徑，並將定義任務文件上傳到 Zookeeper 中。假設配置文件為 task.xml , 執行如下指令：

$ bin/zkCli.sh create /clickhouse/copytasks ""
$ bin/zkCli.sh create /clickhouse/copytasks/task ""
$ bin/zkCli.sh create /clickhouse/copytasks/task/description "`cat ./task.xml`"

三、啟動任務

定義好遷移任務后，就可以啟動 clickhouse-copier 來遷移數據了。在此之前，需要准備配置文件, 配置文件中描述了 Zookeeper 地址，以及日志配置。舉例如下：

<yandex>
    <logger>
        <level>trace</level>
        <size>100M</size>
        <count>3</count>
    </logger>

    <zookeeper>
        <node index="1">
            <host>172.16.0.139</host>
            <port>2181</port>
        </node>
    </zookeeper>
</yandex>

假設該文件命名為 config.xml

可以使用如下命令啟動 clickhouse-copier:

 $ clickhouse-copie
  --config ./config.xml \
  --task-path /clickhouse/copytasks/task \
  --base-dir ./clickhouse \

其中，--task-path 指定數據遷移任務在 Zookeeper 上的路徑，即第3節中創建的路徑。需要注意的是，路徑下必須包含 description 文件。

如果數據量比較多，可以部署多個 clickhouse-copier 並發執行遷移任務。

總結

clickhouse-copier 是 ClickHouse 發行版自帶的工具，在穩定性可靠性上是有保證的。在使用過程中，需要注意的問題：

在遷移過程中，源集群的表需要停止寫入；
在遷移過程中，占用源，目的集群網絡帶寬，需要仔細評估；
clickhouse-copier 提供了較多靈活性，包括數據分片算法，指定遷移表的 partitions
關注“騰訊雲大數據”公眾號，技術交流、最新活動、服務專享一站Get~

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 手把手教你搭建一個Elasticsearch集群手把手教你搭建FastDFS集群（上）手把手教你搭建FastDFS集群（中） Hadoop（三）手把手教你搭建Hadoop全分布式集群手把手教你用Python網絡爬蟲獲取網易雲音樂歌曲干貨 | 手把手教你搭建一套OpenStack雲平台手把手教你在新浪雲上免費部署自己的網站--連接數據庫手把手教你在 TKE 集群中實現簡單的藍綠發布和灰度發布手把手教你通過Ambari新建Hadoop集群圖解案例手把手教你搭建zookeeper+kafka集群+kafka managerg管理