現象

　　在debezium 抽取千萬級大表時，發現snapshot時同步速率在 2s 10000 row ，同時還有 young GC 信息打印

原因分析

網絡原因

　　首先排除網絡延遲的原因，ping 一下目的主機，發現延遲在0.1ms

本身原因

　　查看SnapshotReader源碼

// Scan the rows in the table ...
                            long start = clock.currentTimeInMillis();
                            logger.info("Step {}: - scanning table '{}' ({} of {} tables)", step, tableId, ++counter, capturedTableIds.size());

                            Map<TableId, String> selectOverrides = context.getConnectorConfig().getSnapshotSelectOverridesByTable();

                            String selectStatement = selectOverrides.getOrDefault(tableId, "SELECT * FROM " + quote(tableId));
                            logger.info("For table '{}' using select statement: '{}'", tableId, selectStatement);
                            sql.set(selectStatement);

                            try {
                                int stepNum = step;
                                mysql.query(sql.get(), statementFactory, rs -> {
                                    try {
                                        // The table is included in the connector's filters, so process all of the table records
                                        // ...
                                        final Table table = schema.tableFor(tableId);
                                        final int numColumns = table.columns().size();
                                        final Object[] row = new Object[numColumns];
                                        while (rs.next()) {
                                            for (int i = 0, j = 1; i != numColumns; ++i, ++j) {
                                                Column actualColumn = table.columns().get(i);
                                                row[i] = readField(rs, j, actualColumn, table);
                                            }
                                            recorder.recordRow(recordMaker, row, clock.currentTimeAsInstant()); // has no row number!
                                            rowNum.incrementAndGet();
                                            if (rowNum.get() % 100 == 0 && !isRunning()) {
                                                // We've stopped running ...
                                                break;
                                            }
                                            if (rowNum.get() % 10_000 == 0) {
                                                if (logger.isInfoEnabled()) {
                                                    long stop = clock.currentTimeInMillis();
                                                    logger.info("Step {}: - {} of {} rows scanned from table '{}' after {}",
                                                            stepNum, rowNum, rowCountStr, tableId, Strings.duration(stop - start));
                                                }
                                                metrics.rowsScanned(tableId, rowNum.get());
                                            }
                                        }

,原來它默認是對表做一個select * ，然后在內存中對整個表做個 count ，之后迭代發送數據

解決方案

在官網里找snapshot 的配置，發現一個參數 min.row.count.to.stream.results

https://debezium.io/documentation/reference/1.4/connectors/mysql.html#mysql-property-min-row-count-to-stream-results

，嘗試配了這個參數后，日志打印沒有去select * ，直接每批次發送10000，速度提升了20倍。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 mysqlbinlog抽取某個表的信息 Debezium初試 debezium關於cdc的使用(上) 從雲數據遷移服務看MySQL大表抽取模式 hive中創建表的三種方式：直接建表，抽取（as）建表，like建表 MySQL優化四（優化表結構）利用Flume將MySQL表數據准實時抽取到HDFS Hive優化-大表join大表優化配置Debezium Connector for Oracle Debezium Oracle 基於Logminer測試