debezium 抽取大表優化


現象

  在debezium 抽取 千萬級大表時,發現snapshot時同步速率在 2s 10000 row ,同時還有 young GC 信息打印

原因分析

網絡原因

  首先排除網絡延遲的原因,ping 一下目的主機,發現 延遲在0.1ms

本身原因

  查看SnapshotReader源碼

// Scan the rows in the table ...
                            long start = clock.currentTimeInMillis();
                            logger.info("Step {}: - scanning table '{}' ({} of {} tables)", step, tableId, ++counter, capturedTableIds.size());

                            Map<TableId, String> selectOverrides = context.getConnectorConfig().getSnapshotSelectOverridesByTable();

                            String selectStatement = selectOverrides.getOrDefault(tableId, "SELECT * FROM " + quote(tableId));
                            logger.info("For table '{}' using select statement: '{}'", tableId, selectStatement);
                            sql.set(selectStatement);

                            try {
                                int stepNum = step;
                                mysql.query(sql.get(), statementFactory, rs -> {
                                    try {
                                        // The table is included in the connector's filters, so process all of the table records
                                        // ...
                                        final Table table = schema.tableFor(tableId);
                                        final int numColumns = table.columns().size();
                                        final Object[] row = new Object[numColumns];
                                        while (rs.next()) {
                                            for (int i = 0, j = 1; i != numColumns; ++i, ++j) {
                                                Column actualColumn = table.columns().get(i);
                                                row[i] = readField(rs, j, actualColumn, table);
                                            }
                                            recorder.recordRow(recordMaker, row, clock.currentTimeAsInstant()); // has no row number!
                                            rowNum.incrementAndGet();
                                            if (rowNum.get() % 100 == 0 && !isRunning()) {
                                                // We've stopped running ...
                                                break;
                                            }
                                            if (rowNum.get() % 10_000 == 0) {
                                                if (logger.isInfoEnabled()) {
                                                    long stop = clock.currentTimeInMillis();
                                                    logger.info("Step {}: - {} of {} rows scanned from table '{}' after {}",
                                                            stepNum, rowNum, rowCountStr, tableId, Strings.duration(stop - start));
                                                }
                                                metrics.rowsScanned(tableId, rowNum.get());
                                            }
                                        }

,原來它默認是對表做一個select * ,然后在內存中對 整個表做 個 count ,之后迭代發送數據

 

解決方案

在官網里找snapshot 的配置,發現一個參數 min.row.count.to.stream.results

https://debezium.io/documentation/reference/1.4/connectors/mysql.html#mysql-property-min-row-count-to-stream-results

 

 ,嘗試配了這個參數后,日志打印沒有去select * ,直接 每批次 發送10000,速度提升了20倍。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM