kylin的構建引擎從mr換成spark

本文轉載自查看原文 2018-03-12 08:45 1850

說明：

由於線上業務kylin的cube越來越多，數據量隨着時間也在增長，構建時間會托的越來越長（同時跑的任務越多，mr時間越長，所以對同時跑的mr數量，我們進行了限制）。

這影響了數據的可用時間。目前需求是有看到近1個小時內的數據，而不再是早期的T-1。

為此我們做了3點優化：

一、是把自動構建的腳本進行了變更，當天第一次構建是構建，第二次是重新構建當天的（為的是包含當天的最新數據）。

二、當天首次構建時，把昨天的也重新構建一次（防止昨天最后一次構建時，最后幾十分鍾內的數據，沒有構建進去）。

三、構建時間間隔調整成10分鍾-30分鍾不等，縮短時間間隔，加快數據可見時間。

四、把kylin的構建引擎從mr換成spark。有效提升了構建速度。加快了數據可見時間。

本篇博客主要說的是第四點如何配置實現。

1、在kylin目錄下新建hadoop_conf文件夾

2、(集群的配置文件關聯到kylin的目錄)配置文件
ln -s /etc/hadoop/conf/hdfs-site.xml $KYLIN_HOME/hadoop-conf/hdfs-site.xml
ln -s /etc/hadoop/conf/yarn-site.xml $KYLIN_HOME/hadoop-conf/yarn-site.xml
ln -s /etc/hadoop/conf/core-site.xml $KYLIN_HOME/hadoop-conf/core-site.xml
ln -s /etc/hbase/conf/hbase-site.xml $KYLIN_HOME/hadoop-conf/hbase-site.xml
ln -s /etc/hive/conf/hive-site.xml $KYLIN_HOME/hadoop-conf/hive-site.xml

2、修改kylin配置文件
##kylin.properties:

kylin.env.hadoop-conf-dir=/usr/local/apache-kylin-2.1.0-bin-hbase1x/hadoop-conf

3、跑spark依賴的jar添加到hdfs（這樣就不用每次跑的時候都上傳一次）
jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ .
hadoop fs -mkdir -p /kylin/spark/
hadoop fs -put spark-libs.jar /kylin/spark/

3.1、對應第4部分的配置如下：
##kylin.properties:After do that, the config in kylin.properties will be:
#kylin.engine.spark-conf.spark.yarn.archive=hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar
kylin.engine.spark-conf.spark.yarn.archive=hdfs://nameservice1:8020/kylin/spark/spark-libs.jar

4、=======kylin.properties配置文件中的spark引擎部分的配置（含優化參數）=====================================================

#### SPARK ENGINE CONFIGS ###
#
## Hadoop conf folder, will export this as "HADOOP_CONF_DIR" to run spark-submit
## This must contain site xmls of core, yarn, hive, and hbase in one folder
##kylin.env.hadoop-conf-dir=/etc/hadoop/conf
kylin.env.hadoop-conf-dir=/usr/local/apps/apache-kylin-2.2.0-bin/hadoop-conf
#
## Estimate the RDD partition numbers
#kylin.engine.spark.rdd-partition-cut-mb=10
kylin.engine.spark.rdd-partition-cut-mb=100
#
## Minimal partition numbers of rdd
#kylin.engine.spark.min-partition=1
#
## Max partition numbers of rdd
#kylin.engine.spark.max-partition=5000
#
## Spark conf (default is in spark/conf/spark-defaults.conf)
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 # 針對yarn有時候會內存超限，預留的
kylin.engine.spark-conf.spark.yarn.driver.memoryOverhead=256
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.executor.memory=9G
#kylin.engine.spark-conf.spark.executor.cores=2
kylin.engine.spark-conf.spark.executor.cores=2
kylin.engine.spark-conf.spark.executor.instances=9
kylin.engine.spark-conf.spark.storage.memoryFraction=0.5
#kylin.engine.spark-conf.spark.shuffle.memoryFraction=0.3
#kylin.engine.spark-conf.spark.default.parallelism=9
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
#kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
#
## manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtime
kylin.engine.spark-conf.spark.yarn.archive=hdfs://nameservice1:8020/kylin/spark/spark-libs.jar
##kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
#
## uncomment for HDP
##kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current
##kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current
##kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current
#
#
#### QUERY PUSH DOWN ###

5、關於上圖標記出來的spark的相關參數的優化配置，請參照下面2篇博客，根據自身集群和業務特征進行優化。

參考
http://blog.csdn.net/u010936936/article/details/78095165

http://kylin.apache.org/docs21/tutorial/cube_spark.html

6、實際優化效果

按照每天或者是幾個小時進行增量構建的cube，構建速度有大約三倍的提升。

全量構建時，速度提升很少。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Kylin引入Spark引擎 Kylin配置Spark並構建Cube facebook Presto SQL分析引擎——本質上和spark無異，分解stage，task，MR計算 spark和MR比較 spark為什么比hadoop的mr要快？ Spark和MR的區別 Spark學習筆記——構建基於Spark的推薦引擎 OLAP引擎——Kylin介紹 kylin cube 構建過程 kylin構建cube優化