前言
Spark 提供的 webui 已經提供了很多信息,用戶可以從上面了解到任務的 shuffle,任務運行等信息,但是運行時 Executor JVM 的狀態對用戶來說是個黑盒,
在應用內存不足報錯時,初級用戶可能不了解程序究竟是 Driver 還是 Executor 內存不足,從而也無法正確的去調整參數。
Spark 的度量系統提供了相關數據,我們需要做的只是將其采集並展示。
安裝graphite_exporter
注:只需要在Spark的master節點上安裝它
-
上傳解壓
從 https://github.com/prometheus/graphite_exporter 下載並上傳graphite_exporter-0.6.2.linux-amd64.tar安裝包並解壓到/usr/local目錄
tar -xvf graphite_exporter-0.6.2.linux-amd64.tar cd graphite_exporter-0.6.2.linux-amd64/
-
配置
上傳graphite_exporter_mapping配置文件到../graphite_exporter-0.6.2.linux-amd64/目錄下
graphite_exporter_mapping內容:
mappings: - match: '*.*.executor.filesystem.*.*' name: spark_app_filesystem_usage labels: application: $1 executor_id: $2 fs_type: $3 qty: $4 - match: '*.*.jvm.*.*' name: spark_app_jvm_memory_usage labels: application: $1 executor_id: $2 mem_type: $3 qty: $4 - match: '*.*.executor.jvmGCTime.count' name: spark_app_jvm_gcTime_count labels: application: $1 executor_id: $2 - match: '*.*.jvm.pools.*.*' name: spark_app_jvm_memory_pools labels: application: $1 executor_id: $2 mem_type: $3 qty: $4 - match: '*.*.executor.threadpool.*' name: spark_app_executor_tasks labels: application: $1 executor_id: $2 qty: $3 - match: '*.*.BlockManager.*.*' name: spark_app_block_manager labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.DAGScheduler.*.*' name: spark_app_dag_scheduler labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.CodeGenerator.*.*' name: spark_app_code_generator labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.HiveExternalCatalog.*.*' name: spark_app_hive_external_catalog labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.*.StreamingMetrics.*.*' name: spark_app_streaming_metrics labels: application: $1 executor_id: $2 app_name: $3 type: $4 qty: $5
上述文件會將數據轉化成 metric name 為 jvm_memory_usage,label 為 application,executor_id,mem_type,qty 的格式
application_1533838659288_1030_1_jvm_heap_usage -> jvm_memory_usage{application="application_1533838659288_1030",executor_id="driver",mem_type="heap",qty="usage"}
-
啟動
進入根目錄下,輸入以下命令:
cd /usr/local/graphite_exporter-0.6.2.linux-amd64/ nohup ./graphite_exporter --graphite.mapping-config=graphite_exporter_mapping & #啟動 graphite_exporter 時加載配置文件 tail -1000f nohup.out

Spark 配置
-
配置
注:spark 集群下的所有節點都要進行如下配置
進入$SPARK_HOME/conf/目錄下,修改metrics.properties 配置文件:
cp metrics.properties.template metrics.properties
vi metrics.properties

# Enable JvmSource for instance master, worker, driver and executor master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.protocol=tcp *.sink.graphite.host=172.16.10.91 # 部署graphite_exporter服務地址 *.sink.graphite.port=9109 # graphite_exporter服務默認端口9109 *.sink.graphite.period=60 *.sink.graphite.unit=seconds
-
啟動spark集群
進入Spark 的 Master 節點服務器來啟動集群
cd /usr/local/spark-2.3.3-bin-hadoop2.7/sbin ./start-all.sh
- 啟動應用
spark-submit --class org.apache.spark.examples.SparkPi --name SparkPi --master yarn --deploy-mode cluster --executor-memory 1G --executor-cores 1 --num-executors 1 /usr/hdp/2.6.2.0-205/spark2/examples/jars/spark-examples_2.11-2.1.1.2.6.2.0-205.jar 1000
啟動成功后,打開(graphite_exporter)服務地址查看收集的指標信息:
http://172.xx.xx.91:9108/metrics

Prometheus配置
-
配置
修改prometheus組件的prometheus.yml加入spark監控:
vi /usr/local/prometheus-2.15.1/prometheus.yml

-
啟動驗證
先kill掉Prometheus進程,用以下命令重啟它,然后查看targets:
cd /usr/local/prometheus-2.15.1 nohup ./prometheus --config.file=prometheus.yml &

注:State=UP,說明成功
Grafana配置
-
導入儀表盤模板
導入附件提供的模板文件(Spark-dashboard.json)

-
預警指標
| 序號 |
預警名稱 |
預警規則 |
描述 |
| 1 |
Worker節點數預警 |
當集群中的Worker節點數達到閾值【<2】時進行預警 |
|
| 2 |
App 應用數預警 |
當集群中的App數量達到閾值【<1】時進行預警 |
|
| 3 |
Driver內存預警 |
當內存使用達到閾值【>80%】時進行預警 |
|
| 4 |
Executor內存預警 |
當內存使用達到閾值【>80%】時進行預警 |
|
| 5 |
Executor Gc次數預警 |
當每秒Gc次數達到閾值【>5】時進行預警 |
|
