Spark on K8S環境部署細節
time: 2020-1-3
Spark on K8S環境部署細節
本文基於阿里雲ACK托管K8S集群
分為以下幾個部分:
- spark-operator on ACK 安裝
- spark wordcount讀寫OSS
- spark histroy server on ACK 安裝
Spark operator安裝
准備kubectl客戶端和Helm客戶端
- 配置本地或者內網機器kubectl客戶端.
- 安裝helm
使用Aliyun 提供的CloudShell進行操作的時候,一來默認不會保存文件,二來容易連接超時,導致安裝spark operator失敗,重新安裝需要手動刪除spark operator的各類資源.
安裝Helm的方式:
mkdir -pv helm && cd helm
wget https://storage.googleapis.com/kubernetes-helm/helm-v2.9.1-linux-amd64.tar.gz
tar xf helm-v2.9.1-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin
rm -rf linux-amd64
# 查看版本,不顯示出server版本,因為還沒有安裝server
helm version
安裝spark operator
helm install incubator/sparkoperator \
--namespace spark-operator \
--set sparkJobNamespace=default \
--set operatorImageName=registry-vpc.us-east-1.aliyuncs.com/eci_open/spark-operator \
--set operatorVersion=v1beta2-1.0.1-2.4.4 \
--set enableWebhook=true \
--set ingressUrlFormat="\{\{\$appName\}\}.ACK測試域名" \
--set enableBatchScheduler=true
Note:
- operatorImageName:這里的region需要改成k8s集群所在區域,默認谷歌的鏡像是沒辦法拉到的,這里使用aliyun提供的鏡像.
registry-vpc表示使用內網訪問registry下載鏡像. - ingressUrlFormat: 阿里雲的K8S集群會提供一個測試域名,可以替換成自己的.
安裝完畢,我們需要手動創建下serviceaccount,使得后面提交的spark作業可以有權限創建driver,executor對應的pod,configMap等資源.
以下創建default:spark servicecount並綁定相關權限:
創建spark-rbac.yaml,並執行kubectl apply -f spark-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: spark-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["*"]
- apiGroups: [""]
resources: ["services"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-role-binding
namespace: default
subjects:
- kind: ServiceAccount
name: spark
namespace: default
roleRef:
kind: Role
name: spark-role
apiGroup: rbac.authorization.k8s.io
Spark wordcount 讀寫OSS
這里分為以下幾步:
- 准備oss依賴的jar包
- 准備支持oss文件系統的core-site.xml
- 打包支持讀寫oss的spark容器鏡像
- 准備wordcount作業
准備oss依賴的jar包
參照鏈接:https://help.aliyun.com/document_detail/146237.html?spm=a2c4g.11186623.2.16.4dce2e14IGuHEv
以下可以直接操作,下載到oss依賴的jar包
wget http://gosspublic.alicdn.com/hadoop-spark/hadoop-oss-hdp-2.6.1.0-129.tar.gz?spm=a2c4g.11186623.2.11.54b56c18VGGAzb&file=hadoop-oss-hdp-2.6.1.0-129.tar.gz
tar -xvf hadoop-oss-hdp-2.6.1.0-129.tar
hadoop-oss-hdp-2.6.1.0-129/
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-ram-3.0.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-core-3.4.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-ecs-4.2.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-sts-3.0.0.jar
hadoop-oss-hdp-2.6.1.0-129/jdom-1.1.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-sdk-oss-3.4.1.jar
hadoop-oss-hdp-2.6.1.0-129/hadoop-aliyun-2.7.3.2.6.1.0-129.jar
准備core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. -->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- OSS配置 -->
<property>
<name>fs.oss.impl</name>
<value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
</property>
<property>
<name>fs.oss.endpoint</name>
<value>oss-cn-hangzhou-internal.aliyuncs.com</value>
</property>
<property>
<name>fs.oss.accessKeyId</name>
<value>{臨時AK_ID}</value>
</property>
<property>
<name>fs.oss.accessKeySecret</name>
<value>{臨時AK_SECRET}</value>
</property>
<property>
<name>fs.oss.buffer.dir</name>
<value>/tmp/oss</value>
</property>
<property>
<name>fs.oss.connection.secure.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.oss.connection.maximum</name>
<value>2048</value>
</property>
</configuration>
打包支持讀寫oss的鏡像
下載spark安裝包解壓
wget http://apache.communilink.net/spark/spark-3.0.0-preview/spark-3.0.0-preview-bin-hadoop2.7.tgz
tar -xzvf spark-3.0.0-preview-bin-hadoop2.7.tgz
打包發布鏡像
在打包之前,需要准備一個docker registry, 可以是docker hub或者是aliyun提供的遠程鏡像服務.
這里我們使用aliyun的容器鏡像服務
- docker登錄鏡像服務
docker login --username=lanrish@1416336129779449 registry.us-east-1.aliyuncs.com
注:
- 登錄建議使用docker免sudo的方式登錄,否則執行
sudo docker login登錄之后,當前用戶無法創建鏡像. registry.us-east-1.aliyuncs.com這里根據具體選擇的地區來決定,默認通過公網訪問,我們可以創建k8s集群和鏡像服務在同一個地區下(即配置統一的VPC服務),然后在registry后面加一個-vpc,即registry-vpc.us-east-1.aliyuncs.com,這樣k8s可以通過內網快速加載容器鏡像.
- 打包spark鏡像
進入下載解壓好的spark路徑:cd spark-3.0.0-preview-bin-hadoop2.7 - 將oss依賴的jar拷貝到jars目錄.
- 將支持oss的core-site.xml放入conf目錄.
- 修改kubernetes/dockerfiles/spark/Dockerfile
修改如下,重點在19,34,37行,主要為了可以讓spark通過HADOOP_CONF_DIR環境變量去自動加載core-site.xml,之所以這么麻煩而不使用ConfigMap,是因為spark 3.0目前存在bug,詳見: https://www.jianshu.com/p/d051aa95b241
- FROM openjdk:8-jdk-slim
- ARG spark_uid=185
- Before building the docker image, first build and make a Spark distribution following
- the instructions in http://spark.apache.org/docs/latest/building-spark.html.
- If this docker file is being used in the context of building your images from a Spark
- distribution, the docker build command should be invoked from the top level directory
- of the Spark distribution. E.g.:
- docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .
- RUN set -ex && \
- apt-get update && \
- ln -s /lib /lib64 && \
- apt install -y bash tini libc6 libpam-modules krb5-user libnss3 && \
- mkdir -p /opt/spark && \
- mkdir -p /opt/spark/examples && \
- mkdir -p /opt/spark/work-dir && \
- mkdir -p /opt/hadoop/conf && \
- touch /opt/spark/RELEASE && \
- rm /bin/sh && \
- ln -sv /bin/bash /bin/sh && \
- echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
- chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
- rm -rf /var/cache/apt/*
- COPY jars /opt/spark/jars
- COPY bin /opt/spark/bin
- COPY sbin /opt/spark/sbin
- COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
- COPY examples /opt/spark/examples
- COPY kubernetes/tests /opt/spark/tests
- COPY data /opt/spark/data
- COPY conf/core-site.xml /opt/hadoop/conf
- ENV SPARK_HOME /opt/spark
- ENV HADOOP_HOME /opt/hadoop
- ENV HADOOP_CONF_DIR /opt/hadoop/conf
- WORKDIR /opt/spark/work-dir
- RUN chmod g+w /opt/spark/work-dir
- ENTRYPOINT [ "/opt/entrypoint.sh" ]
- Specify the User that the actual main process will run as
- USER ${spark_uid}
- 構建鏡像
# 構建鏡像
./bin/docker-image-tool.sh -r registry.us-east-1.aliyuncs.com/engineplus -t 3.0.0 build
# 發布鏡像
docker push registry.us-east-1.aliyuncs.com/engineplus/spark:3.0.0
如果需要在鏡像中部署額外的依賴環境,則需要使用以下方式:
在spark當前目錄spark-3.0.0-preview-bin-hadoop2.7通過Dockerfile的方式構建自定義鏡像:
docker build -t registry.us-east-1.aliyuncs.com/spark:3.0.0 -f kubernetes/dockerfiles/spark/Dockerfile
可以將自定義的依賴環境定義到kubernetes/dockerfiles/spark/Dockerfile中.
准備wordcount作業
wordcount作業可以從這里clone: https://github.com/i-mine/spark_k8s_wordcount
下載可以直接執行mvn clean package
得到wordcount jar: target/spark_k8s_wordcount-1.0-SNAPSHOT.jar
1. spark submit 提交
注: 這種提交方式中,可以上傳本地的jar,但是同時需要本地提交環境已經配置過hadoop關於oss的環境.
bin/spark-submit \
--master k8s://https://192.168.17.175:6443 \
--deploy-mode cluster \
--name com.mobvista.dataplatform.WordCount \
--class com.mobvista.dataplatform.WordCount \
--conf spark.kubernetes.file.upload.path=oss://mob-emr-test/lei.du/tmp \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=registry.us-east-1.aliyuncs.com/engineplus/spark:3.0.0-oss \
/home/hadoop/dulei/spark-3.0.0-preview2-bin-hadoop2.7/spark_k8s_wordcount-1.0-SNAPSHOT.jar
2. spark operator 提交
注: 這種提交方式中,spark依賴的jar只可以是鏡像中已經存在的或者是通過遠程訪問,無法自動將本地的jar上傳給spark作業,需要自己手動上傳到oss或者s3,且spark鏡像中已經存在oss或者s3的訪問配置和依賴的jar.
編寫spark operator word-count.yaml,這種方式需要提前將jar包打包到鏡像中,或者上傳到雲上.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: wordcount
namespace: default
spec:
type: Scala
mode: cluster
image: "registry.us-east-1.aliyuncs.com/engineplus/spark:3.0.0-oss"
imagePullPolicy: IfNotPresent
mainClass: com.mobvista.dataplatform.WordCount
mainApplicationFile: "oss://mob-emr-test/lei.du/lib/spark_k8s_wordcount-1.0-SNAPSHOT.jar"
sparkVersion: "3.0.0"
restartPolicy:
type: OnFailure
onFailureRetries: 2
onFailureRetryInterval: 5
onSubmissionFailureRetries: 2
onSubmissionFailureRetryInterval: 10
timeToLiveSeconds: 3600
sparkConf:
"spark.kubernetes.allocation.batch.size": "10"
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "oss://mob-emr-test/lei.du/tmp/logs"
hadoopConfigMap: oss-hadoop-dir
driver:
cores: 1
memory: "1024m"
labels:
version: 3.0.0
spark-app: spark-wordcount
role: driver
annotations:
k8s.aliyun.com/eci-image-cache: "true"
serviceAccount: spark
executor:
cores: 1
instances: 1
memory: "1024m"
labels:
version: 3.0.0
role: executor
annotations:
k8s.aliyun.com/eci-image-cache: "true"
作業執行過程中我們可以獲取ingress-url進行訪問WEB UI查看作業執行狀態,但是作業執行完畢無法查看:
- $ kubectl describe sparkapplication
- Name: wordcount
- Namespace: default
- Labels: <none>
- Annotations: kubectl.kubernetes.io/last-applied-configuration:
- {"apiVersion":"sparkoperator.k8s.io/v1beta2","kind":"SparkApplication","metadata":{"annotations":{},"name":"wordcount","namespace":"defaul...
- API Version: sparkoperator.k8s.io/v1beta2
- Kind: SparkApplication
- Metadata:
- Creation Timestamp: 2020-01-03T08:18:58Z
- Generation: 2
- Resource Version: 53192098
- Self Link: /apis/sparkoperator.k8s.io/v1beta2/namespaces/default/sparkapplications/wordcount
- UID: b0b1ff99-2e01-11ea-bf95-7e8505108e63
- Spec:
- Driver:
- Annotations:
- k8s.aliyun.com/eci-image-cache: true
- Cores: 1
- Labels:
- Role: driver
- Spark - App: spark-wordcount
- Version: 3.0.0
- Memory: 1024m
- Service Account: spark
- Executor:
- Annotations:
- k8s.aliyun.com/eci-image-cache: true
- Cores: 1
- Instances: 1
- Labels:
- Role: executor
- Version: 3.0.0
- Memory: 1024m
- Image: registry.us-east-1.aliyuncs.com/engineplus/spark:3.0.0-oss-wordcount
- Image Pull Policy: IfNotPresent
- Main Application File: /opt/spark/jars/spark_k8s_wordcount-1.0-SNAPSHOT.jar
- Main Class: WordCount
- Mode: cluster
- Restart Policy:
- On Failure Retries: 2
- On Failure Retry Interval: 5
- On Submission Failure Retries: 2
- On Submission Failure Retry Interval: 10
- Type: OnFailure
- Spark Conf:
- spark.kubernetes.allocation.batch.size: 10
- Spark Version: 3.0.0
- Time To Live Seconds: 3600
- Type: Scala
- Status:
- Application State:
- Error Message: driver pod failed with ExitCode: 1, Reason: Error
- State: FAILED
- Driver Info:
- Pod Name: wordcount-driver
- Web UI Address: 172.21.14.219:4040
- Web UI Ingress Address: wordcount.cac1e2ca4865f4164b9ce6dd46c769d59.us-east-1.alicontainer.com
- Web UI Ingress Name: wordcount-ui-ingress
- Web UI Port: 4040
- Web UI Service Name: wordcount-ui-svc
- Execution Attempts: 3
- Last Submission Attempt Time: 2020-01-03T08:21:51Z
- Spark Application Id: spark-4c66cd4e3e094571844bbc355a1b6a16
- Submission Attempts: 1
- Submission ID: e4ce0cb8-7719-4c6f-ade1-4c13e137de77
- Termination Time: 2020-01-03T08:22:01Z
- Events:
- Type Reason Age From Message
- ---- ------ ---- ---- -------
- Normal SparkApplicationAdded 7m20s spark-operator SparkApplication wordcount was added, enqueuing it for submission
- Warning SparkApplicationFailed 6m20s spark-operator SparkApplication wordcount failed: driver pod failed with ExitCode: 101, Reason: Error
- Normal SparkApplicationSpecUpdateProcessed 5m43s spark-operator Successfully processed spec update for SparkApplication wordcount
- Warning SparkDriverFailed 4m47s (x5 over 7m10s) spark-operator Driver wordcount-driver failed
- Warning SparkApplicationPendingRerun 4m32s (x5 over 7m2s) spark-operator SparkApplication wordcount is pending rerun
- Normal SparkApplicationSubmitted 4m27s (x6 over 7m16s) spark-operator SparkApplication wordcount was submitted successfully
- Normal SparkDriverRunning 4m24s (x6 over 7m14s) spark-operator Driver wordcount-driver is running
安裝Spark Histroy Server On K8S
這里我們使用由Helm chart提供的Spark History Server
GitHub: https://github.com/SnappyDataInc/spark-on-k8s/tree/master/charts/spark-hs?spm=5176.2020520152.0.0.2d5916ddP2xqfh
為了方便,直接通過Aliyun的應用市場進行安裝:
應用介紹: https://cs.console.aliyun.com/#/k8s/catalog/detail/incubator_ack-spark-history-server
在創建之前,填寫oss相關的配置,然后創建即可:

安裝完畢通過查看k8s的server,可以獲取到spark history server的訪問地址

創建成功后,提交作業的時候,需要添加兩條配置:
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "oss://mob-emr-test/lei.du/tmp/logs"
這樣提交的作業日志就會存儲在OSS.
