Spark on K8S (Kubernetes Native)




Spark on K8S 的幾種模式

Start Minikube

sudo minikube start --driver=none --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers

如果啟動失敗可以嘗試先刪除集群 minikube delete

Spark on K8S 官網

https://spark.apache.org/docs/latest/running-on-kubernetes.html

上面沒說 Spark 版本和 K8S 版本的兼容問題,但是是有影響的

Download Spark

https://archive.apache.org/dist/spark/

Spark 可能和 Hadoop 關系比較緊密,可以下載帶 Hadoop 的版本,這樣會有 Hadoop 的 jar 包可以用,不然可能會出現找不到包和類的錯誤,哪怕其實沒用到 Hadoop

Build Spark Image

Spark 提供 bin/docker-image-tool.sh 工具用於 build image

這個工具會找到 kubernetes/dockerfiles 下的 docker 文件,根據 docker file 會把需要的 Spark 命令、工具、庫、jar 包、java、example、entrypoint.sh 等 build 進 image

2.3 只支持 Java/Scala,從 2.4 開始支持 Python 和 R,會有三個 docker file,會 build 出三個 image,其中 Python 和 R 是基於 Java/Scala 版的

sudo ./bin/docker-image-tool.sh -t my_spark_2.4_hadoop_2.7 build

遇到類似下面的錯誤

WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz: temporary error (try again later)
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz: temporary error (try again later)
ERROR: unsatisfiable constraints:
  bash (missing):
    required by: world[bash]

這是網絡問題,可以修改 ./bin/docker-image-tool.sh,在里面的 docker build 命令加上 --network=host 使容器使用宿主機網絡 (要確保宿主機網絡是 OK 的)

在宿主機提交 Job

bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///path/to/examples.jar

注意這里的 local:///path/to/examples.jar 指的是 容器的文件系統,不是執行 spark-submit 的機器的文件系統,官網的說法:Note that using application dependencies from the submission client's local file system is currently not yet supported.

如果不使用 local 的話,也可以用 HTTP、HDFS 等系統,沒指定的話默認是 local 模式

因為一開始沒用帶 Hadoop 包的 Spark,結果 spark-submit 會報 classNotFound
然后指定 --jars 或是在宿主機的 conf/spark-env.sh 添加

export SPARK_DIST_CLASSPATH=$(/home/lin/Hadoop/hadoop-2.8.3/bin/hadoop classpath)

這樣 spark-submit 過了,但容器跑起來后還是報 classNotFound
實際上啟動的 driver 容器又調用了 spark-submit,只是改了一些參數,比如把 cluster 模式改成 client 模式
后來改成使用帶 Hadoop 包的 Spark,這個問題就沒出現了
所以推測 spark-submit 使用 --jars 指定的包,可能也需要在 容器里有

獲取 K8S Api Server 的地址

sudo kubectl cluster-info

假設返回

https://192.168.0.107:8443

那么 spark-submit 命令是

# --master 指定 k8s api server
# --conf spark.kubernetes.container.image 指定通過 docker-image-tool.sh 創建的鏡像
# 第一個 wordcount.py 是要執行的命令
# 第二個 wordcount.py 是參數,即統計 wordcount.py 文件的單詞量
bin/spark-submit \
    --master k8s://https://192.168.0.107:8443 \
    --deploy-mode cluster \
    --name spark-test \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.container.image=spark-py:my_spark_2.4_hadoop_2.7 \
    /opt/spark/examples/src/main/python/wordcount.py \
    /opt/spark/examples/src/main/python/wordcount.py

這樣可能會報證書錯誤,無法啟動 Pod,可能需要配置證書
Spark on K8S 官網看到有 spark.kubernetes.authenticate.submission.caCertFile 配置項,不過沒試
在測試環境可以用下面的命令使用 proxy,生成一個不需要證書認證的地址

kubectl proxy

然后 spark-submit 命令變成

# Api Server 的地址變成 http://127.0.0.1:8001
bin/spark-submit \
    --master k8s://http://127.0.0.1:8001 \
    --deploy-mode cluster \
    --name spark-test \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.container.image=spark-py:my_spark_2.4_hadoop_2.7 \
    /opt/spark/examples/src/main/python/wordcount.py \
    /opt/spark/examples/src/main/python/wordcount.py

這樣還是會報錯,在宿主機或容器里報,沒有權限,需要在 K8S 配置一個有權限的用戶

准備一個 role.yaml 文件

apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: spark-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: default
subjects:
- kind: ServiceAccount
  name: spark
  namespace: default
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

可以參考 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/manifest/spark-rbac.yaml

執行命令

sudo kubectl apply -f role.yaml

查看配置

sudo kubectl get role
sudo kubectl get role spark-role -o yaml
sudo kubectl get rolebinding
sudo kubectl get rolebinding spark-role-binding -o yaml

重新提交

# 添加了 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
bin/spark-submit \
    --master k8s://http://127.0.0.1:8001 \
    --deploy-mode cluster \
    --name spark-test \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=spark-py:my_spark_2.4_hadoop_2.7 \
    /opt/spark/examples/src/main/python/wordcount.py \
    /opt/spark/examples/src/main/python/wordcount.py

沒報權限錯誤了,但可能還會有其他錯誤

20/07/09 06:32:23 INFO SparkContext: Successfully stopped SparkContext
Traceback (most recent call last):
  File "/opt/spark/examples/src/main/python/wordcount.py", line 33, in <module>
    .appName("PythonWordCount")\
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 173, in getOrCreate
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 367, in getOrCreate
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 136, in __init__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 198, in _do_init
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 306, in _initialize_context
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: External scheduler cannot be instantiated
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [spark-test-1594276334218-driver]  in namespace: [default]  failed.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:237)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:170)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
        at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
        ... 13 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)

這個 Broken pipe 應該是 Spark 使用的代碼和 jar 包,跟 K8S 不兼容導致的
嘗試替換 spark 下面的 jar 目錄下的 k8s 包

https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar
https://gitee.com/everworking/kubernetes-client

但還是有其他問題,應該是版本不兼容導致的
查看 Spark 2.4.6 的 jars 目錄可以看到使用的 K8S jar 包

kubernetes-client-4.6.1.jar
kubernetes-model-4.6.1.jar
kubernetes-model-common-4.6.1.jar

查看 Kubernetes Client 的說明
https://github.com/fabric8io/kubernetes-client#compatibility-matrix
可以看到 4.6.1 可以匹配的 Kubernetes 最高版本是 15(Spark 官網對 K8S 版本的兼容就沒說清楚)
而當前最新的 Minikube 默認安裝的版本是 18

刪除並重新啟動 15 版本的 Kubernetes 集群

sudo minikube stop
sudo minikube delete

sudo rm -rf ~/.kube
sudo rm -rf ~/.minikube

sudo minikube start --driver=none \
                    --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers \
                    --kubernetes-version="v1.15.3"

同時下載 15 版本的 kubectl

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.15.3/bin/linux/amd64/kubectl

查看版本確保都是 15 的

sudo kubectl version --client

同樣的命令重新提交

# 添加了 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
bin/spark-submit \
    --master k8s://http://127.0.0.1:8001 \
    --deploy-mode cluster \
    --name spark-test \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=spark-py:my_spark_2.4_hadoop_2.7 \
    /opt/spark/examples/src/main/python/wordcount.py \
    /opt/spark/examples/src/main/python/wordcount.py

這次成功了,可以看到 Driver 和 Executor 的 Pod 都啟動了

NAME                                   READY   STATUS    RESTARTS   AGE
pythonwordcount-1595818025111-exec-1   1/1     Running   0          12s
pythonwordcount-1595818025401-exec-2   1/1     Running   0          12s
pythonwordcount-1595818025443-exec-3   0/1     Pending   0          12s
spark-test-1595818015819-driver        1/1     Running   0          20s

查看相應的 Service

NAME                                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
kubernetes                            ClusterIP   10.96.0.1    <none>        443/TCP             27m
spark-test-1595818015819-driver-svc   ClusterIP   None         <none>        7078/TCP,7079/TCP   22s

通過 docker 命令查看相應的容器

CONTAINER ID        IMAGE                                                                     COMMAND                  CREATED             STATUS              PORTS               NAMES
a55500cdd00f        9cdc285a4fbb                                                              "/opt/entrypoint.sh …"   9 seconds ago       Up 8 seconds                            k8s_executor_pythonwordcount-1595818025401-exec-2_default_0624eb2d-aeab-454e-bce5-15c38b46f970_0
37ddc67f3527        9cdc285a4fbb                                                              "/opt/entrypoint.sh …"   9 seconds ago       Up 8 seconds                            k8s_executor_pythonwordcount-1595818025111-exec-1_default_0d8fa5ac-07dc-41ea-a7fb-a75d1f5dfdf9_0
5d9d9c5517e4        registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1             "/pause"                 10 seconds ago      Up 8 seconds                            k8s_POD_pythonwordcount-1595818025401-exec-2_default_0624eb2d-aeab-454e-bce5-15c38b46f970_0
210ebc82c274        registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1             "/pause"                 10 seconds ago      Up 8 seconds                            k8s_POD_pythonwordcount-1595818025111-exec-1_default_0d8fa5ac-07dc-41ea-a7fb-a75d1f5dfdf9_0
400f155d78f2        9cdc285a4fbb                                                              "/opt/entrypoint.sh …"   15 seconds ago      Up 14 seconds                           k8s_spark-kubernetes-driver_spark-test-1595818015819-driver_default_3198bca3-fcb7-4a5c-8821-c5fe7ef02dfa_0
d4c7f82d90de        registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1             "/pause"                 16 seconds ago      Up 14 seconds                           k8s_POD_spark-test-1595818015819-driver_default_3198bca3-fcb7-4a5c-8821-c5fe7ef02dfa_0

這里只有兩個 Executor 容器,而提交的時候是指定 3 個
上面的 Pod 也可以看到有一個處於 Pending 狀態
查看 Pending 的 Pod

sudo kubectl describe pod pythonwordcount-1595818025443-exec-3

返回很多信息,最后面可以看到

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  23s (x2 over 23s)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu.

所以 Pending 的原因是 CPU 不夠
不過這不影響 Job 的正常運行

Spark Job 結束后 Executor 和 Driver 容器都會變成 Exit 狀態
但是 Executor 變成 Exit 一小段時間后就不見了,相應的 Pod 也被刪除了
而 Driver 一直都在,且相應的 pod 會變成 Completed 狀態

NAME                              READY   STATUS      RESTARTS   AGE
spark-test-1595818015819-driver   0/1     Completed   0          32s

如果出錯了會是 Error 狀態

在容器里提交 Job

定義 deployment,注意指定 serviceAccountName 使用前面創建的 spark role

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark
      component: client
  template:
    metadata:
      labels:
        app: spark
        component: client
    spec:
      containers:
      - name: sparkclient
        image: spark-py:2.4.6
        workingDir: /opt/spark
        command: ["/bin/bash", "-c", "while true;do echo hello;sleep 6000;done"]
      serviceAccountName: spark

部署

sudo kubectl create -f client-deployment.yaml

查看並登陸 pod

sudo kubectl exec -t -i spark-client-6479b76776-l5bzw /bin/bash

通過 env 命令可以看到容器里有定義 Kubernetes API Server 的地址

KUBERNETES_SERVICE_HOST=10.96.0.1
KUBERNETES_SERVICE_PORT_HTTPS=443

實際上容器上還有相應的 token 和證書,可以用來訪問 API Server

TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)

curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
     -H "Authorization: Bearer $TOKEN" \
     -s https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT_HTTPS}/api/v1/namespaces/default/pods

但通過 spark-submit 提交 Job 報錯了,說是沒權限獲取 configMap,看來要求的權限和在宿主機提交不一樣
改變 spark role 的配置,允許操作所有資源,然后重新執行 kubectl create

- apiGroups: [""]
  resources: ["*"]
  verbs: ["*"]

重新提交 Job,可以看到成功啟動運行了

# 第二個 wordcount.py 是作為參數用
bin/spark-submit \
    --master k8s://https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT_HTTPS} \
    --deploy-mode cluster \
    --name spark-test \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=spark-py:2.4.6 \
    /opt/spark/examples/src/main/python/wordcount.py \
    /opt/spark/examples/src/main/python/wordcount.py



免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM