網上使用Kubernetes搭建Hadoop的資料較少,因此自己嘗試做了一個,記錄下過程和遇到的問題。
一、選擇鏡像
首先從官方Docker Hub中選擇比較熱門的鏡像。這里選擇了bde2020的系列鏡像,因為其Githab上的資料比較完善。https://github.com/big-data-europe/docker-hadoop
二、使用docker-compose進行測試
網站上給出的是使用docker-compose運行此hadoop鏡像的方法,按照網站上操作即可。
docker-compose是Docker自帶的容器編排工具,操作簡單,只需要將docker-compose.yml和hadoop.env文件下載到本地,使用docker-compose up命令即可啟動。停止服務執行docker-compose down命令。
三、編寫各個組件的Kubernetes yaml文件
上面的docker-compose案例雖然簡單,但是功能較少,且運行於同一台機器上。我們要做的就是把docker-compose的yaml文件的語法改寫為Kubernetes的yaml文件語法。
1.創建configmap
配置文件可以通過configmap錄入。參考hadoop.env,編寫configmap.yaml如下:
apiVersion: v1 kind: ConfigMap metadata: name: hadoop-config data: CORE_CONF_fs_defaultFS: "hdfs://namenode:8020" CORE_CONF_hadoop_http_staticuser_user: "root" CORE_CONF_hadoop_proxyuser_hue_hosts: "*" CORE_CONF_hadoop_proxyuser_hue_groups: "*" HDFS_CONF_dfs_webhdfs_enabled: "true" HDFS_CONF_dfs_permissions_enabled: "false" YARN_CONF_yarn_log___aggregation___enable: "true" YARN_CONF_yarn_resourcemanager_recovery_enabled: "true" YARN_CONF_yarn_resourcemanager_store_class: "org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore" YARN_CONF_yarn_resourcemanager_fs_state___store_uri: "/rmstate" YARN_CONF_yarn_nodemanager_remote___app___log___dir: "/app-logs" YARN_CONF_yarn_log_server_url: "http://historyserver:8188/applicationhistory/logs/" YARN_CONF_yarn_timeline___service_enabled: "true" YARN_CONF_yarn_timeline___service_generic___application___history_enabled: "true" YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled: "true" YARN_CONF_yarn_resourcemanager_hostname: "resourcemanager" YARN_CONF_yarn_timeline___service_hostname: "historyserver" YARN_CONF_yarn_resourcemanager_address: "resourcemanager:8032" YARN_CONF_yarn_resourcemanager_scheduler_address: "resourcemanager:8030" YARN_CONF_yarn_resourcemanager_resource___tracker_address: "resourcemanager:8031"
2.創建namenode
hadoop節點間的通信使用hostname,但是pod在創建時會被系統隨機指定一個hostname並寫入自己的/etc/hosts文件中,從而造成節點間的通信問題,出現UnresolvedAddressException等錯誤信息。這里坑了我好久,查了很多資料才發現問題。
解決方法就是在service中將clusterIP指定為None,並在deployment中指定hostname與service名稱一致。為了避免混淆,后面的service name、container name、hostname等都設為相同的值。
注意service中clusterIP一定要設定為None,否則使用yarn處理MapReduce任務時會報錯!
namenode需要掛載volume,因此先編寫pvc.yaml(需要先創建StorageClass,具體可參考我之前的博客https://www.cnblogs.com/00986014w/p/9406962.html):
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: hadoop-namenode-pvc spec: storageClassName: nfs accessModes: - ReadWriteMany resources: requests: storage: 1Gi
編寫namenode的service和deployment文件namenode.yaml如下(把所有可能用到的端口都暴露了,其實不需要這么多):
apiVersion: v1 kind: Service metadata: name: namenode labels: name: namenode spec: ports: - port: 50070 name: http - port: 8020 name: hdfs - port: 50075 name: hdfs1 - port: 50010 name: hdfs2 - port: 50020 name: hdfs3 - port: 9000 name: hdfs4 - port: 50090 name: hdfs5 - port: 31010 name: hdfs6 - port: 8030 name: yarn1 - port: 8031 name: yarn2 - port: 8032 name: yarn3 - port: 8033 name: yarn4 - port: 8040 name: yarn5 - port: 8042 name: yarn6 - port: 8088 name: yarn7 - port: 8188 name: historyserver selector: name: namenode clusterIP: None --- apiVersion: apps/v1beta1 kind: Deployment metadata: name: namenode spec: replicas: 1 template: metadata: labels: name: namenode spec: hostname: namenode containers: - name: namenode image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8 imagePullPolicy: IfNotPresent ports: - containerPort: 50070 name: http - containerPort: 8020 name: hdfs - containerPort: 50075 name: hdfs1 - containerPort: 50010 name: hdfs2 - containerPort: 50020 name: hdfs3 - containerPort: 9000 name: hdfs4 - containerPort: 50090 name: hdfs5 - containerPort: 31010 name: hdfs6 - containerPort: 8030 name: yarn1 - containerPort: 8031 name: yarn2 - containerPort: 8032 name: yarn3 - containerPort: 8033 name: yarn4 - containerPort: 8040 name: yarn5 - containerPort: 8042 name: yarn6 - containerPort: 8088 name: yarn7 - containerPort: 8188 name: historyserver env: - name: CLUSTER_NAME value: test envFrom: - configMapRef: name: hadoop-config volumeMounts: - name: hadoop-namenode mountPath: /hadoop/dfs/name volumes: - name: hadoop-namenode persistentVolumeClaim: claimName: hadoop-namenode-pvc
2.datanode
創建3個datanode。以datanode1為例,編寫datanode的datanode.yaml如下(pvc與namenode的類似,不貼出來了):
apiVersion: v1 kind: Service metadata: name: datanode1 labels: name: datanode1 spec: ports: - port: 50070 name: http - port: 8020 name: hdfs - port: 50075 name: hdfs1 - port: 50010 name: hdfs2 - port: 50020 name: hdfs3 - port: 9000 name: hdfs4 - port: 50090 name: hdfs5 - port: 31010 name: hdfs6 - port: 8030 name: yarn1 - port: 8031 name: yarn2 - port: 8032 name: yarn3 - port: 8033 name: yarn4 - port: 8040 name: yarn5 - port: 8042 name: yarn6 - port: 8088 name: yarn7 - port: 8188 name: historyserver selector: name: datanode1 clusterIP: None --- apiVersion: apps/v1beta1 kind: Deployment metadata: name: datanode1 spec: replicas: 1 template: metadata: labels: name: datanode1 spec: hostname: datanode1 containers: - name: datanode1 image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8 imagePullPolicy: IfNotPresent ports: - containerPort: 50070 name: http - containerPort: 8020 name: hdfs - containerPort: 50075 name: hdfs1 - containerPort: 50010 name: hdfs2 - containerPort: 50020 name: hdfs3 - containerPort: 9000 name: hdfs4 - containerPort: 50090 name: hdfs5 - containerPort: 31010 name: hdfs6 - containerPort: 8030 name: yarn1 - containerPort: 8031 name: yarn2 - containerPort: 8032 name: yarn3 - containerPort: 8033 name: yarn4 - containerPort: 8040 name: yarn5 - containerPort: 8042 name: yarn6 - containerPort: 8088 name: yarn7 - containerPort: 8188 name: historyserver envFrom: - configMapRef: name: hadoop-config volumeMounts: - name: hadoop-datanode1 mountPath: /hadoop/dfs/data volumes: - name: hadoop-datanode1 persistentVolumeClaim: claimName: hadoop-datanode1-pvc
創建完成后,一定要用kubectl logs查看一下日志,確認沒有錯誤信息后再繼續下一步。
3.resourcemanager
編寫resourcemanager.yaml文件如下:
apiVersion: v1 kind: Service metadata: name: resourcemanager labels: name: resourcemanager spec: ports: - port: 50070 name: http - port: 8020 name: hdfs - port: 50075 name: hdfs1 - port: 50010 name: hdfs2 - port: 50020 name: hdfs3 - port: 9000 name: hdfs4 - port: 50090 name: hdfs5 - port: 31010 name: hdfs6 - port: 8030 name: yarn1 - port: 8031 name: yarn2 - port: 8032 name: yarn3 - port: 8033 name: yarn4 - port: 8040 name: yarn5 - port: 8042 name: yarn6 - port: 8088 name: yarn7 - port: 8188 name: historyserver selector: name: resourcemanager clusterIP: None --- apiVersion: apps/v1beta1 kind: Deployment metadata: name: resourcemanager spec: replicas: 1 template: metadata: labels: name: resourcemanager spec: hostname: resourcemanager containers: - name: resourcemanager image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.7.1-java8 imagePullPolicy: IfNotPresent ports: - containerPort: 50070 name: http - containerPort: 8020 name: hdfs - containerPort: 50075 name: hdfs1 - containerPort: 50010 name: hdfs2 - containerPort: 50020 name: hdfs3 - containerPort: 9000 name: hdfs4 - containerPort: 50090 name: hdfs5 - containerPort: 31010 name: hdfs6 - containerPort: 8030 name: yarn1 - containerPort: 8031 name: yarn2 - containerPort: 8032 name: yarn3 - containerPort: 8033 name: yarn4 - containerPort: 8040 name: yarn5 - containerPort: 8042 name: yarn6 - containerPort: 8088 name: yarn7 - containerPort: 8188 name: historyserver envFrom: - configMapRef: name: hadoop-config
4.nodemanager
編寫nodemanager.yaml如下:
apiVersion: v1 kind: Service metadata: name: nodemanager1 labels: name: nodemanager1 spec: ports: - port: 50070 name: http - port: 8020 name: hdfs - port: 50075 name: hdfs1 - port: 50010 name: hdfs2 - port: 50020 name: hdfs3 - port: 9000 name: hdfs4 - port: 50090 name: hdfs5 - port: 31010 name: hdfs6 - port: 8030 name: yarn1 - port: 8031 name: yarn2 - port: 8032 name: yarn3 - port: 8033 name: yarn4 - port: 8040 name: yarn5 - port: 8042 name: yarn6 - port: 8088 name: yarn7 - port: 8188 name: historyserver selector: name: nodemanager1 clusterIP: None --- apiVersion: apps/v1beta1 kind: Deployment metadata: name: nodemanager1 spec: replicas: 1 template: metadata: labels: name: nodemanager1 spec: hostname: nodemanager1 containers: - name: nodemanager1 image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.7.1-java8 imagePullPolicy: IfNotPresent ports: - containerPort: 50070 name: http - containerPort: 8020 name: hdfs - containerPort: 50075 name: hdfs1 - containerPort: 50010 name: hdfs2 - containerPort: 50020 name: hdfs3 - containerPort: 9000 name: hdfs4 - containerPort: 50090 name: hdfs5 - containerPort: 31010 name: hdfs6 - containerPort: 8030 name: yarn1 - containerPort: 8031 name: yarn2 - containerPort: 8032 name: yarn3 - containerPort: 8033 name: yarn4 - containerPort: 8040 name: yarn5 - containerPort: 8042 name: yarn6 - containerPort: 8088 name: yarn7 - containerPort: 8188 envFrom: - configMapRef: name: hadoop-config
5.historyserver
pvc與前面類似。編寫historyserver.yaml如下:
apiVersion: v1 kind: Service metadata: name: historyserver labels: name: historyserver spec: ports: - port: 50070 name: http - port: 8020 name: hdfs - port: 50075 name: hdfs1 - port: 50010 name: hdfs2 - port: 50020 name: hdfs3 - port: 9000 name: hdfs4 - port: 50090 name: hdfs5 - port: 31010 name: hdfs6 - port: 8030 name: yarn1 - port: 8031 name: yarn2 - port: 8032 name: yarn3 - port: 8033 name: yarn4 - port: 8040 name: yarn5 - port: 8042 name: yarn6 - port: 8088 name: yarn7 - port: 8188 name: historyserver selector: name: historyserver clusterIP: None --- apiVersion: apps/v1beta1 kind: Deployment metadata: name: historyserver spec: replicas: 1 template: metadata: labels: name: historyserver spec: hostname: historyserver containers: - name: historyserver image: bde2020/hadoop-historyserver:1.1.0-hadoop2.7.1-java8 imagePullPolicy: IfNotPresent ports: - containerPort: 50070 name: http - containerPort: 8020 name: hdfs - containerPort: 50075 name: hdfs1 - containerPort: 50010 name: hdfs2 - containerPort: 50020 name: hdfs3 - containerPort: 9000 name: hdfs4 - containerPort: 50090 name: hdfs5 - containerPort: 31010 name: hdfs6 - containerPort: 8030 name: yarn1 - containerPort: 8031 name: yarn2 - containerPort: 8032 name: yarn3 - containerPort: 8033 name: yarn4 - containerPort: 8040 name: yarn5 - containerPort: 8042 name: yarn6 - containerPort: 8088 name: yarn7 - containerPort: 8188 envFrom: - configMapRef: name: hadoop-config volumeMounts: - name: hadoop-historyserver mountPath: /hadoop/yarn/timeline volumes: - name: hadoop-historyserver persistentVolumeClaim: claimName: hadoop-historyserver-pvc
以上幾部分都用kubectl create創建后,參考GitHub,按照這5個部件對應的endpoint加上對應的端口,在瀏覽器上測試(需要在集群內部的某台機器上進行操作),如果能夠正確顯示Hadoop的頁面,說明搭建成功!
6.測試hdfs
簡單地測試一下節點間是否能夠正常通行。
使用kubectl exec -it namenode /bin/bash進入namenode內部,執行hdfs dfs -put /etc/issue /,看看是否能夠正常上傳。
7.測試yarn
進入namenode容器內部,按照https://www.cnblogs.com/ccskun/p/7820977.html中的操作進行測試,看看任務能否正常執行,看看resourcemanager的web頁面能否看到finish的任務。