tensorflow與kubernetes/docker結合使用實踐


tensorflow

tensorflow是谷歌基於DistBelief進行研發的第二代人工智能學習系統,其命名來源於本身的運行原理。Tensor(張量)意味着N維數組,Flow(流)意味着基於數據流圖的計算,TensorFlow為張量從圖象的一端流動到另一端計算過程。TensorFlow是將復雜的數據結構傳輸至人工智能神經網中進行分析和處理過程的系統。

tensorflow可在小到一部智能手機、大到數千台數據中心服務器的各種設備上運行。本文主要探討的是tensorflow在大規模容器上運行的一種方案。

tensorflow作為深度學習的框架,其對於數據的處理可以分為訓練、驗證、測試、服務幾種。一般來說,訓練是用指來訓練模型,驗證主要用以檢驗所訓練出來的模型的正確性和是否過擬合。測試是計算黑盒數據對於訓練的模型進行測試,從而評判模型的准確率。服務是指利用已經完成的訓練模型提供服務。這里為了簡化,將處理分為了訓練和服務兩種。

訓練主要是指從給定訓練的程序和訓練數據集,用以生成訓練的模型。訓練完成的模型可以通過存儲形成為checkpoints文件。

驗證、測試、服務統統歸一到服務,其主要流程是使用已有的模型,對於數據集進行處理。

tensorflow訓練 in kubernetes

對於tensorflow訓練的支持,kubernetes可以通過創立多個pod來進行支持。tensorflow分布式可以通過制定parameters服務器(ps參數服務器)和worker服務器進行。

首先ps是整個訓練集群的參數服務器,保存模型的Variable,worker是計算模型梯度的節點,得到的梯度向量會交付給ps更新模型。in-graph與between-graph對應,但兩者都可以實現同步訓練和異步訓練,in-graph指整個集群由一個client來構建graph,並且由這個client來提交graph到集群中,其他worker只負責處理梯度計算的任務,而between-graph指的是一個集群中多個worker可以創建多個graph,但由於worker運行的代碼相同因此構建的graph也相同,並且參數都保存到相同的ps中保證訓練同一個模型,這樣多個worker都可以構建graph和讀取訓練數據,適合大數據場景。同步訓練和異步訓練差異在於,同步訓練每次更新梯度需要阻塞等待所有worker的結果,而異步訓練不會有阻塞,訓練的效率更高,在大數據和分布式的場景下一般使用異步訓練。----TensorFlow深度學習

我使用rc創建多個ps和worker服務器。

gcr.io/tensorflow/tensorflow:latest鏡像是tensorflow提供的官網鏡像,使用CPU進行計算。使用GPU計算的版本下文再行介紹。

[root@A01-R06-I184-22 yaml]# cat ps.yaml 
apiVersion: v1
kind: ReplicationController
metadata:
  name: tensorflow-ps-rc
spec:
  replicas: 2
  selector:
    name: tensorflow-ps
  template:
    metadata:
      labels:
        name: tensorflow-ps
        role: ps
    spec:
      containers:
        - name: ps
          image: gcr.io/tensorflow/tensorflow:latest
          ports:
           - containerPort: 2222

[root@A01-R06-I184-22 yaml]# cat worker.yaml 
apiVersion: v1
kind: ReplicationController
metadata:
  name: tensorflow-worker-rc
spec:
  replicas: 2
  selector:
    name: tensorflow-worker
  template:
    metadata:
      labels:
        name: tensorflow-worker
        role: worker
    spec:
      containers:
        - name: worker
          image: gcr.io/tensorflow/tensorflow:latest
          ports:
           - containerPort: 2222

之后為ps和worker分別創建服務。

[root@A01-R06-I184-22 yaml]# cat ps-srv.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    name: tensorflow-ps
    role: service
  name: tensorflow-ps-service
spec:
  ports:
    - port: 2222
      targetPort: 2222
  selector:
    name: tensorflow-ps
    
[root@A01-R06-I184-22 yaml]# cat worker-srv.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    name: tensorflow-worker
    role: service
  name: tensorflow-wk-service
spec:
  ports:
    - port: 2222
      targetPort: 2222
  selector:
    name: tensorflow-worker

我們可以通過查看service來查看對應的容器的ip。

[root@A01-R06-I184-22 yaml]# kubectl describe service tensorflow-ps-service 
Name:			tensorflow-ps-service
Namespace:		default
Labels:			name=tensorflow-ps,role=service
Selector:		name=tensorflow-ps
Type:			ClusterIP
IP:			10.254.170.61
Port:			<unset>	2222/TCP
Endpoints:		4.0.84.3:2222,4.0.84.4:2222
Session Affinity:	None
No events.

[root@A01-R06-I184-22 yaml]# kubectl describe service tensorflow-wk-service 
Name:			tensorflow-wk-service
Namespace:		default
Labels:			name=tensorflow-worker,role=service
Selector:		name=tensorflow-worker
Type:			ClusterIP
IP:			10.254.70.9
Port:			<unset>	2222/TCP
Endpoints:		4.0.84.5:2222,4.0.84.6:2222
Session Affinity:	None
No events.

這里我使用deep_recommend_system來進行分布式的實驗。

在pod中先下載對應的deep_recommend_system的代碼。

curl https://codeload.github.com/tobegit3hub/deep_recommend_system/zip/master -o drs.zip
unzip drs.zip
cd deep_recommend_system-master/distributed/

在ps的其中一個容器(4.0.84.3)中執行啟動ps服務器的任務:

root@tensorflow-ps-rc-b5d6g:/notebooks/deep_recommend_system-master/distributed# nohup python cancer_classifier.py --ps_hosts=4.0.84.3:2222,4.0.84.3:2223 --worker_hosts=4.0.84.5:2222,4.0.84.6:2222 --job_name=ps --task_index=0 >log1 &
[1] 502
root@tensorflow-ps-rc-b5d6g:/notebooks/deep_recommend_system-master/distributed# nohup: ignoring input and redirecting stderr to stdout
 
root@tensorflow-ps-rc-b5d6g:/notebooks/deep_recommend_system-master/distributed# nohup python cancer_classifier.py --ps_hosts=4.0.84.3:2222,4.0.84.4:2223 --worker_hosts=4.0.84.5:2222,4.0.84.6:2222 --job_name=ps --task_index=1 >log2 &
[2] 603
root@tensorflow-ps-rc-b5d6g:/notebooks/deep_recommend_system-master/distributed# nohup: ignoring input and redirecting stderr to stdout

這里我嘗試使用兩個pod分別做ps服務器,但是總是報core dump的錯誤。官網也有類似錯誤,未能解決,推測原因可能是復用了某個設備的緣故(兩個pod都在同一個宿主機上)。使用一個pod作為兩個ps服務器即無問題。

在worker兩個容器中分別執行:

root@tensorflow-worker-rc-vznvt:/notebooks/deep_recommend_system-master/distributed# nohup python cancer_classifier.py --ps_hosts=4.0.84.3:2222,4.0.84.3:2223 --worker_hosts=4.0.84.5:2222,4.0.84.6:2222 --job_name=worker --task_index=0 >log &

***********************

root@tensorflow-worker-rc-cpnt7:/notebooks/deep_recommend_system-master/distributed# nohup python cancer_classifier.py --ps_hosts=4.0.84.3:2222,4.0.84.3:2223 --worker_hosts=4.0.84.5:2222,4.0.84.6:2222 --job_name=worker --task_index=1 >log &

之后在worker服務器上的checkpoint文件夾中可以查看計算模型的中間保存結果。

root@tensorflow-worker-rc-vznvt:/notebooks/deep_recommend_system-master/distributed# ll checkpoint/
total 840
drwxr-xr-x 2 root root   4096 Oct 10 15:45 ./
drwxr-xr-x 3 root root     76 Oct 10 15:18 ../
-rw-r--r-- 1 root root      0 Sep 23 14:27 .gitkeeper
-rw-r--r-- 1 root root    270 Oct 10 15:45 checkpoint
-rw-r--r-- 1 root root  86469 Oct 10 15:45 events.out.tfevents.1476113854.tensorflow-worker-rc-vznvt
-rw-r--r-- 1 root root 248875 Oct 10 15:37 graph.pbtxt
-rw-r--r-- 1 root root   2229 Oct 10 15:42 model.ckpt-1172
-rw-r--r-- 1 root root  94464 Oct 10 15:42 model.ckpt-1172.meta
-rw-r--r-- 1 root root   2229 Oct 10 15:43 model.ckpt-1422
-rw-r--r-- 1 root root  94464 Oct 10 15:43 model.ckpt-1422.meta
-rw-r--r-- 1 root root   2229 Oct 10 15:44 model.ckpt-1670
-rw-r--r-- 1 root root  94464 Oct 10 15:44 model.ckpt-1670.meta
-rw-r--r-- 1 root root   2229 Oct 10 15:45 model.ckpt-1921
-rw-r--r-- 1 root root  94464 Oct 10 15:45 model.ckpt-1921.meta
-rw-r--r-- 1 root root   2229 Oct 10 15:41 model.ckpt-921
-rw-r--r-- 1 root root  94464 Oct 10 15:41 model.ckpt-921.meta

tensorflow gpu支持

tensorflow gpu in docker

docker可以通過提供gpu設備到容器中。nvidia官方提供了nvidia-docker的一種方式,其用nvidia-docker的命令行代替了docker的命令行來使用GPU。

nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu

這種方式對於docker侵入較多,因此nvidia還提供了一種nvidia-docker-plugin的方式。其使用流程如下:

首先在宿主機啟動nvidia-docker-plugin:

[root@A01-R06-I184-22 nvidia-docker]# ./nvidia-docker-plugin 
./nvidia-docker-plugin | 2016/10/10 00:01:12 Loading NVIDIA unified memory
./nvidia-docker-plugin | 2016/10/10 00:01:12 Loading NVIDIA management library
./nvidia-docker-plugin | 2016/10/10 00:01:17 Discovering GPU devices
./nvidia-docker-plugin | 2016/10/10 00:01:18 Provisioning volumes at /var/lib/nvidia-docker/volumes
./nvidia-docker-plugin | 2016/10/10 00:01:18 Serving plugin API at /run/docker/plugins
./nvidia-docker-plugin | 2016/10/10 00:01:18 Serving remote API at localhost:3476

可以看到nvidia-docker-plugin監聽了3486端口。然后在宿主機上運行docker run -ti curl -s http://localhost:3476/v1.0/docker/cli -p 8890:8888 gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash命令以創建tensorflow的GPU容器。並可以在容器中驗證是否能正常import tensorflow。

[root@A01-R06-I184-22 ~]# docker run -ti `curl -s http://localhost:3476/v1.0/docker/cli` -p 8890:8888 gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash
root@7087e1f99062:/notebooks# python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
>>> 

可以看到tensorflow已經能夠正確加載了。

這里同樣使用deep_recommend_system進行測試。在pod中先下載對應的deep_recommend_system的代碼。

curl https://codeload.github.com/tobegit3hub/deep_recommend_system/zip/master -o drs.zip
unzip drs.zip
cd deep_recommend_system-master/

然后使用GPU0和1進行計算。

root@7087e1f99062:/notebooks/deep_recommend_system-master# export CUDA_VISIBLE_DEVICES='0,1'    //用以指定使用的GPU的編號
root@7087e1f99062:/notebooks/deep_recommend_system-master# python cancer_classifier.py 
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Use the model: wide_and_deep
Use the optimizer: adagrad
Use the model: wide_and_deep
Use the model: wide_and_deep
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla K20c
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:02:00.0
Total memory: 4.69GiB
Free memory: 4.61GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x24402e0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties: 
name: Tesla K20c
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:04:00.0
Total memory: 4.69GiB
Free memory: 4.61GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1:   Y Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20c, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K20c, pci bus id: 0000:04:00.0)
[0:00:34.437041] Step: 100, loss: 2.97578692436, accuracy: 0.77734375, auc: 0.763736724854
[0:00:32.162310] Step: 200, loss: 1.81753754616, accuracy: 0.7890625, auc: 0.788772583008
[0:00:37.559177] Step: 300, loss: 1.26066374779, accuracy: 0.865234375, auc: 0.811861813068
[0:00:36.082163] Step: 400, loss: 0.920016527176, accuracy: 0.8359375, auc: 0.820605039597

同樣我可以使用nvidia-smi查看GPU使用情況

[root@A01-R06-I184-22 ~]# nvidia-smi 
Tue Oct 11 00:10:28 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:02:00.0     Off |                    0 |
| 30%   26C    P0    48W / 225W |   4540MiB /  4799MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:04:00.0     Off |                    0 |
| 30%   31C    P0    48W / 225W |   4499MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          Off  | 0000:83:00.0     Off |                    0 |
| 30%   25C    P8    26W / 225W |     11MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20c          Off  | 0000:84:00.0     Off |                    0 |
| 30%   24C    P8    25W / 225W |     11MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    132460    C   python                                        4524MiB |
|    1    132460    C   python                                        4484MiB |
+-----------------------------------------------------------------------------+

nvidia-docker-plugin工作原理是是其提供了一個API

[root@A01-R06-I184-22 ~]# curl -s http://localhost:3476/v1.0/docker/cli
--volume-driver=nvidia-docker --volume=nvidia_driver_352.39:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/nvidia1 --device=/dev/nvidia2 --device=/dev/nvidia3

可以看到curl -s http://localhost:3476/v1.0/docker/cli命令實際是提供了docker run時候的一些必要參數。其中包括把gpu設備映射進入容器中的部分(--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/nvidia1 --device=/dev/nvidia2 --device=/dev/nvidia3),還包括了將nvidia_driver_352.39存儲映射進入容器的部分。

接下來我們對於nvidia_driver_352.39進行分析

[root@A01-R06-I184-22 ~]# docker volume ls
DRIVER              VOLUME NAME
nvidia-docker       nvidia_driver_352.39
[root@A01-R06-I184-22 ~]# docker volume inspect nvidia_driver_352.39 
[
    {
        "Name": "nvidia_driver_352.39",
        "Driver": "nvidia-docker",
        "Mountpoint": "/var/lib/nvidia-docker/volumes/nvidia_driver/352.39"
    }
]

可以看到該存儲其實只是一個文件夾。對文件夾/var/lib/nvidia-docker/volumes/nvidia_driver/352.39/進行分析

[root@A01-R06-I184-22 ~]# tree -L 3 /var/lib/nvidia-docker/volumes/nvidia_driver/352.39/
/var/lib/nvidia-docker/volumes/nvidia_driver/352.39/
├── bin
│   ├── nvidia-cuda-mps-control
│   ├── nvidia-cuda-mps-server
│   ├── nvidia-debugdump
│   ├── nvidia-persistenced
│   └── nvidia-smi
├── lib
│   ├── libcuda.so -> libcuda.so.352.39
│   ├── libcuda.so.1 -> libcuda.so.352.39
│   ├── libcuda.so.352.39
│   ├── libGL.so.1 -> libGL.so.352.39
│   ├── libGL.so.352.39
│   ├── libnvcuvid.so.1 -> libnvcuvid.so.352.39
│   ├── libnvcuvid.so.352.39
│   ├── libnvidia-compiler.so.352.39
│   ├── libnvidia-eglcore.so.352.39
│   ├── libnvidia-encode.so.1 -> libnvidia-encode.so.352.39
│   ├── libnvidia-encode.so.352.39
│   ├── libnvidia-fbc.so.1 -> libnvidia-fbc.so.352.39
│   ├── libnvidia-fbc.so.352.39
│   ├── libnvidia-glcore.so.352.39
│   ├── libnvidia-glsi.so.352.39
│   ├── libnvidia-ifr.so.1 -> libnvidia-ifr.so.352.39
│   ├── libnvidia-ifr.so.352.39
│   ├── libnvidia-ml.so.1 -> libnvidia-ml.so.352.39
│   ├── libnvidia-ml.so.352.39
│   ├── libnvidia-opencl.so.1 -> libnvidia-opencl.so.352.39
│   ├── libnvidia-opencl.so.352.39
│   ├── libvdpau_nvidia.so.1 -> libvdpau_nvidia.so.352.39
│   └── libvdpau_nvidia.so.352.39
└── lib64
    ├── libcuda.so -> libcuda.so.352.39
    ├── libcuda.so.1 -> libcuda.so.352.39
    ├── libcuda.so.352.39
    ├── libGL.so.1 -> libGL.so.352.39
    ├── libGL.so.352.39
    ├── libnvcuvid.so.1 -> libnvcuvid.so.352.39
    ├── libnvcuvid.so.352.39
    ├── libnvidia-compiler.so.352.39
    ├── libnvidia-eglcore.so.352.39
    ├── libnvidia-encode.so.1 -> libnvidia-encode.so.352.39
    ├── libnvidia-encode.so.352.39
    ├── libnvidia-fbc.so.1 -> libnvidia-fbc.so.352.39
    ├── libnvidia-fbc.so.352.39
    ├── libnvidia-glcore.so.352.39
    ├── libnvidia-glsi.so.352.39
    ├── libnvidia-ifr.so.1 -> libnvidia-ifr.so.352.39
    ├── libnvidia-ifr.so.352.39
    ├── libnvidia-ml.so.1 -> libnvidia-ml.so.352.39
    ├── libnvidia-ml.so.352.39
    ├── libnvidia-opencl.so.1 -> libnvidia-opencl.so.352.39
    ├── libnvidia-opencl.so.352.39
    ├── libnvidia-tls.so.352.39
    ├── libvdpau_nvidia.so.1 -> libvdpau_nvidia.so.352.39
    └── libvdpau_nvidia.so.352.39

3 directories, 52 files

可以看到這個文件夾其實主要包含的是關於GPU顯卡的一些庫、包和一些必要的可執行文件。這些文件實際上也是從宿主機上由nvidia-docker-plugin收集拷貝到該文件夾中的,用以提供給容器,方便容器對於GPU的使用。

kubernetes與GPU

kubernetes1.3已經引入了GPU調度支持,但是目前是實驗性質。

tensorflow服務

Serving Inception Model with TensorFlow Serving and Kubernetes中對於tensorflow服務與kubernetes結合使用的方式進行了介紹。

其基本的工作方式是首先根據已經訓練好的模型,制作成可以對外提供服務的鏡像inception_serving。而后使用該鏡像創建rc,並對應建立service。

$ kubectl get rc
CONTROLLER             CONTAINER(S)          IMAGE(S)                              SELECTOR               REPLICAS   AGE
inception-controller   inception-container   gcr.io/tensorflow-serving/inception   worker=inception-pod   3          20s

$ kubectl get svc
NAME                CLUSTER_IP      EXTERNAL_IP      PORT(S)    SELECTOR               AGE
inception-service   10.15.242.244   146.148.88.232   9000/TCP   worker=inception-pod   3m

$ kubectl describe svc inception-service
Name:     inception-service
Namespace:    default
Labels:     <none>
Selector:   worker=inception-pod
Type:     LoadBalancer
IP:     10.15.242.244
LoadBalancer Ingress: 146.148.88.232
Port:     <unnamed> 9000/TCP
NodePort:   <unnamed> 32006/TCP
Endpoints:    10.12.2.4:9000,10.12.4.4:9000,10.12.4.5:9000
Session Affinity: None
Events:
  FirstSeen LastSeen  Count From      SubobjectPath Reason      Message
  ───────── ────────  ───── ────      ───────────── ──────      ───────
  4m    3m    2 {service-controller }     CreatingLoadBalancer  Creating load balancer
  3m    2m    2 {service-controller }     CreatedLoadBalancer   Created load balancer

用戶請求直接通過EXTERNAL_IP(146.148.88.232:9000)進行服務訪問。當用戶有請求到來時,kubernetes將請求分發給10.12.2.4:9000,10.12.4.4:9000,10.12.4.5:9000之一的pod,然后由該pod上提供實際的服務,從而返回結果。

這一過程本質上來說同提供web服務(如tomcat的服務)等是沒有多大區別的。kubernetes可以很好的支持。

參考資料


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM