Kubernetes Fluentd＋Elasticsearch＋Kibana統一日志管理平台搭建的填坑指南

本文轉載自查看原文 2017-05-24 09:32 15957 kubernetes/ Docker

在初步完成Kubernetes集群架構的建立后,通過搭建一些監控組件,我們已經能夠實現

圖形化的監控每個node,pod的狀態信息和資源情況
通過scale進行replicateSet的擴展和伸縮
通過kubectl logs 或dashboard去查看每個Pod的運行日志

但是,在分布式架構中節點的規模往往是很龐大的,一個典型的生產環境可能有幾十上百個minion節點,在這種情況下就需要建立一套集中的日志監控和管理系統,在本人前期的思考中,也想通過volumn外掛到存儲的方式實現weblogic的日志輸出到共享存儲,但這種方式的問題在於:

我們通過WebLogic單域的模式進行Docker的擴展服務, 這就意味着所有的日志路徑和名字都是一致的,也就是都存放在pod的container節點的統一的路徑下(/u01/oracle/user_projects/domains/base_domain/servers/AdminServer/logs/AdminServer.log),如果通過volumnMount映射到存儲,會存在文件的沖突問題.
無法獲取pod和container的信息
無法獲取集群中其他節點的運行信息

因此還是需要尋找平台級別的架構方案.在kubernetes的官方文檔中,https://kubernetes.io/docs/concepts/cluster-administration/logging/

Kubernetes給出了幾種日志方案,並給出Cluster-level logging的參考架構：

也就是說，我們自己啟動運行的Pod的內部容器進程通過streaming的方式把日志輸出到minion主機，然后由運行在相同主機的另外一個pod,logging-agent-pod把日志獲取到，同時把日志傳回Backend, Bankend實際上是基於不同的實現，比如Elasticsearch-logging,以及展示的kibana平台。

Kubernetes建議采用這種結點級別的logging-agent,並提供了其中的兩種,一種用於Google Cloud Platform的Stackdriver Logging,另一種就是Elasticsearch,兩種都是采用fluentd做為在結點上運行的Agent(日志代理)

Using a node-level logging agent is the most common and encouraged approach for a Kubernetes cluster, because it creates only one agent per node, and it doesn’t require any changes to the applications running on the node. However, node-level logging only works for applications’ standard output and standard error.

Kubernetes doesn’t specify a logging agent, but two optional logging agents are packaged with the Kubernetes release: Stackdriver Logging for use with Google Cloud Platform, and Elasticsearch. You can find more information and instructions in the dedicated documents. Both use fluentd with custom configuration as an agent on the node.

好了,下文便開始我們的填坑指南

1. 准備工作

先交代一下環境

操作系統: CentOS 7.3

Kubernetes version: 1.5

[root@k8s-master fluentd-elasticsearch]# kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"a55267932d501b9fbd6d73e5ded47d79b5763ce5", GitTreeState:"clean", BuildDate:"2017-04-14T13:36:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"a55267932d501b9fbd6d73e5ded47d79b5763ce5", GitTreeState:"clean", BuildDate:"2017-04-14T13:36:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

將github中kubernetes代碼都下栽到master本地。

git clone https://github.com/kubernetes/kubernetes

配置ServiceAccount,這是因為之后下載的fluentd images需要用到SSL的方式去連接API Server,如果不准備修改並生成新的images的話,還是需要配置好, 配置指南參考

http://www.cnblogs.com/ericnie/p/6894688.html

配置dns,kibana組件需要通過dns找到elasticsearch-logging的Service,如果不配置dns,那就需要修改kibana-controller.yaml中的地址為固定的service ip,配置指南參考

http://www.cnblogs.com/ericnie/p/6897142.html

下載images

進入/root/kubernetes/cluster/addons/fluentd-elasticsearch目錄,看到所有的yaml文件

其中fluentd-es-ds.yaml用於構建運行在每個結點的fluentd DamonSet負責logging Agent角色,es-controller.yaml和es-service.yaml用於構建elasticsearch logging,負責logging backend的日志匯總,而kibana-controller和kibana-service用於展示

把幾個conroller.yaml文件中的images下載到各個minion節點

docker pull gcr.io/google_containers/elasticsearch:v2.4.1-2
docker pull gcr.io/google_containers/fluentd-elasticsearch:1.22
docker pull gcr.io/google_containers/kibana:v4.6.1-1

2.啟動fluentd DaemonSet

打Label(坑1)

Fluentd會運行在每一個minion節點上,通過

# kubectl create -f fluentd-es-ds.yaml 
daemonset "fluentd-es-v1.22" created

然后在minion節點上通過tail -f /var/log/fluentd.log中查看,結果在minion節點上根本沒有fluentd.log文件啊!

筆者通過

kubectl get pods -n kube-system

看了一下,發現根本沒有fluentd相關的Pod在運行或者在pending! :(

通過

kubectl get -f fluentd-es-ds.yaml
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR 
AGE fluentd-es-v1.22 0 0 0 0 0 beta.kubernetes.io/fluentd-ds-ready=true 2m

查看一下,發現有個NODE-SELECTOR(正常的如下),beta.kubernetes.io/fluentd-ds-ready=true

再

kubectl describe nodes k8s-node-1

發現我的minion節點根本沒有這個Label,通過命令打label

kubectl label node k8s-node-1 beta.kubernetes.io/fluentd-ds-ready=true

重新create后,就發現在k8s-node-1中已經存在/var/log/fluentd.log文件了.

創建configmap(坑2)

# tail -f /var/log/fluentd.log

2017-03-02 02:27:01 +0000 [info]: reading config file path="/etc/td-agent/td-agent.conf"
2017-03-02 02:27:01 +0000 [info]: starting fluentd-0.12.31
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-mixin-config-placeholders' version '0.4.0'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-mixin-plaintextformatter' version '0.2.6'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-docker_metadata_filter' version '0.1.3'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '1.5.0'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-kafka' version '0.4.1'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '0.24.0'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-mongo' version '0.7.16'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '1.5.5'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-s3' version '0.8.0'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-scribe' version '0.10.14'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-td' version '0.10.29'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-td-monitoring' version '0.2.2'
2017-03-02 02:27:01 +0000 [info]: gem 'fluent-plugin-webhdfs' version '0.4.2'
2017-03-02 02:27:01 +0000 [info]: gem 'fluentd' version '0.12.31'
2017-03-02 02:27:01 +0000 [info]: adding match pattern="fluent.**" type="null"
2017-03-02 02:27:01 +0000 [info]: adding filter pattern="kubernetes.**" type="kubernetes_metadata"
2017-03-02 02:27:02 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error="Invalid Kubernetes API v1 endpoint https://192.168.0.105:443/api: 401 Unauthorized"
2017-03-02 02:27:02 +0000 [info]: process finished code=256
2017-03-02 02:27:02 +0000 [warn]: process died within 1 second. exit.

發現fluentd image是通過443端口去連我的ApiServer的，API Server開啟了安全機制，因此需要配置ca_file、client_cert、client_key等key,如果不想重新做images,Kubernetes提供了ConfigMap這一強大的武器，我們可以將新版td-agent.conf制作成kubernetes的configmap資源，並掛載到fluentd pod的相應位置以替換image中默認的td-agent.conf。

td-agent.conf的目錄是

/root/kubernetes/cluster/addons/fluentd-elasticsearch/fluentd-es-image

加入ca,client等后如

// td-agent.conf
... ...
<filter kubernetes.**>
  type kubernetes_metadata
  ca_file /srv/kubernetes/ca.crt
  client_cert /srv/kubernetes/kubecfg.crt
  client_key /srv/kubernetes/kubecfg.key
</filter>
... ...

需要注意的是

在基於td-agent.conf創建configmap資源之前，需要將td-agent.conf中的注釋行都刪掉,建議先備份一份(后來發揮了作用)
fluentd pod將創建在kube-system下，因此configmap資源也需要創建在kube-system namespace下面，否則kubectl create無法找到對應的configmap。

# kubectl create configmap td-agent-config --from-file=./td-agent.conf -n kube-system
configmap "td-agent-config" created

# kubectl get configmaps td-agent-config -o yaml
apiVersion: v1
data:
  td-agent.conf: |
    <match fluent.**>
      type null
    </match>

    <source>
      type tail
      path /var/log/containers/*.log
      pos_file /var/log/es-containers.log.pos
      time_format %Y-%m-%dT%H:%M:%S.%NZ
      tag kubernetes.*
      format json
      read_from_head true
    </source>
... ...

fluentd-es-ds.yaml也要隨之做一些改動，主要是增加兩個mount:

一個是mount 上面的configmap td-agent-config

另外一個就是mount hostpath：/srv/kubernetes以獲取到相關client端的數字證書：

[root@k8s-master fluentd-elasticsearch]# cat fluentd-es-ds.yaml 
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd-es-v1.22
  namespace: kube-system
  labels:
    k8s-app: fluentd-es
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v1.22
spec:
  template:
    metadata:
      labels:
        k8s-app: fluentd-es
        kubernetes.io/cluster-service: "true"
        version: v1.22
      # This annotation ensures that fluentd does not get evicted if the node
      # supports critical pod annotation based priority scheme.
      # Note that this does not guarantee admission on the nodes (#40573).
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      containers:
      - name: fluentd-es
        image: gcr.io/google_containers/fluentd-elasticsearch:1.22
        command:
          - '/bin/sh'
          - '-c'
          - '/usr/sbin/td-agent 2>&1 >> /var/log/fluentd.log'
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: td-agent-config
          mountPath: /etc/td-agent
        - name: tls-files
          mountPath: /srv/kubernetes
      nodeSelector:
        beta.kubernetes.io/fluentd-ds-ready: "true"
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: td-agent-config
        configMap:
          name: td-agent-config
      - name: tls-files
        hostPath:
          path: /srv/kubernetes
[root@k8s-master fluentd-elasticsearch]#

再次創建fluent-es-ds.yaml,然后看minion的/var/log/fluentd.log

......
    client_cert /srv/kubernetes/kubecfg.crt
    client_key /srv/kubernetes/kubecfg.key
  </filter>
  <match **>
    type elasticsearch
    log_level info
    include_tag_key true
    host elasticsearch-logging
    port 9200
    logstash_format true
    buffer_chunk_limit 2M
    buffer_queue_limit 32
    flush_interval 5s
    max_retry_wait 30
    disable_retry_limit 
    num_threads 8
  </match>
</ROOT>

出現這個基本算是成功了，貌似沒問題吧，其實有個坑，可以接下來繼續配置elasticsearch log.

3.配置elasticsearch

創建elasticsearch,

# kubectl create -f es-controller.yaml
replicationcontroller "elasticsearch-logging-v1" created

# kubectl create -f es-service.yaml
service "elasticsearch-logging" created

get pods：

kube-system                  elasticsearch-logging-v1-3bzt6          1/1       Running    0          7s        172.16.57.8    10.46.181.146
kube-system                  elasticsearch-logging-v1-nvbe1          1/1       Running    0          7s        172.16.99.10   10.47.136.60

查看日志

# kubectl logs -f elasticsearch-logging-v1-3bzt6 -n kube-system
F0302 03:59:41.036697       8 elasticsearch_logging_discovery.go:60] kube-system namespace doesn't exist: the server has asked for the client to provide credentials (get namespaces kube-system)
goroutine 1 [running]:
k8s.io/kubernetes/vendor/github.com/golang/glog.stacks(0x19a8100, 0xc400000000, 0xc2, 0x186)
... ...
main.main()
    elasticsearch_logging_discovery.go:60 +0xb53

[2017-03-02 03:59:42,587][INFO ][node                     ] [elasticsearch-logging-v1-3bzt6] version[2.4.1], pid[16], build[c67dc32/2016-09-27T18:57:55Z]
[2017-03-02 03:59:42,588][INFO ][node                     ] [elasticsearch-logging-v1-3bzt6] initializing ...
[2017-03-02 03:59:44,396][INFO ][plugins                  ] [elasticsearch-logging-v1-3bzt6] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
... ...
[2017-03-02 03:59:44,441][INFO ][env                      ] [elasticsearch-logging-v1-3bzt6] heap size [1007.3mb], compressed ordinary object pointers [true]
[2017-03-02 03:59:48,355][INFO ][node                     ] [elasticsearch-logging-v1-3bzt6] initialized
[2017-03-02 03:59:48,355][INFO ][node                     ] [elasticsearch-logging-v1-3bzt6] starting ...
[2017-03-02 03:59:48,507][INFO ][transport                ] [elasticsearch-logging-v1-3bzt6] publish_address {172.16.57.8:9300}, bound_addresses {[::]:9300}
[2017-03-02 03:59:48,547][INFO ][discovery                ] [elasticsearch-logging-v1-3bzt6] kubernetes-logging/7_f_M2TKRZWOw4NhBc4EqA
[2017-03-02 04:00:18,552][WARN ][discovery                ] [elasticsearch-logging-v1-3bzt6] waited for 30s and no initial state was set by the discovery
[2017-03-02 04:00:18,562][INFO ][http                     ] [elasticsearch-logging-v1-3bzt6] publish_address {172.16.57.8:9200}, bound_addresses {[::]:9200}
[2017-03-02 04:00:18,562][INFO ][node                     ] [elasticsearch-logging-v1-3bzt6] started

發現錯誤，無法提供安全的credential,通過在網上參考Tony Bai的技術文檔，發現是默認的Service Account的問題，其中原理還需要研究一下。

先run起來再說，解決方案如下：

創建一個新的serviceaccount在kube-system namespace下：

/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: k8s-efk

# kubectl create -f serviceaccount.yaml -n kube-system
serviceaccount "k8s-efk" created

# kubectl get serviceaccount -n kube-system
NAME      SECRETS   AGE
default   1         139d
k8s-efk   1         17s

修改es-controller.yaml使用service account “k8s-efk”：

[root@k8s-master fluentd-elasticsearch]# cat es-controller.yaml 
apiVersion: v1
kind: ReplicationController
metadata:
  name: elasticsearch-logging-v1
  namespace: kube-system
  labels:
    k8s-app: elasticsearch-logging
    version: v1
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  replicas: 2
  selector:
    k8s-app: elasticsearch-logging
    version: v1
  template:
    metadata:
      labels:
        k8s-app: elasticsearch-logging
        version: v1
        kubernetes.io/cluster-service: "true"
    spec:
      serviceAccount: k8s-efk
      containers:
      - image: gcr.io/google_containers/elasticsearch:v2.4.1-2
        name: elasticsearch-logging
        resources:
          # need more cpu upon initialization, therefore burstable class
          limits:
            cpu: 1000m
          requests:
            cpu: 100m
        ports:
        - containerPort: 9200
          name: db
          protocol: TCP
        - containerPort: 9300
          name: transport
          protocol: TCP
        volumeMounts:
        - name: es-persistent-storage
          mountPath: /data
        env:
        - name: "NAMESPACE"
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
      volumes:
      - name: es-persistent-storage
        emptyDir: {}

重新創建elasticsearch logging service后，我們再來查看elasticsearch-logging pod的日志，貌似OK,其實也是個坑,呆會我繼續說：

[2017-05-22 06:09:12,155][INFO ][node                     ] [elasticsearch-logging-v1-9jjf1] version[2.4.1], pid[1], build[c67dc32/2016-09-27T18:57:55Z]
[2017-05-22 06:09:12,156][INFO ][node                     ] [elasticsearch-logging-v1-9jjf1] initializing ...
[2017-05-22 06:09:13,657][INFO ][plugins                  ] [elasticsearch-logging-v1-9jjf1] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2017-05-22 06:09:13,733][INFO ][env                      ] [elasticsearch-logging-v1-9jjf1] using [1] data paths, mounts [[/data (/dev/mapper/cl-root)]], net usable_space [25gb], net total_space [37.2gb], spins? [possibly], types [xfs]
[2017-05-22 06:09:13,738][INFO ][env                      ] [elasticsearch-logging-v1-9jjf1] heap size [1015.6mb], compressed ordinary object pointers [true]
[2017-05-22 06:09:21,946][INFO ][node                     ] [elasticsearch-logging-v1-9jjf1] initialized
[2017-05-22 06:09:21,980][INFO ][node                     ] [elasticsearch-logging-v1-9jjf1] starting ...
[2017-05-22 06:09:22,442][INFO ][transport                ] [elasticsearch-logging-v1-9jjf1] publish_address {192.168.10.6:9300}, bound_addresses {[::]:9300}
[2017-05-22 06:09:22,560][INFO ][discovery                ] [elasticsearch-logging-v1-9jjf1] kubernetes-logging/RY_IOcwSSSeuJNtC2E0W7A
[2017-05-22 06:09:30,446][INFO ][cluster.service          ] [elasticsearch-logging-v1-9jjf1] detected_master {elasticsearch-logging-v1-sbcgt}{9--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:9300}{master=true}, added {{elasticsearch-logging-v1-sbcgt}{9--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:9300}{master=true},}, reason: zen-disco-receive(from master [{elasticsearch-logging-v1-sbcgt}{9--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:9300}{master=true}])
[2017-05-22 06:09:30,453][INFO ][http                     ] [elasticsearch-logging-v1-9jjf1] publish_address {192.168.10.6:9200}, bound_addresses {[::]:9200}
[2017-05-22 06:09:30,465][INFO ][node                     ] [elasticsearch-logging-v1-9jjf1] started

好了，繼續.....

4.配置kibana

根據前輩們的經驗，把上面新創建的serviceaccount：k8s-efk顯式賦值給kibana-controller.yaml:

[root@k8s-master fluentd-elasticsearch]# cat kibana-controller.yaml 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kibana-logging
  namespace: kube-system
  labels:
    k8s-app: kibana-logging
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: kibana-logging
  template:
    metadata:
      labels:
        k8s-app: kibana-logging
    spec:
      serviceAccount: k8s-efk
      containers:
      - name: kibana-logging
        image: gcr.io/google_containers/kibana:v4.6.1-1
        resources:
          # keep request = limit to keep this container in guaranteed class
          limits:
            cpu: 100m
          requests:
            cpu: 100m
        env:
          - name: "ELASTICSEARCH_URL"
            value: "http://elasticsearch-logging:9200"
          - name: "KIBANA_BASE_URL"
            value: "/api/v1/proxy/namespaces/kube-system/services/kibana-logging"
        ports:
        - containerPort: 5601
          name: ui
          protocol: TCP
[root@k8s-master fluentd-elasticsearch]#

啟動kibana，並觀察pod日志：

# kubectl logs -f kibana-logging-3604961973-jby53 -n kube-system
ELASTICSEARCH_URL=http://elasticsearch-logging:9200
server.basePath: /api/v1/proxy/namespaces/kube-system/services/kibana-logging
{"type":"log","@timestamp":"2017-03-02T08:30:15Z","tags":["info","optimize"],"pid":6,"message":"Optimizing and caching bundles for kibana and statusPage. This may take a few minutes"}

kibana啟動需要十幾分鍾。抱歉，本人是在一台8G的筆記本電腦的virtualbox虛擬機上做，之后你將會看到如下日志：

# kubectl logs -f kibana-logging-3604961973-jby53 -n kube-system
ELASTICSEARCH_URL=http://elasticsearch-logging:9200
server.basePath: /api/v1/proxy/namespaces/kube-system/services/kibana-logging
{"type":"log","@timestamp":"2017-03-02T08:30:15Z","tags":["info","optimize"],"pid":6,"message":"Optimizing and caching bundles for kibana and statusPage. This may take a few minutes"}
{"type":"log","@timestamp":"2017-03-02T08:40:04Z","tags":["info","optimize"],"pid":6,"message":"Optimization of bundles for kibana and statusPage complete in 588.60 seconds"}
{"type":"log","@timestamp":"2017-03-02T08:40:04Z","tags":["status","plugin:kibana@1.0.0","info"],"pid":6,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:05Z","tags":["status","plugin:elasticsearch@1.0.0","info"],"pid":6,"state":"yellow","message":"Status changed from uninitialized to yellow - Waiting for Elasticsearch","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:05Z","tags":["status","plugin:kbn_vislib_vis_types@1.0.0","info"],"pid":6,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:05Z","tags":["status","plugin:markdown_vis@1.0.0","info"],"pid":6,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:05Z","tags":["status","plugin:metric_vis@1.0.0","info"],"pid":6,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:06Z","tags":["status","plugin:spyModes@1.0.0","info"],"pid":6,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:06Z","tags":["status","plugin:statusPage@1.0.0","info"],"pid":6,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:06Z","tags":["status","plugin:table_vis@1.0.0","info"],"pid":6,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:06Z","tags":["listening","info"],"pid":6,"message":"Server running at http://0.0.0.0:5601"}
{"type":"log","@timestamp":"2017-03-02T08:40:11Z","tags":["status","plugin:elasticsearch@1.0.0","info"],"pid":6,"state":"yellow","message":"Status changed from yellow to yellow - No existing Kibana index found","prevState":"yellow","prevMsg":"Waiting for Elasticsearch"}
{"type":"log","@timestamp":"2017-03-02T08:40:14Z","tags":["status","plugin:elasticsearch@1.0.0","info"],"pid":6,"state":"green","message":"Status changed from yellow to green - Kibana index ready","prevState":"yellow","prevMsg":"No existing Kibana index found"}

需要注意的是:(這也是坑阿)

必須配置dns,否則會出現http://elasticsearch-logging:9200無法連接的狀態
如果不配置dns,那就只能修改control文件把elasticsearch-logging修改成具體的elasticsearch-logging服務的集群ip

通過

kubectl cluster-info

可以獲取kibana服務的地址，其實也就是在

https://{API Server external IP}:{API Server secure port}/api/v1/proxy/namespaces/kube-system/services/kibana-logging/app/kibana#/settings/indices/

在下面這個界面中發現無論怎么搞create都不出現，無法添加index，當然除了直接輸入*可以create,但是進去沒有任何pod的信息，問題大了!!!!

5.定位問題過程

仔細對照了Tony Bai的搭建文檔，有參考了無數前輩的無數的帖子，都想換個CentOS 6.5版本重新來過了，無奈CentOS 6.5暫時也沒裝上kubernetes集群，所以放棄。

對照了一下日志，出問題的地方很可能是:

根本沒有日志
fluentd服務運行問題
elasticsearch logging日志收集問題

修改elasticsearch logging的images(坑阿)

仔細看fluentd的日志/etc/log/fluent.log,發現根本就是沒有任何日志的輸出，排除fluentd連接elasticsearch logging:9200連接不上的問題，

感覺是elasticseach logging的自己的問題，然后對比tonybai的elasticsearch的日志，發現我的只有

[2017-05-22 06:09:30,446][INFO ][cluster.service          ] [elasticsearch-logging-v1-9jjf1] detected_master {elasticsearch-logging-v1-sbcgt}{9--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:9300}{master=true}, added {{elasticsearch-logging-v1-sbcgt}{9--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:9300}{master=true},}, reason: zen-disco-receive(from master [{elasticsearch-logging-v1-sbcgt}{9--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:9300}{master=true}])
[2017-05-22 06:09:30,453][INFO ][http                     ] [elasticsearch-logging-v1-9jjf1] publish_address {192.168.10.6:9200}, bound_addresses {[::]:9200}
[2017-05-22 06:09:30,465][INFO ][node                     ] [elasticsearch-logging-v1-9jjf1] started

就結束了，而tonybai的是

[2017-03-02 08:26:56,955][INFO ][http                     ] [elasticsearch-logging-v1-dklui] publish_address {172.16.57.8:9200}, bound_addresses {[::]:9200}
[2017-03-02 08:26:56,956][INFO ][node                     ] [elasticsearch-logging-v1-dklui] started
[2017-03-02 08:26:57,157][INFO ][gateway                  ] [elasticsearch-logging-v1-dklui] recovered [0] indices into cluster_state
[2017-03-02 08:27:05,378][INFO ][cluster.metadata         ] [elasticsearch-logging-v1-dklui] [logstash-2017.03.02] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []
[2017-03-02 08:27:06,360][INFO ][cluster.metadata         ] [elasticsearch-logging-v1-dklui] [logstash-2017.03.01] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []
[2017-03-02 08:27:07,163][INFO ][cluster.routing.allocation] [elasticsearch-logging-v1-dklui] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[logstash-2017.03.01][3], [logstash-2017.03.01][3]] ...]).
[2017-03-02 08:27:07,354][INFO ][cluster.metadata         ] [elasticsearch-logging-v1-dklui] [logstash-2017.03.02] create_mapping [fluentd]
[2017-03-02 08:27:07,988][INFO ][cluster.metadata         ] [elasticsearch-logging-v1-dklui] [logstash-2017.03.01] create_mapping [fluentd]
[2017-03-02 08:27:09,578][INFO ][cluster.routing.allocation] [elasticsearch-logging-v1-dklui] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[logstash-2017.03.02][4]] ...]).

區別如下：

recover cluster的信息
有logstash的信息

感覺很有可能是image問題，因此把原來的image修改成和tonybai保持一致，把es-control.yaml中官方的v2.4.1-2修改為

bigwhite/elasticsearch:v2.4.1-1

然后啟動，發現果然有了recover cluster的信息。但仍然沒有logstash的信息。

定位日志問題(又是個大坑)

又繞回到第一個問題，明明通過kubectl logs滿屏幕的日志

[root@k8s-master fluentd-elasticsearch]# kubectl logs helloworld-service-4d72j
.
.
JAVA Memory arguments: -Djava.security.egd=file:/dev/./urandom
.
CLASSPATH=/u01/oracle/wlserver/../oracle_common/modules/javax.persistence_2.1.jar:/u01/oracle/wlserver/../wlserver/modules/com.oracle.weblogic.jpa21support_1.0.0.0_2-1.jar:/usr/java/jdk1.8.0_101/lib/tools.jar:/u01/oracle/wlserver/server/lib/weblogic_sp.jar:/u01/oracle/wlserver/server/lib/weblogic.jar:/u01/oracle/wlserver/../oracle_common/modules/net.sf.antcontrib_1.1.0.0_1-0b3/lib/ant-contrib.jar:/u01/oracle/wlserver/modules/features/oracle.wls.common.nodemanager_2.0.0.0.jar:/u01/oracle/wlserver/../oracle_common/modules/com.oracle.cie.config-wls-online_8.1.0.0.jar:/u01/oracle/wlserver/common/derby/lib/derbyclient.jar:/u01/oracle/wlserver/common/derby/lib/derby.jar:/u01/oracle/wlserver/server/lib/xqrl.jar
.
PATH=/u01/oracle/wlserver/server/bin:/u01/oracle/wlserver/../oracle_common/modules/org.apache.ant_1.9.2/bin:/usr/java/jdk1.8.0_101/jre/bin:/usr/java/jdk1.8.0_101/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/java/default/bin:/u01/oracle/oracle_common/common/bin:/u01/oracle/oracle_common/common/bin:/u01/oracle/wlserver/common/bin:/u01/oracle/user_projects/domains/base_domain/bin:/u01/oracle
.
***************************************************
*  To start WebLogic Server, use a username and   *
*  password assigned to an admin-level user.  For *
*  server administration, use the WebLogic Server *
*  console at http://hostname:port/console        *
***************************************************
starting weblogic with Java version:
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
Starting WLS with line:
/usr/java/jdk1.8.0_101/bin/java -server   -Djava.security.egd=file:/dev/./urandom -Dweblogic.Name=AdminServer -Djava.security.policy=/u01/oracle/wlserver/server/lib/weblogic.policy  -Dweblogic.ProductionModeEnabled=true   -Djava.endorsed.dirs=/usr/java/jdk1.8.0_101/jre/lib/endorsed:/u01/oracle/wlserver/../oracle_common/modules/endorsed  -da -Dwls.home=/u01/oracle/wlserver/server -Dweblogic.home=/u01/oracle/wlserver/server     -Dweblogic.utils.cmm.lowertier.ServiceDisabled=true  weblogic.Server
<May 24, 2017 2:27:39 AM GMT> <Info> <Security> <BEA-090905> <Disabling the CryptoJ JCE Provider self-integrity check for better startup performance. To enable this check, specify -Dweblogic.security.allowCryptoJDefaultJCEVerification=true.> 
<May 24, 2017 2:27:41 AM GMT> <Info> <Security> <BEA-090906> <Changing the default Random Number Generator in RSA CryptoJ from ECDRBG128 to FIPS186PRNG. To disable this change, specify -Dweblogic.security.allowCryptoJDefaultPRNG=true.> 
<May 24, 2017 2:27:44 AM GMT> <Info> <WebLogicServer> <BEA-000377> <Starting WebLogic Server with Java HotSpot(TM) 64-Bit Server VM Version 25.101-b13 from Oracle Corporation.> 
<May 24, 2017 2:27:47 AM GMT> <Info> <Management> <BEA-141107> <Version: WebLogic Server 12.1.3.0.0  Wed May 21 18:53:34 PDT 2014 1604337 > 
<May 24, 2017 2:27:59 AM GMT> <Notice> <WebLogicServer> <BEA-000365> <Server state changed to STARTING.> 
<May 24, 2017 2:28:00 AM GMT> <Info> <WorkManager> <BEA-002900> <Initializing self-tuning thread pool.> 
<May 24, 2017 2:28:00 AM GMT> <Info> <WorkManager> <BEA-002942> <CMM memory level becomes 0. Setting standby thread pool size to 256.> 
<May 24, 2017 2:28:02 AM GMT> <Notice> <Log Management> <BEA-170019> <The server log file /u01/oracle/user_projects/domains/base_domain/servers/AdminServer/logs/AdminServer.log is opened. All server side log events will be written to this file.> 
<May 24, 2017 2:28:18 AM GMT> <Notice> <Security> <BEA-090082> <Security initializing using security realm myrealm.> 
<May 24, 2017 2:28:31 AM GMT> <Notice> <WebLogicServer> <BEA-000365> <Server state changed to STANDBY.> 
<May 24, 2017 2:28:31 AM GMT> <Notice> <WebLogicServer> <BEA-000365> <Server state changed to STARTING.> 
May 24, 2017 2:28:31 AM weblogic.wsee.WseeCoreMessages logWseeServiceStarting
INFO: The Wsee Service is starting
<May 24, 2017 2:28:34 AM GMT> <Warning> <Munger> <BEA-2156203> <A version attribute was not found in element "web-app" in the deployment descriptor /u01/oracle/user_projects/domains/base_domain/servers/AdminServer/upload/HelloWorld.war/WEB-INF/web.xml. A version attribute is required, but this version of the WebLogic Server will assume that the latest version is used. Future versions of WebLogic Server will reject descriptors that do not specify the Java EE version. To eliminate this warning, add an appropriate "version=" to element "web-app" in the deployment descriptor.> 
<May 24, 2017 2:28:39 AM GMT> <Notice> <Log Management> <BEA-170027> <The server has successfully established a connection with the Domain level Diagnostic Service.> 
<May 24, 2017 2:28:41 AM GMT> <Notice> <WebLogicServer> <BEA-000365> <Server state changed to ADMIN.> 
<May 24, 2017 2:28:41 AM GMT> <Notice> <WebLogicServer> <BEA-000365> <Server state changed to RESUMING.> 
<May 24, 2017 2:28:41 AM GMT> <Notice> <Server> <BEA-002613> <Channel "Default[3]" is now listening on 127.0.0.1:7001 for protocols iiop, t3, ldap, snmp, http.> 
<May 24, 2017 2:28:41 AM GMT> <Notice> <Server> <BEA-002613> <Channel "Default" is now listening on 192.168.53.3:7001 for protocols iiop, t3, ldap, snmp, http.> 
<May 24, 2017 2:28:41 AM GMT> <Notice> <Server> <BEA-002613> <Channel "Default[2]" is now listening on 0:0:0:0:0:0:0:1:7001 for protocols iiop, t3, ldap, snmp, http.> 
<May 24, 2017 2:28:41 AM GMT> <Notice> <WebLogicServer> <BEA-000329> <Started the WebLogic Server Administration Server "AdminServer" for domain "base_domain" running in production mode.> 
<May 24, 2017 2:28:41 AM GMT> <Error> <Server> <BEA-002606> <The server is unable to create a server socket for listening on channel "Default[1]". The address fe80:0:0:0:42:c0ff:fea8:3503 might be incorrect or another process is using port 7001: java.net.BindException: Cannot assign requested address> 
<May 24, 2017 2:28:41 AM GMT> <Warning> <Server> <BEA-002611> <The hostname "localhost", maps to multiple IP addresses: 127.0.0.1, 0:0:0:0:0:0:0:1.> 
<May 24, 2017 2:28:41 AM GMT> <Notice> <WebLogicServer> <BEA-000360> <The server started in RUNNING mode.> 
<May 24, 2017 2:28:41 AM GMT> <Notice> <WebLogicServer> <BEA-000365> <Server state changed to RUNNING.> 
[root@k8s-master fluentd-elasticsearch]#

再去minion機器上通過docker logs id也是滿屏的日志阿!

[root@k8s-node-1 ~]# docker logs bec3e02b2490
.
.
JAVA Memory arguments: -Djava.security.egd=file:/dev/./urandom
.
CLASSPATH=/u01/oracle/wlserver/../oracle_common/modules/javax.persistence_2.1.jar:/u01/oracle/wlserver/../wlserver/modules/com.oracle.weblogic.jpa21support_1.0.0.0_2-1.jar:/usr/java/jdk1.8.0_101/lib/tools.jar:/u01/oracle/wlserver/server/lib/weblogic_sp.jar:/u01/oracle/wlserver/server/lib/weblogic.jar:/u01/oracle/wlserver/../oracle_common/modules/net.sf.antcontrib_1.1.0.0_1-0b3/lib/ant-contrib.jar:/u01/oracle/wlserver/modules/features/oracle.wls.common.nodemanager_2.0.0.0.jar:/u01/oracle/wlserver/../oracle_common/modules/com.oracle.cie.config-wls-online_8.1.0.0.jar:/u01/oracle/wlserver/common/derby/lib/derbyclient.jar:/u01/oracle/wlserver/common/derby/lib/derby.jar:/u01/oracle/wlserver/server/lib/xqrl.jar
.
PATH=/u01/oracle/wlserver/server/bin:/u01/oracle/wlserver/../oracle_common/modules/org.apache.ant_1.9.2/bin:/usr/java/jdk1.8.0_101/jre/bin:/usr/java/jdk1.8.0_101/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/java/default/bin:/u01/oracle/oracle_common/common/bin:/u01/oracle/oracle_common/common/bin:/u01/oracle/wlserver/common/bin:/u01/oracle/user_projects/domains/base_domain/bin:/u01/oracle
.
***************************************************
*  To start WebLogic Server, use a username and   *
*  password assigned to an admin-level user.  For *
*  server administration, use the WebLogic Server *
*  console at http://hostname:port/console        *
***************************************************
starting weblogic with Java version:
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
Starting WLS with line:
/usr/java/jdk1.8.0_101/bin/java -server   -Djava.security.egd=file:/dev/./urandom -Dweblogic.Name=AdminServer -Djava.security.policy=/u01/oracle/wlserver/server/lib/weblogic.policy  -Dweblogic.ProductionModeEnabled=true   -Djava.endorsed.dirs=/usr/java/jdk1.8.0_101/jre/lib/endorsed:/u01/oracle/wlserver/../oracle_common/modules/endorsed  -da -Dwls.home=/u01/oracle/wlserver/server -Dweblogic.home=/u01/oracle/wlserver/server     -Dweblogic.utils.cmm.lowertier.ServiceDisabled=true  weblogic.Server
...........

因為/var/log/fluentd.log老是停留在配置上沒有刷動，懷疑是configmap的配置問題，找出之前備份的td-agent.conf一看

# The Kubernetes fluentd plugin is used to write the Kubernetes metadata to the log
# record & add labels to the log record if properly configured. This enables users
# to filter & search logs on any metadata.
# For example a Docker container's logs might be in the directory:
#
#  /var/lib/docker/containers/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b
#
# and in the file:
#
#  997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b-json.log
#
# where 997599971ee6... is the Docker ID of the running container.
# The Kubernetes kubelet makes a symbolic link to this file on the host machine
# in the /var/log/containers directory which includes the pod name and the Kubernetes
# container name:
#
#    synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log

終於發現問題, 原來都是通過/var/lib/docker/containers/目錄去找，然而我的docker下面容器沒有任何log文件。

仔細研究了一下docker,原來所有的docker日志都journal到系統日志 /var/log/messages下了.為什么呢？因為經常有人說docker日志太多導致container容器增長比較快，所以都通過系統的journal進行統一處理。

修改/etc/sysconfig/docker配置文件，把原來的journal改回到當前json.log方式.

#OPTIONS='--selinux-enabled --log-driver=journald --signature-verification=false'
OPTIONS='--selinux-enabled --log-driver=json-file --signature-verification=false'

改完后就發現container下面有很多log文件了.

然后回到/var/log/fluentd.log文件，發現終於滿屏的日志滾動，輸出正常了

2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/fluentd-es-v1.22-351lz_kube-system_POD-aca728523bc307598917d78b2526e718e6c7fdbb38b70c05900d2439399efa10.log
2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/helloworld-service-n5f0s_default_POD-ca013e9ab31b825cd4b85ab4700fad2fcaafd5f39c572778d10d438012ea4435.log
2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/fluentd-es-v1.22-351lz_kube-system_POD-2eb78ece8c2b5c222313ab4cfb53ea6ec32f54e1b7616f729daf48b01d393b65.log
2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/helloworld-service-4d72j_default_POD-1dcbbc2ef71f7f542018069a1043a122117a97378c19f03ddb95b8a71dab4637.log
2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/helloworld-service-n5f0s_default_weblogichelloworld-d7229e5c23c6bf7582ed6559417ba24d99e33e44a68a6079159b4792fe05a673.log
2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/helloworld-service-4d72j_default_weblogichelloworld-71d1d7252dd7504fd45351d714d21c3c615facc5e2650553c68c0bf359e8434a.log
2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/kube-dns-v11-x0vr3_kube-system_kube2sky-c77121c354459f22712b0a99623eff1590f4fdb1a5d3ad2db09db000755f9c2c.log
2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/kube-dns-v11-x0vr3_kube-system_skydns-f3c0fbf4ea5cd840c968a807a40569042c90de08f7722e7344282845d5782a20.log
2017-05-24 05:44:17 +0000 [info]: following tail of /var/log/containers/fluentd-es-v1.22-351lz_kube-system_fluentd-es-93795904ff4870758441dd7288972d4967ffac18f2f25272b12e99ea6b692d44.log
2017-05-24 05:45:03 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-05-24 05:44:24 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"elasticsearch-logging\", :port=>9200, :scheme=>\"http\"})!" plugin_id="object:3f986d0a5150"
  2017-05-24 05:45:03 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-1.5.0/lib/fluent/plugin/out_elasticsearch.rb:122:in `client'

把所有的組件都啟動起來

[root@k8s-master fluentd-elasticsearch]# kubectl get pods -n kube-system
NAME                              READY     STATUS    RESTARTS   AGE
elasticsearch-logging-v1-1xwnq    1/1       Running   0          29s
elasticsearch-logging-v1-gx6lc    1/1       Running   0          29s
fluentd-es-v1.22-351lz            1/1       Running   1          3h
kibana-logging-3659310023-gcwrn   1/1       Running   0          15s
kube-dns-v11-x0vr3                4/4       Running   28         1d

訪問kibana

[root@k8s-master fluentd-elasticsearch]# kubectl cluster-info
Kubernetes master is running at http://localhost:8080
Elasticsearch is running at http://localhost:8080/api/v1/proxy/namespaces/kube-system/services/elasticsearch-logging
Kibana is running at http://localhost:8080/api/v1/proxy/namespaces/kube-system/services/kibana-logging
KubeDNS is running at http://localhost:8080/api/v1/proxy/namespaces/kube-system/services/kube-dns

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.