skywalking是一款國產的開源的鏈路追蹤軟件,那么鏈路追蹤、監控系統、日志系統的區別是什么呢。本質上鏈路追蹤也算是一種監控,而鏈路追蹤跟監控系統都是日志。
skywalking中文文檔: https://skyapm.github.io/document-cn-translation-of-skywalking/zh/8.0.0/
與日常監控不同的是我們對監控得出的結果處理可以更主動。以prometheus為例,prometheus收集了數據在grafana上展出出來,並且按制定的規則報警,但是我們一般不會主動去看prometheus的線圖然后得出哪里哪里馬上要出問題了,我們得提前處理,都是它報警了我去看下情況,然后再去看看日志,根據經驗,進行處理以及后續的優化。在常規運維中,這是一個被動的行為,可以理解為“亡羊補牢”。
而鏈路追蹤軟件在啟用后,就可以看到哪個調用鏈用得頻率高,哪個函數方法執行的慢,跟XXX的連接延時比較大,此時就可以根據實際排期進行更高性價比的調整優化,此時業務並沒有出問題,可能就是稍慢一點。當然了,也會出現某個業務使用過程中慢,才要對此進行分析的,這個行為可以理解成普通的被動監控了。不過在在常規運維中,我們對鏈路追蹤的期望是前者,這是一個主動的行為,可以理解為“未雨綢繆”。
那么日志系統呢?日志系統收集了很多日志,而監控跟鏈路追蹤其實是對自己所需要的日志進行了收集及聚合處理后得出了自己所需要的數值、目標等等,最后進行了不同的展示。所以日志系統是最底層的東西,監控報警我只看線條沒有用,我得去看當時的日志,到底系統、業務是因為什么才波動了;鏈路追蹤也一樣,函數運行的慢,那我去看這個函數的處理邏輯,處理流程都經歷了什么才能去調優。
目前,APM中skywalking與pinpoint是實現了對代碼完全無任何侵入,這樣比較符合運維人員的想法,畢竟Zipkin類的對代碼侵入了,那么那就需要有風險擔責,這個業務運行時的鍋我們還是不要輕易背。具體的對比大家可以看https://www.jianshu.com/p/626cae6c0522 這篇文章。
我們使用k8s內運行的方式來安裝skywalking,官方指引是用helm安裝,這邊筆者已經將yaml導出並進行修改調整
elasticsearch:skywalking可以對接的后端很多:https://skyapm.github.io/document-cn-translation-of-skywalking/zh/8.0.0/setup/backend/backend-storage.html,當然了你的elasticsearch不用跑在容器里,所以這是一個非必要操作,如果跑在容器里記得要分配對應的存儲進行持久化。下面這個文件在只有一個節點時重啟后會起不來,因為他無法變成green狀態不符合健康檢查,所以在單獨測試時將健康檢查的那段注釋掉即可。

apiVersion: v1 kind: Service metadata: name: skywalking-elasticsearch namespace: default labels: app: skywalking-elasticsearch spec: ports: - name: http port: 9200 protocol: TCP targetPort: 9200 - name: transport port: 9300 protocol: TCP targetPort: 9300 selector: app: skywalking-elasticsearch --- apiVersion: v1 kind: Service metadata: name: skywalking-elasticsearch-headless namespace: default labels: app: skywalking-elasticsearch spec: clusterIP: None publishNotReadyAddresses: true ports: - name: http port: 9200 protocol: TCP targetPort: 9200 - name: transport port: 9300 protocol: TCP targetPort: 9300 selector: app: skywalking-elasticsearch --- apiVersion: apps/v1 kind: StatefulSet metadata: name: skywalking-elasticsearch namespace: default labels: app: skywalking-elasticsearch spec: replicas: 1 podManagementPolicy: Parallel selector: matchLabels: app: skywalking-elasticsearch serviceName: skywalking-elasticsearch-headless template: metadata: name: skywalking-elasticsearch labels: app: skywalking-elasticsearch spec: # affinity: # podAntiAffinity: # requiredDuringSchedulingIgnoredDuringExecution: # - labelSelector: # matchExpressions: # - key: app # operator: In # values: # - skywalking-elasticsearch # topologyKey: kubernetes.io/hostname initContainers: - command: - sysctl - -w - vm.max_map_count=262144 image: docker.elastic.co/elasticsearch/elasticsearch:7.5.1 imagePullPolicy: IfNotPresent name: configure-sysctl resources: {} securityContext: privileged: true runAsUser: 0 securityContext: fsGroup: 1000 runAsUser: 1000 containers: - env: - name: node.name valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: cluster.initial_master_nodes value: skywalking-elasticsearch-0 - name: discovery.seed_hosts value: skywalking-elasticsearch-headless - name: cluster.name value: skywalking-elasticsearch - name: network.host value: 0.0.0.0 - name: ES_JAVA_OPTS value: -Xmx1g -Xms1g - name: node.data value: "true" - name: node.ingest value: "true" - name: node.master value: "true" name: skywalking-elasticsearch image: docker.elastic.co/elasticsearch/elasticsearch:7.5.1 imagePullPolicy: IfNotPresent ports: - containerPort: 9200 name: http protocol: TCP - containerPort: 9300 name: transport protocol: TCP resources: limits: cpu: "1" memory: 2Gi requests: cpu: 100m memory: 2Gi readinessProbe: exec: command: - sh - -c - | #!/usr/bin/env bash -e # If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' ) # Once it has started only check that the node itself is responding START_FILE=/tmp/.es_start_file http () { local path="${1}" if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}" else BASIC_AUTH='' fi curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path} } if [ -f "${START_FILE}" ]; then echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available' http "/_cluster/health?timeout=0s" else echo 'Waiting for elasticsearch cluster to become cluster to be ready (request params: "wait_for_status=green&timeout=1s" )' if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then touch ${START_FILE} exit 0 else echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )' exit 1 fi fi failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 3 timeoutSeconds: 5 securityContext: capabilities: drop: - ALL runAsNonRoot: true runAsUser: 1000 volumeMounts: - name: skywalking-elasticsearch mountPath: /usr/share/elasticsearch/data terminationGracePeriodSeconds: 120 volumeClaimTemplates: - metadata: name: skywalking-elasticsearch spec: accessModes: - ReadWriteOnce storageClassName: yizhuang-nfs resources: requests: storage: 100Gi
job:對es進行結構初始化。es如果之前初始化過了就沒必要再次執行了。
apiVersion: batch/v1 kind: Job metadata: name: skywalking-job namespace: default labels: app: skywalking-job spec: template: metadata: name: skywalking-job labels: app: skywalking-job spec: initContainers: - command: - sh - -c - for i in $(seq 1 60); do nc -z -w3 skywalking-elasticsearch 9200 && exit 0 || sleep 5; done; exit 1 image: busybox:1.30 imagePullPolicy: IfNotPresent name: wait-for-elasticsearch containers: - env: - name: JAVA_OPTS value: -Xmx2g -Xms2g -Dmode=init # -Dmode=init模式是給elasticsearch集群初始化數據結構 - name: SW_STORAGE value: elasticsearch7 - name: SW_STORAGE_ES_CLUSTER_NODES value: skywalking-elasticsearch:9200 name: skywalking-job image: apache/skywalking-oap-server:8.1.0 imagePullPolicy: IfNotPresent restartPolicy: Never # Job的restartPolicy必須設置Never
oap:就是skywalking服務本身
apiVersion: v1 kind: ServiceAccount metadata: name: skywalking-oap namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: skywalking-oap namespace: default rules: - apiGroups: - "" resources: - pods - configmaps verbs: - get - watch - list --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: skywalking-oap namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: skywalking-oap subjects: - kind: ServiceAccount name: skywalking-oap namespace: default --- apiVersion: v1 kind: Service metadata: name: skywalking-oap namespace: default labels: app: skywalking-oap spec: ports: - name: rest port: 12800 protocol: TCP targetPort: 12800 - name: grpc port: 11800 protocol: TCP targetPort: 11800 selector: app: skywalking-oap --- apiVersion: apps/v1 kind: Deployment metadata: name: skywalking-oap namespace: default labels: app: skywalking-oap spec: replicas: 1 selector: matchLabels: app: skywalking-oap template: metadata: labels: app: skywalking-oap spec: serviceAccount: skywalking-oap serviceAccountName: skywalking-oap # affinity: # podAntiAffinity: # preferredDuringSchedulingIgnoredDuringExecution: # - podAffinityTerm: # labelSelector: # matchLabels: # app: skywalking-oap # topologyKey: kubernetes.io/hostname # weight: 1 initContainers: - command: - sh - -c - for i in $(seq 1 60); do nc -z -w3 skywalking-elasticsearch 9200 && exit 0 || sleep 5; done; exit 1 image: busybox:1.30 imagePullPolicy: IfNotPresent name: wait-for-elasticsearch containers: - env: - name: JAVA_OPTS value: -Dmode=no-init -Xmx2g -Xms512m - name: SW_CLUSTER # 設置集群類型在kubernetes內 value: kubernetes - name: SW_CLUSTER_K8S_NAMESPACE value: default - name: SW_CLUSTER_K8S_LABEL value: app=skywalking-oap - name: SKYWALKING_COLLECTOR_UID valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.uid - name: SW_STORAGE value: elasticsearch7 - name: SW_STORAGE_ES_CLUSTER_NODES value: skywalking-elasticsearch:9200 - name: SW_STORAGE_DAY_STEP # 每個ES索引存多少天的數據 value: "1" - name: SW_STORAGE_ES_FLUSH_INTERVAL value: "60" - name: SW_CORE_RECORD_DATA_TTL # 記錄數據過期時間,這里要注意,比如你想存30天數據,那么TTL要設置為DAY_STEP+30=31 value: "4" - name: SW_CORE_METRICS_DATA_TTL # 指標數據過期時間,同上 value: "4" - name: SW_TRACE_SAMPLE_RATE # 采樣率,10000為100%,生產環境需要調小 value: "10000" name: skywalking-oap image: apache/skywalking-oap-server8.1.0 imagePullPolicy: IfNotPresent ports: - containerPort: 11800 name: grpc protocol: TCP - containerPort: 12800 name: rest protocol: TCP readinessProbe: failureThreshold: 3 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: 12800 timeoutSeconds: 1 livenessProbe: failureThreshold: 3 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: 12800 timeoutSeconds: 1 resources: requests: memory: 512Mi cpu: 30m limits: memory: 2Gi cpu: 500m
ui:負責展示出圖
--- apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: skywalking-dev-xxx-com namespace: default spec: selector: istio: ingressgateway servers: - hosts: - skywalking-dev.xxx.com port: number: 80 name: http protocol: HTTP --- apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: skywalking-dev-xxx-com namespace: default spec: hosts: - skywalking-dev.xxx.com gateways: - skywalking-dev-xxx-com http: - match: - uri: prefix: / route: - destination: host: skywalking-ui port: number: 80 --- apiVersion: v1 kind: Service metadata: name: skywalking-ui namespace: default labels: app: skywalking-ui spec: ports: - port: 80 protocol: TCP targetPort: 8080 selector: app: skywalking-ui --- apiVersion: apps/v1 kind: Deployment metadata: name: skywalking-ui namespace: default labels: app: skywalking-ui spec: replicas: 1 selector: matchLabels: app: skywalking-ui template: metadata: labels: app: skywalking-ui spec: imagePullSecrets: - name: aliyun-registry containers: - env: - name: SW_OAP_ADDRESS value: skywalking-oap:12800 image: apache/skywalking-ui:8.1.0 imagePullPolicy: IfNotPresent name: skywalking-ui ports: - containerPort: 8080 name: page protocol: TCP resources: requests: memory: 512Mi cpu: 30m limits: memory: 1Gi cpu: 500m
然后我們就可以看到,此時還沒有介入客戶端,所以沒有數據,但是服務端的事情已經完成。
接下來就是客戶端的接入,skywalking支持很多的客戶端,當然最常用的還是接入java應用,我們只需要去下載對應的對應的包就可以了,http://skywalking.apache.org/downloads/,建議客戶端的版本號與你服務端的版本號一致,比如我服務端版本是8.1.1,那么我下載的鏈接應該為 https://downloads.apache.org/skywalking/8.1.0/apache-skywalking-apm-8.1.0.tar.gz ,下載解壓后目錄結構如下
agent/ ├── activations │ ├── apm-toolkit-log4j-1.x-activation-8.1.0.jar │ ├── apm-toolkit-log4j-2.x-activation-8.1.0.jar │ ├── apm-toolkit-logback-1.x-activation-8.1.0.jar │ ├── apm-toolkit-meter-activation-8.1.0.jar │ ├── apm-toolkit-opentracing-activation-8.1.0.jar │ └── apm-toolkit-trace-activation-8.1.0.jar ├── bootstrap-plugins │ ├── apm-jdk-http-plugin-8.1.0.jar │ └── apm-jdk-threading-plugin-8.1.0.jar ├── config │ └── agent.config # agent端的配置文件,我們需要修改一些地方 ├── logs ├── optional-plugins │ ├── apm-customize-enhance-plugin-8.1.0.jar │ ├── apm-gson-2.x-plugin-8.1.0.jar │ ├── apm-kotlin-coroutine-plugin-8.1.0.jar │ ├── apm-spring-annotation-plugin-8.1.0.jar │ ├── apm-spring-cloud-gateway-2.0.x-plugin-8.1.0.jar │ ├── apm-spring-cloud-gateway-2.1.x-plugin-8.1.0.jar │ ├── apm-spring-tx-plugin-8.1.0.jar │ ├── apm-trace-ignore-plugin-8.1.0.jar │ └── apm-zookeeper-3.4.x-plugin-8.1.0.jar ├── optional-reporter-plugins │ └── kafka-reporter-plugin-8.1.0.jar ├── plugins │ ├── apm-activemq-5.x-plugin-8.1.0.jar │ ├── apm-armeria-0.84.x-plugin-8.1.0.jar │ ├── apm-armeria-0.85.x-plugin-8.1.0.jar │ ├── apm-avro-plugin-8.1.0.jar │ ├── apm-canal-1.x-plugin-8.1.0.jar │ ├── apm-cassandra-java-driver-3.x-plugin-8.1.0.jar │ ├── apm-dubbo-2.7.x-plugin-8.1.0.jar │ ├── apm-dubbo-plugin-8.1.0.jar │ ├── apm-ehcache-2.x-plugin-8.1.0.jar │ ├── apm-elastic-job-2.x-plugin-8.1.0.jar │ ├── apm-elasticsearch-5.x-plugin-8.1.0.jar │ ├── apm-elasticsearch-6.x-plugin-8.1.0.jar │ ├── apm-feign-default-http-9.x-plugin-8.1.0.jar │ ├── apm-finagle-6.25.x-plugin-8.1.0.jar │ ├── apm-grpc-1.x-plugin-8.1.0.jar │ ├── apm-h2-1.x-plugin-8.1.0.jar │ ├── apm-httpasyncclient-4.x-plugin-8.1.0.jar │ ├── apm-httpclient-3.x-plugin-8.1.0.jar │ ├── apm-httpClient-4.x-plugin-8.1.0.jar │ ├── apm-hystrix-1.x-plugin-8.1.0.jar │ ├── apm-influxdb-2.x-plugin-8.1.0.jar │ ├── apm-jdbc-commons-8.1.0.jar │ ├── apm-jedis-2.x-plugin-8.1.0.jar │ ├── apm-jetty-client-9.0-plugin-8.1.0.jar │ ├── apm-jetty-client-9.x-plugin-8.1.0.jar │ ├── apm-jetty-server-9.x-plugin-8.1.0.jar │ ├── apm-kafka-plugin-8.1.0.jar │ ├── apm-lettuce-5.x-plugin-8.1.0.jar │ ├── apm-light4j-plugin-8.1.0.jar │ ├── apm-mariadb-2.x-plugin-8.1.0.jar │ ├── apm-mongodb-2.x-plugin-8.1.0.jar │ ├── apm-mongodb-3.x-plugin-8.1.0.jar │ ├── apm-mysql-5.x-plugin-8.1.0.jar │ ├── apm-mysql-6.x-plugin-8.1.0.jar │ ├── apm-mysql-8.x-plugin-8.1.0.jar │ ├── apm-mysql-commons-8.1.0.jar │ ├── apm-netty-socketio-plugin-8.1.0.jar │ ├── apm-nutz-http-1.x-plugin-8.1.0.jar │ ├── apm-nutz-mvc-annotation-1.x-plugin-8.1.0.jar │ ├── apm-okhttp-3.x-plugin-8.1.0.jar │ ├── apm-play-2.x-plugin-8.1.0.jar │ ├── apm-postgresql-8.x-plugin-8.1.0.jar │ ├── apm-pulsar-plugin-8.1.0.jar │ ├── apm-quasar-plugin-8.1.0.jar │ ├── apm-rabbitmq-5.x-plugin-8.1.0.jar │ ├── apm-redisson-3.x-plugin-8.1.0.jar │ ├── apm-resttemplate-4.3.x-plugin-8.1.0.jar │ ├── apm-rocketmq-3.x-plugin-8.1.0.jar │ ├── apm-rocketmq-4.x-plugin-8.1.0.jar │ ├── apm-servicecomb-java-chassis-0.x-plugin-8.1.0.jar │ ├── apm-servicecomb-java-chassis-1.x-plugin-8.1.0.jar │ ├── apm-sharding-jdbc-1.5.x-plugin-8.1.0.jar │ ├── apm-sharding-sphere-3.x-plugin-8.1.0.jar │ ├── apm-shardingsphere-4.0.x-plugin-8.1.0.jar │ ├── apm-sharding-sphere-4.1.0-plugin-8.1.0.jar │ ├── apm-sharding-sphere-4.x-plugin-8.1.0.jar │ ├── apm-sharding-sphere-4.x-rc3-plugin-8.1.0.jar │ ├── apm-solrj-7.x-plugin-8.1.0.jar │ ├── apm-spring-async-annotation-plugin-8.1.0.jar │ ├── apm-spring-cloud-feign-1.x-plugin-8.1.0.jar │ ├── apm-spring-cloud-feign-2.x-plugin-8.1.0.jar │ ├── apm-spring-concurrent-util-4.x-plugin-8.1.0.jar │ ├── apm-spring-core-patch-8.1.0.jar │ ├── apm-springmvc-annotation-3.x-plugin-8.1.0.jar │ ├── apm-springmvc-annotation-4.x-plugin-8.1.0.jar │ ├── apm-springmvc-annotation-5.x-plugin-8.1.0.jar │ ├── apm-springmvc-annotation-commons-8.1.0.jar │ ├── apm-spring-webflux-5.x-plugin-8.1.0.jar │ ├── apm-spymemcached-2.x-plugin-8.1.0.jar │ ├── apm-struts2-2.x-plugin-8.1.0.jar │ ├── apm-undertow-2.x-plugin-8.1.0.jar │ ├── apm-vertx-core-3.x-plugin-8.1.0.jar │ ├── apm-xmemcached-2.x-plugin-8.1.0.jar │ ├── baidu-brpc-plugin-8.1.0.jar │ ├── dubbo-2.7.x-conflict-patch-8.1.0.jar │ ├── dubbo-conflict-patch-8.1.0.jar │ ├── graphql-12.x-plugin-8.1.0.jar │ ├── graphql-8.x-plugin-8.1.0.jar │ ├── graphql-9.x-plugin-8.1.0.jar │ ├── motan-plugin-8.1.0.jar │ ├── resteasy-server-3.x-plugin-8.1.0.jar │ ├── sofa-rpc-plugin-8.1.0.jar │ ├── spring-commons-8.1.0.jar │ └── tomcat-7.x-8.x-plugin-8.1.0.jar └── skywalking-agent.jar # 該版本gaent探針jar包
我們對agent conf文件進行修改,結果如下
[root@devops-bj-yz-dx1 conf.d]# grep ^[a-z] agent/config/agent.config agent.service_name=${SW_AGENT_NAME:Your_ApplicationName} # 因為我們的架構都是容器內運行的,需要封裝鏡像,這里就不用改了 collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:skywalking-oap.default:11800} # 這個是指定我們服務端的訪問地址端口,很重要,根據我們k8s yaml文件定義的,服務端的SVC叫skywallking-oap,在default命名空間下,端口11800 logging.file_name=${SW_LOGGING_FILE_NAME:skywalking-api.log} # 指定日志文件名稱,這個看個戲喜好 logging.level=${SW_LOGGING_LEVEL:ERROR} # 日志等級,默認INFO
剩下的就是要將該agent在封裝鏡像時扔進去了,我們只需要在Dockerfile添加COPY agent /root/agent即可將該目錄放在容器的/root/下,然后就是啟動我們的java pod,我們知道在pod是多個,但是其實代表的是同一個服務,也就是同一類pod應該叫同一個ApplicationName,這樣skywalking在收集數據后會將同名APP數據進行匯總,當然了你仍然可以查詢到單個POD具體的情況。舉個例子,www-baidu-com-xxxxx-xxxxx跟www-baidu-com-yyyy-yyyy這兩個pod的名字應該相同都叫www-baidu-com或者baidu,這個看公司的命名規范制度。
我們最后要做的事情就是要在java的k8s yaml文件里定義好一段java的啟動參數env
env:- name: JAVA_OPTS
# -javaagent一定要跟我們Dockerfile里封裝的路徑匹配上,而后面的ApplicationName就是該項目的命名,也就是我們剛才的www-baidu-com
value: "-server -Xms123m -Xmx456m -Xss789k -XX:+UseG1GC -Dfile.encoding=UTF-8 -Dserver.port=6666 -javaagent:/root/agent/skywalking-agent.jar -Dskywalking.agent.service_name=ApplicationName"
這樣我們的java pod啟動后就會開始想服務端發送數據,我們稍等一會就可以在頁面上看到數據了,這里面提一句,如果服務端異常或者掛掉,不會影響業務本身,只是會報skywalking相關數據發送的失敗的錯誤,服務端恢復后也就正常了。這里面注意右下角的時間一定要選好了,不然可能沒數據。