Prometheus監控Oracle數據庫


背景

本文簡單介紹下,Prometheus如何通過exporters監控Oracle數據庫,以及應該注意哪些指標。

oracledb_exporter

oracledb_exporter是一個連接到Oracle數據庫並生成Prometheus metrics的應用程序,

設置

展示下如何安裝和設置oracledb_exporter,以使用Prometheus來監控Oracle數據庫。oracledb_exporter部署在k8s集群中

在k8s使用Deployment部署oracledb_exporter,並添加注解,以實現Prometheus自動發現oracledb_exporter斷點並收集指標

spec:
 template:
   metadata:
     annotations:
       prometheus.io/scrape: "true"
       prometheus.io/port: "9161"
       prometheus.io/path: "/metrics"

oracledb_exporter需要Oracle的連接信息才能訪問和生成指標,此參數作為環境變量傳遞到exporter。由於連接信息包含用於訪問數據庫的用戶和密碼,因此我們將使用Kubernetes Secret來存儲它。

要創建到Oracle數據庫的連接字符串的密碼,可以使用以下命令:

kubectl create secret generic oracledb-exporter-secret \
    --from-literal=datasource='YOUR_CONNECTION_STRING'

在deployment中,這樣配置環境變量

       env:
       - name: DATA_SOURCE_NAME
         valueFrom:
           secretKeyRef:
             name: oracledb-exporter-secret
             key: datasource

要確保連接信息是否正確:

system/password@//database_url:1521/database_name.your.domain.com

可以使用 sqlplus docker鏡像進行檢測

docker run --net='host' --rm --interactive guywithnose/sqlplus sqlplus system/password@//database_url:1521/database_name.my.domain.com

下面添加一些自定義指標,包括慢查詢(slow queries),錯誤查詢(bug queries)
為了使用自定義指標:

  • 在deployment中,我們將添加另一個環境變量,該變量具有到新指標的文件的路由。
  • 從ConfigMap將此新文件掛載為volume

完整配置如下:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: oracledb-exporter
  namespace: database-namespace
spec:
  selector:
    matchLabels:
      app: oracledb-exporter
  replicas: 1
  template:
    metadata:
      labels:
        app: oracledb-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9161"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: oracledb-exporter
        ports:
        - containerPort: 9161
        image: iamseth/oracledb_exporter
        env:
        - name: DATA_SOURCE_NAME
          valueFrom:
            secretKeyRef:
              name: oracledb-exporter-secret
              key: datasource
        - name: CUSTOM_METRICS
          value: /tmp/custom-metrics.toml
        volumeMounts:
          - name:  custom-metrics
            mountPath:  /tmp/custom-metrics.toml
            subPath: custom-metrics.toml
      volumes:
        - name: custom-metrics
          configMap:
            defaultMode: 420
            name: custom-metrics    

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-metrics
  namespace: database-namespace
data:
  custom-metrics.toml: |
    [[metric]]
    context = "slow_queries"
    metricsdesc = { p95_time_usecs= "Gauge metric with percentile 95 of elapsed time.", p99_time_usecs= "Gauge metric with percentile 99 of elapsed time." }
    request = "select  percentile_disc(0.95)  within group (order by elapsed_time) as p95_time_usecs, percentile_disc(0.99)  within group (order by elapsed_time) as p99_time_usecs from v$sql where last_active_time >= sysdate - 5/(24*60)"
    [[metric]]
    context = "big_queries"
    metricsdesc = { p95_rows= "Gauge metric with percentile 95 of returned rows.", p99_rows= "Gauge metric with percentile 99 of returned rows." }
    request = "select  percentile_disc(0.95)  within group (order by rownum) as p95_rows, percentile_disc(0.99)  within group (order by rownum) as p99_rows from v$sql where last_active_time >= sysdate - 5/(24*60)"
    [[metric]]
    context = "size_user_segments_top100"
    metricsdesc = {table_bytes="Gauge metric with the size of the tables in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as table_bytes from user_segments where segment_type='TABLE' group by segment_name) order by table_bytes DESC FETCH NEXT 100 ROWS ONLY"
    [[metric]]
    context = "size_user_segments_top100"
    metricsdesc = {table_partition_bytes="Gauge metric with the size of the table partition in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as table_partition_bytes from user_segments where segment_type='TABLE PARTITION' group by segment_name) order by table_partition_bytes DESC FETCH NEXT 100 ROWS ONLY"
    [[metric]]
    context = "size_user_segments_top100"
    metricsdesc = {cluster_bytes="Gauge metric with the size of the cluster in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as cluster_bytes from user_segments where segment_type='CLUSTER' group by segment_name) order by cluster_bytes DESC FETCH NEXT 100 ROWS ONLY"
    [[metric]]
    context = "size_dba_segments_top100"
    metricsdesc = {table_bytes="Gauge metric with the size of the tables in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as table_bytes from dba_segments where segment_type='TABLE' group by segment_name) order by table_bytes DESC FETCH NEXT 100 ROWS ONLY"
    [[metric]]
    context = "size_dba_segments_top100"
    metricsdesc = {table_partition_bytes="Gauge metric with the size of the table partition in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as table_partition_bytes from dba_segments where segment_type='TABLE PARTITION' group by segment_name) order by table_partition_bytes DESC FETCH NEXT 100 ROWS ONLY"
    [[metric]]
    context = "size_dba_segments_top100"
    metricsdesc = {cluster_bytes="Gauge metric with the size of the cluster in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as cluster_bytes from dba_segments where segment_type='CLUSTER' group by segment_name) order by cluster_bytes DESC FETCH NEXT 100 ROWS ONLY"

創建Secret和ConfigMap之后,就可以應用Deployment並檢查它是否正在從Oracle數據庫的端口9161中獲取指標。

如果一切正常,Prometheus將自動發現exporter帶注釋的pod,並在幾分鍾內開始抓取指標。可以在Prometheus Web界面的target部分中對其進行檢查,以查找以oracledb_開頭的任何指標。

監控什么

性能指標

等待時間: exporter在Oracle數據庫的不同活動中提供一系列等待時間的指標。它們都以oracledb_wait_time_前綴開頭,它們有助於評估數據庫在哪里花費了更多時間。它可以存在於I/O,網絡,提交,並發等中。通過這種方式,我們可以確定系統中可能影響Oracle數據庫整體性能的瓶頸。

慢查詢:某些查詢返回結果所花的時間可能比其他查詢長。如果此時間高於應用程序中配置的接收響應的超時時間,它將認為這是來自數據庫的超時錯誤,然后重試查詢。這種行為可能會使系統超負荷工作,並影響整體性能。

在上面顯示的配置中,有兩個自定義指標可提供最近5分鍾內執行查詢的響應時間的百分比95和99的信息。這些指標是:

  • oracledb_slow_queries_p95_time_usecs
  • oracledb_slow_queries_p99_time_usecs

活動會話:監視Oracle數據庫中活動會話很重要。如果超過配置的限制,則數據庫將拒絕新連接,從而導致應用程序錯誤。提供此信息的指標是oracledb_sessions_value,標簽status可以提供更多信息。

活動:監視數據庫執行的操作也很重要。為此,我們可以依靠以下指標:

  • oracledb_activity_execute_count
  • oracledb_activity_parse_count_total
  • oracledb_activity_user_commits
  • oracledb_activity_user_rollbacks


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM