背景
本文簡單介紹下,Prometheus如何通過exporters監控Oracle數據庫,以及應該注意哪些指標。
oracledb_exporter
oracledb_exporter是一個連接到Oracle數據庫並生成Prometheus metrics的應用程序,
設置
展示下如何安裝和設置oracledb_exporter,以使用Prometheus來監控Oracle數據庫。oracledb_exporter部署在k8s集群中
在k8s使用Deployment部署oracledb_exporter,並添加注解,以實現Prometheus自動發現oracledb_exporter斷點並收集指標
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9161"
prometheus.io/path: "/metrics"
oracledb_exporter需要Oracle的連接信息才能訪問和生成指標,此參數作為環境變量傳遞到exporter。由於連接信息包含用於訪問數據庫的用戶和密碼,因此我們將使用Kubernetes Secret來存儲它。
要創建到Oracle數據庫的連接字符串的密碼,可以使用以下命令:
kubectl create secret generic oracledb-exporter-secret \
--from-literal=datasource='YOUR_CONNECTION_STRING'
在deployment中,這樣配置環境變量
env:
- name: DATA_SOURCE_NAME
valueFrom:
secretKeyRef:
name: oracledb-exporter-secret
key: datasource
要確保連接信息是否正確:
system/password@//database_url:1521/database_name.your.domain.com
可以使用 sqlplus docker鏡像進行檢測
docker run --net='host' --rm --interactive guywithnose/sqlplus sqlplus system/password@//database_url:1521/database_name.my.domain.com
下面添加一些自定義指標,包括慢查詢(slow queries),錯誤查詢(bug queries)
為了使用自定義指標:
- 在deployment中,我們將添加另一個環境變量,該變量具有到新指標的文件的路由。
- 從ConfigMap將此新文件掛載為volume
完整配置如下:
apiVersion: apps/v1
kind: Deployment
metadata:
name: oracledb-exporter
namespace: database-namespace
spec:
selector:
matchLabels:
app: oracledb-exporter
replicas: 1
template:
metadata:
labels:
app: oracledb-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9161"
prometheus.io/path: "/metrics"
spec:
containers:
- name: oracledb-exporter
ports:
- containerPort: 9161
image: iamseth/oracledb_exporter
env:
- name: DATA_SOURCE_NAME
valueFrom:
secretKeyRef:
name: oracledb-exporter-secret
key: datasource
- name: CUSTOM_METRICS
value: /tmp/custom-metrics.toml
volumeMounts:
- name: custom-metrics
mountPath: /tmp/custom-metrics.toml
subPath: custom-metrics.toml
volumes:
- name: custom-metrics
configMap:
defaultMode: 420
name: custom-metrics
ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-metrics
namespace: database-namespace
data:
custom-metrics.toml: |
[[metric]]
context = "slow_queries"
metricsdesc = { p95_time_usecs= "Gauge metric with percentile 95 of elapsed time.", p99_time_usecs= "Gauge metric with percentile 99 of elapsed time." }
request = "select percentile_disc(0.95) within group (order by elapsed_time) as p95_time_usecs, percentile_disc(0.99) within group (order by elapsed_time) as p99_time_usecs from v$sql where last_active_time >= sysdate - 5/(24*60)"
[[metric]]
context = "big_queries"
metricsdesc = { p95_rows= "Gauge metric with percentile 95 of returned rows.", p99_rows= "Gauge metric with percentile 99 of returned rows." }
request = "select percentile_disc(0.95) within group (order by rownum) as p95_rows, percentile_disc(0.99) within group (order by rownum) as p99_rows from v$sql where last_active_time >= sysdate - 5/(24*60)"
[[metric]]
context = "size_user_segments_top100"
metricsdesc = {table_bytes="Gauge metric with the size of the tables in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as table_bytes from user_segments where segment_type='TABLE' group by segment_name) order by table_bytes DESC FETCH NEXT 100 ROWS ONLY"
[[metric]]
context = "size_user_segments_top100"
metricsdesc = {table_partition_bytes="Gauge metric with the size of the table partition in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as table_partition_bytes from user_segments where segment_type='TABLE PARTITION' group by segment_name) order by table_partition_bytes DESC FETCH NEXT 100 ROWS ONLY"
[[metric]]
context = "size_user_segments_top100"
metricsdesc = {cluster_bytes="Gauge metric with the size of the cluster in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as cluster_bytes from user_segments where segment_type='CLUSTER' group by segment_name) order by cluster_bytes DESC FETCH NEXT 100 ROWS ONLY"
[[metric]]
context = "size_dba_segments_top100"
metricsdesc = {table_bytes="Gauge metric with the size of the tables in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as table_bytes from dba_segments where segment_type='TABLE' group by segment_name) order by table_bytes DESC FETCH NEXT 100 ROWS ONLY"
[[metric]]
context = "size_dba_segments_top100"
metricsdesc = {table_partition_bytes="Gauge metric with the size of the table partition in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as table_partition_bytes from dba_segments where segment_type='TABLE PARTITION' group by segment_name) order by table_partition_bytes DESC FETCH NEXT 100 ROWS ONLY"
[[metric]]
context = "size_dba_segments_top100"
metricsdesc = {cluster_bytes="Gauge metric with the size of the cluster in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as cluster_bytes from dba_segments where segment_type='CLUSTER' group by segment_name) order by cluster_bytes DESC FETCH NEXT 100 ROWS ONLY"
創建Secret和ConfigMap之后,就可以應用Deployment並檢查它是否正在從Oracle數據庫的端口9161中獲取指標。
如果一切正常,Prometheus將自動發現exporter帶注釋的pod,並在幾分鍾內開始抓取指標。可以在Prometheus Web界面的target部分中對其進行檢查,以查找以oracledb_開頭的任何指標。
監控什么
性能指標
等待時間: exporter在Oracle數據庫的不同活動中提供一系列等待時間的指標。它們都以oracledb_wait_time_前綴開頭,它們有助於評估數據庫在哪里花費了更多時間。它可以存在於I/O,網絡,提交,並發等中。通過這種方式,我們可以確定系統中可能影響Oracle數據庫整體性能的瓶頸。
慢查詢:某些查詢返回結果所花的時間可能比其他查詢長。如果此時間高於應用程序中配置的接收響應的超時時間,它將認為這是來自數據庫的超時錯誤,然后重試查詢。這種行為可能會使系統超負荷工作,並影響整體性能。
在上面顯示的配置中,有兩個自定義指標可提供最近5分鍾內執行查詢的響應時間的百分比95和99的信息。這些指標是:
- oracledb_slow_queries_p95_time_usecs
- oracledb_slow_queries_p99_time_usecs
活動會話:監視Oracle數據庫中活動會話很重要。如果超過配置的限制,則數據庫將拒絕新連接,從而導致應用程序錯誤。提供此信息的指標是oracledb_sessions_value,標簽status可以提供更多信息。
活動:監視數據庫執行的操作也很重要。為此,我們可以依靠以下指標:
- oracledb_activity_execute_count
- oracledb_activity_parse_count_total
- oracledb_activity_user_commits
- oracledb_activity_user_rollbacks