8.1 理解警報 (Understanding Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
Ambari 預定義了一系列警報來監控集群組件和主機。每一個警報由一個警報定義(alert definition)來定義,定義警報類型檢查的間隔和閾值。集群創建或
修改時,Ambari 讀取警報定義並為指定的項(items)創建警報實例進行監控。例如,如果集群包括 Hadoop Distributed File System (HDFS), 有一個警報
定義用於監控 "DataNode Process". 集群中為每一個 DataNode 創建一個警報定義的實例。
利用 Ambari Web,通過單擊 Alert tab 可以瀏覽集群上警報定義列表。可以通過當前狀態,最后狀態變化,以及與警報定義相關聯的服務,查找或過濾警報
的定義。可以單擊 alert definition name 來查看該警報的詳細信息,或修改警報屬性(如檢查間隔和閾值),以及該警報定義相關聯的警報實例列表。
每個警報實例報告一個警報狀態,由嚴重程度定義。最常用的嚴重級別為 OK, WARNING, and CRITICAL, 也有 UNKNOWN 和 NONE 的嚴重級別。警報通知在警報
狀態發生變化時發送(如,狀態從 OK 變為 CRITICAL)。
8.1.1 警報類型 (Alert Types)
-----------------------------------------------------------------------------------------------------------------------------------------
警報閾值和閾值的單位取決於警報的狀態。下表列出了警報類型,它們可能的狀態,以及可以配置什么閾值單位,如果閾值可配置的話
WEB Alert Type :WEB 警報監視一個給定組件的 web URL, 警報狀態由 HTTP 響應代碼確定。因此,不能改變 HTTP 的響應代碼來確定 WEB 警報
的閾值。可以自定義每個閾值和整個 web 連接超時的響應文本。連接超時被認為是 CRITICAL 警報。閾值單位基於秒。
響應代碼對應 WEB 警報的狀態如下:
● OK status :如果 web URL 響應代碼低於 400.
● WARNING status :如果 web URL 響應代碼等於或高於 400.
● CRITICAL status :如果 Ambari 不能連接到某個 web URL.
PORT Alert Type :PORT 警報檢查連接到一個給定端口的響應時間,閾值單位基於秒
METRIC Alert Type :METRIC 警報檢查一個或多個度量的值(如果執行計算)。度量從一個給定組件上的可用的 URL 端點訪問。連接超時被認為是 CRITICAL
警報。
閾值是可調整的,並且每一個閾值的單位取決於度量。例如,在 CPU utilization 警報的場景下,單位是百分數;在
RPC latency 警報的場景下,單位為毫秒。
AGGREGATE Alert Type :AGGREGATE 警報聚合警報狀態的數量作為受影響警報數量的百分比。例如,Percent DataNode Process 警報聚合 DataNode Process
警報。
SCRIPT Alert Type :SCRIPT 警報執行某個腳本來確定其狀態,例如 OK, WARNING, 或 CRITICAL. 可以自定義響應文本和屬性的值,以及 SCRIPT 警報的
閾值。
SERVER Alert Type :SERVER 警報執行一個服務器側的可運行類以確定警報狀態,例如,OK, WARNING, 或 CRITICAL
RECOVERY Alert Type :RECOVERY 警報由 Ambari Agent 處理,用於監控進程重啟。警報狀態 OK, WARNING, 以及 CRITICAL 基於一個進程自動重啟所用時間的
數量。這在要了解進程終止並被 Ambari 自動重啟時非常有用。
8.2 修改警報 (Modifying Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
警報的通用屬性包括名稱,描述,檢查間隔,以及閾值。
檢查間隔定義了 Ambari 檢查警報狀態的頻率。例如,"1 minute" 意思是 Ambari 每分鍾檢查警報的狀態。
閾值的配置選項取決於警報的類型
修改警報的通用屬性:
① 在 Ambari Web 上瀏覽到 Alerts 部分
② 找到警報到定義並單擊以查看定義詳細信息
③ 單擊 Edit 來修改名稱,描述,檢查間隔,以及閾值(如果可用)
④ 單擊 Save
⑤ 在下一次檢查間隔時,在所有警報實例上修改生效
8.3 修改警報檢查數量 (Modifying Alert Check Counts)
-----------------------------------------------------------------------------------------------------------------------------------------
Ambari 可以設置警報在分發一個通知之前執行檢查的數量。如果警報狀態在一個檢查期間發生了變化,Ambari 在分發通知之前會嘗試檢查這個條件一定的
次數(check count)。
警報檢查次數不適用於 AGGREATE 警報類型。一個狀態的變化對於 AGGREATE 警報導致一個通知分發。
如果環境中經常會用短時的問題導致錯誤的警報,可以提升檢查次數。這種情況下,警報狀態的變化仍然會記錄,但是作為 SOFT 狀態變化。如果在一個指定
的檢查次數之后警報條件仍然觸發,這個狀態的變化被認為是 HARD, 並且通知被發出。
通常對所有警報全局設置檢查次數,但如果一個或多個警報實踐中有短時問題的情況,也可以對單個的警報設置一覆蓋全局設定值。
修改全局警報檢查次數:
① 在 Ambari Web 中瀏覽到 Alerts 部分
② 在 Actions 菜單, 單擊 Manage Alert Settings
③ 更新 Check Count 值
④ 單擊 Save
對全局警報檢查次數對修改可能要求幾秒鍾后出現在 Ambari UI 的單個警報上
為單個警報覆蓋全局警報檢查次數:
① Ambari Web 中瀏覽到 Alerts 部分
② 選擇要設置特殊 Check Count 值的警報
③ 在右側,單擊 Check Count property 旁的 Edit 圖標
④ 更新 Check Count 值
⑤ 單擊 Save
8.4 禁用和再啟用警報 (Disabling and Re-enabling Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
可以禁用警報。當一個警報禁用時,沒有警報實例生效,並且 Ambari 不在執行該警報的檢查。因而,沒有警報狀態變化會記錄,並且沒有通知發送。
① Ambari Web 中瀏覽到 Alerts 部分
② 找到警報定義,單擊文本旁的 Enabled 或 Disabled 以啟用/禁用該警報
③ 另一方法,單擊警報以查看定義的詳細信息,然后單擊 Enabled 或 Disabled 以啟用/禁用該警報
④ 提示確認啟用/禁用
8.5 預定義的警報 (Tables of Predefined Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
8.5.1 HDFS 服務警報 (HDFS Service Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱:NameNode Blocks Health
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :METRIC
描述 :This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold.
潛在原因 :Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes.
The corrupt or missing blocks are from files with a replication factor of 1. New replicas cannot be created because the
only replica of the block is missing.
解決方法 :For critical data, use a replication factor of 3.
Bring up the failed DataNodes with missing or corrupt blocks.
Identify the files associated with the missing or corrupt blocks by running the Hadoop fsck command.
Delete the corrupt files and recover them from backup, if one exists.
□ 警報名稱:NFS Gateway Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :PORT
描述 :This host-level alert is triggered if the NFS Gateway process cannot be confirmed as active.
潛在原因 :NFS Gateway is down.
解決方法 :Check for a non-operating NFS Gateway in Ambari Web.
□ 警報名稱:DataNode Storage
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :METRIC
描述 :This host-level alert is triggered if storage capacity is full on the DataNode (90% critical). It checks the DataNode
JMX Servlet for the Capacity and Remaining properties.
潛在原因 :Cluster storage is full.
If cluster storage is not full, DataNode is full.
解決方法 :If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes.
If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger
disks to the DataNodes. After adding more storage, run the load balancer.
□ 警報名稱:DataNode Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :PORT
描述 :This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on
the network for the configured critical threshold, in seconds.
潛在原因 :DataNode process is down or not responding.
DataNode are not down but is not listening to the correct network port/address.
解決方法 :Check for non-operating DataNodes in Ambari Web.
Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode, if necessary.
Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.
□ 警報名稱:DataNode Web UI
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :WEB
描述 :This host-level alert is triggered if the DataNode web UI is unreachable.
潛在原因 :The DataNode process is not running.
解決方法 :Check whether the DataNode process is running.
□ 警報名稱:NameNode Host CPU Utilization
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :METRIC
描述 :This host-level alert is triggered if CPU utilization of the NameNode exceeds certain thresholds (200% warning,
250% critical). It checks the NameNode JMX Servlet for the SystemCPULoad property. This information is available only if
you are running JDK 1.7.
潛在原因 :Unusually high CPU utilization might be caused by a very unusual job or query workload, but this is generally the sign
of an issue in the daemon.
解決方法 :Use the top command to determine which processes are consuming excess CPU.
Reset the offending process.
□ 警報名稱:NameNode Web UI
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :WEB
描述 :This host-level alert is triggered if the NameNode web UI is unreachable.
潛在原因 :The NameNode process is not running.
解決方法 :Check whether the NameNode process is running.
□ 警報名稱:Percent DataNodes with Available Space
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :AGGREGATE
描述 :This service-level alert is triggered if the storage is full on a certain percentage of DataNodes(10% warn, 30% critical)
潛在原因 :Cluster storage is full.
If cluster storage is not full, DataNode is full.
解決方法 :If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes
If the cluster is full, delete unnecessary data or increase storage by adding either more DataNodes or more or larger disks
to the DataNodes. After adding more storage, run the load balancer.
□ 警報名稱:Percent DataNodes Available
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :AGGREGATE
描述 :This alert is triggered if the number of non-operating DataNodes in the cluster is greater than the configured critical
threshold. This aggregates the DataNode process alert.
潛在原因 :DataNodes are down.
DataNodes are not down but are not listening to the correct network port/address.
解決方法 :Check for non-operating DataNodes in Ambari Web.
Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes.
Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.
□ 警報名稱:NameNode RPC Latency
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :METRIC
描述 :This host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold.
Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to
increase for NameNode operations.
潛在原因 :A job or an application is performing too many NameNode operations.
解決方法 :Review the job or the application for potential bugs causing it to perform too many NameNode operations.
□ 警報名稱:NameNode Last Checkpoint
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :SCRIPT
描述 :This alert will trigger if the last time that the NameNode performed a checkpoint was too long ago or if the number of
uncommitted transactions is beyond a certain threshold.
潛在原因 :Too much time elapsed since last NameNode checkpoint.
Uncommitted transactions beyond threshold.
解決方法 :Set NameNode checkpoint.
Review threshold for uncommitted transactions.
□ 警報名稱:Secondary NameNode Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :WEB
描述 :If the Secondary NameNode process cannot be confirmed to be up and listening on the network. This alert is not applicable
when NameNode HA is configured.
潛在原因 :The Secondary NameNode is not running.
解決方法 :Check that the Secondary DataNode process is running.
□ 警報名稱:NameNode Directory Status
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :METRIC
描述 :This alert checks if the NameNode NameDirStatus metric reports a failed directory.
潛在原因 :One or more of the directories are reporting as not healthy.
解決方法 :Check the NameNode UI for information about unhealthy directories.
□ 警報名稱:HDFS Capacity Utilization
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :METRIC
描述 :This service-level alert is triggered if the HDFS capacity utilization exceeds the configured critical threshold
(80% warn, 90% critical). It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties.
潛在原因 :Cluster storage is full.
解決方法 :Delete unnecessary data.
Archive unused data.
Add more DataNodes.
Add more or larger disks to the DataNodes.
After adding more storage, run the load balancer.
□ 警報名稱: DataNode Health Summary
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : METRIC
描述 : This service-level alert is triggered if there are unhealthy DataNodes.
潛在原因 : A DataNode is in an unhealthy state.
解決方法 : Check the NameNode UI for the list of non-operating DataNodes.
□ 警報名稱:HDFS Pending Deletion Blocks
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : METRIC
描述 : This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning
and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property.
潛在原因 : Large number of blocks are pending deletion.
解決方法 :
□ 警報名稱:HDFS Upgrade Finalized State
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if HDFS is not in the finalized state.
潛在原因 : The HDFS upgrade is not finalized.
解決方法 : Finalize any upgrade you have in process.
□ 警報名稱:DataNode Unmounted Data Dir
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became
unmounted.
潛在原因 : If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well
as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the
root partition, which is undesirable.
解決方法 : Check the data directories to confirm they are mounted as expected.
□ 警報名稱:DataNode Heap Usage
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : METRIC
描述 : This host-level alert is triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet
for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are percentages.
潛在原因 :
解決方法 :
□ 警報名稱:NameNode Client RPC Queue Latency
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if the deviation of RPC queue latency on client port has grown beyond the specified
threshold within an given period. This alert will monitor Hourly and Daily periods.
潛在原因 :
解決方法 :
□ 警報名稱:NameNode Client RPC Processing Latency
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified
threshold within a given period. This alert will monitor Hourly and Daily periods.
潛在原因 :
解決方法 :
□ 警報名稱:NameNode Service RPC Queue Latency
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified
threshold within a given period. This alert will monitor Hourly and Daily periods.
潛在原因 :
解決方法 :
□ 警報名稱:NameNode Service RPC Processing Latency
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified
threshold within a given period. This alert will monitor Hourly and Daily periods.
潛在原因 :
解決方法 :
□ 警報名稱:HDFS Storage Capacity Usage
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified
threshold within a given period. This alert will monitor Daily and Weekly periods.
潛在原因 :
解決方法 :
□ 警報名稱:NameNode Heap Usage
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold
within a given period. This alert will monitor Daily and Weekly periods.
潛在原因 :
解決方法 :
8.5.2 HDFS HA 警報 (HDFS HA Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: JournalNode Web UI
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : WEB
描述 : This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening
on the network for the configured critical threshold, given in seconds.
潛在原因 : The JournalNode process is down or not responding.
The JournalNode is not down but is not listening to the correct network port/address.
解決方法 :
□ 警報名稱: NameNode High Availability Health
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.
潛在原因 : The Active, Standby or both NameNode processes are down.
解決方法 : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode
host/process using Ambari Web.
On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct
network port.
□ 警報名稱: Percent JournalNodes Available
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : AGGREGATE
描述 : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured
critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.
潛在原因 : JournalNodes are down.
JournalNodes are not down but are not listening to the correct network port/address.
解決方法 : Check for dead JournalNodes in Ambari Web.
□ 警報名稱: ZooKeeper Failover Controller Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : PORT
描述 : This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the
network.
潛在原因 : The ZKFC process is down or not responding.
解決方法 : Check if the ZKFC process is running.
8.5.3 NameNode HA 警報 (NameNode HA Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: JournalNode Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : WEB
描述 : This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening
on the network for the configured critical threshold, given in seconds.
潛在原因 : The JournalNode process is down or not responding.
The JournalNode is not down but is not listening to the correct network port/address.
解決方法 : Check if the JournalNode process is running.
□ 警報名稱: NameNode High Availability Health
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.
潛在原因 : The Active, Standby or both NameNode processes are down.
解決方法 : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode
host/process using Ambari Web.
On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct
network port.
□ 警報名稱: Percent JournalNodes Available
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : AGGREGATE
描述 : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured
critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.
潛在原因 : JournalNodes are down.
JournalNodes are not down but are not listening to the correct network port/address.
解決方法 : Check for non-operating JournalNodes in Ambari Web.
□ 警報名稱: ZooKeeper Failover Controller Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : PORT
描述 : This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the
network.
潛在原因 : The ZKFC process is down or not responding.
解決方法 : Check if the ZKFC process is running.
8.5.4 YARN 警報 (YARN Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: App Timeline Web UI
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : WEB
描述 : This host-level alert is triggered if the App Timeline Server Web UI is unreachable.
潛在原因 : The App Timeline Server is down.
App Timeline Service is not down but is not listening to the correct network port/address.
解決方法 : Check for non-operating App Timeline Server in Ambari Web.
□ 警報名稱: Percent NodeManagers Available
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : AGGREGATE
描述 : This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold.
It aggregates the results of DataNode process alert checks.
潛在原因 : NodeManagers are down.
NodeManagers are not down but are not listening to the correct network port/address.
解決方法 : Check for non-operating NodeManagers.
Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManagers hosts/processes, as necessary.
Run the netstat -tuplpn command to check if the NodeManager process is bound to the correct network port.
□ 警報名稱: ResourceManager Web UI
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : WEB
描述 : This host-level alert is triggered if the ResourceManager Web UI is unreachable.
潛在原因 : The ResourceManager process is not running.
解決方法 : Check if the ResourceManager process is running.
□ 警報名稱: ResourceManager RPC Latency
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : METRIC
描述 : This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold.
Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to
increase for ResourceManager operations.
潛在原因 : A job or an application is performing too many ResourceManager operations
解決方法 : Review the job or the application for potential bugs causing it to perform too many ResourceManager operations.
□ 警報名稱: ResourceManager CPU Utilization
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : METRIC
描述 : This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain thresholds (200% warning,
250% critical). It checks the ResourceManager JMX Servlet for the SystemCPULoad property. This information is only available
if you are running JDK 1.7.
潛在原因 : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of
an issue in the daemon.
解決方法 : Use the top command to determine which processes are consuming excess CPU.
Reset the offending process.
□ 警報名稱: NodeManager Web UI
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : WEB
描述 : This host-level alert is triggered if the NodeManager process cannot be established to be up and listening on the network
for the configured critical threshold, given in seconds.
潛在原因 : NodeManager process is down or not responding.
NodeManager is not down but is not listening to the correct network port/address.
解決方法 : Check if the NodeManager is running.
Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManager, if necessary.
□ 警報名稱: NodeManager Health Summary
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This host-level alert checks the node health property available from the NodeManager component.
潛在原因 : NodeManager Health Check script reports issues or is not configured.
解決方法 : Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart
if necessary.
Check in the ResourceManager UI logs (/var/log/hadoop/yarn) for health check errors.
□ 警報名稱: NodeManager Health
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This host-level alert checks the nodeHealthy property available from the NodeManager component.
潛在原因 : The NodeManager process is down or not responding.
解決方法 : Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart
if necessary.
8.5.5 MapReduce2 警報 (MapReduce2 Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: History Server Web UI
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : WEB
描述 : This host-level alert is triggered if the HistoryServer Web UI is unreachable.
潛在原因 : The HistoryServer process is not running.
解決方法 : Check if the HistoryServer process is running.
□ 警報名稱: History Server RPC latency
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : METRIC
描述 : This host-level alert is triggered if the HistoryServer operations RPC latency exceeds the configured critical threshold.
Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to
increase for NameNode operations.
潛在原因 : A job or an application is performing too many HistoryServer operations.
解決方法 : Review the job or the application for potential bugs causing it to perform too many HistoryServer operations.
□ 警報名稱: History Server CPU Utilization
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : METRIC
描述 : This host-level alert is triggered if the percent of CPU utilization on the HistoryServer exceeds the configured
critical threshold.
潛在原因 : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of
an issue in the daemon.
解決方法 : Use the top command to determine which processes are consuming excess CPU.
Reset the offending process.
□ 警報名稱: History Server Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : PORT
描述 : This host-level alert is triggered if the HistoryServer process cannot be established to be up and listening on the
network for the configured critical threshold, given in seconds.
潛在原因 : HistoryServer process is down or not responding.
HistoryServer is not down but is not listening to the correct network port/address.
解決方法 : Check the HistoryServer is running.
Check for any errors in the HistoryServer logs (/var/log/hadoop/mapred) and restart the HistoryServer, if necessary.
8.5.6 HBase 服務警報 (HBase Service Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: Percent RegionServers Available
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :
描述 : This service-level alert is triggered if the configured percentage of Region Server processes cannot be determined to be
up and listening on the network for the configured critical threshold. The default setting is 10% to produce a WARN alert
and 30% to produce a CRITICAL alert. It aggregates the results of RegionServer process down checks.
潛在原因 : Misconfiguration or less-thanideal configuration caused the RegionServers to crash.
Cascading failures brought on by some workload caused the RegionServers to crash.
The RegionServers shut themselves own because there were problems in the dependent services, ZooKeeper or HDFS.
GC paused the RegionServer for too long and the RegionServers lost contact with Zookeeper.
解決方法 : Check the dependent services to make sure they are operating correctly.
Look at the RegionServer log files (usually /var/log/hbase/*.log) for further information.
If the failure was associated with a particular workload, try to understand the workload better.
Restart the RegionServers.
□ 警報名稱: HBase Master Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :
描述 : This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for
the configured critical threshold, given in seconds.
潛在原因 : The HBase master process is down.
The HBase master has shut itself down because there were problems in the dependent services, ZooKeeper or HDFS.
解決方法 : Check the dependent services.
Look at the master log files (usually /var/log/hbase/*.log) for further information.
Look at the configuration files (/etc/hbase/conf).
Restart the master.
□ 警報名稱: HBase Master CPU Utilization
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This host-level alert is triggered if CPU utilization of the HBase Master exceeds certain thresholds (200% warning,
250% critical). It checks the HBase Master JMX Servlet for the SystemCPULoad property. This information is only available
if you are running JDK 1.7.
潛在原因 : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of
an issue in the daemon.
解決方法 : Use the top command to determine which processes are consuming excess CPU
Reset the offending process.
□ 警報名稱: RegionServers Health Summary
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This service-level alert is triggered if there are unhealthy RegionServers
潛在原因 : The RegionServer process is down on the host.
The RegionServer process is up and running but not listening on the correct network port (default 60030).
解決方法 : Check for dead RegionServer in Ambari Web.
□ 警報名稱: HBase RegionServer Process
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the
network for the configured critical threshold, given in seconds.
潛在原因 : The RegionServer process is down on the host.
The RegionServer process is up and running but not listening on the correct network port (default 60030).
解決方法 : Check for any errors in the logs (/var/log/hbase/) and restart the RegionServer process using Ambari Web.
Run the netstat -tuplpn command to check if the RegionServer process is bound to the correct network port.
8.5.7 Hive 警報 (Hive Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: HiveServer2 Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :
描述 : This host-level alert is triggered if the HiveServer cannot be determined to be up and responding to client requests.
潛在原因 : HiveServer2 process is not running.
HiveServer2 process is not responding.
解決方法 : Using Ambari Web, check status of HiveServer2 component. Stop and then restart.
□ 警報名稱: HiveMetastore Process
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This host-level alert is triggered if the Hive Metastore process cannot be determined to be up and listening on the
network for the configured critical threshold, given in seconds.
潛在原因 : The Hive Metastore service is down.
The database used by the Hive Metastore is down.
The Hive Metastore host is not reachable over the network.
解決方法 : Using Ambari Web, stop the Hive service and then restart it.
□ 警報名稱: WebHCat Server Status
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :
描述 : This host-level alert is triggered if the WebHCat server cannot be determined to be up and responding to client requests.
潛在原因 : The WebHCat server is down.
The WebHCat server is hung and not responding.
The WebHCat server is not reachable over the network.
解決方法 : Restart the WebHCat server using Ambari Web.
8.5.8 Oozie 警報 (Oozie Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: Oozie Server Web UI
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This host-level alert is triggered if the Oozie server Web UI is unreachable.
潛在原因 : The Oozie server is down.
Oozie Server is not down but is not listening to the correct network port/address.
解決方法 : Check for dead Oozie Server in Ambari Web.
□ 警報名稱: Oozie Server Status
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This host-level alert is triggered if the Oozie server cannot be determined to be up and responding to client requests.
潛在原因 : The Oozie server is down.
The Oozie server is hung and not responding.
The Oozie server is not reachable over the network.
解決方法 : Restart the Oozie service using Ambari Web.
8.5.9 ZooKeeper 警報 (ZooKeeper Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: Percent ZooKeeper Servers Available
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : AGGREGATE
描述 : This service-level alert is triggered if the configured percentage of ZooKeeper processes cannot be determined to be up
and listening on the network for the configured critical threshold, given in seconds. It aggregates the results of
ZooKeeper process checks.
潛在原因 : The majority of your ZooKeeper servers are down and not responding.
解決方法 : Check the dependent services to make sure they are operating correctly.
Check the ZooKeeper logs (/var/log/hadoop/zookeeper.log) for further information.
If the failure was associated with a particular workload, try to understand the workload better.
Restart the ZooKeeper servers from the Ambari UI.
□ 警報名稱: ZooKeeper Server Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : PORT
描述 : This host-level alert is triggered if the ZooKeeper server process cannot be determined to be up and listening on the
network for the configured critical threshold, given in seconds.
潛在原因 : The ZooKeeper server process is down on the host.
The ZooKeeper server process is up and running but not listening on the correct network port (default 2181).
解決方法 : Check for any errors in the ZooKeeper logs (/var/log/hbase/) and restart the ZooKeeper process using Ambari Web.
Run the netstat -tuplpn command to check if the ZooKeeper server process is bound to the correct network port.
8.5.10 Ambari 警報 (Ambari Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: Host Disk Usage
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SCRIPT
描述 : This host-level alert is triggered if the amount of disk space used on a host goes above specific thresholds (50% warn,
80% crit ).
潛在原因 : The amount of free disk space left is low.
解決方法 : Check host for disk space to free or add more storage.
□ 警報名稱: Ambari Agent Heartbeat
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SERVER
描述 : This alert is triggered if the server has lost contact with an agent.
潛在原因 : Ambari Server host is unreachable from Agent host
Ambari Agent is not running
解決方法 : Check connection from Agent host to Ambari Server
Check Agent is running
□ 警報名稱: Ambari Server Alerts
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 : SERVER
描述 : This alert is triggered if the server detects that there are alerts which have not run in a timely manner
潛在原因 : Agents are not reporting alert status
Agents are not running
解決方法 : Check that all Agents are running and heartbeating
8.5.11 Ambari Metrics 警報 (Ambari Metrics Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: Metrics Collector Process
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on the configured port for
number of seconds equal to threshold.
潛在原因 : The Metrics Collector process is not running.
解決方法 : Check the Metrics Collector is running.
□ 警報名稱: Metrics Collector –ZooKeeper Server Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :
描述 : This host-level alert is triggered if the Metrics Collector ZooKeeper Server Process cannot be determined to be up and
listening on the network.
潛在原因 : The Metrics Collector process is not running.
解決方法 : Check the Metrics Collector is running.
□ 警報名稱: Metrics Collector –HBase Master Process
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :
描述 : This alert is triggered if the Metrics Collector HBase Master Processes cannot be confirmed to be up and listening on
the network for the configured critical threshold, given in seconds.
潛在原因 : The Metrics Collector process is not running.
解決方法 : Check the Metrics Collector is running.
□ 警報名稱: Metrics Collector – HBase Master CPU Utilization
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :
描述 : This host-level alert is triggered if CPU utilization of the Metrics Collector exceeds certain thresholds.
潛在原因 : Unusually high CPU utilization generally the sign of an issue in the daemon configuration.
解決方法 : Tune the Ambari Metrics Collector.
□ 警報名稱: Metrics Monitor Status
-------------------------------------------------------------------------------------------------------------------------------------
警報類型 :
描述 : This host-level alert is triggered if the Metrics Monitor process cannot be confirmed to be up and running on the network.
潛在原因 : The Metrics Monitor is down.
解決方法 : Check whether the Metrics Monitor is running on the given host.
□ 警報名稱: Percent Metrics Monitors Available
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This is an AGGREGATE alert of the Metrics Monitor Status.
潛在原因 : Metrics Monitors are down.
解決方法 : Check the Metrics Monitors are running.
□ 警報名稱: Metrics Collector -Auto-Restart Status
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This alert is triggered if the Metrics Collector has been auto-started for number of times equal to start threshold in
a 1 hour timeframe. By default if restarted 2 times in an hour, you will receive a Warning alert. If restarted 4 or more
times in an hour, you will receive a Critical alert.
潛在原因 : The Metrics Collector is running but is unstable and causing restarts. This could be due to improper tuning.
解決方法 : Tune the Ambari Metrics Collector.
□ 警報名稱: Percent Metrics Monitors Available
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This is an AGGREGATE alert of the Metrics Monitor Status.
潛在原因 : Metrics Monitors are down.
解決方法 : Check the Metrics Monitors.
□ 警報名稱: Grafana Web UI
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This host-level alert is triggered if the AMS Grafana Web UI is unreachable.
潛在原因 : Grafana process is not running.
解決方法 : Check whether the Grafana process is running. Restart if it has gone down.
8.5.12 SmartSenses 警報 (SmartSense Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
□ 警報名稱: SmartSense Server Process
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This alert is triggered if the HST server process cannot be confirmed to be up and listening on the network for the
configured critical threshold, given in seconds.
潛在原因 : HST server is not running.
解決方法 : Start HST server process. If startup fails, check the hst-server.log.
□ 警報名稱: SmartSense Bundle Capture Failure
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This alert is triggered if the last triggered SmartSense bundle is failed or timed out.
潛在原因 : Some nodes are timed out during capture or fail during data capture. It could also be because upload to Hortonworks fails.
解決方法 : From the "Bundles" page check the status of bundle. Next, check which agents have failed or timed out, and review their logs.
You can also initiate a new capture.
□ 警報名稱: SmartSense Long Running Bundle
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This alert is triggered if the SmartSense in-progress bundle has possibility of not completing successfully on time.
潛在原因 : Service components that are getting collected may not be running. Or some agents may be timing out during data
collection/upload.
解決方法 : Restart the services that are not running. Force-complete the bundle and start a new capture.
□ 警報名稱: SmartSense Gateway Status
-------------------------------------------------------------------------------------------------------------------------------------
描述 : This alert is triggered if the SmartSense Gateway server process is enabled but is unable to reach.
潛在原因 : SmartSense Gateway is not running.
解決方法 : Start the gateway. If gateway start fails, review hst-gateway.log
————————————————
版權聲明:本文為CSDN博主「devalone」的原創文章,遵循 CC 4.0 BY-SA 版權協議,轉載請附上原文出處鏈接及本聲明。
原文鏈接:https://blog.csdn.net/devalone/article/details/80826036
