理解 OpenStack Swift （3）：監控和一些影響性能的因素 [Monitoring and Performance]

本文轉載自查看原文 2015-11-17 10:16 3307 監控/ Swift/ 性能

本系列文章着重學習和研究OpenStack Swift，包括環境搭建、原理、架構、監控和性能等。

（1）OpenStack + 三節點Swift 集群+ HAProxy + UCARP 安裝和配置

（2）原理、架構和性能

（3）監控

對 Swift 集群的監控是必要的，特別是集群規模很大的時候。

1. 監控目標

主要的監控目標包括：

硬件故障
操作系統故障
Swift 集群健康狀態
Swift 集群狀態

2. Swift 提供的工具

2.1 Swift 自帶的各種 Audit 工具

2.1.1 磁盤監控工具 swift-drive-audit

該工具分析 /var/log/kern.log 文件，根據預定義的 regexp 來探測 kernel 報告的磁盤錯誤。通常地它會被 cron 周期性地運行。它使用一個配置文件，比如 /etc/swift/drive-audit.conf。如果該腳本發現了某個磁盤存在問題，它會自動 unmount 它，而且會在 /etc/fstab 中將它注釋掉。然后后端 replication 進程就會從其它 replica 中拷貝出一個新的 replica。這是示例配置文件。

要使用它，必須首先創建一個配置文件 driver-audit.conf：

[drive-audit] 
device_dir = /srv/node 
log_facility = LOG_LOCAL0 
log_level = INFO 
minutes = 60 
error_limit = 1 
log_file_pattern = /var/log/kern* 
regex_pattern_1 = \berror\b.*\b(dm-[0-9]{1,2}\d?)\b

然后運行該工具：swift-drive-audit driver-audit.conf，然后上面描述的各種 action 就會有了。

2.1.2 account，container 和 object auditor

功能：

swift-account-auditor 會打開 account server 的 sqlite db，運行 SQL 查詢，確保 db 是有效的，並報告一個account 是否有 replica 缺失或者錯誤的對象。
swift-container-auditor 會對 containers 做同樣的事情.
swift-object-auditor 會打開一個對象服務器上的所有對象，確保它們的元數據是正確的，而且有有合適的大小和 MD5

它們都必須在各自對應的服務的配置文件中的 [account/object/container-auditor] 部分做相應配置，然后它們就會定期運行，並輸出日志。

2.2 集群監控狀態獲取工具 swift-dispersion-populate 和 swift-dispersion-report

這兩個工具用於獲取和報告集群的總體健康狀態。它需要訪問Swift 集群以及 Ring 文件。它需要一個自己的配置文件，往往把該配置文件放在 proxy server 上的 /etc/swift 目錄下。

（1）創建配置文件 /etc/swift/dispersion.conf ：

[dispersion]

auth_url = http://controller:35357/v3
auth_user = service:swift
auth_version = 3
auth_key = 1111
swift_dir = /etc/swift
concurrency = 25
retries = 5

該配置文件的詳細說明可以參見 Ubuntu 文檔。注意 Kilo 版本的 Swift 環境中需要添加配置項 auth_version。

（2）運行 swift-dispersion-populate 去獲取系統的健康狀況，它和 swift-dispersion-report 使用同一個配置文件。詳細說明可以參見 Ubuntu 文檔。

root@swift1:/etc/swift# swift-dispersion-populate dispersion.conf
Created 10 containers for dispersion reporting, 0s, 0 retries
Created 10 objects for dispersion reporting, 0s, 0 retries

（3）運行 swift-dispersion-report 獲取監控報告。詳細說明請參加 Ubuntu 文檔。

root@swift1:/etc/swift# swift-dispersion-report
Queried 11 containers for dispersion reporting, 0s, 0 retries
100.00% of container copies found (33 of 33)
Sample represents 1.07% of the container partition space
Queried 10 objects for dispersion reporting, 0s, 0 retries
There were 10 partitions missing 0 copy.
100.00% of object copies found (30 of 30)
Sample represents 0.98% of the object partition space

2.3 性能獲取工具 swift-recon 中間件

2.3.1 它是什么（what）

Swift Recon是一個安裝在對象服務器的pipeline上的中間件，它有一個必填選項：一個本地緩存目錄。它可以獲取：

How many unmounted (failed) drives there are in the cluster, and on which servers those are located
How many async pendings are present
Drive usage and balance
Load Average (for easy access later on*)
Memory Usage (for easy access later on*)
Checking ring md5sum’s
Logged replication stats
Connection stats (tbd)
Quarantine Statistics (a new pending addition)

要使用它，需要首先修改 object-server 的配置文件，添加該中間件：

[pipeline:main]
pipeline = healthcheck recon object-server

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift
recon_lock_path = /var/lock

2.3.2 使用命令行工具（How）

swift-recon   <server_type>  [-v] [--suppress] [-a] [-r] [-u] [-d] [-l]  [--md5] [--auditor] [--updater] [--expirer] [--sockstat]

參數：

-a, --async: Get async stats
--auditor: Get auditor stats
--updater: Get updater stats
--expirer: Get expirer stats
-r, --replication: Get replication stats
-u, --unmounted: Check cluster for unmounted devices
-d, --diskusage: Get disk usage stats
-l, --loadstats: Get cluster load average stats
-q, --quarantined: Get cluster quarantine stats
--md5 Get md5sum of servers ring and compare to local cop
--all Perform all checks. Equivalent to -arudlq --md5
-z ZONE, --zone=ZONE: Only query servers in specified zone
--swiftdir=PATH Default = /etc/swift

（1）獲取全部zone 上 updater，auditor 和 expier 的狀態

root@swift1:/etc/swift# swift-recon --auditor --updater --expirer
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-11-14 16:52:00] Checking auditor stats
[ALL_audit_time_last_path] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3
[ALL_quarantined_last_path] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3
[ALL_errors_last_path] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3
[ALL_passes_last_path] low: 1, high: 1, avg: 1.0, total: 3, Failed: 0.0%, no_result: 0, reported: 3
[ALL_bytes_processed_last_path] low: 5, high: 5, avg: 5.0, total: 15, Failed: 0.0%, no_result: 0, reported: 3
[ZBF_audit_time_last_path] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3
[ZBF_quarantined_last_path] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3
[ZBF_errors_last_path] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3
[ZBF_bytes_processed_last_path] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3
===============================================================================
[2015-11-14 16:52:00] Checking updater times
[updater_last_sweep] low: 0, high: 0, avg: 0.1, total: 0, Failed: 0.0%, no_result: 0, reported: 3
===============================================================================
[2015-11-14 16:52:00] Checking on expirers
[object_expiration_pass] - No hosts returned valid data.
[expired_last_pass] - No hosts returned valid data.
===============================================================================

（2）磁盤使用情況

root@swift1:/etc/swift# swift-recon -d
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-11-14 16:53:38] Checking disk usage now
Distribution Graph:
  2%    2 **********************************************
  3%    3 *********************************************************************
  4%    1 ***********************
Disk usage: space used: 5343854592 of 160982630400
Disk usage: space free: 155638775808 of 160982630400
Disk usage: lowest: 2.59%, highest: 4.56%, avg: 3.31952247191%

（3）系統負載

root@swift1:/etc/swift# swift-recon -l
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-11-14 16:54:32] Checking load averages
[5m_load_avg] low: 0, high: 0, avg: 0.1, total: 0, Failed: 0.0%, no_result: 0, reported: 3
[15m_load_avg] low: 0, high: 0, avg: 0.1, total: 0, Failed: 0.0%, no_result: 0, reported: 3
[1m_load_avg] low: 0, high: 0, avg: 0.1, total: 0, Failed: 0.0%, no_result: 0, reported: 3

（4）檢查所有節點上的 ring 文件和 swift.conf 是不是完全一致

root@swift1:/etc/swift# swift-recon --md5
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-11-14 17:08:38] Checking ring md5sums
3/3 hosts matched, 0 error[s] while checking hosts.
===============================================================================
[2015-11-14 17:08:38] Checking swift.conf md5sum
3/3 hosts matched, 0 error[s] while checking hosts

2.3.2 通過 REST API 使用

URL 及功能列表：

3. 其它監控工具

Swift 自身帶的監控工具能提供比較豐富的功能，但是使用起來還不是很方便，因此，業界已經出現了很多商業的或者開源的監控工具。這些監控工具，要么向 swift-recon 一樣實現了一個新的中間件，要么調用 Swift 自身的工具提供的接口。以下（1 - 4 部分）文字引用自 OpenStack Object Storage Monitoring 一文：

（1）Swift-Informant

Florian Hines 開發的 Swift-Informant 中間件可以獲得 OS 客戶請求的實時結果。它位於proxy server的 pipeline，在每一個請求到達Proxy server之后，它發送3類統計數據到StatsD server.

一個類似obj.GET.200 or cont.PUT.404的增量計數器。
請求處理的時間長度
傳輸數據數量

　　這有利於了解客戶正體驗的服務品質，也可以了解各種服務類型，命令以及響應碼的不同排列的數量。Swift-Informant也要求no change to core Swift code因為它是用中間件實現的。然而，也因為如此，它不能讓人看到代理服務器后面的工作情景。如果一個存儲節點的響應降低，你也可能看到你的請求變差－要么是高延遲，要么返回錯誤狀態碼。你不會知道具體是為什么，或者請求是要去哪里。或許你所考虛的container服務器是在一個好的節點的，而對象服務器是在另一個性能差的節點。
因此我們需要深入視覺到proxy server的后面，集群的操作里去。

（2）Statsdlog
　　Florian 的 statsdlog 工程增加了 StatsD 基於日志事件的計數器。正如Swift-Informant，它也是非侵入式的。但是statsdlog可以從Swift后台進程中追蹤事件，而不僅僅是代理服務器。后台進程監聽syslog信息的UDP流，當一條日志匹配於一個正則表達式時，StatsD計數器遞增。Metric名字映射到JSON文件中的正則pattern。JSON文件允許靈活配置以能從日志流中抽取出想要的metrics。

（3）Swift StatsD Logging
　　StatsD 使用程序代碼來作深入分析。Metrics被有針對性的代碼實時發送。發送一個metric的開銷相當低：一個sendto UDP包。如果你認為這樣的開銷還算高的話，StatsD client library可以只發送樣本的隨機比率部分，flushing metrics upstream時StatsD將估算實際數據。要避免基於中間件的監控與事后日志處理所帶來的弊端，我們把StatsD metrics發送過種整合到Swift本身。我們當前所提交的更改報告了遍布15個swift后台進程的124個metrics以及tempauth中間件。metrics細節可參考https://review.openstack.org/#patch,sidebyside,6058,2,doc/source/admin_guide.rst

（4）總結

我們認為 Swift 集群動作最好的監控方法是綜合了一個通用服務器監控系統，一個Swift特定統計指標收集（polling Swift-specific gauge metrics）機制，一個Swift內部counter與timing metrics深度StatsD日志手段的結合體。對於polling Swift-specific gauge metrics，最好使用一個通用的收集插件。這個插件即可以從swift-recon讀數據，也可以自己直接收集信息。在 SwiftStack，我們使用 collectd 加上一些服務器監控用 python 插件代碼。我們也在collectd中嵌入StatsD服務器，這樣每個節點都有一個進程將stats數據“倒”給Graphite"(http://graphite.wikidot.com/)集群。有了這個裝置，我們擁有前述所有問題的全覆蓋解決方案：general purpose monitoring, Swift-specific gauge monitoring, and real-time counter and timing data directly from Swift。除了圖形化，你也可以實現異常檢測，觸發警告，維護一個實時的實體健康狀態視圖，避免各種突發問題。

（5）SwiftStack 監控工具的部分截圖：

（6）另外一個監控環境示例

（7）Benchmarking 工具

Intel 開源了對象存儲的 Benchmarking 工具 COSBench，https://github.com/intel-cloud/cosbench

4. 一些影響性能的因素

注：以下內容引用自 "Leveraging open source tools to gain insight into OpenStack Swift“ ，by Michael Factor,Dmitry Sotnikov, dmitrys@il.ibm.com。他們使用的測試環境：

4.1 Container 數量和 Swift 版本

（PUT 操作）

版本和 container 數目帶來的性能差異還是非常大的。建議使用 Swift 2.2 版本以上，以及多個 container。

4.2 客戶端 worker 數目

可見客戶端 worker 數目不是越多性能就會線性增加。

4.3 Container 中的對象數目

4.4 前端和后端網絡帶寬

說明后端網絡的帶寬壓力是前端的至少3倍。

4.5 存儲節點的磁盤I/O 和前端網絡 I/O 的關系（12倍）

4.6 對象大小

這說明 Swift 不合適處理大量的小文件。

4.7 IOPS 性能不是隨着存儲節點數目線性擴展的

也就是說這個環境中，存儲節點數目為7的話，它們就不是 IOPS 性能瓶頸了。

參考文檔：

http://www.cnblogs.com/Clisa/p/3461701.html
https://swiftstack.com/blog/2012/04/11/swift-monitoring-with-statsd/
https://platform.swiftstack.com/docs/admin/monitoring/cluster-monitoring.html
http://blog.chmouel.com/2012/02/01/audit-a-swift-cluster/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 影響轉化率的一些因素，提高網站轉化率的12個技巧 Performance — 前端性能監控利器對sequence的一些理解 Hive的一些理解關於gevent的一些理解(二) 對於Fragment的一些理解 @CallerSensitive一些理解對synchronized(this)的一些理解 kafka生產者性能監控：Monitor Kafka Producer for Performance Performance Monitor4：監控SQL Server的IO性能