Percona 開始嘗試基於Ceph做上層感知的分布式 MySQL 集群，使用 Ceph 提供的快照，備份和 HA 功能來解決分布式數據庫的底層存儲問題

本文轉載自查看原文 2016-08-08 18:07 2800 存儲

本文由 Ceph 中國社區 -QiYu 翻譯

歡迎加入CCTG

Over the last year, the Ceph world drew me in. Partly because of my taste for distributed systems, but also because I think Ceph represents a great opportunity for MySQL specifically and databases in general. The shift from local storage to distributed storage is similar to the shift from bare disks host configuration to LVM-managed disks configuration.

經過過去一年，Ceph的世界吸引了我。部分是因為我對分布式系統的品味，但也是因為我認為Ceph特描繪了對尤其是MySQL和普通數據庫的一個大機會。從本地存儲到分布式存儲的轉變和從裸磁盤主機配置到LVM管理磁盤配置的轉變相似。

Most of the work I’ve done with Ceph was in collaboration with folks from Red Hat (mainly Brent Compton and Kyle Bader). This work resulted in a number of talks presented at the Percona Live conference in April and the Red Hat Summit San Francisco at the end of June. I could write a lot about using Ceph with databases, and I hope this post is the first in a long series on Ceph. Before I starting with use cases, setup configurations and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.

我用Ceph做完的大部分工作是和Red Hat的伙伴 (主要是 Brent Compton and Kyle Bader)合作。這項工作導致一些討論呈現在Percona4月份的在線會議和六月末舊金山的Red hat峰會上。我可以寫很多數據庫使用Ceph的經驗，我希望這個帖是Ceph一個長系列中的第1個。在案例、設置配置和性能基准測試，我想我應該快速回顧一下Ceph背后的架構和原則。

Introduction to Ceph
Inktank created Ceph a few years ago as a spin-off of the hosting company DreamHost. Red Hat acquired Inktank in 2014 and now offers it as a storage solution. OpenStack uses Ceph as its dominant storage backend. This blog, however, focuses on a more general review and isn’t restricted to a virtual environment.

Ceph介紹

作為主機服務公司的DreamHost的獨立子公司Inktank一些年前創造Ceph。 Red Hat 在2014年收購了Inktank並且現在作為一個存儲解決方案推出。OpenStack使用Ceph作為它支配性的存儲后端。而這個博文聚焦在一個更通用角度，而不局限在虛擬環境中。

A simplistic way of describing Ceph is to say it is an object store, just like S3 or Swift. This is a true statement but only up to a certain point. There are minimally two types of nodes with Ceph, monitors and object storage daemons (OSDs). The monitor nodes are responsible for maintaining a map of the cluster or, if you prefer, the Ceph cluster metadata. Without access to the information provided by the monitor nodes, the cluster is useless. Redundancy and quorum at the monitor level are important.

一個簡化描述Ceph的方式是說它是一個對象存儲，像S3或Swift。這是一個隊的聲明，但只提到一個特定的點。Ceoh最少有2種類型的節點，監視器（MON）和對象存儲后台服務（OSD）。監視器負責維護一個集群的圖或如果你喜歡，集群元數據。沒有訪問到監視器節點提供的信息，集群是沒有用的。在監視器層面冗余和選舉法定人數是很重要的。

Any non-trivial Ceph setup has at least three monitors. The monitors are fairly lightweight processes and can be co-hosted on OSD nodes (the other node type needed in a minimal setup). The OSD nodes store the data on disk, and a single physical server can host many OSD nodes – though it would make little sense for it to host more than one monitor node. The OSD nodes are listed in the cluster metadata (the “crushmap”) in a hierarchy that can span data centers, racks, servers, etc. It is also possible to organize the OSDs by disk types to store some objects on SSD disks and other objects on rotating disks.

任何有價值的Ceph配置至少需要3個監視器，監視器是輕量級進程的可以和OSD節點（其他節點需要在最小的配置）部署在一起。OSD節點存儲數據到磁盤，一個單點的物理服務器可以有很多OSD節點，然而有超過1個監視器節點毫無意義。

With the information provided by the monitors’ crushmap, any client can access data based on a predetermined hash algorithm. There’s no need for a relaying proxy. This becomes a big scalability factor since these proxies can be performance bottlenecks. Architecture-wise, it is somewhat similar to the NDB API, where – given a cluster map provided by the NDB management node – clients can directly access the data on data nodes.

通過監視器的CRUSHMAP提供的信息，任何客戶端基於一個偽隨機哈希算法訪問數據。這不需要一個傳遞的代理。因為這些代理帶來性能瓶頸，這成為一個大的擴展因素。

Ceph stores data in a logical container call a pool. With the pool definition comes a number of placement groups. The placement groups are shards of data across the pool. For example, on a four-node Ceph cluster, if a pool is defined with 256 placement groups (pg), then each OSD will have 64 pgs for that pool. You can view the pgs as a level of indirection to smooth out the data distribution across the nodes. At the pool level, you define the replication factor (“size” in Ceph terminology).

Ceph存儲熟讀在一個叫做池的邏輯容器中。池定義后帶來了一些PG（放置組）。 PG是池訪問數據的碎片。例如，在一個4個節點的Ceph集群上，如果一個池被定義有256個PG，接着每個OSD都有這個池的64個PG。你可以以一個間接在節點間平滑數據分布的層看這些PG。在池的層，你定義的副本數（Ceph術語 'size' ）

The recommended values are a replication factor of three for spinners and two for SSD/Flash. I often use a size of one for ephemeral test VM images. A replication factor greater than one associates each pg with one or more pgs on the other OSD nodes. As the data is modified, it is replicated synchronously to the other associated pgs so that the data it contains is still available in case an OSD node crashes.

副本數的推薦值是普通機械硬盤3，SSD/Flash盤 2。我經常用1作短暫的測試VM鏡像。一個大於1的副本配置把每個PG和一個或更多的其他OSD節點的PG關聯起來。當數據被修改，它被同步復制到其他關聯的PG以防保存的數據在一個OSD節點壞掉后仍然可用。

So far, I have just discussed the basics of an object store. But the ability to update objects atomically in place makes Ceph different and better (in my opinion) than other object stores. The underlying object access protocol, rados, updates an arbitrary number of bytes in an object at an arbitrary offset, exactly like if it is a regular file. That update capability allows for much fancier usage of the object store – for things like the support of block devices, rbd devices, and even a network file systems, cephfs.

到目前為止，我已經論述了一個對象存儲的基礎。但是自動適當更新對象的能力使ceph不同和比其他的對象存儲更好（在我看來）。根本的對象訪問協議RADOS，在一個對象任意位置更新一個任意數量的字節，很像它是一個普通的文件。這種更新能力允許對象存儲的很多新奇的應用-例如塊設備，rbd設備，甚至網絡文件系統cephfs的支持。

When using MySQL on Ceph, the rbd disk block device feature is extremely interesting. A Ceph rbd disk is basically the concatenation of a series of objects (4MB objects by default) that are presented as a block device by the Linux kernel rbd module. Functionally it is pretty similar to an iSCSI device as it can be mounted on any host that has access to the storage network and it is dependent upon the performance of the network.

當使用在Ceph上使用MySQL，rbd磁盤塊設備特性非常地吸引人。一個Ceph的rbd磁盤基本上是一系列（默認4M對象）的串聯，它被Linux內核rbd模塊作為塊設備。因它可以被掛載在任何可以訪問存儲網絡並依賴網絡的性能的主機上，它功能上相當像一個iSCSI設備。

The benefits of using Ceph

Agility

In a world striving for virtualization and containers, Ceph gives easily moves database resources between hosts.

使用Ceph的優勢

敏捷

在一個為虛擬化和容器奮斗的世界里，Ceph提供在不主機間容易的移動數據庫資源

IO scalability

On a single host, you have access only to the IO capabilities of that host. With Ceph, you basically put in parallel all the IO capabilities of all the hosts. If each host can do 1000 iops, a four-node cluster could reach up to 4000 iops.

IO擴展性

在一個單點主機，你可以只能達到這個主機的IO能力。使用Ceph，你基本上把所有主機的全部IO能力並行化。如果一個主機有1000 iops，4個節點集群可以達到4000 iops。

High availability

Ceph replicates data at the storage level, and provides resiliency to storage node crash. A kind of DRBD on steroids…

高可用性

Ceph 在存儲層面復制數據，並提供存儲節點壞掉的彈性。

Backups

Ceph rbd block devices support snapshots, which are quick to make and have no performance impacts. Snapshots are an ideal way of performing MySQL backups.

備份

Ceph rbd塊設備支持快照，它很快且沒有性能影響。快照是一個優化MySQL備份性能的理想方式。

Thin provisioning

You can clone and mount Ceph snapshots as block devices. This is a useful feature to provision new database servers for replication, either with asynchronous replication or with Galera replication.

輕發放

你可以克隆和掛載一個快照作為塊設備。這是一個有用的特性為來發放一個復制的新的數據庫服務器，而且有異步復制或Galera復制。

The caveats of using Ceph

Of course, nothing is free. Ceph use comes with some caveats.

使用Ceph的限制

當然，沒有東西是免費的。 Ceph一同帶來一些限制。

Ceph reaction to a missing OSD

If an OSD goes down, the Ceph cluster starts copying data with fewer copies than specified. Although good for high availability, the copying process significantly impacts performance. This implies that you cannot run a Ceph with a nearly full storage, you must have enough disk space to handle the loss of one node.

The “no out” OSD attribute mitigates this, and prevents Ceph from reacting automatically to a failure (but you are then on your own). When using the “no out” attribute, you must monitor and detect that you are running in degraded mode and take action. This resembles a failed disk in a RAID set. You can choose this behavior as default with the mon_osd_auto_mark_auto_out_in setting.

Ceph 丟失OSD的反應

如果一個OSD宕機，Ceph集群開始以比設定的值小的復制數據副本。雖然對高可用性好，復制過程明顯的影響性能。這意味着你不能在接近慢的存儲上運行Ceph，你必須有足夠的硬盤空間處理一個節點的丟失。 OSD的“no nout” 屬性減輕了這些，並且阻止Ceph自動地處理一個失敗（但你講自己負責）。當使用 “no nout”屬性時，你在降級運行模式，你必須監控和檢測並采取措施。這和一個RAID集的一個磁盤失敗類似。你可以設置mon_osd_auto_mark_auto_out_in選擇這個行為作為默認。

Scrubbing

Every day and every week (deep), Ceph scrubs operations that, although they are throttled, can still impact performance. You can modify the interval and the hours that control the scrub action. Once per day and once per week are likely fine. But you need to set osd_scrub_begin_hour and osd_scrub_end_hour to restrict the scrubbing to off hours. Also, scrubbing throttles itself to not put too much load on the nodes. The osd_scrub_load_threshold variable sets the threshold.

數據清理

每天和每周（深度），Ceph執行清理操作，雖然它被限流，仍然湖影響性能。你可以修改控制清理動作的間隔和小時。曾經一天活一周可能是好的。但是你應該設置osd_scrub_begin_hour 和 osd_scrub_end_hour 來限制在繁忙時間清理。並且清理限流自身不要放過多的負載到節點。osd_scrub_load_threshold變量設置閾值。

Tuning

Ceph has many parameters so that tuning Ceph can be complex and confusing. Since distributed systems push hardware, properly tuning Ceph might require things like distributing interrupt load among cores and thread core pinning, handling of Numa zones – especially if you use high-speed NVMe devices.

調優

Ceph有很多參數因此調優Ceph是復雜和令人困惑的。由於分布式系統推進硬件，恰當地的調優Ceph可能需要像在CPU核和線程核綁定分布中斷負載、處理Numa域（尤其是你使用一個NVMe設備）的事情。

Conclusion

Hopefully, this post provided a good introduction to Ceph. I’ve discussed the architecture, the benefits and the caveats of Ceph. In future posts, I’ll present use cases with MySQL. These cases include performing Percona XtraDB Cluster SST operations using Ceph snapshots, provisioning async slaves and building HA setups. I also hope to provide guidelines on how to build and configure an efficient Ceph cluster.

結論

希望這篇博文能提供一個好的Ceph介紹。我已經論述Ceph的架構、好處和限制。在未來的博文中，我將呈現在Ceph上使用MySQL的案例。這些案例包括使用Ceph快照調優XtraDB 集群的 SST操作、發放異步slave和構建HA配置。我也希望提供一個怎樣構建和配置一個高效的Ceph集群的指導。

Finally, a note for the ones who think cost and complexity put building a Ceph cluster out of reach. The picture below shows my home cluster (which I use quite heavily). The cluster comprises four ARM-based nodes (Odroid-XU4), each with a two TB portable USB-3 hard disk, a 16 GB EMMC flash disk and a gigabit Ethernet port.

I won’t claim record breaking performance (although it’s decent), but cost-wise it is pretty hard to beat (at around $600)!

最后，一個給那些認為構建Ceph集群的成本和復雜度難以企及的人提示。下面的圖片展示我家的機器（我重度地使用）。這個集群由4個基於ARM的節點（Odroid-XU4）構成，每個帶1個USB-3接口的2TB的硬盤、1個16GB的EMMC閃存盤和1個1Gb的以太網口。
我不會宣稱性能記錄打破（雖然足夠好），但從很難從成本方式打敗（近600$）。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 分布式存儲ceph——（6）ceph 講解分布式存儲Ceph(六) CephFS使用分布式存儲ceph——（1）部署ceph 分布式數據庫集群介紹分布式數據庫分布式數據庫 Ceph分布式存儲 - 學習筆記 ceph分布式存儲介紹讓我們了解 Ceph 分布式存儲 docker部署Ceph分布式存儲集群