[譯]Cassandra 架構簡述

本文轉載自查看原文 2013-06-16 15:57 3690 cassandra

本文翻譯主要來自Datastax的cassandra1.2文檔。http://www.datastax.com/documentation/cassandra/1.2/index.html。此外還有一些來自於相關官方博客。

該翻譯作為ISE實驗室大數據組Laud的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。

未經本人許可，請勿轉載。

一個Cassnadra的簡介。（下文有時候又將Cassandra簡稱C）

Cassandra被設計來通過沒有單點故障的多節點模式去處理海量數據工作負載。他的架構是基於理解系統和硬件故障可以而且會發生的基礎上的。C通過所有節點都相同並且數據分布在所有節點上的p2p分布式系統來解決故障問題。集群中的每個節點每秒都在交換信息。每個節點上的commit log 捕獲寫行為來確保數據的持久化。數據也會被寫到一個內存結構中，叫做memtable，然后當內存結構滿了的時候就寫數據到磁盤文件中，叫做SSTable。所有的寫入都是自動分區和復制的。

cassandra是一種面向行的數據庫。C的架構允許任何授權的用戶連接任意數據中心的任意的節點，並使用cql訪問數據。為了簡化使用，cql使用和sql類似的語法。從cql的視角出發，database是由tables組成的。典型地，一個集群中每個應用擁有一個keyspace。開發者可以通過cqlsh調用cql，也可以使用其他驅動。

客戶端的讀寫請求可以到達集群的任意節點。當一個客戶連接到一個節點做了一個請求時，那個節點服務器就作為這個特定的客戶操作的一個coordinator了。協調器扮演了客戶應用和擁有用戶請求的數據的節點之間的代理（proxy）的角色。協調器決定了集群環中的哪些節點應該響應請求。（更多信息，請查閱關於用戶請求）

配置C的關鍵組件列表：

Gossip：一個p2p的交流協議來發現和共享其他節點的位置和狀態信息。
gossip信息也被每個節點保存在本地，這樣當一個節點重啟時，它能夠立刻使用這些信息。你可能會想清空某個節點上的gossip歷史，比如節點ip地址改變了等原因。（譯者注：大概就是system.local表）
Partitioner：一個分區器決定了如何分布數據到各個節點。選擇一個分區器決定了哪個節點存儲數據的第一個備份。
你必須設置分區器的類型，並且指派給每個節點一個num_tokens值。如果沒有使用虛擬節點的話，使用initial_token來代替。（譯者注：虛擬節點是1.2中新增的）
副本存放策略：C存儲數據的備份到多個節點上去來確保可用性和故障容忍。一個備份策略決定了哪些節點存放備份。it is not unique in any sense.it is not unique in any sense. 當你創建了一個keyspace的時候，你必須指定副本存放策略和你想備份的數量。
Snitch：一個snitch定義了拓撲信息，這些信息是副本備份側羅和請求路由時經常使用的。當你創建一個集群的時候需要配置一個snitch。snitch is responsible for 知道在你的網絡拓撲中節點的位置以及通過聚合機器成為數據中心或者rack時的分配副本。
cassandra.yaml：C的配置文件。在這個文件中，你要設置集群的初始化信息，表的緩存參數，資源的使用參數，超時設置，客戶端連接，備份以及安全策略。
C將屬性都存到系統keyspace中。你需要對每一個keyspace或者columnfamily進行存儲配置（比如使用cql）。
默認的，一個節點被設置為存儲他管理的數據到/var/lib/cassandra目錄。在一個生產環境中，你需要修改commitlog目錄到一個其他硬盤上去（別和data file 在一個硬盤上）。

（該翻譯作為實驗室大數據組的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。未經本人許可，請勿轉載。）

關於內部通信Gossip：

cassandra使用稱為gossip的協議來發現加入C集群中的其他節點的位置和狀態信息。這是一個p2p的交流協議，每個節點定期的交換他們自己的和他們所知道的其他人的狀態信息。gossip進程每秒都在進行，並與至多三個節點交換狀態信息。節點交換他們自己和所知道的信息，於是所有的節點很快就能學習到整個集群中的其他節點的信息。gossip信息有一個相關的版本號，於是在一次gossip信息交換中，舊的信息會被新的信息覆蓋重寫。

要阻止分區進行gossip交流，那么在集群中的所有節點中使用相同的seed list（譯者注：指的是cassandra。yaml中的seeds）。默認的，在重新啟動時，一個節點記得他曾經gossip過得其他節點。

注意：種子節點的指定除了啟動起gossip進程外，沒有其他的目的。種子節點不是一個單點故障，他們在集群操作中也沒有其他的特殊目的，除了引導節點以外..

設置Gossip設置

任務：

當一個節點第一次啟動的時候，他去yaml中讀取配置，得到集群的名字，並得到從哪些seeds中獲取其他節點的信息，還有其他的一些參數，比如端口，范圍等等。。

屬性	描述
cluster_name
listen_address	與其他節點連接的ip
seed_provider
storage_port	內部節點交流端口（默認7000），每個節點之間必須相同
initial_token	在1.1以及之前，決定節點的數據的管理范圍
num_tokens	在1.2以及之后，決定節點的數據的管理范圍

清理gossip狀態:

-Dcassandra.load_ring_state= false

關於故障檢測和修復

C使用信息來避免路由用戶的請求到壞了的節點（C還能避免路由到可用但是性能很差的節點，通過動態snitch技術）

Rather than have a fixed threshold for marking failing nodes, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network performance, workload, or other conditions. During gossip exchanges, every node maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster. In Cassandra, configuring the phi_convict_threshold property adjusts the sensitivity of the failure detector. Use default value for most situations, but increase it to 12 for Amazon EC2 (due to the frequently experienced network congestion).

（譯者注：這是04年的一篇論文的失效檢測算法

1.Hayashibara, N., Defago, X., Yared, R. & Katayama, T. The phi; accrual failure detector. in Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004 66–78 (2004). doi:10.1109/RELDIS.2004.1353004

）

一個節點的當機往往不代表這個節點永遠的離開了，因此並不會自動的從環中刪除。其他的節點會定期的嘗試與失效節點聯系看看他們恢復了沒。要永久的改變一個節點的關系，administrators must explicitly add or remove nodes from a Cassandra cluster using the nodetool utility.

當一個節點返回的時候，他可能錯過了他需要維護的副本數據的寫入命令。一旦失效檢測標記一個節點當機了，錯過的寫入就會被存儲到其他的副本中一段時間，叫做 hinted handoff。 當一個節點當機超過max_hint_windows_in_ms(默認3小時)，hints就不在存儲了。這時候你應該等節點啟動后運行修復程序了。

此外，你應該日常地運行nodetool repair 在所有的節點上，以保證他們的數據的一致性。

For more explanation about recovery, see Modern hinted handoff.

 （該翻譯作為實驗室大數據組的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。未經本人許可，請勿轉載。）

數據分配和備份

在C中，數據分配和備份是同時的。這是由於C被設計成p2p系統，可以拷貝數據和在節點群組中分配副本。數據被組織成表並且由主鍵進行標識。主鍵決定了數據存在哪個節點上。行的副本就叫做備份了，當數據第一次被寫入時候，他也被稱做一個備份。

When your create a cluster, you must specify the following:

Virtual nodes: assigns data ownership to physical machines.
Partitioner: partitions the data across the cluster.
Replication strategy: determines the replicas for each row of data.
Snitch: defines the topology information that the replication strategy uses to place replicas.

Consistent hashing

比如，有如下數據：

jim	age: 36	car: camaro	gender: M
carol	age: 37	car: bmw	gender: F
johnny	age: 12	gender: M
suzy	age: 10	gender: F

Cassandra assigns a hash value to each primary key:

Primary key	Murmur3 hash value
jim	-2245462676723223822
carol	7723358927203680754
johnny	-6723372854036780875
suzy	1168604627387940318

集群中的每個節點都負責一部分數據，哈希范圍是：

Node	Murmur3 start range	Murmur3 end range
A	-9223372036854775808	-4611686018427387903
B	-4611686018427387904	-1
C	0	4611686018427387903
D	4611686018427387904	9223372036854775807

Cassandra places the data on each node according to the value of the primary key and the range that the node is responsible for. For example, in a four node cluster, the data in this example is distributed as follows:

Node	Start range	End range	Primary key	Hash value
A	-9223372036854775808	-4611686018427387903	johnny	-6723372854036780875
B	-4611686018427387904	-1	jim	-2245462676723223822
C	0	4611686018427387903	suzy	1168604627387940318
D	4611686018427387904	9223372036854775807	carol	7723358927203680754

（該翻譯作為實驗室大數據組的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。未經本人許可，請勿轉載。）

virtual nodes

1、當增刪節點的時候不需要重新平衡雞群了。當一個節點加入時，他就假定均勻的對其他節點的一部分數據負責。當一個節點掛了，那么負載也會均勻的傳遞給其他節點。

2、重建一個死亡節點的時候會更快了。因為他涉及到了其他各個節點，而且數據數據會被增量式的發送到代替節點而不是等待着知道驗證結束。

3、異構集群的優化。可以指派成比例的虛擬節點數到小集群和大集群上去。

Vnodes change this paradigm from one token or range per node, to many per node. Within a cluster these can be randomly selected and be non-contiguous, giving us many smaller ranges that belong to each node.

The top portion of the graphic shows a cluster without virtual nodes. In this paradigm, each node is assigned a single token that represents a location in the ring. Each node stores data determined by mapping the row key to a token value within a range from the previous node to its assigned value. Each node also contains copies of each row from other nodes in the cluster. For example, range E replicates to nodes 5, 6, and 1. Notice that a node owns exactly one contiguous range in the ring space.

The bottom portion of the graphic shows a ring with virtual nodes. Within a cluster, virtual nodes are randomly selected and non-contiguous. The placement of a row is determined by the hash of the row key within many smaller ranges belonging to each node.

假設我們有30個節點，備份3. 一個節點完全死了，並且我們需要加入一個替代點。這時候替代節點需要得到3個不同范圍的副本來重新生成數據，這不僅僅包括他管理的第一份數據（譯者注：應該是指的通過hash計算到的歸他負責的那份數據），也包括他負責的第二備份、第三備份的副本（盡管do recall no replica has ‘priority’ over another in Cassandra）。

因為我們的備份數是3，那么每個缺失的副本還有2個。一個節點負責了三份數據，所以我們涉及了至多6個節點。在當前的實現中，每份數據cassanra只會從一個備份中恢復，於是我們只需要和三個節點溝通就能恢復數據。如下圖：

我們是想最小化這個操作的時間的。因為如果在這個期間又掛了一個節點，那么我們可能對某些范圍的數據而言只剩一個備份了，那么這時候任何一致性級別要求大於1的都將失敗。Even if we used all 6 possible replica nodes, we’d only be using 20% of our cluster, however.

而virtual node情況下，拷貝恢復的數據量是一樣的，但是速度大大加快了：

Repair is two phases, first a validation compaction that iterates all the data and generates a Merkle tree, and then streaming when the actual data that is needed is sent. The validation phase might take an hour, while the streaming only takes a few minutes, meaning your replaced disk sits empty for at least an hour.

那么在virtual node情況下：with vnodes you’ll gain two distinct advantages in this situation. The first is that since the ranges are smaller, data will be sent to the damaged node in a more incremental fashion instead of waiting until the end of a large validation phase. The second is that the validation phase will be parallelized across more machines, causing it to complete faster.

新老機器升級的時候，一定會想讓好機器管理更多的數據。

If you have vnodes it becomes much simpler, you just assign a proportional number of vnodes to the larger machines. If you started your older machines with 64 vnodes per node and the new machines are twice as powerful, simply give them 128 vnodes each and the cluster remains balanced even during transition.

具體設置方法：

Set the number of tokens on each node in your cluster with the num_tokens parameter in the cassandra.yaml file.

Generally when all nodes have equal hardware capability, they should have the same number of virtual nodes. If the hardware capabilities vary among the nodes in your cluster, assign a proportional number of virtual nodes to the larger machines. For example, you could designate your older machines to use 128 virtual nodes and your new machines (that are twice as powerful) with 256 virtual nodes.

Set the number of tokens on each node in your cluster with the num_tokens parameter in the cassandra.yaml file. The recommended value is 256. Do not set the initial_token parameter.

（該翻譯作為實驗室大數據組的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。未經本人許可，請勿轉載。）

數據復制：

The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node.

Two replication strategies are available:

SimpleStrategy: Use for a single data center only. If you ever intend more than one data center, use the NetworkTopologyStrategy.
NetworkTopologyStrategy: Highly recommended for most deployments because it is much easier to expand to multiple data centers when required by future expansion.

SimpleStrategy:

第一個備份時根據哈希環算的，第二第三備份時順時針沿環取得，而不考慮拓撲結構

NetworkTopologyStrategy：

Use NetworkTopologyStrategy when you have (or plan to have) your cluster deployed across multiple data centers. This strategy specify how many replicas you want in each data center.

NetworkTopologyStrategy places replicas in the same data center by walking the ring clockwise until reaching the first node in another rack. NetworkTopologyStrategy attempts to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) often fail at the same time due to power, cooling, or network issues.

決定設置有多少備份在每個節點上時，兩個主要的考慮是：

1、能本地讀，而不要跨數據中心去讀

2、故障場景。

1、每個數據中心備份兩份數據。這種設置可以容忍單點故障時，仍然從本地讀取，（一致性：ONE）

2、每個數據中心3個備份。這種設置可以容忍單點故障時，仍然從本地讀取，（一致性：LOCAL_QUORUM）

不對稱的備份設置，Asymmetrical replication groupings are also possible. For example, you can have three replicas per data center to serve real-time application requests and use a single replica for running analytics.

To set the replication strategy for a keyspace, see CREATE KEYSPACE.

When you use NetworkToplogyStrategy, during creation of the keyspace strategy_options, you use the data center names defined for the snitch used by the cluster. To place replicas in the correct location, Cassandra requires a keyspace definition that uses the snitch-aware data center names. For example, if the cluster uses the PropertyFileSnitch, create the keyspace using the user-defined data center and rack names in the cassandra-topologies.properties file. If the cluster uses the EC2Snitch, create the keyspace using EC2 data center and rack names.

（該翻譯作為實驗室大數據組的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。未經本人許可，請勿轉載。）

關於分區器：

分區器決定了數據和副本在節點中如何分配. 最基本的,一個分區器就是一個用來計算一個行健的token的哈希函數.

Murmur3Partitioner 和randomPartitioner 都使用token來幫助均勻指派分區。即使表使用了不同的row key，比如用戶名、時間戳。此外，讀寫請求也均勻的分布並且能達到負載均衡，因為哈喜歡的每一部分都接收到了相等數量的行。

Cassandra offers the following partitioners:

Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash hash values.
RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values.
ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes

關於Mumur3Partition：

Murmur3Partitioner 提供了更快的哈希函數。集群中確定了某個分區器，就不能再變了。（因此舊數據升級到1.2時，要將分區器改成Random才行）Murmur3Partitioner使用Murmur哈希函數，得到的哈希值的范圍是：-2⁶³ to +2⁶³.

關於RandomPartitioner：

仍然可用，他是通過MD5計算哈希的。值是0-2¹²⁷ -1

關於ByteOrderedPartitioner:

有序分區器。將rowkey轉化為十六進制數值，比如‘A’=x41.這樣你就可以進行有序的查找了。比如username是key，你就可以找名字在Jake 和Joe之間的用戶了。這種查詢在前兩種分區器中是做不到的。

盡管能做范圍查找讓有序分區器聽起來不錯，但是事實上通過表索引也可以達到這種效果。

不推薦使用這種排序器：

Difficult load balancing: More administrative overhead is required to load balance the cluster. An ordered partitioner requires administrators to manually calculate partition ranges (formerly token ranges) based on their estimates of the row key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.

Sequential writes can cause hot spots: If your application tends to write or update a sequential block of rows at a time, then the writes are not be distributed across the cluster; they all go to one node. This is frequently a problem for applications dealing with timestamped data.

Uneven load balancing for multiple tables: If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.

（該翻譯作為實驗室大數據組的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。未經本人許可，請勿轉載。）

About snitches

snitch確定哪個數據中心或者rack被寫或者讀，並且通知Cassandra有關網絡拓撲的信息，使得請求能夠有效地路由，並且允許C通過組合數據中心和rack中的機器來分配副本。

Cassandra會盡最大努力達到同一個rack上不超過一個副本。

注意：如果你在數據已經被寫入集群后，修改了snitch，你必須運行一次全修復了，因為snitch影響了副本的存放位置。

SimpleSnitch

默認的，不識別數據中心和rack信息。可以用在單數據中心環境下。使用這種snitch時候，keyspace strategy option的參數只需設置 replication factor備份數量。

RackInferringSnitch

他假設節點的ip代表了一定的含義，由此可以確定數據中心/rack中節點的位置. 一般可以以這個snitch為例子，自己定制一個snitch。ip示意圖如下：

PropertyFileSnitch

這個snitch通過用戶自己描述網絡詳情來確定節點的位置。描述可以寫在cassandra-topology.properties文件中。當你的節點ip並不同意或者你有很復雜的副本組合需求時。當使用這種snitch的時候，你可以定義你的數據中心名字。注意確保你定義的數據中心名字和你的keyspace strategy_options參數中的數據中心名字一致（譯者注：這不是廢話么？）。

集群中的每個節點都應該在配置文件中進行描述。並且這個文件應該在集群中的每個節點上都相同。注意這個配置文件的位置和發行版有關，see Locations of the configuration files or DataStax Enterprise File Locations.

舉例：

If you had non-uniform IPs and two physical data centers with two racks in each, and a third logical data center for replicating analytics data, the cassandra-topology.properties file might look like this:

多數據中心的snitch手工配置舉例# Data Center One

175.56.12.105 =DC1:RAC1
175.50.13.200 =DC1:RAC1
175.54.35.197 =DC1:RAC1

120.53.24.101 =DC1:RAC2
120.55.16.200 =DC1:RAC2
120.57.102.103 =DC1:RAC2

# Data Center Two

110.56.12.120 =DC2:RAC1
110.50.13.201 =DC2:RAC1
110.54.35.184 =DC2:RAC1

50.33.23.120 =DC2:RAC2
50.45.14.220 =DC2:RAC2
50.17.10.203 =DC2:RAC2

# Analytics Replication Group

172.106.12.120 =DC3:RAC1
172.106.12.121 =DC3:RAC1
172.106.12.122 =DC3:RAC1

# default for unknown nodes default =DC3:RAC1

GossipingPropertyFileSnitch

The GossipingPropertyFileSnitch defines a local node's data center and rack; it uses gossip for propagating this information to other nodes. Theconf/cassandra-rackdc.properties file defines the default data center and rack used by this snitch:

dc =DC1
 rack =RAC1

The location of the conf directory depends on the type of installation; see Locations of the configuration files or DataStax Enterprise File Locations.

To migrate from the PropertyFileSnitch to the GossipingPropertyFileSnitch, update one node at a time to allow gossip time to propagate. The PropertyFileSnitch is used as a fallback when cassandra-topologies.properties is present.

EC2Snitch

假設集群在Amazon EC2環境下，如果集群都在同一個地理位置（譯者注：EC2可以讓你選擇你的虛擬機的物理位置）。The region is treated as the data center and the availability zones are treated as racks within the data center..

比如你有一台機器在：us-east-1a。那么us-east就是數據中心名字，1a就是rack位置。

When defining your keyspace strategy option, use the EC2 region name (for example,``us-east``) as your data center name.

EC2MultiRegionSnitch

跟上面類似，只是在多個地理位置下。As with the EC2Snitch, regions are treated as data centers and availability zones are treated as racks within a data center. For example, if a node is in us-east-1a, us-east is the data center name and 1a is the rack location.

以下是在EC2上配置的細節：

This snitch uses public IPs as broadcast_address to allow cross-region connectivity. This means that you must configure each Cassandra node so that thelisten_address is set to the private IP address of the node, and the broadcast_address is set to the public IP address of the node. This allows Cassandra nodes in one EC2 region to bind to nodes in another region, thus enabling multiple data center support. (For intra-region traffic, Cassandra switches to the private IP after establishing a connection.)

Additionally, you must set the addresses of the seed nodes in the cassandra.yaml file to that of the public IPs because private IPs are not routable between networks. For example:

seeds: 50.34.16.33, 60.247.70.52

To find the public IP address, run this command from each of the seed nodes in EC2:

curl http://instance-data/latest/meta-data/public-ipv4

Finally, be sure that the storage_port or ssl_storage_port is open on the public IP firewall.

When defining your keyspace strategy option, use the EC2 region name, such as ``us-east``, as your data center names.

Dynamic snitching

監控從副本中讀取的性能，並且根據歷史去選擇最佳副本。

默認的,所有的snitch都使用動態snitch布局來監控讀延遲，並在路由時跳過性能很菜的節點。動態路由默認就啟動着，而且也推薦用在在大多數生產環境中。

Configure dynamic snitch thresholds for each node in the cassandra.yaml configuration file.

動態snitching是從Cassandra 0.6.5版本就開始有的，但是有很多地方一直捉摸不透。這篇博文盡可能揭開你可能想知道的所有謎團。

首先解答什么事動態snitch。先來回顧下什么是snitch。一個snitch的功能是確定從哪個數據中心和rack上讀寫。那么，為什么說是“動態”呢？

在cassandra的寫過程中，我們發送消息給節點。然后就阻塞知道符合一致性級別的數量被寫成功了，因此我們並不做任何統計信息。而讀過程中，Cassandra只向一個節點請求數據，並且依賴於一致性級別和讀修復，他會用校驗和去詢問其他的副本。看不懂的看英文：

So, why would that be ‘dynamic?’ This comes into play on the read side only (there’s nothing to be done for writes since we send them all and then block to until the consistency level is achieved.) When doing reads however, Cassandra only asks one node for the actual data, and, depending on consistency level and read repair chance, it asks the remaining replicas for checksums only. This means that it has a choice of however many replicas exist to ask for the actual data, and this is where the dynamic snitch goes to work.

因為只有一個副本會發送我們需要的完整的數據，因此我們需要選擇最好的副本去做請求。The dynamic snitch handles this task by monitoring the performance of reads from the various replicas and choosing the best one based on this history.

在現在的Cassandra中，讀修復事實上已經很少了，因為hints機制目前很可靠。因此，那當我們的一致性級別是one的時候，能不能最大化我們的緩存能力呢？可！這就是動態snitch中badness 閾值的設計緣由。這個參數是一個百分比，它定義了

how much worse the natural first replica must perform, in order to switch to a different one. 因此給定副本X，Y，Z。 X副本江北首選知道他的性能比Y和Z低badness threshold。這意味着當所有節點都很健康的時候，節點中的緩存能力是最大的。但是如果有事情變壞，特別是比badness_threshold還差，那么Cassandra將繼續通過使用其他副本來提供可用服務。

動態snitch並不會確定副本的位置，這是你的snitch要做的。動態snitch簡單的包裝了snitch，並在讀數據時提供了自適應性。

這是如何完成的？最初，動態snitch是想模仿failure detector，因為失效檢測也是自適應的。為了節省CPU，他采用了兩種方法，一種很簡單（接收更新），一種很復雜（計算每個host的得分）。

默認的，設置為每100ms計算一次得分。The updates are capped at a maximum of 10,000 per scoring interval,但是這引入了一些新的問題。首先，如果我們覺得一個節點性能不行，不再去讀他了，那么我們如何知道他什么時候恢復了呢？因為沒有新的信息來評價他，我們不得不增加一條新的規則：每十分鍾重置一次評分表。這樣的話，我們又不得不在重置后馬上采樣一些讀，其中這次讀取是對每個副本都公正的。

第二，10000是不是一個好的蘇子呢，我們不知道，因為這依賴於Cassandra運行的機器的好壞。最后，延遲是唯一並且最好的判定從哪里讀取數據的指標嗎？

在接下來的版本中，對上述問題做了一些處理：1、基於概率的隨機采樣；2、還考慮其他因此，比如節點是否正在做compact。原文：

In the next release of Cassandra, these latter two problems have been addressed. Instead of sampling a fixed amount of updates, we now use a statistically significant random sample, and weight more recent information heavier than past information. But wait, there’s more! Instead of relying purely on latency information from reads, we now also consider other factors, like whether or not a node is currently doing a compaction, since that can often penalize reads.

最后一個問題：當缺少信息的時候，動態snitch不能工作。顯然我們需要等待超過rpc_timeout的時間才能知道一個節點讀取失敗了，但是我們不想等這么久。如何破？我們定義了另一個時間，In actuality, we do have a signal we can respond to, and that is time itself. Thus, if a node suddenly becomes a black hole, we’ll only throw reads at it for one scoring interval, and when the next score is calculated we’ll consider latency, the node’s state (called a severity factor) and how long it has been since the node last replied, penalizing it so that we stop trying to read from it (badness_threshold permitting.)

（該翻譯作為實驗室大數據組的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。未經本人許可，請勿轉載。）

About client requests

這章比較簡單，可以掃一眼原文，只摘抄重要的話在下面。

寫成功指的是數據被寫入到了commit log中並且將更改希爾了memtable中。

About multiple data center write requests

多個數據中心下，C在每個遠程數據中心下都選擇一個coordinator節點來處理副本請求。和客戶端連接的coordinator只需要將寫請求轉發給其他數據中心的coordinator就行了。

About read requests

coordinator發給副本的讀請求包括兩種：

1、直接的讀請求

2、一個后台的讀修復請求。

對於直接讀請求來說，需要聯系的副本個數由客戶端指定的一致性級別決定。后台讀修復請求則是被發送給任何沒收到直接讀請求的其他副本。讀修復請求確保請求的行在所有的副本上是一致的。

因此，協調器首先根據一致性級別聯系副本。對於那些當前最先反應的副本，協調器會發送這些請求到副本。這些被聯系的節點就返回請求的數據。如果多個節點都被聯系了，那么來自每個副本的行數據（rows）會被在內存中進行比較來看看他們是不是一致的。如果他們不是，那么擁有最新數據（基於時間戳）的副本將被coordinator返回給用戶。

為了確保所有的副本都擁有最新版本的頻繁讀取的數據，協調器也會在后台去聯系所有擁有該行數據的剩下的副本去比較數據。如果副本是不一致的，協調器會讓舊數據副本更新這一行到最新版本。這就是讀修復。

舉例：3備份集群，度一致性級別是QUORUM，也就是說3個中的兩個需要被聯系來滿足讀請求。假設聯系的副本具有不同的版本，那么擁有最新版本數據的副本將返回結果。同時在后台，第三個副本將被檢查跟前兩個的一致性，如果需要，最新的數據將被寫到這些舊數據副本上去。

（該翻譯作為實驗室大數據組的學習材料的一部分，適合對Cassandra已經有一定了解的讀者。未經本人許可，請勿轉載。）

Planning a cluster deployment

Selecting hardware for enterprise implementations

內存越大，就能緩存更多的數據，memtable也能放更多的數據。

推薦：

企業環境，16-64GB，至少也要8GB。

虛擬機的話，8-16，最小4GB吧

測試環境，256,mb

CPU：

對於寫入負載很重的應用，對於Cassandra來說CPU比內存更容易到達極限。Cassandra是一個高並發的應用，會使用很多CPU核：

專用環境，8核還不錯。

虛擬機環境，考慮下using a provider that allows CPU bursting, such as Rackspace Cloud Servers.

硬盤：

先來了解下架構：Cassandra寫數據：1，寫入到commit log中來durability。2.刷memtable到SSTable中來持久化。3、SSTable定期壓縮。通過合並和重寫數據以及刪除舊數據，壓縮能夠提升性能。然而，壓縮會大量使用磁盤IO以及磁盤目錄。因此，你應該留下足夠的空閑磁盤空間：50%（至少）for SizeTieredCompactionStrategy and large compactions, and 10% for LeveledCompactionStrategy.

更多壓縮內容見下一篇數據模型的文章。

推薦：

Capacity and I/O

When choosing disks, consider both capacity (how much data you plan to store) and I/O (the write/read throughput rate). Some workloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM).

Solid-state drives

SSDs are the recommended choice for Cassandra. Cassandra's sequential, streaming write patterns minimize the undesirable effects of write amplification associated with SSDs. This means that Cassandra deployments can take advantage of inexpensive consumer-grade SSDs. Enterprise level SSDs are not necessary because Cassandra's SSD access wears out consumer-grade SSDs in the same time frame as more expensive enterprise SSDs.

Number of disks - SATA: Ideally Cassandra needs at least two disks, one for the commit log and the other for the data directories. At a minimum the commit log should be on its own partition.

Commit log disk - SATA: The disk not need to be large, but it should be fast enough to receive all of your writes as appends (sequential I/O).

Data disks

Use one or more disks and make sure they are large enough for the data volume and fast enough to both satisfy reads that are not cached in memory and to keep up with compaction.

RAID on data disks

It is generally not necessary to use RAID for the following reasons:

Data is replicated across the cluster based on the replication factor you've chosen.
Starting in version 1.2, Cassandra includes takes care of disk management with the JBOD (Just a bunch of disks) support feature. Because Cassandra properly reacts to a disk failure, based on your availability/consistency requirements, either by stopping the affected node or by blacklisting the failed drive, this allows you to deploy Cassandra nodes with large disk arrays without the overhead of RAID 10.

RAID on the commit log disk

Generally RAID is not needed for the commit log disk. Replication adequately prevents data loss. If you need the extra redundancy, use RAID 1.

Extended file systems

DataStax recommends deploying Cassandra on XFS. On ext2 or ext3, the maximum file size is 2TB even using a 64-bit kernel. On ext4 it is 16TB.

Because Cassandra can use almost half your disk space for a single file, use XFS when using large disks, particularly if using a 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on 64-bit.

簡單總結就是：不需要RAID。至少兩塊磁盤（commitlog和sstable各一塊）。SSD 是很好的。推薦使用XFS文件系統，ext4似乎也還行。

節點數量：

1.2之前，推薦一個節點有300-500GB數據，1.2之后，由於JBOD，虛擬節點，off-heap 布隆過濾器，SSD的並發壓縮等等，允許你使用更少的機器，每個機器增加更多的數據空間。

網絡：

一定要選擇可靠的、冗余的網絡接口,確保您的網絡能處理交通節點之間沒有瓶頸。