1.有些系統的功能可能重復
比如reids既是KV數據庫,也可以是緩存系統,還可以是 消息分發系統
將來考慮再 以什么樣的形式 去合並, 使歸納更准確。
2.將來會做個索引,現在 東西太多,導致看的很麻煩
[集群管理]
mesos
Program against your datacenter like it’s a single pool of resources
Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
What is Mesos?
A distributed systems kernel
Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments.
Mesos Getting Started
Apache Mesos是一個集群管理器,提供了有效的、跨分布式應用或框架的資源隔離和共享,可以運行Hadoop、MPI、Hypertable、Spark。
特性:
- Fault-tolerant replicated master using ZooKeeper
- Scalability to 10,000s of nodes
- Isolation between tasks with Linux Containers
- Multi-resource scheduling (memory and CPU aware)
- Java, Python and C++ APIs for developing new parallel applications
- Web UI for viewing cluster state
書籍
深入淺出Mesos
深入淺出Mesos(一):為軟件定義數據中心而生的操作系統
深入淺出Mesos(二):Mesos的體系結構和工作流
深入淺出Mesos(三):持久化存儲和容錯
深入淺出Mesos(四):Mesos的資源分配
深入淺出Mesos(五):成功的開源社區
深入淺出Mesos(六):親身體會Apache Mesos
Apple使用Apache Mesos重建Siri后端服務
Singularity:基於Apache Mesos構建的服務部署和作業調度平台
Autodesk基於Mesos的可擴展事件系統
Myriad項目: Mesos和YARN 協同工作
[RPC]
hprose : github
High Performance Remote Object Service Engine
是一款先進的輕量級、跨語言、跨平台、無侵入式、高性能動態遠程對象調用引擎庫。它不僅簡單易用,而且功能強大。構建分布式應用系統。
protocolbuffer
Protocol Buffers - Google's data interchange format
相關網頁
https://github.com/google/protobuf
https://developers.google.com/protocol-buffers/
grpc:github
Overview
Remote Procedure Calls (RPCs) provide a useful abstraction for building distributed applications and services. The libraries in this repository provide a concrete implementation of the gRPC protocol, layered over HTTP/2. These libraries enable communication between clients and servers using any combination of the supported languages.
The Go implementation of gRPC: A high performance, open source, general RPC framework that puts mobile and HTTP/2 first. For more information see the gRPC Quick Start guide.
thrift
The Apache Thrift software framework, for scalable cross-language services development,
combines a software stack with a code generation engine
to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
Document
Tutorial
Thrift 是一個軟件框架(遠程過程調用框架),用來進行可擴展且跨語言的服務的開發。它結合了功能強大的軟件堆棧和代碼生成引 擎,以構建在 C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml 這些編程語言間無縫結合的、高效的服務。
thrift最初由facebook開發,07年四月開放源碼,08年5月進入apache孵化器,現在是 Apache 基金會的頂級項目
thrift允許你定義一個簡單的定義文件中的數據類型和服務接口,以作為輸入文件,編譯器生成代碼用來方便地生成RPC客戶端和服務器通信的無縫跨編程語言。。
著名的 Key-Value 存儲服務器 Cassandra 就是使用 Thrift 作為其客戶端API的。
[messaging systems分布式消息]
Kafka
Apache Kafka is publish-subscribe messaging rethought(rethink 過去式和過去分詞)as a distributed commit log.
- Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
- Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization.
It can be elastically and transparently expanded without downtime.
Data streams are partitioned and spread over a cluster of machines to allow data streams larger than
the capability of any single machine and to allow clusters of co-ordinated consumers
- Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
- Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
kafka是一種高吞吐量的分布式發布訂閱消息系統,特性如下:
- 通過O(1)的磁盤數據結構提供消息的持久化,這種結構對於即使數以TB的消息存儲也能夠保持長時間的穩定性能。
- 高吞吐量:即使是非常普通的硬件kafka也可以支持每秒數十萬的消息。
- 支持通過kafka服務器和消費機集群來分區消息。
- 支持Hadoop並行數據加載。
kafka的目的是提供一個發布訂閱解決方案,它可以處理消費者規模的網站中的所有動作流數據。
這種動作(網頁瀏覽,搜索和其他用戶的行動)是在現代網絡上的許多社會功能的一個關鍵因素。
這些數據通常是由於吞吐量的要求而通過處理日志和日志聚合來解決。
對於像Hadoop的一樣的日志數據和離線分析系統,但又要求實時處理的限制,這是一個可行的解決方案。
kafka的目的是通過Hadoop的並行加載機 制來統一線上和離線的消息處理,也是為了通過集群機來提供實時的消費。
NATS
NATS is an open-source, high-performance, lightweight cloud native messaging system
gnatsd Github:A High Performance NATS Server written in Go.
cnats Github:A C client for the NATS messaging system.
NATS Github:Golang client for NATS, the cloud native messaging system
Cloud Native Infrastructure(基礎建設,基礎設施). Open Source. Performant(高性能). Simple. Scalable.
NATS acts as a central nervous system for distributed systems at scale, such as mobile devices, IoT networks,
and cloud native infrastructure. **Written in Go**,
NATS powers some of the largest cloud platforms in production today.
Unlike traditional enterprise messaging systems,
NATS has an always-on dial tone that does whatever it takes to remain available.
NATS was created by Derek Collison,
Founder/CEO of Apcera who has spent 20+ years designing, building,
and using publish-subscribe messaging systems.
documentation
NATS is a Docker Official Image
NATS is the most Performant Cloud Native messaging platform available
With gnatsd (Golang-based server), NATS can send up to 6 MILLION MESSAGES PER SECOND.
[緩存服務器,代理服務器,負載均衡]
memcached
memcached 是高性能的分布式內存緩存服務器。一般的使用目的是,通過緩存數據庫查詢結果,減少數據庫訪問次數,以提高動態 Web 應用的速度、提高可擴展性。
What is Memcached?
Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
Memcached is simple yet powerful. Its simple design promotes quick deployment, ease of development, and solves many problems facing large data caches. Its API is available for most popular languages.
nginx
nginx [engine x] is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP proxy server, originally written by Igor Sysoev. For a long time, it has been running on many heavily loaded Russian sites including Yandex, Mail.Ru, VK, and Rambler. According to Netcraft, nginx served or proxied 23.36% busiest sites in September 2015. Here are some of the success stories: Netflix, Wordpress.com, FastMail.FM.
The sources and documentation are distributed under the 2-clause BSD-like license.
Document
Now with support for HTTP/2, massive performance and security enhancements,
greater visibility into application health, and more.
redis
Redis is an open source (BSD licensed), in-memory data structure store, used as database, cache and message broker(代理人,經紀人).
It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries.
Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
try redis
[分布式並行計算框架]
mapreduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduce and scatter operations.
相關web
https://en.wikipedia.org/wiki/MapReduce
http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
About MapReduce
MapReduce is the heart of Hadoop®. It is this programming paradigm that allows for massive scalability across
hundreds or thousands of servers in a Hadoop cluster.
The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data
processing solutions.
For people new to this topic, it can be somewhat difficult to grasp, because it’s not typically something people have been exposed to previously.
If you’re new to Hadoop’s MapReduce jobs, don’t worry: we’re going to describe it in a way that gets you up
to speed quickly.
The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform.
The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.
MapReduce Tutorial
spark
Apache Spark™ is a fast and general engine for large-scale data processing.
Document
Programming Guides:
- Quick Start:
a quick introduction to the Spark API; start here! - Spark Programming Guide:
detailed overview of Spark in all supported languages (Scala, Java, Python, R)
Deployment Guides:
- Cluster Overview:
overview of concepts and components when running on a cluster - Submitting Applications:
packaging and deploying applications - Deployment modes:
- Amazon EC2: scripts that let you launch a cluster on EC2 in about 5 minutes
- Standalone Deploy Mode: launch a standalone cluster quickly without a third-party cluster manager
- Mesos: deploy a private cluster using Apache Mesos
- YARN: deploy Spark on top of Hadoop NextGen (YARN)
storm
Why use Storm?
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. Read more in the tutorial.
Document
Storm (event processor)
Apache Storm is a distributed computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz[1] and team at BackType,[2] the project was open sourced after being acquired by Twitter.[3] It uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.[4]
A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real-time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.[5]
Storm became an Apache Top-Level Project in September 2014[6] and was previously in incubation since September 2013.[7][8]
《Storm Applied》書籍
Storm是一個分布式、容錯的實時計算系統,最初由BackType開發,后來Twitter收購BackType后將其 開源
hadoop
- hadoop是開源的、可靠、可擴展、 分布式並行計算框架
- 主要組成:分布式文件系統 HDFS 和 MapReduce 算法執行
What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
- Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
- Avro™: A data serialization system.
- Cassandra™: A scalable multi-master database with no single points of failure.
- Chukwa™: A data collection system for managing large distributed systems.
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- Hive™: A data warehouse infrastructure(基礎設施)that provides data summarization(概要) and ad hoc querying.
Ad Hoc Query:是指用戶根據當時的需求而即刻定義的查詢。是一種條件不固定、格式靈活的查詢報表,可以提供給用戶更多的交互方式。
Hive是基於Hadoop的數據倉庫解決方案。由於Hadoop本身在數據存儲和計算方面有很好的可擴展性和高容錯性,因此使用Hive構建的數據倉庫也秉承了這些特性。
簡單來說,Hive就是在Hadoop上架了一層SQL接口,可以將SQL翻譯成MapReduce去Hadoop上執行,這樣就使得數據開發和分析人員很方便的使用SQL來完成海量數據的統計和分析,而不必使用編程語言開發MapReduce那么麻煩。
- Mahout™: A Scalable machine learning and data mining library.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
- Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
- ZooKeeper™: A high-performance coordination service for distributed applications.
##### [Getting Started]
##### [Learn about Hadoop by reading the documentation.](http://hadoop.apache.org/docs/current/)
```c
簡而言之,Hadoop 提供了一個穩定的共享存儲和分析系統。存儲由 HDFS 實現,分析由 MapReduce 實現。縱然 Hadoop 還有其他功能,但這些功能是它的核心所在。
1.3.1 關系型數據庫管理系統
為什么我們不能使用數據庫加上更多磁盤來做大規模的批量分析?為什么我們需要MapReduce?
這個問題的答案來自於磁盤驅動器的另一個發展趨勢:尋址時間的提高速度遠遠慢於傳輸速率的提高速度。尋址就是將磁頭移動到特定位置進行讀寫操作的工序。它的特點是磁盤操作有延遲,而傳輸速率對應於磁盤的帶寬。
如果數據的訪問模式受限於磁盤的尋址,勢必會導致它花更長時間(相較於流)來讀或寫大部分數據。
另一方面,在更新一小部分數據庫記錄的時候,傳統的 B 樹(關系型數據庫中使用的一種數據結構,受限於執行查找的速度)效果很好。
但在更新大部分數據庫數據的時候,B 樹的效率就沒有 MapReduce 的效率高,因為它需要使用排序/合並來重建數據庫。
在許多情況下,MapReduce 能夠被視為一種 RDBMS(關系型數據庫管理系統)的補充。(兩個系統之間的差異見表 1-1)。
MapReduce 很適合處理那些需要分析整個數據集的問題,以批處理的方式,尤其是 Ad Hoc(自主或即時)分析。
RDBMS 適用於點查詢和更新(其中,數據集已經被索引以提供低延遲的檢索和短時間的少量數據更新。
MapReduce適合數據被一次寫入和多次讀取的應用,而關系型數據庫更適合持續更新的數據集。
表 1-1:關系型數據庫和 MapReduce 的比較
傳統關系型數據庫 | MapReduce | |
---|---|---|
數據大小 | GB | PB |
訪問 | 交互型和批處理 | 批處理 |
更新 | 多次讀寫 | 一次寫入多次讀取 |
結構 | 靜態模式 | 動態模式 |
集成度 | 高 | 低 |
伸縮性 | 非線性 | 線性 |
MapReduce 和關系型數據庫之間的另一個區別是它們操作的數據集中的結構化數據的數量。結構化數據是擁有准確定義的實體化數據,具有諸如 XML 文檔或數據庫表定義的格式,符合特定的預定義模式。這就是 RDBMS 包括的內容。
另一方面,半結構化數據比較寬松,雖然可能有模式,但經常被忽略,所以它只能用作數據結構指南。例如,一張電子表格,其中的結構便是單元格組成的網格,盡管其本身可能保存任何形式的數據。
非結構化數據沒有什么特別的內部結構,例如純文本或圖像數據。MapReduce 對於非結構化或半結構化數據非常有效,因為它被設計為在處理時間內解釋數據。
換句話說:MapReduce 輸入的鍵和值並不是數據固有的屬性,它們是由分析數據的人來選擇的。
關系型數據往往是規范的,以保持其完整性和刪除冗余。規范化為 MapReduce 帶來問題,因為它使讀取記錄成為一個非本地操作,並且 MapReduce 的核心假設之一就是,它可以進行(高速)流的讀寫。
MapReduce 是一種線性的可伸縮的編程模型。程序員編寫兩個函數 map()和Reduce()每一個都定義一個鍵/值對集映射到另一個。
這些函數無視數據的大小或者它們正在使用的集群的特性,這樣它們就可以原封不動地應用到小規模數據集或者大的數據集上。
更重要的是,如果放入兩倍的數據量,運行的時間會少於兩倍。但是如果是兩倍大小的集群,一個任務仍然只是和原來的一樣快。這不是一般的 SQL 查詢的效果。
隨着時間的推移,關系型數據庫和 MapReduce 之間的差異很可能變得模糊。關系型數據庫都開始吸收 MapReduce 的一些思路(如 ASTER DATA 的和 GreenPlum 的數據庫),
另一方面,基於 MapReduce 的高級查詢語言(如 Pig 和 Hive)使 MapReduce 的系統更接近傳統的數據庫編程人員。
[NoSQL數據庫 + KeyValue數據庫]
ScyllaDB
NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
http://blog.jobbole.com/93027/
ScyllaDB:用 C++ 重寫后的 Cassandra ,性能提高了十倍
最核心的兩項技術: Intel的DPDK驅動框架和Seastar網絡框架
cassandra
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.
GettingStarted
About Apache Cassandra
This guide provides information for developers and administrators on installing, configuring, and using the features and capabilities of Cassandra.
What is Apache Cassandra?
Apache Cassandra™ is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data model designed for maximum flexibility and fast response times.
How does Cassandra work?
Cassandra’s built-for-scale architecture means that it is capable of handling petabytes of information and thousands of concurrent users/operations per second.
http://www.ibm.com/developerworks/cn/opensource/os-cn-cassandra/index.html
Apache Cassandra 是一套開源分布式 Key-Value 存儲系統。它最初由 Facebook 開發,用於儲存特別大的數據。 Cassandra 不是一個數據庫,它是一個混合型的非關系的數據庫,類似於 Google 的 BigTable。
本文主要從以下五個方面來介紹 Cassandra:Cassandra 的數據模型、安裝和配制 Cassandra、常用編程語言使用 Cassandra 來存儲數據、Cassandra 集群搭建。
http://docs.datastax.com/en/cassandra/2.0/cassandra/gettingStartedCassandraIntro.html
etcd
etcd是一個用於配置共享和服務發現的高性能的鍵值存儲系統。
A highly-available key value store for shared configuration and service discovery
Overview
etcd is a distributed key value store that provides a reliable way to store data across a cluster of machines. It’s open-source and available on GitHub. etcd gracefully handles master elections during network partitions and will tolerate machine failure, including the master.
Your applications can read and write data into etcd. A simple use-case is to store database connection details or feature flags in etcd as key value pairs. These values can be watched, allowing your app to reconfigure itself when they change.
Advanced uses take advantage of the consistency guarantees to implement database master elections or do distributed locking across a cluster of workers.
Getting Started with etcd
ceph
Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability.
- Object Storage
Ceph provides seamless access to objects using native language bindings or radosgw, a REST interface that’s compatible with applications written for S3 and Swift.
- Block Storage
Ceph’s RADOS Block Device (RBD) provides access to block device images that are striped and replicated across the entire storage cluster.
- File System
Ceph provides a POSIX-compliant network file system that aims for high performance, large data storage, and maximum compatibility with legacy applications.
#### [Document](http://docs.ceph.com/docs/v0.80.5/)
Ceph uniquely delivers object, block, and file storage in one unified system.
#### [Intro to Ceph](http://docs.ceph.com/docs/v0.80.5/start/intro/)
Whether you want to provide Ceph Object Storage and/or Ceph Block Device services to Cloud Platforms,
deploy a Ceph Filesystem or use Ceph for another purpose,all Ceph Storage Cluster deployments begin with setting up each Ceph Node, your network and the Ceph Storage Cluster.
A Ceph Storage Cluster requires at least one Ceph Monitor and at least two Ceph OSD Daemons.
The Ceph Metadata Server is essential when running Ceph Filesystem clients.
Ceph的主要目標是設計成基於POSIX的沒有單點故障的分布式文件系統,使數據能容錯和無縫的復制。2010年3 月,Linus Torvalds將Ceph client合並到內 核2.6.34中。IBM開發者園地的一篇文章 探討了Ceph的架構,它的容錯實現和簡化海量數據管理的功能。
[網絡框架]
seastar
High performance server-side application framework(c++開發),是[scylla](https://github.com/scylladb/scylla)的網絡框架
SeaStar is an event-driven framework allowing you to write non-blocking, asynchronous code in a relatively straightforward manner (once understood). It is based on futures.
POCO : github
POCO C++ Libraries-Cross-platform C++ libraries with a network/internet focus.
POrtable COmponents C++ Libraries are:
- A collection of C++ class libraries, conceptually similar to the Java Class Library, the .NET Framework or Apple’s Cocoa.
- Focused on solutions to frequently-encountered practical problems.
- Focused on ‘internet-age’ network-centric applications.
- Written in efficient, modern, 100% ANSI/ISO Standard C++.
- Based on and complementing the C++ Standard Library/STL.
- Highly portable and available on many different platforms.
- Open Source, licensed under the Boost Software License.
對於c++11 STL支持線程 + string支持UTF8, 跨平台已經不是夢了。我看好這個。
[分布式文件系統 + 存儲 ]
hbase
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store
When Would I Use Apache HBase?
Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
ceph
Ceph is a scalable distributed storage system
Ceph is a distributed object, block, and file storage platform
Ceph的主要目標是設計成基於POSIX的沒有單點故障的分布式文件系統,使數據能容錯和無縫的復制。
2010年3 月,Linus Torvalds將Ceph client合並到內 核2.6.34中。
IBM開發者園地的一篇文章 探討了Ceph的架構,它的容錯實現和簡化海量數據管理的功能。
gcsfuse
A user-space file system for interacting with Google Cloud Storage。
使用 Go 編寫,基於 [Google Cloud Storage](https://cloud.google.com/storage/) 接口的 File系統。
目前是beta版本,可能有潛伏bug,接口修改 不向下兼容。
[Seafile]( windows.location #Seafile)
使用 c 編寫, 雲存儲平台
Seafile is an open source cloud storage system with features on privacy protection and teamwork.
Goofys
Goofys 是使用 Go 編寫,基於 [S3](https://aws.amazon.com/s3/) 接口的 Filey 系統。
Goofys 允許你掛載一個 s3 bucket 作為一個 Filey 系統。為什么是 Filey 系統而不是 File 系統?因為 goofys 優先考慮性能而不是 POSIX
[其他]
HDFS和KFS 比較
兩者都是GFS的開源實現,而HDFS 是Hadoop 的子項目,用Java實現,為Hadoop上層應用提供高吞吐量的可擴展的大文件存儲服務。
Kosmos filesystem(KFS) is a high performance distributed filesystem for web-scale applications such as,
storing log data, Map/Reduce data etc.
It builds upon ideas from Google‘s well known Google Filesystem project. 用C++實現
TFS : 淘寶自己都不用了,2011年就停止更新了
FastDFS is an open source high performance distributed file system (DFS).
It's major functions include: file storing, file syncing and file accessing, and design for high capacity and load balance.
FastDFS是一款類Google FS的開源分布式文件系統,它用純C語言實現,支持Linux、FreeBSD、AIX等UNIX系統。
它只能通過專有API對文件進行存取訪問,不支持POSIX接口方式,不能mount使用。
准確地講,Google FS以及FastDFS、mogileFS、HDFS、TFS等類Google FS都不是系統級的分布式文件系統,
而是應用級的分布式文件存儲服務。
FastDFS是一個開源的輕量級分布式文件系統,它對文件進行管理,
功能包括:文件存儲、文件同步、文件訪問(文件上傳、文件下載)等,解決了大容量存儲和負載均衡的問題。
特別適合以文件為載體的在線服務,如相冊網站、視頻網站等等。
gcsfuse
gcsfuse is a user-space file system for interacting with Google Cloud Storage.
GCS Fuse
GCS Fuse is an open source Fuse adapter that allows you to **mount Google Cloud Storage buckets as file systems on Linux or OS X systems**.
GCS Fuse can be run anywhere with connectivity to Google Cloud Storage (GCS) including Google Compute Engine VMs or on-premises systems.
GCS Fuse provides another means to access Google Cloud Storage objects in addition to the XML API,
JSON API, and the gsutil command line,
allowing even more applications to use Google Cloud Storage and take advantage of its immense scale, high availability, rock-solid durability,
exemplary performance, and low overall cost. GCS Fuse is a Google-developed and community-supported open-source tool, written in Go and hosted on GitHub.
GCS Fuse is open-source software, released under the Apache License.
It is distributed as-is, without warranties or conditions of any kind.
Best effort community support is available on Server Fault with the google-cloud-platform
and gcsfuse
tags.
Check the previous questions and answers to see if your issue is already answered. For bugs and feature requests, file an issue.
Technical Overview
GCS Fuse works by translating object storage names into a file and directory system, interpreting the “/” character in object names as a directory separator so that objects with the same common prefix are treated as files in the same directory. Applications can interact with the mounted bucket like any other file system, providing virtually limitless file storage running in the cloud, but accessed through a traditional POSIX interface.
While GCS Fuse has a file system interface, it is not like an NFS or CIFS file system on the backend.
GCS Fuse retains the same fundamental characteristics of Google Cloud Storage, preserving the scalability of Google Cloud Storage in terms of size and aggregate performance while maintaining the same latency and single object performance. As with the other access methods, Google Cloud Storage does not support concurrency and locking. For example, if multiple GCS Fuse clients are writing to the same file, the last flush wins.
For more information about using GCS Fuse or to file an issue, go to the Google Cloud Platform GitHub repository.
In the repository, we recommend you review README, semantics, installing, and mounting.
When to use GCS Fuse
GCS Fuse is a utility that helps you make better and quicker use of Google Cloud Storage by allowing file-based applications to use Google Cloud Storage without need for rewriting their I/O code. It is ideal for use cases where Google Cloud Storage has the right performance and scalability characteristics for an application and only the POSIX semantics are missing.
For example, GCS Fuse will work well for genomics and biotech applications, some media/visual effects/rendering applications, financial services modeling applications, web serving content, FTP backends, and applications storing log files (presuming they do not flush too frequently).
support
GCS Fuse is supported in Linux kernel version 3.10 and newer. To check your kernel version, you can use uname -a.
Current status
Please treat gcsfuse as beta-quality software. Use it for whatever you like, but be aware that bugs may lurk(潛伏), and that we reserve(保留)the right to make small backwards-incompatible changes.(保留權力 做不向后兼容的修改)
The careful user should be sure to read semantics.md for information on how gcsfuse maps file system operations to GCS operations, and especially on surprising behaviors. The list of open issues may also be of interest.
Goofys
Goofys is a Filey-System interface to [S3](https://aws.amazon.com/s3/)
Overview
Goofys allows you to mount an S3 bucket as a filey system.
It's a Filey System instead of a File System because goofys strives for performance first and POSIX second. Particularly things that are difficult to support on S3 or would translate into more than one round-trip would either fail (random writes) or faked (no per-file permission). Goofys does not have a on disk data cache, and consistency model is close-to-open.
Seafile : github
Seafile is an open source cloud storage system with features on privacy protection and teamwork. Collections of files are called libraries, and each library can be synced separately. A library can also be encrypted with a user chosen password. Seafile also allows users to create groups and easily sharing files into groups.
Introduction Build Status
Seafile is an open source cloud storage system with features on privacy protection and teamwork. Collections of files are called libraries, and each library can be synced separately. A library can also be encrypted with a user chosen password. Seafile also allows users to create groups and easily sharing files into groups.
Feature Summary
Seafile has the following features:
File syncing
- Selective synchronization of file libraries. Each library can be synced separately.
Correct handling of file conflicts based on history instead of timestamp. - Only transfering contents not in the server, and incomplete transfers can be resumed.
- Sync with two or more servers.
- Sync with existing folders.
- Sync a sub-folder.
File sharing and collaboration
- Sharing libraries between users or into groups.
- Sharing sub-folders between users or into groups.
- Download links with password protection
- Upload links
- Version control with configurable revision number.
- Restoring deleted files from trash, history or snapshots.
Privacy protection
- Library encryption with a user chosen password.
- Client side encryption when using the desktop syncing.
Internal
Seafile's version control model is based on Git, but it is simplified for automatic synchronization does not need Git installed to run Seafile. Each Seafile library behaves like a Git repository. It has its own unique history, which consists of a list of commits. A commit points to the root of a file system snapshot. The snapshot consists of directories and files. Files are further divided into blocks for more efficient network transfer and storage usage.
Differences from Git:
- Automatic synchronization.
- Clients do not store file history, thus they avoid the overhead of storing data twice. Git is not efficient for larger files such as images.
- Files are further divided into blocks for more efficient network transfer and storage usage.
- File transfer can be paused and resumed.
- Support for different storage backends on the server side.
- Support for downloading from multiple block servers to accelerate file transfer.
- More user-friendly file conflict handling. (Seafile adds the user's name as a suffix to conflicting files.)
- Graceful handling of files the user modifies while auto-sync is running. Git is not designed to work in these cases.
《流式大數據處理的三種框架:Storm,Spark和Samza》
許多分布式計算系統都可以實時或接近實時地處理大數據流。
本文將對三種Apache框架分別進行簡單介紹,然后嘗試快速、高度概述其異同。
Cloudera 將發布新的開源儲存引擎 Kudu ,大數據公司 Cloudera 正在開發一個大型的開源儲存引擎 Kudu,用於儲存和服務大量不同類型的非結構化數據。