Alluxio 1.8.1

官方:http://www.alluxio.org/
一 簡介
Open Source Memory Speed Virtual Distributed Storage
Alluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed.
alluxio是一個開源的擁有內存訪問速度的虛擬分布式存儲;之前叫Tachyon,可以使應用像訪問內存數據一樣訪問任何存儲系統中的數據。
1 優勢
Storage Unification and Abstraction
Alluxio unifies data access to different systems, and seamlessly bridges computation frameworks and underlying storage.

Remote Data Acceleration
Decouple compute and storage without any loss in performance.
將計算和存儲分離,並且不會損失性能;

2 部署結構
Alluxio can be divided into three components: masters, workers, and clients. A typical setup consists of a single leading master, multiple standby masters, and multiple workers. The master and worker processes constitute the Alluxio servers, which are the components a system administrator would maintain. The clients are used to communicate with the Alluxio servers by applications such as Spark or MapReduce jobs, Alluxio command-line, or the FUSE layer.
alluxio由master、worker組成,其中master如果有多個,只有一個是leading master,其他為standby master;

3 角色
Master
The Alluxio master service can be deployed as one leading master and several standby masters for fault tolerance. When the leading master goes down, a standby master is elected to become the new leading master.

1)Leading Master
Only one master process can be the leading master in an Alluxio cluster. The leading master is responsible for managing the global metadata of the system. This includes file system metadata (e.g. the file system inode tree), block metadata (e.g. block locations), and worker capacity metadata (free and used space). Alluxio clients interact with the leading master to read or modify this metadata. All workers periodically send heartbeat information to the leading master to maintain their participation in the cluster. The leading master does not initiate communication with other components; it only responds to requests via RPC services. The leading master records all file system transactions to a distributed persistent storage to allow for recovery of master state information; the set of records is referred to as the journal.
alluxio中只有一個leading master,leading master負責管理所有的元數據,包括文件系統元數據、block元數據和worker元數據;worker會定期向leading master發送心跳;leading master會記錄所有的文件操作到日志中;
2)Standby Masters
Standby masters read journals written by the leading master to keep their own copies of the master state up-to-date. They also write journal checkpoints for faster recovery in the future. They do not process any requests from other Alluxio components.
standby master會及時同步讀取leader master的日志;
Worker
Alluxio workers are responsible for managing user-configurable local resources allocated to Alluxio (e.g. memory, SSDs, HDDs). Alluxio workers store data as blocks and serve client requests that read or write data by reading or creating new blocks within their local resources. Workers are only responsible for managing blocks; the actual mapping from files to blocks is only stored by the master.
worker負責管理資源,比如內存、ssd等;worker負責將數據存儲為block同時響應client的讀寫請求;實際的file和block的映射關系保存在master中;

Because RAM usually offers limited capacity, blocks in a worker can be evicted when space is full. Workers employ eviction policies to decide which data to keep in the Alluxio space.
Client
The Alluxio client provides users a gateway to interact with the Alluxio servers. It initiates communication with the leading master to carry out metadata operations and with workers to read and write data that is stored in Alluxio.
client先向leading master請求元數據信息,然后向worker發送讀寫請求;
二 安裝
1 下載
$ wget http://downloads.alluxio.org/downloads/files//1.8.1/alluxio-1.8.1-hadoop-2.6-bin.tar.gz
$ tar xvf alluxio-1.8.1-hadoop-2.6-bin.tar.gz
$ cd alluxio-1.8.1-hadoop-2.6
2 配置本機ssh登錄
即可以 ssh localhost
詳見:https://www.cnblogs.com/barneywill/p/10271679.html
3 配置
$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
$ vi conf/alluxio-site.properties
alluxio.master.hostname=localhost
4 初始化
$ ./bin/alluxio validateEnv local
$ ./bin/alluxio format
5 啟動
$ ./bin/alluxio-start.sh local SudoMount
如果報錯:
Formatting RamFS: /mnt/ramdisk (44849277610)
ERROR: mkdir /mnt/ramdisk failed
需要添加sudo權限
# visudo -f /etc/sudoers
$user ALL=(ALL) NOPASSWD: /bin/mount * /mnt/ramdisk, /bin/umount * /mnt/ramdisk, /bin/mkdir * /mnt/ramdisk, /bin/chmod * /mnt/ramdisk
三 使用
1 命令行
文件系統操作
$ ./bin/alluxio fs
$ ./bin/alluxio fs ls /
$ ./bin/alluxio fs copyFromLocal LICENSE /LICENSE
$ ./bin/alluxio fs cat /LICENSE
看起來和hdfs命令很像
admin操作
$ bin/alluxio fsadmin report
Alluxio cluster summary:
Master Address: localhost/127.0.0.1:19998
Web Port: 19999
Rpc Port: 19998
Started: 01-24-2019 10:28:59:433
Uptime: 0 day(s), 1 hour(s), 24 minute(s), and 42 second(s)
Version: 1.8.1
Safe Mode: false
Zookeeper Enabled: false
Live Workers: 1
Lost Workers: 0
Total Capacity: 10.00GB
Tier: MEM Size: 10.00GB
Used Capacity: 9.36GB
Tier: MEM Size: 9.36GB
Free Capacity: 651.55MB
查看統計信息
$ curl http://$master:19999/metrics/json
2 UFS(Under File Storage)
UFS=LocalFileSystem
1 默認配置
$ cat conf/alluxio-site.properties
alluxio.underfs.address=${alluxio.work.dir}/underFSStorage
2 命令示例
$ ls ./underFSStorage/
$ ./bin/alluxio fs persist /LICENSE
$ ls ./underFSStorage
LICENSE
With the default configuration, Alluxio uses the local file system as its under file storage (UFS). The default path for the UFS is ./underFSStorage.
Alluxio is currently writing data only into Alluxio space, not to the UFS.Configure Alluxio to persist the file from Alluxio space to the UFS by using the persist command.
Alluxio默認用的是本地文件系統作為UFS,只有執行persist命令之后,文件才會持久化到UFS中;
UFS=HDFS
1 配置
$ cat conf/alluxio-site.properties
alluxio.underfs.address=hdfs://<NAMENODE>:<PORT>/alluxio/data
如果你想對hdfs上全部數據進行加速並且路徑不變,可以配置為hdfs的根目錄
2 配置hadoop
1)鏈接
$ ln -s $HADOOP_CONF_DIR/core-site.xml conf/core-site.xml
$ ln -s $HADOOP_CONF_DIR/hdfs-site.xml conf/hdfs-site.xml
Copy or make symbolic links from hdfs-site.xml and core-site.xml from your Hadoop installation into ${ALLUXIO_HOME}/conf
2)直接配置路徑
alluxio.underfs.hdfs.configuration=/path/to/hdfs/conf/core-site.xml:/path/to/hdfs/conf/hdfs-site.xml
3 命令
$ bin/alluxio fs ls /
可以看到hdfs上所有的目錄了
4 文件映射
這時可以通過訪問
alluxio://$alluxio_server:19998/test.log
來訪問底層存儲
hdfs://$namenode_server/alluxio/data/test.log
注意:這里需要指定$alluxio_server和端口,存在單點問題,后續ha方式部署之后可以解決這個問題。
3 Spark訪問
1 准備:(二選一)
1)配置
spark.driver.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar
spark.executor.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar
This Alluxio client jar file can be found at /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar
2)拷貝jar
$ cp client/alluxio-1.8.1-client.jar $SPARK_HOME/jars/
2 訪問
$ spark-shell
scala> val s = sc.textFile("alluxio://localhost:19998/derby.log")
s: org.apache.spark.rdd.RDD[String] = alluxio://localhost:19998/derby.log MapPartitionsRDD[1] at textFile at <console>:24scala> s.foreach(println)
----------------------------------------------------------------
Thu Jan 10 11:05:45 CST 2019:
參考:http://www.alluxio.org/docs/1.8/en/compute/Spark.html
4 hive訪問
拷貝jar
$ cp client/alluxio-1.8.1-client.jar $HIVE_HOME/lib/
$ cp client/alluxio-1.8.1-client.jar $HADOOP_HOME/share/hadoop/common/lib/
重啟metastore和hiveserver2
5 部署方式
1 集群ha部署
即多worker+多master+zookeeper
1 配置集群服務器間ssh可達
同上
2 配置
$ cat conf/alluxio-site.properties
#alluxio.master.hostname=<MASTER_HOSTNAME>
alluxio.zookeeper.enabled=true
alluxio.zookeeper.address=<ZOOKEEPER_ADDRESS>
alluxio.master.journal.folder=hdfs://$namenode_server/alluxio/journal/
alluxio.worker.memory.size=20GB
將配置同步到集群所有服務器
3 配置masters和workers
$ conf/masters
$master1
$master2$ conf/workers
$worker1
$worker2
$worker3
4 啟動
$ ./bin/alluxio-start.sh all SudoMount
5 訪問方式
alluxio://zkHost1:2181;zkHost2:2181;zkHost3:2181/path
如果client啟動時增加環境變量
-Dalluxio.zookeeper.address=zkHost1:2181,zkHost2:2181,zkHost3:2181 -Dalluxio.zookeeper.enabled=true
則可以直接這樣訪問
alluxio:///path
6 與hdfs互通
拷貝jar
$ cp client/alluxio-1.8.1-client.jar $HADOOP_HOME/share/hadoop/common/lib/
將以下配置添加到 $HADOOP_CONF_DIR/core-site.xml
alluxio.zookeeper.enabled
alluxio.zookeeper.address
和
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
則可以通過hdfs客戶端訪問alluxio
$ hadoop fs -ls alluxio:///directory
參考:http://www.alluxio.org/docs/1.8/en/deploy/Running-Alluxio-On-a-Cluster.html
2 Alluxio on Yarn部署
Alluxio還有很多種部署方式,其中一種是Alluxio on Yarn,對於類似Spark on Yarn的用戶來說,非常容易使用Alluxio來加速Spark。
詳見:
http://www.alluxio.org/docs/1.8/en/deploy/Running-Alluxio-On-Yarn.html
