為 GlusterFS 設計新的xlator (編譯及調用過程分析)

本文轉載自查看原文 2018-09-09 12:17 873 glusterfs/ 分布式

GlusterFS 是一個開源的網絡分布式文件系統，前一陣子看了一點GlusterFS(Gluster)的代碼，修改了部分代碼，具體是增加了一個定制的xlator，簡單記錄一下。

Gluster與xlator

隨着計算機技術的發展，不管哪一個領域的數據都呈現出爆炸性增長的趨勢，因此產生了大數據處理與存儲技術。單機的存儲基本不可能滿足大量離線數據（文本）的存儲需求了，於是在網絡分布式文件系統越來越受到重視。開源的分布式文件系統非常多，GlusterFS，Lustre，Ceph，HDFS，FastDFS，關於這些文件系統的分類與區別，可以參考這里，我覺得從塊，文件，對象的角度划分比較靠譜。我本是做高性能計算的存儲方面研究的，陰差陽錯地入了Gluster的坑，具體原因不表了。

Gluster是基於FUSE的用戶態文件系統，意味着編譯安裝Gluster不需要去牽涉內核，關於FUSE，其實Gluster做的比較粗暴，用一個死循環去讀取/dev/fuse這個塊設備，再丟給客戶端或者網絡，但是FUSE的原理還是值得去研究的，我也是一知半解。兼容POSIX標准，意味着Linux標准庫的read，write等I/O函數不需要經過修改就可以在Gluster上運行。

一個網絡分布式文件系統的套路通常是，服務端有多台機子構成一個統一的名字空間，文件以某種分布方式存放在不同的服務器上。而客戶端看到的卻是一個整體，如一個目錄，並且客戶端可以被掛載在多個不同的節點，因此可以隨時隨地訪問你的數據。每個文件系統為了實現這一套，都會有各種各樣的概念，但本質都是一樣的，例如Gluster里面有一些基礎的概念, 其中brick是一個存儲節點上的一個輸出目錄, volume是一系列的brick，代表一個功能子集，translator（xlator）是連接子volume的，xlator本身也是某一個volume的具體實現。

Gluster支持多種數據分布方式：

Distributed(默認分布方式)
一個文件分布在一個brick上，不同的文件可能分布在不同的brick上。沒有容錯。Distributed方式的分配粒度是文件。
Replicated
每一個文件都會在每個brick存一個copy，replica數目可以由配置文件指定。Replicated的分配粒度是文件。
Distributed Replicated volume
前兩者的結合，brick的數量是replia的n倍，假如有N個brick，replica是2，則distribute數目是N/2，相鄰的兩個brick互為備份。先distribute，再replicate。
Striped Volume
文件被分成固定大小的塊，以RR方式分布在不同的服務器上。Striped的分配粒度是文件塊。
Distributed Striped volume
與striped 方式不同的是，file只在特定的brick上面stripe，相當於先distribute，再stripe。

xlator是Gluster設計的精髓所在，每一個功能都可以用一個xlator來實現，例如每種分布規則對應一個xlator，另外一些feature可以封裝在一個xlator中，如文件加密。並且可以在配置文件中各種xlator混合，嵌套使用，每個xlator編譯后會生成一個動態鏈接庫，運行時按需加載。

舉一個例子，上面五種分布方式的第五種 Distributed Striped Volume，他的配置文件這樣寫：

*************************************************************
### Add client feature and attach to remote subvolume

##  client 1
volume client1
  type protocol/client
  option transport-type tcp/client
  option remote-host  10.0.0.1 # IP address of the remote brick
  option remote-port 6996 # default server port is 6996
  option remote-subvolume brick1 # name of the remote volume
end-volume

## client 2
volume client2
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.0.0.2
  option remote-port 6996
  option remote-subvolume brick2
end-volume

## client 3
volume client3
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.0.0.3
  option remote-port 6996
  option remote-subvolume brick3
end-volume

## client 4
volume client4
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.0.0.4
  option remote-port 6996
  option remote-subvolume brick4
end-volume

#stripe, subvolume is clients
volume stripe1
  type cluster/stripe
  subvolumes client1 client2
end-volume

volume stripe2
  type cluster/stripe
  subvolumes client3 client4
end-volume

#distribute, subvolume is stripes
volume dht
  type cluster/distribute
  subvolumes stripe1 stripe2
end-volume

配置文件是一種樹形的xlator結構，樹的根是fuse_xlator_t，在配置文件初始化的時候，由根向葉子深度優先初始化。寫配置文件的順序與Gluster讀配置的順序是相反的，例如：dht最先被讀取。client端的配置文件的寫法要比server端復雜，server端只需要指定哪個目錄輸出就足夠了。

xlator-tree

如圖所示，左邊是客戶端的xlator嵌套關系，fuse初始化之后會初始化子卷 subvolume，即調用dht的初始化函數，依次完成初始化。同樣，一個I/O請求被FUSE接受，會經過一些封裝傳遞給dht，dht可能經過一些定位，傳遞給他的某一個subvolume，...一直請求由client xlator通過網絡包發給對應的server。server端收到請求也同樣是一樣的嵌套處理，最終會把請求送到posix xlator，這個xlator里封裝了最原始的系統調用，read，write等。這就是Gluster整個系統的執行流程。

xlator-type

上圖是官方文檔提供的所有類型的xlator，具體都可以在源代碼xlators/目錄里找到。

xlator中的調用(STACK_WIND)與回調(STACK_UNWIND)

Gluster在不同層級的xlator之間的通信有點類似於遞歸，主要依賴於代碼中的兩個宏，分別是STACK_WIND和STACK_UNWIND。每一個xlator中的相關函數都有一對，如 write 函數有着對應的 write_cbk 函數，兩個函數與兩個宏定義配合使用。

stack_wind_unwind

我們把xlator的關系簡化成三層，FUSE，DHT，POSIX，關系如上圖左邊所示。假設系統從/dev/fuse中讀到了一個write請求，系統將這個請求丟給FUSE xlator，在FUSE xlator中調用fuse_write，通過STACK_WIND將請求傳遞給他的subvolume，調用subvolume對應的write函數，即dht_write, dht_write同理通過STACK_WIND調用posix_write。在圖中posix是最底層的xlator，因此posix_write將不會調用STACK_WIND，而是調用STACK_UNWIND將返回值或者結果返回給父volume，對應着調用父volume中的_cbk函數，即dht_write_cbk，該函數做完相應的處理后繼續通過STACK_UNWIND返回到fuse_write_cbk中，這樣一個write才算完成。

調用 (STACK_WIND)

接着具體分析一下STACK_WIND是如何工作的。下面是STACK\_WIND的宏定義。

/* make a call */
#define STACK_WIND(frame, rfn, obj, fn, params ...)                        \
        do {                                                                \
                call_frame_t *_new = NULL;                                \
                xlator_t     *old_THIS = NULL;                          \
                                                                        \
                _new = CALLOC (1, sizeof (call_frame_t));                \
                ERR_ABORT (_new);                                        \
                typeof(fn##_cbk) tmp_cbk = rfn;                                \
                _new->root = frame->root;                                \
                _new->next = frame->root->frames.next;                        \
                _new->prev = &frame->root->frames;                        \
                if (frame->root->frames.next)                                \
                        frame->root->frames.next->prev = _new;                \
                frame->root->frames.next = _new;                        \
                _new->this = obj;                                        \
                _new->ret = (ret_fn_t) tmp_cbk;                                \
                _new->parent = frame;                                        \
                _new->cookie = _new;                                        \
                LOCK_INIT (&_new->lock);                                \
                frame->ref_count++;                                        \
                                                                        \
                old_THIS = THIS;                                        \
                THIS = obj;                                             \
                fn (_new, obj, params);                                        \
                THIS = old_THIS;                                        \
        } while (0)

在dht xlator中，dht_write 函數里這樣調用STACK_WIND:

STACK_WIND (frame, dht_writev_cbk,
                    subvol, subvol->fops->writev,
                    fd, vector, count, off, iobref);

把參數代入到宏定義中，可以按照下面的代碼理解：

new->parent = frame;
new->this = subvolume;
typeof(fn##_cbk) tmp_cbk = dht_writev_cbk;
new->ret = (ret_fn_t) tmp_cbk; 
fn = subvol->fops->writev;
params = {fd, vector, count, off, iobref};
obj = subvol;
//用subvolume->fop->writev 
//參數為new，new為新的frame，  new的父節點設置為frame，
//obj 就是subvolume
fn (_new, obj, params);

可以看到STACK_WIND主要做了三件微小的事，1 傳遞調用之間的上下文，代碼中是frame 這個數據結構，2 記錄當前函數的回調函數，一般是對應的cbk函數，也有特例，像dht xlator中邏輯比較復雜的lookup操作（關於Gluster的核心dht xlator的調用分析可以參考這里），3 調用子subvolume的對應函數，將操作向下傳遞。

因此按照我們簡化的xlator關系，即dht的subvolume是posix，那上面的STACK_WIND就調用了posix_writev：

posix_writev (call_frame_t *frame, xlator_t *this,
          fd_t *fd,struct iovec *vector, int32_t count,
          off_t offset,struct iobref *iobref))

回調 (STACK_UNWIND)

接下來看一下STACK_UNWIND的工作原理。Gluster中有兩種回調的宏，一個是STACK_UNWIND, 另一個是STACK_UNWIND_STRICT，兩者的差別只是第一個參數，原理是一樣的。通常xlator源碼里面用的是STACK_UNWIND_STRICT，原因在宏定義的注釋里寫了，STACK_UNWIND_STRICT是類型安全的。下面是STACK_UNWIND_STRICT的宏定義。

/* return from function in type-safe way */
#define STACK_UNWIND_STRICT(op, frame, params ...)                      \
        do {                                                                \
                fop_##op##_cbk_t      fn = NULL;                        \
                call_frame_t *_parent = NULL;                           \
                xlator_t     *old_THIS = NULL;                          \
                                                                        \
                fn = (fop_##op##_cbk_t )frame->ret;                     \
                _parent = frame->parent;                                \
                _parent->ref_count--;                                        \
                old_THIS = THIS;                                        \
                THIS = _parent->this;                                   \
                frame->complete = _gf_true;                             \
                fn (_parent, frame->cookie, _parent->this, params);        \
                THIS = old_THIS;                                        \
        } while (0)

前面提到如果你的xlator是最底層的（如客戶端的client xlator，服務端的posix xlator），那么這個xlator里不應該存在 xxx_cbk 函數，而是在操作返回之前調用STACK_UNWIND或者STACK_UNWIND_STRICK。posix xlator里面用的是STACK_UNWIND_STRICT 向父volume返回。 STACK_UNWIND_STRICT的第一句是將第一個參數連接成cbk函數，以下是posix的調用：

STACK_UNWIND_STRICT (writev, frame, op_ret, op_errno, &preop, &postop);

把參數代入到宏定義中，可以按照下面的代碼理解：

//這里的frame就是STACK_WIND里的obj
//因此 fn = dht_writev_cbk
fn = (fop_writev_cbk_t)frame->ret;
//parent 就是dht的frame  
_parent = frame->parent;
//因此這里調用的是
//dht_writev_cbk(dht_frame,posix_frame,dht,params);
//實際第二個cokkie參數在dht_write_cbk會被忽略
fn (_parent, frame->cookie, _parent->this, params);

可以看到替換后，首先將上下文換成正確的上下文，即父volume的frame，然后代碼的最后一句實際是調用了父volume的 cbk 方法，即dht_write_cbk，在dht_write_cbk里會繼續調用STACK_UNWIND_STRICT，這樣就會將結果返回到根xlator。

工程編譯，configure與make

為Gluster新增xlator實際是改變了源碼的結構，因此要想xlator正確工作，需要了解一下自動編譯的知識。首先看一下Gluster的代碼結構：

code-tree

Gluster里提供了一個默認的xlator模板叫做defaults，在libglusterfs/src/defaults.c里，里面定義了一個文件系統基本操作，包括cbk方法，但是他不做任何操作，方法里面只有基本的STACK_WIND和STACK_UNWIND調用。假設我們要在cluster中加一個新的分布規則叫dadada，那應該在對應的目錄下新建自己的目錄和源碼，如圖：

dadada

然后需要將新增的xlator加到編譯選項中。為此我特意了解了一下C語言大型工程的編譯套路，僅僅也是能夠修改的水平。首先基礎的編譯方法是用make命令執行Makefile文件，GNU提供了一系列的工具幫助我們自動生成Makefile文件。下面是生成Makefile的操作過程（原圖）：

configure

首先通過autoscan根據源代碼生成對應的configure.ac(或者configure.in)文件，不過通常這一步不需要我們做，一般開源項目里都提供了configure.ac文件。然后aclocal命令根據configure.ac生成aclocal.m4，再運行autoconf命令生成configure文件，然后需要運行automake -a命令生成makefile.in，但這三個步驟Gluster里面有一個autogen.sh的腳本幫我們做了。生成的configure文件是可以執行的，必要的時候修改一下文件權限。執行configure文件，該文件會和makefile.in一起生成所需要的Makefile文件。

因此可以看到，在編譯中通常我們只需要提供configure.ac(configure.in)和Makefile.am兩種文件，其他都是通過工具自動生成的。所以我們要為我們的dadada做以下修改：

1 修改configure.ac，將cluster/dadada 和 cluster/dadada/src 仿照其他的結構添加到configure.ac中。當然要修改的不只下圖中的一處。

configure

2 修改Makefile.am，包括cluster/Makefile.am,cluster/dadada/Makefile.am, cluster/dadada/src/Makefile.am，基本就是改一些名字。

# cluster/Makefile.am
SUBDIRS = stripe afr dht ec dadada

CLEANFILES = 

# cluster/dadada/Makefile.am
SUBDIRS = src

# cluster/dadada/src/Makefile.am  根據代碼的結構修改

xlator_LTLIBRARIES = dadada.la
xlatordir = $(libdir)/glusterfs/$(PACKAGE_VERSION)/xlator/cluster

dadada_la_SOURCES = dadada.c 

dadada_la_LDFLAGS = -module -avoidversion
dadada_la_LIBADD = $(top_builddir)/libglusterfs/src/libglusterfs.la

noinst_HEADERS = dadada.h

AM_CFLAGS = -fPIC -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -Wall -fno-strict-aliasing -D$(GF_HOST_OS) \
	-I$(top_srcdir)/liblwfs/src $(GF_CFLAGS) -shared -nostartfiles

CLEANFILES = 

uninstall-local:
	rm -f $(DESTDIR)$(xlatordir)/dadada.so

3 運行autogen.sh，該腳本是用aclocal和autoconf生成configure文件，以及用automake生成Makefile.in
4 運行configure, configure是用Makefile.in 生成Makefile
5 make && make install

配置文件以及其他

在完成編譯之后便可以自己寫配置文件進行測試了，一個示例的配置文件：

##  client 1
volume client1
  type protocol/client
  option transport-type tcp/client
  option remote-host  10.0.0.1 # IP address of the remote brick
  option remote-port 6996 # default server port is 6996
  option remote-subvolume brick1 # name of the remote volume
end-volume

## client 2
volume client2
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.0.0.2
  option remote-port 6996
  option remote-subvolume brick2
end-volume

#dadada, subvolume is clients
volume dadada
  type cluster/dadada
  subvolumes client1 client2
end-volume

以上就是關於增加xlator所能想起來的基礎知識，當然寫一個xlator是非常困難的，需要理解和研究的東西的太多。有一個建議就是涉及邏輯的操作最好在cbk函數里面做處理，嵌套調用寫在 STACK_UNWIND 前面。另外Gluster在高性能領域用的比較少，所有的文件系統都是有利有弊的，例如Gluster的一大特色就是沒有集中的元數據服務器，文件的定位是根據文件名計算hash值來做的，因此Gluster沒有分布式文件系統中常見的元數據瓶頸問題，但是在沒有指定文件名時候做查詢（ls），Gluster性能就不會很好，因為要全盤掃描，關於Gluster性能的探討，推薦這個系列的博客。關於Gluster開發過程的調試方法，可以參考這里。

以上。

參考
https://www.gluster.org/
http://lustre.org/
http://ceph.com/
https://hadoop.apache.org/docs/r1.0.4/cn/hdfs_design.html
https://github.com/happyfish100/fastdfs
https://turodj.gitbooks.io/those-things-about-architecture/content/cun_chu.html
http://gluster.readthedocs.org/en/latest/Administrator Guide/glossary/
http://lidawn.github.io/2015/04/30/parallel-io-basic/
http://blog.csdn.net/liuhong1123/article/details/8118258
https://www.ibm.com/developerworks/cn/linux/l-makefile/
http://blog.csdn.net/liuaigui/article/details/6284551
http://pl.atyp.us/hekafs.org/index.php/2011/11/

原文地址 https://lidawn.github.io/2016/11/28/glusterfs-xlator/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spring Cloud Feign 調用過程分析 Linux系統調用過程分析函數調用過程分析 Dubbo 源碼分析 - 服務調用過程 ARM函數調用過程分析 API調用過程 alsa聲卡分析alsa-utils調用過程（二）-tinymixer Dubbo消費方服務調用過程源碼分析 UVC 驅動調用過程與驅動框架的簡單分析 rpc調用過程