bluestore調研記錄

本文轉載自查看原文 2019-12-11 11:14 340 ceph

參考：

1. 深入淺出BlueStore的OSD創建與啟動

https://mp.weixin.qq.com/s?src=11&timestamp=1578463611&ver=2083&signature=riJ9hFMjhhUN2eFr0kzFFuVglj9KMXD*Qr1NjFLx4N9OumHp-4IYiqu6Q7G2RVdgeliVG68mbkKwfjWlMoTNGa6ISEHe2WHyz78uQqxHwAw9YxC8QyTo33sX8t-IdzH1&new=1

2. Ceph Bluestore 部署實踐

https://mp.weixin.qq.com/s?src=11&timestamp=1578463592&ver=2083&signature=tvnRK5Ur-Be3JMDodTIUJJAfbiOfZ24P2cxE5KV-8Dm-xiGKmHcRYiLvJfYcxiw6K8CA5QHNekJzWDf6GQX-vswx-fpW7Vpu*GcuxROL7g1aSapph2sCR6s3B7RqIB7F&new=1

代碼中經常看到的read_meta和write_meta，實際讀取和寫入的地址是/var/lib/ceph/osd/ceph-x/里面的block文件；

/var/lib/ceph/osd/ceph-0/里面包含的文件有block, block.db, block.wal, ceph_fsid, fsid, keyring, ready, type, whoami；

其中block，block.db，block.wal是軟鏈接，block指向數據空間，block.db指向rocksdb存放數據庫的空間，block.wal指向log存放的空間。

關於block，block.db，block.wal的打開、讀寫、關閉操作實際上是對應於底層的xfs來實現的。

引入bluefs后，bluestore將空間分為三層：

1. slow：用於存儲對象，可以由普通機械硬盤構成，由BlueStore進行管理；

2. 高速（DB）空間：由普通SSD構成，用於存儲BlueStore內部產生的元數據，由於BlueStore產生的元數據由RocksDB進行存儲，而RocksDB最終將數據存儲在BlueFS中，因而這類空間由BlueFS進行管理；

3. 超高速（WAL）空間：用於存儲RocksDB內部產生的.log日志，使用NVMe或比普通SSD時延更小的設備存儲，由於.log日志也由BlueFS存儲，因而這部分空間也由BlueFS管理；

當然，BlueFS的可用空閑空間低於一定比例的BlueStore可用空閑空間時，會共享Bluestore的空間，同理，也會被回收空間；

BlueFS本身也是日志文件系統，因而也會產生日志數據，其數據和.log文件優先使用WAL空間存儲，WAL不夠，則使用DB空間，DB不夠，使用slow空間，slow空間由Bluestore管理，Bluestore會將該部分空間的分配情況記錄在bluefs_extents結構中，bluefs_extents是一個集合，每個成員對應slow的一段空間，每次更新（更新Alloctor）會作為Bluestore的元數據存儲；后續上電過程中，會讀取該部分信息，從而正確初始化Alloctor；

預留空間：0~8192

0~4096：預留給lable使用

osd_uuid	blockdevice關聯的osd
size	blockdevice的size
btime	label的產生時間
description	label描述信息

4096~8192：預留給bluefs保存自己的superblock

BlueFS的超級快保存在其DB空間的第二個4K空間范圍內，包含：

uuid	bluefs關聯的uuid（fsid ？）
osduuid	bluefs關聯的osd
block_size	DB/WAL關聯的設備塊大小
log_fnode	日志文件對應的fnode
version	超級塊當前的版本

bluefs管理：

* There are up to 3 block devices:

* BDEV_DB db/ - the primary db device

* BDEV_WAL db.wal/ - a small, fast device, specifically for the WAL

* BDEV_SLOW db.slow/ - a big, slow device, to spill over to as BDEV_DB fills

vector<BlockDevice*> bdev; ///< block devices we can use 每種類型bdev的信息，是個數組，根據類型定義數據下標志：BDEV_WAL 0, BDEV_DB 1, BDEV_SLOW 2，最大值為3，記錄bluefs管理的bdev設備

vector<IOContext*> ioc; ///< IOContexts for bdevs 記錄每種類型的上下文信息，和bedv配合使用

vector<interval_set<uint64_t> > block_all; ///< extents in bdev we own 系統所有的block的信息

vector<uint64_t> block_total; ///< sum of block_all 總容量信息

vector<Allocator*> alloc; ///< allocators for bdevs 已經分配的信息

vector<uint64_t> alloc_size; ///< alloc size for each device 已經分配的size信息

vector<interval_set<uint64_t>> pending_release; ///< extents to release

blockdevice管理：

1. mkfs：固化一些配置項到磁盤中，原因：Bluestore的配置項會對磁盤的數據組織方式是不一樣的，如SSD和NVMe的數據組織不同；因此，為了防止重新上電前后使用不同的配置導致數據的損壞，故將配置信息固化，從而再次上電時從磁盤讀取后恢復，保持一致；

os_type	object_store類型：可以是filestore和bluestore
fsid	唯一標志一個bluestore實例
freelist_type	標志FreelistManage的類型，因為BlueStore固化所有的空閑列表的kv_store，如果FreelistManage允許動態改變會導致上電時候無法正常從kvDB中讀取空閑列表信息
kv_backend	使用何種類型的kv_store，目前是level_db和rocks_db
bluefs	如果kv_store使用rocksdb，則使用bluefs替換本地文件系統接口

2. mount：osd進程上電時，需要mount操作來進行上電前檢查和准備工作：

2.1 檢查ObjectStore類型：由於在mkfs時被寫入磁盤，因而mount的時候讀取並校驗類型是否一致；

2.2 fsck：檢查是否出現損壞；

2.3 加載並鎖定fsid

2.4 加載主塊設別

2.5 加載數據庫，調取元數據：

nid_max	標記bluestore最小未分配的nid，新建對象都是從當前的nid_max開始進行分配
blobid_max	全局唯一，目前還不太清楚
freelist_type	標記FreelistManage的類型
min_min_alloc_size	BlueStore自行配置的最小空間分配單元
bluefs_extents	從主設備共享給bluefs的額外空間

數據結構

基本數據結構：

/// an in-memory object 數據，擴展屬性，omap頭部，omap條目
  struct Onode {
    MEMPOOL_CLASS_HELPERS();
    // Not persisted and updated on cache insertion/removal
    OnodeCacheShard *s;
    bool pinned = false; // Only to be used by the onode cache shard

    std::atomic_int nref;  ///< reference count
    Collection *c;       // PG信息
    ghobject_t oid;

    /// key under PREFIX_OBJ where we are stored
    mempool::bluestore_cache_other::string key;

    boost::intrusive::list_member_hook<> lru_item, pin_item;
    // onode磁盤數據結構
    bluestore_onode_t onode;  ///< metadata stored as value in kv store
    bool exists;              ///< true if object logically exists
    // 有序的Extent邏輯空間集合，持久化在RocksDB中，lextent-->blob
    // 由於支持稀疏寫，因而extent map中的extent可以是不連續的，即存在空洞，前一個extent的結束地址小於后一個extent的起始地址
    // 單個對象內的extent過多會導致extentMap很大，嚴重影響RocksDB的訪問效率，因而加入shared_inf，同時也會合並相鄰的小段
    // 好處是可以按需加載，減少內存占用率
    // 空間管理，包含多個extent，每個extent負責管理對象內的一個邏輯段數據並且關聯一個Blob，Blob包含多個pextent，最終將對象的數據映射到磁盤上
    ExtentMap extent_map;

    // track txc's that have not been committed to kv store (and whose
    // effects cannot be read via the kvdb read methods)
    std::atomic<int> flushing_count = {0};
    std::atomic<int> waiting_count = {0};
    /// protect flush_txns
    ceph::mutex flush_lock = ceph::make_mutex("BlueStore::Onode::flush_lock");
    ceph::condition_variable flush_cond;   ///< wait here for uncommitted txns
　  ......

    }
　　
// extentmap結構

/// a sharded extent map, mapping offsets to lextents to blobs
struct ExtentMap {
　　Onode *onode;
　　extent_map_t extent_map; ///< map of Extents to Blobs
　　blob_map_t spanning_blob_map; ///< blobs that span shards
　　typedef boost::intrusive_ptr<Onode> OnodeRef;

　　......

　　}

// extent數據結構，主要就是offset和length的集合

struct Extent : public ExtentBase {
　　MEMPOOL_CLASS_HELPERS();

　　uint32_t logical_offset = 0; ///< logical offset
　　uint32_t blob_offset = 0; ///< blob offset
　　uint32_t length = 0; ///< length
　　BlobRef blob; ///< the blob with our data

　　......

}

// 磁盤數據結構
/// onode: per-object metadata
struct bluestore_onode_t {
  // 邏輯ID，單個Bulestore內部唯一 
  uint64_t nid = 0;                    ///< numeric id (locally unique)
  // 對象的大小
  uint64_t size = 0;                   ///< object size
  // 對象的擴展屬性
  map<mempool::bluestore_cache_other::string, bufferptr> attrs;        ///< attrs
  ......

  }

/// pextent: physical extent
// offset磁盤上的物理偏移，塊大小對齊
// length數據段的長度，塊大小對齊
struct bluestore_pextent_t : public bluestore_interval_t<uint64_t, uint32_t> 
{
  bluestore_pextent_t() {}
  bluestore_pextent_t(uint64_t o, uint64_t l) : bluestore_interval_t(o, l) {}
  bluestore_pextent_t(const bluestore_interval_t &ext) :
    bluestore_interval_t(ext.offset, ext.length) {}

  DENC(bluestore_pextent_t, v, p) {
    denc_lba(v.offset, p);
    denc_varint_lowz(v.length, p);
  }

  void dump(Formatter *f) const;
  static void generate_test_instances(list<bluestore_pextent_t*>& ls);
};

CollectionHandle& ch

Collection *c = static_cast<Collection*>(ch.get());

OpSequencer *osr = c->osr.get();

 /**
   * a collection also orders transactions
   *
   * Any transactions queued under a given collection will be applied in
   * sequence. Transactions queued under different collections may run
   * in parallel.
   *
   * ObjectStore users may get collection handles with open_collection() (or,
   * for bootstrapping a new collection, create_new_collection()).
   */
  struct CollectionImpl : public RefCountedObject {
    const coll_t cid;

    /// wait for any queued transactions to apply
    // block until any previous transactions are visible.  specifically,
    // collection_list and collection_empty need to reflect prior operations.
    virtual void flush() = 0;

    /**
     * Async flush_commit
     *
     * There are two cases:
     * 1) collection is currently idle: the method returns true.  c is
     *    not touched.
     * 2) collection is not idle: the method returns false and c is
     *    called asynchronously with a value of 0 once all transactions
     *    queued on this collection prior to the call have been applied
     *    and committed.
     */
    virtual bool flush_commit(Context *c) = 0;

    const coll_t &get_cid() {
      return cid;
    }
  protected:
    CollectionImpl() = delete;
    CollectionImpl(CephContext* cct, const coll_t& c) : RefCountedObject(cct), cid(c) {}
    ~CollectionImpl() = default;
  };

void BlueStore::_txc_finish_io(TransContext *txc)
{
  dout(20) << __func__ << " " << txc << dendl;

  /*
   * we need to preserve the order of kv transactions,
   * even though aio will complete in any order.
   */

  OpSequencer *osr = txc->osr.get();
  std::lock_guard l(osr->qlock);
  txc->state = TransContext::STATE_IO_DONE;  // 更新狀態
  txc->ioc.release_running_aios();
  OpSequencer::q_list_t::iterator p = osr->q.iterator_to(*txc);
  while (p != osr->q.begin()) {
    --p;
    if (p->state < TransContext::STATE_IO_DONE) {
      dout(20) << __func__ << " " << txc << " blocked by " << &*p << " "
           << p->get_state_name() << dendl;
      return;
    }
    if (p->state > TransContext::STATE_IO_DONE) {
      ++p;
      break;
    }
  }
  do {
    _txc_state_proc(&*p++);  // 再次進入狀態機
  } while (p != osr->q.end() &&
       p->state == TransContext::STATE_IO_DONE);

  if (osr->kv_submitted_waiters) {
    osr->qcond.notify_all();
  }
}

然后檢查是否還有未提交的IO，如果還有就將state設置為STATE_AIO_WAIT並調用_txc_aio_submit提交IO，然后退出狀態機，之后aio完成的時候會調用回調函數txc_aio_finish再次進入狀態機；否則就進入STATE_AIO_WAIT狀態

case TransContext::STATE_PREPARE:
      throttle.log_state_latency(*txc, logger, l_bluestore_state_prepare_lat);
      if (txc->ioc.has_pending_aios()) {   // 檢查是否還有未提交的IO，如果有，將狀態設置為STATE_AIO_WAIT，並提交IO
        txc->state = TransContext::STATE_AIO_WAIT;
        txc->had_ios = true;     // 更新txc
        _txc_aio_submit(txc);    // 提交IO
        return;
    }

void BlueStore::_txc_aio_submit(TransContext *txc)
{
  dout(10) << __func__ << " txc " << txc << dendl;
  bdev->aio_submit(&txc->ioc);
}

bdev = BlockDevice::create(cct, p, aio_cb, static_cast<void*>(this), discard_cb, static_cast<void*>(this));

static void aio_cb(void *priv, void *priv2)
{
  BlueStore *store = static_cast<BlueStore*>(priv);
  BlueStore::AioContext *c = static_cast<BlueStore::AioContext*>(priv2);
  c->aio_finish(store);
}

void aio_finish(BlueStore *store) override {
      store->txc_aio_finish(this);
}

void txc_aio_finish(void *p) {
    _txc_state_proc(static_cast<TransContext*>(p));
}

STATE_AIO_WAIT階段：

 case TransContext::STATE_AIO_WAIT:  // IO保序處理，等待AIO的完成
      {
        mono_clock::duration lat = throttle.log_state_latency(
          *txc, logger, l_bluestore_state_aio_wait_lat);
        if (ceph::to_seconds<double>(lat) >= cct->_conf->bluestore_log_op_age) {
              dout(0) << __func__ << " slow aio_wait, txc = " << txc
                  << ", latency = " << lat
                  << dendl;
            }
      }

      _txc_finish_io(txc);  // may trigger blocked txc's too
      return;

void BlueStore::_txc_finish_io(TransContext *txc)
{
  dout(20) << __func__ << " " << txc << dendl;

  /*
   * we need to preserve the order of kv transactions,
   * even though aio will complete in any order.
   */

  OpSequencer *osr = txc->osr.get();
  std::lock_guard l(osr->qlock);
  txc->state = TransContext::STATE_IO_DONE;  // 更新狀態
  txc->ioc.release_running_aios();
  OpSequencer::q_list_t::iterator p = osr->q.iterator_to(*txc);
  while (p != osr->q.begin()) {
    --p;
　　// 保證q之前的state狀態已經完成，這里來保證有序完成，因為完成后還會進入到_txc_finish_io
    if (p->state < TransContext::STATE_IO_DONE) {
      dout(20) << __func__ << " " << txc << " blocked by " << &*p << " "
           << p->get_state_name() << dendl;
      return;
    }
    if (p->state > TransContext::STATE_IO_DONE) {
      ++p;
      break;
    }
  }
  do {
    _txc_state_proc(&*p++);  // 再次進入狀態機
  } while (p != osr->q.end() &&
       p->state == TransContext::STATE_IO_DONE);

  if (osr->kv_submitted_waiters) {
    osr->qcond.notify_all();
  }
}

_txc_finish_io

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 cypress調研記錄 Ceph BLUESTORE 配置參考 ceph luminous版部署bluestore ceph擴展bluestore的db分區 [ ceph ] BlueStore 存儲引擎介紹 ceph存儲引擎bluestore解析 ceph存儲 ceph Bluestore的架構 ceph bluestore與 filestore 數據存放的區別 Ceph BlueStore 解析：Object IO到磁盤的映射 StarRocks調研