參考:
1. 深入淺出BlueStore的OSD創建與啟動
2. Ceph Bluestore 部署實踐
3.
代碼中經常看到的read_meta和write_meta,實際讀取和寫入的地址是/var/lib/ceph/osd/ceph-x/里面的block文件;
/var/lib/ceph/osd/ceph-0/里面包含的文件有block, block.db, block.wal, ceph_fsid, fsid, keyring, ready, type, whoami;
其中block,block.db,block.wal是軟鏈接,block指向數據空間,block.db指向rocksdb存放數據庫的空間,block.wal指向log存放的空間。
關於block,block.db,block.wal的打開、讀寫、關閉操作實際上是對應於底層的xfs來實現的。
引入bluefs后,bluestore將空間分為三層:
1. slow:用於存儲對象,可以由普通機械硬盤構成,由BlueStore進行管理;
2. 高速(DB)空間:由普通SSD構成,用於存儲BlueStore內部產生的元數據,由於BlueStore產生的元數據由RocksDB進行存儲,而RocksDB最終將數據存儲在BlueFS中,因而這類空間由BlueFS進行管理;
3. 超高速(WAL)空間:用於存儲RocksDB內部產生的.log日志,使用NVMe或比普通SSD時延更小的設備存儲,由於.log日志也由BlueFS存儲,因而這部分空間也由BlueFS管理;
當然,BlueFS的可用空閑空間低於一定比例的BlueStore可用空閑空間時,會共享Bluestore的空間,同理,也會被回收空間;
BlueFS本身也是日志文件系統,因而也會產生日志數據,其數據和.log文件優先使用WAL空間存儲,WAL不夠,則使用DB空間,DB不夠,使用slow空間,slow空間由Bluestore管理,Bluestore會將該部分空間的分配情況記錄在bluefs_extents結構中,bluefs_extents是一個集合,每個成員對應slow的一段空間,每次更新(更新Alloctor)會作為Bluestore的元數據存儲;后續上電過程中,會讀取該部分信息,從而正確初始化Alloctor;
預留空間:0~8192
0~4096:預留給lable使用
| osd_uuid | blockdevice關聯的osd | 
| size | blockdevice的size | 
| btime | label的產生時間 | 
| description | label描述信息 | 
4096~8192:預留給bluefs保存自己的superblock
BlueFS的超級快保存在其DB空間的第二個4K空間范圍內,包含:
| uuid | bluefs關聯的uuid(fsid ?) | 
| osduuid | bluefs關聯的osd | 
| block_size | DB/WAL關聯的設備塊大小 | 
| log_fnode | 日志文件對應的fnode | 
| version | 超級塊當前的版本 | 
bluefs管理:
blockdevice管理:
1. mkfs:固化一些配置項到磁盤中,原因:Bluestore的配置項會對磁盤的數據組織方式是不一樣的,如SSD和NVMe的數據組織不同;因此,為了防止重新上電前后使用不同的配置導致數據的損壞,故將配置信息固化,從而再次上電時從磁盤讀取后恢復,保持一致;
| os_type | object_store類型:可以是filestore和bluestore | 
| fsid | 唯一標志一個bluestore實例 | 
| freelist_type | 標志FreelistManage的類型,因為BlueStore固化所有的空閑列表的kv_store,如果FreelistManage允許動態改變會導致上電時候無法正常從kvDB中讀取空閑列表信息 | 
| kv_backend | 使用何種類型的kv_store,目前是level_db和rocks_db | 
| bluefs | 如果kv_store使用rocksdb,則使用bluefs替換本地文件系統接口 | 
2. mount:osd進程上電時,需要mount操作來進行上電前檢查和准備工作:
2.1 檢查ObjectStore類型:由於在mkfs時被寫入磁盤,因而mount的時候讀取並校驗類型是否一致;
2.2 fsck:檢查是否出現損壞;
2.3 加載並鎖定fsid
2.4 加載主塊設別
2.5 加載數據庫,調取元數據:
| nid_max | 標記bluestore最小未分配的nid,新建對象都是從當前的nid_max開始進行分配 | 
| blobid_max | 全局唯一,目前還不太清楚 | 
| freelist_type | 標記FreelistManage的類型 | 
| min_min_alloc_size | BlueStore自行配置的最小空間分配單元 | 
| bluefs_extents | 從主設備共享給bluefs的額外空間 | 
數據結構
基本數據結構:
/// an in-memory object 數據,擴展屬性,omap頭部,omap條目 struct Onode { MEMPOOL_CLASS_HELPERS(); // Not persisted and updated on cache insertion/removal OnodeCacheShard *s; bool pinned = false; // Only to be used by the onode cache shard std::atomic_int nref; ///< reference count Collection *c; // PG信息 ghobject_t oid; /// key under PREFIX_OBJ where we are stored mempool::bluestore_cache_other::string key; boost::intrusive::list_member_hook<> lru_item, pin_item; // onode磁盤數據結構 bluestore_onode_t onode; ///< metadata stored as value in kv store bool exists; ///< true if object logically exists // 有序的Extent邏輯空間集合,持久化在RocksDB中,lextent-->blob // 由於支持稀疏寫,因而extent map中的extent可以是不連續的,即存在空洞,前一個extent的結束地址小於后一個extent的起始地址 // 單個對象內的extent過多會導致extentMap很大,嚴重影響RocksDB的訪問效率,因而加入shared_inf,同時也會合並相鄰的小段 // 好處是可以按需加載,減少內存占用率 // 空間管理,包含多個extent,每個extent負責管理對象內的一個邏輯段數據並且關聯一個Blob,Blob包含多個pextent,最終將對象的數據映射到磁盤上 ExtentMap extent_map; // track txc's that have not been committed to kv store (and whose // effects cannot be read via the kvdb read methods) std::atomic<int> flushing_count = {0}; std::atomic<int> waiting_count = {0}; /// protect flush_txns ceph::mutex flush_lock = ceph::make_mutex("BlueStore::Onode::flush_lock"); ceph::condition_variable flush_cond; ///< wait here for uncommitted txns
......
}
// extentmap結構
 /// a sharded extent map, mapping offsets to lextents to blobs
 struct ExtentMap {
   Onode *onode;
   extent_map_t extent_map; ///< map of Extents to Blobs
   blob_map_t spanning_blob_map; ///< blobs that span shards
   typedef boost::intrusive_ptr<Onode> OnodeRef;
......
}
// extent數據結構,主要就是offset和length的集合
 struct Extent : public ExtentBase {
   MEMPOOL_CLASS_HELPERS();
   uint32_t logical_offset = 0; ///< logical offset
   uint32_t blob_offset = 0; ///< blob offset
   uint32_t length = 0; ///< length
   BlobRef blob; ///< the blob with our data
......
}
// 磁盤數據結構 /// onode: per-object metadata struct bluestore_onode_t { // 邏輯ID,單個Bulestore內部唯一 uint64_t nid = 0; ///< numeric id (locally unique) // 對象的大小 uint64_t size = 0; ///< object size // 對象的擴展屬性 map<mempool::bluestore_cache_other::string, bufferptr> attrs; ///< attrs ...... }
/// pextent: physical extent
// offset磁盤上的物理偏移,塊大小對齊
// length數據段的長度,塊大小對齊
struct bluestore_pextent_t : public bluestore_interval_t<uint64_t, uint32_t> { bluestore_pextent_t() {} bluestore_pextent_t(uint64_t o, uint64_t l) : bluestore_interval_t(o, l) {} bluestore_pextent_t(const bluestore_interval_t &ext) : bluestore_interval_t(ext.offset, ext.length) {} DENC(bluestore_pextent_t, v, p) { denc_lba(v.offset, p); denc_varint_lowz(v.length, p); } void dump(Formatter *f) const; static void generate_test_instances(list<bluestore_pextent_t*>& ls); };
CollectionHandle& ch
Collection *c = static_cast<Collection*>(ch.get());
OpSequencer *osr = c->osr.get();
/** * a collection also orders transactions * * Any transactions queued under a given collection will be applied in * sequence. Transactions queued under different collections may run * in parallel. * * ObjectStore users may get collection handles with open_collection() (or, * for bootstrapping a new collection, create_new_collection()). */ struct CollectionImpl : public RefCountedObject { const coll_t cid; /// wait for any queued transactions to apply // block until any previous transactions are visible. specifically, // collection_list and collection_empty need to reflect prior operations. virtual void flush() = 0; /** * Async flush_commit * * There are two cases: * 1) collection is currently idle: the method returns true. c is * not touched. * 2) collection is not idle: the method returns false and c is * called asynchronously with a value of 0 once all transactions * queued on this collection prior to the call have been applied * and committed. */ virtual bool flush_commit(Context *c) = 0; const coll_t &get_cid() { return cid; } protected: CollectionImpl() = delete; CollectionImpl(CephContext* cct, const coll_t& c) : RefCountedObject(cct), cid(c) {} ~CollectionImpl() = default; };
void BlueStore::_txc_finish_io(TransContext *txc) { dout(20) << __func__ << " " << txc << dendl; /* * we need to preserve the order of kv transactions, * even though aio will complete in any order. */ OpSequencer *osr = txc->osr.get(); std::lock_guard l(osr->qlock); txc->state = TransContext::STATE_IO_DONE; // 更新狀態 txc->ioc.release_running_aios(); OpSequencer::q_list_t::iterator p = osr->q.iterator_to(*txc); while (p != osr->q.begin()) { --p; if (p->state < TransContext::STATE_IO_DONE) { dout(20) << __func__ << " " << txc << " blocked by " << &*p << " " << p->get_state_name() << dendl; return; } if (p->state > TransContext::STATE_IO_DONE) { ++p; break; } } do { _txc_state_proc(&*p++); // 再次進入狀態機 } while (p != osr->q.end() && p->state == TransContext::STATE_IO_DONE); if (osr->kv_submitted_waiters) { osr->qcond.notify_all(); } }
然后檢查是否還有未提交的IO,如果還有就將state設置為STATE_AIO_WAIT並調用_txc_aio_submit提交IO,然后退出狀態機,之后aio完成的時候會調用回調函數txc_aio_finish再次進入狀態機;否則就進入STATE_AIO_WAIT狀態
case TransContext::STATE_PREPARE: throttle.log_state_latency(*txc, logger, l_bluestore_state_prepare_lat); if (txc->ioc.has_pending_aios()) { // 檢查是否還有未提交的IO,如果有,將狀態設置為STATE_AIO_WAIT,並提交IO txc->state = TransContext::STATE_AIO_WAIT; txc->had_ios = true; // 更新txc _txc_aio_submit(txc); // 提交IO return; }
void BlueStore::_txc_aio_submit(TransContext *txc) { dout(10) << __func__ << " txc " << txc << dendl; bdev->aio_submit(&txc->ioc); }
bdev = BlockDevice::create(cct, p, aio_cb, static_cast<void*>(this), discard_cb, static_cast<void*>(this));
static void aio_cb(void *priv, void *priv2) { BlueStore *store = static_cast<BlueStore*>(priv); BlueStore::AioContext *c = static_cast<BlueStore::AioContext*>(priv2); c->aio_finish(store); }
void aio_finish(BlueStore *store) override { store->txc_aio_finish(this); }
void txc_aio_finish(void *p) { _txc_state_proc(static_cast<TransContext*>(p)); }
STATE_AIO_WAIT階段:
case TransContext::STATE_AIO_WAIT: // IO保序處理,等待AIO的完成 { mono_clock::duration lat = throttle.log_state_latency( *txc, logger, l_bluestore_state_aio_wait_lat); if (ceph::to_seconds<double>(lat) >= cct->_conf->bluestore_log_op_age) { dout(0) << __func__ << " slow aio_wait, txc = " << txc << ", latency = " << lat << dendl; } } _txc_finish_io(txc); // may trigger blocked txc's too return;
void BlueStore::_txc_finish_io(TransContext *txc) { dout(20) << __func__ << " " << txc << dendl; /* * we need to preserve the order of kv transactions, * even though aio will complete in any order. */ OpSequencer *osr = txc->osr.get(); std::lock_guard l(osr->qlock); txc->state = TransContext::STATE_IO_DONE; // 更新狀態 txc->ioc.release_running_aios(); OpSequencer::q_list_t::iterator p = osr->q.iterator_to(*txc); while (p != osr->q.begin()) { --p;
// 保證q之前的state狀態已經完成,這里來保證有序完成,因為完成后還會進入到_txc_finish_io if (p->state < TransContext::STATE_IO_DONE) { dout(20) << __func__ << " " << txc << " blocked by " << &*p << " " << p->get_state_name() << dendl; return; } if (p->state > TransContext::STATE_IO_DONE) { ++p; break; } } do { _txc_state_proc(&*p++); // 再次進入狀態機 } while (p != osr->q.end() && p->state == TransContext::STATE_IO_DONE); if (osr->kv_submitted_waiters) { osr->qcond.notify_all(); } }
_txc_finish_io
