Overview
RGW 中三個基本概念:user, bucket, object。通過分析RGW data layout,可以清楚對象存儲的三個基本概念是怎樣在RGW 中實現的。
RGW 中數據分三種類型:
- data: 每個RGW object 會保存在一個或多個Rados object(s)
- metadata: user, bucket, bucket.instance
- bucket index: 單獨作為一種metadata
RGW 的數據在RADOS 層以三種形式存在:rados對象(data 部分),rados 對象擴展屬性xattr,rados 對象的omap中。
metadata
RGW metadata分為:
- user:保存user 信息
- bucket:維護一組bucket name 和bucket instance id 的映射
- bucket.instance:保存bucket instance 信息
- otp:N 版新增,otp (one-time password) 信息。根據虛擬或硬件Multi-Factor Authentication (MFA) 設備,基於已經時間同步的otp 算法生成一個密碼。
metadata list,可以看到是上面列出的四種類型:
[root@stor14 build]# bin/radosgw-admin metadata list -c ceph.conf
[
"bucket",
"bucket.instance",
"otp",
"user"
]
查看user 下都有什么 :
[root@stor14 build]# bin/radosgw-admin metadata list user -c ceph.conf
[
"56789abcdef0123456789abcdef0123456789abcdef0123456789abcdef01234",
"bl_deliver",
"testx$9876543210abcdef0123456789abcdef0123456789abcdef0123456789abcdef",
"0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef",
"user1",
"test",
"testid",
"ms_sync"
]
list bucket.instance 看下,一般一個bucket 對應一個bucket.instance
[root@stor14 build]# bin/radosgw-admin metadata list bucket.instance -c ceph.conf
[
"bltest1:e34456f0-d371-4384-9007-70c60563fb0b.4281.17",
"test1:e34456f0-d371-4384-9007-70c60563fb0b.4281.15"
]
接着看下bltest1:e34456f0-d371-4384-9007-70c60563fb0b.4281.17 這個bucket.instance
[root@stor14 build]# bin/radosgw-admin metadata get bucket.instance:bltest1:e34456f0-d371-4384-9007-70c60563fb0b.4281.17 -c ceph.conf
{
"key": "bucket.instance:bltest1:e34456f0-d371-4384-9007-70c60563fb0b.4281.17",
"ver": {
"tag": "_UHAkf8NRdh_eXm7CqYsFNA8",
"ver": 1
},
"mtime": "2019-11-20 07:53:39.493618Z",
"data": {
"bucket_info": {
"bucket": {
"name": "bltest1",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4281.17",
"bucket_id": "e34456f0-d371-4384-9007-70c60563fb0b.4281.17",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"creation_time": "2019-11-20 07:53:39.399284Z",
"owner": "user1",
"flags": 0,
"zonegroup": "513b96de-2450-4292-a86a-314abbe29766",
"placement_rule": "default-placement",
"has_instance_obj": "true",
"quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"num_shards": 0,
"bi_shard_hash_type": 0,
"requester_pays": "false",
"has_website": "false",
"swift_versioning": "false",
"swift_ver_location": "",
"index_type": 0,
"mdsearch_config": [],
"reshard_status": 0,
"new_bucket_instance_id": ""
},
"attrs": [
{
"key": "user.rgw.acl",
"val": "AgKBAAAAAwISAAAABQAAAHVzZXIxBQAAAHVzZXIxBANjAAAAAQEAAAAFAAAAdXNlcjEPAAAAAQAAAAUAAAB1c2VyMQUDNgAAAAICBAAAAAAAAAAFAAAAdXNlcjEAAAAAAAAAAAICBAAAAA8AAAAFAAAAdXNlcjEAAAAAAAAAAAAAAAAAAAAA"
}
]
}
}
獲取bltest1 這個存儲桶相關信息:
[root@stor14 build]# bin/radosgw-admin metadata get bucket:bltest1 -c ceph.conf
{
"key": "bucket:bltest1",
"ver": {
"tag": "_Sia7A02mjQlrvVkymKa_y2G",
"ver": 1
},
"mtime": "2019-11-20 07:53:39.560897Z",
"data": {
"bucket": {
"name": "bltest1",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4281.17",
"bucket_id": "e34456f0-d371-4384-9007-70c60563fb0b.4281.17",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"owner": "user1",
"creation_time": "2019-11-20 07:53:39.399284Z",
"linked": "true",
"has_bucket_info": "false"
}
}
由於我們沒有做MFA 相關配置,otp 目前為空:
[root@stor14 build]# bin/radosgw-admin metadata list otp -c ceph.conf
[]
bucket index
bucket index 是一類特殊的metadata,通過bucket index 我們可以list 指定存儲桶下的所有rgw 對象。
bucket index 對象保存在存儲池
bucket index 維護了一個k-v map,
- k-v map本身保存在rados object(s) 關聯的omap 中。在不啟用shard 時,一個存儲桶會對應一個rados object;若shard 后,一個存儲桶可能會對應多個index rados objects。
- omap 的key 為各rgw object name,omap value 這些rgw object 的一些基本元數據,如list bucket 時展示的元數據。
- 每個omap 有一個header,在header 中保存一些bucket的統計信息(對象數,總大小等)?
list 一下bucket index 對象(rados object):
[root@stor14 build]# bin/rados -c ceph.conf ls -p default.rgw.buckets.index
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4281.15
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4281.17
可以看到兩個index objects,此時default zone 下只有兩個存儲桶。
[root@stor14 build]# bin/radosgw-admin bucket list -c ceph.conf
[
"test1",
"bltest1"
]
其中對應存儲桶bltest1
[root@stor14 build]# bin/radosgw-admin bucket stats --bucket test1 -c ceph.conf
{
"bucket": "test1",
"tenant": "",
"zonegroup": "513b96de-2450-4292-a86a-314abbe29766",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "e34456f0-d371-4384-9007-70c60563fb0b.4281.15",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4281.15", # 由marker可以看到對應index 對象 .dir.e34456f0-d371-4384-9007-70c60563fb0b.4281.15
"index_type": "Normal",
"owner": "user1",
"ver": "0#3",
"master_ver": "0#0",
"mtime": "2019-11-20 07:53:38.898208Z",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size": 10279,
"size_actual": 16384,
"size_utilized": 10279,
"size_kb": 11,
"size_kb_actual": 16,
"size_kb_utilized": 11,
"num_objects": 2
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}
listomapkeys 看下
[root@stor14 build]# bin/rados -c ceph.conf listomapkeys .dir.e34456f0-d371-4384-9007-70c60563fb0b.4281.15 -p default.rgw.buckets.index
dcj1
dcj2
可以看到omap key:dcj1 和dcj2,再listomapvals 看下,此時可以看2個RGW 對象
[root@stor14 build]# bin/rados -c ceph.conf listomapvals .dir.e34456f0-d371-4384-9007-70c60563fb0b.4281.15 -p default.rgw.buckets.index
dcj1
value (218 bytes) :
00000000 08 03 d4 00 00 00 04 00 00 00 64 63 6a 31 01 00 |..........dcj1..|
00000010 00 00 00 00 00 00 01 07 03 6e 00 00 00 01 ca 13 |.........n......|
00000020 00 00 00 00 00 00 06 f1 d4 5d f1 9c e0 0e 20 00 |.........].... .|
00000030 00 00 38 30 64 34 36 33 66 36 32 35 65 39 32 37 |..80d463f625e927|
00000040 39 31 39 35 65 30 37 34 65 35 34 65 65 63 64 31 |9195e074e54eecd1|
00000050 34 39 05 00 00 00 75 73 65 72 31 05 00 00 00 75 |49....user1....u|
00000060 73 65 72 31 0a 00 00 00 74 65 78 74 2f 70 6c 61 |ser1....text/pla|
00000070 69 6e ca 13 00 00 00 00 00 00 00 00 00 00 08 00 |in..............|
00000080 00 00 53 54 41 4e 44 41 52 44 00 00 00 00 00 00 |..STANDARD......|
00000090 00 00 00 01 01 02 00 00 00 08 01 01 2c 00 00 00 |............,...|
000000a0 65 33 34 34 35 36 66 30 2d 64 33 37 31 2d 34 33 |e34456f0-d371-43|
000000b0 38 34 2d 39 30 30 37 2d 37 30 63 36 30 35 36 33 |84-9007-70c60563|
000000c0 66 62 30 62 2e 34 32 36 37 2e 35 34 00 00 00 00 |fb0b.4267.54....|
000000d0 00 00 00 00 00 00 00 00 00 00 |..........|
000000da
dcj2
value (220 bytes) :
00000000 08 03 d6 00 00 00 04 00 00 00 64 63 6a 32 01 00 |..........dcj2..|
00000010 00 00 00 00 00 00 01 07 03 6e 00 00 00 01 5d 14 |.........n....].|
00000020 00 00 00 00 00 00 54 2e d5 5d 49 86 f8 2a 20 00 |......T..]I..* .|
00000030 00 00 38 39 36 39 39 37 63 37 63 37 62 64 64 61 |..896997c7c7bdda|
00000040 30 38 61 62 30 39 62 38 35 34 36 32 33 61 30 30 |08ab09b854623a00|
00000050 30 30 05 00 00 00 75 73 65 72 31 05 00 00 00 75 |00....user1....u|
00000060 73 65 72 31 0a 00 00 00 74 65 78 74 2f 70 6c 61 |ser1....text/pla|
00000070 69 6e 5d 14 00 00 00 00 00 00 00 00 00 00 08 00 |in].............|
00000080 00 00 53 54 41 4e 44 41 52 44 00 00 00 00 00 00 |..STANDARD......|
00000090 00 00 00 01 01 02 00 00 00 08 01 02 2e 00 00 00 |................|
000000a0 65 33 34 34 35 36 66 30 2d 64 33 37 31 2d 34 33 |e34456f0-d371-43|
000000b0 38 34 2d 39 30 30 37 2d 37 30 63 36 30 35 36 33 |84-9007-70c60563|
000000c0 66 62 30 62 2e 34 35 36 34 2e 34 30 30 36 00 00 |fb0b.4564.4006..|
000000d0 00 00 00 00 00 00 00 00 00 00 00 00 |............|
000000dc
一般一個存儲桶對應一個rados object,H版之后加入shard 后,也可能會一個存儲桶對應多個rados objects。目前N 版已經可以做到auto reshard。我們通過配置rgw_override_bucket_index_max_shards = 5 將新建存儲桶的shard 初始化為5 分片
新建存儲桶test2
[root@stor14 build]# s3cmd mb s3://test2
Bucket 's3://test2/' created
可以看到max_marker 有5個:0#,1#,2#,3#,4#
[root@stor14 build]# bin/radosgw-admin bucket stats --bucket test2 -c ceph.conf
{
"bucket": "test2",
"tenant": "",
"zonegroup": "513b96de-2450-4292-a86a-314abbe29766",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"index_type": "Normal",
"owner": "user1",
"ver": "0#1,1#1,2#1,3#1,4#1",
"master_ver": "0#0,1#0,2#0,3#0,4#0",
"mtime": "2019-11-20 12:26:38.164727Z",
"max_marker": "0#,1#,2#,3#,4#",
"usage": {},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}
此時除了之前的2個 index objects 外,多出來了5個 對象,名為 .dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.
[root@stor14 build]# bin/rados -c ceph.conf ls -p default.rgw.buckets.index
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.1
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.3
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.2
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.4
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4281.15
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.0
.dir.e34456f0-d371-4384-9007-70c60563fb0b.4281.17
我們上傳一個對象jcd1 至新建的存儲桶test2
[root@stor14 build]# s3cmd ls s3://test2
[root@stor14 build]# s3cmd put ./ceph.conf s3://test2/jcd1
upload: './ceph.conf' -> 's3://test2/jcd1' [1 of 1]
5255 of 5255 100% in 0s 26.82 kB/s done
[root@stor14 build]# bin/rados -c ceph.conf listomapkeys .dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.0 -p default.rgw.buckets.index
[root@stor14 build]# bin/rados -c ceph.conf listomapkeys .dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.1 -p default.rgw.buckets.index
[root@stor14 build]# bin/rados -c ceph.conf listomapkeys .dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.2 -p default.rgw.buckets.index
jcd1
[root@stor14 build]# bin/rados -c ceph.conf listomapkeys .dir.e34456f0-d371-4384-9007-70c60563fb0b.4848.1.3 -p default.rgw.buckets.index
可以看到上傳的對象在分片2 中。
默認情況下bucket index obj 並沒有attr 信息:
[root@stor14 build]# bin/rados -c ceph.conf listxattr .dir.e34456f0-d371-4384-9007-70c60563fb0b.4281.15 -p default.rgw.buckets.index
一般來說單shard 保存10萬-15萬對象為宜,shard 數也不是越多越好,過多的shard會導致部分類似list bucket的操作消耗大量底層存儲IO,導致部分請求耗時過長。
data
每個RGW object 會保存在一個或多個rados obj 中。下面會詳細介紹RGW Object 的layout。
RGW Pools
RGW 使用各種 pool 來專門存放特定類型的數據, zone 所使用的存儲池的命名 規則 {zone}.rgw.{functions} 。
可以看到default zone下配置的存儲池:
[root@stor14 build]# bin/radosgw-admin zone get -c ceph.conf
{
"id": "e34456f0-d371-4384-9007-70c60563fb0b",
"name": "default",
"domain_root": "default.rgw.meta:root",
"control_pool": "default.rgw.control",
"gc_pool": "default.rgw.log:gc",
"lc_pool": "default.rgw.log:lc",
"bl_pool": "default.rgw.log:bl",
"log_pool": "default.rgw.log",
"intent_log_pool": "default.rgw.log:intent",
"usage_log_pool": "default.rgw.log:usage",
"reshard_pool": "default.rgw.log:reshard",
"user_keys_pool": "default.rgw.meta:users.keys",
"user_email_pool": "default.rgw.meta:users.email",
"user_swift_pool": "default.rgw.meta:users.swift",
"user_uid_pool": "default.rgw.meta:users.uid",
"otp_pool": "default.rgw.otp", # metadata 中提到的One-Time Password 用的存儲池
"system_key": {
"access_key": "ms_sync",
"secret_key": "ms_sync"
},
"bl_deliver_key": {
"access_key": "bl_deliver",
"secret_key": "bl_deliver"
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "default.rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "default.rgw.buckets.data"
}
},
"data_extra_pool": "default.rgw.buckets.non-ec",
"index_type": 0
}
}
],
"metadata_heap": "",
"realm_id": "5f47bd7c-94a1-48f5-8c51-5bb8f7075425"
}
重要的RGW Pools:
-
.rgw.root - 保存 zone,zg,realm,period 相關的信息
-
zone.rgw.control - librados 提供的 watch-notify 機制保證緩存一致性,如 notify.
-
zone.rgw.log - 記錄不同的日志,--namespace usage, gc, lc, bl
- gc: 用於垃圾回收的
- lc: 用於lifecycle
- bl:用於bucket logging
-
zone.rgw.buckets.data ,數據池
-
zone.rgw.buckets.extra ,現在叫 rgw.buckets.non-ec ,multipart upload 過程中的臨時數據會放這里。
zone.rgw.buckets.index - 存儲 bucket 中對象的索引對象: “.dir.
-
zone.rgw.meta: 對象存儲元數據池, --namespace users, root
-
root:
.bucket.meta. : # see put_bucket_instance_info() -
users.id:Contains both per-user information (RGWUserInfo) in “
” objects and per-user lists of buckets in omaps of “ .buckets” objects. The “ ” may contain the tenant if non-empty -
users.keys:允許RGW在認證時通過access-key 去找到user id
看下.rgw.meta 池
-
[root@stor14 build]# bin/rados -c ceph.conf ls -p default.rgw.meta -N root
test1
.bucket.meta.test2:e34456f0-d371-4384-9007-70c60563fb0b.4848.1
.bucket.meta.bltest1:e34456f0-d371-4384-9007-70c60563fb0b.4281.17
bltest1
.bucket.meta.test1:e34456f0-d371-4384-9007-70c60563fb0b.4281.15
test2
[root@stor14 build]# bin/rados -c ceph.conf listxattr test1 -p default.rgw.meta -N root
ceph.objclass.version
[root@stor14 build]# bin/rados -c ceph.conf listxattr .bucket.meta.test1:e34456f0-d371-4384-9007-70c60563fb0b.4281.15 -p default.rgw.meta -N root
ceph.objclass.version
user.rgw.acl
看下users.uid
[root@stor14 build]# bin/rados -c ceph.conf ls -p default.rgw.meta -N users.uid
56789abcdef0123456789abcdef0123456789abcdef0123456789abcdef01234
bl_deliver
testx$9876543210abcdef0123456789abcdef0123456789abcdef0123456789abcdef
0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
user1
test
testid
ms_sync
user1.buckets # *****
[root@stor14 build]# bin/radosgw-admin user list -c ceph.conf
[
"56789abcdef0123456789abcdef0123456789abcdef0123456789abcdef01234",
"bl_deliver",
"testx$9876543210abcdef0123456789abcdef0123456789abcdef0123456789abcdef",
"0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef",
"user1",
"test",
"testid",
"ms_sync"
]
可以看到users.uid 中除了oss user,還有一個user1.buckets
“Contains both per-user information (RGWUserInfo) in “
其實這個user1.buckets 對象是一個非常重要的對象。通過bucket index obj 可以list 某個存儲桶中的rgw objects,如何list 某個對象存儲用戶下的存儲桶呢?答案就是這個
[root@stor14 build]# bin/rados -c ceph.conf listomapkeys user1.buckets -p default.rgw.meta -N users.uid
bltest1
test1
test2
[root@stor14 build]# bin/rados -c ceph.conf listomapvals user1.buckets -p default.rgw.meta -N users.uid
bltest1
value (172 bytes) :
00000000 09 05 a6 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000010 00 00 03 f1 d4 5d 00 00 00 00 00 00 00 00 07 03 |.....]..........|
00000020 77 00 00 00 07 00 00 00 62 6c 74 65 73 74 31 00 |w.......bltest1.|
00000030 00 00 00 2c 00 00 00 65 33 34 34 35 36 66 30 2d |...,...e34456f0-|
00000040 64 33 37 31 2d 34 33 38 34 2d 39 30 30 37 2d 37 |d371-4384-9007-7|
00000050 30 63 36 30 35 36 33 66 62 30 62 2e 34 32 38 31 |0c60563fb0b.4281|
00000060 2e 31 37 2c 00 00 00 65 33 34 34 35 36 66 30 2d |.17,...e34456f0-|
00000070 64 33 37 31 2d 34 33 38 34 2d 39 30 30 37 2d 37 |d371-4384-9007-7|
00000080 30 63 36 30 35 36 33 66 62 30 62 2e 34 32 38 31 |0c60563fb0b.4281|
00000090 2e 31 37 00 00 00 00 00 00 00 00 00 00 00 00 00 |.17.............|
000000a0 00 00 00 01 03 f1 d4 5d 26 99 cc 17 |.......]&...|
000000ac
test1
value (170 bytes) :
00000000 09 05 a4 00 00 00 00 00 00 00 27 28 00 00 00 00 |..........'(....|
00000010 00 00 02 f1 d4 5d 02 00 00 00 00 00 00 00 07 03 |.....]..........|
00000020 75 00 00 00 05 00 00 00 74 65 73 74 31 00 00 00 |u.......test1...|
00000030 00 2c 00 00 00 65 33 34 34 35 36 66 30 2d 64 33 |.,...e34456f0-d3|
00000040 37 31 2d 34 33 38 34 2d 39 30 30 37 2d 37 30 63 |71-4384-9007-70c|
00000050 36 30 35 36 33 66 62 30 62 2e 34 32 38 31 2e 31 |60563fb0b.4281.1|
00000060 35 2c 00 00 00 65 33 34 34 35 36 66 30 2d 64 33 |5,...e34456f0-d3|
00000070 37 31 2d 34 33 38 34 2d 39 30 30 37 2d 37 30 63 |71-4384-9007-70c|
00000080 36 30 35 36 33 66 62 30 62 2e 34 32 38 31 2e 31 |60563fb0b.4281.1|
00000090 35 00 00 00 00 00 00 00 00 00 40 00 00 00 00 00 |5.........@.....|
000000a0 00 01 02 f1 d4 5d 74 a9 42 32 |.....]t.B2|
000000aa
test2
value (168 bytes) :
00000000 09 05 a2 00 00 00 00 00 00 00 87 14 20 02 00 00 |............ ...|
00000010 00 00 fd 30 d5 5d 05 00 00 00 00 00 00 00 07 03 |...0.]..........|
00000020 73 00 00 00 05 00 00 00 74 65 73 74 32 00 00 00 |s.......test2...|
00000030 00 2b 00 00 00 65 33 34 34 35 36 66 30 2d 64 33 |.+...e34456f0-d3|
00000040 37 31 2d 34 33 38 34 2d 39 30 30 37 2d 37 30 63 |71-4384-9007-70c|
00000050 36 30 35 36 33 66 62 30 62 2e 34 38 34 38 2e 31 |60563fb0b.4848.1|
00000060 2b 00 00 00 65 33 34 34 35 36 66 30 2d 64 33 37 |+...e34456f0-d37|
00000070 31 2d 34 33 38 34 2d 39 30 30 37 2d 37 30 63 36 |1-4384-9007-70c6|
00000080 30 35 36 33 66 62 30 62 2e 34 38 34 38 2e 31 00 |0563fb0b.4848.1.|
00000090 00 00 00 00 00 00 00 00 20 20 02 00 00 00 00 01 |........ ......|
000000a0 fd 30 d5 5d 57 b6 f2 3a |.0.]W..:|
000000a8
RGW Object
一個RGW Object 有一個或多個Rados Object 組成。RGW object 邏輯上一般分為head 和tail。
其中head( object logiccal head, olh),元數據會保存在擴展屬性xattr中:
- head 保存在單個 RADOS object
- head 是有固定的大小的(rgw_max_chunk_size 默認 4M)
- 內容可變 保存 RGW object 的元數據
- ACLs
- Manifest
- Conten-Type
- ETag
- User-Defined Metadata(保存在 xattrs)
tail 部分:
- 內容不可變
- 對於小於 rgw_max_chunk_size 的 rgw-object 沒有 tail
- 對於multipart 上傳的對象,tail 分multipart (每個part 的第一個rados對象)和shadow
- tail 被分隔成若干個 part,單個 part 保存 rgw-object 的連續數據,每個 part 默認 15M
- 普通的 rgw-object 只有單個 part(但 part 內可以有很多的 stripe),對於 multipart - 上傳的 rgw-object 就有多個 part(最后一個 part 可能比較小)。 注意 part 還是邏輯上的
- 每個 part 都是 striped, striped part 由默認的 stripe size,每個 stripe part 映射到 rados-object,stripe 和 rados-object 一一對應
驗證一下:
[root@stor14 build]# s3cmd put ./4MB s3://test2
upload: './4MB' -> 's3://test2/4MB' [1 of 1]
4194304 of 4194304 100% in 0s 8.14 MB/s done
[root@stor14 build]# bin/rados -c ceph.conf ls -p default.rgw.buckets.data
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_jcd1
e34456f0-d371-4384-9007-70c60563fb0b.4281.15_dcj2
e34456f0-d371-4384-9007-70c60563fb0b.4281.15_dcj1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_4MB
[root@stor14 build]# dd if=/dev/zero of=./5MB count=5 bs=1024k
記錄了5+0 的讀入
記錄了5+0 的寫出
5242880字節(5.2 MB)已復制,0.00357833 秒,1.5 GB/秒
[root@stor14 build]# s3cmd put ./5MB s3://test2
upload: './5MB' -> 's3://test2/5MB' [1 of 1]
5242880 of 5242880 100% in 0s 7.18 MB/s done
[root@stor14 build]# bin/rados -c ceph.conf ls -p default.rgw.buckets.data
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_jcd1
e34456f0-d371-4384-9007-70c60563fb0b.4281.15_dcj2
e34456f0-d371-4384-9007-70c60563fb0b.4281.15_dcj1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_4MB
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_5MB
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_.lWnArC6BcKPkRjbGf-B0q23BSpxjql1_1
我們先后有上傳了2個對象:一個4MB,一個5MB,大小同對象名。可以看到,超過4MB 之后,RGW 對象分為了2 個RADOS 對象,其中
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_5MB 為head,e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_.lWnArC6BcKPkRjbGf-B0q23BSpxjql1_1 為tail。
查看2個rados obj 的大小:
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1_5MB
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1_5MB mtime 2019-11-20 20:49:11.000000, size 4194304
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_.lWnArC6BcKPkRjbGf-B0q23BSpxjql1_1
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_.lWnArC6BcKPkRjbGf-B0q23BSpxjql1_1 mtime 2019-11-20 20:49:11.000000, size 1048576
可以看到,head obj size=4194304,也即一個stripe 大小,加上tail obj size=1048576,正好等於RGW raw obj 的大小5242880。
我們可以在head 中可以list 元數據
[root@stor14 build]# bin/rados -c ceph.conf listxattr e34456f0-d371-4384-9007-70c60563fb0b.4848.1_5MB -p default.rgw.buckets.data
user.rgw.acl
user.rgw.content_type
user.rgw.etag
user.rgw.idtag
user.rgw.manifest # manifest 保存了RGW 對象的layout,可以看作是RGW 對象到RADOS 對象的映射關系圖。
user.rgw.pg_ver
user.rgw.source_zone
user.rgw.storage_class
user.rgw.tail_tag
user.rgw.x-amz-content-sha256
user.rgw.x-amz-date
user.rgw.x-amz-meta-s3cmd-attrs
看下manifest
[root@stor14 build]# bin/rados -c ceph.conf getxattr e34456f0-d371-4384-9007-70c60563fb0b.4848.1_5MB -p default.rgw.buckets.data user.rgw.manifest > /tmp/m
decode 為可讀:
[root@stor14 build]# bin/ceph-dencoder type RGWObjManifest import /tmp/m decode dump_json
{
"objs": [],
"obj_size": 5242880,
"explicit_objs": "false",
"head_size": 4194304, # head 大小
"max_head_size": 4194304,
"prefix": ".lWnArC6BcKPkRjbGf-B0q23BSpxjql1_", # 可以在tail obj 上看到,關聯各tail objects
"rules": [
{
"key": 0,
"val": {
"start_part_num": 0,
"start_ofs": 4194304,
"part_size": 0,
"stripe_max_size": 4194304,
"override_prefix": ""
}
}
],
"tail_instance": "",
"tail_placement": {
"bucket": {
"name": "test2",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"bucket_id": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"placement_rule": "default-placement"
},
"begin_iter": {
"part_ofs": 0,
"stripe_ofs": 0,
"ofs": 0,
"stripe_size": 4194304,
"cur_part_id": 0,
"cur_stripe": 0,
"cur_override_prefix": "",
"location": {
"placement_rule": "default-placement",
"obj": {
"bucket": {
"name": "test2",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"bucket_id": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"key": {
"name": "5MB",
"instance": "",
"ns": ""
}
},
"raw_obj": {
"pool": "",
"oid": "",
"loc": ""
},
"is_raw": false
}
},
"end_iter": {
"part_ofs": 4194304,
"stripe_ofs": 4194304,
"ofs": 5242880,
"stripe_size": 1048576,
"cur_part_id": 0,
"cur_stripe": 1,
"cur_override_prefix": "",
"location": {
"placement_rule": "default-placement",
"obj": {
"bucket": {
"name": "test2",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"bucket_id": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"key": {
"name": ".lWnArC6BcKPkRjbGf-B0q23BSpxjql1_1", # 當前shadow tail obj的key:manifest.prefix+stripe_id
"instance": "",
"ns": "shadow" # tail 類型為shadow
}
},
"raw_obj": {
"pool": "",
"oid": "",
"loc": ""
},
"is_raw": false
}
}
}
manifest 中begin_iter 和end_iter的ofset 值及cur_part_id、cur_stripe_id,這些數據決定了RGW 在RADOS層的數據布局:當前5MB的RGW obj,不分part,分為2個stripe,第一個stripe 4194304(4MB),剩余1048576在第二個stripe,結束ofs為5242880。rules 中start_part_num=0 表示當前未分part。
看一個分part 上傳的manifest:
root@stor14 build]# bin/rados -c ceph.conf getxattr e34456f0-d371-4384-9007-70c60563fb0b.4848.1_20MB -p default.rgw.buckets.data user.rgw.manifest > /tmp/m
[root@stor14 build]# bin/ceph-dencoder type RGWObjManifest import /tmp/m decode dump_json
{
"objs": [],
"obj_size": 20971520,
"explicit_objs": "false",
"head_size": 0,
"max_head_size": 0,
"prefix": "20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx",
"rules": [
{
"key": 0,
"val": {
"start_part_num": 1, # "When uploading a part, in addition to the upload ID, you must specify a part number. You can choose any part number between 1 and 10,000. A part number uniquely identifies a part and its position in the object you are uploading. "
"start_ofs": 0,
"part_size": 15728640,
"stripe_max_size": 4194304,
"override_prefix": ""
}
},
{
"key": 15728640,
"val": {
"start_part_num": 2,
"start_ofs": 15728640,
"part_size": 5242880,
"stripe_max_size": 4194304,
"override_prefix": ""
}
}
],
"tail_instance": "",
"tail_placement": {
"bucket": {
"name": "test2",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"bucket_id": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"placement_rule": "default-placement"
},
"begin_iter": {
"part_ofs": 0,
"stripe_ofs": 0,
"ofs": 0,
"stripe_size": 4194304,
"cur_part_id": 1,
"cur_stripe": 0,
"cur_override_prefix": "",
"location": {
"placement_rule": "default-placement",
"obj": {
"bucket": {
"name": "test2",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"bucket_id": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"key": {
"name": "20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1", # first part key
"instance": "",
"ns": "multipart" # tail 類型為multipart
}
},
"raw_obj": {
"pool": "",
"oid": "",
"loc": ""
},
"is_raw": false
}
},
"end_iter": {
"part_ofs": 20971520,
"stripe_ofs": 20971520,
"ofs": 20971520,
"stripe_size": 4194304,
"cur_part_id": 3, ###??
"cur_stripe": 0,
"cur_override_prefix": "",
"location": {
"placement_rule": "default-placement",
"obj": {
"bucket": {
"name": "test2",
"marker": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"bucket_id": "e34456f0-d371-4384-9007-70c60563fb0b.4848.1",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"key": {
"name": "20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.3", # last part key ,3 ??
"instance": "",
"ns": "multipart" # tail 類型為multipart
}
},
"raw_obj": {
"pool": "",
"oid": "",
"loc": ""
},
"is_raw": false
}
}
}
20MB大小的對象分為2個part 上傳。觀察manifest.rules,當前
key=0 part,也即第一個part,其start_part_num=1,part_size=15728640;
key=1 part,也即第二個part,其start_part_num=2,part_size=5242880,也即剩余大小;
AWS S3 要求multipart 時part number需要在1-1000間,不可以使用0。
ACL
[root@stor14 build]# bin/rados -c ceph.conf getxattr e34456f0-d371-4384-9007-70c60563fb0b.4848.1_5MB -p default.rgw.buckets.data user.rgw.acl > /tmp/m
[root@stor14 build]# bin/ceph-dencoder type RGWAccessControlPolicy import /tmp/m decode dump_json
{
"acl": {
"acl_user_map": [
{
"user": "user1",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [
{
"id": "user1",
"grant": {
"type": {
"type": 0
},
"id": "user1",
"email": "",
"permission": {
"flags": 15
},
"name": "user1",
"group": 0,
"url_spec": ""
}
}
]
},
"owner": {
"id": "user1",
"display_name": "user1"
}
}
如果RGW 對象大小超過15MB, 利用s3cmd 上傳會自動分段上傳,也即multipart 上傳。
[root@stor14 build]# s3cmd put ./20MB s3://test2
upload: './20MB' -> 's3://test2/20MB' [part 1 of 2, 15MB] [1 of 1]
15728640 of 15728640 100% in 1s 11.65 MB/s done
upload: './20MB' -> 's3://test2/20MB' [part 2 of 2, 5MB] [1 of 1]
5242880 of 5242880 100% in 0s 8.21 MB/s done
[root@stor14 build]# bin/rados -c ceph.conf ls -p default.rgw.buckets.data
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_2
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_jcd1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2_1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_1
e34456f0-d371-4384-9007-70c60563fb0b.4281.15_dcj2
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_20MB
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_3
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2
e34456f0-d371-4384-9007-70c60563fb0b.4281.15_dcj1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_4MB
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_5MB
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_.lWnArC6BcKPkRjbGf-B0q23BSpxjql1_1
[root@stor14 build]# bin/rados -c ceph.conf ls -p default.rgw.buckets.data|grep 20MB
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_2
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2_1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1_20MB
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_3
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2
這個20MB 的RGW 對象分成了7 個RADOS 對象,其中
- head object:e34456f0-d371-4384-9007-70c60563fb0b.4848.1_20MB
- tail objects:
- multipart obj:e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1
- shadow objs:e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_1,
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_2,
e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_3
- shadow objs:e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_1,
- multipart obj:e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2
- shadow objs:e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2_1
- multipart obj:e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1
可以看到,multipart 上傳的對象會有一個head,默認每個part 15M,每個part 的第一個tail 為multipart obj,剩余都為shadow obj。
查看個rados 的大小:
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1_20MB
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1_20MB mtime 2019-11-20 20:58:01.000000, size 0
head obj size 為0,multipart 情況head 只存元數據xattr
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1 mtime 2019-11-20 20:58:00.000000, size 4194304
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_1
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_1 mtime 2019-11-20 20:58:00.000000, size 4194304
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_2
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_2 mtime 2019-11-20 20:58:00.000000, size 4194304
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_3
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_3 mtime 2019-11-20 20:58:00.000000, size 3145728
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2 mtime 2019-11-20 20:58:01.000000, size 4194304
[root@stor14 build]# bin/rados -c ceph.conf -p default.rgw.buckets.data stat e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2_1
default.rgw.buckets.data/e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2_1 mtime 2019-11-20 20:58:01.000000, size 1048576
0 + 4194304 + 4194304 + 4194304 + 3145728 + 4194304 + 1048576 = 20971520,正好是RGW obj "20MB"的大小。
一般的tail obj(shadow obj)中沒有元數據信息,不過multipart obj 會有一些元數據信息:
[root@stor14 build]# bin/rados -c ceph.conf listxattr e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_2 -p default.rgw.buckets.data
[root@stor14 build]# bin/rados -c ceph.conf listxattr e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1 -p default.rgw.buckets.data
user.rgw.acl
user.rgw.etag
user.rgw.pg_ver
user.rgw.source_zone
user.rgw.x-amz-content-sha256
user.rgw.x-amz-date
可以看到沒有manifest。
我們可以從multipart tail obj 中拿到和從head obj中一樣的acl 元數據。
[root@stor14 build]# bin/rados -c ceph.conf getxattr e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.2 -p default.rgw.buckets.data user.rgw.acl > /tmp/m
[root@stor14 build]# bin/ceph-dencoder type RGWAccessControlPolicy import /tmp/m decode dump_json
{
"acl": {
"acl_user_map": [
{
"user": "user1",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [
{
"id": "user1",
"grant": {
"type": {
"type": 0
},
"id": "user1",
"email": "",
"permission": {
"flags": 15
},
"name": "user1",
"group": 0,
"url_spec": ""
}
}
]
},
"owner": {
"id": "user1",
"display_name": "user1"
}
}
在未開啟versioning的情況下,RGW obj 對應的RADOS obj 的命名規則為
-
head:
__<rgw_obj_name>,如 e34456f0-d371-4384-9007-70c60563fb0b.4848.1_20MB -
(multipart上傳)mulpart tail:
_ multipart<rgw_obj_name>.<manifest_prefix>.<part_id>,如 e34456f0-d371-4384-9007-70c60563fb0b.4848.1__multipart_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1 -
shadow tail:
- multipart 上傳:
_ shadow<rgw_obj_name>.<manifest_prefix>.<part_id>_<stripe_id>,如 e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow_20MB.2~PDSpB4F7hIgKhjVwf6NqmJtBuKxm-mx.1_3 - 非multipart 上傳:
_ shadow.<manifest_prefix> <stripe_id>,如 e34456f0-d371-4384-9007-70c60563fb0b.4848.1__shadow.lWnArC6BcKPkRjbGf-B0q23BSpxjql1_1
- multipart 上傳:
其中marker 就是bucket stats 中看到的marker,也是bucket id。
在開啟versioning 之后,會增加一個object_instance_id,這里暫不做詳細討論。