OCI和runC

本文轉載自查看原文 2020-12-23 19:45 615 docker

一、OCI

OCI（open Container Initiative）容器標准化組織的主要目的是推進容器技術的標准化。對容器標准進行准確的定義。其主要目的是為了解決容器標准混亂的問題。沒有統一的容器標准，工業界就無法按照統一的標准進行容器開發。因此OCI於2015年由docker牽頭和其他公司制定了相應的容器標准。

二、OCI的標准

OCI目前包含兩個標准: runtime-spec和image-spec。分別定義了容器運行時標准和容器鏡像標准。

三、runC

runC是docker貢獻給oci的容器運行時，也是使用較多的容器運行時。docker目前的實現也是runc。

# create the top most bundle directory
mkdir /mycontainer
cd /mycontainer

# create the rootfs directory
mkdir rootfs

# export busybox via Docker into the rootfs directory
docker export $(docker create busybox) | tar -C rootfs -xvf -

這一步將文件系統解壓到bundle中，執行runc spec自動生成config.json。通過這些操作就生成了一個OCI runtime bundle文件。config.json定義了運行容器所需的所有內容。

而目錄下面的rootfs則定義了根文件系統，以及根文件系統的內容。config.json需要定義的主要參數如下：

ociVersion: 定義了oci的版本
process: 定義了容器進程，包括命令，環境變量，rootfs的路徑，掛載信息等。
hooks：容器的生命周期管理中不同時間點需要執行的腳本或者代碼。

當然還包含其他的參數具體內容可以參考oci標准。

四、RunC的實現原理

1、runc和libcontainer

runc和libcontainer有很大的關系，runc其實是在libcontainer的基礎上進行了進一步的封裝。通過runc命令可以創建一個新的容器。底層與操作系統的交互還是通過libcontainer來實現。runc就是docker公司將自己實現容器的底層代碼libcontainer重新封裝貢獻給社區。

但是runc和原本的libcontainer還是有些區別的，最主要的還是runc遵循oci的標准。包括支持hook等。

2、runc的啟動流程

runc啟動容器還是要從main函數說起，main() (runc/main.go)函數內部定義了許多的command，這些command就是runc所具備的最主要功能。容器內部經常把github.com/urfave/cli 作為命令行工具，用於命令的解析和執行。

這里重點關注下createCommand，這個命令用於創建容器。創建容器調用了startContainer(context, spec, CT_ACT_CREATE, nil) 這個函數會調用createContainer。createContianer會創建一個邏輯容器。邏輯容器存在於內存當中，並沒有實際運行。

package libcontainer

import (
	"github.com/opencontainers/runc/libcontainer/configs"
)

type Factory interface {
	StartInitialization() error
	Type() string
}

使用工廠方法的主要原因是實現容器的平台多種多樣，可能是linux，也可能是window。linux_factory是在linux平台上實現了對應的接口，返回的是linuxContainer。邏輯容器的啟動交給了runner。

runner中最主要的是run方法，run方法將config.json中的process封裝成libcontainer.process類型並返回。這個process是邏輯process也沒有真正的運行。container用來運行process。

調用linuxContainer的start方法來啟動容器。啟動的過程中首先要執行newParentProcess來執行父進程。這是一個比較重要的方法。首先創建了socketPair("init")，這個socketPair主要用於父子進程之間的通信。

func (c *linuxContainer) newParentProcess(p *Process) (parentProcess, error) {
	parentInitPipe, childInitPipe, err := utils.NewSockPair("init")
	if err != nil {
		return nil, newSystemErrorWithCause(err, "creating new init pipe")
	}
	messageSockPair := filePair{parentInitPipe, childInitPipe}

	parentLogPipe, childLogPipe, err := os.Pipe()
	if err != nil {
		return nil, fmt.Errorf("Unable to create the log pipe:  %s", err)
	}
	logFilePair := filePair{parentLogPipe, childLogPipe}

	cmd := c.commandTemplate(p, childInitPipe, childLogPipe)
	if !p.Init {
		return c.newSetnsProcess(p, cmd, messageSockPair, logFilePair)
	}
	if err := c.includeExecFifo(cmd); err != nil {
		return nil, newSystemErrorWithCause(err, "including execfifo in cmd.Exec setup")
	}
	return c.newInitProcess(p, cmd, messageSockPair, logFilePair)
}

最終會生成initProcesss

type initProcess struct {
	cmd             *exec.Cmd
	messageSockPair filePair
	logFilePair     filePair
	config          *initConfig
	manager         cgroups.Manager
	intelRdtManager intelrdt.Manager
	container       *linuxContainer
	fds             []string
	process         *Process
	bootstrapData   io.Reader
	sharePidns      bool
}

cmd就是封裝好的父進程命令，這個命令執行runc init。cmd啟動之后子進程，用戶容器進程也就啟動了，但是沒有啟動命令，這個啟動命令由父進程傳遞給自己。messageSocketPair用於父子進程之間的通信。最終調用的函數是initProcess里面的start方法。

func (p *initProcess) start() (retErr error) {
	defer p.messageSockPair.parent.Close()
       //啟動封裝好的cmd命令，啟動獨立的子線程，也就是容器進程。
	err := p.cmd.Start()
	p.process.ops = p
	// close the write-side of the pipes (controlled by child)
	p.messageSockPair.child.Close()
	p.logFilePair.child.Close()
	if err != nil {
		p.process.ops = nil
		return newSystemErrorWithCause(err, "starting init process command")
	}
	defer func() {
		if retErr != nil {
			// terminate the process to ensure we can remove cgroups
			if err := ignoreTerminateErrors(p.terminate()); err != nil {
				logrus.WithError(err).Warn("unable to terminate initProcess")
			}

			p.manager.Destroy()
			if p.intelRdtManager != nil {
				p.intelRdtManager.Destroy()
			}
		}
	}()
	if err := p.manager.Apply(p.pid()); err != nil {
		return newSystemErrorWithCause(err, "applying cgroup configuration for process")
	}
	if p.intelRdtManager != nil {
		if err := p.intelRdtManager.Apply(p.pid()); err != nil {
			return newSystemErrorWithCause(err, "applying Intel RDT configuration for process")
		}
	}
        //將啟動數據寫入管道，子進程會讀取管道中的數據並執行下一步操作。
	if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
		return newSystemErrorWithCause(err, "copying bootstrap data to pipe")
	}
	childPid, err := p.getChildPid()
	if err != nil {
		return newSystemErrorWithCause(err, "getting the final child's pid from pipe")
	}

	fds, err := getPipeFds(childPid)
	if err != nil {
		return newSystemErrorWithCausef(err, "getting pipe fds for pid %d", childPid)
	}
	p.setExternalDescriptors(fds)
	if p.config.Config.Namespaces.Contains(configs.NEWCGROUP) && p.config.Config.Namespaces.PathOf(configs.NEWCGROUP) == "" {
		if _, err := p.messageSockPair.parent.Write([]byte{createCgroupns}); err != nil {
			return newSystemErrorWithCause(err, "sending synchronization value to init process")
		}
	}

	// Wait for our first child to exit
	if err := p.waitForChildExit(childPid); err != nil {
		return newSystemErrorWithCause(err, "waiting for our first child to exit")
	}

	if err := p.createNetworkInterfaces(); err != nil {
		return newSystemErrorWithCause(err, "creating network interfaces")
	}
	if err := p.updateSpecState(); err != nil {
		return newSystemErrorWithCause(err, "updating the spec state")
	}
	if err := p.sendConfig(); err != nil {
		return newSystemErrorWithCause(err, "sending config to init process")
	}
	var (
		sentRun    bool
		sentResume bool
	)

	ierr := parseSync(p.messageSockPair.parent, func(sync *syncT) error {
		switch sync.Type {
		case procReady:
			.......
		case procHooks:
			.......
		default:
			return newSystemError(errors.New("invalid JSON payload from child"))
		}
		return nil
	})

	if !sentRun {
		return newSystemErrorWithCause(ierr, "container init")
	}
	if p.config.Config.Namespaces.Contains(configs.NEWNS) && !sentResume {
		return newSystemError(errors.New("could not synchronise after executing prestart and CreateRuntime hooks with container process"))
	}
	if err := unix.Shutdown(int(p.messageSockPair.parent.Fd()), unix.SHUT_WR); err != nil {
		return newSystemErrorWithCause(err, "shutting down init pipe")
	}

	// Must be done after Shutdown so the child will exit and we can wait for it.
	if ierr != nil {
		p.wait()
		return ierr
	}
	return nil
}

這個方法是核心的方法。做了如下的事情：

執行cmd命令，啟動一個獨立的進程。這個進程的執行過程也就是InitCommand做的事情。后面可以分析一下這部分的代碼。
將bootstrapData拷貝到管道中，這樣子進程就可以從管道中讀取配置。
然后再調用parseSync()函數，通過init管道與容器初始化進程進行同步，待其初始化完成之后，執行PreStart Hook等一些回調操作。最后，關閉init管道，容器創建完成。

三、子進程和父進程的交互流程

子進程也就是容器進程，父進程也就是runc進程。在上面的分析中知道。runc進程會單獨啟動一個獨立的容器進程。下面我們分析下容器子進程的啟動過程。

var initCommand = cli.Command{
	Name:  "init",
	Usage: `initialize the namespaces and launch the process (do not call it outside of runc)`,
	Action: func(context *cli.Context) error {
		factory, _ := libcontainer.New("")
		if err := factory.StartInitialization(); err != nil {
			// as the error is sent back to the parent there is no need to log
			// or write it to stderr because the parent process will handle this
			os.Exit(1)
		}
		panic("libcontainer: container init failed to exec")
	},
}

libcontainer.New()生成了一個新的linux_factory。並調用StartInitialization方法。StartInitialization通過讀取父進程文件描述符內的配置和容器狀態生成一個新的容器。並調用newContainerInit方法。

func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd int) (initer, error) {
	var config *initConfig
	if err := json.NewDecoder(pipe).Decode(&config); err != nil {
		return nil, err
	}
	if err := populateProcessEnvironment(config.Env); err != nil {
		return nil, err
	}
	switch t {
	case initSetns:
		return &linuxSetnsInit{
			pipe:          pipe,
			consoleSocket: consoleSocket,
			config:        config,
		}, nil
	case initStandard:
		return &linuxStandardInit{
			pipe:          pipe,
			consoleSocket: consoleSocket,
			parentPid:     unix.Getppid(),
			config:        config,
			fifoFd:        fifoFd,
		}, nil
	}
	return nil, fmt.Errorf("unknown init type %q", t)

newContainerInit根據type返回不同的linux init。

type linuxStandardInit struct {
	pipe          *os.File
	consoleSocket *os.File
	parentPid     int
	fifoFd        int
	config        *initConfig
}

最終調用linuxStardardInit的init方法，做如下操作。

setupNetwork: 配置容器的網絡，調用第三方 netlink.LinkSetup
setupRoute: 配置容器靜態路由信息，調用第三方 netlink.RouteAdd
label.Init: 檢查selinux是否被啟動並將結果存入全局變量。
finalizeNamespace: 根據config配置將需要的特權capabilities加入白名單，設置user namespace，關閉不需要的文件描述符。
unix.Openat: 只寫方式打開fifo管道並寫入0，會一直保持阻塞，直到管道的另一端以讀方式打開，並讀取內容
syscall.Exec 系統調用來執行用戶所指定的在容器中運行的程序

配置 hostname、apparmor、processLabel、sysctl、readonlyPath、maskPath。create 雖然不會執行命令，但會檢查命令路徑，錯誤會在 create 期間返回。

總結：

RunC是容器的底層實現，主要調用linux提供的系統調用來實現。從代碼分析可以看出來，容器技術主要是namespace, cgroup，chroot， filesystem，aufs等linux技術的組合，通過這些組合解決了應用的線上應用環境問題。尤其是rootfs，解決了線上線下環境不一致的問題。使得應用進程的安裝和部署更加便捷。

namespace提供了容器隔離技術。runc的實現主要是如下代碼：

cmd := exec.Command(initCmd, "init")
cmd.SysProcAttr = &syscall.SysProcAttr{
	Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS |
	syscall.CLONE_NEWNET | syscall.CLONE_NEWIPC,
}

上面的代碼表示在fork進程的時候要clone uts， pid, ns, net, ipc等。通過這種方式隔離出對應獨立的運行空間。

cgroup是對進程的資源進行限制，如cpu，內存，blkio等。runc的代碼實現如下：

	cgroupManager := cgroups.NewCgroupManager(containerID)
	defer cgroupManager.Destroy()
	cgroupManager.Set(res)
	cgroupManager.Apply(parent.Process.Pid)

　上面的代碼也就是將process的pid放在cgroup目錄下的tasks里。這樣就可以對其進行限制。

參考鏈接：

https://github.com/opencontainers/runc

https://cizixs.com/2017/11/05/oci-and-runc/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 docker、oci、runc以及kubernetes梳理談談docker，containerd，runc，docker-shim，OCI之間的關系從 docker 到 runC RunC 簡介 docker + docker-runc Oracle常用的OCI函數對容器運行時runc的簡單理解 docker-runc not installed on system 問題 docker OCI runtime Docker容器引擎runC執行框架