CS:APP3e 深入理解計算機系統_3e ShellLab(tsh)實驗

本文轉載自查看原文 2017-12-26 23:44 8293 Computer Systems

**詳細的題目要求和資源可以到 http://csapp.cs.cmu.edu/3e/labs.html 或者 http://www.cs.cmu.edu/~./213/schedule.html 獲取。**

前期准備

注意事項

tsh的提示符為“tsh> ”
用戶的輸入分為第一個的name和后面的參數，之間以一個或多個空格隔開。如果name是一個tsh內置的命令，那么tsh應該馬上處理這個命令然后等待下一個輸入。否則，tsh應該假設name是一個路徑上的可執行文件，並在一個子進程中運行這個文件（這也稱為一個工作、job）
tsh不需要支持管道和重定向
如果用戶輸入ctrl-c (ctrl-z)，那么SIGINT (SIGTSTP)信號應該被送給每一個在前台進程組中的進程，如果沒有進程，那么這兩個信號應該不起作用。
如果一個命令以“&”結尾，那么tsh應該將它們放在后台運行，否則就放在前台運行（並等待它的結束）
每一個工作（job）都有一個正整數PID或者job ID（JID）。JID通過"%"前綴標識符表示，例如，“%5”表示JID為5的工作，而“5”代筆PID為5的進程。
tsh應該有如下內置命令：

quit: 退出當前shell

jobs: 列出所有后台運行的工作

bg <job>: 這個命令將會向<job>代表的工作發送SIGCONT信號並放在后台運行，<job>可以是一個PID也可以是一個JID。

fg <job>: 這個命令會向<job>代表的工作發送SIGCONT信號並放在前台運行，<job>可以是一個PID也可以是一個JID。

tsh應該回收（reap）所有僵屍孩子，如果一個工作是因為收到了一個它沒有捕獲的（沒有按照信號處理函數）而終止的，那么tsh應該輸出這個工作的PID和這個信號的相關描述。

提示

利用測試文件逐步構建tsh，例如先從trace01.txt開始。
setpgid中的WUNTRACED and WNOHANG選項有用（參看前期准備）
當解析命令並產生子進程的時候（fork ）的時候，必須先調用sigprocmask block SIGCHLD信號，調用addjob將剛剛創建的工作加入到工作列表里，然后unblock該信號（課件里有講這個競爭產生的問題）。另外，由於子進程會繼承block的特性，所以子進程要記得unblock。
一些具有終端環境的進程會嘗試從父進程讀寫數據，例如/bin/sh，還有一些程序例如more less vi emacs 會對終端做一些“奇怪的設置”。本次實驗用/bin/ls /bin/echo這樣的文字模式的程序測試即可。
當我們在真正的shell（例如bash）中執行tsh時，tsh本身也是被放在前台進程組中的，它的子進程也會在前台進程組中，例如下圖所示：

               +----------+
               |   Bash   |
               +----+-----+
                    |
+-----------------------------------------+
|                   v                     |
|              +----+-----+   foreground  |
|              |   tsh    |   group       |
|              +----+-----+               |
|                   |                     |
|         +--------------------+          |
|         |         |          |          |
|         v         v          v          |
|       /bin/ls    /bin/sleep  xxxxx      |
|                                         |
|                                         |
+-----------------------------------------+

所以當我們在終端輸入ctrl-c (ctrl-z)的時候，SIGINT (SIGTSTP)信號應該被送給每一個在前台進程組中的所有進程，包括我們在tsh中認為是后台進程的程序。一個決絕的方法就是在fork之后execve之前，子進程應該調用setpgid(0, 0)使得它進入一個新的進程組（其pgid等於該進程的pid）。tsh接收到SIGINT SIGTSTP信號后應該將它們發送給tsh眼中正確的“前台進程組”（包括其中的所有進程）。

思路及其實現

我首先將書上（8.5.5節）說的6個關於信號處理函數安全性的要求列出（詳細的解釋請參考書），在編程的時候要注意：

盡量保持信號處理函數的簡單性，例如只改變一個flag
在信號處理函數內部只調用async-signal-safe的函數（man 7 signal里面有完全的列出）
在進入和退出信號處理函數的時候保存和還原errno變量（參考：Thread-local storage ）
當試圖訪問全局結構變量的時候暫時block所有的信號，然后還原
全局變量的聲明為volatile
將flag（標志）聲明為sig_atomic_t

下面我就實驗要求完成的7個函數說幾個注意的地方，代碼中的注釋也解釋了一些：

/* Here are the functions that you will implement */
void eval(char *cmdline);
int builtin_cmd(char **argv, char *cmdline);
void do_bgfg(char **argv, char *cmdline);
void waitfg(pid_t pid);

void sigchld_handler(int sig);
void sigtstp_handler(int sig);
void sigint_handler(int sig);

1.void eval(char *cmdline)

在調用parseline解析輸出后，我們首先判斷這是一個內置命令（shell實現）還是一個程序（本地文件）。如果是內置命令，進入builtin_cmd(argv, cmdline) ，否則創建子進程並在job列表里完成添加。這里要注意在fork前用access判斷是否存在這個文件，不然fork以后無法回收，另外要注意一個線程並行競爭（race）的問題：fork以后會在job列表里添加job，信號處理函數sigchld_handler回收進程后會在job列表中刪除，如果信號來的比較早，那么就可能會發生先刪除后添加的情況。這樣這個job永遠不會在列表中消失了（內存泄露），所以我們要先blockSIGCHLD ，添加以后再還原。

更新：fork子進程后發生錯誤退出子進程應該使用_exit而非exit （unix_error里面也是用的exit ）參考：What is the difference between using _exit() & exit() in a conventional Linux fork-exec?

/*
 * eval - Evaluate the command line that the user has just typed in
 *
 * If the user has requested a built-in command (quit, jobs, bg or fg)
 * then execute it immediately. Otherwise, fork a child process and
 * run the job in the context of the child. If the job is running in
 * the foreground, wait for it to terminate and then return.  Note:
 * each child process must have a unique process group ID so that our
 * background children don't receive SIGINT (SIGTSTP) from the kernel
 * when we type ctrl-c (ctrl-z) at the keyboard.
 */
void eval(char *cmdline)
{
    char *argv[MAXARGS];
    int bg_flag;

    bg_flag = parseline(cmdline, argv); /* true if the user has requested a BG job, false if the user has requested a FG job. */

    if (builtin_cmd(argv, cmdline)) /* built-in command */
    {
        return;
    }
    else /* program (file) */
    {
        if (access(argv[0], F_OK)) /* do not fork and addset! */
        {
        	fprintf(stderr, "%s: Command not found\n", argv[0]);
        	return;
        }

		pid_t pid;
        sigset_t mask, prev;
        sigemptyset(&mask);
        sigaddset(&mask, SIGCHLD);
        sigprocmask(SIG_BLOCK, &mask, &prev); /* block SIG_CHLD */

        if ((pid=fork()) == 0) /* child */
        {
            sigprocmask(SIG_SETMASK, &prev, NULL); /* unblock SIG_CHLD */

            if (!setpgid(0, 0))
            {
                if (execve(argv[0], argv, environ))
                {
                    fprintf(stderr, "%s: Failed to execve\n", argv[0]);
                    _exit(1);
                }
                /* context changed */
            }
            else
            {
                fprintf(stderr, "Failed to invoke setpgid(0, 0)\n");
                _exit(1);
            }
        }
        else if (pid > 0)/* tsh */
        {
            if (!bg_flag) /* exec foreground */
            {
                fg_pid = pid;
                fg_pid_reap = 0;
                addjob(jobs, pid, FG, cmdline);
                sigprocmask(SIG_SETMASK, &prev, NULL); /* unblock SIG_CHLD */
                waitfg(pid);
            }
            else /* exec background */
            {
                addjob(jobs, pid, BG, cmdline);
                sigprocmask(SIG_SETMASK, &prev, NULL); /* unblock SIG_CHLD */
                printf("[%d] (%d) %s", maxjid(jobs), pid, cmdline);
            }
            return;
        }
        else
        {
            unix_error("Failed to fork child");
        }
    }
    return;
}

2.int builtin_cmd(char **argv, char *cmdline)

這個函數分情況判斷是哪一個內置命令，要注意如果用戶僅僅按下回車鍵，那么在解析后argv的第一個變量將是一個空指針。如果用這個空指針去調用strcmp函數會引發segment fault。

/*
 * builtin_cmd - If the user has typed a built-in command then execute
 *    it immediately.
 */
int builtin_cmd(char **argv, char *cmdline)
{
    char *first_arg = argv[0];

    if (first_arg == NULL) /* if input nothing('\n') in function main, then the
                              first_arg here will be NULL, which will cause SEG fault when invoke strcmp(read) */
    {
        return 1;
    }

    if (!strcmp(first_arg, "quit"))
    {
        exit(0);
    }
    else if (!strcmp(first_arg, "jobs"))
    {
        listjobs(jobs);
        return 1;
    }
    else if (!strcmp(first_arg, "bg") || !strcmp(first_arg, "fg"))
    {
        do_bgfg(argv, cmdline);
        return 1;
    }

    return 0;
}

3.void do_bgfg(char **argv, char *cmdline)

這個函數單獨處理了bg和fg這兩個內置命令。要注意fg有兩個對應的情況：1.后台程序是stopped的狀態，這時我們需要設置相關變量，然后發送繼續的信號。2.如果這個進程本身就在運行，我們就只需要改變job的狀態，設置相關變量，然后進入waitfg等待這個新的前台進程執行完畢。

寫這個也出現了一個讓我debug 幾個小時的兼容性問題：

在man 7 signal中，SIGCHLD描述如下：

SIGCHLD   20,17,18    Ign     Child stopped or terminated

也就是說，子進程終止或者停止的時候會向父進程發送這個信號，然后父進程進入sigchld_handler信號處理函數進行回收或者提示。但是在我的機器上卻發現在子進程從stopped變到running（收到SIGCONT ）的時候也會向父進程發送這個信號。這樣就會出現一個問題：我們要使后台一個stopped的進程重新運行，但是它會向父進程（shell）發送一個SIGCHLD ，這樣父進程就會進入信號處理函數sigchld_handler試圖回收它（不是stop），而它有沒有結束，所以信號處理函數會一直等待它執行完畢，在shell中顯示的情況就是卡住了。

經過長時間調試確認后發現在POSIX某個標准中SIGCHLD信號的定義如下：

SIGCHLD

The SIGCHLD signal is sent to a process when a child process terminates, is interrupted, or resumes after being interrupted. One common usage of the signal is to instruct the operating system to clean up the resources used by a child process after its termination without an explicit call to the wait system call.

or resumes after being interrupted. ，看到這句的時候我就要吐血了。。。

為了進一步證實我的想法，我在FreeBSD11.1上面查了一下手冊：

他說的是“changed”，看來我的機器是按照POSIX的某個標准實現的。

我的解決方案是設置一個pid_t的全局變量stopped_resume_child記錄我們要fg的stopped進程，在進入信號處理函數后首先檢查這個變量是否大於零，如果是就直接退出不做處理。（這里其實有一個和其他進程競爭的問題，時間有限就不去做更改了）

/*
 * do_bgfg - Execute the builtin bg and fg commands
 */
void do_bgfg(char **argv, char *cmdline)
{
    char *first_arg = argv[0];
    if (!strcmp(first_arg, "bg"))
    {
        if (argv[1] == NULL)
        {
            fprintf(stderr, "bg command requires PID or %%jobid argument\n");
            return;
        }

        if (argv[1][0] == '%') /* JID */
        {
            int jid = atoi(argv[1] + 1);
            if (jid)
            {
            	struct job_t *job_tmp = getjobjid(jobs, jid);
                if (job_tmp != NULL)
                {
                	job_tmp->state = BG;
                	printf("[%d] (%d) %s", jid, job_tmp->pid, job_tmp->cmdline);
                	stopped_resume_child = job_tmp->pid;
                    killpg(job_tmp->pid, SIGCONT);
                   
                    return;
                }
                else
                {
                    fprintf(stderr, "%%%s: No such job\n", argv[1] + 1);
                }
            }
            else
            {
                fprintf(stderr, "%%%s: No such job\n", argv[1] + 1);
            }
        }
        else /* PID */
        {
            pid_t pid = atoi(argv[1]);
            if(pid)
            {
            	struct job_t *job_tmp = getjobpid(jobs, pid);
                if (job_tmp != NULL)
                {
                	job_tmp->state = BG;
                	printf("[%d] (%d) %s", job_tmp->jid, pid, job_tmp->cmdline);
                	stopped_resume_child = job_tmp->pid;
                    killpg(pid, SIGCONT);
                   
                    return;
                }
                else
                {
                    fprintf(stderr, "(%s): No such process\n", argv[1]);
                }
            }
            else
            {
                fprintf(stderr, "bg: argument must be a PID or %%jobid\n");
            }
        }
    }
    else
    {
    	/* there are two case when using fg:
    	1. the job stopped 
    	2. the job is running
    	*/

    	if (argv[1] == NULL)
        {
            fprintf(stderr, "fg command requires PID or %%jobid argument\n");
            return;
        }
        if (argv[1][0] == '%') /* JID */
        {
            int jid = atoi(argv[1] + 1);
            if (jid)
            {
            	struct job_t *job_tmp = getjobjid(jobs, jid);
                if (job_tmp != NULL)
                {
                	int state = job_tmp->state;
                	fg_pid = job_tmp->pid; /* this is the new foreground process */
                	fg_pid_reap = 0;

                	job_tmp->state = FG;

                	if (state == ST)
                	{
                		stopped_resume_child = job_tmp->pid; /* set the global var in case of wait in SIGCHLD handler */
                    	killpg(job_tmp->pid, SIGCONT);
                	}
                	
                    waitfg(job_tmp->pid); /* wait until the foreground terminate/stop */
                    return;
                }
                else
                {
                    fprintf(stderr, "%%%s: No such job\n", argv[1] + 1);
                }
            }
            else
            {
                fprintf(stderr, "%%%s: No such job\n", argv[1] + 1);
            }
        }
        else /* PID */
        {
            pid_t pid = atoi(argv[1]);
            if(pid)
            {
            	struct job_t *job_tmp = getjobpid(jobs, pid);
                if (job_tmp != NULL)
                {
                	int state = job_tmp->state;
                	fg_pid = job_tmp->pid; /* this is the new foreground process */
                	fg_pid_reap = 0;

                	job_tmp->state = FG;

                	if (state == ST)
                	{
                		stopped_resume_child = job_tmp->pid; /* set the global var in case of wait in SIGCHLD handler */
                    	killpg(pid, SIGCONT);
                	}
                	
                    waitfg(job_tmp->pid); /* wait until the foreground terminate/stop */
                    return;
                }
                else
                {
                    fprintf(stderr, "(%s): No such process\n", argv[1]);
                }
            }
            else
            {
                fprintf(stderr, "fg: argument must be a PID or %%jobid\n");
            }
        }
    }
    return;
}

4.void waitfg(pid_t pid)

我之前聲明了一個volatile sig_atomic_t的全局變量fg_pid_reap ，只要信號處理函數回收了前台進程，它就會將fg_pid_reap 置1，這樣我們的waitfg函數就會退出，接着讀取用戶的下一個輸入。使用busysleep會有一些延遲，實驗報告上要求這么實現我也沒辦法; )

/*
 * waitfg - Block until process pid is no longer the foreground process
 */
void waitfg(pid_t pid)
{
    while (!fg_pid_reap)
    {
        sleep(1);
    }
    fg_pid_reap = 0;
    return;
}

5.void sigchld_handler(int sig)

注意保存errno 。

注意到這里不能使用while來回收進程，因為我們的后台還可能有正在運行的進程，這樣做的話會使得waitpid一直等待這個進程結束。當然使用if只回收一次也可能會導致信號累加的問題，例如多個后台程序同時結束，實驗報告上要求這么實現我也沒辦法 ; )

注意如果程序是被stop的話SIGTSTP ctrl-z ，我們不用回收、刪除job列表中的節點。

/*
 * sigchld_handler - The kernel sends a SIGCHLD to the shell whenever
 *     a child job terminates (becomes a zombie), or stops because it
 *     received a SIGSTOP or SIGTSTP signal. The handler reaps all
 *     available zombie children, but doesn't wait for any other
 *     currently running children to terminate.
 */
void sigchld_handler(int sig) /* When a child process stops or terminates, SIGCHLD is sent to the parent process. */
{
	int olderrno = errno;

	if (stopped_resume_child)
	{
		stopped_resume_child = 0;
		return;
	}
    int status;
    pid_t pid;

    if ((pid = waitpid(-1, &status, WUNTRACED)) > 0) /* don't use while! */
    {
        if (pid == fg_pid)
        {
            fg_pid_reap = 1;
        }

        if (WIFEXITED(status)) /* returns true if the child terminated normally */
        {
        
        
            deletejob(jobs, pid);
        }
        else if (WIFSIGNALED(status)) /* returns true if the child process was terminated by a signal. */
                                      /* since job start from zero, we add it one */
        {
        
        
            printf("Job [%d] (%d) terminated by signal %d\n", pid2jid(pid), pid, WTERMSIG(status));
            deletejob(jobs, pid);
        }
        else /* SIGTSTP */
        {
            /* don't delete job */
            struct job_t *p = getjobpid(jobs, pid);
            p->state = ST; /* Stopped */
            printf("Job [%d] (%d) stopped by signal 20\n", pid2jid(pid), pid);
        }
    }

    errno = olderrno;
    return;
}

6.void sigtstp_handler(int sig)

注意是群發，即killpg，不能只發一個。

/*
 * sigint_handler - The kernel sends a SIGINT to the shell whenver the
 *    user types ctrl-c at the keyboard.  Catch it and send it along
 *    to the foreground job.
 */
void sigint_handler(int sig)
{
	int olderrno = errno;

    pid_t pgid = fgpid(jobs);
    if (pgid)
    {
        killpg(pgid, SIGINT);
    }

    errno = olderrno;
    return;
}

7.void sigint_handler(int sig)

不解釋。

/*
 * sigtstp_handler - The kernel sends a SIGTSTP to the shell whenever
 *     the user types ctrl-z at the keyboard. Catch it and suspend the
 *     foreground job by sending it a SIGTSTP.
 */
void sigtstp_handler(int sig)
{
	int olderrno = errno;

    pid_t pgid = fgpid(jobs);
    if (pgid)
    {
        killpg(pgid, SIGTSTP);
    }
    
    errno = olderrno;
    return;
}

運行結果

為了方便檢查結果，我寫了一個bash腳本，用來比較我的tsh和實驗給的正確參考程序tshref的輸出結果（測試用例為trace01.txt~trace16.txt）：

frank@under:~/tmp/shlab-handout$ cat test.sh 
#! /bin/bash

for file in $(ls trace*)
do
	./sdriver.pl -t $file -s ./tshref > tshref_$file
	./sdriver.pl -t $file -s ./tsh > tsh_$file
done 

for file in $(ls trace*)
do
	diff tsh_$file tshref_$file > diff_$file
done

for file in $(ls diff_trace*)
do
	echo $file " :"
	cat $file
	echo -e "-------------------------------------\n"

全部打印出來太長，這里列出最后幾個：

frank@under:~/tmp/shlab-handout$ ./test.sh 

#.............................
#.............................
#.............................

diff_trace13.txt  :
5c5
< tsh> Job [1] (6173) stopped by signal 20
---
> tsh> Job [1] (6162) stopped by signal 20
7c7
< tsh> [1] (6173) Stopped ./mysplit 4 
---
> tsh> [1] (6162) Stopped ./mysplit 4 
20,24c20,24
<  6170 pts/5    S+     0:00 /usr/bin/perl ./sdriver.pl -t trace13.txt -s ./tsh
<  6171 pts/5    S+     0:00 ./tsh
<  6173 pts/5    T      0:00 ./mysplit 4
<  6174 pts/5    T      0:00 ./mysplit 4
<  6177 pts/5    R      0:00 /bin/ps a
---
>  6159 pts/5    S+     0:00 /usr/bin/perl ./sdriver.pl -t trace13.txt -s ./tshref
>  6160 pts/5    S+     0:00 ./tshref
>  6162 pts/5    T      0:00 ./mysplit 4
>  6163 pts/5    T      0:00 ./mysplit 4
>  6166 pts/5    R      0:00 /bin/ps a
41c41
<  1303 tty7     Ssl+  21:49 /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
---
>  1303 tty7     Ssl+  21:48 /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
51,53c51,53
<  6170 pts/5    S+     0:00 /usr/bin/perl ./sdriver.pl -t trace13.txt -s ./tsh
<  6171 pts/5    S+     0:00 ./tsh
<  6182 pts/5    R      0:00 /bin/ps a
---
>  6159 pts/5    S+     0:00 /usr/bin/perl ./sdriver.pl -t trace13.txt -s ./tshref
>  6160 pts/5    S+     0:00 ./tshref
>  6169 pts/5    R      0:00 /bin/ps a
-------------------------------------

diff_trace14.txt  :
7c7
< tsh> [1] (6207) ./myspin 4 &
---
> tsh> [1] (6188) ./myspin 4 &
23c23
< tsh> Job [1] (6207) stopped by signal 20
---
> tsh> Job [1] (6188) stopped by signal 20
27c27
< tsh> [1] (6207) ./myspin 4 &
---
> tsh> [1] (6188) ./myspin 4 &
29c29
< tsh> [1] (6207) Running ./myspin 4 &
---
> tsh> [1] (6188) Running ./myspin 4 &
-------------------------------------

diff_trace15.txt  :
7c7
< tsh> Job [1] (6241) terminated by signal 2
---
> tsh> Job [1] (6224) terminated by signal 2
9c9
< tsh> [1] (6244) ./myspin 3 &
---
> tsh> [1] (6226) ./myspin 3 &

可以發現除了PID不同以外其余都相同，說明tsh實現正確。

[完整項目代碼](https://files.cnblogs.com/files/liqiuhao/tsh.7z)

感悟

這次實驗給我最大的教訓就是不要完全相信文檔，自己去實現和求證也很重要。另外，並行產生的競爭問題也有了一些了解。

另外，有意思的是，我在做實驗之前看到實驗指導里說：

– In waitfg, use a busy loop around the sleep function.
– In sigchld handler, use exactly one call to waitpid.

當時我還想說用sleep 和在waitpid里面只用一個回收是不是不安全或者太傻了，結果我上github一看不僅都是這樣，而且他們的代碼非常不安全（上面提到的六個安全注意點完全不遵守，各種調用也沒有檢查返回值和異常），於是覺得自己寫的肯定比他們好多了

結果。。。如果注意這些安全問題會有很多麻煩，時間也有限，我就把幾個容易實現的實現了，還有兩個“訪問全局結構變量前block”和“在信號處理函數中僅使用async-signal-safe沒有實現。

最后，改編一下Mutt E-Mail Client作者的一句話總結一下這次實驗：

All code about this ShellLab on github sucks. This one just sucks less 😉

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。