定理描述
對二分類問題,當假設空間是有限個函數的集合\(\mathcal{F}=\{f_1,f_2,\cdots,f_d\}\)時,對任意一個函數\(f\in\mathcal{F}\),至少以概率\(1-\delta\)使得以下不等式成立:
\(R(f)\leq\hat{R}(f)+\epsilon(d,N,\delta)\)
其中,
\(\epsilon(d,N,\delta)=\sqrt{\frac{1}{2N}(\log d+\log\frac{1}{\delta})}\)
證明該公式需要用到\(Hoeffding\)定理
\(Hoeffding\)不等式
假設\(X_1,X_2,\cdots,X_n\)是獨立隨機變量,滿足\(P(X_i\in[a_i,b_i])=1,1\leq i\leq n\),令\(S_n=\sum_{i=1}^{n}X_i\),則對任意的\(t>0\),以下不等式成立:
\[\begin{align} P(S_n-E[S_n]\geq t)\leq &\exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2})\\ P(E[S_n]-S_n\geq t)\leq &\exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2}) \end{align} \]
\(Hoeffding\)不等式的證明
先證明\(Hoeffding\)不等式的一個引理:
引理
對於一個隨機變量\(X\),如果\(P(X\in [a,b])=1,E(X)=0\),則對任意\(s>0\)有:
\[\begin{align} E(e^{sX})\leq e^{\frac{1}{8}s^{2}(b-a)^{2}} \end{align} \]
證明:
- 首先,若\(a=b=0\),根據題設顯然有\(P(X=0)=1\),那么:
\(E(e^{sX})=\int_0^0 p(x)\cdot e^{sx}dx=1\leq e^{\frac{1}{8}s^2(0-0)^2}=1\)
- 又若\(a=b\neq0\),那么\(E(X)=\int_a^a x\cdot p(x)dx=a\neq0\),與題設矛盾,所以必不可能有\(a=b\neq0\)
- 若\(a=0\),由於\(E(x)=\int_{0}^{b}x\cdot P(x)dx=0\cdot P(0)+\int_{0^+}^{b}x\cdot P(x)dx\),
上式右半部分\(\int_{0^+}^{b}x\cdot P(x)dx\)滿足\(x>0,P(x)\geq 0\),所有應有\(\int_{0^+}^{b}x\cdot P(x)dx\geq 0\)
又根據題設\(E(X)=0\),所有必有\(P(X=0)=1,P(X\neq 0)=0\),於是:
\[\begin{align} E(e^{sX})&=\int_{0}^{b}p(x)\cdot e^{sx}dx\\ &=p(0)\cdot e^{s\cdot 0}+\int_{0^+}^{b}p(x)\cdot e^{sx}dx\\ &=1\cdot e^{0}+0=1\\ &\leq e^{\frac{1}{8}s^2b^2}=e^{\frac{1}{8}s^2(b-a)^2} \end{align} \]
- 考慮剩下的情況,此時根據\(E(X)=\int_a^b x\cdot p(x)dx=0\),必有\(a<0,b\geq 0\)
注意到\(e^{sX}\)是關於\(X\)的一個凸函數,所以根據\(Jensen\)不等式有:
\[\begin{align} e^{(\frac{b-X}{b-a}sa+\frac{X-a}{b-a}sb)}&=e^{sX}\\ &\leq \frac{b-X}{b-a}e^{sa}+\frac{X-a}{b-a}e^{sb} \end{align} \]
兩邊同時對\(X\)取期望,並代入\(E(X)=0\)得到:
\[\begin{align} E(e^{sX})&\leq \frac{b-E(X)}{b-a}e^{sa}+\frac{E(X)-a}{b-a}e^{sb}\\ &=\frac{b}{b-a}e^{sa}-\frac{a}{b-a}e^{sb}\\ &=(-\frac{a}{b-a})e^{sa}(e^{sb-sa}-\frac{b}{a}) \end{align} \]
令\(\theta=-\frac{a}{b-a}>0\),上式右邊就變成了:
\[\begin{align} \theta e^{-s\theta(b-a)}(\frac{1}{\theta}-1+e^{s(b-a)})&=(1-\theta+\theta e^{s(b-a)})e^{-s\theta (b-a)}\\ &=e^{\log[1-\theta+\theta e^{s(b-a)}]e^{-s\theta(b-a)}}\\ &=e^{-s\theta(b-a)+\log[1-\theta+\theta e^{s(b-a)}]}\\ \end{align} \]
令\(u=s(b-a)\),並且定義\(\varphi\):
\[\begin{align} \left\{ \begin{array}{l} \varphi : R\rightarrow R\\ \varphi(u)=-\theta u+\log(1-\theta+\theta e^{u}) \end{array} \right. \end{align} \]
由\(e^{u}>0,a<0,b\geq0,\theta>0\),有:\(1-\theta+\theta e^{u}=\theta(\frac{1}{\theta}-1+e^{u})=\theta(-\frac{b}{a}+e^{u})>0\),所以\(\varphi\)的定義是合理的。
將\(\varphi\)代入\(E(e^{sX})\)得到:
\[\begin{align} E(e^{sX})\leq e^{\varphi(u)} \end{align} \]
對\(\varphi\)進行泰勒中值定理展開,存在一個\(v\in[0,u]\)使得:
\[\begin{align} \varphi(u)=\varphi(0)+u\varphi^{\prime}(0)+\frac{u^{2}}{2!}\varphi^{\prime\prime}(v) \end{align} \]
計算得到:
\[\begin{align} \varphi(0)&=0\\ \varphi^{\prime}(0)&=-\theta+\frac{\theta e^{u}}{1-\theta+\theta e^{u}}|_{u=0}=0\\ \varphi^{\prime\prime}(v)&=\frac{\theta e^{u}(1-\theta+\theta e^{u})-\theta e^{u}\theta e^{u}}{(1-\theta+\theta e^{u})^{2}}|_{u=v}\\ &=\frac{(1-\theta)\theta e^{v}}{(1-\theta+\theta e^{v})^{2}}\\ &=\frac{1-\theta}{1-\theta+\theta e^{v}}\cdot\frac{\theta e^{v}}{1-\theta+\theta e^{v}}\\ &=t(1-t)\leq\frac{1}{4} \end{align} \]
其中,\(t=\frac{1-\theta}{1-\theta+\theta e^{v}}\)。
因此得到:
\[\begin{align} \varphi(u)\leq0+0+\frac{1}{2}u^{2}*\frac{1}{4}=\frac{1}{8}u^{2}=\frac{1}{8}s^{2}(b-a)^{2} \end{align} \]
引理得證!
\(Markov\)不等式
接下來證明需要用到\(Markov\)不等式,該不等式屬於概率論與數理統計課程的必修內容,相信難不倒大部分的讀者。這里還是將定理和證明謄抄在下。
令\(X\)為非負隨機變量,且假設\(E(X)\)存在,則對任意\(t>0\),有:
\[\begin{align} P(X\geq t)\leq\frac{E(X)}{t} \end{align} \]
證明如下:
假設\(X\in[a,b],a\geq 0\),容易得到:
\[\begin{align} a=a\cdot\int_a^b p(x)dx\leq E(X)=\int_a^b x\cdot p(x)dx\leq b\cdot\int_a^b p(x)dx=b\\ \end{align} \]
即\(a\leq E(X)\leq b\)
如果\(t\leq a\),\(P(X\geq t)=1=\frac{t}{t}\leq\frac{a}{t}\leq\frac{E(X)}{t}\),
如果\(t\geq b\),\(P(X\geq t)=0=\frac{0}{t}\leq\frac{a}{t}\leq\frac{E(X)}{t}\)。
如果\(t\in(a,b)\),有:
\[\begin{align} E(x)&=\int_a^b x\cdot p(x)dx\\ &=\int_a^t x\cdot p(x)dx + \int_t^b x\cdot p(x)dx\\ &\geq \int_t^b x\cdot p(x)dx\\ &\geq t\cdot \int_t^b p(x)dx\\ &=t\cdot P(X\geq t) \end{align} \]
\(Markov\)不等式得證!
接下來證明\(Hoeffding\)不等式
對於\(X_1,X_2,\cdots,X_n,n\)個獨立的隨機變量,其中\(P(X_i\in[a_i,b_i])=1,1\leq i\leq n\),令\(S_n=\sum_{i=1}^{n}X_i\),根據\(Markov\)不等式,有:
\[\begin{align} P(S_n-E[S_n]\geq t)&=P(e^{s(S_n-E[S_n])}\geq e^{st})\\ &\leq e^{-st}E[e^{s(S_n-E[S_n])}]\\ &=e^{-st}E[e^{s(\sum_{i=1}^{n}X_i-E[\sum_{i=1}^{n}X_i])}]\\ &=e^{-st}E[e^{s(\sum_{i=1}^{n}(X_i-E(X_i)))}]\\ &=e^{-st}E[\prod_{i=1}^{n}e^{s(X_i-E(X_i))}]\\ &=e^{-st}\prod_{i=1}^nE[e^{s(X_i-E(X_i))}]\\ \end{align} \]
令\(Y_i=X_i-E(X_i)\),有:
\[\begin{align} E(Y_i)=E(X_i-E(X_i))&=\int_{a_i}^{b_i} p(x_i)\cdot (x_i-E(X_i))dx_i\\ &=\int_{a_i}^{b_i}p(x_i)\cdot x_i dx - \int_{a_i}^{b_i}p(x_i)\cdot E(X_i)dx_i\\ &=E(X_i)-E(X_i)\int_{a_i}^{b_i}p(x_i)dx_i\\ &=E(X_i)-E(X_i)=0 \end{align} \]
滿足引理的條件,所以上式可化為:
\[\begin{align} P(S_n-E[S_n]\geq t)&\leq e^{-st}\prod_{i=1}^nE[e^{s(Y_i)}]\\ &\leq e^{-st}\prod_{i=1}^{n}e^{\frac{1}{8}s^2(b_i-a_i)^2}\\ &=\exp(-st+\frac{1}{8}s^2\sum_{i=1}^n(b_i-a_i)^2)\\ \end{align} \]
上面的推導都假設\(s>0\),定義:
\[\begin{align} \left\{ \begin{array}{l} g:R_+\leftarrow R \\ g(s)=-st+\frac{1}{8}s^2\sum_{i=1}^n(b_i-a_i)^2 \end{array} \right. \end{align} \]
\(g(s)\)是大家都很熟悉的上開口的拋物線函數,要使得上面的不等式對任意\(t>0\)都成立,顯然應該對\(g(s)\)的最小值也成立。求解\(g^\prime(s)=0\)得到\(s=\frac{4t}{\sum_{i=1}^n(b_i-a_i)^2}\),代入不等式,即可得到:
\[\begin{align} P(S_n-E[S_n]\geq t)\leq \exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2}) \end{align} \]
取\(S_n=-S_n\)即可得到:
\[\begin{align} P(E[S_n]-S_n\geq t)\leq \exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2}) \end{align} \]
\(Hoeffding\)不等式得證!
泛化誤差上界定理的證明
對任意函數\(f\in\mathcal{F}\),\(\hat{R}(f)\)是\(N\)個獨立的隨機變量\(L(Y,f(X))\)的樣本均值,\(R(f)\)是隨機變量\(L(Y,f(X))\)的期望值,如果損失函數取值於區間\([0,1]\),即對所有
\(i,[a_i,b_i]=[0,1]\),那么由\(Hoeffding\)不等式得知,對\(\epsilon>0\),以下不等式成立:
\[\begin{align} P(R(f)-\hat{R}(f)\geq\epsilon)&=P(E(L(Y,f(X)))-\frac{1}{N}\sum_{i=1}^{N}L(Y_i,f(X_i))\geq\epsilon)\\ &=P(N\cdot E(L(Y,f(X)))-\sum_{i=1}^{N}L(Y_i,f(X_i))\geq N\cdot\epsilon)\\ &=P(N\cdot E(L(Y,f(X)))-S_n\geq N\cdot\epsilon)\\ \end{align} \]
其中,\(S_n=\sum_{i=1}^{N}L(y_i,f(x_i))\),並且有:
\[\begin{align} E(S_n)&=E[\sum_{i=1}^{N}L(y_i,f(x_i))]\\ &=\sum_{i=1}^{N}E[L(y_i,f(x_i))]\\ &=\sum_{i=1}^NL(y_i,f(x_i))\\ &=N\cdot E(L(Y,f(X)))\\ \end{align} \]
於是不等式可以化為:
\[\begin{align} P(R(f)-\hat{R}(f)\geq\epsilon)&=P(N\cdot E(L(Y,f(X)))-S_n\geq N\cdot\epsilon)\\ &=P(E(S_n)-S_n\geq N\cdot\epsilon)\\ &\leq\exp(\frac{-2(N\cdot\epsilon)^2}{\sum_{i=1}^{N}(b_i-a_i)^2})\\ &=\exp(\frac{-2(N\cdot\epsilon)^2}{N\cdot 1})=\exp(-2N\epsilon^2)\\ \end{align} \]
上式選取的是\(\mathcal{F}\)中的任意一個\(f\),也就是說對任意的\(f\in\mathcal{F}\)都滿足。那么對於\(\mathcal{F}={f_1,f_2,\cdots,f_d}\),存在一個\(f\)滿足\(P(R(f)-\hat{R}(f)\geq\epsilon)\)的概率等於所有\(d\)個\(f\)各自滿足這一條件的概率的並集,用公式表述就是:
\[\begin{align} p(\exists f\in\mathcal{F}:R(f)-\hat{R}(f)\geq\epsilon)&=P(\bigcup_{f\in\mathcal{F}}\{ R(f)-\hat{R}(f)\geq\epsilon \})\\ &\leq \sum_{f\in\mathcal{F}}P(R(f)-\hat{R}(f)\geq\epsilon)\\ &\leq d\cdot \exp(-2N\epsilon^2) \end{align} \]
該表述的等價表述為,對任意的\(f\in\mathcal{F}\)有:
\[\begin{align} P(\forall f\in\mathcal{F}:R(f)-\hat{R}(f)<\epsilon)\geq1-d\exp(-2N\epsilon^2) \end{align} \]
令\(\delta=d\exp(-2N\epsilon^2)\),則有:
\[\begin{align} P(R(f)<\hat{R}(f)+\epsilon)\geq1-\delta \end{align} \]
即至少以概率\(1-\delta\)有\(R(f)<\hat{R}(f)+\epsilon\),其中由\(\delta=d\exp(-2N\epsilon^2)\)得到:
\[\begin{align} \epsilon=\sqrt{\frac{1}{2N}(\log d-\log\delta)}=\epsilon(d,N,\delta) \end{align} \]
即最終得證泛化誤差上界:
\[\begin{align} R(f)\leq\hat{R}(f)+\epsilon(d,N,\delta) \end{align} \]
總結
至此,我們終於完整地證明了二分類問題的泛化誤差上界。該定理表明,在該類問題中,訓練誤差越小,泛化誤差也越小。這個能力證明了機器學習的模型確實對未知數據具有預測能力。