（main）貝葉斯統計 | 貝葉斯定理 | 貝葉斯推斷 | 貝葉斯線性回歸

（main）貝葉斯統計 | 貝葉斯定理 | 貝葉斯推斷 | 貝葉斯線性回歸 | Bayes' Theorem

本文轉載自查看原文 2018-04-05 11:33 3137 統計

2019年08月31日更新

看了一篇發在NM上的文章才又明白了貝葉斯方法的重要性和普適性，結合目前最火的DL，會有意想不到的結果。

目前一些最直覺性的理解：

概率的核心就是可能性空間一定，三體世界不會有概率
貝葉斯的基礎就是條件概率，條件概率的核心就是可能性空間的縮小，獲取了新的信息就是個可能性空間縮小的過程
貝葉斯定理的核心就是，先驗*似然=后驗，有張圖可以完美可視化這個定理
只要我們能得到可靠的先驗或似然，任意一個，我們就能得到更可靠的后驗概率

最近又在刷一個Coursera的課程：Bayesian Statistics: From Concept to Data Analysis，希望能更系統地學習一下。

week1

In this module, we review the basics of probability and Bayes’ theorem.

In Lesson 1, we introduce the different paradigms or definitions of probability and discuss why probability provides a coherent framework for dealing with uncertainty.
In Lesson 2, we review the rules of conditional probability and introduce Bayes’ theorem.
Lesson 3 reviews common probability distributions for discrete and continuous random variables.

基本問題：

概率的不同定義，Classical framework，Frequentist framework，Bayesian framework
什么是條件概率
使用貝葉斯定理計算條件概率，一個罕見病的案例
理解常見的概率分布，能寫出其期望、方差、PDF、PMF，很有價值的總結
計算常見分布的概率結果
理解中心極限定理central limit theorem，抽樣分布：with sufficiently large sample sizes, the sample average approximately follows a normal distribution. 確定了正態分布的核心地位。
Bayesian and Frequentist在哲學上的區別，客觀與主觀、決定論與信息論

其他：

Probability和Odds的區別
complement是c的縮寫
隨機變量的期望和方差
理解indicator functions，設置定義域
continuous version of Bayes’ theorem，the sum gets replaced with an integral，對所有情況下的θ進行積分。

week2

This module introduces concepts of statistical inference from both frequentist and Bayesian perspectives.

Lesson 4 takes the frequentist view, demonstrating maximum likelihood estimation and confidence intervals for binomial data.
Lesson 5 introduces the fundamentals of Bayesian inference. Beginning with a binomial likelihood and prior probabilities for simple hypotheses, you will learn how to use Bayes’ theorem to update the prior with data to obtain posterior probabilities. This framework is extended with the continuous version of Bayes theorem to estimate continuous model parameters, and calculate posterior probabilities and credible intervals.

基本問題：

什么是似然函數，之前寫過類似文章，其實似然函數和條件概率函數是一個東西，只是given的變量不一樣，一個是參數，一個是事件。最直觀的差別就是條件概率之和為1，而似然則不是，它比較的是不同參數出現的似然值得大小，絕對不能說是參數出現的概率。
什么是頻率學派的CI置信域？以拋硬幣為例，我們有一個觀測值，然后根據置信域我們可以得出95%的CI，解釋就是We're 95% confident that the true probability of getting a head is in this interval. Each time we create a confidence interval in this way based on the data we observe. Than on average 95% of the intervals we make will contain the true value of p. 因為我們假設我們得到的結果不是小概率事件，所以我們重復很多次會發現95%的都會包含真值。Does this interval contain the true p. What's the probability that this interval contains a true p? Well, we don't know for this particular interval. 但是我們無法回答某個特定的置信域的問題，因此我們需要貝葉斯置信域。
計算伯努利分布和二項分布的置信域
MLE最大似然估計的應用，對似然函數求導，得到其最大值。MLE屬於點估計，可以用中心極限定理來求CI。
貝葉斯后驗區間，there is probably a p is in this interval is 95% based on a random interpretation of an unknown parameter

統計分布沒有派別，它是關於數據的概率描述

likelihood <- function(n,y,theta){
  return(theta^y*(1-theta)^(n-y))
}
theta <- seq(from=0.01, to=0.99, by=0.01)
plot(theta, likelihood(400,72,theta))

似然值的計算和普通概率的計算差不多　

其他：

Argmax
Cumulative distribution function (CDF)，離散和連續分布都有，定義為小於一定值得概率，最大值為1，單調遞增，因為事件概率不能為負。
Probability density function (PDF)，只有連續才有，積分后就得到了CDF
Probability mass function (PMF)，只有離散才有，就是簡單的單點概率
R中多個概率分布的差別，dnorm：PDF，pnorm：CDF，qnorm： quantile function，rnorm：pseudo-random samples

R代碼練習：

題1：X ∼ Binomial(5, 0.6)，如何求F(1)? 為了避免求積分，可以直接用CDF，也就是p開頭的函數來求，pbinom(1, 5, 0.6)，直接算出了P(X ≤ 1)的CDF。另外也可以用qbinom來驗證， qbinom(p=0.087, size=5, prob=0.6)，這是求CDF為p的情況下，x的近似值。

題2：Y ∼ Exp(1)，求CDF分別為0.1和0.9的Y值，直接拿qexp函數來求，qexp(0.1, rate = 1)，其中0.1也可以換成vector，批量求。

1. Let X ∼ Pois(3). Find P(X = 1). (0.149): dpois(1, lambda = 3)
2. Let X ∼ Pois(3). Find P(X ≤ 1). (0.199): ppois(1, lambda = 3)
3. Let X ∼ Pois(3). Find P(X > 1). (0.801): 1-0.199
4. Let Y ∼ Gamma(2, 1/3). Find P(0.5 < Y < 1.5). (0.078): pgamma(1.5, shape = 2, rate = 1/3) - pgamma(0.5, shape = 2, rate = 1/3)
5. Let Z ∼ N(0, 1). Find z such that P(Z < z) = 0.975. (1.96): qnorm(0.975, mean = 0, sd = 1)
6. Let Z ∼ N(0, 1). Find P(−1.96 < Z < 1.96). (0.95): pnorm(1.96, mean = 0, sd = 1) - pnorm(-1.96, mean = 0, sd = 1)
7. Let Z ∼ N(0, 1). Find z such that P(−z < Z < z) = 0.90. (1.64): qnorm(0.05, mean = 0, sd = 1)

總結：

離散：單點概率用d函數。

連續：單點無概率，所以d函數返回的是該點在PDF上的高度/值。

p函數都是給定一個值得到小於該值得F函數值，也就是概率；q函數則是給定一個概率得到其對應的一個F函數對應的x值，是p函數的逆函數。

r函數則是隨機從指定分布中抽取n的數值出來。

參考：Introduction to dnorm, pnorm, qnorm, and rnorm for new biostatisticians

貝葉斯推斷

從先驗到后驗的更新過程
頻率派和貝葉斯派的統計推斷的差別
貝葉斯推斷在離散型和連續型數據上的應用
為什么說貝葉斯定理分母下面的是normlizing constant？因為它最初的起源就是f(y)或p(y)，與θ無關，表示我們觀測數據出現的概率，然后后面為了計算方便，才用全概率公司展開。
f(θ|x) ∝ f(x|θ)f(θ)，The symbol ∝ stands for “is proportional to.”

week3

In this module, you will learn methods for selecting prior distributions and building models for discrete data.

Lesson 6 introduces prior selection and predictive distributions as a means of evaluating priors.
Lesson 7 demonstrates Bayesian analysis of Bernoulli data and introduces the computationally convenient concept of conjugate priors.
Lesson 8 builds a conjugate model for Poisson data and discusses strategies for selection of prior hyperparameters.

基本問題：

Understand the prior as representing information.
Understand the concept of conjugate priors.
Recognize the posterior mean as a weighted average of the prior mean and the data estimates, and understand the concept of an effective sample size of a prior.
Compute posterior probabilities for Bernoulli, binomial, and Poisson likelihoods.

學會公式推導，When we use a uniform prior for a Bernoulli likelihood, we get a beta posterior. lesson 7.1. In fact, the uniform distribution, is a beta one one.

什么是共軛分布？在某個分布的likelihood的作用下，prior和posterior同分布，則我們稱這兩個分布共軛。共軛具有非常好的數學性質，完美的符合了我們的先驗后驗彼此交替的需求。And any beta distribution, is conjugate for the Bernoulli distribution. Any beta prior, will give a beta posterior.

理解gamma分布和beta分布？gamma可以理解為階乘函數，beta是由gamma組合而來。

如何選擇合適的先驗概率？poisson的likelihood需要設置gamma的prior，會得到gamma的posterior。

Posterior mean and effective sample size，關於beta分布的兩個計算問題，以及beta后驗是如何更新的。可以看到一個很有用的權重：posterior mean = prior weight * prior mean + data weight + data mean

This effective sample size also gives you an idea of how much data you would need to make sure that you're prior doesn't have much influence on your posterior.

貝葉斯的一個最好的應用：Medical devices, you often have very small sample sizes. But you're only making minor updates to the devices and you're doing new trials. The ability of Bayesian statistics to do easy sequential updates made it very practical and appealing For the medical device testing industry.

能熟練用R來求解伯努利、二項分布和beta分布的問題：

這里需要很好的區分兩個概率，第一個概率就是二項分布的概率X，這雖然是個概率，但可以看做是一個隨機變量；第二個就是這個隨機變量X發生的概率。

理解這里畫的圖的含義，不是傳統的概率分布圖，而是隨機變量的似然值，dbeta就是隨機變量取特定值時的PDF上的值。這里畫了不同prior下，隨機變量的不同似然分布。

需要加深前面的關於p函數的理解，p函數就是用求F函數的某個值下的概率，就是小於等於某個值的概率，最小為0，最大為1. 這里的pbeta就很好理解了。

二項分布作為likelihood來更新beta函數則十分簡單，直接對beta分布的兩個參數進行加減即可。

如何把先驗、likelihood和后驗全部畫到一個圖里？the posterior mean is somewhere in between the maximum likelihood estimate and the prior mean of two-thirds. 其實畫到一張圖里是不嚴謹的，需要調整一下scale，核心是要明白數據的集中度發生了變化。

在制作巧克力餅干時，the number of chips per cookie approximately falls a Poisson distribution. 與gamma分布共軛。gamma的期望和方差一定要知道。

# Suppose we are giving two students a multiple-choice exam with 40 questions, 
# where each question has four choices. We don't know how much the students
# have studied for this exam, but we think that they will do better than just
# guessing randomly. 
# 1) What are the parameters of interest?
# 2) What is our likelihood?
# 3) What prior should we use?
# 4) What is the prior probability P(theta>.25)? P(theta>.5)? P(theta>.8)?
# 5) Suppose the first student gets 33 questions right. What is the posterior
#    distribution for theta1? P(theta1>.25)? P(theta1>.5)? P(theta1>.8)?
#    What is a 95% posterior credible interval for theta1?
# 6) Suppose the second student gets 24 questions right. What is the posterior
#    distribution for theta2? P(theta2>.25)? P(theta2>.5)? P(theta2>.8)?
#    What is a 95% posterior credible interval for theta2?
# 7) What is the posterior probability that theta1>theta2, i.e., that the 
#    first student has a better chance of getting a question right than
#    the second student?

############
# Solutions:

# 1) Parameters of interest are theta1=true probability the first student
#    will answer a question correctly, and theta2=true probability the second
#    student will answer a question correctly.

# 2) Likelihood is Binomial(40, theta), if we assume that each question is 
#    independent and that the probability a student gets each question right 
#    is the same for all questions for that student.

# 3) The conjugate prior is a beta prior. Plot the density with dbeta.
theta=seq(from=0,to=1,by=.01)
plot(theta,dbeta(theta,1,1),type="l")
plot(theta,dbeta(theta,4,2),type="l")
plot(theta,dbeta(theta,8,4),type="l")

# 4) Find probabilities using the pbeta function.
1-pbeta(.25,8,4)
1-pbeta(.5,8,4)
1-pbeta(.8,8,4)

# 5) Posterior is Beta(8+33,4+40-33) = Beta(41,11)
41/(41+11)  # posterior mean
33/40       # MLE

lines(theta,dbeta(theta,41,11))

# plot posterior first to get the right scale on the y-axis
plot(theta,dbeta(theta,41,11),type="l")
lines(theta,dbeta(theta,8,4),lty=2)
# plot likelihood
lines(theta,dbinom(33,size=40,p=theta),lty=3)
# plot scaled likelihood
lines(theta,44*dbinom(33,size=40,p=theta),lty=3)

# posterior probabilities
1-pbeta(.25,41,11)
1-pbeta(.5,41,11)
1-pbeta(.8,41,11)

# equal-tailed 95% credible interval
qbeta(.025,41,11)
qbeta(.975,41,11)

# 6) Posterior is Beta(8+24,4+40-24) = Beta(32,20)
32/(32+20)  # posterior mean
24/40       # MLE

plot(theta,dbeta(theta,32,20),type="l")
lines(theta,dbeta(theta,8,4),lty=2)
lines(theta,44*dbinom(24,size=40,p=theta),lty=3)

1-pbeta(.25,32,20)
1-pbeta(.5,32,20)
1-pbeta(.8,32,20)

qbeta(.025,32,20)
qbeta(.975,32,20)

# 7) Estimate by simulation: draw 1,000 samples from each and see how often 
#    we observe theta1>theta2

theta1=rbeta(1000,41,11)
theta2=rbeta(1000,32,20)
mean(theta1>theta2)


# Note for other distributions:
# dgamma,pgamma,qgamma,rgamma
# dnorm,pnorm,qnorm,rnorm

week4

This module covers conjugate and objective Bayesian analysis for continuous data.

Lesson 9 presents the conjugate model for exponentially distributed data.
Lesson 10 discusses models for normally distributed data, which play a central role in statistics.
In Lesson 11, we return to prior selection and discuss ‘objective’ or ‘non-informative’ priors.
Lesson 12 presents Bayesian linear regression with non-informative priors, which yield results comparable to those of classical regression.

指數分布

For example, suppose you're waiting for a bus that you think comes on average once every ten minutes, but you're not sure exactly how often it comes.

gamma distribution is conjugate for an exponential likelihood. Gammas actually are conjugate for a number of different things.

如何選擇合適的prior？

一文包含所有：Probability concepts explained: Bayesian inference for parameter estimation.

貝葉斯並不難，關鍵是要能熟能生巧，熟練運用在生活各個方面，應用到各個項目。

最近發現遺傳領域用貝葉斯實在是太普遍了，不得不再溫習一遍。

所謂高手，就是把自己活成了貝葉斯定理 - 他的引入和案例非常好，只是深究的話有些問題。

首先理解條件概率：

P(A|B)是什么意思，P(A)可以簡單的用venn圖來可視化，就是內圈的面積；P(A|B)就是在限定空間B下，A的概率。舉例：假設在一個大公司，每個人都有升遷的概率：P(升)，我想知道拿到MBA后升遷的概率：P(升|MBA)，理論上：我們找出所有拿到MBA的人，再一一統計他們是否升遷就可以得到這個概率了，真要這樣那就不用貝葉斯了。

實際上，我們永遠只能做抽樣估計。貝葉斯公式是對稱的，通常是有一邊是我們感興趣的，但是無法求解，所以我們可以曲線救國，求另外一邊。還有一個就是全概率公式，這個在venn圖中也特別好理解，就是把全集拆成幾個互斥的部分，分別求解。

貝葉斯的靈魂就是先驗、后驗和調整因子，如何在實際生活中理解和貫徹這個才是關鍵。

先驗：some knowledge or belief that we already have (commonly known as the prior)，不用太復雜，先驗就是指我們已經獲得的知識，通常是marginal probability。 P(A) is a prior to me knowing anything about the B. 先驗可以是猜測的，可以包含一定的主觀因素。更規范一點我們的P(A)不是一個固定值，而是一個分布，prior distribution。

后驗：是指我們得到一些新的數據后，我們原猜測發生的概率，相當於是對原先主觀先驗的一個更新。P(Θ|data) on the left hand side is known as the posterior distribution. This is the distribution representing our belief about the parameter values after we have calculated everything on the right hand side taking the observed data into account.

核心：Therefore we can calculate the posterior distribution of our parameters using our prior beliefs updated with our likelihood.

共軛和傳遞性

貝葉斯推斷及其互聯網應用（一）：定理簡介

作為一個非統計專業的人，着實是被貝葉斯思想折磨了很久，常見的公式都能倒背如流，但依舊無法理解其精神內核。

近日，有高人指點，自己再查了點資料，才對貝葉斯思想有所領悟。。。

基本框架：前面總結了常見分布的概念，這里貝葉斯也不例外，都是概率論，概率研究的核心就是隨機事件發生的概率。以后遇到統計時，要習慣“某事件發生概率”這種專業說法。

例子1：

兩個一模一樣的碗，一號碗有30顆水果糖和10顆巧克力糖，二號碗有水果糖和巧克力糖各20顆。現在隨機選擇一個碗，從中摸出一顆糖，發現是水果糖。請問這顆水果糖來自一號碗的概率有多大？

作者的描述有點含糊，這里會修正一下：

我們假定，H1表示摸出的球來自一號碗，H2表示摸出的球來自二號碗。由於我們假定這兩個碗是一樣的（先驗概率已被指定），所以P(H1)=P(H2)，也就是說，在取出水果糖之前，這兩個碗被選中的概率相同。因此，P(H1)=0.5，我們把這個概率就叫做"先驗概率"，即沒有做實驗之前，來自一號碗的概率是0.5。

再假定，E表示取出的是水果糖，所以問題就變成了在已知E的情況下，來自一號碗的概率有多大，即求P(H1|E)。我們把這個概率叫做"后驗概率"，即在E事件發生之后，對P(H1)的修正。

后面計算就不說了，老生常談，主要是把概念規范化，不要把數學語言和日常用於混淆在一起。

例子2：

已知某種疾病的發病率是0.001，即1000人中會有1個人得病。現有一種試劑可以檢驗患者是否得病，它的准確率是0.99，即在患者確實得病的情況下，它有99%的可能呈現陽性。它的誤報率是5%，即在患者沒有得病的情況下，它有5%的可能呈現陽性。現有一個病人的檢驗結果為陽性，請問他確實得病的可能性有多大？

假定A事件表示得病，那么P(A)為0.001。這就是"先驗概率"，即沒有做試驗之前，我們預計的發病率。再假定B事件表示陽性，那么要計算的就是P(A|B)。這就是"后驗概率"，即做了試驗以后，對發病率的估計。

其實在醫學統計學里，99%不叫作准確率，而是sensitivity。

5%也不叫作誤報率，而叫做假陽性率，與之對應的是specificity。

計算過程可以參照原文。

參考：Precision and recall

實例：垃圾郵件過濾

貝葉斯過濾器是一種統計學過濾器，建立在已有的統計結果之上。所以，我們必須預先提供兩組已經識別好的郵件，一組是正常郵件，另一組是垃圾郵件。

我們用這兩組郵件，對過濾器進行"訓練"。這兩組郵件的規模越大，訓練效果就越好。Paul Graham使用的郵件規模，是正常郵件和垃圾郵件各4000封。

"訓練"過程很簡單。首先，解析所有郵件，提取每一個詞。然后，計算每個詞語在正常郵件和垃圾郵件中的出現頻率。比如，我們假定"sex"這個詞，在4000封垃圾郵件中，有200封包含這個詞，那么它的出現頻率就是5%；而在4000封正常郵件中，只有2封包含這個詞，那么出現頻率就是0.05%。（【注釋】如果某個詞只出現在垃圾郵件中，Paul Graham就假定，它在正常郵件的出現頻率是1%，反之亦然。這樣做是為了避免概率為0。隨着郵件數量的增加，計算結果會自動調整。）

有了這個初步的統計結果，過濾器就可以投入使用了。

Github上有這個的代碼，可以去跑一跑。

前面已經說了貝葉斯是一種思想，它可以被用在任何統計模型上。這也就是為什么你能聽到各種貝葉斯相關的術語：貝葉斯線性回歸、貝葉斯廣義線性回歸等等。

接下來就從最簡單的貝葉斯線性回歸為例，來講解貝葉斯思想是如何與傳統統計模型相結合的。

參考：貝葉斯線性回歸（Bayesian Linear Regression）

如何通俗地解釋貝葉斯線性回歸的基本原理？

從貝葉斯的觀點看線性分類和線性回歸

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 貝葉斯線性回歸朴素貝葉斯法(一)——貝葉斯定理貝葉斯統計淺談貝葉斯推斷朴素貝葉斯算法，貝葉斯分類算法，貝葉斯定理原理線性回歸模型的貝葉斯估計貝葉斯和朴素貝葉斯是啥貝葉斯算法算法——貝葉斯高斯貝葉斯