一些問題:
1. 什么時候我的問題可以用GLM,什么時候我的問題不能用GLM?
2. GLM到底能給我們帶來什么好處?
3. 如何評價GLM模型的好壞?
廣義線性回歸啊,虐了我快幾個月了,還是沒有徹底搞懂,看paper看代碼的時候總是一臉懵逼。
大部分分布都能看作是指數族分布,廣義差不多是這個意思,我們常見的線性回歸和logistic回歸都是廣義線性回歸的特例,可以由它推到出來。
參考:線性回歸、logistic回歸、廣義線性模型——斯坦福CS229機器學習個人總結(一)
對着上面的教程,手寫了一遍所有的公式,大致能理解其中的40%吧。
1. 線性回歸和logistic回歸都有概率形式,只是基礎的分布假設不一樣。線性回歸假設誤差y服從高斯分布;logistic回歸假設y服從伯努利分布。有了分布形式,根據最大似然方法我們就很容易得到優化目標,這正好也推到出了我們用在線性回歸中的最小二乘公式。
2. 在優化時所用的方法,梯度下降、梯度上升、牛頓法。梯度法的核心就是導數,我們優化的函數就是關於參數的函數,求導得斜率,走一步alpha,更新參數,迭代進行,即可得局部最優參數。
3. GLM的核心就是y服從的分布可以表示為指數分布族的形式,就可以推廣線性模型的應用范圍。logistic回歸就是線性回歸的推廣。
那么如何根據指數分布族來構建廣義線性模型呢?
啊哈,百度里講GLM理論的不少(講得也是比較粗糙),實例的幾乎沒有。下面是一個GLM在醫學fMRI上的應用。
Statistical Analysis: The General Linear Model
What does a generalized linear model do? R
The overall summary is: You can first try linear regression. If this is not appropriate for your problem you can then try pre-transforming your y-data (a log-like or logit transform) and seeing if that fits better. However, if you transform your y-data you are using a new error model (in the transformed space such as log(y)-units instead of y-units, this can be better or can be worse depending on your situation). If this error model is not appropriate you can move on to a generalized linear model. However, the generalized linear model does not minimize square error in y-units but maximizes data likelihood under the chosen model. The distinction is mostly technical and maximum likelihood is often a good objective (so you should be willing to give up your original square-loss objective). If you wan’t to go further still you can try a generalized additive model which in addition to re-shaping the y distribution uses splines to learn re-shapings of the x-data.
“並不是說把推導什么的都放上來,那樣不如干脆推薦讀者看書好了。把指數分布族的形式寫出來的話,這件事情會明了許多,比如為什么Logistic回歸經常用logit聯接函數(我見過一些民科的吐血解釋)、為什么那個散布參數是個“討厭參數”(極大似然估計可以扔掉它不管),等等。更深層的意義在於,廣義線性模型不是簡單的推廣分布族,它是另一種思想。普通的回歸的中心是加性誤差,而GLM則是把模型分成兩個組成成分來考慮,一個系統成分(自變量線性組合),一個隨機成分(因變量的概率分布),二者用連接函數連起來。你可以說GLM是普通回歸的推廣,但我覺得這樣有點低估它在統計建模思想上的突破。”
學習GLM的時候在網上找不到比較通俗易懂的教程。這里以一個實例應用來介紹GLM。
We used a Bayesian generalized linear model (GLM) to assign every gene to one or more cell populations, as previously described (Zeisel et al., 2015).
在單細胞RNA-seq的分析中,可以用GLM來尋找marker。
貝葉斯 + 廣義 + 線性回歸
線性回歸:這個最基礎,大部分人應該都知道。為什么找marker問題可以轉化為線性回歸問題?我們可以把每一個基因的表達當作自變量,把最終的類別作為因變量,擬合線性模型,然后根據系數的分布來得到marker。
廣義:因變量(響應變量)可以服從多種分布(思考:為什么下文要用負二項分布);
貝葉斯:是一種新的思維方式,所有的系數都有自己的分布。
The GLM models the measured gene expression of a cell as realizations of a Negative Binomial probability distribution whose mean is determined by a linear combination of K predictors xi with coefficient bi.
For each cell, the outcome and predictors are known and the aim is to determine the posterior probability distributions of the coefficients.
As predictors, we use a continuous Baseline predictor and a categorical Cell Type predictor. The Baseline predictor value is the cell’s molecule count normalized to the average molecule count of all cells and takes account of the fact that we expect every gene to have a baseline expression proportional to the total number of expressed molecules within a particular cell. While the Cell Type predictor is set to 1 for the cluster BackSPIN assignation of the cell, and 0 for the other classes. From the definition of the model it follows that the coefficient bk for a Cell Type predictor xk can be interpreted as the additional number of molecules of a particular gene that are present as a result of the cell being of cell type k. A more detailed description of the model, including explanation of the prior probabilities used for the fitting as well as the full source code of the model, is provided elsewhere (Zeisel et al., 2015). The Stan (http://mc-stan.org) source is copied below for completeness:
data { int < lower = 0 > N; # number of outcomes int < lower = 0 > K; # number of predictors matrix < lower = 0 > [N,K] x; # predictor matrix int y[N]; # outcomes } parameters { vector < lower = 1 > [K] beta; # coefficients real < lower = 0.001 > r; # overdispersion } model { vector < lower = 0.001 > [N] mu; vector < lower = 1.001 > [N] rv; # priors r !cauchy(0, 1); beta !pareto(1, 1.5); # vectorize the overdispersion for (n in 1:N) { rv[n] < - square(r + 1) - 1; } # regression mu < - x * (beta - 1) + 0.001; y !neg_binomial(mu ./ rv, 1 / rv[1]); }
To determine which genes are higher than basal expression in each population we compared the posterior probability distributions of the Baseline coefficient and the Cell Type coefficient. A gene was considered as marking a cell population if (1) its cell-typespecific coefficient exceeded the Baseline coefficient with 99.8% (95% for the mouse adult) posterior probability, and (2) the median of its posterior distribution did not fall below a threshold q set to 35% of the median posterior probability of the highest expressing group, and (3) the median of the highest-expressing cell type was greater than 0.4. For every gene this corresponds to a binary pattern (0 if the conditions are not met and 1 if they are), and genes can therefore be grouped according to their binarized expression patterns.
We use those binarized patterns to call transcription factor specificity. Our definition of a transcription factor gene was based of annotations provided by the merged annotation of PANTHER GO (Mi et al., 2013) and FANTOM5 (Okazaki et al., 2002), this list was further curated and missing genes and occasional misannotations corrected.
The feature selection procedure is based on the largest difference between the observed coefficient of variation (CV) and the predicted CV (estimated by a non-linear noise model learned from the data) See Figure S1C. In particular, Support Vector Regression (SVR, Smola and Vapnik, 1997) was used for this purpose (scikit-learn python implementation, default parameters with gamma = 0.06; Pedregosa et al., 2011).
特征選取:尋找觀察CV值和預測CV值之間的最大差異。
SVR支持向量回歸
Similarities between clusters within a species were summarized using a Pearson’s correlation coefficient calculated on the binarized matrix (Figures 1C and 1D).
The negative binomial-Lindley generalized linear model: Characteristics and application using crash data
這篇文章不錯,可以仔細研讀。
最新的Nature文章,有用GLM來做normalization的。
However, for Drop-seq, in which the number of UMIs is low per cell compared to the number of genes present, the set of genes detected per cell can be quite different. Hence, we normalize the expression of each gene separately by modelling the UMI counts as coming from a generalized linear model with negative binomial distribution, the mean of which can be dependent on technical factors related to sequencing depth. Specifically, for every gene we model the expected value of UMI counts as a function of the total number of reads assigned to that cell, and the number of UMIs per detected gene (sum of UMI divided by number of unique detected genes).
To solve the regression problem, we use a generalized linear model (glm function of base R package) with a regularized overdispersion parameter theta. Regularizing theta helps us to avoid overfitting which could occur for genes whose variability is mostly driven by biological processes rather than sampling noise and dropout events. To learn a regularized theta for every gene, we perform the following procedure.
1) For every gene, obtain an empirical theta using the maximum likelihood model (theta.ml function of the MASS R package) and the estimated mean vector that is obtained by a generalized linear model with Poisson error distribution.
2) Fit a line (loess, span = 0.33, degree = 2) through the variance–mean UMI count relationship (both log10transformed) and predict regularized theta using the fit. The relationship between variance and theta and mean is given by variance= mean + (mean2/theta).
Normalized expression is then defined as the Pearson residual of the regression model, which can be interpreted as the number of standard deviations by which an observed UMI count was higher or lower than its expected value. Unless stated otherwise, we clip expression to the range [-30, 30] to prevent outliers from dominating downstream analyses.