edgeR


1)簡介

edgeR作用對象是count文件,rows 代表基因,行代表文庫,count代表的是比對到每個基因的reads數目。它主要關注的是差異表達分析,而不是定量基因表達水平。

edgeR works on a table of integer read counts, with rows corresponding to genes and columns to independent libraries. The counts represent the total number of reads aligning to each gene (or other genomic locus).edgeR is concerned with differential expression analysis rather than with the quantification of expression levels. It is concerned with relative changes in expression levels between conditions,but not directly with estimating absolute expression levels.

edgeR作用的是真實的比對統計,因此不建議用預測的轉錄本

Note that edgeR is designed to work with actual read counts. We not recommend that predicted transcript abundances are input the edgeR in place of actual counts.

歸一化原因:

技術原因影響差異表達分析:

1)Sequencing depth:統計測序深度(即代表的是library size);

2)RNA composition:個別異常高表達基因導致其它基因采樣不足

3)GC content: sample-specific effects for GC-content can be detected

4)sample-specific effects for gene length have been detected

注意:edgeR必須是原始表達量,而不能是rpkm等矯正過的。

Note that normalization in edgeR is model-based, and the original read counts are not themselves transformed. This means that users should not transform the read counts in any way before inputing them to edgeR. For example, users should not enter RPKM or FPKM values to edgeR in place of read counts. Such quantities will prevent edgeR from correctly estimating the mean-variance relationship in the data, which is a crucial to the statistical strategies underlying edgeR.Similarly, users should not add artificial values to the counts before inputing them to edgeR.

2)安裝

if("edgeR" %in% rownames(installed.packages()) == FALSE) {source("http://bioconductor.org/biocLite.R");biocLite("edgeR")}
suppressMessages(library(edgeR))
ls('package:edgeR')

 3)矩陣構建及差異分析

需要構建2個矩陣:1、表達矩陣;2、分組矩陣( 實驗設計);

-------------------------------------------------------表達矩陣-----------------------------------------

3.1、讀取表達矩陣文件(Reading in the data)

#讀取文件
rawdata <- read.delim("E:/software/R/R-3.5.0/library/edgeR/Meta/TableS1.txt", check.names=FALSE, stringsAsFactors=FALSE)
head(rawdata)

 

3.2 、構建DGEList對象

這里因為已經有rawdata的count文件,因此直接用DGEList()函數就行了,否則要用readDGE()函數

 y <- DGEList(counts=rawdata[,4:9], genes=rawdata[,1:3])##構建DGEList對象

 DGEList對象主要有三部分:

1、counts矩陣:包含的是整數counts;

2、samples數據框:包含的是文庫(sample)信息。包含 lib.size列 :for the library size (sequencing depth) for each sample,如果不自定義,  the library sizes will be computed from the column sums of the counts。其中還有一個group列,用於指定每個sample組信息

3、一個可選的數據框genes:gene的注釋信息

3.3)數據注釋( Annotation)

這里主要是因為該文章數據是前好多年的,因此需要過濾,symbol更新等。

1)The study  was undertaken a few years ago, so not all of the RefSeq IDs provided by match RefSeq IDs currently in use. We retain only those transcripts with IDs in the current NCBI annotation, which is provided by the org.HS.eg.db package

2)因為edgeR默認使用NCBI中refSeq的ID,所以通過refseq Id 找到entrezID,然后通過entrezID對symbol更新

#######retain only those transcripts with IDs in the current NCBI annotation provided by the org.HS.eg.db######
library(org.Hs.eg.db)
idfound <- y$genes$RefSeqID %in% mappedRkeys(org.Hs.egREFSEQ)
y <- y[idfound,]
dim(y)  ##15550 6
###################### 在注釋中加入  Entrez Gene IDs #########################
egREFSEQ <- toTable(org.Hs.egREFSEQ)  
m <- match(y$genes$RefSeqID, egREFSEQ$accession)
y$genes$EntrezGene <- egREFSEQ$gene_id[m]
#####################用Entrez Gene IDs更新gene symbols##########################
egSYMBOL <- toTable(org.Hs.egSYMBOL)
m <- match(y$genes$EntrezGene, egSYMBOL$gene_id)
y$genes$Symbol <- egSYMBOL$symbol[m]
head(y$genes)

 

3.4) 過濾和歸一化(Filtering and normalization)

過濾一:Different RefSeq transcripts for the same gene symbol count predominantly the same reads. So we keep one transcript for each gene symbol. We choose the transcript with highest overall count:

o <- order(rowSums(y$counts), decreasing=TRUE)
y <- y[o,]
d <- duplicated(y$genes$Symbol)
y <- y[!d,]
nrow(y)

 過濾二:Normally we would also filter lowly expressed genes.For this data, all transcripts already have at least 50 reads for all samples of at least one of the tissues types.

y$samples$lib.size <- colSums(y$counts)  #Recompute the library sizes
###############################Use Entrez Gene IDs as row names:#####################
rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene
y$genes$EntrezGene <- NULL

 歸一化:TMM normalization is applied to this dataset to account for compositional difference between the libraries.

y <- calcNormFactors(y)
y$samples

 

3.5) 數據的探索(Data exploration)

樣本間關系(samples for outliers and for other relationships)

plotMDS(y)

 

PC1將tumor和nomal組分開,PC2 大略和病號對應。也側面體現了腫瘤組的異質性

--------------------------分組矩陣(根據實驗設計、目的)--------------------------------

Here we want to test for differential expression between tumour and normal tissues within patients, i.e. adjusting for differences between patients.

Patient <- factor(c(8,8,33,33,51,51))
Tissue <- factor(c("N","T","N","T","N","T"))
data.frame(Sample=colnames(y),Patient,Tissue)
design <- model.matrix(~Patient+Tissue)
rownames(design) <- colnames(y)
design

3.4)Estimating the dispersion(estimate the NB dispersion for the dataset.)

y <- estimateDisp(y, design, robust=TRUE)
y$common.dispersion    #0.1594505
plotBCV(y)

 

-----------------------------------差異分析-----------------------------------------

3.5) 差異分析(Differential expression)

fit <- glmFit(y, design)
lrt <- glmLRT(fit)
topTags(lrt)
summary(decideTests(lrt))
plotMD(lrt)
abline(h=c(-1, 1), col="blue")

 

------------------------------- Gene ontology analysis----------------------------------------

 對上調的基因進行BP分析

go <- goana(lrt)
topGO(go, ont="BP", sort="Up", n=30)

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM