如何獲得FPKM/RPKM計算需要的基因長度(考慮exon之間的overlap)


版權聲明:本文為博主原創文章,轉載請注明出處

這里我們跟Cufflinks的原理一致,使用總的外顯子長度,並且去除過多的重疊的外顯子的部分。使用R語言,輸入為基因的GTF文件

包的安裝

依賴data.table, IRanges,rtracklayer

install.packages("data.table")
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("rtracklayer")
BiocManager::install("IRanges")


代碼

library(data.table)
library("IRanges")
require("rtracklayer")

hg19 <- readGFF("hg19.gencodev27.gtf")
anno <- setDT(hg19)
anno <- anno[type=="exon",]
setnames(anno,c("seqid","start","end","gene_name","exon_number"),c("Chr","ExonStart","ExonEnd","Gene","Exon_number"))
#mkdir bin and mean by bin
Exon_region <- unique(anno[,.(Chr,ExonStart,ExonEnd,Exon_number,Gene)])
Exon_region <- Exon_region[,{x <- IRanges(ExonStart,ExonEnd);y <- reduce(x); list(ExonStart=y@start,ExonEnd=y@start+y@width-1)},by=.(Gene,Chr)]
Exon_region[,Exon_num:=1:.N,by=Gene]
Exon_region <- Exon_region[,.(Chr,ExonStart,ExonEnd,Exon_num,Gene)]
Exon_len <- Exon_region[,.(ExonLen = ExonEnd - ExonStart + 1),by=.(Exon_num,Gene)]
gene_len <- Exon_len[,.(Length = sum(ExonLen)),by=Gene]


# write out
fwrite(Exon_region,file="All_hg19gene_exon.bed", sep = "\t", col.names = T)
fwrite(gene_len, file = "All_hg19gene_len.txt", sep = "\t", col.names = T)
~

結果文件

  1. 基因長度文件 鏈接:https://pan.baidu.com/s/1NtfM_ESyNyaT-kVaKu0MyQ
    提取碼:gy88
    復制這段內容后打開百度網盤手機App,操作更方便哦

  2. 合並后的外顯子區域文件
    鏈接:https://pan.baidu.com/s/1-IpuC_2N88Jx9m2g5fCqmA
    提取碼:cevo
    復制這段內容后打開百度網盤手機App,操作更方便哦

參考資料

https://www.cureffi.org/2013/09/12/counts-vs-fpkms-in-rna-seq/


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM