hg19基因組 | 功能區域 | 位置提取

本文轉載自查看原文 2019-11-19 15:39 512 基因組

如何獲取hg19的CDS、UTR、intergenic、intron等的位置信息？

參考手冊：

Hg19 regions for Intergenic, Promoters, Enhancer, Exon, Intron, 5-UTR, 3-UTR

The coding region of a gene, also known as the CDS (from coding sequence), is that portion of a gene's DNA or RNA that codes for protein. The region usually begins at the 5' end by a start codon and ends at the 3' end with a stop codon.

CDS就是所有exon的組合

有細微的差別：

Exons = gene - introns

CDS = gene - introns - UTRs

therefore also:

CDS = Exons - UTRs

就是UTR也算作了exon了。

也就是exon算是一個比較大的概念，所以文章里用的CDS區域來統計，而不是exon。

獲取數據的方法：

1. UCSC - 最全，最個性化

How to download the most similar annotation file as the author required from UCSC browser directly.

Go to UCSC brower and Tool, Table caterogy. and pick you reference genome and select right version under clade/genome/assembly
Make sure the group is "Genes and Gene Predictions"
Choose your preferred track (RefSeq/RefGene or UCSC gene/KnownGene)
Choose the table that gives gene information (RefSeq or KnownGene)
Select your region or the entire genome to get coordinates for
Select BED format as your output format
Name your output file
Click "get output"
Be careful, the ouput files don't have exon, intron, integenic, 5-UTR, 3-UTR informatics if you save it as a single file. You can save them as separated files so that you know the information for each subset.

需要自己再sort一下：

cat raw.txt/Gene.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Gene.bed
cat raw.txt/UTR3.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > UTR3.bed
cat raw.txt/UTR5.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > UTR5.bed
cat raw.txt/down2K.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Down2K.bed
cat raw.txt/CDS.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > CDS.bed
cat raw.txt/exon.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Exon.bed
cat raw.txt/up2K.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Up2K.bed
cat raw.txt/intron.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Intron.bed

這里得到的CDS是有overlap的，因為是按轉錄本算的，同一個基因有多個轉錄本。

2. genecode - 最自動化，全部用代碼搞定

3. 其他

RSeQC有比較好的bed注釋文件，但是不是完全的明文，不好利用。

附錄：

intergenic的需要自己計算

wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

grep -v "_" hg19.chrlen.bed | egrep  "^chr([0-9]{1,2})" | cut -f1,3 > hg19.chrlen.1_22.bed
# sort -k1,1 hg19.chrlen.1_22.bed > sorted.genome
bedtools complement -i UCSC.anno/Gene.bed -g hg19.chrlen.1_22.bed > intergenic.bed
sort -k1,1 -k2,2n intergenic.bed > sorted.intergenic.bed
bedtools merge -i sorted.intergenic.bed > merged.sorted.intergenic.bed

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 不同的方法從gtf中提取相關基因組信息in R 使用 gffread 提取基因組序列信息參考基因組基因組注釋初步了解hg19注釋文件的內容 | gtf NextDenovo 組裝基因組基因組轉座元件動物基因組與植物基因組基本差異如何到NCBI提交基因組千人基因組計划數據庫下載某段區域SNP