GTF文件的全稱是gene transfer format,主要是對染色體上的基因進行標注。怎么理解呢,其實所謂的基因名,基因座等,都只是后來人們給一段DNA序列起的名字而已,還原到細胞中就是細胞核里面的一條長長的染色體(DNA序列)。而這個GTF文件的主要功能,就是指出我們所謂的基因在染色體上的位置(coordinate),並且還標注了這一段區間的其他信息。
GTF文件我一般喜歡去ensembl下載,gencode也可以。 這里給出鏈接:
ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/
http://www.gencodegenes.org/releases/current.html
關於這個文件的解釋,這里參考ensembl 給出的官方說明: http://www.ensembl.org/info/website/upload/gff.html
GFF/GTF File Format - Definition and supported options
The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The following documentation is based on the Version 2 specifications.
The GTF (General Transfer Format) is identical to GFF version 2.
Fields
Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'
- seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- source - name of the program that generated this feature, or the data source (database or project name)
- feature - feature type name, e.g. Gene, Variation, Similarity
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
Note that where the attributes contain identifiers that link the features together into a larger structure, these will be used by Ensembl to display the features as joined blocks.
Sample GTF output from Ensembl data dump:
1 transcribed_unprocessed_pseudogene gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 1 processed_transcript transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
Sample GFF output from Ensembl export:
1 X Ensembl Repeat 2419108 2419128 42 . . hid=trf; hstart=1; hend=21 2 X Ensembl Repeat 2419108 2419410 2502 - . hid=AluSx; hstart=1; hend=303 3 X Ensembl Repeat 2419108 2419128 0 . . hid=dust; hstart=2419108; hend=2419128 4 X Ensembl Pred.trans. 2416676 2418760 450.19 - 2 genscan=GENSCAN00000019335 5 X Ensembl Variation 2413425 2413425 . + . 6 X Ensembl Variation 2413805 2413805 . + .
Track lines
Although not part of the formal GFF specification, Ensembl uses track lines to further configure sets of features (thus maintaining compatibility with UCSC). Track lines should be placed at the beginning of the list of features they are to affect.
The track line consists of the word 'track' followed by space-separated key=value pairs - see the example below. Valid parameters used by Ensembl are:
- name - unique name to identify this track when parsing the file
- description - Label to be displayed under the track in Region in Detail
- priority - integer defining the order in which to display tracks, if multiple tracks are defined.
More information
For more information about this file format, see the documentation on the GMOD wiki.
參考資料:
生信小碼農博客:http://www.cnblogs.com/Demo1589/p/6950196.html