What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)
這是一個很細節也很實際的問題,到底用哪個版本?
參考:
What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)
Results differ when using different ensembl versions
First part options:
- dna_sm - Repeats soft-masked (converts repeat nucleotides to lowercase)
- dna_rm - Repeats masked (converts repeats to to N's)
- dna - No masking
Second part options:
-
.toplevel - Includes haplotype information (not sure how aligners deal with this)
-
.primary_assembly - Single reference base per position
大部分都推薦使用soft-mask版本的,也就是沒有把repeat替換為N。
下載hg19基因組:http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
參考:基因組各種版本對應關系
從genecode下載hg19注釋文件:ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/
UCSC也可以下載,不過只能從網頁導出。http://genome.ucsc.edu/cgi-bin/hgTables
注:genecode貌似出了問題,https://www.gencodegenes.org/releases/26lift37.html,里面ebi的鏈接無法下載了。
參考:http://www.biotrainee.com/thread-2035-1-1.html
基因組不是越新越好的,看看最新的CNS,里面很少有用最新版本的基因組,為什么?因為注釋沒跟上,你做出來的東西可能和別人對不上。
親測
用不同版本的基因組效果會怎么樣?
我做了轉錄組的測試,用的hg19和GRCh38
結論如下:
1. reads比對到基因組上的情況大致相同,基本沒有差別;
2. 用不同的注釋文件,基因表達的結果差距非常大。同樣都是用featureCounts
GRCh38的結果:
Assigned 306852 Unassigned_Unmapped 0 Unassigned_MappingQuality 0 Unassigned_Chimera 0 Unassigned_FragmentLength 0 Unassigned_Duplicate 0 Unassigned_MultiMapping 36280 Unassigned_Secondary 0 Unassigned_Nonjunction 0 Unassigned_NoFeatures 56950 Unassigned_Overlapping_Length 0 Unassigned_Ambiguity 19771
//================================= Running ==================================\\ || || || Load annotation file /home/lizhixin/databases/ensembl/release91/Homo_s ... || || Features : 1199851 || || Meta-features : 58302 || || Chromosomes/contigs : 47 || || || || Process BAM file /home/lizhixin/project/scRNA-seq/reanalyze/first_five ... || || Paired-end reads are included. || || Assign fragments (read pairs) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Read re-ordering is performed. || || || || Total fragments : 419853 || || Successfully assigned fragments : 306852 (73.1%) || || Running time : 0.05 minutes ||
hg19的結果:
Assigned 586467 Unassigned_Unmapped 0 Unassigned_MappingQuality 0 Unassigned_Chimera 0 Unassigned_FragmentLength 0 Unassigned_Duplicate 0 Unassigned_MultiMapping 66997 Unassigned_Secondary 0 Unassigned_Nonjunction 0 Unassigned_NoFeatures 133437 Unassigned_Overlapping_Length 0 Unassigned_Ambiguity 47278
//================================= Running ==================================\\ || || || Load annotation file /home/lizhixin/databases/cellranger_ref/refdata-c ... || || Features : 1130716 || || Meta-features : 32738 || || Chromosomes/contigs : 45 || || || || Process BAM file /home/lizhixin/project/scRNA-seq/reanalyze/first_five ... || || Paired-end reads are included. || || Assign fragments (read pairs) to features... || || Total fragments : 834179 || || Successfully assigned fragments : 586467 (70.3%) || || Running time : 0.05 minutes ||
不同的注釋文件千萬不要亂用!!!