GEO數據下載分析(SRA、SRR、GEM、SRX、SAMN、SRS、SRP、PRJNA全面解析)


很多時候我們需要從GEO(https://www.ncbi.nlm.nih.gov/geo/)下載RNA-seq數據,一個典型的下載頁面是https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76381(搜 GSE76381)。

這里你會看到數據的總覽:

GSM2268339    1772067089_A01
GSM2268340    1772067089_A02
GSM2268341    1772067089_A03
……
Supplementary file    Size    Download    File type/resource
SRP/SRP067/SRP067844        (ftp)    SRA Study
GSE76381_ESMoleculeCounts.cef.txt.gz    5.9 Mb    (ftp)(http)    TXT
GSE76381_EmbryoMoleculeCounts.cef.txt.gz    5.3 Mb    (ftp)(http)    TXT
GSE76381_MouseAdultDAMoleculeCounts.cef.txt.gz    1.0 Mb    (ftp)(http)    TXT
GSE76381_MouseEmbryoMoleculeCounts.cef.txt.gz    6.1 Mb    (ftp)(http)    TXT
GSE76381_iPSMoleculeCounts.cef.txt.gz    1001.2 Kb    (ftp)(http)    TXT

現在我們已經從ftp上下載了該文章的所有sra數據。

名稱    大小    修改日期
[上級目錄]        
SRR4055063/        2016/8/24 上午8:00:00
SRR4055064/        2016/8/24 上午8:00:00
SRR4055065/        2016/8/24 上午8:00:00
SRR4055066/        2016/8/24 上午8:00:00
......

里面每一個文件夾里對應一個或多個sra文件。

比對,SRR4061391.sra文件是一個二進制文件,需要使用sra工具來轉化為fastq。

轉換之后的fastq如下:

@SRR4061391.sra.1 Run0289_BC69A1ACXX_L7_T1101_C8 length=51
ATTCAAGGGAGTTATAAGCAGAGTCAATAATGAATTTCTTCCTGCGTCTCC
+SRR4061391.sra.1 Run0289_BC69A1ACXX_L7_T1101_C8 length=51
CCCFFFFFHDHFHIJJJJJGJIIEHHIJJJJIIIIJJIIJIJJJIJJJJJJ
@SRR4061391.sra.2 Run0289_BC69A1ACXX_L7_T1101_C18 length=51
TTGATTGGGCACCTAGAAGCCAAGGACTCTCTAAGTCCTAGTCTGTTTGGT
+SRR4061391.sra.2 Run0289_BC69A1ACXX_L7_T1101_C18 length=51
CCCFFFFFHHHHHJJJGIJIIJJJJJJJJJJJJJJIIJJIIIJJJJJJJJF

可以看到,fastq文件里沒有任何有價值的樣品信息(物種、樣品名、細胞名、組織)。

此時你只能去文章里找相關信息:

image

文章里真正實用的信息很少,

The molar concentrations of the libraries was determined with KAPA Library Quant qPCR (Kapa Biosystems) and size distribution was evaluated after PCR (12cycles) using an Agilent BioAnalyzer. Sequencing was performed on an Illumina HiSeq 2000 with C1-P1-PCR2 as read 1 primer and C1-TN5-U as index read primer. Reads of 50 bp as well as 8 bp index reads corresponding to the cell-specific barcodes were generated. Reads were mapped using bowtie and processed as described previously (Zeisel et al., 2015), adding the more strict criteria for UMI counting: we removed all singletons (molecules supported by a single read).

也沒說太清楚,下載的數據中找不到那8bp的barcode,說明數據已經按照barcode拆好了。

Reads of 50 bp were generated along with 8 bp index reads corresponding to the cell-specific barcode. Each read was expected to start with a 6 bp unique molecular identifier (UMI), followed by 3-5 guanines, followed by the 5’ end of the transcript.

繞了一大圈,真正有價值的信息原來在引文中,所以現在的大牛真是喜歡拽,非要別人去讀他之前的文章。

總結:到此,該文獻的全部數據是下下來了,也已經轉換為fastq,知道fastq的格式信息,但是我們還不知道沒一個fastq的樣品信息。


回到開始的頁面,貌似有樣品的信息:

GSM2268339    1772067089_A01
GSM2268340    1772067089_A02
GSM2268341    1772067089_A03

這是全部的信息:

image

確實是樣品信息,樣品編號,物種信息。

點擊GSM2268340會發現一些更詳細的樣品信息:

Status    Public on Oct 06, 2016
Title    1772067089_A02
Sample type    SRA
     
Source name    ventral midbrain
Organism    Homo sapiens
Characteristics    tissue: ventral midbrain
Sex: pooled male and female
age: 7w
inferred cell type: hRgl2a

總結:但是到目前我們還是找不到SRR文件的樣品信息,只是找到了GSM的。


那么怎么找SRR和GSM之間的關系呢?

直接在GEO搜索SRR4061391,結果如下:

終於找到了對應關系,SRX2050530: GSM2274293: 1772096111_A02; Mus musculus; RNA-Seq

GSM2274293包含了兩個SRR文件。

image 

總結:到目前為止,已經能手動查找到下載的SRR文件對應的樣品信息了。但總共有6k多個,不可能這么手動查吧。


開始科普:About GEO DataSets

Lists the DataSet (GDS), Series (GSE) or Platform (GPL) accession number, followed by title and organism.

lists the Sample accessions numbers (GSM) and titles.

GDS編號:數據集

GSE編號:系列

GPL編號:平台

GSM編號:樣品登陸號

Schematic overview of GEO data submission

參考:About GEO DataSets

GEO Overview


Google了很多,最后發現還是用Biopython比較靠譜,Biopython現在做得不錯了哦,維護的人變多了。

 

 

參考:

Question: From A Geo Gsm Id, How To Obtain The Corresponding Raw File(S) Hosted On Sra?


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM