前言
在生物信息學數據分析中,許多分析軟件都是基於R開發的。這里介紹一個可以在Python 中進行基因富集分析的Python 軟件 GSEAPY
(Gene Set Enrichment Analysis in Python)
GSEApy is a python wrapper for GESA and Enrichr.
It’s used for convenient GO enrichments and produce publication-quality figures from python.
GSEAPY
安裝
可以通過conda
或 pip
進行安裝
# if you have conda
$ conda install -c conda-forge -c bioconda gseapy
# or use pip to install the latest release
$ pip install gseapy
pip 安裝要是遇到這樣的報錯
data = self.read(amt=amt, decode_content=decode_content)
File "/opt/conda/lib/python3.9/site-packages/pip/_vendor/urllib3/response.py", line 541, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/opt/conda/lib/python3.9/contextlib.py", line 135, in __exit__
self.gen.throw(type, value, traceback)
File "/opt/conda/lib/python3.9/site-packages/pip/_vendor/urllib3/response.py", line 443, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.
可以使用清華鏡像,進行安裝:
$ pip install gseapy -i https://pypi.tuna.tsinghua.edu.cn/simple
富集分析
背景信息
- gene set, 指一組具有相同特征的基因。如一個GO term 對應的多個基因,一個kegg pathway對應的多個基因
- gene set library,多個相關的gene set 。如所有GO term組成一個gene set library.
- enrichment analysis, gene set library 作為注釋基因集合,已知的先驗知識。對於一個輸入基因集合,富集分析通過計算分析哪些注釋gene set 顯著存在於輸入基因集合中。例如:GO 富集分析中,查看哪些GO terms 顯著存在於輸入基因列表中。
有多種基因集富集分析策略,我們常說的GO/KEGG 富集分析 應該大多數指over represent analysis(ORA)。還有個常用富集方法叫GSEA(Gene Set Enrichment Analysis), 翻譯過來也是基因集富集分析。下文GSEA,特指這種策略。
ORA
測試數據,可以從GSEApy/tests/data下載。
富集的函數是enricher
.
先展示一下,富集的代碼:
gene_list="./gene_list.txt"
gene_sets='KEGG_2016'
gene_sets=['KEGG_2016','KEGG_2013']
enr = gp.enrichr(gene_list=gene_list,
gene_sets=gene_sets,
organism='Human', # don't forget to set organism to the one you desired! e.g. Yeast
description='kegg',
outdir='test/enrichr',
# no_plot=True,
cutoff=0.5 # test dataset, use lower value from range(0,1)
)
運行完后,'test/enrichr'
目錄下存放着會有富集的圖片以及文本。
(base) jovyan@95c3096ad9ae:~$ ll test/enrichr
-rw-r--r-- 1 jovyan users 22003 Dec 26 14:59 KEGG_2013.Human.enrichr.reports.pdf
-rw-r--r-- 1 jovyan users 22130 Dec 26 14:59 KEGG_2013.Human.enrichr.reports.txt
-rw-r--r-- 1 jovyan users 25722 Dec 26 14:59 KEGG_2016.Human.enrichr.reports.pdf
-rw-r--r-- 1 jovyan users 48458 Dec 26 14:59 KEGG_2016.Human.enrichr.reports.txt
查看KEGG_2016.Human.enrichr.reports.pdf,圖片只顯示了前10個,這是由參數top_term=10,所決定的
同時富集也結果保存在enr.results
里,如查看前五個數據
enr.results.head(5)
輸出
Gene_set Term Overlap P-value Adjusted P-value Old P-value Old Adjusted P-value Odds Ratio Combined Score Genes
0 KEGG_2016 Osteoclast differentiation Homo sapiens hsa04380 28/132 3.104504e-13 7.885440e-11 0 0 6.659625 191.802220 LILRA6;ITGB3;LILRA2;LILRA5;PPP3R1;FCGR3B;SIRPA...
1 KEGG_2016 Tuberculosis Homo sapiens hsa05152 31/178 4.288559e-12 5.446470e-10 0 0 5.224941 136.763196 RAB5B;ITGB2;PPP3R1;HLA-DMA;FCGR3B;HLA-DMB;CASP...
2 KEGG_2016 Phagosome Homo sapiens hsa04145 28/154 1.614009e-11 1.366528e-09 0 0 5.490501 136.437381 ATP6V1A;RAB5B;ITGB5;ITGB3;ITGB2;HLA-DMA;FCGR3B...
3 KEGG_2016 Rheumatoid arthritis Homo sapiens hsa05323 19/90 2.197884e-09 1.395656e-07 0 0 6.554453 130.668081 ATP6V1A;ATP6V1G1;ATP6V0B;TGFB1;ITGB2;FOS;ITGAL...
4 KEGG_2016 Leishmaniasis Homo sapiens hsa05140 17/73 3.132614e-09 1.591368e-07 0 0 7.422186 145.336773 TGFB1;IFNGR1;PRKCB;IFNGR2;ITGB2;FOS;MAPK14;HLA...
查看enricher函數幫助文檔
help(gp.enrichr)
Help on function enrichr in module gseapy.enrichr:
enrichr(gene_list, gene_sets, organism='human',
description='', outdir='Enrichr',
background='hsapiens_gene_ensembl',
cutoff=0.05, format='pdf', figsize=(8, 6),
top_term=10, no_plot=False, verbose=False)
......
......
由幫助文檔可知enricher
函數所需參數如下:
- gene_list, 所需查詢gene_list,可以是一個列表,也可為文件(一列,每行一個基因)
- gene_sets, gene set library。該參數,有兩種形式:
- 可以設置enricher自帶的gene set library 詳細列表可見https://maayanlab.cloud/Enrichr/#libraries。可單個'KEGG_2016',或多個['KEGG_2016','KEGG_2013']
- 一種自定義gene set library。可以是gmt文件,或者輸入一個字典
gene_sets={'term_A':['gene1', 'gene2',...],
'term_B':['gene2', 'gene4',...], ...}
- organism,支持(human, mouse, yeast, fly, fish, worm), 自定義gene_set 則無影響。
- description,工作運行描述
- outdir; 輸出目錄
- background: 背景基因
- 可以是一個背景基因列表
- 或者一個背景基因數目
- 又或者Biomart dataset name.
- cutoff; pvalue閾值
- format, 輸出圖片格式('pdf','png','eps'...)
- figsize, 圖片大小, (width,height). Default: (6.5,6).
- no_plot:是否不做圖
繪圖
gseapy 也提供了繪圖函數進行繪制
# simple plotting function
from gseapy.plot import barplot, dotplot
# to save your figure, make sure that ``ofname`` is not None
barplot(enr.res2d, title='KEGG_2013',)
enr.res2d
存儲着最近一次查詢富集的結果, 上面的例子中, enr.res2d
儲存的是'KEGG_2013']富集結果,因為它是list最后一個.
gene_sets=['KEGG_2016','KEGG_2013']
enr.results
有着所有的富集結果,所以我么也可以挑選數據可視化
barplot(enr.results.loc[enr.results["Gene_set"] == "KEGG_2016",], title='KEGG_2016',)
氣泡圖也是有的;
GSEA
Prerank
Prerank
用於已經排好序的數據來做GSEA。如,根據logFC 從大到小排好序后,去做GSEA。
# gsea_data.gsea_data.rnk 是已經排好序的數據
rnk = pd.read_csv("./gsea_data.gsea_data.rnk", header=None, sep="\t")
rnk.head()
0 | 1 |
---|---|
CTLA2B | 2.502482 |
SCARA3 | 2.095578 |
LOC100044683 | 1.116398 |
pre_res = gp.prerank(rnk=rnk, gene_sets='KEGG_2016',
processes=4,
outdir='test/prerank_report_kegg', format='png', seed=6)
pre_res.res2d.head()
繪圖
from gseapy.plot import gseaplot
terms = pre_res.res2d.index
# to save your figure, make sure that ofname is not None
gseaplot(rank_metric=pre_res.ranking, term=terms[0], **pre_res.results[terms[0]])
未完待續...