[GSEAPY] 在Python里進行基因集富集分析

本文轉載自查看原文 2021-12-27 18:21 2524 生信/ 富集分析

前言

在生物信息學數據分析中，許多分析軟件都是基於R開發的。這里介紹一個可以在Python 中進行基因富集分析的Python 軟件 GSEAPY (Gene Set Enrichment Analysis in Python)

GSEApy is a python wrapper for GESA and Enrichr.
It’s used for convenient GO enrichments and produce publication-quality figures from python.

GSEAPY

安裝

可以通過conda 或 pip 進行安裝

# if you have conda
$ conda install -c conda-forge -c bioconda gseapy

# or use pip to install the latest release
$ pip install gseapy

pip 安裝要是遇到這樣的報錯

    data = self.read(amt=amt, decode_content=decode_content)
  File "/opt/conda/lib/python3.9/site-packages/pip/_vendor/urllib3/response.py", line 541, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/opt/conda/lib/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/opt/conda/lib/python3.9/site-packages/pip/_vendor/urllib3/response.py", line 443, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

可以使用清華鏡像,進行安裝：

$ pip install gseapy -i https://pypi.tuna.tsinghua.edu.cn/simple

富集分析

背景信息

gene set, 指一組具有相同特征的基因。如一個GO term 對應的多個基因，一個kegg pathway對應的多個基因
gene set library，多個相關的gene set 。如所有GO term組成一個gene set library.
enrichment analysis， gene set library 作為注釋基因集合，已知的先驗知識。對於一個輸入基因集合，富集分析通過計算分析哪些注釋gene set 顯著存在於輸入基因集合中。例如：GO 富集分析中，查看哪些GO terms 顯著存在於輸入基因列表中。

有多種基因集富集分析策略，我們常說的GO/KEGG 富集分析應該大多數指over represent analysis(ORA)。還有個常用富集方法叫GSEA(Gene Set Enrichment Analysis), 翻譯過來也是基因集富集分析。下文GSEA,特指這種策略。

ORA

測試數據，可以從GSEApy/tests/data下載。
富集的函數是enricher.

先展示一下，富集的代碼：

gene_list="./gene_list.txt"
gene_sets='KEGG_2016'
gene_sets=['KEGG_2016','KEGG_2013']

enr = gp.enrichr(gene_list=gene_list,
                 gene_sets=gene_sets,
                 organism='Human', # don't forget to set organism to the one you desired! e.g. Yeast
                 description='kegg',
                 outdir='test/enrichr',
                 # no_plot=True,
                 cutoff=0.5 # test dataset, use lower value from range(0,1)
                )

運行完后，'test/enrichr'目錄下存放着會有富集的圖片以及文本。

(base) jovyan@95c3096ad9ae:~$ ll test/enrichr
-rw-r--r-- 1 jovyan users  22003 Dec 26 14:59 KEGG_2013.Human.enrichr.reports.pdf
-rw-r--r-- 1 jovyan users  22130 Dec 26 14:59 KEGG_2013.Human.enrichr.reports.txt
-rw-r--r-- 1 jovyan users  25722 Dec 26 14:59 KEGG_2016.Human.enrichr.reports.pdf
-rw-r--r-- 1 jovyan users  48458 Dec 26 14:59 KEGG_2016.Human.enrichr.reports.txt

查看KEGG_2016.Human.enrichr.reports.pdf，圖片只顯示了前10個，這是由參數top_term=10，所決定的

同時富集也結果保存在enr.results里，如查看前五個數據

enr.results.head(5)

輸出

Gene_set	Term	Overlap	P-value	Adjusted P-value	Old P-value	Old Adjusted P-value	Odds Ratio	Combined Score	Genes
0	KEGG_2016	Osteoclast differentiation Homo sapiens hsa04380	28/132	3.104504e-13	7.885440e-11	0	0	6.659625	191.802220	LILRA6;ITGB3;LILRA2;LILRA5;PPP3R1;FCGR3B;SIRPA...
1	KEGG_2016	Tuberculosis Homo sapiens hsa05152	31/178	4.288559e-12	5.446470e-10	0	0	5.224941	136.763196	RAB5B;ITGB2;PPP3R1;HLA-DMA;FCGR3B;HLA-DMB;CASP...
2	KEGG_2016	Phagosome Homo sapiens hsa04145	28/154	1.614009e-11	1.366528e-09	0	0	5.490501	136.437381	ATP6V1A;RAB5B;ITGB5;ITGB3;ITGB2;HLA-DMA;FCGR3B...
3	KEGG_2016	Rheumatoid arthritis Homo sapiens hsa05323	19/90	2.197884e-09	1.395656e-07	0	0	6.554453	130.668081	ATP6V1A;ATP6V1G1;ATP6V0B;TGFB1;ITGB2;FOS;ITGAL...
4	KEGG_2016	Leishmaniasis Homo sapiens hsa05140	17/73	3.132614e-09	1.591368e-07	0	0	7.422186	145.336773	TGFB1;IFNGR1;PRKCB;IFNGR2;ITGB2;FOS;MAPK14;HLA...

查看enricher函數幫助文檔

help(gp.enrichr)

Help on function enrichr in module gseapy.enrichr:

enrichr(gene_list, gene_sets, organism='human', 
            description='', outdir='Enrichr', 
            background='hsapiens_gene_ensembl', 
            cutoff=0.05, format='pdf', figsize=(8, 6), 
            top_term=10, no_plot=False, verbose=False)
......
......

由幫助文檔可知enricher函數所需參數如下：

gene_list, 所需查詢gene_list，可以是一個列表，也可為文件（一列，每行一個基因）
gene_sets， gene set library。該參數，有兩種形式：
- 可以設置enricher自帶的gene set library 詳細列表可見https://maayanlab.cloud/Enrichr/#libraries。可單個'KEGG_2016'，或多個['KEGG_2016','KEGG_2013']
- 一種自定義gene set library。可以是gmt文件，或者輸入一個字典

gene_sets={'term_A':['gene1', 'gene2',...], 
           'term_B':['gene2', 'gene4',...], ...}

organism，支持(human, mouse, yeast, fly, fish, worm)，自定義gene_set 則無影響。
description，工作運行描述
outdir；輸出目錄
background：背景基因
- 可以是一個背景基因列表
- 或者一個背景基因數目
- 又或者Biomart dataset name.
cutoff； pvalue閾值
format，輸出圖片格式('pdf','png','eps'...)
figsize，圖片大小， (width,height). Default: (6.5,6).
no_plot：是否不做圖

繪圖

gseapy 也提供了繪圖函數進行繪制

# simple plotting function
from gseapy.plot import barplot, dotplot

# to save your figure, make sure that ``ofname`` is not None
barplot(enr.res2d, title='KEGG_2013',)

enr.res2d 存儲着最近一次查詢富集的結果，上面的例子中, enr.res2d儲存的是'KEGG_2013']富集結果，因為它是list最后一個.

gene_sets=['KEGG_2016','KEGG_2013']

enr.results有着所有的富集結果，所以我么也可以挑選數據可視化

barplot(enr.results.loc[enr.results["Gene_set"] == "KEGG_2016",], title='KEGG_2016',)

氣泡圖也是有的；

GSEA

Prerank

Prerank 用於已經排好序的數據來做GSEA。如，根據logFC 從大到小排好序后，去做GSEA。

# gsea_data.gsea_data.rnk 是已經排好序的數據
rnk = pd.read_csv("./gsea_data.gsea_data.rnk", header=None, sep="\t")
rnk.head()

0	1
CTLA2B	2.502482
SCARA3	2.095578
LOC100044683	1.116398

pre_res = gp.prerank(rnk=rnk, gene_sets='KEGG_2016',
                     processes=4,
                     outdir='test/prerank_report_kegg', format='png', seed=6)

pre_res.res2d.head()

繪圖

from gseapy.plot import gseaplot

terms = pre_res.res2d.index
# to save your figure, make sure that ofname is not None
gseaplot(rank_metric=pre_res.ranking, term=terms[0], **pre_res.results[terms[0]])

未完待續...

參考

https://gseapy.readthedocs.io/en/latest/introduction.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 GSEA-基因集富集分析 GSEA - Gene set enrichment analysis 基因集富集 | ORA - Over-Representation Analysis 分析原理與應用基因探針富集分析（GSEA）& GO & pathway 基因或蛋白富集分析，gsea與ssgsea python scipy包進行GO富集分析p值計算利用GSEA對基因表達數據做富集分析 HOMER | MEME | 轉錄因子的靶基因預測 | motif富集分析 R: 修改鏡像、bioconductor安裝及go基因富集分析網頁工具KOBAS進行KEGG富集分析使用clusterProfiler進行GO富集分析