Deep-learning augmented RNA-seq analysis of transcript splicing | 用深度學習預測可變剪切

本文轉載自查看原文 2019-08-29 21:39 460 深度學習

可變剪切的預測已經很流行了，目前主要有兩個流派：

用DNA序列以及variant來預測可變剪切；GeneSplicer、MaxEntScan、dbscSNV、S-CAP、MMSplice、clinVar、spliceAI
用RNA來預測可變剪切；MISO、rMATS、DARTS

前言廢話

科研圈的熱點扎堆現象是永遠存在的，且一波接一波，大部分不屑於追熱點且不出成果也基本都被圈子給淘汰了。

做純方法開發的其實是很心累的，費時費力費腦，特別是自己的研究領域已經過時的時候，另外還得承受外行的歧視：“你們搞這個有什么用嗎？文章也發不了，最后也沒人用。”

最近這些年最大的一個熱點就是“單細胞”，很多人都趁着這股東風撈了一些文章，最早一批開發方法的也發了不少nature method和NBT，bioinformatics和NAR更多。但大部分后面就銷聲匿跡了，因為門檻越來越低，進入者越來越多，經過幾年的發展，現在已經成了三足鼎立之勢，強者愈強，弱者退場。

寫方法類的文章也有個潛規則，千萬不要寫得過於通俗易懂，大部分審稿人如果一眼就能看懂，就會自然認為你做的研究過於簡單，沒有發表的必要。最好要寫得有理有據，且90%的審稿人沒法一眼看懂，但細細琢磨后有那么點意思。哈哈，當笑話聽就好。

跳到另外一篇用深度學習來預測可變剪切的。

Deep-learning augmented RNA-seq analysis of transcript splicing

PDF

文章里面需要重點了解的基礎知識：

Unlike methods that use cis sequence features to predict exon splicing patterns in specific samples7–10，看看前人是如何根據cis sequence特征來預測exon的剪切模式的

涉及到的文獻：

The human splicing code reveals new insights into the genetic determinants of disease - 2015

Deciphering the splicing code - 2010

Deep learning of the tissue-regulated splicing code - 2014

BRIE: transcriptome-wide splicing quantification in single cells - 2017

哈哈，深度學習在可變剪切上的應用的風2014年就開始刮了，你不可能是第一個吃螃蟹的了。

想了解什么是AS，可以直接看現在開發的工具，里面肯定有圖詳細介紹，同時介紹其算法，一圖勝千言。

MISO (Mixture of Isoforms) software documentation 目前只支持python2版本，用conda的話還需要從文檔中copy一下miso_settings.txt文件。

rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

生物和信息之間存在一個巨大的gap，優秀的人能很快察覺到這個gap，並填補這個gap。

問題：

為什么AS的鑒定依賴測序深度？得了解現在主流的AS檢測算法

怎么理解樣本間的差異可變剪切這個概念？

如何理解cis sequence features，這個文件里都包含了哪些信息？

怎么predict exon-inclusion/skipping levels in bulk tissues or single cells

怎么理解we hypothesized that large-scale RNA-seq resources can be used to construct a deep-learning model of differential alternative splicing.

兩部分：

a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features

a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing

During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.

During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.

This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.

差不多懂了，第一個BHT就是常規差異剪切分析工具（如MISO 和 rMATS）的升級版，用於制造有lable的訓練數據。BHT的結果用於訓練DNN模型；新的數據可以放進DNN模型里，得到的結果可以作為后面貝葉斯模型的先驗，我們的RNA-seq數據則是用於更新先驗形成后驗，如果先驗足夠准確，則更新時對數據的依賴不搞，這也就是為什么該方法可以彌補RNA-seq測序深度不足的情形。

To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.

cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions

這個DNN把可變剪切轉換成了一個regression的問題，特征就是上面兩種，因為它們決定了最終的一個特征是否發生了可變剪切。

最終用到的特征：2,926 cis sequence features and 1,498 annotated RBPs

DNN用到的訓練數據具體是什么？

large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium

We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons

The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN

From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT(flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods). 這樣就把每個exon給的特征給提取出來了，lable也有了，就可以用於訓練了。

比較了三個模型：

We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.

關於貝葉斯模型的部分：

incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).

Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data

如果文章看得迷迷糊糊的，就直接跑代碼吧！

第一個功能BHT：

Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/

test_norep_data.txt文件是這樣的：

ID	GeneID	geneSymbol	chr	strand	exonStart_0base	exonEnd	upstreamES	upstreamEE	downstreamES	downstreamEE	ID	IJC_SAMPLE_1	SJC_SAMPLE_1	IJC_SAMPLE_2	SJC_SAMPLE_2	IncFormLen	SkipFormLen
82439	ENSG00000169045.17_1	HNRNPH1	chr5	-	179046269	179046408	179045145	179045324	179047892	179048036	82439	15236	319	6774	834	180	90
21374	ENSG00000131876.16_3	SNRPA1	chr15	-	101826418	101826498	101825930	101826006	101827112	101827215	21374	4105	118	292	54	169	90
32815	ENSG00000141027.20_3	NCOR1	chr17	-	15990485	15990659	15989712	15989756	15995176	15995232	32815	624	564	549	1261	180	90
43143	ENSG00000133731.9_2	IMPA1	chr8	-	82597997	82598198	82593732	82593819	82598486	82598518	43143	155	332	22	341	180	90
111671	ENSG00000100320.22_3	RBFOX2	chr22	-	36232366	36232486	36205826	36206051	36236238	36236460	111671	93	193	35	534	180	90

每一行是一個基因，無冗余，然后就是一些屬性.

跑出來的結果是這樣的：

1 ID      I1      S1      I2      S2      inc_len skp_len psi1    psi2    delta.mle       post_pr
      2 1225    160     0       169     6       180     90      1       0.934   -0.0663 0.4367
      3 15829   52      58      12      41      180     90      0.31    0.128   -0.1819 0.8867
      4 20347   1084    930     371     615     180     90      0.368   0.232   -0.1365 1
      5 21374   4105    118     292     54      169     90      0.949   0.742   -0.2065 1
      6 24817   177     275     263     741     143     90      0.288   0.183   -0.1057 0.974
      7 32815   624     564     549     1261    180     90      0.356   0.179   -0.1774 1
      8 43143   155     332     22      341     180     90      0.189   0.031   -0.158  1
      9 46548   1685    4040    216     1752    180     90      0.173   0.058   -0.1145 1

每一行是對之前條目的預測。　　

第二個功能DNN：

下載model

Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS

預測

Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS

其中的第一個文件是Input feature file (*.h5) or Darts_BHT output (*.txt)

ID      I1      S1      I2      S2      inc_len skp_len mu.mle  delta.mle       post_pr
chr1:-:10002681:10002840:10002738:10002840:9996576:9996685      581     0       462     0       155     99      1       0       0
chr1:-:100176361:100176505:100176389:100176505:100174753:100174815      28      0       49      2       126     99      1       -0.0493827160493827     0.248
chr1:-:109556441:109556547:109556462:109556547:109553537:109554340      2       37      0       81      119     99      0.0430341230167355
      -0.0430341230167355     0.188
chr1:-:11009680:11009871:11009758:11009871:11007699:11008901    11      2       49      4       176     99      0.755725190839695       0.117542135892979       0.329333333333333
chr1:-:11137386:11137500:11137421:11137500:11136898:11137005    80      750     64      738     133     99      0.0735580941766509      -0.0129207126090368     0

第二個文件是Kallisto expression files

thymus  adipose
RPS11   2678.83013      2531.887535
ERAL1   14.350975       13.709394
DDX27   18.2573 14.02368
DEK     32.463558       14.520312
PSMA6   102.332592      77.089475
TRIM56  4.519675        6.14762566667
TRIM71  0.082009        0.0153936666667
UPF2    7.150812        5.23628033333
FARS2   6.332831        7.291382
ALKBH8  3.056208        1.27043633333
ZNF579  5.13265 8.248575

結果文件，第一列是ID，第二列是真實的標簽，第三列是預測的標簽：

ID      Y_true  Y_pred
chr22:-:39136893:39137055:39137011:39137055:39136271:39136437   1.000000        0.318161
chr12:-:69326921:69326979:69326949:69326979:69326457:69326620   1.000000        0.073966
chr3:-:49053236:49053305:49053251:49053305:49052920:49053140    0.947333        0.295664
chr4:-:68358468:68358715:68358586:68358715:68357897:68357993    1.000000        0.304907
chr11:-:124972532:124972705:124972629:124972705:124972027:124972213     0.937333        0.365548
chr15:+:43695880:43696040:43695880:43695997:43696610:43696750   1.000000        0.450762

參考：

The Expanding Landscape of Alternative Splicing Variation in Human Populations.　

這篇是比較純粹的DL應用：

Gene expression inference with deep learning | 基於深度學習的基因表達推測

案例文章：Gene expression inference with deep learning

uci-cbcl/D-GEX - github

深度學習的風已經過了幾年了，目前在醫療影像處理上已經公認非常有效，所以后面想發文章必須數據足夠大足夠靚，方法上想創新太難。

LINCS L1000 data

核心的意思就是這個項目只測了不到一千個基因的表達，卻要通過LR和DL來推測出其他全部的3萬個基因的表達，所以稱那978個基因叫landmark genes。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Salmon---a tool for wicked-fast transcript quantification from RNA-seq data Current best practices in single-cell RNA-seq analysis: a tutorial RNA-seq(9):功能富集分析零基礎照樣做RNA-seq差異分析 32、Differential Gene Expression using RNA-Seq (Workflow) 10x Genomics RNA-seq 原理 Deep Learning（深度學習）學習筆記整理(二）【深度學習Deep Learning】資料大全 2018年最新的single-cell-RNA-seq analysis repositories Deep Learning（深度學習）學習筆記整理系列（一）——背景