可變剪切的預測已經很流行了,目前主要有兩個流派:
- 用DNA序列以及variant來預測可變剪切;GeneSplicer、MaxEntScan、dbscSNV、S-CAP、MMSplice、clinVar、spliceAI
- 用RNA來預測可變剪切;MISO、rMATS、DARTS
前言廢話
科研圈的熱點扎堆現象是永遠存在的,且一波接一波,大部分不屑於追熱點且不出成果也基本都被圈子給淘汰了。
做純方法開發的其實是很心累的,費時費力費腦,特別是自己的研究領域已經過時的時候,另外還得承受外行的歧視:“你們搞這個有什么用嗎?文章也發不了,最后也沒人用。”
最近這些年最大的一個熱點就是“單細胞”,很多人都趁着這股東風撈了一些文章,最早一批開發方法的也發了不少nature method和NBT,bioinformatics和NAR更多。但大部分后面就銷聲匿跡了,因為門檻越來越低,進入者越來越多,經過幾年的發展,現在已經成了三足鼎立之勢,強者愈強,弱者退場。
寫方法類的文章也有個潛規則,千萬不要寫得過於通俗易懂,大部分審稿人如果一眼就能看懂,就會自然認為你做的研究過於簡單,沒有發表的必要。最好要寫得有理有據,且90%的審稿人沒法一眼看懂,但細細琢磨后有那么點意思。哈哈,當笑話聽就好。
跳到另外一篇用深度學習來預測可變剪切的。
Deep-learning augmented RNA-seq analysis of transcript splicing
文章里面需要重點了解的基礎知識:
Unlike methods that use cis sequence features to predict exon splicing patterns in specific samples7–10,看看前人是如何根據cis sequence特征來預測exon的剪切模式的
涉及到的文獻:
The human splicing code reveals new insights into the genetic determinants of disease - 2015
Deciphering the splicing code - 2010
Deep learning of the tissue-regulated splicing code - 2014
BRIE: transcriptome-wide splicing quantification in single cells - 2017
哈哈,深度學習在可變剪切上的應用的風2014年就開始刮了,你不可能是第一個吃螃蟹的了。
想了解什么是AS,可以直接看現在開發的工具,里面肯定有圖詳細介紹,同時介紹其算法,一圖勝千言。
MISO (Mixture of Isoforms) software documentation 目前只支持python2版本,用conda的話還需要從文檔中copy一下miso_settings.txt文件。
生物和信息之間存在一個巨大的gap,優秀的人能很快察覺到這個gap,並填補這個gap。
問題:
為什么AS的鑒定依賴測序深度?得了解現在主流的AS檢測算法
怎么理解樣本間的差異可變剪切這個概念?
如何理解cis sequence features,這個文件里都包含了哪些信息?
怎么predict exon-inclusion/skipping levels in bulk tissues or single cells
怎么理解we hypothesized that large-scale RNA-seq resources can be used to construct a deep-learning model of differential alternative splicing.
兩部分:
a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features
a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing
During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.
During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.
This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.
差不多懂了,第一個BHT就是常規差異剪切分析工具(如MISO 和 rMATS)的升級版,用於制造有lable的訓練數據。BHT的結果用於訓練DNN模型;新的數據可以放進DNN模型里,得到的結果可以作為后面貝葉斯模型的先驗,我們的RNA-seq數據則是用於更新先驗形成后驗,如果先驗足夠准確,則更新時對數據的依賴不搞,這也就是為什么該方法可以彌補RNA-seq測序深度不足的情形。
To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.
cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions
這個DNN把可變剪切轉換成了一個regression的問題,特征就是上面兩種,因為它們決定了最終的一個特征是否發生了可變剪切。
最終用到的特征:2,926 cis sequence features and 1,498 annotated RBPs
DNN用到的訓練數據具體是什么?
large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium
We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons
The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN
From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT(flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods). 這樣就把每個exon給的特征給提取出來了,lable也有了,就可以用於訓練了。
比較了三個模型:
We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.
關於貝葉斯模型的部分:
incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).
Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data
如果文章看得迷迷糊糊的,就直接跑代碼吧!
第一個功能BHT:
Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/
test_norep_data.txt文件是這樣的:
ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen 82439 ENSG00000169045.17_1 HNRNPH1 chr5 - 179046269 179046408 179045145 179045324 179047892 179048036 82439 15236 319 6774 834 180 90 21374 ENSG00000131876.16_3 SNRPA1 chr15 - 101826418 101826498 101825930 101826006 101827112 101827215 21374 4105 118 292 54 169 90 32815 ENSG00000141027.20_3 NCOR1 chr17 - 15990485 15990659 15989712 15989756 15995176 15995232 32815 624 564 549 1261 180 90 43143 ENSG00000133731.9_2 IMPA1 chr8 - 82597997 82598198 82593732 82593819 82598486 82598518 43143 155 332 22 341 180 90 111671 ENSG00000100320.22_3 RBFOX2 chr22 - 36232366 36232486 36205826 36206051 36236238 36236460 111671 93 193 35 534 180 90
每一行是一個基因,無冗余,然后就是一些屬性.
跑出來的結果是這樣的:
1 ID I1 S1 I2 S2 inc_len skp_len psi1 psi2 delta.mle post_pr 2 1225 160 0 169 6 180 90 1 0.934 -0.0663 0.4367 3 15829 52 58 12 41 180 90 0.31 0.128 -0.1819 0.8867 4 20347 1084 930 371 615 180 90 0.368 0.232 -0.1365 1 5 21374 4105 118 292 54 169 90 0.949 0.742 -0.2065 1 6 24817 177 275 263 741 143 90 0.288 0.183 -0.1057 0.974 7 32815 624 564 549 1261 180 90 0.356 0.179 -0.1774 1 8 43143 155 332 22 341 180 90 0.189 0.031 -0.158 1 9 46548 1685 4040 216 1752 180 90 0.173 0.058 -0.1145 1
每一行是對之前條目的預測。
第二個功能DNN:
下載model
Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS
預測
Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS
其中的第一個文件是Input feature file (*.h5) or Darts_BHT output (*.txt)
ID I1 S1 I2 S2 inc_len skp_len mu.mle delta.mle post_pr chr1:-:10002681:10002840:10002738:10002840:9996576:9996685 581 0 462 0 155 99 1 0 0 chr1:-:100176361:100176505:100176389:100176505:100174753:100174815 28 0 49 2 126 99 1 -0.0493827160493827 0.248 chr1:-:109556441:109556547:109556462:109556547:109553537:109554340 2 37 0 81 119 99 0.0430341230167355 -0.0430341230167355 0.188 chr1:-:11009680:11009871:11009758:11009871:11007699:11008901 11 2 49 4 176 99 0.755725190839695 0.117542135892979 0.329333333333333 chr1:-:11137386:11137500:11137421:11137500:11136898:11137005 80 750 64 738 133 99 0.0735580941766509 -0.0129207126090368 0
第二個文件是Kallisto expression files
thymus adipose RPS11 2678.83013 2531.887535 ERAL1 14.350975 13.709394 DDX27 18.2573 14.02368 DEK 32.463558 14.520312 PSMA6 102.332592 77.089475 TRIM56 4.519675 6.14762566667 TRIM71 0.082009 0.0153936666667 UPF2 7.150812 5.23628033333 FARS2 6.332831 7.291382 ALKBH8 3.056208 1.27043633333 ZNF579 5.13265 8.248575
結果文件,第一列是ID,第二列是真實的標簽,第三列是預測的標簽:
ID Y_true Y_pred chr22:-:39136893:39137055:39137011:39137055:39136271:39136437 1.000000 0.318161 chr12:-:69326921:69326979:69326949:69326979:69326457:69326620 1.000000 0.073966 chr3:-:49053236:49053305:49053251:49053305:49052920:49053140 0.947333 0.295664 chr4:-:68358468:68358715:68358586:68358715:68357897:68357993 1.000000 0.304907 chr11:-:124972532:124972705:124972629:124972705:124972027:124972213 0.937333 0.365548 chr15:+:43695880:43696040:43695880:43695997:43696610:43696750 1.000000 0.450762
參考:
The Expanding Landscape of Alternative Splicing Variation in Human Populations.
這篇是比較純粹的DL應用:
Gene expression inference with deep learning | 基於深度學習的基因表達推測
案例文章:Gene expression inference with deep learning
uci-cbcl/D-GEX - github
深度學習的風已經過了幾年了,目前在醫療影像處理上已經公認非常有效,所以后面想發文章必須數據足夠大足夠靚,方法上想創新太難。
LINCS L1000 data
核心的意思就是這個項目只測了不到一千個基因的表達,卻要通過LR和DL來推測出其他全部的3萬個基因的表達,所以稱那978個基因叫landmark genes。