miRNA分析--比對(二)


 miRNA分析--數據過濾(一)

 

在比對之前為了減少比對時間,將每一個樣本中的reads進行合並,得到fasta格式,其命名規則如下:

樣本_r數子_x數字

r 中的數字表示reads序號;

x 中的數字表示該條reads重復次數

 

比對分為兩條策略

1、根據本物種已有的miRNA序列進行比對,

已知當miRNA序列從 miRBase 或者 sRNAanno得到

 

(應該將clean reads比對到所研究物種到tRNA, rRNA, snoRNA,mRNA等數據,允許一個錯配,將比對上等reads過濾,也可以比對到參考基因組,將為未比對到到reads過濾掉,但是本次我沒有這么做)

對於第一種情況,我采用bowtie將reads比對到成熟miRNA

1 ##建立索引
2 bowtie-build ref.fa
3 ##比對
4 bowtie -v 0 -m 30 -p 10 -f  ref.fa sample.fa sample.bwt
5 參數解釋
6 -v: 允許0個錯配
7 -p: 10個線程 
8 -m: 當比對超過這個數時,認為時未比對
9 -f: 輸入序列fasta

根據.bwt 文件可以計算出每個已知當miRNA中比對上當reads數量,別忘記乘以 x后面的數

2、直接比對到參考基因組並進行新miRNA鑒定

采用miR-PREFeR進行Novel miRNA鑒定

其githup主頁:https://github.com/hangelwen/miR-PREFeR

  • 需安裝ViennaRNA ( 最好是1.8.5或2.1.2, 2.1.5版本)

1 #我裝的是最新版
2 wget https://www.tbi.univie.ac.at/RNA/download/sourcecode/2_4_x/ViennaRNA-2.4.10.tar.gz 
3 tar zvxf ViennaRNA-2.4.10.tar.gz
4 cd ViennaRNA-2.4.10
5 ./configure --prefix="/user/tools/ViennaRNA/" --without-perl
6 make
7 make install
  • 安裝

 1 git clone https://github.com/hangelwen/miR-PREFeR.git 

  • 數據准備

1⃣️ref.fa

2⃣️miRNA 比對到ref.fa的sam文件 (sam 文件中的reads 必須是collapse reads )

3⃣️gff 文件(可選,記錄需要屏蔽掉的信息,比如重復序列等)

bowtie 比對

 1 bowtie -a -v 0 -m 30 -p 10 -f ref.fa sample.fa -S sample.sam 

准備configure 文件

 1 #example configuration file for the miR-PREFeR pipeline.
 2 #lines start with '#' are comments
 3 
 4 #miR-PREFeR path, please change to your path to the script folder. 
 5 #Absolute path perfered.
 6 PIPELINE_PATH = /miR
 7 
 8 #Genomic sequence file in fasta format.  Absolute path perfered. If a path
 9 #relative if used, it's relatvie to the working directory where you execute
10 #the pipeline.
11 FASTA_FILE =  genome_v1.fa
12 
13 #Small RNA read alignment file in SAM format. The SAM file should contain 
14 #the SAM header. If N samples are used, then N file names are listed here, 
15 #separated by comma. please note that before doing alignment, process the 
16 #reads fasta files using the provided script 'process-reads-fasta.py' to
17 #collapse and rename the reads. Absolute path perfered. If a path
18 #relative if used, it's relatvie to the working directory where you execute
19 #the pipeline.
20 ALIGNMENT_FILE = ./trm_XX-1_L1_I309.R1.fastq_trm_fa.fa.sam, ./trm_XX-2_L1_I310.R1.fastq_trm_fa.fa.sam, ./trm_XX-3_L1_I311.R1.fastq_trm_fa.fa.sam, ./trm_XY-1_L1_I312.R1.fastq_trm_fa.fa.sam, ./trm_XY-2_L1_I313.R1.fastq_trm_fa.fa.sam, ./trm_XY-3_L1_I314.R1.fastq_trm_fa.fa.sam, ./trm_YY-1_L1_I315.R1.fastq_trm_fa.fa.sam, ./trm_YY-2_L1_I316.R1.fastq_trm_fa.fa.sam, ./trm_YY-3_L1_I332.R1.fastq_trm_fa.fa.sam
21 
22 #GFF file which list all existing annotations on genomic sequences FASTA_FILE.
23 #If no GFF file is availble, comment this line out or leave the value blank.
24 #Absolute path perfered. If a path relative if used, it's relatvie to the
25 #working directory where you execute the pipeline.
26 #CAUTION: please only list the CDS regions, not the entire miRNA region, because 
27 #miRNAs could be in introns. This option is mutual exclusive with 'GFF_FILE_INCLUDE'
28 #option.
29 # If you have a GFF file that contains regions in which you want to predict whehter 
30 # they include miRNAs, please use the 'GFF_FILE_INCLUDE' option instead.
31 GFF_FILE_EXCLUDE = CDS.gff
32 
33 # Only predict miRNAs from the regions given in the GFF file. This option is mutual 
34 # exclusive with 'GFF_FILE_EXCLUDE'. Thus, only one of them can be used.
35 #GFF_FILE_INCLUDE = ./TAIR10.chr1.candidate.gff
36 
37 #The max length of a miRNA precursor. The range is from 60 to 3000. The default
38 #is 300. 
39 PRECURSOR_LEN = 300
40 
41 #The first step of the pipeline is to identify candidate regions of the miRNA
42 #loci. If READS_DEPTH_CUTOFF = N, then bases that the mapped depth is smaller
43 #than N is not considered. The value should >=2.
44 READS_DEPTH_CUTOFF = 20
45 
46 #Number of processes for this computation. Using more processes speeds up the computation,
47 #especially if you have a multi-core processor. If you have N cores avalible for the 
48 #computation, it's better to set this value in the range of N to 2*N.
49 #If comment out or leave blank, 1 is used. 
50 NUM_OF_CORE = 4
51 
52 #Outputfolder. If not specified, use the current working directory. Please make sure that 
53 #you have enough disk space for the folder, otherwise the pipeline may fail.
54 OUTFOLDER = spinach-result
55 
56 #Absolute path of the folder that contains intermidate/temperary files during the
57 # run of the pipeline. If not specified, miR-PREFeR uses a folder with suffix "_tmp"
58 #under OUTFOLDER by default. Please make sure that you have enough disk space for the
59 # folder, otherwise the pipeline may fail.
60 #TMPFOLDER = /tmp/exmaple
61 TMPFOLDER = 
62 
63 #prefix for naming the output files. For portability, please DO NOT contain any 
64 #spaces and special characters. The prefered includes 'A-Z', 'a-z', '0-9', and 
65 #underscore '_'.
66 NAME_PREFIX = spinach-example
67 
68 #Maximum gap length between two contigs to form a candidate region. 
69 MAX_GAP = 100
70 
71 # Minimum and maximum length of the mature sequence. Default values are 18 and 24.
72 MIN_MATURE_LEN = 18
73 MAX_MATURE_LEN = 24
74 
75 # If this is 'Y', then the criteria that requries the star sequence must be expressed 
76 # is loosed if the expression pattern is good enough (.e.g. the majority of the reads 
77 # mapped to the region are mapped to the mature position.). There are lots of miRNAs 
78 # which do not have star sequence expression. The default value is Y.
79 ALLOW_NO_STAR_EXPRESSION = Y
80 
81 # In most cases, the mature star duplex has 2nt 3' overhangs. If this is set to 'Y', then
82 # 3nt overhangs are allowed. Default is 'N'.
83 ALLOW_3NT_OVERHANG = N
84 
85 #The pipeline makes a checkpoint after each major step. In addition, because the
86 #folding stage is the most time consuming stage, it makes a checkpiont for each
87 #folding process after folding every CHECKPOINT_SIZE sequences. If the pipeline
88 #is killed for some reason in the middle of folding, it can be restarted using
89 #'recover' command from where it was stopped. The default value is 3000. On my
90 #system this means making a checkpoint about every 5 minutes.
91 CHECKPOINT_SIZE = 3000

運行

 1 python miR_PREFeR.py -L -k pipeline configfile 

pipeline里包含prepare, candidate, fold, predict四步。如果某步中斷了,還可以續跑

 1 python miR_PREFeR.py -L recover configfile 

 

輸出結果

根據example_miRNA.detail.csv 文件 寫一個腳本 提取每個miRNA的reads 數量,進而做差異分析

 

差異分析

差異分析采用DESeq2, 可看我之前寫的miRAN 分析以及mRNA分析

 

 

 

關注下方公眾號可獲得更多精彩

 

ref

1、計算已知miRNA當表達量

2、省心省事的植物miRNA分析軟件miR-PREFeR,值得擁有

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM