find_circ鑒定circRNA的原理
find_circ 的基本原理: find_circ根據Bowtie2比對結果,從沒有比對到參考序列的 reads 的兩端各提取 20nt 的 anchor 序列,將每一對 anchor 序列再次與參考序列比對。如果 anchor 序列的 5' 端比對到參考序列(起始與終止位點分別記為 A3,A4),anchor 序列的 3' 端比對到此位點的上游(起始與終止位點分別記為 A1,A2),並且在參考序列的 A2 到 A3 之間存在剪接位點(GT-AG),則將此 read 作為候選 circRNA。最后將 read count 大於等於 2 的候選 circRNA 作為鑒定的 circRNA。
1.find_circ的安裝
#find_circ需要運行在裝有python 2.7的64位系統上,同時需要安裝numpy和pysam這兩個python模塊。其運行需要借助bowtie2和samtools來完成基因組mapping的過程。
1 wget https://github.com/marvin-jens/find_circ/archive/v1.2.tar.gz 2 tar -xzvf v1.2.tar.gz
2.參考基因組的下載
#通過fetch_ucsc.py下載ucsc最新版本的參考基因組
1 fetch_ucsc.py hg19/hg38/mm9/mm10 ref/kg/ens/fa out
3.bowtie2建立參考基因組索引
1 bowtie2_build hg38.fa hg38
4.基於RNA-Seq的基因組比對(pair-end模式)
###bowtie2參數介紹###
#-p 使用多線程
#--very-sensitive 允許多重比對,報告出最好的一個
#--score-min=C,-15,0 設置比對分數函數
#--mm 設置I/O模式。
###samtools view參數介紹###
#-h 文件包含header line
#-b 輸出bam格式
#-u 輸出非壓縮的bam格式
#–S 忽略版本兼容
1 bowtie2 -p 16 --very-sensitive --score-min=C,-15,0 --mm -x /path/to/bowtie2_index -q -1 reads1.fq -2 reads2.fq | samtools view -hbuS - | samtools sort - -o output.bam
1 samtools view -hf 4 output.bam | samtools view -Sb - > unmapped.bam 2 /path/to/unmapped2anchors.py unmapped.bam | gzip > anchors.fq.gz
6.根據anchor比對基因組情況尋找潛在的circRNA
###find_circ.py參數介紹###
1 bowtie2 -p 16 --reorder --mm --score-min=C,-15,0 -q -x /path/to/bowtie2_index -U anchors.fq.gz | /path/to/find_circ.py --genome=/path/to/hg38.fa --prefix=hsa_ --name=my_test_sample --stats=<run folder>/stats.txt --reads=<run folder>/splice_reads.fa > <run folder>/spliced_sites.bed
###根據以下規則對結果進行篩選
1.根據關鍵詞CIRCULAR篩選環狀RNA
2.去除線粒體上的環狀RNA
3.篩選unique junction reads數至少為2的環狀RNA
4.去除斷裂點不明確的環狀RNA
5.過濾掉長度大於100kb的circRNA,這里的100kb為基因組長度,直接用環狀RNA的頭尾相減即可
1 grep CIRCULAR spliced_sites.bed | grep -v chrM | gawk '$5>=2' | grep UNAMBIGUOUS_BP | grep ANCHOR_UNIQUE | /path/to/maxlength.py 100000 > find_circ.candidates.bed
7.分析多個樣本
#如果有多個樣本,需要分別用find_circ.py運行,然后將各自的結果合並1 /path/to/merge_bed.py sample1.bed sample2.bed [...] >combined.bed
#輸出的spliced_sites.bed文件前六列為標准的BED文件格式,剩余的12列關於junction的一些信息
column | name | description |
---|---|---|
1 | chrom | chromosome/contig name |
2 | start | left splice site (zero-based) |
3 | end | right splice site (zero-based).(Always: end > start. 5' 3' depends on strand) |
4 | name | (provisional) running number/name assigned to junction |
5 | n_reads | number of reads supporting the junction (BED 'score') |
6 | strand | genomic strand (+ or -) |
7 | n_uniq | number of distinct read sequences supporting the junction |
8 | uniq_bridges | number of reads with both anchors aligning uniquely |
9 | best_qual_left | alignment score margin of the best anchor alignment supporting the left splice junction (max=2 * anchor_length ) |
10 | best_qual_right | same for the right splice site |
11 | tissues | comma-separated, alphabetically sorted list of supporting the left splice junction (max=2 * anchor_length ) |
12 | tiss_counts | comma-separated list of corresponding read-counts |
13 | edits | number of mismatches in the anchor extension process |
14 | anchor_overlap | number of nucleotides the breakpoint resides within one anchor |
15 | breakpoints | number of alternative ways to break the read with flanking GT/AG |
16 | signal | flanking dinucleotide splice signal (normally GT/AG) |
17 | strandmatch | 'MATCH', 'MISMATCH' or 'NA' for non-stranded analysis |
18 | category | list of keywords describing the junction. Useful for quick grep filtering |