find_circ鑒定circRNA的原理

find_circ 的基本原理: find_circ根據Bowtie2比對結果，從沒有比對到參考序列的 reads 的兩端各提取 20nt 的 anchor 序列，將每一對 anchor 序列再次與參考序列比對。如果 anchor 序列的 5' 端比對到參考序列（起始與終止位點分別記為 A3，A4），anchor 序列的 3' 端比對到此位點的上游（起始與終止位點分別記為 A1，A2），並且在參考序列的 A2 到 A3 之間存在剪接位點（GT-AG），則將此 read 作為候選 circRNA。最后將 read count 大於等於 2 的候選 circRNA 作為鑒定的 circRNA。

1.find_circ的安裝

#find_circ需要運行在裝有python 2.7的64位系統上，同時需要安裝numpy和pysam這兩個python模塊。其運行需要借助bowtie2和samtools來完成基因組mapping的過程。

1 wget https://github.com/marvin-jens/find_circ/archive/v1.2.tar.gz
2 tar -xzvf v1.2.tar.gz

2.參考基因組的下載

#通過fetch_ucsc.py下載ucsc最新版本的參考基因組

1 fetch_ucsc.py hg19/hg38/mm9/mm10 ref/kg/ens/fa out

3.bowtie2建立參考基因組索引

1 bowtie2_build hg38.fa hg38

4.基於RNA-Seq的基因組比對（pair-end模式）

###bowtie2參數介紹###

#-p 使用多線程

#--very-sensitive 允許多重比對，報告出最好的一個

#--score-min=C,-15,0 設置比對分數函數

#--mm 設置I/O模式。

###samtools view參數介紹###

#-h 文件包含header line

#-b 輸出bam格式

#-u 輸出非壓縮的bam格式

#–S 忽略版本兼容

1 bowtie2 -p 16 --very-sensitive --score-min=C,-15,0 --mm -x /path/to/bowtie2_index -q -1 reads1.fq -2 reads2.fq | samtools view -hbuS - | samtools sort - -o output.bam

5.挑出沒有比對上的序列，各取兩頭20bp短序列（anchor)

1 samtools view -hf 4 output.bam | samtools view -Sb - > unmapped.bam
2 /path/to/unmapped2anchors.py unmapped.bam | gzip > anchors.fq.gz

6.根據anchor比對基因組情況尋找潛在的circRNA

###find_circ.py參數介紹###

#--prefix參數指定的是spliced_sites.bed文件中第四列內容的前綴，建議指定為物種對應的三字母縮寫，需要注意的是，在spliced_sites_bed中同時包含了環狀RNA和線性RNA,環狀RNA的名稱用circ標識，線性RNA的名稱用norm標識，這里設置為--prefix=hsa_

#--name參數會在生成的spliced_sites.bed文件中指定tissues列的名字

#--reads參數會生成包含spliced reads的fa文件

#--stats參數會生成包含數值統計信息的txt文件

1 bowtie2 -p 16 --reorder --mm  --score-min=C,-15,0 -q -x /path/to/bowtie2_index -U anchors.fq.gz | /path/to/find_circ.py --genome=/path/to/hg38.fa --prefix=hsa_ --name=my_test_sample --stats=<run folder>/stats.txt --reads=<run folder>/splice_reads.fa > <run folder>/spliced_sites.bed

###根據以下規則對結果進行篩選

1.根據關鍵詞CIRCULAR篩選環狀RNA

2.去除線粒體上的環狀RNA

3.篩選unique junction reads數至少為2的環狀RNA

4.去除斷裂點不明確的環狀RNA

5.過濾掉長度大於100kb的circRNA,這里的100kb為基因組長度，直接用環狀RNA的頭尾相減即可

1 grep CIRCULAR spliced_sites.bed | grep -v chrM | gawk '$5>=2' | grep UNAMBIGUOUS_BP | grep ANCHOR_UNIQUE | /path/to/maxlength.py 100000 > find_circ.candidates.bed

7.分析多個樣本

#如果有多個樣本，需要分別用find_circ.py運行，然后將各自的結果合並

1 /path/to/merge_bed.py sample1.bed sample2.bed [...] >combined.bed

8.輸出的spliced_sites_bed文件格式

#輸出的spliced_sites.bed文件前六列為標准的BED文件格式，剩余的12列關於junction的一些信息

column	name	description
1	chrom	chromosome/contig name
2	start	left splice site (zero-based)
3	end	right splice site (zero-based).(Always: end > start. 5' 3' depends on strand)
4	name	(provisional) running number/name assigned to junction
5	n_reads	number of reads supporting the junction (BED 'score')
6	strand	genomic strand (+ or -)
7	n_uniq	number of distinct read sequences supporting the junction
8	uniq_bridges	number of reads with both anchors aligning uniquely
9	best_qual_left	alignment score margin of the best anchor alignment supporting the left splice junction (`max=2 * anchor_length`)
10	best_qual_right	same for the right splice site
11	tissues	comma-separated, alphabetically sorted list of supporting the left splice junction (`max=2 * anchor_length`)
12	tiss_counts	comma-separated list of corresponding read-counts
13	edits	number of mismatches in the anchor extension process
14	anchor_overlap	number of nucleotides the breakpoint resides within one anchor
15	breakpoints	number of alternative ways to break the read with flanking GT/AG
16	signal	flanking dinucleotide splice signal (normally GT/AG)
17	strandmatch	'MATCH', 'MISMATCH' or 'NA' for non-stranded analysis
18	category	list of keywords describing the junction. Useful for quick `grep` filtering

2019-12-02

22:20:44

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 find_circ 識別circRNA 的原理 Linux教程 Find命令的使用 cmake教程 find_package Linux基礎教程 linux下使用find命令根據系統時間查找文件用法 find()和find_all()的具體使用 Linux的Find使用 find和grep的使用 Linux find命令實例教程 15個find命令用法 STL中常見find()函數的使用---std::find ,set.find, multiset.find, map.find和multimap.find Centos 使用find查找