使用Juicer組裝HiC

本文轉載自查看原文 2020-03-03 20:54 901 biosoft

Juicer 相比其他軟件還是蠻有優勢的。文章中介紹突出的兩個優點是（1）流程化，使用簡單。（2）高性能，平行計算，對大數據友好。

Under /home/user/juicedir, there should be a folder references that contains the reference fasta file for your genome and the BWA index files. You can soft-link if necessary, or otherwise download the fasta files from UCSC and run bwa index on the fasta file.

Under /home/user/juicedir, you should also create a folder restriction_sites. This should contain your restriction site file. You can create this file using the generate_site_positions.py Python script, or download already created ones from the Juicer AWS mirror.

Type screen then launch Juicer: /home/user/juicedir/scripts/juicer.sh [options]

Running without any options will default to the genome of hg19 and the restriction site of MboI.

See Usage for more options; to adjust genome and/or site, use -g <genomeID> and -s <restriction_site>.

The files will be split if necessary and Juicer will launch.

Results are available in the aligned directory. The Hi-C maps are in inter.hic (for MAPQ > 0) and inter_30.hic (for MAPQ >= 30). The Hi-C maps can be loaded in Juicebox and explored.

When the pipeline has completed successfully, you will see the folders aligned, debug, and splits. The debug folder contains logging information for the pipeline.

The splits folder is a temporary working directory and can be deleted once you are sure the pipeline ran successfully.

The aligned folder contains the results: inter.hic / inter_30.hic: The .hic files for Hi-C contacts at MAPQ > 0 and at MAPQ >= 30, respectively

MAPQ（Mapping Qualities） 用來表示每條read的比對情況，MAPQ越高，表示比對質量越好，后續可以根據分析需要來進行過濾。

MAPQ 定義從概率的角度來看，每個read的比對都是一個真實比對的估計，它是一個隨機變量，也有可能存在錯誤。錯誤的概率可以用 Phred 來衡量。假設一條read的MAPQ的值為 $mQ, $P 表示reads比對錯誤的概率。

$P = 10 ^ (-$mQ / 10.0);
如果mQ的值為30，那么 mQ的值為30，那么mQ的值為30，那么P(比對錯誤率）就是 0.1%

merged_nodups.txt: The Hi-C contacts with duplicates removed. This file is also input to the assembly and diploid pipelines

collisions.txt: Reads that map to more than two places in the genome

inter.txt, inter_hists.m / inter_30.txt, inter_30_hists.m: The statistics and graphs files for Hi-C contacts at MAPQ > 0 and at MAPQ >= 30, respectively.

對Hi-C互作結果MAPQ>0的結果的統計指標：inter.txt

These are also stored within the respective .hic files in the header. The .m files can be loaded into Matlab.

The statistics and graphs are displayed under Dataset Metrics when loaded into Juicebox

dups.txt, opt_dups.txt: Duplicates and optical duplicates

abnormal.sam, unmapped.sam: Abnormal chimeric and unmapped reads 異常的嵌合體和未比對上的

merged_sort.txt: This is a combination of merged_nodups / dups / opt_dups and can be deleted once the pipeline has successfully

completed stats_dups.txt / stats_dups_hists.m: Statistics and graphs on the duplicates 基於duplicates的統計結果

軟件運行完成之后，在樣本對應的目錄下，會生成以下目錄

splits
aligned

splits目錄下存放的是中間結果，由於hi-C數據量很大，所以會將原始序列拆分成很多份，並行運算，加快速度。默認每份包含22.5M的reads, 當然這個可以通過-C參數調整，該參數指定拆分文件的行數，默認是90000000，注意fastq文件4行代表一條序列，所以這個參數的值必須是4的倍數。拆分后序列的R1和R2端分別通過bwa比對基因組，然后合並，篩選嵌合體序列，去重復，生成預處理后的結果文件。

aligned目錄下存放的是最終結果，包含了可以導入juicebox的后綴為hic的圖譜文件, inter.hic和inter_30.hic， 30表示通過MAPQ > 30進行過濾之后的結果。完整流程還會進行后續處理，包括識別TAD, 染色質環等結構。其中識別染色質環的HICCUPs算法必須通過GPU加速運行才可以，所以沒有安裝GPU卡的普通服務器無法運行這個步驟。

從上述過程可以看到，juicer的使用確實非常簡單。由於Hi-C數據的測序量非常大，以及后續分析算法的復雜度，對服務器計算資源的要求相當高，必須高性能服務器才能滿足要求，而該軟件所需的GPU卡成本也非常高，一塊的成本在2萬元左右，這些因素一定程度制約了Hi-C的普及和發展。

========================3d-dna=======================

#需要用到juicer中的兩個文件

# merged_nodups.txt文件和基因組文件

# pwd

/home/hujiaxiang/tools_update/3d-dna

# 新建一個文件，作為工作目錄

mkdir duck

cd duck

# 鏈接merged_nodups.txt文件

ln -s /home/hujiaxiang/tools_update/juicer/scripts/aligned/merged_nodups.txt .

# 鏈接基因組文件

ln -s /home/hujiaxiang/tools_update/juicer/references/genome.fasta .

# 運行3d-dna

run-asm-pipeline.sh genome.fasta merged_nodups.txt

需要說明的是，這個軟件生成的中間臨時文件很大，並且保存在/tmp下面

如果你的hic數據很多。/tmp目錄很有可能裝不下你的臨時文件，造成報錯

解決辦法如下

新建一個文件夾，作為臨時的文件夾

mkdir /home/hujiaxiang/tools_update/3d-dna/tmp_duck

將此文件夾綁定到/tmp文件夾

mount --r bind /home/hujiaxiang/tools_update/3d-dna/tmp_duck /tmp

mount -o remount.rw /tmp

來源：

https://www.jianshu.com/p/de9400025e1d

https://github.com/aidenlab/juicer/wiki/Running-Juicer-on-a-cluster

http://blog.sina.com.cn/s/blog_4ab0b3390102ygde.html

https://www.jianshu.com/p/de9400025e1d

https://blog.csdn.net/tanzuozhev/java/article/details/89115080

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【基因組組裝】HiC掛載Juicebox糾錯補充 HiC 【基因組組裝】HiC掛載軟件以及如何用Juice_box手工糾錯？ Juicer模板引擎使用筆記使用 CompletableFuture 異步組裝數據 SOAPdenovo組裝軟件使用記錄 java 使用map返回多個對象組裝 jstree -- 使用JSON 數據組裝成樹項目一：使用二代測序數據進行基因組組裝（局部組裝） Juicer——a fast template engine