我們生信技能書有一篇介紹bedtools的文章,可以在微信里搜着看下,非常有用。
http://bedtools.readthedocs.io/en/latest/
gtf轉bed用Linux命令完全可以實現,因為gtf每一行比較規律,不像fasta和fastq。
cat gffcmp.combined.gtf | grep -v exon | cut -f1,4,5,9 | cut -f1 -d";" | awk '{print $1, $2, $3, $5}' | sed -e 's/ /\t/g' | sed -e 's/\"//g' > gffcmp.combined.bed
后面發現,只提取transcript的首尾是不對的,必須要提取exon信息:
cat gffcmp.combined.gtf | grep exon | cut -f1,4,5,9 | cut -f1 -d";" | awk '{print $1, $2, $3, $5}' | sed -e 's/ /\t/g' | sed -e 's/\"//g' > gffcmp.combined.exon.bed
awk輕松計算總的覆蓋度:
cat cov.out.coding.exon.out | awk 'BEGIN{i=1;j=1}{i=i+$5;j=j+$6}END{print i,j}'
1. bed的 過濾/overlap 運算,就是從一個bed里過濾掉與另一個bed有交集的所有區域。
bedtools intersect -v -a gffcmp.combined.bed -b coding.bed -wa > test.bed
2. bed的覆蓋度i計算,比如我有一些轉錄本,想知道它們在基因組上的覆蓋度,怎么辦,transcript之間是有overlap的
bedtools coverage -a Teatree_Assembly.fasta.bed -b gffcmp.combined.bed > cov.out
結果的每一行怎么解讀,請看:鏈接
Sc0000000 1 3505831 977 1824630 3505830 0.5204559 Sc0000001 1 3488299 853 1929980 3488298 0.5532727 Sc0000002 1 2866215 896 1435119 2866214 0.5007020 Sc0000003 1 2774837 512 1285961 2774836 0.4634368 Sc0000004 1 2768176 402 720812 2768175 0.2603925
3. 2的另一種實現方式
bedtools genomecov -i Arabidopsis_thaliana.TAIR10.38.bed -g Arabidopsis_thaliana.TAIR10.dna_sm.bed > Arabidopsis_thaliana.cov
解讀方法不一樣:
chr1 0 980 1000 0.98 chr1 1 20 1000 0.02 chr2 1 500 500 1 genome 0 980 1500 0.653333 genome 1 520 1500 0.346667
chromosome (or entire genome) depth of coverage from features in input file number of bases on chromosome (or genome) with depth equal to column 2. size of chromosome (or entire genome) in base pairs fraction of bases on chromosome (or entire genome) with depth equal to column 2.
fai變bed
cat Arabidopsis_thaliana.TAIR10.dna_sm.fasta.fai | awk '{print $1, 1, $2}' | sed -e's/ /\t/g' > Arabidopsis_thaliana.TAIR10.dna_sm.bed
gff3轉gtf
gffread Arabidopsis_thaliana.TAIR10.38.gff3 -T -o Arabidopsis_thaliana.TAIR10.38.gtf
gtf轉bed
cat Arabidopsis_thaliana.TAIR10.38.gtf | cut -f1,4,5,9 | cut -f1 -d";" | awk '{print $1, $2, $3, $5}' | sed -e 's/ /\t/g' | sed -e 's/\"//g' | sed -e 's/transcript\://g' > Arabidopsis_thaliana.TAIR10.38.bed
coverage方法計算覆蓋度
bedtools coverage -a Arabidopsis_thaliana.TAIR10.dna_sm.bed -b Arabidopsis_thaliana.TAIR10.38.coding.bed > coding.cov.c
最終統計
cat coding.cov.c | awk 'BEGIN{i=1;j=1}{i=i+$5;j=j+$6}END{print i,j}'
我比較了,這兩種方法算出來的覆蓋度是一摸一樣的,但是如果真的只想算基因組覆蓋度,genomecov肯定更快一些。
bedtools,真的神器,你值得擁有。