背景
Kmer是基因組組裝算法中經常接觸到的概念,簡單來說,Kmer就是長度為k的核苷酸序列。一般長短為m的reads可以分成m-k+1個Kmer。Kmer的長度和閾值直接影響到組裝的效果。
Denovo組裝流程:原始數據——數據過濾——糾錯——kmer分析——denovo組裝
。
組裝測序策略:根據基因組大小和具體情況選擇個大概的k值,構建contig所需的數據量以及所需的構建的文庫數量。對於植物基因組一般考慮的是大kmer(>31),動物一般在27左右,具體根據基因組情況調整。需要在短片段數據量達到20X左右的時候進行kmer分析。Kmer分析正常后,繼續加測數據以達到最后期望的數據量。
編碼
import os
import sys
# convert command line arguments to variables
kmer_size = int(sys.argv[1])
count_cutoff = int(sys.argv[2])
# define the function to split dna
def split_dna(dna, kmer_size):
kmers = []
for start in range(0,len(dna)-(kmer_size-1),1):
kmer = dna[start:start+kmer_size]
kmers.append(kmer)
return kmers
# create an empty dictionary to hold the counts
kmer_counts = {}
# process each file with the right name
for file_name in os.listdir("."):
if file_name.endswith(".dna"):
dna_file = open(file_name)
# process each DNA sequence in a file
for line in dna_file:
dna = line.rstrip("\n")
# increase the count for each k-mer that we find
for kmer in split_dna(dna, kmer_size):
current_count = kmer_counts.get(kmer, 0)
new_count = current_count + 1
kmer_counts[kmer] = new_count
# print k-mers whose counts are above the cutoff
for kmer, count in kmer_counts.items():
if count > count_cutoff:
print(kmer + " : " + str(count))