【Python小試】計算目錄下所有DNA序列的Kmer並過濾

本文轉載自查看原文 2020-05-13 15:28 717 Python編程

背景

Kmer是基因組組裝算法中經常接觸到的概念，簡單來說，Kmer就是長度為k的核苷酸序列。一般長短為m的reads可以分成m-k+1個Kmer。Kmer的長度和閾值直接影響到組裝的效果。

Denovo組裝流程：原始數據——數據過濾——糾錯——kmer分析——denovo組裝。

組裝測序策略：根據基因組大小和具體情況選擇個大概的k值，構建contig所需的數據量以及所需的構建的文庫數量。對於植物基因組一般考慮的是大kmer（>31），動物一般在27左右，具體根據基因組情況調整。需要在短片段數據量達到20X左右的時候進行kmer分析。Kmer分析正常后，繼續加測數據以達到最后期望的數據量。

編碼

import os
import sys

# convert command line arguments to variables
kmer_size = int(sys.argv[1])
count_cutoff = int(sys.argv[2])

# define the function to split dna
def split_dna(dna, kmer_size):
    kmers = []
    for start in range(0,len(dna)-(kmer_size-1),1):
        kmer = dna[start:start+kmer_size]
        kmers.append(kmer)
    return kmers

# create an empty dictionary to hold the counts
kmer_counts = {}

# process each file with the right name
for file_name in os.listdir("."):
    if file_name.endswith(".dna"):
        dna_file = open(file_name)

        # process each DNA sequence in a file
        for line in dna_file:
            dna = line.rstrip("\n")

            # increase the count for each k-mer that we find
            for kmer in split_dna(dna, kmer_size):
                current_count = kmer_counts.get(kmer, 0)
                new_count = current_count + 1
                kmer_counts[kmer] = new_count

# print k-mers whose counts are above the cutoff
for kmer, count in kmer_counts.items():
    if count > count_cutoff:
        print(kmer + " : " + str(count))

Ref: https://www.cnblogs.com/leezx/p/5577600.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python遞歸遍歷目錄下所有文件 python 某個目錄下的所有文件列表 Python學習筆記：列出目錄下所有文件或目錄 python 讀取一個目錄下的所有目錄和文件 Python3：遞歸實現輸出目錄下所有的文件 Python獲取指定目錄下所有子目錄、所有文件名利用python求一段DNA序列的互補序列【Python】獲取當前目錄和子目錄下所有文件或指定文件的方法 python獲取某目錄下（含子目錄）所有文件名稱 python獲得目錄下（含子目錄）所有文件的完整路徑