1. 原理
對於DNA序列,一階馬爾科夫鏈可以理解為當前鹼基的類型僅取決於上一位鹼基類型。如圖1所示,一條序列的開端(由B開始)可能是A、T、G、C四種鹼基(且可能性相同,均為0.25),若序列的某一位是A,則下一位鹼基是A、T、G、C的概率分別為0.25、0.20、0.20、0.20,下一位無鹼基(即序列結束,狀態為E)的概率為0.15。
2. 代碼實現
以下代碼運行於Jupyter Notebook (Python 3.7);代碼功能是隨機生成一定數量的DNA序列,統計序列長度並繪制分布圖。若希望顯示隨機生成的序列,將代碼# print(''.join(Seq))
前的#
刪除即可。
import numpy
import random
import matplotlib.pyplot as plt
# 狀態空間
states = ["A","G","C","T","E"]
# 可能的事件序列
transitionName = [["AA","AG","AC","AT","AE"],
["GA","GG","GC","GT","GE"],
["CA","CG","CC","CT","CE"],
["TA","TG","TC","TT","TE"],]
# 概率矩陣(轉移矩陣)
transitionMatrix = [[0.25,0.20,0.20,0.20,0.15],
[0.20,0.25,0.20,0.20,0.15],
[0.20,0.20,0.25,0.20,0.15],
[0.20,0.20,0.20,0.25,0.15]]
def RandomDNAs(Num):
max_len = 0
i = 0
Seq = [] #創建列表(Seq)用於添加鹼基,以組成DNA序列
Len = [] #創建列表(Len)用於記錄每條生成序列的長度
while i != Num:
Base = ["A","G","C","T"]
START = random.choice(Base) #隨機從鹼基中選擇一個作為序列的起始鹼基
Seq.append(START) #將起始鹼基添加至Seq中
while START != "E":
if START == "A":
change = numpy.random.choice(transitionName[0],p=transitionMatrix[0])
#以transitionMatrix矩陣第一行的概率分布隨機抽取transitionName第一行包含的事件
if change == "AA":
START = "A" #如果轉移狀態是AA(即A鹼基接下來的鹼基是A,則將起始鹼基設為A)
elif change == "AG":
START = "G"
elif change == "AC":
START = "C"
elif change == "AT":
START = "T"
elif change == "AE":
START = "E"
elif START == "G":
change = numpy.random.choice(transitionName[1],p=transitionMatrix[1])
if change == "GA":
START = "A"
elif change == "GG":
START = "G"
elif change == "GC":
START = "C"
elif change == "GT":
START = "T"
elif change == "GE":
START = "E"
elif START == "C":
change = numpy.random.choice(transitionName[2],p=transitionMatrix[2])
if change == "CA":
START = "A"
elif change == "CG":
START = "G"
elif change == "CC":
START = "C"
elif change == "CT":
START = "T"
elif change == "CE":
START = "E"
elif START == "T":
change = numpy.random.choice(transitionName[3],p=transitionMatrix[3])
if change == "TA":
START = "A"
elif change == "TG":
START = "G"
elif change == "TC":
START = "C"
elif change == "TT":
START = "T"
elif change == "TE":
START = "E"
if START != "E":
Seq.append(START) #如果狀態轉移后不為End(E),則將轉移后的鹼基加到Seq序列中
i += 1
Len.append(len(Seq))
if len(Seq) > max_len:
max_len = len(Seq)
#print(''.join(Seq))
Seq.clear()
plt.figure(figsize=(8,5), dpi=120)
plt.hist(numpy.array(Len), bins=max_len, edgecolor="white")
# 顯示橫軸標簽
plt.xlabel("DNA Sequence Length")
# 顯示縱軸標簽
plt.ylabel("Frequency")
# 顯示圖標題
plt.title("Frequency Distribution Histogram of 1,000 DNA Sequence Length")
plt.savefig('./DNALengthHistogram.jpg')
plt.show()
print("DNA序列的最大長度為:",max_len)
print("DNA序列長度的眾數為:",max(Len, key=Len.count))
%matplotlib notebook #若未使用Jupyter Notebook,此句不需要
RandomDNAs(1000) #1000表示隨機生成1000條序列
該代碼可以進一步添加保存序列直方圖的功能,此外,該代碼可以進一步簡化,將嵌套的40行if-elif語句精簡為一行的切片語句,代碼如下:
import numpy
import random
import matplotlib.pyplot as plt
states = ["A","G","C","T","E"]
transitionName = [["AA","AG","AC","AT","AE"],
["GA","GG","GC","GT","GE"],
["CA","CG","CC","CT","CE"],
["TA","TG","TC","TT","TE"],]
transitionMatrix = [[0.25,0.20,0.20,0.20,0.15],
[0.20,0.25,0.20,0.20,0.15],
[0.20,0.20,0.25,0.20,0.15],
[0.20,0.20,0.20,0.25,0.15]]
def RandomDNAs(Num):
max_len = 0
i = 0
Seq = []
Len = []
while i != Num:
Base = ["A","G","C","T"]
START = random.choice(Base)
Seq.append(START)
while START != "E":
if START == "A":
change = numpy.random.choice(transitionName[0],p=transitionMatrix[0])
elif START == "G":
change = numpy.random.choice(transitionName[1],p=transitionMatrix[1])
elif START == "C":
change = numpy.random.choice(transitionName[2],p=transitionMatrix[2])
elif START == "T":
change = numpy.random.choice(transitionName[3],p=transitionMatrix[3])
START = change[-1]
if START != "E":
Seq.append(START)
i += 1
Len.append(len(Seq))
if len(Seq) > max_len:
max_len = len(Seq)
f = open('./DNASequence.txt','a')
f.write(''.join(Seq) + "\n")
f.close()
Seq.clear()
plt.figure(figsize=(8,5), dpi=120)
plt.hist(numpy.array(Len), bins=max_len, edgecolor="white")
plt.xlabel("DNA Sequence Length")
plt.ylabel("Frequency")
plt.title("Frequency Distribution Histogram of DNA Sequence Length")
plt.savefig('./DNALengthHistogram.png')
plt.show()
print("DNA序列的最大長度為:",max_len)
print("DNA序列長度的眾數為:",max(Len, key=Len.count))
a = input("隨機生成的DNA數量:")
RandomDNAs(int(a))
input("Prease <enter>") #py文件運行完成時不直接退出
3. 運行結果
從以下4個序列長度分布統計可以看到,隨着隨機生成的序列數量增多,序列長度分布愈發集中,且長度為1bp的序列占比最多且逐漸增加。
10,000條DNA序列的序列中
DNA序列的最大長度為: 65
DNA序列長度的眾數為: 1
100,000條DNA序列的序列中
DNA序列的最大長度為: 71
DNA序列長度的眾數為: 1
4. 擴展
使用R生成隨機DNA序列:https://www.r-bloggers.com/introduction-to-markov-chains-and-modeling-dna-sequences-in-r/
使用Python構建二階馬爾科夫鏈生成隨機DNA序列:https://www.pythonanywhere.com/forums/topic/268/