Bioinfo:學習Python，做生信PartII 學習筆記

本文轉載自查看原文 2018-01-01 19:34 1536

在學習了生信大神孟浩巍的知乎Live “學習Python, 做生信”之后，對第二部分的文件信息處理部分整理了如下的筆記。

一、fasta與fastq格式的轉換

1、首先需要了解FASTA和FASTQ格式的詳解

1）具體的詳解看知乎專欄的這篇文章，寫的很詳細。

https://zhuanlan.zhihu.com/p/20714540

2）關於FASTA

主要分為兩部分：第一行是“>”開始的儲運存的序列描述信息；接下來是序列部分。序列部分可以有空格，按照一般70~80bp。用python進行操作的過程中，需要去掉空格和換行符，把序列讀成一行再處理。（.strip()可以除去空格）
即使是mRNA的序列，為了保證數據的統一性，序列中的U依然是用T來表示。核苷酸序列信息對應列表如下所示。

A  adenosine          C  cytidine             G  guanine
T  thymidine          N  A/G/C/T (any)        U  uridine 
K  G/T (keto)         S  G/C (strong)         Y  T/C (pyrimidine) 
M  A/C (amino)        W  A/T (weak)           R  G/A (purine)        
B  G/T/C              D  G/A/T                H  A/C/T      
V  G/C/A              -  gap of indeterminate length

3）關於FASTQ

剛剛測到的測序數據一般用FASTQ格式進行儲存，因為其中包含了測序質量信息。
四行依次為：信息頭，序列信息，注釋信息，質量信息。質量值的計算方法見上面鏈接里的文章。
序列符號為 ATCGN，N表示無法確定。

2、讀取fastq文件並進行相應的格式轉化

1）主要流程如下：

讀取fastq文件，第1行的header作為fasta的header
讀取fastq文件，第2行的seq作為fasta格式的seq
第3/4行直接忽略
格式化輸出fasta文件

2）以下是沒有設置輸出並儲存為.fa文件的代碼

 1 #-*- coding: UTF-8 -*-
 2 with open( 'E:\genome_file\\test.fastq','r') as input_fastq:
 3     for index, line in enumerate (input_fastq):  
 4         if index % 4 == 0:
 5             print ">" + line.strip()[1:]
 6         elif index % 4 == 1:
 7             for i in range (0,len(line.strip()),40):
 8                 print line.strip()[i:i+40]
 9         elif index % 4 == 2:
10             pass
11         elif index % 4 == 3:
12             pass

3）以下是設置了輸出fasta文件的代碼：

 1 out_file = open（E:\genome_file\\test.fa）
 2 
 3 with open( 'E:\genome_file\\test.fastq','r') as input_fastq:
 4 
 5     for index, line in enumerate (input_fastq):
 6 
 7         if index % 4 == 0:
 8             out_file.write(">" + line.strip()[1:]+"\n" )
 9 
10         elif index % 4 == 1:
11             for i in range (0,len(line.strip()),40):
12                 out_file.write(line.strip()[i:i + 40] +"\n")
13 
14         elif index % 4 == 2:
15             pass
16 
17         elif index % 4 == 3:
18             pass

4）有幾點注意的地方這里提一下：

2）和3）中代碼的主要區別就是在3）在一開始創建了一個output_file，這就是用來輸出用的。所以可以看到2）中的輸出直接用的是print，而3）中則是通過.write寫入到 output_file當中。
讀取第一行的時候要特別注意：去掉原先FASTQ文件中的“@”字符，並加上一個“>”字符。
enumarate是個遍歷函數，能夠輸出相應的索引和對應的值。
print能夠自動換行，但是write不行。所以在寫入的時候需要切片+ \n，切多少呢一般？前面在關於fasta格式文件介紹中有提到。

二、讀取fasta格式文件

1、分析過程

當有>的時候，就為標題行；
當不是>的時候，就是序列信息；
當是序列信息的時候，需要進行序列的拼接；
最終返回序列的header與seq

2、簡化版的主要代碼如下：

input_file = open( 'E:\genome_file\\test.fa','r')
seq = ""
header = input_file.readline().strip()[1:]

for line in input_file:
    line = line.strip()
    if line[0] != ">":
        seq = seq + line
    else:
        print header
        print seq
        header = line[1:]
        seq = ""
#打印最后一行
print header
print seq
input_file.close()

3、值得注意的幾個地方：

整體思路就是通過for循環進行序列信息的累加；
一開始先讀一行header(雖然暫時也還沒搞太明白為什么，反正就是如果不讀的話出來結果是亂七八糟的）
按照上面那個情況，循環中是無法打印最后一行的，所以要在最外面打印一下。
readline和readlines的區別：readline只讀一行，readlines則是一下子讀整個文件。永遠不要用readlines！！

4、下面的代碼是匯總了上面的功能，直接通過函數形式fastq-->fasta格式文件的轉換

 1 def read_fasta(file_path=""):
 2     """
 3     Loading FASTA file and return a iterative object
 4     """
 5 
 6     line = ""
 7 
 8     try:
 9         fasta_handle = open(file_path,"r")
10     except:
11         raise IOError("Your input FASTA file is not right!")
12 
13     # make sure the file is not empty
14     while True:
15         line = fasta_handle.readline()
16         if line == "":
17             return
18         if line[0] == ">":
19             break
20 
21     # when the file is not empty, we try to load FASTA file
22     while True:
23         if line[0] != ">":
24             raise ValueError("Records in Fasta files should start with '>' character")
25         title = line[1:].rstrip()
26         lines = []
27         line = fasta_handle.readline()#這里是讀第二行了
28         while True:#循環只用來加載序列行
29             if not line:
30                 break
31             if line[0] == ">":
32                 break
33             lines.append(line.rstrip())
34             line = fasta_handle.readline()
35 
36         yield title,"".join(lines).replace(" ","").replace("\r","")
37 
38         if not line:
39             return
40 
41     fasta_handle.close()
42     assert False, "Your input FASTA file have format problem."
43 
44 for header,seq in read_fasta(file_path=r"D:\data_file\test.fa"):
45     print header
46     print seq
47 #下面是后邊的練習中以hg19為例進行操作
48 hg19_genome = {}
49 for chr_name , chr_seq in read_fasta(file_path=r"D:/data_file/hg19_only_chromosome.fa"):
50     hg19_genome[chr_name] = chr_seq

其實不太明白的是這里邊最終是返回兩個head和seq兩個值了嗎？為什么后邊直接來兩個參數就能開始for循環？

三、計算人類基因組長度信息

1、下載人類參考基因組信息（UCSC網站）

http://hgdownload.soe.ucsc.edu/downloads.html#human
下載對象為hg19全基因組信息

2、利用讀取fasta的代碼讀取基因組序列

1）代碼如下：

1 #前面已經定義過的read_fasta函數這里不再重復寫。
2 hg19_genome = {}
3 
4 for chr_name , chr_seq in read_fasta(file_path=r"D:/data_file/hg19_only_chromosome.fa"):
5     hg19_genome[chr_name] = chr_seq

2）注意幾點

程序中把下載的.fa文件中的信息輸入到hg19_genome的列表當中
讀取一個全基因組數據需要耗費一定量的時間和內存，如果再pycharm或者通過powershell進行運行，調試的時候每調試一次就要運行一次，很費事。這個時候jupyter就方便多了。使用方法：1、終端輸入conda,回車；2、輸入jupyter notebook,回車；3、在彈出的中選擇計算機上相應的Python程序,hello world。

3、統計每條序列的長度，並輸出；

1)下面這個是一個亂序的輸出

1 for chr_name in sorted(hg19_genome.keys()):
2     print chr_name, len(hg19_genome.get(chr_name))

2）下面的代碼能夠做到按順序輸出

 1 hg19_total_len = 0
 2 for index in range (1,26):
 3     if index <=22:
 4         chr_name = "chr%s" %index
 5     elif index == 23:
 6         chr_name = "chrX"
 7     elif index == 24:
 8         chr_name = "chrY"
 9     elif index == 25:
10         chr_name = "chrM"
11     print chr_name, len(hg19_genome.get(chr_name))
12     hg19_total_len = hg19_total_len + len(hg19_genome.get(chr_name))
13     
14 print "Total chromosome length is %s" %hg19_total_len

注意一點：一般從字典里邊根據Key來獲取內容，用.get()，這樣子在對象不存在的時候就不會報錯。

3、統計人類參考基因組的總長度，並輸出

在上面已經輸出，往回看。

4、統計每條染色體N的長度，並輸出（GC含量同理）

以N為例，其他同理。使用.count指令。

1 hg19_genome['chr22'].count("N")

5、提取基因組特定區域的序列，並輸出（支持+/-鏈）

1）思路分析

通過切片可以截取特定區域的序列
需要進行轉移來過得互補序列
[::-1]能夠實現序列反向的功能

2）上代碼

 1 chr_name = "chr1"
 2 chr_start = 10000
 3 chr_end = 10100
 4 #特定區域截取
 5 hg19_genome[chr_name][chr_start-1:chr_end-1].upper()
 6 #互補鏈
 7 import string
 8 translate_table = string.maketrans("ATCG","TAGC")
 9 a = hg19_genome[chr_name][chr_start-1:chr_end-1].upper()
10 a.translate(translate_table)
11 
12 #反向鏈
13 a.translate(translate_table)[::-1]#反向

這里導入string模塊，設定一個轉譯表
.upper是用來把小寫字母全部轉換成大寫
注意要“-1”

3）定義成函數方便使用

1 def con_rev(input_seq):
2     translate_table = string.maketrans("ATCG","TAGC")
3     output_seq = input_seq.upper().translate(translate_table)[::-1]
4     return output_seq
5 
6 con_rev("aaTTCCGG")

四、計算人類基因組外顯子的長度（CDS區域同理）

1、下載人類參考轉錄組（UCSC table browser）

一般在track一欄均選擇RefSeq Genes

2、整體思路

從轉錄組抓取關鍵信息：exon_count, gene_id, exon_start,exon_end
最后輸出一個字典：{‘基因名，外顯子長度’}
同一基因存在不同轉錄本，本例中按照最長計算，所以需要進行比較

3、上代碼

 1 gene_exon_len_dict = {}
 2 with open(r"E:\genome_file\hg19_refseq_gene_table","r") as input_file:
 3     header = input_file.readline()
 4     for line in input_file:
 5         line = line.strip().split("\t")
 6         
 7         exon_count = int(line[8])
 8         exon_start = line[9].split(",")#還有更好的方法進行index
 9         exon_end = line[10].split(",")
10         exon_total_len = 0
11         gene_id = line[12]
12         for index in range (exon_count):
13             exon_total_len = exon_total_len + int(exon_end[index]) - int(exon_start[index])
14             
15         if gene_exon_len_dict.get(gene_id) is None:#注意：如果直接用dict[gene_id]是會出現key error的
16             gene_exon_len_dict[gene_id] = exon_total_len
17             
18         elif gene_exon_len_dict[gene_id] < exon_total_len:
19             gene_exon_len_dict[gene_id] = exon_total_len
20             
21         else:
22             pass
23             
24         #注意啊調試的時候要加break，這是很重要的！！
25         
26     print gene_exon_len_dict
27

.split()能夠根據括號中的內容把相應的字符串拆分成為列表
exon_start和exon_end都要以整型的格式才能進行相減。除了上面代碼中的處理方法，還可以通過如下方法來實現：

1     exon_start = map(int,line[9].strip(",").split(","))
2     exon_end = map(int,line[10].strip(",").split(","))

特別重要！！數據量大的運算，調試的時候，在后邊加個 break啊！！不然卡的你懷疑人生~~~

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 學習編程用什么做筆記？論文科生學習Python的收獲生信學習--生物數據庫大全微信小程序學習筆記(6) 使用wx.navigateTo做頁面跳轉生信學習-二代測序知乎專欄總結[轉] Python map學習筆記 Python學習筆記（2） python學習筆記一 webdriver（python）學習筆記一 Python 學習筆記（五）