Biopython學習筆記


一、Biopython 學習資料
Biopython教程( 中文版cookbook): http://biopython-cn.readthedocs.io/zh_CN/latest/
更多文檔:
 
 
二、 序列輸入和輸出
1. 通過SeqIO讀取序列文件
主要使用 Bio.SeqIO模塊。
導入:
from Bio import SeqIO
 
Bio.SeqIO.read() 讀取一個record,並返回record對象。如:
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:10638'])
 
Bio.SeqIO.parse() ,它用於讀取序列文件生成 SeqRecord 對象,包含兩個參數:
第一個參數是一個文件名或者一個句柄( handle )。句柄可以是打開的文件,命令行程序的輸出,或者來自下載的數據(請見第 5.3 節)。
第二個參數是一個小寫字母字符串,用於指定序列格式,支持的文件格式請見 http://biopython.org/wiki/SeqIO
一般支持的文件格式有(常見的):fasta、fastq、genbank、embl、swiss、clustal、phylip、sff等。
 
Bio.SeqIO.parse() 用於讀取序列文件並返回 SeqRecord 對象,並通常用在循環中,例如:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(seq_record.description.replace(seq_record.id, ""))
    print(seq_record.seq)
    print(len(seq_record))    # 等同於 len(seq_record.seq)
 
此外, 可通過 SeqIO.to_dict() 函數將SeqRecord對象轉為dict對象,如:
>>> orchid_dict = SeqIO.to_dict(SeqIO.parse("NC_005816.gbf", "genbank"))
>>> orchid_dict
{'NC_005816.1': SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['BioProject:PRJNA224116', 'BioSample:SAMN02602970', 'Assembly:GCF_000007885.1'])}
 
Since this variable orchid_dict is an ordinary Python dictionary, we can look at all of the keys we have available:
>>> len(orchid_dict)
>>> list(orchid_dict.keys())
 
You can leave out the “list(...)“ bit if you are still using Python 2. Under Python 3 the dictionary methods like “.keys()“ and “.values()“ are iterators rather than lists.
 
We can access a single SeqRecord object via the keys and manipulate the object as normal:
>>> seq_record = orchid_dict["Z78475.1"]
>>> print(seq_record.description)
..................................
>>> print(repr(seq_record.seq))
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())
 
 
2. 將序列寫入文件
使用 Bio.SeqIO.write() 輸出序列(寫入文件)。該函數需要三個參數:某些 SeqRecord 對象( A list (or iterator) of SeqRecord objects, or (if using Biopython 1.54 or later) a single SeqRecord.),要寫入的句柄或文件名,和序列格式。
例如:
from Bio import SeqIO
SeqIO.write(my_records, "my_example.faa", "fasta")
 
或者使用文件句柄(handle),例如:
from Bio import SeqIO
records = ...
with open("example.faa", "w") as handle:
SeqIO.write(records, handle, "fasta")
 
注意,如果你 Bio.SeqIO.write() 函數要寫入的文件已經存在,舊文件將會被覆寫,並且不會得到任何警告信息。
 
例子1:
records_new = SeqIO.parse(options.input, "fasta")
result = open(options.output, "w")
for rec in records_new:
    seq_id = rec.id
    alist = seq_id.split("|")
    if options.tag != "":
        alist[0] = alist[0] + "|" + options.tag
    seq_id = " ".join(alist)

    seq_desc = rec.description.replace(rec.id, "")
    seq_desc = re.sub(r">", "", seq_desc)
    seq_desc = seq_desc.strip()
    # 修改原先的ID、description
    rec.id = seq_id
    rec.description = seq_desc
    #print(seq_desc)
    #print(rec.id)

    seq_len = len(rec.seq)
    if seq_len >= options.length:
        SeqIO.write(rec, result, "fasta")  # 將record對象寫入輸出文件

result.close()
 
3. 將GenBank 轉為 FASTA
方法1:
通過 Bio.SeqIO.parse() 和 Bio.SeqIO.write() 函數實現函數的轉換。如:
from Bio import SeqIO
records = SeqIO.parse("ls_orchid.gbk", "genbank")
count = SeqIO.write(records, "my_example.fasta", "fasta")
print("Converted %i records" % count)
 
方法2:
使用 SeqIO.convert 函數進行格式轉換。
>>> from Bio import SeqIO
>>> SeqIO.convert("NC_011833.gbk", "genbank", "NC_011833_cov.fna", "fasta")
 
方法3:
import Bio.SeqIO
gbk_file = "NC_011833.gbk"
fna_file = "NC_011833_cov.fna"
input = open(gbk_file, "r")
output = open(fna_file, "w")

for seq in SeqIO.parse(input, "genbank") :
    print("Convert %s from GenBank to FASTA format" % seq.id)
    output.write(">%s %s\n%s\n" % (
        seq.id, seq.description, seq.seq))

input.close()
output.close()
 
 
 
4. SeqRecord 對象內置的方法
Sequence record 或稱之為 SeqRecord 類,該類在 Bio.SeqRecord 模塊中有定義。
導入 SeqRecord 類:
from Bio.SeqRecord import SeqRecord
 
SeqRecord 類包括下列屬性:
.seq
– The sequence itself, typically a Seq object.
.id
– The primary ID used to identify the sequence – a string. In most cases this is something like an accession number.
.name
– A “common” name/id for the sequence – a string. In some cases this will be the same as the accession number, but it could also be a clone name. I think of this as being analogous to the LOCUS id in a GenBank record.
.description
– A human readable description or expressive name for the sequence – a string.
.letter_annotations
– Holds per-letter-annotations using a (restricted) dictionary of additional information about the letters in the sequence. The keys are the name of the information, and the information is contained in the value as a Python sequence (i.e. a list, tuple or string) with the same length as the sequence itself. This is often used for quality scores (e.g. Section 20.1.6) or secondary structure information (e.g. from Stockholm/PFAM alignment files).
.annotations
– A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more “unstructured” information to the sequence.
.features
– A list of SeqFeature objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence). The structure of sequence features is described below in Section 4.3.
.dbxrefs
- A list of database cross-references as strings.
 
SeqRecord 類包括下列方法:
.lower() # 將record序列全部變為小寫字母
.upper() # 將record序列全部變為大寫字母
.reverse_complement() # 對record對象進行反向互補,注意此時record的id、name、description等屬性都會變化,如果不指定則為'<unknown>'。
.format() # 對record對象進行格式化,常用於文件格式的轉換。
 
注意,SeqRecord 對象沒有 .len 或者 .length 這樣求序列長度的方法,想獲得序列長度,可以:len(seq_record) 或者 len(seq_record.seq)
 
5. SeqRecord.Seq 對象內置的方法
.count()
- 對Seq對象中的字符計數,類似字符串。
.translate()
- 將核酸序列翻譯為蛋白序列並返回,需提供密碼子表,默認是1.例如:
seq = seq_record.seq
seq.translate(table=set_table)
 
# 注意,SeqRecord對象沒有 translate() 方法,是Seq對象才有。
.complement()
- 獲得核酸序列的互補鏈,例如:seq.complement()
.reverse_complement()
- 獲得核酸序列的反向互補鏈,例如: seq.reverse_complement()
 
針對complement()、reverse_complement(),例子如下:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
 
As mentioned earlier, an easy way to just reverse a Seq object (or a Python string) is slice it with -1 step:
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
 
針對translate(),例子如下:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
 
備注,Bio.Alphabet 在biopython 1.78以上版本已經刪除。詳情見: https://biopython.org/wiki/Alphabet
 
6. 統計序列中的GC含量
1)使用 Seq.count() 方法
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
46.875
 
2)使用 Bio.SeqUtils 模塊
Bio.SeqUtils 模塊已經 建立了好幾個GC函數,類如:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875
# 返回的是GC百分比
 
注意在使用 Bio.SeqUtils.GC() 函數時會自動處理序列和可代表G或者C的歧意核苷酸 字母S混合的情況。
 
Bio.SeqUtils 模塊中,GC_skew(seq, window=100) Calculate GC skew (G-C)/(G+C) for multiple windows along the sequence.
 
 
7. 在N處切分序列
例如:
def splitScaff_N(scaff_file, out_file):
    output = open(out_file, "w")
    for seq_record in SeqIO.parse(scaff_file, "fasta"):
        num = 1
        seq_seq = str(seq_record.seq)
        compile_mo = re.compile(r"N+", re.I)
        substrs = compile_mo.split(seq_seq)
        seq_id = seq_record.id
        seq_desc = seq_record.description.replace(seq_id, "")
        seq_desc = seq_desc.strip()
        for item in substrs:
            new_id = seq_id + "_" + str(num)
            if seq_desc != "":
                output.write(">%s %s\n%s\n" % (new_id, seq_desc, item))
            else:
                output.write(">%s\n%s\n" % (new_id, item))
            num += 1
    output.close()
 
8. fastq 處理
(1)識別quality
 
(2)切取部分序列(針對fastq)
如果對SeqRecord進行切片,letter_annotations會自動被切片,例如取最后10個基數:
>>> sub_record = record[-10:]
>>> print("%s %s" % (sub_record.id, sub_record.seq))
slxa_0001_1_0001_01 ACGTNNNNNN
>>> print(sub_record.letter_annotations["solexa_quality"])
[4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
 
如果只截取前50bp的fastq序列。
示例:
import sys, os, re
import shutil
import glob
import gzip
from Bio import SeqIO

def cut50fq(fq1):
    # yield
    with gzip.open(fq1, 'rt') as f1:
        for rec in SeqIO.parse(f1, 'fastq'):
            sub50 = rec[0:50]
            yield sub50

qq = sys.argv[1]
out_dir = sys.argv[2]
out_qq = os.path.join(out_dir, os.path.basename(qq))
with gzip.open(out_qq, 'wt') as fout2:
    for rec in cut50fq(qq):
        SeqIO.write(rec, fout2, 'fastq')
 
三、創建SeqRecord對象
簡單地創建一個 SeqRecord 對象,例如:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                       IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein, small")
print(record)
 
下面我們應用 SeqRecord 對象進行序列處理。
如何將基因核酸序列翻譯為蛋白序列?
例子:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC

seqFile = os.path.abspath(sys.argv[1])
outSeq = os.path.abspath(sys.argv[2])
code_table = sys.argv[3]
if not code_table:
    code_table = 11

handle = open(outSeq, "w")
for seq_record in SeqIO.parse(seqFile, "fasta"):
    seq_id = seq_record.id
    seq_seq = seq_record.seq
    protein_seq = seq_seq.translate(table=code_table)
    protein_record = SeqRecord(Seq(str(protein_seq).rstrip("*"), IUPAC.protein), id=seq_id, description="")
    SeqIO.write(protein_record, handle, "fasta")
handle.close()
 
 
根據prodigal預測結果(gff文件)取序列。代碼如下:
import sys, os, re
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

def Prodigal_Parser(fastaf, gfff, outPrefix, codon=11):
    fastaf = os.path.abspath(fastaf)
    gfff = os.path.abspath(gfff)
    
    # read fasta sequences
    seq_Dicts = {}
    for seq_record in SeqIO.parse(fastaf, 'fasta'):
        seq_id = seq_record.id
        #seq_desc = seq_record.description.replace(seq_id, "")
        #seq_desc = seq_desc.strip()
        seq_Dicts[seq_id] = seq_record

    # read the gff file of prodigal result
    number = 1
    nuc_out = open(outPrefix+"_gene.ffn", 'w')
    protein_out = open(outPrefix+'_protein.faa', 'w')
    with open(gfff, 'r') as fh:
        for line in fh:
            if re.search(r"^#|^\s*$", line):
                continue
            if re.search(r"CDS", line):
                fields = line.strip().split("\t")
                scaf_id = fields[0]
                start_site = int(fields[3]) - 1
                end_site = int(fields[4])
                substr = seq_Dicts[scaf_id].seq[start_site:end_site]
                subrecord = SeqRecord(substr, id=scaf_id+"_orf"+str(number), description='')
                if fields[6] == '-':
                    rc_seq = subrecord.seq.reverse_complement()
                    subrecord = SeqRecord(rc_seq, id=subrecord.id, description='')
                # 輸出核酸序列
                SeqIO.write(subrecord, nuc_out, 'fasta')
                # protein
                protein = subrecord.seq.translate(table=codon)
                protein_record = SeqRecord(Seq(str(protein).rstrip("*"), IUPAC.protein), id=subrecord.id, description='')
                SeqIO.write(protein_record, protein_out, 'fasta')
                number += 1
    
    nuc_out.close()
    protein_out.close()

if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python3 %s <genome> <prodigal_gff> <out_prefix> [codon_table]" % sys.argv[0], file=sys.stderr)
        sys.exit(1)

    fastaFile = sys.argv[1]
    gffFile = sys.argv[2]
    out_prefix = sys.argv[3]
    if len(sys.argv) >= 5:
        codon_table = sys.argv[4]
    else:
        codon_table = 11
    #Prodigal_Parser(fastaFile, gffFile, 'prodigal')
    Prodigal_Parser(fastaFile, gffFile, out_prefix, codon_table)
 
四、解析Genbank文件(gbk,gbff)
 
4.1 簡單說明
這里以 NC_005816.gbf 為例。
>>> from Bio import SeqIO
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> record = SeqIO.read("NC_005816.gbf", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['BioProject:PRJNA224116', 'BioSample:SAMN02602970', 'Assembly:GCF_000007885.1'])
>>> len(record) # 返回seq長度(bp)
9609
>>> len(record.features) # 返回features個數
27
 
說明:
genbank 的 SeqRecord 類非常簡單,包括下列屬性:
.seq
– 序列自身(即 Seq 對象)。
.id
– 序列主ID(-字符串類型)。通常類同於accession number。
.name
– 序列名/id (-字符串類型)。 可以是accession number, 也可是clone名(類似GenBank record中的LOCUS id)。
.description
– 序列描述(-字符串類型)。
備注,name 源於 LOCUS行, id 附加了版本后綴。description源於DEFINITION 行。
.letter_annotations
– 對照序列的每個字母逐字注釋(per-letter-annotations),以信息名為鍵(keys),信息內容為值(value)所構成的字典。值與序列等長,用Python列表、元組或字符串表示。.letter_annotations可用於質量分數(如第 18.1.6 節) 或二級結構信息 (如 Stockholm/PFAM 比對文件)等數據的存儲。
 
dbxrefs 列表中的數據來自 PROJECT 或DBLINK行:
>>> record.dbxrefs
['BioProject:PRJNA224116', 'BioSample:SAMN02602970', 'Assembly:GCF_000007885.1']
# 返回的是列表
 
多數注釋信息儲存在 annotations 字典中,例如:
>>> record.annotations
{'molecule_type': 'DNA', 'topology': 'circular', 'data_file_division': 'CON', 'date': '10-APR-2017', 'accessions': ['NC_005816'], 'sequence_version': 1, 'keywords': ['RefSeq'], 'source': 'Yersinia pestis biovar Microtus str. 91001', ... }
 
features 列表以 SeqFeature 對象(通過函數 record.features獲得)的形式保存了features table中的所有entries(如genes和CDS等)。
Genbank中feature列表說明見: https://www.insdc.org/documents/feature-table
 
4.2 SeqFeature 對象
SeqFeature 對象包含的屬性:
.type
– 用文字描述的feature類型 (如 ‘CDS’ 或 ‘gene’)。
.location
– SeqFeature 在序列中所處的位置。見第 4.3.2 節。 SeqFeature 設計了眾多針對location對象的功能,包含一系列簡寫的屬性。
.ref
– .location.ref 簡寫 –location對象相關的參考序列。通常為空(None)。
.ref_db
– .location.ref_db 簡寫 – 指定 .ref 相關數據庫名稱。通常為空(None)。
.strand
– .location.strand 簡寫 – 表示feature所處序列的strand。在雙鏈核酸序列中,1表示正鏈, -1表示負鏈, 0 表示strand信息很重要但未知, None表示strand信息未知且不重要。蛋白和單鏈核酸序列為None。
.location.start 起始位置,為ExactPosition(准確位點),用一個數字表示。可通過int()強制性轉為一個整數。 (注意,這里是python計數,從0開始)
.location.end 起始位置。
.location.nofuzzy_start 和 .location.nofuzzy_end 為了兼容舊版Biopython,保留了整數形式的 nofuzzy_start and nofuzzy_end,其跟int(.location.start) 和 int(.location.end) 等同。
.qualifiers
– 存儲feature附加信息(Python字典)。鍵(key)為值(value)所存信息的單字簡要描述,值為實際信息。比如,鍵為 “evidence” ,而值為 “computational (non-experimental)”。 這只是為了提醒人們注意,該feature沒有被實驗所證實(濕實驗)。 Note:為與GenBank/EMBL文件中的feature tables對應,規定.qualifiers 中值為字符串數組(即使只有一個字符串)。
.sub_features
– 只有在描述復雜位置時才使用,如 GenBank/EMBL文件中的 ‘joins’ 位置。 已被 CompoundLocation 對象取代,因此略過不提。
 
>>> for seq_feature in record.features:
... print(seq_feature.qualifiers)
...
OrderedDict([('locus_tag', ['YP_RS22210']), ('old_locus_tag', ['pPCP01', 'YP_pPCP01'])])
...
# seq_feature為字典(dict)對象,每個feature中特征值為字典中元素。
 
示例:
>>> for seq_feature in record.features:
... if seq_feature.type == "CDS":
... print(seq_feature.type)
... print(seq_feature.location)
... print(seq_feature.ref)
... print(seq_feature.ref_db)
... print(seq_feature.strand)
...
CDS
[3102:3337](+)
None
None
1
CDS
[<3358:3442](+)
None
None
1
 
4.3 SeqFeature 對象的location的描述
location的模糊Positions用5個類分別描述:
ExactPosition
– 精確位點,用一個數字表示。從該對象的 position 屬性可得知精確位點信息。
BeforePosition
– 位於某個特定位點前。如 `<13' , 在GenBank/EMBL中代表實際位點位於13之前。從該對象的 position 屬性可得知上邊界信息。
AfterPosition
– 與 BeforePosition 相反,如 `>13' , 在GenBank/EMBL中代表實際位點位於13以后。從該對象的 position 屬性可獲知下邊界信息。
WithinPosition
– 介於兩個特定位點之間,偶爾在GenBank/EMBL locations用到。如 ‘(1.5)’, GenBank/EMBL中代表實際位點位於1到5之間。該對象需要兩個position屬性表示,第一個 position 表示下邊界(本例為1), extension 表示上邊界與下邊界的差值(本例為4)。因此在WithinPosition中, object.position 表示下邊界, object.position + object.extension 表示上邊界。
OneOfPosition
– 表示幾個位點中的一個(GenBank/EMBL文件中偶爾能看到),比如在基因起始位點不明確或者有兩個候選位點的時候可以使用,或者用於明確表示兩個相關基因特征時使用。
UnknownPosition
– 代表未知位點。在GenBank/EMBL文件中沒有使用,對應 UniProt中的 ‘?’ feature坐標。
 
在創建SeqFeature對象時,需要 FeatureLocation 對象作為其location屬性,例如創建一個FeatureLocation對象,則為:my_location = SeqFeature.FeatureLocation(start_pos, end_pos),那么創建SeqFeature對象則為:
example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
 
Location檢測:
可用Python關鍵詞 in 檢驗某個鹼基或氨基酸殘基的父坐標是否位於 feature/location中。
假定你想知道某個SNP位於哪個feature里,並知道該SNP的索引位置是4350(Python 計數)。一個簡單的實現方案是用循環遍歷所有features:
>>> for seq_feature in record.features:
... if my_snp in seq_feature:
... print(seq_feature.type, seq_feature.qualifiers.get('locus_tag'), seq_feature.qualifiers.get('db_xref'))
...
source None ['taxon:229193']
 
根據feature位置信息取子序列:
SeqFeature 或 location object對象並沒有直接包含任何序列,只是可根據儲存的location (見第 4.3.2 節),從父序列中取得。例如:某一短基因位於負鏈5:18位置 (注意,這里是python計數,從0開始),由於GenBank/EMBL文件以1開始計數,那么在GenBank/EMBL文件中表示為 complement(6..18) :
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
# 構建一個SeqFeature對象
 
你可以用切片從父序列截取5:18,然后取反向互補序列。如果是Biopython 1.59或以后版本,可使用如下方法:
>>> feature_seq = example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>> print feature_seq
AGCCTTTGCCGTC
 
不過在處理復合 features (joins)時,此法相當繁瑣。此時可以使用 SeqFeature 對象的 extract 方法處理:
>>> feature_seq = example_feature.extract(example_parent) # example_parent是一個Seq對象,返回一個Seq對象。
>>> print feature_seq
AGCCTTTGCCGTC
 
SeqFeature 或 location對象的長度等同於所表示序列的長度。
>>> print example_feature.extract(example_parent)
AGCCTTTGCCGTC
>>> print len(example_feature.extract(example_parent))
13
>>> print len(example_feature)
13
>>> print len(example_feature.location)
13
 
4.4 CompoundLocation對象
CompoundLocation 對象用於描述復雜位置,如 GenBank/EMBL文件中的 ‘joins’ 位置。
 
CompoundLocation 對象有以下屬性:
.strand Overall strand of the compound location.
.start Start location - left most (minimum) value, regardless of strand.
.end End location - right most (maximum) value, regardless of strand.
.nofuzzy_start Start position (integer, approximated if fuzzy, read only) (OBSOLETE).
.nofuzzy_end End position (integer, approximated if fuzzy, read only) (OBSOLETE).
.ref Not present in CompoundLocation, dummy method for API compatibility.
.ref_db Not present in CompoundLocation, dummy method for API compatibility.
.operator 一般返回"join"
.parts 不同Location區域的信息
 
如何取CompoundLocation 對象中部分區域Location信息?例如:
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import FeatureLocation
>>> dna = Seq("nnnnnAGCATCCTGCTGTACnnnnnnnnGAGAMTGCCATGCCCCTGGAGTGAnnnnn")
>>> small = FeatureLocation(5, 20, strand=1)
>>> large = FeatureLocation(28, 52, strand=1)
>>> location = small + large
>>> print(small)
[5:20](+)
>>> print(large)
[28:52](+)
>>> print(location)
join{[5:20](+), [28:52](+)}
>>> for part in location.parts:
... print(len(part))
...
15
24
 
如何提取序列?
.extract(self, parent_sequence)
Extract the sequence from supplied parent sequence using the CompoundLocation object.
 
The parent_sequence can be a Seq like object or a string, and will generally return an object of the same type. The exception to this is a MutableSeq as the parent sequence will return a Seq object.
 
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_protein
>>> from Bio.SeqFeature import FeatureLocation, CompoundLocation
>>> seq = Seq("MKQHKAMIVALIVICITAVVAAL", generic_protein)
>>> fl1 = FeatureLocation(2, 8)
>>> fl2 = FeatureLocation(10, 15)
>>> fl3 = CompoundLocation([fl1,fl2])
>>> fl3.extract(seq)
Seq('QHKAMILIVIC', ProteinAlphabet())
 
 
4.5 參考文獻(reference)
Biopython通過 Bio.SeqFeature.Reference 對象來儲存相關的文獻信息。
References屬性儲存了 期刊名(journal) 、 題名(title) 、 作者(authors) 等信息。此外還包括 medline_id 、 pubmed_id 以及 comment 。
文獻對象都以列表儲存在 SeqRecord 對象的 annotations 字典中。 字典的鍵為 “references”。
如:
>>> record = SeqIO.read("NC_005816.gbf", "genbank")
>>> record.annotations['references']
[Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...)]
>>> record.annotations['references'][0] # 取第一個reference信息
Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...)
>>> record.annotations['references'][0].title
'Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus'
>>> record.annotations['references'][0].authors
'Zhou,D., Tong,Z., Song,Y., Han,Y., Pei,D., Pang,X., Zhai,J., Li,M., Cui,B., Qi,Z., Jin,L., Dai,R., Du,Z., Wang,J., Guo,Z., Wang,J., Huang,P. and Yang,R.'
>>> record.annotations['references'][0].journal
'J. Bacteriol. 186 (15), 5147-5152 (2004)'
 
4.6 格式化方法
SeqRecord 類中的 format() 能將字符串轉換成被 Bio.SeqIO 支持的格式,如FASTA:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
record = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
             +"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
         +"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
                      +"SSAC", generic_protein),
                   id="gi|14150838|gb|AAK54648.1|AF376133_1",
                   description="chalcone synthase [Cucumis sativus]")
print(record.format("fasta"))
 
format 方法接收單個必選參數,小寫字母字符串是 Bio.SeqIO 模塊支持的輸出格式。但是不適用於包含多條序列的文件格式 (如多序列比對格式)。
 
4.7 輸出genbank文件
首先構建SeqRecord對象,為SeqRecord對象添加SeqFeature屬性,然后利用SeqIO.write()輸出genbank文件。
例子1:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

out_file = "your_file.gbf"
seq = Seq("GATCGATCGATCGATCGATC", IUPAC.ambiguous_dna)
rec = SeqRecord(seq, id="ID1", name="ID1", description="test gbk seq ID1")
qualifiers = {"source": "prediction", "score": 10.0, "other": ["Some", "annotations"],
              "ID": "gene1"}
sub_qualifiers = {"source": "prediction"}
top_feature = SeqFeature(FeatureLocation(0, 20), type="gene", strand=1,
                         qualifiers=qualifiers)
sub_features = SeqFeature(FeatureLocation(0, 5), type="exon", strand=1,qualifiers=sub_qualifiers)
sub_features2 = SeqFeature(FeatureLocation(15, 20), type="exon", strand=1,qualifiers=sub_qualifiers)
rec.features = [top_feature, sub_features, sub_features2]
with open(out_file, "w") as out_handle:
    SeqIO.write(rec, out_handle, 'genbank')
 
 
五、解析GFF文件
目前,Biopython沒有整合解析GFF文件的功能,可以使用 bcbio-gff 和 gffutils 模塊。見:
 
參考:
 
1. 安裝模塊
安裝 gffutils:
pip install gffutils
 
安裝bcbio-gff:
pip3 install bcbio-gff
導入模塊:
from BCBio.GFF import GFFExaminer
 
2. 將其他文件格式轉為gff
例如,將genbank文件轉為gff文件,
>>> from Bio import SeqIO
>>> from BCBio import GFF
>>> in_file = "Escherichia_coli_M1.gbf"
>>> out_file = "test.gff"
>>> in_handle = open(in_file)
>>> out_handle = open(out_file, "w")
>>> GFF.write(SeqIO.parse(in_handle, "genbank"), out_handle)
 
注意了,結果中會將genbank文件中的complete header info 取出作為一行,同時FEATURES中的source info 也被取出作為一行,然而我們往往要忽略這2個信息。所以,代碼可以改為這樣:
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
    in_file = "Escherichia_coli_M1.gbf"
    out_file = "test.gff"
    in_handle = open(in_file)
    out_handle = open(out_file, "w")

    records = []
    for record in SeqIO.parse(in_handle, "genbank"):
        # delete annotations
        record.annotations = {}
        # loop through features to find the source
        for i in range(0,len(record.features)):
            # if found, delete it and stop (only expect one source)
            if(record.features[i].type == "source"):
                record.features.pop(i)
                break
        records.append(record)

    GFF.write(records, out_handle)

    in_handle.close()
    out_handle.close()

convert_to_GFF3()
 
 
 
六、序列格式之間的轉換
 
根據ID或者根據序列去除重復項: http://www.mamicode.com/info-detail-1596125.html
 
 
七、解析比對結果文件
1、提取BSR(BLAST score ratios)值
使用模塊:blast-score-ratio : https://pypi.python.org/pypi/blast-score-ratio/1.0.6
備注,
Requires NCBI BLAST+ command line >=2.2.29 application available here:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
BLAST binaries must be in your computer's path.
Also requires biopython and matplotlib packages.
 
示例:
import blast_score_ratio.bsr as bsr
bsr.test()
 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM