針對格式文件，Python讀取一定大小的文件內容

本文轉載自查看原文 2016-03-16 09:38 3112 Python

由數據庫導出的數據是格式化數據，如下所示，每兩個<REC>之間的數據是一個記錄的所有字段數據，如<TITLE>、<ABSTRACT>、<SUBJECT_CODE>。但是每條記錄中可能某些字段信息為空，

在導出的文本文件中，就會缺失這個字段，如記錄3，缺失<ABSTRACT>這個字段，記錄4，缺失<SUBJECT_CODE>這個字段。

<REC>(記錄1)
<TITLE>=Regulation of the protein disulfide proteome by mitochondria in mammalian cells.
<ABSTRACT>=The majority of protein disulfides in cells is considered an important inert structural, rather than a dynamic regulatory, determinant of protein function. 
<SUBJECT_CODE>=A006_8;D050_42;A006_62
<REC>(記錄2)
<TITLE>=Selective control of cortical axonal spikes by a slowly inactivating K+ current.
<ABSTRACT>=Neurons are flexible electrophysiological entities in which the distribution and properties of ionic channels control their behaviors.
<SUBJECT_CODE>=E057_6;E062_318;I135_46
<REC>(記錄3)
<TITLE>=Coupling of hydrogenic tunneling to active-site motion in the hydrogen radical transfer catalyzed by a coenzyme B12-dependent mutase.
<SUBJECT_CODE>=B016_11;B014_32;B014_54
<REC>(記錄4)
<TITLE>=Hyaluronic acid hydrogel for controlled self-renewal and differentiation of human embryonic stem cells.
<ABSTRACT>=Control of self-renewal and differentiation of human ES cells (hESCs) remains a challenge. 
<REC>(記錄5)
<TITLE>=Biologically inspired crack trapping for enhanced adhesion.
<ABSTRACT>=We present a synthetic adaptation of the fibrillar adhesion surfaces found in nature. 
<SUBJECT_CODE>=A004_57;B022_73;C034_22
<REC>(記錄6)
<TITLE>=Identification of a retroviral receptor used by an envelope protein derived by peptide library screening.
<ABSTRACT>=This study demonstrates the power of a genetic selection to identify a variant virus that uses a new retroviral receptor protein. 
<SUBJECT_CODE>=A006_8;E059_A;E059_5

1、從數據庫中導出數據時，一些表格的導出文件（txt文本文件），占用空間會在3-4G個左右，無法直接讀入內存；

2、通過python的linecache模塊的getlines函數讀取600M以上的文本文件時，有時會因為PC當時的運行情況，內存不足等原因，讀取得到的內容為空；

備注：linecache模塊的getlines()函數最終是調用file.readlines()函數來一次讀取數據的，如果文件過大，getlines函數會返回一個空鏈表作為結果。

3、逐行讀取文本內容，一是不方便后續的處理流程，后續流程需要對每條記錄的數據進行處理，而非對每行數據進行處理；二是逐行讀取文本內容，速度較慢；

因此，有必要針對這類格式文件，設計一種可以讀取一定大小，並且這段文本中的記錄都是完整的，不會出現最后一個記錄只有部分字段數據；

實現代碼如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: GBK -*-
import os
import sys
from time import time

REC_STR = '<REC>'

def read_text_in_buffer_multi_line(fd,length,label):
    BUFFER = []
    fd.seek(label,0)#根據新的label設置文件位置
    flag = 0
    line = ''

    BUFFER = fd.readlines(length)#讀取一定大小的文本，並存放在BUFFER中
    line = fd.readline()#讀取下一行，用於判斷文件是否結束
    if not line:
        flag = 1

    label = fd.tell()#獲取當前的文件位置

    if flag == 0:#如果文件沒有結束，則將BUFFER中最后一個<REC>之后的數據丟棄；否則則直接返回BUFFER
        BUFFER_POST = []

        while True:
            temp = BUFFER.pop()#丟棄數據

            if temp.startswith(REC_STR) == False:#判斷是否為<REC>
                BUFFER_POST.append(temp)
            else:#是<REC>，結束循環
                BUFFER_POST.append(temp)
                break

        len_buf_post = len(''.join(BUFFER_POST))#獲取到丟棄的數據的字節數目
        label = label - len_buf_post - len(line)#當前位置減去丟棄的字節數目，再減去多讀取的一行的數據的字節數目
    return BUFFER,label


if __name__ == "__main__":
    filename = "Data\\SJWD_U.txt"
    fd = open(filename,'rb')
    label = 0
    readlen = 100000*210#待讀取的字節數目 
    fout = open("out.txt",'w')

    begin = time()
    while True:
        buffer_list,label = read_text_in_buffer_multi_line(fd,readlen,label)
        if buffer_list == []:
            break
        else:
            fout.writelines(buffer_list)
    end = time()
    print "time:",(end - begin)
    fd.close()
    fout.close()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 C# 讀取txt格式文件內容文件參數Python--讀取wav格式文件 python讀取XML格式文件並轉為json格式 python 讀取帶BOM的utf-8格式文件 python讀取與寫入csv,txt格式文件 python 讀取帶BOM的utf-8格式文件 Python中FITS格式文件數據的讀取 JAVA用geotools讀取shape格式文件 CSV格式文件的讀取與保存讀取.raw格式文件（學習記錄）