由數據庫導出的數據是格式化數據,如下所示,每兩個<REC>之間的數據是一個記錄的所有字段數據,如<TITLE>、<ABSTRACT>、<SUBJECT_CODE>。但是每條記錄中可能某些字段信息為空,
在導出的文本文件中,就會缺失這個字段,如記錄3,缺失<ABSTRACT>這個字段,記錄4,缺失<SUBJECT_CODE>這個字段。
<REC>(記錄1) <TITLE>=Regulation of the protein disulfide proteome by mitochondria in mammalian cells. <ABSTRACT>=The majority of protein disulfides in cells is considered an important inert structural, rather than a dynamic regulatory, determinant of protein function. <SUBJECT_CODE>=A006_8;D050_42;A006_62 <REC>(記錄2) <TITLE>=Selective control of cortical axonal spikes by a slowly inactivating K+ current. <ABSTRACT>=Neurons are flexible electrophysiological entities in which the distribution and properties of ionic channels control their behaviors. <SUBJECT_CODE>=E057_6;E062_318;I135_46 <REC>(記錄3) <TITLE>=Coupling of hydrogenic tunneling to active-site motion in the hydrogen radical transfer catalyzed by a coenzyme B12-dependent mutase. <SUBJECT_CODE>=B016_11;B014_32;B014_54 <REC>(記錄4) <TITLE>=Hyaluronic acid hydrogel for controlled self-renewal and differentiation of human embryonic stem cells. <ABSTRACT>=Control of self-renewal and differentiation of human ES cells (hESCs) remains a challenge. <REC>(記錄5) <TITLE>=Biologically inspired crack trapping for enhanced adhesion. <ABSTRACT>=We present a synthetic adaptation of the fibrillar adhesion surfaces found in nature. <SUBJECT_CODE>=A004_57;B022_73;C034_22 <REC>(記錄6) <TITLE>=Identification of a retroviral receptor used by an envelope protein derived by peptide library screening. <ABSTRACT>=This study demonstrates the power of a genetic selection to identify a variant virus that uses a new retroviral receptor protein. <SUBJECT_CODE>=A006_8;E059_A;E059_5
1、從數據庫中導出數據時,一些表格的導出文件(txt文本文件),占用空間會在3-4G個左右,無法直接讀入內存;
2、通過python的linecache模塊的getlines函數讀取600M以上的文本文件時,有時會因為PC當時的運行情況,內存不足等原因,讀取得到的內容為空;
備注:linecache模塊的getlines()函數最終是調用file.readlines()函數來一次讀取數據的,如果文件過大,getlines函數會返回一個空鏈表作為結果。
3、逐行讀取文本內容,一是不方便后續的處理流程,后續流程需要對每條記錄的數據進行處理,而非對每行數據進行處理;二是逐行讀取文本內容,速度較慢;
因此,有必要針對這類格式文件,設計一種可以讀取一定大小,並且這段文本中的記錄都是完整的,不會出現最后一個記錄只有部分字段數據;
實現代碼如下:
#!/usr/bin/env python # -*- coding: utf-8 -*- # -*- coding: GBK -*- import os import sys from time import time REC_STR = '<REC>' def read_text_in_buffer_multi_line(fd,length,label): BUFFER = [] fd.seek(label,0)#根據新的label設置文件位置 flag = 0 line = '' BUFFER = fd.readlines(length)#讀取一定大小的文本,並存放在BUFFER中 line = fd.readline()#讀取下一行,用於判斷文件是否結束 if not line: flag = 1 label = fd.tell()#獲取當前的文件位置 if flag == 0:#如果文件沒有結束,則將BUFFER中最后一個<REC>之后的數據丟棄;否則則直接返回BUFFER BUFFER_POST = [] while True: temp = BUFFER.pop()#丟棄數據 if temp.startswith(REC_STR) == False:#判斷是否為<REC> BUFFER_POST.append(temp) else:#是<REC>,結束循環 BUFFER_POST.append(temp) break len_buf_post = len(''.join(BUFFER_POST))#獲取到丟棄的數據的字節數目 label = label - len_buf_post - len(line)#當前位置減去丟棄的字節數目,再減去多讀取的一行的數據的字節數目 return BUFFER,label if __name__ == "__main__": filename = "Data\\SJWD_U.txt" fd = open(filename,'rb') label = 0 readlen = 100000*210#待讀取的字節數目 fout = open("out.txt",'w') begin = time() while True: buffer_list,label = read_text_in_buffer_multi_line(fd,readlen,label) if buffer_list == []: break else: fout.writelines(buffer_list) end = time() print "time:",(end - begin) fd.close() fout.close()