今天在讀取一個超大csv文件的時候,遇到困難:首先使用office打不開然后在python中使用基本的pandas.read_csv打開文件時:MemoryError
最后查閱read_csv文檔發現可以分塊讀取。
read_csv中有個參數chunksize,通過指定一個chunksize分塊大小來讀取文件
1.分塊計算數量
from collections import Counter import pandas as pd size = 2 ** 10 counter = Counter() for chunk in pd.read_csv('file.csv', header=None, chunksize=size): counter.update([i[0] for i in chunk.values]) print(counter) ``` --- 大概輸出如下: ``` Counter({100: 41, 101: 40, 102: 40, ... 150: 35}) ```
2.分塊讀取合並為一個list,list元素是dataframe,最后concat為完整dataframe
data = pd.read_csv(path+"dika_num_trainall.csv", sep=',', engine='python', iterator=True) loop = True chunkSize = 100000 chunks = [] while loop: try: chunk = data.get_chunk(chunkSize) chunks.append(chunk) except StopIteration: loop = False print("Iteration is stopped.") print('開始合並') df_train = pd.concat(chunks, ignore_index=True)