[譯]使用Pandas讀取大型Excel文件

本文轉載自查看原文 2019-07-08 21:44 2328 pandas/ Python學習

上周我參加了dataisbeautiful subreddit上的Dataviz Battle，我們不得不從TSA聲明數據集創建可視化。我喜歡這種比賽，因為大多數時候你最終都會學習很多有用的東西。
這次數據非常干凈，但它分散在幾個PDF文件和Excel文件中。在從PDF中提取數據的過程中，我了解了一些工具和庫，最后我使用了tabula-py，這是Java庫tabula的Python包裝器。至於Excel文件，我發現單行 - 簡單pd.read_excel- 是不夠的。
最大的Excel文件大約是7MB，包含一個大約100k行的工作表。我雖然Pandas可以一次性讀取文件而沒有任何問題（我的計算機上有10GB的RAM），但顯然我錯了。
解決方案是以塊的形式讀取文件。該pd.read_excel函數沒有像pd.read_sql這樣的游標，所以我不得不手動實現這個邏輯。這是我做的：

import os
import pandas as pd


HERE = os.path.abspath(os.path.dirname(__file__))
DATA_DIR = os.path.abspath(os.path.join(HERE, '..', 'data'))


def make_df_from_excel(file_name, nrows):
    """Read from an Excel file in chunks and make a single DataFrame.

    Parameters
    ----------
    file_name : str
    nrows : int
        Number of rows to read at a time. These Excel files are too big,
        so we can't read all rows in one go.
    """
    file_path = os.path.abspath(os.path.join(DATA_DIR, file_name))
    xl = pd.ExcelFile(file_path)

    # In this case, there was only a single Worksheet in the Workbook.
    sheetname = xl.sheet_names[0]

    # Read the header outside of the loop, so all chunk reads are
    # consistent across all loop iterations.
    df_header = pd.read_excel(file_path, sheetname=sheetname, nrows=1)
    print(f"Excel file: {file_name} (worksheet: {sheetname})")

    chunks = []
    i_chunk = 0
    # The first row is the header. We have already read it, so we skip it.
    skiprows = 1
    while True:
        df_chunk = pd.read_excel(
            file_path, sheetname=sheetname,
            nrows=nrows, skiprows=skiprows, header=None)
        skiprows += nrows
        # When there is no data, we know we can break out of the loop.
        if not df_chunk.shape[0]:
            break
        else:
            print(f"  - chunk {i_chunk} ({df_chunk.shape[0]} rows)")
            chunks.append(df_chunk)
        i_chunk += 1

    df_chunks = pd.concat(chunks)
    # Rename the columns to concatenate the chunks with the header.
    columns = {i: col for i, col in enumerate(df_header.columns.tolist())}
    df_chunks.rename(columns=columns, inplace=True)
    df = pd.concat([df_header, df_chunks])
    return df


if __name__ == '__main__':
    df = make_df_from_excel('claims-2002-2006_0.xls', nrows=10000)

要記住的另一件事。當工作在Python Excel文件，你可能需要您是否需要從/讀/寫數據時使用不同的包.xls和.xlsx文件。
這個數據集包含兩個.xls和.xlsx文件，所以我不得不使用xlrd來讀取它們。請注意，如果您唯一關心的是讀取.xlsx文件，那么即使xlrd 仍然可以更快，openpyxl也是可行的方法。
這次我沒有寫任何Excel文件，但如果你需要，那么你想要xlsxwriter。我記得用它來創建包含許多復雜工作表和單元格注釋的工作簿（即Excel文件）。您甚至可以使用它來創建帶有迷你圖和VBA宏的工作表！

原文來源：https://www.giacomodebidda.com/reading-large-excel-files-with-pandas/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pandas讀取Excel文件用python的pandas讀取excel文件中的數據 Pandas 讀取超過 65536 行的 Excel 文件 pandas讀取Excel pandas分頁讀取excel 深入理解pandas讀取excel,txt,csv文件等命令 Python 使用Pandas讀寫Excel文件 Python使用xlrd、pandas包從Excel讀取數據 pandas之讀取文件 Excel文件處理之pandas