機器學習之數據預處理，Pandas讀取excel數據

本文轉載自查看原文 2018-06-30 20:52 12837 數據預處理/ 機器學習/ 讀取excel/ Python技術/ Pandas/ Python

Python讀寫excel的工具庫很多，比如最耳熟能詳的xlrd、xlwt，xlutils，openpyxl等。其中xlrd和xlwt庫通常配合使用，一個用於讀，一個用於寫excel。xlutils結合xlrd可以達到修改excel文件目的。openpyxl可以對excel文件同時進行讀寫操作。

而說到數據預處理，pandas就體現除了它的強大之處，並且它還支持可讀寫多種文檔格式，其中就包括對excel的讀寫。本文重點就是介紹pandas對excel數據集的預處理。

機器學習常用的模型對數據輸入都是有要求的，多數機器學習算法最基本的要求是訓練數據要轉換成數值格式。當然，也有像決策樹算法這種不需要轉換為數值的算法，這里不做特例討論。

pandas讀取excel文件的函數是pandas.read_excel()，主要參數包括：

io : 讀取的excel文檔地址，

string, path object (pathlib.Path or py._path.local.LocalPath),

file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx

sheet_name : 讀取的excel指定的sheet頁

string, int, mixed list of strings/ints, or None, default 0

Strings are used for sheet names, Integers are used in zero-indexed sheet positions.

Lists of strings/integers are used to request multiple sheets.

Specify None to get all sheets.

str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.

Available Cases

Defaults to 0 -> 1st sheet as a DataFrame

1 -> 2nd sheet as a DataFrame

“Sheet1” -> 1st sheet as a DataFrame

[0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames

None -> All sheets as a dictionary of DataFrames

header : 設置讀取的excel第一行是否作為列名稱

int, list of ints, default 0

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

names :設置每列的名稱，數組形式參數

　　　array-like, default None

List of column names to use. If file contains no header row, then you should explicitly pass header=None

index_col :設置讀取的excel第一列是否作為行名稱

　　　int, list of ints, default None

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

usecols :執行需要讀取的數據列，通常載入的excel包含不需要的列

　　　　int or list, default None

If None then parse all columns,

If int then indicates last column to be parsed

If list of ints then indicates list of column numbers to be parsed

If string then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

下滿是一些pandas讀取excel數據的示例：

將數據集寫入excel文件：

 
          >>> df_out = pd.DataFrame([('string1', 1), ... ('string2', 2), ... ('string3', 3)], ... columns=['Name', 'Value']) >>> df_out  Name Value 0 string1 1 1 string2 2 2 string3 3 >>> df_out.to_excel('tmp.xlsx')  
         

讀取excel文件：

 
          >>> pd.read_excel('tmp.xlsx')  Name Value 0 string1 1 1 string2 2 2 string3 3 
         

參數index_col and header 都設置為None表示不讀取excel的第一行和第一列作為標題和默認索引：

 
          >>> pd.read_excel('tmp.xlsx', index_col=None, header=None)  0 1 2 0 NaN Name Value 1 0.0 string1 1 2 1.0 string2 2 3 2.0 string3 3  
         

甚至可以專門制定列的格式：

 
          >>> pd.read_excel('tmp.xlsx', dtype={'Name':str, 'Value':float})  Name Value 0 string1 1.0 1 string2 2.0 2 string3 3.0  
         

下面是綜合示例：讀取text.xlsx文件的sheet1頁，僅載入D:F列的數據。這里F列是類別標簽，需要類別1和類別2轉換為數字，應用於機器學習的輸入建模。

import pandas as pd

def reader(path,sheet):
    return pd.read_excel(path, sheet_name=sheet, usecols='D:F')
    
trainrd = reader('text.xlsx','sheet1')
trainrd.head(5)  #查看前5行數據
trainrd['x']=0  #新建一列x
trainrd.loc[trainrd['類別']=='類別1','x']=0 #將類別列的文字轉換為數字
trainrd.loc[trainrd['類別']=='類別2','x']=1

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習之數據預處理機器學習——數據預處理機器學習的數據預處理機器學習之數據預處理機器學習 | 特征工程（一）- 數據預處理特征提取（機器學習數據預處理） [機器學習]-[數據預處理]-中心化縮放 KNN（一） python進行機器學習（一）之數據預處理（原創）(二)機器學習筆記之數據預處理 python大戰機器學習——數據預處理