pandas數據預處理 / pandas data pre-processing
目錄
Pandas起源
Python Data Analysis Library或pandas是基於NumPy的一種工具,該工具是為了解決數據分析任務而創建的。Pandas 納入了大量庫和一些標准的數據模型,提供了高效地操作大型數據集所需的工具。pandas提供了大量能使我們快速便捷地處理數據的函數和方法。
Pandas是python的一個數據分析包,最初由AQR Capital Management於2008年4月開發,並於2009年底開源出來,目前由專注於Python數據包開發的PyData開發team繼續開發和維護,屬於PyData項目的一部分。Pandas最初被作為金融數據分析工具而開發出來,因此,pandas為時間序列分析提供了很好的支持。 Pandas的名稱來自於面板數據(panel data)和python數據分析(data analysis)。panel data是經濟學中關於多維數據集的一個術語,在Pandas中也提供了panel的數據類型。
Pandas中的數據結構
Series:
一維數組,與Numpy中的一維Array類似。二者與Python基本的數據結構List也很相近,其區別是,List中的元素可以是不同的數據類型,而Array和Series中則只允許存儲相同的數據類型,這樣可以更有效的使用內存,提高運算效率。
Time- Series:
以時間為索引的Series。
DataFrame:
二維的表格型數據結構。很多功能與R中的data.frame類似。可以將DataFrame理解為Series的容器。以下的內容主要以DataFrame為主。
Panel:
三維的數組,可以理解為DataFrame的容器。
Pandas中一般的數據結構構成為DataFrame -> Series -> ndarray
環境安裝:
pip install pandas
2.1 常量 / Constants
pass
2.2 函數 / Function
2.2.1 read_csv()函數
函數調用: info = pd.read_csv(filename)
函數功能:讀取指定的csv文件,生成一個包含csv數據的DataFrame
傳入參數: filename
filename: str類型,需要讀取的文件名
返回參數: info
info: DataFrame類型,讀取文件生成的DataFrame
類似方法還有: read_excel / read_json / read_sql / read_html等
2.2.2 isnull()函數
函數調用: bool = pd.isnull(obj)
函數功能:返回一個包含數據是否是null的信息數據
傳入參數: obj
obj: DataFrame/Series類型,待判斷的數據
返回參數: bool
bool: DataFrame/Series類型,返回的判斷結果,True表示null,False則不是
2.2.3 to_datetime()函數
函數調用: date = pd.to_datetime(arg)
函數功能:將傳入的數據轉換成日期數據格式返回
傳入參數: arg
arg: int/float/srting/datetime/list/tuple/1-d array/Series類型,argument,可傳入一維數組或Series,0.18.1版本中加入DataFrame和dict-like結構
返回參數: date
date: 返回的數據類型由傳入的參數確定
Note: pandas中通過to_datetime函數轉換的而成的數據其dtype為datetime64[ns],該數據存在的Series可以通過.dt.month/year/day獲取所需要的日期信息
2.3 類 / Class
2.3.1 DataFrame類
類實例化:df = pd.DataFrame(data, index=) / pd.read_xxx(file_name)
類的功能:用於生成DataFrame
傳入參數: data, index / file_name
data: ndarray類型,包含需要構建成DataFrame的數據(二維)
index: Series類型,決定作為索引的列參數
file_name: str類型,需要讀取的文件名
返回參數: df
df: DataFrame類型,生成的DataFrame
2.3.1.1 dtypes屬性
屬性調用: fmt = df.dtypes
屬性功能: 返回數據結構中每列的數據類型(由於是多個,使用dtypes,numpy中單個,使用dtype)
屬性參數: fmt
fmt: Series類型,包含每個數據值的數據類型,index為列名,value為類型,其中,object類型相當於Python中的string
2.3.1.2 columns屬性
屬性調用: index_name = df.columns
屬性功能: 返回數據結構中每列的列名
屬性參數: index_name
Index_name: Index類型,<class 'pandas.core.indexes.base.Index'>,包含每列的列名
2.3.1.3 shape屬性方法
屬性調用: shp = df.shape
屬性功能: 返回數據結構的行列參數
屬性參數: shp
shp: tuple類型,(row, column),返回行列數
2.3.1.4 loc屬性
屬性調用: index = df.loc
屬性功能: 返回一個index的類
屬性參數: index
index: obj類型,<class 'pandas.core.indexing._LocIndexer'>,可用於切片獲取數據信息的DataFrame,如index[0]獲取第一行,index[3:7]獲取3-7行的數據
2.3.1.5 head()方法
函數調用: hdf = df.head(num=5)
函數功能: 返回csv列表中的前num行數據
傳入參數: num
num: int類型,需要獲取的行數
返回參數: hdf
hdf: DataFrame類型,原數據的前num行數據
2.3.1.6 tail()方法
函數調用: tdf = df.tail(num=5)
函數功能: 返回csv列表中的后num行數據
傳入參數: num
num: int類型,需要獲取的行數
返回參數: tdf
tdf: DataFrame類型,原數據的后num行數據
2.3.1.7 describe()方法
函數調用: ddf = df.describe()
函數功能: 返回csv列表中每個列的一些統計描述參數
返回參數: 無
返回參數: ddf
ddf: DataFrame類型,包括的信息有,每一列的數量count,均值mean,標准差std,最小值min,1/4位數25%,中位數50%,3/4位數75%,最大值max
2.3.1.8 sort_values()方法
函數調用: sdf = df.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)
函數功能: 返回按參數排序的DataFrame
傳入參數: by, axis, ascending, inplace, kind, na_position
by: str類型,DataFrame的行/列名
axis: int類型,0按列(第一軸)sort,1按行(最后軸)sort
ascending: bool類型,True為升序排列, False為降序排列
inplace: bool類型,True則修改原DataFrame,False則返回新的DataFrame
kind: str類型,確定sort的排序算法,包括{‘quicksort’, ‘mergesort’, ‘heapsort’}
na_position: str類型,確定na數據存在的位置,‘first’/‘last’
返回參數: sdf
sdf: DataFrame類型,重排后的DataFrame
2.3.1.9 mean ()方法
函數調用: mdf = df.mean(axis=0)
函數功能: 返回存儲所有非NaN的值的平均值DataFrame
傳入參數: axis
axis: int類型,0按列(第一軸)sort,1按行(最后軸)sort
返回參數: mdf
mdf: DataFrame類型,存儲均值的數據類型為float
2.3.1.10 pivot_table ()方法
函數調用: cdf = df.pivot_table(index=, values=, aggfunc=)
函數功能: 根據index將數據分組,對於values列的值(相同類型)執行aggfunc函數
傳入參數: index, values, aggfunc
index: str類型,進行分組的列的列名
values: str/list類型,需要計算的列的列名,多個則使用list
aggfunc: method類型,需要調用的方法
返回參數: cdf
cdf: DataFrame類型,通過自定義函數運算后得到的DataFrame
2.3.1.11 dropna ()方法
函數調用: ddf = df.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
函數功能: 根據要求刪除帶有NaN值的行列
傳入參數: axis, how, thresh, subset, inplace
axis: int/str類型,搜索方向,0/‘index’為行搜索,1/‘columns’為列搜索
how: str類型,‘any’只要出現NA值就刪除該行/列數據,‘all’所有值都是NA才刪除
thresh: int/None類型,表示對有效數據數量的最小要求(為2則要求該行/列至少2個有效非NA數據存在)
subset: str/list類型,表示在特定子集中尋找NA
inplace: bool類型,表示是否在原數據操作,True修改原數據,False返回新數據
返回參數: cdf
cdf: DataFrame類型,通過刪除NA函數運算后得到的DataFrame
2.3.1.12 reset_index ()方法
函數調用: rdf = df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=’’)
函數功能: 重置(一般為經過排序后的)DataFrame的序號
傳入參數: level, drop, inplace, col_level, col_fill
level: int/str/tuple/list類型,Only remove the given levels from the index. Removes all levels by default
drop: bool類型,是否刪除原始的index列,True刪除,False保留
inplace: bool類型,是否在原數據上操作
col_level: int/str類型,If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level
col_fill: obj類型,If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
返回參數: rdf
rdf: DataFrame類型,通過重排index后的DataFrame
2.3.1.13 set_index ()方法
函數調用: sdf = df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
函數功能: 根據現有的columns參數重新設置index索引
傳入參數: keys, drop, append, inplace, verify_integrity
keys: str類型,需要作為索引的列名
drop: bool類型,是否刪除作為索引的列,True刪除,False保留
append: bool類型,是否添加默認的index(序號索引)
inplace: bool類型,是否在原數據上操作
verify_integrity: bool類型,Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method
返回參數: sdf
sdf: DataFrame類型,通過重設index后的DataFrame
2.3.1.14 apply ()方法
函數調用: re = df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
函數功能: 對DataFrame相應的方向使用自定義函數
傳入參數: func, axis, broadcast, raw, reduce, args, **kwds
func: method類型,用於各行/列的函數
axis: int/str類型,0/‘index’對每列使用函數,1/‘column’對每行使用函數
broadcast: bool類型,For aggregation functions, return object of same size with values propagated
raw: bool類型,If False, convert each row or column into a Series. If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy eduction function this will achieve much better performance
reduce: bool/None類型,Try to apply reduction procedures. If the DataFrame is empty, apply will use reduce to determine whether the result should be a Series or a DataFrame. If reduce is None (the default), apply's return value will be guessed by calling func an empty Series (note: while guessing, exceptions raised by func will be ignored). If reduce is True a Series will always be returned, and if False a DataFrame will always be returned
args: tuple類型,Positional arguments to pass to function in addition to the array/series
**kwds: 其余關鍵字參數將會被當做參數傳給調用函數
返回參數: rdf
rdf: DataFrame類型,通過重排index后的DataFrame
2.3.1.15 ix屬性
屬性調用: ix_obj = df.ix
屬性功能: 返回一個index類的數據
屬性參數: ix_obj
ix_obj: obj類型,<class 'pandas.core.indexing._IXIndexer'>
Note: 后續可通過ix_obj[rows, cols]獲取DataFrame或Series,rows/cols可以是需要取的行索引/列名
2.3.2 Series類
類實例化:sr = pd.Series(data, index=) / df[colomn_name]
類的功能:用於生成Series
傳入參數: data, index / column_name
data: ndarray類型,包含需要構建成Series的數據(一維)
index: Series類型,決定作為索引的列參數
column_name: str類型,需要獲取Series的列名
返回參數: sr
sr: Series類型,生成的Series
2.3.2.1 values屬性
屬性調用: values = sr.values
屬性功能: 返回Series的所有value值
屬性參數: values
values: ndarray類型,Series的所有值形成的一維ndarray
2.3.2.2 tolist()方法
函數調用: list =sr.tolist()
函數功能:將Series或Index類的數據變成list形式返回
傳入參數: 無
返回參數: list
list: list類型,返回的數據列表
2.3.2.3 max/min()方法
函數調用: value =sr.max/min()
函數功能:獲取Series中的最大/最小值
傳入參數: 無
返回參數: value
value: int/str等類型,返回的最值
2.3.2.4 sort_values()方法
函數調用: ssr = sr.sort_values(axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)
函數功能: 返回按參數排序的Series
傳入參數: axis, ascending, inplace, kind, na_position
axis: int類型,0按列(第一軸)sort,1按行(最后軸)sort
ascending: bool類型,True為升序排列, False為降序排列
inplace: bool類型,True則修改原DataFrame,False則返回新的DataFrame
kind: str類型,確定sort的排序算法,包括{‘quicksort’, ‘mergesort’, ‘heapsort’}
na_position: str類型,確定na數據存在的位置,‘first’/‘last’
返回參數: ssr
ssr: Series類型,重排后的Series
2.3.2.5 mean ()方法
函數調用: msr = sr.mean()
函數功能: 返回存儲所有非NaN的值的平均值Series
傳入參數: 無
返回參數: msr
msr: Series類型,存儲均值的數據類型為float
2.3.2.6 reset_index ()方法
函數調用: rsr = sr.reset_index(level=None, drop=False, name=None, inplace=False)
函數功能: 重置(一般為經過排序后的)Series的序號
傳入參數: level, drop, name, inplace
level: int/str/tuple/list類型,Only remove the given levels from the index. Removes all levels by default
drop: bool類型,是否刪除原始的index列,True刪除,False保留
name: obj類型,The name of the column corresponding to the Series values
inplace: bool類型,是否在原數據上操作
返回參數: rsr
rsr: Series類型,通過重排index后的Series
2.3.2.7 value_counts ()方法
函數調用: csr = sr.value_counts(dropna=True)
函數功能: 計算Series中各個values值的數量
傳入參數: dropna
dropna: bool類型,是否計算NA的數量,True不計算,False計算
返回參數: csr
csr: Series類型,各數據值為索引,數量為value的Series
首先對csv文件進行讀取操作,利用read_csv函數,值得注意的是,存儲的csv文件必須利用Excel另存為的方式獲得,而不能以修改后綴名的方法得到。
1 import pandas as pd 2 3 # info = pd.read_csv('info.csv', encoding='latin1') 4 # info = pd.read_csv('info.csv', encoding='ISO-8859-1') 5 # info = pd.read_csv('info.csv', encoding='cp1252') 6 info = pd.read_csv('info.csv') 7 # Get the info of whole csv list, and the info of row and column 8 print(info)
輸出結果為

No. Type Info Number Rank Mark. 0 1001 BUTTER_1 NaN 4.000000 A cc 1 1002 BUTTER_2 NaN NaN C dd 2 1003 BUTTER_3 NaN NaN NaN ff 3 1004 BUTTER_4 NaN NaN NaN NaN 4 1005 BUTTER_5 df 543.000000 F cx 5 1006 BUTTER_6 fa 345.000000 A cc 6 1007 BUTTER_7 jhf 67.000000 S dd 7 1008 BUTTER_8 ad 567.000000 S ff 8 1009 BUTTER_9 gdfs 34.000000 C aa 9 1010 BUTTER_10 vczx 34.000000 C cx 10 1011 BUTTER_11 as 89.000000 E cc 11 1012 BUTTER_12 cd 90.000000 D dd 12 1013 BUTTER_13 qwe 14.000000 S ff 13 1014 WATER_1 asd 186.635198 A aa 14 1015 WATER_2 as 222.000000 B cc 15 1016 WATER_3 fa 193.026806 A cc 16 1017 WATER_4 jhf 196.222611 C dd 17 1018 WATER_5 ad 199.418415 B ff 18 1019 WATER_6 gdfs 202.614219 D aa 19 1020 WATER_7 vczx 205.810023 F cx 20 1021 WATER_8 as 209.005827 A cc 21 1022 WATER_9 cd 212.201632 S dd 22 1023 WATER_10 qwe 215.397436 S ff 23 1024 WATER_11 asd 218.593240 C aa 24 1025 WATER_12 df 221.789044 C cx 25 1026 WATER_13 fa 224.984848 E cc 26 1027 WATER_14 jhf 228.180653 D dd 27 1028 WATER_15 ad 231.376457 S ff 28 1029 WATER_16 gdfs 234.572261 A aa 29 1030 WATER_17 vczx 237.768065 B cx .. ... ... ... ... ... ... 70 1071 CHEESE_11 as 368.796037 E cc 71 1072 CHEESE_12 cd 371.991842 D dd 72 1073 CHEESE_13 qwe 375.187646 S ff 73 1074 CHEESE_14 asd 378.383450 A aa 74 1075 CHEESE_15 df 381.579254 B cx 75 1076 CHEESE_16 fa 384.775058 A cc 76 1077 CHEESE_17 jhf 387.970863 C dd 77 1078 CHEESE_18 ad 391.166667 B ff 78 1079 CHEESE_19 gdfs 394.362471 D aa 79 1080 CHEESE_20 vczx 397.558275 F cx 80 1081 CHEESE_21 as 400.754079 A cc 81 1082 CHEESE_22 cd 403.949883 S dd 82 1083 CHEESE_23 qwe 407.145688 S ff 83 1084 CHEESE_24 asd 410.341492 C aa 84 1085 CHEESE_25 df 413.537296 C cx 85 1086 MILK_1 fa 416.733100 E cc 86 1087 MILK_2 jhf 419.928904 D dd 87 1088 MILK_3 ad 423.124709 S ff 88 1089 MILK_4 gdfs 426.320513 A aa 89 1090 MILK_5 vczx 429.516317 B cx 90 1091 MILK_6 as 432.712121 A cc 91 1092 MILK_7 cd 435.907925 C dd 92 1093 MILK_8 qwe 439.103730 B ff 93 1094 MILK_9 asd 442.299534 D aa 94 1095 MILK_10 df 445.495338 F cx 95 1096 MILK_11 fa 448.691142 A cc 96 1097 MILK_12 jhf 451.886946 S dd 97 1098 MILK_13 ad 455.082751 S ff 98 1099 MILK_14 gdfs 458.278555 C aa 99 1100 MILK_15 vczx 461.474359 C cx [100 rows x 6 columns]
可以看到,pandas已經將csv文件中的數據成功導入
接着可以查看導入的數據類型
1 # Get the type of info 2 print(type(info)) # <class 'pandas.core.frame.DataFrame'> 3 print('-----------') 4 # Get the type of each column(The object dtype equal to the string type in python) 5 print(info.dtypes) ''' No. int64 6 Type object 7 Info object 8 Number float64 9 Rank object 10 Mark. object 11 dtype: object '''
最后還可以利用基本函數獲取前/后 n 行,列名信息以及基本描述等
1 # Get the first x row of csv list, default is 5 2 print(info.head(7)) 3 print('-----------') 4 # Get the last x row of csv list, default is 5 5 print(info.tail(7)) 6 print('-----------') 7 # Get the name of each column 8 print(info.columns) 9 print('-----------') 10 # Get the shape of csv list 11 print(info.shape) 12 print('-----------') 13 # Get the statistics parameter of cvs list(for digit data) 14 # Such as count, mean, standard deviation, min, 25%, 50%, 75%, max 15 print(info.describe())
輸出結果

No. Type Info Number Rank Mark. 0 1001 BUTTER_1 NaN 4.0 A cc 1 1002 BUTTER_2 NaN NaN C dd 2 1003 BUTTER_3 NaN NaN NaN ff 3 1004 BUTTER_4 NaN NaN NaN NaN 4 1005 BUTTER_5 df 543.0 F cx 5 1006 BUTTER_6 fa 345.0 A cc 6 1007 BUTTER_7 jhf 67.0 S dd ----------- No. Type Info Number Rank Mark. 93 1094 MILK_9 asd 442.299534 D aa 94 1095 MILK_10 df 445.495338 F cx 95 1096 MILK_11 fa 448.691142 A cc 96 1097 MILK_12 jhf 451.886946 S dd 97 1098 MILK_13 ad 455.082751 S ff 98 1099 MILK_14 gdfs 458.278555 C aa 99 1100 MILK_15 vczx 461.474359 C cx ----------- Index(['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.'], dtype='object') ----------- (100, 6) ----------- No. Number count 100.000000 97.000000 mean 1050.500000 309.401389 std 29.011492 110.975188 min 1001.000000 4.000000 25% 1025.750000 240.963869 50% 1050.500000 317.663170 75% 1075.250000 391.166667 max 1100.000000 567.000000
對於pandas,由於其基本結構是基於numpy的ndarray,因此numpy的基本計算操作對於pandas的DataFrame及Series也都適用。
下面是pandas的一些基本計算方法的示例,
完整代碼

1 import pandas as pd 2 3 info = pd.read_csv('info.csv') 4 # Get the certain row of csv list 5 print(info.loc[0]) 6 print(info.loc[3:7]) 7 print('----------') 8 # Get certain column(columns) by column name(name list) 9 print(info['Type']) 10 print(info[['Type', 'No.']]) 11 # Get the column name and save it as a list 12 col_names = info.columns.tolist() 13 print(col_names) 14 15 # Filter off the column name that end with '.' 16 dotList = [] 17 for n in col_names: 18 if n.endswith('.'): 19 dotList.append(n) 20 newList = info[dotList] 21 print(newList) 22 23 # Operation for column will act to each element as numpy does 24 print(info['Number'] * 10) 25 26 # Operation for two csv with same shape will act each corresponding element 27 x = info['Number'] 28 y = info['No.'] 29 print(x+y) 30 # Act for string 31 x = info['Rank'] 32 y = info['Mark.'] 33 print(x+y) 34 35 # Add a column after the tail column(the dimension of new one should be same as origin) 36 print(info.shape) 37 info['New'] = x+y 38 print(info.shape) 39 print('----------') 40 41 # Get the max/min value of a column 42 print(info['Number'].max()) 43 print(info['Number'].min()) 44 45 num = info['Number'] 46 num_null_true = pd.isnull(num) 47 # If these is a null value in DataFrame, the calculated result will be NaN 48 print(sum(info['Number'])/len(info['Number'])) # return nan 49 # Use the DataFrame == False to reverse the DataFrame 50 good_value = info['Number'][num_null_true == False] 51 print(sum(good_value)/len(good_value)) 52 print(good_value.mean()) 53 # mean method can filter off the missing data automatically 54 print(info['Number'].mean()) 55 print('---------')
分段解釋
首先導入pandas及數據文件,利用loc獲取pandas的某行數據,可以使用類似list的切片操作
1 import pandas as pd 2 3 info = pd.read_csv('info.csv') 4 # Get the certain row of csv list 5 print(info.loc[0]) 6 print(info.loc[3:7]) 7 print('----------') 8 # Get certain column(columns) by column name(name list) 9 print(info['Type']) 10 print(info[['Type', 'No.']])
結果如下,內容較長

No. 1001 Type BUTTER_1 Info NaN Number 4 Rank A Mark. cc Name: 0, dtype: object No. Type Info Number Rank Mark. 3 1004 BUTTER_4 NaN NaN NaN NaN 4 1005 BUTTER_5 df 543.0 F cx 5 1006 BUTTER_6 fa 345.0 A cc 6 1007 BUTTER_7 jhf 67.0 S dd 7 1008 BUTTER_8 ad 567.0 S ff ---------- 0 BUTTER_1 1 BUTTER_2 2 BUTTER_3 3 BUTTER_4 4 BUTTER_5 5 BUTTER_6 6 BUTTER_7 7 BUTTER_8 8 BUTTER_9 9 BUTTER_10 10 BUTTER_11 11 BUTTER_12 12 BUTTER_13 13 WATER_1 14 WATER_2 15 WATER_3 16 WATER_4 17 WATER_5 18 WATER_6 19 WATER_7 20 WATER_8 21 WATER_9 22 WATER_10 23 WATER_11 24 WATER_12 25 WATER_13 26 WATER_14 27 WATER_15 28 WATER_16 29 WATER_17 ... 70 CHEESE_11 71 CHEESE_12 72 CHEESE_13 73 CHEESE_14 74 CHEESE_15 75 CHEESE_16 76 CHEESE_17 77 CHEESE_18 78 CHEESE_19 79 CHEESE_20 80 CHEESE_21 81 CHEESE_22 82 CHEESE_23 83 CHEESE_24 84 CHEESE_25 85 MILK_1 86 MILK_2 87 MILK_3 88 MILK_4 89 MILK_5 90 MILK_6 91 MILK_7 92 MILK_8 93 MILK_9 94 MILK_10 95 MILK_11 96 MILK_12 97 MILK_13 98 MILK_14 99 MILK_15 Name: Type, Length: 100, dtype: object Type No. 0 BUTTER_1 1001 1 BUTTER_2 1002 2 BUTTER_3 1003 3 BUTTER_4 1004 4 BUTTER_5 1005 5 BUTTER_6 1006 6 BUTTER_7 1007 7 BUTTER_8 1008 8 BUTTER_9 1009 9 BUTTER_10 1010 10 BUTTER_11 1011 11 BUTTER_12 1012 12 BUTTER_13 1013 13 WATER_1 1014 14 WATER_2 1015 15 WATER_3 1016 16 WATER_4 1017 17 WATER_5 1018 18 WATER_6 1019 19 WATER_7 1020 20 WATER_8 1021 21 WATER_9 1022 22 WATER_10 1023 23 WATER_11 1024 24 WATER_12 1025 25 WATER_13 1026 26 WATER_14 1027 27 WATER_15 1028 28 WATER_16 1029 29 WATER_17 1030 .. ... ... 70 CHEESE_11 1071 71 CHEESE_12 1072 72 CHEESE_13 1073 73 CHEESE_14 1074 74 CHEESE_15 1075 75 CHEESE_16 1076 76 CHEESE_17 1077 77 CHEESE_18 1078 78 CHEESE_19 1079 79 CHEESE_20 1080 80 CHEESE_21 1081 81 CHEESE_22 1082 82 CHEESE_23 1083 83 CHEESE_24 1084 84 CHEESE_25 1085 85 MILK_1 1086 86 MILK_2 1087 87 MILK_3 1088 88 MILK_4 1089 89 MILK_5 1090 90 MILK_6 1091 91 MILK_7 1092 92 MILK_8 1093 93 MILK_9 1094 94 MILK_10 1095 95 MILK_11 1096 96 MILK_12 1097 97 MILK_13 1098 98 MILK_14 1099 99 MILK_15 1100 [100 rows x 2 columns]
獲取pandas的列名
1 # Get the column name and save it as a list 2 col_names = info.columns.tolist() 3 print(col_names)
結果如下
['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.']
過濾出所有以‘.’結尾的列
1 # Filter off the column name that end with '.' 2 dotList = [] 3 for n in col_names: 4 if n.endswith('.'): 5 dotList.append(n) 6 newList = info[dotList] 7 print(newList)
基本計算操作會作用於pandas的Series每個值
1 # Operation for column will act to each element as numpy does 2 print(info['Number'] * 10)
對兩個結構形狀相同的Series,其運算會作用到每個values上
1 # Operation for two csv with same shape will act each corresponding element 2 x = info['Number'] 3 y = info['No.'] 4 print(x+y) 5 # Act for string 6 x = info['Rank'] 7 y = info['Mark.'] 8 print(x+y)
創建出一個列名為‘New’的新列,值為兩個列的值之和
1 # Add a column after the tail column(the dimension of new one should be same as origin) 2 print(info.shape) 3 info['New'] = x+y 4 print(info.shape) 5 print('----------')
獲取Series中的最值
1 # Get the max/min value of a column 2 print(info['Number'].max()) 3 print(info['Number'].min())
均值計算的兩種方式,
- 直接求和平均,當計算中有NaN值時,計算的結果將會為NaN
- 利用mean函數進行計算,mean函數將會過自動濾掉NaN缺失數據
1 num = info['Number'] 2 num_null_true = pd.isnull(num) 3 # If these is a null value in DataFrame, the calculated result will be NaN 4 print(sum(info['Number'])/len(info['Number'])) # return nan 5 # Use the DataFrame == False to reverse the DataFrame 6 good_value = info['Number'][num_null_true == False] 7 print(sum(good_value)/len(good_value)) 8 print(good_value.mean()) 9 # mean method can filter off the missing data automatically 10 print(info['Number'].mean()) 11 print('---------')
下面介紹 pandas 中的數據類型 Series 的一些基本使用方法,
完整代碼

1 import pandas as pd 2 3 info = pd.read_csv('info.csv') 4 5 # Fetch a series from DataFrame 6 rank_series = info['Rank'] 7 print(type(info)) # <class 'pandas.core.frame.DataFrame'> 8 print(type(rank_series)) # <class 'pandas.core.series.Series'> 9 print(rank_series[0:5]) 10 11 # New a series 12 from pandas import Series 13 # Build a rank series 14 rank = rank_series.values 15 print(rank) 16 # DataFrame --> Series --> ndarray 17 print(type(rank)) # <class 'numpy.ndarray'> 18 # Build a type series 19 type_series = info['Type'] 20 types = type_series.values 21 # Build a new series based on former two(type and rank) 22 # Series(values, index=) 23 series_custom = Series(rank, index=types) 24 print(series_custom) 25 # Fetch Series by key name list 26 print(series_custom[['MILK_14', 'MILK_15']]) 27 # Fetch Series by index 28 print(series_custom[0:2]) 29 30 # Sorted to Series will return a list by sorted value 31 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x)) 32 33 # Re-sort by index for a Series 34 original_index = series_custom.index.tolist() 35 sorted_index = sorted(original_index) 36 sorted_by_index = series_custom.reindex(sorted_index) 37 print(sorted_by_index) 38 # Series sort function 39 print(series_custom.sort_index()) 40 print(series_custom.sort_values()) 41 42 import numpy as np 43 # Add operation for Series will add the values for each row(if the dimensions of two series are same) 44 print(np.add(series_custom, series_custom)) 45 # Apply sin funcion to each value 46 print(np.sin(info['Number'])) 47 # Return the max value(return a single value not a Series) 48 # If more than one max value exist, only return one 49 print(np.max(filter(lambda x: isinstance(x, float), series_custom))) 50 51 # Filter values in range 52 criteria_one = series_custom > 'C' 53 criteria_two = series_custom < 'S' 54 print(series_custom[criteria_one & criteria_two])
分段解釋
利用列名從DataFrame中獲取一個Series
1 import pandas as pd 2 3 info = pd.read_csv('info.csv') 4 5 # Fetch a series from DataFrame 6 rank_series = info['Rank'] 7 print(type(info)) # <class 'pandas.core.frame.DataFrame'> 8 print(type(rank_series)) # <class 'pandas.core.series.Series'> 9 print(rank_series[0:5])
新建一個Series的方法,先獲取一個作為index的列,在獲取一個作為values的列,利用Series函數生成新的Series
1 # New a series 2 from pandas import Series 3 # Build a rank series 4 rank = rank_series.values 5 print(rank) 6 # DataFrame --> Series --> ndarray 7 print(type(rank)) # <class 'numpy.ndarray'> 8 # Build a type series 9 type_series = info['Type'] 10 types = type_series.values 11 # Build a new series based on former two(type and rank) 12 # Series(values, index=) 13 series_custom = Series(rank, index=types) 14 print(series_custom)
利用列名列表或索引從DataFrame中獲取多個Series
1 # Fetch Series by key name list 2 print(series_custom[['MILK_14', 'MILK_15']]) 3 # Fetch Series by index 4 print(series_custom[0:2])
利用sorted函數根據values大小重排Series,返回值為一個list
1 # Sorted to Series will return a list by sorted value 2 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x))
兩種sort方法對Series進行排列
1. 獲取index索引值,對索引值進行排列,再使用reindex函數獲取新的Series
1 # Re-sort by index for a Series 2 original_index = series_custom.index.tolist() 3 sorted_index = sorted(original_index) 4 sorted_by_index = series_custom.reindex(sorted_index) 5 print(sorted_by_index)
2.使用sort_index或sort_values函數
1 # Series sort function 2 print(series_custom.sort_index()) 3 print(series_custom.sort_values())
Series的相加/正余弦/max,利用numpy函數,將Series的對應values值進行處理
1 import numpy as np 2 # Add operation for Series will add the values for each row(if the dimensions of two series are same) 3 print(np.add(series_custom, series_custom)) 4 # Apply sin funcion to each value 5 print(np.sin(info['Number'])) 6 # Return the max value(return a single value not a Series) 7 # If more than one max value exist, only return one 8 print(np.max(filter(lambda x: isinstance(x, float), series_custom)))
利用True/False列表獲取在范圍內滿足條件的Series
1 # Filter values in range 2 criteria_one = series_custom > 'C' 3 criteria_two = series_custom < 'S' 4 print(series_custom[criteria_one & criteria_two])
下面是一些pandas常用的函數示例
完整代碼

1 import pandas as pd 2 import numpy as np 3 4 info = pd.read_csv('info.csv') 5 6 # Sort value by column 7 # inplace is True will sort value base on origin, False will return a new DataFrame 8 new = info.sort_values('Mark.', inplace=False, na_position='last') 9 print(new) 10 # Sorted by ascending order in default(ascending=True) 11 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail 12 info.sort_values('Mark.', inplace=True, ascending=False) 13 print(info) 14 print('---------') 15 # Filter off the null row 16 num = info['Number'] 17 # isnull will return a list contains the status of null or not, True for null, False for not 18 num_null_true = pd.isnull(num) 19 print(num_null_true) 20 num_null = num[num_null_true] 21 print(num_null) # 12 NaN 22 print('---------') 23 24 # pivot_table function can calulate certain para that with same attribute group by using certain function 25 # index tells the method which column to group by 26 # value is the column that we want to apply the calculation to 27 # aggfunc specifies the calculation we want to perform, default function is mean 28 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum) 29 print(avg_by_rank) 30 print('---------') 31 # Operate to multi column 32 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum) 33 print(sum_by_rank) 34 print('---------') 35 36 # dropna function can drop any row/columns that have null values 37 info = pd.read_csv('info.csv') 38 # Drop the columns that contain NaN (axis=0 for row) 39 drop_na_column = info.dropna(axis=1) 40 print(drop_na_column) 41 print('---------') 42 # Drop the row that subset certains has NaN 43 # thresh to decide how many valid value required 44 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.']) 45 print(drop_na_row) 46 print('---------') 47 # Locate to a certain value by its row number(plus 1 for No.) and column name 48 print(info) 49 row_77_Rank = info.loc[77, 'Rank'] 50 print(row_77_Rank) 51 row_88_Info = info.loc[88, 'Info'] 52 print(row_88_Info) 53 print('---------') 54 55 # reset_index can reset the index for sorted DataFrame 56 new_info = info.sort_values('Rank', ascending=False) 57 print(new_info[0:10]) 58 print('---------') 59 # drop=True will drop the index column, otherwise will keep former index colunn (default False) 60 reset_new_info = new_info.reset_index(drop=True) 61 print(reset_new_info[0:10]) 62 print('---------') 63 64 # Define your own function for pandas 65 # Use apply function to implement your own function 66 def hundredth_row(col): 67 hundredth_item = col.loc[99] 68 return hundredth_item 69 hundred_row = info.apply(hundredth_row, axis=0) 70 print(hundred_row) 71 print('---------') 72 # Null count 73 # The apply function will act to each column 74 def null_count(column): 75 column_null = pd.isnull(column) 76 null = column[column_null] 77 return len(null) 78 # Passing in axis para 0 to iterate over rows instead of column 79 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row 80 # Passing by column can act for each column then get row 81 # Passing by row can act for each row than get column 82 column_null_count = info.apply(null_count, axis=0) 83 print(column_null_count) 84 print('---------') 85 86 # Example: classify the data by Rank, and calculate the sum for each 87 def rank_sort(row): 88 rank = row['Rank'] 89 if rank == 'S': 90 return 'Excellent' 91 elif rank == 'A': 92 return 'Great' 93 elif rank == 'B': 94 return 'Good' 95 elif rank == 'C': 96 return 'Pass' 97 else: 98 return 'Failed' 99 # Format a classified column 100 rank_info = info.apply(rank_sort, axis=1) 101 print(rank_info) 102 print('---------') 103 # Add the column to DataFrame 104 info['Rank_Classfied'] = rank_info 105 # Calculate the sum of 'Number' according to 'Rank_Classfied' 106 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum) 107 print(new_rank_number) 108 109 # set_index will return a new DataFrame that is indexed by values in the specified column 110 # And will drop that column(default is True) 111 # The column set to be index will not be dropped if drop=False 112 index_type = info.set_index('Type', drop=False, append=True) 113 print(index_type) 114 print('---------') 115 116 # Use string index to slice the DataFrame 117 # Note: the index(key) should be unique 118 print(index_type['MILK_1':'MILK_7']) 119 print('---------') 120 print(index_type.loc['MILK_1':'MILK_7']) 121 # Value index is available too 122 print('---------') 123 print(index_type[-15:-8]) 124 print('---------') 125 126 # Calculate the standard deviation for each element from two different index 127 cal_list = info[['Number', 'No.']] 128 # np.std([x, y]) --> std value 129 # The lambda x is a Series 130 # cal_list.apply(lambda x: print(type(x)), axis=1) 131 print(cal_list.apply(lambda x: np.std(x), axis=1))
分段解釋
首先導入模塊,然后利用sort_values函數對DataFrame或Series進行排序操作
1 mport pandas as pd 2 import numpy as np 3 4 info = pd.read_csv('info.csv') 5 6 # Sort value by column 7 # inplace is True will sort value base on origin, False will return a new DataFrame 8 new = info.sort_values('Mark.', inplace=False, na_position='last') 9 print(new) 10 # Sorted by ascending order in default(ascending=True) 11 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail 12 info.sort_values('Mark.', inplace=True, ascending=False) 13 print(info) 14 print('---------')
利用isnull函數對null值的數據進行過濾,可利用Series==False對isnull得到的序列進行反轉
1 # Filter off the null row 2 num = info['Number'] 3 # isnull will return a list contains the status of null or not, True for null, False for not 4 num_null_true = pd.isnull(num) 5 print(num_null_true) 6 num_null = num[num_null_true] 7 print(num_null) # 12 NaN 8 print('---------')
利用pivot_table函數對相同屬性分組的數據進行指定函數的計算
1 # pivot_table function can calulate certain para that with same attribute group by using certain function 2 # index tells the method which column to group by 3 # value is the column that we want to apply the calculation to 4 # aggfunc specifies the calculation we want to perform, default function is mean 5 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum) 6 print(avg_by_rank) 7 print('---------') 8 # Operate to multi column 9 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum) 10 print(sum_by_rank) 11 print('---------')
利用dropna函數刪除空值數據
1 # dropna function can drop any row/columns that have null values 2 info = pd.read_csv('info.csv') 3 # Drop the columns that contain NaN (axis=0 for row) 4 drop_na_column = info.dropna(axis=1) 5 print(drop_na_column) 6 print('---------') 7 # Drop the row that subset certains has NaN 8 # thresh to decide how many valid value required 9 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.']) 10 print(drop_na_row) 11 print('---------')
利用loc對數據進行定位
1 # Locate to a certain value by its row number(plus 1 for No.) and column name 2 print(info) 3 row_77_Rank = info.loc[77, 'Rank'] 4 print(row_77_Rank) 5 row_88_Info = info.loc[88, 'Info'] 6 print(row_88_Info) 7 print('---------')
利用reset_index函數對索引進行重排
1 # reset_index can reset the index for sorted DataFrame 2 new_info = info.sort_values('Rank', ascending=False) 3 print(new_info[0:10]) 4 print('---------') 5 # drop=True will drop the index column, otherwise will keep former index colunn (default False) 6 reset_new_info = new_info.reset_index(drop=True) 7 print(reset_new_info[0:10]) 8 print('---------')
利用apply函數運行自定義函數
1 # Define your own function for pandas 2 # Use apply function to implement your own function 3 def hundredth_row(col): 4 hundredth_item = col.loc[99] 5 return hundredth_item 6 hundred_row = info.apply(hundredth_row, axis=0) 7 print(hundred_row) 8 print('---------') 9 # Null count 10 # The apply function will act to each column 11 def null_count(column): 12 column_null = pd.isnull(column) 13 null = column[column_null] 14 return len(null) 15 # Passing in axis para 0 to iterate over rows instead of column 16 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row 17 # Passing by column can act for each column then get row 18 # Passing by row can act for each row than get column 19 column_null_count = info.apply(null_count, axis=0) 20 print(column_null_count) 21 print('---------') 22 23 # Example: classify the data by Rank, and calculate the sum for each 24 def rank_sort(row): 25 rank = row['Rank'] 26 if rank == 'S': 27 return 'Excellent' 28 elif rank == 'A': 29 return 'Great' 30 elif rank == 'B': 31 return 'Good' 32 elif rank == 'C': 33 return 'Pass' 34 else: 35 return 'Failed' 36 # Format a classified column 37 rank_info = info.apply(rank_sort, axis=1) 38 print(rank_info) 39 print('---------')
添加一個column到DataFrame並進行計算處理
1 # Add the column to DataFrame 2 info['Rank_Classfied'] = rank_info 3 # Calculate the sum of 'Number' according to 'Rank_Classfied' 4 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum) 5 print(new_rank_number)
利用set_index函數設置新的索引,利用索引進行切片操作,切片如果是列名字符串,將返回兩個列名索引之間所有的數據
1 # set_index will return a new DataFrame that is indexed by values in the specified column 2 # And will drop that column(default is True) 3 # The column set to be index will not be dropped if drop=False 4 index_type = info.set_index('Type', drop=False, append=True) 5 print(index_type) 6 print('---------') 7 8 # Use string index to slice the DataFrame 9 # Note: the index(key) should be unique 10 print(index_type['MILK_1':'MILK_7']) 11 print('---------') 12 print(index_type.loc['MILK_1':'MILK_7']) 13 # Value index is available too 14 print('---------') 15 print(index_type[-15:-8]) 16 print('---------')
對兩個不同索引內的元素分別進行標准差計算
1 # Calculate the standard deviation for each element from two different index 2 cal_list = info[['Number', 'No.']] 3 # np.std([x, y]) --> std value 4 # The lambda x is a Series 5 # cal_list.apply(lambda x: print(type(x)), axis=1) 6 print(cal_list.apply(lambda x: np.std(x), axis=1))
1. pandas許多函數底層是基於numpy進行的,pandas一個函數可能調用了numpy的多個函數進行實現;
2. object dtype 和 Python中的string相同;
3. pandas中如果不指定列名則默認文件中第一行為列名;
4. 基本結構包括DataFrame和Series,DataFrame可以分解為Series,DataFrame是由一系列的Series組成的,DataFrame相當於矩陣,Series相當於行或者列。
相關閱讀
1. numpy 的使用