Python的工具包[1] -> pandas數據預處理 -> pandas 庫及使用總結


pandas數據預處理 / pandas data pre-processing


目錄

  1. 關於 pandas
  2. pandas 庫
  3. pandas 基本操作
  4. pandas 計算
  5. pandas 的 Series
  6. pandas 常用函數
  7. 補充內容

 

1 關於pandas / About pandas

Pandas起源

Python Data Analysis Library或pandas是基於NumPy的一種工具,該工具是為了解決數據分析任務而創建的。Pandas 納入了大量庫和一些標准的數據模型,提供了高效地操作大型數據集所需的工具。pandas提供了大量能使我們快速便捷地處理數據的函數和方法。

Pandas是python的一個數據分析包,最初由AQR Capital Management於2008年4月開發,並於2009年底開源出來,目前由專注於Python數據包開發的PyData開發team繼續開發和維護,屬於PyData項目的一部分。Pandas最初被作為金融數據分析工具而開發出來,因此,pandas為時間序列分析提供了很好的支持。 Pandas的名稱來自於面板數據(panel data)和python數據分析(data analysis)。panel data是經濟學中關於多維數據集的一個術語,在Pandas中也提供了panel的數據類型。

Pandas中的數據結構

Series:

一維數組,與Numpy中的一維Array類似。二者與Python基本的數據結構List也很相近,其區別是,List中的元素可以是不同的數據類型,而Array和Series中則只允許存儲相同的數據類型,這樣可以更有效的使用內存,提高運算效率。

Time- Series:

以時間為索引的Series。

DataFrame:

二維的表格型數據結構。很多功能與R中的data.frame類似。可以將DataFrame理解為Series的容器。以下的內容主要以DataFrame為主

Panel:

三維的數組,可以理解為DataFrame的容器

Pandas中一般的數據結構構成為DataFrame -> Series -> ndarray

2 pandas / pandas Library

環境安裝:

pip install pandas

2.1 常量 / Constants

pass

2.2 函數 / Function

2.2.1 read_csv()函數

函數調用: info = pd.read_csv(filename)

函數功能:讀取指定的csv文件,生成一個包含csv數據的DataFrame

傳入參數: filename

filename: str類型,需要讀取的文件名

返回參數: info

info: DataFrame類型,讀取文件生成的DataFrame

類似方法還有: read_excel / read_json / read_sql / read_html

2.2.2 isnull()函數

函數調用: bool = pd.isnull(obj)

函數功能:返回一個包含數據是否是null的信息數據

傳入參數: obj

obj: DataFrame/Series類型,待判斷的數據

返回參數: bool

bool: DataFrame/Series類型,返回的判斷結果,True表示null,False則不是

2.2.3 to_datetime()函數

函數調用: date = pd.to_datetime(arg)

函數功能:將傳入的數據轉換成日期數據格式返回

傳入參數: arg

arg: int/float/srting/datetime/list/tuple/1-d array/Series類型,argument,可傳入一維數組或Series,0.18.1版本中加入DataFrame和dict-like結構

返回參數: date

date: 返回的數據類型由傳入的參數確定

Note: pandas中通過to_datetime函數轉換的而成的數據其dtype為datetime64[ns],該數據存在的Series可以通過.dt.month/year/day獲取所需要的日期信息

2.3 / Class

2.3.1 DataFrame

類實例化:df = pd.DataFrame(data, index=) / pd.read_xxx(file_name)

類的功能:用於生成DataFrame

傳入參數: data, index / file_name

data: ndarray類型,包含需要構建成DataFrame的數據(二維)

index: Series類型,決定作為索引的列參數

file_name: str類型,需要讀取的文件名

返回參數: df

df: DataFrame類型,生成的DataFrame

2.3.1.1 dtypes屬性

屬性調用: fmt = df.dtypes

屬性功能: 返回數據結構中每列的數據類型(由於是多個,使用dtypes,numpy中單個,使用dtype)

屬性參數: fmt

fmt: Series類型,包含每個數據值的數據類型,index為列名,value為類型,其中,object類型相當於Python中的string

2.3.1.2 columns屬性

屬性調用: index_name = df.columns

屬性功能: 返回數據結構中每列的列名

屬性參數: index_name

Index_name: Index類型,<class 'pandas.core.indexes.base.Index'>,包含每列的列名

2.3.1.3 shape屬性方法

屬性調用: shp = df.shape

屬性功能: 返回數據結構的行列參數

屬性參數: shp

shp: tuple類型,(row, column),返回行列數

2.3.1.4 loc屬性

屬性調用: index = df.loc

屬性功能: 返回一個index的類

屬性參數: index

index: obj類型,<class 'pandas.core.indexing._LocIndexer'>,可用於切片獲取數據信息的DataFrame,如index[0]獲取第一行,index[3:7]獲取3-7行的數據

2.3.1.5 head()方法

函數調用: hdf = df.head(num=5)

函數功能: 返回csv列表中的前num行數據

傳入參數: num

num: int類型,需要獲取的行數

返回參數: hdf

hdf: DataFrame類型,原數據的前num行數據

2.3.1.6 tail()方法

函數調用: tdf = df.tail(num=5)

函數功能: 返回csv列表中的后num行數據

傳入參數: num

num: int類型,需要獲取的行數

返回參數: tdf

tdf: DataFrame類型,原數據的后num行數據

2.3.1.7 describe()方法

函數調用: ddf = df.describe()

函數功能: 返回csv列表中每個列的一些統計描述參數

返回參數:

返回參數: ddf

ddf: DataFrame類型,包括的信息有,每一列的數量count,均值mean,標准差std,最小值min,1/4位數25%,中位數50%,3/4位數75%,最大值max

2.3.1.8 sort_values()方法

函數調用: sdf = df.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

函數功能: 返回按參數排序的DataFrame

傳入參數: by, axis, ascending, inplace, kind, na_position

by: str類型,DataFrame的行/列名

axis: int類型,0按列(第一軸)sort,1按行(最后軸)sort

ascending: bool類型,True為升序排列, False為降序排列

inplace: bool類型,True則修改原DataFrame,False則返回新的DataFrame

kind: str類型,確定sort的排序算法,包括{‘quicksort’, ‘mergesort’, ‘heapsort’}

na_position: str類型,確定na數據存在的位置,‘first’/‘last’

返回參數: sdf

sdf: DataFrame類型,重排后的DataFrame

2.3.1.9 mean ()方法

函數調用: mdf = df.mean(axis=0)

函數功能: 返回存儲所有非NaN的值的平均值DataFrame

傳入參數: axis

axis: int類型,0按列(第一軸)sort,1按行(最后軸)sort

返回參數: mdf

mdf: DataFrame類型,存儲均值的數據類型為float

2.3.1.10 pivot_table ()方法

函數調用: cdf = df.pivot_table(index=, values=, aggfunc=)

函數功能: 根據index將數據分組,對於values列的值(相同類型)執行aggfunc函數

傳入參數: index, values, aggfunc

index: str類型,進行分組的列的列名

values: str/list類型,需要計算的列的列名,多個則使用list

aggfunc: method類型,需要調用的方法

返回參數: cdf

cdf: DataFrame類型,通過自定義函數運算后得到的DataFrame

2.3.1.11 dropna ()方法

函數調用: ddf = df.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)

函數功能: 根據要求刪除帶有NaN值的行列

傳入參數: axis, how, thresh, subset, inplace

axis: int/str類型,搜索方向,0/‘index’為行搜索,1/‘columns’為列搜索

how: str類型,‘any’只要出現NA值就刪除該行/列數據,‘all’所有值都是NA才刪除

thresh: int/None類型,表示對有效數據數量的最小要求(為2則要求該行/列至少2個有效非NA數據存在)

subset: str/list類型,表示在特定子集中尋找NA

inplace: bool類型,表示是否在原數據操作,True修改原數據,False返回新數據

返回參數: cdf

cdf: DataFrame類型,通過刪除NA函數運算后得到的DataFrame

2.3.1.12 reset_index ()方法

函數調用: rdf = df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=’’)

函數功能: 重置(一般為經過排序后的)DataFrame的序號

傳入參數: level, drop, inplace, col_level, col_fill

level: int/str/tuple/list類型,Only remove the given levels from the index. Removes all levels by default

drop: bool類型,是否刪除原始的index列,True刪除,False保留

inplace: bool類型,是否在原數據上操作

col_level: int/str類型,If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level

col_fill: obj類型,If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

返回參數: rdf

rdf: DataFrame類型,通過重排index后的DataFrame

2.3.1.13 set_index ()方法

函數調用: sdf = df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

函數功能: 根據現有的columns參數重新設置index索引

傳入參數: keys, drop, append, inplace, verify_integrity

keys: str類型,需要作為索引的列名

drop: bool類型,是否刪除作為索引的列,True刪除,False保留

append: bool類型,是否添加默認的index(序號索引)

inplace: bool類型,是否在原數據上操作

verify_integrity: bool類型,Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method

返回參數: sdf

sdf: DataFrame類型,通過重設index后的DataFrame

2.3.1.14 apply ()方法

函數調用: re = df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

函數功能: 對DataFrame相應的方向使用自定義函數

傳入參數: func, axis, broadcast, raw, reduce, args, **kwds

func: method類型,用於各行/列的函數

axis: int/str類型,0/‘index’對每列使用函數,1/‘column’對每行使用函數

broadcast: bool類型,For aggregation functions, return object of same size with values propagated

raw: bool類型,If False, convert each row or column into a Series. If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy eduction function this will achieve much better performance

reduce: bool/None類型,Try to apply reduction procedures. If the DataFrame is empty, apply will use reduce to determine whether the result should be a Series or a DataFrame. If reduce is None (the default), apply's return value will be guessed by calling func an empty Series (note: while guessing, exceptions raised by func will be ignored). If reduce is True a Series will always be returned, and if False a DataFrame will always be returned

args: tuple類型,Positional arguments to pass to function in addition to the array/series

**kwds: 其余關鍵字參數將會被當做參數傳給調用函數

返回參數: rdf

rdf: DataFrame類型,通過重排index后的DataFrame

2.3.1.15 ix屬性

屬性調用: ix_obj = df.ix

屬性功能: 返回一個index類的數據

屬性參數: ix_obj

ix_obj: obj類型,<class 'pandas.core.indexing._IXIndexer'>

Note: 后續可通過ix_obj[rows, cols]獲取DataFrame或Series,rows/cols可以是需要取的行索引/列名

2.3.2 Series

類實例化:sr = pd.Series(data, index=) / df[colomn_name]

類的功能:用於生成Series

傳入參數: data, index / column_name

data: ndarray類型,包含需要構建成Series的數據(一維)

index: Series類型,決定作為索引的列參數

column_name: str類型,需要獲取Series的列名

返回參數: sr

sr: Series類型,生成的Series

2.3.2.1 values屬性

屬性調用: values = sr.values

屬性功能: 返回Series的所有value值

屬性參數: values

values: ndarray類型,Series的所有值形成的一維ndarray

2.3.2.2 tolist()方法

函數調用: list =sr.tolist()

函數功能:將Series或Index類的數據變成list形式返回

傳入參數:

返回參數: list

list: list類型,返回的數據列表

2.3.2.3 max/min()方法

函數調用: value =sr.max/min()

函數功能:獲取Series中的最大/最小值

傳入參數:

返回參數: value

value: int/str等類型,返回的最值

2.3.2.4 sort_values()方法

函數調用: ssr = sr.sort_values(axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

函數功能: 返回按參數排序的Series

傳入參數: axis, ascending, inplace, kind, na_position

axis: int類型,0按列(第一軸)sort,1按行(最后軸)sort

ascending: bool類型,True為升序排列, False為降序排列

inplace: bool類型,True則修改原DataFrame,False則返回新的DataFrame

kind: str類型,確定sort的排序算法,包括{‘quicksort’, ‘mergesort’, ‘heapsort’}

na_position: str類型,確定na數據存在的位置,‘first’/‘last’

返回參數: ssr

ssr: Series類型,重排后的Series

2.3.2.5 mean ()方法

函數調用: msr = sr.mean()

函數功能: 返回存儲所有非NaN的值的平均值Series

傳入參數:

返回參數: msr

msr: Series類型,存儲均值的數據類型為float

2.3.2.6 reset_index ()方法

函數調用: rsr = sr.reset_index(level=None, drop=False, name=None, inplace=False)

函數功能: 重置(一般為經過排序后的)Series的序號

傳入參數: level, drop, name, inplace

level: int/str/tuple/list類型,Only remove the given levels from the index. Removes all levels by default

drop: bool類型,是否刪除原始的index列,True刪除,False保留

name: obj類型,The name of the column corresponding to the Series values

inplace: bool類型,是否在原數據上操作

返回參數: rsr

rsr: Series類型,通過重排index后的Series

2.3.2.7 value_counts ()方法

函數調用: csr = sr.value_counts(dropna=True)

函數功能: 計算Series中各個values值的數量

傳入參數: dropna

dropna: bool類型,是否計算NA的數量,True不計算,False計算

返回參數: csr

csr: Series類型,各數據值為索引,數量為value的Series

 

3 pandas基本操作

首先對csv文件進行讀取操作,利用read_csv函數,值得注意的是,存儲的csv文件必須利用Excel另存為的方式獲得,而不能以修改后綴名的方法得到

1 import pandas as pd
2 
3 # info = pd.read_csv('info.csv', encoding='latin1')
4 # info = pd.read_csv('info.csv', encoding='ISO-8859-1')
5 # info = pd.read_csv('info.csv', encoding='cp1252')
6 info = pd.read_csv('info.csv')
7 # Get the info of whole csv list, and the info of row and column
8 print(info)

輸出結果為

     No.       Type  Info      Number Rank Mark.
0   1001   BUTTER_1   NaN    4.000000    A    cc
1   1002   BUTTER_2   NaN         NaN    C    dd
2   1003   BUTTER_3   NaN         NaN  NaN    ff
3   1004   BUTTER_4   NaN         NaN  NaN   NaN
4   1005   BUTTER_5    df  543.000000    F    cx
5   1006   BUTTER_6    fa  345.000000    A    cc
6   1007   BUTTER_7   jhf   67.000000    S    dd
7   1008   BUTTER_8    ad  567.000000    S    ff
8   1009   BUTTER_9  gdfs   34.000000    C    aa
9   1010  BUTTER_10  vczx   34.000000    C    cx
10  1011  BUTTER_11    as   89.000000    E    cc
11  1012  BUTTER_12    cd   90.000000    D    dd
12  1013  BUTTER_13   qwe   14.000000    S    ff
13  1014    WATER_1   asd  186.635198    A    aa
14  1015    WATER_2    as  222.000000    B    cc
15  1016    WATER_3    fa  193.026806    A    cc
16  1017    WATER_4   jhf  196.222611    C    dd
17  1018    WATER_5    ad  199.418415    B    ff
18  1019    WATER_6  gdfs  202.614219    D    aa
19  1020    WATER_7  vczx  205.810023    F    cx
20  1021    WATER_8    as  209.005827    A    cc
21  1022    WATER_9    cd  212.201632    S    dd
22  1023   WATER_10   qwe  215.397436    S    ff
23  1024   WATER_11   asd  218.593240    C    aa
24  1025   WATER_12    df  221.789044    C    cx
25  1026   WATER_13    fa  224.984848    E    cc
26  1027   WATER_14   jhf  228.180653    D    dd
27  1028   WATER_15    ad  231.376457    S    ff
28  1029   WATER_16  gdfs  234.572261    A    aa
29  1030   WATER_17  vczx  237.768065    B    cx
..   ...        ...   ...         ...  ...   ...
70  1071  CHEESE_11    as  368.796037    E    cc
71  1072  CHEESE_12    cd  371.991842    D    dd
72  1073  CHEESE_13   qwe  375.187646    S    ff
73  1074  CHEESE_14   asd  378.383450    A    aa
74  1075  CHEESE_15    df  381.579254    B    cx
75  1076  CHEESE_16    fa  384.775058    A    cc
76  1077  CHEESE_17   jhf  387.970863    C    dd
77  1078  CHEESE_18    ad  391.166667    B    ff
78  1079  CHEESE_19  gdfs  394.362471    D    aa
79  1080  CHEESE_20  vczx  397.558275    F    cx
80  1081  CHEESE_21    as  400.754079    A    cc
81  1082  CHEESE_22    cd  403.949883    S    dd
82  1083  CHEESE_23   qwe  407.145688    S    ff
83  1084  CHEESE_24   asd  410.341492    C    aa
84  1085  CHEESE_25    df  413.537296    C    cx
85  1086     MILK_1    fa  416.733100    E    cc
86  1087     MILK_2   jhf  419.928904    D    dd
87  1088     MILK_3    ad  423.124709    S    ff
88  1089     MILK_4  gdfs  426.320513    A    aa
89  1090     MILK_5  vczx  429.516317    B    cx
90  1091     MILK_6    as  432.712121    A    cc
91  1092     MILK_7    cd  435.907925    C    dd
92  1093     MILK_8   qwe  439.103730    B    ff
93  1094     MILK_9   asd  442.299534    D    aa
94  1095    MILK_10    df  445.495338    F    cx
95  1096    MILK_11    fa  448.691142    A    cc
96  1097    MILK_12   jhf  451.886946    S    dd
97  1098    MILK_13    ad  455.082751    S    ff
98  1099    MILK_14  gdfs  458.278555    C    aa
99  1100    MILK_15  vczx  461.474359    C    cx

[100 rows x 6 columns]
View Code

可以看到,pandas已經將csv文件中的數據成功導入

接着可以查看導入的數據類型

 1 # Get the type of info
 2 print(type(info))       # <class 'pandas.core.frame.DataFrame'>
 3 print('-----------')
 4 # Get the type of each column(The object dtype equal to the string type in python)
 5 print(info.dtypes)      ''' No.         int64
 6                             Type       object
 7                             Info       object
 8                             Number    float64
 9                             Rank       object
10                             Mark.      object
11                             dtype: object '''

最后還可以利用基本函數獲取前/后 n 行,列名信息以及基本描述等

 1 # Get the first x row of csv list, default is 5
 2 print(info.head(7))
 3 print('-----------')
 4 # Get the last x row of csv list, default is 5
 5 print(info.tail(7))
 6 print('-----------')
 7 # Get the name of each column
 8 print(info.columns)
 9 print('-----------')
10 # Get the shape of csv list
11 print(info.shape)
12 print('-----------')
13 # Get the statistics parameter of cvs list(for digit data)
14 # Such as count, mean, standard deviation, min, 25%, 50%, 75%, max
15 print(info.describe())

輸出結果

    No.      Type Info  Number Rank Mark.
0  1001  BUTTER_1  NaN     4.0    A    cc
1  1002  BUTTER_2  NaN     NaN    C    dd
2  1003  BUTTER_3  NaN     NaN  NaN    ff
3  1004  BUTTER_4  NaN     NaN  NaN   NaN
4  1005  BUTTER_5   df   543.0    F    cx
5  1006  BUTTER_6   fa   345.0    A    cc
6  1007  BUTTER_7  jhf    67.0    S    dd
-----------
     No.     Type  Info      Number Rank Mark.
93  1094   MILK_9   asd  442.299534    D    aa
94  1095  MILK_10    df  445.495338    F    cx
95  1096  MILK_11    fa  448.691142    A    cc
96  1097  MILK_12   jhf  451.886946    S    dd
97  1098  MILK_13    ad  455.082751    S    ff
98  1099  MILK_14  gdfs  458.278555    C    aa
99  1100  MILK_15  vczx  461.474359    C    cx
-----------
Index(['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.'], dtype='object')
-----------
(100, 6)
-----------
               No.      Number
count   100.000000   97.000000
mean   1050.500000  309.401389
std      29.011492  110.975188
min    1001.000000    4.000000
25%    1025.750000  240.963869
50%    1050.500000  317.663170
75%    1075.250000  391.166667
max    1100.000000  567.000000
View Code

 

4 pandas計算

對於pandas,由於其基本結構是基於numpy的ndarray,因此numpy的基本計算操作對於pandas的DataFrame及Series也都適用。

下面是pandas的一些基本計算方法的示例,

完整代碼

 1 import pandas as pd
 2 
 3 info = pd.read_csv('info.csv')
 4 # Get the certain row of csv list
 5 print(info.loc[0])
 6 print(info.loc[3:7])
 7 print('----------')
 8 # Get certain column(columns) by column name(name list)
 9 print(info['Type'])
10 print(info[['Type', 'No.']])
11 # Get the column name and save it as a list
12 col_names = info.columns.tolist()
13 print(col_names)
14 
15 # Filter off the column name that end with '.'
16 dotList = []
17 for n in col_names:
18     if n.endswith('.'):
19         dotList.append(n)
20 newList = info[dotList]
21 print(newList)
22 
23 # Operation for column will act to each element as numpy does
24 print(info['Number'] * 10)
25 
26 # Operation for two csv with same shape will act each corresponding element
27 x = info['Number']
28 y = info['No.']
29 print(x+y)
30 # Act for string
31 x = info['Rank']
32 y = info['Mark.']
33 print(x+y)
34 
35 # Add a column after the tail column(the dimension of new one should be same as origin)
36 print(info.shape)
37 info['New'] = x+y
38 print(info.shape)
39 print('----------')
40 
41 # Get the max/min value of a column
42 print(info['Number'].max())
43 print(info['Number'].min())
44 
45 num = info['Number']
46 num_null_true = pd.isnull(num)
47 # If these is a null value in DataFrame, the calculated result will be NaN
48 print(sum(info['Number'])/len(info['Number'])) # return nan
49 # Use the DataFrame == False to reverse the DataFrame
50 good_value = info['Number'][num_null_true == False]
51 print(sum(good_value)/len(good_value))
52 print(good_value.mean())
53 # mean method can filter off the missing data automatically
54 print(info['Number'].mean())
55 print('---------')
View Code

分段解釋

首先導入pandas及數據文件,利用loc獲取pandas的某行數據,可以使用類似list的切片操作

 1 import pandas as pd
 2 
 3 info = pd.read_csv('info.csv')
 4 # Get the certain row of csv list
 5 print(info.loc[0])
 6 print(info.loc[3:7])
 7 print('----------')
 8 # Get certain column(columns) by column name(name list)
 9 print(info['Type'])
10 print(info[['Type', 'No.']])

結果如下,內容較長

No.           1001
Type      BUTTER_1
Info           NaN
Number           4
Rank             A
Mark.           cc
Name: 0, dtype: object
    No.      Type Info  Number Rank Mark.
3  1004  BUTTER_4  NaN     NaN  NaN   NaN
4  1005  BUTTER_5   df   543.0    F    cx
5  1006  BUTTER_6   fa   345.0    A    cc
6  1007  BUTTER_7  jhf    67.0    S    dd
7  1008  BUTTER_8   ad   567.0    S    ff
----------
0      BUTTER_1
1      BUTTER_2
2      BUTTER_3
3      BUTTER_4
4      BUTTER_5
5      BUTTER_6
6      BUTTER_7
7      BUTTER_8
8      BUTTER_9
9     BUTTER_10
10    BUTTER_11
11    BUTTER_12
12    BUTTER_13
13      WATER_1
14      WATER_2
15      WATER_3
16      WATER_4
17      WATER_5
18      WATER_6
19      WATER_7
20      WATER_8
21      WATER_9
22     WATER_10
23     WATER_11
24     WATER_12
25     WATER_13
26     WATER_14
27     WATER_15
28     WATER_16
29     WATER_17
        ...    
70    CHEESE_11
71    CHEESE_12
72    CHEESE_13
73    CHEESE_14
74    CHEESE_15
75    CHEESE_16
76    CHEESE_17
77    CHEESE_18
78    CHEESE_19
79    CHEESE_20
80    CHEESE_21
81    CHEESE_22
82    CHEESE_23
83    CHEESE_24
84    CHEESE_25
85       MILK_1
86       MILK_2
87       MILK_3
88       MILK_4
89       MILK_5
90       MILK_6
91       MILK_7
92       MILK_8
93       MILK_9
94      MILK_10
95      MILK_11
96      MILK_12
97      MILK_13
98      MILK_14
99      MILK_15
Name: Type, Length: 100, dtype: object
         Type   No.
0    BUTTER_1  1001
1    BUTTER_2  1002
2    BUTTER_3  1003
3    BUTTER_4  1004
4    BUTTER_5  1005
5    BUTTER_6  1006
6    BUTTER_7  1007
7    BUTTER_8  1008
8    BUTTER_9  1009
9   BUTTER_10  1010
10  BUTTER_11  1011
11  BUTTER_12  1012
12  BUTTER_13  1013
13    WATER_1  1014
14    WATER_2  1015
15    WATER_3  1016
16    WATER_4  1017
17    WATER_5  1018
18    WATER_6  1019
19    WATER_7  1020
20    WATER_8  1021
21    WATER_9  1022
22   WATER_10  1023
23   WATER_11  1024
24   WATER_12  1025
25   WATER_13  1026
26   WATER_14  1027
27   WATER_15  1028
28   WATER_16  1029
29   WATER_17  1030
..        ...   ...
70  CHEESE_11  1071
71  CHEESE_12  1072
72  CHEESE_13  1073
73  CHEESE_14  1074
74  CHEESE_15  1075
75  CHEESE_16  1076
76  CHEESE_17  1077
77  CHEESE_18  1078
78  CHEESE_19  1079
79  CHEESE_20  1080
80  CHEESE_21  1081
81  CHEESE_22  1082
82  CHEESE_23  1083
83  CHEESE_24  1084
84  CHEESE_25  1085
85     MILK_1  1086
86     MILK_2  1087
87     MILK_3  1088
88     MILK_4  1089
89     MILK_5  1090
90     MILK_6  1091
91     MILK_7  1092
92     MILK_8  1093
93     MILK_9  1094
94    MILK_10  1095
95    MILK_11  1096
96    MILK_12  1097
97    MILK_13  1098
98    MILK_14  1099
99    MILK_15  1100

[100 rows x 2 columns]
View Code

獲取pandas的列名

1 # Get the column name and save it as a list
2 col_names = info.columns.tolist()
3 print(col_names)

結果如下

['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.']

過濾出所有以‘.’結尾的列

1 # Filter off the column name that end with '.'
2 dotList = []
3 for n in col_names:
4     if n.endswith('.'):
5         dotList.append(n)
6 newList = info[dotList]
7 print(newList)

基本計算操作會作用於pandas的Series每個值

1 # Operation for column will act to each element as numpy does
2 print(info['Number'] * 10)

對兩個結構形狀相同的Series,其運算會作用到每個values上

1 # Operation for two csv with same shape will act each corresponding element
2 x = info['Number']
3 y = info['No.']
4 print(x+y)
5 # Act for string
6 x = info['Rank']
7 y = info['Mark.']
8 print(x+y)

創建出一個列名為‘New’的新列,值為兩個列的值之和

1 # Add a column after the tail column(the dimension of new one should be same as origin)
2 print(info.shape)
3 info['New'] = x+y
4 print(info.shape)
5 print('----------')

獲取Series中的最值

1 # Get the max/min value of a column
2 print(info['Number'].max())
3 print(info['Number'].min())

均值計算的兩種方式,

  1. 直接求和平均,當計算中有NaN值時,計算的結果將會為NaN
  2. 利用mean函數進行計算,mean函數將會過自動濾掉NaN缺失數據
 1 num = info['Number']
 2 num_null_true = pd.isnull(num)
 3 # If these is a null value in DataFrame, the calculated result will be NaN
 4 print(sum(info['Number'])/len(info['Number'])) # return nan
 5 # Use the DataFrame == False to reverse the DataFrame
 6 good_value = info['Number'][num_null_true == False]
 7 print(sum(good_value)/len(good_value))
 8 print(good_value.mean())
 9 # mean method can filter off the missing data automatically
10 print(info['Number'].mean())
11 print('---------')

 

5 pandasSeries

下面介紹 pandas 中的數據類型 Series 的一些基本使用方法,

完整代碼

 1 import pandas as pd
 2 
 3 info = pd.read_csv('info.csv')
 4 
 5 # Fetch a series from DataFrame
 6 rank_series = info['Rank']
 7 print(type(info)) # <class 'pandas.core.frame.DataFrame'>
 8 print(type(rank_series)) # <class 'pandas.core.series.Series'>
 9 print(rank_series[0:5])
10 
11 # New a series
12 from pandas import Series
13 # Build a rank series
14 rank = rank_series.values
15 print(rank)
16 # DataFrame --> Series --> ndarray
17 print(type(rank)) # <class 'numpy.ndarray'>
18 # Build a type series
19 type_series = info['Type']
20 types = type_series.values
21 # Build a new series based on former two(type and rank)
22 # Series(values, index=)
23 series_custom = Series(rank, index=types)
24 print(series_custom)
25 # Fetch Series by key name list
26 print(series_custom[['MILK_14', 'MILK_15']])
27 # Fetch Series by index
28 print(series_custom[0:2])
29 
30 # Sorted to Series will return a list by sorted value
31 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x))
32 
33 # Re-sort by index for a Series
34 original_index = series_custom.index.tolist() 
35 sorted_index = sorted(original_index)
36 sorted_by_index = series_custom.reindex(sorted_index)
37 print(sorted_by_index)
38 # Series sort function
39 print(series_custom.sort_index())
40 print(series_custom.sort_values())
41 
42 import numpy as np
43 # Add operation for Series will add the values for each row(if the dimensions of two series are same)
44 print(np.add(series_custom, series_custom))
45 # Apply sin funcion to each value
46 print(np.sin(info['Number']))
47 # Return the max value(return a single value not a Series)
48 # If more than one max value exist, only return one
49 print(np.max(filter(lambda x: isinstance(x, float), series_custom)))
50 
51 # Filter values in range
52 criteria_one = series_custom > 'C'
53 criteria_two = series_custom < 'S'
54 print(series_custom[criteria_one & criteria_two])
View Code

分段解釋

利用列名從DataFrame中獲取一個Series

1 import pandas as pd
2 
3 info = pd.read_csv('info.csv')
4 
5 # Fetch a series from DataFrame
6 rank_series = info['Rank']
7 print(type(info)) # <class 'pandas.core.frame.DataFrame'>
8 print(type(rank_series)) # <class 'pandas.core.series.Series'>
9 print(rank_series[0:5])

新建一個Series的方法,先獲取一個作為index的列,在獲取一個作為values的列,利用Series函數生成新的Series

 1 # New a series
 2 from pandas import Series
 3 # Build a rank series
 4 rank = rank_series.values
 5 print(rank)
 6 # DataFrame --> Series --> ndarray
 7 print(type(rank)) # <class 'numpy.ndarray'>
 8 # Build a type series
 9 type_series = info['Type']
10 types = type_series.values
11 # Build a new series based on former two(type and rank)
12 # Series(values, index=)
13 series_custom = Series(rank, index=types)
14 print(series_custom)

利用列名列表或索引從DataFrame中獲取多個Series

1 # Fetch Series by key name list
2 print(series_custom[['MILK_14', 'MILK_15']])
3 # Fetch Series by index
4 print(series_custom[0:2])

利用sorted函數根據values大小重排Series,返回值為一個list

1 # Sorted to Series will return a list by sorted value
2 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x))

兩種sort方法對Series進行排列

  1. 獲取index索引值,對索引值進行排列,再使用reindex函數獲取新的Series

1 # Re-sort by index for a Series
2 original_index = series_custom.index.tolist() 
3 sorted_index = sorted(original_index)
4 sorted_by_index = series_custom.reindex(sorted_index)
5 print(sorted_by_index)

  2.使用sort_index或sort_values函數

1 # Series sort function
2 print(series_custom.sort_index())
3 print(series_custom.sort_values())

Series的相加/正余弦/max,利用numpy函數,將Series的對應values值進行處理

1 import numpy as np
2 # Add operation for Series will add the values for each row(if the dimensions of two series are same)
3 print(np.add(series_custom, series_custom))
4 # Apply sin funcion to each value
5 print(np.sin(info['Number']))
6 # Return the max value(return a single value not a Series)
7 # If more than one max value exist, only return one
8 print(np.max(filter(lambda x: isinstance(x, float), series_custom)))

利用True/False列表獲取在范圍內滿足條件的Series

1 # Filter values in range
2 criteria_one = series_custom > 'C'
3 criteria_two = series_custom < 'S'
4 print(series_custom[criteria_one & criteria_two])

 

6 pandas常用函數

下面是一些pandas常用的函數示例

完整代碼

  1 import pandas as pd
  2 import numpy as np
  3 
  4 info = pd.read_csv('info.csv')
  5 
  6 # Sort value by column  
  7 # inplace is True will sort value base on origin, False will return a new DataFrame
  8 new = info.sort_values('Mark.', inplace=False, na_position='last')
  9 print(new)
 10 # Sorted by ascending order in default(ascending=True) 
 11 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail
 12 info.sort_values('Mark.', inplace=True, ascending=False)
 13 print(info)
 14 print('---------')
 15 # Filter off the null row
 16 num = info['Number']
 17 # isnull will return a list contains the status of null or not, True for null, False for not
 18 num_null_true = pd.isnull(num)
 19 print(num_null_true)
 20 num_null = num[num_null_true]
 21 print(num_null) # 12 NaN
 22 print('---------')
 23 
 24 # pivot_table function can calulate certain para that with same attribute group by using certain function
 25 # index tells the method which column to group by
 26 # value is the column that we want to apply the calculation to 
 27 # aggfunc specifies the calculation we want to perform, default function is mean
 28 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum)
 29 print(avg_by_rank)
 30 print('---------')
 31 # Operate to multi column
 32 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum)
 33 print(sum_by_rank)
 34 print('---------')
 35 
 36 # dropna function can drop any row/columns that have null values
 37 info = pd.read_csv('info.csv')
 38 # Drop the columns that contain NaN (axis=0 for row)
 39 drop_na_column = info.dropna(axis=1)
 40 print(drop_na_column)
 41 print('---------')
 42 # Drop the row that subset certains has NaN 
 43 # thresh to decide how many valid value required
 44 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.'])
 45 print(drop_na_row)
 46 print('---------')
 47 # Locate to a certain value by its row number(plus 1 for No.) and column name
 48 print(info)
 49 row_77_Rank = info.loc[77, 'Rank']
 50 print(row_77_Rank)
 51 row_88_Info = info.loc[88, 'Info']
 52 print(row_88_Info)
 53 print('---------')
 54 
 55 # reset_index can reset the index for sorted DataFrame
 56 new_info = info.sort_values('Rank', ascending=False)
 57 print(new_info[0:10])
 58 print('---------')
 59 # drop=True will drop the index column, otherwise will keep former index colunn (default False)
 60 reset_new_info = new_info.reset_index(drop=True)
 61 print(reset_new_info[0:10])
 62 print('---------')
 63 
 64 # Define your own function for pandas
 65 # Use apply function to implement your own function
 66 def hundredth_row(col):
 67     hundredth_item = col.loc[99]
 68     return hundredth_item 
 69 hundred_row = info.apply(hundredth_row, axis=0)
 70 print(hundred_row)
 71 print('---------')
 72 # Null count
 73 # The apply function will act to each column
 74 def null_count(column):
 75     column_null = pd.isnull(column)
 76     null = column[column_null]
 77     return len(null)
 78 # Passing in axis para 0 to iterate over rows instead of column
 79 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row
 80 # Passing by column can act for each column then get row
 81 # Passing by row can act for each row than get column
 82 column_null_count = info.apply(null_count, axis=0)
 83 print(column_null_count)
 84 print('---------')
 85 
 86 # Example: classify the data by Rank, and calculate the sum for each
 87 def rank_sort(row):
 88     rank = row['Rank']
 89     if rank == 'S':
 90         return 'Excellent'
 91     elif rank == 'A':
 92         return 'Great'
 93     elif rank == 'B':
 94         return 'Good'
 95     elif rank == 'C':
 96         return 'Pass'
 97     else:
 98         return 'Failed'
 99 # Format a classified column
100 rank_info = info.apply(rank_sort, axis=1)
101 print(rank_info)
102 print('---------')
103 # Add the column to DataFrame
104 info['Rank_Classfied'] = rank_info
105 # Calculate the sum of 'Number' according to 'Rank_Classfied'
106 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum)
107 print(new_rank_number)
108 
109 # set_index will return a new DataFrame that is indexed by values in the specified column
110 # And will drop that column(default is True)
111 # The column set to be index will not be dropped if drop=False
112 index_type = info.set_index('Type', drop=False, append=True)
113 print(index_type)
114 print('---------')
115 
116 # Use string index to slice the DataFrame
117 # Note: the index(key) should be unique
118 print(index_type['MILK_1':'MILK_7'])
119 print('---------')
120 print(index_type.loc['MILK_1':'MILK_7'])
121 # Value index is available too
122 print('---------')
123 print(index_type[-15:-8])
124 print('---------')
125 
126 # Calculate the standard deviation for each element from two different index
127 cal_list = info[['Number', 'No.']]
128 # np.std([x, y]) --> std value
129 # The lambda x is a Series
130 # cal_list.apply(lambda x: print(type(x)), axis=1)
131 print(cal_list.apply(lambda x: np.std(x), axis=1))
View Code

分段解釋

首先導入模塊,然后利用sort_values函數對DataFrame或Series進行排序操作

 1 mport pandas as pd
 2 import numpy as np
 3 
 4 info = pd.read_csv('info.csv')
 5 
 6 # Sort value by column  
 7 # inplace is True will sort value base on origin, False will return a new DataFrame
 8 new = info.sort_values('Mark.', inplace=False, na_position='last')
 9 print(new)
10 # Sorted by ascending order in default(ascending=True) 
11 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail
12 info.sort_values('Mark.', inplace=True, ascending=False)
13 print(info)
14 print('---------')

利用isnull函數對null值的數據進行過濾,可利用Series==False對isnull得到的序列進行反轉

1 # Filter off the null row
2 num = info['Number']
3 # isnull will return a list contains the status of null or not, True for null, False for not
4 num_null_true = pd.isnull(num)
5 print(num_null_true)
6 num_null = num[num_null_true]
7 print(num_null) # 12 NaN
8 print('---------')

利用pivot_table函數對相同屬性分組的數據進行指定函數的計算

 1 # pivot_table function can calulate certain para that with same attribute group by using certain function
 2 # index tells the method which column to group by
 3 # value is the column that we want to apply the calculation to 
 4 # aggfunc specifies the calculation we want to perform, default function is mean
 5 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum)
 6 print(avg_by_rank)
 7 print('---------')
 8 # Operate to multi column
 9 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum)
10 print(sum_by_rank)
11 print('---------')

利用dropna函數刪除空值數據

 1 # dropna function can drop any row/columns that have null values
 2 info = pd.read_csv('info.csv')
 3 # Drop the columns that contain NaN (axis=0 for row)
 4 drop_na_column = info.dropna(axis=1)
 5 print(drop_na_column)
 6 print('---------')
 7 # Drop the row that subset certains has NaN 
 8 # thresh to decide how many valid value required
 9 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.'])
10 print(drop_na_row)
11 print('---------')

利用loc對數據進行定位

1 # Locate to a certain value by its row number(plus 1 for No.) and column name
2 print(info)
3 row_77_Rank = info.loc[77, 'Rank']
4 print(row_77_Rank)
5 row_88_Info = info.loc[88, 'Info']
6 print(row_88_Info)
7 print('---------')

利用reset_index函數對索引進行重排

1 # reset_index can reset the index for sorted DataFrame
2 new_info = info.sort_values('Rank', ascending=False)
3 print(new_info[0:10])
4 print('---------')
5 # drop=True will drop the index column, otherwise will keep former index colunn (default False)
6 reset_new_info = new_info.reset_index(drop=True)
7 print(reset_new_info[0:10])
8 print('---------')

利用apply函數運行自定義函數

 1 # Define your own function for pandas
 2 # Use apply function to implement your own function
 3 def hundredth_row(col):
 4     hundredth_item = col.loc[99]
 5     return hundredth_item 
 6 hundred_row = info.apply(hundredth_row, axis=0)
 7 print(hundred_row)
 8 print('---------')
 9 # Null count
10 # The apply function will act to each column
11 def null_count(column):
12     column_null = pd.isnull(column)
13     null = column[column_null]
14     return len(null)
15 # Passing in axis para 0 to iterate over rows instead of column
16 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row
17 # Passing by column can act for each column then get row
18 # Passing by row can act for each row than get column
19 column_null_count = info.apply(null_count, axis=0)
20 print(column_null_count)
21 print('---------')
22 
23 # Example: classify the data by Rank, and calculate the sum for each
24 def rank_sort(row):
25     rank = row['Rank']
26     if rank == 'S':
27         return 'Excellent'
28     elif rank == 'A':
29         return 'Great'
30     elif rank == 'B':
31         return 'Good'
32     elif rank == 'C':
33         return 'Pass'
34     else:
35         return 'Failed'
36 # Format a classified column
37 rank_info = info.apply(rank_sort, axis=1)
38 print(rank_info)
39 print('---------')

添加一個column到DataFrame並進行計算處理

1 # Add the column to DataFrame
2 info['Rank_Classfied'] = rank_info
3 # Calculate the sum of 'Number' according to 'Rank_Classfied'
4 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum)
5 print(new_rank_number)

利用set_index函數設置新的索引,利用索引進行切片操作,切片如果是列名字符串,將返回兩個列名索引之間所有的數據

 1 # set_index will return a new DataFrame that is indexed by values in the specified column
 2 # And will drop that column(default is True)
 3 # The column set to be index will not be dropped if drop=False
 4 index_type = info.set_index('Type', drop=False, append=True)
 5 print(index_type)
 6 print('---------')
 7 
 8 # Use string index to slice the DataFrame
 9 # Note: the index(key) should be unique
10 print(index_type['MILK_1':'MILK_7'])
11 print('---------')
12 print(index_type.loc['MILK_1':'MILK_7'])
13 # Value index is available too
14 print('---------')
15 print(index_type[-15:-8])
16 print('---------')

對兩個不同索引內的元素分別進行標准差計算

1 # Calculate the standard deviation for each element from two different index
2 cal_list = info[['Number', 'No.']]
3 # np.std([x, y]) --> std value
4 # The lambda x is a Series
5 # cal_list.apply(lambda x: print(type(x)), axis=1)
6 print(cal_list.apply(lambda x: np.std(x), axis=1))

 

補充內容 / Complement

1. pandas許多函數底層是基於numpy進行的,pandas一個函數可能調用了numpy的多個函數進行實現;

2. object dtype 和 Python中的string相同;

3. pandas中如果不指定列名則默認文件中第一行為列名;

4. 基本結構包括DataFrame和Series,DataFrame可以分解為Series,DataFrame是由一系列的Series組成的,DataFrame相當於矩陣,Series相當於行或者列。

 

相關閱讀


1. numpy 的使用

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM