Pandas快速上手(一):基本操作


本文包含一些 Pandas 的基本操作,旨在快速上手 Pandas 的基本操作。

讀者最好有 NumPy 的基礎,如果你還不熟悉 NumPy,建議您閱讀NumPy基本操作快速熟悉

Pandas 數據結構

Pandas 有兩個核心的數據結構:SeriesDataFrame

Series

Series一維的類數組對象,包含一個值序列以及對應的索引

1 obj = pd.Series([6, 66, 666, 6666])
2 obj
0       6
1      66
2     666
3    6666
dtype: int64

此時索引默認為 0 到 N。我們可以分別訪問 Series 的值和索引:

1 obj.values
2 obj.index  # 類似於 range(4)
array([   6,   66,  666, 6666])
RangeIndex(start=0, stop=4, step=1)

索引可以用標簽來指定:

1 obj2 = pd.Series([6, 66, 666, 6666], index=['d', 'b', 'a', 'c'])
2 obj2
3 obj2.index
d       6
b      66
a     666
c    6666
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')

可以使用標簽索引來訪問 Series 的值,這有點像 NumPy 和字典的結合。

1 obj2['a']
2 obj2['d'] = 66666
3 obj2[['c', 'a', 'd']]
666
c     6666
a      666
d    66666
dtype: int64

Series 可以使用很多類似於 NumPy 的操作:

1 obj2[obj2 > 100]
2 obj2 / 2
3 np.sqrt(obj2)
d    66666
a      666
c     6666
dtype: int64
d    33333.0
b       33.0
a      333.0
c     3333.0
dtype: float64
d    258.197599
b      8.124038
a     25.806976
c     81.645576
dtype: float64

判斷某索引是否存在:

1 'b' in obj2
2 'e' in obj2
True
False

可以直接將字典傳入 Series 來創建 Series 對象:

1 sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
2 obj3 = pd.Series(sdata)
3 obj3
4 states = ['Texas', 'California', 'Ohio', 'Oregon']
5 obj4 = pd.Series(sdata, index=states)  # 指定了索引及其順序
6 obj4
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
Texas         71000.0
California        NaN
Ohio          35000.0
Oregon        16000.0
dtype: float64

通過 isnullnotnull 可以檢測是否有空值,既可以使用 Pandas 函數也可以使用 Series 方法

1 pd.isnull(obj4)
2 pd.notnull(obj4)
3 obj4.isnull()
4 obj4.notnull()
Texas         False
California     True
Ohio          False
Oregon        False
dtype: bool
Texas          True
California    False
Ohio           True
Oregon         True
dtype: bool
Texas         False
California     True
Ohio          False
Oregon        False
dtype: bool
Texas          True
California    False
Ohio           True
Oregon         True
dtype: bool

Series 的數據對齊,類似於數據庫的連接操作:

1 obj3
2 obj4
3 obj3 + obj4
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
Texas         71000.0
California        NaN
Ohio          35000.0
Oregon        16000.0
dtype: float64
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series 及其索引都有一個 name 屬性:

1 obj4.name = 'population'
2 obj4.index.name = 'state'
3 obj4
state
Texas         71000.0
California        NaN
Ohio          35000.0
Oregon        16000.0
Name: population, dtype: float64

Series 的索引可以原地(in-place)修改:

1 obj
2 obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
3 obj
0       6
1      66
2     666
3    6666
dtype: int64
Bob         6
Steve      66
Jeff      666
Ryan     6666
dtype: int64

DataFrame

DataFrame 表示一張矩陣數據表,其中包含有序的列集合,每列都可以表示不同的數據類型。

DataFrame 有行索引和列索引,可以把它當做是 Series 的字典,該字典共享同一套行索引。

通過等長列表(或 NumPy 數組)的字典是常用的創建方式:

1 # 每個值都是長度為 6 的列表
2 data = {
3     'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
4     'year': [2000, 2001, 2002, 2001, 2002, 2003],
5     'pop': [2.5, 1.7, 3.6, 2.4, 2.9, 3.2],
6 }
7 frame = pd.DataFrame(data)
8 data
	state	year	pop
0	Ohio	2000	2.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

通過 head 方法可以選擇前幾行:

1 frame.head()
2 frame.head(2)
	state	year	pop
0	Ohio	2000	2.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
state year pop 0 Ohio 2000 2.5 1 Ohio 2001 1.7

指定列的序列,可以按照響應順序來展示:

1 pd.DataFrame(data, columns=['year', 'state', 'pop'])
	year	state	pop
0	2000	Ohio	2.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9
5	2003	Nevada	3.2

同樣也可以指定索引:

1 frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
2                      index=['one', 'two', 'three', 'four', 'five', 'six'])
3 frame2
4 frame2.columns
5 frame2.index
	year	state	pop	debt
one	2000	Ohio	2.5	NaN
two	2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN
six	2003	Nevada	3.2	NaN
Index(['year', 'state', 'pop', 'debt'], dtype='object')
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

通過鍵或者屬性提取 DataFrame 的一列,得到的是 Series:

1 frame2['state']
2 frame2.year
3 type(frame2.year)
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64
pandas.core.series.Series

通過 loc 屬性可以指定標簽,檢索行數據:

1 frame2.loc['three']
2 type(frame2.loc['three'])
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
pandas.core.series.Series

對 'debt' 列進行賦值:

1 frame2['debt'] = 16.5
2 frame2
3 frame2['debt'] = np.arange(6.)
4 frame2
     year	state	pop	debt
one	2000	Ohio	2.5	16.5
two	2001	Ohio	1.7	16.5
three	2002	Ohio	3.6	16.5
four	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5
six	2003	Nevada	3.2	16.5
     year	state	pop	debt
one	2000	Ohio	2.5	0.0
two	2001	Ohio	1.7	1.0
three	2002	Ohio	3.6	2.0
four	2001	Nevada	2.4	3.0
five	2002	Nevada	2.9	4.0
six	2003	Nevada	3.2	5.0

可以用 Series 來賦值 DataFrame 列:

1 val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
2 frame2['debt'] = val
3 frame2
	year	state	pop	debt
one	2000	Ohio	2.5	NaN
two	2001	Ohio	1.7	-1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	-1.5
five	2002	Nevada	2.9	-1.7
six	2003	Nevada	3.2	NaN

給不存在的列賦值會創建新列:

1 frame2['eastern'] = frame2.state == 'Ohio'
2 frame2
	year	state	pop	debt	eastern
one	2000	Ohio	2.5	NaN	True
two	2001	Ohio	1.7	-1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	-1.5	False
five	2002	Nevada	2.9	-1.7	False
six	2003	Nevada	3.2	NaN	False

刪除某列:

del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

DataFrame 取出來的列是底層數據的視圖,不是拷貝。所做的修改會反映到底層數據中,如果要拷貝必須使用顯式的 copy 方法。

傳入嵌套字典的情況:

 1 pop = {
 2     'Nevada': {
 3         2001: 2.4,
 4         2002: 2.9,
 5     },
 6     'Ohio': {
 7         2000: 1.5,
 8         2001: 1.7,
 9         2002: 3.6,
10     }
11 }
12 frame3 =  pd.DataFrame(pop)
13 frame3
	Nevada	Ohio
2000	NaN	1.5
2001	2.4	1.7
2002	2.9	3.6

DataFrame 的轉置:

1 frame3.T
	2000	2001	2002
Nevada	NaN	2.4	2.9
Ohio	1.5	1.7	3.6

values 屬性是包含 DataFrame 值的 ndarray:

1 frame3.values
array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

索引對象

Pandas 的索引對象用於存儲軸標簽元數據(軸名稱等)。

1 obj = pd.Series(range(3), index=['a', 'b', 'c'])
2 index = obj.index
3 index
4 index.name = 'alpha'
5 index[1:]
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object', name='alpha')

索引對象是不可變對象,因此不可修改:

1 index[1] = 'd'  # 報錯

索引類似於固定長度的集合:

1 frame3
2 frame3.columns
3 frame3.index
4 # 類似於集合
5 'Nevada' in frame3.columns
6 2000 in frame3.index
Nevada	Ohio
2000	NaN	1.5
2001	2.4	1.7
2002	2.9	3.6
Index(['Nevada', 'Ohio'], dtype='object')
Int64Index([2000, 2001, 2002], dtype='int64')
True
True

但是索引和集合的不同之處是可以包含重復的標簽:

1 dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
2 dup_labels
3 set_labels = set(['foo', 'foo', 'bar', 'bar'])
4 set_labels
Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
{'bar', 'foo'}

索引對象的一些常用方法:

  • append
  • difference
  • intersection
  • union
  • isin
  • delete
  • drop
  • insert
  • is_monotonic
  • is_unique
  • unique

Pandas 基本功能

重新索引

reindex 用於創建一個新對象,其索引進行了重新編排。

1 obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
2 obj
3 obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
4 obj2
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

ffill 用於向前填充值,在重新索引時會進行插值:

1 obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
2 obj3
3 obj3.reindex(range(6), method='ffill')
0      blue
2    purple
4    yellow
dtype: object
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

無論行索引還是列索引都可以重新編排:

1 frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
2                      index=['a', 'c', 'd'],
3                      columns=['Ohio', 'Texas', 'California'])
4 frame
5 frame.reindex(['a', 'b', 'c', 'd'])
6 frame.reindex(columns=['Texas', 'Utah', 'California'])
	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8
      Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0
     Texas	Utah	California
a	1	NaN	2
c	4	NaN	5
d	7	NaN	8

根據軸刪除數據

使用 drop 方法對數據行或列進行刪除:

1 obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
2 obj
3 new_obj = obj.drop('c')
4 new_obj
5 obj.drop(['d', 'c'])
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
e    4.0
dtype: float64
1 data = pd.DataFrame(np.arange(16).reshape((4, 4)),
2                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
3                     columns=['one', 'two', 'three', 'four'])
4 data
5 data.drop(['Colorado', 'Ohio'])
6 data.drop('two', axis=1)
7 data.drop(['two', 'four'], axis='columns')
	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15
     one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15
     one	three	four
Ohio	0	2	3
Colorado	4	6	7
Utah	8	10	11
New York	12	14	15
     one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

原地刪除:

1 obj
2 obj.drop('c', inplace=True)
3 obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

索引、選擇、過濾

Series 的索引類似於 NumPy,只不過 Series 還可以用索引標簽,不一定是整型索引。

1 obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
2 obj
3 obj['b']
4 obj[1]
5 obj[2:4]
6 obj[['b', 'a', 'd']]
7 obj[[1, 3]]
8 obj[obj < 2]
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
1.0
1.0
c    2.0
d    3.0
dtype: float64
b    1.0
a    0.0
d    3.0
dtype: float64
b    1.0
d    3.0
dtype: float64
a    0.0
b    1.0
dtype: float64

Series 的切片和原生 Python 有所不同,是閉區間(Python 是左閉右開區間)。

1 obj['b':'c']
b    1.0
c    2.0
dtype: float64

使用切片進行賦值:

1 obj['b':'c'] = 5
2 obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

選擇 DataFrame 的若干列:

1 data = pd.DataFrame(np.arange(16).reshape((4, 4)),
2                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
3                     columns=['one', 'two', 'three', 'four'])
4 data
5 data['two']
6 type(data['two'])
7 data[['three', 'one']]
8 type(data[['three', 'one']])
     one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
pandas.core.series.Series
     three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12
pandas.core.frame.DataFrame

使用切片和布爾數組選擇 DataFrame:

1 data[:2]
2 data[data['three'] > 5]
	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
        one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

DataFrame 語法上很像二維的 NumPy 數組,使用布爾 DataFrame:

1 data < 5
2 data[data < 5] = 0
3 data
	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False
        one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

使用 loc 和 iloc 進行選擇

lociloc 可以用於選擇 DataFrame 的行和列。

1 data.loc['Colorado', ['two', 'three']]
two      5
three    6
Name: Colorado, dtype: int64
1 data.iloc[2, [3, 0, 1]]
2 data.iloc[2]
3 data.iloc[[1, 2], [3, 0, 1]]
four    11
one      8
two      9
Name: Utah, dtype: int64
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
        four	one	two
Colorado	7	0	5
Utah	11	8	9

同樣可以使用切片:

1 data.loc[:'Utah', 'two']
2 data.iloc[:, :3][data.three > 5]
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64
        one	two	three
Colorado	0	5	6
Utah	8	9	10
New York	12	13	14

DataFrame 常用索引方法:

  • df[val]:選列
  • df.loc[val]:選行
  • df.loc[:, val]:選列
  • df.loc[val1, val2]:選列和行
  • df.iloc[where]:選行
  • df.iloc[:, where]:選列
  • df.iloc[where_i, where_j]:選列和行
  • df.at[label_i, label_j]:選某一標量
  • df.iat[i, j]:選某一標量
  • reindex:選列和行
  • get_value, set_value:選某一標量

算術和數據對齊

前面我們介紹了 Series 的算術對齊,接下來是 DataFrame 的:

1 df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
2                    index=['Ohio', 'Texas', 'Colorado'])
3 df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
4                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
5 df1
6 df2
7 df1 + df2
	b	c	d
Ohio	0.0	1.0	2.0
Texas	3.0	4.0	5.0
Colorado6.0	7.0	8.0
        b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0
        b	c	d	e
ColoradoNaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

沒有相同行索引和列索引的情況:

1 df1 = pd.DataFrame({'A': [1, 2]})
2 df2 = pd.DataFrame({'B': [3, 4]})
3 df1
4 df2
5 df1 - df2
A
0	1
1	2
B
0	3
1	4
A	B
0	NaN	NaN
1	NaN	NaN

填補值的算術方法

對於算術運算后產生空值的情況:

1 df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
2                    columns=list('abcd'))
3 df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
4                    columns=list('abcde'))
5 df2.loc[1, 'b'] = np.nan
6 df1
7 df2
8 df1 + df2
	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0
        a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	NaN	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0
        a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

可以使用 fill_value 屬性來自動填空值:

1 df1.add(df2, fill_value=0)
        a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	5.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

常用的算術方法(可用於補空值):

  • add, radd
  • sub, rsub
  • div, rdiv
  • floordiv, rfloordiv
  • mul, rmul
  • pow, rpow

DataFrame 和 Series 之間的運算

首先,看看一維數組和二維數組的減法的情況。

1 arr = np.arange(12.).reshape((3, 4))
2 arr
3 arr[0]
4 arr - arr[0]
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])
array([0., 1., 2., 3.])
array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

默認情況下,DataFrame 和 Series 會匹配索引進行計算。

1 frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
2                      columns=list('bde'),
3                      index=['Utah', 'Ohio', 'Texas', 'Oregon'])
4 series = frame.iloc[0]
5 frame
6 series
7 frame - series
        b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
        b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

如果索引不匹配,會用外連接的方式重組索引:

1 series2 = pd.Series(range(3), index=['b', 'e', 'f'])
2 frame + series2
        b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

如果希望對於所有列進行算術計算(broadcast 機制),必須使用前面介紹的算術方法。

1 series3 = frame['d']
2 frame
3 series3
4 frame.sub(series3, axis=0)
	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64
        b	d	e
Utah	-1.0	0.0	1.0
Ohio	-1.0	0.0	1.0
Texas	-1.0	0.0	1.0
Oregon	-1.0	0.0	1.0

apply 和 map

NumPy 的通用函數(按元素的數組方法)可以用於 Pandas 對象:

1 frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
2                      index=['Utah', 'Ohio', 'Texas', 'Oregon'])
3 frame
4 np.abs(frame)
	b	d	e
Utah	-0.590458	-0.352861	0.820549
Ohio	-1.708280	0.174739	-1.081811
Texas	0.857712	0.246972	0.532208
Oregon	0.812756	1.260538	0.818304
        b	d	e
Utah	0.590458	0.352861	0.820549
Ohio	1.708280	0.174739	1.081811
Texas	0.857712	0.246972	0.532208
Oregon	0.812756	1.260538	0.818304

一個常用的操作是按或者調用某一個函數,apply 可以達到該功能:

1 f = lambda x: x.max() - x.min()
2 frame.apply(f)
b    2.565993
d    1.613398
e    1.902360
dtype: float64
1 frame.apply(f, axis=1)
Utah      1.411007
Ohio      1.883019
Texas     0.610741
Oregon    0.447782
dtype: float64

apply 傳入的函數不一定非得返回一個標量值,可以返回 Series:

1 def f(x):
2     return pd.Series([x.min(), x.max()], index=['min', 'max'])
3 frame.apply(f)
                b	        d	        e
min	-1.708280	-0.352861	-1.081811
max	0.857712	1.260538	0.820549

還可以傳入按元素計算的函數:

1 frame
2 format = lambda x: '%.2f' % x
3 frame.applymap(format)
        b	        d	        e
Utah	-0.590458	-0.352861	0.820549
Ohio	-1.708280	0.174739	-1.081811
Texas	0.857712	0.246972	0.532208
Oregon	0.812756	1.260538	0.818304
        b	d	e
Utah	-0.59	-0.35	0.82
Ohio	-1.71	0.17	-1.08
Texas	0.86	0.25	0.53
Oregon	0.81	1.26	0.82

按元素應用某函數必須使用 applymap。取這個名字的原因是 Series 有一個 map 方法就是用來按元素調用函數的。

1 frame['e'].map(format)
Utah       0.82
Ohio      -1.08
Texas      0.53
Oregon     0.82
Name: e, dtype: object

排序和排名

對行索引進行排列:

1 obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
2 obj
3 obj.sort_index()
d    0
a    1
b    2
c    3
dtype: int64
a    1
b    2
c    3
d    0
dtype: int64

對列索引進行排列:

1 frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
2                      index=['three', 'one'],
3                      columns=['d', 'a', 'b', 'c'])
4 frame
5 frame.sort_index()
6 frame.sort_index(axis=1)
	d	a	b	c
three	0	1	2	3
one	4	5	6	7
        d	a	b	c
one	4	5	6	7
three	0	1	2	3
        a	b	c	d
three	1	2	3	0
one	5	6	7	4

按降序排列:

1 frame.sort_index(axis=1, ascending=False)
        d	c	b	a
three	0	3	2	1
one	4	7	6	5

對於 Series 和 DataFrame,如果需要根據值來進行排列,使用 sort_values 方法:

1 obj = pd.Series([4, 7, -3, 2])
2 obj.sort_values()
3 obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
4 obj.sort_values()
2   -3
3    2
0    4
1    7
dtype: int64
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64
1 frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
2 frame
3 frame.sort_values(by='b')
4 frame.sort_values(by=['a', 'b'])
	b	a
0	4	0
1	7	1
2	-3	0
3	2	1
        b	a
2	-3	0
3	2	1
0	4	0
1	7	1
        b	a
2	-3	0
0	4	0
3	2	1
1	7	1

排名(ranking)用於計算出排名值。

1 obj = pd.Series([7, -5, 7, 4, 2, 4, 0, 4, 4])
2 obj.rank()
3 obj.rank(method='first')  # 不使用平均值的排名方法
0    8.5
1    1.0
2    8.5
3    5.5
4    3.0
5    5.5
6    2.0
7    5.5
8    5.5
dtype: float64
0    8.0
1    1.0
2    9.0
3    4.0
4    3.0
5    5.0
6    2.0
7    6.0
8    7.0
dtype: float64

降序排列,並且按最大值指定名次:

1 obj.rank(ascending=False, method='max')
0    2.0
1    9.0
2    2.0
3    6.0
4    7.0
5    6.0
6    8.0
7    6.0
8    6.0
dtype: float64

按列進行排名:

1 frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],'c': [-2, 5, 8, -2.5]})
2 frame
3 frame.rank(axis=1)
        b	a	c
0	4.3	0	-2.0
1	7.0	1	5.0
2	-3.0	0	8.0
3	2.0	1	-2.5
        b	a	c
0	3.0	2.0	1.0
1	3.0	1.0	2.0
2	1.0	2.0	3.0
3	3.0	2.0	1.0

帶有重復標簽的索引

Series 的重復索引:

1 obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
2 obj
3 obj.index.is_unique
4 obj['a']
5 obj['c']
a    0
a    1
b    2
b    3
c    4
dtype: int64
False
a    0
a    1
dtype: int64
4

Pandas 的重復索引:

1 df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
2 df
3 df.loc['b']
        0	        1	        2
a	-0.975030	2.041130	1.022168
a	0.321428	2.124496	0.037530
b	0.343309	-0.386692	-0.577290
b	0.002090	-0.890841	1.759072
        0	        1	        2
b	0.343309	-0.386692	-0.577290
b	0.002090	-0.890841	1.759072

求和與計算描述性統計量

sum 求和:

1 df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
2                    [np.nan, np.nan], [0.75, -1.3]],
3                   index=['a', 'b', 'c', 'd'],
4                   columns=['one', 'two'])
5 df
6 df.sum()
7 df.sum(axis=1)
8 df.mean(axis=1, skipna=False)  # 不跳過 NAN
	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3
one    9.25
two   -5.80
dtype: float64
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

返回最大值或最小值對應的索引:

1 df.idxmax()
2 df.idxmin(axis=1)
one    b
two    d
dtype: object
a    one
b    two
c    NaN
d    two
dtype: object

累加:

1 df.cumsum()
	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

describe 方法生成一些統計信息:

1 df.describe()
        one	        two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000
1 # 非數值數據
2 obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
3 obj
4 obj.describe()
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object
count     16
unique     3
top        a
freq       8
dtype: object

一些常用的統計方法:

  • count
  • describe
  • min, max
  • argmin, argmax
  • idxmin, idxmax
  • quantile
  • sum
  • mean
  • median
  • mad
  • prod
  • var
  • std
  • skew
  • kurt
  • cumsum
  • cummin, cummax
  • cumprod
  • diff
  • pct_change

自相關和協方差

首先安裝一個讀取數據集的模塊:

pip install pandas-datareader -i https://pypi.douban.com/simple

下載一個股票行情的數據集:

1 import pandas_datareader.data as web
2 all_data = {ticker: web.get_data_yahoo(ticker)
3             for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
4 price = pd.DataFrame({ticker: data['Adj Close']                     
5                       for ticker, data in all_data.items()})
6 volume = pd.DataFrame({ticker: data['Volume']                      
7                        for ticker, data in all_data.items()})

計算價格的百分比變化:

1 returns = price.pct_change()
2 returns.tail()
                AAPL	        IBM	        MSFT	        GOOG
Date				
2019-08-06	0.018930	-0.000213	0.018758	0.015300
2019-08-07	0.010355	-0.011511	0.004380	0.003453
2019-08-08	0.022056	0.018983	0.026685	0.026244
2019-08-09	-0.008240	-0.028337	-0.008496	-0.013936
2019-08-12	-0.002537	-0.014765	-0.013942	-0.011195

Series 的 corr 方法會計算兩個 Series 的相關性,而 cov 方法計算協方差。

1 returns['MSFT'].corr(returns['IBM'])
2 returns.MSFT.cov(returns['IBM'])
0.48863990166304594
8.714318020797283e-05

DataFrame 的 corr 方法和 cov 會計算相關性和協方差的矩陣:

1 returns.corr()
2 returns.cov()
	AAPL	        IBM	        MSFT	        GOOG
AAPL	1.000000	0.381659	0.453727	0.459663
IBM	0.381659	1.000000	0.488640	0.402751
MSFT	0.453727	0.488640	1.000000	0.535898
GOOG	0.459663	0.402751	0.535898	1.000000
        AAPL	        IBM	        MSFT	        GOOG
AAPL	0.000266	0.000077	0.000107	0.000117
IBM	0.000077	0.000152	0.000087	0.000077
MSFT	0.000107	0.000087	0.000209	0.000120
GOOG	0.000117	0.000077	0.000120	0.000242

DataFrame 的 corrwith 方法可以計算和其他 Series 或 DataFrame 的相關性:

1 returns.corrwith(returns.IBM)
2 returns.corrwith(volume)
AAPL    0.381659
IBM     1.000000
MSFT    0.488640
GOOG    0.402751
dtype: float64
AAPL   -0.061924
IBM    -0.151708
MSFT   -0.089946
GOOG   -0.018591
dtype: float64

唯一值、值的計數、關系

unique 方法唯一值數組:

1 obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
2 obj.unique()
array(['c', 'a', 'd', 'b'], dtype=object)

value_counts 方法返回值的計數:

1 obj.value_counts()
2 pd.value_counts(obj.values, sort=False)  # 等價的寫法
a    3
c    3
b    2
d    1
dtype: int64
c    3
b    2
d    1
a    3
dtype: int64

isin 檢查 Series 中的數是否屬於某個集合:

1 obj
2 mask = obj.isin(['b', 'c'])
3 mask
4 obj[mask]
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
0    c
5    b
6    b
7    c
8    c
dtype: object

get_indexer 將唯一值轉換為索引(對於標簽轉換為數值很管用):

1 to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
2 unique_vals = pd.Series(['c', 'b', 'a'])
3 pd.Index(unique_vals).get_indexer(to_match)
array([0, 2, 1, 1, 0, 2])

計算多個列的直方圖:

1 data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
2                      'Qu2': [2, 3, 1, 2, 3],
3                      'Qu3': [1, 5, 2, 4, 4]})
4 data
5 data.apply(pd.value_counts).fillna(0)
        Qu1	Qu2	Qu3
0	1	2	1
1	3	3	5
2	4	1	2
3	3	2	4
4	4	3	4
        Qu1	Qu2	Qu3
1	1.0	1.0	1.0
2	0.0	2.0	1.0
3	2.0	2.0	0.0
4	2.0	0.0	2.0
5	0.0	0.0	1.0

 

參考

  • 《Python for Data Analysis, 2nd Edition》by Wes McKinney

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM