Pandas詳解一

本文轉載自查看原文 2018-04-28 22:58 1195 Python

pandas簡介

pandas 是基於NumPy 的一種工具，該工具是為了解決數據分析任務而創建的。Pandas 納入了大量庫和一些標准的數據模型，提供了高效地操作大型數據集所需的工具。pandas提供了大量能使我們快速便捷地處理數據的函數和方法。

Series：一維數組，與Numpy中的一維array類似。二者與Python基本的數據結構List也很相近，其區別是：List中的元素可以是不同的數據類型，而Array和Series中則只允許存儲相同的數據類型，這樣可以更有效的使用內存，提高運算效率。 Time- Series：以時間為索引的Series。 DataFrame：二維的表格型數據結構。很多功能與R中的data.frame類似。可以將DataFrame理解為Series的容器。以下的內容主要以DataFrame為主。 Panel ：三維的數組，可以理解為DataFrame的容器。

Series

Series數據結構是一種類似於一維數組的對象，是由一組數據（各種Numpy數據類型）以及一組與之相關的標簽（即索引）組成。

創建Series

多數情況下，Series數據結構是我們直接從DataFrame數據結構中截取出來的，但也可以自己創建Series。語法如下：

s = pd.Series(data, index=index)

其中data可以是不同的內容：

字典
ndarray
標量

index 是軸標簽列表，根據不同的情況傳入的內容有所不同。

由ndarray構建

如果data是ndarray，則索引的長度必須與數據相同。如果沒有入索引，將創建一個值為[0，...，len（data）-1]的索引。

>>> ser = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> ser
a   -0.063364
b    0.907505
c   -0.862125
d   -0.696292
e    0.000751
dtype: float64
>>> ser.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
>>> ser.index[[True,False,True,True,True]]
Index(['a', 'c', 'd', 'e'], dtype='object')
>>> pd.Series(np.random.randn(5))
0   -0.854075
1   -0.152620
2   -0.719542
3   -0.219185
4    1.206630
dtype: float64
>>> np.random.seed(100)
>>> ser=pd.Series(np.random.rand(7))
>>> ser
0    0.543405
1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
dtype: float64
>>> import calendar as cal
>>> monthNames=[cal.month_name[i] for i in np.arange(1,6)]
>>> monthNames
['January', 'February', 'March', 'April', 'May']
>>> months=pd.Series(np.arange(1,6),index=monthNames);
>>> months
January     1
February    2
March       3
April       4
May         5
dtype: int32

由字典構建

若data是一個dict，如果傳遞了索引，則索引中與標簽對應的數據中的值將被列出。否則，將從dict的排序鍵構造索引（如果可能）。

>>> d = {'a' : 0., 'b' : 1., 'c' : 2.}
>>> pd.Series(d)
a    0.0
b    1.0
c    2.0
dtype: float64
>>> pd.Series(d, index=['b', 'c', 'd', 'a'])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
>>> stockPrices = {'GOOG':1180.97,'FB':62.57,'TWTR': 64.50, 'AMZN':358.69,'AAPL':500.6}
>>> stockPriceSeries=pd.Series(stockPrices,index=['GOOG','FB','YHOO','TWTR','AMZN','AAPL'],name='stockPrices')
>>> stockPriceSeries
GOOG    1180.97
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

注：NaN(not a number)是Pandas的標准缺失數據標記。

>>> stockPriceSeries.name
'stockPrices'
>>> stockPriceSeries.index
Index(['GOOG', 'FB', 'YHOO', 'TWTR', 'AMZN', 'AAPL'], dtype='object')
>>> dogSeries=pd.Series('chihuahua',index=['breed','countryOfOrigin','name', 'gender'])
>>> dogSeries
breed              chihuahua
countryOfOrigin    chihuahua
name               chihuahua
gender             chihuahua
dtype: object

由標量創建

如果數據是標量值，則必須提供索引。將該值重復以匹配索引的長度。

>>> pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

除了上述之外，類ndarray的對象傳入后也會轉換為ndarray來創建Series

>>> ser = pd.Series([5,4,2,-3,True])
>>> ser
0       5
1       4
2       2
3      -3
4    True
dtype: object
>>> ser.values
array([5, 4, 2, -3, True], dtype=object)
>>> ser.index
RangeIndex(start=0, stop=5, step=1)
>>> ser2 = pd.Series([5, 4, 2, -3, True], index=['b', 'e', 'c', 'a', 'd'])
>>> ser2
b       5
e       4
c       2
a      -3
d    True
dtype: object
>>> ser2.index
Index(['b', 'e', 'c', 'a', 'd'], dtype='object')
>>> ser2.values
array([5, 4, 2, -3, True], dtype=object)

索引

Series is ndarray-like

Series與ndarray非常相似，是大多數NumPy函數的有效參數。包括像切片這樣的索引操作。

>>> ser = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> ser
a   -0.231872
b    0.207976
c    0.935808
d    0.179578
e   -0.577162
dtype: float64
>>> ser[0]
-0.2318721969038312
>>> ser[:3]
a   -0.231872
b    0.207976
c    0.935808
dtype: float64
>>> ser[ser >0]
b    0.207976
c    0.935808
d    0.179578
dtype: float64
>>> ser[ser > ser.median()]
b    0.207976
c    0.935808
dtype: float64
>>> ser[ser > ser.median()]=1
>>> ser
a   -0.231872
b    1.000000
c    1.000000
d    0.179578
e   -0.577162
dtype: float64
>>> ser[[4, 3, 1]]
e   -0.577162
d    0.179578
b    1.000000
dtype: float64
>>> np.exp(ser)
a    0.793047
b    2.718282
c    2.718282
d    1.196713
e    0.561490
dtype: float64

Series is dict-like

Series同時也像一個固定大小的dict，可以通過索引標簽獲取和設置值：

>>> ser['a']
-0.2318721969038312
>>> ser['e'] = 12.
>>> ser
a    -0.231872
b     1.000000
c     1.000000
d     0.179578
e    12.000000
dtype: float64
>>> 'e' in ser
True
>>> 'f' in ser
False

注：如果引用了未包含的標簽，則會引發異常：

使用get方法，未包含的索引則會返回None，或者特定值。和dict的操作類似。

>>> print(ser.get('f'))
None
>>> ser.get('f', np.nan)
nan

矢量化操作&標簽對齊

在進行數據分析時，通常沒必要去使用循環，而是使用矢量化的操作方式。

>>> ser + ser
a    -0.463744
b     2.000000
c     2.000000
d     0.359157
e    24.000000
dtype: float64
>>> ser * 2
a    -0.463744
b     2.000000
c     2.000000
d     0.359157
e    24.000000
dtype: float64
>>> np.exp(ser)
a         0.793047
b         2.718282
c         2.718282
d         1.196713
e    162754.791419
dtype: float64

Series和ndarray之間的一個主要區別是，Series之間的操作會自動對齊基於標簽的數據。

>>> ser
a    -0.231872
b     1.000000
c     1.000000
d     0.179578
e    12.000000
dtype: float64
>>> ser[1:] + ser[:-1]
a         NaN
b    2.000000
c    2.000000
d    0.359157
e         NaN
dtype: float64

未對齊Series之間的操作結果將包含所涉及的索引的並集。如果在其中一個Seires中找不到標簽，結果將被標記為NaN。

注意：通常不同索引對象之間的操作的默認結果產生索引的並集，以避免信息丟失。
因為盡管數據丟失，但擁有索引標簽也可以作為計算的重要信息。當然也可以選擇通過dropna功能刪除丟失數據的標簽。

屬性

名稱屬性：

>>> s = pd.Series(np.random.randn(5), name='something')
>>> s
0   -0.533373
1   -0.225402
2   -0.314919
3    0.422997
4   -0.438827
Name: something, dtype: float64
>>> s.name
'something'

在多數情況下，series名稱會被自動分配，例如在獲取1D切片的DataFrame時。（后續DataFrame操作將會講解到）

>>> s2 = s.rename("different")
>>> s2
0   -0.533373
1   -0.225402
2   -0.314919
3    0.422997
4   -0.438827
Name: different, dtype: float64

這里需要注意的是，s和s2是指向不同的對象的。

通過索引屬性獲取索引

>>> s2.index
RangeIndex(start=0, stop=5, step=1)

索引對象也有一個name屬性

>>> s.index.name = "index_name"
>>> s
index_name
0   -0.533373
1   -0.225402
2   -0.314919
3    0.422997
4   -0.438827
Name: something, dtype: float64

通過值索引獲取值

>>> s.values
array([-0.53337271, -0.22540212, -0.31491934,  0.42299678, -0.43882681])

DataFrame

DataFrame 是可以包含不同類型的列且帶索引的二維數據結構，類似於SQL表，或者Series的字典集合。

創建DataFrame

DataFrame 是被使用最多的Pandas的對象，和Series類似，創建DataFrame時，也接受許多不同類的參數。

From dict of Series or dicts

>>>d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
>>>df = pd.DataFrame(d)
>>>df

	one	two
a	1.0	1.0
b	2.0	2.0
c	3.0	3.0
d	NaN	4.0

>>>pd.DataFrame(d, index=['d', 'b', 'a'])

	one	two
d	NaN	4.0
b	2.0	2.0
a	1.0	1.0

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

	two	three
d	4.0	NaN
b	2.0	NaN
a	1.0	NaN

可以通過訪問索引和列屬性分別訪問行和列標簽。

>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.columns
Index(['one', 'two'], dtype='object')

From dict of ndarrays / lists

ndarrays必須都是相同的長度。如果傳遞了索引，它的長度也必須與數組一樣長。如果沒有傳遞索引，結果將是range(n)，其中n是數組長度。

>>> d = {'one' : [1., 2., 3., 4.],
...      'two' : [4., 3., 2., 1.]}
>>> pd.DataFrame(d)

	one	two
0	1.0	4.0
1	2.0	3.0
2	3.0	2.0
3	4.0	1.0

>>> pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

	one	two
a	1.0	4.0
b	2.0	3.0
c	3.0	2.0
d	4.0	1.0

From structured or record array

這種情況和從數組的字典集合創建是一樣的。

類型簡略字符參數：

'b'     boolean
'i'     (signed) integer
'u'     unsigned integer
'f'     floating-point
'c'     complex-floating point
'm'     timedelta
'M'     datetime
'O'     (Python) objects
'S', 'a'    (byte-)string
'U'     Unicode
'V'     raw data (void)
# 例子：
>>> dt = np.dtype('f8')   # 64位浮點，注意8為字節
>>> dt = np.dtype('c16')  # 128位復數
>>> dt = np.dtype("a3, 3u8, (3,4)a10")  //3字節字符串、3個64位整型子數組、3*4的10字節字符串數組，注意8為字節
>>> dt = np.dtype((void, 10))  #10位
>>> dt = np.dtype((str, 35))   # 35字符字符串
>>> dt = np.dtype(('U', 10))   # 10字符unicode string
>>> dt = np.dtype((np.int32, (2,2)))          # 2*2int子數組
>>> dt = np.dtype(('S10', 1))                 # 10字符字符串
>>> dt = np.dtype(('i4, (2,3)f8, f4', (2,3))) # 2x3結構子數組
# 使用astype，不可直接更改對象的dtype值
>>> b = np.array([1., 2., 3., 4.])
>>> b.dtype
dtype(‘float64‘)
>>> c = b.astype(int)
>>> c
array([1, 2, 3, 4])
>>> c.shape
(8,)
>>> c.dtype
dtype(‘int32‘)

>>> data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
# i4：定義一個big-endian int 4*8=32位的數據類型
>>> data
array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
>>> data.shape
(2,)
>>> data[:] = [(1,2.,'Hello'), (2,3.,"World")]
>>> data
array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
>>> pd.DataFrame(data, index=['first', 'second'])

	A	B	C
first	1	2.0	b'Hello'
second	2	3.0	b'World'

>>> pd.DataFrame(data, columns=['C', 'A', 'B'])

	C	A	B
0	b'Hello'	1	2.0
1	b'World'	2	3.0

注意：DataFrame和 2-dimensional NumPy ndarray 並不是完全一樣的。

除了以上的構造方法之外還有很多其他的構造方法，但獲取DataFrame的主要方法是讀取表結構的文件，其他構造方法就不一一列出。

>>> d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
>>> df = pd.DataFrame(d)
>>> df
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.index.values
array(['a', 'b', 'c', 'd'], dtype=object)
>>> df.columns
Index(['one', 'two'], dtype='object')
>>> df.columns.values
array(['one', 'two'], dtype=object)

>>> stockSummaries={
... 'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88],index=['Closing price','EPS','Shares Outstanding(M)','Beta', 'P/E','Market Cap(B)']),
... 'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64],index=['Closing price','EPS','Shares Outstanding(M)','Beta','P/E','Market Cap(B)']),
... 'FB': pd.Series([61.48,0.59,2450,104.93,150.92],index=['Closing price','EPS','Shares Outstanding(M)','P/E', 'Market Cap(B)']),
... 'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36],index=['Closing price','EPS','Shares Outstanding(M)','P/E','Beta', 'Market Cap(B)']),
... 'TWTR':pd.Series([65.25,-0.3,555.2,36.23],index=['Closing price','EPS','Shares Outstanding(M)','Market Cap(B)']),
... 'AAPL':pd.Series([501.53,40.32,892.45,12.44,447.59,0.84],index=['Closing price','EPS','Shares Outstanding(M)','P/E','Market Cap(B)','Beta'])}
>>> stockDF=pd.DataFrame(stockSummaries)
>>> stockDF

	AAPL	AMZN	FB	GOOG	TWTR	YHOO
Beta	0.84	0.52	NaN	0.87	NaN	0.66
Closing price	501.53	346.15	61.48	1133.43	65.25	34.90
EPS	40.32	0.59	0.59	36.05	-0.30	1.27
Market Cap(B)	447.59	158.88	150.92	380.64	36.23	35.36
P/E	12.44	589.80	104.93	31.44	NaN	27.48
Shares Outstanding(M)	892.45	459.00	2450.00	335.83	555.20	1010.00

>>> stockDF=pd.DataFrame(stockSummaries,index=['Closing price','EPS','Shares Outstanding(M)','P/E', 'Market Cap(B)','Beta'])
>>> stockDF

	AAPL	AMZN	FB	GOOG	TWTR	YHOO
Closing price	501.53	346.15	61.48	1133.43	65.25	34.90
EPS	40.32	0.59	0.59	36.05	-0.30	1.27
Shares Outstanding(M)	892.45	459.00	2450.00	335.83	555.20	1010.00
P/E	12.44	589.80	104.93	31.44	NaN	27.48
Market Cap(B)	447.59	158.88	150.92	380.64	36.23	35.36
Beta	0.84	0.52	NaN	0.87	NaN	0.66

>>> stockDF=pd.DataFrame(stockSummaries,columns=['FB','TWTR','SCNW'])
>>> stockDF

	FB	TWTR	SCNW
Closing price	61.48	65.25	NaN
EPS	0.59	-0.30	NaN
Market Cap(B)	150.92	36.23	NaN
P/E	104.93	NaN	NaN
Shares Outstanding(M)	2450.00	555.20	NaN

DataFrame列操作

DataFrame列的選取，設置和刪除列的工作原理與類似的dict操作相同。

>>> df['one']
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64
>>> df

	one	two
a	1.0	1.0
b	2.0	2.0
c	3.0	3.0
d	NaN	4.0

>>> df['three'] = df['one'] * df['two']
>>> df

	one	two	three
a	1.0	1.0	1.0
b	2.0	2.0	4.0
c	3.0	3.0	9.0
d	NaN	4.0	NaN

>>> df['flag'] = df['one'] > 2
>>> df

	one	two	three	flag
a	1.0	1.0	1.0	False
b	2.0	2.0	4.0	False
c	3.0	3.0	9.0	True
d	NaN	4.0	NaN	False

DataFram的列可以像使用dict一樣被刪除或移出。

>>> del df['two']
>>> df

	one	three	flag
a	1.0	1.0	False
b	2.0	4.0	False
c	3.0	9.0	True
d	NaN	NaN	False

>>> three = df.pop('three')
>>> df

	one	flag
a	1.0	False
b	2.0	False
c	3.0	True
d	NaN	False

當賦予的值為標量時，會自動在列里廣播填充。

>>> df['foo'] = 'bar'
>>> df

	one	flag	foo
a	1.0	False	bar
b	2.0	False	bar
c	3.0	True	bar
d	NaN	False	bar

如果傳入的是Series並且索引不完全相同，那么會默認按照索引對齊。

>>> df['one_trunc'] = df['one'][:2]
>>> df

	one	flag	foo	one_trunc
a	1.0	False	bar	1.0
b	2.0	False	bar	2.0
c	3.0	True	bar	NaN
d	NaN	False	bar	NaN

也可以插入原始的ndarrays，但其長度必須與DataFrame索引的長度相匹配。

默認情況下，直接的賦值操作列插入到最后的位置。insert方法可用於插入列中的特定位置：

>>> df.insert(1, 'bar', df['one'])
>>> df

	one	bar	flag	foo	one_trunc
a	1.0	1.0	False	bar	1.0
b	2.0	2.0	False	bar	2.0
c	3.0	3.0	True	bar	NaN
d	NaN	NaN	False	bar	NaN

分配列

>>> df_sample = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
>>> df_sample

	A	B
0	1	-0.501413
1	2	-1.658703
2	3	-1.007577
3	4	-0.508734
4	5	0.781488
5	6	-0.654381
6	7	0.041172
7	8	-0.201917
8	9	-0.870813
9	10	0.228932

>>> df_sample.assign(ln_A = lambda x: np.log(x.A), abs_B = lambda x: np.abs(x.B))

	A	B	abs_B	ln_A
0	1	-0.501413	0.501413	0.000000
1	2	-1.658703	1.658703	0.693147
2	3	-1.007577	1.007577	1.098612
3	4	-0.508734	0.508734	1.386294
4	5	0.781488	0.781488	1.609438
5	6	-0.654381	0.654381	1.791759
6	7	0.041172	0.041172	1.945910
7	8	-0.201917	0.201917	2.079442
8	9	-0.870813	0.870813	2.197225
9	10	0.228932	0.228932	2.302585

需要注意的是，傳入的參數是以字典類型的方式傳入的。如果希望保證順序的話，可以多次使用assign。

>>> newcol = np.log(df_sample['A'])
>>> newcol
0    0.000000
1    0.693147
2    1.098612
3    1.386294
4    1.609438
5    1.791759
6    1.945910
7    2.079442
8    2.197225
9    2.302585
Name: A, dtype: float64
>>> df_sample.assign(ln_A=newcol)

	A	B	ln_A
0	1	-0.501413	0.000000
1	2	-1.658703	0.693147
2	3	-1.007577	1.098612
3	4	-0.508734	1.386294
4	5	0.781488	1.609438
5	6	-0.654381	1.791759
6	7	0.041172	1.945910
7	8	-0.201917	2.079442
8	9	-0.870813	2.197225
9	10	0.228932	2.302585

索引

Operation	Syntax	Result
Select column	df[col]	Series
Select row by label	df.loc[label]	Series
Select row by integer location	df.iloc[loc]	Series
Slice rows	df[5:10]	DataFrame
Select rows by boolean vector	df[bool_vec]	DataFrame

關於索引的函數

loc[行號，[列名]]——通過行標簽索引行數據
iloc[行位置,列位置]——通過行號索引行數據
ix——通過行標簽或者行號索引行數據（基於loc和iloc 的混合）

例子

>>> df

	one	bar	flag	foo	one_trunc
a	1.0	1.0	False	bar	1.0
b	2.0	2.0	False	bar	2.0
c	3.0	3.0	True	bar	NaN
d	NaN	NaN	False	bar	NaN

>>> df.loc['b']
one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object
>>> df.loc[['a','c'],['one','bar']]
   one  bar
a  1.0  1.0
c  3.0  3.0
>>> df.loc[:,['one','bar']]
   one  bar
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  NaN

iloc的例子

# df.iloc[行位置,列位置]
>>> df.loc[['a','c'],['one','bar']]
   one  bar
a  1.0  1.0
c  3.0  3.0
>>> df.loc[:,['one','bar']]
   one  bar
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  NaN
>>> df.iloc[1,1]#選取第二行，第二列的值，返回的為單個值
2.0
>>> df.iloc[[0,2],:]#選取第一行及第三行的數據
   one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
c  3.0  3.0   True  bar        NaN
>>> df.iloc[0:2,:]#選取第一行到第三行（不包含）的數據
   one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
b  2.0  2.0  False  bar        2.0
>>> df.iloc[:,1]#選取所有記錄的第一列的值，返回的為一個Series
a    1.0
b    2.0
c    3.0
d    NaN
Name: bar, dtype: float64
>>> df.iloc[1,:]#選取第一行數據，返回的為一個Series
one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object

ix的例子

>>> df.ix[0:3,['one','bar']]
   one  bar
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
>>> df.ix[['a','b'],['one','bar']]
   one  bar
a  1.0  1.0
b  2.0  2.0

注意：在Pandas 0.20版本開始就不推薦使用.ix，只推薦使用基於標簽的索引.loc 和基於位置的索引.iloc 。

數據對齊

DataFrame對象之間在列和索引（行標簽）之間自動數據對齊。並且，運算的結果對象是列和行標簽的並集。

>>> df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
>>> df

	A	B	C	D
0	-0.408040	-0.103925	1.567179	0.497025
1	1.155872	1.838612	1.535727	0.254998
2	-0.844157	-0.982943	-0.306098	0.838501
3	-1.690848	1.151174	-1.029337	-0.510992
4	-2.360271	0.103595	1.738818	1.241876
5	0.132413	0.577794	-1.575906	-1.292794
6	-0.659920	-0.874005	-0.689551	-0.535480
7	1.527953	0.647206	-0.677337	-0.265019
8	0.746106	-3.130785	0.059622	-0.875211
9	1.064878	-0.573153	-0.803278	1.092972

>>> df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
>>> df2

	A	B	C
0	0.651278	2.160525	-0.639130
1	-0.333688	-0.437602	-1.905795
2	-1.229019	0.794899	-1.160508
3	0.546056	1.163258	0.658877
4	0.523689	1.327156	1.112524
5	-1.074630	0.343416	0.985438
6	0.736502	-2.080463	-0.298586

>>> df + df2

	A	B	C	D
0	0.243238	2.056600	0.928049	NaN
1	0.822185	1.401010	-0.370068	NaN
2	-2.073177	-0.188045	-1.466606	NaN
3	-1.144793	2.314432	-0.370460	NaN
4	-1.836581	1.430751	2.851342	NaN
5	-0.942217	0.921210	-0.590468	NaN
6	0.076582	-2.954467	-0.988137	NaN
7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN

在DataFrame和Series之間進行操作時，默認行為是使DataFrame列上的Series索引對齊，從而逐行廣播。

>>> df.iloc[0]
A   -0.803278
B    1.092972
C    0.651278
D    2.160525
Name: 0, dtype: float64
>>> df - df.iloc[0]

	A	B	C	D
0	0.000000	0.000000	0.000000	0.000000
1	0.164148	-1.426659	-1.088879	-4.066320
2	-0.425741	-0.298073	-1.811786	-1.614469
3	1.966537	-0.434095	-0.127588	-0.833369
4	1.915803	-2.167601	-0.307861	-1.175087
5	1.539780	-3.173434	-0.949864	-3.889736
6	0.806788	0.841624	-0.433718	-0.349443
7	-1.746116	-1.298589	-0.641914	-4.381350
8	-0.062220	-1.208825	-0.536867	-2.218942
9	1.425466	-2.731549	-0.348400	-0.083648

在使用時間序列數據等一些特殊情況下，也可以以列方式進行廣播：

>>> index = pd.date_range('1/1/2000', periods=8)
>>> df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))
>>> df

	A	B	C
2000-01-01	-1.285993	-0.625756	0.341711
2000-01-02	-0.130773	-1.091650	0.074539
2000-01-03	1.248248	-0.450343	-0.347523
2000-01-04	-0.317490	1.012952	-0.838197
2000-01-05	-1.041325	-0.087286	1.153089
2000-01-06	1.067939	0.570342	-0.272996
2000-01-07	-0.160398	-0.013020	0.621867
2000-01-08	1.374179	0.779654	-1.554635

運算

四則運算

add 用於加法的方法 (+)
sub 用於減法的方法 (-)
div 用於除法的方法 (/)
mul 用於乘法的方法 (*)

>>> df.sub(df["A"],axis=0)

	B	C
2000-01-01	0.660237	1.627704
2000-01-02	-0.960877	0.205312
2000-01-03	-1.698591	-1.595771
2000-01-04	1.330443	-0.520706
2000-01-05	0.954039	2.194413
2000-01-06	-0.497597	-1.340935
2000-01-07	0.147378	0.782264
2000-01-08	-0.594525	-2.928814

>>> df.sub(df['A'], axis=0)

	B	C
2000-01-01	0.660237	1.627704
2000-01-02	-0.960877	0.205312
2000-01-03	-1.698591	-1.595771
2000-01-04	1.330443	-0.520706
2000-01-05	0.954039	2.194413
2000-01-06	-0.497597	-1.340935
2000-01-07	0.147378	0.782264
2000-01-08	-0.594525	-2.928814

邏輯運算

邏輯運算，與NumPy相似。

>>> df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
>>> df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)
>>> df1

	a	b
0	True	False
1	False	True
2	True	True

>>> df2

	a	b
0	False	True
1	True	True
2	True	False

>>> df1 & df2

	a	b
0	False	False
1	False	True
2	True	False

df1 | df2

	a	b
0	True	True
1	True	True
2	True	True

>>> df1 ^ df2

	a	b
0	True	True
1	True	False
2	False	True

>>> -df1  # 相當於 ~df1

	a	b
0	False	True
1	True	False
2	False	False

if-then... ：An if-then on one column
if-then-else ：Add another line with different logic, to do the -else

>>> d={'a':pd.Series([1,2,3,4,5],index=(np.arange(5))),
... 'b':pd.Series([2,3,4,5,6],index=(np.arange(5))),
... 'c':pd.Series([3,4,5,6,7],index=(np.arange(5)))};
>>> df3 = pd.DataFrame(d)
>>> df3
   a  b  c
0  1  2  3
1  2  3  4
2  3  4  5
3  4  5  6
4  5  6  7
# if-then... 
>>> df3.loc[df3.a>3,'c']=30;
>>> df3
   a  b   c
0  1  2   3
1  2  3   4
2  3  4   5
3  4  5  30
4  5  6  30
# if-then-else 
>>> df3['logic'] = np.where(df3['a'] > 3,'high','low')
>>> df3
   a  b   c logic
0  1  2   3   low
1  2  3   4   low
2  3  4   5   low
3  4  5  30  high
4  5  6  30  high

>>> df_mask = pd.DataFrame({'a' : [True] * 5, 'b' : [False] * 5}) #做標志
>>> df_mask
      a      b
0  True  False
1  True  False
2  True  False
3  True  False
4  True  False
>>> df3.where(df_mask,-100) # 根據標志賦值
   a    b    c logic
0  1 -100 -100  -100
1  2 -100 -100  -100
2  3 -100 -100  -100
3  4 -100 -100  -100
4  5 -100 -100  -100

數學統計

方法	說明
sr.unique	Series去重
sr.value_counts()	Series統計頻率，並從大到小排序，DataFrame沒有這個方法
sr.describe()	返回基本統計量和分位數
df.describe()	按各列返回基本統計量和分位數
df.count()	求非NA值得數量
df.max()	求最大值
df.min()	求最大值
df.sum(axis=0)	按各列求和
df.mean()	按各列求平均值
df.median()	求中位數
df.var()	求方差
df.std()	求標准差
df.mad()	根據平均值計算平均絕對利差
df.cumsum()	求累計和
sr1.corr(sr2)	求相關系數
df.cov()	求協方差矩陣
df1.corrwith(df2)	求相關系數
pd.cut(array1, bins)	求一維數據的區間分布
pd.qcut(array1, 4)	按指定分位數進行區間划分，4可以替換成自定義的分位數列表
df['col1'].groupby(df['col2'])	列1按照列2分組，即列2作為key
df.groupby('col1')	DataFrame按照列col1分組
grouped.aggreagte(func)	分組后根據傳入函數來聚合
grouped.aggregate([f1, f2,...])	根據多個函數聚合，表現成多列，函數名為列名
grouped.aggregate([('f1_name', f1), ('f2_name', f2)])	重命名聚合后的列名
grouped.aggregate({'col1':f1, 'col2':f2,...})	對不同的列應用不同函數的聚合，函數也可以是多個
df.pivot_table(['col1', 'col2'], rows=['row1', 'row2'], aggfunc=[np.mean, np.sum] fill_value=0,margins=True)	根據row1, row2對col1， col2做分組聚合，聚合方法可以指定多種，並用指
pd.crosstab(df['col1'], df['col2'])	交叉表，計算分組的頻率

轉置

>>> df[:5].T

	2000-01-01 00:00:00	2000-01-02 00:00:00	2000-01-03 00:00:00	2000-01-04 00:00:00	2000-01-05 00:00:00
A	0.388901	0.159726	1.576600	-0.993827	-1.297079
B	0.232477	-0.904435	-0.628984	1.015665	0.825678
C	-1.254728	-0.195899	0.450605	-0.541170	0.043319

排序

方法	說明
sort_index()	對索引進行排序，默認是升序
sort_index(ascending=False)	對索引進行降序排序
sort_values()	對Series的值進行排序，默認是按值的升序進行排序的
sort_values(ascending=False)	對Seires的值進行降序排序
df.sort_values(by=[列，列])	按指定列的值大小順序進行排序
df.sort_values(by=[行],axis=1)	按指定行值進行排序
df.rank()	計算排名rank值

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 20個Pandas函數詳解 pandas中cut函數詳解 Pandas系列（六）-時間序列詳解 Pandas 庫的詳解和使用補充 Pandas中at、iat函數詳解 pandas分組統計 - groupby詳解 pandas——to_dict使用詳解 pandas to_excel 用法詳解 pandas常用操作詳解——pd.concat() pandas常用函數詳解——drop()函數