匯總的函數
方法 |
說明 |
count |
非NA的值數量 |
describe |
針對Series和DataFrame列計算匯總統計 |
min、max |
計算最小值和最大值 |
argmin、argmax |
計算能夠獲取到最小值和最大值的索引位置 |
idxmin、indxmax |
計算能夠獲取到最小值和最大值的索引值 |
quantile |
計算四分位數 |
sum |
值的總和 |
mean |
值的平均數 |
median |
值的算術中位數(第50百分位數) |
mad |
根據平均值計算平均絕對離差 |
var |
樣本值的方差 |
std |
樣本值的標准差 |
skew |
樣本值的偏度 |
kurt |
樣本值的峰度 |
cumsum |
樣本值的累計和 |
cummin、cummax |
累計最大值和累計最小值 |
cumprod |
累計積 |
diff |
計算一階差分(對時間序列很有用) |
pct_change |
計算百分數變化 |
import numpy as np
import pandas as pd
from pandas import Series
df = pd.DataFrame([[1.4, np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],
index=['a','b','c','d'],
columns = ['one','two']
)
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# 求每一列和
df.sum()
one 9.25
two -5.80
dtype: float64
# 求每一行和
df.sum(axis=1)
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
選項 |
說明 |
axis |
約簡的軸,axis=1代表統計一行的值,默認統計一列的值。 |
skipna |
排除缺失值,默認值為True |
level |
如果軸是層次化索引的,則根據level分組約簡 |
# 返回最大值最小值的索引
df.idxmin()
one d
two b
dtype: object
# 累積列的和
df.cumsum()
### 匯總分析
df.describe()
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
# 對於非數值型數據,describe會產生另一種匯總統計
obj = Series(['a','a','b','c']*4)
obj.describe()
count 16
unique 3
top a
freq 8
dtype: object
相關系數和協方差
# 需要先pip install pandas_datareader
import pandas_datareader.data as web
# 遠程獲取股票的數據
all_data={}
for ticker in ['AAPL','IBM','MSFT']:
all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
all_data
{'AAPL': High Low Open Close Volume Adj Close
Date
1999-12-31 3.674107 3.553571 3.604911 3.671875 40952800.0 2.467498
2000-01-03 4.017857 3.631696 3.745536 3.997768 133949200.0 2.686497
2000-01-04 3.950893 3.613839 3.866071 3.660714 128094400.0 2.459998
2000-01-05 3.948661 3.678571 3.705357 3.714286 194580400.0 2.495997
2000-01-06 3.821429 3.392857 3.790179 3.392857 191993200.0 2.279998
2000-01-07 3.607143 3.410714 3.446429 3.553571 115183600.0 2.387998
2000-01-10 3.651786 3.383929 3.642857 3.491071 126266000.0 2.345998
2000-01-11 3.549107 3.232143 3.426339 3.312500 110387200.0 2.225998
2000-01-12 3.410714 3.089286 3.392857 3.113839 244017200.0 2.092498
2000-01-13 3.526786 3.303571 3.374439 3.455357 258171200.0 2.321998
2000-01-14 3.651786 3.549107 3.571429 3.587054 97594000.0 2.410497
2000-01-18 3.785714 3.587054 3.607143 3.712054 114794400.0 2.494497
2000-01-19 3.883929 3.691964 3.772321 3.805804 149410800.0 2.557498
2000-01-20 4.339286 4.053571 4.125000 4.053571 457783200.0 2.723998
2000-01-21 4.080357 3.935268 4.080357 3.975446 123981200.0 2.671498
2000-01-24 4.026786 3.754464 3.872768 3.794643 110219200.0 2.549998
2000-01-25 4.040179 3.656250 3.750000 4.008929 124286400.0 2.693998
2000-01-26 4.078125 3.919643 3.928571 3.935268 91789600.0 2.644498
2000-01-27 4.035714 3.821429 3.886161 3.928571 85036000.0 2.639997
2000-01-28 3.959821 3.593750 3.863839 3.629464 105837200.0 2.438998
2000-01-31 3.709821 3.375000 3.607143 3.705357 175420000.0 2.489997
2000-02-01 3.750000 3.571429 3.714286 3.580357 79508800.0 2.405998
2000-02-02 3.647321 3.464286 3.598214 3.529018 116048800.0 2.371498
2000-02-03 3.723214 3.580357 3.582589 3.689732 118798400.0 2.479497
2000-02-04 3.928571 3.700893 3.712054 3.857143 106330000.0 2.591997
2000-02-07 4.080357 3.783482 3.857143 4.073661 110266800.0 2.737498
2000-02-08 4.147321 3.973214 4.071429 4.102679 102160800.0 2.756998
2000-02-09 4.183036 4.015625 4.075893 4.022321 74841200.0 2.702997
2000-02-10 4.066964 3.928571 4.031250 4.053571 75745600.0 2.723998
2000-02-11 4.075893 3.866071 4.058036 3.883929 53062800.0 2.609998
... ... ... ... ... ... ...
2009-11-18 29.571428 29.142857 29.505714 29.422857 93580200.0 19.772139
2009-11-19 29.230000 28.542856 29.230000 28.644285 135581600.0 19.248940
2009-11-20 28.627142 28.251429 28.450001 28.559999 101666600.0 19.192299
2009-11-23 29.428572 28.992857 29.000000 29.411428 118724200.0 19.764460
2009-11-24 29.411428 28.985714 29.332857 29.205715 79609600.0 19.626223
2009-11-25 29.378571 29.108572 29.342857 29.170000 71613500.0 19.602222
2009-11-27 28.994286 28.338572 28.459999 28.655714 73814300.0 19.256622
2009-11-30 28.811428 28.395714 28.730000 28.558571 106214500.0 19.191339
2009-12-01 28.967142 28.118572 28.891428 28.138571 116440800.0 18.909100
2009-12-02 28.774286 27.964285 28.422857 28.032858 178815000.0 18.838060
2009-12-03 28.425714 28.038572 28.202858 28.068571 112179900.0 18.862062
2009-12-04 28.554285 27.182858 28.528572 27.617144 206721200.0 18.558701
2009-12-07 27.681429 26.954287 27.617144 26.992857 178689700.0 18.139185
2009-12-08 27.478571 26.957144 27.051428 27.124287 172599700.0 18.227499
2009-12-09 28.308571 27.187143 27.325714 28.257143 171195500.0 18.988783
2009-12-10 28.528572 28.017143 28.500000 28.061428 122417400.0 18.857262
2009-12-11 28.285715 27.632856 28.254286 27.809999 107443700.0 18.688299
2009-12-14 28.204287 27.508572 27.910000 28.139999 123947600.0 18.910061
2009-12-15 28.215714 27.610001 27.975714 27.738571 104864900.0 18.640306
2009-12-16 28.071428 27.792856 27.871429 27.861429 88246200.0 18.722862
2009-12-17 27.857143 27.285715 27.751429 27.408571 97209700.0 18.418543
2009-12-18 27.928572 27.514286 27.595715 27.918571 152192600.0 18.761261
2009-12-21 28.535715 27.952858 28.007143 28.318571 152976600.0 19.030060
2009-12-22 28.692858 28.379999 28.491428 28.622858 87378900.0 19.234541
2009-12-23 28.911428 28.687143 28.742857 28.871429 86381400.0 19.401581
2009-12-24 29.907143 29.049999 29.078571 29.862858 125222300.0 20.067825
2009-12-28 30.564285 29.944286 30.245714 30.230000 161141400.0 20.314537
2009-12-29 30.388571 29.818571 30.375713 29.871429 111301400.0 20.073582
2009-12-30 30.285715 29.758572 29.832857 30.234285 103021100.0 20.317423
2009-12-31 30.478571 30.080000 30.447144 30.104286 88102700.0 20.230061
[2516 rows x 6 columns],
price = pd.DataFrame({tic:data['Adj Close'] for tic, data in all_data.items()})
volume = pd.DataFrame({tic:data['Volume'] for tic, data in all_data.items()})
# 計算價格的百分位數變化
returns = price.pct_change()
# 結尾取5個,取局部的
returns.tail()
AAPL IBM MSFT
Date
2009-12-24 0.034340 0.004385 0.002587
2009-12-28 0.012294 0.013326 0.005484
2009-12-29 -0.011861 -0.003477 0.007058
2009-12-30 0.012147 0.005461 -0.013698
2009-12-31 -0.004300 -0.012597 -0.015504
Series的corr方法用於計算兩個Serires重疊的、非NA的、按索引對齊的值的相關系數。COV用於計算協方差。
# 計算MSFT和IBM的相關系數
returns.MSFT.corr(returns.IBM)
0.49253706924390156
# 計算MSFT和IBM的協方差
returns.MSFT.cov(returns.IBM)
0.00021557776646297303
DataFrame的corr和cov方法,可以統計出任意兩者之間的相關系數和協方差
returns.corr()
AAPL IBM MSFT
AAPL 1.000000 0.412392 0.422852
IBM 0.412392 1.000000 0.492537
MSFT 0.422852 0.492537 1.000000
# 返回單列的IBM與另外三者之間的相關系數
returns.corrwith(returns.IBM)
AAPL 0.412392
IBM 1.000000
MSFT 0.492537
dtype: float64
# 百分比變化和成交量的相關系數
returns.corrwith(volume)
AAPL -0.057665
IBM -0.006592
MSFT -0.016101
dtype: float64
returns.cov()
AAPL IBM MSFT
AAPL 0.001030 0.000254 0.000309
IBM 0.000254 0.000369 0.000216
MSFT 0.000309 0.000216 0.000519
唯一值(unique),返回一個Series,其索引為唯一值,其值為頻率
obj = Series(['c','a','d','a','a','b','b','c','c'])
uniques = obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
# 對返回的結果可以進行排序
uniques.sort()
uniques
array(['a', 'b', 'c', 'd'], dtype=object)
# 統計值出現的次數,為了方便查看按值頻率降序排序
obj.value_counts()
a 3
c 3
b 2
d 1
dtype: int64
# pandas中需要將一維數組傳入進去
pd.value_counts(obj.values)
a 3
c 3
b 2
d 1
dtype: int64
成員資格isin 計算一個表示"是否在數組中"布爾型數組
mask = obj.isin(['b','c'])
# 第一步先選出是否在,返回bool
mask
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
# 選擇真值
obj[mask]
0 c
5 b
6 b
7 c
8 c
dtype: object
data = pd.DataFrame({'Qu1':[100,300,400,300,400],
'Qu2':[200,300,100,200,300],
'Qu3':[100,500,200,400,400]
})
data
Qu1 Qu2 Qu3
0 100 200 100
1 300 300 500
2 400 100 200
3 300 200 400
4 400 300 400
p=data.apply(pd.value_counts).fillna(0)
p
Qu1 Qu2 Qu3
100 1.0 1.0 1.0
200 0.0 2.0 1.0
300 2.0 2.0 0.0
400 2.0 0.0 2.0
500 0.0 0.0 1.0
處理缺失數據
NA處理方法
方法 |
說明 |
dropna |
根據各標簽的值中是否存在缺失數據對軸標簽進行過濾,可通過閾值調節對缺失值的容忍度 |
fillna |
用指定值或插值方法填充缺失數據 |
isnull |
返回一個含有布爾值的對象,這些布爾值表示哪些值是缺失值 |
notnull |
isnull的否定式 |
過濾掉
from numpy import nan as NA
data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
# 默認會把所有的有NA的行刪除
data.dropna()
0 1 2
0 1.0 6.5 3.0
# 傳入how='all'將只丟棄全為NA的行
data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4]=NA
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
# axis=1 刪除全部為NA的列
data.dropna(axis=1,how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df = pd.DataFrame(np.random.randn(7,5))
df.ix[:4,1] = NA
df.ix[:2,2] = NA
0 1 2 3 4
0 0.758567 NaN NaN -0.064858 -0.385678
1 -0.275586 NaN NaN -0.184934 -0.253343
2 -1.872585 NaN NaN -1.539924 0.794054
3 1.092201 NaN 0.250026 -0.654255 -0.016992
4 0.625871 NaN -1.418505 1.141008 3.188408
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.412723 -0.405043 -0.417018 -1.114318
#這里的thresh函數是選取最少non-NA值個數的行選出來
df.dropna(thresh=5)
0 1 2 3 4
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.412723 -0.405043 -0.417018 -1.114318
#### 填充缺失數據,如果不想過濾的話
df.fillna(0)
# 字典調用fillna,對不同的列填充不同的值
df.fillna({1:'第一列',2:'第二列'})
0 1 2 3 4
0 0.758567 第一列 第二列 -0.064858 -0.385678
1 -0.275586 第一列 第二列 -0.184934 -0.253343
2 -1.872585 第一列 第二列 -1.539924 0.794054
3 1.092201 第一列 0.250026 -0.654255 -0.016992
4 0.625871 第一列 -1.41851 1.141008 3.188408
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.41272 -0.405043 -0.417018 -1.114318
# 上面的都是返回值為新的對象,如果直接向對原數據修改,inplace = True
df.fillna({1:'第一列',2:'第二列'},inplace=True)
# 向上,向下填充(bfill,ffill)
df = pd.DataFrame(np.random.randn(6,3))
0 1 2
0 2.040458 -2.276695 -1.038916
1 0.427078 NaN 0.001678
2 1.798042 NaN -0.839205
3 -0.433214 -0.312427 NaN
4 0.041802 1.356883 NaN
5 -0.904390 -1.030643 1.507598
df.fillna(method='ffill',limit=1)
0 1 2
0 2.040458 -2.276695 -1.038916
1 0.427078 -2.276695 0.001678
2 1.798042 NaN -0.839205
3 -0.433214 -0.312427 -0.839205
4 0.041802 1.356883 NaN
5 -0.904390 -1.030643 1.507598
#還可以傳入Series的平均值或中位數
data = Series([1,NA,3.5])
data.fillna(data.mean())
0 1.00
1 2.25
2 3.50
dtype: float64