Pandas python


 

和大熊貓們(Pandas)一起游戲吧!

 

Pandas是Python的一個用於數據分析的庫: http://pandas.pydata.org

API速查:http://pandas.pydata.org/pandas-docs/stable/api.html

基於NumPy,SciPy的功能,在其上補充了大量的數據操作(Data Manipulation)功能。

統計、分組、排序、透視表自由轉換,如果你已經很熟悉結構化數據庫(RDBMS)與Excel的功能,就會知道Pandas有過之而無不及!

 

0. 上手玩:Why Pandas?

 

普通的程序員看到一份數據會怎么做?

In [1]:
 
 
 
 
 
import codecs
import requests
import numpy as np
import scipy as sp
import scipy.stats as spstat
import pandas as pd
import datetime
import json
 
 
In [2]:
 
 
 
 
 
r = requests.get("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
with codecs.open('S1EP3_Iris.txt','w',encoding='utf-8') as f:
    f.write(r.text)
 
 
In [3]:
 
 
 
 
 
with codecs.open('S1EP3_Iris.txt','r',encoding='utf-8') as f:
    lines = f.readlines()
for idx,line in enumerate(lines):
    print line,
    if idx==10:
        break
 
 
 
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
 

Pandas的意義就在於

快速的識別結構化數據

In [19]:
 
 
 
 
 
import pandas as pd
irisdata = pd.read_csv('S1EP3_Iris.txt',header = None, encoding='utf-8')
#irisdata
 
 
 

快速的操作元數據

In [22]:
 
 
 
 
 
cnames = ['sepal_length','sepal_width','petal_length','petal_width','class']
irisdata.columns = cnames
#irisdata
 
 
 

快速過濾

In [11]:
 
 
 
 
 
#irisdata[irisdata['petal_width']==irisdata.petal_width.max()]
irisdata[irisdata['sepal_length']==irisdata.sepal_length.max()]
 
 
Out[11]:
  sepal_length sepal_width petal_length petal_width class
131 7.9 3.8 6.4 2 Iris-virginica
 

快速切片

In [23]:
 
 
 
 
 
#irisdata.iloc[::30,:2]
irisdata.iloc[::40,0:3]
 
 
Out[23]:
  sepal_length sepal_width petal_length
0 5.1 3.5 1.4
40 5.0 3.5 1.3
80 5.5 2.4 3.8
120 6.9 3.2 5.7
 

快速統計

In [31]:
 
 
 
 
 
#print irisdata['class'].value_counts()
for x in xrange(4):
    s = irisdata.iloc[:,x]
    print '{0:<12}'.format(s.name.upper()), " Statistics: ", \
    '{0:>5}  {1:>5}  {2:>5}  {3:>5}'.format(s.max(), s.min(), round(s.mean(),2),round(s.std(),2))
 
 
 
SEPAL_LENGTH  Statistics:    7.9    4.3   5.84   0.83
SEPAL_WIDTH   Statistics:    4.4    2.0   3.05   0.43
PETAL_LENGTH  Statistics:    6.9    1.0   3.76   1.76
PETAL_WIDTH   Statistics:    2.5    0.1    1.2   0.76
 

快速“MapReduce”

In [9]:
 
 
 
 
 
slogs = lambda x:sp.log(x)*x
entpy = lambda x:sp.exp((slogs(x.sum())-x.map(slogs).sum())/x.sum())
irisdata.groupby('class').agg(entpy)
 
 
Out[9]:
  sepal_length sepal_width petal_length petal_width
class        
Iris-setosa 49.878745 49.695242 49.654909 45.810069
Iris-versicolor 49.815081 49.680665 49.694505 49.452305
Iris-virginica 49.772059 49.714500 49.761700 49.545918
 

1. 歡迎來到大熊貓世界

 

Pandas的重要數據類型

  • DataFrame(二維表)
  • Series(一維序列)
  • Index(行索引,行級元數據)
 

1.1 Series:pandas的長槍(數據表中的一列或一行,觀測向量,一維數組...)

數據世界中對於任意一個個體的全面觀測,或者對於任意一組個體某一屬性的觀測,全部可以抽象為Series的概念。

 

用值構建一個Series:

由默認index和values組成。

In [10]:
 
 
 
 
 
Series1 = pd.Series(np.random.randn(4))
print Series1,type(Series1)
print Series1.index
print Series1.values
 
 
 
0   -0.909672
1    1.739425
2   -1.163028
3    0.408693
dtype: float64 <class 'pandas.core.series.Series'>
Int64Index([0, 1, 2, 3], dtype='int64')
[-0.90967166  1.73942495 -1.1630284   0.40869312]
 

Series支持過濾的原理就如同NumPy:

In [11]:
 
 
 
 
 
print Series1>0
print Series1[Series1>0]
 
 
 
0    False
1     True
2    False
3     True
dtype: bool
1    1.739425
3    0.408693
dtype: float64
 

當然也支持Broadcasting:

In [12]:
 
 
 
 
 
print Series1*2
print Series1+5
 
 
 
0   -1.819343
1    3.478850
2   -2.326057
3    0.817386
dtype: float64
0    4.090328
1    6.739425
2    3.836972
3    5.408693
dtype: float64
 

以及Universal Function:

In [13]:
 
 
 
 
 
print np.exp(Series1)
#NumPy Universal Function
f_np = np.frompyfunc(lambda x:np.exp(x*2+5),1,1)
print f_np(Series1)
 
 
 
0    0.402656
1    5.694068
2    0.312538
3    1.504850
dtype: float64
0    24.06255
1    4811.913
2    14.49702
3    336.0924
dtype: object
 

在序列上就使用行標,而不是創建一個2列的數據表,能夠輕松辨別哪里是數據,哪里是元數據:

In [14]:
 
 
 
 
 
Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])
print Series2,type(Series2)
print Series2.index
print type(Series2.index)
print Series2.values
 
 
 
norm_0   -0.909672
norm_1    1.739425
norm_2   -1.163028
norm_3    0.408693
dtype: float64 <class 'pandas.core.series.Series'>
Index([u'norm_0', u'norm_1', u'norm_2', u'norm_3'], dtype='object')
<class 'pandas.core.index.Index'>
[-0.90967166  1.73942495 -1.1630284   0.40869312]
 

雖然行是有順序的,但是仍然能夠通過行級的index來訪問到數據:

(當然也不盡然像Ordered Dict,因為行索引甚至可以重復,不推薦重復的行索引不代表不能用)

In [15]:
 
 
 
 
 
print Series2[['norm_0','norm_3']]
 
 
 
norm_0   -0.909672
norm_3    0.408693
dtype: float64
In [16]:
 
 
 
 
 
print 'norm_0' in Series2
print 'norm_6' in Series2
 
 
 
True
False
 

默認行索引就像行號一樣:

In [17]:
 
 
 
 
 
print Series1.index
 
 
 
Int64Index([0, 1, 2, 3], dtype='int64')
 

從Key不重復的Ordered Dict或者從Dict來定義Series就不需要擔心行索引重復:

In [18]:
 
 
 
 
 
Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}
Series3_pdSeries = pd.Series(Series3_Dict)
print Series3_pdSeries
print Series3_pdSeries.values
print Series3_pdSeries.index
 
 
 
China      Beijing
Japan        Tokyo
S.Korea      Seoul
dtype: object
['Beijing' 'Tokyo' 'Seoul']
Index([u'China', u'Japan', u'S.Korea'], dtype='object')
 

與Dict區別一: 有序

In [19]:
 
 
 
 
 
Series4_IndexList = ["Japan","China","Singapore","S.Korea"]
Series4_pdSeries = pd.Series( Series3_Dict ,index = Series4_IndexList)
print Series4_pdSeries
print Series4_pdSeries.values
print Series4_pdSeries.index
print Series4_pdSeries.isnull()
print Series4_pdSeries.notnull()
 
 
 
Japan          Tokyo
China        Beijing
Singapore        NaN
S.Korea        Seoul
dtype: object
['Tokyo' 'Beijing' nan 'Seoul']
Index([u'Japan', u'China', u'Singapore', u'S.Korea'], dtype='object')
Japan        False
China        False
Singapore     True
S.Korea      False
dtype: bool
Japan         True
China         True
Singapore    False
S.Korea       True
dtype: bool
 

與Dict區別二: index內值可以重復,盡管不推薦。

In [20]:
 
 
 
 
 
Series5_IndexList = ['A','B','B','C']
Series5 = pd.Series(Series1.values,index = Series5_IndexList)
print Series5
print Series5[['B','A']]
 
 
 
A   -0.909672
B    1.739425
B   -1.163028
C    0.408693
dtype: float64
B    1.739425
B   -1.163028
A   -0.909672
dtype: float64
 

整個序列級別的元數據信息:name

當數據序列以及index本身有了名字,就可以更方便的進行后續的數據關聯啦!

In [21]:
 
 
 
 
 
print Series4_pdSeries.name
print Series4_pdSeries.index.name
 
 
 
None
None
In [22]:
 
 
 
 
 
Series4_pdSeries.name = "Capital Series"
Series4_pdSeries.index.name = "Nation"
print Series4_pdSeries
pd.DataFrame(Series4_pdSeries)
 
 
 
Nation
Japan          Tokyo
China        Beijing
Singapore        NaN
S.Korea        Seoul
Name: Capital Series, dtype: object
Out[22]:
  Capital Series
Nation  
Japan Tokyo
China Beijing
Singapore NaN
S.Korea Seoul
 

1.2 DataFrame:pandas的戰錘(數據表,二維數組)

Series的有序集合,就像R的DataFrame一樣方便。

仔細想想,絕大部分的數據形式都可以表現為DataFrame。

 

從NumPy二維數組、從文件或者從數據庫定義:數據雖好,勿忘列名

In [23]:
 
 
 
 
 
dataNumPy = np.asarray([('Japan','Tokyo',4000),\
                ('S.Korea','Seoul',1300),('China','Beijing',9100)])
DF1 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])
DF1
 
 
Out[23]:
  nation capital GDP
0 Japan Tokyo 4000
1 S.Korea Seoul 1300
2 China Beijing 9100
 

等長的列數據保存在一個字典里(JSON):很不幸,字典key是無序的

In [24]:
 
 
 
 
 
dataDict = {'nation':['Japan','S.Korea','China'],\
        'capital':['Tokyo','Seoul','Beijing'],'GDP':[4900,1300,9100]}
DF2 = pd.DataFrame(dataDict)
DF2
 
 
Out[24]:
  GDP capital nation
0 4900 Tokyo Japan
1 1300 Seoul S.Korea
2 9100 Beijing China
 

從另一個DataFrame定義DataFrame:啊,強迫症犯了!

In [25]:
 
 
 
 
 
DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])
DF21
 
 
Out[25]:
  nation capital GDP
0 Japan Tokyo 4900
1 S.Korea Seoul 1300
2 China Beijing 9100
In [26]:
 
 
 
 
 
DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])
DF22
 
 
Out[26]:
  nation capital GDP
2 China Beijing 9100
0 Japan Tokyo 4900
1 S.Korea Seoul 1300
 

從DataFrame中取出列?兩種方法(與JavaScript完全一致!)

  • '.'的寫法容易與其他預留關鍵字產生沖突
  • '[ ]'的寫法最安全。
In [27]:
 
 
 
 
 
print DF22.nation
print DF22.capital
print DF22['GDP']
 
 
 
2      China
0      Japan
1    S.Korea
Name: nation, dtype: object
2    Beijing
0      Tokyo
1      Seoul
Name: capital, dtype: object
2    9100
0    4900
1    1300
Name: GDP, dtype: int64
 

從DataFrame中取出行?(至少)兩種方法:

In [28]:
 
 
 
 
 
print DF22[0:1] #給出的實際是DataFrame
print DF22.ix[0] #通過對應Index給出行
 
 
 
  nation  capital   GDP
2  China  Beijing  9100
nation     Japan
capital    Tokyo
GDP         4900
Name: 0, dtype: object
 

像NumPy切片一樣的終極招式:iloc

In [29]:
 
 
 
 
 
print DF22.iloc[0,:]
print DF22.iloc[:,0]
 
 
 
nation       China
capital    Beijing
GDP           9100
Name: 2, dtype: object
2      China
0      Japan
1    S.Korea
Name: nation, dtype: object
 

聽說你從Alter Table地獄來,大熊貓笑了

然而動態增加列無法用"."的方式完成,只能用"[ ]"

In [30]:
 
 
 
 
 
DF22['population'] = [1600,130,55]
DF22['region'] = 'East_Asian'
DF22
 
 
Out[30]:
  nation capital GDP population region
2 China Beijing 9100 1600 East_Asian
0 Japan Tokyo 4900 130 East_Asian
1 S.Korea Seoul 1300 55 East_Asian
In [ ]:
 
 
 
 
 
 
 
 

1.3 Index:pandas進行數據操縱的鬼牌(行級索引)

行級索引是

  • 元數據
  • 可能由真實數據產生,因此可以視作數據
  • 可以由多重索引也就是多個列組合而成
  • 可以和列名進行交換,也可以進行堆疊和展開,達到Excel透視表效果

Index有四種...哦不,很多種寫法,一些重要的索引類型包括

  • pd.Index(普通)
  • Int64Index(數值型索引)
  • MultiIndex(多重索引,在數據操縱中更詳細描述)
  • DatetimeIndex(以時間格式作為索引)
  • PeriodIndex (含周期的時間格式作為索引)
 

直接定義普通索引,長得就和普通的Series一樣

In [31]:
 
 
 
 
 
index_names = ['a','b','c']
Series_for_Index = pd.Series(index_names)
print pd.Index(index_names)
print pd.Index(Series_for_Index)
 
 
 
Index([u'a', u'b', u'c'], dtype='object')
Index([u'a', u'b', u'c'], dtype='object')
 

可惜Immutable,牢記!

In [32]:
 
 
 
 
 
index_names = ['a','b','c']
index0 = pd.Index(index_names)
print index0.get_values()
index0[2] = 'd'
 
 
 
['a' 'b' 'c']
 
---------------------------------------------------------------------------
TypeError Traceback (most recent call last) <ipython-input-32-f34da0a8623c> in <module>()  2 index0 = pd.Index(index_names)  3 print index0.get_values() ----> 4 index0[2] = 'd' /Users/wangweiyang/anaconda/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in __setitem__(self, key, value)  1055   1056 def __setitem__(self, key, value): -> 1057 raise TypeError("Indexes does not support mutable operations")  1058   1059 def __getitem__(self, key): TypeError: Indexes does not support mutable operations 
 

扔進去一個含有多元組的List,就有了MultiIndex

可惜,如果這個List Comprehension改成小括號,就不對了。

In [33]:
 
 
 
 
 
#print [('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)]
multi1 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)])
multi1.name = ['index1','index2']
print multi1
 
 
 
MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])
 

對於Series來說,如果擁有了多重Index,數據,變形!

下列代碼說明:

  • 二重MultiIndex的Series可以unstack()成DataFrame
  • DataFrame可以stack成擁有二重MultiIndex的Series
In [34]:
 
 
 
 
 
data_for_multi1 = pd.Series(xrange(0,16),index=multi1)
data_for_multi1
 
 
Out[34]:
Row_1  Col_1     0
       Col_2     1
       Col_3     2
       Col_4     3
Row_2  Col_1     4
       Col_2     5
       Col_3     6
       Col_4     7
Row_3  Col_1     8
       Col_2     9
       Col_3    10
       Col_4    11
Row_4  Col_1    12
       Col_2    13
       Col_3    14
       Col_4    15
dtype: int64
In [35]:
 
 
 
 
 
data_for_multi1.unstack()
 
 
Out[35]:
  Col_1 Col_2 Col_3 Col_4
Row_1 0 1 2 3
Row_2 4 5 6 7
Row_3 8 9 10 11
Row_4 12 13 14 15
In [36]:
 
 
 
 
 
data_for_multi1.unstack().stack()
 
 
Out[36]:
Row_1  Col_1     0
       Col_2     1
       Col_3     2
       Col_4     3
Row_2  Col_1     4
       Col_2     5
       Col_3     6
       Col_4     7
Row_3  Col_1     8
       Col_2     9
       Col_3    10
       Col_4    11
Row_4  Col_1    12
       Col_2    13
       Col_3    14
       Col_4    15
dtype: int64
 

我們來看一下非平衡數據的例子:

Row_1,2,3,4和Col_1,2,3,4並不是全組合的。

In [37]:
 
 
 
 
 
multi2 = pd.Index([('Row_'+str(x),'Col_'+str(y+1)) \
                   for x in xrange(5) for y in xrange(x)])
multi2
 
 
Out[37]:
MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],
           labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])
In [38]:
 
 
 
 
 
data_for_multi2 = pd.Series(np.arange(10),index = multi2)
data_for_multi2
 
 
Out[38]:
Row_1  Col_1    0
Row_2  Col_1    1
       Col_2    2
Row_3  Col_1    3
       Col_2    4
       Col_3    5
Row_4  Col_1    6
       Col_2    7
       Col_3    8
       Col_4    9
dtype: int64
In [39]:
 
 
 
 
 
data_for_multi2.unstack()
 
 
Out[39]:
  Col_1 Col_2 Col_3 Col_4
Row_1 0 NaN NaN NaN
Row_2 1 2 NaN NaN
Row_3 3 4 5 NaN
Row_4 6 7 8 9
In [40]:
 
 
 
 
 
data_for_multi2.unstack().stack()
 
 
Out[40]:
Row_1  Col_1    0
Row_2  Col_1    1
       Col_2    2
Row_3  Col_1    3
       Col_2    4
       Col_3    5
Row_4  Col_1    6
       Col_2    7
       Col_3    8
       Col_4    9
dtype: float64
 

DateTime標准庫如此好用,你值得擁有

In [41]:
 
 
 
 
 
dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]
pd.DatetimeIndex(dates)
 
 
Out[41]:
DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None, tz=None)
 

如果你不僅需要時間格式統一,時間頻率也要統一的話

In [42]:
 
 
 
 
 
periodindex1 = pd.period_range('2015-01','2015-04',freq='M')
print periodindex1
 
 
 
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='int64', freq='M')
 

月級精度和日級精度如何轉換?

有的公司統一以1號代表當月,有的公司統一以最后一天代表當月,轉化起來很麻煩,可以asfreq

In [43]:
 
 
 
 
 
print periodindex1.asfreq('D',how='start')
print periodindex1.asfreq('D',how='end')
 
 
 
PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='int64', freq='D')
PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='int64', freq='D')
 

最后的最后,我要真正把兩種頻率的時間精度匹配上?

In [44]:
 
 
 
 
 
periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')
periodindex_day = pd.period_range('2015-01-01','2015-03-31',freq='D')
print periodindex_mon
print periodindex_day
 
 
 
PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01'], dtype='int64', freq='D')
PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
             '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
             '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
             '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',
             '2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',
             '2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',
             '2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',
             '2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01',
             '2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',
             '2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09',
             '2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13',
             '2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17',
             '2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21',
             '2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25',
             '2015-02-26', '2015-02-27', '2015-02-28', '2015-03-01',
             '2015-03-02', '2015-03-03', '2015-03-04', '2015-03-05',
             '2015-03-06', '2015-03-07', '2015-03-08', '2015-03-09',
             '2015-03-10', '2015-03-11', '2015-03-12', '2015-03-13',
             '2015-03-14', '2015-03-15', '2015-03-16', '2015-03-17',
             '2015-03-18', '2015-03-19', '2015-03-20', '2015-03-21',
             '2015-03-22', '2015-03-23', '2015-03-24', '2015-03-25',
             '2015-03-26', '2015-03-27', '2015-03-28', '2015-03-29',
             '2015-03-30', '2015-03-31'],
            dtype='int64', freq='D')
 

粗粒度數據+reindex+ffill/bfill

In [45]:
 
 
 
 
 
#print pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day)
full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day)
full_ts
 
 
Out[45]:
2015-01-01    2015-01-01
2015-01-02           NaN
2015-01-03           NaN
2015-01-04           NaN
2015-01-05           NaN
2015-01-06           NaN
2015-01-07           NaN
2015-01-08           NaN
2015-01-09           NaN
2015-01-10           NaN
2015-01-11           NaN
2015-01-12           NaN
2015-01-13           NaN
2015-01-14           NaN
2015-01-15           NaN
2015-01-16           NaN
2015-01-17           NaN
2015-01-18           NaN
2015-01-19           NaN
2015-01-20           NaN
2015-01-21           NaN
2015-01-22           NaN
2015-01-23           NaN
2015-01-24           NaN
2015-01-25           NaN
2015-01-26           NaN
2015-01-27           NaN
2015-01-28           NaN
2015-01-29           NaN
2015-01-30           NaN
                 ...    
2015-03-02           NaN
2015-03-03           NaN
2015-03-04           NaN
2015-03-05           NaN
2015-03-06           NaN
2015-03-07           NaN
2015-03-08           NaN
2015-03-09           NaN
2015-03-10           NaN
2015-03-11           NaN
2015-03-12           NaN
2015-03-13           NaN
2015-03-14           NaN
2015-03-15           NaN
2015-03-16           NaN
2015-03-17           NaN
2015-03-18           NaN
2015-03-19           NaN
2015-03-20           NaN
2015-03-21           NaN
2015-03-22           NaN
2015-03-23           NaN
2015-03-24           NaN
2015-03-25           NaN
2015-03-26           NaN
2015-03-27           NaN
2015-03-28           NaN
2015-03-29           NaN
2015-03-30           NaN
2015-03-31           NaN
Freq: D, dtype: object
In [46]:
 
 
 
 
 
full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')
full_ts
 
 
Out[46]:
2015-01-01    2015-01-01
2015-01-02    2015-01-01
2015-01-03    2015-01-01
2015-01-04    2015-01-01
2015-01-05    2015-01-01
2015-01-06    2015-01-01
2015-01-07    2015-01-01
2015-01-08    2015-01-01
2015-01-09    2015-01-01
2015-01-10    2015-01-01
2015-01-11    2015-01-01
2015-01-12    2015-01-01
2015-01-13    2015-01-01
2015-01-14    2015-01-01
2015-01-15    2015-01-01
2015-01-16    2015-01-01
2015-01-17    2015-01-01
2015-01-18    2015-01-01
2015-01-19    2015-01-01
2015-01-20    2015-01-01
2015-01-21    2015-01-01
2015-01-22    2015-01-01
2015-01-23    2015-01-01
2015-01-24    2015-01-01
2015-01-25    2015-01-01
2015-01-26    2015-01-01
2015-01-27    2015-01-01
2015-01-28    2015-01-01
2015-01-29    2015-01-01
2015-01-30    2015-01-01
                 ...    
2015-03-02    2015-03-01
2015-03-03    2015-03-01
2015-03-04    2015-03-01
2015-03-05    2015-03-01
2015-03-06    2015-03-01
2015-03-07    2015-03-01
2015-03-08    2015-03-01
2015-03-09    2015-03-01
2015-03-10    2015-03-01
2015-03-11    2015-03-01
2015-03-12    2015-03-01
2015-03-13    2015-03-01
2015-03-14    2015-03-01
2015-03-15    2015-03-01
2015-03-16    2015-03-01
2015-03-17    2015-03-01
2015-03-18    2015-03-01
2015-03-19    2015-03-01
2015-03-20    2015-03-01
2015-03-21    2015-03-01
2015-03-22    2015-03-01
2015-03-23    2015-03-01
2015-03-24    2015-03-01
2015-03-25    2015-03-01
2015-03-26    2015-03-01
2015-03-27    2015-03-01
2015-03-28    2015-03-01
2015-03-29    2015-03-01
2015-03-30    2015-03-01
2015-03-31    2015-03-01
Freq: D, dtype: object
 

關於索引,方便的操作有?

前面描述過了,索引有序,重復,但一定程度上又能通過key來訪問,也就是說,某些集合操作都是可以支持的。

In [47]:
 
 
 
 
 
index1 = pd.Index(['A','B','B','C','C'])
index2 = pd.Index(['C','D','E','E','F'])
index3 = pd.Index(['B','C','A'])
print index1.append(index2)
print index1.difference(index2)
print index1.intersection(index2)
print index1.union(index2) # Support unique-value Index well
print index1.isin(index2)
print index1.delete(2)
print index1.insert(0,'K') # Not suggested
print index3.drop('A') # Support unique-value Index well
print index1.is_monotonic,index2.is_monotonic,index3.is_monotonic
print index1.is_unique,index2.is_unique,index3.is_unique
 
 
 
Index([u'A', u'B', u'B', u'C', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')
Index([u'A', u'B'], dtype='object')
Index([u'C', u'C'], dtype='object')
Index([u'A', u'B', u'B', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')
[False False False  True  True]
Index([u'A', u'B', u'C', u'C'], dtype='object')
Index([u'K', u'A', u'B', u'B', u'C', u'C'], dtype='object')
Index([u'B', u'C'], dtype='object')
True True False
False False True
 

2. 大熊貓世界來去自如:Pandas的I/O

 

老生常談,從基礎來看,我們仍然關心pandas對於與外部數據是如何交互的。

 

2.1 結構化數據輸入輸出

 
  • read_csv與to_csv 是一對輸入輸出的工具,read_csv直接返回pandas.DataFrame,而to_csv只要執行命令即可寫文件
    • read_table:功能類似
    • read_fwf:操作fixed width file
  • read_excel與to_excel方便的與excel交互
 

還記得剛開始的例子嗎?

  • header 表示數據中是否存在列名,如果在第0行就寫就寫0,並且開始讀數據時跳過相應的行數,不存在可以寫none
  • names 表示要用給定的列名來作為最終的列名
  • encoding 表示數據集的字符編碼,通常而言一份數據為了方便的進行文件傳輸都以utf-8作為標准

提問:下列例子中,header=4,names=cnames時,究竟會讀到怎樣的數據?

In [48]:
 
 
 
 
 
print cnames
irisdata = pd.read_csv('S1EP3_Iris.txt',header = None, names = cnames,\
                       encoding='utf-8')
irisdata[::30]
 
 
 
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
Out[48]:
  sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
30 4.8 3.1 1.6 0.2 Iris-setosa
60 5.0 2.0 3.5 1.0 Iris-versicolor
90 5.5 2.6 4.4 1.2 Iris-versicolor
120 6.9 3.2 5.7 2.3 Iris-virginica
 

希望了解全部參數的請移步API:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

這里介紹一些常用的參數:

讀取處理:

  • skiprows:跳過一定的行數
  • nrows:僅讀取一定的行數
  • skipfooter:尾部有固定的行數永不讀取
  • skip_blank_lines:空行跳過

內容處理:

  • sep/delimiter:分隔符很重要,常見的有逗號,空格和Tab('\t')
  • na_values:指定應該被當作na_values的數值
  • thousands:處理數值類型時,每千位分隔符並不統一 (1.234.567,89或者1,234,567.89都可能),此時要把字符串轉化為數字需要指明千位分隔符

收尾處理:

  • index_col:將真實的某列(列的數目,甚至列名)當作index
  • squeeze:僅讀到一列時,不再保存為pandas.DataFrame而是pandas.Series
 

2.1.x Excel ... ?

對於存儲着極為規整數據的Excel而言,其實是沒必要一定用Excel來存,盡管Pandas也十分友好的提供了I/O接口。

In [49]:
 
 
 
 
 
irisdata.to_excel('S1EP3_irisdata.xls',index = None,encoding='utf-8')
irisdata_from_excel = pd.read_excel('S1EP3_irisdata.xls',header=0, encoding='utf-8')
irisdata_from_excel[::30]
 
 
Out[49]:
  sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
30 4.8 3.1 1.6 0.2 Iris-setosa
60 5.0 2.0 3.5 1.0 Iris-versicolor
90 5.5 2.6 4.4 1.2 Iris-versicolor
120 6.9 3.2 5.7 2.3 Iris-virginica
 

唯一重要的參數:sheetname=k,標志着一個excel的第k個sheet頁將會被取出。(從0開始)

 

2.2 半結構化數據

 

JSON:網絡傳輸中常用的一種數據格式。

仔細看一下,實際上這就是我們平時收集到異源數據的風格是一致的:

  • 列名不能完全匹配
  • 關聯鍵可能並不唯一
  • 元數據被保存在數據里
In [50]:
 
 
 
 
 
json_data = [{'name':'Wang','sal':50000,'job':'VP'},\
             {'name':'Zhang','job':'Manager','report':'VP'},\
             {'name':'Li','sal':5000,'report':'Manager'}]
data_employee = pd.read_json(json.dumps(json_data))
data_employee_ri = data_employee.reindex(columns=['name','job','sal','report'])
data_employee_ri
 
 
Out[50]:
  name job sal report
0 Wang VP 50000 NaN
1 Zhang Manager NaN VP
2 Li NaN 5000 Manager
 

2.3 數據庫連接流程(Optional)

使用下列包,通過數據庫配置建立Connection

  • pymysql
  • pyODBC
  • cx_Oracle

通過pandas.read_sql_query,read_sql_table,to_sql進行數據庫操作。

Python與數據庫的交互方案有很多種,從數據分析師角度看pandas方案比較適合,之后的講義中會結合SQL語法進行講解。

進行數據庫連接首先你需要類似的這樣一組信息:

In [ ]:
 
 
 
 
 
IP = '127.0.0.1'
us = 'root'
pw = '123456'
 
 
 

舉例說明如果是MySQL:

In [ ]:
 
 
 
 
 
import pymysql
import pymysql.cursors
connection = pymysql.connect(host=IP,\
                             user=us,\
                             password=pw,\
                             charset='utf8mb4',\
                             cursorclass=pymysql.cursors.DictCursor)
#pd.read_sql_query("sql",connection)
#df.to_sql('tablename',connection,flavor='mysql')
 
 
 

3. 深入Pandas數據操縱

 

在第一部分的基礎上,數據會有更多種操縱方式:

  • 通過列名、行index來取數據,結合ix、iloc靈活的獲取數據的一個子集(第一部分已經介紹)
  • 按記錄拼接(就像Union All)或者關聯(join)
  • 方便的自定義函數映射
  • 排序
  • 缺失值處理
  • 與Excel一樣靈活的數據透視表(在第四部分更詳細介紹)
 

3.1 數據整合:方便靈活

 

3.1.1 橫向拼接:直接DataFrame

In [51]:
 
 
 
 
 
pd.DataFrame([np.random.rand(2),np.random.rand(2),np.random.rand(2)],columns=['C1','C2'])
 
 
Out[51]:
  C1 C2
0 0.958384 0.826066
1 0.607771 0.687302
2 0.943502 0.647464
 

3.1.2 橫向拼接:Concatenate

In [52]:
 
 
 
 
 
pd.concat([data_employee_ri,data_employee_ri,data_employee_ri])
 
 
Out[52]:
  name job sal report
0 Wang VP 50000 NaN
1 Zhang Manager NaN VP
2 Li NaN 5000 Manager
0 Wang VP 50000 NaN
1 Zhang Manager NaN VP
2 Li NaN 5000 Manager
0 Wang VP 50000 NaN
1 Zhang Manager NaN VP
2 Li NaN 5000 Manager
In [53]:
 
 
 
 
 
pd.concat([data_employee_ri,data_employee_ri,data_employee_ri],ignore_index=True)
 
 
Out[53]:
  name job sal report
0 Wang VP 50000 NaN
1 Zhang Manager NaN VP
2 Li NaN 5000 Manager
3 Wang VP 50000 NaN
4 Zhang Manager NaN VP
5 Li NaN 5000 Manager
6 Wang VP 50000 NaN
7 Zhang Manager NaN VP
8 Li NaN 5000 Manager
 

3.1.3 縱向拼接:Merge

 

根據數據列關聯,使用on關鍵字

  • 可以指定一列或多列
  • 可以使用left_on和right_on
In [54]:
 
 
 
 
 
pd.merge(data_employee_ri,data_employee_ri,on='name')
 
 
Out[54]:
  name job_x sal_x report_x job_y sal_y report_y
0 Wang VP 50000 NaN VP 50000 NaN
1 Zhang Manager NaN VP Manager NaN VP
2 Li NaN 5000 Manager NaN 5000 Manager
In [55]:
 
 
 
 
 
pd.merge(data_employee_ri,data_employee_ri,on=['name','job'])
 
 
Out[55]:
  name job sal_x report_x sal_y report_y
0 Wang VP 50000 NaN 50000 NaN
1 Zhang Manager NaN VP NaN VP
2 Li NaN 5000 Manager 5000 Manager
 

根據index關聯,可以直接使用left_index和right_index

In [56]:
 
 
 
 
 
data_employee_ri.index.name = 'index1'
pd.merge(data_employee_ri,data_employee_ri,\
         left_index='index1',right_index='index1')
 
 
Out[56]:
  name_x job_x sal_x report_x name_y job_y sal_y report_y
index1                
0 Wang VP 50000 NaN Wang VP 50000 NaN
1 Zhang Manager NaN VP Zhang Manager NaN VP
2 Li NaN 5000 Manager Li NaN 5000 Manager
 

TIPS: 增加how關鍵字,並指定

  • how = 'inner'
  • how = 'left'
  • how = 'right'
  • how = 'outer'

結合how,可以看到merge基本再現了SQL應有的功能,並保持代碼整潔。

In [57]:
 
 
 
 
 
DF31xA = pd.DataFrame({'name':[u'老王',u'老張',u'老李'],'sal':[5000,3000,1000]})
DF31xA
 
 
Out[57]:
  name sal
0 老王 5000
1 老張 3000
2 老李 1000
In [58]:
 
 
 
 
 
DF31xB = pd.DataFrame({'name':[u'老王',u'老劉'],'job':['VP','Manager']})
DF31xB
 
 
Out[58]:
  job name
0 VP 老王
1 Manager 老劉
 

how='left': 保留左表信息

In [59]:
 
 
 
 
 
pd.merge(DF31xA,DF31xB,on='name',how='left')
 
 
Out[59]:
  name sal job
0 老王 5000 VP
1 老張 3000 NaN
2 老李 1000 NaN
 

how='right': 保留右表信息

In [60]:
 
 
 
 
 
pd.merge(DF31xA,DF31xB,on='name',how='right')
 
 
Out[60]:
  name sal job
0 老王 5000 VP
1 老劉 NaN Manager
 

how='inner': 保留兩表交集信息,這樣盡量避免出現缺失值

In [61]:
 
 
 
 
 
pd.merge(DF31xA,DF31xB,on='name',how='inner')
 
 
Out[61]:
  name sal job
0 老王 5000 VP
 

how='outer': 保留兩表並集信息,這樣會導致缺失值,但最大程度的整合了已有信息

In [62]:
 
 
 
 
 
pd.merge(DF31xA,DF31xB,on='name',how='outer')
 
 
Out[62]:
  name sal job
0 老王 5000 VP
1 老張 3000 NaN
2 老李 1000 NaN
3 老劉 NaN Manager
 

3.2 數據清洗三劍客

接下來的三個功能,map,applymap,apply,功能,是絕大多數數據分析師在數據清洗這一步驟中的必經之路。

他們分別回答了以下問題:

  • 我想根據一列數據新做一列數據,怎么辦?(Series->Series)
  • 我想根據整張表的數據新做整張表,怎么辦? (DataFrame->DataFrame)
  • 我想根據很多列的數據新做一列數據,怎么辦? (DataFrame->Series)

不要再寫什么for循環了!改變思維,提高編碼和執行效率

In [63]:
 
 
 
 
 
dataNumPy32 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])
DF32 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])
DF32
 
 
Out[63]:
  nation capital GDP
0 Japan Tokyo 4000
1 S.Korea Seoul 1300
2 China Beijing 9100
 

map: 以相同規則將一列數據作一個映射,也就是進行相同函數的處理

In [64]:
 
 
 
 
 
def GDP_Factorize(v):
    fv = np.float64(v)
    if fv > 6000.0:
        return 'High'
    elif fv < 2000.0:
        return 'Low'
    else:
        return 'Medium'
DF32['GDP_Level'] = DF32['GDP'].map(GDP_Factorize)
DF32['NATION'] = DF32.nation.map(str.upper)
DF32
 
 
Out[64]:
  nation capital GDP GDP_Level NATION
0 Japan Tokyo 4000 Medium JAPAN
1 S.Korea Seoul 1300 Low S.KOREA
2 China Beijing 9100 High CHINA
 

類似的功能還有applymap,可以對一個dataframe里面每一個元素像map那樣全局操作

In [65]:
 
 
 
 
 
DF32.applymap(lambda x: float(x)*2 if x.isdigit() else x.upper())
 
 
Out[65]:
  nation capital GDP GDP_Level NATION
0 JAPAN TOKYO 8000 MEDIUM JAPAN
1 S.KOREA SEOUL 2600 LOW S.KOREA
2 CHINA BEIJING 18200 HIGH CHINA
 

apply則可以對一個DataFrame操作得到一個Series

他會有點像我們后面介紹的agg,但是apply可以按行操作和按列操作,用axis控制即可。

In [66]:
 
 
 
 
 
DF32.apply(lambda x:x['nation']+x['capital']+'_'+x['GDP'],axis=1)
 
 
Out[66]:
0      JapanTokyo_4000
1    S.KoreaSeoul_1300
2    ChinaBeijing_9100
dtype: object
 

3.3 數據排序

 
  • sort: 按一列或者多列的值進行行級排序
  • sort_index: 根據index里的取值進行排序,而且可以根據axis決定是重排行還是列
In [67]:
 
 
 
 
 
dataNumPy33 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])
DF33 = pd.DataFrame(dataNumPy33,columns=['nation','capital','GDP'])
DF33
 
 
Out[67]:
  nation capital GDP
0 Japan Tokyo 4000
1 S.Korea Seoul 1300
2 China Beijing 9100
In [68]:
 
 
 
 
 
DF33.sort(['capital','nation'])
 
 
Out[68]:
  nation capital GDP
2 China Beijing 9100
1 S.Korea Seoul 1300
0 Japan Tokyo 4000
In [69]:
 
 
 
 
 
DF33.sort('GDP',ascending=False)
 
 
Out[69]:
  nation capital GDP
2 China Beijing 9100
0 Japan Tokyo 4000
1 S.Korea Seoul 1300
In [70]:
 
 
 
 
 
DF33.sort('GDP').sort(ascending=False)
 
 
Out[70]:
  nation capital GDP
2 China Beijing 9100
1 S.Korea Seoul 1300
0 Japan Tokyo 4000
In [71]:
 
 
 
 
 
DF33.sort_index(axis=1,ascending=True)
 
 
Out[71]:
  GDP capital nation
0 4000 Tokyo Japan
1 1300 Seoul S.Korea
2 9100 Beijing China
 

一個好用的功能:Rank

In [72]:
 
 
 
 
 
DF33
 
 
Out[72]:
  nation capital GDP
0 Japan Tokyo 4000
1 S.Korea Seoul 1300
2 China Beijing 9100
In [73]:
 
 
 
 
 
DF33.rank()
 
 
Out[73]:
  nation capital GDP
0 2 3 2
1 3 2 1
2 1 1 3
In [74]:
 
 
 
 
 
DF33.rank(ascending=False)
 
 
Out[74]:
  nation capital GDP
0 2 1 2
1 1 2 3
2 3 3 1
 

注意tied data(相同值)的處理:

  • method = 'average'
  • method = 'min'
  • method = 'max'
  • method = 'first'
In [75]:
 
 
 
 
 
DF33x = pd.DataFrame({'name':[u'老王',u'老張',u'老李',u'老劉'],'sal':np.array([5000,3000,5000,9000])})
DF33x
 
 
Out[75]:
  name sal
0 老王 5000
1 老張 3000
2 老李 5000
3 老劉 9000
 

DF33x.rank()默認使用method='average',兩條數據相等時,處理排名時大家都用平均值

In [76]:
 
 
 
 
 
DF33x.sal.rank()
 
 
Out[76]:
0    2.5
1    1.0
2    2.5
3    4.0
Name: sal, dtype: float64
 

method='min',處理排名時大家都用最小值

In [77]:
 
 
 
 
 
DF33x.sal.rank(method='min')
 
 
Out[77]:
0    2
1    1
2    2
3    4
Name: sal, dtype: float64
 

method='max',處理排名時大家都用最大值

In [78]:
 
 
 
 
 
DF33x.sal.rank(method='max')
 
 
Out[78]:
0    3
1    1
2    3
3    4
Name: sal, dtype: float64
 

method='first',處理排名時誰先出現就先給誰較小的數值。

In [79]:
 
 
 
 
 
DF33x.sal.rank(method='first')
 
 
Out[79]:
0    2
1    1
2    3
3    4
Name: sal, dtype: float64
 

3.4 缺失數據處理

In [80]:
 
 
 
 
 
DF34 = data_for_multi2.unstack()
DF34
 
 
Out[80]:
  Col_1 Col_2 Col_3 Col_4
Row_1 0 NaN NaN NaN
Row_2 1 2 NaN NaN
Row_3 3 4 5 NaN
Row_4 6 7 8 9
 

忽略缺失值:

In [81]:
 
 
 
 
 
DF34.mean(skipna=True)
 
 
Out[81]:
Col_1    2.500000
Col_2    4.333333
Col_3    6.500000
Col_4    9.000000
dtype: float64
In [82]:
 
 
 
 
 
DF34.mean(skipna=False)
 
 
Out[82]:
Col_1    2.5
Col_2    NaN
Col_3    NaN
Col_4    NaN
dtype: float64
 

如果不想忽略缺失值的話,就需要祭出fillna了:

In [83]:
 
 
 
 
 
DF34
 
 
Out[83]:
  Col_1 Col_2 Col_3 Col_4
Row_1 0 NaN NaN NaN
Row_2 1 2 NaN NaN
Row_3 3 4 5 NaN
Row_4 6 7 8 9
In [84]:
 
 
 
 
 
DF34.fillna(0).mean(axis=1,skipna=False)
 
 
Out[84]:
Row_1    0.00
Row_2    0.75
Row_3    3.00
Row_4    7.50
dtype: float64
 

4. “一組”大熊貓:Pandas的groupby

 

groupby的功能類似SQL的group by關鍵字:

Split-Apply-Combine

  • Split,就是按照規則分組
  • Apply,通過一定的agg函數來獲得輸入pd.Series返回一個值的效果
  • Combine,把結果收集起來

Pandas的groupby的靈活性:

  • 分組的關鍵字可以來自於index,也可以來自於真實的列數據
  • 分組規則可以通過一列或者多列
In [85]:
 
 
 
 
 
from IPython.display import Image
Image(filename="S1EP3_group.png")
 
 
Out[85]:
 

分組的具體邏輯

In [86]:
 
 
 
 
 
irisdata_group = irisdata.groupby('class')
irisdata_group
 
 
Out[86]:
<pandas.core.groupby.DataFrameGroupBy object at 0x10a543b10>
In [87]:
 
 
 
 
 
for level,subsetDF in irisdata_group:
    print level
    print subsetDF[::20]
 
 
 
Iris-setosa
    sepal_length  sepal_width  petal_length  petal_width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
20           5.4          3.4           1.7          0.2  Iris-setosa
40           5.0          3.5           1.3          0.3  Iris-setosa
Iris-versicolor
    sepal_length  sepal_width  petal_length  petal_width            class
50           7.0          3.2           4.7          1.4  Iris-versicolor
70           5.9          3.2           4.8          1.8  Iris-versicolor
90           5.5          2.6           4.4          1.2  Iris-versicolor
Iris-virginica
     sepal_length  sepal_width  petal_length  petal_width           class
100           6.3          3.3           6.0          2.5  Iris-virginica
120           6.9          3.2           5.7          2.3  Iris-virginica
140           6.7          3.1           5.6          2.4  Iris-virginica
 

分組可以快速實現MapReduce的邏輯

  • Map: 指定分組的列標簽,不同的值就會被扔到不同的分組處理
  • Reduce: 輸入多個值,返回一個值,一般可以通過agg實現,agg能接受一個函數
In [88]:
 
 
 
 
 
irisdata.groupby('class').agg(\
    lambda x:((x-x.mean())**3).sum()*(len(x)-0.0)/\
                (len(x)-1.0)/(len(x)-2.0)/(x.std()*np.sqrt((len(x)-0.0)/(len(x)-1.0)))**3 if len(x)>2 else None)
 
 
Out[88]:
  sepal_length sepal_width petal_length petal_width
class        
Iris-setosa 0.116502 0.103857 0.069702 1.161506
Iris-versicolor 0.102232 -0.352014 -0.588404 -0.030249
Iris-virginica 0.114492 0.355026 0.533044 -0.125612
In [89]:
 
 
 
 
 
irisdata.groupby('class').agg(spstat.skew)
 
 
Out[89]:
  sepal_length sepal_width petal_length petal_width
class        
Iris-setosa 0.116454 0.103814 0.069673 1.161022
Iris-versicolor 0.102190 -0.351867 -0.588159 -0.030236
Iris-virginica 0.114445 0.354878 0.532822 -0.125560
 

匯總之后的廣播操作

在OLAP數據庫上,為了避免groupby+join的二次操作,提出了sum()over(partition by)的開窗操作。

在Pandas中,這種操作能夠進一步被transform所取代。

In [90]:
 
 
 
 
 
pd.concat([irisdata,irisdata.groupby('class').transform('mean')],axis=1)[::20]
 
 
Out[90]:
  sepal_length sepal_width petal_length petal_width class sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2 Iris-setosa 5.006 3.418 1.464 0.244
20 5.4 3.4 1.7 0.2 Iris-setosa 5.006 3.418 1.464 0.244
40 5.0 3.5 1.3 0.3 Iris-setosa 5.006 3.418 1.464 0.244
60 5.0 2.0 3.5 1.0 Iris-versicolor 5.936 2.770 4.260 1.326
80 5.5 2.4 3.8 1.1 Iris-versicolor 5.936 2.770 4.260 1.326
100 6.3 3.3 6.0 2.5 Iris-virginica 6.588 2.974 5.552 2.026
120 6.9 3.2 5.7 2.3 Iris-virginica 6.588 2.974 5.552 2.026
140 6.7 3.1 5.6 2.4 Iris-virginica 6.588 2.974 5.552 2.026
 

產生 MultiIndex(多列分組)后的數據透視表操作

一般來說,多列groupby的一個副作用就是.groupby().agg()之后你的行index已經變成了一個多列分組的分級索引。

如果我們希望達到Excel的數據透視表的效果,行和列的索引自由交換,達到統計目的,究竟應該怎么辦呢?

In [91]:
 
 
 
 
 
factor1 = np.random.randint(0,3,50)
factor2 = np.random.randint(0,2,50)
factor3 = np.random.randint(0,3,50)
values = np.random.randn(50)
 
 
In [92]:
 
 
 
 
 
hierindexDF = pd.DataFrame({'F1':factor1,'F2':factor2,'F3':factor3,'F4':values})
hierindexDF
 
 
Out[92]:
  F1 F2 F3 F4
0 1 0 1 -0.083928
1 0 0 0 0.949984
2 2 1 0 -0.037692
3 0 1 1 -0.518305
4 1 0 0 0.963678
5 1 1 1 -0.284463
6 0 0 1 0.412449
7 2 0 0 -0.277126
8 0 0 2 -2.488946
9 0 1 0 0.167413
10 0 1 1 1.520003
11 0 1 2 -0.665450
12 1 1 2 0.783906
13 1 1 1 -0.077717
14 2 0 2 -2.018863
15 0 1 1 -0.993884
16 2 0 2 -1.051174
17 0 0 2 -0.723457
18 1 1 0 -0.783521
19 1 0 2 0.355093
20 1 0 0 -0.252415
21 0 0 2 0.600679
22 2 1 2 -0.660498
23 0 0 2 -0.301613
24 0 1 2 -1.270339
25 0 0 0 0.074086
26 1 0 0 -0.043586
27 0 1 2 -2.000161
28 2 1 1 -1.041330
29 2 1 1 0.984101
30 2 0 0 0.160715
31 1 1 2 0.903035
32 1 1 1 -1.105763
33 1 0 0 0.850310
34 1 1 0 -0.062946
35 1 1 0 -0.507763
36 1 0 1 0.112303
37 2 1 2 -0.663782
38 2 1 1 1.147671
39 0 0 2 -1.552327
40 1 1 1 -1.166517
41 0 1 0 -0.622502
42 0 1 2 1.620961
43 1 1 0 -0.354238
44 1 0 2 -0.233783
45 0 0 0 0.131051
46 2 1 1 -1.301164
47 1 0 2 -1.013341
48 2 0 1 -0.508761
49 0 0 1 -1.104457
In [93]:
 
 
 
 
 
hierindexDF_gbsum = hierindexDF.groupby(['F1','F2','F3']).sum()
hierindexDF_gbsum
 
 
Out[93]:
      F4
F1 F2 F3  
0 0 0 1.155121
1 -0.692009
2 -4.465664
1 0 -0.455089
1 0.007813
2 -2.314989
1 0 0 1.517987
1 0.028374
2 -0.892030
1 0 -1.708468
1 -2.634460
2 1.686941
2 0 0 -0.116412
1 -0.508761
2 -3.070037
1 0 -0.037692
1 -0.210722
2 -1.324281
 

觀察Index:

In [94]:
 
 
 
 
 
hierindexDF_gbsum.index
 
 
Out[94]:
MultiIndex(levels=[[0, 1, 2], [0, 1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[u'F1', u'F2', u'F3'])
 

unstack:

  • 無參數時,把最末index置換到column上
  • 有數字參數時,把指定位置的index置換到column上
  • 有列表參數時,依次把特定位置的index置換到column上
In [95]:
 
 
 
 
 
hierindexDF_gbsum.unstack()
 
 
Out[95]:
    F4
  F3 0 1 2
F1 F2      
0 0 1.155121 -0.692009 -4.465664
1 -0.455089 0.007813 -2.314989
1 0 1.517987 0.028374 -0.892030
1 -1.708468 -2.634460 1.686941
2 0 -0.116412 -0.508761 -3.070037
1 -0.037692 -0.210722 -1.324281
In [96]:
 
 
 
 
 
hierindexDF_gbsum.unstack(0)
 
 
Out[96]:
    F4
  F1 0 1 2
F2 F3      
0 0 1.155121 1.517987 -0.116412
1 -0.692009 0.028374 -0.508761
2 -4.465664 -0.892030 -3.070037
1 0 -0.455089 -1.708468 -0.037692
1 0.007813 -2.634460 -0.210722
2 -2.314989 1.686941 -1.324281
In [97]:
 
 
 
 
 
hierindexDF_gbsum.unstack(1)
 
 
Out[97]:
    F4
  F2 0 1
F1 F3    
0 0 1.155121 -0.455089
1 -0.692009 0.007813
2 -4.465664 -2.314989
1 0 1.517987 -1.708468
1 0.028374 -2.634460
2 -0.892030 1.686941
2 0 -0.116412 -0.037692
1 -0.508761 -0.210722
2 -3.070037 -1.324281
In [98]:
 
 
 
 
 
hierindexDF_gbsum.unstack([2,0])
 
 
Out[98]:
  F4
F3 0 1 2 0 1 2 0 1 2
F1 0 0 0 1 1 1 2 2 2
F2                  
0 1.155121 -0.692009 -4.465664 1.517987 0.028374 -0.892030 -0.116412 -0.508761 -3.070037
1 -0.455089 0.007813 -2.314989 -1.708468 -2.634460 1.686941 -0.037692 -0.210722 -1.324281
 

更進一步的,stack的功能是和unstack對應,把column上的多級索引換到index上去

In [99]:
 
 
 
 
 
hierindexDF_gbsum.unstack([2,0]).stack([1,2])
 
 
Out[99]:
      F4
F2 F3 F1  
0 0 0 1.155121
1 1.517987
2 -0.116412
1 0 -0.692009
1 0.028374
2 -0.508761
2 0 -4.465664
1 -0.892030
2 -3.070037
1 0 0 -0.455089
1 -1.708468
2 -0.037692
1 0 0.007813
1 -2.634460
2 -0.210722
2 0 -2.314989
1 1.686941
2 -1.324281


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM