和大熊貓們(Pandas)一起游戲吧！

Pandas是Python的一個用於數據分析的庫： http://pandas.pydata.org

API速查：http://pandas.pydata.org/pandas-docs/stable/api.html

基於NumPy,SciPy的功能，在其上補充了大量的數據操作（Data Manipulation）功能。

統計、分組、排序、透視表自由轉換，如果你已經很熟悉結構化數據庫（RDBMS）與Excel的功能，就會知道Pandas有過之而無不及！

0. 上手玩：Why Pandas?

普通的程序員看到一份數據會怎么做？

import codecs

r = requests.get("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

with codecs.open('S1EP3_Iris.txt','r',encoding='utf-8') as f:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa

Pandas的意義就在於

快速的識別結構化數據

import pandas as pd

快速的操作元數據

cnames = ['sepal_length','sepal_width','petal_length','petal_width','class']

快速過濾

#irisdata[irisdata['petal_width']==irisdata.petal_width.max()]

快速切片

#irisdata.iloc[::30,:2]

快速統計

#print irisdata['class'].value_counts()

SEPAL_LENGTH  Statistics:    7.9    4.3   5.84   0.83
SEPAL_WIDTH   Statistics:    4.4    2.0   3.05   0.43
PETAL_LENGTH  Statistics:    6.9    1.0   3.76   1.76
PETAL_WIDTH   Statistics:    2.5    0.1    1.2   0.76

快速“MapReduce”

slogs = lambda x:sp.log(x)*x

1. 歡迎來到大熊貓世界

Pandas的重要數據類型

DataFrame(二維表)
Series(一維序列)
Index(行索引，行級元數據)

1.1 Series：pandas的長槍(數據表中的一列或一行,觀測向量,一維數組...)

數據世界中對於任意一個個體的全面觀測，或者對於任意一組個體某一屬性的觀測，全部可以抽象為Series的概念。

用值構建一個Series：

由默認index和values組成。

Series1 = pd.Series(np.random.randn(4))

0   -0.909672
1    1.739425
2   -1.163028
3    0.408693
dtype: float64 <class 'pandas.core.series.Series'>
Int64Index([0, 1, 2, 3], dtype='int64')
[-0.90967166  1.73942495 -1.1630284   0.40869312]

Series支持過濾的原理就如同NumPy：

print Series1>0

0    False
1     True
2    False
3     True
dtype: bool
1    1.739425
3    0.408693
dtype: float64

當然也支持Broadcasting：

print Series1*2

0   -1.819343
1    3.478850
2   -2.326057
3    0.817386
dtype: float64
0    4.090328
1    6.739425
2    3.836972
3    5.408693
dtype: float64

以及Universal Function：

print np.exp(Series1)

0    0.402656
1    5.694068
2    0.312538
3    1.504850
dtype: float64
0    24.06255
1    4811.913
2    14.49702
3    336.0924
dtype: object

在序列上就使用行標，而不是創建一個2列的數據表，能夠輕松辨別哪里是數據，哪里是元數據：

Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])

norm_0   -0.909672
norm_1    1.739425
norm_2   -1.163028
norm_3    0.408693
dtype: float64 <class 'pandas.core.series.Series'>
Index([u'norm_0', u'norm_1', u'norm_2', u'norm_3'], dtype='object')
<class 'pandas.core.index.Index'>
[-0.90967166  1.73942495 -1.1630284   0.40869312]

雖然行是有順序的，但是仍然能夠通過行級的index來訪問到數據：

（當然也不盡然像Ordered Dict，因為行索引甚至可以重復，不推薦重復的行索引不代表不能用）

print Series2[['norm_0','norm_3']]

norm_0   -0.909672
norm_3    0.408693
dtype: float64

print 'norm_0' in Series2

True
False

默認行索引就像行號一樣：

print Series1.index

Int64Index([0, 1, 2, 3], dtype='int64')

從Key不重復的Ordered Dict或者從Dict來定義Series就不需要擔心行索引重復：

Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}

China      Beijing
Japan        Tokyo
S.Korea      Seoul
dtype: object
['Beijing' 'Tokyo' 'Seoul']
Index([u'China', u'Japan', u'S.Korea'], dtype='object')

與Dict區別一：有序

Series4_IndexList = ["Japan","China","Singapore","S.Korea"]

Japan          Tokyo
China        Beijing
Singapore        NaN
S.Korea        Seoul
dtype: object
['Tokyo' 'Beijing' nan 'Seoul']
Index([u'Japan', u'China', u'Singapore', u'S.Korea'], dtype='object')
Japan        False
China        False
Singapore     True
S.Korea      False
dtype: bool
Japan         True
China         True
Singapore    False
S.Korea       True
dtype: bool

與Dict區別二： index內值可以重復，盡管不推薦。

Series5_IndexList = ['A','B','B','C']

A   -0.909672
B    1.739425
B   -1.163028
C    0.408693
dtype: float64
B    1.739425
B   -1.163028
A   -0.909672
dtype: float64

整個序列級別的元數據信息：name

當數據序列以及index本身有了名字，就可以更方便的進行后續的數據關聯啦！

print Series4_pdSeries.name

None
None

Series4_pdSeries.name = "Capital Series"

Nation
Japan          Tokyo
China        Beijing
Singapore        NaN
S.Korea        Seoul
Name: Capital Series, dtype: object

1.2 DataFrame：pandas的戰錘(數據表，二維數組)

Series的有序集合，就像R的DataFrame一樣方便。

仔細想想，絕大部分的數據形式都可以表現為DataFrame。

從NumPy二維數組、從文件或者從數據庫定義：數據雖好，勿忘列名

dataNumPy = np.asarray([('Japan','Tokyo',4000),\

等長的列數據保存在一個字典里（JSON）：很不幸，字典key是無序的

dataDict = {'nation':['Japan','S.Korea','China'],\

從另一個DataFrame定義DataFrame：啊，強迫症犯了！

DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])

DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])

從DataFrame中取出列？兩種方法（與JavaScript完全一致！）

'.'的寫法容易與其他預留關鍵字產生沖突
'[ ]'的寫法最安全。

print DF22.nation

2      China
0      Japan
1    S.Korea
Name: nation, dtype: object
2    Beijing
0      Tokyo
1      Seoul
Name: capital, dtype: object
2    9100
0    4900
1    1300
Name: GDP, dtype: int64

從DataFrame中取出行？（至少）兩種方法：

print DF22[0:1] #給出的實際是DataFrame

  nation  capital   GDP
2  China  Beijing  9100
nation     Japan
capital    Tokyo
GDP         4900
Name: 0, dtype: object

像NumPy切片一樣的終極招式：iloc

print DF22.iloc[0,:]

nation       China
capital    Beijing
GDP           9100
Name: 2, dtype: object
2      China
0      Japan
1    S.Korea
Name: nation, dtype: object

聽說你從Alter Table地獄來，大熊貓笑了

然而動態增加列無法用"."的方式完成，只能用"[ ]"

DF22['population'] = [1600,130,55]

​

1.3 Index：pandas進行數據操縱的鬼牌（行級索引）

行級索引是

元數據
可能由真實數據產生，因此可以視作數據
可以由多重索引也就是多個列組合而成
可以和列名進行交換，也可以進行堆疊和展開，達到Excel透視表效果

Index有四種...哦不，很多種寫法，一些重要的索引類型包括

pd.Index（普通）
Int64Index（數值型索引）
MultiIndex（多重索引，在數據操縱中更詳細描述）
DatetimeIndex（以時間格式作為索引）
PeriodIndex （含周期的時間格式作為索引）

直接定義普通索引，長得就和普通的Series一樣

index_names = ['a','b','c']

Index([u'a', u'b', u'c'], dtype='object')
Index([u'a', u'b', u'c'], dtype='object')

可惜Immutable，牢記！

index_names = ['a','b','c']

['a' 'b' 'c']

---------------------------------------------------------------------------
TypeError Traceback (most recent call last) <ipython-input-32-f34da0a8623c> in <module>()  2 index0 = pd.Index(index_names)  3 print index0.get_values() ----> 4 index0[2] = 'd' /Users/wangweiyang/anaconda/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in __setitem__(self, key, value)  1055   1056 def __setitem__(self, key, value): -> 1057 raise TypeError("Indexes does not support mutable operations")  1058   1059 def __getitem__(self, key): TypeError: Indexes does not support mutable operations

扔進去一個含有多元組的List，就有了MultiIndex

可惜，如果這個List Comprehension改成小括號，就不對了。

#print [('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)]

MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

對於Series來說，如果擁有了多重Index，數據，變形！

下列代碼說明：

二重MultiIndex的Series可以unstack()成DataFrame
DataFrame可以stack成擁有二重MultiIndex的Series

data_for_multi1 = pd.Series(xrange(0,16),index=multi1)

Row_1  Col_1     0
       Col_2     1
       Col_3     2
       Col_4     3
Row_2  Col_1     4
       Col_2     5
       Col_3     6
       Col_4     7
Row_3  Col_1     8
       Col_2     9
       Col_3    10
       Col_4    11
Row_4  Col_1    12
       Col_2    13
       Col_3    14
       Col_4    15
dtype: int64

data_for_multi1.unstack()

data_for_multi1.unstack().stack()

Row_1  Col_1     0
       Col_2     1
       Col_3     2
       Col_4     3
Row_2  Col_1     4
       Col_2     5
       Col_3     6
       Col_4     7
Row_3  Col_1     8
       Col_2     9
       Col_3    10
       Col_4    11
Row_4  Col_1    12
       Col_2    13
       Col_3    14
       Col_4    15
dtype: int64

我們來看一下非平衡數據的例子：

Row_1,2,3,4和Col_1,2,3,4並不是全組合的。

multi2 = pd.Index([('Row_'+str(x),'Col_'+str(y+1)) \

MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],
           labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])

data_for_multi2 = pd.Series(np.arange(10),index = multi2)

Row_1  Col_1    0
Row_2  Col_1    1
       Col_2    2
Row_3  Col_1    3
       Col_2    4
       Col_3    5
Row_4  Col_1    6
       Col_2    7
       Col_3    8
       Col_4    9
dtype: int64

data_for_multi2.unstack()

data_for_multi2.unstack().stack()

Row_1  Col_1    0
Row_2  Col_1    1
       Col_2    2
Row_3  Col_1    3
       Col_2    4
       Col_3    5
Row_4  Col_1    6
       Col_2    7
       Col_3    8
       Col_4    9
dtype: float64

DateTime標准庫如此好用，你值得擁有

dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]

DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None, tz=None)

如果你不僅需要時間格式統一，時間頻率也要統一的話

periodindex1 = pd.period_range('2015-01','2015-04',freq='M')

PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='int64', freq='M')

月級精度和日級精度如何轉換？

有的公司統一以1號代表當月，有的公司統一以最后一天代表當月，轉化起來很麻煩，可以asfreq

print periodindex1.asfreq('D',how='start')

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='int64', freq='D')
PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='int64', freq='D')

最后的最后，我要真正把兩種頻率的時間精度匹配上？

periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01'], dtype='int64', freq='D')
PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
             '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
             '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
             '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',
             '2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',
             '2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',
             '2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',
             '2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01',
             '2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',
             '2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09',
             '2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13',
             '2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17',
             '2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21',
             '2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25',
             '2015-02-26', '2015-02-27', '2015-02-28', '2015-03-01',
             '2015-03-02', '2015-03-03', '2015-03-04', '2015-03-05',
             '2015-03-06', '2015-03-07', '2015-03-08', '2015-03-09',
             '2015-03-10', '2015-03-11', '2015-03-12', '2015-03-13',
             '2015-03-14', '2015-03-15', '2015-03-16', '2015-03-17',
             '2015-03-18', '2015-03-19', '2015-03-20', '2015-03-21',
             '2015-03-22', '2015-03-23', '2015-03-24', '2015-03-25',
             '2015-03-26', '2015-03-27', '2015-03-28', '2015-03-29',
             '2015-03-30', '2015-03-31'],
            dtype='int64', freq='D')

粗粒度數據＋reindex＋ffill/bfill

#print pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day)

2015-01-01    2015-01-01
2015-01-02           NaN
2015-01-03           NaN
2015-01-04           NaN
2015-01-05           NaN
2015-01-06           NaN
2015-01-07           NaN
2015-01-08           NaN
2015-01-09           NaN
2015-01-10           NaN
2015-01-11           NaN
2015-01-12           NaN
2015-01-13           NaN
2015-01-14           NaN
2015-01-15           NaN
2015-01-16           NaN
2015-01-17           NaN
2015-01-18           NaN
2015-01-19           NaN
2015-01-20           NaN
2015-01-21           NaN
2015-01-22           NaN
2015-01-23           NaN
2015-01-24           NaN
2015-01-25           NaN
2015-01-26           NaN
2015-01-27           NaN
2015-01-28           NaN
2015-01-29           NaN
2015-01-30           NaN
                 ...    
2015-03-02           NaN
2015-03-03           NaN
2015-03-04           NaN
2015-03-05           NaN
2015-03-06           NaN
2015-03-07           NaN
2015-03-08           NaN
2015-03-09           NaN
2015-03-10           NaN
2015-03-11           NaN
2015-03-12           NaN
2015-03-13           NaN
2015-03-14           NaN
2015-03-15           NaN
2015-03-16           NaN
2015-03-17           NaN
2015-03-18           NaN
2015-03-19           NaN
2015-03-20           NaN
2015-03-21           NaN
2015-03-22           NaN
2015-03-23           NaN
2015-03-24           NaN
2015-03-25           NaN
2015-03-26           NaN
2015-03-27           NaN
2015-03-28           NaN
2015-03-29           NaN
2015-03-30           NaN
2015-03-31           NaN
Freq: D, dtype: object

full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')

2015-01-01    2015-01-01
2015-01-02    2015-01-01
2015-01-03    2015-01-01
2015-01-04    2015-01-01
2015-01-05    2015-01-01
2015-01-06    2015-01-01
2015-01-07    2015-01-01
2015-01-08    2015-01-01
2015-01-09    2015-01-01
2015-01-10    2015-01-01
2015-01-11    2015-01-01
2015-01-12    2015-01-01
2015-01-13    2015-01-01
2015-01-14    2015-01-01
2015-01-15    2015-01-01
2015-01-16    2015-01-01
2015-01-17    2015-01-01
2015-01-18    2015-01-01
2015-01-19    2015-01-01
2015-01-20    2015-01-01
2015-01-21    2015-01-01
2015-01-22    2015-01-01
2015-01-23    2015-01-01
2015-01-24    2015-01-01
2015-01-25    2015-01-01
2015-01-26    2015-01-01
2015-01-27    2015-01-01
2015-01-28    2015-01-01
2015-01-29    2015-01-01
2015-01-30    2015-01-01
                 ...    
2015-03-02    2015-03-01
2015-03-03    2015-03-01
2015-03-04    2015-03-01
2015-03-05    2015-03-01
2015-03-06    2015-03-01
2015-03-07    2015-03-01
2015-03-08    2015-03-01
2015-03-09    2015-03-01
2015-03-10    2015-03-01
2015-03-11    2015-03-01
2015-03-12    2015-03-01
2015-03-13    2015-03-01
2015-03-14    2015-03-01
2015-03-15    2015-03-01
2015-03-16    2015-03-01
2015-03-17    2015-03-01
2015-03-18    2015-03-01
2015-03-19    2015-03-01
2015-03-20    2015-03-01
2015-03-21    2015-03-01
2015-03-22    2015-03-01
2015-03-23    2015-03-01
2015-03-24    2015-03-01
2015-03-25    2015-03-01
2015-03-26    2015-03-01
2015-03-27    2015-03-01
2015-03-28    2015-03-01
2015-03-29    2015-03-01
2015-03-30    2015-03-01
2015-03-31    2015-03-01
Freq: D, dtype: object

關於索引，方便的操作有？

前面描述過了，索引有序，重復，但一定程度上又能通過key來訪問，也就是說，某些集合操作都是可以支持的。

index1 = pd.Index(['A','B','B','C','C'])

Index([u'A', u'B', u'B', u'C', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')
Index([u'A', u'B'], dtype='object')
Index([u'C', u'C'], dtype='object')
Index([u'A', u'B', u'B', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')
[False False False  True  True]
Index([u'A', u'B', u'C', u'C'], dtype='object')
Index([u'K', u'A', u'B', u'B', u'C', u'C'], dtype='object')
Index([u'B', u'C'], dtype='object')
True True False
False False True

2. 大熊貓世界來去自如：Pandas的I/O

老生常談，從基礎來看，我們仍然關心pandas對於與外部數據是如何交互的。

2.1 結構化數據輸入輸出

read_csv與to_csv 是一對輸入輸出的工具，read_csv直接返回pandas.DataFrame，而to_csv只要執行命令即可寫文件
- read_table：功能類似
- read_fwf：操作fixed width file
read_excel與to_excel方便的與excel交互

還記得剛開始的例子嗎？

header 表示數據中是否存在列名，如果在第0行就寫就寫0，並且開始讀數據時跳過相應的行數，不存在可以寫none
names 表示要用給定的列名來作為最終的列名
encoding 表示數據集的字符編碼，通常而言一份數據為了方便的進行文件傳輸都以utf-8作為標准

提問：下列例子中，header=4，names=cnames時，究竟會讀到怎樣的數據？

print cnames

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

希望了解全部參數的請移步API：

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

這里介紹一些常用的參數：

讀取處理：

skiprows：跳過一定的行數
nrows：僅讀取一定的行數
skipfooter：尾部有固定的行數永不讀取
skip_blank_lines：空行跳過

內容處理：

sep/delimiter：分隔符很重要，常見的有逗號，空格和Tab('\t')
na_values：指定應該被當作na_values的數值
thousands：處理數值類型時，每千位分隔符並不統一 (1.234.567,89或者1,234,567.89都可能)，此時要把字符串轉化為數字需要指明千位分隔符

收尾處理：

index_col：將真實的某列（列的數目，甚至列名）當作index
squeeze：僅讀到一列時，不再保存為pandas.DataFrame而是pandas.Series

2.1.x Excel ... ?

對於存儲着極為規整數據的Excel而言，其實是沒必要一定用Excel來存，盡管Pandas也十分友好的提供了I/O接口。

irisdata.to_excel('S1EP3_irisdata.xls',index = None,encoding='utf-8')

唯一重要的參數：sheetname=k，標志着一個excel的第k個sheet頁將會被取出。（從0開始）

2.2 半結構化數據

JSON：網絡傳輸中常用的一種數據格式。

仔細看一下，實際上這就是我們平時收集到異源數據的風格是一致的：

列名不能完全匹配
關聯鍵可能並不唯一
元數據被保存在數據里

json_data = [{'name':'Wang','sal':50000,'job':'VP'},\

2.3 數據庫連接流程（Optional）

使用下列包，通過數據庫配置建立Connection

pymysql
pyODBC
cx_Oracle

通過pandas.read_sql_query,read_sql_table,to_sql進行數據庫操作。

Python與數據庫的交互方案有很多種，從數據分析師角度看pandas方案比較適合，之后的講義中會結合SQL語法進行講解。

進行數據庫連接首先你需要類似的這樣一組信息：

IP = '127.0.0.1'

舉例說明如果是MySQL：

import pymysql

3. 深入Pandas數據操縱

在第一部分的基礎上，數據會有更多種操縱方式：

通過列名、行index來取數據，結合ix、iloc靈活的獲取數據的一個子集（第一部分已經介紹）
按記錄拼接（就像Union All）或者關聯（join）
方便的自定義函數映射
排序
缺失值處理
與Excel一樣靈活的數據透視表（在第四部分更詳細介紹）

3.1 數據整合：方便靈活

3.1.1 橫向拼接：直接DataFrame

pd.DataFrame([np.random.rand(2),np.random.rand(2),np.random.rand(2)],columns=['C1','C2'])

3.1.2 橫向拼接：Concatenate

pd.concat([data_employee_ri,data_employee_ri,data_employee_ri])

pd.concat([data_employee_ri,data_employee_ri,data_employee_ri],ignore_index=True)

3.1.3 縱向拼接：Merge

根據數據列關聯，使用on關鍵字

可以指定一列或多列
可以使用left_on和right_on

pd.merge(data_employee_ri,data_employee_ri,on='name')

pd.merge(data_employee_ri,data_employee_ri,on=['name','job'])

根據index關聯，可以直接使用left_index和right_index

data_employee_ri.index.name = 'index1'

TIPS: 增加how關鍵字，並指定

how = 'inner'
how = 'left'
how = 'right'
how = 'outer'

結合how，可以看到merge基本再現了SQL應有的功能，並保持代碼整潔。

DF31xA = pd.DataFrame({'name':[u'老王',u'老張',u'老李'],'sal':[5000,3000,1000]})

DF31xB = pd.DataFrame({'name':[u'老王',u'老劉'],'job':['VP','Manager']})

how='left': 保留左表信息

pd.merge(DF31xA,DF31xB,on='name',how='left')

how='right': 保留右表信息

pd.merge(DF31xA,DF31xB,on='name',how='right')

how='inner': 保留兩表交集信息，這樣盡量避免出現缺失值

pd.merge(DF31xA,DF31xB,on='name',how='inner')

how='outer': 保留兩表並集信息，這樣會導致缺失值，但最大程度的整合了已有信息

pd.merge(DF31xA,DF31xB,on='name',how='outer')

3.2 數據清洗三劍客

接下來的三個功能，map,applymap,apply,功能，是絕大多數數據分析師在數據清洗這一步驟中的必經之路。

他們分別回答了以下問題：

我想根據一列數據新做一列數據，怎么辦？（Series->Series）
我想根據整張表的數據新做整張表，怎么辦？（DataFrame->DataFrame）
我想根據很多列的數據新做一列數據，怎么辦？（DataFrame->Series）

不要再寫什么for循環了！改變思維，提高編碼和執行效率

dataNumPy32 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

map: 以相同規則將一列數據作一個映射，也就是進行相同函數的處理

def GDP_Factorize(v):

類似的功能還有applymap，可以對一個dataframe里面每一個元素像map那樣全局操作

DF32.applymap(lambda x: float(x)*2 if x.isdigit() else x.upper())

apply則可以對一個DataFrame操作得到一個Series

他會有點像我們后面介紹的agg,但是apply可以按行操作和按列操作，用axis控制即可。

DF32.apply(lambda x:x['nation']+x['capital']+'_'+x['GDP'],axis=1)

0      JapanTokyo_4000
1    S.KoreaSeoul_1300
2    ChinaBeijing_9100
dtype: object

3.3 數據排序

sort: 按一列或者多列的值進行行級排序
sort_index: 根據index里的取值進行排序，而且可以根據axis決定是重排行還是列

dataNumPy33 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF33.sort(['capital','nation'])

DF33.sort('GDP',ascending=False)

DF33.sort('GDP').sort(ascending=False)

DF33.sort_index(axis=1,ascending=True)

一個好用的功能：Rank

DF33

DF33.rank()

DF33.rank(ascending=False)

注意tied data（相同值）的處理：

method = 'average'
method = 'min'
method = 'max'
method = 'first'

DF33x = pd.DataFrame({'name':[u'老王',u'老張',u'老李',u'老劉'],'sal':np.array([5000,3000,5000,9000])})

DF33x.rank()默認使用method='average'，兩條數據相等時，處理排名時大家都用平均值

DF33x.sal.rank()

0    2.5
1    1.0
2    2.5
3    4.0
Name: sal, dtype: float64

method='min'，處理排名時大家都用最小值

DF33x.sal.rank(method='min')

0    2
1    1
2    2
3    4
Name: sal, dtype: float64

method='max'，處理排名時大家都用最大值

DF33x.sal.rank(method='max')

0    3
1    1
2    3
3    4
Name: sal, dtype: float64

method='first'，處理排名時誰先出現就先給誰較小的數值。

DF33x.sal.rank(method='first')

0    2
1    1
2    3
3    4
Name: sal, dtype: float64

3.4 缺失數據處理

DF34 = data_for_multi2.unstack()

忽略缺失值：

DF34.mean(skipna=True)

Col_1    2.500000
Col_2    4.333333
Col_3    6.500000
Col_4    9.000000
dtype: float64

DF34.mean(skipna=False)

Col_1    2.5
Col_2    NaN
Col_3    NaN
Col_4    NaN
dtype: float64

如果不想忽略缺失值的話，就需要祭出fillna了：

DF34

DF34.fillna(0).mean(axis=1,skipna=False)

Row_1    0.00
Row_2    0.75
Row_3    3.00
Row_4    7.50
dtype: float64

4. “一組”大熊貓：Pandas的groupby

groupby的功能類似SQL的group by關鍵字：

Split-Apply-Combine

Split，就是按照規則分組
Apply，通過一定的agg函數來獲得輸入pd.Series返回一個值的效果
Combine，把結果收集起來

Pandas的groupby的靈活性：

分組的關鍵字可以來自於index，也可以來自於真實的列數據
分組規則可以通過一列或者多列

from IPython.display import Image

分組的具體邏輯

irisdata_group = irisdata.groupby('class')

<pandas.core.groupby.DataFrameGroupBy object at 0x10a543b10>

for level,subsetDF in irisdata_group:

Iris-setosa
    sepal_length  sepal_width  petal_length  petal_width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
20           5.4          3.4           1.7          0.2  Iris-setosa
40           5.0          3.5           1.3          0.3  Iris-setosa
Iris-versicolor
    sepal_length  sepal_width  petal_length  petal_width            class
50           7.0          3.2           4.7          1.4  Iris-versicolor
70           5.9          3.2           4.8          1.8  Iris-versicolor
90           5.5          2.6           4.4          1.2  Iris-versicolor
Iris-virginica
     sepal_length  sepal_width  petal_length  petal_width           class
100           6.3          3.3           6.0          2.5  Iris-virginica
120           6.9          3.2           5.7          2.3  Iris-virginica
140           6.7          3.1           5.6          2.4  Iris-virginica

分組可以快速實現MapReduce的邏輯

Map: 指定分組的列標簽，不同的值就會被扔到不同的分組處理
Reduce: 輸入多個值，返回一個值，一般可以通過agg實現，agg能接受一個函數

irisdata.groupby('class').agg(\

irisdata.groupby('class').agg(spstat.skew)

匯總之后的廣播操作

在OLAP數據庫上，為了避免groupby+join的二次操作，提出了sum()over(partition by)的開窗操作。

在Pandas中，這種操作能夠進一步被transform所取代。

pd.concat([irisdata,irisdata.groupby('class').transform('mean')],axis=1)[::20]

產生 MultiIndex（多列分組）后的數據透視表操作

一般來說，多列groupby的一個副作用就是.groupby().agg()之后你的行index已經變成了一個多列分組的分級索引。

如果我們希望達到Excel的數據透視表的效果，行和列的索引自由交換，達到統計目的，究竟應該怎么辦呢？

factor1 = np.random.randint(0,3,50)

hierindexDF = pd.DataFrame({'F1':factor1,'F2':factor2,'F3':factor3,'F4':values})

hierindexDF_gbsum = hierindexDF.groupby(['F1','F2','F3']).sum()

觀察Index：

hierindexDF_gbsum.index

MultiIndex(levels=[[0, 1, 2], [0, 1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[u'F1', u'F2', u'F3'])

unstack：

無參數時，把最末index置換到column上
有數字參數時，把指定位置的index置換到column上
有列表參數時，依次把特定位置的index置換到column上

hierindexDF_gbsum.unstack()

hierindexDF_gbsum.unstack(0)

hierindexDF_gbsum.unstack(1)

hierindexDF_gbsum.unstack([2,0])

更進一步的，stack的功能是和unstack對應，把column上的多級索引換到index上去

hierindexDF_gbsum.unstack([2,0]).stack([1,2])

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	49.878745	49.695242	49.654909	45.810069
Iris-versicolor	49.815081	49.680665	49.694505	49.452305
Iris-virginica	49.772059	49.714500	49.761700	49.545918

	C1	C2
0	0.958384	0.826066
1	0.607771	0.687302
2	0.943502	0.647464

	nation	capital	GDP	GDP_Level	NATION
0	JAPAN	TOKYO	8000	MEDIUM	JAPAN
1	S.KOREA	SEOUL	2600	LOW	S.KOREA
2	CHINA	BEIJING	18200	HIGH	CHINA

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	0.116502	0.103857	0.069702	1.161506
Iris-versicolor	0.102232	-0.352014	-0.588404	-0.030249
Iris-virginica	0.114492	0.355026	0.533044	-0.125612

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	0.116454	0.103814	0.069673	1.161022
Iris-versicolor	0.102190	-0.351867	-0.588159	-0.030236
Iris-virginica	0.114445	0.354878	0.532822	-0.125560

	nation	capital	GDP	population	region
2	China	Beijing	9100	1600	East_Asian
0	Japan	Tokyo	4900	130	East_Asian
1	S.Korea	Seoul	1300	55	East_Asian

	name	job	sal	report
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager

	name_x	job_x	sal_x	report_x	name_y	job_y	sal_y	report_y
index1
0	Wang	VP	50000	NaN	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager	Li	NaN	5000	Manager

	nation	capital	GDP	GDP_Level	NATION
0	Japan	Tokyo	4000	Medium	JAPAN
1	S.Korea	Seoul	1300	Low	S.KOREA
2	China	Beijing	9100	High	CHINA

	sepal_length	sepal_width	petal_length	petal_width	class	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2	Iris-setosa	5.006	3.418	1.464	0.244
20	5.4	3.4	1.7	0.2	Iris-setosa	5.006	3.418	1.464	0.244
40	5.0	3.5	1.3	0.3	Iris-setosa	5.006	3.418	1.464	0.244
60	5.0	2.0	3.5	1.0	Iris-versicolor	5.936	2.770	4.260	1.326
80	5.5	2.4	3.8	1.1	Iris-versicolor	5.936	2.770	4.260	1.326
100	6.3	3.3	6.0	2.5	Iris-virginica	6.588	2.974	5.552	2.026
120	6.9	3.2	5.7	2.3	Iris-virginica	6.588	2.974	5.552	2.026
140	6.7	3.1	5.6	2.4	Iris-virginica	6.588	2.974	5.552	2.026

	F1	F2	F3	F4
0	1	0	1	-0.083928
1	0	0	0	0.949984
2	2	1	0	-0.037692
3	0	1	1	-0.518305
4	1	0	0	0.963678
5	1	1	1	-0.284463
6	0	0	1	0.412449
7	2	0	0	-0.277126
8	0	0	2	-2.488946
9	0	1	0	0.167413
10	0	1	1	1.520003
11	0	1	2	-0.665450
12	1	1	2	0.783906
13	1	1	1	-0.077717
14	2	0	2	-2.018863
15	0	1	1	-0.993884
16	2	0	2	-1.051174
17	0	0	2	-0.723457
18	1	1	0	-0.783521
19	1	0	2	0.355093
20	1	0	0	-0.252415
21	0	0	2	0.600679
22	2	1	2	-0.660498
23	0	0	2	-0.301613
24	0	1	2	-1.270339
25	0	0	0	0.074086
26	1	0	0	-0.043586
27	0	1	2	-2.000161
28	2	1	1	-1.041330
29	2	1	1	0.984101
30	2	0	0	0.160715
31	1	1	2	0.903035
32	1	1	1	-1.105763
33	1	0	0	0.850310
34	1	1	0	-0.062946
35	1	1	0	-0.507763
36	1	0	1	0.112303
37	2	1	2	-0.663782
38	2	1	1	1.147671
39	0	0	2	-1.552327
40	1	1	1	-1.166517
41	0	1	0	-0.622502
42	0	1	2	1.620961
43	1	1	0	-0.354238
44	1	0	2	-0.233783
45	0	0	0	0.131051
46	2	1	1	-1.301164
47	1	0	2	-1.013341
48	2	0	1	-0.508761
49	0	0	1	-1.104457

			F4
F1	F2	F3
0	0	0	1.155121
		1	-0.692009
		2	-4.465664
	1	0	-0.455089
		1	0.007813
		2	-2.314989
1	0	0	1.517987
		1	0.028374
		2	-0.892030
	1	0	-1.708468
		1	-2.634460
		2	1.686941
2	0	0	-0.116412
		1	-0.508761
		2	-3.070037
	1	0	-0.037692
		1	-0.210722
		2	-1.324281

	name	sal
0	老王	5000
1	老張	3000
2	老李	1000

	name	sal
0	老王	5000
1	老張	3000
2	老李	5000
3	老劉	9000

Pandas python