Python pandas 0.19.1 Indexing and Selecting Data文檔翻譯

本文轉載自查看原文 2016-11-29 20:59 1597 pandas/ python/ 數據分析/ numpy/ 索引/ Python

最近在寫個性化推薦的論文，經常用到Python來處理數據，被pandas和numpy中的數據選取和索引問題繞的比較迷糊，索性把這篇官方文檔翻譯出來，方便自查和學習，翻譯過程中難免很多不到位的地方，但大致能看懂，錯誤之處歡迎指正~

Python pandas 0.19.1 Indexing and Selecting Data 原文鏈接 http://pandas.pydata.org/pandas-docs/stable/indexing.html

數據索引和選取

pandas對象中的軸標簽信息有很多作用：

使用已知指示來確定數據（即提供元數據），這對於分析、可視化以及交互式控制台的顯示都十分重要
使能夠實現自動和顯式的數據對齊
允許直觀地獲取和設置數據集的子集

在這一部分，我們將致力於最終的目的：即如何切片，切丁以及一般地獲取和設置pandas對象的子集。文章將主要集中在Series和DataFrame上，因為它們潛力很大。希望未來在高維數據結構（包括panel）上投入更多的精力，尤其是在基於標簽的高級索引方面。

提示：Python和bumpy的索引操作[ ]和屬性操作. 為pandas數據結構提供了非常快速和簡便的方式。如果你已經知道如何操作Python字典和Numpy數組的話，這就沒什么新的了。然而，由於數據的類型無法提前預知，直接使用標准操作將會有一些優化的限制。對於產品代碼來說，我們建議你可以利用本文展示的優化的pandas數據使用方法。

警告：一個設置操作是會返回一個復制還是一個引用可能取決於具體情況。這種有時被稱為“鏈式賦值”，我們應當避免這種情況。

警告：在0.15.0版本中，與其他pandas對象一樣，index不再是ndarray的子類，而是PandasObject的子類。這個影響不大。

多樣的索引方法

為了實現更簡便的基於位置的索引，對象選取方法添加了一些用戶的請求。pandas現在支持三種類型的多軸索引。

.loc是最基本的基於標簽的索引，但是也可以用於布爾數組。當item無法找到時，.loc將會產生KeyError。合法的輸入有：
- 一個單獨的標簽，如5或“a”,(注意5是作為索引標簽，而不是一個整數的位置索引)
- 一個列表或者數組標簽[“a”,”b”,”c”]
- 一個帶有標簽“a”：“f”的切片對象（注意，與Python切片相反，這種切片的第一個和最后一個都包含在內！）
- 一個布爾數組
- 一個可調用的函數（調用Series, DataFrame或Panel）並返回索引的有效輸出（上面中的一個）
.iloc是最基本的基於整數位置的索引（從軸的第0位到第length-1位），但是也可以用於布爾數組。除了允許超范圍索引的索引器之外，如果一個請求的索引超出了索引范圍，.iloc將會產生IndexError。合法的輸入有：
- 一個整數。如5
- 一個列表或整數數組。如[3,0,4]
- 一個整型的切片對象，如1：7
- 一個布爾數組
- 一個可調用的函數（調用Series, DataFrame或Panel）並返回索引的有效輸出（上面中的一個）
.ix支持基於整數和標簽的混合索引。它主要是基於標簽的，但是除非對應的軸是整數類型，否則它將會回到整數位置進行訪問。.ix是最普適的，它能夠支持.loc和.iloc的任何輸入。.ix還支持浮點型的標簽。當處理基於位置和標簽的混合的層次索引時.ix特別有用。

然而，基於整數的軸只支持基於標簽的索引方式，而不支持基於位置的索引。因此，在此種情況下，使用.iloc或者.loc通常會

更加明確。

.loc, .iloc, .ix和[ ]索引能夠接受一個可調用對象作為索引器。

使用以下標記從一個多軸對象中獲取值（使用.loc為例，但同樣適用於.iloc和.ix）。任何的軸訪問器都可能是空的切片：假定不規范的軸。（如p.loc[‘a’]等價於p.loc[‘a’,:,:]）

Object Type	Indexers
Series	`s.loc[indexer]`
DataFrame	`df.loc[row_indexer,column_indexer]`
Panel	`p.loc[item_indexer,major_indexer,minor_indexer]`

基礎知識

正如在上一章節中介紹數據結構中所提到的那樣，使用[ ]進行索引的主要功能（相當於Python中的__getitem__）是選取出低維切片。因此，

對象類型	選取	返回值類型
Series	`series[label]`	標量值
DataFrame	`frame[colname]`	對應colname的`Series`
Panel	`panel[itemname]`	`對應itemname的DataFrame`

這里我們構建了一個簡單的時間序列數據集來說明索引功能：

In [1]: dates = pd.date_range('1/1/2000', periods=8)

In [2]: df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

In [3]: df
Out[3]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

In [4]: panel = pd.Panel({'one' : df, 'two' : df - df.mean()})

In [5]: panel
Out[5]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 8 (major_axis) x 4 (minor_axis)
Items axis: one to two
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-08 00:00:00
Minor_axis axis: A to D

注意：除非特殊說明，所有的索引功能都是通用的，不只適用於該時間序列。

因此，根據上述，我們使用[ ]能夠實現最基本的索引：

In [6]: s = df['A']

In [7]: s[dates[5]]
Out[7]: -0.67368970808837059

In [8]: panel['two']
Out[8]: 
                   A         B         C         D
2000-01-01  0.409571  0.113086 -0.610826 -0.936507
2000-01-02  1.152571  0.222735  1.017442 -0.845111
2000-01-03 -0.921390 -1.708620  0.403304  1.270929
2000-01-04  0.662014 -0.310822 -0.141342  0.470985
2000-01-05 -0.484513  0.962970  1.174465 -0.888276
2000-01-06 -0.733231  0.509598 -0.580194  0.724113
2000-01-07  0.345164  0.972995 -0.816769 -0.840143
2000-01-08 -0.430188 -0.761943 -0.446079  1.044010

你可以向[ ]中傳遞列的列表來按照順序選取多列。如果某列不再DataFrame中，將引發一個異常。也可以通過這種方式設置多個列。

In [9]: df
Out[9]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

In [10]: df[['B', 'A']] = df[['A', 'B’]] #交換兩個列的值

In [11]: df
Out[11]: 
                   A         B         C         D
2000-01-01 -0.282863  0.469112 -1.509059 -1.135632
2000-01-02 -0.173215  1.212112  0.119209 -1.044236
2000-01-03 -2.104569 -0.861849 -0.494929  1.071804
2000-01-04 -0.706771  0.721555 -1.039575  0.271860
2000-01-05  0.567020 -0.424972  0.276232 -1.087401
2000-01-06  0.113648 -0.673690 -1.478427  0.524988
2000-01-07  0.577046  0.404705 -1.715002 -1.039268
2000-01-08 -1.157892 -0.370647 -1.344312  0.844885

當將這種變換就地應用到列的子集的時候，你可能會發現這個方法的有用之處。

警告：當從.loc, .iloc和.ix設置Series和DataFrame時，pandas會將所有軸對齊。

這不會改變df,因為在賦值之前就進行了列對齊。

In [12]: df[['A', 'B']]
Out[12]: 
                   A         B
2000-01-01 -0.282863  0.469112
2000-01-02 -0.173215  1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771  0.721555
2000-01-05  0.567020 -0.424972
2000-01-06  0.113648 -0.673690
2000-01-07  0.577046  0.404705
2000-01-08 -1.157892 -0.370647

In [13]: df.loc[:,['B', 'A']] = df[['A', 'B']]#這種方法無法使列A和列B的值對調

In [14]: df[['A', 'B']]
Out[14]: 
                   A         B
2000-01-01 -0.282863  0.469112
2000-01-02 -0.173215  1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771  0.721555
2000-01-05  0.567020 -0.424972
2000-01-06  0.113648 -0.673690
2000-01-07  0.577046  0.404705
2000-01-08 -1.157892 -0.370647

正確的做法是使用原始值

In [15]: df.loc[:,['B', 'A']] = df[['A', 'B']].values

In [16]: df[['A', 'B']]
Out[16]: 
                   A         B
2000-01-01  0.469112 -0.282863
2000-01-02  1.212112 -0.173215
2000-01-03 -0.861849 -2.104569
2000-01-04  0.721555 -0.706771
2000-01-05 -0.424972  0.567020
2000-01-06 -0.673690  0.113648
2000-01-07  0.404705  0.577046
2000-01-08 -0.370647 -1.157892

屬性訪問

你或許能夠直接把Series的index，一個DataFrame的column，一個Panel的item作為一種屬性來訪問。

In [17]: sa = pd.Series([1,2,3],index=list('abc'))

In [18]: dfa = df.copy()

In [19]: sa.b #直接使用Series.index來訪問數據
Out[19]: 2

In [20]: dfa.A #直接使用DataFrame.column來訪問一列
Out[20]: 
2000-01-01    0.469112
2000-01-02    1.212112
2000-01-03   -0.861849
2000-01-04    0.721555
2000-01-05   -0.424972
2000-01-06   -0.673690
2000-01-07    0.404705
2000-01-08   -0.370647
Freq: D, Name: A, dtype: float64

In [21]: panel.one #直接使用Panel.item來訪問一個item
Out[21]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

你可以使用屬性訪問來修改一個現存的Series的元素或一個DataFrame的列，但是要小心；如果你試圖使用屬性訪問來創建一個新的列，它會創建一個新的屬性而不是一個新的列。

In [22]: sa.a = 5

In [23]: sa
Out[23]: 
a    5
b    2
c    3
dtype: int64

In [24]: dfa.A = list(range(len(dfa.index)))  # 如果A列已經存在的話這樣做是可行的

In [25]: dfa
Out[25]: 
            A         B         C         D
2000-01-01  0 -0.282863 -1.509059 -1.135632
2000-01-02  1 -0.173215  0.119209 -1.044236
2000-01-03  2 -2.104569 -0.494929  1.071804
2000-01-04  3 -0.706771 -1.039575  0.271860
2000-01-05  4  0.567020  0.276232 -1.087401
2000-01-06  5  0.113648 -1.478427  0.524988
2000-01-07  6  0.577046 -1.715002 -1.039268
2000-01-08  7 -1.157892 -1.344312  0.844885

In [26]: dfa['A'] = list(range(len(dfa.index)))  # 使用這種方法創建一個新列

In [27]: dfa
Out[27]: 
            A         B         C         D
2000-01-01  0 -0.282863 -1.509059 -1.135632
2000-01-02  1 -0.173215  0.119209 -1.044236
2000-01-03  2 -2.104569 -0.494929  1.071804
2000-01-04  3 -0.706771 -1.039575  0.271860
2000-01-05  4  0.567020  0.276232 -1.087401
2000-01-06  5  0.113648 -1.478427  0.524988
2000-01-07  6  0.577046 -1.715002 -1.039268
2000-01-08  7 -1.157892 -1.344312  0.844885

警告：

只有當index列是一個有效的Python標識符的時候，你才可以使用這種方式。比如s.1就不可行。
如果屬性和現存的方法名沖突的話，這種方式也不可行。如s.min
同樣，如果屬性名和任意一個如下的名字沖突的話也不可行：index，major_axis, minor_axis, items, labels
無論哪種情況，標准的索引總是可行的，如s[“1”], s[“min”]和s[“index”]都能夠訪問相應的元素或列。
從0.13.0版本開始，Series和Panel都能夠通過這種方式訪問。

如果你使用的是IPyton的環境，你可能也使用tab-完成鍵來查看這些訪問屬性。

你也可以向一個DataFrame的一行分配一個dict。

In [28]: x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})

In [29]: x.iloc[1] = dict(x=9, y=99)

In [30]: x
Out[30]: 
   x   y
0  1   3
1  9  99
2  3   5

切片范圍

魯棒性和一致性最強的沿任意軸的切片方法是在使用位置選擇部分詳細介紹的.iloc方法。現在，我們介紹一下使用[ ]操作進行切片的語義。

對於Series來說，這個語法對應的就是ndarray，返回的是值的切片和相關的標簽：

In [31]: s[:5]
Out[31]: 
2000-01-01    0.469112
2000-01-02    1.212112
2000-01-03   -0.861849
2000-01-04    0.721555
2000-01-05   -0.424972
Freq: D, Name: A, dtype: float64

In [32]: s[::2]#步長為2
Out[32]: 
2000-01-01    0.469112
2000-01-03   -0.861849
2000-01-05   -0.424972
2000-01-07    0.404705
Freq: 2D, Name: A, dtype: float64

In [33]: s[::-1]#倒序，步長為1
Out[33]: 
2000-01-08   -0.370647
2000-01-07    0.404705
2000-01-06   -0.673690
2000-01-05   -0.424972
2000-01-04    0.721555
2000-01-03   -0.861849
2000-01-02    1.212112
2000-01-01    0.469112
Freq: -1D, Name: A, dtype: float64

需要注意的是設置操作也是如此：

In [34]: s2 = s.copy()

In [35]: s2[:5] = 0

In [36]: s2
Out[36]: 
2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06   -0.673690
2000-01-07    0.404705
2000-01-08   -0.370647
Freq: D, Name: A, dtype: float64

對於DataFrame來說，在[ ]中的切片是對行的操作。由於它的普適性，所以這樣非常方便。

In [37]: df[:3]#取df的前3行
Out[37]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804

In [38]: df[::-1]
Out[38]: 
                   A         B         C         D
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632

使用標簽選擇

警告：對於一個設置操作，是返回一個復制還是引用取決於當時的上下文。這叫做“鏈式賦值”，這種情況應當避免。See Returning a View versus Copy

警告：當你的切片器與索引類型不兼容（或不可轉換）時，.loc是非常嚴格的。例如在一個DatatimeIndex中使用整數的話，將會引發一個TypeError。

In [39]: dfl = pd.DataFrame(np.random.randn(5,4), columns=list('ABCD'), index=pd.date_range('20130101',periods=5))

In [40]: dfl
Out[40]: 
                   A         B         C         D
2013-01-01  1.075770 -0.109050  1.643563 -1.469388
2013-01-02  0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524  0.413738  0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
2013-01-05  0.895717  0.805244 -1.206412  2.565646

In [4]: dfl.loc[2:3]#因為df1的index的類型是datatimeindex，不能使用整數索引
TypeError: cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with these indexers [2] of <type 'int'>

在切片中的string能夠轉換為index的類型，這樣才能正常切片。

In [41]: dfl.loc['20130102':'20130104']#這樣才可以
Out[41]: 
                   A         B         C         D
2013-01-02  0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524  0.413738  0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061

pandas提供了一系列的方法來基於標簽進行索引。這是一個嚴格基於包的協議。在你請求的標簽中，需要至少有一個是在index中的，不然就會引發一個KeyError。與常規Python語法不通，切片范圍的第一個和最后一個都被包含在內。整數是有效的標簽，但是他們必須作為標簽而不是位置。

.loc屬性是基礎的訪問方法。下面的是有效的輸入：

一個單獨的標簽，如5或“a”,(注意5是作為索引標簽，而不是一個整數的位置索引)
一個列表或者數組標簽[“a”,”b”,”c”]
一個帶有標簽“a”：“f”的切片對象（注意，與Python切片相反，這種切片的第一個和最后一個都包含在內！）
一個布爾數組
一個可調用的函數（調用Series, DataFrame或Panel）並返回索引的有效輸出（上面中的一個）

In [42]: s1 = pd.Series(np.random.randn(6),index=list('abcdef'))

In [43]: s1
Out[43]: 
a    1.431256
b    1.340309
c   -1.170299
d   -0.226169
e    0.410835
f    0.813850
dtype: float64

In [44]: s1.loc['c':]
Out[44]: 
c   -1.170299
d   -0.226169
e    0.410835
f    0.813850
dtype: float64

In [45]: s1.loc['b']
Out[45]: 1.3403088497993827

要注意的是設置操作也同樣適用：

In [46]: s1.loc['c':] = 0

In [47]: s1
Out[47]: 
a    1.431256
b    1.340309
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64

對於一個DataFrame來說：

In [48]: df1 = pd.DataFrame(np.random.randn(6,4),
   ....:                    index=list('abcdef'),
   ....:                    columns=list('ABCD'))
   ....: 

In [49]: df1
Out[49]: 
          A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
b  1.130127 -1.436737 -1.413681  1.607920
c  1.024180  0.569605  0.875906 -2.211372
d  0.974466 -2.006747 -0.410001 -0.078638
e  0.545952 -1.219217 -1.226825  0.769804
f -1.281247 -0.727707 -0.121306 -0.097883

In [50]: df1.loc[['a', 'b', 'd'], :]
Out[50]: 
          A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
b  1.130127 -1.436737 -1.413681  1.607920
d  0.974466 -2.006747 -0.410001 -0.078638

通過標簽切片訪問：

In [51]: df1.loc['d':, 'A':'C']
Out[51]: 
          A         B         C
d  0.974466 -2.006747 -0.410001
e  0.545952 -1.219217 -1.226825
f -1.281247 -0.727707 -0.121306

使用一個標簽來獲取截取部分（相當於df.xs(“a")）

In [52]: df1.loc['a']
Out[52]: 
A    0.132003
B   -0.827317
C   -0.076467
D   -1.187678
Name: a, dtype: float64

使用一個布爾型數組獲取值

In [53]: df1.loc['a'] > 0
Out[53]: 
A     True
B    False
C    False
D    False
Name: a, dtype: bool

In [54]: df1.loc[:, df1.loc['a'] > 0]
Out[54]: 
          A
a  0.132003
b  1.130127
c  1.024180
d  0.974466
e  0.545952
f -1.281247

直接獲取值（相當於過時的df.get_value(“a”,”A”)）

# this is also equivalent to ``df1.at['a','A']``
In [55]: df1.loc['a', 'A']
Out[55]: 0.13200317033032932

使用位置選擇

警告：對於一個設置操作，是返回一個復制還是引用取決於當時的上下文。這叫做“鏈式賦值”，這種情況應當避免。

pandas提供了一系列的方法來基於整數進行索引。該語法緊跟Python和numpy切片。這些是0-based索引。切片時，范圍內的第一個和最后一個都包含在內。如果試圖使用非整數，即使是一個有效的標簽也將會引發一個IndexError。 See Returning a View versus Copy

.loc屬性是基礎的訪問方法。下面的是有效的輸入：

一個單獨的標簽，如5
一個列表或者整數數組[4，3，0]
一個帶有整數的切片對象1：7
一個布爾數組
一個可調用的函數

In [56]: s1 = pd.Series(np.random.randn(5), index=list(range(0,10,2)))

In [57]: s1
Out[57]: 
0    0.695775
2    0.341734
4    0.959726
6   -1.110336
8   -0.619976
dtype: float64

In [58]: s1.iloc[:3]
Out[58]: 
0    0.695775
2    0.341734
4    0.959726
dtype: float64

In [59]: s1.iloc[3]#注意，.iloc是針對位置索引，此處的3指的是第“3”個數（即第4個數），而如果使用s1.loc[2]則指的不是第“2”個數，而是index為2的那個數。
Out[59]: -1.1103361028911669

要注意的是設置操作也同樣如此：

In [60]: s1.iloc[:3] = 0

In [61]: s1
Out[61]: 
0    0.000000
2    0.000000
4    0.000000
6   -1.110336
8   -0.619976
dtype: float64

對於一個DataFrame來說：

In [62]: df1 = pd.DataFrame(np.random.randn(6,4),
   ....:                    index=list(range(0,12,2)),
   ....:                    columns=list(range(0,8,2)))
   ....: 

In [63]: df1
Out[63]: 
           0         2         4         6
0   0.149748 -0.732339  0.687738  0.176444
2   0.403310 -0.154951  0.301624 -2.179861
4  -1.369849 -0.954208  1.462696 -1.743161
6  -0.826591 -0.345352  1.314232  0.690579
8   0.995761  2.396780  0.014871  3.357427
10 -0.317441 -1.236269  0.896171 -0.487602

通過整數切片來選擇：

In [64]: df1.iloc[:3]
Out[64]: 
          0         2         4         6
0  0.149748 -0.732339  0.687738  0.176444
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161

In [65]: df1.iloc[1:5, 2:4]
Out[65]: 
          4         6
2  0.301624 -2.179861
4  1.462696 -1.743161
6  1.314232  0.690579
8  0.014871  3.357427

通過整數列表來選擇：

In [66]: df1.iloc[[1, 3, 5], [1, 3]]
Out[66]: 
           2         6
2  -0.154951 -2.179861
6  -0.345352  0.690579
10 -1.236269 -0.487602

In [67]: df1.iloc[1:3, :]
Out[67]: 
          0         2         4         6
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161

In [68]: df1.iloc[:, 1:3]
Out[68]: 
           2         4
0  -0.732339  0.687738
2  -0.154951  0.301624
4  -0.954208  1.462696
6  -0.345352  1.314232
8   2.396780  0.014871
10 -1.236269  0.896171

# 這相當於df1.iat[1,1]
In [69]: df1.iloc[1, 1]
Out[69]: -0.15495077442490321

使用一個整數位置來獲取一個截取部分（相當於df.xs(1)）

In [70]: df1.iloc[1]
Out[70]: 
0    0.403310
2   -0.154951
4    0.301624
6   -2.179861
Name: 2, dtype: float64

正如在Python或Numpy中那樣，超出索引范圍的切片也是被允許的。

# 這些在python/numpy中是允許的。
# 在pandas中從0.14.0版本才可行
In [71]: x = list('abcdef')

In [72]: x
Out[72]: ['a', 'b', 'c', 'd', 'e', 'f']

In [73]: x[4:10]
Out[73]: ['e', 'f']

In [74]: x[8:10]
Out[74]: []

In [75]: s = pd.Series(x)

In [76]: s
Out[76]: 
0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In [77]: s.iloc[4:10]
Out[77]: 
4    e
5    f
dtype: object

In [78]: s.iloc[8:10]
Out[78]: Series([], dtype: object)

注意：在0.14.0以前的版本中，iloc不能接收超出索引范圍的切片，例如超出索引對象長度的值就不可以。

注意到這樣做會導致一個空軸（如返回一個空的DataFrame）

In [79]: dfl = pd.DataFrame(np.random.randn(5,2), columns=list('AB'))

In [80]: dfl
Out[80]: 
          A         B
0 -0.082240 -2.182937
1  0.380396  0.084844
2  0.432390  1.519970
3 -0.493662  0.600178
4  0.274230  0.132885

In [81]: dfl.iloc[:, 2:3]
Out[81]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

In [82]: dfl.iloc[:, 1:3]
Out[82]: 
          B
0 -2.182937
1  0.084844
2  1.519970
3  0.600178
4  0.132885

In [83]: dfl.iloc[4:6]
Out[83]: 
         A         B
4  0.27423  0.132885

一個單獨的超出范圍的索引器將會引發IndexError. 一個其元素超出了索引范圍的索引列表也會引發IndexError。

dfl.iloc[[4, 5, 6]]
IndexError: positional indexers are out-of-bounds

dfl.iloc[:, 4]
IndexError: single positional indexer is out-of-bounds

使用可調用函數來選擇

.loc, .iloc, .ix和[ ]索引都可以接收一個可調用函數作為索引。這個可調用函數必須是有一個參數的函數（Series, DataFrame或者Panel）,並為索引返回有效的輸出。

In [84]: df1 = pd.DataFrame(np.random.randn(6, 4),
   ....:                    index=list('abcdef'),
   ....:                    columns=list('ABCD'))
   ....: 

In [85]: df1
Out[85]: 
          A         B         C         D
a -0.023688  2.410179  1.450520  0.206053
b -0.251905 -2.213588  1.063327  1.266143
c  0.299368 -0.863838  0.408204 -1.048089
d -0.025747 -0.988387  0.094055  1.262731
e  1.289997  0.082423 -0.055758  0.536580
f -0.489682  0.369374 -0.034571 -2.484478

In [86]: df1.loc[lambda df: df.A > 0, :]#A列大於0的行，所有列
Out[86]: 
          A         B         C         D
c  0.299368 -0.863838  0.408204 -1.048089
e  1.289997  0.082423 -0.055758  0.536580

In [87]: df1.loc[:, lambda df: ['A', 'B']]#所有行，A,B兩列
Out[87]: 
          A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374

In [88]: df1.iloc[:, lambda df: [0, 1]]#所有行，第0和第1列
Out[88]: 
          A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374

In [89]: df1[lambda df: df.columns[0]]所有行的第0列
Out[89]: 
a   -0.023688
b   -0.251905
c    0.299368
d   -0.025747
e    1.289997
f   -0.489682
Name: A, dtype: float64

你可以在Series中使用可調用函數

In [90]: df1.A.loc[lambda s: s > 0]#A列，大於0的行
Out[90]: 
c    0.299368
e    1.289997
Name: A, dtype: float64

使用這種方法/索引時，你可以不用臨時變量就能夠進行數據選擇操作。

In [91]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [92]: (bb.groupby(['year', 'team']).sum()
   ....:    .loc[lambda df: df.r > 100])
   ....: 
Out[92]: 
           stint    g    ab    r    h  X2b  X3b  hr    rbi    sb   cs   bb  \
year team                                                                    
2007 CIN       6  379   745  101  203   35    2  36  125.0  10.0  1.0  105   
     DET       5  301  1062  162  283   54    4  37  144.0  24.0  7.0   97   
     HOU       4  311   926  109  218   47    6  14   77.0  10.0  4.0   60   
     LAN      11  413  1021  153  293   61    3  36  154.0   7.0  5.0  114   
     NYN      13  622  1854  240  509  101    3  61  243.0  22.0  4.0  174   
     SFN       5  482  1305  198  337   67    6  40  171.0  26.0  7.0  235   
     TEX       2  198   729  115  200   40    4  28  115.0  21.0  4.0   73   
     TOR       4  459  1408  187  378   96    2  58  223.0   4.0  2.0  190   

              so   ibb   hbp    sh    sf  gidp  
year team                                       
2007 CIN   127.0  14.0   1.0   1.0  15.0  18.0  
     DET   176.0   3.0  10.0   4.0   8.0  28.0  
     HOU   212.0   3.0   9.0  16.0   6.0  17.0  
     LAN   141.0   8.0   9.0   3.0   8.0  29.0  
     NYN   310.0  24.0  23.0  18.0  15.0  48.0  
     SFN   188.0  51.0   8.0  16.0   6.0  41.0  
     TEX   140.0   4.0   5.0   2.0   8.0  16.0  
     TOR   265.0  16.0  12.0   4.0  16.0  38.0

隨機樣本的選擇

從一個Series或DataFrame或Panel的行或列中使用sample（）方法來選擇隨機樣本。這個方法默認是對列進行取樣，並接收一個特定的行/列返回，或者行的一部分。

In [93]: s = pd.Series([0,1,2,3,4,5])

#當沒有任何參數傳進去的時候，返回1行
In [94]: s.sample()
Out[94]: 
4    4
dtype: int64

# 還可以傳入一個參數n來指定行數:
In [95]: s.sample(n=3)
Out[95]: 
0    0
4    4
1    1
dtype: int64

# 或者行數的百分比
In [96]: s.sample(frac=0.5)
Out[96]: 
5    5
3    3
1    1
dtype: int64

默認情況下，sample方法將會使每行最多返回一次，但是你也可以使用替換選項進行替換：

In [97]: s = pd.Series([0,1,2,3,4,5])

 # 不使用替換（默認）:
In [98]: s.sample(n=6, replace=False)
Out[98]: 
0    0
1    1
5    5
3    3
2    2
4    4
dtype: int64

 # 使用替換:
In [99]: s.sample(n=6, replace=True)
Out[99]: 
0    0
4    4
3    3
2    2
4    4
4    4
dtype: int64

默認情況下，每行被取樣的概率是相等的，但是如果你想讓每行有不同的概率被抽到，你可以想sample傳入weights關鍵字來設置抽樣權重。這些權重可以是一個列表，一個numpy數組或者一個Series，但是他們的長度必須和你抽樣的對象的長度一致。缺失值的權重將被設為0，inf值不被允許。如果所有權重的和不為1，他們將使用各權重除以目前權重的和進行重新歸一化。例如：

In [100]: s = pd.Series([0,1,2,3,4,5])

In [101]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [102]: s.sample(n=3, weights=example_weights)
Out[102]: 
5    5
4    4
3    3
dtype: int64

# 權重將會自動重新歸一化
In [103]: example_weights2 = [0.5, 0, 0, 0, 0, 0]

In [104]: s.sample(n=1, weights=example_weights2)
Out[104]: 
0    0
dtype: int64

對於一個DataFrame來說，你可以使用一個DataFrame的一列作為抽樣權重（加入你正在對行進行抽樣而不是列），把列名作為一個字符串穿進去就可以。

In [105]: df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})

In [106]: df2.sample(n = 3, weights = 'weight_column')
Out[106]: 
   col1  weight_column
1     8            0.4
0     9            0.5
2     7            0.1

sample還允許用戶使用軸參數來對列進行抽樣.

In [107]: df3 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [108]: df3.sample(n=1, axis=1)
Out[108]: 
   col1
0     1
1     2
2     3

最后，你還可以使用random_state參數來為sample的隨機數生成器設置一個種子，它將會接收一個整數或者一個numpy RandomState 對象。

In [109]: df4 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

# 如果給定了一個種子，樣本將會始終抽取同樣的幾行
In [110]: df4.sample(n=2, random_state=2)
Out[110]: 
   col1  col2
2     3     4
1     2     3

In [111]: df4.sample(n=2, random_state=2)
Out[111]: 
   col1  col2
2     3     4
1     2     3

使用擴展來設置

當為一個軸設置一個不存在的鍵值時，.loc/.ix/[ ]操作可以進行擴容。

在Series中，這其實是一個追加操作。

In [112]: se = pd.Series([1,2,3])

In [113]: se
Out[113]: 
0    1
1    2
2    3
dtype: int64

In [114]: se[5] = 5.

In [115]: se
Out[115]: 
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

一個DataFrame可以通過.loc或軸進行擴容

In [116]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
   .....:                 columns=['A','B'])
   .....: 

In [117]: dfi
Out[117]: 
   A  B
0  0  1
1  2  3
2  4  5

In [118]: dfi.loc[:,'C'] = dfi.loc[:,'A']

In [119]: dfi
Out[119]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4

這就像是DataFrame的一個追加操作。

In [120]: dfi.loc[3] = 5

In [121]: dfi
Out[121]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

標量值的快速獲取和設置

由於使用[ ]進行索引必須管理很多情況（單標簽訪問，切片，布爾索引等等），它會花費一些性能來識別你究竟請求的是什么。如果你想要訪問一個標量，最快速的方法是使用at和iat方法，他們能夠適用於所有數據結構。

at提供基於標簽的標量查找（與loc類似），而iat提供基於整數的標量查找（與iloc類似）。

In [122]: s.iat[5]
Out[122]: 5

In [123]: df.at[dates[5], 'A']
Out[123]: -0.67368970808837059

In [124]: df.iat[3, 0]
Out[124]: 0.72155516224436689

你也可以使用同樣的索引進行設置

In [125]: df.at[dates[5], 'E'] = 7

In [126]: df.iat[3, 0] = 7

如果索引缺失的話，at方法可對對象進行原地擴充

In [127]: df.at[dates[-1]+1, 0] = 7

In [128]: df
Out[128]: 
                   A         B         C         D    E    0
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632  NaN  NaN
2000-01-02  1.212112 -0.173215  0.119209 -1.044236  NaN  NaN
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804  NaN  NaN
2000-01-04  7.000000 -0.706771 -1.039575  0.271860  NaN  NaN
2000-01-05 -0.424972  0.567020  0.276232 -1.087401  NaN  NaN
2000-01-06 -0.673690  0.113648 -1.478427  0.524988  7.0  NaN
2000-01-07  0.404705  0.577046 -1.715002 -1.039268  NaN  NaN
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885  NaN  NaN
2000-01-09       NaN       NaN       NaN       NaN  NaN  7.0

布爾索引

另一個常見的操作是使用布爾向量來過濾數據。操作是|對應或，&對應與，~對應非。這些必須使用括號進行分組。

使用一個布爾向量來對一個Series進行索引與對一個numpy的多維數組進行索引是一樣一樣的：

In [129]: s = pd.Series(range(-3, 4))

In [130]: s
Out[130]: 
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [131]: s[s > 0]
Out[131]: 
4    1
5    2
6    3
dtype: int64

In [132]: s[(s < -1) | (s > 0.5)]
Out[132]: 
0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [133]: s[~(s < 0)]
Out[133]: 
3    0
4    1
5    2
6    3
dtype: int64

你可以使用一個與DataFrame的index相同長度的布爾向量從一個DataFrame中選擇列（例如，一些來自DataFrame的一列的東西）

In [134]: df[df['A'] > 0]
Out[134]: 
                   A         B         C         D   E   0
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632 NaN NaN
2000-01-02  1.212112 -0.173215  0.119209 -1.044236 NaN NaN
2000-01-04  7.000000 -0.706771 -1.039575  0.271860 NaN NaN
2000-01-07  0.404705  0.577046 -1.715002 -1.039268 NaN NaN

Series的list和map方法也可以用來產生更為復雜的匹配標准：

In [135]: df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
   .....:                     'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
   .....:                     'c' : np.random.randn(7)})
   .....: 

# 只想要'two' 或 'three'
In [136]: criterion = df2['a'].map(lambda x: x.startswith('t'))

In [137]: df2[criterion]
Out[137]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# 實現同樣的效果，但是更慢的方法
In [138]: df2[[x.startswith('t') for x in df2['a']]]
Out[138]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# 多准則匹配
In [139]: df2[criterion & (df2['b'] == 'x')]
Out[139]: 
       a  b         c
3  three  x  0.361719

注意，使用布爾向量和其他索引表達式共同索引時，使用選擇方法通過標簽選擇，通過位置選擇和先進索引，你可能會選出不只一個軸的數據。

In [140]: df2.loc[criterion & (df2['b'] == 'x'),'b':'c']
Out[140]: 
   b         c
3  x  0.361719

使用isin索引

考慮到Series的isin方法，它能夠返回一個布爾向量，Series的元素在傳遞列表的地方顯示為True。這使你能夠選擇那些一列或多列中有你需要的值的行。

In [141]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

In [142]: s
Out[142]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [143]: s.isin([2, 4, 6])
Out[143]: 
4    False
3    False
2     True
1    False
0     True
dtype: bool

In [144]: s[s.isin([2, 4, 6])]
Out[144]: 
2    2
0    4
dtype: int64

該方法同樣適用於Index對象，當你不知道那個標簽是真的存在的時候，這種方法也同樣適用。

In [145]: s[s.index.isin([2, 4, 6])]
Out[145]: 
4    0
2    2
dtype: int64

# 將它與下面的作對比
In [146]: s[[2, 4, 6]]
Out[146]: 
2    2.0
4    0.0
6    NaN
dtype: float64

另外，MultiIndex方法能夠允許選取一個單獨的level來用於成員資格審查

In [147]: s_mi = pd.Series(np.arange(6),
   .....:                  index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
   .....: 

In [148]: s_mi
Out[148]: 
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int64

In [149]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[149]: 
0  c    2
1  a    3
dtype: int64

In [150]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[150]: 
0  a    0
   c    2
1  a    3
   c    5
dtype: int64

DataFrame也有isin方法。當使用isin時，需要傳入一個值的集合，數組或字典都可以。如果是數組，isin返回一個布爾型的DataFrame，它和原DataFrame的shape一樣，而且元素在值序列中的地方顯示為True.

In [151]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
   .....:                    'ids2': ['a', 'n', 'c', 'n']})
   .....: 

In [152]: values = ['a', 'b', 1, 3]

In [153]: df.isin(values)
Out[153]: 
     ids   ids2   vals
0   True   True   True
1   True  False  False
2  False  False   True
3  False  False  False

通常情況下，你會想要使用特定的列來匹配特定的值。只需要使值成為一個dict，其key是列，value是你想要檢索的item列表即可。

In [154]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}

In [155]: df.isin(values)
Out[155]: 
     ids   ids2   vals
0   True  False   True
1   True  False  False
2  False  False   True
3  False  False  False

將DataFrame的isin方法和any( )和all( )方法混合起來以一個給定的標准快速選擇你的數據的子集。選擇一行數據，它的每一列都有各自的標准：

In [156]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [157]: row_mask = df.isin(values).all(1)

In [158]: df[row_mask]
Out[158]: 
  ids ids2  vals
0   a    a     1

where( )方法和偽裝

使用一個布爾向量從一個Series中選取值通常會返回一個數據的子集。為了保證選取的輸出與源數據有相同的規模，你可以使用Series和DataFrame中的where方法。

只返回選取的行：

In [159]: s[s > 0]
Out[159]: 
3    1
2    2
1    3
0    4
dtype: int64

返回一個與源數據具有相同規模的Series

In [160]: s.where(s > 0)
Out[160]: 
4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

使用一個布爾標准從一個DataFrame中選取值也能保留輸入數據的規模。在底層使用where作為實現。等價於df.where(df<0)

In [161]: df[df < 0]
Out[161]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

另外，在返回的復制數據中，where需要一個可選的其他參數來當條件為假時進行替換。

In [162]: df.where(df < 0, -df)
Out[162]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

你或許想要基於一些布爾標准來設置值。這能夠直觀地這樣做：

In [163]: s2 = s.copy()

In [164]: s2[s2 < 0] = 0

In [165]: s2
Out[165]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [166]: df2 = df.copy()

In [167]: df2[df2 < 0] = 0

In [168]: df2
Out[168]: 
                   A         B         C         D
2000-01-01  0.000000  0.000000  0.485855  0.245166
2000-01-02  0.000000  0.390389  0.000000  1.655824
2000-01-03  0.000000  0.299674  0.000000  0.281059
2000-01-04  0.846958  0.000000  0.600705  0.000000
2000-01-05  0.669692  0.000000  0.000000  0.342416
2000-01-06  0.868584  0.000000  2.297780  0.000000
2000-01-07  0.000000  0.000000  0.168904  0.000000
2000-01-08  0.801196  1.392071  0.000000  0.000000

默認情況下，where將會返回一個數據的修改副本。有一個可選擇的參數inplace，它能夠在源數據上直接修改，而不產生一個副本。

In [169]: df_orig = df.copy()

In [170]: df_orig.where(df > 0, -df, inplace=True);

In [171]: df_orig
Out[171]: 
                   A         B         C         D
2000-01-01  2.104139  1.309525  0.485855  0.245166
2000-01-02  0.352480  0.390389  1.192319  1.655824
2000-01-03  0.864883  0.299674  0.227870  0.281059
2000-01-04  0.846958  1.222082  0.600705  1.233203
2000-01-05  0.669692  0.605656  1.169184  0.342416
2000-01-06  0.868584  0.948458  2.297780  0.684718
2000-01-07  2.670153  0.114722  0.168904  0.048048
2000-01-08  0.801196  1.392071  0.048788  0.808838

注意 DataFrame.where()和numpy.where()的識別標志不同。大體上，df1.where(m, df2)等價於np.where(m, df1, df2).

In [172]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[172]: 
               A     B     C     D
2000-01-01  True  True  True  True
2000-01-02  True  True  True  True
2000-01-03  True  True  True  True
2000-01-04  True  True  True  True
2000-01-05  True  True  True  True
2000-01-06  True  True  True  True
2000-01-07  True  True  True  True
2000-01-08  True  True  True  True

校准

與此同時，where使輸入的環境對齊（ndarray 或DataFrame），因此部分選擇與設置是可能的。這與使用.ix進行部分設置相似（但是在內容上倒不如軸標簽）

In [173]: df2 = df.copy()

In [174]: df2[ df2[1:4] > 0 ] = 3

In [175]: df2
Out[175]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525  0.485855  0.245166
2000-01-02 -0.352480  3.000000 -1.192319  3.000000
2000-01-03 -0.864883  3.000000 -0.227870  3.000000
2000-01-04  3.000000 -1.222082  3.000000 -1.233203
2000-01-05  0.669692 -0.605656 -1.169184  0.342416
2000-01-06  0.868584 -0.948458  2.297780 -0.684718
2000-01-07 -2.670153 -0.114722  0.168904 -0.048048
2000-01-08  0.801196  1.392071 -0.048788 -0.808838

在使用where時，where也可以接收axis和level參數來使輸入對齊。

In [176]: df2 = df.copy()

In [177]: df2.where(df2>0,df2['A'],axis='index')
Out[177]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

這個方法與下面的方法相同，但是比下面的快。

In [178]: df2 = df.copy()

In [179]: df.apply(lambda x, y: x.where(x>0,y), y=df['A'])
Out[179]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

where能夠接收一個可調用函數作為條件和其他參數。這個函數必須有一個參數（Series 或者 DataFrame），並返回有效的輸出作為條件或其他參數。

In [180]: df3 = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [4, 5, 6],
   .....:                     'C': [7, 8, 9]})
   .....: 

In [181]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[181]: 
    A   B  C
0  11  14  7
1  12   5  8
2  13   6  9

偽裝

偽裝是where的逆布爾運算。

In [182]: s.mask(s >= 0)
Out[182]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64

In [183]: df.mask(df >= 0)
Out[183]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

`query()`方法 (實驗性的)

DataFrame 對象有一個query（）方法能夠允許使用一個表達式來選取數據。

你可以獲取到frame中列b中介於列a和列c之間的值，例如：

In [184]: n = 10

In [185]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [186]: df
Out[186]: 
          a         b         c
0  0.438921  0.118680  0.863670
1  0.138138  0.577363  0.686602
2  0.595307  0.564592  0.520630
3  0.913052  0.926075  0.616184
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
6  0.792342  0.216974  0.564056
7  0.397890  0.454131  0.915716
8  0.074315  0.437913  0.019794
9  0.559209  0.502065  0.026437

# 純Python
In [187]: df[(df.a < df.b) & (df.b < df.c)]
Out[187]: 
          a         b         c
1  0.138138  0.577363  0.686602
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
7  0.397890  0.454131  0.915716

# query
In [188]: df.query('(a < b) & (b < c)')
Out[188]: 
          a         b         c
1  0.138138  0.577363  0.686602
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
7  0.397890  0.454131  0.915716

如果沒有名為a的列，這樣做的話將會把操作作用於一個命了名的index。

In [189]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))

In [190]: df.index.name = 'a'

In [191]: df
Out[191]: 
   b  c
a      
0  0  4
1  0  1
2  3  4
3  4  3
4  1  4
5  0  3
6  0  1
7  3  4
8  2  3
9  1  1

In [192]: df.query('a < b and b < c')
Out[192]: 
   b  c
a      
2  3  4

如果你不想或不能為你的index命名，你可以在你的query表達式中使用“index”這個名字。

In [193]: df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))

In [194]: df
Out[194]: 
   b  c
0  3  1
1  3  0
2  5  6
3  5  2
4  7  4
5  0  1
6  2  5
7  0  1
8  6  0
9  7  9

In [195]: df.query('index < b < c')
Out[195]: 
   b  c
2  5  6

注意如果你的index的名字和一個列名相同，那列名將會被賦予優先值。如

In [196]: df = pd.DataFrame({'a': np.random.randint(5, size=5)})

In [197]: df.index.name = 'a'

In [198]: df.query('a > 2') # 使用的是列a，而不是index
Out[198]: 
   a
a   
1  3
3  3

即使index已經命名了，你還是可以在query表達式中使用“index”這個名字對index列進行使用：

In [199]: df.query('index > 2')
Out[199]: 
   a
a   
3  3
4  2

如果由於某些原因你有一列名為index，那么你可以使用ilevel_0來使用index，但是這時你最好應該考慮給你的列換個名字。

`MultiIndex` 的`query()` 語法

你也可以把 MultiIndex 和一個DataFrame的levels當做frame中的列進行使用。

In [200]: n = 10

In [201]: colors = np.random.choice(['red', 'green'], size=n)

In [202]: foods = np.random.choice(['eggs', 'ham'], size=n)

In [203]: colors
Out[203]: 
array(['red', 'red', 'red', 'green', 'green', 'green', 'green', 'green',
       'green', 'green'], 
      dtype='|S5')

In [204]: foods
Out[204]: 
array(['ham', 'ham', 'eggs', 'eggs', 'eggs', 'ham', 'ham', 'eggs', 'eggs',
       'eggs'], 
      dtype='|S4')

In [205]: index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])

In [206]: df = pd.DataFrame(np.random.randn(n, 2), index=index)

In [207]: df
Out[207]: 
                   0         1
color food                    
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255
green eggs -0.748199  1.318931
      eggs -2.029766  0.792652
      ham   0.461007 -0.542749
      ham  -0.305384 -0.479195
      eggs  0.095031 -0.270099
      eggs -0.707140 -0.773882
      eggs  0.229453  0.304418

In [208]: df.query('color == "red"')
Out[208]: 
                   0         1
color food                    
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255

如果 MultiIndex的levels未命名，你可以使用特殊名字來使用它們：

In [209]: df.index.names = [None, None]

In [210]: df
Out[210]: 
                   0         1
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255
green eggs -0.748199  1.318931
      eggs -2.029766  0.792652
      ham   0.461007 -0.542749
      ham  -0.305384 -0.479195
      eggs  0.095031 -0.270099
      eggs -0.707140 -0.773882
      eggs  0.229453  0.304418

In [211]: df.query('ilevel_0 == "red"')
Out[211]: 
                 0         1
red ham   0.194889 -0.381994
    ham   0.318587  2.089075
    eggs -0.728293 -0.090255

慣例是level_0，它意味着為index的第0level“索引level0”

`query()` 使用示例

一個query（）的使用示例是，當你有一個有着共同列名（或索引level/名稱）的子集DataFrame的對象集，你可以向所有frame傳遞相同的query，而不用指定哪個frame是你要查詢的。

In [212]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [213]: df
Out[213]: 
          a         b         c
0  0.224283  0.736107  0.139168
1  0.302827  0.657803  0.713897
2  0.611185  0.136624  0.984960
3  0.195246  0.123436  0.627712
4  0.618673  0.371660  0.047902
5  0.480088  0.062993  0.185760
6  0.568018  0.483467  0.445289
7  0.309040  0.274580  0.587101
8  0.258993  0.477769  0.370255
9  0.550459  0.840870  0.304611

In [214]: df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)

In [215]: df2
Out[215]: 
           a         b         c
0   0.357579  0.229800  0.596001
1   0.309059  0.957923  0.965663
2   0.123102  0.336914  0.318616
3   0.526506  0.323321  0.860813
4   0.518736  0.486514  0.384724
5   0.190804  0.505723  0.614533
6   0.891939  0.623977  0.676639
7   0.480559  0.378528  0.460858
8   0.420223  0.136404  0.141295
9   0.732206  0.419540  0.604675
10  0.604466  0.848974  0.896165
11  0.589168  0.920046  0.732716

In [216]: expr = '0.0 <= a <= c <= 0.5'

In [217]: map(lambda frame: frame.query(expr), [df, df2])
Out[217]: 
[          a         b         c
 8  0.258993  0.477769  0.370255,           a         b         c
 2  0.123102  0.336914  0.318616]

`query()` Python與pandas語法比較

完全的numpy形式的語法

In [218]: df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))

In [219]: df
Out[219]: 
   a  b  c
0  7  8  9
1  1  0  7
2  2  7  2
3  6  2  2
4  2  6  3
5  3  8  2
6  1  7  2
7  5  1  5
8  9  8  0
9  1  5  0

In [220]: df.query('(a < b) & (b < c)')
Out[220]: 
   a  b  c
0  7  8  9

In [221]: df[(df.a < df.b) & (df.b < df.c)]
Out[221]: 
   a  b  c
0  7  8  9

去掉括號會更好一點 (通過綁定比較操作符&/|)

In [222]: df.query('a < b & b < c')
Out[222]: 
   a  b  c
0  7  8  9

使用英語來代替符號

In [223]: df.query('a < b and b < c')
Out[223]: 
   a  b  c
0  7  8  9

可能和你在紙上寫的非常相近

In [224]: df.query('a < b < c')
Out[224]: 
   a  b  c
0  7  8  9

`in` 和 `not in` 操作

在比較操作中，query（）也支持Python的特殊用法in和not in，為調用isin方法提供了一個簡潔的語法。

#獲取列a和列b中有重復值的所有行
In [225]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
   .....:                    'c': np.random.randint(5, size=12),
   .....:                    'd': np.random.randint(9, size=12)})
   .....: 

In [226]: df
Out[226]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

In [227]: df.query('a in b')
Out[227]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
3  b  a  2  1
4  c  b  3  6
5  c  b  0  2

# 純Python語法的寫法
In [228]: df[df.a.isin(df.b)]
Out[228]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
3  b  a  2  1
4  c  b  3  6
5  c  b  0  2

In [229]: df.query('a not in b')
Out[229]: 
    a  b  c  d
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

# 純Python
In [230]: df[~df.a.isin(df.b)]
Out[230]: 
    a  b  c  d
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

你可以將這種方法和其他表達式混合來實現非常簡潔的查詢：

# 列a和列b有重復值且列c的值小於列d的值的所有行
In [231]: df.query('a in b and c < d')
Out[231]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
4  c  b  3  6
5  c  b  0  2

# 純Python
In [232]: df[df.b.isin(df.a) & (df.c < df.d)]
Out[232]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
4   c  b  3  6
5   c  b  0  2
10  f  c  0  6
11  f  c  1  2

注意：in和not in在Python中進行了評估，因為numexpr沒有與該操作相等的操作。然而，只有表達式中的in和not in自身在普通Python中被評估了。例如，在表達式

df.query('a in b + c + d')

(b + c + d)通過numexpr進行評估，然后in操作就在普通Python評估了。通常來說，任何能夠使用numexpr評估的操作都是這樣。

`==` 操作和list對象的特殊用法

使用==/!=對一列數值進行比較和in/not in的機制是相似的

In [233]: df.query('b == ["a", "b", "c"]')
Out[233]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

# 純Python
In [234]: df[df.b.isin(["a", "b", "c"])]
Out[234]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

In [235]: df.query('c == [1, 2]')
Out[235]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

In [236]: df.query('c != [1, 2]')
Out[236]: 
    a  b  c  d
1   a  a  4  7
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
8   e  c  4  3
10  f  c  0  6

# 使用 in/not in
In [237]: df.query('[1, 2] in c')
Out[237]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

In [238]: df.query('[1, 2] not in c')
Out[238]: 
    a  b  c  d
1   a  a  4  7
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
8   e  c  4  3
10  f  c  0  6

# 純 Python
In [239]: df[df.c.isin([1, 2])]
Out[239]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

布爾操作

你可以使用not或~操作來否定布爾表達式

In [240]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [241]: df['bools'] = np.random.rand(len(df)) > 0.5

In [242]: df.query('~bools')
Out[242]: 
          a         b         c  bools
2  0.697753  0.212799  0.329209  False
7  0.275396  0.691034  0.826619  False
8  0.190649  0.558748  0.262467  False

In [243]: df.query('not bools')
Out[243]: 
          a         b         c  bools
2  0.697753  0.212799  0.329209  False
7  0.275396  0.691034  0.826619  False
8  0.190649  0.558748  0.262467  False

In [244]: df.query('not bools') == df[~df.bools]
Out[244]: 
      a     b     c bools
2  True  True  True  True
7  True  True  True  True
8  True  True  True  True

當然，表達式也可以變得任意復雜

# 剪短的查詢語法
In [245]: shorter = df.query('a < b < c and (not bools) or bools > 2')

# 相等的純Python方法
In [246]: longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]

In [247]: shorter
Out[247]: 
          a         b         c  bools
7  0.275396  0.691034  0.826619  False

In [248]: longer
Out[248]: 
          a         b         c  bools
7  0.275396  0.691034  0.826619  False

In [249]: shorter == longer
Out[249]: 
      a     b     c bools
7  True  True  True  True

`query()的性能`

對於大型frame來說，DataFrame.query() 使用 numexpr比Python快一些

注意當你的frame超過200，000行時DataFrame.query()的快速才能體現出來

這個圖表是使用numpy.random.randn()生成的3列浮點值生成的.

重復數據

如果你想要識別並刪除一個DataFrame中的重復行，有兩種方法：duplicated和drop_duplicates。每種都需要傳入一個列參數來識別重復行。

duplicated返回一個布爾向量，該向量的長度是行數，且它會指出每一行是不是重復行
drop_duplicates會刪除重復行

默認情況下，幾個重復行的第一行會被留下，但是每種方法都有一個keep參數來決定要留下哪一行。

keep='first' (默認): 將第一行視為非重復行並留下
keep='last':將最后一行視為非重復行並留下
keep=False: 刪除所有行/將所有行都標記為重復行

In [250]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
   .....:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

In [251]: df2
Out[251]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [252]: df2.duplicated('a')
Out[252]: 
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [253]: df2.duplicated('a', keep='last')
Out[253]: 
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [254]: df2.duplicated('a', keep=False)
Out[254]: 
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [255]: df2.drop_duplicates('a')
Out[255]: 
       a  b         c
0    one  x -1.067137
2    two  x -0.211056
5  three  x -1.964475
6   four  x  1.298329

In [256]: df2.drop_duplicates('a', keep='last')
Out[256]: 
       a  b         c
1    one  y  0.309500
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [257]: df2.drop_duplicates('a', keep=False)
Out[257]: 
       a  b         c
5  three  x -1.964475
6   four  x  1.298329

同樣，你可以傳入一個列來識別特定重復行

In [258]: df2.duplicated(['a', 'b'])
Out[258]: 
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

In [259]: df2.drop_duplicates(['a', 'b'])
Out[259]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
5  three  x -1.964475
6   four  x  1.298329

使用index.duplicated然后進行切片可以根據index的值去重。keep參數在這個方法中同樣適用。

In [260]: df3 = pd.DataFrame({'a': np.arange(6),
   .....:                     'b': np.random.randn(6)},
   .....:                    index=['a', 'a', 'b', 'c', 'b', 'a'])
   .....: 

In [261]: df3
Out[261]: 
   a         b
a  0  1.440455
a  1  2.456086
b  2  1.038402
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [262]: df3.index.duplicated()
Out[262]: array([False,  True, False, False,  True,  True], dtype=bool)

In [263]: df3[~df3.index.duplicated()]
Out[263]: 
   a         b
a  0  1.440455
b  2  1.038402
c  3 -0.894409

In [264]: df3[~df3.index.duplicated(keep='last')]
Out[264]: 
   a         b
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [265]: df3[~df3.index.duplicated(keep=False)]
Out[265]: 
   a         b
c  3 -0.894409

類似於字典的 `get()` 方法

Series, DataFrame, 和 Panel都有一個get方法能夠返回一個默認值

In [266]: s = pd.Series([1,2,3], index=['a','b','c'])

In [267]: s.get('a')               #等價於s['a']
Out[267]: 1

In [268]: s.get('x', default=-1)
Out[268]: -1

`select()` 方法

另一個從一個Series, DataFrame, 或 Panel對象中獲取切片的方式是select方法。這個方法應該在只有當沒有其他直接的方式可用時才用。select能夠作用於軸標簽並返回一個布爾值。例如：

In [269]: df.select(lambda x: x == 'A', axis=1)
Out[269]: 
                   A
2000-01-01  0.355794
2000-01-02  1.635763
2000-01-03  0.854409
2000-01-04 -0.216659
2000-01-05  2.414688
2000-01-06 -1.206215
2000-01-07  0.779461
2000-01-08 -0.878999

`lookup()` 方法

有些時候你想要按照某種特定順序來獲取行或列，那么lookup方法就能夠實現，並返回一個numpy數組。例如，

In [270]: dflookup = pd.DataFrame(np.random.rand(20,4), columns = ['A','B','C','D'])

In [271]: dflookup.lookup(list(range(0,10,2)), ['B','C','A','B','D'])
Out[271]: array([ 0.3506,  0.4779,  0.4825,  0.9197,  0.5019])

索引對象

pandas的Index類和它的子類可以被視為實施一個有序的多集。允許重復。然而，如果你試圖將一個有重復項的索引對象轉換為一個集合，將會引發一個異常。

Index 還為查找、數據規整和重新索引提供了必要的基礎。創建一個Index的最簡單的方式是向Index傳遞一個list或其他的序列。

In [272]: index = pd.Index(['e', 'd', 'a', 'b'])

In [273]: index
Out[273]: Index([u'e', u'd', u'a', u'b'], dtype='object')

In [274]: 'd' in index
Out[274]: True

你也可以給index起個名字，並存儲在索引中：

In [275]: index = pd.Index(['e', 'd', 'a', 'b'], name='something') In [276]: index.name Out[276]: ‘something'

如果名字是個集合，將會在控制台顯示：

In [277]: index = pd.Index(list(range(5)), name='rows')

In [278]: columns = pd.Index(['A', 'B', 'C'], name='cols')

In [279]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)

In [280]: df
Out[280]: 
cols         A         B         C
rows                              
0     1.295989  0.185778  0.436259
1     0.678101  0.311369 -0.528378
2    -0.674808 -1.103529 -0.656157
3     1.889957  2.076651 -1.102192
4    -1.211795 -0.791746  0.634724

In [281]: df['A']
Out[281]: 
rows
0    1.295989
1    0.678101
2   -0.674808
3    1.889957
4   -1.211795
Name: A, dtype: float64

設置元數據

索引“大部分是不可變的”，但是設置和改變他們的元數據卻是可能的，比如索引名（或者，對於MultiIndex來說，level和標簽）

你可以使用rename，set_name ,set_levels set_labels來直接設置這些屬性。它們默認返回一個副本，然而，你可以令關鍵字inplace=True來使數據直接原地改變。

MultiIndexes的 Advanced Indexing 用法。

In [282]: ind = pd.Index([1, 2, 3])

In [283]: ind.rename("apple")
Out[283]: Int64Index([1, 2, 3], dtype='int64', name=u'apple')

In [284]: ind
Out[284]: Int64Index([1, 2, 3], dtype='int64')

In [285]: ind.set_names(["apple"], inplace=True)

In [286]: ind.name = "bob"

In [287]: ind
Out[287]: Int64Index([1, 2, 3], dtype='int64', name=u'bob')

set_names, set_levels, 和 set_labels 也有一個可選參數level

In [288]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])

In [289]: index
Out[289]: 
MultiIndex(levels=[[0, 1, 2], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

In [290]: index.levels[1]
Out[290]: Index([u'one', u'two'], dtype='object', name=u'second')

In [291]: index.set_levels(["a", "b"], level=1)
Out[291]: 
MultiIndex(levels=[[0, 1, 2], [u'a', u'b']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

Index對象的設置操作

警告在0.15.0版本中，為了在某些特定索引類型上的數值操作，+和-操作已經過時。+可以使用.union( )或 | 代替，-可以使用.difference( )代替。

The two main operations are These can be directly called as instance methods or used via overloaded operators. Difference is provided via the .difference() method.

兩個主要的操作是union (|)和 intersection (&)。它們可以作為實例方法被直接調用或通過重載操作使用。Difference 是由 .difference()方法實現的。

In [292]: a = pd.Index(['c', 'b', 'a'])

In [293]: b = pd.Index(['c', 'e', 'd'])

In [294]: a | b
Out[294]: Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

In [295]: a & b
Out[295]: Index([u'c'], dtype='object')

In [296]: a.difference(b)
Out[296]: Index([u'a', u'b'], dtype='object')

symmetric_difference (^)操作也同樣能用，它能夠返回idx1或idx2中出現的元素，但不能二者都返回。這等價於使用idx1.difference(idx2).union(idx2.difference(idx1))來創建索引，沒有重復項。

In [297]: idx1 = pd.Index([1, 2, 3, 4])

In [298]: idx2 = pd.Index([2, 3, 4, 5])

In [299]: idx1.symmetric_difference(idx2)
Out[299]: Int64Index([1, 5], dtype='int64')

In [300]: idx1 ^ idx2
Out[300]: Int64Index([1, 5], dtype='int64')

缺失值

重要：雖然Index能夠保留缺失值（NaN),但如果你不想出現什么亂七八糟的結果，最好還是避免缺失值。例如，有些操作會隱性地直接排除缺失值。

Index.fillna能夠使用指定值來填充缺失值。

In [301]: idx1 = pd.Index([1, np.nan, 3, 4])

In [302]: idx1
Out[302]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [303]: idx1.fillna(2)
Out[303]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

In [304]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'), pd.NaT, pd.Timestamp('2011-01-03')])

In [305]: idx2
Out[305]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)

In [306]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[306]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

設置/重置索引

有時候你會向一個DataFrame中加載或創建一個數據集，並給這個后來加入的數據子集加上索引。有以下幾種方法：

設置索引

DataFrame有一個set_index方法，能夠接收一個列名（對於常規Index來說）或一個列名list（對於MutiIndex來說）來創建一個新的，帶有索引的DataFrame：

In [307]: data
Out[307]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

In [308]: indexed1 = data.set_index('c')

In [309]: indexed1
Out[309]: 
     a    b    d
c               
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0

In [310]: indexed2 = data.set_index(['a', 'b'])

In [311]: indexed2
Out[311]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

關鍵字append能夠使你保留當前index，並把給定的列追加到一個MultiIndex中：

In [312]: frame = data.set_index('c', drop=False)

In [313]: frame = frame.set_index(['a', 'b'], append=True)

In [314]: frame
Out[314]: 
           c    d
c a   b          
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

set_index中的其他選項能夠使你保留索引列或者原地加上索引。

In [315]: data.set_index('c', drop=False)
Out[315]: 
     a    b  c    d
c                  
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

In [316]: data.set_index(['a', 'b'], inplace=True)

In [317]: data
Out[317]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

重置索引

考慮到方便性，DataFrame有一個新的功能叫做reset_index，它能夠把index值轉換為DataFrame的列並設置一個簡單的整數索引。它是set_index的逆運算。

In [318]: data
Out[318]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

In [319]: data.reset_index()
Out[319]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

輸出的結果更像是一個SQL表或一個數組記錄。來自index的列名被存儲在names屬性里。

你可以使用level關鍵字來移除index的一部分。

In [320]: frame
Out[320]: 
           c    d
c a   b          
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

In [321]: frame.reset_index(level=1)
Out[321]: 
         a  c    d
c b               
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0

reset_index有一個可選參數drop，當drop=True時會直接丟棄index，而不是將index值放在DataFrame的列中。

注意 reset_index方法在以前的版本中叫做delevel

添加ad hoc索引

如果你自己創建了一個index，你可以將它分配到索引字段

data.index = index

返回視圖vs副本

在一個pandas對象中設置值時，必須小心避免鏈式索引的出現。這是個例子：

In [322]: dfmi = pd.DataFrame([list('abcd'),
   .....:                      list('efgh'),
   .....:                      list('ijkl'),
   .....:                      list('mnop')],
   .....:                     columns=pd.MultiIndex.from_product([['one','two'],
   .....:                                                         ['first','second']]))
   .....: 

In [323]: dfmi
Out[323]: 
    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p

比較一下這兩個訪問方法：

In [324]: dfmi['one']['second']
Out[324]: 
0    b
1    f
2    j
3    n
Name: second, dtype: object

In [325]: dfmi.loc[:,('one','second')]
Out[325]: 
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

這兩種方法產生的結果一樣，所以你該用哪一種？了解它們的操作順序、了解為什么第二種方法（.loc）比第一種好得多，是非常有意義的。

dfmi [“one”] 選取了列的第一個level，並返回一個單獨索引的DataFrame，另一個Python操作ddmi_with_one[“second”]選取“second”索引的Series。這由可變的dfmi_with_one來指示，因為pandas把這些操作視為獨立事件。例如對__getitem__的單獨調用，所以它必須把他們作為線性運算，他們一個接一個地發生。

相反，df。loc[:,(“one”,”second”)]向單獨調用的__getitem__傳入了一個嵌套元組(slice(None),('one','second’)).這能夠使pandas把它們作為一個單一的實體進行處理。此外，這個操作命令比第一種方法要快得多，並且如果需要的話，還能夠允許同時索引多個軸。

為什么使用鏈式索引進行分配時會報錯？

前面部分的問題僅僅是性能問題。為什么會有SettingWithCopy警告？當你的操作會多花費不必要的幾毫秒時，我們通常不會報出警告！

但是事實證明，鏈式索引會導致不可預知的結果。要了解這一點，想想Python解釋器如果執行這個代碼的：

dfmi.loc[:,('one','second')] = value
# 變為
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

但這個代碼的處理方式是完全不同的：

dfmi['one']['second'] = value
# 變為
dfmi.__getitem__('one').__setitem__('second', value)

看到這里的__getitem__了嗎？除了一些簡單的情況之外，我們很難預測它到底會返回一個視圖還是一個副本（這取決於數組的內存布局，pandas可不能保證這個），也不能預測__setitem__是將會直接修改dfmi還是修改一個用完即扔的臨時對象。這也是SettingWithCopy在警告你的東西！

注意你可能會想在第一個例子中我們是否應該考慮到loc的特性。但是我們能肯定在修改索引時，dfmi.loc是dfmi自身，所以dfmi.loc.__getitem__ /dfmi.loc.__setitem__操作是直接作用在dfmi自身上的。當然，dfmi.loc.__getitem__ （idx）或許是dfmi的一個視圖或副本。

有些時候，當沒有明顯的鏈式索引時，SettingWithCopy警告也會產生。這是SettingWithCopy的一些設計上的bug，pandas可能會試着發出警告，如果你這樣做的話：

def do_something(df):
   foo = df[['bar', 'baz']]  # foo是一個視圖還是副本，沒人知道啊!
   #許多行在此省略
   foo['quux'] = value       # 我們不知道這個操作到底有沒有修改到df啊！
   return foo

唉！真無奈啊！

評估事項

此外，在鏈式表達式中，命令將決定是否返回一個副本。如果一個表達式會在一個副本或切片上設置值，那勢必會引發一個SettingWithCopy異常（0.13.0以后的版本）

你可以通過選項mode.chained_assignment來控制一個鏈式分配，它能夠接收值['raise','warn',None]。

In [326]: dfb = pd.DataFrame({'a' : ['one', 'one', 'two',
   .....:                            'three', 'two', 'one', 'six'],
   .....:                     'c' : np.arange(7)})
   .....: 

# 這將會引發SettingWithCopyWarning
# 但是frame的值確實被設置了
In [327]: dfb['c'][dfb.a.str.startswith('o')] = 42

然而這是在副本上的操作，並木有什么卵用。

>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb.a.str.startswith('o')]['c'] = 42
Traceback (most recent call last)
     ...
SettingWithCopyWarning:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead

一個鏈式分配也可以在設置一個混合類型的frame時讓人猝不及防地出現。

注意這些設置規則對.loc/.iloc/.ix通用

這才是正確的訪問方法

In [328]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})

In [329]: dfc.loc[0,'A'] = 11

In [330]: dfc
Out[330]: 
     A  B
0   11  1
1  bbb  2
2  ccc  3

下面的方法有時能夠正常使用，但不能保證任何時候都正常，所以應該避免使用

In [331]: dfc = dfc.copy()

In [332]: dfc['A'][0] = 111

In [333]: dfc
Out[333]: 
     A  B
0  111  1
1  bbb  2
2  ccc  3

下面的方法是錯的，所以別用啊

>>> pd.set_option('mode.chained_assignment','raise')
>>> dfc.loc[0]['A'] = 1111
Traceback (most recent call last)
     ...
SettingWithCopyException:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead

警告鏈式分配警告/異常是為了把可能無效的分配告知用戶。有時候有可能是誤報。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spring Data JPA(官方文檔翻譯) qs文檔翻譯 RdKafka文檔翻譯【SimuPy】Python實現的Simulink 文檔翻譯全部完畢 Postman官方文檔翻譯 obfs4 文檔翻譯 Mammoth官方文檔翻譯 FlowCanvas官方文檔翻譯（一） FullCalendar 官方文檔翻譯 uWSGI配置文檔翻譯

Python pandas 0.19.1 Indexing and Selecting Data文檔翻譯