pandas index索引對象與重建索引


一、Index

Pandas中的索引對象Index用於存儲軸標簽和其它元數據。索引對象是不可變的,用戶無法修改它。

In [73]: obj = pd.Series(range(3),index = ['a','b','c']) In [74]: index = obj.index In [75]: index Out[75]: Index(['a', 'b', 'c'], dtype='object') In [76]: index[1:] Out[76]: Index(['b', 'c'], dtype='object') In [77]: index[1] = 'f'  # TypeError
In [8]: index.size Out[8]: 3 In [9]: index.shape Out[9]: (3,) In [10]: index.ndim Out[10]: 1 In [11]: index.dtype Out[11]: dtype('O')

索引對象的不可變特性,使得在多種數據結構中分享索引對象更安全:

In [78]: labels = pd.Index(np.arange(3)) In [79]: labels Out[79]: Int64Index([0, 1, 2], dtype='int64') In [80]: obj2 = pd.Series([2,3.5,0], index=labels) In [81]: obj2 Out[81]: 0 2.0
1    3.5
2    0.0 dtype: float64 In [82]: obj2.index is labels Out[82]: True

索引對象,本質上也是一個容器對象,所以可以使用Python的in操作:

In [84]: f2 Out[84]: key year state pop debt order a 2000   beijing  1.5 NaN b 2001   beijing  1.7 NaN c 2002   beijing  3.6   1.0 d 2001  shanghai  2.4   2.0 e 2002  shanghai  2.9 NaN f 2003  shanghai  3.2   3.0 In [86]: 'c' in f2.index Out[86]: True In [88]: 'pop' in f2.columns Out[88]: True

而且最關鍵的是,pandas的索引對象可以包含重復的標簽:

In [89]: dup_lables = pd.Index(['foo','foo','bar','bar']) In [90]: dup_lables Out[90]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

那么思考一下,DataFrame對象可不可以有重復的columns或者index呢?

可以的!但是請盡量不要這么做!:

In [91]: f2.index = ['a']*6 In [92]: f2 Out[92]: key year state pop debt a 2000   beijing  1.5 NaN a 2001   beijing  1.7 NaN a 2002   beijing  3.6   1.0 a 2001  shanghai  2.4   2.0 a 2002  shanghai  2.9 NaN a 2003  shanghai  3.2   3.0 In [93]: f2.loc['a'] Out[93]: key year state pop debt a 2000   beijing  1.5 NaN a 2001   beijing  1.7 NaN a 2002   beijing  3.6   1.0 a 2001  shanghai  2.4   2.0 a 2002  shanghai  2.9 NaN a 2003  shanghai  3.2   3.0 In [94]: f2.columns = ['year']*4 In [95]: f2 Out[95]: year year year year a 2000   beijing   1.5 NaN a 2001   beijing   1.7 NaN a 2002   beijing   3.6   1.0 a 2001  shanghai   2.4   2.0 a 2002  shanghai   2.9 NaN a 2003  shanghai   3.2   3.0 In [96]: f2.index.is_unique  # 可以使用這個屬性來判斷是否是唯一的索引
Out[96]: False

index對象也可以進行集合的交、並、差和異或運算,類似Python的標准set數據結構。

二、重建索引

reindex方法用於重新為Pandas對象設置新索引。這不是就地修改,而是會參照原有數據,調整順序。

In [96]: obj=pd.Series([4.5,7.2,-5.3,3.6],index = ['d','b','a','c']) In [97]: obj Out[97]: d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64

reindex會按照新的索引進行排列,不存在的索引將引入缺失值:

In [99]: obj2 = obj.reindex(list('abcde')) In [100]: obj2 Out[100]: a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64

也可以為缺失值指定填充方式method參數,比如ffill表示向前填充,bfill表示向后填充:

In [101]: obj3 = pd.Series(['blue','purple','yellow'],index = [0,2,4]) In [102]: obj3 Out[102]: 0 blue 2 purple 4 yellow dtype: object In [103]: obj3.reindex(range(6),method='ffill') Out[103]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object

對於DataFrame這種二維對象,如果執行reindex方法時只提供一個列表參數,則默認是修改行索引。可以用關鍵字參數columns指定修改的是列索引:

In [104]: f = pd.DataFrame(np.arange(9).reshape((3,3)),index=list('acd'),columns=['beijing','shanghai','guangzhou']) In [105]: f Out[105]: beijing shanghai guangzhou a 0 1          2 c 3         4          5 d 6         7          8 In [106]: f2 = f.reindex(list('abcd')) In [107]: f2 Out[107]: beijing shanghai guangzhou a 0.0       1.0        2.0 b NaN NaN NaN c 3.0       4.0        5.0 d 6.0       7.0        8.0 In [112]: f3 = f.reindex(columns=['beijing','shanghai','xian','guangzhou']) In [113]: f3 Out[113]: beijing shanghai xian guangzhou a 0 1   NaN          2 c 3         4   NaN          5 d 6         7   NaN          8


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM