Series和Datafram索引的原理一樣,我們以Dataframe的索引為主來學習
- 列索引:df['列名'] (Series不存在列索引)
- 行索引:df.loc[]、df.iloc[]
選擇列 / 選擇行 / 切片 / 布爾判斷
import numpy as np import pandas as pd # 導入numpy、pandas模塊 # 選擇行與列 df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100, index = ['one','two','three'], columns = ['a','b','c','d']) print(df) data1 = df['a'] # 列的索引 data2 = df[['a','c']] # 注意:選擇多列的時候要用兩個中括號 ['列1','列2','列3',····’列n'····] print(data1,type(data1)) print(data2,type(data2)) print('-----') # 按照列名選擇列,只選擇一列輸出Series,選擇多列輸出Dataframe data3 = df.loc['one'] #行的索引 data4 = df.loc[['one','two']] print(data2,type(data3)) print(data3,type(data4)) # 按照index選擇行,只選擇一行輸出Series,選擇多行輸出Dataframe
輸出結果:
a b c d one 5.191896 33.756807 55.531059 48.271119 two 73.611065 25.943409 63.896590 10.736052 three 82.450101 45.914238 37.840761 64.896341 one 5.191896 two 73.611065 three 82.450101 Name: a, dtype: float64 <class 'pandas.core.series.Series'> a c one 5.191896 55.531059 two 73.611065 63.896590 three 82.450101 37.840761 <class 'pandas.core.frame.DataFrame'> ----- a c one 5.191896 55.531059 two 73.611065 63.896590 three 82.450101 37.840761 <class 'pandas.core.series.Series'> a 5.191896 b 33.756807 c 55.531059 d 48.271119 Name: one, dtype: float64 <class 'pandas.core.frame.DataFrame'>
2. 選擇/索引 列
# df[] - 選擇列 # 一般用於選擇列,也可以選擇行,但不推薦,行索引用.loc與.iloc df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100, index = ['one','two','three'], columns = ['a','b','c','d']) print(df) print('-----') data1 = df['a'] data2 = df[['b','c']] # 嘗試輸入 data2 = df[['b','c','e']] print(data1) print(data2) # df[]默認選擇列,[]中寫列名(所以一般數據colunms都會單獨制定,不會用默認數字列名,以免和index沖突) # 單選列為Series,print結果為Series格式 # 多選列為Dataframe,print結果為Dataframe格式 # 核心筆記:df[col]一般用於選擇列,[]中寫列名
輸出結果:
a b c d one 32.302368 89.444542 70.904647 3.899547 two 71.309217 63.006986 73.751675 34.063717 three 13.534943 84.102451 48.329891 33.537992 ----- one 32.302368 two 71.309217 three 13.534943 Name: a, dtype: float64 b c one 89.444542 70.904647 two 63.006986 73.751675 three 84.102451 48.329891
3. 選擇/索引 行
# df.loc[] - 按index選擇行 df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index = ['one','two','three','four'], columns = ['a','b','c','d']) df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, columns = ['a','b','c','d']) print(df1) print(df2) print('-----') data1 = df1.loc['one'] data2 = df2.loc[1] print(data1) print(data2) print('單標簽索引\n-----') # 單個標簽索引,返回Series data3 = df1.loc[['two','three','five']] #多了個標簽,明明沒有'five',會出現警告。 data4 = df2.loc[[3,2,1]] print(data3) print(data4) print('多標簽索引\n-----') # 多個標簽索引,如果標簽不存在,則返回NaN # 順序可變 # 這里‘five’標簽不存在,所以有警告 data5 = df1.loc['one':'three'] #從初始到結束,末端也包含 data6 = df2.loc[1:3] print(data5) print(data6) print('切片索引') # 可以做切片對象 # 末端包含 # 核心筆記:df.loc[label]主要針對index選擇行,同時支持指定index
輸出結果:
a b c d one 41.473536 36.036192 61.836041 13.373447 two 83.709165 96.248540 31.266231 84.736594 three 48.617461 82.627569 68.185809 71.803329 four 38.772901 89.275885 84.279757 78.687116 a b c d 0 1.387796 39.795388 12.439624 20.428982 1 88.289011 47.849035 50.188306 77.745736 2 20.914579 13.127105 28.333499 73.411151 3 27.545903 89.901712 14.438023 81.676334 ----- a 41.473536 b 36.036192 c 61.836041 d 13.373447 Name: one, dtype: float64 a 88.289011 b 47.849035 c 50.188306 d 77.745736 Name: 1, dtype: float64 單標簽索引 ----- a b c d two 83.709165 96.248540 31.266231 84.736594 three 48.617461 82.627569 68.185809 71.803329 five NaN NaN NaN NaN a b c d 3 27.545903 89.901712 14.438023 81.676334 2 20.914579 13.127105 28.333499 73.411151 1 88.289011 47.849035 50.188306 77.745736 多標簽索引 ----- a b c d one 41.473536 36.036192 61.836041 13.373447 two 83.709165 96.248540 31.266231 84.736594 three 48.617461 82.627569 68.185809 71.803329 a b c d 1 88.289011 47.849035 50.188306 77.745736 2 20.914579 13.127105 28.333499 73.411151 3 27.545903 89.901712 14.438023 81.676334 切片索引 C:\Users\iHJX_Alienware\Anaconda3\lib\site-packages\ipykernel\__main__.py:19: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
4. 行的另一種索引方式:
# df.iloc[] - 按照整數位置(從軸的0到length-1)選擇行 ,按位置進行索引 # 類似list的索引,其順序就是dataframe的整數位置,從0開始計 df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index = ['one','two','three','four'], columns = ['a','b','c','d']) print(df) print('------') print(df.iloc[0]) #直接寫位置,0就是第一行 print(df.iloc[-1]) #print(df.iloc[4]) print('單位置索引\n-----') # 單位置索引 # 和loc索引不同,不能索引超出數據行數的整數位置 print(df.iloc[[0,2]]) print(df.iloc[[3,2,1]]) print('多位置索引\n-----') # 多位置索引 # 順序可變 print(df.iloc[1:3]) print(df.iloc[:2]) #類似於列表里面的索引,不包括第三列 這一點區別於loc print(df.iloc[::2]) print('切片索引') # 切片索引 # 末端不包含
輸出結果:
a b c d one 40.344453 97.884228 24.426729 12.624394 two 76.042829 86.362548 2.393513 92.894224 three 57.122758 45.150241 95.613046 63.914110 four 89.905096 63.079797 85.669807 0.008500 ------ a 40.344453 b 97.884228 c 24.426729 d 12.624394 Name: one, dtype: float64 a 89.905096 b 63.079797 c 85.669807 d 0.008500 Name: four, dtype: float64 單位置索引 ----- a b c d one 40.344453 97.884228 24.426729 12.624394 three 57.122758 45.150241 95.613046 63.914110 a b c d four 89.905096 63.079797 85.669807 0.008500 three 57.122758 45.150241 95.613046 63.914110 two 76.042829 86.362548 2.393513 92.894224 多位置索引 ----- a b c d two 76.042829 86.362548 2.393513 92.894224 three 57.122758 45.150241 95.613046 63.914110 a b c d one 40.344453 97.884228 24.426729 12.624394 two 76.042829 86.362548 2.393513 92.894224 a b c d one 40.344453 97.884228 24.426729 12.624394 three 57.122758 45.150241 95.613046 63.914110 切片索引
5. 布爾型索引
# 布爾型索引 與numpy里面的布爾型索引一個意思 # 多用於索引行 import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index = ['one','two','three','four'], columns = ['a','b','c','d']) print(df) print('------') b1 = df < 20 print(b1,type(b1)) print(df[b1]) # 也可以書寫為 df[df < 20] 只保留符合條件的值,不符合條件的返回空值 print('------') # 不做索引則會對數據每個值進行判斷 # 索引結果保留 所有數據:True返回原數據,False返回值為NaN b2 = df['a'] > 50 #只保留列a的索引里面大於50的值,按行索引。 print(b2,type(b2)) print(df[b2]) # 也可以書寫為 df[df['a'] > 50] #如果想篩選,a這一列大於50,並且我只需要b和c兩列的值 print(df[df['a']>50][['b','c']],'哈哈哈哈') print('------') # 單列做判斷 # 索引結果保留 單列判斷為True的行數據,包括其他列 #這里區別於數組,看看數組的吧 ar = np.random.randn(20,2)*50 print(ar[ar>5],'數組數組數組!!!') #數組只會保留元素中大於5的值,而不大於5的值刪除。也不會返回空值 b3 = df[['a','b']] > 50 print(b3,type(b3)) print(df[b3]) # 也可以書寫為 df[df[['a','b']] > 50] print('------') # 多列做判斷 # 索引結果保留 所有數據:True返回原數據,False返回值為NaN # 注意這里報錯的話,更新一下pandas → conda update pandas b4 = df.loc[['one','three']] < 50 print(b4,type(b4)) print(df[b4]) # 也可以書寫為 df[df.loc[['one','three']] < 50] print('------') # 多行做判斷 # 索引結果保留 所有數據:True返回原數據,False返回值為NaN
輸出結果:
a b c d one 42.182880 16.944943 97.143421 16.715137 two 3.894318 1.655007 62.291734 73.600681 three 96.052714 3.845297 43.290603 36.172796 four 8.988430 38.483679 51.538006 60.855976 ------ a b c d one False True False True two True True False False three False True False False four True False False False <class 'pandas.core.frame.DataFrame'> a b c d one NaN 16.944943 NaN 16.715137 two 3.894318 1.655007 NaN NaN three NaN 3.845297 NaN NaN four 8.988430 NaN NaN NaN ------ one False two False three True four False Name: a, dtype: bool <class 'pandas.core.series.Series'> a b c d three 96.052714 3.845297 43.290603 36.172796 b c three 3.845297 43.290603 哈哈哈哈 ------ [126.5305168 76.76672929 67.54122606 46.95383418 108.70865373 77.67833227 17.48275006 19.85031457 25.70929928 28.68636573 44.54084001 35.11082135 64.24927152 37.96842756 16.79771495 16.35297097 29.9591603 36.49625972 7.3347084 24.82526937 36.31873796 21.64895926 36.75066597] 數組數組數組!!! a b one False False two False False three True False four False False <class 'pandas.core.frame.DataFrame'> a b c d one NaN NaN NaN NaN two NaN NaN NaN NaN three 96.052714 NaN NaN NaN four NaN NaN NaN NaN ------ a b c d one True True False True three False True True True <class 'pandas.core.frame.DataFrame'> a b c d one 42.18288 16.944943 NaN 16.715137 two NaN NaN NaN NaN three NaN 3.845297 43.290603 36.172796 four NaN NaN NaN NaN ------
5. 多重索引
# 多重索引:比如同時索引行和列 # 先選擇列再選擇行 —— 相當於對於一個數據,先篩選字段,再選擇數據量 df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index = ['one','two','three','four'], columns = ['a','b','c','d']) print(df) print('------') print(df['a'].loc[['one','three']]) # 選擇a列的one,three行 print(df[['b','c','d']].iloc[::2]) # 選擇b,c,d列的one,three行 print(df[df['a'] < 50].iloc[:2]) # 選擇滿足判斷索引的前兩行數據
輸出結果:
a b c d one 48.981007 79.206804 43.775695 5.205462 two 43.786019 15.436499 85.919123 84.083483 three 94.546433 59.227961 97.579354 37.942078 four 11.292684 8.417224 38.782994 17.420902 ------ one 48.981007 three 94.546433 Name: a, dtype: float64 b c d one 79.206804 43.775695 5.205462 three 59.227961 97.579354 37.942078 a b c d one 48.981007 79.206804 43.775695 5.205462 two 43.786019 15.436499 85.919123 84.083483
課后練習:
作業1:如圖創建Dataframe(4*4,值為0-100的隨機數),通過索引得到以下值
① 索引得到b,c列的所有值
② 索引得到第三第四行的數據
③ 按順序索引得到two,one行的值
④ 索引得到大於50的值

import numpy as np import pandas as pd #練習1 df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index=['one','two','three','four'], columns=['a','b','c','d']) print(df) print(df[['b','c']]) print(df.loc[['three','four']]) print(df.iloc[2:4]) #或者print(df.iloc[[2,3]]) // print(df.iloc[[2:]]) print(df.loc[['two','one']]) b = df[df>50] print(b)
作業2:創建一個Series,包含10個元素,且每個值為0-100的均勻分布隨機值,index為a-j,請分別篩選出:
① 標簽為b,c的值為多少
② Series中第4到6個值是哪些?
③ Series中大於50的值有哪些?
#練習2 df1 = pd.Series(np.random.rand(10)*100,index=['a','b','c','d','e','f','g','h','i','j']) print(df1) print(df1.loc[['b','c']]) print(df1.iloc[4:7]) print(df1[df1>50])
