Pandas 索引和切片


Series和Datafram索引的原理一樣,我們以Dataframe的索引為主來學習

  • 列索引:df['列名'] (Series不存在列索引)
  • 行索引:df.loc[]、df.iloc[]

選擇列 / 選擇行 / 切片 / 布爾判斷

import numpy as np
import pandas as pd  
# 導入numpy、pandas模塊

# 選擇行與列

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)

data1 = df['a']           # 列的索引
data2 = df[['a','c']]     # 注意:選擇多列的時候要用兩個中括號 ['列1','列2','列3',····’列n'····]
print(data1,type(data1))
print(data2,type(data2))
print('-----')
# 按照列名選擇列,只選擇一列輸出Series,選擇多列輸出Dataframe

data3 = df.loc['one']                  #行的索引
data4 = df.loc[['one','two']]
print(data2,type(data3))
print(data3,type(data4))
# 按照index選擇行,只選擇一行輸出Series,選擇多行輸出Dataframe

輸出結果:

               a          b          c          d
one     5.191896  33.756807  55.531059  48.271119
two    73.611065  25.943409  63.896590  10.736052
three  82.450101  45.914238  37.840761  64.896341
one       5.191896
two      73.611065
three    82.450101
Name: a, dtype: float64 <class 'pandas.core.series.Series'>
               a          c
one     5.191896  55.531059
two    73.611065  63.896590
three  82.450101  37.840761 <class 'pandas.core.frame.DataFrame'>
-----
               a          c
one     5.191896  55.531059
two    73.611065  63.896590
three  82.450101  37.840761 <class 'pandas.core.series.Series'>
a     5.191896
b    33.756807
c    55.531059
d    48.271119
Name: one, dtype: float64 <class 'pandas.core.frame.DataFrame'>

2. 選擇/索引 列

# df[] - 選擇列
# 一般用於選擇列,也可以選擇行,但不推薦,行索引用.loc與.iloc

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)
print('-----')

data1 = df['a']
data2 = df[['b','c']]  # 嘗試輸入 data2 = df[['b','c','e']]
print(data1)
print(data2)
# df[]默認選擇列,[]中寫列名(所以一般數據colunms都會單獨制定,不會用默認數字列名,以免和index沖突)
# 單選列為Series,print結果為Series格式
# 多選列為Dataframe,print結果為Dataframe格式

# 核心筆記:df[col]一般用於選擇列,[]中寫列名

輸出結果:

               a          b          c          d
one    32.302368  89.444542  70.904647   3.899547
two    71.309217  63.006986  73.751675  34.063717
three  13.534943  84.102451  48.329891  33.537992
-----
one      32.302368
two      71.309217
three    13.534943
Name: a, dtype: float64
               b          c
one    89.444542  70.904647
two    63.006986  73.751675
three  84.102451  48.329891

3.  選擇/索引 行

# df.loc[] - 按index選擇行

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df1)
print(df2)
print('-----')

data1 = df1.loc['one']
data2 = df2.loc[1]
print(data1)
print(data2)
print('單標簽索引\n-----')
# 單個標簽索引,返回Series

data3 = df1.loc[['two','three','five']]  #多了個標簽,明明沒有'five',會出現警告。
data4 = df2.loc[[3,2,1]]
print(data3)
print(data4)
print('多標簽索引\n-----')
# 多個標簽索引,如果標簽不存在,則返回NaN
# 順序可變
# 這里‘five’標簽不存在,所以有警告

data5 = df1.loc['one':'three']    #從初始到結束,末端也包含
data6 = df2.loc[1:3]
print(data5)
print(data6)
print('切片索引')
# 可以做切片對象
# 末端包含

# 核心筆記:df.loc[label]主要針對index選擇行,同時支持指定index

輸出結果:

              a          b          c          d
one    41.473536  36.036192  61.836041  13.373447
two    83.709165  96.248540  31.266231  84.736594
three  48.617461  82.627569  68.185809  71.803329
four   38.772901  89.275885  84.279757  78.687116
           a          b          c          d
0   1.387796  39.795388  12.439624  20.428982
1  88.289011  47.849035  50.188306  77.745736
2  20.914579  13.127105  28.333499  73.411151
3  27.545903  89.901712  14.438023  81.676334
-----
a    41.473536
b    36.036192
c    61.836041
d    13.373447
Name: one, dtype: float64
a    88.289011
b    47.849035
c    50.188306
d    77.745736
Name: 1, dtype: float64
單標簽索引
-----
               a          b          c          d
two    83.709165  96.248540  31.266231  84.736594
three  48.617461  82.627569  68.185809  71.803329
five         NaN        NaN        NaN        NaN
           a          b          c          d
3  27.545903  89.901712  14.438023  81.676334
2  20.914579  13.127105  28.333499  73.411151
1  88.289011  47.849035  50.188306  77.745736
多標簽索引
-----
               a          b          c          d
one    41.473536  36.036192  61.836041  13.373447
two    83.709165  96.248540  31.266231  84.736594
three  48.617461  82.627569  68.185809  71.803329
           a          b          c          d
1  88.289011  47.849035  50.188306  77.745736
2  20.914579  13.127105  28.333499  73.411151
3  27.545903  89.901712  14.438023  81.676334
切片索引
C:\Users\iHJX_Alienware\Anaconda3\lib\site-packages\ipykernel\__main__.py:19: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike

4. 行的另一種索引方式:

# df.iloc[] - 按照整數位置(從軸的0到length-1)選擇行  ,按位置進行索引
# 類似list的索引,其順序就是dataframe的整數位置,從0開始計

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

print(df.iloc[0])     #直接寫位置,0就是第一行
print(df.iloc[-1])
#print(df.iloc[4])
print('單位置索引\n-----')
# 單位置索引
# 和loc索引不同,不能索引超出數據行數的整數位置

print(df.iloc[[0,2]])  
print(df.iloc[[3,2,1]])
print('多位置索引\n-----')
# 多位置索引
# 順序可變

print(df.iloc[1:3])
print(df.iloc[:2])    #類似於列表里面的索引,不包括第三列 這一點區別於loc
print(df.iloc[::2])
print('切片索引')
# 切片索引
# 末端不包含

輸出結果:

               a          b          c          d
one    40.344453  97.884228  24.426729  12.624394
two    76.042829  86.362548   2.393513  92.894224
three  57.122758  45.150241  95.613046  63.914110
four   89.905096  63.079797  85.669807   0.008500
------
a    40.344453
b    97.884228
c    24.426729
d    12.624394
Name: one, dtype: float64
a    89.905096
b    63.079797
c    85.669807
d     0.008500
Name: four, dtype: float64
單位置索引
-----
               a          b          c          d
one    40.344453  97.884228  24.426729  12.624394
three  57.122758  45.150241  95.613046  63.914110
               a          b          c          d
four   89.905096  63.079797  85.669807   0.008500
three  57.122758  45.150241  95.613046  63.914110
two    76.042829  86.362548   2.393513  92.894224
多位置索引
-----
               a          b          c          d
two    76.042829  86.362548   2.393513  92.894224
three  57.122758  45.150241  95.613046  63.914110
             a          b          c          d
one  40.344453  97.884228  24.426729  12.624394
two  76.042829  86.362548   2.393513  92.894224
               a          b          c          d
one    40.344453  97.884228  24.426729  12.624394
three  57.122758  45.150241  95.613046  63.914110
切片索引

5. 布爾型索引 

# 布爾型索引    與numpy里面的布爾型索引一個意思
# 多用於索引行
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

b1 = df < 20
print(b1,type(b1))
print(df[b1])  # 也可以書寫為 df[df < 20]  只保留符合條件的值,不符合條件的返回空值
print('------')
# 不做索引則會對數據每個值進行判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN

b2 = df['a'] > 50   #只保留列a的索引里面大於50的值,按行索引。
print(b2,type(b2))
print(df[b2])  # 也可以書寫為 df[df['a'] > 50]
#如果想篩選,a這一列大於50,並且我只需要b和c兩列的值
print(df[df['a']>50][['b','c']],'哈哈哈哈')
print('------')
# 單列做判斷
# 索引結果保留 單列判斷為True的行數據,包括其他列

#這里區別於數組,看看數組的吧
ar = np.random.randn(20,2)*50
print(ar[ar>5],'數組數組數組!!!')   #數組只會保留元素中大於5的值,而不大於5的值刪除。也不會返回空值

b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3])  # 也可以書寫為 df[df[['a','b']] > 50]
print('------')
# 多列做判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN
# 注意這里報錯的話,更新一下pandas → conda update pandas

b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4])  # 也可以書寫為 df[df.loc[['one','three']] < 50]
print('------')
# 多行做判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN

輸出結果:

        a          b          c          d
one    42.182880  16.944943  97.143421  16.715137
two     3.894318   1.655007  62.291734  73.600681
three  96.052714   3.845297  43.290603  36.172796
four    8.988430  38.483679  51.538006  60.855976
------
           a      b      c      d
one    False   True  False   True
two     True   True  False  False
three  False   True  False  False
four    True  False  False  False <class 'pandas.core.frame.DataFrame'>
              a          b   c          d
one         NaN  16.944943 NaN  16.715137
two    3.894318   1.655007 NaN        NaN
three       NaN   3.845297 NaN        NaN
four   8.988430        NaN NaN        NaN
------
one      False
two      False
three     True
four     False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
               a         b          c          d
three  96.052714  3.845297  43.290603  36.172796
              b          c
three  3.845297  43.290603 哈哈哈哈
------
[126.5305168   76.76672929  67.54122606  46.95383418 108.70865373
  77.67833227  17.48275006  19.85031457  25.70929928  28.68636573
  44.54084001  35.11082135  64.24927152  37.96842756  16.79771495
  16.35297097  29.9591603   36.49625972   7.3347084   24.82526937
  36.31873796  21.64895926  36.75066597] 數組數組數組!!!
           a      b
one    False  False
two    False  False
three   True  False
four   False  False <class 'pandas.core.frame.DataFrame'>
               a   b   c   d
one          NaN NaN NaN NaN
two          NaN NaN NaN NaN
three  96.052714 NaN NaN NaN
four         NaN NaN NaN NaN
------
           a     b      c     d
one     True  True  False  True
three  False  True   True  True <class 'pandas.core.frame.DataFrame'>
              a          b          c          d
one    42.18288  16.944943        NaN  16.715137
two         NaN        NaN        NaN        NaN
three       NaN   3.845297  43.290603  36.172796
four        NaN        NaN        NaN        NaN
------

5. 多重索引

# 多重索引:比如同時索引行和列
# 先選擇列再選擇行 —— 相當於對於一個數據,先篩選字段,再選擇數據量

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

print(df['a'].loc[['one','three']])   # 選擇a列的one,three行
print(df[['b','c','d']].iloc[::2])   # 選擇b,c,d列的one,three行
print(df[df['a'] < 50].iloc[:2])   # 選擇滿足判斷索引的前兩行數據

輸出結果:

       a          b          c          d
one    48.981007  79.206804  43.775695   5.205462
two    43.786019  15.436499  85.919123  84.083483
three  94.546433  59.227961  97.579354  37.942078
four   11.292684   8.417224  38.782994  17.420902
------
one      48.981007
three    94.546433
Name: a, dtype: float64
               b          c          d
one    79.206804  43.775695   5.205462
three  59.227961  97.579354  37.942078
             a          b          c          d
one  48.981007  79.206804  43.775695   5.205462
two  43.786019  15.436499  85.919123  84.083483

課后練習:

 作業1:如圖創建Dataframe(4*4,值為0-100的隨機數),通過索引得到以下值

① 索引得到b,c列的所有值

② 索引得到第三第四行的數據

③ 按順序索引得到two,one行的值

④ 索引得到大於50的值

import numpy as np
import pandas as pd
#練習1
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                  index=['one','two','three','four'],
                  columns=['a','b','c','d'])
print(df)

print(df[['b','c']])
print(df.loc[['three','four']])
print(df.iloc[2:4])  #或者print(df.iloc[[2,3]]) // print(df.iloc[[2:]])

print(df.loc[['two','one']])

b = df[df>50]
print(b)

作業2:創建一個Series,包含10個元素,且每個值為0-100的均勻分布隨機值,index為a-j,請分別篩選出:

① 標簽為b,c的值為多少

② Series中第4到6個值是哪些?

③ Series中大於50的值有哪些?

#練習2
df1 = pd.Series(np.random.rand(10)*100,index=['a','b','c','d','e','f','g','h','i','j'])
print(df1)
print(df1.loc[['b','c']])
print(df1.iloc[4:7])

print(df1[df1>50])

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM