Pandas之Series+DataFrame


Series是帶有標簽的一維數組,可以保存任何數據類型(整數,字符串,浮點數,python對象)

index查看series索引,values查看series值

series相比於ndarray,是一個自帶索引index的數組--> 一維數組 + 對應索引

series和dict相比,series更像是一個有順序的字典

創建方法

1.由字典創建,字典的key就是index,values就是values

dic = {'a':1 ,'b':2 , 'c':3, '4':4, '5':5}
s = pd.Series(dic)
print(s)
4    4
5    5
a    1
b    2
c    3
dtype: int64

2.由數組創建(一維數組)

arr = np.random.randn(5)
s = pd.Series(arr)
print(arr)
print(s)
# 默認index是從0開始,步長為1的數字

s = pd.Series(arr, index = ['a','b','c','d','e'],dtype = np.object)
print(s)
# index參數:設置index,長度保持一致
# dtype參數:設置數值類型
[ 0.11206121  0.1324684   0.59930544  0.34707543 -0.15652941]
0    0.112061
1    0.132468
2    0.599305
3    0.347075
4   -0.156529
dtype: float64
a    0.112061
b    0.132468
c    0.599305
d    0.347075
e   -0.156529
dtype: object

3. 由標量創建

s = pd.Series(10, index = range(4))
print(s)
# 如果data是標量值,則必須提供索引。該值會重復,來匹配索引的長度
0    10
1    10
2    10
3    10
dtype: int64

Pandas

1.由數組/list組成的字典

# 創建方法:pandas.Dataframe()

data1 = {'a':[1,2,3],
        'b':[3,4,5],
        'c':[5,6,7]}
data2 = {'one':np.random.rand(3),
        'two':np.random.rand(3)}   # 這里如果嘗試  'two':np.random.rand(4) 會怎么樣?
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由數組/list組成的字典 創建Dataframe,columns為字典key,index為默認數字標簽
# 字典的值的長度必須保持一致!

df1 = pd.DataFrame(data1, columns = ['b','c','a','d'])
print(df1)
df1 = pd.DataFrame(data1, columns = ['b','c'])
print(df1)
# columns參數:可以重新指定列的順序,格式為list,如果現有數據中沒有該列(比如'd'),則產生NaN值
# 如果columns重新指定時候,列的數量可以少於原數據

df2 = pd.DataFrame(data2, index = ['f1','f2','f3'])  # 這里如果嘗試  index = ['f1','f2','f3','f4'] 會怎么樣?
print(df2)
# index參數:重新定義index,格式為list,長度必須保持一致
{'a': [1, 2, 3], 'c': [5, 6, 7], 'b': [3, 4, 5]}
{'one': array([ 0.00101091,  0.08807153,  0.58345056]), 'two': array([ 0.49774634,  0.16782565,  0.76443489])}
   a  b  c
0  1  3  5
1  2  4  6
2  3  5  7
        one       two
0  0.001011  0.497746
1  0.088072  0.167826
2  0.583451  0.764435
   b  c  a    d
0  3  5  1  NaN
1  4  6  2  NaN
2  5  7  3  NaN
   b  c
0  3  5
1  4  6
2  5  7
         one       two
f1  0.001011  0.497746
f2  0.088072  0.167826
f3  0.583451  0.764435

2.由Series組成的字典

data1 = {'one':pd.Series(np.random.rand(2)),
        'two':pd.Series(np.random.rand(3))}  # 沒有設置index的Series
data2 = {'one':pd.Series(np.random.rand(2), index = ['a','b']),
        'two':pd.Series(np.random.rand(3),index = ['a','b','c'])}  # 設置了index的Series
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由Seris組成的字典 創建Dataframe,columns為字典key,index為Series的標簽(如果Series沒有指定標簽,則是默認數字標簽)
# Series可以長度不一樣,生成的Dataframe會出現NaN值
{'one': 0    0.892580
1    0.834076
dtype: float64, 'two': 0    0.301309
1    0.977709
2    0.489000
dtype: float64}
{'one': a    0.470947
b    0.584577
dtype: float64, 'two': a    0.122659
b    0.136429
c    0.396825
dtype: float64}
        one       two
0  0.892580  0.301309
1  0.834076  0.977709
2       NaN  0.489000
        one       two
a  0.470947  0.122659
b  0.584577  0.136429
c       NaN  0.396825

3.通過二維數組直接創建

import numpy as np
import pandas as pd
ar = np.random.rand(9).reshape(3,3)
print(ar)
df1 = pd.DataFrame(ar)
df2 = pd.DataFrame(ar, index = ['a', 'b', 'c'], columns = ['one','two','three'])  # 可以嘗試一下index或columns長度不等於已有數組的情況
print(df1)
print(df2)
# 通過二維數組直接創建Dataframe,得到一樣形狀的結果數據,如果不指定index和columns,兩者均返回默認數字格式
# index和colunms指定長度與原數組保持一致
[[0.10240097 0.64014438 0.35406434]
 [0.17617253 0.48451747 0.76316397]
 [0.47298642 0.51552541 0.8175865 ]]
          0         1         2
0  0.102401  0.640144  0.354064
1  0.176173  0.484517  0.763164
2  0.472986  0.515525  0.817586
        one       two     three
a  0.102401  0.640144  0.354064
b  0.176173  0.484517  0.763164
c  0.472986  0.515525  0.817586

4.由字典組成的列表

data = [{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
print(data)
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, index = ['a','b'])
df3 = pd.DataFrame(data, columns = ['one','two'])
print(df1)
print(df2)
print(df3)
# 由字典組成的列表創建Dataframe,columns為字典的key,index不做指定則為默認數組標簽
# colunms和index參數分別重新指定相應列及行標簽
[{'one': 1, 'two': 2}, {'one': 5, 'three': 20, 'two': 10}]
   one  three  two
0    1    NaN    2
1    5   20.0   10
   one  three  two
a    1    NaN    2
b    5   20.0   10
   one  two
0    1    2
1    5   10

5.由字典組成的字典

data = {'Jack':{'math':90,'english':89,'art':78},
       'Marry':{'math':82,'english':95,'art':92},
       'Tom':{'math':78,'english':67}}
df1 = pd.DataFrame(data)
print(df1)
# 由字典組成的字典創建Dataframe,columns為字典的key,index為子字典的key

df2 = pd.DataFrame(data, columns = ['Jack','Tom','Bob'])
df3 = pd.DataFrame(data, index = ['a','b','c'])
print(df2)
print(df3)
# columns參數可以增加和減少現有列,如出現新的列,值為NaN
# index在這里和之前不同,並不能改變原有index,如果指向新的標簽,值為NaN (非常重要!)
      Jack  Marry   Tom
art        78     92   NaN
english    89     95  67.0
math       90     82  78.0
         Jack   Tom  Bob
art        78   NaN  NaN
english    89  67.0  NaN
math       90  78.0  NaN
   Jack  Marry  Tom
a   NaN    NaN  NaN
b   NaN    NaN  NaN
c   NaN    NaN  NaN

核心筆記:df.loc[label]主要針對index選擇行,同時支持指定index,及默認數字index

dataframe選擇行

data3 = df.loc['one']
data4 = df.loc[['one','two']]
print(data2,type(data3))
print(data3,type(data4))
# 按照index選擇行,只選擇一行輸出Series,選擇多行輸出Dataframe
               a          c
one    72.615321  57.485645
two    46.295674  92.267989
three  14.699591  39.683577 <class 'pandas.core.series.Series'>
a    72.615321
b    49.816987
c    57.485645
d    84.226944
Name: one, dtype: float64 <class 'pandas.core.frame.DataFrame'>

df.loc[:,'baz'] 

# df.iloc[] - 按照整數位置(從軸的0到length-1)選擇行
# 類似list的索引,其順序就是dataframe的整數位置,從0開始計
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

print(df.iloc[0])
print(df.iloc[-1])
#print(df.iloc[4])
print('單位置索引\n-----')
# 單位置索引
# 和loc索引不同,不能索引超出數據行數的整數位置

print(df.iloc[[0,2]])
print(df.iloc[[3,2,1]])
print('多位置索引\n-----')
# 多位置索引
# 順序可變

print(df.iloc[1:3])
print(df.iloc[::2])
print('切片索引')
# 切片索引
# 末端不包含
  a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
two    99.092794   0.601173  18.598736  61.166478
three  87.183015  85.973426  48.839267  99.930097
four   75.007726  84.208576  69.445779  75.546038
------
a    21.848926
b     2.482328
c    17.338355
d    73.014166
Name: one, dtype: float64
a    75.007726
b    84.208576
c    69.445779
d    75.546038
Name: four, dtype: float64
單位置索引
-----
               a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
three  87.183015  85.973426  48.839267  99.930097
               a          b          c          d
four   75.007726  84.208576  69.445779  75.546038
three  87.183015  85.973426  48.839267  99.930097
two    99.092794   0.601173  18.598736  61.166478
多位置索引
-----
               a          b          c          d
two    99.092794   0.601173  18.598736  61.166478
three  87.183015  85.973426  48.839267  99.930097
               a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
three  87.183015  85.973426  48.839267  99.930097
切片索引

dataframe選擇列

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)

data1 = df['a']
data2 = df[['a','c']]
print(data1,type(data1))
print(data2,type(data2))
print('-----')
# 按照列名選擇列,只選擇一列輸出Series,選擇多列輸出Dataframe
              a          b          c          d
one    72.615321  49.816987  57.485645  84.226944
two    46.295674  34.480439  92.267989  17.111412
three  14.699591  92.754997  39.683577  93.255880
one      72.615321
two      46.295674
three    14.699591
Name: a, dtype: float64 <class 'pandas.core.series.Series'>
               a          c
one    72.615321  57.485645
two    46.295674  92.267989
three  14.699591  39.683577 <class 'pandas.core.frame.DataFrame'>
# 布爾型索引
# 和Series原理相同

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

b1 = df < 20
print(b1,type(b1))
print(df[b1])  # 也可以書寫為 df[df < 20]
print('------')
# 不做索引則會對數據每個值進行判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN

b2 = df['a'] > 50
print(b2,type(b2))
print(df[b2])  # 也可以書寫為 df[df['a'] > 50]
print('------')
# 單列做判斷
# 索引結果保留 單列判斷為True的行數據,包括其他列

b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3])  # 也可以書寫為 df[df[['a','b']] > 50]
print('------')
# 多列做判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN

b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4])  # 也可以書寫為 df[df.loc[['one','three']] < 50]
print('------')
# 多行做判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN
  a          b          c          d
one    19.185849  20.303217  21.800384  45.189534
two    50.105112  28.478878  93.669529  90.029489
three  35.496053  19.248457  74.811841  20.711431
four   24.604478  57.731456  49.682717  82.132866
------
           a      b      c      d
one     True  False  False  False
two    False  False  False  False
three  False   True  False  False
four   False  False  False  False <class 'pandas.core.frame.DataFrame'>
               a          b   c   d
one    19.185849        NaN NaN NaN
two          NaN        NaN NaN NaN
three        NaN  19.248457 NaN NaN
four         NaN        NaN NaN NaN
------
one      False
two       True
three    False
four     False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
             a          b          c          d
two  50.105112  28.478878  93.669529  90.029489
------
           a      b
one    False  False
two     True  False
three  False  False
four   False   True <class 'pandas.core.frame.DataFrame'>
               a          b   c   d
one          NaN        NaN NaN NaN
two    50.105112        NaN NaN NaN
three        NaN        NaN NaN NaN
four         NaN  57.731456 NaN NaN
------
          a     b      c     d
one    True  True   True  True
three  True  True  False  True <class 'pandas.core.frame.DataFrame'>
               a          b          c          d
one    19.185849  20.303217  21.800384  45.189534
two          NaN        NaN        NaN        NaN
three  35.496053  19.248457        NaN  20.711431
four         NaN        NaN        NaN        NaN
------
# 多重索引:比如同時索引行和列
# 先選擇列再選擇行 —— 相當於對於一個數據,先篩選字段,再選擇數據量

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

print(df['a'].loc[['one','three']])   # 選擇a列的one,three行
print(df[['b','c','d']].iloc[::2])   # 選擇b,c,d列的one,three行
print(df[df['a'] < 50].iloc[:2])   # 選擇滿足判斷索引的前兩行數據
              a          b          c          d
one    50.660904  89.827374  51.096827   3.844736
two    70.699721  78.750014  52.988276  48.833037
three  33.653032  27.225202  24.864712  29.662736
four   21.792339  26.450939   6.122134  52.323963
------
one      50.660904
three    33.653032
Name: a, dtype: float64
               b          c          d
one    89.827374  51.096827   3.844736
three  27.225202  24.864712  29.662736
               a          b          c          d
three  33.653032  27.225202  24.864712  29.662736
four   21.792339  26.450939   6.122134  52.323963

添加與修改

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df)

df['e'] = 10
df.loc[4] = 20
print(df)
# 新增列/行並賦值

df['e'] = 20
df[['a','c']] = 100
print(df)
# 索引后直接修改值
         a          b          c          d
0  17.148791  73.833921  39.069417   5.675815
1  91.572695  66.851601  60.320698  92.071097
2  79.377105  24.314520  44.406357  57.313429
3  84.599206  61.310945   3.916679  30.076458
           a          b          c          d   e
0  17.148791  73.833921  39.069417   5.675815  10
1  91.572695  66.851601  60.320698  92.071097  10
2  79.377105  24.314520  44.406357  57.313429  10
3  84.599206  61.310945   3.916679  30.076458  10
4  20.000000  20.000000  20.000000  20.000000  20
     a          b    c          d   e
0  100  73.833921  100   5.675815  20
1  100  66.851601  100  92.071097  20
2  100  24.314520  100  57.313429  20
3  100  61.310945  100  30.076458  20
4  100  20.000000  100  20.000000  20
# 刪除  del / drop()

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df)

del df['a']
print(df)
print('-----')
# del語句 - 刪除列

print(df.drop(0))
print(df.drop([1,2]))
print(df)
print('-----')
# drop()刪除行,inplace=False → 刪除后生成新的數據,不改變原數據

print(df.drop(['d'], axis = 1))
print(df)
# drop()刪除列,需要加上axis = 1,inplace=False → 刪除后生成新的數據,不改變原數據
  a          b          c          d
0  91.866806  88.753655  18.469852  71.651277
1  64.835568  33.844967   6.391246  54.916094
2  75.930985  19.169862  91.042457  43.648258
3  15.863853  24.788866  10.625684  82.135316
           b          c          d
0  88.753655  18.469852  71.651277
1  33.844967   6.391246  54.916094
2  19.169862  91.042457  43.648258
3  24.788866  10.625684  82.135316
-----
           b          c          d
1  33.844967   6.391246  54.916094
2  19.169862  91.042457  43.648258
3  24.788866  10.625684  82.135316
           b          c          d
0  88.753655  18.469852  71.651277
3  24.788866  10.625684  82.135316
           b          c          d
0  88.753655  18.469852  71.651277
1  33.844967   6.391246  54.916094
2  19.169862  91.042457  43.648258
3  24.788866  10.625684  82.135316
-----
           b          c
0  88.753655  18.469852
1  33.844967   6.391246
2  19.169862  91.042457
3  24.788866  10.625684
           b          c          d
0  88.753655  18.469852  71.651277
1  33.844967   6.391246  54.916094
2  19.169862  91.042457  43.648258
3  24.788866  10.625684  82.135316

刪除重復的值

本來uid列有多個重復的id,drop_duplicates()可以刪除重復的uid,並保留第一個或最后一個uid

last_loan_time.drop_duplicates(subset='uid', keep='last', inplace=True)

agg()對一張表同時做多個操作,並生成多個新列

stat_feat = ['min','mean','max','std','median']
        statistic_df = statistic_df.groupby('uid')['loan_amount'].agg(stat_feat).reset_index()
        statistic_df.columns = ['uid'] + ['loan_' + col for col in stat_feat]

# 對齊

df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
print(df1 + df2)
# DataFrame對象之間的數據自動按照列和索引(行標簽)對齊
A         B         C   D
0 -0.281123 -2.529461  1.325663 NaN
1 -0.310514 -0.408225 -0.760986 NaN
2 -0.172169 -2.355042  1.521342 NaN
3  1.113505  0.325933  3.689586 NaN
4  0.107513 -0.503907 -1.010349 NaN
5 -0.845676 -2.410537 -1.406071 NaN
6  1.682854 -0.576620 -0.981622 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN

排序

# 排序1 - 按值排序 .sort_values
# 同樣適用於Series

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df1)
print(df1.sort_values(['a'], ascending = True))  # 升序
print(df1.sort_values(['a'], ascending = False))  # 降序
print('------')
# ascending參數:設置升序降序,默認升序
# 單列排序

df2 = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
                  'b':list(range(8)),
                  'c':list(range(8,0,-1))})
print(df2)
print(df2.sort_values(['a','c']))
# 多列排序,按列順序排序
   a          b          c          d
0  16.519099  19.601879  35.464189  58.866972
1  34.506472  97.106578  96.308244  54.049359
2  87.177828  47.253416  92.098847  19.672678
3  66.673226  51.969534  71.789055  14.504191
           a          b          c          d
0  16.519099  19.601879  35.464189  58.866972
1  34.506472  97.106578  96.308244  54.049359
3  66.673226  51.969534  71.789055  14.504191
2  87.177828  47.253416  92.098847  19.672678
           a          b          c          d
2  87.177828  47.253416  92.098847  19.672678
3  66.673226  51.969534  71.789055  14.504191
1  34.506472  97.106578  96.308244  54.049359
0  16.519099  19.601879  35.464189  58.866972
------
   a  b  c
0  1  0  8
1  1  1  7
2  1  2  6
3  1  3  5
4  2  4  4
5  2  5  3
6  2  6  2
7  2  7  1
   a  b  c
3  1  3  5
2  1  2  6
1  1  1  7
0  1  0  8
7  2  7  1
6  2  6  2
5  2  5  3
4  2  4  4
# 排序2 - 索引排序 .sort_index

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                  index = [5,4,3,2],
                   columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                  index = ['h','s','x','g'],
                   columns = ['a','b','c','d'])
print(df1)
print(df1.sort_index())
print(df2)
print(df2.sort_index())
# 按照index排序
# 默認 ascending=True, inplace=False
  a          b          c          d
5  57.327269  87.623119  93.655538   5.859571
4  69.739134  80.084366  89.005538  56.825475
3  88.148296   6.211556  68.938504  41.542563
2  29.248036  72.005306  57.855365  45.931715
           a          b          c          d
2  29.248036  72.005306  57.855365  45.931715
3  88.148296   6.211556  68.938504  41.542563
4  69.739134  80.084366  89.005538  56.825475
5  57.327269  87.623119  93.655538   5.859571
           a          b          c          d
h  50.579469  80.239138  24.085110  39.443600
s  30.906725  39.175302  11.161542  81.010205
x  19.900056  18.421110   4.995141  12.605395
g  67.760755  72.573568  33.507090  69.854906
           a          b          c          d
g  67.760755  72.573568  33.507090  69.854906
h  50.579469  80.239138  24.085110  39.443600
s  30.906725  39.175302  11.161542  81.010205
x  19.900056  18.421110   4.995141  12.605395


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM