Series是帶有標簽的一維數組,可以保存任何數據類型(整數,字符串,浮點數,python對象)
index查看series索引,values查看series值
series相比於ndarray,是一個自帶索引index的數組--> 一維數組 + 對應索引
series和dict相比,series更像是一個有順序的字典
創建方法
1.由字典創建,字典的key就是index,values就是values
dic = {'a':1 ,'b':2 , 'c':3, '4':4, '5':5}
s = pd.Series(dic)
print(s)
4 4
5 5
a 1
b 2
c 3
dtype: int64
2.由數組創建(一維數組)
arr = np.random.randn(5)
s = pd.Series(arr)
print(arr)
print(s)
# 默認index是從0開始,步長為1的數字
s = pd.Series(arr, index = ['a','b','c','d','e'],dtype = np.object)
print(s)
# index參數:設置index,長度保持一致
# dtype參數:設置數值類型
[ 0.11206121 0.1324684 0.59930544 0.34707543 -0.15652941]
0 0.112061
1 0.132468
2 0.599305
3 0.347075
4 -0.156529
dtype: float64
a 0.112061
b 0.132468
c 0.599305
d 0.347075
e -0.156529
dtype: object
3. 由標量創建
s = pd.Series(10, index = range(4))
print(s)
# 如果data是標量值,則必須提供索引。該值會重復,來匹配索引的長度
0 10
1 10
2 10
3 10
dtype: int64
Pandas
1.由數組/list組成的字典
# 創建方法:pandas.Dataframe()
data1 = {'a':[1,2,3],
'b':[3,4,5],
'c':[5,6,7]}
data2 = {'one':np.random.rand(3),
'two':np.random.rand(3)} # 這里如果嘗試 'two':np.random.rand(4) 會怎么樣?
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由數組/list組成的字典 創建Dataframe,columns為字典key,index為默認數字標簽
# 字典的值的長度必須保持一致!
df1 = pd.DataFrame(data1, columns = ['b','c','a','d'])
print(df1)
df1 = pd.DataFrame(data1, columns = ['b','c'])
print(df1)
# columns參數:可以重新指定列的順序,格式為list,如果現有數據中沒有該列(比如'd'),則產生NaN值
# 如果columns重新指定時候,列的數量可以少於原數據
df2 = pd.DataFrame(data2, index = ['f1','f2','f3']) # 這里如果嘗試 index = ['f1','f2','f3','f4'] 會怎么樣?
print(df2)
# index參數:重新定義index,格式為list,長度必須保持一致
{'a': [1, 2, 3], 'c': [5, 6, 7], 'b': [3, 4, 5]}
{'one': array([ 0.00101091, 0.08807153, 0.58345056]), 'two': array([ 0.49774634, 0.16782565, 0.76443489])}
a b c
0 1 3 5
1 2 4 6
2 3 5 7
one two
0 0.001011 0.497746
1 0.088072 0.167826
2 0.583451 0.764435
b c a d
0 3 5 1 NaN
1 4 6 2 NaN
2 5 7 3 NaN
b c
0 3 5
1 4 6
2 5 7
one two
f1 0.001011 0.497746
f2 0.088072 0.167826
f3 0.583451 0.764435
2.由Series組成的字典
data1 = {'one':pd.Series(np.random.rand(2)),
'two':pd.Series(np.random.rand(3))} # 沒有設置index的Series
data2 = {'one':pd.Series(np.random.rand(2), index = ['a','b']),
'two':pd.Series(np.random.rand(3),index = ['a','b','c'])} # 設置了index的Series
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由Seris組成的字典 創建Dataframe,columns為字典key,index為Series的標簽(如果Series沒有指定標簽,則是默認數字標簽)
# Series可以長度不一樣,生成的Dataframe會出現NaN值
{'one': 0 0.892580
1 0.834076
dtype: float64, 'two': 0 0.301309
1 0.977709
2 0.489000
dtype: float64}
{'one': a 0.470947
b 0.584577
dtype: float64, 'two': a 0.122659
b 0.136429
c 0.396825
dtype: float64}
one two
0 0.892580 0.301309
1 0.834076 0.977709
2 NaN 0.489000
one two
a 0.470947 0.122659
b 0.584577 0.136429
c NaN 0.396825
3.通過二維數組直接創建
import numpy as np
import pandas as pd
ar = np.random.rand(9).reshape(3,3)
print(ar)
df1 = pd.DataFrame(ar)
df2 = pd.DataFrame(ar, index = ['a', 'b', 'c'], columns = ['one','two','three']) # 可以嘗試一下index或columns長度不等於已有數組的情況
print(df1)
print(df2)
# 通過二維數組直接創建Dataframe,得到一樣形狀的結果數據,如果不指定index和columns,兩者均返回默認數字格式
# index和colunms指定長度與原數組保持一致
[[0.10240097 0.64014438 0.35406434]
[0.17617253 0.48451747 0.76316397]
[0.47298642 0.51552541 0.8175865 ]]
0 1 2
0 0.102401 0.640144 0.354064
1 0.176173 0.484517 0.763164
2 0.472986 0.515525 0.817586
one two three
a 0.102401 0.640144 0.354064
b 0.176173 0.484517 0.763164
c 0.472986 0.515525 0.817586
4.由字典組成的列表
data = [{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
print(data)
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, index = ['a','b'])
df3 = pd.DataFrame(data, columns = ['one','two'])
print(df1)
print(df2)
print(df3)
# 由字典組成的列表創建Dataframe,columns為字典的key,index不做指定則為默認數組標簽
# colunms和index參數分別重新指定相應列及行標簽
[{'one': 1, 'two': 2}, {'one': 5, 'three': 20, 'two': 10}]
one three two
0 1 NaN 2
1 5 20.0 10
one three two
a 1 NaN 2
b 5 20.0 10
one two
0 1 2
1 5 10
5.由字典組成的字典
data = {'Jack':{'math':90,'english':89,'art':78},
'Marry':{'math':82,'english':95,'art':92},
'Tom':{'math':78,'english':67}}
df1 = pd.DataFrame(data)
print(df1)
# 由字典組成的字典創建Dataframe,columns為字典的key,index為子字典的key
df2 = pd.DataFrame(data, columns = ['Jack','Tom','Bob'])
df3 = pd.DataFrame(data, index = ['a','b','c'])
print(df2)
print(df3)
# columns參數可以增加和減少現有列,如出現新的列,值為NaN
# index在這里和之前不同,並不能改變原有index,如果指向新的標簽,值為NaN (非常重要!)
Jack Marry Tom
art 78 92 NaN
english 89 95 67.0
math 90 82 78.0
Jack Tom Bob
art 78 NaN NaN
english 89 67.0 NaN
math 90 78.0 NaN
Jack Marry Tom
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
核心筆記:df.loc[label]主要針對index選擇行,同時支持指定index,及默認數字index
dataframe選擇行
data3 = df.loc['one']
data4 = df.loc[['one','two']]
print(data2,type(data3))
print(data3,type(data4))
# 按照index選擇行,只選擇一行輸出Series,選擇多行輸出Dataframe
a c
one 72.615321 57.485645
two 46.295674 92.267989
three 14.699591 39.683577 <class 'pandas.core.series.Series'>
a 72.615321
b 49.816987
c 57.485645
d 84.226944
Name: one, dtype: float64 <class 'pandas.core.frame.DataFrame'>

df.loc[:,'baz']

# df.iloc[] - 按照整數位置(從軸的0到length-1)選擇行
# 類似list的索引,其順序就是dataframe的整數位置,從0開始計
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)
print('------')
print(df.iloc[0])
print(df.iloc[-1])
#print(df.iloc[4])
print('單位置索引\n-----')
# 單位置索引
# 和loc索引不同,不能索引超出數據行數的整數位置
print(df.iloc[[0,2]])
print(df.iloc[[3,2,1]])
print('多位置索引\n-----')
# 多位置索引
# 順序可變
print(df.iloc[1:3])
print(df.iloc[::2])
print('切片索引')
# 切片索引
# 末端不包含
a b c d
one 21.848926 2.482328 17.338355 73.014166
two 99.092794 0.601173 18.598736 61.166478
three 87.183015 85.973426 48.839267 99.930097
four 75.007726 84.208576 69.445779 75.546038
------
a 21.848926
b 2.482328
c 17.338355
d 73.014166
Name: one, dtype: float64
a 75.007726
b 84.208576
c 69.445779
d 75.546038
Name: four, dtype: float64
單位置索引
-----
a b c d
one 21.848926 2.482328 17.338355 73.014166
three 87.183015 85.973426 48.839267 99.930097
a b c d
four 75.007726 84.208576 69.445779 75.546038
three 87.183015 85.973426 48.839267 99.930097
two 99.092794 0.601173 18.598736 61.166478
多位置索引
-----
a b c d
two 99.092794 0.601173 18.598736 61.166478
three 87.183015 85.973426 48.839267 99.930097
a b c d
one 21.848926 2.482328 17.338355 73.014166
three 87.183015 85.973426 48.839267 99.930097
切片索引
dataframe選擇列
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
index = ['one','two','three'],
columns = ['a','b','c','d'])
print(df)
data1 = df['a']
data2 = df[['a','c']]
print(data1,type(data1))
print(data2,type(data2))
print('-----')
# 按照列名選擇列,只選擇一列輸出Series,選擇多列輸出Dataframe
a b c d
one 72.615321 49.816987 57.485645 84.226944
two 46.295674 34.480439 92.267989 17.111412
three 14.699591 92.754997 39.683577 93.255880
one 72.615321
two 46.295674
three 14.699591
Name: a, dtype: float64 <class 'pandas.core.series.Series'>
a c
one 72.615321 57.485645
two 46.295674 92.267989
three 14.699591 39.683577 <class 'pandas.core.frame.DataFrame'>
# 布爾型索引
# 和Series原理相同
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)
print('------')
b1 = df < 20
print(b1,type(b1))
print(df[b1]) # 也可以書寫為 df[df < 20]
print('------')
# 不做索引則會對數據每個值進行判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN
b2 = df['a'] > 50
print(b2,type(b2))
print(df[b2]) # 也可以書寫為 df[df['a'] > 50]
print('------')
# 單列做判斷
# 索引結果保留 單列判斷為True的行數據,包括其他列
b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3]) # 也可以書寫為 df[df[['a','b']] > 50]
print('------')
# 多列做判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN
b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4]) # 也可以書寫為 df[df.loc[['one','three']] < 50]
print('------')
# 多行做判斷
# 索引結果保留 所有數據:True返回原數據,False返回值為NaN
a b c d
one 19.185849 20.303217 21.800384 45.189534
two 50.105112 28.478878 93.669529 90.029489
three 35.496053 19.248457 74.811841 20.711431
four 24.604478 57.731456 49.682717 82.132866
------
a b c d
one True False False False
two False False False False
three False True False False
four False False False False <class 'pandas.core.frame.DataFrame'>
a b c d
one 19.185849 NaN NaN NaN
two NaN NaN NaN NaN
three NaN 19.248457 NaN NaN
four NaN NaN NaN NaN
------
one False
two True
three False
four False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
a b c d
two 50.105112 28.478878 93.669529 90.029489
------
a b
one False False
two True False
three False False
four False True <class 'pandas.core.frame.DataFrame'>
a b c d
one NaN NaN NaN NaN
two 50.105112 NaN NaN NaN
three NaN NaN NaN NaN
four NaN 57.731456 NaN NaN
------
a b c d
one True True True True
three True True False True <class 'pandas.core.frame.DataFrame'>
a b c d
one 19.185849 20.303217 21.800384 45.189534
two NaN NaN NaN NaN
three 35.496053 19.248457 NaN 20.711431
four NaN NaN NaN NaN
------
# 多重索引:比如同時索引行和列
# 先選擇列再選擇行 —— 相當於對於一個數據,先篩選字段,再選擇數據量
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)
print('------')
print(df['a'].loc[['one','three']]) # 選擇a列的one,three行
print(df[['b','c','d']].iloc[::2]) # 選擇b,c,d列的one,three行
print(df[df['a'] < 50].iloc[:2]) # 選擇滿足判斷索引的前兩行數據
a b c d
one 50.660904 89.827374 51.096827 3.844736
two 70.699721 78.750014 52.988276 48.833037
three 33.653032 27.225202 24.864712 29.662736
four 21.792339 26.450939 6.122134 52.323963
------
one 50.660904
three 33.653032
Name: a, dtype: float64
b c d
one 89.827374 51.096827 3.844736
three 27.225202 24.864712 29.662736
a b c d
three 33.653032 27.225202 24.864712 29.662736
four 21.792339 26.450939 6.122134 52.323963
添加與修改
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = ['a','b','c','d'])
print(df)
df['e'] = 10
df.loc[4] = 20
print(df)
# 新增列/行並賦值
df['e'] = 20
df[['a','c']] = 100
print(df)
# 索引后直接修改值
a b c d
0 17.148791 73.833921 39.069417 5.675815
1 91.572695 66.851601 60.320698 92.071097
2 79.377105 24.314520 44.406357 57.313429
3 84.599206 61.310945 3.916679 30.076458
a b c d e
0 17.148791 73.833921 39.069417 5.675815 10
1 91.572695 66.851601 60.320698 92.071097 10
2 79.377105 24.314520 44.406357 57.313429 10
3 84.599206 61.310945 3.916679 30.076458 10
4 20.000000 20.000000 20.000000 20.000000 20
a b c d e
0 100 73.833921 100 5.675815 20
1 100 66.851601 100 92.071097 20
2 100 24.314520 100 57.313429 20
3 100 61.310945 100 30.076458 20
4 100 20.000000 100 20.000000 20
# 刪除 del / drop()
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = ['a','b','c','d'])
print(df)
del df['a']
print(df)
print('-----')
# del語句 - 刪除列
print(df.drop(0))
print(df.drop([1,2]))
print(df)
print('-----')
# drop()刪除行,inplace=False → 刪除后生成新的數據,不改變原數據
print(df.drop(['d'], axis = 1))
print(df)
# drop()刪除列,需要加上axis = 1,inplace=False → 刪除后生成新的數據,不改變原數據
a b c d
0 91.866806 88.753655 18.469852 71.651277
1 64.835568 33.844967 6.391246 54.916094
2 75.930985 19.169862 91.042457 43.648258
3 15.863853 24.788866 10.625684 82.135316
b c d
0 88.753655 18.469852 71.651277
1 33.844967 6.391246 54.916094
2 19.169862 91.042457 43.648258
3 24.788866 10.625684 82.135316
-----
b c d
1 33.844967 6.391246 54.916094
2 19.169862 91.042457 43.648258
3 24.788866 10.625684 82.135316
b c d
0 88.753655 18.469852 71.651277
3 24.788866 10.625684 82.135316
b c d
0 88.753655 18.469852 71.651277
1 33.844967 6.391246 54.916094
2 19.169862 91.042457 43.648258
3 24.788866 10.625684 82.135316
-----
b c
0 88.753655 18.469852
1 33.844967 6.391246
2 19.169862 91.042457
3 24.788866 10.625684
b c d
0 88.753655 18.469852 71.651277
1 33.844967 6.391246 54.916094
2 19.169862 91.042457 43.648258
3 24.788866 10.625684 82.135316
刪除重復的值
本來uid列有多個重復的id,drop_duplicates()可以刪除重復的uid,並保留第一個或最后一個uid
last_loan_time.drop_duplicates(subset='uid', keep='last', inplace=True)

agg()對一張表同時做多個操作,並生成多個新列
stat_feat = ['min','mean','max','std','median']
statistic_df = statistic_df.groupby('uid')['loan_amount'].agg(stat_feat).reset_index()
statistic_df.columns = ['uid'] + ['loan_' + col for col in stat_feat]

# 對齊
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
print(df1 + df2)
# DataFrame對象之間的數據自動按照列和索引(行標簽)對齊
A B C D
0 -0.281123 -2.529461 1.325663 NaN
1 -0.310514 -0.408225 -0.760986 NaN
2 -0.172169 -2.355042 1.521342 NaN
3 1.113505 0.325933 3.689586 NaN
4 0.107513 -0.503907 -1.010349 NaN
5 -0.845676 -2.410537 -1.406071 NaN
6 1.682854 -0.576620 -0.981622 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
排序
# 排序1 - 按值排序 .sort_values
# 同樣適用於Series
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = ['a','b','c','d'])
print(df1)
print(df1.sort_values(['a'], ascending = True)) # 升序
print(df1.sort_values(['a'], ascending = False)) # 降序
print('------')
# ascending參數:設置升序降序,默認升序
# 單列排序
df2 = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
'b':list(range(8)),
'c':list(range(8,0,-1))})
print(df2)
print(df2.sort_values(['a','c']))
# 多列排序,按列順序排序
a b c d
0 16.519099 19.601879 35.464189 58.866972
1 34.506472 97.106578 96.308244 54.049359
2 87.177828 47.253416 92.098847 19.672678
3 66.673226 51.969534 71.789055 14.504191
a b c d
0 16.519099 19.601879 35.464189 58.866972
1 34.506472 97.106578 96.308244 54.049359
3 66.673226 51.969534 71.789055 14.504191
2 87.177828 47.253416 92.098847 19.672678
a b c d
2 87.177828 47.253416 92.098847 19.672678
3 66.673226 51.969534 71.789055 14.504191
1 34.506472 97.106578 96.308244 54.049359
0 16.519099 19.601879 35.464189 58.866972
------
a b c
0 1 0 8
1 1 1 7
2 1 2 6
3 1 3 5
4 2 4 4
5 2 5 3
6 2 6 2
7 2 7 1
a b c
3 1 3 5
2 1 2 6
1 1 1 7
0 1 0 8
7 2 7 1
6 2 6 2
5 2 5 3
4 2 4 4
# 排序2 - 索引排序 .sort_index
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = [5,4,3,2],
columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['h','s','x','g'],
columns = ['a','b','c','d'])
print(df1)
print(df1.sort_index())
print(df2)
print(df2.sort_index())
# 按照index排序
# 默認 ascending=True, inplace=False
a b c d
5 57.327269 87.623119 93.655538 5.859571
4 69.739134 80.084366 89.005538 56.825475
3 88.148296 6.211556 68.938504 41.542563
2 29.248036 72.005306 57.855365 45.931715
a b c d
2 29.248036 72.005306 57.855365 45.931715
3 88.148296 6.211556 68.938504 41.542563
4 69.739134 80.084366 89.005538 56.825475
5 57.327269 87.623119 93.655538 5.859571
a b c d
h 50.579469 80.239138 24.085110 39.443600
s 30.906725 39.175302 11.161542 81.010205
x 19.900056 18.421110 4.995141 12.605395
g 67.760755 72.573568 33.507090 69.854906
a b c d
g 67.760755 72.573568 33.507090 69.854906
h 50.579469 80.239138 24.085110 39.443600
s 30.906725 39.175302 11.161542 81.010205
x 19.900056 18.421110 4.995141 12.605395