02. Pandas 1|數據結構Series、Dataframe

本文轉載自查看原文 2018-08-11 19:49 805 python-金融數據分析/ 數據分析工具包/ python-數據分析/ python數據分析

pandas的series和dataframe

1."一維數組"Series

Pandas數據結構Series：基本概念及創建

s.index 、 s.values

# Series 數據結構
# Series 是帶有標簽的一維數組，可以保存任何數據類型（整數，字符串，浮點數，Python對象等）,軸標簽統稱為索引
import numpy as np
import pandas as pd
>>> s = pd.Series(np.random.rand(5))
>>> print(s,type(s))
0    0.610318
1    0.235660
2    0.606445
3    0.070794
4    0.217530
dtype: float64 <class 'pandas.core.series.Series'>
>>> print(s.index,type(s.index))
RangeIndex(start=0, stop=5, step=1) <class 'pandas.core.indexes.range.RangeIndex'>
>>> print(s.values, type(s.values))
[0.61031815 0.23566007 0.60644485 0.0707941  0.21753049] <class 'numpy.ndarray'>
>>>
# .index查看series索引，類型為rangeindex # .values查看series值，類型是ndarray

# 核心：series相比於ndarray，是一個自帶索引index的數組 → 一維數組 + 對應索引
# 所以當只看series的值的時候，就是一個ndarray
# series和ndarray較相似，索引切片功能差別不大
# series和dict相比，series更像一個有順序的字典（dict本身不存在順序），其索引原理與字典相似（一個用key，一個用index）

1.1 Series 創建方法

由字典創建，字典的key就是index，values就是values

#Series 創建方法一：由字典創建，字典的key就是index，values就是values 
>>> dic = {'a':1,'b':2,'c':3,'4':4,'5':5}
>>> s = pd.Series(dic)
>>> print(s)
a    1
b    2
c    3
4    4
5    5
dtype: int64
# 注意：key肯定是字符串，假如values類型不止一個會怎么樣？ → dic = {'a':1 ,'b':'hello' , 'c':3, '4':4, '5':5}
>>> dic = {'a':1 ,'b':'hello' , 'c':3, '4':4, '5':5}
>>> s = pd.Series(dic)
>>> print(s)
a 1
b hello
c 3
4 4
5 5
dtype: object
>>>

# Series 創建方法二：由數組創建(一維數組) 
>>> arr = np.random.randn(5)
>>> s = pd.Series(arr) # 默認index是從0開始，步長為1的數字
>>> print(arr)
[1.08349965 0.52441811 0.76972371 0.35454797 0.39607907]
>>> print(s)
0    1.083500
1    0.524418
2    0.769724
3    0.354548
4    0.396079
dtype: float64
>>>
>>> s = pd.Series(arr,index = ['a','b','c','d','e'],dtype = np.object)
>>> print(s)
a    1.083500
b    0.524418
c    0.769724
d    0.354548
e    0.396079
dtype: object
# index參數：設置index，長度保持一致
# dtype參數：設置數值類型

# Series 創建方法三：由標量創建 
>>> s = pd.Series(10,index = range(4))
>>> print(s)
0    10
1    10
2    10
3    10
dtype: int64
# 如果data是標量值，則必須提供索引。該值會重復，來匹配索引的長度

# Series 名稱屬性：name
>>> s1 = pd.Series(np.random.randn(5))
>>> print(s1)
0   -0.441627
1   -0.082186
2    0.379461
3    0.163183
4    0.851316
dtype: float64
>>> s2 = pd.Series(np.random.randn(5),name='test')
>>> print(s2)
0   -0.951756
1    0.039272
2    0.618596
3   -0.027975
4    0.409068
Name: test, dtype: float64
>>> print(s1.name,s2.name,type(s2.name))
None test <class 'str'>
# name為Series的一個參數，創建一個數組的 名稱
# .name方法：輸出數組的名稱，輸出格式為str，如果沒用定義輸出名稱，輸出為None

>>> s3 = s2.rename('hahaha')
>>> print(s3)
0   -0.951756
1    0.039272
2    0.618596
3   -0.027975
4    0.409068
Name: hahaha, dtype: float64
>>> print(s3.name,s2.name)
hahaha test
>>>
# .rename()重命名一個數組的名稱，並且新指向一個數組，原數組不變

1.2 Series：索引

位置下標索引： s[0] 、s[-1]不錯在報錯哦、s[1:4]左閉右開；

標簽索引： s.['b'] 、s[ ['a', 'b', 'c' ] ] 、s['a':'c']末端包含哦；

布爾索引： s.isnull () s.notnull() s[s>50] s[ s.notnull() ]

# 位置下標，類似序列 
>>> s = pd.Series(np.random.rand(5))
>>> print(s)
0    0.233801
1    0.828125
2    0.184925
3    0.297279
4    0.346561
dtype: float64
>>> print(s[0],type(s[0]),s[0].dtype)
0.23380091830372507 <class 'numpy.float64'> float64
>>> print(float(s[0]),type(float(s[0])))
0.23380091830372507 <class 'float'>
#print(s[-1])
# 位置下標從0開始
# 輸出結果為numpy.float格式，
# 可以通過float()函數轉換為python float格式
# numpy.float與float占用字節不同 # s[-1]結果如何？ 會報錯

# 標簽索引 
>>> s = pd.Series(np.random.rand(5),index=['a','b','c','d','e'])
>>> print(s)
a    0.685577
b    0.998041
c    0.451358
d    0.832554
e    0.090653
dtype: float64
>>> print(s['a'],type(s['a']),s['a'].dtype)
0.6855772922411842 <class 'numpy.float64'> float64
# 方法類似下標索引，用[]表示，內寫上index，注意index是字符串

>>> sci = s[['a','b','e']]
>>> print(sci,type(sci))
a    0.685577
b    0.998041
e    0.090653
dtype: float64 <class 'pandas.core.series.Series'>
>>>
# 如果需要選擇多個標簽的值，用[[]]來表示（相當於[]中包含一個列表）
# 多標簽索引結果是新的數組

# 切片索引 
>>> s1 = pd.Series(np.random.rand(5))
>>> s2 = pd.Series(np.random.rand(5),index=['a','b','c','d','e'])
>>> print(s1,'\n',s2)
0 0.917653
1    0.763179
2    0.837807
3    0.344435
4    0.360922 dtype: float64
a 0.126537
b    0.699155
c    0.289233
d    0.831209
e    0.273572
dtype: float64
>>>
>>> print(s1[1:4],s1[4]) #左閉右開 1    0.763179
2    0.837807
3    0.344435
dtype: float64 0.36092197040034457
>>> print(s2['a':'c'],s2['c'])  #用index做切片末端是包含的 
a    0.126537
b    0.699155
c    0.289233
dtype: float64 0.28923306798234194
>>> print(s2[0:3],s2[3])
a    0.126537
b    0.699155
c    0.289233
dtype: float64 0.8312088483742163
# 注意：用index做切片是末端包含


>>> print(s2[:-1])
a    0.126537
b    0.699155
c    0.289233
d    0.831209   ##不包含末端的e 
dtype: float64
>>> print(s2[::2])
a    0.126537
c    0.289233
e    0.273572
dtype: float64
# 下標索引做切片，和list寫法一樣

# 布爾型索引 
>>> s = pd.Series(np.random.rand(3)*100)
>>> s[4] = None
>>> print(s)
0    19.9515
1    59.9133
2    97.9854
4       None
dtype: object
>>> bs1 = s > 50
>>> bs2 = s.isnull() >>> bs3 = s.notnull() >>> print(bs1, type(bs1),bs1.dtype)
0    False
1     True
2     True
4    False
dtype: bool <class 'pandas.core.series.Series'> bool
>>> print(bs2, type(bs2),bs2.dtype)
0    False
1    False
2    False
4     True
dtype: bool <class 'pandas.core.series.Series'> bool
>>> print(bs3, type(bs3),bs3.dtype)
0     True
1     True
2     True
4    False
dtype: bool <class 'pandas.core.series.Series'> bool
>>>
# 數組做判斷之后，返回的是一個由布爾值組成的新的數組
# .isnull() / .notnull() 判斷是否為空值 (None代表空值，NaN代表有問題的數值，兩個都會識別為空值)


>>> print(s[s > 50])
1    59.9133
2    97.9854
dtype: object
>>> print(s[bs3])
0    19.9515
1    59.9133
2    97.9854
dtype: object
>>>
# 布爾型索引方法：用[判斷條件]表示，其中判斷條件可以是 一個語句，或者是 一個布爾型數組！

1.3 Series：基本技巧

數據查看（.head() .tail() ） / 重新索引就是對index做重新排序（reindex(列表)） / 對齊（ s1+s2 ）/ 添加（s1.append(s2)）、修改s['a']=10 、刪除值s.drop('a')

# 數據查看
>>> s = pd.Series(np.random.rand(50))
>>> print(s.head(10))
0    0.282475
1    0.012153
2    0.642487
3    0.906513
4    0.195709
5    0.828506
6    0.194632
7    0.197138
8    0.503566
9    0.897846
dtype: float64
>>> print(s.tail())
45    0.963916
46    0.642688
47    0.865840
48    0.835746
49    0.905786
dtype: float64
# .head()查看頭部數據
# .tail()查看尾部數據
# 默認查看5條

# 重新索引reindex
# .reindex將會根據索引重新排序，如果當前索引不存在，則引入缺失值
>>> s = pd.Series(np.random.rand(3),index=['a','b','c'])
>>> print(s)
a    0.239126
b    0.862137
c    0.501479
dtype: float64
>>> s1 = s.reindex(['c','b','a','d']) >>> print(s1)
c    0.501479
b    0.862137
a    0.239126
d         NaN
dtype: float64
# .reindex()中也是寫列表
# 這里'd'索引不存在，所以值為NaN

>>> s2 = s.reindex(['c','b','a','d'],fill_value=0)  # fill_value參數：填充缺失值的值

>>> print(s2) 
c 0.501479 
b 0.862137 
a 0.239126 
d 0.000000 dtype: float64

# Series對齊
>>> s1 = pd.Series(np.random.rand(3),index=['Jack','Marry','Kris'])
>>> s2 = pd.Series(np.random.rand(3),index=['Wang','Jack','Marry'])
>>> print(s1)
Jack     0.583406
Marry    0.603579
Kris     0.812511
dtype: float64
>>> print(s2)
Wang     0.582852
Jack     0.975184
Marry    0.990203
dtype: float64
>>> print(s1+s2)
Jack     1.558589
Kris          NaN
Marry    1.593783
Wang          NaN
dtype: float64
# Series 和 ndarray 之間的主要區別是，Series 上的操作會根據標簽自動對齊
# index順序不會影響數值計算，以標簽來計算
# 空值和任何值計算結果扔為空值

# 刪除：.drop
>>> s = pd.Series(np.random.rand(5),index=list('ngjur'))
>>> print(s)
n    0.239752
g    0.643085
j    0.313229
u    0.231923
r    0.836070
dtype: float64
>>> s1 = s.drop('n') >>> print(s1)
g    0.643085
j    0.313229
u    0.231923
r    0.836070
dtype: float64
>>> s2 = s.drop(['g','j']) >>> print(s2)
n    0.239752
u    0.231923
r    0.836070
dtype: float64
>>> print(s)
n    0.239752
g    0.643085
j    0.313229
u    0.231923
r    0.836070
dtype: float64
# drop 刪除元素之后返回副本(inplace=False)

# 添加 
>>> s1 = pd.Series(np.random.rand(5))
>>> s2 = pd.Series(np.random.rand(5),index=list('ngjur'))
>>> print(s1,'\n',s2)
0    0.417249
1    0.226655
2    0.798018
3    0.984398
4    0.304693
dtype: float64
n    0.354443
g    0.609306
j    0.103994
u    0.392755
r    0.302959
dtype: float64
>>> s1[5] = 100
>>> s2['a'] = 100
>>> print(s1,'\n',s2)
0      0.417249
1      0.226655
2      0.798018
3      0.984398
4      0.304693
5    100.000000
dtype: float64
n      0.354443
g      0.609306
j      0.103994
u      0.392755
r      0.302959
a    100.000000
dtype: float64
# 直接通過下標索引/標簽index添加值 


>>> s3 = s1.append(s2) >>> print(s3,'\n',s1)
0      0.417249
1      0.226655
2      0.798018
3      0.984398
4      0.304693
5    100.000000
n      0.354443
g      0.609306
j      0.103994
u      0.392755
r      0.302959
a    100.000000
dtype: float64
 0      0.417249
1      0.226655
2      0.798018
3      0.984398
4      0.304693
5    100.000000
dtype: float64
# 通過.append方法，直接添加一個數組
# .append方法生成一個新的數組，不改變之前的數組

# 修改 
>>> s = pd.Series(np.random.rand(3),index=['a','b','c'])
>>> print(s)
a    0.246992
b    0.349735
c    0.395859
dtype: float64
>>> s['a'] = 100
>>> s[['b','c']] = 200
>>> print(s)
a    100.0
b    200.0
c    200.0
dtype: float64
>>>
# 通過索引直接修改，類似序列

2. Pandas數據結構Dataframe

2.1 基本概念及創建

"二維數組"Dataframe：是一個表格型的數據結構，包含一組有序的列，其列的值類型可以是數值、字符串、布爾值等。

Dataframe中的數據以一個或多個二維塊存放，不是列表、字典或一維數組結構。

# Dataframe 數據結構
# Dataframe是一個表格型的數據結構，“帶有標簽的二維數組”。
# Dataframe帶有index（行標簽）和columns（列標簽）
>>> data = {'name':['Jack','Tom','Marry'],
... 'age':[18,19,20],
... 'gender':['m','m','w']}
>>> frame = pd.DataFrame(data)
>>> print(frame)
    name  age gender
0   Jack   18      m
1    Tom   19      m
2  Marry   20      w
>>> print(type(frame))
<class 'pandas.core.frame.DataFrame'>
>>> print(frame.index,'\n該數據類型為:',type(frame.index))
RangeIndex(start=0, stop=3, step=1)
該數據類型為: <class 'pandas.core.indexes.range.RangeIndex'>
>>> print(frame.columns,'\n該數據類型為:',type(frame.columns))
Index(['name', 'age', 'gender'], dtype='object')
該數據類型為: <class 'pandas.core.indexes.base.Index'>
>>> print(frame.values,'\n該數據類型為:',type(frame.values))
[['Jack' 18 'm']
 ['Tom' 19 'm']
 ['Marry' 20 'w']]
該數據類型為: <class 'numpy.ndarray'>
# 查看數據，數據類型為dataframe
# .index查看行標簽
# .columns查看列標簽
# .values查看值，數據類型為ndarray

# Dataframe 創建方法一：由數組/list組成的字典 
# 創建方法:pandas.Dataframe()
>>> data1 = {'a':[1,2,3], 'b':[3,4,5], 'c':[5,6,7]}
>>> data2 = {'one':np.random.rand(3),'two':np.random.rand(3)} ## 這里如果嘗試  'two':np.random.rand(4) 會怎么樣？轉為DataFrame會報錯--> {'one': array([0.938673  , 0.90796881, 0.8890414 ]), 'two': array([0.37261493, 0.70430298, 0.24494145, 0.3924875 ])},轉為DataFrame 則 ValueError: arrays must all be same length 
>>> print(data1,'\n',data2)
{'a': [1, 2, 3], 'b': [3, 4, 5], 'c': [5, 6, 7]}
 {'one': array([0.76701471, 0.01005053, 0.09453216]), 'two': array([0.58442534, 0.14610703, 0.03588291]))

>>> df1 = pd.DataFrame(data1)
>>> df2 = pd.DataFrame(data2)
>>> print(df1,'\n',df2)
   a  b  c
0  1  3  5
1  2  4  6
2  3  5  7
         one       two
0  0.767015  0.573035
1  0.010051  0.892624
2  0.094532  0.228811
>>>
# 由數組/list組成的字典 創建Dataframe，columns為字典key，index為默認數字標簽
# 字典的值的長度必須保持一致！


>>> df1 = pd.DataFrame(data1,columns=['b','c','a','d'])
>>> print(df1)
   b  c  a    d
0  3  5  1  NaN
1  4  6  2  NaN
2  5  7  3  NaN
>>> df1 = pd.DataFrame(data1,columns=['b','c'])
>>> print(df1)
   b  c
0  3  5
1  4  6
2  5  7
# columns參數：可以重新指定列的順序，格式為list，如果現有數據中沒有該列（比如'd'），則產生NaN值
# 如果columns重新指定時候，列的數量可以少於原數據

>>> df2 = pd.DataFrame(data2,index=['f1','f2','f3'])  # 這里如果嘗試  index = ['f1','f2','f3','f4'] 會怎么樣？長度不一致，報錯
>>> print(df2)
         one       two
f1  0.767015  0.573035
f2  0.010051  0.892624
f3  0.094532  0.228811
>>>
# index參數：重新定義index，格式為list，長度必須保持一致

# Dataframe 創建方法二：由Series組成的字典 
>>> data1 = {'one':pd.Series(np.random.rand(2)),'two':pd.Series(np.random.rand(3))} # 沒有設置index的Series
>>> data2 = {'one':pd.Series(np.random.rand(2),index=['a','b']),'two':pd.Series(np.random.rand(3),index=['a','b','c'])} # 設置了index的Series
>>> print(data1,'\n',data2)
{'one': 0 0.682455 1 0.282592 dtype: float64, 'two': 0 0.995054 1 0.781587 2 0.959304 dtype: float64}  {'one': a 0.940915 b 0.792245 dtype: float64, 'two': a 0.609878 b 0.910182 c 0.245590 dtype: float64} >>> df1 = pd.DataFrame(data1)
>>> df2 = pd.DataFrame(data2)
>>> print(df1)
        one       two
0  0.682455  0.995054
1  0.282592  0.781587
2       NaN  0.959304
>>> print(df2)
        one       two
a  0.940915  0.609878
b  0.792245  0.910182
c       NaN  0.245590
>>>
# 由Seris組成的字典 創建Dataframe，columns為字典key，index為Series的標簽（如果Series沒有指定標簽，則是默認數字標簽）
# Series可以長度不一樣，生成的Dataframe會出現NaN值

# Dataframe 創建方法三：通過二維數組直接創建 
>>> ar = np.random.rand(9).reshape(3,3)
>>> print(ar)
[[0.43760945 0.3563898  0.16767573]
 [0.26565413 0.61673585 0.54037501]
 [0.95541978 0.05395517 0.02045977]]
>>> df1 = pd.DataFrame(ar)
>>> df2 = pd.DataFrame(ar,index=['a','b','c'],columns=['one','two','three']) >>> print(df1,'\n',df2)
          0         1         2
0  0.437609  0.356390  0.167676
1  0.265654  0.616736  0.540375
2  0.955420  0.053955  0.020460
         one       two     three
a  0.437609  0.356390  0.167676
b  0.265654  0.616736  0.540375
c  0.955420  0.053955  0.020460
>>>
# 通過二維數組直接創建Dataframe，得到一樣形狀的結果數據，如果不指定index和columns，兩者均返回默認數字格式
# index和colunms指定長度與原數組保持一致

# Dataframe 創建方法四：由字典組成的列表  
>>> data = [{'one':1,'two':2},{'one':5,'two':10,'three':20}]
>>> print(data)
[{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
>>> df1 = pd.DataFrame(data)
>>> df2 = pd.DataFrame(data,index = ['a','b'])
>>> df3 = pd.DataFrame(data,columns = ['one','two'])
>>> print(df1,'\n',df2,'\n',df3)
 one three two 0 1 NaN 2 1 5 20.0 10 one three two a 1 NaN 2 b 5 20.0 10 one two 0 1 2 1 5 10
>>>
# 由字典組成的列表創建Dataframe，columns為字典的key，index不做指定則為默認數組標簽
# colunms和index參數分別重新指定相應列及行標簽

# Dataframe 創建方法五：由字典組成的字典 
data = {'Jack':{'math':90,'english':89,'art':78},
       'Marry':{'math':82,'english':95,'art':92},
       'Tom':{'math':78,'english':67}}
df1 = pd.DataFrame(data)
print(df1)
# 由字典組成的字典創建Dataframe，columns為字典的key，index為子字典的key

df2 = pd.DataFrame(data, columns = ['Jack','Tom','Bob'])
df3 = pd.DataFrame(data, index = ['a','b','c'])
print(df2)
print(df3)
# columns參數可以增加和減少現有列，如出現新的列，值為NaN # index在這里和之前不同，並不能改變原有index，如果指向新的標簽，值為NaN （非常重要！）
#在cmd或pycharm里邊報錯。AttributeError: 'list' object has no attribute 'astype'

         Jack  Marry   Tom
art        78     92   NaN
english    89     95  67.0
math       90     82  78.0 Jack Tom Bob art 78 NaN NaN english 89 67.0 NaN math 90 78.0 NaN
   Jack Marry Tom a NaN NaN NaN b NaN NaN NaN c NaN NaN NaN

2.2 Dataframe：索引

Dataframe既有行索引也有列索引，可以被看做由Series組成的字典（共用一個索引）

選擇列 / 選擇行 / 切片 / 布爾判斷

df [ 'a' ] df [ ['a', 'b'] ] 選擇列、 df.loc [ 'one' ] 按index選擇行

#選擇行df.loc[] 與列 df[ ]  
>>> df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,index=['one','two','three'],columns=['a','b','c','d'])
>>> print(df)
               a b c d one 15.715854  86.084608  22.376152  66.760504
two     3.761389  63.610935  85.752549  19.065568
three  77.277233  24.776938  13.159774  46.518796
>>> data1 = df['a'] >>> data2 = df[['a','c']] >>> print(data1,type(data1))
one 15.715854
two       3.761389
three    77.277233 Name: a, dtype: float64 <class 'pandas.core.series.Series'>
>>> print(data2,type(data2))
 a c one 15.715854  22.376152
two     3.761389  85.752549
three  77.277233  13.159774 <class 'pandas.core.frame.DataFrame'>
>>>
# 按照列名選擇列，只選擇一列輸出Series，選擇多列輸出Dataframe 

>>> data3 = df.loc['one'] >>> data4 = df.loc[['one','two']] >>> print(data3,type(data3))
a    15.715854
b    86.084608
c    22.376152
d    66.760504
Name: one, dtype: float64 <class 'pandas.core.series.Series'>
>>> print(data4,type(data4))
             a          b          c          d
one  15.715854  86.084608  22.376152  66.760504
two   3.761389  63.610935  85.752549  19.065568 <class 'pandas.core.frame.DataFrame'>
>>>
# 按照index選擇行，只選擇一行輸出Series，選擇多行輸出Dataframe

2.2.1 df[ ] -- 選擇列

#1. df[] - 選擇列 
# 一般用於選擇列，也可以選擇行
>>> df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,index=['one','two','three'],columns=['a','b','c','d'])
>>> print(df)
               a          b          c          d
one    94.536247  33.478780  10.738060  52.679418
two    37.573186  95.915130   8.529743  11.367094
three  80.758763   0.000355  36.136580  95.739389
>>> data1 = df['a']
>>> data2 = df[['b','c']] # 嘗試輸入 data2 = df[['b','c','e']]會報錯
>>> print(data1)
one      94.536247
two      37.573186
three    80.758763
Name: a, dtype: float64
>>> print(data2)
               b          c
one    33.478780  10.738060
two    95.915130   8.529743
three   0.000355  36.136580
>>>
# df[]默認選擇列，[]中寫列名（所以一般數據colunms都會單獨制定，不會用默認數字列名，以免和index沖突）
# 單選列為Series，print結果為Series格式
# 多選列為Dataframe，print結果為Dataframe格式

>>> data3 = df[:1]
#data3 = df[0] #這兩種都是錯誤的，0  'one'
#data3 = df['one']
>>> print(data3,type(data3))
             a         b         c          d
one  94.536247  33.47878  10.73806  52.679418 <class 'pandas.core.frame.DataFrame'>
# df[]中為數字時，默認選擇行，且只能進行切片的選擇，不能單獨選擇（df[0]）
# 輸出結果為Dataframe，即便只選擇一行
# df[]不能通過索引標簽名來選擇行(df['one'])

# 核心筆記：df[col]一般用於選擇列，[]中寫列名

2.2.2df.loc[ ] - 按index選擇行

#2. df.loc[] - 按index選擇行                                                                

>>> df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index=['one','two','three','four'],columns=['a','b','c','d'])
>>> df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,columns=['a','b','c','d'])
>>> print(df1,'\n',df2)
               a          b          c          d
one    36.881890  13.897714   5.237098  24.676327
two    42.183000  27.146129  49.074872  56.447147
three   6.935006  16.742130   5.955048   2.576066
four   49.843982  64.641184  70.038643  75.103787
            a          b          c          d
0  60.589246  60.305811  90.306763  46.761824
1  59.296330   6.039652  52.296003  97.149954
2  58.255476  13.837192  74.255506  84.082167
3  55.204207  17.340171  25.056553  84.518804

#單標簽索引，返回Series
>>> data1 = df1.loc['one'] #單標簽索引返回Series 
>>> data2 = df2.loc[1]
>>> print(data1,'\n',data2)
a    36.881890
b    13.897714
c     5.237098
d    24.676327
Name: one, dtype: float64
 a    59.296330
b     6.039652
c    52.296003
d    97.149954
Name: 1, dtype: float64
>>>



#多標簽索引,順序可變
>>> data3 = df1.loc[['two','three','five']]
>>> data4 = df2.loc[[3,2,1]]
>>> print(data3)
               a          b          c          d
two    42.183000  27.146129  49.074872  56.447147
three   6.935006  16.742130   5.955048   2.576066
five         NaN        NaN        NaN        NaN    #多標簽索引，如果標簽不存在則返回NaN 
>>> print(data4)
           a          b          c          d
3  55.204207  17.340171  25.056553  84.518804
2  58.255476  13.837192  74.255506  84.082167
1  59.296330   6.039652  52.296003  97.149954


#切片索引 ，可以做切片對象 
>>> data5 = df1.loc['one':'three']   #末端包含  
>>> data6 = df2.loc[1:3]
>>> print(data5)
               a          b          c          d
one    36.881890  13.897714   5.237098  24.676327
two    42.183000  27.146129  49.074872  56.447147
three   6.935006  16.742130   5.955048   2.576066
>>> print(data6)
           a          b          c          d
1  59.296330   6.039652  52.296003  97.149954
2  58.255476  13.837192  74.255506  84.082167
3  55.204207  17.340171  25.056553  84.518804


# 核心筆記：df.loc[label]主要針對index選擇行，同時支持指定index，及默認數字index

2.2.3 df.iloc[ ] - 按整數位置選擇行

# df.iloc[] - 按照整數位置（從軸的0到length-1）選擇行                                                                                         
# 類似list的索引，其順序就是dataframe的整數位置，從0開始計

>>> df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index=['one','two','three','four'],columns=['a','b','c','d'])
>>> print(df)
               a          b          c          d
one    21.693396  38.203531  85.439983   9.740751
two    28.940287  57.861274  68.467893  60.788056
three  81.871777  57.813973  60.092876   1.637220
four   67.789269  95.648501  62.837383  65.794259

# 單位置索引; 和loc索引不同，不能索引超出數據行數的整數位置 
>>> print(df.iloc[0])
a    21.693396
b    38.203531
c    85.439983
d     9.740751
Name: one, dtype: float64
>>> print(df.iloc[-1])
a    67.789269
b    95.648501
c    62.837383
d    65.794259
Name: four, dtype: float64
>>> print(df.iloc[4]) #索引超過行數了
IndexError: single positional indexer is out-of-bounds

# 多位置索引，順序可變
>>> print(df.iloc[[0,2]])  ##從0開始 ，第0行和第3行即末端包含  
               a          b          c         d
one    21.693396  38.203531  85.439983  9.740751
three  81.871777  57.813973  60.092876  1.637220
>>> print(df.iloc[[3,2,1]])
               a          b          c          d
four   67.789269  95.648501  62.837383  65.794259
three  81.871777  57.813973  60.092876   1.637220
two    28.940287  57.861274  68.467893  60.788056


#切片索引        
>>> print(df.iloc[1:3]) #末端不包含  
               a          b          c          d
two    28.940287  57.861274  68.467893  60.788056
three  81.871777  57.813973  60.092876   1.637220
>>> print(df.iloc[::2])
               a          b          c         d
one    21.693396  38.203531  85.439983  9.740751
three  81.871777  57.813973  60.092876  1.637220
>>>

2.2.4布爾型索引

df < 20 df [ df < 20 ] 、單列做判斷df [ 'a' ] >20 、多列做判斷df [ ['a', 'b'] ] >20 、多行做判斷 df.loc[ ['one', 'three'] ] < 50

#布爾型索引 
# 和Series原理相同
>>> df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index=['one','two','three','four'],columns=['a','b','c','d'])
>>> print(df)
               a          b          c          d
one    17.575951  66.534852  96.774872  94.415801
two    67.485820  11.871447  19.140092   9.634462
three  32.052532   8.891445  63.209949  92.451412
four    6.931403   0.622515  29.972335  24.438536

>>> b1 = df < 20  # 也可以書寫為 df[df < 20]
>>> print(b1,type(b1))
           a      b      c      d
one     True  False  False  False
two    False   True   True   True
three  False   True  False  False
four    True   True  False  False <class 'pandas.core.frame.DataFrame'>
>>> print(df[b1])
               a          b          c         d
one    17.575951        NaN        NaN       NaN
two          NaN  11.871447  19.140092  9.634462
three        NaN   8.891445        NaN       NaN
four    6.931403   0.622515        NaN       NaN
>>>
# 不做索引則會對數據每個值進行判斷
# 索引結果保留 所有數據：True返回原數據，False返回值為NaN


>>> b2 = df['a'] > 50
>>> print(b2,type(b2))
one      False
two     True
three    False
four     False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
>>> print(df[b2]) #會把two為True的行保留，包括小於50的數
            a          b          c         d
two  67.48582  11.871447  19.140092  9.634462
# 單列做判斷，索引結果保留單列判斷為True的行數據，包括其他列


>>> b3 = df[['a','b']] > 50
>>> print(b3,type(b3))
           a      b
one    False   True
two     True  False
three  False  False
four   False  False <class 'pandas.core.frame.DataFrame'>
>>> print(df[b3])
              a          b   c   d
one         NaN  66.534852 NaN NaN
two    67.48582        NaN NaN NaN
three       NaN        NaN NaN NaN
four        NaN        NaN NaN NaN
# 多列做判斷，索引結果保留所有數據：True返回原數據，False返回值為NaN


>>>
>>> b4 = df.loc[['one','three']] < 50
>>> print(b4,type(b4))
          a      b      c      d
one    True  False  False  False
three  True   True  False  False <class 'pandas.core.frame.DataFrame'>
>>> print(df[b4])
               a         b   c   d
one    17.575951       NaN NaN NaN
two          NaN       NaN NaN NaN
three  32.052532  8.891445 NaN NaN
four         NaN       NaN NaN NaN
>>>
# 多行做判斷，索引結果保留 所有數據：True返回原數據，False返回值為NaN

2.2.5 多重索引：比如同時索引行和列

先選擇列再選擇行：df[ 'a' ].loc[ ['a', 'b', 'c'] ] df [ df [ 'a' ] < 50 ].iloc[ :2 ]

#多重索引：比如同時索引行和列
# 先選擇列再選擇行 —— 相當於對於一個數據，先篩選字段，再選擇數據量
>>> df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index=['one','two','three','four'],columns=['a','b','c','d'])
>>> print(df)
               a          b          c          d
one    12.408141  98.202562  38.715980  62.978631
two    93.980397  39.455335  77.214844  42.495949
three   4.210569  48.999179  10.320513  51.919796
four   73.838276  72.854442  98.555301  27.902682
>>> print(df['a'].loc[['one','three']]) # 選擇a列的one，three行
one      12.408141
three     4.210569
Name: a, dtype: float64
>>> print(df[['b','c','d']].iloc[::2]) # 選擇b，c，d列的one，three行
               b          c          d
one    98.202562  38.715980  62.978631
three  48.999179  10.320513  51.919796
>>> print(df[df['a'] < 50].iloc[:2]) # 選擇滿足判斷索引的前兩行數據
               a          b          c          d
one    12.408141  98.202562  38.715980  62.978631
three   4.210569  48.999179  10.320513  51.919796
>>>

2.3 Dataframe：基本技巧

數據查看、轉置 / 添加、修改、刪除值 / 對齊 / 排序

######數據查看（.head() .tail() ）與轉置（ .T ） 
>>> df = pd.DataFrame(np.random.rand(16).reshape(8,2)*100,columns=['a','b'])
>>> print(df)
           a          b
0  41.447858  93.937878
1  29.684415  58.637993
2   2.260561  23.601327
3  79.555013  55.611010
4  64.825361  92.444769
5  53.716091  40.166872
6  19.657354  47.842487
7  22.705715  26.977886
>>>
>>> print(df.head(2))
           a          b
0  41.447858  93.937878
1  29.684415  58.637993
>>> print(df.tail())
           a          b
3  79.555013  55.611010
4  64.825361  92.444769
5  53.716091  40.166872
6  19.657354  47.842487
7  22.705715  26.977886

　　# .head()查看頭部數據
　　# .tail()查看尾部數據
　　# 默認查看5條

>>> print(df.T)
           0          1          2          3          4          5          6          7
a  41.447858  29.684415   2.260561  79.555013  64.825361  53.716091  19.657354  22.705715
b  93.937878  58.637993  23.601327  55.611010  92.444769  40.166872  47.842487  26.977886
# .T 轉置

# 添加與修改 
>>> df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,columns=['a','b','c','d'])
>>> print(df)
           a          b          c          d
0  40.395591  38.023720  64.954712  82.833601
1  69.405393  77.664903  76.566145  11.218753
2  61.793220  95.929196  15.415231  79.368691
3  29.482119  85.228170  94.134330  25.678733
>>> df['e'] = 10
>>> df.loc[4] = 20
>>> print(df)
           a          b          c          d   e
0  40.395591  38.023720  64.954712  82.833601  10
1  69.405393  77.664903  76.566145  11.218753  10
2  61.793220  95.929196  15.415231  79.368691  10
3  29.482119  85.228170  94.134330  25.678733  10
4  20.000000  20.000000  20.000000  20.000000  20
>>>
# 新增列/行並賦值 
>>> df['e'] = 20
>>> df[['a','c']] = 100
>>> print(df)
     a          b    c          d   e
0  100  38.023720  100  82.833601  20
1  100  77.664903  100  11.218753  20
2  100  95.929196  100  79.368691  20
3  100  85.228170  100  25.678733  20
4  100  20.000000  100  20.000000  20
>>>
# 索引后直接修改值

# 刪除 del / drop() ；inplace = False/True 、 axis = 0 為行 | axis = 1 為列 
>>> df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,columns=['a','b','c','d'])
>>> print(df)
           a          b          c          d
0  76.082974  91.636219  70.831268  82.900443
1  16.328769   9.910538  36.670726  67.187492
2  96.234567  16.699254   0.257354  31.032239
3  16.659137  85.438085  91.993957  33.055454
>>> del df['a'] >>> print(df)
           b          c          d
0  91.636219  70.831268  82.900443
1   9.910538  36.670726  67.187492
2  16.699254   0.257354  31.032239
3  85.438085  91.993957  33.055454
>>>
# del語句 - 刪除列


>>> print(df.drop(0))              ##刪除行 
           b          c          d
1   9.910538  36.670726  67.187492
2  16.699254   0.257354  31.032239
3  85.438085  91.993957  33.055454
>>> print(df.drop([1,2]))
           b          c          d
0  91.636219  70.831268  82.900443
3  85.438085  91.993957  33.055454
>>> print(df)                     ##原數據不改變 
           b          c          d
0  91.636219  70.831268  82.900443
1   9.910538  36.670726  67.187492
2  16.699254   0.257354  31.032239
3  85.438085  91.993957  33.055454
>>>
# drop()刪除行，inplace=False → 刪除后生成新的數據，不改變原數據

　>>> print(df1.drop(['1'], inplace=True)) #默認為 inplace=False, inplace = True是把原數據也刪除了
　None

>>> print(df.drop(['d'],axis = 1)) #axis = 1是刪除列 ---> 用 [ ]，不改變原數據 ; axis = 0是刪除行,不改變原數據。 
           b          c
0  91.636219  70.831268
1   9.910538  36.670726
2  16.699254   0.257354
3  85.438085  91.993957
>>> print(df)
           b          c          d
0  91.636219  70.831268  82.900443
1   9.910538  36.670726  67.187492
2  16.699254   0.257354  31.032239
3  85.438085  91.993957  33.055454
>>>
# drop()刪除列，需要加上axis = 1，inplace=False → 刪除后生成新的數據，不改變原數據

#對齊 + 
>>> df1 = pd.DataFrame(np.random.randn(10,4),columns=['A','B','C','D'])
>>> df2 = pd.DataFrame(np.random.randn(7,3),columns=['A','B','C'])
>>> print(df1)
          A         B         C         D
0 -0.711905  1.102947 -0.203125  0.464160
1 -1.633976 -0.126530  1.437948  1.721049
2  1.323383 -0.277546  0.060134  0.207093
3  1.708294  0.815721 -0.151322  0.522937
4  0.263572 -0.674251 -1.325148 -2.702464
5  1.659823 -0.131172 -1.114735 -2.182527
6 -0.186723 -0.071455 -1.370213  0.513062
7  0.381603  1.265310  0.083247  1.084061
8  0.399770  0.765438 -1.066299  0.626402
9  0.781321 -1.612135 -0.387417 -0.673143
>>> print(df2)
          A         B         C
0  0.012025 -0.488556  0.243515
1 -0.751000  0.277448  0.013675
2  1.008712 -1.231084 -0.523329
3  0.663029 -0.752602 -0.724749
4 -0.755075  0.303930  1.288335
5 -1.233975 -1.241185 -0.414564
6 -0.251519 -1.384259 -0.996120
>>> print(df1+df2)                  #DataFrame對象之間的數據自動按照列和索引（行標簽）對齊。
          A         B         C   D
0 -0.699879  0.614391  0.040390 NaN
1 -2.384977  0.150917  1.451622 NaN
2  2.332095 -1.508629 -0.463195 NaN
3  2.371323  0.063119 -0.876071 NaN
4 -0.491503 -0.370321 -0.036813 NaN
5  0.425847 -1.372357 -1.529299 NaN
6 -0.438242 -1.455714 -2.366333 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN
>>>

# 排序1 - 按值排序 .sort_values 
# 同樣適用於Series
>>> df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,columns=['a','b','c','d'])
>>> print(df)
           b          c          d
0  91.636219  70.831268  82.900443
1   9.910538  36.670726  67.187492
2  16.699254   0.257354  31.032239
3  85.438085  91.993957  33.055454
>>> print(df1.sort_values(['a'],ascending=True)) #升序； ascending參數：設置升序降序，默認升序。
           a          b          c          d
0   3.255012  35.188882  99.290551  67.897580
1  43.221583  36.144081  84.124544  18.844967
3  47.364524  41.530226  20.800088  22.597198
2  83.170528   1.550416   7.810286  61.375057
>>> print(df1.sort_values(['a'],ascending=False)) #降序
           a          b          c          d
2  83.170528   1.550416   7.810286  61.375057
3  47.364524  41.530226  20.800088  22.597198
1  43.221583  36.144081  84.124544  18.844967
0   3.255012  35.188882  99.290551  67.897580
# 單列排序


>>> df2 = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],'b':list(range(8)),'c':list(range(8,0,-1))})
>>> print(df2)
   a  b  c
0  1  0  8
1  1  1  7
2  1  2  6
3  1  3  5
4  2  4  4
5  2  5  3
6  2  6  2
7  2  7  1
>>> print(df2.sort_values(['a','c']))  #多列排序，按列順序排序。
   a  b  c 3  1  3  5
2  1  2  6
1  1  1  7
0  1  0  8
7  2  7  1
6  2  6  2
5  2  5  3
4  2  4  4
>>>

# 排序2 - 索引排序 .sort_index  
>>> df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index=[5,4,3,2],columns=['a','b','c','d'])
>>> df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index=['h','s','x','g'],columns=['a','b','c','d'])
>>> print(df1)
           a          b          c          d
5  70.899006  29.653652  38.273239  99.254931
4  68.173016  27.051275  43.236560  48.573018
3  35.870577  41.990773  78.055733  63.581352
2  20.946046  19.712039  33.906534  89.749668
>>> print(df1.sort_index())
           a          b          c          d
2  20.946046  19.712039  33.906534  89.749668
3  35.870577  41.990773  78.055733  63.581352
4  68.173016  27.051275  43.236560  48.573018
5  70.899006  29.653652  38.273239  99.254931
>>> print(df2)
           a          b          c          d
h  62.234181  32.481881  83.483145  39.145470
s  41.003081  16.515826  19.958257  30.331726
x  60.486728  20.206607  91.149820  31.731089
g  22.132468  61.116998  19.929379  98.976248
>>> print(df2.sort_index())
           a          b          c          d
g  22.132468  61.116998  19.929379  98.976248
h  62.234181  32.481881  83.483145  39.145470
s  41.003081  16.515826  19.958257  30.331726
x  60.486728  20.206607  91.149820  31.731089
>>>
# 按照index排序
# 默認 ascending=True, inplace=False

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Pandas：Series和DataFrame數據結構詳解 1、pandas數據結構之Series——創建Series pandas 學習（2）： pandas 數據結構之DataFrame pandas數據讀取（DataFrame & Series） Pandas Series 與 DataFrame 數據創建 Pandas兩大主要數據結構之二——DataFrame pandas.DataFrame.astype數據結構轉換 pandas數據排序（series排序 & DataFrame排序） Pandas ， series 與 Dataframe的創建 Pandas之Series+DataFrame