【Python數據挖掘】第二篇--數據框操作

本文轉載自查看原文 2017-07-09 21:38 1593 Python數據分析

一、索引

索引的主要作用是對數據做切片，能夠從pandas的對象中選取數據子集。

1、loc：基於數據標簽，如果標簽值不存在，會拋出KeyError

單個的標簽值
列表或者數組的標簽值
切片范圍數據　　（基於索引名稱，不屬於前閉后開！）
布爾型的數組

# df.loc[ 行操作 , 列操作 ]
# 1、單個的標簽值
df.loc[' 標簽名稱 ']

# 2、列表或者數組的標簽值
df.loc[ [0,1,2] , : ]
df.loc[ [0,1,2] , [ 'color' , 'director_name' ,] ]

# 3、切片范圍數據
df.loc[ 0:4 , : ]

# 4、布爾型的數組
df.loc[ [True,False,True] , : ]          # False 不顯示該條數據
df.loc[ df['duration']>=150 ,:]

2、iloc：基於整數位置，如果整數超出了索引范圍，會拋出IndexError

一個整數
列表或者數組的整數
整數切片
布爾型數組

#1、一個整數
df.iloc[ 0 , : ]

#2、列表或者數組的整數
df.iloc[ [0,1,2,3] , : ]

#3、整數切片   前閉后開
df.iloc[ 0:4 , : ]

#4、布爾型數組
df.iloc[ [True,False, ] , : ]

3、ix：loc與iloc混合使用

df_new.ix[0:4,['color']]

4、應用：條件篩選

# 1. 篩選條件 ,  bool 數組
 df_new['director_facebook_likes'] >= 100

# 2. 條件組合 , 注意括號
( bool 數組 ) & ( bool 數組 ) & ( bool 數組 )

# 3. 應用
df.loc[ ( bool 數組 ) & ( bool 數組 ) & ( bool 數組 ) ]

# 4. 符合條件 進行標記
df.loc[ (...) & (...) & (...) , ' 新列名' ] = 1

# 5. 標記不符合的數據   => 條件取反~(...) ,注意括號
df.loc[ ~( (...) & (...) & (...) ), ' 新列名' ] = 0

###  方式二 :
df.query(" 條件 & 條件  ")
df.query(" aa >= 100 ")

5、其他

isin 方法
Series.isin( [ '...' , '...' ] )

二、設置索引方法：

1.索引必須在數據集里面,不能外部引入

df.set_index([ '列名' , '...' , ])
df.set_index([ [0,1,2,3] ])

keys : column label or list of column labels / arrays
drop : 默認True，成為索引后,刪除原數據列
append : 默認False，追加索引，刪除原索引
inplace : 默認False，返回新數據集，不修改原數據集
verify_integrity : boolean, default False
    Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method

2.reset_index 設置索引的逆操作

df.reset_index( ['xxx' , 'xxx '] )

三、多重索引

多重索引適用於復雜的數據分析，尤其適用於高緯度的數據

多重索引能夠允許你進行分組，選取以及重塑等操作

loc 默認搜索第一層索引，形式上只能寫一個逗號

多重索引形式：

df.loc[ '第一層索引名稱' , '第二層索引名稱' , '...' ]

顯示特定列：（索引處利用元組形式）

df.loc[ ('第一層索引' , '第二層索引' , '...') ,  ]

多個索引值：（需要先對DataFrame進行排序）

df.loc[ ( '第一層索引' , ['xxx','xxx'] ) , ]

排序操作：

df.sort_index(inplace=True)   # True對原數據集進行修改

Signature: df.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)
Docstring:
Sort object by labels (along an axis)

Parameters
----------
axis : index, columns to direct sorting
level : int or level name or list of ints or list of level names
    if not None, sort on values in specified index level(s)
ascending : boolean, default True
    Sort ascending vs. descending
inplace : bool, default False
    if True, perform operation in-place
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
     Choice of sorting algorithm. See also ndarray.np.sort for more
     information.  `mergesort` is the only stable algorithm. For
     DataFrames, this option is only applied when sorting on a single
     column or label.
na_position : {'first', 'last'}, default 'last'
     `first` puts NaNs at the beginning, `last` puts NaNs at the end
sort_remaining : bool, default True
    if true and sorting by level and index is multilevel, sort by other
    levels too (in order) after sorting by specified level

sort_index()

利用Slice() 進行區間索引（表示選取一個范圍內的元素）：

必須制定所有軸上面的元素
必須對索引進行排序
使用slice函數

# slice(None)       表示選取全部元素
# slice('a' , 'z')  表示選取a~z間元素

df.loc[ (slice(None) , slice('Australia','Canada')) , ]
df.loc[( slice(None) , slice('Australia','Canada')) , slice('color','duration') ]

IndexSlice 利用idx代替Slice

df1.loc[(slice(None),slice('Australia','Canada')),]

idx = IndexSlice
df1.loc[ idx[ : , [''Australia','Canada''] ] , ]
#  : => slice(None) 表示選取全部元素

df.loc[( slice(None) , ['USA'] , slice(100,200) ),]
df.loc[ idx[: , ['USA'] , 100:200 ] , : ]              # ['...' , '...']表示或 , 區間表示用 :

根據橫縱軸返回數據集（選取某行或者某列）

df.xs  指定level 進行索引選取
level : 0 / 1/ 2  代表索引 , 即接收文本型,也接收整數型
asix : 0 / 1  控制軸 0縱軸 , 1橫軸
drop_level  默認True 刪除自己level

df.xs( 1 ,axis=0)
df.xs( 'xx' , level = '索引列名' , drop_level = False )    單個
df.xs( ('xx' , 'xxxx' ), level = (0,1) , drop_level = False )    多個元素  , 一 一對應

四、多表操作

多表操作示例

df1 = pd.DataFrame({
        'A':['A0','A1','A2','A3'],
        'B':['B0','B1','B2','B3'],
        'C':['C0','C1','C2','C3'],
        'D':['D0','D1','D2','D3'],
        },
        index=[0,1,2,3])

df2 = pd.DataFrame({
        'A':['A4','A5','A6','A7'],
        'B':['B4','B5','B6','B7'],
        'C':['C4','C5','C6','C7'],
        'D':['D4','D5','D6','D7'],
        },
        index=[4,5,6,7])


df3 = pd.DataFrame({
        'A':['A8','A9','A10','A11'],
        'B':['B8','B9','B10','B11'],
        'C':['C8','C9','C10','C11'],
        'D':['D8','D9','D10','D11'],
        },
        index=[8,9,10,11])

df4 = pd.DataFrame({
        'B':['B2','B3','B6','B7'],
        'D':['D2','D3','D6','D7'],
        'F':['F2','F3','F6','F7'],
        },
        index=[2,3,6,7])

View Code

1、pd.concat()　　返回新數據集

pd.concat( [ ... , ... , ] )
axis  : 默認0 按index , 1 按columns
ignore_index : 忽視以前的索引 , 隨着axis 軸命名從 0 ~ n-1

pd.concat([df1,df2,df3])
pd.concat([df1,df2,df3] , axis=0 , ignore_index=False )

2、兩表組合

left = pd.DataFrame({'key1': ['K0','K0','K1','K2'],
        'key2': ['K0','K1','K0','K1'],
        'A': ['A0','A1','A2','A3'],
        'B': ['B0','B1','B2','B3'],
        })

right = pd.DataFrame({'key1': ['K0','K1','K1','K2'],
        'key2': ['K0','K0','K0','K0'],
        'C': ['C0','C1','C2','C3'],
        'D': ['D0','D1','D2','D3'],
        })

操作示例

pd.merge()　　每次只能組合兩張表

# pd.merge( left,right)
how :  默認inner ,  left ，right，outer
suffixes =  ('_left','_right')　　# 修改列名_x　， _y名字　

pd.merge(left,right,how='inner',on='key1')
pd.merge(left,right,how='left',on='key1')
pd.merge(left,right,how='right',on='key1')
pd.merge(left,right , on=['key1','key2'])

rename()　　修改列名，返回新數據集

left1 = left.rename(columns = {"key1":"key1_l" , "key2":"key2_l"})

注意：

# left_index = True  索引作為key 去join
left_index : boolean, default False
    Use the index from the left DataFrame as the join key(s). If it is a
    MultiIndex, the number of keys in the other DataFrame (either the index
    or a number of columns) must match the number of levels
right_index : boolean, default False
    Use the index from the right DataFrame as the join key. Same caveats as
    left_index

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 R數據挖掘第二篇：基於距離評估數據的相似性和相異性 Python數據挖掘 Python 數據分析與數據挖掘 (介紹篇) python數據挖掘之數據探索第一篇 python數據挖掘介紹【Python數據挖掘】第六篇--特征工程 Python數據挖掘指南 python基礎-第二篇-基本數據類型 Python 項目實踐二（生成數據）第二篇 Rattle：數據挖掘的界面化操作