【Python数据挖掘】第二篇--数据框操作

本文转载自查看原文 2017-07-09 21:38 1593 Python数据分析

一、索引

索引的主要作用是对数据做切片，能够从pandas的对象中选取数据子集。

1、loc：基于数据标签，如果标签值不存在，会抛出KeyError

单个的标签值
列表或者数组的标签值
切片范围数据　　（基于索引名称，不属于前闭后开！）
布尔型的数组

# df.loc[ 行操作 , 列操作 ]
# 1、单个的标签值
df.loc[' 标签名称 ']

# 2、列表或者数组的标签值
df.loc[ [0,1,2] , : ]
df.loc[ [0,1,2] , [ 'color' , 'director_name' ,] ]

# 3、切片范围数据
df.loc[ 0:4 , : ]

# 4、布尔型的数组
df.loc[ [True,False,True] , : ]          # False 不显示该条数据
df.loc[ df['duration']>=150 ,:]

2、iloc：基于整数位置，如果整数超出了索引范围，会抛出IndexError

一个整数
列表或者数组的整数
整数切片
布尔型数组

#1、一个整数
df.iloc[ 0 , : ]

#2、列表或者数组的整数
df.iloc[ [0,1,2,3] , : ]

#3、整数切片   前闭后开
df.iloc[ 0:4 , : ]

#4、布尔型数组
df.iloc[ [True,False, ] , : ]

3、ix：loc与iloc混合使用

df_new.ix[0:4,['color']]

4、应用：条件筛选

# 1. 筛选条件 ,  bool 数组
 df_new['director_facebook_likes'] >= 100

# 2. 条件组合 , 注意括号
( bool 数组 ) & ( bool 数组 ) & ( bool 数组 )

# 3. 应用
df.loc[ ( bool 数组 ) & ( bool 数组 ) & ( bool 数组 ) ]

# 4. 符合条件 进行标记
df.loc[ (...) & (...) & (...) , ' 新列名' ] = 1

# 5. 标记不符合的数据   => 条件取反~(...) ,注意括号
df.loc[ ~( (...) & (...) & (...) ), ' 新列名' ] = 0

###  方式二 :
df.query(" 条件 & 条件  ")
df.query(" aa >= 100 ")

5、其他

isin 方法
Series.isin( [ '...' , '...' ] )

二、设置索引方法：

1.索引必须在数据集里面,不能外部引入

df.set_index([ '列名' , '...' , ])
df.set_index([ [0,1,2,3] ])

keys : column label or list of column labels / arrays
drop : 默认True，成为索引后,删除原数据列
append : 默认False，追加索引，删除原索引
inplace : 默认False，返回新数据集，不修改原数据集
verify_integrity : boolean, default False
    Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method

2.reset_index 设置索引的逆操作

df.reset_index( ['xxx' , 'xxx '] )

三、多重索引

多重索引适用于复杂的数据分析，尤其适用于高纬度的数据

多重索引能够允许你进行分组，选取以及重塑等操作

loc 默认搜索第一层索引，形式上只能写一个逗号

多重索引形式：

df.loc[ '第一层索引名称' , '第二层索引名称' , '...' ]

显示特定列：（索引处利用元组形式）

df.loc[ ('第一层索引' , '第二层索引' , '...') ,  ]

多个索引值：（需要先对DataFrame进行排序）

df.loc[ ( '第一层索引' , ['xxx','xxx'] ) , ]

排序操作：

df.sort_index(inplace=True)   # True对原数据集进行修改

Signature: df.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)
Docstring:
Sort object by labels (along an axis)

Parameters
----------
axis : index, columns to direct sorting
level : int or level name or list of ints or list of level names
    if not None, sort on values in specified index level(s)
ascending : boolean, default True
    Sort ascending vs. descending
inplace : bool, default False
    if True, perform operation in-place
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
     Choice of sorting algorithm. See also ndarray.np.sort for more
     information.  `mergesort` is the only stable algorithm. For
     DataFrames, this option is only applied when sorting on a single
     column or label.
na_position : {'first', 'last'}, default 'last'
     `first` puts NaNs at the beginning, `last` puts NaNs at the end
sort_remaining : bool, default True
    if true and sorting by level and index is multilevel, sort by other
    levels too (in order) after sorting by specified level

sort_index()

利用Slice() 进行区间索引（表示选取一个范围内的元素）：

必须制定所有轴上面的元素
必须对索引进行排序
使用slice函数

# slice(None)       表示选取全部元素
# slice('a' , 'z')  表示选取a~z间元素

df.loc[ (slice(None) , slice('Australia','Canada')) , ]
df.loc[( slice(None) , slice('Australia','Canada')) , slice('color','duration') ]

IndexSlice 利用idx代替Slice

df1.loc[(slice(None),slice('Australia','Canada')),]

idx = IndexSlice
df1.loc[ idx[ : , [''Australia','Canada''] ] , ]
#  : => slice(None) 表示选取全部元素

df.loc[( slice(None) , ['USA'] , slice(100,200) ),]
df.loc[ idx[: , ['USA'] , 100:200 ] , : ]              # ['...' , '...']表示或 , 区间表示用 :

根据横纵轴返回数据集（选取某行或者某列）

df.xs  指定level 进行索引选取
level : 0 / 1/ 2  代表索引 , 即接收文本型,也接收整数型
asix : 0 / 1  控制轴 0纵轴 , 1横轴
drop_level  默认True 删除自己level

df.xs( 1 ,axis=0)
df.xs( 'xx' , level = '索引列名' , drop_level = False )    单个
df.xs( ('xx' , 'xxxx' ), level = (0,1) , drop_level = False )    多个元素  , 一 一对应

四、多表操作

多表操作示例

df1 = pd.DataFrame({
        'A':['A0','A1','A2','A3'],
        'B':['B0','B1','B2','B3'],
        'C':['C0','C1','C2','C3'],
        'D':['D0','D1','D2','D3'],
        },
        index=[0,1,2,3])

df2 = pd.DataFrame({
        'A':['A4','A5','A6','A7'],
        'B':['B4','B5','B6','B7'],
        'C':['C4','C5','C6','C7'],
        'D':['D4','D5','D6','D7'],
        },
        index=[4,5,6,7])


df3 = pd.DataFrame({
        'A':['A8','A9','A10','A11'],
        'B':['B8','B9','B10','B11'],
        'C':['C8','C9','C10','C11'],
        'D':['D8','D9','D10','D11'],
        },
        index=[8,9,10,11])

df4 = pd.DataFrame({
        'B':['B2','B3','B6','B7'],
        'D':['D2','D3','D6','D7'],
        'F':['F2','F3','F6','F7'],
        },
        index=[2,3,6,7])

View Code

1、pd.concat()　　返回新数据集

pd.concat( [ ... , ... , ] )
axis  : 默认0 按index , 1 按columns
ignore_index : 忽视以前的索引 , 随着axis 轴命名从 0 ~ n-1

pd.concat([df1,df2,df3])
pd.concat([df1,df2,df3] , axis=0 , ignore_index=False )

2、两表组合

left = pd.DataFrame({'key1': ['K0','K0','K1','K2'],
        'key2': ['K0','K1','K0','K1'],
        'A': ['A0','A1','A2','A3'],
        'B': ['B0','B1','B2','B3'],
        })

right = pd.DataFrame({'key1': ['K0','K1','K1','K2'],
        'key2': ['K0','K0','K0','K0'],
        'C': ['C0','C1','C2','C3'],
        'D': ['D0','D1','D2','D3'],
        })

操作示例

pd.merge()　　每次只能组合两张表

# pd.merge( left,right)
how :  默认inner ,  left ，right，outer
suffixes =  ('_left','_right')　　# 修改列名_x　， _y名字　

pd.merge(left,right,how='inner',on='key1')
pd.merge(left,right,how='left',on='key1')
pd.merge(left,right,how='right',on='key1')
pd.merge(left,right , on=['key1','key2'])

rename()　　修改列名，返回新数据集

left1 = left.rename(columns = {"key1":"key1_l" , "key2":"key2_l"})

注意：

# left_index = True  索引作为key 去join
left_index : boolean, default False
    Use the index from the left DataFrame as the join key(s). If it is a
    MultiIndex, the number of keys in the other DataFrame (either the index
    or a number of columns) must match the number of levels
right_index : boolean, default False
    Use the index from the right DataFrame as the join key. Same caveats as
    left_index

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python基础-第二篇-基本数据类型 Python 项目实践二（生成数据）第二篇 Git 第二篇：基本操作第二篇：python基础之核心风格第二篇：Python基本知识用 WEKA 进行数据挖掘——第二章: 回归 Python数据挖掘-词频统计-实现 Python数据挖掘—回归—线性回归天地图-第二篇-地图的基本操作数据挖掘面试