一、生成數據表
1、首先導入pandas庫,一般都會用到numpy庫,所以我們先導入備用:
2、導入CSV或者xlsx文件:
df=pd.DataFrame(pd.read_csv('name.csv'),header=1)
df=pd.DataFrame(pd.read_excel('name.xlsx'))
3、用pandas創建數據表:
import pandas as pd
df=pd.DataFrame({'id':[1001,1001,1003,1004,1005,1006],
'date':pd.date_range('20180101',periods=6),
'city':['beijing','shagnhai ','guangzhou','chengdu',' wuhan',' qingdao '],
'age':[22,45,56,33,24,43],
'category':['100-A','101-B','102-C','103-D','104-E','105-F'],
'price':[1000,np.nan,3000,4000,np.nan,6000]
},
columns=['id','date','city','age','category','price'])
print(df)
-------------------執行以上程序,返回的結果為-------------------
id date city age category price
0 1001 2018-01-01 beijing 22 100-A 1000.0
1 1001 2018-01-02 shagnhai 45 101-B NaN
2 1003 2018-01-03 guangzhou 56 102-C 3000.0
3 1004 2018-01-04 chengdu 33 103-D 4000.0
4 1005 2018-01-05 wuhan 24 104-E NaN
5 1006 2018-01-06 qingdao 43 105-F 6000.0
二、數據表信息查看
1、維度查看
df.shape
-------------------執行以上程序,返回的結果為-------------------
(6, 6)
2、數據表基本信息(維度、列名稱、數據格式、所占空間等):
df.info()
-------------------執行以上程序,返回的結果為-------------------
查看表的基本信息:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
id 6 non-null int64
date 6 non-null datetime64[ns]
city 6 non-null object
age 6 non-null int64
category 6 non-null object
price 4 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 368.0+ bytes
None
3、每一列數據的格式:
-------------------執行以上程序,返回的結果為-------------------
每一列數據的格式:
id int64
date datetime64[ns]
city object
age int64
category object
price float64
dtype: object
4、某一列格式:
-------------------執行以上程序,返回的結果為-------------------
某一列數據的格式:
object
5、空值:
-------------------執行以上程序,返回的結果為-------------------
空值:
id date city age category price
0 False False False False False False
1 False False False False False True
2 False False False False False False
3 False False False False False False
4 False False False False False True
5 False False False False False False
6、查看某一列空值:
-------------------執行以上程序,返回的結果為-------------------
0 False
1 False
2 False
3 False
4 False
5 False
Name: age, dtype: bool
7、查看某一列的唯一值:
-------------------執行以上程序,返回的結果為-------------------
查看某一列的唯一值:<bound method Series.unique of 0 1001
1 1001
2 1003
3 1004
4 1005
5 1006
Name: id, dtype: int64>
8、查看數據表的值:
df.values
-------------------執行以上程序,返回的結果為-------------------
查看數據表的值:
[[1001 Timestamp('2018-01-01 00:00:00') 'beijing' 22 '100-A' 1000.0]
[1001 Timestamp('2018-01-02 00:00:00') 'shagnhai' 45 '101-B' nan]
[1003 Timestamp('2018-01-03 00:00:00') 'guangzhou' 56 '102-C' 3000.0]
[1004 Timestamp('2018-01-04 00:00:00') 'chengdu' 33 '103-D' 4000.0]
[1005 Timestamp('2018-01-05 00:00:00') 'wuhan' 24 '104-E' nan]
[1006 Timestamp('2018-01-06 00:00:00') 'qingdao' 43 '105-F' 6000.0]]
9、查看列名稱:
-------------------執行以上程序,返回的結果為-------------------
查看列名稱:Index(['id', 'date', 'city', 'age', 'category', 'price'], dtype='object')
10、查看前10行數據、后10行數據:
df.tail() #默認顯示后10行
-------------------執行以上程序,返回的結果為-------------------
顯示第一行:
id date city age category price
0 1000 2018-01-01 beijing 22 100-A 1000
顯示倒數第一行:
id date city age category price
5 1006 2018-01-06 qingdao 43 105-F 6000
三、數據表清洗
1、用數字0填充空值:
df.fillna(value=0)
-------------------執行以上程序,返回的結果為-------------------
用數字0填充空值:
id date city age category price
0 1001 2018-01-01 beijing 22 100-A 1000.0
1 1001 2018-01-02 shagnhai 45 101-B 0.0
2 1003 2018-01-03 guangzhou 56 102-C 3000.0
3 1004 2018-01-04 chengdu 33 103-D 4000.0
4 1005 2018-01-05 wuhan 24 104-E 0.0
5 1006 2018-01-06 qingdao 43 105-F 6000.0
2、使用列price的均值對NA進行填充:
df['price'].fillna(df['price'].mean())
-------------------執行以上程序,返回的結果為-------------------
使用price的均值對NAN進行填充:
0 1000.0
1 3500.0
2 3000.0
3 4000.0
4 3500.0
5 6000.0
3、清除city字段的字符空格:
df['city'].map(str.strip)
-------------------執行以上程序,返回的結果為-------------------
清除city字段的字符空格:
0 beijing
1 shagnhai
2 guangzhou
3 chengdu
4 wuhan
5 qingdao
Name: city, dtype: object
4、大小寫轉換:
df['city'].str.lower()
-------------------執行以上程序,返回的結果為-------------------
大小寫轉換:
0 beijing
1 shagnhai
2 guangzhou
3 chengdu
4 wuhan
5 qingdao
Name: city, dtype: object
5、更改數據格式:
df['price'].astype('float')
-------------------執行以上程序,返回的結果為-------------------
更改數據格式:
0 1000.0
1 NaN
2 3000.0
3 4000.0
4 NaN
5 6000.0
Name: price, dtype: float64
6、更改列名稱:
df.rename(columns={'category':'category-size'})
-------------------執行以上程序,返回的結果為-------------------
更改列名稱:
id date city age category-size price
0 1001 2018-01-01 beijing 22 100-A 1000.0
1 1001 2018-01-02 shagnhai 45 101-B NaN
2 1003 2018-01-03 guangzhou 56 102-C 3000.0
3 1004 2018-01-04 chengdu 33 103-D 4000.0
4 1005 2018-01-05 wuhan 24 104-E NaN
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0
7、刪除后出現的重復值:
df['id'].drop_duplicates()
-------------------執行以上程序,返回的結果為-------------------
刪除后出現的重復的值:
0 1001
2 1003
3 1004
4 1005
5 1006
Name: id, dtype: int64
8、刪除先出現的重復值
df['id'].drop_duplicates(keep='last')
-------------------執行以上程序,返回的結果為-------------------
刪除先出現的重復的值:
1 1001
2 1003
3 1004
4 1005
5 1006
Name: id, dtype: int64
9、數據替換:
df['city'].replace('shagnhai','sh')
-------------------執行以上程序,返回的結果為-------------------
數據替換:
0 beijing
1 sh
2 guangzhou
3 chengdu
4 wuhan
5 QINGDAO
Name: city, dtype: object
四、數據預處理
df1
=
pd.DataFrame({
"id"
:[
1001
,
1002
,
1003
,
1004
,
1005
,
1006
,
1007
,
1008
],
"gender"
:[
'male'
,
'female'
,
'male'
,
'female'
,
'male'
,
'female'
,
'male'
,
'female'
],
"pay"
:[
'Y'
,
'N'
,
'Y'
,
'Y'
,
'N'
,
'Y'
,
'N'
,
'Y'
,],
"m-point"
:[
10
,
12
,
20
,
40
,
40
,
40
,
30
,
20
]})
1、數據表合並
df_inner=pd.merge(df,df1,how='inner')
df_left=pd.merge(df,df1,how='left')
df_right=pd.merge(df,df1,how='right')
df_outer=pd.merge(df,df1,how='outer')
-------------------執行以上程序,返回的結果為-------------------
內連接:
id date city age category price gender pay m-point
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
左連接:
id date city age category price gender pay m-point
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
右連接:
df_right
全連接:
df_outer
2、設置索引列
df_inner.set_index['id']
-------------------執行以上程序,返回的結果為-------------------
設置索引列:
date city age category price gender pay m-point
id
1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
1005 2018-01-05 wuhan 24 104-E NaN male N 40
1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
3、按照特定列的值排序:
df_inner.sort_values(by=['age'])
-------------------執行以上程序,返回的結果為-------------------
按照特定列的值進行排序:
id date city age category price gender pay m-point
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
4、按照索引列排序:
df_inner.sort_index()
-------------------執行以上程序,返回的結果為-------------------
按照索引列排序:
id date city age category price gender pay m-point
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
5、如果price列的值>3000,group列顯示high,否則顯示low:
df_inner['group']=np.where(df_inner['price']>3000,'high','low')
-------------------執行以上程序,返回的結果為-------------------
如果price>3000,group列展示high,否則展示low:
id date city age category price gender pay m-point group
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high
6、對復合多個條件的數據進行分組標記
df_inner.loc[(df_inner['city']=='beijing')&(df_inner['price']>100),'sign']=1
-------------------執行以上程序,返回的結果為-------------------
對復合多個條件的數據進行分組標記:
id date city age category price gender pay m-point group sign
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN
7、對category字段的值依次進行分列,並創建數據表,索引值為df_inner的索引列,列名稱為category和size
split=pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size'])
-------------------執行以上程序,返回的結果為-------------------
category size
0 100 A
1 101 B
2 102 C
3 103 D
4 104 E
5 105 F
8、將完成分裂后的數據表和原df_inner數據表進行匹配
df_inner=pd.merge(df_inner,split,left_index=True,right_index=True)
-------------------執行以上程序,返回的結果為-------------------
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
五、數據提取
主要用到的三個函數:loc,iloc和ix,loc函數按標簽值進行提取,iloc按位置進行提取,ix可以同時按標簽和位置進行提取。
1、按索引提取單行的數值
df_inner.loc[3]
-------------------執行以上程序,返回的結果為-------------------
按索引提取單行的數值:
id 1004
date 2018-01-04 00:00:00
city chengdu
age 33
category_x 103-D
price 4000
gender female
pay Y
m-point 40
group high
sign NaN
category_y 103
size D
Name: 3, dtype: object
2、按索引提取區域行數值
df_inner.iloc[0:5]
-------------------執行以上程序,返回的結果為-------------------
按索引提取區域行數值:
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
3、重設索引
df_inner.reset_index()
-------------------執行以上程序,返回的結果為-------------------
重設索引:
index id date city age category_x price gender pay m-point group sign category_y size
0 0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
4 4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
5 5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
4、設置日期為索引
df_inner.set_index('date')
-------------------執行以上程序,返回的結果為-------------------
設置日期為索引:
id city age category_x price gender pay m-point group sign category_y size
date
2018-01-01 1001 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
2018-01-02 1001 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2018-01-03 1003 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
2018-01-04 1004 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
2018-01-05 1005 wuhan 24 104-E NaN male N 40 low NaN 104 E
2018-01-06 1006 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
5、提取4日之前的所有數據
df_inner.ix[:'2018-01-04']
-------------------執行以上程序,返回的結果為-------------------
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
6、使用iloc按位置區域提取數據
df_inner.iloc[0:5]
-------------------執行以上程序,返回的結果為-------------------
按位置順序提取數據:
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
7、適應iloc按位置單獨提起數據
df_inner.iloc[[0,2,5],[4,5]]
-------------------執行以上程序,返回的結果為-------------------
category_x price
0 100-A 1000.0
2 102-C 3000.0
5 105-F 6000.0
8、使用ix按索引標簽和位置混合提取數據
df_inner.ix[:'2018-01-03',:4]
-------------------執行以上程序,返回的結果為-------------------
id date city age
0 1001 2018-01-01 beijing 22
1 1001 2018-01-02 shagnhai 45
2 1003 2018-01-03 guangzhou 56
9、判斷city列的值是否為北京
df_inner['city'].isin(['beijing'])
-------------------執行以上程序,返回的結果為-------------------
判斷city列的值是否為北京:
0 True
1 False
2 False
3 False
4 False
5 False
10、判斷city列里是否包含beijing和shanghai,然后將符合條件的數據提取出來
df_inner.loc[df_inner['city'].isin(['beijing','shanghai'])]
-------------------執行以上程序,返回的結果為-------------------
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
11、提取前三個字符,並生成數據表
pd.DataFrame(category.str[:3])
六、數據篩選
使用與、或、非三個條件配合大於、小於、等於對數據進行篩選,並進行計數和求和。
1、使用“與”進行篩選
df_inner.loc[(df_inner['age']>20)&(df_inner['city']=='beijing'),['id','gender']]
-------------------執行以上程序,返回的結果為-------------------
使用與進行篩選:
id gender
0 1001 male
2、使用“或”進行篩選
df_inner.loc[(df_inner['age']>20)|(df_inner['city']=='beijing'),['id','gender']])
-------------------執行以上程序,返回的結果為-------------------
使用或進行篩選:
id gender age
0 1001 male 22
1 1001 male 45
2 1003 male 56
3 1004 female 33
4 1005 male 24
5 1006 female 43
3、使用“非”條件進行篩選
df_inner.loc[(df_inner['city']!='beijing'),['id','gender','city']]
-------------------執行以上程序,返回的結果為-------------------
使用非條件進行篩選:
id gender city
1 1001 male shagnhai
2 1003 male guangzhou
3 1004 female chengdu
4 1005 male wuhan
5 1006 female QINGDAO
4、對篩選后的數據按city列進行計數
df_inner.loc[(df_inner['city']=='beijing')].city.count()
-------------------執行以上程序,返回的結果為-------------------
對篩選后的數據進行計數:
1
5、使用query函數進行篩選
df_inner.query('city==["beijing","shanghai"]')
-------------------執行以上程序,返回的結果為-------------------
使用query函數進行篩選:
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
6、對篩選后的結果按price進行求和
df_inner[(df_inner['city']=='beijing')].price.sum()
-------------------執行以上程序,返回的結果為-------------------
對篩選后的結果按price進行求和:
1000.0
七、數據匯總
主要函數是groupby和pivote_table
1、對所有的列進行計數匯總
df_inner.groupby('city').count()
-------------------執行以上程序,返回的結果為-------------------
對所有的列進行計數匯總:
id date age category_x price gender pay m-point group sign category_y size
city
QINGDAO 1 1 1 1 1 1 1 1 1 0 1 1
beijing 1 1 1 1 1 1 1 1 1 1 1 1
chengdu 1 1 1 1 1 1 1 1 1 0 1 1
guangzhou 1 1 1 1 1 1 1 1 1 0 1 1
shagnhai 1 1 1 1 0 1 1 1 1 0 1 1
wuhan 1 1 1 1 0 1 1 1 1 0 1 1
2、按城市對id字段進行計數
df_inner.groupby('city')['id'].count()
-------------------執行以上程序,返回的結果為-------------------
按城市對id進行計數:
city
QINGDAO 1
beijing 1
chengdu 1
guangzhou 1
shagnhai 1
wuhan 1
3、對兩個字段進行匯總計數
df_inner.groupby(['city','size'])['id'].count()
-------------------執行以上程序,返回的結果為-------------------
對兩個字段進行計數匯總:
city size
QINGDAO F 1
beijing A 1
chengdu D 1
guangzhou C 1
shagnhai B 1
wuhan E 1
Name: id, dtype: int64
4、對city字段進行匯總,並分別計算price的合計和均值
df_inner.groupby('city')['price'].agg([len,np.sum,np.mean])
-------------------執行以上程序,返回的結果為-------------------
對city字段進行匯總,並分別計算price的合計和均值:
len sum mean
city
QINGDAO 1.0 6000.0 6000.0
beijing 1.0 1000.0 1000.0
chengdu 1.0 4000.0 4000.0
guangzhou 1.0 3000.0 3000.0
shagnhai 1.0 0.0 NaN
wuhan 1.0 0.0 NaN
八、數據統計
數據采樣,計算標准差,協方差和相關系數
1、簡單的數據采樣
df_inner.sample(n=3)
-------------------執行以上程序,返回的結果為-------------------
簡單的數據采樣:
id date city age category_x price gender pay m-point group sign category_y size
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
2、手動設置采樣權重
weights=[0,0,0,0,0.5,0.5]
df_inner.sample(n=2,weights=weights)
-------------------執行以上程序,返回的結果為-------------------
手動設置采樣權限:
id date city age category_x price gender pay m-point group sign category_y size
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
3、采樣后不放回
df_inner.sample(n=6,replace=False)
-------------------執行以上程序,返回的結果為-------------------
采樣后不放回:
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
4、采樣后放回
df_inner.sample(n=6,replace=True)
-------------------執行以上程序,返回的結果為-------------------
采樣后放回:
id date city age category_x price gender pay m-point group sign category_y size
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
5、 數據表描述性統計
df_inner.descbibe().round(2).T
-------------------執行以上程序,返回的結果為-------------------
據表描述性統計:
count mean std min 25% 50% 75% max
id 6.0 1003.33 2.07 1001.0 1001.50 1003.5 1004.75 1006.0
age 6.0 37.17 13.20 22.0 26.25 38.0 44.50 56.0
price 4.0 3500.00 2081.67 1000.0 2500.00 3500.0 4500.00 6000.0
m-point 6.0 26.67 15.06 10.0 12.50 30.0 40.00 40.0
sign 1.0 1.00 NaN 1.0 1.00 1.0 1.00 1.0
6、計算列的標准差
df_inner['price'].std()
-------------------執行以上程序,返回的結果為-------------------
計算列的標准差:
2081.6659994661327
7、計算兩個字段間的協方差
df_inner['price'].cov(df_inner['m-point'])
-------------------執行以上程序,返回的結果為-------------------
計算兩個字段間的協方差:
28333.333333333332
8、數據表中所有字段間的協方差
df_inner.cov()
-------------------執行以上程序,返回的結果為-------------------
數據表中所有字段間的協方差:
id age price m-point sign
id 4.266667 0.333333 4.333333e+03 29.333333 NaN
age 0.333333 174.166667 1.366667e+04 -31.333333 NaN
price 4333.333333 13666.666667 4.333333e+06 28333.333333 NaN
m-point 29.333333 -31.333333 2.833333e+04 226.666667 NaN
sign NaN NaN NaN NaN NaN
9、兩個字段的相關性分析
df_inner['price'].corr(df_inner['m-point'])
-------------------執行以上程序,返回的結果為-------------------
兩個字段的相關性分析:
0.9073928715621604
10、數據表的相關性分析
df_inner.corr()
-------------------執行以上程序,返回的結果為-------------------
數據表中所有字段的相關性分析:
id age price m-point sign
id 1.000000 0.012228 1.000000 0.943242 NaN
age 0.012228 1.000000 0.453406 -0.157699 NaN
price 1.000000 0.453406 1.000000 0.907393 NaN
m-point 0.943242 -0.157699 0.907393 1.000000 NaN
sign NaN NaN NaN NaN NaN
九、數據輸出
分析后的數據可以輸出為xlsx格式和csv格式
1、寫入Excel
df_inner.to_excel(
'excel_to_python.xlsx'
, sheet_name
=
'bluewhale_cc'
)
2、寫入到CSV
df_inner.to_csv(
'excel_to_python.csv'
)