在數據分析中,根據需求,有時候需要將一些數據進行轉換,而在Pandas中,實現數據轉換的常用方法有:
- 利用函數或是映射
- 可以將自己定義的或者是其他包提供的函數用在Pandas對象上實現批量修改。
- applymap和map實例方法
在本節中,使用調查的某公司的員工信息為例:
numeber_project:員工所在項目個數
left:該員工是否離職
salary:工資級別
>>> import pandas as pd >>> import numpy as np >>> data = pd.read_csv('./input/HR.csv',encoding = 'gbk') >>> data = data[['number_project','left','salary']] >>> data.head() number_project left salary 0 2 1 low 1 5 1 medium 2 7 1 medium 3 5 1 low 4 2 1 low
一、map()、replace()
(1)使用函數。例:將salary列的數據轉換成每個單詞的字母大寫:
>>> data['salary'].map(str.title)[:5] 0 Low 1 Medium 2 Medium 3 Low 4 Low Name: salary, dtype: object
(2)使用映射關系的字典。例:對於left,生成一個指標標量indicator。若為‘YES’,表示left=1,若為‘NO’,表示left=0(一般在數據處理時是將字符處理成0,1...n,在此時為了便於理解,故如此舉例)。
>>> mapper = {0:'NO',1:'YES'} >>> data['left'] = data['left'].map(mapper) >>> data.head() number_project left salary 0 2 YES Low 1 5 YES Medium 2 7 YES Medium 3 5 YES Low 4 2 YES Low
注意:使用映射關系的字典map()必須考慮到所有的值,若沒有,那么沒有映射關系的值將會為NaN,如下例子:
>>> s = pd.Series(['A','B','C']) >>> s 0 A 1 B 2 C dtype: object >>> s.map({'A':10,'B':100}) 0 10.0 1 100.0 2 NaN dtype: float64
(3)重命名索引---->通過map方法可以對行索引或是列名的Index對象進行修改(行索引和列明都是Index對象)
>>> data.columns Index(['number_project', 'left', 'salary'], dtype='object') >>> data.columns.map(str.upper) Index(['NUMBER_PROJECT', 'LEFT', 'SALARY'], dtype='object')
(4)使用映射,若需要將數據按照一定的映射關系進行替換,使用replace()。多個值的替換可以用列表,少數的值可以用包含映射關系的字典字典。
例:將number_project的值2、3、4設置為less,5、6、7設置為More。
>>> data['number_project'] = data['number_project'].replace([2,3,4,5,6,7],['Less','Less','Less','More','More','More']) >>> data.head() number_project left salary 0 Less YES Low 1 More YES Medium 2 More YES Medium 3 More YES Low 4 Less YES Low
現有一份數據test_loan,如下:
user | term | int_rate | grade | loan_status | |
---|---|---|---|---|---|
389 | 8 | 36 months | 13.66% | C | Fully Paid |
417 | 9 | 36 months | 11.99% | B | Charged Off |
705 | 6 | 60 months | 15.59% | D | Fully Paid |
921 | 7 | 60 months | 11.44% | B | Fully Paid |
1138 | 4 | 36 months | 13.66% | C | Fully Paid |
1251 | 5 | 36 months | 13.66% | C | Charged Off |
1)loan_status狀態為"Charged Off"的貸款有違約風險,視為不良貸款,將其值標記為1,其他貸款標記為0。我們使用replace()
進行值替換
test_loan['loan_status']=test_loan['loan_status'].replace(["Charged Off","Fully Paid"],[1,0]) user term int_rate grade loan_status 389 8 36 months 13.66% C 0 417 9 36 months 11.99% B 1 705 6 60 months 15.59% D 0 921 7 60 months 11.44% B 0 1138 4 36 months 13.66% C 0 1251 5 36 months 13.66% C 1
2)replace()也可以同時指定不同變量的不同值替換為相同新值
test_loan.replace(to_replace={'loan_status':0,'grade':'B'},value='Good') user term int_rate grade loan_status 389 8 36 months 13.66% C Good 417 9 36 months 11.99% Good Charged Off 705 6 60 months 15.59% D Good 921 7 60 months 11.44% Good Good 1138 4 36 months 13.66% C Good 1251 5 36 months 13.66% C Charged Off
說明:to_replace
指需要替換的值,value
指要替換成的新值。replace
作為數值替換的方法,適用范圍非常之廣,可以實現多種操作。
3)也可以使用正則進行替換,設置regex=True即可,代表to_replace部分輸入的是正則表達式部分
例:將D開頭的全部內容替換成Bad
test_loan.replace(to_replace='D+.*$',value='Bad',regex=True) user term int_rate grade loan_status 389 8 36 months 13.66% C Fully Paid 417 9 36 months 11.99% B Charged Off 705 6 60 months 15.59% Bad Fully Paid 921 7 60 months 11.44% B Fully Paid 1138 4 36 months 13.66% C Fully Paid 1251 5 36 months 13.66% C Charged Off