要實現一個類似於 SQL
中的 case when
功能,為用戶打上標簽。
例如:
select tj_month,
name,
online_time,
case when online_time < 12 then '(3,12]'
when online_time >= 12 and online_time < 24 then '[12,24)'
when online_time >= 24 and online_time < 36 then '[24,36)'
when online_time >= 36 and online_time < 48 then '[36,48)'
when online_time >= 48 and online_time < 60 then '[48,60)'
else '>60' end as online_time_cut
from table_name
where tj_month = '202106';
一、利用pandas.DataFrame.loc直接篩選
構造測試數據框。
import numpy as np
import pandas as pd
data = np.array([[np.nan, 0], [2, 0], [np.nan, 1]])
df = pd.DataFrame(data=data, columns=['a', 'b'])
'''
a b
0 NaN 0.0
1 2.0 0.0
2 NaN 1.0
'''
直接篩選符合條件數據進行打標。
# 此方法已不推薦 不支持 建議使用loc/iloc定位
df[(df['a'].isnull()) & (df['b'] == 0)]['c'] = 1
# loc定位
df['c'] = 0
df.loc[(df['a'].isnull()) & (df['b'] == 0), 'c'] = 1
'''
a b c
0 NaN 0.0 1.0
1 2.0 0.0 NaN
2 NaN 1.0 NaN
'''
二、利用np.where篩選
# 滿足條件 輸出x 否則輸出y
np.where(condition, x, y)
np.where(df.isnull(), 100, 5)
'''
array([[100, 5],
[ 5, 5],
[100, 5]])
'''
# 打標簽
df['c'] = np.where((df['a'].isnull()) & (df['b'] == 0), 1, 0)
One more嵌套判斷的例子:
df['class'] = np.where(df['score'].between(0, 60, inclusive=False), '不及格',
np.where(df['score'].between(60, 80, inclusive=True), '良好', '優秀'))
三、利用np.select篩選
np.select
函數可以根據某些條件篩選某些元素,使用語法為:
np.select(condition_list, choice_list, default=0)
# 條件列表、執行操作列表、缺失值
# 返回列表
實操:
df['c'] = np.select([(df['a'].isnull()) & (df['b'] == 0),
(df['a'].isnull()) & (df['b'] == 1),
(df['a'] == 2) & (df['b'] == 0)],
['one', 'two', 'three'],
default = 'XXX')
'''
a b c
0 NaN 0.0 one
1 2.0 0.0 three
2 NaN 1.0 two
'''
四、利用apply函數與if語句
apply
應用在 dataframe
上,用於對行或者列進行計算。
- axis=1 指定按行計算
- lambda匿名函數判斷滿足條件為1,不滿足為0
df['c'] = df.apply(lambda x: 1 if np.isnan(x[0]) and x[1] == 0 else 0, axis=1)
df
'''
a b c
0 NaN 0.0 1
1 2.0 0.0 0
2 NaN 1.0 0
'''
另外一個簡單的例子:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,11,size=(1000000,5)), columns=('a','b','c','d','e'))
def func(a,b,c,d,e):
if e == 10:
return c*d
elif (e < 10) and (e >= 5):
return c+d
elif e < 5:
return a+b
df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
df
'''
a b c d e new
0 2 0 5 7 5 12
1 9 3 3 0 2 12
2 2 0 9 10 3 2
3 5 8 3 8 9 11
4 1 10 0 2 0 11
'''
# 例子
def function(x):
if x['數學'] != 0:
s = x['語文']/x['數學']
else:
s = 0
return s
data['result'] = data.apply(lambda x: function(x), axis=1)
data
參考鏈接:Pandas等價於創建新變量的SQL case when語句