一、前沿技術
Dask包
數據量大、內存不足、復雜並行處理
計算圖、並行、擴展分布式節點、利用GPU計算
類似 TensorFlow 對神經網絡模型的處理
CUDF包
CUDF在GPU加速Pandas
- 缺點:GPU貴!
二、原始Apply
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,11,size=(1000000,5)), columns=('a','b','c','d','e'))
def func(a,b,c,d,e):
if e == 10:
return c*d
elif (e < 10) and (e >= 5):
return c+d
elif e < 5:
return a+b
%%time
df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
# 按行計算 跨列
# Wall time: 25.4 s
三、Swift並行加速
安全Swifit包,並執行。
pip install swifter
%%time
import swifter
df['new'] = df.swifter.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
# Dask Apply: 100%
# 16/16 [00:09<00:00, 1.47it/s]
# Wall time: 12.4 s
三、向量化
使用 Pandas 和 Numpy 最快方法是將函數向量化。
避免:for循環、列表處理、apply等處理
%%time
df['new'] = df['c'] * df['d']
mask = df['e'] < 10
df.loc[mask, 'new'] = df['c'] + df['d']
mask = df['e'] < 5
df.loc[mask, 'new'] = df['a'] + df['b']
# Wall time: 159 ms
四、類別轉化 + 向量化
df.dtypes
'''
a int32
b int32
c int32
d int32
e int32
new int32
dtype: object
'''
將列類別轉化為 int16,再進行相應的向量化操作。
for col in ('a','b','c','d','e'):
df[col] = df[col].astype(np.int16)
%%time
df['new'] = df['c'] * df['d']
mask = df['e'] < 10
df.loc[mask, 'new'] = df['c'] + df['d']
mask = df['e'] < 5
df.loc[mask, 'new'] = df['a'] + df['b']
# Wall time: 133 ms
五、轉化為values處理
轉化為 .values 等價於轉化為 numpy,向量化操作會更加快捷。
%%time
df['new'] = df['c'].values * df['d'].values
mask = df['e'].values < 10
df.loc[mask, 'new'] = df['c'] + df['d']
mask = df['e'].values < 5
df.loc[mask, 'new'] = df['a'] + df['b']
# Wall time: 101 ms
六、其他學習
1.查看維度
df.shape # (1000000, 6)
2.基本信息
維度、列名稱、數據格式(是否空值)、所占空間等
df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 1000000 non-null int16
1 b 1000000 non-null int16
2 c 1000000 non-null int16
3 d 1000000 non-null int16
4 e 1000000 non-null int16
5 new 1000000 non-null int16
dtypes: int16(6)
memory usage: 11.4 MB
'''
3.每列數據格式
df.dtypes
'''
a int16
b int16
c int16
d int16
e int16
new int16
dtype: object
'''
4.某列數據格式
float64、int64、object等格式都是 Pandas 專用的數據格式。
df['new'].dtype
# dtype('int16')
