Python學習筆記：Pandas Apply函數加速技巧

本文轉載自查看原文 2021-08-31 15:26 269 Python

一、前沿技術

Dask包

數據量大、內存不足、復雜並行處理

計算圖、並行、擴展分布式節點、利用GPU計算

類似 TensorFlow 對神經網絡模型的處理

CUDF包

CUDF在GPU加速Pandas

缺點：GPU貴！

二、原始Apply

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,11,size=(1000000,5)), columns=('a','b','c','d','e'))

def func(a,b,c,d,e):
    if e == 10:
        return c*d
    elif (e < 10) and (e >= 5):
        return c+d
    elif e < 5:
        return a+b

%%time
df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
# 按行計算 跨列
# Wall time: 25.4 s

三、Swift並行加速

安全Swifit包，並執行。

pip install swifter

%%time
import swifter
df['new'] = df.swifter.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
# Dask Apply: 100%
# 16/16 [00:09<00:00, 1.47it/s]
# Wall time: 12.4 s

三、向量化

使用 Pandas 和 Numpy 最快方法是將函數向量化。

避免：for循環、列表處理、apply等處理

%%time
df['new'] = df['c'] * df['d']
mask = df['e'] < 10
df.loc[mask, 'new'] = df['c'] + df['d']
mask = df['e'] < 5
df.loc[mask, 'new'] = df['a'] + df['b']
# Wall time: 159 ms

四、類別轉化 + 向量化

df.dtypes
'''
a      int32
b      int32
c      int32
d      int32
e      int32
new    int32
dtype: object
'''

將列類別轉化為 int16，再進行相應的向量化操作。

for col in ('a','b','c','d','e'):
    df[col] = df[col].astype(np.int16)

%%time
df['new'] = df['c'] * df['d']
mask = df['e'] < 10
df.loc[mask, 'new'] = df['c'] + df['d']
mask = df['e'] < 5
df.loc[mask, 'new'] = df['a'] + df['b']
# Wall time: 133 ms

五、轉化為values處理

轉化為 .values 等價於轉化為 numpy，向量化操作會更加快捷。

%%time
df['new'] = df['c'].values * df['d'].values
mask = df['e'].values < 10
df.loc[mask, 'new'] = df['c'] + df['d']
mask = df['e'].values < 5
df.loc[mask, 'new'] = df['a'] + df['b']
# Wall time: 101 ms

六、其他學習

1.查看維度

df.shape # (1000000, 6)

2.基本信息

維度、列名稱、數據格式（是否空值）、所占空間等

df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype
---  ------  --------------    -----
 0   a       1000000 non-null  int16
 1   b       1000000 non-null  int16
 2   c       1000000 non-null  int16
 3   d       1000000 non-null  int16
 4   e       1000000 non-null  int16
 5   new     1000000 non-null  int16
dtypes: int16(6)
memory usage: 11.4 MB
'''

3.每列數據格式

df.dtypes
'''
a      int16
b      int16
c      int16
d      int16
e      int16
new    int16
dtype: object
'''

4.某列數據格式

float64、int64、object等格式都是 Pandas 專用的數據格式。

df['new'].dtype
# dtype('int16')

參考鏈接：Pandas中Apply函數加速百倍的技巧。

參考鏈接：python查看各列數據類型_pandas中查看數據類型的幾種方式

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python機器學習（九十六）Pandas apply函數 python及pandas,numpy等知識點技巧點學習筆記 pandas DataFrame apply()函數(1) pandas DataFrame apply()函數(2) python的map,reduce函數與pandas的apply,filter函數 pandas練習（四）--- 應用Apply函數 pandas的map函數與apply函數的區別 pandas中 transform 函數和 apply 函數的區別 pandas中agg()函數和apply()函數的區別【轉】Pandas的Apply函數——Pandas中最好用的函數