Pandas 中的遍歷與並行處理

本文轉載自查看原文 2020-09-21 17:43 2027 數據分析

使用 pandas 處理數據時，遍歷和並行處理是比較常見的操作了本文總結了幾種不同樣式的操作和並行處理方法。

1. 准備示例數據

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(40, 100, (5, 10)), columns=[f's{i}' for i in range(10)], index=['john', 'bob', 'mike', 'bill', 'lisa'])
df['is_passed'] = df.s9.map(lambda x: True if x > 60 else False)

df 輸出：

      s0  s1  s2  s3  s4  s5  s6  s7  s8  s9  is_passed
john  56  70  85  91  92  80  63  81  45  57      False
bob   99  93  80  42  91  81  53  75  61  78       True
mike  76  92  76  80  57  98  94  79  87  94       True
bill  81  83  92  91  51  55  40  77  96  90       True
lisa  85  82  56  57  54  56  49  43  99  51      False

2. 遍歷

在 pandas 中，共有三種遍歷數據的方法，分別是：

2.1. iterrows

按行遍歷，將 DataFrame 的每一行迭代為 (index, Series) 對，可以通過 row[name] 或 row.name 對元素進行訪問。

>>> for index, row in df.iterrows():
...     print(row['s0'])  # 也可使用 row.s0
    
56
99
76
81
85

2.2. itertuples

按行遍歷，將 DataFrame 的每一行迭代為命名元祖，可以通過 row.name 對元素進行訪問，比 iterrows 效率高。

>>> for row in df.itertuples():
...     print(row.s0)
    
56
99
76
81
85

2.3. iteritems

按列遍歷，將 DataFrame 的每一列迭代為 (列名, Series) 對，可以通過 row[index] 對元素進行訪問。

>>> for index, row in df.iteritems():
...     print(row[0])
    
56
70
85
91
92
80
63
81
45
57
False

3. 並行處理

3.1. map 方法

類似 Python 內建的 map() 方法，pandas 中的 map() 方法將函數、字典索引或是一些需要接受單個輸入值的特別的對象與對應的單個列的每一個元素建立聯系並串行得到結果。map() 還有一個參數 na_action，類似 R 中的 na.action，取值為 None(默認) 或 ingore，用於控制遇到缺失值的處理方式，設置為 ingore 時串行運算過程中將忽略 Nan 值原樣返回。

比如這里將 is_passed 列中的 True 換為 1，False 換位 0，可以有下面幾種實現方式：

3.1.1. 字典映射

>>> # 定義映射字典
... score_map = {True: 1, False: 0}

>>> # 利用 map() 方法得到對應 mike 列的映射列
... df.is_passed.map(score_map)

john    0
bob     1
mike    1
bill    1
lisa    0
Name: is_passed, dtype: int64

3.1.2. `lambda` 函數

>>> # 如同創建該列時的那樣
... df.is_passed.map(lambda x: 1 if x else 0)

john    0
bob     1
mike    1
bill    1
lisa    0
Name: is_passed, dtype: int64

3.1.3. 常規函數

>>> def bool_to_num(x):
...     return 1 if x else 0

>>> df.is_passed.map(bool_to_num)

3.1.4. 特殊對象

一些接收單個輸入值且有輸出的對象也可以用map()方法來處理：

>>> df.is_passed.map('is passed: {}'.format)

john    is passed: False
bob      is passed: True
mike     is passed: True
bill     is passed: True
lisa    is passed: False
Name: is_passed, dtype: object

3.2. apply 方法

apply() 使用方式跟 map() 很像，主要傳入的主要參數都是接受輸入返回輸出，但相較於 map() 針對單列 Series 進行處理，一條 apply() 語句可以對單列或多列進行運算，覆蓋非常多的使用場景，下面分別介紹：

3.2.1. 單列數據

傳入 lambda 函數：

df.is_passed.apply(lambda x: 1 if x else 0)

3.2.2. 輸入多列數據

>>> def gen_describe(s9, is_passed):
...     return f"s9's score is {s9}, so {'passed' if is_passed else 'failed'}"

>>> df.apply(lambda r: gen_describe(r['s9'], r['is_passed']), axis=1)

john    s9's score is 57, so failed
bob     s9's score is 78, so passed
mike    s9's score is 94, so passed
bill    s9's score is 90, so passed
lisa    s9's score is 51, so failed
dtype: object

3.2.3. 輸出多列數據

>>> df.apply(lambda row: (row['s9'], row['s8']), axis=1)

john    (57, 45)
bob     (78, 61)
mike    (94, 87)
bill    (90, 96)
lisa    (51, 99)
dtype: object

3.3. applymap 方法

applymap 是與 map 方法相對應的專屬於 DataFrame 對象的方法，類似 map 方法傳入函數、字典等，傳入對應的輸出結果，
不同的是 applymap 將傳入的函數等作用於整個數據框中每一個位置的元素，比如將 df 中的所有小於 50 的全部改為 50：

>>> def at_least_get_50(x):
...     if isinstance(x, int) and x < 50:
...         return 50
...     return x

>>> df.applymap(at_least_get_50)

      s0  s1  s2  s3  s4  s5  s6  s7  s8  s9  is_passed
john  56  70  85  91  92  80  63  81  50  57      False
bob   99  93  80  50  91  81  53  75  61  78       True
mike  76  92  76  80  57  98  94  79  87  94       True
bill  81  83  92  91  51  55  50  77  96  90       True
lisa  85  82  56  57  54  56  50  50  99  51      False

附：結合 tqdm 給 apply 過程添加進度條

在 jupyter 中並行處理較大數據量的時候，往往執行后就只能干等着報錯或者執行完了，使用 tqdm 可以查看數據實時處理進度，使用前需使用 pip install tqdm 安裝該包。使用示例如下：

from tqdm import tqdm

def gen_describe(s9, is_passed):
    return f"s9's score is {s9}, so {'passed' if is_passed else 'failed'}"

#啟動對緊跟着的 apply 過程的監視
tqdm.pandas(desc='apply')
df.progress_apply(lambda r: gen_describe(r['s9'], r['is_passed']), axis=1)

參考

（數據科學學習手札69）詳解pandas中的map、apply、applymap、groupby、agg

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用 joblib 對 Pandas 數據進行並行處理 C#中多線程的並行處理 .NET中的並行處理，並發和異步編程。 Pytorch多GPU並行處理 python 並行處理數據 Shell腳本的並行處理 MPP(大規模並行處理) R語言中for循環的並行處理 GPU體系架構(一)：數據的並行處理 CNC系統的多任務並行處理