Pandas DataFrame 數據選取和過濾

本文轉載自查看原文 2018-10-31 12:05 18883 python

This would allow chaining operations like:

pd.read_csv('imdb.txt') .sort(columns='year') .filter(lambda x: x['year']>1990) # <---this is missing in Pandas .to_csv('filtered.csv')

For current alternatives see:

http://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining

可以這樣：

df = pd.read_csv('imdb.txt').sort(columns='year')
df[df['year']>1990].to_csv('filtered.csv')

# however, could potentially do something like this:

pd.read_csv('imdb.txt')
  .sort(columns='year')
  .[lambda x: x['year']>1990]
  .to_csv('filtered.csv')
or

pd.read_csv('imdb.txt')
  .sort(columns='year')
  .loc[lambda x: x['year']>1990]
  .to_csv('filtered.csv')

from:https://yangjin795.github.io/pandas_df_selection.html

Pandas 是 Python Data Analysis Library, 是基於 numpy 庫的一個為了數據分析而設計的一個 Python 庫。它提供了很多工具和方法，使得使用 python 操作大量的數據變得高效而方便。

本文專門介紹 Pandas 中對 DataFrame 的一些對數據進行過濾、選取的方法和工具。首先，本文所用的原始數據如下：

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

    Out[9]: A B C D 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-03 0.480507 1.215048 1.313314 -0.072320 2017-04-04 1.700309 0.287588 -0.012103 0.525291 2017-04-05 0.526615 -0.417645 0.405853 -0.835213 2017-04-06 1.143858 -0.326720 1.425379 0.531037

選取

通過 [] 來選取

選取一列或者幾列：

df['A'] Out: 2017-04-01 0.522241 2017-04-02 2.104572 2017-04-03 0.480507 2017-04-04 1.700309 2017-04-05 0.526615 2017-04-06 1.143858

df[['A','B']] Out: A B 2017-04-01 0.522241 0.495106 2017-04-02 2.104572 -0.977768 2017-04-03 0.480507 1.215048 2017-04-04 1.700309 0.287588 2017-04-05 0.526615 -0.417645 2017-04-06 1.143858 -0.326720

選取某一行或者幾行：

df['2017-04-01':'2017-04-01'] Out: A B C D 2017-04-01 0.522241 0.495106 -0.268194 -0.03500

df['2017-04-01':'2017-04-03'] A B C D 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-03 0.480507 1.215048 1.313314 -0.072320

loc, 通過行標簽選取數據

df.loc['2017-04-01','A']

df.loc['2017-04-01'] Out: A 0.522241 B 0.495106 C -0.268194 D -0.035003

df.loc['2017-04-01':'2017-04-03'] Out: A B C D 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-03 0.480507 1.215048 1.313314 -0.072320

df.loc['2017-04-01':'2017-04-04',['A','B']] Out: A B 2017-04-01 0.522241 0.495106 2017-04-02 2.104572 -0.977768 2017-04-03 0.480507 1.215048 2017-04-04 1.700309 0.287588

df.loc[:,['A','B']] Out: A B 2017-04-01 0.522241 0.495106 2017-04-02 2.104572 -0.977768 2017-04-03 0.480507 1.215048 2017-04-04 1.700309 0.287588 2017-04-05 0.526615 -0.417645 2017-04-06 1.143858 -0.326720

iloc, 通過行號獲取數據

df.iloc[2] Out: A 0.480507 B 1.215048 C 1.313314 D -0.072320

df.iloc[1:3] Out: A B C D 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-03 0.480507 1.215048 1.313314 -0.072320

df.iloc[1,1] df.iloc[1:3,1] df.iloc[1:3,1:2] df.iloc[[1,3],[2,3]] Out: C D 2017-04-02 -0.139632 -0.735926 2017-04-04 -0.012103 0.525291 df.iloc[[1,3],:] df.iloc[:,[2,3]]

iat, 獲取某一個 cell 的值

df.iat[1,2] Out: -0.13963224781812655

過濾

使用 [] 過濾

[]中是一個boolean 表達式，凡是計算為 True 的行就會被選取。

df[df.A>1] Out: A B C D 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-04 1.700309 0.287588 -0.012103 0.525291 2017-04-06 1.143858 -0.326720 1.425379 0.531037

df[df>1] Out: A B C D 2017-04-01 NaN NaN NaN NaN 2017-04-02 2.104572 NaN NaN NaN 2017-04-03 NaN 1.215048 1.313314 NaN 2017-04-04 1.700309 NaN NaN NaN 2017-04-05 NaN NaN NaN NaN 2017-04-06 1.143858 NaN 1.425379 NaN df[df.A+df.B>1.5] Out: A B C D 2017-04-03 0.480507 1.215048 1.313314 -0.072320 2017-04-04 1.700309 0.287588 -0.012103 0.525291

下面是一個更加復雜的例子，選取的是 index 在 '2017-04-01'中'2017-04-04'的，一行的數據的和大於1的行：

df.loc['2017-04-01':'2017-04-04',df.sum()>1]

還可以通過和 apply 方法結合，構造更加復雜的過濾，實現將某個返回值為 boolean 的方法作為過濾條件：

df[df.apply(lambda x: x['b'] > x['c'], axis=1)]

使用 isin

df['E']=['one', 'one','two','three','four','three'] A B C D E 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 one 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 one 2017-04-03 0.480507 1.215048 1.313314 -0.072320 two 2017-04-04 1.700309 0.287588 -0.012103 0.525291 three 2017-04-05 0.526615 -0.417645 0.405853 -0.835213 four 2017-04-06 1.143858 -0.326720 1.425379 0.531037 three df[df.E.isin(['one'])] Out: A B C D E 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 one 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 one

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 DataFrame中的數據選取與過濾 pandas 索引、選取和過濾 pandas 數據索引與選取 Pandas-數據選取 pandas中dataframe類型選取前幾行 Pandas 中根據列值，選取DataFrame數據，並獲取行索引號列表 Pandas基本功能之選取索引和過濾 pandas DataFrame 數據篩選 pandas.Dataframe復雜條件過濾 pandas dataframe 過濾——apply最靈活！！！