pandas cookbook

本文轉載自查看原文 2020-03-12 10:57 735 Pandas學習筆記

pandas cookbook

pandas cookbook

rename()

給df的列重命名，字典的鍵值對的形式

df.rename(columns={
"原名":"新名",
"原名":"新名",
"原名":"新名"
......
}
inplace=true
）

.sort_values(ascending)

ascend=True,默認，ascend=False
和r里面的arrange一樣的功能
排序函數，可以安照升序和降序進行排序，還可以同時指定兩個

homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending=[True, False])

日期處理

to_datetime[format=]

subset

提取子集

提取一個子集的時候使用一個["列名"]
提取多個子集的時候使用兩個括號[["列名1","列2..,"列名n"]]
原理也很簡單就是外部先生成一個list，完事提取子list
可以只用邏輯運算篩選特定的子集，此時返回true or false，那么就可以進一步提取
例如df["height"]>168,那么提取大於168的就可以使用df[df["height"]>168]
常見的有>,<,==,and,&...

.isin()

類似於r里面的%in% ，判斷是否存在，可以進行過濾

運算符的使用

可以直接進行兩列+-*/。。
df.[""]+df.[""]

agg()

可以直接基於列計算統計量

除了一般的列之外，時間序列也是同樣適用的

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super efficient.可以使用agg自定義函數

agg里面可以同時計算幾個統計量，不是限定一個

Cumulative statistics累計統計
.cumsum()
.cummax()

counting

drop_duplicates()

去除重復的行
參數

subset：列名，可選，默認為None,subset=["",""...""]
keep： {‘first’, ‘last’, False}, 默認值 ‘first’
- first：保留第一次出現的重復行，刪除后面的重復行。
- last：刪除重復項，除了最后一次出現。
- False：刪除所有重復項。
  -inplace：布爾值，默認為False，是否直接在原數據上刪除重復項或刪除重復項后返回副本。（inplace=True表示直接在原來的DataFrame上刪除重復項，而默認值False表示生成一個副本。

.value_counts().

查看有多少的不同值，並計算每個值中多多有重復值的數量

group

Grouped summary statistics

sales_by_type = sales.groupby("type")["weekly_sales"].sum()

分組之后可以選取特定特征進行統計運算
可以使用多個變量進行分組，可以統計指定特征的統計值

sales_by_type_is_holiday = sales.groupby(["type", "is_holiday"])["weekly_sales"].sum()

額原來的匯總的是這么來的

unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])

<script.py> output:

                           unemployment                      fuel_price_usd_per_l                     
                 amin   amax   mean median                 amin   amax   mean median
    type                                                                            
    A           3.879  8.992  7.973  8.067                0.664  1.107  0.745  0.735
    B           7.170  9.765  9.279  9.199                0.760  1.108  0.806  0.803

pivot_table()

數據透視表

print(sales.pivot_table(values="weekly_sales", index="department", columns="type", aggfunc=aggfunc=[np.sum,np.mean],fill_value=0, margins=True))

Values可以對需要的計算數據進行篩選
aggfunc參數可以設置我們對數據聚合時進行的函數操作
fill_value填充空值,margins=True進行匯總
columns：有點類似於group的意思
可以生成一個數據透視表

set_index()

函數原型：DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
keys：列標簽或列標簽/數組列表，需要設置為索引的列
drop：默認為True，刪除用作新索引的列
append：默認為False，是否將列附加到現有索引
inplace：默認為False，適當修改DataFrame(不要創建新對象)
verify_integrity：默認為false，檢查新索引的副本。否則，請將檢查推遲到必要時進行。將其設置為false將提高該方法的性能。

設置多個索引

# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country", "city"])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])
<script.py> output:
                                  date  avg_temp_c
    country  city                                 
    Brazil   Rio De Janeiro 2000-01-01      25.974
             Rio De Janeiro 2000-02-01      26.699
             Rio De Janeiro 2000-03-01      26.270
             Rio De Janeiro 2000-04-01      25.750
             Rio De Janeiro 2000-05-01      24.356
    ...                            ...         ...
    Pakistan Lahore         2013-05-01      33.457
             Lahore         2013-06-01      34.456
             Lahore         2013-07-01      33.279
             Lahore         2013-08-01      31.511
             Lahore         2013-09-01         NaN
    
    [330 rows x 2 columns]

sort_index()

默認根據行標簽對所有行排序，或根據列標簽對所有列排序，或根據指定某列或某幾列對行排序。
注意：sort_index()可以完成和df. sort_values()完全相同的功能，但python更推薦用只用df. sort_index()對“根據行標簽”和“根據列標簽”排序，其他排序方式用df.sort_values()

ret_index()

還原索引

loc

#平時可以寫的幾種方法
print(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia", "Moscow")])
print(temperatures_srt.loc[:, "date":"avg_temp_c"])

iloc

這個切片是不包括最后一項的
取前5行的化，前面的0可以省略

print(temperatures.iloc[:5, 2:4])

可視化

# Histogram of conventional avg_price 
avocados[avocados["type"] == "conventional"]["avg_price"].hist()

# Histogram of organic avg_price
avocados[avocados["type"] == "organic"]["avg_price"].hist()

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()

缺失值

isna().any()：判斷哪些列存在缺失值
isna().sum():統計每個樣本的缺失值的個數，完事畫個圖，查看缺失值哪個樣本多
.dropna()移除缺失值
fillna（）填充缺失值

創建一個數據框

[{}]

# Create a list of dictionaries with new data
avocados_list = [
    {"date": "2019-11-03", "small_sold": 10376832, "large_sold": 7835071},
    {"date": "2019-11-10", "small_sold": 10717154, "large_sold": 8561348},
]

# Convert list into DataFrame
avocados_2019 = pd.DataFrame(avocados_list)

# Print the new DataFrame
print(avocados_2019)

或者創建一個字典列表

# Create a dictionary of lists with new data
avocados_dict = {
  "date": ["2019-11-17", "2019-12-01"],
  "small_sold": [10859987, 9291631],
  "large_sold": [7674135, 6238096]
}

# Convert dictionary into DataFrame
avocados_2019 = pd.DataFrame(avocados_dict)

# Print the new DataFrame
print(avocados_2019)

to_csv

pd.to_csv(),寫入csv文件

pd.read_csv(filepath_or_buffer,header,parse_dates,index_col)

路徑
列名
解析索引
設置索引

list comprehension

這個功能好強大啊

[expr for iter in iterable if cond_expr]

[expr]：最后執行的結果
[for iter in iterable]：這個可以是一個多層循環
[if cond_expr]：兩個for間是不能有判斷語句的，判斷語句只能在最后；順序不定，默認是左到右。

Reading multiple data files

同時讀多個文件的時候可以使用循環，但是沒有用上python強大的列表解析

# Import pandas
import pandas as pd

# Create the list of file names: filenames
filenames = ['Gold.csv', 'Silver.csv', 'Bronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
#輸出指定的第幾個數據集
print(dataframes[2].head())

添加列

df.['列名']=新列值

ffill()

用.ffill()方法進行的操作則是先重新索引得到一個新的DataFrame，再前向填充缺失值
與使用method=fill有一定的差別

reindex()

重新指定索引

columns重命名

temps_c.columns = temps_c.columns.str.replace('F', 'C')

這里注意要先str

df.pct_change(）

DataFrame.pct_change(periods=1, fill_method=‘pad’, limit=None, freq=None, **kwargs)

因此第一個值的位置是NAN

表示當前元素與先前元素的相差百分比，當然指定periods=n,表示當前元素與先前n 個元素的相差百分比。

add（）

可以直接加到df的每一列上面

DataFrame.add(other, axis='columns', level=None, fill_value=None)

栗子

concat()

拼接兩個df，可以橫向拼接，可以縱向拼接
pd.concat()

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

axis='columns'
表示按列合並
keys：合並時同時增加區分數組的鍵
這個是縱向拼接
join：{'inner'，'outer'}，默認為“outer”。如何處理其他軸上的索引。outer為並集和inner為交集。

merge

一個類似於關系數據庫的連接(join)操作的方法merage,可以根據一個或多個鍵將不同DataFrame中的行連接起來

DataFrame1.merge(DataFrame2, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(’_x’, ‘_y’))

我覺得這個函數不好用

需要搞清楚join，merge，concat什么時候用，區別與聯系
join方法提供了一個簡便的方法用於將兩個DataFrame中的不同的列索引合並成為一個DataFrame。其中參數的意義與merge方法基本相同,只是join方法默認為左外連接how=left

merge_ordered:

函數允許組合時間序列和其他有序數據。特別是它有一個可選的fill_method關鍵字來填充/插入缺失的數據。

merge_asof

Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.

Analyzing Police Activity with pandas

一個實例

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 bash cookbook [譯]OpenSSL Cookbook [NCTF2019]True XML cookbook Django Admin Cookbook 中文版【Yii2-CookBook】JSON 和 XML 輸出 ---CMake菜譜（CMake Cookbook中文版） UVM_COOKBOOK學習【UVM基礎】 [NCTF2019]Fake XML cookbook [NCTF2019]True XML cookbook UVM_COOKBOOK學習【DUT-Testbench Connections】