DataFrame中交叉表使用


1、運用場景:交叉表(cross-tabulation, 簡稱crosstab)是一種常用的分類匯總表格,用於計算分組頻率的特殊透視表,主要價值在於描述了變量間關系的深刻含義。雖然兩個(或以上)變量可以是分類的或數量的,但是以都是分類的情形最為常見。

2、Python中函數說明:

pd.crosstab( index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False, )
Docstring: Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
作用:計算兩個(或多個)變量(因子)的簡單交叉表。默認情況下計算變量(因子)之間的的頻率,
如果傳遞聚合函數有數組或值,將按照設置的內容計算變量之間的關系,具體詳見參數說明。

Parameters
----------
index : array-like, Series, or list of arrays/Series
    Values to group by in the rows.()
columns : array-like, Series, or list of arrays/Series Values to group by in the columns. values : array-like, optional Array of values to aggregate according to the factors. Requires `aggfunc` be specified. rownames : sequence, default None If passed, must match number of row arrays passed. colnames : sequence, default None If passed, must match number of column arrays passed. aggfunc : function, optional If specified, requires `values` be specified as well. margins : bool, default False Add row/column margins (subtotals). margins_name : str, default 'All' Name of the row/column that will contain the totals when margins is True. .. versionadded:: 0.21.0 dropna : bool, default True Do not include columns whose entries are all NaN. normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False Normalize by dividing all values by the sum of values. - If passed 'all' or `True`, will normalize over all values. - If passed 'index' will normalize over each row. - If passed 'columns' will normalize over each column. - If margins is `True`, will also normalize margin values. .. versionadded:: 0.18.1 Returns ------- DataFrame Cross tabulation of the data. See Also -------- DataFrame.pivot : Reshape data based on column values. pivot_table : Create a pivot table as a DataFrame. Notes ----- Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified. Any input passed containing Categorical data will have **all** of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category. In the event that there aren't overlapping indexes an empty DataFrame will be returned.
3、Examples
--------
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[1,8,8,8,1],
'B':[6,6,4,4,4],
'C':[1,1,2,1,1]})
print('df:\n',df)

df:
A B C
0 1 6 1
1 8 6 1
2 8 4 2
3 8 4 1
4 1 4 1

data1=pd.crosstab(df['A'],df['B'])
print('data1:\n',data1)

data1:
B 4 6
A
1 1 1
8 2 1

##normalize=True表示統計交叉表中各項的相對頻率(即所占百分比)
data2=pd.crosstab(df['A'],df['B'],normalize=True)
print("data2:\n",data2)

data2:
B 4 6
A
1 0.20 0.20
8 0.40 0.20

#values:根據因子聚合的值數組
#aggfunc:如果未傳遞values數組,則計算頻率表,如果傳遞數組,則按照指定計算
data3 =pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum)
print('data3:\n',data3)

data3:
B 4 6
A
1 1 1
8 3 1

#margins:布爾值,默認值False,當其為True時,表示:添加行/列邊距(小計),
# 還可以通過margins_name設置總計行(列)的名稱(默認名稱是“All”)。
data4=pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum,margins=True)
print('data4:\n',data4)

data4:
B 4 6 All
A
1 1 1 2
8 3 1 4
All 4 2 6

# 分層交叉:crosstab()的參數index和columns可以接受列表傳入,構建分層交叉表
data5=pd.crosstab([df['A'],df['B']],df['C'])
print('data5:\n',data5)

data5:
C 1 2
A B
1 4 1 0
6 1 0
8 4 1 1
6 1 0


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM