入門級計算
1、算數平均值
#樣本:
S = [s1, s2, s3, …, sn]
#算術平均值:
m = (s1 + s2 + s3 + … + sn)/n
Numpy中的寫法
m = numpy.mean(樣本數組)
2、加權平均值
#樣本:
S = [s1, s2, s3, …, sn]
#權重:
W = [w1, w2, w3, …, wn]
#加權平均值:
a = (s1w1 + s2w2 + s3w3 + … + snwn)/(w1 + w2 + w3 + … + wn)
3、Numpy中的格式
首先是數據源:需要求加權平均值的數據列表和對應的權值列表
elements = []
weights = []
使用numpy直接求:
import numpy as np
np.average(elements, weights=weights)
附純python寫法:
# 不使用numpy寫法1
round(sum([elements[i]*weights[i] for i in range(n)])/sum(weights), 1)
# 不使用numpy寫法2
round(sum([j[0]*j[1] for j in zip(elements, weights)])/sum(weights), 1)
定義函數計算一個序列的平均值的方法
def average(seq, total=0.0):
num = 0
for item in seq:
total += item
num += 1
return total / num
如果序列是數組或者元祖可以簡單使用下面的代碼
def average(seq):
return float(sum(seq)) / len(seq)
3、最大值與最小值
1、最大值、最小值
max:獲取一個數組中最大元素
min:獲取一個數組中最小元素
2、比較出最值數組
maximum:在兩個數組的對應元素之間構造最大值數組
minimum:在兩個數組的對應元素之間構造最小值數組
例:numpy.maximum(a, b):在a數組與b數組中的各個元素對應比較,每次取出較大的那個數構成一個新數組
3、練習
import numpy as np
# 最大值最小值
a = np.random.randint(10, 100, 9).reshape(3, 3)
print(a)
# print('最大值:', np.max(a), a.max()) # 最大值
# print('最小值:', np.min(a), a.min()) # 最小值
# print('最大值索引:', np.argmax(a), a.argmax()) # 數組扁平為一維后的最大值索引
# maximum最大值,minimum最小值
b = np.random.randint(10, 100, 9).reshape(3, 3)
print(b)
print('構造最大值數組:\n', np.maximum(a, b))
print('構造最小值數組:\n', np.minimum(a, b))
精通級學習
例一
有一個df:
ID wt value
Date
01/01/2012 100 0.50 60
01/01/2012 101 0.75 80
01/01/2012 102 1.00 100
01/02/2012 201 0.50 100
01/02/2012 202 1.00 80
相關代碼如下:
import numpy as np
import pandas as pd
index = pd.Index(['01/01/2012','01/01/2012','01/01/2012','01/02/2012','01/02/2012'], name='Date')
df = pd.DataFrame({'ID':[100,101,102,201,202],'wt':[.5,.75,1,.5,1],'value':[60,80,100,100,80]},index=index)
按“值”加權並按指數分組的“wt”的平均值為:
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
或者,也可以定義函數:
def grouped_weighted_avg(values, weights, by):
return (values * weights).groupby(by).sum() / weights.groupby(by).sum()
grouped_weighted_avg(values=df.wt, weights=df.value, by=df.index)
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
更復雜的:
grouped = df.groupby('Date')
def wavg(group):
d = group['value']
w = group['wt']
return (d * w).sum() / w.sum()
grouped.apply(wavg)
例二
ind dist diff cas
0 la 10.0 0.54 1.0
1 p 5.0 3.20 2.0
2 la 7.0 8.60 3.0
3 la 8.0 7.20 4.0
4 p 7.0 2.10 5.0
5 g 2.0 1.00 6.0
6 g 5.0 3.50 7.0
7 la 3.0 4.50 8.0
df = pd.DataFrame({'ind':['la','p','la','la','p','g','g','la'],
'dist':[10.,5.,7.,8.,7.,2.,5.,3.],
'diff':[0.54,3.2,8.6,7.2,2.1,1.,3.5,4.5],
'cas':[1.,2.,3.,4.,5.,6.,7.,8.]})
生成一列(使用 transform在組內獲得標准化權重)weight
df['weight'] = df['dist'] / df.groupby('ind')['dist'].transform('sum')
df
ind dist diff cas weight
0 la 10.0 0.54 1.0 0.357143
1 p 5.0 3.20 2.0 0.416667
2 la 7.0 8.60 3.0 0.250000
3 la 8.0 7.20 4.0 0.285714
4 p 7.0 2.10 5.0 0.583333
5 g 2.0 1.00 6.0 0.285714
6 g 5.0 3.50 7.0 0.714286
7 la 3.0 4.50 8.0 0.107143
將這些權重乘以這些值,並取總和:
df['wcas'], df['wdiff'] = (df[n] * df['weight'] for n in ('cas', 'diff'))
df.groupby('ind')[['wcas', 'wdiff']].sum()
wcas wdiff
ind
g 6.714286 2.785714
la 3.107143 4.882143
p 3.750000 2.558333
變異的寫法:
backup = df.copy() # make a backup copy to mutate in place
cols = df.columns[:2] # cas, diff
df[cols] = df['weight'].values[:, None] * df[cols]
df.groupby('ind')[cols].sum()
cas diff
ind
g 6.714286 2.785714
la 3.107143 4.882143
p 3.750000 2.558333
例四(比較直觀)
df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
...: ('bird', 'Psittaciformes', 24.0),
...: ('mammal', 'Carnivora', 80.2),
...: ('mammal', 'Primates', np.nan),
...: ('mammal', 'Carnivora', 58)],
...: index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
...: columns=('class', 'order', 'max_speed'))
df:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
grouped = df.groupby('class')
grouped.sum()
Out:
max_speed
class
bird 413.0
mammal 138.2
例五
df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
'size': list('SSMMMLL'),
'weight': [8, 10, 11, 1, 20, 12, 12],
'adult': [False] * 5 + [True] * 2})
df:
animal size weight adult
0 cat S 8 False
1 dog S 10 False
2 cat M 11 False
3 fish M 1 False
4 dog M 20 False
5 cat L 12 True
6 cat L 12 True
List the size of the animals with the highest weight.
df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out:
animal
cat L
dog M
fish M
dtype: object
其它參考文檔:
理解Pandas的Transform
https://www.jianshu.com/p/20f15354aedd
https://www.jianshu.com/p/509d7b97088c
https://zhuanlan.zhihu.com/p/86350553
http://www.zyiz.net/tech/detail-136539.html
pandas:apply和transform方法的性能比較
https://www.cnblogs.com/wkang/p/9794678.html
https://www.jianshu.com/p/20f15354aedd
https://zhuanlan.zhihu.com/p/101284491?utm_source=wechat_session
https://www.cnblogs.com/bjwu/p/8970818.html
https://www.jianshu.com/p/42f1d2909bb6
官網的例子
https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-grouping
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.transform.html
獲得Pandas中幾列的加權平均值和標准差
https://xbuba.com/questions/48307663
Pandas里面的加權平均,我猜你不會用!
https://blog.csdn.net/ddxygq/article/details/101351686