威爾遜置信區間

本文轉載自查看原文 2020-07-21 13:58 906 數據挖掘

由於正態區間對於小樣本並不可靠，因而，1927年，美國數學家 Edwin Bidwell Wilson提出了一個修正公式，被稱為“威爾遜區間”，很好地解決了小樣本的准確性問題。

根據離散型隨機變量的均值和方差定義：
μ=E(X)=0*(1-p)+1*p=p
σ=D(X)=(0-E(X))2(1-p)+(1-E(X))2p=p2(1-p)+(1-p)2p=p2-p3+p3-2p2+p=p-p2=p(1-p)
因此上面的威爾遜區間公式可以簡寫成：

代碼：

def wilson_score(pos, total, p_z=2.):
    """
    威爾遜得分計算函數
    參考：https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
    :param pos: 正例數
    :param total: 總數
    :param p_z: 正太分布的分位數
    :return: 威爾遜得分
    """
    pos_rat = pos * 1. / total * 1.  # 正例比率
    score = (pos_rat + (np.square(p_z) / (2. * total))
             - ((p_z / (2. * total)) * np.sqrt(4. * total * (1. - pos_rat) * pos_rat + np.square(p_z)))) / \
            (1. + np.square(p_z) / total)
    return score

　　SQL實現代碼：

#wilson_score
SELECT widget_id, ((positive + 1.9208) / (positive + negative) - 
                   1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) / 
                          (positive + negative)) / (1 + 3.8416 / (positive + negative)) 
       AS ci_lower_bound FROM widgets WHERE positive + negative > 0 
       ORDER BY ci_lower_bound DESC;

#
SELECT widget_id, (positive - negative) 
       AS net_positive_ratings FROM widgets ORDER BY net_positive_ratings DESC;

#
SELECT widget_id, positive / (positive + negative) 
       AS average_rating FROM widgets ORDER BY average_rating DESC;

　　excel實現代碼：

=IFERROR((([@[Up Votes]] + 1.9208) / ([@[Up Votes]] + [@[Down Votes]]) - 1.96 * 
    SQRT(([@[Up Votes]] *  [@[Down Votes]]) / ([@[Up Votes]] +  [@[Down Votes]]) + 0.9604) / 
    ([@[Up Votes]] +  [@[Down Votes]])) / (1 + 3.8416 / ([@[Up Votes]] +  [@[Down Votes]])),0)

星級評價排名

Reddit的話題排序算法叫做（thehot ranking），實現代碼如下：

　　log(10, max{abs(up-down), 1}) + sign(up>down) * seconds / 45000

#Rewritten code from /r2/r2/lib/db/_sorts.pyx

from datetime import datetime, timedelta
from math import log

epoch = datetime(1970, 1, 1)

def epoch_seconds(date):
    """Returns the number of seconds from the epoch to date."""
    td = date - epoch
    return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)

def score(ups, downs):
    return ups - downs

def hot(ups, downs, date):
    """The hot formula. Should match the equivalent function in postgres."""
    s = score(ups, downs)
    order = log(max(abs(s), 1), 10)
    sign = 1 if s > 0 else -1 if s < 0 else 0
    seconds = epoch_seconds(date) - 1134028003
    return round(order + sign * seconds / 45000, 7)

imdb top 250用的是貝葉斯統計的算法得出的加權分(Weighted Rank-WR)，公式如下：

　　weighted rank (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

　　WR=( v / (v+m)) (R-C) +C

　　- WR，加權得分（weighted rating）。
　　- R，該電影的用戶投票的平均得分（Rating）。
　　- v，該電影的投票人數（votes）。
　　- m，排名前250名的電影的最低投票數（現在為3000）。
　　- C，所有電影的平均得分（現在為6.9）。

參考資料：

威爾遜區間(Wilson score interval)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 置信區間公式什么是置信度與置信區間置信度與置信區間數論：威爾遜定理用R語言求置信區間 python計算置信區間正態分布-置信區間計算置信區間（Confidence interval）是啥威爾遜定理及其證明【推薦算法】威爾遜區間排序算法