齊夫定律, Zipf's law,Zipfian distribution


齊夫定律英語:Zipf's law,IPA英語發音:/ˈzɪf/)是由哈佛大學語言學家喬治·金斯利·齊夫George Kingsley Zipf)於1949年發表的實驗定律。

它可以表述為:

自然語言語料庫里,一個單詞出現的頻率與它在頻率表里的排名成反比

所以,頻率最高的單詞出現的頻率大約是出現頻率第二位的單詞的2倍,

而出現頻率第二位的單詞則是出現頻率第四位的單詞的2倍。

這個定律被作為任何與冪定律概率分布有關的事物的參考。

目錄

例子

最簡單的齊夫定律的例子是“1/f function”。給出一組齊夫分布的頻率,按照從最常見到非常見排列,第二常見的頻率是最常見頻率的出現次數的½,第三常見的頻率是最常見的頻率的1/3,第n常見的頻率是最常見頻率出現次數的1/n。然而,這並不精確,因為所有的項必須出現一個整數次數,一個單詞不可能出現2.5次。

Brown語料庫中,“the”、“of”、“and”是出現頻率最前的三個單詞,其出現的頻數分別為69971次、36411次、28852次,大約占整個語料庫100萬個單詞中的7%、3.6%、2.9%,其比例約為6:3:2。大約占整個語料庫的7%(100萬單詞中出現69971次)。滿足齊夫定律中的描述。僅僅前135個字匯就占了Brown語料庫的一半。

齊夫定律是一個實驗定律,而非理論定律,可以在很多非語言學排名中被觀察到,例如不同國家中城市的數量、公司的規模、收入排名等。但它的起因是一個爭論的焦點。齊夫定律很容易用點陣圖觀察,坐標分別為排名和頻率的自然對數(log)。比如,“the”用上述表述可以描述為x = log(1), y = log(69971)的點。如果所有的點接近一條直線,那么它就遵循齊夫定律。

遵循該定律的現象

  • 單詞的出現頻率:不僅適用於語料全體,也適用於單獨的一篇文章
  • 網頁訪問頻率
  • 城市人口
  • 收入前3%的人的收入
  • 地震震級
  • 固體破碎時的碎片大小

參見

====================================

Zipf Distribution

DOWNLOAD Mathematica Notebook

The Zipf distribution, sometimes referred to as the zeta distribution, is a discrete distribution commonly used in linguistics, insurance, and the modelling of rare events. It has probability density function

 P(x)=(x^(-(rho+1)))/(zeta(rho+1)),  

where rho is a positive parameter and zeta(z) is the Riemann zeta function, and distribution function

 D(x)=(H_(x,rho+1))/(zeta(rho+1)),  

where H_(n,r) is a generalized harmonic number.

The Zipf distribution is implemented in the Wolfram Language as ZipfDistribution[rho].

The nth raw moment is

 mu_n^'=(zeta(1-nrho))/(zeta(rho+1)),  

giving the mean and variance as

mu = (zeta(rho))/(zeta(rho+1))
 
sigma^2 = (zeta(rho-1))/(zeta(rho+1))-([zeta(rho)]^2)/([zeta(rho+1)]^2).
 

The distribution has mean deviation

 MD=(2[zeta(rho+1)zeta(rho,|_mu_|+1)-zeta(rho)zeta(rho+1,|_mu_|+1)])/(zeta^2(rho+1)),
 

where zeta(z,s) is a Hurwitz zeta function and mu is the mean as given above in equation (4).

SEE ALSO: Zipf's Law

 

CITE THIS AS: Weisstein, Eric W. "Zipf Distribution." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/ZipfDistribution.html

Zipf's Law

In the English language, the probability of encountering the rth most common word is given roughly by P(r)=0.1/r for r up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. Pierce's (1980, p. 87) statement that sumP(r)>1 for r=8727 is incorrect. Goetz states the law as follows: The frequency of a word is inversely proportional to its statistical rank r such that

 P(r) approx 1/(rln(1.78R)),

where R is the number of different words.

Theoretical review

Zipf's law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency). For example, the word "the" (as described above) would appear at x = log(1), y = log(69971). It is also possible to plot reciprocal rank against frequency or reciprocal frequency or interword interval against rank.[1] The data conform to Zipf's law to the extent that the plot is linear.

Formally, let:

  • N be the number of elements;
  • k be their rank;
  • s be the value of the exponent characterizing the distribution.

Zipf's law then predicts that out of a population of N elements, the frequency of elements of rank k, f(k;s,N), is:

    • f(k;s,N)={\frac {1/k^{s}}{\sum _{n=1}^{N}(1/n^{s})}}


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM