齊夫定律(英語:Zipf's law,IPA英語發音:/ˈzɪf/)是由哈佛大學的語言學家喬治·金斯利·齊夫(George Kingsley Zipf)於1949年發表的實驗定律。
它可以表述為:
在自然語言的語料庫里,一個單詞出現的頻率與它在頻率表里的排名成反比。
所以,頻率最高的單詞出現的頻率大約是出現頻率第二位的單詞的2倍,
而出現頻率第二位的單詞則是出現頻率第四位的單詞的2倍。
目錄
例子
最簡單的齊夫定律的例子是“1/f function”。給出一組齊夫分布的頻率,按照從最常見到非常見排列,第二常見的頻率是最常見頻率的出現次數的½,第三常見的頻率是最常見的頻率的1/3,第n常見的頻率是最常見頻率出現次數的1/n。然而,這並不精確,因為所有的項必須出現一個整數次數,一個單詞不可能出現2.5次。
在Brown語料庫中,“the”、“of”、“and”是出現頻率最前的三個單詞,其出現的頻數分別為69971次、36411次、28852次,大約占整個語料庫100萬個單詞中的7%、3.6%、2.9%,其比例約為6:3:2。大約占整個語料庫的7%(100萬單詞中出現69971次)。滿足齊夫定律中的描述。僅僅前135個字匯就占了Brown語料庫的一半。
齊夫定律是一個實驗定律,而非理論定律,可以在很多非語言學排名中被觀察到,例如不同國家中城市的數量、公司的規模、收入排名等。但它的起因是一個爭論的焦點。齊夫定律很容易用點陣圖觀察,坐標分別為排名和頻率的自然對數(log)。比如,“the”用上述表述可以描述為x = log(1), y = log(69971)的點。如果所有的點接近一條直線,那么它就遵循齊夫定律。
遵循該定律的現象
- 單詞的出現頻率:不僅適用於語料全體,也適用於單獨的一篇文章
- 網頁訪問頻率
- 城市人口
- 收入前3%的人的收入
- 地震震級
- 固體破碎時的碎片大小
參見
====================================
Zipf Distribution
The Zipf distribution, sometimes referred to as the zeta distribution, is a discrete distribution commonly used in linguistics, insurance, and the modelling of rare events. It has probability density function
where
is a positive parameter and
is the Riemann zeta function, and distribution function
where
is a generalized harmonic number.
The Zipf distribution is implemented in the Wolfram Language as ZipfDistribution[rho].
The
th raw moment is
giving the mean and variance as
|
|
|||
|
|
The distribution has mean deviation
|
|
where
is a Hurwitz zeta function and
is the mean as given above in equation (4).
SEE ALSO: Zipf's Law
CITE THIS AS: Weisstein, Eric W. "Zipf Distribution." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/ZipfDistribution.html
Zipf's Law
In the English language, the probability of encountering the
th most common word is given roughly by
for
up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. Pierce's (1980, p. 87) statement that
for
is incorrect. Goetz states the law as follows: The frequency of a word is inversely proportional to its statistical rank
such that
where
is the number of different words.
Theoretical review
Zipf's law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency). For example, the word "the" (as described above) would appear at x = log(1), y = log(69971). It is also possible to plot reciprocal rank against frequency or reciprocal frequency or interword interval against rank.[1] The data conform to Zipf's law to the extent that the plot is linear.
Formally, let:
- N be the number of elements;
- k be their rank;
- s be the value of the exponent characterizing the distribution.
Zipf's law then predicts that out of a population of N elements, the frequency of elements of rank k, f(k;s,N), is:

