You are on page 1of 4

PHYSICAL REVIEW E VOLUME 54, NUMBER 1 JULY 1996

Zipf’s law, the central limit theorem, and the random division of the unit interval
Richard Perline*
Flexible Logic Software, 34-50 80th Street, Suite 22, Queens, New York, 11372
~Received 30 August 1995!
It is shown that a version of Mandelbrot’s monkey-at-the-typewriter model of Zipf’s inverse power law is
directly related to two classical areas in probability theory: the central limit theorem and the ‘‘broken stick’’
problem, i.e., the random division of the unit interval. The connection to the central limit theorem is proved
using a theorem on randomly indexed sums of random variables @A. Gut, Stopped Random Walks: Limit
Theorems and Applications ~Springer, New York, 1987!#. This reveals an underlying log-normal structure of
pseudoword probabilities with an inverse power upper tail that clarifies a point of confusion in Mandelbrot’s
work. An explicit asymptotic formula for the slope of the log-linear rank-size law in the upper tail of this
distribution is also obtained. This formula relates to known asymptotic results concerning the random division
of the unit interval that imply a slope value approaching 21 under quite general conditions. The role of
size-biased sampling in obscuring the bottom part of the distribution is explained and connections to related
work are noted. @S1063-651X~96!01007-0#

PACS number~s!: 05.40.1j, 02.50.2r, 87.10.1e

I. INTRODUCTION sky @7# greatly clarified the situation by discussing an illus-


trative, special case of Mandelbrot’s model. Miller @6#
Mandelbrot’s creative and influential work on Zipf’s in- showed that one can straightforwardly derive a step function
verse power law of word frequency distributions contains approximating an inverse power law with the simplifying
two potential points of confusion. One of these concerns assumptions of equiprobable and independently combined
what Mandelbrot @1# now acknowledges as the ‘‘linguistic letters. He concludes with the clear words—which are im-
shallowness’’ of the law. The possible confusion here stems plied, but not stated, in the Appendix to @2#—that Zipf’s law
from the fact that Mandelbrot @2# has shown that an inverse can be derived ‘‘without appeal to least effort, least cost,
power distribution of pseudoword frequencies can be gener- maximal information, or any branch of the calculus of varia-
ated from random text, although in contrast his earlier work tions.’’
@3,4# aimed at information-theoretic models that might shed Physical scientists interested in Zipf’s law through Man-
light on deeper properties of language or thought. A second delbrot’s work may be unaware of Miller’s clarification,
point of confusion concerns Mandelbrot’s, @5, see pp. 209– probably because it was published in the psychology litera-
211# extended argument opposing the log-normal distribu- ture. For example, Li @9# writes that ‘‘Miller did not give a
tion as a model for word frequency distributions, although a proof of his statement’’—referring to a comment by Miller
version of his random text model has a natural underlying @10# that Zipf’s law could be generated by random text—and
log-normal structure. The first issue, linguistic shallowness, then gives the same proof of the special equiprobable letters
was addressed decades ago by the prominent language re- case given by Miller @6# and Miller and Chomsky @7#. And
searchers Miller and Chomsky @6,7#. It is the purpose of this recently Mantegna et al. @11# have come under criticism ~see
article to address the second issue, log-normal structure, by @12#! for arguing that noncoding DNA sequences may be
proving the previously unrecognized—but direct— transmitting biological information based on an analysis
connection between a version of Mandelbrot’s monkey-at- which they assert shows that noncoding DNA pseudoword
the-typewriter model and a special case of Anscombe’s @8# frequencies conform approximately to a Zipf-like law. They
important generalization of the central limit theorem. We might have been less likely to make such an argument had
also show that the upper tail of this log-normal structure is an they been familiar with Miller’s clarification of Mandelbrot’s
approximate inverse power law and discuss asymptotic re- results.
sults from the classical ‘‘broken stick’’ problem that explain
why a log-log rank-size plot of the upper tail will tend to II. EXHIBITING LOG-NORMAL STRUCTURE
have a slope close to 21 under quite general conditions.
The confusion on both points—linguistic shallowness and Unfortunately, Miller’s simplified model is degenerate in
log-normal structure—is intertwined. Mathematically ori- the sense that it does not reveal the natural log-normal struc-
ented researchers in psychology and linguistics were at first ture of word probabilities that exists for the case of unequal
quite interested in Mandelbrot’s use of information theory to letter probabilities. For this case, Mandelbrot, @5, p. 210# did
derive Zipf’s law, but then also realized the significance of note that the logarithmic probabilities of randomly generated
his derivation of an inverse power law using nothing but a pseudowords of a fixed number n of letters will be approxi-
Markov random text model. Miller @6# and Miller and Chom- mately normal; however, he overlooked a stronger result,
following directly from Anscombe’s theorem on random
sums @8#, which shows that the logarithmic probabilities of
*Electronic address: rkper@acm.org all words of n or fewer letters will be approximately normal.

1063-651X/96/54~1!/220~4!/$10.00 54 220 © 1996 The American Physical Society


54 ZIPF’S LAW, THE CENTRAL LIMIT THEOREM, AND THE . . . 221

Therefore the probabilities themselves will be approximately


log-normal.
Mandelbrot’s derivation of Zipf’s law from his Markov
random text model is based on the ensemble of all the word
probabilities that can be generated by random sequences of
all possible lengths, i.e., an infinite vocabulary. For proving
the log-normal structure, our analysis will be based on a
scheme assuming ~1! unequal and independent letter prob-
abilities ~i.e., we drop the Markov assumption! and ~2! a
maximum word length of n letters, although we study as-
ymptotic behavior as n→`.
Assume an alphabet consisting of K>2 nonspace charac-
ters L 1 ,L 2 ,...,L K and the space character L K11 . Let the let-
ter keys be struck independently with probabilities
a 1 ,a 2 ,...,a K11 , where ( K11 j51 a j 51 and 0,a j ,1 for
1< j<K11. For the purposes of proving the log-normal
structure of the word probabilities, it is necessary to assume
that a i Þa j for at least one case where iÞ j and i, j<K.
Choose an integer n and define a trial as the outcome after
n11 or fewer letters have been selected. The outcome will FIG. 1. Log-normal probability plot of base values for the ran-
result in either ~a! a ‘‘word’’ consisting of n or fewer non- dom text model. The approximate linearity of the curve shows the
space characters plus the ending space character or ~b! a approximate log-normality of the values.
‘‘nonword’’ consisting of n11 consecutive nonspace char-
acters with no ending space character. As many trials are run with finite variance. Anscombe’s theorem @8# assures the as-
as desired, but each is limited to a maximum of n11 char- ymptotic normality of logK Y n if it can be shown that the ratio
acters and ends when either outcome ~a! or ~b! occurs. R n /n converges in probability to a constant c.0 as n→`.
A finite vocabulary model introduces the difficulty of the Since P $ R n 5n2 j % 5K n2 j /N n .(K21)/K j11 , it is not dif-
K n11 nonwords, which have a positive aggregate probability ficult to show that P $ u R n /n21 u < j/n % .121/K j11 . Letting
of occurrence on each trial of ( ( Kj51 a j ) n11 5(12a K11 ) n11 . j5 @ An # , the largest integer in An, and n→` proves that
We exclude the nonwords and analyze the distributional R n /n converges in probability to c51.0. Consequently,
characteristics of the probabilities for the N n 5 ( nj50 K j ‘‘le- logK Y n is asymptotically normal and so Y n will be approxi-
gitimate’’ words. mately log-normal for sufficiently large n.
It simplifies matters to factor out a K11 and refer to the The approximate log-normality of Y n is illustrated in the
resulting values as base values. The largest base value is approximately linear log-normal probability plot of Fig. 1,
always equal to 1, corresponding to the word of 0 nonspace where we take n55, K510, and use nonspace letter prob-
letters. Let B j denote the multiset ~i.e., a set that can contain abilities a 1 through a 10 with values 0.002 04, 0.0305,
repetitions of its elements! of the K j base values for each of 0.0575, 0.06, 0.0668, 0.0715, 0.0837, 0.12, 0.144, and 0.148.
the words of exactly j letters. Let U n 5B 0 ø•••øB n be the All the values in U 5 have been generated and plotted. The
multiset of all base values for words from 0 to n letters. A points (x,y) plotted in the graph are x5log10b r and
generic element of the multisets B j and U n will be written b. y5F 21 [(N n 2r11)/(N n 11)] for 1<r<N n , where F21 is
Write the ranked values of U n as b r , where r represents rank the inverse of the standard normal distribution function.
from the top. Define a probability space on U n by the natural
counting measure, i.e., each element bPU n is assigned an
atom of probability equal to 1/N n . Denote the random vari- III. AN INVERSE POWER UPPER TAIL
able defined in this way as Y n . AND THE ‘‘BROKEN STICK’’ PROBLEM
By this definition, Y n can be represented as the product of
Significantly, the upper tail of the distribution of base
a random number R n of independent and identically distrib-
values in U n conforms approximately to an inverse power
uted random variables: Y n 5X 1 X 2 •••X R n , where 0<R n <n.
law, as can been seen in Fig. 2. This figure shows the same
Each X i is a multinomial random variable taking on the val- base values previously displayed in Fig. 1 now represented
ues a j , 1< j<K, with equal probability. The case R n 50, in log-log coordinates. The vertical axis represents logK b r
when Y n 51, corresponds to the single 0-letter word. Since and the horizontal axis represents the logarithm of the asso-
there are K j words of exactly j letters, the probability that Y n ciated rank, logK r. The evident linearity of the graph over
will have the representation Y n 5X 1 X 2 •••X j involving ex- most of the top range of base values is graphic proof of an
actly j factors is K j /N n . Therefore, letting P $ % be the prob- approximate inverse power law in this range.
ability of the expression inside the braces, the probability Mandelbrot’s @2# combinatorial proof explains this phe-
mass function of R n is P $ R n 5 j % 5K j /N n for j50,1,...,n. nomenon, but his derivation does not lead to an explicit for-
Take logarithms ~base-K logarithms prove convenient; ln mula for the slope of the linear part of this graph. We obtain
will be used for natural logarithms! to obtain logKY n an asymptotic estimate for it here and show that it connects
Rn
5( j51 logK X j . So logK Y n is the sum of a random number of up in a natural way with the ‘‘broken stick’’ problem. Con-
independent and identically distributed random variables sider the least squares regression of logK b r onto logK r con-
222 RICHARD PERLINE 54

Nn Nn Nn
1
Nn (
r51
u logK b r u (
r51
logK r< (
r51
u logK b r u logK r

< S( D
r51
Nn

log2K b r
1/2

3S( D r51
Nn

log2K r
1/2

, ~1!

where the upper bound follows from the Cauchy-Schwartz


inequality and the lower follows from Chebychev’s mono-
tonic inequality @13#. The latter is applicable here because
ulogK b r u and logK r are both monotonically increasing in r.
Nn
After dividing through by ( r51 log2Kr in ~1!, the middle
expression in the chain of inequalities is just ubu. Then, using
the asymptotic relations in ~R3!–~R5!, routine calculations
show that both the upper and lower bounds of ubu are ;um1u.
Consequently, for n→` we must have ubu;um1u, as well. The
inequalities of ~1! are also the crux of similar manipulations
that prove R 2 approaches 1 as a limit as n→`.
Empirical studies of natural language showing plots of
FIG. 2. Log-log plot of base values by rank for the random text word frequencies f r against their rank r in log-log coordi-
model. The linear trend is evident for approximately the top half of nates almost always have slopes very near 21, as a glance
the data. In the random text model, observed word frequencies through Zipf’s work will show @14#. The value 21 has some
would be those of a sample drawn from a multinomial distribution significance for the slope estimate b ;( ( Kj51 logK a j )/K of
with probabilities proportional to these base values. Zipf’s inverse this random text model through its connection to the classical
power law would tend to be observed because the bottom part of the problem in probability theory concerned with the random
distribution ~i.e., words of low probability! would tend not to be division of the unit interval. Consider the letter probabilities
represented in the sample data. a 1 ,a 2 ,...,a K11 as a random division of the interval @0,1#.
We represent this in the standard way. Let X 1 ,X 2 ,...,X K be
strained to go through the origin, ~0,0!5~logK 1,logK b 1 !. This K independently and identically distributed random variables
constrained, least squares regression line can be proven to be defined on @0,1#. Write the corresponding order statistics as
a very good fit in the sense that the R 2 ~coefficient of deter- X (1) <X (2) •••<X (K) . The interval @0,1# is subdivided into
mination! associated with the model approaches 1 as n→`. K11 mutually exclusive and exhaustive segments by the
~The fact that the regression line is not a good fit for the spacings D j defined as D 1 5X (1) ,D 2 5X (2) 2X (1) ,...,D j
bottom of the distribution of base values is addressed below.! 5X ( j) 2X ( j21) ,..., D K11 512X (K) . Think of the letter
The following calculations show that the slope b of this line probabilities a j as realized values of the spacings D j .
is asymptotically equal to ( ( Kj51 logK a j )/K. Darling @15# has shown that for uniformly distributed X j
Let m j be the mean and s 2j the variance of the logarithms and K→`, ( K11 j51 lnD j has the asymptotic mean 2K ln K.
of the K j base values in B j . For example, m05s 2050, Blumenthal @16# has extended this result to a broader family
m 1 5( ( Kj51 logK a j )/K, and s 21 5[ ( Kj51 ~logK a j 2 m 1 ) 2 ]/K. of nonuniform distributions. Dropping any single term, say
Write x n ;y n whenever lim n→` x n /y n 51. The following re- ln D K11 , from the sum has no effect on this asymptotic re-
quired intermediate results are stated without proof. sult, and replacing ln with logK we see that ( ( Kj51 logK D j )/K
~R1! The equations m j 5 j m 1 and s 2j 5 j s 21 hold for must have the asymptotic mean (2K logK K)/K521. The
j50,1,...,n. values of a j used in the two figures of this paper were ob-
~R2! The least squares line of logK b r regressed onto tained by sampling the X j from the uniform distribution on
logK r and constrained to pass through the origin has the @0,1# and computing the associated spacings D j . For this
Nn Nn
slope b 5( ( r51 logKbr logKr)/(r51 log2Kr. sample, ( ( 10 j51 log10D j )/10521.33, which is the slope of the
~R3! ~i! ( j51 logK j;n logK n and ~ii! ( nj51 log 2K j
n
asymptotic regression line graphed in Fig. 2.
;n log 2K n.
~R4! ( nj51 j a K j ;n a K n11 /(K21) for a>0.
Nn
~R5! ~i! ( r51 u logK bru;um1unKn11/(K21) and ~ii! IV. THE ROLE OF SIZE-BIASED SAMPLING
Nn AND CONCLUSION
( r51 log2K br;m21n2Kn11/(K21).
~R1! through ~R4! are straightforward to prove, and ~R5! It is clear from visual inspection of Fig. 2 that the asymp-
follows from ~R1! and ~R4!. totic regression line with slope m1 does not fit the smallest
It is convenient to use ubu in our calculations, and we can base values well, and this will not change as n→`. How-
obtain an asymptotic estimate of it by giving an upper and ever, for the toy model defined here, the observed word fre-
lower bound that are asymptotically equivalent. The key in- quencies of an experiment would be those of a sample drawn
equalities are from a multinomial parent distribution with N n categories
54 ZIPF’S LAW, THE CENTRAL LIMIT THEOREM, AND THE . . . 223

~words! with probabilities proportional to the base values cians @17# have remarked that this pervasive issue needs
plotted in Fig. 2. The smallest probability words would tend greater recognition and have developed analytic methods for
not to occur unless the sample size was very large. Conse- attacking it.
quently, an approximate inverse power law for the observed We conclude with very brief pointers to related topics.
frequencies would be seen. The true log-normal structure of Empirical, approximately log-normal distributions character-
the parent distribution would be obscured or distorted for ized by upper tails with inverse power laws have been noted
most practical sample sizes. In fact, Li @9# carries out a simu- numerous times in many phenomena, as discussed, for ex-
lation that demonstrates this phenomenon. One of his simu- ample, by Montroll and Shlesinger @18,19# and Perline @20#.
lations illustrates how Zipf’s law can be generated in a Empirical and theoretical arguments in support of log-normal
sample of random text based on the independent, unequal models of word frequency distributions are given by Herdan
letter probability model with a maximum word length ~see @21#. ~The similarity between our construction of U n and the
his Fig. 2!. However, he did not recognize that the underly- derivation of a hybrid lognormal-Pareto distribution in
ing parent distribution is actually approximately log- @18,19# should be noted.! The fractal character of the set U n
normal—not a power law. With a sufficiently large sample ~as n→`! seems intuitively evident, and the expression
size, his sample power law would have to break down as ( ( Kj51 logK a j )/K pops up as what Evertsz and Mandelbrot
more and more of the low-probability random word se- @22# call the ‘‘most probable Hölder exponent’’ of a multi-
quences appear. fractal. Finally, Gut @8# has shown the importance of
The significance of size-biased sampling in relation to this Anscombe’s generalization of the central limit theorem for
problem and many others should be emphasized. To a re- more realistic models of random walks, and we suggest that
markable extent, what we observe are extreme, upper tail it can be extended in many ways for applications to a great
events. We see the brightest stars, not the dimmest; we variety of phenomena.
record seismic occurrences only when they are sufficiently
large to be detected by our instruments; we collect and report
statistics on the heights of the tallest buildings, the areas of ACKNOWLEDGMENTS
the world’s largest lakes, and so on. In each of these cases,
there is an obvious visibility bias in observing events that I am deeply grateful to Ron Perline, Department of Math-
makes it difficult to assess the nature of the underlying par- ematics and Computer Science, Drexel University, for en-
ent distribution from which the sample was drawn. Statisti- couraging the composition of this paper.

@1# B. Mandelbrot, Fractals: Form, Chance, and Dimension ~1994!.


~Freeman, New York, 1977!. @12# P. Niyogi and R. Berwick, CMP-LG@XXX.LANL.GOV,
@2# B. Mandelbrot, in Information Networks, the Brooklyn Poly- Document No. 9503012, 1995; see also Phys. Rev. Lett. 76,
technic Institute Symposium, edited by E. Weber ~Interscience, 1976 ~1996! for several Comments and a Reply.
New York, 1955!, pp. 205–221. @13# R. Graham, D. Knuth, and O. Patashnik, Concrete Mathemat-
@3# B. Mandelbrot, in Proceedings of the Symposium on Applica- ics ~Addison-Wesley, Reading, MA, 1994!.
tions of Information Theory, edited by W. Jackson ~Butter- @14# G. Zipf, Human Behavior and the Principle of Least Effort
worth, London, 1953!, pp. 486–499. ~Addison-Wesley, Cambridge, MA, 1949!.
@4# B. Mandelbrot, Trans. IRE 3, 124 ~1954!. @15# D. Darling, Ann. Math. Stat. 23, 450 ~1952!.
@5# B. Mandelbrot, Proc. Symp. Appl. Math. 12, 190 ~1961!. @16# S. Blumenthal, SIAM J. Appl. Math. 16, 1184 ~1968!.
@6# G. Miller, Am. J. Psychol. 70, 311 ~1957!. @17# G. P. Patil and C. Radhakrishna Rao, in Applications of Sta-
@7# G. Miller and N. Chomsky, in Handbook of Mathematical Psy- tistics, edited by P. Krishnaiah ~North-Holland, Amsterdam,
chology II, edited by R. Luce, R. Bush, and E. Galanter 1977!, pp. 383–405.
~Wiley, New York, 1963!, pp. 419–491. @18# E. Montroll and M. Shlesinger, Proc. Natl. Acad. Sci. U.S.A.
@8# A. Gut, Stopped Random Walks: Limit Theorems and Applica- 79, 3380 ~1982!.
tions ~Springer, New York, 1987!. @19# E. Montroll and M. Shlesinger, J. Stat. Phys. 32, 209 ~1983!.
@9# W. Li, IEEE Trans. Inf. Theory 38, 1842 ~1992!. @20# R. Perline, Ph.D. dissertation, University of Chicago, 1982.
@10# G. Miller, in the preface to G. Zipf, The Psycho-Biology of @21# G. Herdan, Type-Token Mathematics: A Textbook of Math-
Language: An Introduction to Dynamic Psychology ~MIT ematical Linguistics ~Mouton, ’S-Gravenhage, 1961!.
Press, Cambridge, MA, 1965!. @22# C. Evertsz and B. Mandelbrot, in Chaos and Fractals: New
@11# R. Mantegna, S. Buldyuv, A. Goldberger, S. Havlin, C.-K. Frontiers of Science, edited by H.-O. Peitgen, H. Jürgens, and
Peng, M. Simons, and H. Stanley, Phys. Rev. Lett. 73, 3169 D. Saupe ~Springer, New York, 1992!.

You might also like