Professional Documents
Culture Documents
I. Kontoyiannis∗
November 6, 2007
Abstract
We illustrate how elementary information-theoretic ideas may be employed to provide
proofs for well-known, nontrivial results in number theory. Specifically, we give an elementary
and fairly short proof of the following asymptotic result,
X log p
∼ log n, as n → ∞,
p
p≤n
where the sum is over all primes p not exceeding n. We also give finite-n bounds refining
the above limit. This result, originally proved by Chebyshev in 1852, is closely related to
the celebrated prime number theorem.
∗
Department of Informatics, Athens University of Economics and Business, Patission 76, Athens 10434, Greece.
Email: yiannis@aueb.gr. Web: http://pages.cs.aueb.gr/users/yiannisk/.
1 Introduction
The significant depth of the connection between information theory and statistics appears to
have been recognized very soon after the birth of information theory [17] in 1948; a book-length
exposition was provided by Kullback [12] already in 1959. In subsequent decades much was
accomplished, and in the 1980s the development of this connection culminated in Rissanen’s
celebrated work [14][15][16], laying the foundations for the notion of stochastic complexity and
the Minimum Description Length principle, or MDL.
Here we offer a first glimpse of a different connection, this time between information theory
and number theory. In particular, we will show that basic information-theoretic arguments
combined with elementary computations can be used to give a new proof for a classical result
concerning the distribution of prime numbers. The problem of understanding this “distribution”
(including the issue of exactly what is meant by that statement) has, of course, been at the heart
of mathematics since antiquity, and it has led, among other things, to the development of the
field of analytic number theory; e.g., Apostol’s text [1] offers an accessible introduction and [2]
gives a more historical perspective.
A major subfield is probabilistic number theory, where probabilistic tools are used to derive
results in number theory. This approach, pioneered by, among others, Mark Kac and Paul Erdös
from the 1930s on, is described, e.g., in Kac’s beautiful book [11], Billingsley’s review [3], and
Tenenbaum’s more recent text [18]. The starting point in much of the relevant literature is the
following setup: For a fixed, large integer n, choose a random integer N from {1, 2, . . . , n}, and
write it in its unique prime factorization,
Y
N= pXp , (1)
p≤n
where the product runs over all primes p not exceeding n, and Xp is the largest power k ≥ 0
such that pk divides N . Through this representation, the uniform distribution on N induces a
joint distribution on the {Xp ; p ≤ n}, and the key observation is that, for large n, the random
variables {Xp } are distributed approximately like independent geometrics. Indeed, since there
are exactly ⌊n/pk ⌋ multiples of pk between 1 and n,
1 j n k 1 k
Pr{Xp ≥ k} = Pr{N is a multiple of pk } = ≈ , for large n, (2)
n pk p
so the distribution of Xp is approximately geometric. Similarly, for the joint distribution of the
{Xp } we find,
2
Billingsley’s argument essentially begins with the observation that, since the representa-
tion (1) is unique, the value of N and the values of the exponents {Xp } are in a one-to-one
correspondence; therefore, the entropy of N is the same as the entropy of the collection {Xp },1
And since the random variables {Xp } are approximately independent geometrics, we should
expect that,
X X h log p 1 i
log n = H(Xp ; p ≤ n) ≈ H(Xp ) ≈ − log 1 − , (3)
p−1 p
p≤n p≤n
where in the last equality we simply substituted the well-known expression for the entropy of
a geometric random variable (see Section 2 for details on the definition of the entropy and its
computation). For large p, the above summands behave like logp p to first order, leading to the
asymptotic estimate,
X log p
≈ log n, for large n.
p
p≤n
Our main goal in this paper is to show that this approximation can indeed be made rigorous,
mostly through elementary information-theoretic arguments; we will establish:
Theorem 1. As n → ∞,
X log p
C(n) := ∼ log n, (4)
p
p≤n
3
This was conjectured by Gauss around 1792, and it was only proved in 1896; Chebyshev was
not able to produce a complete proof, but he used (4) and his finer bounds on C(n) to show
that π(n) is of order logn n . Although we will not pursue this direction here, it is actually not
hard to see that the asymptotic behavior of C(n) is intimately connected with that of π(n). For
example, a simple exercise in summation by parts shows that π(n) can be expressed directly in
terms of C(n):
n
n+1 X k + 1 k
π(n) = C(n) − − C(k), for all n ≥ 3. (5)
log(n + 1) log(k + 1) log k
k=2
where π(n) denotes the number of primes p1 , p2 , . . . , pπ(n) up to n, and Xp is the largest integer
power k ≥ 0 such that pk divides N . As noted in (2) above, the distribution of Xp can be
described by,
1j n k
Pr{Xp ≥ k} = . for all k ≥ 1, (6)
n pk
This representation also gives simple upper and lower bounds on its mean E(Xp ),
X X 1 k 1/p 1
µp := E(Xp ) = Pr{Xp ≥ k} ≤ = = , (7)
p 1 − 1/p p−1
k≥1 k≥1
1 1
and µp ≥ Pr{Xp ≥ 1} ≥ − . (8)
p n
4
Recall the important observation that the distribution of each Xp is close to a geometric.
To be precise, a random variable Y with values in {0, 1, 2, . . .} is said to have a geometric
distribution with mean µ > 0, denoted Y ∼ Geom(µ), if Pr{Y = k} = µk /(1 + µ)k+1 , for all
k ≥ 0. Then Y of course has mean E(Y ) = µ and its entropy is,
X
h(µ) := H(Geom(µ)) = − Pr{Y = k} log Pr{Y = k} = (µ + 1) log(µ + 1) − µ log µ. (9)
k≥0
where (a) is simply the entropy of the uniform distribution, (b) comes from the fact that N
and the {Xp } are in a one-to-one correspondence, (c) is the well-known subadditivity of the
entropy, (d) is because the geometric has maximal entropy among all distributions on the non-
negative integers with a fixed mean, and (e) is the definition of h(µ) in (9). Noting that h(µ) is
nondecreasing in µ and recalling the upper bound on µp in (7) gives,
X X Xh p p 1 1 i
log n ≤ h(µp ) ≤ h(1/(p − 1)) = log − log . (11)
p−1 p−1 p−1 p−1
p≤n p≤n p≤n
log p
Since the summands above behave like p for large p, it is not difficult to deduce the following
lower bounds on C(n) = p≤n logp p :
P
5
2.3 A Simple Upper Bound on C(n)
Unfortunately, it is not clear how to reverse the inequalities in equations (10) and (11) to get
a corresponding upper bound on C(n) – especially inequality (c) in (10). Instead we use a
different argument, one which is less satisfying from an information-theoretic point of view, for
two reasons. First, although again we do go via the entropy of N , it is not necessary to do so;
see equation (13) below. And second, Pwe need to use an auxiliary result, namely, the following
rough estimate on the sum, ϑ(n) := p≤n log p:
X
ϑ(n) := log p ≤ (2 log 2)n, for all n ≥ 2. (12)
p≤n
thus proving:
Theorem 3. [Upper Bound] For all n ≥ 2,
X log p
≤ log n + 2 log 2.
p
p≤n
Theorem 3 together with Corollary 1 prove Theorem 1. Of course the use of the entropy could
have been avoided entirely: Instead of using that H(N ) = log n in (13), we could simply use
that n ≥ N by definition, so log n ≥ E[log N ], and proceed as before.
Finally (paraphrasing from [10, p. 341]) we give an elegant argument of Erdös that employs
a cute, elementary trick to prove the inequality on ϑ(n) in (12). First observe that we can
restrict attention to odd n, since ϑ(2n) = ϑ(2n − 1), for all n ≥ 2 (as there are no even primes
other than 2). Let n ≥ 2 arbitrary; then every prime n + 1 < p ≤ 2n + 1 divides the binomial
coefficient,
2n + 1 (2n + 1)!
B := = ,
n n!(n + 1)!
since it divides the numerator but not the denominator, and hence the product of all these
primes also divides B. In particular, their product must be no greater than B, i.e.,
Y 1 2n + 1 1 2n + 1 1
p≤B= + ≤ (1 + 1)2n+1 = 22n ,
2 n 2 n+1 2
n+1<p≤2n+1
6
or, taking logarithms,
X h Y i
ϑ(2n + 1) − ϑ(n + 1) = log p = log p ≤ (2 log 2)n.
n+1<p≤2n+1 n+1<p≤2n+1
X X log n
log n = H(N ) = H(Xp ; p ≤ n) ≤ H(Xp ) ≤ log + 1 ≤ π(n)(log log n + 1),
log 2
p≤n p≤n
where M ≥ 1 is the largest integer such that M 2 divides N , and each of the Yp are either zero
√
or one. Then H(Yp ) ≤ log 2 for all p, and the fact that M 2 ≤ n implies that H(M ) ≤ log⌊ n⌋.
Therefore,
X 1
log n = H(N ) = H(M, Yp1 , Yp2 , . . . , Ypπ(n) ) ≤ H(M ) + H(Yp ) ≤ log n + π(n) log 2,
2
p≤n
7
Appendix
Proof of the Summation-by-parts Formula (5). Note that, since π(k) − π(k − 1) is zero
unless k is prime, C(n) can be expressed as a sum over all integers k ≤ n,
X log k
C(n) = [π(k) − π(k − 1)] .
k
2≤k≤n
where the last inequality follows from the inequality − log(1 − x) ≤ x/(1 − δ), for all 0 ≤ x ≤
δ < 1, with δ = 1/N0 . Therefore,
X h 1 log p N0 log p 1 i
log n ≤ T (N0 ) + +
N0 <p≤n
1 − N10 p N0 − 1 log N0 p
N 1
0
= T (N0 ) + 1+ C(n) − C(N0 ) . (14)
N0 − 1 log N0
Dividing by log n and letting n → ∞ yields,
C(n) (N0 − 1) log N0
lim inf ≥ ,
n→∞ log n N0 (1 + log N0 )
and since N0 was arbitrary, letting now N0 → ∞ implies (i).
For all n ≥ N0 , (14) implies,
1 1
C(n) ≥ 1 − 1− log n + C(N0 ) − T (N0 ), (15)
N0 1 + log N0
and evaluating this at N0 = 16 gives (ii).
8
Acknowledgments
Thanks to Peter Harremoës for spotting a small error in an earlier version of this paper.
References
[1] T.M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, New York, 1976.
[2] P.T. Bateman and H.G. Diamond. A hundred years of prime numbers. Amer. Math.
Monthly, 103(9):729–741, 1996.
[3] P. Billingsley. The probability theory of additive arithmetic functions. Ann. Probab., 2:749–
791, 1974.
[4] H. Bohr. Address of Professor Harold Bohr. In Proceedings of the International Congress of
Mathematicians (Cambridge, 1950) vol. 1, pages 127–134. Amer. Math. Soc., Providence,
RI, 1952.
[5] G.J. Chaitin. Toward a mathematical definition of “life”. In Maximum entropy formalism
(Conf., Mass. Inst. Tech., Cambridge, Mass., 1978), pages 477–498. MIT Press, Cambridge,
Mass., 1979.
[6] P.L. Chebychev. Mémoire sur les nombres premiers. J. de Math. Pures Appl., 17:366–390,
1852.
[7] P.L. Chebychev. Sur la totalité des nombres premiers inférieurs à une limite donnée. J. de
Math. Pures Appl., 17:341–365, 1852.
[8] T.M. Cover and J.A. Thomas. Elements of Information Theory. J. Wiley, New York, 1991.
[9] H.G. Diamond. Elementary methods in the study of the distribution of prime numbers.
Bull. Amer. Math. Soc. (N.S.), 7(3):553–589, 1982.
[10] G.H. Hardy and E.M. Wright. An Introduction to the Theory of Numbers. The Clarendon
Press Oxford University Press, New York, fifth edition, 1979.
[11] M. Kac. Statistical Independence in Probability, Analysis and Number Theory. Published
by the Mathematical Association of America. Distributed by John Wiley and Sons, Inc.,
New York, 1959.
[12] S. Kullback. Information Theory and Statistics. Dover Publications Inc., Mineola, NY,
1997. Reprint of the second (1968) edition.
[14] J. Rissanen. A universal prior for integers and estimation by minimum description length.
Ann. Statist., 11(2):416–431, 1983.
[15] J. Rissanen. Stochastic complexity. J. Roy. Statist. Soc. Ser. B, 49(3):223–239, 253–265,
1987. With discussion.
9
[16] J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore, 1989.
[17] C.E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379–423,
623–656, 1948.
10