You are on page 1of 3

Good–Turing frequency estimation is a statistical technique for estimating the probability of

encountering an object of a hitherto unseen species, given a set of past observations of objects
from different species. In drawing balls from an urn, the 'objects' would be balls and the 'species'
would be the distinct colors of the balls (finite but unknown in number). After drawing R red
{\displaystyle R_{\text{red}}} R_\text{red} red balls, R black {\displaystyle R_{\text{black}}}
R_\text{black} black balls and R green {\displaystyle R_{\text{green}}} R_\text{green} green balls,
we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a
previously unseen color.

Contents

1 Historical background

2 The method

3 See also

4 References

5 Bibliography

Historical background

Good–Turing frequency estimation was developed by Alan Turing and his assistant I. J. Good as
part of their methods used at Bletchley Park for cracking German ciphers for the Enigma
machine during World War II. Turing at first modelled the frequencies as a multinomial
distribution, but found it inaccurate. Good developed smoothing algorithms to improve the
estimator's accuracy.

The discovery was recognized as significant when published by Good in 1953,[1] but the
calculations were difficult so it was not used as widely as it might have been.[2] The method
even gained some literary fame due to the Robert Harris novel Enigma.

In the 1990s, Geoffrey Sampson worked with William A. Gale of AT&T, to create and implement a
simplified and easier-to-use variant of the Good–Turing method[3][4] described below. Various
heuristic justifications[5] and a simple combinatorial derivation have been provided [6].
The method

Notation

Assuming that X distinct species have been observed, numbered x = 1, ..., X.

Then the frequency vector, R ¯ {\displaystyle {\bar {R}}} {\bar {R}}, has elements R x
{\displaystyle R_{x}} R_{x} that give the number of individuals that have been observed for
species x.

The frequency of frequencies vector, ( N r ) r = 0 , 1 , … {\displaystyle (N_{r})_{r=0,1,\ldots }}


(N_r)_{r=0,1,\ldots}, shows how many times the frequency r occurs in the vector R; i.e. among
the elements R x {\displaystyle R_{x}} R_{x}.

N r = | { x ∣ R x = r } | {\displaystyle N_{r}=|\{x\mid R_{x}=r\}|} N_{r}=|\{x\mid R_{x}=r\}|

For example, N 1 {\displaystyle N_{1}} N_{1} is the number of species for which only one
individual was observed. Note that the total number of objects observed, N, can be found from

N = ∑ r = 1 ∞ r N r . {\displaystyle N=\sum _{r=1}^{\infty }rN_{r}.} N = \sum_{r=1}^\infty r N_r.

The first step in the calculation is to estimate the probability that a future observed individual (or
the next observed individual) is a member of a thus far unseen species. This estimate is:[7]

p 0 = N 1 N . {\displaystyle p_{0}={\frac {N_{1}}{N}}.} p_{0}={\frac {N_{1}}{N}}.

The next step is to estimate the probability that the next observed individual is from a species
which has been seen r times. For a single species this estimate is:

p r = ( r + 1 ) S ( N r + 1 ) N S ( N r ) . {\displaystyle p_{r}={\frac {(r+1)S(N_{r+1})}{NS(N_{r})}}.}


p_r = \frac{(r+1) S(N_{r+1})}{N S(N_r)}.
To estimate the probability that the next observed individual is from any species from this group
(i.e., the group of species seen r times) one can use the following formula:

( r + 1 ) S ( N r + 1 ) N . {\displaystyle {\frac {(r+1)S(N_{r+1})}{N}}.} \frac{(r+1) S(N_{r+1})}{N}.

Here, the notation S ( ) {\displaystyle S()} S( ) means the smoothed or adjusted value of the
frequency shown in parenthesis (see also empirical Bayes method). An overview of how to
perform this smoothing follows.

We would like to make a plot of log N r {\displaystyle \log N_{r}} \log N_r versus log r
{\displaystyle \log r} \log r but this is problematic because for large r many N r {\displaystyle
N_{r}} N_r will be zero. Instead a revised quantity, log Z r {\displaystyle \log Z_{r}} \log Z_r, is
plotted versus log r {\displaystyle \log r} \log r, where Zr is defined as

Z r = N r 0.5 ( t − q ) , {\displaystyle Z_{r}={\frac {N_{r}}{0.5(t-q)}},} Z_{r}={\frac {N_{r}}{0.5(t-q)}},

and where q, r and t are consecutive subscripts having N q , N r , N t {\displaystyle


N_{q},N_{r},N_{t}} N_q, N_r, N_t non-zero. When r is 1, take q to be 0. When r is the last non-
zero frequency, take t to be 2r − q.

The assumption of Good–Turing estimation is that the number of occurrence for each species
follows a binomial distribution.[8]

A simple linear regression is then fitted to the log–log plot. For small values of r it is reasonable
to set S ( N r ) = N r {\displaystyle S(N_{r})=N_{r}} S(N_r) = N_r (that is, no smoothing is
performed), while for large values of r, values of S ( N r ) {\displaystyle S(N_{r})} S(N_r) are read
off the regression line. An automatic procedure (not described here) can be used to specify at
what point the switch from no smoothing to linear smoothing should take place.[9] Code for the
method is available in the public domain.[10]

You might also like