You are on page 1of 3

First, I’m submitting the result of the unit 2 assignment.

Documents 570

Tokens 26543

Terms 9606

From Heaps law we find (Manning, Raghavan, & Schütze, 2009),

M=kTb // Where M= Vocabulary size, k, and b are constant.

From the assignment output, we find

Total terms= 26543

Unique Terms= 9606

Now let’s see whether Heaps law is true for our assignment result:

M= 40*(26543).5 = 6517 approximately. //k=40; b=.5

The difference between 6517 and 9606 is significant. So, we find it doesn’t follow
Heap’s law.

We learned from Heap’s law that out of 1 million terms approximately 38,323 are
unique. We can say, out of 1 million tokens 38,323 are terms. It was represented
as (Manning, Raghavan, & Schütze, 2009):

M= 44* (1000020).49 = 38323 //k=44;b=.49

M= 40* (1000020).5 = 40000 //k=40;b=.50

We find the value slight changes based on the value of k and b. For example, if we
plug k=44 and b=.49 into our equation then

M= 44*(26543).49 = 7168 approximately, more close to our output.

If we use k=59 and b=.50 we find

M= 59*(26543).50 = 9612

9612 is very close to our output.

From the Heaps law, we’ve noticed the value of k varies between 30 and 100
(Manning, Raghavan, & Schütze, 2009). Similarly, the value of b is somehow near
to .5. Heaps law is based on observation. It’s not a pure mathematical output based
on function. If you take more tokens, you’ll get a close result to the Heaps law.
Heaps law has shown with at least 1 million tokens. On the other hand, our corpus
has only 26543 tokens. If your collection number is small, you’ll observe the
unpredictable result. If your collection is high, you’ll observe higher accuracy.
That’s why k and b vary within a range. It’s also important to say that our tokens
are not finite and day by day new tokens are adding to the dictionary. All these are
not pure dictionary based. For example, the word “UoPeople” didn’t exist before
2009. So, testing with large collection will give you an approximately correct value.
But testing with small collection may not give you correct result always.
References

Manning, C. D., Raghavan, P., & Schütze, H. (2009). Heaps’ Law: Estimating The
Number of Terms. In An Introduction to Information Retrieval (pp. 88-89).
Cambridge, England: Cambridge University Press.

You might also like