Professional Documents
Culture Documents
Discussion Forum Unit 3
Discussion Forum Unit 3
Documents 570
Tokens 26543
Terms 9606
Now let’s see whether Heaps law is true for our assignment result:
The difference between 6517 and 9606 is significant. So, we find it doesn’t follow
Heap’s law.
We learned from Heap’s law that out of 1 million terms approximately 38,323 are
unique. We can say, out of 1 million tokens 38,323 are terms. It was represented
as (Manning, Raghavan, & Schütze, 2009):
We find the value slight changes based on the value of k and b. For example, if we
plug k=44 and b=.49 into our equation then
M= 59*(26543).50 = 9612
From the Heaps law, we’ve noticed the value of k varies between 30 and 100
(Manning, Raghavan, & Schütze, 2009). Similarly, the value of b is somehow near
to .5. Heaps law is based on observation. It’s not a pure mathematical output based
on function. If you take more tokens, you’ll get a close result to the Heaps law.
Heaps law has shown with at least 1 million tokens. On the other hand, our corpus
has only 26543 tokens. If your collection number is small, you’ll observe the
unpredictable result. If your collection is high, you’ll observe higher accuracy.
That’s why k and b vary within a range. It’s also important to say that our tokens
are not finite and day by day new tokens are adding to the dictionary. All these are
not pure dictionary based. For example, the word “UoPeople” didn’t exist before
2009. So, testing with large collection will give you an approximately correct value.
But testing with small collection may not give you correct result always.
References
Manning, C. D., Raghavan, P., & Schütze, H. (2009). Heaps’ Law: Estimating The
Number of Terms. In An Introduction to Information Retrieval (pp. 88-89).
Cambridge, England: Cambridge University Press.