CSCW researchers have also started looking at the field as aarea for investigation. In particular, Sen et al. studied howtag selections are affected by community influences andpersonal tendencies . They studied four different tagselection algorithms for displaying tags from other usersand found that user tagging behaviors changed dependingon the algorithm.Millen et al., on the other hand, introduced a systemdesigned for tagging intranet items in an enterprise .They collected some sample user data, and showed thatusers form social networks through patterns of their usageof others’ bookmarks.Within the HCI community, the CHI2006 panel organizedby Furnas et al. brought social tagging systems to theattention of HCI practitioners . Furnas’ suggestion thatsocial tagging systems can be viewed from a “vocabularyproblem”  perspective directly inspired the approachused in this paper. More importantly, his comment pointedto the usefulness of social tagging systems as acommunication device that can bridge the gap betweendocument collections and users’ mental maps of thosecollections. Social navigation as enabled by social taggingsystems can be studied by how well the tags form avocabulary to describe the contents being tagged.MacGregor also refer to social tagging fundamentally as avocabulary problem in indexing: “terms assigned toresources that are exhaustive will result in high recall at theexpense of precision. Conversely, terms that aretoo specific will result in high precision, but lowerrecall.” Sen et al. also referred to this as a fundamentalresearch problem in the conclusion of the paper: “thedensity of tag applications across objects may provideinformation about their value to users. A tag that is appliedto a very large proportion of items may be too general to beuseful, while a tag that is applied very few times may beuseless due to its obscurity.”Indeed, to understand how tags have evolved for a largecorpus of tagged items such as del.icio.us, we need tounderstand whether the tags adequately describe the itemsbeing tagged. Moreover, we need a way to understand howthe social tagging system will evolve in the future. Will thetags lose their specificity?In short, our contribution is providing a methodology forunderstanding how tags in a social tagging system evolve asa vocabulary. And how well does social tagging affordsocial navigation of a large collection of sources? We areproposing the use of concepts in information theory toexamine these questions.
ENTROPY AND INFORMATION THEORY PRIMER
Information theory has long been applied to theunderstanding of human languages. Claude Shannoninvented much of information theory in 1948 to understandhow codes can be efficiently transmitted in acommunication channel. The concept of source coding andchannel coding in information theory is concerned with theefficiency and robustness of codes in communication.These concepts have been applied to human languages.Codes are essentially words, which should be efficient andshort if they are commonly used, e.g. “I”, “the”, “a”, “this”.Second, a language (that is, codes used in communication)should be robust to noise by having enough redundancy inthe system.The central idea in Shannon’s theory is the source-codingtheorem, which says that, on average, the number of bitsneeded to communicate the result of an uncertain event isgiven by its entropy. From a statistical perspective, asomewhat more intuitive understanding of
is that itmeasures the amount of
about a particular eventassociated with a probability distribution.The concept can be understood from a simple experiment.Consider a box containing many color balls from which weare drawing balls. If no single color predominates in thebox, then our uncertainty about the color of the ball ismaximal and the entropy is high. On the other hand, if thebox contains black colored balls more than other colors,then there is more certainty about the color of a drawn ball,and the entropy is lower. Intuitively, a gambler wouldprefer the second case, because it is possible to place betson black and win. In fact, in the extreme in which everyball is black, the entropy would be zero, and the gamblerwould win every time.Entropy thus measures the average amount of informationassociated with a drawn ball. Intuitively, in the third case,the color of the ball is a certainty, and there is noinformation conveyed by knowing the color of a drawn ball.In the first case, knowing the color of previously drawnballs tells a gambler a lot of information about how sheshould bet.Mathematically, given a discrete random variable
and itconsists of several events
, which occurs with probability
, the entropy
is given by:.Looking at the above equation, there are two basic ways inwhich entropy can change:(a) If the total number of events in
increases, entropy of
will increase. This is because entropy is defined as asummation of the values given by a function based on theprobabilities of
. (Note that the negative log of aprobability is always a positive number.)(b) If distribution on
becomes more uniform, entropywill also increase.