Keyword Extraction for Social Snippets
Today, a huge amount of text is being generated for social pur-poses on social networking services on the Web. Unlike traditionaldocuments, such text is usually extremely short and tends to be in-formal. Analysis of such text beneﬁt many applications such asadvertising, search, and content ﬁltering. In this work, we studyone traditional text mining task on such new form of text, that isextraction of meaningful keywords. We propose several intuitiveyet useful features and experiment with various classiﬁcation mod-els. Evaluation is conducted on Facebook data. Performances of various features and models are reported and compared.
Categories and Subject Descriptors:
H.3.1 [Content AnalysisandIndexing]: Abstractingmethods, H.4.m[InformationSystems]:Miscellaneous.
Algorithms, Experimentation, Performance.
keyword extraction, social snippet, online advertising.
Social networking services, such as Facebook, Twitter, and MSNMessenger, are developing at an amazing speed, leading to produc-tion of a special kind of user-generated text for social purposes.Usually, users generate such text to broadcast to friends or follow-ers their current status, recent news, or interesting events. We namesuch text
. Status updates on Facebook and Twittermessages are representative examples of social snippets.Social snippets are valuable media to mine users’ recent interest.For example, if someone publishes his/her social snippet saying“is excited to receive my Wii today”. It is a sign that this per-son is interested in Wii games. One way to discover such interestis to extract meaningful keywords from social snippets. Extractedkeywords can be used for many applications. They can be treatedas personal social tags to better retrieve user-generated informa-tion. They can facilitate social recommendations. Most impor-tantly, they are essential for advertisement targeting. For example,advertisers can target the users who might be interested in videogames using the keyword “Wii”.In this work, we seek to develop the effective keywords extrac-tion methods for social snippets in the form of a classiﬁcation task.There have been many related works on keyword extraction fromdocuments [4, 1] and web pages . However, the special charac-teristics of social snippets make the keyword extraction a new and
The work was supported in part by NSF IIS-0905215 and theAFOSR MURI award FA9550-08-1-0265.
Copyright is held by the author/owner(s).
April 26–30, 2010, Raleigh, North Carolina, USA.ACM 978-1-60558-799-8/10/04.
even more challenging problem. It is statistically shown that socialsnippets are extremely short and tend to be more informal. Ac-cordingly, insights from related works on traditional data no longerhold true. In the following, we report a set of features to measurethe importance of keywords and compare the performances amongvarious classiﬁcation models. We compare our approach with sev-eral other keyword extraction systems, such as KEA  and Yahoo!keyword extraction system. All the experiments are conducted ona sample of Facebook data.
For each social snippet, we ﬁrst tokenize it and generate uni-grams and bigrams to be considered as keyword candidates. Then,a set of features are calculated to represent each candidate. Wetrain a classiﬁcation model based on the labeled keywords of socialsnippets. And ﬁnally the keyword candidates with highest scoresthrough classiﬁcation model are returned.Here are the features:
: information retrieval feature.
is a commonlyused feature in information retrieval. Intuitively, the word that ap-pears more often in a document but not very often in the corpus ismore likely to be a keyword. However, in our problem, this featurecould be less useful. Because in short social snippets, the TF formost keyword candidates is
. Frequency does not help to distin-guish real keywords from others.
: linguistic feature.
In most cases, nouns have higher proba-bility to be keywords. For a word, feature
is the number of itsnoun POS tags divided by the total number of its POS tags. Thisfeature is not affected by the “short” and “informal” properties of social snippets so we expect this feature to be an important one forclassiﬁcation.
: relative position.
is deﬁned as the relative positionwithin a social snippet. For the bigrams, we consider the positionof the ﬁrst word.
: length of social snippet.
of a social snippet iscalculated by the number of words (including stopwords) in it. Forall the keyword candidates in one social snippet, they are assignedwith the same value.
A hot and trendy topic is more likelyto be a keyword in social snippets. Feature
reﬂects the popu-larity of a keyword.
This feature has a binary value showingwhether there is a upper case letter in a keyword candidate.After applying the classiﬁcation model on testing data, we canget a relevance score for each keyword candidate. Finally, we needto return the “top” candidates with highest score. The most com-mon way to select the “top” ones is to choose top-
candidates.However, we observe from our experiment dataset that about