You are on page 1of 11

Finding High-Quality Content in Social Media

Eugene Agichtein Carlos Castillo Debora Donato


Emory University Yahoo! Research Yahoo! Research
Atlanta, USA Barcelona, Spain Barcelona, Spain
eugene@mathcs.emory.edu chato@yahoo-inc.com debora@yahoo-inc.com
Aristides Gionis Gilad Mishne
Yahoo! Research Search and Advertising
Barcelona, Spain Sciences, Yahoo!
gionis@yahoo-inc.com gilad@yahoo-inc.com

ABSTRACT 1. INTRODUCTION
The quality of user-generated content varies drastically from Recent years have seen a transformation in the type of
excellent to abuse and spam. As the availability of such con- content available on the web. During the rst decade of the
tent increases, the task of identifying high-quality content webs prominencefrom the early 1990s onwardsmost on-
in sites based on user contributionssocial media sites line content resembled traditional published material: the
becomes increasingly important. Social media in general majority of web users were consumers of content, created
exhibit a rich variety of information sources: in addition to by a relatively small amount of publishers. From the early
the content itself, there is a wide array of non-content infor- 2000s, user-generated content has become increasingly pop-
mation available, such as links between items and explicit ular on the web: more and more users participate in con-
quality ratings from members of the community. In this pa- tent creation, rather than just consumption. Popular user-
per we investigate methods for exploiting such community generated content (or social media) domains include blogs
feedback to automatically identify high quality content. As and web forums, social bookmarking sites, photo and video
a test case, we focus on Yahoo! Answers, a large community sharing communities, as well as social networking platforms
question/answering portal that is particularly rich in the such as Facebook and MySpace, which oers a combination
amount and types of content and social interactions avail- of all of these with an emphasis on the relationships among
able in it. We introduce a general classication framework the users of the community.
for combining the evidence from dierent sources of infor- Community-driven question/answering portals are a par-
mation, that can be tuned automatically for a given social ticular form of user-generated content that is gaining a large
media type and quality denition. In particular, for the audience in recent years. These portals, in which users an-
community question/answering domain, we show that our swer questions posed by other users, provide an alternative
system is able to separate high-quality items from the rest channel for obtaining information on the web: rather than
with an accuracy close to that of humans. browsing results of search engines, users present detailed in-
formation needsand get direct responses authored by hu-
Categories and Subject Descriptors mans. In some markets, this information seeking behavior
is dominating over traditional web search [29].
H.3 [Information Storage and Retrieval]: H.3.1 Con- An important dierence between user-generated content
tent Analysis and Indexing indexing methods, linguistic and traditional content that is particularly signicant for
processing; H.3.3 Information Search and Retrieval infor- knowledge-based media such as question/answering portals
mation ltering, search process. is the variance in the quality of the content. As Ander-
son [3] describes, in traditional publishingmediated by a
General Terms publisherthe typical range of quality is substantially nar-
Algorithms, Design, Experimentation. rower than in niche, unmediated markets. The main chal-
lenge posed by content in social media sites is the fact that
the distribution of quality has high variance: from very
Keywords
high-quality items to low-quality, sometimes abusive con-
Social media, Community Question Answering, User Inter- tent. This makes the tasks of ltering and ranking in such
actions. systems more complex than in other domains. However, for
information-retrieval tasks, social media systems present in-
herent advantages over traditional collections of documents:
their rich structure oers more available data than in other
Permission to make digital or hard copies of all or part of this work for domains. In addition to document content and link struc-
personal or classroom use is granted without fee provided that copies are ture, social media exhibit a wide variety of user-to-document
not made or distributed for profit or commercial advantage and that copies
relation types, and user-to-user interactions.
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific In this paper we address the task of identifying high-
permission and/or a fee. quality content in community-driven question/answering sites,
WSDM08, February 1112, 2008, Palo Alto, California, USA. exploring the benets of having additional sources of infor-
Copyright 2008 ACM 978-1-59593-927-9/08/0002 ...$5.00.

183
mation in this domain. As a test case, we focus on Ya- (decided by the asker, or by an automatic timeout in the
hoo! Answers, a large portal that is particularly rich in the system), the question is considered closed, and can receive
amount and types of content and social interaction available no further answers. At this stage, a best answer is se-
in it. We focus on the following research questions: lected either by the asker or through a voting procedure
from other users; once a best answer is chosen, the question
1. What are the elements of social media that can be used is resolved.
to facilitate automated discovery of high-quality con- As previously noted, the system is partially moderated by
tent? In addition to the content itself, there is a wide the community: any user may report another users question
array of non-content information available, from links or answer as violating the community guidelines (e.g., con-
between items to explicit and implicit quality rating taining spam, adult-oriented content, copyrighted material,
from members of the community. What is the utility etc.). A user can also award a question a star, marking it
of each source of information to the task of estimating as an interesting question, sometimes can vote for the best
quality? answer for a question, and can give to any answer a thumbs
up or thumbs down rating, corresponding to a positive or
2. How are these dierent factors related? Is content
negative vote respectively.
alone enough for identifying high-quality items?
Yahoo! Answers is a very popular service (according to
3. Can community feedback approximate judgments of spe- some reports, it reached a market share of close to 100%
cialists? about a year after its launch [27]); as a result, it hosts a
very large amount of questions and answers in a wide va-
To our knowledge, this is the rst large-scale study of com- riety of topics, making it a particularly useful domain for
bining the analysis of the content with the user feedback examining content quality in social media. Similar exist-
in social media. In particular, we model all user interac- ing and past services (some with a dierent model) include
tions in a principled graph-based framework (Section 3 and Amazons Askville2 , Google Answers3 , and Yedda4 .
Section 4), allowing us to eectively combine the dierent
sources of evidence in a classication formulation. Further- 2.2 Related work
more, we investigate the utility of the dierent sources of
feedback in a large-scale, experimental setting (Section 5) Link analysis in social media. Link-based methods have
over the market leading question/answering portal. Our ex- been shown to be successful for several tasks in social me-
perimental results show that these sources of evidence are dia [30]. In particular, link-based ranking algorithms that
complementary, and allow our system to exhibit high accu- were successful in estimating the quality of web pages have
racy in the task of identifying content of high quality (Sec- been applied in this context. Two of the most prominent
tion 6). We discuss our ndings and directions for future link-based ranking algorithms are PageRank [25] and HITS [22].
work in Section 7, which concludes this paper. Consider a graph G = (V, E) with vertex set V corre-
sponding to the users of a question/answer system and hav-
2. BACKGROUND AND RELATED WORK ing a directed edge e = (u, v) E from a user u V to
Social media content has become indispensable to millions a user v V if user u has answered to at least one ques-
of users. In particular, community question/answering por- tion of user v. ExpertiseRank [32] corresponds to PageRank
tals are a popular destination of users looking for help with over the transposed graph G = (V, E  ), that is, a score is
a particular situation, for entertainment, and for community propagated from the person receiving the answer to the per-
interaction. Hence, in this paper we focus on one particu- son giving the answer. The recursion implies that if person u
larly important manifestation of social media community was able to provide an answer to person v, and person v was
question/answering sites, specically on Yahoo! Answers. able to provide an answer to person w, then u should receive
Our work draws on signicant amount of prior research on some extra points given that he/she was able to provide an
social media, and we outline the related work before intro- answer to a person with a certain degree of expertise.
ducing our framework in Section 3. The HITS algorithm was applied over the same graph [8,
19] and it was shown to produce good results in nding
2.1 Yahoo! Answers experts and/or good answers. The mutual reinforcement
process in this case can be interpreted as good questions
Yahoo! Answers1 is a question/answering system where
attract good answers and good answers are given to good
people ask and answer questions on any topic. What makes
questions; we examine this assumption in Section 5.2.
this system interesting is that around a seemingly trivial
question/answer paradigm, users are forming a social net-
work characterized by heterogeneous interactions. As a mat- Propagating reputation. Guha et al. [14] study the prob-
ter of fact, users do not only limit their activity to asking lem of propagating trust and distrust among Epinions5 users,
and answering questions, but they also actively participate who may assign positive (trust) and negative (distrust) rat-
in regulating the whole system. A user can vote for answers ings to each other. The authors study ways of combining
of other users, mark interesting questions, and even report trust and distrust and observe that, while considering trust
abusive behavior. Thus, overall, each user has a threefold as a transitive property makes sense, distrust can not be
role: asker, answerer and evaluator. considered transitive.
The central element of the Yahoo! Answers system are 2
questions. Each question has a lifecycle. It starts in an http://askville.amazon.com/
3
open state where it receives answers. Then at some point http://answers.google.com/
4
http://yedda.com/
1 5
http://answers.yahoo.com/ http://epinions.com/

184
Ziegler and Lausen [33] also study models for propagation built as text classication tools, and use a range of prop-
of trust. They present a taxonomy of trust metrics and dis- erties derived from the text as features. Some of the fea-
cuss ways of incorporating information about distrust into tures employed in systems are lexical, such as word length,
the rating scores. measures of vocabulary irregularity via repetitiveness [7] or
uncharacteristic co-occurrence [9], and measures of topical-
Question/answering portals and forums. The particular ity through word and phrase frequencies [28]. Other features
context of question/answering communities we focus on in take into account usage of punctuation and detection of com-
this paper has been the object of some study in recent years. mon grammatical error (such as subject-verb disagreements)
According to Su et al. [31], the quality of answers in ques- via predened templates [4, 24]. Most platforms are com-
tion/answering portals is good on average, but the quality of mercial and do not disclose full details of their internal fea-
specic answers varies signicantly. In particular, in a study ture set; overall, AES systems have been shown to correlate
of the answers to a set of questions in Yahoo! Answers, the very well with human judgments [6, 24].
authors found that the fraction of correct answers to specic A dierent area of study involving text quality is read-
questions asked by the authors of the study, varied from 17% ability; here, the diculty of text is analyzed to determine
to 45%. The fraction of questions in their sample with at the minimal age group able to comprehend it. Several mea-
least one good answer was much higher, varying from 65% sures of text readability have been proposed, including the
to 90%, meaning that a method for nding high-quality an- Gunning-Fog Index [15], the Flesch-Kincaid Formula [21],
swers can have a signicant impact in the users satisfaction and SMOG Grading [23]. All measures combine the num-
with the system. ber of syllables or words in the text with the number of
Jeon et al. [17] extracted a set of features from a sample sentencesthe rst being a crude approximation of the syn-
of answers in Naver,6 a Korean question/answering portal tactic complexity and the second of the semantic complex-
similar to Yahoo! Answers. They built a model for answer ity. Although simplistic and controversial, these methods
quality based on features derived from the particular answer are widely-used and provide a rough estimation of the di-
being analyzed, such as answer length, number of points culty of text.
received, etc., as well as user features, such as fraction of best
answers, number of answers given, etc. Our work expands Implicit feedback for ranking. Implicit feedback from mil-
on this by exploring a substantially larger range of features lions of web users has been shown to be a valuable source of
including both structural, textual, and community features, result quality and ranking information. In particular, clicks
and by identifying quality of questions in addition to answer on results and methods for interpreting the clicks have been
quality. studied in references [1, 18, 2]. We apply the results on click
interpretation on web search results from these studies, as
Expert finding. Zhang et al. [32] analyze data from an on- a source of quality information in social media. As we will
line forum, seeking to identify users with high expertise. show, content usage statistics are valuable, but require dif-
They study the user answers graph in which there is a link ferent interpretation from the web search domain.
between users u and v if u answers a question by v, ap-
plying both ExpertiseRank and HITS to identify users with 3. CONTENT QUALITY ANALYSIS IN
high expertise. Their results show high correlation between
link-based metrics and the answer quality. The authors also SOCIAL MEDIA
develop synthetic models that capture some of the charac- We now focus on the task of nding high quality content,
teristics of the interactions among users in their dataset. and describe our overall approach to solving this problem.
Jurczyk and Agichtein [20] show an application of the Evaluation of content quality is an essential module for per-
HITS algorithm [22] to a question/answering portal. The forming more advanced information-retrieval tasks on the
HITS algorithm is run on the user-answer graph. The re- question/answering system. For instance, a quality score
sults demonstrate that HITS is a promising approach, as the can be used as input to ranking algorithms. On a high level,
obtained authority score is better correlated with the num- our approach is to exploit features of social media that are
ber of votes that the items receive, than simply counting the intuitively correlated with quality, and then train a classi-
number of answers the answerer has given in the past. er to appropriately select and weight the features for each
Campbell et al. [8] computed the authority score of HITS specic type of item, task, and quality denition.
over the user-user graph in a network of e-mail exchanges, In this section we identify a set of features of social media
showing that it is more correlated to quality than other sim- and interactions that can be applied to the task of content-
pler metrics. Dom et al. [11] studied the performance of quality identication. In particular, we model the intrinsic
several link-based algorithms to rank people by expertise on content quality (Section 3.1), the interactions between con-
a network of e-mail exchanges, testing on both real and syn- tent creators and users (Section 3.2), as well as the content
thetic data, and showing that in real data ExpertiseRank usage statistics (Section 3.3). All these feature types are
outperforms HITS. used as an input to a classier that can be tuned for the
quality denition for the particular media type (Section 3.4).
Text analysis for content quality. Most work on estimat- In the next section, we will expand and rene the feature set
ing the quality of text has been in the eld of Automated specically to match our main application domain of com-
Essay Grading (AES), where writings of students are graded munity question/answering portals.
by machines on several aspects, including compositionality,
style, accuracy, and soundness. AES systems are typically 3.1 Intrinsic content quality
The intrinsic quality metrics (i.e., the quality of the con-
6
http://naver.com/ tent of each item) that we use in this research are mostly

185
text-related, given that the social media items we evaluate 3.2 User relationships
are primarily textual in nature. For user-generated content A signicant amount of quality information can be in-
of other types (e.g., photos or bookmarks), intrinsic quality ferred from the relationships between users and items. For
may be modeled dierently. example, we could apply link-analysis algorithms for propa-
As a baseline, we use textual features onlywith all word gating quality scores in the entities of the question/answer
n-grams up to length 5 that appear in the collection more system, e.g., we use the intuition that, good answerers
than 3 times used as features. This straightforward ap- write good answers, or vote for other good answerers.
proach is the de-facto standard for text classication tasks, The main challenge we have to face is that our dataset,
both for classifying the topic and for other facets (e.g., sen- viewed as a graph, often contains nodes of multiple types
timent classication [26]). (e.g., questions, answers, users), and edges represent a set
Additionally, we use a large number of semantic features, of interaction among the nodes having dierent semantics
organized as follows: (e.g., answers, gives best answer, votes for, gives a
star to).
Punctuation and typos. Poor quality text, and particu- These relationships are represented as edges in a graph,
larly of the type found in online sources, is often marked with with content items and users as nodes. The edges are typed,
low conformance to common writing practices. For example, i.e., labeled with the particular type of interaction (e.g.,
capitalization rules may be ignored; excessive punctuation User u answers question q). Besides the user-item rela-
particularly repeated ellipsis and question marksmay be tionship graph, we also consider the user-user graph. This
used, or spacing may be irregular. Several of our features is the graph G = (V, E) in which the set of vertices V is
capture the visual quality of the text, attempting to model composed of the set of users, and the set E represents im-
these irregularities; among these are features measuring punc- plicit relationships between users. For example, a user-user
tuation, capitalization, and spacing density (percent of all relationship could be User u has answered a question from
characters), as well as features measuring the character-level user v.
entropy of the text. A particular form of low visual qual- The resulting user-user graph is extremely rich and het-
ity are misspellings and typos; additional features in our erogeneous, and is unlike traditional graphs studied in the
set quantify the number of spelling mistakes, as well as the web link analysis setting. However, we believe that (in our
number of out-of-vocabulary words.7 classication framework) traditional link analysis algorithm
may provide useful evidence for quality classication, tuned
Syntactic and semantic complexity. Advancing from the for the particular domain. Hence, for each type of link we
punctuation level to more involved layers of the text, other performed a separate computation of each link-analysis al-
features in this subset quantify the syntactic and semantic gorithm. We computed the hubs and authorities scores (as
complexity of it. These include simple proxies for complex- in HITS algorithm [22]), and the PageRank scores [25]. In
ity such as the average number of syllables per word or the Section 4 we discuss the specic relationships and node types
entropy of word lengths, as well as more intricate ones such developed for community question/answering.
as the readability measures [15, 21, 23] mentioned in Sec-
tion 2.2. 3.3 Usage statistics
Readers of the content (who may or may not also be con-
Grammaticality. Finally, to measure the grammatical qual- tributors) provide valuable information about the items they
ity of the text, we use several linguistically-oriented features. nd interesting. In particular, usage statistics such as the
We annotate the content with part-of-speech (POS) tags, number of clicks on the item and dwell time have been shown
and use the tag n-grams (again, up to length 5) as features. useful in the context of identifying high quality web search
This allows us to capture, to some degree, the level of cor- results, and are complementary to link-analysis based meth-
rectness of the grammar used. ods. Intuitively, usage statistics measures are useful for so-
Some part-of-speech sequences are typical of correctly- cial media content, but require dierent interpretation from
formed questions: e.g., the sequence when|how|why to (verb) the previously studied settings.
(as in how to identify. . . ) is typical of lower-quality ques- For example, all items within a popular category such as
tions, whereas the sequence when|how|why (verb) (personal celebrity images or popular culture topics may receive orders
pronoun) (verb) (as in how do I remove. . . ) is more typ- of magnitude more clicks than, for instance, science topics.
ical of correctly-formed content. Nevertheless, when normalized by the item category, the de-
Additional features used to represent grammatical prop- viation from expected number of clicks can be used to infer
erties of the text are its formality score [16], and the distance quality directly, or can be incorporated into the classica-
between its (trigram) language model and several given lan- tion framework. The specic usage statistics that we use are
guage models, such as the Wikipedia language model or the described in Section 4.3.
language model of the Yahoo! Answers corpus itself (the dis-
tance is measured with KL-divergence). 3.4 Overall classification framework
We cast the problem of quality ranking as a binary classi-
7
To identify out-of-vocabulary words, we construct multiple cation problem, in which a system must learn automatically
lists of the k most frequent words in Yahoo! Answers, with to separate high-quality content from the rest.
several k values ranging between 50 and 5000. These lists are We experimented with several classication algorithms,
then used to calculate a set of out-of-vocabulary features, including those reported to achieve good performance with
where each feature assumes the list of top-k words for some
k is the vocabulary. An example feature created this way is text classication tasks, such as support vector machines
the fraction of words in an answer that do not appear in and log-linear classiers; the best performance among the
the top-1000 words of the collection. techniques we tested was obtained with stochastic gradient

186
boosted trees [13]. In this classication framework, a se-
quence of (typically simple) decision trees is constructed so
that each tree minimizes the error on the residuals of the
preceding sequence of trees; a stochastic element is added
by randomly sampling the data repeatedly before each tree
construction, to prevent overtting. A particularly useful
aspect of boosted trees for our settings is their ability to
utilize combinations of sparse and dense features.
Given a set of human-labeled quality judgments, the clas-
sier is trained on all available features, combining evidence
from semantic, user relationship, and content usage sources. Figure 2: Interaction of users-questions-answers
The judgments are tuned for the particular goal. For ex- modeled as a tri-partite graph.
ample, we could use this framework to classify questions by
genre or asker expertise. In the case of community ques-
tion/answers, described next, our goal is to discover inter- We use multi-relational features to describe multiple classes
esting, well formulated and factually accurate content. of objects and multiple types of relationships between these
objects. In this section, we expand on the general user re-
lationships ideas of the previous section to develop specic
4. MODELING CONTENT QUALITY IN relational features that exploit the unique characteristics of
COMMUNITY QUESTION/ANSWERING the community question/answering domain.
Our goal is to automatically assess the quality of questions
and answers provided by users of the system. We believe Answer features. In Figure 3, we show the user relation-
that this particular sub-problem of quality evaluation is an ship data that is available for a particular answer. The types
essential module for performing more advanced information- of the data related to a particular answer form a tree, in
retrieval tasks on the question/answering or web search sys- which the type Answer is the root. So, an answer a A is
tem. For example, a quality score can be used as a feature at the 0-th level of the tree, the question q that a answers
for ranking search results in this system. to, and the user u who posted a are in the rst level of the
Note that Yahoo! Answers is question-centric: the inter- tree, and so on.
actions of users are organized around questions: the main To streamline the process of exploring new features, we
forms of interaction among the users are (i) asking a ques- suggest naming the features with respect to their position
tion, (ii) answering a question, (iii) selecting best answer, in this tree. Each feature corresponds to a data type, which
and (iv) voting on an answer. These relationships are ex- resides in a specic node in the tree, and thus, it is charac-
plicitly modeled in the relational features described next. terized by the path from the root of the tree to that node.

4.1 Application-specific user relationships


Our dataset, viewed as a graph, contains multiple types
of nodes and multiple types of interactions, as illustrated in
Figure 1.

Figure 3: Types of features available for inferring


the quality of an answer.

Hence, each specic feature can be represented by a path


in the tree (following the direction of the edges). For in-
Figure 1: Partial entity-relationship diagram of an- stance, a feature of the type QU represents the information
swers. about a question (Q) and the user (U) who asked that ques-
tion. In Figure 3, we can see two subtrees starting from the
The relationships between questions, users asking and an- answer being evaluated: one related to the question being
swering questions, and answers can be captured by a tripar- answered, and the other related to the user contributing the
tite graph outlined in Figure 2, where an edge represents an answer.
explicit relationship between the dierent node types. The types of features on the question subtree are:
Since a user is not allowed to answer his/her own ques- Q Features from the question being answered
tions, there are no triangles in the graph, so in fact all cycles QU Features from the asker of the question being answered
in the graph have length at least 6. QA Features from the other answers to the same question

187
The types of features on the user subtree are: scores, and by px the vector of PageRank scores. We also
UA Features from the answers of the user denote by px the vector of PageRank scores in the transposed
UQ Features from the questions of the user graph.
UV Features from the votes of the user To classify these features in our framework, we consider
UQA Features from answers received to the users questions that PageRank and authority scores are related mostly to
U Other user-based features in-links, while the hub score deals mostly with out-links.
This string notation allows us to group several features For instance, lets take hb . It is the hub score in the best
into one bundle by using the wildcard characters ? (one answer graph, in which an out-link from u to v means that
letter), and* (multiple letters). For instance, U* represents u gave a best answer to user v. Then, hb represents the
all the features on the user subtree, and Q* all the features answers of users, and is assigned to the answerer record (UA).
in the question subtree. The assignment of these features is done in the following
way:
Question features. We represent user relationships around UQ To the asker record of a user: aa , ab , as , pa , pb
a question similarly to representing relationships around an UA To the answerer record of a user: ha , hb , pa , pb , av ,
answer. These relationships are depicted in Figure 4. Again, pv , a+ , p+ , a , p
there are two subtrees: one related to the asker of the ques- UV To the voter record of a user: hv , pv , hs , pv , h+ , p+ ,
tion, and the other related to the answers received. h , p
The types of features on the answers subtree are:
A Features directly from the answers received 4.2 Content features for QA
AU Features from the answerers of the question being an- As the base content quality features for both questions
swered and answer text individually we use directly the semantic
The types of features on the user subtree are the same as features from Section 3.1. We rely on feature selection meth-
the ones above for evaluating answers. ods and the classier to identify the most salient features for
the specic tasks of question or answer quality classication.
Additionally, we devise a set of features specic to the
QA domain that model the relationship between a question
and an answer. Intuitively, a copy of a Wall Street Journal
article about economy may have good quality, but would
not (usually) be a good answer to a question about celebrity
fashion. Hence, we explicitly model the relationship between
the question and the answer. To represent this we include
the KL-divergence between the language models of the two
texts, their non-stopword overlap, the ratio between their
lengths, and other similar features. Interestingly, the text of
answers often relates to other answers for the same question.
While this information is dicult to capture explicitly, we
believe that our semantic feature space is rich enough to
Figure 4: Types of features available for inferring allow a classier to eectively detect quality questions (and
the quality of a question. answers).

4.3 Usage features for QA


Implicit user-user relations. As stated in Section 3.2, be- Recall that community QA is question-centric: a question
sides the user-question-answer graph, we also consider the thread is usually viewed as a whole, and the content usage
user-user graph. This is the graph G = (V, E) in which the statistics are available primarily for the complete question
set of vertices V is composed of the set of users and the set thread. As a base set of content usage features we use the
E = Ea Eb Ev Es E+ E represents the relationships number of item views (clicks).
between users as follows: In addition, we exploit the rich set of metadata available
Ea represents the answers: (u, v) Ea i user u has for each question. This includes temporal statistics, e.g.,
answered at least one question asked by user v. how long ago the question was posted, which allows us to
Eb represents the best answers: (u, v) Eb i user give a better interpretation to the number of views of a ques-
u has provided at least one best answer to a question tion. Also, given that clickthrough counts on a question are
asked by user v. heavily inuenced by the topical and genre category, we also
Ev represents the votes for best answer: (u, v) Ev i use derived statistics. These statistics include the expected
user u has voted for best answer at least one answer number of views for a given category, the deviation from the
given by user v. expected number of views, and other second-order statistics
Es represents the stars given to questions: (u, v) Ev designed to normalize the values for each item type. For ex-
i user u has given a star to at least one question asked ample, one of the features is computed as the click frequency
by user v. normalized by subtracting the expected click frequency for
E+ /E represents the thumbs up/down: (u, v) E+ /E that category, divided by the standard deviation of click fre-
i user u has given a thumbs up/down to an answer quency for the category.
by user v.
For each graph Gx = (V, Ex ), we denote by hx the vector In summary, while many of the item content, user rela-
of hub scores on the vertices V , by ax the vector of authority tionship, and usage statistics features are designed and are

188
applicable for many types of social media, we augment the
general feature set with additional information specic to
the community question/answering domain. As we will show
in the empirical evaluation presented in the next sections,
both the generally applicable, and the domain specic fea-
tures turn out to be signicant for quality identication.

5. EXPERIMENTAL SETTING
This section describes the experimental setting, datasets,
and metrics used for producing our results in Section 6.

5.1 Dataset
Our dataset consists of 6,665 questions and 8,366 ques-
tion/answer pairs. The base usage features (page views or
clicks) were obtained from the total number of times a ques-
tion thread was clicked (e.g., in response to a search result).
All of the above questions were labeled for quality by hu- Figure 5: Sketch showing how do we find related
man editors, who were independent from the team that con- questions and answers, depicted with thick lines in
ducted this research Editors graded questions and answers the figure. All the questions Q0 and answers A0 eval-
for well-formedness, readability, utility, and interestingness; uated by the editors are included at the beginning,
for answers, an additional correctness element was taken into and then (1) all the askers U0 of the questions in
account. Additionally, a high-level type (informational, ad- Q0 , (2) all the answerers U0 of the answers in A0 ,
vice, poll, etc.) was assigned to each question. The assessors (3) all the questions Q1 by users in U0 , (4) all the
were also asked to look at the type of questions. They found answers A1 by users in U0 , and (5) all the answers
that roughly 1/4 of the questions were seeking for an opin- A2 to questions in Q1 .
ion (instead of information or advice). In a subset of 300
questions from this dataset, the inter-annotator agreement
for the question quality rating was = 0.68. In each of the graphs Gx = (V, Ex ), with x {a, b, v, s, +, },
we computed the hubs and authorities scores (as in HITS al-
Following links to obtain user relationship features. gorithm [22]), and the PageRank scores [25]. Note that in all
Starting from the questions and answers included in the eval- cases we execute HITS and PageRank on a subgraph of the
uation dataset we considered related questions and answers graph induced by the whole dataset, so the results might be
as follows. Let Q0 and A0 be the sets of questions and an- dierent than the results that one would obtain if executing
swers, respectively, included in the evaluation dataset. those algorithms on the whole graph.
Now let U1 be the set of users who have made a question The distributions of answers given and received are very
in Q0 or given an answer in A0 . Additionally we select similar to each other, in contrast to [12] where there were
Q1 to be the set of all questions asked by all users in U1 . clearly askers and answerers with dierent types of be-
Similarly we select A1 to be the set of answers given by users haviors. Indeed, in our sample of users, most users partici-
in U1 and A2 to be the set of all the answers to questions pate as both askers and answerers. From the scatter-plot
in Q1. Obviously Q0 Q1 and A0 A1 . Our dataset is in Figure 7, we observe that there are no clear roles of asker
then dened by the nodes (Q1 , A1 A2 , U1 ) and the edges and answerer such as the ones identied by Fisher et al. [12]
induced from the whole dataset. in USENET newsgroups. The fact that only users with
Figure 5 depicts the process of nding related items. The many questions also have many answers is a by-product of
relative size of the portion we used (depicted with thick the incentive mechanism of the system (points), where a
lines) is exaggerated for illustration purposes: actually the certain number of points is required to ask a question, and
data we use is a tiny fraction of the whole collection. points are gained mostly by answering questions.
This process of following links to include a subset of the In our evaluation dataset there is a positive correlation
data only applies to questions and answers. In contrast, between question quality and answer quality. In Table 1
for the user rating features, we included all of the votes we can see that good answers are much more likely to be
received and given by the users in U1 (including votes for written in response to good questions, and bad questions are
best answers, stars for good questions, thumbs up and the ones that attract more bad answers. This observation is
thumbs down), and all of the abuse reports written and an important consideration for feature design.
received.

5.2 Dataset statistics Table 1: Relationship between question quality and


answer quality
The degree distributions of the user interaction graphs de- Question Quality
scribed earlier are very skewed. The (complementary) cumu- Answer Quality A. High B. Medium C. Low
lative distribution of the number of answers, best answers,
and votes given and received is shown in Figure 6. The A. High 41% 15% 8%
distribution of the number of votes given and received by B. Medium 53% 76% 74%
the users can be modeled accurately by Pareto distributions C. Low 6% 9% 18%
with exponents 1.7 and 1.9 respectively. Total 100% 100% 100%

189
(a) answers
1

Pr.(give/receive more than X answers)


4
10

Number of Answers
103
0.1

102

0.01
1
10

Answers given
Answers received 0
0.001 10
1 10 100 1000 100 101 102 103
Number of answers given/received = X Number of Questions

(b) best answers


Figure 7: Number of questions and number of an-
Pr.(give/receive more than X best answers)

1
swers for each user in our data.

5.3 Evaluation metrics and methodology


0.1 Recall that we want to automatically separate high-quality
content from the rest. Since the class distribution is not bal-
anced, we report the precision and recall for the two classes,
high quality and normal or low quality separately: both
Best answers given are measured when the classier threshold is set to maxi-
Best answers received
0.01 mize the F1 measure. We also report the area under the
1 10 100 1000
Number of best answers given/received = X
ROC curve for the classiers, as a non-parametric single es-
timator of their accuracy.
(c) votes given
For our classication task we used the 6,665 questions
1
Votes given and 8,366 question/answer pairs of our base dataset, i.e., on
x-0.74 the sets Q0 and A0 . The classication tasks are performed
Pr.(give more than X votes)

using our in-house classication software. The classication


0.1 measures reported in the next section are obtained using
10-fold cross-validation on our base dataset. The sets Q1 ,
U1 , A1 , and A2 are used only for extracting the additional
0.01 user-relationship features for the sets Q0 and A0 .

6. EXPERIMENTAL RESULTS
0.001
1 10 100 1000 10000 In this Section we show the results for answer and question
Number of votes given = X content quality. Recall that as a baseline we use only textual
(d) votes received features for the current item (answer/question) at the level
1
of the trees introduced in Section 4.1. In the experiments
Votes received reported here, 80% of our data was used as a training set
x-0.85
Pr.(receive more than X votes)

and the rest for testing.


0.1
6.1 Question quality
Table 2 shows the classication performance of the ques-
tion classier, using dierent subsets of our feature set. Text
0.01
refers to the baseline, bag-of-n-gram features; Intrinsic is the
features derived from the text, described in Section 3.1; Us-
age refers to click-based knowledge described in Section 3.3;
0.001 and Relation features are those involving the community be-
1 10 100 1000 10000
Number of votes received = X
havior, described in Section 3.2.
Clearly, a standard text classication approachused in
our baseline, the rst line in Table 2does not address the
Figure 6: Distribution of degrees in the graph repre- task of identifying high quality content adequately; but re-
senting relationships between users: (a) number of lying exclusively on usage patterns, relations, or intrinsic
answers given and received; (b) number of best an- quality features derived from the text (next 3 lines in the
swers given and received; (c) number of votes given; table) results in suboptimal solutions too.
and (d) number of votes received. The votes in- In-line with intuition, we witness a consistent, gradual
cluding votes for best answer, start, thumbs up increase in performance as additional information is made
and thumbs down. available to the classier, indicating that the dierent fea-
ture sets we use provide, to some extent, independent infor-
mation.

190
We performed a comprehensive exploration of our feature
Table 2: Precision P, Recall R, and Area Under the spaces, in particular focusing on user relational features and
ROC Curve for the task of finding high-quality ques- the content usage features. Due to space constraints, we
tions discuss here only the eectiveness of dierent content usage,
High qual. Normal/low qual.
Method P R P R AUC or implicit feedback, features. These features are derived
Text (Baseline) 0.654 0.481 0.762 0.867 0.523 from page views statistics as described in Section 3.3. A
Usage 0.594 0.470 0.755 0.836 0.508 variant of the C4.5 decision tree classier was used to predict
Relation 0.694 0.603 0.806 0.861 0.614
Intrinsic 0.746 0.650 0.829 0.885 0.645
quality based on click features alone. Table 3 breaks down
T+Usage 0.683 0.571 0.798 0.865 0.575 the classication performance by feature type.
T+Relation 0.739 0.647 0.828 0.881 0.659
T+Intrinsic 0.757 0.650 0.830 0.891 0.648
T+Intr.+Usage 0.717 0.690 0.845 0.861 0.686
T+Relation+Usage 0.722 0.690 0.845 0.865 0.679 Table 3: Overall Precision, Recall, and F1 for the
T+Intr.+Relation 0.798 0.752 0.874 0.901 0.749 task of finding high-quality questions using only us-
All 0.794 0.771 0.885 0.898 0.761 age features
Features Precision Recall F1
Page Views 0.540 0.250 0.345
+ Question category 0.600 0.410 0.510
The 20 most signicant features for question quality clas- + Deviation from expected 0.630 0.460 0.530
All Usage features 0.594 0.470 0.530
sication, according to a chi-squared test, included features Top 10 Usage features 0.630 0.540 0.580
from all subsets, as follows:
UQV Average number of stars to questions by the same
asker. These results support our hypothesis that topical cate-
The punctuation density in the questions subject. gory information is crucial for interpreting usage statistics.
The questions category (assigned by the asker). As we can see, normalizing the raw page view counts by
Normalized Clickthrough: The number of clicks on question category signicantly improves the accuracy, as
the question thread, normalized by the average number well as modeling the deviation from the expected page view
of clicks for all questions in its category. count, which provides additional improvement. Finally, in-
UAV Average number of Thumbs up received by answers cluding top 10 content usage features selected according to
written by the asker of the current question. chi-squared statistic provide some additional improvement.
Number of words per sentence. Interestingly, including all derived features similar to those
UA Average number of answers with references (URLs) described in [1] actually degrades performance, indicating
given by the asker of the current question. overtting when relying on usage statistics alone without
UQ Fraction of questions asked by the asker in which he the benet of other forms of user feedback.
opens the questions answers to voting (instead of pick- Because of the eectiveness of the relational and usage fea-
ing the best answer by hand). tures to independently identify high-quality content, we hy-
UQ Average length of the questions by the asker. pothesized that a variant of co-training or co-boosting [10],
UAV The number of best answers authored by the user. or using a Maximum Entropy classier [5] would be more
U The number of days the user was active in the system. eective to expand the training set in a partially supervised
UAV Thumbs up received by the answers wrote by the setting. However, our experiments did not result in an clas-
asker of the current question, minus thumbs down, sication improved accuracy, and this remains an open ques-
divided by total number of thumbs received. tion for future work.
Clicks over Views: The number of clicks on a ques-
tion thread divided by the number of times the ques- 6.2 Answer quality
tion thread was retrieved as a search result (see [2]). Table 4 shows the classication performance of the answer
The KL-divergence between the questions language classier, again examining dierent subsets of our feature
model and a model estimated from a collection of ques- set. In this case, we did not use the Usage subset, as there
tion answered by the Yahoo editorial team (available are no separate clicks on answers within Yahoo! Answers (an
in http://ask.yahoo.com). answer is displayed on the question page, alongside other an-
The fraction of words that are not in the list of the swers to the question). Our high precision and recall score
top-10 words in the collection, ranked by frequency. show that for the task of assessing answer quality, the per-
The number of capitalization errors in the question formance of our system is close to the performance achieved
(e.g., sentence not starting with a capitalized word). by humans.
U The number of days that has passed since the asker
wrote his/her rst question or answer in the system.
UAV The total number of answers of the asker that have Table 4: Precision P, Recall R, and Area Under the
been selected as the best answer. ROC Curve for the task of finding high-quality an-
UQ The number of questions that the asker has asked in swers
its most active category, over the total number of ques- High qual. Normal/low qual.
Method P R P R AUC
tions that the asker has asked.
Text (Baseline) 0.668 0.862 0.968 0.906 0.805
The entropy of the part-of-speech tags of the question. Relation 0.552 0.617 0.914 0.890 0.623
Intrinsic 0.712 0.918 0.981 0.918 0.869
In the above list we label by the intrinsic or usage fea- T+Relation 0.688 0.851 0.965 0.915 0.821
tures, which are obtained directly from the questions and T+Intrinsic 0.711 0.926 0.982 0.917 0.878
for which we do not follow any path on the data graph. All 0.730 0.911 0.979 0.926 0.873

191
Once again, we observe an increase in performance at- 1
tributed to both additional feature sets used; however, in
this case improvement is milder. An examination of the 0.8
data shows that one particular featurethe answer length

True Positive Rate


is dominating over other features, resulting in relatively high 0.6
performance of the baseline.
The 20 most signicant features for answer quality, ac- 0.4
cording to a chi-squared test, were:
Answer length. 0.2
The number of words in the answer with a corpus fre- Best
quency larger than c. 0
Baseline

UAV The number of thumbs up minus thumbs down re- 0 0.2 0.4 0.6 0.8 1
ceived by the answerer, divided by the total number of False Positive Rate

thumbs s/he has received. 1


The entropy of the trigram character-level model of
the answer. 0.8
UAV The fraction of answers of the answerer that have been

True Positive Rate


picked as best answers (either by the askers of such 0.6
questions, or by a community voting).
The unique number of words in the answer. 0.4
U Average number of abuse reports received by the an-
swerer over all his/her questions and answers. 0.2
UAV Average number of abuse reports received by the an- Best
swerer over his/her answers. Baseline
0
The non-stopword word overlap between the question 0 0.2 0.4 0.6 0.8 1
and the answer. False Positive Rate

The Kincaid [21] score of the answer.


QUA The average number of answers received by the ques- Figure 8: ROC curve for the best-performing clas-
tions asked by the asker of this answer. sifier, for the task of finding high-quality questions
The ratio between the length of the question and the (top) and high-quality answers (bottom).
length of the answer.
UAV The number of thumbs up minus thumbs down re-
ceived by the answerer. question answering portal, resulting in a high level of accu-
QUAV The average numbers of thumbs received by the an- racy on the question and answer quality classication task.
swers to other questions asked by the asker of this an- Community QA is a popular information seeking paradigm
swer. that has already entered mainstream, and our results pro-
The entropy of the unigram character-level model of vide signicant understanding of this new domain.
the answer. We investigated the contributions of the dierent sources
The KL-divergence between the answers language model of quality evidence, and have shown that some of the sources
and a model estimated from the Wikipedia discussion are complementary i.e., capture the same high-quality con-
pages. tent using the dierent perspectives. The combination of
QU Number of abuse reports received by the asker of the several types of sources of information is likely to increase
question being answered. the classiers robustness to spam, as an adversary is re-
QUQA The sum of the lengths of all the answers received by quired to not only create content the deceives the classier,
the asker of the question being answered. but also simulate realistic user relationships or usage statis-
QUQAV The sum of the thumbs down received by the an- tics. In the future, we plan to more specically explore the
swers received by the asker of the question being an- relationships and usage features to automatically identify
swered. malicious users.
QUQAV The average number of answers with votes in the We demonstrated the utility of our approach on a large-
questions asked by the asker of the question being an- scale community QA site. However, we believe that our
swered. results and insights are applicable to other social media set-
ROC curves for the baseline question and answer classi- tings, and to other emerging domains centered around user
ers from Tables 2 and 4, as well as for the classiers with contributed-content.
the maximal area under the curve appearing in these tables,
are shown in Figure 8. ACKNOWLEDGEMENTS
The authors thank Byron Dom, Benoit Dumoulin, and Ravi
7. CONCLUSIONS Kumar for many useful discussions.
We presented a general classication framework for qual-
ity estimation in social media. As part of our work we
developed a comprehensive graph-based model of contrib-
8. REFERENCES
utor relationships and combined it with content- and usage- [1] E. Agichtein, E. Brill, S. T. Dumais, and R. Ragno.
based features. We have successfully applied our framework Learning user interaction models for predicting web
to identifying high quality items in a web-scale community search result preferences. In SIGIR, pages 310, 2006.

192
[2] K. Ali and M. Scarr. Robust methodologies for non-textual features. In SIGIR 06: Proceedings of the
modeling web click distributions. In WWW, pages 29th annual international ACM SIGIR conference on
511520, 2007. Research and development in information retrieval,
[3] C. Anderson. The Long Tail: Why the Future of pages 228235, New York, NY, USA, 2006. ACM
Business Is Selling Less of More. Hyperion, July 2006. Press.
[4] Y. Attali and J. Burstein. Automated essay scoring [18] T. Joachims, L. A. Granka, B. Pan, H. Hembrooke,
with e-rater v.2. Journal of Technology, Learning, and and G. Gay. Accurately interpreting clickthrough data
Assessment, 4(3), February 2006. as implicit feedback. In SIGIR, pages 154161, 2005.
[5] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A [19] P. Jurczyk and E. Agichtein. Discovering authorities
maximum entropy approach to natural language in question answer communities using link analysis. In
processing. Computational Linguistics, 22(1):3971, ACM Sixteenth Conference on Information and
1996. Knowledge Management (CIKM), 2007.
[6] J. Burstein, K. Kukich, S. Wol, C. Lu, M. Chodorow, [20] P. Jurczyk and E. Agichtein. HITS on question answer
L. Braden-Harder, and M. D. Harris. Automated portals: an exploration of link analysis for author
scoring using a hybrid feature identication technique. ranking. In SIGIR (posters). ACM, 2007.
In Proceedings of the 17th international conference on [21] J. P. Kincaid, R. P. Fishburn, R. L. Rogers, and B. S.
Computational linguistics, pages 206210, Morristown, Chissom. Derivation of new readability formulas for
NJ, USA, 1998. Association for Computational navy enlisted personnel. Technical Report Research
Linguistics. Branch Report 8-75, Millington, Tenn, Naval Air
[7] J. Burstein and M. Wolska. Toward evaluation of Station, 1975.
writing style: nding overly repetitive word use in [22] J. M. Kleinberg. Authoritative sources in a
student essays. In EACL 03: Proceedings of the tenth hyperlinked environment. Journal of the ACM,
conference on European chapter of the Association for 46(5):604632, 1999.
Computational Linguistics, pages 3542, Morristown, [23] G. H. McLaughlin. SMOG grading: A new readability
NJ, USA, 2003. Association for Computational formula. Journal of Reading, 12(8):639646, 1969.
Linguistics. [24] E. B. Page. Computer grading of student prose, using
[8] C. S. Campbell, P. P. Maglio, A. Cozzi, and B. Dom. modern concepts and software. Journal of
Expertise identication using email communications. Experimental Education, 62(2), 1994.
In Proceedings of CIKM, pages 528531, New Orleans, [25] L. Page, S. Brin, R. Motwani, and T. Winograd. The
LA, USA, 2003. PageRank citation ranking: bringing order to the
[9] M. Chodorow and C. Leacock. An unsupervised Web. Technical report, Stanford Digital Library
method for detecting grammatical errors. In Technologies Project, 1998.
Proceedings of the rst conference on North American [26] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?
chapter of the Association for Computational sentiment classication using machine learning
Linguistics, pages 140147, San Francisco, CA, USA, techniques, May 2002.
2000. Morgan Kaufmann Publishers Inc. [27] L. Prescott. Yahoo! Answers captures 96% of Q and A
[10] M. Collins and Y. Singer. Unsupervised models for market share, 2006.
named entity classication. In Natural Language [28] L. M. Rudner and T. Liang. Automated essay scoring
Processing and Very Large Corpora, 1999. using bayes. Journal of Technology, Learning, and
[11] B. Dom, I. Eiron, A. Cozzi, and Y. Zhang. Assessment, 1(2), June 2002.
Graph-based ranking algorithms for e-mail expertise [29] C. Sang-Hun. To outdo Google, Naver taps into
analysis. In Proceedings of Workshop on Data Mining Koreas collective wisdom. International Herald
and Knowledge Discovery, pages 4248, San Diego, Tribune, July 4 2007.
CA, USA, 2003. ACM Press. [30] J. P. Scott. Social Network Analysis: A Handbook.
[12] D. Fisher, M. Smith, and H. T. Welser. You are who SAGE Publications, January 2000.
you talk to: Detecting roles in usenet newsgroups. [31] Q. Su, D. Pavlov, J.-H. Chow, and W. C. Baker.
volume 3, pages 59b59b, 2006. Internet-scale collection of human-reviewed data. In
[13] J. H. Friedman. Stochastic gradient boosting. Comput. WWW 07: Proceedings of the 16th international
Stat. Data Anal., 38(4):367378, 2002. conference on World Wide Web, pages 231240, New
[14] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. York, NY, USA, 2007. ACM Press.
Propagation of trust and distrust. In WWW 04: [32] J. Zhang, M. S. Ackerman, and L. Adamic. Expertise
Proceedings of the 13th international conference on networks in online communities: structure and
World Wide Web, pages 403412, New York, NY, algorithms. In WWW 07: Proceedings of the 16th
USA, 2004. ACM Press. international conference on World Wide Web, pages
[15] R. Gunning. The technique of clear writing. 221230, New York, NY, USA, 2007. ACM Press.
McGraw-Hill, 1952. [33] C.-N. Ziegler and G. Lausen. Propagation models for
[16] F. Heylighen and J.-M. Dewaele. Variation in the trust and distrust in social networks. Information
contextuality of language: An empirical measure. Systems Frontiers, 7(4-5):337358, December 2005.
Context in Context. Special issue Foundations of
Science, 7(3):293340, 2002.
[17] J. Jeon, B. W. Croft, J. H. Lee, and S. Park. A
framework to predict the quality of answers with

193

You might also like