Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
3Activity
0 of .
Results for:
No results containing your search query
P. 1
Google and the Mind

Google and the Mind

Ratings:

4.0

(1)
|Views: 144 |Likes:
Published by zzztimbo
predicting fluency with pagerank

thomas l. griffiths
mark steyvers
alana firl
predicting fluency with pagerank

thomas l. griffiths
mark steyvers
alana firl

More info:

Published by: zzztimbo on Jan 14, 2008
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

06/28/2010

pdf

text

original

 
Research Article
Google and the Mind
Predicting Fluency With PageRank
Thomas L. Griffiths,
1
Mark Steyvers,
2
and Alana Firl
3
1
 Department of Psychology, University of California, Berkeley;
2
 Department of Cognitive Sciences, University of California, Irvine; and
3
 Department of Cognitive and Linguistic Sciences, Brown University
ABSTRACT—
Human memory and Internet search engines face a shared computational problem, needing to retrievestored pieces of information in response to a query. Weexplored whether they employ similar solutions, testing whetherwecouldpredicthumanperformanceonafluencytask using PageRank, a component of the Google searchengine. In this task, people were shown a letter of the al- phabet and asked to name the first word beginning withthat letter that came to mind. We show that PageRank,computed on a semantic network constructed from word-association data, outperformed word frequency and thenumberofwordsforwhichawordisnamedasanassociateas a predictor of the words that people produced in thistask. We identify two simple process models that could support this apparent correspondence between humanmemory and Internet search, and relate our results to previous rational models of memory.
Rational models of cognition explain human behavior as ap-proximating optimal solutions to the computational problemsposed bythe environment (Anderson,1990;Chater&Oaksford,1999; Marr, 1982; Oaksford & Chater, 1998). Rational modelshave been developed for several aspects of cognition, includingmemory (Anderson, 1990; Griffiths, Steyvers, & Tenenbaum,2007;Shiffrin&Steyvers,1997),reasoning(Oaksford&Chater,1994), generalization (Shepard, 1987; Tenenbaum & Griffiths,2001), categorization (Anderson, 1990; Ashby & Alfonso-Reese,1995),andcausalinduction(Anderson,1990;Griffiths&Tenenbaum, 2005). By emphasizing the computational prob-lems underlying cognition, rational models sometimes revealconnections between human behavior and that of other systemsthat solve similar problems. For example, Anderson’s (1990;Anderson & Milson, 1989) rational analysis of memory identi-fied parallels between the problem solved by human memoryand that addressed by automated information-retrieval systems,arguing for similar solutions to the two problems. Since An-derson’s analysis, information-retrieval systems have evolved toproduce what might be an even more compelling metaphor for human memory—the Internet search engine—and computer scientists have developed new algorithms for solving the prob-lemofpullingrelevantfactsfromlargedatabases.Inthisarticle,we explore the correspondence between these new algorithmsand the structure of human memory. Specifically, we show thatPageRank (Page, Brin, Motwani, & Winograd, 1998), one of thekey components of the Google search engine, predicts humanresponses in a fluency task.Viewed abstractly, the World Wide Web forms a directedgraph, in which the nodes are Web pages and the links betweenthosenodesarehyperlinks,asshowninFigure1a.ThegoalofanInternet search engine is to retrieve an ordered list of pages thatare relevant to a particular query. Typically, this is done byidentifying all pages that contain the words that appear in thequery, then ordering those pages using a measure of their im-portance based on their link structure. Many psychologicaltheories view human memory as solving a similar problem: re-trievingtheitemsinastoredsetthatarelikelytoberelevanttoaquery. The targets of retrieval are facts, concepts, or words,rather than Web pages, but these pieces of information are oftenassumed to be connected to one another in a way similar to theway in which Web pages are connected. In an associative se-mantic network, such as that shown in Figure 1b, a set of wordsor concepts is represented using nodes connected by links thatindicate pair-wise associations (e.g., Collins & Loftus, 1975).Analyses of semantic networks estimated from human behavior revealthatthesenetworkshavepropertiessimilartothoseoftheWorld Wide Web, such as a ‘‘scale-free’’ distribution for thenumber of nodes to which a node is connected (Steyvers &Tenenbaum, 2005). If one takes such a network to be the rep-resentation of the knowledge on which retrieval processes op-erate, human memory and Internet search engines address thesame computational problem: identifying those items that are
Address correspondence to Tom Griffiths, University of California,Berkeley, Department of Psychology, 3210 Tolman Hall # 1650,Berkeley, CA 94720-1650, e-mail: tom_griffiths@berkeley.edu.
PSYCHOLOGICAL SCIENCE
Volume 18—Number 12
1069
Copyright
r
2007 Association for Psychological Science
 
relevanttoaqueryfromalargenetworkofinterconnectedpiecesof information. Consequently, it seems possible that they solvethis problem similarly.Although the details of the algorithms used by commercialsearch engines are proprietary, the basic principles behind thePageRank algorithm, part of the Google search engine, arepublic knowledge (Page et al., 1998). The algorithm makes useof two key ideas: first, that links between Web pages provideinformation about their importance, and second, that the rela-tionship between importance and linking is recursive. Given anorderedsetof 
n
pages,wecansummarizethelinksbetweenthemwith an
n
Â
n
matrix
L
, where
L
ij 
is 1 if there is a link from Webpage
to Web page
i
and is 0 otherwise. If we assume that linksarechoseninsuchawaythatmoreimportantpagesreceivemorelinks, then the number of links that a Web page receives (ingraph-theoretic terms, its
in-degree
) could be used as a simpleindex of its importance. Using the
n
-dimensional vector 
p
tosummarize the importance of our 
n
Web pages, this is the as-sumption that
p
5
L1
, where
1
is a column vector with
n
ele-ments each equal to 1.PageRankgoesbeyondthissimplemeasureoftheimportanceof a Web page by observing that a link from an important Webpage is a better indicator of importance than a link from anunimportant Web page. Under such a view, an important Webpage is one that receives many links from other important Webpages.
1
We might thus imagine importance as flowing along thelinks of the graph shown in Figure 1a. If each Web page dis-tributesitsimportanceuniformlyoveritsoutgoinglinks,thenwecan express the proportion of the importance of each Web pagetraveling along each link in a matrix
M
, where
M
ij 
¼
L
ij 
=
P
nk
¼
1
L
kj 
.The idea that highly important Web pages receive links fromhighly important Web pages implies a recursive definition of importance, resulting in the equation
p
¼
Mp
;
ð
1
Þ
which identifies
p
as the eigenvector of the matrix
M
with thegreatest eigenvalue.
2
The PageRank algorithm computes theimportance of a Web page by finding a vector 
p
that satisfies thisequation.The empirical success of Google suggests that PageRankconstitutes an effective solution to the problem of Internetsearch. This raises the possibility that computing a similar quantity for a semantic network might predict which items areparticularly prominent in human memory. If one constructs asemantic network from word-association norms, placing linksfrom words used as cues in an association task to the words thatare named as their associates, the in-degree of a node indicatesthe number of words for which the corresponding word is pro-duced as an associate. This kind of ‘‘associate frequency’’ is anaturalpredictoroftheprominenceofwordsinmemory,andhasbeen used as such in a number of studies (McEvoy, Nelson, &Komatsu, 1999; Nelson, Dyrdal, & Goodmon, 2005). However,thissimplemeasureassumesthatallcuesshouldbegivenequalweight, whereas computing PageRank for a semantic networktakes into account the fact that the cues themselves differ intheir prominence in memory.ToexplorethecorrespondencebetweenPageRankandhumanmemory, we used a task that closely parallels the formal struc-ture of Internet search. In this task, we showed people a letter of the alphabet (the query) and asked them to say the first wordbeginning with that letter that came to mind. By using a querythat was either true or false of each item in memory, we aimed tomimic the problem solved by Internet search engines, whichretrieve all pages containing the set of search terms, and thus toobtain a direct estimate of the prominence of different words inhuman memory. In memory research, such a task is used tomeasure
fluency
—the ease with which people retrieve differentfacts. This particular task is used to measure letter fluency or verbal fluency in neuropsychology, and has been applied inthe diagnosis of a variety of neurological and neuropsychiatricdisorders (e.g., Lezak, 1995). However, in the standard use of this task, the interest is in the number of words that can beproduced in a given time period, whereas we intended to dis-cover which words were more likely to be produced than others.Our goal was to determine whether people’s responses in thisfluency task were better predicted by PageRank or by moreconventional predictors—word frequency and associate fre-
Fig. 1.
Parallels between the problems faced by search engines andhuman memory. Internet search and retrieval from memory both involvefinding the items relevant to a query from within a large network of in-terconnected pieces of information. In the case of Internet search (a), theitems to be retrieved are Web pages connected by hyperlinks. When itemsare retrieved from a semantic network (b), the items are words or con-cepts connected by associative links.
1
A similar insight in the context of bibliometrics motivated the developmentof essentially the same method for measuring the importance of publicationslinked by citations (Geller, 1978; Pinski & Narin, 1976).
2
In general, an eigenvector 
x
of a matrix
M
satisfies the equation
l
x
5
Mx
,where
l
is the eigenvalue associated with
x
(Strang, 1988). Equation 1 iden-tifies
p
as an eigenvector of 
M
with eigenvalue 1. This is guaranteed to be theeigenvector with greatest eigenvalue because
M
is a stochastic matrix, with
P
ni
¼
1
M
ij 
¼
1, and thus has no eigenvalues greater than 1. For simplicity, all of themathematical results reported in this article assume that
M
has only oneeigenvalue equal to 1. The PageRank algorithm can be modified when thisassumption is violated (Brin & Page, 1998), and a similar correction can extendthe results we state here to the general case.
1070
Volume 18—Number 12
Google and the Mind
 
quency. Word frequency is viewed as a cause of fluency (e.g.,Balota & Spieler, 1999; Plaut, McClelland, Seidenberg, &Patterson, 1996; Seidenberg & McClelland, 1989; see alsoAdelman, Brown, & Quesada, 2006) and is used to set the prior probability of items in rational models (Anderson, 1990; D.Norris,2006).Associatefrequencywascomputedfromthesamedata as PageRank, differing only in the assumption that all cuesshould be given equal weight. These two measures thus con-stitute strong alternatives to compare with PageRank.
METHOD
Fifty members of the Brown University community (30 female,20male) participated intheexperiment.Theirages ranged from18 to 75 years, with a mean of 24.6 years and a standard devi-ation of 13.2 years.Twenty-one letters of the alphabet (the low-frequency letters
 K 
,
Q
,
 X 
,
,and
Z
beingexcluded)wereprintedindividuallyon3
Â
5 cards in 56-point Times New Roman font. The cards wereshuffled, face down, and each subject was told that he or shewould be shown letters of the alphabet one after the other andshould produce the first word beginning with each letter thatcame to mind. The experimenter then turned the cards up oneby one until the subject had responded to the entire set. Theexperimenter wrote down the words produced by the subject.This procedure was performed twice with each subject.
RESULTS
PageRank and associate frequency were calculated using asemantic network constructed from the word-association normsof Nelson, McEvoy, and Schreiber (1998). The norms list allwords named at least twice as an associate of each of 5,018words. From these norms, we constructed a directed graph inwhicheachwordwasanode,withlinkstoitsassociates.Wethenapplied the PageRank algorithm to this graph and also calcu-latedtheassociatefrequencyforeachword.Finally,werecordedfrom Kucera and Francis (1967) the word frequency for each of the words appearing in the norms.PageRank,associatefrequency,andwordfrequencyalldefinearanking ofthewords thatpeople couldproduceinourtask. Weused the responses of our subjects to evaluate these predictors.Responses that did not begin with the appropriate letter wereomitted, as were instances of repetition of a word by a singlesubject, as these words could have been produced as a result of memory for the previous trial. The number of times each wordwasproduced wasthensummedoverallsubjects. Table 1showsthe most popular responses for seven of the letters. For evalu-ating the predictors, we removed words that were produced onlyonce. This is a standard procedure used to control outliers intasks generating spontaneous productions, such as word-asso-ciation tasks (including that of Nelson et al., 1998). Finally, weomittedallresponsesthatdidnotappearintheword-associationnorms of Nelson et al., as PageRank and associate frequencywere restricted to these words. The result was a set of 1,017responses.Our analysis focused on the ranks that the predictors as-signed to the human responses. We identified all words in thenorms that began with each letter and then ordered those wordsby each predictor, assigning a rank of 1 to the highest-scoringword and lower rank (i.e., a higher number) as the score de-creased. Table 1 shows some of these ranks. Because the totalnumberofwordsinthenormsvariedacrossletters(from648for 
S
to50fo
),wereducedtheserankstopercentagesofthesetof possible responses for each letter before aggregating acrossletters. The distribution of ranks was heavily skewed, so wecompared the predictors using medians and nonparametrictests. The median percentile ranks for the different predictorsareshowninTable2.Figure2presentstheproportionofhumanresponses produced as a function of percentile rank. PageRankoutperformedbothassociatefrequencyandwordfrequencyasapredictor of fluency, assigning lower ranks to 59.42% and81.97%, respectively, of the human responses given differentranks by the different predictors (both
p
s
<
.0001 by binomialtest).We performed several additional analyses in an attempt togain insight into the relatively poor performance of wordfrequency as a predictor. First, we used word frequencies from alarger corpus—the Touchstone Applied Science Associates(TASA) corpus used by Landauer and Dumais (1997)—whichimproved predictions, although not enough for word frequencyto compete with PageRank or associate frequency. Second, welooked at performance within restricted sets of words. Althoughthewordsusedinouranalyseswereallproducedasassociatesinthe word-association task of Nelson et al. (1998), they varied inpart of speech and concreteness. From Table 1, it is apparentthat people’s responses in the fluency task were biased towardconcrete nouns, whereas word frequency does not take part of speech or concreteness into account. We repeated our analysisusing two subsets of the words from the norms. The first sub-set consisted of all words identified as nouns and possessingconcreteness ratings in the MRC Psycholinguistic Database(Wilson,1988).Thisreducedthetotalnumberofwordsto2,128,and the total number of responses matching these words to 753.This restriction controlled for part of speech, but still leftdifferences in concreteness among the words favored by thepredictors; the mean concreteness ratings averaged over thedistributions over words implied by PageRank, associate fre-quency,andwordfrequencyfromtheKuceraandFrancis(1967)and TASA corpora were 496.93, 490.32, 421.24, and 433.65,respectively (on a scale from 100 to 700). To address this issue,we also analyzed a more reduced subset of words: nouns withconcreteness ratings greater than or equal to the median. Thissecond subset included 1,068 words matching 526 human re-sponses and had mean concreteness ratings of 584.70, 583.93,
Volume 18—Number 12
1071
Thomas L. Griffiths, Mark Steyvers, and Alana Firl

Activity (3)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Dimension Akk liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->