This action might not be possible to undo. Are you sure you want to continue?

Soumen Chakrabarti

Indian Institute of Technology Bombay

y

soumen@cse.iitb.ernet.in

ABSTRACT

With over 800 million pages covering most areas of human

endeavor, the World-wide Web is a fertile ground for data

mining research to make a dierence to the eectiveness

of information search. Today, Web surfers access the Web

through two dominant interfaces: clicking on hyperlinks and

searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search

result in more structured ways than available now. Data

mining and machine learning have signi

We will review the continuum of supervised to semi-supervised to unsupervised learning problems.cant roles to play towards this end. highlight the speci. In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular.

Communities such as the ACM1 Special Interest Groups on 2 Information Retrieval (SIGIR) .acm. Hypertext.acm. and several hundred gigabytes change every month. 1 Introduction The volume of unstructured text and hypertext data exceeds that of structured data. Apart from sheer size and ux.org/sigir 3 http://www. or six terabytes of data on about three million servers.org/dl gines. The Web exceeds 800 million HTML pages. Text and hypertext are used for digital libraries. the Web and its users differ signi. Hyper3 media and Web (SIGLINK/SIGWEB) and Digital 4 Libraries have been engaged in research on eective search and retrieval from text and hypertext databases. and projects. medical reports. down from 35% in late 1997 [6]. organizations. The spectacular ascent in the size and popularity of the Web has subjected traditional information retrieval (IR) techniques to an intense stress-test.acm.org 2 http://www. and homepages for individuals. Even the largest search en1 http://www. and summarize the key areas of recent and ongoing research. a typical page changes in a few months. product catalogs. customer service reports. reviews. Hypertext has been widely used long before the popularization of the Web.acm.c challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses. Almost a million pages are added daily.org/siglink 4 http://www. newsgroups. such as Alta Vista5 and HotBot6 index less than 18% of the accessible Web as of February 1999 [49].

But apart from this de. criticism. endorsement. content is created autonomously by diverse people. search engines no longer attempt to index the whole Web. As mentioned before. Misrepresentation of content by `spamming' the page with meaningless terms is rampant so that keyword indexing search engines rank pages highly for many queries. Search technology inherited from the world of IR is evolving slowly to meet these new challenges. citation. Hyperlinks are created for navigation. The Web is populist hypermedia at its best: there is no consistent standard or style of authorship dictated centrally. or plain whim.cantly from traditional IR corpora and workloads.

polysemy (a word with more than one meaning). the Web oers semistructured data in the form of directed graphs where nodes are primitive or compound objects and edges are .ciency of scale. a new dimension is added because of extant partial structure: whereas IR research deals predominantly with documents. The usual problems of text search. and context sensitivity become especially severe on the Web. such as synonymy. search engines are also known for poor accuracy: they have both low recall (fraction of relevant documents that are retrieved) and low precision (fraction of retrieved documents that are relevant). Moreover.

NL techniques are outside the scope of this survey. SIGKDD Explorations.com c 2000 ACM SIGKDD. 62.hotbot. disambiguate polysemous words with high accuracy [3. 12] and expanding brief queries with related words for enhanced search. 28. 71. Jan 2000. 2. popular Web search engines have been slow to embrace these capabilities. NL techniques can now parse relatively well-formed sentences in many languages [36. 5. Although we see the progression towards better semantic understanding as inevitable. 11]. and perform NL translation [39.com 6 http://www. documents and the use of natural language (NL) have been studied also by a large body of linguists and computational linguists.eld labels. Issue 2 . 65]. 68]. represent NL documents in a canonical machine-usable form [67. 40. 5 http://www. 32]. Copyright Volume 1. Apart from the IR community. A combination of NL and IR techniques have been used for creating hyperlinks automatically [9. 35. 46. tag words in running text with part-of-speech information [3.altavista.page 1 . 4]. For reasons unclear to us.

Outline: In this survey we will concentrate on statistical techniques for learning structure in various forms from text. Models: In x3 we discuss techniques for supervised learning or classi. In x2 we describe some of the models used to represent hypertext and semistructured data. hypertext and semistructured data.

Unsupervised learning: Real-life applications .cation. Supervised learning: The other end of the spectrum. unsupervised learning or clustering is discussed in x4.

and must therefore involve a NL grammar. and are discussed in x5. Since we restrict our scope to statistical analyses. Semi-supervised learning: A key feature which distinguishes hypertext mining from warehouse mining and much of IR is the analysis of structural information in hyperlinks. Social network analysis: 2 Basic models In its full generality.t somewhere in between fully supervised or unsupervised learning. a model for text must build machine representations of world knowledge. This is discussed in x6. we need to .

One may choose to normalize the length of the document to 1. We discuss some such models in this section. or 2 L1 norms: jjdjj1 = t n(d. one may construct probabilstic models for document generation. importance. documents have been traditionally represented in the vector space model [63. (Many variants of this formula exist. which is a subset of the lexicon (the universe of possible terms).' `was. we should start with the disclaimer that these models have no bearing on grammar and semantic coherence. Each canonical token represents an axis in a Euclidean space.' `is. The simplest statistical model is the binary model. Documents are tokenized using simple syntactic rules (such as whitespace delimiters in English) and tokens stemmed to canonical form (e. and semistructured data which will suÆce for our learning applications. These representations do not capture the fact that some terms (like `algebra') are more important than others (like `the' and `is') in determining document content. 2. jjdjj2 = t n(d. t) .' `are' to `be'). t) times in document d. Word counts are not signi. hence. Once again.t jj1 IDF(t). t). The \inverse document frequency" IDF(t) = 1 + log NNt is used to stretch the axes of the vector space dierentially. 29]. our form is merely illustrative. In the crude form. the t-th coordinate of d is simply n(d.g. Nt =N gives a sense of the rarity. of the term. In this model. and jjdjj1 = maxt n(d. a document is a set of terms. typicallypin P Pthe L1 . if a term t occurs n(d. hypertext. Alternatively..nd suitable representations for text. t).) Thus the t-th ) coordinate of d may be chosen as njj(dd. popularly called the `TFIDF' (term frequency times inverse document frequency) weighted vector space model. `reading' to `read.1 Models for text In the IR domain. Documents are vectors in this space. t). If t occurs in Nt out of N documents. L2 .

each face t having an associated probability t of showing up when tossed. In the multinomial model. To compose a document the author .cant. one imagines a die with as many faces as there are words in the lexicon.

and then tosses the die as many times. or pages. without paying attention to ordering between terms. These are modeled with varying levels of detail. and L is the set of links. Therefore they are collectively called bag-of-words models. In spite of being extremely crude and not capturing any aspect of language or semantics. In spite of minor variations all these models regard documents as multisets of terms. documents. L) where D is the set of nodes. hypertext can be regarded as a directed graph (D.xes a total word count (arbitrarily or by generating it from a distribution). depending on the application. More re. In the simplest model. 2. Crude models may not need to include the text models at the nodes. each time writing down the term corresponding to the face that shows up. these models often perform well for their intended purpose.2 Models for hypertext Hypertext has hyperlinks in addition to text.

This may be used to establish speci. One may also wish to recognize that the source document is in fact a sequence of terms interspersed with outbound hyperlinks.ned models will characterize some sort of joint distribution between the term distribution of a node with those in a certain neighborhood.

On occasion we will regard documents as being generated from topic-speci.c relations between certain links and terms (x6).

the Web is not isolated: nodes related to diverse topics link to each other. For example. the term distribution of documents related to `bicycling' is quite dierent from that of documents related to `archaeology. (We have found recreational bicycling pages to link to pages about .c term distributions.' Unlike journals on archaeology and magazines on bicycling.

rst-aid signi.

) If the application so warrants (as in x6. Such services have constructed. a giant taxonomy of topic directories. both across and within documents.3 Models for semistructured data Apart from hyperlinks. other structures exist on the Web. through human eort.cantly more often than a typical page.2) one needs to model such coupling among topics. One prominent kind of inter-document structure are topic directories like the Open 7 8 Directory Project and Yahoo! . Each directory has a collection of hyperlinks to relevant (and often popular or authoritative) sites related to the speci. 2.

Issue 2 . Semistructured data is a point of convergence [25] for the 9 10 communities: the former deals with Web and database documents.org 8 http://www. SIGKDD Explorations.acm. jet engine 7 http://dmoz. Jan 2000.com 9 http://www9. movies. One may model tree-structured hierarchies with an is-a(specific-topic. url) relation to assign URLs to topics.yahoo.c topic. and an example(topic. Copyright Volume 1. Although topic taxonomies are a special case of semistructured data. the latter with data. it is important and frequent enough to mention separately. papers. general-topic) relation.org 10 http://www.page 2 .org/sigmod c 2000 ACM SIGKDD. The form of that data is evolving from rigidly structured relational tables with numbers and strings to enable the natural representation of complex real-world objects like books.

Emergent representations for semistructured data (such as XML11 ) are variations on the 12 [54. 3 Supervised learning In supervised learning. The above forms of irregular structures naturally encourage data mining techniques from the domain of `standard' structured warehouses to be applied.components. adapted. 33]. Object Exchange Model (OEM) data is in the form of atomic or compound objects : atomic objects may be integers or strings. In OEM. HTML is a special case of such `intra-document' structure. and extended to discover useful patterns from semistructured sources as well. also called classi. compound objects refer to other objects through labeled edges. and chip designs without sending the application writer into contortions.

the learner .cation.

rst receives training data in which each item is marked with a label or class from a discrete .

Classi. after which it is given unlabeled data and has to guess the label. The algorithm is trained using this data.nite set.

cation has numerous applications in the hypertext and semistructured data domains. for example. Surfers use topic directories because they help structure and restrict keyword searches. to ask for documents with the word socks which are about garments. (socks AND NOT network will drop garment sites boasting of a large distribution network.) Robust hypertext classi. not the se13 curity proxy protocol . making it easy.

As another example. a campus search engine that can categorize faculty. student and project pages can answer queries that select on these attributes in addition to keywords (.cation is useful for email and newsgroup management and maintining Web directories.

learning relations between these types of pages (advised-by.nd faculty interested in cycling). Furthermore. investigatorof) can enable powerful searches such as \.

: : : . cm . Let T be the universe of terms. The prior probability of a class is usuallyPestimated as the fraction of documents in it.1 Naive Bayes classification The bag-of-words models readily enable naive Bayes classi.1. in all the training documents. with some training documents Dc associated with each class c. jDc j= c jDc j.." 3. 3.nd faculty supervising more than the departmental average of Masters students.1 Probabilistic models for text learning Suppose there are m classes or topics c1 .e. i. or the vocabulary.

cation [13] (`naive' indicates the assumption of independence between terms). there is a binary text generator model. The model parameters are c.t which indicates the probability that a document in class c will mention term t at least once. One may assume that for each class c. With this de.

1).t ): (1) 11 http://www.t n(d.edu/~widom/xml-whitepaper. The parameters are replaced with c.t Y 2 62 (1 t T.t d c.w3.socks. the second product makes strong independence assumptions and is likely to greatly distort the estimate of Pr(djc). In the multinomial model. (2) jjdjj1 ! jj1 = where fnjj(dd.t) fn(d.t .t2 )! is the multinomial coef)g .stanford. t)g t2d c.org/XML/ 12 http://www-db. The conditional probability of document d being generated from a class c is Pr(djc) = jjdjj1 ! Y n(d.t1 )! n(d. each class has an associated die with T faces. short documents are discouraged by this model. Pr(djc) = Y 2 t d c.nition. the probability of the face t 2 T turning up on tossing the die (x2. Also.nec.t . html 13 http://www.com/ Since typically jT j jdj.

t `surprise' factor. Not only is inter-term correlation ignored.2 Parameter smoothing and feature selection Many of the terms t 2 T will occur in only a few classes. although their estimate of Pr(djc) must be terrible. and the poor estimate of Pr(djc) appears to matter less than one would expect [31. The multinomial model is usually prefered to the binary model because the additional term count information usually leads to higher accuracy [51. 26]. their real task is to pick arg maxc Pr(cjd). Note that short documents are encouraged. 14]. in particular. the empty document is most likely. Both models are surprisingly eective given their crudeness. 3. but each occurrence of term t results in a multiplicative c.cient.1. The likelihood (ML) estimates of c.t (de.

and so Pr(cjd) will also be zero even if a test document d contains even one such term. )) will be zero for all classes c where t does not occur. the ML estimate is often replaced by the Laplace corrected estimate [47. Other techniques for parameter smoothing.t 1 + d2Dc n(d. which use information from related classes (e. the corresponding correction is 1 + jfd 2 Dc : t 2 dgj : (4) c. Many terms in T are quite useless and even potentially harmful for classi.g.. 61]: P c.1. ) : (3) Since in the binary case there are two outcomes instead of T outcomes. To avoid this in practice. t)= d2Dc n(d. t) = jT j + P Pd2Dc n(d.ned as P P maximum P d2Dc n(d. parent or sibling in a topic hierarchy) are discussed in x3.t = 2 + jDc j Unfortunately the second form often leads to a gross overestimation of Pr(tjc) for sparse classes (small Dc ).5.

' `an. although one should exercise caution in such judgement: `can' (verb) is often regarded as a stopword. the current topic context. Jan 2000. One option is to use mutual c 2000 ACM SIGKDD. Issue 2 . SIGKDD Explorations. Copyright Volume 1.t.org/Science/Environment/Pollution_Prevention_ and_Recycling/. etc.' are likely candidates.page 3 . Various techniques can be used to grade terms on their \discriminating power" [72]. Thus.' `the.cation. but `can' (noun) is hardly a stopword in the context of http: //dmoz. (and therefore `can' might be an excellent feature at this level). paper.r. stopwords are best determined by statistical testing on the corpus at hand w. bottles. `Stopwords' such as `a. which may specialize into subtopics re- lated to cans.

Another option is to use class-preserving projections such as the Fisher index [27].' One idea commonly used is to order the terms best . Ranking the terms gives some guidance on which to retain in T and which to throw out as `noise.information [19].

rst and retain a pre.

However this static ranking does not recognize that selecting (or discarding) a term may lead to a change in the desirability score of another term.x. This option can be made near-linear in the number of initial features and I/O eÆcient [14]. More sophisticated (and more time-consuming) feature selection techniques based on .

Experiments suggest that feature selection yields a significant improvement for the binary naive Bayes classi.nding Markov blankets in Bayesian networks [38] have been proposed [44].

er and a moderate improvement for the multinomial naive Bayes classi.

As in simple greedy approaches. 14. Deriving the edges among T from training data is made diÆcult by the enormous dimensionality of the problem space. 3.3 Limited dependence modeling The most general model of term dependence that can be used in this context is a Bayesian network [38] of the following form: there is a single node that encodes the class of the document. Performing some form of feature selection is important. 51]. jT j is usually large. 72. edges from this node go to `feature' terms chosen from T . it is possible to discard a large fraction (say over 2/3) of T and yet maintain classi. and pick them greedily based on pre-estimated pairwise correlation between terms [44]. 13. in the range of tens of thousands of terms.1. these term nodes are arbitrarily connected with each other (but there are no directed cycles). and there are no other edges in the Bayesian network. the results are not too sensitive to the exact term scoring measures used. One approach to cut down the complexity (to slightly worse than jT j2 ) is to insist that each node in T has at most d ancestors.er [44.

further experimental analyses on larger text data sets are needed to assess the impact of modeling term dependence on classi.cation accuracy. in some cases. However. discarding noisy features improved accuracy by 5%.

4 Other advanced models Two other techniques have emerged in recent research that capture term dependence. The maximum entropy technique regards individual term occurrence rates as marginal probabilities that constrain the cell probabilities of a gigantic contingency table potentially involving any subset of terms.cation. Operationally there are similarities with Bayesian networks in the sense that one has to choose which regions of this rather large contingency table to estimate for subsequent use in classi. 3.1.

In experiments.cation. improvements in accuracy over naive Bayes have been varied and dependent on speci.

c data sets [57]. Thus far we have been discussing distribution-based classi.

ers which characterize class-conditional term distributions to estimate likely generators for test documents. Another approach is to use separator-based classi.

Decision trees and perceptrons are examples of separator-based classi.ers that learn separating surfaces between the classes.

ers. Support vector machines (SVMs) [69] have yielded some of the best accuracy to date. SVMs seek to .

\fraudulent credit card transaction" vs.g. i. 42].g." \medium risk" and \high risk" patients.. \normal transaction.1. 3. \low risk.nd a separator surface between classes such that a band of maximum possible thickness of the region (also called the margin) around the separator is empty. e.5 Hierarchies over class labels In the traditional warehouse mining domain. has no training points. e.. class labels are drawn from a small discrete set.. This leads to better generalization [59." Sometimes a mild ordering between the labels is possible. Topic directories such as Yahoo! provide a more complex scenario: the class labels have a large hierarchy de.e.

but we restrict our discussion to trees only for simplicity. all training documents belonging to c are also examples of c0 .) The role of the classi. (The Yahoo! is a directed acyclic graph. not a tree. with the common interpretation that if c is-a c0 .ned on them.

a common goal is to .er in this setting is open to some discussion.

one starts at the root and computes posterior probabilities for all classes in the hierarchy via a best-. Using the chain rule. then Pr(c0 jd) = i Pr(ci jd). cm . : : : . Pr(ci jd) = Pr(c0 jd) Pr(ci jc0 . d). and if c0 is the parent of c1 . The common formulation in this setting is that Pr(rootjd) = 1 for any document P d.nd the leaf node in the class hierarchy which has the highest posterior probability.

rst search. the set of features useful for discrimination between classes may vary signi.2. An approximation to this is the `Pachinko machine' [45] which. selects in each step the most likely child of the current class until it reaches a leaf. As mentioned before in x3. starting at the root.1. always expanding that c with the highest Pr(cjd) and collecting unexpanded nodes in a priority queue until suÆciently many leaf classes are obtained. Another issue that arises with a taxonomy is that of contextdependent feature selection.

cantly. depending on c0 . 45] across dierent data sets. Surprisingly few terms (tens to hundreds) suÆce at each node in the topic hierarchy to build models which are as good (or slightly better) as using the whole lexicon. Bayesian classi. This has been observed in a set of experiments [13.

ers in this domain tend to have high bias and rarely over.

t the data. Therefore the major bene. they are quite robust to moderate excesses in parameters.

For a data set such as Yahoo!.t is in storage cost and speed [14]. feature selection may make the dierence between being able to hold the classi.

From the training data.t are computed. Information gleaned indirectly from the class models of better populated ancestors and siblings in the taxonomy may be used to `patch' local term distributions. the maximum likelihood estimates of c. ck up to a leaf ck and suppose we need to estimate the -parameters of the classes on this path. the training set used is Dck 1 n Dck . c1 . its root is made the only child of a special `class' c0 where all terms are equally likely (c0 . Issue 2 . Shrinkage involves estimating a c 2000 ACM SIGKDD. One such approach is shrinkage [53].er model data in memory or not.e. this is similar to equation (3). Sparsity of data is always a problem in the text domain. This process is continued up the path. Consider a path c0 . : : : .page 4 . The set of training documents Dck for leaf ck is used as is. Jan 2000.t = 1=T for all t 2 T ). SIGKDD Explorations. after dropping the `1+' and `jT j+' terms. Copyright Volume 1. documents in Dck are discarded so that the estimates become independent. especially for rare classes. Yet another utility of a class hierarchy is in improved model estimation. Given a topic tree hierarchy. For ck 1 .. i.

in trying to classify pages at a college into faculty. For example. etc. enrolled(student. student. k ) such that the `corrected' parameters for ck are X ~ck . and course pages. course). : : : . learning such relations may improve the accuracy of node classi. 3.t : (5) 0ik is estimated empirically by maximizing the likelihood of a held-out portion of the training data using an Expectation Maximization (EM) framework [24] (also see x5.t = i ci . course). student). one may wish to learn relations like teaches(faculty. advises(faculty.1).weight vector = (0 . In turn. project.2 Learning relations Classifying documents into topics can be extended into learning relations between pages.

using a naive Bayes text classi. The former component can analyze textual features while the latter can exploit hyperlinks. Learning relations between hypertext pages involve a combination of statistical and relational learning.cation. The idea is to augment FOIL-based learner [50] with the ability to invent predicates.

Due to its relational component.er. using a statistical method for text implies that the learned rules will not be dependent on the presence or absence of speci. it can represent hyperlink graph structure and the word statistics of the neighbouring documents. At the same time.

The coordinate of the seed in the vector space is recomputed to be the centroid of all the documents assigned to that seed. and is expected to discover a hierarchy among the documents based on some notion of similarity.1 Basic clustering techniques Clustering is a fundamental operation in structured data domains. Typically.1. Iteratively. documents are continually merged into super-documents or groups until only one group is left. and the similarity between two documents is the cosine of the angle between their corresponding vectors. provided their lengths are normalized. documents are represented in unweighted or TFIDF vector space.c keywords as would be the case with a purely relational method [50]. browsing. 4. A good clustering will collect similar documents together near the leaf levels of the hierarchy and defer merging dissimilar subsets until near the root of the hierarchy. The extreme high dimensionality of text creates two problems for the top-down k-means approach.1. 4. the merge sequence generates a hierarchy based on similarity. the selfsimilarity of a group is de.2 Agglomerative clustering In agglomerative or bottom-up clustering. 4. each input document is `assigned to' the most similar seed. the number of input documents will always be too small to populate each of the 230000 possible cells in the vector space. or the distance between the vectors. the learner is given a set of hypertext documents. However these techniques cannot deal with the tens of thousands of dimensions. and has been intensely studied.1 k -means clustering The number k of clusters desired is input to the k-means algorithm. which then picks k `seed documents' whose coordinates in vector space are initially set arbitrarily. Hence most dimensions are unreliable for similarity computations. and visualization [37]. such as k-means [41] and hierarchical agglomerative clustering [21] can be applied to documents. 4 Unsupervised learning In unsupervised learning [41] of hypertext. Some of the existing techniques. There exist bottom-up techniques to determine orthogonal subspaces of the original space (obtained by projecting out other dimensions) such that the clustering in those subspaces can be characterized as `strong' [1]. Even if each of 30000 dimensions has only two possible values. This process is repeated until the seed coordinates stabilize. it is hard to detect them a priori. and organize the documents along that hierarchy. Clustering is used to enhance search. In one variant [21]. But since this is not a supervised problem.

d2 ) is the similarity between documents d1 and d2 .d1=6 d2 s(d1 . often de.ned as X 1 s( ) = j j(j j 1) d1 .d2 2 . d2 ). (6) where s(d1 .

This algorithm takes time that is worse than quadratic in the number of initial documents. 4.1 Latent semantic indexing The similarity between two documents in the vector space model is a syntactic de. We now discuss some applications of linear algebra to text analysis. To scale it up to large collections.ned as the cosine of the angle between the vectors corresponding to these documents in TFIDF vector space. may expose interesting structure.2. (Many other measures of similarity have been used. 4. regarded as vectors in Euclidean space.2 Techniques from linear algebra The vector space model and related representations suggest that linear transformations to documents and terms. While there is more than one group the algorithm looks for a and a such as s( [ ) is maximized. a variety of sampling techniques exist [20]. and merges these two groups.) The algorithm initially places each document into a group by itself.

Latent Semantic Indexing (LSI) [23] formalizes this intuition using notions from linear algebra. (In TFIDF weighted vector space other documents do exert some in uence through IDF. For example.' which may in turn lead us to believe `gearbox' and `transmission' are related. Now if `car' and `auto' are related. This may help us relate a document mentioning `car' and `gearbox' with another document mentioning `auto' and `transmission. Copyright Volume 1. `car' and `auto' co-occurring in a document may lead us to believe they are related. we would expect them to occur in similar sets of c 2000 ACM SIGKDD. and 1 otherwise (word counts and TFIDF weighting may also be used). indirect evidence often lets us build semantic connections between documents that may not even share terms.) However. Jan 2000.nition involving those two documents alone. SIGKDD Explorations. Issue 2 .page 5 . Consider the termby-document matrix A where aij is 0 if term i does not occur in document j .

r ). many rows and/or columns may be somewhat `redundant. : : : . : : : . r .' Let the singular values [34] of A (the eigenvalues of AAT ) be 1 .documents. where j1 j jr j. hence the rows corresponding to these words should have some similarity. where D = diag(1 . and U and V are column-orthogonal matrices. Extending this intuition we might suspect that A may have a rank r far lower than its dimensions. The Singular Value Decomposition (SVD) [34] factorizes A into three matrices T U DV . LSI retains only some .

this modi.rst k singular values together with the corresponding rows of U and V . similarly each row of Vk represents a document as a k-dimensional vector. Among many things. Each row of Uk represents a term as a k-dimensional vector. which induce an approximation to A. denoted T Ak = Uk Dk Vk (k is a tuned parameter).

and .ed representation makes for a new search method: map a keyword query to a kdimensional vector.

nd documents which are nearby.' Improvement in query performance has indeed been observed [23]. SVD and truncation has been used for decades for robust model . (For multi-word queries one may take the centroid of the word vectors as the query. the hope is that the rows in U corresponding to `car' and `auto' will look very similar. and a query on `car' will have a nonzero match with a document containing the word `auto. In the Signal Processing community.) Heuristically speaking.

LSI can also be used to preprocess the data (eliminating noise) for clustering and visualization of document collection in very low-dimensional spaces (2.or 3-d).2.2 Random projections In either k-means or agglomerative clustering.tting and noise reduction [56]. a signi. Apart from search. 4. Analogously for text. it has been shown that LSI extracts semantic clusters in spite of noisy mappings from clusters to terms [58].

It has been proven [58] that the quality of the resulting clustering is not signi. Although individual documents can be expected to have bounded length.cant fraction of the running time is spent in computing similarities between documents and document groups. It is easy to see that this implies small distortion in inter-point distances and inner products. and let X be the square of the length of the projection of v on to H (X is thus a random variable). Then and if ` is chosen pbetween (log n) and p E [X ] = `=n`=n O ( n). where 0 < < 1=2 [30]. More precisely. There is a theorem which establishes that a projection of a set of points to a randomly oriented subspace induces small distortions in most inter-point distances with high probability. This theorem can be used to develop a technique for reducing the dimensionality of the points so as to speed up the distance computation inherent in clustering programs. let v 2 Rn be a unit vector and H a randomly oriented `-dimensional subspace through the origin. cluster centroids and document groups become dense as they collect contributions from more and more documents during the execution of these algorithms. Pr(jX j > `=n) < 2 ` exp( (` 1)2 =4).

term sharing and similarity between labeled and unlabeled documents is a source of information that may lead to increased classi. but a larger pool of unlabeled documents.. simpler methods such as truncating the document or cluster centroid vectors to retain only the most frequent terms (i. On the other hand. In real life. 5. the largest orthogonal components) perform almost as well as projection [20. most often one has a relatively small collection of labeled training documents. In practice.1 Learning from labeled and unlabeled documents Clearly. whereas unsupervised learning is open to interpretation. supervised learning needs training data which must be obtained through human eort. 5 Semi-supervised learning Supervised learning is a goal-directed activity which can be precisely evaluated.e.cantly worse than a clustering derived from the points in the original space. 66].

cation accuracy. This has indeed been veri.

ed using a simple algorithm patterned after Expectation Maximization (EM) [24]. The algorithm .

rst trains a naive Bayes classi.

Thereafter each EM iteration consists of the E-step and M-step.er using only labeled data. We restrict the discussion to naive Bayesian classi.

t = jT j + d Pr(cjd) n(d.ers.t using a variant of equation (3): P 1 +P d n(d. ) : The modi. t) P Pr(cjd) (7) c. In the E-step one estimates c.

t estimates are used to assign class probabilities Pr(cjd) to all documents not labeled as part of th input. These EM iterations continue until (near-) convergence. 5. the c.) In the M-step.2 Relaxation labeling In supervised topic learning from text. (This distribution will be degenerate for the initial labeled documents. distribution-based classi. Results are mostly favorable compared to naive Bayes alone: error is reduced by a third in the best cases. but may belong probabilistically to many classes. but care needs to be taken in modeling classes as mixtures of term distributions.cation is that documents do not belong deterministically to a single class.

page 6 .. where training and testing documents are nodes in a hypertext graph. The reason. but also neighboring documents N (d) (some suitable radius can be chosen to determine N (d)). Jan 2000.) Interestingly.ers posit a class-conditional term distribution Pr(~tjc) and use this to estimate the posterior class probabilities Pr(cjd) for a given document. motivated by spreading activation techniques. it means upgrading our model to a class-conditional distribution over not only terms in the current document d. in other words. but it is unclear how to exploit them. ~t(N (d))jc). viz. that a term `diuses' to documents in the local link neighborhood of the document where it belongs. (One may apply a heuristic damping function which limits the radius of in uence of a term. and terms in neighbors of testing document d2 may be used to estimate the class probabilities for d2 . One can adopt the notion. this does not lead to better accuracy in experiments [16]. SIGKDD Explorations.2. Consider now the hypertext model in x2. this has been proved useful in hypertext IR [64]. It seems obvious that apart from terms in the training and testing documents. is that content expansion is not the only purpose of citation. Pr(~t(d). In our context. Issue 2 . it is not always the case that with respect to a set of top- c 2000 ACM SIGKDD. Copyright Volume 1. Operationally. there are other sources of information induced by the links. it means that terms in neighbors of a training document d1 contribute to the model for the class of d1 . in hindsight.

hypertext forms very tightly isolated clusters. one often .ics. In the US Patents database. for example.

nds citations from patents from the class `ampli.

ers' to the class `oscillators' (but none to `pesticides'). Thus one is led to re.

ne the model of linkages to posit a class conditional joint distribution on terms in a document and the classes of neighbors. denoted by Pr(~t(d). Note that there is a circularity involving class labels. One approach to resolve the circularity is to initially use a text-only classi. ~c(N (d))jc).

Pr (r+1) (cjd. ~c(N (d))). (In practice one truncates this sum. (8) ~ c(N (d)) where the sum is over all possible neighborhood class assignments. for each node at step r +1. N (d)) = X Pr (r)~c(N (d)) Pr (r) (cjd.) In the case where some of the nodes in N (d) are preclassi. then iteratively compute.er to assign initial class probabilities Pr(0) (cjd) to all documents in a suitable neighborhood of the test document.

and between web pages by hyperlinking to other web pages. football stars. between musicians. In the . citation indexing. between movie personnel by directing and acting.ed. advising. friends and relatives. Relaxation labeling is able to smoothly improve its accuracy as this happens [16]. with diverse applications like epidemiology. Social networks are formed between academics by co-authoring. Social network theory is concerned with properties related to connectivity and distances in graphs. 6 Social network analysis The web is an example of a social network. between papers through citation. etc. Social networks have been extensively researched [70]. espionage. this becomes a partially supervised learning problem. serving on committees. between people by making phone calls and transmitting infection.

rst two examples. one might be interested in identifying a few nodes to be removed to signi.

this turns out to be quite symmetric to .cantly increase average path length between pairs of nodes. In citation analysis. one may wish to identify in uential or central papers.

IR literature includes insightful studies of citation. a series of applications of social network analysis were made to the web graph. this symmetry has been explored by Mizruchi and others [55].1. co-citation. 6. with the purpose of identifying the most authoritative pages related to a user query.nding good survey papers. and in uence of academic publication [48]. 6.1 Google If one wanders on the Web for in.1 Applying social network analysis to the Web Starting in 1996.

This measure of popularity is called PageRank [10]. popular pages with many in-links will tend to be visited more often. following a random link out of each page with probability 1 p and jumps to a random Web page with probability p. de.nite time. then dierent pages will be visited at dierent rates.

(The arti.ned recursively as X PageRank(u) . (9) PageRank(v ) = p=N + (1 p) OutDegree(u) u!v where `!' means \links to" and N is the total number of nodes in the Web graph.

ce of p is needed because the Web is not connected or known to be aperiodic. therefore the simpler eigenequation is not guaranteed to have a .

and the measure of a page being a hub (a compilation of links to authorities. Because of the query-dependent graph construction. hence Google can be potentially as fast as any relevance-ranking search engine. A query to HITS is forwarded to a search engine such as Alta Vista. Pages citing or cited by these pages are also included.xed point. Each node u in this expanded graph has two associated scores hu and au . or a \survey paper" in bibliometric terms).) The Google14 search engine simulates such a random walk on the web graph in order to estimate PageRank. initialized to 1. which retrieves a subgraph of the web whose nodes (pages) match the query. (10) !v P P where u hu and v av are normalized to 1 after each iteration.1. 6. Given a keyword query. The a and h scores converge respectively to the measure of a page being an authority. but depends on a search engine. which is used as a score of popularity. matching documents are ordered by this score. HITS then iteratively assigns := X !v hu u and hu := X av .2 Hyperlink induced topic search (HITS) Hyperlink induced topic search [43] is slightly dierent: it does not crawl or pre-process the web. A variant of this technique has been used by Dean and Henzinger to . Note that the popularity score is precomputed independent of the query. HITS is slower than Google.

g. E.3 Adding text information to link-based popularity HITS's graph expansion sometimes leads to topic contamination or drift.. Although movie awards is a . av u 6.1. the community of movie awards pages on the web is closely knit with highly cited (and to some extent relevant) home pages of movie companies. They improve speed by fetching the Web graph from a connectivity server which has pre-crawled substantial portions of the Web [7].nd similar pages on the Web using link-based analysis alone [22].

The Automatic Resource Compilation (ARC) and Clever systems incorporate such query-dependent modi.ner topic than movies. the top movie companies emerge as the victors upon running HITS. This is partly because in HITS (and Google) all edges in the graph have the same importance15 . Contamination can be reduced by recognizing that hyperlinks that contain award or awards near the anchor text are more relevant for this query than other edges.

17]. Query results are signi.cation of edge weights [15.

1.com 15 Google might be using not published.com c 2000 ACM SIGKDD. edge-weighting strategies which are 16 http://www.4 Outlier filtering Bharat and Henzinger have invented another way to integrate textual content and thereby avoid contamination of 14 http://google. such as Yahoo! and 16 Infoseek . the results compared favorably with lists compiled by humans. 6.infoseek. Copyright Volume 1. In user studies.cantly improved. SIGKDD Explorations. Issue 2 . Jan 2000.page 7 .

unlike HITS. It is possible to fabricate queries that demonstrate the strengths and weaknesses of each of these systems. Clever. During the graph expansion step. thus the distribution of keywords on these pages will enable the Bharat and Henzinger algorithm to eectively prune away as outliers nodes in the neighborhood that are about movie companies. They model each page according to the \vector space" model [63]. In the example above. and Outlier Filtering have been shown to be better (as judged by testers) than HITS. in at least two prominent forms. they describe several heuristics that cut down the query time substantially.the graph to be distilled. This would be of great interest given the basic dierence in graph construction and consequent greater speed of Google. Instead they prune the graph expansion at nodes whose corresponding term vectors are outliers with respect to the set of vectors corresponding to documents directly retrieved from the search engine [8]. ARC. If the reader . Apart from speeding up their system by using the Connectivity Server [7]. one would hope that the response to the query movie award from the initial keyword search would contain a majority of pages related to awards and not companies. There has not been a systematic comparison between Google and the HITS family. Acknowledgements: Thanks to Pushpak Bhattacharya and 6.2 Resource discovery There is short-range topic locality on the Web. they do not include all nodes at distance one from the preliminary query result.

there is reasonable chance that the reader will also .nds this paper interesting.

nd some signi.

If this paper cites a paper that the reader thinks is interesting..cant fraction of citations in this paper to be interesting too. this paper is a suitable hub ). Coupling the ability to . that makes it more likely that another paper cited by this one is interesting (i.e.

The goal is to crawl as many relevant pages as fast as possible while crawling as few irrelevant pages as possible. one can devise crawlers that crawl the Web for selected topics [18]. The crawler can be guided by a text or hypertext classi.nd good hubs with the ability to judge how likely a given page is to be generated from a given topic of interest.

but it seems clear that a con uence will be valuable. 7 Conclusion In this survey we have reviewed recent research on the application of techniques from machine learning. also see an update at http://www. Commercial search products and services are slowly adopting the results of recent investigations. 1998. one may use reinforcement learning techniques [60.htm. 52]. Mining of semistructured data has followed a slightly dierent path.research. Bharat and A. Alternatively. We will also likely see increased systems and tool building as the winning research ideas become more .au/ programme/fullpapers/1937/com1937. Broder.er. researchers have progressed towards document substructure and the interplay between link and content. Online at http://www7. In 7th World-Wide Web Conference (WWW7). It seems inevitable that knowledge bases and robust algorithms from computational linguistics will play an increasingly important role in hypertext content management and mining in the future. and data mining to analyzing hypertext.html. [6] K.com/ SRC/whatsnew/sem. statistical pattern recognition.scu. Starting from simple bulk models for documents and hyperlinks.digital. A technique for measuring the relative size and overlap of public web search engines.edu.

J. D. Raghavan. In 7th ACM Conference on Hypertext. Arnold.essex. 1996. Seattle. pages 42{51. Automatic subspace clustering of high dimensional data for data mining applications. Agrawal.uk/~doug/book/book. and L. L. 1995. ibm. Machine translation: An introductory guide. Allan. R. S.ac. Sadler. Hypertext '96. June 1998.com/cs/quest/papers/sigmod98_clique.pdf. Humphreys. Gunopulos. Gehrke.rmly established. [5] Babel. Balkan. Natural Language Understanding. Online at http://www.html. Yusuf Batterywala for helpful discussions and references.almaden. [2] J. Automatic hypertext link typing. In SIGMOD Conference on Management of Data. 8 REFERENCES [1] R. [4] D. [3] J. L. Meijer. and P. Allen. Benjamin/Cummings. J. 1987. Online at http://clwww.

Aug. The Connectivity Server: Fast access to linkage information on the Web. Brisbane.com/ SRC/personal/Andrei_Broder/cserv/386. [9] W. Online at http://decweb. 1999. pages 104{111. and S. Pierce. Brin and L. A. Venkatasubramanian. Henzinger. pages 423{430. Australia. University of Western Ontario.htm. Henzinger.com.research.pdf.ht%ml and http://decweb. Improved algorithms for topic distillation in a hyperlinked environment. 1998. Hypertext versions of journal articles: Computer aided linking and realistic human evaluation. 1998. Copyright Volume 1.ethz. 1998.com/pub/DEC/SRC/ publications/monika/sigir98.sh Language Translation Service. Bharat. [7] K.ch/WWW7/1938/com1938. J.ethz. Online at http://www.digital.page 8 . In PhD Thesis. [11] C. Broder.digital. In 7th World Wide Web Conference. Page. c 2000 ACM SIGKDD. Kumar. SIGKDD Explorations. [10] S.htm. The role of lexicalization and pruning for base noun phrase grammars. Bluestein. Online at ftp://ftp. P. In Proceedings of the 7th World-Wide Web Conference (WWW7). July 1999. M. In 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. Bharat and M. [8] K. The anatomy of a large-scale hypertextual web search engine. Jan 2000. Cardie and D.ch/WWW7/1921/com1921. http://www.altavista. 1998. In AAA1 99. Issue 2 .

[12] N. Raghavan. B. R. Agrawal. [14] S. Raghavan. Gibb. Chakrabarti. Catenazzi and F. Dom. Agrawal. Greece. and P. [13] S. Scalable feature selection. The publishing process: the hyperbook approach. In VLDB. B. discriminants. Chakrabarti. Dom. Using taxonomy. 1995. and signatures to navigate in text databases. R. classi. Journal of Information Science. and P. Sept. 1997. 21(3):161{172. Athens.

au/ programme/fullpapers/1898/com1898.edu/~soumen/sigmod98. Mining the Web's link structure. Focused crawling: a new approach to topic-speci. VLDB Journal. Raghavan. [15] S.cs. J. Chakrabarti. S. 1998. Chakrabarti. In 7th World-wide web conference (WWW7). and P. [18] S. P.berkeley. Dom. Aug. D. Enhanced hypertext categorization using hyperlinks. and J.cation and signature generation for organizing large text databases into hierarchical topic taxonomies. Dom. [17] S. Online at http://www7. Automatic resource compilation by analyzing hyperlink structure and associated text. A.edu. 1998. IEEE Computer.cs. Gibson. 1998. Dom.berkeley. Ravi Kumar. B. Rajagopalan. and S. P. Chakrabarti. 1999. and B. M. van den Berg. B. S. Gibson. Aug. ACM. E.html. Tomkins.scu. Kleinberg. In SIGMOD Conference. online at http: //www. Raghavan. B. Kleinberg. Dom. Feature article. 32(8):60{67. Invited paper. D. Online at http://www.PDF.edu/~soumen/VLDB54_3.ps. Indyk. Chakrabarti. Rajagopalan. [16] S.

1977. and R.com/~remde/lsi/ papers/JASIS90. Online at http://superbook. [25] S. http://www8.telcordia. M.ps. Indexing by latent semantic analysis. Domingos and M. Landauer. A. T. 1999. May 1999. [26] P. Laird. VLDB 1999. Furnas. Maximum likelihood from incomplete data via the EM algorithm. Harshman. W. G. N.c web resource discovery. Sept. K. anyway? Keynote address. What do those weird XML types want. [23] S.org/w8-papers/5a-search-query/ crawling/index. and D. B(39):1{38. Computer Networks. Edinburgh. Dempster. DeRose. Rubin. On the optimality of the simple Bayesian classi. 31:1623{1640. P. S. Journal of the Society for Information Science. [24] A. Available online at 17 .html. 41(6):391{407. 1999. Scotland. T. First appeared in the 8th International World Wide Web Conference Toronto. Pazzani. Dumais. B. Journal of the Royal Statistical Society. 1990. Deerwester.

Pattern Classi. 1997. Duda and P. Hart.er under zero-one loss. Machine Learning. [27] R. 29:103{130.

In Proceedings 17 http://www8.ps.stanford. [30] P. 1988. PA. [21] D. Mellish. Philadelphia. stanford. Building newspaper links in newspaper articles using semantic similarity. D. J. W. Addison-Wesley. Inc.. From semistructured data to XML: Migrating the Lore data model and query language.cation and Scene Analysis. 1989. Online at http://www-db. 1989. Jan 2000. Basil Blackwell Ltd. Online at ftp://parcftp. In Annual International Conference on Research and Development in Information Retrieval. In Proceedings of the 2nd International Workshop on the Web and Databases (WebDB '99). Karger. online at ftp://playfair. Wiley. Goldman. July 1997. 0/1 loss. 1997. 1992. London. and J. [34] G. Parsing with Principles and Parameters.page 9 . New York. Frakes and R. 1(1):55{77. Scatter/gather: A cluster-based approach to browsing large document collections. M. B 44:355{362. MIT Press. F. [28] S. Karger. Pedersen.edu/pub/friedman/curse. 1992.ps. Introduction to Government and Binding Theory.edu/pub/papers/xml. J. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. [32] G. 1991. May 1999. G. Maehara. Matrix Computations. variance. R. Information retrieval: Data structures and algorithms. D. Cutting. Elements of Information Theory. and the curse of dimensionality. [36] L. Frankl and H. Widom. Stanford University Technical Report.. [19] T. [35] S. Copyright of the 20th Annual International ACM/SIGIR Conference. Golub and C. and J. O. R. Johns Hopkins University Press. Fong and R. [29] W. H. John Wiley and Sons.ps. and J.org c 2000 ACM SIGKDD. A.Z. Oxford. Constant interaction-time scatter/gather browsing of very large document collections. Berwick. H.xerox. Vancouver. SIGKDD Explorations. van Loan. On bias. Pedersen. NLDB'97. June 1999. In 8th World Wide Web Conference. R. In Annual International Conference on Research and Development in Information Retrieval. 1991. Denmark. In Natural Language and Data Bases Conference. Natural Language Processing in LISP. [20] D.com/pub/hearst/sigir97. R. 1973. [33] R. 1993. Finding related pages in the world wide web. [22] J. Prentice-Hall. [37] M. Baeza-Yates. Hearst and C. Cover and J. R. Cat-a-Cone: An interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. Karadi. Philadelphia. Henzinger. Data Mining and Knowledge Discovery. Green. 1997. Tukey. Dean and M. Journal of Combinatorial Theory. Haegeman. Volume 1. [31] J. McHugh. Friedman. pages 25{30. Issue 2 . Toronto. 1992. Cutting. B. O. Gazder and C. Thomas.

Somers. [39] W. Institute of Advanced Studies. [40] U. 1992. U. An Introduction to Machine Translation. Heckerman. Hutchins and H. research. N. The universal networking language: Speci. Academic Press. J.[38] D. Bayesian networds for data mining. L. Data Mining and Knowledge Discovery.microsoft. 1(1).PS.com/pub/tr/TR-95-06.research.microsoft.PS and ftp://ftp.com/ pub/dtg/david/tutorial. Also see URLs ftp://ftp. 1997.

[49] S. 1996. In B. Laplace. MIT Press. Online at http: //sherlock. Morgan-Kaufmann. and A.cs.berkeley.-S. Dubes. Sahami. [51] A. Algorithms for Clustering Data. editors. Advances in Kernel Methods: Support Vector Learning. 400:107{109. Toward optimal feature selection. Translated by A. S. Saitta. New York. In ACM-SIAM Symposium on Discrete Algorithms. C. Philosophical Essays on Probabilities. Dale from the 5th French edition of 1825. Prentice Hall. [46] K. July 1999.Tech Thesis. S. editor. In Annual Meeting of the American Society for Information Science.informatik. In Internal Technical Document. July 1997.html. Joachims. 1997. [45] D. Lawrence and C. 1995.de/FORSCHUNG/ VERFAHREN/SVM_LIGHT/%svm_light. Koller and M. M. [44] D. K. 1999. 1998. Computer Science and Engineering Department. McCallum and K. [47] P. [42] T. volume 14. C. [43] J.uni-dortmund. Nature. Mark Craven and K. I. IIT Bombay. Automatic hypertext creation. Lee Giles. volume 13.edu/home/kleinber/auth. Hierarchically classifying documents using very few words. [41] A. Kleinberg. In L. [48] R. Nigam. Springer-Verlag. Making large-scale SVM learning practical.cornell.eng. 1999.edu/asis96/asis96.html. 1996. Smola. Burges. Morgan-Kaufmann. First order learning for web mining. Nigam.ps. [50] S. International Conference on Machine Learning. Also see http: //www-ai.cation document. Accessibility of information on the web. A comparison of event models for naive Bayes text classi. Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. Online at http: //www.edu/users/sahami/ papers-dir/ml97-hier.ps. Larson. Authoritative sources in a hyperlinked environment. Online at http://robotics. Koller and M. In International Conference on Machine Learning. 10th European Conference on Machine Learning. Sahami.stanford. Scholkopf. Kumarvel. 1998. Jain and R. 1988.

K.cation. pages 41{48. online at [52] A. J. McCallum. and K. Nigam. In AAAI/ICML-98 Workshop on Learning for Text Categorization. Rennie. Also technical report WS-98-05. 1998. Building domain-speci. Seymore. CMU. AAAI Press.

McCallum. Ng.ps. In AAAI-99 Spring Symposium. Improving text classi. and A. T. R.cs. Online at http://www. Mitchell. Rosenfeld.c search engines with machine learning techniques.edu/~mccallum/ papers/cora-aaaiss99. [53] A. 1999.gz.cmu.

Tuma. Prentice Hall. Widom. Quass. Moon and W. 1997. [55] M. 1 edition. P. 1999. [57] K. J.cmu. S. 26(3):54{66. editor.ps. [56] T. Sept. and A. McHugh.cs. and B. Goldman.ps. pages 26{48. R.stanford. In N. Laerty. Using maximum entropy for text classi. Online at http://www.gz. D.edu/pub/papers/lore97. K. Aug. Lore: A database management system for semistructured data. SIGMOD Record.cation by shrinkage in a hierarchy of classes. Techniques for disaggregating centrality scores in social networks. 1986.edu/~mccallum/papers/ hier-icml98. S. Online at http: //www-db. Schwartz. and J. Mizruchi. McCallum. B. San Francisco. C. Mintz. Abiteboul. 1998. Mariolis. In ICML. [54] J. Sterling. M. Jossey-Bass. Sociological Methodology. Mathematical Methods and Algorithms for Signal Processing. Nigam.

1983. McGill. Submitted for publication. S. Latent sematic indexing: A probabilistic analysis. Rennie and A.pdf.ps. SIGKDD Explorations. G. [64] J. Vempala. [63] G. Volume 1. Papadimitriou. [58] C. P. McGraw-Hill.gz.). July 1995. Salton. com/users/jplatt/smoTR. 1989. Sequential minimal optimization: A fast algorithm for training support vector machines.pdf. In in http://www. Introduction to Modern Information Retrieval.microsoft. J. 1999. 1985. Online at http://www.cmu. Platt. In ICML.cs. Microsoft Research. 1996.cation. Salton and M. R. Automatic Text Processing.J.cs. Princeton University.Levesque (ed. and S. 1999. Savoy.edu/~mccallum/papers/ maxent-ijcaiws99. [62] G.cmu. Online at http://www. Brachman and H. Research report CS-TR-495-95. Mar. Tamaki. Schank and C. Inference and computer understanding of natural language. Issue 2 . Using reinforcement learning to spider the web eÆciently. Addison Wesley. c 2000 ACM SIGKDD. Rieger. J. In IJCAI'99 Workshop on Information Filtering. Raghavan. Morgan Kaufmann Publishers. A natural law of succession.J. Copyright Readings in Knowledge Representation.cmu. H. [65] R.edu/~knigam/papers/ multinomial-aaaiws98.page 10 . Technical Report MSR-TR-98-14. [60] J. An extended vector processing scheme for searching information in hypertext systems. Information Processing and Management.gz.edu/~mccallum/ papers/rlspider-icml99s. Ristad. Jan 2000. 1998. Online at http://www.cs. [59] J. 32(2):155{170.research. McCallum. [61] E.ps.

pages 412{420. Smola.S. [68] D. A comparison of projections for eÆcient document clustering. and A. [67] J. S. [71] T. Volume 1: Syntax. regression estimation. F. In International Conference on Machine Learning. in Computer Science from the University of California. July 1997.html. 1983. 1994. Online at http://www. Online at http://www-cs-students.edu/ ~csilvers/papers/metrics-sigir. Temperley. Sowa. Wasserman and K. 1996. Reading. 1999. Kharagpur. [72] Y. 1997. [70] S.[66] H. Language as a Cognitive Process. Addison Wesley. At Berkeley he worked on compilers and runtime systems for running scalable parallel scienti. and signal processing.link. MA. Social Network Analysis. pages 74{81. Apr.edu/link/dict/ introduction. In Proceedings of the Twentieth Annual ACM SIGIR Conference on Research and Development in Information Retrieval.ps. Conceptual Structures: Information Processing in Mind and Machines. Vapnik. in 1991 and his M. J. and Ph. Pedersen. Faust.Tech in Computer Science from the Indian Institute of Technology. A comparative study on feature selection in text categorization. 18 received his B. Berkeley in 1992 and 1996. Addison-Wesley. Yang and J. Technical report. [69] V. Cambridge University Press.D. Schutze and C. Golowich.cs.stanford. 1984.cmu. In Advances in Neural Information Processing Systems. An introduction to link grammar parser. MIT Press. Silverstein. Support vector method for function approximation. Winograd.

iitb.c software on message passing multiprocessors.in/~soumen/ 19 http://www. His research interests include hypertext information retrieval.page 11 .cs. Copyright Volume 1. SIGKDD Explorations. Bombay. He designed the Focused Crawler19 and part of the Clever20 search engine.edu/~soumen/focus 20 http://www.html c 2000 ACM SIGKDD.almaden.com/cs/k53/clever.berkeley. Jan 2000. He is currently an Assistant Professor in the Department of Computer Science and Engineering at the Indian Institute of Technology.ibm. At IBM he worked on hypertext analysis and information retrieval. web analysis and data mining. Biography: Soumen Chakrabarti 18 http://www. He was a Research Sta Member at IBM Almaden Research Center between 1996 and 1999.ernet. Issue 2 .cse.

- 10-009
- [IJCST-V2I5P7] Author
- Web Opinion Based Clustering
- Clustering User Preferences Using W-Kmeans
- IJARCET-VOL-2
- An Efficient Information Extraction Model for personal named entity
- Url Mining Using Agglomerative Clustering Algorithm
- Main Paper
- International Journal of Engineering Research and Development
- Web Document Clustering Using Fuzzy Equivalence Relations
- Web Based Fuzzy Clustering Analysis
- QDC
- Data Minning Presentation
- evaluatoin
- COMPUSOFT, 3(6), 840-844.pdf
- v11i2p7
- Web Usage Mining for analysising eldery people care
- Data Mining Based Soft Computing Skills Towards Prevention of Cyber Crimes on the Web
- Web Mining Final
- Web Usage Mining for Predicting Users’ Browsing Behaviors by using FPCM Clustering
- IJSET_2014_312
- Intrusion Detection System Based on Web Usage Mining
- IJETTCS-2012-10-25-096
- Document Classification Methods For
- 16-quality.pdf
- Scalable Techniques for Clustering the Web (2003)
- 10.1.1.34
- IJCS_2016_0301004.pdf
- A Comparative Agglomerative Hierarchical Clustering Method to Cluster Implemented Course
- IJETTCS-2013-04-14-101
- Chakra Bar Ti

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd