You are on page 1of 22

ELECTRONIC PUBLISHING, VOL.

4(2), 87–108 (JUNE 1991)

Information retrieval in hypertext systems:
an approach using Bayesian networks
JACQUES SAVOY AND DANIEL DESBOIS
Universit´e de Montr´eal
D´epartement d’informatique et de recherche op´erationnelle
P. O. Box 6128, station A
Montr´eal (Qc) H3C 3J7, Canada

SUMMARY
The emphasis in most hypertext systems is on the navigational methods, rather than on the
global document retrieval mechanisms. When a search mechanism is provided, it is often
restricted to simple string matching or to the Boolean model. As an alternate method, we
propose a retrieval mechanism using Bayesian inference networks. The main contribution
of our approach is the automatic construction of this network using the expected mutual
information measure to build the inference tree, and using Jaccard’s formula to define fixed
conditional probability relationships.
KEY WORDS

Hypertext Information retrieval Information retrieval in hypertext Bayesian network
Probabilistic retrieval model Probabilistic inference Uncertainty processing

1 INTRODUCTION
Hypertext systems [1–3] have been designed to articulate, organize and visualize information units (called nodes) that are interconnected according to different relationships,
providing an environment where the user can navigate through the information network by
following existing links or by creating new links. Navigational access that is restricted to
existing links means that the user can only move from one document to another, and cannot
easily find an answer to the question, “Where (else) can I find something interesting?”.
Ideally, a hypertext system should be able to provide more information about the current
user’s focus without introducing a specific query.
To provide more effective information retrieval and facilitate suggestions of documents
of interest to the user, we have designed and implemented a system based on Bayesian
inference networks [4]. A similar proposition can be found in References [5] and [6],
which discuss a theoretical model, and in Frisse’s [7] approach, which uses the same
idea but is applied in a specific context. Unlike Frisse’s solution, based on predefined
relationships, our system does not require the provision of any prior information. Thus,
our prototype is automatic and can adapt to any field of knowledge or interest.
This article is divided into four sections. In the first section, we describe hypertext
systems and some information retrieval problems. Next, we reveal the Bayesian principles
and illustrate them with an example. In the third section, we explain how we create a
Bayesian inference network using a tree structure. Finally, we illustrate how one can search

0894–3982/91/020087–22$11.00
c 1991 by John Wiley & Sons, Ltd.
© 1998 by University of Nottingham.

Received 12 November 1990
Revised 30 May 1991

88

JACQUES SAVOY AND DANIEL DESBOIS

for information in a hypertext system, and demonstrate the preliminary results of our
prototype.

2 HYPERTEXT SYSTEMS AND INFORMATION RETRIEVAL
In this section, we explore the general principles of hypertext systems and then explain
some global information retrieval approaches used in them. We then reveal some problems
related to information retrieval, and explain our objectives.

2.1 Hypertext
In the hypertext approach [1], unlike classical information retrieval [8], the user explores
the information sources instead of writing a formal query and consulting the result (a set
of documents relating to the request). It involves the paradigm “where ! what” instead of
“what ! where”. The success of the navigation concept is based on the human capability
of rapidly scanning a text and extracting essential information. Also important is the
capability of easily recognizing the information one is looking for without being able to
describe it: “I don’t know how to describe ‘art’ but I know what it looks like”.
The hypertext approach suggests a form of writing and a dynamic document exploration
that breaks with the traditional linearity of text. It can be defined in the following way:
A hypertext is a network of information nodes connected by means of relational links. A
hypertext system is a configuration of hardware and software that presents a hypertext
to users and allows them to manage and access the information that it contains. [9]

This approach has already been explored in several prototypes (see Reference [1]) and
commercial products (Guide [10], HyperCard [11], etc.). To search relevant information,
a simple-minded solution is for the user to navigate through the information network to
find, by himself, the information that he is looking for. This linking facility is a “short-term
view”; for example, in Figure 1, the user reading the node D1 had to jump twice to reach
the document D3.
Beside these links, hypertext systems also propose global access methods using indexes
and table of contents. According to Alschuler [12], these tools are not enough to find
relevant material. Indexes are divided into different levels and only one level is presented
at time. They are sorted according to non-standard conventions (partial alphabetical order).
The tables of contents reflect the structure of the documents and therefore are not adequate
to search for specific information (i.e., the text of the table entry may be too short or the
meaning too broad).
Many hypertext systems use local or global maps [1], [13]. It is very difficult to build
these views because the hypergraph has no natural topology, the number of nodes and links
may become very large, the response time may be too slow, the differentiation among links
and/or nodes may be problematic, etc. Finally, the usability of these graphical tools has not
yet been established [13]. For example, when the number of nodes and links becomes too
large, the map shows too many things to be very helpful (i.e., Figure 11 in Reference [1]);
therefore, some systems, such as KMS [14], do not use graphical views. As an alternative,
the design of the KMS system is based on a hierarchical database structure combined with
rapid system response by following links.

In the Boolean approach [8. 2 D2* D1 D5 D2 D3 Figure 1. String matching. the information retrieval proposes models and search strategies to find documents that match the user’s interest. the selection of the right operator. hypertext systems have used classical search techniques that consider documents as independent entities. can be compared to information retrieval in a database. The search process itself may be problematic. pp. However. the exact meaning of each operator “AND”. most user difficulties are related to the choice of terms [17]. the knowledge of the content of . these techniques are not satisfactory. This method is not very effective [15. As a first solution.89 INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS Document 5 Document 1 D5* Doc. Hypertext representation and its network of links (hypergraph) 3 INFORMATION RETRIEVAL Based on the user’s information needs. because of its deterministic nature (exact match). and using the Prolog unification mechanism does not resolve the problem [5]. the experience with the system. [16]. Chapter 1]. “OR” and “NOT”. 229–274]. the fact that a concept may be represented by a great variety of key words [18].

“variance” and “estimation”. we use a Bayesian tree. write it and submit a query. 308]. the belief vector is set to: BEL(X) = [0. 4 THE BAYESIAN NETWORK This section explains the basic concepts of Bayesian reasoning. The belief vector is noted BEL(X). (see References [19]. he just has to select the appropriate word(s) on a screen. 4. (2) a search system may improve its performances from 10% to 20% over more conventional techniques using a thesaurus of related terms [8.. etc. synonym. the heuristics used to enlarge or narrow the search. “term x is irrelevant”] Each term may be more or less important for a user who is searching for information. The user does not have to find a term. Each node can have two possible values representing mutually exclusive and collectively exhaustive events. Chapter 7] for a more detailed discussion). and vice versa. p. as a word becomes more important. The higher the belief.5] Let us assume for the moment that we already have a Bayesian tree (see Figure 2) where the key word “statistic” (the stem of “statistics”) is directly related to the terms “interval”. To each node. we first show the usefulness of this approach. and the evidence propagation mechanism within a tree. the learning of an artificial syntax language. related key word) and quantitative relationships with other terms (i. the indexing processing used. has many difficulties in writing a good query. Using an example. Our system builds a query.e.e. This weight factor is called a belief.90 JACQUES SAVOY AND DANIEL DESBOIS the information space and the appropriate key words.0. In order to promote a global search method. In order to find them. and the second value indicates that the term is irrelevant. we describe the belief computation. as stated before. Our prototype proposes another way of identifying the user’s information need. and ask the searching process to begin. although this tool is not yet widely available and its construction requires large amounts of work.5. Next. not only with the terms specified by the user. This descendant node is used to propagate the user’s wishes . the more relevant the term. we based our solution on two observations: (1) the user. how can we modify the relevance of other terms?). but by automatically including related terms. The first indicates that the term is relevant in the description of the user’s interest. The domain of each term X is represented as: X = [“term x is relevant”.1 Example of inference tree reasoning On the basis of Pearl’s concepts [4]. and it should be excluded. we assign a descendant called an evidence node (node with the label “Evidence” in Figure 2). a Bayesian tree contains as nodes a set of terms with their qualitative (i. [20] or [21. press a button in order to express the relevance (or irrelevance) of this word. and since we have no information at the beginning about the relevance of a term..

the user sees the key word “statistics” in the text and would like more information about it. the evidence nodes are not really updated as shown in Figures 2 and 3 because their beliefs are identical to the belief vector of the index node at which they are assigned. The user can give new evidence to the inference tree in two ways: first. For example.INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS 91 Figure 2.43.1] into disjoint subintervals in order to obtain a value in the context of multivalued logic. we ignore the respective term weights ((equation (4)) because the system is not able to know which part of the document is more important from the user’s point of view. The term is then selected.2). In the process of evaluating the relevance of a term or document. No Assessment. the belief associated with the event “statistics is relevant” is 0. In Figure 3. If a document is judged relevant. Bayesian tree after initialization (hand-crafted) through the network. and the belief for the fact “geometric is relevant” is 0. etc. Of course. one can subdivide the interval [0. In this update. the system updates the evidence of all indexed terms of that document. by indicating the relevant and irrelevant word(s) of a document to the system. by making a judgment on an entire document. Figure 3 shows the effect of this new evidence on the entire tree. The belief values may be used to select the n most important terms in the user’s information need (see Section 6. Available options include: fVery Relevant. the user may use different values to quantify his feedback. Second. Irrelevant.75. equation 2 gives the exact structure of it. Very Irrelevantg . giving positive feedback (see Figure 8). The “prob” matrix represents the strength of the link between the two nodes. Relevant. In our system.

One cannot assert. however. in which each entry is related to its synonyms by a qualitative relationship indicating narrower or broader terms. Bayesian tree after a positive update from node “statistic” This scale allows us to give positive or negative evidence with more or less strength. and changing the value of relationships between terms.e. one cannot generalize a relationship between any one couple of terms. 1. Different interest domains imply the presence of a different indexing space.. As Lesk [23] points out. and thus require distinct Bayesian trees (or distinct thesauri). in Figure 3. In this case. the user gets a Bayesian network already trained to a specific field of interests or to a specific user group interest. 2g and are normalized with previous evidence to define probabilities. that once a Bayesian tree is built. we can use it for every collection of documents.. The user can save a Bayesian network or load an existing tree containing already set values in it. for example. The values associated with this scale are f2.e. i. 1. “statistic” and “hypothesis” in Figure 3). modifying the position of nodes. These efforts are related to Fung’s work [22].92 JACQUES SAVOY AND DANIEL DESBOIS Figure 3. we have quantified each relationship between words. we cannot see the immediate descendants of a node in order to find all related words without inspecting the tree upwards or downwards (i. The relationship of terms occurring concurrently forms a basis for the Bayesian network . one can see that the term “variance” is more closely related to “statistic” than “interval”. In our Bayesian network. We are still studying the possibility of performing an update on the Bayesian structure itself. 0. Furthermore. adding new nodes. We can view a Bayesian network as a kind of thesaurus.

. This solution is efficient both in storage usage and in computation....93 INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS but is valid only for a document set from which the Bayesian tree is obtained. contains the probabilities associated with each value of node X. containing the strength of the relationship U ! X. xl . . a variable may have more than two values. . This matrix stores the fixed conditional probabilities for all possible values of the two variables as follows: Mxju = = = P(X = xjU = u) P(xju) 2 P(x1 ju1 ) P(x2 ju1 ) 6 P(x1 ju2 ) P(x2 ju2 ) 6 6 . Each element of the belief vector is then evaluated by: BEL(xi ) P (xi je+ x ..2 Belief computation and evidence propagation A Bayesian network is a directed acyclic.. . When updating the belief of a variable X. and this relationship may be invalid elsewhere. while P(ex jxi ) measures the likelihood of X = xi (diagnostic support).. The information passed. . x1 .. . namely the predictive support given by the parents to their children. M xju . 4. The factor is a constant used for normalization. If belief computation in a complex graph is NP-complete [24].. dependency graph formed by nodes and arcs with nodes representing uncertain variables and edges symbolizing dependence relationships between variables.. and to the children of X.. this node must propagate to its parent U. 6 6 . x2 . Each link between variables is represented by a matrix.. (2) P(xl juq ) In our context. which is presented as a probability distribution. we take into account two components.. and the diagnostic support given by the children to their parents. for example. 3 P(xl ju1 ) P(xl ju2 ) 7 7 7 7 7 7 P(xl uq 1 ) 5 j . . ex )  P(ex je+x . the introduction of restrictions on the topology of the network allows using tractable algorithms. . . xi )  P(xi je+x ) = =  P(ex jxi )  P(xi je+x ) = (1) where the probability P(xi je+ x ) measures the prior probability of the fact X = xi (predictive support). . the structure of the matrix is the following:  Mxju = P(x relevantju relevant) P(x irrelevantju relevant) P(x relevantju irrelevant) P(x irrelevantju irrelevant)  The propagation of probabilities is done by information passed between adjacent nodes. 6 4 P(x1 uq j .. 1) P(x1 juq ) P(x2 juq 1 ) P(x2 juq ) . The belief associated with X is presented as a vector named BEL(X). To compute the belief of a variable. In general. Let us assume that .

we compare our approach with some related works. that is: (4) R(Di ) = f(T 1 . . and their weight wij characterizing each document or node Di . The information space representing the “document” nodes fDi. . for example the fuzzy reasoning [25]. the propagation does have to generate a waiting time for the user. ng. we have to store the following real values: Storage requirement for each node = n2 + 2  n + m  n To update each node during the propagation scheme. The result of this automatic process is a descriptor R(Di ) consisting of a set of terms T j . We reject all terms T j having a dvj value lower than or near zero. n is defined as 2. . t. i = 1. . we clarify the formation of the Bayesian network tree structure. t 1. The following multiplications are required: Bottom-up updating = Top-down updating = n2 + 2  m  n n2 + n + 2  n  m Since in our model. and the evidence nodes are not updated. the system must compute a bottom-up part (from a child node to its parent) and a top-down part (from the parent node to its children). that: 1. Section 9. we can automatically give a weight to each non-common term.94 JACQUES SAVOY AND DANIEL DESBOIS the tree is composed of n nodes that have. . wi1 ). Next. which means that the corresponding term appears either too frequently or too rarely. Other methods can be used to evaluate probabilistic inference networks. By following the principles stated in Reference [8. First. . . the modal logic [26] or the Dempster–Shafer theory [27]. we illustrate how we build the Bayesian tree based on the “expected mutual information measure” (EMIM) and the Maximum Spanning Tree algorithm.2]. Section 9. . (T t . 2. we explain the initialization process used in our system.1 Indexing For information retrieval and reasoning on the information space to be possible.1]. . m children. 5. we use the following formula: wij = tfij  dvj (3 ) This expression combines two aspects: the term frequency tfij (frequency of term T j in the document Di ) and the term-discrimination value dvj [8. is presented like a network.3. wi2 ). we describe the indexing methodology. Finally. . Chapter 9]. 2.. 5 CREATION OF THE BAYESIAN TREE In this section. (T 2 . n 1. in mean. . Third. we must list all the essential indexing terms by performing an automatic indexing of each hypertext document Di . At each node. To express the weight of each term T j in a node Di . . for j = 1. wit )g We note from the schematic representation of these two sets (Figure 4). [4. the mean value of m is 1.

establishes a collection without any a priori structure. xj )  ln  P(xi . . Information space representation (top) and indexing space (bottom) 2.95 INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS D1 D6 D9 D4 T1 T2 D3 D2 Ti T3 . For example. . if wki > 0 then xi = 1 for document Dk . . many of the subsequent mathematical relations could not be expressed.. a weight wki = 0 in equation 4 means that xi = 0 for document Dk . . tg. It is obvious that from one term T i in the indexing space. [29]. Although the hypertext defines relationships between “document” nodes. on the contrary. Without this assumption. many of the conclusions should be accepted with extreme caution.1 P(xi . we use directed vertices to represent the existing relationship between the two spaces. 2. .. i = 1. t 1. 5. The index space corresponding to the indexing terms nodes fTj . but as Williams notes [28]: The assumption of independence of words in a document is usually made as a matter of mathematical convenience. In Figure 4. Tt Figure 4.2 Dependence tree In the previous section. T j ) = X xi =0. To take the index space term co-dependence into account.1 xj =0. The variable xi associated with the term T i can be either 1 or 0 according to whether or not the term T i is present in the document. we can find the set of documents Dj that contain this element. The EMIM is defined as: I(T i . With it.xj )  P(xi ) P(xj )  (5) . . we assumed the independence of index space terms. Chapter 6]. we calculate the term dependence between T i and T j by using the “expected mutual information measure” or EMIM [15. . all term nodes in the indexing space are isolated from one another.

D and t = 4). only the relative value is important because we only need the rank of each EMIM value. The construction of that tree requires O(t2 ) steps in the algorithm showed in Table 2 (see Reference [30. go to 3.1). The next step consists of building a tree based on the rank ordering of the different EMIM values for the couples (T i. P(xi = 1) the proportion of documents indexed by term T i . With this table. T j) from the list. To calculate the EMIM. the summation of the rows. Remove I(T i . T j ) by using the following formula: I(Ti . the Maximum Spanning Tree algorithm selects the branch between node B and C. 4. the value [1] represents the number of documents indexed by term T i and term T j . Repeat Step 3 until t 1 branches have been selected. 2. j=1 The root of the tree is chosen arbitrarily. In the first step. The value in [9] gives the number of documents in the information space. The third selected arc is not the branch between B and D because this arc introduces a cycle. T j ). 3. the value [2] the number of documents indexed by term T j and not by T i . we could choose the node that has the maximum number of descendants or the node with the largest dvj value (see Section 5. we define 0  ln(0) as 0. n in our example.  Compute all t (t 1) / 2 pairwise I(T i . Table 3 represents the similarity measure between four nodes (A. etc. For example. The last branch will be the arc between A and C. The values in [5] and [6] are the summation of the columns and the values in [7] and [8]. T j ). xj = 1) means the proportion of documents indexed by term T i and term T j . Table 2. B. Chapter 12]). Select the greatest value of the EMIM list. in order to create a tree where the following summation is maximized: t X I(Ti . Chapter 12] for refinements). T j ) = [1]  ln  [1]  [5] [7]  + [2]  ln  [2]  [6 ] [7 ]  + [3]  ln  [3 ]   [5] [8] + [4]  ln  [4]   [6] [8] To be practical. etc. the branch from C to D is added. we calculated the value I(T i . Calculation of EMIM values xi 1 xi 0 xj 1 [1] [2] [7] xj 0 [3] [4] [8] [5] [6] [9] = = = = where P(xi = 1. In Table 1. 5. remove I(T i . if so. 6. T j ) from the EMIM list. we build a two-dimensional table for each pair of terms (T i . Sort the EMIM values in decreasing order and store them in a list. C. P(xj = 0) the proportion of documents not indexed by term T j . T j). After that. . The absolute value given by this symmetric measure does not interest us.96 JACQUES SAVOY AND DANIEL DESBOIS Table 1. Maximum Spanning Tree algorithm sketch 1. T j ) i. Add the branch from i to j to the tree only if this arc does not introduce a cycle. We build this tree by using the Maximum Spanning Tree algorithm (see Reference [30.

. Tt .6 — Table 4.. Information space and indexing space structured as to a dependence tree . Figure 5. .. Examples of word–word associations with significance judgments word stable number variance data protocol linear derivative word convergence integer distribution file telecommunication theorem partial D1 overall x x local x x x x x D6 D9 D4 D3 D2 T3 T2 T1 ....3 — C 0... Ti .97 INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS Table 3. .5 0. Example of the Maximum Spanning Tree algorithm A B C D A — B 0..1 0.4 0. . T9 ....7 — D 0. T7 .....

the relationship of terms occurring concurrently and forming the basis for the Bayesian network construction. Thus. Since at each node we assign two events. If the terms X and U are positively correlated.0. Chapter 6] and Salton et al. nearly every term can appear concurrently with every other term. 5. The assessment of the matrix values is the major problem with Bayesian networks [7]. In the Lesk’s work [23]. p. j 6= k and. but only 20% in the absolute sense. and this relationship may be invalid elsewhere. Based on this dependence tree storing the first order dependency between terms. 75% of these were judged locally significant. [31] propose a heuristic method that allows us to find the value for P(Xi jXj .98 JACQUES SAVOY AND DANIEL DESBOIS The majority of these associations (co-occurring terms) are locally significant. Moreover. the nodes have a relatively small content: a title and an abstract. Xk ). jfu = 1g \ :fx = 1gj P(x relevantju relevant) = jfx = 1g [ fu = 1gj P(x irrelevantju relevant) = 1 P(x relevantju relevant) min (8) (9) . there is no relevant information in the network. Inside a larger context. the documents from which the statistics are computed must be small.5. The strength of the relationship between two nodes is stored in a conditional probability matrix. for i 6= j. our new indexing space is structured as a tree (see Figure 5). van Rijsbergen [15. we use Jaccard’s formula [15. i 6= k. Equation (2) shows the structure of this matrix and examples are presented in Figure 2 beside the label “prob”. the prior distribution being non-informative. Yu et al. In our work. the set fx = 1g refers to documents indexed by the term X. [29] have introduced a probabilistic information retrieval system. 39] as a heuristic function. If X and U are connected together by a link representing an negative correlation. Unlike Figure 4. all matrices have four fixed probabilities. In our hypertext. we use the following heuristic:  jfx = 1g \ :fu = 1gj .j is the cardinality of a set. to find semantically compatible terms. In order to account for the dependence between two terms with a third one. According to Salton’s results [29]. where j. we found 69% locally significant pairs (see Table 4 for examples). P(x relevantju relevant) = P(x irrelevantju relevant) = 1 jfx = 1g \ fu = 1gj jfx = 1g [ fu = 1gj P(x relevantju relevant) (6) (7) In the previous expressions.5] In this case.3 Initialization of the Bayesian tree The initialization process assigns the following values for each node: BEL(X) = [0. this work does not greatly improve the results obtained with the previous probabilistic model using only the first order dependency. is valid only for a given document set.

Links are present between document nodes and text nodes are based on a one-to-one correspondence. In order to achieve this purpose. they build a knowledge base (or networks of concepts) by identifying the relationships between concepts and use a concept-based retrieval system. The retrieval model proposed by Croft and Turtle [5. Given a query. Based on this information. and how additional dependencies may be used. one can distinguish between two types of nodes: state and evidence nodes. In the former. etc. In order to build that knowledge base. They demonstrate how a retrieval process may be seen as an inference process integrating multiple document and query representations. a threshold or a selection of the best n).6] is certainly the most important work in this field. Since their proposed network has the same kind of .4 Related works Fung et al.INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS 99 In order to get a symmetric matrix to quantify the connection between two variables. to identify the words relevant to concepts. This network is built once for a given document collection.). the system extracts features from documents and updates the evidence nodes associated with the corresponding words. automatic key word extraction. In that approach. one can distinguish between two networks: the document network and the query network. the state nodes are activated by the evidences. First. Frisse and Cousins [7] propose the use of a Bayesian network to represent the user’s information need. however. one can find nodes corresponding to documents (abstract). text concepts (extracting from text by various techniques like manually assigned terms. which builds the network of relationships between concepts.e. Section 5] show how one can deal with uncertainty. the system makes its retrieval decision.. and between query concept nodes and queries. Following a decision criterion (i.. The query network contains nodes representing query concepts. texts (specific text content of a document). The information retrieval process or inference mechanism goes through the following steps. The extensions proposed in Reference [6. and one node to represent the user’s information need. the system may compute the belief that a document is relevant to the concept of interest. the term “bombing” forms an evidence node and the concept of “explosion” a state node. the user (or the author) has to specify the concepts. Links are present between queries and the information need. a large amount of work is needed by a user to identify what concepts are present for each document [22]. The knowledge base consists of two types of networks: concept networks representing relationships between concepts and concept-evidence networks relating concept to features (or words). the knowledge base may be generated manually or by using the Constructor system. and a belief factor is associated with each state node. the word is present or absent). Finally. the user has to decide for each document which concepts are the most appropriate to describe the document. [22] also use a Bayesian network. In a parallel way. In this model. the second row of M xju must be obtained from expressions 6 to 8. The first represent concepts and the second type are associated with a feature or key word. Their objective is to reduce the cost associated with the acquisition of knowledge. and to establish relationships between concepts.e. The relationships between text nodes and text concept nodes are set by various representation (or indexing) schemes. the links may have a weight. queries. For example. 5. From this point. therefore. Each evidence node is binary-valued (i.

without using Bayesian networks. But the method used by Frisse and Cousins [7] is restrictive. Frisse [33] proposes to assess the node weight by adding to its intrinsic weight (based on equation (10)). his network is derived from a “MeSH Tree” (“The National Library of Medicine’s Medical Subject Headings”). we describe our current research aimed at improving our first version. and it attributes . this system includes global information retrieval functions. a heuristic function or observing the reader interests. and to suggest a means of establishing a conditional probability matrix. For any document or node Di . Secondly. In addition to a navigation mechanism. 6 OUR HYPERTEXT SYSTEM Based on the previously discussed theory and considerations. we assess the value of each document according to the user’s information needs as expressed in the Bayesian tree. This tree is a standard classification of medical terms of about 15 000 terms of a manually created hierarchical vocabulary. Frisse and Cousins [7] propose using empirical methods. For example. this work was the starting point of our research. this weight is calculated as: W(Di ) = t X wij (10) j=1 where the value wij is similar to equation (3).. To assign the conditional probability matrix. except for the multiplicative constant. In order to validate our ideas.. The summation is performed over all terms T j . in our approach.2. supports his approach with the Dewey decimal or the “Library of Congress” systems. . we elaborated the first prototype using an object-oriented language (Smalltalk-80) running on Sun SparcStation computers. we first explain the information retrieval in a hypertext system. Finally. we have built a hypertext system. Lesk [32]. we reveal the features of our document collection.t. This solution is not satisfactory since the addition of new descendants decreases the importance of existing children. the structure of the network is given. Thus. and present an example of the results obtained. If the information space is presented as a tree. for j = 1. The term weight T j in the document Di is computed as: wij = kj  tfij  dvj The value kj indicates the importance (or opposition) of this term relative to the user’s interest. If we already have the qualitative relationships between nodes. however. called extrinsic weight. These functions can be divided into two distinct classes: a Boolean model and a Bayesian inference network system. this set of documents can be structured as a tree or as a network (see Figure 1). 6. Our main purpose is to develop a precise methodology to build a Bayesian network without using a thesaurus. This first solution is useful when the set of documents Di is not structured. In this section. . the average weight of its descendants. we must also quantify them.1 Information retrieval in a hypertext system In order to find relevant information in hypertext. and comes directly from the BEL(T j ).100 JACQUES SAVOY AND DANIEL DESBOIS nodes and links as ours.

the factor ik reflects the relative importance of each descendant of a node.. this method takes into account only one kind of link between the nodes: the hierarchical link. For example. the argumentation in Document 2 is very interesting. In this context. “generalization”. we can distinguish between the validity of a node and the pertinence of the link.. Lowe [32] [34] presents a similar idea. Partial view of a hypertext having a support link (S) and an opposition link (O) to an idea different weights that depend on the tree structure instead of being based on the component arguments or intrinsic weights.101 INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS Document 5 Document 1 A is true because . With that. Figure 6 presents a simplified view of the problem.. which document has the most importance? Furthermore. A first solution would consist of assigning a weight ik depending on the relation type connecting component Di to Dk : W(Di ) = t X j=1 wij + r X ik  w(Di ) k (11) k=1 In this case. “specialization”. Figure 6. This value must be available to the user to let him choose whether the document is interesting by itself or if the link between the two is relevant. but the link between Documents 1 and 2 is inappropriate because Document 2 does not really present an objection to the fact that “A is true”. A hypertext system has several link types: “support”. in Figure 6. “objection”. S A is true O Document 2 A is false because . etc. “question”. “answer to”. ..

when a node X is particularly important. and in order to get this weight. the node weight may change according to the importance of the court. it must have the weight of node B. it must know the value of C. Our weight function thus becomes: W(Di ) = i  t X wij + j=1 r X ik  k  w(Di ) k (12) k=1 The information retrieval process must avoid pitfalls by using a heuristic. The weight propagation must take care of this. Some of these documents are short (around 30 to 40 words) but others are longer (around 500 words). node A supports node B. one can find a cycle in the document set. In this judicial context. For example. Using equation (12) directly. etc.2 An example session Our information space contains 251 different documents. B supports C and C supports A. we introduce a factor k . it will have an important weight. the decision date.5 if BEL(T j is relevant) < 0. they can distinguish between a Civil Code article. For example. when it computes the weight of node A. an extract of the court of first instance or a section of a Supreme Court decision. Furthermore. A collection of this size is not large enough to obtain definitive conclusions but we have assessed the feasibility of our model and obtained some positive evidence about its performance. to obtain the weight of node C. there is a great chance that all components included in the path from the root to node X would be considered relevant only because the weight attached to X is too large. Sometimes. In legal texts. After node indexing and eliminating all words carrying weak information. the document weight is measured according to the following expression: t X W(Di ) = wij (13) j=1 The weight of term T j in the component Di is expressed as wij and is calculated as:  wij = kj  tfij  dvj 0 if jkj j   if jkj j <  (14) where kj is evaluated by the following expression:  kj = e2BEL(Tj is relevant) e2BEL(Tj is irrelevant) if BEL(T j is relevant)  0. Propagating this value through hierarchical links. This threshold value  directly affects the number of terms selected for a query (see Figure 8). Each component describes a course given at Universit´e de Montr´eal in Computer Science or Mathematics. we had 287 different terms describing the entire information space. the system will run for ever. In our prototype.102 JACQUES SAVOY AND DANIEL DESBOIS Our heuristic can still be improved because of the possibility of classification of the nodes themselves. 6. for example.5 (15) The user has the capability of changing indirectly the threshold value  by using a gauge. Equation (14) implies a selection of m terms for which the belief . the system must also compute the A weight. To take this into account.

. Our formulation takes that into account. BEL(T j Example: Retrieval of courses discussing probability and statistics Using a non-informative Bayesian network. First. Precision is defined as the ratio between the number of retrieved documents that are relevant and the total number of retrieved documents. a query containing a lot of terms has more chance of increasing the recall but this gain often results in a loss of precision. as we have mentioned previously. our approach would incorporate the Boolean operators “AND” and “OR” since the distinction between them may be viewed as a question of strength variation [8. We can examine the results of our approach by looking at an example. Furthermore. especially when the belief approaches unity. During the validation of our prototype..4. Section 10.. If we compare this to the Boolean model. Table 6. The operator “NOT” was not present in Frisse’s work [7].103 INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS is relevant) exceeds the threshold value  . It is rather tricky to establish it a priori. Result of the second search (lowering the threshold  value) Statistics and probability for economists Probability for computer scientists Statistics for geologists Probability 1 Introduction to statistics Statistics: concepts and applications Probability 2 Statistics for physicists Theory of distributions Statistics 1 . The recall is defined as the proportion of retrieved documents that are relevant over the total number of relevant documents in the collection. Usually.2].. we want to obtain the courses that discuss probability and statistics. The exponential function is used in equation (15) in order to give more weight to the belief value. several examples demonstrated the necessity of letting the user take care of this parameter  of equation (14). We find the “NOT” when two terms in the Bayesian tree are connected with an negative correlative relationship (see the term “geometric” in Figure 3). (non-relevant) . the boolean operator “NOT” must be available. the user has to increase the belief of the term “probability” by providing positive support to this term in particular. Result of the first search Probability for computer scientists Probability 1 Statistics and probability for economists Probability 2 Statistics: concepts and applications Statistics for economists Cryptography: theory and applications (non-relevant) . This action has the effect of Table 5.

58 0.56 expectation density charact.52 economic test 0. All the course descriptions selected by the system contain the term “probability” and/or “random”.73 probability statistic 0.83 0. Bayesian tree after an update with positive support for the term “probability” increasing the belief in the key word “probability” from 0. “statistic” and “variable”. The belief for each node is also shown.83. the first seven documents are judged relevant by the system. . Figure 7 shows the partial network obtained after this step. In Table 5.56 random 0. we get more documents discussing probability and statistics.. “expectation”. The document list returned by the system is shown in Table 6. In the retrieval mode. “characteristic”. such as the document discussing “cryptography”. “distribute”. “random”. even if some discuss other subjects. the system uses the two most significant terms of the Bayesian tree. e. By lowering the threshold  .104 JACQUES SAVOY AND DANIEL DESBOIS variable 0. we include in the query the following terms: “probability”. estimation 0. After another search.5 to 0.g.58 0.51 Figure 7. “probability” and “random”. “density”.6 0.53 0.

After a manual verification of the information space.INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS 105 Figure 8. the precision obtained was 0. the precision was 0.708. the entire set of relevant documents was defined. After the second retrieval. the system has also listed more irrelevant documents. they were simply further down on the list than those we presented. Thus. the user would have had the capability of accessing these documents by moving further down in the system’s list.857 and the recall 0. After the first document retrieval. The previously selected documents were not really absent from the selection.393. .708 and the recall 0. Example from our current hypertext system interface Even through more courses about probability and statistics have been returned.

In the near future. Finally. from the list of results. An example showing the current interface is displayed in Figure 8 in which the term “algorithmes” is judged very relevant.6] presented a theoretical retrieval model that can be used in a hypertext environment. the interface between man and machine will be modified according to the first users’ comments. indicate that they are relevant. A broader retrieval process would use equation (12). Croft and Turtle [5. the user may always modify the relative importance of a term or a document simply by pressing a button (see Figure 8). Furthermore. 7 CONCLUSION The use of the Bayesian network in a retrieval information system is not a new approach. and let the system starting a new search process. we have proposed a method to automatically create a Bayesian network structured as a tree. . [22] have used that technique in order to build a knowledge base incorporating networks of concepts. and based on these hints.3 Work in progress Currently.106 JACQUES SAVOY AND DANIEL DESBOIS Our methodology takes into consideration user feedback. our information space contains nodes corresponding to course descriptions and aggregates (grouping of several courses in the same node). “support” and “opposition” links type will be included. our system updates a Bayesian tree that stores the user’s information need. As result. ACKNOWLEDGMENTS We would like to thank University of Montreal and NSERC (Natural Sciences and Engineering Research Council of Canada). the user may select relevant documents. Instead. Eventually. we have shown that our system does not require the user to write a formal query. In this paper. we hope to complete the current system by having it manage networks carrying different user opinions. For example. After a search. The proposed solution uses the expected mutual information measure to build the inference tree. the proposed solution can fit nicely into a hypertext system by explaining how one can use link semantics during the retrieval process. 6. and Jaccard’s formula to define conditional probability relationships. The user will then provide insight on link quality and on the reliability of some argumentations advanced in the text (see Figure 6). Finally. The authors would like to thank R. Furuta and the three anonymous referees for their helpful suggestions and remarks. this study would not have been possible. for without them. the user can choose terms on documents currently displayed on the screen. which is a standard classification of medical terms of about 15 000 terms of a manually created hierarchical vocabulary. Our system is more closely related to the previous work of Frisse and Cousins [7] where the network is derived from a “MeSH Tree” (“The National Library of Medicine’s Medical Subject Headings”). where hierarchical links structure the information space as a tree. a specific context like the “MeSH Tree” is not really needed. Fung et al.

van Dyke Parunak. J. Analysis. 22. J. G. 393–405 (1990). T. 199–212. Salton. ‘The computational complexity of probabilistic inference using bayesian belief network’. and M. 887–895 (1988). 9–13 January 1990. 7. Communications of the ACM. San Mateo (California). Communications of the ACM. in The Society of Text. 964–971 (1987). M. ‘A model for hypertext-based information retrieval’. Pearl. Online Information Retrieval—Concepts. 37(6). P. ‘Hypertext: An introduction and survey’. F. Automatic Text Processing. C. G. 9. Butterworths. M. L. S. van Dam. 8. G. Croft and H. E. pp. 16(3). F. 343–361. 387–400 (1986). Cooper. Turtle. 481–485 (1986). W. Artificial Intelligence. Akscyn. Information Retrieval. E. H. Biswas. B. C. Harter. B. Gomez. 81–94. Tong. B. 109–145 (1987). T. pp. R. C. New York. 38(2). Dumais. 24. 17–41 (1987). Conklin. Cousins. Journal of the American Society for Information Science. ‘Word–word associations in document retrieval systems’. 25. pp. ‘Retrieval techniques’. 27. ‘Reflections on notecards: seven issues for the next generation of hypermedia systems’. A. Barrett. 42(2–3). E. K. Turtle and W. L. van Rijsbergen. 23. D. American Documentation. ‘Hypertext’87 keynote address’. ‘Reference and data model group (rdmg): work plan status’. D. Blair and M. 1989. 2nd edn. 97–110 (1987). Marchionini. Borgman. Yoder. volume 22. in Proceedings of ECHT’90. Alschuler. Bezdek. G. 30(6). M. ‘Inference networks for documentretrieval’. B. Proceedings of the Hypertext Standardization Workshop. S. Principles. Hypertext. 1988. Lucarella. 1989. 15. in Hypertext ’89 Proceedings. ACM. and S. 19. V. ‘The vocabulary problem in human-system communication’. ‘A retrieval model for incorporating hypertext links’. 1986. 437–447 (1990). 10. 1991. 31(7). ed. A. L. M. 12. Journal of the American Society for Information Science. 591–618 (1989). M. 27–38 (1969). IEEE Computer. J. 5. 1990. C. 2. Nielsen. McCracken. T. 33(3). New York. 17. D. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1–24 (September 1990). 14. 18(9). Croft. 20. J. Croft. pp. Communications of the ACM. 16. R. Gaithersburg. van Rijsbergen. Martin E. 26. 291–299 (1986). pp. H. ‘Interactive documentation’. 1989. 213–224. pp. ‘Full-text information retrieval: further analysis and classification’. in ProceedingsSIGIR’90. MIT Press. 11. Toronto. Crawford. Information Processing & Management. and Techniques. 29(6). Williams. and Retrieval of Information by Computer. Morgan Kaufmann. and the Social Construction of Information. D. Brown. ACM. 26(3). 296–310 (1990). pp. ‘Why are online catalogs hard to use? Lessons learned from informationretrieval studies’. Cambridge University Press. Addison-Wesley. 1990. The Transformation. C. ‘Knowledge-assisted document retrieval: Ii the retrieval process’. Software—Practice and Experience. and E. Landauer. Bantam. A. Maron. Fung. in Proceedings SIGIR’90. J. 4. 21. Furnas. London. 820–835 (1988). Halasz. 31(7). J. G. . L. Hypermedia. Lesk. 1987. ‘Making the transition from print to electronic encyclopaedias: adaptation of mental models’. ‘A non-classical logic for information retrieval’. ‘The art of navigating through hypertext’. P. 13. E. ‘Information retrieval from hypertext: update on the dynamic medical handbook project’. Subramanian. Frisse and S. 455–467. and R. 836–852 (1988).INFORMATION RETRIEVAL IN HYPERTEXT SYSTEMS 107 REFERENCES 1. International Journal of Man–Machine Studies. Academic Press. ‘KMS: A distributed hypermedia system for managing knowledge in organizations’. To appear in ACM Transactions on Information Systems. L. in Hypertext ’89 Proceedings. Reading (Massachusetts). J. 18. 3. ‘An architecture for probabilistic concept-based information retrieval’. 6. 20(1). Marques. New York 1989. J. Appelbaum. G. N. L. The Complete HyperCard Handbook. ed. 30(11). 1979. Goodman. in Annual Review of Information Science and Technology. Computer Journal. Belkin and W. ‘Hand-crafted hypertext—lessons from the ACM experiment’. Communications of the ACM. 31(7). Communications of the ACM.

30. Englewood Cliffs. H. ‘Cooperative structuring of information: the representation of reasoning and debate’. pp. 1983.J. Byte. 97–111 (1985). Computer Science Department. 305–318. Yu. 32. Cornell University. International Journal of Man–Machine Studies. Salton. 1982. Washington. ed. 13(9). ‘Extensions to the tree dependence model in information retrieval’. in Hypertext ’89 Proceedings. C. Buckley. C. G. T. 1982. G. J. K. ‘From text to hypertext’. 247–253 (1988). number 146. Papadimitriou and K. Salton and H. Technical report. 29. T.108 JACQUES SAVOY AND DANIEL DESBOIS 28. Lowe. pp. 33. 1989. 31. Yu. ‘Results of classifying documents with multiple discriminant functions’. Lesk. ACM. 1965. H. M. Prentice Hall. Berlin. Ithaca. Schneider. New York. M. Williams. National Bureau of Standards. D. ‘What to do when there’s too much information’. pp. 23(2). Combinatorial Optimization. 34. . Springer Verlag. in Lecture Notes in Computer Science. 151–173. in Statistical Association Methods for Mechanized Documentation. Salton. Steiglitz. G. and C. E. Stevens. 217–224. and G. eds. Algorithms and Complexity. ‘An evaluation of term dependence models in information retrieval’. Lam. C. Frisse. New York.