You are on page 1of 11

http://www.diva-portal.

org

Postprint

This is the accepted version of a paper presented at Sundsvall 42, 19-21 October 1993, Sundsvall,
Sweden.

Citation for the original published paper:

Eberhagen, N. (1993)
Information filtering.
In: Sundsvall 42: ADB i verksamhetens tjänst Sundsvall, Sweden: Sundsvall 42 and The Swedish
Information Processing Society

N.B. When citing this work, cite the original published paper.

Permanent link to this version:


http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-6371
Information filtering

Niclas Eberhagen

Department of Mathematics, Statistics and Computer Science


Växjö University

Box 5053, S-351 95 Växjö

June 1993

Abstract
Information filtering is concerned with the process of filtering information relevant to
an individual’s information needs, in order to come to terms with the ever increasing
amount of information. The individual’s long term needs are expressed and represented
in a filter profile that is matched with the distributed information in order to select that
which is of interest to the individual. This process is discussed in two aspects: the
capturing and representation of the information needs, giving light to traditional
methods and techniques and the different filtering processes adopted by individuals in
order to filter information; and the manageability of the filter profiles, exploring the
support for the principles of relevance feedback as the timeliness of the information
needs quickly out-dates them.

Keywords: Information filtering; Information retrieval; and Information overload.

Working paper of the Information System Science’s group, ISV-WP-15.


1. Introduction

In the post-industrial society organizations are believed to face a different world


compared to that of today. Huber (1984) characterizes the post-industrial society as
experiencing greater levels of knowledge, complexity, and turbulence, and that each of
these will be increasing at a considerable rate.

The amount of available information will grow and its absolute growth will increase.
The increase of knowledge will lead to large increases in technological, economical,
and social specialization and diversity. The high diversity and specialization will lead to
large increase in societal interdependencies and thus escalate the level of complexity
and its absolute growth. The increase of turbulence follows from the rapidity of events
and increasing knowledge, causing many technologies to be more effective and shorten
the duration of the events, thus permitting more events per time unit (Huber 1984).

Effects of advanced communication technologies, such as e-mail system, voice mail


systems, radio-phones, etc., and computing technologies, mainly used for storing,
retrieving, and processing information to derive new information, will be that they
increase in availability to the individuals and increase the efficiency of communication,
making the communication timeliness, and that they will open up new sources of
information which originally were external to the organization and keep the information
more up-to-date.

Decision making in post-industrial organization will be more frequent, faster, and


complex thereby increasing the decision-task loads greater than before. This will put
high demands on the acquisition and distribution of the available information.
Organizations will need to, more than before, scan the environment of information
about existence of problems and opportunities or for information to be used in the
future, and probe the environment for information not routinely gathered, but also guard
them self against information overload as the amount of available information will
increase to the individual decision-maker (Huber 1984).

Denning (1982) puts the focus of the relevance to the problem on the implications of
automatic document preparation system and electronic mail, and on the quantity of
information being received by the end users by stating the following:

“The visibility of personal computers, individual workstations, and local area networks
has focused most of the attention on generating information - the process of producing
documents and disseminating them. It is now time to focus more attention on receiving
information - the process of controlling and filtering that reaches the persons who must
use it.”
(Denning 1982, p163-165)

Since the past years’ increase of electronically stored and produced information has led
to an escalating distribution of documents in e-mail system, distribution lists, and
corporate and other public networks, users of such systems tend to feel flooded with
information, even in systems where users subscribe for distribution lists (Malone,
Grant, Turbak, and Brobst 1987). Much of this information are of no interest to the
individual and thus leads either to a time consuming process of reading all the
information bits or to complete negligence of them. The problem is how to find the
information relevant to the individual’s needs. How does one filter out that which is
relevant, not only the publicly distributed information but also that which is specifically
addressed to an individual?

An organizational environmental scanning process where individuals or group of


individuals, at different levels within an organization, scans common information
channels from external sources in order to filter out that which is of their major concern,
e.g. a production manager scanning technical and other related information sources in
order to check for new innovations or competitive products, or a marketing manager
looking for information concerning the market of a specific product in order to establish
how well the launch of it will do, is just one of many processes where the filtering
concept is applicable. The decision-maker collecting just those pieces of information
that will form the basis of a decision is another example. He would need clearly defined
selection models, stating his information needs, in order to filter out that which is of
relevance to him because he is certainly not interested in having to deal with an
enormous amount of information where most of it is of little relevance to him.

2. Information filtering

Information filtering is concerned with the problem of selecting the information


relevant to the needs of the individuals. Users of a filtering system specify their needs in
a profile reflecting their long term wants, i.e. information needs, interests and
preferences, relevant to their work, use these profiles to automatically match them with
the incoming information (Malone et al. 1987). Filter profiles could be constructed to
reflect the needs of a group of individuals as to cover their common fields of interests.

What are an individual’s information needs and how they can be captured into a
meaningful representation? Most often individuals will know what their needs are, they
should be relevant to their work, but might experience problems in expressing them.
Sometimes these needs are not known such as in the case of an environment where the
individuals are scanning everything in search for interesting topics e.g. a news bureau.

The capturing and representation of an individual’s information needs concerns two


areas of information filtering: the different types of filtering processes that individuals
adopt in order to select and filter information, such as cognitive filtering, based upon
the words of the weakly structured and unstructured text documents, social filtering,
this process is based upon how others judge the relevance of different information
pieces and recommend it, and the relation between the distributor and the receiver, and
economic filtering, based on the amount of information e.g. the length of the text
document; and techniques for representing the individual’s information needs, mainly
techniques borrowed from the field of information retrieval (Belkin and Croft 1992).

Due to the timeliness of information and the dynamic and ever changing environment
these filters can’t be static, as the individual’s needs change. Even though one of the
characteristics of information is that it does not wear out (Glazer 1993) the dimension
of time may greatly affect its relevance. The filter profiles may also not be tuned as to
perform at a maximum. An individual must be able to interact with his profile based on
a reaction to the captured information. Thus some sort of process is required to support
them in modifying and maintaining these profiles. This support should assist the
individuals to fine-tune their filter profiles in striving for the optimum performance and
also in helping them to construct new profiles in expanding the filtering process into
including new areas of interest. Such a process that captures the individual’s reaction to
incoming information and changing environment provides for a dynamic interaction
with the filter profile and greatly enhance the opportunities of successful goal
attainment where the individual’s work is concerned. In the case where the individual
does not clearly know what he is looking for and can’t express his needs it may be
easier to state that which he does not want, i.e. a negative filter (Stadnyk and Kass
1992). This could be the case of the news bureau mentioned earlier. Here the necessity
for a flexible filter profile is obvious.

The above stated questions and aspects of information filtering will be given light
below through a discussion of the different processes and techniques that are used in
information filtering both for the representation and the filtering, and the principles of
relevance feedback that can be used in order to accommodate for the individual’s ability
to manage his filter profiles.

2.1 Information filtering processes

In a study, made by Malone, Grant, Lai, Rao, and Rosenblitt (1987), of how different
types information is distributed within organizations, individuals were interviewed
about their needs and wants, and how the made the selections, in order to establish their
information filtering processes. Focus in the study was placed on different types of
environments of information filtering such as electronic mailboxes, e-mail systems,
electronic billboards, etc., from which the individuals retrieved information.

Out of the results from this study Malone et al. (1987) identified three general filtering
processes, which individuals use in order to select information:

2.1.1 Cognitive filtering

The cognitive filtering process is mainly based on a characterization of the content of


the distributed information. The individual’s stated information needs are represented in
a filter profile consisting of keywords. Distributed information is thus matched against
this list of keywords in order to establish the relevance according to the needs of the
individual. Keywords are often of the type: date, sender, title words, message type, and
specific topic describing words from the content of the text. These keywords may be
combined into complex patterns of selection criteria. The techniques used here are
mainly to be found in the field of information retrieval (Belkin and Croft 1992).

However, further research needs to be done here since they do not usually involve any
actual text understanding but merely a match of words that take no consideration to the
semantics of the words (Ram 1992).

2.1.2 Social filtering

Social filtering is based upon the relations that exist between individuals in an
organization or group. The relation between the distributor and receiver often dictates
how the receiver will classify the information. If the distributor of the information is the
manager of another department the information may be classified as less interesting and
non-relevant. However, should the information come from the closest boss, it might be
selected and processed immediately because the individual sees his boss every day and
knows that he requires an instant reply.

Another aspect of social filtering is based upon the judgment of other individuals.
People often read what other refer to or recommend. The individual relies on those in
his environment which are well known to him, to the extent that he knows their
preferences and what kind of work they are occupied with and hence their interests. The
principle here is that there will always be eager readers who scan all information bits in
a specific area just to be in the front line. These individuals make annotations about the
information. If an individual knows these eager readers well he can use their
annotations in his filter profile. This requires however that the information be
redistributed all over again as an information bit may go past a not-so-eager reader
unnoticed and not trigger any filtering criteria due to the lack of appropriate annotations
(Goldberg, Nichols, Oki, and Terry 1992).

2.1.3 Economic filtering

Receivers of information often try to value the effort it would to take to read the
messages. If the messages are to long the receiver may ignore it because it would take
too much of his time in order to process the information. Here the length or amount of
information becomes a criterion upon which the information is classified.

Another aspect of economic filtering concerns the distributor of information. Usage of


information channels is expensive, however the broader the information is i.e.
addressing several receivers, the cheaper it becomes to formulate and distribute the
information. The best message to a receiver is that which is specially formulated to his
needs but which in that case leads to an increasing cost of formulating the information
to the distributor. Therefore much of the information, distributed through common
channels, will be of general character and have little relevance to most of the receivers’
needs (Malone et al. 1987).

2.2 Techniques for representation and filtering

Here two main techniques will be discussed: semi-structured templates and information
retrieval techniques. Natural language processing will not be discussed since usage of it
in commercial applications is still low and the techniques are high in cost, however one
example of natural language analysis within information filtering is to be found in the
SCISOR system (Jacobs and Rau 1990).

2.2.1 Semi-structured templates

Malone et al. (1987) found that most of the information which is distributed within
organizations is semi-structured, i.e. keywords such as date, sender, message type, and
so forth, were to be found in specific reoccurring places within the messages. This
semi-structure is important for how the receiver filters the information. It is upon the
occurrence of these keywords that most of the cognitive filtering is based. Malone et al.
(1987) defines semi-structured information as messages of identifiable type, where each
type has a clearly defined set of templates, but some the templates can include
unstructured text or other sort of information.

In this approach both the distributed information and the individual’s filter profile is
structured according to these templates. Not all of the templates need to be filled in. The
templates of the structured information are matched against the templates of the
individual’s filter profile. If the templates, consisting of structured information, of both
the message and filter profile match, the information is classified as relevant.
Malone et al. (1987) advocates the usage of semi-structured templates through
communication and points out:

- Semi-structured templates enables computers to automatically process a wider range


of information and eliminates the need for natural language analysis in order to classify
the messages while still being able to allow advanced processing of the information.

- Semi-structure templates allow individuals to communicate unstructured information


as well as structured information in the same message and thus making the
communication more flexible and user-friendly.

- Most of the processing of information within organizations is done in a structured


form and usage of semi-structured templates helps to reflect this processing.

2.2.2 Information retrieval techniques

These techniques are mainly used for the processing weakly structured or unstructured
information. The usage of these techniques within information filtering strongly
resembles those of automatic document classification in information retrieval (Salton
1989). They are in fact the same. The distributed information to be filtered is classified
by its content rather than specific keywords in semi-structured templates.

The content of the text-based information is broken down into a list of individual
unique words. All of the non-content-descriptive function words such as: and, or, but,
to, etc., are removed as they are not related to the content. The remaining words are
stripped of prefixes and suffixes leaving only the root form of the word and are given
weight factors based on the frequency of occurrence in the text, to reflect the
importance of the word.

The list of words, known as a term vector i.e. the formal representation of the content of
the information (Salton 1989), each with its weight factor, is used to describe or classify
the content of the information.

The individual expresses his information needs in a filter profile as unstructured text in
natural language or as a set of keywords describing the information needs. In the former
case the filter profile undergoes the same treatment as in the case of the distributed
information.

The formal representations of the distributed information and the filter profile are
matched with each other. A similarity coefficient is established based on the matches
between the two lists. The higher the coefficient the higher the relevance of the
information is to the individual’s information needs.

However, words say nothing about the actual content. No semantic analysis is done and
these filter techniques involve no actual understanding of the information. Belkin and
Croft (1992) has shown that in spite of this fact the filtering based upon these
techniques is quite sufficient and adequate.

Though words may have different meanings and the same meaning can be expressed
with different words and phrases, and should pose a problem in establishing a match
between the information and the filter profile as both may match though the individual
unique words indicate otherwise, the usage of thesaurus should help to reduce this
problem (Salton 1989). A thesaurus is used to group together words and phrases which
concerns the same topic and is therefore related in meaning. By replacing words in the
term vector with references to different thesauruses the problem of synonyms should be
easily coped with, as a quick glance at the thesaurus should reveal other phrases or
words that could be used instead to establish a possible match.

A drawback of IR-techniques is that the individual must formulate his information


needs using a rich vocabulary with a fairly large set of words in order to cover the
different aspects of his information needs. Otherwise the process of establishing the
degree of similarity and hence the match between the profile and the information may
not function properly as the expressed needs becomes too broad and general (Salton
1989).

Malone et al. (1987) points out further that the cognitive filtering process is based on
identifying keywords and some of these keywords can have multiple forms of
representation e.g. date. Malone et al. (1987) advocates the usage of semi-structured
templates in order to give such keywords a standardized representational form where
IR-techniques would fail in identifying them.

However Malone et al. (1987) states further that usage of semi-structured templates
allows the distributor to write unstructured text in some of the templates and these can’t
be analyzed with less than IR-techniques. The filtering process here is based on the
structured keywords and these may not always adequately describe the content of the
information. Usage of IR-techniques ensures that the content will be properly classified.

2.3 Relevance feedback

Due the timeliness of the information needs of an individual, caused by the dynamic
and turbulent environment, the needs will vary from time to time and some must be
revised and others complemented as new goals will rise and old disappear. Existing
information needs may not be as clearly and adequately expressed, as they should be in
order to function at an optimum or sufficient level. They may be too broad in their
definition and thus allowing less relevant information to be selected or they may be to
precise formulated allowing only a fraction of the relevant information to come to the
individual’s notice. Individuals must be able to react upon selected information and
establish the degree of relevance in order to revise the information needs or
complement them. Giving an individual support for these activities is essential for the
functionality of the filter profiles.

Where does one find support for these activities? Turning to the field of information
retrieval one can adopt the principles of relevance feedback.

The principles of relevance feedback within information retrieval are mainly used to
help individuals sharpen their information requests. The individual looks upon the
retrieved information and tries to establish a degree of relevance to his information
needs. He ranks the information according to the relevance.

The information needs expressed in his requests may either be too broad and thus have
a high recall i.e. selecting a large amount where only a small part may be relevant, or
may be to precise and thus have a high precision i.e. selecting a small amount of
information within a narrow area of interest and reduce the selected amount of non-
relevant information, but at the same time not selecting all that is relevant. The
optimum requests are those where the information needs are expressed in such a way
that they give a high recall with a high degree of precision (Salton 1989).

The individual may adopt either of two main strategies; positive feedback i.e. the
information judged to be of high relevance is used to enhance the information request as
to make it more similar the relevant information, or negative feedback i.e. using the
information deemed to be of little or no relevance and subtract it from the information
request and reduce the similarity to or move away from non-relevant information.

These two principles works by either adding words or weight factors from the term
vector of the most relevant information to the term vector representing the information
request and thus broaden the area of interest and increasing the recall, or subtracting
words or weight factors from the term vector of the information ranked as having the
lowest relevance from the information request’s term vector and thus sharpen the area
of interest and increase the precision (Salton 1989).

The support for the activities of relevance feedback must be dynamic and flexible in
order to give the individual opportunity to interact with his filter profiles and adopt
either strategy of relevance feedback. The process of establishing the degree relevance
of the captured information is subjective and the criteria for judging the relevance
varies from individual to individual. Thus these activities cannot be captured in an
automatic processing of the information needs as they depend upon the interaction of
the individual (Salton 1989).

Finding and developing such support will greatly enhance the opportunities for success
of filtering information according to the needs of the individuals as old information
needs can be revised and new needs can be captured which were not expressed before.

3. Postscript

Behind all of the aspects of the information filtering lays the unanswered question of
how to capture the individual’s information needs. The individual’s information needs
should be relevant to his work, both of formal and informal character and be dictated by
his goals both formal and informal, though the individual may find it difficult to express
them. The different needs can be related to each other in such ways that they are
interdependent on each other. To extract the complex pattern of the information needs
to an individual into a representational form in a computer as to enabling him to
succeed in his goal attainments is a challenge. Finding techniques and models of how to
capture these information needs, both to the individual and to a group of individuals, is
in dire need to be exploited. Finding support here would greatly enhance the usage and
experience of the filtering concept to the individuals.

The different techniques of information representation in the field of information


retrieval are a well-established area of research and much will come out it in the future.
However the usage of natural language analysis techniques is still a question since they
are complex when it comes to semantically understand the information (Salton 1989).
The question of how great impact such techniques will have on organizationally based
information filtering systems is still unanswered since the cost of using these techniques
and the amount of effort in the processing probably will be high for a long time still.
The concept of relevance feedback is another area that needs to be exploited. Giving
support for maintaining and interacting with the filter profiles is quite obvious since the
increasing turbulence and complexity quickly out-dates much of the information needs.
New goals will arise, which need to be revised in order to find the information needs
related to them, and old will disappear. The information needs soon wears out or must
be revised. In order to make sure that information selection models perform at
maximum, the support for this activity is essential. Capturing the individual’s reaction
to the information at hand and establishing whether the filter profile should be revised
or the information needs should be rephrased (based on this reaction and the timeliness
of the information needs) is an area of study that must be researched. Studying the
existing traditional techniques and models as well as exploiting new ones is indeed an
object of future research.
4. References

Belkin, N.J., and Croft, W.B. (1992) Information Filtering and Information Retrieval:
Two Sides of the Same Coin? Communication of the ACM, Vol. 35, No. 12, December.

Denning, P. (1982) President’s Letter on “Junk Mail”, Communication of the ACM,


March.

Glazer, R. (1993) Measuring the value of information: The information-intensive


organization, IBM System Journal, Vol. 32, No. 1.

Goldberg, D., Nichols, D., Oki, B.M., and Terry, D. (1992) Using Collaborative
Filtering to Weave an Information Tapestry, Communication of the ACM, Vol. 35, No.
12, December.

Huber, G.P. (1984) The Nature and Design of Post-Industrial Organizations,


Management Science, Vol. 30, No. 8, August.

Jacobs, P.S., and Rau, L.F. (1990) SCISOR: Extracting Information from On-line News,
Communication of the ACM, Vol. 33, No. 11, November.

Malone, T.W., Grant R.K., Lai, K-Y., Rao, R., and Rosenblitt, D. (1987)
Semistructured Messages Are Surprisingly Useful for Computer-Supported
Coordination, ACM Transactions on Office Information Systems, Vol. 5, No. 2, April.

Malone, T.W., Grant, R.K., Turbak A. F., Brobst S.A., and Cohen M.D. (1987)
Intelligent information-sharing systems, Communication of the ACM, Vol. 30, No. 5,
May.

Ram, A. (1992) Natural Language Understanding for Information-Filtering Systems,


Communication of the ACM, Vol. 35, No. 12, December.

Salton, G. (1989) Automatic Text Processing; The Transformation, Analysis, and


Retrieval of Information by Computer. Addison Wesley, USA.

Stadnyk, I., and Kass, R. (1992) Modeling User’s Interests in Information Filters,
Communication of the ACM, Vol. 35, No. 12, December.

You might also like