You are on page 1of 61

Contents

• Search Statements and Binding


• Similarity Measures and Ranking
• Relevance Feedback
• Selective Dissemination of Information Search
• Weighted Searches of Boolean Systems
• Searching the INTERNET and Hypertext
• Information Visualization:
• Introduction to Information Visualization,
• Cognition and Perception,
• Information Visualization Technologies
Search Statements and Binding

• Search statements are generated by users to describe their information


needs.

• Typically, a search statement uses Boolean logic or natural language.

• In generation of the search statement, the user may have the ability to
weight (assign an importance) to different concepts in the statement.
Three level of binding may be observed

• At each level the query statement becomes more specific.


• At the first level, the user attempts to specify the information needed, using
his/her vocabulary and past experience.
• At the next level, the system translates the query to its own internal
language.
• This process is similar to that of processing (indexing) a new document.
• At the final level, the system reconsiders the query based upon the specific
database.
Examples of Query Binding

Caption
Search Statements and Binding

• The length of search statements directly affect the ability of


Information Retrieval Systems to find relevant items.

• The longer the search query, the easier it is for the system to
find items.
Similarity Measures and Ranking

• Searching in general is concerned with calculating the similarity between


a user’s search statement and the items in the database.

• similarity measures are used when both queries and documents are
described by vectors.

• Once items are identified as possibly relevant to the user’s query, it is


best to present the most likely relevant items first- Ranking

• The output of the use of a similarity measure is a scalar number that


represents how similar an item is to the query.
similarity measures can be used to measure

• document-document similarity (used in document clustering)

• document-query similarity (used in searching)

• Term-query similarity(term clustering)


Similarity Measure

The basic formula to determine the similarity between items is:

Caption
Weighting Terms

• Weights are assigned to index terms based on term frequency and inverse
document frequency.

• The term frequency tfi,j of term i in document j is defined as the number of times that
i occurs in j.

• Document Frequency is the number of Unique items in the database that contain the
processing token i.
inverse document frequency, idfi = loge (N/dfi)

• The above two indicators are very often multiplied together to form the “tf.idf”
weight, wij = tfij * idfi
or
wij = log (1 + tfij ) (1 + idfi)
Example

Consider 5 document collection:


D1= “Dogs eat the same things that cats eat”
D2 = “No dog is a mouse”
D3 = “Mice eat little things”
D4 = “Cats often play with rats and mice”
D5 = “Cats often play, but not with other cats”
index sets:
V1 = ( dog, eat, cat ) Consider 5 document collection:
V2 = ( dog, mouse ) D1= “Dogs eat the same things that cats eat”
V3 = ( mouse, eat ) D2 = “No dog is a mouse”
V4 = ( cat, play, rat, mouse ) D3 = “Mice eat little things”
V5 = (cat, play) D4 = “Cats often play with rats and mice”
D5 = “Cats often play, but not with other cats”
System dictionary
(cat, dog, eat, mouse, play, rat)
dfcat=3 idfcat=ln(5/3)=0.51 index sets:
V1 = ( dog, eat, cat )
dfdog=2 iddog=ln(5/2)=0.91 V2 = ( dog, mouse )
V3 = ( mouse, eat )
dfeat=2 idfeat=ln(5/2)=0.91 V4 = ( cat, play, rat, mouse )
V5 = (cat, play)
dfmouse=3 idfmouse=ln(5/3)=0.51 System dictionary
(cat, dog, eat, mouse, play, rat)
dfplay=2 idfplay=ln(5/2)=0.91

dfrat=1 idfrat=ln(5/1)=1.61
Consider 5 document collection:
D1= “Dogs eat the same things that cats eat”
D2 = “No dog is a mouse”
D3 = “Mice eat little things”
V1(cat, eat, dog)
D4 = “Cats often play with rats and mice”
wcat= tfcat * idfcat = 1 * 0.51 = 0.51 D5 = “Cats often play, but not with other cats”
wdog= tfdog * idfdog = 1 * 0.91 = 0.91
weat= tfeat * ideat = 2 * 0.91 = 1.82

V2(dog, mouse)
wdog= tfdog * idfdog = 1 * 0.91 = 0.91
wmouse= tfmouse * idfmouse = 1 * 0.51 = 0.51
V3(mouse, eat)
wmouse= tfmouse * idfmouse = 1 * 0.51 = 0.51
weat=tfeat *ideat =1*0.91=0.91

V4(cat, mouse, play, rat)


wcat= tfcat * idfcat = 1 * 0.51 = 0.51
wplay= tfplay * idfplay = 1 * 0.91 = 0.91
wrat= tfrat * idfrat = 1 * 1.61 = 1.61
wmouse= tfmouse * idfmouse = 1 * 0.51 = 0.51

V5(cat, play)
wcat= tfcat * idfcat = 2 * 0.51 = 1.02
wplay= tfplay * idfplay = 1 * 0.91 = 0.91
Dictionary: (cat, dog, eat, mouse, play, rat)

Weights:
V1 = [cat(0.51), dog (0.91),eat(1.82), 0, 0, 0 ]
V2 = [0, dog(0.91), 0, mouse(0.51),0,0]
V3 = [0, 0, eat(0.91), mouse(0.51),0,0]
V4 = [cat(0.51), 0, 0, mouse(0.51), play(0.91), rat(1.61)]
V5 = [cat(1.02), 0, 0, 0, play (0.91),0]
Similarity formula by Salton in SMART system

To determine the “weight” an item has with respect to the search statement, the
Cosine formula is used to calculate the distance between the vector for the item
and the vector for the query:
Example

Assume: there are 2 terms in the dictionary (t1, t2)

Doc-1 contains t1 and t2, with weights 0.5 and 03 respectively.

Doc-2 contains t1 with weight 0.6

Doc-3 contains t2 with weights 0.4.

Query contains t2 with weight 0.5.


Cosine similarity

Similarity measured between Query(Q) and Documents

Ranked output: D3, D1, D2


Jaccard Similarity Measure

In the Jaccard similarity measure, the denominator becomes dependent upon


the number of terms in common. As the common elements increase, the
similarity value quickly decreases, but is always in the range -1 to +1. The
Jaccard formula is
Dice Simmilarity Measure

The Dice measure simplifies the denominator from the Jaccard measure and
introduces a factor of 2 in the numerator. The normalization in the Dice
formula is also invariant to the number of terms in common.
Query Threshold Process

• Similarity measures may return the entire database as a search result and
some documents may have the similarity value close to zero.

• For this reason thresholds are usually associated with the search process.

• The threshold defines the items in the resultant Hit file from the query.

• Use of thresholds may decrease recall when documents are clustered.


Query Threshold Process
Ranking Algorithms

• Similarity measures provide a means for ranking the set of retrieved


documents:

• Ordering the documents from the most likely to satisfy the query to the
least likely.

• Ranking reduces the user overhead by allowing the user to display most
relevant items first.
Ranking Algorithms
Ranking Algorithms

• Page ranking algorithms are used by the search engines to present the search
results by considering the relevance.

• If the search results are not displayed according to the user interest then the
search engine will lose its popularity.

• Some of the popular page ranking algorithms:


• Weighted Page Rank
• HITS Algorithm
• Distance Rank Algorithm
• EigenRumor Algorithm
Relevance Feedback

• Thesauri and vocabulary provide utility in generally expanding a user’s


search statement to include potential related search terms.

• An initial query might not provide an accurate description of the user’s


needs:
• User’s lack of knowledge of the domain.
• User’s vocabulary does not match authors’ vocabulary.

• The relevance feedback concept was that the new query should be based on
the old query modified to increase the weight of terms in relevant items and
decrease the weight of terms that are in non-relevant items.
Relevance Feedback

• Relevance feedback: user feedback on relevance of docs in initial set of


results

• User issues a (short, simple) query

• The user marks returned documents as relevant or non-relevant.

• The system computes a better representation of the information need


based on feedback.

• Relevance feedback can go through one or more iterations.


Relevance Feedback: Example

Image search engine


http://nayana.ece.ucsb.edu/imsearch/imsearch.html
Results for Initial Query
Relevance Feedback
Results after Relevance Feedback
The new query vector formula is:
Revised version of the formula

+ve feedback -ve feedback

Positive feedback is weighted significantly greater than negative feedback.


Impact of Relevance Feedback

• Positive feedback moves the query to retrieve items similar to the items retrieved
and thus in the direction of more relevant items.
• Negative feedback moves the query away from the non-relevant items retrieved,
but not necessarily closer to more relevant items.
Query Modification via Relevance Feedback

Using the unnormalized similarity formula


Selective Dissemination of Information Search

• Frequently called dissemination systems, are becoming more prevalent


with the growth of the Internet.

• A dissemination system is sometimes labeled a “push” system while a


search system is called a “pull” system.

• The user defines a profile and as new info is added to the system it is
automatically compared to the user’s profile.

• If it is considered a match, it is asynchronously sent to the user’s “mail” file.


Selective Dissemination of Information Search
• The differences between the two functions lie in the dynamic nature of the
profiling process, the size and diversity of the search statements and number
of simultaneous searches per item.

• In the search system, an existing database exists.

• As such, corpora statistics exist on term frequency within and between terms.

• These can be used for weighting factors in the indexing process and the
similarity comparison (e.g., inverse document frequency algorithms).

• Profiles are relatively static search statements that cover a diversity of topics.
Example : Logicon Message Dissemination System (LMDS).

• Originated from a system created by Chase, Rosen and Wallace (CRW Inc.).

• Designed for speed to support the search of thousands of profiles with items arriving
every 20 seconds.

• It demonstrated one approach to the problem where the profiles were treated as the
static database and the new item acted like the query.

• It uses the terms in the item to search the profile structure to identify those profiles
whose logic could be satisfied by the item.

• The system uses a least frequently occurring trigraph (three character) algorithm that
quickly identifies which profiles are not satisfied by the item.

• The potential profiles are analyzed in detail to confirm if the item is a hit.
Example: Personal Library Software (PLS) system.

• It uses the approach of accumulating newly received items into the database
and periodically running user’s profiles against the database.

• This makes maximum use of the retrospective search software but loses near
real time delivery of items.
Formula

• A x2 measure was used to determine the most important features.

• The test was applied to a table that contained the number of relevant(Nr)and
non-relevant(Nɳr) items in which a term occurs plus the number of relevant
and non-relevant items in which the term does not occur respectively).

• The formula used was:


Weighted Searches of Boolean Systems

• The two major approaches to generating queries are Boolean and natural
language.

• Natural language queries are easily represented within statistical models


and are usable by the similarity measures discussed.

• Issues arise when Boolean queries are associated with weighted index
systems.

• Some of the issues are associated with how the logic (AND, OR, NOT)
operators function with weighted values and how weights are associated with
the query terms.
Searching the INTERNET and Hypertext

• The Internet has multiple different mechanisms that are the basis for search of
items.

• The primary techniques are associated with servers on the Internet that create
indexes of items on the Internet and allow search of them.

• Some of the most commonly used nodes are YAHOO, AltaVista and Lycos.

• There is the process of searching for information on the Internet by following


Hyperlinks.

• A Hyperlink is an embedded link to another item that can be instantiated by


clicking on the item reference.
Introduction to Information Visualization

• A relatively new discipline growing out of the debates in the 1970s on the
way the brain processes and uses mental images.

• It required significant advancements in technology and information retrieval


techniques to become a possibility.

• One of the earliest researchers in information visualization was Doyle, who in


1962 discussed the concept of “semantic road maps” that could provide a
user a view of the whole database. The road maps show the items that are
related to a specific semantic theme.
Information visualization

• Information visualization techniques have the potential to significantly


enhance the user’s ability to minimize resources expended to locate needed
information.

• There are many areas that information visualization and presentation can
help the user:
• reduce the amount of time to understand the results of a search and
likely clusters of relevant information
• yield information that comes from the relationships between items versus
treating each item as independent
• perform simple actions that produce sophisticated information search
functions
Information visualization

• Information visualization provides an intuitive interface to the user to


aggregate the results of the search into a display that provides a high-level
summary and facilitates focusing on likely centers of relevant items.

• The query logically extracts a virtual workspace (information space) of


potential relevant items which can be viewed and manipulated by the user.

• By representing the aggregate semantics of the workspace, relationships


between items become visible.

• It is impossible for the user to perceive these relationships by viewing the


items individually.
What makes a good Visualization
Cognition and Perception
Cognition
• Studied in different disciplines such as psychology,
anthropology, philosophy, and neurology.
• Cognitive psychology focuses on how processing
information influences behaviour and the role of different
mental processes in the acquisition of knowledge.
• Cognitive processes are techniques used to incorporate
new knowledge and make decisions based on that
knowledge.
• Attention, memory, perception, etc. are different
cognitive processes having different roles in cognition.
• Cognitive processes can happen consciously or
unconsciously, naturally or artificially, but they happen
fast. They are always at work, even without our
knowledge.
Perception

• Perception is the ability to capture, process, and


make sense of the information that our senses
receive.
• Our brain identifies and organizes the information it
obtains from the neural impulses and then begins to
interpret them.
• After our five senses receive several stimuli that are
sent to our brain as nerve impulses, our brain
interprets those impulses as a visual image, a sound,
taste, odour, touch, or pain.
• Perception involves both bottom-up and top-down
processing.
Types of Information Visualization

• Column chart • Chord diagram


• Bar graph • Choropleth map
• Network graph • Hex map
• Stacked bar graph • Voronoi polygon diagram
• Histogram • Ridgeline plot
• Line chart • Interactive decision tree
• Pie chart • Heatmap
• Scatter plot or 3D scatter plot • Tree map
• Box plot • Circle packing
• Bubble chart • Violin plot
• Dual-axis chart • Real-time tracker
• Stream graph
• Sankey diagram
Information Visualization Examples

• Scientific studies
• Data mining
• Digital libraries
• Healthcare
• Market research
• Manufacturing
• Crime mapping
• Policy modeling
• Financial analysis
Example

The map,
generated in
Google maps,
offers two simple
ways of
representing the
route from Chiang
Mai in Northern
Thailand to the
capital of Thailand,
Bangkok, in the
center of the
country.
Common Uses for Information Visualization

1. Presentation (for
Understanding or
Persuasion)
• “Use a picture. It’s
worth a thousand
words.” Tess Flanders,
Journalist and Editor,
Syracuse Post
Standard, 1911.
• A visual representation
can help someone
understand concepts
that might otherwise be
impossible to explain.
2. Explorative Analysis

• The image portrays the


frequency of lung cancer
within the United States
by geographic region.
• Mapping disease data like
this enables researchers
to explore the relationship
between a disease and
geography.
• Explorative analysis
through information
visualization allows you to
see where relationships in
data may exist.
3. Confirmation Analysis

• Information visualization
can be used to help
confirm our
understanding and
analysis of data.
• This graph is used to
show the similarities in
Brownian motion
between sets of
particles or it might be
used to question the
break in the relationship
towards the end of the
graph.
Example
Example
Six examples of information visualization methods.
Examples
Thank You

You might also like