Professional Documents
Culture Documents
Internet Technology
By Dr. Noaman M. Ali Slide 2- 4
Introduction
Nowadays, users rely on the web for information,
but the amount of data on the web is growing in an
uncontrolled way.
Finding relevant and required information is a
hard task; this problem is referred to as
information overload.
In this chapter, we will discuss the information
retrieval issues related to the search for
information through the Internet.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 5
Information Retrieval
Definition
o Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
o An information retrieval system is an
applied computer environment for processing,
storing, sorting, filtering, and searching for
large arrays of structured information.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 6
Information Retrieval (Cont.)
Goal
o Retrieve documents with information that is
relevant to the user’s information need and help
the user complete a task.
o Should easily retrieve the interested
information.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 7
Information Retrieval (Cont.)
Examples
o Web Search (Search Engine)
o E-mail Search
o Searching your Laptop
o Searching authors, titles, and subjects in library
card catalogs or computers
o Document classification and categorization,
user interfaces, data visualization, filtering
Internet Technology
By Dr. Noaman M. Ali Slide 2- 8
Information Retrieval (Cont.)
Notes
o IR can be inaccurate as long as the error is
insignificant
o Data is usually natural language text, which is
not always well structured and could be
semantically ambiguous
Internet Technology
By Dr. Noaman M. Ali Slide 2- 9
Classification of IR Systems
Directories
Internet Technology
By Dr. Noaman M. Ali Slide 2- 10
Classification of IR Systems (Cont.)
Directories “Local”
Internet Technology
By Dr. Noaman M. Ali Slide 2- 11
Classification of IR Systems (Cont.)
Directories “Web”
Internet Technology
By Dr. Noaman M. Ali Slide 2- 12
Classification of IR Systems (Cont.)
Directories “Web” (Cont.)
Also called: catalogs, yellow pages, subject directories
Hierarchical taxonomies that classify human knowledge
The first level of taxonomies ranges from 12 to 26
Popularities: Yahoo!, eBLAST, LookSmart, Magellan, and
Nacho.
Most allow keyword searches
Internet Technology
By Dr. Noaman M. Ali Slide 2- 13
Classification of IR Systems (Cont.)
Databases
Internet Technology
By Dr. Noaman M. Ali Slide 2- 14
Classification of IR Systems (Cont.)
Search Engines
Internet Technology
By Dr. Noaman M. Ali Slide 2- 15
Information Search Methods on the
Internet
► Direct Search Using Hypertext Links
Internet Technology
By Dr. Noaman M. Ali Slide 2- 16
Information Search Methods on the
Internet (Cont.)
► Use of Search Engines
Internet Technology
By Dr. Noaman M. Ali Slide 2- 17
Information Search Methods on the
Internet (Cont.)
► Search Using Special Tools
Internet Technology
By Dr. Noaman M. Ali Slide 2- 18
Search Engine
Search engines are the means by which most
people search the Web
Common examples are Google, Altavista, and Bing
Yet a search engine does not actually search the
Web during your search
A search engine searches itself.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 19
Difficulties of Building a Search Engine
Build by companies and hide the technical detail
Distributed data
High percentage of volatile data
Large volume
Unstructured and redundant data
Quality of data
Heterogeneous data
Internet Technology
By Dr. Noaman M. Ali Slide 2- 20
Difficulties of Building a Search Engine
(Cont.)
Dynamic data
How to specify a query from the user
How to interpret the answer provided by the
system.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 21
User Problems
Do not exactly understand how to provide a
sequence of words for the search
Not aware of the input requirement of the search
engine
Problems understanding Boolean logic, so the
users cannot use advanced search
Novice users do not know how to start using a
search engine
Internet Technology
By Dr. Noaman M. Ali Slide 2- 22
User Problems (Cont.)
Do not care about advertisements? No funding
Around 85% of users only look at the first page of
the result, so relevant answers might be skipped
Internet Technology
By Dr. Noaman M. Ali Slide 2- 23
Searching Guidelines
Specify the words clearly (+, -)
Use Advanced Search when necessary
Provide as many particular terms as possible
If looking for a company, institution, or
organization, try:
www.name [.com | .edu | .org | .gov | country code]
Some search engines are specialized in some areas
Internet Technology
By Dr. Noaman M. Ali Slide 2- 24
Searching Guidelines (Cont.)
If the user uses broad queries, try to use Web
directories as starting points
The user should notice that anyone can publish
data on the Web, so information that they get from
search engines might not be accurate.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 25
Types of Search Engines
Search by Keywords
► Yandex, Google, and Bing
Search by categories
► Yahoo!
Specialize in other languages
► Chinese Yahoo and Yahoo Japan
Interview simulation
► Ask Jeeves!
Internet Technology
By Dr. Noaman M. Ali Slide 2- 26
Search Engine Basic Components
Indexing Module
o Spider
o Crawler ("traveling" spider)
o Indexer
Database
Search Server
An Interface
o Enables users to submit queries
o Displays results
Internet Technology
By Dr. Noaman M. Ali Slide 2- 27
Search Engine Basic Components (Cont.)
Crawling
o Search engines continually send
out hundreds of “robots” or
“bots” (or “spiders” or
“crawlers” )
o A robot that follows links
o Bots visit websites, read word by
word, and then index those
words, Metadata, and ALT
attributes in IMG tags
o Robot Exclusion Protocol (REP)
Internet Technology
By Dr. Noaman M. Ali Slide 2- 28
Search Engine Basic Components (Cont.)
Crawling (Cont.)
o Starting point?
o Popular pages
Internet Technology
By Dr. Noaman M. Ali Slide 2- 29
Search Engine Basic Components (Cont.)
Crawling (Cont.)
o At its peak:
o Use multiple spiders
o Each spider can keep ~300 connections to
pages at a time
o Generates 600K/s
o Starting points:
o Dedicated server that feeds URLs to spiders
o Instead of relying on ISPs for domain names
they have their own DNS server
Internet Technology
By Dr. Noaman M. Ali Slide 2- 30
Search Engine Basic Components (Cont.)
Crawling (Cont.)
o Google spider looks at two things:
o Significant words within the page
o Location of the words -- Why is location
important?
Internet Technology
By Dr. Noaman M. Ali Slide 2- 31
Search Engine Basic Components (Cont.)
Indexing
o Spiders get the data
o Now what?
o Content analysis
o Method by which information is sorted and
stored
o One way: Storing the word and associated URL
o No way to tell if the word is important or
trivial
o How many times was the word used?
Internet Technology
By Dr. Noaman M. Ali Slide 2- 32
Search Engine Basic Components (Cont.)
Database
o Where the user's query is matched
o A huge database of Websites that is gathered and
indexed by word
o These databases can be huge, with millions of
links
o Contains only essential parts of
pages
o Only includes pages that were
indexed
o Search engines are always out of
Internet Technology
By Dr. Noaman M. Ali date. Slide 2- 33
Search Engine Basic Components (Cont.)
Interface
o Using the
keywords you
give it, a search
engine then
searches its
own current
index.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 34
Search Engine Basic Components (Cont.)
Interface (Cont.)
o Query Interface
✓ A box is entered a sequence of words (AltaVista
uses union, HotBot uses intersection)
✓ Complex query interfaces (e.g., Boolean logic,
phrase search, title search, URL search, date
range search, data type search)
Internet Technology
By Dr. Noaman M. Ali Slide 2- 35
Search Engine Basic Components (Cont.)
Interface (Cont.)
o Answer Interface
✓ Relevant pages appear at the top of the list
✓ Each entry in the list includes a title of the
page, a URL, a brief summary, a size, a date,
and a written language
Internet Technology
By Dr. Noaman M. Ali Slide 2- 36
Search Engine Basic Components (Cont.)
Search Server
o Interfaces are based on rankings
o Search engines return results based on a ranking
system
o Ranking is the order in which files are listed when
they are retrieved.
o Google is different:
Internet Technology
By Dr. Noaman M. Ali Slide 2- 37
Search Engine Basic Components (Cont.)
Search Server (Cont.)
o Ranking is a relationship between items about
their ordering
o For more useful information:
✓ Number of times word appears on page
✓ Assign a weight to each word
o Each search engine has a different formula for
assigning weight to words in its index
o Popular way of indexing: Hashing
✓ Numerical value assigned to each word that
Internet Technology
By Dr. Noaman M. Ali
can be retrieved using a formula Slide 2- 38
Web Browser
Internet Technology
By Dr. Noaman M. Ali Slide 2- 39
SUMMARY
Finding relevant and required information is a hard
task; this problem is referred to as information overload.
The primary goal of Information Retrieval systems is to
retrieve documents with information that is relevant to
the user’s information need and helps the user
complete a task.
There are three main types of online IR systems: the
directory, the database and the search engine.
There are three main methods of internet search: Direct
Search Using Hypertext Links, Use of Search Engines,
and Search Using Special Tools.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 40
SUMMARY (Cont.)
Almost all major search engines have their own
structure, different from others. However, it is possible
to single out the main components common to all
search engines.
Differences in the structure can only be in the form of
implementation of the mechanisms of interaction of
these components.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 41