You are on page 1of 42

Internet Technology

Dr. Noaman Muhammad Ali


Ph.D. in Informatics and Information Processes
Information Systems & Technology Department
Spring 2023-2024
Chapter 2

Dr. Noaman M. Ali


Spring 2023-2024
OUTLINE
 Introduction
 Information Retrieval
o Definition
o Goal
o Classification of Information Retrieval Systems
 Information Search Methods on the Internet
o Direct Search Using Hypertext Links
o Use of Search Engines
o Search Using Special Tools
Internet Technology
By Dr. Noaman M. Ali Slide 2- 3
OUTLINE (Cont.)
 Search Engine Basic Components
o Indexing Module
o Database
o Search Server
 Web Browser

Internet Technology
By Dr. Noaman M. Ali Slide 2- 4
Introduction
 Nowadays, users rely on the web for information,
but the amount of data on the web is growing in an
uncontrolled way.
 Finding relevant and required information is a
hard task; this problem is referred to as
information overload.
 In this chapter, we will discuss the information
retrieval issues related to the search for
information through the Internet.

Internet Technology
By Dr. Noaman M. Ali Slide 2- 5
Information Retrieval
 Definition
o Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
o An information retrieval system is an
applied computer environment for processing,
storing, sorting, filtering, and searching for
large arrays of structured information.
Internet Technology
By Dr. Noaman M. Ali Slide 2- 6
Information Retrieval (Cont.)
 Goal
o Retrieve documents with information that is
relevant to the user’s information need and help
the user complete a task.
o Should easily retrieve the interested
information.

Internet Technology
By Dr. Noaman M. Ali Slide 2- 7
Information Retrieval (Cont.)
 Examples
o Web Search (Search Engine)
o E-mail Search
o Searching your Laptop
o Searching authors, titles, and subjects in library
card catalogs or computers
o Document classification and categorization,
user interfaces, data visualization, filtering

Internet Technology
By Dr. Noaman M. Ali Slide 2- 8
Information Retrieval (Cont.)
 Notes
o IR can be inaccurate as long as the error is
insignificant
o Data is usually natural language text, which is
not always well structured and could be
semantically ambiguous

Internet Technology
By Dr. Noaman M. Ali Slide 2- 9
Classification of IR Systems
 Directories

Internet Technology
By Dr. Noaman M. Ali Slide 2- 10
Classification of IR Systems (Cont.)
 Directories “Local”

Internet Technology
By Dr. Noaman M. Ali Slide 2- 11
Classification of IR Systems (Cont.)
 Directories “Web”

Internet Technology
By Dr. Noaman M. Ali Slide 2- 12
Classification of IR Systems (Cont.)
 Directories “Web” (Cont.)
 Also called: catalogs, yellow pages, subject directories
 Hierarchical taxonomies that classify human knowledge
 The first level of taxonomies ranges from 12 to 26
 Popularities: Yahoo!, eBLAST, LookSmart, Magellan, and
Nacho.
 Most allow keyword searches

Internet Technology
By Dr. Noaman M. Ali Slide 2- 13
Classification of IR Systems (Cont.)
 Databases

Internet Technology
By Dr. Noaman M. Ali Slide 2- 14
Classification of IR Systems (Cont.)
 Search Engines

Internet Technology
By Dr. Noaman M. Ali Slide 2- 15
Information Search Methods on the
Internet
► Direct Search Using Hypertext Links

Internet Technology
By Dr. Noaman M. Ali Slide 2- 16
Information Search Methods on the
Internet (Cont.)
► Use of Search Engines

Internet Technology
By Dr. Noaman M. Ali Slide 2- 17
Information Search Methods on the
Internet (Cont.)
► Search Using Special Tools

Internet Technology
By Dr. Noaman M. Ali Slide 2- 18
Search Engine
 Search engines are the means by which most
people search the Web
 Common examples are Google, Altavista, and Bing
 Yet a search engine does not actually search the
Web during your search
 A search engine searches itself.

Internet Technology
By Dr. Noaman M. Ali Slide 2- 19
Difficulties of Building a Search Engine
 Build by companies and hide the technical detail
 Distributed data
 High percentage of volatile data
 Large volume
 Unstructured and redundant data
 Quality of data
 Heterogeneous data

Internet Technology
By Dr. Noaman M. Ali Slide 2- 20
Difficulties of Building a Search Engine
(Cont.)
 Dynamic data
 How to specify a query from the user
 How to interpret the answer provided by the
system.

Internet Technology
By Dr. Noaman M. Ali Slide 2- 21
User Problems
 Do not exactly understand how to provide a
sequence of words for the search
 Not aware of the input requirement of the search
engine
 Problems understanding Boolean logic, so the
users cannot use advanced search
 Novice users do not know how to start using a
search engine

Internet Technology
By Dr. Noaman M. Ali Slide 2- 22
User Problems (Cont.)
 Do not care about advertisements? No funding
 Around 85% of users only look at the first page of
the result, so relevant answers might be skipped

Internet Technology
By Dr. Noaman M. Ali Slide 2- 23
Searching Guidelines
 Specify the words clearly (+, -)
 Use Advanced Search when necessary
 Provide as many particular terms as possible
 If looking for a company, institution, or
organization, try:
www.name [.com | .edu | .org | .gov | country code]
 Some search engines are specialized in some areas

Internet Technology
By Dr. Noaman M. Ali Slide 2- 24
Searching Guidelines (Cont.)
 If the user uses broad queries, try to use Web
directories as starting points
 The user should notice that anyone can publish
data on the Web, so information that they get from
search engines might not be accurate.

Internet Technology
By Dr. Noaman M. Ali Slide 2- 25
Types of Search Engines
 Search by Keywords
► Yandex, Google, and Bing
 Search by categories
► Yahoo!
 Specialize in other languages
► Chinese Yahoo and Yahoo Japan
 Interview simulation
► Ask Jeeves!
Internet Technology
By Dr. Noaman M. Ali Slide 2- 26
Search Engine Basic Components
 Indexing Module
o Spider
o Crawler ("traveling" spider)
o Indexer
 Database
 Search Server
 An Interface
o Enables users to submit queries
o Displays results
Internet Technology
By Dr. Noaman M. Ali Slide 2- 27
Search Engine Basic Components (Cont.)
 Crawling
o Search engines continually send
out hundreds of “robots” or
“bots” (or “spiders” or
“crawlers” )
o A robot that follows links
o Bots visit websites, read word by
word, and then index those
words, Metadata, and ALT
attributes in IMG tags
o Robot Exclusion Protocol (REP)
Internet Technology
By Dr. Noaman M. Ali Slide 2- 28
Search Engine Basic Components (Cont.)
 Crawling (Cont.)
o Starting point?
o Popular pages

Internet Technology
By Dr. Noaman M. Ali Slide 2- 29
Search Engine Basic Components (Cont.)
 Crawling (Cont.)
o At its peak:
o Use multiple spiders
o Each spider can keep ~300 connections to
pages at a time
o Generates 600K/s
o Starting points:
o Dedicated server that feeds URLs to spiders
o Instead of relying on ISPs for domain names
they have their own DNS server
Internet Technology
By Dr. Noaman M. Ali Slide 2- 30
Search Engine Basic Components (Cont.)
 Crawling (Cont.)
o Google spider looks at two things:
o Significant words within the page
o Location of the words -- Why is location
important?

Internet Technology
By Dr. Noaman M. Ali Slide 2- 31
Search Engine Basic Components (Cont.)
 Indexing
o Spiders get the data
o Now what?
o Content analysis
o Method by which information is sorted and
stored
o One way: Storing the word and associated URL
o No way to tell if the word is important or
trivial
o How many times was the word used?
Internet Technology
By Dr. Noaman M. Ali Slide 2- 32
Search Engine Basic Components (Cont.)
 Database
o Where the user's query is matched
o A huge database of Websites that is gathered and
indexed by word
o These databases can be huge, with millions of
links
o Contains only essential parts of
pages
o Only includes pages that were
indexed
o Search engines are always out of
Internet Technology
By Dr. Noaman M. Ali date. Slide 2- 33
Search Engine Basic Components (Cont.)
 Interface
o Using the
keywords you
give it, a search
engine then
searches its
own current
index.

Internet Technology
By Dr. Noaman M. Ali Slide 2- 34
Search Engine Basic Components (Cont.)
 Interface (Cont.)
o Query Interface
✓ A box is entered a sequence of words (AltaVista
uses union, HotBot uses intersection)
✓ Complex query interfaces (e.g., Boolean logic,
phrase search, title search, URL search, date
range search, data type search)

Internet Technology
By Dr. Noaman M. Ali Slide 2- 35
Search Engine Basic Components (Cont.)
 Interface (Cont.)
o Answer Interface
✓ Relevant pages appear at the top of the list
✓ Each entry in the list includes a title of the
page, a URL, a brief summary, a size, a date,
and a written language

Internet Technology
By Dr. Noaman M. Ali Slide 2- 36
Search Engine Basic Components (Cont.)
 Search Server
o Interfaces are based on rankings
o Search engines return results based on a ranking
system
o Ranking is the order in which files are listed when
they are retrieved.
o Google is different:

✓ PageRankTM method based on popularity

Internet Technology
By Dr. Noaman M. Ali Slide 2- 37
Search Engine Basic Components (Cont.)
 Search Server (Cont.)
o Ranking is a relationship between items about
their ordering
o For more useful information:
✓ Number of times word appears on page
✓ Assign a weight to each word
o Each search engine has a different formula for
assigning weight to words in its index
o Popular way of indexing: Hashing
✓ Numerical value assigned to each word that
Internet Technology
By Dr. Noaman M. Ali
can be retrieved using a formula Slide 2- 38
Web Browser

Internet Technology
By Dr. Noaman M. Ali Slide 2- 39
SUMMARY
 Finding relevant and required information is a hard
task; this problem is referred to as information overload.
 The primary goal of Information Retrieval systems is to
retrieve documents with information that is relevant to
the user’s information need and helps the user
complete a task.
 There are three main types of online IR systems: the
directory, the database and the search engine.
 There are three main methods of internet search: Direct
Search Using Hypertext Links, Use of Search Engines,
and Search Using Special Tools.

Internet Technology
By Dr. Noaman M. Ali Slide 2- 40
SUMMARY (Cont.)
 Almost all major search engines have their own
structure, different from others. However, it is possible
to single out the main components common to all
search engines.
 Differences in the structure can only be in the form of
implementation of the mechanisms of interaction of
these components.

Internet Technology
By Dr. Noaman M. Ali Slide 2- 41

You might also like