Content Search Engines

SEARCH ENGINES
INTRODUCTION
The Web is potentially a terrific place to get information on almost any topic. Doing
research without leaving your desk sounds like a great idea, but all too often you end up
wasting precious time chasing down useless URLs. Almost everyone agrees that there's
going to be a better way! But for now we're stuck with making the best use of the search
tools that already exist on the Web.
It's important to give some thought to your search strategy. Are you just beginning to
amass knowledge on a fairly broad subject? Or do you have a specific objective in mind--
like finding out everything you can about carpal tunnel syndrome, or the e-mail address
of your old college roommate?
If you're more interested in broad, general information, the first place to go is to a Web
Directory. If you're after narrow, specific information, a Web search engine is probably a
better choice.
Searching by Means of Subject Directories
Think back to the library card catalogue analogy. In the old card files, and even in
today's computer terminal library catalogues, you find information by searching on either
the author, the title, or the subject. You usually choose the subject option when you want
to cover a broad range of information.
Example: You'd like to create your own home page on the Web, but you don't know how
to write HTML, you've never created a graphic file, and you're not sure how you'd post a
page on the Web even if you knew how to write one. In short, you need a lot of
information on a rather broad topic--Web publishing.
1
DEPT. OF ELECTRONICS & COMMUNICATION
Your best bet is not a search engine, but a Web directory like Yahoo. Yahoo is a subject-
tree style catalogue that organizes the Web into 14 major topics, including Arts, Business
and Economy, Computers and Internet, Education, Entertainment, Government, Health,
News, Recreation, Reference, Regional, Science, Social Science, Society and Culture.
Under each of these topics is a list of subtopics, and under each of those is another list,
and another, and so on, moving from the more general to the more specific.
Example: To find out about Web page publishing from Yahoo, select the Computers and
Internet Topic, under which you find a subtopic on the Wide World Web. Click on that
and you find another list of subtopics, several of which are pertinent to your search: Web
Page Authoring, CGI Scripting, Java, HTML, Page Design, Tutorials. Selecting any of
these subtopics eventually takes you to Web pages that have been posted precisely for the
purpose of giving you the information you need.
If you are clear about the topic of your query, start with a Web directory rather than a
search engine. Directories probably won't give you anywhere near as many references as
a search engine will, but they are more likely to be on topic.
Web directories usually come equipped with their own keyword search engines that allow
you to search through their indices for the information you need.
Important note: More and more search engines are incorporating Web directories into
their sites. These directories interact with the main search engine on the site in various
ways. See Excite, Infoseek and Lycos, even Alta Vista--they are no longer "just a search
engine." They are now characterizing themselves as Web portals or hubs -- places where
people come to on the Web to get information about a multitude of subjects, and even to
chat, send email and form online communities.
Searching by Means of Search Engines
This is where things start to get very complicated.
2
Search engines are trickier than they look! You'll discover this the first time you enter a
query on C++, the programming language. At least of the Web search engines will
essentially say, "Huh?"
C++ is not a word. It's a letter followed by two characters that might, depending on the
index, be regarded merely as punctuation. Many text search engines have trouble
handling input of this type. Many don't deal too well with numbers, either. So much for
"007," "R2D2,"or "Catch-22."
Important Note: This problem is no longer as bad as it used to be. I'm now finding
relevant hits for C++ on a majority of search engines sites.
Here's another example of a text string search engines hate: To be or not to be. Just
about anyone who finished junior high school will be able to tell you where the phrase
comes from and (possibly!) what it means. But some search engines choke because all
the words in the phrase are stop words--i.e., unimportant words too short and too
common to be considered relevant strings on which to search. However, if you enclose
the query in quotation marks, forcing the search engine to find the words, "to be or not to
be" in that precise order, most search engines can recognize the phrase as a famous
quotation from Hamlet.
Let's take a less obvious example. Suppose you're a fan of murder mysteries and you
want to search the Web for the home pages of all your favorite authors in that genre. If
you simply enter the words "mystery" and "writer," most search engines will return
hyperlinks to all Web documents that contain the word "mystery" or the word, "writer."
This will probably include hundreds--or even thousands--of URLs, most of which will
have no relevance to your search. If you enter the words as a phrase, however, you stand
a better chance of getting some good hits.
However, as search technology advances, this is not as much of a problem as it was a

couple of years ago. Many search engines will now automatically apply the "adjacency"
3
operator when responding to a two-word query. This mean that they will indeed look for
documents in which your two words appear next to each other.
If you understand how search engines organize information and run queries, you can
maximize your chances of getting hits on URLs that matter.
Search Engine Servers
Search engines, like all other web sites, are housed on high speed computers
called WW servers. They are completely dedicated to providing effective search
services 24 hours a day. Search engine servers are connected to the backbone
(high speed infrastructure) of the WWW via extremely fast, expensive telephone
lines called T3 lines. Most of Yahoo's servers, for example, are located in Santa
Clara, California.
Search Engine Databases
Before search engines can function, they need to have a collection of information
(a database, also called an index) to search. No search engine actually goes out
onto the WWW to look for matches when a query is entered. Think about it: web
sites sometimes go offline for maintenance, and connection speeds vary
depending on how busy the web is at any given time. If a search engine were to
initiate its search of the WWW when its visitor clicked the "Search" button, its
search would take weeks, not seconds!
The solution to this problem is the creation and maintenance of an enormous

database. When a surfer performs a search, the engine searches its database, not
the WWW itself. Ideally, these databases are a perfect, complete reflection of the
WWW. Due to additions and deletions and changes of thousands and millions of
web pages every minute of every day, no search engine database meets this lofty
goal. If it did, it would simply be a copy of the whole WWW! Realistically, each
4
database is at least a large variety and significant sampling of quality web sites.
At most, these collections sport an impressive, frequently updated and detailed
majority of the WWW.
Once a database is in place, search engines keep much of these giant summaries
in the memory of their computers, not just on hard drives or other mechanical
storage media. Electronic searches (in memory) are much faster than mechanical
searches (on hard drives) because electronic searches can be performed at the
speed of electricity (near the speed of light). In this manner, a search engine can
search through its database of millions of web site summaries within a few
seconds, delivering very fast results. Most household PCs these days have around
32 MegaBytes (MB - millions of bytes) of memory in them. Computers used as
web servers for search engines have GigaBytes (GB - billions of bytes) of
memory to allow them to maintain much of their huge databases in quickly
searchable electronic memory.
Search Engine Ranking Algorithms
After the database has been created and placed in the search engine computer's
memory, the device is finally ready to perform searches and deliver results. Only
now does another device come into play: the ranking algorithm. All search
engines, including directories, score the relevancy of web pages through these
mathematical machines. Their purpose is to deliver links to web pages most
relevant to each search phrase. Rightfully so, these automatic mechanisms are a
source of great pride and revenue for their inventors.
When a surfer types in a search phrase on a search engine and hits the "Search"
button, the algorithm jumps into action. Say, for example, that a surfer types in
"martial arts in phoenix" as their search phrase. The algorithm then looks at the
first database entry in its memory, searching for occurrences of the entire search
phrase, or for occurrences of the individual key words "martial", "arts" or
"phoenix" (extremely common words like "in" are usually ignored).
5
Each ranking algorithm assigns different weights to different occurrences of the
key words, depending on where and in what form these matches are found (more
on this below). Taking all these factors into account, these algorithms generate a
relevancy score for the first web page in their memory. They then proceed to do
the same for the second, third and millionth web pages. Finally, the relevancy
scores are sorted in order from most relevant to least, and the corresponding web
pages are listed in this order with informative summary information from the
database. Viola! The surfer (hopefully) gets the results he or she was looking for.
Although all search engines incorporate the basic components described above,
the boundaries among these components are not rigid. The designs of a search
engine's database and of its ranking algorithm go hand in hand, and usually it's
difficult to discern where one ends and the other begins. For example, some
search engines might calculate and store ranking information for obvious web
page themes during the creation of their databases, in order to speed up the job of
the ranking algorithm. Major functional differences are also apparent between
deep search engines and directories, beginning with their distinct approaches to
building databases
INFORMATION RETRIEVAL STRATEGIES
We can safely regard web searches as an IR (Information retrieval) problem. As

compared to searching a database the search for a document content is perhaps more
daunting since it is not structured.
Document should be indexed for making search easier and less time consuming. Indexing
is the processing of a document representation by assigning content descriptors or terms
to the document. Each document has objective terms (for eg. The authors name,
document URL, and the date of publication), and non-objective terms intended to reflect
the information known as content terms. The effectiveness can be measured by two main
parameters : indexing exhaustivity and term specificity.
6
In indexing, the web documents are characterized by recall (ratio of the number of
relevant documents retrieved to the total number of relative documents in the collection ),
and precision (ratio of the number of relevant documents retrieved to the total number of
documents retrieved). This function can be performed either manually or automatically
but millions of web sites render manually indexing quite impractical.
Automatic indexing includes single term indexing, statistical methods, as well as

informative theoretic and probabilistic methods. In addition to this automatic indexing
uses linquistic and multi-term or phrase indexing.
IR MODELS
An IR model can be characterized by the representations made for documents and

queries. This model matches strategies for accessing relevant documents to a user query.
The model also utilizes methods for ranking query output, and has a mechanism to
acquire user-relevant feed back. Let us take a look at the various IR models.
SET THEORITIC MODELS
These models represent documents by a set of index terms, each of which is viewed as a
Boolean variable and valued as true if it is present in the document.
ALGEBRAIC MODELS
Here the document can be represented by a set of vectors in a space spanned by a set of
normalized term vectors.
PROBABILISTIC MODELS
7
These models take the term dependencies and relationships into account and, infact,
specify major parameters such as the weights of the query, and the form of the query
document.
HYBRID MODELS
These models combine both set theoretic and algebraic models to retrieve the document
effectively.
INDEX OR DIRECTORY
Currently there are two different types of search engines - spider crawled index search
engines, such as Google, and human index search engines, such as Yahoo! directory.
Most people assume they are one and the same, but in fact are very different. We will
discuss the difference between the two in more detail as we go along.
Search engines were designed as an index tool for displaying relevancy. They display
references based on the information fed into them. Fortunately, you can control what the
search engine spider sees when it crawls through your site.
Spider Crawled Index Search Engines
A search engine spider, is a software program which crawls through each page of the
website to determine relevancy. The program operates on its own without any human
involvement. The spider then uses different methods to determine where the page ranks
among its competition. Once the spider absorbs the website’s entire content, it follows
any hyperlinks listed and runs from one web page to another. In theory, it should
eventually locate every website on the Net. Once a website has been indexed, it is
recorded and loaded into an enormous database which keeps the site on file. Most of
these search engine databases have billions of websites stored. Google claims to have
close to 3.5 billion web pages stored in its database to date.
Most search engines today use these types of robots to rank your relevancy; Google,
Yahoo!, and AskJeeves are examples. In fact, more than 95% of all search engines use
this type of ranking system. If optimized correctly, you can achieve astonishing results
from a spider crawled index engine.
8
When you submit your site to an index crawled search engine, it then deploys the spider
to your website to gather and document information. The spider takes note of your
website’s URL, title, text, meta tags, alt tags as well as other factors. When dealing with
index crawled search engines, things you cannot see may be just as important as the
things you can.
Human Index Search Engines
Human index search engines, also referred to as directories, do not utilize the
spider.Instead, they have a real human come to your website and go through all the
different pages. This person then lists your site in the most relevant category in the
directory. Picking up favorable listings in a directory can be difficult because a real
human is reviewing your website not a robot. Most of the listing tricks which influence
the spider will not work with a directory. Directories have real humans viewing what is
physically seen on your site and categorizing it how he or she feels fit. Everything is left
up to a single person and how that person happens to feel that day. There is still certain
criteria that must be built into the website if you want to have a shot at a top position, but
remember, after your page has been submitted, it’s totally up to a real living person to
decide where you rank.
SEARCH TOOLS AND SERVICES
The search industry has two ways to find things – through directories and spiders
The problem with directories, which store knowledge in some structure, is that
classification is a labour intensive activity and there are far more publishers than
classifiers on the web. And if the information you are looking for is not reflected by the
classification structure, then you are out of luck. And this happens quite often.
An alternative is intensive automation that involves spider or robot, which explores the
web and helps find web pages. Spiders also have the ability to test databases against
queries and order the resulting matches they have user interface for obtaining and
presenting results.
9
Search tools employ robots for indexing web documents, and these can be classified as
type1 and type2.
Search services
Search services broadcast user queries two several engines and various other information
sources simultaneously. They then merge results submitted by these sources, check for
duplicates and, present them to user as an HTML page with clickable URLs.
Search sites
There are basically two types of search sites on the web : search directories and search
engines
Search directories contain a list of web sites organized hierarchially in to categories and
subcategories. These are created manually rather than being automated.
Search engines, on the other hand or huge computer generated databases containing
information on millions of web sites. They use spiders to automatically look up web sites
and update their databases.
To eliminate the need for looking for several engines log on to meta search sites. They
take your request to various search engines and help you with a better coverage. Meta
search sites do not have capabilities on their own.
HOW SEARCH ENGINES WORK
Search engines use software robots to survey the Web and build their databases. Web
documents are retrieved and indexed. When you enter a query at a search engine
website, your input is checked against the search engine's keyword indices. The best
matches are then returned to you as hits.
There are two primary methods of text searching--keyword and concept.
10
KEYWORD SEARCHING:
This is the most common form of text search on the Web. Most search engines do their
text query and retrieval using keywords.
Unless the author of the Web document specifies the keywords for her document (this is
possible by using meta tags in the latest version of HTML), it's up to the search engine to
determine them. Essentially, this means that search engines pull out and index words that
are believed to be significant. Words that are mentioned towards the top of a document
and words that are repeated several times throughout the document are more likely to be
deemed important.
Some sites index every word on every page. Others index only part of the document. For
example, Lycos indexes the title, headings, subheadings and the hyperlinks to other sites,
along with the first 20 lines of text and the 100 words that occur most often.
Infoseek uses a full-text indexing system, picking up every word in the text except
commonly occurring stop words such as "a," "an," "the," "is," "and," "or," and "www."
Hotbot also ignores stop words. AltaVista claims to index all words, even the articles,
"a," "an," and "the." Some of the search engines discriminate upper case from lower
case; others store all words without reference to capitalization.
THE PROBLEM WITH KEYWORD SEARCHING:
Keyword searches have a tough time distinguishing between words that are spelled the
same way, but mean something different (i.e. hard cider, a hard stone, a hard exam, and
the hard drive on your computer). This often results in hits that are completely irrelevant
to your query. Some search engines also have trouble with so-called stemming--i.e., if
you enter the word "big," should they return a hit on the word, "bigger?" What about
singular and plural words? What about verb tenses that differ from the word you entered
by only an "s," or an "ed"?
11
Search engines also cannot return hits on keywords that mean the same, but are not
actually entered in your query. A query on heart disease would not return a document that
used the word "cardiac" instead of "heart."
CONCEPT BASED SEARCHING:
Unlike keyword search systems, concept-based search systems try to determine what you
mean, not just what you say. In the best circumstances, a concept-based search returns
hits on documents that are "about" the subject/theme you're exploring, even if the words
in the document don't precisely match the words you enter into the query.
Excite is currently the best-known general-purpose search engine site on the Web that
relies on concept-based searching.
This is also known as clustering -- which essentially means that words are examined in
relation to other words found nearby.
How does it work? There are various methods of building clustering systems, some of
which are highly complex, relying on sophisticated linguistic and artificial intelligence
theory that we won't even attempt to go into here. Excite sticks to a numerical approach.
Excite's software determines meaning by calculating the frequency with which certain
important words appear. When several words or phrases that are tagged to signal a
particular concept appear close to each other in a text, the search engine concludes, by
statistical analysis, that the piece is "about" a certain subject.
For example, the word heart, when used in the medical/health context, would be likely to
appear with such words as coronary, artery, lung, stroke, cholesterol, pump, blood, attack,
and arteriosclerosis. If the word heart appears in a document with others words such as
flowers, candy, love, passion, and valentine, a very different context is established, and
the search engine returns hits on the subject of romance.
12
Warning: This often works better in theory than in practice. Concept-based indexing is a
good idea, but it's far from perfect. The results are best when you enter a lot of words, all
of which roughly refer to the concept you're seeking information about.
Here's an example of a concept-based query. Jump to Excite and enter the phrase "cyber
love and relationships" (don't use the quotation marks). You will get back a lot of
documents about love and romance online, even if they don't contain the precise words in
your query. On the keyword search engines, you will also get hits, but they will be
limited to those that do contain the precise words of your query.
Refining Your Search
Most sites offer two different types of searches--"basic" and "refined." In a "basic"
search, you just enter a keyword without sifting through any pulldown menus of
additional options. Depending on the engine, though, "basic" searches can be quite
complex.
Search refining options differ from one search engine to another, but some of the
possibilities include the ability to search on more than one word, to give more weight to
one search term than you give to another, and to exclude words that might be likely to
muddy the results. You might also be able to search on proper names, on phrases, and on
words that are found within a certain proximity to other search terms.
Some search engines also allow you to specify what form you'd like your results to
appear in, and whether you wish to restrict your search to certain fields on the internet
(i.e., usenet or the Web) or to specific parts of Web documents (i.e., the title or URL).
13
Many, but not all search engines allow you to use so-called Boolean operators to refine
your search. These are the logical terms AND, OR, NOT, and the so-called proximal
locators, NEAR and FOLLOWED BY.
Boolean AND means that all the terms you specify must appear in the documents, i.e.,
"heart" AND "attack." You might use this if you wanted to exclude common hits that
would be irrelevant to your query.
Boolean OR means that at least one of the terms you specify must appear in the
documents, i.e., bronchitis, acute OR chronic. You might use this if you didn't want to
rule out too much.
Boolean NOT means that at least one of the terms you specify must not appear in the
documents. You might use this if you anticipated results that would be totally off-base,
i.e., nirvana AND Buddhism, NOT Cobain.
Not quite Boolean + and - Some search engines use the characters + and - instead of
Boolean operators to include and exclude terms.
NEAR means that the terms you enter should be within a certain number of words of
each other. FOLLOWED BY means that one term must directly follow the other. ADJ,
for adjacent, serves the same function. A search engine that will allow you to search on
phrases uses, essentially, the same method (i.e., determining adjacency of keywords).
Phrases: The ability to query on phrases is very important in a search engine. Those that
allow it usually require that you enclose the phrase in quotation marks, i.e., "space the
final frontier."
Capitalization: This is essential for searching on proper names of people, companies or

products. Unfortunately, many words in English are used both as proper and common
nouns--Bill, bill, Gates, gates, Oracle, oracle, Lotus, lotus, Digital, digital--the list is
endless.
14
All the search engines have different methods of refining queries. The best way to learn
them is to read the help files on the search engine sites and practice!
RELEVANCY RANKINGS:
Most of the search engines return results with confidence or relevancy rankings. In other
words, they list the hits according to how closely they think the results match the query.
However, these lists often leave users shaking their heads on confusion, since, to the user,
the results often seem completely irrelevant.
Why does this happen? Basically it's because search engine technology has not yet
reached the point where humans and computers understand each other well enough to
communicate clearly.
Most search engines use search term frequency as a primary way of determining whether
a document is relevant. If you're researching diabetes and the word "diabetes" appears
multiple times in a Web document, it's reasonable to assume that the document will
contain useful information. Therefore, a document that repeats the word "diabetes" over
and over is likely to turn up near the top of your list.
If your keyword is a common one, or if it has multiple other meanings, you could end up
with a lot of irrelevant hits. And if your keyword is a subject about which you desire
information, you don't need to see it repeated over and over--it's the information about
that word that you're interested in, not the word itself.
Some search engines consider both the frequency and the positioning of keywords to
determine relevancy, reasoning that if the keywords appear early in the document, or in
the headers, this increases the likelihood that the document is on target. For example,
Lycos ranks hits according to how many times your keywords appear in their indices of
the document and in which fields they appear (i.e., in headers, titles or text). It also takes
into consideration whether the documents that emerge as hits are frequently linked to
15
other documents on the Web, reasoning that if other folks consider them important, you
should, too.
If you use the advanced query form on AltaVista, you can assign relevance weights to
your query terms before conducting a search. Although this takes some practice, it
essentially allows you to have a stronger say in what results you will get back.
As far as the user is concerned, relevancy ranking is critical, and becomes more so as the
sheer volume of information on the Web grows. Most of us don't have the time to sift
through scores of hits to determine which hyperlinks we should actually explore. The
more clearly relevant the results are, the more we're likely to value the search engine.
HOW SEARCH ENGINE RANK PAGES
Each search engine has its own method of ranking its stored web pages. Algorithms are
processed and calculations are made to determine your web page score for each key
phrase. Your page is then ranked and displayed as a possible match when a user types in
a correlated keyword. The greater your search score, the higher your web page appears in
the search results.
Because each search engine processes information differently, it is possible that altering a
site’s design can increase web ranking in one search engine while decreasing the ranking
in another. Satisfying every aspect of every search engine is an impossible task. Our
design strategy will cover optimization techniques designed to get results with all search
engines, but will concentrate mostly on the major players in the search engine industry,
while including a blanket strategy for smaller engines.
Search engines rank your pages based on multiple factors working together. In order to
achieve a high overall score, we will break down every scoring factor and set our sites on
obtaining a high score for each aspect.
The first order of business is the website domain name (URL). Web pages which utilize
keywords or key phrases within the domain will score additional ranking points.
The title tag is the second feature the spider will come across. The title tag should also
include relevant keywords or key phrases. Title tags are also weighted more heavily then
ordinary text.
Search engines also place significant importance on keyword density. Keyword density is
the number of times keywords are used on a web page divided by the total number of
words on the page. The more keywords used, the higher the density.
16
Some spiders also crawl meta tags, which are invisible tags inserted into the HTML code
to inform the search engine of relevancy. Meta keyword tags notify the search engine
which keywords apply to your website, while a meta description is sometimes used as
your site description on the result page.
Search engine also score alt tags, which are words attached to images. Alt tags can be
used to supply even richer content to a website.
Link popularity and link relevancy are the final and possibly the most important features .
Most websites link to other websites; the more links directing traffic to a particular
webpage, the more popular that webpage is. If the inbound link to the page is highly
correlated with the business of the web page, the link relevancy score will be higher. Link
popularity and link relevancy are off-the-page factors which you cannot design, and can
only be achieved through other websites.
INFORMATION ON META TAGS:
Some search engines are now indexing Web documents by the meta tags in the
documents' HTML (at the beginning of the document in the so-called "head" tag). What
this means is that the Web page author can have some influence over which keywords are
used to index the document, and even in the description of the document that appears
when it comes up as a search engine hit.
This is obviously very important if you are trying to draw people to your website based
on how your site ranks in search engines hit lists.
There is no perfect way to ensure that you'll receive a high ranking. Even if you do get a
great ranking, there's no assurance that you'll keep it for long. There is a lot of
conflicting information out there on meta-tagging. If you're confused it may be because
different search engines look at meta tags in different ways. Some rely heavily on meta
tags, others don't use them at all.
It seems to be generally agreed that the "title" and the "description" meta tags are
important to write effectively, since several major search engines use them in their
indices. Use relevant keywords in your title, and vary the titles on the different pages
that make up your website, in order to target as many keywords as possible. As for the
17
"description" meta tag, some search engines will use it as their short summary of your
url, so make sure your description is one that will entice surfers to your site.
The "keyword" meta tag, which is essentially made up of a list of keywords that
(supposedly) appear in the document, has been abused by some webmasters. For
example, a recent ploy has been to put the words "Pamela Anderson" into keyword meta
tags, in hopes of luring searchers to one's website by using the keywords for one of the
most popular searches on the Web.
The search engines are aware of such deceptive tactics, and have devised various
methods to circumvent them, so be careful. Use keywords that are appropriate to your
subject, and make sure they appear in the top paragraphs of actual text on your webpage.
Many search engine algorithms score the words that appear towards the top of your
document more highly than the words that appear towards the bottom. Words that appear
in HTML header tags (H1, H2, H3, etc) are also given more weight by some search
engines. It sometimes helps to give your page a file name that makes use of one of your
prime keywords, and to include keywords in the "alt" image tags.
One thing you should not do is use some other company's trademarks in your meta tags.
Some website owners have been sued for trademark violations because they've used
other company names in the meta tags.
Remember that all the major search engines have slightly different policies. If you're
designing a website and meta-tagging your documents, we recommend that you take the
time to check out what the major search engines say in their help files about how they
each use meta tags. You might want to optimize your meta tags for the search engines
you believe are sending the most traffic to your site.
What are "meta-search" engines?
In a meta-search engine, you submit keywords in its search box, and it transmits your
search simultaneously to several individual search engines and their databases of web
pages. Within a few seconds, you get back results from all the search engines queried.
18
Meta-search engines do not own a database of Web pages; they send your search terms to
the databases maintainted for other search engines.
In ordinary (non-"meta") search engines such as Northern Light, AltaVista, Google, etc.,
you submit keywords to their individual database of web-pages, and you get back a
different display of documents from each search engine. Results from submitting very
comparable searches can differ widely (about 40%), but also contain some of the same
sites (about 60%).
Some meta-search engine sites offer many useful secondary, portal-like services and
specialized collections of web-sites and/or resources for businesses, web designers,
movie-goers, etc. Others offer what I call "pseudo-meta-searching" -- a collection of
search boxes for different search engines or a drop-down menu that let's you choose
which one among a list of search engines to search. Neither of these types of services is
commented on here. Pseudo-meta-searchers are, in fact, excluded from the table below
(see criteria), because they resemble collections of searchable databases more than meta-
searchers
Limitations of Meta-Search engines
How do you know if your search terms will "work"? As anyone who does Internet
searching knows, search protocol (the way you enter search keywords) is far from
standardized. Almost all accept " " as causing a phrase. A few accept Boolean AND, OR,
and NOT. Fewer accept ( ) to group terms. Some only accept + or -. Some default to OR,
some to AND. Some take * to truncate. Other stem automatically. And so on.
Three main factors determine the usefulness of any meta-search engine (see Table
below):
1. The search engines they send your search terms to (size, content, number of
search engines, you ability to choose the search engines you prefer); all of them
search subject directories as well as search engines and intermix results from all.
19
2. How they handle your search terms and search syntax (Boolean operators,
phrases, and defaults they impose);
3. How they display results (ranking; aggregated into one list, or with each search
engine's results reported separately)
Good for simple searches. Meta-Search engines are useful if you are looking for a
unique term or phrase (enclose phrases in quotes " "); or if you simply want to test run a
couple of keywords to see if they get what you want. For such straight-forward searches,
the unique ranking algorithm used by Google (based on how many other sites link to a
site) often finds exactly what you want, better than any meta-search engine (unless you
choose one you can limit to Google only).
For more difficult searches, we can search within results on a term or phrase we specify..
Use meta-search engines -- but use them CAUTIOUSLY:
 Most meta-search engines only spend a short time in each database and often
retrieve only 10% of any of the results in any of the databases queried. This
makes their searches usually "quick and dirty," but often good enough to find
what you want.
 Most meta-searchers simply pass your search terms along, and if your search
contains more than one or two words or very complex logic, most of that will be
lost. It will only make sense to the few search engines that support such logic
(see table of general search engine features).
 Quantity in results does not equal satisfaction. If you get more results than you
want, try refining the results by going directly to AltaVista Advanced Search,
Northern Light, or Infoseek by clicking on their link in the results. Choose meta-
search engines that offer some of these as options.
 Look for meta-search engines that also send your terms to selective or odd
databases like WebCrawler, Thunderstone, Direct Hit, and WhatUSeek. One of
20
the advantages of a meta-searcher is that you might overlook databases like these
which may have sites missed by the big boys.
IN-DEPH ANALYSIS OF POPULAR SEARCH ENGINES:
ALTA VISTA
21
Alta Vista is a fast, powerful search engine with enough bells and whistles to do an
extremely complex search, but first you have to master all its options. If you're serious
about Web searching, however, mastering Alta Vista is a wise policy.
Type of search: Keyword
Search options: Simple or Advanced search, search refining.
Domains searched: Web, Usenet
Search refining: Boolean "AND," "OR" and "NOT," plus the proximal locator "NEAR."
Allows wildcards and "backwards" searching (i.e., you can find all the other web sites
that link to another page). You can decide how search terms should be weighed, and
where in the document to look for them. Powerful search refining tools, and the more
refining you do, the better your results are.
Relevance ranking: Ranks according to how many of your search terms a page contains,
where in the document, and how close to one another the search terms are.
Results presented as: First several lines of document. "Detailed" summaries don't appear
any more detailed than "standard" ones.
User interface: Reasonably good, but not very friendly to the casual user. Advanced
query now allows you to further refine your search at the end of each results page. You
can also visit specialized zones or channels in areas like finance, travel, news.
Help files: Complete, but confusing. Too much thrown at you at once. More clarity and
more explanation of options would be appreciated!
Good points: Fast searches, capitalization and proper nouns recognized, largest database;
finds things others don't. Alta Vista searches both the Web and Usenet. It will search on
both words and on phrases, including names and titles. You can even search to discover
how many people have linked their site to yours. You can also have the resulting pages
of your searches translated into several other languages.
22
Bad points: Multiple pages from the same site show up too frequently; some curious
relevancy rankings, especially on Simple search.
EXCITE:
Excite bills itself as the "intelligent" search engine because of its concept-based indexing.
While "intelligent" is an exaggeration (the apparent intelligence comes from the clever
use of statistics, not from a sudden advance in artificial intelligence), Excite is one of our
favorite search tools.
Type of search: Both concept and key word

Domains searched: Web, Usenet and classified ads
Search refining: Suggests you use more words, repeating key choices several times. Uses
a fuzzy AND, which searches AND and OR, giving preference to AND. Has recently
added Boolean operators to aid in search refining--AND, OR, AND NOT, and the
characters + and -.
Relevance ranking: Confidence percentile provided on all searches, derivation unclear.
Results returned in: Summaries; will also sort them by site. By clicking on an icon
beside each summary, you will get a cross-reference of similar sites.
User interface: Generally good, nothing exciting.
Help files: Very good, including a handbook that explains the site, the Web, the
software, and how best to use their site.
Good points: Large index. Not quite as up-to-date as it used to be. Excellent summaries,
which they admit are actually highlights--the top few most important sentences in the
document. You can view your hits in various ways, too--grouped by confidence or
grouped by Web site.
23
Bad points: Does not specify the format or the size in megabytes of the hits it returns,
nor does it tell you upfront exactly how many hits there are.
INFOSEEK :
Search options: Simple, but powerful (see comments below). Infoseek now uses the
Ultraseek engine, which really zips along. The site has added an extensive catalogue
section for subject-oriented searching. You can also cross-reference your search terms
with similar catalogue subject items and searches come back with subjects automatically
appended. You can also search images, which seems to be popular suddenly.
Domains searched: Web, Usenet, Usenet FAQs, Reviews, Topics.
Search refining: Phrases, capitalization, no Boolean operators, but uses + and - instead
(similar to AND and NOT).
Relevance ranking: Gives numerical scores based on frequency and comparison to words
already in their database.
Results presented as: First 30-100 words of the page
User interface: Good, easy to use, clear. Infoseek is also now allowing free searches of
some of its extensive databases (stock quotes, company information, e-mail addresses,
various reference works like dictionaries and zip code directories).
Help files: Good, useful.
Good points: Fast, flexible, reliable searching. Good output, which gives the URL, the
size of the document and the relevancy score. Allows you to see similar pages (based on
topic information about the pages). Full-text indexing, allows capital letters and phrases.
24
Bad points: We're sure Infoseek has some bad points, but we really can't think of any
offhand!
LYCOS
Type of search: Keyword, but Lycos is gradually becoming less of a search engine, it
seems, and more of a Yahoo-like subject index. Has recently had a cool graphical
facelift. Proud of its ability to search on image and sound files.
Search options: Basic orAdvanced
Domains searched: Web, Usenet, News, Stocks, Weather, Mult-media.
Search refining : Lycos now has full Boolean capabilities (using choices on drop-down
forms).
Relevance ranking: Lycos no longer provides a relevancy ranking.
Results presented as: First 100 or so words in simple search, you choose in advanced
search--summary, full results or short version.
User interface: Clean, clear, focuses more on directory now than on simple search.
Help files: Good, informative, graphical help screens are easy to understand.
Good points: Large database. Comprehensive results given--i.e., the date of the
document, its size, etc. Lycos indexes the frequency with which documents are linked to
by other documents to make sure the most popular web sites are found and indexed
before the less popular ones.
WEBCRAWLER
25
Search options: Simple, refined
Search options: Domains searched: Web, Usenet
Search refining : Uses either "and" or "any." Webcrawler has added full Boolean search
term capability, including AND, OR, AND NOT, ADJ, (adjacent) and NEAR.
Relevance ranking: Yes--frequency calculated--computes the total number of times your

keywords appear in the document and divides it by the total number of words in the
document. Webcrawler returns surprisingly relevant results.
Results presented as: lists of hyperlinks or summaries, as the user chooses.
User interface: Good--easy and fun to use
Help files: Useful tips and FAQ.
Good points: Easy to use. Popular on the Web because it belongs to AOL and there are a
lot of websurfers who sign on from AOL. Publishes usage statistics on their site. Also
provides a service by which you can check to see whether a particular URL is in their
index, and, if so, when it was last visited by their "spider." There is also some fascinating
information about how Webcrawler's search strategy works.
Bad points: Speed seems to be slowing down a little recently. Its previous weakness--no
way to refine search--has been eliminated with the addition of Boolean operators.
HOTBOT
Search options: Simple, Modified, Expert
Domains searched: Web
26
Search refining: Multiple types, including by phrase, person and Boolean-like choices in
pull-down boxes. No proximal operators at present. In Expert searches you can search by
date and even by different media types (Java, Javascript, Shockwave, VRML, etc.).
Relevance ranking: Yes. Methods used--search terms in the title will be ranked higher
search terms in the text. Frequency also counts, and will result in higher rankings when
search terms appears more frequently in short documents than when they appear
frequently in very long documents. (This sounds sensible and useful).
Results presented as: Relevancy score and URL
User interface: Very cool and lively. Some users have complained about the bright green
background, but we kinda like it.
Help files: A FAQ that answers users' questions, but not a lot of serious help files.
Good points: Claims to be fast because of the use of parallel processing, which distributes
the load of queries as well as the database over several work stations.
Bad points: Some limitations still on Boolean operators, and the help files still aren't very
good.
YAHOO
Although not precisely a search engine site, Yahoo is an important Web resource. It
works as an hierarchical subject index, allowing you to drill down from the general to the
specific. Yahoo is an attempt to organize and catalogue the Web.
Yahoo also has search capabilities. You can search the Yahoo index (note: when you do
this you are not searching the entire Web). If your query gets no hits in this manner,
Yahoo offers you the option of searching the Alta Vista, which does search the entire
Web.
27
Yahoo will also automatically feed your query into the other major search engine sites if
you so desire. Thus, Yahoo has the capacity to act as a kind of meta-search engine.
Search options: Simple, Advanced
Domains searched: Yahoo's index, Usenet, E-mail addresses. Yahoo searches titles,
URLs and the brief comments or descriptions of the Web sites Yahoo indexes.
Search refining: Boolean AND and OR. Yahoo is case insensitive.
Relevance ranking: Since Yahoo returns relatively few hits (it will never return more
than 100), it's not clear how results are ran\ked.
Results presented as: Yahoo tells you the category where a hit is found, then gives you a
two-line description of the site.
User interface: Excellent, easy-to-use
Help files: Not very complete, but since there aren't a lot of search options, detailed help
files are not necessary.
Good points: Easy-to-navigate subject catalogue. If you know what you want to find,
Yahoo should be your first stop on the Web.
Bad points: Only a small portion of the Web has actually been catalogued by Yahoo.
28
STATISTICS
CONCLUSION:
Though there are many search engines available on the web, the searching methods and
the engines need to go a long way for efficient retrieval of information on relevant topics.
As the technology advances at an unimaginable pace, it is not unwise expecting an
efficient search engine, which addresses all the needs.
Indexing the entire web and building one huge integrated index will further
deteriorate retrieval effectiveness, since the web is growing at an exponential rate. On
the other hand, a collection of web indexes, each with its own specialized search tool, is
quite promising. Under this scheme, each web index is targeted to comprehensively
represent documents of a specific information space. Information spaces are bounded by,
for example academic disciplines, a class of industries, and a group of services. The
commonality in the subject matter indexed supports the capture of semantic level
29
features. It also supports the incorporation of domain semantics into the indexing process.
The search tool for such an index can also be specialized for the information space.
The current generation of search tools and services have to significantly improve
their retrieval effectives. Otherwise, the web will continue to evolve towards an
information entertainment center for users with no specific search objectives.
Choosing the right search engine will need patience and experience. Use Meta
search engines. They minimize your search to a great extent. The good news is that new
search engines are evolving every day to improve retrieval efficiency.
REFERENCES
[1] L. Page, S. Brin, R. Motwani, and T. Winograd. “The Page Rank citation ranking:
Bringing order to the Web.” Technical Report, Stanfoed Digital Library Technologies
Project, 1998.
[2] Andrei Broder. “A taxonomy of Web Search.” In SIGIR, 2002.
[3] Nick Craswell, David Hawking, and Stephen Robertson. “Effective site finding using
link anchor information.” In SIGIR, 2001.
[4] http://googlewebmastercentral.blogspot.com/2008/04/crawlingthroughtml-forms.html
[5] B. Jansen, A. Spink, and T. Saracevic. “Real life, real users, andreal needs: A study
and analysis of user queries on the Web.” Information Processing and Management,
2000.
[6] Robert Krovetz andW. Bruce Croft. “Lexical ambiguity and information retreival.” In
Information Systems, 1992.
30
[7] Feng Qiu and Junghoo Cho. “Automatic Identification of User Interest For
Personalized Search.” In WWW, 2006.
[8] T. Haveliwala. “Topic-Sensitive PageRank.” In Proc of the Eveleventh Intl. World
Wide Web Conf, 2002.
[9] J. Cho and S. Roy. “Impact of Web search engines on page popularity.” In Proc. of
WWW, 2004.
[10] Ji-Rong Wen, Jian-Yun Nie, and Hong-Jian Zhang. “Clustering User Queries of a
Search Engine.” In WWW10, 2001.
[11] Zaiqing Nie, Ji-RongWen, andWei-Ying Ma. “Object-level VerticalSearch.” In
CIDR, 2007.
[12] Pedro DeRose, Warren Shen, Fei Chen, Yoonkyong Lee, Doug Burdick, Anhai
Doan, and Raghu Ramakrishna. “DBLife: A Community Information Management
Platform for the Database Research Community.” In CIDR, 2007.
31

Content Search Engines

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Content Search Engines

Uploaded by

Copyright:

Available Formats

SEARCH ENGINES

Searching by Means of Subject Directories

Searching by Means of Search Engines

This is where things start to get very complicated.

However, as search technology advances, this is not as much of a problem as it was a

Search Engine Servers

Search Engine Databases

The solution to this problem is the creation and maintenance of an enormous

Search Engine Ranking Algorithms

INFORMATION RETRIEVAL STRATEGIES

We can safely regard web searches as an IR (Information retrieval) problem. As

Automatic indexing includes single term indexing, statistical methods, as well as

An IR model can be characterized by the representations made for documents and

SET THEORITIC MODELS

Spider Crawled Index Search Engines

Human Index Search Engines

SEARCH TOOLS AND SERVICES

HOW SEARCH ENGINES WORK

There are two primary methods of text searching--keyword and concept.

THE PROBLEM WITH KEYWORD SEARCHING:

CONCEPT BASED SEARCHING:

Refining Your Search

Capitalization: This is essential for searching on proper names of people, companies or

HOW SEARCH ENGINE RANK PAGES

INFORMATION ON META TAGS:

What are "meta-search" engines?

Limitations of Meta-Search engines

Use meta-search engines -- but use them CAUTIOUSLY:

IN-DEPH ANALYSIS OF POPULAR SEARCH ENGINES:

Type of search: Keyword

Search options: Simple or Advanced search, search refining.

Domains searched: Web, Usenet

Type of search: Both concept and key word

Relevance ranking: Confidence percentile provided on all searches, derivation unclear.

User interface: Generally good, nothing exciting.

Type of search: Keyword

Domains searched: Web, Usenet, Usenet FAQs, Reviews, Topics.

Results presented as: First 30-100 words of the page

Help files: Good, useful.

Search options: Basic orAdvanced

Domains searched: Web, Usenet, News, Stocks, Weather, Mult-media.

Relevance ranking: Lycos no longer provides a relevancy ranking.

Type of search: Keyword

Search options: Domains searched: Web, Usenet

Relevance ranking: Yes--frequency calculated--computes the total number of times your

Results presented as: lists of hyperlinks or summaries, as the user chooses.

User interface: Good--easy and fun to use

Help files: Useful tips and FAQ.

Type of search: Keyword

Search options: Simple, Modified, Expert

Domains searched: Web

Results presented as: Relevancy score and URL

Type of search: Keyword

Search options: Simple, Advanced

Search refining: Boolean AND and OR. Yahoo is case insensitive.

User interface: Excellent, easy-to-use

You might also like