Seminar Report

A Seminar Report On
Working of web search engine

Submitted in partial fulfillment of the requirement For the award of the degree Of Bachelor of Engineering In Information Technology
Submitted to: Sachin Sharma B.E. Final Year
Guide: Dr. K.R. Chowdhary Professor, CSE Dept.
Department of Computer Science and Engineering M.B.M. Engineering College, Faculty of Engineering, Jai Narain Vyas University Jodhpur (Rajasthan) 342001 Session 2008-09
i Working of web search engine
CANDIDATES DECLARATION
I hereby declare that the work which is being presented in the Seminar entitled Working of Web Search engine in the partial fulfillment of the requirement for the award of degree of Bachelor of engineering in Information Technology, submitted in the department of Computer science and engineering, M.B.M. Engineering college, Jodhpur (Rajasthan), is an authentic record of my own work carried out during the period from February 2009 to May 2009, under the supervision of Dr. K.R. Chowdhary, Professor, Department of Computer science and engineering, M.B.M. Engineering college, Jodhpur (Rajasthan). The matter embodied in this project has not been submitted by me for the award of any other degree. I also declare that the matter of seminar is not reproduced as it is from any source. Date: Place: Jodhpur (SACHIN SHARMA)
CERTIFICATE
This is certified that the above statement made by the candidate is correct to the best of my knowledge.
Dr. K.R. Chowdhary Professor Department of Computer science and engineering M.B.M. Engineering College, Jodhpur (Rajasthan) 342001
ii Working of web search engine
Contents
1. Introduction.1 2. Types of search engine2 3. General system architecture of web search engine2 3.1. Web crawling..4 3.1.1. Types of crawling6 3.1.1.1. 3.1.1.2.
Focused crawling6 Distributed crawling....6
3.1.2. Robot exclusion protocol7 3.1.3. Resource constraints..8 3.2. Web indexing..8 3.2.1. Index design factors9 3.2.2. Index data structures..10 3.2.3. Types of indexing11 3.2.3.1. 3.2.3.2.
Inverted Index...11 Forward index..12
3.2.4. Latent Semantic Indexing (LSI).13 3.2.4.1. 3.2.4.2. 3.2.4.3.
What is LSI13 How LSI Works.14 Singular Value Decomposition (SVD)..17

iii
Working of web search engine
3.2.4.4. 3.2.4.5.
Stemming..20 The Term Document Matrix22
3.2.5. Challenges in parallelism27 4. Meta search engine27 5. Search engine optimization..29 5.1. Page Rank29 5.2. The ranking algorithm simplified...30 5.3. Damping factor.32 5.4. Uses of page Rank..35 6. Marketing of search engines.36 7. Summary..37
Abstract
Exploring the content of web pages for automatic indexing is of fundamental importance for efficient e-commerce and other applications of the Web. It enables users, including customers and businesses, to locate the best sources for their use. Todays search engines use one of two approaches to indexing web pages. They either:
Analyze the frequency of the words (after filtering out common or meaningless
words) appearing in the entire or a part (typically, a title, an abstract or the first 300 words) of the text of the target web page,
They use sophisticated algorithms to take into account associations of words in the indexed web page. In both cases only words appearing in the web page in question are used in analysis. Often, to increase relevance of the selected terms to the potential searches, the indexing is refined by human processing.
iv Working of web search engine
To identify so called authority or expert pages, some search engines use the structure of the links between pages to identify pages that are often referenced by other pages. The approach used in the Google Search Engine implementation, assign each page a score that depends on frequency with which this page is visited by web surfers.
1. Introduction
A search engine is an information retrieval system designed to help find information stored on a computer system. Search engines help to minimize the time required to find information and the amount of information which must be consulted, akin to other techniques for managing information overload. The most public, visible form of a search engine is a Web search engine which searches for information on the World Wide Web. Engineering a web search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been conducted on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago.
v Working of web search engine
There are differences in the ways various search engines work, but they all perform three basic tasks: They search the Internet or select pieces of the Internet based on
important words. index. The most important measure for a search engine is the search performance, quality of the results and ability to crawl, and index the web efficiently. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Some of the efficient and recommended search engines are Google, Yahoo and Teoma, which share some common features and are standardized to some extent. They keep an index of the words they find, and where they find them. They allow users to look for words or combinations of words found in that
vi Working of web search engine
Types of search engine
2. Types of search engine

Search engines provide an interface to a group of items that enables users to specify criteria about an item of interest and have the engine find the matching items. The criteria are referred to as a search query. In the case of text search engines, the search query is typically expressed as a set of words that identify the desired concept that one or more documents may contain. There are several styles of search query syntax that vary in strictness. It can also switch names within the search engines from previous sites. Whereas some text search engines require users to enter two or three words separated by white space, other search engines may enable users to specify entire documents, pictures, sounds, and various forms of natural language. Some search engines apply improvements to search queries to increase the likelihood of providing a quality set of items through a process known as query expansion.
3. General system architecture of web search engine

This section provides an overview of how the whole system of a search engine works. The major functions of the search engine crawling, indexing and searching are also covered in detail in the later sub-sections. Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a typical search engine employs special software robots, called spiders, to build lists of the words found on Websites. When a spider is building its lists, the process is called Web crawling. A Web crawler is a program, which automatically traverses the web by downloading documents and following links from page to page. They are mainly used by web search engines to gather data for indexing. Other possible applications include page validation, structural analysis and visualization; update notification, mirroring and personal web assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc. Crawlers are automated programs that follow the links found on the web pages. There is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages
General system architecture of search engine
that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository. Every web page has an associated ID number called a doc ID, which is assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform the indexing function. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. The URL Resolver reads the anchors file and converts relative URLs into absolute URLs and in turn into doc IDs. It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links, which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of word IDs and offsets into the inverted index. A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. A lexicon lists all the terms occurring in the index along with some term-level statistics (e.g., total number of documents in which a term occurs) that are used by the ranking algorithms The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries.
3.1. Web crawling

Web crawlers are an essential component to search engines; running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers, which are all beyond the control of the system. Web crawling speed is governed not only by the speed of ones own Internet connection, but also by the speed
of the sites that are to be crawled. Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced, if many downloads are done in parallel. Despite the numerous applications for Web crawlers, at the core they are all fundamentally the same. Following is the process by which Web crawlers work: Download the Web page. Parse through the downloaded page and retrieve all the links. For each link retrieved, repeat the process. The Web crawler can be used for crawling through a whole site on the Inter-/Intranet. You specify a start-URL and the Crawler follows all links found in that HTML page. This usually leads to more links, which will be followed again, and so on. A site can be seen as a tree-structure, the root is the start-URL; all links in that root-HTML-page are direct
sons of the root. Subsequent links are then sons of the previous sons.
3.1.1. Types of crawling Crawlers are of two types basically.

3.1.1.1. Focused crawling A general purpose Web crawler gathers as many pages as it can from a particular set of URLs. Where as a focused crawler is designed to gather documents only on a specific
topic, thus reducing the amount of network traffic and downloads. The goal of the focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. The focused crawler has three main components: a classifier, which makes relevance judgments on pages, crawled to decide on link expansion, a distiller which determines a measure of centrality of crawled pages to determine visit-priorities, and a crawler with dynamically reconfigurable priority controls which is governed by the classifier and distiller. The most crucial evaluation of focused crawling is to measure the harvest ratio, which is the rate at which relevant pages are acquired and irrelevant pages are effectively filtered off from the crawl. This harvest ratio must be high, otherwise the focused crawler would spend a lot of time merely eliminating irrelevant pages, and it may be better to use an ordinary crawler instead.
3.1.1.2. Distributed crawling Indexing the web is a challenge due to its growing and dynamic nature. As the size of the Web is growing it has become imperative to parallelize the crawling process in order to finish downloading the pages in a reasonable amount of time. A single crawling process even if multithreading is used will be insufficient for large scale engines that need to fetch large amounts of data rapidly. When a single centralized crawler is used all the fetched data passes through a single physical link. Distributing the crawling activity via multiple processes can help build a scalable, easily configurable system, which is fault tolerant system. Splitting the load decreases hardware requirements and
at the same time increases the overall download speed and reliability. Each task is performed in a fully distributed fashion, that is, no central coordinator exists.
3.1.2. Robot exclusion protocol

Web sites also often have restricted areas that crawlers should not crawl. To address these concerns, many Web sites adopted the Robot protocol, which establishes guidelines that crawlers should follow. Over time, the protocol has become the unwritten law of the Internet for Web crawlers. The Robot protocol specifies that Web sites wishing to restrict certain areas or pages from crawling have a file called robots.txt placed at the root of the Web site. The ethical crawlers will then skip the disallowed areas. Following is an example robots.txt file and an explanation of its format:
# Robots.txt for http://somehost.com/ User-agent: * Disallow: /cgi-bin/ Disallow: /registration # Disallow robots on registration page Disallow: /login
The first line of the sample file has a comment on it, as denoted by the use of a hash (#) character. Crawlers reading robots.txt files should ignore any comments. The third line of the sample file specifies the User-agent to which the Disallow rules following it apply. User-agent is a term used for the programs that access a Web site. Each browser has a unique User-agent value that it sends along with each request to a Web server. However, typically Web sites want to disallow all robots (or User-agents) access to certain areas, so they use a value of asterisk (*) for the User-agent. This specifies that
all User-agents be disallowed for the rules that follow it. The lines following the Useragent lines are called disallow statements. The disallow statements define the Web site paths that crawlers are not allowed to access. For example, the first disallow statement in the sample file tells crawlers not to crawl any links that begin with /cgi-bin/. Thus, the following URLs are both off limits to crawlers according to that line.
http://somehost.com/cgi-bin http://somehost.com/cgi-bin/register
3.1.3. Resource Constraints

Crawlers consume resources: network bandwidth to download pages, memory to maintain private data structures in support of their algorithms, CPU to evaluate and select URLs, and disk storage to store the text and links of fetched pages as well as other persistent data.
3.2. Web Indexing

Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing. The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the
considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
3.2.1. Index design factors

Major factors in designing a search engine's architecture include:
Merge factors: How data enters the index, or how words or subject features are
added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms.
Storage techniques: How to store the index data, that is, whether information
should be data compressed or filtered.

Index size: How much computer storage is required to support the index. Lookup speed: How quickly a word can be found in the inverted index. The
speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science.
Maintenance: How the index is maintained over time. Fault tolerance: How important it is for the service to be reliable. Issues include
dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning, and schemes such as hashbased or composite partitioning, as well as replication.
3.2.2. Index data structures
Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. Types of indices include:
Suffix tree: It is figuratively structured like a tree, supports linear time lookup.
Built by storing the suffixes of words. The suffix tree is a type of trie. Tries support extendable hashing, which is important for search engine indexing. Used for searching for patterns in DNA sequences and clustering. A major drawback is that the storage of a word in the tree may require more storage than storing the word itself. An alternate representation is a suffix array, which is considered to require less virtual memory and supports data compression such as the BWT algorithm.
Tree: An ordered tree data structure that is used to store an associative array
where the keys are strings. Regarded as faster than a hash table but less spaceefficient.
Inverted index: Stores a list of occurrences of each atomic search criterion,
typically in the form of a hash table or binary tree

Citation index: Stores citations or hyperlinks between documents to support
citation analysis, a subject of Bibliometrics.

Ngram index: Stores sequences of length of data to support other types of
retrieval or text mining.

Term document matrix: Used in latent semantic analysis, stores the occurrences
of words in documents in a two-dimensional sparse matrix.
3.2.3. Types of indexing: Indexing is basically of two types.

3.2.3.1. Inverted Index: Many search engines incorporate an inverted index when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. Because the inverted index stores a list of the documents containing each word, the search engine can use direct access to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The following is a simplified illustration of an inverted index:
Word the cow says moo
Documents Doc1, Doc3, Doc4, Doc5 Doc2, Doc3, Doc4 Doc5 Doc7
This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a Boolean index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval.The inverted index is a sparse matrix, since not all words are present in each document. To reduce computer storage memory
requirements, it is stored differently from a two dimensional array. The index is similar to the term document matrices employed by latent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a distributed hash table. The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing, where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives. After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index. 3.2.3.2. Forward Index: The forward index stores a list of words for each document. The following is a simplified form of the forward index: Document Word Doc1 Doc2 Doc3 the, cow, says, moo the, cat, and, the ,hat the, dish, ran, away, with, the, spoon
The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck. The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.
3.2.4. Latent Semantic Indexing (LSI)

3.2.4.1. What is LSI: Regular keyword searches approach a document collection with a kind of accountant mentality: a document contains a given word or it doesn't, without any middle ground. We create a result set by looking through each document in turn for certain keywords and phrases, tossing aside any documents that don't contain them, and ordering the rest based on some ranking system. Each document stands alone in judgement before the search algorithm - there is no interdependence of any kind between documents, which are evaluated solely on their contents. Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent. When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not
share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don't contain the keyword at all. To use an earlier example, let's say we use LSI to index our collection of mathematical articles. If the words n-dimensional, manifold and topology appear together in enough articles, the search algorithm will notice that the three terms are semantically close. A search for n-dimensional manifolds will therefore return a set of articles containing that phrase (the same result we would get with a regular search), but also articles that contain just the word topology. The search engine understands nothing about mathematics, but examining a sufficient number of documents teaches it that the three terms are related. It then uses that information to provide an expanded set of results with better recall than a plain keyword search. 3.2.4.2 How LSI Works: Natural language is full of redundancies, and not every word that appears in a document carries semantic meaning. In fact, the most frequently used words in English are words that don't carry content at all: functional words, conjunctions, prepositions, auxiliary verbs and others. The first step in doing LSI is culling all those extraneous words from a document, leaving only content words likely to have semantic meaning. There are many ways to define a content word - here is one recipe for generating a list of content words from a document collection: Make a complete list of all the words that appear anywhere in the collection Discard articles, prepositions, and conjunctions Discard common verbs (know, see, do, be) Discard pronouns Discard common adjectives (big, late, high) Discard frilly words (therefore, thus, however, albeit, etc.) Discard any words that appear in every document Discard any words that appear in only one document
This process condenses our documents into sets of content words that we can then use to index our collection. Using our list of content words and documents, we can now generate a termdocument matrix. This is a fancy name for a very large grid, with documents listed along the horizontal axis, and content words along the vertical axis. For each content word in our list, we go across the appropriate row and put an 'X' in the column for any document where that word appears. If the word does not appear, we leave that column blank. Doing this for every word and document in our collection gives us a mostly empty grid with a sparse scattering of X-es. This grid displays everything that we know about our document collection. We can list all the content words in any given document by looking for X-es in the appropriate column, or we can find all the documents containing a certain content word by looking across the appropriate row. Notice that our arrangement is binary - a square in our grid either contains an X, or it doesn't. This big grid is the visual equivalent of a generic keyword search, which looks for exact matches between documents and keywords. If we replace blanks and X-es with zeroes and ones, we get a numerical matrix containing the same information. The key step in LSI is decomposing this matrix using a technique called singular value decomposition. The mathematics of this transformation is beyond the scope of this article. Imagine that you are curious about what people typically order for breakfast down at your local diner, and you want to display this information in visual form. You decide to examine all the breakfast orders from a busy weekend day, and record how many times the words bacon, eggs and coffee occur in each order. You can graph the results of your survey by setting up a chart with three orthogonal axes - one for each keyword. The choice of direction is arbitrary - perhaps a bacon axis in the x direction, an eggs axis in the y direction, and the all-important coffee axis in the
z direction. To plot a particular breakfast order, you count the occurrence of each keyword, and then take the appropriate number of steps along the axis for that word. When you are finished, you get a cloud of points in three-dimensional space, representing all of that day's breakfast orders. If you draw a line from the origin of the graph to each of these points, you obtain a set of vectors in 'bacon-eggs-and-coffee' space. The size and direction of each vector tells you how many of the three key items were in any particular order, and the set of all the vectors taken together tells you something about the kind of breakfast people favor on a Saturday morning. What your graph shows is called a term space. Each breakfast order forms a vector in that space, with its direction and magnitude determined by how many times the three keywords appear in it. Each keyword corresponds to a separate spatial direction, perpendicular to all the others. Because our example uses three keywords, the resulting term space has three dimensions, making it possible for us to visualize it. It is easy to see that this space could have any number of dimensions, depending on how many keywords we chose to use. If we were to go back through the orders and also record occurrences of sausage, muffin, and bagel, we would end up with a six-dimensional term space, and six-dimensional document vectors. Applying this procedure to a real document collection, where we note each use of a content word, results in a term space with many thousands of dimensions. Each document in our collection is a vector with as many components as there are content words. Although we can't possibly visualize such a space, it is built in the exact same way as the whimsical breakfast space we just described. Documents in such a space that have many words in common will have vectors that are near to each other, while documents with few shared words will have vectors that are far apart. Latent semantic indexing works by projecting this large, multidimensional space down into a smaller number of dimensions. In doing so, keywords that are semantically similar will get squeezed together, and will no longer be completely distinct. This blurring of
boundaries is what allows LSI to go beyond straight keyword matching. To understand how it takes place, we can use another analogy. 3.2.4.3. Singular Value Decomposition: Imagine you keep tropical fish, and are proud of your prize aquarium - so proud that you want to submit a picture of it to Modern Aquaria magazine, for fame and profit. To get the best possible picture, you will want to choose a good angle from which to take the photo. You want to make sure that as many of the fish as possible are visible in your picture, without being hidden by other fish in the foreground. You also won't want the fish all bunched together in a clump, but rather shot from an angle that shows them nicely distributed in the water. Since your tank is transparent on all sides, you can take a variety of pictures from above, below, and from all around the aquarium, and select the best one. In mathematical terms, you are looking for an optimal mapping of points in 3-space (the fish) onto a plane (the film in your camera). 'Optimal' can mean many things - in this case it means 'aesthetically pleasing'. But now imagine that your goal is to preserve the relative distance between the fish as much as possible, so that fish on opposite sides of the tank don't get superimposed in the photograph to look like they are right next to each other. Here you would be doing exactly what the SVD algorithm tries to do with a much higher-dimensional space. Instead of mapping 3-space to 2-space, however, the SVD algorithm goes to much greater extremes. A typical term space might have tens of thousands of dimensions, and be projected down into fewer than 150. Nevertheless, the principle is exactly the same. The SVD algorithm preserves as much information as possible about the relative distances between the document vectors, while collapsing them down into a much smaller set of dimensions. In this collapse, information is lost, and content words are superimposed on one another. Information loss sounds like a bad thing, but here it is a blessing. What we are losing is noise from our original term-document matrix, revealing similarities that were latent in
the document collection. Similar things become more similar, while dissimilar things remain distinct. This reductive mapping is what gives LSI its seemingly intelligent behavior of being able to correlate semantically related terms. We are really exploiting a property of natural language, namely that words with similar meaning tend to occur together. While a discussion of the mathematics behind singular value decomposition is beyond the scope of our article, it's worthwhile to follow the process of creating a termdocument matrix in some detail, to get a feel for what goes on behind the scenes. Here we will process a sample wire story to demonstrate how real-life texts get converted into the numerical representation we use as input for our SVD algorithm. The first step in the chain is obtaining a set of documents in electronic form. This can be the hardest thing about LSI - there are all too many interesting collections not yet available online. In our experimental database, we download wire stories from an online newspaper with an AP news feed. A script downloads each day's news stories to a local disk, where they are stored as text files. Let's imagine we have downloaded the following sample wire story, and want to incorporate it in our collection: O'Neill Criticizes PITTSBURGH (AP) Europe on Grants
Treasury Secretary Paul O'Neill expressed irritation on Wednesday that European countries have refused to go along with a U.S. proposal to boost the amount of direct grants rich nations offer poor countries. The Bush administration is pushing a plan to increase the amount of direct grants the World Bank provides the poorest nations to 50 percent of assistance, reducing use of loans to these nations.
The first thing we do is strip all formatting from the article, including capitalization, punctuation, and extraneous markup (like the dateline). LSI pays no attention to word order, formatting, or capitalization, so can safely discard that information. Our cleanedup wire story looks like this: ONeill criticizes Europe on grants treasury secretary Paul ONeill expressed irritation Wednesday that European countries have refused to go along with a us proposal to boost the amount of direct grants rich nations offer poor countries the bush administration is pushing a plan to increase the amount of direct grants the world bank provides the poorest nations to 50 percent of assistance reducing use of loans to these nations The next thing we want to do is pick out the content words in our article. These are the words we consider semantically significant - everything else is clutter. We do this by applying a stop list of commonly used English words that don't carry semantic meaning. Using a stop list greatly reduces the amount of noise in our collection, as well as eliminating a large number of words that would make the computation more difficult. Creating a stop list is something of an art - they depend very much on the nature of the data collection. You can see our full wire stories stop list here. Here is our sample story with stop-list words highlighted: ONeill criticizes Europe on grants treasury secretary Paul ONeill expressed irritation Wednesday that European countries have refused to go along with a US proposal to boost the amount of direct grants rich nations offer poor countries the bush administration is pushing a plan to increase the amount of direct grants the world bank provides the poorest nations to 50 percent of assistance reducing use of loans to these nations Removing these stop words leaves us with an abbreviated version of the article containing content words only:
ONeill criticizes Europe grants treasury secretary Paul ONeill expressed irritation European countries refused US proposal boost direct grants rich nations poor countries bush administration pushing plan increase amount direct grants world bank poorest nations assistance loans nations However, one more important step remains before our document is ready for indexing. We can notice how many of our content words are plural noun (grants, nations) and inflected verbs (pushing, refused). It doesn't seem very useful to have each inflected form of a content word be listed separately in our master word list - with all the possible variants, the list would soon grow unwieldy. More troubling is that LSI might not recognize that the different variant forms were actually the same word in disguise. We solve this problem by using a stemmer. 3.2.4.4. Stemming: While LSI itself knows nothing about language (we saw how it deals exclusively with a mathematical vector space), some of the preparatory work needed to get documents ready for indexing is very language-specific. We have already seen the need for a stop list, which will vary entirely from language to language and to a lesser extent from document collection to document collection. Stemming is similarly language-specific, derived from the morphology of the language. For English documents, we use an algorithm called the Porter stemmer to remove common endings from words, leaving behind an invariant root form. Here are some examples of words before and after stemming: Information -> inform Presidency -> preside Presiding -> preside Happiness -> happy Happily -> happy Discouragement -> discourage Battles -> battle
And here is our sample story as it appears to the stemmer: ONeill criticizes Europe grants treasury secretary Paul ONeill expressed irritation European countries refused US proposal boost direct grants rich nations poor countries bush administration pushing plan increase amount direct grants world bank poorest nations assistance loans nations Note that at this point we have reduced the original natural-language news story to a series of word stems. All of the information carried by punctuation, grammar, and style is gone - all that remains is word order, and we will be doing away with even that by transforming our text into a word list. It is striking that so much of the meaning of text passages inheres in the number and choice of content words, and relatively little in the way they are arranged. This is very counterintuitive, considering how important grammar and writing style are to human perceptions of writing. Having stripped, pruned, and stemmed our text; we are left with a flat list of words:
administrat amount assist bank boost bush countri (2) direct europ express grant (2) increas irritat loan nation (3) ONeill Paul plan poor (2) propos push
refus rich secretar treasuri US world This is the information we will use to generate our term-document matrix, along with a similar word list for every document in our collection. 3.2.4.5. The Term-Document Matrix: As we mentioned in our discussion of LSI, the term-document matrix is a large grid representing every document and content word in a collection. We have looked in detail at how a document is converted from its original form into a flat list of content words. We prepare a master word list by generating a similar set of words for every document in our collection, and discarding any content words that either appear in every document (such words won't let us discriminate between documents) or in only one document (such words tell us nothing about relationships across documents). With this master word list in hand, we are ready to build our TDM. We generate our TDM by arranging our list of all content words along the vertical axis, and a similar list of all documents along the horizontal axis. These need not be in any particular order, as long as we keep track of which column and row corresponds to which keyword and document. For clarity we will show the keywords as an alphabetized list. We fill in the TDM by going through every document and marking the grid square for all the content words that appear in it. Because any one document will contain only a tiny subset of our content word vocabulary, our matrix is very sparse (that is, it consists almost entirely of zeroes). Here is a fragment of the actual term-document matrix from our wire stories database: Document a Astro 0 b 0 c 0 d 0 e 0 F 0 g 0 h 0 i 0 j 0 k 0 l 0 m 0 n 0 o 0 p 0 q 0
satellite shine star planet sun earth
0 0 0 0 0 0
0 0 0 0 0 0
0 0 1 0 0 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 1 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
We can easily see if a given word appears in a given document by looking at the intersection of the appropriate row and column. In this sample matrix, we have used ones to represent document/keyword pairs. With such a binary scheme, all we can tell about any given document/keyword combination is whether the keyword appears in the document. This approach will give acceptable results, but we can significantly improve our results by applying a kind of linguistic favoritism called term weighting to the value we use for each non-zero term/document pair. Term weighting is a formalization of two common-sense insights: 1. Content words that appear several times in a document are probably more meaningful than content words that appear just once. 2. Infrequently used words are likely to be more interesting than common words. The first of these insights applies to individual documents, and we refer to it as local weighting. Words that appear multiple times in a document are given a greater local weight than words that appear once. We use a formula called logarithmic local weighting to generate our actual value. The second insight applies to the set of all documents in our collection, and is called global term weighting. There are many global weighting schemes; all of them reflect the fact that words that appear in a small handful of documents are likely to be more significant than words that are distributed widely across our document collection. Our own indexing system uses a scheme called inverse document frequency to calculate global weights.
By way of illustration, here are some sample words from our collection, with the number of documents they appear in, and their corresponding global weights. word unit cost project tackle wrestler count 833 295 169 40 7 global weight 1.44 2.47 3.03 4.47 6.22
You can see that a word like wrestler, which appears in only seven documents, is considered twice as significant as a word like project, which appears in over a hundred. There is a third and final step to weighting, called normalization. This is a scaling step designed to keep large documents with many keywords from overwhelming smaller documents in our result set. It is similar to handicapping in golf - smaller documents are given more importance, and larger documents are penalized, so that every document has equal significance.These three values multiplied together - local weight, global weight, and normalization factor - determine the actual numerical value that appears in each non-zero position of our term/document matrix. Although this step may appear language-specific, note that we are only looking at word frequencies within our collection. Unlike the stop list or stemmer, we don't need any outside source of linguistic information to calculate the various weights. While weighting isn't critical to understanding or implementing LSI, it does lead to much better results, as it takes into account the relative importance of potential search terms. With the weighting step done, we have done everything we need to construct a finished term-document matrix. The final step will be to run the SVD algorithm itself. Notice that this critical step will be purely mathematical - although we know that the matrix and its contents are a shorthand for certain linguistic features of our collection, the algorithm doesn't know anything about what the numbers mean. This is why we say LSI is language-agnostic - as long as you can perform the steps needed to generate a
term-document matrix from your data collection, it can be in any language or format whatsoever. You may be wondering what the large matrix of numbers we have created has to do with the term vectors and many-dimensional spaces we discussed in our earlier explanation of how LSI works. In fact, our matrix is a convenient way to represent vectors in a high-dimensional space. While we have been thinking of it as a lookup grid that shows us which terms appear in which documents, we can also think of it in spatial terms. In this interpretation, every column is a long list of coordinates that gives us the exact position of one document in a many-dimensional term space. When we applied term weighting to our matrix in the previous step, we nudged those coordinates around to make the document's position more accurate. As the name suggests, singular value decomposition breaks our matrix down into a set of smaller components. The algorithm alters one of these components ( this is where the number of dimensions gets reduced ), and then recombines them into a matrix of the same shape as our original, so we can again use it as a lookup grid. The matrix we get back is an approximation of the term-document matrix we provided as input, and looks much different from the original: a -0.006 -.0012 b c d e f g -0.006 -0.002 -0.002 -0.003 -0.001 0.000 h 0.007 i 0.004 j 0.008
star plane t moon sun earth astro shine
Notice two interesting features in the processed data: The matrix contains far fewer zero values. Each document has a similarity value for most content words.
Some of the similarity values are negative. In our original TDM, this would correspond to a document with fewer than zero occurrences of a word, impossibility. In the processed matrix, a negative value is indicative of a very large semantic distance between a term and a document.
This finished matrix is what we use to actually search our collection. Given one or more terms in a search query, we look up the values for each search term/document combination, calculate a cumulative score for every document, and rank the documents by that score, which is a measure of their similarity to the search query. In practice, we will probably assign an empirically-determined threshold value to serve as a cutoff between relevant and irrelevant documents, so that the query does not return every document in our collection.
3.2.4. Challenges in parallelism:

A major challenge in the design of search engines is the management of parallel computing processes. There are many opportunities for race conditions and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of
information, and a web crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producerconsumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture.
4. Meta-Search Engine
A meta-search engine is the kind of search engine that does not have its own database of Web pages. It sends search terms to the databases maintained by other search engines and gives users the results that come from all the search engines queried. Fewer meta-searchers allow you to delve into the largest, most useful search engine databases. They tend to return results from smaller and/or free search engines and miscellaneous free directories, often small and highly commercial. The mechanism and algorithms that meta-search engines employ are quite different. The simplest metasearch engines just pass the queries to other direct search engines. The results are then simply displayed in different newly opened browser windows as if several different queries were posed. Some improved meta-search engines organize the query results in one screen in different frames, or in one frame but in a sequential order. Some more sophisticated meta-search engines permit users to choose their favorite direct search engines in the query input process, while using filters and other algorithms to process the returned query results before displaying them to the users. Problems often arise in the queryinput process though. Meta-Search engines are useful if the user is looking for a unique term or phrase; or if he (she) simply wants to run a couple of keywords. Some metasearch engines simply pass search terms along to the underlying direct search engine,
and if a search contains more than one or two words or very complex logic, most of them will be lost. It will only make sense to the few search engines that supports such logic. Following are some of the powerful meta-search engines with some direct search engines like Google, AltaVista and Yahoo. No two meta-search engines are alike. Some search only the most popular search engines while others also search lesser-known engines, newsgroups, and other databases. They also differ in how the results are presented and the quantity of engines that are used. Some will list results according to search engine or database. Others return results according to relevance, often concealing which search engine returned which results. This benefits the user by eliminating duplicate hits and grouping the most relevant ones at the top of the list. Search engines frequently have different ways they expect requests submitted. For example, some search engines allow the usage of the word "AND" while others require "+" and others require only a space to combine words. The better meta-search engines try to synthesize requests appropriately when submitting them. Results can vary between meta-search engines based on a large number of variables. Still, even the most basic meta-search engine will allow more of the web to be searched at once than any one stand-alone search engine. On the other hand, the results are said to be less relevant, since a meta-search engine cant know the internal alchemy a search engine does on its result (a meta-search engine does not have any direct access to the search engines database).Meta-search engines are sometimes used in vertical search portals, and to search the deep web. Some examples of metasearch engine are Dogpile and Meta-crawler.
5. Search engine optimization

Search engine optimization (SEO) is the process of improving the volume and quality of traffic to a web site from search engines via "natural" ("organic" or "algorithmic") search results. Usually, the earlier a site is presented in the search results, or the higher it "ranks," the more searchers will visit that site. SEO can also target different kinds of
search, including image search, local search, and industry-specific vertical search engines. As an Internet marketing strategy, SEO considers how search engines work and what people search for. Optimizing a website primarily involves editing its content and HTML coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. In Internet marketing terms, search engine optimization or SEO is the process of making website easy to find in search engines for its targeted and relevant keywords. This could be achieved by optimizing internal and external factors that influence search engine positioning. The main goal of every professionally implemented SEO campaign is gaining top positioning for targeted keywords as well as search engine traffic growth. That is the reason why search engine optimization may increase the number of sales and conversions in times. Many third party organizations provides visibility to business organizations website on the World Wide Web, through search engine marketing techniques and by methods of increasing page ranks of the organizations website.
5.1. Page Rank

Page-Rank is a link analysis algorithm used by the Google Internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is also called the Page-Rank of E and denoted by PR (E). The name "Page-Rank" is a trademark of Google, and the Page-Rank process has been patented (U.S. Patent 6,285,999). However, the patent is assigned to Stanford University and not to Google. Google has exclusive license rights on the patent from Stanford University. The university received 1.8 million shares of Google in exchange for use of the patent; the shares were sold in 2005 for $336 million.
As an algorithm, Page-Rank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. Page-Rank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided between all documents in the collection at the beginning of the computational process. The Page-Rank computations require several passes, called "iterations", through the collection to adjust approximate Page-Rank values to more closely reflect the theoretical true value. A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with the 0.5 Page-Rank.
5.2. The Ranking algorithm simplified

Assume a small universe of four web pages: A, B, C and D. The initial approximation of Page-Rank would be evenly divided between these four documents. Hence, each document would begin with an estimated Page-Rank of 0.25. In the original form of Page-Rank initial values were simply 1. This meant that the sum of all pages was the total number of pages on the web. Later versions of PageRank (see the below formulas) would assume a probability distribution between 0 and 1. Here we're going to simply use a probability distribution hence the initial value of 0.25. If pages B, C, and D each only link to A, they would each confer 0.25 Page-Rank to A. All Page-Rank PR ( ) in this simplistic system would thus gather to A because all links would be pointing to A.
This is 0.75. Again, suppose page B also has a link to page C, and page D has links to all three pages. The value of the link-votes is divided among all the outbound links on a
page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of D's Page-Rank is counted for A's Page-Rank (approximately 0.083)
In other words, the Page-Rank conferred by an outbound link L ( ) is equal to the document's own Page-Rank score divided by the normalized number of outbound links (it is assumed that links to specific URLs only count once per document).
In the general case, the Page-Rank value for any page u can be expressed as
I.e. the Page-Rank value for a page u is dependent on the Page-Rank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L (v) of links from page v
5.3. Damping factor

The Page-Rank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. The various studies have tested different damping factors, but it is generally assumed that the damping factor will be set around 0.85. The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents in the collection) and this term is then added to the product of the damping factor and the sum of the incoming Page-Rank scores. That is,
Or (N = the number of documents in collection)
So any page's Page-Rank is derived in large part from the Page-Ranks of other pages. The damping factor adjusts the derived value downward. Google recalculates PageRank scores each time it crawls the Web and rebuilds its index. As Google increase the number of documents in its collection, the initial approximation of Page-Rank decreases for all documents. The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The Page-Rank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain in which the states are pages, and the transitions are all equally probable and are the links between pages. If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. However, the solution is quite simple. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again. When calculating Page-Rank, pages with no outbound links are assumed to link out to all other pages in the collection. Their Page-Rank scores are therefore divided evenly among all other pages. In other words, to be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually d = 0.85, estimated from the frequency that an average surfer uses his or her browser's bookmark feature. So, the equation is as follows:
where p1,p2,p3,..,pN are the pages under consideration, M(pi) is the set of pages that link to pi, L(pj) is the number of outbound links on page pj, and N is the total number of pages. The Page-Rank values are the entries of the dominant eigenvector of the modified adjacency matrix. This makes Page-Rank a particularly elegant metric: the eigenvector is
Where R is the solution of the equation as follows:
where the adjacency function L(Pi,Pj) is 0, if page pj does not link to pi, and normalized such that, for each j
i.e. the elements of each column sum up to 1. This is a variant of the eigenvector centrality measure used commonly in network analysis. Because of the large Eigen-gap of the modified adjacency matrix above, the
values of the Page-Rank eigenvector are fast to approximate (only a few iterations are needed). As a result of Markov theory, it can be shown that the Page-Rank of a page is the probability of being at that page after lots of clicks. This happens to equal t 1 where t is the expectation of the number of clicks (or random jumps) required to get from the page back to itself. The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing site (a site being a densely connected set of pages, such as Wikipedia). The Google Directory (itself a derivative of the Open Directory Project) allows users to see results sorted by PageRank within categories. The Google Directory is the only service offered by Google where Page-Rank directly determines display order. In Google's other search services (such as its primary Web search) Page-Rank is used to weigh the relevance scores of pages shown in search results. Several strategies have been proposed to accelerate the computation of PageRank. The Various strategies to manipulate Page-Rank have been employed in concerted efforts to improve search results rankings and monetize advertising links. These strategies have severely impacted the reliability of the Page-Rank concept, which seeks to determine which documents are actually highly valued by the Web community.
5.4. Uses of Page-Rank

A version of Page-Rank has recently been proposed as a replacement for the traditional Institute for Scientific Information (ISI) impact factor, and implemented at eigenfactor.org. Instead of merely counting total citation to a journal, the "importance" of each citation is determined in a Page-Rank fashion.
A similar new use of Page-Rank is to rank academic doctoral programs based on their records of placing their graduates in faculty positions. In Page-Rank terms, academic departments link to each other by hiring their faculty from each other (and from themselves).
Page-Rank has been used to rank spaces or streets to predict how many people (pedestrians or vehicles) come to the individual spaces or streets.
Page-Rank has also been used to automatically rank WordNet synsets according
to how strongly they possess a given semantic property, such as positivity or negativity.
A dynamic weighting method similar to Page-Rank has been used to generate customized reading lists based on the link structure of Wikipedia.
A Web crawler may use Page-Rank as one of a number of importance metrics it
uses to determine which URL to visit next during a crawl of the web. One of the early working papers which were used in the creation of Google is efficient crawling through URL ordering, which discusses the use of a number of different importance metrics to determine how deeply and how much of a site Google will crawl. Page-Rank is presented as one of a number of these importance metrics, though there are others listed such as the number of inbound and outbound links for a URL, and the distance from the root directory on a site to the URL.
The Page-Rank may also be used as a methodology to measure the apparent impact of a community like the Blogosphere on the overall Web itself. This approach uses therefore the Page-Rank to measure the distribution of attention in reflection of the Scale-free network paradigm.
6. Marketing of search engines

Search engine marketing, or SEM, is a form of Internet marketing that seeks to promote websites by increasing their visibility in search engine result pages (SERPs) through the use of paid placement, contextual advertising, and paid inclusion. The Pay per Click (PPC) lead Search Engine Marketing Professional Organization (SEMPO) also includes search engine optimization (SEO) within its reporting, but SEO is a separate discipline with most sources, including the New York Times defining SEM as 'the practice of buying paid search listings.
Fig. 1. The Advertisement market share of search engines
Search engines have become indispensable to interacting on the Web. In addition to processing information requests, they are navigational tools that can direct users to specific Web sites or aid in browsing. Search engines can also facilitate e-commerce transactions as well as provide access to noncommercial services such as maps, online auctions, and driving directions. People use search engines as dictionaries, spell checkers, and thesauruses; as discussion groups (Google Groups) and social networking forums (Yahoo! Answers); and even as entertainment (Google-whacking, vanity searching). In this competitive market, rivals continually strive to improve their information-retrieval capabilities and increase their financial returns. One innovation is sponsored search, an economics meets search model in which content providers pay search engines for user traffic going from the search engine to their Web sites. Sponsored search has proven to be a successful business. Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads. Revenue in the web search portals industry is projected to grow in 2008 by 13.4 percent, with broadband connections expected to rise by 15.1 percent. Between 2008 and 2012, industry revenue is projected to rise by 56 percent as Internet penetration still has some way to go to reach full saturation in American households. Furthermore, broadband services are projected to account for an ever increasing share of domestic Internet users, rising to 118.7 million by 2012, with an increasing share accounted for by fiber-optic and high speed cable lines.
7. Summary
With the precipitous expansion of the Web, extracting knowledge from the Web is becoming gradually important and popular. This is due to the Webs convenience and richness of information. Today search engines can cover more than 60% of information
of the information on the World Wide Web. The future prospects of every aspect of search engine are very bright. Like Google is coming up with embedded intelligence in its search engine. For all their problems, online search engines have come a long way. Sites like Google are pioneering the use of sophisticated techniques to help distinguish content from drivel, and the arms race between search engines and the marketers who want to manipulate them has spurred innovation. But the challenge of finding relevant content online remains. Because of the sheer number of documents available, we can find interesting and relevant results for any search query at all. The problem is that those results are likely to be hidden in a mass of semi-relevant and irrelevant information, with no easy way to distinguish the good from the bad.
8. References
Brin, Sergey and Page Lawrence. The anatomy of a large-scale hyper textual
Web search engine. Computer Networks and ISDN Systems, April 1998 Baldi, Pierre. Modeling the Internet and the Web: Probabilistic Methods and Algorithms, 2003 Chakrabarti, Soumen. Mining the Web: Analysis of Hypertext and Semi Structured Data, 2003 Jansen, B. J. (May 2007). "The Comparative Effectiveness of Sponsored and Non-sponsored Links for Web E-commerce Queries" (PDF). ACM Transactions on the Web.
"Fast Page-Rank Computation via a Sparse Linear System (Extended Abstract)".
Gianna
M.
Del
Corso,
Antonio
Gull,
Francesco
Romani.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.118.5422. Deeho Search Engine Optimization (SEO) solutions
9. Bibliography
Wikipedia.org Google Books The SEO Books

Seminar Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seminar Report

Uploaded by

Copyright:

Available Formats

A Seminar Report On

Working of web search engine

Submitted to: Sachin Sharma B.E. Final Year

Guide: Dr. K.R. Chowdhary Professor, CSE Dept.

i Working of web search engine

Focused crawling6 Distributed crawling....6

Inverted Index...11 Forward index..12

3.2.4. Latent Semantic Indexing (LSI).13 3.2.4.1. 3.2.4.2. 3.2.4.3.

What is LSI13 How LSI Works.14 Singular Value Decomposition (SVD)..17

Working of web search engine

Stemming..20 The Term Document Matrix22

iv Working of web search engine

vi Working of web search engine

Types of search engine

2. Types of search engine

3. General system architecture of web search engine

General system architecture of search engine

General system architecture of search engine

3.1. Web crawling

General system architecture of search engine

General system architecture of search engine

3.1.1. Types of crawling Crawlers are of two types basically.

General system architecture of search engine

General system architecture of search engine

3.1.2. Robot exclusion protocol

General system architecture of search engine

3.1.3. Resource Constraints

3.2. Web Indexing

General system architecture of search engine

3.2.1. Index design factors

should be data compressed or filtered.

3.2.2. Index data structures

General system architecture of search engine

typically in the form of a hash table or binary tree

citation analysis, a subject of Bibliometrics.

retrieval or text mining.

of words in documents in a two-dimensional sparse matrix.

General system architecture of search engine

3.2.3. Types of indexing: Indexing is basically of two types.

Word the cow says moo

General system architecture of search engine

General system architecture of search engine

3.2.4. Latent Semantic Indexing (LSI)

General system architecture of search engine

General system architecture of search engine

General system architecture of search engine

General system architecture of search engine

General system architecture of search engine

General system architecture of search engine

General system architecture of search engine

General system architecture of search engine

General system architecture of search engine

General system architecture of search engine

satellite shine star planet sun earth

General system architecture of search engine

General system architecture of search engine

star plane t moon sun earth astro shine

General system architecture of search engine

3.2.4. Challenges in parallelism:

General system architecture of search engine

General system architecture of search engine

5. Search engine optimization

General system architecture of search engine