You are on page 1of 47

1.

INTRODUCTION

Defines a search engine as: ‘a program designed to help find information stored on a computer system
such as the World Wide Web, or a personal computer. The search engine allows one to ask for content
meeting specific criteria (typically those containing a given word or phrase) and retrieving a list of
references that match those criteria. Search engines use regularly updated indexes to operate quickly
and efficiently.’

In other words, a search engine is a sophisticated piece of software, accessed through a page on a
website that allows you to search the web by entering search queries into a search box. The search
engine then attempts to match your search query with the content of web pages that is has stored, or
cached, and indexed on its powerful servers in advance of your search.

Many search engines allow you to search for things other than text: for example, images

SEO methods are largely (but not exclusively) centred upon text as they involve matching key parts of
the text in your web pages with the keywords or keyphrases that people actually type into search
engines when looking for something on the internet.

There are two main types of search indexes we access when searching the web:

• directories
• crawler-based search engines

Directories

Unlike search engines, which use special software to locate and index sites, directories are compiled
and maintained by humans. Directories often consist of a categorised list of links to other sites to
which you can add your own site. Editors sometimes review your site to see if it is fit for inclusion in
the directory.

Crawler-based search engines

Crawler-based search engines differ from directories in that they are not compiled and maintained by
humans. Instead, crawler-based search engines use sophisticated pieces of software called spiders or
robots to search and index web pages.

These spiders are constantly at work, crawling around the web, locating pages, and taking snapshots of
those pages to be cached or stored on the search engine’s servers. They are so sophisticated that they can
follow links from one page to another and from one site to another.Google is a prominent example of a
crawler-based search engine.
SEO short for Search Engine Optimization is the art, craft, and science of driving web traffic to web sites.

Learning how to construct web sites and pages to improve and not harm the search engine placement of
those web sites and web pages has become a key component in the evolution of SEO.

In this article we will see one of the technique that is used commonly for improving SEO Creating Friendly
URLs.

Today, most of the websites built are database driven or dynamic sites and most of these websites pass data
between pages using query strings. Search engines crawlers usually do not index the page having a question
mark or any other character. If search engines do not identify a page or its content in a website, it
essentially means missing web presence for that page. How this could be handled. This write-up discusses
this topic with a sample web site project as implementation reference.

Friendly URLs

Friendly URLs pass information to the pages without using the question mark or any other character and
these pages will be indexed by the search engines which will maximize search engine rankings for your
website. Search engines prefer static URLs to dynamic URLs.

A dynamic URL is a page address that is created from the search of a database-driven web site or the URL
of a web site that runs a script. In contrast to static URLs, in which the contents of the web page stay the
same unless the changes are hard-coded into the HTML, dynamic URLs are generated from specific
queries to a site's database. The dynamic page is only a template in which to display the results of the
database query.

Search engines do not index the dynamic URLs for the reason that it will contain the non-standard
characters like ?, &, %, = etc Many times, anything after the non-standard character is omitted. For
example, URLs like the below:

http://www.myweb.com/default.aspx?id=120

In this case, if the URL after the first non-standard character is omitted, the URL will look like:

http://www.myweb.com/default.aspx

The URLs of this type will be a group of duplicate URLs for the search engine and search engines omit the
duplicate URLs and the search engine will not index all your dynamic pages. The search engine indexes a
URL like:

http://www.myweb.com/page/120.aspx

Even though nowadays search engines are optimized to index dynamic URLs, they would still prefer static
URLs.
Creating SEO Friendly URLs

What if we were to implement this in our projects? Here's a method that could be used to create friendly
URLs to boost your Page rank.

In our example, let us try with these URLs:

http://www.myweb.com/Order.aspx?Itemid=10&Item=Apple

http://www.myweb.com/Order.aspx?Itemid=11&Item=Orange

And our objective is to convert it to contextual URLs that resemble:

http://www.myweb.com/shop/10/Apple.aspx

http://www.myweb.com/shop/11/Orange.aspx

What we are going to do is that first we convert the actual URL with the query string to the contextual URL
by first getting the mapped URL name for the page from the web.config file and combining the query string
values followed by / and finally the item name as the page name. When this converted contextual URL is
clicked and the page is navigated the Application_BeginRequest event in the Global.asax file rewrites the
contextul URL into the actual URL with the query strings.

Lets get down in detail about how to build the application that is using friendly URLs. This application
simply explains how to create SEO friendly URLs and this is just an idea about how we can rewrite URLs.
You can just take the idea and rewrite it in your own way.

Other popular ASP.NET


articles:

• Multiple File Upload With Progress


Bar Using Flash and ASP.NET

How to use Flash to upload multiple files in a


medium-trust hosting environment
• Exploring Session in ASP.Net

This article describe about session in ASP.Net


2.0 . Different Types of Session , There
Configuration . Also describe Session on Web
Farm , Load balancer , web garden etc.

• CAPTCHA Image

Using CAPTCHA images to prevent


automated form submission.

• 10 ASP.NET Performance and


Scalability Secrets

10 easy ways to make ASP.NET and AJAX


websites faster, more scalable and support
more traffic at lower cost

• Paging of Large Resultsets in


ASP.NET

An article about optimization and


performance testing of MS SQL Server 2000
stored procedures used for paging of large
resultsets in ASP.NET

1.1 OBJECTIVE
Background

This article follows on from the previous three Searcharoo samples:

Searcharoo Version 1 describes building a simple search engine that crawls the file system from a specified
folder, and indexes all HTML (or other known types) of document. A basic design and object model was
developed to support simple, single-word searches, whose results were displayed ina rudimentary
query/results page.

Searcharoo Version 2 focused on adding a 'spider' to find data to index by following web links (rather than
just looking at directory listings in the file system). This means downloading files via HTTP, parsing the
HTML to find more links and ensuring we don't get into a recursive loop because many web pages refer to
each other. This article also discusses how multiple search words results are combined into a single set of
'matches'.

Searcharoo Version 3 implemented a 'save to disk' function for the catalog, so it could be reloaded across
IIS application restarts without having to be generated each time. It also spidered FRAMESETs and added
Stop words, Go words and Stemming to the indexer. A number of bugs reported via CodeProject were also
fixed.

Introduction to version 4
Version 4 of Searcharoo has changed in the following ways (often prompted by CodeProject members):

1. It can now index/search Word, Powerpoint, PDF and many other file types, thanks to the
excellent Using IFilter in C# article by Eyal Post. This is probably the coolest bit of the whole
project - but all credit goes to Eyal for his excellent article.
2. It parses and obeys your robots.txt file (in addition to the robots META tag, which it already
understood) ( cool263).
3. You can 'mark' regions of your html to be ignored during indexing (xbit45).
4. There is a rudimentary effort to follow links hiding in javascript ( ckohler).
5. You can run the Spider locally via a CommandLine application then upload the Catalog file to
your server (useful if your server doesn't have all the IFilter's installed to parse the documents you
want indexed).
6. The code has been significantly refactored (thanks to encouragement from mrhassell and j105
Rob). I hope this makes it easier for people to read/understand and edit to add the stuff they need.

Some things to note

• You need Visual Studio 2005 to work with this code. In previous versions I tried to keep the code
in a small number of files, and structure it so it'd be easy to open/run in Visual WebDev Express
(heck, the first version was written in WebMatrix), but it's just getting too big. As far as I know,
it's still possible to shoehorn the code into VWD (with App_Code directory and assemblies from
the ZIP file) if you want to give it a try...
• I've included two projects from other authors: Eyal's IFilter code (from CodeProject and his blog
on bypassing COM) and the Mono.GetOptions code (nice way to handle Command Line
arguments). I do NOT take credit for these projects - but thank the authors for the hard work that
went into them, and for making the source available.
• The UI (Search.aspx) hasn't really changed at all (except for class name changes as a result of
refactoring) - I have a whole list of ideas & suggestions to improve it, but they will have to wait
for another day.

Design & Refactoring


The Catalog-File-Word design
that supports searching the
Catalog remains basically
unchanged (from Version 1!),
however there has been a total
reorganization of the classes used
to generate the Catalog.

In version 3, all the code to:


download a file, parse the html,
extract the links, extract the
words, add the to catalog and
save the catalog was crammed
into two classes (Spider and
HtmlDocument see right).

Notice that the StripHtml()


method is in the Spider class -
doesn't make sense, does it?

This made it difficult to add the new functionality required for supporting IFilter (or any other document
types we might like to add) that don't have the same attributes as an Html page.

To 'fix' this design flaw, I pulled out all the Html-specific code from Spider and put it into
HtmlDocument. Then I took all the 'generic' document attributes (Title, Length, Uri, ...) and pushed them
into a superclass Document, from which HtmlDocument inherits. To allow Spider to deal
(polymorphically) with any type of Document, I moved the object creation code into the static
DocumentFactory so there is a single place where Document subclasses get created (so it's easy to
extend later). DocumentFactory uses the MimeType from the HttpResponse header to decide which class
to instantiate.
You can see how much neater the Spider and HtmlDocument classes are (well OK, that's because I hid
the Fields compartment). To give you an idea of how the code 'moved around': Spider went from 680 lines
to 420, HtmlDocument from 165 to 450, and the Document base became 135 lines - the total line count
has increased (as has the functionality) but what's important is the way relevant functions are encapsulated
inside each class.

The new Document class can then form the basis of any downloadable file type: it is an abstract class so
any subclass must at least implement the GetResponse() and Parse() methods:

• GetResponse() controls how the class gets the data out of the stream from the remote server (eg.
Text and Html is read into memory, Word/PDF/etc are written to a temporary disk location) and
text is extracted.
• Parse() performs any additional work required on the files contents (eg. remove Html tags, parse
links, etc).

The first 'new' class is TextDocument, which is a much simpler version of HtmlDocument: it doesn't
handle any encodings (assumes ASCII) and doesn't parse out links or Html, so the two abstract methods are
very simple! From there is was relatively easy to build the FilterDocument class to wrap the IFilter calls
which allow many different file types to be read.

To demonstrate just how easy it was to extend this design to support IFilter, the FilterDocument class
inherits pretty much everything from Document and only needs to add a touch of code (below; most of
which is to download binary data, plus three lines courtesy of Eyal's IFilter sample). Points to note:

• BinaryReader is used to read the webresponse for these files (in HtmlDocument we use
StreamReader, which is intended for use with Text/Encodings)
• The stream is actually saved to disk (NOTE: you need to specify the temp folder in *.config, and
ensure your process has write permission there).
• The saved file location is what's passed to IFilter
• The saved file is deleted at the end of the method

Collapse Copy Code


public override void Parse()
{
// no parsing (for now).
}
public override bool GetResponse (System.Net.HttpWebResponse webresponse)
{
System.IO.Stream filestream = webresponse.GetResponseStream();
this.Uri = webresponse.ResponseUri;
string filename = System.IO.Path.Combine(Preferences.DownloadedTempFilePath
, (System.IO.Path.GetFileName(this.Uri.LocalPath)));
this.Title = System.IO.Path.GetFileNameWithoutExtension(filename);
using (System.IO.BinaryReader reader = new System.IO.BinaryReader(filestream))
{ // we must use BinaryReader to avoid corrupting the data
using (System.IO.FileStream iofilestream
= new System.IO.FileStream(filename, System.IO.FileMode.Create))
{ // we must save the stream to disk in order to use IFilter
int BUFFER_SIZE = 1024;
byte[] buf = new byte;
int n = reader.Read(buf, 0, BUFFER_SIZE);
while (n > 0)
{
iofilestream.Write(buf, 0, n);
n = reader.Read(buf, 0, BUFFER_SIZE);
}
this.Uri = webresponse.ResponseUri;
this.Length = iofilestream.Length;
iofilestream.Close(); iofilestream.Dispose();
}
reader.Close();
}
try
{
EPocalipse.IFilter.FilterReader ifil
= new EPocalipse.IFilter.FilterReader(filename);
this.All = ifil.ReadToEnd();
ifil.Close();
System.IO.File.Delete(filename); // clean up
} catch {}
}

And there you have it - indexing and searching of Word, Excel, Powerpoint, PDF and more in one easy
class... all the indexing and search results display work as before, unmodified!

"Rest of the Code" Structure

The refactoring extended way beyond the HtmlDocument class. The 31 or so files are now organised into
five (5!) projects in the solution:

EPocalipse.IFilter Unmodified from Using IFilter in C# CodeProject article


Wrapped in a Visual Studio project file, but otherwise unmodified from a Mono source
Mono.GetOptions
repository
All Searcharoo code now lives in this project, in three folders:
/Common/
Searcharoo
/Engine/
/Indexer/
NEW Console Application, allows the Catalog file to be built on a local PC (more likely
to have a wide variety of IFilter's installed), then copied to your website for searching.
Searcharoo.Indexer
You could also create a scheduled task to regularly re-index your site (it's also great for
debugging).
The ASPX files used to run Searcharoo.
They have been renamed to:
Search.aspx
SearchControl.ascx
WebApplication
SearchSpider.aspx
Add these files to your website, merge the web.config settings (update whatever you
need to), ensure the Searcharoo.DLL is added to your /bin/ folder AND make sure your
website 'user account (ASPNET)' has write permission to the web root.
New features & bug fixes

I, robots.txt

Previous versions of Searcharoo only looked in Html Meta tags for robot directives - the robots.txt file was
ignored. Now that we can index non-Html files, however, we need the added flexibility of disallowing
search in certain places. robotstxt.org has further reading on how the scheme works.

The Searcharoo.Indexer.RobotsTxt class has two main functions:

1. Check for, and if present, download and parse the robots.txt file on the site
2. Provide an interface for the Spider to check each Url against the robots.txt rules

Function 1 is accomplished in the RobotsTxt class constructor - it reads through every line in the file (if
found), discards comments (indicated by a hash '#') and builds an Array of 'url fragments' that are to be
disallowed.

Function 2 is exposed by the Allowed() method below

Collapse Copy Code


public bool Allowed (Uri uri)
{
if (_DenyUrls.Count == 0) return true;

string url = uri.AbsolutePath.ToLower();


foreach (string denyUrlFragment in _DenyUrls)
{
if (url.Length >= denyUrlFragment.Length)
{
if (url.Substring(0, denyUrlFragment.Length) == denyUrlFragment)
{
return false;
} // else not a match
} // else url is shorter than fragment, therefore cannot be a 'match'
}
if (url == "/robots.txt") return false;
// no disallows were found, so allow
return true;
}

There is no explicit parsing of Allowed: directives in the robots.txt file - so there's a little more work to do.
Ignoring a NOSEARCHREGION

In HtmlDocument.StripHtml(), this new clause (along with the relevant settings in .config) will cause the
indexer to skip over parts of an Html file surrounded by Html comments of the (default) form <!--
SEARCHAROONOINDEX-->text not indexed<!--/SEARCHAROONOINDEX-->

Collapse Copy Code


if (Preferences.IgnoreRegions)
{
string noSearchStartTag = "<!--" + Preferences.IgnoreRegionTagNoIndex +
"-->";
string noSearchEndTag = "<!--/" + Preferences.IgnoreRegionTagNoIndex +
"-->";
string ignoreregex = noSearchStartTag + @"[\s\S]*?" + noSearchEndTag;
System.Text.RegularExpressions.Regex ignores =
new System.Text.RegularExpressions.Regex(ignoreregex
, RegexOptions.IgnoreCase | RegexOptions.Multiline |
RegexOptions.ExplicitCapture);
ignoreless = ignores.Replace(styleless, " ");
// replaces the whole commented region with a space
}

Links inside the region are still followed - to stop the Spider searching specific links, use robots.txt.

Follow Javascript 'links'

In HtmlDocument.Parse(), the following code has been added inside the loop that matches anchor tags.
It's a very rough piece of code, which looks for the first apostrophe-quoted string inside an onclick=""
attribute (eg. onclick="window.location='top.htm'") and treat it as a link.

Collapse Copy Code


if ("onclick" == submatch.Groups[1].ToString().ToLower())
{ // maybe try to parse some javascript in here
string jscript = submatch.Groups[2].ToString();
// some code here to extract a filename/link to follow from the
// onclick="_____"
int firstApos = jscript.IndexOf("'");
int secondApos = jscript.IndexOf("'", firstApos + 1);
if (secondApos > firstApos)
{
link = jscript.Substring(firstApos + 1, secondApos - firstApos - 1);
}
}
It would be almost impossible to predict the infinite variety of javascript links being used, but this code
should hopefully provide a basis for people to modify to suit their own site (most likely if tricky menu
image rollovers or something bypass the regular href behaviour). At worst it will be extract something that
isn't a real page and get a 404 error...

Multilingual 'option'

Culture note: in the last version I was really focussed on reducing the index size (and therefore the size of
the Catalog on disk and in memory). To that end, I hardcoded the following Regex.Replace(word,
@"[^a-z0-9,.]", "") statement which agressively removes 'unindexable' characters from words.
Unfortunately, if you are using Searcharoo in any language other than English, this Regex is so agressive
that it will delete a lot (if not ALL) of your content, leaving only numbers and spaces!

I've tried to improve the 'useability' of that a bit, by making it an option in the .config

Collapse Copy Code


<add key="Searcharoo_AssumeAllWordsAreEnglish" value="true" />
which governs this method in the Spider:
Collapse Copy Code
private void RemovePunctuation(ref string word)
{ // this stuff is a bit 'English-language-centric'
if (Preferences.AssumeAllWordsAreEnglish)
{ // if all words are english, this strict parse to remove all
// punctuation ensures words are reduced to their least
// unique form before indexing
word = System.Text.RegularExpressions.Regex.Replace(word,
@"[^a-z0-9,.]", "",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
}
else
{ // by stripping out this specific list of punctuation only,
// there is potential to leave lots of cruft in the word
// before indexing BUT this will allow any language to be indexed
word = word.Trim
(' ','?','\"',',','\'',';',':','.','(',')','[',']','%','*','$','-');
}
}

In future I'd like to make Searcharoo more language aware, but for now hopefully this will at least make it
possible to use the code in a non-English-language environment.
Searcharoo.Indexer.EXE

The console application is a wrapper that performs the exact same function as SearchSpider.aspx (now
that all the code has been refactored out of the ASPX and into the Searcharoo 'common' project). The actual
console program code is extremely simple:

Collapse Copy Code


clip = new CommandLinePreferences();
clip.ProcessArgs(args);
Spider spider = new Spider();
spider.SpiderProgressEvent += new SpiderProgressEventHandler(OnProgressEvent);
Catalog catalog = spider.BuildCatalog(new Uri(Preferences.StartPage));

That's almost identical to the SearchSpider.aspx web-based indexer interface.

The other code you'll find in the Searcharoo.Indexer project relates to parsing the command line arguments
using the Mono.GetOptions which turns the following attribute-adorned class into the well-behaving
console application below with hardly an additional line of code.

What it actually does when it's running looks like this:


Just as with SearchSpider.aspx you'll see the output as it follows links and indexes text from each page in
your website. The verbosity setting allows you to control how much 'debug' information is presented:

-v:0 None: totally silent (no console output)


-v:1 Minimal: Just page names and wordcounts
-v:2 Informational: Some error information (eg. 403, 404)
-v:3 Detailed: More exception and other info (eg. cookie error)
-v:4 VeryDetailed: Still more (eg. robot Meta exclusions)
-v:5 Verbose: outputs the extracted words from each document - VERY VERBOSE

NOTE: the exe has it's own Searcharoo.Indexer.exe.config file, which would normally contain exactly
the same settings as your web.config. You may want to consider using the Indexer if your website
contains lots of IFilter-documents (Word, Powerpoint, PDF) and you get errors when running
SearchSpider.aspx on the server because it does not have the IFilters available. The catalog output file
(searcharoo.dat or whatever your .config says) can be FTPed to your sever where it will be loaded and
searched!

References

There's a lot to read about IFilter and how it works (or doesn't work, as the case may be). Start with Using
IFilter in C#, and it's references: Using IFilter in C# by bypassing COM for references to LoadIFilter,
IFilter.org and IFilter Explorer
dotlucerne also has file parsing references).

Searcharoo now has it's own site - searcharoo.net - where you can actually try a working demo, and
possibly find small fixes and enhancements that aren't groundbreaking enough to justify a new CodeProject
article...
Wrap-up

Hopefully you find the new features useful and the article relevant. Thanks again to the authors of the other
open-source projects used in Searcharoo.

History

• 2004-06-30: Version 1 on CodeProject


• 2004-07-03: Version 2 on CodeProject
• 2006-05-24: Version 3 on CodeProject
• 2007-03-18: Version 4 (this page) on CodeProject

License

This article, along with any associated source code and files, is licensed under The Code Project Open
License (CPOL)

About the Author

craigd -- ooo ---


www.conceptdevelopment.net
conceptdev.blogspot.com
Member www.searcharoo.net
www.recipenow.net
www.racereplay.net
www.silverlightearth.com

Occupation: Web Developer


Location: Australia

1.2 SCOPE
There are some important things to note about these search engines.

1. each use different systems to rank pages


2. because different systems are used, a high ranking for a specific keywords in one search
engine does not automatically mean that your page will rank highly for the same keywords in another
search engine
3. nevertheless, each use similar principles to determine the relevancy and importance of web
pages in relation to search queries

we began to show you how search engines work. For the sake of simplicity, we can consider the
search process to work something like the following:

1. Search Engine Spiders the web


2. Search engine caches pages that its spiders on its servers
3. User enters a search query
4. Search engine checks the search query against its index
5. Search engine returns what it believes to be the most relevant results for that query

Although the process is actually more complex than this, the above diagram is useful in helping us to
visualise how searches work, more so in reminding us that when we enter a search term, the search
engine does not actually rush off and check every page on the web. This would take far too long.
Instead it checks your search term against an index that is stored on its servers. Spiders working their
way around the web constantly update this index.

If I carry out a search for cheap web-hosting, the search engine checks its index to see which pages
carry the terms ‘cheap’, ‘web’ and ‘hosting’. It then returns a results page containing what it believes
are the most relevant pages for these particular keywords.

1.3 Proposed System


we suggested that the vast majority of Internet users use search engines to locate products or services.
This free system of listings is a more popular method of locating sites than paid-for advertising such
as PPC and is thus a better way of improving the visibility of your website. But which search engines
do you want to be found by and which search engines should you target?

Although the majority of Internet users rely on search engines to find what they are looking for, they
do not all use the same search engines. There are, in fact, numerous search engines out there, all vying
for a share in the lucrative search engine market. Here are just a few of the search engines that we use
when looking for something on the Internet:

There are two main factors that search engines use to determine the position that pages will gain in
search results:

• Keyword relevancy
• Page importance or link popularity

As we noted above, when you carry out a search query, the search engine tries to return relevant pages
for that query by returning pages that contain the keywords in your search query.

However, search engines also take the importance of the page into account when ranking pages. This
importance is based on the number of external links pointing to a page. The more links pointing to
your pages, the more important they are deemed to be by the search engine.

• Search engines allow us to search the web by entering search queries that the search engine
compares against its index of web pages.
• The leading search engines are currently Google, Yahoo, and MSN.
• Crawler-based search engines use software called spiders to crawl the web and index web
pages.
• Search engines use complex mathematical algorithms to rank web pages.
• Search engine ranking is based on a combination of page relevance and page importance.
• Page importance (or PageRank) is based on the link popularity of a web page and the quantity
and quality of external links pointing to that page.
• PageRank is calculated on a per-page basis and does not apply to websites as a whole.
page shows the results for the above search in Google .The results page is set out as follows:
1. Search box with our search query
2. The number of results Google returned for our search query (plus the time the search took)
3. Sponsored links. This is paid-for advertising. For this results page, Google has selected
adverts that are relevant to our search query.
4. Search results. This section shows the pages that Google thinks are most relevant to our
particular search terms. These listings are free.
5. Link/Page title. The text is the exact text that appears between the title tags (<title></title>)
on the page that the search result links to. Notice how keywords from our search query have been
highlighted.
6. Page description. This text is commonly the actual text that appears in the meta description of
the page that the search result links to. This is the text between the quotation marks in the HTML tag
<META NAME="description" content="YOUR TEXT HERE">. Again, Google has matched this text
with our search query.
7. Domain. This is the address of the page linked to.
8. Cached page link. Unlike the above link, which links to the domain that the page is on, this link takes
us to the cached version of the page that Google has stored on its server.
9. More results. Links to further pages of results

2.Literature Survey

2.1 Technology Used


2.1.1 NET Framework

.NET is a "Software Platform". It is a language-neutral environment for developing rich .NET experiences
and building applications that can easily and securely operate within it. When developed applications are
deployed, those applications will target .NET and will execute wherever .NET is implemented instead of
targeting a particular Hardware/OS combination. The components that make up the .NET platform are
collectively called the .NET Framework.

The .NET Framework is a managed, type-safe environment for developing and executing applications.
The .NET Framework manages all aspects of program execution, like, allocation of memory for the storage
of data and instructions, granting and denying permissions to the application, managing execution of the
application and reallocation of memory for resources that are not needed.

The .NET Framework is designed for cross-language compatibility. Cross-language compatibility means,
an application written in Visual Basic .NET may reference a DLL file written in C# (C-Sharp). A Visual
Basic .NET class might be derived from a C# class or vice versa.

The .NET Framework consists of two main components:

Common Language Runtime (CLR)


Class Libraries

Common Language Runtime (CLR)

The CLR is described as the "execution engine" of .NET. It's this CLR that manages the execution of
programs. It provides the environment within which the programs run. The software version of .NET is
actually the CLR version.

Working of the CLR

When the .NET program is compiled, the output of the compiler is not an executable file but a file that
contains a special type of code called the Microsoft Intermediate Language (MSIL). This MSIL defines a
set of portable instructions that are independent of any specific CPU. It's the job of the CLR to translate this
Intermediate code into a executable code when the program is executed making the program to run in any
environment for which the CLR is implemented. And that's how the .NET Framework achieves Portability.
This MSIL is turned into executable code using a JIT (Just In Time) complier. The process goes like this,
when .NET programs are executed, the CLR activates the JIT complier. The JIT complier converts MSIL
into native code on a demand basis as each part of the program is needed. Thus the program executes as a
native code even though it is compiled into MSIL making the program to run as fast as it would if it is
compiled to native code but achieves the portability benefits of MSIL.

Class Libraries

Class library is the second major entity of the .NET Framework. This library gives the program access to
runtime environment. The class library consists of lots of prewritten code that all the applications created in
VB .NET and Visual Studio .NET will use. The code for all the elements like forms, controls and the rest in
VB .NET applications actually comes from the class library.

Common Language Specification (CLS)

If we want the code which we write in a language to be used by programs in other languages then it should
adhere to the Common Language Specification (CLS). The CLS describes a set of features that different
languages have in common. The CLS includes a subset of Common Type System (CTS) which define the
rules concerning data types and ensures that code is executed in a safe environment.

Some reasons why developers are building applications using the .NET Framework:
o Improved Reliability
o Increased Performance
o Developer Productivity
o Powerful Security
o Integration with existing Systems
o Ease of Deployment
o Mobility Support
o XML Web service Support
o Support for over 20 Programming Languages
o Flexible Data Access

Minimum System Requirements to Install and Use Visual Studio .NET

The minimum requirements are:


RAM: 512 MB (Recommended)
Processor: Pentium II 450 MHz
Operating System: Windows 2000 or Windows XP
Hard Disk Space: 3.5 GB (Includes 500 MB free space on disk)

2.1.2 MYSQL
Microsoft SQL Server is an application used to create computer databases for the
Microsoft Windows family of server operating systems. It provides an environment
used to generate databases that can be accessed from workstations, the web, or other
media such as a personal digital assistant (PDA).
Probably before using a database, you must first have one. A database is primarily a
group of computer files that each has a name and a location. Just as there are different
ways to connect to a server, in the same way, there are also different ways to create a
database.

SQL is short for Structured Query Language and is a widely used database language, providing means of
data manipulation (store, retrieve, update, delete) and database creation.

Almost all modern Relational Database Management Systems like MS SQL Server, Microsoft Access,
MSDE, Oracle, DB2, Sybase, MySQL, Postgres and Informix use SQL as standard database language.
Now a word of warning here, although all those RDBMS use SQL, they use different SQL dialects. For
example MS SQL Server specific version of the SQL is called T-SQL, Oracle version of SQL is called
PL/SQL, MS Access version of SQL is called JET SQL, etc.

Microsoft SQL Server is a Relational Database Management System (RDBMS) designed to run on
platforms ranging from laptops to large multiprocessor servers. SQL Server is commonly used as the
backend system for websites and corporate CRMs and can support thousands of concurrent users.
SQL Server comes with a number of tools to help you with your database administration and programming
tasks.

SQL Server is much more robust and scalable than a desktop database management system such as
Microsoft Access. Anyone whose ever tried using Access as a backend to a website will probably be
familiar with the errors that were generated when too many users tried to access the database!

Although SQL Server can also be run as a desktop database system, it is most commonly used as a server
database system.

Server Database Systems

Server based database systems are designed to run on a central server, so that multiple users can access the
same data simultaneously. The users normally access the database through an application.

For example, a website could store all its content in a database. Whenever a visitor views an article, they
are retrieving data from the database. As you know, websites aren't normally limited to just one user. So, at
any given moment, a website could be serving up hundreds, or even thousands of articles to its website
visitors. At the same time, other users could be updating their personal profile in the members' area, or
subscribing to a newsletter, or anything else that website users do.

Generally, it's the application that provides the functionality to these visitors. It is the database that stores
the data and makes it available. Having said this, SQL Server does include some useful features that can
assist the application in providing its functionality.

Database -

A database is a coherent collection of data with some inherent


meaning, designed, built and populated with data for a specific purpose. It stores
data that is useful to user. This data is only a part of the entire data available in
the world around us.

Database Management System -

A Database Management System (DBMS) consists of a collection of


interrelated data and a set of programs to access those data. The primary goal of
DBMS is to provide an environment that is both convenient and efficient to use
in retrieving and storing database information.

Objectives of database: -
The objectives of database are: -

• It reduces data redundancy and inconsistency.


• Eliminates difficulty in accessing data.
• Helps in isolation of data so that there is no difficulty in writing new applications
due to scattered data.

View of data (data abstraction) -

The major purpose of a database system is to provide users with an


abstract view of data. For usable system the retrieve of data must be efficient.
Since many database-system users are not computer trained, the complexity of
data has to be hidden through several levels of abstraction, to simplify users
interaction with the system. The data has been classified into following levels of
data abstraction -

 Physical level - The lowest level of abstraction describes how the data
are actually stored. The record or data is described in term of block of
consecutive storage locations like words or bytes.
 Logical level - The next level of abstraction describes what data are
stored in the database, and what relationships exist among those data.

Drawback of DBMS -

In the early days of computing the DBMS, used to manage data,


were of the Hierarchic or Network model. When these were placed onto network
operating systems and multiple users began to access table data concurrently the
DBMS responded to these users very sluggishly and totally stable when the
number of users exceeded four or five. This caused commercial application
developers to abandon the use of DBMS
responded to manage data and switch over other programming environments, to
develop software that had to be used by multiple users concurrently.

Evolution of RDBMS -

The mathematician E.F.T. Codd applied the principles of relationships


in statistics to data management and came up with twelve laws. Using
mathematics, he proved that if all these 12 laws were incorporated into database
core technology, there would be a revolution in the speed of any DBMS while
managing data, even when used on network operating systems.

INTRODUCTION OF SQL SERVER

 Introduction to SQL and Its Tools:

The SQL SERVER product is primarily divided into:

 SQL Server Tools


 SQL Client Tools

 SQL Server:

SQL SERVER is a company that produces the most widely used, Server based Multi user
RDBMS. The SQL SERVER Server is a program installed on the Server’s hard disk
driver. This program must be loaded in RAM so that it can process user requests.

The SQL SERVER Server product is either called SQL SERVER Workgroup Server Or
SQL SERVER Enterprise Server

The functionality of both these products is identical. However, the SQL SERVER
Workgroup Server restricts the number of concurrent users who can query the Server.
SQL SERVER Enterprise Server has no such restrictions. Enter product must be loaded
on a multi user operating system.

The SQL SERVER Server takes care of the following:

 Updating the database.


 Retrieving information from the database.
 Accepting query language statements.
 Enforcing security specifications.
 Enforcing data integrity specifications.
 Enforcing transaction consistency.
 Managing data sharing.
 Optimizing queries.
 Managing system catalogs.
 SQL SERVER Client Tools:

Once the SQL SERVER engine is loaded into the server’s memory, users would have to
log into the engine to get work done. SQL SERVER Corporation has several client-based
tools that facilitate this. The client tool most commonly used for Commercial Application
Development is called SQL SERVER Developer 2000.

SQL*Plus is separate SQL SERVER client-side tool. It is a product that works on


Microsoft Windows 95 and Windows NT both of which are standard client based GUI
operating systems.

 What is SQL Used for:

Using SQL one can create and maintain data manipulation objects such as table, views,
sequence etc. These data manipulation objects will be created and stored on the server's
hard disk drive, in a table space, to which the user has been assigned.

Once these data manipulation objects are created, they are used extensively in
commercial applications.

 DML, DCL, DDL:

In addition to the creation of data manipulation objects, the actual manipulation of data
within these objects is done using SQL.

The SQL sentences that are used to create these objects are called DDL's or Data
Definition Language. The SQL sentences used to manipulate data within these objects are
called DML's or Data Manipulation Language. The SQL sentences, which are used to
control the behavior of these objects, are called DCL's or Data Control Language.

Hence, once access to the SQL*Plus tool is available and SQL syntax is known, the
creation of data storage and the manipulation of data within the storage system, required
by commercial applications, is possible.

SQL SERVER and Client/Server

SQL SERVER Corporation's reputation as a database company is firmly established in its full-featured,
high-performance RDBMS server. With the database as the cornerstone of its product line, SQL SERVER
has evolved into more than just a database company, complementing its RDBMS server with a rich offering
of well-integrated products that are designed specifically for distributed processing and client/server
applications. As SQL SERVER's database server has evolved to support large-scale enterprise systems for
transaction processing and decision support, so too have its other products, to the extent that SQL SERVER
can provide a complete solution for client/server application development and deployment. This chapter
presents an overview of client/server database systems and the SQL SERVER product architectures that
support their implementation.

An Overview of Client/Server Computing

The premise of client/server computing is to distribute the execution of a task among multiple processors in
a network. Each processor is dedicated to a specific, focused set of subtasks that it performs best, and the
end result is increased overall efficiency and effectiveness of the system as a whole. Splitting the execution
of tasks between processors is done through a protocol of service requests; one processor, the client,
requests a service from another processor, the server. The most prevalent implementation of client/server
processing involves separating the user interface portion of an application from the data access portion.

On the client, or front end, of the typical client/server configuration is a user workstation operating with a
Graphical User Interface (GUI) platform, usually Microsoft Windows, Macintosh, or Motif. At the back
end of the configuration is a database server, often managed by a UNIX, Netware, Windows NT, or VMS
operating system.

Client/server architecture also takes the form of a server-to-server configuration. In this arrangement, one
server plays the role of a client, requesting database services from another server. Multiple database servers
can look like a single logical database, providing transparent access to data that is spread around the
network.

Designing an efficient client/server application is somewhat of a balancing act, the goal of which is to
evenly distribute execution of tasks among processors while making optimal use of available resources.
Given the increased complexity and processing power required to manage a graphical user interface (GUI)
and the increased demands for throughput on database servers and networks, achieving the proper
distribution of tasks is challenging. Client/server systems are inherently more difficult to develop and
manage than traditional host-based application systems because of the following challenges:

• The components of a client/server system are distributed across more varied types
of processors. There are many more software components that manage client,
network, and server functions, as well as an array of infrastructure layers, all of
which must be in place and configured to be compatible with each other.

• The complexity of GUI applications far outweighs that of their character-based


predecessors. GUIs are capable of presenting much more information to the user
and providing many additional navigation paths to elements of the interface.

• Troubleshooting performance problems and errors is more difficult because of the


increased number of components and layers in the system.
Databases in a Client/Server Architecture

Client/server technologies have changed the look and architecture of application systems in two ways. Not
only has the supporting hardware architecture undergone substantial changes, but there have also been
significant changes in the approach to designing the application logic of the system.

Prior to the advent of client/server technology, most SQL SERVER applications ran on a single node.
Typically, a character-based SQL*Forms application would access a database instance on the same
machine with the application and the RDBMS competing for the same CPU and memory resources. Not
only was the system responsible for supporting the entire database processing, but it was also responsible
for executing the application logic. In addition, the system was burdened with all the I/O processing for
each terminal on the system; the same processor that processed database requests and application logic
controlled each keystroke and display attribute.

Client/server systems change this architecture considerably by splitting the entire interface management
and much of the application processing from the host system processor and distributing it to the client
processor.

Combined with the advances in hardware infrastructure, the increased capabilities of RDBMS servers have
also contributed to changes in the application architecture. Prior to the release of SQL SERVER7, SQL
SERVER's RDBMS was less sophisticated in its capability to support the processing logic necessary to
maintain the integrity of data in the database. For example, primary and foreign key checking and
enforcement was performed by the application. As a result, the database was highly reliant on application
code for enforcement of business rules and integrity, making application code bulkier and more complex.
Figure 2.1 illustrates the differences between traditional host-based applications and client/server
applications. Client/server database applications can take advantage of the SQL SERVER7 server features
for implementation of some of the application logic.

3. Analysis

3.1 Process:-

Process: New Admin. Session / Administrator Security Sign_in


Used by the system to create a new session and session history etc when
they (Administrator) log-in/on to the system.

Data-flow: User_Id
The Users (Administrator’s) Id and Password is sent to the process for
Validation.

Attributes: User_Id (An Administrator’s)


Password

Data-flow: Validation

If the User_Id and Password that was passed is valid, i.e. it was checked
against the ‘Current User Table’ and the user was found to be an
Administrator, a ‘new’ Session will be created for the Administrator by
creating and entry in the ‘User Table’.

Attributes: User_Id
User_Type

Process: New User Session / Repository Users Security Sign_in

Used by the system to create a new session and session history etc when
they (User) log-in/on to the system.

Data-flow: User_Id
The Users (Administrator’s) Id and Password is sent to the process for
Validation.

Attributes: User_Id
Password

Data-flow: Validation

If the User_Id and Password that was passed is valid, i.e. it was checked
against the ‘Current User Table’ and the user was found to be an User, a
‘new’ Session will be created for the User by creating and entry in the
‘User Table’.
Attributes: User_Id
User_Type

Process: Next Session


Used when the system has to figure out the screen that it has to display
next based on which user is requesting it and the relevant session history
of screens.

Data-flow: Session_Id

Attributes: Session_Id

Process: User Log-out

Used to log a user out from the system and subsequently clear up the
system after they have ‘loged-out’

Data-flow: User_Id
The User_Id is used to clear up the system and ‘log-out’ the user cleanly.
The ‘User Table’ and the ‘Session History’ will be cleared of the relevant
details.

Attributes: User_Id

Process: Activate User

Used to record the fact that a user is currently active on the system.

Data-flow: User_Details
The user details are sent from the ‘User Table’
Attributes: User_Name
User_Access_Type

Data-flow: User_Id

The ‘User Table’ is updated using the User_Id to mark the fact that the
user is enabled in the "User Enabled" field of the ‘User Table’.
Attributes: User_Id
User_Enabled

Process: Add User


Used to add a new user to the system including all his/her relevant details.

Data-flow: User_Details
The Administrator enters the details of the user via the ‘Add User Screen’
so that the Process can add them to the ‘User Table’.

Attributes: User_Id
Full_Name
Password
User_Access_Type
User_Enabled

Process: Delete User

Used to delete both record and the details of an existing user from the
system.

Data-flow: User_Id
The user id or a ‘wildcard’ is entered and passed to this process wich in
turn passes it to the Search Process to find a user or list of users matching
this, which the Administrator may wish to delete. The Administrator clicks
on an entry in the returned list to delete that user from the system.

Attributes: User_Id

Data-flow: User_Details

The process returns the matching entries from the ‘User Table’ in a list
from which the administrators will click on the one(s) s/he wishes to
delete from the system as described above.

Attributes: (for each user)


User_Id
Full_Name
User_Access_Type
User_Enabled

Process: Update User


Used to modify the details of an existing user.

Data-flow: User_Details
The details are sourced from the ‘User Table’ and returned to the screen
whereupon the Administrator can modify them.

Attributes: User_Id
Password
Full_Name
User_Access_Type
User_Enabled

Data-flow: Updated_User

The modified details are returned to the ‘User Table’ and thus the changes
are now reflected in the system. I.e. the details for that user are updated.

Attributes: User_Id
Password
Full_Name
User_Access_Type
User_Enabled

Process: Find User

Used to find an existing user or users on the system and return the details
to the screen.

Data-flow: User_Id
The User_Id is accepted from the find user screen and it may contain
‘wildcards’. This is then used in the search process to find user(s) who
may match it/them.

Attributes: User_Id

Data-flow: User_Details

The details of the resulting match(`s) are passed back to the screen in list
format.

Attributes: User-Id
Full_Name
User_Access_Type
User_Enabled

Process: Get Session

Takes in the current Session’s Number so as to return the current Session’s


Id from ‘Current Session’.

Data-flow: SessionNo

Attributes: Session_No

Data-flow: SessionId

Attributes: Session_Id

Process: Current Session

Receives the current Session’s Number from ‘Get Session’ and returns the
current Session’s Id.

Data-flow: SessionNo

Attributes: Session_No

Data-flow: SessionId

Attributes: Session_Id

Process: Search for Administrator

On receipt of User ID from the Administrator the ID is validated. If the ID


matches an ID in the User Table as being an Administrator’s ID, then the
details of that Administrator (or Administrators if wild cards were used)
are returned to the respective screen for viewing, or for further updating
by the Administrator.

Data-flow: Administrator_Details_Request
A request from an Administrator for an Administrator’s Details.
Attributes: User ID

Data-flow: User_Details

The details pertaining to the Administrator (or Administrators if wild cards


were used) that were requested.

Attributes: Full Name


Password
User Access Type
User Enabled

Process: Search for User

On receipt of the User ID from an Administrator the ID is validated. If the


ID matches an ID in the User Table as being a User, then the details for
that User (or users if wild cards are used) are returned to the respective
screen for viewing or for further updating by the Administrator.

Data-flow: User_Details_Request
A request from an Administrator for User Details.

Attributes: User ID

Data-flow: User_Details

The details pertaining to the user (or Users if wilds cards are used) that
were requested.

Attributes: User ID
Full Name
Password
User Access Type
User Enabled

Process: Search for Documents


On receipt of the Document ID or a Search-string from the User (either a
User or
an Administrator), the system will search for documents matching these
parameters. Then it (the System) will return a list of the Names of these
document(s) with Summary-details, so that the reader can select which one
to access and read.

Data-flow: Document_Request
A request from a User for a Document.

Attributes: Document ID and/orSearch String


The Document can be searched for either by using the ID (or a wild card
for the ID), or by submitting a Search-string on its own, or as a
compliment to the ID wild card.

Data-flow: Document_Details
The Document Details matching the required parameters are returned to
the respective screen in list form where the user may select one from the
list to read.

Attributes: Document ID
Document Type
Document Summary

3.2 Analysis. DFD - 1.


4. DESIGN

USER LOGIN:-
5. Future of Search Engines

Future of Search Engine Technology

While many smaller search engines are surfacing, the onus of taking search engine technology to the next
level lies with major search engines like Google, Yahoo, MSN and Microsoft.

Information technologists believe that the search results we receive in the not too distant future would make
the present search engine technology appear primitive and cumbersome. However, in order to achieve this
new search technology, consumers must be forthcoming and shed apprehensions about protection of their
privacy.

Picture a scenario where Google is able to keep track and monitor the web sites a consumer views and
maintains a log of all of the search queries. This type of personalized information could greatly improve the
relevancy of the results displayed by the search engines to the said consumer. It is worth giving up a part of
one’s privacy, if it could result in search engines throwing out more relevant results saving time.

6. HARDWARE REQUIREMENTS

Minimum System Requirements to Install and Use Visual Studio .NET

The minimum requirements are:

RAM: 512 MB (Recommended)


Processor: Pentium III 450 MHz
Operating System: Windows 2000 or Windows XP
Hard Disk Space: 3.5 GB (Includes 500 MB free space on disk)

Software Requirement:

FRONT END : ASP.NET USING C#.NET

BACK END : ms SQL SERVER 2005 / MS-ACCESS


7. Examples of search engines

• Conventional (library catalog).


Search by keyword, title, author, etc.
• Text-based (Lexis-Nexis, Google, Yahoo!).
Search by keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe)
Search by visual appearance (shapes, colors,… ).
• Question answering systems (Ask, NSIR, Answerbus)
Search in (restricted) natural language
• Clustering systems (Vivísimo, Clusty)
• Research systems (Lemur, Nutch)
8. Conclusion
This article considers a paid placement strategy for search engines. On the one hand, paid
placement appears to be a financial necessity, embraced by most major Web search
engines. On the other, paid placement can hurt the search engine's market share and its
potential for revenues brought by users. We have developed a mathematical model for
optimal design of a paid placement strategy, examined this tradeoff and analyzed
sensitivity of the placement strategy to users' perceived disutility, the service quality of
the gatekeeper, and the advertising rate. Our preliminary results are as follows. We show
that the negative impact of paid placement on users causes the search engine to set paid
placements at a below-ideal level. However, when disutility for paid placement is quite
low (though not zero), the search engine can maintain its ideal placement revenues. We
find that an increase in the search engine's quality of service allows it to improve its
utilization of paid placement, moving it closer to the ideal; this also increases surplus for
all players. However, an increase in the advertising rate motivates the search engine to
increase market share by reducing further its reliance on paid placement and fraction of
paying providers. As consumers get a better understanding of the factors underlying paid
placement, the search engine would likely need to spend heavily on marketing campaigns
in order to minimize users' perceived disutility for paid placement. While this research is
set in the context of Internet search engines, our model and results apply more generally
to many other contexts that share similar characteristics as search engines. This broader
category is often called information gatekeepers, that intermediate between a set of users
(or buyers, or consumers) and a set of products (or content providers, or vendors). Baye
& Morgan [#!Baye-Morgan-2001!#] argue that modern markets for information tend to
be dominated by ``information gatekeepers'' that specialize in collating, aggregating, and
searching massive amounts of information available on the Web - and can often charge
consumers, advertisers, and information providers, for their ability to acquire and
transmit information. Wise & Morrison [#!Wise-Morrison-2001!#] emphasize the
increasing role of information gatekeepers in today's economy, noting that in business-to-
business markets, ``value has shifted from the product itself to information about the
product.'' Specific categories of information gatekeepers to which our work applies
include recommender systems (e.g., at Amazon.com), comparison shopping services (e.g.,
mySimon.com), e-marketplaces and exchanges (e.g., Free Markets), and more traditional
information gatekeepers such as investment advisors and television networks. Like search
engines, many information gatekeepers generate user-based revenues, but also seek to
obtain revenues from their provider-base by offering some form of preferential
placement. For example, some Internet booksellers are influenced by advertising fees in
determining their bestseller lists. Similarly, certain Internet exchanges provide
preferential service (such as real time notification or favorable recommendation to
buyers) to some clients in return for higher fees.

We are pursuing extensions of this work, including a formal derivation of the optimal
bias, generalization of demand assumptions, and elimination of free placement by the
gatekeeper. Our models can be extended to examine conditions under which the
information gatekeeper will begin to charge users, and specifically the case where the
gatekeeper differentiates between users by offering two versions: a fee-based premium
service with no bias in the query results, and a free basic version with paid placement
bias. The fee-based premium version will bring additional user revenues to the search
engine, however it may reduce placement revenues because paid placement becomes less
attractive to content providers. In addition, the search engine's market coverage and
placement fee may change as well, and the models can be used to determine if it is
optimal for the gatekeeper to offer differentiated service. Similar models can be
developed to examine the impact of differentiation based on advertising. Some search
engines have already began to offer fee-based premium search services that contain no
advertising. If this is the trend, it may eventually change people's view of Internet search
engines as a free resource for fair information.

Search Engines are sophisticated engines that allow users to quickly locate products and
services on the Internet. Since SEO is aimed at improving your visibility in search engine
results, it is essential that you understand the criteria they use to rank web pages. In the
next units of this course we will show how to use search engines to help locate the right
keywords for your products and help analyses the competition you will face in search
engine listings

9. Bibliography

• Asp.Net Unleashed: Stephen Walther


• SQL Server 2000: Mike Gunderloy
• SAMS Teach yourself VB.net in 21 days
• Programming Asp.Net: Jesse Liberty
• Beginning C#.Net: Richard Blair

You might also like