You are on page 1of 36

Internet Searching Technique

By
P.Palniladevi, AP/ECE
Introduction
 Online Web content is increasing day by day. Around the world millions of users
get content by searching from Search Engines to get their desired information.
 Web popularity has generated a crucial problem for users on the internet to find
relevant and correct information from huge content available on thousands of
websites.
 In this regard Search Engines (SE) provide most relevant information that is being
searched by users.
 SE have become integral part of every users connected to internet. It is one of the
greatest functionality of any search engines to satisfy the users by collecting and
filtering the most related content matching with the search term entered by users .
Search Engines
 Search engines are internet-based computerized tools that searches an
index of website documents, website pages, online content or media for a
particular phrase, term, text or image specified by the user.
 Search engines have a mechanism in form of algorithm which when
applied on web crawlers help to retain user’s internet reality-based
information about their internet usage.
 For example Google keep account activity full logs of every account
holder in Gmail.
Contd..
 Spider also called crawlers “crawls” the website to search newly updated
documents available as webpages, online web-based any type of apps or some
other online web documents.
 Google follows links of other websites available in any specific website. This is
called “dofollow” or “nofollow” attribute in HTML tags. All those links are
saved in search engines own databases.
 Casual or un sensual updates are repeated by search engines. This approach is
called monthly or spontaneously crawling. It ranks the resulting documents
from user’s search using an algorithm based on various weights and ranking
factors.
Famous Search Engines

 Some famous search engines used by millions of online users are Google, Bing
(Microsoft), Yahoo, Baidu (Chinese Web Search), AOL.com (AOL Incorporation),
Ask.com, Excite, Yandex, DuckDuckGo, Lycos.
 Main vital role of all these search engines is to find relevant information at right
time for users.
 Another important tool of SEs is to understand user’s behavior according to the
provided search terms by them.
 Geographical based search engines use location, IP address and zip codes to find
more accurate and relevant information for specific country or geographical area.
Search engine optimization
SEO does
• Optimize website
• Provides guidelines to develop and design website
• Offers some modification in website coding
• Maximize website tracerability
• Improves website ranking in search
 Search engine result pages are list of pages that provides websites matching the
term specified by the users. SEO improves quality of website, improve web
traffic, helps in achieving higher ranking, and puts website in top search result
pages for famous search engines when specific keywords are searched in
searching field.
Terms Related to SEO
• Page Rank: Website value or importance according to search
engines is called page ranking. Ranking refers to position of
website under free listings sections of any search engine
against specific search queries entered by users.
• Keywords: Words or phrases that define the content on which
website are publishing information. Keywords are repeated in
websites. Search engines filter website pages available on
internet that are best matched with the user provide queries.
Contd..
• Spider / Crawler: Spiders or Crawlers are programs that
automatically fetch all the available pages of the website.
• Crawling: Website Crawler of Spiders first crawl the website.
These software tools also called bots always follow incoming
or inbound and outgoing or outbound website linking within
website. One page is often linked with other page to create
backlinks. When crawler find a website page, it index whole
page with links whether inbound or outbound.
• Indexing: Search engines first crawl the webpage or whole website then these crawled

pages are to be indexed in their databases. SEs maintain a hierarchy of these indexed

page. Database containing the indexed information is regularly saved and updated.

• Sitemaps: Sitemap is an XML file store in website. Sitemap displays the hierarchal

structure of the website. It had better always create two sitemaps, one for visiting users

and other for search engines. It makes the sites easier to navigate. Sitemaps pages

designed for visitors help users if they have difficulties finding the pages on a site.

• Web Directory: Windows also known as Microsoft’s famous operating system has a

built-in directory structure. Same like this a web directory or web hierarchy is a

collection of some web architecture. It is a composite directory about any website which

define structure of website on the World Wide Web (WWW). It focusses in linking to

other websites and classifying those website links.


PageRank Algorithm Mathematics

• PR(X) = (1-d) + d (PR(T1)/C(T1)+ PR(T2)/C(T2) ++.... + PR(Tn)/C(Tn))


Explanation:
• PR(X) refers PageRank of page A (whose rank is to be calculated)
• PR(Ti) is the PageRank of website pages Ti which has linking to page X
• C(Ti) defines Total number of outgoing or outbound links on any page Ti
• d stands for Damping factor having values between 0 and 1, usually set to
0.85
Google Search Factors
• Google also depends on automated software robot agents to filter, read, compare,
analyze and check ranking of web pages. These software tools are called Google
Robots or Google Spiders. These spiders crawls every page of website published to
the public. To read out information contained in web pages is called Crawling.
Types of Google Crawl

1. Deep Crawl: Google crawls every website on each month cycle. It is also called
main crawling. GoogleBot is responsible for deep crawling. After crawling of every
website Google updates its main index after every month.
2. Fresh Crawl: Google also crawls website many time during a week even daily for
some websites. Fresh crawl updates newly created and high rank website. Fresh
crawl is performed by Google FreshBot.
Contd..
• Keyword Factors: Website page relevancy is determined by textual
keyword used in webpage. These keywords are widely used in webpages
because optimization of website depend on chosen keywords. Best
practices to use keywords are in the content, in text of hyperlinks, title of
webpage and also content containing links to other webpages.
• Link Factor (PageRank): Link factors include website page importance,
quality & quantity of webpages and proper use of internal links and
external links.
 The Knowledge Graph used by Google is a knowledge base to improve its search
engine's user’s query search results with the help of semantic-search accurate
information collected from a wide-ranging variety of data sources.
 Knowledge Graph Entities

• Google’s Knowledge Graph contains millions of different types of entries which


define real-world things as objects like people, books, Events, places, websites and
things etc. Some types of these entities established in the Google’s Knowledge
Graph are:
• Book / Book Series
• Educational Organization
• Movie / Movie Series
• Music Album / Group
Genetic Algorithm
and the Internet
Algorithms Phases
Process set of URLs given by user

Select all links from input set

Evaluate fitness function for all genomes

Perform crossover, mutation, and reproduction

Satisfactory
solution
obtained?

The End
Introduction
• GA can be used for intelligent internet
search.
• GA is used in cases when search space
is relatively large.
• GA is adoptive search.
• GA is heuristic search method.
System for GA Internet Search

Input set
C Generator
O
N
T Agent Spider
R
O
L Topic Top data

P
Current set
R
O Space
G
R
A
M Time Net data

Output set
Spider
• Spider is software packages that picks up internet documents from user
supplied input with depth specified by user.
• Spider takes one URL, fetches all links and documents thy contain with
predefined depth.
• The fetched documents are stored on local hard disk with same structure as on
the original location.
• Spider’s task is to produce the first generation.
• Spider is used during crossover and mutation.
Agent
• Agent takes as an input a set of urls and calls spider for
every one of them, with depth 1.
• Then, agent performs extraction of keywords from each
document, and stores it in local hard disk.
Generator
• Generator generates a set of url’s from given keywords, using some
conventional search engine.
• It takes as input the desired topic, calls yahoo search engine and submits
a query looking for all documents covering the specific topic.
• Generator stores URL and topic of given web page in database called top
data.
Topic
• It uses top data DB in order to insert random url’s from database into
current set.
• Topic performs mutation.

Space
• Space takes as input the current set from the agent application and
injects into it those urls from the database netdata that appeared with
the greatest frequency in the output set of previous searches.
Time

• Time takes set of urls from agent and inserts ones with
greatest frequency into DB netdata.
• The netdata DB contains of three fields: URL, topic,
and count number.
• The DB is updated in each algorithm iteration.
GA and the Internet: Conclusion
• GA for internet search, on contrary to other gas is much faster and
more efficient that conventional solutions such as standard internet
search engines.

INTERNET
Boolean Operators
• A system of logical operators that allow you to specify relationships
between search words
• To use:
– Use AND to find all search terms
– Use OR to find either search term
– Use NOT to eliminate some portion of search results for the keyword
Thank You

You might also like