Professional Documents
Culture Documents
Internet
Internet is popularly known as network of networks.
Internet helps any computer system/mobile to connect with any other computer system globally
using TCP/IP protocol. TCP/IP protocol is also known as Internet protocol.
Internet identifies each system in the network through a unique address known as IP address. Each
computer system has a unique IP to distinguish from other computer on the network just like voter
id of human beings.
URL
URL stands for Uniform Resource Locator.
It is also known as URI or Uniform Resource Identifier.
To visit any website, you need to type its URL or URI on web browser.
Suppose you need to visit, Google so you need to type its URL - www.google.com
For example,
When you visit amazon.com or flipkart.com, you are first redirected to their home page.
From there you can search different products based on categories, signup or login to their website,
sell product, purchase product, etc.
If you go to online exam section of this website, the questions would be different for different
users or on every visit. This can be also viewed as dynamic nature of the website.
A web server receives the request from the user with help of a browser and then it process the
request, prepares the necessary response and sends it back to the browser.
Domain name
Domain Name is the way to identify and locate computers connected to the internet.
Two websites cannot have same domain name along with top level domain.
For example, consider our website tutorialsinhand.com
tutorialsinhand is the domain name of this website and .com is the top level domain.
google is the domain name and .com is the top level domain.
Read more about Domain name here
To communicate, send files, send emails, share informations, etc with other systems it is necessary
to know where that computer is. IP address helps identify the different systems uniquely.
IP address is an identifier for a particular computer on a particular network.
There are two types of IP address:
IPv4: Example is 190.167.48.160
IPv6: Example is 2003:0eb8:75b3:0000:0000:8c2d:0371:7434.
Firewall
Firewall is a kind of security device for computers accessing informations via internet.
Firewall protects the computer and network by restricting the access of outsiders or intruders. It
also sets up the criteria that must be met before access to the network or system is allowed to
anyone.
Firewall is hardware or software or both that helps protect your system connected on the network
from untrusted sites that may contain viruses or other malwares.
Cache
Cache stores data of the recently or frequently visited websites.
Cache helps to speed up the serving of the web pages faster as the stored data is not required to be
fetched from server again which is time consuming task.
Browser cache is used for purposes to store data of the frequently visited websites.
Many ad serving websites use the cache to find out the activity or searches that you do online and
then serve ad according to your recent activities. That is why you start seeing ad related to footwear
on every website you visit after you have searched anything related to footwear recently from your
browser.
FTP
FTP is an abbreviation for File Transfer Protocol.
FTP is a network protocol used to transfer data from one computer to another through a network.
FTP helps in exchanging and manipulating files over any TCP-based computer network. A FTP
client may connect to a FTP server to manipulate files on that server.
HTTP
HTTP is an abbreviation for Hypertext Transfer Protocol.
HTTP is a request / response standard between a client and a server
HTTP is a communication protocol that helps in transfer of information on the internet and the
WWW or World Wide Web. Original purpose od HTTP was to provide a protocol to publish and
retrieve hypertext pages over the internet.
Hypertext pages are specially coded using HTML or hypertext markup language. HTML pages
may contain text, sound, animations, images, or link to another hypertext pages. When user clicks
on any hyperlink the client program on the computer uses HTTP to contact server and ask the
server to provide response based on clients request. Server responds back after processing the
request over HTTP.
HTML
HTML stands for Hyper Text Markup Language.
HTML was the first language to be used to design the web pages. Those web pages were static in
nature.
HTML designed web page can contain texts, images, audio, videos, etc.
HTML along with CSS can be used to design attractive websites. You can view HTML as a plain
design on a white paper whereas CSS is a paint that can fill up the design with beautiful colors.
Web Mining
Web mining is an application of the Data Mining technique that is used to find
information patterns from the web data. Web Mining helps to improve the power of web
search engines by identifying the web pages and classifying web documents.
Web Content Mining can be used for the mining of useful data, information, and
knowledge from web page content. Web content mining performs scanning and mining
of the text, images, and group of web pages according to the content of the input, by
displaying the list in search engines.
There are two approaches that are used for Web Content Mining :
(i) Agent-based approach :
This approach involves intelligent systems. It usually relies on autonomous agents, that
can identify websites that are relevant.
(ii) Data-based approach :
Data-Based approach is used to organize semi-structured data present on the internet
into structured data.
2. Web Structure Mining –
Web Structure Mining can be used to discover link structure of hyperlinks. The purpose
of Structure Mining is to produce the structural summary of websites and similar web
pages. Interested in the structure of hyperlinks within the web. This type of mining is
applied at the level of document and at hyperlink level. Web Structure Mining plays a
very important role in the mining process.
Web Usage Mining is used for mining weblog records (access information of web pages).
It helps to discover user access patterns of web pages. There are many available research
projects and tools that analyze those patterns for different purposes. There are mainly
four techniques of mining applied to web mining namely, Association Rule Mining,
Sequential Pattern, Clustering, and Classification.
Web Content
Semi-
structured
Unstructured Website as
View of data Structured DB Link structure Interactivity
Text
documents
Hypertext Hypertext Server logs
Main data documents documents Link structure Browser logs
Machine Machine
Learning Proprietary learning
Statistical algorithm Statistical
(Including Association Proprietary Association
Method NLP) rules algorithm Rules
Bag of words,
n-gram terms
Phrases, Edged
concepts or labeled Relational
ontology graph Table
Representation Relational Relational Graph Graph
Web Mining
Web Mining is the process of Data Mining techniques to automatically discover and
extract information from Web documents and services. The main purpose of web
mining is discovering useful information from the World-Wide Web and its usage
patterns.
Applications of Web Mining:
1. Web mining helps to improve the power of web search engine by classifying the
web documents and identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g.,
FatLens, Become etc.
3. Web mining is used to predict user behavior.
4. Web mining is very useful of a particular Website and e-service e.g., landing page
optimization.
Web mining can be broadly divided into three different types of techniques of mining:
Web Content Mining, Web Structure Mining, and Web Usage Mining.
Web mining can be divided into three categories based on the data to be mined.
These are explained as following below.
1. Web Content Mining:
Web content mining is the application of extracting useful information from the
content of the web documents. Web content consist of several types of data – text,
image, audio, video etc. Content data is the group of facts that a web page is
designed. It can provide effective and interesting patterns about user needs. Text
documents are related to text mining, machine learning and natural language
processing. This mining is also known as text mining. This type of mining performs
scanning and mining of the text, images and groups of web pages according to the
content of the input.
In Data Mining get the In Web Mining get the information from
information from explicit structured, unstructured and semi-
Structure structure. structured web pages.
Clustering, classification,
Problem regression, prediction, Web content mining, Web structure
Type optimization and control. mining.
Scrapy
Scrapy is the finest web usage mining tool. It is an open-source framework that helps in
extracting data from websites. It is written in Python and the rules can be written to extract web
data. It is deemed to be an entire solution as a web scraping tool because it can handle requests,
follow redirects, maintain user sessions, and manage output pipelines.
PageRank Algorithm
PageRank Algorithm is the widespread web-based mining algorithm. It is a link scrutiny
algorithm and it allocates a numeral weighting to every element of a hyperlinked form of
documents, like the world wide web, with the objective of estimating its comparative importance
within the set. It may be applied to any bunch of entities with references and reciprocal
quotations.
R
R is a language for graphics and statistical computing. It has been made available from script
languages like Ruby, Python, Perl, etc. R sustains proceeding programming with functions and
object-oriented programming manner with general functions. A general function behaves
differently depending on the classes of reasoning passed to it.
Octoparse
Octoparse is a potential web data mining tool that automatizes web data derivation. It allows you
to create highly accurate extraction rules. Octoparse makes it faster and easier to get data from
the web without in need of coding. The extraction rule would tell this software: which website is
to go to; what kind of data you want; where the data is you plan to crawl, etc.
Tableau
Tableau is one of the most efficient and quickly growing interactive data visualization tools
employed in the business intelligence industry, enabling us to simplify raw data into an
accessible format. Tableau allows data to be transformed into interactive visualizations in the
form of dashboards and worksheets. It is possible for any employee at any level in the company
to interpret the data created with the help of Tableau.
CONCLUSION
Web mining tools are numerous and each of them has its positives and negatives. It depends on
what your business is and the kind of perceptions you are in search of. If you can recognize your
requirements and consequently lookout for a tool that meets your requirements, you can create
the competitive benefit you are seeking. A lot more tools are around that you might find as the
domain of web mining continues to rise and extend.