You are on page 1of 32

Distributed computing

• A Distributed computing refers to as multiple


computer systems located at different places
linked together over a network and used to
solve higher level computation without having
to use an expensive supercomputer.
Cloud computing
• Cloud computing is the use of various shared services,
such as software development platforms, servers, storage
and software, over the internet, often referred to as the
"cloud“.
• In general, there are three cloud computing characteristics
that are common among all cloud-computing vendors:
– The back-end of the application (especially hardware) is
completely managed by a cloud vendor.
– A user only pays for services used (memory, processing time and
bandwidth, etc.).
– Services are scalable.
Big data
• Big data refers to a process that is used when
traditional data mining and handling techniques
cannot uncover the insights and meaning of the
underlying data.
• Data that is unstructured or time sensitive or simply
very large cannot be processed by relational database
engines. This type of data requires a different
processing approach called big data, which uses
massive parallelism on readily-available hardware.
cloud storage
• Data (or files) are said to be stored in the
cloud when they are saved on a remote
server, which is easily accessible from
anywhere with internet access. This allows
access to the data from any device connected
to the internet, including computers, tablets
and smartphones.

Ex: Google Drive, Apple iCloud.


Virtualization
• Virtualization means to create a virtual version of a device
or resource, such as a server, storage device, network or
even an operating system where the framework divides
the resource into one or more execution environments.
• Even something as simple as partitioning a hard drive is
considered virtualization because you take one drive and
partition it to create two separate hard drives.
• Devices, applications and human users are able to interact
with the virtual resource as if it were a real single logical
resource.
Approaches of Virtualization

Two kinds of virtualization approaches available according to


VMM
Hosted
When VMM runs in a operating system.

Bare-metal approach
Which runs VMM on top of hardware directly.
Fairly complex to implement but good in performance.
Cloud Models
There are 3 cloud computing service models:
 IaaS (Infrastructure as a Service)
 PaaS (Platform as a Service)
 SaaS (Software as a Service)
• IaaS (Infrastructure as a Service provides the
computing infrastructure, physical or (quite
often) virtual machines and other resources like
virtual-machine disk image library, block and file-
based storage, firewalls, load balancers, IP
addresses, virtual local area networks etc.

Examples: Amazon EC2, Windows Azure, Rackspace,


Google Compute Engine.
• PaaS (Platform as a Service) provides computing
platforms which typically includes operating
system, programming language execution
environment, database, web server etc.

Examples: AWS Elastic Beanstalk, Windows Azure,


Heroku, Force.com, Google App Engine, Apache
Stratos.
• SaaS (Software as a Service) model provides
access to application software often referred
to as "on-demand software". No need to
worry about the installation, setup and
running of the application. Service provider
will do that. Just need to pay and use it
through some client.

Examples: Google Apps, Microsoft Office 365.


Cloud Services
• Amazon Elastic Compute Cloud (EC2) is a key web service that
provides a facility to create and manage virtual machine
instances with operating systems running inside them.
• There are three ways to pay for EC2 virtual machine instances,
and businesses may choose the one that best fits their
requirements.
– An on-demand instance provides a virtual machine (VM) whenever
you need it, and terminates it when you do not.
– A reserved instance allows the user to purchase a VM and prepay for a
certain period of time.
– A spot instance can be purchased through bidding, and can be used
only as long as the bidding price is higher than others.
Another convenient feature of Amazon’s cloud is that it allows
for hosting services across multiple geographical locations,
helping to reduce network latency for a geographically-
distributed customer base.
Cloud Services
• Amazon Relational Database Service (RDS) provides
MySQL and Oracle database services in the cloud.
• Amazon S3 is a redundant and fast cloud storage
service that provides public access to files over http.
• Amazon SimpleDB is very fast, unstructured NoSQL
database.
• Amazon Simple Queuing Service (SQS) provides a
reliable queuing mechanism with which application
developers can queue different tasks for background
processing.
Big Data
Big data is a collection of large datasets that cannot
be processed using traditional computing techniques.
Big data is not merely a data which also involves
various tools, techniques and frameworks.
What Comes Under Big Data?
Big data involves the data produced by different devices and
applications. Given below are some of the fields that come under
the umbrella of Big Data.

• Black Box Data : It is a component of helicopter, airplanes, and


jets, etc. It captures voices of the flight crew, recordings of
microphones and earphones, and the performance information of
the aircraft.

• Social Media Data : Social media such as Facebook and Twitter


hold information and the views posted by millions of people
across the globe.
• Stock Exchange Data : The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on
a share of different companies made by the customers.

• Power Grid Data : The power grid data holds


information consumed by a particular node with respect
to a base station.

• Transport Data : Transport data includes model,


capacity, distance and availability of a vehicle.

• Search Engine Data : Search engines retrieve lots of data


from different databases.
Thus Big Data includes huge volume, high
velocity, and extensible variety of data. The data
in it will be of three types.

• Structured data : Relational data.


• Semi Structured data : XML data.
• Unstructured data : Word, PDF, Text, Media
Logs.
Structured data
Structured data is information, usually text files,
displayed in titled columns and rows which can
be easily ordered and processed by data mining
tools. Most organizations are likely to be familiar
with this form of data and already using it
effectively,
It is organized data in predefined format and
used to query & report.
Semi-structured data
Semi-structured data is neither raw data, nor typed
data in a conventional database system. It is
structured data, but it is not organized in a rational
model, like a table or an object-based graph. A lot
of data found on the Web can be described as
semi-structured. Data integration especially makes
use of semi-structured data.
XML
JSON
{"employees":[
    {"firstName":"John", "lastName":"Doe"},
    {"firstName":"Anna", "lastName":"Smith"},
    {"firstName":"Peter", "lastName":"Jones"}
]}
Unstructured data
It is a set of data that might or might not have any
logical or repeating patterns.
It consists of metadata, inconsistent data, email,
video, images, social media data
About 80% of enterprise data consists of
unstructured content.
For example author, date created and date
modified and file size are examples of very basic
document metadata.
Big Data Technologies
• Big data technologies are important in providing more
accurate analysis, which may lead to more concrete
decision-making resulting in greater operational
efficiencies, cost reductions, and reduced risks for the
business.
• There are various technologies in the market from
different vendors including Amazon, IBM, Microsoft, etc.,
to handle big data. While looking into the technologies
that handle big data, we examine the following two
classes of technology:
- Operational Big Data
- Analytical Big Data
Operational Big Data
• This include systems like MongoDB that provide
operational capabilities for real-time, interactive
workloads where data is primarily captured and
stored.
• NoSQL Big Data systems are designed to take
advantage of new cloud computing architectures
that have emerged over the past decade to allow
massive computations to be run inexpensively and
efficiently. This makes operational big data
workloads much easier to manage, cheaper, and
faster to implement.
Analytical Big Data
• This includes systems like Massively Parallel
Processing (MPP) database systems and MapReduce
that provide analytical capabilities for retrospective
and complex analysis that may touch most or all of
the data.
• MapReduce provides a new method of analyzing
data that is complementary to the capabilities
provided by SQL, and a system based on MapReduce
that can be scaled up from single servers to
thousands of high and low end machines.
Elements of Big Data
• Volume
• Velocity
• Variety
• Veracity
Volume
It is the most common term used to describe a
big data opportunity. Enterprises of all industries
are having to address the need to handle the
ever-increasing volume of data created by every
day processes, people and systems.
Velocity
Velocity describes the frequency at which data is
generated, captured and shared. The faster we can
collect and process data, the more opportunity we
have to leverage the information for competitive
advantage. Traditional BI approaches commonly
do not effectively address the needs of an
organization to collect, analyse and disseminate
insight in near-real time.
Variety
Examples include the 80% of non-structured/semi-
structured information and intellectual property
stored internally in our organisations in the form of
documents, emails, video, voice etc. These datasets
do not fit into our well-structured traditional data
warehouses, as the data is constantly changing, is
non-exact and is often unpredictable. Other emerging
data types include geo-spatial and location, log data,
machine data, metrics, mobile, RFIDs, search,
streaming data, social, text and so on.
Veracity
Big data is often not verified, verifiable or validated.
Analysis can’t always be duplicated easily as data
keeps growing/changing; and duplication,
omission, and general incompleteness is to be
expected. This is an important characteristic of big
data that needs to be addressed early in terms of
how we deal with it and the type of insight we
expect to gain from it.

You might also like