You are on page 1of 26

Introduction to

Dr. Sandeep G. Deshmukh


Contents

❑ Big Data
❑ Distributed Systems
❑ Hadoop
➢ Hadoop Distributed File System (HDFS)

➢ MapReduce

2
Show of Hands
Introduction to Big Data
Definition

Big data is data that exceeds the processing capacity of


conventional database systems.

The data is too big, moves too fast, or doesn’t fit the strictures of
your database architectures.

To gain value from this data, you must choose an alternative way
to process it.

https://www.oreilly.com/ideas/what-is-big-data
Volume

Quantity of data

Data sets too large to store and analyze


using traditional databases
Velocity

Speed at which data is generated

Speed at which data is moving around


and analyzed

Analyze data while it is being generated


without even putting it into databases
Variety

Different types of data that we can use


Veracity

Messiness or trustworthiness of the data

Volume makes up for quality

Eg. Tweets with spelling mistakes, short


words ( u -> you, thr-> there)
Value

Getting value out of Big Data!!!


Definition

“Big data” is

high-volume, -velocity and -variety information assets

that demand cost-effective, innovative forms of information processing

for enhanced insight and decision making

By Gartner
Definition
Big data is a term for
data sets that are so large or complex that traditional data processing applications
are inadequate

Challenges include analysis, capture, data curation, search,sharing, storage,


transfer, visualization, querying, updating and information privacy.

The term often refers simply to the use of predictive analytics or certain other
advanced methods to extract value from data, and seldom to a particular size of
data set.

Accuracy in big data may lead to more confident decision making, and better
decisions can result in greater operational efficiency, cost reduction and reduced
risk.
Wikipedia
Use Case: Big Data in Oil & Gas Drilling

http://analytics-magazine.org/images/stories/novdec12/big-data.jpg
Use Case: Uber - Pay Surge Pricing if Battery is Low
Further Reading
● A Brief History of Big Data Everyone Should Read

● Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity

● What is big data? - OpenSource.com

● What is big data? - O’Reilly

● 5 Big Data Use Cases To Watch


● Best Big Data Analytics Use Cases
● The 5 game changing big data use cases
● Big Data - The 5 Vs Everyone Must Know
● Top SlideShare Presentations on Big Data
Distributed Systems
Definition

A distributed system is a collection of independent computers that appears to


its users as a single coherent system.
Distributed Systems: Principles and Paradigms, 2nd Edition, Andrew S. Tanenbaum, Maarten Van Steen, 2006

http://www.mypearsonstore.com/bookstore/distributed-systems-principles-and-paradigms-9780132392273?xid=PSED
Distributed Systems: Principles and Paradigms, 2nd Edition, Andrew S. Tanenbaum, Maarten Van Steen, 2006
Forms of Transparency in Distributed Systems

Transparency Description

Access Hide differences in data representation and how a resource is accessed

Location Hide where a resource is located

Migration Hide that a resource may move to another location

Relocation Hide that a resource may be moved to another location while in use

Replication Hide that a resource is replicated

Concurrency Hide that a resource may be shared by several competitive users

Failure Hide the failure and recovery of a resource


● A distributed system consists of components (i.e., computers) that are autonomous

● Users (be they people or programs) think they are dealing with a single system. This means that one way or
the other the autonomous components need to collaborate. How to establish this collaboration lies at the
heart of developing distributed systems.
Definition

A distributed system is a model in which components located on networked


computers communicate and coordinate their actions by passing messages.

The components interact with each other in order to achieve a common goal.

Three significant characteristics of distributed systems are: concurrency of


components, lack of a global clock, and independent failure of components.

Wikipedia

https://www.oreilly.com/ideas/what-is-big-data
Further Reading

● Distributed Computing - Wikipedia

● Distributed computing

● Characteristics of distributed system


Miscellaneous Concepts
Big Data Primers: Size does matter
Big Data Primers: Vertical Vs Horizontal Scaling

Vertical Scaling Horizontal Scaling


Big Data Primers: The scale of infrastructure

You might also like