You are on page 1of 42

Introduction to

BigData

1
IT and Software System
IT Evolution
Need for process automation
◦ Software Solutions

Examples
◦ Client Server applications (ERP, HRMS etc.)
◦ Traditional method of offering solutions where well defined data is captured
via user interface screens and eventually stored in the datastore (RDBMS).

Challenges
◦ Need for sophisticated data store mechanism if the amount of data from
such application is growing steadily
◦ How to manage distributed data itself

2
IT and Software System
Web Evolutions
◦ Web 1.0 -> Web 2.0 ->…..
◦ from static web pages to dynamic or user-generated content and the growth
of social media (web 2.0).

Challenges
◦ Data Store
◦ Data format
◦ Not unique, not structured

Devices
◦ Increasingly higher number of connected devices and the data they emit

3
Who is collecting what?
CREDIT CARD COMPANIES WHAT DATA ARE THEY GETTING?
Airline ticket Restaurant check

Grocery Bill
Hotel Bill
Why are they collecting all this data?
TARGET MARKETING TARGETED INFORMATION

To send you catalogs for exactly the To know what you need before
merchandise you typically purchase. you even know you need it based
on past purchasing habits!
To suggest medications that
precisely match your medical To notify you of your expiring
history. driver’s license or credit cards or
To “push” television channels to last refill on a Rx, etc.
your set instead of your “pulling” To give you turn-by-turn directions
them in. to a shelter in case of emergency.
To send advertisements on those
channels just for you!
IT and Software System’s
impact on data
◦All of this technological advancements
boils down to
◦ How to capture with large amount of data?
◦ How to process multiple types of data?
◦ How to effectively make use of such data?

6
Type of Data
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
◦ Social Network, Semantic Web (RDF), …
Type of Data

8
What to do with these data?
Aggregation and Statistics
◦ Data warehouse and OLAP

Indexing, Searching, and Querying


◦ Keyword based search
◦ Pattern matching (XML/RDF)

Knowledge discovery
◦ Data Mining
◦ Statistical Modeling
What Is Big Data?
“Big data exceeds the reach of commonly used hardware
environments and software tools to capture, manage, and process it
with in a tolerable elapsed time for its user population.” - Teradata
Magazine article, 2011

“Big data refers to data sets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze.” - The
McKinsey Global Institute, 2011

10
What Is Big Data?

Big data is the term for a collection of data sets so large


and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications.

The challenges that we face with dbms tools and other


technologies is capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
What Is Big Data?

IOPS(Input/Output Operations Per Second)


12
By 2015, 80% of all available data will be uncertain

By 2015 the number of networked devices will be


double the entire global population. All sensor
9000
data has uncertainty.
8000 100
Global Data Volume in Exabytes

The total number of social media accounts


7000 90 exceeds the entire global population. This
data is highly uncertain in both its
Aggregate Uncertainty %

80
6000 expression and content.
70
5000
60
4000 50 Data quality solutions exist for
enterprise data like customer,
3000 40 product, and address data, but
30 this is only a fraction of the total
2000 enterprise data.
20
1000
10
0
Multiple sources: IDC,Cisco
2005 2010 2015

13
The term Big Data applies to information that can’t be processed or analyzed using traditional processes or tools

Transactional & Machine Social Data Enterprise


12+ terabytes 5+million Application
Data
Data Content

of Tweets trade events


created per second.
daily.

Volume Velocity

100’s Variety Veracity Only 1 in 3 • Volume • Velocity • Variety • Variety


of different • Structured • Semi- • Highly • Highly
decision makers
types structured unstructured unstructured
trust their • Throughput
of data.
information. • Ingestion • Veracity • Volume
Why Big data
Key enablers for the appearance and growth of ‘Big-Data’ are:

+ Increase in storage capabilities


+ Increase in processing power
+ Availability of data
Examples of Big data
Some Examples
e-Bay
90 PB of data about customer transaction & behavior
8 PB on warehouse, 40 PB on Hadoop, 40 PB on Singularity
Facebook
100 Billion hits each day
100’s of millions of req each second
~100 TB of log each day
Aadhaar
30 TB IO each day, 4 TB logs each day
Able-Grape, BabaCar, Beebler
Few TB overall data
Hadoop Cluster of few machines
Examples of BigData
Total data generated every day by the global 787 fleet in operation
today:
(20TB/ hr x 9hr ave operation per day) x 2 engines x 114 787 aircraft
= 41,040 terabytes (or 40 petabytes)

18
3 Vs of Big Data
The “BIG” in big data isn’t just about volume

19
V s of Big Data Explained
Data Evolution

https://www.google.de/search?q=evolution+of+business+intelligence&newwindow=1&tbm=isch&tbo=u&source=univ&sa=X&ei=gE
GoU5KXBuTb4QSGsoH4BQ&ved=0CDsQsAQ&biw=1366&bih=64
Big Data Challenges
How Is Big Data Different?
1) Automatically generated by a machine
(e.g. Sensor embedded in an engine)

2) Typically an entirely new source of data


(e.g. Use of the internet)

3) Not designed to be friendly


(e.g. Text streams)

4) May not have much values


◦ Need to focus on the important part

23
How Is Big Data More of the
Same?
Most new data sources were considered big and difficult
Just the next wave of new, bigger data

< The past > < The present > < The future >

24
Risks of Big Data
Will be so overwhelmed
◦ Need the right people and solve the right problems

Costs escalate too fast


◦ Isn’t necessary to capture 100%

Many sources of big data is privacy


◦ self-regulation
◦ Legal regulation

25
Why You Need to Tame Big
Data
Analyzing big data is already standard
(e.g. ecommerce)

Be left behind in a few years


◦ So far, only missed the chance on the bleeding edge

Capturing data, using analysis to make decisions


◦ Just an extension of what you are already doing today

26
The Structure of Big Data
Structured
◦ Most traditional data sources

Semi-structured
◦ Many sources of big data

Unstructured
◦ Video data, audio data

27
Filtering Big Data Effectively
The extract, transform, and load (ETL) processes
taking a raw feed of data, reading it, and producing a usable set of
output

Extract Transform Load

28
Computing - Evolution

1622 Slide Rule


1642 Mechanical Adding Machine
1693 Mechanical Calculator
1700 Numerical Binary System Proposal
1822 Modern Computer conception
1946 ENIAC the first electronic computer
1951 UNIVAC and IBM 701 computer design
1966 IBM Memory Chip
1969 First human to step on the Moon / ARPANET
1975 Production of microchips for PC
1979 Cathode Ray Tube monitors and terminals
1988 IBM AS/400
1992 Think Pad launch
Filtering Big Data Effectively
Focus on the important
pieces of the data

It makes big data easier


to handle

30
Mixing Big Data with
Traditional Data
Browsing history
◦ Knowing how valuable a customer is
◦ What they have bought in the past

Smart-grid data
◦ For a utility company
◦ Knowing the historical billing patterns
◦ Dwelling type

Text (Online chat and e-mails)


◦ Knowing the detailed product specification being discussed
◦ The sales data related those products

31
Handling Big-Data?
Better Algorithms

Stronger Hardware

All areas of CS : AI, Cognitive Computing, Data Management


The Need for Standards
Become more structured over time
Fine-tune to be friendlier for analysis
Standardize enough to make life much easier

33
Big Data Use Cases

http://www.meltinfo.com/ppt/ibm-big-data
Building a Big Data Platform

As with data warehousing, web stores or any IT platform, an infrastructure


for big data has unique requirements.
In considering all the components of a big data platform, it is important to
remember that the end goal is to easily integrate your big data with your
enterprise data to allow you to conduct deep analytics on the combined
data set.
Infrastructure Requirements
The requirements in a big data infrastructure span data acquisition, data
organization and data analysis.
Acquire Big Data

• The acquisition phase is one of the major changes in infrastructure from the
days before big data.
• Because big data refers to data streams of higher velocity and higher variety,
the infrastructure required to support the acquisition of big data must deliver
low, predictable latency in both capturing data and in executing short, simple
queries; be able to handle very high transaction volumes, often in a istributed
environment; and support flexible, dynamic data structures.

• NoSQL databases are frequently used to acquire and store big data. They are
well suited for dynamic data structures and are highly scalable.

• The data stored in a NoSQL database is typically of a high variety because the
systems are intended to simply capture all data without categorizing and
parsing the data into a fixed schema.
Organize Big Data

• Organizing data is called data integration.


• Because there is such a high volume of big data, there is a tendency to organize
data at its initial destination location, thus saving both time and money by not
moving around large volumes of data.
• The infrastructure required for organizing big data must be able to process and
manipulate data in the original storage location; support very high throughput
(often in batch) to deal with large data processing steps; and handle a large variety
of data formats, from unstructured to structured.
Hadoop is a new technology that allows large data volumes to be organized and
processed while keeping the data on the original data storage cluster.
Hadoop Distributed File System (HDFS) is the long-term storage system for web logs for
example.
These web logs are turned into browsing behavior (sessions) by running MapReduce
programs on the cluster and generating aggregated for the results on the same cluster.
These aggregated results are then loaded into a Relational DBMS system.
Analyze Big Data

The infrastructure required for analyzing big data must be able to support
deeper analytics such as statistical analysis and data mining, on a wider
variety of data types stored in diverse systems; scale to extreme data volumes;
deliver faster response times driven by changes in behavior; and automate
decisions based on analytical models
Big data analytics tools and technology

Big data analytics cannot be narrowed down to a single tool or technology. Instead, several types
of tools work together to help you collect, process, cleanse, and analyze big data. Some of the
major players in big data ecosystems are listed below.

Hadoop is an open-source framework that efficiently stores and processes big datasets on clusters
of commodity hardware. This framework is free and can handle large amounts of structured and
unstructured data, making it a valuable mainstay for any big data operation.
NoSQL databases are non-relational data management systems that do not require a fixed scheme,
making them a great option for big, raw, unstructured data. NoSQL stands for “not only SQL,” and
these databases can handle a variety of data models.
MapReduce is an essential component to the Hadoop framework serving two functions. The first is
mapping, which filters data to various nodes within the cluster. The second is reducing, which
organizes and reduces the results from each node to answer a query.
YARN stands for “Yet Another Resource Negotiator.” It is another component of second-generation
Hadoop. The cluster management technology helps with job scheduling and resource management
in the cluster.
Spark is an open source cluster computing framework that uses implicit data parallelism and fault
tolerance to provide an interface for programming entire clusters. Spark can handle both batch and
stream processing for fast computation.
Tableau is an end-to-end data analytics platform that allows you to prep, analyze, collaborate, and
share your big data insights. Tableau excels in self-service visual analysis, allowing people to ask
new questions of governed big data and easily share those insights across the organization.
Core Analytics
One of the major challenges of data preparation is that it is a time
consuming process, along with needing consistent effort to shape the
data and make it valuable.
Need to focus on changing the stride of data preparation and
fundamentally prepare clean understandable data.
Taking that goal into consideration, various data wrangling
technologies that are suitable to serve this purpose.
Azure Machine Learning Workbenchincludes powerful data wrangling
capabilities using productive data transformations, to prepare the data.
Data ingestion is a form of data wrangling that performs some level of
sanitization on the data (structured and unstructured) and prepares the
data before it is sent to the data lakes and used for reporting, analytics and
predictive modeling purposes.
Byte Comparison Table
Metric Value Bytes
Byte (B) 1 1
Kilobyte (KB) 1,0241 1,024
Megabyte (MB) 1,0242 1,048,576
Gigabyte (GB) 1,0243 1,073,741,824
Terabyte (TB) 1,0244 1,099,511,627,776
Petabyte (PB) 1,0245 1,125,899,906,842,624
Exabyte (EB) 1,0246 1,152,921,504,606,846,976
Zettabyte (ZB) 1,0247 1,180,591,620,717,411,303,424
Yottabyte (YB) 1,0248 1,208,925,819,614,629,174,706,176
The Resource Description Framework (RDF) is a general framework for
representing interconnected data on the web. RDF statements are used
for describing and exchanging metadata, which enables standardized
exchange of data based on relationships. RDF is used to integrate data
from multiple sources.

You might also like