Professional Documents
Culture Documents
BigData
1
IT and Software System
IT Evolution
Need for process automation
◦ Software Solutions
Examples
◦ Client Server applications (ERP, HRMS etc.)
◦ Traditional method of offering solutions where well defined data is captured
via user interface screens and eventually stored in the datastore (RDBMS).
Challenges
◦ Need for sophisticated data store mechanism if the amount of data from
such application is growing steadily
◦ How to manage distributed data itself
2
IT and Software System
Web Evolutions
◦ Web 1.0 -> Web 2.0 ->…..
◦ from static web pages to dynamic or user-generated content and the growth
of social media (web 2.0).
Challenges
◦ Data Store
◦ Data format
◦ Not unique, not structured
Devices
◦ Increasingly higher number of connected devices and the data they emit
3
Who is collecting what?
CREDIT CARD COMPANIES WHAT DATA ARE THEY GETTING?
Airline ticket Restaurant check
Grocery Bill
Hotel Bill
Why are they collecting all this data?
TARGET MARKETING TARGETED INFORMATION
To send you catalogs for exactly the To know what you need before
merchandise you typically purchase. you even know you need it based
on past purchasing habits!
To suggest medications that
precisely match your medical To notify you of your expiring
history. driver’s license or credit cards or
To “push” television channels to last refill on a Rx, etc.
your set instead of your “pulling” To give you turn-by-turn directions
them in. to a shelter in case of emergency.
To send advertisements on those
channels just for you!
IT and Software System’s
impact on data
◦All of this technological advancements
boils down to
◦ How to capture with large amount of data?
◦ How to process multiple types of data?
◦ How to effectively make use of such data?
6
Type of Data
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
◦ Social Network, Semantic Web (RDF), …
Type of Data
8
What to do with these data?
Aggregation and Statistics
◦ Data warehouse and OLAP
Knowledge discovery
◦ Data Mining
◦ Statistical Modeling
What Is Big Data?
“Big data exceeds the reach of commonly used hardware
environments and software tools to capture, manage, and process it
with in a tolerable elapsed time for its user population.” - Teradata
Magazine article, 2011
“Big data refers to data sets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze.” - The
McKinsey Global Institute, 2011
10
What Is Big Data?
80
6000 expression and content.
70
5000
60
4000 50 Data quality solutions exist for
enterprise data like customer,
3000 40 product, and address data, but
30 this is only a fraction of the total
2000 enterprise data.
20
1000
10
0
Multiple sources: IDC,Cisco
2005 2010 2015
13
The term Big Data applies to information that can’t be processed or analyzed using traditional processes or tools
Volume Velocity
18
3 Vs of Big Data
The “BIG” in big data isn’t just about volume
19
V s of Big Data Explained
Data Evolution
https://www.google.de/search?q=evolution+of+business+intelligence&newwindow=1&tbm=isch&tbo=u&source=univ&sa=X&ei=gE
GoU5KXBuTb4QSGsoH4BQ&ved=0CDsQsAQ&biw=1366&bih=64
Big Data Challenges
How Is Big Data Different?
1) Automatically generated by a machine
(e.g. Sensor embedded in an engine)
23
How Is Big Data More of the
Same?
Most new data sources were considered big and difficult
Just the next wave of new, bigger data
< The past > < The present > < The future >
24
Risks of Big Data
Will be so overwhelmed
◦ Need the right people and solve the right problems
25
Why You Need to Tame Big
Data
Analyzing big data is already standard
(e.g. ecommerce)
26
The Structure of Big Data
Structured
◦ Most traditional data sources
Semi-structured
◦ Many sources of big data
Unstructured
◦ Video data, audio data
27
Filtering Big Data Effectively
The extract, transform, and load (ETL) processes
taking a raw feed of data, reading it, and producing a usable set of
output
28
Computing - Evolution
30
Mixing Big Data with
Traditional Data
Browsing history
◦ Knowing how valuable a customer is
◦ What they have bought in the past
Smart-grid data
◦ For a utility company
◦ Knowing the historical billing patterns
◦ Dwelling type
31
Handling Big-Data?
Better Algorithms
Stronger Hardware
33
Big Data Use Cases
http://www.meltinfo.com/ppt/ibm-big-data
Building a Big Data Platform
• The acquisition phase is one of the major changes in infrastructure from the
days before big data.
• Because big data refers to data streams of higher velocity and higher variety,
the infrastructure required to support the acquisition of big data must deliver
low, predictable latency in both capturing data and in executing short, simple
queries; be able to handle very high transaction volumes, often in a istributed
environment; and support flexible, dynamic data structures.
• NoSQL databases are frequently used to acquire and store big data. They are
well suited for dynamic data structures and are highly scalable.
• The data stored in a NoSQL database is typically of a high variety because the
systems are intended to simply capture all data without categorizing and
parsing the data into a fixed schema.
Organize Big Data
The infrastructure required for analyzing big data must be able to support
deeper analytics such as statistical analysis and data mining, on a wider
variety of data types stored in diverse systems; scale to extreme data volumes;
deliver faster response times driven by changes in behavior; and automate
decisions based on analytical models
Big data analytics tools and technology
Big data analytics cannot be narrowed down to a single tool or technology. Instead, several types
of tools work together to help you collect, process, cleanse, and analyze big data. Some of the
major players in big data ecosystems are listed below.
Hadoop is an open-source framework that efficiently stores and processes big datasets on clusters
of commodity hardware. This framework is free and can handle large amounts of structured and
unstructured data, making it a valuable mainstay for any big data operation.
NoSQL databases are non-relational data management systems that do not require a fixed scheme,
making them a great option for big, raw, unstructured data. NoSQL stands for “not only SQL,” and
these databases can handle a variety of data models.
MapReduce is an essential component to the Hadoop framework serving two functions. The first is
mapping, which filters data to various nodes within the cluster. The second is reducing, which
organizes and reduces the results from each node to answer a query.
YARN stands for “Yet Another Resource Negotiator.” It is another component of second-generation
Hadoop. The cluster management technology helps with job scheduling and resource management
in the cluster.
Spark is an open source cluster computing framework that uses implicit data parallelism and fault
tolerance to provide an interface for programming entire clusters. Spark can handle both batch and
stream processing for fast computation.
Tableau is an end-to-end data analytics platform that allows you to prep, analyze, collaborate, and
share your big data insights. Tableau excels in self-service visual analysis, allowing people to ask
new questions of governed big data and easily share those insights across the organization.
Core Analytics
One of the major challenges of data preparation is that it is a time
consuming process, along with needing consistent effort to shape the
data and make it valuable.
Need to focus on changing the stride of data preparation and
fundamentally prepare clean understandable data.
Taking that goal into consideration, various data wrangling
technologies that are suitable to serve this purpose.
Azure Machine Learning Workbenchincludes powerful data wrangling
capabilities using productive data transformations, to prepare the data.
Data ingestion is a form of data wrangling that performs some level of
sanitization on the data (structured and unstructured) and prepares the
data before it is sent to the data lakes and used for reporting, analytics and
predictive modeling purposes.
Byte Comparison Table
Metric Value Bytes
Byte (B) 1 1
Kilobyte (KB) 1,0241 1,024
Megabyte (MB) 1,0242 1,048,576
Gigabyte (GB) 1,0243 1,073,741,824
Terabyte (TB) 1,0244 1,099,511,627,776
Petabyte (PB) 1,0245 1,125,899,906,842,624
Exabyte (EB) 1,0246 1,152,921,504,606,846,976
Zettabyte (ZB) 1,0247 1,180,591,620,717,411,303,424
Yottabyte (YB) 1,0248 1,208,925,819,614,629,174,706,176
The Resource Description Framework (RDF) is a general framework for
representing interconnected data on the web. RDF statements are used
for describing and exchanging metadata, which enables standardized
exchange of data based on relationships. RDF is used to integrate data
from multiple sources.