You are on page 1of 37

IBM Big Data Platform

Overview

Martin Pavlk
+420 731 435 691
martin_pavlik@cz.ibm.com

January 2013 2013 IBM Corporation


Big Data is a Hot Topic Because Technology Makes it
Possible to Analyze ALL Available Data
Cost effectively manage and analyze
all available data in its native form
unstructured, structured, streaming

Website Social Media

Billing
ERP Network Switches
CRM RFID
2 2012 IBM Corporation
BIG DATA is not just HADOOP
Understand and navigate
Federated Discovery and Navigation
federated big data sources

Manage & store huge Hadoop File System


volume of any data MapReduce

Structure and control data Data Warehousing

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Integrate and govern all Integration, Data Quality, Security,


data sources Lifecycle Management, MDM

3 2012 IBM Corporation


Business-Centric Big Data Enables You to Start With a Critical Business
Pain and Expand the Foundation for Future Requirements

Big data isnt just a technology


its a business strategy for
capitalizing on information
resources

Getting started is crucial

Success at each entry point is


accelerated by products within
the Big Data platform

Build the foundation for future


requirements by expanding
further into the big data platform

4 2012 IBM Corporation


1 Unlock Big Data
Customer need
Understand existing data sources

Search and navigate data within


existing systems

No copying of data

Value statement
Get up and running quickly

Discover and retrieve big data

Work even with big data sources by


business users

Solution
Vivisimo Velocity renamed to
IBM InfoSphere DataDiscovery

5 2012 IBM Corporation


2 Analyze Raw Data
Customer need
Ingest data as-is into Hadoop
Combine it with data from DWH

Process very large volume of data

Value statement
Gain new insight

Overcome the high cost of converting


data from unstructured to structured
format

Experiment with analysis on different


data and combine them with other
sources

Solution
IBM InfoSphere BigInsights

6 2012 IBM Corporation


Merging the Traditional and Big Data Approaches
Traditional Approach Big Data Approach
Structured & Repeatable Analysis Iterative & Exploratory Analysis

IT
Business Users
Delivers a platform to
Determine what enable creative
question to ask discovery

IT Business
Structures the Explores what
data to answer questions could be
that question asked

Monthly sales reports Brand sentiment


Profitability analysis Product strategy
Customer surveys Maximum asset utilization

7 2012 IBM Corporation


InfoSphere BigInsights is more than just HADOOP

IBM InfoSphere Big Insights


Is much more than
HADOOP

IBM Big data platform


Includes much more than
IBM InfoSphere Big
Insights

8 2012 IBM Corporation


Hadoop
Open-source software framework from Apache
Inspired by
Google MapReduce
GFS (Google File System)

HDFS
Map/Reduce

9 2012 IBM Corporation


InfoSphere BigInsights
Can run also on top of
Platform for volume, variety,
velocity
Enhanced Hadoop
foundation Enterprise Edition
Licensed
Analytics Application accelerators
Text analytics & tooling Pre-built applications
Application accelerators Text analytics
Spreadsheet-style tool
Usability RDBMS, warehouse connectivity
Web console Basic Edition Administrative tools, security

Spreadsheet-style tool Enterprise class Free download


Eclipse development tools
Performance enhancements
Ready-made apps Integrated install ....
Online InfoCenter
Enterprise Class Apache
BigData Univ.
Storage, security, cluster Hadoop

management
Integration
Connectivity to Netezza,
DB2, JDBC databases, etc
Breadth of capabilities
10 2012 IBM Corporation
Spreadsheet-style Analysis
Web-based analysis
and visualization

Spreadsheet-like
interface
Define and manage
long running data
collection jobs

Analyze content of the


text on the pages that
have been retrieved

11 2012 IBM Corporation


Build a Big Data Program MapReduce example
Eclipse tools
For Jaql, Hive, Pig Java MapReduce, BigSheets
plug-ins, text analytics, etc.

12 2012 IBM Corporation


JAQL IBMs programming language in hadoop world
Jaql is a complete solutions environment supporting all other
BigInsights components
Integration point for
various analytics

Ad-Hoc analysis
BigInsights Text

DB2, Netezza,
(Integration)
(SystemML)

(BigSheets)

Streams,
(R module)
Statistical
Text analytics

Analytics

Machine
learning
Analysis
Statistical analysis
Machine learning
Ad-hoc analysis
Jaql
Integration point for Jaql Core Jaql
Jaql I/O
various data sources Operators Modules

Local and distributed


file systems
NoSQL data bases DFS NoSQL RDBMS
File
Content repositories System

Relational sources
(Warehouses,
operational data bases)
13 2012 IBM Corporation
BigInsights and the data warehouse Traditional
analytic
Big Data
tools
analytic
applications Data warehouse

BigInsights

Filter Transform Aggregate

14 2012 IBM Corporation


3 Simplify your warehouse
Customer need SIGNIFICANTLY
Make performance of DWH better
Reduce DWH administration costs

Value statement
Speed: 10 100x better performance
Simplicity: Administration costs reduced by 75% - 90%
Scalability
Smart system
In-database analytics
Out-of-the box integration with SPSS

Solution
IBM Netezza renamed to
PureData System for Analytics

15 2012 IBM Corporation


OK. We have to evaluate a lot of
statistics, set the correct db indexes
I need to evaluate the possible and db partitioning. It will take us 5
relationship between client salary and days.
overdrafts

IT
Analyst

16 2012 IBM Corporation


Great. Thanks a lot. Done. You can run your analytical
Im going to check the results. query.

Analyst IT

17 2012 IBM Corporation


Great. I can see here some nice Noooo!!! Ohhh, welcome dear friend.
correlations. Now I need to Its
looknot
atpossible to work Understand. So, its .
here! another 5 days of our work
it from the different perspective.

Analyst IT

18 2012 IBM Corporation


And now with Netezza ...

19 2012 IBM Corporation


I need to evaluate the possible
relationship between client salary and
overdrafts.
I will use Netezza.

IT
Analyst

20 2012 IBM Corporation


Great. I can see here some nice correlations.
Now I need to look at it from the different
perspective.
With Netezza I can run the query immediately.
The response will be in the same time

Analyst IT

IT can do something else


much more useful

21 2012 IBM Corporation


22 2012 IBM Corporation
Built-In Expertise Makes This as Simple as an Appliance

Dedicated device
Optimized for purpose
Complete solution
Fast installation
Very easy operation
Standard interfaces
Low cost

23 2012 IBM Corporation


In October 2012

IBM Netezza was renamed to IBM PureData System for Analytics

24 2012 IBM Corporation


Netezza
Genesis in T-Mobile CZ

Proof-Of-Concept Project
New EnterpriseDataWarehouse platform selection
Comparison of existing and other platforms

Selection Criteria
Performance
Operational Savings

.and the winner was: Netezza

25 2012 IBM Corporation


Netezza Genesis in T-Mobile CZ
Expectations
Significant response improvement:
Faster platform means better reports response

Direct Data Availability


Higher trust in data , one version of truth
Aggregation reduction
Any attribute available

Operational Benefits
Storage savings (no data replicas)
Administration costs reduction(DBA)

Infrastructure Simplification
Lower environment complexity

26 2012 IBM Corporation


Netezza Genesis in T-Mobile CZ
Project Implementation
EDW platform migration
Netezza platform implementation
ETL graphs/processes redesign

BI Front-End Tool Migration


SAP Business Object implementation
All reports redesign

Main Integration Partner: T-System CZ

27 2012 IBM Corporation


Netezza Genesis in T-Mobile CZ
Actual Status
All relevant ETL procecessing redesigned

Actual parallel run to Original and Netezza platform finished

Netezza as only primary platform

28 2012 IBM Corporation


Real Netezza experience from T-Mobile Czech Rep.

Original Netezza
Platform
Workflow Reporting 2 hours 1 minute

Invoicing and Payments reporting

Payment discipline of current month invoices 33 minutes 17 seconds

Overdue Debt of Invoices in Current Month 10 hours 23 seconds

Average Monthly Invoice Figures 50 minutes 38 seconds

RESPONSE TIME MASSIVELY


IMPROVED
29 2012 IBM Corporation
4 Reduce costs with Hadoop
Customer need SIGNIFICANTLY
Too much data => Too expensive to store and to maintain
Big portion is used just in case
Data amount is still growing => its more expensive

=> too expensive to have all data in standard DWH

Value statement
Leverage the architecture of parallel processing in Hadoop

Hadoop uses cheap commodity HW

Enable business users still work in the same or similar way

Solution
IBM InfoSphere BigInsights

30 2012 IBM Corporation


BigInsights and the data warehouse
Traditional Big Data
analytic analytic
From Cognos BI
tools via Hive JDBC applications

BigInsights

Query-ready archive for cold warehouse data


Data Warehouse

31 2012 IBM Corporation


Future: The SQL interface . . . .
Rich SQL query capabilities Application
SQL '92 and 2011 features
Correlated subqueries SQL Language
Windowed aggregates
JDBC / ODBC Driver

SQL access to all data stored in


InfoSphere BigInsights JDBC / ODBC Server

Robust JDBC/ODBC support SQL interface Engine

Take advantage of key features


of each data source
Data Sources
Leverage MapReduce
parallelism
OR
achieving low-latency HiveTables HBase CSV Files
tables

InfoSphere BigInsights

34 2012 IBM Corporation


5 Analyze Streaming Data
Customer need
Process and leverage streaming data

Select valuable data from data stream for


future processing

Quickly process data going to be useless


if its not processed immediately

Value statement
React in real-time to take an oppurtinity
before it expires

Periodically adjust streaming models


based on analysis on data at rest

Solution
IBM InfoSphere Streams

35 2012 IBM Corporation


Why and when to use InfoSphere Streams?
Applications needing on-fly processing, filtering and analyzing streaming data
Environmental, Industrial, GPS,
Sensors
Images, Videos,
Network data
Data Exhaust
system logs (web server, app server),
Financial transactions
High-rate transaction data
CDRs

At least 2 criteria from the list bellow should be fulfilled


Processing in isolation
Isolation
or in limited windows (time / nr. Of records)

Non-traditional formats included Spatial data, images, text, voice,

Different connection methods


Integration challenges Different data rates
Different processing requirements

Multiple processing nodes Volume / rate very high => scalability required

Sub-millisecond latency Immediate analysis and response

Store & mine approach doesnt work Because of very high volume of data (and its rates)

36 2012 IBM Corporation


Streams and BigInsights - Integrated Analytics on Data in
Motion & Data at Rest
Visualization of real-
time and historical
insights

Data Integration,
data mining,
InfoSphere machine learning,
statistical modeling
Streams
1. Data Ingest
Data
InfoSphere
2. Bootstrap/Enrich BigInsights,
Database &
Warehouse
Control
Data ingest,
preparation,
flow
online analysis,
model validation
3. Adaptive Analytics Model

38 2012 IBM Corporation


The Platform Advantage

Analytic Applications
BENEFITS IN DETAIL
BI / Exploration / Functional Industry Predictive Content
Reporting Visualization App App BI /
Analytics Analytics
Reporting
Increase over By moving from entry to a 2nd
time and 3rd project
IBM Big Data Platform
Lowering Shared components Visualization Application Systems
deployment costs & Discovery Development Management
Integration

Accelerators
Points of leverage Shared text analytics for
Streams and BigInsights
Hadoop Stream Data
System Computing Warehouse
HDFS connectors (data
integration (ETL, ),
Streams)

Accelerators
Build across multiple Information Integration & Governance
engines

39 2012 IBM Corporation


IBM big data
IBM big data IBM big data IBM big data

THINK

IBM big data


IBM big data
IBM big data

IBM big data IBM big data IBM big data


40 2012 IBM Corporation