CAP1250-Fast Data Meets Big Data PDF

APP-CAP1250
Fast Data Meets Big Data
Jags Ramnarayan, VMware, Inc. -- Chief Architect, GemFire/SQLFire
Mike Stolz, VMware, Inc. -- Global Senior Staff Architect
#vmworldapps
Disclaimer
This session may contain product features that are

currently under development.
This session/overview of the new technology represents

no commitment from VMware to deliver these features in any generally available product.
Features are subject to change, and must not be included in

contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
Whats Common?
Whats Common?
Big Data allows you to find opportunities you didnt know you had
Fast Data allows you to respond to opportunities before they are gone
Working together they enable entirely new business models

6
The Database is Being Stretched
Big Data
Petabytes vs. Gigabytes Democratize BI
Fast Data
Low latency expectations Horizontal scale
Big Data
Fast Data
Big Data
Flexible Data
Multi-structured data Developer productivity
Fast Data
Big Data
Flexible Data
Multi-structured data Developer productivity
Cloud Delivery
Virtualized Offered -as-a-Service
10
Need a Horizontally Scalable, Elastic Data Management Solution
Elastic
Add/remove data servers dynamically
Grow or shrink dynamically

with no interruption of service or data loss
11
Tiered Data Strategy
12
Looking Back Traditional Batch Analytics

Business Events
Polling UI CEP/BAM
(vendor proprietary)
Transform
OLTP Query/Update
Enrich OLTP Data Online DB RDBMS Validate (structured)
Delayed Batch Analytic Queries Processing
Analytic Queries Other Data Online DB Delay Transform Batch Data Warehouse
(structured)
13
Pipeline with Hadoop (Log Analytics)
ACQUIRE
TRANSFORM
ANALYZE
Logs, Raw data in files
Batch
Sequential process to filter, extract, xFrm
SQL
SQL DB MPP DB
HDFS, MapReduce
Visualization
Analytics
14
The Real-time Pipeline

ACQUIRE PROCESS IN REAL TIME BATCH ANALYZE SQL DB Batch/ single events? MPP DB
Filtered events
Stream Data (Social, SaaS, Transactional)

Low latency
Filter, enrich, correlate
Raw Data in batches
Derived Insight
HDFS, MapReduce
Online Apps
Online DB
15
New Architecture for Real-time (Custom)

Streams
BATCH ANALYZE SQL DB MPP DB
HDFS, MapReduce
In-Memory Data Grids Buffer data, process events, In-memory Map-reduce (VMWare GemFire, SQLFire, Oracle Coherence, etc.)
16
Stream Processing Derive insight with continuous event processing (Apache S4, STORM, Esper, StreamBase, GemFire)
What Principles Drive This Architecture?
Very low latency ingest, high scalable (write scalability)

Support structured as well as unstructured
Real-time processing cannot throttle incoming stream(s)

Highly parallelizable with minimum IO (network and disk) Be elastic
Cannot lose events else derived value is questionable Post processing Raw, derived events (batch analytics)
17
What Principles Drive This Architecture?
STAGE to SCALE
Staged Events Driven Architecture
18
Acquire, Transform, Filter (Fast Ingest)
19
In-memory Data Grid Concepts
Distributed memory oriented key-value store

Queriable, Indexable and transactional
Distributed namespace of Maps (key-value)

Called Regions (GemFire, Hibernate), Cache(Oracle), etc. 2 key storage models: Replication, Partitioning
Handle thousands of concurrent connections
Replicated Region Synchronous replication for slow changing data
Redundant copy Partitioned Region
Partition for large data or highly transactional data
20
Acquire, Cache, Transform, etc.
High ingest partitioned buffering

Expiry based on TTL, idleTime Windows count, heap size, LRU eviction
Works with rigid or flexible Schema (JSON, Objects, SQL) Cache frequently used DB data for transform, massaging Partitioned listeners for filtering, event transform, etc.
Handle thousands of concurrent connections Replicated Region Synchronous replication for slow changing data
Redundant copy Partitioned Region
Partition for large data or highly transactional data
21
21
Continuously Available
22
22
Complex Stream Processing
23
Continuous Filtering Using CQs

Apps subscribe to streams using Queries: Streams Select * from Tweets where tweetCount > 10 and userId in (X,Y,Z)
Distributed Processing
When data changes, subscribers are pushed Async events reliably -All related data is accessible at memory speeds
CQ: Continuous Queries

24
24
Using CEP, S4, STORM
Distributed framework for unbounded streams Custom App processing code that filters, routes, joins
multiple streams
Main proposition is horizontally, elastically scalable processors Simple config model to create processing pipelines
25
Correlate, Joins with Data
26
Accessing Historical Data in Real-time
Correlations/Joins with History

Option 1) Keep history in memory Option 2) Keep in MPP DB (Greenplum DB) Option 3) In HDFS
27
27
Real-time Aggregations with Parallel Processing
28
Map-reduce for Real Time

Hadoop M/R is for sequential batch processing and not for real time
@DistributedFunction(regionName="trades") public List AnalyzeTrades(@FilterKey Set<String> months, String portfolio) { ... }
Java Stored procedure
Parallel Data aware function execution
29
Ingestion to Hadoop, MPP DB
30
Enabling Batch Analytics with Write Behind
Store all raw, derived data in Hadoop Async, parallel write behind from data grid into HDFS
Each partition writes batches in parallel for max throughput
31
Staged Pipeline bringing it all together
Use Spring Integration to orchestrate the pipeline Patterns: Pub-sub, splits, routers, Xfrm, etc.
32
Distribution of Analytic Results
33
Multi-Site Capability
Active Everywhere
Data Fabric Node
Data Fabric Node
Data Fabric Node
Data
Single Cluster Spanning Data Centers
34
Multi-Site Capability
Active Everywhere
Asynchronous, Fault Tolerant, Bi-Directional WAN Gateway
35
Global Data Distribution
Distribute
GemFire can keep clusters that are distributed around the world synchronized in real-time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network environments.
36
Bringing It All TogetherWhat Would It Look Like?
Existing technologies working together
37
Greenplum + Gem/SQLFire Combined Architecture

SQL MapReduce
CQ and Pub/Sub Master Severs

Query planning & dispatch
...
GemFire Now Data

Active Notification
...
Network Interconnect Segment Severs

Query processing & data storage
...
...
External Sources
Fast Ingest --XTP-SQLFire
...
Transaction Data
...
38
THANK YOU
39
FILL OUT A SURVEY

EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE
APP-CAP1250
Jags Ramnarayan, VMware, Inc. Mike Stolz, VMware, Inc.
#vmworldapps

CAP1250-Fast Data Meets Big Data PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CAP1250-Fast Data Meets Big Data PDF

Uploaded by

Copyright:

Available Formats

APP-CAP1250

Fast Data Meets Big Data

Jags Ramnarayan, VMware, Inc. -- Chief Architect, GemFire/SQLFire

Mike Stolz, VMware, Inc. -- Global Senior Staff Architect

This session may contain product features that are

This session/overview of the new technology represents

Features are subject to change, and must not be included in

Fast Data Meets Big Data

Fast Data Meets Big Data

Working together they enable entirely new business models

The Database is Being Stretched

The Database is Being Stretched

The Database is Being Stretched

The Database is Being Stretched

Need a Horizontally Scalable, Elastic Data Management Solution

Add/remove data servers dynamically

Grow or shrink dynamically

Tiered Data Strategy

Looking Back Traditional Batch Analytics

Enrich OLTP Data Online DB RDBMS Validate (structured)

Delayed Batch Analytic Queries Processing

Pipeline with Hadoop (Log Analytics)

Logs, Raw data in files

Sequential process to filter, extract, xFrm

The Real-time Pipeline

Stream Data (Social, SaaS, Transactional)

Filter, enrich, correlate

Raw Data in batches

New Architecture for Real-time (Custom)

What Principles Drive This Architecture?

Very low latency ingest, high scalable (write scalability)

Real-time processing cannot throttle incoming stream(s)

What Principles Drive This Architecture?

Acquire, Transform, Filter (Fast Ingest)

In-memory Data Grid Concepts

Distributed memory oriented key-value store

Distributed namespace of Maps (key-value)

Replicated Region Synchronous replication for slow changing data

Redundant copy Partitioned Region

Partition for large data or highly transactional data

Acquire, Cache, Transform, etc.

High ingest partitioned buffering

Redundant copy Partitioned Region

Partition for large data or highly transactional data

Complex Stream Processing

Continuous Filtering Using CQs

CQ: Continuous Queries

Using CEP, S4, STORM

Correlate, Joins with Data

Accessing Historical Data in Real-time

Correlations/Joins with History

Real-time Aggregations with Parallel Processing

Map-reduce for Real Time

Java Stored procedure

Parallel Data aware function execution

Ingestion to Hadoop, MPP DB

Enabling Batch Analytics with Write Behind

Staged Pipeline bringing it all together

Distribution of Analytic Results

Data Fabric Node

Data Fabric Node

Data Fabric Node

Single Cluster Spanning Data Centers