You are on page 1of 41

APP-CAP1250

Fast Data Meets Big Data

Jags Ramnarayan, VMware, Inc. -- Chief Architect, GemFire/SQLFire

Mike Stolz, VMware, Inc. -- Global Senior Staff Architect

#vmworldapps

Disclaimer

This session may contain product features that are


currently under development.

This session/overview of the new technology represents


no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in


contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features
discussed or presented have not been determined.

Whats Common?

Whats Common?

Fast Data Meets Big Data

Big Data allows you to find opportunities you didnt know you had

Fast Data allows you to respond to opportunities before they are gone

Fast Data Meets Big Data

Working together they enable entirely new business models


6

The Database is Being Stretched

Big Data
Petabytes vs. Gigabytes Democratize BI

The Database is Being Stretched

Fast Data
Low latency expectations Horizontal scale

Big Data
Petabytes vs. Gigabytes Democratize BI

The Database is Being Stretched

Fast Data
Low latency expectations Horizontal scale

Big Data
Petabytes vs. Gigabytes Democratize BI

Flexible Data
Multi-structured data Developer productivity

The Database is Being Stretched

Fast Data
Low latency expectations Horizontal scale

Big Data
Petabytes vs. Gigabytes Democratize BI

Flexible Data
Multi-structured data Developer productivity

Cloud Delivery
Virtualized Offered -as-a-Service
10

Need a Horizontally Scalable, Elastic Data Management Solution

Elastic

Add/remove data servers dynamically

Grow or shrink dynamically


with no interruption of service or data loss
11

Tiered Data Strategy

12

Looking Back Traditional Batch Analytics


Business Events

Polling UI CEP/BAM
(vendor proprietary)

Transform

OLTP Query/Update

Enrich OLTP Data Online DB RDBMS Validate (structured)

Delayed Batch Analytic Queries Processing

Analytic Queries Other Data Online DB Delay Transform Batch Data Warehouse
(structured)

13

Pipeline with Hadoop (Log Analytics)

ACQUIRE

TRANSFORM

ANALYZE

Logs, Raw data in files

Batch

Sequential process to filter, extract, xFrm

SQL

SQL DB MPP DB

HDFS, MapReduce

Visualization

Analytics

14

The Real-time Pipeline


ACQUIRE PROCESS IN REAL TIME BATCH ANALYZE SQL DB Batch/ single events? MPP DB
Filtered events

Stream Data (Social, SaaS, Transactional)


Low latency

Filter, enrich, correlate

Raw Data in batches

Derived Insight

HDFS, MapReduce

Online Apps
Online DB
15

New Architecture for Real-time (Custom)


Streams
BATCH ANALYZE SQL DB MPP DB

HDFS, MapReduce

In-Memory Data Grids Buffer data, process events, In-memory Map-reduce (VMWare GemFire, SQLFire, Oracle Coherence, etc.)
16

Stream Processing Derive insight with continuous event processing (Apache S4, STORM, Esper, StreamBase, GemFire)

What Principles Drive This Architecture?

Very low latency ingest, high scalable (write scalability)


Support structured as well as unstructured

Real-time processing cannot throttle incoming stream(s)


Highly parallelizable with minimum IO (network and disk) Be elastic

Cannot lose events else derived value is questionable Post processing Raw, derived events (batch analytics)

17

What Principles Drive This Architecture?

STAGE to SCALE
Staged Events Driven Architecture

18

Acquire, Transform, Filter (Fast Ingest)

19

In-memory Data Grid Concepts

Distributed memory oriented key-value store


Queriable, Indexable and transactional

Distributed namespace of Maps (key-value)


Called Regions (GemFire, Hibernate), Cache(Oracle), etc. 2 key storage models: Replication, Partitioning
Handle thousands of concurrent connections

Replicated Region Synchronous replication for slow changing data

Redundant copy Partitioned Region

Partition for large data or highly transactional data

20

Acquire, Cache, Transform, etc.

High ingest partitioned buffering


Expiry based on TTL, idleTime Windows count, heap size, LRU eviction

Works with rigid or flexible Schema (JSON, Objects, SQL) Cache frequently used DB data for transform, massaging Partitioned listeners for filtering, event transform, etc.
Handle thousands of concurrent connections Replicated Region Synchronous replication for slow changing data

Redundant copy Partitioned Region

Partition for large data or highly transactional data

21

21

Continuously Available

22

22

Complex Stream Processing

23

Continuous Filtering Using CQs


Apps subscribe to streams using Queries: Streams Select * from Tweets where tweetCount > 10 and userId in (X,Y,Z)

Distributed Processing

When data changes, subscribers are pushed Async events reliably -All related data is accessible at memory speeds

CQ: Continuous Queries


24

24

Using CEP, S4, STORM

Distributed framework for unbounded streams Custom App processing code that filters, routes, joins
multiple streams

Main proposition is horizontally, elastically scalable processors Simple config model to create processing pipelines

25

Correlate, Joins with Data

26

Accessing Historical Data in Real-time

Correlations/Joins with History


Option 1) Keep history in memory Option 2) Keep in MPP DB (Greenplum DB) Option 3) In HDFS

27
27

Real-time Aggregations with Parallel Processing

28

Map-reduce for Real Time


Hadoop M/R is for sequential batch processing and not for real time
@DistributedFunction(regionName="trades") public List AnalyzeTrades(@FilterKey Set<String> months, String portfolio) { ... }

Java Stored procedure

Parallel Data aware function execution

29

Ingestion to Hadoop, MPP DB

30

Enabling Batch Analytics with Write Behind

Store all raw, derived data in Hadoop Async, parallel write behind from data grid into HDFS
Each partition writes batches in parallel for max throughput

31

Staged Pipeline bringing it all together

Use Spring Integration to orchestrate the pipeline Patterns: Pub-sub, splits, routers, Xfrm, etc.
32

Distribution of Analytic Results

33

Multi-Site Capability

Active Everywhere

Data Fabric Node

Data Fabric Node

Data Fabric Node

Data

Single Cluster Spanning Data Centers

34

Multi-Site Capability

Active Everywhere

Asynchronous, Fault Tolerant, Bi-Directional WAN Gateway

35

Global Data Distribution

Distribute

GemFire can keep clusters that are distributed around the world synchronized in real-time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network environments.
36

Bringing It All TogetherWhat Would It Look Like?

Existing technologies working together

37

Greenplum + Gem/SQLFire Combined Architecture


SQL MapReduce

CQ and Pub/Sub Master Severs


Query planning & dispatch

...

GemFire Now Data


Active Notification

...

Network Interconnect Segment Severs


Query processing & data storage

...

...

External Sources
Fast Ingest --XTP-SQLFire

...
Transaction Data

...

38

THANK YOU

39

FILL OUT A SURVEY


EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE

APP-CAP1250

Fast Data Meets Big Data

Jags Ramnarayan, VMware, Inc. Mike Stolz, VMware, Inc.

#vmworldapps

You might also like