You are on page 1of 24

Big Data Framework

Background
Multi
Analytics Solution
Domain Innovation
Focused Delivery
Expertise

8+ Years of Specialization Recommendations & From Strategy, Execution to We build Analytics


• Data Analytics insights from Real Support enabling efficient Software/Data Products
• Big Data Business experience transfer of capabilities to too !!!
enabling us to deliver the clients • HiveDeveloper -
• Business Intelligence
right solution • Best Practices SQL based extract,
• Data Integration query & load for
• Cross-domain • Skills & Learning
• Data Management expertise in Hi- Hadoop
Tech, Media, Retail, “On the Ground Delivery”, • ROBII – BI Adoption
Delivering Right Business Telecom, Finance Analytics
Value & Solution with focus on
• Benefit from best of • Handholding • Diversio – Predictive
• Right Architecture breed advise Model based Fraud
• Accelerated Delivery • Imparting knowledge
• Pride in Analytics platform
• Expertise in Multiple • Cultivating
safeguarding each • Channel & Contact
Technologies Relationships
client interests Persona
• 160+ Team members, • Mission “Project
HQ in Silicon Valley Success”
Offerings
Business
BI Big Data Education
Solutions
• VMWare
• Cisco
• Amazon
• CapitalOne*
• Ebay • Confidential 1
• Master Card* Big Data
• Netflix Fraud Analytics • Confidential 2
• Comcast Learning
• Walgreen
• Stubhub
• Visa • Imperva
• +200 …… • VMware PoC
Hadoop VMs Big data for • Violin
Off the shelf BI
Frameworks with Sample Channels and Memory
Models • Rocketfuel
• MTN
• O-Desk Data Marketing
• Polycom
• Bloomenergy
• Meraki
• Reporting
Joint Product
InsightApps Foundation Graph Based Development
Free as a contribution to the Open Source Community MDM with Actian

* - Inflight
Team Matrix
North America India
BI Business Analysts 10% 10%

Solution Architects2 10% 5%

Project Managers 8% 5%

Enterprise BI Leads 10% 10%

ETL Developers 10% 20%

BI Developers 5% 20%

Advanced Analytics 20% 30%

Hi-Tech 35% 25%

Media 10% 20%


Telecom 30% 45%

Retail 25% 10%

1 % Includes employees who play multiple roles 2Thought Leaders


Big Data Engagement Objectives
 Help Organizations understand the importance of a big data
architecture
 Identify big data architecture suitable for the Organization.
 Develop the organization’s big data architecture by breaking
tasks down into manageable decisions.
 Develop an implementation plan to ensure the architecture is
successful.
 Define Security and Governance Guidelines
 Operationalize and Support Big Data Stack
Challenges in Big Data Implementation
• Data Ingestion – Big Data Projects fail because of ineffective and
inefficient Data Ingestion Process
• 3Vs of Big Data – Dealing with Volume, Velocity and Variety of Data
Sources
• Security – Maintaining Data Privacy and Anonymity of Data
• Data Integration – Integration of disparate data and making it available
for Data Discovery/Analysis
• Metadata Management – Defining, capturing and maintenance of
Metadata
• Governance – Implementing an effective Data Governance process
• Performance – Handling analytical query processing
• Real-Time Analysis – Streaming real time data and enabling analysis
Big Data Reference Architecture - Highlights
Big Data Challenges Handled
• Provide an effective framework which
will help the business make decisions Data Ingestion
about how it uses/implements data 3V's of Big Data
architecture technology.
• Drive the functional requirements of a Security
data solution’s architecture Data Integration
• Illustrates feasibility of Handling all
types of data and latency Metadata Management
• Governance and Monitoring are
Governance
considered as base foundation and
activities managed and tracked for Performance
each layer
• Availability of data in one or more Real-Time Analysis
formats to handle multitude of use Extensibility
cases
Big Data Reference Architecture
Source Data Ingestion Data Processing Data Consumption
Systems

Model Building &


Data Access Insight Generation
Real time Real time Search
Batch Query ElasticSearch
• Map Reduce Query SAS JMP Knime
Streaming Language Solr
Flume Kafka • Pig • Tez
• Hive QL Lucene
• Impala
Algorithms
Chukwa
Storage Storage
Weka RHadoop
Apache Apache Cassandr Mongo MPP/
Hive HBase a DB RDBMS Mahout
Batch Load

Processing
Real Time processing Data Visualization

Data
Sqoop
Map Reduce
Storm Spark Tableau Qlikviiew
InsightApp
Metadata Management Spotfire
s
File Load
Hadoop Shell Multi-tenant Processing -YARN Discovery/ Exploration
Scripts
Excel
Hive Developer Reliable, Redundant, Linear, Compute Storage
( HDFS) Hive Developer

RDBMS Monitoring Security Workflow & Scheduling Coordination Management & pipeline

Ambari Knox Oozie ZooKeeper Falcon


Excel

Text
Files
Data Governance

File based Infrastructure

Commercial Open Source techIdeas Products NoSQL database (separate from Hadoop stack)
Functional View of Big Data Architecture
Metadata Management

Transformation
Cleansing

Entity & Relation


Data Processing Duplicate

Data Integration
Visualization Extraction
Filtering
Tools
Structure
Data Discovery & Correct Wrong Extraction
Ad-hoc Analytics
BI Tools Search Values
Classification
Complete Empty
Text/Predictive Text Analytics BI Analytics Values
Analytics

Filtration
Discovery &
Search Tools
Real Time
Rules Engine
Analytics Streaming Data
Data Extraction
Collection
APIs or Web Real Time Processing
Services Batch Processing
Data Consumption

Streaming Batch
Data Data

Security Management
Data Ingestion Layer
• Acquisition of required data from source Big Data Challenges
• Handling of streaming data handled
 Data Ingestion
Goals • Data Insights from disparate sources
• Effectively handling variability of data by implementing
data quality steps
 3V's of Big Data
• List of sources
• Data Structures
 Security
Metadata
• Frequency and size of data
Activities
• DQ Techniques applied  Data Integration
• Metadata of Semi-Structured data

 Metadata Management
• Data currency
Governance • Data lineage and run-time successes  Governance
Process • Load statistics and error-handling statistics
• Data cleansing and transformation rules
 Performance

• Maintain a separate Hadoop cluster for Staging  Real-Time Analysis


Security • Network Encryption for all communication –
Controls HDFS Data transfer, HTTP by Kerberos RPC  Extensibility
connection
Data Ingestion Layer – Functional Flow
Data Extraction:
• Setup Connectivity to Data sources and extract required data set.
• Aggregation of data if there are multiple data sources
Data Consumption Data Processing
Streaming Data Collection:
• Setup Connectivity to Streaming APIs.
• Define Timelines of Extraction
• Dump in Staging Area and perform further processing Data Integration
Filtration:
• Sometimes unwanted data need to be filtered. Cleansing Duplicate Filtering
• Decision is made based on current and future analysis requirements

Transformation: Complete Empty Correct Wrong


• Structure Extraction - Extract structured data from semi -Structured Data like Values Values
JSON,XML, HTML. By leveraging Metadata
• Entity & Relation Extraction – Identify Unique entities, merge identical entities,

Real Time Processing


Extract facts or relations about entities similar to attributes in traditional DWH.
• Classification - Extract Sentiment Analysis, Influence Analysis and dividing data Transformation Entity & Relation
by topics decided by user Extraction
Structure
Cleansing:
Extraction Filtration
Classification
• Complete Empty Values - Fill incomplete and empty attribute values using other
attributes or Statistical and Machine Learning Techniques. Mostly defined in User
requirements Filtration
• Correct Wrong Values - By using defined Rules and thresholds, Outliers and
errors in data are corrected along with conflicting data between different sources
Streaming Data
• Duplicate Filtering - Identify multiple data items referring same object or entity Data Extraction
Collection
and remove them
Batch Processing
Data Integration:
• Build Schema based on requirements
• Build logic to transform data from different sources into the Schema
• Maintain Standards across whole dataset Streaming Batch
Data Data
Data Processing Layer
• Incorporate feasibility for future analysis
Big Data Challenges handled
Goals • Maintain Timeliness for Real-Time analysis
 Data Ingestion
• Handle different Usability requirements

 3V's of Big Data

• Capturing Machine Learning Techniques and


 Security
Metadata
statistical techniques used
Activities • Pre-calculated rules from Rules engine  Data Integration

 Metadata Management

• Business terms & definitions


Governance  Governance
• Data ownership
Process
• Storage description & Data structures
 Performance

• ACLs on tables and column families and individual


cells  Real-Time Analysis
Security • Encryption of Data at Rest using various
Controls approaches like tokenization  Extensibility
• Task Authorization by Access Controls on Jobs
• Auditing using Namenode local log
Data Processing Layer – Functional Flow
BI Analytics:
Data
•Generate Precalculated values as per Business rules
Consumption
•Create Predefined views for Standard reports and Dashboard.
•Create Batch jobs to maintain these views similar to DWH
Text Analytics:
•Apply Machine Learning, Data Mining or Statistical Techniques which ever suites to Data Processing
achieve user need.

Processing
Real Time
Data Discovery &
Ad-hoc Analytics
•Value Deduction from data to enrich Master Data and also use them in Cleansing. Search

Data Discovery & Search: Text Analytics BI Analytics


•Largely depends on Metadata maintenance
•Implementation of Performance techniques for effective free text search across all
types of data or documents
Data Ingestion
Ad-hoc Analytics:
Data
•Design Schema and storage for feasibility of Ad-Hoc querying or Text Analytics
Consumption
•Maintain datasets in Sandboxes for few Analytics which need sample data

Rules Engine: Real Time Analytics


• Generate Rules or Models from historical data or from existing Text
Analytics
Rules Engine
• Create complex event processes for detecting patterns Data
Real Time Processing Processing
Real Time Analytics:
• Apply rules or process existing in Rule Engine on Incremental data. Data Ingestion
• Use In-memory or cache techniques based on Timeline requirement

Streaming
Data
Data Consumption Layer
Big Data Challenges handled
• Enable BI systems or other appliances to consume
bulk data through APIs
 Data Ingestion
Goals
• Real-time access through Query language tools
• Web access through restful APIs  3V's of Big Data

 Security

Metadata • User and/or applications and types of access  Data Integration


Activities • Peak usage and data consumed

 Metadata Management

 Governance
Governance
• Usage statistics, error reports and audit trails  Performance
Process

 Real-Time Analysis

 Extensibility
Security • Authenticate users using Kerberos
Controls • Enforce permissions on HDFS files to users/groups
Storage Layer
Data Distribution across storage layer is handled based on the processing and
access requirements

Landing zone for extracted source data


Data Store for transformed and
Aggregated data for Analytical
or any other use cases
HDFS
Columnar Key/Value
Analytical Staging
MPP/RDBMS
Data
Store
Storage
Centralized and long term
A copy or sample data store. Similar to DWH
data for Text Analytics or Sandbox Enriched Data
Ad-Hoc Analytics Store Document

Document HBASE HBASE HDFS


Metadata Management
Goals
• Maintain Sources & Data structures
• Directory of Analysis techniques and Processing Steps within system
• Data Provenance
Approach
Ingestion Data Processing Layer
Source Systems
• Transformed data formats • Machine Learning Techniques
• List of sources from Catalog • Statistical Techniques
• Data Structures • DQ Techniques applied • Pre-calculated rules from
• Frequency and size of data • Metadata of Semi-Structured data Rules engine

Store Data Monitor Correction


Operations

• Job Runtimes
• Processing Steps
• User Statistics
Security Management
Security controls are defined to secure the complete stack
from Application to Data Block through various layers

Data Extraction Multiple Hadoop


• Maintain a separate Hadoop cluster for Staging Clusters
DV Layer
• BigData Ingestion process will not connect to Source
systems directly Users & Groups
• Implementation of Data Virtualization Layer File Permissions
OS Authentication
Users
Network
• Authenticate users using Kerberos Encryption
• Enforce permissions on HDFS files to users/groups Auditing
• Track logging history
Data at Rest
Data Movement Encryption
• Network Encryption for all communication – HDFS
Data transfer, HTTP by Kerberos RPC connection
• Enforce Task Authorization by Access control on jobs
• Auditing using Namenode local log

Data Access
• ACLs on tables and column families and individual
cells
• Encryption of Data at Rest using various approaches
like tokenization
Backup Slides
Our Approach to defining Big Data Architecture

Approach
Infrastructure - Cloud vs Bare Metal
Cloud-Based Bare Metal
Pros Pros
 Ability to analyze petabytes of data in the cloud without  Large degree of control of the daily management.
having to store, integrate, or manage on your own.  Security, maintenance.
 Pay-per-use model with low up-front investment and quick
time to market. Cons
 Option of private or public cloud. × The TCO is high for on-premise hosting, as all costs need to
 Speed and ease-of-access. be covered (e.g. cooling, energy, property costs, staffing,
 Highly Elastic model security, maintenance, etc.).
 Reduces day-to-day maintenance and workload. × Greater up-front investment for infrastructure.
 Broad range of vendors include, but are not limited to, × Longer time to market.
Cloudera, Amazon EC2, Rackspace, SoftLayer, or VMWare’s × In-house security must be well developed to prevent being
vCloud. compromised.
Cons
× Responsible for deploying, managing, and maintaining
clusters on your own.
× Liability for the data falls on the organization, not the cloud
vendor.
× Security Risk
Infrastructure – OSS vs Vendor
OSS (Ex: Cloudera, Hortonworks) Vendor (Ex: HP Vertica, IBM)
Pros Pros
 There are regular releases with updated versions of different  There are regular releases with tested and updated versions
projects. of vendors’ products.
 Easier to set up, manage, and monitor complex clusters  Vendor supported set up, management, and monitoring
thanks to graphical tooling for deployment, administration, capabilities and tooling.
and monitoring of clusters offered by vendors.  Vendor assumes any OSS licensing implications as part of its
 Potentially more OSS skilled resources than vendor specific products.
solutions.  Typically, vendors offer more in-depth support.
 Organizations comfortable with coding can save money by  Organizations already using a specific vendor stack may be
avoiding a vendor suite. able to leverage existing relationships to set up a big data
environment at a more reasonable cost.
Cons
Cons
× Still requires technical depth to write code for utilizing the
× While solutions are vendor supported, they are not trivial
OSS and to integrate data sources into the distributed
and require in-house technical expertise.
persistence.
× High software license costs.
× OSS distributions are not always compatible with vendor
× Some big data suites apply data-driven costs (“data tax”)
stacks. Check which OSS distributions are supported.
which can get very expensive.
× Skilled resource availability. There are likely more OSS skilled
resources available in the marketplace than vendor-specific
skills. It could be costly if there is a reliance upon the vendor
and/or third-party system integrators for resources.
× Selecting a vendor for big data persistence may constrain
the number of alternatives available for data integration and
big data analysis solutions.
Governance

Governance Policies and Guidelines help organizations to pro-actively track and monitor the state and health of
the system, via,

•Change Management Procedures to formally accept or reject a change

•Architecture Review Board for reviewing and recommending changes to existing architecture and
architecture standards

•Compliance and Policy regulations for external data

•Defining data archiving and purging policies

•Data replication and encryption policies

•Periodic Auditing and redefining the policies


Data Propagation b/w Layers
Streaming

Batch

Hive HBase
Batch
Pig
Python Java
UDF
Kafka
Flume MAP REDUCE
Chukwa

Near Real Time


Near real time
Interactive Query Tez
HDFS 2.0 YARN Tableau
processing engine Impala Spotfire
Qlikview
Hadoop 2.0 InsightApps

SQOOP
Real Time
RDBMS( Oracle, Hadoop
MySQL, SQL
HBase
Shell Scripts Storm Cassandra
Real time
Hive QL
Server)
MongoDB
File Based sources

Excel
Data Processing Engines on HDFS
Text
Files
Thank You

You might also like