Professional Documents
Culture Documents
Background
Multi
Analytics Solution
Domain Innovation
Focused Delivery
Expertise
* - Inflight
Team Matrix
North America India
BI Business Analysts 10% 10%
Project Managers 8% 5%
BI Developers 5% 20%
Processing
Real Time processing Data Visualization
Data
Sqoop
Map Reduce
Storm Spark Tableau Qlikviiew
InsightApp
Metadata Management Spotfire
s
File Load
Hadoop Shell Multi-tenant Processing -YARN Discovery/ Exploration
Scripts
Excel
Hive Developer Reliable, Redundant, Linear, Compute Storage
( HDFS) Hive Developer
RDBMS Monitoring Security Workflow & Scheduling Coordination Management & pipeline
Text
Files
Data Governance
Commercial Open Source techIdeas Products NoSQL database (separate from Hadoop stack)
Functional View of Big Data Architecture
Metadata Management
Transformation
Cleansing
Data Integration
Visualization Extraction
Filtering
Tools
Structure
Data Discovery & Correct Wrong Extraction
Ad-hoc Analytics
BI Tools Search Values
Classification
Complete Empty
Text/Predictive Text Analytics BI Analytics Values
Analytics
Filtration
Discovery &
Search Tools
Real Time
Rules Engine
Analytics Streaming Data
Data Extraction
Collection
APIs or Web Real Time Processing
Services Batch Processing
Data Consumption
Streaming Batch
Data Data
Security Management
Data Ingestion Layer
• Acquisition of required data from source Big Data Challenges
• Handling of streaming data handled
Data Ingestion
Goals • Data Insights from disparate sources
• Effectively handling variability of data by implementing
data quality steps
3V's of Big Data
• List of sources
• Data Structures
Security
Metadata
• Frequency and size of data
Activities
• DQ Techniques applied Data Integration
• Metadata of Semi-Structured data
Metadata Management
• Data currency
Governance • Data lineage and run-time successes Governance
Process • Load statistics and error-handling statistics
• Data cleansing and transformation rules
Performance
Metadata Management
Processing
Real Time
Data Discovery &
Ad-hoc Analytics
•Value Deduction from data to enrich Master Data and also use them in Cleansing. Search
Streaming
Data
Data Consumption Layer
Big Data Challenges handled
• Enable BI systems or other appliances to consume
bulk data through APIs
Data Ingestion
Goals
• Real-time access through Query language tools
• Web access through restful APIs 3V's of Big Data
Security
Metadata Management
Governance
Governance
• Usage statistics, error reports and audit trails Performance
Process
Real-Time Analysis
Extensibility
Security • Authenticate users using Kerberos
Controls • Enforce permissions on HDFS files to users/groups
Storage Layer
Data Distribution across storage layer is handled based on the processing and
access requirements
• Job Runtimes
• Processing Steps
• User Statistics
Security Management
Security controls are defined to secure the complete stack
from Application to Data Block through various layers
Data Access
• ACLs on tables and column families and individual
cells
• Encryption of Data at Rest using various approaches
like tokenization
Backup Slides
Our Approach to defining Big Data Architecture
Approach
Infrastructure - Cloud vs Bare Metal
Cloud-Based Bare Metal
Pros Pros
Ability to analyze petabytes of data in the cloud without Large degree of control of the daily management.
having to store, integrate, or manage on your own. Security, maintenance.
Pay-per-use model with low up-front investment and quick
time to market. Cons
Option of private or public cloud. × The TCO is high for on-premise hosting, as all costs need to
Speed and ease-of-access. be covered (e.g. cooling, energy, property costs, staffing,
Highly Elastic model security, maintenance, etc.).
Reduces day-to-day maintenance and workload. × Greater up-front investment for infrastructure.
Broad range of vendors include, but are not limited to, × Longer time to market.
Cloudera, Amazon EC2, Rackspace, SoftLayer, or VMWare’s × In-house security must be well developed to prevent being
vCloud. compromised.
Cons
× Responsible for deploying, managing, and maintaining
clusters on your own.
× Liability for the data falls on the organization, not the cloud
vendor.
× Security Risk
Infrastructure – OSS vs Vendor
OSS (Ex: Cloudera, Hortonworks) Vendor (Ex: HP Vertica, IBM)
Pros Pros
There are regular releases with updated versions of different There are regular releases with tested and updated versions
projects. of vendors’ products.
Easier to set up, manage, and monitor complex clusters Vendor supported set up, management, and monitoring
thanks to graphical tooling for deployment, administration, capabilities and tooling.
and monitoring of clusters offered by vendors. Vendor assumes any OSS licensing implications as part of its
Potentially more OSS skilled resources than vendor specific products.
solutions. Typically, vendors offer more in-depth support.
Organizations comfortable with coding can save money by Organizations already using a specific vendor stack may be
avoiding a vendor suite. able to leverage existing relationships to set up a big data
environment at a more reasonable cost.
Cons
Cons
× Still requires technical depth to write code for utilizing the
× While solutions are vendor supported, they are not trivial
OSS and to integrate data sources into the distributed
and require in-house technical expertise.
persistence.
× High software license costs.
× OSS distributions are not always compatible with vendor
× Some big data suites apply data-driven costs (“data tax”)
stacks. Check which OSS distributions are supported.
which can get very expensive.
× Skilled resource availability. There are likely more OSS skilled
resources available in the marketplace than vendor-specific
skills. It could be costly if there is a reliance upon the vendor
and/or third-party system integrators for resources.
× Selecting a vendor for big data persistence may constrain
the number of alternatives available for data integration and
big data analysis solutions.
Governance
Governance Policies and Guidelines help organizations to pro-actively track and monitor the state and health of
the system, via,
•Architecture Review Board for reviewing and recommending changes to existing architecture and
architecture standards
Batch
Hive HBase
Batch
Pig
Python Java
UDF
Kafka
Flume MAP REDUCE
Chukwa
SQOOP
Real Time
RDBMS( Oracle, Hadoop
MySQL, SQL
HBase
Shell Scripts Storm Cassandra
Real time
Hive QL
Server)
MongoDB
File Based sources
Excel
Data Processing Engines on HDFS
Text
Files
Thank You