You are on page 1of 51

ARCHITECTING DATA INTENSIVE

SOFTWARE SYSTEMS

Engels. R
Department of Computer Science and
Engineering
PSG College of Technology
Speaker intro
• Graduated B.E (CSE) from PSG College of Technology, 1997
• M.S (CS) from Colorado State University, USA, 2001
• 12+ years industry experience
– Last ~ 10 years @ Microsoft USA, and Microsoft India R&D
– Senior Technical Program Manager
• Windows / Servers and Tools Business
– Business owner for review management / Interop documentation
• Associate Professor (from Nov, 2011)
– Department of Computer Science & Engineering, PSG Tech
• Research interests
– Computational Intelligence
– Reasoning and knowledge representation
– Geospatial systems and Location Aware Services
– Software architectures for data intensive, HTC and HPC

Engels. R, Dept.of.CSE, PSG College of Technology


Goals of this lecture
• Understand Big Data – brief discourse
• Review current Data Intensive Software Architectures
• Highlight the key areas and challenges faced in the current
approaches
• At the end of lecture, we should be able to
– identify important architectural needs for developing a data
intensive software / analysis
– Ready to apply learnings to put together architecture for
intended system using reference architectures
• Motivational assumption:
– Scheduled earlier in the FDP, after introductions to key concepts,
this lecture will help lectures dealing with design or technology
level to be more effective
– Audience to have some familiarity with Software Engineering and
software architecture

Engels. R, Dept.of.CSE, PSG College of Technology


Architecting Data Intensive Systems

1 • Basic discussion

2 • Need

3 • Architectures

4 • Challenges

5 •Q&A
BIG DATA OR BIG FRAUD?
-THE OTHER SIDE

Engels. R, Dept.of.CSE, PSG College of Technology


The other side of Big Data

• How many of us understand what is


Big Data?
– Apart from the (in)famous 5 Vs definition
• How many of us can fathom the differences between
Big Data and traditional (and not so traditional)
methods of analyzing data
– Be it data mining or knowledge extraction Highly Opinionated

• Is Big Data a marketing term and push


– If we agree, do we also say marketing is somehow ‘bad’
• Technically oriented people long for precise definition
– and are irritated by a “bigger is better” argument.
Engels. R, Dept.of.CSE, PSG College of Technology
Big Data
• “As one of the progenitors of the term “big data”, I’m
happy with its imprecision. It wasn’t coined for
marketing purposes, but as a descriptor for an
emerging technology phenomenon. In the broader
business consciousness, at the end of 2012, “big data”
really means “smart use of data”.”
– Edd Dumbill, Forbes Contributor

• Huh? Is “descriptor for an emerging tech phenomenon”


different from marketing?
– Especially when the idea itself is imprecise?
– Similar to Web 2.0 ?

Engels. R, Dept.of.CSE, PSG College of Technology


Huntsman, what quarry
Upon this gifted age, in its dark hour, rains from the sky
a meteoric shower of facts…they lie, unquestioned, uncombined.
Wisdom enough to leach us of our ill is daily spun;
but there exists no loom to weave it into a fabric.
- Edna St. Vincent Millay

Big data is like fundamental research:


Everyone talks about it
Nobody really knows how to do it
Everyone thinks everyone else is doing it
So everyone claims they are doing it.

• Am I talking against Big Data?


– Ease of term usage enables familiarity and brings in more practitioners
– Simpler tools are needed and usage of large and dynamic data sets is
propagated

Engels. R, Dept.of.CSE, PSG College of Technology


NEED

Engels. R, Dept.of.CSE, PSG College of Technology


Why?
• Innovations in early twenty first century are
primarily driven by
– Collection, Process and Sharing of large volumes of data
• Massive data manipulation
– Broad based & different domains (politics, business
intelligence, scientific and medical research)
• Data generation capabilities have increased
– Radio Astronomy LOFAR: 138 PB / day
– Climate models: 8 PB / run
– Large Haldron Collider: 2 PB data / second (while in op)
• Is data volume the only challenge?

Engels. R, Dept.of.CSE, PSG College of Technology


Why?..2
• Brings several challenges in different related areas
– Storage capacity, Computing power, network
bandwidth, computing language, web protocols
• Scientific collaborations
– Virtual teams / large data (SETI, weather analysis)
• Interpreting and making a collage of results
– How to architect a system for doing this?
• In this session, focus is on
– Software architectural key elements in science data
systems
– Identify challenges around key elements
– Point out areas where BD research is focused currently
• In other words, Venture Capitalist firms funding availability
• In India, change VC to funding agencies
Engels. R, Dept.of.CSE, PSG College of Technology
Simple functional model for data analysis

Engels. R, Dept.of.CSE, PSG College of Technology


Information consumers and their focus on
architectural pieces

Engels. R, Dept.of.CSE, PSG College of Technology


Conceptual expansion of functional model to
include Big Data

Engels. R, Dept.of.CSE, PSG College of Technology


Conceptual expansion of functional model to
include new knowledge into analysis

Engels. R, Dept.of.CSE, PSG College of Technology


V’s of Big Data

Variability,
Visualization, …
Engels. R, Dept.of.CSE, PSG College of Technology
ARCHITECTURES

Engels. R, Dept.of.CSE, PSG College of Technology


Architecture versus design
Architecture Design
Higher level of abstraction – lays Deeper view - more concrete
foundation for design
System wide bridging for what is done, How is it done, usually focusing on a
why is it done, alternative considerations specific part of a system
(usually closer to requirements)
Bigger scope: choices of frameworks, Focused scope: how code will be
languages, scope, goals and high level organized, contracts between different
methodologies (including how design parts of system, implementation of goals
might be done)
Strategic – Overall fitness of solution to Tactical – Realizing technical needs and
business efficiency for current need
Architectural decisions are both political Design decisions are almost always
and technical technical
Pictorial representation - UML, block Pictorial representation - UML
Ensures right things are done Ensures things are done right

Recent trends such as agile, blur the difference


Engels. R, Dept.of.CSE, PSG College of Technology
Fundamental architectural challenges
• Why challenging?
– Canonical software architecting approaches are
unable to meet the demands
– Imposed by highly connected and
computationally demanding DI systems

• Challenges are seen in Workflow management


(WM ) , File management (FM) and Resource
Management(RM)
• Not exhaustive but most indicative at architecture
level
• Architecting solutions and tools for Data Intensive
computing usually work with these needs

Engels. R, Dept.of.CSE, PSG College of Technology


Data and Control Flow
• Data (and meta data) enter into the system
– To a staging area
• In the staging area data is curated
– May involve humans or automated tools for metadata
extraction and structural information decoration
– Using Information modeling
• Models of data and metadata are set to be catalogued and
archived, are derived
• After curation
– Data is ingested (through a sophisticated FM)
– Data disseminated from staging to archive
– Usually the archive is controlled and rich with metadata
info

Engels. R, Dept.of.CSE, PSG College of Technology


Data and Control Flow…2
• FM is responsible for
– Staging data & metadata to WM component
– Managing total volume of data in archive

• WM component is responsible for


– Data processing
– Orchestration of control flow (sequences of
executions of work units or tasks) and data flow
(passing of information between these tasks)

Engels. R, Dept.of.CSE, PSG College of Technology


Data and Control Flow…3
• RM component is responsible for
– Managing underlying compute and storage resources
– To execute a ready to run task
– Identifies hardware node for job (after resources
availability verification)
– Stages appropriate data files for the task
• Done by communicating with FM in form of metadata search
– Once task is completed, results, usually in form of set of
files and metadata,
• are re-ingested and disseminated through FM
• Made available to downstream tasks and data processing

Engels. R, Dept.of.CSE, PSG College of Technology


Data and Control Flow…4
• During some point in lifecycle of data intensive system
– Data is delivered from processing archive to the long term
archive
– Further data curation is possible for enrich metadata and
structure of data
– Data can be presented externally using data portals
• For programmatic search for data and metadata
• Explore and consume the information model that describes the data
• Information model: representation of concepts and the
relationships, constraints, rules and operations to specify data
semantics
• Ex: components in software, devices in network or a billing system’s
entities
– Dissemination may also be done from long time archive

Engels. R, Dept.of.CSE, PSG College of Technology


Architectural lifecycle in typical data intensive system

Engels. R, Dept.of.CSE, PSG College of Technology


Prototype architecture for extended very large array
astronomy data

Engels. R, Dept.of.CSE, PSG College of Technology


Oracle’s IM reference architecture model

Engels. R, Dept.of.CSE, PSG College of Technology


Oracle’s IM reference architecture
• It’s a classically abstracted architecture with the
purpose of each layer clearly defined.
• Five layers
– Staging Layer
– Foundation Data Layer
– Access and performance Layer
– Knowledge Discovery Layer
– BI abstraction and Query Federation Layer
• Staging Data Layer
– Abstracts the rate at which data is received onto the
platform from the rate at which it is prepared and then made
available to the general community
– It facilitates a ‘right-time’ flow of information through the
system
Engels. R, Dept.of.CSE, PSG College of Technology
Oracle’s IM reference architecture
• Foundation Data Layer
– Abstracts the atomic data from the business process
– For relational technologies the data is represented in close to third
normal form and in a business process neutral fashion to make it
resilient to change over time
– For non-relational data this layer contains the original pool of invariant
data.
• Access and Performance Layer
– Facilitates access and navigation of the data, allowing for the current
business view to be represented in the data
– For relational technologies data may be logical or physically structured in
simple relational, longitudinal or dimensional forms
– For non-relational data this layer contains one or more pools of data,
optimized for a specific analytical task or the output from an analytical
process
– e.g., In Hadoop it may contain the data resulting from a series of Map-
Reduce jobs which will be consumed by a further analysis process
Engels. R, Dept.of.CSE, PSG College of Technology
Oracle’s IM reference architecture
• Knowledge Discovery Layer
– Facilitates the addition of new reporting areas through agile
development approaches and data exploration (strongly and
weakly typed data) through advanced analysis and Data
Science tools (e.g. Data Mining).
• BI Abstraction & Query Federation Layer
– Abstracts the logical business definition from the location of
the data,
– presenting the logical view of the data to the consumers of
BI
– This abstraction facilitates
• Rapid Application Development (RAD),
• migration to the target architecture and
• the provision of a single reporting layer from multiple federated
sources
Engels. R, Dept.of.CSE, PSG College of Technology
What is the primary architectural goal?
• One of the key advantages often cited for the Big Data approach
is
– the flexibility of the data model over and above a more traditional
approach
– where the relational data model is seen to be brittle in the face of
rapidly changing business requirements
• By storing data in a business process neutral fashion and
incorporating an
– Access and Performance Layer and
– Knowledge Discovery Layer
• into the design to quickly adapt to new requirements we avoid
the issue.
• A well designed Data Warehouse should not require the data
model to be changed to keep in step with the business and
• provides for rich, broad and deep analysis.
Engels. R, Dept.of.CSE, PSG College of Technology
Layered architecture for Big Data

Engels. R, Dept.of.CSE, PSG College of Technology


Hadoop = HDFS + MapReduce
Hive = SQL that converts to MapReduce code [Facebook]
Pig = SQL for wide variety of dataset [Yahoo] ... Similar to Hive
Hbase = SQL for read/update/delete
Cassandra = SQL based data storage system based on peer to peer model
Flume = Channelling data to Hadoop
Oozie = Workflow processing system (MapR, Hbase, Pig .. etc)
Engels. R, Dept.of.CSE, PSG College of Technology
Mahout = Data Mining Library
Zookeeper = Centralized service for different application
Hadoop = HDFS + MapReduce
Hive = SQL that converts to MapReduce code [Facebook]
Pig = SQL for wide variety of dataset [Yahoo] ... Similar to Hive
Hbase = SQL for read/update/delete
Cassandra = SQL based data storage system based on peer to peer model
Flume = Channelling data to Hadoop
Oozie = Workflow processing system (MapR, Hbase, Pig .. etc)
Engels. R, Dept.of.CSE, PSG College of Technology
Mahout = Data Mining Library
Zookeeper = Centralized service for different application
Implementation Architecture

Engels. R, Dept.of.CSE, PSG College of Technology


MS Layered architecture for BD

Engels. R, Dept.of.CSE, PSG College of Technology


CHALLENGES

Engels. R, Dept.of.CSE, PSG College of Technology


Total Volume Challenge
• In Science data systems
– Data is regularly stored as files on disk
– Associate metadata is in
• a file catalogue such as a database or
• flat file based index (such as Lucene) or
• simply as metadata files alongside data files
– In average scale volumes (up to few hundreds of
GBs),
• Disk repositories and files can be managed in adhoc
manner
– For prohibitive sizes we need other approaches

Engels. R, Dept.of.CSE, PSG College of Technology


Total Volume Challenge….2
• Metadata based file organization
– Partitioning of data based on metadata
– For ex. Spatially located datasets, regions may be used
• File Management Replication
– Replication of file management servers to partition
• the overall data namespace and
• its access search and processing
• Selection of Data Dissemination
– Support for multi delivery intervals
– Parallelized saturation of the underlying network
– Examples: bbFTP and GridFTP

Engels. R, Dept.of.CSE, PSG College of Technology


Data Dissemination
• Need
– On the one hand, the distribution of upstream data sets across the globe,
that must be brought into a data-intensive system for processing;
– and on the other, the data produced by a data-intensive system that must be
distributed to several geographically diverse archives
• Unfortunately, in many cases, the present state of the art
– Involves providers that run legacy services customized to the specific storage
type, data format, and control policies in use at the center
• Middleware code that
– delivers data and metadata conforming to the common API,
– while interfacing with the existing back-end servers and applications
• is often necessary to bridge the gap.

Engels. R, Dept.of.CSE, PSG College of Technology


Data Dissemination……..2
• Also requires
– availability of a network bandwidth that is able to keep up with the data
generation stream, and
– the utilization of data transfer protocols that are able to take advantage
of that bandwidth

• Historically, network bandwidth has lagged behind with respect to


the continuous increase in storage capacity and computing power

• Currently, the fastest networks allow transfers above 10 GB/s, but they
are available only between a few selected locations

• New technologies
– GridFTP, UDT, BFTP try to maximize data transfer rates by
– Instantiating multiple concurrent data streams
– Automatic tweaks of the buffer size
– decompose each single transferred file

Engels. R, Dept.of.CSE, PSG College of Technology


Data Curation
• Data Curation is a broad term for
– a set of processes designed to ensure that data in all
stages of a data-intensive software system, from raw
input data to processed output,
– exhibit properties that facilitate a common
• organization,
• unified interpretation, and
• contain sufficient supporting information, or metadata,
– so as to be easily shared and preserved
• Curation concept is not new
– has taken an increasingly prominent role in the modern
era of high-volume, complex data systems

Engels. R, Dept.of.CSE, PSG College of Technology


Data Curation … 2
• Data curation is closely related to efforts in
information modeling
• It is very often one of the chief mechanisms by which
– the abstract information model is actually applied to the
data in the system
– through which any modeled constraints are enforced
• From this perspective, the data curation process can
be viewed as a mechanism for quality assurance
• It provides an opportunity
– to perform sanity checks and corrective action on the data
– as data moves throughout the system

Engels. R, Dept.of.CSE, PSG College of Technology


Data Curation … 3
• The benefits of data curation are not limited to
error detection
• The curation process is often viewed as a
mechanism for adding value to the data
• Curation provides opportunities for
– enriching data with contextual metadata
annotations
– that facilitate its downstream discovery and use
(such as search)

Engels. R, Dept.of.CSE, PSG College of Technology


Big Data + Deep Analysis
• Current evolutionary stage
• The BIG data analytics is usually done in a batch
fashion (e.g. once a day),
– usually we see big data processing and deep data analysis
happens at different stages of this batch process
• The big data processing part is usually done using
– Hadoop/PIG/Hive technology with classical ETL logic
implementation
• By leveraging the Map/Reduce model that Hadoop
provides,
– we can linearly scale up the processing by adding more
machines into the Hadoop cluster
– Drawing cloud computing resources (e.g. Amazon EMR) is a
very common approach to perform this kind of tasks

Engels. R, Dept.of.CSE, PSG College of Technology


Big Data + Deep Analysis
• The deep analysis part is usually done in R, SPSS, SAS
– using a much smaller amount of carefully sampled data that fits
into a single machine's capacity
– usually less than couple hundred thousands data records.
– The deep analysis part usually involve
– data visualization,
– data preparation,
– model learning
• Linear regression and regularization,
• K-nearest-neighbour
• Support vector machine
• Bayesian network
• Neural network,
• Decision Tree and Ensemble methods,
– model evaluation

Engels. R, Dept.of.CSE, PSG College of Technology


Big Data + Deep Analysis

Engels. R, Dept.of.CSE, PSG College of Technology


Big Data contextualization issues – Courtesy Gartner

Engels. R, Dept.of.CSE, PSG College of Technology


SUMMARY

Engels. R, Dept.of.CSE, PSG College of Technology


Summary
• We looked at different reference
architectures and an example architecture
based on one of the reference architecture
• We considered possible issues in the
architectural phases for data intensive
systems
• Solving individual component challenges is
important but architectural fit is crucial

Engels. R, Dept.of.CSE, PSG College of Technology


References

• Most references are online resources from


Microsoft, Oracle and IBM
• Images and quotes are from web resources
• I thank the original contributors

Engels. R, Dept.of.CSE, PSG College of Technology


QUESTIONS?

Engels. R, Dept.of.CSE, PSG College of Technology

You might also like