You are on page 1of 38

Trends in databases

ITP 249
Lecture 11
Outline
• Big data
– The Features of Big Data
– What Drives Big Data?
– Big Data Applications
– Real-time Analytics and Its Impact
• In-memory db
• Columnar db
• Limitations of SQL
• NoSQL db
What is big data?
• Something large and full of information?
– Maybe but provides no information of what Big Data really is
• Universal Definition
– Extremely large data sets
– Grown beyond capacity of traditional tools
– Also the processes of leveraging the data (e.g. Analytics, BI, Data Mining)
• What kind of data?
– Every day we create 2.5 quintillion (2.5 × 1018) bytes of data
– 90 % of the data in the world today has been created in the last two years
– Sensors (IoTs), Blogs, Pics, Videos, E-commerce, GPS, etc
• Analytics and Research defines Big Data today
– More data, more analysis, more results
– Presents opportunity for deep analysis, pattern prediction and correlation
Structured vs. Unstructured Data
Structured Unstructured

• Strictly organized, common schema • No uniform structure


• Designed for management by computers • Designed for use by humans & devices
• Relational databases & spreadsheets • Word docs, PDFs, emails, videos, IoT sensor
• Standard search operations data, audio files, emails, HTML, & images
• Limited data visibility
With the rise of 4K video, medical images, IoT, digital
information, AI and analytics, the data explosion is
accelerating.

3X
Number of enterprises with 1PB+ unstructured
data grows from 2016 to 10174

90%
500

80 % 375

of all data was created


in the last 2 years1 Unstructured data3 250

Unstructured
331 EB
© Copyright IBM Corporation 2017

Object based storage capacity by 20212


File and Object
125

Projected
Structured block storage
Exabytes

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

5
The growing imperative of Business Data
Analytics have emerged for …to massive Interactive,
years from Transactional, Unstructured content
Structured data… Documents
Web Pages

Sales
transactions
Cameras

80 %
Databases

Text Messages
Is Unstructured
Emails

6
Who is using Big Data?
– Science / Reaseach (NASA / NOAA
– Pharma / Health
– Energy
– Media and Entertainment
– Manufacturing
– Finance
– All Businesses today leverage some form of big data
– References:
• http://www.cnbc.com/id/100792215
• http://video.cnbc.com/gallery/?video=3000168940
• http://www.cnbc.com/id/100638376
The Features of Big Data
• 7 ‘V’s that describe the features of big data
Volume
• Volume of data collected, stored, and shared is growing
faster than ever before
• Not all data are stored. Some are discarded, others are
archived. Even then the total volume is growing
Variety
• Source of data
• Form of data
• Business data,
social media data
• Multiple languages
• Formats – text,
voice, photos,
video, audio
Velocity
• Speed at which data are generated and collected
• Can also refer to how quickly data can be
processed
Variability
• Changes in the meaning of data over time or in
context (asset class over time)
• Data of unknown or indistinct type or structure
or format (number, text, emoji, etc)
• Sentiment analysis uses natural language
processing to derive the attitude of the writer
Veracity
• Reliability or truthfulness of data
• Errors and inaccuracies
• Separating noise from signal
Volatility
• Lifespan of data
• How long data are available
• How long should it be stored
Value
• Driving force of big data analytics is value
• Should provide benefit to someone
• Providing big data itself is a business
• Evaluate the benefit of investing in big data
against the cost
What continues to Drives Big Data?
• World is becoming more digital
• World is becoming more connected
• Electronic/digital devices are becoming more
economical (putting technology in the hands of
more people)
• Traditional forms of social communications are
being replaced with digital ones that are often
‘free’
Not just the Data =>
Big Data Applications
• Business Intelligence -> AI
• AI –> Machine Learning -> Deep Learning
• Application Caetogies:
– Statistical Applications (Trends)
– Predictive Analysis (Trends -> Predictions)
– Data modeling/Data Visualization
– What If scenarios
In-memory Databases (IMDB)
• Using RAM instead of hard disk for the database
• All relevant data are in memory all the time
• Speeds up queries to provide real time or near real time analytics
capabilities
• Innovations
• Data are stored in RAM
• Use of columnar storage for the relational database.
• Indexing (is free with columnar storage)
• Data compression
• Parallel data processing
• Partitioning data
• SAP HANA is an IMDB
Real-time Analytics and Its Impact

• Provide almost instantaneous feedback from analytics processing


• React to changing customer needs
• React to opportunities in real time
• Example of customer service in credit card companies
• Customer navigates the website but cannot find resolution
• Calls customer service rep
• Real time analytics helps improve customer experience
• Re-order call tree to the customer’s most likely reason for calling
• Prepopulate the rep’s screens
• Eliminate options from the phone tree based on customer’s browsing history
• Change language of chat or call
• Make promotional offers to customer
In-Memory Appliance Development
• Drivers
– Big data
– Predictive analytics
– Real-time analytics
– Self-service BI

• Enabling hardware innovations


– High-capacity RAM
– Multi-core processor architectures
– Massive parallel scaling
– Massively parallel processing (MPP)
– Large symmetric multiprocessors (SMP)

Image Source: Ralokota, R. (May 15, 2011). New tools for new times – primer on big data, Hadoop
and “in-memory” data clouds. Retrieved from http://practicalanalytics.wordpress.com/2011/05/15/new-tools-for-new-times-a-primer-o
n-big-data/
Performance Bottleneck Comparison
• Without high-capacity RAM  With high-capacity RAM
− Database stored on disk
− Database stored in memory
− Bottleneck: Latency between disk
− Bottleneck: Latency between
and RAM
CPU and RAM
− Orders of magnitude response
time improvements

Image Source:Morrison
, A. (2012). The art and science of new analytics technology. PwC Technology Forecast, 1, 31-43. Retrieved from http://www.pwc.com/en_US/
us/technology-forecast/2012/issue1/features/feature-art-science-analytics-technology.jhtml
Software That Leverages Hardware Innovations

Source: Plattner
, H. & Zeier, A. (2011). In Memory Data Management: An Inflection Point for Enterprise Applications. Retrieved from http://www3.weforum.org/docs/GITR/2012/GI
TR_Chapter1.7_2012.pdf
Another Innovation - Columnar Databases
Advantages Disadvantages
• Better I/O bandwidth utilization  Load times can be slow
• Higher cache efficiency  Less efficient for transactional
• Faster data aggregation processes
• High compression rates  Possibly slower relational
interfaces
• Column-based parallel processing
Columnar Storage Example
Country Customer Product Sold Pieces

USA 3000 DXTR1100 5

USA 4000 DXTR1100 21

Germany 23000 DXTR3100 12

Germany 17000 DXTR3100 34

Row table Column table

Row 1 USA 3000 DXTR1100 5 Column1 Column2 Column3 Column4


USA 3000 DXTR1100 5
Row 2 USA 4000 DXTR1100 21
USA 4000 DXTR1100 21

Row 3 DE 23000 DXTR3100 12 Germany 23000 DXTR3100 12

Row 4 DE 17000 DXTR3100 34 Germany 17000 DXTR3100 34


Super Simple App & Schema
Monolithic ERP Application with super simple
schema:
• Employee
• Salary
• Department
Modern Apps (Mobile/Social)

A new app comes along that needs to be ‘internet scale’. What if in


your schema…
• You need to add or remove fields, lots of them, frequently?
• You need another table with a ‘variable’ schema?

What if for your infrastructure…


• You need to scale out not up
• Writes are as numerous as reads
• Data in volume is high and growth rate is high
• Use is decentralized (web, mobile, IoT)
Limitations of SQL (RDBMS)
• Rigid schema, not easy to add columns
(attributes) as needed
• JOINs are expensive!
• Transaction handling is complex with millions of
concurrent users
• Requires some downtime
• Unstructured data is not easily handled
• Not adaptive to new requirements
NoSQL
• Not Only SQL
• Not based on relational databases
• They may support SQL like querying
• Based on key-value pairs
• Schema-less
• ACID transactions may be compromised to
increase performance, availability, speed.
Eventually consistent.
SQL vs. NoSQL
Enter NoSQL Data stores
• Key-Value: amazon dynamo
• Column: cassandra
• Graph DB: neo4j
• Document: mongodb
Key-Value Stores
• Use case:
– Quick lookups with no ‘relational’ component (no
joins)
– Quick and high scalability
– Often (mostly) in memory
• Example:

• Application:
– User session data between shared applications
Column Stores
• Use case:
– Super scalable
– Map Reduce support
• Example

• Application:
– Large scale realtime data logging (Finance, Web Analytics)
Graph DB
• Use Case:
– Dense network of strongly connected entities
– Nodes and relationships
– Graph Data Modeling
• Example:

• Application:
– Facebook graph search, Google knowledge graph,
Twitter
Document Store
• Use case:
– Semi-structured data with SQL-like queries
– Collections of related key-value pairs with variable
schemas
• Example:

• Application:
– Document driven web or other applications
Distributed Computing
• Apache Hadoop
• Distributed computing
• Parallel processing
When to…
Use an RDBMS when you need/have... Use NoSQL when you need/have...

Centralized applications (e.g. ERP) Decentralized applications (e.g. Web,


mobile and IOT)
Moderate to high availability Continuous availability; no downtime
Moderate velocity data High velocity data (devices, sensors, etc.)
Data coming in from one/few locations Data coming in from many locations
Primarily structured data Structured, with semi/unstructured
Complex/nested transactions Simple transactions
Primary concern is scaling reads Concern is to scale both writes and reads
Philosophy of scaling up for more Philosophy of scaling out for more
users/data users/data
To maintain moderate data volumes with To maintain high data volumes; retain
purge forever
What if you have both?
(and they are Big)
• SQL-like Distributed Query engines:
– Hive
– Presto
– Drill
– Impala
– Spark SQL
– Lingual

• Distributed computing platforms:


– Hadoop
– Spark
– Tez
When in doubt ask…
• What is the application use case(s)?
• What is the application(s) data model?
• What is the need for scalability on reads/writes?
• What is the query pattern for the application or
users?

You might also like