Professional Documents
Culture Documents
ITP 249
Lecture 11
Outline
• Big data
– The Features of Big Data
– What Drives Big Data?
– Big Data Applications
– Real-time Analytics and Its Impact
• In-memory db
• Columnar db
• Limitations of SQL
• NoSQL db
What is big data?
• Something large and full of information?
– Maybe but provides no information of what Big Data really is
• Universal Definition
– Extremely large data sets
– Grown beyond capacity of traditional tools
– Also the processes of leveraging the data (e.g. Analytics, BI, Data Mining)
• What kind of data?
– Every day we create 2.5 quintillion (2.5 × 1018) bytes of data
– 90 % of the data in the world today has been created in the last two years
– Sensors (IoTs), Blogs, Pics, Videos, E-commerce, GPS, etc
• Analytics and Research defines Big Data today
– More data, more analysis, more results
– Presents opportunity for deep analysis, pattern prediction and correlation
Structured vs. Unstructured Data
Structured Unstructured
3X
Number of enterprises with 1PB+ unstructured
data grows from 2016 to 10174
90%
500
80 % 375
Unstructured
331 EB
© Copyright IBM Corporation 2017
Projected
Structured block storage
Exabytes
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
5
The growing imperative of Business Data
Analytics have emerged for …to massive Interactive,
years from Transactional, Unstructured content
Structured data… Documents
Web Pages
Sales
transactions
Cameras
80 %
Databases
Text Messages
Is Unstructured
Emails
6
Who is using Big Data?
– Science / Reaseach (NASA / NOAA
– Pharma / Health
– Energy
– Media and Entertainment
– Manufacturing
– Finance
– All Businesses today leverage some form of big data
– References:
• http://www.cnbc.com/id/100792215
• http://video.cnbc.com/gallery/?video=3000168940
• http://www.cnbc.com/id/100638376
The Features of Big Data
• 7 ‘V’s that describe the features of big data
Volume
• Volume of data collected, stored, and shared is growing
faster than ever before
• Not all data are stored. Some are discarded, others are
archived. Even then the total volume is growing
Variety
• Source of data
• Form of data
• Business data,
social media data
• Multiple languages
• Formats – text,
voice, photos,
video, audio
Velocity
• Speed at which data are generated and collected
• Can also refer to how quickly data can be
processed
Variability
• Changes in the meaning of data over time or in
context (asset class over time)
• Data of unknown or indistinct type or structure
or format (number, text, emoji, etc)
• Sentiment analysis uses natural language
processing to derive the attitude of the writer
Veracity
• Reliability or truthfulness of data
• Errors and inaccuracies
• Separating noise from signal
Volatility
• Lifespan of data
• How long data are available
• How long should it be stored
Value
• Driving force of big data analytics is value
• Should provide benefit to someone
• Providing big data itself is a business
• Evaluate the benefit of investing in big data
against the cost
What continues to Drives Big Data?
• World is becoming more digital
• World is becoming more connected
• Electronic/digital devices are becoming more
economical (putting technology in the hands of
more people)
• Traditional forms of social communications are
being replaced with digital ones that are often
‘free’
Not just the Data =>
Big Data Applications
• Business Intelligence -> AI
• AI –> Machine Learning -> Deep Learning
• Application Caetogies:
– Statistical Applications (Trends)
– Predictive Analysis (Trends -> Predictions)
– Data modeling/Data Visualization
– What If scenarios
In-memory Databases (IMDB)
• Using RAM instead of hard disk for the database
• All relevant data are in memory all the time
• Speeds up queries to provide real time or near real time analytics
capabilities
• Innovations
• Data are stored in RAM
• Use of columnar storage for the relational database.
• Indexing (is free with columnar storage)
• Data compression
• Parallel data processing
• Partitioning data
• SAP HANA is an IMDB
Real-time Analytics and Its Impact
Image Source: Ralokota, R. (May 15, 2011). New tools for new times – primer on big data, Hadoop
and “in-memory” data clouds. Retrieved from http://practicalanalytics.wordpress.com/2011/05/15/new-tools-for-new-times-a-primer-o
n-big-data/
Performance Bottleneck Comparison
• Without high-capacity RAM With high-capacity RAM
− Database stored on disk
− Database stored in memory
− Bottleneck: Latency between disk
− Bottleneck: Latency between
and RAM
CPU and RAM
− Orders of magnitude response
time improvements
Image Source:Morrison
, A. (2012). The art and science of new analytics technology. PwC Technology Forecast, 1, 31-43. Retrieved from http://www.pwc.com/en_US/
us/technology-forecast/2012/issue1/features/feature-art-science-analytics-technology.jhtml
Software That Leverages Hardware Innovations
Source: Plattner
, H. & Zeier, A. (2011). In Memory Data Management: An Inflection Point for Enterprise Applications. Retrieved from http://www3.weforum.org/docs/GITR/2012/GI
TR_Chapter1.7_2012.pdf
Another Innovation - Columnar Databases
Advantages Disadvantages
• Better I/O bandwidth utilization Load times can be slow
• Higher cache efficiency Less efficient for transactional
• Faster data aggregation processes
• High compression rates Possibly slower relational
interfaces
• Column-based parallel processing
Columnar Storage Example
Country Customer Product Sold Pieces
• Application:
– User session data between shared applications
Column Stores
• Use case:
– Super scalable
– Map Reduce support
• Example
• Application:
– Large scale realtime data logging (Finance, Web Analytics)
Graph DB
• Use Case:
– Dense network of strongly connected entities
– Nodes and relationships
– Graph Data Modeling
• Example:
• Application:
– Facebook graph search, Google knowledge graph,
Twitter
Document Store
• Use case:
– Semi-structured data with SQL-like queries
– Collections of related key-value pairs with variable
schemas
• Example:
• Application:
– Document driven web or other applications
Distributed Computing
• Apache Hadoop
• Distributed computing
• Parallel processing
When to…
Use an RDBMS when you need/have... Use NoSQL when you need/have...