Chapter 1
Introduction to Big Data Analytics
•What is Big Data?
•What makes Big Data different from other related
“buzzwords”?
•What are we going to cover in this course?
What’s Big Data?
Big Data Analytics 2
Google Translate
Big Data Analytics 3
Big Data: Some Examples
• Topic detection and tracking
• Trend analysis
• Social network analysis
• PageRank
• Predictive analytics
• Many others: healthcare, natural resources,
education, public sector, insurance,
transportation, finance and crime detection,
…
Big Data Analytics 4
Google News
Big Data Analytics 5
Google Trends
Big Data Analytics 6
What is Big Data?
• Big data is a term for data sets that are so
large or complex that traditional data
processing application software's are
inadequate to deal with them.
• Challenges include capture, storage,
analysis, data curation, search, sharing,
transfer, visualization, querying, updating
and information privacy. [source:
Wikipedia]
Big Data Analytics 7
What is Big Data?
• “Big data is data whose scale, distribution,
diversity, and/or timeliness require the
use of new technical architectures and
analytics to enable insights that unlock
new sources of business value.”
• [source: C. Manyika, Big Data: The next
frontier for innovation, competition, and
productivity, McKinsey Global Institute,
2011]
Big Data Analytics 8
Characteristics of Big Data
• META Group (now Gartner) defined data
growth challenges and opportunities as
being three-dimensional, i.e. increasing
volume, velocity, and variety [Doug
Laney, 2001]
– Volume: the quantity of generated and stored
data
– Velocity: the speed at which data is generated
and processed
– Variety: the type and nature of data
Big Data Analytics 9
Characteristics of Big Data
• “Big data is high volume, high velocity,
and/or high variety information assets
that require new forms of processing to
enable enhanced decision making, insight
discovery and process optimization.”
[Gartner, 2012]
Big Data Analytics 10
The Four V’s of Big Data
(Image source: [Link])
Big Data Analytics 11
Characteristics of Big Data
• Five V’s:
– Volume: scale
• 2.5 EB per day (300 TB, Library of Congress)
– Velocity: streaming
• In 60 seconds: 350,000 tweets, 300 hours of YouTube video,
171 million emails, 350 GB sensor data from a jet engine
– Variety: different forms
• Structured, Semi-structured, Unstructured
– Veracity: uncertainty
• Quality or fidelity
– Value
• Higher veracity, lower processing time -> higher value
Big Data Analytics 12
Characteristics of Big Data
Big Data Analytics 13
Data Structures
• Variety: different forms
– Structured: databases, spreadsheets, …
– Semi-structured: textual files such as Web
pages, XML, …
– Unstructured: text documents, images,
videos, …
• Data growth is increasingly unstructured
– Social media: Facebook, Twitter, …
Big Data Analytics 14
Differences from traditional
data analysis
• Distinct requirements
– Combining of multiple unrelated datasets
– Processing of large amounts of unstructured data
– Harvesting of hidden information in a time-sensitive
manner
• Newer techniques that leverage computational
resources
• Interdisciplinary
– Mathematics, statistics, computer science, subject matter
expertise
• Benefits
– Optimization, predictions, fault or fraud detection,
improved decision making, discoveries
Big Data Analytics 15
Data analysis vs. Data analytics
• “Analysis is the separation of a whole into its
component parts, and analytics is the method of
logical analysis.” [source: Merriam-Webster
dictionary]
• “Analysis is really a heuristic activity, where
scanning through all the data the analyst gains
some insight. “ [source: [Link]]
• “Analytics is about applying a mechanical or
algorithmic process to derive the insights for
example running through various data sets
looking for meaningful correlations between
them. ” [source: [Link]]
Big Data Analytics 16
Related Terms
• Data science, predictive analytics
• Business intelligence, FinTech
• IoT, CPS, Industry 4.0
• Smart homes, smart cities
• Data mining, machine learning, artificial
intelligence
• Cloud computing, data-intensive computing,
parallel computing, distributed computing
•…
Big Data Analytics 17
What is Data Science?
Big Data Analytics 18
Data Engineering vs. Data
Analysis
• Data engineering: designing and building
infrastructure for integrating and managing data from
various resources
– MySQL, NoSQL, Hadoop, MapReduce
• Data analysis: querying and processing data,
providing reports, summarizing and visualizing data
– Statistics, visualization, Excel, SAS, SPSS, …
• Data science: applying statistics, machine learning and
analytic approaches to solve critical business
problems, and turning data into valuable and
actionable insights
– Advanced data analysis
– Data mining tools, machine learning, statistics, …
Big Data Analytics 19
Big data vs. Business
Intelligence
• Business Intelligence (BI) are the set of
strategies, processes, applications, data,
products, technologies and technical
architectures which are used to support
the collection, analysis, presentation and
dissemination of business information.
[source: Wikipedia]
Big Data Analytics 20
Data Mining in Business Intelligence
Increasing potential
End User
to support Decision
business decisions Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
[source: Han 2011]
Big Data Analytics 21
Big Data vs. Financial
Technology
[source: Slideshare]
Big Data Analytics 22
Image source: [Link]
Big Data Analytics, 23
Big Data vs. IoT
• The Internet of things (IoT) is the
internetworking of physical devices,
vehicles (also referred to as "connected
devices" and "smart devices"), buildings,
and other items—embedded with
electronics, software, sensors, actuators,
and network connectivity that enable
these objects to collect and exchange data.
[source: Wikipedia]
Big Data Analytics 24
Big Data vs. Industry 4.0
Big Data Analytics [source: Roland Berger] 25
Big Data vs. CPS
• A cyber-physical system (CPS) is a mechanism
controlled or monitored by computer-based
algorithms, tightly integrated with the internet and its
users.
• In cyber-physical systems, physical and software
components are deeply intertwined, each operating on
different spatial and temporal scales, exhibiting
multiple and distinct behavioral modalities, and
interacting with each other in a myriad of ways that
change with context.
– E.g.: smart grid, autonomous automobile systems, medical
monitoring, process control systems, robotics systems, and
automatic pilot avionics
– [source: Wikipedia]
Big Data Analytics 26
Big Data vs. Data Mining
• Data mining is the computational process
of discovering patterns in large data sets
involving methods at the intersection of
artificial intelligence, machine learning,
statistics, and database systems. [source:
Wikipedia]
Big Data Analytics 27
Knowledge Discovery (KDD) Process
• This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
• Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
[source: Han 2011]
Databases
Big Data Analytics 28
Big Data vs. Data-Intensive
Computing
• Data-intensive computing is a class of
parallel computing applications which use
a data parallel approach to process large
volumes of data typically terabytes or
petabytes in size and typically referred to
as big data. [source: Wikipedia]
Big Data Analytics 29
Google Trends Comparison:
Big Data Analytics 30
Limits of Predictions
• Predictive analytics: technology that learns
from experience (data) to predict the future
behavior of individuals in order to drive
better decisions
• Accurate prediction is generally not possible
• But predictions need not be accurate to bring
value
– E.g. direct mail marketing
• The prediction effect: predicting better than
pure guess delivers value
Big Data Analytics 31
Is Big Data the End of Theory?
• Critiques
– Big data requires “big judgement”
– If the systems dynamics of the future change,
the past can say little about the future
– Privacy
– Bias, subjective, shallow
– “If you believe in Big Data analytics, it’s time
to begin planning for a Hillary Clinton
presidency and all that entails.”
Big Data Analytics 32
What are we going to cover in
this course?
• Data mining
– Frequent pattern mining
– Classification
– Clustering
• Parallel programming in distributed
platforms
– Scalability
– Hadoop, Spark
– MapReduce programming
Big Data Analytics 33
KDD Process: A Typical View from ML and Statistics
Input Data Data Pre- Data Post-
Processing
Mining Processin
g
Data integration Pattern discovery Pattern evaluation
Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern interpretation
Dimension reduction Clustering Pattern visualization
Outlier analysis
…………
• This is a view from typical machine learning and statistics communities [source: Han 2011
Big Data Analytics 34
Data Mining Function: Association
and Correlation Analysis
• Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in
your Walmart?
• Association, correlation vs. causality
– A typical association rule
• Diaper Beer [0.5%, 75%] (support, confidence)
– Are strongly associated items also strongly
correlated?
• How to mine such patterns and rules efficiently in
large datasets?
• How to use such patterns for classification,
clustering, and other applications?
Big Data Analytics 35
Data Mining Function:
Classification
• Classification and label prediction
– Construct models (functions) based on some training examples
– Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
– Predict some unknown class labels
• Typical methods
– Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification,
pattern-based classification, logistic regression, …
• Typical applications:
– Credit card fraud detection, direct marketing,
classifying stars, diseases, web-pages, …
Big Data Analytics 36
Data Mining Function: Cluster
Analysis
• Unsupervised learning (i.e., Class label is
unknown)
• Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns
• Principle: Maximizing intra-class similarity
& minimizing interclass similarity
• Many methods and applications
Big Data Analytics 37
Limitations of Data Analysis
• CPU: computing time to execute the
analysis
• I/O: how much data can be put in
memory per time unit
• Memory: how much data can be
processed at a time
Big Data Analytics 38
Data to be analyzed
• Tall data: large number of cases
• Wide data: large number of features
• Tall and wide data: large number of both
cases and features
• Sparse data: large number of zero entries
Big Data Analytics 39
Algorithm to be used
• How complex is your algorithm
• How many parameters in your model
• Are the optimization processes
parallelizable
• Does your algorithm learn from all data or
small batches of data
Big Data Analytics 40
Possible solutions
• Scale up: performance improvement on a single
machine
– More memory, faster CPU, faster storage, using
GPUs, …
– E.g. CUDA, TensorFlow, Keras, …
• Scale out: performance improvement by
distributing computations
– Using outside resources: other CPUs, GPUs, storage
– E.g. Hadoop, Spark, …
• Scale up and out
– E.g. Distributed TensorFlow, …
Big Data Analytics 41
Hadoop Architecture
Big Data Analytics 42
MapReduce
Big Data Analytics 43
Functional Programming
Big Data Analytics 44
Spark and Hadoop
Big Data Analytics 45
TensorFlow
• An open-source library for machine
learning
– High-level API: Keras, …
– Low-level API: TensorFlow Core
• Prerequisites: algebra, Python
• Google Colab (Colaboratory) is an easy
way to learn and use TensorFlow
Big Data Analytics 46
TensorFlow
Big Data Analytics 47
Typical Workflows in
TensorFlow
Input data Tensors
Model Dataflow Graph
sessions
Model training
and testing device device
Big Data Analytics 48
Thanks for Your Attention!
Big Data Analytics 49