Professional Documents
Culture Documents
CHAPTER 1
INTRODUCTION TO BIG DATA SYSTEMS
Prepared by:
Saidatul Rahah Hamidi
Chapter Outline
• Introduction
• Data is everywhere
• Big data is larger, more complex data sets, especially from new data sources. These data sets are so
voluminous that traditional data processing software just can’t manage them. But these massive
volumes of data can be used to address business problems you wouldn’t have been able to tackle
before. (https://www.oracle.com/my/big-data/what-is-big-data/)
• “Big data” refers to datasets whose size is beyond the ability of typical database software tools to
capture, store, manage, and analyze. (mckinsey.com)
• 'Extremely large collections of data (data sets) that may be analysed to reveal patterns, trends, and
associations, especially relating to human behaviour and interactions.‘
4
“Big data are high
volume, high-velocity
and high-variety
information assets
that require new
forms of processing
to enable enhanced
decision making,
insight discovery and
process
optimization”
- Gartner -
Process optimization - The most common
goals are minimizing cost and
maximizing throughput and/or efficiency.
This is one of the major quantitative tools
in industrial decision making.
5
Access to date is becoming ultimate competitive advantage
7
Big Data is everywhere!
A new style of IT emerging
41000
50 billion Photos uploaded
Messages
342000
204 million Tweets
Emails sent
9
Big Data is made of structured and unstructured information.
10
Primary Source of Big Data
11
Massive amount of data generated
• Social Media: On social media platforms like Facebook, Twitter, and
Instagram, users generate billions of posts, comments, likes, and
shares every second. This generates a vast amount of structured and
unstructured data that can be analyzed for insights into user behavior
and preferences.
12
Massive amount of data generated
• Streaming: Streaming platforms like Netflix and YouTube generate a vast amount of
data every second. Every time a user watches a video, data is generated, including
information on user behavior, preferences, and engagement.
• Internet of Things (IoT): The IoT generates data from various devices like sensors,
wearables, and smart home appliances. Every time a device collects data or sends data
to the cloud, it generates data. This data can be analyzed to understand patterns and
trends in device usage.
• Search Engines: Search engines like Google generate a vast amount of data every
second. Every time a user searches for a query, data is generated, including search
terms, location, and device information. This data can be analyzed to understand user
intent and preferences.
13
Big Data Systems
• Big data refers to the large volume of structured and unstructured data
generated from various sources such as social media, internet of things
(IoT) devices, mobile devices, and sensors. Big data systems are designed to
store, manage, and analyze this data.
• Big data systems are designed to handle large volumes of data that are
beyond the capabilities of traditional data processing systems. These
systems use distributed computing and parallel processing to store,
manage, and analyze massive amounts of structured and unstructured data.
14
Big Data Systems
• Big data systems are typically designed to be scalable, fault-tolerant,
and flexible to handle a variety of data sources and formats.
• They may include various components such as data ingestion,
storage, processing, analysis, and visualization tools.
15
Key Concepts
Here are some key concepts related to big data systems:
1. Volume 3. Variety
Scale of Data Different Forms of Data
Terabytes to exabytes of existing Structured, unstructured, text, multimedia
data to process Big data comes in different formats, including
structured, semi-structured, and unstructured
Big data systems deal with massive data.
amounts of data that traditional systems
cannot handle.
2. Velocity 4. Veracity
Analysis of Streaming Uncertainty of Data (accuracy)
Data Managing the reliability and
Streaming data, mileseconds predictability of inherently imprecise
to seconds to respond (speed/rate) data types
Big data is often noisy, incomplete, and
Big data is generated at a high inconsistent, which can affect the
speed and needs to be processed accuracy of analysis.
16
quickly to derive insights.
17
Modern cars have close
100 Sensors Modern cars have close
that monitor items such as By 2020, it’s 1 in 3 Business Leader
40 Zettabytes fuel level and tire anticipated there don’t trust the information they
of data will be created by 2020 pressure will be more than use to make decisions
420 Million
Wearable,
Wireless
Health
Monitors
6 Billion People
have cell phones
27% of Respondents
4 Billions+ Hours of Video In one survey were unsure of how
much of their data was inaccurate
are watched on YouTube each
By 2020, it is projected there will
months
be more than
18.9 Billion Network 400 Million Tweets
Connection are sent per day about 200 million
- Almost 2.5 connections per monthly active users
2.5 Quintillion Bytes person on earth 30 Billion Pieces of
of data are created each day Content
are shared on Facebook every 18
month
Some Popular Big Data Systems
• Some popular big data systems include Hadoop, Spark, NoSQL databases, and cloud-based
platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
• These systems use distributed computing and parallel processing to handle the volume and
velocity of big data.
• They also support data processing and analysis using machine learning and artificial intelligence
algorithms.
• In summary, big data systems are critical for organizations to gain insights and make informed
decisions from massive amounts of data.
19
Sample of Big Data Systems
• Hadoop is an open-source framework that provides distributed storage and
processing of large datasets across clusters of computers. It includes components
like Hadoop Distributed File System (HDFS) for storage and MapReduce for
processing.
• Spark is another open-source big data processing framework that can handle both
batch processing and real-time data processing. It includes components like Spark
Core, Spark SQL, and Spark Streaming.
• NoSQL databases like MongoDB, Cassandra, and Couchbase provide flexible and
scalable storage solutions for big data applications. They can handle a variety of
data formats and support distributed data storage and processing.
20
Sample of Big Data Systems (cont.)
• Cloud-based platforms like AWS, Azure, and GCP provide scalable and
flexible infrastructure for big data applications. They offer various
services like data storage, processing, and analysis, as well as machine
learning and artificial intelligence tools.
• Overall, big data systems are critical for organizations to gain insights
and make informed decisions from massive amounts of data. They
can provide valuable insights for various industries, including finance,
healthcare, retail, and more.
21
https://www.geeksforgeeks.org/hadoop-ecosystem/ 22
Spark Ecosystem
23
Key Principles in Designing Big Data
System
• Key principles to consider when designing a big data system:
• scalability
• fault tolerance
• data processing
• security
• data governance
• interoperability
• cost-effectiveness
• These principles help ensure that big data systems can handle the scale and
complexity of big data while maintaining high performance, reliability, and
security. By following these principles, organizations can design big data
systems that support their business needs and goals
24
Key Principles in Designing Big Data
System (cont.)
• Scalability: Big data systems should be scalable to handle the increasing
volume and variety of data. The system should be designed to add more
computing resources as needed.
• Fault tolerance: Big data systems should be fault tolerant to handle failures
in the system. The system should be designed to handle node failures and
ensure data consistency and availability.
• Data processing and storage: Big data systems should be designed to handle
both batch processing and real-time data processing. The system should also
support different storage formats and data models.
25
Key Principles in Designing Big Data
System (cont.)
• Security: Big data systems should be designed to handle security and privacy
concerns. The system should ensure data confidentiality, integrity, and availability.
• Data governance: Big data systems should have proper data governance policies to
ensure data quality, accuracy, and compliance with regulations.
• Interoperability: Big data systems should be interoperable with other systems in the
organization. The system should allow for data exchange and integration with other
systems.
28