Dsc652 - Chapter 1 Introduction To Big Data Systems

DSC 652
BIG DATA APPLICATIONS

AND ISSUES
CHAPTER 1
INTRODUCTION TO BIG DATA SYSTEMS
Prepared by:
Saidatul Rahah Hamidi
Chapter Outline
• Introduction
• Data is everywhere
• Big data Systems
• Key concepts of big data systems
• Big data systems designing principles

2
Introduction to big data: https://www.youtube.com/watch? 3
v=bAyrObl7TYE
What is Big Data?
• Big data is a combination of structured, semistructured and unstructured data collected by
organizations that can be mined for information and used in machine learning projects,
predictive modeling and other advanced analytics applications. (
https://www.techtarget.com/searchdatamanagement/definition/big-data)
• Big data is larger, more complex data sets, especially from new data sources. These data sets are so
voluminous that traditional data processing software just can’t manage them. But these massive
volumes of data can be used to address business problems you wouldn’t have been able to tackle
before. (https://www.oracle.com/my/big-data/what-is-big-data/)
• “Big data” refers to datasets whose size is beyond the ability of typical database software tools to
capture, store, manage, and analyze. (mckinsey.com)
• 'Extremely large collections of data (data sets) that may be analysed to reveal patterns, trends, and
associations, especially relating to human behaviour and interactions.‘
4
“Big data are high
volume, high-velocity
and high-variety
information assets
that require new
forms of processing
to enable enhanced
decision making,
insight discovery and
process
optimization”
- Gartner -
Process optimization - The most common
goals are minimizing cost and
maximizing throughput and/or efficiency.
This is one of the major quantitative tools
in industrial decision making.
5
Access to date is becoming ultimate competitive advantage
Amount of data we have double, triple every year…too fast
A collection of data sets so large and complex that it becomes difficult to

process using on-hand database management tools or traditional data
processing applications
Databases Data warehousing

(the 80s) (the 90s) TODAY
⮚ Relational databases ⮚ Terabytes in size

⮚ Gigabytes in size ⮚ Custom hardware
⮚ Low latency (time interval/delay)
6
Introduction
7
Big Data is everywhere!
A new style of IT emerging
41000
50 billion Photos uploaded
Messages
342000
204 million Tweets
Emails sent
2014 1.4 million

3.3 million What happens Voice calls
Facebook online in 60
post seconds?
120 hours
Video uploaded
4 million
Google searches
8
2021
9
Big Data is made of structured and unstructured information.
10% STRUCTURED 90 % UNSTRUCTURED
Structured information is the Unstructured information is 90% of Big Data and is

data in databases and is ‘human information’ like emails, videos, tweets,
about 10% of the story. Facebook posts, call-center conversations, closed
circuit TV footage, mobile phone calls, website clicks.
footage is raw, unedited material as originally filmed by a movie camera or recorded
by a video camera, which typically must be edited to create a motion picture, video
clip, television show or similar completed work.
10
Primary Source of Big Data
Machine Social Transactional

• data comes from what can be • derived from social media • this comes from the
measured by the equipment platforms through tweets, transactions which are
used retweets, likes, video uploads, undertaken by the organisation.
• E.g: smart sensors, SIEM logs, and comments shared on • E.g: transaction time, location,
medical devices and wearables, Facebook, Instagram, Twitter, products purchased, product
road cameras, IoT devices, YouTube, Linked In etc. prices, payment methods,
satellites, desktops, mobile discounts/coupons used, and
phones, industrial machinery, other relevant quantifiable
etc. information related to
transactions.
• Transactional data is a key
source of business intelligence.
11
Massive amount of data generated
• Social Media: On social media platforms like Facebook, Twitter, and
Instagram, users generate billions of posts, comments, likes, and
shares every second. This generates a vast amount of structured and
unstructured data that can be analyzed for insights into user behavior
and preferences.
• E-commerce: Online shopping generates a significant amount of data

every second. Every time a user browses a product, adds it to their
cart, or makes a purchase, data is generated. This data can be
analyzed to understand consumer behavior, preferences, and trends.
12
Massive amount of data generated
• Streaming: Streaming platforms like Netflix and YouTube generate a vast amount of
data every second. Every time a user watches a video, data is generated, including
information on user behavior, preferences, and engagement.
• Internet of Things (IoT): The IoT generates data from various devices like sensors,
wearables, and smart home appliances. Every time a device collects data or sends data
to the cloud, it generates data. This data can be analyzed to understand patterns and
trends in device usage.
• Search Engines: Search engines like Google generate a vast amount of data every
second. Every time a user searches for a query, data is generated, including search
terms, location, and device information. This data can be analyzed to understand user
intent and preferences.
13
Big Data Systems
• Big data refers to the large volume of structured and unstructured data
generated from various sources such as social media, internet of things
(IoT) devices, mobile devices, and sensors. Big data systems are designed to
store, manage, and analyze this data.
• Big data systems are designed to handle large volumes of data that are
beyond the capabilities of traditional data processing systems. These
systems use distributed computing and parallel processing to store,
manage, and analyze massive amounts of structured and unstructured data.
14
Big Data Systems
• Big data systems are typically designed to be scalable, fault-tolerant,
and flexible to handle a variety of data sources and formats.
• They may include various components such as data ingestion,
storage, processing, analysis, and visualization tools.
15
Key Concepts
Here are some key concepts related to big data systems:
1. Volume 3. Variety
Scale of Data Different Forms of Data
Terabytes to exabytes of existing Structured, unstructured, text, multimedia
data to process Big data comes in different formats, including
structured, semi-structured, and unstructured
Big data systems deal with massive data.
amounts of data that traditional systems
cannot handle.
2. Velocity 4. Veracity
Analysis of Streaming Uncertainty of Data (accuracy)
Data Managing the reliability and
Streaming data, mileseconds predictability of inherently imprecise
to seconds to respond (speed/rate) data types
Big data is often noisy, incomplete, and
Big data is generated at a high inconsistent, which can affect the
speed and needs to be processed accuracy of analysis.
16
quickly to derive insights.
17
Modern cars have close
100 Sensors Modern cars have close
that monitor items such as By 2020, it’s 1 in 3 Business Leader
40 Zettabytes fuel level and tire anticipated there don’t trust the information they
of data will be created by 2020 pressure will be more than use to make decisions
420 Million
Wearable,
Wireless
Health
Monitors
6 Billion People
have cell phones
VOLUME VELOCITY VARIETY VERACITY

Data at Scale Data in Motions Data in Many Data Uncertainty
Forms
27% of Respondents
4 Billions+ Hours of Video In one survey were unsure of how
much of their data was inaccurate
are watched on YouTube each
By 2020, it is projected there will
months
be more than
18.9 Billion Network 400 Million Tweets
Connection are sent per day about 200 million
- Almost 2.5 connections per monthly active users
2.5 Quintillion Bytes person on earth 30 Billion Pieces of
of data are created each day Content
are shared on Facebook every 18
month
Some Popular Big Data Systems
• Some popular big data systems include Hadoop, Spark, NoSQL databases, and cloud-based
platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
• These systems use distributed computing and parallel processing to handle the volume and
velocity of big data.
• They also support data processing and analysis using machine learning and artificial intelligence
algorithms.
• In summary, big data systems are critical for organizations to gain insights and make informed
decisions from massive amounts of data.
19
Sample of Big Data Systems
• Hadoop is an open-source framework that provides distributed storage and
processing of large datasets across clusters of computers. It includes components
like Hadoop Distributed File System (HDFS) for storage and MapReduce for
processing.
• Spark is another open-source big data processing framework that can handle both
batch processing and real-time data processing. It includes components like Spark
Core, Spark SQL, and Spark Streaming.
• NoSQL databases like MongoDB, Cassandra, and Couchbase provide flexible and
scalable storage solutions for big data applications. They can handle a variety of
data formats and support distributed data storage and processing.
20
Sample of Big Data Systems (cont.)
• Cloud-based platforms like AWS, Azure, and GCP provide scalable and
flexible infrastructure for big data applications. They offer various
services like data storage, processing, and analysis, as well as machine
learning and artificial intelligence tools.
• Overall, big data systems are critical for organizations to gain insights
and make informed decisions from massive amounts of data. They
can provide valuable insights for various industries, including finance,
healthcare, retail, and more.
21
https://www.geeksforgeeks.org/hadoop-ecosystem/ 22
Spark Ecosystem
23
Key Principles in Designing Big Data
System
• Key principles to consider when designing a big data system:
• scalability
• fault tolerance
• data processing
• security
• data governance
• interoperability
• cost-effectiveness
• These principles help ensure that big data systems can handle the scale and
complexity of big data while maintaining high performance, reliability, and
security. By following these principles, organizations can design big data
systems that support their business needs and goals
24
System (cont.)
• Scalability: Big data systems should be scalable to handle the increasing
volume and variety of data. The system should be designed to add more
computing resources as needed.
• Fault tolerance: Big data systems should be fault tolerant to handle failures
in the system. The system should be designed to handle node failures and
ensure data consistency and availability.
• Data processing and storage: Big data systems should be designed to handle
both batch processing and real-time data processing. The system should also
support different storage formats and data models.
25
System (cont.)
• Security: Big data systems should be designed to handle security and privacy
concerns. The system should ensure data confidentiality, integrity, and availability.
• Data governance: Big data systems should have proper data governance policies to
ensure data quality, accuracy, and compliance with regulations.
• Interoperability: Big data systems should be interoperable with other systems in the
organization. The system should allow for data exchange and integration with other
systems.
• Cost-effectiveness: Big data systems should be designed to be cost-effective. The

system should optimize resource utilization and reduce operational costs.
26
SEE YOU NEXT WEEK
28

Dsc652 - Chapter 1 Introduction To Big Data Systems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dsc652 - Chapter 1 Introduction To Big Data Systems

Uploaded by

Copyright:

Available Formats

DSC 652

BIG DATA APPLICATIONS

• Big data Systems

• Key concepts of big data systems

• Big data systems designing principles

Amount of data we have double, triple every year…too fast

A collection of data sets so large and complex that it becomes difficult to

Databases Data warehousing

⮚ Relational databases ⮚ Terabytes in size

2014 1.4 million

10% STRUCTURED 90 % UNSTRUCTURED

Structured information is the Unstructured information is 90% of Big Data and is

Machine Social Transactional

• E-commerce: Online shopping generates a significant amount of data

VOLUME VELOCITY VARIETY VERACITY

• Cost-effectiveness: Big data systems should be designed to be cost-effective. The

You might also like