Chapter N1 Introduction To Big Data

INTRODUCTION TO BIG DATA
Instructor: Oussama Derbel

About Me
■ Oussama Derbel, PhD
■ Big Data and Machine Learning Trainer
■ 2 years IT Instructor
■ 6 years experience in R&D
■ 3 years experience as R&D Project Manager
■ Certificates:
1. Big Data Analytics using Spark
2. Machine Learning
3. Python For Data Science
4. AWS Certified Data Analytics (In Progress)

OUTLINE
Introduction
3V (Volume- Variety- Velocity) characteristics
Types of Big Data
Application and use cases of Big Data
Limitations of traditional large-scale systems
How a distributed way of computing is superior (cost and scale)
Opportunities and challenges with Big Data

OUTLINE
Introduction
Types of Big Data

Introduction
■ What is Data?
■ What is Big Data?
■ What is Information?
Introduction
■ What is Data?
– The quantities, characters, or symbols on which operations are performed by a computer
– It may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
■ What is Big Data?
– Big Data is also data but with a huge size and yet growing exponentially with time
– None of the traditional data management tools can store it or process it efficiently.
Introduction
■ What is Information?
– Information is a set of data which is processed in a meaningful way according to the given requirement.
– Information is processed, structured, or presented in a given context to make it meaningful and useful.
Data
Warehouse
ETL Process Analyse Present

Introduction
■ What is ETL?
– ETL Stands for Extract Transform and Load Data Warehouse
Introduction
■ Data Growth over the years

Introduction
Byte = 8 Bit: 00100010 1 TeraByte = 103GB
1 PetaByte = 103TB
1 kilo Byte (KB) =210Byte
1 Exa Byte = 103PB
= 1024 Byte
1 ZettaByte = 103EB
1 MegaByte = 103KB
1 GegaByte = 103MB Hard Disk = 1TB
Mobile = 128 GB
Photo = 10-15 MB
Introduction

Introduction
Internet Data generated every day in 2020
• 500 million tweets are sent
• 294 billion emails are sent
• 4 petabytes of data are created on Facebook
• 4 terabytes of data are created from each connected car
• 65 billion messages are sent on WhatsApp
• 5 billion searches are made
• By 2025, it’s estimated that 463 exabytes of data will be created each day globally –that’s the equivalent of
212,765,957 DVDs per day!

OUTLINE
Introduction
Types of Big Data

➢ Volume:
– Large amounts of data , from datasets with sizes of terabytes to zettabyte.
➢ Velocity:
– Large amounts of data from transactions with high refresh rate resulting in data streams coming at great
speed and the time to act based on these data streams will often be very short . There is a shift from batch
processing to real time streaming.
➢ Variety:
– Data come from different data sources.
– Data can come from both internal and external data source
OUTLINE
Introduction
Types of Big Data

Types of Big Data
Structured Data Semi-structured Data Unstructured Data
Databases • XML/JSON Data • Audio

• Email • Video
• Web pages • Image Data
• Natural languafe
• Documents
Types of Big Data
■ Structured Data
Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data.
■ Unstructured Data
Any data with unknown form or the structure is classified as unstructured data
■ Semi-structured Data
Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form, but
it is not defined with e.g. a table definition in relational DBMS.

OUTLINE
Introduction
Types of Big Data

■ Big Data in Education Industry

• Customized and Dynamic Learning Programs
Customized programs and schemes to benefit individual students can be
created using the data collected on the bases of each student’s learning
history. This improves the overall student results.
• Reframing Course Material
Reframing the course material according to the data that is collected
on the basis of what a student learns and to what extent by real-time
monitoring of the components of a course is beneficial for the students.
• Example
The University of Alabama has more than 38,000 students and an ocean of data. In the past when there were no
real solutions to analyze that much of data, some of them seemed useless. Now, administrators are able to use
analytics and data visualizations for this data to draw out patterns of students revolutionizing the university’s
operations, recruitment, and retention efforts.
■ Big Data in Healthcare Industry
Example
• Wearable devices and sensors have been introduced in the healthcare industry
which can provide real-time feed to the electronic health record of a patient. One
such technology is from Apple.
• Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The main
goal is to empower the iPhone users to store and access their real-time health
records on their phones.

■ Big Data in Government Sector
• Example
Food and Drug Administration (FDA) which runs under the jurisdiction
of the Federal Government of USA leverages from the analysis of big
data to discover patterns and associations in order to identify and
examine the expected or unexpected occurrences of food-based
infections.
■ Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting
comments etc
■ Weather Patterns
IBM Deep Thunder, which is a research project by IBM,
provides weather forecasting through high-performance
computing of big data. IBM is also assisting Tokyo with
the improved weather forecasting for natural disasters
or predicting the probability of damaged power lines.

OUTLINE
Introduction
Types of Big Data

■ Problem—Schema-On-Write:
– Schema-on-write requires the data to be validated when it is written.
– This means that a lot of work must be done before new data sources can be analyzed.
– Example: Suppose a company wants to start analyzing a new source of data from unstructured or semi-
structured sources. A company will usually spend months (3–6 months) designing schemas and so on to store
the data in a data warehouse. That is 3 to 6 months that the company cannot use the data to make business
decisions. Then when the data warehouse design is completed 6 months later, often the data has changed
again. If you look at data structures from social media, they change on a regular basis. The schema-on-write
environment is too slow and rigid to deal with the dynamics of semi-structured and unstructured data
environments that are changing over a period of time.
■ The other problem with unstructured data is that traditional systems usually use Large Object Byte (LOB) types to
handle unstructured data, which is often very inconvenient and difficult to work with.
■ Solution—Schema-On-Read:
– Hadoop systems are schema-on-read, which means any data can be written to the storage system
immediately. Data are not validated until they are read. This enables Hadoop systems to load any type of data
and begin analyzing it quickly.
■ Problem—Cost of Storage: Traditional systems use shared storage. As organizations start to ingest larger
volumes of data, shared storage is cost prohibitive.
■ Solution—Local Storage: Hadoop can use the Hadoop Distributed File System (HDFS), a distributed file
system that leverages local disks on commodity servers. Shared storage is about $1.20/GB, whereas
local storage is about $.04/GB. Hadoop’s HDFS creates three replicas by default for high availability. So
at 12 cents per GB, it is still a fraction of the cost of traditional shared storage.
■ Problem—Cost of Proprietary Hardware: Large proprietary hardware solutions can be cost prohibitive when deployed
to process extremely large volumes of data. Organizations are spending millions of dollars in hardware and software
licensing costs while supporting large data environments. Organizations are often growing their hardware in million
dollar increments to handle the increasing data. New technology in traditional vendor systems that can grow to
petabyte scale and good performance are extremely expensive.
■ Solution—Commodity Hardware: It is possible to build a high-performance super-computer environment using

Hadoop. One customer was looking at a proprietary hardware vendor for a solution. The hardware vendor’s solution
was $1.2 million in hardware costs and $3 million in software licensing. The Hadoop solution for the same
processing power was $400,000 for hardware, the software was free, and the support costs were included. Because
data volumes would be constantly increasing, the proprietary solution would have grown in $500k and $1 million
dollar increments, whereas the Hadoop solution would grow in $10,000 and $100,000 increments.
■ Problem—Complexity: When you look at any traditional proprietary solution, it is full of extremely complex silos of
system administrators, DBAs, application server teams, storage teams, and network teams. Often there is one DBA
for every 40 to 50 database servers. Anyone running traditional systems knows that complex systems fail in complex
ways.
■ Solution—Simplicity: Because Hadoop uses commodity hardware and follows the “shared-nothing” architecture, it is
a platform that one person can understand very easily. Numerous organizations running Hadoop have one
administrator for every 1,000 data nodes. With commodity hardware, one person can understand the entire
technology stack.
■ Problem—Causation: Because data is so expensive to store in traditional systems, data is filtered and aggregated,
and large volumes are thrown out because of the cost of storage. Minimizing the data to be analyzed reduces the
accuracy and confidence of the results. Not only are accuracy and confidence to the resulting data affected, but it
also limits an organization’s ability to identify business opportunities. Atomic data can yield more insights into the
data than aggregated data.
■ Solution—Correlation: Because of the relatively low cost of storage of Hadoop, the detailed records are stored in
Hadoop’s storage system HDFS. Traditional data can then be analyzed with non-traditional data in Hadoop to find
correlation points that can provide much higher accuracy of data analysis. We are moving to a world of correlation
because the accuracy and confidence of the results are factors higher than traditional systems. Organizations are
seeing big data as transformational. Companies building predictive models for their customers would spend weeks
or months building new profiles. Now these same companies are building new profiles and models in a few days.
One company would have a data load take 20 hours to complete, which is not ideal. They went to Hadoop and the
time for the data load went from 20 hours to 3 hours.
OUTLINE
Introduction
Types of Big Data

■ Big Data Architecture
- Time taken for data

movement is high
- Operation is costly
SLOW FAST
■ Until2000: ScaleUp / Vertically

■ Scale Horizontal
OUTLINE
Introduction
Types of Big Data

■ Big Data Challenges
1. Huge data sources and poor data quality
■ Big data is characterized by heterogeneous data sources like images, videos and audios.
2. Efficient Storage of Big data
■ The way Big data stored effects not only cost but also analysis and processing. To meet service and
analysis requirements in Big data reliable, high performance, high availability and low cost storage need
to be developed
3. Efficiently processing Unstructured and Semi-Structured data
■ Databases and warehouses are unsatisfactory for processing of unstructured and semi structured data.
With Big data read/write operations are highly concurrent for large number of users. As the size of
database increases, algorithm may become insufficient
■ Big Data Opportunities
– Enhanced information management
■ Big Data enables enhanced discovery, access, availability, exploitation, and provisioning of information
within companies and the supply chain. It can enable the discovery of new data sets that are not yet
being used to drive value.
– Increased operations efficiency and maintenance
– Enhanced product and market strategy
■ Big Data analytics can enhance customer segmentation, allowing for better scalability and mass
personalization. It can improve customer service levels, enhance customer acquisition and sales
strategies (through web and social), as well as enabling customization of delivery.
■ Big Data Opportunities
– Innovation and product design benefits
■ A wide variety of data streams can aid innovation and product design. These include utilizable product
usage data, point-of-sales data, field data from devices, customer data, and supplier suggestions to drive
product and process innovation.
– Positive financial implications
■ Big Data can reduce long-term costs, increase ability to invest, and improve understanding of cost drivers
and impacts.
References
1. https://www.erpublication.org/published_paper/IJETR042630.pdf
2. https://www.pearsonitcertification.com/articles/article.aspx?p=2427073&seqNum=2
3. https://intellipaat.com/blog/7-big-data-examples-application-of-big-data-in-real-life/
Thank you

Chapter N1 Introduction To Big Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter N1 Introduction To Big Data

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO BIG DATA

Instructor: Oussama Derbel

■ Oussama Derbel, PhD

■ Big Data and Machine Learning Trainer

■ 6 years experience in R&D

■ 3 years experience as R&D Project Manager

1. Big Data Analytics using Spark

3. Python For Data Science

4. AWS Certified Data Analytics (In Progress)

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

■ What is Big Data?

– The quantities, characters, or symbols on which operations are performed by a computer

mechanical recording media.

■ What is Big Data?

ETL Process Analyse Present

■ Data Growth over the years

■ Data Growth over the years

Byte = 8 Bit: 00100010 1 TeraByte = 103GB

1 GegaByte = 103MB Hard Disk = 1TB

■ Data Growth over the years

■ Data Growth over the years

Internet Data generated every day in 2020

• 500 million tweets are sent

• 294 billion emails are sent

• 4 petabytes of data are created on Facebook

• 4 terabytes of data are created from each connected car

• 65 billion messages are sent on WhatsApp

• 5 billion searches are made

212,765,957 DVDs per day!

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

– Large amounts of data , from datasets with sizes of terabytes to zettabyte.

processing to real time streaming.

– Data come from different data sources.

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

Structured Data Semi-structured Data Unstructured Data

Databases • XML/JSON Data • Audio

it is not defined with e.g. a table definition in relational DBMS.

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)