You are on page 1of 40

INTRODUCTION TO BIG DATA

Instructor: Oussama Derbel


About Me

■ Oussama Derbel, PhD

■ Big Data and Machine Learning Trainer

■ 2 years IT Instructor

■ 6 years experience in R&D

■ 3 years experience as R&D Project Manager

■ Certificates:

1. Big Data Analytics using Spark

2. Machine Learning

3. Python For Data Science

4. AWS Certified Data Analytics (In Progress)


OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data


OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data


Introduction

■ What is Data?

■ What is Big Data?

■ What is Information?
Introduction

■ What is Data?

– The quantities, characters, or symbols on which operations are performed by a computer

– It may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or

mechanical recording media.

■ What is Big Data?

– Big Data is also data but with a huge size and yet growing exponentially with time

– None of the traditional data management tools can store it or process it efficiently.
Introduction

■ What is Information?
– Information is a set of data which is processed in a meaningful way according to the given requirement.

– Information is processed, structured, or presented in a given context to make it meaningful and useful.

Data
Warehouse

ETL Process Analyse Present


Introduction

■ What is ETL?
– ETL Stands for Extract Transform and Load Data Warehouse
Introduction

■ Data Growth over the years


Introduction

■ Data Growth over the years

Byte = 8 Bit: 00100010 1 TeraByte = 103GB

1 PetaByte = 103TB
1 kilo Byte (KB) =210Byte
1 Exa Byte = 103PB
= 1024 Byte
1 ZettaByte = 103EB
1 MegaByte = 103KB

1 GegaByte = 103MB Hard Disk = 1TB

Mobile = 128 GB

Photo = 10-15 MB
Introduction

■ Data Growth over the years


Introduction

■ Data Growth over the years

Internet Data generated every day in 2020

• 500 million tweets are sent

• 294 billion emails are sent

• 4 petabytes of data are created on Facebook

• 4 terabytes of data are created from each connected car

• 65 billion messages are sent on WhatsApp

• 5 billion searches are made

• By 2025, it’s estimated that 463 exabytes of data will be created each day globally –that’s the equivalent of

212,765,957 DVDs per day!


OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data


3V (Volume- Variety- Velocity) characteristics

➢ Volume:

– Large amounts of data , from datasets with sizes of terabytes to zettabyte.

➢ Velocity:

– Large amounts of data from transactions with high refresh rate resulting in data streams coming at great

speed and the time to act based on these data streams will often be very short . There is a shift from batch

processing to real time streaming.

➢ Variety:

– Data come from different data sources.

– Data can come from both internal and external data source
OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data


Types of Big Data

Structured Data Semi-structured Data Unstructured Data

Databases • XML/JSON Data • Audio


• Email • Video
• Web pages • Image Data
• Natural languafe
• Documents
Types of Big Data

■ Structured Data

Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data.

■ Unstructured Data

Any data with unknown form or the structure is classified as unstructured data

■ Semi-structured Data

Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form, but

it is not defined with e.g. a table definition in relational DBMS.


OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data


Application and use cases of Big Data

■ Big Data in Education Industry


• Customized and Dynamic Learning Programs
Customized programs and schemes to benefit individual students can be
created using the data collected on the bases of each student’s learning
history. This improves the overall student results.
• Reframing Course Material
Reframing the course material according to the data that is collected
on the basis of what a student learns and to what extent by real-time
monitoring of the components of a course is beneficial for the students.

• Example
The University of Alabama has more than 38,000 students and an ocean of data. In the past when there were no
real solutions to analyze that much of data, some of them seemed useless. Now, administrators are able to use
analytics and data visualizations for this data to draw out patterns of students revolutionizing the university’s
operations, recruitment, and retention efforts.
Application and use cases of Big Data

■ Big Data in Healthcare Industry

Example

• Wearable devices and sensors have been introduced in the healthcare industry

which can provide real-time feed to the electronic health record of a patient. One

such technology is from Apple.

• Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The main

goal is to empower the iPhone users to store and access their real-time health

records on their phones.


Application and use cases of Big Data

■ Big Data in Government Sector

• Example

Food and Drug Administration (FDA) which runs under the jurisdiction

of the Federal Government of USA leverages from the analysis of big

data to discover patterns and associations in order to identify and

examine the expected or unexpected occurrences of food-based

infections.
Application and use cases of Big Data

■ Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting
comments etc
Application and use cases of Big Data

■ Weather Patterns

IBM Deep Thunder, which is a research project by IBM,

provides weather forecasting through high-performance

computing of big data. IBM is also assisting Tokyo with

the improved weather forecasting for natural disasters

or predicting the probability of damaged power lines.


OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data


Limitations of traditional large-scale systems

■ Problem—Schema-On-Write:

– Schema-on-write requires the data to be validated when it is written.

– This means that a lot of work must be done before new data sources can be analyzed.

– Example: Suppose a company wants to start analyzing a new source of data from unstructured or semi-
structured sources. A company will usually spend months (3–6 months) designing schemas and so on to store
the data in a data warehouse. That is 3 to 6 months that the company cannot use the data to make business
decisions. Then when the data warehouse design is completed 6 months later, often the data has changed
again. If you look at data structures from social media, they change on a regular basis. The schema-on-write
environment is too slow and rigid to deal with the dynamics of semi-structured and unstructured data
environments that are changing over a period of time.

■ The other problem with unstructured data is that traditional systems usually use Large Object Byte (LOB) types to
handle unstructured data, which is often very inconvenient and difficult to work with.
Limitations of traditional large-scale systems

■ Solution—Schema-On-Read:

– Hadoop systems are schema-on-read, which means any data can be written to the storage system
immediately. Data are not validated until they are read. This enables Hadoop systems to load any type of data
and begin analyzing it quickly.
Limitations of traditional large-scale systems

■ Problem—Cost of Storage: Traditional systems use shared storage. As organizations start to ingest larger

volumes of data, shared storage is cost prohibitive.

■ Solution—Local Storage: Hadoop can use the Hadoop Distributed File System (HDFS), a distributed file

system that leverages local disks on commodity servers. Shared storage is about $1.20/GB, whereas

local storage is about $.04/GB. Hadoop’s HDFS creates three replicas by default for high availability. So

at 12 cents per GB, it is still a fraction of the cost of traditional shared storage.
Limitations of traditional large-scale systems

■ Problem—Cost of Proprietary Hardware: Large proprietary hardware solutions can be cost prohibitive when deployed
to process extremely large volumes of data. Organizations are spending millions of dollars in hardware and software
licensing costs while supporting large data environments. Organizations are often growing their hardware in million
dollar increments to handle the increasing data. New technology in traditional vendor systems that can grow to
petabyte scale and good performance are extremely expensive.

■ Solution—Commodity Hardware: It is possible to build a high-performance super-computer environment using


Hadoop. One customer was looking at a proprietary hardware vendor for a solution. The hardware vendor’s solution
was $1.2 million in hardware costs and $3 million in software licensing. The Hadoop solution for the same
processing power was $400,000 for hardware, the software was free, and the support costs were included. Because
data volumes would be constantly increasing, the proprietary solution would have grown in $500k and $1 million
dollar increments, whereas the Hadoop solution would grow in $10,000 and $100,000 increments.
Limitations of traditional large-scale systems

■ Problem—Complexity: When you look at any traditional proprietary solution, it is full of extremely complex silos of
system administrators, DBAs, application server teams, storage teams, and network teams. Often there is one DBA
for every 40 to 50 database servers. Anyone running traditional systems knows that complex systems fail in complex
ways.

■ Solution—Simplicity: Because Hadoop uses commodity hardware and follows the “shared-nothing” architecture, it is
a platform that one person can understand very easily. Numerous organizations running Hadoop have one
administrator for every 1,000 data nodes. With commodity hardware, one person can understand the entire
technology stack.
Limitations of traditional large-scale systems

■ Problem—Causation: Because data is so expensive to store in traditional systems, data is filtered and aggregated,
and large volumes are thrown out because of the cost of storage. Minimizing the data to be analyzed reduces the
accuracy and confidence of the results. Not only are accuracy and confidence to the resulting data affected, but it
also limits an organization’s ability to identify business opportunities. Atomic data can yield more insights into the
data than aggregated data.

■ Solution—Correlation: Because of the relatively low cost of storage of Hadoop, the detailed records are stored in
Hadoop’s storage system HDFS. Traditional data can then be analyzed with non-traditional data in Hadoop to find
correlation points that can provide much higher accuracy of data analysis. We are moving to a world of correlation
because the accuracy and confidence of the results are factors higher than traditional systems. Organizations are
seeing big data as transformational. Companies building predictive models for their customers would spend weeks
or months building new profiles. Now these same companies are building new profiles and models in a few days.
One company would have a data load take 20 hours to complete, which is not ideal. They went to Hadoop and the
time for the data load went from 20 hours to 3 hours.
OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data


How a distributed way of computing is superior (cost and scale)

■ Big Data Architecture

- Time taken for data


movement is high
- Operation is costly

SLOW FAST
How a distributed way of computing is superior (cost and scale)

■ Until2000: ScaleUp / Vertically


How a distributed way of computing is superior (cost and scale)

■ Scale Horizontal
OUTLINE
Introduction

3V (Volume- Variety- Velocity) characteristics

Types of Big Data

Application and use cases of Big Data

Limitations of traditional large-scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data


Opportunities and challenges with Big Data

■ Big Data Challenges

1. Huge data sources and poor data quality

■ Big data is characterized by heterogeneous data sources like images, videos and audios.

2. Efficient Storage of Big data

■ The way Big data stored effects not only cost but also analysis and processing. To meet service and
analysis requirements in Big data reliable, high performance, high availability and low cost storage need
to be developed

3. Efficiently processing Unstructured and Semi-Structured data

■ Databases and warehouses are unsatisfactory for processing of unstructured and semi structured data.
With Big data read/write operations are highly concurrent for large number of users. As the size of
database increases, algorithm may become insufficient
Opportunities and challenges with Big Data

■ Big Data Opportunities

– Enhanced information management

■ Big Data enables enhanced discovery, access, availability, exploitation, and provisioning of information
within companies and the supply chain. It can enable the discovery of new data sets that are not yet
being used to drive value.

– Increased operations efficiency and maintenance

– Enhanced product and market strategy

■ Big Data analytics can enhance customer segmentation, allowing for better scalability and mass
personalization. It can improve customer service levels, enhance customer acquisition and sales
strategies (through web and social), as well as enabling customization of delivery.
Opportunities and challenges with Big Data

■ Big Data Opportunities

– Innovation and product design benefits

■ A wide variety of data streams can aid innovation and product design. These include utilizable product
usage data, point-of-sales data, field data from devices, customer data, and supplier suggestions to drive
product and process innovation.

– Positive financial implications

■ Big Data can reduce long-term costs, increase ability to invest, and improve understanding of cost drivers
and impacts.
References

1. https://www.erpublication.org/published_paper/IJETR042630.pdf
2. https://www.pearsonitcertification.com/articles/article.aspx?p=2427073&seqNum=2
3. https://intellipaat.com/blog/7-big-data-examples-application-of-big-data-in-real-life/
Thank you

You might also like