02 Chapter 2 - ET - Data Science

CT060-3-3-EMTECH-Emerging Technology
EMERGING TECHNOLOGY – DATA SCIENCE
CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 1

TOPIC LEARNING OUTCOMES
At the end of this topic, you should be able to:
1. Describe what is data science.
2. Describe data processing life cycle.
3. Understanding the basics of Big Data.
4. Describing data value chain in emerging era of big data.
5. Describe the purpose of the Hadoop ecosystem components towards
emerging technology.

Contents & Structure
• Overview of Data Science.

• Data Processing Cycle.
• Data Value Chain.
• Big Data Concepts.
• Hadoop Ecosystem.

Key Terminologies
• Data Science
• Data & Information
• Processing Cycle
• Structured
• Unstructured
• Big Data
• Hadoop

What is Data Science
• Multi-disciplinary field that uses scientific methods, processes, algorithms & systems to
extract knowledge & insights from structured, semi-structured and unstructured data.
• More than just simply analyzing of data.
• Offers a whole ranges of new roles and requires a specific range of skillsets.

What is Data Science
• Among the roles and skillsets relevant

to Data Science, includes:
– Computer science
– Machine learning
– Data science
– Statistical mathematics
– Software development
– Business analytics
– Business understanding

Data vs Information

Data Processing Cycle
• Restructuring or re-ordering of data by people/machines in-order to increase its

usefulness and adds values for a particular purpose.
• Consists of the following steps; input, processing, output
Input Processing Output

Data Processing Cycle
• Input data is prepared in some convenient form depending on the processing machine, for processing.
• Example: for computers, the input data can be recorded on any few types of storage medium like HDD, CD
Inpu and others.
t
• The input data is changed to produce data in a more useful forms.

Proc • Example: interest can be calculated on deposit to a bank, or a summary of sales for the month can be
essi calculated from the sales orders.
ng
• The results of the proceeding processing step is collected. The particular form of the output data depends on
the usage of the data.
Out • Example: Output data may be the payroll for employees.
put

Data Types Representation
(Computer Programming)
• Data types can be described from various perspectives: computer programming
perspective & data analytics perspectives.
• In computer science, a data type simply refers to the attribute of data that tells the
compiler or interpreter how the programmer intends to use the data.
• Data types – Programming perspective includes:

(Data Analytics)
• In data analytics perspective, there are three (3) common data types / structures:
– Structured data
– Unstructured data
– Semi structured data

(Data Analytics)
Structured Data Unstructured Data Semi-structured Data

• Data that adheres to a pre-defined data • Structured data that does not conforms • Contains information that either does not
model and straight-forward to be analyzed. with the formal structure of data models have pre-defined data model or is not
associated with relational databases. organized in a pre-defined manner.
• Commonly conforms to a tabular format Contains tags or other markers to separate • Typically, text-heavy but may contain data
with a relationship between the different semantic elements and enforce hierarchies of such as dates, numbers, and facts a
rows & columns records & fields within the data
• Has structured rows and columns that can Also known as self-describing structure • results in irregularities and ambiguities
be sorted. that make it difficult to understand using
traditional programs as compared to data
stored in structured databases.
• Examples: Excel files & SQL Database Examples: JSON files & XML files Examples: Audio, Video files, No-SQL
databases.

Data Value Chain
• Introduced to describe the information flow within a big data system.
• Series of steps needed to generate value & useful insights from data.
• It identifies the following key high-level activities:

What is BIG DATA?
• Blanket term for the non-traditional strategies & technologies needed to

gather, organize, process and generate insights from large datasets.
• Big data – collection of data sets so large and complex that it becomes difficult
to process using on-hand database management tools / traditional data
processing applications.
• Datasets is too large to reasonably be processed or store with traditional
tooling or on a single computer.

What is BIG DATA?
• Big data is characterized with 4 Vs

• Volume:
– large amount of data; zettabytes / massive
datasets.
• Velocity:
– Data is live streaming or in motion.
• Variety:
– Data comes in many different forms from
diverse sources.
• Veracity:
– Can the data be trusted? How accurate is the
data?

Clustered Computing
• Due to the qualities of big data, individual computers are often inadequate for handling
the data at most stages.
• To better address the high storage and computational needs of big data, computer
clusters / clustered computing are a better fit.
• Clustered computing is a set of computers that work together so they can be viewed as a
single system
• Big data clustering software combines the resources of many smaller machines, in-order
to provide several benefits: high availability, resource pooling & easy scalability.

Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with big data easier.
• It’s a framework that allows for the distributed processing of large datasets across
clusters of computers using simple programming models.
• Being inspired by a technical document published by Google.

• The four key characteristics of Hadoop, includes:
Economical:
Its systems are highly economical as ordinary computers can be used for data processing.
Reliable:
It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.
Scalable:
Easily scalable, both, horizontally & vertically. A few extra nodes helps in scaling up the framework
Flexible:
It is flexible and can store as much structured and unstructured data as needed & decide to use it later.

• Hadoop has an ecosystem that has evolved

from its four core components: data
management, access, processing and storage.
• Continuously growing to meet the needs of
Big Data.
• Comprises of the following components and
many others:

• Comprises of the following components and

many others:
• HDFS – Hadoop Distributed File System
• YARN – Yet Another Resource Negotiator
• MapReduce - Programming based Data Processing.
• Spark – In-Memory data processing
• PIG, HIVE – Query-based processing of data services
• Mahout, Spark MLLib – Machine Learning algorithm
library
• Solar, Lucene – Searching & Indexing
• Zookeeper – Managing cluster
• Oozie – Job Scheduling

Big Data Life Cycle with Hadoop
•
Ingest • Stage 1 – Ingest

Data is ingested or transferred to Hadoop from various
sources such as RDBMS, etc.
• Stage 2 – Processing
Data is stored & processed. Data is stored in distributed
system, HDFS & no SQL distributed data.
Access Process
• Stage 3 – Analyze
Data is analyzed by processing framework such as
Impala & Hive.
• Stage 4 – Access
Performed by tools such as hue and Cloudera Search.
Analyze In this stage, the analyzed data can be accessed by
users.

BIG DATA applications in Industry
• Retail
• E-Commerce
• Banking, Financial Services & Insurance (BFSI)
• Manufacturing
• Logistics, Media & Entertainment
• Oil & Gas

Impact of BIG DATA on Business
• Improvised customer services.
• Enhancing customer experience.
• Improvised target marketing.
• Cost reduction.
• Improvised efficiency.
• Assisting in analyzing business data.
• Improvised decision-making.
• Social & economic benefits to organization.

Summary / Recap of Main Points
• What is Data Science • Clustered Computing

• Data vs Information • Hadoop Ecosystem
• Data Processing Cycle
• Data Types Representation
• Data value chain
• What is Big Data

Question & Answer Session
Q&
A

02 Chapter 2 - ET - Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Chapter 2 - ET - Data Science

Uploaded by

Copyright:

Available Formats

CT060-3-3-EMTECH-Emerging Technology

EMERGING TECHNOLOGY – DATA SCIENCE

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 1

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 2

• Overview of Data Science.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 3

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 4

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 5

• Among the roles and skillsets relevant

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 6

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 7

• Restructuring or re-ordering of data by people/machines in-order to increase its

Input Processing Output

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 8

• The input data is changed to produce data in a more useful forms.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 9

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 10

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 11

Structured Data Unstructured Data Semi-structured Data

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 12

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 13

• Blanket term for the non-traditional strategies & technologies needed to

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 14

• Big data is characterized with 4 Vs

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 15

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 16

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 17

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 18

• Hadoop has an ecosystem that has evolved

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 19

• Comprises of the following components and

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 20

Ingest • Stage 1 – Ingest

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 21

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 22

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 23

• What is Data Science • Clustered Computing

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 24

You might also like