You are on page 1of 25

CT060-3-3-EMTECH-Emerging Technology

EMERGING TECHNOLOGY – DATA SCIENCE

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 1


TOPIC LEARNING OUTCOMES
At the end of this topic, you should be able to:
1. Describe what is data science.
2. Describe data processing life cycle.
3. Understanding the basics of Big Data.
4. Describing data value chain in emerging era of big data.
5. Describe the purpose of the Hadoop ecosystem components towards
emerging technology.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 2


Contents & Structure

• Overview of Data Science.


• Data Processing Cycle.
• Data Value Chain.
• Big Data Concepts.
• Hadoop Ecosystem.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 3


Key Terminologies
• Data Science
• Data & Information
• Processing Cycle
• Structured
• Unstructured
• Big Data
• Hadoop

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 4


What is Data Science

• Multi-disciplinary field that uses scientific methods, processes, algorithms & systems to
extract knowledge & insights from structured, semi-structured and unstructured data.
• More than just simply analyzing of data.
• Offers a whole ranges of new roles and requires a specific range of skillsets.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 5


What is Data Science

• Among the roles and skillsets relevant


to Data Science, includes:
– Computer science
– Machine learning
– Data science
– Statistical mathematics
– Software development
– Business analytics
– Business understanding

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 6


Data vs Information

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 7


Data Processing Cycle

• Restructuring or re-ordering of data by people/machines in-order to increase its


usefulness and adds values for a particular purpose.
• Consists of the following steps; input, processing, output

Input Processing Output

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 8


Data Processing Cycle
• Input data is prepared in some convenient form depending on the processing machine, for processing.
• Example: for computers, the input data can be recorded on any few types of storage medium like HDD, CD
Inpu and others.
t

• The input data is changed to produce data in a more useful forms.


Proc • Example: interest can be calculated on deposit to a bank, or a summary of sales for the month can be
essi calculated from the sales orders.
ng

• The results of the proceeding processing step is collected. The particular form of the output data depends on
the usage of the data.
Out • Example: Output data may be the payroll for employees.
put

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 9


Data Types Representation
(Computer Programming)
• Data types can be described from various perspectives: computer programming
perspective & data analytics perspectives.
• In computer science, a data type simply refers to the attribute of data that tells the
compiler or interpreter how the programmer intends to use the data.
• Data types – Programming perspective includes:

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 10


Data Types Representation
(Data Analytics)
• In data analytics perspective, there are three (3) common data types / structures:
– Structured data
– Unstructured data
– Semi structured data

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 11


Data Types Representation
(Data Analytics)

Structured Data Unstructured Data Semi-structured Data


• Data that adheres to a pre-defined data • Structured data that does not conforms • Contains information that either does not
model and straight-forward to be analyzed. with the formal structure of data models have pre-defined data model or is not
associated with relational databases. organized in a pre-defined manner.

• Commonly conforms to a tabular format Contains tags or other markers to separate • Typically, text-heavy but may contain data
with a relationship between the different semantic elements and enforce hierarchies of such as dates, numbers, and facts a
rows & columns records & fields within the data

• Has structured rows and columns that can Also known as self-describing structure • results in irregularities and ambiguities
be sorted. that make it difficult to understand using
traditional programs as compared to data
stored in structured databases.

• Examples: Excel files & SQL Database Examples: JSON files & XML files Examples: Audio, Video files, No-SQL
databases.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 12


Data Value Chain
• Introduced to describe the information flow within a big data system.
• Series of steps needed to generate value & useful insights from data.
• It identifies the following key high-level activities:

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 13


What is BIG DATA?

• Blanket term for the non-traditional strategies & technologies needed to


gather, organize, process and generate insights from large datasets.
• Big data – collection of data sets so large and complex that it becomes difficult
to process using on-hand database management tools / traditional data
processing applications.
• Datasets is too large to reasonably be processed or store with traditional
tooling or on a single computer.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 14


What is BIG DATA?

• Big data is characterized with 4 Vs


• Volume:
– large amount of data; zettabytes / massive
datasets.
• Velocity:
– Data is live streaming or in motion.
• Variety:
– Data comes in many different forms from
diverse sources.
• Veracity:
– Can the data be trusted? How accurate is the
data?

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 15


Clustered Computing
• Due to the qualities of big data, individual computers are often inadequate for handling
the data at most stages.
• To better address the high storage and computational needs of big data, computer
clusters / clustered computing are a better fit.
• Clustered computing is a set of computers that work together so they can be viewed as a
single system
• Big data clustering software combines the resources of many smaller machines, in-order
to provide several benefits: high availability, resource pooling & easy scalability.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 16


Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with big data easier.
• It’s a framework that allows for the distributed processing of large datasets across
clusters of computers using simple programming models.
• Being inspired by a technical document published by Google.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 17


Hadoop and its Ecosystem
• The four key characteristics of Hadoop, includes:

Economical:
Its systems are highly economical as ordinary computers can be used for data processing.
Reliable:
It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.

Scalable:
Easily scalable, both, horizontally & vertically. A few extra nodes helps in scaling up the framework
Flexible:
It is flexible and can store as much structured and unstructured data as needed & decide to use it later.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 18


Hadoop and its Ecosystem

• Hadoop has an ecosystem that has evolved


from its four core components: data
management, access, processing and storage.
• Continuously growing to meet the needs of
Big Data.
• Comprises of the following components and
many others:

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 19


Hadoop and its Ecosystem

• Comprises of the following components and


many others:
• HDFS – Hadoop Distributed File System
• YARN – Yet Another Resource Negotiator
• MapReduce - Programming based Data Processing.
• Spark – In-Memory data processing
• PIG, HIVE – Query-based processing of data services
• Mahout, Spark MLLib – Machine Learning algorithm
library
• Solar, Lucene – Searching & Indexing
• Zookeeper – Managing cluster
• Oozie – Job Scheduling

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 20


Big Data Life Cycle with Hadoop

Ingest • Stage 1 – Ingest


Data is ingested or transferred to Hadoop from various
sources such as RDBMS, etc.

• Stage 2 – Processing
Data is stored & processed. Data is stored in distributed
system, HDFS & no SQL distributed data.
Access Process
• Stage 3 – Analyze
Data is analyzed by processing framework such as
Impala & Hive.

• Stage 4 – Access
Performed by tools such as hue and Cloudera Search.
Analyze In this stage, the analyzed data can be accessed by
users.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 21


BIG DATA applications in Industry
• Retail
• E-Commerce
• Banking, Financial Services & Insurance (BFSI)
• Manufacturing
• Logistics, Media & Entertainment
• Oil & Gas

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 22


Impact of BIG DATA on Business
• Improvised customer services.
• Enhancing customer experience.
• Improvised target marketing.
• Cost reduction.
• Improvised efficiency.
• Assisting in analyzing business data.
• Improvised decision-making.
• Social & economic benefits to organization.

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 23


Summary / Recap of Main Points

• What is Data Science • Clustered Computing


• Data vs Information • Hadoop Ecosystem
• Data Processing Cycle
• Data Types Representation
• Data value chain
• What is Big Data

CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 24


Question & Answer Session

Q&
A
CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 25

You might also like