Chapter 2

Chapter 2
Data Science
1
Introduction
 In this chapter, we are going to learn more about

 data science
 Data Vs Information
 Data types and representation
 Data value chain and
 Basic concept of Big data
2
After completing this chapter, the students will be able to:
 Describe what data science is and the role of data

scientists.
 Differentiate data and information.
 Describe data processing life cycle
 Understand different data types from diverse
perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem
components.
3
2.1. An Overview of Data Science
Self test Exercise
 What is data science?
 Can you describe the role of data in
emerging technology?
 What are data and information?
 What is big data?
4
Con’t…
 Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract
knowledge and insights(ariko masebor mastewal) from:
 structured data (relational database): is a data whose
elements are addressable for effective analysis.
 It has been organized into a formatted repository that is typically a
database.
 They have relational key and can easily mapped into pre-designed fields,
Example: Relational data.
 semi-structured data: is information that does not reside in a
rational database but that have some organizational properties that make it
easier to analyze.
 Example: XML data.
5
Con’t…
 unstructured data: is a data that is which is not organized
in a pre-defined manner or does not have a pre-defined data
model.
 it is not a good fit for a mainstream relational database.
 it is increasingly prevalent in IT systems and is used by

organizations in a variety of business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.
Self test Exercise

What’s The Difference Between Structured, Semi-Structured And
Unstructured Data? Discuss in detail
6
Con’t…
 Data science is much more than simply analyzing

data
E.g Supermarket data- On Thursday nights people who buy milk also
tend to buy beer .
 catalog order of items,
 Know the profile of your best customer,
 Predict more wanted item for the next 2 month,
 The probability of customer who buy computer also buy
pen drive.
Self-test Exercise
 Describe in some detail the main disciplines that
contribute to data science.
7
2.1.1 What are data and information?
 Data: a representation of facts, concepts, or instructions in a

formalized manner.
 suitable for communication, interpretation, or processing, by

human or electronic machines.
 represented with the help of characters such as alphabets,

digits or special characters.
 Information :is the processed data on which decisions and

actions are based.
 It is data that has been processed into a form that is

meaningful to the recipient
8
Con’t…
Data Information
raw facts data with context
no context processed data
just numbers and text value-added to data
summarized
organized
analyzed
9
Con’t…
Data: 51012
Information:
 5/10/11 The date of your final exam.
 51,012 Birr The average starting salary of an accounting
major.
 51011 Zip code of Dilla.
10
2.1.2. Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by
people or machines to increase their usefulness
 three steps constitute the data processing cycle.
 Input − in this step, the input data is prepared in some convenient

form for processing. The form will depend on the processing
machine.
 E.g when electronic computers are used, the input data can be
recorded on any one of the several types of storage medium,
such as hard disk, CD, flash disk and so on
11
Con’t…
 Processing − in this step, the input data is changed to
produce data in a more useful form.
 For example, interest can be calculated on deposit to a bank, or a

summary of sales for the month can be calculated from the sales
orders.
 Output − at this stage, the result of the proceeding
processing step is collected.
 The particular form of the output data depends on the use of

the data. For example, output data may be payroll for
employees.
12
Self Test Exercise
 Discuss the main differences between data and
information with examples
 Can we process data manually using a pencil and

paper?
 Discuss the differences with data processing

using the computer with manually.
13
2.3 Data types and their representation
 Data types can be described from diverse perspectives.

 In CS and CP, for instance, a data type is simply an attribute of
data that tells the compiler or interpreter how the programmer
intends to use the data.
 Int x=5;
 String name=“Melaku”;
 Float final pi=3.14;
14
2.3.1. Data types from Computer programming perspective
 What is Programming language?
 Almost all programming languages explicitly include the notion of

data type, though different languages may use different
terminology. Common data types include:
 Integers(int)- is used to store whole numbers,

mathematically known as integers
 Booleans(bool)- is used to represent restricted to one of
two values: true or false, E.g Boolean is_she_come=True;
 Characters(char)- is used to store a single character, E.g
Char x=‘M’;
15
Con’t
 Explain types of Programming language.
 A data type makes the values that expression, such as a

variable or a function, might take.
 This data type defines the operations that can be done on the
data, the meaning of the data, and the way values of that type
can be stored.
16
2.3.2. Data types from Data Analytics perspective
 From a data analytics point of view, it is important to understand
that there are three common types of data types or structures:
 Structured
 Semi-structured, and
 Unstructured data types
17
Con’t…
Structured Data: data that adheres to a pre-defined data
model and is therefore straightforward to analyze.
 conforms to a tabular format with a relationship between the

different rows and columns.
 Common examples of structured data are Excel files or SQL

databases. Each of these has structured rows and columns that
can be sorted
18
Con’t…
Semi-structured Data: is a form of structured data
that does not conform with the formal structure of data
models associated with relational databases or other forms of
data tables.
 contains tags or other markers to separate semantic

elements and enforce hierarchies of records and fields within
the data.
 Therefore, it is also known as a self-describing structure.
 Examples of semi-structured data include JSON and

XML are forms of semi-structured data.
19
Con’t…
Unstructured data: is information that either does not have
a predefined data model or is not organized in a pre-defined
manner.
 Unstructured information is typically text-heavy but may
contain data such as dates, numbers, and facts as well.
 results in irregularities and ambiguities hat make it difficult to
understand using traditional programs as compared to data
stored in structured databases
 Common examples of unstructured data include audio, video files

or NoSQL databases, social media data(fb, twitter,instagram…).
20
Con’t…
Metadata – Data about Data: this is not a separate data
structure, but it is one of the most important elements for Big
Data analysis and big data solutions.
 Metadata is data about data. It provides additional information

about a specific set of data
 In a set of photographs, for example, metadata could

describe when and where the photos were taken.
 The metadata then provides fields for dates and locations

which, by themselves, can be considered structured data.
Because of this reason, metadata is frequently used by Big
Data solutions for initial analysis.
21
Con’t… Self Test Exercise
o Discuss data types from programming and
analytics perspectives.
o Compare metadata with structured,

unstructured and semi-structured data
o Given at least one example of structured,

unstructured and semi-structured data types
22
2.4. Data value Chain
 is introduced to describe the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
 The Big Data Value Chain identifies the following key high-level
activities:
23
2.4.1. Data Acquisition
 It is the process of gathering, filtering, and

cleaning data before it is put in a data warehouse .
 “A data warehouse is a subject-oriented,

integrated, time-variant, and nonvolatile collection
of data in support of management’s decision-
making process.”—W. H. Inmon
 A data warehouse is a decision support database

that is maintained separately from the
organization’s operational database.
24
Con’t…
 Data acquisition is one of the major big data
challenges in terms of infrastructure requirements.
 The infrastructure required to support the acquisition

of big data must deliver low, predictable latency in
both capturing data and in executing queries.
 be able to handle very high transaction volumes, often

in a distributed environment; and support flexible and
dynamic data structures.
25
2.4.2. Data Analysis
 It is concerned with making the raw data acquired amenable to

use in decision-making as well as domain-specific usage.
 Data analysis involves exploring, transforming, and modeling data

with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a
business point of view.
 Related areas include data mining, business intelligence, and

machine learning
26
2.4.3. Data Curation
 It is the active management of data over its life cycle to ensure
it meets the necessary data quality requirements for its
effective usage.
 Data curation processes can be categorized into different

activities such as content creation, selection, classification,
transformation, validation, and preservation.
 Data curators (scientific curators or data annotators) hold

the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
27
2.4.4. Data Storage
 It is the persistence and management of data in a scalable
way that satisfies the needs of applications that require
fast access to the data.
 RDBMS have been the main, and almost unique, a solution

to the storage paradigm for nearly 40 years.
 However, the ACID (Atomicity, Consistency, Isolation, and

Durability) properties that guarantee database
transactions lack flexibility with regard to schema changes
28
Con’t…
 the performance and fault tolerance when data
volumes and complexity grow, making them unsuitable
for big data scenarios.
 What is NoSql Database? Compare it with Sql.
 NoSQL technologies have been designed with the

scalability goal in mind and present a wide range of
solutions based on alternative data models.
29
2.4.5. Data Usage
 It covers the data-driven business activities that
need access to data, its analysis, and the tools
needed to integrate the data analysis within the
business activity.
 Data usage in business decision-making can

enhance competitiveness through the reduction of
costs, increased added value, or any other
parameter that can be measured against existing
performance criteria.
30
Self Test Exercise
 Which information flow step in the data value chain you
think is labor-intensive? Why?
 What are the different data types and their value chain?
31
2.5. Basic concepts of big data
Big Data Every Where!

 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Social Network
32
Con’t…
 Big data is a blanket term for the non-traditional
strategies and technologies needed to gather,
organize, process, and gather insights from large
datasets.
 While the problem of working with data that

exceeds the computing power or storage of a
single computer is not new, the pervasiveness,
scale, and value of this type of computing have
greatly expanded in recent years.
33
2.5.1. What Is Big Data?
 is the term for a collection of data sets so large and

complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 a “large dataset” means a dataset too large to reasonably

process or store with traditional tooling or on a single
computer.
 the common scale of big datasets is constantly shifting and

may vary significantly from organization to organization
34
Bigdata: 4 V’s
 Big data is characterized by 4V and more:
 Volume: Machine generated data is produced in larger
quantities than non traditional data.
 large amounts of data Zeta bytes/Massive datasets
 Velocity: the speed of data processing.

 Data is live streaming or in motion
 Variety: large variety of input data which in turn
generates large amount of data as output.
 data comes in many different forms from diverse sources
 Veracity: can we trust the data? How accurate is it?

etc.
35
Con’t…
36
2.5.2. Clustered Computing and Hadoop
Ecosystem
2.5.2.1.Clustered Computing
 To better address the high storage and
computational needs of big data, computer clusters
are a better fit.
 Big data clustering software combines the

resources of many smaller machines, seeking to
provide a number of benefits:
 Resource Pooling: Combining the available storage

space to hold data is a clear benefit, but CPU and
memory pooling are also extremely important.
37
Con’t…
 High Availability: Clusters can provide varying levels
of fault tolerance and availability guarantees to
prevent hardware or software failures
 Easy Scalability: Clusters make it easy to scale

horizontally by adding additional machines to the
group.
 clusters requires a solution for managing cluster

membership, coordinating resource sharing, and
scheduling actual work on individual nodes.
38
Con’t…
 Cluster membership and resource allocation can
be handled by software like Hadoop’s YARN
(which stands for Yet Another Resource
Negotiator).
 assembled computing cluster often acts as a

foundation that other software interfaces with to
process the data.

39
Self Test Exercise
 List and discuss the characteristics of big data ➢
Describe the big data life cycle. Which step you think most
useful and why?
 List and describe each technology or tool used in the big

data life cycle.
 Discuss the three methods of computing over a large

dataset.
40
How much data?
 Google processes more than 20 PB a day (2018)
 Wayback Machine has 3 PB + 100 TB/month

(3/2009)
 Facebook process 500+TB/day (2019)

 eBay has 6.5 PB of user data + 50 TB/day
(5/2009)
 CERN’s Large Hydron Collider (LHC) generates 15

PB a year
41
Con’t…
42
Who is collecting what?
Credit Card Companies What data are they

getting?
Airline ticket Restaurant check
Grocery Bill
Hotel Bill
43
How Can You Avoid Big Data?
 Pay cash for everything!
 Never go online!
 Don’t use a telephone! How????
 Don’t fill any prescriptions!
 Never leave your house!
44
Examples of big data…..
 Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than 2.5
petabytes * of data — the equivalent of 167 times the
information contained in all the books in the US
Library of Congress.
 FICO Credit Card Fraud Detection System protects

2.1 billion active accounts world-wide.
45
Examples of big data…..
 NASA Climate Simulation
32 petabytes
 The Large Hadron Collider
25 petabytes annually, 200 petabytes after replication

 Wall mart
2.5 petabytes per hour
46
Who’s Generating Big Data
Mobile devices
(tracking all objects all the
time)
Social media and networks Scientific instruments
(all of us are generating data) (collecting all sorts of data)
Sensor technology and

networks
(measuring all kinds of data)
47
2.5.2.2.Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to
make interaction with big data easier.
 It is a framework that allows for the distributed

processing of large datasets across clusters of
computers using simple programming models.
 Designed to answer the question: “How to process

big data with reasonable cost and time?”
48
Con’t…
 inspired by a technical document published by Google. The
four key characteristics of Hadoop are:
 Economical: highly economical as ordinary computers can be

used for data processing.
 Reliable: stores copies of the data on different machines and

is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically.
 Flexible: can store as much structured and unstructured

data .
49
Con’t…
 Hadoop has an ecosystem that has evolved from
its four core components:
 data management,
 access,
 processing, and
 storage.
 It is continuously growing to meet the needs of Big Data.
50
Component's of Big Data(Hadoop Ecosystem)
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm
libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
51
Assignment
 discuss the purpose of each Hadoop Ecosystem
components?
52
2.5.3. Big Data Life Cycle with Hadoop
2.5.3.1. Ingesting data into the system
 The first stage of Big Data processing is Ingest.
 The data is ingested or transferred to Hadoop

from various sources such as relational databases,
systems, or local files.
 Sqoop transfers data from RDBMS to HDFS,

whereas Flume transfers event data.
53
2.5.3.2. Processing the data in storage
 The second stage is Processing. In this stage, the
data is stored and processed.
 The data is stored in the distributed file system,

HDFS, and the NoSQL distributed data, HBase.
 Spark and MapReduce perform data processing.
54
2.5.3.3. Computing and analyzing data
 The third stage is to Analyze. Here, the data is
analyzed by processing frameworks such as Pig,
Hive, and Impala.
 Pig converts the data using a map and reduce and

then analyzes it.
 Hive is also based on the map and reduce

programming and is most suitable for structured
data.
55
2.5.3.4. Visualizing the results
 The fourth stage is Access, which is performed
by tools such as Hue and Cloudera Search.
 In this stage, the analyzed data can be accessed

by users.
56

Chapter 2 - Introduction To Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 - Introduction To Data Science

Uploaded by

Copyright:

Available Formats

 In this chapter, we are going to learn more about

 Describe what data science is and the role of data

 it is increasingly prevalent in IT systems and is used by

Self test Exercise

 Data science is much more than simply analyzing

 Data: a representation of facts, concepts, or instructions in a

 suitable for communication, interpretation, or processing, by

 represented with the help of characters such as alphabets,

 Information :is the processed data on which decisions and

 It is data that has been processed into a form that is

 Input − in this step, the input data is prepared in some convenient

 For example, interest can be calculated on deposit to a bank, or a

 The particular form of the output data depends on the use of

 Can we process data manually using a pencil and

 Discuss the differences with data processing

 Data types can be described from diverse perspectives.

 What is Programming language?

 Almost all programming languages explicitly include the notion of

 Integers(int)- is used to store whole numbers,

 A data type makes the values that expression, such as a

 conforms to a tabular format with a relationship between the

 Common examples of structured data are Excel files or SQL

 contains tags or other markers to separate semantic

 Therefore, it is also known as a self-describing structure.

 Examples of semi-structured data include JSON and

 Common examples of unstructured data include audio, video files

 Metadata is data about data. It provides additional information

 In a set of photographs, for example, metadata could

 The metadata then provides fields for dates and locations

o Compare metadata with structured,

o Given at least one example of structured,

 It is the process of gathering, filtering, and

 “A data warehouse is a subject-oriented,

 A data warehouse is a decision support database

 The infrastructure required to support the acquisition

 be able to handle very high transaction volumes, often

 It is concerned with making the raw data acquired amenable to

 Data analysis involves exploring, transforming, and modeling data

 Related areas include data mining, business intelligence, and

 Data curation processes can be categorized into different

 Data curators (scientific curators or data annotators) hold

 RDBMS have been the main, and almost unique, a solution

 However, the ACID (Atomicity, Consistency, Isolation, and

 What is NoSql Database? Compare it with Sql.

 NoSQL technologies have been designed with the

 Data usage in business decision-making can

Big Data Every Where!

 While the problem of working with data that

 is the term for a collection of data sets so large and

 a “large dataset” means a dataset too large to reasonably

 the common scale of big datasets is constantly shifting and

 Velocity: the speed of data processing.

 Veracity: can we trust the data? How accurate is it?

 Big data clustering software combines the

 Resource Pooling: Combining the available storage

 Easy Scalability: Clusters make it easy to scale

 clusters requires a solution for managing cluster

 assembled computing cluster often acts as a

 List and describe each technology or tool used in the big

 Discuss the three methods of computing over a large

 Wayback Machine has 3 PB + 100 TB/month

 Facebook process 500+TB/day (2019)