Chapter 2 Data Science

CHAPTER TWO
Data Science
1
An Overview of Data Science
 Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and insights

from structured, semi-structured and unstructured data.
 It is a systematic study of raw data and making insightful observations.
 Data acquisition, data cleaning, feature engineering, modelling and
visualization are some major parts of this universe.

 Data science is much more than simply analyzing data. It offers a range of
roles and requires a range of skills. 2

Overview of Data Science …
• Example:
• Consider data involved in buying a box of KitKat from the store or supermarket:
• Your data here is the planned purchase written somewhere
• When you get to the store, you use that piece of data to remind yourself about
what you need to buy and pick it up and put it in your cart.
• At checkout, the cashier scans the barcode on your box and the cash register
logs the price.
• Back in the warehouse, a computer informs the stock manager that it is time to
order this item from distributor because your purchase takes the last box in the
store.
• You may have a coupon for your purchase and the cashier scans that too, giving
you a predetermined discount.
• Example:
• At the end of the week, a report of all the scanned
manufacturer coupons gets uploaded to the KitKat company
so they can issue a reimbursement to the grocery store for all
of the coupon discounts they have handed out to customers.
• Finally, at the end of the month, a store manager looks at a
colorful collection of pie charts showing all the different kinds
of KitKat that were sold and, on the basis of strong sales of
KitKat, decides to offer more varieties of these on the store’s
limited shelf space next month.
• So, the small piece of information on your notebook ended up in many different places
• Notably on the desk of a manager as an aid to decision making.
• The data went through many transformations.
• In addition to the computers where the data might have stopped by or stayed on
for the long term, lots of other pieces of hardware—such as the QR code scanner—
were involved in collecting, manipulating, transmitting, and storing the data.
• In addition, many different pieces of software were used to organize,
aggregate, visualize, and present the data.
• Finally, many different human systems were involved in working with the data.
• People decided which systems to buy and install, who should get access to
what kinds of data, and what would happen to the data after its immediate
purpose was fulfilled.
• Data science evolves as one of the most promising and in-demand career
paths.
• Professionals use advanced techniques for analyzing large volumes of
data.
• Skills important for data science:

• Statistics
• Linear algebra
• Programming knowledge with focus on data warehousing,
data mining, and data modeling
What are data and information?
 Data is the representation of facts, concepts, or instructions in a
formalized manner
 It is unprocessed facts and figures.
 It has no meaning since it has multiple meaning
 What does ‘CHMS’ mean?
 What does ‘1992’ mean?
 It is the level of conceptualization
8
 Information is the processed data on which decisions and actions are
based.
 Data is processed to form information.
 Information is the level of contextualization
 Can answer WH questions except ‘why’
 Information is interpreted data; created from organized, structured,
and processed data in a particular context.
 Still information is not enough for decision making … thus go for

9
knowledge
 Knowledge: An appropriate collection of information.
 Is the level of patronization (creating r/ship among concept)
 Used to answer ‘how’ question
 Found through many experience and much information.
 Come through understanding patterns.
 Wisdom: Collection of very deep knowledge.
 Come through understanding principles. Hierarchical Model10of
human competency
Data VS Information…
Data vs. Information Examples Chart
• Seeing examples of data and information side-by-side in a chart can help you
better understand the differences between the two terms.
Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by
people or machines to increase their usefulness and add values for

a particular purpose.
 It is the activity of converting raw facts [data] into information.
 Information is data that have been processed using the data
processing functions.
13
What is the ultimate purpose of storing and then analyzing/

processing data?
Data Information Knowledge Action
Is to transform
14
 Data processing consists of the following basic steps - input, processing,
and output.
 Input − in this step, the input data is prepared in some convenient form for
processing.
 The form will depend on the processing machine.
 Processing − in this step, the input data is changed to produce data in a

more useful form.
 Output − at this stage, the result of the proceeding processing step is

collected.
15
Input Processing Output

Data types and their representation
 Data types can be described from diverse perspectives.
In computer science and computer programming, for instance,
A data type is an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.
• Almost all programming languages explicitly include the notion
of data type, though different languages may use different
terminology.
Common data types include:
• Integers: store integers.
• Booleans: store one of the two values: true or false
• Characters: store a single character (numeric, alphabetic,
symbol, …) 16
• Floating-point numbers: stores real numbers
• Alphanumeric strings: stores a combination of characters
Data types from Data Analytics perspective
 From a data analytics point of view, it is important to understand that
there are three common types of data types or structures:
A. Structured
 Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze.
 Structured data conforms to a tabular format with a relationship

between the different rows and columns.
 Common examples of structured data are Excel files or SQL databases.

17
B. Semi-structured
 It is a form of structured data that does not conform with the formal structure of data
models associated with relational databases or other forms of data tables.
 Examples of semi-structured data include JSON and XML are forms of semi-structured
data.
C. Unstructured
 Unstructured data is information that either does not have a predefined data model or is
not organized in a pre-defined manner.
 Unstructured information is typically text-heavy but may contain data such as dates,
numbers, and facts as well. 18
Metadata : Metadata is data about data
 The last category of data type is metadata.

 It provides additional information about a specific set of data.
• Example: In a set of photographs, metadata could describe
when and where the photos were taken.
• The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data.
• Because of this reason, metadata is frequently used by Big Data
solutions for initial analysis.
20
Data value Chain
 The Data Value Chain is introduced to describe the information flow within a big
data system as a series of steps needed to generate value and useful insights from
data.
 The Big Data Value Chain identifies the following key high-level activities:
21
Data value Chain
A. Data Acquisition
 It is the process of gathering, filtering, and cleaning data before it is put in

a data warehouse or any other storage solution on which data analysis can
be carried out.
B. Data Analysis
 Data analysis involves exploring, transforming, and modeling data with the
goal of highlighting relevant data, synthesizing and extracting useful
hidden information with high potential from a business point of view.
22
Data value Chain
C. Data Curation
 It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.
D. Data Storage
 It is the persistence and management of data in a scalable way that satisfies

the needs of applications that require fast access to the data.
E. Data Usage
 Data usage in business decision making can enhance competitiveness through

the reduction of costs, increased added value, or any other parameter that can
be measured against existing performance criteria. 23
Basic concepts of big data
 Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools or

traditional data processing applications.
 The challenges include capture, storage, search, sharing, analysis, and

visualization.
 “Large dataset” means a dataset too large to reasonably process or store

with traditional tooling or on a single computer.
 Scale of big datasets is constantly shifting and may vary significantly

from organization to organization.
24
Characteristics of big data
 Big data is a term that describes large, hard-to-manage
volumes of data – both structured and unstructured
 It is has 4 Vs characters:
 1. Volume:-Large amount of data (in zeta bytes)
 2. Velocity-Data is live streaming or in motion
 3. Variety-Data comes in d/t forms from d/t sources
 4. Veracity–Can we trust the data? How it is accurate?
25
• Let’s look our smart phones, now a day smart phones
generates a lot of data in the form of text, phone calls,
emails, photos, videos, searches and music.
• Approximately 40 Exabytes (10^18) of data get generated
every month by a single smart phone user, now consider
how much data will generate from 5 billon smart phone.
• That is mind blowing in fact, this amount of data quit a lot
for traditional computing systems to handle. This massive
amount of data is called big data.
• Now let’s have a look at the data generated per
minute on internet.
• 2.1M snaps are shard in Snap chat,
• 3.8M search queries are mead in Google,
• 1M people are log in Facebook,
• 4.5M videos are watched in YouTube and
• 188M emails are send.
Big Data Solutions: Clustered Computing
• Individual computers are often inadequate for handling big data a
most stages.
• Clustered computing is used to better address the high storage
and computational needs of big data.
• Clustered computing is a form of computing in which a group of
computers (often called nodes) that are connected through a LAN
(local area network) so that, they behave like a single machine.
• The set of computers is called a cluster.
• The resources from these computers are pooled to appear as one
more powerful computer than the individual computers.
Clustered Computing
 Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:
I. Resource Pooling
 Combining the available storage space to hold data is a clear benefit, but CPU
and memory pooling are also extremely important.
II. High Availability
 Clusters can provide varying levels of fault tolerance and availability
guarantees to prevent hardware or software failures from affecting access to
data and processing. 29

Clustered Computing
III. Easy Scalability:
 Clusters make it easy to scale horizontally by adding additional
machines to the group.
 Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
 The machines involved in the computing cluster are also typically
involved with the management of a distributed storage system 30

Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to make interaction
with big data easier.
 Hadoop is a database framework, which allows users to save,
process Big Data in a fault-tolerant, low latency ecosystem using
programming models.
 It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming

31
models.
Characteristics of Hadoop
 Economical: Its systems are highly economical as ordinary computers can
be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different machines
and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically. A few extra
nodes help in scaling up the framework.
 Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.

32
 It comprises the following components and many others:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing

33
 Zookeeper: Managing cluster and Oozie: Job Scheduling

 Hadoop has an ecosystem that has evolved from its four core components:
data management, access, processing, and storage.
34
Big data life cycle with hadoop
1. Ingesting data into the system
• The first stage of Big Data processing is to Ingest data into the
system.
• The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
• Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage.
• The second stage is Processing.
• In this stage, the data is stored and processed.
• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase.
• Spark and MapReduce perform data processing.
Big data life cycle with hadoop
3. Computing and analyzing data
• The third stage is to Analyze Data
• Here, the data is analyzed by processing frameworks such
as Pig, Hive, and Impala.
• Pig converts the data using a map and reduce and then
analyzes it.
• Hive is also based on the map and reduce programming
and is most suitable for structured data.
4. Visualizing the results
• The fourth stage is access, which is performed by tools
such as Sqoop, Hive, Hue and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
37

Chapter 2 Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 Data Science

Uploaded by

Copyright:

Available Formats

CHAPTER TWO

processes, algorithms, and systems to extract knowledge and insights

 It is a systematic study of raw data and making insightful observations.

 Data acquisition, data cleaning, feature engineering, modelling and

visualization are some major parts of this universe.

roles and requires a range of skills. 2

• Skills important for data science:

 It is unprocessed facts and figures.

 It has no meaning since it has multiple meaning

 What does ‘CHMS’ mean?

 What does ‘1992’ mean?

 It is the level of conceptualization

 Data is processed to form information.

 Information is the level of contextualization

 Can answer WH questions except ‘why’

 Information is interpreted data; created from organized, structured,

and processed data in a particular context.

 Still information is not enough for decision making … thus go for

people or machines to increase their usefulness and add values for

 It is the activity of converting raw facts [data] into information.

 Information is data that have been processed using the data

What is the ultimate purpose of storing and then analyzing/

Data Information Knowledge Action

 The form will depend on the processing machine.

 Processing − in this step, the input data is changed to produce data in a

 Output − at this stage, the result of the proceeding processing step is

Input Processing Output

 Structured data conforms to a tabular format with a relationship

 Common examples of structured data are Excel files or SQL databases.

 The last category of data type is metadata.

 It is the process of gathering, filtering, and cleaning data before it is put in

 It is the persistence and management of data in a scalable way that satisfies

 Data usage in business decision making can enhance competitiveness through

it becomes difficult to process using on-hand database management tools or

 The challenges include capture, storage, search, sharing, analysis, and

 “Large dataset” means a dataset too large to reasonably process or store

 Scale of big datasets is constantly shifting and may vary significantly

seeking to provide a number of benefits:

and memory pooling are also extremely important.

II. High Availability

 Clusters can provide varying levels of fault tolerance and availability

guarantees to prevent hardware or software failures from affecting access to

data and processing. 29

 Clusters make it easy to scale horizontally by adding additional

machines to the group.

 Cluster membership and resource allocation can be handled by software

like Hadoop’s YARN (which stands for Yet Another Resource

 The machines involved in the computing cluster are also typically

involved with the management of a distributed storage system 30

with big data easier.

 Hadoop is a database framework, which allows users to save,

process Big Data in a fault-tolerant, low latency ecosystem using

 It is a framework that allows for the distributed processing of large

datasets across clusters of computers using simple programming

be used for data processing.

 Reliable: It is reliable as it stores copies of the data on different machines

and is resistant to hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically. A few extra

nodes help in scaling up the framework.

 Flexible: It is flexible and you can store as much structured and

unstructured data as you need to and decide to use them later.

 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator

 MapReduce: Programming based Data Processing