You are on page 1of 62

Unit 1

Basics of Data
Structure:
1.1 Introduction
1.2 Data, Information and Knowledge
1.3 Data Processing
1.4 Database Management System
1.4.1 Database Model
1.4.2 Designing a Database
1.4.3 Normalization
1.4.4 Data Types
1.5 Structured Query Language
1.6 Enterprise Databases
1.7 Data Warehouse and Data Mining
1.8 NoSQL
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This
license is available at https://creativecommons.org/licenses/by-sa/4.0/.
Objectives
After going through this unit, you will be able to:

• Understand the meaning of data, information and knowledge


• Discuss data processing concept
• Describe the role of a Database Management System and data models
• Describe the characteristics of a data warehouse and data mining
• Understand SQL and NoSQL

1.1 INTRODUCTION
Every day there is a lot of the data generated due to day to day operations of a organisations like
RFID for employee tracking, daily transactions within the enterprise and buy the customers, and
other conventional tasks. For example, in a telecoms company the data volume increases depending
upon the number of customers and their daily usage of cell phones. The data stored consists of call
details like caller's time, location, call duration, cost/unit depending on the plan, billing, SMS,
backups, historic information etc.

The growth of data is thus extensive and it also keeps on increasing at a constant rate, with
festive seasons being an exception. Business Intelligence (BI) is a broad category of application
programs and technologies for gathering, storing, analysing, and providing access to data to help
enterprise users to make better business decisions.

In this unit, we are going to discuss the basics of data and database.

1.2 DATA, INFORMATION AND KNOWLEDGE


Data is unstructured raw facts in isolation to be used as the raw material of/for the information
system. Data is also the glue or mortar of the information system.
Data could be expressed in numeric, alphabetic, alphanumeric, special characters, images,
symbols, or even in voice. The various forms of data expressions are understood by the computer
as strings of “0”or “1”. Data can be quantitative or qualitative. Quantitative data is numeric, the
result of a measurement, count, or some other mathematical calculation. Qualitative data is
descriptive.

Information is data that has been given structure. Knowledge is information that has been given
meaning. Data, information and knowledge are interwoven and interrelated in complex ways. The
three entities influence each other and the value of each of them depends on the purpose of use.
Both data and information require knowledge in order to be interpreted. At the same time, data and
information are useful building blocks for constructing new knowledge. Knowledge, information and
data work together as individual, but as critically interdependent entities to provide a greater entity
which is analysable by the human mind.

In today’s global market, there is an increasing appreciation of the value of shared information
and knowledge – and of the technologies that make this happen. The organisation that ignores
the value of this capital does so at its own peril! Productivity and profits are inexorably linked to
the expertise of the workforce – and in particular to the learning and sharing of this expertise.

1.3 DATA PROCESSING


Data processing involves a number of operations, similar to those in a manufacturing unit, to
convert the basic raw material − data − into a finished product, i.e., Information. The typical data
processing steps would include, among others, the following activities.

Data
processing also involves carrying out a number of data operations as mentioned in Table 1.1
Table 1.1 Data operations

Distributed data processing is a computer-networking method in which multiple computers across


different locations share computer-processing capability. This is in contrast to a single, centralised
server managing and providing processing capability to all connected systems. Computers that
comprise the distributed data processing network are located at different locations but
interconnected by means of wireless or satellite links.

The distributed data processing system provides the advantages of increased system availability and
quicker systems response time. Systems availability is increased because when CPU malfunctions or
undergoes preventive maintenance, its work may be transferred to another CPU in the system. The
response time is improved because the workload can be distributed evenly among the CPUs to
ensure optimum utilisation.
Check your Progress 1

State True or False.

1. Data is information that has been given meaning.


2. The distributed data processing system provides the advantages of increased system
availability and quicker systems response time.

1.4 DATABASE MANAGEMENT SYSTEM


A database is a collection of stored related information so that it is available to many users for
different purposes. A Database Management System (DBMS) is a collection of interrelated data and
a set of programs to access that data. The primary goal of a DBMS is to provide a way to store and
retrieve information that is both convenient and efficient.

Most database management systems have the following facilities/capabilities:

a) Creation, addition, deletion, and modification of data.


b) Creation, addition, deletion, and modification of files.
c) Collective or selective retrieval of data.
d) Sorting or indexing of data at the user’s discretion and direction.
e) Various reports can be produced from the system. These may either be standardised or
specifically generated according to specific user definitions.
f) Mathematical functions can be performed and the data stored in the database can be
manipulated with these functions to perform the desired calculations.
g) Maintenance of data integrity and database use.

Characteristics of data in DBMS:

1. Data sharing: Data should be shared amongst different users and applications 2. Data
independence: Changes made in the schema at one level should not affect other levels. 3.
Controlled redundancy: Data is not duplicated, however, any duplication is deliberate and
controlled.
4. Validity/Integrity/Correctness: Data entered should be correct with respect of the real world
entity that they represent.
5. Security: Data should be protected from unauthorised users.

1.4.1 Database Model


The database models are used for keeping track of entities, attributes, and relationships. Some of
these database models are discussed below:

1. Hierarchical Database Model (HDBM): The Hierarchical Database Model is one of earliest DBMS,
when the computer applications focused on processing huge data like sales order processing, check
processing, inventory updating, etc.

This Model follows a structured organisational mode. It represents data in a pyramidal or tree-like
structure. Each record appears to be like an organisational chart with one top- level segment, called
the root, spreading downwards into branches and leaves as illustrated in Fig. 1.1.
Fig. 1.1 Hierarchical database model
Under this Model, there is a record. Within each record, data elements are organised into pieces of
record called segments. An upper segment is connected logically to a lower segment in a parent-child
relationship. A parent segment can have more than one child, but a child can have only one parent,
indicating a one-to-many relationship.

The Hierarchical Model is, thus, highly structured and requires a controlled, finite and rule-based
approach, where record and its segments are connected to each other in one-to-many parent-child
relationships.

The most common hierarchical DBMS has been the Information Management System (IMS) released
by the IBM in 1968.

2. Network Database Model (NDBM): The NDBM is a variation of the earlier Hierarchical Database
Model. The Network Model features data logically as many-to-many relationship. To put it more
succinctly, just as “parents can have multiple children”, a “child” too can have more than one
“parent”. The many-to-many relationship under this Model is illustrated in Fig. 1.2.

Fig. 1.2 Network Database Model


It would be observed that the data regarding the salesperson could be made use
of/for: i. Understanding/analysing “Sales Zone” performance.
ii. Analysing sales/recovery position.
iii. Analysing product-wise sales performance.
3. Relational Database Model (RDBM): The Relational Database Model is the most recent of the
three database models and was proposed by Dr. E. F. Codd in 1970. The Model represents all data in
the database as simple two-dimensional tables called “Relations”. The table has rows and columns,
the rows representing individual records and the columns representing attributes of each record.
Although the tables appear to be similar to flat files, the information in more than one file can be
easily extracted and combined to suit the user’s specific requirements, thereby providing ad-hoc
request flexibility/facility. The key is the separation of the data on logical and physical levels, which is
made possible by the use of sophisticated mathematical algorithms and notations, which are used in
the relational model. Popular examples of relational databases are Microsoft Access, MySQL, and
Oracle.

Let us see an example for better understanding, it represents a Customer table with rows and columns
Customer-id Customer-name Street City

1 Anthony M.G. Road Pune

2 Mary Street 2 Mumbai

4. Other Data Models: The other data models include the object-oriented data model that is used
widely after the relational model. It includes the concept of encapsulation, methods and object
identity.

Historically, the other two data models are the network data model and the hierarchical data model
which preceded the relational data model. But in practice, today, it is the relational data model that
is widely used across all application implementation.

1.4.2 Designing a Database

Suppose a university wants to create an information system to track participation in student clubs.
After interviewing several people, the design team learns that the goal of implementing the system is
to give better insight into how the university funds clubs. This will be accomplished by tracking how
many members each club has and how active the clubs are. From this, the team decides that the
system must keep track of the clubs, their members, and their events. Using this information, the
design team determines that the following tables need to be created:

• Clubs: this will track the club name, the club president, and a short description of the club. •
Students: student name, e-mail, and year of birth.
• Memberships: this table will correlate students with clubs, allowing us to have any given
student join multiple clubs.
• Events: this table will track when the clubs meet and how many students showed up.

Now, that the design team has determined which tables to create, they need to define the specific
information that each table will hold. This requires identifying the fields that will be in each table. For
example, Club Name would be one of the fields in the Clubs table. First Name and Last Name would
be fields in the Students table. Finally, since this will be a relational database, every table should have
a field in common with at least one other table (in other words: they should have a relationship with
each other). In order to properly create this relationship, a primary key must be selected for each
table. This key is a unique identifier for each record in the table. For example, in the Students table, it
might be possible to use students’ last name as a way to uniquely identify them. However, it is more
than likely that some students will share a last name (like Rodriguez, Smith, or Lee), so a different
field
should be selected. A student’s e-mail address might be a good choice for a primary key, since e-mail
addresses are unique. However, a primary key cannot change, so this would mean that if students
changed their e-mail address we would have to remove them from the database and then re-insert
them – not an attractive proposition. Our solution is to create a value for each student — a user ID —
that will act as a primary key. We will also do this for each of the student clubs. This solution is quite
common and is the reason you have so many user IDs!

Fig. 1.3 Student clubs database diagram


With this design, not only do we have a way to organise all of the information we need to meet the
requirements, but we have also successfully related all the tables together. Here’s what the database
tables might look like with some sample data. Note that, the Memberships table has the sole
purpose of allowing us to relate multiple students to multiple clubs.
1.4.3 Normalization
When designing a database, one important concept to understand is normalization. In simple terms,
to normalize a database means to design it in a way that:

1) reduces duplication of data between tables and


2) gives the table as much flexibility as possible.

Edgar F. Codd originally established three normal forms: 1NF, 2NF and 3NF. There are now others that
are generally accepted, but 3NF is widely considered to be sufficient for many practical applications.
Most tables when reaching 3NF are also in BCNF. 4NF and 5NF are further extensions, and 6NF only
applies to temporal databases.

In 1NF, we need to ensure that the values in each column of a table are atomic. To qualify for second
normal form, a relation must be in first normal form (1NF) and not have any non-prime attribute that
is dependent on any proper subset of any candidate key of the relation. A non-prime attribute of a
relation is an attribute that is not a part of any candidate key of the relation. In order to conform to
3NF, the table should not contain transitive dependencies.

A relational schema R is in Boyce–Codd normal form (BCNF) if and only if for every one of its
dependencies X → Y, at least one of the following conditions hold:

∙ X → Y is a trivial functional dependency (Y ⊆ X)


∙ X is a superkey for schema R

A relation is in 4NF if and only if, for every one of its non-trivial multivalued dependencies X→ Y, X is
a superkey—that is, X is either a candidate key or a superset thereof.

In the Student Clubs database design, the design team worked to achieve these objectives. For
example, to track memberships, a simple solution might have been to create a Members field in the
Clubs table and then just list the names of all of the members there. However, this design would
mean
that if a student joined two clubs, then his or her information would have to be entered a second
time. Instead, the designers solved this problem by using two tables: Students and Memberships.
In this design, when a student joins their first club, we first must add the student to the Students
table, where their first name, last name, e-mail address, and birth year are entered. This addition to
the Students table will generate a student ID. Now we will add a new entry to denote that the
student is a member of a specific club. This is accomplished by adding a record with the student ID
and the club ID in the Memberships table. If this student joins a second club, we do not have to
duplicate the entry of the student’s name, e-mail, and birth year; instead, we only need to make
another entry in the Memberships table of the second club’s ID and the student’s ID.

The design of the Student Clubs database also makes it simple to change the design without major
modifications to the existing structure. For example, if the design team were asked to add
functionality to the system to track faculty advisors to the clubs, we could easily accomplish this by
adding a Faculty Advisors table (similar to the Students table) and then adding a new field to the
Clubs table to hold the Faculty Advisor ID.

1.4.4 Data Types

When defining the fields in a database table, we must give each field a data type. For example, the
field Birth Year is a year, so it will be a number, while First Name will be text. Most modern databases
allow for several different data types to be stored. Some of the more common data types are listed
here:

• Text: for storing non-numeric data that is brief, generally under 256 characters. The database
designer can identify the maximum length of the text.
• Number: for storing numbers. There are usually a few different number types that can be
selected, depending on how large the largest number will be.
• Yes/No: a special form of the number data type that is (usually) one byte long, with a 0 for
“No” or “False” and a 1 for “Yes” or “True”.
• Date/Time: a special form of the number data type that can be interpreted as a number or a
time.
• Currency: a special form of the number data type that formats all values with a currency
indicator and two decimal places.
• Paragraph Text: this data type allows for text longer than 256 characters. • Object: this data
type allows for the storage of data that cannot be entered via keyboard, such as an image or a
music file.

There are two important reasons that we must properly define the data type of a field. First, a
data type tells the database what functions can be performed with the data. For example, if we
wish to perform mathematical functions with one of the fields, we must be sure to tell the
database that the field is a number data type. So if we have, say, a field storing birth year, we can
subtract the number stored in that field from the current year to get age.

The second important reason to define data type is so that the proper amount of storage space is
allocated for our data. For example, if the First Name field is defined as a text (50) data type, this
means fifty characters are allocated for each first name we want to store. However, even if the
first name is only five characters long, fifty characters (bytes) will be allocated. While this may not
seem like a big deal, if our table ends up holding 50,000 names, we are allocating 50 * 50,000 =
2,500,000 bytes for storage of these values. It may be prudent to reduce the size of the field so
we do not waste storage space.
1.5 STRUCTURED QUERY LANGUAGE
The primary way to work with a relational database is to use Structured Query Language, SQL
(pronounced “sequel,” or simply stated as S-Q-L). Almost all applications that work with databases
(such as database management systems, discussed below) make use of SQL as a way to analyse and
manipulate relational data. As its name implies, SQL is a language that can be used to work with a
relational database. From a simple request for data to a complex update operation, SQL is a mainstay
of programmers and database administrators.

The basic structure of an SQL statement consists of 3 clauses.

a) The select clause corresponds to the projection operation of the relational algebra. It is used
to list the attributes desired in the result of a query.
b) The from clause corresponds to the cartesian product operation of the relational algebra. It
lists the relations to be scanned in the evaluation of the expression.
c) The where clause corresponds to the selection predicate of the relational algebra.

A typical SQL query has the following form:

Select A1,A2,A3,……. From r1,r2,r3,….. where condition;

Select * from tablename;

E.g.:
a) Select * from employee;
b) Select empno, empname from employee where salary >20000;

Data Definition Language deals with the database schemas and descriptions of how the data
should reside in the database, therefore language statements like CREATE TABLE or ALTER TABLE
belong to DDL. DML deals with data manipulation, and therefore includes most common SQL
statements such SELECT, INSERT, etc. Data Control Language includes commands such as GRANT,
and mostly concerns with rights, permissions and other controls of the database system.

1.6 ENTERPRISE DATABASES


A database that can only be used by a single user at a time is not going to meet the needs of most
organizations. As computers have become networked and are now joined worldwide via the Internet,
a class of database has emerged that can be accessed by two, ten, or even a million people. These
databases are sometimes installed on a single computer to be accessed by a group of people at a
single location. Other times, they are installed over several servers worldwide, meant to be accessed
by millions. These relational enterprise database packages are built and supported by companies
such as Oracle, Microsoft, and IBM.

The open-source MySQL is also an enterprise database. As stated earlier, the relational database
model does not scale well. The term scale here refers to a database getting larger and larger, being
distributed on a larger number of computers connected via a network. Some companies are looking
to provide large-scale database solutions by moving away from the relational model to other, more
flexible models. For example, Google now offers the App Engine Datastore which is based on NoSQL.
Developers can use the App Engine Datastore to develop applications that access data from
anywhere in the world. Amazon.com offers several database services for enterprise use, including
Amazon RDS, which is a relational database service, and Amazon DynamoDB, a NoSQL enterprise
solution.
1.7 DATA WAREHOUSE AND DATA MINING
As organisations have begun to utilize databases as the center piece of their operations, the need to
fully understand and leverage the data they are collecting has become more and more apparent.
However, directly analysing the data that is needed for day-to-day operations is not a good idea; we
do not want to tax the operations of the company more than we need to. Further, organisations also
want to analyse data in a historical sense: How does the data we have today compare with the same
set of data this time last month, or last year? From these needs arose the concept of the data
warehouse.

The concept of the data warehouse is simple: extract data from one or more of the organisation’s
databases and load it into the data warehouse (which is itself another database) for storage and
analysis. However, the execution of this concept is not that simple. A data warehouse should be
designed so that it meets the following criteria:

• It uses non-operational data. This means that the data warehouse is using a copy of data from
the active databases that the company uses in its day-to-day operations, so the data
warehouse must pull data from the existing databases on a regular, scheduled basis.
• The data is time-variant. This means that whenever data is loaded into the data warehouse, it
receives a time stamp, which allows for comparisons between different time periods. • The data
is standardised. Because the data in a data warehouse usually comes from several different
sources, it is possible that the data does not use the same definitions or units. For example, our
Events table in our Student Clubs database lists the event dates using the mm/dd/yyyy format
(e.g., 01/10/2013). A table in another database might use the format yy/mm/dd (e.g., 13/01/10)
for dates. In order for the data warehouse to match up dates, a standard date format would
have to be agreed upon and all data loaded into the data warehouse would have to be
converted to use this standard format. This process is called extraction-transformation-load
(ETL).

There are two primary schools of thought when designing a data warehouse: bottom-up and top
down. The bottom-up approach starts by creating small data warehouses, called data marts, to solve
specific business problems. As these data marts are created, they can be combined into a larger data
warehouse. The top-down approach suggests that we should start by creating an enterprise-wide
data warehouse and then, as specific business needs are identified, create smaller data marts from
the data warehouse.
Fig. 1.4 Data warehouse process

Benefits of Data Warehouses:

Organisations find data warehouses quite beneficial for a number of reasons:

• The process of developing a data warehouse forces an organisation to better understand the
data that it is currently collecting and, equally important, what data is not being collected. • A
data warehouse provides a centralized view of all data being collected across the enterprise and
provides a means for determining data that is inconsistent.
• Once all data is identified as consistent, an organisation can generate one version of the truth.
This is important when the company wants to report consistent statistics about itself, such
as revenue or number of employees.
• By having a data warehouse, snapshots of data can be taken over time. This creates a historical
record of data, which allows for an analysis of trends.
• A data warehouse provides tools to combine data, which can provide new information and
analysis.

Data Mining

Data mining is the process of analysing data to find previously unknown trends, patterns, and
associations in order to make decisions. Generally, data mining is accomplished through automated
means against extremely large data sets, such as a data warehouse. Some examples of data mining
include:

• An analysis of sales from a large grocery chain might determine that milk is purchased more
frequently the day after it rains in cities with a population of less than 50,000.

• A bank may find that loan applicants whose bank accounts show particular deposit and withdrawal
patterns are not good credit risks.

• A baseball team may find that collegiate baseball players with specific statistics in hitting, pitching,
and fielding make for more successful major league players.

In some cases, a data-mining project is begun with a hypothetical result in mind. For example, a
grocery chain may already have some idea that buying patterns change after it rains and want to get
a deeper understanding of exactly what is happening. In other cases, there are no presuppositions
and a data-mining program is run against large data sets in order to find patterns and associations.

1.8 NOSQL
Perhaps the most interesting new development is the concept of NoSQL (from the phrase “not only
SQL”). NoSQL arose from the need to solve the problem of large-scale databases spread over several
servers or even across the world. For a relational database to work properly, it is important that only
one person be able to manipulate a piece of data at a time, a concept known as record-locking. But
with today’s large-scale databases (think Google and Amazon), this is just not possible. A NoSQL
database can work with data in a looser way, allowing for a more unstructured environment,
communicating changes to the data over time to all the servers that are part of the database.

Check your Progress 2


Fill in the Blanks.

1. The _____ data model that is widely used across all application implementation.
2. ______ is used to reduce the redundancy of the data.
3. _______ deals with database schemas and descriptions of how the data should reside in
the database.
4. The term _____ refers to a database getting larger and larger, being distributed on a larger
number of computers connected via a network.

Activity 1
1. Browse the internet and try to find out and list the different sources of data. 2. Do some
research and find two examples of data mining. Summarize each example and then write about
what the two examples have in common.

Summary
∙ In this unit, we learned about the role that data and databases play in the context of
information systems. Data is made up of small facts and information without context. If we
give data context, then we have information. Knowledge is gained when information is
consumed and used for decision making. A database is an organized collection of related
information. Relational databases are the most widely used type of database, where data is
structured into tables and all tables must be related to each other through unique identifiers.
A database management system (DBMS) is a software application that is used to create and
manage databases, and can take the form of a personal DBMS, used by one person, or an
enterprise DBMS that can be used by multiple users. A data warehouse is a special form of
database that takes data from other databases in an enterprise and organizes it for analysis.
Data mining is the process of looking for patterns and relationships in large data sets. Many
businesses use databases, data warehouses, and data mining techniques in order to produce
business intelligence and gain a competitive advantage.

Keywords
• Distributed: Give a share or a unit of (something) to each of a number of recipients. •
Indexing: It is a data structure technique to efficiently retrieve records from the database
files based on some attributes
• Transitive Dependency: A functional dependency is said to be transitive if it is indirectly
formed by two functional dependencies.
• Analysis: The process of breaking a complex topic or substance into smaller parts in order to
gain a better understanding of it.
• Hypothesis: A supposition or proposed explanation made on the basis of limited evidence as a
starting point for further investigation.

Self-Assessment Questions
1. What is the difference between data, information, and knowledge?
2. Describe Normalisation.
3. What is a Data Warehouse?
4. State the role of Data Mining in the various organisations.
5. What is SQL? Explain the syntax of “Select” query.

Answers to Check your Progress


Check your Progress 1

State True or False.

1. False
2. True

Check your Progress 2

Fill in the Blanks.

1. The relational data model that is widely used across all application implementation.
2. Normalisation is used to reduce the redundancy of the data.
3. Data Definition Language deals with database schemas and descriptions of how the data
should reside in the database.
4. The term scale refers to a database getting larger and larger, being distributed on a larger
number of computers connected via a network.

Suggested Reading
1. Information Systems for Business and Beyond, David T. Bourgeois, Ph.D. Saylor URL:
http://www.saylor.org/courses/bus206
2. Silberschatz, Avi, Henry F. Korth and S. Sudarshan. Database System Concepts. Mc-Graw-Hill.
3. Date, C.J. An Introduction to Database Systems. Addison-Wesley.
Unit 2
Basics of Data Science
Structure:

2.1 Introduction

2.2 Why Now, Why not earlier?

2.3 What is Data Science?

2.4 Data Science Composition

2.5 Applications & Case Studies

2.6 Core of Data Science – Machine Learning Algorithms

2.7 Types of Learning

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading

Published by Symbiosis Centre for Distance Learning (SCDL), Pune

2019

Copyright © 2019 Symbiosis Open Education Society

All rights reserved. No part of this unit may be reproduced, transmitted or utilised in any form or by any means, electronic or mechanical,
including photocopying, recording or by any information storage or retrieval system without written permission from the publisher.

Acknowledgement

Every attempt has been made to trace the copyright holders of materials reproduced in this unit. Should any infringement have occurred,
SCDL apologies for the same and will be pleased to make necessary corrections in future editions of this unit.
Objectives

After going through this unit, you will able to:

∙ Understand what Data Science means


∙ Understand why we are studying it now
∙ Applications of Data Science and its uses

2.1 INTRODUCTION
In a world, where on a single tap we are sending friend requests and buying products on a retail
website, unknowingly or knowingly we are creating data at every click. With this, we are enabling the
companies collecting this data to help us give recommendations which are personalized in nature.

All of this collected data makes a huge impact on how we see our recommendations. Along with this,
nowadays, people don’t even own one phone. There are multiple phones, tabs, laptops through
which he data is collected. For example, if you search for a Data Science course online, you will see
the ads of Data Science courses on Facebook, YouTube etc. This is where huge computer with a lot of
processing power, data and data science come into play.

In this unit, we are going to discuss about the meaning of Data Science, why are we studying it now
and why didn’t we talk of earlier, what exactly does data science mean and some of the applications
of Data Science

2.2 WHY NOW, WHY NOT EARLIER?


The least asked question in the world of Data Science is the fact that why are we studying this as a
subject now, why not earlier? The answer to these questions lies in the cost of collection of data.
Data Science, as you know is associated with data. But in the early 90’s the cost of collection of data
was humongous. As a result, most of the corporates or company did not store data. Even if they did,
most of the stored data was in the form of paper trails which were difficult to comprehend with. As a
result, even if somebody thought of analysing the data, it was very difficult to go back and retrieve
records.

But by the late 2000s, the cost of storing data had died down exponentially. As a result, it was easier
and cheaper to store data digitally. Because of this it was much easier to do analysis on the data, pull
it up when required, transfer it in whenever required etc. In early 90s, storing data used to be
associated with floppy disks and big hard drives which needed to be transported through heavy
machinery. That scenario has now changed completely. We now more than 128 Gb of space in our
mobile phones, pocket size hard disks exceeding 2 TBs and USB’s storing more than 64 GB of data.
That is the reason why we can transform, analyse and visualize data in easier manner.

Data storage was costly during the late 90’s period but what was also not possible was large
computation on the data. The computers present in those days were not capable of processing huge
amounts of data. So as the data storage got cheaper new advanced computers with highly calibrated
processing power came into existence. Not only they could store large data, but they could process it
quickly too. This also resulted in various discoveries of algorithms which were more powerful and as
a result of this Data Science came in existence.

2.3 WHAT IS DATA SCIENCE?

Data Science can be defined as an area concerned with collection, preparation, analysis, visualisation,
management, modelling and inference building from large chunks of information present. Although,
the name seems connected to databases and management, it is mostly related to business problem
solving with the help of statistics and programming. Data Science cannot be defined as one skill or in
a way just analysis. It is a complete branch in which you have to prepare, analyse and transform data
so as to have an actionable insight on the business problem you are trying to solve.

There are a lot of myths present about Data Science. Some of them are that these are statisticians in
white lab coats, with PhD’s blinking through a computer screen or a coder who is trying to automate
a process using some algorithm. Nothing could be further from the truth. The fact is that Data
Scientists can be any of those as well as none of these. Sometimes you only need Microsoft Excel and
logical analysis to solve the most complex problems whereas something you might require a very
powerful cloud cluster to get you an answer. Another misconception about Data Science is that it is
just analysis. In fact, Data Science is much more beyond just analysing the data. It is process of
extracting meaningful insights on which some action can be taken.

The people who work with Data Science problem in and out every day are called as Data Scientist.
The job profile of a Data Scientist can vary. It can start from cultivating or storing a data to a complex
mathematical prediction model which needs to be deployed into production environment. A Data
Scientist can also have job roles of automating dash boards to building products and application
which can work flawlessly. Thus being a Data Scientist requires command on ETL (Extract, Transform
& Load), BI (Business Intelligence) tools for visualisation, coding command on languages like R,
Python and Matlab & SAS along with knowledge on big data tools like Hadoop, Apache Spark etc.

Some of the skills required for Data Science are:

1) Knowing the business & application domain

A data scientist must know the problem context in terms of application as well as business
domain. He/she should know how solving the problem will impact the business side of
things.

2) Communication & looking at holistic view

A data scientist should be able to communicate the insights & results along with actions
required to the business user clearly. They should also have an understanding of the holistic
view of the domain and the impact of recommending this change on the business.

3) Data Transformation, Analysis & Visualisation

A data scientist should be able to extract, load, & transform data from any source into usable
format. He/she should be able to work with structured & unstructured data so as to find
useful insights from it. A data scientist should be able to analyse the data, understand what
each and every variable means and visualize it so as to look at in a snapshot view.

4) Modelling & Presentation

A data scientist should be able to build statistical algorithms on the data at hand. He/she
should also be able to present the results in a way which the business user can understand
and take actions on.

Thus, Data Science is a field which has a plethora of fields to be studied, a lot of languages to
be learnt and a still a lot of questions to be answered. And all of these questions are
generally answered by Data Scientists.

Some of the skills and work ethics required by Data Scientists are:

a) Conduct undirected research and frame open-ended industry questions


b) Extract huge volumes of data from multiple internal and external sources

c) Employ sophisticated analytics programs, machine learning and statistical methods to


prepare data for use in predictive and prescriptive modelling
d) Thoroughly clean and prune data to discard irrelevant information

e) Explore and examine data from a variety of angles to determine hidden weaknesses,
trends and/or opportunities

f) Form data-driven solutions to the most pressing challenges

g) Invent new algorithms to solve problems and build new tools to automate work

h) Communicate predictions and findings to management and IT departments through


effective data visualisations and reports

i) Recommend cost-effective changes to existing procedures and strategies

2.4 DATA SCIENCE COMPOSITION

Data Science mainly comprises of 4 major components:

1) Domain Knowledge Expertise


2) Software Programming
3) Statistics
4) Communication

Figure Source - http://rpubs.com/ptherond/DSc_Venn

∙ Domain Knowledge Expertise

Data science is all about solving the business problem. The most imperative step is to
understand the impact of a problem on a business. While solving a problem one needs to
understand the intricacies of a particular business in a domain.

For example, a financial domain problem will have different meaning of variables as
compared to a retail domain problem. Another change in these domains will be the impact
of changing one thing will lead to a greater domino effect (changing one thing leading to a
change in ‘N’ things following it) in finance domain than that in retail domain. Thus, domain
expertise is very integral part in understanding data science.

∙ Statistics

The most integral part of Data Science is Statistics. When we say statistics, it does not purely
mean Mathematics. Statistics deals with a host of things starting from mean till complex
integral and differential calculus problems. The modelling of data, the predictions,
recommendations and every other actionable action in the world of data science is
dependent on statistics. But it does not mean a data scientist needs to have knowledge
about the subject like a statistician. A Data Scientist should be well versed in the basic
concepts and algorithms and should be a quick learner so as to know the working of any
modelling/algorithmic technique.

∙ Software Programming

Software Programming is again one of the main aspects of Data Science community. Even if
you understand the problems, you know which algorithm to apply; you need to know
programming concepts so as to execute it. Most of the coding in the world of Data Science is
done in R & Python. Other languages such as Scala, Matlab & SAS are also used. One of the
upcoming languages, which are changing the field of Data Science, is ‘Julia’. If one can get a
grip of writing a good code in any of these languages, then it is pretty easy to learn any new
language as only the syntax changes. Along with this as R & Python are open source
languages, the Data Scientists can get a lot of queries answered from the community.

∙ Communication Skills

Out of all the branches in Data Science, communication is one of the most important fields. It
is very important for a Data Scientist to explain what the action he/she seems fit for the
problem at hand. Not only this, but a data scientist tasks also involves explaining a statistical
model to a lay man who has very little or near no know how of how algorithms work. This
also needs to be coupled with the presentation or the workable solution at hand. A data
scientist should be able to express his approach and ideas to a problem in the least complex
manner. He/she also needs to explain why that respective approach was chosen or rejected
based on the data at hand. Thus it is very hard for Data Scientists to explain the processes
that they have taken up in solving a problem and good communication skills can go a long
way in making the conversation much easier with the business.
2.5 APPLICATIONS & CASE STUDIES

As stated by Harvard University, Data Scientist is the sexiest job of 20th century. As discussed in the
previous sections of this chapter, we now know that Data Science has stemmed from multiple
disciplines. Data has become the new oil and as a result Data Science has many applications. Some of
the domains where Data Science has wide variety of applications are as follows:

1. Banking
2. Retail & E-Commerce
3. Finance
4. Healthcare
Other domains where Data Science is widely used is Transportation, Manufacturing, Telecom & IT.

There are multiple applications of Data Science in each of the sectors mentioned above. Let us walk
through one example of each sector

1) Banking

Banking is one of the biggest sectors having Data Science applications. Coupled with Big Data
Technologies, it has enabled banks to avoid frauds, manage resources efficiently, personalize
customer recommendations & provide real time analytics. One of the biggest applications of
Data Science across banks is the Risk Score calculation which can be used to avoid frauds. In
this application, an algorithm generates a risk score for each and every individual who has
taken a loan, just made a transaction or has defaulted on a bill. The algorithm analyses the
past history of the individual, their borrowing & lending patters, investment amount, growth
rate etc. After taking into picture all these factors, the banks are able to get a risk score
related to a particular customer. This helps them mitigate and prevent frauds by taking
corrective actions in the respective directions.

2) Retail & E-Commerce

From Walmart to D-mart, each and every retail shop is engulfed in the world of Data Science.
Introduction of Data Science has revolutionised the retail & commerce industry.
Transportation, logistics, and forecasting are some of the areas where Data Science
algorithms have made a huge impact on the industry. Retail & Ecommerce industries have
now less out of stocks, better goods management and personalised recommendation system
due to Data Science. One such example would be that of Amazon, where for every user the
home page looks different. This is because the interests of each & every person are different.
Similarly, what they have searched on Google along with their past history also paves way to
the recommendations the users get. Along with this, the retails giants are now able to
manage their inventory pretty well because of forecasting algorithms which can predict with
very high accuracy what will be the sales of a particular product across city, region, zip code
& state. As a result of all these applications, the retail and e-commerce industries have
completely changed.

3) Finance

Data Science has played a key role in automating various financial tasks. Just like how banks
have automated risk analytics, finance industries have also used data science for this task.
Financial industries need to automate risk analytics in order to carry out strategic decisions.
for the company. Using machine learning, they identify, monitor and prioritize the risks.
These machine learning algorithms enhance cost efficiency and model sustainability through
training on the massively available customer data. Similarly, financial institutions use
machine learning for predictive analytics. It allows the companies to predict customer
lifetime value and their stock market moves. Nowadays companies are being built on
predicting the next moves in stock markets. Data Science algorithms weigh in various factors
like new, stock trend, overall psychology of people, technicals, stock performance, and
various such factors to give you a recommendation whether the stock will move up or down
in long and short term. Thus collection of data across various platforms and sources has
made financial institutions take data driven decisions. This in turn has also helped customers
as they get personalized recommendations along with better quality of experience.

4) Healthcare
Healthcare is one of the leading domains in the world of Data Science. Everyday new
experiments and predictions are being done so as to predict heart attacks, cure diseases &
mitigate risks. From medical image analysis to genomics, data science is influencing the world
of Healthcare. As an example, the recently launched Apple Watch can predict the risk of
Heart Attack of a person prior to two days of on-setting. This has enabled the medical
professionals to carry out more research into his field which will guide them to better the
health conditions of the patients. Another application of Data Science in the world of Data
Science is that of drug discovery. In drug discovery, new candidate medicines are formulated.
Drug Discovery is a tedious and often complex process. Data Science can help us to simplify
this process and provide us with an early insight into the success rate of the newly
discovered drug. With Data Science, we can also analyse several combinations of drugs and
their effect on different gene structure to predict the outcome. Thus Data Science is taking
Healthcare to new heights and that has improved the patient health in general.

2.6 CORE OF DATA SCIENCE - MACHINE LEARNING

The first question that you might be asking yourself is that how do all these projects in Data Science
work. The core solution of nearly every problem in Data Science is Machine Learning.

Now this pegs the question: What is Machine Learning?

Machine learning is often a big part of a "data science" project, e.g., it is often heavily used for
exploratory analysis and discovery (clustering algorithms) and building predictive models (supervised
learning algorithms). However, in data science, you often also worry about the collection, wrangling,
and cleaning of your data (i.e., data engineering), and eventually, you want to draw conclusions from
your data that help you solve a particular problem.

The term machine learning is self-explanatory. Machines learn to perform tasks that aren’t
specifically programmed to do. Many techniques are put into practice like supervised clustering,
regression, naive Bayes, etc.

Alpavdin defines Machine Learning as-

“Optimising a performance criterion using example data and past experience”.

Machine Learning is concerned with giving machines the ability to learn by training algorithms on a
huge amount of data. It makes use of algorithms & statistical models to perform a task without
needing explicit instructions. The name machine learning was coined in 1959 by Arthur Samuel.
Machine learning generally involves 2 steps:

In the first, the human writes "learning" code that finds patterns in data, identifies which patterns
are similar, and reports that similarity (knowledge) in a useful way.

In the second step, the human writes more code that uses that knowledge, so that when new data is
encountered, this "predicting" code can anticipate the value of the data or interpret it as having
meaning, in context to what is already known.

The main objective of machine learning algorithms is to solve certain problems. These “problems”
come under the different types of machine learning.

Machine learning is divided into three main categories.

1) Supervised Learning
2) Unsupervised Learning
3) Reinforcement Learning
Supervised Learning:
Here, the system is trained using past data (which includes input and output), and is able to take
decisions or make predictions, when new data is encountered.

It is called supervised — because there is a teacher or supervisor.

Suppose you have provided a data set consisting of bikes and cars. Now, you need to train the
machine on how to classify all the different images. You can train it like this:

If there is 2 number of wheels and 1 headlight on the front it will be labeled as a

bike. If there is 4 number of wheels and 2 headlights on the front it will be labeled as

a car.

Now, let’s say that after training the data, there is a new separate image (say Bike) from the bunch
and you need to ask the machine to identify it.

Since, your machine has already learned the things, it needs to use that knowledge. The machine will
classify the Image regarding the presence or absence of a number of wheels and number of
Headlights and would label the image name as Bike.
Unsupervised Learning:

The system is able to recognise patterns, similarities and anomalies, taking into consideration only
the input data.

Unlike supervised learning, in this, the result is not known, we approach with little or no knowledge
of what the result would be, the machine is expected to find the hidden patterns and structure in
unlabelled data on their own.

That’s why it is called unsupervised — there is no supervisor to teach the machine.

Reinforcement Learning:

Decisions are made by the system on the basis of the reward/punishment it received for the last
action it performed.

Reinforcement Learning is an aspect of Machine learning where an agent learns to behave in an


environment, by performing certain actions and observing the rewards/results which it get from
those actions.

2.7 THE DATA SCIENCE LIFECYCLE

The Data Science Life Cycle consists of


1) Identify the problem
2) Identify available data sources
3) Identify if additional data sources are needed
4) Statistical analysis
5) Implementation, development
6) Communicate results
7) Maintenance

1) Identify the problem:

- Identify metrics used to measure success over baseline (doing nothing)

- Identify type of problem: prototyping, proof of concept, root cause analysis, predictive analytics,
prescriptive analytics and machine-to-machine implementation

- Identify key people within your organisation and outside

- Get specifications, requirements, priorities, and budgets

- How accurate the solution needs to be?

- Decide what data you need

- Built internally versus using a vendor solution

- Vendor comparison, benchmarking

2) Identify available data sources:

-Extract (or obtain) and check sample data (use sound sampling techniques); discuss fields to make
sure data is understood by you

- Perform EDA (exploratory analysis, data dictionary)

- Assess quality of data, and value available in data


- Identify data glitches, find work-around

- Is quality and fields populated consistent over time?

- Are some fields a blend of different stuff (example: keyword field, sometimes equal to user query,
sometimes to advertiser keyword, with no way to know except via statistical analyses or by talking to
business people)

- How to improve data quality moving forward

- Do I need to create mini summary tables / database?

- Which tool do I need (R, Excel, Tableau, Python, Perl, Tableau, SAS)

3) Identify if additional data sources are needed:

- What fields should be captured?

- How granular?

- How much historical data?

- Do we need real time data?

- How to store or access the data? (NoSQL? Map-Reduce?)

- Do we need experimental design?

4) Statistical analysis:

- Use imputation methods as needed

- Detect/remove outliers

- Selecting variables (variables reduction)

- Is the data censored (hidden data, as in survival analysis or time-to-crime statistics)

- Cross-correlation analysis

- Model selection (as needed, favour simple models)

- Sensitivity analysis

- Cross-validation, model fitting

- Measure accuracy, provide confidence intervals

5) Implementation, development:

- FSSRR: Fast, simple, scalable, robust, re-usable


- How frequently do I need to update lookup tables, white lists, data uploads, and so on
- Debugging

- Need to create an API to communicate with other apps

6) Communicating results

- To whom do I have to communicate the results?

- Is the business person someone who understands statistics or not?

- How does my solution affect the business and what is the impact it can make?

7) Maintenance

- Is the data shaping up as it was previously?

- Is the statistical model used previously still showing accurate numbers?

- Is there any other way the code can be optimised?

The data science workflow is a non-linear and iterative task which requires many skills and tools to
cover the whole process.

From framing your business problem to generating actionable insights it is a complex process which
requires a lot of brain storming and iterative model building tasks.

Check your Progress 1


1. What are the components of Data Science?
2. In which fields does Data Science have its applications?
3. What are the components of Data Science Life Cycle?
4. What is Machine Learning?
5. What are the components of Machine Learning?

Activity 1

1. Give one example in your day to day life where you see data science algorithms getting
applied.

Summary

∙ Data Science is a combination of multiple fields which involves creation, preparation,


transformation, modelling, and visualisation of data.
∙ Data Science consists of 4 major components which are domain knowledge expertise, software
programming, statistics, & communications skills.
∙ About a decade ago as there was a lot of cost associated with storage of data, analysis on the
past data was not possible. Along with this, the lack of computing power in machines also
made things a lot hard.
∙ Data Science has its applications in each and every field. Right from mobiles to predicting a
heart attack, Data Science is being used to create a better living.
∙ Data Scientists are the ones which predominantly do most Data Science work. They are integral
to modelling and transforming the data.
∙ The job of Data Scientists can vary from project to project and from organisation to
organisation. They do not have a fixed set of rules under which their work lies. ∙ The core part of
Data Science is Machine Learning. Machine Learning consists of modelling the data to get some
feasible output from it.
∙ Machine Learning consists of making the machine learn or training it so that it can predict the
outcome from previous data points.
∙ Machine Learning can be categorised in 3 major parts which are Supervised, Unsupervised &
Reinforcement Learning.
∙ Data Science can be defined into a work cycle which can be segregated in seven different parts
associated with problem identification, data gathering & transformation, statistical
modelling, communicating the results, and maintenance.

Keywords
∙ Data Science: The art of solving business problems coupled with statistics & software
programming.
∙ Machine Learning: The ability of a machine to learn from past data to predict an outcome for
the future.
∙ Supervised Learning: A learning in which the answer to some of the data points is already
known.
∙ Unsupervised Learning: A learning in which answer to data points is not known or learning in
which there is no supervisor or teacher.
∙ Reinforcement Learning: A learning which is based on punishment and reward ratio. A right
move leads to reward and a wrong move leads to punishment.

Self-Assessment Questions
1. What is Data Science & what are its applications?
2. What are algorithms and how are they used in Data Science field?
3. Why are we studying Data Science as a subject now & why didn’t we study it earlier?
4. What is the Data Science Life Cycle?
5. Which types of Learning do you see around you? Give one example of each and explain how
that effects your day to day life.
6. What is Machine Learning?

Answers to Check your Progress


Check your Progress 1

1) 1) Components of Data Science are:


a. Domain Knowledge Expertise
b. Software Programming
c. Statistics
d. Communication Skills

2) Data Science has its applications in :


a. Retail & Ecommerce
b. Healthcare & IT
c. Transport, Logistics & Manufacturing
d. Banking & Finance
3) Components of Data Science Life Cycle are:
a. Identifying the problem
b. Gathering Data
c. Statistical Analysis
d. Implementation & Development
e. Communicating Results
f. Maintenance

4) Machine Learning can be defined as:

Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.

5) Components of ML are:
a. Supervised Learning
b. Unsupervised Learning
c. Reinforcement Learning

Suggested Reading
1. Jeffrey Stanton, An Introduction to Data Science.
2. The Elements of Statistical Learning, Book by Jerome H. Friedman, Robert Tibshirani, and
Trevor Hastie
3. The Hundred-page Machine Learning Book, Book by Andriy Burkov
Unit 3
Big Data, Datafication & its impact on Data Science
Structure:

3.1 Introduction to Big Data

3.2 Big Data, What is it?

3.3 Big Data & Data Science

3.4 Big Data Technologies

3.5 Datafication

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading
Published by Symbiosis Centre for Distance Learning (SCDL), Pune

2019

Copyright © 2019 Symbiosis Open Education Society

All rights reserved. No part of this unit may be reproduced, transmitted or utilised in any form or by
any means, electronic or mechanical, including photocopying, recording or by any information
storage or retrieval system without written permission from the publisher.

Acknowledgement

Every attempt has been made to trace the copyright holders of materials reproduced in this unit.
Should any infringement have occurred, SCDL apologies for the same and will be pleased to make
necessary corrections in future editions of this unit.
Objectives

After going through this unit, you will able to:

∙ Understand what is meant by Big Data


∙ Make connections between Big Data Problems & Data Science

3.1 INTRODUCTION TO BIG DATA

While we keep on ordering a cab for us to travel, or order a parcel from Amazon we never think how
much easy it is for us to be able to sit in our room and do these things. Although a couple of years
ago, this was not the case. Cabs had to be called by hand signals on the road and we ourselves did
not trust ordering an item from online stores because of some reason or other. But as we have
evolved, we have had these comforts because storing and analysing data has gotten cheaper and
faster. But along with this, what has also changed is how much amount of data we are generating.
Everybody these days has social media accounts, 2 tablets and a phone, a laptop etc. and we are
continuously generating data. As all of these data points are getting generated, they are getting
stored parallel as well. Have you ever wondered how Amazon might be managing its data from all
users and storing? How much big data space will they be requiring? And even if they manage to
store say such huge amounts of data, won’t it take forever to do analysis on it? These are some of
the questions that we might ask ourselves and a decade back all of these questions answers would
have been “No, it is not possible”. But we have come a long way from it.
The answer to all the above questions is yes and that is only possible because of Big Data applications
and its technological advancements. Then this raises the question what is this Big Data and why is it
getting such importance?

“Big Data is a phrase used to mean a massive volume of both structured and
unstructured data that is so large it is difficult to process using traditional
database and software techniques. In most enterprise scenarios the volume
of data is too big or it moves too fast or it exceeds current processing
capacity.” – Webopedia

Thus in short, Big Data is nothing but huge amounts of data produced which our traditional SQL
databases cannot handle or store. Plus it is not compulsory that all of this data getting generated is
structured on unstructured. A Big Data system is capable of handling both plain/random text
(Unstructured Data) and a tabular format data (Structured). As a result of this, this can help
companies and conglomerates to take data driven intelligent decisions.

3.2 BIG DATA, WHAT IS IT?

The below figure is one of the most common figures you will see when you Google about Big

Data.
Figure Source - https://blog.unbelievable-machine.com/en/what-is-big-data-definition
five-vs
Generally, Big Data consists of 5 major V’s which help us in understanding, managing and
transforming Big Data. These are

a) Volume
b) Variety
c) Value
d) Velocity
e) Veracity

Let us talk about each one of them in brief so as to understand what each means and how is it
important.

a) Volume

Volume is nothing the amount of data that needs to be stored. Let us say you are building a
product which will capture data from all users using a particular browser once you have
asked for their permission. Supposedly there are only 5 users at start and they slowly start
growing. The database which you had used at the start like Excel, SQL will not be enough as
the users multiply the volume of data which will be coming in will be humongous and it will
not be possible to store it in the same old/traditional database. This is where volume plays a
pivotal role in deciding the technology that needs to be used for the product back end.

b) Variety
The data that we see in Excel or any other database for that matter is mostly structured. In
the world of Big Data, it can be anything. You can store text, images and voice notes into a
database. This creates variety amongst the data types you are going to store. This variety is a
driving force is selecting the Big Data technology.

c) Value

Not all the data which will be stored will be valuable to us. But does that mean we should not
store invaluable data? Sometimes, a particular type/column in dataset might not be useful
but that does not mean it won’t be used in the future. That is reason why the database
should be built in such a way that it is easy to segregate valuable & invaluable data from the
source as well as it should be easy to combine when required.

d) Velocity

In today’s world, speed is the name of the game. If analytics can be done real time it changes
a lot of things. Thus it is imperative to know beforehand the speed at which you need to
collect and store data, the timeframe of data capturing and the way in which it should be
processed. This is the most important part of Big Data.

e) Veracity

Most real world datasets are not perfect. There are mostly a lot of inconsistencies in the data
when it is captured. It is important to replace those inconsistencies with suitable data values.
You will often find missing values, wrong data types in wrong columns etc. in the real world
data that will be captured. You should be able to transform and replace these missing values
in the dataset with imputed ones easily.
3.3 BIG DATA & DATA SCIENCE
Data science is an interdisciplinary field that combines statistics, mathematics, data acquisition, data
cleansing, mining and programming to extract insights and information. When data sets get so big
that they cannot be analysed by traditional data processing application tools, it becomes ‘Big Data’.
That massive amount of data is useless if it is not analysed and processed.

Hence, big data and data science are inseparable.

Both Data Science and Big Data are related to Data Driven Decision but are significantly different.

Data Driven Decision (with the expectations of better decision and increase value) is process and
involves different stages like

1) Capturing of Data
2) Processing & Storing Data
3) Analysing and Generating Insights
4) Decision & Actions

Big Data is typically involved in processing and storing the data (Step 2) and that too in all the
scenarios. Big Data & Technology helps in reducing cost in processing volume of data and also making
it feasible to do a few typically analyses.

Data Science is involved in analysing and generating insights (Step 3). It involves in using Statistical,
Mathematical and Machine Learning algorithms to use data and generate insights. Whether a data is
"Big data" or not, we can use Data Science to support Data Driven Decisions and take better
decisions.
For e.g.,

If you want to mail your friend a 100 Mb file, the mail system will not allow it. So for the mail system,
this file will be “Big Data”. But if you consider the same file to be uploaded to any cloud drive, you
would be able to do that easily. Hence, the definition of Big Data changes from system to system.

Big Data is the fuel required by Data Scientist to do Data Science.

Some of the technologies which work with Big Data are Hadoop, Apache Spark

etc. 3.4 BIG DATA TECHNOLOGIES

Nearly every industry has begun investing in big data analytics, but some are investing more heavily
than others. According to IDC, banking, discrete manufacturing, process manufacturing,
federal/central government, and professional services are among the biggest spenders. Together
those industries will likely spend $72.4 billion on big data and business analytics in 2017, climbing to
$101.5 billion by 2020.

The fastest growth in spending on big data technologies is occurring within banking, healthcare,
insurance, securities and investment services, and telecommunications.

It’s noteworthy that three of those industries lie within the financial sector, which has many
particularly strong use cases for big data analytics, such as fraud detection, risk management and
customer service optimisation.

The list of technology vendors offering big data solutions is seemingly infinite.

Many of the big data solutions that are particularly popular right now fit into one of the following 5
categories:
1) Hadoop Ecosystem
2) Apache Spark
3) Data Lakes
4) NoSQL Databases
5) In-Memory Databases

Let’s look at each one in detail.

1) Hadoop Ecosystem

Let’s first learn Hadoop and its ecosystem, then automatically you will get the idea that what is
Hadoop and its Ecosystems.

Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It efficiently
processes large volumes of data on a cluster of commodity hardware.

Hadoop is not only a storage system but is a platform for large data storage as well as processing.

Hadoop is an open-source tool from the ASF – Apache Software Foundation. Open source project
means it is freely available and we can even change its source code as per the requirements. If
certain functionality does not fulfil your need then you can change it according to your need.
Most of Hadoop code is written by Yahoo, IBM, Facebook and Cloudera.

It provides an efficient framework for running jobs on multiple nodes of clusters.

Cluster means a group of systems connected via LAN. Apache Hadoop provides parallel
processing of data as it works on multiple machines simultaneously.
Hadoop consists of three key parts –

1) Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.


2) Map-Reduce – It is the data processing layer of Hadoop.
3) YARN – It is the resource management layer of Hadoop.

Let’s go into details of each one of these.

1) HDFS

HDFS is a distributed file system which is provided in Hadoop as a primary storage


service. It is used to store large data sets on multiple nodes. HDFS is deployed on low
cost commodity hardware.

So, if you have ten computers where each of the computer (node) has a hard drive of 1
TB and you install Hadoop on top of these ten machines, you get a storage capacity of 10
TB in total. So, it means that you can store single file of 10 TB in HDFS which will be
stored in a distributed fashion on these ten machines.

There are many features of HDFS which makes it suitable for storing large data like
scalability, data locality, fault tolerance etc.

2) Map Reduce

Map Reduce is the processing layer of Hadoop. Map Reduce programming model is
designed for processing large volumes of data in parallel by dividing the work into a set
of independent tasks.
You need to put business logic in the way Map Reduce works and rest things will be
taken care by the framework. Work (complete job) which is submitted by the user to
master is divided into small works (tasks) and assigned to slaves.

Map Reduce programs are written in a particular style influenced by functional


programming constructs, specifically idioms for processing lists of data.

In Map Reduce, we get inputs from a list and it converts it into output which is again a
list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to Map
Reduce as here parallel processing is done.

3) YARN

Apache Yarn – “Yet another Resource Negotiator” is the resource management layer of
Hadoop. The Yarn was introduced in Hadoop 2.x. Yarn allows different data processing
engines like graph processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed File System).

Apart from resource management, Yarn is also used for job Scheduling. Yarn extends the
power of Hadoop to other evolving technologies, so they can take the advantages of
HDFS (most reliable and popular storage system on the planet) and economic cluster.

Apache yarn is also considered as the data operating system for Hadoop 2.x. The yarn
based architecture of Hadoop 2.x provides a general purpose data processing platform
which is not just limited to the Map Reduce.

It enables Hadoop to process other purpose-built data processing system other than
Map Reduce. It allows running several different frameworks on the same hardware
where Hadoop is deployed.

Now that we have understood what is Hadoop let’s try and understand what is the Hadoop
Ecosystem

The Hadoop ecosystem refers to the various components of the Apache Hadoop software library,
as well as to the accessories and tools provided by the Apache Software Foundation.

Figure Source: https://www.oreilly.com/library/view/apache-hiveessentials/9781788995092/e846ea02-6894-45c9-


983a-03875076bb5b.xhtml

The above figure shows the various components of Hadoop ecosystem. Some of the components
are explained as follows:

a) Hive

Apache Hive is an open source data warehouse system for querying and analysing large
datasets stored in Hadoop files. Hive do three main functions:

Data summarisation
Query Processing
Analysis

Hive use language called HiveQL (HQL), which is similar to SQL.

HiveQL automatically translates SQL-like queries into Map Reduce jobs which will execute on
Hadoop.

b) Pig

Apache Pig is a high-level language platform for analysing and querying huge dataset that are
stored in HDFS.

Pig uses Pig Latin language. It is very similar to SQL. It loads the data, applies the required
filters and dumps the data in the required format.
For Programs execution, pig requires Java runtime environment.

c) HBase

Apache HBase is distributed database that was designed to store structured data in tables
that could have billions of row and millions of columns.
HBase is scalable, distributed, and NoSQL database that is built on top of HDFS. HBase
provides real time access to read or write data in HDFS.

d) HCatalog

It is a table and storage management layer for Hadoop.

HCatalog supports different components available in Hadoop like Map Reduce, Hive, and Pig
to easily read and write data from the cluster. HCatalog is a key component of Hive that
enables the user to store their data in any format and structure.

e) Avro

It is a most popular Data serialization system.

Avro is an open source project that provides data serialization and data exchange services for
Hadoop. These services can be used together or independently.

Big data can exchange programs written in different languages using Avro.

Above mentioned services are the ones which are generally present in Hadoop Ecosystem. It is
not compulsory that each of these technologies will be required always.

Thus Hadoop is a very important part of Big Data and most of it being open source, can be
modified as per requirement.

2) Apache Spark

Apache Spark is an open-source cluster-computing framework.

Apache Spark is a general-purpose & lightning fast cluster computing system. It provides high
level API. For example, Java, Scala, Python and R.

Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Big Data
Hadoop and 10 times faster than accessing data from disk.

Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later it becomes AMP Lab. It
was open sourced in 2010 under BSD license.

In 2013, spark was donated to Apache Software Foundation where it became top-level Apache
project in 2014. It was built on top of Hadoop Map Reduce and it extends the Map Reduce model
to efficiently use more types of Computations.

Spark can be used along with Map Reduce in the Same Hadoop cluster or can be used alone as a
processing framework. Also Spark application can run on YARN.

Apache Spark framework can be implemented in Java, R, Python and Scala. However, Scala
Programming is the most favourable one because

a) It offers great scalability on JVM


b) Performance achieved using Scala is better than that of data analysis tools like R or Python c)
It has excellent built-in concurrency support and libraries like Akka, making it easy to build a
scalable application
d) A single complex line of code in Scala can replace 20–25 lines of complex java code e) Scala
is fast and efficient making it an ideal choice for computationally intensive algorithms

Some important features of Spark are:


1) Real-time: Real-time computation and low latency because of in-memory computation
2) Speed: 100x faster for large scale data processing
3) Ease of Use: Applications can be written easily in Java, Scala, Python, R and SQL
4) Generality: Combines SQL, streaming and complex analytics
5) Deployment: Easily deployable through Mesos, Hadoop via YARN or Spark’s own cluster
manager
6) Powerful Caching: Provides powerful caching and disk persistence capabilities

Apache Pyspark

Pyspark is nothing but python for Spark.

Pyspark is one of the supported language for Spark. Spark is a big data processing platform,
provides capability to process petabyte scale data.

Using Pyspark you can write spark application to process data and run it on Spark platform. AWS
provides managed EMR, spark platform.

Using Pyspark you can read data from various file format like csv, parquet, Json or from
databases and do analysis on top of it.

It is because of such features why Spark is widely preferred in industry these days. Whether it is
start-ups or Fortune 500s, all are adopting Apache Spark to build, scale and innovate their
applications.

Spark has left no area of Industry untouched whether it is finance or entertainment, it is being
widely used everywhere.

3) Data Lakes

A data lake is a reservoir which can store vast amounts of raw data in its native format. This data
can be –

1) Structured data from relational databases (rows and columns),

2) Structured data from NoSQL databases (like MongoDB, Cassandra, etc.),

3) Semi-structured data (CSV, logs, XML, JSON),

4) Unstructured data (emails, documents, PDFs) and Binary data (images, audio, video).

The purpose of a data lake, a capacious and agile platform is to hold all the data of an enterprise
at a central platform.

By this, we can do comprehensive reporting, visualisation, analytics and eventually glean deep
business insights.

But keep in mind that Data Lakes and Data Warehouse are different things.

Contrary to a data warehouse, where data is processed and stored in files and folder, a data lake
has a flat architecture, meaning that a data lake stores all the data without any prior processing
done, reducing the time required for compilation. The data in a data lake is retained in its original
format, until it is needed.

Data lakes provides agility and flexibility, making it easier to make changes. Though the reason to
store data in a data lake is not predefined, the main objective of building a data lake is to offer an
unrefined view of data to data scientists, whenever needed.
Data Lake also allows Ingestion i.e. connectors to get data from different data sources to be
loaded into the Data Lake. Data lake storage is more scalable and cost efficient and allows fast
data exploration.

If not designed correctly, Data Lake can soon become toxic. Some of the guiding principles for
designing Data Lake are:

Data within the data lake is stored in the same format as that of the source. The idea is to store
data quickly with minimal processing to make the process fast and cost efficient.

Data within the data lake is reconciled with the source every time a new data set is loaded, to
ensure that it is a mirror copy of data inside the source.

Data within the data lake is well documented to ensure correct interpretation of data. Data
catalogue and definitions are made available to all authorised users through a convenient
channel.

Data within the data lake can be traced back to its source to ensure integrity of data.

Data within the data lake is secured through a controlled access mechanism. It is generally made
available to data analysts and data scientists to explore further.

Data within the data lake is generally large in volume. The idea is to store as much data as
possible, without worrying about which data elements are going to be useful and which are not.
This enables an exploratory environment, where users can keep looking at more data and build
reports or analytical models in an incremental fashion.

Data within the data lake is stored in the form of daily copies of data so that previous versions of
data can be easily accessed for exploration. Accumulation of historic data overtime enables
companies to do trend analysis as well as build intelligent machine learning models that can
learn from previous data to predict outcomes.

Data within the data lake is never deleted.

Data within the data lake is generally stored in open source big data platforms like Hadoop to
ensure minimum storage costs. This also enables very efficient querying and processing of large
volumes of data during iterative data exploration and analysis.

Data within the data lake is stored in the format that it is received from the source, and is not
necessarily structured. The idea is to put minimum efforts while storing data into the data lake.
All efforts to organize and decipher data happens post loading.

Thus Data Lakes are now a major part of every enterprise architecture building process.

When a business question arises, the data lake can be queried for relevant data, and that smaller
set of data can then be analysed to help answer the question.

4) In Memory Databases

An in-memory database is a data store that primarily uses the main memory of a computer. Since
this main memory has the fastest access time, data stored in main memory affords the most
speed for database applications.

Main stream databases, mostly store data in a permanent store (such as a hard disk or network
storage), which increases its access time and are thus not as fast, when compared to in-memory
databases.
Mission critical applications, which need very fast response times, such as medical and telecom
applications always relied on in-memory databases. However, recent development of memory
devices that can fit large amounts of data for a very low price, have made in-memory databases
very attractive to commercial applications as well.

In-memory databases generally store data in proprietary forms. There are several open-source in
memory databases that store data in a ‘key-value’ format. So, in that sense, these databases are
not similar to traditional relational databases that use SQL.

All properly constructed DBMS’s are actually in-memory databases for query purposes at some
level because they really only query data that is in memory, i.e. in their buffer caches. The
difference is that a database that claims to be in-memory will always have the entire database
resident in memory from start-up while more traditional databases use a demand loading
scheme only copying data from permanent storage to memory when it is called for.

So, even if our Oracle, Informix, DB2, PostGreSQL, MySQL, or MS SQL Server instance has
sufficient memory allocated to it to keep your entire database in memory, the first number of
queries will run slower than later queries until all of the data has been called for directly by
queries or pulled into memory by read ahead algorithm activity.

A true in-memory database system will have a period at start-up when it will either refuse to
respond to queries or will suspend them until the entire database can be loaded in from storage
after which all queries will be served as quickly as possible.

5) NOSQL Databases

NoSQL refers to a general class of storage engines that store data in a non-relational format. This
is in contrast to traditional RDBMS in which data is stored in tables that have data that relate to
each other. NoSQL stands for "Not Only SQL" and isn't meant as a rejection of traditional
databases.

There's different kinds of NoSQL databases for different jobs. They can be categorised broadly
into four different buckets:

a) Key-Value Stores: Are very simple in that you simply define a key for a binary object. It’s very
common for programmers to simple store large serialised objects in these kinds of DBs.
Examples are Cassandra (database), and Oracle NoSQL.
b) Document Store: Stores "documents" also based on a key-value system although more
structured. The most common implementation is based on the JSON (JavaScript Object
Notation) standard, which I tend to think of as a similar structure to XML. Examples are
MongoDB and Couch DB.
c) Graph DB: Stores data as "graphs" which allow you to define complex relationships between
objects. Very common for something like storing relationships of people in a social network.
Examples are Neo4j.
d) Column Oriented: Data is stored in columns rather than rows (this is a tricky concept to get at
first). Allows for great compression and for building tables that are very large (hundreds of
thousands of columns, billions/trillions of rows). Examples are HBase.

In general, NoSQL databases excel when you need something that can both read and write large
amounts of data quickly. And since they scale horizontally, just adding more servers tends to improve
performance with little effort. Facebook uses it for you Inbox.
Other examples might be a user's game online profile or storing large amounts of legal documents.
An RDBMS is still the best option for handling large numbers of atomic level transactions (IE, we likely
won't see things like banking systems or supply chain management systems run on a NoSQL
database).

This is also because they are not ACID compliant (basically two people looking at the same key might
see different values, depending on when the data was accessed).

NoSQL databases are getting used in nearly 90% of our daily applications. They are very important
component contributing to the overall Big Data Structure.

3.5 DATAFICATION

Datafication is a new concept which refers “how we render into data many aspects of the
uncontrollable and qualitative factors into a quantified form”. In other words, this new term
represents our ability to collect data for aspects of our lives that have never been quantified before
and turning them into value, e.g. valuable knowledge.

Let us try to elaborate this with an example.

Every time we try to go to a big store for buying any product, the store guys ask us to fill a form and
get a card of that store. In earlier days, this practice was not present. So what changed?

Earlier when we used to buy certain products, there was no traceability associated with the products
with respect to which product was bought by which person. But now as we buy anything by swiping a
card associated with that store, they associate the product with us. This helps them in sending us
offers on our mobile and email.

This is DATAFICATION. Earlier where this data was qualitative in nature was not getting captured. But
now with introduction of this card system, we are able to know this for nearly 40% customers.

Datafication is not only about the data, but refers to the process of collecting data, as well as the
tools and the technologies that support data collection. In the business context, an organization uses
data to monitor processes, support decision-making and plan short- and long-term strategies. Many
start up companies have been established on the hype of big data by extracting value from them. In a
few years, no business will be able to operate without exploiting the data available, while whole
industries may face complete re-engineering.

But keep in mind that Datafication is not Digitalization. The later term describes the process of using
digital technologies to restructure our society, businesses and personal lives. It began with the rise of
computers and their introduction in organizations. In the following years, new technologies such
Internet of Things, have been gradually integrated in our lives and revolutionized them. However,
Datafication represents the next phase of evolution, when data production and proper collection is
already a present and the society tends to establish processes for the extraction of valuable
knowledge.

Check your Progress 1

1. What are the 5 V’s in Big Data?


2. What is the connection between Data Science and Big Data?
3. What are different technologies associated with Big Data?
4. Which are the main components of Hadoop?
5. What is Apache Spark?
6. What are in memory databases?
Activity 1

1. Give one example in your day to day life where you see DATAFICATION happening.

Summary

∙ Big Data is nothing but huge data which is governed by 5 V’s.


∙ Big Data and Data Science go hand in hand. But it doesn’t mean that one cannot survive
without another.
∙ The various technologies associated with Big Data are:
o Hadoop
o Apache Spark
o In-Memory Databases
o NoSQL databases
o Data Lakes
∙ Datafication is a new trend which is nothing but trying to find new ways to convert qualitative
data of a consumer / customer into quantitative one.

Keywords

∙ Big Data: Big data is data that exceeds the processing capacity of conventional database
systems.
∙ Datafication: Datafication refers to the collective tools, technologies and processes used to
transform an organisation to a data-driven enterprise.

Self-Assessment Questions

1. How is Data Science and Big Data connected? Explain with an example.
2. Where do you think Big Data can be most useful in today’s world?

Answers to Check your Progress

Check your Progress 1

1) 5 V’s of Big Data:


a. Volume
b. Variety
c. Velocity
d. Veracity
e. Value

2) Connection between Data Science and Big Data:


Big Data is the fuel for Data Science
3) Technologies associated with Big Data are
a. Hadoop
b. Spark
c. Data Lakes
d. NoSQL Databases
e. In Memory Databases
4) Main components of Hadoop are:
1) HDFS
2) Map Reduce
3) YARN
5) Apache Spark is a general-purpose & lightning fast cluster computing platform. It is an open
source, wide range data processing engine.
6) An In-Memory Database is a database management system that primarily relies on main
memory for computer data storage.

Suggested Reading

1. Big Data: A Revolution That Will Transform How We Live, Work, and Think. - Book by
Kenneth Cukier and Viktor Mayer-Schönberger.
2. Big Data For Dummies - Book by Alan Nugent, Fern Halper, Judith Hurwitz, and Marcia
Kaufman.
3. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities - Book by Thomas H.
Davenport.
Unit 4
Data Science Pipeline, EDA & Data Preparation
Structure:

4.1 Introduction to Data Science Pipeline

4.2 Data Wrangling

4.3 Exploratory Data Analysis

4.4 Data Extraction & Cleansing

4.5 Statistical Modelling

4.6 Data Visualisation

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading

Published by Symbiosis Centre for Distance Learning (SCDL), Pune

2019
Copyright © 2019 Symbiosis Open Education Society

All rights reserved. No part of this unit may be reproduced, transmitted or utilised in any form or by
any means, electronic or mechanical, including photocopying, recording or by any information
storage or retrieval system without written permission from the publisher.

Acknowledgement

Every attempt has been made to trace the copyright holders of materials reproduced in this unit.
Should any infringement have occurred, SCDL apologies for the same and will be pleased to make
necessary corrections in future editions of this unit.
Objectives

After going through this unit, you will able to:

∙ Understand what is meant by Data Science Pipeline


∙ The meaning of Data Wrangling and Exploratory Data Analysis
∙ Understand why cleansing the data is the most important part of Data Science ∙
Understand the basics of Statistical Modelling
∙ Know why visualising the data is an integral part of Data Science Work Cycle

4.1 INTRODUCTION TO DATA SCIENCE PIPELINE

A data science pipeline is the overall step by step process towards obtaining, cleaning, visualising,
modelling, and interpreting data within a business or group.

Data science pipelines are sequences of processing and analysis steps applied to data for a specific
purpose.

They're useful in production projects, and they can also be useful if one expects to encounter the
same type of business question in the future, so as to save on design time and coding.

Stages of Data Science Pipeline are as follows:

1) Problem Definition

∙ Contrary to common belief, the hardest part of data science isn’t building an accurate model or
obtaining good, clean data. It is much harder to define feasible problems and come up with
reasonable ways of measuring solutions. Problem definition aims at understanding, in depth,
a given problem at hand.
∙ Multiple brainstorming sessions are organized to correctly define a problem because of your
end goal with depending upon what problem you are trying to solve. Hence, if you go wrong
during the problem definition phase itself, you will be delivering a solution to a problem
which never even existed at first.
2) Hypothesis Testing

∙ Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a


population parameter. The methodology employed by the analyst depends on the nature of
the data used and the reason for the analysis.
∙ Hypothesis testing is used to infer the result of a hypothesis performed on sample data from a
larger population. In simple words, we form some assumptions during problem definition
phase and then validate those assumptions statistically using data.

3) Data Collection and processing

∙ Data collection is the process of gathering and measuring information on variables of interest,
in an established systematic fashion that enables one to answer stated research questions,
test hypotheses, and evaluate outcomes. Moreover, the data collection component of
research is common to all fields of study including physical and social sciences, humanities,
business, etc.
∙ While methods vary by discipline, the emphasis on ensuring accurate and honest collection
remains the same.
∙ Data processing is more about a series of actions or steps performed on data to verify,
organize, transform, integrate, and extract data in an appropriate output form for
subsequent use. Methods of processing must be rigorously documented to ensure the utility
and integrity of the data.

4) EDA and Feature Engineering

∙ Once you have clean and transformed data, the next step for machine learning projects is to
become intimately familiar with the data using exploratory data analysis (EDA). ∙ EDA is about
numeric summaries, plots, aggregations, distributions, densities, reviewing all the levels of
factor variables and applying general statistical methods.
∙ A clear understanding of the data provides the foundation for model selection, i.e. choosing
the correct machine learning algorithm to solve your problem.
∙ Feature engineering is the process of determining which predictor variables will contribute the
most to the predictive power of a machine learning algorithm.
∙ The process of feature engineering is as much of an art as a science. Often feature engineering
is a give-and-take process with exploratory data analysis to provide much-needed intuition
about the data. It’s good to have a domain expert around for this process, but it’s also good
to use your imagination.

5) Modelling and Prediction

∙ Machine learning can be used to make predictions about the future. You provide a model with
a collection of training instances, fit the model on this data set, and then apply the model to
new instances to make predictions.
∙ Predictive modelling is useful because you can make products that adapt based on expected
user behaviour. For example, if a viewer consistently watches the same broadcaster on a
streaming service, the application can load that channel on application start-up.

6) Data Visualisation

∙ Data visualisation is the process of displaying data/information in graphical charts, figures, and
bars. It is used as a means to deliver visual reporting to users for the performance,
operations or general statistics of data and model prediction.
7) Insight Generation and implementation

∙ Interpreting the data is more like communicating your findings to the interested parties. If you
can’t explain your findings to someone believe me, whatever you have done is of no use.
Hence, this step becomes very crucial.
∙ The objective of this step is to first identify the business insight and then correlate it to your
data findings. Secondly, you might need to involve domain experts in correlating the findings
with business problems.
∙ Domain experts can help you in visualising your findings according to the business dimensions
which will also aid in communicating facts to a non-technical audience.

4.2 DATA WRANGLING


Data wrangling is 80% of what a data scientist does. It’s where most of the real value is created

The first step in analytics is gathering data. Then as you begin to analyse and dig deep for answers, it
often becomes necessary to connect to and mashup information from a variety of data sources.

Data can be messy, disorganised, and contain errors. As soon as you start working with it, you will see
the need for enriching or expanding it, adding groupings and calculations. Sometimes it is difficult to
understand what changes have already been implemented.

Moving between data wrangling and analytics tools slows the analytics process—and can introduce
errors. It’s important to find a data wrangling function that lets you easily make adjustments to data
without leaving your analysis.

This is also called as Data Munging. It follows certain steps such as after extracting the data from
different data sources, sorting of data using certain algorithm is performed, decompose the data into
a different structured format and finally store the data into another database.

Some of the steps associated with Data Wrangling are:

1. Load, explore, and analyse your data

2. Drop the unnecessary columns like columns containing IDs, Names, etc.
3. Drop the columns which contain a lot of null or missing values

4. Impute missing values

5. Replace invalid values

6. Remove outliers

7. Log Transform Skewed Variables

8. Transform categorical variables to dummy variables

10. Binning the continuous numeric variables

11. Standardisation and Normalisation

Each of the above mentioned steps has a special importance with respect to Data

Science. Let us look at an example.

If you want to visualise number of customers of a telecom provider by city, then you need to ensure
that there is only one row per city before data visualisation.

If you have two rows like Bombay and Mumbai representing the same city, this could lead to wrong
results. One of the rows has to be changed manually by the data analyst and this is done by creating
a mapping on the fly in the visualisation tool and applied to every row of data to detect for more
such issues and the process is repeated for other cities.

Need of Data Wrangling

Data wrangling is an important aspect for implementing the statistical model.

Therefore, data is converted to the proper feasible format before applying any model intro it. By
performing filtering, grouping and selecting appropriate data accuracy and performance of the
model could be increased.

4.3 EXPLORATORY DATA ANALYSIS


Exploratory data analysis is, as the name mentions, a peek at the data you will be working with.
Usually this involves

1. Cleaning the data - finding junk values and removing them, finding outliers and replacing them
appropriately (with the 95% percentile, for example) etc.
2. Summary Statistics - finding the summary statistics - mean, median and if necessary, mode,
along with the standard deviation and variance of the particular distribution
3. Univariate analysis - A simple histogram that shows the frequency of a particular variable's
occurrence, or a line chart that shows how a particular variable changes over time to have a
look at all the variables in the data and understand them.

The idea is that, after performing Exploratory Data Analysis, you should have a sound understanding
of the data you are about to dive into. Further hypothesis based analysis (post EDA) could involve
statistical testing, bi-variate analysis etc.

Let's understand this with the help of an example.

We all must have seen our mother taking a spoonful of soup to judge whether or not the salt is
appropriate in the soup. The act of tasting the soup to check the salt level and to better understand
the taste of soup by taking a spoonful is exploratory data analysis. Based on that our mothers decide
the salt level, this is where they make inferences and their validity depends on whether or not the
soup is well stirred that is to say whether or not the sample represents the whole population.

Let us take another business case example,

Say we have given some data of sales and their daily revenue numbers for a big retail chain

Business problem is – A retail chain wants to improve its revenue.

The question that arises now is that what are the ways with which we can achieve this?

What will you look for? Do you know what to look for? Will you immediately run a code to find mean
median and mode and other statistics?

The main objective is to understand the data inside out. The first step in any EDA is asking the right
questions for which we want the answers for. If our questions go wrong, the whole EDA goes wrong.

The first step of any EDA is list down as many questions as you can on a piece of paper.
What are some of the questions that we can ask? They are:

∙ How many total stores are there in the retail company?


∙ Which stores and regions are performing the best and the worse?
∙ What are the actual sales across each and every store?
∙ How many stores are selling products below the average?
∙ How many stores are exclusively selling best profit making products?
∙ On which days are the sales maximum?
∙ Do we see seasonal sales across products?
∙ Are there any abnormal sales numbers?

These are some of the questions that need to be asked before deciding on the next steps.

It gives some very interesting insights out of data such as:

1. Listing the outliers and anomalies in our data


2. Identifying the most important variables
3. Understanding the relationship between variables
4. Checking for any errors such as missing variables or incorrect entries
5. Know the data types of the dataset – whether continuous/discreet/categorical
6. Understand how the data is distributed
7. Testing a hypothesis or checking assumptions related to a specific model

Exploratory data analysis (EDA) is very different from classical statistics. It is not about fitting models,
parameter estimation, or testing hypotheses, but is about finding information in data and generating
ideas.

So, this is the background of EDA. Technically, it involves steps like cleaning the data, calculating
summary statistics and then making plots to better understand the data at hand to make meaningful
inferences.

4.4 DATA EXTRACTION & CLEANSING


Data extraction & cleaning (sometimes also referred to as data cleansing or data scrubbing) is the act
of detecting and either removing or correcting corrupt or inaccurate records from a record set, table,
or database. Used mainly in cleansing databases, the process applies identifying incomplete,
incorrect, inaccurate, irrelevant, etc. items of data and then replacing, modifying, or deleting this
“dirty” information.

The next step after data cleaning is data reduction. This includes defining and extracting attributes,
decreasing the dimensions of data, representing the problems to be solved, summarising the data,
and selecting portions of the data for analysis.

There are multiple data cleansing practices in vogue to clean and standardize bad data and make it
effective, usable and relevant to business needs.

Organisations relying heavily on data driven business strategies need to choose a practice that best
fits in with their operational working. A standard practice is shown in the diagram below.

Detailed steps of this process are as follows:

1. Stored Data:

Put together the data collected from all sources and create a data warehouse. Once your
data is stored in a place, it is ready to be put through the cleansing process.

2. Identify errors:

Multiple problems contribute to lowering the quality of data and making it dirty. Problems
like inaccuracy, invalid data, incorrect data entry, missing values, spell error, incorrect data
ranges, multiple representation of data.

These are some of the common errors which should be taken care in creating a cleansed data
regime.

3. Remove duplication/redundancy

Multiple employees work on a single file where they collect and enter data. Most of the
times, they don’t realise they are entering the same data collected by some other employee,
at some other time. Such duplicate data corrupts the data results and must be weeded out.

4. Validate the accuracy of data

Effective marketing occurs with high quality of data and thus validating the accuracy is the
utmost prior thing organisations aim for. However, the method of collection is independent
of cleansing process.
A triple verification of data will enhance the dataset and build trust worthiness in marketers
and sales professional to utilize the power of data.

5. Standardise data format

Now that data is validated, it is more important to put all the data in a standardised and
accessible format. This ensures entered data is clean and enriched for ready to use.

Some of the other best practices which need to be followed while Data Cleansing are:

∙ Sort data by different attributes


∙ For large datasets cleanse it stepwise and improve the data with each step until you achieve a
good data quality
∙ For large datasets, break them into small data. Working with less data will increase your
iteration speed
∙ To handle common cleansing task create a set of utility functions/tools/scripts. It might include,
remapping values based on a CSV file or SQL database or, regex search-and-replace, blanking
out all values that don’t match a regex
∙ If you have an issue with data cleanliness, arrange them by estimated frequency and attack the
most common problems
∙ Analyse the summary statistics for each column (standard deviation, mean, number of missing
values)
∙ Keep track of every date cleaning operation, so you can alter changes or remove operations if
required

But keep in mind that all these are standard practices and they might and might not apply every time
to a given problem. For e.g. if we have a numerical data, we might want to remove missing values,
NAs at first.

For textual data, tokenisation, removing whitespace, punctuation, stopwords, stemming can be all
possible steps towards cleaning data for further analysis.

Thus Data Cleansing is imperative for model building. If the data is garbage, then the output will also
be garbage no matter how great of statistical analysis is applied on it.

4.5 STATISTICAL MODELLING


In simple terms, statistical modelling is a simplified, mathematically-formalized way to approximate
reality (i.e. what generates your data) and optionally to make predictions from this approximation.
The statistical model is the mathematical equation that is used.

Statistical modelling, is, literally, building statistical models. A linear regression is a statistical model.

To do any kind of statistical modelling, it is utmost necessary to know the basics of statistics like:

∙ Basic statistics: Mean, Median, Mode, Variance, Standard Deviation, Percentile, etc. ∙
Probability Distribution: Geometric Distribution, Binomial Distribution, Poisson distribution,
Normal Distribution, etc.
∙ Population and Sample: understanding the basic concepts, the concept of sampling ∙
Confidence Interval and Hypothesis Testing: How to Perform Validation Analysis ∙
Correlation and Regression Analysis: Basic Models for General Data Analysis

Statistical modeling is a step which comes after data cleansing. The most important parts are model
selection, configuration, prediction, evaluation & presentation.

Let us look at each one of these in brief.

1) Model Selection

∙ One among many machine learning algorithms may be appropriate for a given predictive
modeling problem. The process of selecting one method as the solution is called model
selection.
∙ This may involve a suite of criteria both from stakeholders in the project and the careful
interpretation of the estimated skill of the methods evaluated for the problem. ∙ As with model
configuration, two classes of statistical methods can be used to interpret the estimated skill of
different models for the purposes of model selection. They are:
o Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the
result given an assumption or expectation about the result (presented using critical
values and p-values).
o Estimation Statistics. Methods that quantify the uncertainty of a result using
confidence intervals.

2) Model Configuration

∙ A given machine learning algorithm often has a suite of hyperparameters (parameters passed
to the statistical model which can be changed) that allow the learning method to be tailored
to a specific problem.
∙ The configuration of the hyperparameters is often empirical in nature, rather than analytical,
requiring large suites of experiments in order to evaluate the effect of different
hyperparameter values on the skill of the model.
∙ Hyperparameters are the ones which can break or make a model. Hyperparameter tuning is a
very famous practice in the world of Data Science.
∙ The 2 methods by which we can do Hyperparameter tuning are:
o Grid Search
o Random Search

3) Model Evaluation

∙ A crucial part of a predictive modeling problem is evaluating a learning method. ∙ This often
requires the estimation of the skill of the model when making predictions on data not seen
during the training of the model.
∙ Generally, the planning of this process of training and evaluating a predictive model is called
experimental design. This is a whole subfield of statistical methods.
∙ Experimental Design. Methods to design systematic experiments to compare the effect of
independent variables on an outcome, such as the choice of a machine learning algorithm on
prediction accuracy.
∙ As part of implementing an experimental design, methods are used to resample a dataset in
order to make economic use of available data in order to estimate the skill of the model.
These two represent a subfield of statistical methods.
∙ Resampling Methods. Methods for systematically splitting a dataset into subsets for the
purposes of training and evaluating a predictive model.

4) Model Presentation

∙ Once a final model has been trained, it can be presented to stakeholders prior to being used or
deployed to make actual predictions on real data.
∙ A part of presenting a final model involves presenting the estimated skill of the model. ∙
Methods from the field of estimation statistics can be used to quantify the uncertainty in the
estimated skill of the machine learning model through the use of tolerance intervals and
confidence intervals.
∙ Estimation Statistics. Methods that quantify the uncertainty in the skill of a model via
confidence intervals.

4.6 DATA VISUALISATION


Data Visualisation is the representation of information in the form of chart, diagram, picture, etc.
These are created as the visual representation of information.
Importance of Data Visualisation:

∙ Absorb information quickly


∙ Understand your next steps
∙ Connect the dots
∙ Hold your audience longer
∙ Kick the need for data scientists
∙ Share your insights with everyone
∙ Find the outliers
∙ Memorise the important insights
∙ Act on your findings quickly

There are 10 elements of successful data visualisation:

∙ It tells a visual story


∙ It’s easy to understand
∙ It’s tailored for your target audience
∙ It’s user friendly
∙ It’s useful
∙ It’s honest
∙ It’s succinct
∙ It provides context

Data science is useless if you can’t communicate your findings to others, and visualisations are
imperative if you’re speaking to a non-technical audience. If you come into a board room without
presenting any visuals, you’re going to run out of work pretty soon.

More than that, visualisations are very helpful for data scientists themselves. Visual representations
are much more intuitive to grasp than numerical abstractions

Let’s consider an example

The below plot is a chart which shows total air passengers across time for a particular airline.
Just by glancing at the chart for two seconds, we immediately recognize a seasonal trend and a long
term trend. Identifying those patterns by analysing the numbers alone would require decomposing
the signal in several steps.

Thus you require visualisations at two places:

∙ You need to understand the data yourself so you need to create visualisations which will
probably never be shared.
∙ You need to get the data’s story across and visualisation is usually the best way to go.

Visualisations are helpful both in pre-processing and post-processing stages. They help us understand
our datasets and results in the form of shapes and objects which is somehow more real to the
human brain.

What is the future of data visualisation?

There are currently three key trends that are probably going to shape the future of data visualisation:
Interactivity, Automation, and storytelling (VR).

1) Interactivity

Interactivity has been a key element of online Data Visualisation for numerous years. But it is
currently beginning to overwhelm static visualisations as for the predominant manner in which
visualisations are introduced - particularly in news media. It is progressively expected that every
online map, chart, and a graph is interactive as well as energised.

The challenge of interactivity is to give choices obliging an extensive range of users and
corresponding necessities, without overcomplicating the UI of the data visualisation. There are 7 key
sorts of interactivity, as shown below:

∙ Reconfigure
∙ Choosing features
∙ Encode
∙ Abstract/elaborate
∙ Explore
∙ Connect
∙ Filter
2) Automation

In the past, Data Visualisation was a tedious and troublesome process. The current test is to
automate the Big Data Visualisation to regulate huge picture trends, however, without dismissing the
sight of interest.

Best practice visualisation and design standards are vital. But there should be a match between the
kind of visualisation and the reason for which it will be utilised.

3) Storytelling and VR

Storytelling with data is popular, and rightfully so. Data Visualisations are vacant of significance
without a story, and stories can be enormously enhanced when supplemented with data
visualisation.

The future of storytelling might be virtual reality. The human visual awareness system is upgraded to
seeing and interfacing in three measurements. The full storytelling capability of data visualisation can
be investigated once it is no longer compelled to flat screens.
Some of the best Data Visualisation tools for Data Science are:

1) Tableau
2) QlikView
3) PowerBi
4) QlikSense
5) FusionCharts
6) HighCharts
7) Plotly

But the most important one if you are playing with R is Ggplot2 and that with respect to Python is
Seaborn or Matplotlib.

Let us discuss in detail a bit about Ggplot2.

What is GGPLOT?

Ggplot2 is a data visualisation package for the statistical programming language R, which tries to take
the good parts of base and lattice graphics and none of the bad parts.

It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as
providing a powerful model of graphics that makes it easy to produce complex multi-layered
graphics.

The 5 main reasons why you should explore ggplot are as follows:

∙ It can do quick-and-dirty and complex, so you only need one system


∙ The default colours and other aesthetics are nicer
∙ Never again lose an axis title (or get told your pdf can’t be created) due to wrongly specified
outer or inner margins
∙ You can save plots (or the beginnings of a plot) as objects
∙ Multivariate exploration is greatly simplified through faceting and colouring

Data Visualisation will change the manner in which our analysts work with data. They will be relied
upon to respond to issues more quickly and required to dig for more insights – look at information
differently, more creatively.
Data Visualisation will advance that imaginative data analysis.

Check your Progress 1


1. What are the components of Data Science Pipeline?
2. Name some Data Visualisation tools.
3. What are the four steps involved in model building?
4. What is EDA?
5. What is Data Wrangling?

Activity 1
Find and list the more data visualisation tools.
Summary
∙ Data Science is a combination of multiple fields which involves creation, preparation,
transformation, modelling, and visualisation of data.
∙ Data Science pipeline consists of Data Wrangling, Data Cleansing & Extraction, EDA, Statistical
Model Building, and Data Visualisation.
∙ Data Wrangling is a step in which the data needs to transformed and aggregated into usable
format through which insights can be derived.
∙ Data Cleansing is an important step in which data needs to be cleansed like replacing the
missing values, replacing NaN’s in data, removing outliers along with standardisation and
normalisation.
∙ Data Visualisation is a process of visualising the data so as to derive insights from it at a glance.
It is also used to present results of the data science problem.
∙ Statistical modelling is the core of Data Science problem solution. It is fitting of statistical
equations on the data at hand to predict a certain value on future observations.

Keywords
∙ Data Science Pipeline: The 7 major stages of solving a Data Science problem. ∙ Data Wrangling:
The art of transforming the data into a format through which it is easier to draw insights from.
∙ Data Cleansing: The process of cleaning the data of missing, garbage, Nan’s and outliers. ∙ Data
Visualisation: The art of building graphs and charts so as to understand data easily and find
insights into it.
∙ Statistical Modelling: The implementation of statistical equations on existing data.

Self-Assessment Questions

1. What is Data Science Pipeline?


2. Why is there a need for Data Wrangling?
3. What are the steps involved in Data Cleansing?
4. What are the basics required to perform statistical modelling?
5. What do you mean by Data Visualisation and where is it used?

Answers to Check your Progress


Check your Progress 1

1) Components of Data Science Pipeline are:


a. Identifying the problem
b. Hypothesis testing
c. Data collection & data wrangling
d. EDA
e. Statistical Modelling
f. Interpreting and communicating results
g. Data Visualisation and Insight Generation
2) Some Data Visualisation tools are:
a. Tableau
b. Power Bi
c. R & Python
d. Qlikeview and Qliksense
3) 4 steps involved in model building are:
a. Model selection
b. Model configuration
c. Model evaluation
d. Model presentation
4) EDA is exploratory data analysis which refers to refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis
and to check assumptions with the help of summary statistics and graphical representations
5) Data wrangling is the process of cleaning and unifying messy and complex data sets for easy
access and analysis.

Suggested Reading
1. Jeffrey Stanton, An Introduction to Data Science.
2. The Data Science Handbook, Book by Field Cady.
3. Hands-On Data Science and Python Machine Learning, Book by Frank Kane.
4. Data Science in Practice.
Unit 5
Data Scientist Toolbox, Applications & Case Studies
Structure:
5.1 Data Scientist’s Toolbox

5.2 Applications & Case Study of Data Science

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading
Published by Symbiosis Centre for Distance Learning (SCDL), Pune

2019

Copyright © 2019 Symbiosis Open Education Society

All rights reserved. No part of this unit may be reproduced, transmitted or utilised in any form or by
any means, electronic or mechanical, including photocopying, recording or by any information
storage or retrieval system without written permission from the publisher.

Acknowledgement

Every attempt has been made to trace the copyright holders of materials reproduced in this unit.
Should any infringement have occurred, SCDL apologies for the same and will be pleased to make
necessary corrections in future editions of this unit.
Objectives
After going through this unit, you will able to:

∙ Understand what are the tools inside the Data Scientist Toolbox
∙ Know different applications of Data Science
∙ Understand how a Data Science Lifecycle works

5.1 DATA SCIENTIST’S TOOLBOX


Data scientists are responsible for discovering insights from massive amounts of structured and
unstructured data to help shape or meet specific business needs and goals. The data scientist role is
becoming increasingly important as businesses rely more heavily on data analytics to drive decision
making and lean on automation and machine learning as core components of their IT strategies.

A data scientist’s approach to data analysis depends on their industry and the specific needs of the
business or department they are working for.

Before a data scientist can find meaning in structured or unstructured data, business leaders and
department managers must communicate what they’re looking for. As such, a data scientist must
have enough business domain expertise to translate company or departmental goals into data-based
deliverables such as prediction engines, pattern detection analysis, optimisation algorithms, and the
like.

A Data Scientists toolbox is one of a kind. It can vary from one organisation to organisation.

A set of generalistic tools that are used are as follows:

a) R
b) Python
c) SQL
d) Tableau
e) PowerBi
f) Hadoop
g) Tensorflow
h) Apache Spark
i) Statistics

Let’s go through each one in detail.

1) R Programming

The R programming language includes a set of function that supports machine learning algorithm,
linear regression, classification, statistical inference and so on. The best algorithms for machine
learning can be implemented with R.

Today, R is not only used by academic but most of the large companies also use R programming
language including Google, Facebook, YouTube and Amazon and so on.

R is freely available under the GNU General Public License. It offers good organised data analytics
capabilities and most importantly it has an active community of user online where they can turn for
support. R is the first choice in the healthcare industry, followed by government and consulting.
R is specifically designed for Data Science needs. Data Scientist are using R to solve statistical
problems. R has a steep learning curve.

Some of the aspects of R programming are:

∙ The style of coding is quite easy


∙ It’s open source. No need to pay any subscription charges
∙ The community support is overwhelming. There are numerous forums to help you out ∙
Get high performance computing experience
∙ One of highly sought skill by analytics & Data Science companies
∙ Statistical analysis environment
∙ Open source
∙ Huge community support
∙ Availability of packages
∙ Benefits of charting

R has some features that are important for Data Science applications

1) R being a vector language can perform many operations at once


2) R doesn’t need any compilers as it’s an interpreted language
3) For statistical analysis & graphs, there is no better option than R programming, with
capabilities around matrix multiplication available straight out of the box
4) R programming provides support functions for Data Science applications 5) The ability of R
programming to translate math to code seamlessly and makes it an ideal choice for someone
with minimal programming knowledge.

Why is R important for data science?

∙ You can run your code without any Compiler – R is an interpreted language. Hence we can run
Code without any compiler. R interprets the Code and makes the development of code easier. ∙
Many calculations done with vectors – R is a vector language, so anyone can add functions to a
single Vector without putting in a loop. Hence, R is powerful and faster than other languages. ∙
Statistical Language– R used in biology, genetics as well as in statistics. R is a turning complete
language where any type of task can perform.

For example,

If you are interested in calculating the average of 10 numbers, you would probably write a for loop,
calculate the total, maintain a counter and calculate the average. So, you deal with the numbers one
by-one.

If you are interested in applying a common formula to a set of numbers, you would probably store
the numbers in an array, use for loop, apply the operation to each number and obtain the result.

Conversely, R operates on vectors! This is critical. You have to think in vectors. The same first
aforementioned example can be solved by using one function mean (set of numbers). So, you are
ideally not looking at each number anymore. You are looking a set of numbers i.e. a vector and doing
an operation. This is called vectorisation.

As statisticians always deal with a set of data points, they have developed a statistical package called
R which basically operates on vectors. R is optimised for vectorised operations.
So, many of the statistical operations are in-built in R. With this design philosophy, handling datasets
is very natural to R.

Other competitive benefits is using R for data analysis:

∙ 6000 packages on CRAN spread across various domains of study


∙ Strong support on “stackoverflow” and good documentation reducing the learning curve for
beginners
∙ Availability of *almost all* machine learning packages
∙ Incredible plotting system (ggplot2).

2) Python Programming

Python is a popular open source programming language and it is one of the most-used languages in
artificial intelligence and other related scientific fields.

Machine learning (ML), on the other hand, is the field of artificial intelligence that uses algorithms to
learn from data and make predictions. Machine learning helps predict the world around us.

From self-driving cars to stock market predictions to online learning, machine learning is used in
almost every field that utilises prediction as a way to improve itself. Due to its practical usage, it is
one of the most in-demand skills right now in the job market. You don't have to be a data scientist to
be fascinated by the world of machine learning, but a few travel guides might help you navigate the
vast universe that also includes big data, artificial intelligence, and deep learning, along with a large
dose of statistics and analytics.

Also, getting started with Python and machine learning is easy as there are plenty of online resources
and lots of Python Machine learning libraries available.

For example some libraries in Python are:

∙ Theano
Released nearly a decade ago and primarily developed by a machine learning group at
University de Montréal, it is one of the most-used CPU and GPU mathematical compilers in
the machine learning community.
∙ TensorFlow
It is an open source library for numerical computing using data flow graphs, is a newcomer to
the world of open source, but this Google-led project already has almost 15,000 commits and
more than 600 contributors on GitHub, and nearly 12,000 stars on its models repository. ∙
Scikit-learn
A free software machine learning library for the Python programming language. It features
various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
inter-operate with the Python numerical and scientific libraries NumPy and SciPy.
∙ Pandas
Pandas contain high level data structures and manipulation tools so as to make data analysis
faster and easier in Python.
∙ Seaborn
Seaborn is a statistical plotting library in Python. So whenever you’re using Python for data
science, you will be using matplotlib (for 2D visualisations) and Seaborn, which has its
beautiful default styles and a high level interface to draw statistical graphics
The reasons why Data Scientists work on Python are:

∙ Python is a free, flexible and powerful open source language.


∙ Python cuts development time in half with its simple and easy to read syntax. ∙ With Python,
you can perform data manipulation, analysis, and visualisation. ∙ Python provides powerful
libraries for Machine learning applications and other scientific computations.

Fundamentals of Python Programming:

∙ Variables

Variables refers to the reserved memory locations to store the values. In Python, you don’t
need to declare variables before using them or even declare their type.

∙ Data Types

Python supports numerous data types, which defines the operations possible on the
variables and the storage method. The list of data types includes – Numeric, Lists, Strings,
tuples, Sets and Dictionary.

∙ Operators

Operators helps to manipulate the value of operands. The list of operators in Python
includes Arithmetic, Comparison, Assignment, Logical, Bitwise, Membership, and Identity.

∙ Conditional Statements

Conditional statements helps to execute a set of statements based on a condition. There are
namely three conditional statements – If, Elif, and Else.
∙ Loops

Loops are used to iterate through small pieces of code. There are three types of loops
namely – While, for, and nested loops.

∙ Functions

Functions are used to divide your code into useful blocks, allowing you to order the code,
make it more readable, reuse it & save some time

Thus, Python for data science is a step by step process.

3) SQL

SQL is for storing, exporting, and importing data. Data scientists will frequently encounter the need
to retrieve data from a SQL data store, such as SQL Server, to complete an analytics task.
Additionally, data scientists may need to store some results created in an external package, such as
SAS, Python, or R, into a SQL Server database for documentation and subsequent re-use. To perform
these kinds of functions, you need to know how to obtain meta-data about the contents of a SQL
data store, how to query data in a SQL data store, and how to add new data structures to an existing
database, and finally how to update and insert data into a SQL database.
While SQL is all about data, it also permits program development, such as data mining based on
statistical techniques or modeling based on artificial intelligence rules and/or technical indicator.

SQL is a critical tool to use in data science, mostly to prep and extract datasets.

It's easy to get "book knowledge" on SQL, such as understanding the parts and meaning of each
section of a Select query. But that’s entirely different from actually being able to solve real-world
problems, with realistic data.

SQL basics is a huge term, which covers all the fundamental logic of SQL. It means you must learn
about RDBMS, triggers, data types, relation model, and commands in SQL. Let’s have a look into
basics of SQL.

1. SQL Statements - Basically SQL has 4 types of statements categorized as

I) DML (Data Manipulation Language) - In this, you will learn SELECT, INSERT, UPDATE, DELETE
statements. The SELECT statement used to select a row or record. With the INSERT
statement, we insert a set of values into the table. Now, the UPDATE statement is used to
update the values in a table, and last the DELETE statement, deletes the record in the table.
II) DDL (Data Definition Language) - These statements make changes in the system catalogue
table. This cover 3 different SQL statements - CREATE, ALTER, and DROP. We use these 3
statements to create a new table, to add a column or rename a column or table, and Drop to
remove the data, indexes, triggers, and constraints for the table.
III) DCL (Data Control Language) - These statements control the data in the database. Here, we
use two commands such as GRANT and REVOKE. GRANT is for allowing the specific task to
the specific user. REVOKE is to cancel the granted permissions
IV) TCL (Transaction Control Language) - These statements manage the database transactions. In
other words, manages changes that are done by DML statements. We use 3 statements to do
this such as COMMIT, to save any transaction permanently, ROLLBACK, to restore the
database and we use SAVEPOINT to temporarily save a transaction.

2. SQL Operators

∙ Operators perform an operation on one or two values or expressions. SQL has mainly 5
operators - Arithmetic, Bitwise, Compound, Logical, and Comparison operators.

3. SQL Joins

∙ Joins in SQL are used to combine record from a table, based on the related column. SQL has 4
different types of joins - INNER, LEFT, RIGHT, FULL join.

4. SQL Subquery

∙ Subquery is also called as a Nested or Inner query. It is a query in another query embedded
with WHERE Clause. We can use subqueries with the SELECT, INSERT, UPDATE, and DELETE
Statements.

5. SQL String Function

∙ We use string functions for string manipulation. Some major SQL string functions are ASCII(),
BIN(), CHAR(), CHARACTER_LENGTH(), CONCAT(), ELT().

Thus, every Data Scientist must know the basics of SQL for Data Preparation & Transformation.
4) Hadoop

Hadoop is an ecosystem of open source components that fundamentally changes the way
enterprises store, process, and analyse data.

Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same
data, at the same time, at massive scale on industry-standard hardware. CDH, Cloudera's open
source platform, is the most popular distribution of Hadoop and related projects in the world (with
support available via a Cloudera Enterprise subscription).

Apache Hadoop is an open source software framework for storage and large scale processing of data
sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used
by a global community of contributors and users. It is licensed under the Apache License 2.0.

The Apache Hadoop framework is composed of the following modules

∙ Hadoop Common
o Contains libraries and utilities needed by other Hadoop modules
∙ Hadoop Distributed File System (HDFS)
o A distributed file-system that stores data on the commodity machines, providing very
high aggregate bandwidth across the cluster
∙ Hadoop YARN
o A resource-management platform responsible for managing compute resources in
clusters and using them for scheduling of users' applications
∙ Hadoop MapReduce
o A programming model for large scale data processing

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines, or racks of machines) are common and thus should be automatically handled in
software by the framework. Apache Hadoop's MapReduce, and HDFS components originally derived
respectively from Google's MapReduce and Google File System (GFS) papers.

Beyond HDFS, YARN, and MapReduce, the entire Apache Hadoop "platform" is now commonly
considered to consist of a number of related projects as well: Apache Pig, Apache Hive, Apache
HBase, and others.

HDFS and MapReduce


There are two primary components at the core of Apache Hadoop 1.x: the Hadoop Distributed File
System (HDFS) and the MapReduce parallel processing framework. These are both open source
projects, inspired by technologies created inside Google.

Apache Hadoop is an open source software framework for storage and large scale processing of data
sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used
by a global community of contributors and users. It is licensed under the Apache License 2.0.

HDFS terminology

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It
achieves reliability by replicating the data across multiple hosts, and hence does not require RAID
storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the
same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move
copies around, and to keep the replication of data high.
HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the
target goals for a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system
is increased performance for data throughput and support for non-POSIX operations such as Append.

5) Apache Spark

Apache Spark is an open-source cluster computing framework for real-time processing. It has a
thriving open-source community and is the most active Apache project at the moment.

Spark provides an interface for programming entire clusters with implicit data parallelism and fault
tolerance.

It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use
more types of computations.

Let us look at the features in detail:

∙ Polyglot:

Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any
of these four languages. It provides a shell in Scala and Python. The Scala shell can be
accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed
directory.

∙ Speed:

Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing.
Spark is able to achieve this speed through controlled partitioning. It manages data using
partitions that help parallelise distributed data processing with minimal network traffic.

∙ Multiple Formats:

Spark supports multiple data sources such as Parquet, JSON, Hive, and Cassandra apart from
the usual formats such as text files, CSV, and RDBMS tables. The Data Source API provides a
pluggable mechanism for accessing structured data though Spark SQL. Data sources can be
more than just simple pipes that convert data and pull it into Spark.

∙ Lazy Evaluation:

Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors
contributing to its speed. For transformations, Spark adds them to a DAG (Directed Acyclic
Graph) of computation and only when the driver requests some data, does this DAG actually
gets executed.

∙ Real Time Computation:

Spark’s computation is real-time and has low latency because of its in-memory computation.
Spark is designed for massive scalability and the Spark team has documented users of the
system running production clusters with thousands of nodes and supports several
computational models.

∙ Hadoop Integration:

Apache Spark provides smooth compatibility with Hadoop. This is a boon for all the Big Data
engineers who started their careers with Hadoop. Spark is a potential replacement for the
MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing
Hadoop cluster using YARN for resource scheduling.

∙ Machine Learning:

Spark’s MLlib is the machine learning component which is handy when it comes to big data
processing. It eradicates the need to use multiple tools, one for processing and one for
machine learning. Spark provides data engineers and data scientists with a powerful, unified
engine that is both fast and easy to use.

6) Tensorflow

Tensorflow is a library for creating deep learning models. I have used Tensorflow to count people in
videos, to create a Chabot for a company website, to recognise handwriting text from images, create
recommendation system for articles.

To learn TensorFlow, you should understand the meaning of Tensor and how it works. Tensor is a
mathematical object which is represented as arrays of higher dimensions. Such arrays of data differ
with different ranks and dimensions. These arrays are fed as input to the neural network. TensorFlow
was developed initially to run large sets of numerical computations. It uses a data flow graph to
process data and perform computations.

Tensorflow operates on the basis of a Computational graph. The graph consists of nodes and edges
where each node performs of mathematical operation like addition, subtraction, multiplication, etc.,
while edges carry the data.

5.2 APPLICATIONS & CASE STUDY OF DATA SCIENCE

Let us take a look at a Data Science Case Study in retail domain. To understand the case, it is
imperative to understand the retail supply chain.

Now, it is pretty clear looking at the below the whole supply chain flow. Now let us look at indepth
knowledge of case study.

Goal

A Fortune 500 retail company approaches you and illustrates the problem of Demand Forecasting
and Inventory Management. The demand forecasts that they are getting as an input and not correct
and as a result there is problem with inventory as well. They want you to build a customised demand
forecasting and inventory planning solution which will cause less out of stocks at their stores and put
inventories in right places.
Approach taken:

The data was gathered from respective databases using SQL queries and database decisions were
taken. As this was a structured data, we needed SQL databases to store it.

The data is at a store-product-day level and from the year 1990. So we would be requiring a Hadoop
Cluster when we will be processing and storing the results.

All forecasting in retail depends on a degree of aggregation. The aggregations could be on product
units, location or time buckets or promotion according to the objective of the forecasting activity.
We need the forecast at Day-Store-Product level. So we will be forecasting at that level.

Retail sales data often exhibit strong trend, seasonal variations, serial correlation and regime shifts
because any long span in the data may include both economic growth, inflation, and unexpected
events.

Time series models have provided a solution to capturing these stylised characteristics. Thus, time
series models have long been applied for market level aggregate retail sales forecasting.

Simple exponential smoothing and its extensions to include trend and seasonal (Holt-Winters), and
ARIMA models have been the most frequent time series models employed for market level sales
forecasting. Even in the earliest references, reflecting controversies in the macroeconomic
literature, the researchers raised the question of which of various time series models performed
best and how they compared with simple econometric models.

So for this case study, we will be trying different time series models.
The steps for this case study are:

1) Analyse if data has any missing values. If yes, then should they be replaced by mean

values? 2) Are there any NaN’s in the data? If yes, then should they be replaced by mean

values? 3) Check the data types of variables.

4) Do EDA on the data

∙ Try to see if the data is showing trends across time


∙ Check if there exists any kind of seasonality in the dataset
∙ Are particular products showing jumps in sales during only certain periods? ∙
Is cannibalisation of products happening?
∙ Which are the best and worst selling products?
∙ Which stores are best and worst with respect to sales?
∙ How many stores are above and below the average sales number?
∙ Are all the products showing similar trends in sales?
∙ Can we group certain products by their sales trend?

5) Run a forecast on data with statistical models and choose the best model across products (after
grouping products into clusters which show similar sales trend)

6) Evaluate the results on the past data at hand

7) Present the results and measure the impact

8) The Outcome - Improved productivity and collaboration within the client

∙ 8% improvement in product availability during the seasonal sales leading to increased revenue
to the tune of £ 1M annually.
∙ A stable and repeatable process for 6,000-8,000 SKUs is being currently used across the online
channel.
∙ Forecasts served as benchmark to measure lift from promotional activities. ∙ New process also
ensured closer collaboration between various teams within the client organisation like allocation
& replenishment, trading, merchandising, and operations development.

The above impact is just shown as an example of how the case study output should look like.

We can also include facts like the forecast accuracy increased by about 15% and that led to greater
sales of products by about 5%.

This causes directly impact on the business audience which in turn lets you solve and excavate more
and more problems in the business.
Check your Progress 1
1. What are the components of Data Scientist’s toolbox?
2. What does open source mean?
3. What are big data technologies a data scientist should know?

Activity 1
1. Find and list the commands used in MySQL database.

Summary
∙ Data Science is a combination of multiple fields which involves creation, preparation,
transformation, modelling, and visualisation of data.
∙ Data Science pipeline consists of Data Wrangling, Data Cleansing & Extraction, EDA, Statistical
Model Building and Data Visualisation.
∙ Data Scientists toolbox consists of a wide variety of tools which require data preparation to
data visualisation.
∙ Any data science problem should have a solution which can depict the business impact the
solution might have.
∙ If the impact can be shown in terms of revenue the better it is.

Keywords
∙ Data Science Pipeline: The 7 major stages of solving a Data Science problem. ∙ Data Wrangling:
The art of transforming the data into a format through which it is easier to draw insights from.
∙ Data Cleansing: The process of cleaning the data of missing, garbage, NaN’s, and outliers. ∙ Data
Visualisation: The art of building graphs and charts so as to understand data easily and find
insights into it.
∙ Statistical Modelling: The implementation of statistical equations on existing data. ∙ Data
Scientist Toolbox: List of technologies which are needed by a Data Scientist to solve a Data
Science problem.

Self-Assessment Questions
1. Write a short note on:
1. Hadoop
2. Python
2. Explain the features of SQL.
3. What are the applications of Data Scientist’s toolbox?

Answers to Check your Progress


Check your Progress 1

1) Components of Data Scientist’s toolbox are:


a. R & Python
b. Hadoop & Spark
c. SQL & NoSQL Databases
d. PowerBi, Tableau and other visualisation tools
e. Statistics
2) Open source means that the software is available freely and can be distributed by making
changes to the source code as per ones requirement.
3) The big data technologies a Data Scientist should know are Hadoop and Spark.

Suggested Reading

1. Jeffrey Stanton, An Introduction to Data Science.


2. The Data Science Handbook, Book by Field Cady.
3. Hands-On Data Science and Python Machine Learning, Book by Frank Kane.
4. Data Science in Practice.

You might also like