Professional Documents
Culture Documents
DATA SCIENCE BOOK - Locked PDF
DATA SCIENCE BOOK - Locked PDF
L
D
C
T
IM
Institute of
Management Technology
IMT Centre for Distance Learning, Ghaziabad
© Copyright 2018 Publisher
ISBN: 978-93-86052-63-6
This book may not be duplicated in any way without the express written consent of the publisher, except
in the form of brief excerpts or quotations for the purposes of review. The information contained herein
is for the personal use of the reader and may not be incorporated in any commercial programs, other
L
books, databases, or any kind of software without written consent of the publisher. Making copies of
this book or any portion, for any purpose other than your own is a violation of copyright laws. The author
and publisher have used their best efforts in preparing this book and believe that the content is reliable
and correct to the best of their knowledge. The publisher makes no representation or warranties with
D
respect to the accuracy or completeness of the contents of this book.
C
T
IM
Brief Contents
L
1. INTRODUCTION TO DATA SCIENCE.......................................................................................................... 1
3.
4.
D
IMPLEMENTATION OF DECISION-MAKING AND SUPPORT..................................................................75
6. MACHINE LEARNING............................................................................................................................143
9. OPTIMIZATION......................................................................................................................................213
IM
13. HADOOP................................................................................................................................................293
16. PIG........................................................................................................................................................351
END NOTES......................................................................................................................................................375
IM
T
C
D
L
Table of Contents
L
CHAPTER 1: INTRODUCTION TO DATA SCIENCE...............1 Statistical Inference ..................................................... 41
L
in Key Areas.................................................................. 82 Predictive Analytics.....................................................109
Telecommunication.............................................. 82
Bioinformatics...................................................... 83
Engineering........................................................... 83D Logic Driven Models...........................................110
Summary.....................................................................111
Exercise.......................................................................111
C
Healthcare............................................................ 84
Case Study..................................................................113
Information and Communication Technology.... 86
Logistics................................................................ 86
T
Process Industry................................................... 87 CHAPTER 5: DATA WAREHOUSING............................... 115
Summary....................................................................... 88 Introduction.................................................................116
L
Information Extraction....................................... 174
Clustering............................................................ 174
CHAPTER 6: MACHINE LEARNING................................ 143
Categorization.................................................... 175
vii
TABLE OF CONTENTS
L
Introduction.................................................................248
Exercise.......................................................................205
What is Big Data?.......................................................248
Case Study..................................................................207
D
Lab Exercise................................................................209
Advantages of Big Data.....................................249
Various Sources of Big Data..............................250
History of Data Management –
C
Evolution of Big Data.................................................. 251
CHAPTER 9: OPTIMIZATION.......................................... 213
Introduction................................................................. 214 Structuring Big Data...................................................253
Introduction.................................................................230 Exercise.......................................................................267
viii
TABLE OF CONTENTS
Use of Big Data in Social Networking........................ 276 NameNodes and DataNodes............................ 311
The Command-Line Interface............................ 311
Use of Big Data in Preventing
Fraudulent Activities...................................................278 Using HDFS Files................................................312
HDFS Commands............................................... 313
Preventing Fraud using Big Data Analytics.......279
The org.apache.hadoop.io Package................. 313
Use of Big Data in Banking and Finance...................281
HDFS High Availability........................................ 314
Big Data in Healthcare Industry.................................282
L
Features of HDFS........................................................ 315
Big Data in Entertainment Industry...........................283
Data Integrity in HDFS................................................ 317
Use of Big Data in Retail Industry..............................284
Use of RFID Data in Retail.................................284
Use of Big Data in Education.....................................286
D Features of HBase...................................................... 317
Differences between HBase and HDFS............318
Summary.....................................................................318
C
Summary.....................................................................287
Exercise....................................................................... 319
Exercise.......................................................................288
Case Study..................................................................320
Case Study..................................................................290
T
CHAPTER 15: INTRODUCING HIVE................................ 323
CHAPTER 13: HADOOP.................................................. 293
Introduction.................................................................325
IM
Introduction.................................................................294
Hive..............................................................................325
Hadoop .......................................................................294
Hive Services......................................................327
Real-Time Industry Applications of Hadoop.....294 Hive Variables.....................................................328
Hadoop Ecosystem............................................295 Hive Properties...................................................328
Hadoop Architecture...................................................296 Hive Queries ......................................................329
ix
TABLE OF CONTENTS
L
Using the WHERE Clause...................................340
Summary.....................................................................367
Using the GROUP BY Clause.............................. 341
Exercise.......................................................................368
Using the HAVING Clause.................................. 341
Using the LIMIT Clause...................................... 341
Executing HiveQL Queries ................................342 D Case Study..................................................................369
Lab Exercise................................................................373
C
Summary.....................................................................343
Case Study..................................................................345
T
Lab Exercise................................................................348
IM
x
CHAPTER
1
Introduction to
Data Science
L
Topics Discussed
Introduction
What is Data? D NOTES
C
What is Data Science?
Components of Data Science
Data-Driven Decision Making
T
Data Science and Business Strategy
Data-analytic Thinking
IM
Self-Instructional
Material
DATA SCIENCE
INTRODUCTION
L
Data science1 is a multidisciplinary field in which data interference, algorithms, and innovative
technologies are used to solve complex analytical problems. In simple terms, you can say that data
science is nothing but a way to use data in such a creative way that will generate the most out of a
business.
D
Data science has been used in solving many complex problems. It also helps in data analysis, data
cleaning, data modeling and data prototyping. Apart from that, data science is being used in many
C
essential tasks such as Internet searching, digital advertising, image recognition, speech recognition,
gaming, airline route planning, fraud and risk detection, etc.
In this chapter, we will first discuss about the concept of data and data science. Further, the chapter
T
discusses about data analytical thinking. It next discusses business problems and data science
solutions. Towards the end, it discusses data science and business strategy.
WHAT IS DATA?
IM
In simple terms, data2 are raw facts and information that are generally gathered in a systematic
approach for some kind of analysis. Data can be in the form of characters, images, numbers, sounds,
voice, etc. No matter whatever be the format of the data, the most important thing is that it should
be put into the context of anything important, otherwise the data would be of no use.
Regardless of its format, data will always be stored in the computer in the pattern of just two
numbers, i.e., 0s and 1s. The smallest unit of data is called a bit which represents a single value. Eight
bits of data is called a byte. Data is measured in bytes, kilobytes, megabytes, gigabytes, etc.
Data science is used in many fields such as statistical computing, statistical modeling, data technology, NOTES
data research, data consulting, real-world application, scientific methods and visualisation.
Visualization
Scientific Statistical
Methods Modeling
Data Data
Consulting Technology
L
Data
Research
D
FIGURE 1.1 Data Science
Source: https://www.google.com/search?q=Data+Science&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjn
lZTp8tzbAhWWUn0KHXL3DHMQ_AUICygC&biw=1366&bih=662#imgrc=amIvbQtCF6xY4M:
C
Components of Data Science
Basically, there are three components of data science:
T
Organizing the data
01
Organizing
02 03
D ATA
Packaging Delivering
L
Data-driven decision making, also called data-driven decision management or data-directed decision
making is the process of selecting the most logical choice from a significant amount of verified
and analysed data. With the help of modern technologies, an individual is able to calculate the
D
number of calories and number of steps taken in a day, the amount of money spent on different
items in a month, and many other data collection activities. Gathering this type of information helps
an individual to make better decisions in future and improvise his/her style of living. Similarly, to
C
survive in today’s competitive world, a data-driven culture has also become the need of the hour.
Implementing data-driven decisions in organisations results in higher efficacy as compared to the
decisions based upon some gut instinct (intuitions) or past examples.
T
Regardless of the type of organisation, whether a multinational or a small-scale business firm, data-
driven culture can help in answering the critical business questions such as “What measures or
improvements can be taken to increase customer’s satisfaction?” Some organisations believe that
they already possess a data-driven culture due to the presence of various reports and statistics that
IM
covers most of the organisation’s elements. However, a data-driven culture is a lot more organised
and different than these traditional measures.
Data-driven decisions help in prior determination of different risks and opportunities that leads to
better productivity and profitability for an organisation. They also eliminate human error to a great
extent and assist an organisation in implementing the best-fit solutions. It provides the legitimate
evidences to the authorities and other stakeholders associated with stability, flaws and limitations
of an organisation. Some other advantages of data-driven decision making are listed as follows:
It identifies the risks that need to be eliminated or at least minimized at an early stage.
It assists in analysing the existing and upcoming products and services in order to fulfil the
current business needs and meet the customer’s expectations.
It helps in predicting the new technologies that needs to be added.
Self-Instructional
Material
4
Introduction to Data Science
Figure 1.3 shows the steps that need to be followed for effective data-driven decision making: NOTES
Finalize decision
L
Steps in Data Driven
Decision Making
D
FIGURE 1.3 Steps of Effective Data Driven Decision Making
problem has again been encountered and try to prevent the same in future.
If you do not find similar problems in the past records, you can also review the problems
that are somewhat similar to the current one. This practice helps in speeding up the decision
making.
3. Assemble the required data: Data does not have limits or restrictions and is scattered all over
the organisation in a raw form such as alphabets, integers, etc. Each data slice contributes
to some meaningful information. Selecting the appropriate quality and quantity of data
for addressing the problem leads to sound decision making. The majority of data in an
organisation is collected from everyday operations, however this routine information is not
sufficient for finalising critical decisions. Hence, you need to collect the data by conducting
research and observations from various internal as well as external data components.
4. Inspect the data: After finishing the third step, you will be left with huge amount of complex
data. Now, you need to inspect this data for generating various patterns and trends. You need
to find the data that answers your questions. Today, there are many business intelligence
tools that help in analysing this data and generate different statistics and trends. These
tools are equipped with latest technologies that transform the complicated data sets into
simplified displays for various predefined impressions.
Self-Instructional
Material
5
DATA SCIENCE
NOTES 5. Document possible solutions: Determine the best alternatives and solutions by interpreting
the results generated from the previous step. Depending upon the complexities of data,
more skilled officials are required to document the best alternatives or solutions.
6. Discuss and analyse the best solutions: Once the possible solutions to the identified problem
have been documented, they need to be discussed with different authorities before being
implemented. You can use the predictive analytics tools for predicting the results after
implementing different solutions. Once analysed, select the best possible alternative.
7. Finalize decision: The last step is to finalize the decision for the identified problem or some
situation. You need to finalize one of the best alternatives generated from step 6.
E xhibit -1
In 2009, a pandemic named H1N1 also called as swine flu was discovered. It was spreading very fast
and no vaccination was available to cure it. Also, it was almost impossible to invent its vaccination
or cure in a short span of time. Surprisingly, few weeks before discovering it as a pandemic (an
epidemic that has the possibility to spread throughout the world) by CDC (Centres for Disease
Control and Prevention), Google’s Web services and Flu Trends predicted the pace and location
L
of its infections. It was only possible due to the company’s capability to deduce meaningful
conclusions using Big Data that pours in at the rate of 24 petabytes per day.
D
DATA SCIENCE AND BUSINESS STRATEGY
Business strategies and data science are two different components. Data science is related to
organizing the huge Big Data with the company, whereas business strategies determine how to use
C
this data to improve the productivity rates. Hence, organisations should not look for strategies to
store Big Data, instead they should look for a business strategy that efficiently uses this Big Data.
Some primary measures that assist in determining effective business strategies are to change the
T
way the businesses operate. For instance, they can change their business-related decisions and also
hire a skilled data scientist that monitors the organisation’s database to gain different insights. An
organisation must follow the three Rs to implement a business strategy as shown in Figure 1.4:
IM
Right Data
Implementing
a business
strategy
Right
Right Modeling
transformational
Capability
methods
Let us study about one of the major components of business strategies, that is, data-analytic thinking.
Self-Instructional
Material
6
Introduction to Data Science
An organisation needs to select the exact amount of data required. Most organisations are not
even able to be handle their basic data such as customer transaction data, internal supply chain
management data and other performances related data. Data analytics tasks do not end up with
handling the organisation’s own data. An organisation cannot create its value by just handling its
own data. If it wants to be competitive and survive in today’s world, it needs to collect other external
data as well. It needs to consider what other data sources are available and should bring external
data into play. Some examples of external data sources for different organisations are weather or
climate data, traffic pattern data, competitor’s data, prices for the different products in the market.
The choice of selecting the data, its source and how to receive it directly impacts the success of an
L
organisation.
Data analytics is a highly mathematics-intensive process and can only be performed with the help of
some skilled professionals called data scientists. It is one of the key challenges for an organisation to
D
find a highly skilled data scientist suitable for using their organisation’s statistical technology. Data
analytics have been classified into three different categories explained as follows:
Descriptive analytics: It is used to describe the past events. It consists of collecting and
C
then organising the data. Later, this data is depicted on a graph plot and is used to see the
characteristics of a dataset. It is used for categorizing the data in different groups such as user’s
groups. With the help of past events, it decides the organisation’s performance. However, it
does not provide the information related to any future events or cause of any past event that
T
has occurred.
Predictive analytics: It makes use of the historical data models to predict the future events.
Initially, the relationships between different elements are established followed by forecasting
IM
the dependent element. For example, what number of users will purchase a specific product
during its campaign or discount period?
Prescriptive analytics: It specifies the actions that need to be implemented. It further includes
two more components named design and optimisation. Design helps in answering the WHY
questions associated with different products and services. Optimisation refers to the process
of achieving the highest level of profitability for a product or service.
Self-Instructional
Material
7
DATA SCIENCE
NOTES framework to resolve some business problems with the help of data science comprises five steps as
shown in Figure 1.5:
L
FIGURE 1.5 Steps for Resolving Business Problems using Data Science
D
Let us discuss these steps one by one.
To resolve such issues, an organisation must dive into the feedback provided by the customers
with a focus towards improving a single issue at an instant. These measures help in identifying the
business problems more accurately. Hence, the identified business problem comes out to be:
“What needs to be done when previous users have stopped purchasing a product or service?”
The businesses need to follow more advanced deep learning technologies instead of the traditional
ones. Consider the following business problem:
“What measures can be taken to acquire more customers for buying a product or service?”
With the above business problem, the analytics objectives need to be set as for determining the
factors that will lead to more buyers for a product or service.
In this step, proper analysis for data preparation from numerous sources needs to be done. For
example, an e-commerce business can use their database or transactional data. Also, they can take
Self-Instructional help of Google analytics and various social platforms that accumulate the Web behaviour.
Material
8
Introduction to Data Science
Before the implementation of the developed model, businesses need to evaluate its performance.
This step allows them to optimise their model with a real-time data before its marketing or
campaigning. In this final step, a business can determine for more improvisation once the developed
model has been tested for its performance.
Summary
In this chapter, we have first discussed about the concept of data, data science and components of
data science. Further, the chapter discussed about data analytical thinking. Moreover, it discussed
L
about the business problems and data science solutions. Towards the end, it has discussed about
data science and business strategy.
Multiple-Choice Questions
Exercise
D
C
Q1. Which of the following is not a component of data science?
a. Collecting data
T
b. Organizing data
c. Delivering data
d. Packaging data
IM
NOTES
Assignment
Q1. The amount of data is getting increased day by day and various analytical tools are required
to analyze the data. What are the various analytical tools available in the market for analyzing
the data?
Q2. As a data scientist, you have to analyze the unstructured data of your company in to various
forms that can be used for business values. What are the various branches of science that you
have to deal with for analyzing the data?
Q3. What is data-driven decision making?
Q4. Explain different types of data-analytics.
References
https://towardsdatascience.com/examples-of-applied-data-science-in-healthcare-and-e-
commerce-e3b4a77ed306
L
https://www.techopedia.com/definition/32877/data-driven-decision-making-dddm
2. c.
D
Answers for Multiple-Choice Questions
Self-Instructional
Material
10
C A S E S T U D Y
DATA STRATEGY IN A FINANCIAL COMPANY NOTES
This Case Study discusses how the Financial Company (FC) adopted Data Science methods to better
target the customers.
The accounting division of a well-established financial company (say FC) wanted to launch a new
business pertaining to small business lending by targeting customers who were already using their
products. FC wanted a solution that could address its strategic goals, and build and demonstrate
the business value derived from the data-driven analytical capabilities in order to be able to launch
the new business as per the scheduled date. The accounting division already had a referral lending
business. This business had generated $350 million in outstanding loans over the last three years.
FC had marketed the loans to customers using direct mails (5,00,000 mails per month) targeted at
customers based on the information, such as number of employees, number of years in business,
location and industry codes. However, the rate of conversion of loan applications into approval was
low because of insufficient targeting and high drop-off and low approval rates.
FC wanted to make profits by entering into small business lending as the already established lenders
L
were extra cautious while lending. FC was in a very favorable position because it could use its
proprietary accounting, finance, payroll and payments data from their customer base. In addition,
FC had its own suite of software products that provided an effective channel for highly targeted
customer acquisition using in-app marketing.
After SVDS took charge of this project, it implemented the following approach:
IM
A monthly marketing campaign process was created by SVDS. This process targeted customers
based on need.
SVDS’s analytics team helped FC in gaining a better understanding of the customer universe
and which customers may most likely need a loan. This led to a fourfold improvement in FC’s
lending business.
There were certain complexities in bringing a new data product to the market. For instance, the
acquisition of datasets on which analysis is done was quite complicated. This had a direct impact
on customer profiling and targeting activities.
SVDS’s team collaborated with FC’s Data Science and business teams to develop features for
the credit and risk models. These credit and risk models formed the basis for FC’s new loan
product business.
SVDS helped create a process that could generate features based on customer transactions.
This enabled FC in easily updating, aggregating and consuming data in a repeatable and scalable
way.
Data analytics capability developed by SVDS helped FC in effectively targeting customers based
on feature generation.
SVDS also developed data analytics ability for data ingestion for model development that Self-Instructional
enabled FC in making better decisions related to its new lending business. Material
11
C A S E S T U D Y
NOTES The difference between the previous and new marketing strategies of FC and their impact on
customer targeting is shown in the following Figure 1.6:
Targeted
Targeted Customers
Customers
Responding
Customers Responding
Customers
Converted
Converted
Customers
Customers
L
Benefits realized by FC were as follows:
New data sources were used to create new businesses and products.
D
Data products across different Lines of Business (LOB) and platforms could be unified by
creating a sustainable data pipeline.
New marketing strategy adopted by FC was more efficient and advanced and it helped in
C
yielding four times the results as against the previous marketing strategy.
Source: https://cdn2.hubspot.net/hubfs/2464317/02_Case%20Studies/FinancialSoftware_DataScience.pdf
Questions
T
1. Why did FC want to adopt Data Science technologies?
(Hint: FC is required to build data-driven capabilities to launch the new lending business.)
IM
2. Briefly describe the approach adopted by SVDS and its impact on FC.
(Hint: Data analytics capability developed by SVDS helped FC in effectively targeting
customers based on feature generation. New marketing strategy adopted by FC helped in
yielding four times the results as against the previous marketing strategy.)
Self-Instructional
Material
12
L A B E X E R C I S E
R is a cross-platform programming language as well as a software environment for statistical NOTES
computing and graphics. Generally, it is used by statisticians and data miners for developing
statistical software and doing data analysis. R language is developed by the R Development Core
Team. It is a GNU project, which is freely available under the GNU General Public License and its pre-
compiled binary versions are provided for various operating systems. R programs can be compiled
and run on a wide variety of UNIX platforms, Windows and MacOS.
R is an interpreted language, so it uses a command line interpreter for execution of commands. The
programming features of the R language are as follows:
R supports matrix arithmetic.
R includes data structures like scalars, vectors, matrices, data frames, and lists.
R provides an extensible object system, which includes the objects for regression models, time
series, and geo-spatial coordinates.
R provides support for procedural programming with functions, and supports object-oriented
programming with generic functions.
R language can also be used with several other scripting languages such as Python, Perl, Ruby,
L
F#, and Julia.
Some popular text editors and integrated development environments (IDEs) that support R
programming development are: ConTEXT, Eclipse (StatET), Emacs (Emacs Speaks Statistics),
D
Vim, jEdit, Kate, RStudio, WinEdt (R Package RWinEdt), Tinn-R, and Notepad++.
A package is a collection of functions and datasets. To access the contents of a package,
you have to load it first. The R language provides two types of packages, standard (or called
C
base packages) and contributed packages (user-defined packages). Standard packages are
considered as an in-built part of the R source code. These packages consist of basic functions
that allow R to work, and the datasets and standard statistical and graphical functions. On the
other hand, contributed packages are written by different users as they are user-defined. These
T
packages are developed primarily in R, and sometimes in Java, C, C++, or FORTRAN.
LAB 1
Exploring RGUI
RGUI stands for R Graphical User Interface. You can download and install it from the official website
of R, www.r-project.org. After installation, you see the and icons on your system’s desktop
and in the Programs menu. Open the editor window of RGUI by navigating the following path:
Developing a Program
Let’s now learn to write a simple program in R. The following code showing the use of the print()
function to display “Hello world” in the R Console window:
You can see that the code begins with the > symbol, and the output line begins with [1].
L
Quitting R
D
You can quit an active R session by entering the q() command. The following code showing the use
of the q() command in the console after the command prompt (>):
> q()
C
Whenever you execute the q() command, the Question dialog box appears asking whether or not
you wish to save your work. Besides RGUI, there is another popular IDE for R programs development,
i.e., RStudio. Let us now learn about RStudio.
T
Exploring RStudio
RStudio is a code editor and development environment, which provides a set of integrated tools
IM
to develop productive R programs. It provides a console, syntax-highlighting editor, and tools for
plotting history, debugging, and managing workspace. You can download and install RStudio from
its official website, i.e., http://www.rstudio.org/.
Code highlighting gives different colors to keywords and variables, so that these words can be
easily recognized and differentiated from other text.
Automatic bracket matching keeps a check on the opening and closing of brackets.
Easy access to R Help allows you to understand the role of various functions and other concepts
of R.
Easy exploration of variables, values, functions, packages, etc.
After installation, you can see the symbol on your desktop and in the Programs menu of
Windows. Open the editor window of RStudio by navigating the following path:
Self-Instructional
Material
14
L A B E X E R C I S E
The RStudio window opens (Figure 1.7). NOTES
Select File W New File W R Script. This action opens four panes in the RStudio window, as shown in
Figure 1.7:
L
D
C
FIGURE 1.7 Opening RStudio with Different Panes
In Figure 1.7, you see the following panes in the RStudio window:
Script pane: Refers to an editor pane at the top left corner of the RStudio window. This pane is
T
used to edit and save a collection of commands or scripts.
Console pane: Refers to a command pane at the bottom left corner of the RStudio window. In
this pane, you can enter commands after the > prompt.
IM
Workspace/History pane: Refers to a pane at the top right corner of the RStudio window. The
workspace pane allows you to see the data and values that R has in its memory. You can edit
these data or values. The History pane shows what has been typed before.
Files, Plots, Package, and Help pane: Refers to a pane at the bottom right corner of the RStudio
window. It gives access to the following tools:
zz Files: This tool allows a user to browse folders and files on a computer.
zz Plots: This tool allows a user to display the user’s plots.
zz Packages: This tool allows a user to have a view of all the installed packages.
zz Help: This tool allows a user to browse the built-in Help system of R.
Basic Arithmetic in R
You can use R like a calculator to perform complex mathematical and statistical calculations. For
example, you can perform simple arithmetic operations in R by typing the following command:
Self-Instructional
Material
15
L A B E X E R C I S E
NOTES The following is the result of executing the preceding command in R:
[1] 59
1. 12 + 45 + 9 = 66
2. 66 – 7 = 59
Now, let’s perform some complex calculations in R. Type the following command on R Console:
[1] 25.125
To calculate this expression, R follows the standard rules of BODMAS according to which the
L
multiplication and division operations are done first, followed by the additions and subtractions.
Therefore, the expression 18 + 23/2 – 5/4 * 3.5 is evaluated to 25.125.
[1] 98.875
D
You can see that the output of (18 + 23/2 -5/4) * 3.5 is evaluated to 98.875. This is because, as per
C
the rule of BODMAS, R evaluates the expression inside the parentheses first, and then the result is
multiplied by 3.5 to give the final result.
Calling Functions in R
T
To invoke predefined functions in R, you need to type their names on the R Console, followed by
comma-separated parameters as arguments within parentheses.
IM
Vectors
Vector can be defined as a single entity consisting of an ordered collection of numbers. For example,
a numeric vector consists of multiple numbers, such as an array. The following code shows how to
construct a vector in R:
Self-Instructional
c(10,20,30,40,50) Shows the implementation of the c() function to create a vector
Material
16
L A B E X E R C I S E
The preceding code shows the output of executing the c() function: NOTES
[1] 10 20 30 40 50
The c() function is used to construct a vector with five integers. It should be noted that the values
or numbers written inside the parentheses are referred to as arguments.
A vector can also be created by using the ‘:’ operator between the range of numbers. The following
code shows how to create a vector in R by using the sequence (:) operator:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
L
sum(1:15) Generates the sum of numbers from 1 to 15
The preceding code shows the output of running the sum() command on a vector:
[1] 120
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The sequence 1:15 is assigned to a variable, x. The <– symbol is the assignment operator in R. To
print the value of x, type x in the console and press the ENTER key.
Now, let’s create a second variable, y, assign it the value 30, and add the values of x and y.
[1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Self-Instructional
Material
17
L A B E X E R C I S E
NOTES In the code a number, 30, is assigned to variable, y. Then, the x+y expression adds the number
30 in the sequence of numbers 1 to 15. Therefore, you see the output as: 31, 32, 33…….45.
It should be noted that the values of the x and y variables do not change unless you assign a new
value. You can check this by entering x and y as individual commands, as shown in Figure 1.8:
Variables can also store text values. For example, you can assign the value “Hello” to a variable
called msg, as shown in the code:
L
The preceding code shows the output:
[1] “Hello“
D
The text “Hello” is assigned to the msg variable, and then the msg variable is invoked to display its
content.
C
You can also use the c() function to concatenate two text values, as shown in code:
>hw <- c(“Hello”, “World!”) Combines two text values and assigns
T
the result to the hw variable
>hw Invokes the value of the hw variable
The c() function is used to combine two text values “Hello” and “World!” and assign it to a variable,
hw. Then, the hw variable is invoked to display the result.
Functions in R Workspace
Let’s now learn about the following functions that allow users to handle data in R Workspace:
The ls() function The save() function
L
> ls()
character(0)
> msg <- “Hello”
> yourname <- readline(“What is your name?”)
What is your name?paste(msg, yourname)
> myObj = 25 + 12/2 - 16 + (7 * pi/2)
> myObj D
C
[1] 25.99557
> hw <- c(“Hello”, “World!”)
> hw
[1] “hello” “World!”
T
>msg <- “Hello”
> msg
[1] “Hello”
> ls()
IM
The code will show the names of all the variables created in the active session.
> rm(msg)
> ls()
[1] “hw” “myObj” “yourname”
You can see that the rm()function has removed the msg variable and the execution of the ls()
function does not list the msg variable in the current active session.
It should be noted that R silently saves the file in the working directory, which means that no
confirmation message is displayed in the console. However, you can check whether the file has been
stored or not by selecting the File W Display file(s) option, as shown in Figure 1.9:
L
D
C
T
This action displays the list of files created in your working directory, as shown in Figure 1.10:
Self-Instructional
Material
FIGURE 1.10 Displaying a List of Created Files
20
L A B E X E R C I S E
In Figure 1.10, you can see that a file, yourname.rda, is created in the MyFiles working directory NOTES
of the user.
L
> load(“yourname.rda”)
> ls()
[1] “yourname”
>yourname
[1] “Mary”
D
You can see that the rm() function removes the yourname variable, and the load() function
reloads this variable in the active session.
C
Reading Multiple Data Values from Large Files
Till now, the data items created or read were simple as they contained a single value. Let’s now learn
T
to read multiple data values in R.
E xample
IM
Suppose Mr. Smith wants to read the complete data shown in Table 1.1. The data of employees
is maintained in a CSV file containing data in various fields, such as S.No. Names, Country, and
Salaries, as shown in Figure 1.11:
>read.csv()
The given syntax will read the entire CSV file and display the data on the console. You can add various
instructions with the read.csv() command, such as:
file to specify the file name
L
sep to provide the separator
header to specify whether or not the first row of the CSV file should be set as column names. By
default, the value is set to TRUE
D
rows.names to specify row names for the data. Generally, this will be a column in the dataset.
You can set the row names by setting the row.names = n, where n is the column number
To read the data from the spreadsheet shown in Figure 1.11, Mr. Smith runs the commands shown in
C
following code:
In Figure 1.12, each row is labelled with a simple index number. Here, the data is less so it may not
seem of great relevance; however, this command is of great help when you need to work with large
amounts of data. Reading data from a CSV file using the read.csv() command implies that you
have to type less for data entries.
Execution of the preceding command prompts you to select the file. After selecting the file in which
the names of the employees are stored as a single column, the output is displayed as shown in
Figure 1.13:
FIGURE 1.13 Output of the read.table() Command Reading the File Separated by Spaces
L
Now, suppose the values saved in the employees.txt file are separated by tabs, as shown in
Figure 1.14:
D
C
T
IM
To read the data from this file, you need to run the commands shown in following code:
The output of the commands listed in previous code is shown in Figure 1.15:
L
D
C
T
FIGURE 1.16 Executing the write.table() Command
The output will be saved in the mydata.txt file, as shown in Figure 1.17:
IM
>write.csv(readfile_tab, “E:/sampledata.txt”)
Self-Instructional
Material
24
L A B E X E R C I S E
The data will be stored as a CSV file, as shown in Figure 1.18: NOTES
L
ID Name
1 Anne
2 John
‘)
3 Berkeley
5 David
6 James
T
7 Thomson
‘)
rbind(Data1, Data2) Combines the rows of Data1 and Data2
IM
Two data frames, Data1 and Data2, are created. Then, the rbind() function is used to combine
the rows of Data1 and Data2.
L
FIGURE 1.20 Creating and Displaying Vectors and Data Frames
In Figure 1.20, three vectors personNames, salary, and bonus are created and displayed. All
D
these vectors are of equal length. Then, a data frame, company, is created by using these three
vectors and printed by using the print() function.
In Figure 1.21, the first three commands display different ways to access the salary column by using
double square brackets and the dollar ($) symbol. Next, two commands display the data of the
Self-Instructional personNames column and the first row. Then, the head() and tail() functions are used to
Material display the top 5 rows and the bottom 3 rows, respectively.
26
L A B E X E R C I S E
Merging Data Frames NOTES
The merge() function takes two tables as arguments, with the left table being the x frame and
the right table being the y frame. The merge() function joins two tables by default on the basis of
columns with the same name. Let’s create some sample data frames, as shown in Figure 1.22:
L
D
C
FIGURE 1.22 Displaying Two Sample Data Frames
In Figure 1.22, two data frames, area and popuVehicle, are created.
Let’s try to merge these two data frames, as shown in Figure 1.23:
T
IM
Self-Instructional
FIGURE 1.23 Merging Two Data Frames
Material
27
L A B E X E R C I S E
NOTES In Figure 1.23, the merge() function is used to merge the data of the area and popuVehicle
data frames. The merged data is stored in a new data frame, mergeData. The dim() function
displays the number of columns and rows of mergeData. The colnames() and rownames()
functions display the column names and rows names of mergeData, respectively.
Packages
To view the complete list of packages available in the R library, type the following command:
installed.packages()
You can select the packages that you need to install from the list of packages.
L
The CRAN mirror dialog box appears.
In addition to the available packages in R, you can also download the packages from Internet in the
form of zip files and install them in the R library using the Install package(s) from local zip files option
T
available in the Packages menu, as shown in Figure 1.24:
IM
library(“package-name”)
To access the functions of the bigdata package in R, type the following command:
library(“bigdata”)
L
FIGURE 1.25 Using the library() Function
D
You can also load and update packages by using the Load package or Update packages options,
respectively, from the Packages menu, as shown in Figure 1.26:
C
T
IM
You can unload or remove a package from the R library by using the detach() function. The syntax
of writing the detach() function is as follows:
detach(package:<name of package>)
In the preceding syntax, the package is a keyword; while <name of the package> specifies
the name of the package. Self-Instructional
Material
29
L A B E X E R C I S E
NOTES To view the working of the detach() function, first load a package using the library() function.
In this case, we have loaded the bigdata package. Then, write the search() function to view the
list of loaded packages. Now, remove the bigdata package by using the detach() function, as
shown in Figure 1.27:
L
D
C
T
IM
Self-Instructional
Material
30
CHAPTER
2
Statistics for
Data Science
L
Topics Discussed
Introduction
Measures of Central Tendency
Probability Theory
D NOTES
C
Sampling Theory
Sampling Frame
Sampling Methods
Sample Size Determination
Sampling and Data Collection
Sampling Errors
Hypothesis Testing
Four Steps to Hypothesis Testing
Hypothesis Testing and Sampling Distributions
Types of Error
Effect Size, Power, and Sample Size in Hypothesis Testing
t-test
Analysis of Variance (ANOVA)
Regression Analysis
Multiple Regression Analysis
Types of Regression Techniques Self-Instructional
Material
DATA SCIENCE
INTRODUCTION
Ever since the evolution of big data, the data storage capacities have grown a lot. It becomes
increasingly difficult for the organisations to process such huge amounts of data. Here comes the
L
role of data science. Data science is a multidisciplinary subject that has developed as a combination
of mathematical expertise (data inference and statistics) and algorithm development, business
acumen and technology in order to solve complex problems. At the core of business operations
is data. An overwhelming amount of data is stored in the enterprise data warehouses and a lot of
D
value can be derived from it by mining the data. Data warehouse can be used to discover data and
development of data product that helps in generating value.
Data science helps in uncovering findings from data. Discovering data insights involves mining data
C
at granular level and understanding complex behaviours, trends and inferences which can be used by
the organisations to make better business decisions. For example, the P&G Company makes use of
time series models to understand the future demand and plan for production levels more optimally.
T
A data product is a technical asset that takes in data as input and processes the data to return
algorithm-based results. An appropriate example of data product is the recommendation engine
that takes in user data and makes personalised recommendations based on data. For example,
e-commerce websites, such as Flipkart and Amazon mine data to understand the buying pattern
IM
of the consumers and then based on its analysis, it recommends other similar products that may
interest the buyers.
Field of data science involves use of techniques, such as machine learning, statistical skills, cluster
analysis, data mining, algorithms and coding and visualisation. Note that statistics plays a central role
in the data science applications. Data science involves use of statistical techniques that are used for
data collection, visualising the data and deriving insights from them, obtaining supporting evidence
for data-based decisions and constructing models for predicting future trends from the data.
In this chapter, you will learn about the various techniques of statistics that are used frequently in
data sciences. You will learn about the important concepts, such as probability theory, statistical
inference, sampling theory, hypothesis testing and regression analysis.
∑ fX
n
Self-Instructional x= i =1 i i
Material ∑f i
32
Statistics for Data Science
where, X represents the sample mean and fi represents the frequency of an ith observation of the NOTES
variable. One of the problems with arithmetic mean is that it is highly sensitive to the presence
of outliers in the data of the related variable. To avoid this problem, the trimmed mean of the
variable can be estimated. Trimmed mean5 is the value of the mean of a variable after removing
some extreme observations (e.g., 2.5 percent from both the tails of the distribution) from the
frequency distribution. Mean is the hypothetical value of a variable. It may or may not exist in
the dataset.
The following code snippet shows how to calculate mean in R language:
L
## [1] 108.9
In the preceding code snippet IQ scores of 10 participants are used having values 108, 90, 100,
110, 113, 98, 95, 129, 137, 109. Then, a new workspace variable with the name participants is
D
created. The mean function is used to calculate the mean of given IQ scores.
Median: Median is known as the ‘positional average’ of a variable. If we arrange the observations
of a variable in an ascending or descending order, the value of the observation that lies in the
middle of the series is known as median. The value of the median divides the observations of
C
a variable into two equal halves. Half of the observations of the variable are higher than the
median value and the other half observations are lower than the median value. The extensions
of median are quartiles, deciles, and percentiles.
T
The following code snippet shows how to calculate median using R language:
IQ.median=median(participants)
print(IQ.meadian)
IM
## [1] 108.5
In the preceding code snippet, the participants workspace is created using IQ scores of 10
participants. The median function is used for calculating median of IQ scores.
Consider another code snippet for calculating the Inter Quartile range using R:
IQ.IQR=IQR(participants)
print(IQ.IQR)
## [1] 13.75
Mode: The mode of a variable is the observation with the highest frequency or highest
concentration of frequencies.
Self-Instructional
Material
33
DATA SCIENCE
The probability theory involves use of discrete and continuous random variables and probability
distributions. The distributions provide mathematical abstractions of non-deterministic or uncertain
processes or measured quantities which may occur as a single occurrence or over time.
Random events cannot be predicted perfectly. However, their behaviour can be analysed. The law
of large numbers and the central limit theorem are used to describe the behaviour of such random
events.
Study of Probability Theory is essential for human activities involving quantitative data analysis. It
acts as a mathematical foundation for the concepts, such as uncertainty, confidence, randomness,
L
variability, chance and risk. Probability theory is also used by various experimenters and scientists
who make inferences and test hypotheses based on uncertain empirical data. Probability theory is
also used to build intelligent systems. For example, techniques and approaches such as automatic
D
speech recognition and computer vision which involve machine perception and artificial intelligence
are based on probabilistic models.
A probability distribution can be described using an equation called Probability Density Function
(PDF). The area under the curve of a random variable’s PDF shows the probabilities of the
continuous random variables. Here it must be remembered that a range of values can have a non-
zero probability. For example, we can calculate the probability that a student has scored marks
between 80 and 90. Probability of a continuous random variable having some value is zero. Due to
this reason, a continuous probability function cannot be expressed in tabular form. A continuous
probability distribution is described using an equation or a formula.
For a random variable Y, PDF y = f(x) means that y is a function of x. For all values of x, the value of y
will be greater than or equal to zero. Also, the total area under the curve of function is equal to one.
Self-Instructional
Material
34
Statistics for Data Science
For example, the PDFs of men’s height are shown in Figure 2.1 as follows: NOTES
50
Probability
30
50% 50%
10
60 70 80
Height (Inches)
For a continuous probability distribution for men’s heights, we cannot measure the exact probability
L
that a man will have a height of exactly 70 inches. It only shows that an average man has a height of
70 inches. It is not possible to find out the probability that any one person has the height of exactly
70 inches.
exponential distributions.
Some of the most common discrete probability distributions used in statistics include binomial
IM
Discrete probability distributions can be described using frequency distribution tables, graphs or
charts. The frequency distribution table for the probability of rolling a die is shown in Table 2.1 as
follows:
TABLE 2.1: Frequency Distribution Table for the Probability of Rolling a Die
Roll 1 2 3 4 5 6
Odds 1/6 1/6 1/6 1/6 1/6 1/6
class<-c(12,20,12,13,16,11,19,16,15,15,14,17,19,20,11,10,18,17,
+14,19,17,18,16,19)
# Frequency distribution of the data with bar graph
library(descr)
freq(class)
Self-Instructional
Material
35
DATA SCIENCE
NOTES The output of the preceding code snippet is shown in Figure 2.2:
Frequency
2
0
10 11 12 13 14 15 16 17 18 19 20
Class
## class
## Frequency Percent
## 10 1 4.167
L
## 11 2 8.333
## 12 2 8.333
## 13 1 4.167
## 14 2 8.333
##
##
##
##
15
16
17
18
2 8.333
3 12.500
3 12.500
2 8.333
D
C
## 19 4 16.667
## 20 2 8.333
## Total 24 100.000
T
In the preceding code snippet, we have calculated the frequency distribution along with its visual
representation using R. For this, we have imported a library (a collection of precompiled routines).
1. Bernoulli Distribution
A Bernoulli distribution7 has only one trial and only two possible outcomes, namely 1 (success)
and 0 (failure). In a Bernoulli distribution, a random variable X can take value 1 (success) with
probability p or can take value 0 with probability q (= 1 – p). For example, the toss of a coin
once may result in heads or tails. In case of a fair coin, probability of heads = probability
of tails = 0.5. The probability function for Bernoulli distribution = px(1-p)1-x, where x € (0, 1).
Alternatively,
P(x) = p if x = 1 and
P(x) = q = 1 – p if x = 0
In a Bernoulli trial, it is not necessary that both the outcomes will have equal probability like
in case of a fair coin toss. For example, in a karate competition between a karate green belt
holder and a black belt holder, it is highly likely that the black belt holder would win. You may
assume that probability of winning of black belt holder (success) = 0.9 and the probability of
Self-Instructional
winning of green belt holder (failure) = 0.1.
Material
36
Statistics for Data Science
It is known that the expected value of any distribution is the mean of the distribution. NOTES
Therefore, for a Bernoulli distribution, the expected value of a random variable X is found as
follows:
E(X) = 1*p + 0*(1 – p) = p
The variance of a random variable X from the normal distribution is calculated as:
V(X) = E(X²) – [E(X)]² = p – p² = p(1 – p)
Some other examples of Bernoulli distribution include winning or losing in a game of chance,
whether it will rain or not, whether an earthquake would happen tomorrow or not, etc.
2. Uniform Distribution
In a uniform distribution, there may be any number of outcomes and the probability of getting
any outcome is equally likely. For example, when a fair dice marked A, B, C, D, E, and F on 6 of
its sides is rolled, then:
Probability of getting A = Probability of getting B = Probability of getting C = Probability of
getting D = Probability of getting E = Probability of getting F = 1/6.
Assume that in one trial, n outcomes may turn up. Then, all the n number of possible outcomes
L
of a uniform distribution are equally likely. The probability function for a uniform distribution
is written as:
1
f (x)
= ; for − ∞ < a ≤ x ≤ b < ∞
b−a
a b x
In Figure 2.3, we observe that the uniform distribution is rectangular in shape. Due to this
reason, it is also called rectangular distribution.
Assume that a cake shop sells 50-80 cakes everyday. Let us calculate the probability that the
daily sales is between 65 and 75 cakes.
Probability that the daily sales will fall between 65 and 75 = (75 – 65)*(1/(80 – 50)) = 0.33
Probability that the daily sales is greater than 60 = (80 – 60)*(1/(80 – 50)) = 0.67
For a uniform distribution, mean and variance are calculated as:
E(X) = (a + b)/2
V(X) = (b – a)²/12
A standard uniform distribution has parameters a = 0 and b = 1. The probability distribution
function for a standard uniform distribution is written as:
f (x) = 1 if 0 ≤ x ≤ 1
f (x) = 0 for all other cases
Self-Instructional
Material
37
DATA SCIENCE
L
Major characteristics of a Binomial Distribution are:
zz There are ‘n’ different trials and each trial is independent of the other.
zz
zz
D
For each trial, there are only two outcomes, success or failure.
All the trials are identical and the probability of success and failure is the same for all the
trials.
There are three parameters in a binomial distribution, namely n, p and q, where n = num-
C
zz
ber of trials; p = probability of success, and q = probability of failure.
Mathematically, a binomial distribution is represented as:
n!
T
px qn − x
x ! (n − x )!
The mean and variance of a binomial distribution are calculated as:
IM
Mean, µ = n*p
Variance, V(X) = n*p*q
When the probability of success is equal to probability of failure, the binomial distribution
curve looks as shown in Figure 2.4:
FIGURE 2.4 Binomial Distribution Curve when Probability of Success is equal to Probability of Failure
Self-Instructional
Material
38
Statistics for Data Science
When the probability of success is not equal to probability of failure, the binomial distribution NOTES
curve looks as shown in Figure 2.5:
Binomial Distribution
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1 2 3 4 5 6 7 8 9 10 11
FIGURE 2.5 Binomial Distribution Curve when Probability of Success is not Equal to Probability of Failure
4. Normal Distribution
Normal distribution results in a bell-shaped symmetrical curve. This distribution occurs naturally
L
in many situations. For example, if an examination is conducted, most of the students would
pass with average marks, a few will score extremely high and a few will score extremely low.
If this is shown on a graph, half of the data will fall on the left of the average marks and half
of the data would fall on the right side of the average marks. Many situations follow a normal
D
distribution. Due to this reason, normal distribution is widely used in businesses. Some of
the situations that follow normal distribution include heights of people, measurement errors,
test scores, IQ scores, salaries, etc. Under normal distribution, we have an empirical rule that
tells us what percentage of the data falls within a certain number of standard deviations from
C
the mean. They are:
zz 68% of the data falls within the range of mean ± standard deviation (σ)
zz 95% of the data falls within the range of mean ± 2 σ
T
zz 99.7% of the data falls within the range of mean ± 3 σ
This empirical rule can be represented as shown in Figure 2.6:
IM
34.1% 34.1%
0.1% 2.1% 2.1% 0.1%
13.6% 13.6%
–3σ –2σ –1σ µ 1σ 2σ 3σ
The spread of the normal deviation depends upon the standard deviation. If the standard
deviation is low, much of the data will be accumulated around the mean and the distribution
would appear taller. On the contrary, if the standard deviation is greater, the data would be
spread out from the mean position and the normal distribution would appear flatter and wider.
The following code snippet shows how to calculate standard deviation using R:
# Now try to get the standard deviation
IQ.SD=sd(participants)
print(IQ.SD) Self-Instructional
Material
39
DATA SCIENCE
L
2π σ
For a random variable that is normally distributed, the mean and variance are given as:
Mean, E(X) = µ
Variance, V(X) = σ2
D
A standard normal distribution is defined as a distribution with mean 0 and deviation 1.
The PDF of a standard distribution is:
C
− x2
1 2
= f (x) e for − ∞ < x < ∞
2π
T
A standard normal distribution is shown in Figure 2.7:
–4 –3 –2 –1 0 1 2 3 4
Self-Instructional
Material
40
Statistics for Data Science
The population refers to a collection of all the subjects or objects of interest. A sample refers to
the subset of the population. This subset is used to make inferences about the population and
its characteristics. Population parameters are those numeric characteristics of a population that
are fixed and are usually unknown; for example, the average lifespan of Tamil Nadu males or the
percentage of people satisfied with the current government. The information is collected from the
sample subjects or regarding the objects in a sample and is consequently analysed. The analysis
L
results in data or the values measured or recorded with respect to a sample. Numeric statistics of
the sample data, such as the mean, proportion and variance are called sample statistic and are often
used to provide estimates of the corresponding population parameters.
D
It must be remembered that different samples give different values for sample statistics. In
practice, the sample statistic such as mean is calculated for different samples and a histogram
can be constructed for all the samples’ mean. Here the statistic from a sample is regarded as a
random variable. Also, a histogram is an approximation of its probability distribution and is known
C
as sampling distribution. Sampling distribution is used to describe the distribution. In other words,
the sampling distribution shows how a statistic or the random variable varies when the random
samples are taken repeatedly from a population. In statistical inferencing, the distance between the
parameter and the expected value of sample statistic is known as bias.
T
In inferential statistics, the experimenter tries to achieve three goals as follows:
that determine the properties of a distribution. For example, mean and standard deviation
are parameters in a normal distribution. Mean is the value around which the normal or the
bell-shaped distribution is centred and standard deviation helps in determining the expanse
or the width of the curve. If the type of distribution is known, the true values of its parameters
can be calculated.
2. Data prediction: After the parameters have been estimated for a particular distribution, they
can be used to predict the future data. For example, assume that a survey of 10% of the total
female population of India is conducted to estimate their average or mean height. We can
use this data to predict the probability that a particular female will have a height within a
certain range of values. Assume that the mean value of marks scored by class 10th students
is calculated as 81 and standard deviation is calculated as 10. Then if we select any class 10th
student, it is likely that he/she will have scores in the range of 71-91.
3. Model comparison: After the data has been predicted for an entire population, the
experimenter selects one model which best explains the observed data from two or more
models. In probability theory, a model is basically a combination of postulates about the
process that generates the data. For instance, a model may state that the height of an adult
person is determined by factors, such as gender, genes, nutrition, physical exercise, ethnicity,
race and geographical location. This is a generic model. A statistical model can be constructed
to examine or postulate a relationship between the given factors and the data to be explained. Self-Instructional
Material
41
DATA SCIENCE
NOTES For example, model 1 suggests that the height of a person depends 75% on genetic factors
and 25% on exercise. Similarly, model 2 may suggest that the height of a person depends 85%
on genetic factors and 10% on exercise and 5% on the rest of the factors. In such a case, the
model that can best accommodate the observed data is adopted.
Let us now discuss the two important types of inferences, namely frequentist inference and the
Bayesian inference.
Frequentist Inference
According to the classical definition of probability, probability of an event is defined as:
Number of favourableoutcomes
P ( Event ) =
Totalnumber of outcomes
In probability theory, two outcomes should have the same probability of occurrence, provided that
they are symmetric with respect to the factors that cause them.
L
Before we discuss frequentist and Bayesian inferences, let us study about the four important
definitions of probability as follows:
1. Probability as a long-term frequency: This definition of probability states that the probability
D
of an event is equal to the long-term frequency of the event’s occurrence when the same
process is repeated multiple number of times. For instance, probability that a dice turns up
in an odd number upon being rolled is 0.5. This is true because when a dice is rolled multiple
number of times, nearly half of the rolls result in an odd number.
C
2. Probability as propensity or physical tendency: This definition of probability states that
probability is a physical tendency that a certain event would occur under the given conditions.
For example, when a coin is tossed, it usually turns up in a head or tail with a propensity of 0.5,
T
but one in every few thousand flips of a coin, the coin may land on its edge as well.
3. Degrees of logical support/ law of large numbers: This definition of probability specifies the
source of long-term frequency of occurrence of certain events. This definition is based on
the law of large numbers. This law establishes a link between probabilities and frequencies.
IM
This definition states that as a particular process is repeated, the relative frequency of
occurrence of a particular outcome gets closer to an outcome’s probability. For example, if a
coin is tossed 20 times, it is possible that heads will turn up 5 times or 7 times or 11 times or 15
times and the relative frequency of heads would be 5/20 (0.25) or 7/20 (0.3) or 11/20 (0.5) or
15/20 (0.75). However, if the same coin is tossed 1000 times, the relative frequency of heads
increasingly nears 0.5. Therefore, as the number of flips increases, the relative frequency of
heads will tend to approach 0.5.
4. Probability as degree of belief: This definition of probability states that the probability
measures the degrees of belief about the occurrence of an event or about the truth of a
hypothesis or truth regarding any random statement. In a way, probability here represents
how certain an experimenter is about the truth of a statement. When an experimenter has
total belief that an event or a statement is definitely true or certain, it is assigned a probability
of 1 and when the experimenter believes that something is false, it is assigned a probability
of zero. If the experimenter assigns a probability between 1 and 0, it means that there still
remains an element of uncertainty. In such a probability, the knowledge and experience of
the experimenter must reflect.
All these four definitions describe how probabilities relate to the physical world as a mathematical
Self-Instructional concept. Frequentist inference is based on the first definition of probability as a long-term frequency.
Material
42
Statistics for Data Science
Also note that the Bayesian inference is based on the definition of probability on the basis of the law NOTES
of large numbers and the degree of belief.
Frequentist probability is applicable in case of repeatable random events. Here it is held that
frequentist probability is equal to long-term frequency of the occurrence of the events. In frequentist
inference, no probability is attached to hypotheses or any unknown value.
To better understand frequentist inference, let us assume that we want to estimate the average
marks of a group of 1000 students. Here we make two assumptions. First, the marks of the group
of students are distributed normally. Second, the value of standard deviation is known. Therefore,
now we need to calculate the mean of the normal distribution. If a frequentist experimenter is given
this data, he/she would probably observe that we don’t know what the mean marks of the students
are, but we certainly know that the mean is a fixed value. Therefore, we cannot assign a probability
to the mean being equal to a certain value or less than or more than a certain value. However, we
can collect data from the sample of population and analyse it to estimate the mean value of marks.
The estimate value calculated in this manner is called maximum likelihood estimate and it depends
upon the distribution of the data. For a normal distribution, the maximum likelihood estimate of
population mean is equal to sample mean.
L
The following code snippet shows how to find the minimum of input using R:
IQ.min=min(participants)
print(IQ.min)
In the preceding code snippet, the min function is used to find the minimum value among participants.
T
Now, consider the following code snippet that shows how to find the maximum of input using R:
IQ.max=max(participants)
print(IQ.max)
IM
Output:
## [1] 137
In the preceding code snippet, the max function is used to find the maximum value among
participants.
Now, consider the following code snippet that shows how to find the range of input using R:
IQ.range=range(participants)
print(IQ.range)
Output:
## [1] 90 137
In the preceding code snippet, the range function is used to find the value ranging from minimum to
maximum among participants.
E xhibit -1
CONDITIONAL PROBABILITY
Conditional probability is denoted as P(A|B), which means the probability of A provided B has
occurred. When we assess conditions to normal probabilities, it makes a large difference in
evaluating the probabilities. For example, probability of getting a free pizza (P(free pizza)) on any
day is 0.1. On the other hand, probability of getting a free pizza on a Friday (P(free pizza|friday)) is
1 and probability of getting a free pizza on a Sunday (P(free pizza|sunday)) is 0.
L
falsify the models. In Bayesian statistics, probabilities are assigned to a model.
p(A |B) =
P(B | A) × P(A)
P(B)
D
A frequentist may argue that a certain event, say a person having disease may or may not happen.
C
On the other hand, a Bayesian researcher may argue that there is a 1% probability that the person
has a disease.
To better understand Bayesian inference, let us again assume that we want to estimate the average
marks of a group of 1000 students. If a Bayesian experimenter is given this data, he/she would probably
observe that: We know that the mean is a fixed and unknown value, but we can still represent the
uncertainty in terms of probability. This can be done by defining a probability distribution over the
possible values of the mean and using the sample data to update the distribution. In case of Bayesian
inferencing, newly collected data makes the probability distribution over the parameter narrower.
To be very specific, the probability distribution becomes narrower around the parameter’s true
value. For updating the entire probability distribution, Bayes’ theorem is applied to each possible
value of the parameter.
P(B | A) × P(A)
p(A |B) =
P(B)
P (A|B) represents the probability that event A occurs given that event B has already occurred.
Self-Instructional
Material
44
Statistics for Data Science
Bayesian inference is a process of making inferences about a population or probability distribution NOTES
from the data using Bayes’ theorem.
SAMPLING THEORY
In research studies, sampling theory is used extensively. At times, experimenters have to conduct
certain researches that involve collecting data from a large population. In such cases, they do not
collect data from each object or subject in the population because it is infeasible to do so. Therefore,
the experimenters collect data from a sample of objects or subjects drawn from a population. The
goal of sample surveys is to collect data from a small part of the larger population so that inferences
can be made about the larger group. The practice of drawing samples and analysing them to derive
some useful information is called sampling theory. Certain important concepts related to sampling
L
theory are as follows:
Data: Data refers to the entire set of observations that have been collected.
population.
D
Population: An entire group of subjects or objects that are to be studied and analysed is called
Sample: In research studies, the population whose characteristics are to be studied is usually
very large and it is practically infeasible to examine each subject or object of the population.
C
Therefore, a sample is selected to act as a representative of the population in the research
study. A sample is a portion or sub-collection of elements that are examined in order to estimate
the characteristics of a population.
The usual course of a research study involves four steps as follows:
T
zz Determine the population or all the individuals of interest
zz Select a sample from the given population
IM
zz Objects and subjects that are selected from a sample are made to participate in a research
study.
zz Data collected from the research study are analysed and the results so obtained are gener-
alised for the entire population.
Parameter: A parameter refers to a characteristic of a sample that is generalised for the
population. A parameter is a numerical measurement that describes the characteristics of
a population. For example, mean, variance and standard deviation are parameters for a
distribution.
Statistics: It is a branch of mathematics that deals with planning and conducting experiments,
obtaining data, and organising, summarising, presenting, analysing, interpreting and drawing
conclusions based on data. These are numeric characteristics that describe the characteristics
of the sample.
Sampling Frame
Sampling frame refers to the complete list of all the items (everyone and everything) that must be
studied. At first, it would appear that a sampling frame is the same as population. But, population
is general, whereas sampling frame is specific. For example, we may define a population as all
those individuals who can be sampled (for example, all the Indian Americans living in Texas, USA), Self-Instructional
Material
45
DATA SCIENCE
NOTES whereas an exhaustive list of all the Indian Americans living in Texas, USA would be considered as
the sampling frame because it is not necessary that all the Indian Americans living in Texas, USA
would be listed under the list so provided. In statistical research, the experimenters require a list of
items in order to draw a sample from it. It must be ensured that the sampling frame is adequate for
the needs of the experimenter.
Let us understand the population and sampling frame with the help of an example.
According to Alaska University, a good sample frame for a project on living conditions has the
following characteristics:
Include all subjects/objects in the target population.
Include all the accurate information that can be used to contact the selected individuals.
Other than these characteristics, the following factors may also be considered:
For each object or subject in the sampling frame, a unique identifier must be fixed. It must be
ensured that the identifiers have no duplicates.
L
The objects or the subjects in the sampling list must be organised in a known manner. For
example, a list containing the names of 100 people may be arranged alphabetically or using age
as a criterion.
D
Information contained in the sampling frame (list) must be up-to-date. For example, a list
containing the names of 100 people made in 2018 must be updated to reflect changes in address,
age, live/dead status, etc. of each individual in the list.
C
Sampling Methods
In statistics, there are various sampling methods. Sampling methods are divided into two categories,
T
namely probability sampling and non-probability sampling. Probability sampling is the one wherein
the sample has a known probability of being selected. On the other hand, in non-probability
sampling, a sample does not have known probability of being selected. In probability sampling, we
can determine the probability that each sample will be selected. In addition, we can also determine
IM
characteristic. For example, once a population of individuals aged between 10 and 70 is identified, NOTES
the experimenters can identify the individuals aged between 40 and 50. A researcher first finds
the relevant strata and their representation in the total population. After the different strata
have been defined, random sampling is carried out to select the required number of subjects or
objects from each stratum. Experimenters use stratified sampling when different strata in the
population have different incidences relative to other strata.
In cluster sampling, the experimenter divides the population into separate groups called
clusters. After the clusters have been defined, a random sample of clusters is selected from
the entire population. Now, the experimenter collects and analyses data from the objects or
subjects of the cluster. Cluster sampling is less precise than the SRS and the stratified sampling,
but it is more cost-effective than SRS and stratified sampling.
In systematic random sampling, a sample frame list is prepared and from this list, the kth
element is selected as the first element of the sample. Thereafter, the (2k)th element is selected
as the second element. Then (3k)th element is selected as the third element, and so on. Under
systematic sampling, the sample members are selected on the basis of a constant interval called
sampling interval. Here k is the sampling interval.
L
Multistage Sampling is a complex form of cluster sampling. In multistage sampling, clusters
are formed from a population and these clusters are sub-divided into smaller groups or sub-
clusters. The subjects or objects are chosen randomly from each sub-cluster. The sub-clustering
D
activity can be undertaken multiple numbers of times on the basis of the nature of research and
the population size under study. In multi-stage sampling, the sample size gets reduced at each
stage.
C
Sample Size Determination
We have mentioned that the experimenters select a sample out of the total population. However,
till now, we did not mention about the number of observations that are included in a sample. The
T
number of observations that are included in a sample is called sample size. We need to understand
that in order to be able to do any statistical inference, we must carefully select a sample that has a
sufficient sample size. A sufficient sample size means that the sample must properly represent the
entire population. A very large sample may lead to the wastage of time, money and other resources,
IM
In order to determine the sample size, the researcher needs to have certain information which
includes:
How accurate answers or estimates do we need?
The confidence level represents the Z-score. The Z-scores for common confidence levels are:
For confidence level of 90%, Z-score = 1.645
These are the most common confidence levels. If you use any other confidence level, you will require
referencing the Z-tables.
Self-Instructional
Material
47
DATA SCIENCE
NOTES Now, we will use the Z-score, Standard Deviation and confidence interval to calculate the sample
size using the following formula:
Sample Size =
( Z − score )2 * Std. Dev ( 1 − Std. Dev )
(Margin of error )2
For example, assume that the experimenters choose a confidence level of 95%, standard deviation
of 0.5, and a margin of error (confidence interval) of +/– 5%.
Sample=
Size
( 1.96)2 *0.5 ( 1 − 0.5) 0.9604
= = 384.16 ≈ 385
(0.5)2 0.0025
Alternatively, the sample size can also be determined using the Cochran’s formula as follows:
Z2 pq
n0 =
e2
L
Here e = desired level of precision/margin of error
q=1–p
Short, simple and easy to understand questions should be included in the questionnaire.
Sampling is carried out to conduct experiments and draw conclusions regarding a population based
on sample. Sampling is carried out because of feasibility and cost factors. While sampling, it must be
remembered that the sampled population and the target population must be similar to one another.
Sampling can be done through different techniques, such as SRS, stratified sampling, etc. While
sampling, the sample size must be determined carefully because the larger the sample size is, the
better would be the sample estimates.
Sampling Errors
There are two main types of errors that are involved in sampling. When a sample of observations is
taken from a population, two types of errors, namely sampling and non-sampling errors can arise. In
Self-Instructional
sampling theory, total error is the sum of sampling and non-sampling errors. Total error is defined as
Material
48
Statistics for Data Science
the difference between the mean value of the population parameter and the observed mean value NOTES
of a parameter. Figure 2.8 shows the division of total error.
Total
Error
Sampling Non-sampling
Error Error
Errors in Data
Acquisition
Non Response
Error
L
Selection Bias
D
FIGURE 2.8 Components of Total Error
The major differences between the sampling and non-sampling errors are shown in Table 2.2:
C
TABLE 2.2: Differences between the Sampling and Non-Sampling Errors
L
responses may arise due to faulty equipment, mistake in transcription from primary source,
inaccurate recording, inaccurate responses, misinterpretation of terms, etc.
Non-response error: At times, certain members of the sample may not provide their responses.
In such a situation, a bias is introduced in the data and observations. In such instances, the
D
sample responses may not be representative of the population which leads to biased results.
Non-response error may creep in when the respondents in the sample are not available to
provide their response or when the respondents are not willing to provide their inputs.
C
Selection bias: Selection bias occurs when certain members of a population are given more
preference over other members and are not adequately represented among the sample
population. For example, if a population consists of 20 managers and 80 workers and the
sample of 20 consists of 15 managers and 5 workers, the sample is probably skewed as the
T
managers and workers are not represented proportionally.
HYPOTHESIS TESTING
There are basically two types of statistical inferences. One is estimation and the other one is
hypothesis testing. Before we discuss hypothesis testing, let us first discuss what a hypothesis is.
Hypothesis testing, also called significance testing, is a method which is used to test the hypothesis
regarding the population parameters using the data collected from a sample. Alternatively, we can
say that hypothesis testing is a method of evaluating samples to learn about the characteristics of
a given population.
Self-Instructional
Material
50
Statistics for Data Science
In hypothesis testing, we test a hypothesis by determining the likelihood that a sample statistic NOTES
would be selected if the hypothesis was true. For example, assume that a study published in a
journal claims that Indians aged between 25 and 40 years of age sleep for an average of 6 hours. To
test this claim made in the study for Indians aged between 25 and 40 years of age living in Bengaluru,
we may first record the average sleeping time of 100 (sample size) Bengaluru-based Indians aged
between 25 and 40 years of age. The average value of sleeping hours calculated for these 100 people
is the sample mean. Next, we can compare the sample mean with the population mean.
For example, the experimenter may want to test if the Bengaluru-based people in the age group of
25 and 40 years sleep for 6 hours on an average.
H0; µ = 6
L
Step 2: Set the criterion upon which the hypothesis would be tested.
For example, if we consider the hypothesis that Bengaluru-based people in the age group of 25 and
D
40 years sleep for 6 hours on an average, then the sample so selected should have a mean close to
or equal to 6 hours. However, if the Bengaluru-based people in the said age group sleep for more
than or less than 6 hours, the sample should have a mean distant from 6 hours. However, here it is
important to describe how much difference or deviation from 6 hours would make the experimenter
C
reject the hypothesis. For example, the mean of 6 ± 0.5 hours may be acceptable, whereas the mean
of 6 ± 0.51 hours onwards may not be acceptable.
Step 3: Select a random sample from the population and measure the sample mean (Compute the
test statistic).
T
For example, a sample of 100 Bengaluru-based people in the age group of 25 and 40 years is selected
at random and the mean time of their sleep is measured.
IM
Step 4: Make a decision – Compare the observed value of the sample to what we expect to observe
if the claim we are testing is true.
For example, if the sample mean calculated is approximately 6 hours with a small discrepancy
between the population and sample mean, then the experimenter may decide to accept the
hypothesis, else if the discrepancy is large, the hypothesis would be rejected.
NOTES For calculating the probability of obtaining sample mean in a sampling distribution, the population
mean and the Standard Error of the Mean (SEM) must be known. These values are input in the
test statistic formula calculated in step 3. The notations used to describe populations, samples and
sampling distributions are shown in Table 2.3:
TABLE 2.3: Notations Used for the Mean, Variance, and Standard Deviation in
Populations, Samples and Sampling Distributions
Variance σ2 s2/SD2 σ2
M2 =
n
L
We must know the important differences between population, sample, and sampling distributions.
They are described in Table 2.4:
Population Distribution D
TABLE 2.4: Differences between Population, Sample, and Sampling Distributions
Scores of all persons in a Scores of a select number of All the possible sample means
population persons from the population that can be selected given a
T
certain sample size
They are generally not They are accessible They are accessible
accessible
IM
Types of Error
You studied that there are four steps in hypothesis testing. In the fourth stage, the experimenter
decides whether to accept or reject the null hypothesis. Since a sample is used to observe the
population, the decision taken regarding the null hypothesis may be wrong. When a decision is
taken regarding a sample, there may be four decision alternatives as follows:
The four decisions may be: correctly retaining the null hypothesis, correctly rejecting the null
hypothesis, incorrectly retaining the null hypothesis and incorrectly rejecting the null hypothesis.
In the context of hypothesis testing, there are usually two types of errors, as follows:
Type I Error: It is the probability of rejecting a null hypothesis that is actually true. This error is
depicted using symbol α. Researchers directly control the probability of committing this type of
error by stating an alpha level.
Type II Error: It is the probability of incorrectly retaining a null hypothesis. This error is depicted
using symbol β.
L
Let us now analyse the two decision types, namely retaining the null hypothesis and rejecting the
null hypothesis.
D
When the researcher decides to retain a null hypothesis, the decision may be correct or incorrect.
The correct decision here is to retain a true null hypothesis (null result). In this case, we are retaining
what we had already assumed.
C
At times, the researcher may make an incorrect decision to retain a false null hypothesis. This is a Type
II (β) error. In most tests, there is a probability of making Type II error because the experimenters do
not reject the previous notions of truth that are, in fact, false. Type II error is less problematic than
T
Type I error, but it might be problematic in fields such as medicine and defence. Testing of defence
equipment or medicine may involve accepting null hypothesis that should have been rejected. This
may even put a risk on the lives of the patients and other individuals.
IM
NOTES Whenever an experimenter wants to use inferential statistics to analyse the evaluation results, first
of all, he/she should conduct a power analysis to determine the required size of sample. Whenever
we are conducting an inferential statistics test, we are basically comparing two hypotheses, i.e., the
null hypothesis and the alternate hypothesis. For example, a null hypothesis may state that when a
group of students are taken for an environment scanning and conservation trip, the attitude towards
environment conservation before and after going to the trip will remain the same. On the contrary,
the alternate hypothesis may state that there is a significant difference between the attitude of
students before and after going to the trip. Usually, the statistical tests look for tests that can allow
you to reject the null hypothesis and conclude that the program had an effect. In any statistical test,
there exists a possibility that there will be a difference between groups when there exists none. This
is type I error. Similarly, there is a possibility that the test will not be able to identify a difference
when it does exist. This is type II error.
Statistical power, or simply power, refers to the probability that the experimenter will reject the null
hypothesis when he/she should, thus avoiding Type II error. In other words, we can say that power
is the probability that a statistical test will find a significant difference when such difference exists.
In general, a power of 0.8 or more is considered as a standard. It means that there should be an 80%
L
or more chance of finding a statistically significant difference when the difference actually exists.
The experimenter can use power calculations to determine the sample size. There is a relation
between sample size and power. As the size of sample increases, the power of test also increases.
D
This is so because when a large sample is collected, more information is available and it makes it
easier to correctly reject the null hypothesis. In order to ensure that a sample size is sufficiently
large, power analysis calculation should be conducted. For a power calculation, following must be
known:
C
Type of inferential test that must be used
These values are input in statistical software to calculate the value of power. The power value comes
IM
out to be between 0 and 1. In case the power is less than 0.8, it is recommended that the sample
size is increased.
On entering these values, a power value ranging between 0 and 1 will be generated. If the power is
less than 0.8, you will need to increase your sample size.
Statistical Significance
Let us continue our previous example of a group of students going to an environment scanning and
conservation trip. In such examples, there is a possibility that the students’ knowledge, attitude and
behaviour might change due to chance rather than the trip itself. Testing for statistical significance
helps the experimenter estimate how likely it is that these changes occurred randomly and not due
to the program. To determine whether the difference is statistically significant or not, the p-value
must be compared with the critical probability value or the alpha value. If p-value is less than the
alpha value, then it can be concluded that the difference is statistically significant. P-value is the
probability that the results so obtained were due to chance and not due to the program. P-value
can be in the range of 0 and 1. If the p-value is lower, it has higher probability that the difference
has occurred as a result of the program. Alpha (α) level (Type I error) refers to the error rate that
an experimenter is willing to accept. Usually, the alpha level is set at .05 or .01. An alpha of 0.05/0.01
means that the experimenter is willing to accept a 5%/1% chance that the results are due to chance
Self-Instructional and not due to the program.
Material
54
Statistics for Data Science
0.05 is the most common alpha level chosen by experimenters for hypothesis testing in social NOTES
science field. It is considered as statistically significant.
Effect Size
A statistically significant difference does not mean that it is big or helpful in decision-making.
It only means that there exists a difference. For example, assume that a group of students from
a population are selected and a pre-program test is conducted. The mean score obtained by the
students is 85. After an improvement program is implemented, the students are again tested
on a post-program test. The mean score obtained by the students is 85.5. Here the difference is
statistically significant due to large sample size, but the difference in scores is very low, which
indicates that the improvement program did not lead to a meaningful increase in the knowledge
of the students. It can be concluded that the difference among the means must be statistically
significant and also meaningful. To determine whether or not an observed difference is statistically
significant and meaningful, its effect size must be calculated. Effect size is a standardised measure
and it is calculated on a common scale, which allows for comparing the effectiveness of different
programs on the same outcome. Let us now see how we can calculate the effect size depending
L
on the evaluation design. To calculate the effect size, the difference between the test and control
groups is taken and is divided by the standard deviation of one of the groups. For example, in a
medical hypothesis testing, the difference of the means of the test and control groups is calculated
and divided by the standard deviation of the control group.
Effect Size =
Meanof treatment group − Meanof control group
standard deviationof control group
D
C
Now the value of effect size is calculated as follows:
The effect size can only be calculated after collecting data from all the objects or subjects in the
sample. Therefore, an estimate for the power analysis must be derived. Most commonly used value
for moderate to large difference is 0.5.
t-Test
In many research studies, researchers may want to find out differences between various calculated
values. For example, there may be situations where the researcher may need to find out the
difference between the sample mean and the population mean (one-sample t-test). Similarly, in
other situations, a researcher may need to find out the difference between two independent sample
means (independent-samples t-test) or the difference between pre- and post-event outcomes
(paired samples t-test). Let us now discuss the three different types of t-tests as follows:
One-Sample t-test: To test the difference between sample mean and population mean
Independent-Sample t-test: To test the difference between two independent sample means
Paired-Sample t-test: To test the difference between pre- and post-event outcomes
Self-Instructional
Material
55
DATA SCIENCE
‘H0: There is no significant difference between sample mean and population mean.’
The t-statistic in one-sample t-test can be estimated by using the following formula:
x −µ
t=
σ
L
N− 1
where, X = sample mean, µ = population mean, σ = standard deviation of sample mean and N =
sample size.
Independent-Sample t-Test
D
C
When we want to test the difference between two independent sample means, we use independent-
sample t-test. The independent samples may belong to the same population or different population.
Some of the instances in which the independent-samples t-test can be used are as follows:
1. Testing difference in the average level of performance between employees with the MBA
T
degree and employees without the MBA degree.
2. Testing difference in the average wages received by labor in two different industries.
3. Testing difference in the average monthly sales of the two firms.
IM
‘H0: There is no significant difference between sample means of two independent groups.’
The t-statistic in the case of independent-sample t-test can be calculated by using the following
formula:
X1 – X2
tX − X2 =
1
(N – 1) s12 + (N2 − 1) s22 1 1
1 +
N1 + N2 − 2 N1 N2
In SPSS, the independent-sample t-test is conducted in two stages. At stage one, SPSS software
compares variances of two samples. The statistical method of comparing two sample variances
is known as Levene’s homogeneity test of variance. The null hypothesis of this test is ‘equal
variances assumed’, i.e., there are no significant differences between the sample variances of two
independent samples. In other words, the two samples are comparable. On the basis of Levene’s
Self-Instructional test of homogeneity, the SPSS gives two values of t-statistic. In case of equal variances, both the
Material
56
Statistics for Data Science
values are the same. In case the sample variances are different, the lower t-statistic value should be NOTES
considered for final analysis.
Paired-Sample t-Test
A paired-sample t-test is also known as repeated sample t-test because data (response) is collected
from same respondents but at different time periods. A paired-sample t-test should be used when
we want to test the impact of an event or experiment on the variable under study. In this case,
the data is collected from the same respondents before and after the event. After this, means are
compared. The null hypothesis of paired-sample t-test is that the means of pre-sample and post-
sample are equal. Some of the instances where paired sample t-test can be applied are as follows:
L
Analysis of Variance (ANOVA)
Independent sample t-test can be applied to situations where there are only two independent
D
samples. In other words, we can use independent-sample t-tests for comparing the means of two
populations (such as males and females). When we have more than two independent samples, t-test
is inappropriate. The Analysis of Variance (ANOVA) has an advantage over t-test when the researcher
C
wants to compare the means of a larger number of population (i.e., three or more). ANOVA is a
parametric test that is used to study the difference among more than two groups in the datasets. It
helps in explaining the amount of variation in the dataset. In a dataset, two main types of variations
can occur. One type of variation occurs due to chance and the other type of variation occurs due
T
to specific reasons. These variations are studied separately in ANOVA to identify the actual cause of
variation and help the researcher in taking effective decisions.
In case of more than two independent samples, the ANOVA test explains three types of variance.
IM
The ANOVA test is based on the logic that if the between group variance is significantly greater than
the within group variance, it indicates that the means of different samples are significantly different.
There are two main types of ANOVA, namely one-way ANOVA and two-way ANOVA. One-way ANOVA
determines whether all the independent samples (groups) have the same group means or not. On
the other hand, two-way ANOVA is used when you need to study the impact of two categorical
variables on a scale variable.
In case of more than two independent samples, the sample means can be compared with the
help of multiple t-tests. However, still ANOVA is preferred over multiple t-tests. The basic reason
of this preference of ANOVA test over multiple t-tests is the presence of a family-wise error in
case of multiple t-tests. Suppose that we are interested in comparing the sample means of three
Self-Instructional
Material
57
DATA SCIENCE
NOTES independent samples A, B, and C. If we are interested to apply t-test, it requires three independent
sample t-tests:
1. Between A and B
2. Between B and C
3. Between A and C
If the level of significance in each test is 5 per cent, the confidence level is 95 per cent. If we assume
that the three independent-sample t-tests are independent, an overall confidence level of all t-tests
together will be:
= 0.857
Hence, the combined probability of committing type I error in multiple t-tests is = 1 – 0.857 = 0.143 or
14.3 per cent. Therefore, the probability of making type I error increases from 5 per cent to 14.3 per
cent in multiple t-tests. This error is known as the family-wise error rate. The family-wise error rate
L
can be calculated using the generalized method in which n represents the number of tests carried
out in data:
D
Because of the presence of a family-wise error, the ANOVA test is always preferred to multiple
t-tests.
C
The various examples where one-way ANOVA test can be used are as follows:
To test the difference in the level of product usage among the citizens of four different cities
To test the difference in the performance level among the respondents of different educational
T
backgrounds
To test whether the average income of different professionals is different
In case of t-test, the null hypothesis is that there is no difference between two samples’ means, that
IM
is, the two samples’ means are equal. Similarly, in case of ANOVA test, the null hypothesis is that all
group means are equal.
F-Statistics
Similar to t-statistics in t-tests, the ANOVA procedure calculates F-statistics, which compare the
systematic variance in the data (between group variance) to the unsystematic variance (within
group variance). As F-distribution is the square of t-distribution, assuming that the assumptions of
parametric tests hold true, any value of F-statistics more than 3.96 is sufficient to reject the null
hypothesis with 5 per cent level of significance.
Combined Test
ANOVA is a combined test. It indicates that the rejection of the null hypothesis implies that all group
means are not the same. But, it may be possible that some group means are the same and some are
not. For example, if there are three groups, rejection of the null hypothesis means that all group
means are not equal. This is a confusing statement because of the following possibilities:
Null hypothesis: All group means are the same, that is, =
X1 X=
2 X3
Alternate hypothesis: All group means are not equal and have the following possibilities:
Self-Instructional
Material X1 ≠ X2 = X3
58
Statistics for Data Science
X=1 X 2 ≠ X 3 NOTES
X1 ≠ X2 ≠ X3
X1 ≠ X3 =X2
In order to go in much detail, it is required to apply post-hoc tests along with ANOVA.
REGRESSION ANALYSIS
L
Regression analysis is a statistical method that is used to model a relationship between two or more
variables of interest. Regression analysis is usually used to model a relationship between a response
variable (dependent variable) and one or more predictor (independent) variables. There are various
D
types of regression. However, the basic function of these regression models is to examine the
influence of one or more independent variables on a dependent variable.
Regression analysis helps in identifying which variables have an impact on a variable of interest.
C
For example, the impact of demand on supply, impact of money supply on inflation, etc. By
performing regression analysis, we can determine the factors that matter the most, which factors
have negligible impact (hence, can be ignored), and how these factors influence each other. To
understand regression analysis, it is important to know the following:
T
Dependent variable: Regression analysis is carried out to understand or predict the dependent
variable.
Independent variables: Regression analysis involves use of hypothesized factors or the
IM
Simplest regression analysis is the simple regression that involves a single response variable and a
single predictor variable. The simple linear regression helps in summarising and studying relationships
between two continuous quantitative variables.
Y = response/outcome/dependent variable
Y = β0 + β1X + ε
where
If β1 is close to 0, it indicates little or no relationship but if β1= large positive or negative, it indicates
large positive or large negative relationship. Self-Instructional
ε = Error term or disturbance Material
59
DATA SCIENCE
NOTES There exists a broadly linear relation between x and y when all the (x, y) pairs do not lie on the
straight line. Therefore, it is required that we fit the linear regression line. Variance associated with
Y is written as:
Var (Y) = σ2
At times, X is a random variable and, in such cases, we consider the conditional mean of Y, given that
X = x.
E(y|x) = β0 + β1x
Var(y|x) = σ2
In case β0, β1 and σ2 are known, the model is complete. However, the parameters β0, β1 and σ2 are
usually unknown and ε is not observed.
The value of the model Y = β0 + β1X + ε is dependent on the estimation of values of β0, β1 and σ2. To
L
determine the values of β0, β1 and σ2, n pairs of observations of (xi, yi) for i = 1, 2, …………, n are
collected.
(∑y )( ∑x ) − ( ∑x )(∑xy )
β0 =
2
n ( ∑x ) − ( ∑x )2
D
2
C
β1 =
n (∑xy ) − (∑x )(∑y )
n ( ∑x ) − ( ∑x )
2
2
T
Let us now see how the linear regression equation is found.
Step 1: Enter all the x and y data in a table as shown in Table 2.6.
IM
From the above table, Σx = 97, Σy = 601.22, Σxy = 11664.07, Σx2 = 1882.2, Σy2 = 1882.2 and n = 5.
=β0
(∑y )( ∑x ) − ( ∑x )(=
2
∑xy ) (601.22 )( 1882.2 ) − (97 )( 11664.07
=
) 1131616.284 − 1131414.79 201.494
= = 100.747
n ( ∑x ) − ( ∑x )
2 2
2 5 ( 1882.2 ) − ( 97 ) 9411 − 9409 2
Self-Instructional
Material
60
Statistics for Data Science
β1
=
n (∑xy ) − (∑x )(∑= y) 5 ( 11664.07 ) − ( 97 )( 601.22 ) 58320.35 − 58318.34 2.01
= = = 1.005
NOTES
n ( ∑x ) − ( ∑x )
2 2
2 5 ( 1882.2 ) − ( 97 ) 9411 − 9409 2
Y = 100.747 + 1.005X
L
Y = β0 + β1X1 + β2X2 + β3X3 + ………….. + βkXk
Note that the given multiple regression equation has k independent variables.
where,
The simplest form of multiple regression is created with two independent variables as follows:
Y = β0 + β1X1 + β2X2 + ε
IM
To determine how well an equation fits the data is expressed by R2. R2 is called the coefficient of
multiple determination and it can range from 0 to 1. When R2 value is 0, it means that there is no
relationship between Y and the X variables. When R2 value is 1, it represents a perfect fit, which
means that there is no difference between the observed and expected values of Y. It must be noted
that the P-value is a function of R2, the number of observations and the number of X variables.
E xhibit -2
Multiple Regression Example
The National Highways Authority of India recently concluded a study to determine the relation
between various factors and the average number of accidents on highway number NH-226.
Researchers assumed that the total number of accidents are affected by the total population and
the average speed of vehicles (km/hr) on the highways. Here the researchers constructed their
null hypothesis as: the total number of accidents are not affected by the total population and the
average speed of vehicles (km/hr) on the highways.
β0 = 1.2 (= The number of accidents on NH-226 that would be expected to happen if both the
independent variables were equal to zero. It means that the population and the average speed
both become 0.)
β1 = 0.00005
If X2 remains the same, then β1 indicates that for every person who is added in the population, the
number of accidents (Y) increases by .00005.
β2 = 15
If X1 remains the same, then β2 indicates that for every km/hr increase in the average speed, the
number of accidents (Y) increases by 15.
L
Y = β0 + β1X1 + β2X2 = 1.2 + 0.00005 X1 + 15 X2
X2 = 70 (km/hr)
D
Y = 1.2 + 0.00005 (3,00,00,000) + 15 (70) = 1.2 + 1500 + 1050 =2551.2
C
This is the multiple regression equation for Y.
T
Types of Regression Techniques
Regression is a statistical technique that helps determine the relationship between a dependent
variable and one or more independent variables. So far, you have studied only the linear and multiple
IM
regressions. However, there exist hundreds of different types of regression. Each regression
technique has its importance and specific conditions, wherein they are best suited. It is best to first
analyse a given problem and decide which regression technique should be applied instead of simply
applying linear regression in each case. Seven important types of regression techniques include:
Linear Regression: You have studied linear regression in detail in the previous section.
Logistic Regression: This type of regression is used in cases wherein the dependent variable is
binary in nature and the independent variables may be continuous or binary.
Polynomial Regression: This type of regression is used in situations wherein the relation
between the dependent and non-dependent variables is non-linear. In such situations,
polynomial regression is used.
Ordinal Regression: Ordinal regression technique is used to predict ranked values. It is used in
cases where the dependent variable is ordinal in nature.
Ridge Regression: It is an improved form of linear regression which puts constraints on
regression coefficients in order to make them more natural and easier to interpret.
Principal Components Regression (PCR): Principal Components Regression technique is used in
cases when there are many independent variables or when the data contains multi-collinearity.
This regression technique is divided into 2 steps, namely getting the principal components and
Self-Instructional running the regression analysis on principal components.
Material
62
Statistics for Data Science
Partial Least Squares (PLS) Regression: PLS regression technique is an alternative to PCR NOTES
technique and it is used in cases where the independent variables are highly correlated or when
there are a large number of independent variables.
Apart from these, there are various regression techniques, such as Poisson regression, negative
binomial, quasi Poisson, cox regression, stepwise regression, ridge regression, lasso regression,
elastic net regression, etc.
Summary
In this chapter, we have covered the concept of statistics and its use in data science. We started
L
our discussion with the concept of probability theory, which included study of various types of
probability distributions. We explained the characteristics of various probability distributions. Next,
we discussed the concept of statistical inference and the two major types of statistical inferences,
namely frequentist inference and the Bayesian inference. Further, we discussed the sampling
D
theory. Under the sampling theory, you studied the related concepts of population, sampling frame,
sampling methods, sample size determination, sampling and data collection, and sampling errors.
Then, we discussed the concept of hypothesis testing along with important concepts, such as four
C
steps to hypothesis testing, sampling distributions and types of error. Towards the end of the
chapter, we studied the concept of regression analysis.
T
Exercise
Multiple-Choice Questions
IM
NOTES What will come in place of the blanks in the above statement if only two words would come
in place of all the three blanks?
a. Sample b. Data
c.
Statistic d.
Sampling frame
Assignment
Q1. 1000 students of class 9 of ABC School score a mean IQ of 100 in an IQ test with a standard
deviation of 15. Assume that you are a researcher and want to determine what percent of the
students would score between mean – 1 SD and mean + 1 SD.
Q2. It is said that inferential statistics is a set of methods used to make generalizations, estimates,
predictions or decisions. Prepare a case study on the use of inferential statistics for a business.
Q3. In the year 1936, presidential elections were held in America. The main contestants were
Alfred Landon (the Republican governor of Kansas) and the incumbent President, Franklin
D. Roosevelt. A study was conducted to predict the winner of the presidential polls by The
L
Literary Digest, which was one of the most respected magazines of the time. This study
predicted that Landon would get 57% of the vote against Roosevelt’s 43%. However, the
results revealed that Roosevelt got 62% votes as against the 38% got by Landon. It meant
sampling error.
D
that there was a 19% sampling error. Study about this case and report the causes of such high
Q4. Enlist and describe at least four different types of distributions. Also specify whether each of
the distributions is a continuous or a discrete distribution.
C
Q5. Explain in detail the frequentist and Bayesian inferences.
Q6. Write a short note on the process of determining sample size for population.
Q7. Explain the different types of regression. Describe linear regression in detail.
T
References
(2018). Retrieved from https://www.utdallas.edu/~scniu/OPRE-6301/documents/Important_
IM
Probability_Distributions.pdf
(2018). Retrieved from http://personal.cb.cityu.edu.hk/msawan/teaching/FB8916/FB8916Ch1.
pdf
(2018). Retrieved from http://www.econ.ucla.edu/riley/106P/Excel/RegressionAnalysis.pdf
Medical science has established the relation between high cholesterol levels and increased risk
of coronary heart diseases and strokes. Therefore, the health authorities of any country take into
account the distribution of cholesterol levels in their country for undertaking any public health
planning. To thoroughly study the cholesterol levels of men of a specific area, the Centers for Disease
Control and Prevention (CDC), USA carried out a study.
The study was conducted by first randomly selecting a sample of 1169 men aged between 35 and 44.
These men were tested for their cholesterol levels which came out to be 205 milligrams per decilitre
having a standard deviation of 39.2 milligrams per decilitre. The research team assumed that the
total cholesterol level was normally distributed and they wanted to find the highest total cholesterol
level of a man aged between 35 and 44 and still be in the lowest 1% of the sample.
CDC established that the total cholesterol levels in the lowest 1% of the sample correspond to the
L
shaded region as shown in the following Figure 2.9:
D
C
1%
z
–2.33 0
x (total cholesterol
T
? 205 level, in mg/dL
The 1st percentile represents the total cholesterol level corresponding to the lowest 1%. The cholesterol
level corresponding to the 1st percentile can be found by determining the z-score that corresponds
to a cumulative area of 0.01. For this, we look up the table of standard normal distribution. From this
table, we find that the area closest to 0.01 is 0.0099 and the corresponding z-score is 2.33.
It means that the value that separates the men having lowest 1% of the cholesterol levels from the
rest of 99% is 114.
Questions
1. What would be the x-value if the standard deviation of cholesterol in sample of men was 25
milligrams per decilitre?
(Hint: x = 205 + (–2.33) (25) = 146.75 ≈ 147)
2. If, instead of 1169 men, the researchers had taken 2000 men, what could be its impact on the
standard deviation? Self-Instructional
(Hint: As the sample size increases, the standard deviation decreases.) Material
65
L A B E X E R C I S E
NOTES Statistics has long been used in all forms of analyses. The advancements in the field of statistics in
recent times, along with the large amount of data being generated every day, have increased the
importance of this field of study. In these days, R is used to write programs that fins statistical Mean,
Median and Mode, and implement operations. Some basic statistical operations such as the t-test is
implemented in this Lab Exercise.
LAB 1
L
Solution: To perform this lab, you must know the basic operations in statistics, such as calculation
of mean, median, and mode.
Start by loading the empRecruit.csv dataset file into the R environment by entering the following
commands:
D
> empRecruit=read.csv(“datasets/empRecruit.csv”,header=T)
C
> head(empRecruit)
Admit Gender Dept Freq
1 Admitted Male Art 512
2 Rejected Male Art 313
3 Admitted Female Art 89
T
4 Rejected Female Art 19
5 Admitted Male Banking 353
6 Rejected Male Banking 207
IM
The head() function retrieves the top records from the dataset.
To retrieve standard mean values from the empRecruit.csv file into the Tab-Separated Value (TSV)
format, use the xtabs() function, as shown in the following command:
> xtabs(Freq~Gender+Admit,empRecruit)
Admit
Gender Admitted Rejected
Female 557 1278
Male 1198 1493
To view the median of the whole dataset, use the summary() function, as shown in the following
command:
FIGURE 2.10 Presenting Data Values and Summary in the TSV Format
L
LAB 2
Before training: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
IM
After training: 12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1
From the two paired samples, we have to check whether there is an improvement or deterioration in
the means of scores, or whether the means have substantially remained the same (Hypothesis H0).
We can check this by conducting a Student’s t-test for the paired samples.
Solution: Before starting the lab, you should have a sound understanding of the t-test, its signifi
cance in statistical analyses, and its application on datasets. You should also know about sample
datasets on which the t-test can be applied.
Self-Instructional
Material
67
L A B E X E R C I S E
NOTES First, we create the datasets by using the following commands:
> beforeTrn = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3,
14.4, 11.3)
> afterTrn = c(12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9,
16.0, 11.1)
The command to apply the Student’s t-test for paired samples on the datasets is as follows:
L
Paired t-test
data: beforeTrn and afterTrn
D
t = -0.2133, df = 9, p-value = 0.8358
alternative hypothesis: true difference in means is not equal
to 0
95 percent confidence interval:
C
-0.5802549 0.4802549
sample estimates:
mean of the differences
-0.05
T
Here, if p-value > 0.05, then we must accept the null hypothesis (H0) of equality of the averages,
meaning that the new training has not made any significant improvement in the physical fitness of
IM
the students.
Observing the results, if t-computed < t-tabulated, then we accept the null hypothesis (H0).
LAB 3
Solution: Before starting the lab, you need to load the CSV file containing the petrol consumption
data into the current R environment. You can load the file by using the following command:
> petrolData=read.csv(“Datasets/Petrol_Consuption.csv”)
You can view the records and summarize the dataset petrolData created above by using the
following commands:
L
D
C
FIGURE 2.11 Displaying Top Records from the Dataset
T
summary(petrolData) # Summarizing the dataset
After loading the data in a tabular format from a CSV file, let’s apply simple linear regression on the
petrol consumption dataset.
Let’s begin by finding the relation between two or more factors that impact the consumption
of petrol. The relation between factors is called correlation and it can be calculated by using the
following command:
>cor(data.frame(petrolData$Petrol_tax_cents_per_gallon,petrol-
Data$ Self-Instructional
Material
69
L A B E X E R C I S E
NOTES +Average_income_dollars,petrolData$Prop_pop_drivers_
+licenses,petrolData$Consum_mill_gallons))
Enter the following command to plot two most closely related variables against each other. The
variables, in our case, are Prop_pop_drivers_licenses and Consum_mill_gallons. The command to
do this is as follows:
> plot(petrolData$Prop_pop_drivers_licenses,petrolData$Consum_
+mill_gallons,col=”purple”)
L
400 500 600 700
D
C
0.4 0.5 0.6 0.7 0.8 0.9
petrolData$Prop_pop_drivers_licenses
T
The results of regression analysis are stored in a data frame named petrolReg. We can view these
results by using the following command:
> petrolReg
You can also summarize the regression results by using the following command:
L
D
FIGURE 2.14 Applying Simple Linear Regression on a Dataset
C
We can get some more useful results from the simple linear regression analysis by using certain
functions, as shown in Figure 2.15:
T
IM
We can predict values on the basis of simple linear regression analysis by using the following
command:
Figure 2.16 shows the above commands for the analysis of variance:
L
D
FIGURE 2.16 Predicting Values and Calculating Analysis of Variance
C
The result of simple linear regression can be plotted on a graph. Before plotting the graph, we need
to assign a layout for the results of the analysis. In our case, we are going to use a matrix layout. For
this, use the following command:
T
> layout(matrix(1:4,2,2))
> plot(petrolReg)
Figure 2.17 shows the graphs plotted from the above commands:
Standardized residuals
1.0
–100
0.0
Standardized residuals
2
-2 0
0
0.5
-2
The results of the second regression analysis can be viewed by using the following command:
> petrolReg2
Figure 2.18 shows the application of simple linear regression for a different set of variables:
FIGURE 2.18 Applying Simple Linear Regression for a Different Set of Variables
L
Multiple Linear Regression
D
Simple linear regression can be applied on a dependent variable for a single independent variable
only. We cannot perform regression analysis on a dependent variable for more independent variables
than one. If we have two or more explanatory variables on which the value of the dependent variable
depends, we need to apply multiple linear regression for insights.
C
Multiple linear regression on the petrol consumption data can be applied by using the following
command:
> summary(petrolMultiReg)
Figure 2.19 shows the application of multiple linear regression on the petrol consumption data:
Self-Instructional
Material
73
L A B E X E R C I S E
NOTES For a different set of variables, apply multiple linear regression by using the following command:
> petrolMultiReg2
L
FIGURE 2.20 Displaying Result of the Analysis
Summarize the results of multiple linear regression by using the following command:
> summary(petrolMultiReg2)
D
Figure 2.21 shows the output of the preceding command:
C
T
IM
N ote
The models built by using simple linear regression and multiple linear regression on petrol
consumption data can now be used for predicting petrol consumption in a city with the given
characteristics.
Self-Instructional
Material
74
CHAPTER
3
Implementation of
Decision-Making and Support
L
Topics Discussed
Introduction
Concept of Decision-Making D NOTES
C
Types of Decisions
Decision-Making Process
Understanding Decision Support System
T
Techniques of Decision-Making
Application of Data Science for Decision-Making in Key Areas
Economics
IM
Telecommunication
Bioinformatics
Engineering
Healthcare
Information and Communication Technology
Logistics
Process Industry
Self-Instructional
Material
DATA SCIENCE
INTRODUCTION
Decisions are inevitable and help an organization pave its way towards better performance. These
decisions could be related to strategies, business activities or HR, and each one of them is made
in the best interest of the organization. The process of decision-making is an integral part of the
L
managerial process in any organization. However, this process is highly complex and involves
experts from different domains. In case of large organizations, there is often a team of experts
specially trained to make all sorts of decisions; while usually in small organizations, all significant
D
decisions are taken by the managerial board.
The decision-making process is collective and consultative, and has a considerable impact on
the overall growth and prospects of the organization. However, there are some advantages and
disadvantages in the process that reflect the consequences on the overall performance of the
C
organization.
As you know, decisions are taken to support organizational growth. Therefore, it requires the
manager to be able to take critical decisions at any level – top, middle, or entry level. The foundation
T
of management in an organization is built on managerial decisions, which are reflected through its
day-to-day operations. Many big corporations use effective communication tools in addition to the
normal consultation process to make decisions that would have large-scale implications.
IM
Discussions and consultations along with standard procedures and techniques are the two main
tools that maintain and eventually facilitate decision-making. For example, when the strategic
management team suggests a decision on initiating a new business activity, it must follow a series of
deliberate discussions and consultations. Decisions taken by strategic managers often reflect new
and innovative business initiatives. Thus, such decisions require the implementation team to have
a consultative discussion. An extensive debate and research is required before finalizing a decision.
Moreover, the final decision to roll out a product or service is accomplished through collective interim
decisions taken by various internal and external units. This decision proves to be reflective with the
research done and consultations within various levels in the organization. The overall process is a
sequence of steps where one decision is taken at one point, and where each level has far-forcing
implications on the overall decision-making strategy of the organization.
In this chapter, you will first learn about the concept of decision-making. Next, you will learn about
the techniques of decision-making. Then, it will discuss about the development of decision support
system. It will also discuss about the application of DSS.
CONCEPT OF DECISION-MAKING
Self-Instructional Decision-making9 can be interpreted as the process of making the best choice between the prospective
Material and vague alternatives for meeting the objectives. Therefore, decision-making is concerned with the
76
Implementation of Decision-Making and Support
future and involves the act of selecting one best course of action from various courses of action. It is NOTES
one of the major functions of management, which is difficult but very important. The most important
responsibility of management in any organization is to set up organizational goals and allocate the
available resources effectively and efficiently. Resources are always limited and need to be used
judiciously to achieve maximum profits. Accounting information can improve the understanding of
the management with respect to the alternative resource allocation. This information is provided by
the cost management information system in terms of supply cost and revenue data that are useful
to make strategic decisions.
Types of Decisions
As you have studied, the decision-making process refers to selecting the best choice from the
available options. It requires a hard thinking and intellectual weighing of different options to arrive
at a certain choice depending on the requirement of the situation. Decisions are needed when
different options are available, when problems occur and a solution is required, and when an
opportunity comes along and there is a need to make a choice.
L
As each individual has different perspectives and different intellectual skills, the ability to make a
decision also varies. Decision makers are categorized into various types. Similarly, the approach of
decision-making is also characterized into different kinds. The type of decision generally depends
on an individual as well as on the situation at hand. The most common types of decisions that an
organization usually makes are given as follows:
D
1. Programmed decisions: These are standard decisions that typically follow a repetitive practice.
These kinds of decisions are taken for routine jobs. For example, most organizations have
C
standard procedures to address customer complaints. Similarly, an inventory manager can
order the product that is dipping beyond the cut-off mark to maintain stock. The procedure
to take programmed decisions is fixed. Therefore, it can be written down in a sequence of
steps which everyone can follow as a standard. They could also be written in the form of a
T
computer program.
2. Non-programmed decisions: These are non-standard and non-routine decisions, where every
decision is different from the previous one. There is no need to set guidelines or rules for
IM
such decisions as each situation is either uncertain or unplanned; for example, a decision on
whether the firm should go for a merger/acquisition or not. Similarly, selecting a college for
further studies is a non-programmed decision. These decisions are taken when the situation
is unique and information is unstructured. The decision maker needs to collect information,
establish link between the pieces of information available, consider the different alternatives,
and delve into a continuous thinking process.
3. Strategic decisions: These are the long-term decisions that could set the trend of business;
for example, a decision on whether the company should launch a new service or product,
stop offering a product, or acquire a company for some specific service or product. Another
example could be to train employees for enhancing performance and sustaining over a long
term.
4. Tactical decisions: These are medium-term decisions that are taken in regard to implementing
strategic decisions; for example, market analysis for a new product or the staff required.
The grocery stores sell data received through the bar code scanner to organizations such as
Information Resources, Inc. (IRI) who are responsible for collation and further selling of it to
grocery vendors and wholesalers. These people can study selling pattern of their competitors
and respond according to the prevailing situation. At a tactical level, the forecast defined
here is applied as a policy. At Continental Airlines, the tactical decisions are taken based on
the query generated by the staff that accesses the Flight Management Dashboard application Self-Instructional
to check a particular flight status at a specific time. Material
77
DATA SCIENCE
NOTES 5. Operational decisions: These are short-term decisions that guide us about how to perform
the regular operations; for example, the decision to hire a particular logistic company to make
deliveries. Another example from the financial sector can be EMC Insurance Companies, which
find it difficult to determine the amount of money required to be held in reserve against any
potential case payouts. As a solution, EMC opted for PolyVista, a data analytics software, to
reveal the hidden patterns, relationships, and anomalies within the firm’s warehouse of claim
data.
Decision-Making Process
The decision-making process involves a series of steps that we need to take logically. This process
is lengthy as well as time-consuming, but is carried out in a scientific manner. The decision-making
process suggests a number of general guidelines and methods that need to be followed regarding
how we should take a decision. It involves many steps that are arranged logically. Peter Drucker
published the book ‘The Practice of Management’ in 1955. In this book, he suggests the scientific
method of decision-making. This method is considered as a base model and organizations can make
changes to it depending on the nature of business. This basic model, given by Peter Drucker, consists
of seven steps as shown in Figure 3.1:
L
Identify the problem
L
of alternatives. Group participation approach is very useful in developing alternative solutions.
For example, various alternative solutions for quality problems in an organization can be the
implementation of new techniques, installing sophisticated machinery, and installing a quality
D
monitoring system. Tesco uses the real-time data to gain a detailed insight unlike any of its
competitors. On the basis of the buying patterns of the consumers, Tesco can introduce many
new schemes to increase sales and customer loyalty.
Select the best solution: In the decision-making process, when the alternative solutions have
C
been developed, the subsequent step is that of selecting the best alternative to obtain the best
result. The selected alternative should be conveyed to those who are liable to be influenced by
it. This decision can be implemented effectively if it is accepted by all group members.
T
Convert decision into action: Once the best decision has been selected, the next step is to
translate the selected decision into an effectual action. In the absence of such appropriate
action, the decision is just considered to be an assertion of good intentions. The role of the
manager is to convert ‘his/her decision’ into ‘their decision’ with the help of his/her leadership
IM
skills. The manager should take his/her subordinates into confidence and convince them about
the appropriateness of the decision. After this, the manager should follow up in order to ensure
proper execution of the decision.
Ensure feedback: This is the last step of the decision-making process. The manager needs to
formulate provisions to ensure that the feedback is taken continuously and that the actual
developments are in tune with the expectations. It is the process of checking the effectiveness
of the decision taken. Feedback should provide organized information in the form of reports
and personal observations. Feedback helps in establishing whether the decision that has been
taken needs to be continued or modified taking the changed conditions into account.
There are many companies that enjoy the advantage of business intelligence in effective decision-
making. For example, the fast food chain, McDonald’s uses Business Intelligence (BI) to make
strategic decisions such as what to add to the menu or which under-performing stores need to be
closed or what new schemes should be implemented to drag them out from a low-profit zone.
Yahoo, Inc. also uses BI to bring changes to their website. Millions of users hit the organization
home page each hour. To bring changes to the home page, the organization randomly selects a
few thousand users as an experimental group and checks their behavior. It can obtain the result of
analysis in just a few minutes. This fast access to the results helps the organization in optimizing the
offering to increase the number of hits and profits. At any given time, Yahoo generally runs about Self-Instructional
20 such experiments. Material
79
DATA SCIENCE
In other words, decision support systems belong to a particular category of information systems
that support the decision-making activities. DSS is used to analyze business data and present an
interactive information support to all decision makers. Its role starts right from the stage of problem
identification and continues till the decision is implemented. DSS uses analytical models, dedicated
databases, insight and judgement of the decision maker, and an interactive, computer-based
modeling approach to sustain unstructured decisions.
L
In an age of ever-evolving competition in the global business environment, you do not enjoy the
liberty of being able to spend too much time on taking decisions. You are always expected to make
more and faster decisions than ever before. This is the reason why businesses need leaders who can
D
take quick decisions. The entire decision-making process has become extremely accelerated. In such
a situation, it is impracticable to depend solely on human response. Therefore, companies need a
DSS to react and adapt to the persistently changing business environment.
C
Therefore, for being successful in the business environment created today, your company requires
information systems by which diverse information and decision-making needs can be supported. In
addition, the system also needs to support you in taking prompt decisions. DSSs assist in assessing
and resolving the questions posed everyday in businesses. To do so, DSSs analyze raw data,
T
documents, personal knowledge, and business models from where useful information is gathered.
To attain maximum performance in the existing business environment, you need to achieve a
competitive advantage. Otherwise, proper functioning of your company is in doubt; it may ultimately
IM
come to a close.
In one way, you can gain competitive advantage if you use a computerized DSS, which, in the
simplest terms, can provide you with the ability to help you take better decisions as its most tangible
benefit. Your decisions can be considered better when on implementation, they are able to reduce
the costs effectively, use assets more economically, enhance revenue, cut risks down, and provide
improvement in customer service.
E xhibit -1
Decision-Making in Credit Market
In this app economy era, every one of us is influenced by data-driven decisions, whether we realize
it or not. Similarly, the credit business has also acknowledged the importance of analytics in
decision-making, so that the customers get the best value and don’t feel underserved. Companies
using data and analytics for decision-making are generating up to 9 percent more revenue and
Self-Instructional are gaining 26 percent more profit than the competition. This is not something which is easy to
Material achieve as organizations face grave challenges to get the right data, filter it and then analyze it.
80
Implementation of Decision-Making and Support
Data, analytics and decision technologies are converging together to solve these kinds of problems NOTES
at a rapid rate. The ability to create data models and predictions from new and larger quantity of
data is now much easy through data science, decision-making and machine learning, producing a
substantial benefit for the business.
TECHNIQUES OF DECISION-MAKING
A decision, when taken scientifically, leaves a little scope for confusion and generally meets the
required goals. The accuracy and rationale for such decisions can be justified even if they do not
accomplish the required purpose. Though there is no standard to ensure that the decision has been
taken correctly, yet there are some important techniques that can assist the manager. Some of the
widely used techniques are given as follows:
1. Operations research: This is the study in which scientific methods are applied to the problems
that are complicated and cropping up in the direction and administration of a sizeable system
that comprises men, machines, raw materials, and capital in various areas, such as business,
L
industry, defense, and government. As Robert J. Thierauf has stated, “Operations research
utilises the planned approach and an interdisciplinary team in order to represent functional
relationships as mathematical models for the purpose of providing a quantitative basis for
decision-making and uncovering new problems for quantitative analysis.”
D
Operations research facilitates the authority that is making the decision to formulate
decisions objectively through a fact-based direction and support to the decision, and easing
the responsibility of the manager with respect to efforts and time taken. Certain managerial
problems that are generally subjected to operations research analysis comprise inventory
C
control, production scheduling, expansion of plant, sales policies, etc. DANOPT is a company
that uses operations research for business analytics to outperform in the market.
2. Models: These are uncomplicated, convenient, and reasonably efficient resource conservation
T
tools that are used for testing hypothesis. These are based on mathematical relationships and
are also known as mathematical models. These models facilitate the concept of optimization
while making decisions. Models play a very important role in calculation and selection of the
best possible alternative solution for a particular problem. For example, linear programming
IM
Telecommunications
L
Bioinformatics
Engineering
Healthcare
ICT
Logistics
Process industry
D
C
Let us discuss about the application of each field in detail.
Economics
T
After many weeks of interviewing, you have got job offers from three different companies. The
offers dissent greatly, which creates some confusion. You have created a small list of the offers:
IM
1. Giant national firm, $12 per hour beginning wage, insurance and dental advantages paid by
the corporate, a two-week holiday each year, and potential for fast advancement.
2. Little native firm, $20 per hour beginning wage, insurance and dental advantages offered.
However, you need to pay the premiums, a 2-week without-pay vacation every year, share
choices and program benefits, and potential for advancement.
3. Regional firm, $15 per hour beginning wage, full insurance and dental advantages, one-week
holiday, smart program, and moderate advancement potential.
Regardless of the shape of the organization or enterprise, success within the world of business
sometimes depends on economic choices. As a result of economic decision-making depending
heavily on analysing data, it is crucial for that data to be helpful to economic decision makers. All
economic choices of any consequence need the utilization of some form of accounting data, typically
within the style of money reports.
Telecommunication
Telecommunication12 can be defined as the exchange of information by using different technologies
over the distance, and telecommunication industries provide the infrastructure for the transmission
of information using phone devices and internet.
Self-Instructional
The evolution comes in the telecommunication industries due to the prompt expansion of mobile
Material devices and smart phones and because of this, telecommunication industries need to gather huge
82
Implementation of Decision-Making and Support
amount of data from call records, data usage and server logs, etc. to get valuable information from NOTES
this large amount of data in order to improve customer experience and growth of business.
Data science can help handle and improve accountability of data by boosting services related
to network, customers and security. It is one of the best ways to better understand about the
prospective customers in order to predict their action and behavior and to deliver the right assistance
and solutions in the telecommunication industries.
Gathering, mapping and analyzing the data from different data sources.
Identifying the duration of the heaviest data usage over the network, and taking steps to solve
it.
Analyzing the customers who are facing problems with paying bills, and assisting them and
taking appropriate steps to make recovery of payment easy.
Analyzing the call and data statistics.
L
Bioinformatics
Bioinformatics can be understood as a field of science which involves study, development and
information.
D
utilization of tools, programming languages and software to analyze and comprehend biological
Being an interdisciplinary field, bioinformatics uses various computational methods which are used
C
to analyze huge amount of data such as cell count, genetic sequences and protein structures to
predict new solutions and to expand biological understanding. Some of the technologies like next-
generation sequencing generate enormous amount of data which should be properly organized
and clustered to make sense out of it. Big data like this, if utilized optimally, can be extremely useful
T
in drug discovery or preventive medicine design. With the introduction of data science in this field,
management of big data and data visualization has been made very easy and scalable. Although
bioinformatics and data science are two distinct disciplines, they share a common objective of
cleaning, understanding and processing data. Presently, a large number of tools, based on machine
IM
learning are present which are used in bioinformatics. Recently, TensorFlow, a deep learning library
from Google, has demonstrated how it can be used in biological computations. Application of data
science in the field of bioinformatics is relatively new, but in short time, it has established itself as a
great contributor for disease diagnosis and prediction.
Engineering
Software engineering can be defined as a detailed study of designing, developing and maintaining
a software product. Since most of these software are directly used by a huge population or a
corporation which is processing a large amount of data, these products are source of valuable
information which can be used for various purposes. This is the reason why data science should
matter to a software engineer very much. Over the course of time, many software engineers have
realized that they can leverage this data flow to find solutions to their own questions, some of them
being crucial, for example:
Which feature of the software is most popular among consumers and which is not?
What problem is being created by a bug and who should fix it?
NOTES Let us take some cases where data science has affected the perspective to look at a problem.
Research professionals, while examining the failures on Hadoop MapReduce and similar systems,
found the following facts:
Enough data was collected in error logs to reproduce data.
Only a few nodes were required to debug the whole cluster.
Simple testing and error handling could prevent a majority of failures.
This analysis was able to deduce a simple but vital conclusion that if engineers would have also
tested that how their code behaved if things go wrong instead of right, most of the catastrophic
failures could be averted. To improve software, a software engineer can implement statistical tools
of data science to examine important information, such as server metrics, logs, etc. Data science
helps them ask vital questions regarding a product and then find the answer using the relevant data.
Healthcare
In the past decade, healthcare has also been impacted greatly by the advancement of technology.
According to Forbes, approximately, 200 digital companies have invested more than 3.5 billion
L
dollars in 2017 for its digitization. It is done to manage the large chunks of data generated by this
sector annually. For instance, in US, approximately 1.2 billion documents related to healthcare of
patients are generated annually. The collection, structuring and processing of such high amount of
D
data are done to understand the health issues deeply. Thousands of data scientists and machine-
learning experts are providing their contribution in the healthcare industry in different ways and
helping doctors to diagnose a patient with great accuracy and efficiency. Data science is also playing
a key role in the advancement of healthcare industry.
C
Approximately, 2 terabytes of data is generated by the human body and due to advancement in the
technology, we are capable of collecting most of it. This data can be related to a patient’s heartbeat,
sleeping patterns, levels of blood glucose and stress, brain activity, etc. Companies like IBM, Apple,
T
and Qualcomm are providing advance equipment and framework to collect patients’ data with high
accuracy. Machine learning algorithms are also playing a vital role in detecting and tracking common
health ailments related to heart or respiration. As regards the collection and analysis of the acquired
patterns of heartbeat and breathing, the technology is advanced enough to detect disorders in the
IM
patient’s health on the basis of the collected data. These days, obesity has emerged as a common
ailment globally. To tackle this health issue, Omada Health, a medicinal company, has launched a
data science-based preventive medicinal program, called first digital therapeutic, to help obese
patients in changing their daily routine to lose weight or keep the body weight under control. This
program also helps patients in avoiding the risks which may occur on their body due to obesity.
Omada uses smart devices like pedometer and scales to collect patient’s behavioral data to
customize its program for helping every patient. This customization acts as personal health coach for
each patient, which helps in gaining deeper knowledge about the patient’s health and modifies the
program accordingly. Another deep learning organization, Enlitic, uses data science for increasing
the accuracy and efficiency of diagnosis. Enlitic has created a deep learning algorithm for reading
imaging data generated using X-rays, CT scans and analysing it. It them compares its analysis with
the results of clinical and laboratory reports. It has been found that the algorithm has delivered
results with 70 percent accuracy and fifty thousand times faster. So, you can easily conclude that
data science is playing a significant role in the field of healthcare and medicine.
E xample
Decision Making in Healthcare
Self-Instructional Consider an example of decision making in healthcare. In this example, the R code is used for
making decisions about patients on the basis of joint or conditional test results. The data set is
Material
84
Implementation of Decision-Making and Support
The data includes heart disease study, with data on whether the patient had gone through
successive cardiovascular examination, the clinical examination results and the results of a scan.
Now, consider the following code snippet:
dataset2 <- read.delim(“c:\dca_example_dataset2.txt”, header =
TRUE, sep = “\t”)
attach(dataset2)
#clinical exam: treat high risk patients only
clinical_test <- clinical_exam==”high risk”
#joint test is positive if either:
#scan is positive or clinical exam gives high risk
joint <- (clinical_exam==”high risk”) | (scan==1)
#conditional test: treat if high risk; scan if intermediate risk
conditional <- clinical_exam==”high risk” |
(clinical_exam==”intermediate risk” & scan==1)
L
After determining the variables in the preceding code, you can execute the decision curve analysis.
The treatment threshold values are typically between 10 to 20% in case of cardiovascular disease.
Consider the following code to plot decision curve analysis:
library(rmda)
dca(yvar=event, xmatrix=cbind(clinical_test, joint, condition-
al),prob=rep(“Y”, 3), xstart=0.1, xstop=0.2, ymax=0.15)
The output of the preceding code is shown in the following figure:
D
C
0.15
None
All
Model 1
Model 2
T
Model 3
0.10
IM
Net benefit
0.05
0.00
–0.05
10 12 14 16 18 20
Threshold probability (%)
The preceding curve shows that joint test is the best option for coming to decision about the
health of patient as it has the greatest net benefit almost across the entire range of threshold Self-Instructional
probabilities.
Material
85
DATA SCIENCE
The ICT technologies achieved high optimization in handling data by utilizing distributed storage
mechanism and computational capabilities due to innovation in data science. The association
between industries and academics has led to a sharp rise in the usage of ICT devices. Both these
technologies, data science and ICT, are now interdependent, which is immensely helpful for both
these areas in aiding development. Data science plays a significant role in understanding the external
and internal reasons which are impacting the business. The reasons are determined using the data
produced from social media platforms, search engines, organizational portals and are getting used
L
extensively in widespread business applications. There is sharp rise in demand of workforce who can
work in interdisciplinary domain of data science and ICT. The data scientists are required who must
have adequate skills and enough knowledge about the emerging information technologies and also
must be capable enough to implement business solutions efficiently. Data science has somehow
D
made adoption of ICT technologies easier for the people. Moreover, one does not need to acquire
high skills for adopting data science. Due to the availability of libraries and user-friendly applications,
business users can create and implement data science easily and achieve swift business insights and
results.
C
Logistics
T
Global logistics sector is one of the fastest growing industries. While generating revenue at such
a fast pace, it has yet to unleash the monetary and analytical potential of huge amount of data it
generates. Several systems like business systems, social media, and lot of other devices generate
huge data which can be helpful for both the consumer and the producer using business analytics.
IM
Now this huge amount of data needs data science to make sense out of it.
Let’s take an example to elaborate the situation. Few years ago, if there was a problem in a vehicle,
there were two ways to address this. First would be the owner taking his/her vehicle to the service
garage or second would be a repair service van getting to the location of the vehicle. Nowadays,
smart vehicles can detect, analyze and report any kind of anomaly to the owner prior to any
unexpected problem. This saves huge cost and time for both the business as well as the consumers.
Using data science, one can predict the future points of failure up to a significant precision using
analytics and data-driven algorithms. Better prediction of demands has enabled logistics companies
to successfully cut out their inventory by 20 percent to 30 percent along with increasing their fill rate
by up to 7 percent. These are only few of the applications from the domain of applications which can
be implemented using data science and analytics.
Predictive analysis is the key to use data science successfully. Following are some examples in
logistics industry:
Since logistics vehicles need frequent maintenance and long lifetime on road, data analytics can
be used to predict the mechanical parts which are most likely to fail and replace them to save
time and money.
Self-Instructional Increase or decrease in demand can be predicted in a particular demography, which can make
Material the company ready to move resources in a particular direction.
86
Implementation of Decision-Making and Support
Data science techniques can help in improving various aspects such as automation, tracking vehicles NOTES
more accurately and freight usage including readiness for natural constraints such as weather.
Multiple factors such as multiple data sources collated in a meaningful way, and the art of asking the
right question when finely tuned with data science, packs the potential to push the logistic industry
on a very profitable path from both the consumer’s and supplier’s points of view.
Process Industry
The process industries are those industries in which the production of items is done in large amount.
The various examples of process industries are food, chemicals, pharmaceuticals, petroleum, base
metals, plastics, textiles, wood and wood products, paper and paper products, etc. The process
industry requires reformation due to advancement of new information technologies. The mechanism
of improving processes and gathering effective knowledge plays a significant role in all the facets
of process industry, which include system integration, sustainability design, quality control, process
control, decision support, etc. In order to automate the process industry from machine automation
to information automation and, finally, to knowledge automation, data science and analytics are
required. In the past few years, a large amount of data has been collected in the process industry
L
because of the usage of scattered control systems. This large amount of data has been hardly utilized
for detailed analyses. But, nowadays, the significance of extracting information from the collected
data has emerged and acquired the main role in the process industry. The useful information from
decision-making capacity can also be used to help doctors and patients to overcome health
problems. Dr. Rema Padman of Carnegie Mellon University explains that today physicians rely
on classic clinical trials to make decisions about how to treat a particular patient. These trials are
very costly from the point of view of time and money. Also, the results of these trials only tell us
about the patient population on which the experiment was performed, which is not suitable for
every patient in every condition. But, if thousands of patients are being treated in several unique
conditions, why can’t we save that huge amount of data digitally to improve services? Now, with
the help of data science and analytics, this data-driven approach may help doctors assess optimal
routes for caring patients suffering from chronic conditions, something even the most acclaimed
clinical trials are not able to do. Using data analytics to help patients with multiple health conditions
enables doctors to catch any anomaly in a patient’s plan of care. Dr. Padman gives an example –
“If a patient is being treated for kidney disease and in the line of managing that disease, their
corresponding blood sugar, high blood pressure and other concurrent diseases can be cured, then
it a definite upside from both monetary and health points of view. This line of medical research
and its application would be impossible without the intervention of data science and data-driven
decision-making.
Source: https://www.heinz.cmu.edu/media/2017/january/analytics-better-health-care
Self-Instructional
Material
87
DATA SCIENCE
Summary
In this chapter, you have first learned about the concept of decision-making. Then, you have learned
about the techniques of decision-making. Subsequently, it has discussed about the development of
decision support system. It has also discussed about the application of DSS.
Exercise
Multiple-Choice Questions
Q1. DSS refers to ____________.
L
a. Decision Designing System b. Decision Support System
c. Describing Decision System d. Delay Decision Support
c. Strategic decision D
Q2. Which of the following is not a type of decision?
a. Programmed decision b. Non-programmed decision
d. Non-strategic decision
C
Q3. Which of the following is the first step in a decision-making process?
a. Analyze the problem b. Identify the problem
c. Collect the relevant data d. Develop alternate solutions
T
Q4. Which of the following is the last step in decision-making process?
a. Analyze the problem b. Identify the problem
c. Collect relevant data d. Ensure feedback
IM
Assignment
Q1. What do you understand by decision-making?
Q2. What are programmed decisions?
Q3. What are non-programmed decisions?
Q4. List the steps involved in the decision-making process.
Q5. What are strategic decisions?
Q6. What are tactical decisions?
Self-Instructional
Material
88
Implementation of Decision-Making and Support
NOTES
References
https://www.decision-making-solutions.com/decision_making_techniques.html
http://www.businessmanagementideas.com/decision-making/top-10-techniques-of-decision-
making/3377
https://www.useoftechnology.com/role-technology-decision-making/
https://towardsdatascience.com/how-data-science-is-enabling-better-decision-making-
1699defd6899
L
D
C
T
IM
Self-Instructional
Material
89
C A S E S T U D Y
NOTES RATIONALIZING REAL ESTATE INVESTMENT DECISIONS
USING DATA SCIENCE
This Case Study discusses how Talentica helped its client HomeUnion, Inc. in building data-driven tools
for data analysis and decision making.
Company Profile
HomeUnion, Inc. is a real estate company that was established in 2009. It is headquartered in
Irvine, California. It is a subsidiary of HomeUnion Holdings, Inc. HomeUnion, Inc. works in the real
estate sector and provides services, such as property selection, acquisition, management and rental
services to the customers from across the world.
HomeUnion, Inc. is an online marketplace for real estate investments. It deals in fully managed
properties that are sourced through due diligence and data analysis process. Other services provided
by the company after a customer purchases the properties include market intelligence, portfolio
analysis and management oversight for the customers. HomeUnion, Inc. uses scientific data science
L
techniques in order to enable the investors in choosing portfolios and acquiring assets that are best
suited to their investment goals.
According to Narayanan Srinivasan, CTO of HomeUnion, Inc., “Subjective opinions have long guided
D
real estate investment decisions, making portfolio investments risky. We enable our investors with
data-driven decision making and model-driven recommendations on investments.”
data processing.
Variability in Data Sources and Availability: To build data-driven tools, the company required
statistically modeling factors, such as crime rate in a particular area, employment, income,
schools, rental trends, etc. All these data elements were acquired from different sources. In
addition, the frequency of all such data varied. These factors meant that the company required
a Big Data Pipeline Processing Factory.
Statistical Analysis of Data: Building the required data tools required quantifying the investment
decisions which was dependent upon various factors. Further, these factors were dependent
upon a number of attributes. It was required that these attributes are rated, and statistical
models are built at relevant predictive indices. All this was required for building the algorithms.
To carefully meet all these challenges and develop the required tools, HomeUnion, Inc. hired a leading
outsourced product development company, Talentica. According to Manjusha Madabushi, CTO of
Talentica, “Data Science will be the backbone of most futuristic solutions. Our solution for HomeUnion,
Inc. takes a data-driven approach to a domain which has always relied on subjective judgment. We are
thrilled at the possibilities this could have.”
Self-Instructional
Material
90
C A S E S T U D Y
NOTES
Source: www.talentica.com
Talentica devised a strategy to develop and deploy a Data Science solution for HomeUnion, Inc. The
solution could collect, process, model and analyze the data to generate accurate insights that could
be used in business decision making. These insights could be used by the predictive tools in order to
help the investors in making better portfolio investment decisions. To achieve these goals, Talentica
set up a team comprising data scientists along with the product development team.
L
The stages in developing the data science solution were as follows:
1. Raw Data Collection and Processing: Data that was to be used for the data-driven solution
D
had to be collected from multiple sources and they were in different forms. For this,
Talentica’s team developed an automated data ingestion ecosystem that could receive data
from multiple sources. Sources were categorized as public/private and free/paid services.
C
The team configured the Application Program Interfaces (APIs) to collect the data on the
basis of frequency of its publication and updation. This data was then stored in the Big Data
environment. The team also deployed a Big Data based solution using Hadoop, MapReduce,
R and Python systems to process large volumes of data which keep growing day by day.
T
2. Exploratory Analysis and Feature Engineering: Big Data solution processed the data and
provided clean and prepared data. Using this processed data, the team of data scientists
started readying the data that would be used by the algorithms (of data tools) that were to
be built. The project team then outlined the various factors that drive real estate investment
IM
decisions along with the features that were essential to the statistical model. Thereafter, the
team used exploratory analysis and feature engineering techniques to reduce the number
of features to include only the most important ones. These features could be modeled to
produce accurate and relevant insights when modeled.
3. Building Machine Learning Algorithms and Statistical Models: Data scientists selected a
set of parameters from the previous step and used it to build machine learning algorithms
by using tools, such as R, Python and Hadoop. Data Scientists used regression analysis and
machine learning techniques to build models for Neighborhood Investment Rating (rating
every US neighborhood for its investment potential); REALestimate (forecasting the return
on investment on every property); predicting the right offer for anchoring an optimal winning
bid; time-series based price trends on various US geographies; and predicting the likely rent
for every property in the US.
4. Data-Driven Insights and Analysis: Data Scientists built machine learning algorithms and
statistical models. These models ran on an automatic flow of data received from the Big Data
environment. The feature data was modeled to generate trends and forecasts for investment
risk, sale price range, rent prediction and price appreciation, etc. The trends and forecasts
could be generated for various geographical locations across the US.
Self-Instructional
Material
91
C A S E S T U D Y
NOTES 5. Deciding the Technologies Used: The technologies used by the team in developing the models
and algorithms include: Single page app with Java Spring stack for the server; databases
such as MySQL, Vertica, Redis; mobiles such as iPhone, Android, Phonegap, Cordova plugins;
reporting tools such as Jasper Reports, D3.js; Data Science tools such as R, Python, Hadoop,
Elastic Search, and PIG; and Maven and Jenkins for continuous integration.
Benefits
After the implementation of these data-driven tools, the benefits realized by HomeUnion, Inc. were
as follows:
Feature-rich product: Using the new data based tools, the investors could make decisions
regarding acquisition of property, property management, market intelligence for buy/hold/sell
decisions, predicting investment plans tailored to achieve investor goals, etc.
Enhanced number of deals owing to easier investment decisions: New data tools made it
easier for the customers to make informed portfolio investment decisions based on accurate
assessment of the investment worthiness. Due to this, HomeUnion, Inc. was able to close more
deals than it used to do earlier. This meant an increase in business and, hence, revenues.
L
Questions
Inc.
D
1. Describe why large data volume and variability in data sources was a challenge for HomeUnion,
(Hint: There was a constant flow of updated feeds that required reprocessing. Such large
amounts of data required complex data processing. Factors such as crime rate in a particular
C
area, employment, income, schools, rental trends, etc. were acquired from different sources.
In addition, the frequency of all such data varied. These factors meant that the company
required a Big Data Pipeline Processing Factory.)
T
2. What were the various stages involved in developing the data science solution for HomeUnion,
Inc.?
(Hint: Stages involved in developing the data science solution for HomeUnion, Inc. included:
Raw data collection and processing, Exploratory analysis and feature engineering, Building
IM
Machine learning algorithms and statistical models, Data-driven insights and analysis, and
Deciding the technologies used.)
Self-Instructional
Material
92
CHAPTER
4
Exploring Business
Analytics
L
Topics Discussed
Introduction
What is Business Analytics (BA)? D NOTES
C
Types of Business Analytics
Importance of Business Analytics
What is Business Intelligence (BI)?
Relation between BI and BA
T
Difference between BI and BA
Emerging Trends in BA
Data, Information and Knowledge
How are Data, Information and Knowledge linked?
IM
Self-Instructional
Material
DATA SCIENCE
L
INTRODUCTION
D
The word Analytics has multiple meanings and is open to interpretation for business and marketing
professionals. This term is used differently by experts and consultants in almost a similar fashion.
Analytics, as per the definition of the business dictionary, is anything that involves measurement –
a quantifiable amount of data that signifies a cause and warrants an analysis that culminates into
C
resolution.
This chapter discusses about Business Analytics and its types. It further discusses about the
importance of Business Analytics. Further, this chapter discusses about the concept of Business
Intelligence (BI) and its relation with business analytics. Then, it discusses about the emerging
T
trends of BI and BA. At the end, it discusses about the different types of analytics in detail.
Business Analytics13 is a group of techniques and applications for storing, analysing and making data
accessible to help users make better strategic decisions. Business Analytics is a subset of Business
Intelligence, which creates competencies for companies to contest in the market efficiently and
is likely to become one of the main functional areas in most companies (More on BI later in this
chapter).
Analytics companies develop the ability to support decisions through analytical perception. The
analytics certainly influences the business by acquiring knowledge that can be helpful to make
enhancements or bring change. Business Analytics can be segregated into many branches. Say,
for a sales and advertisement company, marketing analytics are essential to understand which
marketing tactics and strategies clicked with the customer and which didn’t. With performance data
of marketing branch in hand, Business Analytics becomes an essential way for measuring the overall
impact on the organisation’s revenue chart. These understandings direct the investments in areas
like media, events and digital campaigns. These allow us to understand customer results clearly,
such as lifetime value, acquisition, profit and revenue driven by our marketing expenditure.
Self-Instructional
Material
94
Exploring Business Analytics
E xhibit -1 NOTES
Amnesty International
Amnesty International is a worldwide programme that includes over seven million crusaders
who fight for a free world with equal human rights for all. Being a non-profit institution, the
organisation has to rely on different donors and contributors, who get to know about campaigns
through activities such as street fundraising, telephone outreach, petitions and mailers. When
donors are involved, it is important to create a long-lasting relationship with them. Like many non-
profits, Amnesty International has a Customer Relationship Management (CRM) system to make
the relationship lifecycle last longer. The organisation also required performance improvement
using contemporary data analytics procedures.
Challenge
Around four years back, with the help of its in-house fundraising consultants, Amnesty
International started seeking an analytics software to work parallel to the existing CRM systems.
The fundraising consultants are responsible for gathering funds and managing various kinds of
L
donors. They are also required to measure the donors’ sentiments and interests based on multiple
inputs, such as various parameters and participatory ratios. For such measurements, they were
dependent on programmers for analysing customers, directing specific campaigns at them based
D
on their interactions and contributions to the campaign and the organisation. It was a tedious
exercise and not always accurate. There were regular gaps between the requirements consultants
asked for and what they were delivered.
C
Solution
Based on the inputs gained from the consultants, Amnesty International finalised an analytics
tool with easy drag and drop interface to carry out the analytics processes as envisaged by the
T
consultants.
The analytical tool was integrated with the CRM. Thus, using the contemporary analytics software
with CRM database became easier, making the reporting features much more robust. Of course,
IM
as a human rights organisation, Amnesty International performs all data analytics in obedience
with privacy rules and protective integrity.
NOTES may learn how many customers they served in the time duration 9 a.m. to 11 a.m. and which
coffee was ordered the most. So, this analysis answers questions like “What happened?”, but
is not capable to answer more deep questions like “Why it happened?”. Due to this reason,
companies which are highly data-driven don’t rely just on descriptive analysis, rather combine it
with other analyses to get detailed results.
Diagnostic analysis: With the availability of historical data, diagnostic analysis can be used to
find the answer to the question “Why it happened?”. Diagnostic analysis provides a way to
dig deeper by drilling down and find out patterns and dependencies. Result of this analysis is
often a predefined report structure, such as RCA (Root Cause Analysis) report. For example,
if the coffee shop owner experiences a heavy rush on some day and finds he was unable to
provide quality service, diagnostic report can help him find out why it went wrong. Attribute
importance, principle components analysis, sensitivity analysis, and conjoint analysis are some
techniques that use diagnostic analysis. Diagnostic analysis also includes training algorithms for
classification and regression.
Predictive analysis: Predictive analysis can be defined as the process of focusing on predicting
the possible outcome using machine–learning techniques like SVM, random forests and
L
statistical models. It tries to forecast on the basis of previous data and scenarios. So, this is used
to find answers to questions like “What is likely to happen?”. For example, a hotel chain owner
might ramp down promotional offers during a restive season of rains in a coastal area. This is
based on the predictions that there are going to be fewer footfalls due to heavy rain. However,
D
it must not be understood that this analysis can predict whether an event will occur in future
or not. It merely is able to predict the probability that an event will occur. If predictive analysis
model is tuned properly based on historical data, it can be used to support complex predictions
in marketing and sales. It can perform better than standard BI in giving correct forecasts.
C
K nowledge C heck 4.2
1. A software firm has roped in a consultant to study the financial leaks happening in their
billing system. This is the example of ___________.
T
2. A company needs to launch its new product, but is on a limited marketing budget, and
needs to figure out the best possible market response with a minimum investment. The
___________ analytics should help the company with studying the market response.
IM
To understand, improve and track the method that can be used to impress and convert the first
lead or prospect to a valuable customer
Significance of BA:
To get visions about customer behaviour: The prime advantage of financing some BI software
Self-Instructional and expert is the fact that it increases your skill to examine the present customer–purchasing
Material trend. Once you know what your customers are ordering, this information can be used to create
96
Exploring Business Analytics
products matching the present consumption trends and, thus, improve your cost-effectiveness NOTES
since you can now attract more valued customers.
To improve visibility: BA helps you in getting to a vantage point in the organisational complexities
where you can have a better visibility of the processes and make it likely to recognise any parts
requiring a fix or improvement.
To convert data into worthy information: A BI system is a logical tool that can educate you to
enable you in making successful strategies for your corporation. Since such a system identifies
patterns and key trends from your corporation’s data, so it makes it easier for you to connect
the dots between different points of your business that may seem disconnected otherwise.
Such a system also helps you comprehend the inferences drawn from the multiple structural
processes better and increase your skill to recognise the right and correct opportunities for
your organisation.
To improve efficiency: One critical reason to consider a BI system is increase in the efficacy of
the organisation, thus leading to increased productivity. BI helps in sharing information across
multiple channels in the organisation, saving time on reporting analytics and processes. This
ease of sharing information reduces redundancy of duties or roles within the organisation and
L
improves the precision and practicality of the data produced by different divisions.
Consider a typical website that relies on visitor footfall and subsequent click-based advertising
revenues. Such an organisation needs analytics more often than other organisations who have a
purposes.
D
dedicated business running in brick and mortar stores and who use their website only for marketing
BA is an important area that helps you in equipping with correct weapons to make the correct
C
business decisions. For example, if you already expect some turmoil in one of your business sections,
you can do a SWOT of the section and impact the overall outcome positively. Here, BA not only
helped you in retaining a section full of customers, but also helped you in avoiding a future conflict
of similar nature. BA arms you with situational arsons – you get a machine gun in the form of viral
T
marketing campaigns when you are targeting a mass audience for a given product, whereas in case
of customer withdrawal or ramp-up, you can have your sniper ready to specifically target them out.
Prepare a report of the case where a business gained effectively from the SWOT analysis.
BI utilises computing techniques for the discovery, identification and business data analysis – like
products, sales revenue, earnings and costs.
BI models provide present, past and projecting opinions of structured internal data for goods
and department. It provides effective strategic operational insights and helps in decision-making
through predictive analytics, reporting, benchmarking, data/text mining and business performance
administration.
NOTES business data into knowledge to aid decision-making process. The conventional method of doing
this includes logging and probing the data from past and using the overall outcome from the reading
as the standard for setting future benchmarks.
BA emphasises on data usage to get new visions, while conventional BI uses a constant, recurring
metric sets to drive strategies for future business on the basis of historical data. If BI is the method
of logging the past, BA is the method to deal with the present and forecast the future.
With the help of BA, you get to know the pain points of your business; your product’s standing in the
market, your strengths related to business that put you ahead of the competition and the opportunity
which you are yet to explore. BA helps you in knowing your business thoroughly. BI helps in bridging
that gap between ground reality and management perspective on a pan-organisational basis.
L
BI helps you in compounding your strong points collectively; weeding out the weakness in an
efficient manner and managing the organisational business more efficiently. It helps you capitalise
on the lessons learned from the BA findings about the organisation. Table 4.1 shows the differences
between BI and BA:
D
TABLE 4.1: Differences between BI and BA
C
BI BA
Uses current and past data to optimise the Utilises the past data and separately analyses the
current age performance for success current data with past data as reference to prepare
the businesses for the future
T
Informs about what happened Tells why it happened
Tells you the sales numbers for the first Tells you about why your sales numbers tanked in
quarter of a fiscal year or total number of the first quarter or about the effectiveness of the
IM
new users signed up on our platform newly launched user campaign for making users
refer other users to our platform
Quantifiable in nature, it can help you in More subjective and open to interpretations and
measuring your business in visualisations, prone to changes due to ripples in organisational or
chartings and other data representation strategic structure
techniques
Studies the past of a company and ponders Predicts the future based on the learning gained
over what could’ve been done better from the past, present and projected business
in order to have more control over the models for a given term in the near future
outcomes
Another new trend is the skill to combine multiple data projects in one, while making it useful in
sales, marketing and customer support. That concept is also called CRM – Customer Relationship
Management software, which sources raw data from every division and department, compiles it for
a new understanding that otherwise would not have been visible from one point alone.
All these boil down to the interchangeable usage of the term ‘business intelligence’ and ‘business
analytics’ and its importance in managing the relationship between the business managers and data.
Owners and managers now, as a result of such accessibility, need to be more familiar with what data
Self-Instructional is capable of doing and how they need to actively produce data to create lucrative future returns.
Material The significance of the data hasn’t changed, its availability has.
98
Exploring Business Analytics
EMERGING TRENDS IN BA
Following are the contemporary trends in the BI and BA fields:
More Power and Monetary Impact for Data Analysts: The analysts are consistently charting the
demand charts across many industries. All thanks to the demand-driven analytical bandwagon
that has made the industry take cognizance of the data analysts and led to a spike in other roles
like Information Research Scientists and Computer Systems Analysts.
Location Analytics: Another major business driver in 2016 was related to location and geospatial
analytical tools that gave organisations better market intelligence and placements in terms
of effective campaigns. For example, a company aiming geocentric campaigns for specific
customers.
L
Data at the Rough Edge: Businesses must look beyond the usual sources of data besides their
data centres since the data flows now initiate outside the data from multiple sensor devices,
and servers — e.g. a spatial satellite or an oil rig in the sea.
Artificial Intelligence (AI): This is a top trend as per multiple studies with scientists targeting to
D
make machines that do what complex human reflexes and intelligence achieve. The analytical
work on such programmes is exponentially growing with AI and machine learning transforming
the way we relate with the analytics and data management.
C
Predictive Analytics and Impact on Data Discovery: By gathering more information,
organisations will have the capacity to build more detailed visual models that will help them
act in more accurate ways. For instance, having better information models shows organisations
more about what clients are purchasing, and even what they are possibly going to purchase in
the future. From CRM to sales or marketing deals, predictive analytics and cutting edge BI are
T
set to bring disruption.
Cloud Computing: Cloud computing is technique that makes it possible for organizations to
dynamically regulate the use of computing resources and access them as per the need while
IM
paying only for those resources that are used. Cloud computing is being absorbed into many
systems and will continue to grow. We’ve witnessed the division of Cloud into multiple vendor
systems and many companies are utilising Cloud services to host the powerful data analytics
tools. A lot of customers are already using Microsoft Azure and Amazon Redshift along with
Cloud resources that provide flexible handling and scalability for the data.
Digitisation: It is a process of turning any analogue image, sound or video into a digital format
understandable by the electronic devices and computers. This data is usually easier to store,
fetch and share than the raw original format (e.g., turning a tape recorded into a digital song).
The gains from digitising the data-intensive processes are great: with up to 90% cost cut
and much faster turnaround times than before. Creating and utilising software over manual
processes allows the businesses to gather and screen the data in real time, which assists the
managers to tackle issues before they turn critical.
Examples of Data:
Self-Instructional
2,4,6,8
Material
Mercury, Jupiter, Pluto
99
DATA SCIENCE
NOTES The above data alone doesn’t represent the true picture. Maybe the sequence above is simply the
table of two or a sequence denoting the difference of two between numbers. The names may just
be the names of conference rooms in an organisation rather than being planet names, unless you
give it a logic and define the reasoning for its existence. The data alone doesn’t have a standalone
existence by itself.
Information is the result that we achieve after the raw data is processed. This is where the data takes
the shape as per the need and starts making sense. Standalone data has no meaning. It only assumes
meaning and transitions into information upon being interpreted. In IT terms, characters, symbols,
numbers or images are data. These are joint inputs which a system running a technical environment
needs to process in order to produce a meaningful interpretation.
Information can offer answers to questions like which, who, why, when, what, and how. Information
put into an equation should look like:
Examples of Information:
L
2,4,6,8 are the results of first four multiples of 2.
D
When we allocate a situation or meaning, only then the data becomes information.
The first type is regularly called the explicit knowledge meaning a knowledge that can be simply
transferred to others. Explicit knowledge and its offspring can be kept in a certain media format,
e.g., encyclopedias and textbooks.
The second type is termed as the tacit knowledge referring to the type of knowledge that is complex
and intricate. It is gained simply by passing on to others and requires elevated and advanced skills in
order to be comprehended. For example, it will be tough for a foreign tourist to understand the local
customs or rituals of a specific community located in a country whose language is different than the
tourist’s language. In such a case, the tourist needs to be conversant with the language or requires
additional resources in order to understand the rituals. Similarly, the ability to speak a language or
use a computer or similar things requires knowledge that cannot be gained explicitly and is rather
learned through experience.
The topics are hierarchical in the following order, as shown in Figure 4.1: NOTES
becomes becomes
For example, the temperature fell 15 degrees followed by rains. Here, the inference based on the
data becomes information.
Knowledge signifies a design that links and usually provides a high-level view and likelihood of what
will happen next or what is described.
For example, if the humidity levels are high and the temperature drips considerably, the atmosphere
is pretty much unlikely to hold the moisture and the humidity, hence, it rains. The pattern is reached
on the basis of comparing the valid points emanating from data and information resulting into the
L
knowledge or sometimes also referred to as wisdom.
Wisdom exemplifies the understanding of essential values personified within the knowledge that
are foundation for the knowledge in its current form. Wisdom is systematic and allows you to
air currents.
Business
Systems
Planner
Analyst
Organization Project
Analyst Manager Financial
Analyst
Subject
Data Area Expert Technology
Analyst Architect
Application
Architect Application
Designer
Process
Analyst
Self-Instructional
FIGURE 4.2 Skills of a Business Analyst Material
101
DATA SCIENCE
NOTES Described below are few of the key requirements and responsibilities of the BA in managing and
defining requirements:
Gathering the requirements: Requirements are a key part of the IT systems. Inadequate or
unfitting requirements often lead to a failed project. The BA fixes the requirements of a project
by mining them from stakeholders and from current and future users, through research and
interaction.
Expecting requirements: An expert BA knows that in dynamic world of IT, things can change
quickly even before they can expect the change. Plans developed at starting are always subject
to alteration, and expecting requirements that might be needed in the future is the key to
successful results.
Constraining requirements: While complete requirements are a must for a successful project,
the emphasis should be the essential business needs, and not the personal user preference,
functions based on the outdated processes or trends, or other unimportant changes.
Organising requirements: Requirements often come from multiple sources that sometimes
may contrast with the other sources. The BA must segregate requirements into associated
L
categories to efficiently communicate and manage them. Requirements are organised into
types as per their source and application. An ideal organisation averts the project requirements
from being overlooked, and thus leads to an optimum use of budgets and time.
D
Translating requirements: The BA must be skilled at interpreting and converting the business
requirements effectively to the technical requirements. It involves using powerful modeling
and analysis tools to meet the planned business goals with real-world technical solutions.
C
Protecting requirements: At frequent intervals in a project’s lifecycle, the BA protects the
user’s and business needs by confirming the functionality, precision and inclusiveness of
the requirements developed so far compared to the requirements gathered in the initial
documents. Such protection reduces the risk and saves considerable time by certifying that the
T
requirements are being fulfilled before devoting further time in development.
Simplifying requirements: The BA is all about being simple and easier functionality, especially
in implementation. Completing the business objective is the aim of every project; BAs recognise
IM
and evade the unimportant activities that are not helpful in resolving the problem or in achieving
the objective.
Verifying requirements: BAs are the most informed persons in a project about the use cases;
hence, they frequently validate the requirements and discard the implementations that do not
help in growing the business objective to culmination. Requirement verification is completed
through test, analysis, inspection and demonstration.
Managing requirements: Usually, an official requirement presentation is followed by the
review and approval session, where project deliverables, costs and duration estimates and
schedules are decided and the business objectives are rechecked. Post approval, the BA shifts
to requirement managing events and activities for the rest of the project lifecycle.
Maintaining system and operations: Once all the requirements are completed and the solution
is delivered, the BA’s role shifts to post implementation maintenance to ensure that defects,
if any, do not occur or are resolved in the agreed SLA timelines; any enhancements that are to
be made to the project, or performing change activities to make the system yield more value;
similarly, the BA is also responsible behind many other activities post implementation, such
as operations and maintenance, or giving system authentication procedures, deactivation
plans, maintenance reports, and other documents like reports and future plans. The BA also
Self-Instructional plays a great role in studying the system to regulate when replacement or deactivation may be
required.
Material
102
Exploring Business Analytics
It is not at all necessary that a business analyst should be from IT background, although it is certainly
helpful to have basic understating of IT systems’ functionality and their working. Sometimes, BAs
come from a programming or other technical background often from within the business – carrying a
thorough information of the business field which can be likewise very useful. To be called a successful
BA, you ought to be a multi-skilled person who is adaptable to an ever-changing environment. The
following are some of the most common skills that a decent BA should have:
Understanding the Objectives: Being able to understand the path and commands is important.
If you can’t understand what and, more significantly, why you are assigned to do something,
L
chances you can’t deliver what is required are high. Don’t hesitate in asking questions or
additional information if you have any doubts.
Having Good Communication Skills: Sounds obvious, but it is necessary to have good verbal
D
communication skills, preferably in a global environment, where multitudes of stakeholders,
management and resources from diverse backgrounds will collaborate on a single platform
to discuss, debate and finalise the requirements which would incidentally be captured by you.
It is necessary for you to have that comprehension level along with the eloquence to deliver
C
your conceptions or clear any doubts, which you have. You should be able to make your point
evidently and explicitly. Communicating the data and the information at the appropriate level
is important – as some stakeholders require more detailed information than others due to the
varying levels of understanding.
T
Manage Stakeholder Meetings: While email also acting as an audit trail is a fair method
to facilitate the communication, sometimes it turns out to be not enough. Old school F2F
discussions and meetings for detailed deliberation over the problems and any queries are still a
IM
popular way of carrying out effective analysis. Most of the time, you end up discovering more
about your project from a physical presence of all stakeholders, where all the collaborators
tend to be open about debating circumstances.
A Good Listener: You are better off listening more than you speak and jotting down the notes
and takeaways from the meetings. Having good listening skills require patience and virtue to
understand and listen to the stakeholder, which gives them the feeling of being heard and
not being overlooked or overpowered by a dominating analyst. Such projects often end up in
mess sooner than they should be. Your listening and information-absorbing skills are important
to make you an effective analyst. Not only listen, but understand the situation, question only
where you think you are being condescended upon by the stakeholders passing off unnecessary
off-business requirements and ignoring the actual requirements that can help in making of an
efficient system. You can attend personality development trainings to get the control over
voice modulation, dialect and pitch moderation along with an effective body language with
business presentation skills.
Improving the Presentation Skills: As a BA, you are supposed to be presentable at any time round
the clock. As a BA, you will often lead the workshops or pitch a workpiece to the stakeholders,
or to internal project team. It is important to give due consideration to the content of your
presentation and ensure that it matches the meeting objectives – since there is no point of
presenting the implementation methods if the meeting is about gathering requirements. These Self-Instructional
presentations not only represent information but also act as a good way to get more clarity or
Material
103
DATA SCIENCE
NOTES information from stakeholders in case you are looking for further details on a specific part of
the project.
A Time Manager: A BA is responsible for maintaining the timeframes of the project as well
as the corporate schedules. BA should ensure that the project meets the pre-agreed project
milestones along with daily tracking schedules being fulfilled by the development team. BA
should prioritise activities separating critical ones from the others that can wait, and focus on
them.
Literary and Documenting Skills: Being a BA, you are supposed to deliver numerous types of
documentations that will go on to become the project and legal documents later on. So, you
need to ensure that your documents are created concisely and at a comprehensible level for the
stakeholders. Avoid specific jargons to a particular field as they may not be understood by all
stakeholders and later may create confusion or other complexities with their interpretations.
Starting as an inexperienced BA, you will gradually learn to write requirement documentations
and reports, but having strong writing skills is enough to give you a head start over the others
since it will lead to unambiguous requirements documentation.
Stakeholder Management: It is important that you know how to deal with the stakeholders
L
and know how much power and impact they have on your project. Stakeholders can either
be your best friends/supporters or your greatest critics. An accomplished BA will have the skill
to investigate the degree of management every stakeholder needs and how they ought to be
independently dealt.
D
Develop Your Modeling Skills: As the expression goes, a photo paints a thousand words.
Procedures (such as process modeling) are compelling tools to pass on a lot of data without
depending on the textual part. A visual portrayal enables you to get an outline of the issue or
project so that you can see what functions well and where the loopholes lie.
C
K nowledge C heck 4.5
As a BA, write a report on your analytical study of Sony Corporation currently undergoing turmoil
T
for serving too many areas in business fields.
DESCRIPTIVE ANALYTICS
IM
Descriptive analytics16 is the most essential type of analytics and establishes the framework for more
advanced types of analytics. This sort of analysis involves “What has occurred in the corporations”
and “What is going on now?”. Let us consider the case of Facebook. Facebook users produce
content through comments, posts and picture uploads. This information is unstructured and is
produced at an extensive rate. Facebook stats reveal that 2.4 million posts equivalent to around 500
TB of information are produced every minute. These jaw-dropping figures have offered popularity
of another term which we know as Big Data.
Comprehending the information in its raw configuration is troublesome. This information must be
abridged, categorised and displayed in an easy-to-understand way to let the managers comprehend
it. Business Intelligence and data mining instruments/methods have been the accepted components
of doing so for bigger organisations. Practically, every association does some type of outline and
MIS reporting using the information base or simply spreadsheets.
There are three crucial approaches to abridge and describe the raw data:
Dashboards and MIS reporting: This technique gives condensed data, giving information on
“What has happened”, “What’s been going on?” and “How can it stand with the plan?”.
Impromptu detailing: This technique supplements the past strategy in helping the administration
Self-Instructional extract the information as required.
Material
104
Exploring Business Analytics
Drill-down Reporting: This is the most complex piece of descriptive analysis and gives the NOTES
capacity to delve further into any report to comprehend the information better.
Definition of Statistics
Statistics, as defined by David Hand, Past President of the Royal Statistical Society in the UK, is both
the science of uncertainty and the technology of extracting information from data. Statistics involves
collecting, organising, analysing, interpreting and presenting data. A statistic is a summary measure
of data. You are familiar with the concept of statistics in daily life as reported in newspapers and the
media for example, baseball batting averages, airline on-time arrival performance, and economic
statistics, such as the Consumer Price Index.
L
sampling plan states the following:
Objectives of the sampling activity
Target population
Method of sampling
Operational procedures for collecting the data D
Population frame (the list from which the sample is selected)
C
Statistical tools that will be used to analyse the data
E xhibit -2
A Sampling Plan for a Market Research Study
T
Suppose that a company in America wants to understand how golfers might respond to a
membership program that provides discounts at golf courses in the golfers’ locality as well as
across the country. The objective of a sampling study might be to estimate the proportion of
IM
golfers who would likely subscribe to this programme. The target population might be all
golfers over 25 years old. However, identifying all golfers in America might be impossible. A
practical population frame might be a list of golfers who have purchased equipment from
national golf or sporting goods companies through which the discount card will be sold.
The operational procedures for collecting the data might be an e-mail link to a survey site or
direct-mail questionnaire. The data might be stored in an Excel database; statistical tools such
as PivotTables and simple descriptive statistics would be used to segment the respondents
into different demographic groups and estimate their likelihood of responding positively.
Sampling Methods
Many types of sampling methods exist. Sampling methods can be subjective or probabilistic.
Subjective methods contain judgment sampling, in which expert judgment is used for selecting the
sample and convenience sampling, in which easier-to-collect samples are selected (e.g., survey all
customers who visited this month). Probabilistic sampling14 includes items selected from the sample
using a random procedure and it is necessary to derive effective statistical conclusions.
The most common probabilistic sampling approach is simple random sampling. A random sampling
requires choosing items from a population such that every subset of a given sample size has an equal
Self-Instructional
opportunity to get selected. Simple random samples can be easily obtained if the population data is
kept in a database. Material
105
DATA SCIENCE
L
stratum are not homogeneous. However, issues of cost or significance of certain strata might
make a disproportionate sample more useful. For example, the ethnic or racial mix of each ward
might be significantly different, making it difficult for a stratified sample to obtain the desired
information.
D
Cluster Sampling: It refers to dividing a population into clusters (subgroups), sampling a cluster
set, and conducting a complete survey within the sampled clusters. For instance, a company
might segment its customers into small geographical regions. A cluster sample would consist
C
of a random sample of the geographical regions, and all customers within these regions would
be surveyed (which might be easier because regional lists might be easier to produce and mail).
Sampling from a Continuous Process: Selecting a sample from a continuous manufacturing
process can be accomplished in two main ways. First, select a time at random; then select the
T
next n items produced after that time. Second, randomly select n times; select the next item
created after each of these times. The first approach generally ensures that the observations
will come from a homogeneous population. However, the second approach might include items
from different populations if the characteristics of the process should change over time. So
IM
Estimation Methods
Sample data provides the basis for many useful analyses to support decision-making. Estimation
involves evaluating the value of an unfamiliar population constraint – such as a population
proportion, population mean, or population variance – using sample data. Estimators as measures
are used to approximate the population parameters; e.g., we use the mean sample x to approximate
a population mean µ. The sample variance s2 estimates a population variance σ2, and the sample
proportion p estimates a population proportion π. A point estimate is a number resulting from a
sample data used in estimating the value of a population parameter.
Unbiased Estimators
It seems quite intuitive that the sample mean should provide a good point estimate for the population
mean. However, it may not be clear why the formula for the sample variance we read previously
has a denominator of n – 1, particularly because it is different from the formula for the population
variance. In these formulas, the population variance is computed by the formula:
n
Self-Instructional Σ (X – µ)
i=1
i
2
Material σ=
2
N
106
Exploring Business Analytics
σ=
2
n–1
Why so? Statisticians develop many types of estimators, and from a theoretical as well as a practical
perspective, it is important that they estimate the population parameters truly as they are expected
to estimate. Say, we perform a test where we frequently sampled from a population and calculated
a point estimate for a population parameter. Each individual point estimate varies from population
parameter; though, the long-term average (probable value) of all the likely point estimates would
be identical to the population parameter, hopefully. If the likely value of an estimator is equal to
the population parameter it is supposed to estimate, the estimator is credited as impartial, else the
estimator is called biased and will yield incorrect results.
Luckily, all estimators we discussed are unbiased and are expressive for decision-making linking the
population parameter. Statisticians have shown that the denominator ‘n – 1’ used in computing s2 is
necessary to provide an unbiased estimator of σ2. If we simply divide by the number of observations,
L
the estimator tends to underestimate the true variance.
D
One of the drawbacks of using point estimates is that they do not provide any indication of the
magnitude of the potential error in the estimate. A newspaper reported that college professors
were the best-paid workforces in the area, with an average pay of $150,004. However, it was found
that average pays for two local universities were less than $70,000. How did this happen? It was
C
revealed that the sample size taken was very small and included a large number of highly paid
medical school faculty; as a result, there was a significant error in the point estimate that was used.
When we sample, the estimators we use – such as a sample mean, sample proportion or sample
T
variance – are actually random variables that are characterised by some distribution. By knowing
what this distribution is, we can use probability theory to quantify the uncertainty associated with
the estimator. To understand this, we first need to discuss sampling error and sampling distributions.
IM
Different samples from the same population have different characteristics – for example, variations
in the mean, standard deviation, frequency distribution, and so on. Sampling error occurs as samples
are only the subset of the total population. Sampling errors can be lessened but not completely
avoided. Another error called non-sampling happens when the sample does not represent the target
population effectively. This is generally a result of poor sample design, such as using a convenience
sample when a simple random sample would have been more appropriate, or choosing the wrong
population frame. To draw good conclusions from samples, analysts need to eliminate non-sampling
error and understand the nature of sampling error.
Sampling error depends on the size of the sample relative to the population. Thus, determination of
sample size to be taken is basically a statistical issue based on the precision of the estimates required
to infer a valuable assumption. Also, from a rational point of view, one should also deliberate the
sampling price and create a trade-off between cost and information obtained.
Let’s take an example where you have done descriptive analysis and it shows low sales on your
online grocery store website. Followed by some event checks and analysis, it occurs to you that
users are adding items in the card but are not checking out. You now come to a conclusion that
there is some issue with user experience on your website, but what is it precisely? There exists many
factors which could be affecting sales, such as the payment page is not using more secure https web
service, the payment options form doesn’t work or an unexpected charged amount appears on the
page. Hence, diagnostic analysis enables you to present a picture and the cause behind it, which is
not apparent in the presented data.
L
Following are the functions which broadly cover functions of diagnostic analytics:
1. Identifying the problem and events worth investigating: Using results of descriptive analysis,
the analyst must identify the areas which require further analysis and investigation since they
D
are the ones which raise questions whose answers can’t be found by just looking at the data
provided. It may be anything from falling sales to unexpected performance boost. Every
one of these causes then can be analyzed further using diagnostic analytics to find the root
problems or causes.
C
2. Drilling into the analytics: Once anomalies are distinct and recognized, the analyst must
identify the data source which might be able to direct for the root cause of the anomalies.
During this process, the analyst may have to look outside the selected data sets to find
patters and directions. This also may require pulling data from other data sources which can
T
be used to identify correlation between datasets and checking if these correlations are casual
in nature.
3. Identifying casual relationships: To explain the cause of identified anomalies, unseen
IM
relationships are identified by closely observing the events. Techniques and concepts such
as probability theory, time-series data analytics filtering and regression analysis can be
implemented and prove useful l for unravelling the true nature of data.
Since data volume, variety and velocity has increased drastically in past years, manual methods,
which were used by analysts for diagnostic analytics produced results that were highly dependent
on abilities of the analyst. However, even those analysts didn’t guarantee the consistency of results.
Modern methods for diagnostic analytics use machine learning since machines are far more capable
than humans in recognizing and clustering patterns. An intelligent implementation and use of
diagnostic analytics is imperative as it answers very specific question such as “How to avoid a given
problem?” or finding ways to replicate a solution for other similar problems. Also, documentation of
the diagnostic analysis must be done in a meaningful way which must state which issue was identified,
what data sources were used to analyze and eliminate the issue and which casual relationships
between data sets were identified during the analysis. The vitality of diagnostic analysis can’t be
ignored as the data collected by IDC in the survey states that 25% of large organizations will have
supplement data scientists by 2021 who will provide contextual data interpretation using qualitative
research methods which would enable the organizations to dig deep inside the understandings of
human’s emotion, perception and stories of their world.
Predictive
Analysis
Data Mining
Ad Hoc Query
Level of Drill Down/Across
Insight OLAP
L
Reports
Dashboards
Scorecards
What happened?
D
Why Did it Happen?
BI Maturity
What Will Happen?
C
FIGURE 4.3 Predictive Analytics
Predictive Modelling
T
Predictive modelling is the method of making, testing and authenticating a model to best predict
the likelihood of a conclusion. Several modelling procedures from artificial intelligence, machine
learning and statistics are present in predictive analytics software solutions. The model is selected
IM
on the basis of testing, authentication and assessment using the detection theory to predict the
likelihood of an outcome in a given input data amount. Models can utilise single or more classifiers
to decide the probability of a set of data related to another set. The different models available
for predictive analytics software enable the system develop new data information and predictive
models. Each model has its own strengths and weakness and is best suited for types of problems.
Predictive analysis and models are characteristically used to predict future probabilities. Predictive
models in business context are used to analyse historical facts and current data to better comprehend
customer habits, partners and products and to classify possible risks and prospects for a company. It
practices many procedures, including statistical modelling, data mining and machine learning to aid
analysts make better future business predictions.
NOTES The greatest set of changes and advances in predictive modelling are coming to fruition due to
increase in unstructured information – content archives, video, voice and pictures – joined with
quickly improving analytical methods. Basically, predictive modelling requires organised data – the
kind which is found in social databases. To make unstructured data indexes valuable for this sort of
examination, organised data must be extricated from them first. One case is sentiment analysis from
Web posts. Data can be found in client posts on forums, online journals and different sources that
foresee consumer loyalty and deal with trends for new items. It would be almost impossible, in any
case, to attempt to assemble a predictive model specifically from the content in the posts themselves.
An extraction step is required to get usable data as keywords, expressions and importance from the
content in the posts. At that point, it’s conceivable to search for the connection between cases of
the instances ‘issues with the item’, for example, and increase in customer service calls.
Predictive models represent the association between performance of a sample and its well-known
characteristics. The aim is to assess how likely a similar member from another sample is to behave in
the same manner. This model helps in identifying implicit patterns indicating customers’ preferences
and is used a lot in marketing. This model can even perform calculations at the exact time that a
customer performs a transaction.
L
Predictive analytics methods rely upon the quantifiable variables and control the metrics for
forecasting future performance or results.
D
Many quantifiable variables (or predictors) combine together to form a predictive analytics model.
This method allows for the data collection and preparation of a statistical model, to which extra data
can be added as and when available.
The accumulation of higher data volumes creates a nifty predictive model, trusting the larger data
C
sets which produce more dependable forecasts based on the data volume examined. Moreover,
trusting the actual data to power predictive analytics models marks better accurateness of the
predicting process.
T
The various business processes on predictive modelling are as follows:
1. Creating the model: A software-based solution allows you to make a model to multiple
algorithms on the dataset.
IM
2. Testing the model: Test the predictive model on the dataset. In some situations, the testing is
done on previous data to the effectiveness of a model’s prediction.
3. Authenticating the model: Authenticate the model results by means of business data
understanding and visualisation tools.
4. Assessing the model: Assess the best-suited model from the used models and select the
appropriate model tailored for the data.
The predictive modelling process includes executing one or more algorithms on the dataset
subjected to prediction. This is a recurring process and often includes model training, using several
models on the same dataset and lastly getting the appropriate model based on the business data.
Self-Instructional To understand better, let’s take an example of a customer who visits a restaurant around six times
in a year and spends around Rs. 5000/- per visit. The restaurant gets around 40% margin on per visit
Material billing amount.
110
Exploring Business Analytics
The annual gross profit on that customer turns out to be 5000 × 6 × 0.40 = ` 12000/-. NOTES
30% of the customers do not return each year, while 70% do return to provide more business to the
restaurant.
Assuming the average lifetime of a customer (time for which a consumer remains a customer)
1/.3 = 3.33 years. So, the average gross profit for a typical customer turns out to be 12000 × 3.33
= ` 39,960/-.
Armed with all the above details, we can logically arrive at a conclusion and can derive the following
model for the above problem statement:
where,
L
M = Profit margin
D
So, as you can see, logic-driven predictive models can be derived for a number of situations,
conditions, problem statements and a lot other scenarios where predictive analytical models
provide a futuristic view on the basis of validation, testing and evaluation to guess the likelihood of
an outcome in a given set amount of input data.
C
Summary
In this chapter, you have learned about Business Analytics and its types. Business Analytics is a group
T
of techniques and applications for storing, analysing and making data accessible to help users make
better strategic decisions. It frequently utilises numerous quantitative tools to convert big data into
meaningful contexts valuable for making sound business moves. Next, you have learned about the
concept of Business Intelligence (BI) and its relation with business analytics. Business Intelligence
IM
(BI) is the set of applications, technologies and ideal practices for the integration, collection and
presentation of business information and analysis. Finally, you have learned about different types of
analytics in detail. Descriptive analytics involves “What has occurred in the corporation?” and “What
is going on now?”. Diagnostic analytics is used to find the root cause of a given situation, such as
‘Why did it happen?’. Predictive modelling is the method of making, testing and authenticating a
model to best predict the likelihood of a conclusion.
Exercise
Multiple-Choice Questions
Q1. The result of this analysis is often a pre-defined reporting structure, such as root cause
analysis (RCA) report.
a. Descriptive analysis b. Diagnostic analysis
c. Predictive analysis d. Predictive modelling
Q2. Skills required for an analyst are:
a. Understanding the objectives b. Having good communication skills Self-Instructional
c. Managing stakeholder meetings d. All of these Material
111
DATA SCIENCE
Assignment
Q1. Discuss the concept of BA.
Q2. Enlist and explain different types of BA.
Q3. Discuss the relation between BA and BI.
Q4. Discuss the importance of data visualisation with the help of suitable examples.
L
Q5. What do you understand by descriptive statistics? How mean, median and mode are calculated
in statistics?
Q6. Describe sampling and estimation with suitable examples.
D
Q7. Explain the concept of predictive modelling.
Q8. What are logic-driven models? Discuss with appropriate examples.
C
References
What is big data analytics? Definition from WhatIs.com. (n.d.). Retrieved April 25, 2017, from
http://searchbusinessanalytics.techtarget.com/definition/big-data-analytics
T
What is business analytics (BA)? Definition from WhatIs.com. (n.d.). Retrieved April 25, 2017,
from http://searchbusinessanalytics.techtarget.com/definition/business-analytics-BA
Monnappa, A. (2017, March 24). Data Science vs. Big Data vs. Data Analytics. Retrieved April 25,
IM
Background
Opened in 1875, Cincinnati Zoo & Botanical Garden is a world-famous zoo that is located in Cincinnati,
Ohio, US. It receives more than 1.3 million visitors every year.
Challenge
In late 2007, the management of the zoo had begun a strategic planning process to increase the
number of visitors by enhancing their experience with an aim to generate more revenues. For
this, the management decided to increase the sales of food items and retail outlets in the zoo by
improving their marketing and promotional strategies.
L
According to John Lucas, the Director of Operations at Cincinnati Zoo & Botanical Garden, “Almost
immediately we realised we had a story being told to us in the form of internal and customer data,
changes.”
D
but we didn’t have a lens through which to view it in a way that would allow us to make meaningful
Lucas and his team members were interested in finding business analytics solutions to meet the
zoo’s needs. He said, “At the start, we had never heard the terms ‘business intelligence’ or ‘business
C
analytics’; it was just an abstract idea. We more or less stumbled onto it.”
They looked for various providers, but did not include IBM initially in the false assumption that they
could not afford IBM. Then somebody guided them that it was completely free to talk to IBM. Then
T
they found that IBM not only had suggested a solution that could fit in their budget, but it was the
most appropriate solution for what they were looking for.
IM
Solution
IBM has provided a business analytics solution to the zoo’s executive committee, which provides a
facility of analysing data related to the membership of customers, their admission and food, etc., in
order to gain a better understanding of visitors’ behaviour. This solution also provides a facility of
analysing the geographic and demographic information that could help in customer segmentation
and marketing.
The executive committee of the zoo needed a platform which must be capable enough of achieving
the desired goals by uniting and analyzing data associated with ticketing and point-of-sale systems,
memberships and geographical details. The complete project was managed by senior executives of
the zoo, IBM’s consultants and BrightStar Partners, an IBM Business Premier Partner.
Lucas said, “We already had a project vision, but the consultants on IBM’s pre-sales technology team
helped us identify other opportunity areas.” While the project was being implemented, BrightStar
has become the main point of contact for the zoo. Next, a platform was created on IBM Cognos 8.4
in late 2010, which has then further upgraded to Cognos 10 in the beginning of the year 2011.
Self-Instructional
Material
113
C A S E S T U D Y
NOTES Output
The result of implementing the IBM’s business analytics solution is that the zoo’s return of investment
(ROI) increased. Lucas admitted that “Over the 10 years we’d been running that promotion, we lost
just under $1 million in revenue because we had no visibility into where the visitors using it were coming
from.”
The new business analytics solution helped in cost savings for the zoo; for example, there is a saving
of $40,000 in marketing in the first year, visitors’ number has been increased to 50,000 in 2011, food
sales is increased by at least 25%, and retail sales has been increased by at least 7.5%, etc.
By adopting new operational management strategies of the business analytics solution, there
is a remarkable increment in attendance and revenues, which have resulted in an annual ROI of
411%. Lucas further admitted that, “Prior to this engagement, I never would have believed that an
organisation of the size of the Cincinnati Zoo could reach the level of granularity its business analytics
solution provides. These are Fortune 200 capabilities in my eyes.”
Questions
L
1. What measures were taken by Cincinnati Zoo & Botanical Garden in order to increase their
sales and revenue?
Self-Instructional
Material
114
CHAPTER
5
Data Warehousing
L
Topics Discussed
Introduction
D
Data Warehousing: An Information Environment
NOTES
C
Benefits of Data Warehousing
Features of Data Warehouse
Increased Demand for Strategic Information
T
Inability of Past Decision Support System
Operational vs. Decisional Support System
Information Flow Mechanism
IM
Key Components
Classification of Metadata
Data Warehouse and Data Marts
Fact and Dimension Tables
Data Warehouse Architecture
Data Warehouse Design Techniques
Bottom-up Design
Top-down Design
ETL Process
Data Extraction
Identification of Data Source
Extraction Methods in Data Warehouse
Change Data Capture
Transformation
Staging
Loading
Cleaning Self-Instructional
Material
DATA SCIENCE
L
Describe ETL process
INTRODUCTION
D
A data warehouse can be defined as the centralized repository of the data integrating from various
C
sources in order to support analytical techniques and decision-making processes of organisations.
It is maintained by organizations as a central storehouse of data that can be equally accessed by all
the business experts and end-users. The term ‘data warehouse’ was introduced by W. H. Inmon,
a computer scientist also known as the Father of Data Warehouse. Data warehouses18 are used
T
to store huge amounts of data, which help organizations in decision making, defining business
conditions and formulating future strategies. Both data warehouse and database store data but a
data warehouse is more efficient than a database. Data warehouse is more effective in dealing with
the information requirement of an organization because it helps in fulfilling information needs of
IM
the management.
Extraction Transformation Loading (ETL)17 is a process of extracting data from the source systems,
validating it against certain quality standards, transforming it so that data from separate sources
can be used together and delivered in a presentation ready format and then loading it into the
data warehouse. This organized form of data helps organizations as well as end users to conduct
analysis, create reports, formulate strategies and help in decision making process. Apart from the
three processes of extraction, transformation, and loading, ETL also involves transportation stage
in which data is transported from various sources to the warehouse.
This chapter explains the significance of data warehousing and the need to implement it in
organizations. It also discusses the importance and increased demand for strategic information
among business experts. It then elaborates the benefits and features of data warehousing and data
warehouse architecture. Then, you learn about the information flow mechanism, data warehouse
architecture, data marts, and data warehouse design techniques. This chapter also discusses the
various steps involved in the ETL process. It starts with an overview of the ETL process and proceeds
to discuss its various stages in detail.
Self-Instructional
Material
116
Data Warehousing
L
FIGURE 5.1 Data Warehouse as an Informational Environment
D
A data warehouse stores a replica of information from the source transaction systems. Let us discuss
some benefits of implementing data warehousing:
C
Data warehouse collects data from numerous sources into a single database so that a single
query engine can be executed to access data.
It diminishes the problematic situations of database isolation level lock disagreement in
T
transaction processing systems caused by attempts to run large and complex analysis queries
in transaction processing databases.
It manages data history, even if the source transaction systems do not.
IM
It integrates data from several source systems, which deliver a central view across the enterprise.
This is always a valuable benefit, but specifically when the organization has been merged with
other organizations and grown in size.
It improves the quality of data, provides reliable codes and descriptions, and flags. It even fixes
bad data.
It delivers the organization’s information constantly to business experts and managers.
It provides a common data model for all data irrespective of the source of the data.
It restructures the data to make it deliver excellent query performance. It also helps in complex
analytic queries, without affecting the operational systems.
It enhances the value of operational business applications, especially Customer Relationship
Management (CRM) systems.
It makes decision–support queries easier to write.
Data warehousing is the best way to integrate valuable data from different sources into the
database of a particular application.
Data warehouses make it easy to develop and store metadata.
Self-Instructional
Material
117
DATA SCIENCE
NOTES Business experts or users become habitual to see many customized data on display screens
fields such as rolled-up general ledger balances. These fields do not exist in the database.
When we perform reporting and analysis functions on the hardware that handles transactions,
the performance is often poor. Therefore, data warehouse should be used for reporting and
analysis.
L
such as naming conflicts and inconsistencies among units of measure. Once these problems are
solved by data warehouses, they are regarded as integrated.
Nonvolatile: The meaning of non-volatile is that once the data is entered into data warehouse,
D
it should not be changed or altered. This is because the objective of a data warehouse is to
Time Variant: Business analysts require large amounts of data to discover trends in business. This
is very different from the Online Transaction Processing (OLTP) systems, where performance
C
requirements request that historical data should be moved to an archive. The term time variant
signifies a data warehouse’s focus on the change over time. Normally, data flows on a monthly,
weekly, or daily basis from one or many OLTPs to data warehouse.
T
E xhibit -1
Data warehouse: A vital resource for business analysis and decision making
of world, generate huge volume of data on a daily basis. Data warehouses can provide decision
makers the consolidated, accurate and time stamped information they need to make right choices.
Multinational brands like Pizza Hut have made huge progress by building and analysing their data
warehouse of millions of customer records in the past 10 years. This helped them to design their
product line-up for a particular demography. The company used it to do target marketing and find
best deals for a given household. Using data warehousing, the company could segment customer
households for grouping according to their buying behavior. Data warehouse can also be used to
predict if a campaign or promotional offering was successful or not.
Figure 5.2 shows some business objectives that can be achieved using strategic information: NOTES
Business experts and managers use strategic information to make decisions about these objectives
L
for some important purposes. Some of these purposes are as follows:
Gain thorough knowledge of their company’s operations
Learn about important business factors and how these affect each other
D
Compare the performance of their organization to that of the competitors
Business experts and managers need to concentrate on the need and preferences of customers,
C
new technologies, sales and marketing outcomes and quality of product and services. There are so
many types of information that is needed to make decisions regarding the creation and execution
of business strategies. We can group all these types of essential information and call it strategic
information. Strategic information is not for executing daily operations, such as generating invoices,
T
making shipments and recording bank transactions. Strategic information is much more important
and helps organizations to take some of its most crucial decisions. Figure 5.3 shows the features of
strategic information:
IM
Integrated
Must have a single, enterprise-wide view
Data Integrity
Information must be accurate and conform to business rules
Accessible
Easily accessible
Credible
Every business factor must have one and only one value
Timely
Information must be available within the stipulated time frame
L
You might face such situations many times during your career as an IT expert. Sometimes, you might
get the information needed for such ad hoc reports from the databases and other sources and
sometimes you may not get the required information. In the latter case, you may have to approach
several applications, running on different platforms in your company environment, to get the
D
information. Sometimes, you may also be required to sort or present the information in different
formats and all tasks can prove to be very cumbersome and time consuming in absence of a data
warehouse.
C
The fact is that for a last couple of decades or more, IT departments have been trying to deliver
information to key personnel in their companies for making strategic decisions. Sometimes, an IT
department could generate ad hoc reports from a single application but in most cases, the reports
are created from multiple systems.
T
Most of these efforts by IT department in the past resulted in failure. Users often could not clearly
define what they need in the first place. Once the first set of reports is delivered to them, they
IM
wanted more data in changed formats. This happened primarily because of the nature of the process
of making strategic decisions. Information required for strategic decision making has to be available
in a collaborative manner. The user must be capable to query online and get results. The information
must be in an appropriate format for analysis.
Let us now compare the operational and decision support systems. Table 5.1 shows the comparison
between operational support systems and decisional support systems:
Self-Instructional
Material
120
Data Warehousing
TABLE 5.1: Difference between Operational Support Systems and Decisional Support NOTES
Systems
Data in decisional support systems needs to be updated periodically to load new updated data that
L
is derived from the operational data. The decision support data does not store the details of each
operational transaction. Decision support data represents transaction summaries. Hence, we can
say that decision support systems store data that is summarized, integrated, and aggregated for
decision support objectives.
JMI-Enabled
Metadata
Service
Figure 5.4 shows the complete flow of information through the integrated data warehouse. The white
arrows denote the general flow of data through the data warehouse, from the Operational Data
Source (ODS) to the advanced visualization/reporting software. This data flow is characteristically Self-Instructional
metadata-driven. Material
121
DATA SCIENCE
NOTES In this environment, metadata has a single and centralized depiction in terms of JMI. Shared
metadata is defined by a MOF (Meta Object Facility)-compliant metamodel that is Common
Warehouse Metamodel (CWM), but the JMI-enabled metadata service is not linked to any particular
metamodel, and is capable of loading the CWM metamodel and dynamically generating an internal
implementation of CWM Communication of shared metadata is accomplished through JMI
interfaces. XmiReader and XmiWriter interfaces are used to transfer complete models or precise
packages of models in a bulk format for loading into tools. On the other hand, metamodel-specific
(CWM, in this case) JMI interfaces are used by client tools for browsing and probably building or
altering existing metadata structures. Finally, JMI reflection is used to enable metadata integration
between tools whose metamodels vary, but are otherwise MOF-compliant. Let us see some key
components of information flow mechanism.
Key Components
Information supply chain is fully-integrated
L
Metadata communication via JMI programmatic API and bulk interchange (XMI)
MOF/JMI reflection used to reconcile metadata integration between tools based on dissimilar
metamodels
D
With which meta-data model, the JMI-enabled metadata service is linked?
C
CLASSIFICATION OF METADATA
Classification of metadata is defined by a classification schema, which is a group of similar things.
T
It is commonly represented by a hierarchical structure with descriptive information of the group. A
classification is designed to be used for an arrangement of individual objects that lie in group. The
groups are made on the basis of the common features of the objects.
IM
Whether subordinates (may) have several superordinates or not. Multiple supertypes for one
subtype signify that the subordinate contains the features of all its superordinates.
Evaluate whether the standards for belonging to a class or group are well defined.
Whether or not the types of relations between the concepts are made clear and well defined.
Whether or not the subtype-supertype relations are differentiated from composition relations
and from object-role relations.
Self-Instructional
Material
122
Data Warehousing
According to many vendors, data warehouses are not easy to build and are also expensive. They
make you believe that building data warehouses is just a waste of time. However, this is not accurate.
These data mart vendors believe that data warehouses are obstacles which stop them from earning
profits. It is ordinary that they would tell you about all the drawbacks of a data warehouse that you
may encounter while implementing a data warehouse. Some vendors might suggest you to build a
data warehouse by building a few data marts and let them grow.
However, using this method you might face many problems. Data mart companies try to advertise
their products as being the data warehouses. This often confuses people. Many people purchased
data marts and started using them without the data warehouses. But soon they realized that the
architecture is defective. It should be understood clearly that data warehouses and data marts
L
are two different things. There are some noteworthy differences between both of them. A data
warehouse has a structure which is distinct from a data mart.
D
A data mart19 is a collection of subjects that support departments in making specific decisions.
For example, the marketing department will have its own data mart, while the sales department
will have a data mart that is separate from it. Additionally, each department completely owns the
software, hardware and other components that form their data mart.
C
Due to this, it is difficult to manage and organize data across various departments. Each department
has its own control on data mart that how it looks. The data mart that they use will be explicit to
them. Comparatively, a data warehouse is designed around the entire organization and is not owned
T
by any single department. The data contained in data warehouses is granular, but the information
stored in data marts is not very granular.
Another thing that distinguishes data warehouses from data marts is that data warehouses store
IM
more information. The information stored in data marts is normally summarized. Data warehouses
do not store information that is biased on the part of the department. Instead, it contains the
information that is analyzed and processed by the organization.
The information that is held in data warehouses is mostly historical in nature. Data warehouses are
designed to process this information. We have seen that there are many differences between data
marts and data warehouses.
Table 5.2 shows the difference between data warehouse and data mart:
It has a corporate/ enterprise-wide scope. Its scope is departmental that is specific to one
department.
Self-Instructional
Material
123
DATA SCIENCE
Data is received from the staging area. Data is received from Star-join (facts and
dimensions).
It queries on the presentation resources. It is technology optimal for data access and
analysis.
It has a structure for corporate view of data. It has a structure to suit the departmental view
of data.
L
may range sometimes up to hundreds of millions of records for large organizations. These records
contain one or more years of history of operations of the organization. One important characteristic
of a fact table is that it contains numerical data (facts) that can be summarized and aggregated to
provide information about the history of the operations of the organization. Each fact table also
D
contains an index that contains as foreign keys the primary keys of the related dimension tables.
These dimension tables contain the attributes of the fact records. Fact tables should not contain any
kind of descriptive data. It can contain only numerical fields and the index fields that relate the facts
to corresponding entries in the dimension tables.
C
A dimension table, on the other hand, is a hierarchical structure that contains attributes to
describe fact records in the fact table. These are also called lookup or reference tables that contain
relatively static data in the warehouse. Some of these attributes provide descriptive information;
T
others are used to specify how fact table data should be aggregated or summarized to provide
useful information to the analyst. Dimension tables consist of hierarchies of attributes that help in
summarization. For example, a dimension containing product information would often contain a
hierarchy that separates products into categories, such as food, drink and non-consumable items.
IM
Each of these categories is further subdivided a number of times until an individual product is
reached at the lowest level.
Dimension data is usually collected at the lowest level and then aggregated and transformed into
higher level totals, which prove to be more useful for business analysis. These natural progressions
or aggregations within a dimension table are called hierarchies. Dimension tables are used to store
the data and information that normally contain queries. Dimension tables contain information which
is mostly textual and descriptive and can be used in result set as the row headers.
Dimension tables are produced by dimensional modeling. Each table contains fact attributes that
are independent of those in other dimensions. For example, a customer dimension table contains
data about customers, a product dimension table contains information about products, and a store
dimension table contains information about stores. Queries make use of attributes in dimensions to
specify a view into the fact information. For example, The Product, store and time dimensions might
be used in query to ask the question “What was the cost of non-consumable goods in the southwest
in 1989?” Subsequent queries can go down along one or more dimensions to study more detailed
data, such as “What was the cost of kitchen products in New York City in the third quarter of 1999?”
In these examples, the dimension tables are used to specify how a numeric figure (cost) in the fact
table is to be summarized.
Self-Instructional
Material
124
Data Warehousing
Information
Source Data Data Staging Data Storage Delivery
Production
Data Metadata Data Mining
Extractions
+
Internal OLAP
Data Data Analyze
Transformation
Warehouse Query
+ MDBB
External Loading
Data Report/Query
(ETL)
Archived
Data Mart
Data Analytics
L
FIGURE 5.5 Architecture of Data Warehouse
The following is the description of the layers of the data warehouse system:
D
1. Data Source Layer: Refers to the layer representing various data sources that enter data
into the data warehouse. The data can be in any of these formats: plain text file, relational
database, Excel file and other types of databases. The following are the types of data that can
act as data sources:
C
zz Production Sources: Represents sales data, HR data and product data
zz Internal Data: Represents data of a department or an organization such as employee
data
T
zz External Data: Represents data from outside the organization or third party data such as
census data or demographic or survey data
zz Achieved Data: Represents logs of the Web server along with the user’s browsing data
IM
2. Data Staging Layer: Refers to the storage area for data processing where data comes before
being transformed into the data that is entered in a data warehouse. The following are the
steps involved in transporting data from various sources to data warehouse:
zz Extraction: Refers to the process of extracting data from different source systems and
validating it against certain quality
zz Transformation: Refers to the transformation of the data available in different source
systems and validating it against certain quality
zz Loading: Refers to the process of loading the data either from data warehouse or data
mart
3. Data Storage Layer: Refers to the layer in which the transformed data and cleaned data is
stored. On the basis of the scope and functionality, the following are the types of entities in
this stage:
zz Data Warehouse: It is maintained by organizations as central warehouse of data that can
be equally accessed by all business experts and end users.
zz Data Mart: When data warehouse is created at the departmental level, it is known as
data mart.
Self-Instructional
Material
125
DATA SCIENCE
NOTES zz Meta Data: Details about the data is known as metadata. In other words, it is a catalogue
of data warehouse.
zz MDDB: It is multidimensional database that allows data to be moulded and viewed in
multiple dimensions. It is defined by dimensions and facts.
4. Information Delivery: Provides the information that reached to end users. The information
can be in any form such as tables, chart, graphs, or histograms. The following are the tools
used in this layer:
zz Data Mining: Refers to the process of finding relevant and useful information from a
large amount of data.
zz OLAP: Allows the navigation of data at different levels of abstraction, such as drill-down,
roll-up, slice, dice, and so on.
zz Query/Reports: Query and reporting tools are used for accessing and displaying data
stored in the data warehouse to a user. A user enters the query and accordingly informa-
tion is displayed mainly in form of reports.
zz Analytics: Data is dynamically accessed from the data warehouse using various analytical
L
tools efficiently. The data accessed can be used for analysis by end users and decision
making in an organization.
D
Data warehouse is structured for corporate view of data and data mart is structured for
departmental view of data. Do you agree with this statement?
C
DATA WAREHOUSE DESIGN TECHNIQUES
A data warehouse can be designed either by bottom-up or top-down approach. In other words,
T
Global data warehouse contains data of the entire organisation and segments it into different data
marts or subset warehouse.
Bottom-up Design
Bottom-up approach was designed by Ralph Kimball for data warehouse designing.
Data marts in the bottom-up approach firstly help to create reports and then analytical capabilities
for specific business process. Data marts store dimensions and facts. Facts comprise either atomic
data and, if necessary, summarized data. A single data mart usually models a particular business
area such as “Sales” or “Production.” These data marts can ultimately be integrated to create a
complete data warehouse. The data warehouse bus architecture is mainly an implementation of
“the bus”, a group of conformed dimensions and conformed facts, which are dimensions that are
shared between facts in multiple data marts.
The integration of data marts in a data warehouse is focused on the conformed dimensions that
reside in the bus that describes the potential integration “points” between data marts. The authentic
integration of two or more data marts is then completed by a process recognized as “drill across”. A
drill across works by grouping or summarizing data along the keys of the conformed dimensions of
each fact contributing in the “drill across” followed by a join on the keys of these summarized facts.
Self-Instructional
Material
126
Data Warehousing
Upholding strict management over the data warehouse bus architecture is essential to maintain NOTES
the integrity of the data warehouse. The most significant management task is to ensure that the
dimensions among data marts are reliable.
Business value can be returned as soon as the first data marts can be for example, the data
warehousing might start in the “Sales” department, by creating a Sales-data mart. After
completion of the Sales-data mart, the business might choose to take the warehousing
activities into the “Production department” which results in a Production data mart. The need
for the Sales data mart and the Production data mart to be integrable, is that they share the
same “Bus”, means that the data warehousing team has done something to recognize and
implement the conformed dimensions in the bus, and that the separate data mart links that
information from the bus. The Sales-data mart is already constructed and the Production-data
mart can be constructed virtually independent of the Sales-datamart (but not independent
of the Bus).
Top-down Design
L
The top-down approach is designed with the help of the normalized enterprise data model. “Atomic”
data, which is the lowest level of detail, is stored in the data warehouse. Dimensional data marts that
store data needed for specific business processes or specific departments are built from the data
D
warehouse. In the Inmon vision, the data warehouse is in the middle of the “Corporate Information
Factory” (CIF), which provides a logical framework for providing Business Intelligence (BI) and
business management skills.
ETL PROCESS
IM
ETL is the process of extracting data from varied source systems and loading it into the data
warehouse. Data warehouse needs to be loaded regularly so that it can serve its purpose of providing
relevant data to facilitate business analysis. To facilitate this, data from one or more operational
systems needs to be extracted and copied into the data warehouse. The integration, rearrangement
and consolidation of a large amount of data is a challenge in the data warehouse environment over
many systems in order to provide a now organised information for business intelligence.
The three processes, extraction, transformation and loading, are responsible for the majority of
operations taking place at the back end of data warehousing. Although, ETL is primarily a backend
process, it takes up almost 70% of the resources required for maintenance and implementation of
a data warehouse. First, the data extraction takes place from multiple sources which typically can
range from relational databases, On-Line Transactional Processing (OLTP), Web pages, or various
kinds of documents like Word documents or spreadsheets. After the extraction phase, comes the
transformation phase wherein all the extracted data is accumulated at a special area called Data
Staging Area (DSA). The homogenization, transformation and cleaning of data take place at the
DSA. This transformation takes place with checks in place like filters and integrity constraints to
ensure that the data that reaches the warehouse conforms to the business rules and schema of the
target data warehouse. In the final step, data is loaded in the data warehouse.
Self-Instructional
Material
127
DATA SCIENCE
L
FIGURE 5.6 Major Steps in the ETL Process
D
Google Cloud using ETL architecture
E xhibit -2
Google Cloud, which supports ETL architecture for a cloud-native data warehouse on Google Cloud
C
platform. ETL solutions automate the task of extracting data from operational databases, making
initial transformations to data, loading data records into staging tables and initiating aggregation
calculations. A lot of ETL tools are now capable in handling very large amount of data that do
not have to be necessarily stored in any data warehouses. With Hadoop-connectors to big data
T
sources being provided by almost 40% of ETL tools, support for big data is growing continually at
a fast pace.
IM
Data Extraction
Data extraction is the first step in the ETL process. During this phase, the required data is first
identified and then extracted from varied sources like database systems and applications using as
little resources as possible. The process of data extraction should not adversely affect the source in
terms of its execution, reaction time or any kind of locking. Most of the times, it is not possible to
identify the data of interest, thus during the data extraction stage, a lot of data gets extracted than
is actually required. The identification of relevant data is done at a later point of time. The size of
the extracted data can range from hundreds of kilobytes up to gigabytes, depending on the source
system and the business situation. Depending upon the capabilities of the source system, some
transformation might take place during the extraction process itself.
To design and create an extraction process is the most time consuming part of the entire ETL
process. The source systems are diverse and varied in design. They are complex and are usually
poorly documented, thus extracting useful data from them can be a challenging part. Moreover, the
data is extracted not once but periodically to update the data in a data warehouse. The extraction
process has no control over the source system, nor on its performance or availability to suit the
needs of data warehouse extraction method.
There are various techniques to extract data from different kinds of sources. The extraction process
that we choose is influenced by various factors like data source system, transportation process and
Self-Instructional the time needed to refresh the data warehouse.
Material
128
Data Warehousing
Let’s assume that an organization designs a database to provide strategic information on the orders
that it fulfilled. To do that, it needs the records of previous as well as current fulfilled and pending
orders. Now, if the orders are fulfilled through multiple channels, then the organization also needs
reports about these channels. The order fact table contains data related to order, such as date of
delivery, item no, item codes, discounts and credit limit. The dimension table contains the details
about products, customers and channels. The organization also needs to ensure that it has the
correct data source needed for the database and this data source is able to supply correct data to
each data element. This is done by going through the verification process to authenticate the data
source.
Identification of data source is a crucial step in the data extraction process. We need to go through
L
the source identification and ensure that whatever bit of data is entered into the data warehouse
must be authenticated.
NOTES to be used. This is necessary because the data source is in a constant state of updating. Whenever a
new addition or modification takes place in the existing data, the data source changes. Thus, data in a
system is said to be time dependent or temporal since the data in system changes with time.
L
specific point of time is extracted. The extraction may be defined by an event that had occurred
in the history. This event may define the last time of extraction, thus the data that has changed
after this event has occurred is identified. This change in data is either provided by the source
D
data itself such as a data column showing the last change or a separate table where any addition
in the data information keeps getting recorded.
Since the incremental method entails additional logic of maintaining a separate table, many data
warehouses do not want to use this extraction process. Instead the whole table from the source
C
system is extracted to the data warehouse or data staging area and compared with the previous
data from the source data to identify the changes. Although this approach might prove to be simpler
for the source data but it clearly places a huge burden on the data warehouse processes especially
in cases where the data volume is too high.
T
Incremental data extraction can be done in two ways:
Immediate Data Extraction: In this technique, the data extraction is done in real time. Extraction
IM
occurs at the same time as the transactions are taking place at the source database and files, as
shown in Figure 5.8:
DBMS
Extract
Files from
Source
Systems OPTION 2:
Output Files Capture through
of Trigger database triggers
OPTION 3: Programs
Capture in source
applications
EA
G AR
AGIN
A ST
DAT
Self-Instructional
Material FIGURE 5.8 Immediate Data Extraction Process
130
Data Warehousing
L
warehouse. You have to accordingly modify the relevant application programs that write
to the source files and databases. You rewrite the programs to include all adds, updates,
and deletes to the source files as well as database tables. The changes to the source data
can be contained in the separate files by other extract program.
D
Deferred Data Extraction: In this technique, as compared to immediate data extraction, data
extraction does not capture the changes in real time, as shown in Figure 5.9:
C
SOURCE SOURCE DATABASES
OPERATIONAL Today's
SYSTEMS Extract
Source
Data
T
Yesterday's
EXTRACT
Extract
PROGRAMS
Option 1: FILE
COMPARISON
Capture based on Extract PROGRAMS
Files based
IM
data and
Extract on File
time stamp comparison Option 2:
Files based on
Capture BY
time-stamp
comparing files
A
RE
GA
GIN
Options for deferred data extraction S TA
TA
DA
NOTES essary to keep prior copies of all the relevant source data as they may be required for
comparing data in future. Though the technique is simple and straightforward, the actual
comparison of full rows in a large file can be very time consuming and may prove inefficient
in the long run. The technique may be only relevant way for the source records of some
legacy data sources are used that don’t have transaction logs and timestamps.
L
snapshot logs or change tables). The point to notice here is that the intermediate system is not
necessarily physically different from the source system.
We need to consider the original source objects of prepared source objects for the distributed
D
transactions in an online transaction.
Offline Extraction: In contrast to online extraction, data is extracted from outside the original
system, not directly from the sources. The data is already in a pre-existing format or structure
(e.g. redo logs, archive logs or transportable tablespaces) or had been created by some
C
extraction routine.
Redo and Archive Logs: These logs contain information in a specific additional dump file.
Due to efficient identification and extraction of only the most recently changed data, the extraction
process (as well as all downstream operations in the ETL process) becomes much more efficient,
because now it must extract a much smaller volume of data. On the other side, for some source
systems, identifying the recently modified data can be difficult or intrusive to the operations of
the system. This proves to be a disadvantage for the efficiency and speed of the system. In data
Self-Instructional extraction the change in the data capture is one of the most demanding issues.
Material
132
Data Warehousing
Although Change Data Capture is an important and desirable part of the extraction process, it is not NOTES
always possible to implement it. The following are some alternate techniques for implementing a
self-developed change capture on Oracle Database source systems:
Timestamps: Some operational systems have specific timestamp columns in their tables. This
timestamp specifies the time and date when the specified row was last modified. This enables
the identification of the latest data very easily by using the timestamp columns and reduces the
overheads of extracting extra data. For example, the following query proves useful in extracting
today’s data from an orders table:
SELECT * FROM orders
WHERE TRUNC(CAST(order_date AS date),’dd’) =
TO_DATE(SYSDATE,’dd-mm- yyyy’);
If originally, the timestamp column is not present in an operational source system, it proves to
be a difficult task to modify the system to include timestamps. Such type of modification would
require changing the operational system table’s design to include a new timestamp column and
then updating the timestamp column with the help of a trigger which would be fired following
every operation that modifies a given row.
L
Partitioning: Some source systems use range partitioning, i.e., the source tables are partitioned
along a date key. This helps in easy identification of new data. For instance, if data extraction
is requires from an orders table which is partitioned by week then the data of current week is
easily identifiable.
D
Triggers: Triggers are created to keep track of the recently updated records in an operational
system. Timestamp columns can also be used along with triggers to identify the actual time
and date of the last modified given row. This can be done by creating a trigger on each source
table where change data capture has been implemented. Thus, for every DML statement that is
C
executed on the source table, the trigger shall update the timestamp column with the current
time. Hence, with the help of the timestamp column which provides the exact time and date
when a given row was last modified, you can extract the required data.
T
This kind of technique is used for Oracle materialized view logs. These logs are used by materialized
views to identify changed data. These logs are accessible to end users also. However, the format of
the materialized view logs is not documented and might change over time.
IM
These techniques are defined by the characteristics of the source systems. Some source systems
might require some modifications to implement these techniques. Each one of such techniques
should be assessed carefully by the owner of source system preceding the implementation.
All these techniques can be implemented along with the techniques of previously discussed data
extraction. When the data contained in the file is being unloaded or data is used through a distributed
query then the timestamps can be used.
Materialized view logs are completely based on triggers, but this proves to be an advantage as the
creation and maintenance of the change-data system is managed by the database. Trigger-based
techniques might affect performance of the source systems.
NOTES tables together. Different extraction techniques are implemented differently according to their
capabilities to support these two scenarios.
If the source system is an Oracle database, the following options are available for extracting
data into files:
zz Extracting into Flat Files using SQL*Plus
zz Extracting into Flat Files using OCI or Pro*C Programs
zz Exporting into Export Files using the Export Utility
zz Extracting into Export Files using External Tables
Extraction through Distributed Operations: In the distributed-query technology, one database
can directly query tables located in various source systems. Specifically, a data warehouse or
staging database can directly access tables and data located in a connected source system.
Gateways are a form of distributed-query technology. They allow an Oracle database (such as
a data warehouse) to access database tables stored in remote, non-Oracle databases. It is one
of the most effortless approaches for transferring data among two oracle databases because
it merges the transformation along with extraction into single step and require minimum
programming. However, this is not always feasible.
L
Transformation
Data transformation is the second step in the ETL process. It is also considered to be the most
D
complex and time-consuming process. Data transformation can include both simple data conversions
as well as extremely complex data scrubbing (cleaning/error correction) techniques at the same
time. Some data transformations can occur within the database, although most transformations are
implemented outside the database, for example on flat files.
C
The following headings will demonstrate the types of fundamental technology that can be applied
to implement the transformations.
T
Transformation Flow
From an architectural perspective, data can be transformed in two ways:
Multistage Data Transformation: Data transformation logic consists of multiple steps for most
IM
data warehouses. For example, while inserting new records into a table, the transformation
may take place in separate logical transformation steps to validate each dimension key.
Figure 5.10 offers a graphical way of looking at the transformation logic:
sales
Insert into sales
warehouse table
Table
Self-Instructional
Material FIGURE 5.10 Multistage Data Transformation
134
Data Warehousing
If the Oracle database is used as a transformation engine, a common technique would be to NOTES
implement each transformation as a separate SQL operation. After each such operation, a
separate, temporary staging table is created (such as new_sales_step1 and new_sales_step2 in
Figure 5.10) to store the intermediate results for each step. This load-then-transform technique
provides a natural checkpoint scheme to the entire transformation process, thus enabling the
process to be more easily monitored and restarted. However, due to this, there is an increase
in time and space requirement that proves to be a major disadvantage of multi-staging data
transformation.
To overcome this disadvantage, there is a possibility of combining many simple logical
transformations into a single SQL statement or single PL/SQL procedure. Although combining
steps may prove to optimize the performance, on the other hand, it may also introduce
difficulties like modification, addition, or dropping individual transformations, as well as
difficulty in recovering from failed transformations.
Pipelined Data Transformation: Pipelined data transformation technique changes the ETL
process flow dramatically. It renders some of the previous necessary process steps obsolete
whereas a few others are remodeled to enhance the data flow. This enhances the whole
L
process and creates a more scalable and non-interruptive data transformation procedure. The
focus shifts from serial transform-then-load process (where most of the tasks are done outside
the database) or load-then-transform process, to an enhanced transform-while-loading.
Figure 5.11 shows the pipelined data transformation:
External table
D
Validate customer
keys (lookup in
customer
Convert source
product keys to
warehouse
C
dimension table) product keys
Flat Files
sales
T
Insert into sales
warehouse table
Table
IM
The difference between pipelined data transformation and multistage data transformation is that
pipelined transformation occurs in database tables while multistage data transformation might also
take place outside database tables.
Staging
During the ETL process, it should be a possibility to restart, if need be, some of the phases
independently from the others. For example, if the transformation process fails, the process should
not restart from the Extract step again. Only the failed step should be rectified and restarted. This
can be ensured by implementing proper staging. Staging means that the data is dumped to a special
location called the Data warehousing Staging Area (DSA), so that it is unaffected by any failed step. Self-Instructional
It can be read by the next processing phase individually. This area is also used frequently during the Material
135
DATA SCIENCE
NOTES ETL process to store intermediate results of processing. However, the load ETL process can only
access the staging area.
It should not be made available to anyone outside the process, such as the end users as it is not
intended for data presentation to end users, as shown in Figure 5.12:
DS.PS1.PKEY,
DS.DSNWE2.DKEY,
DS.PSOLD2.PKEY
LOOKUP_PS. SKEY, DSA
SOURCE SOURCE COST DATA QTY,COST
DS.DSNEW2
DS.DSNNEW1.DKEY, DS.PS2.PKEY,
DS.PSOLD1.PKEY LOOKUP_DS.SKEY, DATA-SYSDATE PKEY, DATE
SOURCE
DS.DSNEW1
L
PKEY, DAY
MIN (COST)
S2.PARTS FTP2
Aggregate1
DW.PARTS V1
S1.PARTS
Sources
FTP1
D TIME
DW.PARTSUPP.DATE,
DAY
PKEY, MONTH
AVG(COST)
Aggregate2
V2
C
DW
DSA is a temporary location where data from source systems is copied and kept. The Data Warehousing
T
Architecture requires this area for timing reasons. It is not possible to extract all the data from all
operation databases due to couple of reasons – lack of hardware and network resources, geographical
factors, varying business and data processing cycles. In short, all required data must be accumulated
IM
and made available to the process before data can be integrated into the data warehouse. For
example, on daily basis, the extraction of sales data might be reasonable, however, the daily extraction
of financial data might not be suitable that requires a month-end reconciliation process. Similarly, it
might be feasible to extract “customer” data from a database in Bangkok at noon eastern standard
time, but this would not be feasible for “customer” data in a New York database.
Data in the data warehouse can be either persistent (i.e. remains in the DSA for a long period) or
transient (i.e. only remains in DSA temporarily or for a very short period). All businesses do not
require a DSA. Many businesses find it more feasible to use ETL to copy data directly from operational
databases into the data warehouse rather than maintaining the Staging Area.
Loading
Loading is the third step in ETL process, and is relatively simpler than the other two processes. During
the loading process, it is ensured that the loading of data is completed correctly and with as little
resources as possible. To increase the efficiency of the load process, it is desirable to disable any
constraints and indexes before starting the load process and enable them back only after the load
process completes. To ensure consistency, the referential integrity is maintained by the ETL tool.
Self-Instructional The following mechanisms are used for loading a data warehouse:
Material Loading a Data Warehouse with SQL*Loader
136
Data Warehousing
Cleaning
Data cleaning, also called data scrubbing, involves detecting and removing errors and inconsistencies
from data in order to improve its quality. The cleaning step is one of the most important steps as
it ensures the quality of the data in the data warehouse and helps in integration of heterogeneous
data sources.
Data quality problems arise in single data collections as well as in case of multiple data sources.
In single data source, such as files and databases, the problem of data cleaning arises due to
misspellings during the time of entering the data, missing of useful data and other invalid data.
When multiple data sources need to be integrated, e.g. in data warehouses, federated database
systems or global Web-based information systems, the need for data cleaning increases manifold. In
this case, data quality problem arises because the sources often contain redundant data in different
L
forms. In order to provide access to clean, accurate and consistent data, consolidation of different
data forms and elimination of duplicate data becomes necessary.
D E xhibit -3
A health benefits company operating in the United States serves more than 11.9 million customers.
C
The company offers several specialty products, including group life and disability insurance
benefits, pharmacy benefit management and dental, vision and behavioral health benefits services.
The company had a significantly large database, which was consolidated from various systems
using an ETL (Extract, Transform and Load) tool. This tool would mine data from a database and
T
then transform and store the mined data in a data warehouse. The company wanted to outsource
the reengineering of the data warehouse and partnered with Infosys to build a modular, scalable
architecture so that the processing batch time could be reduced considerably. Infosys used its
Global Delivery Model and its team of nine personnel completed the project in 16 months. The
IM
initial task concerned a detailed study of the existing ETL model to identify the bottlenecks. Once
this was completed, the Infosys team determined areas of improvement. These changes were
implemented in a short duration, resulting in an ETL that processed information at a faster pace.
The new ETL delivered many benefits to the client including:
The new batch processes ran significantly faster than before, thanks to the tightly integrated
code. The immense speed ensured a 70% time reduction in batch processes and enhanced the
efficiency of the client’s processing capabilities.
Flexibility was also improved and the client could easily add newer data from sources that
were not initially supported. This improved the capabilities of the data warehousing solution.
As part of the improvement process, the client’s data warehouse solution started generating
more useful output and this helped the company to take key business decisions with high
quality data.
The continuous enhancements that are being carried out to the production environment by
automating the load processes, adding reconciliation and automated balancing processes,
are helping the client to improve the satisfaction of its customers.
The Infosys Global Delivery Model (GDM) lowered costs drastically and helped the company
to focus its IT budget savings on more important tasks.
Self-Instructional
Source: https://www.infosys.com/industries/healthcare/case-studies/Pages/data-warehousing.aspx
Material
137
DATA SCIENCE
L
reliable. It also helps in minimizing the hazard of data loss in production. However, there are some
challenges come during ETL testing of data. Some of these challenges are shown in Figure 5.13:
D
Data might get lost during ETL process
ETL testing has to verify the data on the both source and destination systems
T
ETL testing has to verify the NOT NULL condition of the field values
IM
Summary
In this chapter, you learned about data warehousing and its need. It further discussed the importance
and increased demand for strategic information among business experts. Next, it explained
the inability of past decision support systems and the differences between the operational and
decisional support systems. It then elaborated the benefits and features of data warehousing and
data warehouse architecture. Then, you learned about data warehouse architecture, data marts,
and data warehouse design techniques. This chapter also discussed the ETL process and its various
stages, such as extraction, transformation and loading. Next, it explained various data extraction
mechanisms in detail. You are also familiarized with various steps involved in the data transformation
process. Finally, the chapter explained the cleaning process, which is an essential part of the ETL
process.
Self-Instructional
Material
138
Data Warehousing
NOTES
Exercise
Multiple-Choice Questions
Q1. Which of the following features usually applies to data in a data warehouse?
a. Data is often deleted.
b. Data is rarely deleted.
c. Most applications consist of transactions.
d. Relatively few records are processed by applications.
Q2. Which of the following statements is true?
a. Data warehouse consists of data marts and operational data.
b. Data warehouse is used as a source for the operational data.
c. Operational data is used as a source for the data warehouse.
d. All of the above
L
Q3. The process of removing the deficiencies and loopholes in the data is called as _________.
a. Cleaning up of data b. Extracting of data
c. Aggregation of data
Q4. An operational system is _________.
D
d. Loading of data
a. a system that is used to run the business in real time and is based on historical data
b. a system that is used to run the business in real time and is based on current data
C
c. a system that is used to support decision making and is based on current data
d. a system that is used to support decision making and is based on historical data
T
Q5. The extract process is _________ .
a. capturing all the data contained in various operational systems
b. capturing a subset of the data contained in various operational systems
IM
NOTES
Assignment
Q1. What is metadata? How is shared metadata defined?
Q2. What is the need for the classification of metadata?
Q3. Dreamkart is an online book store. For annual sales analysis, the company wants to categorize
its customers according to their expenditure and shopping personality trait. Some people
spend $500 and visit the site 4 times a month. Some spend $300 and visit 3 times a month.
Some spend $700 and visit 7 times a month. While designing classification schema, how many
groups can be made with respect to expenditure and number of visits per month?
Q4. What can be the consequences if we remove the data staging layer from the data warehouse
system and skip directly to data storage layer?
Q5. How two or more data marts can be combined?
Q6. Why are physical extraction methods used?
Q7. We have a static data mart, where data changes every 4 months, except date and time. Is the
L
change in data capture an efficient extraction implementation?
Q8. What are some alternate techniques for implementing a change capture on database source
systems?
D
Q9. What are the challenges faced during multistage data transformation?
Q10. What makes pipelined data transformation significantly faster than multidimensional data
transfer?
C
Q11. Assume a case where a data transformation fails before completion. The database engineer
observes that transformation has restarted from extract step again. Which procedure you
think has not been implanted properly?
Q12. What measures can be taken to improve the efficiency of load process? Discuss the
T
mechanisms used for loading data warehouse.
Q13. What is data mart? How is it different from data warehouse?
References
IM
http://www.dataintegration.info/etl
https://www.springpeople.com/blog/data-warehousing-essentials-what-is-etl-tool-what-are-its-
benefits/
https://intellipaat.com/tutorial/data-warehouse-tutorial/what-is-etl/
https://www.talend.com/solutions/etl-analytics/
https://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
Self-Instructional
Material
140
C A S E S T U D Y
REVAMPING THE DATA WAREHOUSE NOTES
OF A HEALTH BENEFITS COMPANY
This Case Study discusses how Infosys helped XYZ, a health benefits company, in reengineering its data
warehouse.
L
The company had consolidated data from various systems using some Extract, Transform and Load
(ETL) tool. This resulted in the creation of a large database by mining, transforming and storing the
mined data in a warehouse. XYZ wanted to build a data warehouse that is modular and has scalable
D
architecture in order to reduce the processing batch time. For this, they outsourced the task of
Picking up a database and reengineering it to improve efficiency is a difficult task. The challenges
C
faced by Infosys during this task were as follows:
ETL mappings done by XYZ previously were inefficient. The performance was reduced during
the transformation operation. To remedy this, Infosys had to optimize the performance by
reengineering the whole process.
T
The then existing codes were poor and were difficult to manage. The engineers of Infosys had
to rework on the codes to make them manageable and maintainable.
The previous applications took a longer amount of time to run because of the large amount
IM
of data involved. Therefore, Infosys wanted to significantly improve the performance of the
operations.
Solution
The reengineering task was now the responsibility of Infosys. In this task, time and cost were the two
critical factors. Infosys decided to use its Global Delivery Model and a team of nine personnel. They
completed the project in 1 year and 4 months. Initially, the team studied the previous ETL process in
order to identify the bottlenecks. After this activity, the team identified the areas of improvement.
All these changes led to the creation of an ETL that processed the information at a great speed.
After fixing the speed, Infosys team started working on the maintainability of the ETL by making
the ETL process concurrent and scalable. Scalability was important in order to ensure that XYZ could
ramp up storage and processing capabilities of the data warehouse (if required) in future. The last
activity that was undertaken by the team was to take up the responsibility of maintaining XYZ’s data
warehouse so that XYZ’s other internal resources could be freed to undertake other tasks. It was
introduced as an ongoing task and Infosys has been managing this task in an undisputed manner.
Self-Instructional
Material
141
C A S E S T U D Y
NOTES Benefits
The benefits accrued to XYZ as a result of the new ETL are as follows:
A new tightly integrated ETL code was introduced which enabled time reduction of 70% in
running the batch processes.
XYZ’s processing capabilities were enhanced.
XYZ could now load data from newer data sources which were not supported initially. This
increased the flexibility of the new data warehouse solution.
XYZ started using the improved data warehouse and generating more useful and quality data.
This data was used by XYZ in making certain important business decisions.
Continuous enhancements are being made in the production environment by automating
the load processes and by adding reconciliation and automated balancing processes. These
improvements help XYZ in improving its customer satisfaction.
Overall, the data warehouse solution helped XYZ in lowering its costs.
L
Questions
1. Why did XYZ feel the need for reengineering its data warehouse?
D
(Hint: XYZ wanted to reengineer its data warehouse in order to improve efficiency and to
make its existing codes more manageable and maintainable.)
2. What kinds of benefits were realized by XYZ after Infosys reengineered its data warehouse?
(Hint: Time reduction of 70% in running the batch processes.)
C
T
IM
Self-Instructional
Material
142
CHAPTER
6
Machine Learning
L
Topics Discussed
Introduction
Meaning of Machine Learning D NOTES
C
Types of Machine Learning
Supervised Learning
Unsupervised Learning
T
Reinforcement Learning
Supervised Learning Algorithms
IM
Decision Trees
Linear Regression
Logistic Regression
Naive Bayes
K-nearest Neighbors (KNN)
Unsupervised Learning Algorithms
K-Means
PCA
Association Rule Mining
Applications of Machine Learning in Business
Self-Instructional
Material
DATA SCIENCE
INTRODUCTION
With the advent of computing and ever-evolving technology, humans started exploring different
frontiers to make their work more easy and efficient. Moreover, the importance of data and its
L
processing leading to information demanded proper analyzing and decision-making. The humans
started developing algorithms and programs so as to analyze big chunks of data with ease and
take proper decisions based on the inferences. There came a need to innovate a process where the
D
computing systems start learning on their own based on the data and processing results without
any human intervention. This process is known as Machine Learning. We are well aware of the
fact that machines can outperform humans when it comes to scientific calculations and numerical
processing. Thus, the need arose where machines were programmed in such a manner to learn from
C
their previous data processing experiences and progress further so as to make it easy for the human
to make proper decisions and increase profitability.
In this chapter, you will learn about the meaning of machine learning and its different types. You will
T
also learn about the different approaches that can be followed to implement it. At the end, you will
learn about the applications of machine learning in business.
IM
“A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with the
experience E.”
Self-Instructional
Material
144
Machine Learning
The important thing here is that you now have a set of objects to define machine learning: NOTES
zz Task (T), either one or more
zz Experience (E)
zz Performance (P)
So, with a computer running a set of tasks, the experience should be leading to performance
increases. Machine Learning is a branch of Artificial Intelligence. Using computers and
programming languages, we design systems which can learn from the data by the process of
training. Such systems might learn and improve with experience, and with time, refine a model
which then can be used to predict outcomes of problems based on previous learning.
For example, your email program checks which emails you mark as spam and which are not,
and on the basis of that, it learns how to perform better in filtering spam. In order to understand
this in terms of machine learning, we will use Tom M. Mitchell’s theory, in which we need to
identify the Task (T), Experience (E) and Performance (P):
zz Classifying emails as spam or not spam. -------------------------------------------------------- (T)
zz Watching you label emails as spam or not spam. -------------------------------------------- (E)
L
zz The number (or fraction) of emails correctly classified as spam or not spam.-- (P)
Some prominent uses of machine learning are as follows:
zz
zz
zz
Classifying mails as spam or not spam
D
Classifying customer groups so as to reach out using proper advertisement channels
Weather forecasting
C
zz Fraud detection
zz Predicting the estimates related to natural or financial disasters
zz Predicting a sports game/election result
T
zz Programming algorithms to automate business processes
E xhibit -1
Pinterest – Improved Content Discovery
IM
In 2015, Pinterest acquired a machine learning company, Kosei, which was specialized in the
commercial applications of machine learning, specifically in content discovery and recommendation
algorithms. Nowadays, machine learning touches virtually every aspect of Pinterest’s business
operations, such as spam moderation, content discovery to advertise monetization, and reducing
the data of email newsletter subscribers.
Un-Supervised Learning
Estimating
life expectancy
L
Reinforcement
Learning
Robot Navigation Skill Acquisition
D
FIGURE 6.1 Types of Machine Learning Algorithms
C
Source: https://www.datasciencecentral.com/profiles/blogs/types-of-machine-learning-algorithms-in-one-picture
Supervised Learning
T
Supervised learning21 refers to the fact where the system is given a data in which whatever is needed
is given. In other words, the data set is properly labelled for supervised learning. The system gets a
data set which consists of right answers corresponding to the data points given. The system then
IM
needs to predict the result of some other data point which is not given in the set. This technique is
also termed as Regression where we predict the outcome corresponding to some data point which
is not residing in the data set, and in order to calculate it, we need to have a data set with properly
labelled data points, i.e., the correct answers.
Let’s consider an example to understand what supervised learning is. Suppose you have a data set
having heights and weights of 10 students. For instance, a student having weight as 60 is having
height as 180, and so on. Now, supervised learning can be used here by employing regression and
then predicting, say the height of a student who is weighed as 70.
Unsupervised Learning
Self-Instructional
On the other side of the coin comes unsupervised learning, where the system trains itself to
Material accomplish the task without any human intervention, i.e., the system running an unsupervised
146
Machine Learning
learning algorithm tries to find hidden information within the data set. One example of such learning NOTES
is clustering, where the algorithm creates clusters (groups) based on some predefined criteria, so as
to analyze more effectively. Here the technique known as dimensionality reduction comes into play,
where a large data is reduced to a chunk so as to perform analysis easily and effectively. The main
categories of this kind of learning are as follows:
Clustering: Grouping of variables into clusters according to some defined criteria. Further
analysis is then performed in these clusters.
Dimensionality Reduction: If the input data has high dimensionality, it gets necessary to remove
unwanted or redundant data.
N ote
There is a semi-supervised learning also, which combines both labelled and unlabelled examples to
generate an appropriate function or classifier.
Reinforcement Learning
L
Reinforcement learning22 is a type of machine learning algorithm which enables machines maximize
their performance by identifying the ideal solution based on some conditions. This is a reward-based
system where the machine discovers the best action based on high-yielding rewards. One classical
D
example of reinforcement learning is ‘Tower of Hanoi’, where we have 3 towers and some disks
which are circular sized and kept on one another from low to high on the left most tower. The
objective is to transfer all the disks from left most tower to right most tower using least number of
moves, provided no bigger disk can be put on a smaller disk.
C
K nowledge C heck 6.2
1. What are the different types of machine learning? Explore different categories of algorithms
under each type.
T
2. John works as a developer in a software development company. He is assigned the task
of developing an application that can learn from analyzed data. If the output of the
application depends on a sequence of actions, which of the following types of machine
IM
E xhibit -2
Machine learning algorithms – Increasing productivity in retail market
Unlike e-commerce stores, offline retail markets don’t have the technical advantage for saving
data containing preferences and choices for every customer they serve. These retailers have to
rely on their store staff for information and insight. Now, with the advent of machine learning,
companies are harnessing its power for providing effective business plan and boosted product
sales using customer queries and feedbacks. The technology takes data input such as consumers’
past behavior, their response to a particular campaign, their purchasing trends, and suggests a
time slot and product which needs to be marketed. Major textile manufacturers such as Arvind
Textiles have already started using these machine learning-driven technologies and have seen
twice the customer response rate as compared to traditional marketing. Technology companies
are also coming with products like store sense systems, which give an overall idea about what is
happening in the store. These systems provide insights into each customer’s behavior pattern in Self-Instructional
Material
147
DATA SCIENCE
NOTES a store based on their interactions with the store employees to understand their requirements,
category preferences, and readiness to purchase the product. These systems also provide detailed
store visitor analytics, including visitor demographics, visitor-fashion profiling to help brands
optimize their store operations, to increase sales and conversion from each store.
L
Let’s discuss each in detail.
Decision Trees
D
It is a supervised learning technique. This technique uses a graphical representation to visualize
all the possible outcomes based on the decisions. This algorithm uses Tree representation which
comprises a root and children nodes along with leaf nodes. The root classifies the main decision
C
or condition, followed by alternate solutions which are its branches (child nodes). This structure
helps identify all the alternatives and accordingly the decision-making process becomes easier and
effective. The decision trees23 help in classifying different paths related to a problem, making it
faster to make the best decision. Decision trees can be used when the analyst wants to make sure all
T
paths related to a condition are well-checked and analyzed based on their reward depending on the
problem. Figure 6.2 shows a decision tree about animal-related information:
IM
Is it very big?
No Yes
No Yes No Yes
No Yes
Guess Guess
rhino hippo
Now, let’s learn how to plot decision tree using R language. For this, you need to first install ‘party’
Self-Instructional
package as it has the function tree which lets us plot the decision tree. We use default dataset ‘iris’
Material and plot the decision tree. Now, consider the following code snippet in R:
148
Machine Learning
install.package(“party”) NOTES
library(“party”)
data(iris)
str(iris)
iris_create<- ctree(species~Sepal.Length + Sepal.width+Petal.
Length+Petal.width,data=iris)
print(iris_create)
plot(iris_create)
plot(iris_create,type=”simple”)|
>str(iris)
‘data.frame’:150 obs.of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …
$ Sepal.Width: num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …
$ Petal.Width: num 0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …
$ Species : Factor w/ 3 levels “setosa”, “setosa”,”ver-
sicolor”,..: 1 1 1 1 1 1 1 1 1 1 …
> iris_ctree <- ctree(Species ~ Sepal.Length +Sepal.Width+ Pet-
L
al.Length + Petal.Width, data=iris)
> print(iris_ctree)
Conditional inference tree with 4 terminal nodes
Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150
1) Petal.Length<=1.9; criterion=1, statistic= 140.264
2)* weights=50
D
C
1) Petal.Length > 1.9
3) Petal.width <=1.7; criterion=1, statistic=67.894
4) Petal.Length<=4.8; criterion=0.999, statistic=13.865
5)* weights=46
T
4) Petal.Length > 4.8
6)* weights = 8
3) petal.Width > 1.7
7)* weights=46
IM
>plot(iris_ctree)
> plot(iris_ctree,type=”simple”)
1 1
Petal. Length Petal. Length
p < 0.001
p < 0.001
≤ 1.9 > 1.9
3 ≤ 1.9 > 1.9
Petal. Width
p < 0.001 2 3
≤ 1.7 n = 50 Petal. Width
> 1.7 y = (1, 0, 0)
4 p < 0.001
Petal. Length
p < 0.001
≤ 1.7 > 1.7
≤ 4.8 > 4.8
4 7
Node 2 (n = 50) Node 5 (n = 46) Node 6 (n = 8) Node 7 (n = 46) Petal. Length n = 46
p < 0.001 y = (0, 0.022, 0.978)
1 1 1 1
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6 ≤ 4.8 > 4.8
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2 5 6
0 0 0 0 n = 46 n=8
y = (0, 0.978, 0.022) y = (0, 0.5, 0.5)
setosa setosa setosa setosa
(a) (b)
Figure 6.4 shows the plotted x and y values for a dataset. The goal is to fit a line that is nearest to
most of the points. This would reduce the distance (error) between the y value of a data point and
the line.
L
and this y value)
D x
# values of weight
# 46,87,98,48,78,55,87,67,75,88
IM
# values of height
# 161,185,172,168,141,158,169,185,190,176
# syntax for lm() function for linear regression in R:
# lm(formula,data)
# parameters used:
formula: symbol representing the relation between x and y
#data: is a vector on which formula will be applied
x<- c(161,185,172,168,141,158,169,185,190,176)
y<- c(46,87,98,48,78,55,87,67,75,88)
# Applying the lm():
relation <- lm(y-x)
print(relation)
#getting the summary of the relation:
print(summary(relation)
#predict function for linear regression:
#predict(object,formula)
#The predictor vector.
x<-c(161,185,172,168,141,158,169,185,190,176)
#the response vector
y<- c(46,87,98,48,78,55,87,67,75,88)
Self-Instructional # Applying the lm():
Material
150
Machine Learning
L
lm(formula=y~x)
Residuals:
Min IQ Median 3Q Max
-24.070 -12.990 2.454 14.267
Coefficients:
24.602
50 60 70 80 90 100
Logistic Regression
The solutions acquired through linear regression are continuous in nature (weights in kg), whereas
the solutions acquired via logistic regression are discrete values (y/n or 0/1). The best application of Self-Instructional
logistic regression is when we are dealing with only 2 sets of values, i.e., binary (either 0 or 1). Material
151
DATA SCIENCE
NOTES E xample
Coin tossed will give head or tail? if the outcome is heads, then it is represented as 1. Logistic
regression name is arrived after the transformation function used in it, i.e., logistic function h(x)=
1/ (1 + e-x), which is an S-shaped curve. Figure 6.6 shows S-shaped curve:
0.8
0.6
h(x)
0.4
0.2
0.0
L
–8 –6 –4 –2 0 2 4 6 8
X
D source: analyticsvidhya
For logistic regression, we are going to use in-built data set mtcars (in R studio) which comprises
different models of a car along with their engine descriptions. We are going to program a logistic
C
regression model between the different engine descriptions as shown in the following code snippet:
Self-Instructional
Material
152
Machine Learning
where,
P(h|d) = Posterior probability. The probability of hypothesis h being true, given the data d, where
P(h|d) = P (d1| h) P (d2| h) .... P (dn| h) P(d)
P(d|h) = Likelihood. The probability of data d given that the hypothesis h was true.
P(h) = Class prior probability. The probability of hypothesis h being true (irrespective of the data)
P(d) = Predictor prior probability. Probability of the data (irrespective of the hypothesis)
L
This algorithm is called ‘naive’ because it assumes that all the variables are independent of each
other, which is a naive assumption to make in real-world examples. Using Naive Bayes to predict the
status of ‘play’ using the variable ‘weather’ is shown in Table 6.1.
Weather D Play
C
Sunny No
Overcast Yes
Rainy Yes
T
Sunny Yes
Sunny Yes
Overcast Yes
IM
Rainy No
Rainy No
Sunny Yes
Rainy Yes
Sunny No
Overcast Yes
Overcast Yes
Rainy No
We can download and use a package named ‘e1071’ in R studio to study Naive Bayes. For example,
we are going to use titanic dataset and find out which passengers survived using Naive Bayes. We
are going to use the columns – Age, Gender, Status given in the data set. Bayes’ theorem is based on
conditional probability and uses the formula:
P (A | B) = P(A) * P (B | A) / P(B)
We now know how this conditional probability comes from multiplication of events. So, if we use the
general multiplication rule, we get another variation of the theorem, i.e., using P(A) & P(B) = P(A)
* P (B | A), we can obtain the value for conditional probability: P(B | A) = P(A AND B) / P(A) which is
the variation of Bayes’ theorem. Self-Instructional
Material
153
DATA SCIENCE
NOTES Since P (A AND B) also equals P(B) * P (A | B), we can substitute it and get back the original formula:
P (B | A) = P(B) * P (A | B) / P(A)
Now, consider the following code snippet in R studio to implement Naive Bayes:
L
Titanic_dataset$Freq=NULL
#Fitting the Naïve Bayes model
Naïve_bayes_Model=naiveBayes(Survived~.,data=Titanic_dataset)
Naïve_Bayes_Model
#Prediction on the dataset
D
#What does the model say? Print the model summary
NB_Predictions=predict(Naïve_Bayes_Model,Titanic_dataset)
C
#Confusion matrix to check accuracy
table(NB_Predictions,Titanic_dataset$Survived)
Naïve Bayes Classifier for Discrete Predictors
Call:
T
naiveBayes.default(x=X,y=Y,laplace=laplace)
A-priori probabilities:
Y
No Yes
IM
0.676965 0.323035
Conditional probabilities:
Class
Y Ist 2nd 3rd Crew
No 0.08187919 0.11208054 0.35436242 0.45167785
Yes 0.28551336 0.16596343 0.25035162 0.29817159
Sex
Y Male Female
No 0.91543624 0.08456376
Yes 0.51617440 0.48382560
Age
Y Child Adult
No 0.03489933 0.96510067
Yes 0.08016878 0.91983122
NB_Predictions No Yes
No 1364 362
Yes 126 349
creates new cases based on the measures of distance. The value of k is user-specified. For example, NOTES
in pattern recognition, if we need to find a new case, i.e., the output, KNN finds out the nearest
instance based on the neighbor distances. Consider the following code snippet in R, in which we
will use KNN to predict stock market price, whether it will increase or not, based on the neighboring
instances of prices:
library(class)
library(dplyr)
library(lubridate)
library(readr)
stocks<-read.csv(“:/location/stocks.csv”, header =TRUE)
View(stocks)
set.seed(10)
stocks$Date<- ymd(stocks$Date)
stocksTrain<-year(stocks$Date)< 2014
predictors<-cbind(lag(stocks$Apple, default=210.73),
lag(stocks$Google, default=619.98), lag(stock$MSFT, de-
fault=30.48))
L
prediction<- knn(predictors[stocksTrain, ], predictors[!stock-
sTrain, ], stocks$Increase[stocksTrain], k=1)
table(prediction, stocks$Increase[!stocksTrain])
mean(prediction==stocks$Increase[!stocksTrain])
accuracy<-rep(0,10)
k<- 1:10
for(x in k) {
prediction<-knn(predictors[stocksTrain, ], predic-
D
C
tors[!stocksTrain, ], stocks$Increase[stocksTrain], k=x)
accuracy[x]<-mean(prediction==stocks$Increase[!stock-
sTrain])
}
T
plot(k,accuracy,type=’b’)
prediction FALSE TRUE
FALSE 29 32
TRUE 192 202
IM
>mean(prediction==stocks$Increase[!stocksTrain])
[1] 0.5076923
The graph on the basis of accuracy and k value is shown in Figure 6.7:
0.525
0.520
accuracy
0.515
0.510
0.505
2 4 6 8 10
k
Self-Instructional
FIGURE 6.7 Graph between Accuracy and K-value Material
155
DATA SCIENCE
NOTES N ote
You can download the dataset from the link below:
http://datascienceplus.com/wp-content/uploads/2015/10/stocks.csv
L
K-Means
K-means algorithms24 come into play for such purposes where we need cluster creation based
on data points which are having some sort of relevance in between them. K-means is an iterative
D
(repetitive) approach which does two things – cluster assignment step and moving cluster step. In
order to do so, we follow these steps:
1. We randomly find points which we mark as cluster centroids. If you want 2 clusters, then you
C
need 2 cluster centroids.
2. Now every data set is gone through and assigned a centroid based on the distance. For
instance, if we have 2 cluster centroids, then the data points will be analyzed and categorised
in those 2 cluster centroids. This is known as cluster assignment step.
T
3. The next step is the moving cluster step where the data points categorised for cluster centroid
1 and 2 are moved based on the averages of the entire cluster.
4. Now repeat steps 2 and 3, which mark the iterative process. Keep repeating steps until the
IM
points converge and you can distinctively identify the clusters; for this case, until you get 2
clusters.
Figure 6.8 shows steps of K-means algorithms:
Self-Instructional
Material FIGURE 6.8 Steps of K-Means Algorithms
156
Machine Learning
As an example, we are going to create clusters based on K-means using iris dataset available in R, NOTES
which is as follows:
library(datasets)
head(iris)
library(ggplot2)
ggplot(iris,aes(Petal.Length,Petal.Width,color=Species))+geom_
+point()
set.seed(20)
irisCluster<-kmeans(iris[,3:4],3,nstart=20)
irisCluster
table(irisCluster$cluster,iris$Species)
irisCluster$cluster<-as.factor(irisCluster$cluster)
ggplot(iris,aes(Petal.Length,Petal.
Width,color=irisCluster$cluster))+geom_point()
L
K-means clustering with 3 clusters of sizes 50,52,48
Cluster means:
Petal.Length Petal.Width
1 1.462000
2 4.269231
3 5.595833
0.246000
1.342308
2.037500
D
C
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
T
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[72] 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 32 3 3 3 3 3 3 2 3 3 3 3 3
IM
3 3 3 3 3 3 3 2 3 3 3
[143] 3 3 3 3 3 3 3 3
Within cluster sum of squares by cluster:
[1] 2.02200 13.05769 16.29167
(between_SS/total_SS=94.3%)
Available components:
[1] “cluster” “centres” “totss” “withinss” “tot.
withinss” “size” “iter” “ifault”
>table(irisCluster$cluster.iris$Species)
setosa versicolor virginica
1 50 0 0
2 0 48 4
3 0 2 46
Self-Instructional
Material
157
DATA SCIENCE
NOTES Figure 6.9 shows the graph between Petal.Length and Petal.Width:
2.5
2.0
1.5 Species
Petal. Width
setosa
versicolor
virnica
1.0
0.5
L
0.0
2 4 6
1.5 irisCluster$cluster
Petal. Width
1
2
3
1.0
0.5
0.0
2 4 6
Petal. Length
PCA
Dimensionality reduction problem can be solved by using Principal Component Analysis (PCA). Let’s
try and understand what PCA does: Say, for instance, we have a data set having 2 dimensions x
Self-Instructional and y and we need to reduce the dimensions from 2 to 1. The objective of PCA is to find a lower
Material dimension surface (direction) onto which the data is to be projected (minimize dimension) so that
158
Machine Learning
the projection error gets minimized. Finding the direction involves finding a vector where the data NOTES
should be projected, as shown in Figure 6.11:
Gene 2
Gene 1
In a general case, say if we have an n-dimensional data and we need to reduce it to k-dimensions, we
need to find k-vectors so as to minimize the data by minimizing the projection error.
L
Association Rule Mining
Association mining also known as Market Basket analysis is used to make product recommendations
D
based on the products which are bought together at a relatively higher frequency. Association
Mining is the basis of the modern recommender systems, giants like amazon, Netflix use association
rules and analysis to recommend user about their products depending on the products bought by
the user frequently together.
C
Association mining is done on the transaction level data which are taken from retail market say an
online e-commerce store, for this purpose we use Apriori algorithm to find patterns.
Association rules analysis can be interpreted as a technique to discover how components/items are
linked to each other. Association rules are used by programmers and analysts to build programs
which are capable of machine learning. These programs play a vital role in analysis of store layout,
catalogue design and customer behaviour prediction. The following are some common methods to
measure association:
1. Support: Support can be interpreted as an indicator for how popular (frequent) an item is in
the given dataset. It is calculated by the ratio of transactions in which a given item appears to
the total number of transactions.
2. Confidence: Confidence provides the probability that an item ‘B’ is also purchased wherever
an item ‘A’ is purchased, denoted as {A->B}. In general, it is an indicator that how often a
particular rule/relation is found to be true in a given dataset.
3. Lift: Lift can be defined as the ratio of support of union of two elements to the product of
their support calculated independently. Continuing with our example, lift can be interpreted
as the probability that item ‘B’ is purchased whenever item ‘A’ is purchased, while controlling
the popularity of item ‘A’.
Apriori Algorithm
Using apriori principles, we can reduce the number of itemsets needed to be examined. The aproiri
principle explains that if an itemset is not frequent, then all of its subsets also must not be frequent.
Self-Instructional
Putting it in on transaction set example case, if cheese spread was found to be infrequent, then we Material
159
DATA SCIENCE
NOTES can expect {bread, cheese spread} to be infrequent. This give us liberty and valid reasons to exclude
{bread, cheese spread} or any itemset containing cheese spread from the list of frequent itemsets,
hence popular items, from analysis. Using these principles, the dataset can be made consolidated
and relevant.
Let’s consider the rule A => B in order to compute the metrics, as shown in Figure 6.12:
L
Source: http://r-statistics.co/Association-Mining-With-R.html
In the given metrices, lift is the factor by which the co-occurence of A and B exceeds the expected
probability of A and B co-occuring as independently. It means higher the lift makes higher the chance
of A and B occurring together.
D
Let’s take Groceries dataset in R and learn the association between different products. The following
code snippet installs the arules package for the Groceries dataset:
C
library(arules)
class(Groceries)
inspect(head(Groceries, 3))
T
Output of the preceding code:
IM
The following code snippet shows how to see the most frequent items using the eclat() function,
eclat() function is used to implement eclat algorithm, which is used to mine frequent itemsets. The
algorithm makes use of simple intersection operations for clustering of equivalence classes along
with bottom-up lattice traversal.Here, it takes two arguments,supp and maxlen.
Self-Instructional
Material
160
Machine Learning
The following code snippet shows the data of the frequentItems variable: NOTES
inspect(frequentItems)
L
D
The following code snippet plots frequent items by using the itemFrequencyPlot() function:
C
itemFrequencyPlot(Groceries, topN=10, type=”absolute”, main=”Item
Frequency”)
T
Figure 6.13 shows the item frequency:
IM
The following code shows how to get the product recommendation rules:
The following code snippet shows how to adjust the maxlen, supp and conf arguments in the
apriori() function to control the number of rules generated:
L
D
C
The following code snippet shows how to remove redundant rules:
The following code snippet shows to find out what customers had purchased before buying ‘Whole
Milk’:
Self-Instructional
Material
162
Machine Learning
The following code snippet shows to find out the customers who bought ‘Whole Milk’ also bought: NOTES
rules <- apriori (data=Groceries, parameter=list (supp=0.001,conf
= 0.15,minlen=2), appearance = list(default=”rhs”,lhs=”whole
milk”), control = list (verbose=F)) # those who bought ‘milk’ and
also bought..
rules_conf <- sort (rules, by=”confidence”, decreasing=TRUE) #
‘high-confidence’ rules.
inspect(head(rules_conf))
L
As you can see in the above code we found the product recommendation rules, we also found
association in between products of our own choice.
D
Explore some more examples of different algorithms of machine learning.
C
APPLICATIONS OF MACHINE LEARNING IN BUSINESS
Today, most of the prominent business organizations depend heavily upon machine learning
T
algorithms for understanding their clients and opportunities to generate revenue. In this section,
we are going to discuss some common machine-learning applications in business:
1. Customer Experience Evaluation: The businesses require clients and customers who serve as
IM
NOTES 4. Fraud Detection: Using machine learning algorithms, we can identify insights related to
financial data which may be important and can help reduce frauds. Data mining techniques
can help us identify risk-prone grounds which may need immediate attention so as to limit
risky outcomes and overcome the sort of fraud chances in advance.
5. Logistics: Machine learning can help us identify the latest trends and patterns which are being
used in the logistics discipline. Routes which are more efficient and can reduce time can be
easily found using data mining techniques, also potential issues or risks can be identified in
advance with respect to supply chain management. This can help increase profitability with
relevance to a logistics or transportation business.
6. Software: User experience related to a piece of software can be dealt with machine learning.
The system can learn user customizations, settings and improve user experience based on
prior learning. The system can recommend the user with the best-suited option based on the
learning. This will ultimately increase productivity and let software vendors understand user
demands and needs which when fulfilled will increase profitability.
7. Spam Detection: Emails are the most used communication channel. Due to ever-increasing
adverts and phishers, emails pose a threat on user security if not handled properly. Spams
L
are attractive advertisements which can send important user data to the unauthorised user
who can misuse such critical information. In order to avoid such issues, users need to be
extra-cautious when reading emails and clicking unknowingly. This problem has been solved
D
by machine learning. Several email vendors and tech giants like Google have implemented an
algorithm which automatically classifies and marks emails which are spam and may pose a
threat, making it easy for users and limiting the security threats to a great deal.
8. Voice Recognition: Voice recognition is another technology which bridges the gap between
C
a user and his machine. The reason why voice recognition is evolving is because it is making
machines learn user demands, patterns and act accordingly, making it easier for a user. By
using just voice, a user can communicate with either a smartphone or a computer system
and make the machine take care of the rest. Due to this technology, the productivity has
T
increased a thousand folds and it brought the user and machine closer. One such example of
voice recognition is Apple’s Siri, which can communicate with the user and perform tasks on
the go. Users don’t even need to type or look for anything else.
IM
9. Online Trading: There are various online platforms where the users can trade or buy stocks.
Buying or trading online requires a great deal of analysis and proper decision-making. One
mistake, and the user can risk/lose his hard-earned money. Machine Learning can help a user
to do a comparative study of different stocks, trades and identify the best one to invest based
on the thorough analysis and effective decision-making. The algorithms can take the previous
stock data and market fluctuations into consideration and predict a wise result to work upon.
The user can then visualize those results and take decisions accordingly. One example is of
Bitcoin trading, where predictive algorithms can help you identify on which crypto currency
one should invest based on past and possible future outcomes.
10. Healthcare Services: Health is the most vital subject for us. Without proper health, we cannot
accomplish our missions and progress seamlessly. Machine learning algorithms are being
used and are helping several doctors and patients to understand the diseases better along
with a way to neutralize them. According to the latest research, Google AI can predict the
death of a cancer patient. Just think about all the contingencies a doctor can adapt to so as
to help his patients based on such researches. With the help of self-learning systems, one can
get the results for a test or examination without actually using the drugs on humans in case
it is harmful. Once the results delivered are positive and harmless, humans can adopt those
ways for their well-being.
Self-Instructional
Material
164
Machine Learning
E xhibit -3 NOTES
Machine Learning used in Online Cab Booking
When you book a cab, your app might estimate the price of the ride. How such price estimate
is calculated? The answer is machine learning. Jeff Schneider, the engineering lead at Uber ATC,
reveals in a an interview that they use machine learning to define price surge hours by predicting
the rider demand. In the entire cycle of the services, machine learning plays a major role.
Summary
This chapter discussed about the fundamentals of machine learning. Further, it discussed about the
L
different types of machine learning. Next, it has explained about the different types of supervised
learning algorithms. This chapter has also discussed about the unsupervised learning algorithms.
Towards the end, it has elucidated about the applications of machine learning in business.
Exercise
D
C
Multiple-Choice Questions
Q1. Which of the following is characteristic of the best machine learning method?
a. Fast b. Accuracy
T
c. Scalable d. All of these
Q2. Machine learning is
a. the autonomous acquisition of knowledge through the use of computer programs
IM
Self-Instructional
Material
165
DATA SCIENCE
NOTES Q6. In which of the following learning does the teacher return reward and punishment to the
learner?
a. Active learning b. Reinforcement learning
c. Supervised learning d. Unsupervised learning
Q7. Decision trees are appropriate for the problems where
a. attributes are both numeric and nominal
b. target function takes on a discrete number of values.
c. data may have errors
d. All of these
Q8. Following is also called exploratory learning:
a. Supervised learning b. Active learning
c. Unsupervised learning d. Reinforcement learning
Assignment
L
Q1. List some applications of machine learning in business.
Q2. Explain different types of supervised learning algorithms.
D
Q3. Describe unsupervised learning algorithms in machine learning.
Q4. Explain the difference between linear regression and logistic regression.
Q5. Discuss the meaning of machine learning.
C
References
Mitchell, T. M. The Discipline of Machine Learning, Machine Learning Department technical
report CMU-ML-06-108. Pittsburgh, PA: Carnegie Mellon University, July 2006.
T
Mitchell, T. M. Machine Learning. New York: McGraw-Hill, 1997.
Hastie T., Tibshirani R., Friedman J., The Elements of Statistical Learning: Data Mining, Inference
and, Prediction, Springer.
IM
Aleksandrov A.D., Kolmogorov A.N., Laureate’s M.A., Mathematics: Its Contents, Methods, and
Meaning, Courier Corporation.
Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson.
Louppe G., Wehenkel L., Sutera A., and Geurts P., Understanding variable importance in forests
of randomized trees, NIPS Proceedings 2013.
Self-Instructional
Material
166
C A S E S T U D Y
FRAUD ANALYTICS SOLUTION HELPED IN SAVING NOTES
THE WEALTH OF COMPANIES
This Case Study discusses how IBM’s fraud analytics helped organisations in detecting the frauds and
saving them from financial losses.
In the year 2011, industries in the US were suffering from huge financial loss of approximately $80
billion annually. Alone, issuers of credit and debit cards in the US have suffered a whopping loss of
$2.4 billion. Besides industries, financial frauds had also taken place with individuals which would
have taken years to resolve.
Existing fraud detection systems were not effective as they function on a predefined set of rules,
which include flagging on withdrawals from ATM up to a certain amount or purchasing using the
credit cards outside the credit card holder’s country. These traditional methods helped in reducing
the number of fraudulent cases but not all. The research team at IBM decided to take the fraud
detection system to the next level, so that a large number of fraudulent financial transactions can
be detected and prevented. At IBM, the team has created a virtual data detective solution by using
L
machine learning and stream computing to prevent fraudulent transactions to save industries or
individuals from financial losses.
In addition to signalling about the particular type of transaction, the solution also analyses about
D
transactional data to create a model for detecting fraudulent patterns. This model is further utilised
for processing and analysing a large amount of financial transactions as they occur in real time which
is termed as ‘steam-computing’.
C
Each transaction is allocated a fraud score, which specifies the likelihood of a transaction being
fraudulent. This model is further customised according to the data of the client and then upgraded
after a certain period of time for covering new fraud patterns. Fundamental analytics depend
on statistical analysis and machine-learning methods, which allow determining of strange fraud
T
patterns that could be skipped by human experts.
Consider an example of a large U.S.-based bank that used the IBM machine-learning technologies
for analysing transactions of the issued credit cards got the result as shown in the following image:
IM
Consider another case of an online clothing retailer. If most transactions made at the retailer were
fraudulent, then there is a high probability that future transactions related to purchases would also
be fraudulent. The system is capable of gathering these historical points of data and further analyse
it, to detect the possibility of future fraudulent attempts. In addition to prevent fraudulent attempts,
the system has also cut down on false alarms after analysing the relation between the suspected
fraudulent transactions and actual fraud.
“The triple combination of prevention, catching more incidents of actual fraud, and reducing the
number of false positives results in maximum savings with minimal hassle. In essence, we are able to
apply complicated logic that is outside the realm of human analysis to huge quantities of streaming
data,” notes Yaara Goldschmidt, manager, Machine Learning Technologies group.
These machine-learning technologies are presently used in detecting and preventing fraud in Self-Instructional
financial transactions, which includes transactions related to credit cards, ATMs and e-payments. Material
167
C A S E S T U D Y
NOTES The system is embedded with client’s infrastructure and a machine-learning model is developed
using its existing set of data to combat with fraudulent transactions before they take place.
“By identifying legal transactions that have a high probability of being followed by a fraudulent
transaction, a bank can take pro-active measures—warn a card owner or require extra measures for
approving a purchase,” explains Dan Gutfreund, project technical lead.
Machine learning and stream-computing technologies are not capable of predicting the future, yet
they enable financial institutions to take effective decisions and work towards preventing frauds
before they occur.
Questions
1. What is the need for fraud analytics in the organisations?
(Hint: To prevent fraudulent transactions.)
2. What are the benefits of machine-learning and steam-computing for organisations?
(Hint: To identify the pattern of fraudulent transactions, raise alarms, etc.)
L
D
C
T
IM
Self-Instructional
Material
168
CHAPTER
7
Text Mining and Analytics
L
Topics Discussed
Introduction
D
Differences between Text Mining and Text Analytics
NOTES
C
Text Mining Techniques
Sentiment Analysis
Topic Modeling
T
Term Frequency
Named Entity Recognition
Event Extraction
IM
INTRODUCTION
Text mining and analytics has gained very much importance in the last few years. Data is available
in textual format since the existence of scripting in the world. Today, when you type a word in any
search engine on Internet, probably you get hundreds or thousands of articles related to that word.
L
The data available on the Internet is in large volumes, but only relevant and meaningful information
is extracted from that data and is displayed by the search engine. Extracting meaningful information
out of the available text (also called corpus) in concise manner is referred to as text mining and
D
analytics. We use different techniques and technologies to gain knowledge (in to summarised form)
from all the available text, without reading or examining them manually.
Many industries are now using text analytics to analyse the inputs or comments given by the customers
C
they have served to gain valuable information out of their comments. This valuable information can
be further used for better customer experience, luring more customers and increasing profits. For
example, people often share their travelling experience on social networking sites like Twitter or
Facebook. A hotel or airline industry can analyse Twitter or Facebook data to know the feedback of
T
customers about them and how well they are serving the customers.
This chapter first describes the concept of text mining and text analytics. Further, it explains the
differences between text mining and text analytics. Next, it discusses about the different techniques
IM
of text mining. This chapter also discusses about the different types of text mining technologies.
Towards the end, it discusses about various applications of text analytics.
Text Analytics
Text Mining25 is the first step before analysing the text data. It involves cleaning the data so that the
same is made ready for text analytics.
Self-Instructional
Material
170
Text Mining and Analytics
The various steps involved in the text mining process is shown in Figure 7.1: NOTES
4. Verify words in data frame: It refers to making sure that all of our words are
appropriate variable names in the data frame.
On the other hand, text analytics use techniques to infer, prescribe or predict any information from
the mined data.
L
K nowledge C heck 7.1
Enlist the differences between text mining and text analytics.
Sentiment Analysis
One of the most significant and popular techniques to describe and infer the textual data is sentiment
analysis26. It is used to derive the emotions from the text, tweets, Facebook posts, or YouTube
comments. Sentiments such as good, bad, anger, neutral, anxiety, etc. are inferred from the given
text. For example, how the people opine about a movie, topic or decision by the government, etc.
can be analysed using sentiment analysis tool.
Topic Modeling
Suppose you are given a text of multiple pages. This text is not having any title. Then what is it all
about? The technique used to find the same is Topic Modeling. Consider the following output about
a six page text data:
topic
GST will states system India
34 27 17 16 15
Self-Instructional
The preceding output infers that the article is about GST in India. But, this is a very raw method
of identifying. Topic Modeling27 is a statistical approach for discovering topic(s) from a collection Material
171
DATA SCIENCE
NOTES of text documents based on statistics of each word. Latent Dirichlet Allocation (LDA) is one of
the most common algorithms for topic modeling. The LDA Algorithm classifies the Corpus into
Topics automatically by self-learning to assign probabilities to all terms in the corpus. Consider the
following code snippet in R:
>lda.terms<-terms(article.lda )
>lda.terms
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
“please” “aaa” “air” “board” “will”
The preceding output is from Topic Modeling of the tweets corpora of 4 airlines.
Term Frequency
In a given corpus, how important a word is to document? The Term Frequency tells about the
importance of the word with respect to total number of terms in the document. The ‘Term Frequency
(TF)’ is usually measured along with ‘Inverse Document Frequency (IDF)’ as ‘TF-IDF’. ‘TF-IDF’ is
L
abbreviation for ‘Term Frequency-Inverse Document Frequency’. It is a statistic measure which tells
how a word is important in the given document. Consider the following formula in which you can
see that the importance of the term increases proportionally with frequency of the term in the
document but is counter-balanced frequency of the term in corpus.
The entire 1.35 billion citizens of India are a witness to this historical event.
The preceding text represents quantity, place and time as shown below:
The entire 1.35[quantity] billion citizens of India [place] are a witness to this historical[time] event.
Event Extraction
Suppose we want information of an event happened. Online news has published this information
in large text. Deriving detailed and structured information about the event from this text is called
Self-Instructional event extraction. By event extraction, we identify Ws, i.e., Who, When, Where, to Whom, Why and
Material hoW. In other words, event extraction identifies the relationship between entities. Suppose you are
172
Text Mining and Analytics
analysing the information on joint venture. Then we will be extracting partners, products, place, NOTES
capital and profits of the said joint venture. Another example would be identifying the events from
the corpus of SMS received on mobile. This could be passed to our mobile calendar to manage our
upcoming events.
Now let us describe the various techniques involved in the text mining process.
L
as follows:
Information Retrieval Categorization
Clustering
>page[[2]] [59:69]
(1) https://www.transparency.org/news/pressreleases/year/2010
(2) “https://www.transparency.org/news/pressreleases/year/2010/P10”
(3) https://www.transparency.org/news/pressreleases/year/2010/P20”
(4) “https://www.transparency.org/news/pressreleases/year/2010/P30”
(5) “https://www.transparency.org/news/pressreleases/year/2010/P40”
(6) “https://www.transparency.org/news/pressreleases/year/2010/P50”
(7) “https://www.transparency.org/news/pressreleases/year/2010/P60” Self-Instructional
(8) https://www.transparency.org/news/pressreleases/year/2010/P70” Material
173
DATA SCIENCE
Information Extraction
Extraction of structured information from unstructured and/or semi-structured documents is known
as information extraction. In most of the cases, this activity concerns processing of human language
texts by means of Natural Language Processing (NLP). Information Extraction is the activity
by which the document is processed with automatic annotation and extraction of content from
images, audio, video. Internet Movie Database (IMDb) is an online databse about the information
of world films, TV programs, home videos and video games. Another simple example: How many
times phrase ‘x y z’ has appeared in the given document? Figure 7.2 shows the sentiment analysis of
the comments on GST:
L
120
100
80
60
40
D
C
20
0
Negative Positive
Clustering
IM
When you search for something on a web search engine, you get huge number of documents in
response to search phrase you entered. It becomes difficult for you to browse or to identify the
relevant information. Here, clustering helps to group the retrieved documents into meaningful
categories. This grouping is done based on the descriptor (sets of word) in the document. It is
an unsupervised knowledge discovery technique. One of the common example of clustering is
hierarchical clustering. In Hierarchical Clustering, each data point forms one cluster and then pairs
with the most adjacent cluster. Here, one can stop at any point depending on the number of clusters
required. Hierarchical Clustering takes time when applied to a large dataset. Figure 7.3 shows
hierarchy of files:
1 4
A B 2 3
E D E F
Material
174
Text Mining and Analytics
Categorization NOTES
‘Categorization’ refers to assigning the given document to a specific category. A common example
is segregating the application forms on the basis of age, discipline, class, etc. Documents may be
textual, image, music files, etc. The categorization can be done on the basis of topics or its attributes,
such as type of document, author, year of printing, subject, etc.
Categorization is also called ‘classification’ when you want to assign instances of the appropriate
class of your known types. If you are using Gmail for handling emails, you find folders with names
Primary, Promotion, Social, Updates and Forum. Your emails are being categorized into the above-
mentioned categories. Similarly, your text messages in your mobile are categorized into Notifications
and others. In categorization, the data set which is analyzed has been given some inputs and the
result is discrete/ categorical variable. Figure 7.4 shows the classification done for the raw SMS data
set:
L
1200
1000
800
600
400
200 D
C
0
Harm Spam
N ote
K-means already explained in Chapter 6–Machine Learning.
IM
Summarization
You might have seen Blinkist or Mentorbox which provides extract of the books in short text. These
apps are the classic example of Text Summarization. Summarization is shorter form of text derived
from one or more texts which gives important knowledge from the original document. The automatic
text summarization aims to condense the original text into a petite version with semantic. The most
important advantage of using a summary is that it reduces the reading time. Text Summarization
methods can be classified into the following types:
Extractive summarization Indicative summarization
The intelligent technologies attempt to understand the language more deeply. It gives better
analysis by extracting the useful information and knowledge in the data. Artificial intelligence is an
example of intelligent technology.
On the other hand, text analytics is based on retrieval according to user requirement. For information
L
retrieval, the following methods are being used in text analytics:
Term-based method: Term-Based Method analyzes the document based on semantic terms
(words). This method has computational advantage and is efficient with the theories like
weighing.
D
Phrase-based method: As compared to words, phrases are more meaningful and less
ambiguous. Therefore, phrase-based method is more useful in many cases. This method may
lag in performance because phrases have inferior statistical properties as compared to terms,
C
have low frequency of occurrence and large numbers of redundant and noisy phrases are
present among them.
Concept-based method: Concept-based method is more effective as it analyzes at sentence and
T
document level. It can distinguish between the meaningful and non-meaningful terms which
describes the sentence. The text analytics techniques are mainly based on statistical analysis
of terms or phrases. The statistical analysis of the term frequency calculates the importance of
the words without taking document into account. For example, two terms are there with same
IM
frequency in the given text but the meaning of only one term may be contributing more correctly
than the meaning of the other term. Hence, the terms that capture the meaning of the text or
the semantic of the given text must be given more weightage. Therefore, concept-based mining
is useful. Concept-based method initially analyzes the semantic structure of sentences in the text.
Then to describe the semantic structures of the sentences, a Conceptual Ontological Graph (COG)
is constructed. Then by using standard vector space model, feature vector is built extracting top
concepts. Natural language-processing technique is used in concept-based method.
Pattern taxonomy method: Patterns in the given text are discovered by data mining techniques,
viz., association rule mining, frequent item set mining, sequential mining and closed pattern
mining. Then these patterns are classified by using relations between them. Then patterns are
deployed and developed to new refined patterns in the text document. Pattern-based method is
better than the term-based and concept-based method. Pattern-based method is more difficult
and sometimes ineffective as again some useful low-frequency long patterns may be ignored
and non-useful frequent short patterns are weighted more, leading to misinterpretation and
decreasing the frequency.
Content Analysis
As a marketing head, you are promoting your product through advertisement. The purpose of an
Self-Instructional advertisement is to promote the use of the product by increasing awareness (advertisement) and
Material by increasing volumes (sales). Here you have two documents. One is advertisement content (cause)
176
Text Mining and Analytics
and another is viewer/audience report (effect). The content analysis uncovers causes and audience NOTES
research uncovers effects.
Content analysis is a method for summarizing any form of content by counting various aspects of
the content. It is better to do analysis based on content than evaluating the impressions of the
listener. For example, review of a book by its reader is evaluation and not content analysis. Rather
than evaluating the impressions of a listener, it is fair to do evaluation based on content. Content
analysis also uses the quantitative method, though it analyzes terms and the results are in the form
of numbers and percentages. With content analysis, one might say 15 percent of TV programs are on
knowledge as compared to 10 percent last year. The counting has two objectives:
Subjectivity from summaries is removed
The content analysis has six main stages, which are as follows:
L
3. Preparing content for coding 6. Drawing conclusions
For print media, you may select a page or a column as sample for content analysis. For Radio or TV,
you select as per program the time and duration of the program. Your corpus needs to be divided
D
into number of units roughly similar in size. Before content analysis can begin, we have to get into
text format which can be analyzed and divide the content into group based on similarity and assign
codes (numbers/ topics/etc.). Draw conclusions after counting and weighing and prepare content
analysis report beginning with a clear explanation of the focus – what was included, what was
C
excluded, and why.
Language Processing (NLP). Now-a-days, machine-learning algorithms, such as decision trees or the
probabilistic decisions are used for NLP of large corpora.
The NLP process is broken down into three parts. The first task of NLP is to understand the natural
language received by the computer. A built-in statistical mode is being used by the machines to
perform speech recognition and convert the natural language into a programming language. It
is done by breaking down the data into tiny units and then compares these units to the previous
units from a previous speech. The words and sentences that were most likely said are statistically
determined to produce the output in the text format.
The next task is called the part-of-speech (POS) tagging or word-category disambiguation. This
process identifies words in their grammatical forms as nouns, verbs, adjectives, past tense, etc.
using a set of lexicon rules coded into the computer. With the completion these two steps, the
machine now understands the meaning of the speech that is made.
The third step taken by an NLP is text-to-speech conversion. At this stage, the computer programming
language is converted into an audible or textual format for the user.
Syntax: It refers to organizing the words to form a sentence and determining the structural role
of the words in sentence and phrase.
Semantics: It is concerned with the meanings of the words and how to arrange words into
meaningful phrases and sentences.
Pragmatics: It is understanding and using sentences in different situations and how the
interpretation of the sentences is affected.
Discourse: It understands how the preceding sentence affects the interpretation of the next
sentence.
World knowledge: It is the general knowledge about the world.
L
done by running one or more algorithms on the data set where prediction is going to be carried out.
Different techniques are being used while building a model. Predictive modeling is part of Predictive
Analytics. The different steps involved in predictive modeling are shown in Figure 7.5:
D Data Mining
C
Understanding
Monitor and
the Data
improve
T
Preprocessing
Deploy the
the Data
IM
Model
Evaluate
Model & Select Model Data
Best Fit Model
Understanding the Data: The data is then understood to prepare the model.
Preprocessing the Data: The data is preprocessed to prepare the data model.
Evaluate model and select the best-fit model: The model created is then evaluated and the
best-fit model is selected for deployment.
Deploy the model: The best-fit model is then deployed in business.
Self-Instructional
Material Monitor and improve: The deployed model is monitored and improved on timely basis.
178
Text Mining and Analytics
E xhibit -1
Language Processing Algorithms for Disease Detection
According to a study published in JMIR Medical Informatics, researchers have designed and
developed a language-processing algorithm that is capable of detecting disease symptoms
with accuracy as high as 97.6%. They developed a natural language-processing algorithm, CHESS
(Clinical History Extractor for Syndromic Surveillance), which takes input data from clinical
records containing valuable information regarding many disease symptoms and identifies the
illness clusters that would not otherwise be suspected. Due to involvement of digital services in
healthcare industry, such as smart watches, electronic health records, huge amount of data is
being created each day. But advent of technologies like NLP makes it extremely convenient to
gather and analyze data.
L
CHESS extracts 48 signs and symptoms suggesting various infections and diseases using the
keywords found in a patients’ electronic medical records. According to training and validation
evaluations, CHESS was able to reach 96.7% precision and 97.6% recall on training data set and 96%
D
precision and 93.1% recall on the validation data set. The overall accuracy of tool was recorded
as 97.6% and it successfully identified symptom duration in 81.2% of the cases. In addition to the
presence of symptoms, the algorithm can also accurately distinguish affirmed, negated, and
C
suspected assertion statuses and extract symptom durations.
Scholarly Communication
Sentiment Analysis
Sentiments from the given text corpus can be extracted. These sentiments can be positive, negative,
happiness, fear, trust, anger, etc. Also the sentiments can be expressed in terms of range of score.
Sentiment Analysis is widely used by organisations to understand what the customers actually
feel about the product or the services provided by them. For example, marketing department can
analyze the response to their advertisement/product through sentiment analysis. Bag of Words
(positive or negative) are used to classify the terms and analyze the text document. Based on the
type of sentiment, three lexicons (dictionaries) are available in Tidytext package in R. There are
three types of lexicons available in Tidytext package:
NOTES Again the same text corpus used in the above section of predictive analysis is further analyzed for
understanding. You can use any text data set for the analysis. Terms in the example data set were
identified and sparse terms were removed for better analysis. Further, these terms as analyzed to
get the sentiments. The code snippet in R is as follows:
L
word_freq2 <- as.data.frame(word_freq2)
library(plyr)
library(dplyr)
# Sentiments - Categories
D
sentiments <- tidytext::get_sentiments(“nrc”)
plyr1 <- join(word_freq2, sentiments, by = “word”)
plyr1 %>% filter(sentiment != “NA”)
C
barplot(table(plyr1$sentiment),
col = c(“Red”, “Blue”, “Orange”, “Yellow”, “Red”,
“Green”, “Cyan”, “Purple”, “Green”),
main = “Sentiment Analysis”)
T
The output of the preceding code snippet is shown in Figure 7.6:
IM
Sentiment Analysis
25
20
15
10
0
anticipation
anger
disgust
fear
joy
negative
positive
sadness
surprise
trust
Consider the following R code that shows analysis of negative and positive thoughts of people: NOTES
# Sentiments - Binary
sentiments <- tidytext::get_sentiments(“bing”)
plyr1 <- join(word_freq2, sentiments, by = “word”)
plyr1 %>% filter(sentiment != “NA”)
barplot(table(plyr1$sentiment),
col = c(“Red”, “Blue”, “Orange”, “Yellow”, “Red”,
“Green”, “Cyan”, “Purple”, “Green”), main = “Sentiment Analy-
sis”)
Sentiment Analysis
15
L
10
5
D
0
negative positive
C
FIGURE 7.7 Displaying Analysis of Negative and Positive Thoughts
barplot(table(plyr1$score),
col = c(“Red”, “Blue”, “Orange”, “Yellow”, “Red”,
“Green”,
“Cyan”, “Purple”, “Green”),
main = “Sentiment Score “)
Sentiment Score
8
6
4
2
0
–4 –3 –2 1 2 3 4
Self-Instructional
FIGURE 7.8 Graph showing Sentiment Score Material
181
DATA SCIENCE
L
D
C
FIGURE 7.9 Picture of Swami Vivekanandji
T
Consider the following code snippet in R:
>Emotion_face
IM
The preceding shows the different emotions and their related score:
Emotion Score
1 Neutral 0.9841457
2 Sad 0.015763
3 Fear 4.44e-05
4 Happy 3.24e-05
5 Disgust 1.06e-05
6 Angry 2.9e-06
7 Surprise 1.1e-06
Scholarly Communication
We learn, read through books, articles and web. But who has written this information at these
different media of communication. Scholars, researchers and academicians share and publish their
findings so that the same is available to all. This is called scholarly communication. Remember
during our school days, we used to learn only through teachers and books. Now the technology
Self-Instructional
Material
182
Text Mining and Analytics
has grown so high and the world has come to the desk through computers. Evolution in scholarly NOTES
communication is shown in Figure 7.10:
L
D
FIGURE 7.10 Evolution in Scholarly Communication
Source: https://101innovations.wordpress.com
C
Health
For any hospital, safety of its patients and their well-being is top priority for them. Besides taking
T
so many precautions, mistakes and unfavorable events might occur frequently. The traditional
way of detecting unfavorable conditions is not very successful. The manual reviewing of records
of patient’s conditions for assessing injuries and identifying cause behind their injuries is very time-
IM
consuming and engages more staff than required. The main cause of injuries or the inappropriate
treatment given to patients might go undetected. This also led to loss of opportunity of saving the
lives by improving the treatment procedure. Here, text analytics is very helpful as it saves time in
detecting diseases or helps in fast diagnosis. Text analytics software is used in healthcare industry
due to its following benefits:
It automates reviewing manual records review, which helps in shortening the timespan in
diagnosis.
It enables thorough evaluation of records of patients for better healthcare.
Visualization
By now, we have understood that text mining and text analytics are used by companies to
discover and generate new information through deep analysis and examination of large amounts
of text using various methods and technologies. The combination of text mining and visualization
Self-Instructional
tools can make this process even more powerful. Text mining and analytics essentially transform
unstructured information into structured data and further explore and analyzed to get valuable Material
183
DATA SCIENCE
NOTES information which is used to draw conclusions or to make decisions. Reading through a long list
of elements or browsing a large amount of documents requires a long time to get knowledge
contained within. Hence, insightful and interactive data visualisation will allow users to understand
the analysis and get deeper into the area of interest. Thus, combining text analytics and visualization
tools represents information with even greater clarity. Text mining and visualization tools convert
documents, spreadsheets, reports, etc. into clear charts or graphs helping analysts easily explore
and work with data and content. Visualization tools help the users to:
Get the sense of the data
L
E xhibit -2
Text Analytics: Increasing the return on existing investments
D
In the last several years, many organizations have invested a huge sum just for the development
of optimized document management system. This is a vital requirement as it saves a lot of time in
finding the relevant document. Introduction of text analytics is definitely one of the highest returns
of investment for the organizations that want to cope with the challenge of information access.
C
Several crucial business decisions are taken on the basis of the data extracted from the data-based
enterprise applications. Use of text mining and analytics to analyze customers’ behavior can help
companies detect product and business problems and then solve them to boost sales. Using
resources like customer reviews and comments in product description web pages as a source for
T
text mining and analytics, companies now can also predict customer churn, enabling them get
ahead of potential rivals. Several other areas are getting benefit using text mining and analytics,
such as fraud detection, risk management, web content management and online advertising.
IM
In the areas like business intelligence, text mining and analytics make a great difference, enabling
the analyst to quickly jump at the answer even when analyzing huge amount of internal and open-
source data. Text mining applications such as Cogito Intelligence Platform are able to monitor
thousands of sources and analyze large volume of data to extract the relevant content.
Summary
In this chapter, we have first discussed about what is Text Mining and Analytics. We learnt the
different methods and techniques of Text Analytics. We discussed about the step-by-step approach
to analyze the text data. The chapter also discussed about the different applications of text mining.
Finally, we got hands on analyzing the different types of text data in user-friendly R.
Self-Instructional
Material
184
Text Mining and Analytics
NOTES
Exercise
Multiple-Choice Questions
Q1. Which of the following is/are the text mining techniques?
a. Sentiment analysis b. Topic modeling
c. Term frequency d. All of these
Q2. What is the full form of LDA?
a. Latent Dirichlet Allocation b. Latent Dirich Allocation
c. Latent Dirichlet Association d. Latency Dirich Allocation
Q3. ___________ is a tool used in text analytics which classifies the named entities in the given
corpus into the predefined classes.
a. Named Entities Recognition b. Name Entity Recognition
L
c. Named Entity Recognition d. None of these
Q4. In ______, each data point forms one cluster and then pairs with the most adjacent cluster.
a. K-means clustering b. Hierarchical clustering
c. Hierarchy collection
Q5. The methods used in text analytics are:
a. Term-based method D
d. None of these
b. Phrase-based method
C
c. Concept-based method d. All of these
Assignment
T
Q1. Describe text mining and analytics.
Q2. What is the difference between extractive summarization and abstractive summarization?
Q3. How many methods are there for clustering documents? How are they different from each
IM
other?
Q4. How would you explain the term ‘natural language processing’?
Q5. Discuss the application of analytics in healthcare.
References
Jan Wijffels, udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency
Parsing with the ‘UDPipe’ ‘NLP’ Toolkit https://cran.r-project.org/web/packages/udpipe/index.
html
Hadley Wickham, dplyr: A Grammar of Data Manipulation https://cran.r-project.org/web/
packages/dplyr/index.html
Hadley Wickham, ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics
https://cran.r-project.org/web/packages/ggplot2/index.html
Ingo Feinerer, Kurt Hornik, tm: Text Mining Package https://cran.r-project.org/web/packages/
tm/index.html
Self-Instructional
Material
185
DATA SCIENCE
NOTES Dr Martin Porter, SnowballC: Snowball stemmers based on the C libstemmer UTF-8 library
https://cran.r-project.org/web/packages/SnowballC/index.html
Terry Therneau [aut], Beth Atkinson [aut, cre], Brian Ripley , rpart: Recursive Partitioning and
Regression Trees https://cran.r-project.org/web/packages/rpart/index.html
James Athappilly, algorithmia: Allows you to Easily Interact with the Algorithmia Platform,
https://cran.r-project.org/web/packages/algorithmia/index.html
L
D
C
T
IM
Self-Instructional
Material
186
C A S E S T U D Y
AUTOMATION OF TEXTUAL DATA DISCOVERY AND ANALYSIS NOTES
This Case Study discusses how Elder Research helped a United States’ federal agency in automating the
textual data discovery process.
Elder Research is an established company in the field of advanced analytics. This company has
worked with various organizations belonging to different industries and has helped them solve
L
real-world problems using technologies for text mining, data visualization, software engineering,
technical teaching, projects and algorithms, advanced validation techniques and innovative model
combination methods (ensembles), etc. It has been proven that Elder Research can maximize
D
project success in addition to ensuring a continuous return on Analytics Investment.
Solution
C
Elder Research followed a planned approach under which it undertook the following activities:
Elder Research helped in improving analysts’ efficiencies in analyzing country-specific veterinary
capability studies by combining advanced data mining and machine learning tools.
T
The system implemented by Elder Research is shown in the following figure 7.11:
Doc
Scores
Predicted Actual
Links
Link Queue Crawler Parser Classifer Good Good
Analyst
<a> Content
Bad Bad
Internet
Elder Research’s solution employed a ‘learned rule’ natural language processing which did not
rely on dictionaries and heuristics.
Activities of clustering and document fingerprinting used robust rules learned from the
observed outcomes.
Elder Research evaluated all the available technologies and software tools to determine the
most cost-effective solution that could be employed for a real-time data streaming architecture.
Elder Research also deployed analytical models that were compatible with agency’s the then
existing systems and workflows. Self-Instructional
Material
187
C A S E S T U D Y
NOTES Analysts could use user-friendly and real-time web-based search interface to access the results
of the analysis.
Statistical model implemented by Elder Research for text analytics used the full text of
documents rather than just keywords in order to search for the desired topics.
Elder Research used a set of training documents provided by the analysts as an input to a web
crawler. This web crawler used data mining technology to locate similar relevant documents on
different websites and stored all those documents on the capabilities database.
Whenever any new documents that were of interest to the analysts were received, the systems
generated automated alerts in the form of electronic mail or text messages.
Different documents could be tagged and scores could be attached with the documents.
The system receives thumbs up or down judgment regarding a particular document, which
helped the system in gaining an understanding of what the analyst is looking for and what he/
she is not looking for.
A machine learning model builder analyzes the different documents that have different labels.
Based on the type of labels, the model builder can update a model.
L
Documents having the highest ratings are documents of interest.
Elder Research also deployed various methods that were used for gathering information:
D
External Search Engine Module: Under this module, there were three major search engines.
In addition, this module also ran a parallel country-specific keyword search engine against the
content that was already stored in the database. The analysts could use keyword search to
restrict the searches on the basis of document tags, ratings, or specific web addresses.
C
Search Monitor Module: This module was enabled to run keyword searches and save the
sets of keyword searches automatically. After this, the keywords were tagged and stored in
the database for future searching. The analysts could configure the modules to run searches
repeatedly at the defined time intervals.
T
Site Monitor Module: This module could monitor or ‘scrape’ a particular domain or website. It
could add the scraped content into the database for future retrieval.
Import Modules: This module was used to import documents from hard drives, emails or FTP
IM
directories.
Results
Elder Research helped the agency in tracking and evaluating the key events related to the infectious
animal diseases by designing and deploying advanced text mining tools. These text mining tools
could be used to search, index and automatically classify information pertaining to infectious
diseases in animals.
Questions
1. Why did the federal agency contact Elder Research?
(Hint: The federal agency required a solution which would help the analysts in conveniently
tracking and evaluating key events related to animal diseases for any country. For this, the
federal agency hired Elder Research.)
2. What was the role of web crawler deployed by Elder Research?
(Hint: Elder Research used a set of training documents provided by the analysts as an input
to a web crawler. This web crawler used data mining technology to locate similar relevant
Self-Instructional documents on different websites and stored all those documents on the capabilities
Material database.)
188
L A B E X E R C I S E
This lab exercise takes you through the step-by-step process of gathering, segregating, and analyzing NOTES
text data using the R tool.
LAB 1
Solution: In this lab, you need to load some library utilities into the current R environment and verify
the Twitter authentication information to work with the tweets.
L
Enter the following commands to load required packages to work with online tweets:
install.packages(“twitteR”)
install.packages(“bitops”)
install.packages(“digest”)
install.packages(“RCurl”)
# If there is any error while installing RCurl, follow the
D
C
below
command using terminal
#sudo apt-get install libcurl4-openssl-dev
T
install.packages(“ROAuth”)
install.packages(“tm”)
install.packages(“stringr”)
install.packages(“plyr”)
IM
library(twitteR)
library(ROAuth)
library(RCurl)
library(plyr)
library(stringr)
library(tm)
If you are working on Windows Operating System, you may face Secured Socket Layer (SSL)
certificate issues.
You can avoid that by providing certificate authentication information in the options() function
through the following command:
After loading the required R utilities and providing the SSL certificate authentication information,
load the Twitter authentication information. This information will be used to download tweets later.
Self-Instructional
Material
189
L A B E X E R C I S E
NOTES Enter the following commands to load Twitter authentication information using your own Twitter
credentials:
load(“/Datasets/twitter_cred.RData”)
registerTwitterOAuth(cred)
To analyze tweets, you first need to segregate and download them on the basis of some specific
keywords.
N ote
R provides automatic downloading of tweets by using the searchTwitter() function, which takes
as its arguments the language in which the tweets need to be searched, the keyword (term to be
searched on the Internet), and the number of tweets that need to be extracted containing the
keyword.
Now, enter the following command to download 1000 English language (specified as lang=“en”
argument to the searchTwitter() function) tweets, containing the word “nokia”:
L
input_tweets=searchTwitter(“nokia”, n=1000,lang=”en”)
command:
input_tweets[1:3] D
We can take a list of tweets at a time to analyze different opinions, as shown by the following
C
Some tweets, containing the search word, may be insignificant for our analysis. Therefore, we need
to extract tweets only with the relevant texts. Enter the following command to extract a specific set
of words as a text string:
T
tweet=sapply(input_tweets,function(x) x$getText())
input_tweets[1:4]
The next task is to segregate tweets on the nature of the feedback they provide. The feedback
would be positive, negative, or neutral. In our case, we are using only positive and negative words.
After writing the preceding function, the files containing positive and negative words are loaded to
run the sentiment function. Enter the following commands to load the data file containing positive
and negative words, respectively:
pos=readLines(“/Datasets/positive-words.txt”)
L
# find file positive_words.txt
neg=readLines(“/Datasets/negative-words.txt”)
# find file negative_words.txt
scores$very.pos = as.numeric(scores$score
D
Categorize each tweet as positive, negative, or neutral by using the following code:
Enter the following commands to find out the number of positive, negative, and neutral tweets:
T
# Number of positive, neutral, and negative tweets
numpos = sum(scores$very.pos)
numneg = sum(scores$very.neg)
IM
numneu = sum(scores$very.neu)
After the sentiments are categorized and the number of positive, negative, and neutral tweets is
found out, plot the results by using the following command:
Self-Instructional
Material
191
L A B E X E R C I S E
NOTES The Pie chart for the analyzed sentiment score is shown in Figure 7.12:
OPINION
L
D
C
T
IM
Self-Instructional
Material
192
CHAPTER
8
Data Representation
and Visualization
L
Topics Discussed
Introduction
Ways of Representing Visual Data D NOTES
C
Techniques Used for Visual Data Representation
Types of Data Visualization
Applications of Data Visualization
T
Visualizing Big Data
Deriving Business Solutions
IM
Chapter Objectives
After completing Chapter 8, you should be able to:
NOTES INTRODUCTION
Data is everywhere, but to represent the data in front of users in such a way that it communicates
all the necessary information effectively is important. Data visualization can be understood as a
technique which can be used to communicate data or information by transforming it into pictorial
or graphical format. It represents the data as visual objects, with the help of visual aids such as
graphs, bar, histograms, tables, pie charts, mind maps32, etc. The main purpose of data visualization
is to make users understand the information clearly and efficiently. It is one of the important steps
in data analysis or data science.
Depending upon the complexity of data and the aspects from which it is analysed, visuals can vary
in terms of their dimensions (one-/two-/multi-dimensional) or types, such as temporal, hierarchical,
network, etc. All these visuals are used for presenting different types of datasets. Different types of
tools are available in the market for visualising data. But what is the use of data visualisation in Big
Data? Is it necessary to use it? Let’s first track down the real meaning of visualisation in the context
of Big Data analytics.
This chapter familiarises you with the concept of data visualisation and the need to visualise data in
L
Big Data analytics. You also learn about different types of data visualisations. Next, you learn about
various types of tools using which data or information can be presented in a visual format.
194
Data Representation and Visualization
Data visualisation30 is a different approach from the Infographics. It is the study of representing NOTES
data or information in a visual form. With the advancement of digital technologies, the scope
of multimedia has increased manifold. Visuals in the form of graphs, images, diagrams, or
animations have completely proliferated the media industry and the Internet. It is an established
fact that the human mind can comprehend information more easily if it is presented in the form
of visuals. Instructional designers focus on abstract and model-based scientific visualisations to
make the learning content more interesting and easy to understand. Nowadays, scientific data
is also presented through digitally constructed images. These images are generally created with
the help of computer software.
Visualisation is an excellent medium to analyse, comprehend, and share information. Let’s see
why:
zz Visual images help to transmit a huge amount of information to the human brain at a glance.
zz Visual images help in establishing relationships and distinction between different patterns
or processes easily.
zz Visual interpretations help in exploring data from different angles, which help gain insights.
zz Visualisation helps in identifying problems and understanding trends and outliers.
L
zz Visualisations point out key or interesting breakthroughs in a large dataset.
Data can be classified on the basis of the following three criteria irrespective of whether it is
presented as data visualisation or infographics:
representation.
D
Method of creation: It refers to the type of content used while creating any graphical
Quantity of data displayed: It refers to the amount of data which is represented. For example,
geographical map, companies financial data, etc.
C
Degree of creativity applied: It refers to the extent to which the data is created graphically or
designed in a colorful way or it is just showing some important data in black and white diagrams.
On the basis of above evaluation, we can understand which is the correct form of representation for
T
a given data type. Let’s discuss the various content types:
Graph: A representation in which X and Y axes are used to depict the meaning of the information
Diagram: A two-dimensional representation of information to show how something works
IM
3598
4812
1799
6405 2345
200
40 0
30
50 0
00
60
00
0
70
0
0
00
60
50 00 5920
HANA
00
6231 3578
6083 2998
4780
5439 3549
4013 1791
4050
1934
2940
3180
2469 2200
2003
1622
L
Figure 8.2
FIGURE 8.2 Isolines
space.
D
that are bounded in a volume of space by a constant value, that is, in a domain that covers 3D
–0.2
IM
–0.4
1
0.5 –1
0 –0.5
0
–0.5 0.5
x –1 1 y
Direct Volume Rendering (DVR): It is a method used for obtaining a 2D projection for a 3D
dataset. A 3D record is projected in a 2D form through DVR for a clearer and more transparent
visualisation.
Self-Instructional
Material FIGURE 8.4 2D Image DVR
196
Data Representation and Visualization
Streamline: It is a field line that results from the velocity vector field description of the data NOTES
flow.
Figure 8.5 shows a set of streamlines:
L
Map: It is a visual representation of locations within a specific area. It is depicted on a planar
surface.
D
Parallel Coordinate Plot: It is a visualisation technique of representing multidimensional data.
C
10
Z
T
5 y
1
0
-1
0
IM
-2
-2 -1
0
1 2 X
Venn Diagram: It is used to represent logical relations between finite collections of sets. Figure
8.7 shows a Venn diagram for a set of relations:
A B A B
A B A B
A B A-B
A B A B
Self-Instructional
FIGURE 8.7 Venn Diagrams Material
197
DATA SCIENCE
Figure 8.8 shows an example of a timeline for some critical event sets:
L
Euler Diagram: It is a representation of the relationships between sets.
D Ireland
British Isles
British Islands
United Kingdom
Great Britain
C
Republic Northern England
Ireland Scotland
of Ireland Wales
Islands ndencies
T
Crown Depe
Archipelagos Channel Island
Isle of
Sovereign nations Guernsey
Man
Jersey
Other political entities
IM
Hyperbolic Trees: They represent graphs that are drawn using the hyperbolic geometry. Figure
8.10 shows a hyperbolic tree:
Self-Instructional
Material FIGURE 8.10 Hyperbolic Tree
198
Data Representation and Visualization
Cluster Diagram: It represents a cluster, such as a cluster of astronomic entities. Figure 8.11 NOTES
shows a cluster diagram:
63 62 61
64 55 29 28
27 53 31
67
60 38 51
68 66 47
2 40
5 6
54 35
30 36 3
58 26 11
65 4 46 49 1 12
23 44
13
16 56
34 22
14
59 19 45
43
17
41
42
71 37
70 8 18
33 52 15 20
69
39
L
10 9
respons
10 10
8 8
6 6
T
4 4
Tol Tol
2 2
Opt
0 0
IM
2,82 2,83 2,84 2,85 2,86 2,7 2,8 2,9 3,0 3,1 3,2 3,3
gradient gradient
L
Network Matrix, node link diagram, hive plot, Pajek, Gephi, NodeXL, VOSviewer,
and tube map UCINET, GUESS, Network
Workbench/Sci2, sigma.js, d3/
As shown in Table 8.1, the simplest type of data visualisation is 1D representation and the most
complex data visualisation is the network representation.
C
K nowledge C heck 8.3
Which data visualization type you think would have least spatial complexity?
T
Data visualisation tools and techniques are used in various applications. Some of the areas in which
we apply data visualisation are as follows:
Education: Visualisation is applied to teach a topic that requires simulation or modeling of any
object or process. Have you ever wondered how difficult it would be to explain any organ or
organ system without any visuals? Organ system or structure of an atom is best described with
the help of diagrams or animations.
Information: Visualisation is applied to transform abstract data into visual forms for easy
interpretation and further exploration.
Production: Various applications are used to create 3D models of products for better viewing
and manipulation. Real estate, communication, and automobile industry extensively use 3D
advertisements to provide a better look and feel to their products.
Science: Every field of science including fluid dynamics, astrophysics, and medicine use visual
representation of information. Isosurfaces and direct volume rendering are typically used to
explain scientific concepts.
Systems visualisation: Systems visualisation is a relatively new concept that integrates visual
techniques to better describe complex systems.
Visual communication: Multimedia and entertainment industry use visuals to communicate
Self-Instructional their ideas and information.
Material
200
Data Representation and Visualization
Visual analytics: It refers to the science of analytical reasoning supported by the interactive NOTES
visual interface. The data generated by social media interaction is interpreted using visual
analytics techniques.
E xhibit -1
“Data visualization app” by UNICEF
TThe UNICEF has released a ‘data visualization application’ which provides a user-friendly visual
representation of Indian education scenario’s complex analytics. The application uses government
database for schools, known as UDISE (Unified District Information System for Education) and
NAS (National Assessment survey). This can be used as a visual tool by government officials, policy
makers, research scholars to resolve issues and make the education system successful and future
ready.
L
2. What problems could be faced if we don’t use data visualization in fields like scientific
education and analytics?
Almost every organisation today is struggling to tackle the huge amount of data pouring in every
day. Data visualisation is a great way to reduce the turn-around time consumed in interpreting
Big Data. Traditional visualisation techniques are not efficient enough to capture or interpret the
T
information that Big Data possesses. For example, such techniques are not able to interpret videos,
audios, and complex sentences. Apart from the type of data, the volume and speed with which it is
generating pose a great challenge. Most of the traditional analytics techniques are unable to cater
to any of these problems.
IM
Big Data comprises both structured as well as unstructured forms of data collected from various
sources. Heterogeneity of data sources, data streaming, and real-time data are also difficult to
handle by using traditional tools. Traditional tools are developed by using relational models that
work best on static interaction. Big Data is highly dynamic in function and therefore, most traditional
tools are not able to generate quality results. The response time of traditional tools is quite high,
making it unfit for quality interaction.
Considering all these factors, IT companies are focusing more on research and development of
robust algorithms, software, and tools to analyse the data that is scattered in the Internet space.
Tools such as Hadoop are providing the state-of-the-art technology to store and process Big Data. Self-Instructional
Material
201
DATA SCIENCE
NOTES Analytical tools are now able to produce interpretations on smartphones and tablets. It is possible
because of the advanced visual analytics that is enabling business owners and researchers to explore
data for finding out trends and patterns.
Visual data mining also works on the same principle as simple data mining; however, it involves
the integration of information visualisation and human–computer interaction. Visualisation of data
produces cluttered images that are filtered with the help of clutter-reduction techniques. Uniform
sampling and dimension reduction are two commonly used clutter-reduction techniques.
Visual data reduction process involves automated data analysis to measure density, outliers, and
their differences. These measures are then used as quality metrics to evaluate data-reduction
L
activity. Visual quality metrics can be categorised as:
Size metrics (e.g. number of data points)
Visual effectiveness metrics (e.g. data density, collisions)
D
Feature preservation metrics (e.g. discovering and preserving data density differences)
1. What are the challenges faced while analysing big data using traditional analytics
techniques? What are the techniques available to overcome these challenges?
2. List some characteristic properties of a visual analytic tool.
Pics: This tool is used to track the activity of images on the website.
Arc: It is used to display the topics and stories in a spherical form. Here, a sphere is used to
Self-Instructional display stories and topic, and bunches of stories are aligned at the outer circumference of
Material
202
Data Representation and Visualization
sphere. Larger stories have more diggs. The arc becomes thicker with the number of times NOTES
users dig the story.
Google Charts API: This tool allows a user to create dynamic charts to be embedded in a Web
page. A chart obtained from the data and formatting parameters supplied in a HyperText
Transfer Protocol (HTTP) request is converted into a Portable Network Graphics (PNG) image
by Google to simplify the embedding process.
TwittEarth: TwittEarth is tool which is capable of mapping location of tweets from all over
the globe on a 3d representation of globe and show it. It is an effort to improve social media
visualisation and provide a global image mapping in tweets.
Tag Galaxy: Tag Galaxy provides a stunning way of finding a collection of Flickr images. It is an
unusual site which provides search tool which makes the online combing process a memorable
visual experience. If you want to search a picture, you have to enter a tag of your choice and it
will find the picture. The central (core) star contains all the images directly relating to the initial
tag and the revolving planets consist of similar or corresponding tags. Click on a planet and
additional sub-categories will appear. Click on the central star and Flickr images gather and land
on a gigantic 3D sphere.
L
D3: With D3, you get the ability to attach DOM (Document Object Model) with random data
and then apply transformations which are data-driven, on the document. For example, you
can utilise D3 for creating an HTML table from a sequence of numbers. Also, you can use the
interactions.
D
same data to design and develop an interactive SVG have features like smooth transition and
Rootzmap Mapping the Internet: It is a tool to generate a series of maps on the basis of the
datasets provided by the National Aeronautics and Space Administration (NASA).
C
OPEN-SOURCE DATA VISUALISATION TOOLS
We already know that Big Data analytics requires the implementation of advanced tools and
T
technologies. Due to economic and infrastructural limitations, every organisation cannot purchase
all the applications required for analysing data. Therefore, to fulfill their requirement of advanced
tools and technologies, organisations often turn to open-source libraries. These libraries can be
IM
defined as pools of freely available applications and analytical tools. Some examples of open-source
tools available for data visualisation are VTK, Cave5D, ELKI, Tulip, Gephi, IBM OpenDX, Tableau
Public, and Vis5D.
Open-source tools are easy to use, consistent, and reusable. They deliver high-quality performance
and are compliant with the Web as well as mobile Web security. In addition, they provide multichannel
analytics for modeling as well as customised business solutions that can be altered with changing
business demands.
TABLEAU PRODUCTS
In these days, Tableau31 is one of the popular evolving business intelligence and data visualization
tool. It is used to create and share interactive dashboards, which can depict the variation and density
of data on various visual forms like charts and graphs. Tableau can acquire data from various sources
like files, big data, relational data and then process it. Currently, it is positioned as industry leader in
business intelligence and analytics. Self-Instructional
Material
203
DATA SCIENCE
NOTES Tableau offers five main products to fit into diverse data visualization requirements for professionals
and organizations. They are:
1. Tableau desktop: It is self-service business analytics and data visualization that can be used
by anybody. It translates data pictures into optimizes queries. You can directly connect to
your data from your data warehouse using Tableau desktop for up to date live data analysis.
Queries can be performed in tableau without writing any code. Data from multiple sources
can be imported into tableau’s data engine and then integrated to create an interactive
dashboard.
2. Tableau server: It is enterprise-level tableau software. You can publish dashboards with
Tableau desktop and then share them with others inside the organization with the help of a
web-based tableau server. Following are the components of tableau server:
a. Application server: It handles permission and browsing for web and mobile interface s of
tableau server.
b. Vizsql server: After receiving a request from clients, Vizsql server sends the query to data
source and returns the result. The result set ids then rendered as image and finally is pre-
sented to user.
L
c. Data server: manages and stores data sources and metadata.
3. Tableau online: It has the same functionality as the tableau server but it is hosted by tableau
in their cloud.
D
4. Tableau reader: It is a free application that can be downloaded from tableau website. It can
be used to analyze the workbooks that have been saved with .tbwx extension. The only thing
worth keeping in mind is that anyone who receives the workbook can use the tableau reader
C
to open it. So, essentially, there is no security.
5. Tableau public: It is free download software from tableau’s website. Being “public” means
anyone can post their visualization and any one can view other’s workbooks. There is no
feature which allows you to save your workbook; however, it allows you to block others from
T
downloading it. Tableau public can be used for sharing any resource and its development.
N ote
IM
Data visualization can be used as a method to convey data or information by transforming it visually,
so that it is more accessible by the people receiving it. Best visual representations give the reader
a clearer idea and let them draw conclusion based on data that they might otherwise have missed.
Also, presenting data in too much detail may confuse the consumers and they may lose interest. So,
a strike of balance is needed while modelling data visualization.
Self-Instructional
Material
204
Data Representation and Visualization
Organizations can also assess their direction of business using data visualization. For example you NOTES
may find yourself looking for answers that how you can boost sales in a particular demography or
hoe you can cut down the operational costs. You can authenticate your future business moves by
backing it up with the data you have.
In 2013, a survey was conducted by Aberdeen group in which it was found that the rate of finding
relevant information by managers in organization increased by more than 28% using data visualization
tools as compared to their peers who relied simply on dashboards and managed reporting.
Summary
In this chapter, you have learned data representation and visualisation. First, you have learned
different ways of representing visual data, techniques used for visual data representation. Next, you
have learned types of data visualisation, and applications of data visualisation. You have also learned
the importance of Big Data visualisation. Then, you have learned tools used in data visualisation. At
the end, you have learned tableau products and data visualization for managers.
L
Exercise
Multiple-Choice Questions
D
Q1. Which of the following visual aids are used for representing data?
C
a. Graphs b. Bar
c. Histograms d. All of these
Q2. Which of the following diagrams refers to a representation of instructions that shows how
T
something works or a step-by-step procedure to perform a task?
a. Template b. Checklist
IM
Self-Instructional
Material
205
DATA SCIENCE
NOTES
Assignment
Q1. What do you understand by data visualisation? List the different ways of data visualisation.
Q2. Describe the different techniques used for visual data representation.
Q3. Discuss the types and applications of data visualisation.
Q4. Describe the importance of Big Data visualisation.
Q5. Elucidate the transformation process of data into information.
Q6. Enlist and explain the tools used in data visualisation.
References
Data visualization. (2017, April 26). Retrieved May 02, 2017, from https://en.wikipedia.org/wiki/
Data_visualization
Suda, B., & Hampton-Smith, S. (2017, February 07). The 38 best tools for data visualization.
L
Retrieved May 02, 2017, from http://www.creativebloq.com/design-tools/data-visualization-
712402
50 Great Examples of Data Visualization. (2009, June 01). Retrieved May 02, 2017, from https://
D
www.webdesignerdepot.com/2009/06/50-great-examples-of-data-visualization/
Self-Instructional
Material
206
C A S E S T U D Y
DUNDAS BI SOLUTION HELPED MEDIDATA AND ITS NOTES
CLIENTS IN GETTING BETTER DATA VISUALISATION
This Case Study discusses how a custom data visualisation solutions provider is helping its clients in
getting better visualisation of the data stored in their database.
Medidata started receiving demands from its clients to provide software that can help them in
analysing and interacting with the data generated from the ERP software. Medidata felt necessary
to include a Business Intelligence (BI) and analytics solution in its collection of software solutions.
L
The BI and analytics solution will have the following advantages for Medidata’s clients:
It helped them in taking better and more informed decisions
D
It was scalable which means it can increase or decrease resources as and when required
C
Also, to detect and resolve its own issues related to quality of data, Medidata also used BI and
analytics along with fulfilling the needs of clients. Medidata decided to migrate on the Dundas
BI solution for data visualisation. The decision was obvious because Dundas has been working as
a partner since 2009 when it was involved in developing business intelligence components. The
T
satisfaction and belief in using Dundas legacy products helped Medidata in migrating to Dundas BI
solution for visualising the data. Dundas has created and provided customized software for data
visualization for start-up companies as well as fortune 500 companies.
IM
Before formalising the partnership, several meetings were held between Medidata and Dundas
to discuss how Medidata would encourage the clients to use the Dundas BI solution for data
visualisation. After understanding the strategies of Medidata of selling and marketing the Dundas BI
solution, Dundas decided to provide the customised BI solution with full support as per Medidata’s
needs to use and test it for a certain period of time. Dundas also helped Medidata in learning the
use of BI solution by providing multimedia training content and webinars. It helped in the adoption
of BI solution rapidly across Medidata. The interface of BI solution for data visualisation is shown in
the following Figure 8.13:
Self-Instructional
Material
FIGURE 8.13 BI solution for Data Visualisation
207
C A S E S T U D Y
NOTES Some important features of the BI solution are:
Superb interactivity: A highly interactive environment of Dundas BI visualisation enabled clients
of Medidata in engaging and understanding their data in a better way.
Data-driven alerts: Utilising alert notifications, built-in annotations and scheduled reports in
Dundas BI, their clients can collaborate with the user using these tools.
Smart design tools: Dundas BI provides smart and in-built design tools, which provide drag-and-
drop functionality for quickly designing reports and dashboards.
Extensibility: Dundas BI provides connectivity with earlier unsupported data sources.
Performance tuning: The BI solution provides an ability to store the output of the data cubes
within Dundas BI’s data warehouse for better performance.
Due to the presence of preceding features, some key benefits of the BI solution for Medidata are
as follows:
Medidata can now validate those database attributes, which were incorrect in some situations
but not in others.
L
Medidata can now determine inconsistencies in their database.
Medidata also became able in resolving various issues related to data integrity.
The BI solution has resolved 60% of the validity concerns faced by Medidata. Not only BI solution for
D
data visualisation benefitted the Medidata, it has also proved useful for Medidata’s clients:
It helped clients by increasing their ability to take data-driven actions.
It helped clients in identifying and understanding their Key Performance Indicators (KPIs).
C
It has provided dashboards to clients which include KPIs, such as workflow performance, in
addition to the ratio of workflow outstanding tasks, grouped by department.
It helped clients by making information available quickly, the decision-makers of clients became
T
capable to regulate resources in real-time, task execution time in different scenarios, and finally
improve the ratio of overdue tasks.
“While using Dundas BI, I found I was able to accelerate the time-to-market of my BI projects. The
IM
usability, self-contained management and the easy way that, in a blink, I could see and analyse data
from various sources were a great and awesome surprise!” – Luis Silva, Senior BI Consultant, Medidata.
Questions
1. Why data visualisation is important for companies?
(Hint: Timely action, resource allocation, etc.)
2. What should be the features of a good data visualisation solution?
(Hint: Highly interactive, data-driven alerts, etc.)
Self-Instructional
Material
208
L A B E X E R C I S E
In this lab, you will learn to use Tableau tool for data visualization. NOTES
LAB 1
1. Load the data in Tableau by selecting the file format. Open the Tableau and click the file
formats such as xls, csv, txt, etc. Figure 8.14 shows the dataset in Tableau:
L
D
C
T
IM
2. Analyze the data that you have loaded in Tableau. For analyzing the data, you need to go to
the worksheet of the Tableau. Figure 8.15 shows the worksheet of Tableau:
L
Figure 8.17 shows the total population of the persons and the population of males between 0 and
6 years:
D
C
T
IM
FIGURE 8.17 Total Population of the Persons and the Population of Males between 0 and 6 Years
Solution (b): To plot a graph to show the data in histogram, perform the following steps:
Self-Instructional
Material FIGURE 8.18 Crime Data of San Diego
210
L A B E X E R C I S E
In Figure 8.19, the horizontal axis shows the cities and the vertical axis shows the count of the NOTES
crime.
3. Now, click to the mark plane and click on “show marks”.
4. Analyze the data on horizontal bars and labels in the vertical axis. This analysis gives you an
idea that which city is represented by which horizontal bar. Figure 8.19 shows the labels with
the horizontal bars:
L
D
FIGURE 8.19 Labels and Associated Horizontal Bars of Histogram
C
Solution (c): Treemap is the graphical representation of a dataset in rectangular boxes. These
boxes are somehow associated with each other. Now, perform the following steps to show data in
Treemap:
T
1. Load the data in Tableau.
2. Select the measure that has to be analyzed. In this, we have selected the “sum of crimes
in night” in a city and community. Then, drag the measure “sum of crimes in night” in the
IM
column and “City measure” as well as the “Community measure” in the row bar.
3. Select the Treemap from the “Show Me” panel. Figure 8.20 shows the “Show Me”, indicating
the Treemap:
L
FIGURE 8.21 Treemap in Tableau
D
C
T
IM
Self-Instructional
Material
212
CHAPTER
9
Optimization
L
Topics Discussed
Introduction
D NOTES
C
Optimization Process
Linear Programming
Standard Form
T
Augmented Form
Integer Linear Programs
Multi-criteria Optimization
IM
Goal programming
Analytic Hierarchy Process (AHP)
Consistency in AHP Analysis
Sensitivity Analysis
Self-Instructional
Material
DATA SCIENCE
INTRODUCTION
Optimization33 refers to a technique used to make decisions in organizations and analyze physical
systems. In case of mathematics, an optimization problem is defined as the problem of determining
L
the best solution from the set of all possible solutions. Optimization also helps in finding the highest
or lowest value of a function within some constraints. In other words, optimization determines the
best possible value for a function from a given domain.
D
In this chapter, you first learn about optimization process. Next, you learn the linear programming.
Then, it explains about multi-criteria optimization. Towards the end this chapter discuss about goal
programming and AHP (Analytic Hierarchy Process).
C
OPTIMIZATION PROCESS
The optimization process is generally used in the field of physics and computers. In these fields,
T
it is commonly known as energy optimization. Consider a function linear objective function, h(x).
Suppose the function has a real number domain of set R. The maximum optimal solution, in this
case, appears where h(x0) ≥ h(x) over set R and the minimum optimal solution appears when h(x0) ≤
IM
h(x) over set R. The optimization of a function involves determining the function’s absolute extrema
by using either the first derivative test or second derivative test.
There are three steps involved in an optimization process. The first step is the creation of an
appropriate model. Modeling refers to the process that helps in determining and stating objectives,
variables and constraints of a problem in mathematical terms. An objective measures the
performance of the system that we want to increase or decrease. For example, in the manufacturing
sector, we want to increase the profit and decrease the production cost of goods. The variables are
the constituents of the system whose values are to be determined. In case of manufacturing, the
variables can be the time taken by each resource or the time spent in each process. The constraints
refer to the restrictions on the values of the variables. These restrictions determine the type of
the values that can be stored in a variable. Again consider the case of manufacturing in which the
consumption of resources cannot exceed the existing amount of resources.
The second step is to determine the category of the developed model in the optimization process.
It is important to find the category of the model because the algorithms used to solve problems
in optimization can then be customized accordingly. Different types of optimization problems are
Integer Linear Programming, Linear Programming (LP), Bound Constrained Optimization, Nonlinear
Programming, Nonlinear Equations, Nonlinear Least-Squares Problems, etc.
Self-Instructional The third and final step in optimization involves the selection of software suitable for solving the
Material optimization problem. The optimization software is of two types, one is solver software and other
214
Optimization
is modeling software. The solver34 software concentrates on determining a solution for a particular NOTES
instance of an optimization model which acts as an input for the software. On the basis of input,
the software returns the result. On the other hand, the modeling software is used for formulating
the optimization model and analyzing its output. The modeling software takes the optimization
problem in the symbolic form as input and displays the result in the similar manner. The conversion
in symbolic forms is handled by the modeling software internally. The symbolic form of the problem
or result is known as the modeling language, which may depend upon modeling software. The
modeling software varies from one another in terms of their support to import data, invoke solvers,
process results and association with large applications. The packages of the modeling and solvers
are integrated for marketing or operation purposes, and sometimes even the distinction between
the two gets blurred.
LINEAR PROGRAMMING
L
Linear programming35 is also known as linear optimization. It is a method used for achieving
preeminent results which includes maximum revenue and lowest investment in context of a
D
mathematical model. The mathematical model is used for describing a system using mathematical
language and concepts. The requirements of a mathematical model are specified by linear
relationships. Linear programming is treated as an exceptional case in mathematical optimization.
C
Linear programming optimizes linear objective function, depending upon linear equality and
constraints of linear inequality. A simple linear program involving two variables and six inequalities
can be represented using a two dimensional polytope. Figure 9.1 represents a convex polytope or
convex polyhedron:
T
IM
The objective function of a polytope is a real-valued linear function. A linear programming algorithm
determines a point on the polytope for which there exists a smallest or largest value. Please note
that this condition is valid only, if such type of point exists. The linear cost function is denoted by an
arrow and the straight line on the polyhedron. Linear programs can be referred as problems that can
be represented in the canonical form in the following manner:
minimize dTy
subject Py≤b
and y≥0
In the preceding canonical form, y denotes the vector of variables that needs to be determined, Self-Instructional
d and b are known as vectors, P is known as the matrix of coefficients and T denotes the transpose Material
215
DATA SCIENCE
NOTES of a matrix. The expression that needs to be maximized or minimized is known as the objective
function. The dTy is called the objective function in the given canonical form. The inequalities Py ≤
b and y≥0 are called constraints and specify a convex polytope on which the optimization of the
objective function is to be done.
You can apply linear programming in various fields such as mathematics, business, economics and
engineering. Linear programming models are used in various industries such as manufacturing,
energy, transportation, telecommunications, etc. The linear programming model has been proved
useful in diversified problems which exist in routing, planning, scheduling, design and assignment in
a company management.
Standard Form
The standard form of a linear programming problem comprises three parts, which are as follows:
L
2. Problem constraints
a11y1+a12y2≤ c1
a21y1+a22y2≤ c2
a31y1+a32y2≤ c3
3. Non-negative variables D
C
y1≥ 0
y2≥ 0
The preceding problem can be expressed in the form of a matrix, which can be represented as:
T
max{bTy|Py≤c∧ y≥0}
Consider an example of a gardener who has K km2 of garden area in which he wanted to plant either
rose or marigold or the combination of both flowers. The problem is that the gardener has fertilizer
IM
and pesticide in a limited amount. The amount of fertilizer and pesticide is A and B kilograms,
respectively. Every km2 of rose needed A1 kgs of fertilizer and B1 kgs of pesticide. On the other hand,
every km2 of marigold needs A2 kgs of fertilizer and B2 kgs of pesticide. Assume R as the selling price
of rose per km2, and M as the selling price of marigold. Suppose the area of the land planted with
rose and marigold is denoted by y1 and y2, respectively, then by picking optimal values for y1 and y2,
the profit can be maximized. This problem can be stated in the standard form as:
y
Maximize [R M] 1
y
2
Self-Instructional
Material
216
Optimization
1 1 K NOTES
A A y 1 ≤ A ; y 1 ≥ 0
Subject to 1 2 y y 0
B1 B2 2 B 2
Augmented Form
Linear programming problems can also be represented in an augmented form. In an augmented
form, the problem is represented in a block matrix form. This form has introduced non-negative
slack variables for replacing inequalities with equalities in the constraints. The block matrix form for
maximizing m is as follows:
m
1 −dT 0 0
y =
0 P I c
t
L
The gardener example discussed in the standard form can be represented in the augmented form
as:
Maximize: R y1 +M y2( denotes the objective function)
Subject to: y1+y2+y3=K (denotes augmented constraint)
A1.y1 + A2.y2 + y4= A (denotes augmented constraint)
B1.y1 + B2.y2 + y5= B (denotes augmented constraint)
D
C
y1, y2, y3, y4, y5 ≥ 0
y3, y4, y5 denote positive slack variables. The variable y3 represents the unused area, y4 represents
quantity of the unused fertilizer and y5 represents the quantity of the unused pesticide. In the
T
augmented form, the gardener problem can be represented as:
M
y y1
1 −R −M 0 0 0 M 1 0
yy yy12
IM
Generally, in case of integer linear programming problem, we used to minimize a linear cost function
based on n-dimensional vectors y. These vectors are subjected to a group of linear equality, inequality
and integrity constraints on all or some of the variables in y. Consider the following equations:
Self-Instructional
Material
217
DATA SCIENCE
L
K nowledge C heck 9.2
Discuss the augmented form of linear programming.
MULTI-CRITERIA OPTIMIZATION
D
In multi-criteria optimization36, the optimization problems involving more than one objective function
C
simultaneously are considered. Multi-criteria optimization is also known as the multi-objective
programming, vector optimization, multi-objective optimization, multi-attribute optimization
or Pareto optimization. Multi-criteria optimization finds its application in various fields such as
economics, engineering, finance, logistics, etc. In these fields, optimal decisions require to be taken
T
in case of two or more objectives that are conflicting in nature. For example, while developing a new
component, you may require to decrease its weight and increase the strength. Another example of
conflict objectives could be while buying a car. You consider maximum comfort in a minimum cost,
maximum performance while minimizing the consumption of fuel and smoke emission. There are
IM
only two objectives in the preceding scenario, but there could be scenarios where there can be more
than two conflict objectives.
Multi-criteria optimization problems are not easier to solve as the optimized objectives usually conflict
with each other. Therefore, it is very difficult to find a solution that can satisfy all the objectives from
the perspective of mathematics. In analytical or classical numerical methods, complex mathematical
calculation is required, but there are intelligent optimization algorithms which are capable enough
to determine the global optimal solution.
Optimization techniques always assume that defined constraints in the model are hard constraints,
which can't be violated. However, in some cases, there restrictions might become too restrictive.
Now, there may be some situations in which we set some goals for a given projects and all of them
might not be restrictive constraints. Such projects can be modeled more precisely around goals
instead of hard constraints. This methodology of analyzing and modelling a given problem to find its
solution is called goal programming, which will be discussed in the next section.
GOAL PROGRAMMING
The occurrence of a situation when we have to deal with more than one objective (goal), which
Self-Instructional conflict with each other but still we have to reach a decision where we need to take all of these
Material objectives in account refers to a scenario of multi-criteria decision making. In case of objectives, goal
218
Optimization
programming differs with linear programming as linear programming deals with optimization of a NOTES
single linear objective.
The methods of goal programming, however, don’t produce the complete answer as whenever
there are multiple goals competing with each other, it is complex to conduct a purely objective
analysis that would produce an optimal solution. There always exists a subjective component in
the analysis that reflects the preferences of the decision maker. Hence, it becomes crucial for the
decision maker to reach a compromise keeping all the objectives in context. The primary concept
behind goal programming is to specify numeric goals for each objective and to devise a solution
satisfying all the given constraints along with minimizing the sum of deviation from the specified
goal. Here, deviations can be understood as weighted metrics which define the relative significance
of each objective function. The advantage with goal programming is that it can handle relatively very
large numbers and yet remains simple and easy to use. This makes it applicable in the large domain
of goal programming applications, whether it contains problems regarding single linear program or
a series of connected linear programs (lexicographic variants).
In goal programming, the objective is to minimize W, which is weighted sum of deviations with
respect to predefined goals. Also, weights can be interpreted as penalty weights for missing the
L
goal. In mathematical programming, all the given individual objective functions and constraints of a
specific problem are linear which can be stated as:
n
Min∑ wi (Di+ –Di– )
i=l
If you want to take decision regarding priories of define criteria, one must check for AHP.
AHP contains a set of evaluation constraints along with a set of alternatives from which the decision Self-Instructional
is to be made. This doesn’t guarantee that the optimal option from every criterion will be chosen. Material
219
DATA SCIENCE
NOTES Rather, the option which strikes the balance of optimization among all criteria is decided to be most
suitable. A weight is assigned to each evaluation criteria, which is influenced from the decision
maker’s pairwise criteria comparison. This weight is directly proportional to the importance of the
corresponding criterion. In the next step, AHP assigns a score to each option based on the option’s
pairwise comparison done by the decision maker, based on that criterion. This process results in
the accumulation of criteria weight and corresponding option score which eventually determines
the global score for each option, followed by a ranking. The magnitude of score reflects the
performance of option for the given criteria and overall ranking. The global score for an option
is equal to the weighted sum of all the scores obtained by it in accordance with all criteria. The
fundamental judgement scale to assign individual scores is given in Table 9.1:
L
3 Moderate importance
4 Moderate plus
5 Strong importance
6
8
9
7
D Strong plus
Very strong
Very, very strong
Extremely important
C
Let’s take an example to understand how this hierarchy design works:
Suppose you want to buy a house. Now, this decision involves a lot of prioritization and future plans
T
regarding the property. Some people may prefer cost over location; others may prefer the resale
value the house will provide after 20 years from the date of acquisition of property. Let’s see how we
can apply AHP analysis in this decision making problem to find an optimal solution.
IM
The first step in the AHP analysis process is building the decision hierarchy. This process generates
an overall idea about the decision process, hence it’s also called decision modelling. Figure 9.2 shows
the hierarchy for our example:
Level 1: Goal
Level2: Criteria
Level3: Alternatives
Self-Instructional In our example, the first level depicts the goal that we want to achieve, the second level contains
the criteria or constraints that we want to apply for decision making, and the third level shows the
Material
220
Optimization
alternatives that are available with us with respect to our criteria and goal. This hierarchy design NOTES
presents the problem in a structured manner which helps the decision maker to think and analyze
clearly. This step also involves the participation of domain experts who observe and ensure that all
possible criteria (and subcriteria) and alternatives have been included.
Now that our objectives and alternatives are clear, we must be able to understand that not every
criterion can be given equal priority. Also, every individual will have different needs which will be
reflected in their decision making. In the second step, we need to assign weights to individual
criterion using pairwise comparisons and the fundamental judgement scale proposed by Saaty
(2012) for comparison. A performance matrix is designed to perform the cost comparison. How to
fill the performance matrix:
Suppose if your preference for cost is very strong as compared to location, then the cost cell column
will get value 7 and the location cell will get value 1/7. Similarly, if your preference for resale value
is moderate as compared to location, the resale value cell column will get a value of 3 and the
corresponding location cell gets a value of 1/3. It should be intuitive that the diagonal cells, e.g. cell
(location, location) will have a value of 1. The filled-up performance matrix will look as shown in
Table 9.2:
L
TABLE 9.2: Performance Matrix with Judgement Scores
Cost
Buying a home
Location
Cost
1
1/7 D
Location
7
1
Resale value
3
1/3
C
Resale value 1/3 3 1
Sum 1.476 11.00 4.33
Now, to calculate the overall weight, we can use either the exact method or the approximate method.
It must be kept in mind that the approximate method will provide accurate results only when the
calculated comparison matrix has low inconsistency. Now, from the obtained normalized matrix, we
can simply calculate the value of each row and obtain the final priorities, as shown in Table 9.4. For
example, the final priority index for resale value can be calculated as (0.226+0.273+0.231)/3=0.243.
Now, according to the results obtained in Table 9.4, the decision maker has given the highest priority
to cost (66.9%) in his/her purchasing decision. This criterion is then followed by resale value (24.3%) Self-Instructional
Material
221
DATA SCIENCE
NOTES which is finally followed by location (8.8%). It is imperative to observe that these pairwise priorities
enable the decision maker to prioritize one criterion over another, one at a time, so that the correct
overall prioritization can be provided for every criterion. These calculated values depicting priorities
are also mathematically valid, as they are derived from mathematical formulas for ratio calculation,
thus being intuitive in nature.
CR=CI/RI.
According to the standard definitions set by Saaty (2012), if the value of CR is greater than 0.10, then
L
the judgements made must be revised to find and resolve the issues that are causing inconsistency.
Sensitivity Analysis
D
The final priorities are driven by the weight given to a particular criterion. However, it is important to
know how the results would be affected if the given criterion had different priorities. The process of
finding answers to such “what-if” questions is called the sensitivity analysis. With sensitivity analysis
C
performed, we can test the robustness of our decision as well as identify the key factors that
influence the decision maker. Sensitivity analysis in an inevitable part of the whole decision-making
process and no final conclusion in the context of the decision should be made without performing
sensitivity analysis.
T
Readers may notice that in the example given for buying a house, the highest priority was given
to cost. So, finally the decision maker would go with the cheapest house available to him. But, we
must also consider the possibility of the decision maker changing his priority? How would it affect
IM
the decision or the property which he is going to buy? One way to perform sensitivity analysis is to
change the weight of the given criterion and then perform the decision modelling on a new set of
information. Some of the scenarios may be:
Summary
This chapter has first discussed linear programming. Next you learned about multi-criteria
optimization. Further, it has described the goal programming and Analytic Hierarchy Process.
Self-Instructional
Material
222
Optimization
NOTES
Exercise
Multiple-Choice Questions
Q1. MILP stands for
a. Mixed Integer Linear Programming
b. Mixed Integer Linear Program
c. Mixed Integrity Linear Programming
d. All of these
Q2. An example of no-preference method is
a. Lexicographic method
b. Utility function method
c. Goal programming method
d. Multiobjective proximal bundle method
L
Q3. MINLP stands for
a. Mixing Integer Nonlinear Programming Problem
b. Mixed Integer Nonlinear Programming Problem
c. Mixed Integral Nonlinear Programming Problem
d. All of these
D
C
Q4. Which of the following is a verbal equivalent for judgement score of 5?
a. Equally important b. Slightly important
c. Strongly Important d. Slightly important
T
Q5. Level 2 of decision modelling in decision hierarchy depicts:
a. Goal b. Criteria
c.
Alternatives d.
Sensitivity
IM
Assignment
Q1. What do you understand by convex polytope in linear programming?
Q2. Discuss about integer linear programs.
Q3. Enlist the steps involved in interactive methods.
Q4. Define consistency. What do you understand by consistency ratio?
Q5. Discuss the steps involved in analytic hierarchy process.
References
https://www.sciencedirect.com/science/article/pii/S0898122111010406
https://www.igi-global.com/dictionary/pareto-optimal-solution/21879
http://www.superdecisions.com
Project management is a field that deals with planning, scheduling, monitoring, and controlling
projects. In the planning phase, the goals and objectives of the project are defined clearly. In
the scheduling phase, time and sequence interdependencies between the project activities are
determined. Monitoring and control phase involves dealing with unexpected events which may
disturb the time and budget schedules. Usually, all the projects have their own specific objectives.
However, some of the most common objectives for all the projects include:
L
Completing the project on time
on linear programming model and there are various algorithms for the same.
In this case, Seror applied goal programming for allocating time and cost in a construction project.
For applying goal programming in this case, Paul Kizito Mubiru (Head of Section for Industrial
Engineering and Management at Kyambogo University) proposed a goal programming model for
allocating time and cost in a project. The goal programming problem was formulated as follows:
∑ ∑
3 3
=
Minimize Z =k 1=i 1
Pk (i )(Dk+ + Dk− )
∑ ∑
3 3
=j 1=i 1
X ij − D1+ + D1− =
Ti
∑ ∑
3 3
=j 1=i 1
C ij X ij − D 2+ + D 2− =
TC i
∑ ∑
3 3
=j 1=i 1
Aij X ij − D 3+ + D 3− =
LMC i
j=1, 2, 3 refers to three distinctive phases for each project to be successfully completed
L
Ti = Total time of completion
Project Project Action plan Monthly labor and Monthly total costs including
IM
Self-Instructional
Material
225
C A S E S T U D Y
NOTES Project team also estimated the total allocations and project duration of the three projects. They
are shown in Table 9.6:
For each project, there were different priorities related to time and cost. These priorities were
identified by the project team as follows:
Project 1 Priorities:
L
P2(1): Total project expenditure should not exceed 89893286.53 Algerian dinars
Project 2 Priorities:
D
P1(2): Complete the project in 8 months.
P2(2): Total project expenditure should not exceed 105351315.31 Algerian dinars.
C
Project 3 Priorities:
P2(3): Total project expenditure should not exceed 129971789.61 Algerian dinars.
T
The project team formulated the goal programming problem for each project in order to allocate
time to each phase of the project, i.e., planning, scheduling and monitoring, and control. This was
important to achieve the time and total expenditure goals.
IM
Project 1:
subject to:
X 11 + X 12 + X 13 − D1+ + D1− =
6
X 11 , X 12 , X 13 , D1+ , D1− , D 2+ , D 3− , D 3+ , D 3− ≥ 0
Project 2:
X 21 , X 22 , X 23 , D1+ , D1− , D 2+ , D 3− , D 3+ , D 3− ≥ 0
Project 3:
subject to:
L
X 31 + X 32 + X 33 − D1+ + D1− =
11
X 31 , X 32 , X 33 , D1+ , D1− , D 2+ , D 3− , D 3+ , D 3− ≥ 0
D
2016867.27 X 21 + 63867463.46 X 22 + 1344578.18 X 23 − D 3+ + D 3− =
75968667.06
C
The project team used the LINDO software for goal programming and the results were as follows:
Project 1:
T
X11 = Time allocated for planning = 0 months
P1(1) = Goal for completing project 1 on time is fully achieved because X11+X12+X13 = 0 + 1.09 + 4.91 = 6
months.
However, this solution is illogical because in the absence of a planning phase, the project fails.
P2(1) = Goal for keeping total project expenditure within budgeted amount is partially achieved
because:
Project 2:
Self-Instructional
Material
227
C A S E S T U D Y
NOTES X23 = Time allocated for scheduling = 6.96 months
P1(2) = Goal for completing project 2 on time is fully achieved because X21 + X22 + X23 = 0 + 1.04 + 6.96
= 8 months.
However, this solution is illogical because in the absence of a planning phase, the project fails.
P2(2) = Goal for keeping total project expenditure within budgeted amount is not fully achieved
because:
Project 3:
L
X33 = Time allocated for scheduling = 10.02 months
P1(3) = Goal for completing project 3 on time is fully achieved because X31+X32+X33 = 0 + 0.98 + 10.02
= 11 months.
D
However, this solution is illogical because in the absence of a planning phase, the project fails.
C
P2(3) = Goal for keeping total project expenditure within budgeted amount is partially achieved
because:
An analysis of the goal programming results suggests that the model provides satisfactory levels of
achievement for managing three projects with preemptive goals.
IM
Source: http://yujor.fon.bg.ac.rs/index.php/yujor/article/view/486/377
Questions
1. List some of the most common objectives for a project.
(Hint: Some of the common objectives for any given project can be meeting the profit goal,
meeting the manpower goal etc.)
2. Comment on the value of Xij for all the three projects during the planning phase.
(Hint: Xij represents the time allocated for project i during phase j. Therefore, X11, X21, and X31
represent the time allocated for projects 1, 2 and 3 during the planning phase, respectively. In
the given goal programming, solution value (0) of time allocated for planning is illogical and
impractical because without the planning phase, a project fails.)
Self-Instructional
Material
228
CHAPTER
10
Need for a System
Management – KPI
L
Topics Discussed
Introduction
Need for a System Management D NOTES
C
Data Quality
Business Metrics
Key Performance Indicators (KPIs)
T
Types of KPIs
KPI Software
IM
Chapter Objectives
After completing Chapter 10, you should be able to:
Self-Instructional
Material
DATA SCIENCE
NOTES INTRODUCTION
System management37 signifies the management of Information Technology (IT) assets of an
organization in a centralized manner. It not only manages the IT assets but also tackles and resolves
the problems related to IT assets. There are a number of system management solutions available
that help small or big organizations in addressing their requirements which include monitoring
of organizational network, management of servers, monitoring storage of data, and handling
organizational and client’s devices, such as printer, laptop, mobile phone, etc. System management
also includes sending or generating of notifications in case of failures, issues related to capacity
of data and other events taking place over a network. Effective system management also handles
compliance issues and is capable of enforcing company policies on employees regarding the usage
of IT assets of the organization. Like system management, Key Performance Indicators (KPIs) are
also used by the organizations for achieving their business goals.
KPIs38 are used by the organizations to measure the business goals in order to check the performance
and determine whether the organization is on success track or any improvement is required for its
success. The KPIs vary from organization to organization as some organizations focus on certain
L
aspects of business while others on some other aspects of business. Each department of an
organization might also have different KPIs as per their tracking of their specific goals. KPIs help an
organization not only in tracking their goals, but also in determining the health of their practices to
get the best results. The KPIs can also be used outside the company. For example, an organization
D
can also use particular KPIs with the customers while creating a contract with them. The decided KPIs
help both the organization and the customer in tracking the success of their contract at present or in
future. Sometimes, KPIs used by a department of an organization are useful for another department
of the same organization. With the help of KPI-tracking software, companies can view the results
C
after using KPIs on a single dashboard in real time.
This chapter first discusses about the need of system management and the data quality. Next, it
explains business metrics and their importance. Further, this chapter discusses about the KPI
T
solution and types of KPIs. Towards the end, it discusses about the benefits of the KPI software.
When the business of an organization grows, the IT requirements for the organization grow
consequently. It is very hard to find an organization that does not depend upon IT for its business.
Therefore, it becomes very important for an organization to effectively manage and provide
a safeguard to its assets. The system management is an umbrella term which includes a lot of
management solutions. For example, in order to keep the systems in the running condition, an
organization uses management solutions, such as service desk management, patch management,
etc. Sometimes, the organization also uses single sign on authentication as a management solution
to authenticate its employees while accessing the organizational resources to protect them from
any unauthorized access. These management solutions help an organization in enhancing the
productivity of IT assets and its employees. The management solutions help an organization in
developing the new software solutions efficiently and also upgrade the existing software. System
management solutions also help an organization in protecting it against the following:
Fallout from downtime or threats caused due to improper functioning of the systems
Sabotage in network
Power outages
If any of the previously stated events occurs in an organization, it may lead to any of the following:
Financial loss
Damage of brand
Legal liabilities
L
Automated Backup & Hardware Asset Inventory &
Restore Configuration
Security
System
Management
D Application Software
C
Software Asset Inventory
T
Source: smallbusinesscomputing.com
System management solutions also handle hardware inventory and their configuration. The
configuration of hardware devices includes firmware present in it and the type of software installed
in it. System management also focuses on the security of IT assets by installation of anti-virus
software in the devices and constantly updating the anti-virus software over a certain period of
time. Besides security of devices, system management also focuses on the security of data stored in
the devices. It emphasizes on the back-up and restoring of data in case of failure of a storage device.
The data gets restored from the central data repository or from the location where it is backed up.
Loss of good-quality data may lead to huge financial loss of an organization. The quality of data
is judged on several dimensions, which include completeness, consistency, conformity, accuracy,
integrity and timeliness. It is not wrong to conclude that system management helps in achieving the
organizational goals by managing the IT assets of an organization in every possible manner.
Number of devices
NOTES Consider an example of a small organization which has a small number of computers. System
management for this organization requires investment of more amount of money and time in
comparison to the investment needed in managing each system individually. However, due to the
lack of system management, the small and medium organizations are vulnerable to security risks.
But, small and medium organizations can follow the following tips to safeguard themselves to a
certain extent:
Proper assessment of bottlenecks, gaps and vulnerabilities residing in an organization’s IT
set-up.
Search for vendors and solutions that are helpful in addressing the immediate IT issue, but are
also capable of providing help that might be required by an organization after a certain period
of time.
Search for vendors that may provide customized solutions as per your budget in managing the
IT assets. Some popular vendors for small and medium organizations may include Dell KACE, HP
Insight Manager, IBM Service Manager for Smart Business, etc.
Evaluate the packages or offerings of the vendors that suit your organizational needs and level
of support the vendors can provide. For example, the best vendor can be the one which can
L
provide high level of IT expertise and 24/7 support at a lesser price as copared to other vendors.
D
Discuss the need of system management in an organization.
DATA QUALITY
C
The quality of data is extremely important for an organization. The bad-quality data provides
inaccurate results and makes the decision-making process slower in an organization. Therefore,
improvement in the data quality is of utmost importance for an organization in order to take the
T
accurate decisions. The clearing out of bad data from the available data helps in its improvement.
You can only be successful in your business when you make the right decisions. The right decisions
are taken only on the availability of the right information. The availability of the right information
also makes the decision-making process faster.
IM
The organizational data is mainly stored in a data warehouse. The business intelligence solutions are
used by the organizations to access the data from the data warehouse to gain better insight of their
business at any point of time. The data accessed must be of good quality to make faster decisions
by the executives of an organization. If the data available in the data warehouse is of bad quality,
then the same set of data might give inaccurate results when accessed at different intervals of time.
In this case, the executives have to hold the decision-making process which has to be performed on
the basis of the results and have to work in the direction of finding the cause of the different results
over the same set of data. The data quality issue might arise because of the following reasons:
Due to patchwork between enterprise applications and operational systems.
Due to inconsistent (or undefined) standards and formats in the stored data
Due to human error while entering, editing, maintaining, manipulating and reporting of the data
Self-Instructional
Material
232
Need for a System Management – KPI
In order to avoid inaccuarcy in maintaining the data, the business organizations have to implement NOTES
data quality strategy which includes techniques for maintaining the data quality during business
processes going on in the organization. You can conclude that the data quality is all about the
cleaning of bad data which might be incorrect or invalid in some way. If an organization wants to
make sure that the data available is trustworthy, it has to understand the key dimensions of the
quality of the data. The data quality dimensions are used by the organizations to measure the level
of accuracy of the data from time to time. Figure 10.2 shows the key dimensions of the data quality:
Completeness
Timeliness Consistency
Data
L
Quality
Integrity
Accuracy
D Conformity
C
FIGURE 10.2 Dimensions of the Data Quality
T
Source: smartbridge.com
Completeness: Completeness of the data refers to the meeting of expectations of the executives
of an organization. Data is considered complete even if the optional data is not present. For
example, if the client’s first name and last name are present in the data, but the middle name
is not present as it is optional to provide the same, even in that case the record of a client is
considered complete despite the fact that the middle name is not available in the company’s
database.
Consistency: Consistency of the data means that it must reflect the same type of information
across all the systems or units of an organization. For example, an office of an organization has
already been closed, but the sales figures are getting reflected for it in the database. Another
example of inconsistency is that an employee of an organization has left it many years ago, but
the salary status of that employee is still getting reflected in the organization’s database. If the
employee has left the organization, then his status across all the offices of the organization
must be consistent .
Conformity: Conformity39 of the data means that the same set of standards have been followed
for entering the data. The standard data definitions include data type, format and size of the
data. For example, there must be conformity while entering the date of birth of an employee
working in the organization. The date of birth of the employee must be entered in the ‘dd–mm–
yyyy’ format across all the offices of the organization.
Self-Instructional
Material
233
DATA SCIENCE
NOTES Accuracy: Accuracy of the data reflects the degree of correctness of the data entered in the
database. For example, the profit earned or the sales figures of an organization must be entered
correctly. These profit or sales figures are mathematical values and reflect the business growth.
Therefore, these must be entered with high level of accuracy.
Integrity: Integrity of data means that the data entered in the database must have relationships
or are connected appropriately. For example, the employees have several attributes and address
is one of them. Therefore, an address relationship must exist with the employee records. If
the addresses exist in the database without any employee record, then these are considered
orphaned records in the database. When the related records are not linked properly, then this
may lead to the duplication of the data in the database.
Timeliness: The timeliness of the data refers to the availability of the data when it is required
and expected. The timeliness majorly relies on the expectations of the users. The availability
of the data for an organization in a timely manner is considered very important because of the
following reasons:
zz Data must be available when an organization wants to provide its quarterly business re-
sults.
L
zz Data must be available when an organization wants to provide the correct information to
its clients.
zz Data must be available when an organization wants to check its financial activity at any
point of time
made. The business metrics also help an organization in judging its progress towards the set short-
term or long-term goals. The business metrics are very important for the key stakeholders of business.
The key stakeholders are those people whose input plays a significant role in an organization.
The business metrics are stated by an organization in its mission statements which are nothing but
organizational communication with its customers and general public. Sometimes, an organization
also includes business metrics in its workflows. Business metrics are important for the different
departments of an organization according to their interests. For example, the marketing department
measures the success of its conducted campaigns, whereas the sales department uses business
metrics to track the sales over a certain period of time or at any instant. The financial executives use
the metrics to view the overall performance of the organization in financial terms.
In other words, there must be some context attached to business metrics as they have no worth
without it. An organization generally considers metrics in terms of the existing benchmarks,
approaches and goals. People often confuse business metrics with key performance indicators
(KPIs), but there exists a fine line between them. The business metrics focus on all the business
areas to measure the performance of an organization, whereas the KPIs only focus on a particular
area of a business to quantify performance.
Self-Instructional
Material
234
Need for a System Management – KPI
Figure 10.3 shows some important business metrics which are as follows: NOTES
Sales
revenue
Customer
Inventory
loyalty and
size
retention
Variable Cost of
cost customer
Percentage acquisition
Important
business
metrics
L
Monthly Productivity
Profit/loss
Size of
gross
margin
D ratios
C
FIGURE 10.3 Displaying Important Business Metrics
Customer loyalty and retention: This business metric measures how an organization lures the
customers in order to increase its sales and retain them for long-term business profit. The long-
term relatioship with the customers helps in making long-term profits by an organization. An
organization conducts surveys in order to get feedback from thecustomers. Sometimes, an
organization also gets feedback from direct interaction with the customers or by performing
other kinds of analyses and uses the feedback to enhance more satisfaction while offering the
products or services to the customers. The feedback implementation helps an organization in
generating more loyalty among the customers and retaining strong customer base.
Cost of customer acquisition: This business metric helps in assessing the new customers
acquired by investing money in different processes implemented to acquire them. This metric
is calculated by dividing the total expense made by the organization in acquiring the customers
by total number of new customers over a certain interval of time.
Churn rate: This business metric is used to assess the cost of acquiring the lost customers of the
organization. This metric indicates an increase in the cost of acquiring customers and a decrease
in the customers’ values to an organization in the long-term.
Productivity ratios: This business metric is used to assess the productivity of the employees
working in an organization. This business metric is calculated by the total revenue generated
by the employees of a particular organization and is then compared with the productivity Self-Instructional
Material
235
DATA SCIENCE
NOTES of the employees of another organization to gain deeper insight of the effectiveness of the
employees. This metric finds its application in almost any area of business.
Size of gross margin: This metric is helpful in improving the efficiency of an organization, thus
finding opportunities in reducing costs by increasing the organization’s margin on the sales of
the products. This metric is calculated by subtracting the cost of the products sold from the
total sales revenue and divided by the total sales revenue. The size of the gross margin is then
converted into percentage. If the value obtained is higher, then the organization can spend
more money on other costs it has incurred and can generate more profit.
Monthly profit/loss: This metric helps in measuring the fixed and variable operational costs paid
on monthly basis. The costs might include office rent, insurance paid, payments made against
mortgage, taxes or salaries paid, etc.
Overhead costs: This business metric helps in assessing the fixed costs that do not rely on
the production levels of the products or services. The fixed cost includes salaries paid to the
employees and the rent paid against the usage of various services. The overhead costs do not
get affected by the earning and growth of the business. Therefore, their tracking must be done
separately.
L
Variable cost percentage: This business metric is used to assess the cost of goods. The cost of
goods is variable as it depends upon various factors, such as cost of raw materials, labor charges,
shipping cost, and other costs related to the production or delivery of goods. Therefore, the
factors.
D
cost of goods might increase from time to time with an increase in the charges of the different
Inventory size: This business metric is used to track the inventory that is ready to be sold at
C
any instant. This metric is also used to assess how much inventory will be ready for sale after a
certain period of time. An organization always keeps a close eye over the inventory as it is the
primary source of its income.
Business performance metrics play a significant role in conveying the information to the organizational
T
executives, investors, and clients to make them aware about the overall organizational performance.
The simplest, easiest and the mot effective method to assess your company’s performance is by
keeping the key business metrics on a dashboard. Various departments of an organization keep a
IM
close eye on different metrics. So, the dashboard varies from one department to another, and also
from one organization to another.
E xhibit -1
Business Metrics in Decision-Making
Business metrics share many similarities with KPI, but the property which distinguishes them from
each other is that business metrics focus on overall development of the business rather than KPI
which provides information and details about a particular domain for which it is focused. Business
metrics play a critical role in optimizing business and discovering loopholes. These metrics prove to
be a key element in diagnostics for a certain problem inside the organization. Taking the example
of a technology company which now also wants to start customer-centric division for itself, which
includes direct communication with customers along with providing them with various services,
such as insurance and loan financing, needs to get a thorough evaluation of its strategy and soft
points to avoid any future problems. Business metrics can be a great resource for diagnostic
Self-Instructional evaluation as they provide correct insight of the company’s past records in various domains.
Material
236
Need for a System Management – KPI
Assuming a case where the company wants to know if it has a breach-proof security system NOTES
installed to protect the customers’ data, metrics can come in handy to show how many times the
company has faced a security threat and which were the weak areas. Those areas can be easily
identified using metrics and diagnosis can be narrowed down, ultimately resulting in saving time
and efforts. Also, using metrics, business managers can react to dynamic customer behavior in a
flexible way.
L
Types of KPIs
As discussed earlier, different types of KPIs are used in the organizations to track business goals.
D
An organization uses KPIs as per its requirement and also those which suits best as per its business
strategy. Some indicators of KPIs are shown in Figure 10.4:
C
Quantitative
indicators
Financial Qualitative
indicators indicators
T
Actionable Leading
indicators indicators
IM
KPIs Indicators
Directional Lagging
indicators indicators
Practical Input
indicators indicators
Output Process
indicators indicators
NOTES Leading indicators: These can be used for predicting the result of a process.
Lagging indicators: These can be used for representing the success or failure after the execution
of the business processes.
Input indicators: These are used for measuring the quantity of resources used for generating
the desired result.
Process indicators: These represent the efficiency or the productivity of the business processes.
Output indicators: These denote the outcomes of the business process activities.
Financial indicators: These are used for the performance measurement of an organization.
Some types of KPIs across different business functions in an organization are shown in Figure 10.5:
L
IT Operations and Project execution KPIs
D
Financial Performance KPIs
zz Estimate at completion: It is used to determine the actual cost required in completing the NOTES
project and the actual cost required to complete the remaining work of the project.
zz Cost of managing the processes: This allows in estimating the periodic cost that is required
for managing the processes.
Financial Performance KPIs: These KPIs are used to measure the financial performance of an
organization. Some financial performance KPIs are as follows:
zz EV/EBITDA: EV stands for Enterprise value and EBITDA stands for Earnings before interest,
taxes, depreciation and amortization. The ratio of these two helps you in analyzing the
debt value of an organization.
zz Return on Investment (ROI): It is used to evaluate the performance of an organization by
dividing net profit by net worth.
zz Debt-Equity ratio: This ratio is also known as risk ratio. It is used for measuring the propor-
tion of shareholders’ equity to the debt used for financing the assets of an organization.
zz Operating margin: It is used to evaluate the strategy of an organization’s pricing and its
operating efficiency.
L
zz Return on Assets/Return on Equity (ROA/ROE): ROE is used to evaluate the money taken
from the shareholders. On the other hand, ROA signifies the measure of an organization’s
profitability to its assets.
D
Human Resource Performance KPIs: These are used in an organization to measure the different
aspects of the human resources. The different types of Human Resource KPIs are as follows:
zz Revenue per employee: It is used to evaluate the productivity of an organization’s work-
C
force. It is used for determining the amount of sales made per employee and also how
effectively the human resources of an organization are getting utilized.
zz Employee satisfaction index: It allows an organization to determine how much satisfied
the employees are with the organization.
T
zz Salary competitiveness ratio: It helps in gathering the data about how much the compet-
itor organizations are paying to the employees. Thus, it helps in determining the salary
levels of an organization in comparison to the other organizations.
IM
zz Human Capital ROI: It helps in measuring the return on capital invested on an employee in
terms of pay and benefits.
Supply chain and operational performance KPIs: The KPIs in this area are used for improving
the experience of the customers of an organization. Some KPIs used in this area are as follows:
zz Order fulfillment cycle time: This KPI is used to measure the total time taken from order-
ing a product till its delivery to the customer. This KPI helps in developing the customer
responsiveness towards the organization. Moreover, it also helps in determining the time
required for completing a manufacturing order.
zz Yield: This metric is used for improving the quality of the products after measuring it. It
denotes the percentage of correctly manufactured products which do not require rework
or do not need to be scraped.
zz Throughput: It is used to evaluate the speed of the production process on the basis of
inputs and outputs.
Consumer insights and marketing KPIs: These KPIs are used for getting insights of a customer.
Some of the KPIs used in this area are:
zz Market growth rate: It is used for analyzing the change on the basis of the given number of
consumers in a particular market after a certain interval of time.
Self-Instructional
Material
239
DATA SCIENCE
NOTES zz Customer satisfaction index: This KPI is used for evaluating the performance of the prod-
ucts and services, whether they have met or surpassed the customer’s expectations or not.
zz Social networking footprint: This KPI is used for evaluating the presence of an organiza-
tion on social media.
zz Brand equity: It is used for measuring the premium that a brand name may provide to a
product.
zz Customer lifetime value: This KPI is used to evaluate the revenue that can be generated till
the customer remained in relationship with the company.
zz Customer acquisition cost: This KPI is used for determining the cost incurred in marketing
and campaigning to acquire new customers over a certain period of time.
N ote
There can be high-level and low-level KPIs. High-level KPIs are used for measuring the overall
performance of an organization, whereas the low-level KPIs are used to measure the performance
of a particular department of an organization.
L
E xhibit -2
KPI in Maintaining Business Quality
D
Performance of any business has never been static. It always changes with the market strategy
and new technologies. However, performance and strategy must be compared side by side to
optimize qualitative and quantitative methods, which can finally provide a bigger picture of overall
C
growth. Key Performance Indicators (KPIs) prove to be beneficial in such cases where performance
measurements are required to be evaluated. They are analyzed for the process of evaluation
and making recommendations to improve company performance. Performance management
experts also agree that cascading and designing goals across multiple people result into shared
T
accountability and responsibility that play a crucial role in a company’s success. The company can
then use its KPI as the basis for tracking and analyzing the performance of a business as well
as designing the future strategies. Multiple businesses, such as Lancaster Landmark Hotel group
have multiple properties across the globe, but everybody is aligned with each other with a focus
IM
on the collective measures to achieve individual as well as prime targets. Using KPI, the group is
now able to analyze the market trends and competitions and has created a performance-based
environment across the organization. Using KPI, managers and employees can synchronize easily
with a 360-degree view of the projected goal and can understand how their individual goals can
contribute in the achievement of the projected goal. These indicators also create a responsibility
among the employees. Since the process of goal achievement is divided into many phases,
managers and employees keep in touch as well as keep a track of each other’s performance so as
to keep the overall development and deadline for project development on track. This also helps in
creating an open communication environment which generates quality feedback.
KPI Software
KPI software, also known as KPI solutions, is used to track the performance evaluated by several
KPIs on dashboards in real time. Different organizations provide different KPI solutions as per the
need of an organization. Different KPI solutions provide different benefits to the organizations,
some of which are as follows:
Enable users, executives and organizations to measure and handle targets and objectives.
Enable tracking of project performance and provide reporting to the shareholders of the
Self-Instructional
project.
Material
240
Need for a System Management – KPI
Allow conveying the updated information to the right people in the organization. Provide a NOTES
snapshot of the performance against the targets set by an organization.
Integrate all KPI data at one place easily and there is no requirement of depending upon the
creation of the spreadsheets for assessment.
Display the performance of the departments residing at different locations at a single location
and, thus, provide the integrated view of the organization irrespective of its geographically
separated departments.
Enable an organization to view transparency in performance at any instant.
Allow an organization to assess its weakness and set KPIs for the improvement in its
performance.
L
Enable an organization to judge its operational performance.
D
1. Scoro KPI dashboard: The Scoro KPI dashboard software allows an organization to see every
aspect of its business on one or several dashboards. It also allows the organization to track
C
KPIs related to a project, work and finance at any instant.
2. Datapine: It allows an organization to view and monitor most significant KPIs at a single
location. Some more features of this software include advanced analytics, automated
reporting, highly interactive dashboard and intelligent warning or alarming system.
T
3. Inetsoft dashboard: This is an analytical dashboard and reporting software. Some main
features of this include data modeling, online data mashup, embedding of dashboards, highly
secure infrastructure with high performance, etc.
IM
4. Tableau: It is the best solution for those organizations which have clients with very few number
of users and wish deployment of solution in multiple organizations. Some main featutres of
Tableau include adding of the number of users as the needs grow, ability of refreshing the
data automatically using Web applications, such as Google analytics, Salesforce, etc., allowing
the site administrators to manage authentication and permissions to users and data.
5. SimpleKPI: It is a powerful and higlhly flexible KPI dashboard. It allows you to cutomize your
dashboard and reports.
Besides the KPI software discussed above, the names of some more popular KPI software are:
Smartsheet, Bilbeo, DATAZEN, Databox, InfoCaptor, KPI Fire and Dasheroo.
Self-Instructional
Material
241
DATA SCIENCE
NOTES
Summary
This chapter has first discussed about the dimensions of data quality. Next, it has explained the
business metrics and their importance. Further, this chapter has discussed about the KPI solution
and types of KPIs. Towards the end, it has discussed about the benefits of the KPI software.
Exercise
Multiple-Choice Questions
Q1. System management solutions help the organizations in protecting them against which of
the following?
a. Lost or stolen devices b. Power outages
c. Security breaches d. All of these
L
Q2. Which of the following is not a data quality dimension?
a. Completeness b. Consistency
c. Identity theft d. Conformity
relationships?
a. Completeness
D
Q3. Which of the following refers to the fact that the data entered in the database must have
b. Consistency
C
c. Integrity d. Conformity
Q4. KPIs refer to ____________.
a. Key Performance Indicators b. Key Performance Indications
T
c. Key Performer Indications d. None of these
Q5. Which of the following comes under financial performance KPIs?
IM
Assignment
Q1. What do you understand by system management in an organization?
Q2. Discuss the need of system management in an organization.
Q3. Explain the different dimensions of data quality.
Q4. Discuss the importance of business metrics for an organization.
Q5. What are KPIs? Enlist the benefits of KPIs.
References
https://www.smallbusinesscomputing.com/news/article.php/3928971/What-is-Systems-
Management-and-Why-Should-You-Care.htm
https://smartbridge.com/data-done-right-6-dimensions-of-data-quality-part-1/
Self-Instructional
Material
242
Need for a System Management – KPI
https://www.whitepapers.em360tech.com/wp-content/files_mf/1407250286DAMAUKDQ NOTES
DimensionsWhitePaperR37.pdf
https://www.edq.com/uk/glossary/data-quality-dimensions/
http://dashboardinsight.com/articles/digital-dashboards/fundamentals/the-benefits-of-key-
performance-indicators-to-businesses.aspx
https://www.successfactors.com/en_us/lp/articles/key-performance-indicators.html
https://blog.results.com/blog/the-benefits-of-having-the-right-kpis-key-performance-indicators
https://www.klipfolio.com/blog/17-kpi-management-data-driven-manager
L
D
C
T
IM
Self-Instructional
Material
243
C A S E S T U D Y
NOTES MANAGING AND MEASURING QUALITY
This Case Study discusses how CQI developed a KPI system for performance management.
The Chartered Quality Institute (CQI) is a professional body that works in the field of quality
management in all the sectors. It has about 20,000 members from all over the world. It works
in the field of organizational excellence and improvement. CQI wanted to create an exemplary
performance management framework. For this, it hired Bernard Marr who works as a consultant
for companies and public concerns. Since CQI focuses on achieving excellence and improving
performance of companies, it is itself committed to continuously improve its own performance.
Marr started the process of building a performance framework by first conducting a series of
workshops. In these workshops, CQI executive management team and its employees had to identify
the important strategic goals and objectives. On the basis of inputs received from these workshops,
Marr prepared a strategy map as shown in Figure 10.6:
L
We help organisations improve performance by placing quality at the heart of what they do
1.4 Grow Membership: Retain, grow and develop our membership base (focusing on specific sectors)
4.1 Generate
Income and
Achieve Financial
Sustainablity
C
INTERNAL PROCESSES
2.2 Publish and Share
2.1 Influence and Make 2.3 Provide Training: 2.4 Provide
Quality Knowledge: 4.2 Operational
our Voice Heard: Making Leading course provider Qualifications: Develop,
with organisations and Excellence
quality a strategic issue in quality accredit and audit
individuals • Effective
2.5 Manage the CQI 2.6 Engage and Manage 2.7 Understand our 2.8 Develop new Project
T
Image and Brand: The Relationship: With Customers and the Products (incl. Management
recognised leader in members and partner Market: What do they curriculum, training • Process
quality organisations want and need publication, message) Improvement
• Integration
2.9 Maintain and Develop Body Quality of Knowledge
IM
ENABLERS
4.3 Effective
3.4 Our Effective
3.1 Our People: Be a 3.2 our culture and 3.3 Our Information and Efficient
Governance: Work
grate place to work with Behaviours: We live our Systems and Resource
in accord to deliver
engaged and competent values and value people e-Enablement: Systems, Management
strategic focus and
staff and their contributions databases and websites
compliance
At the top of this map is the vision of the company. After this come the customer-related objectives
and the internal processes needed by an organization. These processes are necessary for an
organization to excel in its domain. These processes are the enablers of success. The financial
perspective was placed vertically because it is related to all the other perspectives. This map was
taken as a base and it is reviewed, revised and refined regularly to ensure that it remains fresh and
relevant.
Designing KPIs
At this stage, the CQI developed KPIs for all the goals and KPQs. For instance, for the objective,
‘Increasing influence and making our voice heard’, the KPQs and KPIs were as follows:
L
KPQs:
How effective is our campaigning in making quality a strategic issue with the business media?
How effective is our campaigning in making quality a strategic issue with the Government?
KPIs:
D
To what extent are we increasing our profile to target markets?
Heat Map
A heat map is a color-coded strategy map that shows how well an organization is doing with respect
to its strategic objectives. Marr helped in developing a heat map of the strategy map. In a heat map,
green color signifies that everything is fine; yellow color signifies that there are certain issues; and
red color signifies that there are very critical issues that may require immediate attention. A heat
map can be easily looked at by the management, and the decisions to deploy resources such as
money and people can be taken accordingly.
Management Meetings
After the implementation of the strategic performance framework, the management in its monthly
meetings discusses issues that are flagged by the heat map. It can be said that the heat map and the
KPQs form the basis for discussion of the monthly meetings. Using the heat maps, the corrective
actions were taken for issues that were flagged in the heat map.
Self-Instructional
Material
245
C A S E S T U D Y
NOTES Result
CQI used the strategy framework to highlight the issues that were relevant to everyone for delivering
satisfactory performance in the future. All the activities and projects were aligned to deliver the
strategy. For this, CQI also developed meaningful indicators to monitor and track the progress.
Questions
1. Discuss the role of a strategy map in performance management.
(Hint: Strategy map led to the development of KPIs for all the goals and KPQs.)
2. How did the management of CQI use the heat maps?
(Hint: Heat map and the KPQs formed the basis for discussion of the monthly meetings. Using
the heat maps, the corrective actions were taken for issues that were flagged in the heat
map.)
L
D
C
T
IM
Self-Instructional
Material
246
CHAPTER
11
Introduction to Big Data
L
Topics Discussed
Introduction
What is Big Data?
D NOTES
C
Advantages of Big Data
Various Sources of Big Data
T
History of Data Management – Evolution of Big Data
Structuring Big Data
Types of Big Data
IM
Structured Data
Unstructured Data
Semi-Structured Data
Elements of Big Data
Volume
Velocity
Variety
Veracity
Big Data Analytics
Advantages of Big Data Analytics
Benefits of Big Data Analytics in Different Sectors
Careers in Big Data
Skills Required
Future of Big Data Self-Instructional
Material
DATA SCIENCE
L
INTRODUCTION
If you think of the world around you, there is an enormous amount of data generated, captured, and
D
transferred through various media—within seconds. This data may come from a personal computer
(PC), social networking sites, transaction or communication system of an organization, ATMs, and
multiple other channels such as mobile phones, RFID readers.
C
The accumulation or storage of this data results in continuous generation of an enormous volume
of data, which if analysed intelligently, can be of immense value, as it can give us a variety of
critical information to make smarter decisions. Accumulation of data in large amount is called Big
Data. Careful analysis of Big Data can transform data to information, and information to insight.
T
Organizations analyze the data to understand and interpret market trends, study customer
behaviour, and take financial decisions.
Big Data consists of large datasets that cannot be managed efficiently by the common database
IM
management systems. These datasets range from terabytes to exabytes. Mobile phones, credit
cards, Radio Frequency Identification (RFID) devices, and social networking platforms, create huge
amounts of data that may reside unutilized at unknown servers for many years. However, with the
evolution of Big Data, this stored data can be accessed and analyzed to generate useful information.
Various tools and technologies are available to store and analyse Big Data.
In this chapter, we will first discuss about the concept of Big Data, its advantages and various sources
from where it gets collected. Next, we will discuss about evolution of Big Data and structure of Big
Data. This chapter will also discuss about different types of Big Data. Further, the chapter will also
discuss about benefits of Big Data Analytics. Towards the end, this chapter will discuss about careers
and future of Big Data.
Every hour, Walmart, a global discount departmental store chain, handles more than 1 million
customer transactions.
Self-Instructional
Material
248
Introduction to Big Data
Everyday millions of users generate data over popular networking sites, for example: NOTES
zz Twitter’s users post 500 million tweets per day.
zz Facebook’s users post 2.7 billion likes and comments per day.
zz Radio-Frequency Identification (RFID) systems generate nearly thousand times the data of
bar code systems.
According to IBM, “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data
in the world today has been created in the last two years alone. This data comes from everywhere:
sensors used to gather climate information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to name a few. This data is big data.”
Data is everywhere, in every industry and business function, in the form of numbers, images, videos,
and text. As data continues to grow, there is a need for the data to be organized and made available
so that it can be used by individuals or organizations as information source. That’s where Big Data
gets into the picture.
Big Data is a pool of large-sized datasets to capture, store, search, share, transfer, analyze, and
L
visualize related information or data within an acceptable elapsed time. In the IT industry, Big Data
refers to the art and science of analyzing the data to gain a deeper insight, which was earlier not
possible because of the lack of access to data and the means to process it.
Big Data D
C
Is a new data Is classified in terms of 4 V’s
Volume Is usually unstructured
challenge that requires
T
Variety and qualitative in
leveraging existing
Velocity nature
systems differently
Veracity
IM
Several billions of dollars being saved by improvements in operational efficiency, and more
Across industries, data along with analytics can transform major business processes in various ways
such as:
Improving performance in sports by analyzing and tracking performance and behavior
NOTES Across organizations, right analysis of available data can transform major business processes in
various ways such as:
Procurement: Find out which suppliers are more cost-effective in delivering products efficiently
and on time
Product Development: Draw insights on innovative product and service formats and designs to
enhance the development process and come up with demanded products
Manufacturing: Identify machinery and process variations that may be indicators of quality
problems
Distribution: Enhance supply chain activities and standardize optimal inventory levels vis-à-vis
various external factors such as weather, holidays, economy, etc.
Marketing: Find out which marketing campaigns will be most effective in driving and engaging
customers, understand customer behaviours and channel behaviours
Price Management: Optimize prices based on the analysis of external factors
L
behaviours
Sales: Optimize assignment of sales resources and accounts, product mix, and other operations
Store operations: Adjust inventory levels on the basis of predicted buying patterns, study of
D
demographics, weather, key events, and other factors
Human Resources: Find out the characteristics and behaviours of successful and effective
employees, as well as other employee insights to manage talent better
C
E xhibit -1
Google Inc. uses Big Data for health tracking
T
Google Inc. applied its massive data-collecting power to raise warnings for the flu plagues
approximately two weeks in advance of the existing public services. To do this, Google monitored
millions of users’ health-tracking behaviors, and followed a cluster of queries on themes such
as symptoms about flu, congestion in chest, and incidences of buying a thermometer. Google
IM
analyzed this collected data and generated consolidated results that revealed strong indications
of flu levels across America. To determine the accuracy of this data, Google did further research
and data comparison to determine the accuracy of the data before divulging the information.
The need for Big Data is evident. If leaders and economies want exemplary growth and wish to
L
generate value for all their stakeholders, Big Data has to be embraced and used extensively to:
Allows the storage and use of transactional data in digital form
D
Classify customers to provide customized products and services based on buying patterns
Big Data is the new stage of data evolution directed by the enormous Velocity, Variety, and Volume
of data. Figure 11.2 shows the challenges faced while handling data over the past few decades:
The advent of IT, the Internet, and globalization have facilitated increased volumes of data and
information generation at an exponential rate, which has led to “information explosion.” This, in
turn, fueled the evolution of Big Data that started in 1940s and continues till date. Table 11.2 lists
some of the major milestones in the evolution of Big Data.
Self-Instructional
Material
251
DATA SCIENCE
NOTES TABLE 11.2: Some Major Milestones in the Evolution of Big Data
Year Milestone
1940s An American librarian speculated the potential shortfall of shelves and cataloging
staff, realizing the rapid increase in information and limited storage.
1960s Automatic Data Compression was published in the Communications of the ACM. It
states that the explosion of information in the past few years makes it necessary
that requirements for storing information should be minimized.
The paper described ‘Automatic Data Compression’ as a complete automatic and
fast three-part compressor that can be used for any kind of information in order to
reduce the slow external storage requirements and increase the rate of transmission
from a computer system.
1970s In Japan, the Ministry of Posts and Telecommunications initiated a project to study
information flow in order to track the volume of information circulating in the
country.
1980s A research project was started by the Hungarian Central Statistics Office to account
for the country’s information industry. It measured the volume of information in
L
bits.
1990s Digital storage systems became more economical than paper storage.
Challenges related to the amount of data and the presence of obsolete data became
apparent.
zz
D
Some papers that discussed this concern are as follows:
Michael Lesk published How much information is there in the world?
John R. Masey presented a paper titled Big Data and the Next Wave of InfraStress.
C
zz
zz K.G. Coffman and Andrew Odlyzko published The Size and Growth Rate of the
Internet.
zz Steve Bryson, David Kenwright, Michael Cox, David Ellsworth, and Robert Haimes
published Visually Exploring Gigabyte Datasets in Real Time.
T
2000 zz Many researchers and scientists published papers raising similar concerns and
onwards discussing ways to solve them.
IM
This table is only a synopsis of the evolution. The idea of Big Data began when a librarian speculated
the need for more storage shelves for books as explained in Table 11.2, and with time, Big Data
has grown into a cultural, technological, and scholarly phenomenon. The generation of Big Data,
and with it new storage and processing solutions equipped to handle this information, helped
businesses to:
Enhance and streamline existing databases
E xhibit -2 NOTES
L
easy to study, analyze, and derive conclusion from it. But, why is structuring required?
D
How do I use to my advantage the vast amount of data and information I come across?
Today, solutions to such questions can be found by information processing systems. These systems
T
can analyze and structure a large amount of data specifically for you on the basis of what you
searched, what you looked at, and for how long you remained at a particular page or website, thus
scanning and presenting you with the customized information as per your behavior and habits. In
other words, structuring data helps in understanding user behaviors, requirements, and preferences
IM
When a user regularly visits or purchases from online shopping sites, say eBay, each time he/she logs
in, the system can present a recommended list of products that may interest the user on the basis
of his/her earlier purchases or searches, thus presenting a specially customized recommendation set
for every user. This is the power of Big Data analytics.
Today, various sources generate a variety of data, such as images, text, audios, etc. All such different
types of data can be structured only if it is sorted and organized in some logical pattern. Thus, the
process of structuring data requires one to first understand the various types of data available today.
NOTES Table 11.3 compares the Internal and External sources of data as follows:
L
zz
unorganized data zz Internet mostly external to the
that originates organization such as
zz Government
from the external customers, competitors,
environment of an zz Market research organizations market, and environment
organization
D
Thus, on the basis of the data received from the aforementioned sources, Big Data comprises:
structured data, unstructured data, and semi-structured data. In a real-world scenario, typically
C
unstructured data is larger in volume than structured and semi-structured data. Figure 11.3 illustrates
the types of data that comprise Big Data:
T
Semi-
Structured Unstructured Big Data
structured
Data Data
Data
IM
Structured Data
Structured data41 can be defined as the data that has a defined repeating pattern. This pattern makes
it easier for any program to sort, read, and process the data. Processing structured data is much
easier and faster than processing data without any specific repeating patterns. Structured data:
Is organized data in a predefined format
Flat files in the form of records (like csv and tab-separated files)
254
Introduction to Big Data
Table 11.4 shows a sample of structured data in which the attribute data for every customer is stored NOTES
in the defined fields.
Unstructured Data
Unstructured data42 is a set of data that might or might not have any logical or repeating patterns.
Unstructured data:
Consists typically of metadata, i.e., the additional information related to data
Comprises inconsistent data, such as data obtained from files, social media websites, satellites,
L
etc.
Consists of data in different formats such as e-mails, text, audio, video, or images
D
Text both internal and external to an organization: Documents, logs, survey results, feedbacks,
and e-mails from both within and across the organization
C
Social media: Data obtained from social networking platforms, including YouTube, Facebook,
Twitter, LinkedIn, and Flickr
Mobile data: Data such as text messages and location information
E xhibit -3
Analyzing Customer Behavior
We all know that nowadays CCTV cameras are installed in almost every supermarket, and its
footage is thoroughly analyzed by the management for various purposes. Some focus points of
the analysis include the routes customers take to navigate through the store, customer behavior
during a bottleneck, such as network traffic; and places where customers typically halt while
shopping. This unstructured information from the CCTV footage is combined with structured data,
comprising the details obtained from the bill counters, products sold, the amount and nature of
payments, etc. to arrive at a complete data-driven picture of customer behavior. The analysis of
the obtained information helps the management to provide a pleasant shopping experience to
customers, as well as improve sales figures.
Working with unstructured data poses certain challenges which are as follows:
Identifying the unstructured data that can be processed
Self-Instructional
Sorting, organizing, and arranging unstructured data in different sets and formats
Material
255
DATA SCIENCE
NOTES Combining and linking unstructured data in a more structured format to derive any logical
conclusions out of the available information
Costing in terms of storage space and human resource (data analysts and scientists) needed to
deal with the exponential growth of unstructured data
Figure 11.4 shows the result of a survey conducted to ascertain the challenges associated with
unstructured data in percentage—from the most to the least challenging IT areas:
L
client security 8% 8% 10%
Mobility 7% 10% 6%
D
Data center migration
from RISC/UNIX* system
Bring Your Own Device (BYOD)
8%
6%
8%
8%
5%
C
Number 1 priority Number 2 priority Number 3 priority
Semi-Structured Data
Semi-structured data43, also known as schema-less or self-describing structure, refers to a form
of structured data that contains tags or markup elements in order to separate semantic elements
and generate hierarchies of records and fields in the given data. Such type of data does not follow
proper structure of data models as in relation databases.
To be organized, semi-structured data should be fed electronically from database systems, file
systems, and through data exchange formats including scientific data and XML (eXtensible Markup
Language). XML44 enables data to have an elaborate and intricate structure that is significantly richer
and comparatively complex. Some sources for semi-structured data include:
Self-Instructional
Database systems
Material
256
Introduction to Big Data
An example of semi-structured data is shown in Table 11.5, which indicates that entities that belong
to a same class can have different attributes even if they are grouped together:
Now that we have examined the way data arrives and is presented, let us examine the elements that
characterize this data.
L
ELEMENTS OF BIG DATA
According to Gartner, data is growing at the rate of 59% every year. This growth can be depicted in
terms of the following four V’s:
Volume
Velocity
D
C
Variety
Veracity
Figure 11.5 explains the four V’s of Big Data with examples:
T
IM
NOTES Volume
Volume is the amount of data generated by organizations or individuals. Today, the volume of data
in most organizations is approaching exabytes. Some experts predict the volume of data to reach
zettabytes in the coming years. Organizations are doing their best to handle this ever-increasing
volume of data. For example, according to IBM, over 2.7 zetabytes of data is present in the digital
universe today. Every minute, over 571 new websites are being created. IDC estimates that by 2020,
online business transactions will reach up to 450 billion per day.
Velocity
Velocity describes the rate at which data is generated, captured, and shared. Enterprises can
capitalize on data only if it is captured and shared in real time. Information processing systems
such as CRM and ERP face problems associated with data, which keeps adding up but cannot be
processed quickly.
These systems are able to attend data in batches every few hours; however, even this time lag
L
causes the data to lose its importance as new data is constantly being generated. For example, eBay
analyzes around 5 million transactions per day in real time to detect and prevent frauds arising from
the use of PayPal.
Variety
D
We all know that data is being generated at a very fast pace. Now, this data is generated from
different types of sources, such as internal, external, social, and behavioural, and comes in different
C
formats, such as images, text, videos, etc. Even a single source can generate data in varied formats,
for example, GPS and social networking sites, such as Facebook, produce data of all types, including
text, images, videos, etc.
T
Veracity
Veracity generally refers to the uncertainty of data, i.e., whether the obtained data is correct or
IM
consistent. Out of the huge amount of data that is generated in almost every process, only the
data that is correct and consistent can be used for further analysis. Data when processed becomes
information; however, a lot of effort goes in processing the data. Big Data, especially in the
unstructured and semi-structured forms, is messy in nature, and it takes a good amount of time and
expertise to clean that data and make it suitable for analysis.
Figure 11.6 highlights the proportion of business areas that have benefited by using Big Data: NOTES
Proportion of Businesses
Big Data Analytics Benefit
Reporting Benefit (%)
Better social influences marketing 61%
L
FIGURE 11.6 Big Data Benefit Areas
Source: TDWI July 2013
D
Let us understand some common analytical approaches that businesses apply to use Big Data. Table
11.6 describes various analytical approaches typically associated with Big Data.
C
TABLE 11.6: Analytical Approaches
Behavioral How will a business leverage complex data in order to create new models
Analysis for:
zz Driving business outcomes
Data What new business analyses can be estimated from the available data?
Interpretation Which data should be analyzed for new product innovation?
L
in analyzing the requirements and finding easy and innovative ways of imparting education,
especially distance learning over vast geographical areas.
Travel: The travel industry also uses Big Data to conduct business. It maintains complete details
D
of all the customer records that are then analyzed to determine certain behavioral patterns in
customers. For example, in the airline industry, Big Data is analyzed for identifying personal
preferences or spotting which passengers like to have window seats for short-haul flights and
aisle seats for long-haul flights. This helps airlines to offer the similar seats to customers when
C
they make a fresh booking with the airways. Big Data also helps airlines to track customers who
regularly fly between specific routes so that they can make the right cross-sell and up-sell offers.
Some airlines also apply analytics to pricing, inventory, and advertising for improving customer
experiences, leading to more customer satisfaction, and hence, more business. Some airlines
T
even go to the length of evaluating customers who tend to miss their flights. They try to help
such customers by delaying the flights or booking them on another flight.
Government: Big Data has come to play an important role in almost all the undertaking and
IM
processes of government. For instance, Indian government body, UIDAI was able to successfully
implement aadhar card using big data technologies which includes millions of citizen registration
by performing trillions of data matches every day. Analysis of Big Data promotes clarity and
transparency in various government processes and helps in:
zz Taking timely and informed decisions about various issues
zz Identifying flaws and loopholes in processes and taking preventive or corrective measures
on time
zz Assessing the areas of improvement in various sectors such as education, health, defense,
and research
zz Using budgets more judiciously and reducing unnecessary wastage and costs
zz Preventing fraudulent practices in various sectors
Healthcare: In healthcare, the pharmacy and medical device companies use Big Data to improve
their research and development practices, while health insurance companies use it to determine
patient-specific treatment therapy modes that promise the best results. Big Data also helps
researchers to work towards eliminating healthcare-related challenges before they become
real problems. Big Data helps doctors to analyze the requirement and medical history of every
patient and provide individualistic services to them, depending on their medical condition.
Self-Instructional
Material
260
Introduction to Big Data
Telecom: The mobile revolution and the Internet usage on mobile phones have led to a NOTES
tremendous increase in the amount of data generated in the telecom sector. Managing this
huge pool of data has almost become a challenge for the telecom industry. For example, in
Europe, there is a compulsion on the telecom companies to keep data of their customers
for at least six months and maximum up to two years. Now, all this collection, storage, and
maintenance of data would just be a waste of time and resources unless we could derive any
significant benefits from this data. Big Data analytics allows telecom industries to utilize this
data for extracting meaningful information that could be used to gain crucial business insights
that help industries in enhancing their performance, improving customer services, maintaining
their hold on the market, and generating more business opportunities.
Consumer Goods Industry: Consumer goods companies generate huge volumes of data in
varied formats from different sources, such as transactions, billing details, feedback forms,
etc. This data needs to be organized and analyzed in a systemic manner in order to derive any
meaningful information from it. For example, the data generated from the Point-of-Sale (POS)
systems provides significant real-time information about customers’ preferences, current
market trends, the increase and decrease in demand of different products at different regions,
etc. This information helps organizations to predict any possible fluctuations in prices of goods
L
and make purchases accordingly. It also helps marketing teams in taking suitable actions rapidly
if there is a deviation in the expected sales of a product, thus, preventing any further losses to
the company. Therefore, we can say that Big Data analytics allows organizations to gain better
D
business insights and take informed and timely decisions.
Aviation Industry: Big Data analytics also plays a significant role in the commercial aviation
industry. Like other industries, the aviation industry also maintains a detailed record of all their
customers that includes their personal information, flying preferences, and other trends and
C
patterns. The organization analyzes this data to improve their customer services, and thus brand
image. In addition, every aircraft generates a significant amount of data during operation. This
data is then analyzed for enhancing operational efficiencies, identifying the parts that require
T
repairs, and taking any necessary constructive or preventive measures on time.
E xhibit -4
Big Data analysis helps organizations in identifying specific customer needs across
IM
diversified audience
In today’s competitive environment, every organization competes to attract the customers for
their product by using different techniques for promotion. This is because success and productivity
depend on what promotion techniques one is following to meet customer demands. The different
promotion techniques that organizations follow to attract their customers are as follows:
Qualified and experienced Big Data professionals must have a blend of technical expertise, creative
and analytical thinking, and communication skills to be able to effectively collate, clean, analyze, and
present information extracted from Big Data.
Most jobs in Big Data are from companies that can be categorized into the following four broad
buckets:
L
Big Data technology drivers, e.g., Google, IBM, Salesforce
D
Big Data services companies, e.g., EMC
Figure 11.7 shows the logos of some companies that hire Big Data professionals:
C
TOP COMPANIES HIRING BIG DATA PROFESSIONALS
T
IM
As shown in Figure 11.7, companies such as Google, Salesforce, and Apple offer various types of
opportunities to Big Data professionals. These companies deal in various domains such as retail,
manufacturing, information, finance, and consumer electronics. The hiring of Big Data experts in
these domains, as per Big Data Analytics 2014 report, is shown in Figure 11.8:
Self-Instructional
Material
262
Introduction to Big Data
NOTES
27.14% Professional, Scientific, and Technical Services
18.89% Information
12.35% Manufacturing
9.60% Retail Trade
8.20% Sustainability, Waste Management & Rededication Services
8.13% Finance and Insurance
27.14% 18.89% 5.70% Wholesale Trade
3.04% Educational Services
0.02%
1.71% Other Services (expect Public Administration)
0.11%
0.11% 1.22% Accommodation and Food Services
0.18%
0.28% 1.05% Health Care and Social Assistance
0.46% 12.35%
0.42%
0.76% Real State and Rental and Leasing
0.62% 0.62% Construction
0.76%
1.05% 4% 0.42% Transportation and Warehousing
1.22% 3.0 0.46% Public Administration
1.71% 5.70% 9.60%
0.28% Management of Companies and Enterprises
8.13% 0.18% Arts, Entertainment, and Recreation
8.20%
0.11% Mining Quarrying, and Oil Gas Extraction
0.11% Utilities
0.02% Agriculture, Factory, Fishing and Hunting
L
FIGURE 11.8 Top 20 Industries Hiring Big Data Experts
Source: Wanted Analytics, 2014
Data scientist D
C
Big Data developer
Self-Instructional
Material
263
DATA SCIENCE
NOTES In 2011, a report was published by McKinsey & Co. that indicated that by 2018, the United States
alone might face a huge shortage (about 140,000 to 190,000) of data analytics professionals.
Skills Required
Big Data professionals can have various educational backgrounds, such as econometrics, physics,
biostatistics, computer science, applied mathematics, or engineering. Data scientists mostly possess
a master’s degree or Ph.D. because it is a senior position and often achieved after considerable
experience in dealing with data. Developers generally prefer implementing Big Data by using
Hadoop and its components.
Technical Skills
A Big Data analyst should possess the following technical skills:
Understanding of Hadoop ecosystem components, such as HDFS, MapReduce, Pig, Hive, etc.
L
Knowledge of statistical analysis and analytical tools
D
A Big Data developer should possess the following skills:
Programming skills in Java, Hadoop, Hive, HBase, and HQL
These skills can be acquired with proper training and practice. This book familiarizes you with the
technical skills required by a Big Data analyst and Big Data developer.
T
Soft Skills
Organizations look for professionals who possess good logical and analytical skills, with good
IM
The preferred soft skills requirements for a Big Data professional are:
Strong written and verbal communication skills
Analytical ability
Classify customers for providing customized products and services based on buying patterns
Self-Instructional
Material
264
Introduction to Big Data
Most organizations today consider data and information to be their most valuable and differentiated NOTES
asset. By analyzing this data effectively, organizations worldwide are now finding new ways
to compete and emerge as leaders in their fields to improve decision making and enhance their
productivity and performance. At the same time, the volume and variety of data is also increasing
at an immense rate every day. The global phenomena of using Big Data to gain business value and
competitive advantage will only continue to grow as will the opportunities associated with it.
Figure 11.10 depicts the tremendous growth in the volume of Big Data over the coming years:
L
20
15
10
5
0
2008 2009 2010 2011
D
2012 2013 2014 2015 2016 2017 2018 2019 2020
C
FIGURE 11.10 Growth Pattern of Data
Source: Oracle, 2012
Research conducted by MGI and McKinsey’s Business Technology Office suggests that the use of Big
T
Data is most likely to become a key basis of competition for individual firms for success and growth
and strengthening consumer surplus, production growth, and innovation.
E xhibit -5
IM
Today, clients often ask about the future of big data and what the next step is; how can we
leverage data on an even deeper level in order to extract meaningful consumer insights that go
beyond where we are now? Most of the standard answers are around the ability to get data and
insights in real time and from more devices than ever. It’s time we move beyond structured data
and into the prime time of text analytics.
For us, the easiest way to get started with Big Data 2.0 is to focus on the unstructured data we
collect every day. This can be reviews, customer support emails, community forums, or even your
own CRM system. The simplest way to look at this data is through a process called text analytics.
Text analytics is a fairly straightforward process that breaks out like this:
1. Acquisition: Collecting and aggregating the raw data you want to analyze
2. Transforming & Preprocessing: Cleaning and formatting the data to make it easier to read
3. Enrichment: Enhancing the data by adding additional data points
4. Processing: Performing specific analyses and classifications on the data Self-Instructional
Material
265
DATA SCIENCE
NOTES 5. Frequencies & Analysis: Evaluation of the results and translation into numerical indicators
6. Mining: Actual extraction of information
Let’s assume we are a consumer packaged goods company and we want to introduce a new line
of diapers into the market. We decide to look at Amazon in order to better understand which
products are category leaders (sales rank and number of sales) and how the consumers like the
product itself (reviews). If we analyze these metrics across all diapers, we have a Big Data 1.0
picture that tells us exactly who sells the most and what the audience favorite is.
We are trying to understand the diaper market. In order to not turn this into a step-by-step
guide, let’s assume that we already have collected all diapers reviews as well as their qualitative
indicators. That means we know what sells best and what ranks best/worst. In order to take this
to the next level, we would start to extract words and phrases from the reviews. This will tell us
some of the recurring patterns and their frequencies within the reviews.
L
across the majority of the helpful reviews were “price,” “special,” and “value.” This tells us that
people did not buy it because of its quality or features but because of its pricing. So, when we are
launching our product, we want to look at this one for price/value guidance instead of features.
D
This one was very revealing. The brand with the most negative reviews had an extremely high
frequency around the terms “tape,” “stick,” “stay closed,” and “open.” After a few reads,
C
I discovered that consumers had no issues with the usual key features on a diaper such as
“absorbency,” “leakage,” or “softness” but actually had issues with the tape on the side of the
diaper, and the fact that it kept opening. The amount of negative reviews overall that mentioned
these issues makes us believe that this is a feature that brands don’t talk about but consumers
T
care about. Therefore, we would recommend testing ads that address this issue.
3. Smart Filtering
One interesting issue we came across is the fact that a lot of the negative reviews were not
IM
actually about the product but rather focused on shipping, stock level, and packaging concerns. By
tagging and removing these from the set, we are able to evaluate product level in order to focus
on product-related concerns. If we were to list our diaper on Amazon, we would recommend
adding a shipping and stock level guarantee prominently in the copy—a competitive advantage
that speaks directly to consumer concerns.
4. What Do They Want
From an R&D perspective, this insight is worth gold. By evaluating reviews that have terms like “I
wish,” “hope,” or “they should,” we are able to detect common features consumers are looking
for when thinking about diapers. These are great insights that address the constantly changing
need of the consumers. We can feed these product feature-specific insights to our R&D team as
well as our copywriters.
As you can see, when analyzing the diaper category, Big Data 2.0 yielded insights beyond binary
performance indicators. We could see the crowd favorites but did not (yet) know the “why”
behind purchases or understand the positive or negative reviews until our text analytics exercise.
There are countless consumer insights to be mined from textual, unstructured data that give us
the voice of the consumer, their motivations, and a deeper understanding of their purchasing
Self-Instructional behavior.
266
Introduction to Big Data
NOTES
Summary
In this chapter, we have first discussed about the concept of Big Data, its advantages and various
sources from where it gets collected. Next, we have discussed about evolution of Big Data and
structure of Big Data. This chapter has also discussed about different types of Big Data. Further,
it has also discussed about benefits of Big Data Analytics. Towards the end, it has discussed about
careers and future of Big Data.
Exercise
Multiple-Choice Questions
Q1. Which of the following is not a characteristic of Big Data?
a. Volume b. Variable
c.
Variety d.
Velocity
L
Q2. Which one of the following is not an example of external data sources?
a. Data from Sales b. Data from Web logs
c. Data from government sources d. Data from market surveys
c. Multidimensional databases D
Q3. Which of the following is/are sources of structured data?
a. Relational databases b. Flat files
d. All of these
C
Q4. Some people call this data as “structured but not relational.” Which data are we talking
about?
a. Structured data b. Unstructured data
T
c. Semi-structured data d. Mixed data
Q5. The data generated from a GPS satellite is classified as ______________.
a. Structured data b. Unstructured data
IM
Assignment
Q1. A retail company wants to launch a new line of products but has no experience. Which type of
data can help the company to effectively strategize and launch its new product? What could
be some of the potential sources for such data?
Q2. As an HR Manager of a company providing Big Data solutions to clients, what characteristics
would you look for while recruiting a potential candidate for the position of a data analyst?
Q3. If a Big Data analyst were to analyze data from a database of call logs provided by a telecom
service provider, which element of Big Data would he be dealing with?
Q4. Coffee chain Cool Cafe has come up with a coffee variety that they plan to target to
prospective customers aged between 18 and 25. The key selling point of this coffee would
be that this is not only to be enjoyed as a beverage but also has health benefits that increase
body’s energy levels. Hence, the new brand of coffee needs to be targeted at college-going
or freshly recruited youngsters, who are health-conscious.
You need to identify the method by which the advertisers/organizations can reach their
specific target audience online and track the prospective customers across the web. Self-Instructional
Material
267
DATA SCIENCE
NOTES Q5. List and discuss the three elements of Big Data. Which element is responsible for the inception
of Big Data?
Q6. What are the challenges associated with unstructured data? How it is different from semi-
structured data?
Q7. Discuss the benefits of Big Data analytics in different sectors.
Q8. “The need to process large amounts of data in real time as well as applying the results to the
business in a timely fashion is unavoidable in today’s world.” Justify this statement.
References
Oracle Big Data. (n.d.). Retrieved from https://www.oracle.com/in/big-data/guide/what-is-big-
data.html
What is big data analytics? - Definition from WhatIs.com. (n.d.). Retrieved from https://
searchbusinessanalytics.techtarget.com/definition/big-data-analytics
Big Data Analytics. (n.d.). Retrieved from https://www.qubole.com/big-data-analytics/
L
What is Big Data Analytics? Learn About Tools and Trends – NGDATA. (2018, May 02). Retrieved
from https://www.ngdata.com/what-is-big-data-analytics/
2. a.
D
Answers for Multiple-Choice Questions
Self-Instructional
Material
268
C A S E S T U D Y
EVOLUTION OF ONLINE CLASSIFIEDS WITH BIG DATA NOTES
ANALYTICS
This Case Study shows how data is integrated from hundreds of countries, dozens of languages and
then allowing users with powerful data-driven insight to predict the future of free online classifieds.
Background
OLX is a popular fast growing online classified advertising website.It is active in around 105 countries
and supports over 40 languages. This website is having more than 125 million unique visitors per
month across the world and generates one billion page-hits per month approximately. OLX allows
its users to design and personalise their advertisements and add them in their social networking
profiles, so that their data require big data analytics.
Challenges
The main challenge for OLX website was to find new ways to use business analytics to handle the
L
vast data of their customers. The business users of OLX required numerous metrics to track their
customer data. To achieve this aim, they need to build a good control over their data warehouse.
OLX takes the help of Datalytics, Pentaho’s partner vendor, in searching the solutions for extracting,
D
transforming and loading data from worldwide and then creating an improved data warehouse.
After creating such a warehouse, OLX wants to allow its customers to visualise its stored data in
real time without facing any technical error or barrier. OLX knew that it would be difficult for those
people who do not have without previous Business Intelligence (BI) knowledge, so it is essential
C
to use a visualisation tool for this purpose. According to Franciso Achaval, Business Intelligence
Manager at OLX, “While it may be easy for a BI analyst to understand what’s happening in the numbers,
to explain this to business users who are not versed in BI or OLAP (On-line Analytical Processing), you
need visualisations.”
T
Solutions
OLX has approached Pentaho, which is a business intelligence software company that provides open
IM
source products and services to its customers, such as data integration, OLAP services, reporting,
information dashboards, etc. Pentaho has partnership with Datalytics, which is basically a consulting
firm based in Argentina. Datalytics provides data integration, business intelligence, and data mining
solutions to Pentaho’s worldwide clients.
Self-Instructional
Material
269
C A S E S T U D Y
NOTES Results
OLX has realised that Datalytics’ expertise and Pentaho’s platform have enabled them to deploy
their new analytics solution in less than a month. They have realised the following changes in the
new solution:
Pentaho Business Analytics enables OLX to facilitate its users to create easy and creative reports
about key business metrics.
Instead of buying an expensive enterprise solution or investing time in building a new data
warehouse internally, OLX was able to save time by focussing on data integration with analytics
capabilities.
Pentaho Business Analytics provides end-user satisfaction.
Pentaho Business Analytics provides a scalable solution to OLX, as it can integrate any type
of data from any data source and can increase its business. In addition, Datalytics’ assistance
provides an opportunity to OLX regarding the experiment with big data.
Questions
L
1. What were the challenges faced by OLX?
(Hint: The main challenge for OLX website was to find new ways to use business analytics to
D
handle the vast data of their customers.)
2. What was the result of implementing Pentaho Business Analytics?
(Hint: Pentaho Business Analytics enables OLX to facilitate its users and create easy and creative
C
reports about key business metrics.)
T
IM
Self-Instructional
Material
270
L A B E X E R C I S E
In today’s competitive environment, every organization competes to attract the customers for NOTES
their product by using different techniques for promotion. This is because success and productivity
depend on what promotion techniques one is following to meet customer demands.
The different promotion techniques that organizations follow to attract their customers are as
follows:
To address this issue, organizations use Big Data, which gives an option to analyze the customers
on the basis of the ‘comments’ (unstructured form of data) expressed for specific products on
L
social networking websites, and accordingly target the customer later. Big Data actually involves
converting this unstructured form of data into a structured form and presents the result as per the
analysis done.
D
This Lab Exercise is a discussion-and-brainstorming session where you will be presented with a series
of problem scenarios and will be required to suggest suitable Big Data solutions to each of them.
C
LAB 1
prospective customers aged between 18 and 25. The key selling point of this coffee would be that
this is not only to be enjoyed as a beverage but also has health benefi ts that increase body’s energy
levels. Hence, the new brand of coffee needs to be targeted at college-going or freshly recruited
youngsters, who are health-conscious.
You need to identify the method by which the advertisers/organizations can reach their specific
target audience online and track the prospective customers across the web using Big Data.
Solution: Before you start answering the questions, let us first analyze how modern-day audience
and advertising have changed. Traditionally, advertisers used demographic targeting and reached
audiences through TV or other mass media. However, consumers today spend less time watching
TV broadcasts and more time in their own personalized media environments, including their own
individual blogs, customized news items, songs, and videos. Although good for consumers, this
media fragmentation has scattered advertisers’ audience. With means to measure the available Big
Data, organizations today can monitor relevant interactive communication across various digital
channels, such as e-mails, mobile apps, social networking sites, and the web.
Analyzing the social media data gives organizations useful insights into the target audience’s brand
communication preferences and interests. Analysis of this data provides information about the
Self-Instructional
Material
271
L A B E X E R C I S E
NOTES other brands that the target audience talks about and plenty of other data, including new ideas and
ways of presentation and communication.
Big Data helps organizations in improving products and services. By analyzing Big Data, organizations
can identify gaps between consumer requirements and business offerings for clients. Organizations
can make correct/profitable decisions in the product development process, thus making sure
consumers ultimately get better products and services.
Using advanced Big Data monitoring and analysis techniques, organizations can gather analysis
reports from multiple online media platforms, such as e-mails, social networking sites, video-sharing
websites, etc. Let us consider that a customer organization analyzes its research on a specifi c
target audience using a social media monitoring platform. This research reveals what consumers
talk about, their likes and dislikes, and comments about a product, service, or industry online, more
often on Facebook than on any other social platform. Comparing this with data obtained from the
e-mail marketing platform, organizations can plan effective advertising efforts on Facebook than on
e-mail or any other platform.
L
LAB 2
D
Demonstrating the Use of Big Data in Retail Sector
Problem: Let us consider the example of Amazon, an e-commerce company based in Seattle,
Washington. Amazon started with selling books online and gradually expanded to electronic goods
C
and a variety of other products. In your opinion:
How can Amazon use Big Data for increasing its sales?
How can Big Data help Amazon in providing personalized products and services to its varied
T
customer base?
Solution: Before you start answering the questions, let’s fi rst analyze how today’s retail shopping
is different from the one that existed earlier. With the availability of large number of retail channels
IM
and interruption of social media in everyday life, customers are easily able to get useful information
about products and services. Now customers can compare product features and prices, know about
product ratings, get products and services online and paid for them online too.
This advancement provides a better means for shopping in comparison to shopping in physical
stores. Since customers have options to easily interact with product and service providers,
it becomes difficult for organizations to hold their customers for a long time. How does a retail
company manage this challenge?
Retail organizations can use Big Data solution to get clear insights about customers likes and dislikes
and use this information for marketing products and services, and enhancing the efficiency of
merchandising decisions.
They can also use this data to remove inefficiencies in distribution and operations. To succeed in
addressing Big Data challenges, Amazon needed to collect, manage, and analyze huge volumes of
data. Amazon implemented advanced Big Data analysis techniques to capitalize on newer trends
and other changes in the retail industry.
Self-Instructional
Material
272
L A B E X E R C I S E
Amazon has used Big Data solution to identify some specific reasons for ineffectiveness in their NOTES
operations and distributions and then addressed those reasons. By adopting the Big Data solution,
Amazon was able to get clear insight about the shifting in retail landscape and then used this
knowledge for positive transformation. Amazon considered their critical objectives as:
Provide a more satisfying shopping experience to their customers
Social media
By analyzing the vast variety and volumes of data, Amazon was able to get an understanding for
L
its every customer and then developed a better strategy for customers to offer them a smarter
shopping experience.
The analyzed information helped Amazon to understand customer preferences and shopping
When a user searches for an item on the retail site of Amazon, the user is suggested items available
C
with the retailer from the same segment of products. Figure 11.11 shows a screenshot showing some
suggested products:
T
IM
Self-Instructional
Material
273
L A B E X E R C I S E
NOTES Build Smarter Merchandising and Supply Networks
By implementing Big Data platforms, Amazon got a better understanding of:
Demand Trends: By analyzing the demand trends, Amazon could provide competitive pricing
and promotions for their business. Big Data helps organizations to plan promotional activities
by generating conclusions based on the data generated from several sources. These sources
can be social networks, social media, business reports, market reports, sales data of an
organization, and buying patterns of customers.
Optimal Pricing: Amazon was able to develop a better supply and distribution chain by analyzing
key indicators such as customer sentiment, price analysis, demand and supply analysis, and
current market trends.
L
D
C
T
IM
Self-Instructional
Material
274
CHAPTER
12
Business Applications
of Big Data
L
Topics Discussed
Introduction
D NOTES
C
Use of Big Data in Social Networking
Use of Big Data in Preventing Fraudulent Activities
Preventing Fraud using Big Data Analytics
T
Use of Big Data in Banking and Finance
Big Data in Healthcare Industry
Big Data in Entertainment Industry
IM
Self-Instructional
Material
DATA SCIENCE
Explain how Big Data is used in detecting fraudulent activities in insurance sector
INTRODUCTION
Almost all the organizations collect and collate relevant data in various forms, such as customers’
feedback, inputs from retailers and suppliers, current market trends, etc. The information, thus
derived, is used by the management to take major organizational decisions. An organization
L
generally has to spend huge amounts to collect data and information. For example, customer
surveys or market research reports require a significant amount of investment by an organization.
The cost of collecting information goes on escalating as an organization keeps on collecting more
D
information. The continuously increasing cost decreases the value of the collected information. In
other words, collecting and maintaining a pool of data and information is just a waste of resources
unless any logical conclusions and business insights can be derived from it. This is where Big Data
analytics comes into the picture.
C
This chapter explains how Big Data influences businesses in today’s world. The key is to understand
how Big Data and different methods of data analytics are used in real time and why. How can an
organization make the optimum use of Big Data? How can large volumes of data be used to get
T
better insights? How can the data obtained be used to form better business strategies, and thereby
help in scalability and profitability? The key to implementing a Big Data solution is to manage Big
Data for meeting business requirements. Business insights gained from Big Data analytics help
the organizations reduce their cycle time, fulfill orders quickly, cut excess inventory, and improve
IM
forecast accuracy and customer services by exchanging information, such as inventory levels,
forecast data, and sales trends. These insights can be applied to almost all the core domains of an
organization and shared with partners, suppliers, customers, and other stakeholders.
In this section, we analyze the effects of Big Data generated from the social media on different
industries. Let’s first understand the meaning of social network data.
Social network data45 refers to the data generated from people socializing on social media. On a
social networking site, you will find different people constantly adding and updating comments,
statuses, preferences, etc. All these activities generate large amounts of data. Analyzing and mining
Self-Instructional such large volumes of data show business trends with respect to wants and preferences and likes
Material and dislikes of a wide audience.
276
Business Applications of Big Data
This data can be segregated on the basis of different age groups, locations, and genders for the NOTES
purpose of analysis. Based on the information extracted, organizations design products and services
specific to people’s needs.
Figure 12.1 shows the social network data generated daily through various social media:
YouTube
users upload
72 hours of
video
Every minute
Amazon of the day Google
generates receives over
2,000,000
L
over $80,000
in online search
sales queries
Twitter users
send over
300,000
tweets
D
Facebook
users share
2-5 million
pieces of
content
C
FIGURE 12.1 Social Network Data Generated Every Minute of the Day
Social Network Analysis (SNA)46 is the analysis performed on the data obtained from social media.
As the data generated is huge in volume, it results in the formation of a Big Data pool.
T
Let’s understand the importance of social network data with the help of an example of a Mobile
Network Operator (MNO). The data captured by an MNO in a day, such as the cell phone calls, text
IM
messages, and other related details of all its customers is very huge in volume. This type of data is
used daily for different purposes.
An MNO does not simply need to record and analyze the calls of a customer but the entire network
calls related to that customer. The company must study the data of the people whom the customer
called and also of the people in the customer’s network who called him back. Such a network is
called a social network.
Figure 12.2 shows the graphical view of how the structure of a caller’s social network is created:
As seen in Figure 12.2, the data analysis process can go deeper and deeper within the network to
Self-Instructional
get a complete picture of a social network. As the analysis goes deeper, the volume of data to be
Material
277
DATA SCIENCE
NOTES analyzed also becomes massive. The challenge comes while using traditional methods to analyze
the huge volume of data. The same structure of SNA is followed when it comes to social networking
sites. While analyzing the social media data of a user, it is not considered a difficult task to identify
the number of connections a user has, the frequency of messages posted on the user’s timeline, and
other standard metrics. However, it becomes a daunting task to know how wide the network of a
user is, including his or her friends, the friends of friends, and so on.
It is not difficult to keep a track of a thousand users, but it becomes difficult when it comes to
one million direct connections between these thousand users, and another one billion connections
when friends of friends are taken into consideration. Extracting, obtaining, and analyzing data from
every single point of connection is a big challenge faced by SNA.
The data derived from social media enables an organization to calculate the total revenue a customer
can influence instead of the direct revenue the customer generates. Because of this advantage,
organizations are compelled to invest in such customers.
A customer who is quite influential must be pampered, thereby increasing the total profitability of
the network the customer is using. For an organization, increasing the total profitability of a network
L
takes priority over increasing the profitability of every individual customer’s account.
Some facts about Big Data and social media are listed as follows:
Facebook collects 500 times more data each day than the New York Stock Exchange.
(Source: BI Intelligence)
D
Twitter produces 12 times more data each day than the New York Stock Exchange.
(Source: BI Intelligence)
C
Social media analytics48 is nowadays used for online reputation management, crisis management,
lead generation, and brand check to measure campaigning reports and much more.
E xhibit -1
T
American Airlines
According to a survey conducted by MSN Money, American Airlines has been ranked as one of the
most disliked companies in the U.S.
IM
Studies reveal that American Airlines has about 346,259 ‘followers’ on Twitter and 273,591 ‘likes’ on
Facebook. But this cannot be taken as a true indicator of the company’s popularity. Deep studies
of the sentiments of customers reveal that the trends of online conversation about the company
are negative, which indicate that it is allegedly one of the most disliked airlines. Thus, clearly, the
company’s efforts to engage social media community have not paid any beneficial returns.
To improve their image and ranking, American Airlines should focus more on the sentiment and
emotive data and the ‘correct’ types of incoming data sources rather than just counting the
numbers of ‘followers’ and ‘likes’.
278
Business Applications of Big Data
financial institutions, such as banks and insurance and healthcare companies, or involve any type of NOTES
monetary transactions, such as in the retail industry, are called financial frauds. In such fraudulent
cases, online retailers, such as Amazon, eBay, and Groupon, tend to incur huge expenses and losses.
The following are some of the most common types of financial frauds:
Credit card fraud: This type of fraud is quite common these days and is related to the use
of credit card facilities. In an online shopping transaction, the online retailer cannot see the
authentic user of the card and, therefore, the valid owner of the card cannot be verified. It is
quite likely that a fake or a stolen card is used in the transaction.
In an online transaction, in spite of the security checks, such as address verification or card
security code, fraudsters manage to manipulate the loopholes in the system.
Exchange or return policy fraud: An online retailer always has a policy allowing the exchange
and return of goods and, sometimes, people take advantage of this policy. These people buy
a product online, use it, and then return it back as they are not satisfied with the product.
Sometimes, they even report non-delivery of the product and later attempt to sell it online.
What leads to such a fraud is that retailers encourage consumers to order products in bulk and
later return the ones that they don’t require.
L
Such a fraud can be averted by charging a restocking fee on the returned goods, getting
customer’s signature on the delivery of the product, and staying cautious of such customers
who are known to commit such frauds.
D
Personal information fraud: In this type of fraud, people obtain the login information of
a customer and then log-in to the customer’s account, purchase a product online, and then
change the delivery address to a different location. The actual customer keeps calling the
retailer to refund the amount as he or she has not made the transaction. Once the transaction
C
is proved fraudulent, the retailer has to refund the amount to the customer.
All these frauds can be prevented only by studying the customer’s ordering patterns and keeping
track of out-of-line orders. Other aspects should also be taken into consideration such as any
T
change in the shipping address, rush orders, sudden huge orders, and suspicious billing addresses.
By observing such precautions, the frequency of the occurrence of such frauds can be reduced to a
certain extent, but cannot be completely eliminated.
IM
NOTES Big Data also examines the entire historical data to track suspicious patterns of the customer order.
These patterns are then used to create checks for avoiding real-time fraud. Big Data analysis is
performed in real time by retailers to know the actual time when the products were delivered to
the customers. Costly products often have sensors attached to them that transmit their location
information. When such products are delivered to the customers, the streaming data obtained from
these sensors provides location information to the retailer, thereby preventing frauds.
Centralization of Big Data takes place through MPP systems. Any organization that aims at improving
its analytic scalability needs an MPP system. With the continuous increase in the volume of data, it
is not always possible to move data as part of the analysis process except where it is absolutely
required. MPP is the most widely used technique of storing and analyzing huge volumes of data.
Let us now understand what an MPP database is and what makes it so special and preferred. An
MPP database has several independent pieces of data stored on multiple networks of connected
computers. It eliminates the concept of one central server having a single CPU and disk.
The data in an MPP database is divided into different disks managed by different CPUs across
different servers, as shown in Figure 12.3:
L
Single Overloaded Server
D
C
Multiple Lightly
Loaded Servers
T
Big Data can also help in creating maps and graphs for comparisons that can be used to analyze
situations and take decisions. An analysis in the graphical form, for example, can help identify the
customers, areas, and products that display a high fraud rate. Big Data can even show comparisons
between products and regions, which alert retailers as to where a greater probability of fraud exists.
The retailer can then take proper actions to mitigate the risk accordingly.
Self-Instructional In the course of his research, Smith finds that most cases of cheatings and fraudulent activities occur
Material in the insurance and retail industries. He concludes that the reason behind such fraudulent activities
280
Business Applications of Big Data
is these sectors mainly deal with hardcore financial transactions. He decided to concentrate his NOTES
study on the role of Big Data analytics in these industries.
E xhibit -2
Implementing Social Network Analysis for Fraud Prevention
The use of social network analysis for combating fraud is slowly gaining acceptance within a range of
sectors, primarily in financial services, telecommunications, and public organizations. Anti-money
laundering, identity fraud, network fraud, denial of service attacks, and terrorist financing are
some of the areas of fraud where SNA could be used to significantly improve fraud detection. SNA
techniques and tools have been deployed in landmark cases like tracing terrorist funding after 9/11
attacks by FinCEN and insider trading cases identified by the Australian Securities and Investment
Commission. For most businesses though, profit or loss continues to be a key reason to invest in
fraud detection. More than 15% of income loss for the medium-sized businesses in Germany is due
to fraud, corruption, and defalcation. This is the second largest area of loss after theft, burglary,
and assault [Corporate Trust 2009]. Financial crimes are continuously evolving into more complex
systems of attack on businesses and, therefore, the technologies that financial institutions use to
L
detect and stop these crimes from occurring need to evolve. In Germany, there were more than
4,100 reported cases of cheque fraud in 2002. By 2009, this dropped to only a little more than 600
cases [BKA, 2002 & 2009]. In contrast, reported fraud cases related to card payments have risen
345% just from 2007 to 2009 [BKA, 2008 & 2009]. With the recent economic crisis and obvious
D
changes in technology, it is important to be more vigilant regarding fraud detection. Increasingly
sophisticated fraudsters are able to easily slip behind risk-score-based analysis to avoid detection,
and to overcome this issue, organizations need to better understand the dynamics and patterns
of fraud and fraud networks. This is where the visual and analytical capabilities of SNA can help
C
the fraud prevention function to effectively detect and prevent fraud originating from Web-
based and other more traditional business channels. In general, SNA is a ‘data mining technique
that reveals the structure and content of a body of information by representing it as a set of
T
interconnected, linked objects or entities’. [Mena, 2003]. The perfect combination of advances
in knowledge management, visualization techniques, data availability, and increased computing
power enabled the steady rise of SNA as an interdisciplinary investigative technique in a wide
array of sectors. Unlike other analytical techniques like statistics that are based on the notion
IM
of independence of subjects, SNA can provide useful insight into large datasets along network,
spatial, and time dimensions based on the interconnectedness of the subjects being analyzed.
Source: http://www.cgi.com/sites/default/files/white-papers/Implementing-social-network-
analysis-for-fraud-prevention.pdf
NOTES can now track and analyze huge customer database in real time and take business decisions such as
customized offers and other benefits. These decisions prove to be valuable for both organizations
as well as customers. Some of the key areas on which banks and other financial institutes focus using
Big Data analytics are:
1. Optimizing business operations: Big Data technology fuels the effectiveness of system
response, prediction capacity and accuracy of risk models and coverage for extensive risk.
This provides huge cost savings as there are fewer risks to track, process are automated and
systems become more predictive.
2. Customer experience: The insanely large and profitable customer base for many of the banking
and financial institutes have belittled their decision making strategies based on organizational
size and forced them to think in a customer-centric way. Since modern customers expect a
lot from these organizations, it is very important and challenging to focus on customers and
their needs, expectation from banks and financial institutes. To get a deep insight in customer
patterns, these organizations use large data hubs that combine and aggregate data using raw
Big Data such as purchase patterns, interaction with brand, transaction and browsing history
with Big Data analytics. This proves to be a great help in creating customer segmentation,
L
product idea generation and building a customer-centric infrastructure and culture.
3. Employee performance measurement: This is one of the potential uses of Big Data analytics
which can provide the performance index of the employees of an organization. Applying Big
D
Data in an organization can not only highlight top performers and contributors but also the
population who is unhappy and struggling. This information can be utilized to optimize the
work culture and environment inside the organization.
4. Fraud identification: Big Data helps in analyzing the occurrences, types and addresses of
C
unauthorized accesses from the banks records. With the data for unauthorized accesses, that
address can be blocked for future accesses.
Apart from all this, aligning Big Data with global trends and analyzing it using data analytics can be
T
very useful in providing new business and product ideas, which can be achieved in optimal operation
costs in a timely manner.
Healthcare companies like Dignity Health has designed their ‘Sepsis Bio-surveillance Program’ using
Big Data analytics platform, which enables them to predict Sepsis (an inflammatory reaction to
infection) cases and manage 7500 patients each month who might be under influence of the disease.
The global survey results conducted by McKinsey in hospitals says that almost half of the worldwide
hospitals have already included Big Data analytics and 40 percent of them are about to adopt it.
Self-Instructional
Material
282
Business Applications of Big Data
Figure 12.4 shows the Big Data sources in healthcare industries: NOTES
Hospitals
Health
Research
insurance
institutions
f irms
Big data in
Healthcare
Medical
Pharmaceuticals
equipment
L
FIGURE 12.4 Big Data in Healthcare
1. Predictive analysis is one of the key properties of Big Data analytics and that can be applied
in entertainment industry as well. The input is huge which includes searches, views, log files,
viewing history, etc., which can be processed using Big Data to find what consumers want.
2. Using the subscriber and view counts on a given product, media companies can strategize
about their products and content promotion to attract customers and retain them. Using
the unstructured Big Data sources such as social media and call details, analysis can be done
which would favor both the content creators and consumers.
3. Entertainment industry can also utilize Big Data to generate other sources of revenue by
suggesting new ways to attract customers, by exploring opportunities for a new product or
identifying a potential product/service.
4. Using Big Data, entertainment companies can identify which devices are being used to consume
media. This helps them to keep the technical and other configurations of the device on which
the media is consumed the most, helping them to customize the content for better traffic.
Self-Instructional
Material
283
DATA SCIENCE
Seemingly simple questions, are easy to answer when there is a single retail location and a small
customer base:
How many basic tees did we sell today?
What else has customer X bought, and what kind of coupons can we send to customer X?
However, with millions of transactions spread across multiple disconnected legacy systems and IT
teams, it is impossible to find answers to such questions. Business insights in customer behavior and
company health can be obtained by finding a relation between the organization’s sales between
in-store and online sales. It could be very difficult for a marketing analyst to understand the health
and strength of different types of products and campaigns and reconcile the data obtained from
these systems. While omnichannel retailing solutions do exist, they require both store managers and
L
Web developers to learn entirely new systems. The company-wide training and deployment of these
systems would incur huge costs in terms of time and money.
D
Many times, extracting data in real time is not feasible as systems are affected because of scaling
issues. Suppose you want to know if a particular item is in stock in another nearby store. This data
cannot be found immediately and needs some phone calls or other ways of accessing information
and, therefore, prevents the immediate sale of the item.
C
If access to the data is possible, there may not be anything particularly rich or useful about it.
Raw transactional data can only help a company understand its sales but does not provide any
relationships, patterns, or other clues for deeper analysis.
T
Also, the fact remains that most of the Big Data is just not required and not useful either. Some
information in a Big Data feed can have a long-term strategic value while some information will be
used immediately and some information will not be used at all. The main part of taming Big Data is
IM
The RFID technology helps better item tracking by differentiating the items that are out of stock and
that are available on shelves. For instance, if an item is not available on the shelves, it does not imply
that the item is not available throughout. With the help of an RFID reader and a mobile computer,
the inventory can be immediately verified and stocks replenished, if required.
Various types of RFID tags are available for various environments, such as cardboard boxes,
wooden, glass, or metal containers. Tags also come in various sizes and are of varied capabilities,
including read and write capability, memory, and power requirements. They also have a wide range
of durability. Some varieties are paper-thin and are typically for one-time use and are called ‘smart
labels’. RFID tags can also be customized and withstand heat, moisture, acids, and other extreme
conditions. Some RFID tags are also reusable, thus offering a Total Cost of Ownership (TCO) benefit
Self-Instructional over bar code labels.
Material
284
Business Applications of Big Data
The use of RFIDs saves time, reduces labor, enhances the visibility of products throughout the NOTES
production-delivery cycle, and saves costs. Some common benefits of using RFID are shown in
Figure 12.5:
Asset Management
Inventory Control
Regulatory Compliance
Asset Management
L
Organizations can tag all their capital assets, such as pallets, vehicles, and tools, in order to trace
them anytime and from any location. Readers fixed at specific locations can observe and record
all movements of the tagged assets with great accuracy. This mechanism also works as a security
authorized area.
D
check and alerts supervisors and raises an alarm in case anyone tries to take the asset outside the
When containers are loaded for shipment, tracking pallets with RFIDs are included in them. These
C
RFIDs contain records of what is stored in the container. This helps production managers to have
a complete view of the inventory level and location of containers. This information can be used to
locate items and fulfil rush orders without any waste of time.
T
Shipping containers, pallets, cylinders, and reusable plastic bottles having RFID tags can be easily
identified at the dock entry as they leave with an outbound consignment. After the database is
matched with the shipping information, the manufacturers of the products create a log of each
shipping container with its details and develop a procedure for tracking their goods. This information
IM
can be utilized to reduce the time required for documentation and can be of great value in resolving
disputes of lost and damaged goods.
Inventory Control
One of the primary benefits of using RFID is inventory tracking, especially in areas where tracking
has not been done or was not possible earlier. RFID tags can be read even if the contents are packed
and are not in the direct line of sight. This means that an entire pallet with an assortment of goods
can be read without disturbing the arrangement of goods in the pallet. RFID tags are resistant to
temperature and environmental variances, such as dirt, moisture, heat, and contaminants. On the
other hand, bar codes cannot handle such conditions and are prone to damage or errors.
Using an RFID tracking system can result in an optimized inventory level, and thus reduce the overall
cost of stocking and labor. RFID allows manufacturers to track inventory for raw materials, work in
progress, or finished goods. Readers installed on shelves can update inventory automatically and
raise alarms in case the requirement for restocking arises.
RFIDs are used to create secure storage areas wherein readers can be programmed to raise an alarm
in case the items are removed or placed elsewhere. A study has indicated that a consumer goods
manufacturer can reduce the chance of shrinkage and loss of inventory by approximately 10% by Self-Instructional
using RFID tags. Material
285
DATA SCIENCE
Nowadays, Serial Shipping Container Code (SSCC) is widely used in shipping labels. SSCC can be
easily converted into RFID tags in order to provide automatic handling of shipment.
The data contained in the RFID tag can be considered with the shipment information, which can easily
be read by the receiving organization to simplify the receiving process and eliminate processing
delays.
Regulatory Compliance
L
The entire custody trail can be produced before regulatory bodies, such as the Food and Drug
Administration (FDA), Department of Transportation (DOT), and Occupational Safety and Health
Administration (OSHA), along with other regulatory requirements, provided the RFID tag that
D
travels with the material has been updated with all the handling data. This could be of great use for
companies that work with hazardous items, food, pharmaceuticals, and other regulated materials.
A logistic company brings various packages from different locations to a hub. Thereafter, it sorts out
C
the urgent ones for a morning delivery from the regular delivery ones. This is where RFID can help in
locating these packages or pallets and loading them for a faster and quicker delivery.
will always remain on the product. If future repairs are required, the technician can access this
information without accessing any external database, which helps in reducing calls and time-
expensive enquiries into documents.
E xhibit -3
AXA OYAK uses the SAS Social CRM solution to deal with risks and avoid frauds
A Turkish insurance company called AXA OYAK uses the SAS Social CRM solution to deal with
risks and avoid frauds. With the help of the social CRM, AXA cleans its customer portfolio data at
regular intervals. This helps AXA OYAK to detect and rectify inconsistencies in customer data that
enables it to relate even two slightly different records of the same customer. Thus, AXA OYAK
is able to make more accurate analysis of customer data and examine fraudulent claims more
efficiently. SAS enables the insurance company to detect the relationships between customer
behavior and fraudulent claims quickly and efficiently. Using an SAS data warehouse, AXA could
check its customer data on the basis of the flags generated while analyzing certain relationships
between datasets.
institutions believe that they are providing best facilities and quality education and in reality, they do NOTES
put their best efforts in favour of students. However, this effort needs to be shifted to exploring new
effective teaching mechanisms, monitoring and predicting student’s performances, identifying the
strengths and weaknesses of students. Big Data helps in achieving these tasks easily and effectively.
Figure 12.6 shows the ways in which Big Data can impact the education sector.
Decreasing
dropouts
Big data
in
Education
L
Improving
Customised student’s
courses performance
D
FIGURE 12.6 Big Data in Education sector
C
Let us discuss these ways one-by-one.
Improving student’s performance: In earlier times, students were judged only on the basis
of the marks scored in exams and tests. However, these marks do not explore the actual
T
performance trail of a student. Big Data can keep the track of various student-related activities
such as which books are being referred, how many problems have been skipped, and many
others. Monitoring such activities helps in increasing the student’s performance by optimising
the learning environment.
IM
Customised courses: Each student has his/her own interest of subjects. Using Big Data in
educational institutions can create customised courses or programs as per the student’s choice.
Big Data helps in implementing the concept of blended learning, which means both digital and
traditional learnings. With such implementations, students have the flexibility to select their
course with respective subjects.
Decreasing dropouts: Once Big Data has been implemented, it increases student performances
leading to decreased dropout rates. Dropout rates will be lowered when a student will be
tracked for different activities and is provided with instant feedback.
Summary
This chapter explored the need and importance of conducting Big Data analytics in the business Self-Instructional
context. You first learned how businesses process and use the data collected from various social Material
287
DATA SCIENCE
NOTES network platforms. Next, the chapter discussed how Big Data analytics helps in detecting and
preventing fraudulent activities in businesses, especially in the insurance and retail sectors. Towards
the end, you learned about the use of RFID tags in the retail industry.
Exercise
Multiple-Choice Questions
Q1. Which application of social network data analysis is used by a customer retention manager?
a. Business intelligence b. Marketing
c. Product design and development d. Insurance fraud
Q2. Identify the types of frauds that typically impact online retailers.
a. Credit card fraud b. Forward fraud
c. Corporate fraud d. Insurance fraud
L
Q3. How can Big Data analytics help prevent fraud?
a. Analyze all the data b. Detect fraud in real time
c. Use predictive analytics d. All of these
D
Q4. From the following, select the analytical method that Social Network Analysis (SNA) uses to
show relationships via links:
a. Organizational business rules b. Pattern framework
C
c.
Link analysis d.
Statistical methods
Q5. Identify the technologies that enable fraud identification and the predictive modeling
process:
a. Text mining b. Social media data analysis
T
c. Regression analysis d. All of these
Q6. Predictive models based on both historical and real-time data can help ________________
to identify suspected cases of fraud in the early stages.
IM
Self-Instructional
Material
288
Business Applications of Big Data
Assignment
Q1. Discuss some areas in which decision-making processes are influenced by social network
data.
Q2. List some common types of financial frauds prevalent in the current business scenario.
Q3. In what ways does analyzing Big Data help organizations prevent fraud?
Q4. List some methods used for the verification of credit cards.
Q5. List the steps that SNA follows to detect fraud.
Q6. Write a note on Social Customer Relationship Management.
Q7. How does the use of RFID tags help in inventory control? Explain.
L
References
D
https://www.journals.elsevier.com/future-generation-computer-systems/call-for-papers/
special-issue-on-social-networking-big-data-opportunities-so
https://www.business2community.com/social-media/big-data-analytics-social-media-02007051
https://www.simplilearn.com/big-data-transforming-retail-industry-article
C
https://www-935.ibm.com/services/us/gbs/thoughtleadership/big-data-retail/
T
Answers for Multiple-Choice Questions
1. a. 2. a. 3. d. 4. c. 5. a. 6. b. 7. a. 8. a. 9. c. 10. d.
IM
Self-Instructional
Material
289
C A S E S T U D Y
NOTES USING BIG DATA TO DRIVE PRODUCT INNOVATION
This Case Study discusses how Big Data is being used by various organizations to drive product
innovation.
Traditionally, organizations have been storing and using data for improving their products and
L
services. However, in the past few decades, technology has made great advancements and led
to the development of various products, such as smartphones, computers, sensors, PDAs, etc. All
these devices or products help in collecting data that is huge in amount and complex in nature, and,
thus, called Big Data. Specialized data analysis techniques are developed to process such Big Data.
D
Nowadays, big brands such as Google, Apple, etc., are using their huge amounts of data to stay
ahead of smaller competitors having small databases. However, certain smart small start-ups are
also giving a tough competition to the bigger and established organizations.
C
T
IM
Source: http://www.ibmbigdatahub.com/blog/big-data-challenge-transformation-manufacturing-industry
Product manufacturers work closely with their retail, supply chain and research partners, and
generate huge amounts of data that are analyzed to answer questions such as:
1. Who are our end customers or the final users of our products and what do they expect from
our products?
2. How can we improve or innovate our products?
3. What are the present opportunities that exist in the market?
An organization can optimize its data gathering and analysis techniques as well as technologies to
gather useful and actionable results in order to innovate products.
In order to innovate products, the organizations may implement the following solutions:
290
C A S E S T U D Y
information on a continuous basis, such as sales numbers, customer data, surfing habits, NOTES
shopping preferences, hobbies, etc. Such information should be combined with information,
such as age, gender, income, occupation, geography, etc., to generate a set of customer
personas, which are used to create different customer segments. New product concepts can
be developed to target such segments.
Live Nation is an American organization that works as an event promoter and venue operator.
John Forese is the SVP & Head of Data and Marketing Services at Live Nation. According to
John, organizations use data to develop products that customers really want and to market
them appropriately. For example, the data that is collected from consumers of Live Nation
can help provide a fair picture regarding the number and nature of music or sports fans. The
consumer-buying pattern can be combined with demographics and psychographics in order
to gain information for its varied clients.
2. Chase the signals
With advancement in technology and increase in the use of digital devices, consumers live in
a truly virtual environment where it is extremely easy to gather data about the consumers.
Organizations gather data regarding how consumers use their products and how they behave
L
within the product environment. All these activities generate huge amounts of data which
is then organized and analyzed using statistical and other techniques. The analysis of data
reveals important facts, figures and trends used by the organizations to plan their actions.
D
FiveThirtyEight is an opinion poll analysis, politics, economics and sports website. According
to its founder, Nate Silver, the possibilities in the world cannot be looked at from a black and
white angle. Forecasts published by Nate are probabilistic in nature and a range of possible
outcomes must be provided. While conducting research and data analysis, organizations may
C
want the research team to provide definitive and actionable directives. However, it is not
always possible. While conducting data analysis, it may be observed that data contradicts
internal wisdom and, thus, must be reconciled. At times, it is difficult to separate the signals
(meaningful trends or facts) from noise (data that are inconsistent with greater trends).
T
Organizations must chase the signals in order to take advantage of any potential or unrealized
opportunity.
For example, a synthetic rubber maker observes that the demand for synthetic rubber was on
IM
the rise in 2017, and there was a shortage of synthetic rubber distributors in 2018. Should the
synthetic rubber maker try to extend its distribution network in 2018? The synthetic rubber
maker can collect all the details regarding the use of natural and synthetic rubbers in the past
few years and handover that data to the data analysts and product researchers. In addition,
the rubber maker can analyze online shopping data of rubber products to study consumer
buying. Assume that the trends (signals) are encouraging. Therefore, it calls for further signal
chasing by placing certain synthetic rubber products at a few selected retail outlets and
online platforms and tracking the customer interests.
3. Gather maximum amounts of data from maximum number of sources
The traditional sources of data, such as sales figures, revenue, customer feedback surveys,
etc., can be used along with other non-conventional sources of data. For instance, customer
profiles along with demographics, browsing and shopping data can be acquired from third
parties. An organization should elicit sales and customer data from retail and manufacturing
partners. An organization may also keep a track of product returns and replacement during
warranty period in order to analyze product problems. An organization may also design
products along with sensors that would help in tracking product usage. Product reviews
of the organization’s products and the products of its competitors must be analyzed to
develop product fixes, improvements and new features. Another very practical approach for
gathering data is to engage in social listening. Self-Instructional
Material
291
C A S E S T U D Y
NOTES After the data has been collected from all the possible sources, all the data should be co-
related, analyzed and interpreted to gain a better understanding of the product.
4. Use quantitative data along with qualitative data
Big Data is usually related to quantitative data that is gathered, stored, analyzed and visualized.
However, it fails to capture the qualitative information, such as emotional, aspirational,
and environmental aspects. Therefore, it is important that the product innovators and the
research teams include and take care of the qualitative data while developing new products.
Conclusion
An organization can use different data types, such as social listening, data from sensors on product
usage, customer demographics, customer reviews, retail sales, customer information, return and
warranty data, etc., for product innovation.
Questions
1. Data is analyzed today and used to be analyzed in the past as well. What improvements have
L
taken place in the data analysis field as we know it today?
(Hint: In the past few decades, technology has made great advancements and led to the
development of various products, such as smartphones, computers, sensors, PDAs, etc.
D
All these devices or products help in collecting data that is huge in amount and complex
in nature, and, thus, called Big Data. Specialized data analysis techniques are developed to
process such Big Data.)
C
2. What strategies can an organization adopt in order to innovate its products?
(Hint: Get to know your customers by using data, chase the signals, etc.)
T
IM
Self-Instructional
Material
292
CHAPTER
13
Hadoop
L
Topics Discussed
Introduction
Hadoop
D NOTES
C
Features of Hadoop
Chapter Objectives
After completing Chapter 13, you should be able to:
Describe Hadoop architecture
Self-Instructional
Material
DATA SCIENCE
NOTES INTRODUCTION
Handling huge volumes of data generating from billions of online activities and transactions requires
continuous upgradation and evolution of Big Data. One such upcoming technology is Hadoop.
People often confuse Hadoop with Big Data. Hadoop is not Big Data; however, it is also true that
Hadoop plays an integral part in almost all Big Data processes. In fact, it is almost impossible to use
Big Data without the tools and techniques of Hadoop. So what exactly is Hadoop?
According to Apache, “Hadoop49 is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models. It is designed to scale
up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.” In simple words, Hadoop is a ‘software library’
that allows its users to process large datasets across distributed clusters of computers, thereby
enabling them to gather, store and analyze huge sets of data.
Hadoop provides various tools and technologies, collectively termed as the Hadoop ecosystem, to
L
enable development and deployment of Big Data solutions.
In this chapter, we will discuss Hadoop ecosystem and the various components of Hadoop. Next, we
will learn about the features of Hadoop. Finally, we will discuss the functioning of Hadoop.
HADOOP D
C
Traditional technologies have proved incapable to handle huge amounts of data generated in
organizations or to fulfill the processing requirements of such data. Therefore, a need was felt to
combine a number of technologies and products into a system that can overcome the challenges
faced by the traditional processing systems in handling Big Data.
T
One of the technologies designed to process Big Data (which is a combination of both structured
and unstructured data available in huge volumes) is known as Hadoop. Hadoop is an open-source
platform that provides analytical technologies and computational power required to work with such
IM
Earlier, distributed environments were used to process high volumes of data. However, multiple
nodes in such an environment may not always cooperate with each other through a communication
system, leaving a lot of scope for errors. Hadoop platform provides an improved programming
model, which is used to create and run distributed systems quickly and efficiently.
Hadoop is a distributed file system, which allows you to store and handle massive amount of data
on a cloud of machines. The main benefit of using Hadoop is that since data is deposited in multiple
nodes, it is preferable to execute it in the distributed way. It means the data which is stored on a
node is processed by the node itself instead of spend the time to distribute it over the network. On
the other hand, though in a relational database computing system, you can query data in real-time;
however, in case of huge amount of data, it becomes almost impossible to store data in tables,
records and columns.
Used in advertisements targeting platforms to capture and analyze the social media data
Used to manage the content, posts, images and videos on social media platforms
Used by financial agencies to reduce risk, analyze fraud patterns, and improve customer
satisfaction
Hadoop Ecosystem
Hadoop ecosystem50 refers to a collection of components of the Apache Hadoop software library,
including the accessories and tools provided by the Apache Software Foundation. Several elements
of the Hadoop ecosystem may be different from one another in order of their architecture but the
functionalities such as the scalability and power of Hadoop help to gathered the elements under
the single system as they all derive from these functionalities. In simple words, Hadoop ecosystem
can be defined as a comprehensive collection of tools and technologies that can be effectively
L
implemented and deployed to provide Big Data solutions in a cost-effective manner. MapReduce
and Hadoop Distributed File System (HDFS) are two core components of the Hadoop ecosystem
that provide a great starting point to manage Big Data; however, they are not sufficient to deal
with Big Data challenges. Along with these two, Hadoop ecosystem provides a collection of various
elements to support the complete development and deployment of Big Data solutions.
R Connectors
Data Exchange
IM
SQL Query
Sqoop
Mahout
Workflow
Statistics
Scripting
Oozie
Hive
Pig
Coordination
HDFS
Hadoop Distributed File System
All these elements enable users to process large datasets in real time and provide tools to support
various types of Hadoop projects, schedule jobs, and manage cluster resources.
MapReduce and HDFS provide the necessary services and basic structure to deal with the
core requirements of Big Data solutions. Other services and tools of the ecosystem provide the
environment and components required to build and manage purpose-driven Big Data applications.
In the absence of an ecosystem, the developers, database administrators, system and network
Self-Instructional
managers will have to implement separate sets of technologies to create Big Data solutions.
However, such an approach would prove to be expensive in terms of both time and money.
Material
295
DATA SCIENCE
HADOOP ARCHITECTURE
A Hadoop cluster consists of a single MasterNode and multiple worker nodes. The master node
contains a NameNode and JobTracker; whereas a slave or worker node acts as both a DataNode and
TaskTracker. The secure shell should be set up between nodes in the cluster required by the standard
startup and shutdown scripts. In a larger cluster, HDFS is managed through a NameNode server to
host the file system index. A secondary NameNode keeps snapshots of the NameNodes. At the
time of failure of a primary NameNode, a secondary NameNode replaces the primary NameNode,
thus preventing the file system from getting corrupt and reducing data loss. Figure 13.2 shows the
Hadoop multinode cluster architecture:
FS/namespace/meta ops
L
HDFS Secondary
NameNode
Client NameNode
Namespace backup
DataNode
D DataNode DataNode DataNode DataNode
C
Data serving
The secondary NameNode takes snapshots of the primary NameNode directory information after
IM
a regular interval of time, which is saved in local or remote directories. These checkpoint images
can be used in the place of the primary NameNode to restart a failed primary NameNode without
replaying the entire journal of file-system actions and without editing the log to create an up-to-date
directory structure. To process the data, Job Tracker assigns tasks to the Task Tracker. Let us assume
that a DataNode cluster goes down while the processing is going on, then the NameNode should
know that some DataNode is down in the cluster, otherwise it cannot continue processing. Each
DataNode sends a “Heart Beat Signal” to NameNode after every few minutes (as per Default time
set) to make the NameNode aware of the active / inactive status of DataNodes. This system is called
as Heartbeat mechanism.
COMPONENTS OF HADOOP
There are two main components of Apache Hadoop—the Hadoop Distributed File System (HDFS)
and the MapReduce parallel processing framework. Both of these components are open source
projects—HDFS is used for storage and MapReduce is used for processing.
Self-Instructional
HDFS is a fault‐tolerant storage system, which stores large size files from terabytes to petabytes
Material
across different terminals. HDFS replicates the data over multiple hosts to achieve credibility. The
296
Hadoop
default replication value is 3. Data is replicated on three nodes: two on the same rack and one on NOTES
a different rack. The file in HDFS is split into large blocks of around 64 to 128 megabytes of data.
Each block of this file is independently replicated at multiple data nodes. The NameNode actively
monitors the number of replicas of a block 3 times, by default. When a replica of a block is lost due
to a DataNode failure or disk failure, the NameNode creates another replica of the block.
MapReduce51 is a framework that helps developers to write programs to process large volumes of
unstructured data parallelly over a distributed /standalone architecture in order to get the output in
an aggregated format.
MapReduce consists of several components. Some of the most important ones are:
JobTracker: Master node that manages all jobs and resources in a cluster of commodity
computers
TaskTrackers: Agents deployed at each machine in the cluster to run the map and reduce task
at the terminal
JobHistoryServer: Component that tracks completed jobs
L
We can write MapReduce programs in several languages like C, C++, Java, Ruby, Perl, and Python.
The MapReduce libraries are used by the programmers to build tasks, without communication and
coordination between nodes. Each node will periodically report its status to the master node; if
D
a node doesn’t respond as expected, the master node re-assigns that piece of the job to other
available nodes in the cluster, so we can say that MapReduce is also fault-tolerant.
E xhibit -1
C
Facebook Hadoop Architecture
Each day around 30 million Facebook users update their status, every month around 10 million users
upload the videos, every week around 1 billion users share the content pieces and every month
T
more than 1 billion users upload their photos. To manage such a huge amount of data, Facebook
uses Hadoop to interact with petabytes of data. Facebook employs world’s largest Hadoop Cluster
with more than 4000 machines for storing around hundreds of millions of gigabytes of data. The
biggest hadoop cluster at Facebook has about 2500 CPU cores and 1 PB of disk space. Moreover,
IM
the engineers at Facebook load more than 250 GB of compressed data into HDFS with hundreds
of hadoop jobs running on these datasets on a daily basis.
NOTES distributed across these machines (also known as nodes), which are allowed to work independently
and provide their responses to the starting node. Moreover, it is possible to add or remove nodes
dynamically in a Hadoop cluster on the basis of varying workloads. Hadoop has the ability to detect
changes (which also include server failure) in the cluster and adjust to them, without causing any
interruption in the system.
Hadoop accomplishes its operations (of dividing the computing tasks into subtasks that are handled by
individual nodes) with the help of the MapReduce model, which comprises two functions—mapper and
reducer. The mapper function is used to map the different computation subtasks among the nodes and
the reducer function is used to reduce the responses to a single result come from the compute nodes.
The MapReduce model implements the MapReduce algorithm, as discussed earlier, to incorporate the
capability of breaking data into manageable subtasks, processing the data on the distributed cluster
simultaneously, and making the data available for additional processing or user consumption.
In the MapReduce algorithm, the operations of distributing tasks across various systems, handling
the task placement for load balancing, and managing the failure recovery are accomplished by the
map component (or the mapper function). The reduce component (or the reducer function), on the
other hand, has the responsibility to aggregate all the elements together after the completion of the
distributed computation.
L
When an indexing job is provided to Hadoop, it requires the organizational data to be loaded first.
Next, the data is divided into various pieces, and each piece is forwarded to different individual
D
servers. Each server has a job code with the piece of data it is required to process. The job code helps
Hadoop to track the current state of data processing. Once the server completes operations on the
data provided to it, the response is forwarded with the job code appended to the result.
In the end, results from all the nodes are integrated by the Hadoop software and provided to the
C
user, as shown in Figure 13.3:
Indexing job
T
Hadoop software
Input data +
job code 2
Result
The call records of all the telephones in a city are being examined by a researcher who wants to
know specifically about the calls made by college students on the occasion of an event. The fields
Self-Instructional
required as a result of the analysis carried out by the researcher include the timing of the event
Material and the relevant information of the user. The query is fired on every machine to search the results
298
Hadoop
from the call records stored with the machine, which will return the relevant results. Finally, a single NOTES
result will be generated by aggregating individual results obtained from all the machines. We are
considering that the records are collected in a Comma Separated Value (CSV) file.
The processing starts by first loading the data into Hadoop and then applying the MapReduce
programming model. Let us consider the following five columns to be contained in the CSV file:
u_id sp_name
u_name call_time
c_name
To identify individual users who made phone calls at a specific time, the u_id field is used. It helps
us to determine the number of users who have called at the time. We can, thus, get the final output
in terms of the number of users by whom calls were made during the specified time. To obtain the
final output, each mapper receives data line by line. Once the mapper completes its job, the results
are shuffled or sorted by the Hadoop framework, which then combines the data in groups that are
forwarded to the reducer. Ultimately, the final output is obtained from the reducer. Data in Hadoop
can be stored across multiple machines. Businesses can take the advantage of this storage facility to
use multiple commodity machines. These machines are capable of hosting the Hadoop software. In
L
this case, businesses will not need to create integrated systems.
Features of Hadoop
D
Hadoop has become the most-commonly used platform in businesses for Big Data processing. It
helps in Big Data analytics by overcoming the obstacles usually faced in handling Big Data. Hadoop
allows analysts to break down large computational problems into smaller tasks as smaller elements
C
can be analyzed quickly and economically. All these parts are analyzed in parallel, and the results of
the analysis are regrouped to produce the final output.
Hadoop follows the client–server architecture in which the server works as a master and is
responsible for data distribution among clients that are commodity machines and work as
slaves to carry out all the computational tasks. The master node also performs the tasks of job
controlling, disk management, and work allocation.
The data stored across various nodes can be tracked in Hadoop NameNode. It helps in accessing
and retrieving data, as and when required.
Hadoop improves data processing by running computing tasks on all available processors that
are working in parallel. The performance of Hadoop remains up to the mark both in the case of
complex computational questions and of large and varied data.
Hadoop keeps multiple copies of data (data replicas) to improve resilience that helps in
maintaining consistency, especially in case of server failure. Usually, three copies of data are
maintained, so the usual fault-replication factor in Hadoop is 3.
Summary
In this chapter, you have learned about Hadoop, which is an open-source framework that provides
a distributed file system for processing Big Data. Hadoop uses MapReduce programming model of
data processing that allows users to execute and split big data sets into meaning information. Next,
you have learned about Hadoop architecture as well as the two main components of Hadoop— Self-Instructional
HDFS and MapReduce. Finally, you have learned about the basic functions and features of Hadoop. Material
299
DATA SCIENCE
NOTES
Exercise
Multiple-Choice Questions
Q1. How does the Hadoop architecture use computing resources?
a. By distributing software to computing resources
b. By distributing data and computing tasks to computing resources
c. By creating shared memory for computing resources
d. By distributing data to computing resources
Q2. Hadoop makes the system more resilient by ________________.
a. Using an effective firewall and anti-virus
b. Keeping multiple copies of data
c. Keeping each computing resource isolated
d. Uploading data to a cloud for backup
L
Q3. Hadoop is a/an _______________ that provides analytical technologies and computational
power required to work with such large volumes of data.
a. digital platform b. open-source platform
c. cross-platform d. social media platform
D
Q4. Hadoop follows the client–server architecture in which the server works as a master and
performs the tasks of _______________.
a. job controlling b. disk management
C
c. work allocation d. All of the above
Q5. In the MapReduce algorithm, the ______________ component has the responsibility to
aggregate all the elements together after the completion of the distributed computation.
a. Map b. Reduce
T
c. HDFS d. Hive
Assignment
IM
References
http://ercoppa.github.io/HadoopInternals/HadoopArchitectureOverview.html
http://www.bmcsoftware.in/guides/hadoop-ecosystem.html
https://data-flair.training/blogs/hadoop-ecosystem-components/
http://www.bmcsoftware.in/guides/hadoop-ecosystem.html
L
D
C
Source: https://www.fiercetelecom.com/telecom/nokia-acquires-deepfield-enhances-ip-network-security-analytics-
capabilities
Nokia is aware of the fact that today’s most important resource is data. Therefore, Nokia has now
T
resolved to leverage digital data that could be used to easily navigate to the physical world. In order
to achieve this, Nokia wanted to find a technology solution that can support collection, storage and
analysis of a number of data types in large volumes.
IM
After Nokia returned to the smartphone business, effective collection and use of data has become
important for Nokia in order to understand user’s experiences with their phones and other location
products. By analysing the feedback of users based on their product or service use, Nokia could
make improvements and eliminate the problems in their products. According to Amy O’Connor,
the Senior Director of Analytics at Nokia, the company focuses on conducting data processing and
complex analysis in order to build maps with predictive traffic and layered elevation models. To
understand the phone quality and user expectations, Nokia also sources information about the
points of interest from around the world.
Since, Nokia is having an exposure to really large amounts of data (Big Data), it has adopted a
Teradata Enterprise Data Warehouse (EDW), various Oracle and MySQL data marts, visualization
technologies and Hadoop platform to perform data analysis. This constitutes the technology
ecosystem of Nokia. Nokia runs a Hadoop Distributed File System (HDFS) having over 100 TB of
structured data and Petabytes of multi-structured data that run on Dell PowerEdge servers. As of
2012, Nokia’s centralised Hadoop cluster contained 0.5 PB of data. Nokia’s warehouses and marts
constantly stream data into Hadoop environment. This data can be accessed and used by the
employees of Nokia.
Nokia moves data from servers at one location to Hadoop clusters at another location by using
Scribe processes. The data is moved from HDFS to Oracle/Teradata using Sqoop. The data is moved Self-Instructional
out of Hadoop using HBase. Material
301
C A S E S T U D Y
NOTES Before deploying Hadoop, various applications developed by Nokia were integrated in order to
allow cross-referencing and to create a single dataset. The users that interact with mobile phones
generate data regarding the services. In addition log files and other data is also generated. All these
data and log files are collected and processed by Nokia to gain insights about the markets and
behaviour of different user groups. It was not cost-effective for the company to capture PB scale
data using relational database.
Therefore, the company decided to deploy a Hadoop platform due to the following benefits:
Cost per TB of storage is 10 times cheaper than traditional relational data warehouse systems
Hadoop is a rapidly evolving platform, and it is difficult to deploy the tools designed to support
it. Therefore, the company decided to use a customized Hadoop platform, Cloudera’s Distribution
L
including Apache Hadoop (CDH). CDH bundles together the most popular open source projects in
the Apache Hadoop stack into a single, integrated package with steady and reliable releases.
After successful trial of the Hadoop system, Nokia deployed a central CDH cluster to serve as the
D
company’s enterprise-wide information core. Cloudera company helped in deploying the platform
from start till finish and at the same time, it ensured that it was fully integrated with other Hadoop
clusters and relational technologies to ensure reliability and performance.
C
The benefits realised by Nokia after implementing CDH are described as follows:
Nokia uses Hadoop to create 3D digital maps. These maps can incorporate traffic models that
understand speed categories, recent speeds on roads, historical traffic models, elevation,
T
ongoing events, video streams of the world, and more.
Scalability and flexibility offered by Hadoop
Usage pattern of the customers across various applications running in Nokia’s systems can be
IM
judged.
Questions
1. What changes did Nokia want to implement after it decided to fully enter into the smartphones
market?
(Hint: After Nokia returned to the smartphone business, effective collection and use of data
has become important in order to understand user’s experiences with their phones and other
location products. By analysing the feedback of users based on their product or service use,
Nokia could make improvements and eliminate the problems in their products.)
2. What were the benefits realised by Nokia after implementing Cloudera’s Hadoop platform?
(Hint: Scalability and flexibility; Usage pattern of the customers across various applications
can be judged.)
Self-Instructional
Material
302
L A B E X E R C I S E
In this Lab Exercise, you are going to perform the steps for installing Hadoop on a computer system. NOTES
Ensure that the steps are performed sequentially and accurately to complete the installation.
LAB 1
Solution: Apache Hadoop is one of the most popular framework used for processing of Big Data. As a
Big Data professional, you are likely to work on the Hadoop platform extensively, hence knowledge
of the installation process of Hadoop would be compulsory if you choose a career as a Big Data
Developer, and would be an added advantage if you choose to be a Big Data Analyst. Let’s start with
prerequisites for installing Hadoop:
L
1. To run Hadoop, we have to install Sun JDK 1.7.
2. Secure Shell (SSH) facilitates communication between Hadoop and other entities, such as
localhost or remote nodes. SSH can be confi gured in the terminal by using the following
commands:
$ sudo apt-get install openssh-server
$ sudo apt-get install openssh-client
D
C
These commands install the SSH server on port 22 by default. After this step, generate the
SSH key for the current user by using the following command:
$ ssh-keygen -t rsa -P ""
T
To verify SSH installation, open a new terminal and try to create an SSH session by using the
following command:
$ ssh localhost
IM
3. Since Ubuntu uses Internet Protocol version 4 (IPv4) [0.0.0.0], IPv6 needs to be disabled.
Run a root account by using the following command:
$ sudo gedit /etc/sysctl.conf
This command will open sysctl.conf in a text editor. Disable IPv6 by using the following
commands:
To disable ipv6:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You have fulfilled all the prerequisites for installing Hadoop.
L
FIGURE 13.4 Hadoop Downloaded in Apache Directory
2. Extract the Hadoop tar file in a new terminal using the following commands:
D
$ cd /home/wcbda/apache
$ sudo tar xzf hadoop-1.0.4.tar.gz
$ sudo tar xzf jdk-7u25-linux-i586.tar.gz
Figure 13.5 shows the extracted Hadoop tar file in the apache directory:
C
T
IM
3. To update the .bashrc file for Hadoop, open it as root by using the following commands:
$ cd ~
$ sudo gedit .bashrc
Add the following configuration at the end of the .bashrc file by using the following
commands:
# Set Hadoop-related environment variablesexport
HADOOP_HOME=/home/wcbda/apache/hadoop-1.0.4
# Set JAVA_HOME (we will also configure JAVA_HOME directly
for
Hadoop later on)
export JAVA_HOME=/home/wcbda/apache/jdk1.7.0_25
# Add Hadoop & Java directory to PATH
Self-Instructional export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
Material
304
L A B E X E R C I S E
4. After completing the installation of Hadoop, you need to configure the Hadoop framework NOTES
on the Ubuntu machine. Perform the following steps to configure Hadoop on the Ubuntu
machine:
a. In the hadoop-env.sh file, update the value of JAVA_HOME variable by using the
following commands:
$ sudo gedit /home/wcbda/apache/hadoop-1.0.4/conf/ha-
doop-env.sh
#add below lines
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export JAVA_HOME=/home/wcbda/apache/jdk1.7.0_25
export PATH=$PATH:$JAVA_HOME/bin
b. In core-site.xml, create a temp directory for the Hadoop framework. The
directory should be created in a shared place under a shared folder; for example,
/usr/local... can be considered as a feasible file structure for creating the directory.
To avoid exceptions, such as java.io.IOException, caused due to security issues,
create the temp folder in $HADOOP_HOME space by using the following commands:
L
<property>
<name>hadoop.tmp.dir</name>
<value>/home/wcbda/apache/hadoop-1.0.4/hadoop_temp/</
value>
description>
</property>
<property> D
<description>A base for other temporary directories.</
C
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. The
Filesystem implementation is determined by the scheme
T
and authority of the URI. The config property (fs.SCHEME.
impl) naming the FileSystem implementation class is de-
scribed by the uri’s scheme. The uri’s authority con-
tains the information about the host, port, etc. of a
IM
filesystem.</description>
</property>
c. In mapred-site.xml, update hadoop-1.0.4/conf/mapred-site.xml in a text
editor by using the following commands:
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:54311</value>
<description>The host and port that the MapReduce job
tracker runs
at. If “local”, then jobs are run in-process as a sin-
gle map
and reduce task.
</description>
</property>
d. In hdfs-site.xml, update hadoop-1.0.4/conf/hdfs-site.xml in a text
editor by using the following commands:
<property>
<name>dfs.replication</name>
<value>1</value> Self-Instructional
Material
305
L A B E X E R C I S E
NOTES <description>Default block replication.
The actual number of replications can be specified when
the file is created.
The default is used if replication is not specified in
create time.
</description>
</property>
e. The NameNode in HDFS should be formatted once at the time of installation by using the
following command:
$ bin/hadoop namenode -format
5. To start Hadoop, navigate to the $HADOOP_HOME directory and run the bin/start-all.
sh script.
6. Open a web browser and try to open the following links:
http://localhost:50070/(NameNode)
http://localhost:50030(JobTracker)
L
http://localhost:50060/(TaskTracker)
Figure 13.6 shows the snapshot of the NameNode window:
D
C
T
IM
Self-Instructional
Material
306
CHAPTER
14
Hadoop File System
for Storing Data
L
Topics Discussed
Introduction
Hadoop Distributed File System
D NOTES
C
HDFS Architecture
Concept of Blocks in HDFS Architecture
T
NameNodes and DataNodes
The Command-Line Interface
Using HDFS Files
IM
HDFS Commands
The org.apache.hadoop.io Package
HDFS high Availability
Features of HDFS
Data Integrity in HDFS
Features of HBase
Difference between HBase and HDFS
Self-Instructional
Material
DATA SCIENCE
INTRODUCTION
Hadoop Distributed File System (HDFS) is a resilient, flexible, grouped method of file management
in a Big Data setup which runs on commodity hardware. It is a data service offering unique abilities
using nodes (commodity servers) required when data variety, volumes, and velocity are beyond the
controllable levels of traditional data management systems.
L
HDFS is based on Google File System (GFS), which provides reliable and efficient access to data using
clusters of commodity hardware. In case of traditional data management systems, data is read many
times after it is inserted into a system but written only once in its lifetime. However, in the case
of systems performing continuous read-write cycles, HDFS offers excellent Big Data analysis and
support.
D
In this chapter, you will learn about HDFS. First, you will learn about HDFS file system. Next, you will
learn about HDFS architecture. You will also learn some important features of HDFS. At the end, you
C
will learn data integrity in HDFS and features of HBase.
data only once and reads it many times thereafter. This helps in simplifying data coherency and
enabling high-throughput access. File systems like HDFS are designed to manage the challenges of
accessing Big Data, which can’t be handled using traditional data-processing application software.
HDFS stores file metadata and application data separately on namenode54 (also called master
node) and datanode servers, respectively. All these servers are connected to each other using TCP
(Transmission Control Protocol)-based protocols. A large interconnection of these servers makes a
cluster that can support close to billion files and blocks and HDFS has demonstrated the capacity
to scale a single cluster of up to 4500 nodes. The file content is divided in to several small data
blocks (block size depends on users) and is replicated throughout the datanodes. This distribution
of blocks helps the HDFS to maintain data durability, multiply data transfer bandwidth and enable
highly efficient parallel processing.
source, and then various analyses are performed on that dataset in the long run. Every analysis NOTES
is done in detail and hence time consuming, so the time required for examining the entire
dataset is quite high.
Appliance hardware: Hadoop does not require large, exceptionally dependable hardware
to run. It can be installed on generic hardware which is available easily at low cost.
Low-latency information access: Applications that permit access to information in milliseconds
do not function well with HDFS. Therefore, to convey high transaction volume of information,
HDFS can be upgraded. But this comes at the cost of idleness. Apache HBase is at present a
superior choice for small latency access as it gives real-time, random read/write access to given
Big Data.
Loads of small documents: Since the NameNode holds file system data information in memory,
the quantity of documents in a file system is administered in terms of the memory on the
server. As a dependable guideline, each document and registry takes around 150 bytes. Thus,
for instance, if you have one million documents, you would require no less than 300 MB of
memory. While putting away a huge number of records is achievable, billions is an achievable
capacity in the near future using efficient commodity hardware.
L
Discuss the following terms:
Streaming information access
Low-latency information access
HDFS ARCHITECTURE D
C
HDFS consists of a central DataNode and multiple DataNodes running on the appliance model cluster
and offers the highest performance levels when the same physical rack is used for the entire cluster.
NameNode and DataNode are software pieces specifically designed to run on commodity hardware.
HDFS is capable of managing the inbuilt environment redundancy and its design can detect failures
T
and resolve them by running particular programs automatically on multiple servers present in
the cluster. HDFS allows simultaneous application execution across multiple servers consisting of
economical internal disk drives.
IM
Metadata
Name =/home/foo/data
Replicas = 3
Name = /home/foo/data1
Metadata ops Replicas = 3
NameNode
Client
Block ops
Read
DataNodes DataNodes
Replication
Blocks
Client
Self-Instructional
Material
FIGURE 14.1 Displaying the Architecture of HDFS
309
DATA SCIENCE
NOTES The NameNode manages HDFS cluster metadata, whereas DataNodes store the data. Records and
directories are presented by clients to the NameNode. These records and directories are managed
on the NameNode. Operations on them, such as their modification or opening and closing them
are performed by the NameNode. On the other hand, internally, a file is divided into one or more
blocks, which are stored in a group of DataNodes. DataNodes read and write requests from the
clients. DataNodes can also execute operations like the creation, deletion, and replication of blocks,
depending on the instructions from the NameNode.
HDFS blocks are huge in contrast to disk blocks because they have to minimize the expense of the
L
seek operation. Consequently, the time to transfer a huge record made of multiple blocks operates
at the disk exchange rate. A quick computation demonstrates that if the seek time is around 10ms,
and the exchange rate is 100 MB/s, then to assign the seek time that is 1% of the exchange time, we
have to create a block size of around 100 MB. The default size of block is set to 64MB, however many
D
HDFS systems use 128 MB blocks. Map tasks of MapReduce (component of HDFS) typically work on
one block at once, so if you have fewer assignments (fewer than the nodes in the group), your tasks
will run slower. Figure 14.2 shows the heartbeat message of Hadoop:
C
T
HDFS Cluster HDFS Cluster
Server Server
IM
When a heartbeat message reappears or a new heartbeat message is received, the respective
DataNode sending the message is added to the cluster. HDFS performance is calculated through
the distribution of data and fault tolerance by detecting faults and quickly recovering the data. This
recovery is completed through replication and resents in a reliable file system. It results in a reliable
huge file storage.
Self-Instructional
Material
310
Hadoop File System for Storing Data
To enable this task of reliability, one should facilitate the number of tasks for failure management, NOTES
some of which are utilized within HDFS and others are still in a process to be implemented:
Monitoring: DataNode and NameNode communicate through continuous signals (“Heartbeat”).
If a signal is not heard by either of the two, the node is considered to have failed and would be
no longer available. The failed node is replaced by the replica and replication scheme is also
changed.
Rebalancing: According to this process, the blocks are shifted from one to another location
where ever the free space is available. Better performance can be judged by the increase in
demand of data as well as the increase in demand for replication towards frequent node failures.
Metadata replication: These files are prone to failures; however, they maintain the replica of
the corresponding file on the same HDFS.
L
the documents and indexes in the file system. This metadata is stored on the local disk as two files:
the file system image and the edit log. The NameNode is aware of the DataNodes on which all the
pieces of a given document are found; however, it doesn’t store block locations necessarily, since
this data is recreated from DataNodes.
D
By establishing communication with DataNodes and NameNodes, a client accesses the file system
on behalf of user. The client provides a file system, like the POSIX interface, so the user code does
not require the NameNode and DataNodes in order to execute. DataNodes are the workhouses of
C
a file system. When asked by client or namenode, they store and recover the data blocks and report
back with a list of externally stored blocks to the NameNode.
The file system can’t be accessed without the NameNode. The file system would lose all the files
T
if the NameNode crashes. This is so because without the NameNode, there won’t be any way to
reconstruct files from DataNode blocks. This is why it is important to make the NameNode robust to
cope with failures, and Hadoop provides two ways of doing this.
IM
The first way is to take the back up of file documents. Hadoop can be set in a way that NameNode
creates its state for various file systems. The normal setup choice is to write to the local disk and
to a remote NFS mount.
Another way is to run a secondary NameNode, which does not operate like a normal NameNode.
Secondary NameNode periodically reads the filesystem, changes the log, and apply them into
the fsimage file. Secondary NameNode comes online whenever the NameNode is down but the
constraint is that it ony has access for fsimage and editlog file.
DataNodes ensure connectivity with the NameNode by sending heartbeat messages. Whenever the
NameNode ceases to receive a heartbeat message from a DataNode, it unmaps the DataNode from
the cluster and proceeds with further operations.
Self-Instructional
Material
311
DATA SCIENCE
NOTES There are two properties that we set in the distributed mode prototype setup that calls for further
clarification. The principal is fs.default.name, set to hdfs://localhost/, which is used to set a default
Hadoop file system. File systems are tagged by a URI, and here we have utilized an HDFS URI to
design Hadoop. The HDFS daemons will utilize this property to focus the host and port for the HDFS
NameNode. We’ll be running it on localhost, on the default HDFS port, 8020. Furthermore, HDFS clients
will utilize this property to figure out where the NameNode is running so they can connect with it.
The object of a Filesystem class is created for accessing HDFS. The Filesystem class is an abstract
base class for a generic file system. The code created by a user referring to HDFS must be written
in order to use an object of the Filesystem class. An instance of the FileSystem class can be created
L
by passing a new Configuration object into a constructor. Assume that Hadoop configuration files,
such as hadoop-default.xml and hadoopsite.xml are present on the class path. The code to create
an instance of the FileSystem class is shown in Listing 14.1:
Configuration(); D
Listing 14.1: Creating a FileSystem Object
Path fp=new Path(file name); Creating an object for the Path class
//statement 1
If(fsys.isFile(fp))
//statement 1
Boolean result=fsys.createNewFile(fp);
Boolean result=fsys.delete(fp);
FSDataInputStream fin=fsys.open(fp); Reading from the file
You must note that whenever a file is opened to perform the writing operation, the client opening
this file grants an exclusive writing lease for it. Due to this, no other client can perform write
operations on this file until the operation of this client gets completed. In order to ensure that a
Self-Instructional lease is held by no “runaway” clients, it gets expired periodically. The effective use of a lease ensures
Material that no two applications can perform the write operation to a given file simultaneously.
312
Hadoop File System for Storing Data
There is a soft limit and hard limit which bounds the lease duration. In the case of the soft limit, a NOTES
writer gets access to a file exclusively. In a situation where the client fails to renew the lease and fails
to close file as well as the soft limit expires, the lease can be preempted by another client. Else ways,
if the client fails to renew the lease and the hard limit expires, then HDFS, by default executes as if
client has quitted and the file is closed immediately.
HDFS Commands
HDFS and other file systems supported by Hadoop, e.g. Local FS,HFTP FS, S# FS and others can
directly be interacted with various shell like commands. These commands are provided by the File
System (FS) shell. The FS shell can be invoked by the following command:
bin/hadoop fs <args>
Most commands in the FS shell are similar to the Unix commands and perform almost similar
functions. For example, consider the following path on the basis of the preceding command:
L
Some commonly used HDFS commands are shown in Table 14.1:
Commands
appendToFile
Description
D
Used to append a single src or multiple srcs
Syntax
hdfs dfs -appendToFile
C
from a local file system to the destination file <localsrc> ... <dst>
system. This command is also used to read
input from stdin and append it to destination
file system.
T
cat Used for copying source paths to stdout. This Usage: hdfs dfs -cat URI [URI
command returns 0 on success and -1 in case ...]
of error.
IM
chmod Used for changing the permissions of files. hdfs dfs -chmod [-R]
<MODE[,MODE]... |
OCTALMODE> URI [URI ...]
get Used for copying files to the local file system. hdfs dfs -get [-ignorecrc] [-crc]
<src> <localdst>
mkdir Used for creating directories by taking the hdfs dfs -mkdir [-p] <paths>
path URI as an argument.
mv Used for moving files from the source to the hdfs dfs -mv URI [URI ...]
destination. <dest>
Class Description
AbstractMapWritable Refers to an abstract base class for MapWritable and
SortedMapWritable. Unlike org.apache.nutch.crawl.MapWritable,
L
this class enables the creation of MapWritable<Writable,
MapWritable>.
ArrayFile Used to perform a dense integers to values file -based mapping.
ArrayPrimitiveWritable
ArrayWritable
BinaryComparable DRefers to a wrapper class.
Acts as a Writable object for arrays containing instances of a class.
Refers to the interface supported by WritableComparable types
C
that support ordering/permutation by a representative set of
bytes.
Exceptions Description
IM
The cluster would remain unavailable if an unplanned event such as machine crash occurs, until
an operator restarts the NameNode.
Cluster downtime can be a result of many events such as planned maintenance (hardware or
software upgradation) of NameNode machine.
The preceding problems are addressed by the HDFS High Availability feature which provides the
facility of running two redundant NameNodes in the same cluster. In case of machine crash, a fast
Self-Instructional
failover to another NameNode can be assigned due to the ability of NameNodes to run in active/
Material passive configuration.
314
Hadoop File System for Storing Data
Generally, there are two separate machines configured as NameNodes in a typical HA cluster. At a NOTES
given instance, either of the NameNodes will be in active state while the other will be in standby
state. The Active NameNode performs all the client operations in the cluster, and the Standby
NameNode acts as a slave, which maintains enough state for providing a fast failover, if required.
You can deploy an HA cluster by preparing the following:
NameNode machines: These are the machines on which you can run the Active and Standby
NameNodes. These NameNode machines must have similar hardware configurations.
Shared storage: Both NameNode machines must have read/write accessibility on a shared
directory.
FEATURES OF HDFS
L
Data replication, data resilience, and data integrity are the three key features of HDFS. You have
already learned that HDFS allows replication of data; thus automatically providing resiliency to data
D
in case of an unexpected loss or damage of the contents of any data or location. Additionally, data
pipelines are also supported by HDFS. A block is written by a client application on the first DataNode
in the pipeline. The DataNode then forwards the data block to the next connecting node in the
pipeline, which further forwards the data block to the next node and so on. Once all the data replicas
C
are written to the disk, the client writes the next block, which undergoes the same process. This
feature is supported by Hadoop MapReduce.
When a file is divided into blocks and the replicated blocks are distributed throughout the different
T
DataNodes of a cluster, the process requires careful execution as even a minute variation may result
in corrupt data. HDFS ensures data integrity throughout the cluster with the help of the following
features:
Maintaining Transaction Logs: HDFS maintains transaction logs in order to monitor every
IM
operation and carry out effective auditing and recovery of data in case something goes wrong.
Validating Checksum: Checksum53 is an effective error-detection technique wherein a numerical
value is assigned to a transmitted message on the basis of the number of bits contained in
the message. HDFS uses checksum validations for verification of the content of a file. The
validations are carried out as follows:
1. When a file is requested by a client, the contents are verified using checksum.
2. If the checksums of the received and sent messages match, the file operations proceed
further; otherwise, an error is reported.
3. The message receiver verifies the checksum of the message to ensure that it is the same as
in the sent message. If a difference is identified in the two values, the message is discarded
assuming that it has been tempered with in transition. Checksum files are hidden to avoid
tempering.
Creating Data Blocks: HDFS maintains replicated copies of data blocks to avoid corruption of a
file due to failure of a server. The degree of replication, the number of DataNodes in the cluster,
and the specifications of the HDFS namespace are identified and implemented during the initial
implementation of the cluster. However, these parameters can be adjusted any time during the
operation of the cluster.
Self-Instructional
Material
315
DATA SCIENCE
NOTES Data blocks are sometimes, also called block servers55. A block server primarily stores data in a file
system and maintains the metadata of a block. A block server carries out the following functions:
Storage (and retrieval) of data on a local file system. HDFS supports different operating systems
and provides similar performance on all of them.
Storage of metadata of a block on the local file system on the basis of a similar template on the
NameNode.
Conduct of periodic validations for file checksums.
Intimation about the availability of blocks to the NameNode by sending reports regularly.
On-demand supply of metadata and data to the clients where client application programs can
directly access DataNodes.
Movement of data to connected nodes on the basis of the pipelining model.
N ote
A connection between multiple DataNodes that supports movement of data across servers is
termed as a pipeline.
L
The manner in which the blocks are placed on the DataNodes critically affects data replication and
support for pipelining. HDFS primarily maintains one replica of each block locally. A second replica of
D
the block is then placed on a different rack to guard against rack failure. A third replica is maintained
on a different server of a remote rack. At last, random locations in local and remote clusters recieve
additional replicas.
C
However, without proper supervision, one DataNode may get overloaded with data while another is
empty. To address and avoid these possibilities, HDFS has a rebalancer service that balances the load
of data on the DataNodes. The rebalancer is executed when a cluster is running and can be halted to
avoid congestion due to network traffic. The rebalancer provides an effective mechanism; however,
T
it has not been designed to intelligently deal with every scenario. For instance, the rebalancer
cannot optimize for access or load patterns. These features can be anticipated from further releases
of HDFS. The HDFS balancer rebalances data over the DataNodes, moving blocks from over-loaded
to under-loaded nodes. System administrator can run the balancer from the command-line when
IM
N ote
The superuser of HDFS has the capabilities to run balancer.
E xhibit -1
HDFS Use in Financial services Sector
Financial services sector is the most data-intensive sector of global economy. With many companies
running in this league, competitive pressure is very high to use available data for developing
better products and services. Meanwhile companies are squeezed by security risks and regulatory
requirements. For financial service companies like banks which always face high security risk and
frauds, managing their data using hadoop and HDFS can solve many of the problems without
Self-Instructional shelling a fortune. Hadoop is a technology that gives its 100% in data management with proper
Material
316
Hadoop File System for Storing Data
security. Due to its high storage capacity and analytic capability, HDFS can handle lage customer NOTES
database and can even detect if the organization has made any mistake in managing data. Post-
financial crisis regulations like Basel III has declared the liquidity reserve requirements which exert
the pressure on lender to know precisely how much money they have to keep in reserve. Using
HDFS, organizations can now analyze credit risk and counter risk, helping them to take better
business decisions.
L
operating cost is less than 1%. DataNodes are in charge of confirming the information they get
before putting away the information and its checksum. This applies to information they get from
clients and from different DataNodes in replication. The information written by client is sent to a
pipeline of DataNodes where the last DataNode in the pipeline confirms the checksum. In the event
D
that it recognizes a lapse, the client gets Checksumexception, a subclass of Ioexception.
While reading the information from DatNodes, clients also check ckecksum also looking at
the ones put away at the dataNode. Every DataNode keeps a determined log of checksum
C
confirmations, so it knows when each of its blocks was last confirmed. At the point when a client
effectively confirms a block, it tells the DataNode, which overhauls its log. This action is profitable
in detecting bad disks. Every DataNode runs a Datablockscanner in a background thread that
occasionally checks all the blocks put away on the DataNode. This is to prepare for debasement
because of “bit decay” in the physical storage media.
T
Since HDFS stores imitations of pieces, it can “repair” corrupted blocks by replicating one of the
good copies to deliver another, uncorrupted copy. The way this works is that if a client identifies
an error when examining a piece, the client reports the bad block in the related file system
IM
before throwing the Checksumexception exception. The NameNode marks the block copy
as corrupt, so it doesn’t forward clients to it. It then schedules the block to be duplicated on an
alternate DataNode. It is possible to disable checksum verifications by passing false to the setverify
Checksum() method on the related file system, before utilizing the open() system to peruse a
record. The same impact is possible from the shell by utilizing the -ignorecrc choice with the -get
or equal -copytolocal charge. This feature is useful when you have a corrupt document that you
need to assess so you can choose what to do with it.
N ote
HBase stores data in column-oriented fashion instead of the record storage pattern as done in
RDBMS.
FEATURES OF HBASE
HBase56 is a column-oriented distributed database composed on top of HDFS. HBase is used when
you need real-time continuous read/write access to huge datasets. Some of the main features of
HBase are:
Consistency: In spite of not being an ACID implementation, HBase supports consistent read and
Self-Instructional
write operations. In places where RDBMS-supported features such as full transaction support or
typed columns are not requires, this feature makes HBase suitable for high speed requirements. Material
317
DATA SCIENCE
NOTES Sharding: HBase supports many operations such as transparent and automatic splitting,
redistribution of content and distribution of data using an underlying file system.
High availability: HBase implements region servers to ensure the recovery of LAN and WAN
operations in case of a failure. The master server at the core monitors the regional servers and
manages all the metadata for the cluster.
Client API: HBase supports programmatic access using Java APIs.
Support for IT operations: HBase provides a set of built-in Web pages to view detailed
operational insights about the system.
N ote
ACID stands for Atomicity, Consistency, Isolation, and Durability. It is a set of properties that
ensures reliable processing of database transactions.
L
Real-time information processing and exchange take place
D
Table 14.5 shows difference between HBase and HDFS
of data. data.
Its performance with Hive is faster. Its performance with Hive is slower.
It can store 30 petabytes of data approximately. It can store 1 petabyte of data approximately.
It has an inelastic architecture that does not It allows dynamic modifications and can be used
allow you to make modifications. It also does for applications independently.
not allow dynamic storage of data.
It uses MapReduce technique to divide files It uses key-value pairs because of its basis on
into key-value pairs. Google’s Bigtable model.
It allows high latency operations. It allows low latency operations.
It is accessible through MapReduce jobs. It is accessible through shell commands.
Summary
In this chapter, you have learned about HDFS, which is a useful distributed file system to store very
large data sets. It has a master-slave architecture, which comprises a NameNode and a number of
DataNodes. The NameNode is the master that manages the various DataNodes. Next, you have
Self-Instructional learned some important features of HDFS, such as data replication, data resilience, and data
Material integrity. Then, you will learn about MapReduce. At the end, you will learn HBase. HBase is used
when you need real-time continuous read/write access to huge datasets.
318
Hadoop File System for Storing Data
NOTES
Exercise
Multiple-Choice Questions
Q1. Which of the following terms is used to denote the small subsets of a large file created by
HDFS?
a. NameNode b. DataNode
c.
Blocks d.
Namespace
Q2. What message is generated by a DataNode to indicate its connectivity with NameNode?
a.
Beep b. Heartbeat
c. Analog pulse d. Map
Q3. Which of the following defines metadata?
a. Data about data b. Data from Web logs
c. Data from government sources d. Data from market surveys
L
Q4. Which of the following is managed by the MapReduce environment?
a. Web logs b. Images
c. Structured data d. Unstructured data
c.
a. NameNode b.
Inode d.
Namespace
DataNode
D
Q5. In an HDFS cluster, _______ manages cluster metadata.
C
Q6. Which of the following commands of HDFS can issue directives to blocks?
a. fcsk b. fkcs
c.
fsck d.
fkcs
T
Assignment
IM
References
https://searchdatamanagement.techtarget.com/definition/Hadoop-Distributed-File-System-
HDFS
https://intellipaat.com/blog/what-is-hdfs/
https://www.ibm.com/analytics/hadoop/hdfs
https://data-flair.training/blogs/big-data-use-cases-case-studies-hadoop-spark-flink/
Background
Cisco is one of the world’s leading networking organisations that has transformed the way how
people connect, communicate and collaborate. Cisco IT has 38 global data centres that totally
comprise 334,000 square feet space.
Challenge
The company had to manage large datasets of information about customers, products and network
activities, which actually comprise the company’s business intelligence. In addition, there was a
large quantity of unstructured data, approximately in terabytes in the form of Web logs, videos,
L
emails, documents and images. To handle such a huge amount of data, the company decided to
adopt Hadoop, which is an open-source software framework to support distributed storage and
processing of Big Datasets.
D
According to Piyush Bhargava, a distinguished engineer at Cisco IT, who handles Big Data programs,
“Hadoop behaves like an affordable supercomputing platform.” He also says, “It moves compute to
where the data is stored, which mitigates the disk I/O bottleneck and provides almost linear scalability.
Hadoop would enable us to consolidate the islands of data scattered throughout the enterprise.” To
C
implement the Hadoop platform for providing Big Data analytics services to Cisco business teams,
firstly Cisco IT was required to design and implement an enterprise platform that could support
appropriate Service Level Agreements (SLAs) for availability and performance. Piyush Bhargava
T
says, “Our challenge was adapting the open source Hadoop platform for the enterprise.” The technical
requirements of the company for implementing the Big Data architecture were to:
Have open source components at place to establish the architecture
IM
Know the hidden business value of large datasets, whether the data is structured or unstructured
Provide SLAs to internal customers, who want to use Big Data analytics services
Solution
Cisco IT developed a Hadoop platform using Cisco® UCS Common Platform Architecture (CPA) for
Big Data. According to Jag Kahlon, a Cisco IT architect, “Cisco UCS CPA for Big Data provides the
capabilities we need to use Big Data analytics for business advantage, including high-performance,
scalability, and ease of management.”
The Cisco UCS C240 rack servers are the building block of cisco IT Hadoop platform for computation.
These severs ate powered by Intel Xeon E5-2600 series processors coupled with 256 GB of RAM and
24 TB of local storage. Virendra Singh, a Cisco IT architect, says, “Cisco UCS C-Series Servers provide
high performance access to local storage, the biggest factor in Hadoop performance.” The present
architecture contains four racks of servers, where each rack is having 16 server nodes providing
384 TB of raw storage per rack. Kahlon says, “This configuration can scale to 160 servers in a single
management domain supporting 3.8 petabytes of raw storage capacity.”
Self-Instructional Cisco IT server administrators are able to manage all elements of Cisco UCS including servers, storage
Material access, networking and virtualisation from a single Cisco UCS Manager interface. Kahlon declares,
320
C A S E S T U D Y
“Cisco UCS Manager significantly simplifies management of our Hadoop platform. UCS Manager will NOTES
help us manage larger clusters as our platform grows without increasing staffing.” Cisco IT uses MapR
Distribution for Apache Hadoop, and code written in advanced C++ rather than Java. Virendra Singh
says, “Hadoop complements rather than replacing Cisco IT’s traditional data-processing tools, such as
Oracle and Teradata. Its unique value is to process unstructured data and very large data sets far more
quickly and at far less cost.”
All cisco ucs c240 m3 servers form a cluster to create one large unit and the storage on all of them is
managed by HDFS (Hadoop Distributed File System). Then, HDFS system splits the data into smaller
chunks for further processing and performing ETL (Extract, Transform and Load) operations. Hari
Shankar, a Cisco IT architect, says, “Processing can continue even if a node fails because Hadoop makes
multiple copies of every data element, distributing them across several servers in the cluster. Even if a
node fails, there is no data loss.” Hadoop can detect node failure automatically and create another
parallel copy of data, without distributing any process across the remaining servers. In addition, the
total volume of data is not increased, as Hadoop also compresses the data.
To handle the task like job scheduling and orchestration process, Cisco IT uses Cisco TES (Cisco
Tidal Enterprise Scheduler), which works as an alternative of Oozie. Cisco TES connects Hadoop
L
components automatically and eliminates the need for writing the Sqoop code manually to
download data and move it to HDFS and then execute commands to load data to Hive. Singh says,
“Using Cisco TES for job-scheduling saves hours on each job compared to Oozie because reducing the
D
number of programming steps means less time needed for debugging.” Also, cisco TES is mobile
device compatible, making it easy for the end-users of the company to check and manage big-data
jobs from anywhere.
C
Results
The main result of transforming the business using Big Data by Cisco IT is that the company has
introduced multiple Big Data analytics programs, which are based on the Cisco® UCS Common
T
Platform Architecture (CPA) for Big Data. The revenues of the company from partner sales have
been increased. The company has started the Cisco Partner Annuity Initiative program, which is
in production. Piyush says, “With our Hadoop architecture, analysis of partner sales opportunities
IM
completes in approximately one-tenth the time it did on our traditional data analysis architecture, and
at one-tenth the cost.”
The productivity of the company has been increased by making intellectual capital easier to find.
Since most of the content was not tagged to relevant keywords, it took a lot of time for employees
working as knowledge workers to search for contents. But, now, Cisco IT has replaced the static
and manual tagging process with dynamic tagging on the basis of user feedback. This process uses
machine-learning techniques to examine usage patterns adopted by users and also acts on user
suggestions given for searching by new tags. Moreover, the Hadoop platform analyses log data
of collaboration tools, such as Cisco Unified Communications, email, Cisco TelePresence®, Cisco
WebEx®, Cisco WebEx Social, and Cisco Jabber™ to reveal commonly used communication methods
and organisational dynamics.
Lesson Learned
Cisco IT has come up with the following observations shared with other organisations:
Hive is good for structured data processing, but provides limited SQL support.
Network File System (NFS) saves time and effort to manage a large amount of data. Self-Instructional
Material
321
C A S E S T U D Y
NOTES Cisco TES simplifies the job-scheduling and orchestration process.
A library of user-defined functions (UDFs) provided by Hive and Pig increases developer
productivity.
Knowledge of internal users is enhanced as they can now analyse unstructured data of email,
webpages, documents, etc., besides data stored in databases.
Questions
1. In case of a node failure, what happens to data?
(Hint: Hadoop can detect node failure automatically and create another parallel copy of the
data, without distributing any process across the remaining servers. In addition, the total
volume of data is not increased, as Hadoop also compresses the data.)
2. What benefits were realized by Cisco IT after implementing the Big Data Hadoop platform?
(Hint: Revenues of the company from partner sales have been increased; productivity of the
company has been increased by making intellectual capital easier to find.)
L
D
C
T
IM
Self-Instructional
Material
322
CHAPTER
15
Introducing Hive
L
Topics Discussed
Introduction
Hive D NOTES
C
Hive Services
Hive Variables
Hive Properties
T
Hive Queries
Built-In Functions in Hive
IM
Hive DDL
Creating Databases
Viewing a Database
Dropping a Database
Altering Databases
Creating Tables
External Table
Creating a Table using the Existing Schema
Dropping Tables
Altering Tables
Using Hive DDL Statements
Data Manipulation in Hive
Loading Files into Tables
Inserting Data into Tables
Self-Instructional
Update in Hive Material
DATA SCIENCE
Topics Discussed
Delete in Hive
Using Hive DML Statements
Data Retrieval Queries
Using the SELECT Command
Using the WHERE Clause
Using the GROUP BY Clause
Using the HAVING Clause
Using the LIMIT Clause
Executing HiveQL Queries
L
Chapter Objectives
D
C
After completing Chapter 15, you should be able to:
T
Describe the built-in functions in Hive
324
Introducing Hive
INTRODUCTION NOTES
You have already learned that Hadoop is a framework that handles huge volumes of datasets in
a distributed computing environment. In fact, these datasets can be in any form, unstructured or
structured. To carry out operations like querying and analyzing on such huge amounts of data,
Hadoop offers an open-source data warehouse system called Hive or Apache Hive. Hive57 is a
mechanism through which we can access the data stored in Hadoop Distributed File System (HDFS).
It provides an interface, similar to SQL, which enables you to create databases and tables for storing
data. In this way, you can achieve the MapReduce concept without explicitly writing the source
code for it. Hive also supports a language called HiveQL59, which is considered as the primary data
processing method for Treasure Data. Treasure Data58 is a cloud data platform that allows you to
collect, store, and analyze data on the cloud. It manages its own Hadoop cluster, which accepts
your queries and executes them using the Hadoop MapReduce framework. HiveQL automatically
translates SQL-like queries into MapReduce jobs executed on Hadoop.
In this chapter, you learn about Hive, Hive architecture, and Hive metastore. Next, you learn about
the variables, properties, and commands used in Hive. You also learn to execute Hive queries from
files. The chapter next discusses Data Definition Language (DDL) and data manipulation in Hive.
L
Toward the end, you learn how to execute HiveQL queries in Hive.
HIVE
D
Hive provides a Structured Query Language (SQL) interface, HiveQL, or the Hive Query Language.
This interface translates the given query into a MapReduce code. Hive can be seen as a mechanism
through which one can access the data stored in the HDFS. HiveQL enables users to perform tasks
C
using the MapReduce concept but without explicitly writing the code in terms of the map and reduce
functions. The data stored in HDFS can be accessed through HiveQL, which contains the features of
SQL but runs on the MapReduce framework.
T
It should be noted that Hive is not a complete database and is not meant to be used in Online
Transactional Processing Systems, such as online ticketing, bank transactions, etc. It is mostly used
in data warehousing kind of applications, where you need to perform batch processing on a huge
IM
amount of data. Typical examples of this kind of data include Web logs, call data records, weather
data, etc. As Hive queries are converted into MapReduce jobs, their latency period is also increased
because of the overhead involved in the startup. This means that queries that usually take around a
few milliseconds to execute in traditional database systems now take more time on Hive. Figure 15.1
shows the architecture of Hive:
Self-Instructional
FIGURE 15.1 Displaying the Architecture of Hive Material
325
DATA SCIENCE
NOTES The architecture of Hive consists of various components. These components are described as follows:
User Interface (UI): Allows you to submit queries to the Hive system for execution.
Driver: Receives the submitted queries. This driver component creates a session handle for the
submitted query and then sends the query to the compiler to generate an execution plan.
Compiler: Parses the query, performs semantic analysis on different query blocks and query
expressions, and generates an execution plan.
Metastore: Stores all the information related to the structure of the various tables and
partitions in the data warehouse. It also includes column and column type information and the
serializers and deserializers necessary to read and write data. It also contains information about
the corresponding HDFS files where your data is stored.
Execution Engine: Executes the execution plan created by the compiler. The plan is in the form
of a Directed Acyclic Graph (DAG) to be executed in various stages. This engine manages the
dependencies between the different stages of a plan and is also responsible to execute these
stages on the appropriate system components.
N ote
L
DAG is a model for scheduling work. In this model, jobs are represented as vertices in a graph, and
the order of the execution of jobs is specified by the directions of the edges in the graph. The term
“acyclic” means there are no loops (or cycles) in the graph.
D
The following are the different ways to access Hive:
Hive Command Line Interface: This is the most commonly used interface of Hive. It is mostly
referred to as Hive CLI.
C
Hive Web Interface: This is a simple Graphical User Interface (GUI) used to connect to Hive. To
use this interface, you need to configure it during the Hive installation.
Hive Server: This is an optional server. By using this server, users can submit their Hive jobs from
a remote client.
T
JDBC/ODBC: This is a JDBC client that allows users to connect to Hive and submit their jobs.
N ote
IM
Most Hadoop users wish to work on a GUI instead of a CLI. The GUI of Apache Hive is provided
by a Web browser technology known as Hue. Hue supports not only Hive but also other Hadoop
technologies, such as HDFS, MapReduce/YARN, Oozie, Zookeeper, HBase, Pig, and Sqoop. Hue is
an open-source project and its Apache Hive GUI is called BeesWax.
Hive operates in a shell interactive mode; therefore, to work with Hive, you first need to enter this
mode. The shell interactive mode is also known as Command-line Interface (CLI). You can enter the
shell interactive mode of Hive by running the $HIVE_HOME/bin/Hive command. Table 15.1 lists some
commonly used Hive commands to be used in the shell interactive mode:
Command Description
! <command> Executes a shell command from the Hive shell
<query string> Executes a Hive query and prints results to standard output
add FILE[S] <filepath> Used to add more files, archives or jars in the distributed
<filepath>* cache resource list.
add JAR[S] <filepath>
<filepath>*
Self-Instructional add ARCHIVE[S] <filepath>
<filepath>*
Material
326
Introducing Hive
L
set <key>=<value> Sets the value of a particular configuration variable (key)
source FILE <filepath> Executes a script file inside the CLI
In the preceding examples, the set command prints a list of configuration variables the select
T
query displays the columns beginning with my. from the mytable table, and dfs –ls executes the dfs
command to list the directory content. The semicolon (;) symbol is used to terminate a command,
and — is used to insert comments. Hive queries can be run in either batch mode or shell interactive
IM
mode using $HIVE_HOME/bin/hive utility. Some examples of using Hive CLI are as follows:
The Hive command in the first example executes the select query from the command line, while the
second command executes a script non-interactively from the local disk.
Hive Services
You can get all the Hive services you want by typing Hive service help. Some Hive services are as
follows:
CLI: It is the black window or the panel that we get after the installation of Hive. This is nothing
but a command line interface of Hive. This is an inbuilt default service present in Hive.
Hive server: It runs Hive as an integrated server to expose a thrift service, which integrates
the access to the number of clients that are written in different types of languages. There are
many applications like JDBC, ODBC connectors that need to execute a Hive server to get in Self-Instructional
communication loop with Hive. Material
327
DATA SCIENCE
NOTES Hive Web Interface (HWI): The Hive Web Interface is the GUI of Hive on which we can execute
queries. It is an alternative to the shell. HWI uses the following command:
% export ANT_LIB=/path/to/ant/lib
% hive --service hwi
JAR: Hive is somewhat equal to Hadoop JAR as it is convenient to run Java applications including
the Java applications, Hadoop, and Hive classes on the classpath. It is not required to make a
special jar to do factual analysis. Hive itself does it just by a simple query.
Metastore: It is the service that runs with the Hive services whenever Hive starts. This is a default
process. It is possible to run it on a standalone (remote) process using the metastore60 service.
To use this service, you need to set the property of Hive ‘METASTORE_PORT’ environment
variable to make the server able to listen to the specific port.
Hive client: There are many different mechanisms to get in contact with the applications when
you run Hive as a server that is hiveserver. Following is one of the clients of Hive:
Thrift Client: It is very easy to run Hive commands from the extensive variety of programming
dialects through the Hive thrift client. There are numerous programming dialects such as
C++, Java, php, python, and ruby for which Hives has the thrift client, to make these dialects
accessible.
L
Similarly, there are JDBC and ODBC drivers that are compatible with Hive. These drivers connect
Hive with the metastores and help run applications.
Hive Variables
D
Hive allows you to set variables that can be referred in the Hive script. For this purpose, you need to
use the –d or –define option, as shown in the following commands:
C
$hive –define db_name = Defines the db_name
variable
sampledatabase;
T
In the preceding commands, a table named sampletable is created in the database sampledatabase.
By default, the variable substitution option is enabled. However, you can disable this option by using
the following command:
set hive.variable.substitute=false;
In general, there are three namespaces for defining variables hiveconf, system, and env. You can
also define custom variables in separate namespaces using the define or hivevar options.
Hive Properties
The hive-site.xml file stores the configuration properties of Hive. These properties can be
overwritten by the developers. To overwrite the properties of the hive-site.xml file, the set
command is used. For example, the following command sets the path of the current directory to
/tmp/mydir for all subsequent commands:
set hive.exec.scratchdir=/tmp/mydir;
Self-Instructional
Material
328
Introducing Hive
In the hive-site.xml file, you can also enable the automatic optimization of joins. This can be done NOTES
by using the following command:
It should be noted that these properties are valid only until the Hive CLI session is alive.
Hive Queries
Hive allows you to simultaneously execute one or more queries. These queries can be stored in and
executed from files. The extension of a query file in Hive is .hql or .q. Let’s take an example of a
Hive query file, named ourquery.hql, stored in the …/home/weusers/ queries/folder.
Now, type the following commands to execute the query stored in the ourquery.hql file:
L
SELECT roll_no, name FROM students;
Executes a query from
$hive –f /home/weusers/queries/ourquery.hql
a file
You can also execute a Hive query in the background, as shown in the following command:
E xhibit -1
Hive Architecture used by NASA
NASA team uses a climate model, which is based mathematical representation of climate systems
using various factors that impacts the climate of the Earth. For this purpose, NASA’s Jet Propulsion
Laboratory has developed Regional Climate Model Evaluation System (RCMES) for analysis and
evaluation of the climate output model. The RCMES system has two components:
RCMED (Regional Climate Model Evaluation Database): It is a scalable cloud database used to
load and reanalyze remote sensing data and data which is related to climate.
RCMET (Regional Climate Model Evaluation Toolkit): It is a collection of tools that provide
the user an ability to perform different types of evaluation and analysis by comparing the
reference data present in RCMED with the output data of climate model.
Following are the reasons for which NASA’s JPL team decided to go with apache Hadoop as an
integral part of their solution strategy:
Apache Hive runs on top of Hadoop, which is scalable and can process data in a distributed Self-Instructional
and parallel manner.
Material
329
DATA SCIENCE
NOTES It provides Hive Query Language which is easy to learn, like SQL.
NASA team installed Hive using Cloudera and Apache Hadoop. They used Apache Sqoop to ingest
data into the Hive from MySQL database. Moreover, Apache OODT wrapper was implemented to
perform queries on Hive and retrieve the data back to RCMET.
Inspired By: https://www.edureka.co/blog/hive-tutorial/
L
an application. A function is a group of commands used to perform a particular task in a program and
return an outcome. Like every programming language, Hive also has its set of built-in functions (also
known as pre-defined functions). Table 15.2 lists some commonly used built-in functions in Hive:
D
TABLE 15.2: Built-in Functions Available in Hive
Function Function
Return Type
Description
C
floor(double a) BIGINT Returns the maximum BIGINT value that is equal or
less than a double variable.
concat(string A, string Returns the string resulting from concatenating
string B,...) B after A. This function is used to concatenate an
T
arbitrary number of arguments and return the result.
trim(string A) string Returns the resulting string by trimming spaces from
both ends of A.
IM
To perform various types of computations, Hive provides aggregate functions. Table 15.3 lists the
aggregate functions available in Hive:
HIVE DDL
Data Definition Language (DDL) is used to describe data and data structures of a database. Hive
has its own DDL, such as SQL DDL, which is used for managing, creating, altering, and dropping
databases, tables, and other objects in a database. Similar to other SQL databases, Hive databases
also contain namespaces for tables. If the name of the database is not specified, the table is created
in the default database. Some of the main commands used in DDL are as follows:
L
Create Drop Truncate
Let’s learn more about these commands and the operations that can be performed by using them.
Creating Databases
D
In order to create a database, we can use the following command:
C
hive> CREATE DATABASE temp_database; Creates temp_database
The preceding command creates a database with the name temp_database. In case a database
T
already exists with the same name, an error is thrown. You can avoid the error by creating the
database by using the following command:
N ote
When a database is created in Hive, a folder with the same name is created under Hive’s directory
on HDFS.
You can also add a table in a particular database by using the following command:
In the preceding command, the name of the database is added before the name of the table.
Therefore, temp_table gets added in the temp_database. In addition, you can also create a table in
the database by using the following commands:
In the preceding commands, the USE statement is used for setting the current database to execute
all the subsequent HiveQL statements. In this way, you do not need to add the name of the Self-Instructional
database before the table name. The table temp_table is created in the database temp_database. Material
331
DATA SCIENCE
NOTES Furthermore, you can specify DBPROPERTIES in the form of key-value pairs in the following manner:
Viewing a Database
You can view all the databases present in a particular path by using the following command:
The preceding command shows all the databases present at a particular location.
Dropping a Database
Dropping a database means deleting it from its storage location. The database can be deleted by
L
using the following command:
D
You must note that if the database to be deleted contains any tables, you first need to delete tables
from the database by using the DROP command. Deleting each table individually from a database
is a tedious task. Hive allows you to delete tables along with the database by using the following
C
command:
In the preceding command, the cascade functionality allows you to drop all the tables in the database
T
as well as the actual database itself.
Altering Databases
IM
Altering a database means making changes in the existing database. You can alter a database by
using the following command:
Creating Tables
You can create a table in a database by using the CREATE command, as discussed earlier. Now,
let’s learn how to provide the complete definition of a table in a database by using the following
commands:
Use temp_database
CREATE TABLE IF NOT EXISTS employee (ename STRING, salary
FLOAT, designation STRING)
Self-Instructional TBLPROPERTIES (‘owner’ = ‘Joseph’, ‘created at’ = ‘05/02/2015’)
Material ROW FORMAT DELIMITED
332
Introducing Hive
In the preceding commands, temp_database is first set and then the table is created with the name
employee. In the employee table, the columns (ename, salary, and designation) are specified with
their respective data types. The TBLPROPERTIES is a set of key-value properties. Comments are also
added in the table to provide more details.
External Table
External tables are created by accessing the query written in files stored on some other computer
system. You can specify the location of a data file by using the LOCATION keyword while creating the
external table. Execute the following commands for creating an external table:
L
STRING, percentage FLOAT, grade STRING)
TBLPROPERTIES (‘owner’ = ‘Joseph’, ‘created at’ = ‘05/02/2015’)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
D
LOCATION ‘/user/hive/warehouse/details.db/student’ STORED AS
TEXTFILE;
C
The concept of external table proves to be very useful when HDFS has huge data files, making copies
of which is not feasible due to space constraints.
In the preceding command, the schema of the student table present in the details database gets
copied in the employee table present in the same database. The name of the table to which the
schema gets copied is specified before the LIKE keyword, and the name of the table from which the
schema gets copied is specified after the LIKE keyword.
You can know the structure of the existing table by using the following command:
The DESCRIBE command shows the structure of the student table present in the details database.
N ote
You can also use the short notation of DESCRIBE as DESC for knowing the structure of the table.
You can also find the details about any column present in a table by using the following command:
In the preceding command, the name of the column is specified with the table name using a dot Self-Instructional
between them. This will return the data type of the name column of the student table. Material
333
DATA SCIENCE
Here, we are creating a table named employee with the ename and sal columns. The table is
partitioned from the column designation.
Dropping Tables
You can delete or drop a table by using the following command:
L
hive> DROP TABLE IF EXISTS students; Deletes students table
Altering Tables
D
Altering a table means modifying or changing an existing table. By altering a table, you can modify
C
the metadata associated with the table. The table can be modified by using the ALTER TABLE
statement. The altering of a table allows you to:
Rename tables Delete some columns
T
Modify columns Change table properties
Let’s now learn how to perform the preceding operations by using the ALTER TABLE statement.
IM
Rename Tables
You can rename a table by using the following command:
Modify Columns
You can modify the name of a column in Hive by using the following command:
Self-Instructional
Material
334
Introducing Hive
The preceding command adds an extra column with the name sales_month in the employees table.
By default, the new column is added at the last position before the partitioned columns.
Replace Columns
You can replace all the existing columns with new columns in a table by using the following command:
The REPLACE COLUMNS statement is used to delete all the existing columns and add new columns in
their place. Let’s now learn how to use Hive DDL statements.
L
Using Hive DDL Statements
D E xample
Mark works as a team leader of the research department in Argon Technology. His department
deals with data on stocks and shares and requires running a set of standard queries on Big Data
in order to prepare some ad-hoc reports. Instead of writing the MapReduce source code for the
C
desired results, Mark decides to use Hive for writing queries similar to SQL on Big Data. His team
compiles the data available in the CSV file format and saves it in a sequence format.
T
Let’s learn how to write HQL queries for importing data given in a text file named Sample1.txt into
HDFS first as a CSV table named stock_data, and then copy it in a sequence table named stock_
data2. Log into Hive and type the following command on the terminal to go to the bin folder of the
Hive directory:
IM
cd ~/$HIVE_HOME/bin/
./hive
1. To view a list of all the databases in Hive, type the following command on the terminal:
show databases;
The preceding command, along with its output, is shown in Figure 15.2:
Self-Instructional
FIGURE 15.2 Showing the List of Databases
Material
335
DATA SCIENCE
L
4. To use the demo database in Hive, type the following command:
USE demo
USE demo
D
5. To create a table named stock_data, type the following command on the terminal:
quencefile;
7. To view the structure of the stock_data table and the records stored in it, type the following
command:
describe stock_data;
The operations of creating the stock_data table and viewing its structure are shown in
Figure 15.4:
Self-Instructional
Material FIGURE 15.4 Creating and Describing a Table in Hive
336
Introducing Hive
L
Loading Files into Tables
D
Hive doesn’t perform any kind of transformation while loading data into tables. The data load
operations in Hive are, at present, pure copy/move operations, which move data files from one
location to another. You can upload data into Hive tables from the local file system as well as from
HDFS. The syntax of loading data from files into tables is as follows:
C
LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE ta-
blename [PARTITION (partcol1=val1, partcol2=val2 ...)]
T
When the LOCAL keyword is specified in the LOAD DATA command, Hive searches for the local
directory. If the LOCAL keyword is not used, Hive checks the directory on HDFS. On the other hand,
when the OVERWRITE keyword is specified, it deletes all the files under Hive’s warehouse directory
for the given table. After that, the latest files get uploaded. If you do not specify the OVERWRITE
IM
keyword, the latest files are added in the already existing folder.
In the preceding syntax, the INSERT OVERWRITE statement overwrites the current data in the table
or partition. The IF NOT EXISTS statement is given for a partition. On the other hand, the INSERT
INTO statement either appends the table or creates a partition without modifying the existing data.
The insert operation can be performed on a table or a partition. You can also specify multiple insert
clauses in the same query.
Consider two tables, T1 and T2. We want to copy the sal column from T2 to T1 by using the INSERT
command. It can be done as follows:
L
CREATE TABLE T1 (sal STRING);
INSERT OVERWRITE TABLE T1 SELECT sal FROM T2;
Here, we use the keyword OVERWRITE. It means that any previous data in table T1 will be deleted,
The preceding command inserts the SELECT query result of the employee table in the myfile file of
the local directory.
CREATE TABLE T1
AS SELECT name, sal, month FROM T2;
Here, a new table called T1 would be created, and the schema for the table would be three columns
named name, sal, and month, with the same data types as mentioned in the T2 table.
Update in Hive
The update operation in Hive is available from Hive 0.14 version. The update operation can only be
performed on tables that support the ACID property.
Self-Instructional
Material
338
Introducing Hive
The syntax for performing the update operation in Hive is as follows: NOTES
UPDATE tablename SET column = value [, column = value ...]
[WHERE expression]
In the preceding syntax, the name of the table will be followed by the UPDATE statement. Only those
rows that match with the WHERE clause will be updated. You must note that the partitioning and
bucketing columns cannot be updated.
In the preceding command, the salary of the employee whose empid is 10001 is updated.
L
Delete in Hive
The delete operation is available in Hive from the Hive 0.14 version. The delete operation can only
be performed on those tables that support the ACID property. The syntax for performing the delete
operation is as follows:
When the preceding command is executed, the records of the employee whose employee id is 10001
get deleted.
IM
The data stored in a table can be modified either by using the LOAD command or by using the INSERT
command. The general syntax of the LOAD command is as follows:
Mark needs to type the following command on the terminal to access the data stored in the stock_
data table:
NOTES To overwrite the content of the table, Mark uses the INSERT command as follows:
Mark types the following command on the terminal to overwrite the data stored in the stock_data2
table with the data stored in the stock_data table:
You have successfully created two tables, stock_data and stock_data2, and loaded data in the CSV
format in the first file and as a sequence file in the second file.
L
Write a script to copy the same table structure into a new table.
D
Hive allows you to perform data retrieval queries by using the SELECT command along with various
types of operators and clauses. In this section, you learn about the following:
C
Using the SELECT command Using the HAVING clause
rows, or both. The syntax for using the SELECT command is as follows:
The following query retrieves all the columns and rows from table, mydemotable:
amount greater than 15000 from the US region. Hive also supports a number of operators (such as NOTES
> and <) in the WHERE clause. The following query shows an example of using the WHERE clause:
Retrieves sales records where amount
SELECT * FROM sales WHERE amount > is greater than 15000 and region is
15000 AND region = “US” equal to US
For example, if you wish to calculate the average marks obtained by the students from all semesters,
you may use the GROUP BY clause as shown in the following example:
L
The following example helps you to count the number of distinct users by gender:
The preceding command displays the clmn1 column of the mydemotable table grouped by clmn1
where the sum of clmn2 is greater than 15.
1. Type the following command on the terminal to go to the bin folder of the Hive directory:
cd $HIVE_HOME/bin/
./hive
The preceding command performs the logging procedure in Hive, as shown in Figure 15.5:
L
D FIGURE 15.5 Logging into Hive
2. To retrieve the data stored in the stock_data table on a particular date, you can type the
following HiveQL command on the terminal:
C
select * from stock_data where date1=”2013-12-12”;
The preceding command, along with its output, is shown in Figure 15.6:
T
IM
3. In order to retrieve a fixed number of records from a table, you need to specify its limit. Type
the following command on the terminal to retrieve three records from the stock_data table,
having volume less than 1,500,000:
select * from stock_data where volume<1500000 limit 3;
Self-Instructional
Material
342
Introducing Hive
The preceding command, along with its output, is shown in Figure 15.7: NOTES
L
FIGURE 15.7 Retrieving Specific Number of Records from an HDFS Table
Summary
T
This chapter discussed the concept of Hive. You learned about Hive architecture and Hive metastore.
Next, you learned about Hive Command Line Interface (CLI), and the variables, properties, and
IM
commands used in Hive. You also learned how to execute Hive queries from files. Next, the chapter
discussed shell execution, data types, and built-in functions in Hive. You also learned about Hive DDL
and data manipulation. Finally, you learned how to execute HiveQL queries in Hive.
Exercise
Multiple-Choice Questions
Q1. The syntax ALTER TABLE old_table_name RENAME TO new_table_name is used for:
a. Renaming an existing database b. Renaming an existing field name
c. Renaming an existing table name d. None of the above with a new name
Q2. In Hive, the term ‘aggregation’ is used to:
a. Count the number of distinct users
b. Get the best performance of a table by gender.
c. Use to merge the data in a table
d. There is no term like ‘aggregation’ to another table.
Self-Instructional
Material
343
DATA SCIENCE
L
Q6. What is the syntax to declare the struct data type?
a. STRUCT<col_name : data_type
b. STRUCTS<col_name : data_type [COMMENT col_comment], ...> [COMMENT col_
comment], ...>
c. STRUCT<[COMMENT
D
d. None of the above col_comment], ...>
C
Q7. The CREATE statement in Hive is related to:
a. DDL statements b. DML statements
c. Session control statements d. Embedded SQL statements
T
Assignment
IM
344
C A S E S T U D Y
HIVE FOR RETAIL ANALYSIS NOTES
This Case Study discusses how BigX implemented Big Data solution for analyzing its sales and achieving
a better growth rate.
The concept of hypermarkets and supermarkets spread rapidly in the latter part of the early 20th
century. The chief cause of its expansion was that it provided a one-stop solution for all the needs of
the people. The retail business is a very important industry and will continue to be so in future also.
Walmart, Costco, Kroger, Tesco, Metro Group, etc., are the biggest retailers in the world. In India,
Future Group, Reliance Industries, Aditya Birla Group, etc., have their own retail chains.
A major Indian retailer (say) BigX has approximately 220 stores across 85 cities and towns in India.
It employs more than 35,000 people. The annual revenue of BigX was USD 10 billion in 2017. BigX
stores offer a wide range of products including fashion and apparel, food products, books, furniture,
electronics, health care, general merchandize and entertainment sections. About 33% of BigX stores
have daily sales of USD 25,000 or more. Rest of the 67% of the BigX stores have daily sales in the
range of USD 14,000-USD 25,000. The stores have an average daily footfall of 1200+ customers. BigX
wanted to analyze the trends and patterns using the semi-structured data it had acquired in the last
L
five years. It decided to implement a Hadoop platform to assist in this cause. The problem scenario
faced by BigX was:
One of the big datasets of BigX held 5 years of vital information in a semi-structured form and
BigX wanted to analyze this vital information.
D
The dataset that BigX wanted to analyze mostly constituted of logs, which did not conform to
any specific schema. It meant that Business Intelligence (BI) tools could not be applied over
such data without any schema. Also, the BI tools are not efficient for Terabytes or Petabytes
C
of data.
The data was moved to BigX’s BI systems to analyze and derive a result. It took more than 12
hours bi-weekly to move such data. It was a major goal for BigX to reduce this time.
It was difficult and time-consuming to query such a large dataset.
T
For a dataset of size 12 TB, data analysis and application of BI tools becomes difficult because of
the two prime reasons. First, it is difficult to move such large datasets to HDFS periodically. Second,
it is difficult to perform analysis on the HDFS dataset. In order to resolve the first problem, i.e., to
IM
move the dataset into HDFS, BigX used Flume. Thereafter, Hive was used to perform analysis on
datasets. Flume solved BigX’s data transfer problem. Flume is used to move large amounts of data
as it is a distributed and reliable service. The way it works is that it behaves as a logging system to
collect various log files from all the machines available in cluster and then it aggregates them into a
purposeful and centralized HDFS store. Flume’s architecture is shown as follows:
App Server
Server
Logs
Agent
Logs
Agent
1 Flume Agent
per Host
This process of sending logs from the nodes to HDFS can be done in real-time or on a daily, weekly
or monthly basis. BigX chose to send the logs on a bi-weekly basis. In order to resolve the second
problem, i.e., to perform analysis on the HDFS dataset, BigX chose Hive. Hive is a data warehouse
system that is used for analyzing both the structured and semi-structured data. Hive provides a
mechanism using which the datasets can be structured and queries can be performed upon them.
The queries are written in the Hive Query Language (HQL) which is quite similar to Structured Query
Language (SQL). HQL converts the query into MapReduce tasks.
The solution architecture implemented by BigX using Flume and Hive is shown as follows:
L
App Server
Server
App Server
App Server
Server
D Data Sink
HDFS
Hive insert
MapReduce Hive
Namespace
HDFS Cluster
C
1 Flume Collector
per n Agents
1 Flume Agent
per Host
BigX Datacenter
T
After the implementation of Hive, some of the interesting trends that were observed are as follows:
There was a steady increase in YoY growth for all categories of products.
There was a growth of 65% in health & beauty products; 55% in food products and 54.6% in
entertainment.
BigX found that the people from Northern India spent more on health & beauty products,
whereas people from Southern India spent more on books and food products.
The sales of fashion and apparel products were maximum in Delhi and Mumbai.
Questions
1. Elaborate the problems faced by BigX and how they resolved these problems.
(Hint: The problem scenario faced by BigX was that it held a dataset having 5 years of vital
information in semi-structured form and it wanted to analyze this vital information.)
2. Comment on the success or failure of the BigX’s Big Data solution.
(Hint: BigX’s Big Data solution was a success because the company saw a steady increase in
its growth rate. In addition, it was able to analyze different trends and patterns of shopping
done by various segments of people.)
L
D
C
T
IM
Self-Instructional
Material
347
L A B E X E R C I S E
NOTES In this Lab Exercise, you are going to perform the steps for installing Hive on Hadoop. Ensure that
the steps are performed sequentially and accurately to complete the installation.
Let’s first understand the purpose of Hive before implementing the steps for installing it. Hive is a
data warehouse system for Hadoop. It uses the language HiveQL, which has SQL-like structure. Hive
accesses data stored in HDFS by using a SQL-like interface and HiveQL commands.
LAB 1
Solution: As we know that before using Hive, we need to first install and set the home variable to
use Hive on Hadoop.
L
The steps for installing Hive on Hadoop are as follows:
2. Untar the package:
$ tar –xzvf
D
1. Download the latest version of Hive.
apache-hive-0.13.1-bin.tar.gz
C
3. Add following to ~/.bash_profile:
$ sudo nano ~/.bash_profile
export HIVE_HOME=/home/hduser/hive-0.13.1
T
export PATH=$PATH:$HIVE_HOME/bin
Where hduser is the user name and Hive-0.13.1 is the Hive directory extracted from tar.
4. Run Hive from terminal:
IM
$ hive
5. Make sure that the Hive node is connected to the hadoop cluster.
The embedded derby databases are used by this system and the data is stored in local
filesystem. Only one Hive session could be open on the node. In case where more than one
users try to run the Hive shell, the second user will get “Failed to start database’ Metastore_
db’ error.
6. Run Hive queries for datastore to test for the installation:
hive> SHOW TABLES;
hive> CREATE TABLE sales(id INT, product String) ROW FOR-
MAT DELIMITED FIELDS
TERMINATED BY ‘\t’;
7. Logs are generated per user bases in /tmp/<usrename> folder.
Installing Hive with local metastore:
L
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
Name</name> D
<name>javax.jdo.option.ConnectionUser-
<value>hduser</value>
C
<description>user name for connecting to
mysql server
</description>
</property>
T
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>passwd</value>
<description>password for connecting to mysql
IM
server</description>
</property>
</configuration>
6. Run Hive from terminal:
$ hive
Installing Hive with remote metastore:
L
<description>user name for connecting to
mysql server
</description>
</property>
<property>
D
<name>javax.jdo.option.ConnectionPassword</name>
<value>passwd</value>
C
<description>password for connecting
to mysql
server</description>
</property>
T
</configuration>
6. Run Hive from terminal:
$ hive
IM
Self-Instructional
Material
350
CHAPTER
16
Pig
L
Topics Discussed
Introduction
Pig D NOTES
C
Benefits of Pig
Properties of Pig
Modes of Running Pig Scripts
T
Running Pig Programs
Schema
Getting Started with Pig Latin
IM
Chapter Objectives
After completing Chapter 16, you should be able to:
NOTES INTRODUCTION
While programming with Hadoop, we use the MapReduce model. It consists of the map phase and
the reduce phase. In the map phase, a significant portion of the dataset is processed, and the output
is represented into an intermediate representation. On the other hand, in case of the reduce phase,
the results from the map phase are combined, and the final output is produced.
The MapReduce model is not always convenient and efficient, but it can be extended to many real-
world problems. However, when there is a large amount of data that needs to be processed using
Hadoop, the processing involves more overhead and becomes complex. A better solution to such
kind of problems is Pig, which is a Hadoop extension. Pig is recommended if you need to handle
gigabytes or terabytes of data. However, it is not recommended in case you are required to write a
single or a small group of records or search various records randomly. Yahoo uses Hadoop heavily
and executes 40 percent of all its Hadoop jobs with Pig. The well-known company Twitter is also
another user of Pig. The main philosophy behind the development of Pig is its ease of use, high
performance, and massive scalability.
This chapter begins by introducing Pig. It then discusses the benefits of Pig. After that, you learn
L
how to install and run Pig. Next, the chapter explains the Pig language, called Pig Latin, along with
its structure. In the end, it explores the operators and functions used in Pig.
PIG
D
Pig was designed and developed for performing a long series of data operations. The Pig61
platform is specially designed for handling many kinds of data, be it structured, semi-structured, or
C
unstructured. Pig enables users to focus more on what to do than on how to do it. Pig was developed
in 2006 at Yahoo. Its aim, as a research project was to provide a simple way to use Hadoop and focus
on examining large datasets instead of wasting resources on MapReduce. Pig became an Apache
project in 2007. By 2009, other companies started using pig, making it a top-level Apache project
T
in 2010. Pig can be divided into three categories: ETL (Extract, Transform, and Load), research, and
interactive data processing.
IM
Pig consists of a scripting language, known as Pig Latin, and a Pig Latin compiler62. The scripting
language is used to write the code for analysing the data, and the compiler converts the code
into the equivalent MapReduce code. So, we can say that Pig automates the process of designing
and implementing MapReduce applications. It becomes easier to write code in Pig compared to
programming in MapReduce. Pig has an optimizer that decides how to get the data quickly. This
essentially means that Pig is smart and intelligent in terms of data processing.
Benefits of Pig
The Pig programming language offers the following benefits:
Ease of coding: Pig Latin lets you write complex programs. The code is simple and easy to
understand and maintain. It takes complex tasks involving interrelated data transformations as
data flow sequence and explicitly encodes them.
Optimization: Pig Latin encodes tasks in such a way that they can be easily optimized for
execution. This allows users to concentrate on the data processing aspects without bothering
about efficiency.
Extensibility: Pig Latin is designed in such a way that it allows you to create your own custom
functions. These can be used for performing special tasks. Custom functions are also called
Self-Instructional user- defined functions.
Material
352
Pig
L
MODES OF RUNNING PIG SCRIPTS
Pig scripts can be run in the following two modes:
D
Local mode: In this mode, several scripts can run on a single machine without requiring Hadoop
MapReduce and Hadoop Distributed File System (HDFS). This mode is useful for developing
C
and testing Pig logic. If you are using a small data set to develop or test your code, then the
local mode is faster than the MapReduce Infrastructure. This mode does not require Hadoop.
Pig program runs with local JVM (Java Virtual Machine) in the background. The data is accessed
through the local file system on a single machine. This process is a separate one, outside of the
T
Pig Latin compiler.
MapReduce mode: It is also known as the Hadoop mode. The pig script, in this mode, gets
converted into a series of MapReduce jobs and then run on the Hadoop cluster.
IM
Figure 16.1 shows the modes available for running Pig scripts:
The decision whether to use the local mode or the Hadoop mode is made on the basis of the amount
of the data available. Suppose you want to perform operations on several terabytes of data and
create a program. You may notice that the operations slow down significantly after some time. The
local mode enables you to perform tasks with subsets of your data in a highly interactive manner.
You can determine the logic and rectify bugs in your Pig program. After performing these tasks, and
when your operations are running smoothly, you can use the MapReduce mode.
Self-Instructional
Material
353
DATA SCIENCE
L
The above code after completion looks like this:
Schema
D
After getting the knowledge of grunt, we need to know the schema for writing the script. For every
C
scripting language, there is a schema definition that tells everything about the script. Let us have an
example.
Here, we declared the columns in the script, and year is given a datatype of integer, temp as an
IM
integer, and name as a char array. This is nothing but a schema declaration of the script for “a”.
E xample
Consider the case of Argon Technology, which provides Big Data solutions. An input file containing
weather data is provided to the project manager of this company. This is a simple CSV file and has
three columns. The first column indicates the serial number, the second column indicates the year,
and the third column indicates the temperatures recorded over various years. The manager is
interested in finding the maximum temperature from the given dataset. To do this, a Pig program
is required, which can be implemented on Pig Grunt. He will have to write a Pig program and
execute the commands for running it. This scenario involves implementation of a Pig program to
find out the maximum temperature year-wise from the given dataset.
The manager has to first ensure that Hadoop, Java, and Pig are installed and running smoothly on his
system. He can set the environment variables, if required, by using the following commands:
Then, he writes the Pig program to find the maximum temperature from the given set of data. The
program is shown in Listing 16.1:
L
DUMP max_temp;
Finally, he runs this ‘Dump’ command, shown in Listing 16.1, on a terminal to generate the output.
D
A command shell is known as _____________, which is graphical in nature and used for scripting
of pig.
C
E xhibit -1
Companies using Pig
T
Yahoo! : Web search and ad system support research is done using Pig. It generates between
40-60% of their Hadoop jobs.
IM
LinkedIn: It uses Hadoop with pig to find and display contents like “people you may know”
and other facts to respective users.
Twitter: It makes extensive use of Pig for various jobs such as mining tweet data, processing
logs and other scheduled as well as ad-hoc jobs, since pig uses few statements to accomplish
a lot of jobs.
DropFire: It is used to describe semantic and structural conversions between data contexts
by generating scripts in pig Latin.
Nokia: It explores and analyzes unstructured datasets coming from logs, database dumps,
and data feed using Pig.
PayPal: Uses Pig to analyze transaction data in order to prevent fraud.
SARA Computing and Networking Services: Uses for fast exploration of large datasets by
scientists.
WhitePages: It is used to clean, merge and filter record-data sets in multi-billions for people
and business-search applications. They also analyze daily search and Web logs for producing
key performance factors for their critical Web services, efficiently using pig.
Source: https://cwiki.apache.org/confluence/display/PIG/PoweredBy
Self-Instructional
Material
355
DATA SCIENCE
L
Extensible: Due to its extensible nature, Pig Latin enables the developers to address specific
business problems by adding functions.
Compared to Java MapReduce programs, Pig Latin scripts make writing, understanding, and
D
maintaining programs easier. All this is made possible because:
You are not required to write jobs in Java.
Pig Latin provides an easy and simple language to effectively use Hadoop. It, thus, makes it easier
for more and more users to leverage the power of Hadoop, reduce time, and enhance productivity.
T
With Big Data, however, you do not want the large amount of data to be moving around. So, the
processing is brought to the data itself. Instead of old ETL approach, the Pig Latin language goes
with the ELT approach. First, data is extracted from various sources, and then it is loaded in HDFS,
and then transformed to used it as resource for further analysis.
IM
X = LOAD ‘file_name.txt’;
...
Y = GROUP ... ;
...
Z= FILTER ... ;
...
DUMP Y;
..
STORE Z into ‘temp’;
Self-Instructional
Material
356
Pig
As you can see from the preceding syntax, Pig Latin makes use of the following statements: NOTES
The LOAD statement is used for reading data from the file system. The data that requires to be
manipulated is first loaded. It is typically stored in HDFS. For a Pig program to access the data,
you first need to specify the file to use. For this, you use the LOAD statement. The specified file
can be either an HDFS file or a directory. If a directory is specified, all files in that directory are
loaded into the program.
The FILTER statement is used for filtering the records.
To view the result on screen, DUMP statement is used. Typically, you use the DUMP command to
show the output on to the screen when the program needs to be debugged. When working in
a production environment, you can use the STORE command to save the results in a file. These
results can be used for further processing or analysis.
The STORE statement is used for saving the results.
L
In Pig Latin, relational operators are used for transforming data. Different types of transformations
include grouping, filtering, sorting, and joining. The following are some basic relational operators
used in Pig:
FOREACH
ASSERT
FILTER
ORDER BY
DISTINCT
JOIN D SAMPLE
SPLIT
FLATTEN
C
GROUP LIMIT
Suppose we have a student’s data and we load it from the HDFS and want to extract the data from
it. The following script illustrates the use of the FOREACH operator on the student data:
(John,3,M)
(Robinson,2,M) Self-Instructional
(Jilly,6,F)
Material
(Peter,7,M)
357
DATA SCIENCE
NOTES The preceding example uses the relations ‘student’, ’rollno‘, and gender. The asterisk (*) symbol is
used for projecting all the fields. You can also use the FOREACH operator for projecting only two
fields of the student table by using the following script:
(John,3,M)
(Robinson,2,M)
(Jilly,6,F)
(Peter,7,M)
ASSERT
L
The ASSERT operator asserts a condition on the data. Assertions are used for ensuring that a
condition is true on the data. The processing fails if any of the records violate the condition. The
syntax for using the ASSERT operator is as follows:
DUMP temp;
(2,3,4)
(4,1,2)
(3,4,7)
(5,8,2)
Suppose you want to ensure that the size of the first column in your data is >0. In that case, you can
use the following script:
FILTER
The FILTER operator enables you to use a predicate for selecting the records that need to be
retained in the pipeline. Only those records will be passed down the pipeline successfully for which
the predicate defined in the FILTER statement remains true. The syntax for using the FILTER
operator is as follows:
Self-Instructional
Material alias = FILTER alias BY expression;
358
Pig
L
The result of the preceding script is as follows:
(1,2,3)
(6,1,3)
a1));
DUMP temp;
(6,1,3) D
temp = FILTER table BY (first == 6) OR (NOT (second+third >
C
(6,1,2)
GROUP
T
Various operators are provided by Pig Latin for group and aggregate functions. The syntax of the
GROUP operator in Pig Latin is similar to SQL, but it is different in functionality when compared to
the GROUP BY clause in SQL. In Pig Latin, the GROUP operator is used for grouping data in single
or multiple relations. The GROUP BY clause in SQL is used to create a group that can be input di-
IM
rectly into single or multiple aggregate functions. In Pig Latin, star expression can’t be included in a
GROUPBY column.
ALL: Refers to the keyword used for inputting all the tuples into a group; for example, Z =
GROUP A ALL;
BY: Refers to a keyword used for grouping relations by field, tuple, or expression; for example,
X = GROUP A BY f1;
PARTITION BY partitioner: Describes the Hadoop Partitioner, used for controlling the keys that
partition intermediate map-outputs.
Self-Instructional
Material
359
DATA SCIENCE
NOTES The following script shows the use of the GROUP operator:
(George,20)
(Lisa,21)
(Paul,22)
(Joseph,20)
Now, let’s suppose that you want to group relation X on the field age to form relation Z, as shown
in the following script:
Z = GROUP X BY age;
L
DESCRIBE Z;
Z: {group: int, X: {name: chararray, age: int}}
DUMP Z;
(20,{(George,18),(Joseph,18)})
(21,{(Lisa,19)}) D
The output of the preceding script is as follows:
C
(22,{(Paul,20)})
N ote
T
The GROUP and COGROUP operators are similar in several respects. Both these operators can
perform their functions with one or multiple relations. The GROUP operator can be used when the
statements deal with a single relation, while the COGROUP operator is used when two or more
IM
relations are involved in statements. COGROUP can handle a maximum of 127 relations at a time.
ORDER BY
Depending on one or more fields, a given relation can be sorted using the ORDER BY operator. The
syntax of the ORDER BY operator is as follows:
Material PARALLEL n: Enhances the parallelism of a task by mentioning the number of reduce tasks, n
360
Pig
Pig supports the ordering on fields with simple data types or by using the tuple designator (*). You NOTES
cannot impose order fields with complex types or by expressions.
The following commands describe the syntax used for ordering fields in Pig:
The following script depicts some examples of the ORDER BY operator in Pig:
(1,2,4)
L
(2,4,1)
(4,2,5)
(3,5,7)
D
Now, consider the following script that arranges the data in decreasing order in terms of the third
element:
C
Z = ORDER data BY third DESC;
DUMP Z;
DISTINCT
In Pig Latin, the DISTINCT operator works on the entire records and not on individual fields.
This operator is used for removing duplicate fields from a given set of records. The syntax of the
DISTINCT operator is as follows:
PARALLEL n: Enhances the parallelism of a task by mentioning the number of reduce tasks, n
Consider the following script that illustrates the use of the DISTINCT operator:
(Sam, 2,4)
(Sam, 2,4)
(Paul, 3,2)
(Peter, 3,6)
(John, 3,6)
Consider the following example in which all the redundant tuples are removed:
(sam,2,4)
(Paul, 3,2)
(Peter,3,6)
L
JOIN
For joining two or more relations, JOIN operator is used. The joining of two rows is possible in case
D
they have identical keys. If some records in the two rows do not match, these records can be deleted
or dropped. The following two types of joins can be performed in Pig Latin:
Inner join Outer join
C
The syntax of the JOIN operator is as follows:
Inner Join
In an inner join63, the same dataset is used with different aliases. As we know, the Pig script works
upon the nomenclature and that is why different aliases are used to avoid naming conflicts. Let’s
take an example to show the inner join on a data file “mydata” with different aliases:
In the preceding code, the two different aliases, data1 and data2, of the mydata table are created. NOTES
Then, the join operation is performed using these aliases.
Outer Join
When you implement outer joins64 in Pig, records that do not have a match of the records in the
other table are included with null values filled in for the missing fields. There are three types of outer
joins:
Left outer join: Used to fetch all rows from left table, even when there are no matches in
corresponding right table. Let’s consider a relation A and B, and have the same elements in it.
grunt> DUMP A;
(2, maggie)
(4, yepme)
(3, topreman)
(1, chings)
grunt> DUMP B;
(Nestle, 2)
L
(xyz,4)
(tops, 2)
(Kelogs, 0)
(itc, 3)
will be as follows:
grunt> C = JOIN A by $0 LEFT OUTER, B by $1
D
These are the elements of both A and B. Now, on joining them on the basis of left outer join, it
C
grunt> Dump C;
The output for this is as follows:
(1, chings, ,)
T
(2, Maggie, nestle, 2)
(2, Maggie, top, 2)
(3, topreman, itc, 3)
(4, yepme, xyz, 4)
IM
Right outer join: It returns all the rows from the right table, even if there are no matches in
the left table. For the above data relation, we get the result by applying the right outer join as
follows:
grunt> C = JOIN A RIGHT OUTER JOIN, B ;
grunt> DUMP C;
The output for this is as follows:
(Nestle, 2, maggie,2)
(XYZ, 4, yepme,4)
(top, 2, maggie,2)
(kelogs, 0, ,)
(ITC, 3, topreman, 3)
Full outer join: It includes all the records from both sides even if they do not have any matches.
LIMIT
The LIMIT operator in Pig allows a user to limit the number of results. The syntax of the LIMIT
operator is as follows:
(2,2,3)
(3,2,4)
(7,3,5)
(3,2,3)
Now, consider the following statement, which only selects the first two results out of four:
L
Lmt_tmp= LIMIT temp 2;
DUMP lmt_tmp ;
(2,2,3)
(3,2,4)
D
The output of the preceding statement is as follows:
C
In the preceding output, you can see that only the first two results are displayed.
SAMPLE
T
By providing a sample size, the SAMPLE operator can be used for selecting a random data sample in Pig.
It returns the percentage in rows in double values. For example, if the operator returns 0.2, it
indicates 20%.
IM
It is not always likely that the same number of rows will be returned for a particular sample size each
time the SAMPLE operator is used. So, this operator is also termed a probabilistic operator.
SAMPLE alias n;
n: Refers to sample size, which can be a constant that ranges from 0 to 1 or a scalar used in an
expression
Consider the following example in which table Z will have 1% of the data present in table X:
SPLIT
Self-Instructional The SPLIT operator partitions a given relation into two or more relations. The syntax for using the
SPLIT operator is as follows:
Material
364
Pig
A tuple may be assigned to more than one relation or may not be assigned to any relation, depending
on the conditions given in the expression. In the following example, the relation table is split into
two relations, tab1 and tab2:
L
(1,3,5)
(2,4,6)
(5,7,8)
SPLIT table INTO tab1 IF first<5, tab2 IF second==7;
DUMP tab1;
(1,3,5)
(2,4,6)
DUMP tab2; D
C
(5,7,8)
FLATTEN
T
The FLATTEN operator is used for un-nesting as well as collecting tuples (also known as bags). The
FLATTEN operator seems syntactically similar to a user-defined function statement; however, it
is actually an operator that allows the user to modify the structure of tuples and bags. In case of
tuple, the fields of tuple are replaced in place of tuple, using the FLATTEN command. For instance,
IM
consider a relation that contains a tuple of the form (x, (y, z)). The expression GENERATE
$0, flatten($1), will change the form of this tuple to (x, y, z).
In case of bags, the situation is a bit more complex. We form new tuples every time we un-nest
a bag. Consider a relation of the form ({(w,x),(y,z)}) and when you apply GENERATE
flatten($0), two tuples (w,x) and (y,z) are generated. Sometimes, you need to perform a
cross product for removing a level of nesting from a bag. Consider a relation that contains a tuple
in the form (u, {(w,x), (y,z)}). Now, when you apply the expression GENERATE $0,
flatten($1) to this tuple, new tuples (a, b, c) and (a, d, e) are generated.
E xhibit -2
Process a Million Songs with Apache Pig
There exists a dataset containing detailed acoustics and contextual data for about a million songs,
called MSD (Million Song database). For every listed song, one can find loads of information such
as title, loudness, tempo, dancebility, artist name, duration, localization and many other things.
Only the links to sample mp3 data have been provided here. The Database contains 339 tab-
separated text files, each file containing around 3000 songs, where each song is represented as
a separate line of text. Total size of dataset is 218 GB, making it very hard to process on a general
machine. so, an obvious and efficient approach is to use an open-source tool from the apache Self-Instructional
-Hadoop ecosystem such as Pig. Material
365
DATA SCIENCE
NOTES The goal here is to find exotic and folk songs which represent the access to a programming
language called Pig Latin, which is easy to learn and understand to maintain data flow. It is very
similar to scripting language as it supports many operations such as filtering, sorting, aggregation,
joining, splitting, etc. and also supports various complex data-types e.g. tuples, maps, bags, etc.
A Pig Latin Script is fairly better than is Java MapReduce equivalent in terms of time and space
complexity. Many large organizations such as Yahoo, Nokia, Twitter and LinkedIn rely on Apache
Pig for data processing.
L
There are basically two types of functions in Pig: user-defined functions and built-in functions. As the
name suggests, user-defined functions can be created by users as per their requirements. On the
other hand, built-in functions are already defined in Pig. There are mainly five categories of built-in
functions in Pig, which are as follows:
D
Eval or Evaluation functions: These functions are used to evaluate a value by using an expression.
Some commonly used eval functions are listed in Table 16.2:
C
TABLE 16.2: Some Eval Functions and Their Description
Function Description Syntax
Avg Calculates the average of numeric AVG (expression)
T
values.
Math functions: These functions are used to perform mathematical operations. Some commonly
used mathematical functions are listed in Table 16.3:
String functions: These functions are used to perform various operations on strings. Some NOTES
commonly used string functions are listed in Table 16.4:
L
Function Description Syntax
TOBAG Transforms one or more TOBAG(expression [, expression
expressions to the type bag. ...])
TOP
TOTUPLE
Returns the top-n tuples from a
bag of tuples.
Converts one or more D TOP(topN,column,relation)
TOTUPLE(expression [, expression
C
expressions into the type tuple. ...])
Load and Store functions: These functions are used to load and extract data. Pig provides a set
of built-in load and store functions, some of which are described in Table 16.6:
T
TABLE 16.6: Load and Store Functions
Function Description Syntax
BinStorage Uses machine-readable format BinStorage()
IM
Summary
This chapter discussed various aspects of data analysis using the Pig Latin language. After introducing
Pig, it explained the benefits of the language. After this, the chapter took you through the steps of
installing Pig on your system. It also explained the properties of Pig and run the installed Pig. In the
end, it explained the operators and functions used in Pig.
Self-Instructional
Material
367
DATA SCIENCE
NOTES
Exercise
Multiple-Choice Questions
Q1. Pig was developed by:
a. Yahoo b. Gmail
c.
Twitter d.
Facebook
Q2. Which of the following operators is used for performing iteration?
a. FOREACH b. ASSERT
c.
FILTER d.
GROUP
Q3. UDF stands for:
a. Universal Defined Function b. Unique Defined Function
c. Universal Disk Format d. Unique Definition of Function
Q4. On starting Pig, you can specify Hadoop properties with which of the following options?
L
a. –A b. –B
c.
–P d.
–D
c. Global mode
D
Q5. Which mode of Pig is also known as the Hadoop mode?
a. Local mode b. MapReduce mode
d. Universal mode
C
Assignment
Q1. Discuss the two modes used for running the Pig scripts.
T
Q2. What are the main reasons for developing Pig Latin?
Q3. What do you understand by Pig Latin application flow?
Q4. Discuss the use of the FOREACH operator.
IM
Q5. What are the various statements used in flow of data processing in Pig Latin?
Q6. What is the use of the ASSERT operator in Pig Latin?
Q7. What is the use of the FILTER operator in Pig Latin?
Q8. Write a short note on the following operators:
zz GROUP zz ORDER BY
References
https://wiki.apache.org/pig/PigLatin
https://pig.apache.org/
http://www.dictionary.com/e/pig-latin/
https://data-flair.training/blogs/pig-latin-operators-statements/
This Case Study discusses how Shutterfly used Hadoop, Pig and Tableau for its business analytics.
It has been estimated that Shutterfly stores over 120 petabytes (PB) of customer data. The volume,
velocity, variety and veracity of Shutterfly’s data are increasing on a daily basis. The organization
wanted to study which Big Data systems it can implement in order to effectively manage such
enormous amounts of data and how it can derive customer and product insights from it. One
possible solution was to deploy a Hadoop platform and integrate it with a traditional relational
database. The organization decided to use a Hadoop, Pig and Tableau infrastructure to scale and
automate its data analytics.
Big Data is characterized by 5 Vs, namely Velocity, Volume, Value, Variety, and Veracity.
L
Volume: Large amounts of data are being produced.
Value: Large amounts of data can be analyzed to generate business insights. In this way, value
is associated with the Big Data.
D
Variety: Data so generated by organizations is usually in different forms or types. In addition,
we have structured as well as unstructured data.
C
Veracity: This relates to the accuracy or trustworthiness of data.
Shutterfly is acquiring new companies and is also expanding organically. In addition, the
organization has also deployed machine learning and other Big Data analytics which leads to
increased velocity through machine-generated data.
Variety: Shutterfly generates a variety of data, such as customer order data; photos; videos;
Facebook and Twitter data; customer service data; text logs; RFID data from production,
printer, press data; external shipping data from FedEx and UPS and other logistics partners.
Veracity: Data is generated through customer service text logs which originate from chats,
phone conversations, emails, etc. All these different sources of data have certain amounts of
inaccuracies, such as misspellings, symbols, and web- and tool-generated characters, etc.
Shutterfly deployed a Big Data ecosystem comprising Hadoop, Pig and Tableau. Shutterfly used
this ecosystem to discover important insights about customers and products. Shutterfly wanted
to analyze which products would be successful in the first quarter of 2015 and which products need
improvement and which products can be eliminated altogether. For this, they analyzed the top
10 products sold by Shutterfly in the first quarter of 2014. The data that was taken up for analysis
contained a sample of 1 lakh orders from the first quarter of 2014. The data contained information
such as userid, userzipcode, orderdate, prodsku, ordercount, unitstotal and rating. Shutterfly used a
Pig Script to find out the top 10 products of the first quarter of 2014 by order count.
Self-Instructional
Material
369
C A S E S T U D Y
NOTES Pig Script for finding the top 10 products of Q1 2014 by order count is shown in Figure 16.2:
FIGURE 16.2 Pig Script for Finding the Top 10 Products of Q1 2014 by Order Count
L
1 prodsku ordercount
2 AB25 9810
D3
4
5
6
CD31
AR52
AR54
CR23
3808
3188
2366
1812
C
7 PB21 1725
8 AP45 970
9 MA03 712
10 AB01 600
T
11 FG88 582
In addition to order count, we could also explore the data by quantity ordered. A bar chart is
generated by Tableau, showing top 10 products sold in Q1 2014, using the output of the Pig script.
This is shown in Figure 16.4:
prodsku
AB25
CD31
AR52
AR54
CR23
PB21
AP45
MA03
AB01
FG88
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000
FIGURE 16.4 Tableau Bar Chart for Top Ten Products of Shutterfly in Q1 2014
After finding out the top ten products that had the maximum demand in Q1 of 2014, the company
Self-Instructional again ran a Pig Script for conducting a Geospatial Analysis of the top ten products. It ran a Script to
Material
370
C A S E S T U D Y
discover the top 100 zip codes where demand for those top ten products was the maximum. Figure NOTES
16.5 shows the Pig Script for finding out the top 100 zip codes where most orders were originated:
FIGURE 16.5 Pig Script for Finding Top 100 Zip Codes
This geospatial analysis helped the company in determining the states from which it received the
maximum number of orders. After experimenting and analyzing the number of orders based on 100
L
zip codes, the company ran a Pig Script for finding out the top 800 zip codes. Figure 16.6 shows the
Pig Script used for finding top 800 zip codes:
D
C
T
FIGURE 16.6 Pig Script for Finding Top 800 Zip Codes
Using the output of the Pig Script given in Figure 16.6, a representational map showing the top 800
IM
FIGURE 16.7 Representational Map showing the Top 800 Zip Codes generated by Tableau
After running the first script, the company knew which top ten products were sold in maximum
numbers in Q1 of 2014. For example, AB25 was sold the maximum. Therefore, this data was used
along with Pig Script to analyze the regions from where it received the maximum orders and to Self-Instructional
Material
371
C A S E S T U D Y
NOTES determine which states and zip codes should be targeted for marketing promotions. This is shown
in Figure 16.8:
FIGURE 16.8 Pig Script used for Finding Top 800 AB25s by Zip Codes
All this data was used by Tableau to generate a representational map showing the different regions
L
and the number of orders of Product AB25 from each of the regions. This is shown in Figure 16.9:
D
C
T
From Tableau output, it was revealed that the customers from areas such as SF Bay, Los Angeles and
San Diego ordered the most AB25s.
Result
Shutterfly used Hadoop, Pig and Tableau to store and analyze its retail data in order to improve its
analytics capabilities and discover certain actionable insights. Using Hadoop meant that Shutterfly
could scale well over the 120 PBs of data that was being stored and analyzed by it at that time.
Shutterfly used Pig along with Tableau to automate its geospatial reporting and analysis. Shutterfly
realized that Hadoop, Pig and Tableau have the ability to scale, process and automate analysis on
large data sets and data pipelines. It helps them in enhancing the capabilities of the traditional
relational databases and business intelligence tools.
Questions
1. Briefly explain how the use of Hadoop, Pig and Tableau is beneficial for Shutterfly.
(Hint: Hadoop, Pig and Tableau can be used together to produce actionable insights related
to Shutterfly customers and products.)
2. Explain how geospatial analysis was used by Shutterfly.
Self-Instructional (Hint: Geospatial Analysis helps in determining geospatial demand insights that can be used
Material for targeted marketing.)
372
L A B E X E R C I S E
In this Lab Exercise, you are going to perform the steps for installing Pig. Ensure that the steps are NOTES
performed sequentially and accurately to complete the installation.
Let’s first understand the purpose of Pig before implementing the steps for installing it.
Pig is an interactive, or script-based, execution environment supporting Pig Latin—a language used
to express data flows. The Pig platform is specially designed for handling different kinds of data, be
it structured, semi-structured, or unstructured. Pig enables users to focus more on what to do than
on how to do it.
LAB 1
Install Pig
Assume that you have to install Pig on your laptop or desktop computer.
L
Solution: Perform the following procedure to install Pig:
Installing Pig
D
You can run Pig from your laptop/desktop computers. It can also operate on the machine from which
Hadoop jobs are launched. Pig can be installed on a UNIX or Windows system.
N ote
C
Windows users also need to install the Cygwin and Perl packages. The packages can be installed
from http://www.cygwin.com/.
T
Before installing Pig, you need to make sure that your system has the following applications:
zz
zz The HADOOP_HOME environment variable can be set accordingly to indicate the directory
where Hadoop is installed.
Downloading Pig
Perform the following steps to download Pig:
1. Download a recent stable release from one of the Apache Download Mirrors.
2. Unpack the downloaded Pig distribution and ensure it contains the following files:
zz The Pig script file, Pig, which is located in the bin directory.
zz The Pig properties file, pig.properties, which is located in the conf directory.
Self-Instructional
Material
373
L A B E X E R C I S E
NOTES 3. Add /pig-n.n.n/bin to your path. Then, use either the export (bash.sh.ksh) or the
setenv (tcsh.csh) command, as shown here:
$ export PATH=/<my-path-to-pig>/pig-n.n.n/bin:$PATH
This completes the process of downloading and installing Pig on your local machine.
4. The Pig installation can be tested with the following command:
$ pig-help
1. Download the Pig code from Apache Subversion (SVN), which is available at the following
URL: http://svn.apache.org/repis/asf/pig/trunk.
2. You can build the code in the working directory. A successfully completed build would result
in the creation of the pig.jar file in the working directory.
L
3. You can validate the pig.jar file by running a unit test, such as the ant test.
When you start Pig, Hadoop properties can be specified with the -D option and the Pig properties
with the -P option in the Pig interface. You can also use the set command to change individual
D
properties in the Grunt mode of Pig.
The Hadoop properties that you specify support the following precedence order:
T
Hadoop configuration files < -D Hadoop property < -P properties_
file < set command
IM
Self-Instructional
Material
374
End notes
L
D
1. Data science: It is a multidisciplinary field in which data interference, algorithms, and
innovative technologies are used to solve complex analytical problems.
NOTES
C
2. Data: These are raw facts and information that are generally gathered in a systematic
approach for some kind of analysis.
3. Analytics: It is defined as the science of analysing large data pools that contribute to an
effective decision.
T
4. Big Data: It is a term that indicates the heap of structured as well as unstructured
datasets to collect, save, find, transfer and review the information within reasonable
time duration.
IM
5. Trimmed mean: It is the value of the mean of a variable after removing some extreme
observations from the frequency distribution.
6. Probability theory: It is a branch of mathematics that is concerned with chance or
probability.
7. Bernoulli distribution: It is the one wherein only two outcomes are possible for all the
trials and each trial’s results are independent of each other.
8. Statistical inference: It refers to the process using which inferences about a population
are made on the basis of certain statistics calculated from a sample of data drawn from
that population.
9. Decision-making: It can be interpreted as the process of making the best choice
between the prospective and vague alternatives for meeting the objectives.
10. Decision Support System(DSS): It is type of a system that is based on computers and
helps decision-making authorities to confront the ill-structured problems by interaction
directly with data and analysis models.
11. Game theory: It helps find out an optimum solution for developing an effective strategy
for a given situation, irrespective of the condition to maximize profits or minimize losses
in a competitive age.
Self-Instructional
Material
DATA SCIENCE
NOTES 12. Telecommunication: It can be defined as the exchange of information by using different
technologies over the distance, and telecommunication industries provide the infrastructure
for the transmission of information using phone devices and internet.
13. Business Analytics: It is a group of techniques and applications for storing, analysing and
making data accessible to help users make better strategic decisions.
14. Probabilistic Sampling: It includes items selected from the sample using a random procedure.
15. Business Intelligence (BI): It is a set of applications, technologies and ideal practices for the
integration, collection and presentation of business information and analysis.
16. Descriptive Analytics: It is the most essential type of analytics and establishes the framework
for more advanced types of analytics.
17. Extraction Transformation Loading (ETL): It is a process of extracting data from the source
systems, validating it against certain quality standards, transforming it so that data from
separate sources can be used together and delivered in a presentation ready format and then
loading it into the data warehouse.
18. Data warehouses: These are used to store huge amounts of data, which help organizations in
L
decision making, defining business conditions and formulating future strategies.
19. Data mart: It is a collection of subjects that support departments in making specific decisions.
20. Fact table: It is a special table that contains the data to measures the organization’s business
operations.
D
21. Supervised learning: It refers to the fact where the system is given a data in which whatever
is needed is given.
C
22. Reinforcement learning: It is a type of machine learning algorithm which enables machines
maximize their performance by identifying the ideal solution based on some conditions.
23. Decision trees: These help in classifying different paths related to a problem, making it faster
T
to make the best decision.
24. K-means algorithms: These are used for purposes where we need cluster creation based on
data points which are having some sort of relevance in between them.
IM
25. Text Mining: Text Mining involves cleaning the data so that the same is made ready for text
analytics.
26. Sentiment Analysis: It is used to derive the emotions from the text, tweets, Facebook posts,
or YouTube comments.
27. Topic Modeling: It is a statistical approach for discovering topic(s) from a collection of text
documents based on statistics of each word.
28. Named Entity Recognition: It is a tool used in text analytics which classify the named entities
in the given corpus into predefined classes, such as place, person, product, organisation,
quantity, percentage, time, etc.
29. Infographics: These are the visual representations of information or data rapidly and
accurately.
30. Data visualisation: It is the study of representing data or information in a visual form.
31. Tableau: It is one of the popular evolving business intelligence and data visualization tool.
32. Mind Maps: It is a type of diagram which is used to visually organise information.
33. Optimization: It refers to a technique used to make decisions in organizations and analyze
physical systems.
Self-Instructional 34. Solver software: This software concentrates on determining a solution for a particular
Material instance of an optimization model which acts as an input for the software.
376
End notes
35. Linear programming: It is a method used for achieving preeminent results which includes NOTES
maximum revenue and lowest investment in context of a mathematical model.
36. Multi-criteria optimization: It is a type of optimization in which the optimization problems
involving more than one objective function simultaneously are considered.
37. System management: It signifies the management of information technology (IT) assets of
an organization in a centralized manner.
38. KPIs: These are used by the organizations to measure the business goals in order to check the
performance and determine whether the organization is on success track or any improvement
is required for its success.
39. Conformity: It means that the same set of standards have been followed for entering the
data.
40. Business metric: It is defined as a quantifiable measure that is used by an organization for
tracking, monitoring and assessing the success or failure of its business process.
41. Structured data: It can be defined as the data that has a defined repeating pattern.
42. Unstructured data: It is a set of data that might or might not have any logical or repeating
L
patterns.
43. Semi-structured data: It is known as schema-less or self-describing structure which refers to a
form of structured data that contains tags or markup elements in order to separate semantic
D
elements and generate hierarchies of records and fields in the given data.
44. XML (eXtensible Markup Language): It enables data to have an elaborate and intricate
structure that is significantly richer and comparatively complex.
C
45. Social network data: It refers to the data generated from people socializing on social media.
46. Social Network Analysis (SNA): It is the analysis performed on the data obtained from social
media.
T
47. Radio Frequency Identification (RFID): It is a technology that has automated the process of
labeling and tracking of products, thereby saving significant time, cost, and effort.
48. Social media analytics: It is used for online reputation management, crisis management, lead
generation, and brand check to measure campaigning reports and much more.
IM
49. Hadoop: It is a framework that allows for the distributed processing of large data sets across
clusters of computers using simple programming models
50. Hadoop ecosystem: It refers to a collection of components of the Apache Hadoop software
library, including the accessories and tools provided by the Apache Software Foundation.
51. MapReduce: It is a framework that helps developers to write programs to process large
volumes of unstructured data parallelly over a distributed /standalone architecture in order
to get the output in an aggregated format.
52. Hadoop Distributed File System (HDFS): It is a cluster of highly reliable, efficient, and
economical storage solutions that facilitates the management of files containing related data
across machines.
53. Checksum: It is an effective error-detection technique wherein a numerical value is assigned
to a transmitted message on the basis of the number of bits contained in the message.
54. NameNode: These are the machines on which you can run the Active and Standby NameNodes.
55. Block server: It stores data in a file system and maintains the metadata of a block.
56. HBase: It is a column-oriented database composed on top of HDFS.
57. Hive: It is a mechanism through which we can access the data stored in Hadoop Distributed Self-Instructional
File System (HDFS). Material
377
DATA SCIENCE
NOTES 58. Treasure Data: It is a cloud data platform that allows you to collect, store, and analyze data
on the cloud.
59. HiveQL: It stands for Hive Query Language that translates the given query into a MapReduce
code.
60. Metastore: It stores all the information related to the structure of the various tables and
partitions in the data warehouse.
61. Pig: It is a platform that is specially designed for handling many kinds of data, be it structured,
semi-structured, or unstructured.
62. Pig Latin compiler: It is used to convert the Pig Latin code into executable code.
63. Inner join: In an inner join, the same dataset is used with different aliases.
64. Outer join: In an outer join, records that do not have a match of the records in the other table
are included with null values filled in for the missing fields.
L
D
C
T
IM
Self-Instructional
Material
378