Fundamentals of Big Data and Business Analytics NXFLUY7Qcj

Fundamentals of Big Data & Business Analytics
S
IM
M
N
COURSE DESIGN COMMITTEE
Chief Academic Officer Content Reviewer

Dr. Sanjeev Chaturvedi Dr. R. Vijaylakshmi
NMIMS Global Access – Visiting Faculty, NMIMS Global Access -
School for Continuing Education School for Continuing Education
Specialization: Information Technology
TOC Reviewer TOC Reviewer

Ms. Brinda Sampat Kali Charan Sabat
Assistant Professor, NMIMS Global Visiting Faculty, NMIMS Global
Access - School for Continuing Education Access - School for Continuing Education
Specialization: Information Technology Specialization: Operations Management
S
IM
Author: Shashank Mishra
M
Reviewed By: Dr. R. Vijaylakshmi

N
Copyright:
2017 Publisher
ISBN:
978-93-86052-15-5
Address:
4435/7, Ansari Road, Daryaganj, New Delhi–110002
Only for
NMIMS Global Access - School for Continuing Education School Address
V. L. Mehta Road, Vile Parle (W), Mumbai – 400 056, India.
NMIMS Global Access - School for Continuing Education

CONTENTS
CHAPTER NO. CHAPTER NAME PAGE NO.
1 Business Transformation with Big Data 1
2 Technologies for Handling Big Data 29
3 Basics of Business Analytics 69
S
4 Resource Considerations to Support Business Analytics 87
5 Descriptive Analytics 111

IM
6 Predictive Analytics 141
7 Prescriptive Analytics 163

M
8 Social Media Analytics and Mobile Analytics 181

N
9 Data Visualisation 239
10 Business Analytics in Practice 263
11 Case Studies 287

F unda men ta ls o f B i g Data &
B us i n e s s A n a ly t i cs
c u rr i c u l u m
Business Transformation with Big Data: What is Big Data; Structured v/s Unstructured data Big Data
Skills and Sources of Big Data; Big Data Adoption; Characteristics of Big Data - The Seven V’s; Under-
standing Big Data with Examples; Key aspects of a Big Data Platform; Governance for Big Data; Text
Analytics and Streams; Business applications of Big Data; Technology infrastructure required to store,
handle, and manage Big Data
Technologies for handling Big Data: Distributed and Parallel Computing for Big Data, Introduction to
Big Data Technologies (Hadoop, Python, R etc.) Cloud Computing and Big Data, In-Memory Technolo-
gy for Big Data; Big Data Techniques (Massive Parallelism; Data Distribution; High-Performance Com-
puting; Task and Thread Management; Data Mining and Analytics; Data Retrieval; Machine Learning;
Data Visualization)
S
Introduction to Business Analytics: What is Business Analytics (BA)? ; Types of BA; Business Analyt-
ics Model; Importance of business analytics now; what is Business Intelligence (BI)? Relation between
IM
BI and BA; Emerging Trends in BI and BA
Resource considerations to support Business Analytics: What is Data, Information and Knowledge;
Business Analytics Personnel and their roles, required competencies for an analyst; Business Analytics
Data; Ensuring Data Quality; Technology for Business Analytics; Managing Change
M
Descriptive Analytics: What is descriptive Analytics; Visualizing and Exploring Data; Descriptive Sta-
tistics; Sampling and Estimation; Introduction to Probability Distributions
Predictive Analytics: What is predictive Analytics; Introduction to Predictive Modeling: Logic driven
N
and data driven models; Data Mining; Data Mining Methodologies
Prescriptive Analytics: What is Prescriptive Analytics; Introduction to Prescriptive Modeling; Nonlin-

ear optimization
Social Media Analytics, Mobile Analytics, and Visualization

Socialmedia analytics: What Is Social Media? Social Analytics, Metrics, and Measurement; Key
Elements of Social Media Analytics
Mobile analytics: Introducing Mobile Analytics; Mobile Analytics Tools; Performing Mobile Analyt-
ics
Big Data visualization techniques: What Is Visualization?, Importance of Big Data Visualization,
Big Data Visualization Tools
Business Analytics in Practice: Financial and Fraud Analytics, HR analytics, Marketing Analytics,
Healthcare Analytics, Supply Chain Analytics, Web Analytics, Sports Analytics and Analytics for
Government and NGO’s

C h a
1 p t e r
Business Transformation with Big Data
CONTENTS
S
1.1 Introduction
1.2 Evolution of Big Data
IM
Self Assessment Questions
Activity
1.3 Structured v/s Unstructured data
Activity
1.4 Big Data Skills and Sources
M
1.4.1 The Sources of Big Data

Activity
1.5 Big Data Adoption
N
1.5.1 Use of Big Data in Social Networking

1.5.2 Use of Big Data in Preventing Fraudulent Activities
1.5.3 Use of Big Data in Retail Industry
Activity
1.6 Characteristics of Big Data – The Seven Vs
Activity
1.7 Big Data Analytics
1.7.1 Advantages of Big Data Analytics
Activity
1.8 Key Aspects of a Big Data Platform
Activity
1.9 Governance for Big Data
Activity

2 Fundamentals of Big Data & Business Analytics
CONTENTS
1.10 Text Analytics

Activity
1.11 Business Applications of Big Data
Activity
1.12 Technology Infrastructure Requirement
1.12.1 Storing of Big Data
1.12.2 Handling of Big Data
1.12.3 Managing Big Data
Activity
S
1.13 Summary
1.14 Descriptive Questions
1.15 Answers and Hints
IM
1.16 Suggested Readings & References
M
N

Business Transformation with Big Data 3
Introductory Caselet
n o t e s
Big Data Handling in CGL Corporation
A $10 billion-dollar IT corporation, CGL Inc. has over 30,000 data

centres across the world. With new-age virtualisation support
catching up, CGL has already virtualised 82% of its data centres
and now it is aiming at 95% data centre’s virtualisation. The cus-
tomers of CGL belong to multiple domains – industrial, IT, phar-
maceuticals, aviation, government, defense, and so on. The data
related to these domains, such as customer details, products, ser-
vices and network activity is actually what defines the business
intelligence.
The data itself is a repository of great informational value for the

industries it serves and the customers it deals with. This data
S
alone is of impeccable value towards the corporation and serves
as a driving factor for most of the part of its business decisions,
the way forward strategy, trend analysis and internal quality con-
trolling policy formulations.
IM
However, at the same time, it also accounts for an immense mag-
nitude of never-ending unstructured data, like videos, images,
documents, server configurations, customer set-ups, infrastruc-
ture details, and so on. To unleash the actual BI potential lying
underneath those information mountains, the corporation decid-
ed to implement Hadoop – an open source framework for distrib-
M
uted data intensive application. Overall the implementation has

resulted in a positive yield with lessened disk I/O bottlenecks and
has provided linear scalability.
Going forward, CGL expects the consolidation of scattered data

N
across different data centres present throughout the world so that

its basic functions like retrieval and data-fetching operations can
be performed faster. The Big Data analytical strategy, while fruit-
ful, needs to be adaptive enough to accommodate further chang-
es in methodologies, technicalities of businesses and become a
multi-source dividend-yielding platform.

n o t e s
learning objectives
After studying this chapter, you will be able to:

>> Discuss about the evolution of Big Data
>> Describe the differences between structured and unstruc-
tured data
>> Explain Big Data skills and sources
>> Describe Big Data adoption
>> Elucidate about characteristics of Big Data
>> Explain Big Data analytics
>> Describe key aspects of a Big Data platform
>> Elucidate governance for Big Data
S
>> Discuss about text analytics
>> Describe business applications of Big Data
>> Explain technology infrastructure requirement
IM
1.1 INTRODUCTION
The 21st century is characterised by the rapid advancement in the
field of information technology. IT has become an integral part of daily
life as well as various other industries, be it health, education, enter-
M
tainment, science and technology, genetics, or business operations. In

today’s competitive and global economy, organisations must possess a
number of skills to create their place and sustain in the market. One of
the most crucial of these skills is an understanding of and the ability to
utilise and harness the immense potential of information technology.
N
This is truly an information age where data is being generated at an

alarming rate. This huge amount of data is often termed as Big Data.
Organisations use data generated through various sources to run
their businesses. They analyse the data to understand and interpret
market trends, study customer behavior, and take financial decisions.
The term ‘Big Data’ is now widely used, particularly in the IT industry,
where it has generated various job opportunities.
Big Data consists of large datasets that cannot be managed efficiently

by the common database management systems. These datasets range
from terabytes to exabytes. Mobile phones, credit cards, Radio Fre-
quency Identification (RFID) devices, and social networking platforms
create huge amounts of data that may reside unutilised at unknown
servers for many years. However, with the evolution of Big Data, this
data can be accessed and analysed on a regular basis to generate use-
ful information.
This chapter first discusses about the evolution of Big Data. Next, the
chapter describes the differences between structured and unstruc-

n o t e s
tured data. Further, the chapter explains Big Data skills and sourc-
es. This chapter next discusses about the adoption of Big Data. The
chapter also discusses the characteristics of Big Data and Big Data
analytics. Next, the chapter discusses about key aspects of a Big Data
platform and text analytics. Towards the end, the chapter discusses
about business applications of Big Data and technology infrastructure
requirement.
1.2 Evolution of Big Data

The earliest need for managing large datasets of information origi-
nated back in the early eighteenth century around 1880, when the US
census authorities were facing a critical problem as they had data of
several citizens, which includes age, sex, gender, even the people who
are ‘insane’, and so on. The data also includes those people who got
S
displaced after the great rail road program into random habitats or at
different places faraway from their original ones. Authorities felt the
need of having an efficient system that could hold the data of such
dynamics.
IM
In 1890, the Hollerith Tabulating System was utilised for census – it
was a mechanical device and worked with punch cards that could hold
80 different variables or attributes. It revolutionised the way census
was conducted and reduced the time taken for compilation of census
data from almost seven years to six weeks.
M
Some years later in 1919, IBM took up the agricultural census with
over 5000 federal employees deployed across Washington and over
90,000 enumerators by using more than 100 million IBM punch cards
and other processing equipment. After that successful program, Big
N
Data took yet another leap forward with the development of The Man-
hattan Project – the atomic bomb developed by the US in World War
II and further more in US space programs from 1950. Later, a synoptic
data collection model was adopted, which relied heavily on allocation
of large data sets. This shift in data-collecting techniques, analysis,
and subsequent collaboration helped to redefine how bigger scientific
projects were planned and accomplished. One such ambitious project
was the International Biological Program – it studied the environmen-
tal changes on the species and flora-fauna of a particular place. This
program led to the exponential increase in the amount of data gath-
ered and combined latest analysis technologies. Although, it was met
with difficulties related to research structures and methodologies, and
ultimately ended in 1974, it opened a host of different transformed
ways that data was collected, organised, shared and redefined the
ways the existing tech could use data science more efficiently.
The lessons gained from the arrival of Big Data Science laid way for
further contemporary Big Data projects, like weather prediction, su-
percollider data analytics and other physics based research, astronom-
ical sciences and data collection like planetary image detection, med-

n o t e s
ical research and many others. Big Data has become such a dynamic
force that it doesn’t apply only to sciences anymore; many businesses
have got their critical data based services hooked onto its methodolo-
gies, techniques and objectives too which has allowed the businesses
to unleash the data value that might have gone unnoticed earlier.
self assessment Questions
1. The path towards modern Big Data was actually laid during
__________.
2. In 1890, the Hollerith Tabulating System was utilised for
census. (True/False)
Activity
S
Where else, instead of existing industries and domains, do you
think Big Data can play a crucial role in improving the overall op-
erational and organisational efficiency? Make a list of the domains
IM
with reasons to back them up.
Structured v/s Unstructured

1.3
data
Anything that has a well-defined arrangement, easy to understand
M
structure and comprehensible hierarchy is considered a structurally

sound entity. Anything which doesn’t have the above-mentioned attri-
butes is considered unorganised and structurally weak entity.
For example, imagine a 10 GB outlook .psd file (Outlook email config-

N
uration file) with mails from the last two years for a company execu-
tive that receives over 100 emails per day. If you open it raw, via means
of reverse-engineering, all you are going to see is a sea of randomly oc-
curring datasets that point to nothing with hard to decipher meanings
and number codes with occasional familiar words’ sightings. But if
you open it in the program it is made for, you will see the structure and
arrangement in which it is supposed to be presented and aligned with.
So, anything that has a structure falls in place everywhere as a struc-

ture?
No. Actually, a word file may not fit in a database where only text files
are supposed to be kept. Word file may have an internal structure with
all sorts of indentations, grammar, alignment and margins thoroughly
worked upon but in a database with different definitions for the data,
the database designer expects a text or excel file as a word file is con-
sidered as unstructured.

n o t e s
The joys of having a structurally sound data are many like they can be
seamlessly added in a relational database and are easily searchable
by simplest of search engine operations or even algorithms; whereas,
the unstructured data is basically the reverse of the above definition.
It is a nightmare for the designers to connect the random strands of
data with the existing meaningful ones and present it as a structure.
Structural data is closer to machine language than the unstructured
data. So, the battle of finding out a fine balance between keeping the
machine happy and the user happier is all that leads to the ever-refin-
ing Big Data sciences and its affiliated technologies.
3. Anything that has a well-defined arrangement, easy-to-

understand structure and comprehensible hierarchy is
S
considered a structurally sound entity. (True/False)
4. Structural data is closer to ______language.
IM
Activity
In your day-to-day life, write all the structured data patterns. You
see that you have observed for a week and compare them with the
unstructured patterns around you. Now, think of the ways how the
connection between them can be laid, if required. Please note re-
late only logically cohesive things, i.e. things that can co-exist.
M
Exhibit
Semi-structured data
N
Semi-structured data, also known as having a schema-less or

self-describing structure, refers to a form of structured data that
contains tags or markup elements in order to separate elements and
generate hierarchies of records and fields in the given data. Such
type of data does not follow the proper structure of data models as
in relational databases. In other words, data is stored inconsistently
in rows and columns of a database. Some sources for semi-struc-
tured data include:
File systems such as Web data in the form of cookies
Dataexchange formats such as JavaScript Object Notation
(JSON) data
Now, consider the following scenario:
Mr. Smith also observes the presence of some semi-structured data

saved in the database system of the publishing house. This data

n o t e s
contains personal details of the authors working for the publishing

house, as shown in the following table:
Semi-Structured Data
S. No. Name E-mail
1. Sam Jacobs smj@xyz.com
2. First Name: David davidb@xyz.com
Last Name: Brown
As you can notice from the preceding table, semi-structured data

indicates that the entities belonging to the same class can have dif-
ferent attributes even if they are grouped together. In this case, dif-
ferent names and different e-mails are grouped under a common
column name.
S
1.4 Big Data Skills and Sources
IM
Now that we know theoretically what Big Data means, its evolution
into data sciences, the dramatic turnaround that made the sever-
al industries latch onto it, what kind of data exist in the known data
sphere – armed with that level of knowledge, comes the next stage of
reining the data science where we will look at the tools of the trade
that are frequently used and skills you need to possess to tame the
dataset bulls that may almost seem intimidating at first.
M
Normally, while dealing with enormous number of datasets, you need

to have a good sense of observing the patterns, frequency of data oc-
currences and other features that help in narrowing down a data fur-
N
ther to its correct place. A keen statistical and data mining mind will
always take lesser time in finding out the patterns and studying the
data. Hence, it is necessary to have hands on with statistics and good
mathematical skills – needless to say, you don’t need to be a genius.
Big Data science uses concepts of statistics, relational database pro-

gramming extensively. In forensic data analysis, the patterns are often
recorded and studied for days before they yield something out even
with the help of sophisticated software and tools.
According to a survey, the technical skills most commonly required

for the Big Data positions in between 2012-17 comprises knowledge of
NoSQL, Oracle, Java and SQL. Moreover, the knowledge of technical
process/methodological requirements most often cited by recruiters
were in relation to Agile Software Development, Statistical Analysis,
Test Driven Development, Extract, Transform and Load (ETL) devel-
opment, and Cascading Style Sheets (CSS). Besides, existing technol-
ogies such as Hadoop, a Java based open source framework that ac-
tively supports large data set processing, are already in the game since
long. Within the data framework of Hadoop, multiple technologies like

n o t e s
Hive, MapReduce, Pig, HBase and so on are also an efficient medi-

um of transforming large data sets into meaningful bits and pieces for
varying degree of requirements.
Over the next five years, demand for Big Data staff, by comparison, is
forecasted to increase at an average rate of between 13% (low growth)
and 23% p.a. (high growth). A mid-point average of these two rates
would give an expected growth rate of 18% p.a. This would be a fa-
voured situation and should equate to the creation of approx. 28,000
job opportunities p.a. by 2017.
That was a read about the Big Data technologies and methodologies
with a brief overview of how the job prospects are for a potential Big
Data candidate. Let’s take a brief look on the sources of datasets that
define Big Data as a science and complete it as a method.
S
1.4.1 The Sources of Big Data
The philosophy around Big Data sciences and collection has often
IM
been defined around the 3 Vs – volume, velocity and variety of data in-
flowing a system. For many yesteryears, this used to be enough but as
companies moved more towards online processes, this description has
been stretched to take in variability as well — which simply denotes
the increase in the range value of a large data set — and value, that
addresses the evaluation of a typical enterprise data.
M
The chunk of Big Data comes from three primary sources: machine
data, social data and transactional data. Besides, companies need to
make out the difference between internally generated data, like data
residing behind a corporation’s firewall, and externally generated
N
data which is imported into a system.
Whether data is structured or unstructured is also a crucial factor

since unstructured data does not have a definite data model and,
hence, requires more resources to make sense out it.
The three top primary sources of data are described as follows:

Social data comes from the Tweets, Likes, Comments, Retweets,
Video Uploads, and the overall media that is shared on the world’s
most popular social media platforms. This type of data provides
vital understanding of consumer behavior and perception and can
be hugely effective in marketing analytics. The public Web hap-
pens to be another major source of social data, and tools such as
Google Trends can be used to advantageous effect to increase the
Big Data volume.
Machine data is the data created from/by sensors installed in ma-
chinery and industrial equipment, and even logs that track the
typical user behavior. This data type is likely to grow manifolds
as the Internet of things (IoT) grew ever more prevalent and ex-

n o t e s
pand around the world. Sensors present in devices, such as smart

meters, medical devices, satellites, road cameras, games and the
ever-growing IoT will deliver high value, velocity, variety, and vol-
ume of data in near future.
Transactional data is the data generated from online and offline
transactions occurring daily. Invoices, storage records, payment
orders, delivery receipts – all are considered as transactional data.
Despite the immense variety of existing data, these datasets and types
alone are almost meaningless, and most organisations struggle to
make sense of the data that they are generating and how it can be put
to effective use.
S
5. Data that comes from door-to-door surveys falls in _______
category.
6. ________ data is the data created from/by sensors installed in
IM machinery and industrial equipment, and even logs that track
the typical user behaviour.
Activity
Can there be specific data types that are most reliable and authen-
M
tic while another one that is more prone to errors? Consider met-
rics, such as references, quotes, and sources while creating the vi-
sualisation.
1.5 Big Data Adoption

N
The adoption of a contemporary technology like Big Data can enable

the altering innovation that can bring a transition in the structure of
a business, either with its services, products, or organisation. Howev-
er, managing innovation requires due attention: too many regulations
can throttle the initiative and diminish the results, and too little omis-
sions can turn a great project with great intentions into a science trial
that never yielded promised results.
Given the Big Data nature and its analytical prowess, there are many
issues that require consideration and planning at the very start. For
example, with the adoption of any new technology, it becomes equally
important to secure it in a way conforming to current corporate stan-
dards. Tracking issues related to the source of a dataset from its dis-
covery to its consumption is considered as a new requirement for the
organisations. Managing the privacy of elements whose data or iden-
tity is being controlled by analytical processes must also be planned
ahead.

n o t e s
In fact, all the above deliberations require the organisation to identify

and set up different decision frameworks and governance processes to
ensure that accountable parties know about Big Data’s consequences
and management requirements.
As explained earlier, there are many things to consider and account

for when adopting Big Data.
Big Data frameworks are not push-button answers. For data analysis/
analytics to offer value, corporations ought to have data management
and the governance frameworks of Big Data. Complete well-defined
processes and ample skill sets for those who will be responsible for
customising, implementing, populating and using Big Data solutions
are also necessary. Additionally, the data quality aimed for Big Data
powered processing needs to be evaluated as well.
S
1.5.1 Use of Big Data in Social Networking
The magnitude of datasets present in social media even in not so pop-

IM
ular sites are large enough to warrant the consideration of Big Data as
the crucial technology to effectively utilise the barrage of data that is
longing to be comprehended.
For example, Facebook’s ad feature is a comprehensive analytical tool

that studies the user’s activities on different e-commerce websites and
targets them with contextual ads that may arouse user’s interest and
M
end up in a successful purchase. This may seem simple at first but in

truth, it is the clever use of Big Data science deployed to study user
activity and customise their experiences with a mutually rewarding
experience for the corporations – with a win-all situation at the end.
N
Big Data is also used to gather the friend requests, activity sugges-
tions and pages to be followed – all these are nothing but Big Data
behind the scenes as the chief driving force enabling you to reconnect
with your old lost friend, customise your account as per the liking and
interests. Not only on Facebook, but interconnection of several other
social media platforms has opened the potential of a new social media
world order that might be brewing with several hidden features, ex-
ploiting which can prove beneficial for all.
1.5.2 Use of Big Data in Preventing Fraudulent

Activities
“The accountant for a U.S. company recently received an e-mail from

her chief executive, who was on vacation out of the country, requesting
a transfer of funds on a time-sensitive acquisition that required comple-
tion by the end of the day. The CEO said a lawyer would contact the ac-
countant to provide further details. It was not unusual for me to receive
e-mails requesting a transfer of funds,” the accountant later wrote, and
when she was contacted by the lawyer via e-mail, she noted the appropri-
ate letter of authorisation—including her CEO’s signature over the com-

n o t e s
pany’s seal—and followed the instructions to wire more than $737,000 to

a bank in China.”
The clerk for a U.S. council received an e-mail from her senior, who
was out of the country on a vacation, requesting the funds transfer for
a time-bound acquisition requiring to be closed by the end of the day.
The senior said that a lawyer would contact her to provide further
details.
“It was not uncommon for me to get official e-mails seeking funds trans-
fer,” the clerk said. Later the lawyer contacted her via e-mail, with
the appropriate authorisation—including her senior’s signature with
company’s seal—she simply followed the directions to transfer more
than $880,000 to a bank in China.
Clearly, to handle such attacks, you need a unique defense outlook
S
where Big Data offers a potential answer as it allows institutions or
corporations to tackle the fraud differently and get results accordingly.
Here is how Big Data helps in preventing frauds:

IM
Recognising suspicious activities in advance: Banks are always
on a lookout for real time data with suspicious behaviour. Like, if a
credit card owner transacts for the first time from a particular de-
vice, the bank gets notified. If multiple transactions are occurring
from different devices in a day, the subsequently generated data is
enough to raise the alarm and red flag the transactions. Few banks
M
also inform the actual card holders instantly and can prohibit the
transaction. Big Data is simplifying the detection of unusual trans-
actions like if two transactions take place from a single credit card
in different cities within a short period, the bank is going to get
N
alerted.
Leverage data to detect suspicious activities: Banks access large
number of customer’s data from various sources such as social
media, logs, call center conversation and that data can be very
helpful in determining abnormal activities. For example, a credit
card holder travelling in an airplane currently and is posted his
present status on Facebook. Therefore, any transaction on user’s
credit card during that period is considered suspicious and can be
blocked at the bank’s discretion.
Let us now consider insurance industry that receives a lot of deceit-

ful claims and even accepts some and disburses the substantial claim
amount. How does Big Data assist in such a case? The industry can
access data gained from a variety of sources such as past claim re-
cords, social media, phone and criminal records. Upon receipt of a
claim, the scrutiniser should verify the claimant information. If any
suspicious activity is found in the claimant’s record, it should forward
the claim for additional investigation.

n o t e s
The Chinese e-commerce giant, Alibaba, utilises Big Data effectively

to handle fraud by subjecting any suspected fraudster to pass through
5 stages of verification, like Device Check, Account Check, Risk Strat-
egy, Activity Check, and Manual Review. Each step uses the immense
amount of seller related data and activities. For example, in the first
stage, multiple questions may be asked such as previous record in sus-
picious activity, retailing experience, and so on. The second layer will
inspect the technicalities, such as the IP number and device ID, num-
ber of devices used by seller or is going to use, and so on.
Using Big Data provides industries involved in critical financial trans-

actions an opportunity to avoid the scam to a great extent. However,
the Big Data usage for such industries is still in its early stages and a
lot has to be done in this regard. Using Big Data requires the compa-
nies to be conducive to change and need to learn to be data-driven and
S
data-centric and solve problems that call for bigger data sets, such as
cultural change needs to happen for Big Data solutions to become uni-
versal norm across the industry, including solutions that don’t work or
take you to a dead end but invariably end up educating you.
IM
1.5.3 Use of Big Data in Retail Industry
Big Data has brought in some remarkable results for retailers across
the industries as evident from their testimonials.
A famous jewellery shop claims a 47% increase in holiday season sales

M
all thanks to Big Data. Similarly, a prominent hotel chain experienced

increased online and over the phone reservations and enquiries af-
ter implementing a Big Data solution recommended to them by con-
sultants they employed to drive their business all across. As per the
N
analysis undertaken and completed by McKinsey, more than 250 data

centric and data driven approaches owned by companies over a five-
year period of sales and marketing choices improved their overall ROI
by 20 to 25 percent.
However, as with any other great bargain, plenty of obstacles and cyn-
icism still remain with using Big Data as the key retail transition ex-
pert. Big Data is creating a lot of interest, as confirmed by many senior
executives but most of them struggle with common challenges – like
aligning Big Data with the use cases, identifying new (usually unstruc-
tured) types of data and how to utilise Big Data for faster and efficient
decision-making.
Cases of some clever usages of Big Data in retail industries are true
examples of creative thinking of solution architect. Consider the fol-
lowing big Data examples in in which hotels using Big Data to increase
reservations.
A bad weather naturally results in decreased overnight staying at ho-

tels due to lesser travellers. If you’re in the hotel business, places with

n o t e s
such unpredictable weather are not good. However, Café Inn turned
this adversity into their advantage. They observed that the travellers
of a cancelled flight end up in an urgent situation and are in need of
an overnight stay. The company used weather and flight cancellation
information that was readily and freely available, coupled with hotel
and airport information, and an algorithm was developed, which took
factors, like travel conditions, weather severity, time of the day and
rates of cancellation by airlines among other variables. With insights
of Big Data, and pattern recognition of travelers using the mobiles for
this use case, the company effectively used Pay Per Click (PPC) and
search mobile campaigns to send specific mobile ads to stuck travel-
ers and made it easier for them to book a nearby hotel and increasing
the overall hotel revenue by manifolds even in the most unexpected
of times.
S
There are several such case studies and stories where Big Data’s effec-
tive utilisation resulted in a great deal of turnaround for corporations.
IM self assessment Questions
7. The fashion industry can utilise _________ to predict the next

stage of fashion resurgence.
Activity
M
Big Data for retail industries can be a hit and miss affair. Discuss
with your friends.
N
CHARACTERISTICS OF BIG DATA – THE

1.6
SEVEN Vs
The seven si(g)ns of Big Data almost perfectly define the true Big Data
attributes and sum it up as an effective yet extremely straightforward
solution for those datasets that require dealing with an incredibly
plumped up information. The key Vs used in Big Data are:
Volume: While deliberating Big Data volumes, incredible sizes
and numerical terms are required. Each day, the data to the tune
of 2.5 quintillion bytes is produced. Most companies on average
have 100 terabytes of data stored which the Facebook users upload
that much data on daily basis.
Velocity: The speed at which data is accumulated, generated and
analysed is considered vital to have more responsive, accurate and
profitable solutions. The knowledge of rate of data generation will
result only in faster system ready to handle that traffic.
Variety: Beyond the massive volumes and data velocities lies an-
other challenge - operating on the vast variety of data. Seen as a

n o t e s
whole, these datasets are incomprehensible without any finite or

defined structure.
Variability: A single word can have multiple meanings. Newer
trends are created and older ones are discarded over time – same
goes for meanings as well. Big Data’s limitless variability poses a
unique decipher challenge if its full potential is to be realised.
Veracity: What Big Data tells you and what the data tells you are
two different situations. If the data being analysed is incomplete or
inaccurate, the Big Data solution will to be erroneous. This situa-
tion occurs when data streams range from multiple with a variety
of formats. The veracity of the overall analysis and effort is useless
without cleaning up the data it begins with.
Visualisation: Another daunting task for a Big Data system is to
easily be represented the immense scale of information it pro-
S
cesses into something easily comprehensible and actionable. For
human purposes, the best methods are conversion into graphical
formats like charts, graphs, diagrams, etc.
IM
Value: Big Data offers an excellent value to those who can actually
play and tame it on its scale and unlock the true knowledge. It also
offers newer and effective methods putting new products to their
true value even in formerly unknown market and demands.
While Velocity, Volume and Variety are inherent itself to Big Data, the
M
other Vs of Variability, Value, Veracity and Visualisation are important

properties that reflect the gigantic complexity that Big Data presents
to those who would analyse, process and benefit from it.

N
8. If a dataset complies with all the Vs but fails in one, due to

incorrect details of the data received, which V it fails to adhere
to ________?
Activity
Are seven V’s enough/too much for Big Data classification. Critical-
ly explain both the cases with examples.
1.7 BIG DATA ANALYTICS

Big Data analytics are a set of advanced analytic techniques used
against very large, miscellaneous data sets that include unstructured/
structured, batch/streaming and different sizes ranging from terabytes
to zettabytes. Analysis of Big Data allows researchers, analysts, and
business users to make better and faster decisions using data that was
previously unusable or inaccessible. Using advanced analytics tech-
niques, such as machine learning, text analytics, predictive analytics,

n o t e s
statistics, data mining, and natural language processing, businesses

can examine previously untouched data sources independently or to-
gether with their current enterprise data to gain new perceptions re-
sulting in faster and better decisions.
1.7.1 Advantages of Big Data Analytics
Big Data analytics helps the corporations to utilise their data and use
it in identifying new opportunities which further leads to more effi-
cient operations, smarter and well calculated business moves, hap-
pier clients and higher revenues. Companies are actively looking to
find workable insights about their data. Many Big Data projects are
initiated from the need of answering key business requirements and
questions. With selection of a correct Big Data platform, the enter-
prise can increase efficiency, sales and improve operations, be better
S
at managing risks and servicing customers.
Cost reduction: Big Data technologies like Hadoop bring substan-
tial cost advantages when it comes to storage of large data and to
IM
recognise more efficient ways of doing business.
Better and faster decision making: With the evolving new age
technologies and memory analytics, coupled with the ability to
analyse new data sources, corporations are now able to immedi-
ately analyse the information – and make decisions based on the
learnings they derive.
M
New services and products: With the clarity to read the custom-
er’s need and analytical satisfaction enables the power to give
consumers what they want – even till the levels to tailormake the
solution according to the requirements of each customer individu-
N
ally. More such technological prowess has enabled and opened up

further potential arenas of customer servicing.
9. The process of evaluating a situation and analysing it for

creating faster and efficient decision-making systems is called
____________.
Activity
Analyse a real life situation around you that can use Big Data ana-
lytics to increase the overall operational and functional efficiencies.
KEY ASPECTS OF A BIG DATA

1.8
PLATFORM
For most organisations, the answer to many questions is Big Data it-
self – the massive volumes of structured and unstructured data gen-

n o t e s
erated within the organisation. Being able to analyse all this data in a
meaningful way can be an intimidating task without the proper infra-
structure and ways to process data from diverse sources and effective-
ly. And once you have managed it, it’s another fight to make it mean-
ingful to the people who need to understand it. So, for organisations to
build the correct Big Data policy, here are the five crucial components
to consider:
A universal data model: Ensure your entire data is centralised
and unified in a common data model to provide a single accurate
view of the business. The conventions for common data model
such as naming, fields relationships and attributes are created by
data model itself in a way that everything is aligned across trans-
actional and other related systems.
Exploit the power of external data: Capturing the true meaning
S
of the data means successfully integrating initial data from inter-
nal sources with external data from diverse environments (like so-
cial media, vendor data and demographics). The platform should
be flexible enough to accommodate information in multiple ways
IM
from multiple structured or unstructured distributed databases.
Focus on open standards and scalability: Organisations can uti-
lise existing systems efficiently by using a platform with scalable
standards, simultaneously gaining flexibility and reducing the IT
related costs in terms of businesses. Open industry standard com-
pliant systems are readily available and preferred to existing sys-
M
tems for many reasons, one being their effortless integration with
existing systems from multiple other vendors, legacy systems and
future add-on solutions.
Platform independent model: In today’s age, the information is
N
readily accessible across various platforms, hence organisations

must ensure a universal infrastructure for delivering and produc-
ing scorecards, dashboards, enterprise reports and ad-hoc anal-
ysis while giving end-users the real-time round the clock access
to mobile BI, self-service BI, and the capacity to tailor made their
own BI content and customised dashboards using a simpler point-
and-click interface.
Provide users with insights: Users need to be at single point in
order to act on information rather than switching between tasks
or multiple applications. This type of cross-domain, closed-loop
analytics ensures that Big Data will have an instant beneficial and
informative impact on daily operations.
Foundation establishment for leveraging Big Data is worth the ex-

tra effort. When business users can make decision, and take actions
straight from the analytics dashboard, the positive impact customer
experience and on operations is almost instantaneous.

n o t e s
10. An ideal Big Data platform should not have:

a. Singular Access Mode
b. Platform-independent architecture
c. Well-defined structure
d. End-user-friendly and easily accessible
Activity
How will you unify the different data sets in case you are given an
opportunity to design and develop an architecture?
S
1.9 Governance for Big Data
IM
Big Data governance is a crucial factor in dealing with management
of diverse datasets because many times, such data poses as risks, like
unplanned costs, input and misleading data.
Since Big Data is a new model and ever changing with dynamics of
industries, data governance is at nascent stage and not many know
about it. With policies and procedures yet to be developed, many gov-
M
ernance companies are offering services to help companies to organ-

ise their data.
With the help of data modelling tools that aid with governing data,
these tools have unified the Metadata repository further allowing the
N
effective integration and metadata aggregation from different data

sources. Big Data governance gives data the required validation and
the authority to selectively distribute the data within or outside the
organisation. These modeling tools also provide a great graphical data
representation and advanced research while maintaining accuracy.
This gives an organisation scalability to discover and study different
implications of Big Data.
Data governing tools actively deploy data pipelining technology, that
enables sequential data processing where the output from one pro-
cess works as an input for the next process. With these pipelines being
linear or dynamic, the scale of data flexibility becomes high in data
governance.
A data governance strategy is a must – you can borrow it from another
successful strategy, but make sure to custom-fit it with your unique
business needs. The strategy should contain things such access to the
information, ownership of different information types, and the pur-
pose data is used. While defining the strategy, consider data quality,
regulatory requirements, management of information lifecycle, priva-
cy and security.

n o t e s
A cross-functional method for data governance is recommended,

which is a compliance requirement since data and information sys-
tems often interact with different departments that can result in an
opaque, non-transparent view to the top management about the func-
tioning. Hence, there needs to be an all-inclusive way to Big Data by
a team consisting of members from all the required departments to
check on auditable proof, controls, and compliance documentation.
The data usually has a lifecycle beyond which it either becomes obso-
lete or simply becomes a liability to be looked after. Overlooking such
aspects is a common error organisations commit. Hence, a standard
schedule is never recommended for all data types as they may have
different retention stages. Data archival is recommended to enhance
the overall performance of your applications.
S
11. Big Data governance model utilises the __________ thoroughly

to present the information effectively.
IM
Activity
Which other governance model other than Big Data you can think
for managing low traffic data centers?
M
1.10 Text Analytics

The Text Analytics are the conversion of unstructured textual data
N
into comprehensible analytical data, that often includes processes like

check product reviews, measure consumer opinions, buyer sentiment
analysis and feedback, provide search facility, and object modelling to
ensure factual decision-making. Text analysis requires multiple statis-
tical, linguistic and machine learning techniques and involves retriev-
al of information from unstructured data and restructuring the input
text to create patterns and trends, evaluate and interpret the data
output. It also involves categorisation, alphabetical analysis, tagging,
recognition of a recurring or singular pattern, clustering, extraction
of vital information, visualisation, link and association, and predictive
analytics. Text Analytics determines topics, keywords, category, tags,
semantics, from the humongous text data stored in different files and
formats in a typical organisation. The term ‘Text Analytics’ also refers
to ‘text mining’.
The textual analytical software provides servers, algorithm and tools-

based applications, extraction tools and provisions for data mining to
turn unstructured data into data with some value. The output is com-
posed of recovered entities, relationships and is stored in a relational
format – typically XML, and in a format, that’s compliant with other

n o t e s
analytical applications such as or Big Data analytics, business intel-

ligence tools or predictive analytics tools. Figure 1.1 shows the text
analytics process flow:
Text
Identification
Visualisation Text Mining
Text
Summarisation
Categorisation
Text Analytics
Sentiment Text
Analysis Clustering
Link Search
S
Analysis Access
Entity/Relation
Modeling
IM
Figure 1.1: Displaying the Text Analytics Process Flow
Source: https://s-media-cache-ak0.pinimg.com/originals/05/3d/e0/053de0478bb02ab7dfb-
73222059fe182.jpg
The features and processes involving Text Analytics solutions are as

follows:
M
Text parsing, mining, identification, categorisation, extraction and

clustering.
Extraction of entities, concepts, events, relations.
N
Indexing, Web crawling, Search Access, duplicate document iden-

tification.
Link analysis, identifying and analysing people, sentiments and
other information from websites, reports, internal files, forms, sur-
veys, claims, underwriting notes, employee surveys, medical re-
cords, blogs, emails, news, social media, online forums, customer
surveys, market surveys, online reviews, website feedback, review
sites, scientific journals, call center logs, snail mail, transcripts,
sales notes, etc.
Textual analytics has a wider application dynamics and is often uti-

lised in analysing the market sentiment, consumer behavioural pat-
terns and segments that lead the domain for a given product or a
company. Besides, a lot of actionable items, such as ad placement on
sites customised on user’s browsing experience, enterprise business
intelligence and records management, national security and intelli-
gence, scientific discoveries of species and especially in life sciences
domain, etc.

n o t e s
12. The __________ is conversion of unstructured textual data into

comprehensible analytical data.
Activity
Big Data for retail industries can be a hit-and-miss affair. Explain.
1.11 Business Applications of Big Data

From examples described earlier in this chapter, Big Data’s crucial
role transforming even the most adverse situations for companies and
S
organisations or even smaller hotel chains, is no uncommon achieve-
ment for a model that is meant to failsafe you against the worst of
the conditions. But, Big Data is not only limited to that; it comes with
IM
much deeper and broader applications.
As discussed earlier, Big Data is used to have better understanding of

a customer’s behaviours, need and preferences. You might remember
the example of hotel chain we discussed earlier that can now almost
accurately predict when the weather is going to go bad and customers
will come down hunting for their tailor-made services. Similarly, a car
M
dealership can predict when the next car going to be sold, Walmart
can predict the most selling item in each point of time for a month or
in a year or around any holiday season.
Big Data helps in optimising election campaigns as well. Reportedly,

N
in the 2012 presidential election campaign, many believed Obama’s

win was because of his team’s greater ability to use Big Data analytics
to their advantage.
Big Data is now seeping in those areas that were earlier prone to mis-
calculations and predictions, such as stock inventory model where a
retailer couldn’t decide whether to stock up for the upcoming seasonal
sales based on the factors around or not. Now, the same retailer can
optimise their stock from the Web search trends, social media data
and weather forecasts predictions.
In supply chain or delivery route optimisation, Big Data is helping big

time as well. Radio sensory along with route optimisation based on
the traffic data, road blockages or even live protest detectors, are be-
ing actively used by many postal corporations. The power of Big Data
analytics is now helping the scientists in decoding the entire DNA mu-
tations in minutes and allow them to find new cures, predict disease
patterns and better understand the genomes. Science and research is
currently under transformation by Big Data and its associated tech-
niques being actively used for example, CERN, the Large Hadron

n o t e s
Collider nuclear physics lab, world’s most powerful and largest parti-
cle accelerator is currently experimenting on genesis of the universe
in search of the elusive God particle. The datacentre responsible for
managing CERN’s datasets has 66,000 processors to analyse around
30 petabytes of data produced. It uses the distributed computing pow-
er of thousands of systems located across 140 datacentres around the
world. Such computing powers can be utilised to change the way many
other areas of science and research function and give the results.
13. ___________’s election campaign actively used Big Data

analytics to gain traction over the competition as per a report.
S
Activity
Pattern-based recognitions and fingerprint recognition system

store their data and keep it unique based on patterns and finger-
IM
prints. What do you think facial recognition systems keep as unique
identifier and how does Big Data help in it?
Technology Infrastructure
1.12
Requirement
M
Big Data is simply a large data repository with the following charac-
teristics:
Has distributed redundant data storage
N
Handles large amounts (a petabyte or more) of data

Provides data processing (MapReduce or equivalent) capabilities
Processes tasks in parallel
Is relatively inexpensive
Centrally managed and orchestrated
Extensible — basic capabilities can be augmented and altered
Accessible — easy to use and available
So, the infrastructure that’s going to host Big Data as the prime driver
of an organisation must be robust, scalable, ductile and fail-safe for
unplanned situations. But how do we arrive at such robust scale of
infrastructure? Merely having a super-expensive high-spec systems
and networking gears will be enough or Big Data requires something
more than these usual factors?

n o t e s
Another driving force behind the successful implementation of Big

Data is the software – both analytics and infrastructure. Primary in-
frastructure is called Hadoop – an open source Big Data management
software used to distribute, manage, catalogue, and query data across
horizontally scaled multiple server nodes. Hadoop is basically a frame-
work for storing, processing and analysing massive amounts of un-
structured distributed data. Hadoop Distributed File System (HDFS)
as a file storage subsystem was planned and designed to handle few
trillions bytes of parallel data distributed across multiple nodes.
The most important components of Hadoop are the Hadoop Distrib-

uted File System (HDFS) which provides storage and MapReduce, for
parallel processing of large dataset. Going forward, we will use Ha-
doop as a chief example for a Big Data product and infrastructure.
S
1.12.1 Storing of Big Data
The data once gathered from your sources is stored in the sophisticated
but accessible systems and traditional data warehouse, a distributed/
IM
cloud-based storage system, a data lake, and in the company servers
or even a simple computer’s hard disk depending on the magnitude of
the data received. For not so larger amounts of data, one can consid-
er using their clustered networks storage as the data-storing option,
given it’s well designed and has failsafe measures to withstand those
unpredictable storage issues. However, for larger data inflows, where
M
a group of interconnected networks won’t suffice alone, it is better to

consider the cloud-based data caches or professionally managed data-
centres. Following are the characteristics that a typical HDFS storage
system should be compliant of:
N
Scalable: Storage should be flexible in throughput, size and access

speed.
Tiered storage: It is important for the storage system to manage
the hierarchy of the data across the range of storage devices pres-
ent within a system like fast disk, flash, tape and slower disk.
Widely accessible: Storage should be globally distributed to be
closer to users for ready access.
Backward compatible with analytical and content applications,
and legacy systems: A well-built Big Data storage system should
be flexible and heterogeneous. It should be composed of interfaces
allowing access to the Big Data storage and inbuilt functionality.
Supports integration with cloud ecosystems: A near-perfect Big
Data storage system must be built keeping cloud storage as in
purview as cloud-based storage has come up as a great option for
most of the businesses. It’s flexible, doesn’t requires your physical
presence nor the physical systems onsite and it reduces the data
security problem. It’s also much cheaper than investing and main-
taining the expensive data warehouses and dedicated systems.

n o t e s
1.12.2 Handling of Big Data
While handling large datasets is never a one-time job. Hadoop is

changing the conventions of Big Data management, especially with
the unstructured data. Let us see how a framework called Apache Ha-
doop software library plays a crucial role in managing Big Data.
Apache Hadoop streamlines the excess data for any distributed pro-
cessing system across computer clusters using simple programming
based models. Instead of having hardware dependency to provide the
uptime, the library inbuilt with features at the application layer, to
detect and handle breakdowns, providing a reliable and always avail-
able service along with a computer cluster, since both versions may be
prone to failures.
The Hadoop Community Package consists of:
S
OS level and File system abstractions
A MapReduce or YARN (Yet Another Resource Negotiator) engine
IM
Hadoop Distributed File System (HDFS)
Java ARchive (JAR) files
Scripts needed to start Hadoop, documentation and source code,
and a contribution section
M
1.12.3 Managing Big Data
A lot has been discussed and written about the Big Data’s functioning,
its associated workflows, technologies used and traits they require to
share in order to perform efficiently. Following key points should be
N
considered while keeping Big Data management in context:

Cluster design: Application requirements are evaluated in terms
of volume, workload and other associated factors that form the ba-
sis of cluster design, which is not a repetitive process. The set-up
in initial stages is validated and verified with an application and
data sample before being actuated. Although, the cluster design
of a typical Big Data structure allows scalability in tuning config-
uration parameters, a large number of other parameters and their
impacts on each other lead to additional complexity.
Hardware architecture: The key factor that works in favour of
Hadoop clusters is the high-quality equipment used by it. Since
most of the Hadoop users concern about the cost and as the clus-
ters grow, cost rises significantly. In the current scenario, the ar-
chitecture hardware requirements for the NameNode are higher
RAM and lower level or mid-lower levels of HDD. If the JobTrack-
er turns out to be a separate server, it will have higher CPU speed
and RAM latency. DataNodes are a standard for lower end server
machines.

n o t e s
Network architecture: As of now, network architecture is not

designed explicitly for Big Data. Inputs from application require-
ment and cluster design are not always mapped to it. Standard
set-up for the network within the existing datacentre is used as the
primary set-up. This results in network deployment that’s over-
valued most of the time and has a negative effect on MapReduce
algorithm responsible for data processing. Hence, therein lies the
great scope for creating actual guidelines linked to network archi-
tecture design for Big Data.
Storage architecture: Most enterprises are already hugely invest-
ed in SAN and NAS devices when they consider Big Data. During
the implementation, the attempts to reuse the current storage in-
frastructure even though DAS is recommended as storage for clus-
ters of Big Data.
S
Information security architecture: A usual examination of mul-
tiple Big Data implementations illustrates that less security fea-
tures are considered secondary over other pressing requirements
of a demanding system and aftermarket security solutions are not
IM
tailor made for these clusters. These deployments often turn out
to be insecure and solely rely on perimeter and network security
support.

M
14. MapReduce was originally the part of a framework developed

at _______.
Activity
N
Can HDFS have a replacement with a much more efficient system?

Make a list of technologies that have the potential to do so.
1.13 Summary
The Big Data sciences use concepts of statistics, relational data-
base programming extensively.
Normally, while dealing with enormous number of datasets, you
need to have a good sense of observing the patterns, frequency of
data occurrences and other features that help in narrowing down
a data further to its correct place.
The chunk of Big Data created comes from three primary sources:
machine data, social data and transactional data.
The adoption of a contemporary technology like Big Data can
enable the altering innovation that can bring a transition in the
structure of a business, either with its services, products, or organ-
isation.

n o t e s
Big Data has brought in some remarkable results for retailers

across the industries as evident from their testimonials.
Analysis of Big Data allows researchers, analysts, and business us-
ers to make better and faster decisions using data that was previ-
ously unusable or inaccessible.
key words
Big data analytics: It is a set of advanced analytic techniques

used against very large, miscellaneous data sets.
Structured data: It is a well-defined arrangement, easy-to-un-
derstand structure and comprehensible hierarchy of data.
Social data: It is the data that comes from the tweets, likes, com-
ments, retweets, video uploads, and overall media that is shared
S
on the world’s most popular social media platforms.
Transactional data: It is the data that is generated from online
and offline transactions occurring daily.
IM
Unstructured data: It is the data that is not well organised.

1. Discuss the evolution of Big Data.
M
2. What are the basic differences between structured and

unstructured data?
3. Enlist and explain different sources of Big Data.
N
4. Explain various characteristics of Big Data.

5. What are different advantages of Big Data?
6. Explain the concept of text analytics with suitable examples.
Answers for Self Assessment Questions
Topic Q. No. Answers

Evolution of Big Data 1. Late Fifties
2. True
Structured v/s Unstructured 3. True
data
4. Machine
Big Data Skills and Sources 5. Transactional

n o t e s

6. Machine
Big Data Adoption 7. Analytics
Characteristics of Big Data – 8. Veracity
The Seven Vs
Big Data Analytics 9. Big Data analytics
Key Aspects of a Big Data 10. a. Singular Access Model
Platform
Governance for Big Data 11. Data Model
Text Analytics 12. Text Analytics
Business Applications of Big 13. Presidential
S
Data
Technology Infrastructure 14. Google
Requirement
IM
ANSWERS FOR DESCRIPTIVE QUESTIONS
1. The earliest need for managing large datasets of information
originated back in early eighteenth century around 1880. Refer
to Section 1.2 Evolution of Big Data.
2. Anything that has a well-defined arrangement, easy-to-
M
understand structure and comprehensible hierarchy is

considered a structurally sound entity. Refer to Section
1.3 Structured v/s Unstructured Data.
3. Whether data is structured or unstructured is also a crucial factor
N
since unstructured data does not have a definite data model and,
hence, requires more resources to make sense out of it. Refer to
Section 1.4 Big Data Skills and Sources.
4. The seven signs of Big Data define the true Big Data attributes
and sum it up as an effective yet extremely straightforward
solution for those datasets that require dealing with an incredibly
plumped-up information. Refer to Section 1.6 Characteristics of
Big Data – The Seven Vs.
5. Big Data Analytics are a set of advanced analytic techniques
used against very large, miscellaneous data sets that include
unstructured/structured, batch/streaming and different sizes
ranging from terabytes to zettabytes. Refer to Section 1.7 Big
Data Analytics.
6. Text analysis requires multiple statistical, linguistic and machine-
learning techniques and involves retrieval of information from
unstructured data and restructuring the input text to create
patterns and trends, evaluate and interpret the data output.
Refer to Section 1.10 Text Analytics.

n o t e s
1.16 Suggested ReadingS & References
SUGGESTED READINGS
Mayer-Schönberger, V., & Cukier, K. (2014). Big data: a revolution
that will transform how we live, work, and think. Boston: Mariner
Books, Houghton Mifflin Harcourt.
Erl, T., Khattak, W., & Buhler, P. (2016). Big data fundamentals:
concepts, drivers & techniques. Boston: Prentice Hall.
E-REFERENCES
What is Big Data and why it matters. (n.d.). Retrieved April 22,
2017, from https://www.sas.com/en_us/insights/big-data/what-is-
big-data.html
S
BigData. (2017, March 17). Retrieved April 22, 2017, from https://
www.ibm.com/big-data/us/en/
IM
M
N

C h a
2 p t e r
Technologies for Handling Big Data
CONTENTS
S
2.1 Introduction
2.2 Distributed and Parallel Computing for Big Data
IM
Activity
2.3 Introduction to Big Data Technologies
2.3.1 Hadoop
2.3.2 Python
2.3.3 R
M

Activity
2.4 Cloud Computing and Big Data
N
Activity
2.5 In-Memory Technology for Big Data
Activity
2.6 Big Data Techniques
2.6.1 Massive Parallelism
2.6.2 Data Distribution
2.6.3 High-Performance Computing
2.6.4 Task and Thread Management
2.6.5 Data Mining and Analytics
2.6.6 Data Retrieval
2.6.7 Machine Learning
2.6.8 Data Visualisation
Activity
2.7 Summary

CONTENTS

S
IM
M
N

Technologies for Handling Big Data 31
n o t e s
Improved Data Security with Cisco and MapR Technologies
A company, Solutionary, located in Omaha, Nebraska, provides

IT security and managed services to its customers. It consists of
more than 310 employees to handle more than trillions of queries
per year of their customers. The main challenges for the compa-
ny were to increase data analytics capabilities for improving the
data security for their customers. The company also wanted to
improve scalability as the number of clients and datasets grow
every year remarkably. In addition, the company wanted to re-
duce the costs of expanding the database solution to meet current
business demands.
The company formed partnership with Cisco and MapR Technol-
S
ogies for implementing Cisco UCS Common Platform Architec-
ture (CPA) for Big Data. MapR Technologies had suggested the
Apache Hadoop solution that provides a complete new way of
handling Big Data. Unlike traditional databases that store struc-
IM
tured data only, Hadoop allows Solutionary to distribute and
analyse both types of data, structured or unstructured, smoothly
on a single data infrastructure.
This partnership resulted in the following benefits for the com-

pany:
M
Less time required to investigate security events for relevance

and impact
Easy data availability along with new services and enhanced
security features
N
Enhanced agility along with on-demand deployment of appli-

cation or sevice
According to Dave Caplinger, Director of Architecture of Solu-

tionary, “By implementing MapR and Cisco UCS, we have achieved
performance and flexibility with incredible scalability via Hadoop’s
clustered infrastructure. This infrastructure allows us to perform
real-time analysis on big data in order to help protect and defend
against sophisticated, organised, and state-sponsored adversaries.”
He also declares, “MapR and Cisco UCS have many of the same
values: high performance, efficient management, and ease of use.
Using both solutions together enables us to scale our security analy-
sis services while keeping complexity and cost under control.”

n o t e s
learning objectives

>> Explain distributed and parallel computing for Big Data
>> Recognise Big Data technologies
>> Describe cloud computing in reference to Big Data
>> Discuss in-memory technology for Big Data
>> Elucidate Big Data techniques
2.1 Introduction
The market is flooded with corporations offering custom-made tools
S
and frameworks for implementing Big Data and analytics. However,
behind the branding and beneath the platform, the basic features are
common in all. Given below is a list of methods and practices that are
usually followed for a typical Big Data implementation:
IM
NoSQL database: It offers a provision for storage and extraction
of the data modelled in tabular relations instead of typical relation-
al databases to cater efficiently to real-time situations.
Data incorporation: Data management tools available as solu-
tions like Amazon Elastic MapReduce (EMR) that run underneath
M
a customised version of Apache Hive, Pig, Spark, Couchbase, Ma-

pReduce, Hadoop, MongoDB, etc.
Data virtualisation: Virtualisation of multiple data sources into
one helps in real-time extraction, fetching and storage operations
from multiple sources such as Hadoop and distributed data stores.
N
It is possible from a single point.

Search and knowledge finding: These tools and applications aid
in self-serviced processes to extract information and new findings
from humongous storage spaces consisting of structured/unstruc-
tured data residing in numerous sources such as databases, file
systems, APIs, streams, other platforms and applications.
Stream analysis: These tools and applications can enrich, aggre-
gate, filter and analyse a high data influx from multiple incongru-
ent real-time data sources and in any format.
Data memory composition: These tools provide faster access and
processing of humongous data by spreading it across the dynamic
RAM, SSD or Flash storage of a distributed computer system.
Big Data predictive analytics: Predictive analysis is simply the
analysis of the expected events and pre-planning to manage such
events that might have an impact on overall structural, operation-
al and functional aspect of an organisation. It usually comprises
hardware or tool-based solutions to let the organisation discover,

n o t e s
evaluate, deploy and optimise predictive models by evaluating big

data sources to better business performance and alleviate risks.
Quality of data: These products perform data cleansing and im-
provement on voluminous, high-speed datasets, using simultane-
ous operations on distributed databases and storage. These con-
sist of software that perform the process of sourcing, cleansing,
making and sharing different and untidy datasets to make the final
data useful for analytics.
In this chapter, you will first learn distributed and parallel comput-
ing for Big Data. Next, you will learn the basics of Big Data technolo-
gies. Further, you will study cloud computing in reference to Big Data.
Next, you will learn in-memory technology for Big Data. Towards the
end, you will learn about various Big Data techniques.
S
Distributed and Parallel
2.2
Computing for Big Data
IM
In Big Data, terminologies related to computing have similar meaning
that they have in other fields although with different scope of appli-
cability. Let’s have a look at what they mean and what they stand for:
Distributed computing: It works on the rules of the divide and
conquer approach, performing modules of parent tasks on multi-
ple machines and then combining the results. It is basically multi-
M
ple processors interconnected by communication links as opposed

to parallel computing models which usually work on shared mem-
ory (but not always). Distributed systems basically aim towards
passing the message. If systems are separated by geographically
different locations, such a setup is said to be characteristically dis-
N
tributed. Imagine you and your friend’s computer in a same room–

and by some interconnecting technology you have managed to join
it up as a single system for performing any task. Such a system will
be called a parallel system. Now consider the same setup, albeit
your friend’s computer is miles away from yours and it is connect-
ed to a node that runs common to both the system’s processing
power. Such a setup will be called distributed.
Parallel computing: Parallel computing refers to the utilisation of
a single CPU present in a system or a group of internally coupled
systems by the means of efficient and clever multi-threading op-
erations. It aims at finishing a specific computation operation in
the lowest time possible, by utilising multiple processors. The pro-
cessor scale may vary from multiple logical units within a single
processor to many memory sharing processors, to computation-
al process distribution on multiple computers. On computational
models, parallelism is simply the execution of internal simultane-
ous threads of computation to achieve a final result. Parallelism is
evident in finite real-time systems consisting of multiple proces-
sors with a single master clock used by all. In the context of Big

n o t e s
Data, such parallel systems are the ones that execute from multi-
ple datasets throughput points and run in parallel connected to a
master system. Parallel computing is a close-coupled system and
is used in solving the following:
Computer-exhaustive problems
Bigger problems in the same time
Similar-sized problems in the same time with high precision
Figure 2.1 shows a comparison between distributed and parallel pro-

cessing techniques:
Distributed Computing
Grid Node
S
Control Server
Task
IM
Parallel Computing
M
Server
10/100 MB/s Ethernet Switch Internet
Compute Nodes
N
Network Disk
Storage
Figure 2.1: Distributed Computing and Parallel Computing
Organisations use both parallel and distributed computing techniques

to process Big Data. The most important constraint for businesses to-
day is time. In case there had no restriction on time, every organisation
would have hired outside (or third-party) sources to perform analysis
of its complex data. The direct benefit of adopting this method is that
the organisation would not require any resources and data sources
to process and analyse complex data. These third parties are usually
specialised agencies in the field of data manipulation, processing and
analysis. Apart from being effective, hiring third-party agencies also
reduces the storage and processing costs of handling large amounts
of data.

n o t e s
DIFFERENCE BETWEEN DISTRIBUTED AND PARALLEL

COMPUTING SYSTEMS
Table 2.1 differentiates between distributed and parallel computing

systems:
Table 2.1: Difference Between Distributed and

Parallel Computing Systems
Distributed Computing System Parallel Computing System
An independent, autonomous A computer system with several
system connected to a network for processing units attached to it
accomplishing specific tasks
Coordination is possible between A common shared memory can be
connected computers that have directly accessed by every process-
their own memory and CPU ing unit in a network
S
Loose coupling of computers con- Tight coupling of processing re-
nected in a network that provides sources that are used for solving a
access to data and remotely located single, complex problem
resources
IM
Besides these computing models, a common occurring model that
lies somewhere between these two models is called the concurrent
computing model. Concurrency of a system is simply the operation of
multiple threads that execute on single or multiple processors. Con-
currency refers to sharing of multiple sources in real-time.
M
Distributed computing is considered as the subset of parallel comput-

ing, which is the subset of concurrent computing.

N
1. The _________ works on the rules of the divide and conquer

approach, performing modules of parent tasks on multiple
machines and then combining the results.
2. _________ refers to sharing of multiple sources in real-time.
3. Parallel computing is a close-coupled system that is used in
solving similar-sized problems in the same time with high
precision. (True/False)
Activity
Supercomputers are multi-threaded, multi-processor and mul-

ti-core utilising systems. What kind of systems are they—parallel,
distributed, concurrent or some hybrid in between these models?
Explain.

n o t e s
Introduction to Big Data

2.3
Technologies
A Big Data system is vastly different from other solution providing
systems and is based on the seven Vs, as described in previous chap-
ter, namely: Volume, Velocity, Variety, Veracity, Variability, Value and
Visualisation. A system that complies with these properties and hap-
pens to be robust to withstand unexpected events and scalable enough
to accommodate future methodologies is qualified to be called a Big
Data system.
A typical Big Data system consists of a setup that adheres to these sev-
en Vs and provides a great infrastructure that can withstand the in-
flux of huge datasets with high velocity, meanwhile providing an effec-
tive mechanism to process the datasets by cleansing, shaping, filtering
S
and sorting into meaningful information aimed towards making the
data both user- and machine-friendly. Beneath the complex system
of architecture, sophisticated hardware and methodologies working
IM
in conjunction with each other, therein lie the interfaces that are re-
sponsible for communicating with the hardware and user simultane-
ously – the programmable applications or the tools that are the prime
drivers of the efficiency of a typical Big Data system setup. A few such
contemporary interface development programs are described in the
next few sections along with their applications.
M
2.3.1 Hadoop
Hadoop is an open-source platform that provides analytical technol-

ogies and computational power required to work with such large vol-
N
umes of data.
Earlier, distributed environments were used to process high volumes

of data. However, multiple nodes in such an environment may not
always cooperate with each other through a communication system,
leaving a lot of scope for errors. The Hadoop platform provides an
improved programming model, which is used to create and run dis-
tributed systems quickly and efficiently.
A Hadoop cluster consists of single MasterNode and multiple worker

nodes. The master node contains a NameNode and JobTracker and
a slave or worker node acts as both a DataNode and TaskTracker.
Hadoop requires Java Runtime Environment (JRE) 1.6 or a higher
version of JRE. The standard start-up and shutdown scripts require
Secure Shell to be set up between nodes in the cluster. In a larger
cluster, Hadoop Distributed File System (HDFS) is managed through
a NameNode server to host the file system index and a secondary
NameNode that keeps snapshots of the NameNodes and at the time
of failure of NameNode the secondary NameNode replaces the prima-
ry NameNode, thus preventing file-system from getting corrupt and

n o t e s
reducing data loss. Figure 2.2 shows The Hadoop multinode cluster
architecture:
S
IM
Figure 2.2: Hadoop Multinode Cluster Architecture
The secondary NameNode takes snapshots of primary NameNode di-

rectory information after a regular interval of time, which is saved in
local or remote directories. These checkpoint images can be used in the
M
place of the primary NameNode to restart a failed primary NameNode

without replaying the entire journal of file-system actions and editing
the log to create an up-to-date directory structure. NameNode is the
single point for the storage and management of metadata. To process
the data, Job Tracker assigns tasks to the Task Tracker. Let us assume
N
that a DataNode cluster goes down while the processing is going on,
then the NameNode should know that the some DataNode is down in
the cluster, otherwise it cannot continue processing. Each DataNode
sends a “Heart Beat Signal” to NameNode after every few minutes (as
per Default time set) to make NameNode aware of the active / inactive
status of DataNodes. This system is called Heartbeat mechanism.
There are two main components of Apache Hadoop–the Hadoop Dis-

tributed File System (HDFS) and the MapReduce parallel processing
framework. Both of these open source projects, HDFS is used for stor-
age and MapReduce is used of processing.
HDFS is a fault-tolerant storage system in Hadoop. It stores large size

files from terabytes to petabytes across different terminals and attains
reliability by replicating the data over multiple hosts. The default rep-
lication value is 3. Data is replicated on three nodes: two on the same
rack and one on a different rack. The file in HDFS is split into large
block size of 64 MB by default (typically 64 to 128 megabytes) and each
block of the file is independently replicated at multiple data nodes.
The NameNode actively monitors the number of replicas of a block

n o t e s
(by default 3 times). When a replica of a block is lost due to a DataN-

ode failure or disk failure, the NameNode creates another replica of
the block.
Figure 2.3 shows the typical HDFS architecture:
S
IM
M
Figure 2.3: HDFS Architecture

Source: https://hadoop.apache.org
MapReduce is a framework that helps developers to write programs

to process large volumes of unstructured data parallel over a distrib-
N
uted architecture/standalone architecture which produces result in

useful aggregated form. MapReduce consists of several components;
a few important ones are mentioned here:
JobTracker: It is the master that looks over the execution of a Ma-
pReduce job. It acts as a medium between the application and Ha-
doop.
TaskTracker: It manages individual task execution on each of the
slave nodes.
JobHistoryServer: It tracks completed jobs.
We can write MapReduce programs in several languages like C, C++,

Java, Ruby, Perl and Python.
The following are some important features of Hadoop:

Hadoop performs well with several nodes without requiring shared
memory or disks among them. Hence, efficiency-related issues in
the context of storage and access to data get automatically solved.

n o t e s
Hadoop follows the client-server architecture in which the server

works as a master and is responsible for data distribution among
clients that are commodity machines and work as slaves to carry
out all computational tasks. The master node also performs the
tasks of job controlling, disk management and work allocation.
The data stored across various nodes can be tracked in Hadoop
NameNode. It helps in accessing and retrieving data as and when
required.
Hadoop improves data processing by running computing tasks on
all available processors that are working in parallel. The perfor-
mance of Hadoop remains up to the mark both in the the case of
complex computational questions and of large and varied data.
Hadoop keeps multiple copies of data (data replicas) to improve
resilience that helps in maintaining consistency, especially in case
S
of server failure. Usually, three copies of data are maintained, so
the usual fault-replication factor in Hadoop is 3.
IM
Hadoop also manages hardware failure and smoothens data handling.
Following few inbuilt components of Hadoop make it a great platform
to perform larger dataset related operations:
Hive: A data warehouse tool created by Facebook based on Ha-
doop converts query language into MapReduce jobs. It deals with
storage, analysis and queries of large sets of data. HQL (Hive Que-
M
ry Language) statement is used as a query language in HIVE that

is similar to a SQL statement.
Hbase: Hbase is a Hadoop application running atop the HDFS. It
represents the set of relations or tables, but is a column-oriented
DBMS, different from conventional row-oriented DBMS. Usually,
N
conventional databases we know are relational database system,

but Hbase is not a relational database and nor does it support any
query language, e.g. SQL.
Pig: Pig is a high-level modular programming tool developed by
Yahoo in 2006 for streamlining huge data sets with the use of Ha-
doop and MapReduce. Pig comprises two components – PigLat
being the programming language and the other one being run
time environment where programs are executed similar to Java
environment.
2.3.2 Python
Python is a popular interpreted, general-purpose, high-level dynam-

ic programming language that aims to improve code readability and
overall ease of use and expression in fewer statements than other
competitive languages such as C++ or Java.
The most acknowledged fact that goes in the favour of Python as a lan-
guage is that it is widely used by developers, analysts or even finance/

n o t e s
statistical executives and people of all intellectual levels without get-

ting too syntax heavy. It retains its simplistic characteristics of having
a not too verbose semantics system and yet it turns out to be one of
the most flexible, powerful languages with a plenty of data libraries
for data analysis and manipulation. It has a unique distinction of be-
ing a well-crafted programming language as well as easy-to-use for
quantitative and analytical computing. Anyone with slightest of prior
programming experience can settle down with Python faster than any
other language. This makes it a great choice for many companies that
always try to find the best value for their time investment.
Python has been instrumental in building up enormously flexible Web

applications such as YouTube and has almost singlehandedly driven
internal infrastructure of the search giant Google. Numerous corpo-
rations like Disney or Sony trust the reliability of Python to manage
S
colossal groups of graphics servers to compile the imagery for the
chartbuster movies. Python consistently ranks higher than JavaS-
cript, Ruby and Perl in popularity ratings.
IM
Just like Hadoop, Python consists of custom implementation of Spark
framework of Apache which is used to handle, manage and analyse
large chunks of datasets. Apache Spark is a large-scale data process-
ing framework which is fast and can be customised according to the
platform being implemented upon.
However, a key point to note here would be – Python is not being im-
M
plemented in a Big Data system, it is used for implementing multiple

things like machine learning, data processing and visualisation and so
on with the help of multiple frameworks available for specific tasks.
Libraries such as PyDoop and SciPy available with Python make it
N
actually easier for an analyst to evaluate and manage datasets.
Python can be used for creating Hadoop MapReduce programes and

applications which access the Hadoop HDFS API with PyDoop pack-
age. The PyDoop package offers a MapReduce and HDFS compatible
Python API letting you connect with existing HDFS installation, read
and write files, get information on files, directories or global file sys-
tems. Also, the MapReduce API helps you in solving many complex
problems with nominal programming efforts. On the other hand, ad-
vance MapReduce concepts like Record Readers and Counters can
also be implemented using PyDoop.
Python provides effective provisions for tackling Big Data problems,

some of which are listed as follows:
Numpy: Its arrays mapped with the memory allow you to access a
file saved on the disk as if it were an array. Only those array parts
you need or are working with are loaded into the memory.
Pytables and h5py: Libraries that provide access to HDF5 files,
which allow access to just a specific part of the data. Further, many

n o t e s
manipulations and mathematical operations on the data can be

done without formally loading it into a python data structure due
to the underlying libraries. It also allows lossless and seamless
compression.
Pandas: It allows high-level access to different types of data such
as csv files, HDF5 data, databases or websites. It offers HDF5 file
wrappers access for Big Data, making it easy to do closer scrutiny
on big datasets.
Mpi4py: A tool for executing the Python code in a distributed
manner across numerous processors or even computers, allowing
you to work in a modular fashion on your data parts concurrently.
Blaze: A tool specifically meant to cater Big Data. Basically, a
wrapper built around the libraries described above, providing a
constant and steady interface to many huge data amount storage
S
spaces (such as databases or HDF5) and applications to make it
easier to mathematically operate or manipulate the data or simply
analyse the data that is otherwise too big for the memory.
IM
There are some limitations to Python in the context of a Big Data im-
plementation. In the case of benchmarking performance, Python fares
less than or equal to Java. It is not slow by any measure, but still there
remains a lot of optimisation to be done. Let’s move to another statis-
tical and analytical language called R, study about it and summarise
differences between the two languages.
M
2.3.3 R
R is an open source programming language and an application en-

vironment for statistical computing with graphics, developed by R
N
Foundation for Statistical Computing. It is an interpreted language

like Python and uses a command line interpreter. It supports proce-
dural and as generic functions with OOP.
R is extensively used by data miners and statisticians, providing a

vast variety of graphical and statistical techniques, with linear and
nonlinear modelling, time-series analysis, classical statistical tests,
clustering, classification and others. R is easily extendable and imple-
mentable through functions and available extensions. Another area
of strength where R scores over its competitors is the static graphics
representation that can produce quality publication standard graphs.
R is an extremely powerful statistical and visualisation analysis tool

that is used in Big Data for the following purposes:
Visualisation (charts, graphs, etc.): Using ggplot2 packages and/
or some functions inbuilt, e.g. plot()
Data cleansing: Polishing the data to take out useful information
Cluster/parallel computation: Using Apache Spark (SparkR)

n o t e s
Usually tackling Big Data with a language like R, there has to be a

strategy and a streamlined process to be followed. Considering that R
is a statistical analyst’s dream language that can visually enthral the
best of the data miner, while it may not appeal to coders that depend
or prefer outputs as raw as they could get. Here are a few key things
one should take care when dealing with R:
Sampling: A dataset that is too big to be analysed as a whole bunch,
it can be sampled down to reduce the sizes. Now, the problem with
sampling down is that the performance of a model can be affected
significantly, because in a typical Big Data setup, a plenty of data
is always preferred over fewer sets of scattered data. However, ac-
cording to many experts, a sample-based model is fine till the size
of data records goes beyond the one billion threshold limit. So, if
sampling can be avoided to bypass the uncalled complexities, an-
other Big Data approach is recommended. However, in situations
S
where sampling is a must, it can still lead to substantial models,
especially if the sample is:
Still big in total numbers
IM
Not too small proportionally to the size of the entire dataset
and not biased
Bigger hardware: Since R retains and keeps all objects in the dy-
namic memory, it can pose a serious problem if the dataset gets
exponentially larger. However, given the memory costs, it is easier
M
to upgrade memory and now the current version of R can support

to 8 TB of RAM (on 64-bit machines) of standard data ready to be
traversed or extracted which is by no means slouch for even the
most demanding of programs.
N
Storing objects on hard disk: As a substitute, there are a few

packages that ensure that the objects are stored on hard disk and
are analysed in chunks. Though, the bunch of data groups leads
to parallelisation as a side effect, it is not much of a problem if al-
gorithms are capable of parallel analysis of data chunks. However,
only those algorithms that are designed to analyse data chunks
within the R system are supported. Any external concoction with
interface from a different platform might result in an error.
Integration with other programming languages like Java or
C++: The integration of programming languages gives R a great
advantage of multi-platform compatibility and being perfor-
mance-oriented language. Small modules of a program are moved
to another language’s (like Java or C++) compilation and exe-
cution environments to avoid bottlenecks and expensive perfor-
mance procedures. The goal of this feature happens to be balanc-
ing R’s elegant way of dealing with data efficiently on one hand
and at the same time, take advantages of performance of other ad-
vance programming languages on the other hand thus coming out
with the best of the both.

n o t e s
The rJava package of R and Java is an example that facilitates the

above-mentioned operation. Similarly, the RCPP is an example of
the integration between R and C++. It is easier to outsource the
code from R to C++ using RCPP. A simple understanding of C++
syntax is enough to utilise it.
Difference Between R and Python for Big Data

Both Python and R are popular and widely used programming lan-
guages for statistics. While R’s functionality and usability is created
with statisticians at its crux, given its strong data visualisation and
charting prowess, Python is considered to be easier to comprehend
both by machines and by the user due to its simpler syntax.
In this section, we will study some differences between R and Python,

and how they both co-exist successfully in the statistics world and
S
data science.
R and Python: The General Numbers

IM
On the Internet, you can find a good comparison between the languag-
es, adoption numbers and popularity charts of R and Python. While
these figures are a good indicator of how these two languages have
evolved so far and still are evolving in the computer science ecosys-
tem, it’s always tough to compare them side-by-side. Primary reason
being R is only found in a data science related stats heavy, number
crunching environment whereas Python is a dynamic language with a
M
wide variety of applications in many fields, such as Web and software

development.
When and how to use R?

N
R is used primarily when the standalone computing task is required

for data analysis or for individual servers. It is a great mode of exam-
ining the data, figuring out the patterns with great usage of visualisa-
tions. It is always ready for any type of data analysis because of readily
available tests that keep you updated with necessary tools to get up
and running in shorter time.
When and how to use Python?

When your data analysis errands require to be combined with Web-
based apps or if a statistics-heavy code needs to be merged with a pro-
duction database – Python is a no brainer. Being a full-fledged dynam-
ic programming language, it is a great tool to implement algorithms
for production use.

n o t e s
Table 2.2 lists the pros and cons of using R and Python for Big Data:
Table 2.2: pros and cons of using R and Python

for Big Data
R Python
PROS Visualised data is often easi- Ease of doing it – In-built note-
er to understand unreadable book IPython makes it easier to
numbers randomly lying work since you can easily share
atop each other. R effectively your notebook with your co-work-
utilises visualisation as a ma- er or peer, without the need of
jor plus point over all other them installing anything. This
options available. reduces extra effort in organising
the code, output and note files
considerably.
R has a wealthy repository of Python is a general purpose ori-
S
front-line packages and great ented programming language that
community support. All R is easy and intuitive. It comes with
packages are available at R virtually no learning curve for
documentation. those having prior programming
IM experience, and it increases the
speed at which you can create a
program. You need lesser time to
code while you have more time to
test it
R is meant for statisticians. The Python testing framework is
They can interconnect an in-built testing framework that
M
through R code and packag- promotes test coverage, and guar-

es with their respective ideas antees your code to be reusable
not necessarily requiring and dependable.
a computer background to
start. It is also highly adap-
N
tive outside its own applica-

bility zone.
CONS Although efficient, visual- Visualisation is an important
isation and other related criterion in an ideal data analysis
processes can take a toll software. While Python has good
on computer performance visualisation libraries, such as
and R can come out to be Bokeh, Seaborn and Pygal, but
a slow performer due to when compared to R, visualis-
poorly written and optimised ations are nowhere close in terms
code, though with the help of comprehensibility and easy to
of packages like renjin, pqR eye.
and FastR, performance can
be improved considerably.
R’s learning curve is a Python still needs to come up with
critical aspect especially if an alternative to several important
you are coming from a GUI R packages.
based environment for your
statistical analysis.


n o t e s
4. Pig was developed by Facebook in 2006. (True/False)

5. Which of the following manages hardware failure and
smoothens data handling?
a. Pig b. Hadoop
c. R d. Python
Activity
Try to find out alternatives to R in Python that equally resonates

well with the Hadoop HDFS architecture.
S
2.4 CLOUD COMPUTING AND BIG DATA
IM
One of the vital issues that organisations face with the storage and
management of Big Data is the huge amount of investment to get the
required hardware setup and software packages. Some of these re-
sources may be over utilised or underutilised with varying require-
ments overtime. We can overcome these challenges by providing a set
of computing resources that can be shared through cloud computing.
These shared resources comprise applications, storage solutions, com-
M
putational units, networking solutions, development and deployment

platforms, business processes, etc. The cloud computing environment
saves costs related to infrastructure in an organisation by providing a
framework that can be optimised and expanded horizontally. In order
to operate in the real world, cloud implementation requires common
N
standardised processes and their automation.
Figure 2.4 shows the cloud computing model:
SaaS
Laptop Cloud
Internet IaaS
Provider
PaaS
Desktop
Mobiles or
PDAs
Figure 2.4: Cloud Computing Model

n o t e s
In cloud-based platforms, applications can easily obtain resources to

perform computing tasks. The costs of acquiring these resources need
to be paid as per the acquired resources and their use. In cloud com-
puting, this feature of resource acquisition is in accordance with the
requirements and payment of cost and is known as elasticity. Cloud
computing makes it possible for organisations to dynamically regulate
the use of computing resources and access them as per the need while
paying only for those resources that are used. This facility of dynamic
use of resources provides flexibility; however, an organisation needs
to plan, monitor and control its resource utilisation carefully. Care-
less resource monitoring and control can result in unexpectedly high
costs.
A cloud computing technique uses data centers to collect data and

ensures that data backup and recovery are automatically performed
S
to cater to the requirements of businesses. Both cloud computing and
Big Data analytics use the distributed computing model in a similar
manner and hence, are complementary to each other.
IM
FEATURES OF CLOUD COMPUTING
The following are some features of cloud computing that can be used
to handle Big Data:
Scalability: Scalability means the addition of new resources to an
existing infrastructure. An increase in the amount of data being
M
collected and analysed requires organisations to improve their

hardware components’ processing ability. These organisations
may, at times, need to replace existing hardware with a new set of
hardware components in order to improve data management and
processing activities. New hardware may not provide complete
N
support to the software that used to run properly on the earlier set
of hardware. We can solve such issues by using cloud services that
employ the distributed computing technique to provide scalability
to the architecture.
Elasticity: Elasticity in cloud means hiring certain resources, as
and when required, and paying for resources that have been used.
No extra payment is required for acquiring specific cloud services.
For example, a business expecting the use of more data during
in-store promotion could hire more resources to provide high pro-
cessing power. Moreover, a cloud does not require customers to
declare their resource requirements in advance.
Resource pooling: Resource pooling is an important aspect of
cloud services for Big Data analytics. In resource pooling, multiple
organisations, which use similar kinds of resources to carry out
computing practices, have no need to individually hire all resourc-
es. The sharing of resources is allowed in a cloud, which facilitates
cost cutting through resource pooling.

n o t e s
Self service: Cloud computing involves a simple user interface

that helps customers to directly access cloud services they want.
The process of selecting the needed services requires no interven-
tion from human beings and can be accessed automatically.
Low cost: A careful planning, use, management, and control
of resources help organisations to reduce the cost of acquiring
hardware significantly. Also, cloud offers customised solutions,
especially to organisations that cannot afford too much initial in-
vestment in purchasing resources that are used for computation
in Big Data analytics. The cloud provides them the pay-as-you-use
option in which organisations need to sign for those resources only
that are essential. This also helps the cloud provider in harnessing
benefits of the economies of scale and providing a benefit to their
customers in terms of cost reduction.
S
Fault tolerance: Cloud computing provides fault tolerance by of-
fering uninterrupted services to customers, especially in cases of
component failure. The responsibility of handling the workload is
shifted to other components of the cloud.
IM
CLOUD DEPLOYMENT MODELS
Depending upon the architecture used in forming the network, ser-

vices and applications used, and the target consumers, cloud services
are offered in the form of various deployment models. The following
M
are the most commonly used cloud deployment models:

Public cloud (end-user level cloud): A cloud that is owned and
managed by a company than the one (which can be either an indi-
vidual user or a company) using it is known as a public cloud. In
N
this cloud, there is no need for organisations (customers) to control

or manage resources; instead, they are being administered by a
third party. Some examples of public cloud providers are Savvis,
Verizon, Amazon Web Services and Rackspace. You should under-
stand that in the case of a public cloud, the resources are owned
or hosted by the cloud service providers (a company), and the ser-
vices are sold to other companies. Companies or individuals can
obtain various services in a public cloud. The workload is catego-
rised on the basis of service category, and therefore, in this cloud,
hardware customisation is possible to provide optimised perfor-
mance. The process of computing becomes flexible and scalable
through customised hardware resources. For example, a cloud can
be used specifically for video storage that can be streamed live on
YouTube or Vimeo. You can also optimise this cloud for handling
large traffic volumes.
Businesses can obtain economical cloud storage solutions in a
public cloud, which provides efficient mechanisms for complex
data handling. The primary concerns with a public cloud include
security and latency, which can be overlooked citing the benefits
of this cloud.

n o t e s
Figure 2.5 demonstrates the use of a public cloud:
Company X
Cloud
Services
Public Cloud (IaaS/ Company Y
PaaS/
SaaS)
Company Z
Figure 2.5: Level of Accessibility in a Public Cloud

Private cloud (enterprise level cloud): The cloud that remains
entirely in the ownership of the organisation using it is known as
S
a private cloud. In other words, in this cloud, the cloud comput-
ing infrastructure is solely designed for a single organisation and
cannot be accessed by other organisations. However, the organi-
sation may allow this cloud to be used by its employees, partners
IM
and customers. The primary feature of a private cloud is that an
organisation installs the cloud for its own requirements. These
requirements are customary to the organisation that plans and
manages the resources and their use. A private cloud integrates
all processes, systems, rules, policies, compliance checks, etc. of
the organisation at a place. In a private cloud, you can automate
M
several processes and operations that require manual handling in

a public cloud. Moreover, you can also provide firewall protection
to the cloud; thereby, solving many latency and security concerns.
A private cloud can be either on-premises or hosted externally. In
case of on-premises private clouds, the service is exclusively used
N
and hosted by a single organisation. However, private clouds that

are hosted externally are used by a single organisation and are
not shared with other organisations. Moreover, cloud services are
hosted by a third party that specialises in cloud infrastructure.
Note that on-premises private clouds are costlier as compared to
externally hosted private clouds. In the case of a private cloud, se-
curity is kept in mind at every level of design. The general objec-
tive of a private cloud is not to sell cloud services (IaaS/PaaS/SaaS)
to the external organisations but to get the advantages of cloud
architecture by not providing the privilege to manage your own
data center.
Figure 2.6 demonstrates the use of a private cloud:

n o t e s
Cloud
Private Services
Cloud (IaaS/PaaS/
SaaS)
Figure 2.6: Level of Accessibility in a Private Cloud

Community cloud: Community cloud is a type of cloud that is
shared among various organisations with a common tie. This type
of cloud is generally managed by a third party offering the cloud
S
service and can be made available on or off premises. To make
the concept of community cloud clear and to explain when com-
munity clouds can be designed, let’s take an example. In any state
IM
or country, say England, the community cloud can be provided so
that almost all government organisations of that state can share
resources available on the cloud. Because of the sharing of cloud
resources on community cloud, the data of all citizens of that state
can be easily managed by government organisations.
Figure 2.7 shows the use of community clouds:
M
Community Community
Cloud for Level A Cloud for Level B
N
Cloud Services Cloud Services

(IaaS/PaaS/SaaS) (IaaS/PaaS/SaaS)
Organisations having common tie to Organisations having common tie to

share resources share resources
Figure 2.7: Level of Accessibility in Community Clouds

Hybrid cloud: The cloud environment in which various internal
or external service providers offer services to many organisations
is known as a hybrid cloud. Generally, it is observed that an organ-
isation hosts applications, which require a high level of security
and are critical, on the private cloud. It is also possible that the ap-
plications that are not so important or confidential can be hosted

n o t e s
on the public cloud. In hybrid clouds, an organisation can use both

types of cloud, i.e. public and private together. Such type of cloud
is generally used in situations such as cloud bursting. In the case of
cloud bursting, an organisation generally uses its own computing
infrastructure; however, in high load requirements, the organisa-
tion can access clouds. In other words, the organisation using the
hybrid cloud can manage an internal private cloud for general use
and migrate the entire or a part of an application to the public
cloud during peak periods.
Figure 2.8 shows a hybrid cloud:
Public Cloud
S
Migrated Application
IM
Private
Cloud
Organisation X Organisation Y
Cloud Services
(IaaS/PaaS/SaaS)
M
Figure 2.8: Implementation of a Hybrid Cloud

N
Cloud is a multipurpose platform that not only helps in handling

Big Data analytics operations but also performing various tasks,
including data storage, data backup and customer service. Nowa-
days, business operations are performed mostly by using laptops,
tablets and mobile devices, which are suited for accessing cloud
services, because most people today want to access computers
even when on the move. In addition to this, many customers use
the Internet for purchasing some product or service. These online
orders are taken from customers by product stores, which send
instructions to the warehouse for delivering the product. The en-
tire process of receiving orders, forwarding instructions to ware-
houses, handling payments and tracking deliveries can be assisted
by the cloud, which is not essential but reduces the infrastructure
cost and improves scalability in content storage.
CLOUD DELIVERY MODELS
Cloud environment provides computational resources in the form

of hardware, software and platform, which are deployed as services.

n o t e s
Therefore, we can categorise these services in the following manner:

Infrastructure as a Service (IaaS): It is one of the categories of
cloud computing services, which makes available virtualised com-
puting resources on the Internet. It helps in avoiding the expense
of buying and managing your own physical resources, as you can
use any resource virtually using the Internet and paying the rent
for as long as you need it. Actually all the responsibility is of the
cloud computing service provider, who manages the infrastruc-
ture, its installation, configuration, and the software purchased.
Platform as a Service (PaaS): It is built above IaaS and is the
layer that interacts with the users, allowing them to deploy and
use applications created using programming and run-time envi-
ronment platforms that are supported by the provider. This is the
stage where DBMS related to Big Data are implemented.
S
Software as a Service (SaaS): SaaS is one of the most popular
cloud-based models and comprises applications provided by the
service provider.
IM
Exhibit
Difference between SaaS, PaaS and IaaS
The cloud is a broad concept and it covers just about every possi-
ble sort of online service, but when businesses refer to cloud pro-
M
curement, there are usually three models of cloud service under

consideration: Software as a Service (SaaS), Platform as a Service
(PaaS),and Infrastructure as a Service (IaaS). Each has its own in-
tricacies and hybrid cloud models, but today we’re going to help
you develop an understanding of high-level differences between
N
SaaS, PaaS and IaaS.
Build Buy Deploy

IaaS SaaS PaaS
Software as a Service
In some ways, SaaS is very similar to the old thin-client model of

software provision, where clients, in this case usually Web brows-
ers, provide the point of access to software running on servers.
SaaS is the most familiar form of cloud service for consumers.
SaaS moves the task of managing software and its deployment to
third-party services. Among the most familiar SaaS applications for
business are customer relationship management applications like
Sales force, productivity software suites like Google Apps and stor-
age solutions brothers like Box and Drop box.

n o t e s
Use of SaaS applications tends to reduce the cost of software owner-

ship by removing the need for technical staff to install, manage and
upgrade software, as well as reduce the cost of licensing software.
SaaS applications are usually provided on a subscription model.
Platform as a Service
PaaS functions at a lower level than SaaS, typically providing a

platform on which software can be developed and deployed. PaaS
providers abstract much of the work of dealing with servers and
give clients an environment in which the operating system and
server software, as well as the underlying server hardware and net-
work infrastructure are taken care of, leaving users free to focus
on the business side of scalability and the application development
of their product or service. Businesses can requisite resources as
S
they need them, scaling as demand grows, rather than investing in
hardware with redundant resources. Examples of PaaS providers
include Heroku, Google App Engine and Red Hat’s OpenShift.
IM
Infrastructure as a Service
Moving down the stack, we get to fundamental building blocks for

cloud services. IaaS is comprised of highly automated and scalable
compute resources, complemented by cloud storage and network
capability, which can be self-provisioned, metered and available
M
on-demand.
IaaS providers offer these cloud servers and their associated re-
sources via dashboard and/or API. IaaS clients have direct access
to their servers and storage, just as they would with traditional
N
servers but gain access to a much higher order of scalability. Users

of IaaS can outsource and build a “virtual data center” in the cloud
and have access to many of the same technologies and resource ca-
pabilities of a traditional data center without having to invest in ca-
pacity planning or the physical maintenance and management of it.
IaaS is the most flexible cloud computing model and allows for
automated deployment of servers, processing power, storage and
networking. IaaS clients have true control over their infrastructure
than the users of PaaS or SaaS services. The main uses of IaaS in-
clude the actual development and deployment of PaaS, SaaS and
Web-scale applications.
Source: https://www.computenext.com/blog/when-to-use-saas-paas-and-iaas/
CLOUD PROVIDERS IN BIG DATA MARKET
Big Data cloud providers have been gearing up to bring the most ad-
vanced technologies at competitive prices in the market. Some pro-
viders are established, whereas some of them are relatively new to the

n o t e s
field of cloud services. Some of these providers are rendering services

that are relevant to Big Data analytics only. Some such providers are
discussed as follows:
Amazon: Amazon is one of the largest cloud service provider, and
it offers its cloud services as Amazon Web Services (AWS). AWS
includes some of the most popular cloud services, such as Elastic
Compute Cloud (EC2), Elastic MapReduce, Simple Storage Ser-
vice (S3), etc. Some of these services are discussed as follows:
EC2: It is a Web service that employs a large set of computing
resources to perform its business operations. These resources
are not properly utilised by Amazon, and therefore, they are
pooled in the form of an IaaS cloud so that other organisations
can take the benefit of these resources, ultimately benefitting
Amazon through the rental cost. Organisations can use these
S
resources elastically in a way that the hiring of resources is
possible on an hourly basis.
Elastic MapReduce: It is a Web service that uses Amazon EC2
IM
computation and Amazon S3 storage for storing and process-
ing large amounts of data so that the cost of processing and
storage is reduced significantly.
DynamoDB: It is a NoSQL database system in which data stor-
age is done on Solid State Devices (SSDs). DynamoDB allows
data replication for high availability and durability.
M
Amazon S3: Amazon Simple Storage Service (Amazon S3) is

a Web interface that allows data storage over the Internet and
makes Web-scale computing possible.
High Performance Computing (HPC): It is a network that is
N
replete with high bandwidth, low latency and high computa-

tional abilities, which are required for processing Big Data,
especially for solving issues related to education and business
domains.
RedShift: It is a data warehouse service that is used to anal-
yse data with the help of existing tools of business intelligence
in an economical manner. You can scale Amazon RedShift for
handling data up to a petabyte.
Google: Cloud services that are provided by Google for handling
Big Data include the following:
Google compute engine is a computing environment, which is
secure, flexible, and based on virtual machine.
Google BigQuery is a Desktop as a Service (DaaS), which is
used for searching huge amounts of data at a faster pace on the
basis of SQL-format queries.
Google prediction API, which is used for identifying patterns
in data, storing patterns and improving the patterns with suc-
cessive utilisation.

n o t e s
Windows azure: Microsoft offers a PaaS cloud that is based on

Windows and SQL abstractions and consists of a set of develop-
ment tools, virtual machine support, management and media ser-
vices and mobile device services. Windows Azure PaaS is easy-
to-adopt for people who are well equipped with the operations of
.NET, SQL Server and Windows. In addition, the Windows Azure
HD Insight option added to the PaaS cloud makes it possible for
cloud users to address emerging requirements for integrating Big
Data into Windows Azure solutions.
The platform used for building the Windows Azure PaaS is Horton
works Data Platform (HDP) that, as stated by Microsoft, is fully
compatible with Apache Hadoop. Moreover, Microsoft Excel and
various other Business Intelligence (BI) tools can be connected to
Windows Azure with support from HDInsight, which can be devel-
oped on the Windows Server also.
S
Hadoop is used as a cloud service in Windows Azure PaaS with
the help of HDInsight. HDFS and MapReduce related frameworks
are thus, offered economically, and in a simpler way, by the in-
IM
tegration of Hadoop in this PaaS. The efficient management and
storage of data are important features of HDInsight, which also
uses the Sqoop connector for importing the Windows Azure SQL
data into HDFS or exporting the data to a Windows Azure SQL
database from HDFS.
M
6. The cloud environment in which various internal or external

service providers offer services to many organisations is
known as a _______.
N
a. private cloud
b. public cloud
c. hybrid cloud
d. community cloud
7. The SaaS model of cloud service allows its users to deploy and
use applications on run-time environment platforms, which
are provided on the Internet and supported by the provider.
(True/False)
Activity
Search the names of companies on the Internet, which make avail-

able different cloud computing service (IaaS, PaaS, or SaaS) bene-
fits to their users.

n o t e s
2.5 IN-MEMORY TECHNOLOGY FOR BIG DATA

Nowadays, there are systems that require data availability to be faster
than anything ever before. Imagine in future a real-time stock bidding
platform where corporations bid for stocks of their choices in lots.
Even if we consider the bidding for a penny stock being auctioned
for a lot of two million stocks, a single fluctuation of few pennies can
result in profit turning towards the loss statement and can deride the
deal away. Now imagine the same for blue chip shares, this is one such
example where the utopian requirement of the Big Data to be in al-
ways ready and standby mode and serve back the data in the quickest
possible time is already being catered and served by many corpora-
tions well ahead in time.
Twitch is a social media gaming platform community that serves 100
S
million members supporting over 3 million concurrent visitors watch-
ing and chatting about games from over 2 million broadcasters where
the capacity of a chat room often goes beyond 500,000 in a single chat
room. Besides, it also offers a target-based advertising – a potential
IM
revenue driver, based on the chat history. This is one such example
where hardware obstructions and limitations, lag of memory indiffer-
ences have to be sidelined and streamlined with something faster like
a cache memory or dynamic access memory so that the data is readily
available for disposal. To deliver such services and capabilities, busi-
nesses require the skill to integrate both abrupt dynamics with histori-
M
cal breakdown and evaluation of the information. This combo provides

direction and context for taking real-time decisions. The in-memory
big data computing tool supports the processing of high velocity data
in real-time and also faster processing of the stationary data. Tech-
nologies like event streaming platforms, in-memory databases and
N
analytics and high level messaging structures are witnessing massive

growth that resonates with the organisational needs.
Now cost variations for such setups have abridged. Figure 2.9 shows
the cost of various storage technologies available for a sample 1GB of
memory along with respective read/write performance:
500 µsec
$9 250 µsec
90 µsec
25 µsec
1 µsec
0.10 µsec
$2 $0.4
$1
0
DRAM NV-DIMM/PM NVMe SSD SATA SSD
1GB cost Read Latency Write Latency
Figure 2.9: Showing the Cost of Various Storage Technologies

Source: http://flarrio.com/in-memory-big-data-real-time-decisions-technology-2016/

n o t e s
It takes $9 for 1GB of RAM, $0.40 for SSDs and $1 for PCI compatible
memory cards. The choice of a specific memory technology is subject
to its raw performance figures for a real-time scenario than bench-
marking figures, for a given use case. As memory evolution goes on,
new dynamic memory substitutes are shortening performance gaps
by far and large. Database-related technologies are adapting with the
evolution that has struck the goldmine for corporations for giving a
capability to fuse the newer and older setups in tandem with deliver-
ing radical performance to cost ratios.
8. The _________ tool supports processing of high velocity data

in real-time and also faster processing of the stationary data.
S
9. Twitch is a social media gaming platform community.(True/
False)
IM
Activity
Study the evolution of storage-based flash memories along with

their counterparts’ dynamic memories and try to figure out com-
mon points where the difference between them in future is going to
be the shortest before either one of them takes another lead.
M
2.6 BIG DATA TECHNIQUES

To analyse the datasets, there are many techniques available. In this
N
section, we will study about some of the techniques that are used to
tackle datasets and bring them to a conclusive end. However, this list
is not exhaustive since newer methodologies and techniques keep on
evolving from time to time.
2.6.1 MASSIVE PARALLELISM
According to the simplest definition available, a parallel system is

a system where multiple processors are involved and associated to
carry out concurrent computations. Since operations go side by side,
the parallelism occurs in the processes and hence the technique is
called parallel computing. Massive parallelism refers to a parallel sys-
tem where multiple systems interconnected with each other pose as
a single mighty conjoint processor and carry out tasks received from
the data sets parallelly. However, things don’t end here. In terms of
Big Data dynamics, the systems can not only be processor, but also
memory, hardware and even network conjoint to scale up the opera-
tional efficiency posing as a massive system that can eat humongous
datasets parallelly without breaking a sweat. But, this is where the
complacency of a hardware owner may pose an error-prone system.

n o t e s
Let’s say an organisation can afford a hypothetical 1TB of RAM in a

single system. While the system will certainly be efficient and faster
in operations than anything else, the framework or the driving force
of that hardware may not be as efficient to utilise the full potential of
those terabytes of dynamic memory, half of which might lie wasted
and underutilised. Further factors that can affect a typical setup can
be many – incompatible processors, latencies, MOSFETs based error,
storage lag, delay in processing and other hardware related flaws. On
the application side, the software may not be properly optimised for
concurrent usage or may break down over multiple simultaneous ac-
cess. These are bottlenecks for parallelism which if duly looked after,
can actually work with existing systems and spare the need of upgrad-
ing to expensive hardware. While hardware specs are crucial, the in-
terface driving application selection is equally important for such a
large-scale methodology.
S
2.6.2 DATA DISTRIBUTION
Distribution of data is a highly critical step in a typical Big Data setup.

IM
There are approaches to data distribution in a Big Data system de-
scribed as follows:
Centralised approach: A central repository is used to store and
download the essential dataset by virtual machines. In the starting
script, all virtual machines connect to the central repository and
get the required data. A limitation of such an approach is that if
M
multiple transfers are parallelly requested, the server will drop the
connections due to numerous virtual machines seeking blocks of
data – leading to a flash crowd effect.
Semi-centralised approach: Given the flash crowd effect in the
N
earlier approach, the semi-centralised approach reduces the stress

on the networking infrastructure. It shares the dataset across mul-
tiple machines in the data centre at different times. The limitation
of such approaches is that when datasets change, they may grow
beyond its predefined size making it difficult to foresee the chang-
es and expect the outcome.
Hierarchical approach: If datasets keep on adding new data to
itself, semi-centralised approach becomes hard to track and main-
tain. In a hierarchical approach, the data is fetched from the par-
ent node, i.e. the virtual machine, in the hierarchy. But, this conse-
quently leads us to bottleneck of the first approach and it cannot
offer failure-resistance during the transfer if one virtual machine
gets stuck then the deployments of all the VMs fail after the trans-
fers have been initiated.
P2P approach: P2P streaming connections are based on hierarchi-
cal multi trees. Each system acts as a client and server and to ac-
cess virtual machines, the data centre environment offers a low-la-
tency, firewall and NAT excluded, and unmonitored ISP traffic to
deliver a P2P delivery of datasets for the big data.

n o t e s
These approaches deal with design challenges for flexible data-heavy

systems which stem from issues as described ahead. First, a high-
ly-distributed system automatically paves the way for high availability
and scalability. Data distribution occurs in all levels, from web/cloud
server farms to caches to storage at the backend. Second, the single
system image abstraction with consistent reads and transactional
writes using query languages is difficult to achieve at the given scale.
Applications need to be alert of the data replicas; and to handle incon-
sistencies from replica updates that are conflicting; and continue op-
erations even in the occurrence of a network processor and software
failure. Third, each Big Data database application like NoSQL comes
with a set of compromises on quality, especially in terms of scalability,
performance, consistency and durability. The solution architects must
meticulously evaluate and select the databases that fulfil the appli-
cation’s requirements. This situation often ends up in a polyglot per-
S
sistence – where multiple database technologies are singularly used to
store multiple datasets in a single system.
IM
2.6.3 HIGH-PERFORMANCE COMPUTING
High-performance computing is the simultaneous use of supercom-
puters and parallel processing techniques for solving intricate com-
putation problems. It emphasises making parallel processing systems
and algorithms by joining both parallel and administrative computa-
tional methods. The words ‘supercomputing’ and ‘high-performance
M
computing’ are often used to resemble each other.

High-performance computing is used for performing research activ-
ities and cracking advanced problems through computer simulation,
modelling and analysis. Sometimes, such computing prowess is used
N
in special observations, satellite imagery and weather analytics as well

through the means of concurrency of computing resources.
For Big Data, where large datasets are required to be broken down
into chunks and then evaluated into meaningful data, high perfor-
mance computing comes as an excellent partner to be coupled with.
Hadoop enjoys some excellent libraries and with the load sharing ca-
pability of MapReduce, a typical big data system can use large files,
and for analytics processing, can perform huge block-wise sequential
read operations. Utilising a parallel file system such as Lustre, which
is a massively parallel and open-source file system developed by In-
tel, designed for large-scale data and high-performance charts, comes
handy in such systems. The bandwidth for such a file system often ex-
ceeds 700 GB/s or more, with premium users getting 1.9 TB/s as band-
width, Lustre easily scales up to thousands of clients and few hundred
petabytes as storage.
Besides that, Hadoop utilises popular accelerators, such as Kepler
GPUs. Like these technologies assist significantly in calculating solu-

n o t e s
tions, they also assist Big Data in the bioinformatics domain as they do
for sequencing and alignment.
2.6.4 TASK AND THREAD MANAGEMENT

Threads are simply the OS-based feature with their own Kernel and
memory resources, and allow an application logic to be segregated
into concurrent multiple execution paths. It is a useful feature when
complex applications having multiple tasks need to be performed at
the same time.
When an OS executes an application instance, it creates a process
having an execution thread to manage the instance. This is just the
programming instruction being performed by the code. You can sus-
pend or resume a thread, but not a task. Task can only be killed or
started. This is where it becomes a problem statement for data intense
S
environments and to deal with such concurrency related issues in Big
Data, we deal with two types of parallelisms – Task and Data.
Task parallelism refers to the execution of computer programes
IM
throughout the multiple processors on different or same machines. It
emphasises on performing diverse operations in parallel to best utilise
the accessible computing resources like memory and processors.
One example for such parallelism would be an application creating
multiple threads for doing parallel processing with every thread re-
M
sponsible for performing a dissimilar operation.

Data parallelism focuses on effective distribution of datasets through-
out multiple calculation programs. Same parallel operations are exe-
cuted on multiple computing processors on the subset of the distrib-
N
uted data.
It is often dealt in normal programming languages under the syntax of
synchronous and asynchronous programming techniques which are
similarly implemented in Hadoop, with the use of Java.
2.6.5 DATA MINING AND ANALYTICS

Data mining is a process of data extraction, evaluating it from multiple
perspectives and then producing the information summary in a mean-
ingful form that identifies one or more relationships within the data-
set. Descriptive data mining gives information about existing data and
the patterns recorded within it; while predictive data mining, foretells
predictions based on the occurrence of a pattern of figures within the
dataset.
Data analysis is an experiential activity, where the data sourcing gives
out some insight. By looking at the dataset of a premium system vs.
a budget bound system, you can well say that while the initial cost is
higher in premium systems, operational faults and failures are less
likely to happen than those budgeted systems.

n o t e s
Data analytics is about applying an algorithmic or logical process to

derive insights from a given dataset. For example, looking at the past
year’s weather and pest data, for the current month, we can deter-
mine that a particular type of fungus grows often when humidity lev-
els reach a definite point.
2.6.6 DATA RETRIEVAL
Big Data refers to the large amounts of multi-structural data that con-
tinuously flows around and within the organisations, and includes
text, video, transactional records and sensor logs. Big Data systems
utilise the Hadoop and the HDFS architecture to retrieve the data us-
ing MapReduce - a distributed processing framework.
It helps programmers in solving parallel data problems where the
dataset can be divided into small chunks and handled autonomous-
S
ly. MapReduce is an important step as it allows normal developers to
utilise parallel programming concepts irrespective of cluster commu-
nication details, failure handling and task monitoring.
IM
MapReduce simplifies all that by splitting the input data-set into mul-
tiple portions, each assigned a map task to process the data parallelly.
Each map task takes the input (key, value) and creates a transformed
(key, value) output.
MapReduce uses TaskTracker and JobTracker mechanisms for task
M
scheduling and monitoring. HDFS keeps bulky data files by cutting

them into lots (64 or 128 MB) and copying the lots on more than three
servers. MapReduce applications use APIs provided by HDFS to par-
allelly read and write data. Performance and capacity can be met by
adding single NameNode and DataNodes and the mechanism manag-
N
es the data location and monitors server accessibility.

In addition to MapReduce and HDFS, Apache Hadoop includes many
other components, some of which are very useful for data retrieval
and extraction:
Apache Flume: It is a distributed system for gathering, combining
and moving huge data from various sources into HDFS.
Apache Sqoop: It is a tool for moving data between relational da-
tabases and Hadoop.
Apache Hive and Pig: These are programming languages that
streamline the application development while retaining the Ma-
pReduce framework.
2.6.7 MACHINE LEARNING
Machine learning formally focuses on the performance, theory and
properties of learning algorithms and systems. Machine learning is
considered to be an ideal research field for taking advantage of the
opportunities available in Big Data.

n o t e s
It delivers on the potential of mining the value from huge and differ-
ent data sources with less dependence on human instructions. It is
data-driven and runs at machine scale and well-suited to the compli-
cation of dealing with different data sources and the enormous range
of variables and quantities of data involved. And in contrast to con-
ventional analysis, machine learning blooms on expanding datasets.
More data a machine learning system gets, more it learns and applies
the results to yield higher quality insights.
Machine learning systems utilise multiple algorithms to discover and

show the patterns hidden in the datasets. Most of them are the gradi-
ent-based algorithms, which are a form of algorithms used to optimise
the problems with a form f(x) with search instructions well-defined
by the gradient function at a current point. Following are examples of
algorithms:
S
Logistic Regression
Linear Regression
Autoencoders
IM
Neural Networks
Machine learning comprises a wider collection of algorithms, with

some being more efficient than others making it harder to select an
efficient algorithm without knowing about the dataset it will work on.
For example, Linear regression algorithm can be solved recursively
M
or with normal equations. The recursive process is much efficient for

datasets greater than the variable range of 10,000 because the normal
equation solution becomes tough to be solved in lesser time.
Let’s discuss a few machine learning methods that may prove to be

N
vital for solving the Big Data problems are discussed. These methods
do not focus on the algorithm logic only rather on the idea of learning:
Representation learning: Datasets with multi-dimensional fea-
tures are becoming gradually more common nowadays, which
challenges the current learning algorithms to excerpt and man-
age the discerning information from the datasets. Representation
learning aims to achieve a rational size learned representation
that can capture many likely input configurations, and can pro-
vide improvements in both statistical efficiency and computational
efficiency.
Deep learning: Unlike most learning techniques that use scarcely
designed learning styles, the deep learning technique uses con-
trolled and/or uncontrolled strategies in deep structures to learn
hierarchical representation automatically. Deep architectures
gather hierarchically launched statistical and complicated input
patterns for achieving adaptiveness for newer areas than outdat-
ed learning methods and frequently beat the state-of-the-art tech-
niques.

n o t e s
Distributed and parallel learning: Learning from the massive

amounts of datasets and figuring out the meaning hidden beneath
those data behemoths can be exciting but a bottleneck occurs in
the form of incapability of algorithms to use all the data present in
a dataset to learn in a given time limit. This is where distributed
and parallel learning offers a capable solution since assigning the
learning process to several workplaces is an obvious way of im-
proving the efficiency of the system as well as the machine learn-
ing algorithms.
Active learning: In real-world applications, data may be plenty
but labelling is scarce and expensive to be instantaneously ob-
tained. Also, learning from enormous quantities of raw data is time
consuming and difficult. Active learning deals this issue by picking
a subgroup of most critical occurrences for labelling. In this way,
the learner machine looks forward to achieving high precision by
S
using less labelled instances possible, thus curtailing the cost of
finding the labelled data.
IM
All the above forms of learning find a supportive library function in
Hadoop and HDFS file structure. Textual analysis, analytical tools
end up deploying a few of above learning techniques implicitly during
regular operations, which is further evaluated and later studied to fig-
ure out valuable insights offered by the automated learning. It is a
clear case of artificial intelligence coupled with Big Data and associat-
ed technologies and several developments in this field have only sup-
M
ported the overall machine learning narrative for corporations and

service providers.
2.6.8 DATA VISUALISATION
N
Data visualisation is a valuable means through which the larger data-

sets after being combined may appear practical, sensible and open to
most people. Data visualisation is a trailblazing method that not only
keeps you enlightened but helps other with the attributes of a typical
statistical and computational result that would’ve otherwise appeared
intimidating for normal minds.
Visual representation is often considered to be the most effective me-

dium of information and communication channel. As the saying goes,
a picture is worth thousand words, data visualisation is a great exam-
ple of that saying. When properly aligned, it can convey critical infor-
mation of data analysis in probably the easiest way possible.
Data visualisation should consist of the correct amount of communi-

cating quotient to be truly effective. They should be easy to use, well
designed, meaningful, understandable and approachable.
Typical data visualisation helps in:

identifying the areas requiring improvement or attention

n o t e s
clarifying the factors that affect consumer decision making or be-

haviour
making you realise about the products popularity
Many conventional data visualisation methods are still popular for

imparting critical information in easier formats like histogram, table,
line chart, scatter plot, bar chart, area chart, pie chart, flow chart,
combination of charts, data flow diagram, Venn diagram and entity re-
lationship diagram. Besides, a few data visualisation approaches are
less known compared the above methods but still are used like tree
map, parallel coordinates, semantic network and cone tree.
Visualisation in Big Data can be achieved through numerous ap-

proaches such as greater than one view per illustrative display, active
changes in filtering and factor numbers (star-field display, dynamic
S
query filters and tight coupling). There are also a few standard prob-
lems for big data visualisation:
Visual noise: Most dataset objects are too tightly coupled to each
IM
other making it tougher for users to divide them as distinct objects
on the screen.
Information loss: Lessening of evident datasets often leads to in-
formation loss.
High image change rate: Users simply observe the data and can-
not react to the data change or their intensity in real time on dis-
M
play.
High performance necessities: Good data visualisation requires
a higher degree of efficient setup backed by scalable and robust
machines that are ready to churn out visualisation in high perfor-
N
mance environment.
According to the dataset criteria, following points are considered be-

fore planning a dataset evaluation: data volume, variety and dynamics.
Few popular forms of data visualisation are Treemap, Circle Packing,
Sunburst, Parallel Coordinates, Circular Network Diagram.
A number of visualisation tools are available on the Hadoop platform.

The common modules in Hadoop namely Hadoop Distributed File
System (HDFS), Hadoop Common, Hadoop YARN and MapReduce,
efficiently analyse the big data, but lack suitable visualisation. Some
software with the visualisation and interactive functions for the Big
Data have been developed and are given below:
Pentaho: Supports the BI functions such as dashboard, analysis,
data mining and enterprise-class reporting.
Flare: A library belonging to ActionScript for making data visual-
isation in Adobe Flash Player.
JasperReports: It has a different software layer for producing vi-
sual reports from dataset storage.

n o t e s
Platfora: It changes raw Big Data of Hadoop to interactive data

processing engine and has the segmental functionality of data en-
gine built in memory.
10. Each system acts as a client and a server, and to access virtual
machines, the data centre offers firewall-free, low-latency ISP
traffic. Such an approach is called ____________.
11. Parallelism is the execution of multiple threads concurrently
to complete a task in the shortest possible time. (True/False)
Activity
S
Research on different machine learning methods and find out
which methods and their algorithms are vital for solving Big Data
problems.
IM
2.7 SUMMARY
Distributed computing works on the rules of the divide and con-
quer approach, performing modules of the parent tasks on multi-
ple machines and then combining the results.
M
Parallel computing refers to the utilisation of a single CPU present

in a system or a group of internally coupled systems by the means
of efficient and clever multi-threading operations.
Concurrency of a system is simply an operation of multiple threads
N
that execute on single or multiple processors.

Distributed computing is considered to be the subset of parallel
computing, which further is the subset of concurrent computing.
A Big Data system is vastly different from other solution provid-
ing systems and is based on the seven Vs, as described in previ-
ous chapter, namely: Volume, Velocity, Variety, Veracity, Variability,
Value and Visualisation.
Hadoop is an open-source platform that provides analytical tech-
nologies and computational power required to work with such
large volumes of data.
MapReduce is a framework that helps developers to write pro-
grams to process large volumes of unstructured data parallel over
a distributed architecture/standalone architecture which produc-
es result in a useful aggregated form.
Hive is a data warehouse tool created by Facebook based on Ha-
doop and converts the query language into MapReduce jobs.
Hbase is a Hadoop application running atop the HDFS.

n o t e s
Pig is a high-level modular programming tool developed by Yahoo

in 2006 for streamlining huge data sets with the use of Hadoop and
MapReduce.
Python is a popular interpreted, general-purpose, high-level dy-
namic programming language that aims to improve code readabil-
ity and overall ease of use and expression in fewer statements than
other competitive languages such as C++ or Java.
R is an open source programming language and an application en-
vironment for statistical computing with graphics, developed by
R Foundation for Statistical Computing. It is an interpreted lan-
guage like Python and uses a command line interpreter.
One of the vital issues that organisations face with the storage and
management of Big Data is the huge amount of investment to get
the required hardware setup and software packages.
S
Cloud computing makes it possible for organisations to dynami-
cally regulate the use of computing resources and access them as
per the need while paying only for those resources that are used.
IM
The in-memory Big Data computing tool supports the processing
of high velocity data in real time and also faster processing of the
stationary data.
Massive parallelism refers to a parallel system where multiple sys-
tems are interconnected with each other pose as a single mighty
M
conjoint processor and carry out the tasks received from the data
sets parallelly.
Distribution of data is a highly critical step in a typical Big Data
setup.
N
High-performance computing is used for performing research ac-

tivities and cracking advanced problems through computer simu-
lation, modelling and analysis.
Task parallelism refers to the execution of computer programs
throughout multiple processors on different or same machines. It
emphasises performing diverse operations in parallel to best uti-
lise accessible computing resources like memory and processors.
Data mining is a process of data extraction, evaluating it from mul-
tiple perspectives and then producing the information summary in
a meaningful form that identifies one or more relationships within
the dataset.
Machine leaning formally focuses on the performance, theory and
properties of learning algorithms and systems. Machine learning
is considered to be an ideal research field for taking advantage of
the opportunities available in Big Data.
Data visualisation is a valuable means through which larger data-
sets after being combined may appear practical, sensible and open
to most people.

n o t e s
key words
Hadoop distributed file system (HDFS): It is a fault-tolerant

storage system in Hadoop.
Hive: A data warehouse tool created by Facebook based on Ha-
doop that converts a query language into MapReduce jobs.
MapReduce: It is a framework that helps developers to write
programs to process large volumes of unstructured data over a
distributed architecture/standalone architecture which produc-
es results in a useful aggregated form.
Object Oriented Programming (OOP): A paradigm where data
is encompassed within an object and carries several heuristic
properties.
S
Pig: Pig is a high-level modular programming tool developed by
Yahoo for streamlining huge data sets with the use of Hadoop
and MapReduce.
IM
Python: It is a popular interpreted, general-purpose, high-lev-
el dynamic programming language that aims to improve code
readability and overall ease of use and expression in fewer state-
ments than other competitive languages such as C++ or Java.
R: It is an open source interpreted programming language
and an application environment for statistical computing with
M
graphics, developed by R Foundation for Statistical Computing.

Solid State Drives (SSD): Such storage drives have no me-
chanical components and higher read/write rates that result in
less wear or tear and robust performance.
N
2.8 DESCRIPTIVE QUESTIONS

1. Differentiate between parallel and distributed computing.
2. Explain the concept of Hadoop in Big Data.
3. What do you understand by cloud computing? Also, discuss its
three basic types of services.
4. Describe the concept of in-memory technology for Big Data.
5. Enlist and explain different types of Big Data techniques.
2.9 ANSWERS AND HINTS
ANSWERS FOR SELF-ASSESSMENT QUESTIONS

Distributed and Parallel 1. Distributed computing
Computing for Big Data

n o t e s

2. Concurrency
3. True
Introduction to Big Data 4. False
Technologies
5. b. Hadoop
Cloud Computing and Big 6. c. Hybrid cloud
Data
7. False
In-Memory Technology for 8. In-memory big data Computing
Big Data
9. True
S
Big Data Techniques 10. P2P
11. False
IM
ANSWERS FOR DESCRIPTIVE QUESTIONS
1. The distributed computing is basically multiple processors
interconnected by communication links as opposed to parallel
computing models which usually work on shared memory (but
not always). Refer to Section 2.2 Distributed and Parallel
Computing for Big Data.
M
2. Hadoop is an open-source platform that provides analytical

technologies and computational power required to work with
such large volumes of data. Refer to Section 2.3 Introduction to
Big Data Technologies.
N
3. Cloud computing makes it possible for organisations to

dynamically regulate the use of computing resources and access
them as per the need while paying only for those resources that
are used. Refer to Section 2.4 Cloud Computing and Big Data.
4. The in-memory Big Data computing tool supports the processing
of high velocity data in real-time and also faster processing of the
stationary data. Refer to Section 2.5 In-Memory Technology for
Big Data.
5. To analyse datasets, there are many Big Data techniques
available. Refer to Section 2.6 Big Data Techniques.
2.10 SUGGESTED READINGS & REFERENCES
SUGGESTED READINGS
Wadkar, S., Siddalingaiah, M., &Venner, J. (2014). Pro Apache Ha-
doop. Berkeley, CA: Apress.
White, T. (2011). Hadoop: the definitive guide. Sebastopol, CA:
O’Reilly.

n o t e s
E-REFERENCES
Welcome to Apache™ Hadoop®! (n.d.). Retrieved April 22, 2017,
from http://hadoop.apache.org/
What is Hadoop? (n.d.). Retrieved April 22, 2017, from https://www.
sas.com/en_us/insights/big-data/hadoop.html
Hadoop& Big Data.(n.d.). Retrieved April 22, 2017, from https://
mapr.com/products/apache-hadoop/
S
IM
M
N

C h a
3 p t e r
Basics of Business Analytics
CONTENTS
S
3.1 Introduction
3.2 Introduction to Business Analytics
IM
Activity
3.3 Types of BA
Activity
3.4 Business Analytics Model
M
3.4.1 SWOT Analytical Model

3.4.2 PESTLE or PEST Analytical Model
Activity
N
3.5 Importance of Business Analytics

Activity
3.6 What is Business Intelligence (BI)?
Activity
3.7 Relation between BI and BA
Activity
3.8 Emerging Trends in BI and BA
Activity
3.9 Summary

n o t e s
AMNESTY INTERNATIONAL
Amnesty International is a worldwide programme that includes

over seven million crusaders who fight for a free world with equal
human rights for all. Being a non-profit institution, the organi-
sation has to rely on different donors and contributors, who get
to know about campaigns through activities, such as street fund-
raising, telephone outreach, petitions and mailers. When donors
are involved, it is important to create a long-lasting relationship
with them. Like many non-profits, Amnesty International has a
Customer Relationship Management (CRM) system to make the
relationship life-cycle last longer. The organisation also required
performance improvement using contemporary data analytics
procedures.
S
THE CHALLENGE
Around four years back, with the help of its in-house fund-
IM
raising consultants, Amnesty International started seeking an
analytics software to work parallel to the existing CRM systems.
The fund-raising consultants are responsible for gathering funds
and managing various kinds of donors. They are also required to
measure the donors’ sentiments and interests based on multiple
inputs, such as various parameters and participatory ratios. For
M
such measurements, they were dependent on programmers for

analysing customers, directing specific campaigns at them based
on their interactions and contributions to the campaign and the
organisation. It was a tedious exercise and not always accurate.
There were regular gaps between the requirements consultants
N
asked for and what they were delivered.
THE SOLUTION
Based on the inputs gained from the consultants, Amnesty In-

ternational finalised an analytics tool with easy drag-and-drop
interface to carry out the analytics processes as envisaged by the
consultants.
The analytical tool was integrated with the CRM. Thus, using the
contemporary analytics software with CRM database became eas-
ier, making the reporting features much more robust. Of course,
as a human rights organisation, Amnesty International performs
all data analytics in obedience with privacy rules and protective
integrity.

Basics of Business Analytics 71
n o t e s
learning objectives

>> Describe business analytics and its types
>> Explain business analytics model
>> Recognise the importance of business analytics
>> Elucidate the concept of Business Intelligence (BI)
>> Describe the relation between BI and BA
>> Identify the emerging trends in BI and BA
3.1 INTRODUCTION
The word ‘Analytics’ has multiple meanings and is open to interpreta-
S
tion for business and marketing professionals. This term is used dif-
ferently by experts and consultants in almost a similar fashion. Ana-
lytics, as per the definition of the business dictionary, is anything that
IM
involves measurement – a quantifiable amount of data that signifies a
cause and warrants an analysis that culminates into resolution.
This chapter discusses about Business Analytics and its types. Next,
the chapter discusses about Business Analytics (BA) model. This chap-
ter further discusses about importance of Business Analytics. Further,
this chapter discusses about the concept of Business Intelligence (BI)
M
and its relation with business analytics. In the end, this chapter dis-
cusses about emerging trends of BI and BA.
INTRODUCTION TO BUSINESS
3.2
N
ANALYTICS
Business Analytics is a group of techniques and applications for stor-
ing, analysing and making data accessible to help users make better
strategic decisions. Business Analytics is a subset of Business Intel-
ligence, which creates competences for companies to contest in the
market efficiently and is likely to become one of the main functional
areas in most companies (More on BI later in this chapter).
Analytics companies develop the ability to support decisions through

analytical perception. The analytics certainly influence the business
by acquiring knowledge that can be helpful to make enhancements
or bring change. Business Analytics can be segregated into many
branches. Say, for a sales and advertisement company, marketing an-
alytics are essential to understand about which marketing tactics and
strategies clicked with the customer and which didn’t. With perfor-
mance data of marketing branch in hand, Business Analytics become
an essential way for measuring the overall impact on the organisa-
tion’s revenue chart. These understandings direct the investments in
areas like media, events and digital campaigns. These allow us to un-

n o t e s
derstand customer results clearly, such as lifetime value, acquisition,

profit and revenue driven by our marketing expenditure.
1. Business analytics is a subset of business analysis. (True/

False)
2. Analytics companies develop the ability to support decisions
through ______ perception.
Activity
How can business analytics bring a change for a newspaper hawk-

er? Think it out.
S
3.3 TYPES OF BA
IM
Going by the linguistic definition purely, there may be multiple elu-
cidations of the term BA. However, in practical terms, there are four
types of BA that help an organisation in gauging out the customer
sentiments and then take respective actions:
Descriptive analysis: It refers to “What is happening?” or “What
M
happened?” type analytics based on incoming data. Such analyt-

ics is better studied by the dashboards and reports. Like, a coffee
shop experiencing heavy rush on a day they least expected and are
ill-prepared to do anything about it.
Diagnostic analysis: It refers to analysis of the past figures and
N
facts to derive the scenarios about what happened and why it hap-
pened. The result of this analysis is often a pre-defined reporting
structure, such as root cause analysis (RCA) report. For example,
a root cause analysis may help in finding out the factors which the
above coffee shop owners fail to read and comprehend.
Predictive analysis: It refers to analysis of probabilities. Predic-
tive analysis tries to forecast on the basis of previous data and sce-
narios. For example, a hotel chain owner might ramp down pro-
motional offers during a restive season of rains in a coastal area.
This is based on the predictions that there is going to be fewer
footfalls due to heavy rain.
Prescriptive analysis: This analysis type tells you about the ac-
tions you should take. This is the most essential analysis type and
typically forms the standards and recommendations for the next
phase. For example, a doctor prescribes medicines to the patient
after researching, studying, evaluating and diagnosing the cause
of pain or irritation with the patient. Similarly, organisations too,
after drawing out the statements, resultants, conclusions and oth-

n o t e s
er factors will take a step in ensuring that the factors affecting the
growth charts positively continue to exist, whereas the damaging
factors stay out of their future prospects.
3. A software firm has roped in a consultant to study the financial

leaks happening in their billing system. This is the example of
_________.
4. A company needs to launch their new product, but is on a
limited marketing budget, and needs to figure out the best
possible market response with a minimum investment. The
_________ analytics should help the company with studying
the market response.
S
Activity
IM
Is there any other analysis type you can think of other than above
four models? What would it be?
3.4 BUSINESS ANALYTICS MODEL

BA frequently utilises numerous quantitative tools to convert big data
M
into meaningful information for making sound business moves. These

tools can be further categorised into tools for data mining, operations
research, statistics and simulation. Statistics for instance, can be help-
ful in gathering, articulating and understanding big data as part of
N
descriptive analytical model.
A BA model assists organisations in making a move which yields fruit-

ful results.
Here we will discuss two most commonly used analytical models by

the analysts across the globe as a standard analysis factor – SWOT
and PESTEL analysis.
3.4.1 SWOT ANALYTICAL MODEL
SWOT analysis is amongst the most popular method of gauging the

organisational and corporate nerve of an organisation. SWOT stands
for Strengths, Weaknesses, Opportunities, Threats.
As evident from the abbreviation, an organisation uses SWOT analy-

sis to figure out its greatest extremes – strengths to which it can stand-
by even in toughest of times, weaknesses that may lead it to a certain
failure even in the greener pastures, opportunities that may help in
realising the organisation’s full potential and finally the threats to the

n o t e s
businesses that may end up exploiting its weaknesses and may turn its
strengths into weakness. Figure 3.1 shows the SWOT diagram:
S
IM
Figure 3.1: The SWOT Diagram
Source: https://s-media-cache-ak0.pinimg.com/736x/88/b0/1a/88b01aa805648a30 4c0a3bbd-
954c1a5e.jpg
M
SWOT is often considered as a 360-degree tool to measure the pulse

and vitals of an organisation. Businesses that have been in market
for long should conduct SWOT analysis periodically to evaluate the
impact of the changing situations in the market, getting around the
newer business models and respond actively.
N
On the other hand, new starters should include SWOT as their plan-
ning process. SWOT is not necessarily a pan-organisation-based pro-
cess; rather each of the organisation’s departments can have their
own dedicated SWOT, such as Marketing SWOT, Operational SWOT,
Sales SWOT, etc.
A great example of benefits of SWOT analysis could be the turn-

around of modern day’s largest company of the world by valuation
– Apple Inc. Apple was incorporated in 1995 after a long battle with
the existing stakeholders who had control over the shares and stocks.
Post return to the computing market, facing a mighty challenger in
Microsoft, Apple didn’t take them head-on as most would’ve expect-
ed. Rather, it realised the opportunities and laid back on the threats
part since they had ‘nothing to lose’. Apple identified opportunities in
newer areas of the technology, while the world was busy hailing com-
puters as the lone IT revolution torch-bearer.

n o t e s
3.4.2 PESTLE OR PEST ANALYTICAL MODEL
PESTLE stands for Political, Economic, Social, Technological, Legal

and Environmental. PESTLE analysis is a method for figuring out ex-
ternal impacts on a business. In some countries, legal and environ-
mental parts are combined in the social, political and economic part.
Hence they use PEST.
PEST analysis is an examination of the external environment in which

an organisation currently exists or is going to enter. It is a handy tool
for understanding the economic, socio-cultural, political and techno-
logical environment that an organisation functions in. The sample
PEST analysis is shown in Figure 3.2:
S
IM
M
Figure 3.2: PEST Analysis

N
Source: https://www.smartdraw.com/pest-analysis/
Following is how PEST can reign in as an effective analytical charter

ready for most organisations:
Politicalfactors: These are government regulations in different
countries related to employment, tax, and environment, trade and
government stability.
Economic factors: These factors affect the purchasing power and
cost of capital of a corporation, such as economic growth, inflation,
currency exchange and interest rates.
Social factors: These influence the consumer’s requirement and
the possible market size for an organisation’s products and ser-
vices. These factors include age demographics, population growth
and healthcare.
Technological factors: These influence the barricades to entry, in-
vestment decisions related to buying and innovation, such as in-
vestment incentives, automation and the adaptability quotient for
the technology.

n o t e s
PEST factors can also be categorised as threats or opportunities in

SWOT analysis. It is ideal to complete a PEST analysis before SWOT.
Also, it is a point worth noticing that the four components of the PEST
model vary in meaning on the basis of business type. For example,
social factors are more important to a consumer-oriented business
at the customer’s side of the supply chain. On the other hand, politi-
cal factors reign in more to an aerospace manufacturer or a defence-
contracting firm.
5. SWOT stands for ________________________.

6. SWOT is often considered as a 360-degree tool to measure the
pulse and vitals of an organisation. (True/False)
S
Activity
IM1. Do an honest SWOT of Big Data so far.
2. Can a strength identified in SWOT be a political challenge in
PEST? Support your answer with an example.
3.5 IMPORTANCE OF BUSINESS ANALYTICS

M
The need of analytics arises from our basis day-to-day life. An average
person has to analyse the time factor from getting up from the bed to
getting ready to leave for office so as to reach on time in a relaxed man-
ner. Not only that, it also includes analysing the best possible route to
N
avoid the traffic and save more time in order to have an extra cup of
coffee for the day! As evident, even a ballpark analysis of the daily life
often yields results that may be assuring that analytics actually are an
efficient way of measuring and tracking your results periodically.
This is more true for a business. BA helps organisations:

To understand leads, audience, prospects and visitors
To understand, improve and track the method that can be used to
impress and convert the first lead or prospect to a valuable cus-
tomer.
Significance of BA:
To get visions about customer behaviour: The prime advantage
of financing some BI software and expert is the fact that it increas-
es your skill to examine the present customer-purchasing trend.
Once you know what your customers are ordering, this informa-
tion can be used to create products matching the present con-
sumption trends and, thus improve your cost-effectiveness since
you can now attract more valued consumers.

n o t e s
To improve visibility: BA helps you in getting to a vantage point

in the organisational complexities where you can have a better vis-
ibility of the processes and make it likely to recognise any parts
requiring a fix or improvement.
To convert data into worthy information: A BI system is a logical
tool that can educate you to enable you in making successful strat-
egies for your corporation. Since such a system identifies patterns
and key trends from your corporation’s data, so it makes it easier
for you to connect the dots between different points of your busi-
ness that may seem disconnected otherwise. Such a system also
helps you comprehend the inferences drawn from the multiple
structural processes better and increase your skill to recognise the
right and correct opportunities for your organisation.
To improve efficiency: One critical reason to consider a BI system
S
is increase in the efficacy of the organisation leading to increased
productivity. BI helps in sharing information across multiple chan-
nels in the organisation, saving time on reporting analytics and
processes. This ease of sharing information reduces redundancy
IM
of duties or roles within the organisation and improves the preci-
sion and practicality of the data produced by different divisions.
Consider a typical website that relies on visitor footfall and subse-

quent click-based advertising revenues. Such an organisation needs
analytics more often than other organisations, who have a dedicated
M
business running in brick and mortar stores and who use their web-
site only for marketing purposes.
BA is an important area that helps you in equipping with correct

weapons to make the correct business decisions. For example, if you
N
already expect some turmoil in one of your business sections, you can
do a SWOT of the section and impact the overall outcome positively.
Here, BA not only helped you in retaining a section full of customers,
but also helped you in avoiding a future conflict of similar nature. BA
arms you with situational arsons – you get a machine gun in the form
of viral marketing campaigns when you are targeting a mass audience
for a given product, whereas in case of customer withdrawal or ramp-
up, you can have your sniper ready to specifically target them out.
7. A BI system is a ____ tool that can educate you to enable you

in making successful strategies for your corporation.
8. Business intelligence helps in sharing information across
multiple channels in the organisation, saving time on reporting
analytics and processes. (True/False)

n o t e s
Activity
Prepare a report of the case where a business gained effectively

from the SWOT analysis.
3.6 WHAT IS BUSINESS INTELLIGENCE (BI)?

Business Intelligence (BI) is a set of applications, technologies and
ideal practices for the integration, collection and presentation of busi-
ness information and analysis. The motto of BI is to facilitate improved
decision-making for businesses.
BI utilises computing techniques for the discovery, identification and

business data analysis, like products, sales revenue, earnings and
S
costs.
BI models provide present, past and projecting opinions of structured

internal data for goods and department. It provides effective strategic
IM
operational insights and helps in decision-making through predictive
analytics, reporting, bench-marking, data/text mining and business
performance administration.
Common applications of BI:

Performance and bench-marking measurement and overall prog-
M
ress tracking towards achieving business goals

Quantifiable analysis with the help of predictive modelling, analyt-
ics, statistical analysis and business process modelling
Joint plans allowing internal and external business units to coop-
N
erate through data sharing and electronic data interchange

Usage of knowledge management programmes to recognise and
make insights and skills for regulatory agreement and learning
management
BI also includes explicit practices and procedures for applying inter-

active data-amassing techniques, like:
Examining the organisations and institutions
Selection and preparation of interview candidates
Creation and development of interview questions based on the
subject
Preparation and lining up the interviews
BI-based solutions are most apt for industries with huge customer
base, higher competition levels and massive data volumes. Some of
the exclusive BI functions include the following:
Examining sales trends

n o t e s
Following customer-purchasing habits

Handling finances
Assessing sales and advertising campaign efficiency
Forecasting market demand
Examining vendor dealings
Evaluating staffing requirements and performance
9. Business intelligence does not utilise the existing computing

techniques for the discovery, identification and business data
analysis. (True/False) False
S
10. _________ based solutions are most apt for industries with
huge customer base, higher competition levels and massive
data volumes.
IM
Activity
How can an election campaign benefit from BI? Make a case study
on it.
M
3.7 RELATION BETWEEN BI AND BA

BI is an umbrella in broader sense that encompasses everything un-
der it, like data analytics, visualisation that also includes BA. BA is
N
a subset of BI. BI at the root level is the skill of converting business

data into knowledge to aid decision-making process. The convention-
al method of doing this includes logging and probing the data from
past and using the overall outcome from the reading as the standard
for setting future benchmarks.
BA emphasises on data usage to get new visions, while conventional

BI uses a constant, recurring metric sets to drive strategies for future
business on the basis of historical data. If BI is the method of logging
the past, BA is the method to deal with the present and forecast the
future.
The Evolution of BI vs. BA
Earlier, BI has been utilised to discuss the population, procedures and

applications used to get to and infer importance from the information,
for enhancing choices and understanding the competence of focused
choices. The quick development of BA originates from this flaw, and is
in a way the advanced type of BI solution. In a business world with ev-
er-increasing speed, the user should have the capacity to collaborate

n o t e s
with data at the speed of business. An information-driven organisation

sees its information as an asset, and hedges it out to outperform the
rivals. The more information the client has, the better lead he or she
has on the competitor, who could possibly be a threat so far by now.
An ever-increasing number of individuals are being made a request

to interpret information in parts that are not entirely analytical. With
the significance of information-driven choices progressively turning
into an acknowledgment for less informed branches of organisation-
al departments, the requirement for easier to use and quicker stages
develops. In addition, diagrams and charts indicating BA conclusions
are faster and more effective than written measurements and excel
sheets overladen with information.
The difference between BI and BA is that BI equips you with the in-
formation whereas BA gives you the knowledge.
S
With the help of BA, you get to know the pain points of your busi-
ness; your product’s standing in the market, your strengths related to
IM
business that put you ahead of the competition and the opportunity
which you are yet to explore. BA helps you in knowing your business
thoroughly. BI helps in bridging that gap between ground reality and
management perspective on a pan-organisational basis.
BI helps you in compounding your strong points collectively; weeding

out the weakness in an efficient manner and managing the organisa-
M
tional business more efficiently. It helps you capitalise on the lessons

learned from the BA findings about the organisation. Table 3.1 shows
the differences between BI and BA:
Table 3.1: Differences between BI and BA

N
BI BA
Uses current and past data to opti- Utilises the past data and separately
mise the current age performance analyses the current data with past
for success data as reference to prepare the busi-
nesses for the future
Informs about what happened Tells why it happened
Tells you the sales numbers for Tells you about why your sales
first quarter of a fiscal year or total numbers tanked in first quarter or
number of new users signed up on about the effectiveness of the newly
our platform launched user campaign for making
users refer other users to our plat-
form
Quantifiable in nature, it can help More subjective and open to interpre-
you in measuring your business in tations and prone to changes due to
visualisations, chartings and other ripples in organisational or strategic
data representation techniques structure
Studies the past of a company and Predicts the future based on the
ponders over what could’ve been learning gained from the past, pres-
done better in order to have more ent and projected business models for
control over the outcomes a given term in the near future

n o t e s
Another new trend is the skill to combine multiple data projects in

one, while making it useful in sales, marketing and customer support.
That concept is also called CRM – Customer Relationship Manage-
ment software, which sources raw data from every division and de-
partment, compiles it for a new understanding that otherwise would
not have been visible from one point alone.
All this boils down to the interchangeable usage of the term “business
intelligence” and “business analytics” and its importance in manag-
ing the relationship between the business managers and data. Owners
and managers now, as a result of such accessibility, need to be more
familiar with what data is capable of doing and how they need to ac-
tively produce data to create lucrative future returns. The significance
of the data hasn’t changed, its availability has.
S
11. BI at the _____ level is the skill of converting material resources

and converting them to knowledge to aid decision-making
IM
process.
12. BA emphasises on data usage to get new visions, while
conventional BI uses a constant, recurring metric sets to drive
strategies for future business on the basis of this historical
data. (True/False)
M
Activity
Create a case study on election campaign for a new party using BA

system and compare the outcomes with that of BI system.
N
3.8 EMERGING TRENDS IN BI AND BA

Following are the contemporary trends in the BI and BA fields:
More power and monetary impact for data analysts: The ana-
lysts are consistently charting the demand charts across many in-
dustries. All thanks to the demand-driven analytical bandwagon
that has made the industry take cognizance of the data analysts
and led to a spike in other roles, like Information Research Scien-
tists and Computer Systems Analysts.
Location analytics: Another major business driver in 2016 was
related to location and geospatial analytical tools that gave or-
ganisations better market intelligence and placements in terms of
effective campaigns. For example, a company aiming geocentric
campaigns for specific customers.
Data at the rough edge: Businesses must look beyond the usual
sources of data besides their data centres since the data flows now

n o t e s
initiate outside the data from multiple sensor devices, and servers,
e.g. a spatial satellite or an oil rig in the sea.
Artificial Intelligence (AI): This is a top trend as per multiple
studies with scientists targetting to make machines that do what
complex human reflexes and intelligence achieve. The analytical
work on such programmes is exponentially growing with AI and
machine-learning transforming the way we relate with the analyt-
ics and data management.
BI Centre of Excellence (CoE): Moving to a simpler, secure and
effective BI strategy isn’t entirely the onus of IT. The difficulty of
the data management in huge companies is astounding, and the
need to strengthen it is becoming important. A growing number
of organisations are opting for BI and Analytical CoE to substitute
the implementation of self-serviced analytics. These CoE centres
S
will have a great role in applying an information-driven culture
and get the maximum advantage from a BI solution. Through me-
diums like virtual forums and training, the CoEs will authorise
even laymen to include data in their decision-making strategy. It
IM
is quite an efficient way of getting skilled people, processes and
technology aligned in a structured manner at one place.
Predictive analytics and impact on data discovery: By gather-
ing more information, organisations will have the capacity to build
more detailed visual models that will help them to act in more ac-
curate ways. For instance, having better information models shows
M
organisations more about what clients are purchasing, and even

what they are possibly going to purchase in future. From CRM to
sales or marketing deals, predictive analytics and cutting edge BI
are set to bring disruption.
N
Cloud computing: Cloud computing is being absorbed into many

systems and will continue to grow. We’ve witnessed the division
of Cloud into multiple vendor systems and many companies are
utilising Cloud services to host the powerful data analytics tools.
A lot of customers are already using Microsoft Azure and Amazon
Redshift along with Cloud resources that provide flexible handling
and scalability for the data.
Digitisation: It is a process of turning any analogue image, sound
or video into a digital format understandable by the electronic de-
vices and computers. This data is usually easier to store, fetch and
share than the raw original format (e.g. turning a tape recorded
into a digital song). The gains from digitising the data-intensive
processes are great with up to 90% cost cut and much faster turn-
around times than before. Creating and utilising software over
manual processes allow the businesses to gather and screen the
data in real time, which assists the managers to tackle issues be-
fore they turn critical.

n o t e s
13. CoE stands for:

a. Centre for Excellence b. Centre of Excellence
c. Centre of Excel d. None of these
14. _________ is a process of turning any analogue image, sound,
video into a digital format understandable by the electronic
devices and computers.
Activity
What trend you think can be emerging the next in BI and BA field?
Discuss.
S
3.9 Summary
IM
Business Analytics is a group of techniques and applications for
storing, analysing and making data accessible to help users make
better strategic decisions.
The analytics certainly influences the business by acquiring knowl-
edge that can be helpful to make enhancements or bring changes.
M
In diagnostic analysis, analysis of the past figures and facts to de-

rive the scenarios about what happened and why it happened is
done.
Business analytics frequently utilises numerous quantitative tools
N
to convert big data into meaningful contexts valuable for making

sound business moves.
PESTLE stands for Political, Economic, Social, Technological, Le-
gal and Environmental (PESTLE) – a method for figuring out nu-
merous external impacts on a business.
Business Intelligence (BI) is the set of applications, technologies
and ideal practices for the integration, collection and presentation
of business information and analysis.
key words
Business analytics: It is the subset of Business Intelligence,

which creates competences for companies to contest in the mar-
ket efficiently.
PEST analysis: It is an examination of the external environ-
ment in which an organisation currently exists or is going to
enter or start.

n o t e s
Predictive analysis: A kind of analysis that is based on proba-

bilities.
Prescriptive analysis: A kind of analysis that tells you about
what actions you should take.
SWOT: It stands for Strengths, Weaknesses, Opportunities and
Threats.

1. Discuss the concept of BA.
2. Enlist and explain different types of BA.
3. Explain the different analytical models with the help of real-time
S
examples.
4. Discuss the importance of BA with suitable examples.
5. Describe the importance of BI.
IM
6. Discuss the evolution and relation between BA and BI.
ANSWERS FOR SELF ASSESSMENT QUESTIONS

M

Introduction to Business 1. False
Analytics
N
2. Analytical
Types of BA 3. Diagnostic
4. Predictive
Business Analytics Model 5. Strengths, Weaknesses,
Opportunities, Threats
6. True
Importance of Business 7. Logical
Analytics
8. True
What is Business 9. False
Intelligence (BI)?
10. Business Intelligence (BI)
Relation between BI and BA 11. Root
12. True
Emerging Trends in BI and 13. b. Centre of Excellence
BA
14. Digitisation

n o t e s
HINTS FOR DESCRIPTIVE QUESTIONS

1. Business Analytics is a group of techniques and applications for
storing, analysing and making data accessible to help users make
better strategic decisions. Refer to Section 3.2 Introduction to
Business Analytics.
2. There are four types of BA that help an organisation in gauging
out the customer sentiments and then take respective decisive
actions. Refer to Section 3.3 Types of BA.
3. The two most commonly used analytical models by the analysts
across the globe as a standard analysis factor – SWOT and
PESTLE analysis. Refer to Section 3.4 Business Analytics
Model.
4. BA helps you in getting to a vantage point in the organisational
complexities where you can have a better visibility of the
S
processes and make it likely to recognise any parts requiring a
fix or improvement. Refer to Section 3.5 Importance of Business
Analytics.
IM
5. Business Intelligence (BI) is the set of applications, technologies
and ideal practices for the integration, collection, presentation of
business information and analysis. Refer to Section 3.6 What is
Business Intelligence (BI)?
6. BA and BI can be two of the most interchangeably used terms
but rarely explained in a way that doesn’t put the end-user in a
M
much vaguer position than before. Refer to Section 3.7 Relation

between BI and BA.
3.12 SUGGESTED READINGs & REFERENCES

N
Suggested Readings
Liebowitz,J. (2013). Big data and business analytics. Boca Raton
(FL): CRC Press.
Laursen, G. H., & Thorlund, J. (2017). Business analytics for man-
agers: taking business intelligence beyond reporting. Hoboken,
NJ: John Wiley & Sons, Inc.
E-References
What is big data analytics? – Definition from WhatIs.com. (n.d.).
Retrieved April 25, 2017, from http://searchbusinessanalytics.
techtarget.com/definition/big-data-analytics
What is business analytics (BA)? – Definition from WhatIs.com.
(n.d.). Retrieved April 25, 2017, from http://searchbusinessanalyt-
ics.techtarget.com/definition/business-analytics-BA
Monnappa, A. (2017, March 24). Data Science vs. Big Data vs. Data
Analytics. Retrieved April 25, 2017, from https://www.simplilearn.
com/data-science-vs-big-data-vs-data-analytics-article

N
M
IM
S
C h a
4 p t e r
Resource Considerations to Support

Business Analytics
CONTENTS
S
4.1 Introduction
4.2 What is Data, Information and Knowledge?
IM
Activity
4.3 Business Analytics Personnel and their Roles
Activity
4.4 Required Competencies for an Analyst
M

Activity
4.5 Business Analytics Data
N
Activity
4.6 Ensuring Data Quality
Activity
4.7 Technology for Business Analytics
Activity
4.8 Managing Change
Activity
4.9 Summary

n o t e s
Challenges faced by a Cloud Service Provider
A corporation XYZ Inc., based outside India, delivers managed IT

operations, hosted applications and cloud based services to busi-
ness enterprises across the globe. It has got great ratings for its
brilliant service and customer care thanks to the inclusive Service
Level Agreements (SLAs) and the consistent focus on improving
the customer service experience.
XYZ Inc. provides its consumers a private and tailor made cloud
infrastructure to execute important applications, with the help of
latest cutting edge tools, which support the company to look after
customer needs while reducing management and system compli-
cations.
S
Along with a zero-acceptance policy for downtime, max data se-
curity is another core focus area of the company for which it has
two network connected data centers in metro cities working in
IM
tandem with the first data center deputed as a backup/failover
recovery with other data center to create a secure and reliable
disaster recovery solution.
The organisation faces many challenges that other corporate IT

organisations experience nowadays like availability, reliability,
agility, security and shoe-string budget concerns besides cloud
M
and hosting service providers as well. Being a managed service

provider, XYZ Inc. must abide by the tighter SLAs than most or-
ganisations deliver to their internal customers.
Being an organisation with products and services of this range,

N
XYZ certainly faces some challenges as described here:

Growing operational productivity: XYZ needs to ensure uni-
fied deployment of ongoing operations and customer applica-
tions despite ever-augmenting resource requirements from
new and prevailing customers.
Dropping operational expenses: The company had to reduce
costs in order to remain competitive with managing all sorts of
non-revenue linked maintenance sources.
Guaranteeing high accessibility: Reliable disaster recovery,
high availability and complete security of the data are few rea-
sons why customers have chosen XYZ as the service provider.

Resource Considerations to Support Business Analytics 89
n o t e s
learning objectives

>> Describe the meaning of the terms—data, information, and
knowledge
>> Discuss the role of business analytics personnel
>> List the required competencies for an analyst
>> Recognise the challenges of business data analytics
>> Describe how data quality management framework ensures
data quality
>> Explain the technology used for business analytics
>> Discuss change management in business analytics
S
4.1 Introduction
IM
Business analytics is a process to filter and analyse sets of data which
might be small bits of data, a file containing the data or a large col-
lection of data generally known as a database. With the growth in
the data, a need of storing it at some appropriate location arises from
where it can be easily accessed and modified irrespective of geograph-
ical location. Unlike small datasets which is useful only for individual
organisations, Big Data is useful for various organisations. To store
M
Big Data, companies use cloud technology, data warehousing, etc. This
data is further retrieved from its storage and analytics is applied on it
to derive useful information. The analytics involves the use of various
statistical methods such as measures of central tendency, graphs, etc.
N
to derive significant information from data. This useful information

is further used in businesses for decision making, growth, planning,
creating action plans and increasing overall profitability. The way of
sorting the data to derive useful information has given a new purpose
to business analytics.
In this chapter, you will first study about data, information and knowl-
edge. Next, the chapter discusses business analytics personnel and their
roles. Further, the chapter discusses the required competencies for an
analyst. Next, the chapter details upon business analytics data and the
importance of ensuring data quality. Towards the end, the chapter dis-
cusses technology for business analytics and change management.
What is Data, Information and

4.2
Knowledge?
Data, to put simply, is the raw material that does not make any definite
sense unless you process it to any meaningful end. It can be anything
from a collection of numbers, text and unrelated symbols. It needs to
be processed with a context, before being logically viable.

n o t e s
Examples of data
2,4,6,8
Mercury, Jupiter, Pluto
The above data alone does not represent the true picture. Maybe the
sequence above is simply the table of two or a sequence denoting
the difference of two between numbers. The names may just be the
names of conference rooms in an organisation rather than being plan-
et names, unless you give it a logic and define the reasoning for its ex-
istence, the data alone does not have a standalone existence by itself.
Information is the result that we achieve after the raw data is pro-
cessed. This is where the data takes the shape as per the need and
starts making sense. Standalone data has no meaning. It only assumes
meaning and transitions into information upon being interpreted. In
IT terms, characters, symbols, numbers or images are data. These are
S
joint inputs which a system running a technical environment needs to
process in order to produce a meaningful interpretation.
IM
Information can offer answers to questions like which, who, why,
when, what and how. Information put into an equation should look
like:
Information = Data + Meaning
Examples of Information
M
2,4,6,8 are the results of first four multiples of 2.

Mercury, Jupiter, Pluto are the names of planets.
When we allocate a situation, or meaning, only then the data becomes

information.
N
Data is raw, information is processed and knowledge is gained.
Knowledge is something that is inferred from data and information.

Actually, knowledge has a far broader meaning than the typical defi-
nition. Knowledge is an assembly of meaningful information whose
intent is to be valuable. Knowledge is a deterministic process.
Knowledge can be of two types:

Obtaining and memorising the facts
Using the information to crack problems
The first type is regularly called the explicit knowledge meaning a

knowledge that can be simply transferred to others. Explicit knowl-
edge and its offspring can be kept in a certain media format for exam-
ple encyclopedia and textbooks.
The second type is termed as the tacit knowledge referring to the type
of knowledge that is complex and intricate. It is gained simply by pass-
ing on to others and requires elevated and advance skills in order to

n o t e s
be comprehended. For example, it will be tough for a foreign tourist

to understand local customs or rituals of a specific community located
in a country whose language is different than the tourist’s language.
In such a case, the tourist needs to be conversant with the language
or requires additional resources in order to understand the rituals.
Similarly, the ability to speak a language or use a computer or simi-
lar things requires knowledge that cannot be gained explicitly and is
rather learned through experience.
How are data, information and knowledge linked?
Data signifies an element or statement of procedures without being

related to other things, for example. It is raining. Information symbo-
lises a relationship of some type, perhaps cause and effect that act as a
bridge between the data and information. The topics are hierarchical
S
in the following order, as shown in Figure 4.1:
becomes becomes
IM
Data Information Knowledge
Figure 4.1: Transforming Data into Knowledge
For example, the temperature fell 15 degrees followed by rains. Here,

M
the inference based on the data becomes information.

Knowledge signifies a design that links and usually provides a high-lev-
el view and likelihood of what will happen next or what is described.
N
For example, if humidity levels are high and the temperature drips
considerably, the atmosphere is pretty much unlikely to hold the mois-
ture and the humidity, hence it rains. The pattern is reached on the
basis of comparing valid points emanating from data and information
resulting into the knowledge or sometimes also referred to as wisdom.
Wisdom exemplifies the understanding of essential values personi-
fied within the knowledge that are foundation for the knowledge in its
current form. Wisdom is systematic and includes an understanding of
all interactions that happen between raining, temperature gradients,
evaporation, changes, air currents and raining.
1. Information put into an equation should look like:

Information = _______+ Meaning
2. Explicit knowledge and its offspring can be kept in a certain
media format for example in encyclopedias and textbooks.
(True/False)

n o t e s
Activity
Suppose you have to explain a school going kid the difference be-
tween data, information and knowledge. Describe the method and
technique you will use.
Business Analytics Personnel and

4.3
Their Roles
A business analyst is anyone who has the key domain experience and
knowledge related to the paradigms being followed. He/she often
needs to sport multiple hats related to the field he/she is in. A business
analyst can be anyone, from an executive to a top-level project direc-
tor given that they have grasp of the system, its techniques and func-
S
tionality – since all they represent is the business their organisation is
offering to customers.
IM
Key Roles and Responsibilities of a Business Analyst
Requirements are the essential part of creating successful IT solu-

tions. Defining, documenting and analysing requirements that are de-
veloped from a business analyst’s perspective help in demonstrating
what a system can do. The skills of a business analyst are shown in
Figure 4.2:
M
Business Analyst Skills

N
Business System
Planner Analyst
Project
Manager
Organization Financial
Analyst Analyst
Technology
Subject
Architect
Data Area Expert
Analyst
Application
Application Designer
Architect
Process
Analyst
Figure 4.2: Skills of a Business Analyst
Described below are a few of the key requirements and responsibili-

ties of a business analyst in managing and defining requirements:
Gathering the requirements: Requirements are a key part of
IT systems. Inadequate or unfitting requirements often lead to

n o t e s
a failed project. The business analyst fixes the requirements of a

project by mining them from stakeholders and from current and
future users, through research and interaction.
Expecting requirements: A business analyst who has expertise
in his/her field knows that in the dynamic world of IT, things can
change quickly even before they can expect the change. Plans de-
veloped at starting are always subject to alteration, and expecting
requirements that might be needed in the future is key to success-
ful results.
Constraining requirements: While complete requirements are
must for a successful project, the emphasis should be the essential
business needs, and not the personal user preference, functions
based on the outdated processes or trends, or other unimportant
changes.
S
Organising requirements: Requirements often come from mul-
tiple sources that sometimes may contrast with other sources.
A business analyst must segregate requirements into associated
IM
categories to efficiently communicate and manage them. Require-
ments are organised into types as per their source and applica-
tion. An ideal organisation averts project requirements from over-
looked, and thus leads to an optimum use of budgets and time.
Translating requirements: A business analyst must be skilled at
interpreting and converting the business requirements effectively
M
to the technical requirements. It involves using powerful modeling

and analysis tools to meet planned business goals with real-world
technical solutions.
Protecting requirements: At frequent intervals in a project’s life-
N
cycle, the business analyst protects the user’s and business needs
by confirming the functionality, precision and inclusiveness of
the requirements developed so far compared to the requirements
gathered in the initial documents. Such protection reduces the risk
and saves considerable time by certifying that the requirements
are being fulfilled before devoting further time in development.
Simplifying requirements: The main role of a business analyst is
to simplify tasks and maintain easier functionality. Completing the
business objective is the aim of every project; a business analyst
recognise and evades unimportant activities that are not helpful in
resolving the problem or achieving the objective.
Verifying requirements: A business analyst is the most informed
person in a project about the use cases; hence, they frequently val-
idate the requirements and discard implementation that do not
help in growing the business objective to culmination. Require-
ment verification is completed through test, analysis, inspection
and demonstration.
Managing requirements: Usually, an official requirements pre-
sentation is followed by the review and approval session, where

n o t e s
project deliverables, costs and duration estimates and schedules

are decided and the business objectives are rechecked. Post ap-
proval, the business analyst shifts to requirement managing events
and activities for the rest of the project lifecycle.
Maintaining system and operations: Once all the requirements
are completed and the solution is delivered, the business analyst’s
role shifts to post implementation maintenance to ensure that de-
fects if any, do not occur or are resolved in the agreed SLA time-
lines; any enhancements that are to be made to the project, or per-
forming change activities to make the system yield more value;
similarly, the business analyst is also responsible behind many
other activities post implementation such as operations and main-
tenance, or giving system authentication procedures, deactivation
plans, maintenance reports and other documents like reports and
future plans. The business analyst also plays a great role in study-
S
ing the system to regulate when replacement or deactivation may
be required.
IM
3. Inadequate or unfitting requirements often lead to _____ of

project.
4. A business analyst must _______ requirements into associated
categories to efficiently communicate and manage them.
M
Activity
As a business analyst, prepare a report on your analytical study

N
of Sony Corporation, currently undergoing turmoil for serving too

many areas in business fields.
Required Competencies for an

4.4
Analyst
The business analyst role is considered as a bridge between business
stakeholders and IT. Business analysts need to be great in verbal and
written communications, diplomatic, experts with problem solving
acumen, theorists with the ability to involve with stakeholders to com-
prehend and answer to their needs in a dynamic business environ-
ment. This includes dealing with senior members of management and
challenging interrogations sessions to confirm that the time is well
spent and value for money development can commence.
Business analysts need not necessarily be from IT background al-

though it certainly helps having a basic understanding IT systems and
how they work. Sometimes, business analysts come from a program-

n o t e s
ming or other technical background often from within the business

– carrying a thorough information of the business field which can be
likewise very useful. To be called as a successful business analyst, you
ought to be a multi-skilled person who is adaptable to an ever-chang-
ing environment. The following are some of the most common skills
that a decent business analyst should have:
Understanding the objectives: Being able to understand the path
and commands is important. If you can not understand what and,
more significantly, why you are assigned to do something, chances
you can not deliver what is required are high. Do not hesitate in
asking questions or additional information if you have any doubts.
Having good communication skills: Sounds obvious but it is nec-
essary to have good verbal communication skills preferably in a
global environment, where multitudes of stakeholders, manage-
S
ment and resources from diverse backgrounds will collaborate on
a single platform to discuss, debate and finalise the requirements
which would incidentally be captured by you. It is necessary for
you to have that comprehension level along with the eloquence to
IM
deliver your conceptions or clear any doubts, which you have. You
should be able to make your point evidently and explicitly. Com-
municating the data and the information at the appropriate level is
important – as some stakeholders require more detailed informa-
tion than others due to the varying levels of understanding.
Manage stakeholder meetings: While email also acting as an au-
M
dit trail is a fair method to facilitate communication, sometimes it

turns out to be not enough. Old school F2F discussions and meet-
ings for detailed deliberation over the problems and any queries
are still a popular way of carrying out effective analysis. Most of
N
the times, you end up discovering more about your project from a
physical presence of all stakeholder where all collaborators tend to
be open about debating circumstances.
A good listener: You are better off listening more than you speak
and jotting down the notes and takeaways from the meetings.
Having good listening skills require patience and virtue to under-
stand and listen to the stakeholder, which gives them a feeling of
being heard and not being overlooked or overpowered by a dom-
inating analyst. Such projects often end up in mess sooner than
they should be. Your listening and information absorbing skills are
important to make you an effective analyst. Not only listen, but
understand the situation, question only where you think you are
being condescended upon by the stakeholders passing off unnec-
essary off-business requirements and ignoring the actual require-
ments that can help in making of an efficient system. You can at-
tend personality development training to get the control over voice
modulation, dialect and pitch moderation along with an effective
body language with business presentation skills.

n o t e s
Improving the presentation skills: As a business analyst, you are

supposed to be presentable at any time round the clock. As a busi-
ness analyst, you will often lead workshops or pitch a work piece
to the stakeholders, or to internal project team. It is important to
give due consideration to the content of your presentation and en-
sure that it matches the objectives to be meet – since there is no
point of presenting the implementation methods if the meeting is
about gathering requirements. These presentations not only rep-
resent information but also act as a good way to get more clarity or
information from stakeholders in case you are looking for further
details on a specific part of the project.
A time manager: A business analyst is responsible for maintaining
the timeframes of the project as well as the corporate schedules.
BA should ensure that the project meets the pre-agreed project
milestones along with daily tracking schedules being fulfilled by
S
the development team. Business analyst should prioritise activi-
ties separating critical ones from the others that can wait, and fo-
cus on them.
IM
Literary and documenting skills: Requirements documents, spec-
ifications, reports, analysis and plans. Being a business analyst, you
are supposed to deliver numerous types of documentations that
will go on to become project and legal documents later on. So, you
need to ensure that your documents are created concisely, and at
a comprehensible level for the stakeholders. Avoid specific jargons
M
to a particular field as they may not be understood by all stake-

holders and later may create confusion or other complexities with
their interpretations. Starting as an inexperienced business ana-
lyst, you will gradually learn to write requirement documentations
and reports, but having strong writing skills is enough to give you a
N
head start over the others since it will lead to unambiguous require-
ments documentation.
Stakeholder management: It is important that you know how to
deal with stakeholders and know how much power and impact
they have on your project. Stakeholders can either be your best
friends/supporters or your greatest critics. An accomplished busi-
ness analyst will have the skill to investigate the degree of man-
agement every stakeholder needs and how they ought to be inde-
pendently dealt.
Develop your modelling skills: As the expression goes, a photo
paints a thousand words. Procedures (such as process modeling)
are compelling tools to pass on a lot of data without depending on
the textual part. A visual portrayal enables you to get an outline
of the issue or project so that you can see what functions well and
where the loopholes lie.

n o t e s
5. To be called as a successful business analyst, you ought to be

a multi-skilled person who is adaptable to an ever-changing
environment. (True/False)
6. A business analyst is not responsible for maintaining the
timeframes of the project as well as corporate schedules.
(True/False)
Activity
You are a veteran business analyst, responsible for coaching a

new batch of management trainees in an organisation. Layout the
course plans and methods you will utilise to train them about the
S
standards and the knowledge. IM
4.5 Business Analytics Data
Any approach for analytics must adjust to changes in the way people
work inside their business settings, particularly with the developing
size of data volumes. Arranging data that is redone in a way that bodes
well for every business customer requires infusing content with con-
text before augmenting the estimation of relevant filtering and rep-
M
resentation. Enhancing the enormous amounts of data and making

a presentation of significant learnings for every business consumer’s
needs shows up with many difficulties. We will segregate those prob-
lems as data analytics challenges—creating algorithms that will gath-
er, analyse, group, channel, categorise and at last filter the meaning
N
and also persistently retrain the machine, cutting and dicing this data
in the view of individual needs and conveying it in a way that is most
useful relying upon a person’s perspective (area, time, gadget and so
on). Some of data analytics challenges are as follows:
Content variety and quality: Information sources are no longer
entirely organised. Business folks depend on a pool of information
objects that mix customarily structured information with various
types of artefacts, for example, transactional system databases and
in addition Web-based social networking channels, like Facebook,
Twitter, LinkedIn, Web journals, wikis, etc. each of which must be
surveyed for logical importance and incorporated inside different
data models.
For quality, the bits of information that can be mined from an infor-
mation source like a database or an online networking Web page
may have distinctive levels of relevance for various sorts of data
consumers in different places of an organisation. One example is
information gathered for announcing the item launches for senior
officials, a moved-up lookout of positive or negative beliefs might

n o t e s
be adequate, while the product manager may search for insights

with respect to potential item defects that can be quickly remedi-
ated.
Content organisation: Forming the data inputs begins with a set
of meaning and semantics, but business requirements change over
time, so the models need to be flexible with capacity to provide
allowances in relation to taxonomic models, tag inputs and match
them based on incidental content. However, dissimilar levels of
information sparseness, density, freshness and quality affect the
capability to unify the data and require increased sophistication.
Connectivity: Any information source may have different levels of
importance inside a wide range of business settings. For instance,
remarks about a bike’s drivability might be more important com-
ing from a vehicle enthusiast blog owner which can be checked
S
through Twitter. That poses two difficulties – firstly, linking infor-
mation artifacts to various business domains, while the second in-
cludes deriving dynamic linkages, connections and relevance be-
yond settled ordered models. The last challenge likewise implies
IM
striving to advance an understanding of how data sets are utilised
by various people and adjusting analytical models respectively.
Personalisation challenges: More important than separating
through substantial volumes of data resources taken from a vari-
ety of sources is that a wide range of channels must be set up to
recognise different filters of business value relying on who cus-
M
tomers are. For instance, a sales delegate may be informed about

a few particular contacts from their client base to help in gener-
ating leads. Similar data sources can be refined to give sales and
marketing executives with subjective information about their top
N
clients, help to recognise potential threats from competitors and

inform about techniques for continuing with expansion inside ver-
tical markets.
Finding correlations in a dynamically changing business world:
Pattern detection in data correlations may specify developing
trends. For example, investigating the correlation between Web
searches about influenza symptoms, and medicines and geograph-
ical places over a period can help in forecasting the patterns for
influenza infections.
7. Any information source may have different _________ of

importance inside a wide range of business settings.
8. _____ detection in data correlations may specify developing
trends.

n o t e s
Activity
If a raw sample data from a research institute lands at your depart-

ment, what will be your first reaction in order to polish up the data?
4.6 Ensuring Data Quality

Data is formed during the progression of a single business method
and flows throughout an organisation as it goes through the multi-
ple phases of one or more business procedures. As data moves from
one place to another, it converts and presents itself in supplementary
forms, which unless governed and managed properly, can lose its ve-
racity. Although each data type needs a separate plan and method for
supervision, there is a general framework that can be used to efficient-
S
ly manage all data types. The data quality management framework
comprises of three mechanisms: control, monitor and improve.
Control
IM
The most ideal approach to deal with the nature of information in a
data framework is to guarantee that only the information which meets
the standard models is permitted to enter the framework. This can
be accomplished by setting up solid controls at the front end of every
data inflow system, or by putting validation runs in the integration
M
layer which is in charge of moving information from one system to the

other. However, this is not generally plausible or financially practical
when, for instance, information is captured physically and after that
later captured in a framework/system, or when changes to applications
are excessively costly, especially with programming involved with
N
commercial off-the-shelf (COTS). In one specific case, an organisation

ruled against executing changes to one of its primary data capture
COTS applications that would have authorised stricter information
controls. They depended rather on preparing, observing and giving
an account of the utilisation of the framework to help them enhance
their business procedure, and accordingly, experienced heightened
data quality. In any case, organisations that have solid quality controls
at the data influx entry points have experienced exceptionally viable
data quality administration.
Monitor
It is natural to assume the data to be of higher quality provided there

are strong data controls installed at the entry gate of the system. As
processes are developed and enhanced, folks responsible for data
change managing and the systems age up and the quality controls are
not necessarily maintained to keep up with the anticipated data qual-
ity phases. This creates a need for intermittent monitoring of the data
quality by running authentication rules against the existing stored

n o t e s
data to ensure that the data quality matches the desired levels. Ad-
ditionally, information captured from one system to another compels
the company to monitor the data frequently to confirm consistency
across multiple systems. Data quality monitoring enables the organi-
sation to actively discover issues before they affect the decision-mak-
ing process.
At present, organisations increasingly rely on innovative data visu-

alisation methods and analytics to deliver increased business value.
However, when those efforts are hindered by issues related to data
quality, the trustworthiness of their whole analytics strategy comes
into question. Since conventionally, analytics is considered as a pre-
sentation of wide range of data points, it is falsely presumed that data
quality issues can be ignored since they would not influence the broad
ranges. The 5Cs for ensuring data quality are shown in Figure 4.3:
S
Correctness Measure the degree of data accuracy
IM
Measure the degree to which all required data is present
Completeness
Currency Measure the degree to which data is refreshed or made

available at the time it is needed
M
Conformity Measure the degree to which data adheres to standards

and how well it is represented in an expected format
Measure the degree to which data is in sync or uniform

Consistency across the various systems in the enterprise
N
Figure 4.3: 5Cs of Data Quality
Improve
When the data quality checks report a decline in quality, a few correc-
tive measures can be deployed. As described above, training and ad-
justing processes and system enhancements involve both people and
technology. Usually, an improvement plan which is implemented right
after the first instance of quality dip comprises data cleansing, which
can be completed via automation or manually by business users. If the
business can self-define the rules to improve data, then data purging
programs can be easily created to mechanise the data enhancement
method. Next step - business validation – makes sure that the data
regains its required quality levels. Habitually, organisations end the
data quality enhancement program after a single round of positive
validation which is a wrong step. An important step that is missed is
improving data quality controls to ensure that the same issues do not
recur by doing a full RCA of the issues and quality controls. Applying
these steps is more critical when a project consists of master data or

n o t e s
reference, such as product, client or market data. Besides, organisa-

tions implementing an integrative solution will gain from having this
extra exertion since it aids quality data flow throughout the enterprise
in an adaptable solution.
Besides technical challenges, often there are organisational hindranc-

es that must be dealt with. This is evident in organisations with huge
vastness and diversity of data, which is often kept by unalike depart-
ments with contradictory priorities. Hence, a mixture of stakehold-
er management, data governance and careful planning are required,
along with the right approach and solution.
9. _______________________ the data requires close look at the
S
data parameters and controlling the overall aspects of data to
ensure ________________ in quality.
10. In a close environment, data quality is achievable and can be
IM
achieved without adhering to data metrics. (True/False)
Activity
Prepare a report on popular tools used for measuring data quality.

M
Technology for Business

4.7
Analytics
N
As a push to make analysis more significant and unmistakable to the

business client, solutions are concentrating on particular vertical ap-
plications and customising the outcomes and business audience in-
terfaces. For usability, less complex and compelling arrangement, and
ideal value, analytics are being installed in bigger systems. Therefore,
issues like information gathering, storage and processing related to
analytics are overall increasingly viewed as critical issues in system
design. In endeavour to expand the capability of analytics in a busi-
ness procedure, provisions are being developed that go beyond the
client facing applications, working in background to applications in
sales, supply chain perceptiveness, advertising, value improvement,
and workforce analysis. For this sole purpose, business analytics (BI)
includes tools in the following categories:
AQL - Associative Query Logic
Business planning
Business process re-engineering
Competitive analysis
Data mining (DM), Data farming and data warehouses and so on.

n o t e s
Technologies and trends variations in technologies are possibly the

most noticeable BI component in the IT industry. We might think that
the need of data volume to make a specific decision has decreased
over time either due to the overall shift in management or due to as-
sumptions with terms with higher significance, such as insight, knowl-
edge and ideas.
While taking the human factor in mind, the change between reactive
and proactive decision making is defined by the complexity level of
the fields between advanced analytics and BI. Summary reports, sta-
tistics and queries, and low-latency dashboards are built on chrono-
logical information. There is a mid-ground for simple analytics, e.g.,
algebraic or trending predictions that give estimated answers about
expectations in terms of sales, production, etc. Advanced analytics are
much more refined, support techniques such as statistical analysis,
S
forecasting, prediction and correlation, whereas trend analysis simply
infers the existing data to project the next quarter. A refined predic-
tive model takes seasonality, correlations between strong and weak
quarters, and historical sales outlines into account.
IM
Let’s take a look at decision making from another point of view. Say
we want to examine our brain while taking a decision. From a logi-
cal viewpoint, when our brain encounters a task it has no idea about,
it attempts to create rational assumptions guessing the input, likely
outcomes vs. actions to be taken, and attempts to find the best an-
swer. When the brain encounters the same level of problem again, it
M
re-imagines the outcomes and methods deployed as in the old task,

before trying to figure out the right answer to the current problem, as-
sesses what worked earlier and what did not. After being subjected to
a certain amount of similar or varying tasks, brain becomes familiar to
N
cracking a specific type of task. Consequently, the time of re-examin-

ing the older solutions and finding the right solution for the new task
reduces significantly.
Alongside the issues of supportive human policymaking patterns, the

structural setup of a BI system should be prudently measured. A num-
ber of sources with printed study specify that intelligence works finest
when planned as a joint effort by involving people. This effort needs to
be correctly synchronised in terms of urgencies, responsibilities, pro-
cedures and at the same time, intelligence setup should backup and
inspire an effective flat exchange of data among contributors.
There are multiple cases where a state-of-the-art business intelli-

gence technology failed to deliver on the expectations because of the
unwillingness of the persons to take care of the data hungry system
and accomplish additional actions that are required from time to time.
Though, taking the learning curve into account, capabilities and pat-
terns intricate ways of human capacity and learning still exceed ma-
chine learning in countless areas. People have never been more able

n o t e s
to understand, and use specific technologies. The next generations

are finding technologies less intimidating and assume the techno-hu-
man connection as regular and undisputable.
Now the business analytics functions as per the standards prescribed

by BABOK (Business Analysis Body of Knowledge). It is a compila-
tion of most commonly utilised efficient practices in business analysis
across the globe. These standards keep on evolving and incorporate
new changes dynamically in the form of versions. It is a framework
that describes the knowledge, skills and capabilities required to ac-
complish business analysis efficiently. Software development method-
ologies like Agile and SCRUM are commonly occurring standards that
help in creating an iterative informative solution for the system which
is composed of several layered steps of dealing with the SDLC and as-
sociated phases. Coming to application tools, business analysts across
S
the world utilise applications like MS Word, Excel, Visio, PowerPoint
& Project and many such tools in order to put their best foot forward.
These tools are effective and clear in presenting information closest to
depiction as wanted by the analyst and hence elate the overall levels
IM
of analytical operational standards.
11. AQL stands for

a. Associative Query Logic
M
b. Associated Query Logic

c. Association of Query Logic
d. Associative Query in Logic
N
Activity
Create a presentation on data mining tools and show it in your class.
4.8 Managing Change

There are numerous reasons why change is fraught with stares - our
characteristic necessity of having a sense of security around existing
processes and comfort zone is often tough to break which further
helps in decreasing a contemporary change’s probability of accom-
plishment. For example, many Windows XP user, most of them being
elderly bank employees in India, were intimidated on hearing Micro-
soft discontinuing support for XP, since they had to learn new OS from
scratch and that for them could have taken considerable time, if not
long. Instead, they found ways of doing existing work efficiently with
available resources and with the help of consultants hired to drain the
fear factor that had them on their toes. Change management in field

n o t e s
of business analytics often interrelates and precedes/succeeds other

phases as shown in Figure 4.4:
zz Compare planned
zz Register and study and actual indi-
corporate data cators
zz Follow the budget
Monitoring Analysis
Change Control
Management zz Evaluate the
zz Implement a efficiency of
achieved targets
S
balanced indicator
system zz Make exact
decisions
IM
Figure 4.4: Change Management Phases
There should be multiple phase auditors to ensure that the roles and
responsibilities of one phase assigned to a business analyst do not seep
into the other phases, affecting the overall outcomes and messing up
the overall project execution.
M
As a business analyst, you often come across initiatives or projects that

act as the defining watershed moments and lead to massive changes
within an organisation. In few cases, you are subjected to stay at the
frontal attack lines, be it getting together the requirements from cyn-
N
ical investors to reviewing the solution which was put in place a bit
too early and now is facing strong resistance. To get your job done
efficiently in such circumstances, you need to comprehend how well a
change is received by the susceptible individuals and how to lead the
people through the change. Let us discuss a few of the topics related
to change management a business analyst should abide to.
All Change Is Personal
An organisational change always occurs at the root individual level –

this is the first thing you need to learn and understand. Each employee
or person will respond to the change in a dissimilar way based on their
worldview, culture, understanding and relevance of the change relat-
ed to their responsibilities, their current lifestyles and other factors.
Change is not Team Bound
Independent studies have found that visible and active leadership

from the managerial team is the principal donor to a positive change.
Individuals of an organisation obviously look up to their senior man-

n o t e s
agement for leadership on the reputation of activities and to know

the need for doing actions. If the change leaders are not seen to be
frequently involved and supportive to the process, the change faces
the high probability of failure, since callousness of senior people will
lead the people at lower levels to believe that change is not worth or
much required.
When business analysts are involved in transition and operational

activities, they need to assess the method being used to support the
individual’s viewpoint and make suggestions to ensure all applicable
investors can successfully bring and implement the change.
There is more to Change Management than only

Communication and Training
At the point when the vast majority consider helping people to adjust
S
to a change, the two most commonly used methods are Training and
Communication. Both are important tools that are expected to help
individuals work through the change procedure, and help address the
IM
awareness and ability/knowledge areas. Nonetheless, they are not ad-
equate to completely back the implementation of a change.
Managers through an entire period of the association with the regions

that are affected by a change should be adequately supported so they
themselves can onboard to play a part in the change itself before they
are requested to help their staff. These people do not simply require
M
training on the solution however they need to understand what to do

in order to help their staff conquer over any issues they confront.
Business analysts frequently perform stakeholder enquiry to track

every group required in a project. We frequently evaluate things like
N
state of mind, impact and engagement. These qualities can be utilised

as a feature of a larger context to evaluate how people are manag-
ing the change, and what to do if some of them are impervious to it.
An official roadmap guide will concentrate on the support by drawing
in key partners routinely to ensure that they stay onboarded and in-
volved with the change.
Change Managers as Business Analysts
Change administration is a different field than business analysis; how-

ever, the two are extremely correlative. While a few organisations now
have devoted change management assets, business analysts will reg-
ularly be included in the planning and usage of change management
provided their forefront inclusion with stakeholders throughout the
project. In case there are not dedicated change management resourc-
es or pre-defined change management obligations, the business an-
alyst has a chance to help the project successfully meet its goals by
knowing the basics of change management and applying them in their
exercises. At a point when your organisation is implementing another

n o t e s
BI activity, chances of progress are incredibly improved when change

management is an integral piece of the activity.
Change management encourages communication from the start of the
activity, getting resistant clients to acceptance and even enthusiasm,
which significantly improves the effective selection of the new func-
tionality. Also, it does not end when the innovation goes live; change
management exercises keep on helping with adoption and client capac-
ity until the technology is completely incorporated into the business.
Another use of change management is to guarantee that partners
and stakeholders, like BI groups and business lines are cooperating
to guarantee that the correct data is captured for future business re-
quirements. Change management encourages the understanding of
the business needs of the organisation by uniting leaders from vari-
ous offices and departments, which empowers BI and Development to
S
concentrate on next trends and envision the restatement of business
data needs.
Project and change management are different and correlative activ-
IM
ities that use multiple skill sets. Project management drives the spe-
cialised side of a technology side, concentrated on guaranteeing that
the solution is appropriately designed and works as required. Change
management is centered around the people side, preparing clients
for the change and attempting to ensure that the new procedures are
adaptable and usable. According to a study carried out by an individ-
ual research group, emerging best practices are intended for change
M
management to be incorporated with project management. They sat-

isfy many diverse functions collaborating as a single entity up for suc-
cessful implementation.
Developing change managers and leaders across the organisation can
N
be extraordinarily helpful in improving change management endeav-

ors. These could be leaders from many functional areas and various
departments, leaders who have managerial and operational aptitude,
are educated about hierarchical process, and know about how to set
up the course to the effective and enthusiastic adoption of new proce-
dures and practices. Having business analysts as change leaders en-
ables improved BI implementation in the following ways:
Ensuring that business ranks have a reliable group or person who
shares data within the business domain and throughout the organ-
isation, and adopts the understanding of the development and BI
team.
Increasing BI and development, understanding of business re-
quirements across the organisation, resulting in better data find-
ing and catching the right data for the corporation.
12. _______ management encourages communication from the

start of the activity.

n o t e s
Activity
Your existing medical project requires some sudden changes due

to a large influx of disorganised sample data. Not only that, it also
requires change in system dynamics being used so far to manage
the existing volumes of data. How will you proceed to ensure an
effective change management being carried out without affecting
operations?
4.9 Summary
Data, to put simply, is the raw material that does not make any
definite sense unless you process it to any meaningful end.
Information is the result which we achieve after the raw data is
S
processed.
Standalone data has no meaning rather it only assumes meaning
and transitions into information upon being interpreted.
IM
Knowledge is something that is inferred from the data and infor-
mation.
A business analyst is anyone who has the key domain experience
and knowledge related to the paradigms being followed.
Business analysts need not necessarily be from the IT background
M
although it certainly helps having a basic understanding IT sys-

tems and how they work.
When the data quality checks report a decline in quality, a few cor-
rective measures can be deployed.
N
Change administration is a different field than business analysis;

however, the two are extremely correlative.
key words
Businessanalyst: Anyone who has the key domain experience

and knowledge related to the paradigms being followed.
Explicit knowledge: A type of knowledge that can be simply
transferred to others.
Information: It is the result that we achieve after the raw data
is processed.
Stakeholder management: It is a process of dealing with stake-
holders and understanding how much power and impact they
have on your project.
Tacit knowledge: A type of knowledge that is complex and intri-
cate and is gained simply by passing on to others and requires
elevated and advance skills in order to be comprehended.

n o t e s

1. Discuss the relation between data, information and knowledge.
2. Explain the role and responsibilities of a business analyst.
3. Enlist and describe the skills required to be a good business
analyst.
4. Discuss the ways of ensuring data quality.
S
What is Data, Information 1. Data
and Knowledge?
2. True
IM
Business Analytics Person- 3. Failure
nel and their Roles
4. Segregate
Required Competencies 5. True
for an Analyst
6. False
M
Business Analytics Data 7. Levels

8. Pattern
Ensuring Data Quality 9. Monitoring, improvement
N
10. False
Technology for Business 11. a. Associative Query Logic
Analytics
Managing Change 12. Change

1. Data, to put simply, is the raw material that does not make any
definite sense unless you process it to any meaningful end. Refer
to Section 4.2 What is Data, Information and Knowledge?
2. A business analyst is anyone who has the key domain experience
and knowledge related to the paradigms being followed. Refer to
Section 4.3 Business Analytics Personnel and their Roles.
3. Business analysts need to be great in verbal and written
communications, diplomatic, experts with problem solving
acumen, theorists with the ability to involve with the stakeholders
to comprehend and answer to their needs in a dynamic business
environment. Refer to Section 4.4 Required Competencies for
an Analyst.

n o t e s
4. As data moves from one place to another, it converts and

presents itself in supplementary forms, which unless governed
and managed properly, can lose its veracity. Refer to Section
4.6 Ensuring Data Quality.
SUGGESTED READINGS
Laursen, G. H., & Thorlund, J. (2017). Business analytics for man-
agers: taking business intelligence beyond reporting. Hoboken,
NJ: Wiley. Isson, J. P. (2013). Win with advanced business analytics:
creating business value from your data. Hoboken, NJ: John Wiley
& Sons.
S
E-REFERENCES
Risk, S. (n.d.). Business Analytics less Data Quality equals Bad
IM
Decisions. Retrieved April 26, 2017, from https://www.blue-gran-
ite.com/blog/business-analytics-less-data-quality-equals-bad-de-
cisions
Data Quality for Business Analytics by David Loshin - BeyeNET-
WORK. (n.d.). Retrieved April 26, 2017, from http://www.b-eye-net-
work.com/view/15539
M
N

N
M
IM
S
C h a
5 p t e r
Descriptive Analytics
CONTENTS
S
5.1 Introduction
5.2 Visualising and Exploring Data
IM
5.2.1 Dashboards
5.2.2 Column and Bar Charts
5.2.3 Data Labels and Data Tables Chart Options
5.2.4 Line Charts
5.2.5 Pie Charts
5.2.6 Scatter Chart
M
5.2.7 Bubble Charts

5.2.8 Miscellaneous Excel Charts
5.2.9 Pareto Analysis
Self-Assessment Questions
N
Activity
5.3 Descriptive Statistics
5.3.1 Central Tendency (Mean, Median and Mode)
5.3.2 Variability
5.3.3 Standard Deviation
Activity
5.4 Sampling and Estimation
5.4.1 Sampling Methods
5.4.2 Estimation Methods
Activity
5.5 Introduction to Probability Distributions
Activity

CONTENTS
5.6 Summary
S
IM
M
N

Descriptive Analytics 113
n o t e s
CAB SERVICE COMPANY USING DESCRIPTIVE ANALYTICS

FOR BETTER CUSTOMER SATISFACTION
To reap the maximum benefits of social media marketing, a new-

ly launched cab service company deploys the analytical expertise
of a consultancy firm. The firm has recommended an extended
social media campaign followed by a series of introductory offers
and joining gifts in the form of free travel and exclusive cash back
offers for first few customers. The firm has offered to help with the
social media operations along with the reputation management,
in case some disgruntled customers throng the social forums to
voice out their opinions or other cab companies plan to bog it
down by targeting a malicious false-review campaign against the
company.
S
The cab company is on a strict marketing and advertising budget
and needs the analytics to stay true to their potential. A misfired
campaign may result in a detrimental image as well as the revenue
IM
loss for the company. The statistics and analysis of the consultan-
cy firm needs to be spot on in order to create a niche in the market
for a domain where already there are several players. They need
to make sure that the customers are taken into confidence along
with the existing players and retained for a long time. The con-
sultancy will study the current market and stats around the area
M
where the company is planning to deploy their cabs. Based on the

data gathered, the consultancy will go into technical detailing like
occurrences of low-travel days, weather dependent phases and
predicting traffic, movements, random happenings and a work-
around to deal with them.
N

n o t e s
learning objectives

>> Explain about visualising and exploring data
>> Describe descriptive statistics
>> Define sampling and estimation
>> Elucidate probability distributions
5.1 Introduction
Descriptive analytics is the most essential type of analytics and estab-
lishes the framework for more advanced type of analytics. This sort of
analysis involves “What has occurred in the corporation” and “What
S
is going on now?” Let us consider the case of Facebook. Facebook
user produce content through comments, posts and picture uploads.
This information is unstructured and is produced at an extensive rate.
Facebook stats reveal that 2.4 million posts equivalent to around 500
IM
TB of information are produced every minute. These jaw-dropping
figures have offered popularity of another term which we know as Big
Data.
Comprehending the information in its raw configuration is trouble-

some. This information must be abridged, categorised and displayed
M
in an easy to understand way to let the managers to comprehend it.

Business Intelligence and data mining instruments/methods have
been the accepted components of doing so for bigger organisations.
Practically every association does some type of outline and MIS re-
porting using the information base or simply spreadsheets.
N
There are three crucial approaches to abridge and describe the raw
data:
Dashboards and MIS reporting: This technique gives condensed
data giving information on “What has happened”, “What’s been
going on?” and “How can it stand with the plan?”
Impromptu detailing: This technique supplements the past strat-
egy in helping the administration to extract the information as re-
quired.
Drill-down reporting: This is the most complex piece of descrip-
tive analysis and gives the capacity to delve further into any report
to comprehend the information better.
This chapter first discusses the processes of visualising and exploring

data. Next, the chapter discusses about descriptive statistics. Further,
the chapter discusses about sampling and estimation. Towards the
end, the chapter discusses about probability distributions.

n o t e s
5.2 VISUALISING AND EXPLORING DATA

Data visualisation is the method of depicting data (typically in larg-
er quantities) in graphical or visual form. Researchers observed that
data visualisation improves decision-making, provides managers with
better analytic capabilities that reduce the dependence on IT profes-
sionals, and improves collaboration and information sharing.
Raw data is important, particularly when one needs to identify ac-

curate values or compare individual numbers. However, it is quite
difficult to identify trends, patterns and find exceptions, or compare
groups of data in tabular form. The human brain does a surprisingly
good job in processing visual information—if presented in an effective
way.
S
Data visualising provides a way of data collaboration at all business
levels and can disclose surprising relationships and patterns.
Data visualisation is also important both for building decision models

IM
and for interpreting their results. To identify the appropriate model to
use, we would normally have to collect and analyse data to determine
the type of relationship (linear or non-linear, for example) and esti-
mate the values of the parameters in the model. Visualising the data
will help to identify the proper relationship and use the appropriate
data analysis tool. Furthermore, complex analytical models often yield
complex results. Visualising the results helps in understanding and
M
gaining insight about model output and solutions.
5.2.1 DASHBOARDS
N
Making data visible and accessible to employees at all levels is a hall-

mark of effective modern organisations. A dashboard is a visual pic-
ture of a group of specific business measures. It is similar to the dash-
board of an automotive, such as a car, which displays fuel level, speed,
seat signs, temperature, and so on. Dashboards deliver important key
synopses of valuable business data to efficiently manage a business
function or process. Dashboards might include tabular as well as visu-
al data to allow managers to quickly locate the key data.
5.2.2 COLUMN AND BAR CHARTS
MS Excel refers to the vertical bar charts as column and horizontal bar
charts as bar charts. Column and bar charts are valuable for equating
categorical or series specific data, for demonstrating differences be-
tween value sets, and for displaying percentages or proportions of a
whole.

n o t e s
Figure 5.1 shows column and bar charts:
Figure 5.1: Column and Bar Chart

Source: https://www.aploris.com/support/documentation/bar-and-line-charts
S
5.2.3 DATA LABELS AND DATA TABLES CHART OPTIONS
IM
MS Excel provides options for including the numerical data on which
charts are based within the charts. Data labels can be added to chart
elements to show the actual value of bars. Data tables can also be
added; these are usually better than data labels, which can get quite
messy. Both can be added from the Add Chart Element Button in the
Chart Tools Design tab, or also from the Quick Layout button, which
provides standard design options. Figure 5.2 shows data labels and
M
data tables chart:

N
Figure 5.2: Data Labels and Data Tables Chart

Source: http://datapigtechnologies.com/blog/index.php/the-trouble-with-chart-data-tables/
5.2.4 LINE CHARTS
Line charts are a useful way of displaying data for a given period. You
may enter multiple series of data in line charts; however, it can be-
come difficult to interpret if the size of data values differs exponential-

n o t e s
ly. In such a case, it would be advisable to create individual charts for

different data series. Figure 5.3 shows line charts:
S
Figure 5.3: Line Charts
Source: http://www.advsofteng.com/gallery_line.html
IM
5.2.5 PIE CHARTS
For many types of data, we are interested in understanding the rela-

tive proportion of each data source to the total. A pie chart shows this
by dividing a circle into pie-shaped areas displaying the relative part.
M
New age 3D pie charts can get confusing at times because of their nar-
row representation in case of huge data variables. This is because the
third dimension also represents something especially on a coordinate
graph. Hence, pie charts are preferred only in two dimensional form
for effective and simpler data representation. Figure 5.4 displays a pie
N
chart:
Organic (36%)
Email marketing (31%)

Google + (5%)
Facebook (6%)
Twitter (7%)
Pinterest (7%) Referrals (3%)
Figure 5.4: Pie Charts

Source: http://www.f1f9.com

n o t e s
5.2.6 SCATTER CHART
Scatter charts demonstrate the connection between two variables. To

create a scatter chart, we require variable pairs and observations re-
lated to them. For example, students in a class might have grades for
both a midterm and a final exam. Figure 5.5 shows a scatter chart:
S
IM
Figure 5.5: Scatter Chart
Source: https://www.zingchart.com/docs/chart-types/scatter-plots/
5.2.7 BUBBLE CHARTS

A bubble chart is a chart related to scatter chart, in which the data
M
marker size corresponds to a third variable; thus, it is a method to

display three variables in 2D space. Figure 5.6 shows a bubble chart:
N
Figure 5.6: Displaying Bubble Charts

Source: https://community.devexpress.com/blogs/ctodx/archive/2008/10/28/dxperience-v2008-
vol-3-bubble-charts-for-winforms-and-asp-net.aspx
5.2.8 MISCELLANEOUS EXCEL CHARTS

Excel provides several additional charts for special applications.
These additional types of charts (including bubble charts) can be se-
lected and created from the Other Charts button in the Excel ribbon.

n o t e s
These include the following:

A stock chart allows you to plot stock prices, such as the daily high,
low, and close. It may also be used for scientific data such as tem-
perature changes.
A surface chart shows 3-D data.
A doughnut chart is similar to a pie chart but can contain more
than one data series.
A radar chart allows you to plot multiple dimensions of several
data series.
5.2.9 PARETO ANALYSIS
Pareto analysis is a term named after Vilfredo Pareto, an Italian econ-
S
omist. In 1906, he realised that a large portion of the total wealth is
held by a comparatively small number of the people in Italy. The Pa-
reto principle is often seen in many business situations. For example,
higher percentage of sales may come usually from a small percentage
IM
of customers, a higher percentage of defects originate from relatively
smaller batches of the product or a high percentage of stock value be-
longs to a small percentage of selective items. As a result, the Pareto
principle is also often called the “80–20 rule,” referring to the generic
situations.
M
1. Data _____ gives a way of data collaboration at all business

levels and can disclose surprising relationships and patterns.
N
2. Dashboards might include ____ as well as ____ data to allow

managers to quickly locate key data.
3. Bubble chart is a method to display three variables in 2D
space. (True/False)
Activity
Prepare a report on data visualisation tools available on the Web

other than the tools discussed in the chapter.
5.3 DESCRIPTIVE STATISTICS

Statistics, as defined by David Hand, past president of the Royal Sta-
tistical Society in the UK, is both the science of uncertainty and the
technology of extracting information from data. Statistics involves col-
lecting, organising, analysing, interpreting and presenting data. You
are familiar with the concept of statistics in daily life as reported in
newspapers and the media, for example, baseball batting averages,

n o t e s
airline on-time arrival performance, and economic statistics such as

the Consumer Price Index.
Statistical methods are essential to business analytics and are used

throughout this book. Microsoft Excel supports statistical analysis in
two ways:
1. With statistical functions that are entered in worksheet cells
directly or embedded in formulas.
2. With the Excel Analysis Toolpak add-in to perform more complex
statistical computations. We wish to point out that Excel for the
Mac does not support the Analysis Toolpak.
A population consists of all items of interest for a particular decision

or investigation—for example, all individuals in the United States who
do not own cell phones, all subscribers to Netflix, or all stockholders
S
of Google. A company like Netflix keeps extensive records on its cus-
tomers, making it easy to retrieve data about the entire population of
customers. However, it would probably be impossible to identify all
IM
individuals who do not own cell phones.
A sample is a subset of a population. For example, a list of individuals

who rented a comedy from Netflix in the past year would be a sample
from the population of all customers. Whether this sample is repre-
sentative of the population of customers—which depends on how the
sample data is intended to be used—may be debatable; nevertheless,
M
it is a sample. Most populations, even the finite ones, are usually too
large to practically or effectively deal with. For example, it would be
unreasonable as well as costly to survey the TV viewers’ population of
the United States. Sampling is also necessary when data must be ob-
tained from destructive testing or from a continuous production pro-
N
cess. Thus, the process of sampling aims to obtain enough information

to create a legal interpretation about a population. Market research-
ers, for example, use sampling to gauge consumer perceptions on new
or existing goods and services; auditors use sampling to verify the
accuracy of financial statements; and quality control analysts sample
production output to verify quality levels and identify opportunities
for improvement.
Understanding Statistical Notation
We typically label the elements of a dataset using subscripted vari-

ables, x1, x2, … and so on. In general, xi represents the ith observation.
In statistics, it is common to use Greek letters, such as σ (sigma), µ
(mu), and π (pi), to represent population measures and italic letters
such as by x (x-bar), s, and p for sample statistics. We will use N to
represent the number of items in a population and n to represent the
number of observations in a sample. Statistical formulas often contain
a summation operator, (Greek capital sigma), which means that the
terms that follow it are added together. Thus, a . Understanding these

n o t e s
conventions and mathematical notations will help you interpret and

apply statistical formulas.
5.3.1 CENTRAL TENDENCY (MEAN, MEDIAN AND MODE)
Central tendency is the measurement of a single value that attempts

to describe a set of data by identifying the central position within that
set of data. Measurement of central tendency is also called as mea-
sures of central location. Some common terms used as valid measures
of central tendency are as follows:
Mean
Median
Mode
S
Midrange
Mean
IM
The mathematical average is called the mean (or the arithmetic mean),
which is the sum of the observations divided by the total number of
observations. The mean of a population is shown by the µ, and the
sample mean is denoted by . If the population contains N observations
x1, x2,…xN, the population mean is calculated as
N
∑x
M
i
µ= i=1
The mean of n observations sample, x1, x2, …xn, ,is calculated as

N
∑x
i=1
i
x=
n
Note that the calculations for the mean are the same whether we are
dealing with a population or a sample; only the notation differs. We
may also calculate the mean in Excel using the function AVERAGE
(data range).
One property of the mean is that the sum of the deviations of each
observation from the mean is zero:
∑ (X
i
i − X) =
0
This simply means that the sum of the deviations above the mean is
the same as the sum of the deviations below the mean. Thus, the mean
“balances” the values on either side of it. However, it does not suggest
that half the data lie above or below the mean.

n o t e s
Median
The measure of location that specifies the middle value when the data
are arranged from least to greatest is the median. If the number of
observations is odd, the median is the exact middle of the sorted num-
bers – i.e. the 4 observation. If the number of observations is even, say
8, the median is the mean of the two middle numbers – i.e. mean of 4th
and 5th observation. We can use the Sort option of MS Excel to order
the data as per the rank and then find the median. The Excel function
MEDIAN (data range) could also be used. The median is meaningful
for ratio, interval and ordinal data. As opposed to the mean, the medi-
an is not affected by outliers.
Mode
A third method of measuring the location is called mode. It is the ob-
S
servation/number/series that occurs the maximum number of times.
The mode is valuable for datasets containing smaller number of
unique values. You can easily identify the mode from a frequency dis-
IM
tribution by identifying the value having the largest frequency or from
a histogram by identifying the highest bar. You may also use the Excel
function MODE.SNGL (data range). For frequency distributions or
grouped data, the modal group is the group with the greatest frequency.
Midrange
M
A fourth measure of location that is used occasionally is the midrange.

This is simply the average of the greatest and least values in the data
set.
N
5.3.2 Variability
A commonly used measure of dispersion is the variance. Basically,

variance is the squared deviations average of the observations from
the mean.
The bigger the variance is, the more is the spread of the observations
from the mean. This indicates more variability in the observations.
The formula used for calculating the variance is different for popula-
tions and samples.
The formula for the variance of a population is:

N
∑(x – µ)
2
i
σ2 = i −1
where xi is the value of the ith item, N is the number of items in the
population, and µ is the population mean.

n o t e s
The variance of a sample is calculated by using the formula

n
∑(x – x)
2
i
s2 = i −1
n–1
where n is the number of items in the sample and is the sample mean.
Example: A population has four observations: {1, 3, 5, 7}. Find the

variance.
(A) 2 (B) 4 (C) 5 (D) 6 (E) None
Solution: The answer is (C). First, we need to compute the population

mean.
μ = (1 + 3 + 5 + 7) / 4 = 4
S
Insert all known values into the formula for the variance, as shown
below:
IM
σ2 = σ (Xi - μ )2 / N
σ2 = [ (1 - 4 )2 + (3 - 4 )2 + ( 5 - 4 )2 + ( 7 - 4 )2 ] / 4
σ2 = [ ( -3 )2 + ( -1 )2 + ( 1 )2 + ( 3 )2 ] / 4
σ2 = [ 9 + 1 + 1 + 9 ] / 4 = 20 / 4 = 5
5.3.3 Standard Deviation

M
The square root of the variance is the standard deviation. For a popu-
lation, the standard deviation is computed as:
N
∑(x – µ)
2
N
i
σ= i −1
and for samples, it is

N
∑(x – x)
2
i
s= i −1
n–1
The standard deviation is usually easier to understand than the vari-

ance because of similarity in its measure units that are same as the
data units. Thus, it can be more easily related to the mean or other
statistics measured in the same units.
The standard deviation is a popular measure of risk, particularly in

financial analysis, because many people associate risk with volatility
in stock prices. The standard deviation measures the tendency of a
fund’s monthly returns to vary from their long-term average (as For-
tune stated in one of its issues, “. . . standard deviation tells you what
to expect in the way of dips and rolls. It tells you how scared you’ll
be.”). For example, a mutual fund’s return might have averaged 11%

n o t e s
with a standard deviation of 10%. Thus, about two-thirds of the time

the annualised monthly return was between 1% and 21%. By contrast,
another fund’s average return might be 14% but have a standard de-
viation of 20%.Its returns would have fallen in a range of -6% to 34%
and, therefore, is riskier.
Example: A random sample consists of four observations: {1, 3, 5, 7}.

Based on these sample observations, what is the best estimate of the
standard deviation of the population?
(A) 2 (B) 2.58 (C) 6 (D) 6.67 (E) None
Solution: The answer is (B). First, compute the sample mean.
x = (1 + 3 + 5 + 7) / 4 = 4
S
Then, we insert all the known values into formula for calculating the
SD of a sample, as shown below:
s = sqrt [ Σ (xi - x )2 / (n - 1)]

IM s = sqrt { [ ( 1 - 4 )2 + ( 3 - 4 )2 + ( 5 - 4 )2 + ( 7 - 4 )2 ] / ( 4 - 1 ) }
s = sqrt { [ ( -3 )2 + ( -1 )2 + ( 1 )2 + ( 3 )2 ] / 3 }
s = sqrt { [ 9 + 1 + 1 + 9 ] / 3 } = sqrt (20 / 3) = sqrt ( 6.67 ) = 2.58
Standardised Values
A z-score, or standardised value, provides a measure of the distance

M
of the observation away from the mean, irrespective of the measure-

ment units. In a data set, z-score for the ith observation is calculated
as follows:
We subtract the sample mean from the ith observation, xi, and divide
N
the result by the sample standard deviation. The numerator denotes

the distance that xi is away from the sample mean; a negative value
designates that xi is at the left of the mean, and a positive value means
it lies at the right. By dividing by the standard deviation, s, we scale
the distance from the mean to express it in units of standard devia-
tions.
Thus, a z-score of 1.0 means that the observation is one standard de-
viation to the right of the mean; a z-score of -1.5 means that the ob-
servation is 1.5 standard deviations to the left of the mean. Thus, even
though two data sets may have different means and standard devia-
tions, the same z-score means that the observations have the same
relative distance from their respective means.
Z-scores can be computed easily on a spreadsheet; however, Excel has
a function that calculates it directly, STANDARDISE (x, mean, stan-
dard_dev).
xi – x
Zi =
s

n o t e s
Coefficient of Variation
The coefficient of variation (CV) provides a relative measure of the

dispersion in data relative to the mean and is defined as
CV = Standard Deviation/Mean
Often, the coefficient of variation is multiplied by 100 to be expressed

as a percentage.
This statistic is useful when comparing the variability of two or more

data sets when their scales differ.
The coefficient of variation offers a relative risk to return measure.

The smaller the coefficient of variation, the smaller the relative risk is
for the return provided. The reciprocal of the coefficient of variation,
called return to risk, is often used because it is easier to interpret.
S
That is, if the objective is to maximise return, a higher return-to-risk
ratio is often considered better. A related measure in finance is the
Sharpe ratio, which is the ratio of a fund’s excess returns (annualised
IM
total returns minus Treasury bill returns) to its standard deviation. If
several investment opportunities have the same mean but different
variances, a rational (risk-averse) investor will select the one that has
the smallest variance. This approach to formalising risk is the basis for
modern portfolio theory, which seeks to construct minimum-variance
portfolios.
M
4. ____ involves collecting, organising, analysing, interpreting

and presenting data.
N
5. Sampling is also clearly necessary when data must be obtained

from destructive testing or from a continuous production
process. (True/False)
6. The _____is meaningful for ratio, interval and ordinal data.
Activity
Prepare a report on the relationship between statistical analytical

concepts and their usage in analytical sciences in the simplest man-
ner possible.
5.4 Sampling and Estimation

The primitive sampling steps require an effective sampling plan to be
designed, that will produce representative samples of the populations
under scrutiny. A sampling plan is a description of the approach that
is used to obtain samples from a population prior to any data collec-
tion activity.

n o t e s
A sampling plan states the:

Objectives of the sampling activity
Target population
Population frame (the list from which the sample is selected)
Method of sampling
Operational procedures for collecting the datad
Statistical tools that will be used to analyse the data
Example: A sampling plan for a market research study
Suppose that a company in America wants to understand how golf-

ers might respond to a membership program that provides discounts
at golf courses in the golfers’ locality as well as across the country.
S
The objective of a sampling study might be to estimate the proportion
of golfers who would likely subscribe to this programme. The target
population might be all golfers over 25 years old. However, identify-
IM
ing all golfers in America might be impossible. A practical population
frame might be a list of golfers who have purchased equipment from
national golf or sporting goods companies through which the discount
card will be sold. The operational procedures for collecting the data
might be an e-mail link to a survey site or direct-mail questionnaire.
The data might be stored in an Excel database; statistical tools such
as PivotTables and simple descriptive statistics would be used to seg-
M
ment the respondents into different demographic groups and estimate

their likelihood of responding positively.
5.4.1 Sampling Methods

N
Many types of sampling methods exist. Sampling methods can be sub-

jective or probabilistic. Subjective methods contain judgment sam-
pling, in which expert judgment is used for selecting the sample and
convenience sampling, in which easier to collect samples are selected
(e.g., survey all customers who visited this month). Probabilistic sam-
pling includes items selection from the sample using a random proce-
dure and it is necessary to derive effective statistical conclusions.
The most common probabilistic sampling approach is simple random

sampling. A random sampling requires choosing items from a popula-
tion such that every subset of a given sample size has an equal oppor-
tunity to get selected. Simple random samples can be easily obtained
if the population data is kept in a database.
Other methods of sampling include the following:

Systematic (periodic) sampling: Systematic sampling (or Periodic
sampling) is a sampling plan that selects each specified nth item
from the population. For example, to sample 200 names from a list
of 400,000, first name can be randomly selected from the first 2000,

n o t e s
and then every 2000th name can be selected. This approach can be
used for sampling telephone supported by an automated dialler
used to dial numbers in an orderly manner. However, systemat-
ic sampling is complex compared to random sampling as for any
given sample, every possible sample of a given size of the popula-
tion has no equal chance of getting selected. In few situations, this
method can bring weighty bias if the population has some basic
pattern. For example, sampling the orders received on each Sun-
day may not produce an illustrative sample if consumers tend to
order more or less i=on other days.
Stratified sampling: It applies to populations divided into natu-
ral subsets (strata). For example, a large city may be divided into
political districts called wards. Each ward has a different number
of citizens. A stratified sample would choose a sample of individu-
als in each ward proportionate to its size. This approach ensures
S
that each stratum is weighted by its size relative to the population
and can provide better results than simple random sampling if the
items in each stratum are not homogeneous. However, issues of
IM
cost or significance of certain strata might make a disproportion-
ate sample more useful. For example, the ethnic or racial mix of
each ward might be significantly different, making it difficult for a
stratified sample to obtain the desired information.
Cluster sampling: It refers to dividing a population into clusters
(subgroups), sampling a cluster set, and conducting a complete
M
survey within the sampled clusters. For instance, a company might

segment its customers into small geographical regions. A cluster
sample would consist of a random sample of the geographical re-
gions, and all customers within these regions would be surveyed
(which might be easier because regional lists might be easier to
N
produce and mail).

Sampling from a continuous process: Selecting a sample from
a continuous manufacturing process can be accomplished in two
main ways. First, select a time at random; then select the next n
items produced after that time. Second, randomly select n times;
select the next item created after each of these times. The first
approach generally ensures that the observations will come from
a homogeneous population; however, the second approach might
include items from different populations if the characteristics of
the process should change over time, so caution should be used.
5.4.2 Estimation Methods
Sample data provides the basis for many useful analyses to support
decision making. Estimation involves evaluating the value of an unfa-
miliar population constraint—such as a population proportion, popu-
lation mean, or population variance—using sample data. Estimators
as measures are used to approximate the population parameters; e.g.,
we use the mean sample x to approximate a population mean µ. The

n o t e s
sample variance s2 estimates a population variance σ2, and the sample

proportion p estimates a population proportion π. A point estimate is
a number resulting from a sample data used in estimating the value of
a population parameter.
Unbiased Estimators
It seems quite intuitive that the sample mean should provide a good
point estimate for the population mean. However, it may not be clear
why the formula for the sample variance we read previously, has a
denominator of n - 1, particularly because it is different from the for-
mula for the population variance. In these formulas, the population
variance is computed by
N
∑(x – µ)
2
i
S
σ2 = i −1
Whereas, the sample variance is computed by the formula

IM n
∑(x – x)
2
i
s2 = i −1
n–1
Why so? Statisticians develop many types of estimators, and from a

theoretical as well as a practical perspective, it is important that they
estimate the population parameters truly as they are expected to es-
M
timate. Say, we perform a test where we frequently sampled from a

population and calculated a point estimate for a population parame-
ter. Each individual point estimate varies from population parameter;
though, the long-term average (probable value) of all the likely point
estimates would be identical to the population parameter, hopefully. If
N
the likely value of an estimator is equal to the population parameter it

is supposed to estimate, the estimator is credited as impartial else the
estimator is called biased and will yield incorrect results.
Luckily, all estimators we discussed are unbiased and are expressive

for decision-making linking the population parameter. Statisticians
have shown that the denominator “n – 1” used in computing s2 is nec-
essary to provide an unbiased estimator of σ2. If we simply divide by
the number of observations, the estimator tends to underestimate the
true variance.
Errors in Point Estimation
One of the drawbacks of using point estimates is that they do not pro-
vide any indication of the magnitude of the potential error in the es-
timate. A newspaper reported that college professors were the best-
paid workforces in the area, with an average pay of $150,004. However,
it was found that average pays for two local universities were less
than $70,000. How did this happene? It was revealed that the sample
size taken was very small and included a large number of highly-paid

n o t e s
medical school faculty; as a result, there was a significant error in the

point estimate that was used.
When we sample, the estimators we use—such as a sample mean,

sample proportion or sample variance—are actually random variables
that are characterised by some distribution. By knowing what this dis-
tribution is, we can use probability theory to quantify the uncertainty
associated with the estimator. To understand this, we first need to dis-
cuss sampling error and sampling distributions.
Different samples from the same population have different charac-

teristics—for example, variations in the mean, standard deviation,
frequency distribution and so on. Sampling error occurs as samples
are only the subset of the total population., Sampling errors can be
lessened but not completely avoided. Another error called non-sam-
pling happens when the sample does not represent the target popula-
S
tion effectively. This is generally a result of poor sample design, such
as using a convenience sample when a simple random sample would
have been more appropriate or choosing the wrong population frame.
IM
To draw good conclusions from samples, analysts need to eliminate
non-sampling error and understand the nature of sampling error.
Sampling error depends on the size of the sample relative to the pop-
ulation. Thus, determination of sample size to be taken is basically a
statistical issue based on the precision of the estimates required to
infer a valuable assumption. Also, from a rational point of view, one
M
should also deliberate the sampling price and create a trade-off be-
tween cost and information obtained.
Understanding Sampling Error

N
Suppose that we estimate the mean of a population using the sample

mean. How can we determine how accurate we are? In other words,
can we make an informed statement about how far the sample mean
might be from the true population mean? We could gain some insight
into this question by performing a sampling experiment.
Sampling Distributions
We can quantify the sampling error in estimating the mean for any
unknown population. To do this, we need to characterise the sampling
distribution of the mean.
Sampling Distribution of the Mean
The means of all possible samples of a fixed size n from some popu-
lation will form a distribution that we call the sampling distribution
of the mean. The histograms are approximations to the sampling dis-
tributions of the mean based on 25 samples. Statisticians have shown
two key results about the sampling distribution of the mean. First one,

n o t e s
the standard deviation of the sampling distribution (called the stan-

dard error of the mean), is computed as:
Standard Error of the Mean = σ/
where σ is the standard deviation of the population from which the

individual observations are drawn and n is the sample size. From this
formula, we see that as n increases, the standard error decreases, just
as our experiment demonstrated. This suggests that the estimates of
the mean that we obtain from larger sample sizes provide greater ac-
curacy in estimating the true population mean. In other words, larger
sample sizes have less sampling errors.
Confidence Intervals
Confidence interval estimates provide a way of assessing the accuracy
S
of a point estimate. It is a value range between which the population
parameter value is assumed to be correctly estimating the true (indef-
inite) population parameter along with a probability. This probability
IM
is called the level of confidence, denoted by 1 – a, where a is a number
between 0 and 1.
The level of confidence is usually expressed as a percent. (Note that

if the level of confidence is 90%, then a = 0.1.) The margin of error
depends on the level of confidence and the sample size. For example,
suppose that the margin of error for some sample size and a level of
M
confidence of 95% is calculated to be 2.0. One sample might yield a

point estimate of 10. Then, a 95% confidence interval would be [8, 12].
This means that if the sample mean is 10, we can be 95% sure that
population mean will lie between 8 and 12. Though, this interval may
or may not include the true population mean.
N
Additional Types of Confidence Intervals
Confidence intervals may be computed for other population con-

straints such as standard deviation or variance and for variances in
the means or proportions of two populations. The concepts are like
the types of confidence intervals we have discussed, but many of the
formulas are rather complex and more difficult to implement on a
spreadsheet.
Prediction Intervals
Another type of interval used in estimation is a prediction interval.

A prediction interval is one that provides a range for predicting the
value of a new observation from the same population. This is different
from a confidence interval, which provides an interval estimate of a
population parameter. A confidence interval is related with sampling
distribution of a statistic, whereas a prediction interval is related to
the distribution of the random variable itself.

n o t e s
When the population standard deviation is unknown, a 100(1 –α)%

prediction interval for a new observation is:
 1
x ± tα /2,n −1  s 1 + 
 n
Note that this interval is wider than the confidence interval by the
additional value of 1 under the square root. This is because, in addi-
tion to estimating the population mean, we must also account for the
variability of the new observation around the mean.
7. A sampling plan states:

a. The objectives of the sampling activity
S
b. The population frame
c. The method of sampling
d. All of these
IM
8. The most common probabilistic sampling approach is simple
____ sampling.
9. A _____ sample would choose a sample of individuals in each
ward proportionate to its size.
M
Activity
Create a powerpoint presentation on quantitative methods and

show it in your class.
N
Introduction to Probability
5.5
Distributions
The concept of probability is prevalent everywhere, from stock mar-
ket predictions and market research to weather forecasts. In a busi-
ness, managers need to know the likelihood that a new product will
be profitable or the chances that a project will be completed on time.
Probability quantifies the uncertainty that we encounter all around us
and is an important building block for business analytics applications.
Probability is the likelihood that an outcome occurs. Probabilities are
expressed as values between 0 and 1, although many people convert
them to percentages. The statement that there is a 10% chance that oil
prices will rise next quarter is another way of stating that the proba-
bility of a rise in oil prices is 0.1.
The closer the probability is to 1, the more likely it is that the outcome
will occur. Before we discuss probability, let’s get familiarised with its
terminology.

n o t e s
Experiment: An experiment is a process that results in an outcome.

An experiment can be as straightforward as tossing a coin or a complex
one such as conducting a market research study, observing weather
conditions or the stock market.
Outcome: The outcome of an experiment is the result that we observe.
The group of all likely outcomes of an experiment is the sample space.

For instance, if we roll two fair die, the possible outcomes are the num-
bers 2 through 12.
A sample space may consist of a small number of separate outcomes

or an infinite number of outcomes.
Probability may be defined from one of the following three perspec-

tives:
S
First, if the process that generates the outcomes is known, prob-
abilities can be deduced from theoretical arguments; this is the
classical definition of probability.
IM
The second approach to probability, called the relative frequency
definition, is based on empirical data. The probability that an out-
come will occur is simply the relative frequency associated with
that outcome.
Finally, the subjective definition of probability is based on judg-
ment and experience, as financial analysts might use in predicting
M
a 75% chance that the DJIA will increase 10% over the next year,
or as sports experts might predict, at the start of the football sea-
son, a 1-in-5 chance (0.20 probability) of a certain team making it
to the final.
N
The definition to use depends on the specific application and the avail-
able information. We will see various examples that draw upon each of
these perspectives.
Probability Rules and Formulas
Suppose, we label the n outcomes in a sample space as O1, O2,... On,

where Oi represents the ith outcome in the sample space.
And let P(Oi) be the probability related with the outcome Oi.
Two elementary facts:

The probability associated with any outcome must be between 0
and 1, or
0 < P(Oi) < 1 for each outcome Oi
The sum of the probabilities over all possible outcomes must be 1.
Or
P(O1) + P(O2) + …+ P(On) = 1

n o t e s
An event is a group of > 1outcomes from a sample space. An example

of an event would be rolling a 7 or an 11 with two die. This leads to the
following rules:
Rule 1: The probability of any event is the sum of the probabilities
of the outcomes that comprise that event.
Rule 2: If A is any event, the complement of A, denoted AC, con-
sists of all outcomes in the sample space not in A. The probability
of the complement of any event A is P(AC) = 1 – P(A).
Rule 3: The union of two events contains all outcomes that belong
to either of the two events. To illustrate this with rolling of two die,
let A be the event {7, 11} and B be the event {2, 3, 12}.
The union of A and B is the event {2, 3, 7, 11, 12}. The probability that
some outcome in either A or B (i.e., the union of A and B) occurs is
S
denoted as P(A or B). Finding this probability depends on whether the
events are mutually exclusive or not. Two events are mutually exclu-
sive if they have no outcomes in common. The events A and B in this
IM
example are mutually exclusive. When events are mutually exclusive,
the following rule applies:
Ifevents A and B are mutually exclusive, then P(A or B) = P(A) +
P(B)
Iftwo events A and B are not mutually exclusive, then P (A or B)
=P(A) + P(B) – P (A and B). Here, (A and B) represents the inter-
M
section of events A and B, that is, all outcomes belonging to both

A and B.
Conditional Probability
N
Conditional probability is the probability of occurrence of one event

A, given that another event B is known to be true or has already oc-
curred. Conditional probabilities are useful in analysing data in
cross-tabulations, as well as in other types of applications. Many com-
panies save purchase histories of customers to predict future sales.
Conditional probabilities can help to predict future purchases based
on past purchases.
The conditional probability of an event A given that event B is known
to have occurred is:
P( A and B)
P( A \ B) =
P( B)
We read the notation P(A|B) as “the probability of A given B.”
Random Variables and Probability Distributions
Some experiments naturally have numerical outcomes, such as a roll

of the dice, the time it takes to repair computers, or the weekly change
in a stock market index. For other experiments, such as obtaining con-

n o t e s
sumer response to a new product, the sample space is categorical. To

have a consistent mathematical basis for dealing with probability, we
would like the outcomes of all experiments to be numerical. A random
variable is a numerical description of the outcome of an experiment.
If we have categorical outcomes, we can associate an arbitrary nu-
merical value to them. For example, if a consumer likes a product in
a market research study, we might assign this outcome a value of 1;
if the consumer dislikes the product, we might assign this outcome
a value of 0. Random variables are usually denoted by capital italic
letters, such as X or Y.
Random variables may be discrete or continuous. A discrete random

variable is one
for which the number of possible outcomes can be counted. A con-

tinuous random variable has outcomes over one or more continuous
S
intervals of real numbers.
A probability distribution is the likely values that a random variable

may adopt along with the probability of carrying these values. A prob-
IM
ability distribution can be continuous or discrete, depending on the
nature of the random variable it represents.
We may develop a probability distribution using any one of the three

perspectives of probability.
First, if we can quantify the probabilities associated with the val-
M
ues of a random variable from theoretical arguments; then we can

easily define the probability distribution.
Second, we can calculate the relative frequencies from a sample of
empirical data to develop a probability distribution.
N
Finally, we could simply specify a probability distribution using

subjective values and expert judgment. This is often done in cre-
ating decision models for the phenomena for which we have no
historical data.
Researchers have identified many common types of probability distri-

butions that are useful in a variety of applications of business analytics.
A working knowledge of common families of probability distributions
is important for several reasons. First, it can help you to understand
the underlying process that generates sample data. We will investi-
gate the relationship between distributions and samples later. Second,
many phenomena in business and nature follow some theoretical dis-
tribution and, therefore, are useful in building decision models. Final-
ly, working with distributions is essential in computing probabilities of
occurrence of outcomes to assess risk and make decisions.
Discrete Probability Distributions

For a discrete random variable X, the probability distribution of the
discrete outcomes is called a probability mass function and is denoted

n o t e s
by a mathematical function, f(x). The symbol xi represents the ith value

of the random variable X and f(xi) is the corresponding probability.
Bernoulli Distribution
The Bernoulli distribution depicts a random variable with two possi-

ble outcomes, with each having a constant probability of occurrence.
A success can be any outcome you define. For example, in attempting
to boot a new computer just off the assembly line, we might define a
success as “does not boot up” in defining a Bernoulli random vari-
able to characterise the probability distribution of a defective product.
Thus, success need not be a favourable result in the traditional sense.
Binomial Distribution
S
The binomial distribution models n independent replications of a Ber-
noulli experiment, each with a probability p of success. The random
variable X represents the number of successes in these n experiments.
Let us consider a telemarketing example, suppose we call n = 10 cus-
IM
tomers, each of which has a probability p = 0.2 of making a purchase.
Then the probability distribution of the number of positive responses
obtained from 10 customers is binomial. Using the binomial distribu-
tion, we can calculate the probability that exactly x customers out of
the 10 will make a purchase. The value of x will always be between 0
and 10. A binomial distribution might also be used to model the results
M
of sampling inspection in a production operation or the effects of drug

research on a sample of patients.
Poisson Distribution
N
The Poisson distribution is a discrete distribution used to model the

number of occurrences in some unit of measure—for example, the
number of customers arriving at a Subway store during a weekday
lunch hour, the number of failures of a machine during a month, num-
ber of visits to a Web page during 1 minute, or the number of errors
per line of software code. The Poisson distribution assumes no limit
on the number of occurrences (meaning that the random variable X
may assume any non-negative integer value), that occurrences are in-
dependent and that the average number of occurrences per unit is a
constant, λ (Greek lowercase lambda). The expected value of the Pois-
son distribution is l, and the variance also is equal to λ.
Uniform Distribution
The uniform distribution depicts a constant random variable for

which all outcomes between some maximum and minimum value are
alike. The uniform distribution is often assumed in business analytics
applications when little is known about a random variable other than
reasonable estimates for minimum and maximum values. The param-

n o t e s
eters are chosen judgmentally to reflect a modeller’s best guess about

the range of the random variable.
Normal Distribution
The normal distribution is a continuous distribution that is described

by the familiar bell-shaped curve and is perhaps the most important
distribution used in statistics. The normal distribution is observed in
many natural phenomena. Test scores such as the SAT, deviations
from specifications of machined items, human height and weight and
many other measurements are often normally distributed.
The normal distribution is characterised by two parameters: the mean,

µ, and the standard deviation, σ. Thus, as µ changes, the location of
the distribution on the x-axis also changes, and as σ is decreased or
S
increased, the distribution becomes narrower or wider, respectively.
Data Modelling and Distribution Fitting

IM
In many applications of business analytics, we need to collect sam-
ple data of important variables, such as customer demand, purchase
behaviour, machine failure times, service activity times, etc. to gain
an understanding of the distributions of these variables. We can also
construct frequency distributions and histograms and compute basic
descriptive statistical measures to better understand the nature of the
data. However, sample data are just that—samples.
M
Using sample data may limit our ability to predict uncertain events
that may occur because potential values outside the range of the sam-
ple data are not included. A better method is to identify the probability
N
distribution of the sample data by retrofitting a theoretic distribution

to the data and verifying it.
To select an appropriate theoretical distribution that fits the sample

data, we might begin by examining a histogram of the data to look for
the distinctive shapes. If the histogram is symmetric with a peak in the
middle, the distribution is normal. If the histogram is very positively
skewed with no negative values, the distribution may be exponential.
Similarly, a very positively skewed histogram with the density drop-
ping to zero at the edge indicates a lognormal distribution.
Various forms of the gamma, Weibull, or beta distributions could be

used for distributions that do not seem to fit one of the other common
forms. This approach is not, of course, always accurate or valid, and
sometimes it can be difficult to apply, especially if sample sizes are
small. However, it may narrow the search down to a few potential dis-
tributions.
Summary statistics can also provide clues about the nature of a distri-
bution. The mean, median, standard deviation and coefficient of vari-
ation often provide information about the nature of the distribution.

n o t e s
For instance, normally distributed data tend to have a fairly low coef-
ficient of variation (however, this may not be true if the mean is small).
For normally distributed data, we would also expect the median and
mean to be approximately the same. For exponentially distributed
data, however, the median will be less than the mean. Also, we would
expect the mean to be about equal to the standard deviation, or, equiv-
alently, the coefficient of variation would be close to 1. We could also
look at the skewness index. Normal data are not skewed, whereas
lognormal and exponential data are positively skewed. The following
example of Analysing Airline Passenger Data will help better in un-
derstanding the distribution of a normal data.
An airline operates a daily route between two medium-sized cities us-

ing a 70-seat regional jet. The flight is rarely booked to capacity but
often accommodates business travellers who book at the last minute
S
at a high price. The histogram shows a relatively symmetric distri-
bution. The mean, median, and mode are all similar, although there
is some degree of positive skewness. It is important to recognise that
IM
this is a relatively small sample that can exhibit a lot of variability
compared with the population from which it is drawn. Thus, based on
these characteristics, it would not be unreasonable to assume a nor-
mal distribution for developing a predictive or prescriptive analytics
model.
M
10. Probabilities are expressed as values between 0 and 10. (True/

False)
11. _________ probabilities can help to predict future purchases
N
based on past purchases.

12. A ______ variable is a numerical description of the outcome of
an experiment.
Activity
Outline your plans if you are assigned an opportunity to study, eval-

uate and come out with an execution plan for a newly launched
store chain that is planning to maximise their sales.
5.6 Summary
Descriptive analytics is the most essential type of analytics and es-
tablishes the framework for more advanced type of analytics.
Datavisualisation is the method of showing data in a graphical
manner to provide insights that help take better decisions.

n o t e s
Raw data is important, particularly when one needs to identify ac-

curate values or compare individual numbers.
Dashboards deliver important key synopses of valuable business
data to efficiently manage a business function or process.
Excel refers to the vertical bar charts as column and horizontal bar
charts as bar charts.
Data labels can be added to chart elements to show the actual val-
ue of bars.
Pie charts are preferred only in two dimensional form for effective
and simpler data representation.
The measure of location that specifies the middle value when the
data sets are arranged from the least to the greatest is the median.
S
Conditional probability is the probability of occurrence of one
event A, given that another event B is known to be true or has
already occurred.
IM
key words
Cluster Sampling: It refers to dividing a population into clus-

ters (subgroups), sampling a cluster set, and conducting a com-
plete survey within the sampled clusters.
Dashboard: It is a visual picture of group of specific business
M
measures.
Line chart: A type of chart that is used to display data pertain-
ing to a given period.
N
Mean: It is the sum of the observations divided by the total ob-

servations.
Scatter chart: A type of chart that is used to demonstrate the
connection between two variables.

1. Discuss the importance of data visualisation with the help of
suitable examples.
2. What do you understand by descriptive statistics? How mean,
median and mode are calculated in statistics?
3. Describe sampling and estimation with suitable examples.
4. Explain the concept of probability distribution. Also, enlist the
rules and formulas used in probability.

n o t e s
Answers for Self-Assessment Questions

Visualising and Exploring Data 1. Visualising
2. Tabular and visual
3. True
Descriptive Statistics 4. Statistics
5. True
6. Median
Sampling and Estimation 7. d. All of these
8. Random
S
9. Stratified
Introduction to Probability Dis- 10. False
tributions
IM
11. Conditional
12. Random
Hints for Descriptive Answers

1. Data visualisation is the method of showing data (typically
in larger quantities) in an expressive manner to provide
M
understandings that will help in taking better decisions. Refer to

Section 5.2 Visualizing and Exploring Data.
2. Statistics involves collecting, organising, analysing, interpreting,
and presenting data. Refer to Section 5.3 Descriptive Statistics.
N
3. A sampling plan is a description of the approach that is used to

obtain samples from a population prior to any data collection
activity. Refer to Section 5.4 Sampling and Estimation.
4. Probability quantifies the uncertainty that we encounter all
around us and is an important building block for business
analytics applications. Refer to Section 5.5 Introduction to
Probability Distributions.
Suggested Readings
Sheikh, N. M. (2013). Implementing analytics: a blueprint for de-
sign, development, and adoption. Amsterdam: Elsevier.
Atzmüller,
M., & Roth-Berghofer, T. R. (2016). Enterprise big data
engineering, analytics, and management. Hershey: IGI Global.

n o t e s
E-References
Descriptive, Predictive, and Prescriptive Analytics Explained.
(2016, August 05). Retrieved May 01, 2017, from https://halobi.
com/2016/07/descriptive-predictive-and-prescriptive-analytics-ex-
plained/
Big Data Analytics: Descriptive Vs. Predictive Vs. Prescriptive.
(n.d.). Retrieved May 01, 2017, from http://www.information-
week.com/big-data/big-data-analytics/big-data-analytics-descrip-
tive-vs-predictive-vs-prescriptive/d/d-id/1113279
What is descriptive analytics? - Definition from WhatIs.com. (n.d.).
Retrieved May 01, 2017, from http://whatis.techtarget.com/defini-
tion/descriptive-analytics
S
IM
M
N

C h a
6 p t e r
Predictive Analytics
CONTENTS
S
6.1 Introduction
6.2 Predictive Modelling
IM
6.2.1 Logic Driven Models
6.2.2 Data Driven Models
Activity
6.3 Introduction to Data Mining
M
Activity
6.4 Data Mining Methodologies
6.4.1 Classification
6.4.2 Regression
N
6.4.3 Clustering (K-means)

6.4.4 Artificial Neural Networks
Activity
6.5 Summary

n o t e s
Samsung Won over the Market Sentiments Using

Predictive Analytics
Global mobile major Samsung electronics introduced a phone

called Note 7 around October 2016. Although futuristic in speci-
fications with class leading performance, this phone turned out
to be the darkest blot in the otherwise clean bowl of the Samsung
smartphone assembly lines. The phone had critical battery fail-
ure issues which even resulted in few phone explosions across the
world. Airlines across the world banned passengers from board-
ing the flight if they were found to be carrying Note 7 with them.
Samsung restricted the charging to 60% with a firmware upgrade
but in the end, it became a matter of so much ridicule for the com-
pany with decreased levels of brand confidence and customers
S
fleeing, that led the company to ultimately recall all the phones it
sold and put the lid on the project Note 7 forever – a total loss of
$18 billion.
IM
However, rather than taking it as an incident to beat the bush
around with and pinning the blame on quality control, vendors
and everyone else, Samsung took it in a positive stride. They fig-
ured out the real issue with the battery, fixed the gaps and ex-
ploited the existing market sentiments cleverly by emphasising
on their battery issues openly and steps they took to fix that goof-
M
up and not staying behind in accepting and recalling the defec-

tive brand like a true professional consumer driven company. Re-
sults? Their competitors too had to follow the suit and declare
the safety features of their devices along with other specifications
and next phone launch of Samsung – Galaxy S8 got rave reviews
N
and accolades across the technical diaspora and forums. Samsung

achieved this by applying predictive analytics on the data collect-
ed related to the issues of the Note 7 phone. The company pre-
dicted the existing anger and expected the scornful views of the
loyal base of consumers – and gave them quite a few industry-first
reasons to make them believe to their consumer friendly image
again – from issuing credit notes to exchanging devices with S7
Edge device with extra offers to issuing apology notes and to lead-
ing the only complete recall in history of mobiles – Samsung won
over the market sentiments simply by predicting the outpour and
anger of the customers way before it could get worse.

Predictive Analytics 143
n o t e s
learning objectives

>> Explain predictive modelling
>> Describe the concept of data mining
>> Explore different data mining methodologies
6.1 INTRODUCTION
In the previous chapter, you have learned about descriptive analytics
analyses a database to provide information on the trends of past or
current business events that can help managers, planners, leaders,
etc., to develop a road map for future actions. Descriptive analytics
S
performs an in-depth analysis of data to reveal details such as fre-
quency of events, operation costs, and the underlying reason for fail-
ures. It helps in identifying the root cause of the problem. On the other
hand, Predictive analytics is about understanding and predicting the
IM
future and answers the question ‘What could happen?’ by using statis-
tical models and different forecast techniques. It predicts the near fu-
ture probabilities and trends and helps in what-if analysis. In predic-
tive analytics, we use statistics, data mining techniques, and machine
learning to analyse the future. Figure 6.1 shows the steps involved in
predictive analytics:
M
N
Figure 6.1: Predictive Analytics

Source: http://www.witinc.com/predictive-analytics.id.355.htm
In this chapter, you will first learn about about predictive modelling.
Further, the chapter discusses about the concept of data mining. To-
wards the end, the chapter discusses about different data mining
methodologies such as classification, regression, clustering (K-means)
and artificial neural networks.

n o t e s
6.2 Predictive Modelling

Predictive modelling is the method of making, testing and authenti-
cating a model to best predict the likelihood of a conclusion. Several
modelling procedures from artificial intelligence, machine learning
and statistics are present in predictive analytics software solutions.
The model is selected on the basis of testing, authentication and as-
sessment using the detection theory to predict the likelihood of an
outcome in a given input data amount. Models can utilise single or
more classifiers to decide the probability of a set of data related to an-
other set. The different models available for predictive analytics soft-
ware enables the system develop new data information and predictive
models. Each model has its own strengths and weakness and is best
suited for types of problems.
S
Predictive analysis and models are characteristically used to predict
future probabilities. Predictive models in business context, are used
to analyse historical facts and current data to better comprehend cus-
tomer habits, partners and products and to classify possible risks and
IM
prospects for a company. It practices many procedures, including sta-
tistical modelling, data mining and machine learning to aid analysts
make better future business predictions.
Predictive modelling is at the heart of business decision making.
Building decision models more than science is an art.
M
Creating an ideal decision model demands:

Good understanding of functional business areas
Knowledge of conventional and in-trend business practices
N
and research
Logical skillset
It is always recommended to start simple and keep on adding to
the models as required.
The greatest set of changes and advances in predictive modelling are

coming to fruition due to the increase in unstructured information—
content archives, video, voice and pictures—joined with quickly im-
proving analytical methods. Basically, predictive modelling requires
organised data—the kind which is found in social databases. To make
unstructured data indexes valuable for this sort of examination, or-
ganised data must be extricated from them first. One case is sentiment
analysis from Web posts. Data can be found in client posts on forums,
online journals and different sources that foresee consumer loyalty
and deals trends for new items. It would be about impossible, in any
case, to attempt to assemble a predictive model specifically from the
content in the posts themselves. An extraction step is required to get
usable data as keywords, expressions and importance from the con-
tent in the posts. At that point, it’s conceivable to search for the con-

n o t e s
nection between cases of the instances “issues with the item”, for ex-
ample, and increase in customer service calls.
Predictive models are representations of the relationship between

how a member of a sample performs and some of the known charac-
teristics of the sample. The aim is to assess how likely a similar mem-
ber from another sample is to behave in the same manner. This model
is used a lot in marketing. It helps identify implied patterns which
indicate customers’ preferences. This model can even perform calcu-
lations at the exact time that a customer performs a transaction.
Predictive analytics methods depend on the quantifiable variables,

controlling metrics to forecast future performance or outputs given
many quantifiable methods.
A predictive analytics model combines many predictors or quantifi-
S
able variables. This method allows for the data collection and prepa-
ration of a statistical model, to which extra data can be added as and
when available.
IM
The accumulation of higher data volumes creates a nifty predictive
model, trusting the larger data sets which produce more dependable
forecasts based on the data volume examined. Moreover, trusting the
actual data to power predictive analytics models marks better accu-
rateness of the predicting process.
M
The various business processes on predictive modelling are as follows:

1. Creating the model: A software based solution allows you to
make a model to multiple algorithms on the dataset.
2. Testing the model: Test the predictive model on the dataset.
N
In some situations, the testing is done on previous data to the

effectiveness of a model’s prediction.
3. Authenticating the model: Authenticate the model results by
means of business data understanding and visualisation tools.
4. Assessing the model: Assessing the best suited model from the
used models and selecting the appropriate model tailored for the
data.
The predictive modelling process includes executing one or more al-

gorithms on the dataset subjected to prediction. This is a recurring
process and often includes model training, using several models on
the same dataset and lastly getting the appropriate model based on
the business data.
6.2.1 Logic driven models
Logic driven models are created on the basis of inferences and postu-
lations which the sample space and existing conditions provide. Creat-
ing logical models requires solid understanding of business functional

n o t e s
areas, logical skills to evaluate the propositions better and knowledge

of business practices and research.
To understand better, let’s take an example of a customer who vis-

its a restaurant around six times in a year and spends around `5000
per visit. The restaurant gets around 40% margin on per visit billing
amount.
The annual gross profit on that customer turns out to be 5000×6×0.40

= `12000/-.
30% of the customers do not return each year, while 70% do return to
provide more business to the restaurant.
Assuming the average lifetime of a customer (time for which a con-

sumer remains a customer) →1/.3 = 3.33 years. So, the average gross
S
profit for a typical customer turns out to be 12000×3.33 = `39,960
Armed with all the above details, we can logically arrive at a conclu-
sion and can derive the following model for the above problem state-
IM
ment:
Economic value of each customer (V) = (R × F x× M)/D
where,
R = Revenue generated per customer

M
F = Frequency of visits per year
M = Profit margin
D = Defection rate (Non-returning customers each year)

N
So, as you can see, logical driven predictive models can be derived for
a number of situations, conditions, problem statements and a lot other
scenarios where predictive analytical models provide a futuristic view
on the basis of validation, testing and evaluation to guess the likeli-
hood of an outcome in a given set amount of input data.
6.2.2 Data Driven Models
A data-driven model is based on the data analysis of a specific sys-

tem. The main data-driven model concept is to find links between the
state system variables (input and output) without clear knowledge of
the physical attributes and behaviour of the system. The data driven
predictive modelling derives the modelling method based on the set of
existing data and entails a predictive methodology to forecast the fu-
ture outcomes. A company expecting losses in the current quarter due
to the poor market performance and sentiments is a classic example
of data driven predictive modelling. You have the data and you know
about the data inferences. You need not predict anything related to
data, unlike Logic driven models. You are simply predicting the out-

n o t e s
comes based on the data. Refer to the caselet in this chapter for data
driven modelling – Samsung’s case with their product and their en-
suing actions as a good example of data driven predictive modelling.
1. Predictive analysis is all about predicting outcomes. (True/

False)
2. _______is at the heart of business decision making.
3. Logical models differ from data driven models based on the
size and type of input variables available. (True/False)
4. Economic value of each customer (V) = ____________.
S
Activity
Create a data driven model using MS Excel to denote the variation

in a product’s sales for last 3 years.
IM
6.3 Introduction to Data Mining
Data mining is a growing business analytics field focused on better
understanding of the features and designs among variables in huge
M
databases using a variety of analytical and statistical tools. Most of

the tools discussed in earlier chapters, such as data visualisation, data
summarisation, pivottables, correlation, regression analysis, etc., can
be used in data mining extensively. However, as the amount of data
has grown exponentially, many other statistical and analytical meth-
N
ods have been developed to identify relationships among variables in

large data sets and understand hidden patterns that they may contain.
Figure 6.2 shows the four stages of data mining:
Four stages of data mining

1 2 3 4
Data Sources Data exploration/
Modeling deploying models
These range from gathering
Users create a model, Take an action based
database to news wires, This stage involves
test it, and then on the results from the
and are considered a the sampling and
evaluate models
problem definition transformation of data
Figure 6.2: Four Stages of Data Mining

Source: http://searchsqlserver.techtarget.com/definition/data-mining
Data mining can be considered part descriptive and part prescriptive

analytics. In descriptive analytics, data-mining tools help analysts
to identify patterns in data. Excel charts and PivotTables, for exam-
ple, are useful tools for describing patterns and analysing data sets;
however, they require manual intervention. Regression analysis and
forecasting models help us to predict relationships or future values of
variables of interest.

n o t e s
In most business applications, the purpose of descriptive analytics is

to help managers predict the future or make better decisions that will
impact future performance, so we can generally state that data mining
is primarily a predictive analytic approach. Some core ideas in data
mining are as follows:
Classification: Classification is the most essential type of data
analysis. The beneficiary of an offer can react or not react. A can-
didate for a loan can repay on time, late, or opt for non-payment.
A credit card charge can be normal or deceitful. A data packet go-
ing on a network system can be good or bad. A bus in fleet can be
accessible for service or inaccessible. A patient can recover, still be
sick, or expire. A typical assignment in data mining is to analyse
the information where the classification is unknown or will hap-
pen later.
S
Similar data where the classification is known are utilised to cre-
ate rules, which are then subjected to the data with the unknown
order. We will study about classification in more detail further in
the chapter.
IM
Prediction: Prediction resembles classification, aside from that we
are attempting to foresee the estimation of a numerical variable
(e.g., measure of procurement) as opposed to a class (e.g., buyer or
non-buyer). Obviously, in classification, we are attempting to fore-
see a class, yet the term forecast in this book alludes to the forecast
of the constant variable estimation. Once in a while, in the data
M
mining terms, estimation and regression are utilised to refer to the

forecast of the value of a continuous variable, and prediction might
be utilised for both continuous and segmented.
Affiliation rules and recommendation systems: Huge databases
N
of client transactions advance themselves to the relationship anal-

ysis among things acquired, or “what runs with what.”
Association rules are intended to discover such broad association
designs among things in large databases. The principles can then
be utilised as a part of an assortment of ways. For instance, su-
permarkets can utilise such data for item arrangement. They can
utilise the rules for week by week special offers or for packaging
items.
Association rules contracted from a medical facility database on
patients’ manifestations amid successive hospitalisations can help
discover “which side effect is trailed by what other side effect”
and help anticipate future indications for returning patients. On-
line suggestion frameworks, for example, those utilised on Ama-
zon.com and Netflix.com, utilise Collaborative Filtering, a strate-
gy that uses individual clients’ inclinations and tastes given their
past purchases, rating, browsing, or whatever other quantifiable
conduct characteristic of inclination, and other clients’ histories.
As opposed to classification that creates rules general to an entire
populace, collaborative filtering creates “what runs with what” at

n o t e s
the individual client level. Henceforth, collaborative filtering is uti-

lised as a part of numerous suggestion frameworks that intend to
convey customised proposals to users with an extensive variety of
preferences.
5. Data mining is a practice of scrubbing out the data from various

sources for further evaluation and analytical purposes. (True/
False)
6. Which of the following is/are the tools used in data mining?
a. Data visualisation b. Data summarisation
c. Correlation d. All of these
S
7. Predictive analysis deals with data mining in the same way
business analytics deals with raw data. (True/False)
8. The third stage in data mining is ________.
IM
9. Data mining is solely predictive analytical strategy since
descriptive and prescriptive analytics deal with data only
after receiving it and predictive analysis forecasts the data
outcomes. (True/False)
M
Activity
Create a PowerPoint presentation on techniques used in data min-

ing and show it in your class.
N
6.4 Data Mining Methodologies

Databases can accommodate vast quantities of data that aids in de-
cision making. As discussed earlier, data mining is a set of tools and
techniques that help organisations to perform this task. Some com-
mon approaches used in data mining include the following:
Data Exploration and Reduction: This often involves identifying
groups in which the elements of the groups are in some way simi-
lar. This approach is often used to understand differences among
customers and segment them into homogenous groups. For exam-
ple, a department store recognised four lifestyles of its customers:
“Kacy,” an old-style, classic dresser quality lover and less risk
taker;
“Brenda,” hybrid of tradition and contemporary and classic
but with a modern touch;
“Victoria,” a modern, contemporary brand lover customer;
and finally

n o t e s
“Alex,” the fashion oriented customer who seeks the newest

and best.
Such segmentation is useful in design and marketing activities to
better target product offerings. These techniques have also been
used to identify characteristics of successful employees and im-
prove recruiting and hiring practices.
Association: Association is the analysis of databases to recognise
naturalvariable associations and create buying recommendations
or target marketing rules. For example, Netflix uses association to
understand what types of movies a customer likes and provides
recommendations based on the data. Amazon.com also makes rec-
ommendations based on past purchases.
Cause-and-effect modelling: Cause-and-effect modelling is the
process of developing analytic models to describe the relationship
S
between metrics that drive business performance—for instance,
profitability, customer satisfaction, or employee satisfaction. Un-
derstanding the drivers of performance can lead to better deci-
IM
sions to improve performance. For example, the controls group
of CGL Inc. evaluated the relationship between contract-renewal
rates and overall satisfaction. They concluded that 91% contract
renewals were of the customers who were either very satisfied or
satisfied, and higher defection rate for not satisfied customers.
Their model foretold that a one-percent-point surge in the gen-
eral satisfaction score was worth $12 million in renewals of yearly
M
service contracts. As a result, they identified decisions that would

improve customer satisfaction. Regression and correlation analy-
sis are key tools for cause-and-effect modelling.
N
6.4.1 Classification
Classification is the process of analysing data to predict how to classify
a new data element. An example of classification is spam filtering in
an e-mail client. By examining textual characteristics of a message
(subject header, key words, and so on), the message is classified as
junk or not. Classification methods can aid predicting if a credit-card
charge may be fake, risk details of a loan applicant, or whether expect-
ing a consumer response to an advertisement.
Classification is about predicting a positive conclusion based on a
given input and algorithm. The algorithm attempts to determine the
relationships between the attributes that will make it feasible to fore-
cast the outcome. Next an unseen data set is given to the algorithm,
called prediction set, containing the same set of attributes, excluding
the prediction attribute. The algorithm examines the input and yields
a prediction. The accuracy of the prediction describes about the ef-
ficiency of the algorithm. For example, the training set in a medical
database would have applicable patient information captured earlier
in which the prediction attribute is the patient’s heart problem.

n o t e s
Figure 6.3 demonstrates the prediction sets and training of such a da-
tabase:
Figure 6.3: Showing Training set and Prediction Set for Medical
Database
S
Among a few types of data representation known, classification nor-
mally uses forecast principles to express learning and knowledge. Pre-
diction standards are communicated as IF-THEN guidelines, where
IM
the antecedent (IF part) comprises a conjunction of conditions and the
rule subsequent (THEN part) predicts a specific expectations trait for
an item that fulfils the forerunner. Utilising the above example, a rule
expecting the first row in the training set might be represented as:
IF (Age=65 AND Heart rate>70) OR (Age>60 AND Blood pres-
sure>140/70) THEN Heart problem=yes
M
Most of the time, the prediction rule is monstrously bigger in com-

parison to the case specified above. In conjunction, each condition is
isolated by the OR keyword and therefore, defines tinier standards
catching attributes relationship. Fulfilling any of these smaller rules
N
implies that the resultant is the expectation. Each smaller rule is

shaped with AND’s which encourages narrowing down relations be-
tween attributes.
How well forecasts are done is measured as rate of predictions hit
against the total number of forecasts specified. A good rule should
have a hit rate more prominent than the event of the prediction attri-
bute. At the end of the day, if the algorithm is attempting to foresee
rain in Seattle and it downpours 80% of the time, any algorithm could
without much of a stretch can hit a rate of 80%. Subsequently, 80% is
the base prediction rate that any algorithm ought to achieve in this sit-
uation. The ideal solution is a rule with 100% forecast hit rate, which
is hard, if not impossible, to accomplish. In this manner, apart from
some certain issues, classification by definition must be cracked by
approximation based algorithms.
6.4.2 Regression
Regression analysis is an instrument for creating statistical and math-

ematical models that define relations between a dependent variable

n o t e s
(should be ratio variable, not categorical) and one or more descriptive

or independent numerical (ratio or categorical) variables.
Two broad categories of regression models are used often in business

settings are (1) Regression models of cross-sectional data and (2) Re-
gression models of time-series data, in which the independent vari-
ables are time or some function of time and the focus is on predicting
the future. Time-series regression is an important tool in forecasting.
A regression model involving a single autonomous variable is called

simple regression while a regression model involving two or more au-
tonomous variables is called multiple regression.
Simple linear regression involves finding a linear relationship be-

tween one independent variable, X, and one dependent variable, Y.
The relationship between two variables can assume many forms. The
S
relationship may be linear or nonlinear, or there may be no relation-
ship at all. Because we are focusing our discussion on linear regression
models, the first thing to do is to verify that the relationship is linear.
IM
We would not expect to see the data line up perfectly along a straight
line; we simply want to verify that the general relationship is linear. If
the relationship is clearly nonlinear, then alternative approaches must
be used, and if no relationship is evident, then it is pointless to even
consider developing a linear regression model.
To determine if a linear relationship exists between the variables, we

M
recommend that you create a scatter chart that can display the rela-
tionship between variables visually as shown in Figure 6.4:
N
Figure 6.4: Displaying Relationship Between Variables
Linear regression models are not appropriate for every situation.

A scatter chart of the data might show a nonlinear relationship, or the
residuals for a linear fit might result in a nonlinear pattern. In such
cases, we might propose a nonlinear model to explain the relationship.
For instance, a second-order polynomial model would be:
Y = β0 +β1X + β2X2 + e
Sometimes, this is called a curvilinear regression model. In this mod-

el, β1 represents the linear effect of X on Y, and β2 represents the
curvilinear effect. However, although this model appears to be quite

n o t e s
different from ordinary linear regression models, it is still linear in the

parameters (the betas, which are the unknowns that we are trying to
estimate). In other words, all terms are a product of a beta coefficient
and some function of the data, which are simply numerical values. In
such cases, we can still apply least squares to estimate the regression
coefficients. Curvilinear regression models are also often used in fore-
casting when the independent variable is time.
6.4.3 Clustering (K-means)
Cluster analysis, (data segmentation), is a set of techniques that aim to

group or categorise an object collection (i.e., remarks or records) into
clusters or subsets in a way that those inside each cluster are more
closely related than objects of different clusters. The objects inside
clusters should display high similarity, while objects of different clus-
S
ters will stay dissimilar. Cluster analysis reduces the data overhead
since it can take number of observations, such as questionnaires or
customer surveys, and decrease the information into smaller easier to
interpret similar groups. The segmentation of customers into small-
IM
er groups, for example, can be used to customise advertising or pro-
motions. As opposed to many other data-mining techniques, cluster
analysis is primarily descriptive, and we cannot draw statistical infer-
ences about a sample using it. In addition, the clusters identified are
not unique and depend on the specific procedure used; therefore, it
does not result in a definitive answer but only provides new ways of
M
looking at data. Nevertheless, it is a widely used technique. There are

two major methods of clustering—hierarchical clustering and k-means
clustering.
In hierarchical clustering, the data is not divided into a specific cluster

N
in one step. Instead, number of partitions take place, running from a

single cluster covering all n clusters objects, each having a lone object.
Hierarchical clustering is further divided into agglomerative grouping
methods, which continue by fusions of the n objects series into groups
and divisive clustering methods, which isolate n objects consecutively
into higher groupings. Figure 6.5 shows the concept of agglomerative
grouping methods and divisive clustering methods:
Figure 6.5: Concept of Agglomerative Grouping Methods

and Divisive Clustering Methods
Source: http://hbanaszak.mjr.uw.edu.pl/TempTxt/ClusterAnalysis/Hierarchical%20Cluster-
ing-Introduction.htm

n o t e s
K-means is one of the modest, intuitive learning algorithms that can

crack the well-known grouping problem. The procedure consists of a
simple and easy way to categorise a dataset through a fixed number of
clusters (say, k clusters). The primary hint is to describe k centroids,
one for each cluster.
Clustering is a group partitioning process of data points into smaller

clusters. E.g. the items in a supermarket are grouped in categories
(cheese, butter and milk are as dairy products). Naturally, this is a
qualitative partitioning. A quantitative approach should be to calcu-
late certain product features, like milk percentage and products with
high milk percentage grouped as one. In common, we have n data
points xi, i=1...n to be partitioned in k clusters. The aim is to allocate
a cluster to each data point. K-means is a clustering method with the
purpose of finding the μi, i=1...k positions of the clusters that mini-
S
malise the data points distance from the cluster. K-means clustering
solves:
IM
ci =set of points belonging to cluster i.
The K-means clustering uses the Euclidean distance square d(x,

μi)=||x− μi ||22. This problem is in fact NP-hard, so the K-means algo-
M
rithm aims to find the universal minimum, perhaps getting trapped in

a different solution.
The goal of this algorithm is to find groups in the data, with the num-
ber of groups represented by the variable K. The algorithm works it-
N
eratively to assign each data point to one of K groups based on the

features that are provided. Data points are clustered based on feature
similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new
data
2. Labels for the training data (each data point is assigned to a
single cluster)
As opposed to characterising groups before studying the data, clus-

tering enables you to discover and dissect the groups that have
naturally shaped. Every centroid of a group is a collection of the
highlighted values which characterise the subsequent groups. Eval-
uating the centroid feature weights can be utilised to subjectively
decipher what sort of group each cluster communicates with and
represents. These centroids ought to be set skillfully as different ar-
eas cause diverse outcomes. Along these lines, the better decision
is to place them as much as reasonably could be expected, far from
each other. The following step is to take each group having a place

n o t e s
towards a given dataset and associate it with the closest centroid.

At the point, when no point is pending, the initial step and an ear-
ly groupage is finished. Now we have to re-compute k new centroids
as barycenters of the groups coming about because of the previous
step. After we have these k new centroids, another binding should
be done between similar informational index focuses and the clos-
est new centroid. A loop has been produced, citing which, we may
see that the k centroids change their area stepwise until no more
changes are done. Simply put, the centroids don’t move any more.
Lastly, this algorithm focuses on minimising the objective function
which is a squared error function in this case.
Business Uses
The K-means clustering algorithm is employed to discover the groups
S
which have not been unequivocally labeled in the data. This can be
used to affirm business assumptions about group types that exist or to
recognise unclear groups in complex datasets. Once the algorithm is
executed with characterisation of groups, any new data can be effort-
IM
lessly allotted to the right group.
This is a flexible algorithm that can be utilised for a group. A few use
cases types are:
Behavioral segmentation:
Segment purchase history and activities on application, web-
M
site, or platform
Define interests based roles
Profiling based on activity monitoring
N
Inventory categorisation:
Group inventory by sales activity and manufacturing metrics
Sorting sensor measurements:
Detect activity in motion sensors
Group images and separate audio
Identify health monitoring groups
Detecting bots or irregularities:
Separating valid activity groups from bots
Grouping valid activity to clean up outlier detection
The K-means clustering algorithm practices iterative improvement to

yield a final result. The inputs of algorithms are the number of K clus-
ters and the dataset which is a collection of each data point features.
The algorithm initiates with early evaluations for the K centroids,
which can either be randomly selected or generated from the dataset.

n o t e s
Example: Consider the following data set containing scores of two

variables of seven individuals:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
This data set is to be clustered into two groups. Let the A and B values
of the two individuals farthest apart (using the Euclidean distance cal-
culation), define the initial cluster means:
S
Individual Mean Vector (Centroid)
Group 1 1 (1.0, 1.0)
IM
Group 2 4 (5.0, 7.0)
The remaining individuals are now inspected serially and assigned to

their closest clusters, in Euclidean distance to the cluster mean. The
mean vector is revaluated each time for a new member’s entry. This
leads to the following steps:
M
Cluster 1 Cluster 2
Mean Vector Mean Vector
Step Individual Individual
(Centroid) (Centroid)
1 1 (1.0. 1.0) 4 (5.0, 7.0)
N
2 1, 2 (1.2. 1.5) 4 (5.0, 7.0)

3 1, 2, 3 (1.8. 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8. 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8. 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8. 2.3) 4, 5, 7 (4.1, 5.4)
Now the initial partition is no more the same and the two clusters cur-
rently have the following features:
Individual Mean Vector (centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)
But since not everyone has been assigned to the respective cluster, we
cannot say for sure. So, we relate everyone’s distance from its own
cluster mean and of the opposite cluster:
Distance to mean (cen- Distance to mean (centroid)
Individual
troid) of Cluster 1 of Cluster 2
1 1.5 5.4

n o t e s
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
Person 3 is closer to the mean of the opposite cluster (Cluster 2) than

its own (Cluster 1). Simply, each person’s distance to its own cluster
mean should be lower than the distance to the other cluster’s mean
(unlike person 3). Thus, person 3 is moved to Cluster 2 ensuing a new
partition:
Individual Mean Vector (centroid)
S
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
The recurring movement would continue from this new partition till
IM
no more relocations remain to occur. However, in this example, each
person gets nearer to its own cluster mean than the other cluster and
the recurrence stops, choosing the latest partitioning as the final clus-
ter solution.
6.4.4 Artificial Neural Networks

M
Neural systems were designed according to the cognitive subjective

procedures of the mind. They can anticipate new perceptions from
the existing ones. A neural system comprises interconnected handling
components additionally called units, nodes (hubs), or neurons. The
N
neurons inside the system cooperate, in parallel, to create an output

function. Since the calculation is performed by the collective neurons,
a neural system can deliver the output function regardless of the pos-
sibility that a portion of the individual neurons are breaking down
(the system is strong and fault tolerant).
As a rule, every neuron inside a neural system has a related activation

number. Additionally, every association between neurons has a weight
related with it. These amounts recreate their partners in the natural
brain: firing rate of a neuron and quality of a synapse. The actuation of
a neuron relies on upon the initiation of other neurons and the weight
of the edges that are associated with it. The neurons inside a neural
system are typically arranged in layers. The quantity of layers inside
the neural system, and the quantity of neurons inside each layer nor-
mally matches the way of the examined spectacle. After the size has
been resolved, the system is generally subjected to training. Here, the
system gets a sample training input with its related classes. It then ap-
plies an iterative procedure on the input to regulate the weights of the
system so that its future forecasts are ideal. After the training stage,

n o t e s
the system is prepared to perform predictions in new groups of data.

Neural systems can frequently deliver extremely accurate predictions.
In any case, one of their most prominent disapproval is the way that
they speak to a black box approach to deal with research as they don’t
divulge information into the fundamental way of the phenomena.
Neural frameworks can be used to predict the time data arrangement,

for instance, climate data. A neural framework can be expected to rec-
ognise outline the data and make an output free of disorder.
As a complex algorithm, the neural system is naturally propelled by

the structure of the human mind. A neural system gives an extremely
basic model compared to the human mind.
Generally utilised for data arrangement, neural systems prepare past

and current information to gauge future values – finding any complex
S
relationships hidden in in the data – in a route closely resembling that
utilised by the human cerebrum.
Figure 6.6 demonstrates the neural-network structure of algorithm

IM
and its three layers:
M
N
Figure 6.6: The Neural-network Structure of Algorithm

and its Three Layers
Source: http://ecee.colorado.edu/~ecen4831/lectures/MLPnet.gif
The input layer feeds past data values into the next (hidden) layer. The
black circles denote the hubs of the neural system. The hidden layer
stores a few complex functions that make predictors; those functions
are oblivious from the client. An arrangement of hubs (black circles)
at the hidden layer speaks to mathematical functions called neurons,
that alter the data information. The output layer gathers the predic-
tions from the hidden layer and delivers the outcome.
Here’s a more rigorous look at how a neural system can deliver an

anticipated output from the input information. The hidden layer is
the key part of a neural system on account of the neurons it contains;
they work in coordination to do the significant estimations and create

n o t e s
the output. Every neuron takes a group of input values; each is related
with a weight (more about that in a minute) and a numerical value
called bias. The output of every neuron is a function of the output of
the weighted aggregate of each input in addition to the bias.
Most neural systems use mathematical functions to initiate the neu-

rons. A function in math is a connection between an input set and an
output set, with each input relating to an output. (For example, con-
sider the negative function where an entire number can be input and
the outcome is its negative equivalent.). Basically, a function in math
works like a black box – it takes an input and produces an output.
Neurons in a neural system can utilise sigmoid functions to correlate

inputs to outputs. A sigmoid function is known as a logistic function, if
used in same manner and its equation resembles the following:
S
Here f refers to the activation function which activates the neuron,
IM
and e denotes a mathematical constant that possesses an approximate
value of 2.718. The sigmoid functions are used in neurons as these
functions have positive derivatives and are easy to compute. More-
over, these are continuous, can act as types of smoothing and bound-
ed functions. This association of unique characteristics of sigmoid
functions is important for the workings of a neural network algorithm
— mainly when a derivative calculation (such as the weight related
M
to each input to a neuron) is required. Neural networks possess high

accuracy, irrespective of significant amount of noise of data. This is a
major advantage when the hidden layer can still determine associa-
tions in the data despite the presence of noise.
N
10. Classification is a predictive analytical strategy aimed to

forecast the data. (True/False)
11. _________ modelling is the process of developing analytic
models to describe the relationship between metrics that
drive business performance.
12. A simple linear regression analysis differs from a non-linear
analysis on the fact that nonlinear (or curvilinear) regression
is used more to predict the outcomes when one of the
independent variables happen to be time. (True/False)
13. A regression model involving a single autonomous variable is
called ______.
14. K-means clustering is similar to neural networks, the only
difference being the approach and method involved in
devising the solution. (True/False)

n o t e s
Activity
A consumer products company has collected some data relating to

the advertising expenditure and sales of one of its products:
Advertising cost Sales
$300 $7000
$350 $9000
$400 $10000
$450 $10600
Figure out the model that would best depict the above data in the
least number of steps.
S
6.5 SUMMARY
Predictive modelling is the method of making, testing and authen-
ticating a model to best predict the likelihood of a conclusion.
IM
Predictive analysis and models are characteristically used to pre-
dict future probabilities.
Predictive models are representations of the relationship between
how a member of a sample performs and some of the known char-
acteristics of the sample.
M
Predictive analytics methods depend on the quantifiable variables,

controlling metrics to forecast future performance or outputs.
Logic driven models are created on the basis of inferences and
postulations provided by the sample space and existing conditions.
N
A data-driven model is based on the data analysis of a specific sys-

tem.
Regression analysis and forecasting models help us to predict rela-
tionships or future values of variables of interest.
Association rules are intended to discover such broad association
designs among data in large databases.
Regression analysis is an instrument for creating statistical and
mathematical models that define relations between a dependent
variable (should be ratio variable, not categorical) and one or more
descriptive or independent numerical (ratio or categorical) vari-
ables.
key words
Association rules: These rules are used to discover broad asso-

ciation designs in large databases.
Cause-and-effect modelling: It is the process of developing an-
alytic models to describe the relationship between metrics that
drive business performance.

n o t e s
Descriptive analytics: In this type of analytics, analysts help to

identify patterns in data with the help of data-mining tools.
Logic driven models: These are created on the basis of infer-
ences and postulations provided by the sample space and exist-
ing conditions.
Predictive models: These are used to analyse historical facts
and current data to better comprehend customer habits, part-
ners and products and to classify possible risks and prospects
for a company.
Simple linear regression: It involves finding a linear relation-
ship between one independent variable and one dependent
variable.
S
1. Explain the concept of predictive modelling.
IM
2. What are logic driven models? Discuss with appropriate
examples.
3. Describe the concept of data mining. Enlist its four stages.
4. Discuss the differences between classification and prediction.
5. Explain some approaches in data mining.
M
6. Describe the concept of regression analysis.

N

Predictive Modelling 1. False
2. Predictive modelling
3. True
4. (R x F x M)/D
Introduction to Data Mining 5. True
6. d. All of these
7. True
8. modeling
9. False
Data Mining Methodologies 10. False
11. Cause-and-effect
12. True
13. simple
14. False

n o t e s

1. Predictive modelling is the method of making, testing and
authenticating a model to best predict the likelihood of a
conclusion. Refer to Section 6.2 Predictive Modelling.
2. Logic driven models are created on the basis of inferences and
postulations which the sample space and existing conditions
provide. Refer to Section 6.2 Predictive Modelling.
3. Data mining is a growing business analytics field focused on better
understanding of the features and designs among variables in
huge databases using a variety of analytical and statistical tools.
Refer to Section 6.3 Introduction to Data Mining.
4. Classification is the most essential type of data analysis. Prediction
resembles classification, aside from that we are attempting to
S
foresee the estimation of a numerical variable. Refer to Section
6.3 Introduction to Data Mining.
5. Some common approaches in data mining include data
IM exploration and reduction, association and cause-and-effect
modelling. Refers to Section 6.4 Data Mining Methodologies.
6. Regression analysis is an instrument for creating statistical and
mathematical models that define relations between a dependent
variable (should be ratio variable, not categorical) and one or
more descriptive or independent numerical (ratio or categorical)
M
variables. Refer to Section 6.4 Data Mining Methodologies.

N
SUGGESTED READINGS
Bari,A., Chaouchi, M., & Jung, T. (2014). Predictive Analytics for
Dummies. Hoboken, NJ: John Wiley & Sons, Inc.
Finlay, S. (2014). Predictive Analytics, Data Mining and Big Data:
Myths, Misconceptions and Methods. Basingstoke: Palgrave Mac-
millan.
Larose, D. T., & Larose, C. D. (2015). Data Mining and Predictive
Analytics. Wiley.
E-REFERENCES
Predictive analytics. (2017, May 09). Retrieved May 16, 2017, from
https://en.wikipedia.org/wiki/Predictive_analytics
What is predictive analytics? - Definition from WhatIs.com. (n.d.).
Retrieved May 16, 2017, from http://searchbusinessanalytics.
techtarget.com/definition/predictive-analytics
Impact, I. P., & World, P. A. (n.d.). Predictive Analytics World. Re-
trieved May 16, 2017, from http://www.predictiveanalyticsworld.
com/predictive_analytics.php

C h a
7 p t e r
PRESCRIPTIVE ANALYTICS
CONTENTS
S
7.1 Introduction
7.2 Overview of Prescriptive Analytics
IM
7.2.1 Prescriptive Analytics brings a lot of Input into the Mix
7.2.2 Prescriptive Analytics Comes of Age
7.2.3 How Prescriptive Analytics Functions
7.2.4 Commercial Operations and Viability
7.2.5 Research and Innovation
7.2.6 Business Development
M
7.2.7 Consumer Excellence

7.2.8 Corporate Accounts
7.2.9 Supply Chain
7.2.10 Governance, Risk and Compliance
N

Activity
7.3 Introduction to Prescriptive Modeling
7.3.1 The Waterfall Model
7.3.2 Incremental Process Model
7.3.3 Rapid Application Development (Rad) Model
Activity
7.4 Non-linear Optimisation
Activity
7.5 Summary

n o t e s
CREDIT CARD COMPANY USING PRESCRIPTIVE ANALYSIS

TO SERVE ITS CUSTOMERS IN A BETTER WAY
This caselet illustrates the use of prescriptive analytics in our day-

to-day life. The following incident happened to a person named
Bill, whose credit card company started offering electronic cou-
pons from retailers which could be downloaded to the customer’s
card. Then, the respective customers would automatically receive
discounts from the retailer whenever a purchase would be made
using the card. But, not being a regular eater of fast food items,
Bill had added a fast food coupon to his card in case he would be
getting late to his office, he can purchase a quick meal for him to
save his time and money. With that, he signed off from his account
and forgot about the entire process.
S
After many weeks had passed, one day Bill’s cell phone received
a notification while he was driving the car. Upon opening the no-
tification, Bill was surprised to get an alert message from the fast
IM
food vendor, which notified that the place where Bill was current-
ly travelling had a restaurant where he could use his coupon. Ini-
tially though shocked, Bill had always heard about the existence
of this cutting-edge technology, but had not known that one day
he might be benefited from this technology. This technology is a
sheer example of what retailers can do in the future by combining
M
the geo-location ability of the phone along with any other informa-
tion, which they had acquired from their customers.
Bill was too excited to be rewarded positively for sharing all of his
data with the credit card company. A bit uncomfortable initially
N
with the possibility of sharing all his details with the credit card
company, currently, Bill is very much pleased that the company is
using innovative methods like prescriptive analytics to serve its
customers better.
In this caselet, you can see how the credit card uses prescriptive
analytics to link customers with their requirements. After all the
information of an individual is being shared with the company, it
can use many mathematical modeling and statistics methods to
find actionable insights, which can again be used to help customer
to get better results.

PRESCRIPTIVE ANALYTICS 165
n o t e s
learning objectives

>> Describe the meaning of prescriptive analytics
>> Explain the prescriptive modeling
>> Discuss about non-linear optimisation
7.1 INTRODUCTION
After studying predictive and descriptive analytics steps in the previ-
ous chapters of the business analytics process, one should be in a good
position to take the final step, i.e., prescriptive analytics. This analysis
will provide a prediction or a forecast of what future trends in the
S
business may look like.
For example, there can be significant statistical measures of higher

or lower sales; profitability trends accurately measured in dollars for
IM
new market prospects; or measured cost savings from a future joint
venture. In the event that the organisations know where the future
lies by foreseeing the patterns, it can best arrange to exploit conceiv-
able plans that the patterns may offer. The third step of the business
analytics process is prescriptive analytics, which involves the applica-
tion of decision science, operations research methodologies and man-
agement science to make optimal utilisation of available resources.
M
Prescriptive analytics methods and techniques are mathematically

based algorithms designed to take variables and other parameters into
a qualitative framework and generate an optimal or real-time solution
N
for complex problems. Such methods can be utilised to ideally distrib-

ute a company’s limited assets to take the best preferred advantage of
opportunities it has found in the anticipated future patterns. The lim-
itations on human and financial assets turn away organisations from
pursuing each opportunity. Utilising prescriptive analytics allows an
organisation to designate limited assets to accomplish goals as ideally
as possible. Prescriptive analytics is simply a computerised method
for applying calculation and interpretation and providing valuable in-
sights from various data sources.
By the end of this chapter, readers will understand how various class-
es of analytics—predictive and descriptive—can lead to prescriptive
analysis. This chapter will first discuss the meaning of prescriptive
analytics. Next, the chapter discusses prescriptive modeling. In the
end, the chapter discusses non-linear optimisation.

n o t e s
OVERVIEW OF PRESCRIPTIVE
7.2
ANALYTICS
Prescriptive analysis answers ‘What should we do?,’ on the basis of
complex data obtained from descriptive and predictive analyses. By
using the optimisation technique, prescriptive analytics determines
the finest substitute to minimise or maximise some equitable finance,
marketing and many other areas. For example, if we have to find the
best way of shipping goods from a factory to a destination, to minimise
costs, we will use prescriptive analytics. Figure 7.1 shows a diagram-
matic representation of the stages involved in the prescriptive analyt-
ics:
S
IM
M
Figure 7.1: Prescriptive Analytics
Data, which is available in abundance, can be streamlined for growth

and expansion in technology as well as business. When data is anal-
N
ysed successfully, it can become the answer to one of the most import-
ant questions: how can businesses acquire more customers and gain
business insight? The key to this problem lies in being able to source,
link, understand and analyse data.
All the companies need to address their data challenges to support

their decision making capabilities or may risk themselves in falling be-
hind in this highly competitive landscape. Today, businesses are col-
lecting, storing, analysing and interpreting more data as compared to
the previous years, and this trend is continuing at an alarming rate to
gain momentum. According to many leading professors and research-
ers, this is the era of a Big Data revolution. In any case, it is not the
amount of information that is progressive. Rather, the revolution has
got something to do with the data.
Since a lot has been written on Big Data, we will focus on analytics,
which will help companies transform the finance function by offering
forward looking insights and help them devise a solution appropriate
for the optimal course of action, improve the ability to communicate
and collaborate with other companies at a lower cost of ownership.

n o t e s
These transformative characteristics will lead to better performance

improvements in business sectors.
Prescriptive analytics go beyond predictions, workforce optimisations

and decision options. It is usually used to analyse complex data to
analyse huge complex data to forecast outcomes, offer decision op-
tions and show alternative business impact. This method also consists
of many scientific and mathematical methods used for understanding
how alternative learning investments impact the bottom line. More-
over, this analytics can also help enterprises to take decisions on how
to take advantage of a future scenario or reduce a risk in the future
course of time and represent the implication of each decision option.
In real life, prescriptive analytics can automatically and continuously

process new data to improve forecast accuracy and offer better de-
cision options. For instance, prescriptive analytics can be utilised to
S
profit healthcare key arranging. By utilising data analytics, one can
harness operational information which includes population statistic
patterns, financial information, and population health patterns, to a
IM
more exact arrangement and contribute future capital, such as, equip-
ment usage and new facilities.
7.2.1 PRESCRIPTIVE ANALYTICS BRINGS A LOT OF INPUT

INTO THE MIX
Prescriptive analytics instruments articulate upgrades of business

M
results by joining business rules, historical information, factors, nu-

merical models, requirements and machine-learning calculations.
Prescriptive analytics, much the same as the predictive analytics, are
especially utilised as a part of circumstances where there are a ton of
N
factors, choices, limitations and data points to viably assess without

utilising any help from technology. While experimenting in the real
world scenario, prescriptive analytics are overly risky, or bit expensive
or probably takes too much time.
Sophisticated analytical models and simulations can keep running

with well-known and randomised factors to mention next strides,
show any if/then situations and pick up a superior understanding of
the scope of conceivable results.
Some examples of business processes where we can apply prescrip-

tive analytics include pricing, operational resource allocation, inven-
tory management, supply chain optimisation, production planning,
utility management, sales lead assignment, transportation and distri-
bution planning, marketing mix optimisation and financial planning.
For instance, in the airline ticket pricing framework, prescriptive ana-
lytics are utilised to get insight into complex demand levels, travel el-
ements and booking timings to acquire more potential travelers with
costs computed to upgrade benefits additionally without demoralise
sales. Another noticeable contextual analysis case for investigation is

n o t e s
the utilisation of prescriptive analytics in UPS in enhancing package

delivery courses. Prescriptive analytics applications have been in op-
eration for a long while.
7.2.2 PRESCRIPTIVE ANALYTICS COMES OF AGE
Prescriptive analytics is an absolute necessity for any company to ex-

ecute key marketing strategies. It highlights ideal choices, to give the
effect of those choices, bringing about a key and plainly characterised
path ahead. The uplifting news for organisations looking for an upper
hand is that the stunning measure of information now accessible is
the thing that really controls prescriptive analytics.
Prescriptive analytics take subjective choices in the target region

utilising the abundance of information to structure the basic deci-
S
sion-making process. The approach dissects potential choices, col-
laborations amongst choices and influences on these decisions. The
approach then uses this data to help graph the best activities/choices.
It is conceivable in the view of advancements in processing speed and
IM
the subsequent advancement of complex scientific calculations con-
nected to the shifted data sets (big data).
7.2.3 HOW PRESCRIPTIVE ANALYTICS FUNCTIONS
Utilising prescriptive analytics is a complex and time-taking process

M
that investigates all viewpoints to sustain the decision-making pro-

cess, including:
Identifying and breaking down every single potential choice
Defining potential connections and associations between each of
N
these choices with each other

Identifying variables that could affect each of these choices (posi-
tively or negatively)
Prescriptive analytics handle processes every one of these viewpoints

and maps out (at least one) potential results for each of the choices —
bringing about a customised model. Elements sustaining the model,
including information volume and quality, could affect the exactness
of the model (as they would in descriptive and predictive analytics).
Prescriptive analytics utilises procedures like optimisation, game

theory, simulation, and decision-analysis techniques. A procedure as
opposed to a defined event, prescriptive analytics can constantly and
consequently prepare new information to enhance predictive preci-
sion and give better decision choices.
7.2.4 COMMERCIAL OPERATIONS AND VIABILITY
Enhanced operations are an essential utilisation of prescriptive ana-

lytics. Most organisations have concentrated intensely on finding the

n o t e s
correct cost levels and working model to adequately empower them

to develop. Prescriptive analytics add another measurement to oper-
ational and business adequacy by giving directors a chance to foresee
what structures, messages and targets will yield ideal outcomes giv-
en the organisation’s remarkable parameters, and after that choose
which way will give the biggest returns. There are numerous other
business applications of prescriptive analytics, such as:
Optimising spend and rate of profitability (ROI) through exact cus-
tomer profiling
Providingimportant data for brand planning and go-to-market
procedures
Maximising campaign productivity, sales force arrangement and
promotional activities
S
Predicting and proactively overseeing market events
Providing significant data for territory examination, customer
deals and medical data
IM
A one-estimate fits-all business model is no longer reasonable; the
eventual fate of a focused sales model is focused on customised mes-
saging.
7.2.5 RESEARCH AND INNOVATION

M
Research and advancement are frequently speculating games, how-

ever, prescriptive analytics can be a noteworthy differentiator for any
organisation occupied with R&D exercises in a competitive industry
including:
N
Demonstrating, anticipating and enhancing results from item util-

ity
Understanding sickness (or different zones of intrigue) patterns /
movement
Establishing ideal trial conditions through focused patient cohorts
Increasing customer adherence to the item and diminishing com-

pliance
Understanding necessities for customised drug and different ad-
vancements
Determining and setting up focused items and interventions
Determining and setting up an ideal trial conditions through fo-
cused patient cohorts
7.2.6 BUSINESS DEVELOPMENT
Understanding what new items are required, what differentiating

components will make one item sell better than the other, or which

n o t e s
markets are demanding which items are key zones for prescriptive
analytics including:
Identifying and settling on choices about circumstances/rising
ranges of unmet need
Predicting the potential advantage
Proactively following industry trends and actualising techniques
to get an advantage
Exploiting data analytics to distinguish particular buyer popula-
tions and regions that ought to be focused on
Leveraging data analytics to distinguish key advancements for
item improvement that will produce the biggest return for the in-
vestment
S
Identifying likely purchasers to cut business improvement costs
altogether, and imagine a scenario where situations for items, mar-
kets and purchasers could be an unmistakable differentiator for
developing organisations
IM
7.2.7 CONSUMER EXCELLENCE
Understanding buyer needs and having the capacity to tailor offer-

ings (items or services) are basic figures in business development.
Prescriptive analytics can be utilised to improve purchaser excellence
M
in a huge number of ways including:

Predicting what purchasers will need and settling on key choices
that address those necessities
Segmenting purchasers and recognising and focusing on custom
N
fitted messages to them

Staying on top of competition and deciding (e.g., marketing, brand-
ing) about items that will prompt more desirable items and higher
sales
7.2.8 CORPORATE ACCOUNTS
Corporate account functions can immensely use prescriptive analyt-

ics to improve their capacity to settle on choices that help drive inter-
nal excellence and outer strategy:
Internal excellence
Viability and direction for non-item related activities; what
choices ought to be made and what is the effect.
Viability and direction for item related activities; what choices
ought to be made and what is the effect.

n o t e s
External-facing key direction

Utilising important data to demonstrate item esteem and build
up a market valuing
Utilising examination to build up a targeted on coupon strategy
Recognising ideal price point alternatives and the effect of

those choices on the income model for the item
Better understanding the whole price cycle from rundown cost
to repayment (counting all rebates and refunds) to inform the
ideal pricing system
Utilising an important competitor data to build up estimating
and get market access
7.2.9 SUPPLY CHAIN
S
Prescriptive analytics can likewise furnish, supply chain capacities
with an upper hand through the capacity to predict and make deci-
sions in a few basic areas including:
IM
Forecasting future demand and pricing (e.g., supplies, material,
fuel and different components affecting cost to guarantee proper
supply)
Utilising prescriptive analytics to illuminate stock levels, schedule
plants, route trucks and different components in the supply chain
M
cycle
Modifying supplier threat by mining unstructured information re-
garding value-based information
Better understanding historical demand examples and product
N
course through supply chain channels, anticipating future exam-

ples and settling on choices on future state procedures
7.2.10 GOVERNANCE, RISK AND COMPLIANCE
Governance, risk and compliance are elements of expanding signifi-

cance crosswise over practically every industry. Prescriptive analytics
can help associations accomplish consistence through the capacity
to anticipate up-coming dangers and settle on the proper mitigation
choices.
Governance, risk and compliance are functions of increasing impor-

tance across almost every industry. Prescriptive analytics can help
organisations achieve compliance through the ability to expect forth-
coming risks and make proper mitigation decisions. Utilisation of pre-
scriptive analytics in the region of governance, hazard and compli-
ance incorporates:
Improving internal review effectiveness
Notifying third-party arrangement and management

n o t e s
Classifying patterns related with outlandish spend (e.g., total

spend working on this issue of pharma)
Applying very much learned compliance controls
1. Prescriptive analytics can be utilised in improving services of

the healthcare industry. (True/False)
2. Prescriptive analytics take ______ choices in the target region,
utilising the abundance of information to structure the basic
_______ process.
3. _______ analytics can be a noteworthy differentiator for any
organisation occupied with R&D exercises in a competitive
industry.
S
4. Prescriptive analytics can help associations to remain
consistent in anticipating the upcoming dangers and settling
on the proper mitigation choices. (True/False)
IM
Activity
Assign a group of students with the task of collecting information of

the money spent by the residents of a town to keep their area pol-
lution free and clean. All the data needs to be collected and docu-
M
mented cleanly in a spreadsheet. The students need to find out the

probability of the amount of money the residents would be spend-
ing for the same purpose in the future course of time frame.
N
INTRODUCTION TO PRESCRIPTIVE
7.3
MODELING
Prescriptive analytics methods are not just concentrating on Why,
How, When, and What; additionally they also prescribe acceptable
behavior for taking advantage of the situation. Prescriptive analytics
every now and then proved itself as a benchmark for an organisation’s
analytics development. Segments of prescriptive analytics are:
a Evaluate and choose better ways to deal with work
b. Target business goals and conform all restrictions
Prescriptive models guide everybody precisely and have a tendency

to be substantial. These models require a great deal of documenta-
tion, and are costly. Prescriptive methodologies are basically “project
insurance”. Prescriptive decision models help leaders recognise the
best arrangement. There are three kinds of prescriptive process mod-
els in business. They are:
The Waterfall Model

n o t e s
Incremental Process Model

RAD Model
7.3.1 THE WATERFALL MODEL
The waterfall model is additionally called the ‘Linear sequential mod-

el’ or ‘Great life cycle model’. In this model, each stage is completely
finished before the start of the following stage. This model is utilised
for little projects. In this model, input is taken after each stage to guar-
antee that the venture is on the correct way. The testing phase begins
simply after the development gets finished.
The advantages of using the waterfall model are as follows:

The waterfall model is easier to implement and simple to use.
S
It avoids overlapping of each phase.
This model works for small projects as the prerequisites are un-
derstood extremely well.
IM
Thismodel is favored for those activities where quality is more
imperative when contrasted with the cost of the venture.
The disadvantages of using the waterfall model are as follows:

This model is not suitable for complex and object-oriented ven-
tures.
M
The issues with this model are uncovered, until the product test-
ing stage.
The measure of risk is quite high.
N
7.3.2 INCREMENTAL PROCESS MODEL
The increment model is the evolutionary model in which a product is

implemented and tested incrementally. The process sequence used is
built, implemented, integrated and tested. Successive builds are there
until the product is complete. The product is in operation mode and
the model provides stepwise development. It retains the discipline in-
troduced by the waterfall model at each build. Model can be used at
all stages of the life cycle.
The advantages of using the incremental model are as follows:

This model delivers items quicker and is cost effective.
The testing and debugging is easier in this model.
Itgenerates software rapidly and ahead of schedule amid the
product life cycle.

n o t e s
The handling of risk is easier as risky items can be determined and

managed during each iteration.
The disadvantages of using the incremental model are as follows:

The cost of the last item may cross the cost assessed in the begin-
ning.
This model requires good design and planning.
This model requires a precise definition of the whole system be-
fore it gets broken down and built incrementally.
The cost involved in this model is more than the cost involved in
the waterfall model.
The requests of clients for extra functionalities after each augmen-
tation causes. an issue amid the framework design.
S
7.3.3 Rapid Application Development (RAD) Model
RAD is a Rapid Application Development model which is based upon

IM
prototyping and iterative development without any specific plan. It
emphasises on gathering requirements of customers with the help of
workshops or focus groups. The RAD model comprises following stag-
es:
Business modeling: It describes the flow of information among
business functions and is modeled in a manner that answers the
M
following questions:
What information is generated?
Who generated the information?
N
Where does the information flow?

Who processes the information?
Data modeling: It describes the flow of information defined as
part of the business modeling phase that is refined into a group of
data objects required for supporting the business. The character-
istics of each object are ascertained and the relationships between
these objects are defined.
Process modeling: It refers to data objects described in the data
modeling phase that are transformed for achieving the flow of in-
formation required to implement a business function. Processing
descriptions are built to add, modify, delete, or retrieve a data ob-
ject.
Application generation: It assumes the use of 4GT (Fourth Gener-
ation Techniques) which comprise a wide range of software tools.
Each tool allows the software engineer to specify software at a high
level. Automated tools are used for facilitating the construction of
software.

n o t e s
Testing and turnover : It helps in reducing the overall testing time

as many of the program components have already been tested.
New components must be tested and all interfaces must be fully
exercised.
5. Segments of prescriptive analytics are:

a. Evaluate and choose better ways to deal with work
b. _______________________________________________
6. Identify the disadvantage of the Waterfall model from the
following statements:
a. The waterfall model is easy.
S
b. It avoids overlapping of each phase.
c. This model works for small projects.
IM
d. It is a poor model for long activities.
7. RAD is a Rapid ________ Development model.
8. The waterfall model is also called the _______.
Activity
M
Create a datasheet related to the allocation of budgets for creating

a construction building in your locality. Use prescriptive analytics
to find out the annual budget allocation for the maintenance of the
construction building in the next 5 years in your locality.
N
7.4 NON-LINEAR OPTIMISATION

You already know that there are numerous numerically programed,
nonlinear techniques and methodologies intended to produce ideal
business execution arrangements. The greater part of them require
cautious estimation of parameters that could possibly be exact, espe-
cially given the exactness required of an answer that can be so dubi-
ously subordinate upon parameter precision. This accuracy is further
confounded in business analytics by the vast information records that
ought to be figured into the model-building effort. To conquer these
impediments and be more comprehensive in the utilisation of sub-
stantial information, regression software can be used. Curve Fitting
programming can be utilised to create predictive analytical models
that can likewise be used to help in settling on prescriptive analytical
decisions.
Prescriptive investigation gives exact choices on the action plan for

future achievement. One of the conspicuous utilisations of prescrip-

n o t e s
tive analytics in advertising is the streamlining issue of showcasing

spending designation. The business issue is to make sense of the ide-
al amount of spending that should be designated from the aggregate
advertising spending plan for each of the promoting media like TV,
press, Web video and so forth to maximise the income. The spending
advancement issue is understood either through Linear or Nonlinear
Programing (NLP) which relies on upon whether:
The objective purpose is linear/ nonlinear
The practicable region is detected by linear/nonlinear constraints
Nonetheless, in this present reality, TV ad information, as plotted

in Figure 7.2, challenges such presumption as the diagram demon-
strates a curved capacity. The requirement that might be considered
for growing such advancement issue is the greatest sum that ought
S
to be spent on a specific media and past that point any further use
may prompt the expansion in income yet at a diminishing rate. Hence,
it’s essential to discover the decreasing purpose of return for each of
the promoting mediums. Figure 7.2 demonstrates the income created
IM
against the cost caused for a TV commercial and both the cost and the
income said in the present paper dollar value are in thousands:
M
N
Figure 7.2: Diminishing Point of Return for TV Advertisement

Source: https://www.blueoceanmi.com/blueblog/application-derivatives-nonlinear-program-
ming-prescriptive-analytics
The curve that perfectly fits the plotted revenue and cost of TV promo-
tion is cubic and is plotted in Figure 7.2. The R-square accomplished
through the cubic condition is an astounding 98.7%. The first and sec-
ond request subordinates of the cubic condition are figured as takes
after:

n o t e s
The inflection point is recognised where the second derivative chang-

es from positive to negative. Subsequently, numerically, it’s the point
where the second derivative is 0.
In this manner, explaining Equation 3, the cost of decreasing point of

return is 1777.78 and the comparative income at the diminishing point
of return in the wake of connecting the estimation of x to Equation 1 is
36735.68. For further reviews on the utilisation of second subordinates
on nonlinear optimisation, you may refer to Newton–Raphson calcu-
lation and conjugate course calculations.
Partial derivative is the other noticeable utilisation of analytics on

advancement issues. Partial derivatives of a function with a few fac-
tors are proficient when a specific variable’s subsidiary is registered
keeping different factors steady. A standout amongst the most general
uses of partial derivative is the least square model where the goal is to
S
discover the best fitting line by limiting the separation of the line from
the data points.
IM
This is accomplished by setting first order partial derivatives of inter-
cepts and the angles are equivalent to zero. The second order partial
derivative is utilised as a part of an optimisation issue to make sense of
whether a given basic point is a relative most extreme, relative least,
or a saddle point.
The methodologies discussed earlier show how analytics can be incor-

M
porated with nonlinear programming while conveying an enhanced

arrangement. In a similar way, Lagrangean-based procedures can
likewise be incorporated with Mixed Integer Non-Linear Program-
ming (MINLP) to provide the marketing budget optimisation solution.
At present, data researchers can not bet on a solitary strategy to give
N
examination arrangement. The genuine test is to make sense of how

numerous systems can be inventively joined to give an answer as in-
teresting as the business issue.
9. Curve fitting programming can be utilised to create ________

analytical models.
10. NLP stands for
a. Nonlinear Programing
b. New Language for programing
c. New linear programing
d. None of these
11. The _______ point is recognised where the second derivative
changes from positive to negative.
12. The fill form of MINLP is ________________.

n o t e s
Activity
Create some teams in your class, each having four students and go
to the nearest truck dealer. Use the non-linear optimisation method
to calculate how to minimise the cost of transport as a lot of trucks
of the dealer ship goods to a large network of markets or stores.
7.5 SUMMARY
By using the optimisation technique, prescriptive analytics deter-
mines the finest substitute to minimise or maximise some equita-
ble finance, marketing and many other areas.
Data, which is available in abundance, can be streamlined for
growth and expansion in technology as well as business.
S
In real life, prescriptive analytics can automatically and continu-
ously process new data to improve forecast accuracy and offer bet-
ter decision options.
IM
Prescriptive analytics are an absolute necessity for any company
to execute key marketing strategies.
Corporate account functions can immensely use prescriptive ana-
lytics to improve their capacity to settle on choices that help drive
internal excellence and outer strategy.
M
Prescriptive analytics can likewise furnish, supply chain capaci-

ties with an upper hand through the capacity to predict and make
decisions in a few basic areas.
RAD is a Rapid Application Development model. Using the RAD
demonstrate, programming item is produced in a brief timeframe.
N
The inflection point is recognised where the second derivative

changes from positive to negative.
key words
Analytics: It refers to the discovery, interpretation and commu-

nication of meaningful patterns in data.
Descriptive analytics: It is a preliminary stage of data process-
ing that creates a summary of historical data to yield useful in-
formation and possibly prepare the data for further analysis.
Prescriptive analytics: It is the area of business analytics (BA)
dedicated to finding the best course of action for a given situation.
Predictive modelling: It is a process that uses data mining and
probability to forecast outcomes.
Waterfall model: The model in which each stage is completely
finished before the start of the following stage.

n o t e s

1. Explain the concept of perspective analytics along with its
functions.
2. What do you understand by perspective modeling? Discuss three
kinds of prescriptive process models in business.
3. Describe the non-linear optimisation in analytics.
4. Discuss the importance of perspective analytics in commercial
operations, research and innovation, business development and
consumer excellence.
S
Answers for Self Assessment Questions
Topic Q.No. Answers

Overview of Prescriptive 1. True
IM
Analytics
2. subjective, decision-making
3. prescriptive
4. True
Introduction to Prescrip- 5. b. Target business goals and con-
tive Modeling form all restrictions
M
6. d. It is a poor model for long ac-

tivities
7. Application
8. Linear sequential model
N
Non-linear Optimisation 9. predictive

10. a. Nonlinear Programing
11. inflection
12. Mixed Integer Non-Linear
Programming

1. Prescriptive analytics go beyond predictions, workforce
optimisation and decision options. Refer to Section 7.2 Overview
of Prescriptive Analytics.
2. Prescriptive analytics every now and again fill in as a benchmark
for an organisation’s analytics development. Refer to Section
7.3 Introduction to Prescriptive Modeling.
3. The spending advancement issue is understood either through
Linear or Nonlinear Programing (NLP). Refer to Section
7.4 Non-linear Optimisation.

n o t e s
4. Enhancing operations are an essential utilisation of prescriptive

analytics. Refer to Section 7.2 Overview of Prescriptive
Analytics.
SUGGESTED READINGS
Liebowitz, J. (2014). Business analytics: an introduction. Boca Ra-
ton: CRC Press.
Williams, S. (2016). Business intelligence strategy and big data an-
alytics: a general management perspective. Cambridge, MA: Mor-
gan Kaufmann.
Bruce, P. C. (2015). Introductory statistics and analytics a resam-
S
pling perspective ;. Hoboken, NJ: Wiley.
E-REFERENCES
IM
August 17, 2016 · by Tuhin Chattopadhyay · in Big Data Analyt-
ics. (2016, August 23). Application of Derivatives to Nonlinear
Programming for Prescriptive Analytics. Retrieved May 02, 2017,
from https://www.blueoceanmi.com/blueblog/application-deriva-
tives-nonlinear-programming-prescriptive-analytics/
Beginning Prescriptive Analytics with Optimization Modeling by
M
Jen Underwood - BeyeNETWORK. (n.d.). Retrieved May 02, 2017,

from http://www.b-eye-network.com/view/17152
Prescriptive
Analytics. (n.d.). Retrieved May 02, 2017, from https://
www.mathworks.com/discovery/prescriptive-analytics.html
N

C h a
8 p t e r
Social Media Analytics and Mobile Analytics
CONTENTS
S
8.1 Introduction
8.2 Social Media Analytics
IM
Activity
8.3 Key Elements of Social Media
Activity
8.4 Overview of Text Mining
M
8.4.1 Understanding Text Mining Process

8.4.2 Sentiment Analysis
Activity
N
8.5 Performing Social Media Analytics and Opinion Mining on Tweets

Activity
8.6 Online Social Media Analysis
Activity
8.7 Mobile Analytics
8.7.1 Define Mobile Analytics
8.7.2 Mobile Analytics and Web Analytics
8.7.3 Types of Results from Mobile Analytics
8.7.4 Types of Applications for Mobile Analytics
Activity
8.8 Mobile Analytics Tools
8.8.1 Location-based Tracking tools
8.8.2 Real-Time Analytics Tools
8.8.3 User Behavior Tracking Tools

CONTENTS

Activity
8.9 Performing Mobile Analytics
8.9.1 Data Collection Through Mobile Device
8.9.2 Data Collection on Server
Activity
8.10 Challenges of Mobile Analytics
Activity
8.11 Summary
S
IM
M
N

Social Media Analytics and Mobile Analytics 183
n o t e s
Tracking Customer Sentiment through Wipro’s

Social Media Analytics (SMA)
Wipro Ltd. is a famous Information Technology, Consulting and

Outsourcing company that provides business solutions to its client
companies to do better business. One of Wipro’s clients, providing
paid applications for the Entertainment and Media Industry to
its customers, had recently launched its application services. The
company needs a way to know its customer’s feedbacks, issues,
demands and their overall experience with this new launch. They
were lacking in improving the promotional activities to engage
their customers and were also taking a significant time in giving
response to solve customer issues.
S
The company took the help of Wipro to reduce the response time
in resolving customer issues by 65% by tracking customer experi-
ence through Social media analytics. This analytics, using Senti-
ment analysis, finds out business insights of the client’s strategies
IM
related to marketing and their customer relationships. Sentiment
analysis also helps in improving the promotional activities and
engaging customer’s attention with improved services.
Wipro has provided Wipro’s Social Media Analytics (SMA) solu-

tion, which accurately understands customers’ core sentiments
and translates them into key business insights. This SMA solution
M
has been built over time by considering the customers feelings

about products and services by collecting the social media data
provided on Twitter, Facebook, blogs, forums, etc. The solution
is based on Naïve Bayesian and Association Mining technique,
N
which can also handle the data having noise. The SMA solution
also allows to send reports containing the data based on generat-
ed insights weekly/fortnightly to the clients.
Some main features of SMA solution are as follows:

Taxonomy generation: It helps in categorising the social me-
dia data into various categories, like functionality, issues, net-
work, environment, competition and content.
Insights generation: Insights are generated on the basis of
competitor trends, geography/demography/topic based sen-
timent, product launches, product/service performance, and
etc.
Data collection: Data is collected, on the basis of preset rules
and configurations defined by the client, from social media in
real time with the help of using a social listening tool.

n o t e s
Data preparation by categorisation: Data is prepared on the

basis of different categories like recent trending topics, or cus-
tomer sentiments about the product.
Text Analytics Engine: Prepared data is provided into an in-
house built Text Analytics Engine that transforms social me-
dia data into a structured format, which can be easily analysed
quantitatively.
The business impact of using Wipro’s Social Media Analytics

solution was that promotional activities increased for customers
regarding improved services by performing buzz analysis, launch
analysis and campaign analysis. The insights generated from the
SMA solution through buzz analysis, launch analysis, and cam-
paign analysis helped in resource channelisation and market
S
expansion on the basis of customers’ sentiments. Moreover, the
SMA solution also helped in identifying key influencers on the
social media by performing Social Node Network analysis.
IM
M
N

n o t e s
learning objectives

>> Explain the concept of social media
>> Describe the key elements of social media
>> Explain the concept of text mining
>> Understand the text mining process
>> Describe the sentiment analysis
>> Explain how social media analytics perform and mining on
Tweets
>> Describe the concept of mobile analytics
>> Describe the mobile analytics tools
S
>> Explain how to perform mobile analytics
>> Describe the challenges of mobile analytics
IM
8.1 Introduction
In a world where information is readily available via Internet on the
click of a button, organisations need to remain abreast with the ongo-
ing events and latest happenings in order to gain a competitive edge
over business markets. Apart from that, organisations also need to
interact with their consumers more effectively in order to gain an in-
M
sight about the ongoing business trends and the market position of
particular products. Social media provides an opportunity to business
organisations and individuals to connect and interact with each other
worldwide. With the evolution of social media as a tool to connect with
N
the existing and potential customers, business organisations have be-

gun to recognise the requirement of employing social media analytics
for gaining crucial business insights and taking timely decisions.
This chapter discusses the role of social media and the importance of
conducting social media analytics by business organisations. These
analyses help organisations to evaluate feedback from consumers and
gauge their current and future position in the market. Further you
will learn about text mining and sentiment analysis. The chapter ends
with a presentation on how to perform social media analytics and
opinion mining on tweets.
8.2 Social Media Analytics

Simply put, social media refers to a computer-mediated, interactive,
and Internet-based platform that allows people to create, distribute,
and share a wide range of content and information, such as text and
images. Social media technologies unleash and leverage the power of
social networks to enable interaction and information exchange.

n o t e s
Jesse Farmer, cofounder of Dev Bootcamp, quotes social network as

a collection of people bound together via a specific set of social rela-
tions. Social media, in turn, denotes a group of Internet-based applica-
tions build over the foundations of Web 2.0 that supports the creation
and exchange of user-generated content. In other words, social media
relies on Web-based technologies to generate interactive platforms
where people and organisations can create, co-create, recreate, share,
discuss, and modify user-generated content.
Prior to the advent of social media as an open-system approach to ex-
change content effectively, business organisations and public relation
practitioners rarely focused on business dynamics to manage brand
images. With the changing business environments due to the evolution
of social media, business organisations also adopted the open-system
approach based on reciprocal feedback. This, in turn, has completely
transformed the way information was communicated or the manner
S
by which public relations were developed. The new approach encour-
ages active participation in development and distribution of informa-
tion by merging innovative technologies and sociology. Social media
IM
provides a collaborative environment which can be employed for:
Building relationships
Distributing content
Rating products and services
Engaging target audience
M
Social media provides an equally open platform for novices as well as

experts to express and share their viewpoints and feedback on vari-
ous events and issues. This information can, in turn, be employed by
business organisations to gain insights about customers’ perspective
N
on their products and services. In this manner, social media enables

business organisations to receive a feedback and promote a dialog be-
tween customers, potential customers, and the organisation. In other
words, social media allows business organisations to promote partici-
pation, conversation, sharing, and publishing of content. Social media,
however, can take different forms, which can be categorised as follows:
Social networking websites: These provide a Web-based platform
to users where they can create a personalised profile, summaris-
ing and showcasing their interests, define other members as con-
nections or contacts, and communicate and share content with
their contacts. Examples of social networking websites include
Facebook, LinkedIn, MySpace, Hi5, and Bebo.
Blogs: Short for ‘Web logs’, blogs represent online journals to show-
case the content organised in the reverse chronological order. Ex-
amples of blogging sites include Blogger, WordPress, and Tumblr.
Microblogs: These allow people to share and showcase small posts
and are suitable for quick sharing of content in a few lines of text
or an individual photo or video. Twitter is a well-known microb-
logging website.

n o t e s
Content communities and media sharing sites: These allow users

to organise and share different types of media content such as vid-
eos and images. The members can also comment on the shared con-
tent. Examples include YouTube, Pinterest, Flikr, and Instagram.
Wiki: It represents a collective website in which the members can
create and modify content in a community-based database. In oth-
er words, the users can modify the content of any hosted page and
can also create new pages in the website based on the wiki tech-
nology. One of the most popular examples of Wiki websites is Wiki-
pedia, which is an online encyclopedia.
Social bookmarking websites: These websites allow users to or-
ganise and manage tags and links to other websites. Well-known
examples include Reddit, StumbleUpon, and Digg.
S
Apart from the listed ones, social media may include websites that
showcase reviews and ratings, such as Yelp, forums, and discussion
boards, such as Yahoo!, and websites that showcase virtual social
worlds that create a virtual environment where people can interact,
IM
such as SecondLife. Figure 8.1 depicts the forms of conversations pos-
sible via social media:
M
N
Figure 8.1: Possible Forms of Conversation via Social Media

n o t e s
Social media analytics is the practice of collecting data from social

media websites or blogs and then analysing the data to take crucial
business decisions. Generally, the data obtained from the social media
is mined to identify customer sentiments and opinions regarding par-
ticular products and services. Such an analysis helps organisations to
enhance their products and services, improve marketing strategies,
provide better customer services, reduce costs, and gain a competitive
edge in the market.
1. ______ websites allow users to organise and manage tags and

links to other websites.
2. WordPress is an example of a ________ site while Twitter is an
S
example of _____ site.
Activity
IM
Search and prepare a report on Social Media Analytics Cycle.
8.3 Key Elements of Social Media

Incorporating social media into everyday sales and marketing rou-
M
tines of an organisation is not easy and requires gaining a command

over certain set of tactics and tools related to the efficient manage-
ment and utilisation of social media. In order to effectively leverage
the possibilities provided by social media for the growth of business,
N
the organisations need to focus on certain key elements of social me-

dia along with the corresponding techniques.
Social media participation involves a focus on the following key ele-

ments:
Collect: In order to effectively incorporate social media, the busi-
ness organisations first need to understand how to collect and
leverage useful information and market artifacts. This involves
critical and careful analysis of the information coming from vari-
ous sources such as customers, competitors, journalists, and other
market influencers. Various tools, such as feed readers, blog sub-
scriptions, and email newsletters, can be employed to collect infor-
mation from various sources.
Curate: Once the information is collected from various sources,
the next step is to effectively curate the important information
to be sent to the clients and internal stakeholders. This involves
intelligent filtering and aggregation of the information collected
from various resources. This not only provides an effective insight
to the customers but also helps in having a clear vision about the

n o t e s
current industry standards and market trends. Various curation

tools such as Newsle, LinkedIn, and RSS readers can be employed
for this task.
Create: After collection and curation of the collected information,
organisations need to create valuable content objects that can pro-
vide a focus and industry buzz to them. This is an effective mar-
keting strategy to create a leader position in the industry. This can
be accomplished by employing various publishing programs and
sharing routines.
Share: A key element for implementing effective social media is
sharing of information. This employs sharing your content, infor-
mation, and ideas with others. This helps in expanding the social
media network. Various tools, such as Feedly and Hootsuite, help
in sharing information and content over the social media.
S
Engage: The basic idea behind social media is to engage the ex-
isting and prospective customers. The tools and routines of social
media and the regular practice of listening, curation, and sharing
IM
help executives and sales personnel of an organisation in engag-
ing more and more customers, stakeholders, prospective custom-
ers, journalists, and industry influencers. Tools such as Salesforce
help to connect people from different categories. Apart from that,
various mobile apps also help in expanding the realm of reaching
more and more people.
M
3. Which of the following elements is involved in intelligent

filtering and aggregation of the information collected from
N
various resources?
a. Collect b. Curate
c. Create d. Share
4. The Feedly and Hootsuite tools help in _______ information
and content over the social media.
Activity
Enlist and discuss the elements of social media marketing strategy

in your class.
8.4 Overview of Text Mining

We all know that social networks are a rich source of information.
A lot of valuable content can be extracted and analysed from this in-
formation to serve the knowledge requirements of various business
organisations, political parties, scientific research departments, social

n o t e s
science fraternities, and other interested domains. Social networks

generally support the exchange of information and data in various
formats, such as text, videos, and photos. However, the most common
form of information and content exchange on social networking sites
is text.
Online marketers and business analysts examine and interpret the

online content using social media analytics. This analysis helps them
to amend and mould their business objectives as per the customer
behavior. For example, the reviews posted by customers on websites
or social marketing media in the form of text or rating score enable
organisations to understand and analyse customer’s perspectives and
expectations.
The insight obtained from such reviews can help organisations to iden-
tify their key areas of improvement and enhance their performance.
S
However, certain tools and methodologies are required to read, inter-
pret, and analyse the large number of reviews received on a daily ba-
sis. This is accomplished by text mining.
IM
It is pretty difficult for any database administrator, marketing pro-
fessional, or a researcher to explore and extract the desired informa-
tion from the huge amount of data and information generated and
exchanged online on a daily basis. The problem is multiplied manifold
by the text-based social networking communications and documents
exchanged during business operations. Although keyword searching
M
provides some scope to search the desired information; however, they

cannot every time relate to the exact terms in the document.
Text mining or text analytics comes as a handy tool to quantitatively

examine the text generated by social media and filtered in the form
N
of different clusters, patterns, and trends. In other words, text mining

represents the set of tools, techniques, and methods applied for auto-
matically processing natural language textual data provided in huge
amounts in the form of computer files. The extracted and structured
content and themes are used for rapid analysis, identification of hid-
den data and information, and automatic decision making. Text min-
ing tools are often based on the principles of information retrieval and
natural learning processes.
Complex linkage structure makes text mining in social networks a

challenging job, requiring the help of automated tools and sorting
techniques. A number of text mining tools and algorithms have been
developed to enable easy extraction of information from different tex-
tual resources. The recent developments in statistical and data pro-
cessing tools have added to the evolution in the domain of text mining.
Text mining employs the concepts obtained from various fields, rang-
ing from linguistics and statistics to Information and Communication
Technologies (ICT). Statistical pattern learning is applied to create

n o t e s
patterns from the extracted text, which are further examined to ob-
tain valuable information. The overall process of text mining compris-
es retrieval of information, lexical analysis, creation and recognition
of patterns, tagging, extraction of information, application of data
mining techniques, and predictive analytics. This can be summarised
as follows:
Text mining = Lexicometry + Data mining
note
Lexicometry or lexical statistics refers to the study of identifying

the frequency of occurrence of words in textual data.
The process is initiated with the retrieval of information, which in-
S
volves collection and identification of information from a set of textu-
al material. The information can come from various sources, such as
websites, database, documents, or content management system. The
IM
textual information is processed by parsers and other linguistic analy-
sis tools to examine and recognise textual features, such as people, or-
ganisations, names of places, stock ticker symbols, and abbreviations.
Figure 8.2 depicts the text mining process:
Unrestricted
M
Collect Exploratory
Data Freedom
Apply Text
Parse Mining
Algorithms
N
Repository Optimise
View
Results
Figure 8.2: Text Mining Process
note
The process of analysing a string of symbols, either in natural or

computer language on the basis of formal grammar rules is termed
as parsing or syntactic analysis
On the basis of certain identified patterns, other quantities such as

entities, emails, and telephone numbers are identified. Further, sen-
timent analysis is applied to identify the underlying attitude. Finally,
the psychological profiling is determined by conducting quantitative
text analysis.

n o t e s
The overall purpose of text mining analytics is to transform unstruc-

tured text into valuable structured data, which can be further anal-
ysed and applied for various domains, such as research, investigation,
exploratory data analysis, biomedical applications, and business intel-
ligence.
Statistical analysis tools, such as R and word count, aid in the assess-
ment of the overall review. Further, positive and negative relationships
can be explored using various plotting techniques, such as scatter
plot. Apart from the listed application areas, text mining techniques
can be further applied for analysis of demographics, financial status,
and buying tendencies of customers.
To sum up, text mining can be applied in the following areas:

Competitive intelligence: In order to succeed, business organisa-
S
tions need to know about not only the key players in the industry
but also the strengths and weaknesses of their competitors. Text
mining provides factual data to organisations that can be applied
for strategic decision making.
IM
Community leveraging: Text mining facilitates the identification
and extraction of the information embedded in community inter-
action. This information can be applied for amending marketing
strategies.
Law enforcement: Text mining can be applied in the domain of
M
government intelligence for countering anti-terrorist activities.

Life sciences: Text mining can also be effectively applied in the
area of research and development of drugs. Bioinformatics com-
panies, such as PubGen, are applying biomedical text mining com-
N
bined with network visualisation as an Internet service.
8.4.1 Understanding Text Mining Process
The enormous amount of unstructured data collected from the social

media makes text mining a very challenging process. The key steps for
any text mining process can be summed up as follows:
1. Extracting the keyword: Any text analysis process begins by
identification of relevant and precise keyword(s) that can be
applied for specific queries. Next, the content and the linkage
patterns are considered for applying keyword searches as the
content related to similar keywords is often linked. The selected
keywords act as social network nodes and play an important role
while clustering the text.
2. Classifying and clustering the text: Various algorithms
are applied for classifying text from the source content. For

n o t e s
this process, the nodes are associated with labels prior to

classification. After that, the classified text is clustered on the
basis of similarity. The classification and clustering of the text
are greatly influenced by the linkage structure of data. Accurate
results can be obtained by applying node labeling and content-
based classification techniques.
3. Identifying patterns: Trend analysis applies the principle that
even for the same content, the clusters collected at different
nodes can have different concept distributions. For this reason,
the concepts at various nodes are compared and classified
accordingly in the same or different subcollections.
Obtaining desired results for a specific query involves careful process-

ing of the relevant document. For effective text mining, several stages
of processing need to be applied on a document, such as:
S
Text preprocessing: This involves the identification of all the
unique words in a document. Non-informative words, such as the,
and, or, and when, are filtered out from the document text before
IM
applying word stemming. Word stemming refers to the process of
reducing the inflected or derived words to their stem base. For ex-
ample, words such as cat, cats, catlike, and catty will all be mapped
to the same stem base ‘cat’. Terms such as stemmers or stemming
algorithms are also used interchangeably in stemming programs.
Affix stemmers trim down both suffix and prefix, such as ed, ly,
M
and ing, from a given word. Popular stemmers include Brute Force
algorithm and Suffix Tripping algorithm.
Document representation: A document is basically represented
in words and terms.
N
Document retrieval: This involves the retrieval of a document

based on some query. Accurate results are ensured using text in-
dexing and accuracy measures. Text indexing and searching capa-
bilities can be incorporated in an application using Lucene, which
is a Java library.
Document clustering: This involves the grouping of conceptually
related documents to ensure fast retrieval. A term for a given que-
ry can be searched faster from the well-clustered documents.
Document clustering can be implemented using the following tech-

niques:
Hierarchical clustering
One-pass clustering
Buckshot clustering

n o t e s
Once clustered, the documents are then organised into user-defined

categories or taxonomies. Figure 8.3 depicts the stages of document
processing:
S
IM
Figure 8.3: Stages of Document Processing in Text Mining
M
Both structured and unstructured data are involved in text mining.

Unstructured data comes from reviews and summaries while the
structured data is obtained from organised spreadsheets. Text mining
tools identify themes, patterns, and insights hidden in the structured
as well as unstructured data. Various text mining software are em-
N
ployed by organisations for different data mining applications. The

following are some commonly used text mining software:
R: Used for statistical data analysis, text processing, and sentiment
analysis
ActivePoint: Applied for natural language processing and online
catalog-based contextual search
Attensity: Used for extraction of facts, including who, what, where,
and why and then identifying people, places, and events and how
they are related
Crossminder: Applied for cross-lingual text analytics
Compare Suite: Used for comparing texts by keywords and high-
lighting common and unique keywords
IBM SPSS Predictive Analytics Suite: Applied for data and text
mining
Monarch: Applied for analysis and transformation of reports into
live data

n o t e s
SAS Text Miner: Provides a rich suite of text processing and anal-
ysis tools
Textalyzer: Used for online text analysis
Apart from these, some other text mining tools include AeroText, An-
goss, Autonomy, Clarabridge, IBM LanguageWare, IBM SPSS, Word-
Stat, and Lexalytics
Now, let’s explore an important component of text mining, i.e., senti-

ment analysis.
8.4.2 Sentiment Analysis
Sentiment analysis is one of the most important components of text
mining. Also termed as opinion mining, it involves careful analysis of
people’s opinions, sentiments, attitudes, appraisals, and evaluations.
S
This is accomplished by examining large amounts of unstructured
data obtained from the Internet on the basis of positive, negative, or
neutral view of the end user. Sentiment analysis involves the analysis
of following sentences:
IM
Facts: Product A is better than product B.
Opinions: I don’t like A. I think B is better in terms of durability.
Similar to Web analysis, specific queries are applied in sentiment
analysis to retrieve and rank relevant content. However, sentiment
analysis also differs from Web analysis in certain factors. It is possi-
M
ble to determine from a sentiment analysis that whether the content

expresses an opinion on the topic and also whether the opinion is pos-
itive or negative. Ranking in Web analysis is done on the basis of the
frequency of keywords. On the other hand, ranking in sentiment anal-
N
ysis is done on the basis of polarity of the attitude.

With the widespread use of Web 2.0 technologies, a huge volume of
opinionated data is available on the social media. People using social
media put their reviews and comments about products used and also
share their feedback, opinions and experiences with others in their
network. These reviews and feedback are utilised by organisations
to improve and upgrade their products and services and enhance
their brand equity. Sentiment analysis applies other domains such
as linguistics, digital technologies, text analysis tools, artificial intelli-
gence, and Natural Language Processing (NLP) for identification and
extraction of useful information. This greatly influences various do-
mains, ranging from politics and science to social science.
note
Artificial intelligence is a technology and a branch of science that

deals with the study and development of intelligent machines and
software. Natural language processing is a domain of computer sci-
ence, artificial intelligence, and linguistics that deals with the inter-
actions between computers and human (natural) languages.

n o t e s
The most common application of sentiment analysis is in the field

of consumer products and services. This also provides valuable
information to competing organisations and candidates. Sentiment
analysis can effectively track voters’ expectations, perspectives,
and feedback. Apart from that, sentiment analysis can be applied in
automated scoring systems and rating applications to provide scores
and ratings to public companies. An example of rating applications
is Stock Sonar that generates automatic rating by analysing articles,
blogs, and tweets.
The process of sentiment analysis begins by tagging words using Parts

of Speech (POS), such as subject, verb phrase, verb, noun phrase, de-
terminer, and prepositions. Defined patterns are filtered to identify
their sentiment orientation. For example, ‘beautiful room’ has an ad-
jective followed by noun. The adjective ‘beautiful’ indicates a positive
S
perspective about the noun ‘room’. At this stage, the emotional factor
in the phrase is also examined and analysed. After that, an average
sentiment orientation of all the phrases is computed and analysed to
conclude if a product is recommended by a user.
IM
The following parameters may be applied to classify the given text in
the process of sentiment analysis:
Polarity, which can be positive, negative, or neutral
Emotional states, which can be sad, angry, or happySubjectivity or
objectivity
M
Features of key entities, like screen size of cell phone, durability of

furniture, lens quality of camera, and etc.
Scaling system or numeric values
N
Automated sentiment analysis is still evolving as it is difficult to in-

terpret the conditional phrases used by people to express their sen-
timents on social media. For example, ‘if you don’t like A, try B’. In
this sentence, the user clearly shows his/her positivity towards B
but doesn’t indicate clear views about A. After removing ‘if’, the first
clause clearly indicates negativity towards A.
However, sentiment analysis employs various online tools to effective-

ly interpret consumer sentiments. Some of the online tools are listed
as follows:
Topsy: It is used to measure success of a Website on Twitter. It
tracks the occurrence of given and related keywords, website
name, and website URL in tweets.
BackTweets: This toll is applied to improve search engine ranking
of a website. It tracks tweets that link back to a website.
Twitterfall: It locates tweets that are important for a website. It
can be used to stay in touch with the customers and consumers
and respond to their queries and suggestions in real time.

n o t e s
TweetBeep: This is used to send timely updates or alerts for the

topics of interest.
Reachli: Designed especially for Pinterest, it is a content-sharing
website. This tool helps in tracking data and scheduling and or-
ganising pins (denote the updates in Pinterest) in advance.
Apart from these, some other sentiment analysis tools include Social
Mention, AlertRank Sentiment Analysis, and Twitter Sentiment Anal-
ysis. The business organisations can apply specific tools as per their
requirements and sentiment analysis needs.
5. Social networks generally support the exchange of information

and data in various formats, such as text, videos, and photos.
S
(True/False)
6. Text mining tools are often based on the principles of _____
and ______ processes.
IM
Activity
Search and prepare a report on the various applications of text

mining.
M
Performing Social Media

8.5 Analytics and Opinion Mining on
Tweets
N
In today’s IT-driven world, social media has emerged as the most

popular means of communicating and sharing views and information
across the world. Some examples of social media are social network-
ing websites, such as Facebook and Twitter. These websites act as
global platforms that allow people to share their likes, dislikes, and
opinions on various topics.
In this section, you will practice deriving useful information from the
data obtained from social networking sites.
Every day, millions of people use social platforms to express their

opinions about almost everything under the sun. This is why most
organisations prefer to collect data from these websites to know the
public opinion about their products and services. This information
helps organisations to take timely and crucial business decisions. R is
a statistical programming tool that implements modern statistical al-
gorithms to perform various types of analytical activities. This section
takes you through the step-by-step process of gathering, segregating,

n o t e s
and analysing text data using the R tool. A mobile phone manufactur-
ing company hires a data analyst to review the opinion given by peo-
ple on its products. This information will help the company to know
about the current market trends and further enhance the quality of
its products based on the insights. The data analyst decides to collect
data from the tweets of people, and then examine it under three cate-
gories: positive, negative, and neutral. Here, you are going to help him
download tweets and analyse them to derive valuable information.
Before performing the social media analytics, you need to load some
library utilities into the current R environment and verify the Twitter
authentication information to work with the tweets.
Enter the following commands to load required packages to work with

online tweets:
install.packages("twitteR")
S
install.packages("bitops")
install.packages("digest")
install.packages("RCurl")
IM
# If there is any error while installing RCurl,
follow the below
command using terminal
#sudo apt-get install libcurl4-openssl-dev
install.packages("ROAuth")
install.packages("tm")
install.packages("stringr")
M
install.packages("plyr")
library(twitteR)
library(ROAuth)
library(RCurl)
N
library(plyr)
library(stringr)
library(tm)
If you are working on Windows Operating System, you may face Se-
cured Socket Layer (SSL) certificate issues.
You can avoid that by providing certificate authentication information

in the options() function through the following command:
options(RCurlOptions = list(cainfo = system.

file("CurlSSL","cacert.pem",package="RCurl")))
After loading the required R utilities and providing the SSL certificate
authentication information, load the Twitter authentication informa-
tion. This information will be used to download tweets later.
Enter the following commands to load Twitter authentication infor-

mation using your own Twitter credentials:

n o t e s
load("/Datasets/twitter_cred.RData")
registerTwitterOAuth(cred)
Figure 8.4 shows the use of Twitter credentials for Twitter authenti-
cation in R:
S
IM
Figure 8.4: Using Twitter Credentials for Twitter Authentication in R
M
To analyse tweets, you first need to segregate and download them on

the basis of some specific keywords.
N
note
R provides automatic downloading of tweets by using the searchTwit-

ter() function, which takes as its arguments the language in which the
tweets need to be searched, the keyword (term to be searched on the
Internet), and the number of tweets that need to be extracted contain-
ing the keyword.
Now, enter the following command to download 1000 English lan-

guage (specified as lang=“en” argument to the searchTwitter() func-
tion) tweets, containing the word “nokia”:
input_tweets=searchTwitter("nokia", n=1000,lang="en")
We can take a list of tweets at a time to analyse different opinions, as

shown by the following command:
input_tweets[1:3]

n o t e s
Figure 8.5 shows the commands along with their outputs:
Figure 8.5: Searching a Keyword in a Specified Number of Tweets
Some tweets, containing the search word, may be insignificant for our
analysis. Therefore, we need to extract tweets only with the relevant
texts. Enter the following command to extract a specific set of words
as a text string:
tweet=sapply(input_tweets,function(x) x$getText())
S
The strings can be viewed as vectors by entering the following com-
mand:
input_tweets[1:4]
IM
The next task is to segregate tweets on the nature of the feedback they
provide. The feedback would be positive, negative, or neutral. In our
case, we are using only positive and negative words. The function for
sentiment analysis is given as follows:
score.sentiment = function(sentences, pos.words,neg.

M
words,.progress='none')
{ scores = laply(sentences, function(sentence, pos.
words, neg.words)
{ sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
N
sentence = gsub('\\d+', '', sentence)

tryTolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)}
sentence = sapply(sentence, tryTolower)
word.list = str_split(sentence, "\\s+")
words = unlist(word.list)
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(text=sentences, score=scores)
return(scores.df)}

n o t e s
After writing the preceding function, the files containing positive and
negative words are loaded to run the sentiment function. Enter the
following commands to load the data file containing positive and neg-
ative words, respectively:
pos=readLines("/Datasets/positive-words.txt") # find
file positive_words.txt
neg=readLines("/Datasets/negative-words.txt") # find
file negative_words.txt
Categorise each tweet as positive, negative, or neutral by using the

following code:
scores = score.sentiment(tweet, pos, neg,

.progress='text')
scores$very.pos = as.numeric(scores$score > 0)
S
scores$very.neg = as.numeric(scores$score < 0)
scores$very.neu = as.numeric(scores$score == 0)
Figure 8.6 shows the commands along with their outputs:

IM
M
N
Figure 8.6: Assigning Sentiment Scores to Tweets
Enter the following commands to find out the number of positive, neg-
ative, and neutral tweets:
# Number of positive, neutral, and negative tweets

numpos = sum(scores$very.pos)
numneg = sum(scores$very.neg)
numneu = sum(scores$very.neu)
Now, aggregate the final results by using the following commands:
#Final results aggregation

<- c(numpos,numneg,numneu)lbls
<-c("POSITIVE","NEGATIVE","NEUTRAL")pct <-
round(s/sum(s)*100)lbls <- paste(lbls, pct)lbls
<-paste(lbls,"%",sep="")

n o t e s
After the sentiments are categorised and the number of positive, neg-
ative, and neutral tweets is found out, plot the results by using the
following command:
#Plot the results

pie(s,labels = lbls, col =
rainbow(length(lbls)),main="OPINION")
The Pie chart for the analysed sentiment score is shown in Figure 8.7:
S
IM
Figure 8.7: Pie Chart for the Sentiment Score
M
7. These websites act as global platforms that allow people to

share their likes, dislikes, and opinions on various topics.
(True/False)
N
8. R is a statistical programming tool that implements modern

statistical ______ to perform various types of analytical
activities.
Activity
Search and enlist various text mining packages available in R.
8.6 Online Social Media Analysis

We can use online tools for the analysis of text generating from social
media. One of the major online social media analysis tools is Social
Mention. The analysis of the text to find out positive, negative, and
neutral views can also be performed online by performing the follow-
ing steps (you will need an Internet connection to perform this):
1. Open the following link in your browser:
http://socialmention.com/

n o t e s
The Social Mention website will appear, as shown in Figure 8.8:
S
Figure 8.8: Showing the Social Mention Website
2. Type the name of the product about which you need to gather
IM
information in the Search box and press the Search button, as
shown in Figure 8.9:
M
Figure 8.9: Searching a Product for Online Text Analysis

N
A Web page appears with the sentiment score for the product
(Sony, in our case) being displayed on the left-hand side, as
Figure 8.10: Showing the Web Page with Sentiment Score

for the Product

n o t e s
The extended view of the Sentiment score, marked in Figure

8.10 is shown in Figure 8.11:
S
IM
Figure 8.11: Showing the Sentiment Score for the Product and
Related Information
You can get more information on the feedback of the product by
M
clicking any link on the Web page shown in Figure 8.10.

Apart from R, you can also use the Sentiment140 tool to analyse
data on the basis of the feedback given by users. Sentiment140
uses Twitter’s data for the analysis purpose.
N
Figure 8.12 shows the Web page of the Sentiment140 text analysis
tool:
Figure 8.12: Showing the Web Page of Sentiment140

n o t e s
Figure 8.13 shows the online analysis of views for Toshiba

products:
S
Figure 8.13: Showing the Online Analysis for Toshiba Products Using
the Sentiment140 Tool
IM
9. We cannot use online tools for the analysis of text generating

from social media. (True/False)
10. Apart from R, you can also use the ____ tool to analyse data on
M
the basis of the feedback given by users.
Activity
N
Search and find some tools that are used by the organisations to
analyse its Social Media Competitors.
8.7 Mobile analytics

We are now using fourth-generation (4G) wireless mobile technolo-
gies. When you look at the past, you will see that wireless mobile tech-
nologies have shown a steady growth, evolving from 1G to 4G. With
every major shift in the technology, there has been a corresponding
improvement in both the speed and efficiency of mobile devices.
First generation (1G) mobile devices provided only a “mobile voice”,

but in second-generation (2G) devices, larger coverage and improved
digital quality were provided. Third-generation (3G) technology fo-
cused on multimedia applications like videoconferencing through
mobile phones. 3G opened the gates for the mobile broadband, which
was seen in fourth-generation (4G) devices. 4G provides wide range
access, multiservice capacity, integration of all older mobile technolo-
gies, and low bit cost to the user.

n o t e s
Figure 8.14 shows the evolution of different generations of mobile

technologies:
S
Figure 8.14: Evolution of Mobile Technologies
IM
Source: 3GPP Alliance, UMTS forums, Informa telecoms, Motorola, ZTI.
note
The full forms of the terms used in Figure 8.14 are as follows:
GSM: Global System for Mobile Communications
M
CDMA: Code Division Multiple Access

GPRS: General Packet Radio Service
EDGE: Enhanced Data rates for GSM Evolution
N
WCDMA: Wideband Code Division Multiple Access

LIT: Long Term Evolution
“Forget what we have taken for granted on how consumers use the In-
ternet,” said Karsten Weide, research vice president, Media and En-
tertainment. “Soon, more users will access the Web using mobile devic-
es than using PCs, and it’s going to make the Internet a very different
place.”
According to International Telecommunication Union (ITU), an agen-

cy of the United Nations (UN) responsible for information and com-
munication technologies-related issues, the number of mobile users in
2008 was 548 million. This number increased to 6835 million in 2012.
According to Informa, a research firm in the US, there were over

3.3 billion active cell phone subscriptions in 2007 all over the world.
This effectively means that around half of the total population of the
earth is using mobile devices.

n o t e s
Given the above statistics, it is imperative for organisations to find

ways of analysing data related to mobile devices and use the data to
market and sell their products and services through these devices.
Mobile analytics is a tool that allows organisations to do this.
8.7.1 Define Mobile Analytics
Marketers want to know what their customers want to see and do on

their mobile device so that they can target the customer.
Similar to the process of analytics used to study the behavior of users

on the Web or social media, mobile analytics is the process of analys-
ing the behavior of mobile users. The primary goal of mobile analytics
is to understand the following:
New users: These are users who have just started using a mobile
S
service. Users are identified by unique device IDs. The growth and
popularity of a service greatly depend on the number of new users
it is able to attract.
IM
Active users: These are users who use mobile services at least once
in a specified period. If the period is one day, for example, the ac-
tive user will use the service several times during the day. The
number of active users in any specific period of time shows the
popularity of a service during that period.
Percentage of new users: This is the percentage of new users over
M
the total active users of a mobile service. This figure is always less
than 100%, but a very low value means that the particular service
or app is not doing very well.
Sessions: When a user opens an app, it is counted as one session.
N
In other words, the session starts with the launching of the app
and finishes with the app’s termination. Note that a session is not
related to how long the app has been used by the user.
Average usage duration: This is the average duration that a mo-
bile user uses the service.
Accumulated users: This refers to the total number of users (old
as well as new) who have used an app before a specific time.
Bounce rate: The bounce rate is calculated in percentage (%).
It can be calculated as follows:
Bounce rate = Number of terminated sessions on any specific
page of an app/Total number of sessions of the app*100.
The bounce rate can be used by service providers to help them
monitor and improve their service so that customers remain
satisfied and do not leave the service.
User retention: After a certain period of time, the total number
of new users still using any app is known as the user retention of
that app.

n o t e s
Commercially, mobile analytics can be defined as the study of data

that is collected to achieve the following purposes:
Track sales: Mobile analytics can track the sale of products.
Analyse screen flow: Mobile analytics can track how and where
a user touches the screen. This information can be used to make
interactive GUIs and also decide the place for mobile ads.
Keep customers engaged: Mobile analytics studies the behavior of
the users or customers, and display ads and other screens to keep
them engaged.
Analyse the preferences of visitors: On the basis of the user’s
touch, tap and other behavior on the screen, mobile analytics can
analyse their preferences.
Convert potential buyers into buyers: According to the users’
S
likes and dislikes, mobile analytics offers different products and
services to them. The purpose of this exercise is to convert a visitor
to a buyer.
IM
Analyse m-commerce activities of visitors: Mobile analytics can
analyse the m-commerce activities of the visitors and find out a
lot of useful information like a user’s frequency of making a pur-
chase and the amount he is willing to spend. Mobile commerce (or
m-commerce) refers to the delivery of electronic commerce capa-
bilities directly into the consumer’s hand anywhere, via wireless
M
technology.
Track Web links that users visit on their mobile phones: Mobile
analytics can be used to analyse the visited Web links of users and
know their preferences.
N
8.7.2 Mobile Analytics and Web Analytics
Mobile analytics has several similarities with Web and social analytics,
such as both can analyse the behavior of the user with regard to an
application and send this information to the service provider. Howev-
er, there are also several important differences between Web analytics
and mobile analytics.
Some of the main differences between Web analytics and mobile ana-
lytics are as follows:
Analytics segmentation: Mobile analytics works on the basis of
location of the mobile devices. For example, suppose a company is
offering cab service in a city like New York. In this case, the compa-
ny can use mobile analytics to identify the target people travelling
in New York. Mobile analytics works for location-based segments,
while Web analytics works globally.

n o t e s
Complexity of code: Mobile analytics requires more complex code

and programming languages to implement than Web analytics,
which is easier to code.
Network service providers: Mobile analytics is totally dependent
on Network Service Providers (NSPs), while Web analytics is inde-
pendent of this factor.
Measure: Sometimes, it is difficult to measure information from
the mobile analytics apps because they can run offline. Web ana-
lytics always runs online. So, we can easily measure vital informa-
tion with it.
Tools: To do the ultimate analysis on data, we require some other
tools of Web analytics with mobile analytics tools. Web analytics,
on the other hand, does not require any other tool for analysis.
S
8.7.3 Types of Results from Mobile Analytics
The study of consumer behavior helps business firms or other organi-

IM
sations to improve their marketing strategies. Nowadays, every organ-
isation is making extra effort to understand and know the behavior of
its consumers.
Mobile analytics provides an effective way of measuring large amounts

of mobile data for organisations. It also shows how useful marketing
tools, such as ads, are converting potential buyers to actual purchas-
M
ers. It also offers deep insight into what makes people buy a product
or service and what makes them quit a service.
The technologies behind mobile analytics, like Global Positioning Sys-

tem (GPS), are more sophisticated than those used in Web analytics;
N
hence, compared to Web analytics, users can be tracked and targeted

more accurately with mobile analytics.
Mobile analytics can easily and effectively collect data from various
data sources and manipulate it into useful information. Mobile analyt-
ics keeps track of the following information:
Total time spent: This information shows the total time spent by
the user with an application.
Visitors’location: This information shows the location of the user
using any particular application.
Number of total visitors: This is the total number of users using
any particular application, useful in knowing the application’s
popularity.
Click paths of the visitors: Mobile analytics tracks of the activities
of a user visiting the pages of any application.

n o t e s
Pages viewed by the visitor: Mobile analytics tracks the pages of

any application visited by the user, which again reflects the popu-
lar sections of the application.
Downloading choice of users: Mobile analytics keeps track of files
downloaded by the user. This helps app owners to understand the
type of data users like to download.
Type of mobile device and network used: Mobile analytics tracks
the type of mobile device and network used by the user. This in-
formation helps mobile service providers and mobile phone sell-
ers understand the popularity of mobile devices and networks and
make further improvements as required.
Screen resolution of the mobile phone used: Any information or
content that appears on mobile devices is according to the screen
size of these devices. This important aspect of ensuring that the
S
content fits a particular device screen is done through mobile an-
alytics.
Performance of advertising campaigns: Mobile analytics is used
IM
to keep track of the performance of advertising campaigns and
other activities by analysing the number of visitors and time spent
by them as well as other methods.
8.7.4 Types of Applications for Mobile Analytics

M
There are two types of applications made for mobile analytics. They
are:
Mobile Web analytics
Mobile application analytics
N
Let’s learn about these types of applications in detail in the following

sections.
Mobile Web Analytics
Mobile Web refers to the use of mobile phones or other devices like
tablets to view online content via a light-weight browser. The name
of any mobile-specific site can be the form of m.example.com. Mobile
Web sometimes depends on the size of the screen of the devices. For
example, if you design an application for a small screen, its images
would appear blurred on a big screen; similarly, if you make your site
for the big screen, it can be heavy for a small screen device. Some or-
ganisations are starting to build sites specifically for tablets because
they have found that neither their mobile-specific site nor their main
website ideally serves the tablet segment. To solve this problem, mo-
bile Web should have a responsive design. In other words, it should

n o t e s
have the property to adapt the content to the screen size of the user’s
device.
Figure 8.15 shows the difference between a website, a mobile site, and
a responsive-design site:
S
IM
Figure 8.15: Difference among a Website, Mobile Site, and Respon-
sive-design Site
In Figure 8.15, you can see that a website can be opened on both com-
puters and mobile phones, while a mobile site can be opened only
M
on mobile phones; responsive-design sites, on the other hand, can be

opened on any device like a computer, tablet, or mobile phone.
Mobile Application Analytics

N
The term mobile app is short for the term mobile application software.
It is an application program designed to run on smartphones and oth-
er mobile devices.
Mobile apps are usually available through application distribution

platforms like Apple App Store and Google Play. These application
distribution platforms are generally operated by the owners of the
mobile operating systems. Examples of mobile operating systems in-
clude the Apple App Store, Google Play, Windows Phone Store, and
BlackBerry App World. Some mobile apps are freely available, while
others must be bought.
Depending on the objective of analytics, an organisation should de-

cide whether it needs a mobile application or a mobile website. If the
organisation wants to create an interactive engagement with users on
mobile devices, mobile app is a good option; however, for business
purposes, mobile websites are more suitable than mobile apps.

n o t e s
Table 8.1 lists the main differences between mobile app analytics and
mobile Web analytics:
Table 8.1: Differences between Mobile App Ana-

lytics and Mobile Web Analytics
Factors Mobile App Analytics Mobile Web Analytics
Screen and Mobile app analytics does Mobile Web analytics has
Page not have pages. The user can pages like normal websites,
interact with various screens. and users do interact with
various pages.
Use of built- Mobile app analytics can Mobile Web analytics does
in features access built-in features, such not use built-in features, like
of mobile as gyroscope, GPS, acceler- gyroscope, GPS, accelerom-
devices ometer, and storage. eter, etc.
Session Mobile app analytics has Mobile Web analytics has
S
time shorter session timeouts longer session timeouts. In
(around 30 seconds). general, a session will end
after 30 minutes of inactivity
for websites.
IM
Online/Of- Depending on how it was Mobile Web analytics re-
fline developed, mobile app quires an Internet connec-
analytics may not require tion and can run online only.
to be connected to a mobile
network.
Updates App owners provide frequent Updates are not frequent.
M
updates and new versions of

the apps.
Exhibit
N
Enhancement in Search Query on Mobile Phones
According to Google, the percentage of search queries on mobile

phones has increased manifold in the last few years. Some of the
most searched industries on mobiles are as follows: restaurants
(29.6%), auto (16.8%), electronics (15.5%), and insurance (15.4%).
11. Which of the following technologies is used to focus on

multimedia applications like videoconferencing through
mobile phones?
a. 3G b. 2G
c. 1G d. None of these
12. CDMA stands for Code ______ Multiple Access.

n o t e s
Activity
Prepare a report on Amazon Mobile Analytics.
8.8 Mobile Analytics Tools

Advances in analytic technologies and business intelligence are allow-
ing CIOs to go big, go fast, go deep, go cheap and go mobile with business
data.—www.CIO.com
The fundamental task of the mobile analytics tool is similar to other

digital analytical tools like Web analytics. They capture data, collect
it, and help to generate reports that can be used meaningfully after
processing.
S
The selection of analytics tools is not an easy process because these
tools are new and undergo rapid enhancements as compared to tra-
ditional Web analytics tools. Companies frequently upgrade their ex-
IM
isting analytical tools as well as launch new tools with new features.
Mobile analytics tools have some technical limitations; all mobile ana-
lytics tools do not perform all the services, so must find out which tool
can be beneficial for you. Following are some points to be considered
while selecting mobile analytics tools:
M
What is your analytical goal?: No single mobile analytics tool can

fulfill all your needs; therefore, it is essential to set your analytical
goal so that you can select the right tools. You can take the help of
experts to make your goal in terms of the capabilities of the tools.
N
Analysis techniques: Various analysis techniques exist to analyse

the information of mobile users’ behavior. For example, through
‘packet sniffing’ an intruder can obtain important personal infor-
mation of users from data packets; ‘image-based tagging’ uses a
query string to guess the activities of the user; and ‘data collection
scripts’ analyse users’ requests.
Way of presentation: Information flows between a mobile device
and the server according to the client–server architecture. Mobile
analytics captures data as it passes through the mobile network.
The data capture is done on a server that is different from the ac-
tual server. This server is placed between mobile devices and its
actual server. This type of tracking generates a large amount of
data in the form of log files. The reading of these log files is again
a daunting task.
Now, the question is, ‘how is the information presented to you?’ You
must choose an analytical tool that best suits your requirements.

n o t e s
There are two classes of mobile analytics tools:

Internal mobile analytics tools: These refer to software that is
provided by the SaaS (Software as a Service) vendors. It can be
software installed and maintained by the IT department of any
data center. Some examples of internal mobile analytics tools are
Localytics and WebTrends.
External mobile analytics tools: These are services provided by
third-party data vendors, which are responsible for collecting, ma-
nipulating, analysing, and generating reports for the customers
from their proprietary systems. Some examples of external mobile
analytics tools are comScore and Groundtruth.
According to Aconitum Mobile (a software development company in

the US), the top four mobile analytics tools (or packages) are as fol-
S
lows:
1. Localytics: This is a big marketing and analytics platform for
mobile and Web apps. Its developer is Localytics, in Boston. It
IM supports cross-platform and Web-based applications. For more
details, you can check out their website at www.localytics.com.
Localytics supports push messaging, business analytics, and
acquisition campaigns management. Localytics has a list of big
customers like Microsoft, New York Times, ESPN, Soundcloud,
and eBay.
M
2. Appsee: Appsee was founded by Zahi Boussiba and Yoni Douek,

in 2012 based in Tel Aviv, Israel. It provides analytical services
with features like conversion funnel analysis, heatmaps, and
much more. For more details, you can check out their website at
http://www.appsee.com.
N
3. Google analytics: Google Analytics is a great free service provided

by Google. It is a cross-platform, Web-based application. It offers
analytics services to mobile app and mobile developers. You can
check out the tool at www.google.com/analytics.
4. Mixpanel: Mixpanel is a ‘business analytics service.’ It is also
a cross-platform and Web-based application. It can continually
follow the user and mobile Web interactions and provide
attractive tool offers to the user. For more information, you can
visit the website at www.mixpanel.com.
Mobile analytics tools can be categorised as follows:

Location-based tracking tools
Real-time analytics tools
User behavior tracking tools

n o t e s
8.8.1 Location-based Tracking tools
A location-based tracking tool stores information about the location

of mobile devices (or the location of user). These tools are software
applications for mobile devices, and they continuously monitor the
location of the devices and manipulate information thus obtained in
various ways. For example, it can display the location of friends (per-
sons in the contact list of your mobile phone), ATMs, cafés, hotels,
and nearby police stations. The following are some location-based
tracking tools:
Geoloqi: This tool is a platform for location-based services, and
it was launched in 2010 by the founder, Amber Case in the Unit-
ed States. It supports the creation of location-based notes and
time-limited private location sharing.
S
Placed: Placed provides a ‘ratings service’ by measuring various
types of information such as place visited and duration of visit.
This is an efficient tool to explain offline consumer behavior.
IM
8.8.2 Real-Time Analytics Tools
“The 6.8 billion subscribers are approaching the 7.1 billion world popu-
lation” (ITU). This is illustrated in Figure 8.16.
Figure 8.16 shows the growth of mobile phone users (in billions) with
respect to years:
M
N
Figure 8.16: Growth of Mobile Subscribers
With the increase in the popularity of mobile phones and mobile Web,
business organisations want to know more about the behavior of the
user. Real-time analytics tools refer to software tools that analyse and
report data in real time.
The following are some real-time analytics tools:

Geckoboard: Geckoboard is a real-time dashboard that can col-
lect, display, report, and share data that is important for you and
your business in real time. According to Damian Kimmelman,

n o t e s
the founder and CEO at Duedil, “Geckoboard simplifies the deci-

sion-making process. It is hard to dispute something that’s right in
front of you.” Figure 8.17 shows the Geckoboard application:
S
Figure 8.17: Geckoboard Application
Mixpanel: Mixpanel is a Web-based, cross-platform service. It is
IM
a business analytics service that tracks user interactions with mo-
bile Web applications. It offers services to users on the basis of user
behavior. Mixpanel does all its activities in real time. Suhail Doshi
and Tim Trefren founded Mixpanel in 2009 in San Francisco, Cali-
fornia. Figure 8.18 shows the Mixpanel applications:
M
N
Figure 8.18: Mixpanel Application
8.8.3 User Behavior Tracking Tools
User behavior tracking tool is a software tool that tracks user behavior
with any particular mobile application.
These behavior reports can help organisations to improve their appli-

cations and services. You can get a lot of information about your users,
such as:
Screens viewed in a session

n o t e s
Number of screens viewed in a session

Technical errors faced by the user
The frequency of the user using any particular app
Sessions life
Time taken in the loading of any app elements
Any specific action performed with the content, such as clicking
of ads
These reports provide an excellent way for an organisation to know

how users use specific applications on their mobile phones. By cus-
tomising the tracking settings, an organisation can observe the be-
havior of the user on any particular page (such as the product listing
page) and use the information to fulfill their business objectives. The
S
following are some popular behavior-tracking tools:
TestFlight: Let’s suppose a company has many testers, spread
over different countries (or locations). The company has a new app
that it wants to test. How is it going to test the app? The solution is
IM
TestFlight. TestFlight is a free software platform. Using TestFlight,
a team of developers can use beta and internal iOS applications
to manage the testing and feedback using the dashboard of Test-
Flight. The TestFlight SDK has a wide range of useful APIs to test
application from various dimensions. Figure 8.19 shows the Test-
Flight application:
M
N
Figure 8.19: TestFlight Application

Mobile app tracking: This tool tracks and analyses mobile
app installations. It can report user engagement and Lifetime
Value (LTV) beyond the installations. It is efficient in scaling mo-
bile advertising campaigns. It can collect customer’s behavior re-
lated to an ad just through tapping. Lifetime Value (LTV) is related
to the net profit gained by the entire future relationship with the
customer.

n o t e s
Figure 8.20 shows the Mobile App Tracking application:
S
Figure 8.20: Mobile App Tracking Application

IM
13. Through _________, an intruder can obtain important personal
information of users from data packets.
14. Geckoboard is a _______ that can collect, display, report, and
share data that is important for you and your business.
M
Activity
Search and enlist at least 10 mobile analytics tools used by organi-

sations these days.
N
8.9 Performing Mobile Analytics

In this section, you will first understand the basic steps to integrate
mobile analytics within your business processes. Then, you will do a
practical where you will analyse datasets on mobile phones by using
mobile applications.
To integrate mobile analytics with business processes, you must per-

form the following basic steps:
Select the appropriate mobile device like a smartphone or tablet.
List the objectives of mobile analytics for your business process.
Identify the target audience and create a dataset for it.
Modify the dataset by adding missing values and removing unnec-
essary data.
Use the dimension-reduction technique and transform data (if
possible).

n o t e s
Perform some required data mining techniques, like text mining.

Select data mining algorithms and do some analysis for mobile mining.
Apply data mining algorithm and verify the relationships among

the variables.
Evaluate the result/interpretation.
As we have discussed in the earlier section, the whole process of mo-

bile communication is done through the client–server method. So, the
mobile analytics process can be done either on a mobile device or on
the server providing services to the mobile device. According to the
location of data collection, mobile analytics is categorised as follows:
Data collection through mobile device
Data collection on server
S
8.9.1 Data Collection through Mobile Device
Data for analysis is collected on the mobile devices and sent back to
IM
the server for further manipulation. This collection process may be
done online as well as offline. This simply means that some data col-
lection processes require an Internet connection to send collected
data to the server but, on the other hand, several applications collect
data into spreadsheets, and they do not require an Internet connec-
tion. Collected data can be stored in various formats.
M
Data collection through mobile devices has the following benefits:

Allows data collection in real time
Proves beneficial for field executives
N
Allows data collection to be stored in various formats like text,

graphs, images, etc.
Allowsthe location of field executives to be tracked and the man-
agement to assign tasks to them according to their location
Reduces unnecessary paperwork
Prevents data redundancy because the data is collected and shared
with the entire team within seconds
The following are some commonly used data collection applications:

Numbers is a very simple-to-use tool. It can be used to organise
data, perform calculations, and manage lists for you with a few
taps. It provides various templates for making graphs and charts.
Moreover, you can create your own templates too. It has around
250 functions that can perform simple-to-complex calculations for
you. You can create your own formulas through built-in functions
and the help feature.

n o t e s
Figure 8.21 shows the GUI of numbers:
S
Figure 8.21: GUI of Numbers
HanDBase is a relational database management system. It was ini-
IM
tially designed to run on Palm PDAs, but can run on almost any
handheld platform. HanDBase is not as full-featured as Oracle,
Sybase, and DB2, but it still has various other features that make
it important for computing. It is simple enough, supports multiple
handheld platforms, and provides high security. The company of-
fers a few apps to download free from its website.
M
Figure 8.22 shows the GUI of HanDBase:

N
Figure 8.22: GUI of HanDBase

Statistics Visualiser is a tool for iPads and is nicknamed as Stat-
Viz. It is a perfect statistical data tool for students and research-
ers. It performs statistical calculations and provides results with
detailed explanations. The main feature of this tool is the dynamic
graph, which helps you to quickly understand difficult statistical
concepts.

n o t e s
Figure 8.23 shows the GUI of StatViz:
Figure 8.23: GUI of StatViz
S
8.9.2 Data Collection on Server
As we have studied in the previous section, data collected by a mo-
IM
bile device is ultimately transferred to its server for analysis. A server
stores the received data, performs analysis over it, and creates reports.
The following are some popular applications that collect data into the
server:
DataWinners is the data collection service design for experts. This
M
application converts paper forms into digital questionnaires. Team

members can submit their data through any service like SMS,
Web, etc. DataWinners provides an efficient data collection facility
that can reduce users’ decision-making time. The home page of the
DataWinners application is shown in Figure 8.24:
N
Figure 8.24: Home Page of DataWinners

COMMANDmobile is an application that can perform mobile data
collection and workforce management services. COMMANDmo-
bile is more accurate and efficient than traditional paper and pen-
cil approaches. The application uploads the data as soon as it is
collected. Its response time is very small; thus, it can perform well
in emergencies. It gives very good ROI (Return on Investment).

n o t e s
The home page of COMMANDmobile application is shown in

Figure 8.25:
S
IM
Figure 8.25: Home Page of COMMANDmobile
Till now, you have studied various fundamental concepts related to mo-
bile analytics. Now, let’s do a practical activity with mobile analytics.
M
Get ready to analyse datasets on mobile phones using a mobile appli-

cation. We have already discussed the key points for analysing data by
using mobile phones and tablets. Now, through this hands-on prac-
tice, you will analyse data stored in a given dataset available on a mo-
N
bile device. You must have an android mobile phone and an Internet
connection to do this practical work.
The objective of the activity is as follows: The marketing manager of

a company performed an analysis on the listed price of items that his/
her company needs to purchase from the company’s peers. He/she
wants to present the results of the analysis to the top management
of his/her company during a meeting. He/she wants to use graphical
techniques for presenting the analysis results on a tablet that runs on
the android Operating System (OS). The marketing manager will use
a mobile application, which he/she needs to download and install on
his/her tablet for demonstrating the analysis results. You are going to
help him/her download and install the application and create graphs
using the application.
Perform the following steps to download and install the Graph Trial
app, and then create a graph to present the results of data analysis:
1. Open the Google Play Store by tapping the Play Store icon on
the screen of any android phone or tablet. A window appears,
showing the contents of the play store.

n o t e s
2. Type Graph Trial in the search box, and tap the button to start
the search operation, as shown in Figure 8.26:
S
IM
Figure 8.26: Showing the Google Play Store Window
A window appears, containing the list of available apps for the
particular search item.
M
3. Select the first app, named Graph trial, from the window by
tapping it, as shown in Figure 8.27:
N
Figure 8.27: Selecting the Graph Trial App from the List
The next window appears, asking for permission to install the
app.

n o t e s
4. Tap the ACCEPT button to install the app on your device, as

Figure 8.28: Showing the Installation Permission Window
S
A new window appears with the INSTALL button.
5. Tap the INSTALL button to install the app, as shown in Figure 8.29:
IM
M
N
Figure 8.29: Showing the Install App Window

The downloading and installation process of the Graph trial app
begins, as shown in Figure 8.30:
Figure 8.30: Showing the Downloading Window

n o t e s
After the installation is completed, the Graph trial icon appears

on the tablet’s screen, as shown in Figure 8.31:
Figure 8.31: Showing the Tablet Screen Containing
S
the Installed App Icon
6. Tap the Graph trial icon to get to the home screen, as shown in
Figure 8.32:
IM
M
N
Figure 8.32: Showing the App Home Screen

7. Tap the button to get the graph type selection window, as
Figure 8.33: Showing the Graph Type Selection Window

n o t e s
8. Select the type of graph you want to create by tapping on its icon.
In our case, we have selected simple graph. The Create simple
graph window appears.
9. Select the simple type of graph from the Graph type tab. In our
case, we have selected the Bar graph, as shown in Figure 8.34:
S
Figure 8.34: Showing the Create Simple Graph Window
10. Input the details in the Y axis title, Min, and Max fields, as shown
IM
in Figure 8.35:
M
N
Figure 8.35: Showing the Bar Graph Window

11. Scroll down the window to input details in other fields, as shown
in Figure 8.36:
Figure 8.36: Showing the Data Input Window

n o t e s
12. Tap the Save button to get the Barchart of list items icon, as
S
IM
Figure 8.37: Showing the Window with the Barchart of list items Icon
M
13. Long tap the Barchart of list items icon to see the graph, as
N
Figure 8.38: Showing the Bar Graph for a Given Dataset

n o t e s
14. Select the Pie tab by tapping it to get a pie chart, as shown in
Figure 8.39:
S
IM
Figure 8.39: Showing a Pie Chart for the Given Dataset
15. Select the Line tab by tapping it to get a line chart, as shown in
M
Figure 8.40:
N
Figure 8.40: Showing a Line Chart for the Given Dataset

Till now, we have created different types of simple graphs for a
sample dataset provided in our virtual lab.

n o t e s
Let’s now create a bar graph by loading a dataset stored as a

Comma Separated Value (CSV) file in the tablet’s memory.
16. Tap the Settings button to open a Settings window, as shown in
Figure 8.41:
S
IM
M
Figure 8.41: Showing the Settings Window

17. Tap the CSV import option to open the window listing the CSV
files in the selected location, as shown in Figure 8.42:
N
Figure 8.42: Showing the List of CSV File in a Memory Location

n o t e s
18. Select the particular file name to get a graph for the dataset.
19. Select the size of the graph by tapping an optionfrom the Select
image size window, as shown in Figure 8.43:
S
IM
Figure 8.43: Selecting the Graph Size
20. Tap the OK button to save the graph in the form of an image, as
M

N
Figure 8.44: Saving a Graph as an Image

After saving the graph as an image, you can share it via e-mail.
To do this, proceed to the next step.

n o t e s
21. Tap the Share option to open the Share graph image window, as
S
IM
Figure 8.45: Showing the Share graph image Window
22. Tap the option through which you want to share the image. In
M
our case, we have selected Gmail, as shown in Figure 8.46:

N
Figure 8.46: Selecting the Particular Option

for Sharing the Graph Image

n o t e s
23. Enter the details in the required fields, as shown in Figure 8.47:
S
IM
Figure 8.47: Entering Details
24. Tap the Share button and exit the application by pressing
the OK button on the pop-up box that appears, as shown in
Figure 8.48:
M
N
Figure 8.48: Sharing the Graph Image
We can quickly create some dashboards by using the above procedure

to create a presentation of the analysis on a mobile phone or a tablet.
The charts included in the presentation can also be shared quickly
through e-mails.

n o t e s
Exhibit
Premier Inn Generates £1m in revenues

through mobile analytics
Premier Inn is a chain of hotels in UK. It is the largest hotel brand

of UK, having around 650 hotels. Premier Inn launched a mobile
S
app in January 2011 to make bookings online. Through this mo-
bile app, Premier Inn was able to generate revenues of over £1m
in just three months of launching the app. Since then, the app has
IM
achieved more than two million downloads. Around 77% of the to-
tal bookings are made through the mobile app.
How was the hotel able to achieve such big revenues? Actually, the
magic behind the success of the Premier Inn mobile app was mo-
bile data analytics provided by Grapple, a mobile-innovation agen-
cy. Grapple collected data from its 300 branded applications of cli-
M
ents. Branded applications are those which either offer a utility or

make the life of the customer easier when he or she is on the move.
Grapple analysed this data to enable companies, such as Premier
Inn, to better understand customer behavior and make required
changes to improve sales, customer retention, and loyalty. Premier
N
Inn used Grapple’s analysis to improve the features and function-

ality of its mobile application and increase sales conversion rates
from 3% to 5.9% and generated revenues of £1m in a short period
of three months.
15. The mobile analytics process can be done either on a mobile

device or on the _______ providing services to the mobile
device.
16. DataWinners is the data collection service design for experts.
(True/False)
Activity
Prepare a report on mobile app analytics.

n o t e s
8.10 Challenges of Mobile Analytics

Mobile analytics has its own challenges. Some of the main ones can be
listed as follows:
Unavailability of uniform technology: Different mobile phones
support different technologies. For example, some mobile phones
support images, JavaScript, HTML, and cookies while others
do not.
Random change in subscriber identity: TMSI (Temporary Mobile
Subscriber Identity) is the identity of mobile devices and can be
known by the mobile network being used. This identity is random-
ly assigned by the VLR (Visitor Location Register) to every mobile
(after switched-on) located in the area. This random change in the
subscriber ID makes it difficult to gather important information
S
such as the location of user, etc.
Redirect: Some mobile devices do not support redirects. The term
‘redirect’ is used to describe the process in which the system auto-
IM
matically opens another page.
Special characters in the URL: In some mobile devices, some
special characters in the URL are not supported.
Interrupted connections: The mobile connection with the tow-
er is not always dedicated. It can be interrupted when the user is
moving from one tower to another tower. This interruption in the
M
connection breaks the requests sent by the devices.
Together with generalised issues mentioned above, mobile analysts

are also facing the following critical issues, which discourage mobile
N
analytics marketing:
Limited understanding of the network operators: Network oper-
ators are unable to understand the business processes happening
outside the carrier’s firewall.
True real-time analysis: True real-time data analysis is not always
possible with mobile analytics due to various reasons such as sig-
nal interruption, variation in technology used in mobiles, random
change in subscriber ID, etc.
Security issues: Mobile technology has various important features
but some of these features, such as GPS, cookies, Wi-Fi, and bea-
cons can disclose important information of the user. Information
like details of credit cards, bank accounts, medical history, or oth-
er personal content can be easily misused. Some techniques like
Deep Packet Inspection (DPI), Deep Packet Capture (DPC), and
application logs can increase security threats.
To cope with such security threats, business organisations must intel-

ligently monitor all communications in real time and make sure that
personal data is not accessible to everyone.

n o t e s
17. TMSI stands for

a. Temporary Mobile Subscribed Identity
b. Temporary Mobile Subscriber Identification
c. Temporary Mobile Subscribed Identification
d. Temporary Mobile Subscriber Identity
18. The term ______ is used to describe the process in which the
system automatically opens another page.
Activity
S
Determine the ways to overcome the challenges in the field of mo-
bile marketing and mobile advertising.
IM
8.11 SUMMARY
Social media refers to a computer-mediated, interactive, and in-
ternet-based platform that allows people to create, distribute, and
share a wide range of content and information, such as text and
images.
M
Social media analytics is the practice of collecting data from so-

cial media, websites or blogs and analysing the data to take crucial
business decisions.
Text mining or text analytics comes as a handy tool to quantitative-
ly examine the text generated by social media and filtered in the
N
form of different clusters, patterns, and trends.

Sentiment analysis involves careful analysis of people’s opinions,
sentiments, attitudes, appraisals, and evaluations.
Automated sentiment analysis is still evolving as it is difficult to
interpret the conditional phrases used by people to express their
sentiments on social media.
4G provides wide range access, multiservice capacity, integration
of all older mobile technologies, and low bit cost to the user.
Mobile analytics has several similarities with web and social ana-
lytics, such as both can analyse the behavior of the user with regard
to an application and send this information to the service provider.
Mobile web refers to the use of mobile phones or other devices like
tablets to view online content via a light-weight browser.
Mobile apps are usually available through application distribution
platforms like apple app store and Google play.

n o t e s
Mobile analytics tools have some technical limitations; all mobile

analytics tools do not perform all the services, so must find out
which tool can be beneficial for you.
key words
Blog: It represents an online journal to showcase the content

organised in the reverse chronological order.
Microblogs: The types of blogs that allow people to share and
showcase small posts and are suitable for quick sharing of con-
tent in a few lines of text or an individual photo or video.
Wiki: It represents a collective website in which the members
can create and modify content in a community-based database.
Social networks: It is a network that generally supports the ex-
S
change of information and data in various formats, such as text,
videos, and photos.
Text mining tools: The tools used to identify themes, patterns,
IM
and insights hidden in the structured as well as unstructured
data.

1. Discuss the concept of social media analytics with suitable
M
example.
2. Enlist and explain the key elements of social media analytics.
3. What do you understand by text mining? Discuss the key steps
for any text mining process.
N
4. Explain the concept of mobile analytics with appropriate

examples.
5. Enlist the differences between Web analytics and mobile
analytics.
6. Describe the tasks of mobile analytics tools.
Answer FOR SELF ASSESSMENT QUESTIONS

Social media analytics 1. Social Bookmarking
2. Blogging, microblogging
Key elements of social media 3. b. Curate
analytics
4. Sharing

n o t e s

Overview of text mining 5. True
6. Information retrieval, natu-
ral learning
Performing social media ana- 7. True
lytics and opinion mining on
tweets
8. Algorithms
Online social media analysis 9. False
10. Sentiment140
Mobile analytics 11. a. 3G
12. Division
Mobile analytics tools 13. Packet sniffing
S
14. Real-time dashboard
Performing mobile analytics 15. Server
16. True
IM
17. d. Temporary Mobile Sub-
scriber Identity
Challenges of mobile analytics 18. Redirect

1. Social media refers to a computer-mediated, interactive, and
M
Internet-based platform that allows people to create, distribute,

and share a wide range of content and information, such as text
and images. Refer to Section 8.2 Social Media Analytics.
2. Incorporating social media into everyday sales and marketing
N
routines of an organisation is not easy and requires gaining

a command over certain set of tactics and tools related to the
efficient management and utilisation of social media. Refer to
Section 8.3 Key Elements of Social Media Analytics.
3. Text mining or text analytics comes as a handy tool to
quantitatively examine the text generated by social media and
filtered in the form of different clusters, patterns, and trends.
Refer to Section 8.4 Overview of Text Mining.
4. Similar to the process of analytics used to study the behavior of
users on the Web or social media, mobile analytics is the process
of analysing the behavior of mobile users. Refer to Section
8.7 Mobile Analytics.
5. Mobile analytics has several similarities with Web and social
analytics, such as both can analyse the behavior of the user with
regard to an application and send this information to the service
provider. However, there are also several important differences
between Web analytics and mobile analytics. Refer to Section
8.7 Mobile Analytics.

n o t e s
6. The fundamental task of the mobile analytics tool is similar to

other digital analytical tools like Web analytics. Refer to Sections
8.7 Mobile Analytics and 8.8 Mobile Analytics Tools.
SuGGESTED READINGS
Ganis, M., & Kohirkar, A. (2016). Social media analytics: techniques
and insights for extracting business value out of social media. New
York: IBM Press.
Rowles, D. (2017). Mobile marketing: how mobile technology is
revolutionizing marketing, communications and advertising. Lon-
don: Kogan Page.
S
E-REFERENCES
Top 25 social media analytics tools for marketers - keyhole. (2017,
IM
march 09). Retrieved April 28, 2017, from http://keyhole.co/blog/
list-of-the-top-25-social-media-analytics-tools/
Social media analytics. (2017, April 13). Retrieved April 28, 2017,
from https://en.wikipedia.org/wiki/Social_media_analytics
What is social media analytics? - Definition from WhatIs.com.
(n.d.). Retrieved April 28, 2017, from http://searchbusinessanalyt-
M
ics.techtarget.com/definition/social-media-analytics
Mobile Analytics Key Benefits | Mobile Marketing. (n.d.). Re-
trieved April 28, 2017, from https://www.webtrends.com/prod-
ucts-solutions/digital-analytics/mobile-analytics-use-cases/
N

C h a
9 p t e r
Data Visualisation
CONTENTS
S
9.1 Introduction
9.2 What is Visualisation?
IM
9.2.1 Ways of Representing Visual Data
9.2.2 Techniques Used for Visual Data Representation
9.2.3 Types of Data Visualisation
9.2.4 Applications of Data Visualisation
Activity
M
9.3 Importance of Big Data Visualisation

9.3.1 Deriving Business Solutions
9.3.2 Turning Data into Information
N
Activity
9.4 Tools Used in Data Visualisation
9.4.1 Open-Source Data Visualisation Tools
9.4.2 Analytical Techniques Used in Big Data Visualisation
Activity
9.5 Summary

n o t e s
How a company used the power of data

visualisation for better analytics
Knowledgent, an industry information consultancy company,

helps organisations in transforming their information into busi-
ness results by using innovations in data and analytics. The com-
pany’s expertise integrates industry experiences, capabilities of
data analysts and scientists, and data architecture and engineer-
ing skills to discover areas that require actions to be taken.
One of the client companies of Knowledgent, actually a commer-
cial distribution company, had grown rapidly with regular series
of achievements. However, the company was facing the problem
of using critical business information extracted from Enterprise
S
Resource Planning (ERP) using a variety of different data archi-
tectures and source systems for data-driven decision making.
Main business stakeholders were making their decision on the
basis of manual assembled reports, which were lacking measur-
IM
able consistency, data reliability, and metric transparency.
As a result, the company realised that they require a way to
visualise key performance areas across the organisation. They
had the requirement of creating real-time dashboards with a
consistent user interface, across Sales, Finance, and Operations.
Knowledgent provided an Enterprise Data Warehouse and data
M
visualisation solution to the client company, which was imple-

mented by using Agile methodology. They started the project by
dividing it into three phases, where each business unit was start-
ed from Sales data. At each phase of project, Knowledgent’ team
N
conducted an assessment of the ERP system with the focus on

dimensions and measures required at each phase. They also de-
manded the clients to define key performance indicators (KPI),
dashboards, and required end-user reporting capabilities. After
that, Knowledgent designed the Enterprise Data Warehouse to
support visualisations, and then implemented it. The ETL pro-
cess was developed to integrate data from different sources and
normalise and harmonise it. Finallly, a commercial visualisation
tool was used to manage visualisation development in combina-
tion with stakeholders.
The result of the entire project at the end was that the client com-
pany of Knowledgent gained robust analytics and reporting effi-
ciencies across organisation, customer, product, sales and suppli-
er data. The analytic dashboards of the client company can now
provide key business drivers, trends, and issues. In fact they con-
sider their analytics and data visualisation capabilities are differ-
ent from their competitors.

Data Visualisation 241
n o t e s
learning objectives

>> Describe the meaning of visualisation
>> Discuss the importance of Big Data visualisation
>> Explain the tools used in data visualisation
9.1 INTRODUCTION
In the previous chapter, you have learned about prescriptive analyt-
ics. It is the final phase of Business Analytics, which uses fundamen-
tals of mathematical and computational sciences to provide different
decision options for taking the benefit of the results of descriptive and
S
predictive analytics.
Data visualisation is a pictorial or visual representation of data with

the help of visual aids such as graphs, bar, histograms, tables, pie
IM
charts, mind maps, etc. Depending upon the complexity of data and
the aspects from which it is analysed, visuals can vary in terms of their
dimensions (one-/two-/multi-dimensional) or types, such as temporal,
hierarchical, network, etc. All these visuals are used for presenting
different types of datasets. Different types of tools are available in the
market for visualising data. But what is the use of data visualisation
M
in Big Data? Is it necessary to use it? To answer these questions, we

need to track down the real meaning of visualisation in the context of
Big Data analytics.
This chapter familiarises you with the concept of data visualisation

N
and the need to visualise data in Big Data analytics. You also learn
about different types of data visualisations. Next, you learn about var-
ious types of tools using which data or information can be presented
in a visual format.
9.2 WHAT IS VISUALISATION?

Visualisation is a pictorial or visual representation technique. Any-
thing that is represented in pictorial or graphical form, with the help
of diagrams, charts, pictures, flowcharts, etc. is known as visualisa-
tion. Data presented in the form of graphics can be analysed better
than the data presented in words.
9.2.1 WAYS OF REPRESENTING VISUAL DATA
The data is first analysed and then the result of that analysis is visu-
alised in different ways as discussed above. There are two ways to
visualise a data—infographics and data visualisation:
Infographics are the visual representations of information or data
The use of colorful graphics in drawing charts and graphs helps in

n o t e s
improving the interpretation of a given data. Figure 9.1 shows an

example of infographics:
S
IM
M
N
Figure 9.1: An Example of Infographics

Source: http://www.jackhagley.com/What-s-the-difference-between-an-Infographic-and-a-Da-
ta-Visualisation
Data visualisation approach is different from Infographics. It is the

study of representing data or information in a visual form. With
the advancement of digital technologies, the scope of multimedia
has increased manifold. Visuals in the form of graphs, images, di-
agrams, or animations have completely proliferated the media in-
dustry and the Internet. It is an established fact that the human
mind can comprehend information more easily if it is presented in
the form of visuals. Instructional designers focus on abstract and
model-based scientific visualisations to make the learning content
more interesting and easy to understand. Nowadays, scientific
data is also presented through digitally constructed images. These
images are generally created with the help of computer software.
Visualisation is an excellent medium to analyse, comprehend, and
share information. Let’s see why:
Visual images help in transmitting a huge amount of informa-
tion to the human brain at a glance.

n o t e s
Visual images help in establishing relationships and distinc-

tions between different patterns or processes easily.
Visual interpretations help in exploring data from different
angles, which helping gaining insights.
Visualisation helps in identifying problems and understanding
trends and outliers.
Visualisations point out key or interesting breakthroughs in a
large dataset.
Data can be classified on the basis of the following three criteria irre-
spective of whether it is presented as data visualisation or infographics:
Method of creation: It refers to the type of content used while cre-
ating any graphical representation.
S
Quantity of data displayed: It refers to the amount of data which
is represented.
Degree of creativity applied: It refers to the extent to which the
IM
data is created graphically, and wheather it is designed in a color-
ful way or in black and white diagrams.
On the basis of above evaluation, we can understand which is the cor-

rect form of representation for a given data type. Let’s discuss the var-
ious content types:
M
Graph: A representation in which X and Y axes are used to depict

the meaning of the information
Diagram: A two-dimensional representation of information to
show how something works
N
Timeline: A representation of important events in a sequence with

the help of self-explanatory visual material
Template: A layout design for presenting information
Checklist: A list of items for comparison and verification
Flowchart: A representation of instructions which shows how
something works or a step-by-step procedure to perform a task
Mind Map: A type of diagram which is used to visually organise
information
9.2.2 Techniques Used for Visual Data

Representation
Data can be presented in various visual forms, which include simple

line diagrams, bar graphs, tables, matrices, etc. Some techniques used
for a visual presentation of data are as follows:
Isoline:It is a 2D data representation of a curved line that moves
constantly on the surface of a graph. The plotting of an isoline is
based on data arrangement rather than data visualisation.

n o t e s
Figure 9.2 shows a set of isolines:
S
IM
M
Figure 9.2: Isolines

Isosurface: It is a 3D representation of an isoline. Isosurfaces are
created to represent points that are bounded in a volume of space
by a constant value, that is, in a domain that covers 3D space. Fig-
N
ure 9.3 shows how isosurfaces look like:
0.4 –
0.2 –
0–
–0.2 –
–0.4 –
1
0.5 –1
0 –0.5
–0.5 0
x 0.5 y
–1 1
Figure 9.3: Isosurfaces

Direct Volume Rendering (DVR): It is a method used for obtain-
ing a 2D projection for a 3D dataset. A 3D record is projected in a
2D form through DVR for a clearer and more transparent visuali-
sation.

n o t e s
Figure 9.4 shows a 2D DVR of a 3D image:
Figure 9.4: 2D Image DVR

Streamline: It is a field line that results from the velocity vector
S
field description of the data flow. Figure 9.5 shows a set of stream-
lines:
IM
M
Figure 9.5: Streamlines

N
Map: It is a visual representation of locations within a specific area.

It is depicted on a planar surface. Figure 9.6 shows an instance of
Google Map:
Figure 9.6: Google Map

n o t e s
Parallel Coordinate Plot: It is a visualisation technique of repre-

senting multidimensional data. Figure 9.7 shows a parallel coordi-
nate plot:
10
Z
5 Y
0
–2
–2 –1 0 1 2 X
S
Figure 9.7: Parallel Coordinate Plot
Venn Diagram: It is used to represent logical relations between
finite collections of sets. Figure 9.8 shows a Venn diagram for a set
IM
of relations:
A∩B A∩B
A B A B
M
A∪B A–B
A B A B
N
Figure 9.8: Venn Diagrams

Timeline: It is used to represent a chronological display of events.
Figure 9.9 shows an example of a timeline for some critical events:
Figure 9.9: Timeline for Some Critical Events

n o t e s
Euler Diagram: It is a representation of the relationships between

sets. Figure 9.10 shows an example of an Euler diagram:
Figure 9.10: Euler Diagram

Hyperbolic Trees: They represent graphs that are drawn using
S
the hyperbolic geometry. Figure 9.11 shows a hyperbolic tree:
IM
M
N
Figure 9.11: Hyperbolic Tree

Cluster Diagram: It represents a cluster, such as a cluster of astro-
nomic entities. Figure 9.12 shows a cluster diagram:
Figure 9.12: Cluster Diagram

n o t e s
Ordinogram: It is used to analyse various sets of multivariate

objects. Figure 9.13 shows an ordinogram:
Figure 9.13: Ordinogram
S
9.2.3 TYPES OF DATA VISUALISATION
You already know that data can be visualised in many ways, such as
in the forms of 1D, 2D, or 3D structures. Table 9.1 briefly describes the
IM
different types of data visualisation:
Table 9.1: Data Visualisation Types
Name Description Tool
1D/Linear A list of items organised Generally, no tool is used for
in a predefined manner 1D visualisation
M
2D/Planar Choropleth, cartogram, GeoCommons, Google Fusion

dot distribution map, Tables, Google Maps API,
and proportional sym- Polymaps, Many Eyes, Google
bol map Charts, and Tableau Public
3D/Volumetric 3D computer models, AC3D, AutoQ3D, TrueSpace
N
surface rendering,
volume rendering, and
computer simulations
Temporal Timeline, time series, TimeFlow, Timeline JS, Excel,
Gantt chart, sanky dia- Timeplot, TimeSearcher, Goog-
gram, alluvial diagram, le Charts, Tableau Public, and
and connected scatter Google Fusion Tables
plot
Multidimen- Pie chart, histogram, Many Eyes, Google Charts,
sional tag cloud, bubble cloud, Tableau Public, and Google
bar chart, scatter plot, Fusion Tables
heat map, etc.
Tree/Hierar- Dendogram, radial tree, d3, Google Charts, and Net-
chical hyperbolic tree, and work Workbench/Sci2
wedge stack graph
Network Matrix, node link Pajek, Gephi, NodeXL,
diagram, hive plot, VOSviewer, UCINET, GUESS,
and tube map Network Workbench/Sci2, sig-
ma.js, d3/Protovis, Many Eyes,
and Google Fusion Tables

n o t e s
As shown in Table 9.1, the simplest type of data visualisation is 1D

representation and the most complex data visualisation is the network
representation. The following is a brief description of each of these
data visualisations:
1D (Linear) data visualisation: In the linear data visualisation,
data is presented in the form of lists. Hence, we cannot term it as
visualisation. It is rather a data organisation technique. Therefore,
no tool is required to visualise data in a linear manner.
2D (Planar) data visualisation: This technique presents data in
the form of images, diagrams, or charts on a plane surface. Car-
togram and dot distribution map are examples of 2D data visual-
isation. Some tools used to create 2D data visualisation patterns
are GeoCommons, Google Fusion Tables, Google Maps API, Poly-
maps, Tableau Public, etc.
S
3D (Volumetric) data visualisation: In this method, data presen-
tation involves exactly three dimensions to show simulations, sur-
face and volume rendering, etc. Generally, it is used in scientific
IM
studies. Today, many organisations use 3D computer modelling
and volume rendering in advertisements to provide users a better
feel of their products. To create 3D visualisations, we use some
visualisation tools that involve AC3D, AutoQ3D, TrueSpace, etc.
Temporal data visualisation: Sometimes, visualisations are time
dependent. To visualise the dependence of analyses on time, the
M
temporal data visualisation is used, which includes Gantt chart,

time series, sanky diagram, etc. TimeFlow, Timeline JS, Excel,
Timeplot, TimeSearcher, Google Charts, Tableau Public, Google
Fusion Tables, etc. are some tools used to create temporal data
visualisation.
N
Multidimensional data visualisation: In this type of data visuali-

sation, numerous dimensions are used to present data. We have pie
charts, histograms, bar charts, etc. to exemplify multidimensional
data visualisation. Many Eyes, Google Charts, Tableau Public, etc.
are some tools used to create multidimensional data visualisation.
Tree/Hierarchical data visualisation: Sometimes, data relation-
ships need to be shown in the form of hierarchies. To represent
such kind of relationships, we use tree or hierarchical data visual-
isations. Examples of tree/hierarchical data visualisation include
hyperbolic tree, wedge-stack graph, etc. Some tools to create
hierarchical data visualisation are D3, Google Charts, and Net-
work Workbench/Sci2.
Network data visualisation: It is used to represent data relations
that are too complex to be represented in the form of hierarchies.
Some examples of network data visualisation tools include matrix,
node link diagram, hive plot, Pajek, Gephi, NodeXL, VOSviewer,
UCINET, GUESS, Network Workbench/Sci2, sigma.js, d3/Proto-
vis, Many Eyes, Google Fusion Tables, etc.

n o t e s
9.2.4 APPLICATIONS OF DATA VISUALISATION
Data visualisation tools and techniques are used in various applica-

tions. Some of the areas in which we apply data visualisation are as
follows:
Education: Visualisation is applied to teach a topic that requires
simulation or modelling of any object or process. Have you ever
wondered how difficult it would be to explain any organ or organ
system without any visuals? Organ system, structure of an atom,
etc. are best described with the help of diagrams or animations.
Information: Visualisation is applied to transform abstract data
into visual forms for easy interpretation and further exploration.
Production: Various applications are used to create 3D models of
products for better viewing and manipulation. Real estate, com-
S
munication, and automobile industry extensively use 3D adver-
tisements to provide a better look and feel to their products.
Science: Every field of science including fluid dynamics, astro-
IM
physics, and medicine use visual representation of information.
Isosurfaces and direct volume rendering are typically used to
explain scientific concepts.
Systems visualisation: Systems visualisation is a relatively new
concept that integrates visual techniques to better describe com-
plex systems.
M
Visual communication: Multimedia and entertainment industry

use visuals to communicate their ideas and information.
Visual analytics: It refers to the science of analytical reasoning
supported by the interactive visual interface. The data generat-
N
ed by social media interaction is interpreted using visual analytics

techniques.
1. Which of the following visual aids is/are used for representing

data?
a. Graphs b. Bar
c. Histograms d. All of these
2. The use of colorful graphics in drawing charts and graphs
helps in improving the interpretation of a given data. (True/
False)
3. Scientific data is also presented through _______ constructed
images.
4. Visual images do not help in transmitting huge amount of
information to the human brain at a glance. (True/False)

n o t e s
5. Which of the following types of diagrams refers to a

representation of instructions that shows how something
works or a step-by-step procedure to perform a task?
a. Graph b. Diagram
c. Flowchart d. Mind Map
6. DVR stands for ____________.
7. ______ diagram is used to represent logical relations between
finite collections of sets.
8. Ordinogram is used to analyse various sets of multivariate
objects. (True/False)
S
Activity
Search and enlist the symbols used in a flowchart. Also, create a

flowchart which represents a sequence of instructions for resolving
a problem using its symbols.
IM
IMPORTANCE OF BIG DATA
9.3
VISUALISATION
M
Visual analysis of data is not a new thing. For years, statisticians and
analysts have been using visualisation tools and techniques to inter-
pret and present the outcomes of their analyses.
Almost every organisation today is struggling to tackle the huge

N
amount of data pouring in every day. Data visualisation is a great

way to reduce the turn-around time consumed in interpreting Big
Data. Traditional visualisation techniques are not efficient enough to
capture or interpret the information that Big Data possesses. For ex-
ample, such techniques are not able to interpret videos, audios, and
complex sentences. Apart from the type of data, the volume and speed
with which data is generated pose a great challenge. Most of the tra-
ditional analytics techniques are unable to cater to any of these prob-
lems.
Big Data comprises both structured as well as unstructured forms of

data collected from various sources. Because of the heterogeneity of
data sources, data streaming, and real-time data, it becomes difficult
to handle Big Data by using traditional tools. Traditional tools are de-
veloped by using relational models that work best on static interaction.
Big Data is highly dynamic in function and therefore, most traditional
tools are not able to generate quality results. The response time of tra-
ditional tools is quite high, making them unfit for quality interaction.

n o t e s
9.3.1 DERIVING BUSINESS SOLUTIONS
The most common notation used for Big Data is 3Vs—volume, veloci-
ty, and variety. But, the most exciting feature is the way in which val-
ue is filtered from the haystack of data. Big Data generated through
social media sites is a valuable source of information to understand
consumer sentiments and demographics. Almost every company now-
adays is working with Big Data and facing the following challenges:
Most data is in unstructured form
Data is not analysed in real time
The amount of data generated is huge
There is a lack of efficient tools and techniques
Considering all these factors, IT companies are focusing more on re-
S
search and development of robust algorithms, software, and tools to
analyse the data that is scattered in the Internet space. Tools such
as Hadoop provide state-of-the-art technology to store and process
IM
Big Data. Analytical tools are now able to produce interpretations on
smartphones and tablets. It is possible because of the advanced visual
analytics that is enabling business owners and researchers to explore
data for finding out trends and patterns.
9.3.2 TURNING DATA INTO INFORMATION

M
The most exciting part of any analytical study is to find useful infor-
mation from a plethora of data. Visualisation facilitates identification
of patterns in the form of graphs or charts, which in turn helps to de-
rive useful information. Data reduction and abstraction are generally
N
followed during data mining to get valuable information.
Visual data mining also works on the same principle as simple data
mining; however, it involves the integration of information visualisa-
tion and human–computer interaction. Visualisation of data produces
cluttered images that are filtered with the help of clutter-reduction
techniques. Uniform sampling and dimension reduction are two com-
monly used clutter-reduction techniques.
Visual data reduction process involves automated data analysis to

measure density, outliers, and their differences. These measures are
then used as quality metrics to evaluate data-reduction activity. Visual
quality metrics can be categorised as:
Size metrics (e.g. number of data points)
Visual effectiveness metrics (e.g. data density, collisions)
Feature preservation metrics (e.g. discovering and preserving data
density differences)

n o t e s
In general, we can conclude that a visual analytics tool should be:

Simple enough so that even non-technical users can operate it
Interactive to connect with different sources of data
Competent to create appropriate visuals for interpretations
Able to interpret Big Data and share information
Apart from representing data, a visualisation tool must be able to es-

tablish links between different data values, restore the missing data,
and polish data for further analysis.
9. Analytical tools are now able to produce interpretations on
S
smartphones and tablets. (True/False)
10. Big Data generated through _________ sites is a valuable
source of information to understand consumer sentiments
and demographics.
IM
11. Which of the following is/are the challenges with Big Data?
a. Most data is in unstructured form.
b. Data is not analysed in real time.
c. The amount of data generated is huge.
M
d. All of these
12. _______ and _________ are two commonly used clutter-
reduction techniques.
N
13. Data reduction and _______ are generally followed during

data mining to get valuable information.
14. Which of the following is/are a visual quality metric?
a. Size metric
b. Visual effectiveness metric
c. Feature preservation metric
d. All of these
Activity
Prepare a report on Big Data visualisation tools that are widely

used by the organisations nowadays.

n o t e s
9.4 TOOLS USED IN DATA VISUALISATION

Some useful visualisation tools are listed as follows:
Excel: It is a new tool that is used for data analysis. It helps you to
track and visualise data for deriving better insights. This tool pro-
vides various ways to share data and analytical conclusions within
and across organisations. Figure 9.14 shows an example of Excel
sheet:
S
IM
M
Figure 9.14: Excel Sheet

Last.Forward: It is open-source software provided by last.fm for
N
analysing and visualising social music network. Figure 9.15 shows

an example of a Last.Forward visual:
Figure 9.15: Last.Forward

n o t e s
Digg.com: Digg.com provides some of the best Web-based visual-

isation tools.
Pics: This tool is used to track the activity of images on a website.
Arc: It is used to display the topics and stories in a spherical form.
Here, a sphere is used to display stories and topics, and bunches
of stories are aligned at the outer circumference of sphere. Figure
9.16 shows Digg Arc:
S
IM
Figure 9.16: Digg Arc
Larger stories have more diggs, as shown in Figure 9.16. The arc
M
becomes thicker with the number of times users dig the story.
Google Charts API: This tool allows a user to create dynamic
charts to be embedded in a Web page. A chart obtained from the
data and formatting parameters supplied in a HyperText Trans-
N
fer Protocol (HTTP) request is converted into a Portable Net-

work Graphics (PNG) image by Google to simplify the embedding
process. Figure 9.17 shows some charts created by using Google
Charts API:
Column Chart Area Chart Candlestick Chart
Timeline Bubble Chart Donut Chart

1 Washington
IRQ
2 %
29.2
Adams
.5%
3 Jefferson EGY
37
USA
RUS IRN 12.5%
deu 7%
16.
Figure 9.17: Charts Obtained from Google Charts API

n o t e s
TwittEarth: This tool is capable of showing live tweets from all

over the world on a 3D globe. It is an effort to improve social media
visualisation and provide a global image mapping in tweets. Fig-
ure 9.18 shows an example of a TwittEarth visual:
S
IM Figure 9.18: TwittEarth
Source: http://cybergyaan.com/2010/01/10-supercool-ways-to-visualise-internet.html
Tag Galaxy: Tag Galaxy provides a stunning way of finding a col-

lection of Flickr images. It is an unusual site which provides search
tool which makes the online combing process a memorable visual
experience. If you want to search a picture, you have to enter a tag
M
of your choice and it will find the picture. The central (core) star
contains all the images directly relating to the initial tag and the
revolving planets consist of similar or corresponding tags. Click
on a planet and additional sub-categories will appear. Click on the
central star and Flickr images gather and land on a gigantic 3D
N
sphere. Figure 9.19 shows a visual created by Tag Galaxy:
Figure 9.19: Tag Galaxy

Source: Taggalaxy.de

n o t e s
D3: D3 enables you to attach random data with a Document Object

Model (DOM) and then applies data-driven transformations on the
document. For example, you can utilise D3 for creating an HTML
table from a sequence of numbers. Or, you can use the same data
to develop an interactive SVG bar chart having smooth transitions
and interactions. Figure 9.20 shows some complex visuals created
through D3:
S
IM
Figure 9.20: Some Visuals Obtained from D3
Source: http://d3js.org/
Rootzmap Mapping the Internet: It is a tool to generate a series

of maps on the basis of the datasets provided by the National Aero-
M
nautics and Space Administration (NASA). Figure 9.21 shows an

example of the Internet mapping through Rootzmap:
N
Figure 9.21: Internet Mapping

Source: http://www.sysctl.org/rootzmap/e-map.jpg

n o t e s
9.4.1 OPEN-SOURCE DATA VISUALISATION TOOLS
We already know that Big Data analytics requires the implementation

of advanced tools and technologies. Due to economic and infrastruc-
tural limitations, every organisation cannot purchase all the applica-
tions required for analysing data. Therefore, to fulfill their require-
ment of advanced tools and technologies, organisations often turn to
open-source libraries. These libraries can be defined as pools of freely
available applications and analytical tools. Some examples of open-
source tools available for data visualisation are VTK, Cave5D, ELKI,
Tulip, Gephi, IBM OpenDX, Tableau Public, and Vis5D.
Open-source tools are easy to use, consistent, and reusable. They de-
liver high-quality performance and are compliant with the Web as
well as mobile Web security. In addition, they provide multichannel
analytics for modelling as well as customised business solutions that
S
can be altered with changing business demands.
9.4.2 ANALYTICAL TECHNIQUES USED IN BIG DATA

IM
VISUALISATION
Analytical techniques are used to analyse complex relationships

among variables. The following are some commonly used analytical
techniques for Big Data solutions:
Regression analysis: It is a statistical tool used for prediction.
M
Regression analysis is used to predict continuous dependent vari-

ables from independent variables. In this, we try to find the effect
of one variable on other variable. For example, sales increase when
prices decrease. Types of regression analysis are as follows:
N
Ordinary least squares regression: It is used when dependent

variable is continuous and there exists some relationship be-
tween the dependent variable and independent variable.
Logistic regression: It is used when dependent variable has
only two potential results.
Hierarchical linear modeling: It is used when data is in nested
form.
Duration models: It is used to measure length of process.
Grouping methods: The technique of categorising observation
into significant or purposeful blocks is called grouping. The recog-
nition of features to create a distinction between groups is called
discriminant analysis.
Multiple equation models: It is used to analyse causal pathways
from independent variables to dependent variables. Types of mul-
tiple equation models are as follows:
Path analysis
Structural equation modelling

n o t e s
15. ________ is an open-source software provided by last.fm for

analysing and visualising social music network.
16. Google Charts API tool allows a user to create dynamic charts
to be embedded in a Web page. (True/False)
17. Tag Galaxy provides a stunning way of finding a collection of
______ images.
Activity
Collect information about the pivot table used in Excel for repre-
senting data.
S
9.5 SUMMARY
Visualisation
IM
is a pictorial or visual representation technique.
Anything which is represented in pictorial or graphical form, with
the help of diagrams, charts, pictures, flowcharts, etc. is known as
visualisation.
Data presented in the form of graphics can be analysed better than
the data presented in words.
M
Infographics are the visual representation of information or data.

Data visualisation approach is different from Infographics. It is the
study of representing data or information in a visual form.
N
Data can be presented in various visual forms, which include sim-

ple line diagrams, bar graphs, tables, matrices, etc.
Multimedia and entertainment industry use visuals to communi-
cate their ideas and information.
The data generated by social media interaction is interpreted
using visual analytics techniques.
Apart from the type of data, the volume and speed with which data
is generated pose a great challenge.
Because of heterogeneity of data sources, data streaming, and
real-time data, it becomes difficult to handle Big Data by using
traditional tools.
Visual
data reduction process involves automated data analysis to
measure density, outliers, and their differences.

n o t e s
key words
Graph: It is a representation in which X and Y axes are used to

depict the meaning of the information.
Diagram: It is a two-dimensional representation of information
to show how something works.
Timeline: It is a representation of important events in a se-
quence with the help of self-explanatory visual material.
Flowchart: It is a representation of instructions which shows
how something works or a step-by-step procedure to perform
a task.
Isosurfaces: These are designed to represent points that are
bound by a constant value in a volume of space.
S
IM
1. What do you understand by data visualisation? List the different
ways of data visualisation.
2. Describe the different techniques used for visual data
representation.
3. Discuss the types and applications of data visualisation.
M
4. Describe the importance of Big Data visualisation.

5. Elucidate the transformation process of data into information.
6. Enlist and explain the tools used in data visualisation.
N
7. Describe the analytical techniques used in data visualisation.

What Is Visualisation? 1. d. All of these
2. True
3. digitally
4. False
5. c. Flowchart
6. Direct Volume Rendering
7. Venn

n o t e s

8. True
Importance of Big Data 9. True
Visualisation
10. social media
11. d. All of these
12. Uniform sampling; dimension
reduction
13. abstraction
14. d. All of these
Tools Used in Data Visual- 15. Last.Forward
isation
S
16. True
17. Flickr
IM
1. Visualisation is a pictorial or visual representation technique.
Anything which is represented in pictorial or graphical form, with
the help of diagrams, charts, pictures, flowcharts, etc. is known
as visualisation. Refer to Section 9.2 What is Visualisation?
M
2. Data can be presented in various visual forms, which include

simple line diagrams, bar graphs, tables, matrices, etc. Refer to
Section 9.2 What is Visualisation?
3. Data can be visualised in many ways, such as in the forms of 1D,
N
2D, or 3D structures. Refer to Section 9.2 What is Visualisation?

4. Visual analysis of data is not a new thing. For years, statisticians
and analysts have been using visualisation tools and techniques
to interpret and present the outcomes of their analyses. Refer to
Section 9.3 Importance of Big Data Visualisation.
5. The most exciting part of any analytical study is to find useful
information from a plethora of data. Refer to Section 9.3
Importance of Big Data Visualisation.
6. Excel is a new tool that is used for data analysis. It helps you
to track and visualise data for deriving better insights. Refer to
Section 9.4 Tools used in Data Visualisation.
7. Analytical techniques are used to analyse complex relationships
among variables. Refer to Section 9.4 Tools used in Data
Visualisation.

n o t e s
SUGGESTED READINGS
Kirk, A. (2016). Data visualisation: a handbook for data driven de-
sign. Los Angeles: Sage Publications.
Evergreen, S. (2017). Effective data visualization: the right chart
for the right data. Los Angeles: Sage.
Kirk, A. (2012). Data visualization: a successful design process. S.l.:
Packt Publ.
E-REFERENCES
Data visualization. (2017, April 26). Retrieved May 02, 2017, from
S
https://en.wikipedia.org/wiki/Data_visualization
Suda, B., & Hampton-Smith, S. (2017, February 07). The 38 best
tools for data visualization. Retrieved May 02, 2017, from http://
IMwww.creativebloq.com/design-tools/data-visualization-712402
50 Great Examples of Data Visualization. (2009, June 01). Re-
trieved May 02, 2017, from https://www.webdesignerdepot.
com/2009/06/50-great-examples-of-data-visualization/
M
N

C h
10 a p t e r
Business Analytics in Practice
CONTENTS
S
10.1 Introduction
10.2 Financial and Fraud Analytics
IM
Activity
10.3 HR Analytics
Activity
10.4 Marketing Analytics
M

Activity
10.5 Healthcare Analytics
N
Activity
10.6 Supply Chain Analytics
Activity
10.7 Web Analytics
Activity
10.8 Sports Analytics
Activity
10.9 Analytics for Government and NGO’s
Activity
10.10 Summary

n o t e s
MIAMI BASEBALL TEAM USED SPORTS ANALYTICS TO

PERFORM BETTER
The Miami Red Hawks is a National Collegiate Athletic Associ-

ation (NCAA) division I American baseball team which belongs
to Miami University in Oxford, Ohio. Dan Hayden is the present
coach of Red Hawks. This team is also a member of Mid-Ameri-
can conference east division. The first baseball team of Miami was
played in 1915.
The Red Hawks were trying to get competitive advantage over
their competitors with the help of baseball statistics and analytics.
This need is felt because of the frequent changes in this sport to
make it more interesting and competitive. The data analytics is
widely used and its use is very popular at the professional level
S
in comparison to amateur sports. Miami baseball wanted to be
updated and competitive in this sport with the use of analytics.
The team wanted to use analytics for analysing pitching which
IM
includes type, speed and location of pitch.
After searching various available tools in the market for sports
analytics, the team has decided to use Vizion360 impact analyt-
ics. This analytics involve the use of visualisation tool, Microsoft
spower BI, at the front end. This tool provides deep insight of the
collected data. It provides detailed summary of the performance
M
of the team and the player at the individual level. The summa-
ry also includes statistics related to pitching and batting. After
studying this data, the team worked on improving their perfor-
mances in intrinsic situations of the game.
N
The tool also helped in improving the productivity of the players

and helped coaches in refraining from making lame assumptions
about the performance of players by providing deep statistics
about the earlier performance of players which was not possible
by using the simple statistics. This tool also helped in doing cus-
tom visualisation instead of using spreadsheets for cut/paste re-
porting. The results obtained after applying sports analytics us-
ing Vizion360 are as follows:
The team became capable of analysing pitch’s type, location
and speed.
The team can access the statistics, as and when required, for
analysing the current situation.
Before the implementation of Vizion360, the team coaches
used to select players on their gut feeling. Now, they can check
performance data before selecting a player.
Vizion360 helped coaches in better decision making against a
particular team.
Coaches can save and analyse the data even on their mobile
devices.

Business Analytics in Practice 265
n o t e s
learning objectives

>> Describe the concept of financial and fraud analytics
>> Explain the importance of HR analytics
>> Discuss marketing analytics
>> Define healthcare analytics
>> State the significance of supply chain analytics
>> Describe the functions of Web analytics
>> Explain the functions of sports analytics
>> Discuss how analytics is used by the government and NGOs
S
10.1 INTRODUCTION
Business analytics has emerged as a growth driver for most new era
IM
organisations. Gone are those occasions when managers used to settle
on choices on the premise of their own guts or use large-scale finan-
cial indicators and their imaginable effect on individual organisations.
Choices made without data and information have turned out to be
unfortunate for many associations. With the advent of data innova-
tion and increased data handling ability of PCs, supervisors are utilis-
ing numerous metbods to anticipate the fate of business and enhance
M
gainfulness of the venture. The application of descriptive and predic-

tive analytics, client relationship management tools and different pro-
cess improvement devices brings the benefit to the organisation. The
entire business world is taking a look at huge information as an open
N
door and source of a competitive advantage.
Business analytics has expanded consistently over the previous de-

cade as confirmed by the constantly developing business analytics
software market. It is targeting more organisations and reaches out
to more number of users, from administrators and line of business
supervisors to examiners and other information specialists, inside or-
ganisations.
This chapter first discusses financial and fraud analytics. Next, the
chapter explains HR analytics, marketing analytics and healthcare
analytics. The chapter also explains supply chain analytics and Web
analytics. Towards the end, the chapter discusses sports analytics and
how analytics is used by the government and NGOs for providing var-
ious beneficial services to people.

n o t e s
10.2 FINANCIAL AND FRAUD ANALYTICS

Fraud impacts organisations in several ways which might be related
to financial, operational or psychological processes. While the money
related misfortune owing to fraud is huge, the full effect of fraud on an
organisation can be more shocking. As fraud can be executed by any
worker inside an organisation or by an external source, it is essential
for an organisation to have successful fraud management or a fraud
analytics program to defend its reputation against fraud and prevent
financial loss. Many organisations such as Simility, MATLAB, Actim-
ize, etc. provide fraud detection software or suite to detect fraud at an
early stage and take appropriate measures to prevent it. Numerous
organisations stay helpless against extortion and money related crime
since they are not exploiting new abilities to battle today’s dangers.
These abilities depend intensely on huge information and analytic in-
S
novations that are currently accessible.
With these advancements, organisations can oversee and examine

terabytes of recorded and outsider information. The capacity to break
IM
down enormous information volumes empowers organisations to
make exact and precise models for perceiving and forestalling future
fraud.
By utilising the most recent advancements in robust analytics, organi-

sations can unhesitatingly ensure themselves and their clients regard-
M
ing privacy and security of data while doing business with them or
offering them various services which require their personal data to be
utilised.
Advanced analytics can also be connected to all key fraud information

N
to foresee whether an activity is possibly fraudulent before losses hap-

pen. Taking a look at just little arrangements of security information,
for example, occasion logs, decreases a bank’s capacity to anticipate
or identify sophisticated crime. The more volume and sorts of infor-
mation an organisation can analyse at a greater velocity, the more the
organisation can guard against interior and outer dangers.
Intelligent investigation of suspicious movement requires performing

and managing requests that are bolstered by careful investigation and
data availability. With these tools, organisations can rapidly confirm
fraud and then the further activities such as prosecution and recuper-
ation can be taken.
Harness existing historical information
Organisations can use the already recorded information and analyse it

to detect and prevent frauds in future. This information also helps in
detecting the past and future impressions of the fraud. The recorded
information related to fraud can help organisations to prevent huge
losses of money and data related to it or clients.

n o t e s
Data management software empowers auditors and fraud analysts to

break down an organisation’s business information to gain knowledge
into how well internal controls are working and distinguish transac-
tions that appear to be fraudulent. Generally, data analysis can be
done at places in an organisation where electronic transactions are
recorded and stored.
There is no doubt in saying that data analysis gives a powerful ap-

proach to become more proactive in the battle against fraud. The
companies also use whistleblower hotlines which help individuals for
reporting speculated fake conduct or unsafe conduct and violations of
its law and policy. However, using hotlines alone are insufficient. Why
be just receptive and wait for a whistleblower to come forward at the
last approach? Why not search out indicators of fraud in the informa-
tion? To successfully test for fraud, every important transaction must
S
be analysed over all pertinent business frameworks and applications.
Breaking down business exchanges at the source level provide audi-
tors with better knowledge and a more entire view with regards to the
probability of fraud happening. Analysis involves the investigation of
IM
those activities that are suspicious and help control weaknesses that
could be misused by fraudsters.
1. Companies also use __________ hotlines to help individuals

M
report speculated fake conduct or unsafe conduct and

violations of its law and policy.
2. _________ can also be connected to all key fraud information
to foresee whether an activity is possibly fraudulent before
N
losses happen.
3. It is essential for an organisation to have successful fraud
management or a fraud analytics program to defend its
reputation against fraud. (True/False)
Activity
Collect information from a nearby local bank related to the impact

of fraud in the financial system and all the measures taken by the
banking institution to reduce fraud. Prepare a report on this topic.
10.3 HR ANALYTICS
Human Resource (HR) analytics, additionally called talent analytics,
is the use of complex information mining and business analytics (BA)
strategies to get HR information. HR analytics is a zone in the field
of analysis that alludes to applying analytic processes to the human
resource department of a company in the expectation of enhancing

n o t e s
worker execution and along with improving in the degree of profit-

ability. Organisations generally move to HR analytics and data led
solutions when there exists problems that cannot be resolved with
current management practices.
HR analytics does not simply manage the gathering of information

on employee performance and efficiency; instead, they also provide
deeper details of each process by accumulating data and then use it
for making important decisions about improving these processes.
HR analytics establishes a relationship between business data and in-

dividual’s data, which further help in building important connections
between them. The main aspect of HR analytics is to show people the
impact of HR department on the whole organisation. HR analytics
also help in building a cause-and-effect relationship between the tasks
of HR and business outcomes and then making strategies on the basis
S
of that information.
The core functionalities of HR can be improved by applying various

IM
processes in analytics which include acquisition, optimisation, paying
and creating the employees workforce for the organisation. HR ana-
lytics can also help in digging problems and challenges using analyti-
cal workflow and guide managers in answering questions. It also help
managers in gaining deeper details from information at hand, then
make important decisions and take proper actions.
M
The field of HR analytics can be further divided into the following

segments:
Capability analytics: It is a talent management process that en-
ables you to identify capabilities or core competencies that you
N
require in your business. It helps in identifying the capabilities of

your workforce which includes their skill, level and expertise.
Competency acquisition analytics: It refers to the process of as-
sessment how well or otherwise your business can attain the re-
quired competencies. Acquiring and managing the talent is very
critical for the growth of business.
Capacity analytics: It helps in identifying how many operationally
efficient people are in business. For example, it identifies whether
people are spending time in profitable work or not.
Employee churn analytics: Hiring employees and training them
involve time and money. Employee churn analytics refers to the
process of estimating the staff turnover rates for predicting the
future and reducing employee churn.
Corporate culture analytics: It refers to the process in which as-
sessment and understanding about the corporate culture or the
different cultures that are followed across an organisation is done.

n o t e s
Recruitment channel analytics: It refers to the process of finding

out the source of getting or recruiting best employees and most
efficient recruitment channels.
Employee performance analytics: Every organisation requires
employees that are capable and perform well to survive and thrive.
Employee performance analytics is used in assessing the perfor-
mance of an individual employee. The resulting information can
be used to determine which employee is performing efficiently
and which employee may require some extra support or training
for improving its performance.
4. HR analytics is also known as ______ analytics.
S
5. HR analytics help managers in gaining deeper details from
information at hand, then make important decisions and take
proper actions. (True/False)
IM
6. ____________ analytics helps in identifying how many are
operationally efficient people are in business.
Activity
Visit an organisation and meet its HR executives to know how HR

M
analytics help them to motivate their employees and reduce em-

ployee turnover in the last five years.
N
10.4 MARKETING ANALYTICS

Every organisation strives to gain an edge over its competitors. This
can be possible if an organisation develops an effective industry lev-
el strategy. For this, an organisation needs to analyse various forces,
such as level of competition in the market, entry of new organisations,
availability of substitute products, etc. For this purpose, marketing an-
alytics are used by organisations.
Marketing analytics is the act of measuring, overseeing and examining

advertising execution to expand its effectiveness and enhance quanti-
fiable profit (ROI). Marketing analytics helps in providing deeper in-
sight into customer preferences and trends. Despite various benefits,
a majority of organisations failed to realise the benefits of marketing
analytics. With the advancement of search engines, paid search mar-
keting, search engine optimisation (SEO) and efficient new software
solutions, marketing analytics has become more effective and easier
to get implemented than ever.

n o t e s
You need to follow the below three steps to get the benefits from mar-
keting analytics:
1. Practice a balanced collection of analytic methods
In order to get the best benefits from marketing analytics, you
need an analytic evaluation that is balanced – that is, one that
merges methods for:
Covering the past: Utilising marketing analytics to research
on the past. You can answer a few queries such as which cam-
paign component was used to make most income from last
quarter?
Exploring the present: Marketing analytics enables you to
decide how your marketing activities are acting at this mo-
ment by asking questions such as: How are clients doing?
S
Which channels do clients use to gain maximum benefits?
What is the reaction of different networking media personnel
on the company’s image?
Predicting influencing what’s to come: Marketing analytics
IM
can be used to deliver data driven expectations to change the
future by putting few inquiries such as: How would we be
able to transform here and now win into dedication and con-
tinuous engagement? In what capacity, we should include
more sales representatives to meet expectations? Which ur-
ban communities would be a good idea for us to focus next by
M
utilising our present situation?

2. Evaluate your analytical capabilities and fill in the gaps
Marketing organisations have an access to a lot of analytic
abilities for supporting different marketing goals. Estimating
N
your present analytic capabilities is necessary to attain these

goals. It is significant to know about your present situation along
with an analytic spectrum, so that you can determine gaps and
take steps to create a strategy for filling those gaps.
Consider an example in which a marketing organisation is
already gathering data from sources like the Internet and POS
transactions, but is not providing importance to the unstructured
information coming from social media platforms. Such
unstructured sources are very useful, and the technology for
transforming unstructured data into actual insights is available
today that can be used by marketers. A marketing organisation
can plan and allocate budget for adding these analytic capabilities
that can be used to fill that particular gap.
3. Take action as per analytical findings
The information collected after performing marketing analytics
is not useful until you try to act on that information. In the
continuous process of testing and learning, marketing analytics

n o t e s
allows you to enhance the performance of your marketing

program as a whole by performing the following tasks:
Determining deficiencies in the channel
Doing adjustment in strategies and tactics as and when re-
quired
Optimising processes
Due to a lack of capability to test and evaluate the performance of your

marketing programs, you would not be able to know what had worked
and what had not. Moreover, you would not be able to know whether
things needed to be changed or in what manner. In other words, if you
are using marketing analytics for evaluating success and doing noth-
ing with the details gained, then what is the point of using analytics?
Marketing analytics enables better, more successful marketing for
S
your efforts and investments. It can lead to better management which
helps in generating more revenue and greater profitability.

IM
7. SEO stands for
a. Search engine optimisation
b. Searching engine optimisation
c. Search engine operation
M
d. None of these
8. Marketing analytics enables you to decide how your marketing
activities are acting at this moment. (True/False)
N
9. The information collected after performing marketing

analytics remain useful whether you act or not on that
information. (True/False)
Activity
Prepare a report on the total sales and revenue generated by a store

at your nearby location by using marketing analytics.
10.5 HEALTHCARE ANALYTICS

Healthcare analytics is a term used to describe the analysis of health-
care activities using the data generated and collected from different
areas in healthcare such as pharmaceutical data, research and de-
velopment (R&D) data, clinical data, patient behavior and sentiment
data, etc. In addition to this, data also gets generated from patients
buying behavior in stores, claims made by patients, preference of pa-
tients in selecting activities and more. The analytics applied on this

n o t e s
data to get insight of data for providing healthcare services in a better

way.
Organisations in the field of healthcare are quickly receiving data

frameworks to enhance both business operations and clinical care.
Many classes of data frameworks develop in the human services area,
extending from electronic medical records (EMRs), specialty care
management, to supply chain system.
Healthcare organisations are also implementing approaches, for

example, lean and six Sigma to take a more patient-driven concen-
tration, lessen errors and waste and increase the number of flow of
patients with the objective of enhancing quality. The healthcare an-
alytics industry is a growing industry and it is estimated that it will
cross $18.7 billion by 2020 alone in the United States (US). The indus-
try also emphasises on various areas such as financial analysis, clin-
S
ical analysis, fraud analysis, supply chain analysis and HR analysis.
Basically, healthcare analytics is based on the verification of patterns
in healthcare data for determining how clinical care can be enhanced
IM
while minimising the excessive cost.
In addition to reveal data about present and past organisational per-

formance, analytical tools are also used to study large information-
al collections by using statistical analysis procedures to uncover and
comprehend recorded information with an eye to foresee and enhance
operational execution later on.
M
Healthcare analytics is used as a measurable instrument in getting

deeper details of medicinal services related information keeping in
mind the end goal to determine past performances (i.e., operational
execution or clinical results) to enhance the quality and proficiency of
N
clinical and business procedures and its execution in future.
As the volume and accessibility of healthcare information keeps on

increasing, healthcare organisations progressively need to depend on
analytics as a key competency to comprehend and enhance their op-
erations.
Implementing Real-Time Healthcare Data Analytics
Healthcare data is not easily available in a unified and informative

way and therefore, restricting the industry’s endeavour to enhance
quality and effectiveness in healthcare. Real-time analytics tools are
used in healthcare for addressing these issues by bringing data from
various sources at a single location with the purpose of presenting it
in a unified manner so that fruitful information can be derived from it.
Moreover, the data picked up from breaking down gigantic measures

of collected health information can give noteworthy knowledge to en-
hance operational quality and effectiveness for providers, insurers
and others. The healthcare industry is quickly transitioning from vol-

n o t e s
ume-to value based healthcare. Presently like never before, the ana-
lytics is crucial for clinicians and health service providers so that they
can distinguish and address gaps in care, quality and hazards and use
it to bolster changes in clinical and quality results and financial per-
formance.
Real-time analytics is capable of continuous reporting that illustrates

the status of the patients and how to enhance the current quality of
the services. It can also give instant and exact knowledge into patients’
therapeutic history including past clinical conditions, analyses, med-
icines, usage and results irrespective of their geographical location.
10. EMR stands for
S
a. Electrical Medical Records
b. Electronic Medical Records
IM
c. Electronic Mediclaim Records
d. None of these
11. Healthcare analytics is based on the verification of patterns
in healthcare data for determining how clinical care can be
enhanced while minimising excessive cost. (True/False)
12. _______ analytics is capable of continuous reporting that
M
illustrates where a patient stands and how to enhance the

quality of services.
N
Activity
Study how healthcare analytics has helped in improving the care

delivered in your nearby hospital.
10.6 SUPPLY CHAIN ANALYTICS

Supply chain is an arrangement of organisations, individuals, activi-
ties, data and assets required to move an item or service from suppli-
er to client. Generally, a supply chain comprises suppliers, manufac-
tures, wholesaleres, retailers and customers.
Intense competition and compulsion to reduce cost have impelled or-

ganisations to maintain an effective supply chain network. Therefore,
organisations came up with various tools and techniques of effectively
managing a supply chain. Globalisation gave a major push to supply
chain management. Organisations that operate in a highly competi-
tive global environment needs to have a highly effective supply chain
management system in place. For example, Apple faces huge demand

n o t e s
for their products as soon as the products are announced in the mar-
ket. Most Apple products are manufactured in China; therefore, Apple
needs to have a highly efficient supply chain to ship items from China
to different countries in the world.
It can be clearly concluded from the above discussion that supply

chain is a dynamic process in which various parties, such as suppliers
and distributors, are involved in delivering products and services to
fulfil customer requirements. Thus, in the absence of a supply chain,
there would be disruptions in the flow of products and information.
It can be said that a supply chain plays an important role in an organi-

sation. Thus, it is of utmost importance for an organisation to manage
activities involved in a supply chain. The activities in a supply chain
converts the raw material into a final product which further can be
delivered to the customer.
S
Almost every economy is getting globalised today, and the companies
are competing to increase their presence in the global market. The
IM
operations performed by global manufacturing and logistic teams are
getting more intrinsic and challenging. Delay in shipments, ineffec-
tive planning and inconsistent supplies can lead to an increase in the
supply chain cost of the company. Some issues faced by supply chain
organisations are as follows:
Visibility of global supply chain and various processes in logistics
M
Management of demand volatility

Fluctuations of cost in a supply chain
To overcome such challenges in the supply chain, the supply chain

N
analytics is used by most of organisations or supply chain executives.

Organisations are planning to increase their investment in analytics
to perform better in the market in comparison to their competitors.
With improvement in the supply chain analytics in the past few years,
it helps in making decisions for critical tactical and strategic supply
chain activities. The details gained from these activities help supply
chain organisations in reducing their excessive cost and optimising
supply chains.
Various solutions are provided by the supply chain analytics to supply

chain organisations are as follows:
Use of smarter logistics: The use of smarter logistics helps supply
chain organisations in providing more visibility in the global mar-
ket. With the growth of businesses, opportunities are developed
wordwide to lure customers by satisfying its needs related to the
product irrespective of the geographical location. As customers
are present wordwide, a complex Web of supply chain has been
created that must be monitored closely to remain competitive in
business.

n o t e s
The use of advanced analytics-driven ‘control metrics’ allows the

monitoring of real-time critical events and key performance indi-
cators (KPIs) with the help of various touch points. The intergation
of these metrics with predictive analytics provide a high amount
of savings in areas such as freight optimisation. Organisations that
are making investments in supply chain visibility can take deci-
sions to increase supply chain responsiveness, optimise cost and
minimise customer impact.
Managing customer’s demand through inventory management:
Due to globalisation and variations in products to fulfill the re-
quirements of globally available customers, demand volatility has
enhanced to a significant level. Industries in various sectors such
as retail, consumer goods, automotive need daily or real time pre-
diction to perform better in the market. Advanced supply chain
analytics can be implemented to these sectors or related industries
S
more precisely to forecast demand and describe and monitor pol-
icies related to supply and replenishment. It is also used for plan-
ning inventory flow of goods and services.
IM
Reducing cost by optimising sourcing and logistic activities: The
cost involved in supply chain is a major portion of company’s over-
all cost. The supply chain costs significantly impact various finan-
cial metrics such as the cost of goods sold, working capital and
cash flow. There is a constant requirement to improve organisa-
tions’ financial performance which can manage huge amounts of
M
inventories. The main areas where costs can be handled by using

analytics-driven intelligence include materials, logistics and sourc-
ing. Analytical tools help in providing better visibility to the actual
total component cost of products. It is necessary to make decision
regarding the buying and selling of products. With the availabili-
N
ty of complete information at the fingertips, organisations can de-

cline the material purchases through improved practices in supply
chain and better price negotiation. The fluctuation in patterns of
demand of customers and an increased base of suppliers and logis-
tics partners make organisations redesign their logistics network
planning. Companies can make strong ROI improvement by using
analytics-driven planning tasks which include route optimisation,
load planning, fleet sizing and freight cost settlement. With the
growth of business, suppliers also increase, the companies can use
these suppliers against each other by applying analytics to get the
lowest price from them. Supply chain managers can use sophis-
tacted analytics programs which can provide them real-time sup-
plier performance management data to improve their strategies.

n o t e s
13. Intense competition and the compulsion to reduce cost have

impelled organisations to maintain an effective supply chain
network. (True/False)
14. ______ supply chain analytics can be implemented to these
sectors or related industries more precisely to forecast
demand and describe and monitor policies related to supply
and replenishment.
Activity
Prepare a report on how the use of business analytics tools in sup-

ply chain has helped in improving the production of the manufac-
S
turing industry.
IM
10.7 WEB ANALYTICS
Web analytics refers to measuring, collecting, analysing and reporting
of Web data to understand and optimise the usage of Web. However,
Web analytics is not only restricted to measurement of Web traffic but
can also be utilised as a method of performing research in business
and market.
M
Web analytics also help companies in measuring the outcomes of tra-

ditional print or broadcast advertising campaigns, in estimating how
traffic to a website alters after launching of a new campaign of adver-
tising, in providing accurate figures of visitors on a website and page
N
views, and in gauging Web traffic and popularity patterns which are
useful in market research. The four basic steps of Web analytics are
as follows:
Collection of information: This stage involves gathering of basic
or elementary data. This data involves counting of things.
Processing of data into information: The purpose of this stage is
to process the collected data and derive information from it.
Developing KPI: This stage focuses on using the derived informa-
tion with business methodologies, referred to as Key Performance
Indicators (KPI).
Formulating online strategy: This stage emphasises on setting
online goals, objectives and standards for the organisation or busi-
ness. It also lays emphasis on making and saving money and in-
creasing marketshare.
There are two categories of Web analytics: off-site Web analytics and
on-site Web analytics. Off-site Web analytics allows Web measurement
and analysis irrespective of whether you own or maintain a website. It

n o t e s
includes the measurement of a website’s potential audience, visibility

and comments that are going on the Internet. On the other hand, On-
site Web analytics is used to measure the behavior of a visitor who had
once visited the website. The On-site Web analytics is used to measure
the effectiveness and performance of your website in a commercial
context.
This data generated is further compared against KPIs for perfor-

mance and is used for improvement of a website. Google Analyt-
ics and Adobe Analytics are popular on-site Web analytics services.
Heat maps and session replays are some examples of new on-site Web
analytics tools.
There are mainly two methods of gathering the data technically. The
first method lays emphasis on server log file analysis in which the log
files are read and used by the Web server for recording file requests
S
sent by browsers. The second method, known as page tagging, uses
JavaScript embedded in the Web page for tracking it. Both the meth-
ods can gather data which can be processed for generating reports
IM
of Web traffic. This second method provides more accurate result as
compared to the first method.
Web analytics is helpful to any business for deciding the division of

market, determining target market, analysing market trends and de-
ciding the conduct of site visitors. It is additionally helpful to compre-
hend visitor’s advantages and priorities. Some important uses of Web
M
analytics for business growth are as follows:

Measure Web traffic: Web analytics can track the number of users
visiting the site and identify the source from where they are com-
ing. It also focuses on the keywords that the visitors utilise to query
N
items on the website. It also demonstrates the quantity of visitors

on the Web page by means of the diverse sources like Web search
tools, through messages, online networking and promotions.
Estimate visitors count: Visitors on the Internet refer to the quan-
tity of unique individuals that went to the site. Frequent or large
number of visits from visitors shows the activity the site is getting.
The Web analytics tool helps in deciding how frequently a visitor
came back to a site and which pages of a site were given more pref-
erence by visitors. It additionally tells various traits about a visitor
such as its nation, language, etc. Web analytics also provide report
about the time that was spent by a particular visitor on the website
or total time by visitors as a whole. Such reports help to enhance
pages and reduce their bounce rate (or low engagement). It addi-
tionally demonstrates high engagement time of pages and tells in
which item or service visitor may be interested.
Track bounce rate: A bounce describes a situation in which a vis-
itor visits a page on the site and leaves that page without making
any move or clicking on any links on that page. A high bounce rate

n o t e s
could mean visitors were unable to find what they were searching
for in the site.
Identify exit pages: An exit is the point at which a visitor visits
various pages on site and then leaves that site. A few pages on a
site may have a high leave rate, similar to the thank you page on an
online e-commerce website after purchasing is done successfully.
A high exit rate on a particular page demonstrates that the page
has some issue and should be investigated quickly. Examination of
such pages should be done to determine whether visitors are not
getting the intended information for which they have visited the
website. Web analytics tools help in finding such pages quickly and
rectifying the problems with those pages.
Identify target market: It is essential for advertisers to understand
their visitors and deliver information according to their require-
S
ments. The discoveries of analytics services uncover the present
market requests which generally change with a geographic area.
By utilising Web analytics, marketers can track the volume and
geographical information of visitors and can offer things according
IM
to the interest of visitors.
15. ________ analytics helps in gauging Web traffic and popularity

patterns which are useful in market research.
M
16. There are two categories of Web analytics which are _______
Web analytics and ________ Web analytics.
N
Activity
Visit a Web hosting company and try to learn how Web analytics can
help the company to monitor the activity on the hosted websites of
the server.
10.8 SPORTS ANALYTICS

Sports analytics is a technique of analysing relevant and historical in-
formation in the field of sports mainly to perform better than other
team or individual. The information gathered in sports is analysed by
coaches, players and other staff members for decision making both
during and prior to sporting events. With rapid advancement in the
technology in the past few years, data collection has become more pre-
cise and relatively easier than earlier. The advancement in the collec-
tion of data has also contributed in the growth of sports analytics as
it totally relies on the collected pool of data. The growth in analytics
has further led to building of technologies such as fitness trackers,

n o t e s
game simulators, etc. Fitness trackers are smart devices that provide
data about the fitness of players on the basis of which coaches can
take a decision of including particular players in the team or not. The
game simulators help in practicing the game before the actual sport-
ing event takes place.
The sport analytics not only modifies the way of playing a game but
also changes the way of recording the performance of players. The
National Basketball of America (NBA) teams are now using the player
tracking technology which can evaluate the efficiency of a team by an-
alysing the movement of its players. As per the information provided
by the SportVu software website, the teams in NBA have installed six
cameras for tracking the movements of each player on the court and
the basketball at the rate of 25 times per second. The data collected
using cameras provide significant amount of innovative statistics on
S
the basis of speed, player separation and ball possession. For example,
how fast a player moved, how much distance he had covered during
the game, how many times he had passed the ball and much more. On
the basis of the data collected, strategies are created to win the game
IM
or to improve the performance in the game.
Sports analytics has also found its application in the field of sports
gambling. The availability of more accurate information about teams
and players on the websites leads to sport gambling to new levels. The
analytics information helps gamblers in better decision making and
attaining accuracy in predicting outcomes of games or performance
M
of a particular player. In addition to websites or Web pages, a number

of companies also help in providing minute details of players or teams
to gamblers to fulfill their betting requirements. Sports gambling con-
tributed 13% of global gambling industry valued somewhere between
N
$700-$1000 billion. Some of the popular websites which provide bet-

ting services to users are bet365, bwin, Paddy Power, betfair, and Uni-
bet.
17. Fitness trackers are _____devices that provide data about the
fitness of players.
18. Sports analytics does not contribute in the field of sports
gambling. (True/False)
Activity
Discuss with your friends how analytics can be used in the field of
sports to enhance the energy of players while protecting them from
injuries.

n o t e s
ANALYTICS FOR GOVERNMENT AND

10.9
NGOs
Data analytics is also playing its role in the government sector. Not
only it is important for government, it is also equally beneficial for
non-governmental organisations. Data analytics is used by these or-
ganisations to get deeper details of data. These details are used by the
organisations for modernising their services, progress and determin-
ing the solutions faster.
Big data analytics is used in almost every part of the world for deriving
useful information from huge sets of data. Not only private organisa-
tions and industries are employing data analytics but also many gov-
ernment enterprises are adopting data analytics for taking smart de-
cisions for the benefit of its citizens. Lot of data gets generated in the
S
government sector and processing and analysing this data helps the
government in improving its policies and services for citizens. Some
benefits of data analytics in government sector are as follows:
IM
With the rise of national threats and criminal activities these days,
it is important for any government to ensure safety and security of
its citizens. With the help of data analytics, intelligence organisa-
tions can detect crime prone areas and be prepared to prevent or
stop any kind of criminal activity.
The analytics also help in detecting the possibility of the cyber at-
M
tacks and identifying criminals. It also helps in detecting their pat-

terns of attacks. The government can therefore, takes appropriate
action in advance to prevent people from any kind of financial loss.
Government can use analytics to track and monitor health of its
N
citizens. It can also be used for tracking disease patterns. The gov-
ernment can launch proper healthcare facilities in advance in the
areas prone to diseases. It also helps in arranging and managing
free medicines and vaccinations, etc in order to save life of people.
Real time analysis and sensors help government departments in
water management in the city. The officials can detect the issues
in the flow of water, pollution level in water, predict scarcity of wa-
ter on the basis of usage, detect areas of leakage, etc. Government
departments can take proper action to avoid these issues to ensure
supply of clean water in city.
Government organisations also use analytics to detect tax frauds
and predict the revenue. Government can take necessary steps to
prevent tax frauds and increase the revenue.
Government can also use the analytics in the field of agriculture
to know the appropriate time for cultivation of crops, fertilisers
required for crops, etc. Moreover, the government can also take
prior actions to prevent damage of crops in case of various envi-
ronmental challenges.

n o t e s
You can say that data analytics is helping government in building

smart cities having the capability of fast detection and rectification of
problems. For example, in India, the government led by Prime Min-
ister Narendra Modi has been encouraging people to adopt Digital
India initiative. This will lead to ease in collection and quicker avail-
ability of data for analytics to detect flaws in money transactions and
prevent people from becoming the victim of fake currency.
Data analytics also helps NGOs in improving their services to needy

or poor people. Mainly, NGOs help people in several ways such as by
providing free education, books, medicines, clothes, etc. NGOs use
data analytics to become more efficient while raising and allocation
of funds, predicting trends and planning campaigns, identifying pro-
spective donors and encouraging donors who have made contribu-
tions earlier, etc. Consider the case of a non-profit organisation, Ak-
S
shaya Patra foundation, which supplies food in government schools
in Bangalore. The foundation was finding it difficult to supply food to
government schools due to high cost involved with it. Therefore, they
looked for a cost-effective solution to deliver food in schools without
IM
any interruption.
According to Chanchalapathi Dasa, Vice-chairman of Akshaya Pa-

tra, the foundation use 34 routes for delivering food to government
schools in Bangalore and expenditure on each route is Rs.60,000 per
month approximately. Therefore, the organisation has decided to use
the data analytics to find a cost effective solution to this problem.
M
While analysing various parameters required in food delivery such as

the number of vehicles utilised, the time and fuel used on each route,
they analysed that Rs. 3 lakh can be saved by reducing the number of
N
routes by five.
Besides Akashya Patra, several other large NGOs such as Bill and
Melinda Gates Foundation India, Save the Children India, and Child
Rights and You (CRY) are also utilising data to raise their efficiency
in getting and allocating funds, predicting trends and planning cam-
paigns.
These NGOs often face difficulties with data collection because they
use traditional ways of data collection. In order to overcome these
challenges, NGOs have allotted mobile phones equipped with apps
so that real time collection and recording of data can take place. The
data recorded in this manner would be accurate and will give more
precise information on the basis of which further decisions or action
plans can be made.

n o t e s
19. NGO stands for

a. Non-governmental organisation
b. Non-governer organisation
c. Non-governing organisation
d. None of these
20. Analytics is helpful for government in building smart cities.
(True/False)
Activity
S
Visit to a nearby NGO and try to know how analytics has helped
them in improving their services and focus more on the overall de-
velopment of the people or area.
IM
10.10 SUMMARY
Business analytics has expanded consistently over the previous
decade as confirmed by the constantly developing business ana-
lytics software market.
M
Fraud impacts organisations in several ways which might be relat-

ed to financial, operational and psychological processes.
Numerous organisations stay helpless against extortion and mon-
ey related crime since they aren’t exploiting new abilities to battle
N
today’s dangers.
Organisations generally move to HR analytics and data led solu-
tions when there exists problems that cannot be resolved with the
current management practices.
Marketing analytics helps in providing deeper insights of custom-
er preferences and trends. Despite various benefits, a majority of
organisations failed to realise the benefits of marketing analytics.
Healthcare organisations are also implementing approaches, for
example lean and Six Sigma to take a more patient-driven concen-
tration, lessen errors and waste, and increase the number of flow
of patients with the objective of enhancing quality.
Organisations that operate in a highly competitive global environ-
ment needs to have a highly effective supply chain management
system in place.

n o t e s
The use of smarter logistics helps supply chain organisations in

providing more visibility in the global market.
Web analytics can provide accurate figures of visitors on a website
and page views. It helps in gauging Web traffic and popularity pat-
terns.
key words
Capacity analytics: It helps in tracking the number of people

who are operationally efficient and currently in business.
Employee churn analytics: It refers to the process of estimating
your staff turnover rates for predicting the future and reducing
employee churn.
Employee performance analytics: It is used in assessing the
S
performance of an individual employee.
Fraud Analytics: It is used to detect whether a financial activity
is fraudulent or not to prevent any kind of financial loss.
IM
Marketing analytics: It helps in providing deep insight of cus-
tomer preferences and trends.

M
1. Discuss the importance of financial and fraud analytics for an

organisation.
2. Describe the role of HR analytics in an organisation.
3. What do you understand by marketing analytics? Discuss the
N
steps in getting the best assistance from marketing analytics.

4. How healthcare analytics is useful in the medical field? Explain
with suitable examples.
5. Why analytics is required in supply chain? Discuss with suitable
reasons.
6. What is Web analytics? Enlist the steps involved in the Web
analytics process.
7. Describe the importance of analytics in the field of sports.
8. Discuss the need for analytics for government and NGOs.

n o t e s

Financial and Fraud Ana- 1. Whistleblower
lytics
2. Advanced analytics
3. True
HR Analytics 4. Talent
5. True
6. Capacity
S
Marketing Analytics 7. a. Search Engine Optimisation
8. True
9. False
IM
Healthcare Analytics 10. b. Electronic Medical Records
11. True
12. Real-time
Supply Chain Analytics 13. True
M
14. Advanced
Web Analytics 15. Web
16. Off-site, On-site
N
Sports Analytics 17. Smart

18. False
Analytics for Government 19. a. Non-governmental organisa-
and NGOs tion
20. True

1. Fraud impacts organisations in several ways which might be
related to financial, operational and psychological processes.
Refer to Section 10.2 Financial and Fraud Analytics.
2. HR analytics, additionally called talent analytics, is the use
of complex information mining and business analytics (BA)
strategies to HR information. Refer to Section 10.3 HR Analytics.
3. Marketing analytics is an act of measuring, overseeing and
examining advertising execution to expand its effectiveness
and enhance quantifiable profit (ROI). Refer to Section
10.4 Marketing Analytics.

n o t e s
4. Healthcare analytics is a term used to describe the analysis of

healthcare activities using the data generated and collected
from different areas in healthcare such as pharmaceutical
data, research and development (R&D) data, clinical data,
patient behavior and sentiment data, etc. Refer to Section 10.5
Healthcare Analytics.
5. Supply chain is an arrangement of organisations, individuals,
activities, data and assets required in moving an item or service
from supplier to the client. Refer to Section 10.6 Supply Chain
Analytics.
6. Web analytics refers to measuring, collecting, analysing and
reporting of Web data to understand and optimise the usage of
Web. Refer to Section 10.7 Web Analytics.
7. Sports analytics is the technique of analysing relevant and
S
historical information in the field of sports mainly to perform
better than any other team or individual. Refer to Section
10.8 Sports Analytics.
IM
8. Data analytics is also playing its role in the government sector. Not
only it is important for government, it is also equally beneficial
for non-governmental organisations as well. NGOs are also often
called non-profit organisations. Data analytics is used by these
organisations to get deeper details of data. Refer to Section
10.9 Analytics for Government and NGOs.
M
SUGGESTED READINGS
N
Yang,H., & Lee, E. K. (2016). Healthcare analytics: from data to

knowledge to healthcare improvement. Hoboken, NJ: John Wiley
& Sons, Inc.
Marketing Analytics: Data-Driven Techniques with Microsoft Ex-
cel. (n.d.). Retrieved May 03, 2017, from http://www.wiley.com/Wi-
leyCDA/WileyTitle/productCd-111837343X.html
PredictiveHR Analytics: Mastering the HR Metric ... (n.d.). Re-
trieved May 3, 2017, from https://www.bing.com/cr?IG=C49CE-
3B25C1F41949D724B7780C3209E&CID=1AB71989BA3B6F60
126F13FEBBAB6E4A&rd=1&h=2HPbUDMl0ZucQiOWelv6s-
kYOxNOGBwStlLSOwIZulhs&v=1&r=https%3a%2f%2fwww.
amazon.com%2fPredictive-HR-Analytics-Mastering-Metric%2fd-
p%2f0749473916&p=DevEx,5080.1
E-REFERENCES
Data analysis techniques for fraud detection. (2017, April 26). Re-
trieved May 03, 2017, from https://en.wikipedia.org/wiki/Data_
analysis_techniques_for_fraud_detection

n o t e s
HR Analytics. (2017, March 17). Retrieved May 03, 2017, from

https://www-01.ibm.com/software/analytics/solutions/operation-
al-analytics/hr-analytics/
Health care analytics. (2017, March 26). Retrieved May 03, 2017,
from https://en.wikipedia.org/wiki/Health_care_analytics
What is marketing analytics? (n.d.). Retrieved May 03, 2017, from
https://www.sas.com/en_us/insights/marketing/marketing-analyt-
ics.html
S
IM
M
N

C h
11 a p t e r
CASE STUDIES
S
CONTENTS
Case Study 1 How Cisco it uses Big Data Platform to Transform Data
Management
IM
Case Study 2 Usda used Data Mining to know the Patterns of Loan Defaulters
Case Study 3 Cincinnati Zoo used Business Analytics for Improving Performance
Case Study 4 Application of Business Analytics in Resource Management
Case Study 5 Role of Descriptive Analytics in the Healthcare Sector
Case Study 6 An Application of Predictive Analytics in Underwriting
Case Study 7 Unicredit Bank Applies Prescriptive Analytics for Risk Management
M
Case Study 8 Campaign Success of Mediacom

Case Study 9 Dundas Bi Solution Helped Medidata and its Clients in Getting
Better Data Visualisation
Case Study 10 Sports Analytics Helped in the Enrichment of Performance of
N
Players
Case Study 11 Fraud Analytics Solution Helped in Saving the Wealth of Companies
Case Study 12 Big Data Analytics Allowing Users to Visualise the Future of Free
Online Classifieds

Case study 1
n o t e s
HOW CISCO IT USES BIG DATA PLATFORM TO TRANSFORM

DATA MANAGEMENT
This Case Study shows how Hadoop Architecture based on Cisco

UCS Common Platform Architecture (CPA) for Big Data is used for
business insight. It is with respect to Chapter 1 of the book.
Background
Cisco is one of the world’s leading networking organisations
that has transformed the way how people connect, communicate
and collaborate. Cisco IT has 38 global data centres that totally
comprise 334,000 square feet space.
S
Challenge
The company had to manage large datasets of information about
customers, products and network activities, which actually
IM
comprise the company’s business intelligence. In addition,
there was a large quantity of unstructured data, approximately
in terabytes in the form of Web logs, videos, emails, documents
and images. To handle such a huge amount of data, the company
decided to adopt Hadoop, which is an open-source software
framework to support distributed storage and processing of big
datasets.
M
According to Piyush Bhargava, a distinguished engineer at

Cisco IT, who handles big data programs, “Hadoop behaves like
an affordable supercomputing platform.” He also says, “It moves
compute to where the data is stored, which mitigates the disk I/O
N
bottleneck and provides almost linear scalability. Hadoop would

enable us to consolidate the islands of data scattered throughout the
enterprise.”
To implement the Hadoop platform for providing big data analytics
services to Cisco business teams, firstly Cisco IT was required to
design and implement an enterprise platform that could support
appropriate service level agreements (SLAs) for availability and
performance. Piyush Bhargava says, “Our challenge was adapting
the open source Hadoop platform for the enterprise.”
The technical requirements of the company for implementing the
big data architecture were to:
have open source components at place to establish the archi-
tecture.
know the hidden business value of large datasets, whether the
data is structured or unstructured

Case study 1: HOW CISCO IT USES BIG DATA PLATFORM TO TRANSFORM DATA MANAGEMENT 289
Case study 1
n o t e s
provide service-level agreements (SLAs) to internal custom-

ers, who want to use big data analytics services
support multiple internal users on the same platform
Solution
Cisco IT developed a Hadoop platform using Cisco® UCS Common
Platform Architecture (CPA) for Big Data.
According to Jag Kahlon, a Cisco IT architect, “Cisco UCS CPA for
Big Data provides the capabilities we need to use big data analytics
for business advantage, including high-performance, scalability,
and ease of management.”
For computation, the building block of the Cisco IT Hadoop
S
Platform is the Cisco UCS C240 M3 Rack Servers, which are
powered by Intel Xeon E5-2600 series processors, 256 GB of RAM,
and 24 TB of local storage.
IM
Virendra Singh, a Cisco IT architect, says, “Cisco UCS C-Series
Servers provide high performance access to local storage, the biggest
factor in Hadoop performance.”
The present architecture contains four racks of servers, where
each rack is having 16 server nodes providing 384 TB of raw
storage per rack. Kahlon says, “This configuration can scale to 160
M
servers in a single management domain supporting 3.8 petabytes of

raw storage capacity.”
Cisco IT server administrators are able to manage all elements
of Cisco UCS including servers, storage access, networking and
N
virtualisation from a single Cisco UCS Manager interface. Kahlon

declares, “Cisco UCS Manager significantly simplifies management
of our Hadoop platform. UCS Manager will help us manage larger
clusters as our platform grows without increasing staffing.”
Cisco IT uses MapR Distribution for Apache Hadoop, and code
written in advanced C++ rather than Java. Virendra Singh says,
“Hadoop complements rather than replacing Cisco IT’s traditional
data-processing tools, such as Oracle and Teradata. Its unique value
is to process unstructured data and very large data sets far more
quickly and at far less cost.”
Hadoop Distributed File System (HDFS) manages the storage on
all Cisco UCS C240 M3 servers in the cluster form to create one
large logical unit. Then, HDFS system splits the data into smaller
chunks for further processing and performing ETL (Extract,
Transform and Load) operations.
Hari Shankar, a Cisco IT architect, says, “Processing can continue
even if a node fails because Hadoop makes multiple copies of every

Case study 1
n o t e s
data element, distributing them across several servers in the cluster.

Even if a node fails, there is no data loss.” Hadoop can detect node
failure automatically and create another parallel copy of the data,
without distributing any process across the remaining servers. In
addition, the total volume of data is not increased, as Hadoop also
compresses the data.
To handle the task like job scheduling and orchestration process,
Cisco IT uses Cisco TES (Cisco Tidal Enterprise Scheduler),
which works as an alternative of Oozie. Cisco TES connects
Hadoop components automatically and eliminates the need for
writing the Sqoop code manually to download data and move it to
HDFS and then execute commands to load data to Hive.
Singh says, “Using Cisco TES for job-scheduling saves hours
S
on each job compared to Oozie because reducing the number of
programming steps means less time needed for debugging.” Another
benefit of using Cisco TES is that it operates on mobile devices, so
that the end-users of the company can manage big data jobs from
IM
anywhere.
Results
The main result of transforming the business using Big Data by
Cisco IT is that the company has introduced multiple big data
analytics programs, which are based on the Cisco® UCS Common
M
Platform Architecture (CPA) for Big Data.

The revenues of the company from partner sales have been
increased. The company has started the Cisco Partner Annuity
Initiative program, which is in production. Piyush says, “With
N
our Hadoop architecture, analysis of partner sales opportunities

completes in approximately one-tenth the time it did on our
traditional data analysis architecture, and at one-tenth the cost.”
The productivity of the company has been increased by making
intellectual capital easier to find. Earlier, many employees who
work as knowledge workers in Cisco used to take a lot of time to
search for the content on websites throughout the day as most
of the content was not tagged with relevant keywords. But, now,
Cisco IT has replaced the static and manual tagging process with
dynamic tagging on the basis of user feedback. This process uses
machine-learning techniques to examine usage patterns adopted
by users and also acts on user suggestions given for searching by
new tags.
Moreover, the Hadoop platform analyses log data of collaboration
tools, such as Cisco Unified Communications, email, Cisco
TelePresence®, Cisco WebEx®, Cisco WebEx Social, and Cisco

Case study 1: HOW CISCO IT USES BIG DATA PLATFORM TO TRANSFORM DATA MANAGEMENT 291
Case study 1
n o t e s
Jabber™ to reveal commonly used communication methods and

organisational dynamics.
Lesson Learned
Cisco IT has come up with the following observations shared with
other organisations:
Hive is good for structured data processing, but provides lim-
ited SQL support.
Sqoop easily moves a large amount of data to Hadoop.
Network File System (NFS) saves time and effort to manage a
large amount of data.
Cisco TES simplifies the job-scheduling and orchestration
S
process.
A library of user-defined functions (UDFs) provided by Hive
and Pig increases developer productivity.
IM
Knowledge of internal users is enhanced as they can now
analyse unstructured data of email, webpages, documents,
etc., besides data stored in databases.
questions
M
1. What were the challenges faced by Cisco?

(Hint: Open source components, service-level agreements
(SLAs) to the internal customers, etc.)
N
2. What are the lessons learned by Cisco?

(Hint: Hive is good for structured data processing, Cisco
TES simplifies job-scheduling and orchestration process,
Network File System (NFS) saves time and effort to
manage a large amount of data, etc.)

Case study 2
n o t e s
USDA USED DATA MINING TO KNOW THE PATTERNS

OF LOAN DEFAULTERS
This Case Study discusses how a US-based rural welfare department

used the data mining technique for providing loans to people for
their welfare and development. It is with respect to Chapter 2 of the
book.
USDA Rural Development is presided by an Under-Secretary, who
is appointed directly by the US President and confirmed by the
Senate of the United States. The role of the Under-Secretary is to
provide executive direction and policy leadership so as to ensure
improved economic opportunities for the rural communities
of America. The department has a loan portfolio of more than
S
$216 billion for providing economic opportunities to the rural
communities of the nation.
The rural housing service of USDA runs various programmes
IM
to create and improve housing and other important community
facilities in rural areas. USDA also provides loans, permissions
and loan guarantees for housing of single- and multi-family, fire -
and police stations, child care centres, hospitals, nursing homes,
libraries, schools, etc. The main aim of USDA and its partners
working together is to make sure that rural America should be
a better place to live, work and raise a family.
M
The USDA’s Rural Housing Service has administered a loan

program that provides mortgage loans to people residing in rural
areas. To manage these nearly 6,00,000 loans, the department
has maintained detailed information about each loan in its data
N
warehouse. Like earlier lending programs, although some USDA

loans got a better response than others, it was difficult for the
department to track the exact status of those loans.
The USDA decided to adopt a data mining technique for a better
understanding of loans, improvement in handling its lending
program and reducing the occurrence of problem loans. Using
the data mining technique, the department wanted to determine
patterns that could differentiate borrowers who repay their loan
punctually from those who do not. Determining such type of
patterns could forecast the creditworthiness of the borrower.
Commercial lenders in the US also use the data mining technique
for predicting loan default or poor-repayment behaviours at the
time of providing loans to people. But, the main interest of USDA
is somewhat different from commercial lenders as it is more
interested in determining the problems for loans which were
already granted.

Case study 2: USDA USED DATA MINING TO KNOW THE PATTERNS OF LOAN DEFAULTERS 293
Case study 2
n o t e s
Segregating problem loans allows the USDA to give more attention

and assistance to such type of loan borrowers, thereby reducing
the possibility that their loans will become problems.
questions
1. What were the motives behind setting up the USDA Rural

Development?
(Hint: Welfare of rural areas of America, etc.)
2. How could the data mining technique help the USDA?
(Hint: To determine problems in the already granted
loans, etc.)
S
IM
M
N

Case study 3
n o t e s
CINCINNATI ZOO USED BUSINESS ANALYTICS FOR

IMPROVING PERFORMANCE
This Case Study discusses how business analytics has made an

impression on midsized companies by improving their business
performance in real time. It is with respect to Chapter 3.
Background
Opened in 1875, Cincinnati Zoo & Botanical Garden is a world-
famous zoo that is located in Cincinnati, Ohio, US. It has more
than 1.3 million visitors every year.
Challenge
S
In late 2007, the management of the zoo had begun a strategic
planning process to increase the number of visitors by enhancing
their experience with an aim to generate more revenues. For
this, the management decided to increase the sales of food items
IM
and retail outlets in the zoo by improving their marketing and
promotional strategies.
According to John Lucas, the Director of Operations at Cincinnati
Zoo & Botanical Garden, “Almost immediately, we realised we had
a story being told to us in the form of internal and customer data, but
we didn’t have a lens through which to view it in a way that would
M
allow us to make meaningful changes.”

Lucas and his team members were interested in finding business
analytics solutions to meet the zoo’s needs. He said, “At the start,
we had never heard the terms ‘business intelligence’ or ‘business
analytics’; it was just an abstract idea. We more or less stumbled
N
onto it.”
They looked for various providers, but did not include IBM initially
in the false assumption that they could not afford IBM. Then,
somebody guided them that it was completely free to talk to IBM.
Then, they found that IBM not only had suggested a solution that
could fit in their budget, but it was the most appropriate solution
for what they were looking for.
Solution
IBM has provided a business analytics solution to the zoo’s
executive committee, which provides a facility of analysing data
related to the membership of customers, their admission and food,
etc. in order to gain a better understanding of visitors’ behaviour.
This solution also provides a facility of analysing the geographic
and demographic information that could help in customer
segmentation and marketing.
The zoo’s executive committee wanted a platform, which would
be capable of delivering the desired goals by combining and

Case study 3: CINCINNATI ZOO USED BUSINESS ANALYTICS FOR IMPROVING PERFORMANCE 295
Case study 3
n o t e s
analysing data related to ticketing and point-of-sale systems,

memberships and geographical facts. The entire project was
handled by senior executives of the zoo and consultants of IBM
and BrightStar Partners, an IBM Business Premier Partner.
Lucas said, “We already had a project vision, but the consultants on
IBM’s pre-sales technology team helped us identify other opportunity
areas.” During the project implementation, BrightStar became
the zoo’s main point of contact, and then a platform was built
on IBM Cognos 8.4 in late 2010, which further upgraded to the
Cognos 10 in early 2011.
Output
The result of implementing the IBM’s business analytics solution
is that the zoo’s return of investment (ROI) has increased. Lucas
S
admits, “Over the 10 years we’d been running that promotion, we
lost just under $1 million in revenue because we had no visibility
into where the visitors using it were coming from.”
IM
The new business analytics solution has helped in cost savings
for the zoo; for example, there is a saving of $40,000 in marketing
in the first year, visitors’ number has been increased to 50,000 in
2011, food sales is increased by least 25%, and retail sales has been
increased by at least 7.5%, etc.
By adopting new operational management strategies of the
M
business analytics solution, there is a remarkable increment in

attendance and revenues, which have resulted in an annual ROI
of 411%. Lucas admits, “Prior to this engagement, I never would
have believed that an organisation of the size of the Cincinnati Zoo
could reach the level of granularity its business analytics solution
N
provides. These are Fortune 200 capabilities in my eyes.”
questions
1. What was desired by the Cincinnati Zoo & Botanical

Garden in their business operations?
(Hint: They wanted to increase the sales of food items
and retail outlets in the zoo by improving their marketing
and promotional strategies.)
2. How did IBM help the zoo?
(Hint: IBM has provided a business analytics solution to
the zoo’s executive committee, which helps in analysing
data related to the membership of customers, their
admission and food, etc. in order to gain the better
understanding of visitor behaviour.)

Case study 4
n o t e s
APPLICATION OF BUSINESS ANALYTICS IN

RESOURCE MANAGEMENT
This Case Study discusses how a real estate company uses business
analytics for resource management. It is with respect to Chapter 4.
Analytics can influence cross domains expertise. This case study
presents an instance where a real estate company assisted a law
firm in choosing whether or not to relocate to a different office
space through the usage of data devices. This was done based on
the feedback of employees of the law firm taken by the internal
analytics team of the real estate company. Such feedback helped
the real estate company to come up with an employee lean
management program for the law firm.
S
In one of a kind example, the law firm wished to bring in and keep
the most suitable employees, so the first factor to be evaluated
as personnel retention. It had got great ratings for its brilliant
IM
services and consistent focus on improving customer service
experience. Being a firm with services of this range, the firm
certainly faced some challenges as usual with any other resource-
critical organisation. Now to deal with space-related issues, the
firm roped in the real estate company which went on not to only
suggest the office space but also managed to streamline resource
operations to effectively compensate a positive impact of admitting
M
big data business analytics as the firm’s chief workforce driver.
Method
The company conducted a few surveys and questionnaire among
N
the group and came out with a solution to streamline and lean
manage the teams present within the law firm. For the office
space, the real estate company used the firm’s resources to
map out where the employees were most often. The real estate
company assisted the law firm by utilising different location-
conscious mechanisms to keep track of the whereabouts of the
firm’s personnel in which the data was accumulated based on
employee partialities and activities. The end-result was that the
law firm decided to relocate from the high-rise office into a more
affordable space based on the location habits of its personnel. The
new location was too convenient for employees that it resulted in
increased employee retention; thereby saving costs of the firm.
Apart from the above actions, the following questionnaires were
circulated across various departments:
Questions for Management:
What evaluation methods should be employed to assess the
yearly performance of employees?

Case study 4: APPLICATION OF BUSINESS ANALYTICS IN RESOURCE MANAGEMENT 297
Case study 4
n o t e s
What cost or economic leaks can be present inside the sys-

tem which can be fixed altogether to come up with a fool-proof
plan?
How will rapid change in existing SLAs be met along with
proper and due transition? (For upper level executives)
Based on the knowledge and responses received, the organisation
started its studies applying normal resourcing parameters and
concepts to effectively study and dish out the best possible scenario
with applicable implications that helped the organisation in going
out far with using business analytics as an effective medium to
utilise whichever operations they gear up for.
questions
S
1. What were the initial challenges faced by the law firm?
(Hint: Office space relocation indecisiveness, employee
IM
retention impact and resourcing issues)
2. What are the lessons learned from this case study?
(Hint: You can cite examples of cross-functionality
deployed by the real estate team to denote excellent all-
round services provided by the real estate company.)
M
N

Case study 5
n o t e s
ROLE OF DESCRIPTIVE ANALYTICS IN THE

HEALTHCARE SECTOR
This Case Study discusses the role of descriptive analytics in

overcoming challenges in the healthcare industry. It is with respect
to Chapter 5 of the book.
Across the world, healthcare organisations are focussed on
providing better quality services to patients. Therefore, it is
necessary to define performance and determine methods required
for improving quality in the healthcare sector. Many researches
have been performed with an aim to take feedback from patients
and their families, healthcare professionals, planners and
others on patient outcomes, professional development, system
performance, etc.
S
Many standards and measurable attributes can be used for
defining performance and quality in the healthcare industry.
Some attributes are effectiveness, timeliness, safety, efficiency,
IM
accessibility and availability. In addition to this, healthcare
organisations also consider patient and social preferences in
order to assess and assure quality in the healthcare sector.
The major challenge that lies in the healthcare sector across
the world is crowding of emergency rooms which may lead to
serious consequences and complications. Overcrowding and poor
M
performance of emergency rooms lead to long waiting times by

patients for treatment since the time they are admitted to the
hospital.
Crowding of emergency rooms and reduced performance in
N
this essential service is a serious issue for both researchers and

professionals in the healthcare sector. A number of researches
have been conducted to analyse the factors associated with the
overcrowding of emergency rooms. Some researchers have
classified factors that lead to overcrowding of emergency rooms
into three categories, namely input factors, throughput factors
and output factors. On the other hand, some other researchers
have determined the length of stay (LOS) in emergency rooms by
dividing it into the following three intervals:
Waiting time: It refers to the interval of time between the ar-
rival of patient and he/she is examined by a physician in an
emergency room.
Treatment time: It refers to the interval of time between
starting of the examination by the physician and a decision of
admitting the patient to the hospital or discharging him/her.
Boarding time: It refers to the interval of time started from
the decision of admitting some patients till they are shifted to
an inpatient hospital bed.

Case study 5: ROLE OF DESCRIPTIVE ANALYTICS IN THE HEALTHCARE SECTOR 299
Case study 5
n o t e s
These conceptual models help in building strategies and

solutions in order to reduce crowding to a great extent. Besides
handling patients, some more problems also exist, contributing
in overcrowding in emergency rooms and prolonged LOS.
Inadequate staffing and shortage of treatment areas make
patients wait longer for their turn or leave the hospital without
examination or proper treatment. Moreover, delay in using
ancillary services, like lab, radiology and other procedures, also
contribute to overcrowding.
Descriptive analytics has emerged as a data processing method
in modern healthcare organisations, which helps in summarising
historical data to extract meaningful information from it and
preparing the data for further analysis. Such information helps
in resolving various issues and making appropriate decisions in
S
healthcare organisations.
Descriptive analytics also helps in studying various decisions in
healthcare and their impact on service performance and clinical
IM
results. Descriptive analytics is an easy and simple approach to
apply and the data is usually represented in terms of graphs and
tables, which display hospital occupancy rates, average time of
stay, indicators related to healthcare services, etc.
Moreover, descriptive analytics provide data visualisation, which
helps in answering specific queries or determining the patterns
M
of care. It, therefore, provides a broader perception for evidence-

based clinical practice. This allows organisations to handle real-
time, or near real-time data, what can be referred to as operational
content, and capture visual data of all the patients. This analytics is
N
also helpful in determining those patterns of patients which were

left previously unnoticed. Thus, descriptive analytics is playing
its role in providing better services to patients by providing deep
insight of data.
questions
1. What were the challenges faced by hospitals in emergency

services?
(Hint: Overcrowding of patients, delay in providing health
services, etc.)
2. What are the advantages of descriptive analytics?
(Hint: To determine hidden patterns, better visualisation
of information, etc.)

Case study 6
n o t e s
AN APPLICATION OF PREDICTIVE ANALYTICS

IN UNDERWRITING
There is a case study done in the D&O (Directors and Officers

Liability) insurance industry, in which the executives of Scottsdale
Insurance Company were proposed a precarious underwriting
submission following the recession in 2008. The proposal stated
that the liability insurance (compensation for damages or defence
fee loans, given a scenario in which an insured customer was
to suffer a loss as a result of a legal settlement) was to be paid
to the institution and/or its executives and administrators. The
Scottsdale Insurance Company approved this proposition, and
thus Freedom Specialty Insurance Company was formed.
S
Freedom Specialty Insurance Company placed the industry as the
top priority. Using external predictive analytic data to calculate
risk, D&O claims could be foreseen from class action lawsuit
data. An exclusive, multimillion dollar underwriting model was
IM
created, the disbursements of which have proven profitable to
Freedom in the amount of $300 million in annual direct written
premiums. Losses have been kept at a minimum with a rate below
49% in 2012, which is the industry’s average loss percentage.
The model has proven successful in all areas, with satisfied and
assured employees at all levels of the company, as well as the
M
reinsured being contented. This case study is a great example

of how predictive analytics helped Freedom to soar high with a
revamped and modernised underwriting model. Many teams took
part in developing the new policy: the predictive model itself was
constructed and assessed by an actuarial firm; the user interface
N
was crafted by an external technology supplier, who also formed

the assimilation with company’s systems; and technology from
SAS supplied components, such as data repositories, statistical
analytics engines and reporting and conception utilities. The
refurbished system that Freedom employed consists of the
following components:
Data sources: The system assimilates six external sources
of data (such as class action lawsuits and other financial ma-
terial), and the data is acquired through executive company
applications. The external sources are frequently utilised by
the D&O industry. In the case of Freedom Specialty Insurance
Company, they have exclusive sources to contribute to their
predictive model. Freedom, in particular, spends a lot of time
in acquiring and retaining valuable information regarding
merchant activities. Classification and back testing exposes
merchant flaws, as well as inconsistencies in their informa-
tion. Freedom exerts extra time and energy for collaborating
with merchants to catalogue information and maintaining it at

Case study 6: AN APPLICATION OF PREDICTIVE ANALYTICS IN UNDERWRITING 301
Case study 6
n o t e s
a high worth. Freedom also keeps a close watch on their exter-

nal data, applying stringent inspection to develop policy and
claims information. The company upholds data liberty from
merchants’ identification schemes. Although it takes more ef-
fort to decipher values, the process safeguards that Freedom
can promptly terminate business with certain merchants if
necessary.
Data scrubbing: Upon its delivery, information undergoes
many “cleaning” processes, which assures that information is
able to be used to its maximum ability. For example, there is
a review of 20,000 separate class action lawsuits per month to
observe if any variations have occurred. They were originally
classified by different factors, but now they are gone through
monthly. In the past, before the automated system was put
S
into place and the process needed to be carried out manually,
the process took weeks to finish. Now, with the modernised
methods and information cataloguing devices, everything can
be completed within hours.
IM
Back testing: This is one of the most important processes that
determine the potential risk upon receiving a claim. The sys-
tem will use the predictive model to run the claim and analyse
the selection criterion, altering tolerances as required. Upon
being used numerous times, the positive feedback loop polish-
M
es the system.
Predictive model: Information is consolidated and run
through a model, which defines the wisest range of apprais-
al and limits through the use of multivariate analysis. Algo-
N
rithms assess the submission through numerous programmed

thresholds.
Risk selection analysis: This provides the underwriter with
a brief analytical report of recommendations. Similar risks
are shown and contrasted alongside various other risk factors,
such as industry, size, monetary configuration and consider-
ations. The platform’s fundamental standard is compelled by
the underwriter’s rationality, with the assistance of technol-
ogy. In other words, the system is made to support, but not
replace, the physical human underwriter.
Interface with company systems: Once a conclusion is made,
designated data is delivered to the executives of the compa-
ny. The policy distribution procedure is still generally done
by hand, but is likely to be replaced by automated systems
later on. The policy is distributed, and the statistical data is
re-run through the data source element. More information is
contributed to the platform as claims are filed through loss. As
is evident in all D&O processes, underwriters are required to

Case study 6
n o t e s
have a thorough understanding of technical insurance. While

in the past, underwriters put a great deal of effort into acquir-
ing, organising and evaluating information, they now have to
adapt to a system in which enormous quantities of data are
condensed onto a number of analytical pages.
Predictive analytics have greatly altered the responsibilities of a
customary underwriter, who now cross over with policyholders
and negotiators in book and risk control. Although the technology
has simplified and cut down a lot of the manual work, additional
experienced technical personnel also needed to be employed who
have legal and numerical awareness that allow them to construct
predictive models in the financial area. Integrating this model has
enabled Freedom to advance proficiency in the process across many
zones. Processes involved in managing information such as data
S
scrubbing, back-testing and classification were all discovered and
learned by the people themselves and were originally carried out
by hand. However, they have been increasingly mechanised since
IM
they were first conceived. Also, there is an ever-growing quantity
of external sources. Freedom is currently undergoing processes
to assess the implementation of cyber security and intellectual
property lawsuits, with the predictive model continuously being
enhanced and improved.
The D&O industry has adopted many processes related to the
M
upkeep, feeding and preservation of the predictive model that

are utilised by other industries too. One situation in particular is
that following the actuarial originally constructing the predictive
model, Freedom achieved full fluency in the program’s complex
processes over the course of many months. Operations were
N
implemented to efficiently oversee all external merchants together.

A number of external assemblies (including the actuarial firm, the
IT firm, data vendors, reinsurers and internal IT) came together
to refine and organise the predictive model together, all of them in
close collaboration with each other. It was a great feat for Freedom
to unite all of these individuals to take advantage of their distinct
expertise and understanding all together simultaneously.
POSITIVE RESULTS OF THE MODEL

Freedom ended up having positive results from the implemen-
tation of their predictive analytics model, with many new op-
portunities and insights provided for the company.
Communication and correspondence with brokers and poli-
cyholders on the topic of risk management was boosted as a
result of the highly detailed analytic results.
The model could even be expanded to cover other areas of
liability, like property and indemnity.

Case study 6: AN APPLICATION OF PREDICTIVE ANALYTICS IN UNDERWRITING 303
Case study 6
n o t e s
Back testing and cataloguing mechanisms can also now be im-

plemented to foresee other data components in the future.
The updated and automated model highlights Freedom as
a tough contender amongst competitor companies, and has
opened up windows to uncover even more possible data
sources.
questions
1. What were the initial challenges faced by Freedom

Specialty Insurance?
(Hint: Interface bottlenecks, manual processes, various
policies developed in silos, etc.)
S
2. What changes did the implementation of an advanced
predictive model bring in for the company?
(Hint: Integrated processes, easier claim tracking, etc.)
IM
M
N

Case study 7
n o t e s
nicredit Bank Applies Prescriptive Analytics

U
for Risk Management
The Case Study discusses how an Italian bank, UniCredit, is using

Fico software to apply prescriptive analytics to risk management. It
is with respect to Chapter 7.
When analytics are combined with algorithms, they can make a
great impact on business. An Italy’s largest bank, UniCredit, has
done something similar, as it has figured out a model to handle a
high volume of data to manage its risk management processes.
For the bank, it was important to have the right information, in
order to handle their risk management projects, as it may affect
data infrastructure. The bank’s goal was to replace the older
decision-making process with a new agile, flexible and productive
S
technology framework.
Recently, the bank has implemented Fico software, which works
as a decision engine to manage data related to credit cards,
IM
personal loans or other small business loans. According to Ivan
Cavinato, head of credit risk methodologies for the Italian bank,
“The predictive analytics and decision management software will
analyse big data to improve customer lending decisions and capital
optimization.”
The Fico software adopts UniCredit’s strategy on data and
M
prescriptive analytics to enhance customer relationships and

credit risk management. Prescriptive analytics provides multiple
decision options along with their future opportunity and risk. In
short, prescriptive analytics provides better decision options and
N
improved predictions with accuracy.

Cavinato says, “Our goal is to get actionable insights resulting in
smarter decisions and better business outcomes. How you architect
business technologies and design data analytics processes to
get valuable, actionable insights varies. Fico allows us to put in
place a prescriptive analytic environment. Prescriptive analytics
automatically synthesizes big data, mathematics and business rules
to suggest decision options to take advantage of the predictions.”
According to Cavinato, Fico software is integrated with a new
vision of enterprise data infrastructure. He says, “We aim to build
a more flexible and agile architecture. That also means displacing
pieces of legacy software and embracing distributed architecture,
such as Hadoop. But let’s be clear. That doesn’t necessarily mean
dealing with unstructured data.”
Cavinato also admits that Hadoop has created efficiency at the
processing level and operational level, so that the total time
taken by dependent tasks is automatically reduced. In addition,

Case study 7: Unicredit Bank Applies Prescriptive Analytics for Risk Management 305
Case study 7
n o t e s
re-engineering of the data infrastructure by using Hadoop and

big data paradigms has reduced overall cost also. However, the
previous software was lacking all such advantages.
One of the major reasons of accepting Fico software by UniCredit
is that it allows modifying as per the requirements of other types
of businesses also. Finally, Cavinato summarises, “Fico underpins
a software methodology that’s largely dependent on algorithms. It’s
a key building block to proceed to a complete overhaul of the entire
infrastructure, physical and logical, that supports our data business.
It helps redefine processes with greater agility and granularity,
bringing new opportunities and greater performance.”
questions
S
1. What was the challenge faced by UniCredit?
(Hint: UniCredit required right information in order to
handle their risk management projects.)
IM
2. How has UniCredit achieved its goal?
(Hint: By adopting Fico software that uses prescriptive
analytics to enhance customer relationships and credit
risk management.)
M
N

Case study 8
n o t e s
Campaign Success of Mediacom
This Case Study discusses how MediaCom has taken the assistance
of Sysomos for planning and measuring data related to advertising
campaigns for its clients. It is with respect to Chapter 8.
MediaCom is one of the leading media agencies of the world,
which helps its clients to plan and measure its advertising
strategies across all media channels. The company greatly
depends on Sysomos in planning and measuring the performance
of campaigns of its clients.
The main motto of MediaCom agency was to improve the business
along with having insight data related to the audience’s response
to their brands and issues.
S
Alejandro De Luna, Social Strategy Manager at MediaCom, says
“The value Sysomos provides for us is very clear. We need to have a
bedrock of insights to justify how to approach content solutions for
IM
different audiences and different platforms, and Sysomos helps us
to sell in our strategies by giving us a much clearer understanding
of how audiences feel about specific brands and issues.”
Sysomos has enabled MediaCom to analyse online conversations
without any limitations of keywords or result into the database of
over 550 billion social media posts. Now, MediaCom is able to use
M
social intelligence for planning and reporting. For example, it can

analyse the data of social media discussions about the campaign on
Twitter and discussion forums to know about consumer opinions.
Sysomos has provided a tool, Buzzgraph, to MediaCom that
N
helps in gaining knowledge about the key concepts of online

conversations. However, the Tweet Life tool helps in analysing
how a tweet gets viral on the Internet. With the help of Sysomos,
MediaCom, is now easily convincing its clients that its plans are
made on solid facts and figures. Sysomos has helped MediaCom to
have complex analysis of a wide range of topics, without imposing
restrictions on the number of searching terms or obtaining the
results in order to gain insight knowledge and applying their
campaign strategies.

Case study 8: Campaign Success of Mediacom 307
Case study 8
n o t e s
questions
1. What was the basic aim of MediaCom media agency?

(Hint: The main motto of MediaCom agency was to
improve the business along with having insight data
related to the audience’s response to their brands.)
2. How has Sysomos helped MediaCom?
(Hint: Sysomos has provided tools like, Buzzgraph and
Tweet Life, to analyse online conversations without any
limitations of keywords or result into the database of
many billions of social media posts.)
S
IM
M
N

Case study 9
n o t e s
Dundas Bi Solution Helped Medidata and its

Clients in Getting Better Data Visualisation
This Case Study discusses how a custom data visualisation solutions

provider is helping its clients in getting better visualisation of the
data stored in their database. It is with respect to Chapter 9 of the
book.
Medidata is a Portugal-based company which is specialised in
providing ERP based solutions to the Portuguese government.
The company believes in modernising the technology and
software solutions used by the Portuguese government for
combating with the fast evolution of the market. Medidata was
committed in continuous development by providing a variety
of software products and services, which might include back
S
office applications, support systems, etc. in order to fulfil the
requirements of municipalities and residents of Portugal. In
addition to fulfil the requirements of the Portuguese government,
IM
Medidata has its own pool of customers who use ERP solutions
and services for improving document management and enhancing
workflow.
Medidata started receiving demands from its clients to provide
software that can help them in analysing and interacting with the
data generated from the ERP software. Medidata felt necessary to
M
include a Business Intelligence (BI) and analytics solution in its

collection of software solutions. The BI and analytics solution will
have the following advantages for Medidata’s clients:
It helped them in taking better and more informed decisions
N
It improved efficiency and productivity of clients

It was capable of redefining processes when required
It was scalable which means it can increase or decrease re-
sources as and when required
In addition to fulfil the needs of clients, Medidata also wanted a
BI and analytics solution for detecting its own issues related to
the quality of data. Medidata decided to migrate on the Dundas BI
solution for data visualisation. The decision was obvious because
Dundas has been working as a partner since 2009 when it was
involved in developing business intelligence components.
The satisfaction and belief in using Dundas legacy products
helped Medidata in migrating to Dundas BI solution for visualising
the data. Dundas has been involved in creating and providing
customised data visualisation software for both Fortune 500 and
start-up companies across the world.
Before formalising the partnership, several meetings were held
between Medidata and Dundas to discuss how Medidata would

Case study 9: Dundas Bi Solution Helped Medidata and its Clients in Getting
Better Data Visualisation 309
Case study 9
n o t e s
encourage the clients to use the Dundas BI solution for data

visualisation. After understanding the strategies of Medidata of
selling and marketing the Dundas BI solution, Dundas decided
to provide the customised BI solution with full support as per
Medidata’s needs to use and test it for a certain period of time.
Dundas also helped Medidata in learning the use of BI solution by
providing multimedia training content and webinars. It helped in
the adoption of BI solution rapidly across Medidata. The interface
of BI solution for data visualisation is shown in the following
figure:
S
IM
M
N
Some important features of the BI solution are:

Superb interactivity: A highly interactive environment of
Dundas BI visualisation enabled clients of Medidata in engag-
ing and understanding their data in a better way.
Data-driven alerts: Utilising alert notifications, built-in anno-
tations and scheduled reports in Dundas BI, their clients can
collaborate with the user using these tools.
Smart design tools: Dundas BI provides smart and in-built
design tools, which provide drag-and-drop functionality for
quickly designing reports and dashboards.
Extensibility:
Dundas BI provides connectivity with earlier
unsupported data sources.
Performance tuning: The BI solution provides an ability to
store the output of the data cubes within Dundas BI’s data
warehouse for better performance.

Case study 9
n o t e s
Due to the presence of preceding features, some key benefits of

the BI solution for Medidata are as follows:
Medidata can now validate those database attributes, which
were incorrect in some situations but not in others.
Medidata can now determine inconsistencies in their data-
base.
Medidata also became able in resolving various issues related
to data integrity.
The BI solution has resolved 60% of the validity concerns faced by
Medidata. Not only BI solution for data visualisation benefitted
the Medidata, it has also proved useful for Medidata’s clients:
It helped clients by increasing their ability to take data-driven
S
actions.
It helped clients in identifying and understanding their key
performance indicators (KPIs).
IM
It has provided dashboards to clients which include KPIs, such
as workflow performance, in addition to the ratio of workflow
outstanding tasks, grouped by department.
It helped clients by making information available quickly,
the decision-makers of clients became capable to regulate re-
sources in real-time, task execution time in different scenari-
M
os, and finally improve the ratio of overdue tasks.

“While using Dundas BI, I found I was able to accelerate the time-to-
market of my BI projects. The usability, self-contained management
and the easy way that, in a blink, I could see and analyse data from
N
various sources were a great and awesome surprise!” – Luis Silva,

Senior BI Consultant, Medidata.
questions
1. Why data visualisation is important for companies?

(Hint: Timely action, resource allocation, etc.)
2. What should be the features of a good data visualisation
solution?
(Hint: Highly interactive, data-driven alerts, etc.)

Case study 10: SPORTS ANALYTICS HELPED IN THE ENRICHMENT OF PERFORMANCE
OF PLAYERS 311
Case study 10
n o t e s
SPORTS ANALYTICS HELPED IN THE ENRICHMENT OF

PERFORMANCE OF PLAYERS
This Case Study discusses how real-time analytics from IBM have
been utilised by team USA for measuring and improving their
athlete’s performance. It is with respect to Chapter 10 of the book.
A US-based cycling organisation, which is dedicatedly
contributing towards the betterment of advanced US cycling
teams in the Olympics and other international events, was
involved in determining the ways to get an edge over its well-
funded competitive organisations in the events like Women’s
Team Pursuit. In the team pursuit event, there are four cyclists
with one in the lead and the other three remaining behind. The
challenge appears when riders change their places, which cause
S
disruption and slows down the group. The delay of fraction of a
second can cost the race in this extremely competitive sport.
USA cycling totally depends on private donations unlike national
IM
teams which are totally supported by government bodies.
Coaches in USA recycling felt the need of analytics for analysing
the rider’s performance along with managing the organisation’s
budget efficiently. The challenge in front of USA cycling was to
quantify the performance in Team Pursuit track cycling events
in real time, which were organised indoors in velodromes. It was
M
easier to monitor and track the rider’s performance outdoors only

if there are no variations in wind or condition of cycling track.
“The single most important factor in winning a race is the power
that the riders are able to exert on the pedals. The bikes we use have
N
a power meter on the crank that measures the power generated in

watts,” according to Andy Sparks, Director of Track Programs
for USA Cycling. Collecting and applying data analytics from
bicyclists’ sensors was a slow-going process that usually took an
hour in only collecting data per cyclist.
“At the end of a training session, the coach had to plug the head
unit of each bike into his PC, download the data, manually slice it
into half-second intervals, match those intervals to the events that
took place during the session — for example, when each rider was
pulling, versus when they were exchanging or pursuing — and then
calculate a variety of key metrics,” according to Andy Sparks.
This means performance data and cycling analytics would not get
ready until the next training day.
USA cycling organisers were looking for the solution to overcome
these challenges. They decided to take help from real-time
analytics for analysing the performance of players and achieve
its goal. They started working with IBM jStart to configure the

Case study 10
n o t e s
flow of real-time data for delivering instant analytics to coaches

about rider’s performance on their mobile dashboards. jStart is
an IBM team expertise in offering intelligent business solutions
developed on the latest emerging technologies.
Android smartphones were planted in riders’ pockets. The data
generated from phones was further transferred to IBM Watson’s
Internet of Things platform for analysis. Analytics of data are
further provided using summary dashboards that display metrics
like W-prime depletion which provides information about the
usage of anaerobic muscle capacity by a rider and time duration
that a rider takes to regenerate it.
Further, the jStart team incorporates IBM Analytics for Apache
Spark to compute metrics while cyclists are moving speedily to get
S
their useful key metrics. As the cycling analytics data is produced
in the real-time scenario and shown on a mobile dashboard, both
coaches and cyclists, therefore, can access their performance data
during the on-going training session with proper feedback.
IM
“The ability to get hold of the data immediately after the training
session has finished has completely changed my relationship with
the team,” according to Neal Henderson, a high performance
consultant with USA Cycling.
The USA cycling can now view data promptly, it gets much
M
easier in identifying problems, making positive modifications and

strengthening winning behaviours that they can take into the
next session. The analytics solution allows riders to get the review
of their performance, that is, how efficiently they are performing.
It also helped in relieving stress from players.
N
With instant analytics, coaches can provide “riders a quick debrief

after the first race, advise them on tactics for the next one, and then
just let them relax and recover,” says Henderson.
USA cycling organisers get so much befitted from the analytics
solution that their team has won gold medal at the London World
Championships.
questions
1. Enlist challenges faced by USA Cycling.

(Hint: Delay in assessing the performance of players.)
2. Discuss the benefits of sports analytics for USA Cycling.
(Hint: Better utilisation of resources, better assessment
of players, etc.)

Case study 11: FRAUD ANALYTICS SOLUTION HELPED IN SAVING THE WEALTH
OF COMPANIES 313
Case study 11
n o t e s
FRAUD ANALYTICS SOLUTION HELPED IN SAVING

THE WEALTH OF COMPANIES
This Case Study discusses how IBM’s fraud analytics helped

organisations in detecting the frauds and saving them from financial
losses. This case study is related to chapter 10 of the book.
In the year 2011, industries in the US were suffering from huge
financial loss of approximately $80 billion annually. Alone, issuers
of credit and debit cards in the US have suffered a whopping loss
of $2.4 billion. Besides industries, financial frauds had also taken
place with individuals which would have taken years to resolve.
Existing fraud detection systems were not effective as they
function on a predefined set of rules, which include flagging on
S
withdrawals from ATM up to a certain amount or purchasing
using the credit cards outside the credit card holder’s country.
These traditional methods helped in reducing the number of
fraudulent cases but not all. The research team at IBM decided
IM
to take the fraud detection system to the next level, so that a large
number of fraudulent financial transactions can be detected and
prevented. At IBM, the team has created a virtual data detective
solution by using machine learning and stream computing to
prevent fraudulent transactions to save industries or individuals
from financial losses.
M
In addition to signalling about the particular type of transaction,

the solution also analyses about transactional data to create a
model for detecting fraudulent patterns. This model is further
utilised for processing and analysing a large amount of financial
N
transactions as they occur in real time which is termed as ‘steam-

computing’.
Each transaction is allocated a fraud score, which specifies the
likelihood of a transaction being fraudulent. This model is further
customised according to the data of the client and then upgraded
after a certain period of time for covering new fraud patterns.
Fundamental analytics depend on statistical analysis and
machine-learning methods, which allow determining of strange
fraud patterns that could be skipped by human experts.
Consider an example of a large U.S.-based bank that used the IBM
machine-learning technologies for analysing transactions of the
issued credit cards got the result as shown in the following image:

Case study 11
n o t e s
Consider another case of an online clothing retailer. If most

transactions made at the retailer were fraudulent, then there is
a high probability that future transactions related to purchases
would also be fraudulent. The system is capable of gathering
these historical points of data and further analyse it, to detect the
possibility of future fraudulent attempts. In addition to prevent
fraudulent attempts, the system has also cut down on false alarms
after analysing the relation between the suspected fraudulent
transactions and actual fraud.
“The triple combination of prevention, catching more incidents of
actual fraud, and reducing the number of false positives results in
maximum savings with minimal hassle. In essence, we are able to
apply complicated logic that is outside the realm of human analysis
to huge quantities of streaming data,” notes Yaara Goldschmidt,
S
manager, Machine Learning Technologies group.
These machine-learning technologies are presently used
in detecting and preventing fraud in financial transactions,
IM
which includes transactions related to credit cards, ATMs and
e-payments. The system is embedded with client’s infrastructure
and a machine-learning model is developed using its existing set
of data to combat with fraudulent transactions before they take
place.
“By identifying legal transactions that have a high probability of
M
being followed by a fraudulent transaction, a bank can take pro-

active measures—warn a card owner or require extra measures for
approving a purchase,” explains Dan Gutfreund, project technical
lead.
N
Machine learning and stream-computing technologies are

not capable of predicting the future, yet they enable financial
institutions to take effective decisions and work towards
preventing frauds before they occur.
questions
1. What is the need for fraud analytics in the organisations?

(Hint: To prevent fraudulent transactions.)
2. What are the benefits of machine-learning and steam-
computing for organisations?
(Hint: To identify the pattern of fraudulent transactions,
raise alarms, etc.)

Case study 12: BIG DATA ANALYTICS ALLOWING USERS TO VISUALISE THE FUTURE OF
FREE ONLINE CLASSIFIEDS 315
Case study 12
n o t e s
BIG DATA ANALYTICS ALLOWING USERS TO VISUALISE

THE FUTURE OF FREE ONLINE CLASSIFIEDS
This Case Study shows how data is integrated from hundreds

of countries, dozens of languages and then allowing users with
powerful, data-driven insight to predict the future of free online
classifieds. It is with respect to Chapter 1.
Background
OLX is a popular fast growing online classified advertising
website. It is active in around 105 countries and supports over
40 languages. This website is having more than 125 million
unique visitors per month across the world and generates one
billion page-hits per month approximately. OLX allows its users
S
to design and personalise their advertisements and add them in
their social networking profiles, so that their data require big data
analytics.
IM
Challenges
The main challenge for OLX website was to find new ways to
use business analytics to handle the vast data of their customers.
The business users of OLX required numerous metrics to track
their customer data. To achieve this aim, they need to build a
M
good control over their data warehouse. OLX takes the help of
Datalytics, Pentaho’s partner vendor, in searching the solutions
for extracting, transforming and loading data from worldwide and
then creating an improved data warehouse. After creating such a
warehouse, OLX wants to allow its customers to visualise its stored
N
data in real time without facing any technical error or barrier. OLX
knew that it would be difficult for those people who do not have
without previous Business Intelligence (BI) knowledge, so it is
essential to use a visualisation tool for this purpose. According to
Franciso Achaval, Business Intelligence Manager at OLX, “While
it may be easy for a BI analyst to understand what’s happening in the
numbers, to explain this to business users who are not versed in BI
or OLAP (On-line Analytical Processing), you need visualisations.”
Solutions
OLX has approached Pentaho, which is a business intelligence
software company that provides open source products and
services to its customers, such as data integration, OLAP
services, reporting, information dashboards, etc. Pentaho has
partnership with Datalytics, which is basically a consulting firm
based in Argentina. Datalytics provides data integration, business
intelligence, and data mining solutions to Pentaho’s worldwide
clients.

Case study 12
n o t e s
The following solutions are provided to OLX:

Pentaho Data Integration and Mondrian technique is used to
handle a huge amount of data across the world. This data is
extracted, transformed and loaded from multiple sources, like
Google analytics, and then stored in a data warehouse.
Pentaho Business Analytics allow its users to have data in-
sight and analyse key trends in real time.
Pentaho’s partner, Datalytics, has provided assistance to OLX
in the deployment and designing of a new analytics solution.
They have examined new ways to build out data analysis ca-
pabilities and integrating big data.
Results
S
OLX has realised that Datalytics’ expertise and Pentaho’s platform
have enabled them to deploy their new analytics solution in less
than a month. They have realised the following changes in the
IM
new solution:
Pentaho Business Analytics enables OLX to facilitate its users
to create easy and creative reports about key business metrics.
Instead of buying an expensive enterprise solution or invest-
ing time in building a new data warehouse internally, OLX
was able to save time by focussing on data integration with
M
analytics capabilities.
Pentaho Business Analytics provides end-user satisfaction.
Pentaho Business Analytics provides a scalable solution to
N
OLX, as it can integrate any type of data from any data source
and can increase its business. In addition, Datalytics’ assis-
tance provides an opportunity to OLX regarding the experi-
ment with big data.
questions
1. What were the challenges faced by OLX?

(Hint: The main challenge for OLX website was to find
new ways to use business analytics to handle the vast data
of their customers.)
2. What was the result of implementing Pentaho Business
Analytics?
(Hint: Pentaho Business Analytics enables OLX to
facilitate its users and create easy and creative reports
about key business metrics.)

Fundamentals of Big Data and Business Analytics NXFLUY7Qcj

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fundamentals of Big Data and Business Analytics NXFLUY7Qcj

Uploaded by

Copyright:

Available Formats

Fundamentals of Big Data & Business Analytics

Chief Academic Officer Content Reviewer

TOC Reviewer TOC Reviewer

Reviewed By: Dr. R. Vijaylakshmi

NMIMS Global Access - School for Continuing Education

CHAPTER NO. CHAPTER NAME PAGE NO.

1 Business Transformation with Big Data 1

2 Technologies for Handling Big Data 29

3 Basics of Business Analytics 69

5 Descriptive Analytics 111

7 Prescriptive Analytics 163

8 Social Media Analytics and Mobile Analytics 181

9 Data Visualisation 239

10 Business Analytics in Practice 263

11 Case Studies 287

NMIMS Global Access - School for Continuing Education

and data driven models; Data Mining; Data Mining Methodologies

Prescriptive Analytics: What is Prescriptive Analytics; Introduction to Prescriptive Modeling; Nonlin-

Social Media Analytics, Mobile Analytics, and Visualization

NMIMS Global Access - School for Continuing Education

Business Transformation with Big Data

1.4.1 The Sources of Big Data

1.5.1 Use of Big Data in Social Networking

NMIMS Global Access - School for Continuing Education

1.10 Text Analytics

NMIMS Global Access - School for Continuing Education

Big Data Handling in CGL Corporation

A $10 billion-dollar IT corporation, CGL Inc. has over 30,000 data

The data itself is a repository of great informational value for the

uted data intensive application. Overall the implementation has

Going forward, CGL expects the consolidation of scattered data

across different data centres present throughout the world so that

NMIMS Global Access - School for Continuing Education

After studying this chapter, you will be able to:

tainment, science and technology, genetics, or business operations. In

This is truly an information age where data is being generated at an

Big Data consists of large datasets that cannot be managed efficiently

NMIMS Global Access - School for Continuing Education

1.2 Evolution of Big Data

NMIMS Global Access - School for Continuing Education

self assessment Questions

Structured v/s Unstructured

structure and comprehensible hierarchy is considered a structurally

For example, imagine a 10 GB outlook .psd file (Outlook email config-

So, anything that has a structure falls in place everywhere as a struc-

NMIMS Global Access - School for Continuing Education

self assessment Questions

3. Anything that has a well-defined arrangement, easy-to-

Semi-structured data, also known as having a schema-less or

Now, consider the following scenario:

Mr. Smith also observes the presence of some semi-structured data

NMIMS Global Access - School for Continuing Education

contains personal details of the authors working for the publishing

As you can notice from the preceding table, semi-structured data

Normally, while dealing with enormous number of datasets, you need

Big Data science uses concepts of statistics, relational database pro-

According to a survey, the technical skills most commonly required

NMIMS Global Access - School for Continuing Education

Hive, MapReduce, Pig, HBase and so on are also an efficient medi-

data which is imported into a system.

Whether data is structured or unstructured is also a crucial factor

The three top primary sources of data are described as follows:

NMIMS Global Access - School for Continuing Education

1.5.2 Use of Big Data in Preventing Fraudulent

Text parsing, mining, identification, categorisation, extraction and

Indexing, Web crawling, Search Access, duplicate document iden-