You are on page 1of 11

FACULTY OF MANAGEMENT STUDIES

UNIVERSITY OF DELHI
MBA II year 1st Semester examination, November 2021
MBA 7708: Predictive Analytics and Big Data
Time: 3 hours Maximum marks: 70
Answer any FOUR questions. All questions carry equal marks.

1.(a) "Data science is ... much more than data analysis, e.g., using techniques from machine learning
and statistics; extracting this value takes a lot of work, before and after data analysis."

What are the different stages of the Data Science Life cycle? Discuss the above statement by
describing the activities done in each stage and their utility.

(b) The following figure is a section of output obtained during multiple regression analysis.
Describe the results presented there, especially those highlighted (boldfaced and underlined
text).

```{r}
# Multiple Linear Regression
lm.fit=lm(Sales ~ TV + Radio + Newspaper, data = mydata)
summary(lm.fit)
```

Call:
lm(formula = Sales ~ TV + Radio + Newspaper, data = mydata)

Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
Radio 0.188530 0.008611 21.893 <2e-16 ***
Newspaper -0.001037 0.005871 -0.177 0.86
---
Page 1 of 11
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.686 on 196 degrees of freedom


Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

2.(a) One of the most common applications of text mining is in marketing, especially survey research.
Although the questions are often multiple-choice-based (and therefore easy to process), the most
useful ones are open-ended, to explore the topic in depth. How does text mining help in such
cases?

(b) The following figure is a section of output obtained from application of a predictive analytics
tool. Describe the results presented there, especially those highlighted (boldfaced and underlined
text).

```{r}
model1 <- glm( diabetes ~ ., data = train.data, family = binomial)
# model with all covariates
summary(model1)
```

Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.7037 -0.6530 -0.3794 0.6352 2.5264

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.491e+00 1.570e+00 -6.045 1.49e-09 ***
pregnant 1.546e-01 7.271e-02 2.127 0.03344 *
glucose 3.783e-02 7.150e-03 5.291 1.22e-07 ***
pressure -3.663e-04 1.639e-02 -0.022 0.98217
triceps 2.170e-02 2.171e-02 1.000 0.31739
insulin -2.174e-05 1.773e-03 -0.012 0.99021
mass 4.788e-02 3.602e-02 1.329 0.18378
pedigree 1.817e+00 5.953e-01 3.053 0.00227 **
age 1.221e-02 2.369e-02 0.515 0.60623
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Page 2 of 11
(Dispersion parameter for binomial family taken to be 1)

Null deviance: 308.67 on 238 degrees of freedom


Residual deviance: 212.37 on 230 degrees of freedom
AIC: 230.37

Number of Fisher Scoring iterations: 5

```{r}
list(
model1 = table(test.data$diabetes, test.predicted.m1 > 0.5) %>%
prop.table() %>% round(3)
)
```

$model1

FALSE TRUE
neg 0.601 0.092
pos 0.118 0.190

```{r}
# model 1 AUC
prediction(test.predicted.m1, test.data$diabetes) %>%
performance(measure = "auc") %>%
.@y.values
```

[[1]]
[1] 0.8250703

3.(a) Read the case “Dell Is Staying Agile and Effective with Analytics in the 21st Century” printed at
the end of this document. Answer the following questions on the case.

1. What was the challenge Dell was facing that led to their analytics journey?

2. What solution did Dell develop and implement? What were the results?

3. As an analytics company itself, Dell has used its service offerings for its own business. Do you
think it is easier or harder for a company to taste its own medicine? Explain.

Page 3 of 11
(b) (i) In association rules mining, describe the following terms with examples:
Support, Confidence, lift ratio

(ii) Provide an example of an association rule from the market basket domain that satisfies the
following conditions. Also, describe whether such rules are subjectively interesting.

"A rule that has reasonably high support but low confidence."

4. Answer the following question in approximately 150 words each.

(i) What is Big Data? Why is it important?

(ii) What is Big Data analytics?

(iii) What are the common business problems addressed by Big Data analytics?

(iv) What are the big challenges that one should be mindful of when considering
implementation of Big Data analytics?

5. The classification tree was applied to a dataset. A section of the output from the analysis is
presented below. It is divided into nine sections. Describe what is being done in these nine
sections, define the terms printed in boldface and underlined.

What conclusion can you draw from this analysis?

Loading the data as a dataframe


library(ISLR)
mydata <- Carseats

Check all variables


Page 4 of 11
str(mydata)

## 'data.frame': 400 obs. of 12 variables:


## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2
3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ High : Factor w/ 2 levels "No","Yes": 2 2 2 1 1 2 1 2 1 1 ...

Section 1. Split data into training (70%) and validation (30%)


dt = sort(sample(nrow(mydata), nrow(mydata)*.7))
train.data<-mydata[dt,]
test.data<-mydata[-dt,] # Check number of rows in training data set
nrow(train.data)

## [1] 280

Section 2. Fit the decision tree model


mytree <- rpart(High ~. - Sales, data = train.data, method="class", control =
rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10, usesurrogate = 2,
xval =10 ))
mytree

## n= 280
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 280 118 No (0.57857143 0.42142857)
## 2) ShelveLoc=Bad,Medium 221 67 No (0.69683258 0.30316742)
## 4) Price>=106.5 146 26 No (0.82191781 0.17808219)
## 8) Advertising< 13.5 123 16 No (0.86991870 0.13008130) *
## 9) Advertising>=13.5 23 10 No (0.56521739 0.43478261)
## 18) Age>=55 11 2 No (0.81818182 0.18181818) *
## 19) Age< 55 12 4 Yes (0.33333333 0.66666667) *
## 5) Price< 106.5 75 34 Yes (0.45333333 0.54666667)
## 10) Age>=68.5 15 1 No (0.93333333 0.06666667) *
## 11) Age< 68.5 60 20 Yes (0.33333333 0.66666667)
## 22) Income< 59.5 17 6 No (0.64705882 0.35294118) *
## 23) Income>=59.5 43 9 Yes (0.20930233 0.79069767) *
## 3) ShelveLoc=Good 59 8 Yes (0.13559322 0.86440678)
## 6) Price>=150 7 2 No (0.71428571 0.28571429) *
## 7) Price< 150 52 3 Yes (0.05769231 0.94230769) *
Page 5 of 11
Section 3. Print the tree
#Plot tree
plot(mytree)
text(mytree)

Section 4. Draw the confusion matrix on TRAIN data


library(caret)

## Loading required package: lattice

pred.default.train <- predict(mytree, train.data, type = "class")


confusionMatrix(pred.default.train,train.data$High)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 146 27
## Yes 16 91
##
## Accuracy : 0.8464
## 95% CI : (0.7988, 0.8866)
## No Information Rate : 0.5786
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.681
##
## Mcnemar's Test P-Value : 0.1273
##
Page 6 of 11
## Sensitivity : 0.9012
## Specificity : 0.7712
## Pos Pred Value : 0.8439
## Neg Pred Value : 0.8505
## Prevalence : 0.5786
## Detection Rate : 0.5214
## Detection Prevalence : 0.6179
## Balanced Accuracy : 0.8362
##
## 'Positive' Class : No
##

5. Draw the confusion matrix on TEST data


pred.default.test <- predict(mytree, test.data, type = "class")
confusionMatrix(pred.default.test,test.data$High)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 54 19
## Yes 20 27
##
## Accuracy : 0.675
## 95% CI : (0.5835, 0.7577)
## No Information Rate : 0.6167
## P-Value [Acc > NIR] : 0.1104
##
## Kappa : 0.3154
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7297
## Specificity : 0.5870
## Pos Pred Value : 0.7397
## Neg Pred Value : 0.5745
## Prevalence : 0.6167
## Detection Rate : 0.4500
## Detection Prevalence : 0.6083
## Balanced Accuracy : 0.6583
##
## 'Positive' Class : No
##

6. Tabulate the performance of trees for different values of Complexity Parameter (CP)
printcp(mytree)

##
## Classification tree:
## rpart(formula = High ~ . - Sales, data = train.data, method = "class",

Page 7 of 11
## control = rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10,
## usesurrogate = 2, xval = 10))
##
## Variables actually used in tree construction:
## [1] Advertising Age Income Price ShelveLoc
##
## Root node error: 118/280 = 0.42143
##
## n= 280
##
## CP nsplit rel error xerror xstd
## 1 0.364407 0 1.00000 1.00000 0.070022
## 2 0.084746 1 0.63559 0.63559 0.062798
## 3 0.042373 3 0.46610 0.56780 0.060502
## 4 0.025424 4 0.42373 0.46610 0.056339
## 5 0.016949 5 0.39831 0.52542 0.058879
## 6 0.010000 7 0.36441 0.60169 0.061694

bestcp <- mytree$cptable[which.min(mytree$cptable[,"xerror"]),"CP"]

7. Prune the tree using the best cp.


pruned <- prune(mytree, cp = bestcp)

8. Plot pruned tree


prp(pruned, faclen = 0, cex = 0.8, extra = 1)

9. Draw the confusion matrix and calculate the accuracy

Page 8 of 11
conf.matrix <- table(train.data$High, predict(pruned,type="class"))
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ":")
colnames(conf.matrix) <- paste("Pred", colnames(conf.matrix), sep = ":")
print(conf.matrix)

##
## Pred:No Pred:Yes
## Actual:No 145 17
## Actual:Yes 33 85

pred.pruned.test <- predict(pruned, test.data, type = "class")


confusionMatrix(pred.pruned.test,test.data$High)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 53 24
## Yes 21 22
##
## Accuracy : 0.625
## 95% CI : (0.532, 0.7117)
## No Information Rate : 0.6167
## P-Value [Acc > NIR] : 0.4655
##
## Kappa : 0.1969
##
## Mcnemar's Test P-Value : 0.7656
##
## Sensitivity : 0.7162
## Specificity : 0.4783
## Pos Pred Value : 0.6883
## Neg Pred Value : 0.5116
## Prevalence : 0.6167
## Detection Rate : 0.4417
## Detection Prevalence : 0.6417
## Balanced Accuracy : 0.5972
##
## 'Positive' Class : No
##

Page 9 of 11
CASE
Dell Is Staying Agile and Effective with Analytics in the 21st
Century
The digital revolution is changing how people shop. Studies show that even commercial
customers spend more of their buyer journey researching solutions online before they engage a
vendor. To compete, companies like Dell are transforming sales and marketing models to support
these new requirements. However, doing so effectively requires a Big Data solution that can
analyze corporate databases along with unstructured information from sources such as
clickstreams and social media.

Dell has evolved into a technology leader by using efficient, data-driven processes. For decades,
employees could get measurable results by using enterprise applications to support insight and
facilitate processes such as customer relationship management (CRM), sales, and accounting.
When Dell recognized that customers were spending dramatically more time researching
products online before contacting a sales representative, it wanted to update marketing models
accordingly so that it could deliver the new types of personalized services and the support that
customers expected. To make such changes, however, marketing employees needed more data
about customers’ online behavior. Staff also needed an easier way to condense insight from
numerous business intelligence (BI) tools and data sources. Drew Miller, Executive Director,
Marketing Analytics and Insights at Dell, says, “There are peta-bytes of available information
about customers’ online and offline shopping habits. We just needed to give marketing employees
an easy-to-use solution that could assimilate all of it, pinpoint patterns and make
recommendations about marketing spend and activities.”

Setting Up an Agile Team to Boost Return on Investment (ROI) with BI and Analytics

To improve its global BI and analytics strategy and communications, Dell established an IT task
force. Executives created a flexible governance model for the team so that it can rapidly respond
to employees’ evolving BI and analytics requirements and deliver rapid ROI. For example, in
addition to having the freedom to collaborate with internal business groups, the task force is
empowered to modify business and IT processes using agile and innovative strategies. The team
must dedicate more than 50% of its efforts identifying and implementing quick-win BI and
analytics projects that are typically too small for the “A” priority list of Dell’s IT department. And
the team must also spend at least 30% of its time evangelizing within internal business groups to
raise awareness about BI’s transformative capabilities—as well as opportunities for collaboration.

One of the task force’s first projects was a new BI and analytics solution called the Marketing
Analytics Workbench. Its initial application was focused on a select set of use cases around online
and offline commercial customer engagements. This effort was co-funded by Dell’s IT and
marketing organizations. “There was a desire to expand the usage of this solution to support many
more sales and marketing activities as soon as possible. However, we knew we could build a more
effective solution if we scaled it out via iterative quick sprint efforts,” says Fadi Taffal, Director,
Enterprise IT at Dell.

One Massive Data Mart Facilitates a Single Source of Truth

Page 10 of 11
Working closely with marketing, task force engineers use lean software development strategies
and numerous technologies to create a highly scalable data mart. The overall solution utilizes
multiple technologies and tools to enable different types of data storage, manipulation, and
automation activities. For example, engineers store unstructured data from digital/social media
sources on servers running Apache Hadoop. They use the Teradata Aster platform to then
integrate and explore large amounts of customer data from other sources in near real time. For
various data transformation and automation needs, the solution includes the use of Dell’s Toad
software suite, specifically Toad Data Point and Toad Intelligence Central, and Dell Statistica.
Toad Data Point provides a business-friendly interface for data manipulation and automation,
which is a critical gap in the ecosystem. For advanced analytical models, the system uses Dell
Statistica, which provides data preparation, predictive analytics, data mining and machine
learning, statistics, text analytics, visualization and reporting, and model deployment and
monitoring. Engineers also utilize this solution to develop analytical models that can sift through
all of the disparate data and provide an accurate picture of customers’ shopping behavior. Tools
provide suggestions for improving service, as well as ROI metrics for multivehicle strategies that
include Web site marketing, phone calls, and site visits.

Within several months, employees were using the initial Marketing Analytics Workbench. The
task force plans to expand the solution’s capabilities so it can analyze data from more sources,
provide additional visualizations, and measure the returns of other channel activities such as
tweets, texts, e-mail messages, and social media posts.

Saves More Than $2.5 Million in Operational Costs

With its new solution, Dell has already eliminated several third-party BI applications.
“Although we’re just in the initial phases of rolling out our Marketing Analytics Workbench, we’ve
saved approximately $2.5 million in vendor outsourcing costs,” says Chaitanya Laxminarayana,
Marketing Program Manager at Dell. “Plus, employees gain faster and more detailed insights.” As
Dell scales the Marketing Analytics Workbench, it will phase out additional third-party BI
applications, further reducing costs and boosting efficiency.

Facilitates $5.3 Million in Revenue

Marketing employees now have the insight they need to identify emerging trends in customer
engagements—and update models accordingly. “We’ve already realized $5.3 million in
incremental revenue by initiating more personalized marketing programs and uncovering new
opportunities with our big data Marketing Analytics Workbench,” says Laxman Srigiri, Director,
Marketing Analytics at Dell. “Additionally, we have programs on track to scale this impact many
times over in the next three years.” For example, employees can now see a timeline of a customer’s
online and offline interactions with Dell, including purchases, the specific Dell Web site pages the
customer visited, and the files they downloaded. Plus, employees receive database suggestions for
when and how to contact a customer, as well as the URLs of specific pages they should read to
learn more about the technologies a customer is researching. Srigiri says, “It was imperative that
we understand changing requirements so we could stay agile. Now that we have that insight, we
can quickly develop more effective marketing models that deliver the personalized information
and support customers expect.”

Page 11 of 11

You might also like