You are on page 1of 5

School of Computing and Information Systems

PROGRAMME: BSC HONS BUSINESS INTELLIGENCE AND DATA ANALYTICS

BI 204 Advanced Data Analytics Year 2 Semester 2

ASSIGNMENT

Handout Date: 8 March 2024 Hand in date: 14 May 2024

Total Marks: 100

Instructions to candidates

1. Candidates must attempt ALL questions.


2. You are to make your submission on turn-it-in. You may consult with your tutor/lecturer on how
this will be done.
3. This assignment is 100% practical and will therefore be assessed practically through a
presentation. A candidate who fails to attend the presentation will get a fail grade in this
assessment.
4. The purpose of the presentation is for each candidate to validate that the work they are
submitting belongs to them and therefore it will be necessary for each candidate to be able
to defend their work.

This question paper consists of Five (5) printed pages excluding the cover page
Part A [42 marks]

Exploratory data analysis and data pre-processing

The dataset for this question (churn_real.xlsx) is provided in Microsoft Teams files under the
ADA module space. This dataset contains variables that have a bearing in predicting whether a
customer is likely to churn out of a telecommunication service. In each case, write Python code
snippets to achieve the following:

a) Import the requisite libraries [4 marks]

b) Load the dataset into a Python data frame and check its contents. [3 marks]

c) Convert the “Total Charges” column to numeric [2 marks]


d) Create a visually enhanced summary of descriptive statistics for a pandas DataFrame,
excluding the count row, with specific styling such as centered text, a background
gradient, and custom header colors. [3 marks]
e) Analyze customer churn through visual comparisons of numeric features, excluding
non-relevant numeric columns like identifiers and location data (e.g., "Churn Value",
"Latitude", "Longitude", "Churn Score", "Count", "Zip Code"). The code should generate
multiple bar plots in a subplot arrangement, each representing a different numeric
feature's average impact on churn and emphasizes enhancing readability through
customized bar labels. [6 marks]
f) From the visualization in e) above, summarize the insights you derive from the effects of
Tenure Months, Monthly Charges and Customer Life Time Value (CLTV) on customer
churn. [6 marks]
g) Using a pie chart, visualize the proportion of churned customers [2 marks]

h) Create a set of count plot visualizations for each categorical variable in the dataset
against churn labels, ensuring each plot includes a legend and customized titles, with
the entire collection of plots organized into a grid layout and each bar labeled with its
count. [6 marks]
i) Display the frequency and percentage of each churn reason in the dataset, ensuring that
the reasons are neatly organized in a DataFrame with both counts and percentages
formatted as percentages. [4 marks]
j) Reflect on the insights provided by top 3 churn influencers from the results in i) above.
What interventions would be put in place to mitigate on the findings? [6 marks]
Part B [21 marks]
Data preprocessing
a) Display the list of the variables alongside their total missing values. [2 marks]
b) Using the most frequently occurring value (mode), perform imputation to deal with the
missing values in the “Total Charges” column. [3 marks]

Page 2 of 5
c) Analyze the skewness of numeric columns in the dataset, excluding specific columns
like Latitude, Longitude, Churn Value, and Churn Score, visualize their relationship
with 'Churn Score' using regression plots, compare their distributions by 'Churn Label'
with KDE plots, and additionally, create box plots to assess the spread and identify
outliers in these numeric columns. [8 marks]
d) Giving examples from the plots, identify any outliers in any of the variables.
[2 marks]
e) Improve the skewness of any variable with a value greater than 0.8. [6 marks]
Part C [12 marks]
Hypothesis Testing
a) Test the following hypothesis:
i) Null Hypothesis (H0): There is no significant relationship between Phone Service
and Churn. Alternative Hypothesis (H1): There is a significant relationship
between Phone Service and Churn.
ii) Null Hypothesis (H0): The type of contract does not affect the likelihood of
churn. Alternative Hypothesis (H1): The type of contract significantly influences
the likelihood of churn.
iii) Null Hypothesis (H0): The Senior Citizen does not affect the likelihood of churn.
Alternative Hypothesis (H1): The Senior Citizen significantly influences the
likelihood of churn.
[6 marks]
b) From the outcomes of the hypothesis tests, what management conclusions can be
reached? [6 marks]
Part D [25 marks]
Predictive model development
a) Perform data normalization on the ‘Tenure Months’, ‘Monthly Charges’ and the ‘Total
Charges’ columns. [3 marks]
b) Split the dataset into training set and test set using an appropriate proportion.
[5 marks]
c) Compile and train a Logistic Regression model with appropriate hyperparameters.
[6 marks]
d) Print a classification report from the model. [4 marks]
e) Print the confusion matrix from the model. [3 marks]
f) Interpret the confusion matrix [4 marks]

Page 3 of 5
Marking key

PART A EDA [42 marks] Possible mark Scored mark Comment

a) Libraries 4

b) Dataset loading 3

c) Total charges to numeric 2

d) Descriptive statistics 3

e) Numeric features visuals 6

f) Insights summaries 6

g) Pie chart 2

h) Count plot visualization 6

i) Churn reason analysis 4

j) Churn influencers analysis 6

PART B Data Preprocessing [21 marks]

a) Display missing values 2

b) Imputation for missing values 3

c) Skewness 8

d) Outlier detection 2

e) Skewness improvement 6

PART C Hypothesis Testing [12 marks]

a) Hypothesis testing 6

b) Management conclusions 6

PART D Predictive model development

Page 4 of 5
a) Data Normalization 3

b) Train test split 5

c) LR model training 6

d) Classification report 4

e) Confusion matrix 3

f) Confusion matrix interpretation 4

END OF ASSIGNMENT PAPER

Page 5 of 5

You might also like