You are on page 1of 23

DATA

ANALYTICS
WITH R
DATA MANAGEMENT PROJECT BY:-

NISHANT CHATURVEDI
REBECCA NAMUBIRU
AGENDA
INTRODUCTION

P R O B L E M S TAT E M E N T

D ATA W R A N G L I N G

RANDOM FOREST

F E AT U R E A N A LY S I S

SUMMARY

NISHANT CHATURVEDI | REBECCA NAMUBIRU 2


INTRODUCTION
WHEN TO USE R

Building Free and


statistical open
Models source

Data is
not in Qualitative
proper analysis
Format

Cleansing
and Scientific
Wrangling Research
data
NISHANT CHATURVEDI | REBECCA NAMUBIRU 3
Click icon to add picture

K D N U G G E T S
S U R V E Y 2 0 1 9

70%
66%

60%

51%
50%
47%
POPULARITY

40%
35% 34% 33%

30%

20%

10%

0%
Python RapidMiner R Excel Anaconda SQL

TOOLS

Reference:

NISHANT CHATURVEDI | REBECCA NAMUBIRU 4


90000
80000
80000

70000
Number of Scholalrly Articles

60000
55000
50000

40000 38000
33000 31000 30000
30000

20000

10000

0
SPSS R SAS Stata GraphPadPrism Matlab
TOOLS

Reference:

NISHANT CHATURVEDI | REBECCA NAMUBIRU 5


Click icon to add picture
FACT SHEET

ATTRIBUTES

• C u s t o m e r _ i d : - i n t e g e r v a l u e s t o i d e n ti f y c u s t o m e r s
• S u b s c r i p ti o n _ fl a g : - i t h a s v a l u e s 0 , 1 , - 1 t o s i g n i f y i n g t h e s t a t u s o f c u s t o m e r , 0 f o r
unsubscribed, 1 for subscribed and -1 for unknown
• d a y s _ o f _ m e m b e r s h i p : - t h e t e n u r e o f t h e c u s t o m e r ’ s s u b s c r i p ti o n
• no_of_movie :- number of movies watched by a customer in the previous month
• No_of_serie :- number of TV series watched by the customer in the previous month
• No_of_documentary :- number of documentary watched by the customer in the previous
month
• a c ti v i t y _ ti m e : - A c ti v i t y ti m e o n t h e a p p l i c a ti o n i n t h e l a s t m o n t h
• a v g _ r a ti n g : - s u m o f r a ti n g s g i v e n / n u m b e r o f s h o w s r a t e d
• n o _ o f _ a c c o u n t s : - n u m b e r o f a c ti v e a c c o u n t s i n t h e s u b s c r i p ti o n
• d e v i c e _ t y p e : - w h a t a r e t h e d e v i c e s b e i n g u s e d t o a c c e s s t h e a p p l i c a ti o n
• age :- Age of the customer
• gender :- 0 for Female and 1 for Male

NISHANT CHATURVEDI | REBECCA NAMUBIRU


QUESTIONS

1. Which customers are most likely to churn out of their subscription?

2. What key factors play the most important role in determining the subscription of a customer?

NISHANT CHATURVEDI | REBECCA NAMUBIRU 7


Click icon to add picture PROBLEM
Click icon to add picture
STATEMENT

Click icon to add picture


WRANGLING DATA Click icon to add picture

PREDICTING CUSTOMERS WHO ARE GOING TO CHURN OUT

FEATURE ANALYSIS
Click icon to add picture Click icon to add picture

NISHANT CHATURVEDI | REBECCA NAMUBIRU 8


WRANGLING PREDICTIVE FEATURE
DATA ANALYSIS IMPORTANCE

Predictive
• Converting data from Analysis • To find the relative
one format to another. importance of
• Gathering Data, • Identify future different parameters
Selecting Data, likelihood of an event • By giving scores to
Transforming Data • Applying Random different parameters
Forest to perform
Predictive Analysis
DATA Feature
WRANGLING Importance

NISHANT CHATURVEDI | REBECCA NAMUBIRU 9


Click icon to add picture
I M P L E M E N TAT I O N

DATA
WRANGLING

PREDICTIVE
ANALYSIS

FEATURE
IMPORTANC
E

NISHANT CHATURVEDI | REBECCA NAMUBIRU 10


Click icon to add picture
I M P L E M E N TAT I O N

DATA
WRANGLING

PREDICTIVE
ANALYSIS

FEATURE
IMPORTANC
E

NISHANT CHATURVEDI | REBECCA NAMUBIRU 11


Click icon to add picture
I M P L E M E N TAT I O N

DATA
WRANGLING

PREDICTIVE
ANALYSIS

FEATURE
IMPORTANC
E

NISHANT CHATURVEDI | REBECCA NAMUBIRU 12


DECISION TREE
I R A N D O M F O R E S T

Inspired by:
https://towardsdatascience.com/decision-trees- https://www.analyticsvidhya.com/blog/2020/05/
in-machine-learning-641b9c4e8052 decision-tree-vs-random-forest-algorithm/

NISHANT CHATURVEDI | REBECCA NAMUBIRU 13


Confusion matrix
Gini index

Range: 0 - 1

Out of Box
For features not
included in the model
at the time when the Often computed rates
leaf / decision is Accuracy = (TP+TN)/n
reached Error Rate = (FP+FN)/n or 1-Accuracy
True Positive Rate = TP/actual yes
True Negative Rate = TN/actual no
Precision = TP/predicted yes

NISHANT CHATURVEDI | REBECCA NAMUBIRU 14


Click icon to add picture
I M P L E M E N TAT I O N

DATA
WRANGLING

PREDICTIVE
ANALYSIS

FEATURE
IMPORTANC
E

NISHANT CHATURVEDI | REBECCA NAMUBIRU 15


Click icon to add picture
I M P L E M E N TAT I O N

DATA
WRANGLING

PREDICTIVE
ANALYSIS

FEATURE
IMPORTANC
E

NISHANT CHATURVEDI | REBECCA NAMUBIRU 16


Click icon to add picture
I M P L E M E N TAT I O N

DATA
WRANGLING

PREDICTIVE
ANALYSIS

FEATURE
IMPORTANC
E

NISHANT CHATURVEDI | REBECCA NAMUBIRU 17


FEATURE ANALYSIS
USING R

Remove Rank
Redundant Features by
Features Importance

Select
Features

NISHANT CHATURVEDI | REBECCA NAMUBIRU 18


Click icon to add picture
I M P L E M E N TAT I O N

DATA
WRANGLING

PREDICTIVE
ANALYSIS

FEATURE
IMPORTANC
E

NISHANT CHATURVEDI | REBECCA NAMUBIRU 19


Click icon to add picture
I M P L E M E N TAT I O N

DATA
WRANGLING

PREDICTIVE
ANALYSIS

FEATURE
IMPORTANC
E

NISHANT CHATURVEDI | REBECCA NAMUBIRU 20


SUMMARY

Large
Reliable user
base

Many user-
contributed Time
packages saving

NISHANT CHATURVEDI | REBECCA NAMUBIRU 21


REFERENCES

• https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platf
orms.html
(Survey by KD Nuggets)

• https://www.linkedin.com/learning/learning-r-2/r-in-context?u=2154233 (To learn


R)

• https://www.dataschool.io/simple-guide-to-confusion-matrix-
terminology/#:~:text=A%20confusion%20matrix%20is%20a,related%20terminology
%20can%20be%20confusing.

NISHANT CHATURVEDI | REBECCA NAMUBIRU 22


THANK YOU

Icon Icon

Nishant Chaturvedi
Rebecca Namubiru

You might also like