Assignment2 Dikshita

ASSIGNMENT-2
SUBJECT : Predictive Modelling CODE-MB207
SUBMITTED BY: Dikshita Jain (2K19/BMBA/08)

Prediction is a very powerful tool when it comes to big data. Prediction in big data is based on
datasets, its volume, values and the variety we have in our data. To make a prediction we use
different big data tools and algorithms.
Predictive modeling can be explained as a process of building statistical models for predicting
the future behaviour of our data.
Predictive modeling uses machine learning algorithms for prediction. Machine learning depends
on the dataset that you give for prediction. Machine learning algorithms help us build a
predictive model which will predict the fate based on past numbers and data.
There are three basic steps for building a predictive model. This is just a simple guide to
understand how a predictive model is build:
Data: Data is the information needed for working on a given problem. Whenever we select a
problem to build a build a predictive model, we need information based on which the prediction
can be made. Data can be in the form of text, comma-separated values (CSV), database or a raw
file.
Model: Model will be responsible in giving us the needed results. It uses any one of the machine
learning algorithms. The model will be used for learning. All this learning will be used for
prediction when we wish to do so. Once this model is built, we can use it for various predictions.
Also, we can reuse the model for different datasets with a different set of predictors.
We will use an algorithm and provide a training dataset for learning purposes, so we can use it
for prediction.
Prediction: In this phase, we use the trained model on different dataset for similar or different
predictions. Once the model is built, we can use it for different predictions based on the input.
The input could be of the same data or different data, the prediction model should still work fine
for any dataset.
PREDICTIVE MODELLING PROCESS
Step 1: Understand Business Objective
Step 2: Define Modeling Goals
Step 3: Select/Get Data

Step 4: Prepare Data
Step 5: Analyze and Transform Variables. Random Sampling
Step 6: Model Selection and Develop Models (Training)
Step 7: Validate Models (Testing), Optimize and Profitability
Step 1: Business Objective(s)- Define objectives clearly and ask questions!
 Target Marketing
 Risk & Fraud Management
 Strategy Implementation and Change Management
 Operational Efficiency
 Increase Customer Experience
 Manage Marketing Campaigns
 Forecast Revenue or Loss
 Workforce Management
 Financial Modeling
 Churn Management
 Social Media Influencers
Step 2: Define Goals - translate business objective into analytics goal
Based on the business questions we want to answer, translate the business objective into Analytic
terms
 Profile Analysis
 Segmentations
 Response Modeling
 Risk Modeling
 Activation
 Cross-Sell and Upsell
 Attrition/Churn Modeling
 Net Present Value(NPV)
 Customer Life Time Value (CLTV)
Step 3: Selecting Data for Modeling
Selecting best data for target modeling requires thorough understanding of the market, business
and the objective. The model is only as good and relevant as the underlying data:
Step 4: Prepare Data
In this step we need prepare data into right format for analysis and the tool you may want use.
 Do initial cleaning up
 Define Variables and Create Data Dictionary
 Joining/Appending multiple datasets
 Validate for correctness
 Produce Basic Summary Reports
Step 5: Analyze and Transform Variables
Once data is in right shape and perform univariate analysis: to check the distribution of each of
the variables and features multivariate analyses: to check relationships with other variables and
with dependent variables
Based on type of model you are going to use, you may need to transform the variables using one
of the approaches
 Bining approach: create distinct groups

 Transformation:
 Logarithmic, Polynomial
 Square Root, Inverse, Square, boxCox
 Extreme value (outlier) treatments
 Missing Value Treatment
 Dimension Reduction - Information Value(IV) and Weight of Evidence(WoE), Variable
Clustering, PCA, Factor Analysis, etc.
Step 5.1: Random Sampling (Train and Test)
 Training Sample: Model will be developed on this sample. Typically 50%, 60%, 70% or
80% of the data goes here.
 Test Sample: Model performances will be validated on this sample. Typically 50%, 40%,
30% or 20% of the data goes here
Step 6.1: Model Selection
Based on the defined goal(s) (supervised or unsupervised) we have to select one of or

combinations of modeling techniques. Such as
 General linear model

 Non-Linear Regression
 Linear Regression
 Classification/Decision Trees
 Random Forest
 Support Vector Machine (SVM)
 Distance metric learning
 Bayesian methods
 Graphical Models
 Neural Networks
 Genetic Algorithm
 The Hazard and Survival Functions
 Time Series Models
 Signal Processing
 Clustering Techniques
 Market Basket Analysis
 Frequent Itemset Mining
 Association Rule Mining etc.
There are wide variety of choices available outside this list.
Step 6.2: Build/Develop/Train Models

 Validate the assumptions of the chosen algorithm
 Check for Multicollinearity and Redundancies of Independent Variables (Features).
Sometime in Machine Learning, we are keen on accuracies of the models and hence we
may not perform these checks!
 Develop/Train Model on Training Sample, which is 80%/70%/60%/50% of the available
data(Population)
 Check Model performance - Error, Accuracy, ROC, KS, Gini
Step 7: Validate/Test Models
 Score and Predict using Test Sample

 Check for the robustness and stability of the model
 Check Model Performance: Accuracy, ROC, AUC, KS, GINI etc.
 Perform Cross Validation to increase accuracy/performance of the models
HYPOTHESIS TESTING
• A statistical hypothesis is an assumption about an unknown population parameter.
• It is a well defined procedure which helps us to decide objectively whether to accept or

reject the hypothesis based on the information available from the sample.
STEP 1: SET NULL AND ALTERNATIVE HYPOTHESIS
• The null hypothesis is denoted by H0, is the hypothesis which is tested for the possible
rejection under the assumption that it is true.
• Theoretically, H0 is set as no difference considered true, until and unless it is proved

wrong by the collected sample data.
• The alternative hypothesis is denoted by H1 or HA, is a logical opposite of the H0.
1. H0:  =  0 versus
Ha:    0 (two-sided) TWO TAILED TEST
2. H0:    0 versus
Ha:  <  0 (one-sided) LEFT TAILED TEST
3. H0:    0 versus
Ha:  >  0 (one-sided) RIGHT TAILED TEST
STEP 2: DETERMINE THE APPROPRIATE STATISTICAL TEST

 After setting the hypothesis, the researches has to decide on an appropriate statistical
test that will be tested for the statistical analysis.
 The statistic used in the study (mean, proportion, variance etc.) must also be
considered when a researchers decides on appropriate statistical test, which can be
applied for hypothesis testing in order to obtain the best results.
STEP 3: SET THE LEVEL OF SIGNIFICANCE
 The level of significance is denoted by  is the probability, which is attached to a null

hypothesis, which may be rejected even when it is true.
 The level of significance also known as the size of the rejection region or the size of
the critical region.
 Level of significance must be determined before we draw samples, so that the
obtained result is free from the bias of a decision maker.
 0.01, 0.02, 0.05, 0.010
STEP 4: SET THE DECISION RULE
• Next step is to establish a critical region, which is the area under the normal curve .
These regions are termed as acceptance region (when the H0 is accepted) and the
rejection region or critical region.
• If the computed value of the test statistic falls in the acceptance region , the null hypo is
accepted .
• Otherwise H0 is rejected.
STEP 5: COLLECT THE SAMPLE DATA
• In this stage data are collected and appropriate sample statistics are computed.
• The first 4 steps should be completed before collecting the data for the study.
STEP 6: ANALYZE THE DATA
• In this step the researcher has to compute the test statistic. This involves selection of
appropriate probability distribution for a particular test.
• For Example- When the sample is small, then t-distribution is used. If sample size is large
then use Z-test.
• Some commonly used testing procedures are F, t, Z, chi square.
HOW TO SELECT A TEST

• In attempting to choose a particular significance test, the researcher should consider at
least three questions:
1. Does the test involve one sample, two samples, or k (more than two) samples?
2. If two samples or k samples are involved, are the individual cases independent or
related?
3. Is the measurement scale nominal, ordinal, interval, or ratio?
SELECTION OF TEST
STEP 7: ARRIVE AT STATISTICAL CONCLUSION
• In this step the researcher draw a conclusion. A statistical conclusion is a decision to

accept or reject a H0 . This depends whether the computed statistic falls in the acceptance
region or rejection region.
IMPORTANT CONCEPTS:
1. CRITICAL REGION
The critical region (or rejection region) is the set of all values of the test statistic that cause us to
reject the null hypothesis
2. LEVEL OF SIGNIFICANCE
The significance level (denoted by ) is the probability that the test statistic will fall in the
critical region when the null hypothesis is actually true. Common choices for  are 0.05, 0.01,
0.02 , and 0.10.
3. CRITICAL VALUE
A critical value is any value separating the critical region (where we reject the H0) from the
values of the test statistic that does not lead to rejection of the null hypothesis, the sampling
distribution that applies, and the significance level . For example, the critical value of z =
1.645 corresponds to a significance level of  = 0.05.
4. TWO-TAILED, RIGHT TAILED AND LEFT TAILED TESTS
The tails in a distribution are the extreme regions bounded by critical values
TWO TAILED
LEFT TAILED
RIGHT TAILED
5. CONCLUSION IN HYPOTHESIS TESTING
We always test the null hypothesis.
1. Reject the H0
2. Fail to reject the H0
ACCEPT VERSUS FAIL TO REJECT
 Some texts use “accept the null hypothesis.”

 We are not proving the null hypothesis.
 The sample evidence is not strong enough to warrant rejection (such as not enough
evidence to convict a suspect).
DECISION CRITERIA
• Reject H0 if the test statistic falls within the critical region.
• Fail to reject H0 if the test statistic does not fall within the critical region.
6. THE TWO TYPES OF ERRORS AND THEIR PROBABILITIES

 When the null hypothesis is true, the probability of a type 1 error, the level of
significance, and the -level are all equivalent.
 When the null hypothesis is not true, a type 1 error cannot be made.
TYPE I ERROR
• A Type I error is the mistake of rejecting the null hypothesis when it is true.
• The symbol a (alpha) is used to represent the probability of a type I error.
TYPE II ERROR
• A Type I error is the mistake of rejecting the null hypothesis when it is true.
• The symbol a (alpha) is used to represent the probability of a type I error.
ERRORS
CONTROLLING THE ERRORS
• For any fixed a, an increase in the sample size n will cause a decrease in b.
• For any fixed sample size n , a decrease in a will cause an increase in b. Conversely, an
increase in a will cause a decrease in b .
• To decrease both a and b, increase the sample size.
PREDICTION VS INTERPRETATION
 Interpretation: Use the model to learn about the data generation process.
 Prediction: Use the model to predict the outcomes for new data points
Criterion Prediction Interpretation
- Reason about the data generation
- Evaluate a variety of models
Model process
- Select the best-performing
Selection - Select model whose assumptions
model
seem most reasonable
- Empirically determine loss on
Validation - Use goodneess-of-fit tests
test set
- Predict the outcome for new - Use the model to explain the data
Application
samples generation process
- Model validity is uncertain since

- Model validity shown for the
predictive accuracy was not
test set
Ramnifcations considered
- Model may overfit if the test
- Overfitting prevented through
data are similar to the training
simplified assumptions
data

Assignment2 Dikshita

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment2 Dikshita

Uploaded by

Copyright:

Available Formats

ASSIGNMENT-2

SUBJECT : Predictive Modelling CODE-MB207

SUBMITTED BY: Dikshita Jain (2K19/BMBA/08)

PREDICTIVE MODELLING PROCESS

Step 1: Understand Business Objective

Step 2: Define Modeling Goals

Step 3: Select/Get Data

Step 5: Analyze and Transform Variables. Random Sampling

Step 6: Model Selection and Develop Models (Training)

Step 7: Validate Models (Testing), Optimize and Profitability

Step 1: Business Objective(s)- Define objectives clearly and ask questions!

Step 2: Define Goals - translate business objective into analytics goal

Step 3: Selecting Data for Modeling

Step 4: Prepare Data

Step 5: Analyze and Transform Variables

 Bining approach: create distinct groups

Step 5.1: Random Sampling (Train and Test)

Step 6.1: Model Selection

Based on the defined goal(s) (supervised or unsupervised) we have to select one of or

 General linear model

There are wide variety of choices available outside this list.

Step 6.2: Build/Develop/Train Models

Step 7: Validate/Test Models

 Score and Predict using Test Sample

• A statistical hypothesis is an assumption about an unknown population parameter.

• It is a well defined procedure which helps us to decide objectively whether to accept or

STEP 1: SET NULL AND ALTERNATIVE HYPOTHESIS

• Theoretically, H0 is set as no difference considered true, until and unless it is proved

• The alternative hypothesis is denoted by H1 or HA, is a logical opposite of the H0.

STEP 2: DETERMINE THE APPROPRIATE STATISTICAL TEST

STEP 3: SET THE LEVEL OF SIGNIFICANCE

 The level of significance is denoted by  is the probability, which is attached to a null

STEP 4: SET THE DECISION RULE

STEP 5: COLLECT THE SAMPLE DATA

STEP 6: ANALYZE THE DATA

• Some commonly used testing procedures are F, t, Z, chi square.

HOW TO SELECT A TEST

3. Is the measurement scale nominal, ordinal, interval, or ratio?

STEP 7: ARRIVE AT STATISTICAL CONCLUSION

• In this step the researcher draw a conclusion. A statistical conclusion is a decision to

4. TWO-TAILED, RIGHT TAILED AND LEFT TAILED TESTS

5. CONCLUSION IN HYPOTHESIS TESTING

We always test the null hypothesis.

2. Fail to reject the H0

ACCEPT VERSUS FAIL TO REJECT

 Some texts use “accept the null hypothesis.”

• Reject H0 if the test statistic falls within the critical region.

6. THE TWO TYPES OF ERRORS AND THEIR PROBABILITIES

CONTROLLING THE ERRORS

- Model validity is uncertain since

You might also like