Professional Documents
Culture Documents
Assignment2 Dikshita
Assignment2 Dikshita
Predictive modeling can be explained as a process of building statistical models for predicting
the future behaviour of our data.
Predictive modeling uses machine learning algorithms for prediction. Machine learning depends
on the dataset that you give for prediction. Machine learning algorithms help us build a
predictive model which will predict the fate based on past numbers and data.
There are three basic steps for building a predictive model. This is just a simple guide to
understand how a predictive model is build:
Data: Data is the information needed for working on a given problem. Whenever we select a
problem to build a build a predictive model, we need information based on which the prediction
can be made. Data can be in the form of text, comma-separated values (CSV), database or a raw
file.
Model: Model will be responsible in giving us the needed results. It uses any one of the machine
learning algorithms. The model will be used for learning. All this learning will be used for
prediction when we wish to do so. Once this model is built, we can use it for various predictions.
Also, we can reuse the model for different datasets with a different set of predictors.
We will use an algorithm and provide a training dataset for learning purposes, so we can use it
for prediction.
Prediction: In this phase, we use the trained model on different dataset for similar or different
predictions. Once the model is built, we can use it for different predictions based on the input.
The input could be of the same data or different data, the prediction model should still work fine
for any dataset.
Target Marketing
Risk & Fraud Management
Strategy Implementation and Change Management
Operational Efficiency
Increase Customer Experience
Manage Marketing Campaigns
Forecast Revenue or Loss
Workforce Management
Financial Modeling
Churn Management
Social Media Influencers
Based on the business questions we want to answer, translate the business objective into Analytic
terms
Profile Analysis
Segmentations
Response Modeling
Risk Modeling
Activation
Cross-Sell and Upsell
Attrition/Churn Modeling
Net Present Value(NPV)
Customer Life Time Value (CLTV)
Selecting best data for target modeling requires thorough understanding of the market, business
and the objective. The model is only as good and relevant as the underlying data:
In this step we need prepare data into right format for analysis and the tool you may want use.
Do initial cleaning up
Define Variables and Create Data Dictionary
Joining/Appending multiple datasets
Validate for correctness
Produce Basic Summary Reports
Once data is in right shape and perform univariate analysis: to check the distribution of each of
the variables and features multivariate analyses: to check relationships with other variables and
with dependent variables
Based on type of model you are going to use, you may need to transform the variables using one
of the approaches
Training Sample: Model will be developed on this sample. Typically 50%, 60%, 70% or
80% of the data goes here.
Test Sample: Model performances will be validated on this sample. Typically 50%, 40%,
30% or 20% of the data goes here
HYPOTHESIS TESTING
• The null hypothesis is denoted by H0, is the hypothesis which is tested for the possible
rejection under the assumption that it is true.
1. H0: = 0 versus
Ha: 0 (two-sided) TWO TAILED TEST
2. H0: 0 versus
Ha: < 0 (one-sided) LEFT TAILED TEST
3. H0: 0 versus
Ha: > 0 (one-sided) RIGHT TAILED TEST
• Next step is to establish a critical region, which is the area under the normal curve .
These regions are termed as acceptance region (when the H0 is accepted) and the
rejection region or critical region.
• If the computed value of the test statistic falls in the acceptance region , the null hypo is
accepted .
• Otherwise H0 is rejected.
• In this stage data are collected and appropriate sample statistics are computed.
• The first 4 steps should be completed before collecting the data for the study.
• In this step the researcher has to compute the test statistic. This involves selection of
appropriate probability distribution for a particular test.
• For Example- When the sample is small, then t-distribution is used. If sample size is large
then use Z-test.
1. Does the test involve one sample, two samples, or k (more than two) samples?
2. If two samples or k samples are involved, are the individual cases independent or
related?
SELECTION OF TEST
IMPORTANT CONCEPTS:
1. CRITICAL REGION
The critical region (or rejection region) is the set of all values of the test statistic that cause us to
reject the null hypothesis
2. LEVEL OF SIGNIFICANCE
The significance level (denoted by ) is the probability that the test statistic will fall in the
critical region when the null hypothesis is actually true. Common choices for are 0.05, 0.01,
0.02 , and 0.10.
3. CRITICAL VALUE
A critical value is any value separating the critical region (where we reject the H0) from the
values of the test statistic that does not lead to rejection of the null hypothesis, the sampling
distribution that applies, and the significance level . For example, the critical value of z =
1.645 corresponds to a significance level of = 0.05.
The tails in a distribution are the extreme regions bounded by critical values
TWO TAILED
LEFT TAILED
RIGHT TAILED
1. Reject the H0
DECISION CRITERIA
• Fail to reject H0 if the test statistic does not fall within the critical region.
TYPE I ERROR
• A Type I error is the mistake of rejecting the null hypothesis when it is true.
• The symbol a (alpha) is used to represent the probability of a type I error.
TYPE II ERROR
• A Type I error is the mistake of rejecting the null hypothesis when it is true.
• The symbol a (alpha) is used to represent the probability of a type I error.
ERRORS
• For any fixed a, an increase in the sample size n will cause a decrease in b.
• For any fixed sample size n , a decrease in a will cause an increase in b. Conversely, an
increase in a will cause a decrease in b .
• To decrease both a and b, increase the sample size.
PREDICTION VS INTERPRETATION
Interpretation: Use the model to learn about the data generation process.
Prediction: Use the model to predict the outcomes for new data points
Criterion Prediction Interpretation
- Reason about the data generation
- Evaluate a variety of models
Model process
- Select the best-performing
Selection - Select model whose assumptions
model
seem most reasonable
- Empirically determine loss on
Validation - Use goodneess-of-fit tests
test set
- Predict the outcome for new - Use the model to explain the data
Application
samples generation process