You are on page 1of 11

ASSIGNMENT-2

SUBJECT : Predictive Modelling CODE-MB207

SUBMITTED BY: Dikshita Jain (2K19/BMBA/08)


Prediction is a very powerful tool when it comes to big data. Prediction in big data is based on
datasets, its volume, values and the variety we have in our data. To make a prediction we use
different big data tools and algorithms.

Predictive modeling can be explained as a process of building statistical models for predicting
the future behaviour of our data.

Predictive modeling uses machine learning algorithms for prediction. Machine learning depends
on the dataset that you give for prediction. Machine learning algorithms help us build a
predictive model which will predict the fate based on past numbers and data.

There are three basic steps for building a predictive model. This is just a simple guide to
understand how a predictive model is build:

Data: Data is the information needed for working on a given problem. Whenever we select a
problem to build a build a predictive model, we need information based on which the prediction
can be made. Data can be in the form of text, comma-separated values (CSV), database or a raw
file.

Model: Model will be responsible in giving us the needed results. It uses any one of the machine
learning algorithms. The model will be used for learning. All this learning will be used for
prediction when we wish to do so. Once this model is built, we can use it for various predictions.
Also, we can reuse the model for different datasets with a different set of predictors.

We will use an algorithm and provide a training dataset for learning purposes, so we can use it
for prediction.

Prediction: In this phase, we use the trained model on different dataset for similar or different
predictions. Once the model is built, we can use it for different predictions based on the input.
The input could be of the same data or different data, the prediction model should still work fine
for any dataset.

PREDICTIVE MODELLING PROCESS

Step 1: Understand Business Objective

Step 2: Define Modeling Goals

Step 3: Select/Get Data


Step 4: Prepare Data

Step 5: Analyze and Transform Variables. Random Sampling

Step 6: Model Selection and Develop Models (Training)

Step 7: Validate Models (Testing), Optimize and Profitability

Step 1: Business Objective(s)- Define objectives clearly and ask questions!

 Target Marketing
 Risk & Fraud Management
 Strategy Implementation and Change Management
 Operational Efficiency
 Increase Customer Experience
 Manage Marketing Campaigns
 Forecast Revenue or Loss
 Workforce Management
 Financial Modeling
 Churn Management
 Social Media Influencers

Step 2: Define Goals - translate business objective into analytics goal

Based on the business questions we want to answer, translate the business objective into Analytic
terms

 Profile Analysis
 Segmentations
 Response Modeling
 Risk Modeling
 Activation
 Cross-Sell and Upsell
 Attrition/Churn Modeling
 Net Present Value(NPV)
 Customer Life Time Value (CLTV)

Step 3: Selecting Data for Modeling

Selecting best data for target modeling requires thorough understanding of the market, business
and the objective. The model is only as good and relevant as the underlying data:

Step 4: Prepare Data

In this step we need prepare data into right format for analysis and the tool you may want use.

 Do initial cleaning up
 Define Variables and Create Data Dictionary
 Joining/Appending multiple datasets
 Validate for correctness
 Produce Basic Summary Reports

Step 5: Analyze and Transform Variables

Once data is in right shape and perform univariate analysis: to check the distribution of each of
the variables and features multivariate analyses: to check relationships with other variables and
with dependent variables

Based on type of model you are going to use, you may need to transform the variables using one
of the approaches

 Bining approach: create distinct groups


 Transformation:
 Logarithmic, Polynomial
 Square Root, Inverse, Square, boxCox
 Extreme value (outlier) treatments
 Missing Value Treatment
 Dimension Reduction - Information Value(IV) and Weight of Evidence(WoE), Variable
Clustering, PCA, Factor Analysis, etc.

Step 5.1: Random Sampling (Train and Test)

 Training Sample: Model will be developed on this sample. Typically 50%, 60%, 70% or
80% of the data goes here.
 Test Sample: Model performances will be validated on this sample. Typically 50%, 40%,
30% or 20% of the data goes here

Step 6.1: Model Selection

Based on the defined goal(s) (supervised or unsupervised) we have to select one of or


combinations of modeling techniques. Such as

 General linear model


 Non-Linear Regression
 Linear Regression
 Classification/Decision Trees
 Random Forest
 Support Vector Machine (SVM)
 Distance metric learning
 Bayesian methods
 Graphical Models
 Neural Networks
 Genetic Algorithm
 The Hazard and Survival Functions
 Time Series Models
 Signal Processing
 Clustering Techniques
 Market Basket Analysis
 Frequent Itemset Mining
 Association Rule Mining etc.

There are wide variety of choices available outside this list.

Step 6.2: Build/Develop/Train Models


 Validate the assumptions of the chosen algorithm
 Check for Multicollinearity and Redundancies of Independent Variables (Features).
Sometime in Machine Learning, we are keen on accuracies of the models and hence we
may not perform these checks!
 Develop/Train Model on Training Sample, which is 80%/70%/60%/50% of the available
data(Population)
 Check Model performance - Error, Accuracy, ROC, KS, Gini

Step 7: Validate/Test Models

 Score and Predict using Test Sample


 Check for the robustness and stability of the model
 Check Model Performance: Accuracy, ROC, AUC, KS, GINI etc.
 Perform Cross Validation to increase accuracy/performance of the models

HYPOTHESIS TESTING

• A statistical hypothesis is an assumption about an unknown population parameter.

• It is a well defined procedure which helps us to decide objectively whether to accept or


reject the hypothesis based on the information available from the sample.

STEP 1: SET NULL AND ALTERNATIVE HYPOTHESIS

• The null hypothesis is denoted by H0, is the hypothesis which is tested for the possible
rejection under the assumption that it is true.

• Theoretically, H0 is set as no difference considered true, until and unless it is proved


wrong by the collected sample data.

• The alternative hypothesis is denoted by H1 or HA, is a logical opposite of the H0.

1. H0:  =  0 versus
Ha:    0 (two-sided) TWO TAILED TEST

2. H0:    0 versus
Ha:  <  0 (one-sided) LEFT TAILED TEST

3. H0:    0 versus
Ha:  >  0 (one-sided) RIGHT TAILED TEST

STEP 2: DETERMINE THE APPROPRIATE STATISTICAL TEST


 After setting the hypothesis, the researches has to decide on an appropriate statistical
test that will be tested for the statistical analysis.
 The statistic used in the study (mean, proportion, variance etc.) must also be
considered when a researchers decides on appropriate statistical test, which can be
applied for hypothesis testing in order to obtain the best results.

STEP 3: SET THE LEVEL OF SIGNIFICANCE

 The level of significance is denoted by  is the probability, which is attached to a null


hypothesis, which may be rejected even when it is true.
 The level of significance also known as the size of the rejection region or the size of
the critical region.
 Level of significance must be determined before we draw samples, so that the
obtained result is free from the bias of a decision maker.
 0.01, 0.02, 0.05, 0.010

STEP 4: SET THE DECISION RULE

• Next step is to establish a critical region, which is the area under the normal curve .
These regions are termed as acceptance region (when the H0 is accepted) and the
rejection region or critical region.

• If the computed value of the test statistic falls in the acceptance region , the null hypo is
accepted .

• Otherwise H0 is rejected.

STEP 5: COLLECT THE SAMPLE DATA

• In this stage data are collected and appropriate sample statistics are computed.

• The first 4 steps should be completed before collecting the data for the study.

STEP 6: ANALYZE THE DATA

• In this step the researcher has to compute the test statistic. This involves selection of
appropriate probability distribution for a particular test.

• For Example- When the sample is small, then t-distribution is used. If sample size is large
then use Z-test.

• Some commonly used testing procedures are F, t, Z, chi square.

HOW TO SELECT A TEST


• In attempting to choose a particular significance test, the researcher should consider at
least three questions:

1. Does the test involve one sample, two samples, or k (more than two) samples?

2. If two samples or k samples are involved, are the individual cases independent or
related?

3. Is the measurement scale nominal, ordinal, interval, or ratio?

SELECTION OF TEST

STEP 7: ARRIVE AT STATISTICAL CONCLUSION

• In this step the researcher draw a conclusion. A statistical conclusion is a decision to


accept or reject a H0 . This depends whether the computed statistic falls in the acceptance
region or rejection region.

IMPORTANT CONCEPTS:

1. CRITICAL REGION

The critical region (or rejection region) is the set of all values of the test statistic that cause us to
reject the null hypothesis

2. LEVEL OF SIGNIFICANCE
The significance level (denoted by ) is the probability that the test statistic will fall in the
critical region when the null hypothesis is actually true. Common choices for  are 0.05, 0.01,
0.02 , and 0.10.

3. CRITICAL VALUE

A critical value is any value separating the critical region (where we reject the H0) from the
values of the test statistic that does not lead to rejection of the null hypothesis, the sampling
distribution that applies, and the significance level . For example, the critical value of z =
1.645 corresponds to a significance level of  = 0.05.

4. TWO-TAILED, RIGHT TAILED AND LEFT TAILED TESTS

The tails in a distribution are the extreme regions bounded by critical values

TWO TAILED

LEFT TAILED
RIGHT TAILED

5. CONCLUSION IN HYPOTHESIS TESTING

We always test the null hypothesis.

1. Reject the H0

2. Fail to reject the H0

ACCEPT VERSUS FAIL TO REJECT

 Some texts use “accept the null hypothesis.”


 We are not proving the null hypothesis.
 The sample evidence is not strong enough to warrant rejection (such as not enough
evidence to convict a suspect).

DECISION CRITERIA

• Reject H0 if the test statistic falls within the critical region.

• Fail to reject H0 if the test statistic does not fall within the critical region.

6. THE TWO TYPES OF ERRORS AND THEIR PROBABILITIES


 When the null hypothesis is true, the probability of a type 1 error, the level of
significance, and the -level are all equivalent.
 When the null hypothesis is not true, a type 1 error cannot be made.

TYPE I ERROR
• A Type I error is the mistake of rejecting the null hypothesis when it is true.
• The symbol a (alpha) is used to represent the probability of a type I error.

TYPE II ERROR
• A Type I error is the mistake of rejecting the null hypothesis when it is true.
• The symbol a (alpha) is used to represent the probability of a type I error.

ERRORS

CONTROLLING THE ERRORS

• For any fixed a, an increase in the sample size n will cause a decrease in b.
• For any fixed sample size n , a decrease in a will cause an increase in b. Conversely, an
increase in a will cause a decrease in b .
• To decrease both a and b, increase the sample size.

PREDICTION VS INTERPRETATION

 Interpretation: Use the model to learn about the data generation process.
 Prediction: Use the model to predict the outcomes for new data points
Criterion Prediction Interpretation
- Reason about the data generation
- Evaluate a variety of models
Model process
- Select the best-performing
Selection - Select model whose assumptions
model
seem most reasonable
- Empirically determine loss on
Validation - Use goodneess-of-fit tests
test set

- Predict the outcome for new - Use the model to explain the data
Application
samples generation process

- Model validity is uncertain since


- Model validity shown for the
predictive accuracy was not
test set
Ramnifcations considered
- Model may overfit if the test
- Overfitting prevented through
data are similar to the training
simplified assumptions
data

You might also like