You are on page 1of 5

How to Prepare Data for a Predictive

Analysis Model
Predictive modelling is aimed at developing tools that can be used for individual prediction of
the most likely value of a continuous measure, or the probability of the occurrence (or
recurrence) of an event. There has been a huge increase in popularity of developing tools for
prediction of outcomes at the level of the individual unit.

Pre-processing Your Data


The first step after collecting data is checking for inconsistencies and impossible values in the
data. On the case level, variables that are dependent on each other may be checked several
ways. On the variable level, computing ranges provides a first check of whether values
beyond an acceptable range were entered in the data. Examine outliers and determine per
outlier if this is likely due to an error in the data collection, or whether the outlier represents
the true value of the case.
Transforming Predictor Variables
Regression models that are employed to develop prediction models explicitly assume
additivity and linearity of the associations between the predictors and the outcome (in linear
regression), between the predictors and the log odds of the outcome (in logistic regression).
The linearity assumption implies that the slope of the regression line (or the estimated
coefficient) is the same value over the whole range of the predictor, and the additivity
assumption implies that effects of different predictor variables on the outcome are not
dependent on the value of other predictors. Regression methods do not place assumptions on
the distribution of the predictor variables, but severely skewed continuous variables (e.g,
circulating levels of biomarkers) often perform better after transformation to a roughly
normal distribution. A frequent transformation of right-skewed predictors that consist of only
positive values is taking the natural logarithm. This compresses the long right tail and
expands the short left tail. In addition to taking the logarithm of a predictor, other
mathematical transformations may be performed as well (e.g., taking the square root). A
drawback of including transformed predictors in the model is interpreting the effect of those
predictors on the original scale.
There are other methods to account for non-linear associations between the predictor and the
outcome, but those are strictly part of the regression modelling phase and do not fall within
the scope of preparing data for predictive modelling. Examples of such methods include
polynomial regression and spline regression.
Categorizing Predictor Variables
If transforming does not yield the desired effect, or if easy interpretation of coefficients is
necessary, continuous predictor variables may be categorized into two or more categories.
Keep in mind that when the assumptions of additivity and linearity are met, categorization is
likely to result in a decrease of predictive performance compared to using the continuous
predictor. Categorizing causes a loss of information and statistical power, but also
underestimates the extent of variation in risk. Categorization can be performed using data-
driven cut-off values after visualization of the association between the determinant and the
outcome, or using well-established cut-off values.
For example, evidence suggests that the association between body mass index (BMI) and
mortality is U-shaped. In this case, choosing cut-off values that are commonly accepted (e.g.,
below 18.5 kg/m2 to define underweight and above 25 kg/m2 to define overweight) may not
result in the best performing categories on the data used for development compared to data-
driven determination of cut-off values, but it aids interpretation and practical implementation.
Bear in mind that the number of categories that are made not only depends on the best fit of
the predictor during the modelling phase, but also on the amount of predictors that can be
studied using the sample at hand
Visualizing Data
Associations between continuous predictor variables and the outcome (or log odds etc. of the
outcome) can be visualized to check if non-linearity exists and if so, if there are clear
indications for certain transformations, polynomials, or categorization. For a continuous
outcome, a simple plot can be made consisting of the predictor on the x-axis and the outcome
variable on the y-axis with a smooth local regression curve to provide a visual representation
of the association.
For binary outcomes, graphing the association becomes more tedious as the outcome variable
consists only of zeroes and ones. A simple solution is to make groups based on quartiles of
the predictor variable, and plot the average of the predictor values against the average of the
outcome parameter.
Missing Data
Most statistical and machine learning packages will omit patients that have one or more
missing values on the variables that are used to develop the model. This results in less
statistical precision in estimating regression coefficients and other statistics of interest,
reflected by larger standard errors, wider confidence intervals and thus p-values that are less
likely to be lower than the alpha that is chosen for testing. Such complete case analysis or
listwise deletion not only decreases the sample size, but may also introduce bias if the
incomplete cases are not a random sample of all cases recruited for the study. The cases in the
sample that are completely observed do not reflect the population of interest anymore. This
mechanism that underlies the process of missing values is important for deciding how to
handle missing data.
Handling Missing Data
To prevent a decrease in precision and a high likelihood of biased regression coefficients,
missing data can be imputed. Imputing is the replacing of the empty cells in the dataset with
actual values. The goal of imputation is not adding new information to the dataset, but to
allow all other observations of incomplete patients to be used for the subsequent analysis.
There are numerous methods that can be used to impute missing data. A simple method to
impute a continuous variable is to compute the mean of that variable using data of patients
that have an observed value of this variable, and replace every missing data point with this
mean value. Simple as it is, imputation with the mean decreases the variance within a
variable and distorts the association between the imputed variable and other covariates in the
data. Proper imputation methods produce a synthetic part of the data that, when analysed, do
not introduce bias in the estimation of regression coefficients.

In addition to above discussion following points are also important for preparing data
for Predictive Modelling
When you’ve defined the objectives of the model for predictive analysis, the next step is to
identify and prepare the data you’ll use to build your model. The general sequence of steps
looks like this:
1. Identify your data sources.
Data could be in different formats or reside in various locations.
2. Identify how you will access that data.
Sometimes, you would need to acquire third-party data, or data owned by a different division
in your organization, etc.
3. Consider which variables to include in your analysis.
One standard approach is to start off with a wide range of variables and eliminate the ones
that offer no predictive values for the model.
4. Determine whether to use derived variables.
In many cases, a derived variable (such as the price-per-earning ratio used to analyze stock
prices) would have a greater direct impact on the model than would the raw variable.
5. Explore the quality of your data, seeking to understand both its state and limitations.
The accuracy of the model’s predictions is directly related to the variables you select and the
quality of your data. You would want to answer some data-specific questions at this point:
o Is the data complete?
o Does it have any outliers?
o Does the data need cleansing?
o Do you need to fill in missing values, keep them as they are, or eliminate them
altogether?

Understanding your data and its properties can help you choose the algorithm that will be
most useful in building your model. For example:

 Regression algorithms can be used to analyze time-series data.


 Classification algorithms can be used to analyze discrete data.
 Association algorithms can be used for data with correlated attributes.
The dataset used to train and test the model must contain relevant business information to
answer the problem you’re trying to solve. If your goal is (for example) to determine which
customer is likely to churn, then the dataset you choose must contain information about
customers who have churned in the past in addition to customers who have not.

Some models created to mine data and make sense of its underlying relationships — for
example, those built with clustering algorithms — need not have a particular end result in
mind.

Two problems arise when dealing with data as you’re building your model: underfitting and
overfitting.
Underfitting
Underfitting is when your model can’t detect any relationships in your data. This is usually an
indication that essential variables — those with predictive power — weren’t included in your
analysis. For example, a stock analysis that includes only data from a bull market (where
overall stock prices are going up) doesn’t account for crises or bubbles that can bring major
corrections to the overall performance of stocks.
Failing to include data that spans both bull and bear markets (when overall stock prices are
falling) keeps the model from producing the best possible portfolio selection.
Overfitting
Overfitting is when your model includes data that has no predictive power but it is only
specific to the dataset you’re analyzing. Noise — random variations in the dataset — can find
its way into the model, such that running the model on a different dataset produces a major
drop in the model’s predictive performance and accuracy. The accompanying sidebar
provides an example.
If your model performs just fine on a particular dataset and only underperforms when you test
it on a different dataset, suspect overfitting.
In addition to above requirements of all possible users of Database must be taken care. As we can observe from following college student data
example.

Overall data
Student Father Mother Student Contact 10th 12th
Sno Roll No Address Caste Gender Blood Gp Height Weight
Name Name Name Adhar Card Number Marks Marks
1 19201 Ram Sh. Hari Smt. Puspa Panipat 123654987 G M O 5'10" 1234567890 67 82 85
Sh.
2 19210 Shyam Smt. Radha Sonipat 321456789 BC M B+ 5"8" 9087654321 71 85 72
Kalesh

Data required during admission time in Mba


Student Father Mother Student 12th Contact Graduation
Sno Roll No Address Caste Gender 10th Marks
Name Name Name Adhar Card Marks Number Marks
1 19201 Ram Sh. Hari Smt. Puspa Panipat 123654987 G M 82 85 1234567890 72
Sh.
2 19210 Shyam Smt. Radha Sonipat 321456789 BC M 85 72 9087654321 78
Kalesh

Data Required in Gym


Student Father Mother Contact
Address Gender Blood Gp Height Weight
Sno Name Name Name Number
Smt.
Ram Sh. Hari Panipat M O 5'10" 67 1234567890
  Puspa
Sh. Smt.
Shyam Sonipat M B+ 5"8" 71 9087654321
  Kalesh Radha

Data Required In college placement cell


No of
Student Father Interview Contact
Sno Roll No Offer
Name Name Cleared Number
Letter
1 19201 Ram Sh. Hari 2 1 1234567890
Sh.
2 19210 Shyam 2 0 9087654321
Kalesh

Data required by friend


Student Father Mother Contact
Address
Name Name Name Number
Smt.
Ram Sh. Hari Panipat 1234567890
Puspa
Sh. Smt.
Shyam Sonipat 9087654321
Kalesh Radha

You might also like