Professional Documents
Culture Documents
Analysis Model
Predictive modelling is aimed at developing tools that can be used for individual prediction of
the most likely value of a continuous measure, or the probability of the occurrence (or
recurrence) of an event. There has been a huge increase in popularity of developing tools for
prediction of outcomes at the level of the individual unit.
In addition to above discussion following points are also important for preparing data
for Predictive Modelling
When you’ve defined the objectives of the model for predictive analysis, the next step is to
identify and prepare the data you’ll use to build your model. The general sequence of steps
looks like this:
1. Identify your data sources.
Data could be in different formats or reside in various locations.
2. Identify how you will access that data.
Sometimes, you would need to acquire third-party data, or data owned by a different division
in your organization, etc.
3. Consider which variables to include in your analysis.
One standard approach is to start off with a wide range of variables and eliminate the ones
that offer no predictive values for the model.
4. Determine whether to use derived variables.
In many cases, a derived variable (such as the price-per-earning ratio used to analyze stock
prices) would have a greater direct impact on the model than would the raw variable.
5. Explore the quality of your data, seeking to understand both its state and limitations.
The accuracy of the model’s predictions is directly related to the variables you select and the
quality of your data. You would want to answer some data-specific questions at this point:
o Is the data complete?
o Does it have any outliers?
o Does the data need cleansing?
o Do you need to fill in missing values, keep them as they are, or eliminate them
altogether?
Understanding your data and its properties can help you choose the algorithm that will be
most useful in building your model. For example:
Some models created to mine data and make sense of its underlying relationships — for
example, those built with clustering algorithms — need not have a particular end result in
mind.
Two problems arise when dealing with data as you’re building your model: underfitting and
overfitting.
Underfitting
Underfitting is when your model can’t detect any relationships in your data. This is usually an
indication that essential variables — those with predictive power — weren’t included in your
analysis. For example, a stock analysis that includes only data from a bull market (where
overall stock prices are going up) doesn’t account for crises or bubbles that can bring major
corrections to the overall performance of stocks.
Failing to include data that spans both bull and bear markets (when overall stock prices are
falling) keeps the model from producing the best possible portfolio selection.
Overfitting
Overfitting is when your model includes data that has no predictive power but it is only
specific to the dataset you’re analyzing. Noise — random variations in the dataset — can find
its way into the model, such that running the model on a different dataset produces a major
drop in the model’s predictive performance and accuracy. The accompanying sidebar
provides an example.
If your model performs just fine on a particular dataset and only underperforms when you test
it on a different dataset, suspect overfitting.
In addition to above requirements of all possible users of Database must be taken care. As we can observe from following college student data
example.
Overall data
Student Father Mother Student Contact 10th 12th
Sno Roll No Address Caste Gender Blood Gp Height Weight
Name Name Name Adhar Card Number Marks Marks
1 19201 Ram Sh. Hari Smt. Puspa Panipat 123654987 G M O 5'10" 1234567890 67 82 85
Sh.
2 19210 Shyam Smt. Radha Sonipat 321456789 BC M B+ 5"8" 9087654321 71 85 72
Kalesh