You are on page 1of 40

Predictive Analytics

INTRODUCTION
What and Why analytics

• Analytics is a journey that involves a


combination of potential skills, advanced
technologies, applications, and processes used
by firm to gain business insights from data and
statistics. This is done to perform business
planning
WHAT IS PREDICTIVE ANALYTICS?

• Predictive analytics is about using historical


data to make predictions.
Ex: Our Credit score
Continue…..

• While predictive analytics has been used for


years in financial services, it’s now an integral
part of many industries and businesses. The
massive increase in data collection abilities
coupled with the widespread availability of
commodity hardware are the two trends that
have made the spread of predictive modeling
a reality.
Why now?

• Growing volumes and types of data, and more interest


in using data to produce valuable insights.
• Faster, cheaper computers.
• Easier-to-use software.
• Tougher economic conditions and a need for
competitive differentiation.
• predictive analytics is no longer just the domain of
mathematicians and statisticians. Business analysts and
line-of-business experts are using these technologies as
well.
Places where Analytics is used
Business Applications of Predictive
Analytics
Reporting Vs Analytics:

• Reporting is presenting result of data analysis


and Analytics is process or systems involved in
analysis of data to obtain a desired output.
Predictive Analytics Process Cycle
Introduction to tools and Environment:

• Data Science and Analytics are used by Manufacturing companies


as well as Real Estate firms to develop their business and solve
various issues by the help of historical data base.
• Tools are the software's that can be used for Analytics like SAS or R.
While techniques are the procedures to be followed to reach up to
a solution.
• Various steps involved in Analytics:
• Access
• Manage
• Analyze
• Report
Predictive Analytics Tools in Market
Various Analytics techniques are:

• Data Preparation
• Reporting, Dashboards & Visualization
• Segmentation Icon
• Forecasting
• Descriptive Modeling
• Predictive Modeling
• Optimization
continuation
Why is predictive analytics important?
COMMON APPLICATIONS

• Customer Relationship Management (CRM).


• Detecting outliers and fraud.
• Anticipating demand
• Improving processes.
• Building recommendation engines
• Improving time-to-hire and retention
Application of Modeling in Business:
• A statistical model embodies a set of assumptions concerning the
generation of the observed data, and similar data from a larger
population.
• A model represents, often in considerably idealized form, the data-
generating process.
• Signal processing is an enabling technology that encompasses the
fundamental theory, applications, algorithms, and implementations of
processing or transferring information contained in many different
physical, symbolic, or abstract formats broadly designated as signals.
• It uses mathematical, statistical, computational, heuristic, and linguistic
representations, formalisms, and techniques for representation, modeling,
analysis, synthesis, discovery, recovery, sensing, acquisition, extraction,
learning, security, or forensics.
• In manufacturing statistical models are used to define Warranty policies,
solving various conveyor related issues, Statistical Process Control etc.
Databases & Type of data and variables:
• A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized
repository of information about data such as meaning, relationships to other data, origin, usage, and
format”.

• The term can have one of several closely related meanings pertaining to databases and database
management systems (DBMS):

• A document describing a database or collection of databases

• An integral component of a DBMS that is required to determine its structure

• A piece of middleware that extends or supplants the native data dictionary of a DBMS

• Data can be categorized on various parameters like Categorical, Type etc.

• Data is of 2 types – Numeric and Character. Again numeric data can be further divided into sub group of –
Discrete and Continuous.

• Again, Data can be divided into 2 categories – Nominal and ordinal.

• Also based on usage data is divided into 2 categories – Quantitative and Qualitative

• Manufacturing industry also have their data divided in the groups discussed above. Like production
quantity is a discrete quantity while production rate is a continuous data. Similarly quality parameter can
be given ratings which ordinal data.
DATAMODELING
•Adata model is a conceptual representation of
the data structures that are required by a
database.

•To use a common analogy, the data model is


equivalent to an architect's building plans.

•Adata model is independent of hardware or


software constraints.
IMPORTANCE OF DATA
MODELS

Data models
Representations,
 usually graphical, of complex
world data structures

Facilitate interaction among the designer, the


applications programmer and the end user
End-users

have different views and needs for
data
Data model
• organizes data for various
users
TYPEOFDATAMODELS
• FILEBASEDAPPROACH
• Hierarchical Model
• Network Model
• Relational Model ER
• Model
• Object Oriented Model
• Object Relational Model
• Deductive / Inference Model
FILEBASEDAPPROACH
• A collection of un-related files and a collection of
application programs that perform services for the
end-users, such as the production of reports. Each
program defines and managesits own data.

1.Traditionally each department in a company would


maintain its own collection of files.

2.The data processing department would write programs


for each application each office needed performed.
LIMITATIONS OFTHEFILE-
BASED
•Duplication of data
•Incompatible File Formats
•Data dependence
•Fixed queries/proliferation of
application programs
•Inability to generate timely reports
HIERARCHICAL
MODEL
Oldest•data base model. (1950’s)

Tree structure is most frequently occurring

relationship. organize data elements as tabular
rows
Advantages
• Simplicity
• Data security
• Data Integrity
• Efficiency : When contains large no of
relations

Disadvantages
• Implementation complexity
• Database management problem :
maintaining difficult
• Lack of structural independence
• programming complexity
Network
Model
•Graph structure
• Allow more connection between nodes
•Ex:Aemployee work for two department is not
possible in hierarchical model, but here it is
possible
Advantages

• Conceptual simplicity
• handle more relationships
• Ease of data access
• Data integrity
•Data independence
• Database standards
Disadvantages

• System Complexity
• Absence of structural independence
Relational
Model
• Data in the form of table
• each table  application entity
• each row  instances of that entity
• SQLserves as a uniform interface for users
providing a collection of standard
expression for storing and retrieving data
• Most popular database model
Advantages

•Structural independence
•Conceptual simplicity
•Design , implementation , maintenance and
usage ease
•Query capability
•Very powerful
•Flexible
•Easy to use query capability
The main highlights of relations
model
•Data is stored in tables called relations.
•Relations can be normalized.
•In normalized relations, values saved are
atomic values.
•Each row in a relation contains a unique
value.
•Each column in a relation contains
values from a same domain.
Data Modeling Techniques
Overview:
• Regression analysis mainly focuses on finding a relationship between a
dependent variable and one or more independent variables.
• Predict the value of a dependent variable based on the value of at least one
independent variable.
• It explains the impact of changes in an independent variable on the dependent
variable.

Y = f(X, β)
where Y is the dependent variable
X is the independent variable
β is the unknown coefficient
Types of Regression model are as below:
Linear Regression
•It’s a common technique to determine how
one variable of interest is affected by another.
•Its used for three main purposes:
•For describing the linear dependence of one
variable on the other.
•For prediction of values of other variable from
the one which has more data.
•Correction of linear dependence of one
variable on the other.
Cluster Analysis:

•Cluster Analysis is the process of forming groups of related


variable for the purpose of drawing important conclusions
based on the similarities within the group.
•The greater the similarity within a group and greater the
difference between the groups, more distinct is the clustering.
•Often there are no assumptions about the underlying
distribution of the data
•The reason for taking such an approach is that the objects in
a group are similar to one another and are different from the
objects in other groups. Therefore it is very easy to find
pattern here.
Time Series :
• Time series data is an ordered sequence of observations on a quantitative
variable measured over an equally spaced time interval.
• Time series are used in statistics, signal processing, pattern recognition,
econometrics, mathematic finance, weather forecasting, earthquake
prediction electroencephalography, control engineering, astronomy,
communications engineering and other places.

Time series analysis is used in


• Analyzing time series data
• Forecasting the future value of the variable under consideration.
• In time series analysis it is assumed that the data consist of set of
identifiable components and random errors which usually makes the
pattern difficult to identify.
E.g. Sales of quilts and blankets in a store across a period of five years.
Missing Imputations:
• In R, missing values are represented by the symbol NA (not
available). Impossible values (e.g., dividing by zero) are
represented by the symbol NaN (not a number). Unlike
SAS, R uses the same symbol for character and numeric
data.

• To test if there is any missing in the dataset we use is.na ()


function.

For Example,
We have defined “y” and then checked if there is any missing
value. T or True means that there is a missing value.
y <- c(1,2,3,NA) is.na(y)
# returns a vector (F FF T)
Arithmetic functions on missing values yield missing values.
For Example,
x <- c(1,2,NA,3) mean(x)
# returns NA
To remove missing values from our dataset we use na.omit() function.

For Example,
We can create new dataset without missing data as below: -
newdata<- na.omit(mydata)
Or, we can also use “na.rm=TRUE” in argument of the operator. From
above example we use na.rm and get desired result.
x <- c(1,2,NA,3)
mean(x, na.rm=TRUE)
# returns 2
MICE Package -> Multiple Imputation by Chained
Equations
MICE uses PMM to impute missing values in a
dataset.
PMM-> Predictive Mean Matching (PMM) is a semi-
parametric imputation approach. It is similar to the
regression method except that for each missing
value, it fills in a value randomly from among the a
observed donor values from an observation whose
regression-predicted values are closest to the
regression-predicted value for the missing value
from the simulated regression model.

You might also like