You are on page 1of 10

Non-Linear Regression aka Attributes Data Analysis

What is Non-Linear Regression?

Non-Linear Regression aka Attributes Data Analysis is organized into discrete data types (groups
or categories.)

When do you use Non-Linear Regression aka Attributes Data Analysis?

Simple Linear regression is use to predict the relationship between two variable (say x and y) with a
straight line y=mx+b. While non-linear regression aka Attributes Data Analysis is used to explain the
nonlinear relationship between a response variable and one or more than one predictor variable
(mostly curve line).

Linear Vs Nonlinear Regression

Specifically use non-linear regression instead of ordinary least square regression when one cannot
adequately model the relationship with linear parameters

DO NOT USE Linear Regression for attribute data analysis- because this assumes the response
variable is continuous.

For instance Analyze attributes data using logit, probit, logistic regression, etc., to investigate sources
of variation

Logistic Regression

Logistic regression is a method to predict a dependent variable for a given set of independent
variables, such that the dependent variable is categorical. This method is especially widely using in
machine learning algorithms.

Logistic regression similar to linear regression, however logistic regression predicts categorical (like
true or false) where as linear regression predicts continuous data. Hence the dependent variable in
logistic regression is categorical, but in linear regression dependent variable is always continuous.
Logistic regression uses a logit transformation on the dependent variable to fit a linear regression
model.

The outcome of logistic regression is always categorical, when the resultant outcome has always
only two possible value of 0 or 1. In logistic regression the graph is not a linear line, but the line
looks like a curve goes between 0 and 1. The curve is called “S” curve and also called as Sigmoid
curve.
Although the logistic regression tells the probability, it is most commonly used for classification.
Team has to determine the threshold value if the probability of event happened between 0 and 1,
then based on threshold value event to be classified as 0 or 1.

For example if the threshold value is 0.5, then any value between 0.5 and 1 then it should be
classified as 1, similarly any value below 0.5 then it should classify as 0

Logistic Regression calculation

P ranges between 0 and 1, it represents the probability of event to be happen

X1, X2 are the independent variables which determines the occurrence of an event

C is the constant

B1, B2 represents the respective coefficients of X1, X2

Logistic Regression Example

Non-linear regression models will be used in Analyze phase of DMAIC. For example American
Cancer Institute wants to predict chances of throat cancer based on the smoker’s average cigarettes
consuming per day. From the below random data, 0 indicates no cancer and 1 indicates throat
cancer .

From the above data logistic regression helps to assess what level of cigarette consumption leads
to throat cancer. Moreover, from the above graph one can say if number of cigarette consumption
is more than 8 it leads to throat cancer and if cigarette consumption is below 8 no cancer. Hence
logistic regression will be used in complex problem

Certainly we have to use statistical tools (like R, Minitab) to model the logistic regression.
Using the formula

Suppose if we get following values (random values) from the above information

C= -0.100245

B1= 0.1814

If we want to predict for 2.6

=-0.100245+ 0.1814*2.6= -0.53081

Use sigmoid function to find the probability

Since f(z ) value is 0.37 (towards zero), hence we can predict at 2.6 there will no chance of throat
cancer.

Logit Analysis

Logit regression is a model where dependent variable is qualitative, In other words dependent
variable is categorical. If the dependent variable is categorical in two parts it is binary logistic
regression, for more than two categories it is a multinomial logit regression.

Formula for P=f(z)

f is the cumulative distribution function for a standard logistic random variable.

z= β0+ β1x1+ β2x2+………………+ βnxn

β0 is the intercept and β1 ,β2,β3 are the slopes against independent variable

f is a function : 0<f(z)<1, for all z

Logit analysis Example

A cardiology doctor wants to predict the chances of life threat with the cholesterol. Patient alive and
also dead data collected after specific period having different cholesterol levels. The response is a
binary variable (i.e alive or dead) AND Find the logarithm base 10 for cholesterol
Calculate the proportion of dead patients= dead/(alive+dead)

Calculate the logit (p) = log(p/1-p)

While calculating the logit values, calculate the slope and intercept using logit values and Log values
of cholesterol.

We have to use statistical tools (like R, Minitab) to model the logistic regression.

Once the model was developed, Use sigmoid function to find whether the chances of patient dead
or alive for a particular cholesterol level.
Probit Analysis

The probit model was first introduced by Chester Bliss in 1934, but the maximum likelihood method
was proposed by Ronald Fisher as an appendix to Bliss in 1935.

A probit model is a popular specification for an ordinal or a binary response model. This model ,
which employs a probit link function, this was estimated using the maximum likelihood method,
hence this estimation was named it as probit regression.

z= β0+ β1x1+ β2x2+………………+ βnxn

Where f is the standard normal cumulative distribution function (cdf)

Probit Analysis Example

A coaching center interested in students GMAT (Graduate Management Admission Test )score and
admission into the business school. The response is binary variable (i.e admit or no admission in to
the business school) AND First of all find the logarithm base 10 for GMAT Score

Calculate the proportion of no admission = No admission /(admission + no admission)

Calculate the Probit values using the chart


While calculating the probit values, calculate the slope and intercept using probit values and Log
values of GMAT score.

We have to use statistical tools (like R, Minitab) to model the logistic regression.

Once the model was developed, Use probit function to find whether the chances of student
admission or no admission into the business school for a particular GMAT score.

Difference between logit and probit

Both logit and probit models yields the almost same results. The logit and probit function are
increasing and both functions increase relatively quick in the central portion and relatively slow at
the extremities and both function lie between 0 and 1. While the only key difference between logit
and probit model is in the use of link function

In other words logit model have slightly flatter tails when compared to probit model.
MAKING SENSE OF THE BINARY LOGISTIC REGRESSION TOOL
In some situations, Six Sigma practitioners find a Y that is discrete and Xs that are continuous. How
can a regression equation be developed in these cases? Black Belt training indicated that the correct
technique is something called logistic regression but this tool is often not well understood. An
example about a well-known space shuttle accident can help to demystify logistic regression using
the simplest logistic regression – binary logistic regression, where the Y has just two potential
outcomes (i.e., “yes” or “no,” or 0 or 1).
The data in Table 1 comes from the Presidential Commission on the Space Shuttle Challenger
Accident (1986). The data consists of the number of the flight, the air temperature at the time of the
launch and whether or not there was damage to the booster rocket field joints (no = 0, yes = 1).

Using normal regression and given a particular temperature at launch time, this data can be used
to calculate the probability of damage to the booster rocket field joints.

There are five steps to apply logistics regression.


Step 1. Graphically Visualize the Data

A stratified dot plot can be used to graphically display the data. It is obvious from Figure 2 that the
probability of damage is greater at lower temperatures. However, there is quite a fair bit of overlap
in the distribution. Is launch temperature a real X (i.e., a real predictor of damage)? And if so, what
is the probability of damage for any given launch temperature?

Figure 1: Stratified Dot Plot


It is obvious that the probability of damage is greater at lower temperatures. However, there is
quite a fair bit of overlap in the distribution. Is launch temperature a real X (i.e., a real predictor of
damage)? And if so, what is the probability of damage for any given launch temperature?

Step 2. Formulate the Regression Model

Any regression requires a continuous output or Y. However, in this case the Y is discrete with only
two categories or two events: Damage – yes or no. What to do? The “trick” behind the logistic
regression is to turn the discrete output into a continuous output by calculating the probability
(p) for the occurrence of a specific event. That means, the logistic regression provides a model to
predict the p for a specific event for Y (here, the damage of booster rocket field joints, p = P[Y=1])
given any value of X (here, the temperature at the time of the launch). The logistic regression
equation has the form:

This function is the so-called “logit” function where this regression has its name from. The
procedure for modeling a logistic model is determining the actual percentages for an event as a
function of the X and finding the best constant and coefficients fitting the different percentages.

This is exactly the equation that comes out of statistical software’s output for logistics regression:
Step 3. Check Validity of Regression Model

There are two major checks that need to be done before it can said with confidence that this
model is valid (please refer to the session output):

1. P-value for the coefficient is less than 0.05.


A p-value is calculated for each coefficient. If the p-value is low then there is a significant
relationship between the X variable and the Y. In this case, the coefficient for temperature has p-
value of 0.032 (i.e., there is a significant relationship between the temperature and the probability
of a damage).
2. P-value of the “goodness of fit” tests are greater than 0.05.
Goodness-of-fit tests are conducted to see whether the model adequately fits the actual situation.
Low p-values indicate a significant difference of the model from the observed data. Hence, the p-
values should be above 0.05 to show that there are no significant differences between the
predicted probabilities (from the model) and the observed probabilities (from the raw data). In this
case, from the goodness-of-fit tests, none of them show a significant difference – the regression
model is valid.

Step 4. Reverse the Logit Equation

This is done to obtain an answer to the question, given a particular setting of X, what is the
probability of failure? Reversing this, the result is:

On the day of the Challenger incident, the temperature was 31 degree Fahrenheit. Hence, the
probability of damage to the booster rocket field joints on that day is:

Damage was almost a certainty.

Step 5. Visualize the Results (Optional)

The event probability for all the possible temperature settings can be obtained by using statistical
software. In Minitab software, for example, one must go to “Storage” and check the “Event
Probability” box. The output is illustrated in Table 2.
Using this data, the scatter plot (decreasing logistic plot) in Figure 2 can be produced.

Figure 2: Scatter Plot of Damage Versus Temperature

You might also like