Professional Documents
Culture Documents
新增 Microsoft Word Document
新增 Microsoft Word Document
Non-Linear Regression aka Attributes Data Analysis is organized into discrete data types (groups
or categories.)
Simple Linear regression is use to predict the relationship between two variable (say x and y) with a
straight line y=mx+b. While non-linear regression aka Attributes Data Analysis is used to explain the
nonlinear relationship between a response variable and one or more than one predictor variable
(mostly curve line).
Specifically use non-linear regression instead of ordinary least square regression when one cannot
adequately model the relationship with linear parameters
DO NOT USE Linear Regression for attribute data analysis- because this assumes the response
variable is continuous.
For instance Analyze attributes data using logit, probit, logistic regression, etc., to investigate sources
of variation
Logistic Regression
Logistic regression is a method to predict a dependent variable for a given set of independent
variables, such that the dependent variable is categorical. This method is especially widely using in
machine learning algorithms.
Logistic regression similar to linear regression, however logistic regression predicts categorical (like
true or false) where as linear regression predicts continuous data. Hence the dependent variable in
logistic regression is categorical, but in linear regression dependent variable is always continuous.
Logistic regression uses a logit transformation on the dependent variable to fit a linear regression
model.
The outcome of logistic regression is always categorical, when the resultant outcome has always
only two possible value of 0 or 1. In logistic regression the graph is not a linear line, but the line
looks like a curve goes between 0 and 1. The curve is called “S” curve and also called as Sigmoid
curve.
Although the logistic regression tells the probability, it is most commonly used for classification.
Team has to determine the threshold value if the probability of event happened between 0 and 1,
then based on threshold value event to be classified as 0 or 1.
For example if the threshold value is 0.5, then any value between 0.5 and 1 then it should be
classified as 1, similarly any value below 0.5 then it should classify as 0
X1, X2 are the independent variables which determines the occurrence of an event
C is the constant
Non-linear regression models will be used in Analyze phase of DMAIC. For example American
Cancer Institute wants to predict chances of throat cancer based on the smoker’s average cigarettes
consuming per day. From the below random data, 0 indicates no cancer and 1 indicates throat
cancer .
From the above data logistic regression helps to assess what level of cigarette consumption leads
to throat cancer. Moreover, from the above graph one can say if number of cigarette consumption
is more than 8 it leads to throat cancer and if cigarette consumption is below 8 no cancer. Hence
logistic regression will be used in complex problem
Certainly we have to use statistical tools (like R, Minitab) to model the logistic regression.
Using the formula
Suppose if we get following values (random values) from the above information
C= -0.100245
B1= 0.1814
Since f(z ) value is 0.37 (towards zero), hence we can predict at 2.6 there will no chance of throat
cancer.
Logit Analysis
Logit regression is a model where dependent variable is qualitative, In other words dependent
variable is categorical. If the dependent variable is categorical in two parts it is binary logistic
regression, for more than two categories it is a multinomial logit regression.
β0 is the intercept and β1 ,β2,β3 are the slopes against independent variable
A cardiology doctor wants to predict the chances of life threat with the cholesterol. Patient alive and
also dead data collected after specific period having different cholesterol levels. The response is a
binary variable (i.e alive or dead) AND Find the logarithm base 10 for cholesterol
Calculate the proportion of dead patients= dead/(alive+dead)
While calculating the logit values, calculate the slope and intercept using logit values and Log values
of cholesterol.
We have to use statistical tools (like R, Minitab) to model the logistic regression.
Once the model was developed, Use sigmoid function to find whether the chances of patient dead
or alive for a particular cholesterol level.
Probit Analysis
The probit model was first introduced by Chester Bliss in 1934, but the maximum likelihood method
was proposed by Ronald Fisher as an appendix to Bliss in 1935.
A probit model is a popular specification for an ordinal or a binary response model. This model ,
which employs a probit link function, this was estimated using the maximum likelihood method,
hence this estimation was named it as probit regression.
A coaching center interested in students GMAT (Graduate Management Admission Test )score and
admission into the business school. The response is binary variable (i.e admit or no admission in to
the business school) AND First of all find the logarithm base 10 for GMAT Score
We have to use statistical tools (like R, Minitab) to model the logistic regression.
Once the model was developed, Use probit function to find whether the chances of student
admission or no admission into the business school for a particular GMAT score.
Both logit and probit models yields the almost same results. The logit and probit function are
increasing and both functions increase relatively quick in the central portion and relatively slow at
the extremities and both function lie between 0 and 1. While the only key difference between logit
and probit model is in the use of link function
In other words logit model have slightly flatter tails when compared to probit model.
MAKING SENSE OF THE BINARY LOGISTIC REGRESSION TOOL
In some situations, Six Sigma practitioners find a Y that is discrete and Xs that are continuous. How
can a regression equation be developed in these cases? Black Belt training indicated that the correct
technique is something called logistic regression but this tool is often not well understood. An
example about a well-known space shuttle accident can help to demystify logistic regression using
the simplest logistic regression – binary logistic regression, where the Y has just two potential
outcomes (i.e., “yes” or “no,” or 0 or 1).
The data in Table 1 comes from the Presidential Commission on the Space Shuttle Challenger
Accident (1986). The data consists of the number of the flight, the air temperature at the time of the
launch and whether or not there was damage to the booster rocket field joints (no = 0, yes = 1).
Using normal regression and given a particular temperature at launch time, this data can be used
to calculate the probability of damage to the booster rocket field joints.
A stratified dot plot can be used to graphically display the data. It is obvious from Figure 2 that the
probability of damage is greater at lower temperatures. However, there is quite a fair bit of overlap
in the distribution. Is launch temperature a real X (i.e., a real predictor of damage)? And if so, what
is the probability of damage for any given launch temperature?
Any regression requires a continuous output or Y. However, in this case the Y is discrete with only
two categories or two events: Damage – yes or no. What to do? The “trick” behind the logistic
regression is to turn the discrete output into a continuous output by calculating the probability
(p) for the occurrence of a specific event. That means, the logistic regression provides a model to
predict the p for a specific event for Y (here, the damage of booster rocket field joints, p = P[Y=1])
given any value of X (here, the temperature at the time of the launch). The logistic regression
equation has the form:
This function is the so-called “logit” function where this regression has its name from. The
procedure for modeling a logistic model is determining the actual percentages for an event as a
function of the X and finding the best constant and coefficients fitting the different percentages.
This is exactly the equation that comes out of statistical software’s output for logistics regression:
Step 3. Check Validity of Regression Model
There are two major checks that need to be done before it can said with confidence that this
model is valid (please refer to the session output):
This is done to obtain an answer to the question, given a particular setting of X, what is the
probability of failure? Reversing this, the result is:
On the day of the Challenger incident, the temperature was 31 degree Fahrenheit. Hence, the
probability of damage to the booster rocket field joints on that day is:
The event probability for all the possible temperature settings can be obtained by using statistical
software. In Minitab software, for example, one must go to “Storage” and check the “Event
Probability” box. The output is illustrated in Table 2.
Using this data, the scatter plot (decreasing logistic plot) in Figure 2 can be produced.