Data200_W5

Week 5
Fitting the Model
Saksham Joshi
Presidential Graduate School, Kathmandu
DATA200: Applied Statistical Analytics
Professor Das
April 7, 2024
Quiz 1: Predictive Modeling Using Logistic Regression
Lesson 2: Fitting the Model
Theoretical Concept
Topic: Logistic Regression Model
Regression methods have become an integral component of any data analysis concerned with
describing the relationship between a response variable and one or more explanatory variables. Ouite
often the outcome variable is discrete, taking on two or more possible values. The logistic regression
model is the most frequently used regression model for the analysis of these data.
Before beginning a through study of the logistic regression model it is important to understand
that the goal of an analysis using this model is the same as that of any other regression model used in
statistics, that is, to find the best fitting and most parsimonious, clinically interpretable model to
describe the relationship between an outcome (dependent or response) variable and a set of independent
(predictor or explanatory) variables. The independent variables are often called covariates(Hosmer et
al., 2013, p.1). The most common example of modeling, and one assumed to be familiar to the readers
of this text, is the usual linear regression model where the outcome variable is assumed to be
continuous.
What distinguishes a logistic regression model from the linear regression model is that the
outcome variable in logistic regression is binary or dichotomous. This difference between logistic and
linear regression is reflected both in the form of the model and nits assumptions (Hosmer et al., 2013,
p.1). Once this difference is accounted for, the methods employed in an analysis using logistic
regression follow, more or less, the same general principles used in the linear regression. Thus, the
techniques used in linear regression analysis motivate our approach to logistic regression. We illustrate
both the similarities and difference between logistic regression and linear regression with an example.
In the case of linear regression model, the link function is the identify function as the dependent
variable, by definition, is linear in the parameters (Hosmer et al., 2013, p.50). For a linear regression
model, the slope coefficient, β1 is equal to the difference between the value of the dependent variable
at x+1 and the value of the dependent variable at x, for any value of x. For example, the linear
regression model at x is y(x) = β0 + β1x. It follows that the slope coefficient is β1 = y(x+1) – y(x). In
this case, the interpretation of the slope coefficient is is that it is the change in the outcome variable
corresponding to a one-unit change in the independent variable .
In the logistic regression model, the slope coefficient is the change in the logit corresponding to
a change of one unit in the independent variable [i.e., β1 = g(x+1) =g(x)]. Proper interpretation of the
coefficient in a logistic regression model depends on being able to place meaning on the difference
between two values of the logit function.
Article Review
Source
Yang, W., Pan, C., & Zhang, Y. (2022). An oversampling method for imbalanced data based on spatial
distribution of minority samples SD-KMSMOTE. Scientific Reports (Nature Publishing Group), 12(1).
https://doi.org/10.1038/s41598-022-21046-1
Purpose
The issue of data imbalance has grown in importance in the domains of finance, networks,
medical treatment, and so on due to the quick growth of data. And the oversampling method is usually
used to solve it. Nevertheless, the majority of oversampling techniques currently in use randomly
sample or only sample a certain area, which has an impact on the categorization outcomes.
Findings
This paper suggests SD-KMSMOTE, an imbalanced data oversampling technique based on the
spatial distribution of minority samples, as a solution to the aforementioned issues. The minority class
sample noise is eliminated, a filter noise pre-treatment is applied, and the category information of the
nearby samples is taken into account. These circumstances prompt the creation of a novel sample
synthesis technique, upon which the weight values' computation principles are based. Minority class
sample spatial distribution is taken into account in its whole; the samples are grouped, and the sub-
clusters with valuable information are given higher weight values and more synthetic sample numbers.
Value
The current growth in artificial intelligence has given rise to an unbalanced learning problem
that continues to get more and more attention from researchers. This study suggests SD-KMSMOTE,
an unbalanced data oversampling method based on the spatial distribution of minority class samples, in
contrast to the majority of oversampling methods currently in use, which sample randomly or just for a
set region. This work focuses on the spatial distribution of samples from minority classes, increases the
amount of synthetic new samples allocated in the boundary zone, allocates a specific number of new
samples to be generated in the safe region, and enhances the unbalanced learning problem's
performance.
References
Quiz 1:
Predictive Modeling Using Logistic Regression [Fitting the Model], [Date: 4/7/2024, 1:41 PM]
https://vle.sas.com/pluginfile.php/440287/mod_scorm/content/26/02/epmlr5102_3_b_quizx.htm
Theoretical:
Hosmer, D. W. J., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (pp. 1-50).
John Wiley & Sons, Incorporated.
Article Review:
Yang, W., Pan, C., & Zhang, Y. (2022). An oversampling method for imbalanced data based on spatial
distribution of minority samples SD-KMSMOTE. Scientific Reports (Nature Publishing Group), 12(1).
https://doi.org/10.1038/s41598-022-21046-1

Data200_W5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data200_W5

Uploaded by

Copyright:

Available Formats

Week 5

Fitting the Model

Presidential Graduate School, Kathmandu

DATA200: Applied Statistical Analytics

You might also like