0% found this document useful (0 votes)
58 views5 pages

Understanding Regression Analysis Techniques

This document discusses different types of regression analysis including linear regression, nonlinear regression, and logistic regression. It explains key concepts such as the regression function, R-squared, and how to handle nonlinear relationships. It also briefly discusses opportunities for applying regression techniques to big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views5 pages

Understanding Regression Analysis Techniques

This document discusses different types of regression analysis including linear regression, nonlinear regression, and logistic regression. It explains key concepts such as the regression function, R-squared, and how to handle nonlinear relationships. It also briefly discusses opportunities for applying regression techniques to big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318141663

Regression

Chapter · May 2017


DOI: 10.1007/978-3-319-32001-4_174-1

CITATION READS

1 2,588

1 author:

Qinghua Yang
Texas Christian University
47 PUBLICATIONS 742 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Examining what characteristics predict the retransmission of cardiovascular Tweets View project

Scientific uncertainty of e-cigarettes View project

All content following this page was uploaded by Qinghua Yang on 19 October 2017.

The user has requested enhancement of the downloaded file.


R

Regression Linear Regression

Qinghua Yang The estimation target of regression is a function


Department of Communication Studies, Texas that predicts the dependent variable based upon
Christian University, Fort Worth, TX, USA values of the independent variables, which is
called the regression function. For simple linear
regressions, the function can be represented as
Regression is a statistical tool to estimate the yi = a + bxi + ei. The function of multiple lin-
relationship(s) between a dependent variable ear regressions is yi = b0 + b1x1 + b2x2 þ   
(y or outcome variable) and one or more indepen- þ bkxk + ei where k is the number of independent
dent variables (x or predicting variables; Fox variables. The regression estimation using ordi-
2008). More specifically, regression analysis nary least squares (OLS) selects the line with the
helps in understanding the variation in a depen- lowest total sum of squared residuals. The propor-
dent variable using the variation in independent tion of total variation (SST) that is explained by
variables with other confounding variable(s) the regression (SSR) is known as the coefficient
controlled. Regression analysis is widely used to of determination, often referred to as R2, a value
make prediction and estimation of the conditional ranging between 0 and 1 with a higher value
expectation of the dependent variable given the indicating a better regression model (Keith 2015).
independent variables, where its use overlaps with
the field of machine learning. Figure 1 shows how
crime rate is related to residents’ poverty level and Nonlinear Regression
predicts the crime rate of a specific community.
We know from this regression that there is a In the real world, there are much more nonlinear
positive linear relationship between the crime functions than linear ones. For example, the rela-
rate (y axis) and residents’ poverty level (x axis). tionship between x and y can be fitted in a qua-
Given the poverty index of a specific community, dratic function shown in Figure 2. There are in
we are able to make a prediction of the crime rate general two ways to deal with nonlinear models.
at that area. First, nonlinear models can be approximated with
linear functions. Both nonlinear functions in
Figure 2 can be approximated by two linear func-
tions according to the slope: the first linear regres-
sion function is from the beginning of the
semester to the final exam, and the second
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_174-1
2 Regression

50

25
crime

–25

–50

–1.00 –.50 –.00 .50 1.00 1.50


poverty_sqrt

Regression, Figure 1 Linear regression of crime rate and residents’ poverty level

function is from the final to the end of the semes- prediction of the outcome variable. In logistic
ter. Similarly, regarding cubic, quartic, and more regression, we predict the odds or log-odds
complicated regressions, they can also be approx- (logit) that a certain condition will or will not
imated with a sequence of linear functions. How- happen. Odds range from 0 to infinity and are a
ever, analyzing nonlinear models in this way can ratio of the chance of an event (p) divided by the
produce much residual and leave considerable chance of the event not happening, that is, p/
variance unexplained. The second way is consid- (1p). Log-odds (logits) are transformed odds,
ered better than the first one from this aspect, by ln[p/(1p)], and range from negative to positive
including nonlinear terms in the regression func- infinity. The relationship predicting probability
tion as ^y = a þ b1x þ b2x2. As the graph of a using x follows an S-shaped curve as shown in
quadratic function is a parabola, if b2 < 0, the Figure 3. The shape of curve above is called a
parabola opens downward, and if b2 > 0, the “logistic curve.” This is defined as
parabola opens upward. Instead of having x2 in expðb0 þb1 xi þei Þ
pð y i Þ ¼ . In this logistic regression,
the model, the nonlinearity can also be presented 1þexpðb0 þb1 xi þei Þ
pffiffiffi the value predicted by the equation is a log-odds
in many other ways, such as x, ln(x), sin(x),
cos(x), and so on. However, which nonlinear or logit. This means when we run logistic regres-
model to choose should be based on both theory sion and get coefficients, the values the equation
or former research and the R2. produces are logits. Odds is computed as exp
expðlogitÞ
(logit), and probability is computed as 1þexp ðlogitÞ.
Another model used to predict binary outcome is
the probit model, with the difference between
Logistic Regression logistic and probit models lying in the assumption
about the distribution of errors: while the logit
When the outcome variable is dichotomous (e.g., model assumes standard logistic distribution of
yes/no, success/failure, survived/died, accept/ errors, probit model assumes normal distribution
reject), logistic regression is applied to make
Regression 3

Regression, Anxiety
Figure 2 Nonlinear
regression models

Semester Mid-term Final Semester


begins ends

Confidence in
the Subject

Semester Mid-term Final Semester


begins ends

of errors (Chumney & Simpson 2006). Despite opportunities and challenges. Generally speaking,
the difference in assumption, the predictive results big data is a collection of large-scale and complex
using these two models are very similar. When the data sets that are difficult to be processed and
outcome variable has multiple categories, multi- analyzed using traditional data analytic tools.
nomial logistic regression or ordered logistic Inspired by the advent of machine learning and
regression should be implemented depending on other disciplines, statistical learning has
whether the dependent variable is nominal or emerged as a new subfield in statistics, including
ordinal. supervised and unsupervised statistical learn-
ing (James, Witten, Hastie, & Tibshirani, 2013).
Supervised statistical learning refers to a set of
approaches for estimating the function f based on
Regression in Big Data
the observed data points, to understand the rela-
tionship between Y and X = (X1, X2, . . . , XP),
Due to the advanced technologies that have been
which can be represented as Y = f(X) þ e. Since
increasingly used in data collection and the vast
the two main purposes for the estimation are to
amount of user-generated data, the amount of data
make prediction and inference, which regression
will continue to increase at a rapid pace, along
modeling is widely used for, many classical sta-
with a growing accumulation of scholarly works.
tistical learning methods use regression models,
The explosion of knowledge makes big data one
such as linear, nonlinear, and logistic regression,
of new research frontiers with an extensive num-
with the selection of specific regression model
ber of application areas affected by big data, such
based on research question and data structure. In
as public health, social science, finance, geogra-
contrast, for unsupervised statistical learning,
phy, and so on. The high volume and complex
there is no response variable to predict for every
structure of big data bring statisticians both
4 Regression

Regression,
Figure 3 Logistic 1.00
regression models

0.80

0.60

pass
0.40

0.20

0.00

0 2 4 6 8 10
X

observation that can supervise our analysis (James ▶ Statistics


et al. 2013). Additionally, more methods have
been recently developed, such as Bayesian and
Markov chain Monte Carlo (MCMC). Bayes-
Further Readings
ian approach, distinct from the frequentist
approach, treats model parameters as random Bandalos, D. L., & Leite, W. (2013). Use of Monte Carlo
and models them via distributions. MCMC is studies in structural equation modeling research. In
statistical sampling investigations that involve G. R. Hancock & R. O. Mueller (Eds.), Structural
sample data generation to obtain empirical sam- equation modeling: A second course (pp. 625-666).
Charlotte, NC: Information Age Publishing.
pling distributions based on constructing a Mar- Chumney, E. C., & Simpson, K. N. (2006). Methods and
kov chain that has the desired designs for outcomes research. Bethesda, MD: ASHP.
distribution (Bandalos & Leite 2013). Fox, J. (2008). Applied regression analysis and general-
ized linear models. Thousand Oaks, CA: Sage.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).
An introduction to statistical learning (Vol. 6).
Cross-References New York, NY: Springer.
Keith, T. Z. (2015). Multiple regression and beyond: An
▶ Data Mining Algorithms introduction to multiple regression and structural
equation modeling. New York, NY: Routledge.
▶ Machine Learning
▶ Statistical Analysis

View publication stats

You might also like