You are on page 1of 59

OPERATIONAL FOUNDATION OF

STATISTICS

Exequiel R. Gono Jr., PhD


Professional School
Topic Outline

REVIEW OF BASIC CLASS ORIENTATION TOPIC OUTLINE


STATISTICS REQUIREMENTS
Role of Regression Analysis in Research

◦ We want to model a certain phenomena that influences human behavior.


◦ Most inferential statistical procedures in social science research are derived
from a general family of statistical models called the general linear model
(GLM). A model is an estimated mathematical equation that can be used to
represent a set of data, and linear refers to a straight line.
Chain of Reasoning for
Inferential Statistics

Selection
Sample
Population

Me a sure
Inference data

Probability

Are our inferences valid?…Best we can do is to calculate probability


about inferences !!!! Wisdom of the crowd
General Linear Model
The General Linear Model (GLM) is a useful framework for comparing how
several variables affect different continuous variables. In it’s simplest form, GLM is
described as:

Data = Model + Error (Rutherford, 2001, p.3)

The formula of the GLM:


Scatter Plot
When Should I Use Regression Analysis?

◦ Use regression analysis to describe the relationships between a set


of independent variables and the dependent variable.
◦ Regression analysis produces a regression equation where the
coefficients represent the relationship between each independent
variable and the dependent variable.
◦ You can also use the equation to make predictions.
Why Regression Analysis?

Predictor
A
Predictor A

Predictor B Criterion Predictor


B Criterion

Predictor C
Predictor C

Less accurate, weaker prediction More accurate, stronger prediction


Popular Research Design in Social Research

Case Studies
Experimental
Action Research
Field Surveys
Ethnography
Secondary Data
FGD/KII
Bhattacherjee (2012). Social Science Research: Principles, Methods, and Practices. University of South Florida
Use of Regression to
Analyze a Wide Variety Sample Study
of Relationships

Do socio-economic status and race affect


educational achievement?
Model multiple Assess interaction of • IV- Socio Economic Status and Race
Independent Variable • DV-Educational Achievement
independent and Dependent
variables Variable

Do education and IQ affect earnings?


• IV- Education and IQ
• DV-Earnings

Include Use
continuous and polynomial
categorical terms to model Do exercise habits and diet effect weight?
variables curvature • IV-exercise habits and diet
• weight
The Research Process
Initial Observation
(Research Question)

Conclusion
Generate Theory

Graph Data Analyze the Data Identify the


Generate Hypothesis
Fit a Model Variables

Collect the data Measure Variables


Operational Foundation
Level- Possible
score
(eg. Sex-2
levels)
Ordinal- score
Nominal- describes the represents some Score/
Group
Groups of
lowest level of Participants
measurement were rank order Observation

numbers are used


Levels of Measurement Model
Abstract to quantity and Equation
qualify
“By chance” Participants
Interval- the scoring Ratio- level of True characteristics
People
rules are such that the measurement differs from of the population
participates in
spacing between scores the interval level only in the study

reflects equal amount of that negative values are Statistic


the variable not allowed Quantitative
value of the
sample
Independent and Predictor and Criterion
Dependent Variable Variable
Some hypothesis suppose only that variables are related to
One variable affects another. each other .
In those hypotheses, the variable that is affected by the other is
In those cases we distinguish between the roles of variables in
labeled the dependent variable because it depends on the other the design by using the term predictor variable for the one
variable. The variable doing the affecting, the supposed casual that is kind of “independenty” and criterion variable for the
variable, is called the independent variable. one we are trying to explain or predict.

Independent Independent Variable


Predictor Criterion
Variable Variable Variable
Frequency analysis • Binomial
2 cat.
• is used when one is interested in the distribution of one or Test
T
Y more variable in a single sample
>2 cat • Chi- Square
P
E
Group comparisons • t test
O 2 cat.
• analyses that compare groups to each other
F
>2 cat • F test
S
T
A Repeated measures analyses • ANOVA
T • use data from groups that have been measured more than (repeated)
I once, usually on the same variable
S • The comparisons are usually across time • Time Series
T
I
C Correlational analyses 2 cat • Logistic
S involve one group of participants that have been
measured on more than one variable Interval • Linear Regression
Introduction to Statistical Learning

What is Statistical Learning?


◦ Statistical learning refers to a vast set of tools for
understanding data.
• These tools can be classified as supervised or
unsupervised.
Supervised (input and output)
Unsupervised (Input)
Suppose we are statistical consultants hired by a client to provide advice on how to improve
sales of a particular product. We have sales data of the product in 200 different markets,
along with advertising budgets for three different media: TV, radio, and newspaper.
◦ The advertising budgets are input variables while sales is an
output variable. The input variables are typically denoted using
the symbol X, a subscript to distinguish them.
◦ The inputs are also called as predictors, independent variables,
features, or sometimes just variables.
◦ Output variables are also called response or dependent variable.
◦ In a general set-up, we have p different predictors, 𝑋1, 𝑋2, … , 𝑋𝑝.
◦ We assume that there is a relationship between 𝑌 and 𝑿 = (𝑋1, 𝑋2, … ,
𝑋𝑝) , which can be written in a general form
𝑌 = 𝑓 (𝑿) + 𝜀
◦ 𝑓() is some unknown function that represents the systematic
information that X provides about Y.
◦ 𝜀 is a random error term
In essence, statistical learning refers to a set of approaches for
estimating 𝑓.
We estimate 𝑓 for two main reasons: prediction and inference.

Prediction- Using our estimate for 𝑓 which we denote by f̂ , we


obtain the predicted values of Y, 𝑌 (hat) = 𝑓 (hat) (X)

Inferences- Here our goal is not much on predicting 𝑌 but on


understanding how 𝑌 changes as a function of 𝑿.
We refer to problems with a Regression
quantitative response as regression
problems.
Versus
Classification
Researchers will
Problems involving a qualitative
response are referred to as focus on the
classification problems. response variable or
the dependent
variable.
We tend to select statistical learning methods on the basis
of whether the response is quantitative or qualitative; i.e.
we might use linear regression when quantitative and
logistic regression when qualitative.
Predictor Variable/IV 1
Simple Linear
Level of Measurement Interval
Regression
Number of Levels Many Analysis
Number of Groups 1

A simple linear regression


Criterion Variable/Response/DV 1
assumes an approximately
linear model between a
Level of Measurement Interval quantitative response Y on the
basis of 1 predictor variables X.
Number of Level Many

Measurement Occasions 1 or 2

108 + k – determine the sig predictor


Sample Size 100 + 8k – best fit model
Research Design- Quantitative Research , Correlational Research, Predictive Causation Research
Objectives-
Prediction - Sometimes they want to predict a score in the future, such as administrators looking at
students' high school performance to guess what their college grade point averages will be. We tend to say that
we are predicting scores
Inferences- researchers are interested in exploring the relationship between two variables to understand
them better.

Statement of the Problem- Do the students’ high school performance predict the college grade point?
Primary Statistical Questions equation?
◦ How accurate are our guesses using the regression equation?
◦ Example of a Study That Would Use Simple Linear Regression
Colleges don't have enough room for every high school student who applies,
and admissions offices must use some information to try to guess who will
succeed in order to make their decisions. One popular predictor has always been
SAT scores. In the late 1960s, as the college population was changing researchers
were interested in what the actual linear relationship was between scores on the
verbal section of the SAT, an interval level variable that ranged from 200 to 800,
and college grade point average (GPA) for the first year, which ranged from 0.00
to 4.00 They collected data on both variables from a sample of about 4,000
students.
Predictor Variable/IV 2+
Multiple Linear
Level of Measurement Interval Regression
Number of Levels Many
Number of Groups 1
A multiple linear regression
Criterion Variable/Response/DV 1 assumes an approximately
linear model between a
quantitative response Y on the
Level of Measurement Interval basis of more than 1 predictor
variables X.
Number of Level Many
Measurement Occasions 1
108 + k – determine the sig predictor
Sample Size 100 + 8k – best fit model
Research Design- Quantitative Research , Correlational Research, Predictive Causation Research
Objectives-
Prediction - Sometimes they want to predict a score in the future, such as administrators looking at
students' high school performance to guess what their college grade point averages will be. We tend to say that
we are predicting scores
Inferences- researchers are interested in exploring the relationship between two variables to understand
them better.

Statement of the Problem- What best fit model that can be derived from relationship of Teaching Competence
and Academic Performance?
Primary Statistical Questions equation?
◦ What are the relative contributions of each predictor to the criterion variable?
Statistical Assumptions

◦ Linear relationship
◦ Multivariate normality
◦ No or little multicollinearity
◦ No auto-correlation
◦ Homoscedasticity
Linear relationship
◦ First, linear regression needs the relationship between the independent and dependent variables to be
linear.  It is also important to check for outliers since linear regression is sensitive to outlier effects.  The
linearity assumption can best be tested with scatter plots, the following two examples depict two cases,
where no and little linearity is present.
Multivariate normality
This assumption can best be checked with a histogram or a Q-Q-Plot. 
Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-
Smirnov test.  When the data is not normally distributed a non-linear
transformation (e.g., log-transformation) might fix this issue.           
Multivariate
normality
If the sig value <0.05,not
normal distribution.

If the sig value >0.05, normal


distribution.

Null hypothesis- It does not


deviate from normal
distribution.
Multicollinearity
Multicollinearity may be tested with three central criteria:
1) Correlation matrix – correlation coefficients need to be smaller than 1.
2) Tolerance – the tolerance measures the influence of one independent
variable on all other independent variables. Tolerance is defined as T = 1 –
R² for these first step regression analysis.  With T < 0.1 there might be
multicollinearity in the data and with T < 0.01 there certainly is.
3) Variance Inflation Factor (VIF) – the variance inflation factor of the
linear regression is defined as VIF = 1/T. With VIF > 5 there is an indication
that multicollinearity may be present; with VIF > 10 there is certainly
multicollinearity among the variables.
◦ There is no multicollinearity since T> 0.01 and VIF <5.
How to solve:

◦ If multicollinearity is found in the data, centering the data


(that is deducting the mean of the variable from each
score) might help to solve the problem.  However, the
simplest way to address the problem is to remove
independent variables with high VIF values.
No auto-correlation
◦   Autocorrelation occurs when the residuals are not independent from each other.  For instance, this
typically occurs in stock prices, where the price is not independent from the previous price.
◦ Use the Durbin-Watson test.
◦ Durbin-Watson’s d tests the null hypothesis that the residuals are not linearly auto-correlated.  While d can
assume values between 0 and 4, values around 2 indicate no autocorrelation.  As a rule of thumb values of
1.5 < d < 2.5 show that there is no auto-correlation in the data.
◦ However, the Durbin-Watson test only analyses linear autocorrelation and only between direct neighbours,
which are first order effects.

There is no auto-correlation . 1.5 < d < 2.5
Methods of Regressions
◦ Forced Enter  (default) . All independent variables are entered into the
equation in (one step), also called "forced entry".
◦ Stepwise Methods- Stepwise methods include or remove one independent
variable at each step, based (by default) on the probability of F (p-value);
alternatively the F value can be used instead.
◦ Hierarchical (Blockwise entry)- Predictors are selected based on the past
work and the experimenter decides in which order to enter the model.
The Multiple Linear Regression Analysis in
SPSS
Research Problem.
This example is based on the FBI’s 2006 crime statistics. Particularly we are interested in the
relationship between size of the state, various property crime rates and the number of murders in
the city. It is our hypothesis that less violent crimes open the door to violent crimes. We also
hypothesize that even we account for some effect of the city size by comparing crime rates per
100,000 inhabitants that there still is an effect left.
◦ Conceptual Framework

Independent Variable Dependent Variable

1. Motor vehicle theft


2. Burglary Murder
3. Larceny Theft
4. Residence Population
Results

This shows the multiple linear regression model summary and overall fit statistics. We find that the adjusted R²
of our model is .398 with the R² = .407. This means that the linear regression explains 40.7% of the variance in
the data. The Durbin-Watson d = 2.074, which is between the two critical values of 1.5 < d < 2.5. Therefore, we
can assume that there is no first order linear auto-correlation in our multiple linear regression data.

If we would have forced all variables (Method: Enter) into the linear regression model, we would have
seen a slightly higher R² and adjusted R² (.458 and .424 respectively).
◦ The next output table is the F-test. The linear regression’s F-test has the null hypothesis that
the model explains zero variance in the dependent variable (in other words R² = 0). The F-test
is highly significant, thus we can assume that the model explains a significant amount of the
variance in murder rate.
In our stepwise multiple linear regression analysis, we find a non-significant intercept but highly significant vehicle theft
coefficient, which we can interpret as: for every 1-unit increase in vehicle thefts per 100,000 inhabitants, we will see .014
additional murders per 100,000.

If we force all variables into the multiple linear regression, we find that only burglary and motor vehicle theft are
significant predictors. We can also see that motor vehicle theft has a higher impact than burglary by comparing the
standardized coefficients (beta = .507 versus beta = .333).
Table 18. Empirical Analysis on the Indicator’s Influence of Spiritual Programs towards
Spiritual Development
Standardized
Unstandardized Coefficients Coefficients
Variables
B Std. Error Beta t Sig.

Constant 1.538 0.336   4.714 0

Beliefs about the Church 0.101 0.039 0.164 2.589 0.01*

Beliefs about my life -0.382 0.121 -0.31 -3.17 0.002*

The Practice of Worship 0.026 0.086 0.025 0.305 0.761

The Practice of Prayer 0.391 0.105 0.309 3.738 0.00*

The Practice of Fellowship 0.202 0.444 0.289 4.623 0.00*

F Value=13.369; P Value=0.00; Adjust r square= 0.271 and R Value = 0.524

Shown in table 18 was the empirical analysis on the influence of the spiritual programs towards spiritual development. Using
Multiple Linear Regression Analysis, the model was best fit (F value= 13.369, P value= 0.00). This means that the regression models
results in significantly better prediction of spiritual development than mean value. Further, around 27.1% of the variability of the
spiritual development can be explained by the spiritual programs.
The indicators: Beliefs about the Church, Beliefs about my Life, The Practice of Prayer and The Practice of Fellowship
significantly predict the spiritual development of the students in San Pedro College.
Regression Analysis Using Dummy Variable
ANOVA VS Regression
Regression Analysis

ANOVA
Predictor Variable 1
Simple Logistic
Level of Measurement Nominal + Regression
Number of Levels 2+ It predicts the probability that
Number of Groups 1 an observation falls into one of
two categories of a
Criterion Variable 1 dichotomous dependent
variable based on one
Level of Measurement Nominal independent variable that can be
either continuous or categorical.
Number of Level 2
Measurement Occasions 1
Sample Size n = 100 + 50i, I is the
number of predictors
Research Design- Quantitative Research , Correlational
Research, Predictive Causation Research

Primary Statistical Question


For those at each level (or at each score or each category) of the independent
variable, what are the probabilities that they will be in each category on the
dependent variable?
◦ Example of a Study That Would Use Simple Logistic Regression
A survey was administered to 1,431 inhabitants of seaside community. As an
independent variable, the amount of fish eaten regularly was assessed (“Think of all the
meals you eat in a week; how many usually include fish?”). To simplify interpretation,
the researchers chose a “cut score” on the independent variable and created a nominal
independent variable with two levels. In this example, the researcher decided that the
key point on the independent variable was whether villagers ate two or more fish meals
a week. Anything less than that, and they were “infrequent fish eaters”. As a dependent
variable, the surveys include items from a depression measure. Score above an accepted
point on the depression scale were interpreted as indicating depression.
Predictor Variable 2+
Level of Measurement Nominal + Multiple Logistic
Number of Levels 2+ Regression
Number of Groups 1
It predicts the probability that
Criterion Variable 1 an observation falls into one of
two categories of a
Level of Measurement Nominal dichotomous dependent
variable based on 2 or more
Number of Level 2 independent variables that can
be either continuous or
Measurement Occasions 1 categorical.

Sample Size n = 100 + 50i, I is the


number of predictors
Multiple Logistic Regression
Primary Statistical Question
For those at each level (or each score for each category) of the independent
variables, what are the probabilities that they will be in each category on the
independent variable?
Example of a Study That Would Use Multiple Logistic Regression
Researchers were interested in the dangers of tanning in term of risk of getting skin cancer.
They recruited two types of people, those who had skin cancer and those who did not, and
formed one large group. Two independent variables that should theoretically be risk factors for
skin cancer were chosen and measured at the nominal level with two levels. They were type of
job (outdoors or indoors) and ability to tan (good tanner or bad tanner). The dependent variable
of presence or absence of skin cancer was nominal with two levels.
Assumptions
Assumption #1: Your dependent variable should be measured on
a dichotomous scale.
Assumption #2: You have one or more independent variables, which can be
either continuous (i.e., an interval or ratio variable) or categorical (i.e.,
an ordinal or nominal variable)
Assumption #3: You should have independence of observations and the dependent
variable should have mutually exclusive and exhaustive categories.
Assumption #4: There needs to be a linear relationship between any continuous
independent variables and the logit transformation of the dependent variable.
Interpretation

This is the chi-square statistic and its significance level. This is the same as the F test in Multiple Linear Regression.
This is the probability of obtaining this chi-square statistic (65.588) if there is in fact no effect of the independent
variables, taken together, on the dependent variable.

Cox & Snell R Square and Nagelkerke R Square – These are pseudo R-squares.  Logistic regression does not have an equivalent to the
R-squared that is found in OLS regression; however, many people have tried to come up with one.  There are a wide variety of pseudo-R-
square statistics (these are only two of them).  Because this statistic does not mean what R-squared means in OLS regression (the
proportion of variance explained by the predictors), we suggest interpreting this statistic with great caution.
Predicted – These are the predicted values of the dependent variable based on the full logistic regression model.  This
table shows how many cases are correctly predicted (132 cases are observed to be 0 and are correctly predicted to be
0; 27 cases are observed to be 1 and are correctly predicted to be 1), and how many cases are not correctly predicted
(15 cases are observed to be 0 but are predicted to be 1; 26 cases are observed to be 1 but are predicted to be 0).

Overall Percentage – This gives the overall percent of cases that are correctly predicted by the model (in this case,
the full model that we specified).  As you can see, this percentage has increased from 73.5 for the null model to 79.5
for the full model.
These are the values for the logistic regression equation for predicting the dependent variable from the independent
variable.  They are in log-odds units.  Similar to OLS regression, the prediction equation is
log(p/1-p) = b0 + b1*x1 + b2*x2 + b3*x3 + b3*x3+b4*x4
where p is the probability of being in honours composition.  Expressed in terms of the variables used in this example,
the logistic regression equation is
log(p/1-p) = –9.561 + 0.098*read + 0.066*science + 0.058*ses(1) – 1.013*ses(2)

P = e ^ (–9.561 + 0.098*read + 0.066*science + 0.058*ses(1) – 1.013*ses(2))


1 + e ^ (–9.561 + 0.098*read + 0.066*science + 0.058*ses(1) – 1.013*ses(2))
Thank You Very
Much!!!!

You might also like