You are on page 1of 152

UNIT 4

Introduction to univariate and


multivariate data analysis

• Univariate involves the analysis of a single


variable while multivariate analysis examines
two or more variables.
• Most multivariate analysis involves a
dependent variable and multiple independent
variables.
Organisation Climate,
Culture

Company Policies,
Career Growth
Salary and Wage Job Satisfaction
Benefits, Promotion
Policies

Work Nature

Safety and Welfare


Benefits
Basics of Statistics

Definition: Science of collection, presentation, analysis, and reasonable


interpretation of data.

Statistics presents a rigorous scientific method for gaining insight into data. For
example, suppose we measure the weight of 100 patients in a study. With so
many measurements, simply looking at the data fails to provide an informative
account. However statistics can give an instant overall picture of data based
on graphical presentation or numerical summarization irrespective to the
number of data points. Besides data summarization, another important task of
statistics is to make inference and predict relations of variables.
A Taxonomy of Statistics
Statistical Description of Data
• Statistics describes a numeric set of data by its
• Center
• Variability
• Shape
• Statistics describes a categorical set of data by
• Frequency, percentage or proportion of each category
Age of the Respondents
Age No.of Percentage
Respondents
Below 20 Yrs 50 25.0
21-35 Yrs 75 37.5
36-50 Yrs 25 12.5
Above 50 Yrs 50 25.0
Total 200 100.0
Some Definitions
Variable - any characteristic of an individual or entity. A variable can take
different values for different individuals. Variables can be categorical or
quantitative. Per S. S. Stevens…
• Nominal - Categorical variables with no inherent order or ranking sequence such as names
or classes (e.g., gender). Value may be a numerical, but without numerical value (e.g., I, II,
III). The only operation that can be applied to Nominal variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be
compared for equality, or greater or less, but not how much greater or less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally, differences
between values are meaningful, however, the scale is not absolutely anchored. Calendar
dates and temperatures on the Fahrenheit scale are examples. Addition and subtraction, but
not multiplication and division are meaningful operations.
• Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero point,
e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and division are
all meaningful operations.
Some Definitions
Distribution - (of a variable) tells us what values the variable takes and how often it
takes these values.
• Unimodal - having a single peak
• Bimodal - having two distinct peaks
• Symmetric - left and right half are mirror images.
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:

Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Grouped Frequency Distribution of Age:
Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency 8 20 26
Age No.of Percentage Cumulative
Respondents Percentage
Below 20 5 11.11 11.11
Years
21-30 10 22.22 33.33
31-40 10 22.22 55.55
41-50 5 11.11 66.66
Above 50 Yrs 15 33.34 100
Total 45 100.00
Data Presentation
Two types of statistical presentation of data - graphical and numerical.

Graphical Presentation: We look for the overall pattern and for striking deviations
from that pattern. Over all pattern usually described by shape, center, and spread
of the data. An individual value that falls outside the overall pattern is called an
outlier.

Bar diagram and Pie charts are used for categorical variables.

Histogram, stem and leaf and Box-plot are used for numerical variable.
Data Presentation –Categorical Variable
Bar Diagram: Lists the categories and presents the percent or count of individuals
who fall in each category.

Figure 1: Bar Chart of Subjects in


Tre atm ent Groups Treatment Frequency Proportion Percent
Group (%)
Nu m ber of Subjects

30
25
1 15 (15/60)=0.25 25.0
20
15 2 25 (25/60)=0.333 41.7
10
5
3 20 (20/60)=0.417 33.3
0 Total 60 1.00 100
1 2 3
Treatm ent Group
Data Presentation –Categorical Variable
Pie Chart: Lists the categories and presents the percent or count of individuals
who fall in each category.

Figure 2: Pie Chart of Treatment Frequency Proportion Percent


Subjects in Treatment Groups Group (%)

1 15 (15/60)=0.25 25.0
25% 1 2 25 (25/60)=0.333 41.7
33%
2 3 20 (20/60)=0.417 33.3

3 Total 60 1.00 100


42%
Graphical Presentation –Numerical Variable

Histogram: Overall pattern can be described by its shape, center, and spread.
The following age distribution is right skewed. The center lies between 80 to
100. No outliers.

Mean 90.41666667
Figure 3: Age Distribution
Standard Error 3.902649518

16 Median 84
14 Mode 84
Number of Subjects

12 Standard Deviation 30.22979318


10
Sample Variance 913.8403955
8
Kurtosis -1.183899591
6
4 Skewness 0.389872725
2 Range 95
0 Minimum 48
40 60 80 100 120 140 More
Maximum 143
Age in Month
Sum 5425
Count 60
Graphical Presentation –Numerical Variable

Box-Plot: Describes the five-number summary

Figure 3: Distribution of Age


160
140
120
q1
100 min
80 median
60 max
q3
40
20
0
Box Plot
1
Numerical Presentation
A fundamental concept in summary statistics is that of a central value for a set
of observations and the extent to which the central value characterizes the
whole set of data. Measures of central value such as the mean or median must
be coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.

To understand how well a central value characterizes a set of observations, let


us consider the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from
the mean in data set A is larger than in the data set B. Thus, the mean of data
set B is a better representation of the data set than is the case for set A.
Methods of Center Measurement

Center measurement is a summary measure of the overall level of a dataset

Commonly used methods are mean, median, mode, geometric mean etc.

Mean: Summing up all the observation and dividing by number of observations.


Mean of 20, 30, 40 is (20+30+40)/3 = 30.
Notation : Let x1 , x2, ...xn are n observations of a variable
x. Then the mean of this variable,
n

x1  x2  ...  xn x i
x  i 1
n n
Methods of Center Measurement

Median: The middle value in an ordered sequence of observations. That is,


to find the median we need to order the data set and then find the middle
value. In case of an even number of observations the average of the two
middle most values is the median. For example, to find the median of {9, 3,
6, 7, 5}, we first sort the data giving {3, 5, 6, 7, 9}, then choose the middle
value 6. If the number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then
the median is the average of the two middle values from the sorted
sequence, in this case, (5 + 6) / 2 = 5.5.

Mode: The value that is observed most frequently. The mode is undefined
for sequences in which no observation is repeated.
Mean or Median
The median is less sensitive to outliers (extreme scores) than the mean and
thus a better measure than the mean for highly skewed distributions, e.g.
family income. For example mean of 20, 30, 40, and 990 is (20+30+40+990)/4
=270. The median of these four observations is (30+40)/2 =35. Here 3
observations out of 4 lie between 20-40. So, the mean 270 really fails to give
a realistic picture of the major part of the data. It is influenced by extreme
value 990.
Methods of Variability Measurement

Variability (or dispersion) measures the amount of scatter in a dataset.

Commonly used methods: range, variance, standard deviation, interquartile


range, coefficient of variation etc.

Range: The difference between the largest and the smallest observations. The
range of 10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.
Methods of Variability Measurement

Variance: The variance of a set of observations is the average of the squares of


the deviations of the observations from their mean. In symbols, the variance of
the n observations x1, x2,…xn is

2 ( x1  x ) 2  ....  ( xn  x ) 2
S 
n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

(5  5) 2  (3  5) 2  (7  5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard deviation of the
above example is 2.
Methods of Variability Measurement

Quartiles: Data can be divided into four regions that cover the total range of
observed values. Cut points for these regions are known as quartiles.

In notations, quartiles of a data is the ((n+1)/4)qth observation of the data,


where q is the desired quartile and n is the number of observations of data.

The first quartile (Q1) is the first 25% of the data. The second quartile (Q2) is
between the 25th and 50th percentage points in the data. The upper bound of Q2
is the median. The third quartile (Q3) is the 25% of the data lying between the
median and the 75% cut point in the data.

Q1 is the median of the first half of the ordered observations and Q3 is the
median of the second half of the ordered observations.
Methods of Variability Measurement

In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th
observation is 11. So Q1 is of this data is 11.

An example with 15 numbers


3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40 (This is also the
Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the


previous example is 61- 40=21. The middle half of the ordered data lie between 40
and 61.
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points are called
Deciles
Percentiles: If data is ordered and divided into 100 parts, then cut points are
called Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2)
and the 75th percentile of the data is Q3.

In notations, percentiles of a data is the ((n+1)/100)p th observation of the data,


where p is the desired percentile and n is the number of observations of data.

Coefficient of Variation: The standard deviation of data divided by it’s mean. It is


usually expressed in percent.

Coefficient of Variation =  100
x
Five Number Summary

Five Number Summary: The five number summary of a distribution consists of


the smallest (Minimum) observation, the first quartile (Q1),
The median(Q2), the third quartile, and the largest (Maximum) observation
written in order from smallest to largest.

Box Plot: A box plot is a graph of the five number summary. The central box
spans the quartiles. A line within the box marks the median. Lines extending
above and below the box mark the smallest and the largest observations
(i.e., the range). Outlying samples may be additionally plotted outside the
range.
Boxplot
Distribution of Age in Month
160
160
140
140
120
120 q1
100 q1
100 min
min
80 median
80 median
60 max
60 max
q3
40 q3
40
20
20
0
0
1
1
Choosing a Summary
The five number summary is usually better than the mean and standard deviation
for describing a skewed distribution or a distribution with extreme outliers. The
mean and standard deviation are reasonable for symmetric distributions that are
free of outliers.

In real life we can’t always expect symmetry of the data. It’s a common practice
to include number of observations (n), mean, median, standard deviation, and
range as common for data summarization purpose. We can include other summary
statistics like Q1, Q3, Coefficient of variation if it is considered to be important for
describing data.
Shape of Data
• Shape of data is measured by
– Skewness
– Kurtosis
Skewness
• Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail

Let x1 , x2 ,...xn be n observations. Then,


n
n  ( xi  x ) 3
Skewness  i 1
3/ 2
 n
2
  ( xi  x ) 
 i 1 
Kurtosis
• Measures peakedness of the distribution of data. The
kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn be n observations. Then,


n
n ( xi  x ) 4
Kurtosis  i 1
2
3
 n 2
  ( xi  x ) 
 i 1 
Summary of the Variable ‘Age’ in the given
data set
Mean 90.41666667 Histogram of Age

Standard Error 3.902649518

10
Median 84
Mode 84

8
Standard Deviation 30.22979318

Number of Subjects

6
Sample Variance 913.8403955
Kurtosis -1.183899591

4
Skewness 0.389872725
Range 95 2

Minimum 48
0

Maximum 143
40 60 80 100 120 140 160
Sum 5425
Age in Month
Count 60
Summary of the Variable ‘Age’ in the given
data set

Boxplot of Age in Month


140
120
Age(month)

100
80
60
Class Summary (First Part)
So far we have learned-

Statistics and data presentation/data summarization


Graphical Presentation: Bar Chart, Pie Chart, Histogram, and Box Plot
Numerical Presentation: Measuring Central value of data (mean, median, mode
etc.), measuring dispersion (standard deviation, variance, co-efficient of variation,
range, inter-quartile range etc), quartiles, percentiles, and five number summary

Any questions ?
Brief concept of Statistical Softwares

There are many softwares to perform statistical analysis and visualization of data.
Some of them are SAS (System for Statistical Analysis), S-plus, R, Matlab, Minitab,
BMDP, Stata, SPSS, StatXact, Statistica, LISREL, JMP, GLIM, HIL, MS Excel etc. We
will discuss MS Excel and SPSS in brief.

Some useful websites for more information of statistical softwares-


http://www.galaxy.gmu.edu/papers/astr1.html
http://ourworld.compuserve.com/homepages/Rainer_Wuerlaender/
statsoft.htm#archiv
http://www.R-project.org
Microsoft Excel
A Spreadsheet Application. It features calculation, graphing tools, pivot tables
and a macro programming language called VBA (Visual Basic for Applications).

There are many versions of MS-Excel. Excel XP, Excel 2003, Excel 2007 are
capable of performing a number of statistical analyses.

Starting MS Excel: Double click on the Microsoft Excel icon on the desktop or Click
on Start --> Programs --> Microsoft Excel.

Worksheet: Consists of a multiple grid of cells with numbered rows down the page
and alphabetically-tilted columns across the page. Each cell is referenced by its
coordinates. For example, A3 is used to refer to the cell in column A and row 3.
B10:B20 is used to refer to the range of cells in column B and rows 10 through 20.
Microsoft Excel
Opening a document: File  Open (From a existing workbook). Change the
directory area or drive to look for file in other locations.
Creating a new workbook: FileNewBlank Document
Saving a File: FileSave

Selecting more than one cell: Click on a cell e.g. A1), then hold the Shift key and
click on another (e.g. D4) to select cells between and A1 and D4 or Click on a cell
and drag the mouse across the desired range.

Creating Formulas: 1. Click the cell that you want to enter the
formula, 2. Type = (an equal sign), 3. Click the Function Button, 4.
Select the formula you want and step through the on-screen
instructions.
fx
Microsoft Excel
Entering Date and Time: Dates are stored as MM/DD/YYYY. No need to enter
in that format. For example, Excel will recognize jan 9 or jan-9 as 1/9/2007 and
jan 9, 1999 as 1/9/1999. To enter today’s date, press Ctrl and ; together. Use a
or p to indicate am or pm. For example, 8:30 p is interpreted as 8:30 pm. To
enter current time, press Ctrl and : together.

Copy and Paste all cells in a Sheet: Ctrl+A for selecting, Ctrl +C for copying and
Ctrl+V for Pasting.

Sorting: Data  Sort Sort By …

Descriptive Statistics and other Statistical methods: ToolsData Analysis


Statistical method. If Data Analysis is not available then click on Tools Add-Ins and
then select Analysis ToolPack and Analysis toolPack-Vba
Microsoft Excel
Statistical and Mathematical Function: Start with ‘=‘ sign and then select
function from function wizard f x .

Inserting a Chart: Click on Chart Wizard (or InsertChart), select chart, give,
Input data range, Update the Chart options, and Select output range/ Worksheet.

Importing Data in Excel: File open FileType Click on File Choose Option
( Delimited/Fixed Width) Choose Options (Tab/ Semicolon/ Comma/ Space/
Other)  Finish.

Limitations: Excel uses algorithms that are vulnerable to rounding and truncation
errors and may produce inaccurate results in extreme
cases.
Statistics Package
for the Social Science (SPSS)
A general purpose statistical package SPSS is widely used in the social sciences,
particularly in sociology and psychology.
SPSS can import data from almost any type of file to generate tabulated reports,
plots of distributions and trends, descriptive statistics, and complex statistical
analyzes.
Starting SPSS: Double Click on SPSS on desktop or ProgramSPSS.

Opening a SPSS file: FileOpen

MENUS AND TOOLBARS


• Data Editor
Various pull-down menus appear at the top of the Data Editor window. These
pull-down menus are at the heart of using SPSSWIN. The Data Editor menu
items (with some of the uses of the menu) are:
Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS

FILE used to open and save data files

EDIT used to copy and paste data values; used to find data in a
file; insert variables and cases; OPTIONS allows the user to
set general preferences as well as the setup for the
Navigator, Charts, etc.

VIEW user can change toolbars; value labels can be seen in cells
instead of data values

DATA select, sort or weight cases; merge files

TRANSFORM Compute new variables, recode variables, etc.


Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts, etc

UTILITIES add comments to accompany data file (and other,


advanced features)

ADD-ons these are features not currently installed (advanced


statistical procedures)

WINDOW switch between data, syntax and navigator windows

HELP to access SPSSWIN Help information


Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS

Navigator (Output) Menus


When statistical procedures are run or charts are created, the output will appear
in the Navigator window. The Navigator window contains many of the pull-down
menus found in the Data Editor window. Some of the important menus in the
Navigator window include:

INSERT used to insert page breaks, titles, charts, etc.

FORMAT for changing the alignment of a particular portion of the output


Statistics Package
for the Social Science (SPSS)
• Formatting Toolbar
When a table has been created by a statistical procedure, the user can edit the
table to create a desired look or add/delete information. Beginning with version
14.0, the user has a choice of editing the table in the Output or opening it in a
separate Pivot Table (DEFINE!) window. Various pulldown menus are activated
when the user double clicks on the table. These include:

EDIT undo and redo a pivot, select a table or table body (e.g., to
change the font)

INSERT used to insert titles, captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells


Statistics Package
for the Social Science (SPSS)
• Additional menus
CHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window


• Show or hide a toolbar

Click on VIEW ⇒ TOOLBARS ⇒ 􀀻to show it/ to hide it


• Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to
its new location

• Customize a toolbar

Click on VIEW ⇒ TOOLBARS ⇒ CUSTOMIZE


Statistics Package
for the Social Science (SPSS)
Importing data from an EXCEL spreadsheet:
Data from an Excel spreadsheet can be imported into SPSSWIN as follows:
1. In SPSSWIN click on FILE ⇒ OPEN ⇒ DATA. The OPEN DATA FILE Dialog
Box will appear.
2. Locate the file of interest: Use the "Look In" pull-down list to identify the folder
containing the Excel file of interest
3. From the FILE TYPE pull down menu select EXCEL (*.xls).
4. Click on the file name of interest and click on OPEN or simply double-click on
the file name.
5. Keep the box checked that reads "Read variable names from the first row of
data". This presumes that the first row of the Excel data file contains variable
names in the first row. [If the data resided in a different worksheet in the Excel
file, this would need to be entered.]
6. Click on OK. The Excel data file will now appear in the SPSSWIN Data
Editor.
Statistics Package
for the Social Science (SPSS)
Importing data from an EXCEL spreadsheet:
7. The former EXCEL spreadsheet can now be saved as an SPSS file (FILE ⇒
SAVE AS) and is ready to be used in analyses. Typically, you would label variable
and values, and define missing values.
Importing an Access table
SPSSWIN does not offer a direct import for Access tables. Therefore, we must follow
these steps:
1. Open the Access file
2. Open the data table
3. Save the data as an Excel file
4. Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN.
Importing Text Files into SPSSWIN
Text data points typically are separated (or “delimited”) by tabs or commas.
Sometimes they can be of fixed format.
Statistics Package
for the Social Science (SPSS)
Importing tab-delimited data
In SPSSWIN click on FILE ⇒ OPEN ⇒ DATA. Look in the appropriate location for
the text file. Then select “Text” from “Files of type”: Click on the file name and then
click on “Open.” You will see the Text Import Wizard – step 1 of 6 dialog box.

You will now have an SPSS data file containing the former tab-delimited data. You
simply need to add variable and value labels and define missing values.

Exporting Data to Excel


click on FILE ⇒ SAVE AS. Click on the File Name for the file to be exported. For
the “Save as Type” select from the pull-down menu Excel (*.xls). You will notice the
checkbox for “write variable names to spreadsheet.” Leave this checked as you will
want the variable names to be in the first row of each column in the Excel
spreadsheet. Finally, click on Save.
Statistics Package
for the Social Science (SPSS)
Running the FREQUENCIES procedure

1. Open the data file (from the menus, click on FILE ⇒ OPEN ⇒ DATA) of
interest.

2. From the menus, click on ANALYZE ⇒ DESCRIPTIVE STATISTICS ⇒


FREQUENCIES
3. The FREQUENCIES Dialog Box will appear. In the left-hand box will be a listing
("source variable list") of all the variables that have been defined in the data file. The
first step is identifying the variable(s) for which you want to run a frequency analysis.
Click on a variable name(s). Then click the [ > ] pushbutton. The variable name(s)
will now appear in the VARIABLE[S]: box ("selected variable list"). Repeat these
steps for each variable of interest.

4. If all that is being requested is a frequency table showing count, percentages


(raw, adjusted and cumulative), then click on OK.
Statistics Package
for the Social Science (SPSS)
Requesting STATISTICS
Descriptive and summary STATISTICS can be requested for numeric variables. To
request Statistics:
1. From the FREQUENCIES Dialog Box, click on the STATISTICS... pushbutton.
2. This will bring up the FREQUENCIES: STATISTICS Dialog Box.
3. The STATISTICS Dialog Box offers the user a variety of choices:

DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics


(click on ANALYZE ⇒ DESCRIPTIVE STATISTICS ⇒ DESCRIPTIVES). The
procedure offers many of the same statistics as the FREQUENCIES procedure,
but without generating frequency analysis tables.
Statistics Package
for the Social Science (SPSS)
Requesting CHARTS
One can request a chart (graph) to be created for a variable or variables included in
a FREQUENCIES procedure.

1. In the FREQUENCIES Dialog box click on CHARTS.


2. The FREQUENCIES: CHARTS Dialog box will appear. Choose the intended chart
(e.g. Bar diagram, Pie chart, histogram.

Pasting charts into Word


1. Click on the chart.
2. Click on the pulldown menu EDIT ⇒ COPY OBJECTS
3. Go to the Word document in which the chart is to be embedded. Click on EDIT ⇒

PASTE SPECIAL
4. Select Formatted Text (RTF) and then click on OK
5. Enlarge the graph to a desired size by dragging one or more of the black squares

along the perimeter (if the black squares are not visible, click once on the graph).
Statistics Package
for the Social Science (SPSS)
BASIC STATISTICAL PROCEDURES: CROSSTABS
1. From the ANALYZE pull-down menu, click on DESCRIPTIVE STATISTICS ⇒
CROSSTABS.
2. The CROSSTABS Dialog Box will then open.

3. From the variable selection box on the left click on a variable you wish to
designate as the Row variable. The values (codes) for the Row variable make up
the rows of the crosstabs table. Click on the arrow (>) button for Row(s). Next,
click on a different variable you wish to designate as the Column variable. The
values (codes) for the Column variable make up the columns of the crosstabs
table. Click on the arrow (>) button for Column(s).

4. You can specify more than one variable in the Row(s) and/or Column(s). A
cross table will be generated for each combination of Row and Column variables
Statistics Package
for the Social Science (SPSS)
Limitations: SPSS users have less control over data manipulation and statistical
output than other statistical packages such as SAS, Stata etc.

SPSS is a good first statistical package to perform quantitative research in social


science because it is easy to use and because it can be a good starting point to
learn more advanced statistical packages.
Data Analysis
• Data Analysis
– Helps us achieve the four scientific goals of
description, prediction, explanation, and control
• Statisical Data Analysis
– Three primary reasons geographers treat data in a
statisitical fashion

http://rlv.zcache.com/knowledge_is_power_do_statistics_stats_humor_fly
er-p2440846222778564182dwj5_400.jpg
Statistical Description
• Descriptive Statistics
– Parameters
– Central Tendency
• Mode
• Median
• Mean X , m
– Arithmetic mean
– When would you use the median or the mode
instead of the mean?
Descriptive Statistics
• Variability
– Range
• = largest value – smallest value
– Variance N

å( x i - m) 2
s2 = i =1

– Standard Deviation N

å( x i - m) 2
s= i =1

N
Descriptive Statistics
• Form
– Modality
– Skewness
• Positive
• Negative
– Symmetry
• Unimodal – Bell-shaped
– Normal Distribution

http://people.eku.edu/falkenbergs/images/skewness.jpg
Descriptive Statistics
• Derived Scores
– Percentile Rank
• Highest – 99th percentile
• Where is the median?
– Z-score
• Standard deviation units above or below the mean
x -m
z =
s
Descriptive Statistics
• Relationship
• Linear Relationship
– Positive
– Negative
• Relationship Strength
– Weak, strong, no relationship
• Correlation Coefficient
– Between -1 and 1
– 0 – no relationship
• Regression Analysis
– Criterion variables (Y)
– Predictor variables (X)
http://hosting.soonet.ca/eliris/remotesensing/LectureImages/correlation.gif
Correlation – Causation?

http://xkcd.com/552/

“Correlation doesn’t imply causation, but it does


waggle its eyebrows suggestively and gesture
furtively while mouthing ‘look over there’.” - XKCD
Statistical Inference
• Inferential Statistics
– Statistics
• Sampling error
• Given our sample statistics, we infer our parameters
• Assign probabilities to our guesses
– Power and difficulty of inferential statistics comes
from deriving probabilities about how likely it is
that sample patterns reflect population patterns
Inferential Statistics
• Sampling distribution
– Ex: sampling distribution of means – show the
probability that a single sample would have a
mean within some given RANGE of values
– Central limit theorem – sampling distribution of
sample means will be normal with a mean equal
to the population mean and a standard deviation
equal to the population standard deviation
divided by the square root of the sample size
• Descriptive statistics describes data (for example, a
chart or graph) and inferential statistics allows you
to make predictions (“inferences”) from that data.
With inferential statistics, you take data from
samples and make generalizations about a
population.
• Descriptive statistics summarize the characteristics
of a data set. Inferential statistics allow you to test a
hypothesis or assess whether your data is
generalizable to the broader population.
Inferential Statistics
• Generation of sampling distributions
– Assumptions
• Distributional assumptions
– Nonparametric
– Parametric
» Normality
» Homogeneity of variance
• Independence of scores
• Correct specification of models
• Parametric tests are those that make
assumptions about the parameters of the
population distribution from which the sample
is drawn. This is often the assumption that the
population data are normally distributed. 
• Non-parametric tests are “distribution-free”
and, as such, can be used for non-Normal
variables.
Estimation and Hypothesis Testing
• Estimation
– Point estimation
– Confidence Interval
• Usually 95%
• Hypothesis Testing
– Null hypothesis
• A hypothesis about the exact (point) value of a
parameter or set of parameters
• Use sample statistics to make an inference about the
probable truth of our null hypothesis
Hypothesis Testing
• Alternative Hypothesis
– Hypothesis that the If A is true,
Then B is true
parameter does not
equal the exact value
hypothesized in the null
– A range rather than an
exact value B is not true B is true

• Modus Tollens
– Useful for disconfirming
– Not confirming!
Therefore,
Therefore, ???
A is not true
Example
• From a recent nationwide study it is known that the typical
American watches 25 hours of television per week, with a
population standard deviation of 5.6 hours. Suppose 50
Denver residents are randomly sampled with an average
viewing time of 22 hours per week and a standard deviation
of 4.8. Are Denver television viewing habits different from
nationwide viewing habits?
• Step 1: State your null and alternative hypotheses
H 0 : X = 25
H A : X ¹ 25

• What is this saying?


Example
• Step 2: Determine your appropriate test statistic and its sampling
distribution assuming the null is true
– We are testing a sample mean where n>30 and so a z distribution can be used
• Step 3: Calculate the test statistic from your sample data

X = 22 m = 25 X -m 22 - 25
s = 4 .8 s = 5 .6 z= = = -3 .7 9
s / n 5 .6 / 5 0
• nStep
= 5 0 4: Compare the empirically obtained test statistic to the null
sampling distribution
– P value:
p = .0at0 .05
– OR Critical value 0 1 significance level: z = ±1.96
– Decision: Reject the null hypothesis
• -3.79 is less than -1.96: reject
• The p value is very small, less than .05 and even .01: reject
Error
• You have made either a correct inference or a
mistake
• Type I error is the rejection level, p (or α)
• Type II error - β

http://www.mirrorservice.org/sites/home.ubalt.edu/ntsbarsh/Business-
stat/error.gif
Hypothesis Development and
Testing
Hypothesis is usually considered as the principal
instrument in research. Its main function is to suggest new
experiments and observations. In fact, many experiments
are carried out with the deliberate object of testing
hypotheses.

WHAT IS A HYPOTHESIS?
Hypothesis may be defined as a proposition or a set of
proposition set forth as an explanation for the occurrence
of some specified group of phenomena either asserted
merely as a provisional conjecture to guide some
investigation or accepted as highly probable in the light of
established facts.
Characteristics of hypothesis
Hypothesis should be clear and precise
Hypothesis should be capable of being tested.
Hypothesis should state relationship between
variables, if it happens to be a relational
hypothesis.
Hypothesis should be limited in scope and
must be specific.
Hypothesis should be stated as far as possible
in most simple terms so that the same is easily
understandable by all concerned.
• Hypothesis should be consistent with
most known facts.
• Hypothesis should be amenable to testing
within a reasonable time.
• Hypothesis must explain the facts that
gave rise to the need for explanation.
BASIC CONCEPTS CONCERNING TESTING OF
HYPOTHESES
• (a) Null hypothesis and alternative hypothesis
If we are to compare method A with method B
about its superiority and if we proceed on the
assumption that both methods are equally good,
then this assumption is termed as the null
hypothesis. As against this, we may think that
the method A is superior or the method B is
inferior, we are then stating what is termed as
alternative hypothesis.
• (b) The level of significance:
• It is always some percentage (usually 5%) which
should be chosen wit great care, thought and
reason. It is always some percentage (usually
5%) which should be chosen wit great care,
thought and reason. In case we take the
significance level at 5 per cent, then this implies
that H0 will be rejected. when the sampling
result (i.e., observed evidence) has a less than
0.05 probability of occurring if H0 is true.
c. Decision rule or test of hypothesis
Given a hypothesis H0 and an alternative
hypothesis Ha, we make a rule which is known
as decision rule according to which we accept
H0 (i.e., reject Ha) or reject H0 (i.e., accept
Ha).
• (d) Type I and Type II errors:
We may reject H0 when H0 is true and we may
accept H0 when in fact H0 is not true. The
former is known as Type I error and the latter as
Type II error.
(e) Two-tailed and One-tailed tests:
A two-tailed test rejects the null hypothesis if,
say, the sample mean is significantly higher or
lower than the hypothesised value of the mean
of the population.
• A one-tailed test would be used when we are
to test, say, whether the population mean is
either lower than or higher than some
hypothesised value.
• Acceptance Region A : Z < 1.96
• Rejection Region R : Z > 1.96
One tail-Left tail
PROCEDURE FOR HYPOTHESIS
TESTING
• (i) Making a formal statement
Null hypothesis H0 : µ = 10 tons
Alternative Hypothesis Ha: µ > 10 tons
• (ii) Selecting a significance level
• (iii) Deciding the distribution to use-Sampling
Distribution
• (iv) Selecting a random sample and
computing an appropriate value
(v) Calculation of the probability
(vi) Comparing the probability- If the calculated
probability is equal to or smaller than the α value
in case of one-tailed test (and α /2 in case of two-
tailed test), then reject the null hypothesis (i.e.,
accept the alternative hypothesis), but if the
calculated probability is greater, then accept the
null hypothesis.
FLOW DIGRAM FOR HYPOTHESIS TESTING
TESTS OF HYPOTHESES
Hypothesis testing helps to decide on the basis
of a sample data, whether a hypothesis about
the population is likely to be true or false.
Statisticians have developed several tests of
hypotheses (also known as the tests of
significance) for the purpose of testing of
hypotheses which can be classified as:
(a) Parametric tests or standard tests of
hypotheses; and
(b) Non-parametric tests or distribution-free test
of hypotheses.
Parametric tests usually assume certain
properties of the parent population from which
we draw samples. Assumptions like observations
come from a normal population, sample size is
large assumptions about the population
parameters like mean, variance, etc., must hold
good before parametric tests can be used.
Parametric tests assume a normal distribution
of values, or a “bell-shaped curve.” For example,
height is roughly a normal distribution in that if
you were to graph height from a group of
people, one would see a typical bell-shaped
curve.
• Non-parametric tests assume only nominal or
ordinal data, whereas parametric tests require
measurement equivalent to at least an interval
scale. As a result, non-parametric tests need
more observations than parametric tests.
• A non parametric test (sometimes called
a distribution free test) does not assume anything
about the underlying distribution (for example,
that the data comes from a normal distribution). 
IMPORTANT PARAMETRIC TESTS
• z-test
• t-test
• χ²-test
• F-test
• z-test is based on the normal probability
distribution and is used for judging the
significance of several statistical measures,
particularly the mean.
• t-test is based on t-distribution and is
considered an appropriate test for judging the
significance of a sample mean or for judging
the significance of difference between the
means of two samples in case of small
sample(s) when population variance is not
known.
• χ² test is based on chi-square distribution and
as a parametric test is used for comparing a
sample variance to a theoretical population
variance.
• F-test is based on F-distribution and is used to
compare the variance of the two-independent
samples. This test is also used in the context of
analysis of variance (ANOVA) for judging the
significance of more than two sample means at
one and the same time.
Non parametric or Distribution Free test
(i) Test of a hypothesis concerning some single value for the
given data (such as one-sample sign test).
(ii) Test of a hypothesis concerning no difference among two
or more sets of data (such as two-sample sign test, Fisher-
Irwin test, Rank sum test, etc.).
(iii) Test of a hypothesis of a relationship between variables
(such as Rank correlation, Kendall’s coefficient of
concordance and other tests for dependence.
(iv) Test of a hypothesis concerning variation in the given
data i.e., test analogous to ANOVA viz., Kruskal-Wallis test.
(v) Tests of randomness of a sample based on the theory of
runs viz., one sample runs test.
(vi) Test of hypothesis to determine if categorical data shows
dependency or if two classifications are independent viz., the
chi-square test.
1.Sign Tests- One Sample Test, Two Sample Test
2. Fisher-Irwin Test
3.McNemberTest
4.Wilcoxon Matched-Pairs Test (or Signed Rank
Test)
5. Rank sum Test
a) Wilcoxon-Mann-Whitney Test (U Test)
b) The Kruskal-Wallis Test (H Test)
6. One Sample Runs Test
7. Spearman’s Rank Correlation
8.Kendall’s Coefficient of Concordance
Chi-Square Test
Purpose
• To measure discontinuous categorical/binned data in which a
number of subjects fall into categories
• We want to compare our observed data to what we expect to
see. Due to chance? Due to association?
• When can we use the Chi-Square Test?
– Testing outcome of Mendelian Crosses, Testing Independence – Is one
factor associated with another?, Testing a population for expected
proportions
Assumptions:
• 1 or more categories
• Independent observations
• A sample size of at least 10
• Random sampling
• All observations must be used
• For the test to be accurate, the expected frequency
should be at least 5
Conducting Chi-Square Analysis
1) Make a hypothesis based on your basic biological
question
2) Determine the expected frequencies
3) Create a table with observed frequencies, expected
frequencies, and chi-square values using the formula:
(O-E)2
E
4) Find the degrees of freedom: (c-1)(r-1)
5) Find the chi-square statistic in the Chi-Square
Distribution table
6) If chi-square statistic > your calculated chi-square value,
you do not reject your null hypothesis and vice versa.
Example 1: Testing for Proportions
HO: Horned lizards eat equal amounts of leaf cutter, carpenter and black ants.
HA: Horned lizards eat more amounts of one species of ants than the others.

Leaf Cutter Carpenter Black Ants Total


Ants Ants
Observed 25 18 17 60
Expected 20 20 20 60
O-E 5 -2 -3 0
(O-E)2 1.25 0.2 0.45 χ2 = 1.90
E
Example 1: Testing for Proportions

χ2 α=0.05 = 5.991
Example 1: Testing for Proportions
Leaf Cutter Carpenter Black Ants Total
Ants Ants
Observed 25 18 17 60
Expected 20 20 20 60
O-E 5 -2 -3 0
(O-E)2 1.25 0.2 0.45 χ2 = 1.90
E

Chi-square statistic: χ2 = 5.991 Our calculated value: χ2 = 1.90

*If chi-square statistic > your calculated value, then you do not reject your
null hypothesis. There is a significant difference that is not due to chance.

5.991 > 1.90 ∴ We do not reject our null hypothesis.


T-test
for dependent Samples
for independent Samples
(ak.a., Paired samples t-test, Correlated
Groups Design, Within-Subjects Design,
Repeated Measures, ……..)
The t Test for Dependent Samples
• Repeated-Measures Design
– When you have two sets of scores from the same
person in your sample, you have a repeated-
measures, or within-subjects design.
– You are more similar to yourself than you are to
other people.
Difference Scores
• The way to handle two scores per person, or a
matched pair, is to make difference scores.
– For each person, or each pair, you subtract one score from
the other.
– Once you have a difference score for each person, or pair,
in the study, you treat the study as if there were a single
sample of scores (scores that in this situation happen to be
difference scores).
A Population of Difference Scores with a
Mean of 0
• The null hypothesis in a repeated-measures
design is that on the average there is no
difference between the two groups of scores.
• This is the same as saying that the mean of
the sampling distribution of difference scores
is 0.
The t Test for Dependent Samples
• You do a t test for dependent samples the
same way you do a t test for a single sample,
except that:
– You use difference scores.
– You assume the population mean is 0.

X   hyp D   Dhyp
t t
sX sD
The t Test for Dependent Samples

D   Dhyp
t
sD
sD
sD 
n

nD 2  (D) 2
sD 
n(n  1)
The t Test for Dependent Samples:
An Example
Hypothesis Testing
1. State the research question.
2. State the statistical hypothesis.
3. Set decision rule.
4. Calculate the test statistic.
5. Decide if result is significant.
6. Interpret result as it relates to your research
question.
The t Test for Dependent Samples:
An Example
• State the research hypothesis.
– Does listening to a pro-socialized medicine lecture
change an individual’s attitude toward socialized
medicine?
• State the statistical hypotheses.

HO : D  0
H A : D  0
The t Test for Dependent Samples:
An Example
• Set the decision rule.
  .05
df  number of difference scores  1  8  1  7
t crit  2.365
The t Test for Dependent Samples:
An Example
• Calculate the test statistic.  16
D  2
8

n D 2  ( D ) 2
sD 
n(n  1)

8(42)  (16) 2
s  1.2
8(7)
sD 1.2
sD  sD   .42
n 8
20
t  4.76
.42
The t Test for Dependent Samples:
An Example
• Decide if your results are significant.
– Reject H0, -4.76<-2.365
• Interpret your results.
– After the pro-socialized medicine lecture,
individuals’ attitudes toward socialized medicine
were significantly more positive than before the
lecture.
Issues with Repeated Measures
Designs
• Order effects.
– Use counterbalancing in order to eliminate any potential
bias in favor of one condition because most subjects
happen to experience it first (order effects).
– Randomly assign half of the subjects to experience the two
conditions in a particular order.
• Practice effects.
– Do not repeat measurement if effects linger.
The t Tests

Independent Samples
The t Test for Independent
Samples
• Observations in each sample are independent
(not from the same population) each other.
• We want to compare differences between
sample means.

( X 1  X 2 )  ( 1   2 ) hyp
t
sX
1X 2
Sampling Distribution of the
Difference Between Means
• Imagine two sampling distributions of the mean...
• And then subtracting one from the other…
• If you create a sampling distribution of the difference
between the means…
– Given the null hypothesis, we expect the mean of the
sampling distribution of differences, 1- 2, to be 0.
– We must estimate the standard deviation of the sampling
distribution of the difference between means.
Pooled Estimate of the Population
Variance
• Using the assumption of homogeneity of variance,
both s1 and s2 are estimates of the same population
variance.
• If this is so, rather than make two separate
estimates, each based on some small sample, it is
preferable to combine the information from both
samples and make a single pooled estimate of the
population variance.
2 2
2 (n  1)s  (n  1)s
sp  1 1 2 2
(n1  1)  (n2  1)
Pooled Estimate of the Population
Variance
• The pooled estimate of the population variance becomes the
average of both sample variances, once adjusted for their
degrees of freedom.
– Multiplying each sample variance by its degrees of freedom ensures
that the contribution of each sample variance is proportionate to its
degrees of freedom.
– You know you have made a mistake in calculating the pooled estimate
of the variance if it does not come out between the two estimates.
– You have also made a mistake if it does not come out closer to the
estimate from the larger sample.
• The degrees of freedom for the pooled estimate of the
variance equals the sum of the two sample sizes minus two,
or (n1-1) +(n2-1).
Estimating Standard Error of the
Difference Between Means
2 2
2 (n  1)s  (n  1)s
sp  1 1 2 2

(n1  1)  (n2  1)

s 2p s 2p
sX  
1X2
n1 n2
( X 1  X 2 )  ( 1   2 ) hyp
t
sX
1X 2
The t Test for Independent
Samples: An Example
• Stereotype Threat

“Trying to develop the test “This test is a measure of


itself.” your academic ability.”
The t Test for Independent
Samples: An Example
• State the research question.
– Does stereotype threat hinder the performance of
those individuals to which it is applied?
• State the statistical hypotheses.

H o : 1   2  0
H 1 : 1   2  0
or
H o : 1   2
H 1 : 1   2
The t Test for Independent Samples: An
Example
• Set the decision rule.
  .05
df  (n1  1)  (n2  1)  (11  1)  (12  1)  21
t crit  1.721
The t Test for Independent Samples: An
Example
• Calculate the test statistic.
Control Control Sq Threat Threat Sq
4 16 7 49
9 81 8 64 ( X 1  X 2 )  ( 1   2 ) hyp
12 144 7 49 t
8 64 2 4
sX
1X 2
9 81 6 36
13 169 9 81
12 144 7 49
13 169 10 100 79
13 169 5 25 X1   6.58
7 49 0 0
12
6 36 10 100
106
8 64 X2   9.64
106 1122 79 621 11
The t Test for Independent Samples: An
Example
• Calculate the test statistic.

2
12( 621)  ( 79)
s12   9.18 s 2p s 2p
12(11) sX  
2
1X2
n1 n2
2 11(1122)  (106)
s 
2  10.05
11(10) 9.59 9.59
sX    1.29
2
2
(n1  1)s  (n2  1)s
1
2
2
1X2
12 11
sp 
(n1  1)  (n2  1)

2 (12  1)9.18  (11  1)10.05


s 
p  9.59
(12  1)  (11  1)
The t Test for Independent Samples: An
Example
• Calculate the test statistic.

( X 1  X 2 )  ( 1   2 ) hyp
t
sX
1X 2

X 1  6.58 X 2  9.64

9.59 9.59
sx X    1.29
1 2
12 11

6.58  9.64
t  2.37
1.29
The t Test for Independent
Samples: An Example
• Decide if your result is significant.
– Reject H0, - 2.37< - 1.721
• Interpret your results.
– Stereotype threat significantly reduced
performance of those to whom it was applied.
Assumptions
1) The observations within each sample must be independent.
2) The two populations from which the samples are selected
must be normal.
3) The two populations from which the samples are selected
must have equal variances.
– This is also known as homogeneity of variance, and there are two
methods for testing that we have equal variances:
• a) informal method – simply compare sample variances
• b) Levene’s test – We’ll see this on the SPSS output
4) Random Assignment
To make causal claims
5) Random Sampling
To make generalizations to the target population
Reliability and Validity
• Reliability and validity are concepts used to
evaluate the quality of research. They indicate
how well a method, technique or test
measures something.
• Reliability is about the consistency of a
measure, and validity is about the accuracy of
a measure.
What is reliability?

• Reliability refers to how consistently a method measures


something. If the same result can be consistently achieved by
using the same methods under the same circumstances, the
measurement is considered reliable.
• You measure the temperature of a liquid sample several times
under identical conditions. The thermometer displays the same
temperature every time, so the results are reliable.
• A doctor uses a symptom questionnaire to diagnose a patient
with a long-term medical condition. Several different doctors
use the same questionnaire with the same patient but give
different diagnoses. This indicates that the questionnaire has
low reliability as a measure of the condition.
What is validity?
Validity refers to how accurately a method measures
what it is intended to measure. If research has high
validity, that means it produces results that
correspond to real properties, characteristics, and
variations in the physical or social world.
High reliability is one indicator that a measurement is
valid. If a method is not reliable, it probably isn’t
valid.
If the thermometer shows different temperatures
each time, even though you have carefully controlled
conditions to ensure the sample’s temperature stays
the same, the thermometer is probably
malfunctioning, and therefore its measurements are
not valid.
Reliability
• Reliability refers to the consistency of a
measure. Psychologists consider three types
of consistency: over time (test-retest
reliability), across items (internal consistency),
and across different researchers (inter-rater
reliability).
Running Reliability Analysis

• Analyze > Scale > Reliability Analysis


• Select items for analysis
• Click “Statistics” and check “item” and
“scale if item deleted”
• Click continue
• Click OK
• See Output
Running Reliability Analysis
Output from Reliability Analysis
Interpret Output

• Case Processing Summary – N is # of Test


Takers
• Reliability Statistics – Cronbach’s Alpha is
our stat, .50-.60 marginal, .61-.70
good, .71-85 very good
• Item Statistics – average response for all
test takers
• Item total Statistics – use to determine
which items stay, get dropped
Item Total Statistics
• If reliability goes up after deleting item,
bad item
• If reliability goes down after deleting item,
good item
• See item total correlations
• Q3 good item, Q2 bad item
Revise Questionnaire
• Use best judgment to exclude items
• Drop a couple of items, rerun reliability
analysis, check results
• Drop more items, check results, Cronbach’s
Alpha go up or down?
• Pick final set of items
• Run Reliability Analysis with final set of
items
• If good reliability, use questions for your
questionnaire
Update Questionnaire

• Include subset of old questionnaire items


• Include validity questionnaire items
• Check for typos
• Submit to friends and family
Uses of Internal Consistency Reliability Analysis

• Quality research must come from measures that have the


ability to consistently (i.e., reliably) and accurately detect
changes in research participants’ skills, knowledge,
attitudes, or behavior.
– A reliable measure is reproducible and precise: each time it is used
it produces the same results, all else being equal.
• Internal consistency reliability analysis is a parametric
procedure used to evaluate the consistency of results across
items within a single scale (i.e., instrument) or subscale that
is composed of multiple items.
– All items in an internally consistent scale assess the same construct.
Uses of Internal Consistency Reliability Analysis

• The following models of internal consistency reliability are


available in SPSS:
– Cronbach’s alpha model is based on the average inter-item correlation.
It is used when items are not scored dichotomously, e.g., it is used for
multiple choice items.
– Split-half model splits the scale into two parts and examines the
correlation between the parts.
– Guttman model computes Guttman’s lower bounds for true reliability.
– Parallel model assumes that all items have equal variances and equal
error variances across replications.
– Strict parallel model makes the assumptions of the parallel model and
also assumes equal means across items.
• Each model involves administering the instrument once to a single
group of subjects and yields a reliability coefficient (also known as the
coefficient of internal consistency).
Uses of Internal Consistency Reliability Analysis
• Procedures for estimating reliability produce a reliability coefficient,
which is a correlation coefficient that ranges in value from zero to +
1.0. When a reliability coefficient is zero, all variability in obtained
test scores is due to measurement error. Conversely, when a
reliability coefficient is 1.0, all variability in scores reflects true
score variability.
• Reliability coefficients can be interpreted as follows:
– Very high reliability = .90 and above
– High reliability = .70 to < .90
– Moderate reliability = .50 to < .70
– Low reliability = .30 to < .50
– Little if any reliability < .30
Note: many social science researchers consider scale reliability below .70 as
questionable and avoid using such scales.
• A reliability coefficient is never squared to interpret it, as is the case
with other correlation coefficients, but is interpreted directly as a
measure of true score variability. A reliability coefficient of .70
means that 49% of variability in obtained scores is true score
variability.

You might also like