You are on page 1of 25

1/10/2018 10:24:00 AM

Variability – scores or measurement values obtained in a study differ from


one another, even when all the subjects in the study are assess under the
same circumstances (not everyone will get the same grade in our stats
class) The dissimilarity in scores and outcomes obtained under the same
circumstances. 3 Sources of Variability:
 Individual Differences - People are different from one another/have
different behaviors
 Measurement Error – cannot always accurately measure stuff
o Ambiguous wording
o One part of a course is emphasized more than other parts
 Unreliability – people don’t respond to the same question exactly
the same of two different occasions

Statistics – the study of methods for describing and interpreting


quantitative information, which is called data. 2 Broad categories:
 Descriptive Stats – procedures for organizing summarizing and
describing data
o Did Ritalin improve performance? (refers to what you found in
your experiment with your own sample)
 Inferential Stats – methods for making inferences about a larger
group of individuals on the basis of data actually collected on a
much smaller group
o Does Ritalin improve performance? (refers to the
generalization of your findings to a population)

Population – all people


Sample – group in the exp

Look at the difference between the control and the experimental group
 Large Difference – the IV manipulation has an effect
 Small Difference – the difference is likely dt variability
Probability – how unlikely the difference has to be to conclude that the IV
had an effect on the DV. Ranges from 0.00 (impossible) to 1.00 (certain)
that the observed result is due to expected variability (p0.05)
 Small probability = unlikely it occurred by chance
Measurement – The orderly assignment of a numerical value to a
characteristic

Scale of Measurement – the ordered set of possible numbers that may be


obtained by the measurement process (ex the numbers on a tape measure)

Properties of Scales
 Rank order – ranking in increasing magnitude (rank them 1, 2, 3)
 Categorization/Rating – putting people in categories (fun, boring),
or in rating groups (5 star scale)
 Magnitude – you can assign data judging values as less than or
greater than or equal.
 Equal Intervals – a unit of measurement on the scale is the same
regardless of where on the scale the unit falls
 Abs Zero – a value where “nothing at all” of the attribute being
measured exists.
o Height or Kelvin
o Assigned 1,2,3 ranks doesn’t have AZ (no rank of zero used)
o

Types of Scales
 Ratio Scale – Mag, EI, AZ
o Ratio statements can be made (a 70” person is dbl the height
of a 35” person)
 Interval Scale – Mag, EI (NO AZ)
o Farenheit or Celcius – because at zero degrees, there is still
temperature. 0 deg is not abs zero. 25o is warmer than 12.5o,
but it isn’t twice as hot. But a 12 deg difference on the scale
is the same magnitude regardless of where on the scale youre
at.
 Ordinal Scale – Magnitude (No EI or AZ) – rank in magnitude,
shortest to tallest, then can assign them numbers too, but the
difference bw the rankings (diff between 1 and 2, and 2 and 3) are
not equal.
 Nominal Scale – Mutually exclusive groups (No Mag, EI, AZ)
o Classification of cars based on their brand name
Continuous Variable – infinite number of values bw any two points on it.
Discrete Variables – non infinite values (ex: basketball game points)

Continuous or Discrete applies to the variable being assessed, not its


measurement.
 Temp is a continuous variable, even though the device only reads
differences in 0.1o increments. Your data will show 9.1o but really
the variable is continuous cause the temp could have really been
9.1218376876…..o)

Real Limits – Upper and lower values that can be rounded off to the value
collected as data.
 Example) You finish a race in 33s, and the timer measures to the
tenth position but you round to the nearest second..
o Lower real limit 32.5s (below this is 32s)
o Upper real limit 33.5s (above this is 34s)
Real Limits of a number are the points falling ½ measurement unit above
and below the number.
 Depends on the units measured by a device, can it measure 1s
intervals(+/- 0.5s limits), or 0.1s intervals (+/- 0.05s limits)

Notation
 Capital letters represent a variable X, Y, V
 Subscripts distinguish one score from another
o Participant 1 = X1
o Participant 2 = X2
 N – total # of scores. Subscripts^ run form 1 to N
 Xi – any particular score in a distribution
 Σ – summing of scores. Ex) Σ X = all Xi values summed

 N above sets the upper limit for summing (all Xi values)


 i = 1 sets the lower limit. Start at X1
*The sum of a constant times a variable equals the sum of a constant times
the sum of the variable (C is a constant)

*The sum of a constant taken N times is N times the constant

^this just means Nc

Summation of a sum of variables


ΣX2 – the sum of each squared score
Σ(X)2 – the square of the sum of the scores
1/10/2018 10:24:00 AM

Frequency Distribution – indicates the number of cases observed at each


score value or within each interval of score values in a group of score.

You can make a tally


 Tally up all the people who scored 10pts, or 4pts, etc.
Relative Distribution – a distribution that indicates the proportion of the total
number of cases (scores) observed at each score value or interval or score
values.
 Find N: total # of data values
 If 5 people scored a certain value, you divide 5/N
 The relative frequency equals that number

Class Interval – a segment of the measurement scale that contains more


than one possible score value. Ex) an A+ is 90-100%

Grouped Data – when scores are presented in class intervals

Cum f (cumulative freq distribution) – one in which the entry for any score
value or class interval is the sum of the frequencies for that value or that
interval plus the frequencies of all lower scores.
Cum Rel f (cumulative relative freq distribution) - one in which the entry for
any score value or class interval expresses that value’s or that interval’s
cumulative frequency as a proportion to the total number of cases.
f - how many scores in that class interval?
Rel f – what proportion of scores were in that class? (f/N)
Cum f – sum of scores in that class interval and the class intervals below?
Cum Rel f – what proportion of the Cum f scores are in that CI and the ones
below? (Cum f/N)

Don’t add Rel f to get Cum Rel f, because the dividing and rounding to get
the Rel f will make those values less accurate than adding up the individual
frequency values and dividing by N.

If 45% of scores are in a class interval and below, then the upper limit of
that class interval defines the 45th percentile.

Minimum size of class interval = Range/Max number of intervals


 divide the range of the score values in the distribution by the max
number of intervals. Determine Max number of intervals using
below…
Round the minimum class interval size UP to the next round number 1, 2, 5
0.1, 0.2, 0.5. 10, 20, 50, 100

Usually the lowest class interval starts with the lowest score… see below.

The 1st score values (lowest value) in the lowest class interval should be
evenly divisible by the size of the interval
Ex) If the lowest score is 30 and the intervals are 5 units large, 30/5 = 6

If the lowest score is 49, and the interval size is 4, drop down to the lowest
interval starting at 48-51, because 48 is evenly divisible by 4.

Stated Limits – In a class interval, the highest and lowest values in the
interval.

The size of a class interval is obtained by subtracting the lower real limit
from the upper real limit.
Ex) The Class interval size of 30-34 is 5 because 34.5-29.5 = 5

Midpoint of a class interval = lower real limit + ½ the interval size

Frequency Histogram
 Abscissa – horizontal axis
 Ordinate – vertical axis (3/4 as long as abscissa)
 Each axis should have a zero origin, the break in the abscissa scale
indicated that part of the scale was omitted.
 Each bar width = 1 class interval and saddles the interval midpoint

FREQUENCY POLYGON
 a point above each interval midpoint corresponds to the frequency
within that interval
 Must have empty intervals with 0 frequences to the left and right
(dots on the abscissa)

Rel f Histogram or polygon – ordinate labels and heights of bars/points show


relative frequency, not frequency.
Cum Rel f & Cum f Histogram or polygon
 Ordinate is labeled Cum f or Cum Rel f
 Polygon – points placed over upper real limit of each CI including
the lowest 0 frequency interval. There is no upper 0 frequency
interval
HISTOGRAMS – preferred over polygons for discrete variables
POLYGONS – preferred over histograms when you have a continuous
variable or when you have a shit ton of score values (you get a smooth
curve, indicating the frequency distribution for ANY score)
BAR GRAPH - NOMINAL frequency
You can swap up the labels on the abscissa and ordinate for these
PIE CHART - NOMINAL relative frequency
Relative frequency is a f/N value. It’s a fraction. So you can just multiply it
by 360o to find out how big of a wedge a relative frequency should occupy.

HOW DISTRIBUTIONS DIFFER


 Central Tendency of a distribution is a point on the scale
corresponding to a typical, representative, or central score.
 Variability – the extent to which scores in a distribution deviate
from their central tendency
 Skewness – an assymetric distribution in which the scores are
bunched on one side of the central tendency and trail out on the
other. The longer the skew tail, the more its skewed (B is more
skewed than A)
o Score A: Skewed to the right – positive skew
o Score B: Skewed to the left – negative skew (due to an easy
exam—lots of students getting high scores)

 Kurtosis – the curvedness of peakedness of the graph. Peaks below


have the same central tendency, but different kurtoses.
o Dist A is leptokurtic (thin distribution. Less variability)
o Dist B is platykurtic (flat dist. Greater variability)

A & B differ only with respect to central tendency


A has less variability than B
1/10/2018 10:24:00 AM

Exploratory Data Analysis – an approach and a set of tools that are used,
often in an unplanned and exploratory manner, to describe and understand
the meaning of a set of data.

Stem & Leaf Display – combine histogram and frequency distribution into
one display, while preserving more of the information in the original data
than does a frequency distribution
 A score value is broken down into a stem and a leaf
o Stem – the remaining larger digits
o Leaf – the smallest digit
o Ex) Score 23  Stem (2), Leaf (3)
 Here, stems are tens digits and leafs are ones digits.
 Batch – ordered listed of the stems and leafs of the scores (similar
to a distribution)

Creating a Stem & Leaf Display


 Smallest stem at the bottom and go in ascending order (like class
intervals)
 Depth of a case is its rank order from the top or bottom of the
distribution (whichever rank is smaller)
o Its like cumulative frequency
o Calculated from top/bottom until the accumulated total = ½N
o Count left to right (increasing) when counting from the
bottom
o Count right to left (decreasing) when counting from the top
o Counts (trained condition)
 73 is the 5th score from the bottom
 96 is the 3rd score from the top
o Untrained – symmetrical distribution
o Trained – centre of the batch is at the 8th stem—they scoring
high
o ^notice how it’s a vertical histogram
Line Width – new line for each stem. The line width is the number of possible
leaves for that stem
 Above, there are 10 possible leaves per stem.
 You can have bigger line widths by using the same stem for
multiple lines, but with code symbols
o * = 0,1
o t = 2,3
o f = 4,5
o s = 6,7
o  = 8,9
 Line Widths can be 2, 5, 10 times a power of 10
 When you have positive and negative data, you have two stems for
zero
o 0 for positive leaves
o -0 for negative leaves
Resistant indicators that describe a set of data are those that change
relatively little in value if a small portion of the data is replaced with new
number that may be very different from the original ones
 bc mean, variance, and standard deviation depend on all values in a
set of data, and a few very high/low scores can throw them off

Median is a more resistance indicator than mean because median value is


influenced by how many scores are higher or lower than it, not how much
they are higher than it.

Median Case for Odd number of scores = (N + 1)/2


Median Case for Even number of scores = [(N/2) + (N/2 + 1)]/2
 Take the avg of the two median cases

Ex) if N=15, the median case is the 8th case


Ex) if N = 16, the median case is an average of the 8th & 9th cases

Range is hugely affects by high/low atypical scores, so we can use the


“Fourth Spread”.
 Find the depth of the median and drop any fractional part on it
 Depth of the ¼ = [depth of median + 1]/2
 FL – Lower fourth
 FU – Upper fourth
 Depth of the 4th gives you a number that is fractional, you count to
that case number from the bottom and top and take the averages
o Ex) if you get 4.5, take an average of the 4th and 5th cases
from the bottom, and the 4th and 5th cases from the top.
o FL – ½ way bw median and bottom of the batch
o FU – ½ way bw median and top of the batch
 Fourth-Spread = FU - FL
o Use these to see which outliers deviate from the group
substantially
o Outliers Below FL – 1.5(4th spread)
o Outliers Above FU + 1.5(4th spread)

Extreme Scores – the lowest & highest scores in the batch, excluding
outliers
 ESL is immediately above FL – 1.5(4th spread)
 ESH is immediately below FU + 1.5(4th spread)
the 4th spread and the the fourths +/- 1.5(4th spread) are resistance
indicators of variability

4th spread is a resistant analogue to the SD


- reflects the main body of the data, the central half of the batch

fourths +/- 1.5(4th spread) is a resistant analogue to the range


- no influenced by outliers.

Five Number Summary – a table with the major resistant indicators or


central tendency and variability
 Md – median
 FL
 FU
 LEx – lower extreme score
 UEx

Boxplot
 Ordinate has no scale
 Box stretches from FL to FU
 Horizontal lines going to the extremes
 X to label outliers
 Median reflects central tendency
 The box and lines reflect the variability
 Asymmetry implies skewness
 Short box is more peaked – leptokurtic
 Broad box is flatter – platykurtic
1/10/2018 10:24:00 AM

Median – ½ have done it, ½ haven’t


Mode – more do it at this value than any other

Indices of Variability
 Range
 Variance
 Standard deviation

Mean (Arithmetic Avg)

The Sum of the Deviations of Scores about their mean is zero

The Sum of the Squared Deviations of Scores about their mean is not zero

Least Squares Sense – Taking deviations from the mean yields the
smallest number than if deviations from the scores were taken and squared
from any other number.

Median
 Odd # Cases: Md score corresponds to case (N+1)/2
 Even# Cases: Md score corresponds to the middle/avg of two cases
o (N/2 + N/2 + 1)/2
Mode – the most frequently occurring score value, NOT the frequency of the
most commonly occurring score value. Always the value at max peak height.
 Bimodal – 2 modes
 Multimodal – more than two modes

Mean – reflect every value


Median – not affected by extremes
Mode – describe nominal/categorical data and bimodal distributions

Mean is useful as its used in many statistical procedures

Mean is an appropriate measure of central tendency bc


- the sum of the deviations of the scores about the mean is zero
- the sum of the square of the deviations of the scores about the mean is
less than any other value.

Variability – the extent to which scores in a distribution differ from their


central tendency.
 Range – not sensitive to all scores, just the extremes
 Variance (s2) – sum of squares (SS)/N-1

 Standard Deviation (s)

o
Properties of s2 and s
 Logical to base variability on deviation from the mean
 Always positive
 s2 is very sensitive to extremes since it’s the squared deviation
 Variance is proportional to the average squared deviation of each
score from every other score.
 As variability increases, variance increases.
 No variability gives s2 and s of zero
 Under certain conditions, variance can be partitioned and its
portions attributed to different sources.

Population Formulas
μ = population average
σ2 = pop variance
σ = pop SD
Mean/Variance of the sample are used as estimators of the mean and
variance of the population.

Statistic – quantitative characteristic of a sample. Roman Letter


Parameter – quantitative characteristic of a pop. Greek Letter

Stats are usually used to estimate parameters.

In variance/SD, N-1 denominator makes s2 a better estimator of σ2 and σ


than N does.

Variance is proportional to the average squared deviation of each score from


every other score. So not only does the variance reflects the extent to which
scores deviate from the mean, it also reflects how much they deviate from
one another.