Professional Documents
Culture Documents
Mother's age
Frequency
Valid 13-20 Years 21
20-30 Years 305
30-35 Years 101
35-55 Years 62
Total 489
(b) Exhaustive: ALL values are considered. In the example above, are the
categories exhaustive?
(i) Relative homogeneity: cases should be truly comparable
ii) The median and nominal level data.
b) ORDINAL VARIABLES: Categories that are ranked
i) EXAMPLE: In the gss dataset, there is a variable called “spanking” that asks
whether the respondent favors spanking to discipline a child. The answers
range from strongly agree to strongly disagree, and these can be considered
ranked.
(1) Note: We can make 0 – 10 strongly disagree to strongly agree or 10 – 0
strongly disagree to strongly agree and it does not matter at all. The
“numeric” values we assign to the answers are arbitrary and meaningless.
c) Interval Ratio: Two properties
i) Equal distance between values
ii) 0 is a real value
4) What does Healey mean by “data reduction”?
a) Data reduction involves using a few numbers to summarize the distribution of a
variable, or an array of data as he calls it.
b) What is the problem with using only a few numbers to summarize the distribution
of a variable?
i) Summarizing a distribution involves using the mean, denoted x , or standard
deviation, denoted , to describe the variable. This inevitably leads to a loss
of information (precision and detail).
5) Rates: rates are defined as the number of actual occurrences of some phenomenon
divided by the number of possible occurrences per some unit of time.
a) EXAMPLE: In a city of 750,000 people, the frequency of unwed pregnancies in
a one-year period was 1875. What is the unwed pregnancy rate for this city?
i) ANS. 1875/750,000 = .0025 = 2.5 per thousand
ii) What is the pregnancy rate per 10,000 people? 25
b) EXAMPLE: In a city with population 1,000,000, there were 516 homicides in the
past year. What is the homicide rate per 10,000 people? 5.16.
6) Measures of Central Tendency
a) Measures of central tendency measure the typical value of a distribution.
i) It is a way to summarize the distribution to give you an idea about the typical
case of that distribution, in other words, the center of it.
b) There are three measures of central tendency
i) The mean: describes the average score
ii) The mode: describes the most recurring score
(1) Only used with nominal variables
iii) The median: is the 50th Percentile of the distribution
(1) A median is a special case of a percentile, which is the percentage of cases
below which a specific percentage of cases fall.
c) How does the median differ from the mode and the mean? Unlike the mode or the
mean, the median always represents the exact center of a distribution of scores,
meaning that 50% of the cases always fall above the median and 50% of the cases
always fall below the median.
d) Characteristics of the mean
i) The mean is always the center of any distribution. The mean is the point
around which all of the scores cancel out. Mathematically, this says that if I
subtract the mean from each value and sum the results, the resulting sum will
n
be equal to 0. This is mathematically given as (x
i 1
i x) 0
1 – 3 = -2
2 – 3 = -1
3–3=0
4–3=1
5–3=2
(-2) + (-1) + (0) + (1) + (2) = 0
n n n
More generally, ( x i x ) x i x nx nx 0
i 1 i 1 i 1
Statistics
AGE OF RESPONDENT
N Valid 1385
Missing 2
Mean 44.94
Median 41.00
200
100
0 N = 1385.00
20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0
25.0 35.0 45.0 55.0 65.0 75.0 85.0
AGE OF RESPONDENT
f
Minutes spent on test mid pt
mid pt x f
0 to less than 5 minutes 2.5 2 5
Total 30 335
335/30 = 11·2
8) Measures of Dispersion
a) What is a “measure of dispersion?”
i) Measures of Central Tendency don’t tell anything about how much the data
values differ from each other.
(1) EXAMPLE: What is the mean of the following two distributions of
AGE?
(a) 50 50 50 50 50
(b) 10 20 50 80 90
(2) The distributions are obviously very different.
(a) Measures of dispersion or variability attempt to quantify the spread of
observations.
(b) It is a measure of variability, usually defined in terms of variability
around the mean.
(c) The distance between the individual score and the mean value,
mathematically this is ( X i X ).
(d) The larger the distance from the mean, the larger the deviation will be.
(e) If the scores were clustered around the mean, the less variability there
will be.
(i) PRACTICAL EXAMPLE: Let’s assume that average income for
people with PhD’s is $55,000 and average income for people with
a high school education is $20,000. Since opportunities for people
with merely a HS education are less than those with PhD’s most
people who only have a HS education would make somewhere
aroung 20K, there is not much variation. However, it is possible
for PhDs to make anywhere from $20K to $800K per year and
hence there is much more variation around the average salary for
PhDs than there is for HS graduates.
b) Measures of dispersion we have looked at
i) Inter-Quartile Range: defined as the 75th percentile minus the 25th percentile.
ii) Quartile/Deciles
iii) Standard deviations
iv) Creating Box and Whiskers
9) Standardized Variables
a) EXAMPLES
i) Here is a random sample of eleven scores on a PLS 201 exam: 12, 16, 16, 18,
23, 23, 24, 25, 25, 26, 29
(1) Find the sample mean.
(a) Answer: x = 21.5
(2) Find the sample standard deviation.
(a) Answer: you should get something approximating 5.
(3) Find the median. Answer: 23
(4) Find the z-score for the student who received the highest score on the
exam. Answer: z = (29 - x )/sx = 1.5 where x = 21.5 and sx = 5.
Interpretation: this student’s score was 1 ½ deviations above the mean.
ii) Faculty salaries at a Midwestern university are normally distributed with a
mean of $51,500 and standard deviation of $3,000.
(1) Find the probability that one faculty member chosen at random has a
salary less than $50,000.
(a) Answer: X = salary of a randomly selected faculty member.
Given that X ~ N(51500, 3000), normal with mean 51,500 and
standard deviation 3,000. P(X < 50000) = P(Z < (50000 -
51500)/3000) = P(Z <= -.5) = .3085
iii) The mean height of adults in an African village is 150 cm, the standard
deviation is 6 cm. What is the probability that a randomly selected adult from
this village will be lower than 162 cm, if we assume that the distribution of
height in the population is normal?
(1) Calculate the z-value for 162 cm based on z-transformation, and look up
the corresponding p values using the table. The z-transformed value of x
xx
z=
sx
=162 cm: mean = 150 cm, SD = 6 cm
162 – 150
z= = +2
6
1. Calculate the median M, lower and upper quartiles, Q1 and Q3, and the
interquartile range, IQR= Q3 – Q1, for the data set.
2. Construct a box with Q1 and Q3 located at the lower corners. The base width will
then be equal to IQR. Draw a vertical line inside the box to locate the median M.
3. Construct the limits on the box plot: Extreme Values are located a distance of
1.5 * IQR below Q1 and above Q3;
3. An experiment was performed upon rats to investigate the effect of ingesting Alar
(a chemical sprayed on apple trees to keep fruit from dropping before ripe) upon
subsequent cancer rates. The following variables were measured:
gender (0=female, 1=male); weight (g); dose of Alar (nil, low, high); number of
tumors
The typical weight of a rat is about 800 g and the weights were rounded to the
nearest gram. The number of tumors is around 10. Which of the following is
FALSE? c
a. Gender is nominal scale; dose is ordinal scale
b. Gender is discrete; weight is continuous
c. Number of tumors is discrete and is interval scale
d. Dose is ordinal scale and discrete
e. Weight is ratio scale; and number of tumors is discrete.
4. Here are some summary statistics on the results of the experiment. Draw suitable
BOXPLOTS to compare the results. Salmon production is in kg/km of spawning
sites.
Quantiles
Level Minimum 10.0% 25.0% median 75.0% 90.0% maximum
clear cut 0.9 0.9 3.3 19.2 48.1 87.4 90.0
selective 0.9 1.2 8.5 29.3 51.5 93.4 108.0
5. What do you conclude from your boxplot and the descriptive statistics? Be sure to
explain how your plot leads you to this conclusion.
Solution: It appears that clear cut streams produce, on average, less salmon than
selectively cut streams. This is because the box plot for the clear-cut areas is shifted
down relative to the box-plot for the selective harvest areas; and the median of the
clear cut areas appears to be less than the median of the selective harvest areas
8. A student discovers that his grade on a recent test was the 72nd percentile. If 90
students wrote the test, then approximately how many students received a higher
grade than he did? b
a. 65
b. 25
c. 72
d. 71
e. 18
Solution: (1 - .72)(90) = 25.2 or (.72)(90) = 64.8 students who scored less than him
so 90 – 64.8 is approximately 25.
Last Question
Total 14 100.0
Identify the Percent, Cumulative Percent and state the median, Q1, Q3, Interquartile
Range and identify any extreme values.
5.8 – 5.1 = .7