You are on page 1of 26

PRIVATE - Megha Majumder!

!
Chapter 1 - Controlled Experiments !
I. The Salk Vaccine Field Trial!
A. Experiment that tests effectiveness can be done via COMPARISON. Drug is given to
subjects in a treatment group, but other subjects are used as controls (not retaed). Then,
the responses of the two groups are compared. Subjects should be assigned to
treatment or control at RANDOM, and the experiment should be run DOUBLE-BLIND:
neither the subjects nor the doctors who measure the responses should know who was
in the treatment group and who was in the control group. !
B. Polio - one of the vaccines in the 1950s, developed by Jonas Salk, was tested. But,
when the vaccine came out, it wasnt possible to observationally prove its effectiveness
because its an epidemic whose incidence varied from year to year. Low incidences
couldve meant that the vaccine was effective OR that that year wasnt an epidemic year. !
C. Only way to find out if vaccine worked was to deliberately leave some children
unvaccinated and use them as controls. NFIP ran a controlled experiment to show
effectiveness of vaccine. !
D. Grades most vulnerable to polio: 1, 2, and 3. Field trial was carried out in selected
school districts throughout the country where risk of polio was high. Two million children
were involved. 500,000 vaccinated, 1,000,000 left deliberately unvaccinated. 500,000
refused vaccination. !
E. ^^^Comparison! Only subjects in the treatment group were vaccinated: the controls did
not get the vaccine. The responses of the two groups could then be compared to see if
the treatment made any difference. !
F. Although the size of the samples differed, the investigators compared the rates at which
children got polio in the two groups - cases per hundred thousand. Looking at rates
instead of just numbers adjusts for the differences in the sizes. !
G. Children whose parents consented would go into the treatment group and get the
vaccine; the others would go as controls. However - its known that the higher-income
parents would more likely consent to treatment than lower-income parents. This makes
the study biased AGAINST the vaccine, because children of higher-income parents are
more vulnerable to polio. !
H. Polio is a disease of hygiene so rich kids are more likely to get it. Poor kids roll around in
dirt, and after being infected with something, they generate their own antibodies which
protect them against more severe infection later. !
I. If the two groups differ with respect to some factor other than the treatment, the effect of
this other factor might be CONFOUNDED (mixed up) with the effect of the treatment.
Thus, the confounding variable is a major source of bias. !
J. Study design - vaccinate grade 2 children whose parents would consent, leaving
children in grades 1 and 3 as controls. Potential bias: polio is contagious, spreading
through contact, so the incidence could have been higher in grade 2 than in grades 1/3,
biasing study against the vaccine. Also, in the control group parental consent was not
needed, so you had people in the treatment group being more vulnerable to polio than
the control group (another bias against the vaccine). !
K. The chance of assignment to the treatment group or the control group was 50-50 for the
SALK vaccine field trial - it was RANDOMIZED CONTROLLED. !
L. PLACEBO - another basic precaution used, in which the children in the control group
were given an injection of salt dissolved in water. During the experiment, the subjects
didnt know whether they were in treatment or control, so their response was to vaccine,
not to the idea of a treatment. !
M. DOUBLE-BLINDING - Salk trial doctors could have been affected by knowing whether
the child was vaccinated or not when diagnosing the polio, so the experiment was
double blinded. The subjects didnt know whether they got the treatment or the placebo,
and neither did those who evaluated the responses.!
!

N. NFIP was biased against the vaccine. The rate of the vaccines success is less than that
of the Salk trials treatment rate. Main source of bias = confounding. Control group
couldnt be compared to treatment group because control group had kids whose parents
wouldnt have consented, and the treatment group only had kids whose parents
consented. !
O. RCDB reduces bias to a minimumthe main reason to use it. The assignment is made
to treatment or control. No confounding variables can explain the results aside fromt he
treatment. Also, each child has a 50-50 chance ot be in treatment or control. Each polio
case has a 50-50 chance to turn up in the treatment or the control group, then. Thus, the
number of polio cases in the two groups must be about the same. !
!
II. The Portacaval Shunt!
i. Portacaval shunt - a shunt that redirects blood flow during surgery when someone has
cirrhosis and the patient is bleeding to death. Its hard to make the shunt, so do the
benefits outweigh the risks?!
a. 32 without controls!
b. 15 with non-randomized controls!
c. 4 with randomized controls!
ii. Of the 51 studies conducted to assess the effect of the surgery:!
a. 75% of studies without controls were markedly enthusiastic about shunt (yes benefits
outweigh risks!) - aka the poorly designed studies exaggerate the value of the
surgery!
b. 67% of the non-randomized studies were markedly enthusiastic (assignment to
treatment or control was not random) - aka the poorly designed studies exaggerate
the value of the surgery!
c. 0% of randomized studies were markedly enthusiastic - aka the well-designed
studies show the surgery to have little or no value. !
iii. The people in the control group are like the people n the treatment group in RC. In non-
randomized experiment, ineligible patients can be used as controls (like the ones who
are sicker), so the surgeon chooses to operate only on the healthier patients. !
iv. Three-year survival rates show the subjects selected for surgery in the non-randomized
studies were healthier than the controls. Tehre is bias in favor of the surgery with the
non-randomized experiment because sicker people were used as controls and healthier
as surgery-recipients, so a greater percentage survived obviously. !

!
III. HISTORICAL CONTROLS!
i. Historical Controls - patients that were treated the old way in the past. The treatment
group and historical control group may differ in important ways besides the treatment. !
ii. In a controlled experiment, theres a group of patients eligible for a treatment at the
beginning of the study. Some of these are assigned to treatment, others used as
controls. Assignment to treatment or control is done CONTEMPORANEOUSLY, or in the
same time period. !
iii. Portacaval shunt experiment - poorly controlled trials were done with historical controls,
or non-randomized controls. !
iv. Coronary bypass surgery - common and expensive operation for coronary artery
disease. The badly-designed studies were more enthusiastic about the value of the
surgery. !
v. Three-year survival rates for surgery patients and controls show that that the treatment
and historical control groups differed: patients selected for surgery were healthier.!

vi. The controls in historically-controlled group have a much smaller survival rate because
the controls were more unhealthy. !
vii. Randmized trials avoid that kind of bias, which is why design of study matter. !
viii. Historical controls of DES, a drug that is used to prevent spontaneous abortion, also
proved to not help in a randomized controlled experiment. In fact, it gave the kids a lethal
form of cancer. !
!
1. In the Salk Vaccine Field Trial of 1954, by NFIP, two million children from grades 1 through
three in schools across the USA were involved in the experiment process. In total, 500,000
students were vaccinated, 500,000 refused, and 1,000,000 students were deliberately left
unvaccinated as part of the control. The researchers then determined the polio incidence
rate post-vaccination and compared it to that of the previous year (1953). The 500000
students who were vaccinated were done so with the consent of the parents, and it was
primarily higher-income families who consented to the vaccination of their children - also, the
pool of students who were vaccinated were all second graders. Children from grades 1 and
3 were used as the 1,000,000 controls. The results for the trial were as follows: !
!
NFIP study! ! ! Size! ! Rate (# of cases per 100000 subjects)!
Grade 2 (vaccine) ! ! 225,000! 25!
Grades 1 and 3 (control) ! 725,000! 54!
Grade 2 (no consent)!! 125,000! 44!
!
What are the potential ways in which the results of this study could have been biased?!
- Children could not be vaccinated without parental consent!
- Higher-income parents were more likely to give consent!
- Children of higher-income families were more vulnerable to polio!
- Infection rate can vary from grade to grade!
!
How could this experiment be made to be less biased? Give 2 examples.!
Treatment and control groups should be as similar as possible, except for the treatment. Use
randomness rather than human judgment to assign subjects to groups and avoid bias.!
!
!
IV. SUMMARY!
i. Statisticians use the method of comparison. They want to know the effect of a treatment
(like the Salk vaccine) on a response (like getting polio). To find out, they compare the
responses of a treatment group with a control group. Usually, it is hard to judge the effect
of a treatment without comparing it to something else. !
ii. If the control group is comparable to the treatment group, apart from the treatment, then
a difference in the responses of the two groups is likely to be due to the effect of the
treatment. !
iii. However, if the treatment group is different from the control group with respect to other
factors, the effects of these other factors are likely to be CONFOUNDED with the effect
of the treatment.!
iv. To make sure that the treatment group is like the control group, investigators put
subjects into treatment or control at random. This is done in randomized controlled
experiments. !
v. Whenever possible, the control group is given a placebo, which is neutral but resembles
the treatment. The response should be to the treatment itself rather than to the idea of
the treatment.!
vi. In a double0blind experiment, the subjects do not know whether they are in treatment or
in control; neither do those who evaluate the responses. This guards against bias, either
in the responses or in the evaluations. !
!
!
8. Some studies find an association between liver cancer and smoking. However, alcohol
consumption is a confounding variable. This means!
(ii) Drinking is associated with smoking, and alcohol causes liver cancer.!
A confounding variable is a source of bias due to the fact that it is a factor by which two groups
(a treatment group and a control group) differ due to some factor other than the treatment itself.
It is a third variable that is associated with exposure and with disease. (Freeman, Statistics) In
this particular case, alcohol is a confounding variable, meaning it got mixed up with smokers,
smokers are generally known to consume more alcohol as opposed to non-smokers, and
greater alcohol consumption can lead to liver cancer. Thus, drinking is associated with smoking
because those who smoke are likely to drink greater quantities of alcohol as opposed to non-
smokers, and that heavier consumption of alcohol is what causes liver cancer. !
!
9a. Does screening save lives? Which numbers in the table prove your point? !
Yes, screening does save lives. Out of the 31,000 people in the total treatment group, 39 deaths
occurred from only breast cancer, giving a rate of 1.3. Out of the 31,000 people in the control
group, 63 people died from breast cancer only, giving a much higher rate of 2.0 compared to the
1.3 rate in the treatment group (the women who had undergone the HIP screening trial). !
!
!
Chapter 2 - Observational Studies!
I. Introduction!
i. controlled experiment - a study where the investigators decide wholl be in the treatment
group and who will not. Control = someone who didnt get the treatment. !
ii. Observational study - the subjects assign themselves to different groups, and the
investigators just watch what happens. Treatment-control idea is still used - but subjects
assign themselves to treatment or choose to be in control not by any influence of the
investigators. !
iii. To determine if giving up smoking will make you live longer via observational study:You
have to control for confounding variables like age and sex. Older people have diff
smoking habits and are more at risk for lung cancer, so you have to compare smokers
and non-smokers by age. Also, men are more likely to get heart disease than somen, so
you have to compare smokers and non-smokers within sexes. EX: compare male
smokers age 55-59 and male non-smokers 55-59.!
iv. Association between treatment and outcome is circumstantial evidence for causation.!
v. Association does not prove causation. Confounding factors may exist.!
vi. Observational studies can be powerful tools but can also be misleading.!
a. Were the control and treatment groups similar?!
b. Did the two populations differ in ways other than the treatment?!
vii. Technique: compare small, more homogeneous groups (e.g., age, sex)!
!
II. THE CLOFIBRATE TRIAL!
i. Coronary Drug Project: randomized, controlled double-blind experiment (placebo =
lactose) to evaluate heart attack prevention drugs!
ii. 8,341 patients followed for five years (5,552 got treatment, 2,789 controls)!
iii. Clofibrate: a cholesterol reduction drug evaluated in the study!
iv. Comparing 20% to 21% shows that clofibrate did not save lives.!
v. Many subjects did not take their medicine (non-adherers).!
vi. Compare 15% to 15% (not to 21% or 25%) to control for adherence, which proved that:!

vii) Conclusions: !
a. Clofibrate does not have an effect. !
b. Adherers are different from non-adherers. The reason why adheres lived longer than
non-adherers in both the treatment and control groups is that they likely were more
concerned with their health and took better care of themselves in general, taking
capsules. !
!
III. MORE EXAMPLES!
i. Example 1 - Pellagra: Among many associations between the 18th century disease and
other factors, lack of niacin was found to be the underlying cause. Pellaga=disease
found in European villages that caused disability and death. The households where the
disease struck were usually unsanitary, and filled with blood-sucking flies. The fly had
the same geographical range as pellagra cases, and the times of the disease and when
the fly was most active coincided. Epidemiologists concluded that the disease was
infectious - and like malaria, it was transmitted via insects. However, the flies were just a
marker of poverty. Impoverished people ate corn which has little niacin. Niacin occurs in
P-P (pellagra-preventive) factors, which is in meat, milk, eggs, veggies. Not corn. People
who got pellagra were too poor to eat those, so they ate corn. Flies only indicated that
they were in poverty. !
ii. Example 2 - Cervical Cancer and Circumcision: Human papilloma virus was found to be
the underlying cause and explained differences in cancer rates between particular
populations in the 1950s. Cervical cancer used to be the most common cancers among
women, but in Jews and Muslims, it was rare. Investigators thought that because these
populations were kinda unaffected, it was circumcision of males that was the protective
factor. HOWEVER, it turns out that STDs were the cause of the cervical cancer, and
HPV is the causal agent. !
iii. Example 3 - Ultrasound and Low Birthweight: Association in observational study with
babies exposed to ultrasound exams and babies who werent exposed. Is this evidence
that ultrasound causes lower birthw7eight? No - women get ultrasound exams if they are
likely or expect to have problem pregnancies. The confounding factor of problem
pregnancy was found to explain an association between ultrasound and low birthweight.
Randomized controlled experiments showed that ultrasounds may be protective.!
iv. Example 4 - The Samaritans and Suicide: A confounding factor explained an association
between the expansion of a volunteer welfare organization and a decrease in the
English suicide rate in 19641970. Not the fact that the Samaritans prevented suicides,
as the investigator thought. To control for confounding variables, the towns in a pair were
matched on the variables regarded as important (one town had a branch of Samaritans,
the other did not). However, when another investigator used a bigger sample size, he
found no effect. Suicide rate was stable in the 70s, even though the Samaritans
continued to expand. The decline in suicide rates in the 60s was explaind by a shift from
coal to natural gas which is less toxic. !
!
IV. SEX BIAS IN GRADUATE ADMISSIONS!
i. Observational study on sex bias in admissions at UC, Berkeley in 1973!
i. 44% of 8,442 male applicants were admitted!
ii. 35% of 4,321 female applicants were admitted!
ii. Compare admissions to the six largest majors:!
iii. Major A: Less selective but few women and many men applied!
iv. Major E: Highly-selective but many women and few men applied!
v. Simpsons paradox: Relationships between percentages in subgroups can be reversed
when the subgroups are combined. !
!
V. CONFOUNDING!
i. Confounding: A difference between the treatment and control groups other than the
treatment that affects the responses being studied. (Fishers Constitutional
Hypothesis)!
ii. Confounders must be associated with both:!
i. The disease/outcome and!
ii. The exposure/treatment.!
iii. ex: if theres a gene which increases the risk of lung cancer and it ALSO gets people
to smoke, it meets both the tests for a confounder. This gene would create an
association between smoking and lung cancer.
(A gene that causes cancer but is unrelated to smoking is not a confounder and is
sideways to the argument, because it does not account for the facts the
association between smoking and cancer.)!
iii. Hidden confounders are a major problem in observational studies.!
iv. Examples:!
i. NFIP polio vaccine study: family income.!
ii. Portacaval shunt studies: health of patients selected for surgery!
iii. Coronary bypass surgery studies: health of patients selected for surgery!
iv. Cervical cancer study: sexual activity!
!
VI. SUMMARY AND OVERVIEW!
i. In an observational study, investigators do not assign the subjects to treatment or
control. Some of the subjects have the condition whose effects are being studied; this is
the treatment group. The other subjects are the controls. For example, in a study on
smoking, the smokers form t he treatment group and the non-smokers are the controls. !
ii. Observational studies can establish association: one thing is linked to another.
Association may point to causation: if exposure causes disease, then people who are
exposed should be sicker than similar people who are not exposed. But association
does not prove causation. !
iii. In an observational study, the effects of treatment may be confounded with the effects of
factors that got the subjects into control or treatment in the first place. Observational
studies can be quite misleading about the cause-and-effect relationships, because of
confounding. A CONFOUNDER is a third variable, associated with exposure and
disease. !
iv. When looking at a study, ask the following questions. Was there any control group at all?
Were historical controls used, or contemporaneous controls? How were subjects
assuaged to treatmentthrough a process under the control of an investigator
(controlled experiment) or a process outside control of investigator (observational
study)? If a controlled experiment, was the assignment made using a chance
mechanism (randomized controlled), or did assignment depend on the judgement of the
investigator?!
v. With observational studies and with nonrandomized controlled experiments, try to find
out how the subjects came to be in treatment or in control. Are the groups comparable?
Different? What factors are confounded with treatment? What adjustments were made to
take care of confounding? Were they sensible? !
vi. In an observational study, a confounding factor can sometimes be CONTROLLED FOR,
by comparing smaller groups which are relatively homogeneous with respect to the
factor.!
vii. Study design is a central issue applied in stats. The great weakness of observational
studies is confounding: randomized experiments minimize this problem. !
!
!
Chapter 3 - The Histogram!
I. INTRODUCTION!
i. A histogram is a graph used to summarize data.!
ii. The total area under the curve is 1 (that is, 100%).!
iii. The horizontal axis is divided into class intervals.!
iv. The area of a rectangle is proportional to the percentage of data values in the class
interval.!
v. Areas of the blocks represent percentages !
!
#10a p23 (the refused group is your control here since they aren't examined)!
To show that screening reduces the risk from breast cancer, someone wants to compare 1.1
and 1.5. Is this a good comparison? Is it biased against screening? For screening?!
!
No, this is not a good comparison because of the many difference between the groups that can
exist, including all of the potentially confounding variables. It is known to us that folks from a
higher socioeconomic class are more likely to accept screening, and those who reject it are
more likely from a lower socioeconomic class. The two different classes also have separate
incidences of the breast cancer. Socioeconomic class, then is a potential confounding variable
because it is associated with both the treatment in the sense that people who belong to different
socioeconomic statuses have different reactions when theyre offered screening, and its also
associated with the outcome in the sense that it is known that people from different
socioeconomic classes are affected by breast cancer in separate, significantly different
manners. This particular comparison, between 1.1 and 1.5, makes screening look worse
because we are comparing the two subgroups within the treatment group, examined and
refuseed. The examined are likely to be from one class, and the refused are likely to be from the
other. The rate for the examined is low because of the screening, likely, and the rate for the
refused is also low in comparison to the control group because it started off as low - the peoiple
who refused were less likely to get breast cancer in the first place, which is prior knowledge
based on the information we were given with. The subgroup agreed to be screened has a higher
incidence of breast cancer, and the subgroup that refused had a lesser one. This means that, in
total, both the subgroups together would have a lower overall death rate due to the disease
(lower than 1.5 that is), when it is compared to the control. !
!
#11 p23!
Cervical cancer is more common among women who have been exposed to the herpes virus,
according to many observational studies. Is it fair to conclude that the virus causes cervical
cancer? !
It is not fair to conclude that the virus causes cervical cancer because herpes and cervical
cancer are both sexually transmitted diseases, or attained through them. Rather, it is more likely
that the two are associated with one another (but not causation), and there is a confounding
variable that is the cause for the cervical cancer. For example, women who have many sexual
partners are placed at a higher risk for both cervical cancer and herpes, so the increased
number and kind of partners means that theres an increased chance that those women would
contract the STDs. The confounding factor, then, would be the number of sexual partners the
women have, which is the variable that is likely influencing the women to attract both diseases.!
!
#11 p27!
California is evaluating a new program to rehabilitate prisoners before their release; the object is
to reduce the recidivism rate - the percentage who will be back in prison within two years of
release. The program involves several months of boot camp - military-style basic training with
very strict discipline. Admission to the program is voluntary. According to a prison spokesman,
Those who complete boot camp are less likely to return to prison than other inmates.!
!
a. What is the treatment group in the prison spokesmans comparison? the control group?!
The treatment group in the prison spokesmans comparison is the prisoners who actually chose
to complete the boot camp and finished it. The control group is the group of prisoners who dont
volunteer to complete the program, or those who do volunteer but cant complete it. !
b. Is the prison spokesmans comparison based on an observational study or a randomized!
controlled experiment?!
The prison spokesmans comparison is based on an observational study because its the
prisoners who decided if they wanted to volunteer for the camp, and also their decision to stay
in it or to leave it. The investigators did not randomly choose who was to be in the boot camp
and who wasnt.!
c. True or false: the data show that boot camp worked.!
This is false because the data did not explicitly show that boot camp worked. The boot camp
people volunteered, so it is not possible to know whether or not their decreased recidivism rate
is because of the boot camp or another confounding factor. For example, its possible that the
people who chose to be in boot camp were the ones who were extremely committed and willing
to work hard to live better lives and stay out of prison. !
!
#2 p33!
2. In figure 2, were there more families earning between 10,000 and $11,000 or between
$15,000 and $16,000? Or were the numbers about the same? Make your best guess.!
!
There were more families between $10,000 and $11,000.!
!
#3 p33 - Histogram!
(a) B represents the people who scored between 60 and 80.!
(b) 20% scored between 40 and 60.!
(c) 70% scored over 60. !
!
#4 p42!
In a public health service study, a histogram was plotted showing the number of cigarettes
smoked by each subject (male current smokers), as shown below. The density is marked in
parentheses. The class intervals include the right endpoint, not the left. !
(a) 15%!
(b) 30%!
(c) 50%!
(d) 10%!
(e) 3.5%!
!
!
II. DRAWING A HISTOGRAM!
i. A distribution table shows the percentage of data in each class interval.!
ii. Choose an endpoint convention - left or right (e.g. put left endpoints in class intervals, but
exclude right, or vice versa)!
iii. Use class intervals to draw horizontal axis !
iv. To figuire out the height of a block over a class interval, divide the percentage by the length
of the interval. (width x height = percentage)!
v. This way, the area of the block equals the percentage of families in the class interval. !
vi. Units on the vertical scale = percent per (horizontal units)!
vii. The height of the block over the interval $7,000 to $10,000 is 5% per thousand dollars: in
other words, in each thousand-dollar interval between $7,000 and $10,000, there are about
5% of the families.!
viii. The unit on the horizontal axis is $1000 of family income, and the vertical axis shows the
percentage of families per $1000 of the income. !
!
III. GENERATING A HISTOGRAM FROM DATA!
i. Toss a fair coin n = 4 times and count the number of heads.!
ii. Repeat this experiment N = 10 times.!
iii. Example: 3, 1, 3, 2, 0, 2, 1, 4, 2, 0 heads in the 10 trials gives the histogram below left.!
iv. If we repeat the experiment N = 1000 times, we get a histogram such as the one shown
below right.!

IV. THE DENSITY SCALE!


i. The histogram below shows years of school completed by persons age 25 and older in
the U.S. in 1991.!
ii. Endpoint convention: years of school completed (e.g., people who dropped out part way
through ninth grade are in the 89 block) (left endpoint)!
iii. Units on the vertical axis are percent (of people) per year (of schooling).!
iv. Area represents percent. Total area = 100%.!
v. Box heights show crowding. Crowding - represented by the height of the block. !
i. The blocks over 8-9 and 9-12. The 8-9 is a little taller, so this interval is a little more
crowded. The 9-12 is taller, so it has a larger area with more people. Theres more
room in the 9-12 interval cuz its longer. Its like Netherlands (small country) being
more crowded, even tho the US has more people.!
ii. In a historgram, the height of a block represents crowding percentage per
horizontal unit. !
vi. Peaks: 89, 1213 and 1617 - peaks show how people tend to stop their schooling at
one of the three possible graduations rather than dropping out inbetween. !
vii. 12-13 = all the people with high school degrees. Some may have gone to college but
didnt finish first year. Left endpoint
convention means that the right endpoint
values begin at the next block.!
!
!
viii. With the density scale on the vertical axis, the areas of the blocks come out in percent.
The area under the histogram over an interval equals the percentage of cases in that
interval. The total area under the histogram is 100%. MAKE SURE IT EQUALS 100%
(and not 200%).!
!
IV. VARIABLES!
i. A (random) variable is a measurement that depends on the outcome of a (random)
event. A variable is a characteristic which changes from person to person in a study.!
ii. Quantitative variables have numeric values. (age, family size, income)!
i. Continuous variables can assume a continuum of values: Examples include
income, temperature, pressure, mass, and speed, age. !
ii. A discrete variable can assume only finitely (or countably) many values. Examples
include: family size, and number of engine cylinders. You can differ by 1 or 2 or 0, but
nothing in between is possible.!
iii. Qualitative variables are non-numeric. Values are typically descriptive words or phrases. !
i. Examples include: marriage status, true or false, employment status, eye color,
automobile transmission type.!
!
!
iv. With discrete variables, center the class intervals at the possible values (so family size
can be 2, 3, or 4. Class Interval for 2 = 1.5 to 2.5; class interval for 3 = 2.5 to 3.5!

v. In the March Current Population Survey, women are asked how many children they
have. Results are shown below for the women age 25-39, by educational level. !
i. Is the number of children discrete or continuous?!
i. The number of children is discrete.!
ii. Draw histograms for these data.!
KEY:!
- Black columns = women who are high school graduates. !
- Outlined columns (no fill) = women with college degrees.!

50 50

25 25

0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6

Number of children Number of children

!
iii. What do you conclude?!
i. Women who have college degrees have fewer children, so women who are
better educated (have more years of schooling) have fewer children. The percent
of women per number of children is generally more skewed towards the left end
of the graph for the women with college degrees than those with high school
degrees only, thus supporting my conclusion. !
!
V. CONTROLLING FOR A VARIABLE!
i. Experiment: 2 groups!
ii. users who took the birth oral contraceptive pill (Treatment group)!
iii. non-users who dont take the pill (the control group) !
iv. Problems: its observational. Women decided whether or not to take the pill. One
problem is its effect on blood pressure. Blood pressure tends to go up with age, and
non-users were on the whole older than the users in treatment group. 70% of nonusers
were over 30, compared to 50% of the users. Effect of AGE is confounded with the effect
of the pill. !
v. To make the full effect of the pill vilisble, its necessary to make a separate comparison
for each age group, THUS CONTROLLING FOR AGE. !
vi. Results suggest that if a woman goes on the pill, her blood pressure will go up around 5
mm, but proof is incomplete. Drug study is an observational study, not controlled. There
could be some factor other than pill or age, which affects the blood pressure (though its
not true - it actually does affect it). !
!
VI. CROSS TABULATION!
i. To make comparisons between age groups or subgroups, use a cross-tab!

!
VII. SELECTIVE BREEDING!
i. breeding rats to breed for intelligence e- maze-bright rats (rats making only a small
number of errors in a maze) bred with each other, and maze dulls were bred together.
There was a clear separation in scores seven generations later. But, measuring for
mental ability - evidence that some mental abilities are at least in part genetically
determined. BUT it was found that the maze-bright rats did no better than the maze-dulls
on other tests of animal intelligence - evidence against his theory. But brights seemed to
be introverts (couldnt have good relationships with other rats but well adjusted to life in
the maze). the dulls were opposite.!
!
p51 #4 !
The figure below is a histogram showing the distribution of blood pressure for all 14,148 women
in the drug study (sec 5). Use the histogram to answer the following questions:!
a) Is the percentage of women with blood pressures above 130 mm around 25%, 50%, or
75%?!
a) 25%!
b) Is the percentage of women with blood pressures between 90 mm and 160 mm around 1%,
50%, or 99%!
a) 99%!
c) In which interval are there more women: 135-140 mm or 140-150 mm?!
a) 140-150 mm!
d) Which interval is more crowded: 135-140 mm or 150-150mm?!
a) 135-140mm is more crowded!
e) On the interval 125-130 mm, the height of the histogram is about 2.1% per mm. What
percentage of women had blood pressures in this class interval?!
a) 10.5% which is equal to the width times height (otherwise known as the area) of that
section/bar.!
f) Which interval has more women: 97-98 mm or 102-103 mm?!
a) 102-103 mm interval has more women!
g) Which is the most crowded millimeter of all?!
a) 115-120 interval!
!
SUMMARY !
1. a histogram represents percents by area. it consists of a set of blocks. The area of each
block represents the percentage of cases in the corresponding class interval.!
2. With the density scale, the height of each block equals the % of cases in the corresponding
class interval, divided by the length of that interval!
3. with the density scale, area comes out in percent, and total area is 100%. The area under
the histogram between two values gives the percentage of cases falling int hat interval.!
4. a variable is characteristic of the subjects in a study. It can be either qualitative or
quantitative. A quant is either discrete or continuous.!
5. a confounding factor is sometimes controlled for by cross-tabulation!
!
!
!
CHAPTER 4 - THE AVERAGE AND STANDARD DEVIATION!
I. INTRODUCTION!
A. Average - used to find center, as is median. !
B. Standard deviation - measures spread around average!
C. Interquartile range - another measure of spread.!
II. THE AVERAGE!
D. HANES - the Health and Nutrition Examination Survey - gets baseline data about
demographic variables (age education income), physiological variables (height weight
BP, cholesterol levels), dietary habits, prevalence of diseases.!
E. Average of a list of numbers equals their sum, divided by how many numbers there are.!
F. HANES is cross-sectional, where different subjects are compared to each other at one
point in time. Longitudinal would be when subjects are followed over time, and compared
with themselves at diff points in time. Cross sectional - means that everyones different;
doesnt mean that the average height of men decreases after an age. !
G. If a study draws conclusions about the effects of age, find out whether data are cross-
sectional or longitudinal. !
!
6. Twenty-one people in a room have an average height of 5 ft 6 inches. A 22nd person enters
the room. How tall would he have to be to raise the avg height by 1 inch? !
!
(21) 66 inches + (1)x!
= 67!
! 22!
!
88 inches, or 7 ft 4 inch!
!
!
#4 P 70!
Each of the following lists has an average of 50. For each one, guess whether the SD is around
1, 2, or 10. !
a) 1 (all numbers deviate from average by +1 or -1)!
b) 2!
c) 2!
d) 2!
e) 10!
!
#13 p 24!
13. A hypothetical university has two departments, A and B. There are 2000 male applicants, of
whom half apply to each department. There are 1100 female applicatns: 100 apply to dept A and
1000 to dept. B. Dept A admits 60% of men who apply and 60% of women. Dept B admits 30%
of men and 30% of women. For each dept, the % of men admitted equals the % of women
admitted: this must be so for both departments together. T or F, explain. !
This is false because Dept A accepts 60% out of 1000 men, which is just 600 men!
Department B accepts 30% out of 1000 men which is 300 men!
Dept A accepts 60% out of 100 women which is just 60 women!
Dept B accepts 30% out of 1000 women which is 300 women!
That totals to 360/1100 women = 32.7% which is a smaller total percentage of women being
accepted into the two departments together, compared to men.!
!
#7 p 26!
According to a study done at Kaiser Permanente in Walnut Creek, users of oral contraceptives
have a higher rate of cervical cancer, even after adjusting for age, education, and marital status.
Investigators concluded that the pill causes cervical cancer. !
a) This is an observational study.!
b) The investigators adjusted for age, education, and marital status because they were
potentially confounding variables, and the way to adjust is to observe smaller, more
homogenous groups of people. Also, as you age, there is an increased possibility of getting
cervical cancer; women who are married versus women who are single have different sex
lives and thus differing sexual activity, so theyre affected by different risks of getting cervical
cancer, which also applies to women who have higher education. !
c) Women using the pill were likely to differ from non-users on another factor which affects the
risk of cervical cancer, which is sexual activity - they may have increased amounts of sexual
activity and sexual partners, which increases their likelihood of contracting STDs which
could result in cervical cancer. !
d) The conclusions of the study were not justified by the data because the data showed
association, not causation. Also, the confounding variable of sexual activity was not
accounted for. The oral contraceptive has not causal role in cervical cancer - theres only a
relationship that exists, not one that is causal.!
!
III. THE AVERAGE AND THE HISTOGRAM!
i. The histogram balances when supported at the mean.!
ii. The first histogram below is symmetric about its mean. Half the data is to the left of the
mean and half is to the right.!
iii. As the red box moves to the right, it pulls the average along with it. !
iv. A histogram balances when supported at the average. !
v. Median of a histogram is the value with half the area to the left and half to the right. !

vi. LONG RIGHT HAND TAIL - AVERAGE IS BIGGER THAN MEDIAN!


vii. LONG LEFT HAND TAIL - AVERAGE IS SMALLER THAN MEDIAN!
!
IV. THE ROOT MEAN SQUARE!
i. The root-mean-square (or RMS) of a list of numbers measures the average!
ii. magnitude (ignoring signs) of the numbers in the list.!

iii. The calculation steps are:!


i. (1) square the entries of the list, SQUARE!
ii. (2) take the mean of this new list, and MEAN!
iii. (3) take the square root of this mean. ROOT!
iv. RMS is used to compute the SD (or spread) of a list of numbers.!

!
V. THE STANDARD DEVIATION!
i. Standard deviation (SD) measures the spread of the data.!
i. Roughly 68% of the data falls within one SD of the average.!
ii. Roughly 95% of the data falls within two SDs of the average.!
ii. Average (mean) and median were measures of the center of the data.!
iii. Units of SD and average are the same as those of the data.!
iv. Variance is SD^2!
v. The SD says how far away numbers on a list are from their average. Most entries on the
list will be somewhere around one SD away from the average. Very few will be more
than two or three SDs away.!
!
VI. COMPUTING THE STANDARD DEVIATION!
i. Deviation from average = entry - average!
ii. SD = rms deviation from average!

SUMMARY!
1. A typical list of numbers can be summarized by its average and standard deviation!
2. Average of a list = SUM OF ENTRIES / NUMBER OF ENTRIES!
3. The average locates the center of a histogram, in the sense that the histogram blanches
when supported at the average.!
4. Half the area under a histogram lies to the left of the median, and half to the right of the
median. The median is another way to locate the center of a histogram. !
5. The RMS size of a list measures how big the entires are, neglecting signs. !
6. RMS = SQUARE ROOT of (average of (entries^2))!
7. SD measures distance from average. Each number on a list is off the average by some
amount. SD is a sort of average size for these amounts off. SD is the rms size of the
deviations from the average. !
8. Roughly 68% of the entries on a list of numbers are within one SD of the average, and about
95% are within 2 SDs of the average. not all lists.!
9. If a study draws conclusions about the effects of age, find out whether the data are cross-
sectional or longitudinal.!
!
12 p 27!
A study is carried out to determine the effect of party affiliation on voting behavior in a certain
city. The city is divided up into wards. In each ward, the percentage of registered Democrats
who vote is higher than the percentage of registered Republicans who vote. True or false: for
the city as a whole, the percentage of registered Democrats who vote must be higher than the
percentage of registered Republicans who vote. If true, why? If false, give and example. !
!
This is false because of Simpsons Paradox, which states that the relationships between
percents in subgroups can be reversed when the subgroups are combined. For example, if we
Users of Pill, Age 17-24
5
4

% Per mm 3
2
1
0 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165

Blood Pressure (mm)

had one ward where there were 10 Democrats in total, of which 10 voted, then wed have 100%
of the Democrats who voted. In that same ward, we had 1000 republicans, in which 990 voted,
wed have a 99% vote. In total, the percentage of Democrats who voted are higher in Ward 1. !
If we had a ward 2, in which there were 1000 Democrats, of whom 600 voted (giving a 60%
turnout rate for Republicans); but then there were 10 Republicans, of whom 5 voted (giving a
50% turnout rate), then the Democrat percentage would again be higher in ward 2. However, if
we totaled the numbers of democrats in ward 1 and ward 2, we would have 1010 Democrats in
total, of whom 610 voted, giving us about 60% of Democrats in total who voted. When we total
the numbers of republicans in wards 1 and 2, we have 1010 Republicans, of whom 995 voted,
giving us a percentage of about 98%. Thus, the Republican vote is higher in total due to the fact
that the majority of the Republicans lay within the ward with the greatest number of votes, even
though the Democrats were in wards in which percentages were higher, but the voting wards
were lower in number of people.!
!
2. p 48!
Draw the histograms for the blood pressures of the users and non-users age 17-24. What do
you conclude?!
!

Non-users of Pill, Age 17-24


5
4
% Per mm

3
2
1
0 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165

Blood Pressure (mm)


I conclude that using the pill is associated with increased blood pressure of measurable
quantities of millimeters, as shown by the % of women who were users having relatively higher
blood pressures in comparison to the group of non-users of pills. !
!
#7 page 52!
!
Two histograms are sketched below. One shows the distribution of age at death from natural
causes (heart disease, cancer, and so forth). The other shows age at death from trauma
(accident, murder, suicide). Which is which, and why?!
(i), the histogram with a long left tail, shows the distribution of age at death from natural causes
because people who are older are more likely to die of natural causes than younger people, as
they are at higher risk of contracting heart disease, cancer, and so forth. !
(ii), the histogram with a long right hand tail, shows age at death from trauma. Accidents,
murders, and suicides are more likely to afflict people who are younger than those who are
older, so there is a larger number of people towards the left (younger) end of the curve. !
!
#4 page 74!
!
For persons age 25 and over in the US, would the average or the median be higher for income?
for years of schooling completed?!
!
For persons age 25 and over in the US, the average would be larger than the median. This is
because there are older people who are extremely rich who stretch out the histograms tail to
the right, lending it a long right hand tail. This affects the average, because it means that the
balance point lies more towards the right to balance out the tail, so the average is bigger than
the median.!
For years of schooling, we have the opposite. The left hand tail is longer, thus giving a smaller
average than the median. Most people ages 25 and older in the US have completed all the way
up to 12th grade, thus the graph has its peak at approximately that area, and promptly
decreases (and there are only so many years of school for students). Fewer people drop out or
are done with school at an earlier period, thus the left hand tail is long - most people complete
up to the end of high school. !
!
!
#6 page 75!
!
a. !
(i) has an average of 60!
(ii) has an average of 50!
(iii) has an average of 40!
!
b. The median is less than the average iii!
The median is about equal to the average ii!
The median is bigger than the average i!
!
c. The SD of histogram iii is around 15 because the area is mostly within 50 units of the
average, however 50 is too large.However, 5 units from the average is too small a portion of the
area, so by process of elimination and eyeing the extent of the graph and its region of majority-
area, the SD is about 15.!
!
d. The SD for histogram (i) is NOT a lot smaller than that for histogram (iii), thus making the
original statement false because the histograms are essentially mirrors of one another, and can
therefore have the same SD - their relative spreads are the same, which is what defines and
constitutes the SD of a histogram.!
!
!
extra prob: A list of transactions contains 100 numbers: 60 gains and 40 losses. The gains are
positive numbers and the the losses are negative numbers. The units are thousands of dollars.
For the 60 gains, the average is 18 and the SD is 7.5. For the losses the average is -20 and the
SD is 9.2.!
!
a) Find the average of the 100 transactions !
!
[(18)(60/100)] + [(-20)(40/100)] = !
10.8 + (-8) = !
2.8!
!
b) Find the SD of the 100 transactions. !
!
S^2 = (60/100)(7.5)^2 + (40/100)(9.2)^2 + (60/100)(18)^2 + (40/100)(-20)^2 - (2.8)^2!
= 33.75 + 33.856 + 194.4 + 160 - 7.84 !
= 414.166!
S = 20.35!
!
!
!
CHAPTER 5: The Normal Approximation for Data!
I. THE NORMAL CURVE!
i. The standard normal (or Gaussian) curve is an ideal histogram to which we will compare
other data.!

ii. The total area under the curve is 1 (or 100%).!


iii. The curve is symmetric about the line x = 0.!
iv. It has mean = 0 and SD = 1.!
!
II. STANDARD UNITS!
i. If x1, . . . , xn is a list of numbers, we convert the values in the list to standard units using
the following formula:!
ii. zi = (xi mean) / SD!
iii. zi measures how far (in units of SD) xi is from the mean (average) of the list!
iv. Example: Consider the list 13, 9, 11, 7, 10.!
>! x < c(13, 9, 11, 7, 10)!
>! mean(x) = 10!
>! SD(x) = 2!
Now convert 13 to standard units:!
>! (13-10)/2 -> (NUMBER - AVERAGE) / 2!
! = 1.5!
!
v. Method: Convert to standard units, then find the corresponding area under the normal
curve. !
!
!
#1 page 82 !
!
On a certain exam, the average of the scores was 50 and the SD was 10. !
(a) Convert each of the following scores to standard units: 60, 45, 75!
60: !
zi = (xi mean) / SD!
zi = (60 50) / 10!
zi = 1!
!
45:!
zi = (xi mean) / SD!
zi = (45 50) / 10!
zi = -0.5!
!
70:!
zi = (xi mean) / SD!
zi = (75 50) / 10!
zi = 2.5!
!
(b) Find the scores which in standard units are 0, +1.5, -2.8!
0:!
zi = (xi mean) / SD!
0 = (xi 50) / 10!
xi = 50!
!
+1.5:!
zi = (xi mean) / SD!
!
1.5 = (xi 50) / 10!
xi = 65!
!
-2.8:!
zi = (xi mean) / SD!
-2.8 = (xi 50) / 10!
xi = 22!
!
!
III. FINDING AREAS UNDER THE NORMAL CURVE!
i. Use one or more of the following to find areas under the normal curve:!
i. Total area under the curve is 1 (that is, 100%)!
ii. The area is symmetric about vertical the line x = 0!
iii. (area to the left of x) = (1 area to the right of x)!
iv. The 68%, 95%, 99.7% rules (see slide 2)!
v. The pnorm or qnorm functions in R!
vi. Normal table in Appendix A of your text!
!
#1a,b,c page 84!
Find the area under the normal curve!
(a) to the right of 1.25!
.5 (100% - 78.87%) = 11%!
(b) to the left of -0.40!
.5 (100% - 31.08%) = 34%!
(c) to the left of 0.80!
50% + .5(57.63%) = 79%!
!
IV. THE NORMAL APPROXIMATION FOR DATA!
i. For the normal curve,!
a. 68% of the area is within 1 SD of the mean!
b. 95% of the area is within 2 SDs of the mean!
c. 99.7% of the area is within 3 SDs of the mean!
ii. The same area vs. SD rule roughly holds for histograms generated by many other data sets.!
iii. From another perspective, for many data sets, if we convert the data to standard units, the
histogram will look a lot like the normal curve.!
iv. Example:!

V. PERCENTILES!
i. For data that does not follow the normal curve, we use percentiles, quartiles and related
descriptive statistics to summarize the data.!
ii. The 25th percentile is a number x for which 25% of the data is less than x!
iii. The 50th percentile is a number x for which 50% of the data is less than x (this is the same
as the median)!
iv. The 75th percentile is a number x for which 75% of the data is less than x!
v. 1st percentile means that 1% of people have below, and 99% have above. !
vi. 10th percentile means that 10% had below that level, and 90% of people are above that.!
vii. 50th percentile = median!
viii. INTERQUARTILE RANGE = 75th percentile - 25th percentile!
ix. percentiles - used for distributions with long tails. !
!
VI. PERCENTILES AND THE NORMAL CURVE!
i. !
!
#1 page 93 review exercises (in class we did 1.33 SD)!
!
The following list of test scores has an average of 50 and an SD of 10:!
!
(a) Use the normal approximation to estimate the number of scores within 1.25 SDs of the
average.!
First, visit z table to determine what area (percentage) 1.25 corresponds to: 79%. This is the
between -1.25 and 1.25 on the normal table. Multiplying 79% (the area between -1.25 and 1.25)
with the number of entries on the given list (25), we get 25 x .79 which is equal to about 20.
Thus, I estimate that there will be about 20 numbers within 1.25 SDs of the average.!
!
(b) How many scores really were within 1.25 SDs of the average?!
To determine the number of scores exactly between 1.25 SDs of the average, I have to convert
1.25 SDs from the normal distribution to the given list. Because the SD is 10, 1.25 SD away
from the given average is 12.5. Then, I add 12.5 to the mean (50) and subtract 12.5. That gives
me 37.5 and 62.5. By counting the numbers between 37.5 and 62.5 on the list, there are 18
scores. !
!
!
#11 page 95!
!
One term, about 700 Stats 2 students at UC Berkeley were asked how many college math
courses they took, other than Stats 2. The average number of courses was about 1.1; the SD
was about 1.5. Would the histogram for the data look like (i), (ii), or (iii)? Why?!
The histogram would like like (i) because you cant have a negative number of courses taken,
and that eliminates ii and iii. (ii) has a graph that shows that students might take less than 0
classes (1.1-1.5), and graph (iii) shows that as well, but to a greater extreme. Though it is
possible to take less than the average, say 0 courses, its not possible to have taken -.4 classes. !
!
#5 page 65!
!
For registered students at universities in the US, which is larger: average age or median age?!
Average age is larger because the histogram depicting the registered students at US
universities would have a long right hand tail (lots of young people going to school, and fewer
older people, but there are still older people going to school nonetheless). !
!
MEASUREMENT ERROR: Chapter 6!
!
I. Introduction!
i. In the real world, if we measure something several times, we observe different values
each time.!
ii. Each result is thrown off by chance error.!
iii. How do these errors arise?!
iv. How big are they likely to be?!
!
II. Chance Error!
i. Standards weights are maintained at local, state, national and international levels for
commercial, scientific and other purposes.!
ii. The International Bureau of Weights and Measures near Paris maintains the
International Prototype Kilogram.!
iii. The National Bureau of Standards in Washington, D.C. maintains a national prototype
kilogram (Kilogram #20) that is calibrated against the international standard.!
iv. The Bureau maintains several other standard weights that are calibrated against
Kilogram #20.!
v. NB 10 is one such standard weight. It weighs very nearly 10 grams.!
vi. NB 10 - NB 10 is a 10 gram weight maintained by the National Bureau of Standards.!
i. The first five NB 10 measurements (in grams) from Table 1 on page 99 are:!
ii. 9.999591 9.999600 9.999594 9.999601 9.999598!
iii. Measurements are in terms of micrograms below 10 grams:!
iv. 409 400 406 399 402!
v. For the measurements in Table 1, mean 405 and SD 6 in micrograms.!
!
III. Bias!
i. Bias affects all measurements the same way, pushing them in the same direction.!
ii. Chance errors change from measurement to measurement, sometimes up and
sometimes down.!
iii. (individual measurement) = (exact value) + bias + (chance error)!
iv. If there is no bias, the long-run average of repeated measurements should approach
the exact value. Chance errors should cancel out in the average.!
!
Examples:!
A carpenter is using a tape measure to get the length of the board.!
(a) What are some possible sources of bias?!
(a) Some possible sources of bias include stretching the cloth purposely and to a significant
extent, and manufacturing errors by the makers (of the tape measure itself).!
(b) Which is more subject to bias, a steel tape or a cloth tape?!
(a) Cloth tape is more subject to bias because it stretches with time (systematic error or
bias) or can shrink, thus making its measuring accuracy less reliable, although it is
possible for a steel tape to expand with temperature (really high temperatures, though) -
its just more likely in comparison that a cloth tape will be biased.!
(c) Would the bias in a cloth tape change over time?!
(a) Yes - continuous use of the cloth tape leads to stretching, thus increasing the bias. !
!
True or false, and explain.!
(a) Bias is a kind of chance error!
(a) False - its not a form of chance error because bias is a systematic and predictable error. !
(b) Chance error is a kind of bias!
(a) False - chance error is not a kind of bias because chance errors change with every
measurement you make, but bias shifts measurement in a singular direction. !
(c) Measurements are usually affected by both bias and chance error. !
(a) True - its not really feasible or possible to see whether or not we have a bias simply by
observing the results, so a theoretical comparison point is necessary to figure out if its
biased.!
!
!

You might also like