You are on page 1of 61

Course STAT2: STATISTICAL Teacher: AMITA PAL

Interdisciplinary Statistical

STRUCTURES IN DATA (SSD) Research Unit (ISRU)


ISI Kolkata

Postgraduate Diploma in Business Analytics (PGDBA): 2022-24 Batch


Data Categories by Source/Mode of
Collection 2

By Source By Mode of Collection


• Secondary • Longitudinal
• Primary • Measurements taken over time
• Observational studies • Panel data: special case where
• Controlled experiments observations are for the same
• Sample surveys
subjects each time.
• Cross-sectional
• Measurements on several subjects
at a single time point
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Observational Studies 3

• When the variable under study is not under


the control of the researcher because of ethical
concerns or logistical constraints
Example
• Study of the suspected link between a certain
medication and some symptom arising as a side
effect
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Controlled Experiments 5

• Data collected on the basis of a study designed according to the principles


of Statistical Design of Experiments (due to R A Fisher)
• Randomization
• The process of assigning individuals at random to groups or to different groups in an
experiment, so that each individual of the population has the same chance of
becoming a participant in the study.
• Replication
• To obtain more precise estimates as well as measures of their precision
• Local Control
• Bringing extraneous sources of variation under control to get more precise estimates

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Controlled Experiments 6

• Objective: To study objectively the efficacy of some


strategy (treatment) by measurement of responses on a
group of subjects
• Method: Subjects are divided into treatment and control
groups
• Subjects in the treatment group receive the treatment
• Subjects in the control group do not receive the treatment (or
are given a placebo)
• Responses of the two groups are compared.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Illustration 7

Experiment Control

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Illustration (contd.) 8

Replication Why replicate?

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Illustration (contd.) 9

Randomization Why randomize?


• Experimental subjects (“units”) • To avoid bias.
should be assigned to treatment • For example: the first six mice you grab
groups at random. may have intrinsically higher BP.
• At random does not mean • Control the role of chance.
haphazardly. One needs to explicitly
• Randomization allows the later use of
randomize using
probability theory, and so gives a solid
• A computer, or foundation for statistical analysis.
• Coins, dice or cards.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Observational Studies vs Controlled
Experiments 10

• In controlled experiments
• investigators assign subjects to treatment and control
groups
• In observational studies
• subjects are naturally assigned to the two groups
• the investigator just observes what happens

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Confounding 11

• If the treatment and control groups differ with respect


to some factor other than the treatment, the effect of
this factor may be confounded (or mixed up) with the
effect of the treatment.
• A major source of bias
• Example: In the study of physical activity level on weight
gain, possible confounding variables are sex, food intake,
starting weight, age, ….
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Case Study I: John Snow and the Cholera
Outbreak in 1854 12
Possibly the First Recorded Observational Study in Epidemiology

Statistical Structures in Data, PGDBA Programme, ISI, 2019 August 2, 2019


Case Study II: The Minnesota Twin Family
Study (MTFS) 13
A Controlled Study

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
What is MTFS? 14

• A longitudinal study of twins conducted by


researchers at the University of Minnesota - Twin
Cities (http://mctfr.psych.umn.edu/).
• Objective: to identify the genetic and
environmental influences on the development of
psychological traits.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
MTFS 15

• Established in June 1989 using same-gendered twin pairs age


11 or 17.
• All twins born in MN at that time were invited to participate
using birth registry data. 500 additional 11-year-old twin-pairs
were added in 2000.
• Assessment done every three years of twins and their parents.
• A large number of variables were measured.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Components of MTFS 16

• Minnesota Twin Study of Adult Development


• Began in 1986 to identify what causes individual differences in
aging.
• Study of identical (MZ) and fraternal (DZ) twins allows for
estimation of how genes and environment affect the aging
process.
• Minnesota Study of Twins Reared Apart
• Study of twins who were separated at birth and raised in
different families.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Motivation 17

• Identical twins share 100% of their genes and


fraternal twins share, on average, 50% of their
genes.
• Both identical and fraternal twins share certain
aspects of their environment (e.g. religious
practices in the home).

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Some Findings 18

• Genetic factors appear to influence personality


• Mental, and activity-level changes as adults become older
• Maintaining an active lifestyle will contribute to more
successful aging
• Continuing to engage in intellectual activities will help adults
retain cognitive functioning as they age
• Keeping an active social life will contribute to stronger
feelings of happiness and well being.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Some Findings (contd.) 19

• Twins reared apart were found to have about an equal chance of


being similar to each other in terms of personality, interests, and
attitudes as those reared together.
• Similarities between twins are due to genes, not environment.
• Given that the differences between twins reared apart must be
due totally to the environment, and given that these twins are
just as similar as twins reared together, we can conclude that the
environment, rather than making twins alike, makes them
different.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
An interesting footnote: The Jim Twins 20

• One example of the amazing similarity of twins reared


apart is the so-called Jim twins.
• These twins were adopted at the age of four weeks.
• Both of the adopting couples, unknown to each other,
named their son James.
• Upon reunion of the twins when they were 39 years old,
Jim and Jim discovered certain striking facts about
themselves.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
An interesting footnote: The Jim Twins
(contd.) 21
• Both twins are married to women named • Each did poorly in spelling and well in
Betty and divorced from women named math.
Linda. • Each did carpentry, mechanical
• One has named his first son James Alan drawing, and block lettering.
while the other named his first son James • Each vacation in Florida in the same
Allan. three-block-long beach area.
• Both twins have an adopted brother whose • Both twins began suffering from
name is Larry. tension headaches at eighteen, gained
• Both named their pet dog "Toy." ten pounds at the same time, and are
• Both had some law-enforcement training six feet tall and 180 pounds.
and had been a part-time deputy sheriff in
Ohio.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
The Salk Polio Vaccine Trial 22
A Controlled Experiment

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Reference 23

• Statistics (Fourth Edition)


• Book by
• David A. Freedman
• Robert Pisani
• Roger Purves

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Background 24

• In 1954, the US Public Health Service and the National Foundation


for Infantile Paralysis (NFIP) organized a trial for a polio vaccine
developed by Jonas Salk.
• Millions of children of grades 1, 2 and 3 were involved.
• Some were vaccinated, some were not (treatment and control
groups)
• Rates of incidence of polio in the two groups compared.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Background (contd.) 25

• Comparison with incidence in past years not advisable as polio is


an epidemic disease.
• Consent of parents was also required.
• Should the treatment group be children whose parents gave
consent, and control group without consent?
• No.
• This may lead to bias as the family backgrounds differ between the two
groups.
• Educated parents are more likely to give consent.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
The NFIP design 26

• Treatment group – All grade 2 students whose parents


consented
• Control Group- Grade 1 & 3 students
• Possible source of bias
• Difference in family background
• Since Polio is highly contagious, rates may differ in different
grades
• Effects of these other factors could be confounded with
the effect of the treatment.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Alternative design 27

• Treatment and Control Groups were chosen from all the children whose
parents gave consent
• The assignment of these children to the two groups was completely
random (by a coin toss type strategy) (Randomization) A
• To reduce the psychological effect, children in the control group were double-blind
given a placebo. randomized
• A blind experiment: the subjects did not know which group they controlled
belonged to. experiment
• Diagnosticians had to decide if a child contracted polio during the study
period. They were also unaware of the group assignments.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
The results 28

The randomized controlled The NFIP Study


double-blind experiment
Size Rate* Size Rate*
Treatment 200000 28 Grade 2 225000 25
(vaccine)
Control 200000 71 Grades 1 & 3 725000 54
(control)
No 350000 46 Grade 2 125000 44
Consent (No Consent)
* Per 100,000
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Gender Discrimination in Graduate
Admissions to a US University 29
An observational study with confounding

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Background 30

• A study conducted by the Graduate Division of the


University of California Berkeley revealed that
• Out of 8442 men who applied for admissions to
graduate programs, about 44% were admitted.
• Out of 4321 women applicants, only 35% were
admitted.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022

September 29, 2022


31

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022

September 29,
2022
A closer look at the
six largest departments 32

Department MEN WOMEN


No. applied % admitted No. applied % admitted
A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7
Overall 2691 44 1835 30
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Observation 33

• The first two departments were less selective


• Over 50% of the men applied to these
• The other four departments were relatively more
selective
• Over 90% of the women applied to these
• The choice of department was the confounding factor!

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Remedy 34

• Instead of computing the overall admission rate by a simple


average, if a weighted average is taken with weights
proportional to the total number of applicants, the admission
rates are
• 39% for men
• 43% for women
• The weighted averaging takes care of the confounding factor.
• The admission process was actually biased against the
men!
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Controlled Experiments vs Observational
Studies 35

• In controlled experiments (like the Salk Vaccine trial), the


experimenter decides the assignment of the subjects to the
two groups.
• In observational studies (like the gender bias study), the two
groups always exist.
• Even in controlled experiments, double-blind experiments
may not always be feasible.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Statistical Lesson (SIMPSON’S PARADOX) 36

• Relationships between percentages in subgroups


(e.g., admission rates for men and women in each
department separately) can be reversed when the
subgroups are combined.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
One Final Example: Effects of Smoking 37

• Example of an observational study


• A number of such studies have been conducted
• They show strong association between smoking and disease
• Association: Circumstantial evidence for causation
• ASSOCIATION IS NOT THE SAME AS CAUSATION
• There may be some hidden confounding factors that make people
smoke and also make them susceptible to certain diseases.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Measurement Levels 38

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Measurement Level 39

• A classification that describes the nature of information


within the values assigned to variables
• Most widely-used classification with four levels or scales of
measurement
• nominal
• ordinal
• interval
• ratio
• Originated in psychology
• widely criticized by scholars in other disciplines.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Interval Data 40

• Has
• a meaningful order
• the quality that equal intervals between measurements represent
equal changes in the quantity of whatever is being measured.
• Addition and subtraction are appropriate with interval scales.
• The zero point is arbitrary on interval scales.
• Multiplication and division are not appropriate with interval
data.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Interval Data (contd.) 41

• Example: the Fahrenheit or Centigrade temperature


scale
• The difference between 10 degrees and 25 degrees (a
difference of 15 degrees) represents the same amount of
temperature change as the difference between 60 and 75
degrees.
• A temperature of 40 degrees is not twice as hot as 20 degrees.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Ratio Data 42

• Has
• a meaningful order
• the quality that equal intervals between measurements represent
equal changes in the quantity of whatever is being measured.
• Addition and subtraction are appropriate.
• There is a natural zero point.
• Multiplication and division are also appropriate.

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Ratio Data (contd.) 43

• Examples
• Many familiar measurements are on the ratio scale
• Height
• Weight
• Age
• Income
• Score in a test

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Secondary Data 44

• Data already collected by some agency, like


government departments, private organizations

• No control over quality or reliability

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Primary Data 45

• The investigator collects the data himself/herself

• More reliable
• Total control over
• Coverage
• Definitions

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Data Collection 46

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Modes of Collection 47

CENSUS SAMPLING
• Complete enumeration • A representative sample
• Expensive is used
• Time-consuming • Various sampling
techniques available
• Reliable, leading to more
accurate analysis

Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Collection Methodology 48

• Questionnaire method
• High non-response
• Interviewer method
• Less possibility of non-response
• Direct Observation

SCRUTINY OF DATA is an important aspect since data


collected may be prone to errors.
Stat2: Statistical Structures in Data, PGDBA Programme, ISI, 2022 September 29, 2022
Measurement Errors 49
Acknowledgement: Statistics by Freedman, Pisani and Purves (4th ed.)

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022


Introduction 50

• Ideally, a property, when measured several times


• under identical conditions
• with the same instrument/method
should lead to identical measurements.
• In reality, this does not happen.
• There are small variations in the measurements
(chance error)
Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022
Example: Standardization of Weights in the US 51

• Local stores weigh goods on (weighing) scales.


• These scales are checked periodically by county weights-and-measures
officials.
• The county standards are in turn calibrated periodically against state
standards.
• The state standards are in turn calibrated periodically against national
standards by the National Bureau of Standards (NBS) in the USA.
• The chain of comparisons ends with the International Prototype Kilogram
maintained by the International Bureau of Weights and Measures in Paris.
Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022
Example (contd.) 52

• Accuracy of weighing in the supermarkets and stores ultimately depends


on the accuracy of the calibration work done at the NBS.
• The Bureau addresses the issue of reproducibility (If a measurement is
repeated, how much will it change?) by making repeated measurements
on their own weights, say, NB10, which has a nominal value of 10
grammes.
• Consider 100 weighings of NB10 made
• in the same room
• on the same apparatus
• by the same technicians.
Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022
Example (contd.) 53

The First 5 Measurements The set of 100 Measurements

Min 375
Max 437
Mean 405
s.d. 6

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022


What does this Illustrate? 54

• No matter how carefully it is made,


a measurement can come out to be
different from the true value.
• If the measurement is repeated, the
difference could take some other
value.
• Can we quantify the extent to
which the measurement differs
from the true value? Yes, by
replication of the measurements.
Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022
Chance or Random Error 55

• Is an error in measurement that leads to measurable values being


inconsistent when repeated measurements of a constant attribute or
quantity are taken.
Individual measurement=exact value + chance error

• Despite repeated measurements, the exact value remains unknown or


unknowable.
• It can at best be estimated by the average of all measurements.
• The standard deviation of a series of repeated measurements estimates
the probable size of the chance error in a single measurement.

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022


Outliers 56

• In the set of 100 measurements on NB10


• measurement no. 36 (423) differs from the average by 3 standard deviations.
• measurement nos. 86 and 94 (437 and 375) differ from the average by 5
standard deviations.
• These are examples of outliers.
• They do not result from any mistakes committed during
measurements.
• Discarding outliers
• may or may not result in a significant change in the average value
• is expected to decrease the standard deviation significantly

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022


Outliers (contd.) 57

• Illustration with the NB10 experiment

Mean 405 Mean 404


s.d. 6 s.d. 4

Histogram with all 100 observations Histogram after discarding 3 outliers


Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022
Outliers (contd.)
58

• Even when measurements are taken very carefully, a


small proportion of outliers is expected to be present.
• When outliers are present in the data, the investigator
can
• discard them
• use statistical analysis that can handle them without having to
remove them (robust statistical methods)

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022


Bias or Systematic Error 59

• Examples
• When
• a grocer weighs some grocery item on a scale but tends to touch the
pan of the scale with his fingers, or
• a shopkeeper selling fabrics by the metre uses a flexible tape measure
for measurement and tends to stretch the measure so that 30 inches on
the measure actually represent 31 inches of the fabric,
a bias or systematic error is introduced into the measurement.

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022


Bias (contd.) 60

• Error that is not determined by chance but introduced by an


inaccuracy (involving either the observation or measurement
process) inherent to the system
• It may also refer to an error with a non-zero mean, the effect of
which is not reduced when observations are averaged.
• Sources
• Imperfect calibration
• Quantity (error is proportional to the actual value of the measured quantity)
• Drift (error exhibits a trend with time)

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022


Bias (contd.) 61

• Bias affects all measurements the same way.


• Pushes them in the same direction, unlike chance errors which
affect them randomly, sometimes increasing them and decreasing
them at other times.
• Long-run averaging of measurements generally cancels out the
effect of chance error but not bias.
• The average is as affected by the bias as the original measurements.

Individual measurement=exact value + bias + chance error

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022


Bias (contd.) 62

• Bias cannot be detected just by looking at the measurements.


• For detection of bias, the measurements must be compared to an
external standard or to theoretical predictions.
• Example
• K20, the US standard for the kilogram has been compared several times with
the universal standard.
• It has been seen that K20 is marginally lighter by about 19 parts in a billion
• It has a negative bias.
• The NBS routinely revises all weight calculations upwards by adding this amount.

Stat3: INFERENCE, PGDBA Programme, ISI, 2021 September 29, 2022

You might also like