● Define Statistics

● Discuss the significance of Statistics for Physicians ● Suggest the study strategies for learning Statistics

● Present role of Statistics in the scientific process
● Review basic concepts of Statistics

● Introduce methods of Exploratory Data Analysis

● “Amazingly, it is widely considered acceptable for medical

researchers to be ignorant of statistics. Many are not ashamed (and some seem proud) to admit that they don't know anything about Statistics.”1
● “It may not be expected from doctors to be expert in

statistics but they should be made capable of understanding the basic statistical methodology. ”2
● “Medical students may not like statistics, but as good

doctors they will have to understand it.”3
1. 2. 3. Altman DG. The scandal of poor medical research. BMJ. 1994 ; 308: 283-4. Singh G. Medical Science without Statistics. The Internet Journal of Healthcare Administration. 2006; 4:2 Chen J. Lecture: Advice to GCRC & Surgery Fellows and Residents, SBU 2004.

● Statistics: theory & methodology

for the collection, organization, analysis, interpretation & presentation of data.
DESCRIPTIVE INFERENTIAL

● Descriptive Statistics: discipline
Collecting
Organizing Summarizing Presenting Data Inferences Hypothesis Testing Relationships Predictions

of quantitatively describing the features of data.
● Inferential Statistics: deals with

drawing conclusions from data.

Ability to understand:
● the value of published

Medical Research.
● the role of Statistics in

Medical Business.

In the past the USE of Statistics was its most significant aspect.

M

P H A R M

E
D I A

Today, the MISUSE of Statistics in Research became a concern.

A

● Statistics is an essential aspect of modern science. ● Before Statistics, the science was perceived as the process of

developing absolute knowledge through observations
● In a contrast, Statistics is based on the notion that scientific

knowledge is not absolute
● Hence, uncertainty & error are part of science
● The only real things in science are distributions of numbers

● Probability theory is used to interpret those distributions
● Statistics reflects acts of interpretation - not irrefutable facts

the Wilcoxon rank sum test Poisson regression models the Bayesian estimates Wald χ2 statistics Cox proportional hazards compared using t tests repeated-measures ANOVA adjusted hazard ratios 2-stage statistical model 95% confidence interval the degrees of freedom odds ratio

Kaplan-Meier method Pearson χ2 Fisher exact test to have 90% power Mann-Whitney test a 2-tailed a level of less than.05 the log-rank test A 2-factor analysis of variance χ2 tests the Z test Logistic regression models stratified Mantel-Haenszel analysis

Source: JAMA Vol. 292 (19): Six Original Contribution Papers

● In developed countries, much of what

laymen know about medicine is gleaned from the media.
● Unfortunately, the more frightening an

event is - the more “newsworthy” it is.
● The Statistical Analysis of Research

Studies is complex. Regrettably, it tends to be oversimplified & sensationalized.

Data dredging (data fishing, data snooping, equation fitting) is the inappropriate use of statistics to uncover misleading relationships.

● The SIMPLEST FORM OF STATISTICS

will suffice in most well-designed studies.

● Therefore, a revision of the study

design should occur - before the use of more “sophisticated” analysis. complex statistical methods should be approached with a caution.

● Similarly, any study that uses overly

Bhattacharya K. University of Oxford Introduction to Statistics for Medical Students. 2004

● As opposed to the past we live now in the

QUANTITATIVE ERA.
● In clinical environment everything is measured.
● All aspects of physicians’ work is being

statistically analyzed & compared to benchmarks such as evidence-based guidelines.
● Any physician who does not understand this

WILL BE CRUSHED.

Computerization of medical business facilitated:
● Automated surreptitious data gathering

● Data Mining
● Physician Performance analyses ● Outcome & Cost-Effectiveness analyses ● Practitioner vs. Peer-Group comparison analyses ● More accurate Actuarial analyses

Statistic is challenging for everybody. Physicians may find it especially challenging - as Statistics is:
● Math-based. It has many rigorous quantitative

aspects rooted in mathematics. Most physicians are not used to study math-based subjects.
● Time consuming. It is a tedious subject requiring a

tremendous time commitment.
● Spuriously not-essential. It appeasers to be not

everyday use topic. (“I can get away w/o studying it”)

Statistics is not a spectator sport.

● Get Motivated by understanding why you need Statistics. ● Learn Actively: it cannot be passively “crammed”:

o Use pen & paper: for solving problems & reflecting on ideas o Make your own scenarios
● Study Deliberately as:

o few words & symbols can mean a lot in statistics o it may be necessary to read a topic many times

● Study Incrementally:
o Statistics is based on small number of principles o Those must be memorized & understood first o It is futile to “look up” the advanced test (e.g. used in a research paper) w/o knowing those essentials

● Assemble Resources:
o There is no single “best statistical manual”
o It pays to prepare the set of personalized references

Source: University of Oxford & LISA: Laboratory for Interdisciplinary Statistical Analysis at Virginia Tech

● Population: all elements to be studied
o Parameter: characteristic of the Population (e.g. Mean, Standard Deviation)

● Sample: a subset of Population.
o Statistic: characteristic of the Sample (not to be confused with Statistics)

VARIABLE: any measurable attribute that differs. ● Quantitative = Numerical
o Continuous: any value between a set of numbers
 E.g.: Time

o Discrete: only a finite number of values
 E.g.: Number of children in a family

● Qualitative = Categorical o Ordinal: can be ordered (ranked)
 E.g.: Clothing Size: S, M, L, XL

o Nominal: cannot be ordered
 E.g.: Colors

DATA: values that variables can assume
DATA

● Univariate: analysis of one variable

● Bivariate: analysis of two variables
● Multivariate: analysis of many variables

SSDC IDM

● SAMPLE SELECTION & DATA COLLECTION ● INITIAL DATA MANIPULATION o Data Formatting o Data Quality Control ● EXPLORATORY DATA ANALYSIS o Tabular, Numerical, Graphical data summaries o Choosing ways of Definitive Analysis ● DEFINITIVE DATA ANALYSIS o Final Inferential Data Analysis ● PRESENTATION of CONCLUSIONS o Concise graphical & tabular summaries o Statement of conclusions

EDA
DDA PoC

SSDC IDM

● Understanding the phases of SA is important

not only for performing research.
● It is essential for the critical appraisal of the

EDA
DDA

published studies.
● This truth is frequently overlooked.
PoC

GOALS:

DI

● Descriptive INFERENCE: describe a population,

using information from a sample
● Analytical INFERENCE: describe relationships

AI

between variables, using a sample - assuming that it can be generalized to a population.

SAMPLING:
● Simple Random

● Stratified
● Cluster ● Multistage

SIMPLE RANDOM Sample
● subset of individuals chosen

RANDOMLY from a population
● each individual has the same

probability of being chosen

STRATIFIED Sample
● STRATA: homogeneous

nonoverlapping subgroups
● STRATIFICATION: dividing

population into strata
● STRATIFIED Sample is

obtained by simple random sampling from each stratum

CLUSTER Sample
● CLUSTERS: natural heterogenous
Cluster
Cluster

subgroups representative of population
● CLUSTERING: identifying clusters

in population
● CLUSTER Sample is obtained by

simple random sampling within each cluster

MULTISTAGE Sample ● a form of cluster sampling ● when using all the sample elements in all the clusters is undoable ● instead the researcher randomly selects elements from clusters

Putting a Data Set to order, making it usable: ● Data Formatting ● Checking Quality of:
o Data (outliers?) o Implementation of Design

● Basic Characteristics of data

OUTLIERS: data points that deviate remarkably from the majority of the sample.

DISTRIBUTION: The pattern of occurrence
of the various values of a variable
● POPULATION D: distribution of values for

all units in the population
● EMPIRICAL D: distribution of values for the

units in a sample.
It is assumed that the Empirical Distribution is a good representation of the Population Distribution

is a listing or function showing all the possible values of the data and how often they occur.
● Distribution of categorical data shows the number

& percentage of individuals in each group.
● Distribution of numerical data is typically

presented using graphs & charts to examine:
o the shape, o center, o amount of variability in the data.

NORMAL Distribution

A PROBABILITY DISTRIBUTION: assigns a probability to each measurable subset of the possible outcomes of a procedure. ● Normal (Gaussian) distribution is a very common continuous probability distribution
● Continuous probability distribution is a

probability distribution that has a pdf.
● pdf: Probability density function or density of a

continuous random variable, is a function that describes the relative likelihood for this variable to take on a given value.
NORMAL (GAUSSIAN) DISTRIBUTION

There are myriad probability distributions Most are related to each other, and ultimately to the Normal

● GOAL: to reduce the information contained in a data

set to a few key indicators.
● APPROACH: summarization of the data with visual

methods to reveal trends & patterns.
● METHODS: depends on the type of data

TABULAR:

NUMERICAL: GRAPHICAL:

Q1=64; Q2=71; Q3=74; IQR=14 𝐗 = 𝟕𝟐. 𝟔; 𝐗= 45; 2 σ = 16 ; σ=4; CV=0.9

Quantiles & Quartiles Median Mean Mode Spread or Dispersion Interquartile Range Standard Deviation Coefficient of Variation

● The EDA methods to be presented in this section are

important not just for the researchers
● Any reader of scientific literature or business statistical

analyses will encounter discussed here methods.
● Familiarity with them is essential for one’s ability to

critically appraise any statistics based document.

FREQUENCY DISTRIBUTION: is an organization of the raw data in the tabular form using classes & frequencies
● Frequency : the number of times a value occurs in a data set ● Relative Frequency: frequency counts expressed as percentages

of the total observations
● Cumulative Frequency: the sum for the frequencies for all

values at or below the given value
● Cumulative Relative Frequency: the sum for the relative

frequencies for all values at or below the given value

● Useful for

categorical data.
● It presents the

distribution of values by showing their frequencies.

Contingency table (cross tab) is used to analyze the relationship between many categorical variables.
Example: 100 individuals are randomly sampled from a population as part of a study of sex differences in handedness.

● Quantiles & Quartiles ● Location o median o mean o mode ● Spread or Dispersion o Range o Interquartile range o Variance o Standard deviation o Coefficient of variation

● Skewness o Coefficient of Skewness ● Kurtosis o Coefficient of Kurtosis ● Covariance ● Correlation o Correlation Coefficients
 Pearson’s CC  Spearman's rank CC

Simple Definition: QUANTILES: Points taken at regular intervals, that divide the data set into equal subsets.

Example of Formal Definition: The α-th sample quantile, denoted η(α), is the smallest value such that (100×α)% of the observations for the variable take values which are less than or equal to η(α).

● Quantiles are the data values (cut-off POINTS) marking the

boundaries between subsets. ● Examples of specific quantiles:
o o o o 2-quantile: median 4-quantiles: quartiles 5-quantiles : quintiles 100-quantiles: percentiles

● Common misconception: the use of the name of quantiles

to denote the subsets they mark. These subsets should be called thirds, quarters, fifths, etc.

three POINTS that divide the data set into four equal groups, each comprising a quarter of data. A quartile is a type of quantile. ● Q1: First: lower = 25th percentile
o splits lowest 25% of data ● Q2: Second: median = 50th percentile

Q2=5

o cuts data set in half
● Q3: Third = upper = 75th percentile o splits highest 25% of data

Q2=5.5

Interquartile Range (IQR): the difference between upper and lower quartiles.

Q2

IQR= Q3-Q1

Finding the position of the value in the data set that best characterizes it.
● Median (𝑋 ): value separating the “higher” half of a data set from “lower”

o The median of {2,3,5,8,9} is 5
● Mean (𝑋): the sum of the n numbers divided by n

o The mean of {6,4,7,10,4} is 6.2=

6+4+7+10+4 5

● Mode (Mo): the most frequent value in the data set

o The mode of {1, 3, 6, 4, 3, 5, 3} is 3

● Mean is affected by outliers, median is not ● Median exhibits robustness against outliers ● Robustness: “the ability to resist”.

● Robust statistics: statistics with good performance

for data drawn from a wide range of probability distributions & not affected by outliers

● Measures the degree to which the observed values are

concentrated around a location measure.
● Smaller spread: values are tightly clustered around the center.

Measures of Spread:
● Range

● Interquartile range
● Variance ● Standard deviation ● Coefficient of variation

● RANGE: difference between the

sample Maximum & Minimum.
o The simplest measure of dispersion o Very sensitive to outliers

● INTERQUARTILE RANGE (IQR):

the difference between upper and lower quartiles.
o Less sensitive to outliers

Measure of how far a set of numbers is spread out: how far the numbers are located from the mean.
● s2 is always positive

● s2=0: no variation
n = Number of variables Xi = Each of the values of the data 𝑋 = Mean

● s2 Small: data close to 𝑋

● s2 High: data far from 𝑋

Since Variance is expressed in squared units it is difficult to interpret intuitively.

Standard Deviation (SD): square root of the Variance. It shows the extend of variation from the mean.
● s Small: data close to 𝑋

● s High: data far from 𝑋
● Both s2 & s depend on the units in

which a variable is measured. ● It can be misleading when comparing variables using different units.

from Latin: co (together) + efficere (to effect)
COEFFICIENT

4
COEFFICIENT

In Mathematics: Number or other known factor (symbol) by which another number or factor is multiplied.
● Eg.: in the equation ax2 + bx + c = 0

o a is the coefficient of x2 o b is the coefficient of x

a

In Statistics: Measure of a specified characteristic of a phenomenon

Coefficient of Variation (CV): ratio of the SD to the Mean
Relative SD (RSD): CV expressed as a percentage

● CV<1: Low Variance
s= Standard Deviation 𝑋 = Mean
● ●

● CV=1: No Variance ● CV>1: High Variance

CV has no units It can be used for comparing dispersions of variables measured in different units.

Skewness: deviations from symmetry with respect to a location measure. It is unit-free.
● b1=0: variables distributed symmetrically around 𝑋

o Tails are symmetric
s= Standard Deviation 𝑿 = 𝑴𝒆𝒂𝒏 n = Number of variables Xi = The data values

● b1>0: positively, right-skewed

o Longer tail for values > 𝑋
● b1<0: negatively, left-skewed

o Longer tail for values < 𝑋

The degree of peakedness of the distribution - as compared to a Normal (Gaussian) Distribution. It is unit-free.
● b2>3: Leptokurtic
o Peaked > Normal
s= Standard Deviation 𝑿 = 𝑴𝒆𝒂𝒏 n = Number of variables Xi = Each of the data values

● b2=3: Mesokurtic
o Peaked as Normal

● b2<3: Platykurtic
o Peaked < Normal

● Covariance is a measure of how much two

random variables change together. ● Dependence is any statistical relationship between two random variables. ● Correlation refers to statistical relationships involving dependence.

Correlation does not imply causation!

Measures association between two numerical variables
● cov(X,Y)=0: X&Y are INDEPENDENT
o X&Y do not correspond
X , Y : variables Xi ,Yi : observations for unit i 𝑋 , 𝑌: means of the variables

● cov(X,Y)>0: X&Y POSITIVELY associated
o greater values of X correspond w/ greater Y

● cov(X,Y)<0: X&Y NEGATIVELY associated
o greater values of X correspond w/ smaller Y

n: number of variables

● Sign (+/-) of cov shows the type of linear relationship between X&Y.

● The magnitude of the cov is hard to interpret, hence normalized cov is used.

● NORMALIZATION: creation of scaled versions of statistics to

allow the comparisons with elimination of influences.
● Correlation Coefficients (CC): normalized versions of covariance

● CC measure the degree of correlation
● CC commonly used:

o Pearson Correlation Coefficient o Spearman’s Rank Correlation Coefficient

Measure of the linear correlation between variables X&Y.
Linear X,Y relationship is modeled best by a straight line

● r=-1: total NEGATIVE correlation
X,Y: variables

● r = 0: NO correlation
● r=+1: total POSITIVE correlation

cov (X,Y): covariance
sx ,sy : Standard Deviations

● ●

r removes the dependence on the units by scaling the cov by the product of the SD of X,Y r is not robust to: outliers, unequal variances, non-normality, & non-linearity

● RANK: relative position in a graded group ● RANKING: transformation of data, in which

values are replaced by their rank
8.9 7.3

5.1

3.4

2.6

Ranking of numerical dataset: { 3.4, 5.1, 7.3, 2.6, 8.9 }
Value Rank 8.9 5 7.3 4 5.1 3 3.4 2 2.6 1

5

4

3

2

1

Measure of monotonic dependence between variables X&Y
In Monotonic X,Y relationship: Y moves in one direction ( ↑or↓) as X moves, but the relationship is not necessarily linear

 Reflects Monotone Trend (M.T.) between X&Y:
● =+1: perfect increasing M.T.
o +1>>0: increasing M.T. (Y↑ when X↑)
 is calculated by applying the Pearson CC formula to the ranks of the data, not to values For a sample of size n, the n raw scores Xi ,Yi are converted to ranks xi , yi .

● =0: no M.T.
o -1<<0: decreasing M.T. (Y↓ when X↑)

● =-1: perfect decreasing M.T.

●  is robust to outliers, unequal variances, non-normality, & non-linearity ●  is non-parametric as exact sampling distribution can be obtained w/o knowing

the parameters of the joint probability distribution of X&Y.

SSDC
IDM

EDA
DDA

PoC

● Statistics is an essential component of both:

Science & Business of Medicine. is prevalent among physicians.

● Despite this fact statistical illiteracy & innumeracy ● This situation should be remedied. ● Study of Statistics poses many challenges, but

those are well worth of overcoming.

● Statistics is based on definite number of principles.

● It is best studied in an incremental fashion.

SSDC IDM EDA DDA PoC

● Statistics reflects acts of interpretation, not irrefutable facts. ● Statistics can be misused & abused. ● Statistical Analyses are result of the multiphasic process, that:

o starts at Sample Selection, o ends with Presentation of Conclusions.
● Appraisal of Statistical Analyses requires familiarity with all phases. ● Understanding of Tabular, Numerical and Graphical Methods of

EDA is critical for assessing the quality of the Statistical Analysis.

To be continued…

Author wishes to thank: Stephen DeCherney, MD, MPH for his valuable comments.

Nothing to disclose: there are no known conflicts of interest associated with this presentation. Specifically, neither the author nor his family have any potential conflicts of interest, financial or otherwise regarding any of the discussed here products and/or services.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.