# SSDC

● Discussed the significance of Statistics for Physicians

IDM EDA

● Suggested the study strategies for learning Statistics ● Presented the role of Statistics in the Scientific Process ● Reviewed basic concepts of Statistics including:
o Sample Selection & Data Collection o Initial Data Manipulation

DDA
PoC

o Tabular & Numerical Methods of Exploratory Data Analysis

● Discuss Graphical Methods of Exploratory Data Analysis ● Present Basic Methods of Statistical Inference ● Introduce most commonly used statistical tests

● Provide directions for further study of Statistics

Graphical methods allow to discover trends & patterns

COMMON GRAPHICAL METHODS
PIE chart LINE chart BAR chart HISTOGRAMS DOT plots STEM-&-LEAF plots BOX-WHISKERS plots SCATTER plots

FREQUENCY POLYGONS

Q-Q plots

● A circular chart divided into sectors ● It illustrates numerical proportion ● Simple but flawed method ● Unsuitable for large data sets

● Inconvenient for data comparisons
● Commonly used in business ● Avoided in research

● A series of data points (markers)

connected by straight line (segments)
● It shows how data changes in time & in

response to interventions
● It reveals trends very clearly
Perfusion of:

● Error bars added to markers provide

● Commonly used in research

HORIZONTAL

● Useful for presenting DISCRETE data ● Uses horizontal or vertical bars to show

comparisons among categories
● Axes represent categories vs discrete data

● Stacked bar graphs: bars divided into subparts
● Grouped bar graphs: bars clustered in groups ● Commonly used in research & business

Frequency

● Useful for presenting CONTINUOUS data

● Used to show the data distribution
● Uses RECTANGLES whose:
BINS

o widths represent class intervals (BINS): i.e. equal width groups into which data are put
Frequency

o heights are proportional to the frequencies of the corresponding bins
BINS

Histogram a.k.a: Frequency Plot (FP) TRUE HISTOGRAM (TH) differs from FP: ● TH uses RECTANGLES whose:
o widths represent class intervals (BINS): like F.P., but o areas (not heights) are proportional to the corresponding frequencies o heights equal to the Frequency Density of the interval, i.e., the frequency divided by the width of the interval o FP is identical to TH only when relative frequencies and bins of equal length are used for the FP
BINS

Frequency Density

BINS = class intervals:
● equal width groups into which data are put ● series of ranges of numerical value into which

data are sorted

Frequency

● There is no "best" number of bins
● Different bin sizes can reveal different features

BINS

FREQUENCY:
● the number of observations in
FREQUENCY

each class interval (BIN)

Bins

● Term “histogram” sounds awkward for physicians ● It evokes the word ὑστέρα (hystera=uterus), but this is just a coincidence

The origin of “Histogram” is unknown.
It was likely coined by statistician Karl Pearson as: ἱστός(histos: mast) + γραμμα(gramma: writing) or historical + diagram

HISTOGRAM: “representation of a frequency distribution by means of rectangles, whose widths represent class intervals (BINS) and whose areas are proportional to frequencies."

Data Set: {1,2,2,3,3,3,3,4,4,5,6} ● Divide it into BINS, note frequency: ● {[1] [2,2] [3,3,3,3] [4,4] [5] [6]}

The assignment pattern here is: ● Bin 1 contains 1: its frequency is 1 ● Bin 2 contains 2,2: its frequency is 2 ● Bin 3 contains 3,3,3,3: its frequency is 4, etc.

FREQUENCY BINS (class intervals)

In real data sets most numbers will be unique
Data Set: {3, 11, 12, 19, 22, 23, 24, 25, 27, 29, 35, 36, 37, 38, 45, 49}
● Let’s use Ranges as BINS
● By using Ranges with width of 10,

Data 3 11,12,19 22,23,24,25,27,29 35,36,37,38 45,49

Data Range Frequency

0-10 10-20 20-30 30-40 40-50

1 3 6 4 2

Frequency

● data can be organized as follows:

BINS: Data Ranges

HISTOGRAM allows to analyze dataset by reducing it to a single graph showing: significance of primary & secondary peaks

THIS HISTOGRAM tells us that in the dataset:
● The peak is:

o well-defined o close to Median & Mean
● Outliers are not frequent

Thus: the deviations from the mean are not frequent

THIS HISTOGRAM tells us that in the dataset:
● The peak is:

o Not well-defined

o Fairly close to Median & Mean
● Outliers are frequent

Thus: the deviations from the mean are frequent

THIS HISTOGRAM tells us that in the dataset:
● There are two peaks:

o a taller primary peak & a shorter secondary o Median & Mean are hard to localize
● Outliers are hard to define as well

Thus: there is a very poor definition of one signal or two signals in data

is drawn by joining the midpoints of the top of bars of a histogram
● ●

Unlike histograms, frequency polygons can be superimposed It is used to compare frequency distributions of multiple datasets on one diagram

FREQUENCY

BINS

FREQUENCY

BINS

Favorite Movies

Data plotted on a simple scale, using circles
SciFi
4 1

Comedy
4

Action
5

Horror
6

Drama

WILKINSON DP: ● Used for univariate data ● Simple representation of a distribution ● Useful for highlighting clusters, gaps & outliers in small datasets

CLEVELAND DP: ● Used for multivariate data ● Plots of points belonging to several categories ● Alternative to bar charts ● Bars are replaced by dots at the values associated with each category

● Two columns separated by a line: o Right: Leaves contain last digit of the number o Left: Stem contains all of the other digits
Data set: 44,46,47,49, 63,64,66,68,68, 72,72,75,76, 81,84,88, 106

● Visualization of a distribution in small data set
4 5 6 7 8 9 10 | | | | | | | 4 6 7 9 3 4 6 8 8 2 2 5 6 1 4 8 6

● Useful for outliers and finding the mode ● Used in ’80s (easy to make with typewriters) ● Became less common after computer

Displayed as:

Key: 6|3=63 Leaf unit: 1.0 Stem unit: 10.0

graphics became available

Upper Whisker

Q3 Q2
Mean

MEDIAN

+

I Q R

Q1

Lower Whisker OUTLIERS

Conventions used for the Box are uniform: ● Bottom: first quartile Q1 ● Line inside: Median Q2 ● Symbol inside (e.g.+): Mean ● Top: third quartile Q3 ● Height: Interquartile Range ● Width of the box is arbitrary, as there is no x-axis ● Spacings between parts of the box help indicate the degree of dispersion (spread) and skewness (asymmetry)

Upper OUTLIERS Whisker

Conventions used for Whiskers & Outlier vary:
Q3 Q2 Mean

● Whiskers: their ends can represent:

o Minimum & Maximum of the data

MEDIAN

+

o one Standard Deviation above & below the Mean
o 9th percentile and the 91st percentile

Q1

o the 2nd percentile and the 98th percentile
● Outliers: data not included between the whiskers are plotted

Lower Whisker OUTLIERS

with dots, circles, stars, or lines

All cases (but exceptions) fit between the upper & lower marks

Distribution is not symmetric: the median line is NOT in the middle of box

EXAMPLE OF MULTIVARIATE BOX-PLOTS

Boxplot for the height of 240 students by gender: MvF shows:
● A difference in distributions of

height in MvF
● Spread is similar across in M&F:

similar heights of the boxes on either side of the medians
● There is more outliers among M

● Uses Cartesian coordinates to display

values for two numerical variables
● Useful when comparing discrete

variables vs numeric variables
● Can suggest various kinds of

correlations between variables

y
THEORETICAL

● Used for comparing two probability distributions

(DD) by plotting their quantiles vs each other
● x-coordinate: quantile of 1st distribution
x
OBSERVED

y
THEORETICAL

● y-coordinate: quantile of 2nd distribution ● Q–Q plot on line y=x -> DD are similar ● Q–Q plot near line y=x -> DD are linearly related ● Commonly used to compare an observed data

set to a theoretical model
x
OBSERVED

● Drawing conclusions from data subjected to random variation such

as: observational errors, random sampling/experiments
● It makes conclusions about populations by analyzing its sample

Includes inter-related areas: ● ESTIMATION ● CONFIDENCE INTERVALS ● HYPOTHESIS TESTS

● Statistics reflect acts of interpretation not “absolute” facts ● Conclusions of a statistical inference are statistical PROPOSITIONS ● Final Inference is obtained by using following interrelated propositions: ESTIMATION: calculation of the attribute of sample (statistic) representing the “Best Estimate" of the attribute of population (parameter) CONFIDENCE INTERVAL: calculated region likely to contain true values of the attributes of interest, it indicates the reliability of an Estimate HYPOTESIS TESTING: consideration if chance is a plausible explanation of findings, it uses the data to decide if hypothesis - that there is no relationship between measured phenomena (Null Hypothesis) - can be rejected

Statistical ASSUMPTIONS: ● suppositions about mechanisms & features of Population & Sample ● Statistical Inference relays on them Statistical MODEL: ● set of statistical assumptions Assumptions can be divided into: ● Non-Modeling (re: Population, Sample) ● Modeling (re: Distribution, Structure, Cross-Variation)

Based upon Modeling Assumptions INFERENCE can be divided into:
● PARAMETRIC: assumes that the population fits

certain ideal distribution described by parameters
● NON-PARAMETRIC: does not depend on the

population fitting any parameterized distributions
● SEMI-PARAMETRIC: has both parametric and

nonparametric components

● Assumes existence of idealized distribution for population from which the sample is drawn

● Uses the known shape & parameters of that ideal distribution for the Inference

PI is less robust but simpler than NPI
Makes more assumptions than NPI ● If those assumptions are: o Correct -> better estimates than NPI o Wrong -> it is misleading ● Thus it is less robust than NPI
● ● ● ● ●

Its models are simpler than NPI Thus it is more convenient than NPI PI has been most commonly used Became subject of recent criticism*

(*) Nassim Taleb. The Black Swan: The Impact of the Highly Improbable. 2nd Ed, 2010

● Relies on no or few assumptions about the shape or parameters of the

population distribution from which the sample is drawn
● NPI is useful for study of populations that take on a ranked order

● Compared to PI:
o NPI is frequently less convenient, but always more robust o NPI has less power (larger sample size is required to draw conclusions with same degree of confidence) o NPI can be occasionally simpler than PI

● NPI is seen by some as leaving less room for misuse & misunderstanding
CAVEAT: Term “non-parametric” has additional meanings in Statistics
e.g. denotes techniques that do not assume that the structure of a model is fixed.

Using the data to provide a suitable guess at the population attributes
● STATISTIC: any mathematical function of the data in a sample. E.g. 𝑋 ● ESTIMATOR: statistic for calculating an estimate based on data ● ESTIMATE: result of an estimator. E.g.: Mean, Variance ● POINT estimator: yields a single-valued result

● INTERVAL estimator: yields a range of plausible values
● ERROR of the estimator: reflects the degree of its precision and

reliability. It dependents on sample size.

a probability distribution that describes: the probabilities of the possible values - for a specific statistic

● The form of SD will depend on the

population distribution
● SD is necessary for constructing confidence

intervals & hypothesis testing

2 pool balls out of 3
Out come 1 2 3 4 5 6 7 8 9 Ball 1 Ball 2 1 1 1 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3 Mean 1.0 1.5 2.0 1.5 2.0 2.5 2.0 2.5 3.0 3.0 1 0.111 2.0 3 0.333 1.0 1 0.111 Mean FREQ RE0L FREQ

a probability distribution that describes the probabilities of the possible values for the Mean
● Consider a Normal Population

1.5

2

0.222

2.5

2

0.222

● Take repeatedly samples of a given size from it.

Rel. Frequency: PROBABILITY

Sampling Distribution Of The Mean

● Calculate Mean for each sample (Sample Mean)
● Each sample has its own Mean value

● The distribution of these Means is called the

sampling distribution of the mean
MEAN

STANDART ERROR of a statistic: Standard Deviation (σ) of the Sampling Distribution of that statistic
STANDARD ERROR OF THE MEAN: SD (σ) of the Sampling Distribution of the Mean
SEM vs σ: ● SEM: estimate of how far the sample mean is likely to be to the population mean. It is an Inferential Statistic. ● SD: degree to which individuals within the sample differ from the sample mean. It is a descriptive statistic.

the difference between the expected value of an Estimator & the corresponding population parameter - it is designed to estimate

● Estimator is unbiased for a parameter A if

its expected value is precisely A
● Estimator is biased for a parameter A if its

expected value differs from A
● Unbiasedness is usually a desirable property

● CLT states that the sampling distribution of any statistic

will be NORMAL or nearly Normal, if the sample size is large enough
● Normal distribution is useful for modeling
● CLT allows to make assumptions about the population

● ●

Standardized normal distribution, with an idealized Mean of 0 & Standard Deviation of 1. It allows to create a compact table for all normal distributions (widely used before computers became available) Score (raw score, datum): original datum (observation) that has not been transformed

Z- Score (Standard Score): number of standard deviations a score is from the mean of population
Positive Z-Score: a datum above the mean

Negative Z-Score: a datum below the mean

probability distribution that is used to estimate normal population parameters when the sample size is small &/or when the population Standard Deviation is unknown Per CLT sampling distribution of a statistic will follow a normal distribution (ND), as long as the sample size is sufficiently large Thus, when we know the standard deviation (SD) of the population, we can compute a z-Score, and use ND to evaluate probabilities with the sample mean In reality, sample sizes are sometimes small, and SDs of the population are unknown When either of these occurs, we have to rely on the distribution of the t statistic (also known as the t score)

● ●

● Indicates precision & accuracy of the Estimator ● Margin of Error: reflects Observational Error

(Measurement Error): the difference between a measured value of quantity & its true value

● Gives an estimated range of values which is likely
CI Visualized
300 experiments Sample size of 10 Mean: 50 CI shown in: Yellow: 95% CI containing the mean Red: those that do not Blue: 99% CI containing the mean (50) White: those that do not

to include an unknown population parameter
● It is calculated from a sample dataset ● If independent samples are taken repeatedly

from the same population, & CI calculated for each sample, then a certain percentage (Confidence Level) of the intervals will include the unknown population parameter
● CI are usually calculated so that this percentage

(CL) is 95% for the unknown parameter

n
Small n

Width of CI indicates how uncertain we are about the unknown parameter
Wide CI: less confident

Narrow CI: more confident
Wide CI -> more data should be collected before anything definite can be said about the parameter

Large n

CIs are more informative than hypothesis tests: they provide a range of plausible values for the unknown parameter
CI are underrated and underused in research

Most Medical Journals prefer to relay on p-value instead

● Consideration if chance is a plausible

explanation of findings
● It uses the data to decide if Null Hypothesis

can be rejected
● Null Hypothesis H0: there is no relationship

between measured phenomena

● Alternative Hypothesis H1: there is

relationship between measured phenomena. This is typically the hypothesis of interest to the researcher

The decision in HT - may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is actually true:
● TYPE I error is rejecting the Null

Hypothesis H0 when H0 is true
● TYPE II error is failing to reject the

Null Hypothesis H0 when the Alternative Hypothesis H1 is true

rejecting the Null Hypothesis H0 when it is true
● It asserts something that is absent, a false positive

● The rate of TIE is called the Size of a test (α)
● It usually equals Significance Level of a test (α)

● If H0 is simple, α is the probability of TIE
● If H0 is composite, α is the maximum of the possible

probabilities of TIE
● The rate of TIIE is called false negative rate (β)

Significance testing involves calculating the probability - that a statistic would differ as much or more from the parameter specified in the H0 - as does the statistics obtained in the experiment
● One-tailed probability: probability computed considering

differences in only one direction, such as the statistic is larger than the parameter

● Two-tailed probability: probability computed

considering differences in both directions (statistic either larger or smaller than the parameter)

1. Specify H0 Hypothesis 2. Select Significance Level (α) 3. Compute p-value

4. Compare p-value with α

● In significance testing H0 - is typically the hypothesis

that a parameter is zero or that a difference between parameters is zero
● E.g.: the null hypothesis might be that the difference

between population means is 0

● The Significance Level (α) is the highest value of a probability

value for which H0 is rejected

● Common Significance Levels α are: 0.05 & 0.01

● For α=0.05: H0 is rejected if the probability value p ≤ 0.05

● Probability value (p value) is the probability of obtaining a

statistic as different or more different from the parameter specified in the H0 as the statistic obtained in experiment
● The p value is computed assuming H0 is true

● The lower p value, the stronger the evidence that H0 is false
● Traditionally, H0 is rejected if p ≤ 0.05

● Final step: comparison of p value with Significance Level α ● If the p < α  then H0 is rejected ● Rejecting H0 is not an all-or-none decision ● The lower the p value  the more confidence that H0 is false

● If p > α  findings are inconclusive
● Failure to reject the H0 does not constitute support for H0

● It just means: there not sufficiently strong data to reject it

the probability that the test will reject H0 when H1 is true. the probability of not committing Type II Error Power of a Test = 1−β; (β is rate of TIIE)
● ● ●

SP (aka Sensitivity) is a function of the possible distributions, determined by a parameter, under the H1 As SP increases, the chances of TIIE decrease Power analysis can be used to calculate the minimum:
o sample size required to detect an effect of a given size
o effect size to be detected using a given sample size

SP is used to make comparisons between different statistical tests (e.g.: between a parametric and a nonparametric test of the same hypothesis)

● ● ●

In a research paper the focus is on the Results Statistical Tests are mentioned briefly in Materials & Methods This brief mention may indicate if the paper contains valid research or not.

● Have the tests been properly chosen? ● Have the results been interpreted in

the context of tests’ capabilities?
● When nonstandard, complex tests are

used: Is their use justified?

ANALYSIS
Compare means between 2 groups

PARAMETRIC Test
Two-sample

NONPARAMETRIC Test
Wilcoxon rank-sum test

Example
Is the mean systolic blood pressure (at baseline) for patients assigned to placebo different from the mean for patients assigned to the Tx group? Was there a significant change in systolic blood pressure between baseline and 6-month follow-up in the Tx group?

t-test
Paired

Compare 2 quantitative data from one subject

t-test

Wilcoxon signed-rank test

Compare means between 3 or more groups

Analysis of variance

Kruskal-Wallis

ANOVA
Pearson coefficient of correlation

ANOVA
Spearman’s rank correlation

If our experiment had 3 groups & we want to know whether the mean systolic blood pressure at baseline differed among them?

Estimate the degree of association between 2 quantitative variables

Is systolic blood pressure associated with the Pt’s age?

Source: Hoskin T. Parametric and Nonparametric: Demystifying the Terms. Mayo Clinics

● ●

TYPE
o o Parametric Evaluates if the means of two groups are statistically different from each other Especially appropriate for the posttest-only two-group randomized experimental design Population from which the sample is taken has normal distribution The variances of the populations to be compared are equal Examines the difference between means relative to variability of their scores

APPLICATION

CORRECT USE
o

ASSUMPTIONS
o o

● ●

PROCEDURE
o

FORMULA
o o o o t = “signal to noise” ratio Top part: difference between two means (signal) Bottom part is a measure of scores variability (noise) Calculated t is determined as significant or not by using tables

Trochim WV. The Research Methods Knowledge Base, 2nd Edition. 2006

● ●

TYPE
o Parametric & Non-parametric

APPLICATION
o Evaluates if the means of more than 2 groups are statistically different

CORRECT USE of ANOVA TYPES
o One-way: difference between two or more groups with one grouping method o One-way repeated measures: when repeated measures are done in one group o Two-way: difference between groups with complex grouping o Two- way repeated measures: for repeated measures structure with an interaction effect

ASSUMPTIONS:
o The expected values of the errors are zero o The variances of all errors are equal to each other o The errors are independent from one another & normally distributed

PROCEDURE
o The mean is calculated for each group o The overall mean is then calculated for all of the groups combined o Within each group, the total deviation of each individual’s score from the group mean is calculated. This is called within group variation o Next, the deviation of each group mean from the overall mean is calculated. This is called between group variation o Finally: F statistic is calculated

FORMULA
o F statistic: the ratio of between group variation to within group variation o If the between group variation is significantly greater than the within group variation, it is likely that there is a statistically significant difference between the groups. The statistical software determines if the F statistic is significant or not .

● In general, this step entails presenting the

graphical and numerical results & inferred conclusions from other steps of Data Analysis in an accurate and concise form
● Specifically, it involves presentation of

Abstract (in the form of a Poster or Oral Presentation) followed by preparation & publication of the Manuscript

SSDC
IDM

EDA
DDA

PoC

As discussed previously:
● There is no single “best statistical manual” ● Set of personalized references has to be assembled ● Electronic texts that contain hyperlinks or allow for

the instant Web Searches for terms are best suited for studying Statistics
Following suggestions are examples of texts taken from the large pool of study materials

Harvey Motulsky: Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, 3rd Ed, 2013
o Non-mathematical Approach to Statistics by a Physician-Statistician

Betty Kirkwood: Essentials of Medical Statistics 2nd Ed, 2003
o Classic Statistical Manual; New Edition is being prepared

Yosef Dlugacz: Measuring Health Care: Using Quality Data for Operational, Financial, and Clinical Improvement. 1st Ed, 2006
o Deals with application of Statistics in Medical Business

Yasar A. Ozcan: Quantitative Methods in Health Care Management: Techniques and Applications 1st Ed, 2009
o Deals with application of Statistics in Medical Business

Android
● Statistics Quick Reference by Nuzzed 2013

● Statistics Tutor by Statistics Research 2014

● Learn Statistics by Miaoshuang Dong 2013 ● Statistics Video Lectures by Khan Academy 2013

Windows 8
● Statistics Formulas by Hexxa 2013
● Statistics and Probability by SimpleNEasy 2013

● Online Statistics Education: A Multimedia

Course of Study. Rice University
o http://www.onlinestatbook.com/2/index.html

● University of Oxford. Introduction to

Statistics for Medical Students
o http://www.well.ox.ac.uk/~kanishka/Lectures/ MSTC_researchers/Notes/notes%20for%20med ical%20students.pdf

● G Singh. Medical Science without Statistics. The Internet Journal of

● Altman DG. The scandal of poor medical research. BMJ 1994; 308:283 ● Ghami N. Good Clinical Care Requires Understanding of Statistics.

Psychiatric Times. March 6, 2009
● Bennette C, Vickers A. Against quantiles: categorization of continuous

variables in epidemiologic research, and its discontents. BMC Med Res Methodol. 2012;12:21
● Taleb N. The Black Swan: “The Impact of the Highly Improbable” & “On

Robustness and Fragility“ 2nd Ed, 2010

● Statistics plays pivotal role in Science & Business of Medicine

● Statistics can be abused & misused
● Statistical illiteracy & innumeracy can be extremely

detrimental for physicians
● Study of Statistics is challenging but unavoidable ● Statistics is based on several sometimes counterintuitive

principles & conventions
● Those axioms have to be mastered first

● Statistics reflects acts of interpretation, not absolute facts
● Therefore, it is based on numerous assumptions ● Understanding of statistical assumptive models is critical for

appraisal of Statistical Analyses
● Statistics continue to evolve as methodology ● Reliance on Parametric Statistics & validity of p-value based

Hypothesis Testing are being recently questioned

Author wishes to thank: Stephen DeCherney, MD, MPH for his valuable comments.

Nothing to disclose: there are no known conflicts of interest associated with this presentation. Specifically, neither the author nor his family have any potential conflicts of interest, financial or otherwise regarding any of the discussed here products and/or services.