Basic Statistics

The following lecture has been approved for
University Undergraduate Students

This lecture may contain information, ideas, concepts and discursive anecdotes that may be thought provoking and challenging It is not intended for the content or delivery to cause offence Any issues raised in the lecture may require the viewer to engage in further thought, insight, reflection or critical evaluation
Background to Statistics for non-statisticians
Craig Jackson Prof. Occupational Health Psychology Faculty of Education, Law & Social Sciences BCU
craig.jackson@bcu.ac.uk
Keep it simple
Some people hate the very name of statistics but.....their power of dealing with complicated phenomena is extraordinary. They are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the science of man.
Sir Francis Galton, 1889
How Many Make a Sample?
How Many Make a Sample?

8 out of 10 owners who expressed a preference, said their cats preferred it.
How confident can we be about such statistics? 8 out of 10? 80 out of 100? 800 out of 1000? 80,000 out of 100,000?
Types of Data / Variables

Continuous BP Height Weight Age Discrete Children Age last birthday colds in last year
Ordinal Grade of condition Positions 1st 2nd 3rd Better- Same-Worse Height groups
Nominal Sex Hair colour Blood group Eye colour
Conversion & Re-classification

Easier to summarise Ordinal / Nominal data Cut-off Points (who decides this?)
Allows Continuous variables to be changed into Nominal variables BP BP > 90mmHg = = Hypertensive Normotensive =< 90mmHg
Easier clinical decisions Categorisation reduces quality of data Statistical tests may be more sensational Good for summaries
BMI
Obese vs Underweight
Bad for accuracy
Types of statistics / analyses

DESCRIPTIVE STATISTICS Frequencies Basic measurements Describing a phenomena How many Meters, seconds, cm3, IQ
INFERENTIAL phenomena
STATISTICS
Inferences Proving or
about disproving
Hypothesis Testing theories Confidence Intervals larger population Correlation
If sample relates to the Associations between
Multiple Measurement or. 25 cells why statisticians and love dont mix
22 cells
26
25
24
24 cells
23
22
21
21 cells 92 cells
20
Total Mean
= =
Small samples spoil research

N 1 2 3 4 5 6 7 8 9 10 Total Mean SD Age 20 20 20 20 20 20 20 20 20 20 200 20 0 IQ 100 100 100 100 100 100 100 100 100 100 1000 100 0 N 1 2 3 4 5 6 7 8 9 10 Total Mean SD Age 18 20 22 24 26 21 19 25 20 21 216 21.6 4.2 IQ 100 110 119 101 105 113 120 119 114 101 1102 110.2 19.2 N 1 2 3 4 5 6 7 8 9 10 Total Mean SD Age 18 20 22 24 26 21 19 25 20 45 240 24 8.5 IQ 100 110 119 101 105 113 120 119 114 156 1157 115.7 30.2
Central Tendency
Mode Median
Patient comfort rating 10 9 8 7 6 5
Mean
4 3 2 1
31
27
70
121
140
129
128
90
80
62 Frequency
Dispersion
Range Spread of data
Mean Arithmetic average Median Location
Mode Frequency SD Spread of data about the mean
Range 50-112 mmHg Mean 82mmHg Median 82mmHg SD 10mmHg
82mmHg
Mode
Dispersion
An individual score therefore possess a standard deviation (away from the mean), which can be positive or negative Depending on which side of the mean the score is
If add the positive and negative deviations together, it equals zero (the positives and negatives cancel out)
negative deviation positive deviation
central value (mean)
Dispersion
Range The interval between the highest and lowest measures Limited value as it involves the two most extreme (likely faulty) measures Percentile The value below / above which a particular percentage of values fall (median is the 50th percentile) e.g 5th percentile - 5% of values fall below it, 95% of values fall above 5th 1st it. 25th 50th 75th 95th 99th A series of percentiles (1st, 5th, 25th, 50th, 75th, 95, 99th) Range gives a good general idea of the scatter and shape of the data
56
57
58
59
510
511
61
62
63
64
Standard Deviation
To get around observations this, we square each of the
Makes all the values positive (a minus times a minus.) Then sum all those squared observations to calculate the mean This gives the variance - where every observation is squared Need to take the square root of the variance, to get the standard deviation
Grouped Data
Normal Distribution SD is useful because of the shape distributions of data. Symmetrical, bell-shaped / normal / distribution Non Normal Distribution Some distributions fail to be symmetrical If the tail on the left is longer than the right, the distribution is negatively skewed (to the left) of many
Gaussian
If the tail on the right is longer than the left, the distribution is positively skewed (to the right)
Normal Distributions
Standard Normal Distribution has a mean of 0 and a standard deviation of 1 The total area under the curve amounts to 100% / central value (mean) unity of the observations
3 SD
2 SD
1 SD
0 SD
1 SD
2 SD
3 SD
Proportions of observations within any given range can be obtained from the distribution by using statistical tables of the standard normal distribution
Quincunx machine 1877

balls dropped through a succession of metal pins.. ..a normal distribution of balls
do not have a normal distribution here. Why?
Normal & Non-normal distributions
The distribution derived from the quincunx is not perfect
It was only made from 18 balls
Distributions
Sir Francis Galton (1822-1911) Alumni of Birmingham University 9 books and > 200 papers Fingerprints, correlation of calculus, twins, neuropsychology, blood transfusions, travel in undeveloped countries, criminality and meteorology)
% of population
Deeply concerned measurement
with
improving
standards
of
56 57 58 59 61 62 63 64
510
511
Normal & Non-normal distributions

Galtons quincunx machine ran with hundreds of balls a more perfect shaped normal distribution.
Obvious implications for the size of samples of populations used The more lead shot runs through the quincunx machine, the smoother the distribution
Presentation of data
Table of means Exposed n=197 Age 45.5 (yrs) ( 9.4) I.Q 105 ( 10.8) 115.1 Controls T n=178 48.9 99 ( 7.3) ( 8.7) 94.7 ( 12.4) P 2.19 0.07 1.78 0.12 3.76
0.04
Speed
(ms) ( 13.4)
Presentation of data
Category tables
Exposed Healthy 50 Controls 150 200
Unwell
147 197
28 178
175 375
Chi square (test of association) shows: Chi square = 7.2 P = 0.02
Bar Charts
A set of measurements can be presented either as a table or as a figure Graphs are not always as accurate as tables, but Title of graph portray trends more easily y-axis
Legend key
y-axis label (ordinate)
Data display area
scale
x-axis (abscissa) groups
Bar Charts
Some Real Data A combination of distributions facilitate comparisons
7000 Vacation 6000 5000 Votes 4000 3000 2000 1000 0 1 2 3 4 5 6 7 User rating 8 9 10 Empire
is
acceptable
to
Movie goers ratings for both movies
Correlation and Association

With a scatter diagram, each individual observation becomes a point on the scatter plot, based on two co-ordinates, measured on the abscissa and the ordinate
ordinate
abscissa Two perpendicular lines are drawn through the medians - dividing the plot into quadrants Each quadrant should outlie 25% of all observations

Correlation is a numerical expression between 1 and -1 (extending through all points in between). Properly called the Correlation Coefficient. A decimal measure of association (not necessarily causation) between variables
Correlation of 1 Maximal - any value of one variable precisely determines the other. Perfect +ve Correlation of 0 - No relationship between the variables. Totally independent of each other. Nothing Correlation of -1 Any value of one variable precisely determines the other, but in an opposite direction to a correlation of 1. As one value increases, the other decreases. Perfect -ve Correlation of 0.5 - Only a slight relationship between the variables i.e half of the variables can be predicted by the other, the other half cant. Medium +ve
Correlations between 0 and 0.3 are weak Correlations between 0.4 and 0.7 are moderate Correlations between 0.8 and 1 are strong

Correlation is a numerical expression between 1 and -1 (extending through all points in between). Properly called the Correlation Coefficient. A decimal measure of association (not necessarily causation) between variables
How can the above variables be correlated?
Sampling Keywords
POPULATIONS Can be mundane or extraordinary SAMPLE Must be representative INTERNALY VALIDITY OF SAMPLE Sometimes validity is more important than generalizability SELECTION PROCEDURES Random Opportunistic Conscriptive Quota
Sampling Keywords
THEORETICAL Developing, exploring, and testing ideas EMPIRICAL Based on observations and measurements of reality NOMOTHETIC Rules pertaining to the general case (nomos - Greek) PROBABILISTIC Based on probabilities CAUSAL How causes (treatments) effect the outcomes
Clinical Research
Types of clinical research Experimental vs. Observational Longitudinal vs. Cross-sectional Prospective vs. Retrospective Experimental Longitudinal Prospective Observational Longitudinal Prospective
Cross-sectional
Retrospective Survey
Cohort studies Case control studies andomised Controlled Trial
Experimental Designs
Between subjects studies
Treatment group Outcome measured patients Control group Outcome measured
Within Subjects studies
Outcome measured #2 patientsOutcome measured #1 Treatment
Observational studies
Cohort (prospective)
coho rt prospectively measure risk factors
end point measured aetiology prevalence developmen t odds ratios case s
Case-Control (retrospective)
start point measured aetiology odds ratios prevalence developmen
retrospectively measure risk factors
Case-Control Study Smoking & Cancer

Cases have Lung Cancer Controls could be other hospital patients (other disease) or normals Matched Cases & Controls for age & gender Option of 2 Controls per Case Smoking years of Lung Cancer cases and controls (matched for age and sex) Controls n=456 F Smoking years 13.75 6.12 ( 1.5) ( 2.1) Cases n=456
P 7.5
0.04
Cohort Study: Methods

Volunteers in 2 groups e.g. exposed vs non-exposed All complete health survey every 12 months End point at 5 years: groups compared for Health Status Comparison of general health between users and non-users of mobile phones ill mobile phone user non-phone user 381 292 89 421 healthy 108 313 802 400 402
Randomized Controlled Trials in GP & Primary Care

90% consultations take place in GP surgery 50 years old
Potential problems 2 Key areas: Recruitment Bias
Randomisation Bias
Over-focus on failings of RCTs
RCT Deficiencies
Trials too small Trials too short Poor quality Poorly presented Address wrong question Methodological inadequacies Inadequate measures of quality of life (changing) Cost-data poorly presented Ethical neglect Patients given limited understanding Poor trial management Politics Marketeering Why still the dominant model?
Quantitative Data Summary

What data is needed to answer the larger-scale research question Combination of quantitative and qualitative ? Cleaning, re-scoring, re-scaling, or re-formatting Measurement of both IVs and DVs is complex but can be simplified Binary measurement makes analysis easier but less meaningful Binary data needs clear parameters e.g exposed vs controls
Quantitative Data Summary

Continuous & Discrete data can also be converted into Binary data Normal distribution of participants / data points desirable Means - age, height, weight, BMI, IQ, attitudes Frequencies / Classifications - job type, sick vs. healthy, dead vs alive Means must be followed by Standard Deviation (SD or ) Presentation of data must enhance understanding
If you or anyone you know has been affected by any of the issues covered in this lecture, you may need a statisticians help:
www.statistics.gov.uk
Further Reading
Abbott, P., & Sapsford, R.J. (1988). Research methods for nurses and the caring professions. Buckingham: Open University Press. Altman, D.G. (1991). Designing Research. In D.G. Altman (ed.), Practical Statistics For Medical Research (pp. 74-106). London: Chapman and Hall. Bland, M. (1995). The design of experiments. In M. Bland (ed.), An introduction to medical statistics (pp525). Oxford: Oxford Medical Publications. Bowling, A. (1994). Measuring Health. Milton Keynes: Open University Press. Daly, L.E., & Bourke, G.J. (2000). Epidemiological and clinical research methods. In L.E. Daly & G.J. Bourke (eds.), Interpretation and uses of medical statistics
Further Reading
Jackson, C.A. (2002). Planning Health and Safety Research Projects. Health and Safety at Work Special Report 62, (pp 1-16). Jackson, C.A. (2003). Analyzing Statistical Data in Occupational Health Research. Management of Health Risks Special Report 81, (pp. 2-8). Kumar, R. (1999). Research Methodology: step guide for beginners. London: Sage. a step by
Polit, D., & Hungler, B. (2003). Nursing research: Principles and methods (7th ed.). Philadelphia: Lippincott, Williams & Wilkins.

Basic Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic Statistics

Uploaded by

Copyright:

Available Formats

The following lecture has been approved for

University Undergraduate Students

Background to Statistics for non-statisticians

How Many Make a Sample?

How Many Make a Sample?

Types of Data / Variables

Nominal Sex Hair colour Blood group Eye colour

Conversion & Re-classification

Bad for accuracy

Types of statistics / analyses

Hypothesis Testing theories Confidence Intervals larger population Correlation

If sample relates to the Associations between

Small samples spoil research

Mean Arithmetic average Median Location

Mode Frequency SD Spread of data about the mean

Range 50-112 mmHg Mean 82mmHg Median 82mmHg SD 10mmHg

central value (mean)

Quincunx machine 1877

do not have a normal distribution here. Why?

Normal & Non-normal distributions

The distribution derived from the quincunx is not perfect

It was only made from 18 balls

Deeply concerned measurement

Normal & Non-normal distributions

Chi square (test of association) shows: Chi square = 7.2 P = 0.02

y-axis label (ordinate)

Data display area

x-axis (abscissa) groups

Movie goers ratings for both movies

Correlation and Association

Correlation and Association

Correlation and Association

How can the above variables be correlated?

Cohort studies Case control studies andomised Controlled Trial

Within Subjects studies

Outcome measured #2 patientsOutcome measured #1 Treatment

end point measured aetiology prevalence developmen t odds ratios case s

retrospectively measure risk factors

Case-Control Study Smoking & Cancer

Cohort Study: Methods

Randomized Controlled Trials in GP & Primary Care

Potential problems 2 Key areas: Recruitment Bias

Over-focus on failings of RCTs

Quantitative Data Summary

Quantitative Data Summary

You might also like