This action might not be possible to undo. Are you sure you want to continue?

# Statistical Methods for

Anaesthesia and

I ntensive Care

Paul S Myles MB BS MPH MD FCARCSI FANZCA

Head of Research and Specialist Anaesthetist

Department of Anaesthesia and Pain Management

Alfred Hospital, Victoria

Associate Professor

Departments of Anaesthesia, and Epidemiology and Preventive

Medicine

Monash University

Melbourne, Australia

and

Tony Gin MB ChB BSc MD DipHSM FRCA FANZCA

Chairman and Chief of Service

Department of Anaesthesia and Intensive Care

Chinese University of Hong Kong

Prince of Wales Hospital

Shatin, Hong Kong

Butterworth-Heinemann

Linacre House, Jordan Hill, Oxford OX2 8DP

225 Wildwood Avenue, Woburn, MA 01801-2041

A division of Reed Educational and Professional Publishing Ltd

A member of the Reed Elsevier Group

First published 2000

© Reed Educational and Professional Publishing Ltd 2000

All rights reserved. No part of this publication may be reproduced in

any material form (including photocopying or storing in any medium by

electronic means and whether or not transiently or incidentally to some

other use of this publication) without the written permission of the

copyright holder except in accordance with the provisions of the Copyright,

Designs and Patents Act 1988 or under the terms of a licence issued by the

Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London,

England WlP OLE Applications for the copyright holder's written

permission to reproduce any part of this publication should be addressed

to the publishers

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

Library of Congress Cataloguing in Publication Data

A catalogue record for this book is available from the Library of Congress

ISBN 0 7506 4065 0

Typeset by E & M Graphics, Midsomer Norton, Bath

Printed and bound in Great Britain by Biddles Ltd, Guildford and King's Lynn

Contents

About the authors

Foreword by Professor Teik E. Oh

Preface

Acknowledgements

1

Data types

Types of data

-categorical

-ordinal

-numerical

Visual analogue scale

2

Descriptive statistics

Measures of central tendency

-mode

-median

-mean

Degree of dispersion

-range

-percentiles

-variance

-standard deviation

-standard error

Confidence intervals

Frequency distributions

-normal

-binomial

-Poisson

Data transformation

Rates and proportions

Incidence and prevalence

Presentation of data

3

**Principles of probability and inference
**

Samples and populations

Inferential statistics

ix

xi

xiii

xvii

1

1

1

2

3

5

7

8

8

8

8

8

8

8

8

9

9

10

11

12

14

15

16

16

16

16

19

19

20

vi Contents

-definition of probability

21

-null hypothesis

21

-P value

22

-type I and type II error

22

Confidence intervals

23

Sample size and power calculations

24

Parametric and non-parametric tests

28

Permutation tests

29

Bayesian inference

30

4

Research design

33

Bias and confounding

33

-randomization and stratification

34

Types of research design

34

-observation vs. experimentation

34

-case reports and case series

35

-case-control study

36

-cohort study

37

-association vs. causation

39

-randomized controlled trial

39

-self-controlled and crossover trials

40

Randomization techniques

42

-block randomization

42

-stratification 42

-minimization 43

Blinding

43

Sequential analysis 44

Interim analysis

45

Data accuracy and data checking

46

Missing data 47

Intention to treat 47

5

Comparing groups: numerical data

51

Parametric tests

51

-Student's t-test

52

-analysis of variance (ANOVA) 58

-repeated measures ANOVA

60

Non-parametric tests

63

-Mann-Whitney U test (Wilcoxon rank sum test) 64

-Wilcoxon signed ranks test

65

-Kruskal-Wallis ANOVA

65

-Friedman two-way ANOVA

65

6 Comparing groups: categorical data

68

Chi-square

68

-Yates' correction

71

Fisher's exact test

71

The binomial test

72

McNemar's chi-square test

72

Risk ratio and odds ratio

74

Contents

vii

Number needed to treat

Mantel-Haenszel test

Kappa statistic

75

76

76

7

**Regression and correlation
**

Association vs. prediction

Assumptions

Correlation

Spearman rank correlation

Regression analysis

Non-linear regression

Multivariate regression

Mathematical coupling

Agreement

78

78

79

80

82

82

85

87

89

90

8

Predicting outcome: diagnostic tests or predictive equations

94

Sensitivity and specificity

94

Prior probability: incidence and prevalence

95

-positive and negative predictive value

95

Bayes' theorem

97

Receiver operating characteristic (ROC) curve

98

Predictive equations and risk scores

99

9

Survival analysis

What is survival analysis?

Kaplan-Meier estimate

Comparison of survival curves

-logrank test

-Cox proportional hazard model

The 'hazards' of survival analysis

105

105

106

107

108

110

110

11

Statistical errors in anaesthesia

122

Prevalence of statistical errors in anaesthesia journals

122

Ethical considerations

123

How to prevent errors

123

What are the common mistakes?

124

-no control group

125

-no randomization

126

-lack of blinding

126

-misleading analysis of baseline characteristics

126

-inadequate sample size

127

-multiple testing, subgroup analyses and interim analysis

127

-misuse of parametric tests

128

10

Large trials, meta-analysis, and evidence-based medicine

112

Efficacy vs. effectiveness

112

Why we need large randomized trials in anaesthesia

113

Meta-analysis and systematic reviews

114

Evidence-based medicine

116

Clinical practice guidelines

118

viii Contents

-misuse of Student's t-test

129

-repeat ('paired') testing

129

-misuse of chi-square - small numbers

130

-standard deviation vs. standard error

130

-misuse of correlation and simple linear regression analysis

130

-preoccupation with P values

131

-overvaluing diagnostic tests and predictive equations

131

A statistical checklist

133

12

How to design a clinical trial

135

Why should anaesthetists do research?

136

Setting up a clinical trial

136

Data and safety monitoring committee

137

Phase I-IV drug studies

137

Drug regulations

137

Role of the ethics committee (institutional review board)

139

Informed consent

140

Successful research funding

141

Submission for publication

142

13

Which statistical test to use: algorithms

145

Index

149

About the authors

Paul Myles is

Head of Research in the Department of Anaesthesia and

Pain Management at the Alfred Hospital, Melbourne. He has been a

specialist anaesthetist for ten years. He received his MPH (majoring in

advanced statistics and epidemiology) from Monash University in 1995

and his MD (Clinical Aspects of Cardiothoracic Anaesthesia) in 1996. He

has a joint university appointment (Monash University) as Associate

Professor in the Department of Anaesthesia, and Department of

Epidemiology and Preventive Medicine. He is Chairman of Alfred

Hospital Research Committee. He has published over 70 papers and has

received more than ten peer-reviewed research grants. He is a member of

three editorial boards

(Anaesthesia and Intensive Care, Asia-Pacific Heart

Journal, Journal of Cardiothoracic and Vascular Anaesthesia) and has

reviewed for four others

(British Journal of Anaesthesia, Anesthesiology,

Annals of Thoracic Surgery and

Medical Journal of Australia).

Tony Gin

is Professor, Chairman and Chief of Service of the Department

of Anaesthesia and Intensive Care, Chinese University of Hong Kong at

the Prince of Wales Hospital, Hong Kong. He was previously Professor of

the

Department of Anaesthesia, Christchurch School of Medicine,

University of Otago. He completed his BSc in statistics after finishing

medical school and has been lecturing and examining in pharmacology

and statistics in Asia and Australasia for over ten years. He has over 100

publications and has been a regular reviewer of research proposals, grant

applications and manuscripts for many ethics committees, funding

bodies and journals.

Paul Myles and Tony Gin

are members of the Australian and New

Zealand College of

Anaesthetists' Examinations and Research

Committees.

Foreword

An often puzzling, if not fearsome, aspect for an anaesthetist in training

or practice, is that of grappling with 'stats'. Of course, a basic under-

standing of statistics is necessary in our reading of a scientific paper or in

planning even the most modest research project. Books on statistics are

usually written by acclaimed statisticians, they are generally helpful only

to the extent that we can digest the heavy prose or the barrage of

unfamiliar terms and concepts. Thankfully, Statistical Methods for

Anaesthesia and Intensive Care will

make our lives easier. The authors of

this book are Tony Gin and Paul Myles, both accomplished researchers

and anaesthetists. The book is written by anaesthetists for anaesthetists

and clinicians, and it will lead us through the minefield of statistics that

we have to cross.

Professor Teik E. Oh

President, Australian and New Zealand College of Anaesthetists

Professor, Department of Anaesthesia, University of Western Australia

Preface

'Statistics' is the science of collecting, describing and analysing data that

are subject to randomvariation. It consists of two main areas: (i)

descriptive statistics, whereby a collection of data is summarized in

order to characterize features of its distribution, and (ii) inferential

statistics, whereby these summary data are processed in order to

estimate, or predict, characteristics of another (usually larger) group.

In most circumstances, the collection of data is froma restricted

number of observations (individuals, animals or any event subject to

variation); this chosen set of data is referred to as a sample and the

reference group fromwhich it is derived is referred to as a population. A

population does not necessarily include all individuals, but is most often

a defined group of interest, such as 'all adult surgical patients', 'all

women undergoing laparoscopic surgery' or 'patients admitted to

intensive care with a diagnosis of septic shock'.

So, for our purposes, statistics usually refers to the process of

measuring and analysing data froma sample, inorder to estimate certain

characteristics of a population. These estimates of a population are most

commonly an average value or proportion; and these estimates are

usually compared with those of another group to determine whether one

group differs significantly fromthe other. In order to be confident about

estimation of population parameters, we need to be sure that our sample

accurately represents our intended population. The statistical methods

outlined in this book are meant to optimize this process.

Why is an understanding of statistics important for anaesthetists and

intensivists? Advances in anaesthesia and intensive care rely upon

development of newdrugs, techniques and equipment. Evaluation and

clinical application of these advances relies critically upon statistical

methodology. However it is at this level that most anaesthetists and

intensivists lose interest or become sceptical - this can often be attributed

to the common belief that statistical analyses are misused or

misrepresented. This is actually a justification for having at least a basic

understanding of statistics. Accurate, reliable descriptive statistics and

the correct use of inferential statistics are essential for good-quality

research, and their understanding is crucial for evaluating reported

advances in our specialty. In 1982, Longnecker wrote: 'If valid data are

analyzed improperly, then the results become invalid and the conclusions

xiv Preface

may well be inappropriate. At best, the net effect is to waste time, effort,

and money for the project. At worst, therapeutic decisions may well be

based upon invalid conclusions and patients' wellbeing may be

jeopardized.'*

Medical statistics is not just the use of clever mathematical formulae,

but a collection of tools used to logically guide rational clinical decision-

making. Our readers can be reassured that knowledge of statistics,

although an essential component of the process, does not displace

everyday clear thinking.

As clinicians,

we want to know what management options are

available for our patients, and what evidence there is to justify our

choices (our patients may also want this information). How convincing is

this evidence and how does it relate to our own clinical experience, or

'

gut feelings'? How can we compare our own results with those of

others?

Under what circumstances should we change our practice?

How can we show that one drug or technique is better than another,

or

whether a new diagnostic test adds to our clinical decision-

making? Research design and statistics are tools to help clinicians make

decisions. These processes are not new to medical practice, though they

have recently been formalized and embraced by 'evidence-based

medicine'.

Although there are many medical statistics books available on the

market, in our experience they are either too mathematical in their

approach or, when designed as a basic introductory textbook, use

examples that have little relevance to our specialty. The design of this

book is such that anaesthetists, intensivists, and trainees can

systematically learn the basic principles of statistics (without boredom or

frustration). This should enable the reader to successfully pass relevant

examinations, design a research trial, or interpret the statistical

methodology and design of a published scientific paper.

Each chapter begins with basic principles and definitions, and then

explains how and why certain statistical methods are applied in clinical

studies, using examples from the anaesthetic and intensive care

literature. More sophisticated information is presented in brief detail,

usually with reference to sources of further information for the interested

reader. Note that we have highlighted key words

in bold print. Our

intention was to make it easier for readers to find specific topics within

the text.

As doctors, we are expected to apply our special knowledge and

training in such a way that promotes healing and good health; for

anaesthetists and intensivists, this process can often be dramatic or life-

saving. Progress in our specialty is rapidly evolving and acquisition of

up-to-date knowledge should be based upon critical scrutiny. The aim of

this book is to explain a variety of statistical principles in such a way that

advances the application and development of our knowledge base, and

promotes the scientific foundations of our unique specialty: anaesthesia

and intensive care.

* David E. Longnecker. Support versus illumination: trends in medical statistics. Anesthesiology

1982; 57:73-74.

A cautionary tale

Preface

xv

Three statisticians and three epidemiologists are travelling by train to a

conference. The statisticians ask the epidemiologists whether they have

bought tickets. They have. 'Fools!', say the statisticians, 'We've only

bought one between us!' When the ticket inspector appears, the

statisticians hide together in the toilet. The inspector knocks and they

pass the ticket under the door. He clips the ticket and slides it back under

the door to the statisticians.

The epidemiologists are very impressed, and resolve to adopt this

technique themselves. On the return they purchase one ticket between

them, and share the journey with the statisticians, who again ask whether

they've all bought tickets. 'No', they reply, 'We've bought one to share.'

'Fools!', say the statisticians, 'We've not bought any.' 'But what will you

do when the inspector comes?' 'You'll see.'

This time when the inspector appears, the epidemiologists hide

together in the toilet. The statisticians walk up to the door and knock on

it.

The epidemiologists slide their ticket under the door, and the

statisticians take it and use it as before - leaving the epidemiologists to be

caught by the inspector.

The moral of this story is that you should never use a statistical

technique unless you are completely familiar with it.

As retold by Frank Shann (Royal Children's Hospital, Melbourne,

Australia)

The Lancet 1996; 348: 1392.

Acknowledgements

We would like to thank Dr Rod Tayler, Dr Mark Reeves and Dr Mark

Langley for their constructive criticism of earlier drafts of this book, and

Dr Anna Lee for proofreading. Paul Myles would like to thank the Alfred

Hospital Whole Time Medical Specialists for providing funds to purchase

a notebook computer and statistical software.

We have used data from many studies, published in many journals, to

illustrate some of our explanations. We would like to thank the journal

publishers for permission to reproduce these results. We would

particularly like to thank and acknowledge the investigators who

produced the work.

Data types

Types of data

-categorical

-ordinal

-numerical

Visual analogue scale (VAS)

Key points

• Categorical data are nominal and can be counted.

• Numerical data may be ordinal, discrete or continuous, and are usually

measured.

• VAS measurements are ordinal data.

Types of data

Before a research study is undertaken it is important to consider the

nature of the observations to be recorded. This is an essential step during

the planning phase, as the type of data collected ultimately determines

the

way in which the study observations are described and which

statistical tests will eventually be used.

At the most basic level, it is useful to distinguish between two types of

data. The first type of data includes those which are defined by some

characteristic, or quality, and are referred to as qualitative data. The

second type of data includes those which are measured on a numerical

scale and are referred to as quantitative data.

The precision with which these data are observed and recorded, and

eventually analysed, is also described by a hierarchical scale (of

increasing precision): categorical, ordinal, interval and ratio scales

(Figure 1.1).

Categorical data

Because qualitative data are best summarized by grouping the

observations into categories and counting the number in each, they are

most often referred to as categorical (or nominal) data. A special case

exists when there are only two categories; these are known as

dichotomous (or binary) data.

Examples of categorical data

1. Gender

- male

- female

Figure 1.1 Types of data

2. Type of operation (cardiac, adult)

- valvular

- coronary artery

- myocardial

- pericardial

- other

3. Type of ICU admission

- medical

- surgical

- physical injury

- poisoning

- other

4.

Adverse events (major cardiovascular)

- acute myocardial infarction

- congestive cardiac failure

- arrhythmia

- sudden death

- other

The simplest way to describe categorical data is to count the number of

observations in each group. These observations can then be reported

using absolute count, percentages, rates or proportions.

Ordinal data

If there is a natural order among categories, so that there is a relative

value among them (usually from smallest to largest), then the data can be

Data types

3

considered as ordinal data. Although there is a semiquantitative

relationship between each of the categories on an ordinal scale, there is

not a direct mathematical relationship. For example, a pain score of 2

indicates more pain than a score of 1, but it does not mean twice as much

pain, nor is the difference between a score of 1 and 0 equal to the

difference between a score of 3 and 2.

For ordinal data, a numerical scoring system is often used to rank the

categories; it may be equally appropriate to use a non-numerical record

(

A, B, C, D; or +, ++, +++, ++++). A numerical scoring system does,

however, have practical usage, particularly for the convenience of data

recording and eventual statistical analyses. Nevertheless, ordinal data

are, strictly speaking, a type of categorical data. Once again these

observations can be described by an absolute count, percentages, rates or

proportions. Ordinal data can also be summarized by the median value

and range (see Chapter 2).

Examples of ordinal data

1. Pain score

0 = no pain

1 = mild pain

2 = moderate pain

3 = severe pain

4 = unbearable pain

2. Extent of epidural block:

A = lumbar (L1-L5)

B = lowthoracic (T10-T12)

C = mid-thoracic (T5-T9)

D = high thoracic (T1-T4)

3. Preoperative risk:

ASA* I/II = lowrisk

ASA III = mild risk

ASA IV = moderate risk

ASA V = high risk

Numerical data

Quantitative data are more commonly referred to as numerical data;

these observations can be subdivided into discrete and continuous

measurements. Discrete numerical data can only be recorded as whole

numbers (integers), whereas continuous data can assume any value. Put

simply, observations that are counted are discrete numerical data and

observations that are measured are usually continuous data.

Examples of numerical data

1. Episodes of myocardial ischaemia (discrete)

2. Body weight (continuous)

3. Creatinine clearance (continuous)

4. Cardiac index (continuous)

*

ASA = American Society of Anesthesiologists' physical status classification.

4

**Statistical Methods for Anaesthesia and Intensive Care
**

5.

Respiratory rate (discrete/ continuous)

6. Post-tetanic count (discrete)

There are circumstances where data are recorded on a

discrete scale,

but may be considered as continuous data, if it is conceptually possible to

achieve any value throughout the possible range of values (even if the

observations are not recorded as such, or eventual statistical analysis

does not consider this possible precision). For example, although

respiratory rate is generally considered to only have discrete values, and

is usually recorded as such, it is possible that any value may exist (at any

one time) and a value of, say, 9.4 breaths/ min is meaningful. This would

not be the case, for example, with number of episodes of myocardial

ischaemia.

It

has to be admitted that the distinction between discrete and

continuous numerical data is sometimes blurred, so that discrete data

may assume the properties of continuous data if there is a large range of

potential values.

Continuous data can also be further subdivided into either an

interval

or ratio scale,

whereby data on a ratio scale have a true zero point and

any two values can be numerically related, resulting in a true ratio. The

classic example of this is the measurement of temperature. If temper-

atures are measured on a Celsius scale they are considered interval data,

but when measured on a Kelvin scale they are ratio data: 0°C is not zero

heat, nor is 26°C twice as hot as 13°C. However, this distinction has no

practical significance for our purposes, as both types of continuous data

are recorded and reported in the same way, and are dealt with using the

same statistical methods.

Numerical data are usually reported as mean and standard deviation,

or as median and range (see Chapter 2).

In general, the observations of interest in a research study are also

referred to as variables, in that they can have different values (i.e. they

can vary). So that, for example, gender may be referred to as a categorical

or dichotomous variable, and cardiac index as a continuous variable.

Studies may include more than one type of data.

For example, in a study investigating the comparative benefits of

patient controlled analgesia after cardiac surgery, Myles

et al.

1 recorded

the following outcomes: pain score, where 0 = no pain, 1 = mild pain, 2 =

moderate pain, 3 = severe pain and 4 = unbearable pain (these are ordinal

data); incidence of respiratory depression (categorical data); total

morphine consumption (continuous data) and serum cortisol level

(continuous data).

As another example, Gutierrez

et a1. ` investigated whether outcome in

the ICU could be improved with therapy guided by measurement of

gastric intramucosal pH (pHi). Outcomes of interest included number of

organs failed in each patient (discrete data), incidence of organ failure

(categorical data), pHi (continuous data) and blood pressure (continuous

data).

They also recorded therapeutic interventions, including the

therapeutic intervention scoring system (TISS) (ordinal data), use of

inotrope infusions (categorical data) and bicarbonate administration

(yes/no: dichotomous, categorical data).

Visual analogue scale (VAS)

Data types

5

A frequently used tool in anaesthesia research is the 100 mm visual

analogue scale (VAS). 3 This is most commonly used to measure

postoperative pain, but can also be used to measure a diverse range of

(mostly)

subjective experiences such as preoperative anxiety,

postoperative nausea, and patient satisfaction after ICU discharge.

Because there are infinite possible values that can occur throughout the

range 0-100 mm, describing a continuum of pain intensity, most

researchers treat the resulting data as continuous.

4,5

If there is some

doubt about the sample distribution, then the data should be considered

ordinal.

There has been some controversy in the literature regarding which

statistical tests should be used when analysing VAS data. 4,,5

Some

statistical tests ('parametric tests') assume that sample data have been

taken from a normally distributed population. Mantha

et a1.4 surveyed

the anaesthetic literature and found that approximately 50%used

parametric tests. Dexter and Chestnuts used a multiple resampling (of

VAS data) method to demonstrate that parametric tests had the greater

power

to detect differences among groups.

Myles

et a1.6 have recently shown that the VAS has properties

consistent with a linear scale, and thus VAS scores can be treated as ratio

data. This supports the notion that a change in the VAS score represents

a relative change in the magnitude of pain sensation. This enhances its

clinical application.

Nevertheless, when small numbers of observations are being analysed

(say, less than 30 observations), it is preferable to consider VAS data as

ordinal.

For a number of practical reasons, a VAS is sometimes

converted to a

' verbal rating scale',

whereby the subject is asked to rate an endpoint on

a scale of 0-10 (or 0-5), most commonly recorded as whole numbers. In

this situation it is preferable to treat the observations as ordinal data.

Changing data scales

Although data are characterized by the nature of the observations, the

precision of the recorded data may be reduced so that continuous data

become ordinal, or ordinal data become categorical (even dichotomous).

This may occur because the researcher is not confident with the accuracy

of their measuring instrument, is unconcerned about loss of fine detail, or

where group numbers are not large enough to adequately represent a

variable of interest. In most cases, however, it simply makes clinical

interpretation easier and this is the most valid and prevalent in the

medical literature.

For example, smoking status can be recorded as smoker/non-smoker

(categorical data), heavy smoker/light smoker/ex-smoker/non-smoker

(ordinal data), or by the number of cigarettes smoked per day (discrete

data).

Another example is the detection of myocardial ischaemia using ECG

ST-segment monitoring - these are actually continuous numerical data,

whereby the extent of ST-segment depression is considered to represent

6

**Statistical Methods for Anaesthesia and Intensive Care
**

the degree of myocardial ischaemia. For several reasons, it is generally

accepted that ST-segment depression greater than 1.0 mm indicates

myocardial ischaemia, so that ST-segment depression less than this value

is categorized as 'no ischaemia' and that beyond 1.0 mm as 'ischaemia'. 7

This results in a loss of detail, but has widespread clinical acceptance (see

Chapter 8 for further discussion of this issue).

References

1. Myles PS, Buckland MR, Cannon GB et al. Comparison of patient-controlled

analgesia and nurse-controlled infusion analgesia after cardiac surgery.

Anaesth Intensive Care 1994; 22:672-678.

2. Gutierrez G, Palizas F, Doglio G et al. Gastric mucosal pH as a therapeutic

index of tissue oxygenation in critically ill patients. Lancet 1992; 339:195-199.

3.

Revill Sl, Robinson JO, Rosen M et al. The reliability of a linear analogue for

evaluating pain. Anaesthesia 1976; 31:1191-1198.

4. Mantha S, Thisted R, Foss J et al. A proposal to use confidence intervals for

visual analog scale data for pain measurement to determine clinical

significance. Anesth Analg 1993; 77:1041-1047.

5.

Dexter F, Chestnut DH. Analysis of statistical tests to compare visual analogue

scale data measurements among groups. Anesthesiology 1995; 82:896-902.

6. Myles PS, Troedel S, Boquest M, Reeves M. The pain visual analogue scale: is

it linear or non-linear? Anesth Analg 1999; 89:1517-1520.

7. Fleisher L, Rosenbaum S, Nelson A et al. The predictive value of preoperative

silent ischemia for postoperative ischemic cardiac events in vascular and

nonvascular surgical patients. Am Heart J 1991; 122:980-986.

2

Descriptive statistics

Measures of central tendency

-mode

-median

-mean

Degree of dispersion

-range

-percentiles

-variance

-standard deviation

-standard error

Confidence intervals

Frequency distributions

-normal

-binomial

-Poisson

Data transformation

Rates and proportions

I

ncidence and prevalence

Presentation of data

Key points

• The central tendency of a frequency distribution can be described by the

mean, median or mode.

• The mean is the average value, median the middle value, and mode the most

common value.

• Degree of dispersion can be described by the range of values, percentiles,

standard deviation or variance.

• Standard error is a measure of precision and can be used to calculate a

confidence interval.

• Most biological variation has a normal distribution, whereby approximately

95%of observations lie within two standard deviations of the mean.

• Data transformation can be used to produce a more normal distribution.

Descriptive statistics summarize a collection of data from a sample or

population. Traditionally summaries of sample data ('statistics') are

defined by Roman letters (z, s,,, etc.) and summaries of population data

('parameters') are defined by Greek letters (y,

6,

etc.).

Individual observations within a sample or population tend to cluster

about a central location, with more extreme observations being less

frequent. The extent that observations cluster can be described by the

central tendency. The spread can be described by the degree of

dispersion.

For example, if 13 anaesthetic registrars have their cardiac output

measured at rest, their results may be: 6.2, 4.9, 4.7, 5.9, 5.2, 6.6, 5.0, 6.1, 5.8,

5.6, 7.0, 6.6 and 5.5 1 /min. Howcan their data be summarized in order to

best represent the observations, so that we can compare their cardiac

output data with other groups?

The most simple approach is to rank the observations, from lowest to

highest: 4.7, 4.9, 5.0, 5.2, 5.5, 5.6, 5.8, 5.9, 6.1, 6.2, 6.6, 6.6 and 7.01/min. We

now have a clearer idea of what the typical cardiac output might be,

because we can identify a middle value or a commonly occurring value

(the smallest or largest value is least likely to represent our sample

group).

8

**Statistical Methods for Anaesthesia and Intensive Care
**

Measures of central tendency

The sample mode is the most common value. In the example above it is

6.6 1/min. This may not be the best method of summarizing the data (in

our example it occurs twice, not much more frequent than other

observations).

If the sample is ranked, the median is the middle value. If there is an

even number of observations, then the median is calculated as the

average of the two middle values. In the example above it is 5.8 1/min.

The mean (or more correctly, the

arithmetic mean)

is the average

value. It is calculated as the sum of (depicted by the Greek letter, Y,) the

observations, divided by the number of observations. The formula for the

mean is:

where x = each observation, and

n

= number of observations. In the

example, the mean can be calculated as 75.1/13 = 5.78 1/min.

The mean is the most commonly used single measure to summarize a

set of observations. It is usually a reliable measure of central tendency.

Degree of dispersion

The spread, or variability, of a sample can be readily described by the

minimum and maximum values. The difference between them is the

range.

In the example above, the range is (7.0 - 4.7) 2.3 1/min. The range

does not provide much information about the overall distribution of

observations, and is also heavily affected by extreme values.

A clearer description of the observations can be obtained by ranking

the data and grouping them into

percentiles.

Percentiles rank obser-

vations into 100 equal parts. We then have more information about the

pattern of spread. We can describe 25%, 50%, 75%, or any other amount

of observations. The median is the 50th centile. If we include the middle

50%of the observations about the median (25th to 75th centile), we have

the interquartile range.

In the example above, the interquartile range is

5.2-6.1 1/min.

A better method of measuring variability about the mean is to see how

closely each individual observation clusters about it. The variance is such

a method. It sums the square of each difference

('sum of squares')

and

divides by the number of observations. The formula for variance is:

The expression within the parentheses is squared so that it removes

negative values. The formula for the variance (and standard deviation,

see below) for a population has the value 'n' as the denominator. The

expression ' n -1'

is known as the

degrees of freedomand is one less than

the number of observations. This is explained by a defined number of

Descriptive statistics

9

observations in a sample with a known mean - each observation is free to

vary except the last one which must be a defined value.

The degrees of freedom describe the number of independent observa-

tions or choices available. Consider a situation where four numbers must

add up to ten and one can choose the four numbers (n = 4). Provided that

one does not choose the largest remainder, it is possible to have free

choice in choosing the first three numbers, but the last number is fixed by

the first three choices. The degee of freedom was (n - 1).

The degrees of freedom is used when calculating the variance (and

standard deviation) of a sample because the sample mean is a

predetermined estimate of the population mean (each individual in the

sample is a random selection, but not the fixed sample mean value).

The variance is measured in units of x2 . This is sometimes difficult to

comprehend and so we often use the square root of variance in order to

retain the basic unit of observation. The positive square root of the

variance is the standard deviation (SD or s x).

The formula for SD is:

In the example above, SD can be calculated as 0.7141/min.

Another measure of variability is the coefficient of variation (CV).

This considers the relative size of the SD with respect to the mean. It is

commonly used to describe variability of measurement instruments. It is

generally accepted that a CV of less than 5%is acceptable reproducibility.

CV = SD/mean x 100%

There are many sources of variability in data collection. Biological

variability - variation between individuals and over time - is a

fundamental source of scatter. Another source of variability is

measurement imprecision (this can be quantified by the CV). These types

of variability result in random error. Lastly, there are mistakes or biases

in measurement or recording. This is

systematic error. Determined

efforts should be made to minimize random and systematic error.

Random error can be reduced by use of accurate measurement

instruments, taking multiple measurements, and using trained observers.

The ability to detect differences between groups is blurred by large

variance, and this inflates the sample size that is needed to be studied.

Systematic error cannot be compensated for by increasing sample size.

Another measure of variability is the standard error (SE). It is

calculated from the SD and sample size (n):

Standard error is a much smaller numerical value than SD and is often

presented (wrongly) for this reason. It is not meant to be used to describe

variability of sample data-1-5 It is used to estimate a population para-

meter from a sample - it is a measure of precision.

1 0

**Statistical Methods for Anaesthesia and Intensive Care
**

Standard error is also known as the standard error of the mean. If one

takes a number of samples from a population, we will have a mean for

each sample. The SD of the sample means is the standard error.

In the example above we selected 13 individuals and measured their

cardiac outputs. If we nowselected a second group of (say) 11 individuals

(n

= 11), and then a third (n

= 8), a fourth (n = 13), and perhaps a fifth

(n

= 15), we would have five different sample means. Each may have

sampled from the same population (in our example these may be

anaesthetic registrars within a regional training programme) and so each

sample could be used to estimate the true population mean and the SD.

The five sample means would have their own distribution and it would

be expected to have less dispersion than that of all the individuals in the

samples, i.e.

sample A (n = 13) mean 5.78 l/ min

sample B (n = 11) mean 5.54 1/min

sample C (n = 8) mean 5.99 1/min

sample D (n = 13) mean 6.12 1/min

sample E (n = 15) mean 5.75 1/min

mean (of 5 samples): 5.841/min

SD of the 5 sample means: 0.23 1/min

The SE represents the SD of the sample means (0.23 1/min). But we do

not generally take multiple samples and are left to determine the SE from

one sample. The example above (sample A) has an SD of 0.714 and an SE

of (0.714/3.61) = 0.201/min.

In general we are not interested in the characteristics of multiple

samples, but more specifically how reliable our one sample is in

describing the true population. We use SE to define a range in which the

true population mean value should lie.

Standard error is used to calculate

confidence intervals,

and so is a

measure of precision (of how well sample data can be used to predict a

population parameter). If the sample is very large (with a large value of

n), then prediction becomes more reliable. Large samples increase

precision.

We stated above that random error can be compensated for by

increasing sample size. This is an inefficient (and possibly costly) method,

as a halving of SE requires a four-fold increase in sample size (sq4 = 2).

Confidence intervals

Confidence intervals are derived from the SE and define a range of values

that are likely to include a population parameter. The two ends of the

range are called

confidence limits.

The width of the confidence interval depends on the SE (and thus

sample size, n)

and the degree of confidence required (say 90, 95 or

99%) : 95%confidence intervals (95%CI) are most commonly used. The

range, 1.96 standard errors either side of the sample mean, has a 95%

probability of including the population mean; and 2.58 standard errors

either side of the sample mean has a 99%probability of including the

population mean, i.e.

95%CI of the mean = sample mean ± (1.96 x SE)

Descriptive statistics

1 1

The 95%CI should be distinguished from a property of the normal

distribution

where 95%of observations lie within 1.96 standard

deviations of the mean. The 95%CI relates to the sample statistic (e.g.

mean), not the individual observations. In a similar way, the SD indicates

the spread of individual observations in the sample, while the SE of the

mean relates the sample mean to the true population mean.

In our example above, 95%CI can be calculated as the mean (5.78) ±

(1.96 x 0.198), that is 5.39 to 6.171/min. This states that the probability (P)

of the true population cardiac output lying within this range is P = 0.95,

or 95%. It can also be stated that 95%of further sample means would lie

in this range.

Confidence intervals can be used to estimate most population

parameters from sample statistics (means, proportions, correlation

coefficients, regression coefficients, risk ratios, etc. - see later chapters).

They are all calculated from SE (but each has a different formula to

estimate its SE).

Frequency distributions

It is useful to summarize a number of observations with a frequency

distribution. This is a set of all observations and their frequencies. They

may be summarized in a table or a graph (see also Figures 2.1 and 2.2).

Example 2.1 A set of observations: creatinine clearance values

(x, ml/min) in 15

critically ill patients. I QR = interquartile range, CI =

confidence i nterval

Patient x

x x) 2

1 76 8.5 4 72.9

2 100 32.5 4 105 9

3

46 -21.46 4620

4 65 -2.46

6.1 range = 81 (26 to 107)

5 89 11.5 4 133

mode = 76

6 37 -30.46

928 median = 68

7 5 9 -8.46

71.6 I QR = 5 2.5 to 82.5

8 68 0.5 4

0.25

9 107 39.5 4 15 63

10 26 -41.46

1719

11 38 -29.46 868

12 90 22.5 4

5 08

13 75 7.5 4 5 6.9

14 76 8.5 4

72.9

15 60 -7.46 5 5 .7

total (su m) = 1012

7930

mean, x = 1012/15 = 67.46

SD = sq(7930/14) = 23.8

SE=23.8/3.87=6.14

95 % CI of the mean = 5 5 .4 to 79.5

12

**Statistical Methods for Anaesthesia and Intensive Care
**

Figure 2.1 A bar diagram of the distribution of creatinine clearance values in

critically ill patients

Example 2.2 From Example 2.1 we can summarize the frequency distribution of

c

reatinine clearance values (CrCl, ml/min) when categorized into intervals

Normal distribu tion

Most biological variation has a tendency to cluster around a central value,

with a symmetrical positive and negative deviation about this point, and

more extreme values becoming less frequent the further they lie from the

CrCl

interval

Frequ ency

Relative

frequ ency

Cu mu lative

frequ ency

0-20 0

0

0

21-40 3

0.2

0.20

41-60 3

0.2

0.40

61-80 5

0.33

0.73

81-100 3

0.2

0.93

101-120 1

0.07

1.0

Descriptive statistics

13

central point. These features describe a

normal distribution and can be

plotted as a normal distribution curve. It is sometimes referred to

as a Gaussian distribution after the German mathematician, Gauss

(1777-1855).

If we now look at the formula for the normal distribution, we can see

that there are two parameters that define the curve, namely p (mu, the

mean) and 6 (sigma, the standard deviation); the other terms are

constants:

The standard normal distribution curve (Figure 2.3) is a symmetrical

bell-shaped curve with a mean of 0 and a standard deviation of 1. This is

also known as the z distribution. It can be defined by the following

eauation:

A

z transformation converts any normal distribution curve to a

standard normal distribution curve, with mean = 0, SD = 1. The formula

for z is:

where y = mean. This results in a standardized score, z, i.e. the number

of standard deviations from the mean in a standard normal distribution

curve. It can be used to determine probability (by referring to a z table in

a reference book).

Using Example 2.1 above, we can determine the probability of a

creatinine clearance value less than 40 ml/min if we assume that the

sample data are derived from a normal distribution and the sample mean

and SD represent that population:

1 4

**Statistiral Mathnrfs fnr AnAPSthPCI A e·cI ntancivo ('.a,
**

l.lv L:viie~1Yeiiut1 uw a wile-Laiieuf r value

or u.rz. rnis means mat to

probability of a critically ill patient having a creatinine clearance of les

than 40 ml/min is 0.12 (or 12%). In other words, it would not b,

considered a very uncommon event in this population.

As the number of observations increases (say, n > 100), the shape of

sampling distribution will approximate a normal distribution curve eve]

if the distribution of the variable in question is not normal. This i

explained by the central limit theorem,

and indicates why the norma

distribution is so important in medical research.

The normal distribution curve has a central tendency and a degree o

dispersion. The mode, median and mean of this curve are the same. The

probability is equal to the area under the curve. In a normal distribution

one SD either side of the mean includes 68%of the total area, twc

standard deviations 95.4%, and three standard deviations 99.7%. 95%o

the population lie within 1.96 standard deviations.

Many statistical techniques have assumptions of normality of the data

It is not necessary for the sample data to be normally distributed, but i

should represent a population that is normally distributed

. 5 ,6

It i;

preferable to be able to demonstrate this, either graphically or by using i

' goodness of fit'

test (see Chapters 5 and 11).

In some circumstances there is an asymmetric distribution (skew), sc

that one of the tails is elongated. If a distribution is skewed, then the

measures of central tendency will differ. If the distribution is skewed tc

the right, then the median will be smaller than the mean. The median i;

a better measure of central tendency in a skewed distribution. If sampef

data are skewed, they can first be transformed into a normal distribution

and then analysed (see below). Kurtosis describes how peaked thf

distribution is. The kurtosis of a normal distribution is zero.

A bimodal distribution

consists of two peaks and suggests that the

sample is not homogeneous but possibly represents two different

populations.

Binomial distribution

A binomial distribution

exists if a population contains items which

belong to one of two mutually exclusive categories (A or B), e.g.

male/female

complication/no complication

It has the following conditions:

1.

There are a fixed number of observations (trials)

2.

Only two outcomes are possible

3.

The trials are independent

4.

There is a constant probability for the occurrence of each event

The binomial distribution describes the probability of events in a fixed

1 6

**Statistical Methods for Anaesthesia and Intensive Care
**

Data transformation

In some circumstances it may be preferable to transform a distribution so

that it approximates a normal distribution. This generally equalizes

group variances, and makes data analyses and interpretation easier.6

This

is a useful approach if sample data are skewed.

The most commonly used transformation is a

log transformation. This

can result in a mean that is independent of the variance, a characteristic

of a normal distribution.6

The antilog of the mean of a set of logarithms is a geometric mean. It

is a good measure of central tendency if a distribution is skewed.

Rates and proportions

A rate

is a measure of the frequency of an event. It consists of a numerator

(number of events) and a denominator (number in the population). For

example, if 14 colorectal surgical patients have died in a hospital

performing 188 cases in the previous 12 months, the reported mortality

rate

would be 14/174 (= 0.0805), or 8.05%. Note that a rate does not

include the number of events in the denominator.

A proportion

includes the numerator within the denominator. It has a

value between 0 and 1.0, and can be multiplied by 100%to give a

percentage.

In our colorectal surgical example, the proportion of deaths

is 14/188 (= 0.0745), or 7.45%. Rates are sometimes used interchangeably

with proportions (e.g. mortality 'rates' often include the deaths in the

denominator). This is a common practice that is generally accepted.

Two proportions may be compared by combining them as a ratio. For

example, a

risk ratio is the incidence rate of an event in an exposed

population versus the incidence rate in a non-exposed population. If the

risk ratio is greater than 1.0, there is an increased risk with that exposure.

I ncidence and prevalence

Incidence and prevalence are often used interchangeably, but they are

different and this difference should be understood. Incidence

is the

number of individuals who develop a disease (i.e. new case) in a given

time period. The incidence rate is an estimate of the probability, or risk,

of developing a disease in a specified time period.

Prevalence is the current number of cases (pre-existing and new).

Prevalence is a proportion, obtained by dividing the number of indi-

viduals with the disease by the number of people in the population.

Presentation of data

The mean is the most commonly used single measure to summarize a set

of observations. It is usually a reliable measure of central tendency. This

Descriptive statistics

17

is because most biological variation is symmetrically distributed about a

central location.

Mean and SD are the best statistics to use when

describing data from a normal distribution.

It

has been suggested that the correct method for presentation of

normally distributed sample data variability is mean (SD) and not mean

(± SD). 7

The '±' symbol implies that we are interested in the range of one

SD above and below the mean; we are generally more interested in the

degree of spread of the sample data.

One of the weaknesses of the mean is that it is affected by extreme

values. In these circumstances it may be preferable to use median or

geometric mean as a measure of central tendency, and range or inter-

quartile range for degree of spread. The mode is best used if the data have

a bimodal distribution.

Ordinal data should be described with mode or median, or each

category as number (%). Categorical data can be presented as number

A box and whisker plot

(Figure 2.4) can be used to depict median,

interquartile range and range. For example, we can use our data from

Example 2.1 and depict the median (line through box), interquartile range

(box) and whiskers (5%and 95%centiles, or minimum and maximum).

Tables and diagrams are convenient ways of summarizing data. They

should be clearly labelled and self-explanatory, and as simple as possible.

Most journals include specific guidelines and these should be followed.

Tables should have their rows and columns labelled. Excessive detail

or numerical precision (say, beyond 3 significant figures) should be

avoided. Graph axes and scales should be labelled. If the axis does not

begin at the origin (zero), then a break in the axis should be included.

On some occasions it may be acceptable to use standard error bars on

graphs (for ease of presentation), but on these occasions they should be

clearly labelled.

Figure 2.4 A box and whisker plot of creatinine clearance data in critically ill

patients

18

**Statistical Methods for Anaesthesia and Intensive Care
**

References

1. Glantz SA. Biostatistics: howto detect, correct and prevent errors in the

medical literature. Circulation 1980; 61:1-7.

2. Altman DG. Statistics and ethics in medical research, v - analysing data.

BMJ

1980; 281:1473-1475.

3. Horan BE Standard deviation, or standard error of the mean? [editorial]

Anaesth Intensive Care 1982;10:297.

4. Avram MJ, Shanks CA, Dykes MHM et al. Statistical

methods in anesthesia

articles: an evaluation of two American journals during two six-month

periods. Anesth Analg 1985; 64:607-611.

5. Altman DG, Bland JM. The normal distribution.

BMJ 1995; 310:298.

6. Bland JM, Altman DG. Transforming data.

BMJ 1996:312:770.

7. Altman DG, Gardner MJ. Presentation of variability.

Lancet 1986; ii:639.

3

Principles of probability and

i

nference

Samples and populations

I nferential statistics

-definition of probability

-null hypothesis

-Pvalue

-type I and type I I error

Samples and populations

Confidence intervals

Sample size and power calculations

Parametric and non-parametric tests

Permutation tests

Bayesian inference

Key points

j

• A sample is a group taken from a population.

j

• Data from samples are analysed to make inferences about the population.

• The null hypothesis states that there i s no difference between the population

variables in question.

• A type I or alpha error is where one rejects the null hypothesis incorrectly.

• A type II or beta error is where one accepts the null hypothesis incorrectly.

• The P value is the probability of the event occurring by chance if the null

hypothesis is true.

• A Confidence interval i ndicates where the true population parameter

probably lies.

• Power is the likelihood of detecting a difference between groups if one exists.

• Sample size is determined by a,

**(delta, the difference between groups)
**

and sigma2 (variance).

A sample is a group taken from a population. The population may be all

human beings on the earth, or just within a specific country, or all

patients

with a specific condition. A population may also consist of

laboratory animals or cell types. A population, therefore, is not defined

by geography, but by its characteristics. Examples of populations studied

in anaesthesia include:

1. All day stay surgical patients undergoing general anaesthesia

2. Low-risk coronary artery bypass graft surgery patients

3. Critically ill patients in ICU with septic shock

4. Women undergoing caesarean section under spinal anaesthesia

5. Skeletal muscle fibres (from quadriceps muscle biopsy)

A clinical trial involves selecting a sample of patients, in the belief that

the sample represents the response of the average patient in the

population. Therefore, when applying the results of a trial to your

practice, it remains important first to decide if the patients recruited in

the trial (the trial 'sample') are similar to those that you wish to apply the

results to (your clinical practice 'population').

20

**Statistical Methods for Anaesthesia and Intensive Care
**

Clear description and consideration of inclusion and exclusion criteria

are therefore required.

Sample data are estimates of population parameters. Sampling

procedures are therefore very important when selecting patients for

study, so that they ultimately represent the population. Two common

methods are to sequentially select all patients until the required number

is

obtained, or to randomly select (this is preferable) from a larger

population. If patients are selected by other criteria, then it may

bias the

sample, such that it does not truly represent the population. In general,

the larger the sample size, the more representative it will be of the

population, but this involves more time and expense.

I

nferential statistics

Inferential statistics is that branch of statistics where data are collected

and analysed from a sample to make inferences about the larger

population. The purpose is to derive an estimate of one or more

population parameters, so that one may answer questions or

test

hypotheses.

The deductive philosophy of science attributes man with an inquiring

mind that asks questions about himself and the environment. Each

question is refined to produce a specific hypothesis and logical

implications of the hypothesis that can be specifically tested. A scientific

method is used to collect evidence, either by observation or controlled

experiment, to try to support or refute the hypothesis.

Hypothesis tests are thus procedures for making rational decisions

about the reality of observed effects. Most decisions require that an

individual select a single choice from a number of alternatives. The

decision is usually made without knowing whether or not it is absolutely

correct.

A rational decision is characterized by the use of a procedure

which ensures that a probability of success is incorporated into the

decision-making process. The procedure is strictly proscribed so that

another individual, using the same information, would make the same

decision.

The effects under study may be quantitative or qualitative, and can be

summarized as various statistics (differences between means,

contingency tables, correlation coefficients, etc.), each requiring an

appropriate hypothesis testing procedure.

In logic, it is preferable to refute a hypothesis rather than try to prove

one. The reason for this is that it is deductively valid to reject a hypothesis

if the testable implications are found to be false (a method of argument

known as modus tollens),

but it is not deductively valid to accept a

hypothesis if the testable implications are found to be true (known as the

fallacy of affirming the consequent).

The specific implications may be found true (because of other

circumstances) even though the general hypothesis may be false.

Hypotheses may however be generally accepted based on weight of

supporting evidence and lack of contrary evidence. A hypothesis can be

· e e·c · · `1 c · e e· e ·e e e · e· e c e·· c · · e · · e · e c · e·c e c · e e e e e · e · e c e e e ·e e· c e · e· · e · c e· · e c ·c · · e · e · · e e e e e e · e· · · e e c c · e e· · · · · e e · e ec c · e e e ec e e c · e e · · c e e c e e· · e e·· e c e· ·

22

**Statistical Methods for Anaesthesia and Intensive Care
**

associated with a certain probability, P, also known as the

P value, which

indicates the likelihood that the result obtained, or one more extreme,

could have occurred randomly by chance, assuming that H

0 is true.

If P is less than an arbitrarily chosen value, known as

a or the

significance level, the H0 is rejected. However the H

0 may be incorrectly

rejected and this is known as a making a type I error.

By convention, the

a value is often set at 0.05 which means that one accepts a 5%probability

of making a type I error. Other values for a can be chosen, depending on

circumstances.

Note that if multiple comparisons are made, then there is increased

likelihood of a type I error: the more you look for a difference, the more

likely you are of finding one, even by chance (see Chapter 11)!

A type II error occurs when one accepts the H

0 incorrectly and the

probability of this occurring is termed

P.

This will be discussed later

when we talk about power.

Having rejected the H0 ,

it is usual to accept the complementary

alternative hypothesis (H1

), in this case that the drug does cause some

effect, (although strictly speaking we have not logically proved the H

l or

asserted that it is in fact true). We accept the H

l but should remember that

this is specifically for the experimental conditions of the trial and the

sample tested. It is hoped that the results are

generalizable to the

population. However, it is up to the researchers and readers to decide

whether or not this is valid.

In the example above, Hl

is that there is a difference in mean heart rate.

The drug effect could be either to increase or decrease heart rate. This is

known as a two-tailed

hypothesis and the two-tailed form of the sig-

nificance test is used. We can just be interested in one direction of effect,

for example that the drug increases heart rate. In this case the

complementary H0

would state that these is no increase in heart rate, and

a one-tailedtest is used.

The two-tailed test is so named because when we specify an a of 0.05

in a two-tailed hypothesis, we are interested in results similar to or more

extreme than that observed (with no indication of direction). Thus we are

using the extremes at both ends or tails of the distribution, each tail

containing half the total a probability, in this case, 2.5%.

A one-tailed test at an a of 0.05 would use just one end of the

distribution and use different

critical values to compare against the test

statistic.

Two-tailed tests should usually be used unless there are clear

reasons specified in advance as to why one-tailed tests are appropriate for

that study 1

We must remember that, based on our experiment, we make an

inference based on likelihood. A statistically significant result may or

may not be real because it is possible to make type I and type II errors.

However, even if the result is real, we still need to decide whether or not

the result is important. A very small effect may be real and shown to be

statistically significant, but may also be clinically unimportant or

irrelevant. The important finding is the likely size of the treatment effect

(see Chapter 11).

Some investigators conduct trials and report only the result of

significance tests; the null hypothesis is rejected, a real effect is observed

and a P value is given indicating that the probability of this effect

occurring by chance is very low.

However, the use of only P values has been criticized because they do

not give an indication of the magnitude of any observed differences and

thus no indication of the clinical significance of the result. Confidence

intervals are often preferred when resenting results because they also

provide this additional information.

Confidence intervals

Principles of probability and inference

23

When estimating any population parameter by taking measurements

from a sample, there is always some imprecision expected from the

estimate. It is obviously helpful to give some measure of this imprecision.

There is an increasing tendency to quote confidence intervals (CI),

either with P values or instead of them. This is preferable as it gives the

reader a better idea of clinical applicability. If the 95%CI describing the

difference between two mean values includes the point zero (i.e. zero

difference), then obviously P > 0.05.

A 95%CI for any estimate gives a 95%probability that the true

population parameter will be contained within that interval. Any %CI

can be calculated, although 95%and 99%are the most common. CIs can

be calculated for many statistics such as mean, median, proportion, odds

ratio, slope of a regression line, etc.

2,3

For example, in a study comparing the dose requirements for thio-

pentone in pregnant and non-pregnant patients, 4 the ED50 (95%CI) for

hypnosis in the pregnant group was 2.6 (2.3-2.8) mg/kg. Thus we can be

95%certain that the true population ED50

is some value between 2.3 and

2.8

mg/kg. In addition, the pregnant to non-pregnant relative median

potency (95%CI) for hypnosis was 0.83 (0.67-0.96). Because the 95%Cl

does not contain 1.0, we conclude that there is a significant difference in

median potency between the two groups.

The lowest and highest values of the CI are also known as the lower

and upper confidence limits.

It is commonly thought that the 95%CI for the population mean is the

sample mean ± (1.96 x SEM), where

SEM is the standard error of the

mean.

However this is only true for large sample sizes. If the sample size

is small, the t distribution

is applicable (see Chapter 5) and the 95%CI

for the sample mean in small samples is calculated as the mean of the

sample ± (the appropriate t value corresponding to 95%x SEM).

CI can be used to indicate the precision of any estimate. Obviously the

smaller the CI the more precisely we assume the sample estimate

represents the true population value. CI can also be used for hypothesis

testing.

For example, if the CI for the difference between two means contains

zero, then one can conclude that there is no significant difference between

the two populations from which the samples were drawn at the

significance

%of the CI. Thus if we use a 95%CI, this is similar to

choosing an a value of 0.05.

24

**Statistical Methods for Anaesthesia and Intensive Care
**

Figure 3.1

Theoretical plots of the mean and 95 %

confidence intervals (CIs) for

troponin-1 levels (wg/1) in three patient groups after major surgery. The

95 % CI

of group C do not overlap those of groups A and B, and so the difference in

means is statistically significant (P < 0.05 ).

The 95 % CI of group B overlaps the

mean value of group A and so it is not statistically significantly different (at

P < 0.05 )

It is possible to graphically illustrate the use of CI for hypothesis

testing in a limited manner (Figure 3.1). If we showthree sample means

and their respective CIs, there is a significant difference between the

group means if the respective CIs do not overlap. There is no difference

if one CI includes the mean of the other sample. However, if the CIs just

overlap, it is not easy to determine graphically whether or not there is a

statistical difference between means and a statistical test is used.

For example, in a study comparing propofol requirements in normal

patients with that in patients with small and large brain tumours,

5

the

dose-response curves show that the 95%CIs for the control and small

brain tumour contain the ED50

for the other group (Figure 3.2). There is

thus no difference in ED

:`

for these two groups. However, these 95%CIs

do not overlap the 95%CI for the large tumour group and so there is a

difference in the ED

:`

of the large tumour group compared with the other

two. Note however that there is overlap of the 95%CI for the ED

`:

and it

is not clear from the graph whether or not this represents a significant

difference. In this example, the authors also used a statistical test of

significance. 5

CIs can indicate statistical significance but, more importantly, by

illustrating the accuracy of the sample estimates, presenting the results in

the original units of measurement and showing the magnitude of any

effect, they reveal more information. This enables the investigator (or

reader) to determine whether or not any difference shown is clinically

significant.

Sample size and power calculations

The difference between a sample mean and population mean is

the

sampling error.

A larger sample size will decrease the sampling error.

Principles of probability and inference

2

5

Figure 3.2 Calculated dose-response curves (log dose scale) for loss of response

to verbal command in patients with brain tumour and patients undergoing

non-cranial surgery. The 95%confidence intervals for the ED50 and ED95 are

also displayed, slightly offset for clarity. Fraction of patients (out of ten) who

failed to respond to verbal command are shown as (•) for patients with large

brain tumour, (X) for patients with small tumour and (.) for control patients

(From Chan et al. 5 )

When designing a trial, two important questions are:

1. How large a sample is needed to allow statistical judgments that are

accurate and reliable

2. Howlikely is a statistical test able to detect effects of a given size in a

particular situation

Earlier we mentioned that it was possible to reject the H0 incorrectly (a

type I error), or accept the H0 incorrectly (a type II error). The investigator

sets a threshold for both these errors, often 0.05 for the type I or a error,

and between 0.05 and 0.20 for the type 11 or beta error.

Power is the likelihood of detecting a specified difference if it exists

and it is equal to 1-beta For example, in a paper comparing the duration of

mivacurium neuromuscular block in normal and postpartum patients ,6

power analysis beta = 0.1, a = 0.05) indicated that a sample size of 11 would

be sufficient to detect a three-minute difference in clinical duration of

neuromuscular block.

Performing power analysis and sample size estimation is an important

aspect of experimental design, 7,8 because without these calculations

sample size may be too large or too small. If sample size is too small, the

experiment will lack the precision to provide reliable answers to the

questions being investigated. If sample size is too large, time and

resources will be wasted, often for minimal gain.

It is considered unethical to conduct a trial with low power because it

wastes time and resources while subjecting patients to risk. The simplest

2 6

**Statistical Methods for Anaesthesia and Intensive Care
**

way to increase power in a study is to increase the sample size, but power

is only one of the factors that determine sample size.

There are various formulae for sample size depending on the study

design and whether one is looking for difference in means, proportions,

or other statistics. 7,9 Nomograms8 and computer programs are also

available.

As an example, if one is interested in calculating the sample size for a

two-sided study comparing the means of two populations, then the

following approximate formula can be used: 9

delta. Traditionally, many researchers choose beta = 0.2, but this implies that the

study only has 80%power (1 - 0.2), i.e. an 80%probability to detect a

difference if one exists. Why should we accept a lower probability for a,

typically 0.05, than beta? Is it more important to protect against a type I or

type II error? It has been argued that we should be more concerned about

type I error because rejecting the H

0 would mean that we are accepting

an effect and may incorrectly decide to implement a new therapy. It is

generally important not to do this lightly.

Committing a type II error and concluding falsely that there is no effect

will only delay the implementation of a new treatment (although this

presupposes that some satisfactory alternative exists). However both

errors are important and aand beta values should be considered carefully,

depending on the hypothesis being tested.

Delta ,

the effect size, is the difference that one wishes to detect. It is

more difficult to detect a small difference than a large difference. Here the

investigator is called upon to decide what is a clinically significant

difference. This is arbitrary but should be plausible and acceptable by

most peers.

For example, a muscle relaxant may truly increase serum potassium by

0.05

mmol/l. However we may decide that a clinically significant

Principles of probability and inference

2 7

increase in potassium is 0.3 mmol/l. Thus the sample size is set so that a

rise of 0.3 mmol/1 is detectable. In this case, we may well fail to detect the

true increase in serum potassium, but this is inconsequential because a

rise of 0.05 mmol/1 is not important. We have not however committed a

type II error because the H

0 was that there is no difference as great as 0.3

mmol/l.

In another example, comparing postoperative morphine requirements

after ketamine induction compared with thiopentone induction,

10 the

effect size chosen was a 40%decrease in 24-hour morphine consumption.

This was the expected effect size from a previous study in a different

population. This effect was thought to be clinically relevant, although the

authors would not dispute that some others might consider 30%or 50%

to be thresholds for clinical relevance. Given the historical variance in

morphine requirement derived from their acute pain database, a sample

size of 20 was eventually calculated at a = 0.05 and beta = 0.2.

The effect size

can be related to the standard deviation (delta/sigma) and

cat e orized as small (< 0.2), medium (0.2-0.5) or large (> 0.5).

o

the variance

in the underlying populations, is the only variable that

the investigator cannot choose. An estimate for this can be obtained from

pilot studies or other published data, if available. The greatest concern is

that if the variance is underestimated, on completion of the study at the

given sample size, the power of the study will be diminished and a

statistically significant difference may not be found.

The variance can also be minimized by maximizing measurement

precision. For example, estimation of cardiac output using the thermo-

dilution method has been shown to be more precise using measurements

in triplicate and with iced injectate.

11,12

Thus, inclusion of these

thermodilution methods in the study design can reduce the study cardiac

output variance, and so limit the sample size required.

Note from the above formula that if one chooses delta equal to sigma these two

would then cancel mathematically in the sample size formula. However

this is not a recommended technique for calculating sample size when

one does not have an estimate of o'.

Having calculated a sample size, one usually increases the number by

a factor based on projected dropouts from the study, and allowing for a

reasonable margin of error in the estimate of sigma. It is useful to recalculate

sample size for greater estimates of sigma because at times a surprisingly

large sample size is required for a small change in sigma and the feasibility of

the whole study may be in doubt.

Sample size estimations for numerical data assume a normal

distribution and so if the study data are skewed or non-parametric it is

common to increase the sample size estimate by at least 10%.

The sample size formula for a two-sided comparison of proportions is:

9

28

**Statistical Methods for Anaesthesia and Intensive Care
**

and

p2

= the expected proportions in the two groups, q1

and

q2

=1-pl

and

1-p2,

respectively, delta = the effect size, which is p1-p2.

Thus, the sample size for a difference in proportions depends on four

factors:

1.

Value chosen for a: a smaller a means a larger sample size

2. Value chosen for beta:

smaller beta (higher power) means larger sample size

3. A (effect size, p1

p2):

smaller A means larger sample size

4.

Number of study endpoints (p l

): rare events require larger sample size

There may also be several outcomes of interest and the sample size

calculations for each of these outcomes may be different. Good trial

design dictates that the sample size should be based on the

primary

endpoint,

though it may be further increased if important secondary

endpoints are also being studied.

Although a sample size is calculated,

a priori, before the study begins,

it is also useful to use the same formula to calculate the power of the

study, a posteriori,

after the study has been completed. This can be useful

when the H0

is accepted because it indicates howlikely a given difference

could have been detected using the actual standard deviation from the

final study samples, rather than the original estimate.

Previous authors have noted that many published studies had failed to

study an adequate number of patients, such that they were unlikely to

have reliably determined a true treatment effect if it existed.

7-913,14

This

has been a common error in anaesthesia studies (see Chapter 11).

Parametric and non-parametric tests

In the previous discussion of statistical inference and hypothesis testing,

it

was necessary to determine the probability of any observed difference.

This probability is based on the sampling distribution of a test statistic. It

is important to realize that there are assumptions made when using these

test statistics (see also Chapters 5 and 6).

Parametric tests

are based on estimates of parameters. In the case of

normally distributed data, the tests are based on sampling distributions

derived from p or

sigma Parametric tests are based on the actual magnitude

of values (quantitative, continuous data), and can only be used for data

on a numerical scale (cardiac output, renal blood flow, etc.). The

parametric tests discussed later in this book have many inherent

assumptions and thus should only be used when these are met.

Non-parametric tests

were developed for situations when the

researcher knows nothing about the parameters of the variable of interest

in the population.

1s Non-parametric methods do not rely on the

estimation of parameters (such as the mean or the standard deviation).

Non-parametric tests should generally be used to analyse ordinal and

categorical data (rating of patient satisfaction, incidence of adverse

events, etc.).

A common criticism of non-parametric tests is that they are not as

powerful as parametric tests. This means that they are not as likely to

detect a significant difference as parametric tests, if the conditions of the

parametric test are fulfilled. When the parametric assumptions are not

met, non-parametric tests often become more powerfu1.

15

Proponents of parametric tests agree that non-parametric methods are

most appropriate when the sample sizes are small. However, the tests of

significance of many of the non-parametric statistics described here are

based on asymptotic (large sample) theory and meaningful tests often

cannot be performed if the sample sizes become too small (say,

n < 10).

When the data set is large (say, n > 100),

it often makes little sense to use

non-parametric statistics because the sample means will follow the

normal distribution even if the respective variable is not normally

distributed in the population, or is not measured very well. This is a

result of an extremely important principle called the central limit

theorem.

The central limit theorem states that, as the sample size increases, the

shape of the sampling distribution approaches normal shape, even if the

distribution of the variable in question is not normal. For n = 30, the

shape of that distribution is 'almost' normal. Ordinal data can be

analysed with parametric tests in large samples, particularly if the sample

data represent a population variable on a continuous scale. 16

Proponents of the non-parametric tests argue that the power of these

tests is very close to the parametric tests even when all conditions for the

parametric test are satisfied. When the parametric assumptions are not

met, non-parametric tests actually become more powerful. Power

efficiency is the increase in sample size necessary to make the test as

powerful at an alpha level and given sample size. Asymptotic relative

efficiency is the same concept for large sample sizes, and is independent

of the type I error. The power for the common non-parametric tests is

often 95%of the equivalent parametric test.

15

Permutation tests

17,18

Principles of probability and inference

2 9

There has been renewed interest in a third approach for testing signifi-

cance, apart from using the common parametric and non-parametric

statistics.

One important assumption of statistical inference is that the

samples are drawn randomly from the population. In reality, this is

almost never the case. Our samples are mostly not truly random samples

from the population but are instead comprised only of subjects to which

we have access and are then able to recruit. Thus the subjects who enter

a trial conducted in a hospital will depend on many geographical and

social as well as medical factors influencing admissions to a particular

hospital. In almost all clinical trials, we actually study a non-random

sample of the population that undergoes random allocation to

treatments. It could be argued then that the results of the trial cannot be

generalized to the population at large and it is inappropriate to determine

the probability of any differences based on sampling theory from the

population.

17,18

An alternative permutation test

works out all the possible outcomes

given your sample size, determines the likelihood of each of them and

3 0

**Statistical Methods for Anaesthesia and Intensive Care
**

then calculates how likely it is to have achieved the given result or one

more extreme.

As a simple example, consider tossing a coin six times to determine

whether or not the coin was biased. If the outcome of interest is the

number of heads, the probabilities range from: P(0 heads) to P(6 heads).

We can thus work out the probability of getting a result as extreme as 1

head by P(0 heads) + P(1 head) + P(5 heads) + P(6 heads) = 14/64! This

being greater than an arbitrary P value of 0.05 would lead us to conclude

that the coin was not biased. However if we got 0 heads, the probability

of a result this extreme is P(0 heads) + P(6 heads) = 2/64, that is less than

an arbitrary P value of 0.05 would lead us to conclude that the coin was

biased.

Note that we have used a two-tailed hypothesis in these

examples.

Thus the permutation tests provide a result exactly applicable to the

samples under study and make no assumptions about the distribution of

the underlying (and remaining) population. We are however usually

interested in generalizing our specific sample results to the population

and it appears that permutation tests do not permit this. However, if

other researchers replicate the trial with different samples and achieve

similar conclusions, then the weight of evidence would lend us to

support (or reject) the overall hypothesis.

The only permutation test in common use is

Fisher's exact test

(Chapter 6). This is because the permutations are very time intensive and

only recently has the advent of personal computers made these tests

more feasible.

Bayesian inference

Use of

P value

to determine whether an observed effect is statistically

significant has its critics.

19,22

This is because a P value is a mathematical

statement of probability; it both ignores howlarge is the

treatment effect,

and conclusions based solely on it do not take into consideration prior

knowledge. Thus, an observed effect that is not statistically significant,

but is consistent with previous study results, is more likely to be true than

a similar treatment effect observed that had not been previously reported.

Clinicians do not consider a trial result in isolation. They generally

consider

what is already known and judge whether the new trial

information modifies their belief and practice. This is one explanation

why clinicians may reach different conclusions from the same study data.

A Bayesian approach incorporates prior knowledge in its con-

clusion.

0-22

It has been developed from

Bayes' theorem*, a formula used

to calculate the probability of an outcome, given a positive test result (see

Chapter 8). It combines the prior probability

and the study P value, to

calculate a

posterior probability.

For example, if a new alpha

2

agonist is tested to see if it reduces the

rate of myocardial ischaemia, it would be a plausible hypothesis because

of what is known about other alpha2agonists. The resultant P value is

Thomas Bayes (1763): `An essay towards solving a problem in the doctrine of chances'.

0.04. There are two possible explanations for this result. Either, it was a

chance finding (1 in 25) or it was a true effect of the newdrug. The second

option makes much more sense because of prior knowledge (this would

still be the case if the P value was 0.06).

Alternatively, if the same study also found that the newalpha2 agonist

reduced the rate of vomiting (P = 0.04), this would be more likely to be a

chance finding. This is because prior knowledge suggests that, if

anything, other alpha2agonists increase the risk of nausea and vomiting.

Bayesian inference has appeal because it considers the totality of

knowledge.

19-23

In fact, there are literally two opposing camps of statis-

ticians: frequentists and Bayesianists!

One of the criticisms of Bayesian inference is that the methods used to

determine prior probability are ill-defined.

References

Principles of probability and inference

31

1. Bland JM, Altman DG. One and two sided tests of significance.

Br Med J 1994;

309:248.

2. Gardner MJ, Altman DG. Confidence intervals rather than P values:

estimation rather than hypothesis testing. Br Med J

1986; 292:746-750.

3. Gardner MJ, Altman DG. Statistics with Confidence - Confidence Intervals

and Statistical Guidelines. British

Medical Journal, London, 1989.

4. Gin T, Mainland P, Chan MTV et al.

Decreased thiopental requirements in

early pregnancy. Anesthesiology 1997; 86:73-78.

5.

Chan MTV, Gin T, Poon WS. Propofol requirement is decreased in patients

with large supratentorial brain tumor.

Anesthesiology 1999; 90:1571-1576.

6. Gin T, Derrick J, Chan MTV, et

al. Postpartum patients have slightly

prolonged neuromuscular block following mivacurium.

Anesth Analg 1998;

86:82-85.

7. Florey C du V Sample size for beginners.

Br Med J 1993; 306:1181-1184.

8.

Altman DG. Statistics and ethics in medical research. III Howlarge a sample.

Br Med J 1980; 281:1336-1338.

9.

Campbell MJ, Julious SA, Altman DG. Estimating sample sizes for binary,

ordered categorical, and continuous outcomes in two group comparisons. Br

Med J 1995; 311:1145-1148.

10. Ngan Kee WD, KhawKS, Ma ML

et al. Postoperative analgesic requirement

after cesarean section: a comparison of anesthetic induction with ketamine or

thiopental. Anesth Analg

1997; 85:1294-1298.

11. Stetz CW, Miller RG, Kelly GE

et al. Reliability of the thermodilution method

in determination of cardiac output in clinical practice. Am Rev Respir Dis 1982;

126:1001-1004.

12. Bazaral

MG, Petre L, Novoa R. Errors in thermodilution cardiac output

measurements caused by rapid pulmonary artery temperature decreases after

cardiopulmonary bypass.

Anesthesiology 1992; 77:31-37.

13.

Goodman NW, Hughes AO. Statistical awareness of research workers in

British anaesthesia.

Br J Anaesth 1992; 68:321-324.

14. Frieman JA, Chalmers TC, Smith H

et al. The importance of beta, the type II

error and sample size in the design and interpretation of the randomized

controlled trial.

N Engl J Med 1978; 299:690-694.

15. Siegal S, Castellan NJ Jr. Nonparametric Statistics for the Behavioral Sciences.

2nd ed. McGraw-Hill, NewYork 1988.

16.

Moses LE, Emerson JD, Hosseini H. Statistics in practice. Analyzing data

from ordered categories.

N Engl J Med 1984; 311:442-448.

32

**Statistical Methods for Anaesthesia and Intensive Care
**

17.

Ludbrook J. Advantages of permutation (randomization) tests in clinical and

experimental pharmacology and physiology. Clin Exp Pharmacol Physiol 1994;

21:673-686.

18.

Ludbrook J, Dudley H. Issues in biomedical statistics: statistical inference.

Aust NZ J Surg 1994; 64:630-636.

19.

Browner WS, Newman TB. Are all significant p values created equal? The

analogy between diagnostic tests and clinical research. JAMA

1987;

257:2459-2463.

20. Brophy JM, Joseph L. Bayesian interim statistical analysis of randomised

trials.

Lancet 1997; 349:1166-1168.

21.

Goodman SN. Towards evidence based medical statistics: the P value fallacy.

Ann Intern Med

1999;130:995-1004.

22. Goodman SN. Towards evidence based medical statistics: the Bayes factor.

Ann Intern Med 1999;130:1005-1013.

23.

Davidoff F. Standing statistics right side up.

Ann Intern Med 1999;

130:1019-1021.

4

Research design

Bias and confounding

Randomization techniques

-randomization and stratification

-block randomization

Types of research design

-stratification

-observation vs. experimentation

-minimization

-case reports and case series

Blinding

-case-control study

Sequential analysis

-cohort study

I

nterim analysis

-association vs. causation

**Data accuracy and data checking
**

-randomized controlled trial

Missing data

-self-controlled and crossover trials

I ntention to treat

Key points

. Bias is a systematic deviation from the truth.

• Randomization and blinding reduce bias.

• Confounding occurs when another factor also affects the outcome of interest.

• Observational studies may be retrospective, cross-sectional or prospective.

• The gold standard study is a double-blind randomized controlled trial (RCT).

• Sequential analysis (interim analysis) allows the early stopping of a trial as

soon as a significant difference is identified.

• Analysis of patients in an RCT should be by intention to treat'.

In the past, large dramatic advances in medicine (e.g. discovery of ether

anaesthesia) did not require a clinical trial to demonstrate benefit. Most

current advances have small-to-moderate benefits, and a reliable method

of assessment is required to demonstrate a true effect.

Bias and confounding

In a research study, an observed difference between groups may be a

result of treatment effect (a true difference), random variation (chance), or

a deficiency in the research design which enabled systematic differences

to exist in either the group characteristics, measurement, data collection

or analysis.' These deficiencies lead to bias, a systematic deviation from

the truth. There are many potential sources of bias in medical research.

Examples include:

1. Selection bias - where group allocation leads to a spurious improved

outcome because one group is healthier or at lower risk than another

2. Detection bias - where measurements or observations in one group are

not as vigilantly sought as in the other

3. Observer bias - where the person responsible for data collection is able

to use their judgment as to whether an event occurred or not, or

determine its extent

3 6

**Statistical Methods for Anaesthesia and Intensive Care
**

surgical or anaesthetic techniques, or other aspects of the patients'

perioperative care. Ideally such reports should only be used to generate

hypotheses that should then be tested with more advanced research

designs.

Case-control study

A case-control study is an observational study that begins with a

definition of the outcome of interest and then looks backward in time to

identify possible exposures, or risk factors, associated with that outcome

(Figure 4.1).3 Patients who experienced this outcome are defined as

'cases'; patients who did not are defined as 'controls'. This study design

is particularly useful for uncommon events, as cases can be collected over

a long period of time (retrospective and prospective) or from specialist

units which have a higher proportion of patients with these outcomes.

The case-control study has been under-utilized in anaesthesia and

intensive care research. It should become more popular as large

departmental and institutional databases are established, thereby

allowing exploratory studies to be undertaken.

In most case-control studies the control patients are matched to the

cases on some criteria, usually age and gender. The aim of this matching

process is to equalize some baseline characteristics so that confounding is

reduced. This allows a more precise estimate of the effect of various

exposures of interest. It is important to understand that matched

Figure 4.1 Types of observational studies: (a) case-control study; (b) cohort

study; (c) cross-sectional study

characteristics cannot then be compared, because they will have equal

values for cases and controls! Another way of reducing the effects of

confounding is to use multivariate statistical adjustment (see Chapter 8).

It is common to increase the sample size of a case-control study by using

a higher proportion of controls than cases. Hence 1:2, 1:3 or 1:4 matching

is commonly used.

Once the cases and controls have been selected, the aim is to look back

in time

at specific exposures that may be related to the outcome of interest.

The exposures may include patient characteristics (such as severity of

illness, pre-existing disease, age group* or smoking status), drug admin-

istration, and type of surgery (or surgeon!). In order to minimize bias, the

definition of each of these exposures should be defined before the data

are collected and efforts used to acquire them should be equivalent for

cases and controls. 3

Rebollo et a1.4

investigated possible factors that were associated with

infection following cardiac surgery. Over a 12-month period they

identified postoperative infection in 89 patients ('cases') and these were

matched to 89 controls. They then retrospectively identified the following

perioperative characteristics

which were significantly associated with

infection: patient age > 65 years, urgent surgery, prolonged surgery and

use of blood transfusion.

Cohort study

A cohort study observes a group of patients forward in time

in order to

record their eventual outcome. A specific group of patients are identified

(a 'cohort'),

and these can be matched to one or several other control

groups for comparison (Figure 4.1). Because cohort studies are usually

performed prospectively, the accuracy of the data can be improved and

so results are generally accepted as being more reliable than in

retrospective case-control studies. 5

But because the outcome of interest

may occur infrequently, or take a long time to develop, this design may

require a large number of patients to be observed over a long period of

time in order to collect enough outcome events. A cohort study is

therefore relatively inefficient.

Observational studies can be used to estimate the risk

of an outcome in

patients who are exposed to a risk factor versus those not exposed. In

prospective cohort studies this is described by the risk ratio (also known

as relative risk). If exposure is not associated with the outcome, the risk

ratio is equal to one, if there is an increased risk, the risk ratio will be

greater than one, and if there is a reduced risk, the risk ratio will be less

than one. For example, if the risk ratio for smokers acquiring a

postoperative wound infection is 10, 6 then smokers have a 10-fold

increased risk of wound infection compared to non-smokers. If the risk

ratio for men reporting postoperative emesis is 0.6, then men have a 40%

reduction (1.0 - 0.6 = 0.4) in postoperative emesis compared with women.

The risk ratio can be expressed with 95% confidence intervals (CI). 7 If

this interval does not include the value of one, the association between

*Only if age was not used to match cases and controls.

Research design

37

3 8

**Statistical Methods for Anaesthesia and Intensive Care
**

Figure 4.2 In prospective cohort studies the risk ratio is equal to the risk of an

outcome when exposed compared to the risk when not exposed. For case-

control studies (outcome 'yes' = cases, outcome 'no' = controls), the value for

the denominator is unreliable and so the odds ratio is used as an estimate of

risk

exposure and outcome is significant (P < 0.05). Because accurate

information concerning total numbers is unavailable in a retrospective

case-control study (because sample size is set by the researcher),

incidence rate and risk cannot be accurately determined, and the

odds

ratio is used as the estimate of risk (Figure 4.2). Odds ratios can also be

expressed with 95%Cl. 7 The statistical methods used to analyse case-

control and comparative cohort studies are presented in more detail in

Chapter 6.

An example of a case-control study by Myles

et al.8 was one designed

to investigate the potential role of calcium antagonists and ACE

inhibitors in causing persistent vasodilatation after cardiac surgery.

Overall, 42 cases (patients with persistent low systemic vascular

resistance) were identified in a 12-month period and these were matched

for age and sex to 84 controls. Looking back at their preoperative

medications ('exposure'), 11 cases and 19 controls had been given ACE

inhibitors, and 22 cases and 62 controls had been given calcium

antagonists. Univariate ('unadjusted') odds ratios were 1.21 and 0.39,

respectively. These were 'adjusted' using a multivariate statistical

technique in order to balance for possible confounding - there was no

significant association for either ACE inhibitors (odds ratio 1.33, 95%Cl:

0.53-3.34) or calcium antagonists (odds ratio 0.49, 95%Cl: 0.21-1.13).

An example of a cohort study was one designed by Strom

et al.,9 who

investigated the adverse effects of the non-steroidal drug, ketorolac, on

postoperative outcome. They compared 10 272 patients who received

ketorolac, and matched them to 10 247 treated with opiates, investigating

the risk of gastrointestinal and operative site bleeding. The risk (odds)

ratio of gastrointestinal bleeding in those exposed to ketorolac was 1.3 (or

30%greater risk), 95%Cl: 1.11-1.52. This risk was increased in patients

treated longer than 5 days, and in those over 70 years of age. Because

their study design enabled them to include very large numbers of

patients, they were able to give a precise estimate of risk (i.e. narrow95%

Association vs. causation

Research design

39

CI).

Nevertheless, because patients were not randomized to each

treatment group (ketorolac or opiates), concern over potential bias and

confounding remain.

Because observational studies do not require a specific intervention, and

many departments and institutions now manage extensive patient

databases, it is relatively easy to obtain information on large numbers of

patients. Such studies form the basis of much epidemiological research

('study of the health of populations') and their value lies in their

efficiency, in that 100s or 1000s of patients can be analysed. But it must be

recognized that the results of observational studies depend heavily on

the accuracy of the original data set ('gigo': garbage in, garbage out). Bias

and confounding are difficult to avoid and should always be considered

as alternative explanations to an observed relationship between drug

exposure (or intervention) and outcome.

Even if a relationship between exposure and outcome is beyond doubt,

it does not prove that exposure caused the outcome: 'association does not

i

mply causation'. In order to demonstrate causation requires the

collective weight of evidence from a number of potential sources.

5,10,11

All the available evidence should be processed:

1. Is the evidence consistent?

2. Is there a demonstrated temporal sequence between drug exposure

and adverse outcome? This is particularly relevant for case-control

studies and case reports.

3. Is there a dose-response relationship (greater risk if exposed to higher

doses, or for longer periods)?

4. Is there biological plausibility?

It is the mounting body of supportive evidence that finally supports

causation.

Randomized controlled trial

The gold standard experimental research design is the

prospective

randomized controlled trial.

Here patients are allocated to a treatment

group and their outcome is compared with a similar group allocated to

an alternative treatment. This reference group is called the

control group

and its role is to represent an equivalent patient group who have not

received the intervention of interest. A controlled trial therefore allows

meaningful conclusions concerning the relative benefits (or otherwise) of

the intervention of interest. There may be one or more control groups,

consisting of patients who receive a current standard treatment or

placebo.

A randomized controlled trial is sometimes referred to as a

parallel

groups trial,

in that each group is exposed to treatment and then

followed concurrently forwards in time to measure the relative effects of

each treatment (Figure 4.3).

40

**Statistical Methods for Anaesthesia and Intensive Care
**

Figure 4.3 Comparing two treatments (A and B) with a standard parallel trial

design or a crossover trial design: (a) parallel groups trial; (b) crossover trial

Trials which compare an active treatment to placebo can demonstrate

whether the active treatment has some significant effect and/or to

document the side-effects of that active treatment (relative to placebo). If

the control group is to receive another active treatment (usually current

standard treatment), then the aim of the trial is to demonstrate that the

new treatment has a significant advantage over current treatment. This

often has more clinical relevance.

For example, Suen et a1.

12

enrolled 204 women undergoing

laparoscopic gynaecological surgery and randomly allocated each to

receive ondansetron or placebo, and measured the incidence of nausea

and vomiting. Patients were followed up for 24 hours. The investigators

clearly demonstrated that ondansetron was an effective anti-emetic in

that it reduced emetic symptoms by approximately 50%.

Fujii et a1.

13

randomized 100 women undergoing major gynaecolgical

surgery into two groups (domperidone 20 mg, or granisetron 2 mg). They

chose domperidone as the comparator because it was a commonly used

anti-emetic.

They clearly demonstrated that granisetron was more

effective in their patient population. This provides additional, clinically

useful information.

Self-controlled and crossover trials

It is difficult to detect a significant difference between groups in trials

when the observation of interest is subject to a lot of variation

('background noise'). This

variance

can be reduced by restricting trial

entry (excluding patients who have extreme values or who may react

differently to the intervention) or by improving measurement precision.

Stratification and blocking can be used to equalize potential confounding

variables (see later).

Another way of reducing variance is to allow each patient to act as

their own control and so all patient characteristics affecting the outcome

Research design

4 1

of interest are equalized. This is a suitable design to test the effect of a

new drug or treatment on a group of patients and is known as a

self-

controlled, or before and after

study. Here, baseline measurements are

taken, the treatment is then given and after an appropriate period of time

measurements are repeated. Assuming that the patients were otherwise

in a stable condition, then any change in the observation can be attributed

to the effect of the treatment. The appropriate methods to analyse this

' paired' data are presented in Chapters 5 and 6.

When two or more interventions are to be compared in patients who

act as their own control, they need to be exposed to each of the

treatments. This requires they be crossed over from one treatment group

to the next; this is known as a

crossover trial.

14,15

This is best achieved by

randomizing patients to each treatment, measuring the effect, and then

giving them the alternative treatment, followed by another set of

measurements (see Figure 4.4).

Crossover trials are most useful when assessing treatment for a stable

disease, or where the intervention being tested has a short period of onset

and offset and the effect can be measured quickly. Each treatment period

can be separated by a '

washout period', enabling the effect of the first

treatment to dissipate before testing the second treatment.

This is a very powerful research design as it avoids the confounding

effect of patient characteristics, thereby markedly reducing variance and

maximizing the likelihood of detecting a treatment effect. It is a very

efficient design: the sample size required to show a difference between

groups can often be substantially reduced.

Nevertheless, readers must be aware of potential problems in using

this design.

14,15

These include avoiding a carry-over effect

(where the

effects of the first treatment are still operating when the next treatment is

evaluated), period effect

(where there is a tendency for the group either

to improve or deteriorate over time - the second treatment evaluation

will be confounded by this) and sequence effect

(where the order of

treatments also has an effect on outcome). There are statistical methods

available to investigate these effects.

14,15

Patient dropouts also have a

potent adverse impact on study power (because each patient contributes

data at all periods, for each treatment), and because each patient requires

at least two treatments, trial duration usually needs to be extended.

Despite these concerns, crossover trials are a very efficient method to

compare interventions, when well designed and correctly analysed. If

possible, the crossover point should be blinded to reduce bias.

In general, crossover trials have been under-utilized in anaesthesia

research. If readers are considering employing,

a crossover trial design,

we would recommend two excellent reviews.1

15

Some examples can be

found in the anaesthetic and intensive care literature.

16,17

When interest is focused on one patient and their response to one or

more treatments, this is known as an '

n-of-1 trial'. This is usually per-

formed in the setting of clinical practice when treating a specific patient

(

who may be resistant to standard treatment, or who warrants a new or

experimental treatment). 18

The results are not intended to be generalized

to other patients. Ideally, an n-of-1 trial should be performed under

blinded conditions. This trial design may be useful when optimizing, say,

42

**Statistical Methods for Anaesthesia and Intensive Care
**

an anaesthetic technique for a patient requiring repeated surgical

procedures or a course of electroconvulsive therapy. It may also be used

in intensive care, say to optimize a sedative or analgesic regimen for a

problematic patient. Obviously the trial result will only apply to that

patient, but this is a useful method of objectively measuring individual

response to changes in treatment.

Randomization techniques

To minimize bias, the allocation of patients to each treatment group

should

be randomized. The commonest method is

simple

randomization,

which allocates patients in such a way that each has an

equal chance* of being allocated to any particular group and that process

is not affected by previous allocations. This is usually guided by referring

to a table of random numbers or a computer-generated list. This

commonly used method has the value of simplicity, but may result in

unequal numbers of patients allocated to each group or unequal

distribution of potential confounding factors (particularly in smaller

studies).

Block randomization

is a method used to keep the number of patients

in each group approximately the same (usually in blocks with random

sizes of 4, 6 or 8). As each block of patients is completed, the number in

each group will then be equal.

Stratification

is a very useful method to minimize confounding.

2,19-21

The identified confounding factors act as criteria for separate

randomization schedules, ensuring that the confounding variables are

equalized between groups. Here, the presence of a confounding variable

divides patients into separate blocks (with and without the confounder),

and each block of patients is separately randomized into groups, so that

ultimately equal numbers of patients with that particular confounder will

be allocated to each group. This allows a clearer interpretation of the

effect of an intervention on eventual outcome. Common stratifying

variables include gender, patient age, risk strata, smoking status (these

depend on whether it is considered they may have an effect on a

particular outcome) and, for multi-centred research, each institution.

For example, in a study investigating the potential benefits of lung

CPAP during cardiopulmonary bypass in patients undergoing cardiac

surgery, Berry et a1. 22

first stratified patients according to their

preoperative ventricular function in order to balance out an important

confounding factor known to have an effect on postoperative recovery.

The patients were then randomly allocated into groups. This method

resulted in near-equal numbers of patients in both groups having poor

ventricular function (Figure 4.4).

A Latin square design

is a more complex method to control for two or

more confounding variables. Here the levels of each confounding

*In some trials there may intentionally be an uneven allocation to each treatment group, so that

there is a fixed (known) chance other than 0.5. This is a valid method to increase the sample size of

a particular group if its variance is to be more precisely estimated.

Research design

4

3

Figure 4.4 An example of stratified randomization adapted from Berry et al.

22

Here, patients are first stratified, or divided, into blocks (according to left

ventricular function) and then separately randomized in order to equalize the

numbers of patients in each group with poor ventricular function. This reduces

confounding

variable make up the rows and columns of a square and patients are

randomized to each treatment cell.

23

Minimization is another method of equalizing baseline characteristics.

Here the distribution of relevant patient (or other) factors between the

groups is considered when the next patient is to be enrolled and group

allocation is aimed at minimizing any imbalance. Minimization is a

particularly useful method of group allocation when there is a wish to

equalize groups for several confounding variables.

24

Minimization also has advantages in situations where randomization

is unacceptable because of ethical concerns (see Chapter 12).

However, because group allocation is no longer randomly determined,

minimization may expose a trial to bias. For example, a selection bias

may occur whereby 'sicker' patients are placed into a control group. A

solution to this can be achieved by retaining random allocation, but

modifying the ratio of random allocation (from 1:1 to, say, 1:4), to increase

the chance the next patient will be allocated to the desired group.

Knowledge of group allocation should be kept secure (blind) until after

the patient is enrolled in a trial. Prior knowledge may affect the decision

to recruit a particular patient and so distort the eventual generalizability

of the trial results. The commonest method is to use sealed, opaque

envelopes.

Blinding

Blinding of the

patient (single-blind), observer (double-blind), and

investigator or person responsible for the analysis of results (sometimes

referred to as triple-blind)

can dramatically reduce bias. It is otherwise

tempting for the subject or researcher to consciously or unconsciously

distort observations, measurement, recordings, data cleaning or analyses.

In a case-control study, the person responsible for identification and

recording of exposure should be blinded to group identity.4 Similarly, in

a comparative cohort study, the observer should be blinded to group

identity when identifying and recording outcome events. In clinical trials,

44

**Statistical Methods for Anaesthesia and Intensive Care
**

a double-blind design should be used whenever possible. If, because of

the nature of the intervention, it is impossible to blind the observer or

investigator, then a separate person who remains blinded to group

identity should be responsible for identifying eventual outcome events. If

a patient cannot be blinded to their treatment group, then outcome events

should be objectively and clearly predetermined in order to reduce

detection and reporting bias. Efforts made to maximize blinding in trial

design are repaid by improved scientific credibility and enhanced impact

on clinical practice.

Sequential analysis

If the results of a clinical trial are analysed and it is found that there is no

significant difference between groups, then some investigators continue

the trial in order to recruit more patients, hoping that the larger numbers

will eventually result in statistical significance. This is grossly incorrect.

23

Some investigators do not even report that they have done this and so the

reader remains unaware of this potent source of bias. The more 'looks' at

the data, the greater the chance of a type I error (i.e. falsely rejecting the

null hypothesis and concluding that a difference exists between

groups).

25-30

Sequential analysis is a collection of valid statistical methods used to

repeatedly compare the outcome of two groups while a trial is in

progress. '6 This allows a clinical trial to be stopped as soon as a

significant difference between groups is identified. This is a useful trial

design to investigate rare conditions, as the traditional randomized

controlled trial could take a very long time to recruit sufficient numbers

of patients, or would need to be multi-centred (and this requires much

greater effort to establish and run). Sequential analysis is also a good

method to investigate potential treatments for serious life-threatening

conditions, as once again, by the time a traditional trial is completed, it

may be found that many patients were denied life-saving treatment.

Sequential analysis is also useful if there are ethical concerns about added

risk of a newtreatment.

26-28 In these cases, sequential analysis is a valid,

cost-efficient approach used to detect a significant difference between

groups as soon as possible.

As the outcome of each patient (or pair of patients) is established, a

sequential line is plotted on a graph with preset boundary limits (Figure

4.5). If either of the two limits is broached, a conclusion of difference is

made and the trial stopped (so, in effect, there is a statistical comparison

made each time a preference is determined). If the boundary limits are

not broached and the plotted line continues on to cross the right-hand

boundary (the preset sample size limit), the trial is stopped and a

conclusion of no difference is made. Boundary limits can be calculated for

different significance levels (usually P < 0.05 or P < 0.01), using binomial

probability (see Chapter 2) or a paired non-parametric test (see Chapters

5 and 6).

There are several other ways of developing and using boundary limits

Research design

45

Figure 4.5 A sequential design comparing two treatments (A and B), using a P

value of 0.05 as the stopping rule. A preference for A leads the line upwards

and to the right; a preference for B leads the line downwards and to the right.

When a boundary limit is crossed, a conclusion of significant difference is made

in sequential analysis. Some examples can be found in the anaesthetic

literature; these include investigation of anaesthetic drug thrombo-

phlebitis, 31 treatment of postoperative nausea and vomiting,

3

and use of

low molecular weight heparin to prevent deep venous thrombosis after

hip surgery33

The above method of sequential analysis is seldom used in current

medical research (unfortunately, College and Board examiners persist in

asking about it!). A more appropriate modification is interim analysis.

I nterim analysis

Interim analysis is also a method of repeatedly comparing groups while

a trial is still in progress. But, in contrast to sequential analysis, these are

preplanned statistical comparisons between groups (usually) on a

restricted number of specified occasions during the trial. This approach

has become the most common method to use when there is a requirement

to stop a trial early, once sufficient evidence of significant difference is

obtained, without jeopardizing statistical validity. A significant difference

is inferred if one of these comparisons results in a P value that is smaller

than a pre-specified value and the trial is generally stopped. Thus, each

pre-specified type I error can be described as a 'stopping rule'. Most

commonly, two to four comparisons are made (usually after a certain

proportion of patients are enrolled). The number and timing of these

comparisons should be determined before the trial is commenced.

Interim analysis is universally employed in large, multi-centred trials.

In many cases the interim analyses are performed blind, independent of

the investigators, by a Data and Safety Monitoring Committee.

26,27

This

4 8

**Statistical Methods for Anaesthesia and Intensive Care
**

be lost to follow-up. They may also refuse the allocated treatment,

unintentionally receive the comparator treatment, or receive other

treatments which may affect the outcome of interest. Some investigators

(and clinicians) only analyse those patients who received the study

treatment (per protocol analysis).

This approach seems intuitively

obvious to many clinicians, as they are only interested in the effect of the

actual treatment (not what happened to those who did not receive it). But

a per protocol analysis can be misleading, particularly if the allocated

treatment has side-effects or is ineffective in some patients. Per protocol

analysis may then over-estimate the true benefit and under-estimate

adverse effects. The most valid method is to use

intention to treat

analysis, so that all patients who were enrolled and randomly allocated

to treatment are included in the analysis. This gives a more reliable

estimate of true effect in routine practice because it replicates what we

actually do - we consider a treatment and want to know what is most

likely going to happen (thus accommodating for treatment failure, non-

compliance, additional treatments, and so on).

Thus, if 20%of epidurals are ineffective, say because of failed insertion,

displacement or inadequate management, 6 and a theoretical study

demonstrates a reduction in major complications on per protocol

analysis, it may be explained by an actual shift in group identity (Table

4.2). A real example can be found in the recent anaesthetic literature,

where Bode et al.

37

found no significant effect of regional anaesthesia in

peripheral vascular surgery on intention to treat analysis, but that those

who had a failed regional technique had a higher mortality than those

who did not.

A per protocol analysis is sometimes used appropriately when

analysing adverse events in drug trials, as it can be argued that the side-

effects of the actual treatment received is clinically relevant in that

circumstance.

Table 4.2 Effect of how groups are analysed if four patients did not receive their

allocated (epidural) treatment and were treated with patient-controlled anaesthesia

(PCA) (and three of these had major complications). A per protocol analysis would

consider these patients in the PCA group. The recommended approach is to use intention

to treat analysis. The resultant P value of 0.33 suggests that the observed difference could

be explained by chance

A. Per protocol analysis

Epidural group (n = 16) PCA group (n = 24) P value

Complicati ons 3(19%) 13(54%) 0.047

B. Intention to

treat analysis

Epidural group (n = 20) PCA group (n = 20) P value

Complications

6(30%) 10(50%) 0.33

References

Research design

4 9

1.

Sackett DL. Bias in analytic research. J Chron Dis 1979; 32:51-63.

2.

Rothman KJ. Epidemiological methods in clinical trials.

Cancer 1977;

39:1771-1779.

3.

Horwitz RI, Feinstein AR. Methodologic standards and contradictory results

in case-control research. Am J

Med 1979; 66:556-562.

4. Rebollo MH, Bernal JM, Llorca J

et al. Nosocomial infections in patients

having cardiovascular operations: a multivariate analysis of risk factors. J

Thorac Cardiovasc Surg

1996;112:908-913.

5. Sackett DL, Haynes RB, Guyatt GH

et al. Clinical Epidemiology: A Basic

Science for Clinical Medicine, 2nd ed. Boston: Little Brown, 1991:283-302.

6.

Kurz A, Sessler DI, Lenhardt R. Perioperative normothermia to reduce the

incidence of surgical wound infection and shorten hospitalization. N Engl

J

Med 1996; 334:1209-1215.

7.

Morris JA, Gardner MJ. Calculating confidence intervals for relative risks,

odds ratios, and standardised ratios and rates. In: Gardner MJ, Altman DG.

Statistics

with Confidence - Confidence Intervals and Statistical Guidelines.

London: BritishMedical journal,

1989:50-63.

8. Myles PS, Olenikov I, Bujor MA

et al. ACE-inhibitors, calcium antagonists

and lowsystemic vascular resistance following cardiopulmonary bypass. A

case-control study.

MedJ Aust 1993;158:675-677.

9. Strom BL, Berlin JA, Kinman JL

et al. Parenteral ketorolac and risk of

gastrointestinal and operative site bleeding. A postmarketing surveillance

study. JAMA

1996; 275:376-382.

10.

Hill AB. The environment and disease: association or causation?

Proc Roy Soc

Med 1965:295-300.

11.

Myles PS, Power 1. Does ketorolac cause renal failure - howdo we assess the

evidence? BrJ

Anaesth 1998; 80:420-421 [editorial].

12. Suen TKL, Gin T, Chen PP

et al.

Ondansetron 4 mg for the prevention of

nausea and vomiting after minor laparoscopic gynaecological surgery.

Anaesth Intensive Care

1994; 22:142-146.

13. Fujii Y, Saitoh Y, Tanaka H

et al.

Prophylactic oral antiemetics for preventing

postoperative nausea and vomiting: granisetron versus domperidone.

Anesth

Analg

1998; 87:1404-1407.

14.

Loius TA, Lavori PW Bailar JC et al.

Crossover and self-controlled designs in

clinical research.

N Engl J Med 1984; 310:24-31.

15.

Woods JR, Williams JG, Tavel M. The two-period crossover design in medical

research.

Ann Intern Med 1989;110:560-566.

16.

Ngan Kee WD, Lam KK, Chen PP, Gin T. Comparison of patient-controlled

epidural analgesia with patient-controlled intravenous analgesia using

pethidine or fentanyl.

Anaesth Intensive Care

1997; 25:126-132.

17.

Myles PS, Leong CK, Weeks AMet al.

Early hemodynamic effects of left atrial

administration of epinephrine after cardiac transplantation.

Anesth Analg

1997; 84:976-981.

18.

Guyatt GH, Keller JL, Jaeschke R et al.

The n-of-1 randomized controlled trial:

clinical

usefulness. Our three year experience.

Ann Intern Med 1990;

112:293-299.

19.

Altman DG. Comparability of randomised groups.

Statistician 1985;

34:125-136.

20.

Altman DG, Dore CJ. Randomisation and baseline comparisons in clinical

trials.

Lancet 1990; 335:149-153.

21. Lavori PW, Louis TA, Bailar JC et al.

Designs for experiments - parallel

comparisons of treatment. N Engl J Med

1983; 309:1291-1299.

5 0

**Statistical Methods for Anaesthesia and Intensive Care
**

22. Berry CB, Butler PJ, Myles PS. Lung management during cardiopulmonary

bypass: is continuous positive airways pressure beneficial? Br J Anaesth 1993;

71:864-868.

23. Armitage P Statistical Methods in Medical Research. London: Blackwell

Scientific Publications, 1985:239-245.

24. Treasure T, Macrae KD. Minimisation: the platinum standard for trials? BMJ

1998; 317:362-363.

25. McPherson K. Statistics: the problem of examining accumulating data more

than once. N Engl J Med 1974; 290:501-502.

26. Armitage P Sequential methods in clinical trials. Am J Pub Health 1958;

48:1395-1402.

27. Pocock SJ. Statistical and ethical issues in monitoring clinical trials. Stat

Med

1993; 12:1459-1469.

28. Task Force of the Working Group on Arrhythmias of the European Society of

Cardiology. The early termination of clinical trials: causes, consequences, and

control - with special reference to trials in the field of arrhythmias and

sudden death. Circulation 1994; 89:2892-2907.

29. Geller NL, Pocock SJ. Interim analyses in randomized clinical trials:

ramifications and guidelines for practitioners. Biometrics 1987; 43:213-223.

30. O'Brien PC, Fleming TR. A multiple testing procedure for clinical trials.

Biometrics 1979; 35:549-556.

31. Boon J, Beemer GH, Bainbridge DJ et al. Postinfusion thrombophlebitis: effect

of intravenous drugs used in anaesthesia. Anaesth Intensive Care 1981; 9:23-27.

32. Abramowitz MD, Oh TH, Epstein BS et al. The antiemetic effect of droperidol

following outpatient strabismus surgery in children. Anesthesiology 1983;

59:579-583.

33. Samama CM, Clergue F, Barre J et al. Lowmolecular weight heparin

associated with spinal anaesthesia and gradual compression stockings in total

hip replacement surgery. Br J Anaesth 1997; 78:660-665.

34. Shearer PR. Missing data in quantitative designs. J Royal Statist Soc Ser C Appl

Statist 1973; 22:135-140.

35. Ludington E, Dexter F. Statistical analysis of total labor pain using the visual

analog scale and application to studies of analgesic effectiveness during

childbirth. Anesth Analg 1998; 87:723-727.

36. Burstal R, Hayes C, Lantry G et al. Epidural analgesia - a prospective audit of

1062 patients. Anaesth Intensive Care 1998; 26:165-173.

37. Bode RH Jr, Lewis KP, Zarich SW et al. Cardiac outcome after peripheral

vascular surgery. Comparison of general and regional anesthesia.

Anesthesiology 1996; 84:3-13.

5

Comparing groups: numerical data

Parametric tests

-Student's Mest

-analysis of variance (ANOVA)

-repeated measures ANOVA

Key points

Numerical data that are normally distributed can be analysed with parametric

tests.

Student's

Mest is a parametric test used to compare the means of two

groups.

The unpaired t-test is used to compare two dependent groups.

A one-tailed t-test is used to look for a difference between two groups in only

one direction (i.e. larger or smaller).

Analysis of variance ( ANOVA) i

s a parametric test used to compare the

means of two or more groups.

Mann-Whitney U test is a non-parametric equivalent to the unpaired Mest.

• Kruskal-Wallis test is a non-parametric equivalent of

ANOVA.

Numerical data may be continuous or ordinal (see Chapter 1).

Continuous data are sometimes further divided into ratio or interval

scales, but this division does not influence the choice of statistical test.

This chapter is concerned with the various methods used to compare the

central tendency of two or more groups when the data are on a numerical

scale.

Numerical data that are normally distributed can be analysed with

parametric tests. These tests are based on the parameters that define a

normal distribution: mean and standard deviation (or variance).

Parametric tests

The parametric tests assume that:

Non-parametric tests

-Mann-Whitney U test (Wilcoxon rank

sum test)

-Wilcoxon signed ranks test

-Kruskal-Wallis ANOVA

-Friedman two-way

ANOVA

1. Data are on a numerical scale

2.

The distribution of the underlying population is normal

3. The samples have the same variance

('homogeneity of variances')

4. Observations within a group are independent

5.

The samples are randomly drawn from the population

If it is uncertain whether the data are normally distributed they can be

plotted and visually inspected, and/or tested for normality, using one of

5 2

**Statistical Methods for Anaesthesia and Intensive Care
**

a number of goodness of fit tests.

One example is the Kolmogorov-

Smirnov test. 1 This compares the sample data with a normal distribution

and derives a P value; if P > 0.05 the null hypothesis is accepted (i.e. the

sample data are not different from the normal distribution) and the data

are considered to be normally distributed.

Non-normal, or skewed, data can be transformed so that they

approximate a normal distribution.2-4

The commonest method is a log

transformation,

whereby the natural logarithms of the raw data are

analysed to calculate a mean and standard deviation. The antilogarithm

of the mean of this transformed data is known as the

geometric mean. If

the transformed data are shown to approximate a normal distribution,

they can then be analysed with parametric tests.

Large sample (studies with, say,

n > 100) data approximate a

normal distribution and can nearly always be analysed with parametric

tests.

The requirement for observations within a group to be independent

means that multiple measurements from the same subject cannot be

treated as separate individual observations. Thus, if three measurements

are made on each of 12 subjects, these data cannot be considered as 36

independent samples. This is a special case of repeated measures

and

requires specific analyses (see later). The requirement for samples to be

drawn randomly from a population is rarely achieved in clinical trials,

5

but this is not considered to be a major problem as results from inferential

statistical tests have proved to be reliable in circumstances where this rule

was not followed.

Student's t-test

Student's t-test is used to test the null hypothesis that there is no

difference between two means. It is used in three circumstances:

• to test if a sample mean (as an estimate of a population mean) differs

significantly from a given population mean (this is a

one-sample

t-test)

• to test if the population means estimated by two independent samples

differ significantly (the unpaired t-test)

• to test if the population means estimated by two dependent samples

differ significantly (the paired t-test).

The t-test can be used when the underlying assumptions of parametric

tests are satisfied (see above). However the t-test is considered to be a

robust test, in that it can accommodate some deviation from these

assumptions. This is one of the reasons why it has been a popular test in

clinical trials, where small samples (say n < 30) are commonly studied.

The t-test only compares the means of the two groups. Without

formally testing the assumption of equal variance, it is possible to accept

the null hypothesis and conclude that the samples come from the same

population when they in fact come from two different populations that

have similar means but different variances. The group variances can be

compared using the F test. The F test is the ratio of variances (var l /var2);

Comparing grou ps: nu merical data

5 3

Figure 5.1 Howa t distribution (for

n = 10) compares with a normal

distribution. A t distribution is broader and flatter, such that 95%of

observations lie within the range mean ± t x SD (t = 2.23 for n = 10) compared

with mean ± 1.96 SD for the normal distribution

if

F differs significantly from 1.0 then it is concluded that the group

variances differ significantly.*

The t distribution was calculated by W.L. Gosset of the Guinness

Brewing Company under the pseudonym Student (company policy

prevented him from using his real name). A sample from a population

with a normal distribution is also normally distributed if the sample size

is large.

With smaller sample sizes, the likelihood of extreme values is

greater, so the distribution 'curve' is flatter and broader (Figure 5.1). The

t

distribution, like the normal distribution, is also bell shaped, but has

wider dispersion - this accommodates for the unreliability of the sample

standard deviation as an estimate of the population standard deviation.

There is a t distribution curve for any particular sample size and this is

identified by denoting the t distribution at a given

degree of freedom.

Degrees of freedom is equal to one less than the sample size (d.f. = n -1).

It

describes the number of independent observations available. As the

degrees of freedom increases, the t distribution approaches the normal

distribution. Thus, if you refer to a t-table in a reference text you can see

that, as the degrees of freedom increases, the value of t approaches 1.96 at

a P value of 0.05. This is analogous to a normal distribution where 5%of

values lie outside 1.96 standard deviations from the mean.

The t-test is mostly used for small samples. When the sample size is

large (say,

n > 100), the sampling distribution is nearly normal and it is

possible to use a test based on the normal distribution (a

test).

Theoretically, the t-test can be used even if the sample sizes are very small

(n <

10), as long as the variables are normally distributed within each

group and the variation of scores in the two groups is not too different.

*The value of F that defines a significant difference (say P < 0.05) depends on the sample

size (degrees of freedom); this can be found in reference tables (F table) or can be

calculated using statistical software.

5 4

**Statistical Methods for Anaesthesia and Intensive Care
**

The simplified formulae for the different forms of the t-test are:

where X = sample mean,

u =

population mean, and SE = standard error.

where d is the mean difference, and SE denotes the standard error of this

difference.

In each of these cases a P value can be obtained from a t-table in a

reference text. More commonly now, a P value is derived using statistical

software. The P value quantifies the likelihood of the observed difference

occurring by chance alone. The null hypothesis (no difference) is rejected

if the P value is less than the chosen type I error (a). Thus, it can be

concluded that the sample(s) are subsets of different populations.

The t value can also be used to derive 95% confidence intervals (95%

CI). 7 In Chapter 3 we described how 95%CI can be calculated as the

mean ± (1.96 x standard error). In small samples it is preferable to use the

value of t rather than 1.96:

1. 95%CI of the group mean = mean ± (t value x SE)

2. 95%Cl of the difference between groups = mean difference ± (t value

x SE of the difference).

If the 95%CI of the difference between groups does not include zero

(i.e.

no difference), then there is a significant difference between groups

at the 0.05 level. Thus, the 95%CI gives an estimate of precision, as well

as indirectly giving the information about the probability of the observed

difference being due to chance.

For example, Scheinkestel et al.$ studied the effect of hyperbaric oxygen

(

HBO) therapy in patients with carbon monoxide poisoning. They

reported the following results for a verbal learning test (with higher

scores indicating better function): HBO group 42 vs. control (normal

*The SE is calculated from a pooled standard deviation that is a weighted average of the

t wo sample variances:

Comparing grou ps: nu merical data

5 5

oxygen) group 49.2. The mean difference was -7.2 and the 95%CI of the

difference was -12.2 to -2.2. Thus, the 95%CI did not include the zero

value, and so it can be concluded that there was a statistically significant

difference between groups. The interval 2.2-12.2 was fairly wide and so

the study did not have high precision for this estimate of effect. The

authors concluded that HBO therapy does not improve outcome in

carbon monoxide poisoning.

Unpaired vs. paired tests

Unpaired tests are used when two different ('independent') groups are

compared. Paired tests are used when the two samples are matched or

paired ('dependent'): the usual setting is when measurements are made

on the same subjects before and after a treatment.

It is useful to try and reduce variability within the sample group to

make more apparent the difference between groups. In the t-test, this has

the effect of reducing the denominator and making the t value larger.

With all samples, there is variability of inherent characteristics that may

influence the variable under study. For example, in a two-group unpaired

comparison of a drug to lower blood pressure, different patients will

have a variety of factors that may affect the blood pressure immediately

before intervention. These initial differences contribute to the total

variability within each group

(variance). By using the same subjects twice

in a before and after treatment

design, there is reduced individual and total

within group variance.

Another example of a paired design is a crossover design of two

treatments when instead of using two groups, each receiving one

treatment, the same group receives the two drugs on two separate

occasions (see Chapter 4). Because the same group of patients is used,

there is less variability.

In the analysis of paired designs, instead of treating each group

separately and analysing raw scores, we can look only at the

differences

between the two measures in each subject. By subtracting the first score

from the second for each subject and then analysing only those

differences, we will exclude the variation in our data set that results from

unequal baseline levels of individual subjects. Thus a smaller sample size

can be used in a paired design to achieve the same

power as an unpaired

design.

It is useful to take another viewof the t-test procedure because it may

be helpful in understanding the basis of analysis of variance.

When

comparing central location between samples, we actually compare the

difference (or variability) between samples with the variability

within

samples. Intuitively, if the variability between sample means is very large

and the variability within a sample is very low, then it will be easier to

detect a difference between the means. Conversely if the difference

between means is very small and the variability within the sample is very

large, it will become more difficult to detect a difference.

If we look at the formula for the t-test, the difference between means is

the numerator. If this is small relative to the variance within the samples

(the denominator), the resultant t value will be small and we are less

likely to reject the null hypothesis (Figure 5.2).

5 6

**Statistical Methods for Anaesthesia and Intensive Care
**

Figure 5.2 The effect of variance: when comparing two groups, the ability to

detect a difference between group means is affected by not only the absolute

difference but also the group variance. (a) Two curves of sampling distributions

with no overlap and easily detected difference; (b) means nowcloser together

causing overlap of curves and possibility of not detecting a difference; (c) means

same distance as in B but smaller variance so that there is no overlap and

difference easy to detect

Comparing groups: numerical data

5

7

If there is reason to look for a difference between mean values in only

one direction (i.e. larger or smaller), then a one-tailed t-test can be used.

This essentially doubles the chance of finding a significant difference (i.e.

increases power) (Figure 5.3). Some investigators have used a one-tailed

t-test because a two-tailed test failed to show a significant (P < 0.05)

result. This should not be done. A one-tailed t-test should only be used if

there is a valid reason for investigating a difference in only one

direction . 9 Ideally this should be based on known effects of the treatment

and be outlined in the study protocol before results are analysed (a

priori).

Comparing more than two groups

The t-test should not be used to compare three or more groups. 10,11

Although it is possible to divide three groups into three different pairs

and use the t-test for each pair, this will increase the chance of making a

type I error (conducting three t tests will have approximately a 3a-fold

chance of making a type I error).

If we consider a seven-group study, there are 21 possible pairs and an

a of 0.05, or 1/20 for each would make it likely that one of the observed

differences could have easily occurred by chance. The probability of

getting at least one significant result is 1-0.9521 = 0.66. There is a better

way to conduct multiple comparisons.

It is possible to divide the a value for each test by the number of

comparisons so that overall, the type I error is limited to the original a. 1

For example, if there are three t-tests, then an a of 0.05 would be reduced

to 0.0167 for each test and only if the P value was less than this adjusted

a would we reject the null hypothesis. This maintains a probability of

0.05 of making a type I error overall. This is known as the Bonferroni

correction.

However, it is apparent that as the number of comparisons increases,

the adjusted a becomes so small that it could be very unlikely to find a

Figure 5.3 Two-tailed and one-tailed t-tests. A one-tailed t-test is used to look

for a difference between mean values in only one direction (i.e. larger or

smaller). This increases the likelihood of showing a significant difference

(power). (a) For two-tailed a = 0.05, and a normal distribution the critical z

value is 1.96; (b) for one-tailed a = 0.05, and a normal distribution, the critical z

value is 1.645

5 8

**Statistical Methods for Anaesthesia and Intensive Care
**

difference and we risk making more type II errors. Thus the Bonferroni

correction is a conservative approach. The best way to avoid this is to

li mit the number of comparisons.

The comparison of means from multiple groups is better carried out

using a family of techniques broadly known as analysis of variance

(ANOVA).

6,11

Thus one important reason for using ANOVA methods

rather than multiple t-tests is that ANOVA is more powerful (i.e. more

efficient at detecting a true difference).

Analysis of variance (ANOVA)

In general, the purpose of ANOVA is to test for significant differences

between the means of two or more groups.

6 It seems contradictory

that a test that compares means is actually called analysis of variance.

However, to determine differences between means, we are actually

comparing the ratio of two variances - ANOVA is based on the F test of

variance.

From our discussion of the t-test, a significant result was more likely

when the difference between means is much greater than the variance

within the samples. With ANOVA, we also compare the difference

between the means (using variance as our measure of dispersion) with

the variance within the samples. The test first asks if the difference

between groups can be explained by the degree of spread (variance)

within a group. It divides up the total variability (variance) into the

variance within each group, and that between each group. If the observed

variance between groups is greater than that within groups, then there is

a significant difference.

These two variances are sometimes known as the between-group

variability and within-group variability. The within-group variability is

also known as the error variance because it is variation that we cannot

readily account for in the study design, being based on random differ-

ences in our samples. However we hope that the between-group, or effect

variance, is the result of our treatment. We can compare these two

estimates of variance using the F test.

There are many types of ANOVA and the only two we will consider

here are the extensions of the unpaired and paired t-test to circumstances

where there are more than two groups. Like the t-test, ANOVA uses the

same assumptions that apply to parametric tests. A simplified formula

for the F statistic is:

where MS is the mean squares between and within groups.

The formulae for mean squares are complex. If k represents the number

of groups and N the total number of results for all groups, the variation

between groups has degrees of freedom k-1, and the variation within

groups has degrees of freedom N-k. Thus if one uses reference tables to

look at critical value of the F distribution, the two degrees of freedom

must be used to locate the correct entry.

Comparing groups: numerical data

5 9

If only two means are compared, ANOVA will give the same results as

the

Nest.* In an analogous manner to the Mest described earlier, the F

statistic calculated from the samples is compared with known values of

the F distribution. A large F value indicates that it is more unlikely that

the null hypothesis is true. Decisions on accepting or rejecting the null

hypothesis are based on preset choices for a, the chosen type I error.

If

we simply compare the means of three or more groups, the ANOVA

is often referred to as a one-way or one-factor ANOVA. There is also two-

way ANOVA when two grouping factors are analysed, and multiple

analysis of variance (MANOVA)

when multiple grouping factors are

analysed. Another method is the

general linear model

(GLM), a form of

multivariate regression. The GLM calculates R

2, a measure of effect size.

R2 is

mathematically related to F and t.

An example of one-way ANOVA would be to compare the changes in

blood pressure after the administration of three different drugs. One may

then consider additional contributory factors by looking at, for example,

a gender-based difference in effect. This would be a two-factor (drug

treatment and gender) ANOVA. In such a case the ANOVA will return a

P value for the difference based on drug treatment and another P value

for the difference based on gender. There will also be a P value for the

interaction

of drug treatment and gender, indicating perhaps, for

example, that one drug treatment may be more likely to cause an effect in

female patients.

ANOVA is a

multivariate

statistical technique because it can test each

factor while controlling for all other factors and also enable us to detect

interaction effects between variables. Thus more complex hypotheses can

be tested and this is another reason why ANOVA is more powerful than

using multiple t-tests.

However, if the ANOVA returns a significant result, the ANOVA by

itself

will only tell us that there is a difference, not where the difference

lies. So if we are comparing three samples, a significant result will not

identify which sample mean is different to any other. Clearly this is not

that useful and we must make use of further tests (post hoc tests) to

identify the differences.b

One confusing aspect of ANOVA is that there are many post hoc tests

and there is not universal agreement among statisticians as to which tests

are preferred.b Statistical software packages often provide a limited

selection. Of the common tests, the

Fisher Protected Least Significant

Difference (LSD)

is the least conservative (i.e. most likely to indicate

significant differences) and the

Scheffe test is the most conservative but

the

most versatile because it can test complex hypotheses involving

combinations of group means. For comparisons of specifically selected

pairs of means, tests such as

Tukey's Honestly Significant Difference

(HSD) and Newman-Keuls

are often used. Dunnett's test

is used

specifically when one wishes to test just one sample mean against all the

others.

*Numerically F = tz.

60

**Statistical Methods for Anaesthesia and Intensive Care
**

Repeated measures ANOVA

As an extension of the paired t-test, we can imagine situations where we

take repeated measurements of the same variable under different

conditions or at different points in time. It is very common to have

repeated measures designs in anaesthesia, but the analysis of these is

complex and fraught with hazard.

11

An example is a study by Myles et a1.,

12

who measured quality of

recovery scores on each of three days after surgery in four groups of

patients (Figure 5.4). They found a reduction in quality of recovery scores

early after surgery in all groups, followed by a gradual improvement.

There were no significant differences between groups with ANOVA and

so post hoc tests looking at each time interval were not performed.

Although repeated measures designs can be very useful in analysing

this type of data, some of the assumptions of ANOVA may not be met

and this casts doubt on the validity of the analysis.

11

Homogeneity of

variance can be a problem because there is usually greater variation at

very lowreadings (assays at the limit of detection) or very high readings.

Transformation of data can be a possible solution. 2 A more important

consideration is analysis of residuals. Here the difference between the

group mean and individual values ('residuals') are analysed to check that

they are normally distributed.

Figure 5.4 Perioperative changes in mean quality of recovery (QoR) score (after

Myles et a1.

12

).

From this graph, one can easily imagine all the possible

comparisons that one could make between different points on the curves,

with the attendant problems of multiple comparisons. Time periods

0 = preoperative, 1 = recovery room discharge, 2 = at 2-4 h postoperatively,

3 = day 1 (am), 4 = day 1 (pm), 5 = day 2 (am), 6 = day 2 (pm), 7 = day 3 (am),

8 = day 3 (pm)

Comparing grou ps: nu merical data

61

One of the assumptions of repeated measures ANOVA is compound

symmetry, also known as multisample sphericity. This means that the

outcome data should not only have the same variance at each time but

also that the correlations between all pairs of repeated measurements in

the same subject are equal. This is not usually the case. In the typical

repeated measures examples above, values at adjacent dose or time

points are likely to be closer to one another than those further apart. If

uncorrected, the risk of type I error is increased. Correction factors

include the Greenhouse-Geisser and Hunyh-Feldt.

11

It is often possible to simplify the data and use summary measures of

the important features of each curve to compare the groups. 13 A simple

unpaired t-test can then be used to compare these summary measures

between groups. Some measures include: 13

• time to peak effect

• area under a time-response curve

•

mean effect over time.

For example, drug absorption profiles are conventionally summarized

by three results: (a) the time to peak concentration (t

ax), (b) the actual

peak concentration (

Cinax),

(c) the area under the curve (AUC) to a certain

ti

me point as a measure of overall absorption. In a comparison of the

interpleural injection of bupivacaine with and without adrenaline,

14 a

Mann-Whitney U test was used to showthat tmax was delayed and Cmax

was decreased in the adrenaline group (Figure 5.5). The conclusion was

Figure 5.5 Absorption of bupivacaine after interpleural administration with

(empty symbols) and without (plain, full symbols) adrenaline. The addition of

adrenaline delays systemic absorption of bupivacaine. 14

Mean (SD) and median

(range) pharmacokinetic data are detailed below

6 2

**Statistical Methods for Anaesthesia and Intensive Care
**

that the addition of adrenaline did decrease systemic absorption of

bupivacaine.

As another example, when paracetamol was used as a measure of

gastric emptying in intensive care patients, 15 the AUC to 30 min was less

in patients with intracranial hypertension, indicating delayed emptying

when compared with patients without intracranial hypertension.

As a more complex example, we may want to compare the haemo-

dynamic stability of two induction agents in a special patient group. We

might inject the drug and measure systolic arterial pressure for the first 5

min after injection. For each drug we might want to know: (a) the time to

maximum effect, and (b) the greatest change from baseline. These would

be analogous to the

Cmax

and

CmaX

in the previous example. However we

may also want to know: (a) when the blood pressure first changes from

the baseline (latency), (b) when the blood pressure returns to normal, and

(c)

whether or not these variables are different for each drug.

Rather than conduct multiple paired t-tests against the baseline blood

pressure, a repeated measures ANOVA is more appropriate. Note that

within each group, one can perform a repeated measures ANOVA to

compare the baseline data with subsequent readings. To test whether or

not there are any differences between groups, one would conduct a

repeated measures ANOVA on both groups at the same time, using

group as a factor. One can see that ANOVA designs can become quite

complex.

Note that these patients will have different baseline blood pressures.

Repeated measures ANOVA will take into account the different baseline

blood pressures when comparing subsequent differences. However, if the

baseline blood pressure is quite variable, then absolute changes may be

slightly

misleading. A drop in blood pressure of 30 mmHg is probably

more significant in someone with a baseline of 100 mmHg than one with

a baseline of 150 mmHg. This has led some authors to convert absolute

differences into percentage changes from baseline before analysis.

Another question may be to determine if, overall, there is a greater

change in blood pressure in one group than the other. This can pose

further problems. In a manner similar to the AUC for the drug absorption

examples, one could calculate the AUC or other summary measures such

as the mean or sum of the blood pressure readings in each patient.

However if the time limit chosen is too long, one may not detect any

differences between groups because the blood pressure has long returned

to normal and this would obscure any initial differences. It is not

appropriate to recalculate the summary measure at every time point to

find significant differences ('data dredging').

The ANOVA and summary measures can also obscure extreme data.

We may have to make a clinical judgment on whether it is important to

distinguish a rapid but short duration of severe hypotension in one group

from a more sustained less severe drop in blood pressure in the other

group. Similarly, we would be interested if one drug occasionally causes

very severe hypotension.

From this discussion of a common and apparently simple question, one

can see that there are many possible multiple comparisons that would

inflate the type I error, even with ANOVA - and we have not even

considered mean arterial pressure, heart rate and other cardiovascular

variables!

The investigator is well advised to precisely define the

i

mportant hypotheses in advance (a priori)

so that the appropriate

selected analyses are undertaken.

A dose-response study of an analgesic presents similar problems. We

may want to know the onset of action, maximal effect, duration and

overall efficacy of each dose. If the primary measure is a pain visual

analogue score (VAS), then non-parametric ANOVA tests may be more

appropriate. In a dose-response study of epidural pethidine,

16 pain

scores were measured at 3-minute intervals and the onset of action was

defined as the time taken to decrease the initial pain score by 50%. This

summary measure was then compared among groups using a Kruskal-

Wallis test followed by Mann-Whitney U tests. Overall postoperative

analgesia was also analysed by comparing the area under the curve of the

pain VAS among the three groups. In both cases, a repeated measures

analysis of the pain scores (to determine when the pain score was

different from the baseline) could have been theoretically used but would

have added complexity without any further useful clinical information.

In both these studies, non-parametric tests were used to compare VAS

measurements because these data were considered ordinal data.

However, the use of parametric t-tests and ANOVA is considered

acceptable (see Chapter 1).

ANOVA is a powerful statistical technique and many complex

analyses are possible. However there are also many pitfalls, especially

with repeated measures ANOVA. Advice and assistance from an experi-

enced statistician is highly recommended. Other computer-intensive tests

have also been advocated for comparing means.s

Non-parametric tests

Comparing groups: numerical data

63

When the assumptions for the parametric tests are not met, there are

many non-parametric alternatives for the parametric tests described

above. 1 These include:

1.

Mann-Whitney U test (identical to the Wilcoxon rank sum) is a non-

parametric equivalent to the unpaired Student's t-test

2.

Wilcoxon signed ranks test is a non-parametric equivalent to the

paired Student's Mest

3.

Kruskal-Wallis test is a non-parametric equivalent of one-way

ANOVA

4.

Friedman's test is a non-parametric repeated measures ANOVA

5.

Spearman rank order (rho) is a non-parametric version of the Pearson

correlation coefficient r (see Chapter 7)

Many other non-parametric tests are available, 1 but these are not often

used in anaesthesia research. Non-parametric tests do not assume a

normal distribution and so are sometimes referred to as distribution-free

tests.

They are best used when small samples are selected because in

these circumstances it is unlikely that the data can be demonstrated to be

normally distributed.

6 4

**Statistical Methods for Anaesthesia and Intensive Care
**

Non-parametric tests do however have some underlying assumptions:

1.

Data are from a continuous distribution of at least ordinal scale

2. Observations within a group are independent

3. Samples have been drawn randomly from the population

Non-parametric tests convert the raw results into ranks and then

perform calculations on these ranks to obtain a test statistic. The calcu-

lations are generally easier to perform than for the parametric tests.

However, non-parametric tests may fail to detect a significant

difference (which a parametric test may). That is, they usually have less

power.

Just as described above for the parametric tests, the test statistic is

compared with known values for the sampling distribution of that

statistic and the null hypothesis is accepted or rejected. With these tests,

the null hypothesis is that the samples come from populations with the

same median.

Statistical

programs may use approximations to determine the

sampling distribution of the test statistic, especially when the sample size

is large. For example, in the case of the Mann-Whitney U test, one

approach has been to perform an unpaired t-test on the ranks (rather than

the original raw scores) and, strictly speaking, this normal approximation

actually compares the mean ranks of the data between two groups rather

than the medians.

Mann-Whitney U test (Wilcoxon rank sum test)

This test is used to determine whether or not two independent groups

have been drawn from the same population. It is a very useful test

because it can have high power (= 0.95), compared with the unpaired t-

test, even when all the conditions of the latter are satisfied.

1 It has fewer

assumptions and can be more powerful than the t-test when conditions

for the latter are not satisfied. The test has several names because Mann,

Whitney and Wilcoxon all described tests that were essentially identical

in analysis but presented them differently.

The Mann-Whitney U test is the recommended test to use when

comparing two groups that have data measured on an ordinal scale.

17

However, if the data represent a variable that is, in effect, a continuous

quantity, then a t-test may be used if the data are normally distributed.

This is more likely with large samples

(say, n >1``) 1!

In the Wilcoxon rank sum test, data from both the groups are combined

and treated as one large group. Then the data are ordered and given

ranks, separated back into their original groups, and the ranks in each

group are then added to give the test statistic for each group. Tied data

are given the same rank, calculated as the mean rank of the tied

observations. The test then determines whether or not the sum of ranks

in one group is different from that in the other. The sum of all the ranks

is N(N +

1)/2, where N is the total number of observations.

For example, a hypothetical study investigating the effect of gender on

postoperative headache may measure this pain on a 100 mm visual

Wilcoxon signed ranks test

Comparing grou ps: nu merical data

66

analogue scale. Each patient would have their pain score recorded and

they would be ranked from lowest to highest (Table 5.1).

It is important to distinguish this test from the similar sounding unpaired

test above. This is also a very valuable test with good efficiency (power

Table 5.1 A hypothetical study investigating the effect of gender on postoperative

headache in 30 patients (16 male, 14 female). Pain is measured on a 100 mm visual

analogue scale. Each patient has their pain score ranked from lowest to highest. Tied

data are given the same rank, calculated as the mean rank of the tied observations

1. Exact method

Use the group sum of ranks and consult a reference table for group sizes 16 and 14.

Because W, lies outside the quoted range (at P < 0.05), the null hypothesis can be rejected.

2. Normal approximation

The mean rank in the male group (n = 16) is 10.44, with a standard deviation of 6.67. The

mean rank in the female group (n = 14) is 21.07, with a standard deviation of 7.79. The

pooled standard deviation (see footnote on page 54) is 7.21. The degrees of freedom are 16

+ 14 - 2 = 28. The t statistic based on ranks is:

6 6

**Statistical Methods for Anaesthesia and Intensive Care
**

= 95%) compared with the paired t-test.] As in the paired t-test, the

differences between pairs are calculated but then the absolute differences

are ranked (without regard to whether they are positive or negative). The

positive or negative signs of the original differences are preserved and

assigned back to the corresponding ranks when calculating the test

statistic. The sum of the positive ranks is compared with the sum of the

negative ranks. If there is no difference between groups, we would expect

the sum of the positive ranks to be equal to the sum of the negative ranks.

Kruskal-Wallis ANOVA

This tests the null hypothesis that k independent groups come from

populations with the same median. A formula for the Kruskal-Wallis test

statistic is:]

where N = the total number of cases, k = the number of groups, nj

- the

number of cases in the jth sample, Rj - the average of the ranks in the jth

group, and R= the average of all the ranks (and equal to [N + 1]/2).

If a significant difference is found, post hoc comparisons are usually

performed with the Mann-Whitney U test with a

Bonferroni correction.

This approach does not consider all group data and a method based on

group mean ranks can also be used.]

Friedman two-way ANOVA

This tests the null hypothesis that

k repeated measures or matched

groups come from populations with the same median. Post hoc

tests need

to be performed if a significant difference is found. These tests are

unfortunately not available with most statistical software packages but

can be found in specialized texts.]

References

1.

Siegal S, Castellan NJ Jr. Non-parametric Statistics for the Behavioral Sciences

2nd ed. McGraw-Hill, NewYork 1988.

2. Bland JM, Altman DG. Transforming data.

Br Med j 1996; 312:770.

3.

Bland JM, Altman DG. Transformations, means, and confidence intervals. Br

Med j 1996; 312: 1079.

4. Bland JM, Altman DG. The use of transformation when comparing two

means. Br Med J 1996; 312:1153.

5. Ludbrook J. Issues in biomedical statistics: comparing means by computer-

intensive tests. Aust NZ J Surg1995; 65:812-819.

6. Godfrey K. Comparing the means of several groups.

N Fngl j Med 1985;

313:1450-1456.

7. Gardner MJ, Altman DG. Statistics with Confidence - Confidence Intervals

and Statistical Guidelines. British Medical journal,

London 1989:pp20-27.

Comparing groups: numerical data

67

8. Scheinkestel CD, Bailey M, Myles PS et al.

Hyperbaric or normobaric oxygen

for acute carbon monoxide poisoning - a randomized, controlled clinical trial.

Med J Aust

1999;170:203-210.

9.

Bland JM, Altman DG. One and two sided tests of significance. Br Med J 1994;

309:248.

10. Bland JM, Altman DG. Multiple significance tests: the Bonferroni method.

Br

Med J 1995; 310:170.

11. Ludbrook J. Repeated measurements and multiple comparisons in

cardiovascular research. Cardiovasc Res

1994; 28:303-311.

12. Myles PS, Hunt JO, Fletcher H et al.

Propofol, thiopental, sevoflurane and

isoflurane: a randomized controlled trial of effectiveness study.

Anesth Analg

in press.

13. Matthews JNS, Altman DG, Campbell MJ et al.

Analysis of serial

measurements in medical research. BMJ 1990; 300:230-235.

14. Gin T, Chan K, Kan AF et al. Effect of adrenaline on venous plasma

concentrations of bupivacaine after interpleural administration.

Br J Anaesth

1990; 64:662-666.

15. McArthur CJ, Gin T, McLaren IM et al. Gastric emptying following brain

injury: effects of choice of sedation and intracranial pressure. Intensive Care

Med 1995; 21:573-576.

16. Ngan Kee WD, Lam KK, Chen PP, Gin T. Epidural meperidine after cesarean

section: the effect of diluent volume. Anesth Analg 1997; 85:380-384.

17.

Moses LE, Emerson JD, Hosseini H. Analyzing data from ordered categories.

N Engl J Med 1984; 311:442-448.

6

Comparing groups: categorical data

Chi-square

-Yates' correction

Fisher's exact test

The binomial test

McNemar's chi-square test

Risk ratio and odds ratio

Number needed to treat

Mantel-Haenszel test

Kappa statistic

Key points

•

The chi-square test is used to

compare independent groups of categorical

data.

• Yates' correction factor should be used when the sample size is small.

•

The results from two group

comparisons with two categories are set out in a

2 x

2 contingency table.

•

Fisher's exact test is a recommended alternative for analysing data from 2

x

2 tables.

McNemar's test is used to compare paired groups of categorical data.

The risk ratio is the proportion of patients with an outcome who were

exposed to a risk factor vs. the proportion not exposed.

•

Odds ratio is an estimate of risk ratio, used mostly in retrospective case-

control studies.

•

The number needed to treat (NNT) is the reciprocal of the absolute risk

reduction.

• The kappa statistic is a measure of

agreement.

Categorical data are nominal and can be counted (see Chapter 1). This

chapter is concerned with various methods to compare two or more

groups when the data are categorical. Extensive further reading is

available in a textbook on non-parametric statistics by Siegal and

Castellan.1

Chi-square

(x2)

The Pearson chi-square (X

2

)

test is the most common significance test

used for comparing groups of categorical data. It compares frequencies

and tests whether the observed rate differs significantly from that

expected if there were no difference between groups (i.e. the

null

hypothesis).

The calculated value of the Pearson X,2 test statistic is compared to the

chi-square distribution,

a continuous frequency distribution, and the

resultant significance level (P value) depends on the overall number of

observations and the number of cells in the table. The x2

distribution is

derived from the square of standard normal variables (X) and provides a

basis for calculating the t and F distributions described in the previous

chapter. It consists of a family of curves, each of which, like the t-test, has

70

**Statistical Methods for Anaesthesia and Intensive Care
**

The Pearson

· `

statistic is calculated as:

where O = the observed number in each cell, and

E = the expected

number in each cell.

The expected number in each cell is that expected if there were no

differences between groups so that the ratio of outcome 1 to outcome 2 is

the same in each group. Thus for outcome 1, we would expect (a +

b)/N

as the ratio in each group, the expected number for group A is ([a + b])/N

x [a + c]), and the expected number for group B is ([a + b])/N) x [b +d]).

All four expected numbers are calculated and the X` is then the sum of the

four [(O- E)` ]/Eterms.

The degrees of freedomis equal to: (number of rows -1) x (number of

columns -1). In a 2 x 2 table, given fixed rowand column totals, there is

only free choice for one of the inner numbers because, in doing so, the

others are calculated by subtraction. The degrees of freedom was

(2-1)(2-1) = 1. Thus, in a 2 x 2 table, the result is compared to known

values of the ` distribution at 1 degree of freedom.

For example, consider a clinical trial investigating the effect of pre-

operative beta-blocker therapy in patients at risk of myocardial ischaemia

(Example 6.1).

X2

= (12.25 /8.5 ) + (12.25 /8.5 ) + (12.25 /11.5 ) + (12.25 /11.5 ) = 5 .013

The P value can be obtained in a X2-table in a reference text and is equal to 0.025 , and so

one would reject the null hypothesis. Thus, patients in group A had a statistically significant

lower rate of myocardial ischaemia.

The X2X2 distribution is actually a continuous distribution and yet each

cell can only take integers. When the total number of observations is

small, the estimates of probabilities in each cell become inaccurate and

the risk of type I error increases. It is not certain howlarge N should be,

but probably at least 20 with the expected frequency in each cell at least

Example 6.1 An observational study of 20 patients at risk of

Group A is receiving beta-blockers whereas group B is not. The

myocardial ischaemia

myocardial ischaemia.

outcome of interest is

Observed:

Group A Group B Row total

I

sc

h

aem

ia

5

12 17

No

i

schaemi

a

15 8 23

Colu mn total

Expected (if there was

20

no difference

Group A

20

between groups):

Group B

40

Row total

I

sch

a

em

ia

8.5 8.5

17

No ischaemia 11.5

11.5

23

Colu mn total

20 2

0 40

Comparing groups: categorical data

71

5. When the expected frequencies are small, the approximation of the x2

statistic can be improved by a continuity correction known as Yates'

correction.

The formula is:

In Example 6.1, the continuity corrected x2 is:

x2=(9/8.5)+(9/8.5)+(9/11.5)+(9/11.5)=3.68

This has an associated P value of 0.055 and one would accept the null

hypothesis! Yates' correction is considered by some statisticians to be an

overly conservative adjustment. It should be remembered that the

x2 test

is

an approximation and the derived P value may differ from that

obtained by an exact method. We stated above that the Pearson x2 test

may not be the best approach and this is more so if small numbers of

observations are analysed. If there are multiple categories it may be

useful to combine them so that the numbers in each cell are greater. With

small numbers in a 2 x 2 table, the best approach is to use Fisher's exact

test.

Fisher's exact test

This is the preferred test for 2 x 2 tables described above. It calculates the

probability under the null hypothesis of obtaining the observed

distribution of frequencies across cells, or one that is more extreme. It

does not assume random sampling and instead of referring a calculated

statistic to a sampling distribution, it calculates an exact probability. The

test examines all the possible 2 x 2 tables that can be constructed with the

same marginal totals (i.e. the numbers in the cells are different but the

rowand column totals are the same). One can think of this as analogous

to the problem of working out all the possible combinations of heads and

tails if one tosses a coin a fixed number of times. The probability of

obtaining each of these tables is calculated. The probability of all tables

with cell frequencies as uneven or more extreme than the one observed is

then added to give the final P value. This test was not common before the

use of computers because the calculation of probability for each cell was

arduous. After constructing all possible tables, the probability of each

table is:

W here ! denotes factorial.

In Example 6.1, the P value for Fisher's exact test is 0.054, similar to that

obtained using Yates' correction factor, and again we would accept the

null hypothesis. Current statistical packages are able to calculate

Fisher's exact test and it seems logical to use the exact probability

rather than approximate

x2

tests. Further discussion can be found

elsewhere.1,2

72

**Statistical Methods for Anaesthesia and Intensive Care
**

Analysis of larger contingency tables

If there are more than two groups and/or more than two categories, one

can construct larger contingency tables. However, if there are more than

two categories, it is often the case that some rank can be assigned to the

categories (e.g. excellent, good, poor) and tests such as the Mann-

Whitney U test may be more appropriate (see Chapter 5). 3 An alternative

is to use a variation of Z2 known as the X2 test for trends. 4 This test will

give a smaller P value if the variation in groups is due to a trend across

groups.4

The analysis of larger tables can also be carried out using the Pearson

X2 test as indicated above, with more cells contributing to the test statistic.

The result is referred to the X2 distribution at (m -1)(n -1) degrees of

freedom, if there are m rows and n columns. For a 4 x 3 table there are

3 x 2 = 6 degrees of freedom. All cells should have an expected frequency

greater than 1 and 80%of the cells should have an expected frequencies

of at least 5. If this is not the case, it is better to combine some of the

categories to have a smaller table.

In the analysis of a large table, for example two categories in three

groups, a significant result on

x2 testing will not indicate which group is

different from the others. It is not appropriate to partition the 2 x 3 table

into several 2 x 2 tables and perform multiple comparisons. Three 2 x 2

tables are possible and a test on each table at the original a may give a

spuriously significant result. One approach is to do an initial x2 test and,

if

P is less than 0.05, perform separate tests on each pair using a

Bonferroni correction

for multiple comparisons.5

The binomial test

The binomial distribution was briefly

described in Chapter 2. Data that

can only assume one of two groups are called

dichotomous or binary

data. Thus, if the proportion in one group is equal to p, then in the other

it will be 1 - p. The binomial test can be used to test whether a sample

represents a known dichotomous population. 1 It is a one-sample test

based on the binomial distribution. A

normal approximation based on

the z test can be used for large samples.1

The binomial test could test whether a single study site in a multi-

centred trial had a similar mortality to that obtained from the entire study

population.

McNemar's chi-square test

McNemar's X2 test is used when the frequencies in the 2 x 2 table

represent paired (dependent) samples. The null hypothesis is that the

paired proportions are equal. The paired contingency table is constructed

such that groups A and Ypairs that had an event (outcome 1) would be

Comparing grou ps: categorical data

73

Table 6.2 A 2 x 2 contingency table for paired groups

counted in the a cell, those pairs that did not have an event (outcome 2)

would be counted in the d cell, and the respective pairs with an event at

only one period in cells b and c (Table 6.2).

Groups A and Ydenote either

matched pairs

of subjects, or a single

group of patients in a

before and after

treatment design. The calculation

of McNemar's

x2

is different from that described above for the Pearson

2 . 1 The value of the McNemar's

x2

is referred to the X

2 distribution with

1 degree of freedom. There is a continuity correction similar to Yates'

correction and an exact version of the test that is similar to the Fisher's

exact test. If available, the exact test is preferred.

For example, if we had used the group A patients (n = 20) described in

Example 6.1, but on this occasion had then given them a newtreatment,

such as lowmolecular weight heparin, so that we nowlabel them as pre-

treatment (group A) and post-treatment (with heparin, group Y), we

would get the following table (Example 6.2), still preserving the same

distribution of outcomes after each treatment.

Example 6.2 A randomized controlled trial of low molecular weight heparin (LMWH)

i

n 20 patients at risk of myocardial ischaemia who are receiving beta-blockers.

The

outcome of interest is myocardial ischaemia

The McNemar P value is 0.065 (using SPSS V9.0 software). The conclusion from this small

before and after study is that LMWH is not effective in the prevention of myocardial

i schaemia in patients receiving beta-blockers.

The Cochran Q test can be used if there are more than two groups.1

Grou p Y

(post-L MW H ):

I schaemia

Grou p Y

(post-L MW H ):

No ischaemia

Row totals

Grou p A

(pre-L MW H ):

Ischaemia 3 2 5

Grou p A

(pre-L MW H ):

No ischaemia

9

6

15

Colu mn totals

12

8

20

Grou p Y:

Grou p Y.

Outcome 1

Outcome 2

Row totals

Grou pA:

Outcome 1 a

b

a +b

Grou pA:

Outcome 2 c

d

c +d

Column totals a+c

b+d

a+b+c+d=N

74

**Statistical Methods for Anaesthesia and Intensive Care
**

Risk ratio and odds ratio

The P value derived from a X2 statistic does not indicate the

strength

of an

association. As clinicians, we are usually interested in how much more

likely an outcome will be when a treatment is given or a risk factor is

present. This can be described by the

risk ratio (also known as

relative

risk)

and it can be calculated from a 2 x 2 table (Table 6.3). It is equal to

the proportion of patients with a defined outcome after an exposure to a

risk factor (or treatment) divided by the proportion of patients with a

defined outcome who were not exposed. If exposure is not associated

with the outcome, the risk ratio is equal to one; if there is an increased

risk, the risk ratio will be greater than one; and if there is a reduced risk,

the risk ratio will be less than one.

Because accurate information concerning all patients at risk in a

retrospective case-control study is not available (because sample size is

set by the researcher), incidence rate and risk cannot be accurately

determined, and the

odds ratio

is used as the estimate of the risk ratio

(Table 6.3). It is equal to the ratio of the odds of an event in an active

group divided by the odds of an event in the control group. It is a

reasonable estimate of risk when the outcome event is uncommon (say,

< 10%). If the outcome event occurs commonly, the odds ratio tends to

overestimate risk.

6

Odds ratios are mostly used in case-control studies

that investigate uncommon events.

The risk ratio and odds ratio can be expressed with

95% confidence

intervals (CI). 1

If this interval does not include the value of 1.0, then the

association between exposure and outcome is significant (at P < 0.05).

These methods not only tell you if there is a significant association, but

also howstrong this association is.

For example, data from Example 6.1 can be reanalysed using these

methods (Example 6.3); note that the axes have been switched.

Table 6.3 In prospective cohort studies and clinical trials the risk

ratio is equal to the risk of an outcome when exposed compared

to the risk when not exposed. For retrospective case-control

studies (outcome 'yes' = cases, outcome 'no' = controls), the value

for the denominator is unreliable and so the odds ratio is used as

an estimate of risk. If an outcome event is uncommon the

aand c

cells have very small numbers relative to the

b and dcells, and so

the risk ratio can be approximated by the odds ratio, using the

fraction alb divided by cld;

this can be rewritten as adlbc.

Comparing groups: categorical data

75

Example 6.3 An observational study of 20 patients at risk of myocardial ischaemia.

Group A is receiving beta-blockers,

whereas group B is not. The outcome of interest is

myocardial ischaemia. The incidence of myocardial ischaemia is high in this study

group and so the odds ratio overestimates risk reduction

Thus, using risk ratio, patients receiving /3-blocker therapy have a

5 8% reduction in risk of

myocardial ischaemia.

The estimation of risk ratios and odds ratios have been used in

epidemiological research for many years where the relationship between

exposure to risk factors and adverse outcomes is frequently studied. They

are nowbeing used more commonly in anaesthesia research.

For example, Wilson et al.

8 investigated the benefits of preoperative

optimization and inotropes in patients undergoing major surgery. They

randomized 138 patients to receive adrenaline, dopexamine or control

(routine care). There was a significant reduction in the proportion of

patients who had complications in the dopexamine group compared with

those in the control group. The risk ratio (95%CI) was 0.30 (0.11-0.50),

indicating a significant 70%reduction in risk.

Number needed to treat

A risk ratio describes howmuch more likely it is for an event to occur, but

this information is limited unless we consider the baseline level of risk, or

incidence rate.

An increase in risk of a very rare event is still very rare!

Thus the change in

absolute risk

is of clinical importance. This is the

difference in the probabilities of an event between the two groups. If an

event has an incidence of 12%and risk is reduced by 33%(i.e. risk ratio

0.67), then the expected incidence will be 8%; this gives an absolute risk

reduction of 4%, or 0.04. If the baseline incidence were 60%, a 25%risk

reduction would result in an absolute risk reduction of 15%, or 0.15.

The

number needed to treat (NNT)

is the reciprocal of the absolute

risk reduction.

9,10

It

describes the number of patients who need to be

treated in order to avoid one adverse event. Thus an absolute risk

reduction of 0.04 translates to a NNT of 25 (1/0.04) - about 25 patients

need to be treated in order to avoid one adverse event. An absolute risk

reduction of 0.15 translates to a NNT of 6.7.

Ischaemia No ischaemia

Row total

Grou p A

-blocker therapy

5 15

20

Grou p B

No therapy

12 8 20

Colu mn total

17 23

40

76

**Statistical Methods for Anaesthesia and Intensive Care
**

For example, data from Example 6.3 can be used to calculate a NNT

of 2.9, suggesting that two or three patients need to be treated with

beta-blockers in order to prevent one patient from having myocardial

ischaemia.

For example, Lee et a1. 11 investigated the use of acupuncture/

acupressure to prevent postoperative nausea and vomiting. They

performed a meta-analysis of all relevant trials and found that

acupuncture/ acupressure were better than placebo at preventing early

vomiting in adults, with an RR (95%CI) of 0.47 (0.34-0.64). If the

incidence of early vomiting is 35%(proportion = 0.35), then these results

suggest that acupuncture/ acupressure, with an RR of 0.47, would reduce

the proportion to 0.17, or an absolute risk reduction of 0.18 (incidence

decreased from 35%to 17%). The NNT, or reciprocal of the absolute risk

reduction (1/0.18), is 5.5. Therefore, it can be concluded that five to six

adult patients need to be treated in order to prevent one patient from

vomiting.

A 95%CI can also be estimated for the NNT. It is calculated as the

reciprocals of the two 95%confidence limits of the absolute risk

reduction.10

Mantel-Haenszel test

If a group response is affected by more than one variable, then it may be

of interest to determine the relative impact that each of the variables may

have on a group outcome. The Mantel-Haenszel x2 test can be used to

analyse several grouping variables (i.e. it is a multivariate test) and so

can adjust for confounding.

1` 1¯

It stratifies the analysis according to the

nominated confounding variables and identifies any that affect the

primary outcome variable. If the outcome variable is dichotomous, then

logistic regressioncan be used (see Chapters 7 and 8). Both these tests are

used most often in outcome studies where there may be several

independent (predictor) variables. They can therefore calculate adjusted

odds ratios.

Kappa statistic

Kappa W measures the agreement between two observers when both are

rating the same variable on a categorical scale.1 14 The kappa statistic describes

the amount of agreement beyond that which would be due to chance.

14

The difference between the observed proportion of cases in which the

raters agree and the proportion expected by chance is divided by the

maximum difference possible between the observed and expected

proportions, given the marginal totals. The formula for the kappa statistic is:

where A = the proportion of times the raters agree, and E = the proportion

of agreement expected by chance.

A value of 1.0 indicates perfect agreement. A value of 0 indicates that

agreement is no better than chance and the null hypothesis is thus kappa

= 0.

A kappa value of 0.1-0.3 can be described as mild agreement, 0.3-0.5 as

moderate agreement, and 0.5-1.0 as excellent agreement. The value of kappa

can be transformed and tested for statistical significance. 1

A common situation where kappa is used in anaesthesia studies is to

measure agreement between researchers when recording data in clinical

trials. Reproducibility is a very important issue in clinical research. For

example, Higgins et a1.

15

developed a risk score from their cardiac

surgical database. They measured the reliability of their research nurses

data coding and entry by measuring agreement with a sample of

reabstracted data checked by study physicians. The kappa statistics were

0.66-0.99, indicating very good agreement, and this supported the

validity of their study.

References

Comparing groups: categorical data

7 7

1. Siegal S, Castellan NJ Jr. Nonparametric Statistics for the Behavioral Sciences

2nd ed. McGraw-Hill, NewYork 1988.

2. Ludbrook J, Dudley H. Issues in biomedical statistics: analysing 2 x 2 tables

of frequencies. Aust NZ J Surg 1994; 64:780-787.

3. Moses LE, Emerson JD, Hosseini H. Analyzing data from ordered categories.

N Engl J Med 1984; 311:442-448.

4. Altman DG. Practical Statistics for Medical Research. Chapman & Hall,

London 1991, pp261-264.

5. Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. Br

Med J 1995; 310:170.

6. Egger M, Smith GD, Phillips AN. Meta-analysis: principles and procedures.

BMJ 1997; 315:1533-1537.

7. Morris JA, Gardner MJ. Calculating confidence intervals for relative risks,

odds ratios, and standardised ratios and rates. In: Gardner MJ, Altman DG.

Statistics with Confidence - Confidence Intervals and Statistical Guidelines.

British Medical Journal, London 1989, ppl-63.

8. Wilson J, Woods I, Fawcett J et al. Reducing the risk of major elective surgery:

randomised controlled trial of preoperative optimisation of oxygen delivery.

BMJ 1999; 318:1099-1103.

9. Laupacis A, Sackett DL, Roberts RS. An assessment of clinically useful

measures of the consequences of treatment. N Engl J Med 1988; 318:1728-1733.

10. Cook RJ, Sackett DL. The number needed to treat: a clinically useful measure

of treatment effect. BMJ 1995; 310:452-454.

11. Lee A, Done ML. The use of nonpharmacologic techniques to prevent

postoperative nausea and vomiting: a meta-analysis.

Anesth Analg 1999;

88:1362-1369.

12. Kuritz SJ, Landis JR, Koch GG. A general overviewof Mantel-Haenszel

methods: applications and recent developments. Ann Rev Public Health

1988;

9:123-160.

13. Zeiss EE, Hanley JA. Mantel-Haenszel techniques and logistic regression:

always examine one's data first and don't overlook the simpler techniques.

Paediatr Perinat Epidemiol 1992; 6:311-315.

14. Morton AP, Dobson AJ. Assessing agreement. Med J Aust 1989;150:384-387.

15. Higgins TL, Estafanous FG, Loop FD, et al. Stratification of morbidity and

mortality outcome by preoperative risk factors in coronary artery bypass

patients: a clinical severity score. JAMA 1992; 267:2344-2348.

7

Regression and correlation

correlation coefficient.

• Regression is used for prediction.

•

Agreement between two methods of measurement can be described by the

Bland-Altman approach or the kappa statistic.

Association vs. prediction

There are many circumstances in anaesthesia research where the strength

of a relationship between two variables on a numerical scale is of interest.

For example, the relationship between body temperature and oxygen

consumption. The commonest methods for describing such a relationship

are correlation and regression analysis, yet these are both frequently

misunderstood and misused techniques. One of the reasons for this is

that they are used in similar circumstances and are derived from similar

mathematical formulae. The main distinction between them is the pur-

pose of the analysis.

Usually one of the variables is of particular interest, whereby we wish

to determine howwell it is related to the other. This variable of interest is

called the dependent variable, but is also known as the outcome or

response variable. The other variable is called the independent variable,

but is also known as the predictor or explanatory variable.

The first step in correlation and regression analyses should be to plot a

scatter diagram (this is essential if misleading conclusions are to be

avoided). Here, the dependent (outcome) variable is placed on the y-axis

and the independent (predictor) variable is placed on the x-axis, and the

plotted data represent individual observations of both variables. For

example, if we wanted to describe the relationship between body

temperature and total body oxygen consumption (Vo 2), we would first

plot the respective measurements obtained from each individual: ' a

scatterplot' (Figure 7.1).

It appears from this scatterplot that Vo 2 increases with increasing body

temperature. But howcan the relationship between them be described in

Vo 2

(ml/min)

Figure 7 .1 A scatterplot of oxygen cons

data point represents a single observad

more detail: howstrongly are they a

to predict Vo2

from body temperatu

The first is described by the Pears

by r): correlation is a measure of th

question is answered by calculatin;

used for prediction. Thus, one

correlation and regression is the pu

Assumptions

20

Before describing correlation and re;

be aware of their underlying assum]

the dependent and independent vi

i mplies that a unit change in one vai

in the other. The Pearson correlatic

scatter of data around a straight line

Two variables can have a strong

coefficient if the relationship is not

often used to describe a linear

regression is used. This is one of t

data. It allows a visual inspection

non-linear relationship is suggeste(

used which do not assume a linear i

described below).

The data should also be indepenc

on the scatterplot should represe

patient. Multiple measurements

analysed using simple correlation o

300

Association vs. prediction

Non-linear regression

Assumptions Multivariate regression

200

Correlation

Mathematical coupling

Spearman rank correlation Agreement

Regression analysis

Key points

100

• Correlation and regression are used to describe the relationship between

two numerical variables.

• Correlation is a measure of association.

•

Spearman rank order (rho) is a non-parametric version of the Pearson

10

Regression and correlation

79

Figure 7.1 A scatterplot of oxygen consumption (Vo2) and temperature. Each

data point represents a single observation in each individual patient (n = 16)

more detail: howstrongly are they associated and, if relevant, are we able

to predict V02

from body temperature? These are two different questions.

The first is described by the

Pearson correlation coefficient (denoted

by r): correlation is a measure of the strength of association. The second

question is answered by calculating a regression equation: regression is

used for prediction. Thus, one of the major distinctions between

correlation and regression is the purpose of the analysis.

Assumptions

Before describing correlation and regression any further, it is important to

be aware of their underlying assumptions. First, the relationship between

the dependent and independent variable is assumed to be

linear. This

i mplies that a unit change in one variable is associated with a unit change

in the other. The Pearson correlation coefficient describes the degree of

scatter of data around a straight line - it is a measure of

linear association.

Two variables can have a strong association but a small correlation

coefficient if the relationship is not linear. Similarly, regression is most

often used to describe a linear relationship and so simple linear

regression is used. This is one of the main benefits of first plotting the

data. It allows a visual inspection of the pattern of the scatterplot: if a

non-linear relationship is suggested, then alternative techniques can be

used which do not assume a linear relationship (some of these are briefly

described below).

The data should also be

independent. This means that each data point

on the scatterplot should represent a single observation from each

patient. Multiple

measurements from each patient should not be

analysed using simple correlation or regression analysis as this will lead

80

**Statistical Methods for Anaesthesia and Intensive Care
**

to

misleading conclusions (most often an over-inflated value of r, or a

misleading regression equation). This is perhaps the most common error

relating to correlation and regression in anaesthesia research. Repeated

measures over time should also not be simply analysed using correlation,

as once again this often results in an over-inflated value for r (time trends

require

more advanced statistical methods). Statistical

methods are

available which can accommodate for such trial designs. 1,2

These analyses also assume that the observations follow a

normal

distribution

(in particular, that for any given value of the independent

variable, the corresponding values of the dependent variable are

normally distributed). This is known as homoscedasticity. If doubt

exists, or if the distribution appears non-normal after visualizing a

scatterplot, then the data can be transformed (commonly using log-

transformation) or a non-parametric method used (e.g. Spearman rank

correlation, see below).

Neither correlation nor regression should be used to measure

agreement between two measurement techniques (see below).

Correlation

The Pearson correlation coefficient (r)

is a measure of how closely the

data points on a scatterplot assume a straight line. It is a measure of

association. The statistic, r, can have any value between -1.0 and + 1.0.*

A value of 1.0 describes a perfect positive linear association; a value of

-1.0

describes a perfect negative linear association (i.e. as the

independent variable increases, the dependent variable decreases). A

value of 0 describes no association at all, and this would appear on a

scatterplot as a (roughly) circular plot (Figure 7.2). In general, if r has an

absolute value between 0.2 and 0.4, it may be described as a mild

association; a value of 0.4-0.7 would be a moderate association, and

0.7-1.0 a strong association. This distinction is an arbitrary one, however,

with the final descriptors of the extent of association being determined

more by the intended clinical application.

There is obviously a degree of uncertainty for any calculated value of

r.

This will depend largely on the number of observations (sample size),

but is also influenced by various other factors (such as measurement

precision, the range of measurements and presence of outliers). The

degree of uncertainty can be described by the standard error of r and its

95%confidence interval.3 Perhaps the most useful measure is the value

r2,

the

coefficient of determination.

This is an estimate of how much a

change in one variable influences the other (and 1 - r 2

is the proportion

of variance yet to be accounted for). For example, from Figure 7.2: 53%

[(0.73)

2] of the variability in body weight is explained by age, and 30%

where x = value of independent variable, y = value of dependent variable, and

x = mean

value of x and

y = mean value of y (several forms of this equation exist).

Regression and correlation

81

Figure 7.2 Examples of scatterplots demonstrating a variety of correlation

coefficients (r)

[(-0.55)2] of the variability in urine osmolarity is explained by a change in

urine flow. Knowledge of r2 therefore has clinical application. It tells us

how influential one factor is in relation to another (and perhaps more

i mportantly, the effect of other factors - using 1 - r2).

Hypothesis tests can also be applied to correlation; the most common

test used is Student's t-test.* The t-test is used to compare the means of

two groups; for correlation, the resultant P value describes the likelihood

of no correlation (r = 0) - it does not describe the strength of that

association. Although uncommon, hypothesis tests can also be used to

determine if a correlation coefficient is significantly different from some

other specified value of r.

An important characteristic of correlation is that it is independent of

units of measurements. Again, referring to Figure 7.2, the r values would

not be altered whether body weight was measured in pounds or

kilograms, or if mean BP was measured in mmHg or kPa. The value of r

will, however, be markedly influenced by the presence of outliers (if the

range is increased by an outlier there is a tendency for r to increase).

Similarly, if the range of values is restricted, then r will usually be

reduced.

Data should be randomly selected from a specified target

population and measurement precision should be optimized.

A

partial correlation coefficient can also be calculated. This is an

adjusted r value which takes into account the impact of a third variable,

which may be associated with both the dependent and independent

variables (this third variable is called a

covariate). For example (referring

82

**Statistical Methods for Anaesthesia and Intensive Care
**

to Figure 7.2), the relationship between age and weight may be influ-

enced by the gender of the patient, or their nutritional status. Similarly,

renal blood flowmay be affected by changes in cardiac output (which is

related to mean blood pressure). If multiple independent (predictor)

variables are used to describe a relationship with a dependent (outcome)

variable, then a

multiple correlation coefficient

can be calculated

(denoted as R) using multivariate regression (see below).

One remaining point should not be forgotten:

association does not

i

mply causation.

A strong association does not, of itself, support a

conclusion of cause and effect. This requires additional proof, such as a

biologically plausible argument, demonstration of the time sequence

(discerning cause from effect) and exclusion of other confounding

influences (i.e. a third variable associated with the two variables of

interest, that is actually the causative factor).' Unfortunately, these issues

have been rarely addressed in the anaesthetic literature and this often

leads to unsubstantiated conclusions (see also Chapter 11).

Spearman rank correlation

If the distribution of the data is skewed (i.e. not normally distributed),

then it can be transformed, typically using the logarithm of each value to

create a more normal distribution to the data so that correlation and

regression analyses can then be performed reliably. Alternatively, or if the

data are ordinal, a non-parametric version of correlation, such as

Spearman rank correlation

should be used. Because one of the

assumptions used in correlation is that the data are normally distributed,

then it is also preferable to use Spearman rank correlation when

analysing small data sets, say n < 20 (as it is difficult to demonstrate a

normal distribution with a small number of observations). This

calculation is based on the ranking of observations and is denoted by the

Greek letter, rho (p). It is the ordered rank values, rather than the actual

values, that are correlated against one another (Table 7.1). In fact, if

Spearman s p is a similar value to r, then the distribution of the data

approximates normal; if not, then it suggests non-normality and the

Spearman p value should be preferentially used to describe association.

Other aspects of correlation apply equally to both. These include use of

the t-test (to derive a P value in order to determine if the correlation is

significantly different from zero), standard error, 95%confidence

intervals and the coefficient of determination.

There are other non-parametric correlation methods - Kendall's tau

(r),

Kendall's coefficient of concordance

W, Cramer coefficient (C),

and lambda (L). Further details of these methods can be found

elsewhere. 5

Regression analysis

If the aim of the investigation is to predict one variable from another, or

at least describe the value of one variable in relation to another, then

Regression and correlation

83

Table 7 .1 Actual values and their ranking of patient weight and morphine consumption

at 24-48 hours after cardiac surgery. These results are a subgroup (n = 23) taken from a

study investigating the efficacy of patient-controlled analgesia after cardiac surgery.6

Spearman's rho (p)is calculated by measuring the association between the rank values

(if actual values are equal, the rank is calculated as the average between them)

regression analysis can be used. Here the dependent (outcome) variable

is again placed on the y-axis of a scatterplot and the independent

(predictor) variable is placed on the x-axis. A

line of best fit, called a

regression line, can then be calculated using a technique known as the

method of least squares. This is where the perpendicular difference

between each data point and the straight line (this difference is called the

residual) is squared and summed - the eventual line chosen is that with

the smallest total sum (hence the term 'least squares method' or 'residual

sum of squares').

The general formula for the line of best fit is y =

a + bx, where 'b' is the

measure of slope and 'a' is the y-intercept.* For example, referring to our

original scatterplot of V02 and body temperature (Figure 7.1), a

regression line can be derived which enables prediction of V0

2 after

measuring body temperature (Figure 7.3). This line is described by the

equation, Vo2

(in ml/min) = 6.8 + 6.0 x temp. (in °C). This equation states

that for each 1°C increase in temperature there is a 6 ml/min increase in

V02 . From this we are able to predict that if a patient has a body

Subject W eight

(kg)

W eight

rank

Total dose of

morphine (mg)

Morphine

rank

1 82 13 13 3

2 86 17 52 19

3 90 9 54 21

4 64 3 29 9

5 83 14 24 7

6 65 4 26 8

7 74 10 38 14

8 53 2 9 1

9 80 12 46 16

10 46 1 19 5.5

11 91 20 53 20

12 69 6 34 12

13 84 15 32

11

14 105 23 30 10

15 78 11 48

17.5

16 73 9 41 15

17 88 18 14

4

18 70 8 36

13

19 97 22 55

22

20 92 21 48

17.5

21 85 16

60 23

22 69 5

12 2

23 89 19

19 5.5

p =0.54 (P =0.031)

84

**Statistical Methods for Anaesthesia and Intensive Care
**

temperature of 32°C, then the best estimate for VO2

would be

199 ml/min, and if their body temperature was 38°C, then

VO2 would be

expected to be 235 ml/min.

Hence regression is a very useful way of describing a linear

relationship between two numerical variables. Nevertheless there

remains some uncertainty about how accurate this equation is in

representing the population: it is unlikely that a derived equation will be

able to perfectly predict the dependent variable of interest in the

population. This uncertainty can be described by the

standard error

of

the slope (b) and its

95% confidence interval.3 As the study sample size

increases, the standard error of

b decreases and so the uncertainty

decreases (and the more reliable the regression equation will become).

For the population, the general form of the equation is actually Y' = beta0+

beta1X, where Y' is the predicted value of the dependent variable and e1 is

the slope. Hence, the slope of our sample regression line (b) is an estimate

of beta1; beta1 is known as the regression coefficient.

In our example (Figure 7.3), the standard error of b can be calculated

from a known formula,

SE(b) = 2.65. From this we can nowcalculate a

95%confidence interval for beta1

(after first looking up a t-table, using 14

degrees of freedom [n - 2], t = 2.15, d.f. = 14): the 95%confidence interval

for be

ta1 is 6.0 ± ([2.15 x 2.65] = 5.7), that is, from 0.3 to 11.7. Because the 95%

confidence interval for beta1

does not include the value zero, it is statistically

significant at the 5%level (P < 0.05). Similarly, a hypothesis test can be

performed to test whether a value of be

ta1 is equal to zero by dividing the

value of b

by its standard error and looking up a t-table (degrees of

freedom = n - 2). Because the level of uncertainty increases the further we

are from the mean value of x (our independent variable), the width of the

95%confidence interval increases towards the extreme values. For this

reason, 95%confidence intervals for a regression line are curved (Figure

7.4).

Figure 7.3. A regression line ('line of best fit') for regression of

VO2 on body

temperature. Note that the line should not extend beyond the limits of the data

(where accurate prediction becomes unreliable)

Regression and correlation

85

Figure 7.4 A regression line for regression of V0 2 on body temperature, showing

the 95%confidence intervals (broken lines)

Just as hypothesis testing can be used to determine whether a

regression line (slope) is statistically significant (from zero), two

regression lines can also be compared to see whether they differ in their

Y-intercept or slope (i.e. do they represent two different populations?).

The difference between a predicted value for y and the actual observed

value is known as the residual. There are methods available which can

analyse the distribution of the residuals across a range of values for X in

order to determine if the data are normally distributed. The residuals can

also be used to describe the ' goodness of fit'

of a regression equation, or

' model' (i.e. does it predict well?).

The assumptions stated earlier for correlation are also important for

regression. However, it is not necessary for the independent (predictor)

variable to be normally distributed. It is important to remember that the

scale of measurement in regression analysis determines the magnitude of

the constants (a and b) in the regression equation, and so units should be

clearly stated.

Other, non-linear versions of regression can be used (the above is called

simple linear regression). These obviously do not plot a straight line

through the scatterplot.

Non-linear regression

An example of non-linear regression is the common classical phar-

macokinetic problem of fitting a polyexponential curve to drug

concentration-time data.7 Specialized pharmacokinetic programs are

usually used to determine an exponential function that has minimal sum

of squares for a set of data points.

Because drug concentrations may vary by several orders of magnitude,

and variance is proportional to the concentration, a differential weighting

factor is often used for each data point in determining the regression

86

**Statistical Methods for Anaesthesia and Intensive Care
**

estimates. Several polyexponential solutions are possible and a variety of

criteria (e.g. Schwarz, Akaike) can be used to determine the most likely

model. Rather than fit individual curves with polyexponential equations,

it is

becoming common nowadays to carry out population

pharmacokinetic modelling that combines all the data points in one

overall regression analysis.

The calculation of quantal dose-response curves is another example of

the use of non-linear regression. In this case a procedure known as

probit

analysis

is often used. In quantal dose-response experiments, several

doses of a drug are chosen and the observed response must be

dichotomous. (For a log dose-response curve, the doses are best chosen

so that the logarithm of the doses are approximately equally spaced.) At

each dose, a number of subjects are exposed to the drug and the response

observed.

For example, in a comparison of thiopentone requirements between

pregnant and non-pregnant patients ,

9 the numbers of patients found to

be unconscious at each dose was determined (Table 7.2 and Figure 7.5).

The shape of the dose-response curve is expected to be sigmoid and

thus the raw proportions of responses in each group must undergo an

appropriate transformation. In the

probit transformation,

the proportion

responding (y) is transformed using the inverse of the cumulative

standard normal distribution function. 10

The basis for this is that a

cumulative normal distribution curve is sigmoid in shape. In the

logit

transformation,

the proportion responding (y) is transformed using the

natural log of the odds ratio: ln(y/[1 - y]). Either transformation can be

used and they give similar results. The probit analysis procedure also

provides

methods to compare the median potency (ED50

),

and

parallelism of two curves, as well as providing confidence limits for the

likelihood of response at any dose.

With graded dose-response curve data, a common related method of

analysis is to model the data points to fit a sigmoid

EmaX

model (i.e. the

Hill equation):

where

E = effect,

EmaX

= the maximum effect, [D] = the drug concentration

EC50 =

the concentration yielding 50%of maximal effect, and ydescribes

the slope of the curve.

Table 7.2 Number of

patients (n = 10) with

hypnosis at different

doses of thiopentone

1`

Dose (mg/kg)

Non-pregnant

Pregnant

2.0

0

1

2.4

1 5

2.8

4

6

3.3

7 8

3.8

7

10

4.5

10

10

5.3

10

10

Regression and correlation

87

Figure 7.5 Dose-response curves in pregnant and non-pregnant women. The

95%confidence intervals for ED50 and ED95

are displayed, slightly offset for

clarity. Data points are the original proportions in groups of 10 patients

Multivariate regression

Multiple linear regression is a more complex form of regression used

when there are several independent variables, using the general form of

the equation, Y' = beta0 + beta1X1 + beta2X2 +...

Using this method, many

independent (predictor) variables can be

included in a model, in order to predict the population value of the

dependent (outcome) variable.

It is not necessary for the independent

variables to be normally distributed, nor even continuous.

For example, Boyd et al. 11

measured arterial blood gases and gastric

tonometry (intramucosal pHi) in 20 ICU patients. As part of their

analyses, they used multivariate linear regression to describe the

relationship between pHi (their dependent variable) and a number of

cardiorespiratory (independent) variables. They found mild negative

associations

with heart rate (r = -0.29), systolic pulmonary artery

pressure (r = -0.25), diastolic pulmonary artery pressure (r = -0.22) and

blood lactate (r = -0.36). Because they also found a strong correlation

between blood base deficit and pHi (r = 0.63), they concluded that routine

blood gas measurements could be used instead of gastric tonometry.

Interestingly, they have recently reanalysed their data and included the

variable (Pr-Pa)CO , the gap between gastric mucosal and arterial carbon

dioxide tensions.12

They found that (Pr-Pa)CO2

was not correlated with

arterial blood gas data and so may be a unique measure of splanchnic

perfusion.

Stepwise regression analysis is a type of multivariate analysis used to

assess the impact of each of several (independent) variables separately,

adding or subtracting one at a time, in order to ascertain whether the

addition of each extra variable increases the predictive ability of the

equation (model) - the 'goodness of fit'. It does this by determining

88

**Statistical Methods for Anaesthesia and Intensive Care
**

whether there has been an increase in the overall value of R2

(where R =

multiple correlation coefficient).

For example, Wong and Chundu used stepwise multiple linear

regression to describe factors associated with metabolic alkalosis after

paediatric cardiac surgery13 Here, the dependent variable was arterial

pH, and several patient characteristics and biochemical measures were

included as independent variables. They found that patient age and

serum chloride concentration were the only significant (negative)

associations with arterial pH, and explained 42%of the variability in

postoperative arterial pH (i.e. R2 = 0.42). They concluded that chloride

depletion may be a factor in the pathogenesis of metabolic alkalosis in

that population.

Analysis of covariance is a combination of regression analysis and

analysis of variance (used to compare the mean values of two or more

groups), that adjusts for baseline confounding variables (also known as

covariates). This method can be used when several groups being

compared have an imbalance in potentially important baseline

characteristics which may influence the outcome of interest. Here the

relationship between each baseline factor and the endpoint of interest is

first determined, leading to an adjusted comparison (i.e. so that the

groups are 'equalized' before comparison).

Logistic regression

is a type of regression analysis used when the

outcome of interest is a dichotomous (binary,

or yes/no) categorical

variable. It generates a probability of an outcome from 0 to 1, using an

exponential equation.* This technique is commonly used in outcome

studies in anaesthesia and intensive care, where the outcome of interest

is a dichotomous variable - typically an adverse event or mortality 14 As

with multivariate linear regression, a number of independent (predictor)

variables can be included in the equation, so that their specific effect on

outcome can be adjusted according to the presence of other variables.

Each independent (predictor) variables may be included in the equation

in a stepwise method (one at a time), or all entered together.

If any of the independent variables are also dichotomous, then their

relationship to the outcome of interest can be expressed by the risk ratio,

or its estimate, odds ratio (

OR) (see Chapter 6). Because this is a

multivariate technique, logistic regression can be used to calculate an

adjusted OR.

The OR is the exponential of the regression coefficient (i.e.

OR for the factor x1 is equal to ebeta1

For example, Kurz et a1.

15 investigated the potential relationship

between postoperative wound infection and various perioperative

factors (including

maintenance of normothermia) in patients having

abdominal surgery. Because wound infection is a dichotomous categori-

cal variable, they used multivariate logistic regression. They found there

was a significant association between postoperative wound infection and

smoking (OR 10.5), as well as with perioperative hypothermia (OR 4.9).

where OR = odds ratio, w= beta + beta1X 1 + beta2X 2 + ... , and P = probability of outcome.

This means that smokers were approximately 10.5 times more likely to

have a postoperative wound infection (compared to non-smokers), and

patients who developed hypothermia were 4.9 times more likely

(compared to those who were normothermic).

It should be stressed that a number of equations, or 'models', may be

developed from a data set using multivariate (linear or logistic)

regression analysis. How the final model is constructed depends partly

on the choice of independent variables and their characteristics (as

numerical, ordinal or categorical data). 16 There may be other (unknown)

variables that may have a significant impact on the outcome of interest. 17

Development of a reliable predictive model requires assistance from a

statistician experienced in multivariate regression techniques, because of

the potential problems with, for example, correlation (co-linearity) and

interaction of variables. But it also requires involvement of an

experienced clinician, as the predictor variables ultimately chosen in the

model must be reliable and clinically relevant. These predictor variables

are often considered as 'risk factors'. Further discussion of these issues

can be found in Chapter 8.

Mathematical coupling

If two variables have a mathematical relationship between them, then a

spurious relationship can be calculated using correlation. This is known

as mathematical coupling and overestimates the value of

r.18

This is also

a common error in anaesthesia research, as many endpoints of interest are

actually derived (as indices) from another measured variable(s). For

example, oxygen delivery

(D02)

is a term derived from a measurement of

cardiac output and oxygen content (which in turn is calculated from a

measurement of haemoglobin concentration, arterial oxygen saturation

and tension).* This is commonly calculated along with V0 2 . t Hence, both

V02 and D02 share several values in their derivation. It has been a

frequent error to describe the relationship between Vo2 and D02 using

correlation and regression analysis, with most authors finding an r value

of approximately 0.75, and so concluding, possibly falsely, that not only

is V02 strongly associated with Do2 , but is also dependent on Do2 (i.e.

'supply-dependence').

19

Another common situation is where one variable includes the value of

the other variable - this is an additive mathematical relationship. An

example would be describing the relationship between an initial urine

output (say over the first 4 h) and that over 24 h (i.e. 0-4 h and 0-24 h).

Clearly the fact that the 24-hour urine volume includes the first 4-hour

urine volume will ensure a reasonable degree of association - in this

example mathematical coupling can be avoided by excluding the first

4-hour volume from the 24-hour measurement (i.e. 0-4 hours and 4-24

hours). Mathematical coupling should always be considered when one or

both variables has been derived from other measurements.

*Do2 = CO x (Hb x 1.34 x Sao2 + Pao2 x 0.003).

+Vo2 = CO x [Hb x 1.34 x (Sao2 - Svo2) + (PaoZ - Pvo2) x 0.003]

Regression and correlation

8 9

90

**Statistical Methods for Anaesthesia and Intensive Care
**

Agreement

How well two measurement techniques agree is a common question in

anaesthesia and intensive care: comparing two methods of measuring

cardiac output, arterial (or mixed venous) oxygen saturation, extent of

neuromuscular blockade, depth of anaesthesia, etc. Although correlation

is the correct method for measuring the association between two

numerical variables, and regression can be used to describe their

relationship, they should not (generally) be used to describe agreement

between two measurement methods.20,21

In nearly all situations two

methods used to measure the same variable will have very close

correlation - but they may not have useful clinical agreement! As an

illustration, if two methods differ by a constant amount (which may be

quite large) they will have excellent correlation, but poor agreement.

To describe the agreement between two measurement techniques, the

average between them (considered the 'best guess') and their difference

are first calculated. 20

The average is then plotted against the difference;

this plot is sometimes referred to as a Bland-Altman plot

. 21 The mean

difference between measurement techniques is referred to as the

' bias'

and the standard deviation of the difference is referred to as the

' precision'. The bias is an estimate of howclosely the two methods agree

on average (for the population), but does not tell us how well the

Table 7 .3 Assessing the agreement between two methods of measuring arterial carbon

dioxide tension ( Pco2, mmHg) in 20 patients (after Myles et a1. 22)

The raw data are

presented, along with the calculated bias, precision and limits of agreement

Mean difference between methods ('bias') = 1.1 mmHg.

Standard deviation (SD) of difference ('precision) = 1.6 mmHg.

1.96x SD ('limits of agreement') = 3.1 mmHg.

Laboratory

Labco2

Paratrend-7

P7co2

Average Pco2

(Labco2 +P7co2)/2

Difference between methods

Labco2 - P7co2

33

37 35 -4

39

39 39 0

39

34 36.5 5

38

36 37 2

42 42

42 0

41

41 41 0

32 31

31.5 1

37

35 36 2

42 41

41.5 1

39

38 38.5 1

29 29

29 0

33

33 33 0

41 40

40.5 1

32

30 31 2

34 33

33.5 1

39 37

38 2

37 36

36.5 1

31 29

30 2

38 36

37 2

43 41

42 2

Regression and correlation

91

Figure 7.6

Bland-Altman plot of two methods of measuring arterial carbon

dioxide tension (PCO2) (see Table 7.3)

methods agree for an individual. For this we must use the estimate of

precision. The precision can be multiplied by 1.96 to calculate the 'limits

of agreement',

which describe where 95%of the data (observed

differences) lie.

Whether two methods have clinically useful agreement is

not determined by hypothesis testing; it is the clinician's impression of

the calculated bias and limits of agreement.

For example, two methods for measuring arterial carbon dioxide are

the

Paratrend 7 intravascular device (Biomedical Sensors, High

Wycombe, UK), and a standard laboratory blood gas analyser. These

were compared in patients undergoing cardiac surgery

22 and the data

recorded after cardiopulmonary bypass are presented in Table 7.3 and

Figure 7.6.

If one or both variables are categorical, then the agreement between

them can be determined by the kappa statistic W. This can be used in

situations where, for example, an assessment is made as to whether a

disease is present or absent (using either a diagnostic test, predictive

score or clinical judgment), and this is compared to another method of

assessment. The most common use of the kappa statistic is to describe the

reliability of two observers' ratings or recordings. The kappa statistic describes

the amount of agreement beyond that which would be due to chance. 23 A

kappa value of 0.1-0.3 is sometimes described as mild agreement, 0.3-0.5 as

moderate agreement, and 0.5-1.0 as excellent agreement. There are other

situations where calculation of positive predictive value, likelihood ratio

or risk ratio may be more appropriate (i.e. the chance of a particular

outcome, given a test result - see Chapters 6 and 8).

If

either of the variables is measured on an ordinal scale (or the

question being asked is howwell does a measurement technique agree to

a previous measurement using the same method?), then the

intraclass

92

**Statistical Methods for Anaesthesia and Intensive Care
**

correlation coefficient can be used.23 This is a test of reproducibility. The

extent of agreement, however, is still best described by the standard

deviation of the difference between methods.20

References

1. Bland JM, Altman DG. Calculating correlation coefficients with repeated

observations: Part I - correlation within subjects. Br Med J 1995; 310:446.

2. Bland JM, Altman DG. Calculating correlation coefficients with repeated

observations: Part II - correlation between subjects. Br Med J 1995; 310:633.

3. Altman DG, Gardner MJ. Calculating confidence intervals for regression and

correlation. In: Gardner MJ, Altman DG. Statistics with confidence -

confidence intervals and statistical guidelines. BritishMedical Journal, London

1989: pp34-49.

4. Sackett DL, Haynes RB, Guyatt GH et al. Clinical Epidemiology: A Basic

Science for Clinical Medicine, 2nd ed. Little Brown, Boston 1991: pp283-302.

5.

Siegel S, Castellan NJ. Nonparametric Statistics for the Behavioural Sciences,

2nd ed. McGraw-Hill International Editions, NewYork 1988.

6. Myles PS, Buckland MR, Cannon GB

et al. Comparison of patient-controlled

analgesia and conventional analgesia after cardiac surgery.

Anaesth Intens

Care 1994; 22:672-678.

7. Hull CJ. The identification of compartmental models. In: Pharmacokinetics

for Anaesthesia. Butterworth, London 1991: pp187-197.

8. Sheiner LB, Beal SL. NONMEM Users Guide. Division of Clinical

Pharmacology, University of California, San Francisco 1979.

9. Gin T, Mainland P, Chan MTV, Short TG. Decreased thiopental requirements

in early pregnancy. Anesthesiology 1997; 86:73-78.

10. Finney, D. J. Probit Analysis, 3rd ed. Cambridge University Press, London

1971.

11.

Boyd O, Mackay CJ, Lamb G et al. Comparison of clinical information gained

from routine blood-gas analysis and from gastric tonometry for intramural

pH. Lancet

1993; 341:142-146.

12.

Rhodes A, Boyd O, Bland JM, Grounds RM, Bennett ED. Routine blood-gas

analysis and gastric tonometry: a reappraisal. Lancet 1997; 350:413.

13.

Wong HR, Chundu KR. Metabolic alkalosis in children undergoing cardiac

surgery. Crit Care Med

1993; 21:884-887.

14.

Myles PS, Williams NJ, Powell J. Predicting outcome in anaesthesia:

understanding statistical methods. Anaesth Intensive Care 1994; 22:447-453.

15.

Kurz A, Sessler DI, Lenhardt R. Perioperative normothermia to reduce the

incidence of surgical-wound infection and shorten hospitalization. N Engl J

Med 1996; 334:1209-1215.

16. Simon R, Altman DG. Statistical aspects of prognostic factor studies in

oncology.

Br J Cancer 1994; 69:979-985.

17.

Datta M. You cannot exclude the explanation you have not considered. Lancet

1993; 342:345-347.

18.

Archie JP Mathematical coupling of data. A common source of error. Ann

Surg 1981;193:296-303.

19.

Myles PS, McRae RJ. Relation between oxygen consumption and oxygen

delivery after cardiac surgery: beware mathematical coupling. Anesth Analg

1995; 81:430-431.

20. Bland JM, Altman DG. Statistical methods for assessing agreement between

two methods of clinical measurement.

Lancet 1986; i:307-310.

Regression and correlation

93

21. Bland JM, Altman DG. Comparing methods of measurement: why plotting

difference against standard method is misleading.

Lancet

1995; 346:1085-1087.

22.

Myles PS, Story DA, Higgs MA

et al. Continuous measurement of arterial and

end-tidal carbon dioxide during cardiac surgery: Pa_ETCO2

gradient. Anaesth

Intensive Care 1997; 25: 459-463.

23.

Morton AP, Dobson AJ. Assessing agreement.

Med J Aust

1989; 150:384-387.

8

Predicting outcome: diagnostic

tests or predictive equations

Sensitivity and specificity

Prior probability: incidence and

prevalence

-positive and negative predictive value

Bayes' theorem

Receiver operating characteristic

(ROC) curve

Predictive equations and risk scores

Key points

• Sensitivity of a test is the true positive rate.

• Specificity of a

test is the true negative rate.

• Positive predictive value is the proportion of patients with an outcome if the

test is positive.

• Negative predictive value is the proportion of patients without an outcome if

the test is negative.

A receiver operating characteristic (ROC) curve can be used to illustrate the

diagnostic properties of a test on a numerical scale.

• Risk prediction is usually based on a multivariate regression equation.

• A predictive score

should be prospectively validated on a separate group of

patients.

•

Predictive scores are generally unhelpful for predicting uncommon (< 10%)

events in individual patients.

Sensitivity and specificity

Diagnostic tests are used to guide clinical practice.l -3 They are used to

enhance a clinician's certainty about what will happen to their patient.

The most familiar is a laboratory test or investigation, but many aspects

of a clinical examination or patient monitoring should also be considered

as diagnostic tests. Predictive equations and risk scores are diagnostic

tests. For example, the Mallampati score is commonly used to assess a

patient's airway in order to predict difficulty with endotracheal

intubation. 4

Clinicians need to know how much confidence should be

placed in such tests - are they accurate and reliable?

A diagnostic test usually gives a positive or negative result and this

may be correct or incorrect. The accuracy of a diagnostic test can be

described by its sensitivity and specificity. The sensitivity of a test is its

true positive rate. The specificity is its true negative rate. Thus the

sensitivity and specificity of a test describe what proportion of positive

and negative tests results are correct given a known outcome (Figure 8.1).

Common events occur commonly. If a disease is common, and is

confirmed by a diagnostic test, then it is very likely to be a true result.

Similarly, if a predictive test is positive for an outcome that is common,

then it is even more likely to occur. If a test is negative for a common (or

expected) event, then the clinician needs to be certain that the chance of

Predicting ou tcome: diagnostic tests or predictive equ ations

95

Sensitivity of the new test = TP/(TP + FN)

Specificity of the new test = TN/(TN + FP)

Positive predictive value of the new test = TP/(TP + FP)

Negative predictive value of the new test = TN/(TN + FN)

Figure 8.1 Sensitivity and specificity, positive and negative predictive value

a false negative result (1-specificity) is extremely low. In most clinical

situations a negative test result should be reviewed if clinical suspicion

was high. Conversely, a single positive test result for a rare event is

unlikely to be true (most positive results will be incorrect).

Prior probability: incidence and prevalence

Where

**TP = true positive
**

FP = false positive

TN = true negative

FN = false negative

The value of a diagnostic test in clinical practice does not just depend on

its sensitivity and specificity. As stated above, common events can be

more confidently predicted and the clinical circumstances in which a test

is to be applied must be taken into consideration. Information about

prevalence (or incidence) is required. Prevalence is the proportion of

patients with a disease (or condition of interest) at a specified time.

Incidence is the proportion of patients who develop the disease (or

outcome of interest) during a specified time. Pre-existing conditions

should be described by their prevalence rate, whereas outcomes should

be described by their incidence rate. Prior probability is a term used

interchangeably for prevalence (or incidence, if it is in reference to an

expected outcome rate).

Clinical interpretation of diagnostic tests requires consideration of

prior probability. This can be done by calculating the positive predictive

value (PPV) and negative predictive value (NPV). PPV describes the

likelihood of disease or outcome of interest given a positive test result.

NPV describes the likelihood of no disease or avoiding an outcome given

a negative test result (Figure 8.1). Prior probability is sometimes referred

to as the pre-test risk; the post-test risk refers to either PPV or NPV

It is common for authors to report optimistic values for PPV and NPV,

yet both are dependent on prevalence* - if a disease (or outcome) is

*If the test is being used for prediction of outcome (e.g. risk score), then PPV is dependent

on its incidence rate.

96

**Statistical Methods for Anaesthesia and Intensive Care
**

common, a test (irrespective of its diagnostic utility) will tend to have a

high PPV The same test in another situation where disease prevalence is

lowwill tend to have poor PPV and yet high NPV. Therefore, the context

in which the diagnostic test was evaluated should be considered - does

the trial population in which the test was developed represent the clinical

circumstances in which the test is to be applied? Was there a broad

spectrum of patients (of variable risk) studied?1

An example of this is electrocardiographic diagnosis of myocardial

ischaemia. 2,5

It is generally accepted that ST-segment depression >- 1 mm

is an indicator of myocardial ischaemia, and with this criterion it has a

sensitivity of about 70%and specificity of about 60%.

5 If a 70-year-old

man with coronary risk factors is found to have ST-segment depression,

then it is very likely that this indicates myocardial ischaemia. But if a

60-year-old woman has the same degree of ST-segment depression, then

it

remains unlikely that she has myocardial ischaemia.

2.6

This

discrepancy can be quantified using PPV and NPV and illustrates the

relevance of prior probability (Figure 8.2). In the first case (the elderly

man), it could be expected that 60%of such patients (i.e. prior probability

60%) have ischaemic heart disease, but in the second case (the woman), it

might only be 10%. If the sensitivity of ECG diagnosis of myocardial

ischaemia is 70%, then the PPV for such patients is 74%. The PPV for the

woman is only 17%. Rifkin and Hood present a cogent argument

describing how the extent of ST-depression should be interpreted

according to the perceived risk (prior probability) of myocardial

Figure 8.2 The effect of prevalence, or prior probability on PPV and NPV,

assuming an ST-segment diagnosis of myocardial ischaemia with sensitivity

70%, specificity 60%: (a) 100 elderly men (prevalence =60 %); (b) 100 women

(prevalence =10%)

ischaemia. 2

PPV can also be calculated by several other methods; one of

these is a mathematical formula known as

Bayes' theorem.

Sensitivity, specificity, PPV and NPV are proportions and so can be

described

with corresponding 95% confidence intervals.

7 These are

measures of a test's reliability.

Bayes' theorem

Predicting outcome: diagnostic tests or predictive equations

97

Bayes' theorem* is a formula used to calculate the probability of an

outcome (or disease), given a positive test result.

2,3 It combines the

characteristics of the patient (prior probability), the test (sensitivity) and

the test result, to calculate PPV Bayes' formula states that the PPV is

equal to the sensitivity of the test multiplied by the prevalence (or

incidence) rate, divided by all those with a positive test (Figure 8.3).

Figure 8.3 Bayes' theorem

PPV is a

conditional probability.

It is the probability of having the

disease (or outcome)

given that the test was positive. The symbol

'I ' is

used to denote that the item to its left presumes the condition to its right.

3

Hence PPV is denoted by P(D+I

T+). Using this nomenclature, sensitivity

can be denoted by P(T+1 D+) and specificity by P(T-/

D-).

As described above, the utility of a test depends on its accuracy

(sensitivity and specificity) and prior probability. If this probability is low,

then the ratio of the true positive rate (sensitivity) to the false negative

rate (1 -specificity) must be very high in order for the test to be useful in

clinical practice. This relationship can be illustrated using a

nomogram,

which illustrates the varying effect of the prior probability and sensitivity

of the test, with the likelihood of an outcome.$ In clinical practice a

positive test result usually offers very little extra information for an

outcome that is already likely, unless the test is very sensitive and/or

specific.

Bayes' theorem can be rearranged to calculate the

odds of an outcome

given a test result - the likelihood ratio.

1,2

The prior odds is defined as

the odds of an outcome before the test result is known and is the ratio of

prior probability to 1 minus prior probability (prevalence/ 1-prevalence).

Either a positive or negative likelihood ratio can be calculated, according

to whether a test result is positive or negative.

A Bayesian approach can also be used to interpret clinical trials. 9

A

significant P value for an unexpected event is less likely to be true (i.e.

*Thomas Bayes (1763): 'An essay towards solving a problem in the doctrine of chances'.

98

**Statistical Methods for Anaesthesia and Intensive Care
**

lower PPV because of a lower prior probability) than a P value that may

not be significant (say P = 0.11) for an event that had been the main

subject of study, or had been demonstrated in previous studies.

The important issue is that

effect size, not P value, is the more

i mportant consideration when interpreting clinical trial results. This

approach has also been suggested for interim analysis of large trials. 1

o

Receiver operating characteristic (ROC) curve

Not all diagnostic test results are simply categorized as 'positive' or

'negative'. Anaesthetists and intensivists are frequently exposed to test

results on a numerical scale. Some judgment is required in choosing a cut-

off point to denote normal from abnormal (or negative from positive).

Laboratory reference ranges are usually calculated from a healthy

population, assuming a normal distribution in the population, as mean

± 1.96 standard deviations. Predictive equations, or risk scores, usually

have some arbitrary cut-off value, whereby it is considered a higher score

denotes higher risk of an adverse outcome ('test positive'). The cut-off

value should ideally be selected so that the risk score has greatest

accuracy. There is a trade off between sensitivity and specificity -

if the

cut-off value is too lowit will identify most patients who have an adverse

outcome (increase sensitivity) but also incorrectly identify many who do

not (decrease specificity).

The change in sensitivity and specificity with different cut-off points

can be described by a receiver operating characteristic (ROC) curve

(Figure 8.4). 3 An ROC curve assists in defining a suitable cut-off point to

denote 'positive' and 'negative'. In general, the best point lies at the

elbow of the curve (its highest point at the left). However, the final,

perhaps most important consideration, is for the intended clinical

Figure 8.4 Receiver operating characteristic (ROC) curve. The broken line

signifies no predictive ability

Predicting outcome: diagnostic tests or predictive equations

99

circumstances to guide the final choice of cut-off point. If the con-

sequences of false positives outweigh those of false negatives, then a

lower point on the curve (to the left) can be chosen.

Because an ROC curve plots the relationship between sensitivity and

specificity, which are independent of prevalence, it will not be affected by

changes in prevalence. The slope of the ROC curve represents the ratio of

sensitivity (true positive rate) to the false positive rate. The line of

equality (slope = 1.0) signifies no predictive ability. The steeper the slope,

the greater the gain in PPV.

The area under an ROC curve represents the diagnostic (or predictive)

ability of the test. An ROC area of 0.5 occurs with the curve of equality

(the line y = x) and signifies no predictive ability. Most good predictive

scores have an ROC area of at least 0.75. Two or more predictive, or risk,

scores can be compared by measuring their ROC areas.

11-13

For example, Weightman et a1. 13 compared four predictive scores used

in adult cardiac surgery and found they had similar ROC areas (about

0.70) in their surgical population. They concluded that all of the scores

performed well when predicting group outcome, but would be unreliable

for individual patients. This was because adverse outcomes were rare in

their study (mortality 3.5%).

Predictive equations and risk scores

Outcome prediction has four main purposes:

• to identify factors associated with outcome (so that changes in

management can improve outcome)

• to identify patient groups who are at unacceptable risk (in order to

avoid further disability or death, or for resource allocation)

• to match (or adjust) groups for comparison

• to provide the patient and clinician with information about their risk

Identification of low-risk patients (who should not need extensive

preoperative evaluation or expensive perioperative care) may save

valuable resources for those most at need. Similarly, if patients are at

unacceptable risk, whereby expensive resources are not expected to

i mprove outcome, it may be appropriate to deny further treatment. In

both of these situations it is imperative that a predictive score is reliable.

This is often the case for groups of patients, but less so for the individual.

This is a very common problem in anaesthesia because serious morbidity

and mortality are rare events and so most 'predictive scores' are not very

helpful. Risk adjustment is a more accurate way of correcting for

baseline differences in clinical studies, or for correcting for 'casemix'

when comparing institutions.

Many studies in anaesthesia and intensive care are used to derive a

predictive equation or risk score. Information regarding patient

demographics, comorbid disease, results of laboratory tests and other

clinical data, may be analysed in order to describe their association with

eventual patient outcome. This process may identify causative or

exacerbating factors, as well as preventive factors.

100

**Statistical Methods for Anaesthesia and Intensive Care
**

The simplest method of describing the relationship between a

predictor variable and outcome is with one of the familiar univariate

techniques - for numerical outcomes it may be Student's t-test or Mann-

Whitney U test; for categorical data it is usually

X 2 or risk ratio calculated

from a 2 x 2 contingency table. Though not essential, these techniques are

commonly used during the initial stages of developing a predictive

equation or risk score. They act as a screening process in order to identify

any possible predictor variables, which are usually chosen as those with

P < 0.05. These predictor variables are often considered as 'risk factors'.

Univariate techniques cannot adjust for the combined effects of other

predictor variables. Hence, with multiple factors, each possibly inter-

related, it is necessary to use some form of

multivariate

statistical

analysis.

These include linear and logistic regression, discriminant

analysis and proportional hazards. 14-17

Regression analysis is used to predict a dependent (outcome) variable

from one, or more independent (predictor) variables.

Multiple linear

regression is

used when the outcome variable is measured on a

numerical scale. Logistic regression is used when the outcome of interest

is a

dichotomous (binary, or yes/no) categorical variable.

1: 1

Discriminant analysis

is used when there are more than two outcome

categories (i.e. on a categorical or ordinal scale).

Cox proportional

hazards is

used when the outcome is time to an event (usually

mortality).

1: 18 1`

Further descriptions of these methods can be found in

Chapters 7 and 9.

Stepwise regression analysis is a type of multivariate analysis used to

assess the impact of each of several predictor variables separately, adding

or subtracting one at a time, in order to ascertain whether the addition of

each extra variable increases the predictive ability of the equation, or

model -

the 'goodness of fit'. A forward stepwise procedure adds one

variable at a time; a backward stepwise procedure removes one variable

at a time. It does this by determining whether there has been a significant

increase (for a forward procedure) or decrease (for a backward

procedure) in the overall value of R 2

(for regression methods, where R =

multi le correlation coefficient) or a goodness of fit statistic (similar to

,2). 16

``

R2

measures the amount of variability explained by the model

and is one method of describing its reliability. It is a measure of effect size.

1 - R2

is the proportion of variance yet to be accounted for. This process

may not necessarily select the most valid, or clinically important,

predictor variables? It may also continue to include factors that offer

very little additional predictive ability (at the expense of added

complexity). An alternative, or complementary, method is to first include

known, established risk factors.

A derived equation, or model, is likely to be unreliable if too few

outcome events are studied. 15

This may result in spurious factors being

identified ('over-fitting' the data) and important ones being missed. The

reliability, or precision, of the equation is dependent on the size of the

study. The larger the sample the more reliable the estimate of risk. It is

recommended that at least ten outcome events should have occurred

with each predictor variable in the mode1.

15,17

The regression coefficients,

or weightings, assume a linear gradient

Predicting outcome: diagnostic tests or predictive equations

10 1

between the predictor variable and the outcome of interest. This means

that a unit change in the predictor variable will be associated with a unit

change in the probability of the outcome. It is best to check for this by

visual inspection of the plotted data or stratifying the predictor variables

into ordered groups to confirm that the effect is uniform across the range

of values. It may be preferable to categorize a numerical predictor

variable if this more clearly discriminates different levels of risk.

It

must be stressed that a number of equations, or models, may be

developed from a data set.

15,17

How the final model is constructed

depends partly on the choice of predictor variables and their

characteristics, or coding (as numerical, ordinal or categorical data).

15,17

There may be other variables, known or unknown, that may have a

significant impact on the outcome of interest. 21 Development of a reliable

predictive model requires assistance from a statistician experienced in

multivariate techniques, because of the potential problems with, for

example, correlation (co-linearity) and interaction of variables. But it

also requires involvement of an experienced clinician, as the predictor

variables ultimately chosen in the model must be reliable and clinically

relevant. 17

Because outcome prediction is usually based on a predictive equation

developed using multivariate analyses, it will, by virtue of its derivation,

be able to predict that original data set well.14,1

,1'7

Further validation is

required before accepting its clinical utility. One method is to split the

study population into two, deriving a risk score from the first and testing

it on the second. Another method is to prospectively validate it using

another data set, or preferably, externally validating the equation, or

score, at another institution.)? Bootstrapping is a method of random

sampling and replacement from the data set so that multiple samples can

be analysed in order to validate the derived model, or risk score.

2

It has

been shown to be a more reliable method than split-samples. This

method was used recently by Wong et al.23 when identifying risk factors

for delayed extubation and prolonged length of stay with fast-track

cardiac surgery.

Wasson et al.14 and others

1,15,17

have described standards for derived

predictive scores. They include a clearly defined outcome, clearly defined

(objective) risk factors, separation of predictive and diagnostic factors

(ideally

with blinded assessment of outcome), a clearly defined study

population (ideally a wide spectrum of patients) and prospective

validation in a variety of settings. 14

Each of the above multivariate methods derives an equation that

predicts the probability of an outcome of interest. Because regression

equations are often very complex, it is common to convert them to a risk

score for clinical use. 15 The

regression coefficients (from linear or logistic

regression), odds ratios

(from logistic regression) or hazard ratios

(proportional hazards) usually form the basis of a numerical score for

each risk factor.

Outcome prediction only applies to groups of patients. If a predictive

equation or risk score estimates that the risk of postoperative mortality is

42%, this does not mean that a patient with the specified characteristics

has a 42%risk of death, only that if 100 similar patients were to proceed

1 02

**Statistical Methods for Anaesthesia and Intensive Care
**

with surgery, then

42

would be expected to die postoperatively. We do

not knowwith any certainty, which of those patients will survive or die.

One remaining point should not be forgotten: association does not

i

mply causation. For this reason, treatment of identified risk factors may

not improve outcome. Just because a strong association is demonstrated,

does not, of itself, support a conclusion of cause and effect. This requires

added proof, such as a biologically plausible argument, demonstration of

the time sequence (discerning cause from effect) and exclusion of other,

perhaps unknown, confounding factors.

21,24

For example, amiodarone

has been associated with poor outcome after cardiac surgery, 25

yet this

may be partly explained by the fact that patients who have been treated

with amiodarone are more likely to have poor ventricular function and it

may be this confounding factor that explains the poor outcome. This

issue could be clarified with a prospective, randomized controlled trial.

Outcome during and after ICU admission has been the subject of many

studies. The most familiar are the APACHE scoring systems.26 These

were developed because ICU patients frequently suffer multisystem

disease and previous risk scores usually focused on a single organ

dysfunction or disease. In APACHE III, Knaus et a1.

26

collected data on

predictor variables and patient outcome from 26 randomly selected

hospitals and 14 volunteer hospitals in the USA. A total of 17 440 patients

were studied as a split sample. The first (derivation) group had weights

calculated for various chronic disease and physiological variables using

logistic regression and these weights were converted into scores. The

APACHE III was then prospectively tested on a second (validation)

group. A series of regression equations and the APACHE III score are

available to calculate the probability of various ICU outcomes. 26 Seneff

and Knaus have written an excellent review of several ICU scoring

systems

A good example of a predictive, or risk score is that developed by

Higgins et a1.

28

who collected retrospective data on 5051 patients

undergoing coronary artery bypass grafting. 28 They used ,'C

2

and Fisher's

exact test to identify risk factors associated with morbidity and mortality

and also calculated odds ratios to measure the degree of association.

Significant risk factors identified by these univariate methods were then

entered into a logistic regression analysis. This enabled adjusted odds

ratios (for confounding) to be calculated. The final logistic equation, or

model, was tested for its predictive ability ('goodness-of-fit') using a test

similar to X2 (called the Hosmer-Lemeshow statistic). It should be noted,

that at this stage of their study, they only validated their model on their

original data set and so naturally they found their model had good

predictive properties. They then used the univariate odds ratios (and

'clinical considerations') to give each significant factor a score of 1 to 6.

They also constructed ROC curves to compare various versions of their

derived clinical severity score. Importantly, they then prospectively

collected data on a further 4169 patients at their institution and tested

their score on this (validation) group. The overall agreement with this

newdata set was also tested with the Hosmer-Lemeshowstatistic. They

calculated that if a score of 6 (out of 33) was chosen as a cut-off point for

mortality, their score had a sensitivity of 63%and specificity of 86%, and

a PPV of 11%and a NPV of 99%. As the authors state, because the PPV

was only 11%(i.e. for each 100 patients identified by their score, only 11

actually died postoperatively; the other 89 survived), their score is best to

identify low-risk patients as the NPV was 99%.

The bispectral index (BIS) is an EEG-derived estimate of depth of

hypnosis.

29

In earlier versions, the manufacturers used multivariate

regression to calculate the probability of movement. This was later

modified to correlate the BIS with level of hypnosis. In either case,

regression analysis was used to construct a range of BIS values that could

be used to reflect depth of hypnosis. Thus, BIS is a predictive model. It

can be considered to have a certain sensitivity and specificity, and

positive and negative predictive value.

References

Predicting ou tcome: diagnostic tests or predictive equ ations

103

1. Sackett DL, Richardson WS, Rosenberg W et al. Evidence-based Medicine:

Howto Practice and Teach EBM. Churchill Livingstone, London 1997:

pp81-84.

2. Rifkin RD, Hood WB. Bayesian analysis of electrocardiographic exercise

stress testing. N Engl J Med 1977; 297:681-686.

3. Metz CE. Basic principles of ROC analysis. Semin Nucl Med 1978;

V111:283-298.

4. Mallampati SR, Gatt SP, Gugino LD et al. A clinical sign to predict difficult

intubation: a prospective study. Can Anaesth Soc J 1985; 32:429-434.

5. Carliner NH, Fisher ML, Plotnick GD et al. Routine preoperative exercise

testing in patients undergoing major noncardiac surgery. Am J Cardiol 1985;

56:51-58.

6. Fleisher LA, Zielski MM, Schulman SP Perioperative ST-segment depression

is rare and may not indicate myocardial ischemia in moderate-risk patients

undergoing noncardiac surgery. J Cardiothorac vasc Anesth 1997; 11:155-159.

7. Gardner MJ, Altman DG. Calculating confidence intervals for proportions

and their differences. In: Gardner MJ, Altman DG. Statistics with Confidence.

British Medical Journal, London 1989: pp28-33.

8. Fagan TJ. Nomogram for Bayes' theorem. N Engl J Med 1975; 293:257.

9. Browner WS, Newman TB. Are all significant p values created equal? The

analogy between diagnostic tests and clinical research. JAMA 1987;

257:2459-2463.

10. Brophy JM, Joseph L. Bayesian interim statistical analysis of randomised

trials. Lancet

1997; 349:1166-1168.

11.

DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under

t

wo or more correlated receiver operating characteristic curves: a

nonparametric approach.

Biometrics 1988; 44:837-845.

12.

Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a

fundamental evaluation tool in clinical medicine. Clin Chem 1993; 39:561-577.

13. Weightman WM, Gibbs NM, Sheminant MR

et al. Risk prediction in coronary

artery surgery: a comparison of four risk scores.

MJA 1997; 166:408-411.

14. Wasson JH, Sox HC, Neff RK et al.

Clinical prediction rules: applications and

methodological standards. N Engl J Med

1985; 313:793-799.

15. Concato J, Feinstein AR, Holford TR. The risk of determining risk with

multivariable models. Ann Intern Med

1993; 118:201-210.

16.

Lee J. Covariance adjustment of rates based on the multiple logistic regression

model. J Chronic Dis 1981; 34:415-426.

1 04

**Statistical Methods for Anaesthesia and Intensive Care
**

17. Simon R, Altman DG. Statistical aspects of prognostic factor studies in

oncology. Br J Cancer 1994; 69:979-985.

18. Cox DR. Regression models and life tables. J R Stat Soc Series B 1972;

34:187-220.

19. Peto R, Pike MC, Armitage P et al. Design and analysis of randomized clinical

trials requiring prolonged observation of each patient. II. Analysis and

examples. Br J Cancer

1977; 35:1-39.

20. LemeshowS, Hosner DW A reviewof goodness-of-fit statistics for use in the

development of logistic regression models. Am J Epidemiol 1982; 115:92-98.

21.

Datta M. You cannot exclude the explanation you have not considered. Lancet

1993; 342:345-347.

22.

Harrell FE, Lee KI, Mark DB. Multivariable prognostic models: issues in

developing models, evaluating assumptions and adequacy, and measuring

and reducing errors. Stat Med

1996; 15:361-387.

23. Wong DT, Cheng DCH, Kustra R

et al. Risk factors of delayed extubation,

prolonged length of stay in the intensive care unit, and mortality in patients

undergoing coronary artery bypass graft with fast-track cardiac anesthesia. A

newcardiac risk score. Anesthesiology

1999; 91:936-944.

24. Sackett DL, Haynes RB, Guyatt GH et al.

Clinical Epidemiology: a Basic

Science for Clinical Medicine, 2nd edn. Little Brown, Boston 1991: pp283-302.

25. Mickleborough LL, Maruyama H, Mohammed A et al.

Are patients receiving

amiodarone at increased risk for cardiac operations? Ann Thorac Surg 1994;

58:622-629.

26. Knaus WA, Wagner DP, Draper EA et al. The APACHE III prognostic system:

risk prediction of hospital mortality for critically ill hospitalized adults. Chest

1991; 100:1619-1639.

27. Seneff M, Knaus WA. Predicting patient outcome from intensive care: a guide

to APACHE, MPM, SAPS, PRISM, and other prognostic scoring systems. J

Intens Care Med 1990; 5:33-52.

28. Higgins TL, Estafanous FG, Loop FD et al. Stratification of morbidity and

mortality outcome by preoperative risk factors in coronary artery bypass

patients: a clinical severity score. JAMA 1992; 267:2344-2348.

29. Rampil IJ. A primer for EEG signal processing in anesthesia. Anesthesiology

1998; 89:980-1002.

9

Survival analysis

What is survival analysis?

Kaplan-Meier estimate

Comparison of survival curves

-logrank test

-Cox proportional hazard model

Key points

• Survival analysis is used when analysing time to an event.

• The Kaplan-Meier method estimates survival using conditional probability,

whereby survival depends on the probability of surviving to that point and the

probability of surviving through the next time interval.

• Two survival curves can be compared using the logrank test.

• The hazard ratio is the risk of an event compared with a reference group.

What is survival analysis?

The `hazards' of survival analysis

Many patients undergoing major surgery, or admitted to intensive care,

are at increased risk of early death. Clinical research in these areas often

includes measures of outcome such as major morbidity and mortality. But

these rates are usually only described at certain points in time (such as in-

hospital mortality or 30-day mortality). They ignore other valuable

information about exactly when the deaths occurred. The actual pattern of

death - when, and how many patients die - and its converse, the pattern

of survival, are less frequently investigated in anaesthesia and intensive

care research. This is particularly so for survival rates over a longer

period of time. The statistical analysis of the pattern of survival is known

as survival analysis.1,2 The outcome of interest, survival, is treated as a

dichotomous (binary, or yes/no) categorical variable and can be

presented at any point in time as a proportion.

Although survival analysis is most often concerned with death rates,

the outcome of interest may be any survival event, such as extubation in

ICU, failure of arterial cannulae, or freedom from postoperative nausea

and vomiting. The difference, of course, is that in these circumstances

complete outcome data on all patients are usually available and so the

more familiar comparative statistical tests are most commonly used.

However, survival analysis is generally a preferable approach as it

provides more clinically relevant information concerning the pattern of

the outcome of interest.

If a death rate remains constant, the probability of death over any time

period can be estimated (using the Poisson distribution - see Chapter 2).

This rarely occurs in clinical practice: usually mortality is high initially,

then tapers off. Such rates that vary over time are called hazard rates.

Survival, or actuarial, analysis is most commonly applied in cancer

research, and analysis of outcome after cardiothoracic surgery and organ

1 06

**Statistical Methods for Anaesthesia and Intensive Care
**

transplantation. The mean time of survival can be a totally misleading

statistic, as it will depend on patients' mortality distribution pattern as

well as how long they were followed up. The median

survival time is

sometimes used but is only available after more than half the patients

have eventually died. Because deaths do not occur in a linear fashion,

estimates made over a brief period of time may not reflect the true overall

pattern of survival. If some patients have only been in a trial for a few

months, we cannot know how many of them will survive over a longer

period (say one or two years). Start and finish times (patient recruitment

and eventual mortality) are usually scattered.

We therefore need a

method that accommodates for the incomplete data. The number of

patients studied, when they died, and the length of time they have been

observed, are crucial data required to describe the pattern of survival.

Because we have information on only those patients who actually died,

we do not know how long the remaining patients will survive. These

survival times are 'censored' and the

data concerning surviving patients

are called censored observations. Censoring removes these individuals

from further analysis. Patients who are withdrawn from treatment, or

who are lost to follow-up, are also censored observations (at the point

they leave the study). Yet the information concerning patients in a trial

who have not yet died, or have not yet survived for a specified period, is

of some value. After all, we know they have survived for at least that

period of time - some of their survival information can be included.

Essentially, an estimate is made of the patients' probability of survival,

given the observed survival rates in the trial at each time period of

interest. Information about the censored data is included (up until they

leave or are lost from the trial).

There are two main methods for describing survival data, the

actuarial

(life table)

method and the

Kaplan-Meier method. The actuarial

method first divides time into intervals and calculates survival during

each of these intervals (censored observations are assumed to have

survived to halfway into the interval). The Kaplan-Meier method

calculates the probability of survival each time a patient dies. A survival

table or graph may be referred to as a

'life table'. An example, using

survival after heart transplantation, is presented in Table

9.1.

Kaplan-Meier estimate

The Kaplan-Meier technique is a non-parametric method of estimating

survival and producing a survival curve. The probability of death is

calculated when patients die, and withdrawals are ignored (Table 9.2);

therefore withdrawals in smaller studies do not have such an effect on

calculated survival rates. The Kaplan-Meier method calculates a

conditional probability:

the chance of surviving a time interval can be

calculated as the probability of survival up until that time, multiplied by

the probability of death during that particular interval. For example, to

survive for 15 months, a patient must first survive 12 months, and then

three more. The probability of surviving 12 months (using Table 9.2) is

Survival analysis

107

Table 9.1 Survival data after heart transplantation (hypothetical data). The life table

describes the observed outcome over two years

0.742, and the probability of survival over the next three months is 0.96;

hence the probability of surviving 15 months is 0.742 x 0.96 = 0.712.

Changes in probability only occur at the times when a patient dies. The

resultant Kaplan-Meier curve is step-like, as a change in the proportion

surviving occurs at the instant a death occurs (Figure 9.1). Both

standard

error and 95% confidence intervals

can be calculated. 3

These usually

widen over the time period because the number of observations

decreases.

Comparison of survival curves

The survival pattern of different patient groups can be compared using

their survival curves. This is most commonly applied when comparing

two (or more) treatment regimens or different treatment periods (changes

over time), or deciding whether various baseline patient characteristics

(gender, age groups, risk strata, etc.) can be used to predict (or describe)

eventual survival. The most obvious method would be to compare the

Table 9.2 Kaplan-Meier estimates for survival after heart transplantation (using the

above hypothetical data)

Time period

Number of patients

alive at the start

of each time period

Number of deaths

Alive, but yet to reach

next time period

(or lost to follow-up)

0-2 mths

74

10

2

2-4 mths

62

4 1

4-6 mths

57

1 1

6-8 mths

55

1 2

8-12 mths

52

1 1

12-16 mths

50

2 3

16-20 mths

45 2 1

20-24 mths

42

0 0

24-30 mths

42

2 1

30-36 mths

39 0 1

Month of

death

Number of

patients (p)

Number of

deaths (d)

Probability of

death (d/p)

Probability of

survival (1-d/p)

Cumulative

survival

1

74 10 0.135

0.865

0.865

3

62 4

0.065 0.935

0.809

6

57 1

0.018 0.972

0.786

7

55 1

0.018 0.972

0.764

12

52 1

0.019 0.971

0.742

15

50 2

0.040 0.960

0.712

20

45 2

0.044 0.956

0.681

28

42 2

0.048 0.952

0.648

Figure 9.1 Kaplan-Meier survival curve for heart transplantation (see Table 9.2)

survival rates of the groups using standard hypothesis testing for

categorical data (such as the familiar chi-square test). But this is

unreliable, as it only depicts the groups at a certain point in time, and

such differences between groups will obviously fluctuate. What is needed

is a method that can compare the whole pattern of survival of the groups.

Various non-parametric tests can be used for this purpose.

One technique is to rank the survival times of each individual and use

the Wilcoxon rank sum test (survival time is not normally distributed

and so the t-test would be inappropriate). This method is unreliable if

there are censored observations (i.e. patients lost to follow-up or still

alive). In these situations, a modification known as the generalized

Wilcoxon test can be used (also known as the Breslow or Gehan test). In

general, these tests are uncommonly used because they are not very

powerful and so may fail to detect significant differences in survival

between groups.

An alternative (and popular) technique is to use the logrank test.2 This

is based on the

X 2

test and compares the observed death rate with that

expected according to the null hypothesis (the Mantel-Haenszel test is

also a variation of this). Time intervals are chosen (such as one- or two-

month periods) and the number of deaths occurring in each is tabulated.

The logrank test then determines the overall death rate (irrespective of

group), and compares the observed group death rate with that expected

if there was no difference between groups. The results of each time

interval are tabulated and a

X 2

statistic is generated. An advantage of the

logrank test is that it can also be used to produce an odds ratio as an

estimate of risk of death: this is called a hazard ratio. A test for trends can

also be used, so that risk stratification can be quantified.

For example, Myles ct al.' compared two anaesthetic techniques in

patients undergoing cardiac surgery. The time to tracheal extubation was

Survival analysis

109

Table 9.3 Kaplan-Meier estimates for tracheal extubation after coronary artery bypass

graft surgery in patients receiving either an enflurane-based (Enf), or propofol-based

(Prop) anaesthetic (using data from Myles et al.

4 )

analysed using survival techniques (Table 9.3 and Figure 9.2). This study

demonstrated that a propofol-based anaesthetic technique, when

compared to an enflurane-based technique, resulted in shorter extubation

ti

mes. As stated above, survival techniques do not need to be restricted

to analysing death rates but can be used to analyse many terminal events

of interest in anaesthesia and intensive care research. Because eventual

outcome was known for all patients in Myles' study (i.e. there were no

Figure 9.2 Kaplan-Meier survival curves, illustrating tracheal extubation after

coronary artery bypass graft surgery in patients receiving either an enflurane-

based or propofol-based anaesthetic (see Table 9.3)

Time after

ICU

admission (h)

Number of

patients

Enf

Prop

Number of

patients

extubated

Enf Prop

Probability

of extubation

Enf Prop

Probability of

continued

mechanical

ventilation

Enf Prop

Cumulative

proportion

still

ventilated

Enf Prop

0 66 58

0 0 0.00 0.00

1.00 1.00 1.00

1.00

2 66 58

2 4 0.03 0.07

0.97 0.93 0.97

0.93

4 64 54

7 12 0.11 0.22

0.89 0.78 0.86

0.72

6 57

42 9 12 0.16 0.29

0.84 0.71 0.73

0.52

8 48 30

4 3 0.08 0.10

0.92 0.90 0.67

0.47

10

44 27 7 10 0.16

0.37 0.84 0.63

0.56 0.29

12 37

17 11 7 0.30 0.41

0.70 0.59 0.39

0.17

14 26

10 6 3 0.23

0.30 0.77 0.70 0.30

0.12

16 20

7 8 0 0.40 0.00

0.60 1.00 0.18

0.12

18 12

7 5 2 0.42

0.29 0.58 0.71 0.11 0.09

20 7 5

2 0 0.29 0.00

0.71 1.00 0.08

0.09

22 5

5 1 1 0.20

0.20 0.80 0.80 0.06

0.07

24 4

4 0 0 0.00

0.00 1.00 1.00

0.06 0.07

110

**Statistical Methods for Anaesthesia and Intensive Care
**

censored observations), traditional (non-parametric) hypothesis testing

was also employed to compare the median time to extubation.

The Cox proportional hazards model is a multivariate technique

similar to logistic regression, where the dependent (outcome) variable of

interest is not only whether an event occurred, but when. 5

It is used to

adjust the risk of death when there are a number of known confounding

factors (covariates) and therefore produces an adjusted hazard ratio

according to the influence on survival of the modifying factors. Common

modifying factors include patient age, gender and baseline risk status.

For example, Mangano et al. investigated the benefits of perioperative

atenolol therapy in patients with, or at risk of, coronary artery disease.

They randomized 200 patients to receive atenolol or placebo. Their major

endpoint was mortality within the two-year follow-up period. They used

the logrank test to compare the two groups and found a significant

reduction in mortality in those patients treated with atenolol (P = 0.019),

principally through a reduction in cardiovascular deaths. They then used

the Cox proportional hazards method to identify other (univariate)

factors that may be associated with mortality. These included diabetes

(P = 0.01) and postoperative myocardial ischaemia (P = 0.04). When these

factors were included in a multivariate analysis, only diabetes was a

significant predictor of mortality over the two-year period (P = 0.01).

The `hazards' of survival analysis

Comparison of survival data does not need to be restricted to the total

period of observation, but can be split into specific time intervals. For

example, an early and late period can be artificially constructed for

analysis, the definition of these periods being determined by their

intended clinical application. This may be a useful exercise if differences

only exist during one or other periods, or if the group survival curves

actually cross. In general, such arbitrary decisions should be guided by

clinical interest, and not be influenced after

visualization of the survival

curves.

Presentation of survival data should include the number of individuals

at each time period (particularly at the final periods, where the statistical

power to detect a difference is reduced), and survival curves should also

include 95%confidence intervals. A comparison of mortality rates at a

chosen point in time should not be based upon visualization of survival

curves (when the curves are most divergent): this may only reflect

random fluctuation and selective conclusion of significance may be

totally

misleading. Conclusions based upon the terminal (right-hand)

portion of a survival curve are often inappropriate, because patients

numbers are usually too small. It is common for such curves to have long

periods of flatness (i.e. where no patient dies); this should not be

interpreted as no risk of death (or 'cure').

In summary, the Kaplan-Meier method is used to estimate a survival

curve using conditional probability, the logrank test is used to compare

survival between groups, and the Cox proportional hazards method is

used to study the effects of several risk factors on survival.

References

Survival analysis

111

1. Peto R, Pike MC, Armitage

P et al.

Design and analysis of randomized clinical

trials requiring prolonged observation of each patient. I. Introduction and

design.

Br j Cancer 1976; 34:585-612.

2. Peto R, Pike MC, Armitage

P et al. Design and analysis of randomized clinical

trials requiring prolonged observation of each patient. II. Analysis and

examples.

Br j Cancer 1977; 35:1-39.

3.

Machin D, Gardner Mj. Calculating confidence intervals for survival time

analysis. In: Gardner Mj, Altman DG. Statistics with Confidence - Confidence

Intervals and Statistical Guidelines. British Medical journal, London

1989:

pp64-70.

4. Myles PS, Buckland MR, Weeks AM

et al.

Hemodynamic effects, myocardial

ischemia, and timing of tracheal extubation with propofol-based anesthesia for

cardiac surgery.

Anesth Analg 1997; 84:12-19.

5. Cox

DR. Regression models and life tables. j

R Stat Soc Series B 1972;

34:187-220.

6.

Mangano DT, Layug EL, Wallace A et al.

Effect of atenolol on mortality and

cardiovascular

morbidity after noncardiac surgery.

N Engl j Med 1996;

335:1713-1720.

10

Large trials, meta-analysis, and

evidence-based medicine

Efficacy vs. effectiveness

Large randomized trials

Meta-analysis and systematic reviews

Evidence-based medicine

Clinical practice guidelines

Key points

•

Stringently designed randomized trials are best to test for efficacy, but lack

applicability.

Large randomized trials are best to test for effectiveness, and so have

greater applicability (generalizability).

Large randomized trials can detect moderate beneficial effects on outcome.

Meta-analysis combines the results of different trials to derive a pooled

estimate of effect.

Meta-analysis is especially prone to publication bias.

A systematic review is a planned, unbiased summary of the evidence to

guide clinical management.

Evidence-based medicine optimizes the acquisition of up-to-date knowledge,

so that it can be readily applied in clinical practice.

Clinical practice guidelines are developed by an expert panel using an

evidence-based approach.

Efficacy vs. effectiveness

The randomized controlled trial (RCT) is the gold standard method to

test the effect of a newtreatment in clinical practice. It is a proven method

of producing the most reliable information, because it is least exposed to

bias.1 ¯

Randomization balances known and unknown confounding

factors that may also affect the outcome of interest.

Stringently designed RCTs are best to test for

efficacy, but lack

applicability.

+ :

They are conducted in specific patient populations, often

in academic institutions, by experienced researchers. They are explanatory

trials. They commonly exclude patients with common medical conditions

or at higher risk. For these reasons their results may not be widely

applicable and so they do not necessarily demonstrate

effectiveness in

day-to-day clinical practice. Trials that test effectiveness are also called

pragmatic trials.

Large RCTs are usually conducted in many centres, by a number of

clinicians, on patients who may have different characteristics, and so

have greater applicability (generalizability).

` •

+• :

Large RCTs are an

excellent way of testing for effectiveness. `

•

+

Large trials, meta-analysis, and evidence-based medicine

113

Why we need large randomized trials in anaesthesia*

A good clinical trial asks an important question and answers it reliably. 2,8

In 1984, Yusuf et a1. 2 explained how large, simple randomized trials can

reliably detect moderate effects on important endpoints (e.g. mortality,

major morbidity). 2 In part they argued: (a) effective treatments are more

likely to be important if they can be used widely, (b) widely applicable

treatments are generally simple, (c) major endpoints (death, disability)

are more important and assessment of these endpoints can be simple, and

(d) newinterventions are likely to have only a moderate beneficial effect

on outcome. These considerations have fostered the widespread use of

large multi-centred RCTs, particularly in the disciplines of cardiology and

oncology. Do these issues apply to anaesthesia?

The use of surrogate, or intermediate, outcome measures in

anaesthesia is widespread.

9-12

Their inherent weaknesses include un-

certain clinical importance, transience, and unconvincing relationships

with more definitive endpoints. One of the reasons for studying

surrogate endpoints is that more definitive endpoints, such as mortality

or major morbidity, are very uncommon after surgery, and anaesthesia is

considered to play a small role in their occurrence. 5 Nevertheless,

i mportant, moderate effects of anaesthetic interventions are worthy of

study, but these require large RCTs in order to be reliable.

3,4,6

The increasing interest in evidence-based medicine has added a further

i mperative to conducting reliable clinical trials.

3,5

Small studies can rarely

answer important clinical questions. Most improvements in our specialty

are incremental, and these require large numbers of patients to be studied

in order to have the power to detect a clinically significant difference.

12

McPeek argued 13 years ago that changes in anaesthetic practice should

be based on reliable trial evidence that can be generalized to other

situations. 4

Large RCTs are more likely to convincingly demonstrate effectiveness

because their treatments are generally widely applicable.

2,4,5

They are

usually multi-centred, and perhaps multi-national, in order to maximize

recruitment and enable early conclusion. This offers an opportunity to

identify other patient, clinician and institutional factors that may

influence outcome. These extraneous, potentially confounding factors

are more likely to be balanced between groups in large RCTs.

4,13,14

They are therefore less biased and so are more reliable,

4,13,15 with less

chance of false conclusion of effect (type I error)

or no effect (type II

error).

What is a large trial? This depends on the clinical question. Some will

accept trials that study more than 1000 patients, but the more important

issue is that the trial should have

adequate power (> 80%) to detect a true

difference for an important primary endpoint.

16,17

Most important

adverse outcomes after surgery are rare. For example, the incidence of

stroke, renal failure or death after coronary artery surgery is 2-4%, and

the incidence of major sepsis after colorectal surgery is 5-10%. In order to

`This has been adapted from an Editorial in British Journal of Anaesthesia,7 published with

permission.

1 1 4

**Statistical Methods for Anaesthesia and Intensive Care
**

Table 10.1 The approximate number of patients needed to be studied (assuming a type I

error 0.05 and type II error 0.2)

detect a moderate, but clinically important difference between groups,

many thousands of patients are required to be studied (Table 10.1).

There have been some excellent examples of large RCTs in anaes-

thesia. 18-20 In some of these the investigators selected a high-risk group

in order to increase the number of adverse events in the study; this

reduced the number of patients required (i.e. with a fixed sample size this

equates to a higher incidence rate).

Meta-analysis and systematic reviews

Meta-analysis is a process of combining the results of different trials to

derive a pooled estimate of effect and is considered to offer very reliable

information.

21-23

Some recent examples in anaesthesia include the effect

of ondansetron on postoperative nausea and vomiting (PONV) 24 the role

of epidural analgesia in reducing postoperative pulmonary morbidity, 25

and the benefit of acupressure and acupuncture on PONV.

26

The term systematic review, or overview, is sometimes used inter-

changeably with meta-analysis, but this more aptly describes the

complete process of obtaining and evaluating all relevant trials, their

statistical analyses, and interpretation of the results. The most well-

known is the Cochrane Collaboration

2

7,28

an Oxford-based group that

was established to identify all randomized controlled trials on specific

topics. They have several subgroups that focus on particular topics,

including acute pain management and obstetrics. An anaesthetic sub-

group is being considered (see web-site: www.cochrane-anaesthesia.

suite.dk).

Individual trial results can be summarized by a measure of treatment

effect and this is most commonly an odds ratio (OR) and its 95%

confidence interval (95%CI). The OR is the ratio of odds of an outcome

in those treated vs. those not treated. It is a commonly used estimate of

risk. An OR of 1.0 suggests no effect; less than 1.0 suggests a reduction in

risk, and greater than 1.0 an increased risk. If the 95%CI of the OR

exceeds the value of 1.0, then it is not statistically significant at P < 0.05

(i.e. it

may be a chance finding).

The results (ORs) of individual trials are combined in such a way that

large trials have more weight. As stated above, differences in trial

characteristics (heterogeneity) can obscure this process. For this reason,

it is recommended that a random effects model be used to combine ORs,

Baseline incidence 25% improvement with intervention Number of patients

40% 30% 920

20° % 15° % 2500

10% 7.5% 5400

6% 4.5 % 9300

L arge trials, meta-analysis, and evidence-based medicine

115

whereby the individual trials are considered to have randomly varied

results. This leads to slightly wider 95%CI.

Results from each trial can be displayed graphically. The OR (box) and

95%CI (lines) for each subsequent trial are usually displayed along a

vertical axis. The size of the box represents the sample size of the trial.

The pooled OR and 95%CI for all the trials is represented at the bottom

as a diamond, with the width of the diamond representing the 95 CI%. If

this pooled result does not cross the value 1.0, it is considered to be

statistically significant. A logarithmic scale is often used to display ORs

because increased or decreased risk can be displayed with equal

magnitude.

23

For example, Lee and Done

26

investigated the role of acupressure and

acupuncture on PONY

26 They found eight relevant studies and

produced a summary diagram (Figure 10.1). The pooled estimate of effect

for prevention of early vomiting (expressed as a risk ratio in their study)

was 0.47 (95%CI: 0.34-0.64). They did a sensitivity analysis, by separately

analysing large and small trials, those of good quality, and those where a

sham treatment was included.

The trials included in a meta-analysis should have similar patient

groups, using a similar intervention, and measure similar endpoints.

Each of these characteristics should be defined in advance. Meta-analysis

may include non-randomized trials, but this is not recommended because

it obviously weakens their reliability.

Meta-analysis has been criticized,"

29 and some of its potential

weaknesses identified.

16,17,23,30-32

These include publication bias

Figure 10.1 Effect of non-pharmacological techniques on risk of early

postoperative vomiting in adults • = relative risk for individual study,

_

overall summary effect. The control was sham or no treatment. Large trials = n

> 50, small trials = n <_ 50, high-quality studies = quality score > 2, low-quality

studies = quality score <_ 2.

1 1 6

**Statistical Methods for Anaesthesia and Intensive Care
**

(negative studies are less likely to be submitted, or accepted, for

publication), duplicate publication (and therefore double-counting in the

meta-analysis), heterogeneity (different interventions, different clinical

circumstances) and inclusion of historical (outdated) studies.

Despite these weaknesses, meta-analysis is considered a reliable source

of evidence.

22,28

There are now established methods to find all relevant

trials, 23 and so minimize publication bias. These include electronic data-

base searching, perusal of meeting abstracts and personal contact with

known experts in the relevant field. Advanced statistical techniques (e.g.

weighting of trial quality, use of a random effects model, funnel plots)

and sensitivity analysis can accommodate for heterogeneity

16,21,23,30,31.

The QUOROM statement, a recent review, has formulated guidelines on

the conduct and reporting of meta-analyses.

32

Meta-analyses sometimes give conflicting results when compared with

large RCTs.

16,17,29,30

A frequently cited example is the effect of

magnesium sulphate on outcome in patients with acute myocardial

infarction.

17,21,30,33,34

Many small RCTs had suggested that magnesium

i mproves outcome after acute myocardial infarction and this was the

conclusion of a meta-analysis, LIMIT-2, published in 1992. 35 A

subsequent large RCT, ISIS-4, disproved the earlier finding.

36

There have been several explanations for such disagreement.

33,34

But it

is generally recognized that positive meta-analyses should be confirmed

by large RCTs. Meta-analyses that include one or more large RCTs are

considered to be more reliable. 29 Meta-analyses that find a lack of

treatment effect can probably be accepted more readily

11

The findings of a meta-anal sis are sometimes presented as the

number needed to treat (NNT).

28• 37

Here the reciprocal of the absolute

risk reduction can be used to describe the number of patients who need

to be treated with the new intervention in order to avoid one adverse

event. For example, Tramer et a1. 24 found a pooled estimate in favour of

ondansetron , with an OR (95%CI) of approximately 0.75 (0.71-0.83)*. If

the incidence of early PONV is 60%(proportion = 0.60), then these results

suggest that ondansetron, with an OR of 0.75, would reduce the

proportion to 0.45, or an absolute risk reduction of 0.15 (60%to 45%). The

NNT, or reciprocal of the absolute risk reduction (1/0.15) is 6.7.

Therefore, it can be concluded that six or seven patients need to be treated

in order to prevent one patient from having PONY

Evidence-based medicine

Evidence-based medicine (EBM) has been defined by its proponents as

the 'conscientious, explicit and judicious use of current best evidence in

making decisions about the care of individual patients'. 28

Although

referred to as a newparadigm in clinical care, 15

it could more accurately

*The authors calculated odds ratios (as estimates of risk ratios) in terms of a relative

benefit, and so we have used the reciprocal to present their results as a relative reduction

in PONV (a ratio of benefit of 1.3became an OR of 0.75).

L arge trials, meta-analysis, and evidence-based medicine

117

be described as a simplified approach that optimizes the acquisition of

up-to-date knowledge, so that it can be readily applied in clinical

practice. As such, EBM formalizes several aspects of traditional practice.

The five steps of EBM are: 28

Step 1. Ask an answerable question

Step 2. Search for evidence

Step 3. Is the evidence valid?

Step 4. Does the evidence apply to my patient?

Step 5. Self-assessment

It teaches how to formulate a specific and relevant question arising

from clinical practice, how to efficiently and reliably access up-to-date

knowledge ('evidence'), and then reminds us of established critical

appraisal skills used to asses the validity of that evidence.

The fourth, and perhaps most important, step is to use clinical

expertise in order to determine whether that evidence is applicable to our

situation. It is this step that requires clinical experience and judgment,

understanding of basic principles (such as pathophysiology, pharma-

cology and clinical measurement), and discussion with the patient before

a final decision is made.5,15,2s EBM also has a fifth step, asking clinicians

to evaluate their own evidence-based practice.

There have been some concerns raised about the application of

evidence-based methods in anaesthetic practice. 11,12,31 Some clinicians

argue that the principles of EBM are in fact those of good traditional care,

and not a new paradigm or approach to clinical practice.11,38 But the

central feature of EBM is its direct application at the bedside (or in the

operating suite), with a specific patient or procedure in mind. For this

reason, it has direct relevance for us, encourages active learning, and

should reduce poor anaesthetic practices not supported by evidence.

What constitutes 'evidence'? Most classifications consider how well a

study has minimized bias and rate well-designed and conducted RCTs as

the best form of evidence (Table 10.2).

22

• 39

But it is generally accepted that

other trial designs play an important role in anaesthesia research and can

still be used for clinical decision-making.

3,5,

8,12,28

Table 10.2 Level of evidence supporting clinical practice (adapted from the US

Preventive Services Task Force)

Level

Definition

I

Evidence obtained from a systematic reviewof all relevant randomized

controlled trials

II Evidence obtained

from at least one properly designed randomized

controlled trial

III Evidence obtained

from other well-designed experimental or analytical

studies

IV Evidence obtained

from descriptive studies, reports of expert committees

or from opinions of

respected authorities based on clinical experience

1 1 8

**Statistical Methods for Anaesthesia and Intensive Care
**

More recently, there has been some recognition that the level of

evidence (I-IV) is not the only aspect of a study that is of relevance to

clinicians

when they apply the results in their practice. Thus, the

dimensions

of evidence are all important: level, quality, relevance, strength

and magnitude of effect.

Myles et a1.

22

surveyed their anaesthetic practice and found that 96.7%

was evidence based, including 32%supported by RCTs. These results are

similar to recent studies in other specialties

39 40

and refute claims that

only 10-20%of treatments have any scientific foundation.

The traditional narrative review has been questioned in recent years.

Their content is largely dependent on the author(s) and their own

experiences and biases. EBM favours the systematic review, as an

unbiased summary of the evidence base.

41

These have become more

commonly used in anaesthesia research. 24-26 42

Clinical practice guidelines

Clinical practice guidelines have been developed in order to improve

processes or outcomes of care.

5,43,44

They are usually developed by a

group of recognized experts after scrutinizing all the available evidence.

They generally follow similar strategies to that of EBM, and so the

strongest form of evidence remains the randomized controlled trial.

In the past clinical practice guidelines were promulgated by

individuals and organizations without adequate attention to their

validity44

More recently there has been a number of excellent efforts at

developing guidelines in many areas of anaesthetic practice.

45

49

The relationship between the RCT and development of clinical practice

guidelines has been explored eloquently by Sniderman.

5

He pointed out

that both the RCT and practice guidelines developed by an expert

committee can be seen as impersonal and detached. Yet, he suggests, this

is also their strength, in that they have transparency and objectivity.

Because there is often incomplete evidence on which to develop

guidelines, there is a risk of them being affected by the interpretations

and opinions of the individuals who make up the expert panel.

Sniderman points out that they are also a social process and are exposed

to personal opinions and compromise.

5 He suggests that their findings

can be strengthened by including a diverse group of experts and not to

demand unanimity in their recommendations.

As with EBM, systematic evaluation of published trials can also

identify important clinical problems that require further study. Smith

et

a1.

49

noted that pain medicine (as with many areas in anaesthesia) is

rapidly evolving and so guidelines may become outdated within a few

years. The cost and effort to maintain them may be a limiting factor in

future developments.

The evaluation of clinical practice guidelines can be biased.

Participating clinicians

may perform better, or enrolled patients may

receive better care or report improved outcomes, because they are being

studied. This is known as the

Hawthorne effect.

There are several

approaches that can minimize this bias.

44

The gold standard remains the

RCT, but this is often not feasible. Other designs include the crossover

trial (where both methods of practice are used in each group at alternate

ti

mes), or the before-and-after study that includes another control group

for comparison.

44

The management of acute pain has received recent attention 4

9 For

example, the Australian National Health and Medical Research Council

established an expert working party to develop clinical practice

guidelines for the management of acute pain. Members of the party

scrutinized all relevant studies and rated the level of evidence (levels

I-IV) in order to make recommendations.

Anaesthetists gain newknowledge from a variety of sources and study

designs.

7,11,12,50

Observational studies can be used to identify potential

risk factors or effective treatments. In most cases these findings should be

confirmed with an RCT. In some circumstances we are interested in a

specific

mechanistic question for which a small, tightly controlled RCT

testing efficacy

may be preferable.

3,12,33,51

Investigation of moderate

treatment effects on important endpoints are best done using large

RCTs.

2,6

References

L arge trials, meta-analysis, and evidence-based medicine

119

1.

Sackett DL, Haynes RB, Guyatt GH, Tugwell P Deciding on the Best Therapy:

A Basic Science for Clinical Medicine. Little Brown, Boston 1991: pp187-248.

2.

Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized

trials? Stat Med

1984; 3:409-420.

3.

Rigg JRA, Jamrozik K, Myles PS. Evidence-based methods to improve

anaesthesia and intensive care.

Curr Opinion Anaesthesiol

1999; 12:221-227.

4.

McPeek B. Inference, generalizability, and a major change in anesthetic

practice [editorial].

Anesthesiology 1987; 66:723-724.

5.

Sniderman AD. Clinical trials, consensus conferences, and clinical practice.

Lancet

1999; 354:327-330.

6.

Schwarz D, Lellouch J. Explanatory and pragmatic attitudes in therapeutic

trials. J

Chron Dis 1967; 20:637-648.

7.

Myles PS. Why we need large randomised trials in anaesthesia [editorial].

Br

J Anaesth

1999; 83:833-834.

8.

Duncan PG, Cohen MM. The literature of anaesthesia: what are we learning?

Can J

Anaesthesia1988; 3:494-499.

9. Fisher

DM. Surrogate end points: are they meaningful [editorial]?

Anesthesiology 1994; 81:795-796.

10.

Lee A, Lum ME. Measuring anaesthetic outcomes.

Anaesth Intensive Care1996;

24:685-693.

11.

Horan B. Evidence based medicine and anaesthesia: Uneasy bedfellows?

Anaesth Intensive Care 1997; 25: 679-685.

12. Goodman NW. Anaesthesia and evidence-based medicine.

Anaesthesia 1998;

53:353-368.

13.

Rothman KJ. Epidemiologic methods in clinical trials.

Cancer 1977;

39:S1771-1775.

14. Ioannidis JPA, Lau J. The impact of high-risk patients on the results of clinical

trials. J Clin Epidemiol 1997; 50:1089-1098.

15.

Evidence-based medicine working group. Evidence-based medicine: a new

approach to teaching the practice of medicine.

JAMA 1992; 268:2420-2425.

120

**Statistical Methods for Anaesthesia and Intensive Care
**

16.

Cappelleri JC, Ioannidis JPA, Schmid CH

et al.

Large trials vs meta-analysis of

smaller trials: howdo their results compare?

JAMA1996; 276:1332-1338.

17.

LeLoerier J, Gregoire G, Benhaddad A

et al.

Discrepancies between meta-

analyses and subsequent large randomized, controlled trials.

N Engl J Med

1997; 337:536-542.

18.

Kurz A, Sessler DI, Lenhardt R. The study of wound infection and

temperature group. Perioperative normothermia to reduce the incidence of

surgical-wound infection and shorten hospitalization.

N Engl J Med 1996;

334:1209-1215.

19.

Mangano DT, Layug EL, Wallace A

et al.

Effect of atenolol on mortality and

cardiovascular

morbidity after noncardiac surgery.

N Engl J Med 1996;

335:1713-1720.

20.

Diemunsch P, Conseiller C, Clyti N

et al.

Ondansetron compared with

metoclopramide in the treatment of established postoperative nausea and

vomiting.

Br J Anaesth 1997; 79:322-326.

21.

Pogue J, Yusuf S. Overcoming the limitations of current meta-analysis of

randomised controlled trials.

Lancet 1998; 351:47-52.

22.

Myles PS, Bain DL, Johnson F, McMahon R. Is anaesthesia evidence-based? A

survey of anaesthetic practice.

Br J Anaesth 1999; 82:591-595.

23. Egger

M, Davey-Smith G, Phillips AN. Meta-analysis: principles and

procedures. BMJ

1997; 315:1533-1537.

24.

Tramer MR, Reynolds DJ, Moore RA, McQuay HJ. Efficacy, dose-response,

and safety of ondansetron in prevention of postoperative nausea and

vomiting.

A quantitative systematic review of randomized placebo-

controlled trials.

Anesthesiology 1997; 87:1277-1289.

25.

Ballantyne JC, Carr DB, deFerrabti S

et al.

The comparative effects of

postoperative analgesic therapies on pulmonary outcome: cumulative

meta-analyses of randomized, controlled trials.

Anesth

Analg 1998;

86:598-612.

26.

Lee A, Done ML. The use of nonpharmacologic techniques to prevent

postoperative nausea and vomiting: a meta-analysis.

Anesth Analg 1999;

88:1362-1369.

27.

Sackett

DL, Oxman AD, eds. Cochrane Collaboration Handbook. The

Cochrane Collaboration, Oxford

1997.

28. Sackett

DL, Richardson WS, Rosenberg W, Haynes RB. Evidence-based

Medicine: Howto Practice and Teach EBM. Churchill Livingstone, London

1997.

29.

Horwitz RI. 'Large-scale randomized evidence: large, simple trials and

overviews of trials': discussion: a clinician's perspective on meta-analyses.

J

Clin Epidemiol 1995; 48:41-44.

30.

Egger M, Smith GD. Misleading meta-analysis. Lessons learned from 'an

effective, safe, simple' intervention that wasn't.

BMJ 1995; 310:752-754.

31.

Moher D, Jones A, Cook DJ

et al.

Does quality of reports of randomised trials

affect estimates of intervention efficacy reported in meta-analyses?

Lancet

1998; 352:609-613.

32.

Moher D, Cook DJ, Eastwood S

et al.

Improving the quality of reports of meta-

analyses of randomised controlled trials: the QUORUM statement.

Lancet

1999; 354:1896-1900.

33.

Woods KL. Mega-trials and management of acute myocardial infarction.

Lancet 1995; 346:611-614.

34.

Antman EM. Randomized trials of magnesium in acute myocardial

infarction: big numbers do not tell the whole story.

Am J Cardiol 1995;

75:391-393.

35.

Woods KL, Fletcher S, Roffe C, Haider Y. Intravenous magnesium sulphate in

suspected acute myocardial infarction: results of the second Leicester

L arge trials, meta-analysis, and evidence-based medicine

121

Intravenous

Magnesium Intervention Trial (LIMIT-2). Lancet 1992;

339:816-819.

36. ISIS-4 Collaborative Group. ISIS-4: a randomised factorial trial assessing early

oral captopril, oral mononitrate, and intravenous magnesium sulphate in

58 050 patients with suspected acute myocardial infarction. Lancet 1995;

345:669-665.

37. Laupacis A, Sackett DL, Roberts RS. An assessment of clinically useful

measures of the consequences of treatment. N

Engl J Med 1988; 318:1728-1733.

38. Various authors: Evidence-based medicine. Lancet 1995, 346:837-840.

39. Ellis J, Mulligan I, Rowe J, Sackett DL. Inpatient general medicine is evidence

based.

Lancet

1995; 346:407-410.

40.

Howes N, Chagla L, Thorpe M, McCulloch P Surgical practice is evidence

based. Br J Surg 1997; 84:1220-1223.

41. Sheldon TA. Systematic reviews and meta-analyses: the value for surgery.

Br

J Surg 1999; 86:977-978.

42. Munro J, Booth A, Nicholl J. Routine preoperative testing: a systematic review

of the evidence. Health Technology Assessment 1997; 1(12).

43. Lomas J. Words without action? The production, dissemination and impact of

consensus recommendations. Ann Rev Pub Health 1991;12:41-65.

44. GrimshawJM, Russell IT. Effect of clinical guidelines on medical practice. A

systematic reviewof rigorous evaluations. Lancet 1993; 342:1317-1322.

45. American Society of Anesthesiologists Task Force on Management of the

Difficult Airway: practice guidelines for management of the difficult airway.

Anesthesiology 1993; 78:597-602.

46. Practice Guidelines for Pulmonary Artery Catheterisation: a report by the

American Society of Anesthesiologists Task Force on Pulmonary Artery

Catheterisation. Anesthesiology 1993; 78:380-394.

47. ACC/AHA Task Force Report. Special report: guidelines for perioperative

cardiovascular evaluation for noncardiac surgery. Report of the American

College of Cardiology/American Heart Association Task Force on practice

guidelines (Committee on Perioperative Cardiovascular Evaluation for

Noncardiac Surgery). J Cardiothorac Vasc Anesth 1996;10:540-552.

48. Practice guidelines for obstetrical anesthesia: a report by the American

Society of Anesthesiologists Task Force on Obstetrical Anesthesia.

Anesthesiology 1999; 90:600-611.

49. Smith G, Power I, Cousins MJ. Acute pain - is there scientific evidence on

which to base treatment? [editorial] Br J Anaesth 1999; 82:817-819

50. Solomon MJ, McLeod RS. Surgery and the randomised controlled trial: past,

present and future. Med J Aust 1998;169:380-383.

51. Powell-Tuck J, McRae KD, Heaty MJR et al. A defense of the small clinical

trial: evaluation of three gastroenterological studies. BMJ 1986; 292:599-602.

Statistical errors in anaesthesia

Prevalence of statistical errors in

anaesthesia journals

Ethical considerations

How to prevent errors

What are the common mistakes?

-no control group

-no randomization

-lack of blinding

-misleading analysis of baseline

characteristics

-inadequate sample size

-multiple testing, subgroup analyses and interim analysis

-misuse of parametric tests

-misuse of Student's t-test

-repeat (`paired') testing

-misuse of chi-square - small numbers

-standard deviation vs. standard error

-misuse of correlation and simple linear

regression analysis

-preoccupation with P values

-overvaluing diagnostic tests and

predictive equations

A statistical checklist

Key points

Obtain statistical advice before commencement

of the study.

Consider inclusion of a statistician as a co-researcher.

Prevalence of statistical errors in anaesthesia journals

Advances in clinical practice depend on new knowledge, mostly gained

through medical research. We otherwise risk stagnation. Yet conclusions

based on poor research, or its misinterpretation, can be even more

harmful to patient care. The detailed reporting of medical research

usually occurs in any of a large number of peer-reviewed medical

journals. Approximately 50% of such published reports contain errors in

statistical

methodology or presentation.1-5 These errors are also prevalent

in the anaesthetic and intensive care literature.

6-8 Avram et al.7 evaluated

the statistical analyses used in 243 articles from two American

anaesthesia journals (Anesthesiology and

Anesthesia and Analgesia) and

found common errors included treating ordinal data as interval data,

ignoring repeated measures or paired data, uncorrected multiple

comparisons, and use of two-sample tests for more than two groups.

Goodman$ surveyed five abstract booklets of the Anaesthesia Research

Society (UK) and found that 61 of 94 abstracts (65%) contained errors.

These included failure to identify which statistical tests were used,

inadequate presentation of data (to enable interpretation of P value),

misuse of standard error and, for negative studies, no consideration of

type II error.

Most statistical analyses are performed by researchers who have some

basic understanding of medical statistics, but may not be aware of

fundamental assumptions underlying some of the tests they employ, nor

of pitfalls in their execution. These mistakes often lead to misleading

conclusions. On some occasions it is apparent that researchers reproduce

a previous study's methodology (including statistical techniques),

perpetuating mistakes.

Ethical considerations

As stated in the Preface to this book, Longnecker wrote in 1982, 'If valid

data are analyzed improperly, then the results become invalid and the

conclusions

may well be inappropriate. At best, the net effect is to

waste time, effort, and money for the project. At worst, therapeutic

decisions

may well be based upon invalid conclusions and patients'

wellbeing may be jeopardized'. 6 Similar statements have been made by

others.

2,9

Flaws in research design and errors in statistical analysis obviously

raise ethical issues. In fact, a poorly designed research project should not

be approved by an institutional ethics committee unless it is satisfied that

the project is likely to lead to valid conclusions. It would be unethical to

proceed. Therefore,

ethical reviewshould include scientific scrutiny of

the research design, paying

particular attention to methods of

randomization and blinding

(if appropriate), definition of outcome

measures, identification of which statistical tests will be applied (and on

what data), and a reasonable estimation of how many patients will be

required to be studied in order to prove or disprove the hypothesis under

investigation. Scrutiny at this early stage will do much to avoid the

multitude of errors prevalent in the anaesthetic and intensive care

literature.

How to prevent errors

Statistical errors in anaesthesia

123

A research paper submitted for publication to a medical journal normally

undergoes a peer review process. This detects many mistakes, but

remains dependent on the statistical knowledge of the journal reviewers

and editor. This can vary. One solution is to include a statistician in the

process, but this can delay publication, may be unachievable for many

journals and may not avoid all mistakes. Some journals only identify

those papers with more advanced statistical methods for selective

assessment by a statistician. This paradoxical process may only serve to

identify papers that already have statistician involvement (as an author)

124

**Statistical Methods for Anaesthesia and Intensive Care
**

and miss the majority of papers (which do not include a statistician) that

are flawed by basic statistical errors. 2 Statisticians can also disagree on

how research data should be analysed and presented, increasing the

medical researcher's confusion. Complete, valid presentation of study

data can be compromised by a journal's word limit and space constraints.

What information to include, and in what form, is perhaps best

dictated by the policies of the particular journal (see Table 11.1, at the end

of this chapter). This can be found in a journal's 'Advice to Authors'

section. If in doubt, direct advice can also be sought from the journal's

editor.

The ultimate solution, hopefully addressed in part by this book, is for

researchers to further develop their knowledge and understanding of

medical statistics. The growing market in introductory texts and

attendance of medical researchers at statistical courses would appear to

be addressing the problems. Readers should be reassured that most

studies can be appropriately analysed using basic statistical tests. The

skill, of course, is to know which studies require more advanced

statistical methods and assistance from a statistician - this should not be

undervalued or resisted. After all, 'specialist' consultation occurs in most

other areas of clinical practice! If in doubt, the best habit is to have a low

threshold for seeking advice. If this is not available, then advice from an

experienced clinical researcher may also be of assistance (although, in our

experience, this may only perpetuate mistakes).

It cannot be stressed strongly enough: the best time to obtain advice is

during the process of study design and protocol development. It is very

frustrating to receive a pile of data from an eager novice researcher,

which has a multitude of deficiencies. The choice of statistical tests

depends on the type of data collected (categorical, ordinal, numerical)

and the exact hypotheses to be tested (i.e. what scientific questions are

being asked). Often the study has not been designed to answer the

question. Exactly what data to collect, when, and how, are fundamental

components of the research design. This can rarely be corrected after the

study is completed!

Where possible, inclusion of a statistician as a co-researcher almost

always provides a definitive solution.

What are the common mistakes?

The commonest errors are often quite basic and relate more to research

design: lack of a control group, no randomization to treatment groups

(or poorly documented randomization) and inadequate blinding of

group allocation. 1,2 These can seriously increase the risk of bias, making

the researcher (and reader) susceptible to misleading results and

conclusions. Another common problem is inadequate description of

methods (including which statistical tests were used for analysing

what data). Some of these issues are addressed in more detail in

Chapter 4.

Specifically, we have found that the following errors (or deficiencies)

Statistical errors in anaesthesia

125

are common in the anaesthetic and intensive care literature:

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

No control group

No randomization

Lack of blinding

Misleading analysis of baseline characteristics (confounding)

Inadequate sample size (and type II error)

Multiple testing

Misuse of parametric tests

Misuse of Student's t-test

-paired vs. unpaired

-one-tailed vs. two-tailed

-multiple groups (ANOVA)

-multiple comparisons

Repeat ('paired') testing

Misuse of chi-square - small numbers

Standard deviation vs. standard error

Misuse of correlation and simple linear regression analysis

Preoccupation with P values

Overvaluing diagnostic tests and predictive equations.

No control group

To demonstrate the superiority of one treatment (or technique) over

another requires more than just an observed improvement in a chosen

clinical endpoint. A reference group should be included in order to

document the usual clinical course (which may include fluctuating

periods of improvement, stability and deterioration). In most situations,

a contemporary, equivalent, representative control group should be used.

Use of an historical control group may not satisfy these requirements,

because of different baseline characteristics, quality or quantity of

treatment, or methods used for outcome assessment.

If the control group is given a placebo treatment, then the question

being asked is 'does the newtreatment have an effect (over and above no

treatment)?' This is a common scenario in anaesthesia research which

only shows that, for example, an antiemetic is an antiemetic, or an

inotrope is an inotrope. Placebo-controlled studies have very little value

(other than for detecting adverse events - in which case, the study should

be of sufficient size to detect them). If the control group is given an active

treatment, then the question being asked is 'does the newtreatment have

an equal or better effect than the current treatment?' This has more

clinical relevance.

It is difficult to detect

'regression to the mean'

unless a control group

is included. Regression to the mean occurs when random fluctuation,

through biological variation or measurement error, leads to a falsely

extreme (high or low) value.10

This leads to a biased selection, whereby a

group has a spuriously extreme mean value, that on re-measurement will

tend towards the population mean (which is less extreme). This is a

common error if group measurements are not stabilized or if there is no

control group.

1 26

**Statistical Methods for Anaesthesia and Intensive Care
**

For example, if a study were set up to investigate the potential benefits

of acupressure on postoperative pain control, and patients were selected

on the basis that they had severe pain (as measured by VAS on one

occasion), then it is likely that VAS measurements at a later time will be

lower. Is this evidence of a beneficial effect of acupuncture? It may be a

result of the treatment given, but it may also be caused by random

fluctuation in pain levels: patients with high levels are more likely to have

a lower score on retesting and so the average (group) pain level is much

more likely to be reduced.

No randomization

The aim of randomization is to reduce bias and confounding. All eligible

patients should be included in a trial and then randomized to the various

treatment groups. This avoids selection bias and increases the

generalizability of the results. In large trials, randomization tends to

equalize baseline characteristics (both known and unknown) which may

have an effect on the outcome of interest (i.e. confounding). The

commonest method is simple randomization, which allocates groups in

such a way that each individual has an equal chance of being allocated to

any particular group and that process is not affected by previous

allocations. This is usually dictated by referring to a table of random

numbers or a computer-generated list. Other methods are available

which can assist in equalizing groups, such as

stratification and blocking

(see

Chapter 4). These are very useful modifications to simple

randomization, but have been under-used in anaesthesia research.

Knowledge of group allocation should be kept secure (blind) until after

the patient is enrolled in a trial, in order to reduce bias. The commonest

method is to use sealed, opaque envelopes.

Lack of blinding

It is tempting for the subject or researcher to consciously or

unconsciously distort observations, measurement, recordings, data

cleaning or analyses. Blinding of the patient (single-blind), observer

(double-blind), and investigator (sometimes referred to as triple-blind)

can dramatically reduce these sources of bias. Unblinded studies remain

unconvincing. Every attempt should be made to maximize blinding in

medical research.

Misleading analysis of baseline characteristics

It is not uncommon for patient baseline characteristics to be compared

with hypotheses testing. This is wrong for two major reasons. The first

does not alter interpretation of the study, yet remains senseless. If

treatment allocation is randomized, then performing significance tests

only tests the success of randomization! With a significance level of 0.05,

roughly one in 20 comparisons will be significant purely by chance. The

Statistical errors in anaesthesia

127

second reason is more important and may affect interpretation of results.

That is, there may be a clinically significant difference between the groups

which is not detected by significance testing, yet such an imbalance may

have an important effect on the outcome of interest. This is known as

confounding.

Just because 'there was no statistically significant difference

between the groups' does not imply that there were no subtle differences

that

may unevenly affect the endpoint of interest. An apparent small

(statistically non-significant) difference at baseline for a factor that has a

strong effect on outcome can lead to serious confounding.

Authors should certainly describe their group baseline characteristics,

consider the possibility of confounding, but not falsely reassure the

readers that 'there is no significant difference' between them. This can be

interpreted by the reader using clinical judgment, after he/she has been

provided with the relevant baseline information.

If an imbalance in baseline characteristics is found to exist at the end of

the trial (which may well occur by chance!), there are some advanced

multivariate

statistical techniques available which adjust the results post

hoc (after the event), taking the

covariate into account. The simplest

method to lessen this problem is to stratify the patients according to one

or two important confounding variables (e.g. gender, preoperative risk,

type of surgery) before randomizing patients to the respective treatment

groups. There are some excellent papers which explore these issues in

greater depth.

11-13

Inadequate sample size

A common reason for failing to find a significant difference between

groups is that the trial was not large enough (i.e. did not enrol a sufficient

number of patients).8,14 This is a

type II error, where

the null hypothesis

is

accepted incorrectly.

Minimization of this occurrence requires

consideration of the incidence rate of the endpoint of interest or, for

numerical data, the anticipated mean and variance, along with an

estimation of the difference between groups that is being investigated.

An approximate sample size can then be calculated (see Chapter 3). Rare

outcomes, or small differences between groups, require very large

studies. If a study concludes 'no difference' between groups,

consideration of the treatment effect size (if any) and likelihood of a type

Il error should be addressed by the authors.

Multiple testing, subgroup analyses and interim analysis

Multiple comparisons between groups will increase the chance of finding

a significant difference which may not be real.15

This is because each

comparison has a probability of roughly one in 20 (if using a

type I error,

or a value,

of 0.05) of being significant purely by chance, and multiple

comparisons magnify this chance accordingly. Multiple testing therefore

increases the risk of a type I error, where the null hypothesis is incorrectly

rejected. This is often called 'a fishing expedition' or 'data dredging'. A

similar problem occurs when multiple subgroups are compared at the

end of a trial, or during interim testing while a trial is in progress.

16-1

8

Misuse of Student's t-test

Statistical errors in anaesthesia

129

achieved by plotting the data and demonstrating a normal distribution,

and/or analysing the distribution using a test of goodness of

fit (e.g.

Kolmogorov-Smirnov test).

This is unlikely to be satisfactorily achieved

with smaller studies (say, n < 20). For these, either

data transformationor

non-parametric tests should be used.

In general, parametric tests should only be used to analyse numerical

data. Ordinal data is best analysed using non-parametric tests (such as

Mann-Whitney U test or Kruskall-Wallis analysis of variance). Some

statisticians accept that if the observations have an underlying theoretical

continuous distribution, such as pain or perioperative risk, then the data

can be considered as continuous, even if measured on an ordinal scale.

This argument is most credible for larger studies.

22

As stated above, it is important to verify assumptions of normality when

using the t-test, as well as independence of the data and equality of

variance (see Chapter 5). The Mest can only be used to compare two

groups; if more than two groups are being compared, then analysis of

variance should be used. If the groups are

independent (this is the usual

situation), then the unpaired t-test is used. If the groups are related to one

another (i.e. dependent),

as with comparing a group before and after

treatment, then the paired t-test must be used. In most cases a difference

between groups may occur in either direction and so a two-tailed t-test is

used. If there is a clear, predetermined rationale for only exploring an

increase, or a decrease, then a

one-tailed t-test may be appropriate.

Unfortunately, a one-tailed t-test is usually selected to lower a P value so

that it becomes significant (which a two-tailed test failed to achieve). This

is gravely misleading.

A paper using a one-tailed t-test should be

scrutinized: was there a valid reason for only investigating a difference in

one direction, and was this preplanned (before analysing the data)?

Repeat (`paired') testing

If an endpoint is measured on a number of occasions (e.g. measurement

of postoperative pain, or cardiac index during ICU stay), then any

subsequent measurement, at least in part, is determined by the previous

measurement. The amount of individual patient variation over time is

much less than that between patients (i.e. intea-group variance

is lower

than

inter-group variance)

and so differences can be more easily

detected.

If a group endpoint is measured on two occasions then a

paired t-test

(or non-parametric equivalent) can be used. If three or more measure-

ments are made, or two or more groups are to be compared on a number

of occasions, then the most appropriate method is to use repeated

measures analysis of variance. If a significant difference is demonstrated

overall, then individual comparisons can be made to identify at which

ti

me the differences were significant (adjusting P values for multiple

comparisons).

1 3 0

**Statistical Methods for Anaesthesia and Intensive Care
**

An alternative approach is to use summary data that describe the

variable of interest over time. This may be the overall mean of repeated

measurements, or the area under a curve (with time on the chi-axis ).23

Misuse of chi-square - small numbers

Mathematically, the

·

2distribution is a continuous distribution and the

calculated

X2

statistic is an approximation of this. When small numbers

are analysed, there is an artificial incremental separation between

potential values. For this reason, an adjustment factor is needed. For a

2 x 2 contingency table, where there are only two groups being

compared, looking at one dichotomous endpoint, Yates' correction

should be used. This subtracts 0.5 from each component in the

X2

equation.* If two or more cells in a 2 x 2 contingency table of expected

valueshave a value less than 5, then

Fisher's exact test should be used (for

larger contigency tables the categories can be collapsed, reducing the

number of categories and increasing the number in each). These

considerations become less important with large studies.

Chi-square should not be used if the groups are matched or if repeat

observations are made (i.e. paired categorical data). McNemar's test can

be used in these situations.

Standard deviation vs. standard error

In a normal distribution, 95%of data points will lie within 1.96 standard

deviations of the mean. Standard deviation (SD)

is therefore a measure

of variability and should be quoted when describing the distribution of

sample data. 2,7-9,24 Standard error is a derived valuet used to calculate

95% confidence intervals, and so is a measure of precision (of howwell

sample data can be used to predict a population parameter). Standard

error is a much smaller value than SD and is often presented (wrongly)

for this reason. It should not be confused with standard deviation, nor

used to describe variability of sample data. On some occasions it may be

acceptable to use standard error bars on graphs (for ease of presentation),

but on these occasions they should be clearly labelled.

It has been suggested that the correct method for presentation of

normally distributed sample data variability is mean (SD) and not mean

(± SD).52

Misuse of correlation and simple linear regression analysis

These techniques are used to measure a linear relationship between two

numerical variables. The correlation coefficient is a measure of linear

association and linear regression is used to describe that linear

relationship. These analyses assume that the observations follow a

normal distribution (in particular, that for any given value of the

independent [predictor] variable, the corresponding values of the

dependent [outcome] variable are normally distributed). If doubt exists,

or if the distribution appears non-normal after visualizing a scatterplot,

then the data can be transformed (commonly using log-transformation)

or a non-parametric method used (e.g. Spearman rank correlation).

The data should also be independent. This means that each data point

on the scatterplot should represent a single observation from each

patient. Multiple measurements from each patient should not be

analysed using correlation or regression analysis as this will lead to

misleading conclusions. Repeated measures over time should also not be

simply analysed using correlation.

26

Variables with a mathematical relationship between them will be

spuriously highly correlated because of mathematical coupling. 27

Further details can be found in Chapter 7.

Normally a scatterplot should be included to illustrate the relationship

between both numerical variables. A regression line should not exceed

the limits of the sample data (extrapolation).

Neither correlation nor regression should be used to measure

agreement between two measurement techniques. Bland and Altman

have described a suitable method. 28 As with correlation, multiple

measurements from each patient should not be plotted together and

treated as independent observations (this is a very common mistake in

anaesthesia and intensive care research). Further details can be found in

Chapter 7.

Preoccupation with P values

St at ist ical errors in anaest hesia 13 1

Too much importance is often placed on the actual P value, rather than

the size of the treatment effect.8,29 A P value describes the probability of

an observed difference being due to chance alone. It does not describe

how large the difference is, nor whether it is clinically significant.

Importantly, a P value is affected by the size of the trial: a highly

significant P value in a large trial may be associated with a trivial

difference, and a non-significant P value in a small trial may conceal an

effect of profound clinical significance. A P value is only a mathematical

statement of probability, it ignores the more important information from

a trial: howlarge is the treatment effect?

The 95% confidence interval

(CI) for effect describes a range in which

the size of the true treatment effect will lie. In general, a large trial will

have a small standard error and so narrow 95%CI: a more precise

estimate of effect. From this the clinician can interpret whether the

observed difference is of clinical importance.

overvaluing diagnostic tests and predictive equations

Diagnostic tests can be described by their sensitivity (true positive rate)

and specificity

(true negative rate). This only tells us what proportion of

positive and negative tests results are correct (given a known outcome).

1 3

2

St at ist ical Met hods for Anaest hesia and Int ensive Care

Of greater clinical application is the positive and negative predictive

values of the test

(PPV and NPVrespectively), which inform us of the

likelihood of disease (or adverse outcome) given a test result.30 It is

common for authors to report optimistic values for these indices, yet both

are dependent on the prevalence of disease - if a disease (or outcome) is

common, any test (irrespective of its diagnostic utility) will tend to have

a high PPV A similar test in another situation where disease prevalence

is lowwill tend to have poor PPV, yet high NPV Therefore, the context in

which the diagnostic test was evaluated should be considered - does the

trial population in which the test was developed represent the clinical

Table 11.1. A statistical checklist used by the British Medical journal (after Gardner et

al. 4)

Design features

1. Was the objective of the trial sufficiently described?

Yes Unclear No

2. Was there a satisfactory statement given of diagnostic

criteria for entry to trial?

Yes Unclear No

3. Was there a satisfactory statement given of source of subjects? Yes

Unclear No

4. Were concurrent controls used (as opposed to historical

controls)?

Yes Unclear No

5. Were the treatments well defined?

Yes Unclear No

6. Was random allocation to treatment used? Yes

Unclear No

7. Was the method of randomization described?

Yes Unclear No

8. Was there an acceptable delay from allocation to commence-

ment of treatment? Yes

Unclear No

9. Was the potential degree of blindness used? Yes Unclear No

10. Was there a satisfactory statement of criteria for outcome

measures? Yes Unclear No

11. Were the outcome measures appropriate? Yes Unclear

No

12. Was there a power based assessment of adequacy of sample

size? Yes Unclear

No

13.

Was the duration of post-treatment follow-up stated? Yes Unclear No

Commencement of trial

14.

Were the treatment and control groups comparable in

relevant measures? Yes Unclear No

15. Was a high proportion of subjects followed up? Yes Unclear

No

16.

Did a high proportion of subjects complete treatment? Yes Unclear No

17.

Were the drop-outs described by treatment/ control groups? Yes Unclear No

18.

Were side-effects of treatment reported? Yes Unclear No

Analysis and presentation

19.

Was there a statement adequately describing or referencing

all statistical procedures used? Yes No

20.

Were the statistical analyses used appropriate? Yes Unclear No

21.

Were prognostic factors adequately considered? Yes Unclear No

22.

Was the presentation of statistical material satisfactory? Yes No

23.

Were confidence intervals given for the main results? Yes No

24.

Was the conclusion drawn from the statistical analysis

justified? Yes Unclear No

Recommendation

25.

Is the paper of acceptable statistical standard for publication? Yes No

26.

If 'No' to Question 25, could it become acceptable with

suitable revision?

Yes No

circumstances for which the test is to be applied? Was there a broad

spectrum of patients studied?

Outcome prediction is sometimes based on a risk score or predictive

equation developed from a large data set using multivariate analyses

which, by virtue of its derivation, should be able to predict that original

data set well. Such derived tests need to be

externally validated using

other data sets, preferably at other institutions before accepting their

clinical utility.

Statistical checklist

The British

Medical journal used a statistical checklist

,4

which is

reproduced in Table 11.1.

References

Statistical errors in anaesthesia 133

1. Gore SM, Jones IG, Rytter EC. Misuse of statistical methods: critical

assessment of articles in BMJ from January to March 1976. Br Med J 1977;

1:85-87.

2. Glantz SA. Biostatistics: howto detect, correct and prevent errors in the

medical literature.

Circulation 1980; 61:1-7.

3.

Altman DG, Gore SM, Gardner MJ, Pocock SJ. Statistical guidelines for

contributors to medical journals. Br Med J 1983; 286:1489-1493.

4. Gardner MJ, Machin D, Campbell MJ. Use of check lists in assessing the

statistical content of medical studies. Br Med J 1986; 292:810-812.

5. Godfrey K. Statistics in practice. Comparing the means of several groups. N

Engl J Med 1985; 313:1450-1456.

6. Longnecker DE. Support versus illumination: trends in medical statistics

[editorial]. Anesthesiology 1982; 57:73-74.

7. Avram MJ, Shanks CA, Dykes MHM et al. Statistical methods in anesthesia

articles: an evaluation of two American journals during two six-month

periods. Anesth Analg 1985; 64:607-611.

8. Goodman NW, Hughes AO. Statistical awareness of research workers in

British anaesthesia. Br J Anaesth 1992; 68:321-324.

9. Altman DG. Statistics and ethics in medical research, v - analysing data. Br

Med J 1980; 281:1473-1475.

10. Yudkin PL, Stratton IM. How to deal with regression to the mean in

intervention studies. Lancet 1996; 347:241-243.

11. Altman DG. Comparability of randomised groups. Statistician 1985;

34:125-136.

12. Altman DG, Dore CJ. Randomisation and baseline comparisons in clinical

trials. Lancet 1990; 335:149-153.

13. Lavori PW, Louis TA, Bailar JC, Polansky M. Designs for experiments -

parallel comparisons of treatment. N Engl J Med 1983; 309:1291-1299.

14. Frieman JA, Chalmers TC, Smith H et al. The importance of beta, the type II

error and sample size in the design and interpretation of the randomized

controlled trial. N Engl J Med 1978; 299:690-694.

15. McPherson K. Statistics: the problem of examining accumulating data more

than once. N Engl J Med 1974; 290:501-502.

16. Bulpitt CJ. Subgroup analysis.

Lancet 1988; ii:31-34.

1 3

4

St at ist ical Met hods for Anaest hesia and Int ensive Care

17.

Oxman AD, Guyatt GH. A consumer's guide to subgroup analysis. Ann

Intern Med1992;116:78-84.

18. Geller NL, Pocock SJ. Interim analyses in randomized clinical trials:

ramifications and guidelines for practitioners. Biometrics 1987; 43:213-223.

19. Hochberg YA sharper Bonferroni method for multiple tests of significance.

Biometrika

1988; 75:800-802.

20.

Michels KB, Rosner BA. Data trawling: to fish or not to fish. Lancet 1996;

348:1152-1153.

21.

Abramson NS, Kelsey SF, Safar P, Sutton-Tyrrell K. Simpson s paradox and

clinical trials:

what you find is not necessarily what you prove. Ann Emerg

Med

1992; 21:1480-1482.

22.

Moses LE, Emerson JD, Hosseini H. Statistics in practice. Analyzing data

from ordered categories. N Engl J Med

1984; 311:442-448.

23.

Mathews JNS, Altman DG, Campbell MJ, Royston P. Analysis of serial

measurements in medical research.

Br Med J 1990; 300:230-235.

24.

Horan BE Standard deviation, or standard error of the mean? [editorial]

Anaesth Intensive Care

1982; 10:297.

25.

Altman DG, Gardner MJ. Presentation of variability. Lancet 1986; ii:639.

26. Bland JM, Altman DG. Calculating correlation coefficients with repeated

observations: part II - correlation between subjects.

Br Med J 1995; 310:633.

27. Archie JP Mathematical coupling of data: a common source of error.

Ann Surg

1981; 193:296-303.

28. Bland MJ, Altman DG. Statistical methods for assessing agreement between

two methods of clinical measurement. Lancet

1986; ii:307-310.

29. Gardner MJ, Altman DG. Confidence intervals rather than P values:

estimation rather than hypothesis testing.

Br Med J 1986; 292:746-750.

30. Myles PS, Williams NJ, Powell J. Predicting outcome in anaesthesia:

understanding statistical methods. Anaesth Intensive Care 1994; 22:447-453.

12

How to design a clinical trial

Why should anaesthetists do research?

Setting up a clinical trial

Data and safety monitoring committee

Phase I -I V drug studies

Drug regulations

Key points

Define the study question(s): what is the aim and study hypothesis?

Perform a literature review.

Use a pilot study to test your methods, measurement

generate preliminary data for consideration.

Develop a study protocol

-background

-aim, hypothesis, endpoints

-study design

-define groups, intervention(s)

-measurements, data recording

-sample size, statistics (and get advice from

-adverse events, safety monitoring.

Regulation

-drug licensing

-ethics committee approval and informed consent.

Role of the ethics committee

(institutional review board)

I nformed consent

Successful research funding

Submission for publication

a statistician)

techniques, an

In most circumstances medical research consists of studying a sample of

subjects (cells, animals, healthy humans, or patients) so that inferences

can be made about a population of interest. Unbiased sample selection

and measurement will improve the reliability of the estimates of the

population parameters and this is more likely to influence anaesthetic

practice.

Laboratory research usually investigates underlying mechanisms of

disease or aspects of drug disposition. Clinical research occurs in

patients. Epidemiology is the study of disease in populations. Each are

important and have their strengths. Most anaesthetists undertaking

laboratory research are supervised in an experienced (hopefully well-

resourced) environment. Clinical research is undertaken by a broad array

of researchers, of variable quality and support, with variable

infrastructure, equipment and staffing. This chapter is primarily

addressing aspects of clinical research.

The best studies are those that answer important questions reliably1,2

The most reliable study design is the randomized controlled trial,2-4 but

other designs have an important role in clinical research.1,5,6 One of the

main aims of medical research is to produce convincing study results and

conclusions that can ultimately improve patient outcome.

1 3 6

St at ist ical Met hods for Anaest hesia and Int ensive Care

Why should anaesthetists do research?

Identification of a clinical problem, and subsequent development and

participation in a study hypothesis, design, conduct, analysis and writing

of a research project can be a rewarding experience. Unfortunately much

research is poor and co-investigators may have little involvement in its

development and conduct.7,8 Involvement in the processes required to

complete a successful research project can teach critical appraisal skills,

but these can also be explicitly taught at the undergraduate and

postgraduate levels.

Many specialist training schemes demand completion of a research

project before specialist recognition is obtained, and consultant

appointment or promotion usually includes consideration of research

output. Thus there are imperatives to 'do research', despite some having

a lack of interest, support or specific training. Cynicism is often generated

by those who have had poor research experiences. This should be

avoidable. Anaesthetic trainees, and those with an interest in research,

should be guided and supported in a healthy, funded and staffed

research environment.

Setting up a clinical trial

The major steps involved in setting up a clinical trial are:

1. Define the study question(s).

Explicitly, what are the aims and

significance of the project? Identify a primary endpoint (there may be

several secondary endpoints); it should be clearly defined, including

under what conditions it is measured and recorded. An essential, often

neglected step, is to state the study hypothesis. Ultimately, the study

design must be able to answer the hypothesis.

2. Perform a literature review.

Previous studies may help in designing a

new study. What is the current evidence in the literature? What

questions remain unanswered? What deficiencies exist in previous

studies? In other words, explain why you are doing this study.

3. Develop a study protocol.

(a)

background - previous published research, outline why this study

should be undertaken

(b) clear description of aim and hypothesis

(c)

overview of study design (retrospective vs. prospective, control

group, randomization, blinding, parallel or crossover design) - a

good study design minimizes bias and maximizes precision

(d) study population, criteria for inclusion and exclusion (define

population)

(e) treatment groups, timing of intervention

(f) clear,

concise data collection, defined times, measurement

instruments

(g) sample size calculation based on the primary endpoint 9,10

(h) details of statistical methods - get advice from a statistician

(i)

reporting of adverse events, safety monitoring.

4. Perform a pilot study.

This is an important and neglected process. The

study protocol assumptions and methodologies need to be tested in

your specific environment. It is an opportunity to test measurement

techniques and generate preliminary data that may be used to

reconsider the sample size calculation and likely results. Is the

recruitment rate feasible?

5. Modify and finalize the study protocol.

This should be agreed to and

understood by all study investigators.

6. Satisfy regulations.

Drug licensing and ethics committee approval.

Data and safety monitoring committee

Clinical trials may be stopped early if (a) the superiority of one treatment

is so marked that it becomes unethical to deny subsequent patients the

opportunity to be treated with it, or (b) one treatment is associated with

serious risks.

An independent

data and safety monitoring committee (DSMC)

should be established to monitor large trials, and can advise early

stopping of a trial in the above circumstances. They are usually guided

by

predetermined stopping rules derived from

interim analyses.1

-13

Phase I -I V drug studies

H ow to design a clinical trial 13

7

New drug compound development can take up to 15 years and cost

US$700 million to get to market. Laboratory and animal testing of new

drug compounds eventually lead to human testing, which is divided into

four phases:

1.

Phase I: this is the first administration in humans (usually healthy

volunteers).

The aim is to confirm (or establish) basic drug

pharmacokinetic data and obtain early human toxicology data. Phase

I trials often only include 20-100 human subjects before moving on to

phase II trials.

2.

Phase II: selected clinical investigations in patients for whom the drug

is intended, aimed at establishing a dose-response ('dose-finding')

relationship, as well as some evidence of efficacy and further safety.

3.

Phase III: is full-scale clinical evaluation of benefits, potential risks

and cost analyses.

4.

Phase IV: is post-marketing surveillance involving many thousands of

patients.

Pharmaceutical companies usually design and sponsor phase I-III

studies. Phase IV studies are mostly designed and conducted by inde-

pendent investigators.

Drug regulations

Most countries have restrictions on the administration and research of

new drugs in humans (Table 12.1). There are established good clinical

138

Statistical Methods for Anaesthesia and Intensive Care

research practice (GCRP) guidelines for clinical investigators and

pharmaceutical companies. These include that a principal investigator

should have the relevant clinical and research expertise, there be a formal

study protocol, adequate staffing and facilities, ethics approval and

informed consent, maintenance of patient confidentiality, maintenance of

accurate and secure data, and there be processes to report adverse events.

In Australia, the

Therapeutic Goods Administration

(TGA) of the

Commonwealth Department of Health and Aged Care approves new

drug trials under one of two schemes:

1.

CTX: the clinical trials exemption scheme

2.

CTN: the clinical trials notification scheme.

The CTX scheme requires an expert committee to evaluate all aspects

of the drug pharmacology, including potential toxicology (mutagenicity,

teratogenicity, organ dysfunction and other reported side-effects) and

benefits. The CTN scheme bypasses this evaluation, usually because

extensive evaluation has occurred in one of a number of key index

countries (Netherlands,

New Zealand, Sweden, UK, USA). In this

Table 12.1 Websites for government

conduct of

c

l

inical trials

agencies responsible for new drug research or the

Agency

W ebsite

Australia

Therapeutic Goods Administration

(TGA)

www.tga.health.gov.a

u

Australian Health Ethics

Committee (AHEC)

www.health.gov.au/nhmrc/ethics/contents.ht m

Canada

Therapeutic Products Programme

(TPP)

www.hc-sc.gc.ca/hpb-dgps/therapeut /

Medical Research Council of

Canada www.mrc.gc.c

a

Europe

European Medicines Evaluation

Agency (EMEA)

www2.eudra.org/emea.html

International Conference on

Harmonisation (ICH)

www.ifpma.org/ichl.html

United Kingdom

Department of Health Research

and Development

www.doh.gov.uk/research/index.htm

Medicines Control Agency

www.open.gov.uk/mca/mcahome.htm

Committee on Safety of Medicines

(CSM)

www.open.gov.uk/mca/csmhome.ht

m

Medical Research Council (MRC) www.mrc.ac.u

k

United States

Food and Drug Administration

(FDA)

www.fda.gov

Center for Drug Evaluation and

Research

www.fda.gov/cder/

National Institutes of Health (NIH)

www.nih.gov

NIH Ethics Program

ethics.od.nih.go v

circumstance, the local ethics committee accepts responsibility for the

trial.

In the UK, the Licensing Division of the Medicines Control Agency of

the Department of Health is responsible for the approval and monitoring

of all clinical drug trials. The Secretariat of the Medicines Division of the

Department of Health will issue a CTX certificate after evaluation. More

extensive phase II-III trials are conducted only after a Clinical Trial

Certificate (CTC) is issued and the drug data reviewed by the Committee

on Safety of Medicines (CSM). The UK, along with most other European

countries, is also guided by the European Medicines Evaluation Agency

(EMEA).

In the USA, the Center of Drug Evaluation and Research, a Food and

Drug Administration (FDA) body of the Department of Health and

Human Services, evaluates new drugs through an Investigational New

Drug (IND) application. Clinical research can start after 30 days. In

Canada, this is overseen by the Therapeutic Products Programme (TPP).

In each of these countries there are similar processes required for new

therapeutic devices, such as implantable spinal catheters and computer-

controlled infusion pumps. Other countries have similar processes which

can be found on the world wide web, or via links from websites included

in Table 12.1.

The different regulations and standards that have existed in different

countries have been an obstacle to drug development and research in

humans. This has prompted co-operation and consistency between

countries. One of the more significant advances has been the

International Conference on Harmonisation (ICH) of Technical

Requirements for Registration of Pharmaceuticals for Human Use. This

includes the regulatory authorities of Europe, Japan and the USA, and the

pharmaceutical industry.

Role of the ethics committee (institutional review board)

Advances in medical care depend on medical research, for which

laboratory investigation, followed by experimentation on animals and

healthy volunteers leads to research on patients. Clinical research should

be thoroughly evaluated and supported within an institution so that it

has the best chance of being successfully completed and providing

reliable results. Poor research leads to misleading results, wastes

resources and puts patients at risk, and so is unethical. Ethics committee

approval has a role in ensuring good-quality research.

14

Ethical considerations include the Hippocratic principle of protecting

the health and welfare of the individual patient as well as the utilitarian

view of the potential benefit for the majority vs. risk to a few. These

considerations were explored by earlier investigations into ethical

research in humans, such as the Nuremberg Code of 1949

15,16

and the

Declaration of Helsinki in 1964.*

17

Most countries have developed

ethical guidelines based on these principles.

*

www.cirp.org/library/ethics/helsinki .

How t o design a clinical t rial 139

1 4 0 St at ist ical Met hods for Anaest hesia and Int ensive Care

In Australia, this is governed by the National Health and Medical

Research Council (NHMRC) statement on human experimentation and

local ethics committees are guided by the NHMRC Australian Health

Ethics Committee (see Table 12.1). Medical colleges and associations also

have their own ethical guidelines. A similar situation occurs in the UK

where the Department of Health has issued guidelines for research

within the NHS (including multi-centre research). 18 In the USA, the

National Institutes of Health (NIH) Ethics Program guides research

practices, and in Canada this is guided by the Medical Research Council

of Canada.

All research involving human beings, either observational or experi-

mental, should include approval through an established ethical review

and approval process. This is included in all GCRP guidelines.

I nformed consent

Patients should be informed of the nature of the research and be asked to

provide informed consent. This requires adequate disclosure of

information, potential risks and benefits (if any), competency and

understanding, and self-determination.

18-20

Patient confidentiality must

be maintained. It should be made clear to patients that they are under no

obligation to participate, that they can withdraw from the study at any

time, and that refusal or withdrawal will not jeopardize their future

medical care.

A key role in the requirement for informed consent for medical

research was played by Beecher in the early 1960s. 21 He presented details

of 18 studies at a symposium (and later published 22 examples in the New

England Journal of Medicine) where no patient consent was obtained.

21

Similar examples can still be found in the literature today

22

The concept of randomization to different treatment groups is a

challenging concept for patients (and some doctors).

23

The ethical

principle underlying this process includes the concept of equipoise,

whereby the clinician and patient have no particular preference or reason

to favour one treatment over another.

18,24 The conflicting roles of

researcher and clinician are sometimes difficult to resolve in this

situation.

18,23,24

Note that the Declaration of Helsinki includes the words

'

The health of my patient will be my first consideration'.

Some have argued that it is not always necessary to obtain consent,

25,26

or that patients are unable to provide truly informed consent,

19,20

or that

the clinician is in a better position to consider the relative merits of the

research. This paternalistic attitude has been rightly challenged. Madder,

in an essay on clinical decision-making, argued cogently that patients are

entitled to clear and reasonable information, and that they should be

included in decisions regarding their care.

27

Informed consent can be difficult in anaesthesia research. Patients

approached before elective surgery are often anxious and may also be

li

mited by concurrent disease.23 Feelings of anxiety, vulnerability,

confusion or mistrust may dominate their thought processes and restrict

H ow to design a clinical trial 141

their ability to provide informed consent. 20 Alternative randomization

methods have been advocated which may address these and other

concerns?28-30 but there is little evidence of benefit.

23

Obtaining informed consent for clinical trials on the day of surgery has

been studied previously

31-33

and is an important consideration given the

increasing trend to day-of-admission surgery. Patients generally prefer to

be approached for consent well in advance, but still accept recruitment on

the day of surgery if approached appropriately (i.e. private setting,

adequate time to consider trial information).31 3 Interestingly, 51%of

patients preferred not to know about a trial prior to admission as it only

increased their level of anxiety

32

Informed consent cannot be obtained in some circumstances. Patients

arriving unconscious or critically ill to the emergency department, or

those in the intensive care unit who are critically ill, confused or sedated

cannot provide informed consent. Government or institutional ethics

committees usually provide guidelines in these circumstances.

34

There are many issues at stake when considering the ethics of research

and consent in incompetent subjects. Many argue that such research is

important and should be supported. In general consent can be waived if

the research has no more than minimal risk to the subjects and it can be

demonstrated that the research could not be carried out otherwise.

34

Some institutions consider that a family member or next of kin can

provide consent in these circumstances, but this may not be legally

binding in many countries. In any case it would be reasonable to inform

the patient's family or next of kin of the nature of research so that they

have an opportunity to have any concerns or questions answered and be

asked to sign an acknowledgement

form. Under these circumstances the

institutional ethics committee accepts greater responsibility until the

patient's consent can be sought at a later

date (deferred consent).

Successful research funding

Peer review funding through major government medical research

agencies (e.g. NHMRC, MRC, NIH) is limited to only the top ranked 20%

of projects. Other sources of research funds are also available, including

from institutions, colleges, associations and benevolent bodies.

35

Successful funding is more likely if the proposed study addresses an

i

mportant question that has demonstrable clinical significance (nowor in

the future). There should be a clearly stated hypothesis and the study

design must be capable of answering it. The application must demon-

strate that there is minimal bias, and maximal precision and relevance. It

should include a sample size calculation and have detailed statistical

analyses. The study should be feasible, with demonstrable ability to

successfully recruit patients. This is best achieved with pilot or previous

study data. A successful track record of the chief investigator (or mentor)

is reassuring.

Funding agencies commonly rate applications on a number of criteria.

For example, in Australia the NHMRC and the Australian and New

1 4

2

St at ist ical Met hods for Anaest hesia and Int ensive Care

Zealand College of Anaesthetists use the following:

1. Scientific merit

2. Track record

3. Originality

4. Feasibility

5. Design and methods

6. International competitiveness.

The ten most common reasons for failure at NIH are:36

1. Lack of original ideas

2.

Diffuse, unfocused, or superficial research plan

3.

Lack of knowledge of published relevant work

4.

Lack of experience in essential methodology

5. Uncertainty concerning future directions

6. Questionable reasoning in experimental approach

7. Absence of acceptable scientific rationale

8. Unrealistic large amount of work

9. Lack of sufficient experimental detail

10. Uncritical approach.

Submission for publication

A paper is more likely to be published if it offers newinformation about

an important topic that has been studied reliably. Editors have a respon-

sibility to their readership and this is what they demand.

Advice on what to include and howa manuscript should be presented

can be sought from experienced colleagues (even in other disciplines).

The simplest and most important message is to follow a target journal's

guidelines for authors exactly. Many authors do not do this and it annoys

editors and reviewers to such an extent that it may jeopardize a fair

assessment! Efforts at maximizing the presentation of the manuscript are

more likely to be rewarded.

Manuscripts are usually set out with an Introduction, Methods, Results

and Discussion. A clear, complete description of the study methodology

(including statistical analysed°) is essential - a reader should be able to

reproduce the study results. The discussion should follow a logical

sequence: what were the study's main findings, how do they fit in with

previous knowledge, what were the weaknesses (and strengths) of the

study design, and what should now occur - a change in practice and/or

further research?

The Consolidated Standards of

Reporting Trials (CONSORT)

state-

ment has defined how and what should be reported in a

randomized

controlled trial.37*

The essential features include identifying the study as

a randomized trial, use of a structured abstract, definition of the study

population, description of all aspects of the randomization process, clear

study endpoints and methods of analyses, and discussion of potential

biases.

* www.ama-assn.org.

References

How t o design a clinical t rial 143

1. Duncan PG, Cohen MM. The literature of anaesthesia: what are we learning?

Can J Anaesthesia 1988; 3:494-499.

2. Sackett DL, Haynes RB, Guyatt GH, Tugwell P Deciding on the Best

Therapy: A Basic Science for Clinical Medicine. Little Brown, Boston 1991:

pp187-248.

3. Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized

trials? Stat Med 1984; 3:409-420.

4. Myles PS. Why we need large randomised trials in anaesthesia [editorial].

Br

J Anaesth 1999; 83:833-834.

5. Rigg JRA, Jamrozik K, Myles PS. Evidence-based methods to improve

anaesthesia and intensive care. Curr Opinion Anaesthesiol

1999;12:221-227.

6. Sniderman AD. Clinical trials, consensus conferences, and clinical practice.

Lancet 1999; 354:327-330.

7. Goodman NW. Making a mockery of research. BMJ 1991; 302:242.

8. Goodman NW. Does research make better doctors?

Lancet 1994; 343:59.

9. Frieman JA, Chalmers TC, Smith H

et al. The importance of beta, type II error

and sample size in the design and interpretation of the randomized

controlled trial. N Engl J Med 1978; 299:690-694.

10.

Gardner MJ, Machin D, Campbell MJ. Use of check lists in assessing the

statistical content of medical studies. BMJ 1986; 292:810-812.

11. Geller

NL, Pocock SJ. Interim analyses in randomized clinical trials:

ramifications and guidelines for practitioners.

Biometrics 1987; 43:213-223.

12. Pocock SJ. When to stop a clinical trial. BMJ

1992; 305:235-240.

13.

Brophy JM, Joseph L. Bayesian interim statistical analysis of randomised

trials. Lancet 1997; 349:1166-1169.

14.

Department of Health. Ethics Committee Reviewof Multicentre Research.

Department of Health, London 1997 (HSG[97]23).

15. Beales WB, Sebring HL, Crawford JT. Permissible medical experiments. From:

The judgement of the Nuremberg Doctors Trial Tribunal. In: Trials of war

criminals before the Nuremberg Military Tribunal, 1946-49 vol 2. US

Government Printing Office, Washington, DC.

16. Shuster E. The Nuremberg Code: Hippocratic ethics and human rights. Lancet

1998; 351:974-977.

17. World

Medical Organisation. Declaration of Helsinki. BMJ 1996;

313:1448-1449.

18.

Gilbertson AA. Ethical reviewof research [editorial]. Br J Anaesth 1999; 92:6-7.

19. Schafer A. The ethics of the randomised clinical trial. N Engl J Med 1982;

307:719-724.

20. Ingelfinger FJ. Informed (but uneducated) consent [editorial]. N Engl J Med

1972; 287:4651166.

21.

Kopp VJ. Henry Knowles Beecher and the development of informed consent

in anesthesia research. Anesthesiology 1999; 90:1756-1765.

22. Madder H, Myles P, McRae R. Ethics reviewand clinical trials. Lancet 1998;

351:1065.

23.

Myles PS, Fletcher HE, Cairo S et al. Randomized trial of informed consent

and recruitment for clinical trials in the immediate preoperative period.

Anesthesiology 1999; 91:969-978.

24. Freedman B. Equipoise and the ethics of clinical research. N Eng J Med 1987;

317:141-145.

25. Hanna GB, Shimi S, Cuschieri A. A randomised study of influence of two-

dimensional versus three-dimensional imaging on performance of

laparoscopic cholecystectomy. Lancet 1998; 351:248-251.

26. Cuschieri A. Ethics reviewand clinical trials (reply). Lancet 1998; 351:1065.

144 Statistical Methods for Anaesthesia and Intensive Care

27. Madder H. Existential autonomy: why patients should make their own

choices. J Ethics 1997; 23(4):221-225.

28. Zelen M. A newdesign for randomised clinical trials. N

Engl J Med 1979;

300:1242-1245.

29. Gore SM. The consumer principle of randomisation [letter].

Lancet 1994;

343:58.

30. Truog RD. Randomized controlled trials: lessons from ECMO.

Clin Res 1993;

40:519-527.

31. Mingus IVIL, Levitan SA, Bradford CN, Eisenkraft JB. Surgical patient's

attitudes regarding participation in clinical anesthesia research.

Anesth Analg

1996; 82:332-337.

32.

Montgomery JE, Sneyd JR. Consent to clinical trials in anaesthesia.

Anaesthesia1998; 53:227-230.

33. Tait AR, Voepel-Lewis T, Siewart M, Malviya S. Factors that influence parents'

decisions to consent to their child's participation in clinical anesthesia

research. Anesth Analg1998; 86:50-53.

34. Pearson KS. Emergency informed consent.

Anesthesiology 1998; 89:1047-1049.

35. Schwinn DA, DeLong ER, Shafer SL. Writing successful research proposals

for medical science.

Anesthesiology 1998; 88:1660-1666.

36.

Ogden TE, Goldberg IA. Research Proposals. A Guide to Success. 2nd ed.

Raven Press, NewYork 1995: pp15-21.

37. Begg F, Cho M, Eastwood S

et al. I mproving the quality of reporting of

randomized controlled trials. The CONSORT Statement. JAMA 1996;

276:637-639.

13

Which statistical test to use:

algorithms

The following algorithms are presented as a guide for newresearchers, in

order to assist them in choosing an appropriate statistical test. This is

ultimately determined by the research question, which in turn

determines the actual way in which the research should be designed and

the type of data to collect.

In practice, there may be several other tests that could be employed to

analyse data, with the final choice being left to the preference and

experience of the statistician (or researcher). Many of these statistical tests

are modifications of those presented here (and go under different names).

Nevertheless, the choices offered here should satisfy most, if not all, of

the beginner researcher's requirements.

We strongly recommend the

reader refer to the appropriate sections of this book in order to find more

detail about each of the tests, their underlying assumptions, and how

common mistakes can be avoided.

Each algorithm has three steps: (a) what type of research design is it,

(b)

what question is being asked, and (c) what type of data are being

analysed. Further description of these issues can be found in Chapters 1

and 4.

The algorithms are given in Figures 13.1-13.4:

• to compare two or more independent groups - is there a difference?

(Figure 13.1)

• to compare two or more paired (dependent) groups - is there a

difference? (Figure 13.2)

• to describe the relationship between two variables - is there an

association? (Figure 13.3)

• to describe the relationship between two measurement techniques - is

there agreement? (Figure 13.4)

146 Statistical Methods for Anaesthesia and Intensive Care

Figure 13.1 To compare two or more independent groups

Figure 13.2 To compare two or more paired (dependent or matched) groups

W hich statistical test to u se: algorithms

147

Figure 13.3 To describe the relationship between two variables

Figure 13.4 To describe the relationship between two measurement techniques

I ndex

Absolute risk, 75

Actuarial analysis, 105-11

Actuarial (life table) method, 106

Adjusted odds ratio, 76, 88, 102

Agreement, 80, 90-2, 131

Alpha (a) error, 22, 44, 54, 70, 113, 127

Alpha (a) value, 127

Alternative hypothesis, 21

Analysis of covariance, 88

Analysis of variance (ANOVA), 55,

58-9

repeated, 60-3

ANOVA see Analysis of variance

APACHE scoring systems, 102

Arithmetic mean, 8

Association, 39, 78-9

Bayesian inference, 30-1

Bayes' theorem, 30, 97-8

Before and after studies, 40-2

Beta error, 22, 113, 127

Bias, 33-4, 124, 126

Bimodal distribution, 14

Binary data, 1, 72

Binary variable, 88

Binomial distribution, 14-15, 72

Binomial test, 72

Bispectral index (BIS), 103

Bland-Altman plot, 90,131

Blinding, 43-4, 123

lack of, 126

Blocking, 126

Block randomization, 42

Bonferroni correction, 55, 66, 72, 128

Bootstrapping, 101

Box and whisker plot, 17

Breslowtest, 108

Carry-over effect, 41

Case-control study, 36-7

Case reports, 35-6

Case series, 35-6

Categorical data, 1-2, 68-77, 108, 124

Categorical variable, 88

Causation, 39

Censoring, 106

Central limit theorem, 14, 29

Central tendency, 7

measures of, 8

Chi-square distribution, 68-9

Chi-square test, 68-71, 108

misuse of, 130

Clinical practice guidelines, 118-19

Clinical trials, 35, 135-44

drug regulations, 137-8

funding, 141-2

informed consent, 140-1

phase I-IV drug studies, 137

publication, 142

role of ethics committee, 139-40

setting up, 136-7

see also Randomized controlled trials

Cochrane Collaboration, 114

Cochran Q test, 72

Coefficient of determination, 80

Coefficient of variation, 9

Cohort study, 37-9

Co-linearity, 89, 101

Committee on Safety of Medicines (UK),

139

Conditional probability, 97, 106

Confidence intervals, 10-11, 23-4, 37, 54,

74, 80, 84, 107, 114, 130, 131

Confidence limits, 10, 23

Confounding, 34, 76, 126, 127

Contingency tables, 69

analysis, 7

Controls, 36

Correlation analysis, 78, 80-2

misuse of, 130-1

Covariate, 81, 110

Cox proportional hazards, 100, 110

Cramer coefficient, 82

Critical values, 22

Crossover design, 40-2

Cross-sectional studies, 35

1 5 0 Index

Data accuracy, 46-7

Data checking, 46-7

Data and Safety Monitoring Committee, 45,

137

Data transformation, 16, 82, 129

Declaration of Helsinki, 139

Degree of dispersion, 7, 8-10

Degrees of freedom, 8, 53, 69, 70

Dependent variable, 78, 87, 129, 131

Dichotomous data, 1, 72

Dichotomous variable, 88

Digit preference, 46

Discrete scales, 4

Discriminant analysis, 100

Double-blind, 43,126

Dunnett's test, 59, 128

Effectiveness, 112

Efficacy, 112

Ethical review, 123

European Medicines Evaluation Agency,

139

Evidence-based medicine, 116-18

Exact probability, 71

Exact test, 69

Fallacy of affirming the consequent, 20

Fisher Protected Least Significant Difference

(LSD), 59

Fisher's exact test, 30, 71-2, 130

Food and Drug Administration (USA),

139

Frequency distributions, 11-15

binomial distribution, 14-15

normal distribution, 12-14

Poisson distribution, 15

Friedman two-way ANOVA, 66

Gehan test, 108

Generalizability, 22, 126

General linear model, 59

Geometric mean, 16,52

Good clinical research practice guidelines,

137-8

Goodness of fit, 14, 52, 85, 87, 100

Greenhouse-Geisser correction factor, 61

Hawthorne effect, 34, 118

Hazard rates, 105

Hazard ratio, 101, 108

Heterogeneity, 114,116

Hochberg procedure, 128

Homogeneity of variance, 51, 60

Homoscedasticity, 80

Hosmer-Lemeshowstatistic, 102

Hunyh-Feldt correction factor, 61

Hypotheses, 20

Incidence, 16

Incidence rate, 75

Independent data, 79, 129, 131

Independent variable, 78, 87, 131

Inferential statistics, 20-3

Informed consent, 140-1

Integers, 3

Intention to treat, 47-8

Interaction, 59, 101

Interim analysis, 45-6, 128, 137

International Conference on

Harmonisation, 139

Interquartile range, 8

Interval scales, 4

Intraclass correlation coefficient, 91-2

Intra-group variance, 129

Investigational NewDrug, 139

Kaplan-Meier method, 106-7

Kappa statistic, 76-7, 91

Kendall's coefficient of concordance, 82

Kendall's tau, 82

Kolmogorov-Smirnov test, 52, 129

Kruskal-Wallis ANOVA, 66

Kurtosis, 14

Lambda, 82

Latin square design, 42-3

Likelihood ratio, 97

Linear regression analysis, 78, 82-5

misuse of, 130-1

multivariate regression, 87-9

non-linear regression, 85-7

Linear relationship, 79

Line of best fit, 83

Logistic regression, 76, 88, 100, 102

Logrank test, 108

Log transformation, 16, 47, 52, 131

McNemar's chi-square test, 72-3, 130

Mallampati score, 94

Mann-Whitney U test, 64-5, 72

MANOVA, 59

Mantel-Haenszel test, 76,108

Matching, 36

Mathematical coupling, 89, 131

Mean, 8

Median, 8

Median survival time, 106

Medical Research Council of Canada, 140

Meta-analysis, 76, 114-16

Method of least squares, 83

Minimization, 43

Missing data, 47

Mode, 8

Modus tollens, 20

Multiple analysis of variance (MANOVA),

59

Multiple comparisons, 55, 72

Multiple correlation coefficient, 82

Multiple linear regression, 100

Multisample sphericity, 61

Multivariate analysis, 100

Multivariate regression, 59, 87-9

Multivariate tests, 76

National Health and Medical Research

Council (Australia), 140

National Institutes of Health (USA), 140

Negative predictive value, 95, 132

Newman-Keuls test, 59,128

n-of-1 trials, 41

Nomogram, 97

Non-linear regression, 85-7

Non-parametric tests, 28-9, 63-6

Friedman two-way ANOVA, 66

Kruskal-Wallis ANOVA, 66

Mann-Whitney U test, 64-5

Wilcoxon signed ranks test, 65

Normal approximation, 72

Normal distribution, 12-14, 80,128

Null hypothesis, 21, 68, 127

Number needed to treat (NNT), 75-6, 116

Numerical data, 3-4, 124

Nuremberg Code, 139

O'Brien-Fleming method, 128

Observational studies, 35

Odds ratio, 38, 74-5, 88, 101, 114

adjusted, 76, 88, 102

One-sample t-test, 52, 54

One-tailed t-test, 22, 55, 129

Ordinal data, 2-3, 124

Outcome variable, 78, 87, 129, 131

Outliers, 46

Paired (dependent) data, 69

Paired Mest, 52, 54, 129

Parallel groups design, 39-40

Parameters, 7

Parametric tests, 5, 28, 51-63

analysis of variance, 58-9

misuse of, 128-9

repeated ANOVA, 60-3

Student's t-test, 52-8

Partial correlation coefficient, 81

Pearson chi-square, 68-71

Pearson correlation coefficient, 79, 80

Percentage, 16

Percentiles, 8

Period effect, 41

Permutation tests, 29-30

Per protocol analysis, 48

Poisson distribution, 15

Population, 19-20, 21, 135

Positive predictive value, 95, 132

Index 1 5 1

Posterior probability, 30

Post hoc tests, 59

Post-test risk, 95

Power, 5, 25, 55, 64, 113

Power analysis, 25

Predictive equation, 133

Predictor variable, 78, 87, 131

Presentation of data, 16-17

Pre-test risk, 95

Prevalence, 16, 95, 99

Primary endpoint, 28

Prior probability, 30, 95

Probability, 21

Probit analysis, 86

Probit transformation, 86

Proportion, 16, 69

Prospective randomized controlled trial,

39-40

Publication bias, 115

P value, 22, 30, 131

Qualitative data, 1

Quantitative data, 1

Random effects model, 114

Random error, 34

Randomization, 42-3, 123

lack of, 126

Randomized controlled trials, 39-40, 112,

135,142

see also Clinical trials

Rate, 16

Ratio scales, 1, 4

Receiver operating characteristic (ROC)

curve, 98-9, 102

Regression analysis see

Linear/Logistic

regression

Regression coefficient, 84, 100, 101

Regression line, 83

Regression to mean, 125

Relative risk, 37, 74

Repeated measures ANOVA, 52, 60-3, 80,

129,131

Repeat ('paired') testing, 129-30

Residual, 60, 85

Risk, 37

Risk adjustment, 99

Risk factors, 100

Risk ratio, 16, 37, 74-5, 88

Risk score, 133

Sample, 19-20, 21

Sampling error, 25

Scatter diagram, 78, 79

Scatterplot, 78, 79

Scheffe test, 59, 128

Self-controlled trials, 40-2

Sensitivity, 94-5, 98, 131

15 2 Index

Sequence effect, 41

Sequential analysis, 44-5

Significance, 21

Significance level, 22

Simple randomization, 42, 126

Simpson s paradox, 128

Single-blind, 43, 126

Skew, 14, 82

Spearman rank correlation, 82, 131

Specificity, 94-5, 98, 131

Standard deviation, 130

Standard error, 9, 80, 84, 130

Standard error of the mean, 10, 23

Standardized score, 13

Statistical errors, 122-33

ethical considerations, 123

prevalence of, 122-3

prevention of, 123-4

Statistics, 7

inferential, 20-3

Stepwise regression analysis, 87, 100

Stopping rule, 45

Stratification, 42, 126

Student's t-test, 52-8

misuse of, 129

Subgroup analyses, 128

Sum of squares, 8

Survival analysis, 105-11

Survival curves, 107-10

Survival event, 105

Systematic error, 9, 34

Systematic review, 114-16

t distribution, 23

Test statistic, 21

Therapeutic Goods Administration

(Australia), 138

Therapeutic Products Programme

(

Canada), 139

Treatment effect, 131

Triple-blind, 43

Tukey's Honestly Significant Difference

(HSD) test, 59, 128

Two-tailed hypothesis, 22

type I error (a), 22, 44, 54, 70, 113, 127

type II error (beta), 22, 113, 127

Univariate analysis, 100, 102

Unpaired t-test, 52, 54

Variance, 8, 27, 40, 55

Verbal rating scales, 5

Visual analogue scales, 5-6, 63

Washout period, 41

Which test?

algorithims, 145-7

Wilcoxon rank sum test, 64-5, 108

Wilcoxon signed ranks test, 65-6

Yates' correction, 71, 130

z distribution, 13

z test, 53, 72

z transformation, 13

**Statistical Methods for Anaesthesia and I ntensive Care
**

Paul S Myles MB BS MPH MD FCARCSI FANZCA Head of Research and Specialist Anaesthetist Department of Anaesthesia and Pain Management Alfred Hospital, Victoria Associate Professor Departments of Anaesthesia, and Epidemiology and Preventive Medicine Monash University Melbourne, Australia and Tony Gin MB ChB BSc MD DipHSM FRCA FANZCA Chairman and Chief of Service Department of Anaesthesia and Intensive Care Chinese University of Hong Kong Prince of Wales Hospital Shatin, Hong Kong

Butterworth-Heinemann Linacre House, Jordan Hill, Oxford OX2 8DP 225 Wildwood Avenue, Woburn, MA 01801-2041 A division of Reed Educational and Professional Publishing Ltd A member of the Reed Elsevier Group First published 2000 © Reed Educational and Professional Publishing Ltd 2000 All rights reserved. No part of this publication may be reproduced in any material form (including photocopying or storing in any medium by electronic means and whether or not transiently or incidentally to some other use of this publication) without the written permission of the copyright holder except in accordance with the provisions of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London, England WlP OLE Applications for the copyright holder's written permission to reproduce any part of this publication should be addressed to the publishers British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloguing in Publication Data A catalogue record for this book is available from the Library of Congress ISBN 0 7506 4065 0

Typeset by E & M Graphics, Midsomer Norton, Bath Printed and bound in Great Britain by Biddles Ltd, Guildford and King's Lynn

Contents About the authors Foreword by Professor Teik E. Oh Preface Acknowledgements ix xi xiii xvii 1 1 1 2 3 5 7 8 8 8 8 8 8 8 8 9 9 10 11 12 14 15 16 16 16 16 19 19 20 1 Data types Types of data -categorical -ordinal -numerical Visual analogue scale Descriptive statistics Measures of central tendency -mode -median -mean Degree of dispersion -range -percentiles -variance -standard deviation -standard error Confidence intervals Frequency distributions -normal -binomial -Poisson Data transformation Rates and proportions Incidence and prevalence Presentation of data Principles of probability and inference Samples and populations Inferential statistics 2 3 .

vi Contents -definition of probability -null hypothesis -P value -type I and type II error Confidence intervals Sample size and power calculations Parametric and non-parametric tests Permutation tests Bayesian inference 21 21 22 22 23 24 28 29 30 33 33 34 34 34 35 36 37 39 39 40 42 42 42 43 43 44 45 46 47 47 51 51 52 58 60 63 64 65 65 65 68 68 71 71 72 72 74 4 Research design Bias and confounding -randomization and stratification Types of research design -observation vs. causation -randomized controlled trial -self-controlled and crossover trials Randomization techniques -block randomization -stratification -minimization Blinding Sequential analysis Interim analysis Data accuracy and data checking Missing data Intention to treat Comparing groups: numerical data Parametric tests -Student's t-test -analysis of variance (ANOVA) -repeated measures ANOVA Non-parametric tests -Mann-Whitney U test (Wilcoxon rank sum test) -Wilcoxon signed ranks test -Kruskal-Wallis ANOVA -Friedman two-way ANOVA Comparing groups: categorical data Chi-square -Yates' correction Fisher's exact test The binomial test McNemar's chi-square test Risk ratio and odds ratio 5 6 . experimentation -case reports and case series -case-control study -cohort study -association vs.

subgroup analyses and interim analysis -misuse of parametric tests . meta-analysis.Contents Number needed to treat Mantel-Haenszel test Kappa statistic 7 Regression and correlation Association vs. prediction Assumptions Correlation Spearman rank correlation Regression analysis Non-linear regression Multivariate regression Mathematical coupling Agreement Predicting outcome: diagnostic tests or predictive equations Sensitivity and specificity Prior probability: incidence and prevalence -positive and negative predictive value Bayes' theorem Receiver operating characteristic (ROC) curve Predictive equations and risk scores Survival analysis What is survival analysis? Kaplan-Meier estimate Comparison of survival curves -logrank test -Cox proportional hazard model The 'hazards' of survival analysis vii 75 76 76 78 78 79 80 82 82 85 87 89 90 94 94 95 95 97 98 99 105 105 106 107 108 110 110 112 112 113 114 116 118 122 122 123 123 124 125 126 126 126 127 127 128 8 9 10 Large trials. effectiveness Why we need large randomized trials in anaesthesia Meta-analysis and systematic reviews Evidence-based medicine Clinical practice guidelines 11 Statistical errors in anaesthesia Prevalence of statistical errors in anaesthesia journals Ethical considerations How to prevent errors What are the common mistakes? -no control group -no randomization -lack of blinding -misleading analysis of baseline characteristics -inadequate sample size -multiple testing. and evidence-based medicine Efficacy vs.

viii Contents -misuse of Student's t-test -repeat ('paired') testing -misuse of chi-square . standard error -misuse of correlation and simple linear regression analysis -preoccupation with P values -overvaluing diagnostic tests and predictive equations A statistical checklist 12 How to design a clinical trial Why should anaesthetists do research? Setting up a clinical trial Data and safety monitoring committee Phase I-IV drug studies Drug regulations Role of the ethics committee (institutional review board) Informed consent Successful research funding Submission for publication 13 Which statistical test to use: algorithms Index 129 129 130 130 130 131 131 133 135 136 136 137 137 137 139 140 141 142 145 149 .small numbers -standard deviation vs.

Chairman and Chief of Service of the Department of Anaesthesia and Intensive Care.About the authors Paul Myles is Head of Research in the Department of Anaesthesia and Pain Management at the Alfred Hospital. He is Chairman of Alfred Hospital Research Committee. Annals of Thoracic Surgery and Medical Journal of Australia). He was previously Professor of the Department of Anaesthesia. and Department of Epidemiology and Preventive Medicine. He is a member of three editorial boards (Anaesthesia and Intensive Care. Anesthesiology. He has over 100 publications and has been a regular reviewer of research proposals. He received his MPH (majoring in advanced statistics and epidemiology) from Monash University in 1995 and his MD (Clinical Aspects of Cardiothoracic Anaesthesia) in 1996. He has been a specialist anaesthetist for ten years. Paul Myles and Tony Gin are members of the Australian and New Zealand College of Anaesthetists' Examinations and Research Committees. Christchurch School of Medicine. University of Otago. Tony Gin is Professor. He has a joint university appointment (Monash University) as Associate Professor in the Department of Anaesthesia. . grant applications and manuscripts for many ethics committees. Chinese University of Hong Kong at the Prince of Wales Hospital. Journal of Cardiothoracic and Vascular Anaesthesia) and has reviewed for four others (British Journal of Anaesthesia. He completed his BSc in statistics after finishing medical school and has been lecturing and examining in pharmacology and statistics in Asia and Australasia for over ten years. Melbourne. funding bodies and journals. Asia-Pacific Heart Journal. Hong Kong. He has published over 70 papers and has received more than ten peer-reviewed research grants.

if not fearsome. The authors of this book are Tony Gin and Paul Myles. Thankfully. Statistical Methods for Anaesthesia and Intensive Care will make our lives easier. Oh President. both accomplished researchers and anaesthetists. aspect for an anaesthetist in training or practice. Books on statistics are usually written by acclaimed statisticians. they are generally helpful only to the extent that we can digest the heavy prose or the barrage of unfamiliar terms and concepts.Foreword An often puzzling. and it will lead us through the minefield of statistics that we have to cross. The book is written by anaesthetists for anaesthetists and clinicians. is that of grappling with 'stats'. a basic understanding of statistics is necessary in our reading of a scientific paper or in planning even the most modest research project. Of course. Professor Teik E. University of Western Australia . Department of Anaesthesia. Australian and New Zealand College of Anaesthetists Professor.

and (ii) inferential statistics. Why is an understanding of statistics important for anaesthetists and intensivists? Advances in anaesthesia and intensive care rely upon development of new drugs. or predict. the collection of data is from a restricted number of observations (individuals. Longnecker wrote: 'If valid data are analyzed improperly. in order to estimate certain characteristics of a population. such as 'all adult surgical patients'. then the results become invalid and the conclusions . statistics usually refers to the process of measuring and analysing data from a sample. techniques and equipment. 'all women undergoing laparoscopic surgery' or 'patients admitted to intensive care with a diagnosis of septic shock'. this chosen set of data is referred to as a sample and the reference group from which it is derived is referred to as a population.this can often be attributed to the common belief that statistical analyses are misused or misrepresented. and these estimates are usually compared with those of another group to determine whether one group differs significantly from the other. In most circumstances. In order to be confident about estimation of population parameters. In 1982. whereby a collection of data is summarized in order to characterize features of its distribution. we need to be sure that our sample accurately represents our intended population. A population does not necessarily include all individuals. describing and analysing data that are subject to random variation. characteristics of another (usually larger) group. and their understanding is crucial for evaluating reported advances in our specialty. reliable descriptive statistics and the correct use of inferential statistics are essential for good-quality research. It consists of two main areas: (i) descriptive statistics. whereby these summary data are processed in order to estimate. The statistical methods outlined in this book are meant to optimize this process. but is most often a defined group of interest. Evaluation and clinical application of these advances relies critically upon statistical methodology. Accurate. This is actually a justification for having at least a basic understanding of statistics. However it is at this level that most anaesthetists and intensivists lose interest or become sceptical .Preface 'Statistics' is the science of collecting. for our purposes. animals or any event subject to variation). These estimates of a population are most commonly an average value or proportion. So.

in our experience they are either too mathematical in their approach or. and trainees can systematically learn the basic principles of statistics (without boredom or frustration). This should enable the reader to successfully pass relevant examinations. usually with reference to sources of further information for the interested reader. this process can often be dramatic or lifesaving. Our intention was to make it easier for readers to find specific topics within the text. As doctors. Note that we have highlighted key words in bold print. Our readers can be reassured that knowledge of statistics. use examples that have little relevance to our specialty. * David E. when designed as a basic introductory textbook. although an essential component of the process. or whether a new diagnostic test adds to our clinical decisionmaking? Research design and statistics are tools to help clinicians make decisions. How convincing is this evidence and how does it relate to our own clinical experience. the net effect is to waste time. effort. or interpret the statistical methodology and design of a published scientific paper. 57:73-74. but a collection of tools used to logically guide rational clinical decisionmaking. and promotes the scientific foundations of our unique specialty: anaesthesia and intensive care. design a research trial.xiv Preface may well be inappropriate. Anesthesiology 1982. does not displace everyday clear thinking. As clinicians. using examples from the anaesthetic and intensive care literature. for anaesthetists and intensivists. At worst.'* Medical statistics is not just the use of clever mathematical formulae. The aim of this book is to explain a variety of statistical principles in such a way that advances the application and development of our knowledge base. and money for the project. At best. and what evidence there is to justify our choices (our patients may also want this information). or ' gut feelings'? How can we compare our own results with those of others? Under what circumstances should we change our practice? How can we show that one drug or technique is better than another. though they have recently been formalized and embraced by 'evidence-based medicine'. therapeutic decisions may well be based upon invalid conclusions and patients' wellbeing may be jeopardized. The design of this book is such that anaesthetists. . intensivists. More sophisticated information is presented in brief detail. Although there are many medical statistics books available on the market. Progress in our specialty is rapidly evolving and acquisition of up-to-date knowledge should be based upon critical scrutiny. we are expected to apply our special knowledge and training in such a way that promotes healing and good health. Each chapter begins with basic principles and definitions. and then explains how and why certain statistical methods are applied in clinical studies. we want to know what management options are available for our patients. Support versus illumination: trends in medical statistics. Longnecker. These processes are not new to medical practice.

' This time when the inspector appears. The moral of this story is that you should never use a statistical technique unless you are completely familiar with it. 'We've only bought one between us!' When the ticket inspector appears. the epidemiologists hide together in the toilet. 348: 1392. they reply.' 'Fools!'. 'We've bought one to share. 'No'. and share the journey with the statisticians. . 'We've not bought any. The epidemiologists slide their ticket under the door. say the statisticians. He clips the ticket and slides it back under the door to the statisticians. say the statisticians. On the return they purchase one ticket between them.leaving the epidemiologists to be caught by the inspector. They have.Preface xv A cautionary tale Three statisticians and three epidemiologists are travelling by train to a conference. The epidemiologists are very impressed. Melbourne. Australia) The Lancet 1996. 'Fools!'. who again ask whether they've all bought tickets. the statisticians hide together in the toilet. As retold by Frank Shann (Royal Children's Hospital. and resolve to adopt this technique themselves.' 'But what will you do when the inspector comes?' 'You'll see. The statisticians walk up to the door and knock on it. and the statisticians take it and use it as before . The inspector knocks and they pass the ticket under the door. The statisticians ask the epidemiologists whether they have bought tickets.

to illustrate some of our explanations. We would particularly like to thank and acknowledge the investigators who produced the work. Paul Myles would like to thank the Alfred Hospital Whole Time Medical Specialists for providing funds to purchase a notebook computer and statistical software. We have used data from many studies. published in many journals.Acknowledgements We would like to thank Dr Rod Tayler. and Dr Anna Lee for proofreading. Dr Mark Reeves and Dr Mark Langley for their constructive criticism of earlier drafts of this book. . We would like to thank the journal publishers for permission to reproduce these results.

• Numerical data may be ordinal. ordinal. The second type of data includes those which are measured on a numerical scale and are referred to as quantitative data. and eventually analysed. is also described by a hierarchical scale (of increasing precision): categorical. discrete or continuous. At the most basic level. Examples of categorical data 1. they are most often referred to as categorical (or nominal) data. Gender . The precision with which these data are observed and recorded. interval and ratio scales (Figure 1. these are known as dichotomous (or binary) data. The first type of data includes those which are defined by some characteristic.1). as the type of data collected ultimately determines the way in which the study observations are described and which statistical tests will eventually be used. Categorical data Because qualitative data are best summarized by grouping the observations into categories and counting the number in each. • VAS measurements are ordinal data. or quality. Types of data Before a research study is undertaken it is important to consider the nature of the observations to be recorded.male .female . it is useful to distinguish between two types of data. This is an essential step during the planning phase. A special case exists when there are only two categories. and are referred to as qualitative data.Data types Types of data -categorical -ordinal -numerical Visual analogue scale (VAS) Key points • Categorical data are nominal and can be counted. and are usually measured.

arrhythmia . adult) . Adverse events (major cardiovascular) .other The simplest way to describe categorical data is to count the number of observations in each group.acute myocardial infarction . percentages. Ordinal data If there is a natural order among categories.pericardial .medical .valvular .surgical .coronary artery . These observations can then be reported using absolute count.congestive cardiac failure . then the data can be .physical injury .1 Types of data 2.Figure 1. Type of ICU admission .other 4.myocardial .other 3.poisoning . rates or proportions.sudden death . so that there is a relative value among them (usually from smallest to largest). Type of operation (cardiac.

++. strictly speaking. Pain score 0 = no pain 1 = mild pain 2 = moderate pain 3 = severe pain 4 = unbearable pain 2. there is not a direct mathematical relationship. D.Data types 3 considered as ordinal data. Although there is a semiquantitative relationship between each of the categories on an ordinal scale. ++++). Ordinal data can also be summarized by the median value and range (see Chapter 2). Nevertheless. it may be equally appropriate to use a non-numerical record (A. Examples of ordinal data 1. Extent of epidural block: A = lumbar (L1-L5) B = low thoracic (T10-T12) C = mid-thoracic (T5-T9) D = high thoracic (T1-T4) 3. particularly for the convenience of data recording and eventual statistical analyses. rates or proportions. however. a type of categorical data. C. For example. whereas continuous data can assume any value. ordinal data are. Preoperative risk: ASA* I/II = low risk ASA III = mild risk ASA IV = moderate risk ASA V = high risk Numerical data Quantitative data are more commonly referred to as numerical data. Episodes of myocardial ischaemia (discrete) 2. Creatinine clearance (continuous) 4. observations that are counted are discrete numerical data and observations that are measured are usually continuous data. have practical usage. A numerical scoring system does. B. Once again these observations can be described by an absolute count. percentages. . Discrete numerical data can only be recorded as whole numbers (integers). Examples of numerical data 1. nor is the difference between a score of 1 and 0 equal to the difference between a score of 3 and 2. or +. +++. Body weight (continuous) 3. For ordinal data. a numerical scoring system is often used to rank the categories. Cardiac index (continuous) * ASA = American Society of Anesthesiologists' physical status classification. Put simply. these observations can be subdivided into discrete and continuous measurements. a pain score of 2 indicates more pain than a score of 1. but it does not mean twice as much pain.

and are dealt with using the same statistical methods. The classic example of this is the measurement of temperature. but may be considered as continuous data. For example. Numerical data are usually reported as mean and standard deviation. Continuous data can also be further subdivided into either an interval or ratio scale. with number of episodes of myocardial ischaemia. pHi (continuous data) and blood pressure (continuous data). total morphine consumption (continuous data) and serum cortisol level (continuous data). use of inotrope infusions (categorical data) and bicarbonate administration (yes/no: dichotomous. resulting in a true ratio. it is possible that any value may exist (at any one time) and a value of. including the therapeutic intervention scoring system (TISS) (ordinal data). this distinction has no practical significance for our purposes. whereby data on a ratio scale have a true zero point and any two values can be numerically related. and is usually recorded as such. Myles et al. but when measured on a Kelvin scale they are ratio data: 0°C is not zero heat. For example. Outcomes of interest included number of organs failed in each patient (discrete data). 2 = moderate pain. . Gutierrez et a1. 9.e. or as median and range (see Chapter 2). so that discrete data may assume the properties of continuous data if there is a large range of potential values. Respiratory rate (discrete/ continuous) 6. If temperatures are measured on a Celsius scale they are considered interval data. In general. where 0 = no pain. 1 recorded the following outcomes: pain score. in a study investigating the comparative benefits of patient controlled analgesia after cardiac surgery. It has to be admitted that the distinction between discrete and continuous numerical data is sometimes blurred. or eventual statistical analysis does not consider this possible precision). This would not be the case. nor is 26°C twice as hot as 13°C. the observations of interest in a research study are also referred to as variables. in that they can have different values (i. although respiratory rate is generally considered to only have discrete values. for example. 2 investigated whether outcome in the ICU could be improved with therapy guided by measurement of gastric intramucosal pH (pHi). as both types of continuous data are recorded and reported in the same way. So that. gender may be referred to as a categorical or dichotomous variable. They also recorded therapeutic interventions. categorical data).4 breaths/ min is meaningful. 1 = mild pain. Studies may include more than one type of data. As another example. for example. if it is conceptually possible to achieve any value throughout the possible range of values (even if the observations are not recorded as such. However. say. they can vary). and cardiac index as a continuous variable.4 Statistical Methods for Anaesthesia and Intensive Care 5. incidence of respiratory depression (categorical data). incidence of organ failure (categorical data). Post-tetanic count (discrete) There are circumstances where data are recorded on a discrete scale. 3 = severe pain and 4 = unbearable pain (these are ordinal data).

Data types

5

**Visual analogue scale (VAS)
**

A frequently used tool in anaesthesia research is the 100 mm visual analogue scale (VAS). 3 This is most commonly used to measure postoperative pain, but can also be used to measure a diverse range of ( mostly) subjective experiences such as preoperative anxiety, postoperative nausea, and patient satisfaction after ICU discharge. Because there are infinite possible values that can occur throughout the range 0-100 mm, describing a continuum of pain intensity, most researchers treat the resulting data as continuous. 4,5 If there is some doubt about the sample distribution, then the data should be considered ordinal. There has been some controversy in the literature regarding which statistical tests should be used when analysing VAS data. 4,,5 Some statistical tests ('parametric tests') assume that sample data have been taken from a normally distributed population. Mantha et a1. 4 surveyed the anaesthetic literature and found that approximately 50% used parametric tests. Dexter and Chestnuts used a multiple resampling (of VAS data) method to demonstrate that parametric tests had the greater power to detect differences among groups. Myles et a1.6 have recently shown that the VAS has properties consistent with a linear scale, and thus VAS scores can be treated as ratio data. This supports the notion that a change in the VAS score represents a relative change in the magnitude of pain sensation. This enhances its clinical application. Nevertheless, when small numbers of observations are being analysed (say, less than 30 observations), it is preferable to consider VAS data as ordinal. For a number of practical reasons, a VAS is sometimes converted to a ' verbal rating scale', whereby the subject is asked to rate an endpoint on a scale of 0-10 (or 0-5), most commonly recorded as whole numbers. In this situation it is preferable to treat the observations as ordinal data.

**Changing data scales
**

Although data are characterized by the nature of the observations, the precision of the recorded data may be reduced so that continuous data become ordinal, or ordinal data become categorical (even dichotomous). This may occur because the researcher is not confident with the accuracy of their measuring instrument, is unconcerned about loss of fine detail, or where group numbers are not large enough to adequately represent a variable of interest. In most cases, however, it simply makes clinical interpretation easier and this is the most valid and prevalent in the medical literature. For example, smoking status can be recorded as smoker/non-smoker (categorical data), heavy smoker/light smoker/ex-smoker/non-smoker (ordinal data), or by the number of cigarettes smoked per day (discrete data). Another example is the detection of myocardial ischaemia using ECG ST-segment monitoring - these are actually continuous numerical data, whereby the extent of ST-segment depression is considered to represent

6

Statistical Methods for Anaesthesia and Intensive Care

the degree of myocardial ischaemia. For several reasons, it is generally accepted that ST-segment depression greater than 1.0 mm indicates myocardial ischaemia, so that ST-segment depression less than this value is categorized as 'no ischaemia' and that beyond 1.0 mm as 'ischaemia'. 7 This results in a loss of detail, but has widespread clinical acceptance (see Chapter 8 for further discussion of this issue).

References

1. Myles PS, Buckland MR, Cannon GB et al. Comparison of patient-controlled analgesia and nurse-controlled infusion analgesia after cardiac surgery.

Anaesth Intensive Care 1994; 22:672-678. 2. Gutierrez G, Palizas F, Doglio G et al. Gastric mucosal pH as a therapeutic index of tissue oxygenation in critically ill patients. Lancet 1992; 339:195-199. 3. Revill Sl, Robinson JO, Rosen M et al. The reliability of a linear analogue for evaluating pain. Anaesthesia 1976; 31:1191-1198. 4. Mantha S, Thisted R, Foss J et al. A proposal to use confidence intervals for

visual analog scale data for pain measurement to determine clinical significance. Anesth Analg 1993; 77:1041-1047. 5. Dexter F, Chestnut DH. Analysis of statistical tests to compare visual analogue scale data measurements among groups. Anesthesiology 1995; 82:896-902. 6. Myles PS, Troedel S, Boquest M, Reeves M. The pain visual analogue scale: is it linear or non-linear? Anesth Analg 1999; 89:1517-1520. 7. Fleisher L, Rosenbaum S, Nelson A et al. The predictive value of preoperative silent ischemia for postoperative ischemic cardiac events in vascular and nonvascular surgical patients. Am Heart J 1991; 122:980-986.

2

Descriptive statistics

Measures of central tendency

**-mode -median -mean
**

Degree of dispersion

Confidence intervals Frequency distributions

-normal -binomial -Poisson

**-range -percentiles -variance -standard deviation -standard error
**

Key points

Data transformation Rates and proportions I ncidence and prevalence Presentation of data

• The central tendency of a frequency distribution can be described by the mean, median or mode. • The mean is the average value, median the middle value, and mode the most common value. • Degree of dispersion can be described by the range of values, percentiles, standard deviation or variance. • Standard error is a measure of precision and can be used to calculate a confidence interval. • Most biological variation has a normal distribution, whereby approximately 95% of observations lie within two standard deviations of the mean. • Data transformation can be used to produce a more normal distribution.

Descriptive statistics summarize a collection of data from a sample or population. Traditionally summaries of sample data ('statistics') are defined by Roman letters (z, s,,, etc.) and summaries of population data ('parameters') are defined by Greek letters (y, 6, etc.). Individual observations within a sample or population tend to cluster about a central location, with more extreme observations being less frequent. The extent that observations cluster can be described by the central tendency. The spread can be described by the degree of dispersion. For example, if 13 anaesthetic registrars have their cardiac output measured at rest, their results may be: 6.2, 4.9, 4.7, 5.9, 5.2, 6.6, 5.0, 6.1, 5.8, 5.6, 7.0, 6.6 and 5.5 1 /min. How can their data be summarized in order to best represent the observations, so that we can compare their cardiac output data with other groups? The most simple approach is to rank the observations, from lowest to highest: 4.7, 4.9, 5.0, 5.2, 5.5, 5.6, 5.8, 5.9, 6.1, 6.2, 6.6, 6.6 and 7.01/min. We now have a clearer idea of what the typical cardiac output might be, because we can identify a middle value or a commonly occurring value (the smallest or largest value is least likely to represent our sample group).

the mean can be calculated as 75. the arithmetic mean) is the average value.7) 2. The median is the 50th centile. The formula for the mean is: where x = each observation. We can describe 25%. The variance is such a method. If there is an even number of observations.1/13 = 5. In the example above. If the sample is ranked. If we include the middle 50% of the observations about the median (25th to 75th centile).0 . This is explained by a defined number of . The formula for the variance (and standard deviation.78 1/min. or variability. The mean is the most commonly used single measure to summarize a set of observations. 50%. It sums the square of each difference ('sum of squares') and divides by the number of observations. or any other amount of observations. The range does not provide much information about the overall distribution of observations. The formula for variance is: The expression within the parentheses is squared so that it removes negative values. The expression ' n -1' is known as the degrees of freedom and is one less than the number of observations. It is calculated as the sum of (depicted by the Greek letter. A clearer description of the observations can be obtained by ranking the data and grouping them into percentiles. and n = number of observations.8 1/min. In the example. Y. In the example above. and is also heavily affected by extreme values. divided by the number of observations.4. In the example above it is 6.8 Statistical Methods for Anaesthesia and Intensive Care Measures of central tendency The sample mode is the most common value.3 1/min. A better method of measuring variability about the mean is to see how closely each individual observation clusters about it.2-6. The mean (or more correctly.1 1/min. In the example above it is 5.6 1/min. We then have more information about the pattern of spread. the range is (7. of a sample can be readily described by the minimum and maximum values. The difference between them is the range. It is usually a reliable measure of central tendency. then the median is calculated as the average of the two middle values. the interquartile range is 5. Degree of dispersion The spread. we have the interquartile range. not much more frequent than other observations). This may not be the best method of summarizing the data (in our example it occurs twice. see below) for a population has the value 'n' as the denominator. the median is the middle value. Percentiles rank observations into 100 equal parts.) the observations. 75%.

SD can be calculated as 0. The formula for SD is: In the example above. It is not meant to be used to describe variability of sample data. Provided that one does not choose the largest remainder. . Another measure of variability is the standard error (SE). Biological variability . but not the fixed sample mean value). but the last number is fixed by the first three choices. The degrees of freedom describe the number of independent observations or choices available. This is sometimes difficult to comprehend and so we often use the square root of variance in order to retain the basic unit of observation. It is commonly used to describe variability of measurement instruments. CV = SD/mean x 100% There are many sources of variability in data collection. Another measure of variability is the coefficient of variation (CV). This is systematic error. Lastly.7141/min. The positive square root of the variance is the standard deviation (SD or s x ). The variance is measured in units of x2 . This considers the relative size of the SD with respect to the mean.it is a measure of precision. taking multiple measurements. Systematic error cannot be compensated for by increasing sample size. and using trained observers.variation between individuals and over time .1-5 It is used to estimate a population parameter from a sample . it is possible to have free choice in choosing the first three numbers. It is calculated from the SD and sample size (n): Standard error is a much smaller numerical value than SD and is often presented (wrongly) for this reason.is a fundamental source of scatter.each observation is free to vary except the last one which must be a defined value.Descriptive statistics 9 observations in a sample with a known mean . The ability to detect differences between groups is blurred by large variance. and this inflates the sample size that is needed to be studied. The degee of freedom was (n . It is generally accepted that a CV of less than 5% is acceptable reproducibility. Random error can be reduced by use of accurate measurement instruments.1). These types of variability result in random error. Consider a situation where four numbers must add up to ten and one can choose the four numbers (n = 4). Determined efforts should be made to minimize random and systematic error. The degrees of freedom is used when calculating the variance (and standard deviation) of a sample because the sample mean is a predetermined estimate of the population mean (each individual in the sample is a random selection. Another source of variability is measurement imprecision (this can be quantified by the CV). there are mistakes or biases in measurement or recording.

54 1/min sample C (n = 8) mean 5. has a 95% probability of including the population mean.99 1/min sample D (n = 13) mean 6. We use SE to define a range in which the true population mean value should lie.e. a fourth (n = 13). sample A (n = 13) mean 5. In general we are not interested in the characteristics of multiple samples. Confidence intervals Confidence intervals are derived from the SE and define a range of values that are likely to include a population parameter.78 l/ min sample B (n = 11) mean 5. Each may have sampled from the same population (in our example these may be anaesthetic registrars within a regional training programme) and so each sample could be used to estimate the true population mean and the SD. and 2. and then a third (n = 8).96 standard errors either side of the sample mean. The range. If one takes a number of samples from a population.23 1/min The SE represents the SD of the sample means (0.714 and an SE of (0. 95 or 99%) : 95% confidence intervals (95% CI) are most commonly used. i. The example above (sample A) has an SD of 0.10 Statistical Methods for Anaesthesia and Intensive Care Standard error is also known as the standard error of the mean. The five sample means would have their own distribution and it would be expected to have less dispersion than that of all the individuals in the samples. n) and the degree of confidence required (say 90.58 standard errors either side of the sample mean has a 99% probability of including the . then prediction becomes more reliable. Standard error is used to calculate confidence intervals.75 1/min mean (of 5 samples): 5. If the sample is very large (with a large value of n).12 1/min sample E (n = 15) mean 5. If we now selected a second group of (say) 11 individuals (n = 11). and so is a measure of precision (of how well sample data can be used to predict a population parameter). The two ends of the range are called confidence limits.61) = 0.714/3. The SD of the sample means is the standard error. 1. Large samples increase precision. and perhaps a fifth (n = 15). we will have a mean for each sample. but more specifically how reliable our one sample is in describing the true population. This is an inefficient (and possibly costly) method. we would have five different sample means. In the example above we selected 13 individuals and measured their cardiac outputs. as a halving of SE requires a four-fold increase in sample size (sq4 = 2). The width of the confidence interval depends on the SE (and thus sample size. We stated above that random error can be compensated for by increasing sample size.841/min SD of the 5 sample means: 0. But we do not generally take multiple samples and are left to determine the SE from one sample.23 1/min).201/min.

171/min.8/3. .46 -29.46 -8. They are all calculated from SE (but each has a different formula to estimate its SE).54 -30. or 95%.54 -7.5 Frequency distributions It is useful to summarize a number of observations with a frequency distribution.95. .e.1 and 2. proportions. mean).96 x 0. It can also be stated that 95% of further sample means would lie in this range.54 32. ml/min) in 15 critically ill patients.87=6.14 95% CI of the mean = 55.2).46 0. The 95% CI relates to the sample statistic (e.54 -41.96 standard deviations of the mean.Descriptive statistics population mean.25 1563 1719 868 508 56. while the SE of the mean relates the sample mean to the true population mean.5 total (sum) = 1012 mean.54 39.1 A set of observations: creatinine clearance values (x. 95% CI can be calculated as the mean (5. i. risk ratios.9 72.78) ± (1. regression coefficients. I QR = interquartile range.8 SE=23. In a similar way. etc.54 7.46 22.9 55. correlation coefficients.198).g.54 8.1 133 928 71. Example 2. the SD indicates the spread of individual observations in the sample. Confidence intervals can be used to estimate most population parameters from sample statistics (means. This states that the probability (P) of the true population cardiac output lying within this range is P = 0. In our example above.4 to 79.46 11.46 SD = sq(7930/14) = 23.9 1059 4620 6. CI = confidence i nterval Patient 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x 76 100 46 65 89 37 59 68 107 26 38 90 75 76 60 8.54 -21.96 x SE) 11 The 95% CI should be distinguished from a property of the normal distribution where 95% of observations lie within 1. x = 1012/15 = 67. that is 5.x) 2 72.46 X-(oversc) (x. They may be summarized in a table or a graph (see also Figures 2.6 0. 95% CI of the mean = sample mean ± (1. not the individual observations.5 to 82.7 7930 range = 81 (26 to 107) mode = 76 median = 68 I QR = 52. This is a set of all observations and their frequencies.39 to 6.46 -2.see later chapters).

12 Statistical Methods for Anaesthesia and Intensive Care Figure 2. with a symmetrical positive and negative deviation about this point.1 we can summarize the frequency distribution of c reatinine clearance values (CrCl.1 A bar diagram of the distribution of creatinine clearance values in critically ill patients Example 2.0 Normal distribution Most biological variation has a tendency to cluster around a central value.20 0.2 0. and more extreme values becoming less frequent the further they lie from the .33 0. ml/min) when categorized into intervals CrCl interval 0-20 21-40 41-60 61-80 81-100 101-120 Frequency 0 3 3 5 3 1 Relative frequency 0 0.07 Cumulative frequency 0 0.2 0.40 0.93 1.2 From Example 2.73 0.2 0.

The formula for z is: where y = mean. It can be used to determine probability (by referring to a z table in a reference book). It can be defined by the following eauation: A z transformation converts any normal distribution curve to a standard normal distribution curve. Using Example 2. This results in a standardized score. If we now look at the formula for the normal distribution. Gauss (1777-1855). the number of standard deviations from the mean in a standard normal distribution curve.e. the standard deviation). This is also known as the z distribution.1 above. the other terms are constants: The standard normal distribution curve (Figure 2. SD = 1. namely p (mu. we can determine the probability of a creatinine clearance value less than 40 ml/min if we assume that the sample data are derived from a normal distribution and the sample mean and SD represent that population: . with mean = 0. z.Descriptive statistics 13 central point. It is sometimes referred to as a Gaussian distribution after the German mathematician. the mean) and 6 (sigma. i. These features describe a normal distribution and can be plotted as a normal distribution curve. we can see that there are two parameters that define the curve.3) is a symmetrical bell-shaped curve with a mean of 0 and a standard deviation of 1.

then the measures of central tendency will differ. and indicates why the norma distribution is so important in medical research. In some circumstances there is an asymmetric distribution (skew). it would not b. n > 100). The median i. either graphically or by using i ' goodness of fit' test (see Chapters 5 and 11). If the distribution is skewed tc the right.7%. A bimodal distribution consists of two peaks and suggests that the sample is not homogeneous but possibly represents two different populations. Many statistical techniques have assumptions of normality of the data It is not necessary for the sample data to be normally distributed. The probability is equal to the area under the curve. e. 5.lv L:viie~1Yeiiut1 uw a wile-Laiieuf r value or u. If a distribution is skewed. If sampef data are skewed. they can first be transformed into a normal distribution and then analysed (see below). median and mean of this curve are the same. preferable to be able to demonstrate this. considered a very uncommon event in this population. This i explained by the central limit theorem. l.rz. There are a fixed number of observations (trials) Only two outcomes are possible The trials are independent There is a constant probability for the occurrence of each event The binomial distribution describes the probability of events in a fixed . The mode.a. then the median will be smaller than the mean. the shape of sampling distribution will approximate a normal distribution curve eve] if the distribution of the variable in question is not normal.4%. 4.14 Statistiral Mathnrfs fnr AnAPSthPCIA and Intancivo ('. As the number of observations increases (say. 95% o the population lie within 1. Kurtosis describes how peaked thf distribution is. but i should represent a population that is normally distributed . In other words. sc that one of the tails is elongated. rnis means mat to probability of a critically ill patient having a creatinine clearance of les than 40 ml/min is 0. 3.12 (or 12%). Binomial distribution A binomial distribution exists if a population contains items which belong to one of two mutually exclusive categories (A or B). a better measure of central tendency in a skewed distribution. The kurtosis of a normal distribution is zero. twc standard deviations 95. The normal distribution curve has a central tendency and a degree o dispersion.96 standard deviations. and three standard deviations 99. 2. male/female complication/no complication It has the following conditions: 1.6 It i. In a normal distribution one SD either side of the mean includes 68% of the total area.g.

.

Incidence and prevalence Incidence and prevalence are often used interchangeably.0805). new case) in a given time period.0. The most commonly used transformation is a log transformation. It consists of a numerator (number of events) and a denominator (number in the population).0745). there is an increased risk with that exposure. This is a common practice that is generally accepted. or 7. or 8. if 14 colorectal surgical patients have died in a hospital performing 188 cases in the previous 12 months. or risk. mortality 'rates' often include the deaths in the denominator). obtained by dividing the number of individuals with the disease by the number of people in the population.16 Statistical Methods for Anaesthesia and Intensive Care Data transformation In some circumstances it may be preferable to transform a distribution so that it approximates a normal distribution.45%. Rates are sometimes used interchangeably with proportions (e. It has a value between 0 and 1. Prevalence is the current number of cases (pre-existing and new). Presentation of data The mean is the most commonly used single measure to summarize a set of observations. It is usually a reliable measure of central tendency.0. and can be multiplied by 100% to give a percentage. Rates and proportions A rate is a measure of the frequency of an event. the proportion of deaths is 14/188 (= 0.6 This is a useful approach if sample data are skewed. For example. If the risk ratio is greater than 1. Two proportions may be compared by combining them as a ratio. This generally equalizes group variances. A proportion includes the numerator within the denominator. and makes data analyses and interpretation easier. This .6 The antilog of the mean of a set of logarithms is a geometric mean.e.g. but they are different and this difference should be understood. It is a good measure of central tendency if a distribution is skewed. of developing a disease in a specified time period. For example. The incidence rate is an estimate of the probability. the reported mortality rate would be 14/174 (= 0.05%. a characteristic of a normal distribution. This can result in a mean that is independent of the variance. Prevalence is a proportion. Note that a rate does not include the number of events in the denominator. Incidence is the number of individuals who develop a disease (i. In our colorectal surgical example. a risk ratio is the incidence rate of an event in an exposed population versus the incidence rate in a non-exposed population.

and range or interquartile range for degree of spread. Ordinal data should be described with mode or median. or minimum and maximum). Figure 2. we can use our data from Example 2. Tables should have their rows and columns labelled. but on these occasions they should be clearly labelled. It has been suggested that the correct method for presentation of normally distributed sample data variability is mean (SD) and not mean (± SD). Graph axes and scales should be labelled. Most journals include specific guidelines and these should be followed.1 and depict the median (line through box). Categorical data can be presented as number A box and whisker plot (Figure 2. One of the weaknesses of the mean is that it is affected by extreme values. On some occasions it may be acceptable to use standard error bars on graphs (for ease of presentation). Tables and diagrams are convenient ways of summarizing data.4 A box and whisker plot of creatinine clearance data in critically ill patients . 7 The '±' symbol implies that we are interested in the range of one SD above and below the mean.Descriptive statistics 17 is because most biological variation is symmetrically distributed about a central location. Excessive detail or numerical precision (say. beyond 3 significant figures) should be avoided. we are generally more interested in the degree of spread of the sample data. interquartile range (box) and whiskers (5% and 95% centiles. If the axis does not begin at the origin (zero). and as simple as possible. The mode is best used if the data have a bimodal distribution. interquartile range and range. then a break in the axis should be included. They should be clearly labelled and self-explanatory. Mean and SD are the best statistics to use when describing data from a normal distribution. In these circumstances it may be preferable to use median or geometric mean as a measure of central tendency. For example. or each category as number (%).4) can be used to depict median.

ii:639. Lancet 1986.18 Statistical Methods for Anaesthesia and Intensive Care References 1. 7. 64:607-611. Altman DG. 281:1473-1475. Transforming data. Statistical methods in anesthesia articles: an evaluation of two American journals during two six-month periods. Gardner MJ.analysing data. 310:298. Horan BE Standard deviation. Glantz SA. Dykes MHM et al. 4. . The normal distribution. BMJ 1996:312:770. 2. 61:1-7. Statistics and ethics in medical research. Altman DG. Bland JM. BMJ 1980. 5. 3. v . Altman DG. BMJ 1995. Altman DG. Circulation 1980. Biostatistics: how to detect. Bland JM. or standard error of the mean? [editorial] Anaesth Intensive Care 1982. correct and prevent errors in the medical literature. Anesth Analg 1985. Presentation of variability. Avram MJ. 6.10:297. Shanks CA.

• A type I or alpha error is where one rejects the null hypothesis incorrectly. j Samples and populations A sample is a group taken from a population. • Sample size is determined by a. • The P value is the probability of the event occurring by chance if the null hypothesis is true. Therefore. • A type II or beta error is where one accepts the null hypothesis incorrectly. Examples of populations studied in anaesthesia include: 1.3 Principles of probability and i nference Samples and populations Inferential statistics -definition of probability -null hypothesis -P value -type I and type II error Key points Confidence intervals Sample size and power calculations Parametric and non-parametric tests Permutation tests Bayesian inference • A sample is a group taken from a population. . 3. or all patients with a specific condition. The population may be all human beings on the earth. but by its characteristics. the difference between groups) and sigma2 (variance). A population. (delta. A population may also consist of laboratory animals or cell types. 5. • The null hypothesis states that there i s no difference between the population variables in question. j • Data from samples are analysed to make inferences about the population. or just within a specific country. is not defined by geography. when applying the results of a trial to your practice. it remains important first to decide if the patients recruited in the trial (the trial 'sample') are similar to those that you wish to apply the results to (your clinical practice 'population'). 2. • Power is the likelihood of detecting a difference between groups if one exists. in the belief that the sample represents the response of the average patient in the population. All day stay surgical patients undergoing general anaesthesia Low-risk coronary artery bypass graft surgery patients Critically ill patients in ICU with septic shock Women undergoing caesarean section under spinal anaesthesia Skeletal muscle fibres (from quadriceps muscle biopsy) A clinical trial involves selecting a sample of patients. therefore. 4. • A Confidence interval i ndicates where the true population parameter probably lies.

I nferential statistics Inferential statistics is that branch of statistics where data are collected and analysed from a sample to make inferences about the larger population. so that one may answer questions or test hypotheses. such that it does not truly represent the population. In general. The specific implications may be found true (because of other circumstances) even though the general hypothesis may be false. but it is not deductively valid to accept a hypothesis if the testable implications are found to be true (known as the fallacy of affirming the consequent). to try to support or refute the hypothesis. Sampling procedures are therefore very important when selecting patients for study. The deductive philosophy of science attributes man with an inquiring mind that asks questions about himself and the environment. The purpose is to derive an estimate of one or more population parameters. Each question is refined to produce a specific hypothesis and logical implications of the hypothesis that can be specifically tested. Hypotheses may however be generally accepted based on weight of supporting evidence and lack of contrary evidence. then it may bias the sample. The procedure is strictly proscribed so that another individual. would make the same decision. or to randomly select (this is preferable) from a larger population. If patients are selected by other criteria. so that they ultimately represent the population. A scientific method is used to collect evidence.20 Statistical Methods for Anaesthesia and Intensive Care Clear description and consideration of inclusion and exclusion criteria are therefore required. using the same information. Sample data are estimates of population parameters. A hypothesis can be . The effects under study may be quantitative or qualitative. each requiring an appropriate hypothesis testing procedure. A rational decision is characterized by the use of a procedure which ensures that a probability of success is incorporated into the decision-making process. In logic. Two common methods are to sequentially select all patients until the required number is obtained. the larger the sample size. etc. Hypothesis tests are thus procedures for making rational decisions about the reality of observed effects. The reason for this is that it is deductively valid to reject a hypothesis if the testable implications are found to be false (a method of argument known as modus tollens). correlation coefficients.). Most decisions require that an individual select a single choice from a number of alternatives. but this involves more time and expense. the more representative it will be of the population. and can be summarized as various statistics (differences between means. contingency tables. The decision is usually made without knowing whether or not it is absolutely correct. either by observation or controlled experiment. it is preferable to refute a hypothesis rather than try to prove one.

mayfurel-wodhyptsecanobdfitvelyshownbtruefals.Iimpobletknwvryhigaboutenvrs.Princplesofbaityndferc21supotdbyhwingat ler mutayxclsivehpot sfale.Thspramoeclsyratdoclin resahw onmigtcsderfoxampltherobiyfdeath.Ihclasi proh.How itpsbleomakprdictns makedcions.Prbatisheoryfunctai hsuedartionlmesfdaingwthucerainwold.Thmegoft rmpbailtydensupo 'hilspcaorenti.nxprimthacnoberpatdnif .Inthesubjcivaproh.ifhxpermntwas edintcaly fienumbrotis.Unikogc.Waepthyos aremoliky.temprobailtyefsoadgrfbeli.Anexamplithrobalyfgetina6 desrol.bailtesrfohelativfrquncyoaevt.

If P is less than an arbitrarily chosen value. in this case that the drug does cause some effect.5%. This will be discussed later when we talk about power. or one more extreme. it is up to the researchers and readers to decide whether or not this is valid. assuming that H 0 is true. even if the result is real. The important finding is the likely size of the treatment effect (see Chapter 11). we are interested in results similar to or more extreme than that observed (with no indication of direction). for example that the drug increases heart rate. We can just be interested in one direction of effect. based on our experiment. Thus we are using the extremes at both ends or tails of the distribution. in this case. However. This is known as a two-tailed hypothesis and the two-tailed form of the significance test is used. we make an inference based on likelihood. (although strictly speaking we have not logically proved the H l or asserted that it is in fact true). a real effect is observed . the null hypothesis is rejected. A one-tailed test at an a of 0. then there is increased likelihood of a type I error: the more you look for a difference. Other values for a can be chosen.22 Statistical Methods for Anaesthesia and Intensive Care associated with a certain probability. Hl is that there is a difference in mean heart rate. We accept the H l but should remember that this is specifically for the experimental conditions of the trial and the sample tested. the a value is often set at 0. The two-tailed test is so named because when we specify an a of 0. Having rejected the H0. it is usual to accept the complementary alternative hypothesis (H 1 ). but may also be clinically unimportant or irrelevant. By convention. even by chance (see Chapter 11)! A type II error occurs when one accepts the H 0 incorrectly and the probability of this occurring is termed P. It is hoped that the results are generalizable to the population. A statistically significant result may or may not be real because it is possible to make type I and type II errors. could have occurred randomly by chance. the H0 is rejected. we still need to decide whether or not the result is important. the more likely you are of finding one. Some investigators conduct trials and report only the result of significance tests.05 which means that one accepts a 5% probability of making a type I error. which indicates the likelihood that the result obtained. Note that if multiple comparisons are made.05 would use just one end of the distribution and use different critical values to compare against the test statistic. known as a or the significance level. 2.05 in a two-tailed hypothesis. Two-tailed tests should usually be used unless there are clear reasons specified in advance as to why one-tailed tests are appropriate for that study 1 We must remember that. depending on circumstances. and a one-tailed test is used. each tail containing half the total a probability. In this case the complementary H0 would state that these is no increase in heart rate. P. However. In the example above. also known as the P value. A very small effect may be real and shown to be statistically significant. However the H 0 may be incorrectly rejected and this is known as a making a type I error. The drug effect could be either to increase or decrease heart rate.

Obviously the smaller the CI the more precisely we assume the sample estimate represents the true population value.Principles of probability and inference 23 and a P value is given indicating that the probability of this effect occurring by chance is very low. in a study comparing the dose requirements for thiopentone in pregnant and non-pregnant patients. It is obviously helpful to give some measure of this imprecision.3 For example.8) mg/kg. Any % CI can be calculated. either with P values or instead of them. CI can also be used for hypothesis testing. There is an increasing tendency to quote confidence intervals (CI). . In addition. Confidence intervals are often preferred when resenting results because they also provide this additional information. etc.6 (2. although 95% and 99% are the most common.3 and 2. then one can conclude that there is no significant difference between the two populations from which the samples were drawn at the significance % of the CI. If the 95% CI describing the difference between two mean values includes the point zero (i. the pregnant to non-pregnant relative median potency (95% CI) for hypnosis was 0. the t distribution is applicable (see Chapter 5) and the 95% CI for the sample mean in small samples is calculated as the mean of the sample ± (the appropriate t value corresponding to 95% x SEM). where SEM is the standard error of the mean.83 (0. we conclude that there is a significant difference in median potency between the two groups. median. It is commonly thought that the 95% CI for the population mean is the sample mean ± (1.96 x SEM).e. zero difference). However. slope of a regression line. Thus if we use a 95% CI. if the CI for the difference between two means contains zero.05. there is always some imprecision expected from the estimate. However this is only true for large sample sizes. this is similar to choosing an a value of 0.8 mg/kg. Confidence intervals When estimating any population parameter by taking measurements from a sample. proportion. For example. 4 the ED 50 (95% CI) for hypnosis in the pregnant group was 2. Because the 95% Cl does not contain 1. This is preferable as it gives the reader a better idea of clinical applicability. CIs can be calculated for many statistics such as mean. A 95% CI for any estimate gives a 95% probability that the true population parameter will be contained within that interval. CI can be used to indicate the precision of any estimate. 2. The lowest and highest values of the CI are also known as the lower and upper confidence limits. odds ratio.96). If the sample size is small. Thus we can be 95% certain that the true population ED50 is some value between 2.67-0.0.05.3-2. the use of only P values has been criticized because they do not give an indication of the magnitude of any observed differences and thus no indication of the clinical significance of the result. then obviously P > 0.

However. . these 95% CIs do not overlap the 95% CI for the large tumour group and so there is a difference in the ED50 of the large tumour group compared with the other two. For example.2). A larger sample size will decrease the sampling error. Sample size and power calculations The difference between a sample mean and population mean is the sampling error. 5 the dose-response curves show that the 95% CIs for the control and small brain tumour contain the ED50 for the other group (Figure 3.1 Theoretical plots of the mean and 95% confidence intervals (CIs) for troponin-1 levels (wg/1) in three patient groups after major surgery. it is not easy to determine graphically whether or not there is a statistical difference between means and a statistical test is used. This enables the investigator (or reader) to determine whether or not any difference shown is clinically significant. If we show three sample means and their respective CIs. there is a significant difference between the group means if the respective CIs do not overlap. they reveal more information. 5 CIs can indicate statistical significance but. However. by illustrating the accuracy of the sample estimates. The 95% CI of group C do not overlap those of groups A and B. the authors also used a statistical test of significance.24 Statistical Methods for Anaesthesia and Intensive Care Figure 3. Note however that there is overlap of the 95% CI for the ED 95 and it is not clear from the graph whether or not this represents a significant difference.05). if the CIs just overlap. There is thus no difference in ED 50 for these two groups. presenting the results in the original units of measurement and showing the magnitude of any effect. in a study comparing propofol requirements in normal patients with that in patients with small and large brain tumours. There is no difference if one CI includes the mean of the other sample.1). The 95% CI of group B overlaps the mean value of group A and so it is not statistically significantly different (at P < 0. In this example.05) It is possible to graphically illustrate the use of CI for hypothesis testing in a limited manner (Figure 3. and so the difference in means is statistically significant (P < 0. more importantly.

The investigator sets a threshold for both these errors. Fraction of patients (out of ten) who failed to respond to verbal command are shown as (•) for patients with large brain tumour.Principles of probability and inference 25 Figure 3. (X) for patients with small tumour and (. and between 0.2 Calculated dose-response curves (log dose scale) for loss of response to verbal command in patients with brain tumour and patients undergoing non-cranial surgery. in a paper comparing the duration of mivacurium neuromuscular block in normal and postpartum patients .8 because without these calculations sample size may be too large or too small. time and resources will be wasted. Performing power analysis and sample size estimation is an important aspect of experimental design. 7.05 and 0. If sample size is too large. often for minimal gain.20 for the type 11 or beta error. If sample size is too small. two important questions are: 1. Power is the likelihood of detecting a specified difference if it exists and it is equal to 1-beta For example. slightly offset for clarity. a = 0.05 for the type I or a error. often 0. It is considered unethical to conduct a trial with low power because it wastes time and resources while subjecting patients to risk.1.6 power analysis beta = 0.05) indicated that a sample size of 11 would be sufficient to detect a three-minute difference in clinical duration of neuromuscular block. the experiment will lack the precision to provide reliable answers to the questions being investigated. The simplest . The 95% confidence intervals for the ED 50 and ED 95 are also displayed. or accept the H 0 incorrectly (a type II error). How large a sample is needed to allow statistical judgments that are accurate and reliable 2. 5 ) When designing a trial.) for control patients (From Chan et al. How likely is a statistical test able to detect effects of a given size in a particular situation Earlier we mentioned that it was possible to reject the H0 incorrectly (a type I error).

is the difference that one wishes to detect. proportions. then the following approximate formula can be used: 9 delta.05 mmol/l. Traditionally. For example. but this implies that the study only has 80% power (1 . However both errors are important and a and beta values should be considered carefully. There are various formulae for sample size depending on the study design and whether one is looking for difference in means. a muscle relaxant may truly increase serum potassium by 0. i.2). Here the investigator is called upon to decide what is a clinically significant difference. Delta . It is generally important not to do this lightly. This is arbitrary but should be plausible and acceptable by most peers. 7.9 Nomograms8 and computer programs are also available. or other statistics. than beta? Is it more important to protect against a type I or type II error? It has been argued that we should be more concerned about type I error because rejecting the H 0 would mean that we are accepting an effect and may incorrectly decide to implement a new therapy. depending on the hypothesis being tested. As an example.05.2. if one is interested in calculating the sample size for a two-sided study comparing the means of two populations. typically 0.e. However we may decide that a clinically significant . but power is only one of the factors that determine sample size. Committing a type II error and concluding falsely that there is no effect will only delay the implementation of a new treatment (although this presupposes that some satisfactory alternative exists).0. many researchers choose beta = 0. Why should we accept a lower probability for a. the effect size. an 80% probability to detect a difference if one exists.26 Statistical Methods for Anaesthesia and Intensive Care way to increase power in a study is to increase the sample size. It is more difficult to detect a small difference than a large difference.

o the variance in the underlying populations.3 mmol/l. The variance can also be minimized by maximizing measurement precision. Having calculated a sample size.2. The sample size formula for a two-sided comparison of proportions is: 9 . The greatest concern is that if the variance is underestimated. comparing postoperative morphine requirements after ketamine induction compared with thiopentone induction. medium (0. on completion of the study at the given sample size. estimation of cardiac output using the thermodilution method has been shown to be more precise using measurements in triplicate and with iced injectate.Principles of probability and inference 27 increase in potassium is 0. but this is inconsequential because a rise of 0.05 and beta = 0. 10 the effect size chosen was a 40% decrease in 24-hour morphine consumption. The effect size can be related to the standard deviation (delta/sigma) and cat e orized as small (< 0.2-0. This was the expected effect size from a previous study in a different population. is the only variable that the investigator cannot choose. This effect was thought to be clinically relevant.12 Thus.3 mmol/l. the power of the study will be diminished and a statistically significant difference may not be found. one usually increases the number by a factor based on projected dropouts from the study. Sample size estimations for numerical data assume a normal distribution and so if the study data are skewed or non-parametric it is common to increase the sample size estimate by at least 10%. a sample size of 20 was eventually calculated at a = 0.5) or large (> 0. and so limit the sample size required. if available. In another example.2). An estimate for this can be obtained from pilot studies or other published data. It is useful to recalculate sample size for greater estimates of sigma because at times a surprisingly large sample size is required for a small change in sigma and the feasibility of the whole study may be in doubt.5). 11. We have not however committed a type II error because the H 0 was that there is no difference as great as 0. However this is not a recommended technique for calculating sample size when one does not have an estimate of o'. For example. Given the historical variance in morphine requirement derived from their acute pain database. Thus the sample size is set so that a rise of 0. In this case. we may well fail to detect the true increase in serum potassium. and allowing for a reasonable margin of error in the estimate of sigma. Note from the above formula that if one chooses delta equal to sigma these two would then cancel mathematically in the sample size formula. although the authors would not dispute that some others might consider 30% or 50% to be thresholds for clinical relevance. inclusion of these thermodilution methods in the study design can reduce the study cardiac output variance.05 mmol/1 is not important.3 mmol/1 is detectable.

Parametric and non-parametric tests In the previous discussion of statistical inference and hypothesis testing. A common criticism of non-parametric tests is that they are not as powerful as parametric tests. etc. 1s Non-parametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation). incidence of adverse events.). such that they were unlikely to have reliably determined a true treatment effect if it existed. p1 p2): smaller A means larger sample size Number of study endpoints (p l ): rare events require larger sample size There may also be several outcomes of interest and the sample size calculations for each of these outcomes may be different. respectively. q1 and q2 =1-p l and 1-p2. Although a sample size is calculated. a posteriori.14 This has been a common error in anaesthesia studies (see Chapter 11). Parametric tests are based on estimates of parameters. Non-parametric tests should generally be used to analyse ordinal and categorical data (rating of patient satisfaction. delta = the effect size. Good trial design dictates that the sample size should be based on the primary endpoint. 3. continuous data). This can be useful when the H 0 is accepted because it indicates how likely a given difference could have been detected using the actual standard deviation from the final study samples. a priori.28 Statistical Methods for Anaesthesia and Intensive Care and p2 = the expected proportions in the two groups.). 4. Non-parametric tests were developed for situations when the researcher knows nothing about the parameters of the variable of interest in the population. though it may be further increased if important secondary endpoints are also being studied. after the study has been completed. This probability is based on the sampling distribution of a test statistic. It is important to realize that there are assumptions made when using these test statistics (see also Chapters 5 and 6). it is also useful to use the same formula to calculate the power of the study. The parametric tests discussed later in this book have many inherent assumptions and thus should only be used when these are met. the tests are based on sampling distributions derived from p or sigma Parametric tests are based on the actual magnitude of values (quantitative. 7-913. the sample size for a difference in proportions depends on four factors: 1. before the study begins. Thus. Previous authors have noted that many published studies had failed to study an adequate number of patients. if the conditions of the . renal blood flow. and can only be used for data on a numerical scale (cardiac output. rather than the original estimate. etc. which is p1-p2. This means that they are not as likely to detect a significant difference as parametric tests. In the case of normally distributed data. Value chosen for a: a smaller a means a larger sample size Value chosen for beta: smaller beta (higher power) means larger sample size A (effect size. it was necessary to determine the probability of any observed difference. 2.

the shape of the sampling distribution approaches normal shape. 16 Proponents of the non-parametric tests argue that the power of these tests is very close to the parametric tests even when all conditions for the parametric test are satisfied. This is a result of an extremely important principle called the central limit theorem. 17. n > 100). as the sample size increases. In almost all clinical trials.Principles of probability and inference 29 parametric test are fulfilled. In reality. particularly if the sample data represent a population variable on a continuous scale. When the data set is large (say. One important assumption of statistical inference is that the samples are drawn randomly from the population. When the parametric assumptions are not met. It could be argued then that the results of the trial cannot be generalized to the population at large and it is inappropriate to determine the probability of any differences based on sampling theory from the population. 15 Permutation tests 17. For n = 30. this is almost never the case. apart from using the common parametric and non-parametric statistics. Our samples are mostly not truly random samples from the population but are instead comprised only of subjects to which we have access and are then able to recruit.18 There has been renewed interest in a third approach for testing significance. Ordinal data can be analysed with parametric tests in large samples. non-parametric tests actually become more powerful. The central limit theorem states that. determines the likelihood of each of them and . we actually study a non-random sample of the population that undergoes random allocation to treatments. even if the distribution of the variable in question is not normal.18 An alternative permutation test works out all the possible outcomes given your sample size. and is independent of the type I error. n < 10). The power for the common non-parametric tests is often 95% of the equivalent parametric test. or is not measured very well. Power efficiency is the increase in sample size necessary to make the test as powerful at an alpha level and given sample size. However. When the parametric assumptions are not met. non-parametric tests often become more powerfu1. Asymptotic relative efficiency is the same concept for large sample sizes. the shape of that distribution is 'almost' normal. it often makes little sense to use non-parametric statistics because the sample means will follow the normal distribution even if the respective variable is not normally distributed in the population. Thus the subjects who enter a trial conducted in a hospital will depend on many geographical and social as well as medical factors influencing admissions to a particular hospital. the tests of significance of many of the non-parametric statistics described here are based on asymptotic (large sample) theory and meaningful tests often cannot be performed if the sample sizes become too small (say. 15 Proponents of parametric tests agree that non-parametric methods are most appropriate when the sample sizes are small.

consider tossing a coin six times to determine whether or not the coin was biased.05 would lead us to conclude that the coin was biased. As a simple example. This is one explanation why clinicians may reach different conclusions from the same study data. the probability of a result this extreme is P(0 heads) + P(6 heads) = 2/64. For example. Bayesian inference Use of P value to determine whether an observed effect is statistically significant has its critics. It combines the prior probability and the study P value. is more likely to be true than a similar treatment effect observed that had not been previously reported. that is less than an arbitrary P value of 0.22 This is because a P value is a mathematical statement of probability. We can thus work out the probability of getting a result as extreme as 1 head by P(0 heads) + P(1 head) + P(5 heads) + P(6 heads) = 14/64! This being greater than an arbitrary P value of 0.05 would lead us to conclude that the coin was not biased. . Thus the permutation tests provide a result exactly applicable to the samples under study and make no assumptions about the distribution of the underlying (and remaining) population. Note that we have used a two-tailed hypothesis in these examples. Clinicians do not consider a trial result in isolation. The only permutation test in common use is Fisher's exact test (Chapter 6). This is because the permutations are very time intensive and only recently has the advent of personal computers made these tests more feasible. We are however usually interested in generalizing our specific sample results to the population and it appears that permutation tests do not permit this. 0-22 It has been developed from Bayes' theorem*. and conclusions based solely on it do not take into consideration prior knowledge. A Bayesian approach incorporates prior knowledge in its conclusion. If the outcome of interest is the number of heads. a formula used to calculate the probability of an outcome. an observed effect that is not statistically significant. it both ignores how large is the treatment effect. However if we got 0 heads. if other researchers replicate the trial with different samples and achieve similar conclusions. Thus.30 Statistical Methods for Anaesthesia and Intensive Care then calculates how likely it is to have achieved the given result or one more extreme. 19. if a new alpha 2 agonist is tested to see if it reduces the rate of myocardial ischaemia. given a positive test result (see Chapter 8). The resultant P value is Thomas Bayes (1763): `An essay towards solving a problem in the doctrine of chances'. then the weight of evidence would lend us to support (or reject) the overall hypothesis. However. to calculate a posterior probability. it would be a plausible hypothesis because of what is known about other alpha2agonists. the probabilities range from: P(0 heads) to P(6 heads). but is consistent with previous study results. They generally consider what is already known and judge whether the new trial information modifies their belief and practice.

Poon WS. Bazaral MG. Ma ML et al. 10. Julious SA. 311:442-448. Either. 14. Bayesian inference has appeal because it considers the totality of knowledge. Analyzing data from ordered categories. Br Med J 1993. Statistics with Confidence . Altman DG. III How large a sample. Confidence intervals rather than P values: estimation rather than hypothesis testing. there are literally two opposing camps of statisticians: frequentists and Bayesianists! One of the criticisms of Bayesian inference is that the methods used to determine prior probability are ill-defined. 2nd ed.04). Frieman JA. Anesthesiology 1997. 299:690-694. Br Med J 1994. Goodman NW. 85:1294-1298. Br Med J 1986. Chan MTV. if anything. 12. Anesth Analg 1997. 9. 3. Moses LE. Stetz CW. 90:1571-1576. Propofol requirement is decreased in patients with large supratentorial brain tumor. Florey C du V Sample size for beginners. Hosseini H. Statistics in practice. One and two sided tests of significance. if the same study also found that the new alpha2 agonist reduced the rate of vomiting (P = 0. the type II error and sample size in the design and interpretation of the randomized controlled trial. 6. Estimating sample sizes for binary. Nonparametric Statistics for the Behavioral Sciences. Chan MTV et al. Decreased thiopental requirements in early pregnancy. . Postoperative analgesic requirement after cesarean section: a comparison of anesthetic induction with ketamine or thiopental. 8. Altman DG. British Medical Journal. Novoa R. Chan MTV. Gin T. Miller RG. 1989. 13. 4.Confidence Intervals and Statistical Guidelines. 2. this would be more likely to be a chance finding. Derrick J. other alpha2agonists increase the risk of nausea and vomiting. Chalmers TC. Reliability of the thermodilution method in determination of cardiac output in clinical practice.Principles of probability and inference 31 0. it was a chance finding (1 in 25) or it was a true effect of the new drug. Mainland P. 7.06). Khaw KS. 15. McGraw-Hill. 311:1145-1148. et al. Gardner MJ. ordered categorical. Gardner MJ. Hughes AO. Errors in thermodilution cardiac output measurements caused by rapid pulmonary artery temperature decreases after cardiopulmonary bypass. Anesthesiology 1992. London. Altman DG. and continuous outcomes in two group comparisons. Petre L. Br Med J 1980. Castellan NJ Jr. Smith H et al. Kelly GE et al.04. Campbell MJ. Gin T. 306:1181-1184. 86:73-78. 86:82-85. 77:31-37. 309:248. Siegal S. 19-23 In fact. Gin T. N Engl J Med 1984. Statistics and ethics in medical research. Br J Anaesth 1992. 126:1001-1004. Altman DG. 281:1336-1338. Ngan Kee WD. Anesthesiology 1999. References 1. Altman DG. Alternatively. Statistical awareness of research workers in British anaesthesia. The second option makes much more sense because of prior knowledge (this would still be the case if the P value was 0. 11. N Engl J Med 1978. Br Med J 1995. The importance of beta. Am Rev Respir Dis 1982. Anesth Analg 1998. This is because prior knowledge suggests that. Bland JM. New York 1988. Emerson JD. There are two possible explanations for this result. 292:746-750. 5. 16. Postpartum patients have slightly prolonged neuromuscular block following mivacurium. 68:321-324.

32

Statistical Methods for Anaesthesia and Intensive Care

17. Ludbrook J. Advantages of permutation (randomization) tests in clinical and experimental pharmacology and physiology. Clin Exp Pharmacol Physiol 1994; 21:673-686. 18. Ludbrook J, Dudley H. Issues in biomedical statistics: statistical inference. Aust NZ J Surg 1994; 64:630-636. 19. Browner WS, Newman TB. Are all significant p values created equal? The analogy between diagnostic tests and clinical research. JAMA 1987; 257:2459-2463. 20. Brophy JM, Joseph L. Bayesian interim statistical analysis of randomised trials. Lancet 1997; 349:1166-1168. 21. Goodman SN. Towards evidence based medical statistics: the P value fallacy. Ann Intern Med 1999;130:995-1004. 22. Goodman SN. Towards evidence based medical statistics: the Bayes factor. Ann Intern Med 1999;130:1005-1013. 23. Davidoff F. Standing statistics right side up. Ann Intern Med 1999; 130:1019-1021.

4

Research design

Bias and confounding -randomization and stratification Types of research design -observation vs. experimentation -case reports and case series -case-control study -cohort study -association vs. causation -randomized controlled trial -self-controlled and crossover trials Randomization techniques -block randomization -stratification -minimization Blinding Sequential analysis I nterim analysis Data accuracy and data checking Missing data I ntention to treat

Key points . Bias is a systematic deviation from the truth. • Randomization and blinding reduce bias. • Confounding occurs when another factor also affects the outcome of interest. • Observational studies may be retrospective, cross-sectional or prospective. • The gold standard study is a double-blind randomized controlled trial (RCT). • Sequential analysis (interim analysis) allows the early stopping of a trial as soon as a significant difference is identified. • Analysis of patients in an RCT should be by intention to treat'.

In the past, large dramatic advances in medicine (e.g. discovery of ether anaesthesia) did not require a clinical trial to demonstrate benefit. Most current advances have small-to-moderate benefits, and a reliable method of assessment is required to demonstrate a true effect.

**Bias and confounding
**

In a research study, an observed difference between groups may be a result of treatment effect (a true difference), random variation (chance), or a deficiency in the research design which enabled systematic differences to exist in either the group characteristics, measurement, data collection or analysis.' These deficiencies lead to bias, a systematic deviation from the truth. There are many potential sources of bias in medical research. Examples include: 1. Selection bias - where group allocation leads to a spurious improved outcome because one group is healthier or at lower risk than another 2. Detection bias - where measurements or observations in one group are not as vigilantly sought as in the other 3. Observer bias - where the person responsible for data collection is able to use their judgment as to whether an event occurred or not, or determine its extent

36

Statistical Methods for Anaesthesia and Intensive Care

surgical or anaesthetic techniques, or other aspects of the patients' perioperative care. Ideally such reports should only be used to generate hypotheses that should then be tested with more advanced research designs.

**Case-control study
**

A case-control study is an observational study that begins with a definition of the outcome of interest and then looks backward in time to identify possible exposures, or risk factors, associated with that outcome (Figure 4.1). 3 Patients who experienced this outcome are defined as 'cases'; patients who did not are defined as 'controls'. This study design is particularly useful for uncommon events, as cases can be collected over a long period of time (retrospective and prospective) or from specialist units which have a higher proportion of patients with these outcomes. The case-control study has been under-utilized in anaesthesia and intensive care research. It should become more popular as large departmental and institutional databases are established, thereby allowing exploratory studies to be undertaken. In most case-control studies the control patients are matched to the cases on some criteria, usually age and gender. The aim of this matching process is to equalize some baseline characteristics so that confounding is reduced. This allows a more precise estimate of the effect of various exposures of interest. It is important to understand that matched

Figure 4.1 Types of observational studies: (a) case-control study; (b) cohort study; (c) cross-sectional study

4 investigated possible factors that were associated with infection following cardiac surgery. the aim is to look back in time at specific exposures that may be related to the outcome of interest. Once the cases and controls have been selected.6. the risk ratio will be less than one.4) in postoperative emesis compared with women. 7 If this interval does not include the value of one. For example. A specific group of patients are identified (a 'cohort'). then men have a 40% reduction (1. 5 But because the outcome of interest may occur infrequently. Because cohort studies are usually performed prospectively. this design may require a large number of patients to be observed over a long period of time in order to collect enough outcome events. In order to minimize bias. because they will have equal values for cases and controls! Another way of reducing the effects of confounding is to use multivariate statistical adjustment (see Chapter 8). The exposures may include patient characteristics (such as severity of illness. urgent surgery. The risk ratio can be expressed with 95% confidence intervals (CI). A cohort study is therefore relatively inefficient. In prospective cohort studies this is described by the risk ratio (also known as relative risk). .0. the risk ratio will be greater than one. age group* or smoking status). It is common to increase the sample size of a case-control study by using a higher proportion of controls than cases.0 . and if there is a reduced risk. Hence 1:2.1). if there is an increased risk. If exposure is not associated with the outcome. and type of surgery (or surgeon!). pre-existing disease. the accuracy of the data can be improved and so results are generally accepted as being more reliable than in retrospective case-control studies. the association between *Only if age was not used to match cases and controls. and these can be matched to one or several other control groups for comparison (Figure 4. the risk ratio is equal to one. Over a 12-month period they identified postoperative infection in 89 patients ('cases') and these were matched to 89 controls. prolonged surgery and use of blood transfusion. They then retrospectively identified the following perioperative characteristics which were significantly associated with infection: patient age > 65 years.6 = 0. 1:3 or 1:4 matching is commonly used. if the risk ratio for smokers acquiring a postoperative wound infection is 10.Research design 37 characteristics cannot then be compared. or take a long time to develop. the definition of each of these exposures should be defined before the data are collected and efforts used to acquire them should be equivalent for cases and controls. drug administration. 6 then smokers have a 10-fold increased risk of wound infection compared to non-smokers. Observational studies can be used to estimate the risk of an outcome in patients who are exposed to a risk factor versus those not exposed. Cohort study A cohort study observes a group of patients forward in time in order to record their eventual outcome. 3 Rebollo et a1. If the risk ratio for men reporting postoperative emesis is 0.

34) or calcium antagonists (odds ratio 0. and matched them to 10 247 treated with opiates. 42 cases (patients with persistent low systemic vascular resistance) were identified in a 12-month period and these were matched for age and sex to 84 controls. 95% Cl: 1. the value for the denominator is unreliable and so the odds ratio is used as an estimate of risk exposure and outcome is significant (P < 0.11-1. Looking back at their preoperative medications ('exposure').3 (or 30% greater risk). Because their study design enabled them to include very large numbers of patients. .53-3. 11 cases and 19 controls had been given ACE inhibitors. 7 The statistical methods used to analyse casecontrol and comparative cohort studies are presented in more detail in Chapter 6. outcome 'no' = controls).05).13).21-1. they were able to give a precise estimate of risk (i. This risk was increased in patients treated longer than 5 days. Odds ratios can also be expressed with 95% Cl. They compared 10 272 patients who received ketorolac.21 and 0.33. Because accurate information concerning total numbers is unavailable in a retrospective case-control study (because sample size is set by the researcher).38 Statistical Methods for Anaesthesia and Intensive Care Figure 4.39.2 In prospective cohort studies the risk ratio is equal to the risk of an outcome when exposed compared to the risk when not exposed.52. incidence rate and risk cannot be accurately determined.there was no significant association for either ACE inhibitors (odds ratio 1. ketorolac. For casecontrol studies (outcome 'yes' = cases. These were 'adjusted' using a multivariate statistical technique in order to balance for possible confounding . Univariate ('unadjusted') odds ratios were 1.9 who investigated the adverse effects of the non-steroidal drug. 95% Cl: 0. Overall. narrow 95% . investigating the risk of gastrointestinal and operative site bleeding. An example of a cohort study was one designed by Strom et al. 8 was one designed to investigate the potential role of calcium antagonists and ACE inhibitors in causing persistent vasodilatation after cardiac surgery.e. The risk (odds) ratio of gastrointestinal bleeding in those exposed to ketorolac was 1. 95% Cl: 0. and 22 cases and 62 controls had been given calcium antagonists. An example of a case-control study by Myles et al. on postoperative outcome. and the odds ratio is used as the estimate of risk (Figure 4. and in those over 70 years of age. respectively.49.2).

Bias and confounding are difficult to avoid and should always be considered as alternative explanations to an observed relationship between drug exposure (or intervention) and outcome. or for longer periods)? 4. Randomized controlled trial The gold standard experimental research design is the prospective randomized controlled trial. garbage out). Such studies form the basis of much epidemiological research ('study of the health of populations') and their value lies in their efficiency. because patients were not randomized to each treatment group (ketorolac or opiates). 3.11 collective weight of evidence from a number of potential sources. A controlled trial therefore allows meaningful conclusions concerning the relative benefits (or otherwise) of the intervention of interest. There may be one or more control groups. it does not prove that exposure caused the outcome: 'association does not i mply causation'. . Here patients are allocated to a treatment group and their outcome is compared with a similar group allocated to an alternative treatment. causation Because observational studies do not require a specific intervention. Nevertheless. Association vs. and many departments and institutions now manage extensive patient databases.10. Is the evidence consistent? 2.Research design 39 CI). All the available evidence should be processed: 1. Is there biological plausibility? It is the mounting body of supportive evidence that finally supports causation. consisting of patients who receive a current standard treatment or placebo. In order to demonstrate causation requires the 5. A randomized controlled trial is sometimes referred to as a parallel groups trial. in that each group is exposed to treatment and then followed concurrently forwards in time to measure the relative effects of each treatment (Figure 4. Is there a dose-response relationship (greater risk if exposed to higher doses. This reference group is called the control group and its role is to represent an equivalent patient group who have not received the intervention of interest. Is there a demonstrated temporal sequence between drug exposure and adverse outcome? This is particularly relevant for case-control studies and case reports. Even if a relationship between exposure and outcome is beyond doubt. in that 100s or 1000s of patients can be analysed. But it must be recognized that the results of observational studies depend heavily on the accuracy of the original data set ('gigo': garbage in.3). concern over potential bias and confounding remain. it is relatively easy to obtain information on large numbers of patients.

or granisetron 2 mg). The investigators clearly demonstrated that ondansetron was an effective anti-emetic in that it reduced emetic symptoms by approximately 50%. Suen et a1. Another way of reducing variance is to allow each patient to act as their own control and so all patient characteristics affecting the outcome . 13 randomized 100 women undergoing major gynaecolgical surgery into two groups (domperidone 20 mg. (b) crossover trial Trials which compare an active treatment to placebo can demonstrate whether the active treatment has some significant effect and/or to document the side-effects of that active treatment (relative to placebo). Stratification and blocking can be used to equalize potential confounding variables (see later). This variance can be reduced by restricting trial entry (excluding patients who have extreme values or who may react differently to the intervention) or by improving measurement precision. clinically useful information. and measured the incidence of nausea and vomiting. For example. Self-controlled and crossover trials It is difficult to detect a significant difference between groups in trials when the observation of interest is subject to a lot of variation ('background noise').3 Comparing two treatments (A and B) with a standard parallel trial design or a crossover trial design: (a) parallel groups trial. then the aim of the trial is to demonstrate that the new treatment has a significant advantage over current treatment. They chose domperidone as the comparator because it was a commonly used anti-emetic. If the control group is to receive another active treatment (usually current standard treatment).40 Statistical Methods for Anaesthesia and Intensive Care Figure 4. Patients were followed up for 24 hours. They clearly demonstrated that granisetron was more effective in their patient population. This often has more clinical relevance. This provides additional. 12 enrolled 204 women undergoing laparoscopic gynaecological surgery and randomly allocated each to receive ondansetron or placebo. Fujii et a1.

18 The results are not intended to be generalized to other patients. say. or who warrants a new or experimental treatment).17 When interest is focused on one patient and their response to one or more treatments.4). the treatment is then given and after an appropriate period of time measurements are repeated. a crossover trial design. This is usually performed in the setting of clinical practice when treating a specific patient ( who may be resistant to standard treatment. 14. baseline measurements are taken.Research design 41 of interest are equalized. for each treatment). 16. this is known as a crossover trial. 14. crossover trials have been under-utilized in anaesthesia research. then any change in the observation can be attributed to the effect of the treatment. Each treatment period can be separated by a 'washout period'. this is known as an ' n-of-1 trial'. When two or more interventions are to be compared in patients who act as their own control. measuring the effect.15 Patient dropouts also have a potent adverse impact on study power (because each patient contributes data at all periods. the crossover point should be blinded to reduce bias. or before and after study. Crossover trials are most useful when assessing treatment for a stable disease. This trial design may be useful when optimizing. 14. crossover trials are a very efficient method to compare interventions. they need to be exposed to each of the treatments. and because each patient requires at least two treatments. Assuming that the patients were otherwise in a stable condition. . Here. or where the intervention being tested has a short period of onset and offset and the effect can be measured quickly. trial duration usually needs to be extended. when well designed and correctly analysed. readers must be aware of potential problems in using this design. In general. we would recommend two excellent reviews. Ideally. enabling the effect of the first treatment to dissipate before testing the second treatment. followed by another set of measurements (see Figure 4. If possible. It is a very efficient design: the sample size required to show a difference between groups can often be substantially reduced. If readers are considering employing. and then giving them the alternative treatment. an n-of-1 trial should be performed under blinded conditions. The appropriate methods to analyse this 'paired' data are presented in Chapters 5 and 6.1 15 Some examples can be found in the anaesthetic and intensive care literature. This requires they be crossed over from one treatment group to the next. Despite these concerns.15 This is best achieved by randomizing patients to each treatment. This is a very powerful research design as it avoids the confounding effect of patient characteristics. Nevertheless.the second treatment evaluation will be confounded by this) and sequence effect (where the order of treatments also has an effect on outcome). period effect (where there is a tendency for the group either to improve or deteriorate over time . thereby markedly reducing variance and maximizing the likelihood of detecting a treatment effect. This is a suitable design to test the effect of a new drug or treatment on a group of patients and is known as a selfcontrolled. There are statistical methods available to investigate these effects.15 These include avoiding a carry-over effect (where the effects of the first treatment are still operating when the next treatment is evaluated).

Common stratifying variables include gender. A Latin square design is a more complex method to control for two or more confounding variables. Berry et a1. ensuring that the confounding variables are equalized between groups. 2. say to optimize a sedative or analgesic regimen for a problematic patient.5. Obviously the trial result will only apply to that patient. This is usually guided by referring to a table of random numbers or a computer-generated list. The patients were then randomly allocated into groups. 22 first stratified patients according to their preoperative ventricular function in order to balance out an important confounding factor known to have an effect on postoperative recovery. so that there is a fixed (known) chance other than 0. For example. This allows a clearer interpretation of the effect of an intervention on eventual outcome.19-21 The identified confounding factors act as criteria for separate randomization schedules. patient age. which allocates patients in such a way that each has an equal chance* of being allocated to any particular group and that process is not affected by previous allocations. Randomization techniques To minimize bias. so that ultimately equal numbers of patients with that particular confounder will be allocated to each group. It may also be used in intensive care. each institution. Here. for multi-centred research. This method resulted in near-equal numbers of patients in both groups having poor ventricular function (Figure 4. the number in each group will then be equal. This is a valid method to increase the sample size of a particular group if its variance is to be more precisely estimated. in a study investigating the potential benefits of lung CPAP during cardiopulmonary bypass in patients undergoing cardiac surgery. . This commonly used method has the value of simplicity. smoking status (these depend on whether it is considered they may have an effect on a particular outcome) and.42 Statistical Methods for Anaesthesia and Intensive Care an anaesthetic technique for a patient requiring repeated surgical procedures or a course of electroconvulsive therapy. but may result in unequal numbers of patients allocated to each group or unequal distribution of potential confounding factors (particularly in smaller studies). Stratification is a very useful method to minimize confounding. but this is a useful method of objectively measuring individual response to changes in treatment. the allocation of patients to each treatment group should be randomized. and each block of patients is separately randomized into groups. the presence of a confounding variable divides patients into separate blocks (with and without the confounder). Block randomization is a method used to keep the number of patients in each group approximately the same (usually in blocks with random sizes of 4.4). The commonest method is simple randomization. risk strata. As each block of patients is completed. Here the levels of each confounding *In some trials there may intentionally be an uneven allocation to each treatment group. 6 or 8).

For example. and investigator or person responsible for the analysis of results (sometimes referred to as triple-blind) can dramatically reduce bias. a selection bias may occur whereby 'sicker' patients are placed into a control group. However. say. data cleaning or analyses. measurement. . because group allocation is no longer randomly determined. observer (double-blind). Knowledge of group allocation should be kept secure (blind) until after the patient is enrolled in a trial. 1:4). into blocks (according to left ventricular function) and then separately randomized in order to equalize the numbers of patients in each group with poor ventricular function. patients are first stratified. to increase the chance the next patient will be allocated to the desired group. This reduces confounding variable make up the rows and columns of a square and patients are randomized to each treatment cell. opaque envelopes. the person responsible for identification and recording of exposure should be blinded to group identity. in a comparative cohort study. 22 Here.Research design 43 Figure 4. or divided. minimization may expose a trial to bias. Blinding Blinding of the patient (single-blind). recordings. Minimization is a particularly useful method of group allocation when there is a wish to equalize groups for several confounding variables.4 An example of stratified randomization adapted from Berry et al.23 Minimization is another method of equalizing baseline characteristics. In a case-control study. A solution to this can be achieved by retaining random allocation. It is otherwise tempting for the subject or researcher to consciously or unconsciously distort observations. but modifying the ratio of random allocation (from 1:1 to. The commonest method is to use sealed. Here the distribution of relevant patient (or other) factors between the groups is considered when the next patient is to be enrolled and group allocation is aimed at minimizing any imbalance. In clinical trials. Prior knowledge may affect the decision to recruit a particular patient and so distort the eventual generalizability of the trial results. 4 Similarly. the observer should be blinded to group identity when identifying and recording outcome events. 24 Minimization also has advantages in situations where randomization is unacceptable because of ethical concerns (see Chapter 12).

hoping that the larger numbers will eventually result in statistical significance.44 Statistical Methods for Anaesthesia and Intensive Care a double-blind design should be used whenever possible. falsely rejecting the null hypothesis and concluding that a difference exists between groups). it may be found that many patients were denied life-saving treatment. because of the nature of the intervention.5). There are several other ways of developing and using boundary limits . Sequential analysis is also a good method to investigate potential treatments for serious life-threatening conditions. in effect. '6 This allows a clinical trial to be stopped as soon as a significant difference between groups is identified. a sequential line is plotted on a graph with preset boundary limits (Figure 4. If. Efforts made to maximize blinding in trial design are repaid by improved scientific credibility and enhanced impact on clinical practice.e. If a patient cannot be blinded to their treatment group. the trial is stopped and a conclusion of no difference is made.01). This is a useful trial design to investigate rare conditions. The more 'looks' at the data. cost-efficient approach used to detect a significant difference between groups as soon as possible. 23 Some investigators do not even report that they have done this and so the reader remains unaware of this potent source of bias. Boundary limits can be calculated for different significance levels (usually P < 0. or would need to be multi-centred (and this requires much greater effort to establish and run). then outcome events should be objectively and clearly predetermined in order to reduce detection and reporting bias. as the traditional randomized controlled trial could take a very long time to recruit sufficient numbers of patients. it is impossible to blind the observer or investigator.25-30 Sequential analysis is a collection of valid statistical methods used to repeatedly compare the outcome of two groups while a trial is in progress. a conclusion of difference is made and the trial stopped (so. Sequential analysis If the results of a clinical trial are analysed and it is found that there is no significant difference between groups. 26-28 In these cases. by the time a traditional trial is completed. If either of the two limits is broached. If the boundary limits are not broached and the plotted line continues on to cross the right-hand boundary (the preset sample size limit). This is grossly incorrect. sequential analysis is a valid. As the outcome of each patient (or pair of patients) is established.05 or P < 0. then a separate person who remains blinded to group identity should be responsible for identifying eventual outcome events. as once again. there is a statistical comparison made each time a preference is determined). the greater the chance of a type I error (i. then some investigators continue the trial in order to recruit more patients. using binomial probability (see Chapter 2) or a paired non-parametric test (see Chapters 5 and 6). Sequential analysis is also useful if there are ethical concerns about added risk of a new treatment.

A preference for A leads the line upwards and to the right. a conclusion of significant difference is made in sequential analysis. The number and timing of these comparisons should be determined before the trial is commenced. by a Data and Safety Monitoring Committee. Some examples can be found in the anaesthetic literature. two to four comparisons are made (usually after a certain proportion of patients are enrolled). using a P value of 0. multi-centred trials. But.27 This . A significant difference is inferred if one of these comparisons results in a P value that is smaller than a pre-specified value and the trial is generally stopped. independent of the investigators.05 as the stopping rule. each pre-specified type I error can be described as a 'stopping rule'. This approach has become the most common method to use when there is a requirement to stop a trial early. When a boundary limit is crossed.5 A sequential design comparing two treatments (A and B). Interim analysis is universally employed in large. these include investigation of anaesthetic drug thrombophlebitis. In many cases the interim analyses are performed blind. these are preplanned statistical comparisons between groups (usually) on a restricted number of specified occasions during the trial. College and Board examiners persist in asking about it!).Research design 45 Figure 4. 31 treatment of postoperative nausea and vomiting. once sufficient evidence of significant difference is obtained. Thus. 3 and use of low molecular weight heparin to prevent deep venous thrombosis after hip surgery 33 The above method of sequential analysis is seldom used in current medical research (unfortunately. Most commonly. I nterim analysis Interim analysis is also a method of repeatedly comparing groups while a trial is still in progress. in contrast to sequential analysis. 26. A more appropriate modification is interim analysis. without jeopardizing statistical validity. a preference for B leads the line downwards and to the right.

so that all patients who were enrolled and randomly allocated to treatment are included in the analysis. Some investigators (and clinicians) only analyse those patients who received the study treatment (per protocol analysis).2 Effect of how groups are analysed if four patients did not receive their allocated (epidural) treatment and were treated with patient-controlled anaesthesia (PCA) (and three of these had major complications). 37 found no significant effect of regional anaesthesia in peripheral vascular surgery on intention to treat analysis. 6 and a theoretical study demonstrates a reduction in major complications on per protocol analysis. The resultant P value of 0.2). Intention to treat analysis Epidural group (n = 20) Complications 6(30%) PCA group (n = 20) 10(50%) P value 0. as it can be argued that the sideeffects of the actual treatment received is clinically relevant in that circumstance. A per protocol analysis is sometimes used appropriately when analysing adverse events in drug trials.047 B. unintentionally receive the comparator treatment. The recommended approach is to use intention to treat analysis. Per protocol analysis may then over-estimate the true benefit and under-estimate adverse effects.we consider a treatment and want to know what is most likely going to happen (thus accommodating for treatment failure. But a per protocol analysis can be misleading. displacement or inadequate management. additional treatments. They may also refuse the allocated treatment. if 20% of epidurals are ineffective. The most valid method is to use intention to treat analysis. as they are only interested in the effect of the actual treatment (not what happened to those who did not receive it).48 Statistical Methods for Anaesthesia and Intensive Care be lost to follow-up. Per protocol analysis Epidural group (n = 16) Complicati ons 3(19%) PCA group (n = 24) 13(54%) P value 0. noncompliance. and so on). This approach seems intuitively obvious to many clinicians. where Bode et al. but that those who had a failed regional technique had a higher mortality than those who did not. A per protocol analysis would consider these patients in the PCA group.33 suggests that the observed difference could be explained by chance A. Thus. Table 4. particularly if the allocated treatment has side-effects or is ineffective in some patients. This gives a more reliable estimate of true effect in routine practice because it replicates what we actually do . it may be explained by an actual shift in group identity (Table 4. A real example can be found in the recent anaesthetic literature. or receive other treatments which may affect the outcome of interest.33 . say because of failed insertion.

Loius TA. Med J Aust 1993. Guyatt GH et al. Feinstein AR. The n-of-1 randomized controlled trial: clinical usefulness. 334:1209-1215. Lancet 1990. Anesth Analg 1998. Gin T. Suen TKL. Anesth Analg 1997. Comparison of patient-controlled epidural analgesia with patient-controlled intravenous analgesia using pethidine or fentanyl. Lenhardt R. Haynes RB. Sackett DL. Statistics with Confidence . Randomisation and baseline comparisons in clinical trials. 66:556-562. and standardised ratios and rates. Perioperative normothermia to reduce the incidence of surgical wound infection and shorten hospitalization. Tavel M. Calculating confidence intervals for relative risks. Weeks AM et al. Does ketorolac cause renal failure . 10. 15. The environment and disease: association or causation? Proc Roy Soc Med 1965:295-300. Hill AB. 1989:50-63. 12.Research design 49 References 1. 7. Woods JR. Olenikov I. London: British Medical journal. 34:125-136. 335:149-153. Anaesth Intensive Care 1994. Parenteral ketorolac and risk of gastrointestinal and operative site bleeding. Kurz A. . N Engl J Med 1984. J Chron Dis 1979. Crossover and self-controlled designs in clinical research.158:675-677.how do we assess the evidence? Br J Anaesth 1998. Llorca J et al. calcium antagonists and low systemic vascular resistance following cardiopulmonary bypass. Sessler DI. Ngan Kee WD. Horwitz RI. Altman DG. A postmarketing surveillance study. 310:24-31. Saitoh Y. Am J Med 1979. 275:376-382.112:908-913. 16. Power 1. Rothman KJ. Williams JG. Fujii Y. 18. Lavori PW. Lam KK. 8. Chen PP. Rebollo MH. Gin T. Kinman JL et al. 84:976-981. 80:420-421 [editorial]. In: Gardner MJ. N Engl J Med 1996. Anaesth Intensive Care 1997. Bailar JC et al. Altman DG. Ann Intern Med 1990.110:560-566. 20. Altman DG. Dore CJ. 19. Tanaka H et al. 11. 25:126-132. 2. 9. 3. Guyatt GH. Prophylactic oral antiemetics for preventing postoperative nausea and vomiting: granisetron versus domperidone. Bujor MA et al. 2nd ed. 87:1404-1407. 1991:283-302. 17. N Engl J Med 1983. Bernal JM. J Thorac Cardiovasc Surg 1996. ACE-inhibitors. Jaeschke R et al. Strom BL. Methodologic standards and contradictory results in case-control research. Comparability of randomised groups. 4. Sackett DL. Gardner MJ. Myles PS. Louis TA. 21. Bias in analytic research. 22:142-146. Epidemiological methods in clinical trials. Clinical Epidemiology: A Basic Science for Clinical Medicine. Berlin JA. 5. Early hemodynamic effects of left atrial administration of epinephrine after cardiac transplantation. 112:293-299. Ondansetron 4 mg for the prevention of nausea and vomiting after minor laparoscopic gynaecological surgery. Myles PS. 13. Designs for experiments . A case-control study. 39:1771-1779. Boston: Little Brown. Cancer 1977. Nosocomial infections in patients having cardiovascular operations: a multivariate analysis of risk factors.parallel comparisons of treatment. Myles PS. 6. Statistician 1985. 14. JAMA 1996. Morris JA. Our three year experience. Ann Intern Med 1989. odds ratios. Keller JL. The two-period crossover design in medical research. Leong CK.Confidence Intervals and Statistical Guidelines. 32:51-63. Chen PP et al. Lavori PW Bailar JC et al. 309:1291-1299.

Anaesth Intensive Care 1981. Br J Anaesth 1997. Pocock SJ. 22:135-140. Armitage P Sequential methods in clinical trials. Boon J. J Royal Statist Soc Ser C Appl Statist 1973. Low molecular weight heparin associated with spinal anaesthesia and gradual compression stockings in total hip replacement surgery. Stat Med 1993. Oh TH. Shearer PR. Anesthesiology 1983. Butler PJ. Myles PS. Biometrics 1987. Task Force of the Working Group on Arrhythmias of the European Society of Cardiology. Comparison of general and regional anesthesia. Bode RH Jr. Epstein BS et al. 87:723-727. Clergue F. 290:501-502. 32. Burstal R. London: Blackwell Scientific Publications. 84:3-13. 34. The antiemetic effect of droperidol following outpatient strabismus surgery in children. Cardiac outcome after peripheral vascular surgery. Treasure T. 26:165-173. 317:362-363. 25. The early termination of clinical trials: causes. 23. Anesth Analg 1998. Epidural analgesia . Am J Pub Health 1958. 26. Abramowitz MD. Zarich SW et al. McPherson K. . Ludington E. 28. Bainbridge DJ et al. 30. Samama CM.a prospective audit of 1062 patients.with special reference to trials in the field of arrhythmias and sudden death. Missing data in quantitative designs. 37. Berry CB. A multiple testing procedure for clinical trials. 1985:239-245. 36. 9:23-27. 48:1395-1402. Dexter F. O'Brien PC. Barre J et al. Postinfusion thrombophlebitis: effect of intravenous drugs used in anaesthesia. Lung management during cardiopulmonary bypass: is continuous positive airways pressure beneficial? Br J Anaesth 1993. Interim analyses in randomized clinical trials: ramifications and guidelines for practitioners. Macrae KD. Fleming TR. Geller NL. 24. 12:1459-1469. Statistical and ethical issues in monitoring clinical trials. 78:660-665. 59:579-583. and control . 27.50 Statistical Methods for Anaesthesia and Intensive Care 22. 31. Circulation 1994. Beemer GH. Minimisation: the platinum standard for trials? BMJ 1998. Pocock SJ. N Engl J Med 1974. 71:864-868. Hayes C. Lewis KP. 43:213-223. Anesthesiology 1996. Armitage P Statistical Methods in Medical Research. 33. consequences. 29. 35. 89:2892-2907. 35:549-556. Statistical analysis of total labor pain using the visual analog scale and application to studies of analgesic effectiveness during childbirth. Lantry G et al. Statistics: the problem of examining accumulating data more than once. Anaesth Intensive Care 1998. Biometrics 1979.

3. The unpaired t-test is used to compare two dependent groups. 5. Parametric tests The parametric tests assume that: 1. but this division does not influence the choice of statistical test. A one-tailed t-test is used to look for a difference between two groups in only one direction (i.5 Comparing groups: numerical data Parametric tests -Student's Mest -analysis of variance (ANOVA) -repeated measures ANOVA Non-parametric tests -Mann-Whitney U test (Wilcoxon rank sum test) -Wilcoxon signed ranks test -Kruskal-Wallis ANOVA -Friedman two-way ANOVA Key points Numerical data that are normally distributed can be analysed with parametric tests. Numerical data that are normally distributed can be analysed with parametric tests. Numerical data may be continuous or ordinal (see Chapter 1). Analysis of variance ( ANOVA) is a parametric test used to compare the means of two or more groups. using one of . Student's Mest is a parametric test used to compare the means of two groups. This chapter is concerned with the various methods used to compare the central tendency of two or more groups when the data are on a numerical scale. • Kruskal-Wallis test is a non-parametric equivalent of ANOVA.e. and/or tested for normality. larger or smaller). These tests are based on the parameters that define a normal distribution: mean and standard deviation (or variance). 2. Mann-Whitney U test is a non-parametric equivalent to the unpaired Mest. 4. Data are on a numerical scale The distribution of the underlying population is normal The samples have the same variance ('homogeneity of variances') Observations within a group are independent The samples are randomly drawn from the population If it is uncertain whether the data are normally distributed they can be plotted and visually inspected. Continuous data are sometimes further divided into ratio or interval scales.

where small samples (say n < 30) are commonly studied. If the transformed data are shown to approximate a normal distribution. or skewed. in that it can accommodate some deviation from these assumptions. say. The requirement for samples to be drawn randomly from a population is rarely achieved in clinical trials. The antilogarithm of the mean of this transformed data is known as the geometric mean. Non-normal. n > 100) data approximate a normal distribution and can nearly always be analysed with parametric tests.The commonest method is a log 4 transformation. The t-test only compares the means of the two groups. Large sample (studies with. This is one of the reasons why it has been a popular test in clinical trials. Without formally testing the assumption of equal variance. if P > 0. 1 This compares the sample data with a normal distribution and derives a P value.05 the null hypothesis is accepted (i. The F test is the ratio of variances (var l /var2). One example is the KolmogorovSmirnov test. data can be transformed so that they approximate a normal distribution. The requirement for observations within a group to be independent means that multiple measurements from the same subject cannot be treated as separate individual observations. 5 but this is not considered to be a major problem as results from inferential statistical tests have proved to be reliable in circumstances where this rule was not followed.2. This is a special case of repeated measures and requires specific analyses (see later). these data cannot be considered as 36 independent samples. The group variances can be compared using the F test. The t-test can be used when the underlying assumptions of parametric tests are satisfied (see above). Student's t-test Student's t-test is used to test the null hypothesis that there is no difference between two means. However the t-test is considered to be a robust test. they can then be analysed with parametric tests. Thus.52 Statistical Methods for Anaesthesia and Intensive Care a number of goodness of fit tests. if three measurements are made on each of 12 subjects. . it is possible to accept the null hypothesis and conclude that the samples come from the same population when they in fact come from two different populations that have similar means but different variances. It is used in three circumstances: • to test if a sample mean (as an estimate of a population mean) differs significantly from a given population mean (this is a one-sample t-test) • to test if the population means estimated by two independent samples differ significantly (the unpaired t-test) • to test if the population means estimated by two dependent samples differ significantly (the paired t-test).e. the sample data are not different from the normal distribution) and the data are considered to be normally distributed. whereby the natural logarithms of the raw data are analysed to calculate a mean and standard deviation.

0 then it is concluded that the group variances differ significantly. There is a t distribution curve for any particular sample size and this is identified by denoting the t distribution at a given degree of freedom. A t distribution is broader and flatter. Thus. the t distribution approaches the normal distribution. It describes the number of independent observations available.1 How a t distribution (for n = 10) compares with a normal distribution.05) depends on the sample size (degrees of freedom). but has wider dispersion .* The t distribution was calculated by W.1). n > 100).96 SD for the normal distribution if F differs significantly from 1. Degrees of freedom is equal to one less than the sample size (d. the likelihood of extreme values is greater.96 at a P value of 0.L. as long as the variables are normally distributed within each group and the variation of scores in the two groups is not too different.f.23 for n = 10) compared with mean ± 1. the t-test can be used even if the sample sizes are very small (n < 10). . the sampling distribution is nearly normal and it is possible to use a test based on the normal distribution (a test). As the degrees of freedom increases. When the sample size is large (say. This is analogous to a normal distribution where 5% of values lie outside 1.96 standard deviations from the mean.this accommodates for the unreliability of the sample standard deviation as an estimate of the population standard deviation. = n -1).Comparing groups: numerical data 53 Figure 5. With smaller sample sizes. the value of t approaches 1. *The value of F that defines a significant difference (say P < 0. as the degrees of freedom increases. such that 95% of observations lie within the range mean ± t x SD (t = 2. so the distribution 'curve' is flatter and broader (Figure 5. The t distribution. this can be found in reference tables (F table) or can be calculated using statistical software. Theoretically. if you refer to a t-table in a reference text you can see that. A sample from a population with a normal distribution is also normally distributed if the sample size is large. is also bell shaped. The t-test is mostly used for small samples. like the normal distribution. Gosset of the Guinness Brewing Company under the pseudonym Student (company policy prevented him from using his real name).05.

More commonly now. where d is the mean difference. In small samples it is preferable to use the value of t rather than 1.96 x standard error). 95% CI of the group mean = mean ± (t value x SE) 2. and SE denotes the standard error of this difference. the 95% CI gives an estimate of precision.e.$ studied the effect of hyperbaric oxygen ( HBO) therapy in patients with carbon monoxide poisoning. control (normal *The SE is calculated from a pooled standard deviation that is a weighted average of the t wo sample variances: . They reported the following results for a verbal learning test (with higher scores indicating better function): HBO group 42 vs. and SE = standard error. The t value can also be used to derive 95% confidence intervals (95% CI). For example. 95% Cl of the difference between groups = mean difference ± (t value x SE of the difference). u = population mean. In each of these cases a P value can be obtained from a t-table in a reference text.05 level. If the 95% CI of the difference between groups does not include zero (i. Thus. as well as indirectly giving the information about the probability of the observed difference being due to chance. it can be concluded that the sample(s) are subsets of different populations.7 In Chapter 3 we described how 95% CI can be calculated as the mean ± (1. Scheinkestel et al. Thus. The P value quantifies the likelihood of the observed difference occurring by chance alone.96: 1. no difference). then there is a significant difference between groups at the 0.54 Statistical Methods for Anaesthesia and Intensive Care The simplified formulae for the different forms of the t-test are: where X = sample mean. The null hypothesis (no difference) is rejected if the P value is less than the chosen type I error (a). a P value is derived using statistical software.

paired tests Unpaired tests are used when two different ('independent') groups are compared. the difference between means is the numerator. These initial differences contribute to the total variability within each group (variance). if the variability between sample means is very large and the variability within a sample is very low. For example. different patients will have a variety of factors that may affect the blood pressure immediately before intervention. By subtracting the first score from the second for each subject and then analysing only those differences. It is useful to take another view of the t-test procedure because it may be helpful in understanding the basis of analysis of variance. The mean difference was -7. Because the same group of patients is used. there is reduced individual and total within group variance. If we look at the formula for the t-test. instead of treating each group separately and analysing raw scores. it will become more difficult to detect a difference. there is variability of inherent characteristics that may influence the variable under study. The interval 2. the resultant t value will be small and we are less likely to reject the null hypothesis (Figure 5. When comparing central location between samples.Comparing groups: numerical data 55 oxygen) group 49. we actually compare the difference (or variability) between samples with the variability within samples.2). and so it can be concluded that there was a statistically significant difference between groups. With all samples.2-12. The authors concluded that HBO therapy does not improve outcome in carbon monoxide poisoning. this has the effect of reducing the denominator and making the t value larger.2 and the 95% CI of the difference was -12. Thus. In the t-test. then it will be easier to detect a difference between the means.2 was fairly wide and so the study did not have high precision for this estimate of effect. . It is useful to try and reduce variability within the sample group to make more apparent the difference between groups. By using the same subjects twice in a before and after treatment design.2. In the analysis of paired designs. If this is small relative to the variance within the samples (the denominator). each receiving one treatment. we will exclude the variation in our data set that results from unequal baseline levels of individual subjects. we can look only at the differences between the two measures in each subject. Intuitively. in a two-group unpaired comparison of a drug to lower blood pressure.2 to -2. Another example of a paired design is a crossover design of two treatments when instead of using two groups. Conversely if the difference between means is very small and the variability within the sample is very large. there is less variability. Paired tests are used when the two samples are matched or paired ('dependent'): the usual setting is when measurements are made on the same subjects before and after a treatment. the 95% CI did not include the zero value. Unpaired vs.2. Thus a smaller sample size can be used in a paired design to achieve the same power as an unpaired design. the same group receives the two drugs on two separate occasions (see Chapter 4).

(b) means now closer together causing overlap of curves and possibility of not detecting a difference. (c) means same distance as in B but smaller variance so that there is no overlap and difference easy to detect .2 The effect of variance: when comparing two groups. (a) Two curves of sampling distributions with no overlap and easily detected difference.56 Statistical Methods for Anaesthesia and Intensive Care Figure 5. the ability to detect a difference between group means is affected by not only the absolute difference but also the group variance.

The probability of getting at least one significant result is 1-0. the critical z value is 1. This essentially doubles the chance of finding a significant difference (i. This should not be done. Some investigators have used a one-tailed t-test because a two-tailed test failed to show a significant (P < 0.3).11 Although it is possible to divide three groups into three different pairs and use the t-test for each pair.05 would be reduced to 0.96. A one-tailed t-test should only be used if there is a valid reason for investigating a difference in only one direction .66. (a) For two-tailed a = 0. and a normal distribution the critical z value is 1. increases power) (Figure 5. and a normal distribution. 1 For example. This maintains a probability of 0.e.05. It is possible to divide the a value for each test by the number of comparisons so that overall. this will increase the chance of making a type I error (conducting three t tests will have approximately a 3a-fold chance of making a type I error). If we consider a seven-group study. there are 21 possible pairs and an a of 0. (b) for one-tailed a = 0. Comparing more than two groups The t-test should not be used to compare three or more groups. However. There is a better way to conduct multiple comparisons.Comparing groups: numerical data 57 If there is reason to look for a difference between mean values in only one direction (i. A one-tailed t-test is used to look for a difference between mean values in only one direction (i. 10.645 .05. larger or smaller). if there are three t-tests.0167 for each test and only if the P value was less than this adjusted a would we reject the null hypothesis.9521 = 0.05. it is apparent that as the number of comparisons increases. the adjusted a becomes so small that it could be very unlikely to find a Figure 5.05 of making a type I error overall. or 1/20 for each would make it likely that one of the observed differences could have easily occurred by chance. This increases the likelihood of showing a significant difference (power). 9 Ideally this should be based on known effects of the treatment and be outlined in the study protocol before results are analysed (a priori). This is known as the Bonferroni correction.e. then a one-tailed t-test can be used.05) result.3 Two-tailed and one-tailed t-tests. then an a of 0. the type I error is limited to the original a. larger or smaller).e.

The best way to avoid this is to limit the number of comparisons. Thus the Bonferroni correction is a conservative approach. The formulae for mean squares are complex. Analysis of variance (ANOVA) In general. There are many types of ANOVA and the only two we will consider here are the extensions of the unpaired and paired t-test to circumstances where there are more than two groups. However we hope that the between-group. Like the t-test. If k represents the number of groups and N the total number of results for all groups. is the result of our treatment. then there is a significant difference.ANOVA is based on the F test of variance. we also compare the difference between the means (using variance as our measure of dispersion) with the variance within the samples. A simplified formula for the F statistic is: where MS is the mean squares between and within groups. With ANOVA. ANOVA uses the same assumptions that apply to parametric tests. the variation between groups has degrees of freedom k-1. we are actually comparing the ratio of two variances . Thus if one uses reference tables to look at critical value of the F distribution. The comparison of means from multiple groups is better carried out using a family of techniques broadly known as analysis of variance (ANOVA).e. From our discussion of the t-test. or effect variance. These two variances are sometimes known as the between-group variability and within-group variability. We can compare these two estimates of variance using the F test. more efficient at detecting a true difference). the two degrees of freedom must be used to locate the correct entry. The test first asks if the difference between groups can be explained by the degree of spread (variance) within a group. 6. It divides up the total variability (variance) into the variance within each group. and that between each group. the purpose of ANOVA is to test for significant differences between the means of two or more groups. If the observed variance between groups is greater than that within groups.58 Statistical Methods for Anaesthesia and Intensive Care difference and we risk making more type II errors. a significant result was more likely when the difference between means is much greater than the variance within the samples. being based on random differences in our samples. However. .11 Thus one important reason for using ANOVA methods rather than multiple t-tests is that ANOVA is more powerful (i. to determine differences between means. The within-group variability is also known as the error variance because it is variation that we cannot readily account for in the study design. 6 It seems contradictory that a test that compares means is actually called analysis of variance. and the variation within groups has degrees of freedom N-k.

R2 is mathematically related to F and t. A large F value indicates that it is more unlikely that the null hypothesis is true. for example. *Numerically F = tz. There will also be a P value for the interaction of drug treatment and gender.e. a gender-based difference in effect. the chosen type I error. and multiple analysis of variance (MANOVA) when multiple grouping factors are analysed. the ANOVA is often referred to as a one-way or one-factor ANOVA. a measure of effect size. Of the common tests. However. Another method is the general linear model (GLM). the ANOVA by itself will only tell us that there is a difference. There is also twoway ANOVA when two grouping factors are analysed. This would be a two-factor (drug treatment and gender) ANOVA. For comparisons of specifically selected pairs of means. So if we are comparing three samples. not where the difference lies. ANOVA is a multivariate statistical technique because it can test each factor while controlling for all other factors and also enable us to detect interaction effects between variables. if the ANOVA returns a significant result. Clearly this is not that useful and we must make use of further tests (post hoc tests) to identify the differences.b Statistical software packages often provide a limited selection.Comparing groups: numerical data 59 If only two means are compared. .* In an analogous manner to the Mest described earlier. One may then consider additional contributory factors by looking at. Decisions on accepting or rejecting the null hypothesis are based on preset choices for a. An example of one-way ANOVA would be to compare the changes in blood pressure after the administration of three different drugs. most likely to indicate significant differences) and the Scheffe test is the most conservative but the most versatile because it can test complex hypotheses involving combinations of group means. that one drug treatment may be more likely to cause an effect in female patients. Dunnett's test is used specifically when one wishes to test just one sample mean against all the others. the F statistic calculated from the samples is compared with known values of the F distribution. a significant result will not identify which sample mean is different to any other. indicating perhaps. a form of multivariate regression. If we simply compare the means of three or more groups. The GLM calculates R2. In such a case the ANOVA will return a P value for the difference based on drug treatment and another P value for the difference based on gender. Thus more complex hypotheses can be tested and this is another reason why ANOVA is more powerful than using multiple t-tests. ANOVA will give the same results as the Nest.b One confusing aspect of ANOVA is that there are many post hoc tests and there is not universal agreement among statisticians as to which tests are preferred. tests such as Tukey's Honestly Significant Difference (HSD) and Newman-Keuls are often used. the Fisher Protected Least Significant Difference (LSD) is the least conservative (i. for example.

6 = day 2 (pm). Transformation of data can be a possible solution.60 Statistical Methods for Anaesthesia and Intensive Care Repeated measures ANOVA As an extension of the paired t-test. Here the difference between the group mean and individual values ('residuals') are analysed to check that they are normally distributed. From this graph.4). 12). one can easily imagine all the possible comparisons that one could make between different points on the curves. There were no significant differences between groups with ANOVA and so post hoc tests looking at each time interval were not performed. They found a reduction in quality of recovery scores early after surgery in all groups.. 5 = day 2 (am). Figure 5. It is very common to have repeated measures designs in anaesthesia. 2 A more important consideration is analysis of residuals. 11 An example is a study by Myles et a1. 4 = day 1 (pm). 3 = day 1 (am). some of the assumptions of ANOVA may not be met and this casts doubt on the validity of the analysis. Although repeated measures designs can be very useful in analysing this type of data. but the analysis of these is complex and fraught with hazard. Time periods 0 = preoperative. followed by a gradual improvement. 7 = day 3 (am).4 Perioperative changes in mean quality of recovery (QoR) score (after Myles et a1. 8 = day 3 (pm) . 12 who measured quality of recovery scores on each of three days after surgery in four groups of patients (Figure 5. 1 = recovery room discharge. we can imagine situations where we take repeated measurements of the same variable under different conditions or at different points in time. 11 Homogeneity of variance can be a problem because there is usually greater variation at very low readings (assays at the limit of detection) or very high readings. 2 = at 2-4 h postoperatively. with the attendant problems of multiple comparisons.

Some measures include: 13 • time to peak effect • area under a time-response curve • mean effect over time. This means that the outcome data should not only have the same variance at each time but also that the correlations between all pairs of repeated measurements in the same subject are equal. The conclusion was Figure 5. (b) the actual peak concentration ( Cinax). 14 a Mann-Whitney U test was used to show that tmax was delayed and Cmax was decreased in the adrenaline group (Figure 5. drug absorption profiles are conventionally summarized by three results: (a) the time to peak concentration (t ax). values at adjacent dose or time points are likely to be closer to one another than those further apart. If uncorrected. 11 It is often possible to simplify the data and use summary measures of the important features of each curve to compare the groups. 13 A simple unpaired t-test can then be used to compare these summary measures between groups. This is not usually the case. (c) the area under the curve (AUC) to a certain time point as a measure of overall absorption. In a comparison of the interpleural injection of bupivacaine with and without adrenaline.Comparing groups: numerical data 61 One of the assumptions of repeated measures ANOVA is compound symmetry. also known as multisample sphericity. The addition of adrenaline delays systemic absorption of bupivacaine. For example. 14 Mean (SD) and median (range) pharmacokinetic data are detailed below . the risk of type I error is increased. In the typical repeated measures examples above. Correction factors include the Greenhouse-Geisser and Hunyh-Feldt. full symbols) adrenaline.5 Absorption of bupivacaine after interpleural administration with (empty symbols) and without (plain.5).

one would conduct a repeated measures ANOVA on both groups at the same time. We may have to make a clinical judgment on whether it is important to distinguish a rapid but short duration of severe hypotension in one group from a more sustained less severe drop in blood pressure in the other group. For each drug we might want to know: (a) the time to maximum effect. 15 the AUC to 30 min was less in patients with intracranial hypertension. and (b) the greatest change from baseline. using group as a factor. We might inject the drug and measure systolic arterial pressure for the first 5 min after injection. when paracetamol was used as a measure of gastric emptying in intensive care patients. One can see that ANOVA designs can become quite complex. we may want to compare the haemodynamic stability of two induction agents in a special patient group. However.62 Statistical Methods for Anaesthesia and Intensive Care that the addition of adrenaline did decrease systemic absorption of bupivacaine. However if the time limit chosen is too long. and (c) whether or not these variables are different for each drug. In a manner similar to the AUC for the drug absorption examples. a repeated measures ANOVA is more appropriate. To test whether or not there are any differences between groups. there is a greater change in blood pressure in one group than the other. Note that these patients will have different baseline blood pressures.and we have not even . The ANOVA and summary measures can also obscure extreme data. one could calculate the AUC or other summary measures such as the mean or sum of the blood pressure readings in each patient. A drop in blood pressure of 30 mmHg is probably more significant in someone with a baseline of 100 mmHg than one with a baseline of 150 mmHg. Note that within each group. These would be analogous to the Cmax and CmaX in the previous example. one can see that there are many possible multiple comparisons that would inflate the type I error. From this discussion of a common and apparently simple question. if the baseline blood pressure is quite variable. This can pose further problems. Repeated measures ANOVA will take into account the different baseline blood pressures when comparing subsequent differences. As another example. However we may also want to know: (a) when the blood pressure first changes from the baseline (latency). Similarly. As a more complex example. then absolute changes may be slightly misleading. Rather than conduct multiple paired t-tests against the baseline blood pressure. one can perform a repeated measures ANOVA to compare the baseline data with subsequent readings. (b) when the blood pressure returns to normal. overall. one may not detect any differences between groups because the blood pressure has long returned to normal and this would obscure any initial differences. It is not appropriate to recalculate the summary measure at every time point to find significant differences ('data dredging'). Another question may be to determine if. This has led some authors to convert absolute differences into percentage changes from baseline before analysis. indicating delayed emptying when compared with patients without intracranial hypertension. even with ANOVA . we would be interested if one drug occasionally causes very severe hypotension.

Spearman rank order (rho) is a non-parametric version of the Pearson correlation coefficient r (see Chapter 7) Many other non-parametric tests are available. the use of parametric t-tests and ANOVA is considered acceptable (see Chapter 1). 1 These include: 1.s Non-parametric tests When the assumptions for the parametric tests are not met. there are many non-parametric alternatives for the parametric tests described above.Comparing groups: numerical data 63 considered mean arterial pressure. Wilcoxon signed ranks test is a non-parametric equivalent to the paired Student's Mest 3. A dose-response study of an analgesic presents similar problems. In a dose-response study of epidural pethidine. then non-parametric ANOVA tests may be more appropriate. heart rate and other cardiovascular variables! The investigator is well advised to precisely define the important hypotheses in advance (a priori) so that the appropriate selected analyses are undertaken. especially with repeated measures ANOVA. a repeated measures analysis of the pain scores (to determine when the pain score was different from the baseline) could have been theoretically used but would have added complexity without any further useful clinical information. . duration and overall efficacy of each dose. They are best used when small samples are selected because in these circumstances it is unlikely that the data can be demonstrated to be normally distributed. Overall postoperative analgesia was also analysed by comparing the area under the curve of the pain VAS among the three groups. In both cases. Friedman's test is a non-parametric repeated measures ANOVA 5. In both these studies. maximal effect. If the primary measure is a pain visual analogue score (VAS). Mann-Whitney U test (identical to the Wilcoxon rank sum) is a nonparametric equivalent to the unpaired Student's t-test 2. Advice and assistance from an experienced statistician is highly recommended. Non-parametric tests do not assume a normal distribution and so are sometimes referred to as distribution-free tests. ANOVA is a powerful statistical technique and many complex analyses are possible. However there are also many pitfalls. We may want to know the onset of action. 16 pain scores were measured at 3-minute intervals and the onset of action was defined as the time taken to decrease the initial pain score by 50%. However. non-parametric tests were used to compare VAS measurements because these data were considered ordinal data. Other computer-intensive tests have also been advocated for comparing means.1 but these are not often used in anaesthesia research. Kruskal-Wallis test is a non-parametric equivalent of one-way ANOVA 4. This summary measure was then compared among groups using a KruskalWallis test followed by Mann-Whitney U tests.

Just as described above for the parametric tests. the null hypothesis is that the samples come from populations with the same median. With these tests. For example. and the ranks in each group are then added to give the test statistic for each group. compared with the unpaired ttest. Mann-Whitney U test (Wilcoxon rank sum test) This test is used to determine whether or not two independent groups have been drawn from the same population. the test statistic is compared with known values for the sampling distribution of that statistic and the null hypothesis is accepted or rejected. The sum of all the ranks is N(N + 1)/2. That is. Tied data are given the same rank. The calculations are generally easier to perform than for the parametric tests. This is more likely with large samples (say. a hypothetical study investigating the effect of gender on postoperative headache may measure this pain on a 100 mm visual . especially when the sample size is large. Whitney and Wilcoxon all described tests that were essentially identical in analysis but presented them differently. they usually have less power. non-parametric tests may fail to detect a significant difference (which a parametric test may). n > 100). Samples have been drawn randomly from the population Non-parametric tests convert the raw results into ranks and then perform calculations on these ranks to obtain a test statistic. separated back into their original groups.17 In the Wilcoxon rank sum test. strictly speaking. one approach has been to perform an unpaired t-test on the ranks (rather than the original raw scores) and. even when all the conditions of the latter are satisfied. this normal approximation actually compares the mean ranks of the data between two groups rather than the medians. The test then determines whether or not the sum of ranks in one group is different from that in the other.95). 17 However. It is a very useful test because it can have high power (= 0. calculated as the mean rank of the tied observations. The Mann-Whitney U test is the recommended test to use when comparing two groups that have data measured on an ordinal scale. However. Data are from a continuous distribution of at least ordinal scale 2. 1 It has fewer assumptions and can be more powerful than the t-test when conditions for the latter are not satisfied. in the case of the Mann-Whitney U test. Statistical programs may use approximations to determine the sampling distribution of the test statistic. Then the data are ordered and given ranks. in effect. Observations within a group are independent 3.64 Statistical Methods for Anaesthesia and Intensive Care Non-parametric tests do however have some underlying assumptions: 1. then a t-test may be used if the data are normally distributed. if the data represent a variable that is. The test has several names because Mann. where N is the total number of observations. a continuous quantity. data from both the groups are combined and treated as one large group. For example.

1). with a standard deviation of 6.79. lies outside the quoted range (at P < 0. Exact method Use the group sum of ranks and consult a reference table for group sizes 16 and 14. Normal approximation The mean rank in the male group (n = 16) is 10. The t statistic based on ranks is: .44.21. 14 female).2 = 28.67. Because W.1 A hypothetical study investigating the effect of gender on postoperative headache in 30 patients (16 male. with a standard deviation of 7. 2. Pain is measured on a 100 mm visual analogue scale. The pooled standard deviation (see footnote on page 54) is 7.07.05). Tied data are given the same rank. Each patient would have their pain score recorded and they would be ranked from lowest to highest (Table 5. calculated as the mean rank of the tied observations 1. Each patient has their pain score ranked from lowest to highest.Comparing groups: numerical data 66 analogue scale. The degrees of freedom are 16 + 14 . the null hypothesis can be rejected. This is also a very valuable test with good efficiency (power Table 5. Wilcoxon signed ranks test It is important to distinguish this test from the similar sounding unpaired test above. The mean rank in the female group (n = 14) is 21.

Non-parametric Statistics for the Behavioral Sciences 2nd ed. Transformations. The use of transformation when comparing two means. post hoc comparisons are usually performed with the Mann-Whitney U test with a Bonferroni correction.] Friedman two-way ANOVA This tests the null hypothesis that k repeated measures or matched groups come from populations with the same median. . New York 1988. Gardner MJ.Confidence Intervals and Statistical Guidelines. Godfrey K. 312:1153. London 1989:pp20-27. Ludbrook J. Aust NZ J Surg 1995. Comparing the means of several groups. nj . N Fngl j Med 1985. we would expect the sum of the positive ranks to be equal to the sum of the negative ranks.66 Statistical Methods for Anaesthesia and Intensive Care = 95%) compared with the paired t-test. the differences between pairs are calculated but then the absolute differences are ranked (without regard to whether they are positive or negative). and R= the average of all the ranks (and equal to [N + 1]/2). 65:812-819. 313:1450-1456.] References 1. Br Med j 1996. Statistics with Confidence . Post hoc tests need to be performed if a significant difference is found. McGraw-Hill. Siegal S. k = the number of groups. Castellan NJ Jr. Bland JM. 312:770. 4. The positive or negative signs of the original differences are preserved and assigned back to the corresponding ranks when calculating the test statistic. 3. Br Med J 1996. Bland JM. and confidence intervals. Issues in biomedical statistics: comparing means by computerintensive tests. Rj .] As in the paired t-test. The sum of the positive ranks is compared with the sum of the negative ranks. Altman DG. This approach does not consider all group data and a method based on group mean ranks can also be used. 7. Altman DG. means. Kruskal-Wallis ANOVA This tests the null hypothesis that k independent groups come from populations with the same median. British Medical journal. Altman DG. 312: 1079.the number of cases in the jth sample. If a significant difference is found. Transforming data. Br Med j 1996. If there is no difference between groups. Altman DG. These tests are unfortunately not available with most statistical software packages but can be found in specialized texts. 6. 5.the average of the ranks in the jth group. A formula for the Kruskal-Wallis test statistic is:] where N = the total number of cases. Bland JM. 2.

Br Med J 1995. 14. Epidural meperidine after cesarean section: the effect of diluent volume. 10. Hunt JO. Anesth Analg in press. McArthur CJ. 310:170. Propofol. 16. Br J Anaesth 1990. . Fletcher H et al.Comparing groups: numerical data 67 8. Cardiovasc Res 1994. 15. 12. Myles PS et al. Multiple significance tests: the Bonferroni method. 17.a randomized. 13. Gin T. Br Med J 1994. Altman DG. Analysis of serial measurements in medical research. Bland JM. controlled clinical trial. Ngan Kee WD. 311:442-448. Kan AF et al. Matthews JNS. Hyperbaric or normobaric oxygen for acute carbon monoxide poisoning . 309:248. sevoflurane and isoflurane: a randomized controlled trial of effectiveness study. N Engl J Med 1984. Emerson JD. Anesth Analg 1997. Altman DG. Effect of adrenaline on venous plasma concentrations of bupivacaine after interpleural administration. Scheinkestel CD. BMJ 1990. 300:230-235. 28:303-311. Intensive Care Med 1995. Analyzing data from ordered categories. Chan K. 64:662-666. 9. Ludbrook J. Repeated measurements and multiple comparisons in cardiovascular research. Gin T. thiopental. Med J Aust 1999. Altman DG. Gastric emptying following brain injury: effects of choice of sedation and intracranial pressure. Campbell MJ et al. Bland JM. Hosseini H. 11. 85:380-384. Lam KK.170:203-210. Bailey M. Moses LE. Chen PP. Myles PS. McLaren IM et al. One and two sided tests of significance. Gin T. 21:573-576.

The x2 distribution is derived from the square of standard normal variables (X) and provides a basis for calculating the t and F distributions described in the previous chapter. each of which. has .e. the proportion not exposed. Categorical data are nominal and can be counted (see Chapter 1). The risk ratio is the proportion of patients with an outcome who were exposed to a risk factor vs. the null hypothesis). • The number needed to treat (NNT) is the reciprocal of the absolute risk reduction. It consists of a family of curves. This chapter is concerned with various methods to compare two or more groups when the data are categorical. The calculated value of the Pearson X. • Fisher's exact test is a recommended alternative for analysing data from 2 x 2 tables. Extensive further reading is available in a textbook on non-parametric statistics by Siegal and Castellan. McNemar's test is used to compare paired groups of categorical data.1 Chi-square (x2) The Pearson chi-square (X2 ) test is the most common significance test used for comparing groups of categorical data. • Yates' correction factor should be used when the sample size is small. • The kappa statistic is a measure of agreement.2 test statistic is compared to the chi-square distribution. • Odds ratio is an estimate of risk ratio.6 Comparing groups: categorical data Chi-square -Yates' correction Fisher's exact test The binomial test McNemar's chi-square test Risk ratio and odds ratio Number needed to treat Mantel-Haenszel test Kappa statistic Key points • The chi-square test is used to compare independent groups of categorical data. • The results from two group comparisons with two categories are set out in a 2 x 2 contingency table. and the resultant significance level (P value) depends on the overall number of observations and the number of cells in the table. used mostly in retrospective casecontrol studies. like the t-test. It compares frequencies and tests whether the observed rate differs significantly from that expected if there were no difference between groups (i. a continuous frequency distribution.

All four expected numbers are calculated and the X2 is then the sum of the four [(O.25/8. the estimates of probabilities in each cell become inaccurate and the risk of type I error increases. the result is compared to known values of the x2 distribution at 1 degree of freedom. and the expected number for group B is ([a + b])/N) x [b + d]). In a 2 x 2 table. the expected number for group A is ([a + b])/N x [a + c]).25/8.E) 2 ]/E terms.5 11. Example 6. but probably at least 20 with the expected frequency in each cell at least . When the total number of observations is small. Group A is receiving beta-blockers whereas group B is not. The X2 X2 distribution is actually a continuous distribution and yet each cell can only take integers. in a 2 x 2 table.013 The P value can be obtained in a X2-table in a reference text and is equal to 0.25/11. The outcome of interest is myocardial ischaemia Observed: G roup A Ischaemia No ischaemia Column total 5 15 20 Group B 12 8 20 Row total 17 23 40 Row total 17 23 40 Expected (if there was no difference between groups): Group A Group B Ischa em ia No ischaemia Column total 8.5) = 5.1). It is not certain how large N should be.5 11. The expected number in each cell is that expected if there were no differences between groups so that the ratio of outcome 1 to outcome 2 is the same in each group. Thus. The degrees of freedom was (2-1)(2-1) = 1. Thus. patients in group A had a statistically significant lower rate of myocardial ischaemia.5 20 X2 = (12.5 20 8. consider a clinical trial investigating the effect of preoperative beta-blocker therapy in patients at risk of myocardial ischaemia (Example 6. For example. given fixed row and column totals.70 Statistical Methods for Anaesthesia and Intensive Care The Pearson X2 statistic is calculated as: where O = the observed number in each cell.25/11. there is only free choice for one of the inner numbers because.025. the others are calculated by subtraction.1 An observational study of 20 patients at risk of myocardial ischaemia. in doing so. and so one would reject the null hypothesis.5) + (12. Thus for outcome 1. and E = the expected number in each cell.5) + (12. The degrees of freedom is equal to: (number of rows -1) x (number of columns -1).5) + (12. we would expect (a + b)/N as the ratio in each group.

the probability of each table is: Where ! denotes factorial.1. the P value for Fisher's exact test is 0. After constructing all possible tables. The probability of obtaining each of these tables is calculated. If there are multiple categories it may be useful to combine them so that the numbers in each cell are greater. Current statistical packages are able to calculate Fisher's exact test and it seems logical to use the exact probability rather than approximate x2 tests. the best approach is to use Fisher's exact test. This test was not common before the use of computers because the calculation of probability for each cell was arduous.e. and again we would accept the null hypothesis. the continuity corrected x2 is: x2=(9/8. the numbers in the cells are different but the row and column totals are the same).5)+(9/11. It does not assume random sampling and instead of referring a calculated statistic to a sampling distribution.5)+(9/11. similar to that obtained using Yates' correction factor. One can think of this as analogous to the problem of working out all the possible combinations of heads and tails if one tosses a coin a fixed number of times. It should be remembered that the x2 test is an approximation and the derived P value may differ from that obtained by an exact method.1. or one that is more extreme. the approximation of the x2 statistic can be improved by a continuity correction known as Yates' correction. The formula is: In Example 6. Further discussion can be found elsewhere. The probability of all tables with cell frequencies as uneven or more extreme than the one observed is then added to give the final P value. it calculates an exact probability. When the expected frequencies are small.68 This has an associated P value of 0. Fisher's exact test This is the preferred test for 2 x 2 tables described above.Comparing groups: categorical data 71 5.2 .054. It calculates the probability under the null hypothesis of obtaining the observed distribution of frequencies across cells.5)=3.1. We stated above that the Pearson x2 test may not be the best approach and this is more so if small numbers of observations are analysed. In Example 6. With small numbers in a 2 x 2 table. The test examines all the possible 2 x 2 tables that can be constructed with the same marginal totals (i.5)+(9/8.055 and one would accept the null hypothesis! Yates' correction is considered by some statisticians to be an overly conservative adjustment.

72 Statistical Methods for Anaesthesia and Intensive Care Analysis of larger contingency tables If there are more than two groups and/or more than two categories.p. perform separate tests on each pair using a Bonferroni correction for multiple comparisons. one can construct larger contingency tables.5 The binomial test The binomial distribution was briefly described in Chapter 2. with more cells contributing to the test statistic. It is not appropriate to partition the 2 x 3 table into several 2 x 2 tables and perform multiple comparisons. 4 This test will give a smaller P value if the variation in groups is due to a trend across groups. The paired contingency table is constructed such that groups A and Y pairs that had an event (outcome 1) would be . The null hypothesis is that the paired proportions are equal. Three 2 x 2 tables are possible and a test on each table at the original a may give a spuriously significant result. excellent. for example two categories in three groups. 3 An alternative is to use a variation of Z2 known as the X2 test for trends. Thus. For a 4 x 3 table there are 3 x 2 = 6 degrees of freedom.05. However.g. The result is referred to the X2 distribution at (m -1)(n -1) degrees of freedom.4 The analysis of larger tables can also be carried out using the Pearson X2 test as indicated above. All cells should have an expected frequency greater than 1 and 80% of the cells should have an expected frequencies of at least 5. it is better to combine some of the categories to have a smaller table. Data that can only assume one of two groups are called dichotomous or binary data.1 The binomial test could test whether a single study site in a multicentred trial had a similar mortality to that obtained from the entire study population. If this is not the case. it is often the case that some rank can be assigned to the categories (e. if the proportion in one group is equal to p. if P is less than 0. McNemar's chi-square test McNemar's X2 test is used when the frequencies in the 2 x 2 table represent paired (dependent) samples. if there are more than two categories. In the analysis of a large table. The binomial test can be used to test whether a sample represents a known dichotomous population. 1 It is a one-sample test based on the binomial distribution. a significant result on x2 testing will not indicate which group is different from the others. good. if there are m rows and n columns. then in the other it will be 1 . poor) and tests such as the MannWhitney U test may be more appropriate (see Chapter 5). A normal approximation based on the z test can be used for large samples. One approach is to do an initial x2 test and.

but on this occasion had then given them a new treatment. If available. so that we now label them as pretreatment (group A) and post-treatment (with heparin. still preserving the same distribution of outcomes after each treatment.2).1. The outcome of interest is myocardial ischaemia Group Y Ischaemia Group A (pre-LMWH): Ischaemia Group A (pre-LMWH): No ischaemia Column totals Group Y (post-LMWH): (post-LMWH): No ischaemia Row totals 3 2 5 9 12 6 8 15 20 The McNemar P value is 0. if we had used the group A patients (n = 20) described in Example 6.Comparing groups: categorical data Table 6. The conclusion from this small before and after study is that LMWH is not effective in the prevention of myocardial i schaemia in patients receiving beta-blockers.0 software). and the respective pairs with an event at only one period in cells b and c (Table 6.2 A randomized controlled trial of low molecular weight heparin (LMWH) i n 20 patients at risk of myocardial ischaemia who are receiving beta-blockers. Example 6. those pairs that did not have an event (outcome 2) would be counted in the d cell. Outcome 2 b d b+d Row totals a+b c+d a+b+c+d=N 73 counted in the a cell. The calculation of McNemar's x2 is different from that described above for the Pearson x2 .2). such as low molecular weight heparin. There is a continuity correction similar to Yates' correction and an exact version of the test that is similar to the Fisher's exact test. The Cochran Q test can be used if there are more than two groups. group Y).2 A 2 x 2 contingency table for paired groups Group Y: Outcome 1 Group A: Outcome 1 Group A: Outcome 2 Column totals a c a+c Group Y. or a single group of patients in a before and after treatment design. Groups A and Y denote either matched pairs of subjects. the exact test is preferred.1 .065 (using SPSS V9. 1 The value of the McNemar's x2 is referred to the X 2 distribution with 1 degree of freedom. For example. we would get the following table (Example 6.

but also how strong this association is. and the odds ratio is used as the estimate of the risk ratio (Table 6. outcome 'no' = controls).0. the risk ratio will be greater than one.3). It is equal to the proportion of patients with a defined outcome after an exposure to a risk factor (or treatment) divided by the proportion of patients with a defined outcome who were not exposed. and so the risk ratio can be approximated by the odds ratio. < 10%). then the association between exposure and outcome is significant (at P < 0. It is equal to the ratio of the odds of an event in an active group divided by the odds of an event in the control group. data from Example 6. note that the axes have been switched. The risk ratio and odds ratio can be expressed with 95% confidence intervals (CI). For retrospective case-control studies (outcome 'yes' = cases.74 Statistical Methods for Anaesthesia and Intensive Care Risk ratio and odds ratio The P value derived from a X2 statistic does not indicate the strength of an association. Table 6. incidence rate and risk cannot be accurately determined. the risk ratio is equal to one. using the fraction alb divided by cld. If an outcome event is uncommon the a and c cells have very small numbers relative to the b and d cells. As clinicians. if there is an increased risk.3).05). It is a reasonable estimate of risk when the outcome event is uncommon (say. this can be rewritten as adlbc. This can be described by the risk ratio (also known as relative risk) and it can be calculated from a 2 x 2 table (Table 6. the risk ratio will be less than one. If the outcome event occurs commonly.1 can be reanalysed using these methods (Example 6.3). For example. 6 Odds ratios are mostly used in case-control studies that investigate uncommon events. Because accurate information concerning all patients at risk in a retrospective case-control study is not available (because sample size is set by the researcher). we are usually interested in how much more likely an outcome will be when a treatment is given or a risk factor is present. If exposure is not associated with the outcome. the odds ratio tends to overestimate risk. and if there is a reduced risk. the value for the denominator is unreliable and so the odds ratio is used as an estimate of risk. These methods not only tell you if there is a significant association.3 In prospective cohort studies and clinical trials the risk ratio is equal to the risk of an outcome when exposed compared to the risk when not exposed. 1 If this interval does not include the value of 1. .

8 investigated the benefits of preoperative optimization and inotropes in patients undergoing major surgery.Comparing groups: categorical data 75 Example 6. whereas group B is not. There was a significant reduction in the proportion of patients who had complications in the dopexamine group compared with those in the control group.7.67).3 An observational study of 20 patients at risk of myocardial ischaemia.10 It describes the number of patients who need to be treated in order to avoid one adverse event. The incidence of myocardial ischaemia is high in this study group and so the odds ratio overestimates risk reduction Ischaemia Group A -blocker therapy Group B No therapy Column total 5 12 17 No ischaemia 15 8 23 Row total 20 20 40 Thus. patients receiving /3-blocker therapy have a 58% reduction in risk of myocardial ischaemia. risk ratio 0. An absolute risk reduction of 0.04) .about 25 patients need to be treated in order to avoid one adverse event. Wilson et al. indicating a significant 70% reduction in risk. dopexamine or control (routine care). If the baseline incidence were 60%. or 0.15 translates to a NNT of 6. They randomized 138 patients to receive adrenaline. . This is the difference in the probabilities of an event between the two groups. using risk ratio. An increase in risk of a very rare event is still very rare! Thus the change in absolute risk is of clinical importance. but this information is limited unless we consider the baseline level of risk. then the expected incidence will be 8%. Thus an absolute risk reduction of 0. They are now being used more commonly in anaesthesia research.30 (0. or 0. this gives an absolute risk reduction of 4%. Group A is receiving beta-blockers.11-0.e. a 25% risk reduction would result in an absolute risk reduction of 15%.50). The outcome of interest is myocardial ischaemia. 9.15. or incidence rate.04 translates to a NNT of 25 (1/0. The estimation of risk ratios and odds ratios have been used in epidemiological research for many years where the relationship between exposure to risk factors and adverse outcomes is frequently studied. For example.04. The number needed to treat (NNT) is the reciprocal of the absolute risk reduction. If an event has an incidence of 12% and risk is reduced by 33% (i. The risk ratio (95% CI) was 0. Number needed to treat A risk ratio describes how much more likely it is for an event to occur.

would reduce the proportion to 0. The NNT.13 It stratifies the analysis according to the nominated confounding variables and identifies any that affect the primary outcome variable.14 The kappa statistic describes the amount of agreement beyond that which would be due to chance. They performed a meta-analysis of all relevant trials and found that acupuncture/ acupressure were better than placebo at preventing early vomiting in adults. is 5.3 can be used to calculate a NNT of 2. it can be concluded that five to six adult patients need to be treated in order to prevent one patient from vomiting. 12.1of the two 95% confidence limits of the absolute risk reciprocals 0 Mantel-Haenszel test If a group response is affected by more than one variable. Kappa statistic Kappa W measures the agreement between two observers when both are rating the same variable on a categorical scale. If the incidence of early vomiting is 35% (proportion = 0. then logistic regression can be used (see Chapters 7 and 8). They can therefore calculate adjusted odds ratios.18).34-0.5.17. suggesting that two or three patients need to be treated with beta-blockers in order to prevent one patient from having myocardial ischaemia.35). with an RR (95% CI) of 0. A 95% CI can also be estimated for the NNT. or an absolute risk reduction of 0. then it may be of interest to determine the relative impact that each of the variables may have on a group outcome. The formula for the kappa statistic is: where A = the proportion of times the raters agree.47 (0.76 Statistical Methods for Anaesthesia and Intensive Care For example.18 (incidence decreased from 35% to 17%).47. or reciprocal of the absolute risk reduction (1/0. Both these tests are used most often in outcome studies where there may be several independent (predictor) variables. . Lee et a1. 11 investigated the use of acupuncture/ acupressure to prevent postoperative nausea and vomiting. then these results suggest that acupuncture/ acupressure. given the marginal totals. The Mantel-Haenszel x2 test can be used to analyse several grouping variables (i. 14 The difference between the observed proportion of cases in which the raters agree and the proportion expected by chance is divided by the maximum difference possible between the observed and expected proportions. with an RR of 0. data from Example 6. it is a multivariate test) and so can adjust for confounding. If the outcome variable is dichotomous. Therefore. and E = the proportion of agreement expected by chance. For example.9.e.64).1. It is calculated as the reduction.

They measured the reliability of their research nurses data coding and entry by measuring agreement with a sample of reabstracted data checked by study physicians. 267:2344-2348.3-0.1-0. 88:1362-1369. 15. 13. Castellan NJ Jr. 318:1099-1103. Dudley H. Issues in biomedical statistics: analysing 2 x 2 tables of frequencies. 0. Loop FD. Landis JR. Dobson AJ. JAMA 1992. Analyzing data from ordered categories. The number needed to treat: a clinically useful measure of treatment effect. Moses LE. Mantel-Haenszel techniques and logistic regression: always examine one's data first and don't overlook the simpler techniques. Meta-analysis: principles and procedures. London 1989.66-0. Morris JA. An assessment of clinically useful measures of the consequences of treatment. Higgins et a1. The use of nonpharmacologic techniques to prevent postoperative nausea and vomiting: a meta-analysis. Altman DG. 311:442-448. New York 1988. 10. 4. Lee A. Practical Statistics for Medical Research. Altman DG. Emerson JD.99. The value of kappa can be transformed and tested for statistical significance.3 can be described as mild agreement. 15 developed a risk score from their cardiac surgical database. and 0. Chapman & Hall. Kuritz SJ. McGraw-Hill. Assessing agreement. Wilson J. Done ML.0 indicates perfect agreement. 12. A general overview of Mantel-Haenszel methods: applications and recent developments. Sackett DL. Woods I.0 as excellent agreement. Morton AP. BMJ 1999. 14. Reproducibility is a very important issue in clinical research. 6. Bland JM. N Engl J Med 1984. 2. A value of 0 indicates that agreement is no better than chance and the null hypothesis is thus kappa = 0. Gardner MJ.5-1. 310:170. British Medical Journal. London 1991. Statistics with Confidence . Sackett DL. Hanley JA.Comparing groups: categorical data 77 A value of 1. N Engl J Med 1988. Ludbrook J. Egger M. Hosseini H. Reducing the risk of major elective surgery: randomised controlled trial of preoperative optimisation of oxygen delivery. Aust NZ J Surg 1994. Nonparametric Statistics for the Behavioral Sciences 2nd ed.150:384-387. Koch GG. et al. ppl-63. odds ratios. 1 A common situation where kappa is used in anaesthesia studies is to measure agreement between researchers when recording data in clinical trials. 8. Zeiss EE. indicating very good agreement. The kappa statistics were 0. Smith GD. References 1.5 as moderate agreement. BMJ 1995. 6:311-315. Calculating confidence intervals for relative risks. BMJ 1997. 9. Fawcett J et al. Estafanous FG. Phillips AN. Stratification of morbidity and mortality outcome by preoperative risk factors in coronary artery bypass patients: a clinical severity score. and standardised ratios and rates. pp261-264. In: Gardner MJ. Roberts RS. 310:452-454. Laupacis A. 64:780-787. . 318:1728-1733. and this supported the validity of their study. 11. Ann Rev Public Health 1988. 315:1533-1537. Multiple significance tests: the Bonferroni method. Higgins TL. Cook RJ. For example. 9:123-160. 5. Br Med J 1995. 7. A kappa value of 0. 3. Altman DG. Med J Aust 1989.Confidence Intervals and Statistical Guidelines. Paediatr Perinat Epidemiol 1992. Siegal S. Anesth Analg 1999.

• Spearman rank order (rho) is a non-parametric version of the Pearson correlation coefficient. • Regression is used for prediction. prediction There are many circumstances in anaesthesia research where the strength of a relationship between two variables on a numerical scale is of interest. • Agreement between two methods of measurement can be described by the Bland-Altman approach or the kappa statistic. The data should also be indepenc on the scatterplot should represe patient. whereby we wish to determine how well it is related to the other. Usually one of the variables is of particular interest. It allows a visual inspection non-linear relationship is suggeste( used which do not assume a linear i described below). For example. used for prediction. we would first plot the respective measurements obtained from each individual: 'a scatterplot' (Figure 7. One of the reasons for this is that they are used in similar circumstances and are derived from similar mathematical formulae. Here. be aware of their underlying assum] the dependent and independent vi implies that a unit change in one vai in the other. For example. This variable of interest is called the dependent variable.1 A 20 scatterplot of oxygen cons data point represents a single observad Association vs. if we wanted to describe the relationship between body temperature and total body oxygen consumption (Vo 2).7 Regression and correlation Association vs. the dependent (outcome) variable is placed on the y-axis and the independent (predictor) variable is placed on the x-axis. 100 10 Figure 7. The Pearson correlatic scatter of data around a straight line Two variables can have a strong coefficient if the relationship is not often used to describe a linear regression is used. and the plotted data represent individual observations of both variables. yet these are both frequently misunderstood and misused techniques. The first step in correlation and regression analyses should be to plot a scatter diagram (this is essential if misleading conclusions are to be avoided). prediction Assumptions Correlation Spearman rank correlation Regression analysis Non-linear regression Multivariate regression Mathematical coupling Agreement Vo 2 ( ml/min) 300 200 Key points • Correlation and regression are used to describe the relationship between two numerical variables. The commonest methods for describing such a relationship are correlation and regression analysis. one correlation and regression is the pu Assumptions Before describing correlation and re. Multiple measurements analysed using simple correlation o . • Correlation is a measure of association. The other variable is called the independent variable. but is also known as the outcome or response variable. It appears from this scatterplot that Vo 2 increases with increasing body temperature. The main distinction between them is the purpose of the analysis. the relationship between body temperature and oxygen consumption. This is one of t data. But how can the relationship between them be described in more detail: how strongly are they a to predict Vo 2 from body temperatu The first is described by the Pears by r): correlation is a measure of th question is answered by calculatin.1). Thus. but is also known as the predictor or explanatory variable.

Similarly. Assumptions Before describing correlation and regression any further. This i mplies that a unit change in one variable is associated with a unit change in the other. Each data point represents a single observation in each individual patient (n = 16) more detail: how strongly are they associated and.1 A scatterplot of oxygen consumption (Vo2) and temperature. Two variables can have a strong association but a small correlation coefficient if the relationship is not linear. The Pearson correlation coefficient describes the degree of scatter of data around a straight line . the relationship between the dependent and independent variable is assumed to be linear. it is important to be aware of their underlying assumptions. regression is most often used to describe a linear relationship and so simple linear regression is used. one of the major distinctions between correlation and regression is the purpose of the analysis. Thus.Regression and correlation 79 Figure 7.it is a measure of linear association. The data should also be independent. This means that each data point on the scatterplot should represent a single observation from each patient. Multiple measurements from each patient should not be analysed using simple correlation or regression analysis as this will lead . This is one of the main benefits of first plotting the data. then alternative techniques can be used which do not assume a linear relationship (some of these are briefly described below). are we able to predict V0 2 from body temperature? These are two different questions. It allows a visual inspection of the pattern of the scatterplot: if a non-linear relationship is suggested. The first is described by the Pearson correlation coefficient (denoted by r): correlation is a measure of the strength of association. The second question is answered by calculating a regression equation: regression is used for prediction. if relevant. First.

This distinction is an arbitrary one.2 These analyses also assume that the observations follow a normal distribution (in particular. and x = mean value of x and y = mean value of y (several forms of this equation exist). This is perhaps the most common error relating to correlation and regression in anaesthesia research. a value of -1.3 Perhaps the most useful measure is the value r2.0 describes a perfect positive linear association. . from Figure 7. the corresponding values of the dependent variable are normally distributed). however. r. the dependent variable decreases).0 and + 1.4-0. as once again this often results in an over-inflated value for r (time trends require more advanced statistical methods). For example. with the final descriptors of the extent of association being determined more by the intended clinical application.0 a strong association. then the data can be transformed (commonly using logtransformation) or a non-parametric method used (e. and 0. see below). if r has an absolute value between 0. It is a measure of association. If doubt exists. the coefficient of determination. The degree of uncertainty can be described by the standard error of r and its 95% confidence interval. and this would appear on a scatterplot as a (roughly) circular plot (Figure 7.g. as the independent variable increases.80 Statistical Methods for Anaesthesia and Intensive Care to misleading conclusions (most often an over-inflated value of r.e.7 would be a moderate association. can have any value between -1.r 2 is the proportion of variance yet to be accounted for). y = value of dependent variable. 1.73)2] of the variability in body weight is explained by age. This is an estimate of how much a change in one variable influences the other (and 1 .0. In general. but is also influenced by various other factors (such as measurement precision. This will depend largely on the number of observations (sample size). Spearman rank correlation. Correlation The Pearson correlation coefficient (r) is a measure of how closely the data points on a scatterplot assume a straight line.2: 53% [(0. that for any given value of the independent variable. or if the distribution appears non-normal after visualizing a scatterplot. and 30% where x = value of independent variable.2 and 0. Repeated measures over time should also not be simply analysed using correlation. There is obviously a degree of uncertainty for any calculated value of r.7-1.4. the range of measurements and presence of outliers). a value of 0. Statistical methods are available which can accommodate for such trial designs.0 describes a perfect negative linear association (i.* A value of 1. Neither correlation nor regression should be used to measure agreement between two measurement techniques (see below). or a misleading regression equation). A value of 0 describes no association at all. This is known as homoscedasticity.2). The statistic. it may be described as a mild association.

the resultant P value describes the likelihood of no correlation (r = 0) . the r values would not be altered whether body weight was measured in pounds or kilograms. for correlation. the effect of other factors .it does not describe the strength of that association. Knowledge of r2 therefore has clinical application. This is an adjusted r value which takes into account the impact of a third variable. Hypothesis tests can also be applied to correlation. The value of r will. however. the most common test used is Student's t-test.55)2] of the variability in urine osmolarity is explained by a change in urine flow.using 1 . Similarly.2 Examples of scatterplots demonstrating a variety of correlation coefficients (r) [(-0. which may be associated with both the dependent and independent variables (this third variable is called a covariate). A partial correlation coefficient can also be calculated. An important characteristic of correlation is that it is independent of units of measurements. It tells us how influential one factor is in relation to another (and perhaps more i mportantly. For example (referring . or if mean BP was measured in mmHg or kPa. Although uncommon. hypothesis tests can also be used to determine if a correlation coefficient is significantly different from some other specified value of r. Again. if the range of values is restricted. referring to Figure 7. then r will usually be reduced.Regression and correlation 81 Figure 7. be markedly influenced by the presence of outliers (if the range is increased by an outlier there is a tendency for r to increase).2.r2). Data should be randomly selected from a specified target population and measurement precision should be optimized.* The t-test is used to compare the means of two groups.

There are other non-parametric correlation methods . then a multiple correlation coefficient can be calculated (denoted as R) using multivariate regression (see below). that are correlated against one another (Table 7.' Unfortunately. Similarly. such as a biologically plausible argument. demonstration of the time sequence (discerning cause from effect) and exclusion of other confounding influences (i. Cramer coefficient (C). rho (p). 5 Regression analysis If the aim of the investigation is to predict one variable from another.e. then it can be transformed. a third variable associated with the two variables of interest. typically using the logarithm of each value to create a more normal distribution to the data so that correlation and regression analyses can then be performed reliably. not normally distributed). or their nutritional status. and lambda (L). renal blood flow may be affected by changes in cardiac output (which is related to mean blood pressure). standard error. Further details of these methods can be found elsewhere. a non-parametric version of correlation.1). if not. Spearman rank correlation If the distribution of the data is skewed (i. If multiple independent (predictor) variables are used to describe a relationship with a dependent (outcome) variable. Because one of the assumptions used in correlation is that the data are normally distributed. of itself. then it suggests non-normality and the Spearman p value should be preferentially used to describe association. or at least describe the value of one variable in relation to another. One remaining point should not be forgotten: association does not imply causation.82 Statistical Methods for Anaesthesia and Intensive Care to Figure 7. the relationship between age and weight may be influenced by the gender of the patient. This calculation is based on the ranking of observations and is denoted by the Greek letter. or if the data are ordinal. rather than the actual values. support a conclusion of cause and effect.Kendall's tau (r). then . A strong association does not. This requires additional proof. if Spearman s p is a similar value to r. These include use of the t-test (to derive a P value in order to determine if the correlation is significantly different from zero). Other aspects of correlation apply equally to both. 95% confidence intervals and the coefficient of determination. such as Spearman rank correlation should be used.2). then it is also preferable to use Spearman rank correlation when analysing small data sets. In fact. say n < 20 (as it is difficult to demonstrate a normal distribution with a small number of observations). these issues have been rarely addressed in the anaesthetic literature and this often leads to unsubstantiated conclusions (see also Chapter 11). Kendall's coefficient of concordance W.e. that is actually the causative factor). It is the ordered rank values. then the distribution of the data approximates normal. Alternatively.

5 20 12 11 10 17.0 x temp.5 15 4 13 22 17. These results are a subgroup (n = 23) taken from a study investigating the efficacy of patient-controlled analgesia after cardiac surgery.1).* For example. (in °C). Here the dependent (outcome) variable is again placed on the y-axis of a scatterplot and the independent (predictor) variable is placed on the x-axis. This equation states that for each 1°C increase in temperature there is a 6 ml/min increase in V0 2 . This is where the perpendicular difference between each data point and the straight line (this difference is called the residual) is squared and summed . can then be calculated using a technique known as the method of least squares. a regression line can be derived which enables prediction of V0 2 after measuring body temperature (Figure 7. This line is described by the equation. where 'b' is the measure of slope and 'a' is the y-intercept. the rank is calculated as the average between them) Subject Weight (kg) 82 86 90 64 83 65 74 53 80 46 91 69 84 105 78 73 88 70 97 92 85 69 89 Weight rank 13 17 9 3 14 4 10 2 12 1 20 6 15 23 11 9 18 8 22 21 16 5 19 Total dose of morphine (mg) 13 52 54 29 24 26 38 9 46 19 53 34 32 30 48 41 14 36 55 48 60 12 19 Morphine rank 3 19 21 9 7 8 14 1 16 5. The general formula for the line of best fit is y = a + bx.1 Actual values and their ranking of patient weight and morphine consumption at 24-48 hours after cardiac surgery. called a regression line. From this we are able to predict that if a patient has a body .54 (P = 0. Vo 2 (in ml/min) = 6.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 p = 0. referring to our original scatterplot of V0 2 and body temperature (Figure 7.Regression and correlation 83 Table 7.031) regression analysis can be used.the eventual line chosen is that with the smallest total sum (hence the term 'least squares method' or 'residual sum of squares').6 Spearman's rho (p)is calculated by measuring the association between the rank values (if actual values are equal. A line of best fit.5 23 2 5.3).8 + 6.

that is.7. Similarly. d. and if their body temperature was 38°C. where Y' is the predicted value of the dependent variable and beta1 is the slope.84 Statistical Methods for Anaesthesia and Intensive Care temperature of 32°C. In our example (Figure 7. then VO2 would be expected to be 235 ml/min. the standard error of b decreases and so the uncertainty decreases (and the more reliable the regression equation will become). Note that the line should not extend beyond the limits of the data (where accurate prediction becomes unreliable) . t = 2. Figure 7. the width of the 95% confidence interval increases towards the extreme values.2].05).f. Because the level of uncertainty increases the further we are from the mean value of x (our independent variable). 95% confidence intervals for a regression line are curved (Figure 7.0 ± ([2. beta1 is known as the regression coefficient. using 14 degrees of freedom [n .15 x 2. This uncertainty can be described by the standard error of the slope (b) and its 95% confidence interval.15. it is statistically significant at the 5% level (P < 0. from 0. Because the 95% confidence interval for beta1 does not include the value zero. a hypothesis test can be performed to test whether a value of be ta1 is equal to zero by dividing the value of b by its standard error and looking up a t-table (degrees of freedom = n .2).3 As the study sample size increases. Hence regression is a very useful way of describing a linear relationship between two numerical variables. Hence. For the population.3.65.3 to 11.4). A regression line ('line of best fit') for regression of VO 2 on body temperature. then the best estimate for VO2 would be 199 ml/min. From this we can now calculate a 95% confidence interval for beta1 (after first looking up a t-table. the general form of the equation is actually Y' = beta0 + beta1X. = 14): the 95% confidence interval for beta1 is 6. Nevertheless there remains some uncertainty about how accurate this equation is in representing the population: it is unlikely that a derived equation will be able to perfectly predict the dependent variable of interest in the population.3).65] = 5. the slope of our sample regression line (b) is an estimate of beta1. the standard error of b can be calculated from a known formula. For this reason. SE(b) = 2.7).

showing the 95% confidence intervals (broken lines) Just as hypothesis testing can be used to determine whether a regression line (slope) is statistically significant (from zero). There are methods available which can analyse the distribution of the residuals across a range of values for X in order to determine if the data are normally distributed. The assumptions stated earlier for correlation are also important for regression. However. The residuals can also be used to describe the ' goodness of fit' of a regression equation.e.Regression and correlation 85 Figure 7.7 Specialized pharmacokinetic programs are usually used to determine an exponential function that has minimal sum of squares for a set of data points. These obviously do not plot a straight line through the scatterplot. It is important to remember that the scale of measurement in regression analysis determines the magnitude of the constants (a and b) in the regression equation. a differential weighting factor is often used for each data point in determining the regression . Because drug concentrations may vary by several orders of magnitude. Non-linear regression An example of non-linear regression is the common classical pharmacokinetic problem of fitting a polyexponential curve to drug concentration-time data. non-linear versions of regression can be used (the above is called simple linear regression). does it predict well?).e. and variance is proportional to the concentration.4 A regression line for regression of V0 2 on body temperature. do they represent two different populations?). and so units should be clearly stated. Other. two regression lines can also be compared to see whether they differ in their Y-intercept or slope (i. or ' model' (i. it is not necessary for the independent (predictor) variable to be normally distributed. The difference between a predicted value for y and the actual observed value is known as the residual.

several doses of a drug are chosen and the observed response must be dichotomous. Several polyexponential solutions are possible and a variety of criteria (e. EmaX = the maximum effect.8 3.y]). a common related method of analysis is to model the data points to fit a sigmoid EmaX model (i. The shape of the dose-response curve is expected to be sigmoid and thus the raw proportions of responses in each group must undergo an appropriate transformation. The calculation of quantal dose-response curves is another example of the use of non-linear regression. In the logit transformation.86 Statistical Methods for Anaesthesia and Intensive Care estimates. In this case a procedure known as probit analysis is often used.8 4. Table 7. and parallelism of two curves.2 and Figure 7.5 5. as well as providing confidence limits for the likelihood of response at any dose. For example. in a comparison of thiopentone requirements between pregnant and non-pregnant patients .5). 10 The basis for this is that a cumulative normal distribution curve is sigmoid in shape.e.2 Number of patients (n = 10) with hypnosis at different doses of thiopentone 1° Dose (mg/kg) 2. the doses are best chosen so that the logarithm of the doses are approximately equally spaced. the proportion responding (y) is transformed using the inverse of the cumulative standard normal distribution function. Akaike) can be used to determine the most likely model.3 Non-pregnant 0 1 4 7 7 10 10 Pregnant 1 5 6 8 10 10 10 . and ydescribes the slope of the curve. it is becoming common nowadays to carry out population pharmacokinetic modelling that combines all the data points in one overall regression analysis. 9 the numbers of patients found to be unconscious at each dose was determined (Table 7. a number of subjects are exposed to the drug and the response observed. (For a log dose-response curve.) At each dose. Either transformation can be used and they give similar results.3 3. In the probit transformation.g. In quantal dose-response experiments.4 2. the Hill equation): where E = effect. With graded dose-response curve data. [D] = the drug concentration EC50 = the concentration yielding 50% of maximal effect.0 2. The probit analysis procedure also provides methods to compare the median potency (ED 50 ). the proportion responding (y) is transformed using the natural log of the odds ratio: ln(y/[1 . Schwarz. Rather than fit individual curves with polyexponential equations.

25). Interestingly. slightly offset for clarity.29). It is not necessary for the independent variables to be normally distributed. Using this method.the 'goodness of fit'. they have recently reanalysed their data and included the variable (Pr-Pa)CO . in order to ascertain whether the addition of each extra variable increases the predictive ability of the equation (model) . systolic pulmonary artery pressure (r = -0. Y' = beta0 + beta1X1 + beta2X2 +. Data points are the original proportions in groups of 10 patients Multivariate regression Multiple linear regression is a more complex form of regression used when there are several independent variables.Regression and correlation 87 Figure 7. For example. they concluded that routine blood gas measurements could be used instead of gastric tonometry. the gap between gastric mucosal and arterial carbon dioxide tensions. they used multivariate linear regression to describe the relationship between pHi (their dependent variable) and a number of cardiorespiratory (independent) variables. Stepwise regression analysis is a type of multivariate analysis used to assess the impact of each of several (independent) variables separately. Boyd et al. 11 measured arterial blood gases and gastric tonometry (intramucosal pHi) in 20 ICU patients. nor even continuous. in order to predict the population value of the dependent (outcome) variable. It does this by determining ..12 They found that (Pr-Pa)CO2 was not correlated with arterial blood gas data and so may be a unique measure of splanchnic perfusion. using the general form of the equation.36).5 Dose-response curves in pregnant and non-pregnant women. diastolic pulmonary artery pressure (r = -0. They found mild negative associations with heart rate (r = -0. The 95% confidence intervals for ED50 and ED95 are displayed. As part of their analyses. many independent (predictor) variables can be included in a model.22) and blood lactate (r = -0.. adding or subtracting one at a time. Because they also found a strong correlation between blood base deficit and pHi (r = 0.63).

and P = probability of outcome. so that the groups are 'equalized' before comparison). Wong and Chundu used stepwise multiple linear regression to describe factors associated with metabolic alkalosis after paediatric cardiac surgery 13 Here..* This technique is commonly used in outcome studies in anaesthesia and intensive care. where the outcome of interest is a dichotomous variable .9). the dependent variable was arterial pH. Because wound infection is a dichotomous categorical variable. so that their specific effect on outcome can be adjusted according to the presence of other variables.88 Statistical Methods for Anaesthesia and Intensive Care whether there has been an increase in the overall value of R2 (where R = multiple correlation coefficient). .. that adjusts for baseline confounding variables (also known as covariates). logistic regression can be used to calculate an adjusted OR.5). then their relationship to the outcome of interest can be expressed by the risk ratio. They concluded that chloride depletion may be a factor in the pathogenesis of metabolic alkalosis in that population. w = beta + beta1X1 + beta2X2 + . where OR = odds ratio. They found there was a significant association between postoperative wound infection and smoking (OR 10. R2 = 0. 15 investigated the potential relationship between postoperative wound infection and various perioperative factors (including maintenance of normothermia) in patients having abdominal surgery. The OR is the exponential of the regression coefficient (i. Here the relationship between each baseline factor and the endpoint of interest is first determined. or its estimate. Because this is a multivariate technique. Each independent (predictor) variables may be included in the equation in a stepwise method (one at a time). and explained 42% of the variability in postoperative arterial pH (i. It generates a probability of an outcome from 0 to 1. They found that patient age and serum chloride concentration were the only significant (negative) associations with arterial pH. Analysis of covariance is a combination of regression analysis and analysis of variance (used to compare the mean values of two or more groups).e. and several patient characteristics and biochemical measures were included as independent variables. This method can be used when several groups being compared have an imbalance in potentially important baseline characteristics which may influence the outcome of interest.42).e. odds ratio ( OR) (see Chapter 6). using an exponential equation. If any of the independent variables are also dichotomous. a number of independent (predictor) variables can be included in the equation. they used multivariate logistic regression.e. leading to an adjusted comparison (i. Kurz et a1. For example.typically an adverse event or mortality 14 As with multivariate linear regression. Logistic regression is a type of regression analysis used when the outcome of interest is a dichotomous (binary. OR for the factor x 1 is equal to ebeta1 For example. or all entered together. as well as with perioperative hypothermia (OR 4. . or yes/no) categorical variable.

These predictor variables are often considered as 'risk factors'. An example would be describing the relationship between an initial urine output (say over the first 4 h) and that over 24 h (i. possibly falsely.18 This is also a common error in anaesthesia research.Pvo2) x 0.34 x Sao2 + Pao2 x 0.5 times more likely to have a postoperative wound infection (compared to non-smokers). as many endpoints of interest are actually derived (as indices) from another measured variable(s). ordinal or categorical data). For example. 0-4 hours and 4-24 hours). as the predictor variables ultimately chosen in the model must be reliable and clinically relevant. 'supply-dependence'). correlation (co-linearity) and interaction of variables.Regression and correlation 89 This means that smokers were approximately 10. But it also requires involvement of an experienced clinician. 19 Another common situation is where one variable includes the value of the other variable .e.75. Mathematical coupling should always be considered when one or both variables has been derived from other measurements. because of the potential problems with. Further discussion of these issues can be found in Chapter 8. and so concluding. Clearly the fact that the 24-hour urine volume includes the first 4-hour urine volume will ensure a reasonable degree of association . (Sao2 . *Do2 = CO x (Hb x 1. 0-4 h and 0-24 h). How the final model is constructed depends partly on the choice of independent variables and their characteristics (as numerical.003). It has been a frequent error to describe the relationship between Vo2 and D0 2 using correlation and regression analysis. 16 There may be other (unknown) variables that may have a significant impact on the outcome of interest. and patients who developed hypothermia were 4. that not only is V0 2 strongly associated with Do 2 . may be developed from a data set using multivariate (linear or logistic) regression analysis.t Hence.e. oxygen delivery ( D02) is a term derived from a measurement of cardiac output and oxygen content (which in turn is calculated from a measurement of haemoglobin concentration. arterial oxygen saturation and tension). but is also dependent on Do2 (i. for example. This is known as mathematical coupling and overestimates the value of r. or 'models'. Mathematical coupling If two variables have a mathematical relationship between them. with most authors finding an r value of approximately 0.* This is commonly calculated along with V0 2 .Svo 2 ) + (PaoZ .34 x +Vo2 = CO x [Hb x 1. then a spurious relationship can be calculated using correlation.e.this is an additive mathematical relationship. 17 Development of a reliable predictive model requires assistance from a statistician experienced in multivariate regression techniques. both V02 and D02 share several values in their derivation.in this example mathematical coupling can be avoided by excluding the first 4-hour volume from the 24-hour measurement (i.003] .9 times more likely (compared to those who were normothermic). It should be stressed that a number of equations.

Standard deviation (SD) of difference ('precision) = 1. 22) The raw data are presented.but they may not have useful clinical agreement! As an illustration. To describe the agreement between two measurement techniques. if two methods differ by a constant amount (which may be quite large) they will have excellent correlation.20. along with the calculated bias. . 20 The average is then plotted against the difference.P7co2 -4 0 5 2 0 0 1 2 1 1 0 0 1 2 1 2 1 2 2 2 Mean difference between methods ('bias') = 1.5 36 41.5 29 33 40. extent of neuromuscular blockade.96 x SD ('limits of agreement') = 3.6 mmHg. arterial (or mixed venous) oxygen saturation.21 The mean difference between measurement techniques is referred to as the 'bias' and the standard deviation of the difference is referred to as the 'precision'. this plot is sometimes referred to as a Bland-Altman plot . depth of anaesthesia.5 38 36.1 mmHg. but does not tell us how well the Table 7.1 mmHg.90 Statistical Methods for Anaesthesia and Intensive Care Agreement How well two measurement techniques agree is a common question in anaesthesia and intensive care: comparing two methods of measuring cardiac output. mmHg) in 20 patients (after Myles et a1. the average between them (considered the 'best guess') and their difference are first calculated. precision and limits of agreement Laboratory Labco2 33 39 39 38 42 41 32 37 42 39 29 33 41 32 34 39 37 31 38 43 Paratrend-7 P7co2 37 39 34 36 42 41 31 35 41 38 29 33 40 30 33 37 36 29 36 41 Average Pco2 (Labco2 + P7co2)/2 35 39 36. 1.5 30 37 42 Difference between methods Labco2 .21 In nearly all situations two methods used to measure the same variable will have very close correlation . they should not (generally) be used to describe agreement between two measurement methods. The bias is an estimate of how closely the two methods agree on average (for the population).5 38.3 Assessing the agreement between two methods of measuring arterial carbon dioxide tension ( Pco2.5 31 33.5 37 42 41 31. and regression can be used to describe their relationship. but poor agreement. Although correlation is the correct method for measuring the association between two numerical variables. etc.

predictive score or clinical judgment). For example.Regression and correlation 91 Figure 7. These were compared in patients undergoing cardiac surgery 22 and the data recorded after cardiopulmonary bypass are presented in Table 7.6 Bland-Altman plot of two methods of measuring arterial carbon dioxide tension (PCO2) (see Table 7. then the agreement between them can be determined by the kappa statistic W.6. 0. For this we must use the estimate of precision. for example. High Wycombe. 23 A kappa value of 0.3 and Figure 7.e. If either of the variables is measured on an ordinal scale (or the question being asked is how well does a measurement technique agree to a previous measurement using the same method?).5 as moderate agreement. then the intraclass .see Chapters 6 and 8). given a test result . If one or both variables are categorical. This can be used in situations where. it is the clinician's impression of the calculated bias and limits of agreement.0 as excellent agreement.3 is sometimes described as mild agreement. The most common use of the kappa statistic is to describe the reliability of two observers' ratings or recordings.3) methods agree for an individual. the chance of a particular outcome. two methods for measuring arterial carbon dioxide are the Paratrend 7 intravascular device (Biomedical Sensors. which describe where 95% of the data (observed differences) lie.5-1. The precision can be multiplied by 1. There are other situations where calculation of positive predictive value.3-0. likelihood ratio or risk ratio may be more appropriate (i.96 to calculate the 'li mits of agreement'. and this is compared to another method of assessment. an assessment is made as to whether a disease is present or absent (using either a diagnostic test. The kappa statistic describes the amount of agreement beyond that which would be due to chance. Whether two methods have clinically useful agreement is not determined by hypothesis testing. and 0. UK).1-0. and a standard laboratory blood gas analyser.

The extent of agreement. Lancet 1993. London 1971. Bland JM. Sackett DL. You cannot exclude the explanation you have not considered. Short TG. Altman DG. 5. Decreased thiopental requirements in early pregnancy. Castellan NJ. Gin T. Probit Analysis. San Francisco 1979. 12. 3. New York 1988. 11. 13. Butterworth. Siegel S. 20. Anesth Analg 1995. is still best described by the standard deviation of the difference between methods. Boston 1991: pp283-302. 2. Lancet 1986. Haynes RB. London 1989: pp34-49. J. Buckland MR. Rhodes A. Little Brown. Powell J. 86:73-78. Myles PS. N Engl J Med 1996. Anesthesiology 1997. 14. i:307-310. 17. Kurz A. Br Med J 1995. Cannon GB et al. Myles PS. NONMEM Users Guide. McGraw-Hill International Editions. Comparison of patient-controlled analgesia and conventional analgesia after cardiac surgery. Bland JM. Br J Cancer 1994. Lenhardt R. Nonparametric Statistics for the Behavioural Sciences. Bland JM. Metabolic alkalosis in children undergoing cardiac surgery. Wong HR. 15. Calculating correlation coefficients with repeated observations: Part II . Datta M. 21:884-887. Cambridge University Press.correlation between subjects. Statistical methods for assessing agreement between two methods of clinical measurement. 19. Altman DG. 9. 3rd ed. Ann Surg 1981. Lancet 1993. McRae RJ. Lancet 1997. A common source of error. 341:142-146. Hull CJ. Sessler DI. 81:430-431. Grounds RM. Sheiner LB. Statistics with confidence confidence intervals and statistical guidelines. In: Pharmacokinetics for Anaesthesia.23 This is a test of reproducibility. Clinical Epidemiology: A Basic Science for Clinical Medicine. Beal SL. Relation between oxygen consumption and oxygen delivery after cardiac surgery: beware mathematical coupling. Boyd O. 7. Bland JM. Altman DG. British Medical Journal. 310:633.92 Statistical Methods for Anaesthesia and Intensive Care correlation coefficient can be used. Chundu KR.20 References 1. Statistical aspects of prognostic factor studies in oncology. Guyatt GH et al. Finney. Perioperative normothermia to reduce the incidence of surgical-wound infection and shorten hospitalization. 22:447-453. 342:345-347. 2nd ed. Bennett ED. 18. Mainland P. Mackay CJ. Altman DG. Predicting outcome in anaesthesia: understanding statistical methods.correlation within subjects. 69:979-985. Br Med J 1995. Division of Clinical Pharmacology. Anaesth Intensive Care 1994. 8. D. Williams NJ. 310:446. London 1991: pp187-197. Altman DG. Altman DG. Comparison of clinical information gained from routine blood-gas analysis and from gastric tonometry for intramural pH.193:296-303. 6. 2nd ed. University of California. 350:413. Routine blood-gas analysis and gastric tonometry: a reappraisal. 334:1209-1215. Anaesth Intens Care 1994. The identification of compartmental models. Myles PS. In: Gardner MJ. Crit Care Med 1993. . however. 22:672-678. Archie JP Mathematical coupling of data. Boyd O. 16. Lamb G et al. 4. Gardner MJ. Chan MTV. 10. Calculating correlation coefficients with repeated observations: Part I . Simon R. Calculating confidence intervals for regression and correlation.

. Assessing agreement. Med J Aust 1989.Regression and correlation 93 21. Altman DG. Continuous measurement of arterial and end-tidal carbon dioxide during cardiac surgery: Pa_ETCO2 gradient. Higgs MA et al. 150:384-387. Comparing methods of measurement: why plotting difference against standard method is misleading. Anaesth Intensive Care 1997. 346:1085-1087. 25: 459-463. Morton AP. 22. Myles PS. Bland JM. 23. Dobson AJ. Story DA. Lancet 1995.

Sensitivity and specificity Diagnostic tests are used to guide clinical practice.1). • Predictive scores are generally unhelpful for predicting uncommon (< 10%) events in individual patients. the Mallampati score is commonly used to assess a patient's airway in order to predict difficulty with endotracheal intubation. The specificity is its true negative rate. if a predictive test is positive for an outcome that is common. • Specificity of a test is the true negative rate. The sensitivity of a test is its true positive rate. Similarly. Thus the sensitivity and specificity of a test describe what proportion of positive and negative tests results are correct given a known outcome (Figure 8. • A predictive score should be prospectively validated on a separate group of patients.8 Predicting outcome: diagnostic tests or predictive equations Sensitivity and specificity Prior probability: incidence and prevalence -positive and negative predictive value Bayes' theorem Receiver operating characteristic (ROC) curve Predictive equations and risk scores Key points • Sensitivity of a test is the true positive rate.l -3 They are used to enhance a clinician's certainty about what will happen to their patient. The accuracy of a diagnostic test can be described by its sensitivity and specificity. A receiver operating characteristic (ROC) curve can be used to illustrate the diagnostic properties of a test on a numerical scale. If a test is negative for a common (or expected) event. Predictive equations and risk scores are diagnostic tests. then it is very likely to be a true result. The most familiar is a laboratory test or investigation. • Risk prediction is usually based on a multivariate regression equation. Common events occur commonly. and is confirmed by a diagnostic test. then it is even more likely to occur. • Positive predictive value is the proportion of patients with an outcome if the test is positive. then the clinician needs to be certain that the chance of . 4 Clinicians need to know how much confidence should be placed in such tests . • Negative predictive value is the proportion of patients without an outcome if the test is negative. For example. but many aspects of a clinical examination or patient monitoring should also be considered as diagnostic tests.are they accurate and reliable? A diagnostic test usually gives a positive or negative result and this may be correct or incorrect. If a disease is common.

1).Predicting outcome: diagnostic tests or predictive equations 95 Where TP = true positive FP = false positive TN = true negative FN = false negative Sensitivity of the new test = TP/(TP + FN) Specificity of the new test = TN/(TN + FP) Positive predictive value of the new test = TP/(TP + FP) Negative predictive value of the new test = TN/(TN + FN) Figure 8. As stated above.g. Pre-existing conditions should be described by their prevalence rate. risk score). . PPV describes the likelihood of disease or outcome of interest given a positive test result. the post-test risk refers to either PPV or NPV It is common for authors to report optimistic values for PPV and NPV. Prior probability: incidence and prevalence The value of a diagnostic test in clinical practice does not just depend on its sensitivity and specificity. Conversely. In most clinical situations a negative test result should be reviewed if clinical suspicion was high. yet both are dependent on prevalence* .if a disease (or outcome) is *If the test is being used for prediction of outcome (e. if it is in reference to an expected outcome rate). Clinical interpretation of diagnostic tests requires consideration of prior probability. a single positive test result for a rare event is unlikely to be true (most positive results will be incorrect). whereas outcomes should be described by their incidence rate. positive and negative predictive value a false negative result (1-specificity) is extremely low. Prior probability is a term used interchangeably for prevalence (or incidence. Prior probability is sometimes referred to as the pre-test risk. Information about prevalence (or incidence) is required. This can be done by calculating the positive predictive value (PPV) and negative predictive value (NPV). Incidence is the proportion of patients who develop the disease (or outcome of interest) during a specified time.1 Sensitivity and specificity. then PPV is dependent on its incidence rate. Prevalence is the proportion of patients with a disease (or condition of interest) at a specified time. NPV describes the likelihood of no disease or avoiding an outcome given a negative test result (Figure 8. common events can be more confidently predicted and the clinical circumstances in which a test is to be applied must be taken into consideration.

2.1 mm is an indicator of myocardial ischaemia.e.6 This discrepancy can be quantified using PPV and NPV and illustrates the relevance of prior probability (Figure 8. But if a 60-year-old woman has the same degree of ST-segment depression.does the trial population in which the test was developed represent the clinical circumstances in which the test is to be applied? Was there a broad spectrum of patients (of variable risk) studied?1 An example of this is electrocardiographic diagnosis of myocardial ischaemia. (b) 100 women (prevalence =10%) . it could be expected that 60% of such patients (i. 2.2 The effect of prevalence. If the sensitivity of ECG diagnosis of myocardial ischaemia is 70%. The PPV for the woman is only 17%. Therefore. it might only be 10%. or prior probability on PPV and NPV. then the PPV for such patients is 74%. prior probability 60%) have ischaemic heart disease. a test (irrespective of its diagnostic utility) will tend to have a high PPV The same test in another situation where disease prevalence is low will tend to have poor PPV and yet high NPV. specificity 60%: (a) 100 elderly men (prevalence =60 %). the context in which the diagnostic test was evaluated should be considered . and with this criterion it has a sensitivity of about 70% and specificity of about 60%. but in the second case (the woman).2). then it remains unlikely that she has myocardial ischaemia.96 Statistical Methods for Anaesthesia and Intensive Care common. assuming an ST-segment diagnosis of myocardial ischaemia with sensitivity 70%.5 It is generally accepted that ST-segment depression >. Rifkin and Hood present a cogent argument describing how the extent of ST-depression should be interpreted according to the perceived risk (prior probability) of myocardial Figure 8. then it is very likely that this indicates myocardial ischaemia. 5 If a 70-year-old man with coronary risk factors is found to have ST-segment depression. In the first case (the elderly man).

divided by all those with a positive test (Figure 8.2 The prior odds is defined as the odds of an outcome before the test result is known and is the ratio of prior probability to 1 minus prior probability (prevalence/ 1-prevalence). which illustrates the varying effect of the prior probability and sensitivity of the test. 7 These are measures of a test's reliability. If this probability is low.3). It is the probability of having the disease (or outcome) given that the test was positive. A Bayesian approach can also be used to interpret clinical trials.3 Bayes' theorem PPV is a conditional probability. This relationship can be illustrated using a nomogram. The symbol 'I ' is used to denote that the item to its left presumes the condition to its right. 1. with the likelihood of an outcome. 2 PPV can also be calculated by several other methods. given a positive test result.e. to calculate PPV Bayes' formula states that the PPV is equal to the sensitivity of the test multiplied by the prevalence (or incidence) rate. the test (sensitivity) and the test result.$ In clinical practice a positive test result usually offers very little extra information for an outcome that is already likely. the utility of a test depends on its accuracy (sensitivity and specificity) and prior probability. *Thomas Bayes (1763): 'An essay towards solving a problem in the doctrine of chances'. sensitivity can be denoted by P(T+1 D+) and specificity by P(T-/ D-). 9 A significant P value for an unexpected event is less likely to be true (i. Figure 8. 3 Hence PPV is denoted by P(D +I T+). Bayes' theorem can be rearranged to calculate the odds of an outcome given a test result . PPV and NPV are proportions and so can be described with corresponding 95% confidence intervals. . unless the test is very sensitive and/or specific.3 It combines the characteristics of the patient (prior probability). one of these is a mathematical formula known as Bayes' theorem.Predicting outcome: diagnostic tests or predictive equations 97 ischaemia. according to whether a test result is positive or negative. Bayes' theorem Bayes' theorem* is a formula used to calculate the probability of an outcome (or disease). Sensitivity. 2. Using this nomenclature. Either a positive or negative likelihood ratio can be calculated. As described above. then the ratio of the true positive rate (sensitivity) to the false negative rate (1 -specificity) must be very high in order for the test to be useful in clinical practice.the likelihood ratio. specificity.

There is a trade off between sensitivity and specificity . as mean ± 1. 1o Receiver operating characteristic (ROC) curve Not all diagnostic test results are simply categorized as 'positive' or 'negative'. However. perhaps most important consideration. the best point lies at the elbow of the curve (its highest point at the left). The change in sensitivity and specificity with different cut-off points can be described by a receiver operating characteristic (ROC) curve (Figure 8.98 Statistical Methods for Anaesthesia and Intensive Care lower PPV because of a lower prior probability) than a P value that may not be significant (say P = 0. The broken line signifies no predictive ability . The cut-off value should ideally be selected so that the risk score has greatest accuracy.96 standard deviations. is the more important consideration when interpreting clinical trial results. Anaesthetists and intensivists are frequently exposed to test results on a numerical scale. Predictive equations. usually have some arbitrary cut-off value. 3 An ROC curve assists in defining a suitable cut-off point to denote 'positive' and 'negative'. Laboratory reference ranges are usually calculated from a healthy population.4 Receiver operating characteristic (ROC) curve. whereby it is considered a higher score denotes higher risk of an adverse outcome ('test positive').if the cut-off value is too low it will identify most patients who have an adverse outcome (increase sensitivity) but also incorrectly identify many who do not (decrease specificity). is for the intended clinical Figure 8. Some judgment is required in choosing a cutoff point to denote normal from abnormal (or negative from positive). In general.4). This approach has also been suggested for interim analysis of large trials. The important issue is that effect size. not P value. the final. or risk scores. or had been demonstrated in previous studies. assuming a normal distribution in the population.11) for an event that had been the main subject of study.

which are independent of prevalence. This was because adverse outcomes were rare in their study (mortality 3. 11-13 For example. scores can be compared by measuring their ROC areas. Most good predictive scores have an ROC area of at least 0. .70) in their surgical population. the greater the gain in PPV. The line of equality (slope = 1.5%). The area under an ROC curve represents the diagnostic (or predictive) ability of the test. whereby expensive resources are not expected to improve outcome. or for correcting for 'casemix' when comparing institutions. An ROC area of 0. If the consequences of false positives outweigh those of false negatives. Because an ROC curve plots the relationship between sensitivity and specificity. but would be unreliable for individual patients. 13 compared four predictive scores used in adult cardiac surgery and found they had similar ROC areas (about 0. This is often the case for groups of patients. This process may identify causative or exacerbating factors. it will not be affected by changes in prevalence. In both of these situations it is imperative that a predictive score is reliable. but less so for the individual.Predicting outcome: diagnostic tests or predictive equations 99 circumstances to guide the final choice of cut-off point. if patients are at unacceptable risk. Predictive equations and risk scores Outcome prediction has four main purposes: • to identify factors associated with outcome (so that changes in management can improve outcome) • to identify patient groups who are at unacceptable risk (in order to avoid further disability or death. as well as preventive factors. or for resource allocation) • to match (or adjust) groups for comparison • to provide the patient and clinician with information about their risk Identification of low-risk patients (who should not need extensive preoperative evaluation or expensive perioperative care) may save valuable resources for those most at need. it may be appropriate to deny further treatment. Risk adjustment is a more accurate way of correcting for baseline differences in clinical studies.75. comorbid disease. This is a very common problem in anaesthesia because serious morbidity and mortality are rare events and so most 'predictive scores' are not very helpful. then a lower point on the curve (to the left) can be chosen. Weightman et a1. The slope of the ROC curve represents the ratio of sensitivity (true positive rate) to the false positive rate. Many studies in anaesthesia and intensive care are used to derive a predictive equation or risk score.0) signifies no predictive ability. Information regarding patient demographics. Similarly. Two or more predictive.5 occurs with the curve of equality (the line y = x) and signifies no predictive ability. results of laboratory tests and other clinical data. or risk. They concluded that all of the scores performed well when predicting group outcome. The steeper the slope. may be analysed in order to describe their association with eventual patient outcome.

for categorical data it is usually X2 or risk ratio calculated from a 2 x 2 contingency table. or clinically important.05. predictor variables? It may also continue to include factors that offer very little additional predictive ability (at the expense of added complexity). each possibly interrelated. Logistic regression is used when the outcome of interest is a dichotomous (binary. 15. is likely to be unreliable if too few outcome events are studied. a backward stepwise procedure removes one variable at a time. 15. The larger the sample the more reliable the estimate of risk. of the equation is dependent on the size of the study. Cox proportional hazards is used when the outcome is time to an event (usually mortality). or model . Multiple linear regression is used when the outcome variable is measured on a numerical scale.20 R2 measures the amount of variability explained by the model and is one method of describing its reliability. in order to ascertain whether the addition of each extra variable increases the predictive ability of the equation. Stepwise regression analysis is a type of multivariate analysis used to assess the impact of each of several predictor variables separately. 14-17 Regression analysis is used to predict a dependent (outcome) variable from one.18. An alternative.16 . These include linear and logistic regression. It is recommended that at least ten outcome events should have occurred with each predictor variable in the mode1. these techniques are commonly used during the initial stages of developing a predictive equation or risk score.100 Statistical Methods for Anaesthesia and Intensive Care The simplest method of describing the relationship between a predictor variable and outcome is with one of the familiar univariate techniques .19 Further descriptions of these methods can be found in Chapters 7 and 9. They act as a screening process in order to identify any possible predictor variables. assume a linear gradient . or complementary. This process may not necessarily select the most valid. It does this by determining whether there has been a significant increase (for a forward procedure) or decrease (for a backward procedure) in the overall value of R 2 (for regression methods. Hence. or model. 1 . or precision.16 Discriminant analysis is used when there are more than two outcome categories (i. A derived equation. Univariate techniques cannot adjust for the combined effects of other predictor variables. it is necessary to use some form of multivariate statistical analysis.the 'goodness of fit'. with multiple factors. established risk factors.2).for numerical outcomes it may be Student's t-test or MannWhitney U test. 15 This may result in spurious factors being identified ('over-fitting' the data) and important ones being missed. The reliability. which are usually chosen as those with P < 0. on a categorical or ordinal scale). method is to first include known. or more independent (predictor) variables. A forward stepwise procedure adds one variable at a time. or weightings. discriminant analysis and proportional hazards. These predictor variables are often considered as 'risk factors'. or yes/no) categorical variable. 15. Though not essential.e.R2 is the proportion of variance yet to be accounted for. adding or subtracting one at a time.17 The regression coefficients. It is a measure of effect size. where R = multi le correlation coefficient) or a goodness of fit statistic (similar to .

Predicting outcome: diagnostic tests or predictive equations 10 1 between the predictor variable and the outcome of interest. It may be preferable to categorize a numerical predictor variable if this more clearly discriminates different levels of risk. for example. may be developed from a data set. 17 Because outcome prediction is usually based on a predictive equation developed using multivariate analyses.15. because of the potential problems with. correlation (co-linearity) and interaction of variables. 15 The regression coefficients (from linear or logistic regression). This means that a unit change in the predictor variable will be associated with a unit change in the probability of the outcome. But it also requires involvement of an experienced clinician. 14 and others 1.17 How the final model is constructed depends partly on the choice of predictor variables and their characteristics.)? Bootstrapping is a method of random sampling and replacement from the data set so that multiple samples can be analysed in order to validate the derived model.14. or score. or models. 14 Each of the above multivariate methods derives an equation that predicts the probability of an outcome of interest. or preferably. 15. This method was used recently by Wong et al. a clearly defined study population (ideally a wide spectrum of patients) and prospective validation in a variety of settings. 2 It has been shown to be a more reliable method than split-samples. known or unknown. this does not mean that a patient with the specified characteristics has a 42% risk of death. Another method is to prospectively validate it using another data set. or risk score. be able to predict that original data set well. or coding (as numerical. as the predictor variables ultimately chosen in the model must be reliable and clinically relevant. ordinal or categorical data). clearly defined (objective) risk factors. 15. by virtue of its derivation. It must be stressed that a number of equations. Because regression equations are often very complex. Outcome prediction only applies to groups of patients. at another institution. 23 when identifying risk factors for delayed extubation and prolonged length of stay with fast-track cardiac surgery. 21 Development of a reliable predictive model requires assistance from a statistician experienced in multivariate techniques. odds ratios (from logistic regression) or hazard ratios (proportional hazards) usually form the basis of a numerical score for each risk factor. It is best to check for this by visual inspection of the plotted data or stratifying the predictor variables into ordered groups to confirm that the effect is uniform across the range of values. If a predictive equation or risk score estimates that the risk of postoperative mortality is 42%.17 have described standards for derived predictive scores. deriving a risk score from the first and testing it on the second. only that if 100 similar patients were to proceed . One method is to split the study population into two. it is common to convert them to a risk score for clinical use. Wasson et al.17 There may be other variables.1'7 Further validation is required before accepting its clinical utility.1 . separation of predictive and diagnostic factors (ideally with blinded assessment of outcome). that may have a significant impact on the outcome of interest. They include a clearly defined outcome. it will. externally validating the equation.

was tested for its predictive ability ('goodness-of-fit') using a test similar to X2 (called the Hosmer-Lemeshow statistic). 25 yet this may be partly explained by the fact that patients who have been treated with amiodarone are more likely to have poor ventricular function and it may be this confounding factor that explains the poor outcome. randomized controlled trial.24 For example. their score had a sensitivity of 63% and specificity of 86%.'C2 and Fisher's exact test to identify risk factors associated with morbidity and mortality and also calculated odds ratios to measure the degree of association. It should be noted. of itself. 26 collected data on predictor variables and patient outcome from 26 randomly selected hospitals and 14 volunteer hospitals in the USA. A total of 17 440 patients were studied as a split sample. perhaps unknown.102 Statistical Methods for Anaesthesia and Intensive Care with surgery. One remaining point should not be forgotten: association does not imply causation.26 These were developed because ICU patients frequently suffer multisystem disease and previous risk scores usually focused on a single organ dysfunction or disease. they only validated their model on their original data set and so naturally they found their model had good predictive properties. Importantly. treatment of identified risk factors may not improve outcome. The APACHE III was then prospectively tested on a second (validation) group. such as a biologically plausible argument. Significant risk factors identified by these univariate methods were then entered into a logistic regression analysis.21. The first (derivation) group had weights calculated for various chronic disease and physiological variables using logistic regression and these weights were converted into scores. A series of regression equations and the APACHE III score are available to calculate the probability of various ICU outcomes. Outcome during and after ICU admission has been the subject of many studies. which of those patients will survive or die. demonstration of the time sequence (discerning cause from effect) and exclusion of other. they then prospectively collected data on a further 4169 patients at their institution and tested their score on this (validation) group. does not. This requires added proof. The most familiar are the APACHE scoring systems. Just because a strong association is demonstrated. Knaus et a1. In APACHE III. or model. They also constructed ROC curves to compare various versions of their derived clinical severity score. 26 Seneff and Knaus have written an excellent review of several ICU scoring systems A good example of a predictive. This enabled adjusted odds ratios (for confounding) to be calculated. confounding factors. The overall agreement with this new data set was also tested with the Hosmer-Lemeshow statistic. They calculated that if a score of 6 (out of 33) was chosen as a cut-off point for mortality. This issue could be clarified with a prospective. amiodarone has been associated with poor outcome after cardiac surgery. support a conclusion of cause and effect. The final logistic equation. that at this stage of their study. 28 who collected retrospective data on 5051 patients undergoing coronary artery bypass grafting. For this reason. We do not know with any certainty. then 42 would be expected to die postoperatively. They then used the univariate odds ratios (and 'clinical considerations') to give each significant factor a score of 1 to 6. or risk score is that developed by Higgins et a1. 28 They used . and .

Sox HC. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. 8. Campbell G. N Engl J Med 1977. Joseph L. Am J Cardiol 1985. J Cardiothorac vasc Anesth 1997. As the authors state. 32:429-434. Covariance adjustment of rates based on the multiple logistic regression model. only 11 actually died postoperatively. 313:793-799. 7. V111:283-298. Rifkin RD. Lee J. 56:51-58. References 1. 9. DeLong ER. 29 In earlier versions. 2. 293:257. . Biometrics 1988. Risk prediction in coronary artery surgery: a comparison of four risk scores. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. 14. Sackett DL. 13. Holford TR. 3. Calculating confidence intervals for proportions and their differences. 4. British Medical Journal. BIS is a predictive model. Clin Chem 1993. Zielski MM. for each 100 patients identified by their score. DeLong DM. London 1997: pp81-84. and positive and negative predictive value. regression analysis was used to construct a range of BIS values that could be used to reflect depth of hypnosis. J Chronic Dis 1981. Altman DG. MJA 1997. London 1989: pp28-33.Predicting outcome: diagnostic tests or predictive equations 103 a PPV of 11% and a NPV of 99%. JAMA 1987. Hood WB. Clinical prediction rules: applications and methodological standards. 12. This was later modified to correlate the BIS with level of hypnosis. Thus. Mallampati SR. Plotnick GD et al. their score is best to identify low-risk patients as the NPV was 99%. 349:1166-1168. A clinical sign to predict difficult intubation: a prospective study. Bayesian interim statistical analysis of randomised trials. 118:201-210. Weightman WM. 39:561-577. because the PPV was only 11% (i. 15. In either case. Gatt SP. Fisher ML. The bispectral index (BIS) is an EEG-derived estimate of depth of hypnosis. Neff RK et al. Semin Nucl Med 1978. 11:155-159. 16. Wasson JH. 10. 11. 34:415-426. Gibbs NM.e. Concato J. Gugino LD et al. The risk of determining risk with multivariable models. Basic principles of ROC analysis. Richardson WS. Newman TB. Can Anaesth Soc J 1985. Are all significant p values created equal? The analogy between diagnostic tests and clinical research. Evidence-based Medicine: How to Practice and Teach EBM. Statistics with Confidence. the other 89 survived). Bayesian analysis of electrocardiographic exercise stress testing. Feinstein AR. 6. Gardner MJ. Altman DG. It can be considered to have a certain sensitivity and specificity. In: Gardner MJ. Clarke-Pearson DL. 257:2459-2463. Schulman SP Perioperative ST-segment depression is rare and may not indicate myocardial ischemia in moderate-risk patients undergoing noncardiac surgery. Metz CE. Churchill Livingstone. N Engl J Med 1975. N Engl J Med 1985. 166:408-411. Nomogram for Bayes' theorem. Routine preoperative exercise testing in patients undergoing major noncardiac surgery. 44:837-845. Ann Intern Med 1993. 297:681-686. Rosenberg W et al. the manufacturers used multivariate regression to calculate the probability of movement. Sheminant MR et al. Brophy JM. Fagan TJ. Fleisher LA. 5. Carliner NH. Lancet 1997. Zweig MH. Browner WS.

Statistical aspects of prognostic factor studies in oncology. Br J Cancer 1977. Sackett DL. Simon R. A new cardiac risk score. 5:33-52. Higgins TL. Boston 1991: pp283-302. Seneff M. 115:92-98. prolonged length of stay in the intensive care unit. Cox DR. 26. Mohammed A et al. . 89:980-1002. Risk factors of delayed extubation. Estafanous FG. Wagner DP. 22. Regression models and life tables. and mortality in patients undergoing coronary artery bypass graft with fast-track cardiac anesthesia. Haynes RB. 91:936-944. J Intens Care Med 1990. SAPS. Lemeshow S. A primer for EEG signal processing in anesthesia. The APACHE III prognostic system: risk prediction of hospital mortality for critically ill hospitalized adults. JAMA 1992. Stratification of morbidity and mortality outcome by preoperative risk factors in coronary artery bypass patients: a clinical severity score. Predicting patient outcome from intensive care: a guide to APACHE. 20. 34:187-220. Knaus WA. Wong DT. Anesthesiology 1999. Are patients receiving amiodarone at increased risk for cardiac operations? Ann Thorac Surg 1994. Draper EA et al. Lancet 1993. Cheng DCH. Kustra R et al. Guyatt GH et al. 23. Am J Epidemiol 1982. Chest 1991. 28. 100:1619-1639. 342:345-347. J R Stat Soc Series B 1972. Analysis and examples. Pike MC. Maruyama H. 25. Little Brown. 2nd edn. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. 15:361-387. 27. Multivariable prognostic models: issues in developing models. Armitage P et al. 69:979-985. 35:1-39.10 4 Statistical Methods for Anaesthesia and Intensive Care 17. Clinical Epidemiology: a Basic Science for Clinical Medicine. Knaus WA. and other prognostic scoring systems. evaluating assumptions and adequacy. Hosner DW A review of goodness-of-fit statistics for use in the development of logistic regression models. Rampil IJ. 21. 267:2344-2348. MPM. Br J Cancer 1994. Altman DG. Loop FD et al. Mickleborough LL. Mark DB. Stat Med 1996. 24. II. Anesthesiology 1998. PRISM. Datta M. Peto R. 18. Lee KI. 58:622-629. You cannot exclude the explanation you have not considered. and measuring and reducing errors. 29. Harrell FE. 19.

• The hazard ratio is the risk of an event compared with a reference group. whereby survival depends on the probability of surviving to that point and the probability of surviving through the next time interval. analysis is most commonly applied in cancer research. and analysis of outcome after cardiothoracic surgery and organ . are at increased risk of early death. or admitted to intensive care.9 Survival analysis What is survival analysis? Kaplan-Meier estimate Comparison of survival curves -logrank test -Cox proportional hazard model The `hazards' of survival analysis Key points • Survival analysis is used when analysing time to an event. Although survival analysis is most often concerned with death rates. the outcome of interest may be any survival event. or actuarial. The difference.and its converse.when. survival analysis is generally a preferable approach as it provides more clinically relevant information concerning the pattern of the outcome of interest. What is survival analysis? Many patients undergoing major surgery.2 The outcome of interest. or freedom from postoperative nausea and vomiting. is treated as a dichotomous (binary. The actual pattern of death . of course. Clinical research in these areas often includes measures of outcome such as major morbidity and mortality. are less frequently investigated in anaesthesia and intensive care research. or yes/no) categorical variable and can be presented at any point in time as a proportion. then tapers off. Such rates that vary over time are called hazard rates. This rarely occurs in clinical practice: usually mortality is high initially.1. survival. • The Kaplan-Meier method estimates survival using conditional probability. Survival. This is particularly so for survival rates over a longer period of time. and how many patients die . • Two survival curves can be compared using the logrank test. failure of arterial cannulae. the probability of death over any time period can be estimated (using the Poisson distribution . such as extubation in ICU. If a death rate remains constant. However. is that in these circumstances complete outcome data on all patients are usually available and so the more familiar comparative statistical tests are most commonly used. They ignore other valuable information about exactly when the deaths occurred.see Chapter 2). The statistical analysis of the pattern of survival is known as survival analysis. But these rates are usually only described at certain points in time (such as inhospital mortality or 30-day mortality). the pattern of survival.

The mean time of survival can be a totally misleading statistic. After all. If some patients have only been in a trial for a few months. and then three more. a patient must first survive 12 months. using survival after heart transplantation. multiplied by the probability of death during that particular interval. we know they have survived for at least that period of time . or have not yet survived for a specified period. Start and finish times (patient recruitment and eventual mortality) are usually scattered. or who are lost to follow-up. The probability of death is calculated when patients die. The number of patients studied. An example. Patients who are withdrawn from treatment. when they died. Because deaths do not occur in a linear fashion. The median survival time is sometimes used but is only available after more than half the patients have eventually died. we cannot know how many of them will survive over a longer period (say one or two years). Censoring removes these individuals from further analysis. an estimate is made of the patients' probability of survival. estimates made over a brief period of time may not reflect the true overall pattern of survival. given the observed survival rates in the trial at each time period of interest.1. The Kaplan-Meier method calculates a conditional probability: the chance of surviving a time interval can be calculated as the probability of survival up until that time. The actuarial method first divides time into intervals and calculates survival during each of these intervals (censored observations are assumed to have survived to halfway into the interval). These survival times are 'censored' and the data concerning surviving patients are called censored observations.106 Statistical Methods for Anaesthesia and Intensive Care transplantation. Information about the censored data is included (up until they leave or are lost from the trial). is of some value. A survival table or graph may be referred to as a 'life table'. For example. Essentially. therefore withdrawals in smaller studies do not have such an effect on calculated survival rates. Kaplan-Meier estimate The Kaplan-Meier technique is a non-parametric method of estimating survival and producing a survival curve. we do not know how long the remaining patients will survive.2). as it will depend on patients' mortality distribution pattern as well as how long they were followed up. There are two main methods for describing survival data. The Kaplan-Meier method calculates the probability of survival each time a patient dies. the actuarial (life table) method and the Kaplan-Meier method. Because we have information on only those patients who actually died. We therefore need a method that accommodates for the incomplete data. Yet the information concerning patients in a trial who have not yet died. and withdrawals are ignored (Table 9. to survive for 15 months. is presented in Table 9. The probability of surviving 12 months (using Table 9. are crucial data required to describe the pattern of survival.some of their survival information can be included. are also censored observations (at the point they leave the study). and the length of time they have been observed.2) is .

681 0. or deciding whether various baseline patient characteristics (gender.) can be used to predict (or describe) eventual survival. Both standard error and 95% confidence intervals can be calculated.742 x 0.648 .2 Kaplan-Meier estimates for survival after heart transplantation (using the above hypothetical data) Month of death 1 3 6 7 12 15 20 28 Number of patients (p) 74 62 57 55 52 50 45 42 Number of deaths (d) 10 4 1 1 1 2 2 2 Probability of death (d/p) 0. Comparison of survival curves The survival pattern of different patient groups can be compared using their survival curves. as a change in the proportion surviving occurs at the instant a death occurs (Figure 9.96 = 0.Survival analysis Table 9.712. Changes in probability only occur at the times when a patient dies.786 0.1).742 0.865 0. age groups. The life table describes the observed outcome over two years Time period Number of patients alive at the start of each time period 74 62 57 55 52 50 45 42 42 39 Number of deaths 107 Alive.135 0. hence the probability of surviving 15 months is 0.044 0.960 0.040 0. The resultant Kaplan-Meier curve is step-like. risk strata.865 0.809 0.96. 3 These usually widen over the time period because the number of observations decreases.952 Cumulative survival 0.764 0.742. and the probability of survival over the next three months is 0.956 0.019 0. but yet to reach next time period (or lost to follow-up) 2 1 1 2 1 3 1 0 1 1 0-2 mths 2-4 mths 4-6 mths 6-8 mths 8-12 mths 12-16 mths 16-20 mths 20-24 mths 24-30 mths 30-36 mths 10 4 1 1 1 2 2 0 2 0 0.048 Probability of survival (1-d/p) 0.065 0.018 0.935 0.972 0.018 0.972 0. etc.712 0.971 0. This is most commonly applied when comparing two (or more) treatment regimens or different treatment periods (changes over time).1 Survival data after heart transplantation (hypothetical data). The most obvious method would be to compare the Table 9.

patients lost to follow-up or still alive).Figure 9. so that risk stratification can be quantified. and such differences between groups will obviously fluctuate. In these situations. as it only depicts the groups at a certain point in time.1 Kaplan-Meier survival curve for heart transplantation (see Table 9. Various non-parametric tests can be used for this purpose. But this is unreliable. For example.2) survival rates of the groups using standard hypothesis testing for categorical data (such as the familiar chi-square test). A test for trends can also be used. In general.' compared two anaesthetic techniques in patients undergoing cardiac surgery. What is needed is a method that can compare the whole pattern of survival of the groups. The time to tracheal extubation was . An alternative (and popular) technique is to use the logrank test. One technique is to rank the survival times of each individual and use the Wilcoxon rank sum test (survival time is not normally distributed and so the t-test would be inappropriate).e. The results of each time interval are tabulated and a X2 statistic is generated. 2 This is based on the X2 test and compares the observed death rate with that expected according to the null hypothesis (the Mantel-Haenszel test is also a variation of this). This method is unreliable if there are censored observations (i. a modification known as the generalized Wilcoxon test can be used (also known as the Breslow or Gehan test). and compares the observed group death rate with that expected if there was no difference between groups. An advantage of the logrank test is that it can also be used to produce an odds ratio as an estimate of risk of death: this is called a hazard ratio. Time intervals are chosen (such as one.or twomonth periods) and the number of deaths occurring in each is tabulated. Myles ct al. The logrank test then determines the overall death rate (irrespective of group). these tests are uncommonly used because they are not very powerful and so may fail to detect significant differences in survival between groups.

07 0.84 0. survival techniques do not need to be restricted to analysing death rates but can be used to analyse many terminal events of interest in anaesthesia and intensive care research.12 0.92 0.59 0.73 0. illustrating tracheal extubation after coronary artery bypass graft surgery in patients receiving either an enfluranebased or propofol-based anaesthetic (see Table 9.16 0.60 1.08 0.29 0.78 0.00 1.29 0.07 Enf 0 2 4 6 8 10 12 14 16 18 20 22 24 66 66 64 57 48 44 37 26 20 12 7 5 4 analysed using survival techniques (Table 9. or propofol-based (Prop) anaesthetic (using data from Myles et al.e.70 0.93 0.77 0.00 0.18 0.23 0.80 0.17 0.3) .71 1.80 1.03 0.00 Cumulative proportion still ventilated Enf Prop 1.00 0.3 and Figure 9.00 0.71 0.29 0.00 Prop 0.07 0.86 0.20 0.2).2 Kaplan-Meier survival curves.70 0.30 0.09 0.12 0.71 0. This study demonstrated that a propofol-based anaesthetic technique.56 0. 4 ) Time after ICU admission (h) Number of patients Prop 58 58 54 42 30 27 17 10 7 7 5 5 4 Number of patients extubated Enf 0 2 7 9 4 7 11 6 8 5 2 1 0 Prop 0 4 12 12 3 10 7 3 0 2 0 1 0 Probability of extubation Enf 0.58 0.72 0.08 0.84 0.00 Probability of continued mechanical ventilation Enf Prop 1.97 0.90 0.00 0.00 0.67 0.30 0.52 0.30 0.42 0.11 0.06 1.63 0.00 0.Survival analysis 109 Table 9.06 0.3 Kaplan-Meier estimates for tracheal extubation after coronary artery bypass graft surgery in patients receiving either an enflurane-based (Enf).40 0. there were no Figure 9.47 0.20 0.00 0.09 0.00 0.41 0.22 0. when compared to an enflurane-based technique.16 0.29 0. Because eventual outcome was known for all patients in Myles' study (i. resulted in shorter extubation times.00 1.11 0.93 0.00 0.37 0.10 0. As stated above.39 0.97 0.89 0.

the definition of these periods being determined by their intended clinical application.01). but when. The Cox proportional hazards model is a multivariate technique similar to logistic regression. where the dependent (outcome) variable of interest is not only whether an event occurred. These included diabetes (P = 0. . Presentation of survival data should include the number of individuals at each time period (particularly at the final periods. They used the logrank test to compare the two groups and found a significant reduction in mortality in those patients treated with atenolol (P = 0. The `hazards' of survival analysis Comparison of survival data does not need to be restricted to the total period of observation. gender and baseline risk status. For example. an early and late period can be artificially constructed for analysis.019). and the Cox proportional hazards method is used to study the effects of several risk factors on survival. or if the group survival curves actually cross. and not be influenced after visualization of the survival curves. only diabetes was a significant predictor of mortality over the two-year period (P = 0. the Kaplan-Meier method is used to estimate a survival curve using conditional probability. such arbitrary decisions should be guided by clinical interest. In summary. or at risk of. Conclusions based upon the terminal (right-hand) portion of a survival curve are often inappropriate. the logrank test is used to compare survival between groups.04). coronary artery disease.110 Statistical Methods for Anaesthesia and Intensive Care censored observations). In general. It is common for such curves to have long periods of flatness (i. Their major endpoint was mortality within the two-year follow-up period. 5 It is used to adjust the risk of death when there are a number of known confounding factors (covariates) and therefore produces an adjusted hazard ratio according to the influence on survival of the modifying factors.e. principally through a reduction in cardiovascular deaths. where no patient dies). because patients numbers are usually too small. investigated the benefits of perioperative atenolol therapy in patients with. This may be a useful exercise if differences only exist during one or other periods. Common modifying factors include patient age. They then used the Cox proportional hazards method to identify other (univariate) factors that may be associated with mortality. A comparison of mortality rates at a chosen point in time should not be based upon visualization of survival curves (when the curves are most divergent): this may only reflect random fluctuation and selective conclusion of significance may be totally misleading.01) and postoperative myocardial ischaemia (P = 0. and survival curves should also include 95% confidence intervals. When these factors were included in a multivariate analysis. where the statistical power to detect a difference is reduced). They randomized 200 patients to receive atenolol or placebo. For example. Mangano et al. but can be split into specific time intervals. traditional (non-parametric) hypothesis testing was also employed to compare the median time to extubation. this should not be interpreted as no risk of death (or 'cure').

Design and analysis of randomized clinical trials requiring prolonged observation of each patient. 4. 2. Br j Cancer 1976. myocardial ischemia. Anesth Analg 1997. Peto R. 34:187-220. Myles PS. 6. 35:1-39. 34:585-612. 5. Layug EL. Cox DR. Pike MC. Effect of atenolol on mortality and cardiovascular morbidity after noncardiac surgery. Statistics with Confidence . Hemodynamic effects. Calculating confidence intervals for survival time analysis. Mangano DT. Weeks AM et al. II. Wallace A et al. Armitage P et al. Pike MC. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. . I.Confidence Intervals and Statistical Guidelines. N Engl j Med 1996. Analysis and examples. 335:1713-1720.Survival analysis 111 References 1. j R Stat Soc Series B 1972. Altman DG. Machin D. and timing of tracheal extubation with propofol-based anesthesia for cardiac surgery. 84:12-19. Regression models and life tables. British Medical journal. 3. Introduction and design. Gardner Mj. London 1989: pp64-70. Peto R. Armitage P et al. Br j Cancer 1977. In: Gardner Mj. Buckland MR.

effectiveness The randomized controlled trial (RCT) is the gold standard method to test the effect of a new treatment in clinical practice. unbiased summary of the evidence to guide clinical management. Clinical practice guidelines are developed by an expert panel using an evidence-based approach. 2 • 4•5 Large RCTs are an excellent way of testing for effectiveness. A systematic review is a planned. meta-analysis. 6 Large RCTs are usually conducted in many centres. effectiveness Large randomized trials Meta-analysis and systematic reviews Evidence-based medicine Clinical practice guidelines Key points • Stringently designed randomized trials are best to test for efficacy. 2•4 . It is a proven method of producing the most reliable information. They are explanatory trials. by a number of clinicians.5 They are conducted in specific patient populations. Large randomized trials can detect moderate beneficial effects on outcome. often in academic institutions. Trials that test effectiveness are also called pragmatic trials. on patients who may have different characteristics. For these reasons their results may not be widely applicable and so they do not necessarily demonstrate effectiveness in day-to-day clinical practice. because it is least exposed to bias. but lack applicability. and evidence-based medicine Efficacy vs. Meta-analysis combines the results of different trials to derive a pooled estimate of effect. and so have greater applicability (generalizability). Large randomized trials are best to test for effectiveness.1-3 Randomization balances known and unknown confounding factors that may also affect the outcome of interest.4. Meta-analysis is especially prone to publication bias.10 Large trials. Evidence-based medicine optimizes the acquisition of up-to-date knowledge. but lack applicability. Stringently designed RCTs are best to test for efficacy. by experienced researchers. so that it can be readily applied in clinical practice. and so have greater applicability (generalizability). Efficacy vs. They commonly exclude patients with common medical conditions or at higher risk.

13. 2. potentially confounding factors are more likely to be balanced between groups in large RCTs. and perhaps multi-national.9-12 Their inherent weaknesses include uncertain clinical importance. renal failure or death after coronary artery surgery is 2-4%. 16. in order to maximize recruitment and enable early conclusion. (b) widely applicable treatments are generally simple. This offers an opportunity to identify other patient. moderate effects of anaesthetic interventions are worthy of study.6 The increasing interest in evidence-based medicine has added a further i mperative to conducting reliable clinical trials. transience. and anaesthesia is considered to play a small role in their occurrence. and these require large numbers of patients to be studied in order to have the power to detect a clinically significant difference. 2 In part they argued: (a) effective treatments are more likely to be important if they can be used widely. and evidence-based medicine 11 3 Why we need large randomized trials in anaesthesia* A good clinical trial asks an important question and answers it reliably. and unconvincing relationships with more definitive endpoints. 12 McPeek argued 13 years ago that changes in anaesthetic practice should be based on reliable trial evidence that can be generalized to other situations.g. simple randomized trials can reliably detect moderate effects on important endpoints (e.15 with less chance of false conclusion of effect (type I error) or no effect (type II error). and the incidence of major sepsis after colorectal surgery is 5-10%. clinician and institutional factors that may influence outcome. outcome measures in anaesthesia is widespread.7 published with permission. (c) major endpoints (death.4.4. Most improvements in our specialty are incremental. In order to `This has been adapted from an Editorial in British Journal of Anaesthesia. Do these issues apply to anaesthesia? The use of surrogate. particularly in the disciplines of cardiology and oncology.5 They are usually multi-centred. 3.13. 2.8 In 1984. 3. the incidence of stroke. 2 explained how large. One of the reasons for studying surrogate endpoints is that more definitive endpoints. Yusuf et a1. 4. meta-analysis. . mortality.5 Small studies can rarely answer important clinical questions. such as mortality or major morbidity. but the more important issue is that the trial should have adequate power (> 80%) to detect a true difference for an important primary endpoint. 4 Large RCTs are more likely to convincingly demonstrate effectiveness because their treatments are generally widely applicable. What is a large trial? This depends on the clinical question. disability) are more important and assessment of these endpoints can be simple.Large trials. but these require large RCTs in order to be reliable. i mportant. For example. These extraneous. and (d) new interventions are likely to have only a moderate beneficial effect on outcome. 4.17 Most important adverse outcomes after surgery are rare. or intermediate.14 They are therefore less biased and so are more reliable. Some will accept trials that study more than 1000 patients. major morbidity). These considerations have fostered the widespread use of large multi-centred RCTs. 5 Nevertheless. are very uncommon after surgery.

Individual trial results can be summarized by a measure of treatment effect and this is most commonly an odds ratio (OR) and its 95% confidence interval (95% CI). Meta-analysis and systematic reviews Meta-analysis is a process of combining the results of different trials to derive a pooled estimate of effect and is considered to offer very reliable information. 26 The term systematic review. with a fixed sample size this equates to a higher incidence rate). including acute pain management and obstetrics.dk). It is a commonly used estimate of risk.0 suggests a reduction in risk. The most wellknown is the Cochrane Collaboration 2 7.05 and type II error 0. suite. differences in trial characteristics (heterogeneity) can obscure this process. If the 95% CI of the OR exceeds the value of 1.0.e.1 The approximate number of patients needed to be studied (assuming a type I error 0. it may be a chance finding).5% 4.28 an Oxford-based group that was established to identify all randomized controlled trials on specific topics. those not treated. An anaesthetic subgroup is being considered (see web-site: www. 5 % Number of patients 920 2500 5400 9300 detect a moderate. The results (ORs) of individual trials are combined in such a way that large trials have more weight. There have been some excellent examples of large RCTs in anaesthesia. their statistical analyses. and greater than 1.e. As stated above. . less than 1. or overview. then it is not statistically significant at P < 0. 18-20 In some of these the investigators selected a high-risk group in order to increase the number of adverse events in the study.0 an increased risk. and interpretation of the results.05 (i.2) Baseline incidence 40% 20°% 10% 6% 25% improvement with intervention 30% 15°% 7.cochrane-anaesthesia. is sometimes used interchangeably with meta-analysis. The OR is the ratio of odds of an outcome in those treated vs. They have several subgroups that focus on particular topics.1).0 suggests no effect. 21-23 Some recent examples in anaesthesia include the effect of ondansetron on postoperative nausea and vomiting (PONV) 24 the role of epidural analgesia in reducing postoperative pulmonary morbidity. many thousands of patients are required to be studied (Table 10. this reduced the number of patients required (i. it is recommended that a random effects model be used to combine ORs. but clinically important difference between groups. 25 and the benefit of acupressure and acupuncture on PONV. An OR of 1. but this more aptly describes the complete process of obtaining and evaluating all relevant trials.114 Statistical Methods for Anaesthesia and Intensive Care Table 10. For this reason.

If this pooled result does not cross the value 1. small trials = n <_ 50. 16.23.1 Effect of non-pharmacological techniques on risk of early postoperative vomiting in adults • = relative risk for individual study. it is considered to be statistically significant. The trials included in a meta-analysis should have similar patient groups. but this is not recommended because it obviously weakens their reliability. _ overall summary effect.47 (95% CI: 0.64). A logarithmic scale is often used to display ORs because increased or decreased risk can be displayed with equal magnitude. The OR (box) and 95% CI (lines) for each subsequent trial are usually displayed along a vertical axis. Meta-analysis may include non-randomized trials. The pooled OR and 95% CI for all the trials is represented at the bottom as a diamond. meta-analysis. and evidence-based medicine 11 5 whereby the individual trials are considered to have randomly varied results.1). low-quality studies = quality score <_ 2. The pooled estimate of effect for prevention of early vomiting (expressed as a risk ratio in their study) was 0. and measure similar endpoints. Each of these characteristics should be defined in advance. The control was sham or no treatment. Large trials = n > 50.0. They did a sensitivity analysis." 29 and some of its potential weaknesses identified. with the width of the diamond representing the 95 CI%. 23 For example. Meta-analysis has been criticized.Large trials. Results from each trial can be displayed graphically. The size of the box represents the sample size of the trial. . those of good quality.34-0. using a similar intervention. and those where a sham treatment was included.17. Lee and Done26 investigated the role of acupressure and acupuncture on PONY 26 They found eight relevant studies and produced a summary diagram (Figure 10.30-32 These include publication bias Figure 10. This leads to slightly wider 95% CI. high-quality studies = quality score > 2. by separately analysing large and small trials.

116

Statistical Methods for Anaesthesia and Intensive Care

(negative studies are less likely to be submitted, or accepted, for publication), duplicate publication (and therefore double-counting in the meta-analysis), heterogeneity (different interventions, different clinical circumstances) and inclusion of historical (outdated) studies. Despite these weaknesses, meta-analysis is considered a reliable source of evidence. 22,28 There are now established methods to find all relevant trials, 23 and so minimize publication bias. These include electronic database searching, perusal of meeting abstracts and personal contact with known experts in the relevant field. Advanced statistical techniques (e.g. weighting of trial quality, use of a random effects model, funnel plots) and sensitivity analysis can accommodate for heterogeneity 16,21,23,30,31 . The QUOROM statement, a recent review, has formulated guidelines on the conduct and reporting of meta-analyses. 32 Meta-analyses sometimes give conflicting results when compared with large RCTs. 16,17,29,30 A frequently cited example is the effect of magnesium sulphate on outcome in patients with acute myocardial infarction. 17,21,30,33,34 Many small RCTs had suggested that magnesium i mproves outcome after acute myocardial infarction and this was the conclusion of a meta-analysis, LIMIT-2, published in 1992. 35 A subsequent large RCT, ISIS-4, disproved the earlier finding. 36 There have been several explanations for such disagreement. 33,34 But it is generally recognized that positive meta-analyses should be confirmed by large RCTs. Meta-analyses that include one or more large RCTs are considered to be more reliable. 29 Meta-analyses that find a lack of treatment effect can probably be accepted more readily 11 The findings of a meta-anal sis are sometimes presented as the number needed to treat ( NNT). 28 • 37 Here the reciprocal of the absolute risk reduction can be used to describe the number of patients who need to be treated with the new intervention in order to avoid one adverse event. For example, Tramer et a1. 24 found a pooled estimate in favour of ondansetron , with an OR (95% CI) of approximately 0.75 (0.71-0.83)*. If the incidence of early PONV is 60% (proportion = 0.60), then these results suggest that ondansetron, with an OR of 0.75, would reduce the proportion to 0.45, or an absolute risk reduction of 0.15 (60% to 45%). The NNT, or reciprocal of the absolute risk reduction (1/0.15) is 6.7. Therefore, it can be concluded that six or seven patients need to be treated in order to prevent one patient from having PONY

**Evidence-based medicine
**

Evidence-based medicine (EBM) has been defined by its proponents as the 'conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients'. 28 Although referred to as a new paradigm in clinical care, 15 it could more accurately *The authors calculated odds ratios (as estimates of risk ratios) in terms of a relative benefit, and so we have used the reciprocal to present their results as a relative reduction in PONV (a ratio of benefit of 1.3 became an OR of 0.75).

Large trials, meta-analysis, and evidence-based medicine

11 7

be described as a simplified approach that optimizes the acquisition of up-to-date knowledge, so that it can be readily applied in clinical practice. As such, EBM formalizes several aspects of traditional practice. The five steps of EBM are: 28 Step 1. Ask an answerable question Step 2. Search for evidence Step 3. Is the evidence valid? Step 4. Does the evidence apply to my patient? Step 5. Self-assessment It teaches how to formulate a specific and relevant question arising from clinical practice, how to efficiently and reliably access up-to-date knowledge ('evidence'), and then reminds us of established critical appraisal skills used to asses the validity of that evidence. The fourth, and perhaps most important, step is to use clinical expertise in order to determine whether that evidence is applicable to our situation. It is this step that requires clinical experience and judgment, understanding of basic principles (such as pathophysiology, pharmacology and clinical measurement), and discussion with the patient before a final decision is made.5,15,2s EBM also has a fifth step, asking clinicians to evaluate their own evidence-based practice. There have been some concerns raised about the application of evidence-based methods in anaesthetic practice. 11,12,31 Some clinicians argue that the principles of EBM are in fact those of good traditional care, and not a new paradigm or approach to clinical practice.11,38 But the central feature of EBM is its direct application at the bedside (or in the operating suite), with a specific patient or procedure in mind. For this reason, it has direct relevance for us, encourages active learning, and should reduce poor anaesthetic practices not supported by evidence. What constitutes 'evidence'? Most classifications consider how well a study has minimized bias and rate well-designed and conducted RCTs as the best form of evidence (Table 10.2). 22 • 39 But it is generally accepted that other trial designs play an important role in anaesthesia research and can still be used for clinical decision-making. 3,5,8 ,12,28

**Table 10.2 Level of evidence supporting clinical practice (adapted from the US Preventive Services Task Force)
**

Level Definition

I

Evidence obtained from a systematic review of all relevant randomized controlled trials Evidence obtained from at least one properly designed randomized controlled trial Evidence obtained from other well-designed experimental or analytical studies Evidence obtained from descriptive studies, reports of expert committees or from opinions of respected authorities based on clinical experience

II

III

IV

11 8

Statistical Methods for Anaesthesia and Intensive Care

More recently, there has been some recognition that the level of evidence (I-IV) is not the only aspect of a study that is of relevance to clinicians when they apply the results in their practice. Thus, the dimensions of evidence are all important: level, quality, relevance, strength and magnitude of effect. Myles et a1. 22 surveyed their anaesthetic practice and found that 96.7% was evidence based, including 32% supported by RCTs. These results are similar to recent studies in other specialties 39,40 and refute claims that only 10-20% of treatments have any scientific foundation. The traditional narrative review has been questioned in recent years. Their content is largely dependent on the author(s) and their own experiences and biases. EBM favours the systematic review, as an unbiased summary of the evidence base. 41 These have become more commonly used in anaesthesia research. 24-26,42

**Clinical practice guidelines
**

Clinical practice guidelines have been developed in order to improve processes or outcomes of care. 5,43,44 They are usually developed by a group of recognized experts after scrutinizing all the available evidence. They generally follow similar strategies to that of EBM, and so the strongest form of evidence remains the randomized controlled trial. In the past clinical practice guidelines were promulgated by individuals and organizations without adequate attention to their validity 44 More recently there has been a number of excellent efforts at developing guidelines in many areas of anaesthetic practice. 45 , 49 The relationship between the RCT and development of clinical practice guidelines has been explored eloquently by Sniderman. 5 He pointed out that both the RCT and practice guidelines developed by an expert committee can be seen as impersonal and detached. Yet, he suggests, this is also their strength, in that they have transparency and objectivity. Because there is often incomplete evidence on which to develop guidelines, there is a risk of them being affected by the interpretations and opinions of the individuals who make up the expert panel. Sniderman points out that they are also a social process and are exposed to personal opinions and compromise. 5 He suggests that their findings can be strengthened by including a diverse group of experts and not to demand unanimity in their recommendations. As with EBM, systematic evaluation of published trials can also identify important clinical problems that require further study. Smith et a1. 49 noted that pain medicine (as with many areas in anaesthesia) is rapidly evolving and so guidelines may become outdated within a few years. The cost and effort to maintain them may be a limiting factor in future developments. The evaluation of clinical practice guidelines can be biased. Participating clinicians may perform better, or enrolled patients may receive better care or report improved outcomes, because they are being studied. This is known as the Hawthorne effect. There are several approaches that can minimize this bias. 44 The gold standard remains the

3:409-420. 11. and a major change in anesthetic practice [editorial]. Peto R. The literature of anaesthesia: what are we learning? Can J Anaesthesia 1988. 12:221-227. 13. Ioannidis JPA. and clinical practice. Anaesth Intensive Care 1996. 6. Boston 1991: pp187-248. 20:637-648.6 References 1. The impact of high-risk patients on the results of clinical trials. Collins R. Rothman KJ. In some circumstances we are interested in a specific mechanistic question for which a small. 81:795-796. but this is often not feasible. Measuring anaesthetic outcomes. 354:327-330. Why do we need some large. Yusuf S. Evidence based medicine and anaesthesia: Uneasy bedfellows? Anaesth Intensive Care 1997. Lau J. 3.33. Myles PS. Tugwell P Deciding on the Best Therapy: A Basic Science for Clinical Medicine. Goodman NW. Little Brown. 24:685-693. . 53:353-368. Guyatt GH. 7. Fisher DM. Lee A. Inference. Members of the party scrutinized all relevant studies and rated the level of evidence (levels I-IV) in order to make recommendations.11. 39:S1771-1775. meta-analysis. Why we need large randomised trials in anaesthesia [editorial]. simple randomized trials? Stat Med 1984. Anaesthetists gain new knowledge from a variety of sources and study designs. Haynes RB. Anaesthesia 1998. generalizability. 15. Horan B. consensus conferences.Large trials. Cancer 1977. Surrogate end points: are they meaningful [editorial]? Anesthesiology 1994. 3:494-499.12. Lancet 1999. 2. McPeek B. In most cases these findings should be confirmed with an RCT. 268:2420-2425. 12. Sackett DL. 9. Other designs include the crossover trial (where both methods of practice are used in each group at alternate times). Epidemiologic methods in clinical trials.51 Investigation of moderate treatment effects on important endpoints are best done using large RCTs. Br J Anaesth 1999. J Chron Dis 1967. 7. Evidence-based medicine: a new approach to teaching the practice of medicine. 3. 4. Lum ME. Explanatory and pragmatic attitudes in therapeutic trials.12. Myles PS. Curr Opinion Anaesthesiol 1999. Anaesthesia and evidence-based medicine. 14. 44 The management of acute pain has received recent attention 49 For example. 8. 5. JAMA 1992. Evidence-based medicine working group. 2. Lellouch J. tightly controlled RCT testing efficacy may be preferable. the Australian National Health and Medical Research Council established an expert working party to develop clinical practice guidelines for the management of acute pain. 50:1089-1098. Sniderman AD. J Clin Epidemiol 1997. Rigg JRA. and evidence-based medicine 11 9 RCT. or the before-and-after study that includes another control group for comparison. 66:723-724. Jamrozik K. 83:833-834.50 Observational studies can be used to identify potential risk factors or effective treatments. Schwarz D. Duncan PG. Evidence-based methods to improve anaesthesia and intensive care. 25: 679-685. 10. Anesthesiology 1987. Cohen MM. Clinical trials.

Moher D. Benhaddad A et al. 18. Rosenberg W. eds. Horwitz RI. 30. Ioannidis JPA. Clyti N et al. safe. Woods KL. Wallace A et al. Egger M. Mega-trials and management of acute myocardial infarction. Conseiller C. 23. Roffe C.120 Statistical Methods for Anaesthesia and Intensive Care 16. Overcoming the limitations of current meta-analysis of randomised controlled trials. Kurz A. Cook DJ et al. Anesth Analg 1999. 25. Oxford 1997. BMJ 1997. Sessler DI. Ondansetron compared with 21. Lessons learned from 'an effective. simple trials and overviews of trials': discussion: a clinician's perspective on meta-analyses. simple' intervention that wasn't. Anesthesiology 1997. Improving the quality of reports of metaanalyses of randomised controlled trials: the QUORUM statement. N Engl J Med 1996. 346:611-614. Lee A. Pogue J. BMJ 1995. Layug EL. Anesth Analg 1998. 32. 88:1362-1369. A quantitative systematic review of randomized placebocontrolled trials. The use of nonpharmacologic techniques to prevent postoperative nausea and vomiting: a meta-analysis. Sackett DL. dose-response. 352:609-613. 86:598-612. Lancet 1998. LeLoerier J. 335:1713-1720. Churchill Livingstone. Lancet 1999. Done ML. 19. Lenhardt R. Myles PS. 17. 24. Moher D. 33. 82:591-595. Effect of atenolol on mortality and cardiovascular morbidity after noncardiac surgery. controlled trials. Reynolds DJ. 310:752-754. Egger M. Gregoire G. and safety of ondansetron in prevention of postoperative nausea and vomiting. Jones A. Ballantyne JC. 48:41-44. 354:1896-1900. Am J Cardiol 1995. 29. Yusuf S. Sackett DL. Br J Anaesth 1999. J Clin Epidemiol 1995. London 1997. The comparative effects of postoperative analgesic therapies on pulmonary outcome: cumulative meta-analyses of randomized. McQuay HJ. Cook DJ. Diemunsch P. temperature group. Richardson WS. Bain DL. 28. Intravenous magnesium sulphate in infarction: big numbers do not tell the whole story. Cochrane Collaboration Handbook. 22. Schmid CH et al. 276:1332-1338. Johnson F. The study of wound infection and 334:1209-1215. 315:1533-1537. 79:322-326. Lancet 1995. The Cochrane Collaboration. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 1998. Perioperative normothermia to reduce the incidence of surgical-wound infection and shorten hospitalization. Discrepancies between metaanalyses and subsequent large randomized. Antman EM. Large trials vs meta-analysis of smaller trials: how do their results compare? JAMA1996. Haynes RB. Woods KL. Davey-Smith G. Meta-analysis: principles and procedures. 31. controlled trials. Br J Anaesth 1997. 'Large-scale randomized evidence: large. 351:47-52. Smith GD. Medicine: How to Practice and Teach EBM. deFerrabti S et al. Cappelleri JC. Oxman AD. McMahon R. Eastwood S et al. 27. 35. Haider Y. Misleading meta-analysis. Efficacy. 34. Carr DB. Mangano DT. Is anaesthesia evidence-based? A survey of anaesthetic practice. N Engl J Med 1997. Fletcher S. Moore RA. Phillips AN. Evidence-based metoclopramide in the treatment of established postoperative nausea and vomiting. Randomized trials of magnesium in acute myocardial 75:391-393. 337:536-542. suspected acute myocardial infarction: results of the second Leicester . 26. 20. N Engl J Med 1996. 87:1277-1289. Tramer MR.

McLeod RS. dissemination and impact of consensus recommendations. Anesthesiology 1993. Routine preoperative testing: a systematic review of the evidence. 346:837-840. 342:1317-1322. Br J Surg 1999. A defense of the small clinical trial: evaluation of three gastroenterological studies. Practice Guidelines for Pulmonary Artery Catheterisation: a report by the American Society of Anesthesiologists Task Force on Pulmonary Artery Catheterisation. present and future. Br J Surg 1997. 51. meta-analysis. Russell IT. 1(12). Nicholl J.169:380-383. 40. 47. 84:1220-1223.is there scientific evidence on which to base treatment? [editorial] Br J Anaesth 1999. Power I. Booth A. McRae KD. J Cardiothorac Vasc Anesth 1996. ISIS-4 Collaborative Group. Systematic reviews and meta-analyses: the value for surgery. Munro J. Practice guidelines for obstetrical anesthesia: a report by the American Society of Anesthesiologists Task Force on Obstetrical Anesthesia. Rowe J. and evidence-based medicine 12 1 Intravenous Magnesium Intervention Trial (LIMIT-2). 339:816-819. 41. oral mononitrate. 345:669-665. Lancet 1995. Anesthesiology 1999. 42. Med J Aust 1998.Large trials. 78:380-394. American Society of Anesthesiologists Task Force on Management of the Difficult Airway: practice guidelines for management of the difficult airway. Roberts RS. 39. Surgery and the randomised controlled trial: past. Thorpe M.10:540-552. An assessment of clinically useful measures of the consequences of treatment. Heaty MJR et al. Laupacis A. 49. Various authors: Evidence-based medicine. Health Technology Assessment 1997. 90:600-611. Ann Rev Pub Health 1991. Sackett DL. 318:1728-1733. Chagla L. BMJ 1986. 86:977-978. Report of the American College of Cardiology/American Heart Association Task Force on practice guidelines (Committee on Perioperative Cardiovascular Evaluation for Noncardiac Surgery). Solomon MJ. Lancet 1992. N Engl J Med 1988. Ellis J. 37. 48. 45. Lancet 1993. Anesthesiology 1993. 82:817-819 50. Cousins MJ. Special report: guidelines for perioperative cardiovascular evaluation for noncardiac surgery. 346:407-410. Effect of clinical guidelines on medical practice. Grimshaw JM. Inpatient general medicine is evidence based. Acute pain . 44. A systematic review of rigorous evaluations. ISIS-4: a randomised factorial trial assessing early oral captopril. Lancet 1995. ACC/AHA Task Force Report. 46. McCulloch P Surgical practice is evidence based. Mulligan I. Sackett DL. Lancet 1995. 292:599-602. Sheldon TA. Howes N. Lomas J. 43.12:41-65. and intravenous magnesium sulphate in 58 050 patients with suspected acute myocardial infarction. . 38. 78:597-602. Smith G. 36. Words without action? The production. Powell-Tuck J.

Prevalence of statistical errors in anaesthesia journals Advances in clinical practice depend on new knowledge. 6-8 Avram et al. We otherwise risk stagnation. subgroup analyses and interim analysis -misuse of parametric tests -misuse of Student's t-test -repeat (`paired') testing -misuse of chi-square . or its misinterpretation. and use of two-sample tests for more than two groups. ignoring repeated measures or paired data.1-5 These errors are also prevalent in the anaesthetic and intensive care literature. standard error -misuse of correlation and simple linear regression analysis -preoccupation with P values -overvaluing diagnostic tests and predictive equations A statistical checklist Key points Obtain statistical advice before commencement of the study. Approximately 50% of such published reports contain errors in statistical methodology or presentation. . can be even more harmful to patient care.small numbers -standard deviation vs. Yet conclusions based on poor research. mostly gained through medical research. Consider inclusion of a statistician as a co-researcher. Goodman$ surveyed five abstract booklets of the Anaesthesia Research Society (UK) and found that 61 of 94 abstracts (65%) contained errors. 7 evaluated the statistical analyses used in 243 articles from two American anaesthesia journals (Anesthesiology and Anesthesia and Analgesia) and found common errors included treating ordinal data as interval data. The detailed reporting of medical research usually occurs in any of a large number of peer-reviewed medical journals.Statistical errors in anaesthesia Prevalence of statistical errors in anaesthesia journals Ethical considerations How to prevent errors What are the common mistakes? -no control group -no randomization -lack of blinding -misleading analysis of baseline characteristics -inadequate sample size -multiple testing. uncorrected multiple comparisons.

and money for the project. inadequate presentation of data (to enable interpretation of P value). but remains dependent on the statistical knowledge of the journal reviewers and editor. but may not be aware of fundamental assumptions underlying some of the tests they employ. 6 Similar statements have been made by others. for negative studies. no consideration of type II error.Statistical errors in anaesthesia 123 These included failure to identify which statistical tests were used. One solution is to include a statistician in the process. Most statistical analyses are performed by researchers who have some basic understanding of medical statistics. On some occasions it is apparent that researchers reproduce a previous study's methodology (including statistical techniques). but this can delay publication. How to prevent errors A research paper submitted for publication to a medical journal normally undergoes a peer review process. 2. then the results become invalid and the conclusions may well be inappropriate. identification of which statistical tests will be applied (and on what data).9 Flaws in research design and errors in statistical analysis obviously raise ethical issues. At worst. a poorly designed research project should not be approved by an institutional ethics committee unless it is satisfied that the project is likely to lead to valid conclusions. Ethical considerations As stated in the Preface to this book. Therefore. definition of outcome measures. Scrutiny at this early stage will do much to avoid the multitude of errors prevalent in the anaesthetic and intensive care literature. may be unachievable for many journals and may not avoid all mistakes. ethical review should include scientific scrutiny of the research design. perpetuating mistakes. This can vary. and a reasonable estimation of how many patients will be required to be studied in order to prove or disprove the hypothesis under investigation. 'If valid data are analyzed improperly. At best. This paradoxical process may only serve to identify papers that already have statistician involvement (as an author) . paying particular attention to methods of randomization and blinding (if appropriate). the net effect is to waste time. It would be unethical to proceed. misuse of standard error and. In fact. Longnecker wrote in 1982. therapeutic decisions may well be based upon invalid conclusions and patients' wellbeing may be jeopardized'. nor of pitfalls in their execution. Some journals only identify those papers with more advanced statistical methods for selective assessment by a statistician. effort. This detects many mistakes. These mistakes often lead to misleading conclusions.

The choice of statistical tests depends on the type of data collected (categorical. then advice from an experienced clinical researcher may also be of assistance (although. 'specialist' consultation occurs in most other areas of clinical practice! If in doubt. What information to include. increasing the medical researcher's confusion. If this is not available. After all. we have found that the following errors (or deficiencies) . It cannot be stressed strongly enough: the best time to obtain advice is during the process of study design and protocol development. which has a multitude of deficiencies. is perhaps best dictated by the policies of the particular journal (see Table 11. inclusion of a statistician as a co-researcher almost always provides a definitive solution.2 These can seriously increase the risk of bias. at the end of this chapter). of course. If in doubt. Specifically. Exactly what data to collect. the best habit is to have a low threshold for seeking advice. is to know which studies require more advanced statistical methods and assistance from a statistician . Some of these issues are addressed in more detail in Chapter 4. Another common problem is inadequate description of methods (including which statistical tests were used for analysing what data). Complete. this may only perpetuate mistakes). what scientific questions are being asked).e. hopefully addressed in part by this book. is for researchers to further develop their knowledge and understanding of medical statistics. The growing market in introductory texts and attendance of medical researchers at statistical courses would appear to be addressing the problems. The skill. and in what form. making the researcher (and reader) susceptible to misleading results and conclusions. 2 Statisticians can also disagree on how research data should be analysed and presented. valid presentation of study data can be compromised by a journal's word limit and space constraints. in our experience.1. It is very frustrating to receive a pile of data from an eager novice researcher. are fundamental components of the research design. The ultimate solution. no randomization to treatment groups (or poorly documented randomization) and inadequate blinding of group allocation. ordinal. and how. This can be found in a journal's 'Advice to Authors' section. numerical) and the exact hypotheses to be tested (i. when. 1.this should not be undervalued or resisted. This can rarely be corrected after the study is completed! Where possible.124 Statistical Methods for Anaesthesia and Intensive Care and miss the majority of papers (which do not include a statistician) that are flawed by basic statistical errors. direct advice can also be sought from the journal's editor. Readers should be reassured that most studies can be appropriately analysed using basic statistical tests. What are the common mistakes? The commonest errors are often quite basic and relate more to research design: lack of a control group. Often the study has not been designed to answer the question.

10 This leads to a biased selection. 11. unpaired -one-tailed vs. the study should be of sufficient size to detect them). because of different baseline characteristics. If the control group is given a placebo treatment. No control group No randomization Lack of blinding Misleading analysis of baseline characteristics (confounding) Inadequate sample size (and type II error) Multiple testing Misuse of parametric tests Misuse of Student's t-test -paired vs. a contemporary. 5. . representative control group should be used. 8. 3.in which case.Statistical errors in anaesthesia 12 5 are common in the anaesthetic and intensive care literature: 1. leads to a falsely extreme (high or low) value. No control group To demonstrate the superiority of one treatment (or technique) over another requires more than just an observed improvement in a chosen clinical endpoint. two-tailed -multiple groups (ANOVA) -multiple comparisons Repeat ('paired') testing Misuse of chi-square . or methods used for outcome assessment. It is difficult to detect 'regression to the mean' unless a control group is included. an antiemetic is an antiemetic. 12. 2. then the question being asked is 'does the new treatment have an effect (over and above no treatment)?' This is a common scenario in anaesthesia research which only shows that. 7. 6. 14. stability and deterioration). standard error Misuse of correlation and simple linear regression analysis Preoccupation with P values Overvaluing diagnostic tests and predictive equations. then the question being asked is 'does the new treatment have an equal or better effect than the current treatment?' This has more clinical relevance. Placebo-controlled studies have very little value (other than for detecting adverse events . through biological variation or measurement error. 9. or an inotrope is an inotrope. 13. 10. Use of an historical control group may not satisfy these requirements. that on re-measurement will tend towards the population mean (which is less extreme). quality or quantity of treatment. A reference group should be included in order to document the usual clinical course (which may include fluctuating periods of improvement. Regression to the mean occurs when random fluctuation. 4. equivalent.small numbers Standard deviation vs. for example. This is a common error if group measurements are not stabilized or if there is no control group. If the control group is given an active treatment. whereby a group has a spuriously extreme mean value. In most situations.

Unblinded studies remain unconvincing.05. Lack of blinding It is tempting for the subject or researcher to consciously or unconsciously distort observations. and investigator (sometimes referred to as triple-blind) can dramatically reduce these sources of bias. in order to reduce bias. Knowledge of group allocation should be kept secure (blind) until after the patient is enrolled in a trial. The commonest method is to use sealed. then performing significance tests only tests the success of randomization! With a significance level of 0. Other methods are available which can assist in equalizing groups. roughly one in 20 comparisons will be significant purely by chance. If treatment allocation is randomized. Every attempt should be made to maximize blinding in medical research. This is usually dictated by referring to a table of random numbers or a computer-generated list. such as stratification and blocking (see Chapter 4). but it may also be caused by random fluctuation in pain levels: patients with high levels are more likely to have a lower score on retesting and so the average (group) pain level is much more likely to be reduced. and patients were selected on the basis that they had severe pain (as measured by VAS on one occasion). recordings. confounding). observer (double-blind). opaque envelopes. if a study were set up to investigate the potential benefits of acupressure on postoperative pain control. measurement. This avoids selection bias and increases the generalizability of the results. These are very useful modifications to simple randomization. but have been under-used in anaesthesia research.e. then it is likely that VAS measurements at a later time will be lower. No randomization The aim of randomization is to reduce bias and confounding. Misleading analysis of baseline characteristics It is not uncommon for patient baseline characteristics to be compared with hypotheses testing. The first does not alter interpretation of the study. which allocates groups in such a way that each individual has an equal chance of being allocated to any particular group and that process is not affected by previous allocations.126 Statistical Methods for Anaesthesia and Intensive Care For example. data cleaning or analyses. The . Blinding of the patient (single-blind). randomization tends to equalize baseline characteristics (both known and unknown) which may have an effect on the outcome of interest (i. In large trials. Is this evidence of a beneficial effect of acupuncture? It may be a result of the treatment given. The commonest method is simple randomization. This is wrong for two major reasons. All eligible patients should be included in a trial and then randomized to the various treatment groups. yet remains senseless.

15 This is because each comparison has a probability of roughly one in 20 (if using a type I error.8. 11-13 Inadequate sample size A common reason for failing to find a significant difference between groups is that the trial was not large enough (i. This is often called 'a fishing expedition' or 'data dredging'. The simplest method to lessen this problem is to stratify the patients according to one or two important confounding variables (e. or during interim testing while a trial is in progress. Just because 'there was no statistically significant difference between the groups' does not imply that there were no subtle differences that may unevenly affect the endpoint of interest. Multiple testing therefore increases the risk of a type I error. the anticipated mean and variance. type of surgery) before randomizing patients to the respective treatment groups. taking the covariate into account.g. along with an estimation of the difference between groups that is being investigated. Minimization of this occurrence requires consideration of the incidence rate of the endpoint of interest or. for numerical data. yet such an imbalance may have an important effect on the outcome of interest.e. consideration of the treatment effect size (if any) and likelihood of a type Il error should be addressed by the authors. There are some excellent papers which explore these issues in greater depth. An apparent small (statistically non-significant) difference at baseline for a factor that has a strong effect on outcome can lead to serious confounding. after he/she has been provided with the relevant baseline information. of 0. Authors should certainly describe their group baseline characteristics.05) of being significant purely by chance. where the null hypothesis is incorrectly rejected. gender. Rare outcomes. consider the possibility of confounding. 16-1 . did not enrol a sufficient number of patients). That is. preoperative risk. An approximate sample size can then be calculated (see Chapter 3). but not falsely reassure the readers that 'there is no significant difference' between them. and multiple comparisons magnify this chance accordingly.14 This is a type II error. where the null hypothesis is accepted incorrectly. or a value. require very large studies. This can be interpreted by the reader using clinical judgment. there may be a clinically significant difference between the groups which is not detected by significance testing. Multiple testing. or small differences between groups.Statistical errors in anaesthesia 12 7 second reason is more important and may affect interpretation of results. there are some advanced multivariate statistical techniques available which adjust the results post hoc (after the event). If an imbalance in baseline characteristics is found to exist at the end of the trial (which may well occur by chance!). subgroup analyses and interim analysis Multiple comparisons between groups will increase the chance of finding a significant difference which may not be real. If a study concludes 'no difference' between groups. This is known as confounding. A similar problem occurs when multiple subgroups are compared at the 8 end of a trial.

if more than two groups are being compared. is determined by the previous measurement. and was this preplanned (before analysing the data)? Repeat (`paired') testing If an endpoint is measured on a number of occasions (e. then the paired t-test must be used. This is gravely misleading. Ordinal data is best analysed using non-parametric tests (such as Mann-Whitney U test or Kruskall-Wallis analysis of variance). If the groups are related to one another (i.e. at least in part. either data transformation or non-parametric tests should be used. then the most appropriate method is to use repeated measures analysis of variance. predetermined rationale for only exploring an increase. even if measured on an ordinal scale. then the unpaired t-test is used. n < 20). intea-group variance is lower than inter-group variance) and so differences can be more easily detected. The Mest can only be used to compare two groups. A paper using a one-tailed t-test should be scrutinized: was there a valid reason for only investigating a difference in one direction. If a group endpoint is measured on two occasions then a paired t-test (or non-parametric equivalent) can be used. In general. For these.g. If three or more measurements are made. then a one-tailed t-test may be appropriate. . In most cases a difference between groups may occur in either direction and so a two-tailed t-test is used. then the data can be considered as continuous. such as pain or perioperative risk. as well as independence of the data and equality of variance (see Chapter 5). This is unlikely to be satisfactorily achieved with smaller studies (say. Unfortunately. If there is a clear. Kolmogorov-Smirnov test). and/or analysing the distribution using a test of goodness of fit (e. a one-tailed t-test is usually selected to lower a P value so that it becomes significant (which a two-tailed test failed to achieve). Some statisticians accept that if the observations have an underlying theoretical continuous distribution. it is important to verify assumptions of normality when using the t-test. or a decrease. If the groups are independent (this is the usual situation). then individual comparisons can be made to identify at which time the differences were significant (adjusting P values for multiple comparisons). 22 Misuse of Student's t-test As stated above. or cardiac index during ICU stay). then analysis of variance should be used. or two or more groups are to be compared on a number of occasions. If a significant difference is demonstrated overall. measurement of postoperative pain. This argument is most credible for larger studies. The amount of individual patient variation over time is much less than that between patients (i. parametric tests should only be used to analyse numerical data. dependent).g.e.Statistical errors in anaesthesia 12 9 achieved by plotting the data and demonstrating a normal distribution. as with comparing a group before and after treatment. then any subsequent measurement.

24 Standard error is a derived valuet used to calculate 95% confidence intervals. paired categorical data). looking at one dichotomous endpoint. an adjustment factor is needed. Chi-square should not be used if the groups are matched or if repeat observations are made (i. the X distribution is a continuous distribution and the 2 calculated X2 statistic is an approximation of this. or the area under a curve (with time on the chi-axis ). It should not be confused with standard deviation.130 Statistical Methods for Anaesthesia and Intensive Care An alternative approach is to use summary data that describe the variable of interest over time. reducing the number of categories and increasing the number in each). Standard deviation (SD) is therefore a measure of variability and should be quoted when describing the distribution of sample data. Yates' correction should be used. This subtracts 0. and so is a measure of precision (of how well sample data can be used to predict a population parameter). On some occasions it may be acceptable to use standard error bars on graphs (for ease of presentation). These considerations become less important with large studies. When small numbers are analysed.96 standard deviations of the mean.23 Misuse of chi-square . The correlation coefficient is a measure of linear association and linear regression is used to describe that linear . This may be the overall mean of repeated measurements.52 Misuse of correlation and simple linear regression analysis These techniques are used to measure a linear relationship between two numerical variables. For this reason. nor used to describe variability of sample data. McNemar's test can be used in these situations. then Fisher's exact test should be used (for larger contigency tables the categories can be collapsed.5 from each component in the X2 equation. but on these occasions they should be clearly labelled.7-9.* If two or more cells in a 2 x 2 contingency table of expected values have a value less than 5. For a 2 x 2 contingency table. It has been suggested that the correct method for presentation of normally distributed sample data variability is mean (SD) and not mean (± SD). 95% of data points will lie within 1. there is an artificial incremental separation between potential values. Standard deviation vs.e. Standard error is a much smaller value than SD and is often presented (wrongly) for this reason. standard error In a normal distribution. where there are only two groups being compared. 2.small numbers Mathematically.

multiple measurements from each patient should not be plotted together and treated as independent observations (this is a very common mistake in anaesthesia and intensive care research). Multiple measurements from each patient should not be analysed using correlation or regression analysis as this will lead to misleading conclusions. it ignores the more important information from a trial: how large is the treatment effect? The 95% confidence interval (CI) for effect describes a range in which the size of the true treatment effect will lie. 28 As with correlation. If doubt exists. From this the clinician can interpret whether the observed difference is of clinical importance. The data should also be independent. This means that each data point on the scatterplot should represent a single observation from each patient. a large trial will have a small standard error and so narrow 95% CI: a more precise estimate of effect. overvaluing diagnostic tests and predictive equations Diagnostic tests can be described by their sensitivity (true positive rate) and specificity (true negative rate). .8. then the data can be transformed (commonly using log-transformation) or a non-parametric method used (e.g.Statistical errors in anaesthesia 13 1 relationship. It does not describe how large the difference is. 26 Variables with a mathematical relationship between them will be spuriously highly correlated because of mathematical coupling. nor whether it is clinically significant. Preoccupation with P values Too much importance is often placed on the actual P value. or if the distribution appears non-normal after visualizing a scatterplot. the corresponding values of the dependent [outcome] variable are normally distributed). A regression line should not exceed the limits of the sample data (extrapolation). These analyses assume that the observations follow a normal distribution (in particular. Further details can be found in Chapter 7.29 A P value describes the probability of an observed difference being due to chance alone. This only tells us what proportion of positive and negative tests results are correct (given a known outcome). that for any given value of the independent [predictor] variable. Normally a scatterplot should be included to illustrate the relationship between both numerical variables. and a non-significant P value in a small trial may conceal an effect of profound clinical significance. Repeated measures over time should also not be simply analysed using correlation. Spearman rank correlation). A P value is only a mathematical statement of probability. 27 Further details can be found in Chapter 7. a P value is affected by the size of the trial: a highly significant P value in a large trial may be associated with a trivial difference. Bland and Altman have described a suitable method. Importantly. Neither correlation nor regression should be used to measure agreement between two measurement techniques. In general. rather than the size of the treatment effect.

30 It is common for authors to report optimistic values for these indices. Was there an acceptable delay from allocation to commencement of treatment? 9. Were the drop-outs described by treatment/ control groups? 18. Is the paper of acceptable statistical standard for publication? 26. Was there a power based assessment of adequacy of sample size? 13.13 2 Statistical Methods for Anaesthesia and Intensive Care Of greater clinical application is the positive and negative predictive values of the test (PPV and NPV respectively). any test (irrespective of its diagnostic utility) will tend to have a high PPV A similar test in another situation where disease prevalence is low will tend to have poor PPV. could it become acceptable with suitable revision? Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Unclear Unclear Unclear Unclear Unclear Unclear Unclear Unclear Unclear Unclear Unclear Unclear Unclear No No No No No No No No No No No No No Yes Yes Yes Yes Yes Unclear Unclear Unclear Unclear Unclear No No No No No Yes Yes Yes Yes Yes Yes Unclear Unclear No No No No No No Unclear Yes Yes No No . Were prognostic factors adequately considered? 22. Was there a satisfactory statement given of diagnostic criteria for entry to trial? 3. Were confidence intervals given for the main results? 24.1. Were the treatment and control groups comparable in relevant measures? 15. Were the treatments well defined? 6. the context in which the diagnostic test was evaluated should be considered . A statistical checklist used by the British Medical journal (after Gardner et al. Was a high proportion of subjects followed up? 16. Was there a satisfactory statement given of source of subjects? 4.does the trial population in which the test was developed represent the clinical Table 11. Was the duration of post-treatment follow-up stated? Commencement of trial 14. yet high NPV Therefore. Were the outcome measures appropriate? 12. Was the objective of the trial sufficiently described? 2. which inform us of the likelihood of disease (or adverse outcome) given a test result. Was the method of randomization described? 8. If 'No' to Question 25. yet both are dependent on the prevalence of disease . Were side-effects of treatment reported? Analysis and presentation 19. Did a high proportion of subjects complete treatment? 17. Were the statistical analyses used appropriate? 21.if a disease (or outcome) is common. Was the conclusion drawn from the statistical analysis justified? Recommendation 25. Was the presentation of statistical material satisfactory? 23. Was there a satisfactory statement of criteria for outcome measures? 11. 4) Design features 1. Was there a statement adequately describing or referencing all statistical procedures used? 20. Was the potential degree of blindness used? 10. Was random allocation to treatment used? 7. Were concurrent controls used (as opposed to historical controls)? 5.

8. 309:1291-1299. 14. v . Gardner MJ.1. 299:690-694. Altman DG. 57:73-74. Statistical methods in anesthesia articles: an evaluation of two American journals during two six-month periods. Goodman NW. Br Med J 1980. 16. Dore CJ. Statistics: the problem of examining accumulating data more than once. 61:1-7. 2. 4. A nesthesiology 1982. Comparing the means of several groups. Frieman JA. N Engl J Med 1985. Chalmers TC. McPherson K. Br J A naesth 1992. Glantz SA. Biostatistics: how to detect. . Machin D. 335:149-153. Pocock SJ. Support versus illumination: trends in medical statistics [editorial]. N Engl J Med 1978. Lavori PW. Altman DG. Randomisation and baseline comparisons in clinical trials. 13. Gore SM. Statistics in practice. Altman DG. Smith H et al. 286:1489-1493. Jones IG. Campbell MJ. Longnecker DE. Yudkin PL. correct and prevent errors in the medical literature. 7. Lancet 1988. Lancet 1996. 290:501-502. How to deal with regression to the mean in intervention studies. Statistical checklist The British Medical journal used a statistical checklist . A nesth A nalg 1985. Misuse of statistical methods: critical assessment of articles in BMJ from January to March 1976. Avram MJ. Dykes MHM et al. Statistical guidelines for contributors to medical journals. Bulpitt CJ. 3.analysing data. Godfrey K. Hughes AO. 64:607-611. ii:31-34. the type II error and sample size in the design and interpretation of the randomized controlled trial. 11. preferably at other institutions before accepting their clinical utility. Lancet 1990. Louis TA. References 1. should be able to predict that original data set well. Br Med J 1986. 281:1473-1475. 292:810-812. Br Med J 1983. 313:1450-1456. Statistics and ethics in medical research. Br Med J 1977. 6.4 which is reproduced in Table 11. Circulation 1980. 10. 5. 68:321-324. Stratton IM. Shanks CA. 9. Bailar JC. Altman DG. by virtue of its derivation.Statistical errors in anaesthesia 133 circumstances for which the test is to be applied? Was there a broad spectrum of patients studied? Outcome prediction is sometimes based on a risk score or predictive equation developed from a large data set using multivariate analyses which. Gore SM. 1:85-87. Use of check lists in assessing the statistical content of medical studies. Statistician 1985. The importance of beta. N Engl J Med 1983. 15. Polansky M. Rytter EC. Comparability of randomised groups. 34:125-136. 12. Gardner MJ. Subgroup analysis. Statistical awareness of research workers in British anaesthesia. 347:241-243. N Engl J Med 1974. Designs for experiments parallel comparisons of treatment. Such derived tests need to be externally validated using other data sets.

. 23. 20. A nn Intern Med 1992. Mathews JNS. Gardner MJ. Calculating correlation coefficients with repeated observations: part II . or standard error of the mean? [editorial] A naesth Intensive Care 1982. Lancet 1986. Confidence intervals rather than P values: estimation rather than hypothesis testing. Rosner BA. Altman DG. 311:442-448. 24. 300:230-235. Statistics in practice. Campbell MJ. Br Med J 1995. Hochberg Y A sharper Bonferroni method for multiple tests of significance.134 Statistical Methods for Anaesthesia and Intensive Care 17. Biometrics 1987. Michels KB. Gardner MJ. Altman DG. Abramson NS. 27. Predicting outcome in anaesthesia: understanding statistical methods. 292:746-750. Bland JM. 10:297. Analyzing data from ordered categories. 28. Statistical methods for assessing agreement between two methods of clinical measurement. Analysis of serial measurements in medical research. Simpson s paradox and clinical trials: what you find is not necessarily what you prove. 22:447-453. ii:639. A nn Emerg Med 1992. 22.116:78-84. Altman DG. Safar P. A nn Surg 1981. 29. 21:1480-1482. Biometrika 1988. Lancet 1996. 30. Interim analyses in randomized clinical trials: ramifications and guidelines for practitioners. N Engl J Med 1984. Myles PS. 21. Oxman AD. ii:307-310. 43:213-223. 26. 348:1152-1153. Guyatt GH. Altman DG. Powell J. Br Med J 1986. Moses LE. Kelsey SF. Emerson JD. Horan BE Standard deviation. Pocock SJ. 25. 310:633. A naesth Intensive Care 1994. 75:800-802. Altman DG. Data trawling: to fish or not to fish. Sutton-Tyrrell K. Br Med J 1990. Williams NJ. 18.correlation between subjects. Royston P. 19. Hosseini H. A consumer's guide to subgroup analysis. Archie JP Mathematical coupling of data: a common source of error. Presentation of variability. Bland MJ. Geller NL. Lancet 1986. 193:296-303.

safety monitoring. Laboratory research usually investigates underlying mechanisms of disease or aspects of drug disposition.6 One of the main aims of medical research is to produce convincing study results and conclusions that can ultimately improve patient outcome.1.2. endpoints -study design -define groups. Develop a study protocol -background -aim. Each are important and have their strengths. data recording -sample size. animals. healthy humans. Unbiased sample selection and measurement will improve the reliability of the estimates of the population parameters and this is more likely to influence anaesthetic practice. This chapter is primarily addressing aspects of clinical research.2 The most reliable study design is the randomized controlled trial. Regulation -drug licensing -ethics committee approval and informed consent. In most circumstances medical research consists of studying a sample of subjects (cells.12 How to design a clinical trial Why should anaesthetists do research? Setting up a clinical trial Data and safety monitoring committee Phase I-IV drug studies Drug regulations Key points Role of the ethics committee (institutional review board) I nformed consent Successful research funding Submission for publication Define the study question(s): what is the aim and study hypothesis? Perform a literature review. with variable infrastructure.but 4 other designs have an important role in clinical research. an generate preliminary data for consideration. measurement techniques. Clinical research occurs in patients. or patients) so that inferences can be made about a population of interest. of variable quality and support. equipment and staffing. intervention(s) -measurements. Use a pilot study to test your methods. Clinical research is undertaken by a broad array of researchers. Most anaesthetists undertaking laboratory research are supervised in an experienced (hopefully wellresourced) environment.5. Epidemiology is the study of disease in populations. The best studies are those that answer important questions reliably1. hypothesis. statistics (and get advice from a statistician) -adverse events. .

An essential.get advice from a statistician (i) reporting of adverse events. Thus there are imperatives to 'do research'. timing of intervention (f) clear. 3.8 Involvement in the processes required to complete a successful research project can teach critical appraisal skills. is to state the study hypothesis. explain why you are doing this study. conduct. What is the current evidence in the literature? What questions remain unanswered? What deficiencies exist in previous studies? In other words. and those with an interest in research. but these can also be explicitly taught at the undergraduate and postgraduate levels. randomization. This should be avoidable. 2. prospective.10 (h) details of statistical methods . Cynicism is often generated by those who have had poor research experiences. . design.13 6 Statistical Methods for Anaesthesia and Intensive Care Why should anaesthetists do research? Identification of a clinical problem. measurement instruments (g) sample size calculation based on the primary endpoint 9. and subsequent development and participation in a study hypothesis. Develop a study protocol. Ultimately. criteria for inclusion and exclusion (define population) (e) treatment groups. despite some having a lack of interest. what are the aims and significance of the project? Identify a primary endpoint (there may be several secondary endpoints). funded and staffed research environment. the study design must be able to answer the hypothesis. control group.7. it should be clearly defined. Perform a literature review. defined times. Unfortunately much research is poor and co-investigators may have little involvement in its development and conduct. blinding. (a) background . should be guided and supported in a healthy. parallel or crossover design) .a good study design minimizes bias and maximizes precision (d) study population. Setting up a clinical trial The major steps involved in setting up a clinical trial are: 1. including under what conditions it is measured and recorded. often neglected step. Define the study question(s). Explicitly. outline why this study should be undertaken (b) clear description of aim and hypothesis (c) overview of study design (retrospective vs. Previous studies may help in designing a new study. Many specialist training schemes demand completion of a research project before specialist recognition is obtained. support or specific training. analysis and writing of a research project can be a rewarding experience. concise data collection. Anaesthetic trainees. safety monitoring.previous published research. and consultant appointment or promotion usually includes consideration of research output.

3. Phase II: selected clinical investigations in patients for whom the drug is intended. aimed at establishing a dose-response ('dose-finding') relationship. Phase IV studies are mostly designed and conducted by independent investigators. Is the recruitment rate feasible? 5. This is an important and neglected process. or (b) one treatment is associated with serious risks. Phase IV: is post-marketing surveillance involving many thousands of patients. Satisfy regulations.How to design a clinical trial 137 4. Phase I: this is the first administration in humans (usually healthy volunteers).1 -13 Phase I-IV drug studies New drug compound development can take up to 15 years and cost US$700 million to get to market.1). potential risks and cost analyses. Data and safety monitoring committee Clinical trials may be stopped early if (a) the superiority of one treatment is so marked that it becomes unethical to deny subsequent patients the opportunity to be treated with it. The study protocol assumptions and methodologies need to be tested in your specific environment. Phase III: is full-scale clinical evaluation of benefits. Modify and finalize the study protocol. Drug regulations Most countries have restrictions on the administration and research of new drugs in humans (Table 12. It is an opportunity to test measurement techniques and generate preliminary data that may be used to reconsider the sample size calculation and likely results. which is divided into four phases: 1. Perform a pilot study. Phase I trials often only include 20-100 human subjects before moving on to phase II trials. Laboratory and animal testing of new drug compounds eventually lead to human testing. There are established good clinical . An independent data and safety monitoring committee (DSMC) should be established to monitor large trials. They are usually guided by predetermined stopping rules derived from interim analyses. Pharmaceutical companies usually design and sponsor phase I-III studies. This should be agreed to and understood by all study investigators. and can advise early stopping of a trial in the above circumstances. 4. Drug licensing and ethics committee approval. as well as some evidence of efficacy and further safety. 2. The aim is to confirm (or establish) basic drug pharmacokinetic data and obtain early human toxicology data. 6.

teratogenicity.uk/mca/csmhome.htm www. UK.gov.htm www.1 Websites for government agencies responsible for new drug research or the conduct of c l ini cal tri a ls Agency Australia Therapeutic Goods Administration (TGA) Australian Health Ethics Committee (AHEC) Canada Therapeutic Products Programme (TPP) Medical Research Council of Canada Europe European Medicines Evaluation Agency (EMEA) International Conference on Harmonisation (ICH) United Kingdom Department of Health Research and Development Medicines Control Agency Committee on Safety of Medicines (CSM) Medical Research Council (MRC) United States Food and Drug Administration (FDA) Center for Drug Evaluation and Research National Institutes of Health (NIH) NIH Ethics Program Website www.gov.html www.mrc. In Australia.gov. maintenance of patient confidentiality.org/ichl.open.mrc.od. usually because extensive evaluation has occurred in one of a number of key index countries (Netherlands. CTX: the clinical trials exemption scheme 2.health. ethics approval and informed consent.gov. maintenance of accurate and secure data. These include that a principal investigator should have the relevant clinical and research expertise.ht m www. The CTX scheme requires an expert committee to evaluate all aspects of the drug pharmacology.c a www2.health.open.uk/research/index. organ dysfunction and other reported side-effects) and benefits.uk/mca/mcahome. CTN: the clinical trials notification scheme.ca/hpb-dgps/therapeut / www.gov/cder/ www. New Zealand.tga.htm l www. Sweden.fda.gov.fda.eudra. The CTN scheme bypasses this evaluation. USA).u k www.htm www.au/nhmrc/ethics/contents.nih.gov ethics. adequate staffing and facilities.ifpma.nih.go v www.a u www.138 Statistical Methods for Anaesthesia and Intensive Care Table 12. there be a formal study protocol.doh. and there be processes to report adverse events. In this .go v research practice (GCRP) guidelines for clinical investigators and pharmaceutical companies.gc.ac.hc-sc.gc. the Therapeutic Goods Administration (TGA) of the Commonwealth Department of Health and Aged Care approves new drug trials under one of two schemes: 1.org/emea. including potential toxicology (mutagenicity.

How to design a clinical trial 139 circumstance. The Secretariat of the Medicines Division of the Department of Health will issue a CTX certificate after evaluation.* 17 Most countries have developed ethical guidelines based on these principles. In each of these countries there are similar processes required for new therapeutic devices. In Canada. such as implantable spinal catheters and computercontrolled infusion pumps. Japan and the USA. for which laboratory investigation. . evaluates new drugs through an Investigational New Drug (IND) application.1. The UK. the Center of Drug Evaluation and Research. In the USA. the local ethics committee accepts responsibility for the trial.org/library/ethics/helsinki . the Licensing Division of the Medicines Control Agency of the Department of Health is responsible for the approval and monitoring of all clinical drug trials. Poor research leads to misleading results. and so is unethical. The different regulations and standards that have existed in different countries have been an obstacle to drug development and research in humans. and the pharmaceutical industry. Clinical research can start after 30 days. Clinical research should be thoroughly evaluated and supported within an institution so that it has the best chance of being successfully completed and providing reliable results. This has prompted co-operation and consistency between countries. In the UK. Ethics committee approval has a role in ensuring good-quality research. One of the more significant advances has been the International Conference on Harmonisation (ICH) of Technical Requirements for Registration of Pharmaceuticals for Human Use.16 and the Declaration of Helsinki in 1964. is also guided by the European Medicines Evaluation Agency (EMEA). Role of the ethics committee (institutional review board) Advances in medical care depend on medical research. this is overseen by the Therapeutic Products Programme (TPP). a Food and Drug Administration (FDA) body of the Department of Health and Human Services. followed by experimentation on animals and healthy volunteers leads to research on patients.14 Ethical considerations include the Hippocratic principle of protecting the health and welfare of the individual patient as well as the utilitarian view of the potential benefit for the majority vs. These considerations were explored by earlier investigations into ethical research in humans. wastes resources and puts patients at risk. Other countries have similar processes which can be found on the world wide web. This includes the regulatory authorities of Europe. risk to a few. * www.cirp. along with most other European countries. or via links from websites included in Table 12. such as the Nuremberg Code of 1949 15. More extensive phase II-III trials are conducted only after a Clinical Trial Certificate (CTC) is issued and the drug data reviewed by the Committee on Safety of Medicines (CSM).

this is governed by the National Health and Medical Research Council (NHMRC) statement on human experimentation and local ethics committees are guided by the NHMRC Australian Health Ethics Committee (see Table 12. All research involving human beings. the National Institutes of Health (NIH) Ethics Program guides research practices. confusion or mistrust may dominate their thought processes and restrict .23. argued cogently that patients are entitled to clear and reasonable information.24 The conflicting roles of researcher and clinician are sometimes difficult to resolve in this situation. Madder. whereby the clinician and patient have no particular preference or reason to favour one treatment over another.26 or that patients are unable to provide truly informed consent.14 0 Statistical Methods for Anaesthesia and Intensive Care In Australia. in an essay on clinical decision-making.1). This requires adequate disclosure of information.25. 18-20 Patient confidentiality must be maintained. Medical colleges and associations also have their own ethical guidelines. should include approval through an established ethical review and approval process. This paternalistic attitude has been rightly challenged. that they can withdraw from the study at any time. A similar situation occurs in the UK where the Department of Health has issued guidelines for research within the NHS (including multi-centre research). competency and understanding.23 Feelings of anxiety. 18. vulnerability. and in Canada this is guided by the Medical Research Council of Canada. 27 Informed consent can be difficult in anaesthesia research.20 or that the clinician is in a better position to consider the relative merits of the research.24 Note that the Declaration of Helsinki includes the words ' The health of my patient will be my first consideration'. Patients approached before elective surgery are often anxious and may also be limited by concurrent disease. This is included in all GCRP guidelines. and that they should be included in decisions regarding their care. and that refusal or withdrawal will not jeopardize their future medical care. I nformed consent Patients should be informed of the nature of the research and be asked to provide informed consent. either observational or experimental. 21 Similar examples can still be found in the literature today 22 The concept of randomization to different treatment groups is a challenging concept for patients (and some doctors). It should be made clear to patients that they are under no obligation to participate. 21 He presented details of 18 studies at a symposium (and later published 22 examples in the New England Journal of Medicine) where no patient consent was obtained. 23 The ethical principle underlying this process includes the concept of equipoise. Some have argued that it is not always necessary to obtain consent. 18. 18 In the USA. and self-determination. potential risks and benefits (if any). A key role in the requirement for informed consent for medical research was played by Beecher in the early 1960s. 19.

private setting. The study should be feasible. It should include a sample size calculation and have detailed statistical analyses. Many argue that such research is important and should be supported. Funding agencies commonly rate applications on a number of criteria.34 Some institutions consider that a family member or next of kin can provide consent in these circumstances. or those in the intensive care unit who are critically ill. NHMRC. with demonstrable ability to successfully recruit patients.e. including from institutions. confused or sedated cannot provide informed consent. colleges.How to design a clinical trial 14 1 their ability to provide informed consent. NIH) is limited to only the top ranked 20% of projects. and maximal precision and relevance. Patients arriving unconscious or critically ill to the emergency department. There should be a clearly stated hypothesis and the study design must be capable of answering it. For example. 20 Alternative randomization methods have been advocated which may address these and other concerns?28-30 but there is little evidence of benefit. In general consent can be waived if the research has no more than minimal risk to the subjects and it can be demonstrated that the research could not be carried out otherwise. MRC. The application must demonstrate that there is minimal bias. This is best achieved with pilot or previous study data. Government or institutional ethics committees usually provide guidelines in these circumstances. adequate time to consider trial information). Other sources of research funds are also available. Patients generally prefer to be approached for consent well in advance.g.31 3 Interestingly. 35 Successful funding is more likely if the proposed study addresses an i mportant question that has demonstrable clinical significance (now or in the future). Under these circumstances the institutional ethics committee accepts greater responsibility until the patient's consent can be sought at a later date (deferred consent). but this may not be legally binding in many countries. 23 Obtaining informed consent for clinical trials on the day of surgery has been studied previously 31-33 and is an important consideration given the increasing trend to day-of-admission surgery. in Australia the NHMRC and the Australian and New . A successful track record of the chief investigator (or mentor) is reassuring. 34 There are many issues at stake when considering the ethics of research and consent in incompetent subjects. 51% of patients preferred not to know about a trial prior to admission as it only increased their level of anxiety 32 Informed consent cannot be obtained in some circumstances. associations and benevolent bodies. but still accept recruitment on the day of surgery if approached appropriately (i. Successful research funding Peer review funding through major government medical research agencies (e. In any case it would be reasonable to inform the patient's family or next of kin of the nature of research so that they have an opportunity to have any concerns or questions answered and be asked to sign an acknowledgement form.

what were the weaknesses (and strengths) of the study design. 3. description of all aspects of the randomization process. Lack of original ideas Diffuse.a reader should be able to reproduce the study results. Results and Discussion. A clear. 4. unfocused. use of a structured abstract.142 Statistical Methods for Anaesthesia and Intensive Care Zealand College of Anaesthetists use the following: 1. Many authors do not do this and it annoys editors and reviewers to such an extent that it may jeopardize a fair assessment! Efforts at maximizing the presentation of the manuscript are more likely to be rewarded.org. Manuscripts are usually set out with an Introduction. Advice on what to include and how a manuscript should be presented can be sought from experienced colleagues (even in other disciplines). clear study endpoints and methods of analyses. 10. Methods. 5. and discussion of potential biases. 2. The simplest and most important message is to follow a target journal's guidelines for authors exactly. * www. 7.37* The essential features include identifying the study as a randomized trial. 8. The ten most common reasons for failure at NIH are:36 1. 2. Editors have a responsibility to their readership and this is what they demand. Scientific merit Track record Originality Feasibility Design and methods International competitiveness. The discussion should follow a logical sequence: what were the study's main findings. 6. definition of the study population. Submission for publication A paper is more likely to be published if it offers new information about an important topic that has been studied reliably. 4. 5. or superficial research plan Lack of knowledge of published relevant work Lack of experience in essential methodology Uncertainty concerning future directions Questionable reasoning in experimental approach Absence of acceptable scientific rationale Unrealistic large amount of work Lack of sufficient experimental detail Uncritical approach. 9. how do they fit in with previous knowledge. . 3. and what should now occur . complete description of the study methodology (including statistical analysed°) is essential .a change in practice and/or further research? The Consolidated Standards of Reporting Trials (CONSORT) statement has defined how and what should be reported in a randomized controlled trial.ama-assn. 6.

How to design a clinical trial

143

References

1. Duncan PG, Cohen MM. The literature of anaesthesia: what are we learning? Can J A naesthesia 1988; 3:494-499. 2. Sackett DL, Haynes RB, Guyatt GH, Tugwell P Deciding on the Best Therapy: A Basic Science for Clinical Medicine. Little Brown, Boston 1991: pp187-248. 3. Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized trials? Stat Med 1984; 3:409-420. 4. Myles PS. Why we need large randomised trials in anaesthesia [editorial]. Br J A naesth 1999; 83:833-834. 5. Rigg JRA, Jamrozik K, Myles PS. Evidence-based methods to improve anaesthesia and intensive care. Curr Opinion A naesthesiol 1999;12:221-227. 6. Sniderman AD. Clinical trials, consensus conferences, and clinical practice. Lancet 1999; 354:327-330. 7. Goodman NW. Making a mockery of research. BMJ 1991; 302:242. 8. Goodman NW. Does research make better doctors? Lancet 1994; 343:59. 9. Frieman JA, Chalmers TC, Smith H et al. The importance of beta, type II error and sample size in the design and interpretation of the randomized controlled trial. N Engl J Med 1978; 299:690-694. 10. Gardner MJ, Machin D, Campbell MJ. Use of check lists in assessing the statistical content of medical studies. BMJ 1986; 292:810-812. 11. Geller NL, Pocock SJ. Interim analyses in randomized clinical trials: ramifications and guidelines for practitioners. Biometrics 1987; 43:213-223. 12. Pocock SJ. When to stop a clinical trial. BMJ 1992; 305:235-240. 13. Brophy JM, Joseph L. Bayesian interim statistical analysis of randomised trials. Lancet 1997; 349:1166-1169. 14. Department of Health. Ethics Committee Review of Multicentre Research. Department of Health, London 1997 (HSG[97]23). 15. Beales WB, Sebring HL, Crawford JT. Permissible medical experiments. From: The judgement of the Nuremberg Doctors Trial Tribunal. In: Trials of war criminals before the Nuremberg Military Tribunal, 1946-49 vol 2. US Government Printing Office, Washington, DC. 16. Shuster E. The Nuremberg Code: Hippocratic ethics and human rights. Lancet 1998; 351:974-977. 17. World Medical Organisation. Declaration of Helsinki. BMJ 1996; 313:1448-1449. 18. Gilbertson AA. Ethical review of research [editorial]. Br J A naesth 1999; 92:6-7. 19. Schafer A. The ethics of the randomised clinical trial. N Engl J Med 1982; 307:719-724. 20. Ingelfinger FJ. Informed (but uneducated) consent [editorial]. N Engl J Med 1972; 287:4651166. 21. Kopp VJ. Henry Knowles Beecher and the development of informed consent in anesthesia research. A nesthesiology 1999; 90:1756-1765. 22. Madder H, Myles P, McRae R. Ethics review and clinical trials. Lancet 1998; 351:1065. 23. Myles PS, Fletcher HE, Cairo S et al. Randomized trial of informed consent and recruitment for clinical trials in the immediate preoperative period. Anesthesiology 1999; 91:969-978. 24. Freedman B. Equipoise and the ethics of clinical research. N Eng J Med 1987; 317:141-145. 25. Hanna GB, Shimi S, Cuschieri A. A randomised study of influence of twodimensional versus three-dimensional imaging on performance of laparoscopic cholecystectomy. Lancet 1998; 351:248-251. 26. Cuschieri A. Ethics review and clinical trials (reply). Lancet 1998; 351:1065.

14 4

Statistical Methods for Anaesthesia and Intensive Care

27. Madder H. Existential autonomy: why patients should make their own choices. J Ethics 1997; 23(4):221-225. 28. Zelen M. A new design for randomised clinical trials. N Engl J Med 1979; 300:1242-1245. 29. Gore SM. The consumer principle of randomisation [letter]. Lancet 1994; 343:58. 30. Truog RD. Randomized controlled trials: lessons from ECMO. Clin Res 1993; 40:519-527. 31. Mingus IVIL, Levitan SA, Bradford CN, Eisenkraft JB. Surgical patient's attitudes regarding participation in clinical anesthesia research. A nesth A nalg 1996; 82:332-337. 32. Montgomery JE, Sneyd JR. Consent to clinical trials in anaesthesia. A naesthesia 1998; 53:227-230. 33. Tait AR, Voepel-Lewis T, Siewart M, Malviya S. Factors that influence parents' decisions to consent to their child's participation in clinical anesthesia research. A nesth A nalg 1998; 86:50-53. 34. Pearson KS. Emergency informed consent. A nesthesiology 1998; 89:1047-1049. 35. Schwinn DA, DeLong ER, Shafer SL. Writing successful research proposals for medical science. A nesthesiology 1998; 88:1660-1666. 36. Ogden TE, Goldberg IA. Research Proposals. A Guide to Success. 2nd ed. Raven Press, New York 1995: pp15-21. 37. Begg F, Cho M, Eastwood S et al. I mproving the quality of reporting of randomized controlled trials. The CONSORT Statement. JA MA 1996; 276:637-639.

13

Which statistical test to use: algorithms

The following algorithms are presented as a guide for new researchers, in order to assist them in choosing an appropriate statistical test. This is ultimately determined by the research question, which in turn determines the actual way in which the research should be designed and the type of data to collect. In practice, there may be several other tests that could be employed to analyse data, with the final choice being left to the preference and experience of the statistician (or researcher). Many of these statistical tests are modifications of those presented here (and go under different names). Nevertheless, the choices offered here should satisfy most, if not all, of the beginner researcher's requirements. We strongly recommend the reader refer to the appropriate sections of this book in order to find more detail about each of the tests, their underlying assumptions, and how common mistakes can be avoided. Each algorithm has three steps: (a) what type of research design is it, (b) what question is being asked, and (c) what type of data are being analysed. Further description of these issues can be found in Chapters 1 and 4. The algorithms are given in Figures 13.1-13.4: • to compare two or more independent groups - is there a difference? (Figure 13.1) • to compare two or more paired (dependent) groups - is there a difference? (Figure 13.2) • to describe the relationship between two variables - is there an association? (Figure 13.3) • to describe the relationship between two measurement techniques - is there agreement? (Figure 13.4)

146 Statistical Methods for Anaesthesia and Intensive Care Figure 13.2 To compare two or more paired (dependent or matched) groups .1 To compare two or more independent groups Figure 13.

4 To describe the relationship between two measurement techniques .3 To describe the relationship between two variables Figure 13.Which statistical test to use: algorithms 147 Figure 13.

114 Cochran Q test. 126. 72 Binomial test. 72 Binary variable. 54. 80. 130. 1-2. 30. 124 Categorical variable. 135-44 drug regulations. 110 Cox proportional hazards. 127 Alternative hypothesis. 107. 140-1 phase I-IV drug studies. 88 Analysis of variance (ANOVA). 90-2. 130 Clinical practice guidelines. 78. 39 Censoring. 54. 8 Chi-square distribution. 97-8 Before and after studies. 10-11. 76. 113. 103 Bland-Altman plot. 106 Confidence intervals. 80-2 misuse of. 102 Agreement. 58-9 repeated. 9 Cohort study. 139 Conditional probability. 29 Central tendency. 35-6 Case series. 33-4. 97. 17 Breslow test. 131 Confidence limits. 14. 37-9 Co-linearity. 141-2 informed consent. 35 . 124. 69 analysis. 136-7 see also Randomized controlled trials Cochrane Collaboration. 78-9 Bayesian inference. 108 Carry-over effect. 36-7 Case reports. 101 Box and whisker plot. 131 Alpha (a) error. 14 Binary data. 126 Block randomization. 108 misuse of. 7 Controls. 7 measures of. 39. 70. 23 Confounding. 68-9 Chi-square test. 88. 127 Contingency tables. 68-77. 10. 72 Coefficient of determination. 1. 37. 36 Correlation analysis. 21 Analysis of covariance. 89. 55. 101 Committee on Safety of Medicines (UK). 35-6 Categorical data. 22 Crossover design. 102 Arithmetic mean. 126 Blocking. 90. 81. 128 Bootstrapping. 35. 142 role of ethics committee. 130-1 Covariate. 22. 75 Actuarial analysis.131 Blinding. 82 Critical values. 44. 137 publication. 105-11 Actuarial (life table) method. 126 Bimodal distribution. 72. 40-2 Beta error. 84. 55. 41 Case-control study. 127 Alpha (a) value. 88 Causation. 40-2 Cross-sectional studies. 60-3 ANOVA see Analysis of variance APACHE scoring systems. 106 Central limit theorem. 113. 22. 74. 42 Bonferroni correction. 14-15. 43-4. 76. 123 lack of. 72 Bispectral index (BIS). 108. 127 Bias. 68-71. 66. 88 Binomial distribution. 80 Coefficient of variation. 110 Cramer coefficient.I ndex Absolute risk. 30-1 Bayes' theorem. 34. 137-8 funding. 8 Association. 114. 139-40 setting up. 118-19 Clinical trials. 106 Adjusted odds ratio. 23-4. 80. 100.

16. 140 Meta-analysis. 11-15 binomial distribution. 52. 108 Log transformation. 3 Intention to treat. 8 Modus tollens. 20-3 Informed consent. 82 Kolmogorov-Smirnov test. 139 Evidence-based medicine. 123 European Medicines Evaluation Agency. 78. 131 Inferential statistics. 100 Greenhouse-Geisser correction factor. 128. 64-5. 116-18 Exact probability. 51. 94 Mann-Whitney U test. 112 Ethical review. 72 Dichotomous variable. 91 Kendall's coefficient of concordance. 34. 102 Hunyh-Feldt correction factor. 114. 85. 43. 82. 139 Frequency distributions. 129 Investigational New Drug. 71-2. 66 Gehan test. 45-6. 128 Homogeneity of variance. 139 Kaplan-Meier method. 78. 131 Dichotomous data. 89. 43 Missing data. 66 Kurtosis. 83 Minimization. 87. 8 Median survival time. 82-5 misuse of. 46 Discrete scales. 131 McNemar's chi-square test. 20 Fisher Protected Least Significant Difference (LSD). 130-1 multivariate regression. 45. 14 Lambda. 101 Interim analysis. 82 Latin square design. 130 Mallampati score. 70 Dependent variable. 47 Mode. 87. 12-14 Poisson distribution. 88. 137-8 Goodness of fit. 100.108 Matching. 52. 118 Hazard rates. 4 Intraclass correlation coefficient. 61 Hawthorne effect. 75 Independent data. 8 Interval scales. 106-7 Kappa statistic. 4 Discriminant analysis. 139 Degree of dispersion. 59. 8. 78. 139 Interquartile range. 112 Efficacy. 105 Hazard ratio. 20 Incidence. 8 Median. 14-15 normal distribution. 114-16 Method of least squares.150 Index Data accuracy. 14.126 Dunnett's test. 79. 126 General linear model. 1. 83 Logistic regression. 108 Heterogeneity. 128 Effectiveness. 71 Exact test. 16. 85-7 Linear relationship. 69. 59 Geometric mean. 76. 53. 108 Generalizability. 130 Food and Drug Administration (USA). 129. 59 Mantel-Haenszel test. 72-3. 7. 131 Mean. 88 Digit preference. 61 Hypotheses. 30.116 Hochberg procedure. 102 Logrank test. 42-3 Likelihood ratio. 47-8 Interaction. 101. 106 Medical Research Council of Canada. 82 Kendall's tau. 76. 69 Fallacy of affirming the consequent. 16 Incidence rate. 100 Double-blind. 129 Kruskal-Wallis ANOVA. 59. 15 Friedman two-way ANOVA. 16. 80 Hosmer-Lemeshow statistic. 22. 76. 87. 60 Homoscedasticity. 8-10 Degrees of freedom. 59 Fisher's exact test. 52. 76-7. 59 . 79 Line of best fit. 36 Mathematical coupling. 129 Declaration of Helsinki. 129.52 Good clinical research practice guidelines. 47. 46-7 Data checking. 137 International Conference on Harmonisation. 137 Data transformation. 72 MANOVA. 20 Multiple analysis of variance (MANOVA). 131 Independent variable. 91-2 Intra-group variance. 87-9 non-linear regression. 140-1 Integers. 46-7 Data and Safety Monitoring Committee. 97 Linear regression analysis.

126 Randomized controlled trials. 83 Regression to mean. 95 Prevalence. 125 Relative risk. 51-63 analysis of variance. 140 Negative predictive value. 100 Multisample sphericity. 112. 7 Parametric tests. 59 Post-test risk. 39-40 Publication bias. 68. 15 Population. 74-5. 22. 30 Post hoc tests. 16 Ratio scales. 81 Pearson chi-square. 55.Index 15 1 Multiple comparisons. 30. 69 Paired Mest. 84. 116 Numerical data. 25 Predictive equation. 69 Prospective randomized controlled trial.131 Repeat ('paired') testing. 52. 58-9 misuse of. 5. 28. 95 Power. 16 Percentiles. 86 Proportion. 76 National Health and Medical Research Council (Australia). 85-7 Non-parametric tests. 139 O'Brien-Fleming method. 135. 59. 66 Kruskal-Wallis ANOVA. 129. 114 Random error. 40-2 Sensitivity. 131 Outliers. 99 Primary endpoint. 41 Permutation tests. 94-5. 98. 54 One-tailed t-test. 39-40. 68-71 Pearson correlation coefficient. 86 Probit transformation. 87. 88 Risk score. 79. 129 Ordinal data. 129 Parallel groups design. 21 Sampling error. 128 Self-controlled trials. 78. 29-30 Per protocol analysis. 78. 28-9. 48 Poisson distribution. 79 Scheffe test. 78. 135 Positive predictive value.128 Null hypothesis. 12-14.128 n-of-1 trials. 115 P value. 65 Normal approximation. 46 Paired (dependent) data. 80 Percentage. 55. 124 Outcome variable. 95. 52. 3-4. 75-6. 124 Nuremberg Code. 52-8 Partial correlation coefficient.142 see also Clinical trials Rate. 19-20. 87. 85 Risk. 5. 61 Multivariate analysis. 22. 25. 99 Risk factors. 64. 102 One-sample t-test. 37 Risk adjustment. 59. 4 Receiver operating characteristic (ROC) curve. 129-30 Residual. 132 Posterior probability. 127 Number needed to treat (NNT). 16. 66 Mann-Whitney U test. 132 Newman-Keuls test. 41 Nomogram. 60. 16-17 Pre-test risk. 72 Normal distribution. 133 Predictor variable. 39-40 Parameters. 87-9 Multivariate tests. 21. 113 Power analysis. 100. 128-9 repeated ANOVA. 97 Non-linear regression. 100 Risk ratio. 98-9. 140 National Institutes of Health (USA). 2-3. 34 Randomization. 129. 25 Scatter diagram. 88. 78. 35 Odds ratio. 95. 8 Period effect. 19-20. 74 Repeated measures ANOVA. 79 Scatterplot. 72 Multiple correlation coefficient. 21. 114 adjusted. 128 Observational studies. 1 Quantitative data. 131 Qualitative data. 37. 95. 59. 60-3 Student's t-test. 16. 80. 101. 38. 82 Multiple linear regression. 100 Multivariate regression. 102 Regression analysis see Linear/Logistic regression Regression coefficient. 64-5 Wilcoxon signed ranks test. 80. 21 Probit analysis. 76. 1 Random effects model. 52. 123 lack of. 1. 30. 60-3. 63-6 Friedman two-way ANOVA. 37. 88. 54. 74-5. 133 Sample. 101 Regression line. 131 Presentation of data. 28 Prior probability. 131 . 42-3. 16. 55. 95 Probability.

52-8 misuse of. 127 type II error (beta). 130 z distribution. 55 Verbal rating scales. 84. 22. 108 Wilcoxon signed ranks test. 43 Tukey's Honestly Significant Difference ( HSD) test. 59. 9. 5-6. 23 Standardized score. 54. 100.15 2 Index Therapeutic Goods Administration (Australia). 113. 23 Test statistic. 122-3 prevention of. 43. 22 type I error (a). 145-7 Wilcoxon rank sum test. 80. 71. 5 Visual analogue scales. 82. 34 Systematic review. 52. 70. 63 Washout period. 102 Unpaired t-test. 100 Stopping rule. 114-16 t distribution. 105 Systematic error. 105-11 Survival curves. 113. 128 Two-tailed hypothesis. 44. 13 Statistical errors. 129 Subgroup analyses. 122-33 ethical considerations. 7 inferential. 123 prevalence of. 22 Simple randomization. 130 Standard error of the mean. 128 Single-blind. 126 Student's t-test. 130 Standard error. 126 Skew. 54 Variance. 21 Significance level. 139 Treatment effect. 126 Simpson s paradox. 27. 45 Stratification. 53. 40. 10. 44-5 Significance. 22. 94-5. 21 . 8 Survival analysis. 127 Univariate analysis. 98. 72 z transformation. 65-6 Yates' correction. 13 z test. 82 Spearman rank correlation. 42. 131 Triple-blind. 131 Specificity. 13 Sequence effect. 107-10 Survival event. 138 Therapeutic Products Programme ( Canada). 41 Sequential analysis. 42. 41 Which test? algorithims. 131 Standard deviation. 20-3 Stepwise regression analysis. 123-4 Statistics. 64-5. 9. 87. 128 Sum of squares. 14. 8.