You are on page 1of 41

Cohort study-practical

Research Methods and Introduction to Statistics


Dr Saiful Islam, Medical Statistician, IoN

Dr Caroline Selai , Senior Lecturer , IoN
Date : 02.10.2018
Cohort code : CLNE0007
Module name: Research Methods and Introduction to Statistics
Introduction to statistics will cover: Critical appraisal will cover:

Why statistics? What is Evidence Based Medicine

How to conduct research / research (EBM) ?
Why EBM?
Role of statistics in research.
How my research data looks like? Hierarchy of evidence.
Data presentation/display.
Data analysis using appropriate How to extract evidence you need?
statistical tests/methods
How to interpret / present statistical What is critical appraisal?
output? Methods of critical appraisal?
Lecture : 8 one hour lecture Consider a variety of published
Workshops : 2 (Repeated) research paper to make it clear
Revision Lectures : 2 how you could appraise them
Assessments: 1 hours unseen written critically?
exam (proposed).
Lecture : 8 one hour lecture
Exam date : 6th February 2019 at 11.30am

Total credits = 15 . Half in introduction to statistics & the other half for research
Overall module aim

• This module aim is to equip you to do your own research

independently by
• Understanding research process
• Critically appraising any research paper
• Understanding current research methods by critically appraising
some recent research
• Clearly knowing different statistical methods needed in common
• Presenting/Displaying your own research data
• Learning Statistical tests/methods needed for neuroscience
• Learn clearly at least one statistical software (we will use
STATA) aiming to analyze and interpret your own data.
Lesson plan

• Why learn statistics

• Research process
• Role of statistics in data analysis
• Research process
• Summary measure of the data
• Identifying outliers in your data
• Types of data
• Data management
Introduction to data analysis

Learning outcome

• At the end of today’s lecture and workshop, you should be

familiar with
• Importance of learning statistics
• How statistics is related to Neurology/Neuroscience
 Different ways of displaying and summarising data
 Identifying outliers in my data
 How I can manage my own research data
 Using STATA to carry out exploratory analysis and presentation of a
dataset (histogram, box plot, cumulative frequency)
Introduction to data analysis

Why learn statistics?

The reason you are here is because you have a inquiring mind!
• Does using a mobile phone increase risk of brain cancer?
• Is drinking the occasional glass of wine during pregnancy harmful to the baby?
• Why do women live longer than men?
• What are the potential health risks of climate change, and who will be most
• Is there a gene for Alzheimer’s?
• Will banning cigarette sales from vending machines reduce smoking rates in
• Should all children be routinely offered the swine flu vaccination?

• To answer interesting questions, you need two things: data and an

explanation of those data
Other Reasons

In the MSc course:

Analysing MRI data
Analysing dementia / Alzheimer disease data
Preparing poster
Reading scientific paper
Conducting MSc project
Interview for PhD/Job
Doing PhD
Publishing paper
Leading judge who hanged himself after dementia diagnosis left wife a note
saying she had 'a life to live', inquest hears :Telegraph Reporters
7 June 2017 • 11:39am

Sir Nicholas, who has died aged 71, was England’s senior divorce court judge who had rare neurological
disease called fronto-temporal lobe dementia that had only recently been diagnosed.
Leading judge who hanged himself after dementia diagnosis left wife a note
saying she had 'a life to live', inquest hears :Telegraph Reporters
7 June 2017 • 11:39am

Fronto temporal dementia is one of the least common forms

of dementia and is sometimes called Pick's disease or
frontal lobe dementia, according to the Alzheimer's
It affects part of the brain connected to control behaviour and
emotions plus the understanding of words. Fronto
temporal dementia is caused when nerve cells in the
frontal and/or temporal lobes of the brain die and the
pathways that connect them change.
We might save this person’s life by early
detection of this rare neurological disease by
doing more in-depth research in this area.
Undiagnosed: mother-of-four Marina Fagan had a family history of aneurism. Her brain
disease went unrecognised for 13 days :
Evening Standard : Wednesday 15 June 2016 08:53

Marina Fagan, a 51-year-old mother of four, was discharged following a two-day stay at
Whipps Cross hospital, in Leytonstone, after investigations ruled out a brain haemorrhage.
She returned to A&E the same day as her headache persisted but was advised to get her
GP to refer her to an outpatient clinic. Her condition was finally diagnosed 11 days after she
was first admitted to hospital. She died six days later, on October 6, 2015.
So we need more research & more neurologists

Means that neuroscientist should involve in more research to

understand the underlying/persisting disease
Introduction to data analysis

The research process

thinking is
involved in all
these phases,
along with
Introduction to data analysis

Role of statistics in data analysis

• Data are the raw material of knowledge

• Scientists rely on data to provide empirical evidence to support and refine their
• Governments, businesses, communities, hospitals, GP’s and individuals need data to
help inform decision-making and risk assessment
• Learning statistics will provide you with basic skills to read and
understand data
• Broadly speaking, statistics provides us with techniques for
– Summarising and presenting the information contained in a data set
– Handling and quantifying variation and uncertainty in the data, to help us
infer what they tell us about the underlying theory of interest

Statistics – the art of telling stories with numbers

Summary measure of any numerical data:
mean, median, mode and inter-quartile range (IQR)
Mean, Median, Mode , range and IQR
Example: Patient ages (ordered)
24 32 37 39 40 41 41 43 44
25th value 75th value

Mean = add all values ÷ how many are they = ?

Median (middle value) = ?

Inter-quartile range = 25th value – 75th value =
Range = smallest value – largest value = ?
Mode: the number occurs repeatedly which is ……..
Variability within data – Variance and standard deviations
9 10
8 10

Standard deviations (Std. Dev.) = √ (variance)

Summary measure of any numerical data:
Use statistical software STATA
We are in the age of technology so use statistical software
STATA –. Type data in STATA , give the variable name ‘Age’
Type following command in STATA in command line:
summarize Age
Output is

Variable Obs Mean Std. Dev. Min Max

Age 8 37.625 6.674846 24 44

But you should know what is Mean ,Std. Dev . & all others.
Summary measure of any numerical data:
Use statistical software STATA
Type following command in STATA in command line to get
more information (quartiles, median etc…):
summarize Age, detail
Output is Age of patients

Percentiles Smallest
1% 24 24
5% 24 32
10% 24 37 Obs 8
25% 34.5 39 Sum of Wgt. 8

50% 40 Mean 37.625

Largest Std. Dev. 6.674846
75% 42 41
90% 44 41 Variance 44.55357
95% 44 43 Skewness -1.136833
99% 44 44 Kurtosis 3.142978
Mean < median

Measure from graph

No symmetry in the data and it

No symmetry in the data and
looks like negatively skewed
it looks like positively skewed
So mean and standard
So mean and standard
deviation is not appropriate
deviation is not appropriate
measure , median and inter-
measure , median and inter-
quartile range
quartile range

Normal distribution tail extended

equally over both sides so mean
and standard deviations are
appropriate measure.
Introduction to data analysis

“A picture is worth a thousand words”

• Graphical presentation of data enables us to get

a feel for:
– typical (central) values and range of values
– shape and spread of the distribution of values
– interesting patterns and relationships in the data
– ……..

• Graphical displays also help reveal problems with

data quality, e.g.:
– outlying / erroneous observations
– digit preference
– ……..
Introduction to data analysis

Displaying Data

Several possible methods:

• Tables
– Frequency Tables
– Cross tabulations (contingency tables)
– …...

• Graphs
– Bar Charts
– Histograms
– Line Graphs
– ……
Introduction to data analysis

Displaying Data

• Before embarking on formal statistical analysis of a

dataset, it is essential to carry out some simple
exploratory analyses to get a feel for the data
Example: Normal and day case hospital admissions in England
with a neurological condition.

Data stored in Moodle named HospAdmNeu.dta

Introduction to data analysis

Histogram of ordinary hospital admissions with a neurological condition

Histogram of the 2012/13 ordinary hospital admissions with a neurological condition among England CCGs



0 5,000 10,000 15,000

Ordinary hospital admissions
Introduction to data analysis

Histograms: Number of Classes

• Too few classes and it could be difficult to see any interesting

• Too many classes and you will end up with only one
observation per class.
• Aim is to ensure that the number of classes does not mask
interesting patterns
– Rule of thumb: optimal number of classes is approximately log
(base 2) of the number of observations
Number of obs Approx. number of classes
50 5-6
100 6-7
1000 10
10000 13

– Number of classes also depends on choosing ‘nice’ cutpoints

Introduction to data analysis

Box plot of ordinary hospital admissions with a neurological condition

Boxplot for ordinary hospital admissions in England CCG's in 2012/13


The box indicates that the median and two quartiles (1st quartiles = 2269, median= 2895 and
3rd quartile = 4013) . The vertical lines above and below the box indicate the range of values,
with outliers shown as separate points.
Introduction to data analysis

Cumulative Frequency Graph of the ordinary hospital admissions 12/13


0 5,000 10,000 15,000

Ordinary hospital admissions
Identifying outliers in your data

• Outliers are identified by assessing whether or not they fall within a set
of numerical boundaries called "inner fences" and "outer fences".
• A point that falls outside the data set's inner fences is classified as a
minor outlier, while one that falls outside the outer fences is classified
as a major outlier.
• Multiplying inter-quartile range (Q3-Q1) by 1.5 then add this number to
Q3 and subtract it from Q1 to find the boundaries of the inner fences.
• Multiplying inter-quartile range (Q3-Q1) by 3 (instead of 1.5) then add
this number to Q3 and subtract it from Q1 to find the upper and lower
boundaries of the outer fences.
• A point that falls outside the data set's inner fences is classified as a
minor outlier, while one that falls outside the outer fences is classified
as a major outlier.
Identifying outliers in your data-example hospital admissions

• Use hospital admissions data HospAdmNeu.dta

• Use summ Ordinary1213, det \\ to find 1st quartile & 3rd quartile
• IQR = Q3-Q1 = 4013-2269 = 1744,
• 1744 × 1.5 = 2616, 1744 × 3 = 5232

• Boundaries for inner fence : (Q3+2616 , Q1-2616) = (6626, - 344)

• Boundaries for outer fence : (Q3+ 5232 , Q1- 5232 ) = (9245, - 2969)
• As hospital admissions never be negative we now check how many
data points are outside inner fence & how many are outside outer
fence using STATA :
• count if Ordinary1213 > 6626

• count if Ordinary1213 > 9245

As the data are positively skewed so report median and inter-quartile range.
Types of data

Continuous Discrete
Blood pressure Number of children (parity)
Age Number of cigarettes per day
Concentration of a pollutant Counts of deaths in small areas

Ordinal Nominal
(Ordered categories) (Unordered categories)
Grade of breast cancer Sex (male/female)
Disease severity (mild/moderate/severe) Exposed/unexposed
Social class (I, II, III, IV, V) Ethnicity (white/asian/black/other)

• Categorical covariate data are often called factors

• Categorical data that take on only two distinct
values are said to be dichotomous or binary
• Categorical data are often coded using numerical
values (e.g. 0 = NO, 1 = YES)
– statistical packages usually treat numeric data as quantitative
unless you explicitly declare it to be categorical

• Limiting factor for any continuous observation is

the accuracy of the measurement instrument
Quantitative versus Categorical

• Sometimes we do not need all the amount of detail

provided by continuous data, in which case we can
transform into categorical (ordinal) data.
• For example, in a study of the effect of maternal smoking
on birthweight, we can recode birthweight as:
≥2.5kg 0 (normal bwt)
<2.5kg 1 (low bwt)

• In a study of the effect of air pollution on asthma

prevalence, we can recode ambient NO2 concentration as:
<30 mg m-3 LOW
30-60 mg m-3 MEDIUM
>60 mg m-3 HIGH
Introduction to data analysis


• It is sometimes helpful to transform data to a different

scale, to aid interpretation and/or statistical analysis
• Reasons for transforming data include:
– improved approximation to normality
– reducing skewness
– linearising the relationship between 2 variables
– making multiplicative relationships additive

• Common transformations include:

– Natural logarithm (y = loge(x)  x = ey or exp(y), where e =
– Power transformations (y = x , y = x2 , y = x3 , etc.)
Introduction to data analysis

Log transformation
• Log transform stretches scale at

lower end and compresses it at

upper end
y = log(x)


• Can only take logs of positive



0 2 4 6 8 10

Histograms of CD4 counts in a sample of 537 AIDS patients

100 200 300

0 20 40 60 80
Number of patients

Number of patients

0 200 600 1000 0 1 2 3 4 5 6 7

CD4 count (per cubic mm) Log CD4 count (per cubic mm)
Class Exercise
Classify the following data as categorical
(Binary/nominal/ordinal) or numerical (discrete/continuous)

Variable Description Data Type

Age at diagnosis Age of patients at diagnosis of

Education 0=Primary, 1=Secondary,


Ethnicity 1=Black, 2=White, 3=Asian

Smoking status 0= Non-smoker, 1=Smoker
Derived variable
Percentage, Ratios, Can be treated as numerical in most analyses
Rates & Scores
Data display in a spreadsheet / Data management

Suppose you are running a study at UCLH aiming to lowers the low-density
lipoprotein (LDL) cholesterol levels for the patients with cardiovascular
disease. Your study is an RCT , double blind and placebo-controlled.
Patients were randomly assigned to receive evolocumab (either 140 mg
every 2 weeks or 420 mg monthly) or matching placebo as
subcutaneous injections. Out of first 20 patients
Group: 11 patients received evolocumab and 9 patients received placebo.
Gender: 12 female and 8 male.
Statin use: High intensity – 12 patients
Medium intensity – 6 patients
Low intensity – 2 patients
Using patient ID 1 to 20 and appropriate code display above information in
a spreadsheet. Ignore between variables information for now.
Data display in a spreadsheet - coding

Group: 1 if patients received evolocumab

0 if patients received placebo.
Gender: 1 if patient is female
0 if patients is male
Statin use: 2 for High intensity
1 for Medium intensity
0 for Low intensity

Possible data entry looks like below:

Data display in a spreadsheet – looks like -

Patient ID Group Gender Statin-use

1 1 0 0
2 1 1 2
3 1 1 2
4 1 1 1
5 1 0 2
6 1 1 2
7 1 1 2
8 1 0 2
9 1 1 1
10 1 0 2
11 1 1 2
12 0 0 0
13 0 1 1
14 0 0 2
15 0 1 2
16 0 1 2
17 0 0 1
18 0 1 1
19 0 0 1
20 0 1 2
Data display in a spreadsheet - coding

Consider the patients age between 50 and 70 with a mean age of 60 years.
Can you now put an extra column for age of the patients?

In your study you might get different variables but need to present in a similar
Data display in a spreadsheet – type in extra column Age

Patient ID Group Gender Statin-use Age

1 1 0 0 56
2 1 1 2 52
3 1 1 2 59
4 1 1 1 60
5 1 0 2 63
6 1 1 2 70
7 1 1 2 63
8 1 0 2 58
9 1 1 1 55
10 1 0 2 59
11 1 1 2 68
12 0 0 0 59
13 0 1 1 67
14 0 0 2 69
15 0 1 2 52
16 0 1 2 53
17 0 0 1 61
18 0 1 1 63
19 0 0 1 62
20 0 1 2 51
Data display in a spreadsheet

Check twice that your coding is correct and make sure you
didn’t put any wrong information or typed any number wrongly
Check relevant research data matched your findings

Check other research proportion of the people using statin

and have lowered LDL. Is it consistent with yours?
Identify and develop methods how you handle missing

If you convince – data is ready to cook (for analysis).

Introduction to data analysis


• Need to distinguish between different types of data

(continuous, discrete, categorical)
• Most appropriate way of presenting data depends on data type
• Frequency tables are appropriate for all types of data
– For quantitative data, need to think carefully about appropriate choice of
classes/intervals to group data before display
– Keep information in tables to the minimum necessary to convey the
message (story) you want to present (significant figures, number of

• Bar charts are appropriate for displaying categorical data

• Histograms and box plots are appropriate for quantitative data
Reference :
1. Introduction to medical statistics by Martin Bland : Chapter – 4
2. Medical Statistics by B. Kirkwood & J. Sterne : Chapter-4
3. Practical Statistics for medical research by Douglas Altman : Chapter 6