Introductory Biostatistics I

KNOWING YOUR DATA:
FIRST THINGS FIRST

Sharon Biribo
Umanand Prasad School of Medicine and Health Sciences
Review the Epidemiological Study
Process
Once you have developed an idea about what you want to study, it is
important to understand what you will need to provide in the form of
evidence
• What is data?
• What are its sources?
• What are its forms
• Why do I need to know this?
2
What is data
1. Collection of facts, values or measurements
2. Data is raw: simply exists, and in its basic form has no

significance beyond its existence
3. Singular form is “datum” but often used as a singular mass noun
4. Can be Qualitative (have meaning, descriptive)
5. Can be Quantitative (have value, numerical)

Sources of Health Data
• vital statistics
- births • clinical records
- deaths - hospital records
- marriages - OP clinical encounters

- insurance records
• Census data - reportable diseases

- disease registries
• Health surveys (CA, birth defects, etc.)
- national NCD survey - use of ICD codes
- national oral health survey
4
Class Discussion
• Timothy, a 25 year old man, Fijian of other descent presents at your clinic with a cough and
respiratory tract infection. The nurse had performed preliminary vitals sign tests and the
slip he brought in indicated he was 1.5m tall and weighed 95 kg. His blood pressure was
175/110.
• Further clinical history:
• Patient lives at the family home with his parents, brother and family, and sister and
family. His siblings have a total of 8 children (5 boys and 3 girls) all living within the
extended family setting.
• Timothy reported that he had developed a cough after one of the children caught a flu-like
illness from school two months ago. Several members of his family were ill and recovered,
but he continued to suffer with a recurrent cough and sore throat which had recently
progressed to shortness of breath.
• He works from home for a call centre and enjoys spending his free time on his social media
pages as a local ‘foodie’ influencer. He has been smoking since he was 18 and consumes up
to 2 packets of cigarettes daily.
His Mother is diabetic and his father was recently diagnosed with hypertension.
5
•
E.g., What do we know about our patient? Can you pick out
the different types of data?
• Qualitative • Quantitative
- Gender: Male - Age: 25 years
- BMI: Obese - Has 2 siblings (1 brother, 1 sister)
- Presenting complaint: chronic - Has 5 nephews and 3 nieces

cough and respiratory infection
- Parents: 2
- Ethnicity: Fijian of other descent
- Has resting BP of 175/110
- Lifestyle: Chronic smoker, Likes
spending time on Tic Tok and - Weighs 95 kg
social media as a local influencer
- Is 1.50 m tall
- Family history: Diabetes,
- BMI – 42.2
Hypertension
6
Why is it important to know your data?
• Allows you to think of the best methods of data collection to capture these
i.e. Is it best to utilize quantitative or qualitative methodologies or perhaps one
with mixed methods
• Helps to think about what analyses you want to perform

Qualitative data is used to extract meanings and themes
Quantitative data can be used to make or prove associations
• Both forms can be used to provide a basic descriptive account of an idea, study
or.
• Remember RIRO: Rubbish In = Rubbish Out

If you have not thought carefully about what data you include in your study,
chances are that you will get results that don’t make much sense  7
Subjects (units of analysis) and
variables
• The approach in most investigations is to study and compare
subjects in groups of reasonable size, rather than to study
individual subjects. In research, subjects (or units of analysis)
are not always people. They may be mosquitoes, mice,
hospitals, villages etc. and will depend on the study being
carried out.
• To analyse data from a study we form groups and compare
them with respect to different characteristics of the subjects.
For example, we may compare count of CD4 cells per micro-
litre between HIV-1 and HIV-2 infected hospital patients at
their first visit to hospital. In statistical terminology, we call
the characteristics variables, because the value it takes (e.g.
CD4 cell count per micro-litre and HIV-type) varies from
subject to subject.
8
Explanatory and response variables
• A variable in a study may be defined as:
• 1. An outcome (or response or dependent) variable. This is the variable that is the main
interest in our study. In the example above, CD4 count is considered the outcome of
interest.
• 2. An explanatory (or exposure or independent) variable. This is a variable that changes
the value of our outcome. In the example above, HIV type is considered the explanatory
variable.
• The distinction between explanatory and outcome variables is dependent on the context
and the objectives of the question being answered. E.g., The importance of being able to
identify the outcome of interest in a study, and how to design the study to be able to
answer the question about that outcome. In randomised controlled trials, it is often a
drug or treatment that is the main explanatory variable. In most studies there is more
than one explanatory variable that can influence the response or outcome of interest.
When we have more than one explanatory variable we call the other variables
covariates.
9
What are its Forms?
Data
Qualitative Quantitative
(Categorical) (Numerical)
Continuous
Nominal Ordinal Discrete
Binary (Any value including
(naming) (order) (Whole)
fractions)
Examples:
Name Examples: Examples:
Satisfied, Neither satisfied Examples: Examples:
Gender Yes/ No
10
nor Dissatisfied, 0, 1 2, 3… 2 ½ , 3.714,
Ethnicity Dissatisfied True/ False
How do we make
sense of data?
DESCRIPTIVE STATISTICS
11
Descriptive Statistics
Measures of Variability and

Distribution Measures of Central Tendency
Dispersion
Range
Interquartile range
Mode
Variance
Median
Standard Deviation
12
Mean
Data
• RAW DATA: information in its unorganised form e.g., Collection of MBBS 3
class data for Age, Gender, and favourite colour
• DISCRETE DATA: Data which can only be collected as whole numbers. It

cannot be in fractions or decimals
• CONTINUOUS DATA: Data which need not be in whole numbers, it can be

in decimals
13
How do we make
sense of data?
DISTRIBUTION
14
DISTRIBUTION
• Distribution refers to the frequencies of different responses.
• Frequencies may be represented by plots, graphs, histograms etc., and are a

quick way of visualising datasets based on similarities in responses
15
Describing your data - displaying
categorical variables
• Before we begin to answer any question of interest from
our study we need to summarise and display our data to
get some idea of what it is telling us. For example, we can
look at how the values of variables change from subject to
subject i.e. what is the distribution of values taken by a
single variable or the association between two variables.
We can summarise data through tables or graphs. These
presentations are purely descriptive with each having
advantages and disadvantages. Careful consideration
must be given to what you would like to show.
16
Presentation of categorical of data
using tables
• Tables (and diagrams) should be well labelled and self-explanatory; you
should be able to obtain all the information you require from the table
without any text to describe it. To ensure this the title should be
informative; the outcome and explanatory variables should be 3 clear;
the percentages should be clearly derived; and provide footnotes for
missing values and abbreviations. However, they must not be cluttered
with too much information.
• Summarising categorical variables is straightforward. For each
category of a variable the number of subjects is counted. These counts
are known as frequencies. One-way tables show the frequency of
categories (or values) of each variable.
17
18
19
Example 2 continued
• In this example, the exposure is BCG status and is presented as the row
variable and therefore row percentages are appropriate.
• In this example the proportion of non-vaccinated children who were atopic

(40%) was higher than the proportion of children vaccinated according to
documentation who were atopic (21%). Therefore there is some suggestion
that the prevalence of atopy is lower in children who have been vaccinated
with BCG.
--------------------------------------------------------------------
1 Aaby P et al. Early BCG vaccination and reduction in atopy in Guineau-

Bissau. Clinical and Experimental Allergy 2000; 30: 644-50
20
Presentation of categorical data
using graphs
• Graphical representation can often be used to show the same information as
a table but in a more vivid manner. Graphs are particularly useful for
presentations and talks. Frequencies are often illustrated in two forms:
▪ Pie charts - In a pie chart the frequencies or the percentages are represented
by the angles in different sectors (slices) of a circle; the total (360 degrees) is
equal to 100%, as shown in Figure 1
▪ Bar charts - In a bar chart the numbers or percentages are represented by

the lengths of the bars, as shown in Figure 2.
21
22
23
Describing your data - displaying
quantitative variables
• The frequencies with which different possible values of a
quantitative variable occur may be summarised as a frequency
distribution. The frequency distribution of individual values is
seldom helpful, unless the overall number of observations is quite
small. It is more useful to group the values taken by the variable and
to report the numbers and the frequencies (or percentage
frequencies) of subjects in each group.
• The first step when forming a frequency distribution is to identify
the lowest and highest values. Then the number and size of the
groups is determined. The number of groups will depend on the
observations; if the number of groups is too few (width of the groups
is large) too much information will be lost, while too many groups
(width of the groups is small) may be impractical. Where possible
each group should be the same width and the starting points should
be whole numbers with no gaps.
24
• Example 2: A study on HIV infected
patients presenting at a hospital in the
Gambia. A total of 1084 patients have been
classified by CD4 cell count (cells/µl) and
HIV-type, at the first presentation to
hospital.
25
26
27
Presentation of quantitative data
using graphs
• Graphical presentation of quantitative variables can take three forms:
▪ Histograms - A histogram is similar to a bar chart, with (usually) the values of the variable
grouped into several categories and the bars can then represent the frequencies. The bars touch one
another to indicate the continuous nature of the variable. If the widths of the groups are different
this should be reflected in the histogram through the thickness of the bars; the area of the bar
should be proportional to the frequency. A histogram can be used to illustrate the distribution of a
single quantitative variable or the distribution of a quantitative variable across the levels of a
categorical variable.
▪ Cumulative frequency curves - The cumulative frequency is the number of data less than (or equal
to) a particular value.
▪ Scatter plots – A simple graph used to examine the relationship between two quantitative
variables. Each pair of values is represented by a symbol where the horizontal position is
determined by the value of the first variable (exposure) and the vertical position by the value of the
second variable (outcome).
These initial displays of the data are particularly useful for identifying outliers or unusual values,
and revealing possible errors.
28
A histogram of the haemoglobin (hb) values in the 70 women is given in Figure 3. In a
histogram, it is the area of the rectangle which represents the frequency (or percentage) -
the vertical scale is measured in frequency per unit of value and the horizontal scale is
measured in unit values. Note that the rectangles are drawn from 8 up to 9, 9 up to 10 etc,
not from 8 up to 8.9, 9 up to 9.9 etc., which would correspond to the actual range of recorded
values.
29
30
31
32
33
34

Introductory Biostatistics I

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introductory Biostatistics I

Uploaded by

Copyright:

Available Formats

KNOWING YOUR DATA:

FIRST THINGS FIRST

• What are its sources?

• What are its forms

• Why do I need to know this?

1. Collection of facts, values or measurements

2. Data is raw: simply exists, and in its basic form has no

3. Singular form is “datum” but often used as a singular mass noun

4. Can be Qualitative (have meaning, descriptive)

5. Can be Quantitative (have value, numerical)

- deaths - hospital records

- marriages - OP clinical encounters

• Census data - reportable diseases

• Health surveys (CA, birth defects, etc.)

- national NCD survey - use of ICD codes

- national oral health survey

- Gender: Male - Age: 25 years

- BMI: Obese - Has 2 siblings (1 brother, 1 sister)

- Presenting complaint: chronic - Has 5 nephews and 3 nieces

• Helps to think about what analyses you want to perform

• Remember RIRO: Rubbish In = Rubbish Out

Measures of Variability and

• DISCRETE DATA: Data which can only be collected as whole numbers. It

• CONTINUOUS DATA: Data which need not be in whole numbers, it can be

• Frequencies may be represented by plots, graphs, histograms etc., and are a

• In this example the proportion of non-vaccinated children who were atopic

1 Aaby P et al. Early BCG vaccination and reduction in atopy in Guineau-

▪ Bar charts - In a bar chart the numbers or percentages are represented by

You might also like