You are on page 1of 55


 Use statistical terms and definitions correctly
in analytical chemistry, TOPIC YOU SHOULD
 Calculate measures of central tendency for BE ABLE TO;
environmental analysis data,
 Calculate measures of dispersion for
environmental analysis data,
 Test hypothesis employing appropriate
statistical tests,
 Statistics: Collection of methods for planning experiments, obtaining
data, and then organizing, summarizing, presenting, analyzing,
interpreting, and drawing conclusions.
 Variable: Characteristic or attribute that can assume different values
 Random Variable: A variable whose values are determined by chance.
 Population: All subjects possessing a common characteristic that is
being studied.
 Sample: A subgroup or subset of the population.
 Parameter: Characteristic or measure obtained from a population.
 Statistic (not to be confused with Statistics): Characteristic or measure
obtained from a sample.
 Descriptive Statistics: Collection, organization,
summarization, and presentation of data.
 Inferential Statistics: Generalizing from samples to
populations using probabilities. Performing hypothesis DEFINITIONS
testing, determining relationships between variables, and
making predictions.

 Qualitative Variables: Variables which assume non- Give more examples of;
numerical values. 1. Qualitative variables
 Quantitative Variables: Variables which assume 2. Quantitative variables
numerical values. 3. Discrete variables
 Discrete Variables: Variables which assume a finite or
4. Continuous variables
countable number of possible values. Usually obtained by
counting. e.g. number of bacteria colonies in a culture
 Continuous Variables: Variables which assume an
infinite number of possible values. Usually obtained by
measurement. e.g. pH of a sample, patient cholesterol
 Nominal Level: Level of measurement which classifies data into mutually
exclusive, all-inclusive categories in which no order or ranking can be
imposed on the data. e.g. gender, blood group.
 Ordinal Level: Level of measurement which classifies data into categories
that can be ranked. Differences between the ranks do not exist. e.g. mild,
moderate or severe illness). Often ordinal variables are coded to be
 Interval Level: Level of measurement which classifies data that can be
ranked and differences are meaningful. However, there is no meaningful
zero, so ratios are meaningless.
 Ratio Level: Level of measurement which classifies data that can be
ranked, differences are meaningful, and there is a true zero. True ratios exist
between the different units of measure.
Which among the following are quantitative or ACTIVITY1A:
a. colour change,
b. temperature
c. turbidity values
Give examples of;
d. states of matter
1. Nominal
e. Concentration of Total Organic Compounds
2. Ordinal
Isolate all discrete from continuous variables from the
following; 3. Interval &
f. time measurements, 4. Ratio data
g. Electric conductivity measurements, Which basic statistical
calculations are meaningful
h. Total Bacterial Count (TBC) for each one?
i. Water hardness (mg/L CaCO3)
 Parameters: Quantities that describe a population
characteristic. They are usually unknown and we wish to
make statistical inferences about parameters. Different to

 Descriptive Statistics: Quantities and techniques used to

describe a sample characteristic or illustrate the sample
data e.g. mean, standard deviation, box-plot

 An (Empirical) Frequency Distribution or Histogram for a
continuous variable presents the counts of observations
grouped within pre-specified classes or groups
 A Relative Frequency Distribution presents the
corresponding proportions of observations within the classes
 A Barchart presents the frequencies for a categorical

Blood samples taken from 36 male volunteers as part of a study to
determine the natural variation in CK concentration. The serum CK
concentrations were measured in (U/I) are as follows:

Serum CK Frequency Relative Cumulative Rel.
(U/I) Frequency Frequency
20-39 1 0.028 0.028
40-59 4 0.111 0.139
60-79 7 0.194 0.333
80-99 8 0.222 0.555
100-119 8 0.222 0.777
120-139 3 0.083 0.860
140-159 2 0.056 0.916
160-179 1 0.028 0.944
180-199 0 0.000 0.944
200-219 2 0.056 1.000
Total 36 1.000
Quantiles Moments
8 100.0% maximum 203.00 Mean 98.277778
99.5% 203.00 Std Dev 40.380767
97.5% 203.00 Std Err Mean 6.7301278
90.0% 154.60 upper 95% Mean 111.94066
6 75.0% quartile 118.75 lower 95% Mean 84.614892
50.0% median 94.50 N 36
25.0% quartile 67.25
10.0% 54.30

4 2.5% 25.00
0.5% 25.00
0.0% minimum 25.00

20 40 60 80 100 120 140 160 180 200 220

area is
percentage of Quantiles Moments
males with CK 100.0% maximum 203.00 Mean 98.277778
values between 99.5% 203.00 Std Dev 40.380767
60 and 100 U/l, 0.20 97.5% 203.00 Std Err Mean 6.7301278
i.e. 42%. 90.0% 154.60 upper 95% Mean 111.94066
75.0% quartile 118.75 lower 95% Mean 84.614892
Right tail
median 94.50 N 36
25.0% quartile(skewed)

Relative Frequency
10.0% 54.30
2.5% 25.00
0.10 0.5% 25.00
Left tail 0.0% minimum 25.00


20 40 60 80 100 120 140 160 180 200 220

 The four types of data you will encounter are: Nominal, Ordinal,
Interval, and Ratio.
 Nominal & Ordinal easiest to distinguish, often don't involve
 Nominal data is finite and has no ordering. If you asked a group of
people a yes/no question, the data from that would be nominal. If you
asked a group of people what their favorite color is, data type
consisting of different colors is nominal.
 If people pick their lucky number from a finite list of numbers, data
would also be nominal because those numbers are merely things.
Have no mathematical meaning.
 Ordinal data is much like Nominal data except there is a
perceived order among the possible choices but beyond
that, has no arithmetical value.
 If you were surveying which sized drink people buy, small,
medium, or large, the data is Ordinal.
 If you asked people if they liked hot, mild, or cold weather,
the data is ordinal because while there is a clear order
among the choices, the different data points cannot be
compared arithmetically.
 Interval and Ratio data differ from Nominal and Ordinal
because the data has mathematical meaning and therefore,
these data types require a numerical choice.
Interval data Ratio data
 Data where addition & subtraction have meaning  Ratio data differs from Interval data because
but division & multiplication do not. division and multiplication have meaning.
 If you asked a group of people at which time they  While dividing birth years doesn't provide any
normally went to bed, the data collected would be useful information about your sample, if you
interval because there is meaning in comparing the divided their ages that could be useful.
bed times by calculating the differences between
 Data that pertains to someone's age is an example
them (using subtraction), but it does not make
sense to divide bed times. of ratio data because you can divide ages to see
how much older someone is compared to someone
 Wouldn't make sense, for example, to say that my else.
bedtime at 9pm is 3/4 sooner than his at 12am. But
 Another example of ratio data could be the
it does make sense to note that my bed time is 3
hours sooner than his. amounts of money different people have in their
bank accounts. Again, division has meaning in this
 Another example is birth years because adding & case. If you have K10,000 in your account and I
subtracting years in order to say whose have K1,000, then that makes you ten times richer
older/younger & by how much makes sense, but than me.
dividing by birth years has no value. Same is true
 Other examples are concentration, weight, mass,
for temperatures and pH..
volume, pressure, length…
 Before we conduct statistical analysis, we need to measure our dependent
variable. Exactly how the measurement is carried out depends on the type
of variable involved.
 Different types are measured differently. To measure the time taken to
respond to a stimulus, you might use a stop watch. Stop watches are of no
use, of course, when it comes to measuring someone's attitude towards a
political candidate. A rating scale is more appropriate in this case (with
labels like "very favorable," "somewhat favorable," …). For a dependent
variable such as "favorite color," you simply note color-word (e.g "red")
 Although procedures for measurement differ, can be classified using a few
fundamental categories. In a given category, all of the procedures share
some properties that are important. The categories are called scale types or
Nominal Scale Ordinal Scale
 A researcher wishing to measure consumers'
 When measuring using a nominal scale, satisfaction with their microwave ovens might
one simply names or categorizes ask them to specify their feelings as either
responses. Gender, handedness, favorite "very dissatisfied," "somewhat dissatisfied,"
color, and religion are examples of "somewhat satisfied," or "very satisfied.“
variables measured on a nominal scale.  The items in this scale are ordered, ranging
 The essential point about nominal scales from least to most satisfied. This is what
is that they do not imply any ordering distinguishes ordinal from nominal scales.
among the responses. For example,  Ordinal scales allow comparisons of the degree
when classifying people according to to which two subjects possess the dependent
their favorite color, there is no sense in variable. For example, our satisfaction ordering
which green is placed "ahead of" blue. makes it meaningful to assert that one person is
 Responses are merely categorized. more satisfied than another with their
Nominal scales embody the lowest level microwave ovens.
of measurement.  Such an assertion reflects the first person's use
of a verbal label that comes later in the list than
the label chosen by the second person.
 Numerical scales in which intervals have the same interpretation
throughout, example, consider the Celsius scale of temperature. The
difference between 30 degrees and 40 degrees represents the same
temperature difference as the difference between 80 degrees and 90
 This is because each 10-degree interval has the same physical
meaning (in terms of kinetic energy of molecules).
 Interval scales are not perfect, however. In particular, they do not
have a true zero point even if one of the scaled values happens to
carry the name "zero."
 The Celsius scale illustrates the issue. 0 C does not represent the complete

absence of temperature (the absence of any molecular kinetic energy). In reality,

the label “0" is applied to temperature for accidental reasons connected to history
of temperature measurement.
 Since an interval scale has no true zero point, it does not make sense to compute
ratios of temperatures. For example, there is no sense in which the ratio of 40 to 20
C is the same as the ratio of 100 to 50 oC; no interesting physical property is
preserved across the two ratios. After all, if the "zero" label were applied at the
temperature that Celsius happens to label as 10 degrees, the two ratios would
instead be 30 to 10 and 90 to 40, no longer the same!
 For this reason, it does not make sense to say that 80 oC is "twice as hot" as 40 oC.
Such a claim would depend on an arbitrary decision about where to "start" the
temperature scale, namely, what temperature to call zero (whereas the claim is
intended to make a more fundamental assertion about the underlying physical
Decide whether one would
 The most informative scale. It is an interval scale with the collect nominal ordinal or
additional property that its zero position indicates the absence interval data in the following;
of the quantity being measured. a. Preference of water source
 Think of a ratio scale as the three earlier scales rolled up in among borehole, covered
one. Like a nominal scale, it provides a name or category for shallow well, lake and river,
each object (the numbers serve as labels). Like an ordinal b. CO2 concentrations at a
scale, the objects are ordered (in terms of the ordering of the point at 6:00hrs, 12:00hrs
numbers). Like an interval scale, the same difference at two
and 18:00hrs
places on the scale has the same meaning. And in addition, the
same ratio at two places on the scale also carries the same c. Highest and lowest
meaning. temperatures for various
 The Celsius scale for temperature has an arbitrary zero point stations in Malawi,
and is therefore not a ratio scale. d. Soil type in estate
 However, zero on the Kelvin scale is absolute zero. This makes agriculture in Northern
the Kelvin scale a ratio scale. For example, if one temperature Malawi
is twice as high as another as measured on the Kelvin scale, e. Residual chlorine
then it has twice the kinetic energy of the other temperature. Concentration in water
 Why is the type of scale that measures a dependent variable
important? The crux of the matter is the relationship between the
variable's level of measurement and the statistics that can be
meaningfully computed with that variable.
 For example, consider a hypothetical study in which 5 children are
asked to choose their favorite color from blue, red, yellow, green, and
purple. The researcher codes the results as follows:
Color Code
Blue 1
Red 2
Yellow 3
Green 4
Purple 5
•Each code is a number, so
 This means that if a child said her favorite nothing prevents us from
color was "Red," then the choice was coded computing the average code
as "2," if the child said her favorite color was assigned to the children.
"Purple," then the response was coded as 5, •The average happens to be 3,
…. but it would be senseless to
conclude that the average
 Consider the following hypothetical data: favorite color is yellow (the
color with a code of 3).
Subject Color Code •Such nonsense arises
1 Blue 1 because favorite color is
nominal scale, averaging its
2 Blue 1
numerical labels is
3 Green 4 meaningless
4 Green 4
5 Purple 5
 This is a difficult question, one that statisticians have debated for
 You will be able to explore this issue yourself in your literature
review and reach your own conclusion.
 The prevailing (but by no means unanimous) opinion of statisticians
is that for almost all practical situations, the mean of an ordinally-
measured variable is a meaningful statistic.
 However, as you will find out in literature, there are extreme
situations in which computing the mean of an ordinally-measured
variable can be very misleading.
Independent and dependent variables
 Variables are properties or characteristics of some event, VARIABLES
object, or person that can take on different values or
amounts (as opposed to constants such as π that do not We wish to
• define and distinguish
 When conducting research, experimenters often manipulate between independent &
variables. For example, an experimenter might compare the dependent variables
effectiveness of four types of antidepressants. In this case,
the variable is "type of antidepressant." • define and distinguish
 When a variable is manipulated by an experimenter, it is
between discrete and
continuous variables
called an independent variable. The experiment seeks to
determine the effect of the independent variable on relief • define and distinguish
from depression. between qualitative &
 In this example, relief from depression is called a dependent quantitative variables
variable. In general, the independent variable is manipulated
by the experimenter and its effects on the dependent
variable are measured.
 Can blueberries slow down aging? A study indicates that antioxidants
found in blueberries may slow down the process of aging. In this
study, 19-month-old rats (equivalent to 60-year-old humans) were fed
either their standard diet or a diet supplemented by either blueberry,
strawberry, or spinach powder. After 8 weeks, the rats were given
memory and motor skills tests. Although all supplemented rats showed
improvement, those supplemented with blueberry powder showed the
most notable improvement.
 What is the independent variable? (dietary supplement: none,
blueberry, strawberry, and spinach)
 What are dependent variables? (memory & motor skills test)
Levels of an Independent Variable
 If an experiment compares an experimental EXERCISE
treatment with a control treatment, then the
independent variable (type of treatment) has Explain a theoretical
two levels: experimental and control. experiment with 4 levels
of the independent
 If an experiment were comparing five types of variable
diets, then the independent variable (type of Identify the dependent
diet) would have 5 levels. In general, the variable in your
number of levels of an independent variable is experiment.
the number of experimental conditions.
Find & study literature
on significant figures

What is the level of the independent

Isolate the independent and variable in each of the following;
dependent variable from the
following;  Rates of reaction in acid-base
reactions with strong & weak acids,
 Experiment to determine effect
of pH on solubility of NaHCO3,  Measures of electric conductivity in
0.4, 0.6, 0.8 & 1.0M PbSO4(aq)
 Experiment to examine
temperature-dependence of rate  Total Bacterial Count (TBC) in raw
of reaction, and treated wastewater
 Research on effectiveness of  Water hardness (mg/L CaCO3) in raw,
chlorination in water with partially & fully treated water
different turbidity values
Measures of Central Tendency
 A measure of central tendency (also referred to as measures of
centre or central location) is a summary measure that attempts to
describe a whole set of data with a single value that represents
the middle or centre of its distribution.
 There are three main measures of central tendency: the mode,
the median and the mean.
 Each of these measures describes a different indication of the
typical or central value in the distribution.
 The mode is the most commonly occurring
value in a distribution. Consider this dataset MODE:
showing the bacterial count in 11 cultures:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
 Table below shows a simple frequency
distribution of the bacterial count data. The most commonly
occurring value is 54,
Colony count Frequency therefore the mode of
this distribution is 54
54 3
55 1
56 1
57 2
58 2
60 2
 Advantage of the mode: unlike for the median and the mean mode can be found for both
numerical & categorical (non-numerical) data.
Limitations of the mode:
 In some distributions, the mode may not reflect the centre of the distribution very well. When
the distribution of colony counts is ordered from lowest to highest value, 54, 54, 54, 55, 56,
57, 57, 58, 58, 60, 60, it is easy to see that the centre of the distribution is 57 years, but the
mode is lower, at 54 colonies.
 It is also possible to have more than one mode for the same distribution of data, (bi-modal, or
 The presence of more than one mode can limit the ability of the mode in describing the centre
or typical value of the distribution because a single value to describe the centre cannot be
 In some cases, particularly where the data are continuous, the distribution may have no mode
at all (i.e. if all values are different). In cases such as these, it may be better to consider using
the median or mean, or group the data in to appropriate intervals, and find the modal class.
 The median is the middle value in distribution when values are arranged in ascending
or descending order. The median divides the distribution in half (there are 50% of
observations on either side of the median value).
 In a distribution with an odd number of observations, the median value is the middle
value. Looking at the colony counts distribution (which has 11 observations), the
median is the middle value, which is 57 colony counts.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
 When the distribution has an even number of observations, the median value is the
mean of the two middle values. In the following distribution, the two middle values
are 56 and 57, therefore the median equals 56.5 colonies:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
 Advantage of the median: It is less affected by outliers & skewed data than the mean,
hence preferred measure of central tendency for skewed distributions
 Limitation of the median: Cannot be identified for categorical nominal data, as it
cannot be logically ordered.
 Mean: The mean is the sum of the value of each n
observation in a distribution divided by the number of
observations. Looking at the colony counts distribution
x 1
n  xi
again: 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60 i 1
 Calculated by adding all the values obtained The population mean is
(54+54+54+55+56+57+57+58+58+60+60 = 623) & indicated by the Greek
dividing by the number of observations (11) which symbol µ (pronounced
equals 56.6 colonies. ‘mu’). When the mean is
calculated on a distribution
 Advantage: Can be used for both continuous and from a sample it is
discrete numeric data. indicated by the symbol x̅
 Limitations: Cannot be calculated for categorical data, (pronounced X-bar).
values cannot be summed. Includes every value in
distribution in its calculation hence sensitive to the
influence of outliers and skewed distributions.
 When a distribution is symmetrical, the mode, median and mean are all in the middle of the
 For skewed distribution the mode remains the most commonly occurring value, the median
remains the middle value in the distribution, but the mean is generally ‘pulled’ in the
direction of the tails. In such a distribution, the median is often a preferred measure of
central tendency, the mean is not usually in the middle.
 A distribution is said to be positively or right skewed when the tail on the right side of the
distribution is longer than the left side. In a positively skewed distribution it is common for
the mean to be ‘pulled’ toward the right tail of the distribution. Although there are exceptions
to this rule, generally, most of the values, including the median value, tend to be less than the
mean value.
 A distribution is said to be negatively or left skewed when the tail on the left side of the
distribution is longer than the right side. In a negatively skewed distribution, it is common for
the mean to be ‘pulled’ toward the left tail of the distribution. Although there are exceptions
to this rule, generally, most of the values, including the median value, tend to be greater than
the mean value.
How do outliers influence the measures of central tendency?
 An outlier has a value which is very different to the rest of the distribution.
 It is important to detect outliers within a distribution, because they can alter the results of
the data analysis. The mean is more sensitive to outliers than the median or mode.
 Consider the colony counts dataset again, with one difference; the last observation of 60
has been replaced with a colony count of 81. This value is much higher than the other
values, & could be considered an outlier. However, it has not changed the middle of the
distribution, & hence median value is still 57 counts. 54, 54, 54, 55, 56, 57, 57, 58, 58, 60,
 As the all values are included in calculation of mean, outlier will influence the mean
value. (54+54+54+55+56+57+57+58+58+60+81 = 644), divided by 11 = 58.5 counts. In
this distribution the outlier value has increased the mean value.
 Despite existence of outliers in, the mean can still be an appropriate measure of central
tendency, especially if the rest of the data is normally distributed.
 If outlier is confirmed as a valid extreme value, it should not be removed from the dataset.
Outliers can be treated by a variety of techniques that attempt to minimise distortion
 Measures of dispersion characterise how spread out the
distribution is, i.e., how variable the data are.
 Commonly used measures of dispersion include:
1. Range
2. Variance & Standard deviation
3. Coefficient of Variation (or relative standard
4. Inter-quartile range
 the sample Range is the difference between the largest and smallest
observations in the sample
 easy to calculate;
Example: The systolic blood pressure of seven middle aged men were
as follows: 151, 124, 132, 170, 146, 124 and 113. In this example:
min=113 and max=170, so the range=57 mmHg
 Useful for “best” or “worst” case scenarios 
 Sensitive to extreme values 

 The sample variance, s2, is the arithmetic mean
of the squared deviations from the sample
 xi  x 

s  i 1
n 1


 The sample standard deviation, s, is the
square-root of the variance

 xi  x 

i 1
n 1

 s has the advantage of being in the

same units as the original variable x
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73

113 -24.14 582.88

Sum = 960.0 Sum = 0.00 Sum = 2304.86
x  137.14 39

 x  x 
i  2304.86
i 1

Therefore, 2304.86
7 1
 19.6
 The coefficient of variation (CV) or relative standard deviation
(RSD) is the sample standard deviation expressed as a percentage
of the mean, i.e.

CV    100%
 The CV is not affected by multiplicative changes in scale
 Consequently, a useful way of comparing the dispersion of
variables measured on different scales

The CV of the blood pressure data is:
 19.6 
CV  100   %
 137.1 
 14.3%
i.e., the standard deviation is 14.3% as large as the mean.

 The Median divides a distribution into two halves.
 The first and third quartiles (denoted Q1 and Q3) are defined as
 25% of the data lie below Q1 (and 75% is above Q1),

 25% of the data lie above Q 3 (and 75% is below Q3)

 The inter-quartile range (IQR) is the difference between first &
third quartiles, i.e.
IQR = Q3- Q1
The ordered blood pressure data is:

113 124 124 132 146 151 170

Q1 Q3
Inter Quartile Range (IQR) is 151-124 = 27
Find the Inter Quartile Range (IQR) for the data below
40, 32, 55, 67, 61, 39, 33, 48 44
 A box-plot is a visual description of the distribution based
 Minimum
 Q1
 Median
 Q3
 Maximum
 Useful for comparing large sets of data
The pulse rates of 12 individuals arranged in increasing
order are:

62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80

Q1=(68+70)2 = 69, Q3=(76+78)2 = 77

IQR = (77 – 69) = 8




AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel 48

 An outlier is an observation which does not
appear to belong with the other data
 Outliers can arise because of a measurement or
recording error or because of equipment failure
during an experiment, etc.
 An outlier might be indicative of a sub-
population, e.g. an abnormally low or high value
in a medical test could indicate presence of an
illness in the patient.
 Re-define the upper and lower limits of the boxplots (the
whisker lines) as:
Lower limit = Q1-1.5IQR, and
Upper limit = Q3+1.5IQR

 Note that the lines may not go as far as these limits

 If a data point is < lower limit or > upper limit, the data
point is considered to be an outlier.
 Determine if there is an outlier in the CK data


 Displays the relationship between two
continuous variables
 Useful in the early stage of analysis when
exploring data and determining if a linear
regression analysis is appropriate
 May show potential outliers in your data




You might also like