You are on page 1of 8

Cebu Doctors’ University

College of Arts and Sciences


Physical Sciences Department

MODULE 1.2 – DATA ANALYSIS


Compiled by: Joselito R. Tumulak Jr., RChT, MS (cand.)
Analytical Chemistry Professor

INTENDED LEARNING OUTCOMES


At the end of this module, you will be able to:
1. Apply measures of central tendency and dispersion to data gathered from chemical analysis in
order to summarize and describe the data;
2. Explain the impact of experimental errors on the precision and accuracy of the data; and
3. Understand the impact of outliers on measures of central tendency and dispersion, and how to
identify and handle outliers in data analysis;

UNIT OUTLINE
Topic Page
I. Characterizing Measurements and Results
A. Measures of Central Tendency
B. Measures of Dispersion
II. Characterizing Experimental Errors
A. Accuracy and Precision
B. Types of Experimental Errors
C. Treatments of Systematic Errors
D. Treatment of Outliers

I. CHARACTERIZING MEASUREMENTS AND RESULTS

 In order to improve the reliability and to obtain information about the variability of results, replicates
of a sample are usually carried through an entire analytical procedure. Individual results from a set
of measurements are seldom the same, so we usually consider the best estimate to be the central
value for the set.
 The best estimates for replicates are justified in two ways:
1. The central value of a set should be more reliable than any of the individual results.
2. An analysis of the variation in the data allows us to estimate the uncertainty associated
with the central value.
 To help us understand this, let us supposed you want to know the mass of a 5-peso coin. You get
a preliminary data by weighing a total of 7 coins whose masses are shown below. Obviously, the
values will differ as expected, so we cannot use the mass of a single penny to draw a specific
conclusion about the mass of any other coin. We can, however, use the justifications above to
characterize these data.
Coin Mass (g)
1 3.080
2 3.094
3 3.107
4 3.056
5 3.112
6 3.174
7 3.198
Table 1. Masses of Seven 5-Peso Coins

Page 1 of 8
A. Measures of Central Tendency

 A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. Using this characterization for the previous
problem of knowing the mass a 5-peso coin, we assume that the masses of individual coins are
scattered around a central value that provides the best estimate of a coin’s true mass.
 Two common ways to report this estimate of central tendency:
1. Mean
- The mean, 𝑋̅, is the numerical average obtained by dividing the sum of the individual
measurements by the number of measurements
∑𝑛𝑖=1 𝑋𝑖
𝑋̅ =
𝑛
where Xi is data of the measurements, and n is the number of independent
measurements.

Example:
What is the mean for the data in Table 1?

1. Add all the results.


3.80 + 3.094 + 3.107 + 3.056 + 3.112 + 3.174 + 3.198 = 21.821
2. Divided the sum by the number of measurements.

21.821
𝑋̅ = = 𝟑. 𝟏𝟏𝟕 𝒈
7

- The mean is the most common estimator of central tendency. However, it is not a
very robust estimator because extreme measurements, those much larger or smaller
than the remainder of the data, strongly influence the mean’s value. For example,
mistakenly recording the mass of the fourth coin as 31.07 g instead of 3.107 g,
changes the mean from 3.117 g to 7.112 g.

2. Median
- The median, 𝑋𝑚𝑒𝑑 , is the middle value when data are ordered from the smallest to
the largest value. When the data include an odd number of measurements, the
median is the middle value. For an even number of measurements, the median is the
average of the n/2 and the (n/2) + 1 measurements, where n is the number of
measurements.

Example:
What is the median for the data in Table 1?

1. Order the data from the smallest to the largest value.


3.056 3.080 3.094 3.107 3.112 3.174 3.198
Since there is a total of seven measurements, the median is the fourth
value in the ordered data set; thus, the median is 3.107.

- As shown in the examples, the mean and median provide similar estimates of central
tendency when all data are similar in magnitude. The median, however, provides a
more robust estimate of central tendency since it is less sensitive to measurements
with extreme values. For example, introducing the transcription error discussed
earlier for the mean only changes the median’s value from 3.107 g to 3.112 g.

Page 2 of 8
B. Measures of Dispersion

 If the mean or median provides an estimate of a coin’s true mass, then the dispersion (spread) of
the individual measurements must provide an estimate of the variability in the masses of individual
pennies. Although dispersion is often defined relative to a specific measure of central tendency, its
magnitude is independent of the central value. Changing all measurements in the same direction,
by adding or subtracting a constant value, changes the mean or median, but will not change the
magnitude of the spread.
 Three common measures of dispersion are:
1. Range
- The range, w, is the difference between the largest and smallest values in the data set.
𝑅𝑎𝑛𝑔𝑒 = 𝑤 = 𝑋𝑙𝑎𝑟𝑔𝑒𝑠𝑡 − 𝑋𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡
- The range provides information about the total variability in the data set, but does not
provide any information about the distribution of individual measurements.

Example:
What is the range of data set in Table 1?
𝑤 = 3.198 𝑔 − 3.056 𝑔 = 0.142 𝑔

2. Standard Deviation
- The absolute standard deviation, s, describes the spread of individual measurements
about the mean and is given as
∑𝑛 (𝑋𝑖 − 𝑋̅)2
𝑠 = √ 𝑖=1
𝑛−1
where Xi is data of the measurements, n is the number of measurements, and 𝑋̅ is the
mean.
- Frequently, the relative standard deviation, s r, is reported.
𝑠
𝑠𝑟 = ̅
𝑋
- The percent relative standard deviation is obtained by multiplying s r by 100%.

Example:
What are the standard deviation, the relative standard deviation, and the percent relative
standard deviation for the data in Table 1?
1. To calculate the standard deviation:
o Obtain the difference between the mean value (3.117 as per previous
example) and each measurement, square the resulting differences, and
add them to determine the sum of the squares.
(3.080 – 3.117)2 = (–0.037)2 = 0.00137
(3.094 – 3.117)2 = (–0.023)2 = 0.00053
(3.107 – 3.117)2 = (–0.010)2 = 0.00010
(3.056 – 3.117)2 = (–0.061)2 = 0.00372
(3.112 – 3.117)2 = (–0.005)2 = 0.00003
(3.174 – 3.117)2 = (+0.057)2 = 0.00325
(3.198 – 3.117)2 = (+0.081)2 = 0.00656
0.01556
o Divide the sum of the squares by n – 1, where n is the number of
measurements, and take the square root.
0.01556
𝑠= √ = 𝟎. 𝟎𝟓𝟏
7−1

Page 3 of 8
2. The relative standard deviation and percent relative standard deviation are:
0.051
𝑠𝑟 = = 0.016
3.117

𝑠𝑟 (%) = 0.016 × 100% = 1.6%

3. Variance
- Another common measure of dispersion is the square of the standard deviation, or the
variance (s2).

Example:
What is the variance for the data in Table 1?
𝑠2 = (0.051)2 = 𝟎. 𝟎𝟎𝟐𝟔

II. CHARACTERIZING EXPERIMENTAL ERRORS

A. Accuracy and Precision

 Accuracy is a measure of how close a measure of central tendency is to the true value (expected
value), μ. It is usually expressed as either absolute error:
𝐸 = 𝑋̅ − 𝜇
or percent relative error, Er:
𝑋̅ − 𝜇
𝐸𝑟 = × 100%
𝜇
1. Although the mean is used as the measure of central tendency in the equations above, the
median could also be used.
 Precision refers to how close the obtained set of measurements with one another. It gives a
measure of reproducibility and repeatability of the measurement. The measures of dispersion
(range, standard deviation, or variance) discussed earlier are used to express precision.
 In simpler terms, when data are accurate, they are correct. When data are precise, they are
consistent.

Figure 1. Illustration of Accuracy and Precision

 We can determine precision just by measuring replicate samples. Accuracy is often more difficult
to determine because the true value is usually unknown. An accepted value must be used instead.

Page 4 of 8
B. Types of Experimental Errors

 A chemical analysis is effected by 3 types of errors:


1. Systematic (Determinate) Errors
- Affect accuracy of an analysis, which are characterized by a systematic deviation
from the true value.
- Have definite values and assignable causes
- For example, the data below show results from six replicate determinations of iron
(Fe) in an aqueous solution containing 20.0 ppm Fe. The mean value of the six
determinations is 19.8 ppm, which means that it has a systematic error of -0.2 ppm
Fe.

Figure 2. Results of 6-replicate iron determination of solution containing 20 ppm Fe

- In general, a systematic error in a series of replicate measurements causes all the


results to be too high or too low. An example of a systematic error is the loss of a
volatile analyte while heating a sample.

2. Random (Indeterminate) Errors


- Causes data to be scattered more or less symmetrically around a mean value.
- Do not affect each determination in the same manner.
- For example, in Figure 1, the first and second dart boards illustrate a data set with
less random errors in comparison to the third and fourth dart boards.
- In general, then, the random error in a measurement is reflected by its precision.

3. Gross Errors
- Usually occur only occasionally, are often large, and may cause a result to be either
high or low. They are often the product of human errors.
- Gross errors lead to outliers, results that appear to differ markedly from all other
data in a set of replicate measurements. We will deal with outliers later in the text.
- For example, if part of a precipitate is lost before weighing, analytical results will be
low.

C. Treatments of Systematic Errors

 As mentioned, systematic errors have a definite value, an assignable cause, and are of the same
magnitude for replicate measurements made in the same way. They lead to bias in
measurement results. Bias measures the systematic error associated with an analysis. It has a
negative sign if it causes the results to be low and a positive sign otherwise.
 There are three types of systematic errors:
1. Instrumental errors
- Caused by nonideal instrument behavior, by faulty calibrations, or by use under
inappropriate conditions.

Page 5 of 8
- For example, pipets, burets, and volumetric flasks may hold or deliver volumes
slightly different from those indicated by their graduations. Electronic instruments are
also subject to systematic errors. These can arise from several sources, such as
when voltage of a battery-operated power supply decreases with use or when
instruments are not calibrated frequently or if they are calibrated incorrectly.

2. Method errors
- Arise from nonideal chemical or physical behavior of analytical systems.
- Such sources of nonideality include the slowness of some reactions, the
incompleteness of others, the instability of some species, the lack of specificity of
most reagents, and the possible occurrence of side reactions that interfere with the
measurement process.
- Errors inherent in a method are often difficult to detect and are thus the most serious
of the three types of systematic error.

3. Personal errors
- Result from the carelessness, inattention, or personal limitations of the experimenter.
- For example, the endpoint of a titration is signaled by an evident change in the
coloration of the sample being tirated. An analyst may be insensitive to color changes
tending to use excess reagent.
- A universal source of personal error is prejudice. Most of us, no matter how honest,
have a natural, subconscious tendency to estimate scale readings in a direction that
improves the precision in a set of results. Alternatively, we may have a preconceived
notion of the true value for the measurement. We then subconsciously cause the
results to fall close to this value.

 One or more of the following steps can be taken to recognize and adjust for a systematic error in an
analytical method:
1. Periodic Calibration of Instruments
- Periodic calibration of equipment treats instrumental errors because the response of
most instruments changes with time as a result of component aging, corrosion, or
mistreatment.

2. Analysis of Standard Samples


- The best way to estimate the bias of an analytical method is by analyzing standard
reference materials (SRMs), materials that contain one or more analytes at known
concentration levels.
- SRMs can be purchased from a number of governmental and industrial sources, such
as the National Institute of Standards and Technology (NIST).
- The components in SRMs has been determined in one of three ways:
i. by analysis with a previously validated reference method,
ii. by analysis by two or more independent, reliable measurement methods,
iii. by analysis by a network of cooperating laboratories that are technically
competent and thoroughly knowledgeable with the material being tested.

3. Independent Analysis
- If standard samples are not available, a second independent and reliable
analytical method can be used in parallel with the method being evaluated. The
independent method should differ as much as possible from the one under study.
This practice minimizes the possibility that some common factor in the sample has
the same effect on both methods.

Page 6 of 8
4. Blank Determination
- A blank contains the reagents and solvents used in a determination, but no analyte.
Often, many of the sample constituents are added to simulate the analyte
environment, which is called the sample matrix.
- In a blank determination, all steps of the analysis are performed on the blank material.
The results are then applied as a correction to the sample measurements.
- Blank determinations reveal errors due to interfering contaminants from the reagents
and vessels employed in the analysis.

D. Treatment of Outliers

 As mentioned previously, gross errors lead to outliers which are occasional results in replicate
measurements that obviously differs significantly from the rest of the results.
 You do not automatically reject an outlier. Its rejection is often based on significance test, where the
common one is the Dixon’s Q-test. The Q-test compares the difference between the suspected
outlier and its nearest numerical neighbor to the range of the entire data set.
 Using the Q-Test:
1. Rank data from smallest to largest so that the suspected outlier is either the first or the last
data point.
2. Calculate the test statistics, Q exp:
- Use the equation below, if the suspected outlier has the lowest value (X1):
𝑋2 − 𝑋1
𝑄𝑒𝑥𝑝 =
𝑋𝑛 − 𝑋1
- Use the equation below, if the suspected outlier has the highest value (Xn):
𝑋𝑛 − 𝑋𝑛−1
𝑄𝑒𝑥𝑝 =
𝑋𝑛 − 𝑋1
where n is the number of members in the data set, including the suspected outlier.
3. Compare the value of Qexp is compared with critical value, Q crit, at a particular significance
level. If Qexp is greater than Qcrit, the outlier is rejected. If Qexp is less than Qcrit, the outlier is
retained.

n 90% 95% 96% 98% 99%

Table 2. Q Critical Values

Page 7 of 8
Example:
The following masses, in grams, were recorded in another experiment of
determination the mass of 5-peso coin.

3.067 3.049 3.039 2.514 3.048 3.079 3.094 3.109 3.102

Determine if the value 2.514 g is an outlier at 95% confidence level.

1. Place the masses in order from smallest to largest:

2.514 3.039 3.048 3.049 3.067 3.079 3.094 3.102 3.109

2. Calculate the Qexp.


𝑋2 − 𝑋1 3.039 − 2.514
𝑄𝑒𝑥𝑝 = = = 0.882
𝑋9 − 𝑋1 3.109 − 2.514
3. Compare Qexp with Qcrit.
The critical value for Q at 95% based on table 2 is 0.493. Since Qexp > Qcrit, the value
is assumed to be an outlier, and can be rejected.

Page 8 of 8

You might also like