You are on page 1of 18

Lesson 1: Data Management and It’s Descriptive Summary Measures

Data Management and It’s Descriptive Summary Measures (Central Tendency and Dispersion)

Statistics as one of the field of science involves the processes of collection, organization or presentation,
analysis and interpretation of data. Generally, statistics is being divided into two main divisions. The first
main division refers to the collection, organization or presentation and analysis of data called descriptive
statistics. Its main goal is to provide a description for the data sets. The second division is concerned in
interpreting and drawing out of conclusions or generalizations from the analysis of the sample random
data called inferential statistics. In providing descriptions or summary for a sample data set, the use of
central tendency, dispersion and relative position are applied. 

         The Measures of Central Tendency

         A Measure of Central Tendency is a statistic that can be obtained from a set of observations or
scores that represents the data set. It is often useful to find the single numerical value located at the
center of the distribution of the data set. It is also defined as the tendency of the same observations or
scores to cluster about a single point.

Example: Jonecis is planning to build up a business that sales different brand of computer printers. He
conducted a survey on the prices of different brands of printers in the market and made projections on
the possible selling price for the items once he engages such business. The data below presents the
projected selling price for various brand of laptops.

Brand A B C D E

Price in Pesos Php 20,990 Php 14,990 Php 16,484 Php 15,799 Php 21,984

            The central tendency for the selling price for the different brands of printers is the average selling
price of the five items that is a “center or central” value for which the different amounts about to
cluster.

           The Measures of Central Tendency has three commonly used measures that finds or locates the
center or central value of a given data set.

The Mean
 The above summary measure result suggests that the average sales Jonecis will expect to have for his
proposed business is about Php 18,049.40.

Example:  JM is a college instructor handling classes in human anatomy. He currently conducted a study
involving his class for the subject. He is interested to determine the average academic performance of
his entire class composed of two sections. He randomly selects the same number of students from his
two classes to compose the samples and record their respective prelim grades shown as follows.

Student 1 2 3 4 5 6 7 8 9 10

Prelim 2.2 1.9 1.7 2.0 1.8 1.5 1.4 2.1 1.6 2.3
Grade
 

Solution: The students involved in the study represent the random samples of the study population.
Hence, the use of sample mean is appropriate to describe the average performance of the sample
students computed as follows.

                  

          Therefore, the average performance of the sample students in his two classes in Human Anatomy
is 1.9.

Another measure of central tendency is called the Median measure.

 The Median

          The median measure of central tendency is the middle most value in an ordered array of data. The
median measure is known to be unaffected by any extreme values in a set of data. Hence, whenever an
extreme value is present, it is proper to use the median rather than the mean to describe the data set.

To find the median value for a given data set, one has to organize the data in an array, that is, arranging
the scores in terms of increasing numerical value and apply the following formulas:

Case 1: Even number of observations:

Case 2: Odd number of Observations

Example: The following data represents the number of 30 hospital patients slept following the
administration of a certain anesthetic. 

            7          10        12        4          8          7          3          8          5

            12        11        3         8          1          1          13        10        4

            4          5         5         8          7          7          3          2          3

            3          1         17

            Determine the median measure of the data.

Solution: The ordered sequence in terms of the number patients slept across 30 hospital
after administration of a certain anesthetic is given by

1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 11, 12, 12, 13,  and 17.

Since the data set has an even number of scores, then the location of the median value is determined
using Case 1. Thus, for n=30, the median measure Md is given by
The Mode

          The mode is the value in a data set that appears most frequently. Unlike the arithmetic mean, the
mode is not easily affected by the occurrence of any extreme values.

Example: The mode value of the ordered sequence of the number of patients slept across 30 hospitals
after administration of a certain anesthetic is 3 since in a sequence

1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 11, 12, 12, 13,  and 17,

the score, 3 has the highest frequency of occurrence.

The Weighted Mean

            The weighted mean measure is a variation of the arithmetic mean wherein the individual scores
has an assigned weight that normally emphasize that one score is better than the other.

         The weighted mean value can be obtained by using the following formula given by

Example: The table below determines JM’s’ grades during the prelim term of the first semester.

Course Prelim Grade Course Units

Mathematics in the Modern 1.4 3


World

Biostatistics 1.8 3

Understanding the Self 1.5 3

Purposive Communication 1.3 3

Science Technology and 1.9 3


Society

Life and Works of Rizal 2.0 3

Art Appreciation 2.1 3

Solution: Each subject is worth 3 units. Hence, the sum of all credit units is 21 units for all courses. Thus,
JM’s grade point average (GPA) during the prelim term of the first semester is given by
Measures of Dispersion

        The measures of dispersion determine the amount of variation or spread in the data. This measure
is helpful in detecting inconsistencies or insufficiencies of values in the data set.

Consider for instance the coffee dispenser machine that supplies coffee to customers shown in the table
below.

Machine Dispenser 1 Machine Dispenser 2

8.55 8.09

6.54 7.95

10.50 7.90

5.99 8.05

8.42 8.01

          The mean of the cup of coffee in oz dispensed with machine 1 is 8 oz. However, the amount of
coffee dispensed per cup is very inconsistent.  Some cup seems to overflow while others have only lesser
amount of coffee dispensed in a cup. This suggests that the first machine dispenser needs some
calibrations. On one hand, machine dispenser 2 is very consistent with the amount of coffee dispensed
having only a very small deviation of the values. This indicates that the second machine dispenser needs
no more calibrations for it works well in serving the customers.       

         The situations discuss above suggests that the mean value is not enough to describe a data set for
it lacks information that reflects the deviation or spread of the data values.  Thus, the need to introduce
further descriptive measures that characterize the spread or deviations of data values in terms of range
and standard deviation measures.

The Range

            The Range determines the difference between the largest and smallest observations in a set of
data given by the formula:

     Range = Xlargest  -  Xsmallest

Example: Find the range of the amount of coffee in oz dispensed in 5 cups from machine dispenser 2.

Solution: The largest amount of coffee dispensed in a cup is 10.50 oz while the smallest amount is 5.99
oz. The range of the two extreme values is 10.50 – 5.99 = 4.51 0z.

            Although the range measure seems easy to compute, however, it can be very sensitive to
extreme values and provides no information for the spread of values in between these two scores.

 The Standard Deviation

          The standard deviation is the measures of variation that takes into account on how all the values in
the data set are distributed. This measure evaluates how the values fluctuate about the mean and is less
sensitive to the extreme values.         

          Although most statistical computations normally involved a sample instead of a population, the
two set of formulas for sample and population measures of dispersion are defined and provided as
follows.
Standard Deviations for Samples and Populations

Example 1: Consider the examination scores for the sample students randomly selected in the GE MMW
class.

                                                            15, 17, 19, 23, 25

Find the standard deviation of the sample.

Example 2: A study is conducted to provide evidence on which companies have received a more
consistent employees’ job satisfaction ratings based on the survey results that used the scale of 1
(strongly dissatisfied) to 7 (strongly satisfied). The following ratings are given below.

Company Job Satisfaction Rating

  1 2 3 4 5 6

A 7 4 7 5 6 6

B 5 5 6 7 6 6

        Which of the two companies has received a more or less consistent job satisfaction rating by
employees?

Solution: The mean job satisfaction rating for each of the company is 5.83.

Company A has a job satisfaction standard deviation rating of


   The variability results suggest that the company B has received a more or less consistent job
satisfaction rating by employees.

The Measures of Relative Position

      Suppose a student is taking examinations in the mathematics of the modern world (MMW) and art
and appreciation (AA). The student got a score of 45 in MMW while a score of 50 in art and
appreciation.  The mean score for all students taking the MMW exam in the class is 40 with a standard
deviation of 5.  On one hand, the mean score for all students taking art and appreciation is 45 with a
standard deviation of 8. For which of the two subjects did a student perform better?

Comparing the performance of the student in both subjects cannot be done right away since the
different sets of examination scores have different amount of variability. Hence, for a comparison to be
possible, a transformation must be made for student’s scores in MMW and art appreciation.

The z – Score

            The z – score of the given data value x is the number of standard deviations the score lies above
or below the mean. The transformation of x-score to z  - score is defined by the equation given as
follows.

Example: Comparing the performance of the student in MMW and art appreciation with his score
provided above, we have

 The result indicates that the student scored 1.00 standard deviation above the mean in his MMW
course while 0.625 standard deviation above the mean for his art appreciation course. The resulting z-
scores suggest that the student perform better in MMW than in his art appreciation course.

The Percentiles

Example: Suppose the median annual travel expenses of personnel in a certain academic institution is
Php 100,000. If the 85th percentile for the annual travel expenses of personnel was Php 110,000, find the
percentage of personnel whose annual travel expenses was:

i) greater than Php 100,000

ii) less than Php 110,000


iii) between Php 100,000 and Php 110,000

Solution:

i) Since the median measure is the same as the 50 th percentile, then 50% of the personnel  incurred
travel expenses greater than Php 100,000.

ii) Since Php 110,000 is the 85th percentile, then 85% of the personnel in the academic institution
incurred travel expenses less than Php 110,000.

iii) Using parts (i) and (ii), 85%-50%=35% of personnel incurred travel expenses

between Php 100,000 and Php 110,000.

The Percentile of a Data Value

Example: In a licensure examination for civil engineers given to 1500 students, Yurie scored 550 higher
than the scores of 900 students who took the examination. What is percentile for Yurie’s examination
score?

The Quartiles
Example: Consider the data below that represents the number of dining rooms occupied in a beach
resort for a 15-days period.

100      95        65        80        88        89        91        70       

72        55        60        99        84        78        75       

Find the first, second and third quartile of the data.

Solution: Listing the data value in an array form, we have

55, 60, 65, 70, 72, 75, 78, 80, 84, 88, 89, 91, 95, 99, 100

Summarizing Data Using a Box-and-Whisker Plot

           The Box-and-Whisker plot is a graphical way of providing the visual summary of a set of data that
involves the median, quartiles and extreme values, which characterizes the distribution. The following
figure illustrates the components of Box-and-Whisker plot.
The Use of Stem-and-Leaf Diagrams in Organizing Data

            The stem-and-leaf display is a tabular way of organizing data. It is formed by splitting the data
values into two parts. The “tens” part forms the “stem” while the units digit represents the leaves. The
first part can also be extended to hundreds, thousands and so on as the leading units depending on the
values involved in a given data set.

Example: The following table shows the ages of customers who owns a motorcycle. Construct a stem-
and-leaf display for the data

                                    31        42        62        21        60        66        68

                                    63        54        22        37        42        75        61

                                    44        52       33        36        47        45        51

                                    51        28        39        43        52        53        33

Solution: A stem-and-leaf diagram can be constructed by writing all the stems in a column in ascending
order while indicating the corresponding leaf to the right of the vertical line as illustrated in the figure
below.

Stems Leaves

2       1 2 8

3 173693

4 224753

5 421123

6 206831

7       5
Lesson 2: The Normal Distribution and Bivariate Correlation and Regression Analysis

The Normal Distribution and Its Applications


       The normal distribution (Gaussian distribution) is a continuous probability distribution that describes
the values in a data set that clusters about the mean. The graph of the Gaussian distribution is a normal
curve defined by the density function

Abraham de Moivre (1667-1754) devised the normal curve mathematically in 1733 as an approximation
to the binomial distribution. Karl Pearson (1857-1936) discovered his work in 1924. Pierre-Simon Laplace
(1749-1827) then used the normal curve to describe the distribution of errors in 1783. During 1809, Carl
Friedrich Gauss (1777-1855) utilized the normal curve in analyzing data in astronomy. In the modern
times, the normal curve also known as the bell-shaped curve plays its important role in modeling real life
problem applications involving normally distributed data values.

            The normal curve has the following characteristics that allow users to estimate the probability of
occurrence of a phenomenon represented by the value of a normally distributed factors.

a.      The distribution is bell-shaped.

b.      The normal curve is symmetric about the mean.

c.       The mean, median and the mode values coincide.

d.     The total area under the curve is 1 or 100%.

e.      The area that fall within 1 standard deviation of the mean is 68%, within 2 standard deviation,
about 95%, and with 3 standard deviation is 99.7%.

          The figure below shows the area specified for each region.

Example: A survey of 200 rice distribution outlets in the province found that the selling price per sack of
rice is approximately normally distributed with a mean of Php 2250 and a standard deviation of Php 25.
How many rice distribution outlets that sells

i) less than Php 2,200 per sack of rice?

ii) between Php 2,225 and Php 2,275 per sack of rice?

iii) more than Php 2,275 per sack of rice?

Solution:
a. The Php 2,200 selling price per sack of rice is 2 standard deviations below the mean. Since 47.7% of all
the data falls between the mean and 2 standard deviation below the mean,

                                                   (47.7%) (200)=(0.477)(200)=95

of the distribution outlets sells between Php 2,200 and the mean price of Php 2,250 per sack of rice.
However, half of the 200 sacks of rice have prices less than the mean price of Php 2,250. Thus, the
number of distribution outlets that sells less than Php 2,200 price per sack is 100-95=5 sacks of rice.

b. The Php 2,225 selling price per sack of rice is 1 standard deviation below the mean while the price of
Php 2,275 is at 1 standard deviation above the mean. For a normal distribution, 68% of all the data lie
within 1 standard deviation about the mean. Hence, approximately, (68%) (200)=(0.68)(200)=136
distribution outlets sells between Php 2,225 and Php 2,275 per sack of rice.

c. The selling price of Php 2,275 per sack of rice is 1 standard deviations above the mean. In a normal
distribution, 68% of all the data falls within 1 standard deviation about the mean. This implies that 32%
of the data will fall either less than 1 standard deviation below the mean or more than 1 standard
deviation above the mean. Thus, the 16% of the data that are 1 standard deviation above the mean is

                                    (16%)(200)=(0.16)(200)=32 distribution outlets.

The Standard Normal Distribution

            The standard normal distribution is a probability distribution that has a mean of 0 and a standard
deviation of 1. Such distribution is a result of the transformation wherein the normally distributed x-
scores are transformed into z – scores through the formula:   

Example: The average age of marketing supervisors is 38 years old. Assume that the data are normally
distributed. If the standard deviation is 6 years old, find the proportion of marketing supervisors whose
age are between 35 years old and 42 years old.

Solution: The standard z - scores for 35 and 42 years old x-scores are computed as

  This means that 30.85% of the data lies below z=-0.50 while 74.86% of the data values fall
below z=0.67. Since the proportion of marketing managers to be determined is between 35 years old
and 42 years old or equivalently between z=-0.50 and z=0.67, then 74.86% - 30.85% = 44.01%. Thus, the
proportion of marketing managers whose age is between 35 and 42 years old is 44.01%.

Remark: The proportion of values bounded by z-scores under the standard normal curve can also be
determined using the z table depicting the area under the standard normal curve. The notion of
proportion can also be thought as percentage, probabilities or areas when dealing with problems
involving the approximately or normally distributed data values.

Explore:

1.      Using the excel “NORMSDIST” function find

a.      the proportion of marketing managers whose age are greater than 38 years old.

b.      the percentage of marketing managers whose age are less than 40 years old.

2.      Verify the results obtained in 1-a and 1-b using the z-table.

Linear Correlation
  There are studies whose concern is to determine whether the two variables of interest are
related to one another. If so, a model can be established characterizing the strength of their
relationship and amount of impact one variable has to the other variable.
 
Linear Correlation
        The linear correlation is a measure that is used to determine the strength of linear
relationship between two variables of interest X and Y. The population linear correlation
coefficient is denoted by the Greek letter rho (ρ). Rho is a parameter, which is estimated by the
statistic r called the Pearson’s Coefficient of Correlation given by the formula:

The Pearson’s Coefficient of Correlation r can assume values ranging from -1 to 1. The following
illustrates the different types of linear correlation or relationships for every two variables of
interest.
Case 1: If r > 0, then we say that, the variables of interest, say X and Y are positively correlated.
That is, Y tends to increase linearly as X increases. If r = 1, then we say that there is a perfect
positive linear correlation between X and Y. The following scatter plots below illustrates the
relationship r > 0 and r = 1.
Case 2: If r<0, then we say that there is a negative correlation between X and Y. That is, Y tends
to decrease linearly as X increases. Whenever r = -1, then there is a perfect negative correlation
between X and Y. The scatter plot shown as follows illustrates the relationship r < 0 and r = -1.

Case 3: If r=0, then the two variables of interest X and Y poses no linear correlation This implies
that there is lack of linearity between X and Y but not lack of association. The following is the
scatter plot for r=0.

  The strength of linear correlation or relationship for every two variables of interest can
categorized and interpreted as follows depending on the resulting Pearson Coefficient of
Correlation r-value.
a)      If 0<│r│<0.30, then the linear correlation is weak.
b)     If 0.30<│r│<0.70, then the linear correlation is moderately strong.
c)      If │r│>0.70, then the linear correlation is very strong.

Example: A psychologist studied the relationship between the hours of sleep and the number of
minutes a college students can complete the problem solving task. A group of 10 students
underwent the experiment wherein their number of hours of sleep is determined the night
before the day of the treatment. The data obtained are as follows.

Number of Hours of Sleep, X 7.1 5.3 6.2 7.5 8.7 6.6 8.0 7.1 7.3 5.9

Minutes to Complete the 9.6 8.3 9.1 8.8 10. 9.0 10. 8.1 9.8 8.0
Problem Solving Task, Y 2 0
 
i) Find the linear correlation coefficient for the relationship between students’s number 
    of hours of sleep and the number of minutes to complete the problem solving task.
ii) Classify the strength of relationship of the two variables based on the Pearson r
     obtained in (i).
Solution:
i) Using the Pearson’s Coefficient of Correlation formula   

Thus, the linear correlation coefficient for the relationship between student’s number 
of hours of sleep and the number of minutes to complete the problem solving task is 0.75.
ii) The linear correlation coefficient, r = 0.75 indicates a very strong positive linear correlation or
relationship between the student’s number of hours of sleep and the number of minutes for
them to complete the problem solving task.
            The linear correlation coefficient for the relationship between the student’s number of
hours of sleep and the number of minutes to complete the problem solving task can also be
computed using the excel function given below.

Simple Linear Regression

Example: Consider again the study for which a psychologist determined the relationship
between the hours of sleep and the number of minutes a college students can complete the
problem solving task. A group of 10 students underwent the experiment wherein their number
of hours of sleep is obtained the night before the day of the treatment. The data are shown as
follows.

Number of Hours of Sleep, X 7.1 5.3 6.2 7.5 8.7 6.6 8.0 7.1 7.3 5.9

Minutes to Complete the 9.6 8.3 9.1 8.8 10. 9.0 10. 8.1 9.8 8.0
Problem Solving Task, Y 2 0

 
a.      Determine the prediction equation of the regression line for the data set using the
least squares method.
b) Construct the scatter plot diagram for the data and the regression line.
c) Use the prediction equation to predict the number of minutes to complete the 
    problem solving task if a certain student had 6 hours of sleep the night before  
    doing the task.
Solution:
a.       Estimating the parameters  b0   and the coefficient b1  through the sample
values b0 and b1 respectively using least squares method given by           
b. The scatter plot for the data and the regression line is given by:

  

You might also like