You are on page 1of 18

BASIC CONCEPTS

Statistics deals with scientific methods of collecting, organizing, summarizing, presenting, and
analyzing data, as well as drawing valid conclusions and making reasonable decision on the basis of this
analysis.

The basic concern in the study of statistics is the presentation and interpretation of chance outcomes that
occur in a planned or scientific investigation.

In order for a statistician to gain information, he collects data for variables used to describe an event.
Data are the values that the variables can assume. Values whose values are determined by chance are
called random variables. These data can be used in different ways. There are two types of variables –
qualitative and quantitative. Qualitative variables are words or codes that represent a class or category.
On the other hand, quantitative variables are numbers that represent an amount or a count.

EXERCISES: Classify each variable as quantitative or qualitative.


1. the height of giraffe living in India
2. the religious affiliation of the people in the Philippines
3. favorite movie
4. the daily intake of proteins
5. the monthly phone bills

Quantitative variables can be further classified as discrete or continuous. Discrete variables can be
assigned values such as 0, 1, 2,3, … and are said to be countable. On the other hand, continuous
variables can assume all values between any two specific values like 0.5, 1.2, etc. For example, length
of a wire is a continuous variable, while the number of persons in a room is a discrete variable.

EXERCISES: Classify each variable as discrete or continuous.


1. The number of bread baked each day
2. The air temperature in a city yesterday
3. The income of single parents living in Quezon City
4. The weights of newborn infants
5. The capacity (in liters) of water in a swimming pool

DESCRIPTIVE AND INFERENTIAL STATISTICS

Statistical Methods are those procedures used in the collection, presentation, analysis, and interpretation
of data. We shall categorize these methods as belonging to one of two major areas called descriptive
statistics and statistical inference.

Descriptive Statistics comprises those methods concerned with collecting and describing a set of data so
as to yield meaningful information.

1
Descriptive statistics provides information only about the collected data and in no way draws inferences
or conclusions concerning a larger set of data. The construction of tables, charts, graphs, and other
relevant computations in various newspapers and magazines usually fall under this method.

Statistical Inference comprises those methods concerned with the analysis of a subset of data leading to
predictions or inferences about the entire set of data.

Inferential statistics consists of methods that are used to infer characteristics of a population from
observations on sample or formulate general laws on the basis of repeated observations. Considered as
the central function of modern statistics, statistical inference is concerned with two types of problems:
estimation of population parameters and tests of hypotheses.

EXERCISES:

Classify the following as belonging to the area of descriptive or inferential statistics:

1. As a result of recent cutbacks by the oil-producing nations, we can expect the price of gasoline to
double next year.
2. At least 5% of all fires reported last year in a certain city were deliberately set by arsonists.
3. Of all patients who have received this particular type of drug at a local clinic, 60% later
developed significant side effects.
4. Assuming that less than 20% of the Columbian coffee beans were destroyed this past winter, we
should expect an increase of no more than 30 cents for a kilogram of coffee by the end of the
year.
5. As a result of a recent poll, most Americans are in favor of building additional nuclear power
plants.

POPULATIONS AND SAMPLES

Populations consists of the totality of the observations with which we are concerned.

Sample is a subset of a population.

EXERCISES: Indicate which of the following examples refer to population or sample.


1. a group of 25 students selected to test a new teaching technique
2. the total machines produced by a factory in one week
3. the yearly expenditures on food for 10 families
4. the ages of employees of all companies in Metro Manila
5. the number of subscribers of telephone companies

2
PARAMETER AND STATISTIC

Parameter – any numerical value describing a characteristics of a population.

Statistic – any numerical value describing a characteristic of a sample.

LEVEL OF MEASUREMENT

Aside from being classified as qualitative or quantitative, variables can also be classified according to
how they are categorized, counted, or measured.

1. Nominal level
This is characterized by data that consist names, labels, or categories only. The data cannot be
arranged in an ordering scheme. There is no criterion as to which values can be identified as greater than
or less than other values.

2. Ordinal level
This involves data that may be arranged in some order, but differences between data values
either cannot be determined or are meaningless.

3. Interval level
This is the same as the ordinal level, with an additional property that we can determine
meaningful amounts of differences between the data. Data at this level may lack an inherent zero
starting point.

4. Ratio level
This is an interval level modified to include the inherent zero starting point. The difference and
ratios of data are meaningful. This is also the highest level of measurement.

EXERCISES: Classify each as nominal, ordinal, interval, or ratio-level data


1. social security number
2. the total annual incomes for a sample of families
3. the ages of students enrolled in a cooking class
4. the rankings of tennis players
5. the salaries of fastfood chain attendants
6. The scores of Tom and Jerry in a Sample of Ten Basketball Games.

CLASSIFICATION OF STATISTICAL TECHNIQUES

A statistical technique may be classified as univariate, bivariate or multivariate depending on the


number of variables involved in the analysis. The technique is univariate if it applies to as single
variable, bivariate if it applies to two and multivariate when more than two variables are involved.

The technique is inferential when it involves estimation of population parameters and tests
hypothesis.

3
SAMPLING PROCEDURES

Samples can be broken down into two basic types: nonprobability and probability. In the
nonprobability type , there is no way of estimating the probability that each individual or element will be
included in the sample. In probability sampling, in the most frequently encountered situations, each
individual has an equal chance of becoming a part of the sample.

Nonprobability Sampling

The students in a class may constitute the entire sample because they happen to be in a class
whose instructor is interested in doing some research. Such samples are called accidental or incidental
samples. Another type of nonprobability sampling is quota sampling. In this type of sampling, the
proportions of the various subgroups in the population are determined and the sample is drawn (usually
not at random) to have the same percentages in it. The third type is purposive sampling. For example,
the cities in the Philippines that have voted for the winners in a series of past presidential elections could
be identified. We could study this cities, and from the voter’s preferences in them make a prediction on
the outcome of a national election.

The major advantage in the use of the samples like these is that they are convenient and economical.

Probability Sampling

The basic type is simple random sampling. In a simple random sample, each individual in the population
has an equal chance of being drawn into the sample. This could be done by drawing lots or by the use of
random numbers. When sampling procedures are not carried out like this, the result is said to be biased.

Systematic sampling selects every kth element in the population for the sample, with the starting point to
be determined at random from the first k elements. Systematic samples are very easy to obtain and are
often used as if they were random samples. In fact, some systematic samples can lead to precise
inferences concerning population parameters simply because the sample values spread evenly over the
entire population. However, the real danger exists if one happens to choose a sampling interval that
corresponds to a hidden periodicity.

Cluster sampling selects a sample containing either all, or a random selection, of the elements from
clusters that have themselves been selected randomly from the population. It has the advantage of being
more cost efficient when the population is widely scattered. When the clusters are geographic areas,
such as regions of a state, or subdivisions of a large city, the sampling procedure is called area sampling.

Stratified random sampling selects simple random samples from mutually exclusive subpopulations, or
strata, of the population. Here, the population is divided into strata such that the data of interest are fairly
homogenous within a given stratum. Stratification of a population results in strata of various sizes.
Consideration must therefore be given to the size of the random samples selected from these strata. This
could be done using proportional allocation which chooses sample sizes proportional to the size of the
different strata.

4
EXERCISES: Classify each sample as random, stratified, systematic, or cluster.
1. Every 12th customer entering a shopping mall is asked to select his or her favorite store.
2. In a university, all teachers from three buildings are interviewed to determine whether they
believe the students have higher grades now than in previous years.
3. Supervisors are selected using random numbers in order to determine annual salaries.
4. A teacher writes the name of each student in a card, shuffles the cards, and then draws five names.
5. A head nurse selects 10 patients from each floor of a hospital.

Sample Sizes for Proportional Allocation. If we divide a population of size N into k strata of sizes
N1 , N 2 , , N k , and select samples of size n1 , n2 , , nk respectively, from the k strata, the allocation is
proportional if

ni 
N1  n , for i  1, 2, 3, , k
N
where n is the total of the stratified random sample.
THE VARIABLE

A variable is a characteristic or entity that can assume different values. The variation in values for a
given characteristic is the primary concern of statistical description. The total set of values for a
particular characteristic is known as the distribution of the variable. A variable that can theoretically
assume any value between two given value is continuous, otherwise it is discrete.

MEASURE OF CENTRAL LOCATION

One of the important ways of describing a group of measurements, whether it be a sample or a


population, is by the use of average.

An average is the measure of the center of a set of data when the data are arranged in an
increasing or decreasing order of magnitude.

Measure of Central Location or Measure of Central Tendency- any measure indicating the center
of a set of data, arranged in an increasing or decreasing order of magnitude. The most commonly used
are the mean, median, and mode.

POPULATION MEAN:

If the set of data x1 , x2 , , xn , not necessarily all distinct represents a finite population of size n,
then the population mean is

x i
 i 1

5
Example 1: The following are the scores of 11 graduating students in Math 112 Real Analysis II.
Compute for the mean.
65 76 78 83 90 86 45 52 37 56 72

65  76  78  83  90  86  45  52  37  56  72 740
   67.3
11 11

Example 2: The Intelligence Quotients (IQs) of five members of a family are 112, 125, 130, 98, and 96.
Find the mean IQ.

112  125  130  98  96 561


   112.2
5 5

SAMPLE MEAN

If the set of data x1, x2 ,, xn, not necessarily all distinct represents a finite sample of size n, then
the sample mean is

x i
x i 1

Example 3: On a midterm Exam in Elementary Statistics, 7 students obtained the following grades: 89
82 76 79 65 92 and 54. Treating the above results as a sample,
compute for the mean.

89  82  76  79  65  92  54 537
x   76.71
7 7

MEDIAN:

The median of a set of observations arranged in an increasing or decreasing order of magnitude


is the middle value when the number of observations is odd or the arithmetic mean of the two middle
values when the number of observation is even.

Example 4: The median in Example 2 is 112.


96, 98, 112, 125, 130

Example 5: The median in Example 1 is 72.


37 45 52 56 65 72 76 78 83 86 90

6
MODE
The mode of a set of observations is that value which occurs most often or with the greatest
frequency.

The mode does not always exist. This is certainly true when all observations occur with the same
frequency.

Example 6: The number of participants present during the five-day workshop-seminar are as follows:
40 46 50 49 and 46. Find the mode.

The mode is 46.

Example 7: The number of incorrect-answers on a true-false competency test for a random sample of 15
students were recorded as follows: 2, 1, 3, 0, 1, 3, 3, 6, 0, 3, 3, 5, 2, 1, 4, and 2. Find the
mode.

The mode is 3.
Remarks:

MEAN
1. The mean is the most commonly used measure of location in statistics.
2. It is easy to calculate and it employs all the variable information.
3. The disadvantage of the mean is it is adversely affected by extreme values.
MEDIAN
1. The median is easy to compute if the number of observations is relatively small.
2. It is not affected by extreme values.
MODE
1. The mode is the least used measure of the three.
2. Its value is almost useless for small sets of data.
3. It requires no calculation.
4. It can be used for both quantitative and qualitative data.

MEASURES OF VARIATION
The measures of central location do not give an adequate description of our data. We need to
know how the observations spread out from the average.

The most important statistics for measuring the variability of a set of data are the range and the
variance:

RANGE

The range of a set of data is the difference between the largest and smallest number in a set.

Example8 : The range in example 2 is 130  96  34


Example9 : The range in example 1 is 90  37  53

7
The range is very simple to compute. However, it is a poor a measure of variation, particularly if
the size of the sample is large. It only considers the extreme values and it tells us nothing about the
distribution of numbers in between.

POPULATION VARIANCE:

Given the finite population x1 , x2 , . . . . xn , the population variance is

 (x ) i
2

2  i 1

Example 10. In example 1, solve the variance:

(  .)  (  .)   (  .)   (  .)   (  .)
 (  .)  (  .)   (  .)   (  .) 
 (  .)   (  .)
 


5.1529  76.2129  115.1329  247.4329  516.6529  350.8129  495.9529  233.1729


 916.2729  127.0129  22.3729
2 
11
3106 .1819
2   282.3802
11

  282.3802  16.80  s tan dard deviation

SAMPLE VARIANCE

Given a random sample x1, x2, , xn , the sample variance is

 x  x 
n 2
i
s2  i 1

n 1

8
Example 11. Find the variance in Example 3.

COMPUTING FORMULA FOR SAMPLE VARIANCE:

n xi2   xi 
2

s 
2

n n  1

The standard deviation provides a method for converting observed variances to standard form so
that they can be more easily understood and compared. The variance and standard deviation provide the
most powerful estimate of variation because they consider the value of each score.

Remarks:

1. The range is the least reliable of the measures and is used only when one is in a hurry to get a
measure of variability. It may be used for ordinal, interval, or ratio data.

2. The most important measures of variability are the standard deviation and its square, the
variance. The variance is the average of the squared deviation around the mean.

3. The standard deviation is used whenever a distribution approximates a normal distribution. It is


the basis for most statistics used in analysis of data. It is used with interval and ratio data.

FREQUENCY DISTRIBUTIONS AND GRAPHICAL REPRESENTATIONS OF DATA

Important characteristics of a large mass of data can be readily assessed by grouping the data into
different classes and then determining the number of observations that fall in each class. Such an
arrangement in tabular form is called frequency distribution.

Data that are presented in the form of frequency distribution are called grouped data. The data of
a sample are often grouped into intervals to produce a better overall picture of the unknown population,
but in so doing the identity of the individual observations are lost.

Example:
Frequency Distribution for the Weights of 50 Pieces of Luggage

Weight Number of
(kilograms) Pieces
19-21 7
16-18 19
13-15 14

9
10-12 8
7-9 2

The lowest and largest values that can fall in a class interval are called class limits. The number of
observations falling in a particular class interval is called class frequency. The numerical difference
between the upper and lower class boundaries of a class interval is defined to be the class width. The
midpoint of the class interval called the class mark is the average of the class limits.

The total frequency of all values less than the upper class boundary of a given class interval is called the
cumulative frequency up to and including that class.

The steps in grouping a large set of data into a frequency distribution may be summarized as
follows:

1. Decide on the number of class intervals required. (We can choose between 5 and 20 class
intervals)
2. Determine the range.
3. Divide the range by the number of classes to estimate the approximate the width of the interval.
4. List the lower class limit of the bottom interval and then the lower class boundary. Add the class
width to the lower class boundary to obtain the upper class boundary. Write down the upper class
limit.
5. List all the class limits and class boundaries.
6. Determine the class marks.
7. Tally the frequencies for each class.
8. Sum the frequency column and check against the total number of observations

Example 12: The following scores represent the final examination grade in elementary statistics course:

23 60 79 32 57 74 52 70 82 36
80 77 81 95 41 65 92 85 55 76
52 10 64 75 78 25 80 98 81 67
41 71 83 54 64 72 88 62 74 43
60 78 89 76 84 48 84 90 15 79
34 67 17 82 69 74 63 80 85 61

Using 9 intervals with the lowest starting at 10,


a. set up the frequency distribution.
b. construct a cumulative frequency distribution.

GRAPHICAL REPRESENTATIONS
The information provided by a frequency distribution in tabular form is easier to grasp if
presented graphically. A visual picture is beneficial in understanding the essential features of frequency
distribution.

Bar Chart – plotting the class frequency against the class limits.

Frequency Histogram – frequency against the class boundaries.

10
Frequency Polygon – frequency against the class marks

Cumulative Frequency Polygon or Ogive – cumulative frequency against the upper class boundaries

EXERCISES:
1. A survey of 500 families were asked the question “Where are you planning to spend you rvacation
this summer?” It resulted in the following distribution. Construct a pie graph for the data and summarize
the results.

Place Number of People


Davao 50
Boracay 200
Palawan 125
Tagaytay 90
Baguio 35

2. Construct a bar chart for the number of health conditions per 100 reported by the elderly in a survey.

Condition Number
Arthritis 48
Hypertension 36
Heart Disease 32
Cataracts 17
Diabetes 11

3. Construct a bar chart for the number of typhoons reported for the selected months from the year 2001
to 2005.
Month Number
June 6
July 8
August 12
September 10
October 5

4. The frequency distribution shows the number of freshmen, sophomores, juniors, and seniors who are
working as assistants in the different offices at the university. Construct a pie graph.

Rank Frequency
Freshmen 9
Sophomores 27
Juniors 36
Seniors 18

11
GROUPED DATA: FORMULAS FOR MEASURES OF CENTRAL TENDENCY

MEAN 
f i xi
n
where : fi  the frequency of class int erval i
xi  the midpoint of class int erval i
f i xi  the sum of the products of the frequency and midpoint of the class
int erval i

or by the use of codes:

MEAN  A 
 f u ci i

Where A = is the midpoint of class interval assigned with a code of zero


fi = the frequency of class interval i
xi = the code of class interval i
c = the class width
 fi ui = the sum of the products of the frequency and the code of the class intervals
n 
  cf c
MEDIAN  L1   
2
f

Where L1 = the lower class boundary of the median class


n = the sample size
cf = the cumulative frequency of the class right below the median class
f = the frequency of the median class
c = the class width
n
median class = class interval where the   the observation falls
 2
MODE  L1  1
d c
d1  d 2

Where L1 = the lower class boundary of the modal class


d1 = the difference between the frequencies of the modal class and the class right below the
modal class
d2 = the difference between the frequencies of the modal class and the class right above the
modal class
c = the class width
modal class = the class interval having the highest frequency.

12
GROUPED DATA: FORMULAS FOR MEASURES OF VARIATION

n f i xi2   f i xi 
2

s 
2

n n  1

where: f i = the frequency of class interval I


xi = the midpoint of class interval I
f i xi = the sum of the products of the frequency and midpoint of the class interval i

 n f i ui2   fi ui 2 
s 
2  c2
 n n  1 
 

Where: fi = the frequency of class interval i


xi = the code of class interval i
c = the class width
 fi ui = the sum of the products of the frequency and the code of the class interval
The Quartile Deviation

The quartile deviation is used when the median is used as an average; when the data depart
noticeably from the normal. It is used for ordinal data.

The quartile deviation, Q, is frequently called the semi-interquartile range. It is half of the
distance between two quartile points, Q1 , and Q3.

Q3  Q1
In symbols: Q
2 1

n   3n 
  cf  c   cf  c
Q1  L1    Q3  L1   
4 4
Where: ,
f f

13
Example on Grouped data

Given the following data.

Class F
interval
36-40 2
31-35 8
26-30 12
21-25 18
16-20 10

Complete the table and find the mean, median, mode, variance, sd , Q1 , Q3, and Q.
Make the bar chart, histogram, frequency polygon, and ogive.

Solution:
Class F Class midpoint u fu fu2 fixi fx2 cf
interval boundaries
36-40 2 35.5-40.5 38 2 4 8 76 2888 50
31-35 8 30.5-35.5 33 1 8 8 264 8712 48
26-30 12 25,5-30.5 28 0 0 0 336 9408 40
21-25 18 20.5-25.5 23 -1 -18 18 414 9522 28
16-20 10 15.5-20.5 18 -2 -20 4 180 3240 10

a) MEAN 
f i xi
=
1270
 25.4
n 50

MEAN  A 
 f u c i i
= 28 
 26
5  28  2.6  25.4
n 50

n  50
  cf  c  10
b) MEDIAN  L1    = 20.5  2 5  24.67
2
f 18

c) MODE  L1 
d1 c = 20.5 
8
5  23.36
d1  d 2 86

e) Variance
n f i xi2   f i xi  5033770   1270 
2
1688500  1612900
2

s 
2
= 
n n  1 5049  2450
75600
=  30.86
2450

14
 n f i ui2   fi ui 2   5074    26 2  2 3700  676
s 
2
 n n  1
 c2 =


  
  5 
 25  30.86
   50 49  2450

f) SD
s  30.86

n   50 
  cf  c   10 
g) Q1  L1    = 20.5   4
4 5  20.5  0.69  21.19
f  18 
 
 

 3n  3 
  cf  c  50   28 
h) Q3  L1   4  = 25.5   4  5  25.5  3.96  29.46
f  12 
 
 
Q  Q1 29.46  21.19
i) Q  3 =  4.135
2 1 2

15
EXERCISES:

I. Evaluate the following.


6 8 5
1
1. i
i 1
2
3. x
i 1
2
 8 x  16 5. x
i 1 x
8 7
2i
2.  i(i  2)
i2
4. 
i 3 i  4

II. Let x1  3 , x2  1 , x3  0 , x4  2 , x5  2 . Calculate the following.


2
5
 5  5
1.  xi
i 1
2
2.   xi 
 i 1 
3.  x
i 1
i  2
2

III.

1.
Tom 12 10 5 8 15 10 13 3 7 27
Jerry 14 8 7 9 10 12 11 13 10 16

a. Describe the data using all the appropriate statistics.


b. Compare the performance of Tom and Jerry using the statistics computed in (a).
c. If you are to choose between the two players based on their performance, whom are you going
to choose? Support your answer.

2. The following data represent the length of life in minutes, measured to the nearest tenth, a
random sample of black flies subjected to a new spray in a controlled laboratory experiment:

2.4 0.7 3.9 2.8 1.3


1.6 2.9 2.6 3.7 2.1
3.2 3.5 1.8 3.1 0.3
4.6 0.9 3.4 2.3 2.5
0.4 2.1 2.3 1.5 4.3
1.8 2.4 1.3 2.6 1.8
2.7 0.4 2.8 3.5 1.4
1.7 3.9 1.1 5.9 2.0
5.3 6.3 0.2 2.0 1.9

Using 8 intervals with the lowest starting at 0.1,


a. set up the frequency distribution.
b. Construct a cumulative frequency distribution.
c. Compute for the median, mode, mean and standard deviation.

16
d. Graph using histogram, frequency polygon, and cumulative frequency polygon or
ogive.

3. Frequency Distribution of Body Weights (in Kilograms) of a Sample of Workers


Weights f
(in kgs.)
110-119 5
100-109 4
90-99 13
80-89 15
70-79 18
60-69 11
50-59 4

a. compute for the median, mode, mean and standard deviation.


b. Graph using histogram, frequency polygon, and cumulative frequency polygon or
ogive.
c. Compute for the semi-interquartile deviation.

4. The following data represent the scores (in words per minute) of computer encoders on a speed test.

Scores Frequency
54 – 58 2
59 – 63 5
64 – 68 8
69 – 73 5
74 – 78 4
79 – 83 5
84 – 88 1

Proportional allocation

1. Among the 350 employees of the local office of an international insurance company, the
following data were gathered. The employees were classified according to the following criterion

Race Number of employees

whites 200
Blacks 90
Orientals 60

17
If we use proportional allocation to select stratified random grievance committee of 15
employees, how many employees must we take from each race?

2 In a certain university, 1560 freshmen students may be classified according to the following
scheme:

Colleges Number of students


Engineering 115
Business &Accountancy 340
Fisheries 180
Agriculture 165
Arts & Sciences 321
education 439

If we use proportional allocation to select stratified random sample of size 300, how many
students must we take from each college?

3 .Faculty members in MSU – GSC are classified according to the following scheme

classification Number of faculty


permanent 700
probationary 600
substitute 150
lecturer 100

If one uses proportional allocation to select a sample size of 200, how many must be taken from
each stratum?

18

You might also like