You are on page 1of 8

PSAE Region IV – Agricultural Engineering Board Review Materials IV- 1

REVIEW ON BASIC STATISTICS

Roselle V. Collado
Asst. Professor
Institute of Statistics, College of Arts and Sciences
University of the Philippines at Los Baños

I. Introduction
Statistics has become a very important part of almost every aspect of life. It has become
so important that some consider it as almost being a way of life. Statistics are used as bases
for decision-making and in most instances, used in drawing information from gathered data.
The term statistics is used in two ways. In the singular sense, it is the science that deals
with the collection, organization, presentation, analysis and interpretation of data. In the
plural sense, it is a set of numerical information, a processed data. Some examples of which
are: population statistics, statistics on births and enrollment statistics.
The science of statistics has two phases. Descriptive statistics deals with the methods of
organizing, summarizing, presenting data and their interpretation while inferential statistics is
concerned with making generalizations about a larger set of data where only a part of it is
examined. This second phase of statistics uses the inductive method of reasoning. Two
main concerns here are estimation and hypothesis testing. In estimation, the objective is to
come up with a value or a range of values, computed from the sample, which we proposed
as the value of the population characteristic. In hypothesis testing a claim or hypothesis
about the population(s) are tested for possible rejection or acceptance based on sample
evidence.

II. Basic Concepts


For any problem situation certain concepts need to be identified to ensure that good
and sound statistical treatment may be applied. At the problem formulation stage, the
universe of the study needs to be identified and properly defined. This may be done by
answering the question “Who do we want to study?” This, when properly defined may either
be finite or infinite.
Next to be determined are the variables in the study. These would refer to the
information we want to gather or observe from the elements of the universe. This would
answer the question “What do we want to know from the elements of the universe?” These
may either be qualitative or quantitative. Items that assume numerical possible values with
defined concepts of magnitude are called quantitative variables while those that assume
alpha-numeric possible values are called qualitative variables.
The population is the set of all possible values of the variable(s) of interest. For a
particular population it is commonly of interest to determine the distribution, the pattern of
variation of the variable, displaying how often each value occurs in the data set. In most
situations a sample is taken to represent a whole population or universe. This is a subset of
the population or universe to be used as evidence for inferential statistics.

Levels of Measurement
Aside from considering the nature of the possible values of a variable, the variables’
levels of measurement are differentiated in terms of the amount of mathematical treatment
and the type of information that may be derived from the gathered data. Consequently, the
level of measurement of a variable of interest determines which statistical procedures are
appropriate to use and apply given the data gathered. Table 1 summarizes the four levels of
measurement and their properties.

Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 2

Table 1. The levels of measurement and their properties.

Level of Properties Defined Operations


Measurement
Nominal -purely categorical possible -counts and percentages
values
Ordinal -categorical possible values with -counts and percentages
inherent ordering of categories -ordinal operations such as =,
> and <
Interval -quantitative values with distinct -counts and percentages
distances between values -ordinal operations such as =,
-arbitrary zero point > and <
- +, -
Ratio -quantitative values with distinct -counts and percentages
distances between values -ordinal operations such as =,
-fixed zero point > and <
- +, - , x, 

III. Data Gathering


Any study requiring the use of a set of data may use any of two types of data. The data
is called a primary data if the user directly obtained in the information from the units in the
universe. On the other hand, data gathered not directly from the units in the universe are
called secondary data. Data may be gathered in three ways. The objective method involves
getting actual measurements and direct observations while the subjective method involves
the respondent providing the data. The third method, the use of existing records, involves
using the data previously collected by some other person or institution. The decision on
which method of data gathering to use considers the nature of the variable of interest and
the problem at hand.

IV. Data Presentation


Data gathered by any of the three methods mentioned will be of little use unless
organized and presented in the most appropriate manner. Three methods of data
presentation may be selected from and/or combined to impart the salient information
contained in the data. A textual presentation is a narrative describing the characteristics of
the universe or the population based on the data collected and organized. This is usually
used with very little numerical information to mention. If there are many values and
information to present, these are organized into properly labeled rows and columns to come
up with a tabular presentation. If details are not very important and a quick picture is
desired, particularly trends and distribution, the graphical presentation is used. The choice of
which method of data presentation to use is determined by the nature of the information to
be relayed using the data gathered.
A special tabular presentation of a given data set, particularly used in storing secondary
data, is the Frequency Distribution Table (FDT). It consists of classes/categories that can be
created from the data together with the frequencies of observations belonging to a
class/category. Whether the variable is qualitative or quantitative, an FDT may be
constructed.

FDT Construction

To construct a quantitative FDT, the steps involved are:


a. Determine the lowest (LV) and highest value (HV) in the data set. Compute the Range,
R=HV-LV.

Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 3

b. Determine the number of classes, k, using the formula k = , where N is the number
of observations in the data set. Round-off k to the nearest integer.
c. Calculate the class size, c, using the formula c=R/k. Round-off c to the nearest value
with precision the same as those of the raw data.
d. Construct the classes as follows. Each class is an interval of values defined by its lower
and upper class limits.
 The lower limit (LL) of the lowest class is LV. The lower limits of the succeeding
classes are obtained by simply adding c to the lower limit of the preceding class
 The upper limit (UL) of the lowest class is computed as the lower limit of the next
class minus one unit of measure. The upper limits of the succeeding classes are
obtained by adding c to the UL of the preceding class.
e. Tally the data, counting the number of observations that belong to each of the classes.
f. Other columns of information that may be constructed:
 Class Mark/Midpoint – is the midpoint of a class, obtained by adding the lower
and upper class limits and dividing by two.
 Relative Frequency – the frequency of a class express in percent of the total
number of observations
 Cumulative Frequency
<CF – the number of observations less than or equal to the upper limit of a class
>CF – the number of observations greater than or equal to the lower limit of a class
 Relative Cumulative Frequency - the cumulative frequency express in percent of
the total number of observations

Stemplot Construction
A special graphical presentation that is able to present the distribution of a data set while
preserving the true values of the data points is the stem and leaf plot commonly called the
stemplot.
To construct a stemplot, the steps involved are:
a. Arrange the observations in increasing order.
b. For each data point, identify the leaf (the unit’s digit) and its stem (all other digits).
c. List the stems vertically in increasing order from top to bottom.
d. Draw a vertical line to the right of the stems.
e. List the leaves to the corresponding stem to the right of the line in an increasing order.

V. Numerical Descriptive Measures


Another way of presenting or describing the universe or the population is by computing
for certain quantities from a given set of data, called numerical descriptive measures. They
serve to summarize, describe the information collected by the researcher. Those commonly
used are the measures of location, dispersion, skewness and kurtosis.
Measures of Location are values within the range of the data that describes its
location or position relative to the entire set of data. Measures of Central Tendency (MCT) are
values computed from the data where the observations tend to “center” or “cluster” around.
The three MCTs are:
1) Arithmetic Mean:

2) Median:

Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 4

3) Mode: Mo = the observation with the most number of occurrence

Percentiles are values that divide the array into 100 equal parts. Deciles are values
that divided the data set into ten equal parts, denoted by D 1, D2, D3, D4 D5, D6, D7, D8, D9 and
Quartiles are values that divided the data set into four equal parts, denoted by Q 1, Q2, and
Q3. These are usually computed for large data sets. To find the j th percentile the following
steps are used:
a) Arrange the data in increasing order.

b) Compute for where N is the total number of observations.

c) If k is a whole number, then the j th percentile is the average of the values in the k th and
(k+1) th position. Otherwise, it is the observation in the next higher whole number
position.

Measures of Dispersion (MD) are values used to describe the extent to which the data are
dispersed. These are interpreted such that large values of the MDs indicate large variation in
the data. These are:

1) Range: R = MAX – MIN

2) Variance:

3) Standard Deviation:
The Measure of Skewness is a value that measures the extent of departure of the
distribution from symmetry. It is measured by the coefficient of skewness (SK),
computed as:

The skewness of a distribution is determined by interpreting the value of SK in the


following manner:
 SK = 0, indicates that the distribution of the data is symmetric
 SK < 0, indicates that the distribution is negatively skewed
 SK > 0, indicates that the distribution is positively skewed
The value that measures the flatness or peakedness of the data distribution is the
Measure of Kurtosis. The distribution of the data is bell-shaped if K is zero. If the shape
of the distribution is relatively flat, K < 0. If the shape of the distribution is relatively peaked,
K > 0.

VI. Probability

Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 5
In the practice of inferential statistics, a very important foundation is the concept of
probability. Developing the concept of probability requires knowledge of the random
experiment, a process of making observation capable repetition and under basically the same
conditions that leads to well defined possible outcomes. The set of all possible outcomes of this
is called the sample space. Subsets of this are called events. It is of interest to measure the
chances of these events being observed, and thus, the probability of these events. The
probability is a value between 0 and 1, inclusive of limits that measures how likely a particular
event will occur. Probability of events is measured via three possible approaches. The a priori
approach uses a theoretical model to measure the chances of an event occurring while the a
posteriori approach uses the relative frequency of the event's occurrence to measure its
chances. The subjective approach on the other hand uses the perception of the person to
determine the probability of an event.

Counting Methods

1. Fundamental Principle of Counting


Suppose an operation can be done in n 1 ways. If for each of these ways, a
second operation can be done in n2 ways, for each of the first two, a third operation can be
done in n3 ways, and so on to the k th operation which can be done in n k ways, then the
k operations together can be done in n1 n2 n3 nk ways.

2. A permutation is an ordered arrangement of all or part of a given set of objects. The


number of permutations of n objects taken r at a time, denoted by nPr is defined as:

3. A combination is a group formed by taking all or part of a given set of objects without
regard to the order by which the objects are selected to form the group. The number of
combinations of n distinct objects taken r at a time, denoted by nCr is defined as:

Theorems on Probability

Let A, B, and C be any events defined on a sample space, S.


1.P (Ac) = 1 – P (A)
2.P (A/ B) = P (A  B)/P (B), provided P (B) > 0
3.P (A  B) = P (A / B)*P (B)
= P (B /A)*P (A)
= P (A)*P (B) if the two events are independent
4.P (A  B) = P (A) + P (B) - P (A  B)
5.P(A  B  C) = P(A) + P(B)+ P(C) - P(A  B) - P(A  C) - P(B  C) + P(A  B  C)

VII. Sampling
In addition to the concept of probability, of outmost importance in the practice of
inferential statistics is sampling. It is the process of selecting a part of the universe or the
population in which that part taken is called the sample. It is desired that the sample taken be
representative of the whole population. The set of rules or procedures employed in selecting the
sample is the sampling design. It includes the sampling scheme, the manner by which the
samples are taken and the sample size, the number of sample units taken from the universe or
population.
Samples may be taken by any of two methods of sampling. Probability sampling
schemes assign a known probability of selection for all possible samples. This known probability
structure allows for the computation of sampling errors, a very important information in doing

Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 6
inferential statistics. This is the error in inference inherent to the fact that only a sample was
observed.

In non-probability sampling schemes, the elements of the universe/population has no


known chance of being included in the sample. These do not allow the computation of
sampling errors.

Some probability sampling procedures are:

1. Simple Random Sampling (SRS) - the elements of the universe/population have equal
chance of being included in the sample. It is most applicable when the universe is believed
to be homogenous

2. Stratified Random Sampling (StRS) - the elements of the universe are first grouped into
strata and SRS are taken from each stratum. Most applicable when:
a. information is required for certain subdivisions of the population
b. population is extremely heterogeneous
c. problem of sampling may differ in different parts of the population

3. Cluster Sampling - the elements are grouped into clusters (the clusters may be inherent to
the universe, e.g. geographical location) and a simple random sample of clusters are selected
and all the elements of the selected clusters are included in the sample. The main
disadvantage of cluster sampling is that it is cheaper especially when sampling necessitates
extensive traveling. The major disadvantage is that the information will be less precise than
through SRS (for the same sample size).

4. Systematic Sampling – sampling scheme characterized by adopting a skipping pattern in


the selection of the sample units. Distinguished as the only sampling scheme that allows
sample selection without a sampling frame.

5. Multi-stage Sampling – sampling scheme characterized by sampling being done in stages


before the ultimate sampling units are selected.

Some non-probability sampling schemes are:


1. Purposive sampling
2. Quota Sampling
3. Judgment Sampling
4. Accidental Sampling

VIII. Estimation
One of the concerns of inferential statistics is the process of estimation. It refers to the
finding of a value or a range of values for an unknown attribute of the population called a
parameter. An estimate may be found using a single value as done in point estimation or
using a range of values with some measure of confidence as done in interval estimation.
Generally, estimates are found using attributes of a sample selected from the population of
interest. These quantities known as statistics are used to estimate the parameters of the
population. Estimators, however, vary depending on the sampling scheme employed.

SRS Estimators
Point and interval estimators for the different parameters using Simple Random Sampling
are given in the Table 2.

Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 7

Table 2. Parameters and their estimators based on a simple random sample of size n.

INTERVAL ESTIMATOR
PARAMETERS POINT ESTIMATOR (1-) X 100%
Confidence

Population Mean ()

Population Variance (2)

,
Population Proportion (P)
a = number of units with the
attribute of interest
STRS Estimators
Point estimators for Stratified random samples are given in Table 3.

Table 3. Point estimators for different population parameters using Stratified Random Samples.

PARAMETERS POINT ESTIMATOR

Population Mean () , where

Population Variance (2)

Population Proportion (P)

IX. Test of Hypothesis


Another concern in inferential statistics is the test of hypothesis. This is performed if
some statement or claim about the parameters of a population needs to be tested using the
evidence contained in the sample. The stated null hypothesis is tested for possible rejection
based on a test statistic computed from the sample. At a given level of significance, the
computed test statistic is then compared to a tabular value. Depending on the decision criterion
of the test, the decision to either reject or accept the null hypothesis is then made. In the event
that the null hypothesis is rejected, the alternative hypothesis also referred as the
researcher's hypothesis is the one accepted as true.
In a statistical test of hypothesis, however, two possible errors may be committed. The
first is the error of rejecting a true null hypothesis which is called the Type I error and the
second is the error of accepting a false null hypothesis called the Type II error. Unfortunately,
these two errors are inversely related and trying to decrease one will only increase the other.
The only way that both errors may be assured of being low is to take a large sample size.
In performing a test of hypothesis, it is important to know which test procedure to
employ to satisfy the given problem. What follows are brief descriptions of possible test
procedures that may be performed.
1. Z-test and T-test

Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 8
– test procedures when one wants to compare the mean of a population to a
hypothesized value
– test procedure to compare the means of two populations

2. Binomial Test
– a statistical procedure to test if a hypothesized value of for the population proportion
is equal proportion is acceptable or not.
3. Regression Analysis
- a statistical technique used for determining the functional form of relationship
between two or more variables.
- the ultimate objective is usually to be able to predict the value of the dependent
variable given the values of independent or concomitant variables.
4. Correlation Analysis
– statistical technique used to determine the strength or degree of linear relationship
between two variables.
5. Analysis of Variance
– statistical technique to compare the means of two or more populations based on
partitioning the total variance of the variable of interest into several sources or
components.
6. Chi-square test of Goodness of fit
- statistical procedure to test whether observed frequency is in agreement with the
expected or hypothesized frequency.
7. Chi-square test of Independence
- statistical procedure to test whether two variables (in at least the nominal scale) are
independent of each other.

Statistics

You might also like