Professional Documents
Culture Documents
Roselle V. Collado
Asst. Professor
Institute of Statistics, College of Arts and Sciences
University of the Philippines at Los Baños
I. Introduction
Statistics has become a very important part of almost every aspect of life. It has become
so important that some consider it as almost being a way of life. Statistics are used as bases
for decision-making and in most instances, used in drawing information from gathered data.
The term statistics is used in two ways. In the singular sense, it is the science that deals
with the collection, organization, presentation, analysis and interpretation of data. In the
plural sense, it is a set of numerical information, a processed data. Some examples of which
are: population statistics, statistics on births and enrollment statistics.
The science of statistics has two phases. Descriptive statistics deals with the methods of
organizing, summarizing, presenting data and their interpretation while inferential statistics is
concerned with making generalizations about a larger set of data where only a part of it is
examined. This second phase of statistics uses the inductive method of reasoning. Two
main concerns here are estimation and hypothesis testing. In estimation, the objective is to
come up with a value or a range of values, computed from the sample, which we proposed
as the value of the population characteristic. In hypothesis testing a claim or hypothesis
about the population(s) are tested for possible rejection or acceptance based on sample
evidence.
Levels of Measurement
Aside from considering the nature of the possible values of a variable, the variables’
levels of measurement are differentiated in terms of the amount of mathematical treatment
and the type of information that may be derived from the gathered data. Consequently, the
level of measurement of a variable of interest determines which statistical procedures are
appropriate to use and apply given the data gathered. Table 1 summarizes the four levels of
measurement and their properties.
Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 2
FDT Construction
Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 3
b. Determine the number of classes, k, using the formula k = , where N is the number
of observations in the data set. Round-off k to the nearest integer.
c. Calculate the class size, c, using the formula c=R/k. Round-off c to the nearest value
with precision the same as those of the raw data.
d. Construct the classes as follows. Each class is an interval of values defined by its lower
and upper class limits.
The lower limit (LL) of the lowest class is LV. The lower limits of the succeeding
classes are obtained by simply adding c to the lower limit of the preceding class
The upper limit (UL) of the lowest class is computed as the lower limit of the next
class minus one unit of measure. The upper limits of the succeeding classes are
obtained by adding c to the UL of the preceding class.
e. Tally the data, counting the number of observations that belong to each of the classes.
f. Other columns of information that may be constructed:
Class Mark/Midpoint – is the midpoint of a class, obtained by adding the lower
and upper class limits and dividing by two.
Relative Frequency – the frequency of a class express in percent of the total
number of observations
Cumulative Frequency
<CF – the number of observations less than or equal to the upper limit of a class
>CF – the number of observations greater than or equal to the lower limit of a class
Relative Cumulative Frequency - the cumulative frequency express in percent of
the total number of observations
Stemplot Construction
A special graphical presentation that is able to present the distribution of a data set while
preserving the true values of the data points is the stem and leaf plot commonly called the
stemplot.
To construct a stemplot, the steps involved are:
a. Arrange the observations in increasing order.
b. For each data point, identify the leaf (the unit’s digit) and its stem (all other digits).
c. List the stems vertically in increasing order from top to bottom.
d. Draw a vertical line to the right of the stems.
e. List the leaves to the corresponding stem to the right of the line in an increasing order.
2) Median:
Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 4
Percentiles are values that divide the array into 100 equal parts. Deciles are values
that divided the data set into ten equal parts, denoted by D 1, D2, D3, D4 D5, D6, D7, D8, D9 and
Quartiles are values that divided the data set into four equal parts, denoted by Q 1, Q2, and
Q3. These are usually computed for large data sets. To find the j th percentile the following
steps are used:
a) Arrange the data in increasing order.
c) If k is a whole number, then the j th percentile is the average of the values in the k th and
(k+1) th position. Otherwise, it is the observation in the next higher whole number
position.
Measures of Dispersion (MD) are values used to describe the extent to which the data are
dispersed. These are interpreted such that large values of the MDs indicate large variation in
the data. These are:
2) Variance:
3) Standard Deviation:
The Measure of Skewness is a value that measures the extent of departure of the
distribution from symmetry. It is measured by the coefficient of skewness (SK),
computed as:
VI. Probability
Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 5
In the practice of inferential statistics, a very important foundation is the concept of
probability. Developing the concept of probability requires knowledge of the random
experiment, a process of making observation capable repetition and under basically the same
conditions that leads to well defined possible outcomes. The set of all possible outcomes of this
is called the sample space. Subsets of this are called events. It is of interest to measure the
chances of these events being observed, and thus, the probability of these events. The
probability is a value between 0 and 1, inclusive of limits that measures how likely a particular
event will occur. Probability of events is measured via three possible approaches. The a priori
approach uses a theoretical model to measure the chances of an event occurring while the a
posteriori approach uses the relative frequency of the event's occurrence to measure its
chances. The subjective approach on the other hand uses the perception of the person to
determine the probability of an event.
Counting Methods
3. A combination is a group formed by taking all or part of a given set of objects without
regard to the order by which the objects are selected to form the group. The number of
combinations of n distinct objects taken r at a time, denoted by nCr is defined as:
Theorems on Probability
VII. Sampling
In addition to the concept of probability, of outmost importance in the practice of
inferential statistics is sampling. It is the process of selecting a part of the universe or the
population in which that part taken is called the sample. It is desired that the sample taken be
representative of the whole population. The set of rules or procedures employed in selecting the
sample is the sampling design. It includes the sampling scheme, the manner by which the
samples are taken and the sample size, the number of sample units taken from the universe or
population.
Samples may be taken by any of two methods of sampling. Probability sampling
schemes assign a known probability of selection for all possible samples. This known probability
structure allows for the computation of sampling errors, a very important information in doing
Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 6
inferential statistics. This is the error in inference inherent to the fact that only a sample was
observed.
1. Simple Random Sampling (SRS) - the elements of the universe/population have equal
chance of being included in the sample. It is most applicable when the universe is believed
to be homogenous
2. Stratified Random Sampling (StRS) - the elements of the universe are first grouped into
strata and SRS are taken from each stratum. Most applicable when:
a. information is required for certain subdivisions of the population
b. population is extremely heterogeneous
c. problem of sampling may differ in different parts of the population
3. Cluster Sampling - the elements are grouped into clusters (the clusters may be inherent to
the universe, e.g. geographical location) and a simple random sample of clusters are selected
and all the elements of the selected clusters are included in the sample. The main
disadvantage of cluster sampling is that it is cheaper especially when sampling necessitates
extensive traveling. The major disadvantage is that the information will be less precise than
through SRS (for the same sample size).
VIII. Estimation
One of the concerns of inferential statistics is the process of estimation. It refers to the
finding of a value or a range of values for an unknown attribute of the population called a
parameter. An estimate may be found using a single value as done in point estimation or
using a range of values with some measure of confidence as done in interval estimation.
Generally, estimates are found using attributes of a sample selected from the population of
interest. These quantities known as statistics are used to estimate the parameters of the
population. Estimators, however, vary depending on the sampling scheme employed.
SRS Estimators
Point and interval estimators for the different parameters using Simple Random Sampling
are given in the Table 2.
Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 7
Table 2. Parameters and their estimators based on a simple random sample of size n.
INTERVAL ESTIMATOR
PARAMETERS POINT ESTIMATOR (1-) X 100%
Confidence
,
Population Proportion (P)
a = number of units with the
attribute of interest
STRS Estimators
Point estimators for Stratified random samples are given in Table 3.
Table 3. Point estimators for different population parameters using Stratified Random Samples.
Statistics
PSAE Region IV – Agricultural Engineering Board Review Materials IV- 8
– test procedures when one wants to compare the mean of a population to a
hypothesized value
– test procedure to compare the means of two populations
2. Binomial Test
– a statistical procedure to test if a hypothesized value of for the population proportion
is equal proportion is acceptable or not.
3. Regression Analysis
- a statistical technique used for determining the functional form of relationship
between two or more variables.
- the ultimate objective is usually to be able to predict the value of the dependent
variable given the values of independent or concomitant variables.
4. Correlation Analysis
– statistical technique used to determine the strength or degree of linear relationship
between two variables.
5. Analysis of Variance
– statistical technique to compare the means of two or more populations based on
partitioning the total variance of the variable of interest into several sources or
components.
6. Chi-square test of Goodness of fit
- statistical procedure to test whether observed frequency is in agreement with the
expected or hypothesized frequency.
7. Chi-square test of Independence
- statistical procedure to test whether two variables (in at least the nominal scale) are
independent of each other.
Statistics