Statistics and Data Management Overview
Statistics and Data Management Overview
Learning Outcomes
At the end of the lesson, the students are able to
S
4. distinguish between the nominal, ordinal, interval and ratio methods of data measurement;
DM
6. identify the features that describe a data distribution.
Statistics is the study of the collection, organization, analysis, interpretation, and presentation of data.
It deals with all aspects of data, including the planning of its collection in terms of the design of
surveys and experiments. Some consider statistics a mathematical body of science that pertains to the
collection, analysis, interpretation or explanation, and presentation of data, while others consider it a
branch of mathematics concerned with collecting and interpreting data. Because of its empirical roots
and its focus on applications, statistics is usually considered a distinct mathematical science rather than
a branch of mathematics.
P
4.1 Basic Concepts
PU
Statistics is defined as a branch of mathematics which is concerned with facilitating wise decision-
making in the face of uncertainty and that, therefore develops and utilizes techniques for collection,
effective presentation, and proper analysis of data.
Branches of Statistics
1. Descriptive Statistics is concerned with the description and summarization of data, It deals with
the techniques used in the collection, presentation, organization, and analysis of the data on hand.
2. Inferential Statistics is concerned with the drawing of conclusions from data. It deals with the
techniques used in generalizing from samples to populations, performing estimations and hypothesis
tests determining relationships among variables, and making predictions.
Functions of Statistics
1. Condensation. Generally speaking by the verb ‘to condense’, we mean to reduce or to lessen.
Condensation is mainly applied at embracing the understanding of a huge mass of data by providing
only few observations.
2. Comparison. Classification and tabulation are the two methods that are used to condense the
data. They help us to compare data collected from different sources. Grand totals, measures
of central tendency measures of dispersion, graphs and diagrams, coefficient of correlation, etc.
provide ample scope for comparison. As statistics is an aggregate of facts and figures, comparison
S
is always possible and in fact comparison helps us to understand the data in a better way.
3. Forecasting. By the word forecasting, we mean to predict or to estimate beforehand. Given the
DM
data of the last ten years connected to the number of students enrolled in PUP, it is possible to
predict or forecast the number of students that will enroll for the near future. In business also
forecasting plays a dominant role in connection with production, sales, profits etc. The analysis of
time series and regression analysis plays an important role in forecasting.
4. Estimation. One of the main objectives of statistics is drawn inference about a population from
the analysis for the sample drawn from that population.
5. Tests of Hypothesis. A statistical hypothesis is some statement about the probability distri-
bution, characterizing a population on the basis of the information available from the sample
P
observations. In the formulation and testing of hypothesis, statistical methods are extremely use-
ful. Whether the grades of students increased because they are motivated or whether the new
teaching method is effective in discussing a particular topic are some examples of statements of
hypothesis and these are tested by proper statistical tools.
PU
Scope of Statistics
1. Statistics and Industry. Statistics is widely used in many industries. In industries, control charts
are widely used to maintain a certain quality level. In production engineering, to find whether the
product is conforming to specifications or not, statistical tools, namely inspection plans, control
charts, etc., are of extreme importance. In inspection plans we have to resort to some kind of
sampling - a very important aspect of Statistics.
2. Statistics and Commerce. Statistics are lifeblood of successful commerce. Any businessman
cannot afford to either by under stocking or having overstock of his goods. In the beginning he
estimates the demand for his goods and then takes steps to adjust with his output or purchases.
Thus statistics is indispensable in business and commerce.
3. Statistics and Economics. Statistical methods are useful in measuring numerical changes in
complex groups and interpreting collective phenomenon. Nowadays the uses of statistics are abun-
dantly made in any economic study. Both in economic theory and practice, statistical methods
play an important role.
4. Statistics and Education. Statistics is widely used in education. Research has become a
common feature in all branches of activities. Statistics is necessary for the formulation of policies
to start new course, consideration of facilities available for new courses etc. There are many people
engaged in research work to test the past knowledge and evolve new knowledge. These are possible
only through statistics.
S
5. Statistics and Planning. Statistics is indispensable in planning. In the modern world, which can
be termed as the “world of planning”, almost all the organizations in the government are seeking
DM
the help of planning for efficient working, for the formulation of policy decisions and execution of
the same. In order to achieve the above goals, the statistical data relating to production, consump-
tion, demand, supply, prices, investments, income expenditure etc and various advanced statistical
techniques for processing, analyzing and interpreting such complex data are of importance. In
India statistics play an important role in planning, commissioning both at the central and state
government levels.
6. Statistics and Medicine. In Medical sciences, statistical tools are widely used. In order to test
the efficiency of a new drug or medicine, t - test is used or to compare the efficiency of two drugs
or two medicines, t-test for the two samples is used. More and more applications of statistics are
P
at present used in clinical investigation.
7. Statistics and Modern Applications. Recent developments in the fields of computer technol-
ogy and information technology have enabled statistics to integrate their models and thus make
PU
statistics a part of decision making procedures of many organizations. There are so many software
packages available for solving design of experiments, forecasting simulation problems etc.
Limitations of Statistics
2. Statistics does not study individuals. Statistics does not give any specific importance to the
individual items; in fact it deals with an aggregate of objects. Individual items, when they are taken
individually do not constitute any statistical data and do not serve any purpose for any statistical
enquiry.
3. Statistical laws are not exact. It is well known that mathematical and physical sciences are
exact. But statistical laws are not exact and statistical laws are only approximations. Statistical
conclusions are not universally true. They are true only on an average.
4. Statistics table may be misused. Statistics must be used only by experts; otherwise, statistical
methods are the most dangerous tools on the hands of the inexpert. The use of statistical tools
by the inexperienced and untraced persons might lead to wrong conclusions.
S
5. Statistics is only one of the methods of studying a problem. Statistical method do
not provide complete solution of the problems because problems are to be studied taking the
In statistics, we are often interested in gathering information from a group of objects. If the group
in consideration consists of large number of objects, we try to obtain information about the group by
examining its subgroup.
P
Definition 14
The total collection of all the elements that we are interested in is called a population. A
subgroup of the population that will be studied in detail is called a sample.
PU
In order for the data from the sample is informative about the population, it must be representative
of the population. Being representative of the population does not mean that the characteristic of the
sample is exactly that of the total population, but instead the sample was obtain in such way that every
member of the population had an equal chance to be included in the sample.
Definition 15
A sample of k members of a population is called a random sample, also called a simple random
sample, if the members are chosen in such a way that all possible choices of the k members are
equally likely.
After a random sample is obtain from the population, we can use statistical inference to draw general-
izations about the population by examining the members of the sample.
2. Collection of data
S
3. Summarization and tabulation of data
(a) This refers to organization of data in text, tables, graphs and charts, so that logical conclusion
can be derived from them.
4. Analysis of data
DM
(b) Explore the data to obtain additional insight that could contribute to the study.
(a) This pertains to the process of deriving from the given data relevant information from which
numerical descriptions can be formulated.
(b) Summarized data must be examined so that insights and meaningful information ca be pro-
duced to support decision-making or solutions to the question or problem at hand.
P
5. Interpretation of data and results
(a) Refers to the task of drawing conclusions from the analyzed data.
(b) Results must be able to answer the research problem and give recommendations.
PU
1. Simple Random Sampling. A probability sampling technique wherein all possible subsets con-
sisting of n elements selected from the N elements of the population have the same chances of
selection.
2. Systematic Sampling. This is a probability sampling technique wherein the selection of the
first element is at random and the selection of other elements in the sample is systematic by
subsequently taking every kth element from the random start where k is the sampling interval.
3. Stratified Random Sampling. A probability sampling method where we partition the population
S
into non-overlapping strata or group and then a proportional sample is chosen from each strata.
The actual sample is the sum of the samples derived from each strata.
DM
4. Cluster Sampling. A probability sampling technique wherein we partition the population into
non-overlapping groups or clusters consisting of one or more elements, and then select a sample
of clusters. Every member of the selected cluster will be considered as sample.
1. Accidental Sampling. Sample is chosen by the researcher by the obtaining members of the
population in a convenient, often haphazard way.
2. Quota Sampling. There is specified number of persons of certain types is included in the sample.
The researcher is aware of categories within the population and draws samples from each category.
P
The size of each categorical sample is proportional to the proportion of the population that belongs
in that category.
PU
3. Purposive Sampling. The researcher employs his or her judgments on choosing which he or she
believes are representative of the population.
4. Snowball Sampling. This technique is also called referral sampling. A primary set of samples
are chosen based on the criteria set by the researcher. Information on where to find succeeding
set of sample having the same criteria will be gathered from this primary set in order to expand
the number of samples.
1. Slovin’s Formula. Slovin’s formula is used to calculate the sample size n given the population
size and a margin of error E. It is a formula use to estimate sampling size of a random sample
from a given population. We can compute
N
n= ;
1 + NE 2
Example 27. A researcher plans to conduct a survey about food preference of BS Stat students. If the
population of students is 1000, use the Slovin’s formula to find the sample size if the margin of error is 5%.
S
Solution. Using the Slovin’s formula, we get
DM
n=
1000
1 + 1000(0:05)2
≈ 285:71:
2. Minimum Sample Size for Estimating a Population Mean. The estimated minimum sample
size n needed to estimate a population mean — to within E units at 100(1 − ¸)% confidence is
(z¸=2 )2 ff 2
n= ;
E2
where ff is the known population standard deviation, E is the margin of error and z¸=2 is a value
P
which can be obtained in the z-table.
Example 28. Suppose we want to know the average age of STEM students. We would like to be 99%
PU
confident about our results. From previous study, we know that the standard deviation for the population
is 1.3. How many students should be chosen for a survey if the margin of error is 0.2.
(2:58)2 (1:3)2
n= ≈ 281:23:
(0:2)2
which we round up to 282, since it is impossible to take a fractional observation. We need a 282 STEM
students as a sample for our study.
3. Minimum Sample Size for Estimating a Population Proportion The estimated minimum
sample size n needed to estimate a population proportion p to within E at 100(1 − ¸)% confidence
is
(z¸=2 )2 p̂(1 − p̂)
n= :
E2
This is also called the Cochran Formula.
The dilemma here is that the formula for estimating how large a sample to take contains the
number p̂, which we know only after we have taken the sample. There are two ways out of this
dilemma.
S
• First, typically the researcher will have some idea as to the value of the population proportion
in the formula.
DM
p, hence of what the sample proportion p̂ is likely to be. For example, if last month 37% of
all voters thought that state taxes are too high, then it is likely that the proportion with that
opinion this month will not be dramatically different, and we would use the value 0.37 for p̂
• The second approach to resolving the dilemma is simply to replace p̂ in the formula by 0.5.
This is because if p̂ is large then 1 − p̂ is small, and vice versa, which limits their product to
a maximum value of 0.25, which occurs when p̂ = 0:5. This is called the most conservative
estimate, since it gives the largest possible estimate of n.
P
Example 29. Suppose we are doing a study on the inhabitants of a large town, and want to find out
how many households serve breakfast in the mornings. We don’t have much information on the subject
to begin with, so we’re going to assume that half of the families serve breakfast: this gives us maximum
PU
The closest z-score for 0:025 in the z-table is 1:96. A 95% confidence level gives us Z values of 1.96,
we get
(1:96)2 (0:5)(1 − 0:5)
n= ≈ 384:16:
(0:05)2
Hence, a random sample of 385 households in our target population should enough to give us the
confidence levels we need.
If the population is small then the sample size can be reduced slightly. This is because a given sample size
provides proportionately more information for a small population than a large population. The formula
is
n0
n= ;
n0 − 1
1+
N
where n0 is the Cochran’s sample size recommendation, N is the population size and n is the new adjusted
sample size.
S
Example 30. In the preceding example, if there were just 1000 households in the target population, we
would calculate
385
n= ≈ 278:18:
385 − 1
DM
1+
1000
All we need are 279 households in our sample, a substantially smaller sample size.
Example: Gender (male, female), Zip Code, Color, Nationality, Political affiliation, Religious
affiliation.
2. The ordinal level of measurement classifies data into categories that can be ranked; however,
precise differences between the ranks do not exist.
Example: Grade(A,B,C,D,F), Rating Scale/Likert scale, Ranking of tennis players, Judging (First
place, second place, etc.
3. The interval level of measurement ranks data, and precise differences between units of measure
do exist; however, there is no meaningful zero.
4. The ratio level of measurement possesses all the characteristics of interval measurement, and
there exists a true zero. In addition, true ratios exist when the same variable is measured on two
different members of the population
S
4.7 Presentation of Data
DM
After data have been collected, the researcher can now present them in the following logical methods.
1. Textual Form. Data are presented in paragraph of text. The text highlights the important figures
or results that the researcher wishes to focus on.
Table 1
Frequency Distribution of the
P
Students Enrolled for the Last 6 Years
Year Frequency
2012 13,450
PU
2013 13,200
2014 15,389
2015 16,790
2016 18,900
2017 19,500
Total 97,229
Table 2
Number of Students Enrolled for the Last 6 Years
When Grouped According to Sex
Year
Sex
2012 2013 2014 2015 2016 2017 Total
Male 5560 6095 7386 8056 7945 6451 41493
Female 7890 7105 8003 8734 10955 13049 55736
S
Total 13450 13200 15389 16790 18900 19500 97229
3. Graphical Form. Data or relationship among variables could be presented in visual form, thru
(a) Bar Graph (Vertical Bar/Column Charts) is applicable for showing comparison of
amount of a variable of interest collected over time.
Simple Chart
P
PU
S
(b) Histogram is similar to the bar graph but the base of the rectangle has a length exactly
equal to the class width of the corresponding interval. Also, there are no spaces between
rectangles.
DM Histogram
P
(c) Pictograph is similar to the bar chart but instead of bars, we use pictures or symbols to
represent a value or an amount.
PU
Pictograph
(d) Pie Chart is a circular graph partitioned into several section, depicting relative percentage
with respect to the total distribution.
Pie Chart
S
(e) Line Graph is a graph used to visualize data that changes continuously over time.
Statistical Map
Mean
Definition 16
S
Suppose that a variable x assumes values x1 ; x2 ; : : : ; xn . The arithmetic mean x of these values
is defined as n
1X x1 + x2 + · · · + xn
P
x
x=
DM
= xi = :
n n i=1 n
The (arithmetic) mean of x is obtained by adding all its observed values and dividing the sum by the
total number of observations.
Example 31. The scores of 15 students in Mathematics in the Modern World on an exam consisting
of 25 items are 25,20,18,18,17,15,15,15,14,14,13,12,12,10,10. Determine the mean score for this exam.
Solution. Let x denote the score of a random student from the sample of 15 students in Mathematics in
the Modern World. The sum of these scores is x = 228. Hence, the mean score of the 15 students is
P
P
228
P
x
x= = = 15:2:
n 15
PU
There are cases when the observations in a data set assume respective weights. In this case where the
weights are positive integers, we can call these weights as frequencies. The following gives a formula
for the weighted mean of a weighted data set.
Definition 17
Given the x values x1 ; x2 ; : : : ; xn assuming respective weights w1 ; w2 ; : : : ; wn , the weighted mean
is defined as
w1 x1 + w2 x2 + · · · + wn xn
P
wx
x= P = :
x w1 + w2 + · · · + wn
Example 32. Suppose that we are asked to get the mean of the data set 1; 1; 3; 3; 3; 3; 4; 4; 4; 6; 6; 8.
Using the original formula for the arithmetic mean we find that
(1 + 1) + (3 + 3 + 3 + 3) + (4 + 4 + 4) + (6 + 6) + 8
x=
12
2·1+4·3+3·4+2·6+1·8
=
1+4+3+2+1
2 + 12 + 12 + 12 + 8
=
12
46
=
12
= 3:833
S
We can interpret the mean of the data values as the fulcrum or center of gravity in a balance scale as
shown below.
1
DM
P
1 2 3 4 5 6 7 8
mean = 3:8333
PU
Example 33.
Calculate the General Weighted Average (GWA) of
Course Grade Units
Julius Garde for the first semester of school year
BM 112 1.25 3
2019-2020 as shown in the following table.
BM 101 1.00 3
AC 103 1.25 6
Solution. To solve for the GWA, we first consider
MG 101 1.00 3
the entries on the second column of the table as the
EC 111 1.50 3
points xi and the entries in the third column as the
MK 101 1.50 3
corresponding weights wi . By constructing a fourth
FM 111 1.20 3
column consisting of the products wi xi and finding
PE 1 1.00 2
the column totals, we get the table below.
Course xi wi wi xi
BM 112 1.25 3 3.75
BM 101 1.00 3 3.00
AC 103 1.25 6 7.50
MG 101 1.00 3 3.00
EC 111 1.50 3 4.50
MK 101 1.50 3 4.50
FM 111 1.20 3 3.60
PE 1 1.00 2 2.00
S
Total w = 26 w x = 32:00
P P
We see from the column totals that w = 26 and w x = 32. Therefore, the weighted mean or the
P P
DM
general weighted average (GWA) of Julius Garde for the first semester of AY 2019-2020 is
32
P
wx
x= P = = 1:23:
w 26
Median
Definition 18
The median, usually denoted by x̃, is the middle value of a data set if the observations are
P
arranged either in increasing or decreasing order.
Outliers in the data set do not affect the median. Thus, the median is preferred over the mean as a
PU
measure of central tendency when the data contains outliers. To find the median, begin by listing the
data in order from smallest to largest, or largest to smallest.
If the number of data values, N, is odd, then the median is the middle data value. This value can be
found by rounding N=2 up to the next whole number. If the number of data values is even, there is no
one middle value, so we find the mean of the two middle values (values N=2 and N=2 + 1)
Example 34. Given the scores of 15 students in Mathematics in the Modern World on an exam consisting
of 25 items:
25; 20; 18; 18; 17; 15; 15; 15; 14; 14; 13; 12; 12; 10; 10
Since the data is already arranged in decreasing order and there are 15 observations, hence, we round
15
up = 7:5 to the nearest whole number, which is 8, and take the 8th observation from the left (or
2
right). Therefore, the median is x̃ = 15: In comparison to example 31, the computed mean is 15:2.
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
S
mean
median
Month
January
February
March
April
Hours Lost
55
23
24
37
DM
Remark. In general, the median need not equal the mean.
Example 35. The data given below is the total number of hours lost due to tardiness and absences of
employees in a company in a given year. Find the median.
20; 23; 24; 27; 30; 32; 37; 37; 40; 48; 42; 55:
May 37
June 48
Since there are 12 observations (even), we take note of the two
P
July 42 middle observations then compute
August 27
September 20
32 + 37
October 40 x̃ = = 34:5:
November 30 2
PU
December 32
Therefore, the median number of hours lost due to tardiness and absences of employees in a company
in the given year is 34:5 hours.
Mode
Definition 19
The mode is the most frequent observation in a given data set.
Outliers in the data set do not affect the mode. It is possible that the mode of a data set does not
exist, and it is not always unique. It is an appropriate measure of average for data measured only in the
nominal level. We will denote mode using the symbol x̂.
Example 36. Suppose that we wanted to know the “average color” of cars used by the residents in a
given village. In our vehicle color survey, we collected the following data.
Color Frequency
Blue 3
Green 5
Red 4
White 3
Black 2
Grey 3
S
Since color of vehicles are measured up to the nominal level, the most appropriate measure for the
“average color” is then the mode. The most frequent color is Green, a total of 5 vehicles. Therefore, the
“average color” in our survey data must be Green.
4.9
DM
It is possible for a given data set to have more than one modes. Such a data set is said to be multimodal.
If a given set has only one mode, the data set is unimodal. If it has two modes, the data set is bimodal,
and so on.
Range
PU
Definition 20
The range is the difference between the largest and the smallest observations or items in a set of
data.
The range of a data set is easy to compute, but it is a limited measure because it depends on only two
of the numbers (the highest and the lowest) in the data set. Hence, the range can easily be affected
by outliers. Also, it does not provide any information regarding the concentration of the data from the
center.
Example 37. The following are scores of 20 coming from two different sections, 10 from each section,
in a 50-item exam in MMW.
section 1 40 38 42 40 39 39 43 40 39 40
section 2 46 37 40 33 42 36 40 47 34 45
For section 1, the highest score is 43, while the lowest score is 38. Thus,
range = 43 − 38 = 5:
On the other hand, for section 2, the highest score is 47, while the lowest score is 33. Thus,
range = 47 − 33 = 14:
Therefore, the scores of students surveyed from section 2 gets a wider range than those of students
surveyed from section 1.
S
Variance and Standard Deviation
Suppose that the center of a population data set {x1 ; x2 ; : : : ; xN } is best described by the arithmetic
i=1
DM
mean — and that our goal is to get the average “distance” of each data point xi form —. Naturally, we
1 X
(xi − —) =
N
N i=1
(xi − —):
However, using the properties of summations, and the fact that n— = x1 + x2 + · · · + xN we can check
that
N N
X
i=1
xi −
N
X
i=1
— = N— − N— = 0:
In other words, the sum of the deviations from the mean is 0, and therefore, we cannot have a meaningful
measure of variability this way. The reason behind this fact is that some of the deviations from the mean
P
are negative (those which are to the left of the mean) and some are positive (those which are to the right
of the mean) and they cancel each other out. However, we can work our way out of this unfortunate
situation if we can ignore the signs of these deviations. One way to do this is to take the square these
PU
Definition 21
The variance of a population data set {x1 ; x2 ; : : : ; xN } with population mean — is defined as
N
1 X
ff 2 = (xi − —)2 :
N i=1
On the other hand, the variance of a sample data set {x1 ; x2 ; : : : ; xn } with sample mean x is
defined as n
2 1 X
s = (xi − —)2 :
n − 1 i=1
As we may have noticed, the formula for the sample variance differs significantly from the formula for
the population variance mainly because of the divisor n − 1. The reason behind this is rather technical
and mathematical in nature. Simply taken, the divisor n − 1 removes the “bias” in s 2 when we want it
to estimate ff 2 for the purposes of making inferences.
Notice that the variance is a nonnegative quantity because it came from averaging squared quantities.
We also realize that there is one major drawback to using the variance. If we follow the steps in calcu-
lating the variance, we find that the variance is measured in terms of square units because we took the
squares of the deviation. For example, if our sample data is measured in terms of meters, then the units
for a variance would be given in square units.
S
In order to standardize the units, we can take the square root of the variance to eliminate the problem of
Definition 22
DM
squared units, and gives us a measure of the spread that will have the same units as our original sample
or population data.
The population (sample) standard deviation is the nonnegative square root of the the pop-
ulation (sample) variance. In symbols,
√ √
ff = ff 2 and s = s 2:
P
PU
Example 38. Using the sample data sets in example 37, determine which section exhibits a greater
variability in terms of standard deviations.
Solution. Let x denote the scores of students sampled from section 1 and let y denote the scores of
students sampled from section 2. To calculate the standard deviations of each sample, we first take note
that the sample means from each section are
400 400
P P
x y
x= = = 40 and y = = = 40:
n 10 n 10
x y x −x y −y (x − x)2 (y − y )2
40 46 0 6 0 36
38 37 −2 −3 4 9
42 40 2 0 4 0
40 33 0 −7 0 49
39 42 −1 2 1 4
39 36 −1 −4 1 16
43 40 3 0 9 0
40 47 0 7 0 49
39 34 1 36
S
−1 −6
40 45 0 5 0 25
x = 400 y = 400 (x − x)2 = 20 (y − y )2 = 224
P P P P
s =
DM
Therefore, the sample variance for the sample from section 1 is
2
P
(x − x)2
n−1
2
s =
P
(y − y )2
n−1
=
=
20
9
9
224
= 2:2222;
= 24:8888:
Taking square roots, we find that the sample standard deviations of section 1 and section 2 respectively
√ √
are 2:2222 ≈ 1:49 and 24:8888 ≈ 4:99. We can conclude that for these samples, the one from
P
section 1 exhibits the lesser variability than that from section 2. We comment that even though the two
samples have equal means, the standard deviations showed the actual difference between the two data
sets.
PU