You are on page 1of 83

BASIC CONCEPTS Chapter 1

“I keep saying the sexy job in


the next ten years will be
statisticians.”
Hal Varian, Chief Economist of Google
(2009)
Objectives
At the end of this chapter, the students will be able to explain
basic statistical concepts and measures. Specifically,
1. Define statistics;
2. Identify the type of variable and the scale of measurement a
variable belongs;
3. Identify which method of data presentation is appropriate to
use;
4. Compute the measures of central tendency and variation;
and
5. Determine when to use the different measures of central
tendency and variations.
Statistical literacy
▪ Statistics is essential for making good decisions in an uncertain
world.
▪ People’s ability to interpret and critically evaluate statistical
information and data-based arguments appearing in diverse media
channels, and their ability to discuss their opinions regarding such
statistical information” (Gal, as cited by Rumsey, 2002)
▪ Statistical literacy is important because you are faced with statistics
problems in your personal and professional lives.
▪ Be able to evaluate media reports about opinion surveys, medical
research studies, the state of the economy, and environmental issues.
Statistical literacy
▪ You’ll face financial decisions such as choosing between an
investment with a sure return and one that could make you
more money but could possibly cost you your entire investment.
▪ How can you evaluate evidence about global warming?
▪ Is there bias against women in appointing managers?
▪ Are cell phones dangerous to your health?
Statistical literacy

▪ How likely are you to win the lotto?


▪ How can you analyze whether a diet really works?
▪ How can you predict the selling price of a house?
▪ What’s the chance that you will get a passing grade in this
class?
Origin of the word “Statistics”
The term statistics came from the Latin phrase “ratio status”
which means the study of practical politics or the stateman’s
art.

In the middle of the 18th century, the term statistik was used, a
German term defined as “the political science of several
countries”.

From statistik it became statistics defined as a statement in


figures and facts of the present condition of a state.
What is Statistics?
The art and science of answering questions and
exploring ideas through the processes of
gathering data, describing data, and making
generalizations about a population on the basis
of a smaller sample

(https://onlinecourses.science.psu.edu/stat200/node/113)
What is Statistics?
The art and science of designing studies and
analyzing the data that those studies produce. Its
ultimate goal is translating data into knowledge and
understanding of the world around us. In short,
statistics is the art and science of learning from
data.
(Agresti and Franklin, 2013)
What is statistics?
Statistics (Singular)
▪ science that deals with techniques for collecting,
presenting, analyzing, and drawing conclusions from data
▪ science of data

Statistics (Plural)
▪ numerical descriptions by which we enhance
understanding of data
▪ summary measures used to describe a sample
Example
1. Statistics are facts or data, either
numerical or nonnumerical.
2. Statistics is the science of organizing
and summarizing numerical or
nonnumerical information.
Variables
▪ When we obtain our sample, we obtain data values on one or
more variables
▪ A variable is a characteristic or attribute which varies from one
entity to another entity (tensile strength, no. of buildings in VSU
campus, income of engineers)
▪ A qualitative variable is one which classifies/identifies/describes
an element of a sample or population (sex of an engineer,
academic rank of faculty )
▪ A quantitative variable is one which quantifies an element of a
sample or population(age, length of service)
Variables
▪ A quantitative variable that can assume only a finite or countably
infinite number of possible values (usually integers) is called a
discrete variable (no. of board passers, enrolment in BSCE per
semester)
▪ A quantitative variable that can theoretically assume any value in a
specified interval (i.e., continuum) is called a continuous variable
(temperature, wind speed, weight of beams)
▪ measuring instruments limit the number of decimal places of
values of continuous variables
Summary
Variables

Qualitative Quantitative

Discrete Continuous
Measurement

▪ the process or rule of assigning labels or values to a variable

Importance:
▪ Identifying the level of measurement of a variable is one of
the factors in choosing a statistical method
Levels or Scales of Measurement

Ratio
Interval
Ordinal
Nominal
Nominal
▪ objects are classified into categories based on some defined
characteristics
▪ categories are mutually exclusive and have no implicit or
explicit ordering
▪ numbers are sometimes used as category codes but arithmetic
should not be performed on these number codes
▪ frequencies or counts of observations that belong to each
category is usually obtained to summarize the data

Examples:
sex, degree program, religion, civil status
Ordinal
▪ objects are classified into ordered categories
▪ categories are mutually exclusive but have some logical order
(one category is “higher” than others)
▪ categories are scaled according to the amount of the particular
characteristic they possess
▪ the difference between two categories is meaningless
▪ the difference between any two categories is not equal

Examples:
faculty position in a university, faculty performance rating,
grading system at VSU
Interval
▪ possesses the properties of the ordinal scale
▪ the difference between any two values in the scale is the same
▪ the zero point is arbitrary and is just another point on the scale.
▪ ratios of two values in the scale are without meaning

Examples:
temperature (in °C), achievement test score , IQ
Ratio
▪ Has all the properties of the interval scale
▪ the zero point reflects an absence of the characteristic
(absolute zero point)
▪ ratios of two values in the scale are meaningful

Examples:
weight, price, distance
Example
Identify the type of variable and the scale of
measurement of the following variables:
1. Age (e.g., 19 years and 5 months old)
2. Total no. of units enrolled in the current semester
3. Sex (e.g., 0-Male and 1-Female)
4. Temperature (in Kelvin)
5. ID number
Population and sample
▪ A population is the entire collection of objects or outcomes
about which data are collected
▪ A sample is a subset of the population containing the
observed objects or the outcomes and the resulting data
▪ Information obtained from a population data are called
parameters
▪ Numbers computed using the data obtained from a sample
are called statistics
▪ Statistics are used to estimate parameters
Population

Inference
Sampling (generalizations)

sample
Sampling
▪ is the process of selecting a small number of
elements from a larger defined target group of
elements such that the information gathered from the
small group will allow judgments to be made about
the larger group
▪ is the process of selecting a number of individuals for
a study in such a way that the individuals represent
the larger group from which they were selected
Sampling
▪ A physician would like to know the characteristics of
a person’s blood (blood type, Rh factor, blood sugar,
etc). To be able to do this, the physician extracts a few
milliliters of blood from his arm. He subjects this to
laboratory analysis and concludes that the
characteristic obtained from the blood sample is the
characteristic of the person’s blood.
Reasons for sampling
1. Reduced cost
2. Greater speed or timeliness
3. Greater efficiency and accuracy
4. Greater scope
5. Convenience
6. Necessity
7. Ethical considerations
Probability sampling
▪ each element in the population has a known chance of
being included in the sample
▪ the likelihood of inclusion is operationalized by the use of a
random mechanism (e.g. a device that is used to generate a
random number) and the assigned probability that the unit
is specified
▪ requires a listing of the population units and assigning a
unique label or identifier (usually counting numbers) to each
one (sampling frame)
▪ generally referred to as random samples
▪ allows drawing of valid generalizations about the population
Non-probability sampling
▪ the manner in which the units are selected from
the population depends on some inclusion rule as
specified by the sampler.
▪ sampling frame is not always required and the
operational cost is relatively cheaper
▪ inability to provide objective measurement of
accuracy
▪ inference can only be made by making
assumptions regarding the representativeness of the
sample
Sampling Methods
Probability sampling Non-probability sampling
▪ simple random sampling ▪ purposive sampling
▪ systematic sampling ▪ convenience sampling
▪ stratified sampling ▪ quota sampling
▪ cluster sampling
Simple Random Sampling
▪ each element in the population has a known and
equal probability of selection
▪ each possible sample of a given size (n) has a known
and equal probability of being the sample actually
selected
▪ may not be practical to implement especially for
large populations due to the absence of good quality
sampling frame and the possibility that the selected
units may be extremely scattered thus making it
doubly difficult to implement
▪ maybe done with replacement or without replacement
Systematic Sampling
▪ the sample is chosen by selecting a random starting
point and then picking every kth element in succession
from the sampling frame
▪ k is the sampling interval and is determined by
dividing the population size N by the sample size n and
rounded to the nearest integer
▪ a random number (called the random start) is selected
from 1 to k; the unit assigned this number is then
included in the sample and the kth unit thereafter
Stratified Sampling
▪ the population is divided into mutually exclusive sub-
populations called strata based on a stratification
variable that is closely related to the characteristic of
interest
▪ the elements within a stratum should be as
homogeneous as possible, but the elements in different
strata should be as heterogeneous as possible.
▪ independent simple random samples are obtained from
each stratum
▪ the overall sample size (n) can be distributed into the
strata sizes (nh) using equal allocation, proportional
allocation or optimum allocation
Cluster Sampling
▪ in many applications, units of the population are naturally
grouped (e.g. villages); these groupings are referred to
as clusters
▪ a random sample of clusters is selected, based on a
probability sampling technique such as SRS
▪ for each selected cluster, either all the elements are
included in the sample (one-stage) or a sample of elements
is drawn probabilistically (two-stage)
▪ elements within a cluster should be as heterogeneous as
possible, but clusters themselves should be as
homogeneous as possible. Ideally, each cluster should be
a small-scale representation of the population
▪ it is administratively convenient to implement
Descriptive statistics
▪ Use of numerical information to summarize, simplify,
and present data.
▪ Organize and summarize data for clear presentation
and easy interpretation
▪ Computation of measures of location and variation
▪ Construction of tables and graphs
Inferential statistics
▪ techniques that use sample data to make general statements
about a population
▪ making decisions and drawing conclusions about a
population based on data obtained from a sample taken from
the population
▪ allows meaningful generalizations only if the subjects in the
sample are representative of the population
▪ Estimation and hypotheses testing
Data
▪ refers to facts or figures from which conclusion can be drawn
▪ refers to the values or labels assigned to variable
▪ it is information collected, organized, analyzed, and
interpreted by statisticians
▪ it is needed whenever we undertake studies or researches
which are designed to answer particular problems, or to
provide a base with which certain decisions may be formulated
Data
▪ Can be quantitative or qualitative
▪ Qualitative data can be transformed into quantitative data
using number codes
▪ Can be primary or secondary
Methods of data collection
▪ Sample survey (personal, phone, online)
▪ Controlled experiments (field or lab)
▪ Observation (psychiatric wards)
▪ Registration method (e. g. as required by law)
▪ Focus group discussion (qualitative data)
▪ Use of existing records (secondary data)
Presenting data
▪ Generally, there are three ways of presenting statistical data:
textual, tabular, and graphical
▪ In a textual presentation, statistics are incorporated in a
text or paragraph.
▪ In a tabular presentation, statistics are organized in rows
and columns with appropriate labels
▪ In a graphical presentation, statistics are shown pictorial
form
Textual presentation
Poverty incidence among Filipinos1 in 2015 was estimated at
21.6 percent. During the same period in 2012, poverty
incidence among Filipinos was recorded at 25.2 percent. On the
other hand, subsistence incidence among Filipinos, or the
proportion of Filipinos whose incomes fall below the food
threshold, was estimated at 8.1 percent in 2015. In 2012, the
subsistence incidence among Filipinos is at 10.4
percent . Subsistence incidence among Filipinos is often
referred to as the proportion of Filipinos in extreme or
subsistence poverty.

Source: Philippine Statistics Authority


Tabular presentation
▪ arranges figures in a systematic manner using rows and
columns
▪ data can more readily be understood and comparisons may
more easily be made

General or reference tables


 used to present data in such a way that individual items may easily
be found by a reader; it is often placed in an appendix.
Summary or text tables
 usually small in size and designed to guide the reader in analyzing
the data; usually accompanies a text discussion.
PARTS OF A STATISTICAL
TABLE
1. Heading-consists of the table number, table title, and headnote
(when necessary)
2. Boxhead - portion of the table which consists of the spanner and
column heads or captions describing the data in each column
3. Stub - contains the stub head, center heads and line captions; the
first column on the left where the line descriptions are
4. Field (body) - depository of information appearing in the cells
5. Footnote - statement qualifying or explaining the information
presented in, or omitted from specific cells, columns or lines;
maybe a general footnote, a specific footnote, or a source note
HEADING BOXHEAD

Table 3.6 Distribution of Families by Poor and Non-Poor Category by


Sex of Head of the Family (Numbers in Thousands)
Total Families Head
Year Category Number % Women Men
S Total Families 18,110 100 18.8 81.2
T
U Poor 3,809 21 11.7 23.2
B 2006 Non-poor 14,301 79 88.3 76.8
Total Families 19,734 100 21.3 78.7
Poor 4,037 20.5 12 22.8
2009 Non-poor 15,697 79.5 88 77.2
Total Families 21,427 100 22.7 77.3
Poor 4,215 19.7 11.4 22.1
2012 Non-poor 17,212 80.3 88.6 77.9
Note: Details may not add up to total due to rounding BODY
Source of basic data: Family Income and Expenditure Survey (FIES),PSA
Frequency Distributions
▪ A frequency distribution shows the number of observations
falling into each of several ranges of values.

▪ Frequency distributions are portrayed as frequency tables,


histograms, or polygons.

▪ Frequency distributions can show either the actual number


of observations falling in each range or the percentage of
observations.

▪ In the latter instance, the distribution is called a relative


frequency distribution.
Frequency Distributions
Table 1. Distribution of 25 patients afflicted with
HIV according to blood type
Blood Type Number of Patients
O 7
A 6
B 7
AB 5
WHAT ARE FREQUENCY
DISTRIBUTIONS?
Table 2. Distribution of the number of cars owned by
families in a certain residential area
Number of cars Number of families
0 12
1 6
2 7
3 3
4 2
WHAT ARE FREQUENCY
DISTRIBUTIONS?
Table 3. Distribution of the weights of 30 female PE major students.
Weights Number of Students Percent
85 - 94 1 3.3
95 - 104 3 10.0
105 - 114 4 13.3
115 - 124 6 20.0
125 - 134 9 30.0
135 - 144 6 20.0
145 - 154 1 3.3
TERMS
Class interval-the numbers defining the class
Class limits-the end numbers of the class
Class frequency-the number of observations falling in a
class/category
Class mark-midpoint of a class interval
Relative frequency-class frequency divided by the total
number of observations; expressed in percent
Class size (class width)- the difference between the lower
limits of two adjacent classes or the difference between the
lower class boundaries of two adjacent classes
TERMS
Class boundaries- the true class limits; only for continuous
data
the lower class boundary (LCB) is defined as the
number halfway between the lower class limit of the
class and the upper class limit of the preceding class
the upper class boundary (UCB) is the number halfway
between the upper class limit of the class and the lower
class limit of the next class
Cumulative frequency - the number of observations that lie
above (or below) a particular value in a data set; less than
CF or greater than CF
Table 3. Distribution of the weights of 30 female PE major students.
Number
Class
Weights of CM RF(%) <CF >CF
Boundaries
Students
85 - 94 1 89.5 3.3 84.5 - 94.5 1 30
95 - 104 3 99.5 10.0 94.5 - 104.5 4 29
105 - 114 4 109.5 13.3 104.5 - 114.5 8 26
115 - 124 6 119.5 20.0 114.5 - 124.5 14 22
125 - 134 9 129.5 30.0 124.5 - 134.5 23 16
135 - 144 6 139.5 20.0 134.5 - 144.5 29 7
145 - 154 1 149.5 3.3 144.5 - 154.5 30 1
CONSTRUCTION OF A FREQUENCY
DISTRIBUTION WITH EQUAL CLASS SIZES
1. Determine the range (R)
R=Highest Value minus Lowest Value

2. Determine the number of classes (K )


Sturges’ approximation of K:
K = 1 + (3.3222(log 𝑛 )), n=# of obs.
Always round off K to the nearest whole number
3. Determine the class size (C)
C=R/K
Round off C to the nearest number with the same number
of decimals as the raw data
CONSTRUCTION OF A FREQUENCY
DISTRIBUTION WITH EQUAL CLASS SIZES
4. Determine the lower limit of the first class
▪ the first class must include the lowest observed value
▪ Use the lowest observed value as the lower limit of the
first class

5. Enumerate the class intervals.

6. Tally the observations to determine the class frequencies


EXAMPLE
Construct a frequency distribution of the scores obtained by
50 students in a class in elementary statistics:
23, 50, 38, 42, 63, 75, 12, 33, 26, 39, 35, 47, 43, 52, 56, 59, 64,
77, 15, 21, 51, 54, 72, 68, 36, 65, 52, 60, 27, 34, 47, 48, 55, 58,
59, 62, 51, 48, 50, 41, 57, 65, 54, 43, 56, 44, 30, 46, 67, 53

R=77 - 12=65 K= 𝟓𝟎 = 6.644318148 ≈ 7

𝟔𝟓
C= 𝟕
=9.28571428571429 ≈ 9
EXAMPLE
Class
Classes Frequency CM RF (%) Boundaries <CF >CF
12 - 20 2 16 4 11.5 - 20.5 2 50
21 - 29 4 25 8 20.5 - 29.5 6 48
30 - 38 6 34 12 29.5 - 38.5 12 44
39 - 47 9 43 18 38.5 - 47.5 21 38
48 - 56 14 52 28 47.5 - 56.5 35 29
57 - 65 10 61 20 56.5 - 65.5 45 15
66 - 74 3 70 6 65.5 - 74.5 48 5
75 - 83 2 79 4 74.5 - 83.5 50 2
GRAPHICAL PRESENTATION
Graph (or chart) - is any device used in presenting numerical
values or relationship in pictorial form

Advantages
1. There is a better comprehension of data than is possible
with textual matter alone.
2. There is a more penetrating analysis of the subject than is
possible in written text.
3. There is a check of accuracy
SOME COMMONLY USED CHARTS
Line Chart
▪ oldest, simplest, most familiar, and most widely used method of
presenting statistics graphically
▪ the plotted points of the data are connected by a line.
▪ the fluctuations of this line show the variations in the trend.
▪ the distance of the plotting from the base line of the graph
indicates the quantity
1. for emphasizing movement rather than actual amount
2. for depicting time series (data across time)
3. for comparing several series
4. when data cover a long period of time
5. when estimates or forecasts are to be shown
Source: PSA
SOME COMMONLY USED CHARTS
Column Chart
▪ to depict numerical values of a given item over a period of
time
▪ values are represented by the height of the column
▪ preferable to the line chart when a sharper delineation of
trend is to be shown

Types:
1. Grouped-column chart - used to compare two or sometimes
three independent series over a period of time.
2. Subdivided-column chart - shows the component parts of a total.
These should be few in number and each should carry a
distinctive pattern so that it may be readily identified.
Source: PSA
SOME COMMONLY USED CHARTS
Horizontal Bar Chart
▪ simplest form of graph comparing different items at a
specified date
▪ especially suited to represent categorical data
▪ bars may be arranged in numerical or alphabetical order,
depending on the purpose of the chart and the given data
SOME COMMONLY USED CHARTS
Pie Chart
▪ a circular diagram that is divided into sections to show the
composition of a whole
▪ size of each section is indicative of the proportion to the
total of the corresponding component
▪ useful when there are few components to a whole
▪ many components (more than six) would diminish the
visual impact of the chart
SOME COMMONLY USED CHARTS
Statistical Map
▪ used to present geographical statistics
▪ should be used only when geographic distribution is of
permanent importance and when data can be readily and
correctly interpreted in this form
Types
1. Shaded or cross-hatched map

2. Dot-map chart
Source: PSA
GRAPHICAL PRESENTATION OF
FREQUENCY DISTRIBUTIONS
Frequency histogram
▪ bar graph showing the class boundaries on the x-axis and
the frequencies on the y-axis
▪ the border of each bar is erected at the class boundaries

Relative frequency histogram


▪ the relative frequencies (y-axis) are plotted against the
class boundaries (x-axis)
FREQUENCY HISTOGRAM

11.5 20.5 29.5 38.5 47.5 56.5 65.5 74.5 83.5


GRAPHICAL PRESENTATION OF
FREQUENCY DISTRIBUTIONS
Frequency polygon
▪ a line chart that is constructed by plotting the frequencies at
the class marks
▪ the polygon is formed by bringing down the line at each end
to the horizontal axis at the class marks of extra classes at
both ends

Frequency curve
▪ smoothed frequency/relative frequency polygon drawn to
show the frequency/relative frequency distribution of a
population or sample with continuous data.
FREQUENCY POLYGON
16

14

12

10
Frequency

0 0 10 20 30 40 50 60 70 80 90 100

16 25 34 43 52 61 70 79
Scores
FREQUENCY CURVE
.025
.02
.015
Density

.01
.005

0 20 40 60 80
Score
Measures of central tendency
▪ These are statistics which provide a description of the
“center” of a data set
▪ The three popular measures of central tendency or location
are: mean, median and mode
▪ When data is fairly symmetric, the most appropriate measure
is the mean.
▪ The median is considered most appropriate for data sets
with extreme values
▪ For qualitative data, the mode is most appropriate
The Mean
▪ When data is fairly symmetric, the most appropriate measure
is the mean.
▪ Can be used for at least interval scaled data
▪ The mean is denoted by μ (population) or 𝑥ҧ (sample)
▪ Given sample data, the mean is computed as
n

x i
x i 1

n
▪ What is the mean of the following data set: 12, 21, 13, 7, 10?
12+21+13+7+10
𝑥ҧ = = 12.6
5
The Median
▪ The median is the middle value when the data is arranged
from lowest to highest
▪ Can be used for at least ordinal scaled data
▪ Example 1: 12, 21, 13, 7, 10
Array: 7, 10, 12, 13, 21  median=12

▪ Example 2: 17, 45, 12, 25, 50, 32


25+32
Array: 12, 17, 25, 32, 45, 50  median= 2 =28.5
The Mode
▪ The mode is the observation with the highest frequency.
▪ Can be used for all scales of measurement

▪ Example 1: 23, 34, 23, 45, 46, 37, 45, 23, 45, 23
Mode=23 (unimodal)

▪ Example 2: red, orange, blue, red, green, blue, blue, red, green
Mode=red and blue (bimodal)
Measures of variation
▪ These are statistics which measure the extent to which data
values are spread out from their center
▪ Also known as measures of dispersion or scatter
▪ Common measures of variation are: range and variance
(standard deviation)
▪ The range (R) is the difference between the largest and
smallest values in a data set.
Example: The range of the data set {12, 21, 13, 7, 10} is
equal to 14
The variance and standard deviation
▪ The average of the squared deviations of each value from
their mean is the variance.
▪ Based on sample data, the variance (S2) is computed as

 x 
n
2
i x
i 1
s2 
n 1

▪ The positive square root of the variance is known as the


standard deviation (s).
The variance
Example:
The following values represent the force (lb) required to
pull-off connectors in a prototype engine: 12.6, 12.9, 13.4,
12.3, 13.5, 13.6, 12.6, 13.1. Compute and interpret the
standard deviation.
Answer: s=0.48 lb
The coefficient of variation
▪ Sometimes one wishes to compare the degree of variation of
two or more data sets
▪ When the data sets have different units of measure, say
income variation vs height variation, the standard deviation
can not be used as basis in the comparison
▪ Similarly, when the data sets differ much in their means, the
standard deviation can not be used as basis in the
comparison
▪ The coefficient of variation is the alternative
The coefficient of variation
▪ The coefficient of variation is computed by dividing the
standard deviation by the mean
▪ Usually expressed in percent
▪ The closer the CV to zero the less varied are the values in the
data set
The coefficient of variation
Example:
The following values represent the force (lb) required to
pull-off connectors in a prototype engine: 12.6, 12.9, 13.4,
12.3, 13.5, 13.6, 12.6, 13.1. Compute and interpret the
standard deviation.
Answer: s=0.48 lb
𝑥=13
ҧ lb
cv=3.69%

You might also like