You are on page 1of 58

Statistics

Felimon A. Pimentel Jr., RChE, ME, PhD, CESE, FRIChE


• Statistics – Latin word status, which means “state”.
• The term became popular only in the 18th century.
• Its original definition was “the science dealing with
data about the condition of a state or community”.
• The practice of statistics dates back to the early
biblical times when nations or states collected data
pertinent in the administration of the affairs of the
state.
• The use of statistics in government is very evident
these days.
What is statistics?
• Collecting data - questionnaires and tests
• Presenting data – tables and figures
• Analysis of data – descriptive and inferential
• Interpretation of data – implications and
conclusions
• Data – facts and figures
• Information – data + context
• Knowledge – information + meaning
• Types of data with respect to its nature
• 1. Quantitative – numerical
• 2. Qualitative – descriptives
• Types of data with respect to its source
• 1. Primary data – taken directly from the
source
• 2. Secondary data – taken from existing
repository of information
• Types of data with respect to its measure
• 1. Discrete data – taken as a whole
• 2. Continuous data – taken as part of a whole
• Types of data with respect to arrangement
• 1. Ungrouped data – raw scores
• 2. Grouped data – arranged in a frequency
distribution

• Frequency distribution – arrangement of data in


class intervals with their corresponding
frequencies.

Two divisions of Statistics
• Descriptive statistics – describes the behavior
of the data. Measure of central tendency
(mean, median, mode); measure of variability
(range, MAD, SD, coefficient of variation);
measure of position (quartiles. quintiles,
deciles, percentiles)
• Inferential statistics – infer the characteristics
of the population from the sample.
Hypothesis testing, various statistical tests.

sample
Statistic (mean, sd)
Population
Parameter (μ,σ)
• Population – refers to the complete set of all
the observations, elements, or objects under
consideration.
• Sample – refers to the representative portion of
the population or the subset of the population.
• Parameter – refers to the numerical description
of a population.
• Statistic – refers to the numerical description of
a sample.
• In statistics, data are facts or figures that
indicate a variable.
• Variable –refers to anything that varies.
• Independent variable – a variable that can
stand on its own.
• Dependent variable – a variable that relies on
another variable (independent variable)
• Extraneous variable – a variable that
influences other variables but is not under
consideration.
Level of Measurement
Ratio level (numerical)
1.The numbers in the data are used to classify a
person/object into distinct, non-overlapping,
and exhaustive categories;
2.The data are arranged into categories
according to magnitude.
3.The data have a fixed unit of measure
representing a set size throughout the scale.
4.The data have absolute zero.
Ex: temperature in Kelvin, daily allowance
• Interval level (numerical)
1.The numbers in the data are used to classify a
person/object into distinct, non-overlapping,
and exhaustive categories;
2.The data are arranged into categories
according to magnitude.
3.The data have a fixed unit of measure
representing a set size throughout the scale.
Ex: temperature in Celsius, IQ scores
• Ordinal level (categorical, rankable)
1.The numbers in the data are used to classify a
person/object into distinct, non-overlapping,
and exhaustive categories;
2.The data are arranged into categories
according to magnitude.
Ex: shirt size, academic rank
• Nominal level (categorical)
1.The numbers in the data are used to classify a
person/object into distinct, non-overlapping,
and exhaustive categories.
Ex: sex, nationality
Exercises
• 1. postal zip code
• 2.performance rating as O, VS, S, MS, NI
• 3.student number
• 4.Body temperature in Celsius
• 5.Ranking in class (1st honor, etc)
• 6.annual salary
• TIN
Data Collection
• 1. Interview or direct method
• 2.Questionnaire or indirect method
• 3.Registration method
• 4.Observation
– Participant observation
– Non-participant observation
5.Experiment
Sampling-
the process of selecting sample units from the population.

• Determining the sample size (n) from the


population size (N)
• Slovin’s Formula:

e= margin of error (5%)


• Target population – the population we want
to study
• Sampled population – the population from
where we actually select the sample
• Elementary units – element of the population
whose measurement on the variable of
interest is what we wish to examine
• Sampling unit –unit of the population that we
select in our sample
• Sampling frame – a list or map showing all the
sampling units in the population
• Sampling error occurs when we collect data
from the sample and not from the population
Sampling Techniques
• Random (Probability) sampling – given equal
chance to population units to be included in
the sample.
• 1.Lottery - drawlots
• 2.Random numbers – generated by calculator
• 3.Systematic – nth interval=(N/n)
• 4.Stratified proportional- every group is represented
• 5.Cluster- only the chosen group is represented
• 6.Multistage – for large population
» Proportional allocation

• Ist yr -100 (120/170)=70.6 ≈ 71


• 2nd yr-50 (120/170)= 35.3≈ 36
• 3rd yr- 20 (120/170)= 14.1≈15
• Total (N) = 170 n=122
• n= 170/[(170*.0025)+1]
• n=170/1.425
• n=119.3 ≈ 120
• Non-random (Non-probability) sampling – no
equal chances to be included in the sample.
• 1.Purposive – with criteria
• 2.Convenience-most accessible, or easiest to
contact.
• 3.Accidental – by chance
• 4.Quota – fixed number
• 5.Snowball-networking
Data Presentation
• Textual – paragraph
• Tabular – tables
• Graphical – graphs and charts
– Line graph
– Bar graph (column graph)
– Circular graph (pie chart)
– Pictograph
– Map graph
Organization of Data
• Raw data – data in original form
• Array – sorted data or ordered data
Constructing a Frequency Distribution

• 1.Determine the range (R=HS-LS)


• 2.Determine the number of class intervals (5
to 20)
– Using the Sturges’s Rule
K = 1 + 3.322 log n
k=number of class interval
n =sample size
• 3.Determine the class size (c=R/K)
• 4.List the class intervals
• 5.Tally all the observed values in each class
interval
• 6.Determine the class frequency (total tally for
each class interval)
• 7.Determine class boundaries and class marks
Exercises
• 175 156 194
• 138 117 94
• 120 145
• 168 43
• 247 129
• 118 110
• 225 75
• 135 103
• 97 55
Graphical Representation of
Frequency Distribution
• Frequency Histogram (bar graph)
– Y-axis – frequency or relative frequency
– X-axis – class marks
• Frequency Polygon (line graph)
• Ogives
• < ogive >ogive
• Y- <cumulative frequency >cumulative frequency
• X- upper class boundaries lower class boundaries
Measure of Central Tendency
• Ungrouped Data
• Mean – sum of all the scores divided by the
number of scores

• Median - the middlemost score in an array


• Mode – the score which appears most
frequently
• Grouped Data

LCBm – lower boundary limit of the median class


C- class size
n- sample size
Σfm-1 – total frequency before the median class
fm – frequency of the median class
• LCBmo – lower boundary limit of the modal
class
• C-class size
• Δ1- difference between the frequency of the
modal class and the frequency just above it.
• Δ2-difference between the frequency of the
modal class and the frequency just below it.
Exercises
• 23 26 30 23 26
• 28 29 29 25 24
• 20 18 23 27 30
• 21 15 28 21 17
• 25 20 29 18 15
• 22 19 25 17 21
• 23 22 22 20 24
• 30 25 20 15 28
Do the following instructions:
• 1.Construct a frequency distribution
• 2.Determine the mean, median and mode of
ungrouped data.
• 3.determine the mean, median and mode of
the grouped data
• 4.Draw the histogram of the grouped data
• 5.Draw the frequency polygon of the grouped
data
• 6.Draw the < and > ogives
Normal Curve
In a normal distribution, the graph appears as a classical,
symmetrical "bell-shaped curve." The mean, or average, median
and the mode, or maximum point on the curve, are equal.

• To understand the probability factors of a normal distribution you need to


understand the following ‘rules’:
• 1. The total area under the curve is equal to 1 (100%)
2. About 68% of the area under the curve falls within 1 standard
deviation.
3. About 95% of the area under the curve falls within 2 standard
deviations.
4 About 99.7% of the area under the curve falls within 3 standard
deviations.
• Items 2,3 and 4 are sometimes referred to as the ‘empirical rule’ or the
68-95-99.7 rule. In terms of probability, once we determine that the data
is normally distributed (bell curved) and we calculate the mean
and standard deviation, we are able to determine the probability that a
single data point will fall within a given range of possibilities.
Standard Score
• The major purpose of standard scores is to place scores for
any individual on any variable having any mean and standard
deviation on the same standard scale so that comparisons can
be made. Without some standard scale, comparisons across
individuals and/or across variables would be difficult to make
(Lomax,2001, p. 68). In other words, a standard score is
another way to compare a student's performance to that of
the standardization sample. A standard score (or scaled score)
is calculated by taking the raw score and transforming it to a
common scale. A standard score is based on a normal
distribution with a mean and a standard deviation.
The basic z score formula for a
sample is:
z = (x – μ) / σ

• You may also see the z score formula shown to


the right. This is exactly the same formula as z = x
– μ / σ, except that x̄ (the sample mean) is used
instead of μ (the population mean) and s
(the sample standard deviation) is used instead of
σ (the population standard deviation). However,
the steps for solving it are exactly the same.
• The z score tells you how many standard
deviations from the mean your score is.
Skewness
• Skewness is asymmetry in a statistical
distribution, in which the curve appears
distorted or skewed either to the left or to the
right. Skewness can be quantified to define
the extent to which a distribution differs from
a normal distribution.
• When a distribution is skewed to the left (green curve), the
tail on the curve's left-hand side is longer than the tail on the
right-hand side, and the mean is less than the mode. This
situation is also called negative skewness.
• When a distribution is skewed to the right (blue curve), the
tail on the curve's right-hand side is longer than the tail on
the left-hand side, and the mean is greater than the mode.
This situation is also called positive skewness.
Skewness
• Population Skewness

• Sample Skewness
• The above formula for skewness is referred to as the
Fisher-Pearson coefficient of skewness.
• Skewness for a normal distribution is zero, and any
symmetric data should have a skewness near zero.
Negative values for the skewness indicate data that
are skewed left and positive values for the skewness
indicate data that are skewed right. By skewed left,
we mean that the left tail is long relative to the right
tail. Similarly, skewed right means that the right tail
is long relative to the left tail. If the data are multi-
modal, then this may affect the sign of the skewness.
Kurtosis
• Kurtosis is defined as the measure of thickness
or heaviness of the given distribution for the
random variable along its tail. In other words,
it can be defined as the measure of
“tailedness” of the distribution. Hence, it is
clear that it is considered as a common
measure of shape. The outliers in the given
data have more effect on this measure.
Moreover, it does not have any unit.
• • The distribution with kurtosis equal to 3 is known as
mesokurtic. A random variable which follows normal distribution
has kurtosis 3.
• • If the kurtosis is less than three, the distribution is called as
platykurtic. Here, the distribution has shorter and thinner tails
than normal distribution. Moreover, the peak is lower and also
broader when compared to normal distribution.
• • If the kurtosis is greater than three, the distribution is called as
leptokurtic. Here, the distribution has longer and fatter tails than
normal distribution. Moreover, the peak is higher and also
sharper when compared to normal distribution.
Hypothesis Testing
• Hypothesis testing refers to the formal procedures used by
statisticians to accept or reject statistical hypotheses.
• A hypothesis is an educated guess about something in the
world around you. It should be testable, either by experiment
or observation. 
• Null hypothesis. The null hypothesis, denoted by H0, is usually
the hypothesis that sample observations result purely from
chance.
• Null hypothesis – a statement of no significance.
• Alternative hypothesis. The alternative hypothesis, denoted
by H1 or Ha, is the hypothesis that sample observations are
influenced by some non-random cause.
• Alternative hypothesis – a statement of significance.
• "significance" refers to something that is
extremely useful and important. But in
statistics, "significance" means "not by chance" or
"probably true". 
• The level of significance for a statistical
hypothesis test is defined as the fixed probability
of wrong rejection of null hypothesis when if in
fact it is true. The significance level is said to be
the probability of type I error and is preset by the
researcher with the consequences of error. 
• A confidence level refers to the percentage of
all possible samples that can be expected to
include the true population parameter.
• Type I errors happen when we reject a true
null hypothesis.
• Type II errors happen when we fail to reject a
false null hypothesis.
• STEPS IN STATISTICAL HYPOTHESIS TESTING
  Step 1: State the null hypothesis, H0, and the alternative hypothesis, Ha. The alternative
hypothesis represents what the researcher is trying to prove. The null hypothesis represents
the negation of what the researcher is trying to prove.
• Step 2: State the size(s) of the sample(s). This represents the amount of evidence that is
being used to make a decision. State the significance level, a, for the test.. The significance
level is the probability of making a Type I error. A Type I error is a decision in favor of the
alternative hypothesis when, in fact, the null hypothesis is true. A Type II error is a decision
to fail to reject the null hypothesis when, in fact, the null hypothesis is false.
• Step 3: State the test statistic that will be used to conduct the hypothesis test (the
appropriate test statistics for the different kinds of hypothesis tests must be identified.
•  Step 4: Find the critical value for the test. This value represents the cut off point for the test
statistic. If the null hypothesis were true, there would be only a probability of a of obtaining
a value of the test statistic that would be at least this extreme. If the value of the test
statistic computed from the sample data is beyond the critical value, the decision will be
made to reject the null hypothesis in favor of the alternative hypothesis.
• Step 5: Calculate the value of the test statistic, using the sample data. (If you are using Excel
or SPSS, or some similar computer package, you will calculate the value of the test statistic,
along with a p-value.)
• Step 6: Decide, based on a comparison of the calculated value of the test statistic and the
critical value of the test, whether to reject the null hypothesis in favor of the alternative. (If
you have a calculated p-value, then decide based on a comparison of the p-value with a. If
the p-value is less than a, reject H0. Otherwise, fail to reject H0.)
Statistical Tests
A. Comparing: Dependent (outcome) Independent Parametric test (data Non-parametric test
variable (explanatory) variable is normally (ordinal/ skewed data)
distributed)
The averages of two Scale Nominal (Binary) Independent ttest Mann-Whitney test/
INDEPENDENT groups Wilcoxon rank sum

The averages of 3+ Scale Nominal One-way ANOVA Kruskal-Wallis test


independent groups

The average Scale Nominal/Ordinal Paired t-test Wilcoxon signed rank


difference between test
paired (matched)
samples e.g. weight
before and after a diet

The 3+ measurements Scale Nominal/Ordinal Repeated measures Friedman test


on the same subject ANOVA

B. Establish Scale Scale Pearson r Spearman rho


relationship/associati
on
C. Involving Prediction Scale Scale Regression Analysis Nonparametric
Regression
• Significance level is always set at 5%

• Thank You.

You might also like