You are on page 1of 26

COM508 BUSINESS STATISTICS ➔ To be valid it must be based on a sample

REVIEWER that fully reflects the characteristics and


properties of the population it is being
UNIT 1 Intro to data collection drawn
A. Preview of Business stats ➔ Possibility of error exists
➔ Probability theory is essential, it is also
1. Origin of statistics referred to as deductive statistics
➔ Statista (statesman) ➔ Statistical inference is a branch of
➔ First used by gottfried achenwall at statistics called decision theory.
marlborough Gottingen
➔ Introduced to England by Dr. EAW 4. Relevance
Zimmerman ➔ The application of statistics in a wide
➔ Popularized by Sir John Sinclair in his variety of business, economic, and daily
“Statistical account of scotland situations show its importance and
1791-1799” relevance
➔ Statistics are the sets of mathematical
2. Definition equations that we use to analyze the
➔ A collection of numerical facts or data things, it keeps us informed about what
➔ An academic discipline is happening in the world
➔ The collection, presentation, analysis
and utilization of numerical data to B. Concepts of Data, Variables, Elements, and
make inferences and reach decisions in observations
the face of uncertainty or
incompleteness of data 1. Distinction of data, variables,
elements, and observations
3. Divisions Data
a. Descriptive statistics - The facts and figures collected,
➔ Summarizing and describing a body of summarized for presentation analyzed
data in the form of tables, charts, and and interpreted
graphs and other forms of graphic - All the data collected in a particular
display which is often done with the use study are referred to as the data set for
of frequency distributions the study
➔ It consists of procedures for describing Elements (like sample)
some characteristics of the data through - The characteristic of interest for
the use of descriptive tools such as elements
measures of central tendency, - Measurements observed values collected
dispersion, shape, and association. on each variable
Observation
b. Inferential statistics - The set of measurements obtained for a
➔ The process of reaching generalizations particular element, thus if there are 50
about a population by examining a elements there are also 50 observations
sample
➔ Involves inductive reasoning
2. Types of Data 3. Levels of measurement scale
According to the nature of data Qualitative: Nominal and ordinal
Categorical/Qualitative Data Quantitative: Interval and ratio
➔ Dichotomous
➔ Multinomial Nominal
Numerical/Quantitative Data ➔ Consists or uses labels to identify an
➔ Discrete attribute of the element
➔ Continuous ➔ A numerical as well as non numerical
code/label may be used
Categorical data Ordinal
Dichotomous ➔ Exhibits properties of nominal data but
- There are only 2 categories/choices and in addition the order or rank of data is
is binary coded meaningful
- Ex: sex, gender, answer ➔ A non-numeric or numerical code can
Multinomial be used to record it but in assigning the
- There are 3 or more categories/choices code the order of values should be
and zero is refrained in coding maintained
- Ex: Nationality, languages, religion Interval
Numerical Data ➔ The data has all the properties of ordinal
Discrete data and the interval between values is
- Countable numbers and whole numbers expressed in terms of a fixed unit of
- Ex: People, chairs, tables measurement
Continuous ➔ Always numerical
- Numbers in fraction form ➔ The difference between values are
- Ex: price, income, weight meaningful
➔ Characterized by not having a
Time periods and number of elements meaningful zero value
Cross sectional Ratio
➔ One time period only (current) and ➔ The data has all the properties of
many elements interval data and the ratio of two values
Time series is meaningful
➔ Many time periods and one element ➔ This scale requires that zero value be
Panel or longitudinal included to indicate that nothing exists
➔ Many time period and many elements for the variable at zero point

Grouping 4. Relationships among variables


Ungrouped - raw data ★ Dependent
Grouped - ranges are made ★ Independent
★ Moderating
Filling ★ Control
Stacked - cross sectional or time series ★ Intervening
Unstacked - Panel or longitudinal
1. Dependent variable C. Data Collection Methods
➔ The presumed effect, response,
or affected variable and its Sources of Data
outcome can be predicted 1. Secondary Data
➔ Since its value is predicted from 2. Primary Data
the models, it can also be called
an endogenous variable Secondary Data
2. Independent Variable ➔ It involves the acquisition of data from
➔ The one causing the change, the secondhand sources or from data
main factor presumes to be the previously gathered or generated by
cause, stimulus and manipulates other agencies, public, or private for
the value of dependent variable other purposes
➔ Sometimes called the main ➔ Different sources of secondary data
explanatory variable ◆ Printed out reports or
➔ The value of this variable is publications
derived outside of the model ◆ Electronic databases
and is thus called an exogenous ◆ Gov agencies,universities,
variable commercial info service groups
3. Moderating Variable etc
➔ It is the second independent
variable or another explanatory Primary Data
variable ➔ It involves the gathering of data from
➔ Believed to have significant the direct source for the first time for the
contributory or contingent effect purpose of the study of the researchers
on the originally IV-DV Different techniques to gather primary data
relationship 1. Surveys
4. Control Variable 2. Observations
➔ Another extraneous variable or 3. Experimentation
explanatory variable which is 4. Unobtrusive
considered to have 5. Delphi technique
compounding effect on the 6. Focus group discussion
hypothesized IV-DV 7. Projective method
relationship
5. Intervening Variable (Qualitative) Survey
➔ A conceptual mechanism Interview
through which the IV and MV ➔ Personal or telephone by using an
might affect the DV interview guide or interview schedule
➔ The factors which theoretically ➔ Interview guide - unstructured/open
affects the observed ended
phenomenon but cannot be seen, ➔ Interview schedule - structured/close
measured or manipulated; its ended
effect must be inferred from the Questionnaire
effects of the IV and DV on the ➔ Self-administered;mailed;email/internet
preserved phenomenon
➔ Important consideration for this is the Unobtrusive
response rate/retrieval rate is only about ➔ Other sources aside from individuals
25%-30% of questionnaires are returned such as employee attendance, report
so there is a need to persistently follow cards, diaries, bills etc
up Delphi Technique
Guide in constructing a good questionnaire ➔ Qualitative process of acquiring
Content information on issues
Purpose - should adequately answer the ➔ Involves forecasting or predicting trends
objectives and hypothesis of the study or outcomes
Language - the words to be used should be ➔ Entails choosing a panel of key
familiar and nearest to the respondents’ level of informants who are considered experts
comprehension on the topic under study to be the
Questions - open-ended or close-ended respondents
Measurement - determine the type of response Focus group discussions
categories to be used for the close-ended ➔ It involves gathering 5-15 selected
questions participants to elicit opinions,
- Validity and reliability perceptions, ideas, beliefs about a
Order - should be arranged in a manner which specific or particular topic of interest
facilitates the respondent’s replies Projective method
➔ These data can be useful in studies on
Other Features consumer preferences and workers
Introduction - must contain an introductory part motivation
- State the purpose and significance of the ➔ Uses standardized psychological tests
study, the reason for choosing the such as inkblots, sentence completion or
respondent for his/her cooperation thematic apperception to probe deeper
Appearance - should motivate the respondent into the minds, behavior, and attitudes of
Length - short to encourage the respondent respondents
Personal data - standard practice that Triangulation/Quadrangulation
demographic information is collected ➔ This process means that a combination
of different procedures from the same
Observation group of respondents can strengthen
➔ This method requires the use of one’s confidence in one’s result since the data
senses by looking at behavioral and from one procedure can be validated
non-behavioral phenomena with the aid with the data of the other procedure
of a camera, one way mirror and or
recording instrument without directly D. Sampling Theory
posing the questions to the respondents ➔ sampling design - Method used to select
➔ Approach to use can be participant or a sample from the population
non participant observation ➔ The main objective of the sampling
Experiment design is to provide guidelines for
➔ Data is derived under controlled selecting a sample that will provide a
conditions specific amount of information about the
➔ Time and cost issues should be greatly population at a minimum cost
considered under this method
➔ If the elements in the population are 2. Specify the population frame from
relatively uniform, then any small which the sample will be taken
sample will provide acceptable results 3. Choosing the sampling method on
➔ When the elements in the population are selecting samples
not relatively uniform we must be 4. Determine the sample size required for
determining how to obtain a sample of the study
data 5. Select the actual samples
➔ We want to obtain a sample that is a
representative of all the elements of the Sampling Methods/Designs
population According to element selection
➔ Unrestricted - any element from the
Terms: population has the chance to become a
Sampling design - the set of decisions that must sample
be made before the data are collected ➔ Restricted - Certain elements are given
Population - it is a set or collection of all the chance to become a sample given
possible observations of some specific certain qualifications
characteristics According to representation basis
Sample - Representative portion of population ➔ Probability - everyone is given an equal
Parameter - data collected from the population chance to become a respondent
Statistic - Data collected from the sample ➔ Non-Probability - not everyone is given
Census - every element in the population is an equal chance to become a respondent
recorded
Elementary unit or element - object/person
which measurement is taken
Frame - A listing of all elementary unit in a
given problem

Since the sample is only part of a population,


any inferences made about the population
characteristics may be erroneous, despite this
possibility there are various reasons for taking a Probability
sample rather than a census of the entire Simple random sampling
population ➔ Each element in the population has an
equal and known chance of being
Reasons for using sampling chosen as a respondent
➔ Expense Systematic sampling
➔ Speed of response ➔ Allows the elements of the population to
➔ Infinite number of observations be selected using a constant number (K)
➔ Destructive sampling derived from dividing the total
➔ Accuracy population (N) from the sample (n)
➔ K = N/n
Steps in Sampling process: ➔ First is selected by random sampling
1. Define the population from which the and the succeeding respondents are the
sample is to be drawn kth element
Stratified sampling Another is the simple formula
➔ The population is divided into strata or (Sloven’s formula)
subgroups. It establishes homogeneity N = N (1+Ne^2)
within each subgroup so that clear
differences between groups are Guidelines for min number of respondents
determined Descriptive studies - min 100
➔ Once stratification is made the final Correlational studies - min 50
respondents can be selected either by Experimental and causal - min 30
simple random or systematic sampling
Clustered Sampling Sampling errors
➔ Involves grouping or division of the ➔ Happens when an unrepresentative
elements of the population into sample used to represent the
heterogeneous groups, it should be characteristic of the population - it
noted that each cluster sample is results when non-random sampling is
composed of respondents with different done
perspectives and interests
Area Sampling Non-sampling errors
➔ It pertains to the grouping of the ➔ Mistakes made in data acquisition or
population into geographical division from sample observations being selected
before the respondents. This sampling improperly
can be done if there exists a clear Coverage Errors
delineation of communities where the ➔ If the research objective and the
respondents can be found population from which the sample is
Double stage sampling drawn are not aligned, the data collected
➔ Getting a smaller sample from the initial cannot accomplish its research objective
large sample Non-responses error
➔ Called Sample within a sample ➔ When segments of the target population
➔ Done when the researcher intends to are systematically underrepresented or
gather more in-depth focused data on overrepresented in the samples whereby
the topic of investigation these groups either less likely or more
likely to respond
Sample size determination Measurement error
Sampling from a finite population ➔ Incorrect measurement of the
➔ Sampling without replacement characteristics of interest, questions that
➔ Sampling with replacement are ambiguous or difficult for
Sampling from an infinite population respondents to answer and responses are
➔ Each element selected comes from the not accurately reflecting the intended
same population answers of the respondents. Included
➔ Each elements is selected independently also are interviewer’s errors and
processing errors done during the
Sample size determination in research involving interview, recording or data preparation
proportions normally uses the formula which
came from the Philippine social science council
survey series
Unit 2 Data Presentation and the categories are mutually
Data presentation techniques exclusive or non overlapping
3 ways of presenting data Relative frequency distribution
1. Tabular ➔ Lists the categories and the proportion
2. Graphical which each occurs
3. Textual ➔ Frequency of each category is divided
by the total frequency the sum of which
Things to remember when using tabular equals to 1.00
techniques ➔ Formula = f/total
All tabular presentations have the following Percentage distribution
● Table number ➔ Lists the categories and the percentage
● Table title with which each occurs
● Totals (row & column) ➔ Relative frequency multiplied by 100
● Box headings (row & column heading) and the sum is which 100%
○ Variable of interest ➔ Formula = rf*100
○ Categories Ex:
○ Frequency (f) Table 1 frequency distribution of respondents by
○ Relative frequency (rf) sex
○ Percentage (%)
Ex:
Table 1 frequency distribution of respondents by
category

Cross tabulation/contingency table


➔ Presents the categories and the counts of
two or more variables or relationship
between two or more variables and
likewise compare two or more sets of
Tabular Data Presentation data
Categorical/Qualitative ➔ For presentation purposes only the
● Frequency distribution table or column is used for the variables with
univariate table less categories
● Relative distribution ➔ For establishing relationships the
● Percentage distribution column is used for the independent or
● Cross tabulation or cross classification explanatory variable and the row is for
or contingency table or multivariate the dependent variable
table Ex:
Table 2. Frequency Distribution of Respondents
Frequency distribution table by Sex & Location of Residence
➔ Univariate
➔ Presents the categories and the counts of
1 variable
➔ Minimum number of categories is two
or dichotomous and maximum is twenty
Numerical/Quantitative Relative frequency distribution
Array ➔ lists the classes and the proportion with
➔ Arranging the data in ascending or which each occurs
descending order (vertical or horizontal) ➔ frequency of each category divided by
Frequency distribution table the total frequency; the sum of which is
➔ presents the classes and the counts of a equals to 1.00
single or only one variable; minimum Percent or percentage distribution
number of classes or class intervals is ➔ lists the classes and the percentage with
five and maximum is twenty classes and which each occurs
the classes must be mutually exclusive ➔ relative frequency multiplied by 100; the
sum of which is equals to 100%
Steps in creating a frequency distribution table Cumulative Frequency distribution/cf
for numerical or quantitative data ➔ presents the classes and the accumulated
1. Determine the number of class intervals counts (either in ascending or
or classes descending order) of a single or only
➔ Use the sturges formula one variable
➔ 1.3.3*log(N) ➔ Formula: f+previous cf
➔ Round up if it is in fraction Cumulative Relative frequency distribution/crf
Ex: ➔ lists the classes and the accumulated
proportion (either in ascending or
descending order) with which each
occurs
➔ Formula: crf+previous crf
Cumulative Percent or percentage distribution or
cpf or c%
2. Determine the class width ➔ lists the classes and the accumulated
➔ range /number of classes percentage (either in ascending or
➔ Range = highest value - lowest value descending order) with which each
occurs
➔ Formula: %+previous c%

3. Determine the class limits. Class limits


must be chosen so that each data
item belongs to one and only one class.
The lower class limit identifies the
smallest possible data value assigned to
the class while the upper class limit
identifies the highest possible data value
assigned to the class.
Cross tabulation or contingency table or Things to remember when using graphical
Multivariate table techniques
➔ presents the classes and the counts of All graphical presentations have the following
two or more variables or relationships parts
between two or more variables and ● Chart or figure number
likewise compares two or more sets of ● Chart of figure title (centered or
data. left-aligned)
➔ For presentation purposes only, the
column is used for the variable with less Categorical Or Qualitative
classes Pie chart
➔ For establishing relationships, the ➔ A graphical device for presenting
column is used for the independent or categorical data summaries based on
explanatory variable and the row is used subdivision of a circle into sectors that
for the dependent variable. correspond to the relative frequency or
percent frequency for each class
➔ It is best used to represent percentage
distribution since the pie represents a
100%
Ex:

Simpson’s Paradox
➔ can happen for cross tabulations of
qualitative or quantitative or mixture of
qualitative & quantitative data.
➔ It happens when conclusions drawn
from two or more separate cross
tabulations are reversed when the data Bar Graph
are aggregated into a single cross Simple bar graph
tabulation ➔ A graphical device for depicting
categorical data that have been
Graphical Data Presentation summarized in a frequency distribution
Categorical/Quantitative table wherein the emphasis is on the
● Pie chart frequency of actual count for each
● Bar chart category
○ Simple bar ➔ usually used for univariate tables
○ component/stacked Component/stacked bar graph
○ compound/side by side ➔ Bar chart in which each bar is broken
● Pictogram into rectangular segments of a different
● Mapgraphs/Cartogram color showing the relative frequency of
each category or class in a manner
similar to pie chart
Ex: Ex:

Compound/side by side bar graph


➔ A graphical display for depicting Numerical/Quantitative
multiple subcategories per class on the ● Stem and leaf display
same display ● Dot plots
Ex: ● Scatter diagram
● Line graph or trendline
● Histogram
● Pareto diagram
● Polygon
● Ogive

Stem and leaf display


Component and compound bar graph are used
➔ A graphics display used to show
for cross-tabulation/multivariate tables
simultaneously the rank order, and shape
of a distribution of data
Pictogram
Ex:
➔ Picture graphs
➔ Picture symbols are used to represent
values
Ex:

Dot plot
➔ A graphical device that summarizes
Map graphs data by the number of dots above each
➔ a map is drawn and divided into the data value on the horizontal axis
desired regions. Each region may be Ex:
distinguished from other regions using
varied lines, shadings with different
colors, or other symbols.
➔ It is always accompanied by a legend
which tells the meaning of the lines,
colors, or other symbols used
Scatter Diagram Ex:
➔ A graphical display of the relationship
between two quantitative variables. The
independent variable on the x axis and
the dependent variable on the y axis
Ex:

Pareto Diagram
➔ A type of chart that contains both bars
and line graph where individual values
are represented in descending order by
bars and the cumulative total is
represented by time
Line graph/trendline Ex:
➔ A line that provides an approximation of
the relationship between two variables
or the changes in a particular variable
over a span of time
Ex:

Polygon
➔ the line graph in the histogram or pareto
diagram representing the values of the
frequency distribution itself.
Ex:

Histogram
➔ A graphical display of the frequency,
Ogive
relative frequency or percent frequency
➔ the line graph in the histogram or pareto
constructed by placing the class
diagram representing the cumulative
intervals on the horizontal axis and the
percent frequency distribution
frequency distribution on the vertical
Ex:
axis
General guidelines on how to increase the
likelihood that the display will effectively
convey the key information in the data.
1. Give the display a clear and concise title
2. Keep the display simple. Do not use
three dimensions if two dimensions are
Graphical Excellence sufficient.
➔ It is the term we apply to techniques that 3. Clearly label each axis and provide the
are informative and concise and that units of measure
impart information clearly to their 4. If color is used to distinguish categories,
viewers. make sure that the colors are distinct.
➔ We discuss an equally important 5. If multiple colors or line types are used,
concept: Graphical Integrity and its use a legend to define how they are used
enemy graphical deception and place the legend close to the
representation of the data.
5 Characteristics that should be applied to
achieve Graphical Excellence GRAPHICAL DECEPTION:
1. The graph presents large data sets 1. Graph without a scale on one axis. No
concisely and coherently. label or variable to measure on the
2. The ideas & concepts the statistics y-axis although the x axis is time period.
practitioner wants to deliver are clearly So what variable is being measured on
understood by the viewer. the y-axis, is it sales, profit, expenses,
3. The graph encourages the viewer to what ??????
compare two or more variables. 2. Same graph with different captions.
4. The display induces the viewer to Your impression of the trend might
address the substance of the data and not differ depending on which caption you
the forms of the graph. read.
5. There is no distortion of what the graph 3. Perspectives of the chart are often
reveals. distorted by the changes in absolute
values rather than percentage changes.
Edward Tufte (professor of Statistics in Yale 4. Distortions on the chart can be made by
University) summarized graphical excellence drastic stretching of the vertical or
this way: horizontal axis or expanded scale on
➔ It is the well-designed presentation of either axis
interesting data – a matter of substance, 5. Be on the lookout for size distortions as
of statistics, and of design. in the case of pictograms to enhance the
➔ It is that which gives the viewers the appeal such that the increase in sales is
greatest number of ideas in the shortest manifested by increase in height and
time with the least ink in the smallest width - bigger size of the bottle of the
space. soft drink over the years. One is less
➔ It is nearly always multivariate. likely to be misled if the focus is on the
➔ It requires telling the truth about the numerical values rather than the graph
data. representing the value
Unit 3 Data Interpretation
DESCRIPTIVE MEASURES
● Measures of Central Tendency or Location
● Measures of Relative Location
● Measures of Dispersion or Spread or Variability
● Measures of Shape
● Measures of Association or Linear Relationship
Measures of central tendency
Mean
➔ The sum of observations divided by the number of observations
Formula:
Ungrouped Grouped

Population sample Population sample

µ = Σ𝑥/𝑁 𝑥 = Σ𝑥/𝑛 µ = Σ𝑓𝑥/𝑁 𝑥 = Σ𝑓𝑥/𝑛

Median
➔ Middle value of observed ordered observations
➔ Data must be in array
Formula:
Ungrouped Grouped

Population sample Population sample

Med = (N + 1) / 2 Med = (n + 1) / 2 Med = L + {[(N/2) – F] / Med = L + {[(n/2) – F] /


fm} * c fm} * c
Where F = sum of the Where fm = frequency of
frequencies up to but not the median class c = class
including the median class size or width
L = lower limit of the
median class

Mode
➔ Most frequent observation
Formula:
Ungrouped Grouped

Population sample Population sample

The most frequently occurring value. Mode = L + [d1 / (d1 + d2)] * c Where d1 = frequency
of modal class less frequency of preceeding class while
d2 = frequency of modal class less frequency of the
succeeding class L = lower limit of the modal class
Midhinge
Formula:
(Q1 + Q3) / 2
Midrange
Formula:
(LV+ HV) / 2

Measures of Relative Location


Q1/Quartile 1
Formula:
Ungrouped

Population sample

the value that lies on the (N or n+ 1) / 4


Q3/Quartile 3
Formula:
Ungrouped

Population sample

the value that lies on the 3 * (N or n+ 1) / 4


Deciles
➔ the distribution is divided into 10 parts
Percentiles
➔ the distribution is divided into 100 parts

FIVE NUMBER SUMMARY & BOX-AND-WHISKER PLOT
➔ The five number summary summarizes the data set by identifying the minimum value, Q1,
median, Q3 , maximum value.
➔ The box-and-whisker plot is the graphical display of the five number summary. Comparative
boxplots can also be used to provide two or more groups and facilitate visual comparisons among
the groups.
Ex:
Measures of spread, dispersion, or variability
Range
Formula:
Highest Value – Lowest Value
Interquartile Range
Formula:
Q3 – Q1
Variance
Formula:
Ungrouped Grouped

Population sample Population sample

2 2 2 2 2 2 2 2
σ = Σ(𝑋 − µ) /𝑁 σ = Σ(𝑋 − 𝑥) /𝑛-1 σ = Σ𝑓(𝑋 − µ) /𝑁 σ = Σ𝑓(𝑋 − 𝑥) /𝑛-1
Standard deviation
Formula:
Ungrouped Grouped

Population sample Population sample

2 2 2 2
σ = Σ(𝑋 − µ) /𝑁 σ = Σ(𝑋 − 𝑥) /𝑛-1 σ = Σ𝑓(𝑋 − µ) /𝑁 σ = Σ𝑓(𝑋 − 𝑥) /𝑛-1
Coefficient of variation
Formula:
Ungrouped Grouped

Population sample Population sample

V = σ/µ V = σ/𝑥 V = σ/µ V = σ/𝑥

Empirical rule
➔ When data are believed to have a symmetrical or bell-shaped distribution
➔ can be used to determine the percentage of data values that must be within a specified number of
standard deviations of the mean.
➔ If the skewness is equal to zero, then the basis of interpreting the standard deviation is the
Empirical Rule. For the Empirical Rule get the 1st up to the 3rd std dev values before making the
interpretations.
◆ Approximately 68.26% of the data values will be within one std deviation of the mean.
◆ Approximately 95.44% of the data values will be within two std deviations of the mean.
◆ Approximately 97.74% or Almost all of the data values will be within three std deviations
of the mean.
Ex:
negative 1st std dev positive 1st std dev 497.24 519.76 Approximately 68.26% of the students
= (Mean - Std dev) = (Mean + Std. dev) have Math scores within 497.24 to
519.76 points which fall on the 1st std
deviation from the mean.

negative 2nd std dev positive 2nd std dev 485.98 531.14 Approximately 95.44% of the students
= Mean - (2*std = Mean + (2*std have Math scores within 485.99 to
dev) dev) 531.015 points which fall on the 2nd
std deviation from the mean

negative 3rd std dev positive 3rd std dev 474.72 542.27 Approximately 97.74% of the students
= Mean - (3*std = Mean + (3*std have Math scores within 474.73 to
dev) dev) 542.27 points which fall on the 3rd std
deviation from the mean.

Chebyshev’s theorem
➔ It enables us to make statements about the proportion of data values that must be within a
specified number of standard deviations of the mean to any data set regardless of the shape of the
distribution.
➔ If the skewness is not equal to zero, then the basis of interpreting the standard deviation is
Chebyshev's theorem. For Chebyshev's theorem, get the 2nd up to the 4th std dev values before
making the interpretations.
◆ At least 75% of the data must be within two std deviations of the mean.
◆ At least 89% of the data must be within three std deviations of the mean.
◆ At least 94% of the data must be within four std deviations of the mean.
Ex:
negative 2nd std dev positive 2nd std dev 485.98 531.14 At least 75% of the students have Math
= Mean - (2*std = Mean + (2*std scores between 485.99 to 531.015
dev) dev) points which fall on the 2nd std
deviation from the mean.

negative 3rd std dev positive 3rd std dev 474.72 542.27 At least 89% of the students have Math
= Mean - (3*std = Mean + (3*std scores between 474.73 to 542.27
dev) dev) points which fall on the 3rd std
deviation from the mean.

negative 4th std dev positive 4th std dev 463.47 553.53 At least 94% of the students have Math
= Mean - (4*std = Mean + (4*std scores between 463.47 to 553.53
dev) dev) points which fall on the 4th std
deviation from the mean.

Determining Outliers
3 approaches to determine outliers:
1. The IQR approach for both symmetrical and asymmetrical data distributions.
➔ Formula for the Lower Limit = Q1-(1.5*IQR) & for the Upper Limit = Q3+(1.5*IQR)
➔ any value beyond the lower and upper limits is considered an Outlier.
2. The Empirical rule for symmetrical data. Any value beyond the 3rd std deviation is considered an
Outlier.
3. Chebyshev's theorem for asymmetrical data. Any value beyond the 4th std deviation is considered
an Outlier.
Kurtosis
➔ peakedness of the distribution
◆ ku =3 = normal or mesokurtic
◆ ku >3 = Leptokurtic/positive(thin)
◆ ku <3 = Platykurtic/negative (flat)
Skewness
➔ symmetry of the distribution.
➔ If SK = 0, the distribution is symmetrical or normally distributed with bell-shaped distribution;
➔ it is asymmetrical, if sk > 0, + or right-skewed and if sk < 0, - or left-skewed.

Measures of association or linear relationship


Covariance
➔ It shows the manner of linear association or relationship of two variables whether the variables
are positively or directly related OR negatively or inversely related.
➔ However, the main problem of using Covariance as a measure of strength of the linear
relationship is that the value depends on the units of measurement of the variables
Correlation coefficient
➔ This measures the relationship between two variables that is not affected by the units of
measurement of X and Y.
➔ It can measure both the manner and degree of strength of a relationship.
The degree of strength can be:
➔ none or no relationship (0)
➔ very weak or very low(0.01-0.19)
➔ weak or low (0.2-0.39); moderate (0.4- 0.59)
➔ strong or high (0.6-0.79)
➔ very strong or very high (0.8-0.99)
➔ perfect relationship (1.0)
➔ The values can range from -1 to +1, inclusive of zero.

Interpretation examples:

MATH-SCORE is measured in interpretation


points

Mean 508.5 the students or respondents got an average math


score of 508.5 points or the students or
respondents got more or less 508.5 points in their
math scores

Standard Error 1.453322708

Median or Q2 505 half or 50% of the students or respondents got a


math score of 505 points or less and the remaining
half or 50% of the students or respondents got a
math score higher than 505 points

Mode 502 Majority or most or many of the respondents or


students got a math score of 502 points

Standard Deviation 11.25738929

Sample Variance 126.7288136


Kurtosis 4.15186183 The data distribution of math scores is leptokurtic
because the value of kurtosis is greater than 3.

Skewness 2.222002828 the data distribution of math scores is


asymmetrical and right-skewed or
positively-skewed because the value of skewness
is more than zero and positive

Range 49 the difference between the highest and lowest math


scores is 49 points

Minimum 496 the lowest math score is 496 points

Maximum 545 the highest math score is 545 points

Sum 30510

Count 60 there are 60 respondents or students who have


math scores or 60 elements

Confidence Level(95.0%) 2.908092021

Q1 502 one-fourth or 25% of the students or respondents


got a math score of 502 points or less and the
remaining three-fourths or 75% of the students or
respondents got a math score higher than 502
points

Q3 509 three-fourths or 75% of the students or respondents


got a math score of 509 points or less and the
remaining one-fourth or 25% of the students or
respondents got a math score higher than 509
points

midhinge 505.5

midrange 520.5

Interquartile range (IQR) 7

correlation of Math & Science 0.19 the manner of linear relationship or linear
association between Math and Science scores is
positive or direct since the correlation coefficient
is a positive value but the degree is very weak or
very low since the correlation value falls between
0.01-0.19.
correlation of Math & Language 0.12 the manner of linear relationship or linear
association between Math and Language scores is
positive or direct since the correlation coefficient
is a positive value but the degree is very weak or
very low since the correlation value falls between
0.01-0.19.

covariance of Math & Science 27.18 the manner of linear relationship or linear
association between Math & Science scores is
positive or direct since the sign of the coefficient
of covariance is positive.

covariance of Math & Language 21.78 the manner of linear relationship or linear
association between Math & Language scores is
positive or direct since the sign of the coefficient
of covariance is positive.

coefficient of variation 0.022138425 the data distribution of Math scores is


homogeneous since the coefficient value is less
than 0.5. - (meaning that the Math scores are
relatively the same or similar to each other)

Note:
➔ To interpret the Standard deviation - you need to look at the value of the skewness
➔ When asked to interpret the default std dev - use the 1st std dev rule for Empirical rule and use
the 2nd std. dev rule for the Chebyshev's theorem.
➔ When asked if there are outlier values and identify the outlier values - identify the rule that is
being used before answering Yes or No and what are the outlier values.
Unit 4 - Probability distributions

● Discrete
● Continuous

Discrete:
● The variables are countable
● The probability function f(x) provides the probability that the random variable x assumes various
values.
● Types
○ Uniform or Univariate
○ Bivariate
○ Binomial
○ Poisson
○ Hypergeometric

1. Uniform or Univariate
- Distribution of a single variable
- Ex. Develop the probability distribution of the number of televisions per household
2. Bivariate
- About the relationship of two variables
- Provides probabilities of combinations of two variables
- Ex. After analyzing several months of sales the owner of an appliance store produces the
probability distribution of the number of refrigerators and stoves sold daily
- Financial Portfolios - amount of investment and interest rate/rate of return
3. Binomial
- The binomial experiment consists of a fixed number of trials represented by n
- Each trial has two possibilities failure or success
- The probability of success is p and failure is 1-p
- The trials are independent which means that the outcome of one trial does not affect the outcomes
of any other trials
- If properties 2, 3, and, 4 are present then each trial is a bernoulli process
- Ex. The leading brand of dishwasher detergent has a 30% market share. A sample of 25
dishwasher detergent customers was taken, what is the probability that 10 or fewer customers
chose the leading brand?
4. Poisson
- The number of successes that occur in a period of time or an interval or space
- The number of successes that occur in any interval is independent of the number of successes that
occur in any other interval
- The probability of success in an interval is the same for all equal size intervals
- The probability of more than one success in an interval approaches 0 as the interval becomes
smaller
- Ex. The number of students who seek assistance with their statistics assignments is distributed
with a mean of 2 per day, what is the probability that no student seeks assistance tomorrow? Find
the probability that 10 students seek assistance in a week.
5. Hypergeometric
- The trials are not independent and the probability of success changes from trial to trial
Ex.
More and more shoppers prefer to do their holiday shopping online from companies, suppose that there is
a group of 10 shoppers, 7 prefer to do their holiday shopping online while 3 prefer to do their holiday
shopping in stores =, a random sample of 3 of these shoppers is selected for a more in depth study of how
the economy has impacted their shopping behavior. What is the probability that exactly 2 prefer shopping
online.

Continuous
● The variables are uncountable
● In fraction or decimal form
● The probability density function f(x) does not provide the probability values directly. Instead,
probabilities are given by areas under the curve or graph of the probability density function f(x).
● Types
○ Uniform
○ Normal
○ Exponential
○ Others

1. Uniform
- Range
- The following requirements apply to probability density function whose range is 𝑎 ≤ 𝑥 ≤ 𝑏
- 𝑓(𝑋) ≥ 0 for all x between a and b
- The total are under the curve between a and b is equal to 1.0
2. Normal
- The curve is symmetric about its mean and random variable ranges between
- − ∞ 𝑎𝑛𝑑 + ∞ there is a two parameter distribution
- Has mean and standard deviation
3. Exponential
- This is a one parameter distribution, the distributions is completely specified once the value of the
lambda (λ) is known
- The mean and standard dev are equal to each other
- Ex. The time between breakdown of aging machines is known to be distributed with a mean of 25
hours, the machine has just been repaired. Determine the probability that the next breakdown
occurs more than 50 hours from now
4. Others
- T distribution
- Chi-square distribution
- F distribution

UNIT 5

Inferential statistics - concerned with empirical verification (trying to evaluate or verify if your
hypothesis is true, valid, and correct)
hypothesis - uneducated guess/theory/hula
- has to be validated ; true = accept false = reject

Hypothesis testing
- the general goal of a hypothesis test it to rule out chance (sampling error) as a plausible
explanation for the results from a research study
- is a technique to infer that what is true to a part is true to a whole (inductive reasoning)
- we should minimize sampling error and non-sampling error, sampling error emanates from the
incorrect computation of sample size and incorrect application of sampling
methodology/technique

STEPS IN HYPOTHESIS TESTING

STEP #1 DETERMINE THE TYPE OF STATISTICAL TEST TO BE USED


- Determine which statistical test is to be used based on whether it is a test of comparison or test of
relationship, the number of groupings and the level of measurement scale
- determining the statistical test is important cause if you use the wrong one it is GIGO

Test of comparisons
- the default is population standard deviation, if not sample standard deviation or variances will be
given/stated explicitly

Test variable One group Two group Three or more groups

Independent Related Independent Related


measures measures

Mean - T-test T-test Paired T-test Anova single Randomized


Population assuming factor or block design
standard equal one-way or two-factor
deviation is variances Anova or Anova without
unknown completely replication
(sample std is T-test randomized
given) assuming design
unequal
variances fisher LSD
multiple
Mean - Z-test Z-test Pooled Z-test pairwise
Population comparisons or
standard Tukey-Kramer
deviation is pairwise
known comparison

Factorial
design or
two-way
Anova or
two-factor
Anova with
replication

sample size - depends on the groups

● if the population standard deviation is unknown but sample is more than 100/200 Z-test will be
used (when there are too many observations it approximates to population
● if the population standard deviation is is known even if sample size if very small Z-test will be
used
● if the sample standard deviation is known and sample size is more than 100/200 you can use
Z-test
● T-test is used when the sample standard deviation/variance is given and sample size is small/less
than 100

Mean - Ration data therefore the test variables are a ratio data
ex: GWA, Income, Prices, Height, etc,
Medians - Interval and Ordinal data

Test of comparison groups


One group - Comparing the variable to a standard
Two Group - Comparing a variable to another variable

Independent
- having different number of observations
Related measures
- Matched/paired (partnered with another group)
- post test and pre test
- Repeated measures

Test variable One group Two group Three or more groups

Independent Related Independent Related


measures measures

Variance X^2 F-test for levene's test of


(Chi-square) variances homogeneity

Proportion Z-test Z-test Pooled Z-test X^2


(Chi-square)

Maracuillo
pairwise
comparison
procedure

Median Wilcoxon Wilcoxon Kruskal-wallis Freidman test


sum-rank test Sign-rank test test
(mann whitney
U test)

- Related measure for 2 groups or 3 or more groups means that the groups are either paired of
matched and post test happened or a case of before and after scenarios
- multiple pairwise comparisons - statistical procedures that can be used to conduct comparisons
between pairs such as in the case of fisher’s LSD or tukey kramer or maracuillo procedures

Test or relationship

Test Variables Test to be used

Test for independence of 2 categorical variables X^2 (Chi-square)


assuming that the expected frequency is atleast 5
for each category

Goodness of fit test for multinomial probability X^2 (Chi-square)


distribution or normal probability distribution
assuming that the expected frequency is at least 5
for each category

linear association between two variables that are pearsons’s correlation coefficient
continuous (when variables are ratio)

non-parametric alternative to test the association spearman’s rank correlation coefficient


of two variables (when variables are interval/ordinal)
Parametric – data set is numerical ratio
Non-parametric - data set may be numerical/non-numerical but uses interval/ordinal data

- you are testing the goodness of it for multinomial/normal probability distribution or we test the
parametric linear association between two ratio variables

You might also like