You are on page 1of 60

MODULE 1: NATURE OF STATISTICS

Introduction

Statistical Thinking will one day be as necessary for efficient citizenship as the ability to read
and write (H. G. Wells).

In 2017, The Economist published one of the striking changes in the world economy. It claims
that the world’s most valuable resource is no longer oil, but data. The five biggest tech giants – Google,
Amazon, Apple, Facebook, and Microsoft – had been taking advantage and profiting from making use
of consumer/customer data. This phenomenon prompted professionals to the use of statistics and
later popularizing the concept of data science.

To date, many business companies all over the world are hiring statisticians and data scientists
to further their competitive advantage. Since everything is data and everyone needs it analyzed, you
need to learn the important knowledge and skills of statistics. As Florence Nightingale puts it,

“Statistics… is the most important science in the whole world: for upon it depends the practical
application of every other science and of every art; the one science essential to all political and social
administration, all education, all organization based upon experience, for it only gives the results of our
experience.”

Lesson 1: History and Development of Statistics

Statistics as a science and art had undergone a series of development and refinement through
time all over the world. Many experts from different fields such as medicine and health, philosophy,
mathematics, and science contributed to strengthening the foundations of the field of statistics. Some
of these notable developments are highlighted in the Timeline of Statistics designed by Tom Fryer and
on the notes on history of statistics by Sweetland.

• 450BC Hippias of Elis uses average value of the length of a king’s reign (mean) to work out the
date of the first Olympic Games, some 300 years before his time
• 431BC Attackers besieging Plataea in the Peloponnesian War calculate the height of the wall by
counting the number of bricks. The count was repeated several times by different soldiers. The
most frequent value (mode) was taken to be the most likely. Multiplying it by the height of the
brick allowed them to calculate the length of the ladders needed to scale the
• 400BC In the Indian epic Mahabharata, King Rtuparna estimates the number of fruits and leaves
(2095 fruit and 50 000 000 leaves) on two great branches of a vibhitaka tree by counting the
number on a single twig, then multiplying by the number of The estimate is found to be very close
to the actual number. This is the first recorded example of sampling
• – “but this knowledge is kept secret”, says the account.
• 2AD Chinese census under the Han Dynasty finds 57.67 million people in 12.36 million households
– the first census from which data survives, and still considered by scholars to have been
• 7AD Census by Quirinus, governor of the Roman province of Judea, is mentioned in Luke’s
Gospel as causing Joseph and Mary to travel to Bethlehem to be
• 840 Islamic mathematician Al-Kindi uses frequency analysis – the most common symbol in a
coded message will stand for the most common letters – to break secret codes. Al-Kindi also
introduces Arabic numerals to
• 10th century The earliest known graph, in a commentary on a book by Cicero, shows the
movements of the planets through the It is apparently intended for use in monastery schools.
• 1069 Domesday Book: survey for William the Conqueror of farms, villages, and livestock in his
new kingdom – the start of official statistics in
• 1150 Trial of the Pyx, an annual test of the purity of coins from the Royal Mint, Coins are drawn
at random, in fixed proportions to the number minted. It continues to this day.
• 1188 Gerald of Wales completed the first population census of
• 1303 A Chinese diagram entitled “The Old Method Chart of the Seven Multiplying Squares”
shows the binomial coefficients up to the eighth power – the numbers that are fundamental to
the mathematics of probability, and that appeared five hundred years later in the west as
Pascal’s triangle.
• 1346 Giovanni Villani’s Nouva Cronica gives statistical information on the population and trade
of
• 1560 Gerolamo Cardano calculates probabilities of different dice throws for
• 1570 Astronomer Tycho Brahe uses the arithmetic mean to reduce errors in his estimates of the
locations of stars and
• 1644 Michael van Langren draws the first known graph of statistical data that shows the size of
possible It is of different estimates of the distance between Toledo and Rome.
• 1654 Pascal and Fermat correspond about dividing stakes in gambling games and together create
the mathematical theory of
• 1657 Huygen’s On the Reasoning in Games of Chance is the first book on probability He also
invented the pendulum clock.
• 1663 John Graunt uses parish records to estimate the population of
• 1693 Edmund Halley prepares the first mortality tables statistically relating death rates to age –
the foundation of life insurance. He also drew a stylized map of the path of a solar eclipse over
England – one of the first data visualization
• 1713 Jacob Bernoulli’s Ars conjectandi derives the law of large numbers – the more often you
repeat an experiment, the more accurately you can predict the
• 1728 Voltaire and his mathematician friend de la Condamine spot that a Paris bond lottery is
offering more prize money than the total cost of the tickets; they corner the market and win
themselves a
• 1749 Gottfried Achenwall coins the word statistics (in German, Statistik); he means the
information you need to run a nation-state.
• 1757 Casanova becomes a trustee of, and may have had a hand in devising the French national
• 1761 The Rev. Thomas Bayes proves Baye’s theorem – the cornerstone of conditional probability
and testing of beliefs and
• 1786 William Playfair introduces graphs and bars charts to show economic
• 1789 Gilbert White and other clergymen-naturalists keep records of temperatures, dates of first
snowdrops and cuckoos, etc; the data is later useful for the study of climate
• 1790 First US census, taken by men on horseback directed by Thomas Jefferson, counts
• 3.9 million Americans.
• 1791 First use of the word statistics in English, by Sir Joh Sinclair in his Statistical Account of
• 1805 Adrien-Marie Legendre introduces the method of least squares for fitting a curve to a given
set of
• 1808 Gauss, with contributions from Laplace, derives the normal distribution – the bell- shaped
curve fundamental to the study of variation and error.
• 1833 The British Association for the Advancement of Science sets up a statistics section. Thomas
Malthus, who analyzed population growth, and Charles Babbage are members. It later becomes
the Royal Statistical
• 1835 Belgian Adolphe Quetelet’s Treatise on Man introduces social science statistics and the
concept of the average man – his height, body, mass index, and
• 1839 The American Statistical Association is formed. Alexander Graham Bell, Andrew Carnegie,
and President Martin Van Buren will become
• 1840 William Farr sets up the official system for recording causes of death in England and Wales.
This allows epidemics to be tracked and disease compared – the start of medical statistics.
• 1849 Charles Babbage designs his difference engine, embodying the ideas of data handling and
the modern computer. Ada Lovelace, Lord Byron’s niece, writes the world’s first computer
program for
• 1854 John Snow’s cholera map pins down the source of an outbreak as a water pump in Broad
street, London, beginning the modern study of
• 1859 Florence Nightingale uses statistics of Crimean War casualties to influence public opinion
and the War Office. She shows casualties month by month on a circular chart she devises, the
Nightingale rose, the forerunner of the pie She is the first woman member of the Royal Statistical
Society and the first overseas member of the American Statistical Association.
• 1868 Minard’s graphic diagram of Napoleon’s March on Moscow shows on one
diagram the distance covered, the number of men still alive at each kilometer of the march, and
the temperatures they encountered on the
• 1877 Francis Galton, Darwin’s cousin, describes regression to the mean. In 1888 he introduces the
concept of correlation. At a Guess the weight of an Ox contest in Devon he describes the Wisdom
of Crowds – that the average of many uninformed guesses is close to the correct
• 1886 Philanthropist Charles Booth begins his survey of the London poor, to produce his poverty
map of Areas were colored black, for the poorest, through to yellow for the upper-middle class
and wealthy.
• 1894 Karl Pearson introduces the term standard If errors are normally distributed, 68% of samples
will lie within one standard deviation of the mean. Later he develops chi- squared tests for
whether two variables are independent of each other.
• 1898 Von Bortkiewicz’s data on deaths of the soldier in the Prussian army from horse kicks shows
that apparently rare events follow a predictable pattern, the Poisson
• 1900 Louis Bachelier shows that fluctuations in stock market prices behave in the same way as
the random Brownian motion of molecules – the start of financial
• 1908 William Sealy Gosset, chief brewer for Guinness in Dublin, describes the t-test. It uses a small
number of samples to ensure that every brew tastes equally
• 1911 Herman Hollerith, inventor of punchcard devices used to analyze data in the US census,
merges his company to form what will become IBM, pioneers of machines to handle business
data, and early
• 1916 During the First World War, car designer Frederick Lanchester develops statistical laws to
predict the outcomes of aerial battles: if you double their size, land armies are only twice as
strong, but air forces are four times as
• 1924 Walter Shewart invents the control chart to aid industrial production and
• 1935 George Zipf finds that many phenomena – river lengths, city populations – obey a power
law so that the largest is twice the size of the second-largest, three times the size of the third,
and so R. A. Fisher revolutionizes modern statistics. His Design of Experiments gives ways of
deciding which results of scientific experiments are significant and which are not.
• 1937 Jerzy Neyman introduces confidence intervals in statistical testing. His work leads to modern
scientific
• 1940-45 Alan Turing at Bletchley Park cracks the German wartime Enigma code, using advanced
Bayesian statistics and Colossus, the first programmable electronic
• 1944 The German tank problem: the Allies desperately need to know how many Panther tanks
they will face in France on D-Day. Statistical analysis of the serial numbers on gearboxes from
captured tanks indicates how many of each are being
• Statisticians predict 270 a month; reports from intelligence sources predict many fewer. The total
turned out to be 276. Statistics had outperformed spies.
• 1948 Claude Shannon introduces information theory and the bit – fundamental to the digital age.
• 1948-53 The Kinsey Report gathers objective data on human sexual A large-scale survey of 5000
men and, later, 5000 women, causes an outrage.
• 1950 Richard Doll and Bradford Hill establish the link between cigarette smoking and lung cancer.
Despite fierce opposition, the result is conclusively proved, to huge public health benefit.
• 1950s Genichi Taguchi’s statistical methods to improve the quality of automobile and electronics
components revolutionize the Japanese industry, which far overtakes western European
• 1958 The Kaplan-Meier estimator gives doctors a simple statistical way of judging which
treatments work best. It has saved millions of
• 1972 David Cox introduced the proportional hazard model and the concept of partial likelihood.
• 1977 John Tukey introduces the box-plot or box-and-whisker diagram, which shows the quartiles,
medians, and spread in a single
• 1979 Bradley Efron introduces bootstrapping, a simple way to estimate the distribution of almost
any sample of
• 1982 Edward Tufte self-publishes The Visual Display of Quantitative Information, setting new
standards for graphic visualization of
• 1988 Margaret Thatcher becomes the first world leader to call for action on climate
• 1993 The statistical programming language R is released, now a standard statistical
• 1997 The term Big Data first appears in
• 2002 The amount of information stored digitally surpasses non-digital. Paul DePodesta uses
statistics – sabermetrics – to transform the fortunes of the Oakland Athletics baseball team; the
film Moneyball tells the
• 2004 Launch of Significance magazine
• 2008 Hal Varian, chief economist at Google, says that statistics will be the sexy profession of the
next ten
• 2012 Nate Silver, statistician, successfully predicts the result in all 50 states in the US Presidential
election. He becomes a media star and sets up what may be an over-reliance on statistical analysis
for the 2016 election. The Large Hadron Collider confirms the existence of a Higgs boson with a
probability of five standard deviations the data is a coincidence.

Lesson 2: Basic Concepts of Statistics

Statistics refers to the scientific study that deals with the collection, organization and
presentation, analysis and interpretation of data.

Two Divisions of Statistics


1. Descriptive Statistics. This refers to the statistical procedures concerned with
describing the characteristics and properties of a group of persons, places, or It
organizes the presentation, description, and interpretation of data gathered without
trying to infer anything that goes beyond the data. The most common measures used
to describe data include the measures of central tendency (mean, median, mode),
measures of variation (range, variance, standard deviation, etc.), kurtosis, skewness,
etc.

Sample Research Questions (Objectives):


1. What is the demographic profile of the respondents? (describe the
demographic profile of the respondents)
2. What are the characteristics and qualifications do school principals look for in
a potential teacher applicant? (determine the characteristics and qualifications
that principals look for in a potential teacher applicant)
3. Which group of learners have the best performance in the national
achievement test? (identify which group of learners have the best
performance in the national achievement test)
4. How did the graduates of teacher education institutions perform in the
licensure examinations? (assess the licensure examination performance of the
graduates from teacher education institutions)
5. What are the factors that affect the implementation of the school program?
(determine the different factors that affect the implementation of the school
program)
2. Inferential Statistics. This refers to statistical procedures that are sued to draw
inferences for a large group of people, places, or things (population) based on the
information obtained from a small portion (sample) taken from a large group. The most
common procedures include the tests of difference between and among groups, the
test of relationship and association, and test of effects.

Sample Research Questions (Objectives):


1. To what degree do NCAE ratings predict freshman college GPA? (ascertain if
freshman college GPA can be predicted by NCAE ratings)
2. To what extent do entry-level qualifications of graduates of teacher education
programs increase the likelihood of developing proficient teachers? (ascertain if
entry- level qualifications of graduates of teacher education programs increase the
likelihood of developing proficient teachers)
3. How do K-3 pupils from different socio-economic status compare in their
reading and mathematics achievement after adjusting for family type? (compare
the reading and mathematics achievement of the K-3 pupils from different socio-
economic status after adjusting for family type)
4. How do male and female learners differ in the national achievement test?
(ascertain if results of the national achievement test differ between sexes)

Population and Sample


Population refers to a large collection of people, objects, places, or things. Any numerical value
that describes a population is called a parameter. A sample refers to the small portion or subset of the
population. Any numerical value that describes a sample is called statistics.

Example: The Department of Education, in a press brief, stated that the average rating of 1 000 000
high school students all over the country who took the examination is 94%. A division supervisor would
like to study the performance of high school students in the national achievement test from their
schools division. Eighteen thousand high school students from their division had an average rating of
92%.

Population: All high school students who took the national achievement test
Parameter: N = 1 000 000 high school students, average rating of 94%
Sample: All high school students from the specific schools division who took the national achievement
test
Statistics: n = 18 000 high school students, an average rating of 92%

Variable, Data, and Indicators

A variable is a characteristic or property of a population or sample which makes the members


different from each other. Variables can be classified as follows.
1. Independent Variable. This is the one thing you change. It is the variable that affects
another
2. Dependent It is the variable being affected by another variable. The change that
happens is due to the influence of the independent variable.
3. Controlled This is the variable that you want to remain constant and unchanging.
4. Quantitative Variable. This is expressed as a number or can be
quantified. Types of Quantitative Variable
1. Discrete Variable. This variable has a
countable number of possible values in a finite
amount of
2. Continuous Variable. This variable can take
on any value between two specified values.
5. Qualitative Variable. This is information that can’t be expressed as a number; thus,
these are not

Example: To what degree do NCAE ratings predict college freshman GPA?


Independent Variable: NCAE ratings
Dependent Variable: college freshman GPA
Controlled Variable: Type of examination and test items

Both the NCAE ratings and college freshman GPA are quantitative, continuous variables. The number
of college freshman students is a discrete variable. The profile of the college freshman students such
as the program of study, sex, and school last attended are qualitative variables.

Data are facts or values gathered or observed from samples or population being
studied. Indicators are data that directly measure the being studied. To be able to gather significant
and relevant data, indicators for each variable of interest must be established first. This will make
analysis and interpretation a lot easier and convenient.

Example: The school principal wants to know if the feeding program implemented among K-3 pupils
for the past 6 months has been successful. The data/indicators that she may look for are the pupils'
weight and height before and after the feeding program.

Example: What is the socio-economic status of pupils and students in different private and public
schools in Batangas?
The variable socio-economic status is broad in scope and data may vary depending on the group of
persons being studied. Data or indicators may include parents’ educational attainment, parents’
occupation, household income, and other household conditions (house ownership,
appliances/gadgets, etc.)
Most research data can be classified into one of the three basic categories.

Category 1: A single group of participants with one score per participant. This type of data often
exists in research studies that are conducted simply to describe individual variables as they exist
naturally. Although several variables are being measured, the intent is to look at all of them one at a
time and there is no attempt to examine relationship nor difference between variables.

Category 2: A single group of participants with two or more variables measured for each participant.
The research study is specifically intended to examine relationships or differences between variables.
However, there is no attempt to control or manipulate the variables.

Category 3: Two or more groups with each score measurement of the same variable. This involves
independent-measures and repeated-measures designs.

Levels of Measurement

Measurement levels refer to different types of variables that imply how to analyze them.

1. It is a variable whose values don’t have an undisputed order. It may have two or more
exhaustive, non-overlapping categories but there is no intrinsic ordering of the
categories. Examples: sex, socioeconomic status, civil status, school division, religious
affiliation, mother-tongue
2. It holds the value that has an undisputed order but no fixed unit of measurement. An
ordinal variable is similar to a nominal variable except that there is a clear ordering of
the variables. Although, the difference between each range cannot be stated with
certainty. Examples: rating scales (Likert scales), shoe/shirt sizes, ranking, monthly
income (range)
3. An interval variable is similar to ordinal data except that the ranges are equally spaced.
It has a fixed unit of measurement but zero does not mean anything. Examples:
temperature, pressure, IQ score, mental ability ratings
4. A ratio variable is an interval variable with a true zero. It has a fixed unit of
measurement and zero means nothing. Example: weight, height, age, income
Lesson 3: Summation Notation

Summation notation is a convenient and simple way that is used to give a concise expression
for a sum of values of a variable. It is commonly used to express statistical formulas. It involves the
following symbols.
Example: Express the following summation as a sum of individual terms.

Evaluate the following summations.


MODULE 2: DATA COLLECTION AND SAMPLING DESIGN

Introduction

After identifying your research problem, the next step is to collect appropriate and relevant
data. Data collection is crucial to the success of any investigation or study. If the investigator was not
able to collect enough relevant data, the findings and results of the study will be affected; thus,
conclusions, generalization, or implications derived from the available data may not be reliable or valid.
Becoming an expert in data collection methods and techniques require time and effort.
Guidance from an experienced researcher or statistician may help you in working your data collection
and sampling design

Lesson 1: Sources of Data and Data Collection Methods

Data collection is a methodical process of gathering and analyzing specific information to give
solutions to relevant research questions.

Characteristics of a Good Data


Ortega (2017) outlines seven (7) characteristics that define quality data.

1. Accuracy and Precision: This characteristic refers to the exactness of the It cannot have any
erroneous elements and must convey the correct message without being misleading. This
accuracy and precision have a component that relates to its intended use. Without understanding
how the data will be consumed, ensuring accuracy and precision could be off-target or more costly
than necessary. For example, accuracy in healthcare might be more important than in another
industry (which is to say, inaccurate data in healthcare could have more serious consequences)
and, therefore, justifiably worth higher levels of investment.
2. Legitimacy and Validity: Requirements governing data set the boundaries of this characteristic.
For example, on surveys, items such as gender, ethnicity, and nationality are typically limited to a
set of options, and open answers are not Any answers other than these would not be considered
valid or legitimate based on the survey’s requirement. This is the case for most data and must be
carefully considered when determining its quality. The people in each department in an
organization understand what data is valid or not to them, so the requirements must be leveraged
when evaluating data quality.
3. Reliability and Consistency: Many systems in today’s environments use and/or collect the same
source data. Regardless of what source collected the data or where it resides, it cannot contradict
a value residing in a different source or collected by a different There must be a stable and steady
mechanism that collects and stores the data without contradiction or unwarranted variance.
4. Timeliness and Relevance: There must be a valid reason to collect the data to justify the effort
required, which also means it has to be collected at the right moment in Data collected too soon
or too late could misrepresent a situation and drive inaccurate decisions.
5. Completeness and Comprehensiveness: Incomplete data is as dangerous as inaccurate data. Gaps
in data collection lead to a partial view of the overall picture to be displayed. Without a complete
picture of how operations are running, uninformed actions will occur. It’s important to
understand the complete set of requirements that constitute a comprehensive set of data to
determine whether or not the requirements are being
6. Availability and Accessibility: This characteristic can be tricky at times due to legal and regulatory
constraints. Regardless of the challenge, though, individuals need the right level of access to the
data to perform their This presumes that the data exists and is available for access to be granted.
7. Granularity and Uniqueness: The level of detail at which data is collected is important because
confusion and inaccurate decisions can otherwise occur. Aggregated, summarized, and
manipulated collections of data could offer a different meaning than the data implied at a lower
An appropriate level of granularity must be defined to provide sufficient uniqueness and
distinctive properties to become visible. This is a requirement for operations to function
effectively.

Types of Data
1. Primary Data. These are data collected by the investigator himself/ herself for a specific
purpose. For instance, the data collected by an investigator for their research projects
is an example of primary
2. Secondary These are data collected by someone else for some other purposes, but the
being utilized by the current investigator for another purpose. For instance, the census
data is used to analyze the impact of education on career choice, and earning is an
example of secondary data.

Data Collection Tools and Instruments (Bhat, 2020)

1. Interview Method. The interviews conducted to collect quantitative data are more structured,
wherein the researchers ask only a standard set of questionnaires and nothing more than that.
There are three major types of interviews conducted for data collection
• Telephone interviews: For years, telephone interviews ruled the charts of data collection
However, nowadays, there is a significant rise in conducting video interviews using the internet,
Skype, or similar online video calling platforms.
• Face-to-face interviews: It is a proven technique to collect data directly from the participants. It
helps in acquiring quality data as it provides a scope to ask detailed questions and probing further
to collect rich and informative data. Literacy requirements of the participant are irrelevant as face-
to-face interviews offer ample opportunities to collect non-verbal data through observation or to
explore complex and unknown issues. Although it can be an expensive and time-consuming
method, the response rates for face-to-face interviews are often
• Computer-Assisted Personal Interviewing (CAPI): It is nothing but a similar setup of the face-to-
face interview where the interviewer carries a desktop or laptop along with him at the time of
interview to upload the data obtained from the interview directly into the database. CAPI saves a
lot of time in updating and processing the data and also makes the entire process paperless as the
interviewer does not carry a bunch of papers and

2. Survey or Questionnaire Method. The checklists and rating scale type of questions make the bulk of
quantitative surveys as it helps in simplifying and quantifying the attitude or behavior of
the respondents.
• Web-based questionnaire: This is one of the ruling and most trusted methods for internet-based
research or online In a web-based questionnaire, the receive an email containing the survey link,
clicking on which takes the respondent to a secure online survey tool from where he/she can take
the survey or fill in the survey questionnaire.
• Mail Questionnaire: In a mail questionnaire, the survey is mailed out to a host of the sample
population, enabling the researcher to connect with a wide range of audiences. The mail
questionnaire typically consists of a packet containing a cover sheet that introduces the audience
about the type of research and reason why it is being conducted along with a prepaid return to
collect data
3. Observation Method. In this method, researchers collect quantitative data through systematic
observations by using techniques like counting the number of people present at the specific
event at a particular time and a particular venue or number of people attending the event in a
designated place. Structured observation is more used to collect quantitative rather than
qualitative
• Structured observation: In this type of observation method, the researcher has to make careful
observations of one or more specific behaviors in a more comprehensive or structured setting
compared to naturalistic or participant observation. In a structured observation, the researchers,
rather than observing everything, focus only on very specific behaviors of It allows them to
quantify the behaviors they are observing. When the observations require a judgment on the part
of the observers – it is often described as coding, which requires a clearly defining a set of target
behaviors.
4.Documents and Records. Document review is a process used to collect data after reviewing the
existing documents. It is an efficient and effective way of gathering data as documents are
manageable and are the practical resource to get qualified data from the past. Three primary document
types are being analyzed for collecting supporting quantitative research
• Public Records: Under this document review, official, ongoing records of an organization are
analyzed for further research. For example, annual reports policy manuals, student activities,
game activities in the university,
• Personal Documents: In contrast to public documents, this type of document review deals with
individual personal accounts of individuals’ actions, behavior, health, physique, etc. For example,
the height and weight of the students, distance students are traveling to attend the school,
• Physical Evidence: Physical evidence or physical documents deal with previous achievements of
an individual or of an organization in terms of monetary and scalable growth.
Lesson 2: Sampling Design
Sampling is a statistical procedure that is concerned with the selection of individual observations.
It allows us to make statistical inferences about the population.

Approaches to Determine the Sample Size


1. Using a census for a small population (N ≤ 200). This eliminates sampling error and provides data
on all the members or elements in the
2. Using a sample size of a similar The disadvantage of using the same method used by other
research is the possibility of repeating the same errors that were made in determining sample
size for the study.
3. Using published tables. (research-advisors.com/tools/SampleSize.htm)
4. Using a formula
a) http://www.raosoft.com/samplesize.html
b) https://www.surveymonkey.com/mp/sample-size-calculator/
c) http://sphweb.bumc.bu.edu/otlt/mph- modules/bs/bs704_power/BS704_Power_print.ht
ml
In using a formula to compute the sample size, the basic information needed is as follows.
a)Margin of error. It is the amount of error that you can tolerate. If 90% of respondents answer yes,
while 10% answer no, you may be able to tolerate a larger amount of error than if the
respondents are split 50-50 or 45-55. A lower margin of error requires a larger sample
b)Confidence Interval. It is the amount of uncertainty you can tolerate. Suppose that you have 20
yes-no questions in your survey. With a confidence level of 95%, you would expect that for
one of the questions (1 in 20), the percentage of people who answer yes would be more than the margin
of error away from the true answer. The true answer is the percentage you would get if you
exhaustively interviewed A higher confidence level requires a larger sample size.

Sampling Techniques

1. Probability Sampling. It is a sampling technique wherein the members of the population are given
an (almost) equal chance to be included as a sample.
• Simple Random All members of the population have a chance of being included in the sample.
Example: lottery method, random numbers
• Systematic Random Sampling (with a random start). It selects every kth member of the
population with a starting point determined at random. Example: Selecting every 5th member
of N = 1000, to get 200 samples. For instance, starting at 7th member, we have the 12th, 17th,
22nd, and so
• Stratified Random This is used when the population can be divided into several smaller non-
overlapping groups (strata), then the sample is randomly selected from each group.
• Cluster Sampling. Also called area sampling in which groups or cluster, instead of individuals are
selected randomly as sample
• Multi-stage Sampling. If the population is too big, two or more sampling techniques may be used
until the desired sample is

2. Non-probability Sampling. It is a sampling technique wherein the sample is determined by set


criteria, purpose, or personal
1. Purposive or Judgment The sample is selected based on predetermined criteria set by
the researcher. Example: To determine the difficulties encountered by students in the
2017 national achievement test, only the Grade 6 pupils of the said school
will be included as a sample.
2. Convenience or Accidental It relies on data collection from population members who
are conveniently available to participate in the study. Facebook polls or questions can
be mentioned as a popular example of convenience sampling.
3. Quota Sampling. It is a non-probability sampling technique in which researchers look for a
specific characteristic in their respondents, and then take a tailored sample that is in proportion
to a population of
4. Snowball The samples are determined by referrals made by previous members of the sample
MODULE 3: DATA PRESENTATION AND VISUALIZATION

Introduction

Data visualization is a graphical representation of information and data. The different data
visualization tools provide an accessible way to see and understand trends, outliers, and patterns in
data. Being another form of visual art, data visualization grabs the interest and attention of the
audience on the message. It helps to tell the important stories by curating data into a form easier to
understand, highlighting the most important aspect of the data set. However, data presentation and
visualization are not as simple as creating graphs and tables. Effective presentation and visualization
of data involve a balance between form (aesthetics) and function.

Lesson 1: Graphical Presentation of Data

A statistical graph (or chart) is a tool that helps readers to understand the characteristics of a
distribution of sample or a population. Effective data presentation follows the following principles.

Five Essential Elements of Data Visualization (Data Craze, 2020)

1. Consistent Style and Colors. Carefully choose and maintain the same style across your
visualizations. Remember that the true meaning and value of data are not just in
2. Select Right Visualization. A bar or pie chart is not the only visualization method in your arsenal.
Adjust what you want to present based on the purpose and type of data you
3. Less is More. Focus on the quality of what you want to present. The excessive number of charts
or indicators is distracting. Simplicity comes at a price – the less information to analyze the
4. Effective Visualization. The difference between effective and impressive visualization can be
huge. The data presented in the application should foremost give a value – effect in the form of
specific
5. Data Quality. The trust of users is difficult to build, but it is easy to lose. Unexpected information
is desirable, errors are not. Try to detect errors at an early

What Graphs Should You Use?

Data should be matched appropriately to the right information visualization. The following are
some of the most common graphs used to present data (Klipfolio, Inc., 2020).

1. Bar Graph. It organizes data into rectangular bars that make it convenient to compare
related data
When to Use: compare two or more values in the same category; compare parts of a whole,
do not have too many groups (less than 10); and relate multiple similar data sets. When Not
Use: the category you are visualizing has one value associated with it or the data is continuous.
Design Best Practice: Use consistent colors and labelling throughout for identifying
relationships more easily. Simplify the length of the y-axis labels and don’t forget to start from
0.

2. Line Chart. It organizes data to rapidly scan information to understand


When to Use: to understand trends, patterns, and fluctuations; to compare different yet
related data sets with multiple series; and to make projections beyond your data
When Not to Use: to demonstrate an in-depth view of your data
Design Best Practice: Use different colors for each category you are comparing. Use solid lines
to keep the line chart clear and concise. Try not to compare more than four categories in one
line chart.
3. Scatter Plot. It organizes many different data points to highlight similarities in the given
data set. It is useful when looking for outliers and identifying correlation between two
variables.
When to Use: to show relationship between variables and to have a compact data visualization
When Not to Use: to rapidly scan information or to have a clear and precise data points Design
Best Practice: Ensure to use 1 or 2 trend lines to avoid confusion. Start at 0 for the y-axis.

4. Histogram. It shows the distribution of data over a continuous interval or certain


period. It gives and estimate as to where values are concentrated, what extremes are
and whether there are any gaps or unusual values throughout the data
When to Use: To make comparison in data sets over an interval or time and to show a
distribution of data
When Not to Use: to compare three or more variables in data sets
Design Best Practice: Avoid bars that are too wide that can hide important details or too
narrow that can cause a lot of noise. Use equal round numbers to create bar sizes. Use
consistent colors and labelling throughout.
5. Box Plot. Also known as box and whisker diagram, is a visual representation of
displaying a distribution of data, usually across groups, based on a five-number
summary: minimum, first quartile, median, third quartile, and maximum. It also shows
the
When to Use: To display or compare a distribution of data and identify the minimum, maximum
and median of data.
When Not to Use: to visualize individual, unconnected data sets
Design Best Practice: Ensure font sizes for labels and legends are big enough and line widths
are thick enough. Use different symbols, line styles or colors to differentiate multiple data sets.
Remove unnecessary clutter from the plots.
Other useful graphs and charts, with their description, use, and other important features may be found
at The Data Visualization Catalogue via datavizcatalogue.com

Here are some tips on improving your charts and graphs (Visme, 2020).
1. Our eyes do not follow a specific order, so you need to create that order. Create a visualization
that deliberately takes viewers on a predefined visual
2. Our eyes first focus on what stands out, so be intentional with your focal point. Create charts and
graphs with one clear message that can be effortlessly
3. Our eyes can only handle a few things at once, so do not over crowd your design. Simplify your
charts so that they highlight one main point you want you
4. Our brains are designed to immediately look for connections and try to find meaning in the data.
Assign colors deliberately to improve the functionality of your
5. We are guided by cultural

Lesson 2: Tabular Presentation of Data

Almost all research and technical reports use tables to present data. Tabular presentation of
data is a systematic and logical arrangement of data into rows and columns with respect to the
characteristics of data.

Components of Tables

1. Table Number and Title. It is included for easy reference and identification. It should indicated
the nature of the information that is included in the
2. Stub (Row Labels). It is placed on the left side of the tabular form indicating specific issues in the
3. Captions (Column Headings). It placed at the top of the columns of a table to explain figures of
the
4. Body. The most important part of the table which comprises numerical contents and reveal the
whole story of investigated
5. Footnote. It provides further explanation that may be needed for any item that is included in a
6. Source note. It is placed at the bottom of the table to indicate the sources of
Tabular Presentation of Nominal and Ordinal Data

Nominal or ordinal data are presented using a frequency table or frequency distribution table.
The table displays frequency count and percentages for each value of a variable.

Example: Suppose your research objective is to determine the profile of the respondents. The data may
be presented as follows.
A contingency table or crosstabulation can also be used to display the relationship between
categorical variables. This type of presentation allows us to examine a hypothesis regarding the
independence or dependence of between variables.

Example: Suppose your research objective is to determine the profile of the respondents. The data may
be presented in crosstabulation as follows.
Tabular Presentation of Interval and Ratio Data

The data on the interval or ratio scale are organized using a frequency distribution table.
These are the steps in constructing a frequency distribution table.

1. Determine the number of class intervals, = 1 + 3.322 , the range = – , and the class size c = R/k
2. Construct the class intervals based on the class The first and last class intervals should contain the
minimum and maximum value, respectively. It is advisable to start the first class interval with the
minimum value.
3. Arrange the data in in either ascending or descending order. Then tally the scores based on the
class intervals in step
4. Add columns for class boundaries, class mark or class midpoint, relative frequency, and
cumulative

The class interval contains the lower (L) and upper limits (U). (e g. In the class interval 46
– 65, the lower limit is 46 and the upper limit is 65)

The class mark or class midpoint (X) is the value in the middle of the class interval. (e. g. In the
class interval 46 – 65, the class mark is 55.5; that is,

The class boundaries are the true class limits of the class intervals. It is halfway below the lower
limit and halfway above the upper limit. (e. g. In the class interval 46 – 65, the class boundary
is 44.5 – 65.5)

The relative frequency (also known as percentage frequency) is computed using the formula

where f is the frequency of the class interval and n is the total of the frequencies.

The less than cumulative frequency (<cf) and greater than cumulative frequency (>cf) are
obtained by adding the frequencies from top to bottom and from bottom to top, respectively.

Example: Using the scores of 50 students in a 55-item Mathematics test, construct a frequency
distribution table.
43 30 35 37 42 19 26 48 34 15
35 18 46 41 27 18 13 40 29 14

40 17 10 21 28 13 14 39 30 5

19 50 36 20 31 28 48 32 20 38

25 12 33 31 28 16 40 32 26 35

Solution:

Step 1: Determine the number of class intervals the range, and the class size.
= 1 + 3.322 =–

= 1 + 3.322 (50) = 50 – 5
= 6.643978 = 7 = 45
c= R/K
c = 45/7 = 6.43 =7

Step 2: Construct the class intervals based on the class size.


Since our minimum value is 5 and the class size if 7, the first class interval is 5 – 11. Note
that this class interval contains 7 values – 5, 6, 7, 8, 9, 10, 11.
To construct the succeeding intervals, add the class size to the lower and upper limits.

Class
Intervals
5 – 11
12 – 18
19 – 25
26 – 32
33 – 39
40 – 46
47 – 53

Step 3: Arrange the data in in either ascending or descending order. Then tally the scores based on the
class intervals in step 2.
5 14 18 20 27 30 32 35 40 43

10 14 18 21 28 30 33 36 40 46

12 15 19 25 28 31 34 37 40 48

13 16 19 26 28 31 35 38 41 48

13 17 20 26 29 32 35 39 42 50

This data set can be organized or sorted using stem-and-leaf plot. A stem-and-leaf plot is a special table
where each data value is split into a stem, first digit or digits, and a leaf, last digit.

Stem Leaf

0 5

1 0233445678899

2 00156678889

3 001122345556789

4 000123688

5 0

Step 4. Add columns for class boundaries, class mark or class midpoint, relative frequency, and
cumulative frequencies.
10/15/21, 5:06 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY)


MODULE 4 : DESCRIPTIVE STATISTICS

LESSON 1 - MEASURES OF CENTRAL TENDENCY


Lesson 1: Measures of Central Tendency
The measure of central tendency or average is a value that best characterize or describe a set of data.

A. MEAN - It is the balance point of a data set and the most reliable, most sensitive measure of average. It is always unique and is affected by extreme values. It is used if
data are interval or ratio, and when the data is normally distrubted.
For ungrouped data, the mean can be computed as follows.

Example: The following are the scores of 10 students in a 50-item Qualifying Examination.
45 44 42 40 45 48 49 50 50 47

The mean score of 10 students in a 50-item Qualifying Examination is 46. In general, the group of students did well in the examination.

Example: A high school teacher conducts a semester evaluation of the Math textbook used in Calculus. The following are data collected among the 40 students.

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 1/7
10/15/21, 5:06 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

Table 1 shows the textbook evaluation results for Calculus. The students strongly agreed that the Calculus book is organized appropriately for the users (M
= 3.63). The students also agreed that the Calculus book has relevant content (M = 3.38), developmentally appropriate supplementary activities (M = 3.25), and
other features that promote higher order thinking skills (M = 3.08).

For grouped data (data arranged in the frequency distribution table), we computed the mean using the following formulas.

Example: Compute and interpret the mean of the data set presented in the following frequency distribution table.

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 2/7
10/15/21, 5:06 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…
Alternative Solution: Assumed Mean Formula
Choose any class mark and designate that as 0. For the deviations column, d, assign 0 to 0 and consecutive integral values for the rest of the class marks (negative integers for class marks lower than 0 and positive integers for class marks higher than 0). Compute
fd, that is, the product between the frequencies and deviations. Get the total for the column fd.
For this problem, let 0 = 29

The mean score of 50 students who took the 55-item Mathematics Test is 29. The mean

score suggests that the students did not perform satisfactorily in the mathematics test.

B. MEDIAN.It is the middle value in an ordered set of data; hence, it is not affected by outliers and is unique. It is used if data is ordinal, when there are few extreme values
in the data set, when some values are missing or underestimated, or there are open-ended distributions.
For the ungrouped data, the median can be determined as follows:

1. Arrange the data in ascending (or descending)


2. If n is odd, the middle entry is the median. If n is even, get the average of the two middle numbers.

Example: Consider the following data set. Compute the median values for each group.

Group A 22 33 21 18 19 15 16 18 16

Group B 15 15 15 16 17 20 21 20 20 28 25 27

Solution: Arrange the data in ascending order.

Group A 15 16 16 18 18 19 21 22 33

Group B 15 15 15 16 17 20 20 20 21 25 28 27

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 3/7
10/15/21, 5:06 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

Example: Compute and interpret the median of the data set presented in the following frequency distribution table.

The median, equivalent to 29.27, means that 50% of the students have scores less than 29.27 or 50% have scores greater than 29.27.

C. MODE. It is the most frequently occurring value in the data set; hence, it is not unique. That is, a data set may have no mode, one mode, or multiple modes. For this
reason, it is the most unreliable among the three averages. It is used when data are in nominal scale.

For the ungrouped data, the mode can be determined by inspection.

Example: Consider the following data set. Determine the mode for each group.

Group A 22 33 21 18 19 15 16 18 16

Group B 15 15 15 16 17 20 21 20 20 28 25 27

Solution: By inspection, the mode of Group A are 18 and 16; hence, a bimodal distribution. The mode of Group B is 15 and 20, which is also a bimodal
distribution.
For grouped data, we can compute the mode using the following formula

Example: Compute and interpret the mode of the data set presented in the following frequency distribution table.

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 4/7
10/15/21, 5:06 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

The mode, equivalent to 30, means that majority of the students who took the math test have score equal to 30.

Watch the following videos on measures of central tendency to understand more about these concepts.
Khan Academy. (14 November 2011). Finding mean, median, and mode [Video clip]. Retrieved 25 July 2020 from

The Organic Chemistry Tutor. (26 January 2019). Mean, median, and mode of grouped data & frequency distribution tables statistics [Video clip]. Retrieved 25
July 2020 from
Emmanuel, E. (11 February 2019). Mean, Median, and Mode (grouped data) [Video clip].

Retrieved 25 July 2020 from

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 5/7
10/15/21, 5:06 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 6/7
10/15/21, 5:06 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 7/7
10/15/21, 5:06 PM Subjects - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_i…

STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY)


MODULE 4 : DESCRIPTIVE STATISTICS

LESSON 2 - Measures of Position

Lesson 2: Measures of Position


The three commonly used measures of positions (also known as quantiles) are quartiles, deciles, and percentiles which divides the distribution into 4, 10,
and 100 equal parts, respectively.
To determine the different measures of position for ungrouped data we will use the following procedures.
(Note: The procedures outlined in the succeeding discussion are based on MS Excel calculations.)

Quartiles

Method 1: Conventional Method

1. Arrange the values in ascending (or descending order).


2. Determine the median and separate the lower and upper 50%.
3. To determine the first quartile, determine the “median” of the lower 50%.
4. To determine the third quartile, determine the “median” of the upper 50%.

Example: Determine the first and third quartile of the following data sets.

Group A: 85, 88, 87, 89, 86, 86, 85, 87, 89, 88,
Group B: 94, 80, 79, 88, 96, 86, 83, 81, 85, 99, 92

Solution
For Group A, we arrange the values in an array: 85, 85, 86, 86, 87, 87, 88, 88, 89, 89

The median is =
̃

The lower 50% includes 85, 85, 86, 86, 87. The middle value, 86, is the first quartile.

The lower 50% includes 87, 88, 88, 89, 89. The middle value, 88, is the third quartile.

For Group B, we arrange the values in an array: 79, 80, 81, 83, 85, 86, 88, 92, 94, 96, 99 The median is 86.
The lower 50% includes 79, 80, 81, 83, 85. The middle value, 81, is the first quartile.
The lower 50% includes 88, 92, 94, 96, 99. The middle value, 94, is the third quartile.

Method 2: Using Formula For first quartile,

1. Compute =

2. If k is an integer, the value of the first quartile is . If k is not an integer, the value of the first quartile is + ( +1 − ) × ( ). For third quartile,

1. Compute =
2. If t is an integer, the value of the first quartile is . If t is not an integer, the value of the first quartile is + ( +1 − ) × ( ).

Example: Determine the first and third quartile of the following data sets.

Group A: 85, 88, 87, 89, 86, 86, 85, 87, 89, 88,

Group B: 94, 80, 79, 88, 96, 86, 83, 81, 85, 99, 92

Solution:
For Group A, we arrange the values in an array: 85, 85, 86, 86, 87, 87, 88, 88, 89, 89

For the first quartile, we compute = = 2.75. Since k is not an integer, the first quartile is computed as follows.

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 1/5
10/15/21, 5:07 PM Subjects - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_i…
Let 2 = 85, 3 = 86, ( ) = 0.75, Thus, 1 = 85 + (86 − 85)(0.75) = 85.75

For the third quartile, we compute = = 8.25. Since t is not an integer,the third quartile is computed as follows.
Let 8 = 88, 9 = 89, ( ) = 0.25, Thus, 3 = 88 + (89 − 88)(0.25) = 88.25 Using MS Excel

To compute for the values of the quartiles using excel, we will use the formula

=QUARTILE.EXC(array, quart). The “quart” in the formula may be 1, 2, or 3 for first, second, or third quartile, respectively.

For Group B, we arrange the values in an array: 79, 80, 81, 83, 85, 86, 88, 92, 94, 96, 99

For the first quartile, we compute = = 3. Since k is an integer, the first quartile is : 4
1= 3=81

For the third quartile, we compute =


3= 9=94
= 9. Since t is an integer, the third quartile is :

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 2/5
10/15/21, 5:07 PM Subjects - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_i…

The computation for deciles and percentiles follows the same procedures as the computation of quartiles using the formula.

Example: Determine the 10th, 80th, and 90th percentiles of the following data set.

Group A: 85, 88, 87, 89, 86, 86, 85, 87, 89, 88,

Solution:

For Group A, we arrange the values in an array: 85, 85, 86, 86, 87, 87, 88, 88, 89, 89

For the 10th percentile: = = = 1.1. Since k is not an integer

For the 80th percentile: = = = 8.8. Since k is not an integer,

Figure 8 shows the MS Excel output using the formula =PERCENTILE.EXC(array, k), where k has values between 0 to 1. That is for 80 th percentile, k = 0.8.

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 3/5
10/15/21, 5:07 PM Subjects - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_i…

For the grouped data, we will use the following formulas.

Note that these formulas are very similar with the formula of the median.

Example: Compute and interpret the first and third quartile, 3rd decile, and 10th and 90th percentile of the data set presented in the following frequency
distribution table.

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 4/5
10/15/21, 5:07 PM Subjects - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_i…

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 5/5
10/15/21, 5:07 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY)


MODULE 4 : DESCRIPTIVE STATISTICS

MEASURES OF VARIABILITY
Lesson 3: Measures of Variability
To get a complete description of the data set, it is not enough to compute averages. We may also consider other statistics called measures of variability to
determine homogeneity or heterogeneity of the data in a distribution. Measures of variability describe how spread out or scattered a set of data is.

As shown in figure 6, the data sets may have the same average but may have a totally different interpretation due to the spread (variation or differences) of
data values.

A. Absolute Variability

Range. This refers to the difference between the maximum and minimum values.

Mean Absolute Deviation. This refers to the average absolute deviations of the values from the mean.

Ungrouped data :

Grouped data :

Variance and Standard Deviation. The variance refers to the squared deviations of the values from the mean. The standard deviation refers to the square root
of the variance.

Ungrouped data:

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 1/6
10/15/21, 5:07 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

Interquartile Range and Quartile Deviation. It is the measure of dispersion of the middle 50%

Example: Consider the following raw data on performance rating set taken from a sample of 10 respondents.

i 1 2 3 4 5 6 7 8 9 10

xi 84 86 83 81 87 90 93 85 80 94

Compute the range, mean absolute deviation, variance, standard deviation, interquartile range, and quartile deviation.

Solution:
a. Range = 94 − 80 = 14

The ratings span 14 points from the minimum to the maximum value.

̅
To compute the mean abs olute deviat ion, variance, and s tandard deviat ion, the mean mus t be deter mined fi rs t. Then, t he column fo r | − | is computed

by getting the absolute value of the difference between the ratings and the mean. The column for (
−̅
)2
is computed by squaring the values in the
̅

column for | − |. Finally, get the total for each column.

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 2/6
10/15/21, 5:07 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 3/6
10/15/21, 5:07 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

2
CLASS INTERVALS CLASS BOUNDARIES f x fx fx

5 - 11 4.5 – 11.5 2 8 16 128

12 – 18 11.5 – 18.5 10 15 150 2250

19 – 25 18.5 – 25.5 6 22 132 2904

26 – 32 25.5 – 32.5 13 29 377 10933

33 – 39 32.5 – 39.5 9 36 324 11664

40 – 46 39.5 – 46.5 7 43 301 12943


https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 4/6
10/15/21, 5:07 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…
47-53
46.5 – 53.5 3 50 150 7500

N=50 1450 48322

Example. Which groups of the following groups is most varied in terms of their performance ratings. Compute their coefficients of variation.

Group1: 85, 85, 87, 89, 86, 86, 85, 87, 89, 88, 87

Group 2: 94, 80, 79, 88, 96, 85, 83, 81, 85, 99, 87

Group 3: 70, 80, 79, 88, 89, 89, 85, 100, 90, 100

Solution: Notice that groups have the same mean value of 87. Hence, we need another measure to compare them. First, compute the standard deviation.

You can verify the following values using MS Excel or your calculator.

Group Mean Standard Deviation Sample

1 87 1.490712 10

2 87 7.055337 10

3 87 9.201449 10

Notice that Group 3 is the most varied than the two other groups. Furthermore, the coefficient of variation tells us the same observation.

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 5/6
10/15/21, 5:07 PM STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY) - https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson…

Example. Which groups of the following groups is most varied in terms of their performance ratings. Compute their quartile deviations.

Group1: 85, 85, 87, 89, 86, 86, 85, 87, 89, 88, 87

Group 2: 94, 80, 79, 88, 96, 85, 83, 81, 85, 99, 87

Group 3: 70, 80, 79, 88, 89, 89, 85, 100, 90, 100

Solution: After arranging the terms in an array, we get the following values for the first and third quartiles.

Group First Quartile Third Quartile

1 85.75 88.25

2 80.75 94.5

3 79.75 92.5

Since some groups have outliers, let us focus on the middle 50% of the distribution.

In terms of the variation in the middle 50%, Group B is most varied than the other two groups.

Click here to download Module 4/files/5466702/Module_4_DescriptiveStatistics_(Recovered)(2).docx

https://ubian.ub.edu.ph/student_lesson/show/2750742?from=%2Fstudent_lesson%2Fshow%2F2750742%3Flesson_id%3D12458339%26router%3Dtr… 6/6
STAT.APP(BSA3-4_ 7:00-10:00 SATURDAY)
MODULE 5 - PROBABILITY

MODULE 5 - PROBABILITY
MODULE 5.1: PROBABILITY
 

Introduction

The theory of probability is a particularly important concept in the study of inferential statistics. It allows us to analyze claims and
hypotheses to determine whether conclusions can be drawn legitimately about a phenomenon as well as to make predictions about future events. A
good foundational knowledge of probability concepts will help you understand inferential statistics and the concept of random variation that is
involved in hypothesis testing.
 

Probability theory, as an established branch of mathematics, has applications in several area of scholarly activity from music to physics, and
in daily experience ranging from weather prediction to predicting the risks of new medical treatments.
 

In this module, you will learn the basics of counting rules and probability theory. The topics covered in this module are:
   1.  Counting rules (Fundamental Principle of Counting, Permutation, Combination)
   2.  Basic Probability
   3.  Random Variables and Probability Distributions
   4.  Normal Distribution

Learning Outcomes

At the end of this module, you will be able to:

1. define probability and sample spaces


2. determine the number of outcomes in a sequence of event using appropriate counting rules
3. compute the probability of an event using the classical and empirical probability
4. determine the probability of compound events using addition and multiplication rules

Expected Outputs

1. Quiz

Date: October 28 – November 14 2020

Lesson 1: Counting Rules


 
It would be very tedious and cumbersome if we are to use brute force in listing down all the possibilities of equally likely outcomes in a
sample space. Thus, a more efficient method will be employed to determine the number of these outcomes in a sample space.
 

Fundamental Principle of Counting

Rule of Sum: If there are n choices for one action, and m choices for another action and the two actions cannot be done at the same time, then there
are n + m ways to choose one of these actions.
 
Rule of Product: If there are n ways of doing something, and m ways of doing another thing after that, then there n x m ways to perform both of
these actions.
 

Example: A college student from UB, who resides in Mindoro can choose from 2 small craft passenger services or 3 big craft passenger services to
go to the port of Batangas. From the port of Batangas, he can choose from 4 van services or 3 jeepney services to go to the city proper. How many
ways are there for him to get to Batangas City?
 

Solution:

He has 2 + 3 = 5 ways to travel from Mindoro to Batangas Port. (Rule of Sum) From there, he has 4 + 3 = 7 ways to get to the city
proper. (Rule of Sum) Hence. He has 5 x 7 = 35 ways to get to Batangas City in total. (Rule of Product)
 

Example: How many three-digit number can be formed using the digits 2, 3, 4, 5, 6?

Solution: Since there are 5 digits to choose from and three place values to fill in, we have 5 x 5 x 5 = 125 three-digit numbers.
 

Another method (but very tedious) to determine all possible outcomes (sample space) is by the use of listing (enumeration) or tree diagram.
 

Example: In how many ways can you arrange the letters A, B, C?


 

Solution: By listing or enumerating all possible outcome, we have 6 ways to arrange the letters A, B, C. The sample space consists of ABC, ACB,
BAC, BCA, CAB, CBA.

Example: If a fair coin will be tossed three time, what are the possible outcomes?

Solution: There will be two outcome for each toss. Hence, there will be 2 x 2 x 2 = 8 possible outcomes. Using tree diagram, the possible outcomes
will be HHH, HHT, HTH, THH, HTT, THT, TTH, TTT.
          
                       
     
       Case 4: Circular Permutation – Formula: (n – 1)!
    
   

Example: In a gathering, 4 Filipinos, 5 Americans and 6 Japanese will be seated in a round table.

                a) In how many ways can the guest be seated?

                 b) In how many ways can the guest be seated so that all Filipinos are seated together?

                 c) In how many ways can the guest be seated so that those of the same nationality are seated together?

     

Solution:

              a)Since there are 15 guests to be seated in a round table, there will be (15 – 1)! = 14! = 87 178 291 200

              b) Since there will be 11 other guests to be seated, they can be seated in 11! = 39 916 800 ways. But, the Filipinos can be arranged in 4! = 24 Hence,
there will be 24 x 39 916 800 = 958 003 200                      arrangements.

              c)There are 3 nationalities, so there will be (3 – 1)! = 2 ways. But, there are 4! = 24 ways to arrange the Filipinos, 5! = 120 ways to arrange the
Americans, and 6! = 720 ways to arrange the                                  Japanese. Thus, there will be 2 x 24 x 120 x 720 = 4 147 200 possible arrangements.

Combination

A combination is an ordered arrangement of all or part of a set of objects, without regard to order (sequence) in which objects are selected.
The formula will be
 

                         
Example: A class is composed of 10 boys and 25 girls.
1. In how many ways can a committee of 5 be chosen from this class?
2. In how many ways can a committee of 5 be chosen so that 3 are boys?
3. In how many ways can a committee of 5 be chosen if 2 girls are always together?

      4. In how many ways can a committee of 5 be chosen if 2 girls do not want to be in the same group?

Solution:

1. Since there are 35 students, there will be 35C5 = 324 632 possible

      2. The three boys can be selected from the 10 boys in 10C3 = 120 ways. The two remaining members must be girls and they can be selected in 25C2 =
300 Hence, there will be 120 x 300 = 36 000               possible ways.

      3. Since 2 girls are always together in the group, the first two position has been filled out. The remaining 3 members will be selected from 33 other
students. That is, 33C3 = 5 456 possible

      4. There are 324 632 ways to select a committee of 5 from the class. Also, there are 5 456 possible committees where the two girls are always Hence, to
get the number of ways to select a                             committee of 5 where two girls do not want to be together will be 342 632 - 5456 = 319176 possible
ways.

To know more about fundamental principle of counting, permutation and combination, watch the following videos.
Khan      Academy.      (21      November      2014).      Permutation      Formula      [Video      Clip].

Khan    Academy.   (20    November   2014).    Factorial    and    counting    seat    arrangements.

Khan           Academy.          (22           November          2014).           Combination           Formula.

Practice Sets

Test your understanding by answering these practice sets.


1. Permutatios. https://bit.ly/30TuOsy
2.   Combination. https://bit.ly/36UpuZW
3. Permutation and Combination. https://bit.ly/36RQOIb

  

Lesson 2: Basic Probability


 
A probability is a number that represents the likelihood of an uncertain event.

Probabilities are always between 0 and 1, inclusive.

    

Example: In a standard deck of cards, what is the probability that


1. a red card is drawn?
2. a red face card is drawn?
3. a black ace card is drawn?
4. a face card is drawn?

Solution:

1. There are 26 red cards in a standard deck of 52 cards; thus, the probability of getting a
    

red card is 


        

1. There are 6 red face cards in a standard deck of 52 cards; thus, the probability of getting
 

a red card face is 


 
1. There are 2 black ace cards in a standard deck of 52 cards; thus, the probability of
        

getting a red card face is   


   

1. There are 12 face cards in a standard deck of 52 cards; thus, the probability of getting a
        

red card face is     


   

Conditional Probability

If A and B are two events, then the conditional probability of A given B, denoted by P(A|B) is the likelihood that an event A occurring based on
the occurrence of a previous event B. The formula will be:

     

where 𝑃(𝐴 ∩ 𝐵 ) is the probability that both events A and B will occur at the same time and 𝑃(𝐵 )

is the probability that event B will occur.


 

Example: Below are the information about students in a university.


Scholar?
Program Total
Yes No

Education 10 230 240

Business & Accountancy 15 540 555

Engineering 40 1500 1540

Tourism 10 200 210

Total 75 2470 2545

1. What is the probability that the student is a scholar given that he is taking education?
2. What is the probability that the student is not a scholar given that he is taking engineering?

Solution: Let A  be the event that the student is a scholar, B  be the event that the student is not a scholar, C be the event that the student is taking education,
and D be the  event that the student is taking engineering
       

  

Example: A couple has three children. What is the probability that all three children are boys, given that one of the children is a boy? Assume that
when a child is born, it has an equal chance of being a boy or girl.
 

Solution: The sample space of this experiment contains BBB, BBG, BGB, GBB, BGG, GBG, GGB, GGG (8 possible outcomes).
Let A be the event that all three children are boys; that is A = {BBB}.
Let B be the event that one of the children is a boy; That is, B = {BBB, BBG, BGB, GBB, BGG, GBG, GGB}. So, 𝐴 ∩ 𝐵 = {𝐵𝐵𝐵}
 

So, the conditional probability that three children are boys, given that one of the children is a boy is
      
 
 

                
    
Independent & Dependent Events

Two events are independent if the incidence of one event does not affect the probability of the other event. If the incidence of one event
does affect the probability of the other event then the events are dependent.
 

Example: Below are the information about students in a university.

Scholar?
Program Total
Yes No

Education 360 240 600

Business & Accountancy 1 090 560 1 650

Tourism 350 400 750

Total 1 800 1 200 3 000

1. Are the events “students enrolled in Tourism program” and “student is not a scholar” independent?

      2. Are the events “students enrolled in Education program” and “student is a not a scholar” independent?

Solution:

           a) Let A - students enrolled in Tourism program and B - student is not a scholar. Note that the probability that a randomly selected student is enrolled
in Tourism program is

                                               

                while the probability that a randomly selected student is enrolled in Tourism program given he is not a scholar is
P(A|B) = 0.33. This suggest that non- scholars in general are enrolled in Tourism program. Hence the events are NOT independent (dependent)

          b) Let A - students enrolled in Education program and B - student is not a scholar. Note that the probability that a randomly selected student is enrolled
in Education program is

              while the probability that a randomly selected student is enrolled in Education program given he is not a scholar is
P(A|B) = 0.2. Knowing that a student is not a scholar does not change the probability that he is enrolled in Education program. Hence the events are
independent.

 
Rule of Product for Probability

Let A and B be independent events. Then 𝑃(𝐴 ∩ 𝐵 ) = 𝑃(𝐴) × 𝑃(𝐵 )

Example: A fair die is rolled twice. What is the probability that both rolls have a result of 4? Solution: Each roll of die is independent; that is, the

first roll of 4 does not affect the second roll of 4.

The probability of rolling 4 in each individual roll is  So, P( 1st roll is 4 and 2nd roll is 4) =  . 

Example: Luke has 5 different ties (exactly one of which is green), 6 different shirts (exactly one of which is gray), and 4 different pants (exactly
one of which is black). If a combination of this
outfit is randomly selected, what is the probability that he will wear the green tie, gray shirt, and black pants?
 

Solution: Let A be the event of wearing green tie, B be the  event of wearing gray shirt, C be the event of wearing black pants. Then,

                
Alternative Solution: Note that there is one possible outfit consisting of green tie, gray shirt and black pants. Using the fundamental principle of
counting there are 5 x 6 x 4 = 120 different outfits.

Therefore, choosing the desired outfit is .


 
 

Rule of Product for Probability

Let A and B be mutually exclusive events. Then, the probability of the union of these events are

𝑃(𝐴 ∪ 𝐵 ) = 𝑃(𝐴) + 𝑃(𝐵 )


 

Mutually exclusive events are events which cannot occur at the same time.

Example: What is the probability of getting a black card or a diamond in a standard deck of cards?

Solution: The event of getting a black card (A) and the event of getting a diamond (B) in a standard deck of cards are mutually exclusive events
because they can exist separately. Hence,
 

𝑃(𝐴 ∪ 𝐵 ) = 𝑃(𝐴) + 𝑃(𝐵 ) =    .

Practice Sets

To know more about probabilities, watch the videos and answer the practice sets found in the link below.
 

Khan Academy. (n. d.). Probability in College Statistics. https://www.khanacademy.org/math/ap- statistics/probability-ap


 
 
 

Lesson 3: Random Variables


A random variable is a variable that assumes real number values that is derived from the outcomes of an experiment. Alternatively, a
random variable is a function that maps the outcomes of a random process to a numeric value. There are two types of random variables: discrete
and continuous.
 

A discrete random variable has a countable number of possible values while a continuous random variable takes infinitely many values
which resulted from measuring.
 

Discrete Random Variables, Mean, and Variance

A discrete probability distribution has the following properties.

1. The probability of each event in the sample space must be between or equal to 0 and
2. The sum of the probabilities of all the events in the sample space must be equal to 1 or 100%

Example: In a box are 7 ball, 2 are green, 2 are blue, and 3 are red. Two balls are picked at a time. With the number of red balls drawn, construct the
probability distribution table.

Solution: Let X be the event of drawing red balls. The sample space contains 21 possible outcomes. (You may want to list all outcomes or use a tree
diagram.)

X P(X)

2
0
7

4
1
7
1
2
7

To compute the mean (expected value) and variance of a discrete random variable we use the following formula.
   

                     
Example: The following are the return of investments given by two banks. Bank A

X (in thousands of pesos) 10 20 30 40 50

P(X) 0.30 0.25 0.20 0.15 0.10

Bank B

X (in thousands of pesos) 10 20 30 40 50

P(X) 0.20 0.20 0.20 0.20 0.20

1. Which is offers a better plan?


2. Which offer is more consistent?

    

Solution: Compute the mean to determine the better (higher) offer. Then compute the variance/standard deviation to determine which is more
consistent (lower variation).
      

Example: A local club plans to invest P10 000 to host a baseball game. They expect to sell tickets worth P15 000. But if it rains on the day of the
game, they won’t sell any tickets and the club will lose all the money invested. If the weather forecast for the day of the game is 20% possibility of
rain, is this a good investment?
 

Solution: Let X be the event of gain (loss) from the baseball game.

Gain/Loss (X) P(X)

5 000 0.80

-10 000 0.20

The expected value is 𝜇 = 5000(0.80) + (−10000)(0.20) = 2000.

This suggest that they will earn P2 000 in the long run. Hence, it is still a good investment.

Example: A car insurance company offers to pay P500 000 if a car is stolen or destroyed beyond repair. The insurance policy costs P24 000 and the
probability that the company will need to pay the amount of insurance is 0.002. Find the expected value of the insurance to car owners.
 

Solution:

Gain/Loss (X) P(X)

- 24 000 0.998

476 000 0.002

The expected value for the car owner is – 23 000 or loss of P23 000. The policy is designed to the advantage of the insurance company, but car
owners still buy the insurance policy because of the security it provides.
To know more about random variables and probability functions, watch the videos and try out the practice problems found in the link below.
 

Khan       Academy.      (n.d.).       Random       variables       in       Statistics       and       Probaility. https://www.khanacademy.org/math/statistics-
probability/random-variables-stats- library#random-variables-discrete
 

  

Lesson 4: Normal Distribution


 
A normal probability distribution is a type of continuous probability distribution. This will be used for experiments or events that yield
continuous random variables. It is an arrangement of data sets in which most values are clustered in the middle of the range and the rest taper off
symmetrically towards the extremes.
 

Some of the random variables that follow a normal distributions includes heights of people, size of things produced by machines, errors in
measurement, blood pressure, marks on attest.
 

The normal distribution has the following characteristics.

1. It is bell-shaped.
2. It is symmetrical about the
3. It is
4. Its tails never touch the x-axis
5. The mean, median and mode are located at the center of the
6. The total area under the curve is equal to 1 or 100%.
7. The are to the right or to the left of the mean is 0.5 or 50%.

Standard Normal Distribution

A standard normal distribution is a type of normal distribution with mean of 0 and standard deviation of 1.
 

                      
Example: Women’s heights have a mean of 161.5 cm and a standard deviation of 6.35 cm. Compute the standard score of the following
heights.
                  a) x = 177.8 cm

                  b) x = 155.43 cm

               Compute the value of the random variable (x) for the given standard scores. c.
                c) z = 2.33

                  d) z = -1.34

      

    

Area Under the Normal Curve

The area under the normal curve is like the proportion or probabilities of occurrence for a random variable. These can be determined using
the z-table found in the Appendix.
 
Example: Successive stopwatch time intervals of service crews of a fast-food chain showed that they can serve the meal in 10.5 minutes with a
standard deviation of 2 minutes. Assuming that the service time is normally distributed, what is the probability that the crew serve the meals

       a) Less than 9 minutes

       b)More than 12 minutes

       c) Between 8.5 minutes to 12.7 minutes

     

        

          

Example: A group of 1200 freshmen college students has mean IQ score of 108 with a standard deviation of 10.

1. What should be the minimum IQ score of a student to be part of the top 10%?
2. What should be the maximum IQ score of a student to be in the bottom 20%?
3. How many students have IQ between 100 and 110?

          
             

        

Practice Sets

To know more about normal distribution and its application, watch the following videos and try out the practice sets.
 

Khan Academy. (n. d.) Normal distributions review. https://bit.ly/31bC3MP


 

Khan Academy. (n. d.) Normal distribution problem: z-scores. https://bit.ly/34JhB6F


 

Khan Aademy. (n. d.) Normal distribution: Area above or below point. https://bit.ly/2SLwvDS

     

Further Readings
 
The following sources were used in the preparation of this independent learning material.

Khan      Academy.      (21      November      2014).      Permutation      Formula      [Video      Clip].

Khan    Academy.   (20    November   2014).    Factorial    and    counting    seat    arrangements.

Khan          Academy.         (22          November         2014).          Possible          letter          words.

Khan Academy. (21 November 2014). Introduction to Combinations.

Khan           Academy.          (22           November          2014).           Combination           Formula.

Khan       Academy.      (n.d.).       Random       variables       in       Statistics       and       Probaility. https://www.khanacademy.org/math/statistics-
probability/random-variables-stats- library#random-variables-discrete

Khan Academy. (n. d.). Probability in College Statistics. https://www.khanacademy.org/math/ap- statistics/probability-ap


 

Khan Academy. (n. d.) Normal distributions review. https://bit.ly/31bC3MP


 

Khan Academy. (n. d.) Normal distribution problem: z-scores. https://bit.ly/34JhB6F


 
Khan Aademy. (n. d.) Normal distribution: Area above or below point. https://bit.ly/2SLwvDS
 

   
          
      
Click here to download Module 5 /files/5466702/Module_5_Probability(2).docx
  

 
 
                

         

You might also like