You are on page 1of 27

Sampling: Probability and non-probability:

Sampling is the process of selecting a group of individuals from a larger population for the
purpose of studying and making inferences about the population as a whole. There are two
main types of sampling: probability sampling and non-probability sampling.

Probability sampling: Probability sampling is a type of sampling in which every member of the
population has a known and non-zero probability of being selected for the sample. This type of
sampling is considered to be more reliable because it allows for the use of statistical theory to
make inferences about the population based on the sample. Some examples of probability
sampling methods include simple random sampling, stratified sampling, and cluster sampling.
● Example
○ Simple random sampling: A researcher wants to conduct a study on the
effectiveness of a new study technique. They create a list of all the students in
their school and use a random number generator to select a sample of 50
students for the study.
○ Stratified sampling: A researcher wants to conduct a study on the reading
habits of students in their school. They divide the students into three strata:
elementary, middle, and high school. They then use simple random sampling to
select a proportional number of students from each stratum for the study.
○ Cluster sampling: A researcher wants to conduct a study on the physical activity
levels of children in their county. They divide the county into five regions and use
simple random sampling to select two regions for the study. All children in the
selected regions are then included in the sample.
● Characteristics
○ Every member of the population has a known and non-zero probability of being
selected for the sample.
○ The sample is representative of the population, as long as the sampling process
is truly random.
○ It allows for the use of statistical theory to make inferences about the population
based on the sample.
○ It may require a complete list of the population, which may not always be
available.

Non-probability: Non-probability sampling is a type of sampling in which the probability of each


member of the population being selected for the sample is unknown or zero. This type of
sampling is considered to be less reliable because it is not based on statistical theory and it is
more difficult to make inferences about the population based on the sample. Some examples of
non-probability sampling methods include convenience sampling, quota sampling, and snowball
sampling.
● Example
○ Convenience sampling: A researcher wants to conduct a study on the nutrition
habits of college students. They set up a table on campus and ask any students
who pass by to participate in the study.
○ Quota sampling: A researcher wants to conduct a study on the shopping habits
of women in their city. They divide the city into four regions and set quotas for the
number of women to be sampled from each region. They then recruit participants
from each region until their quotas are met.
○ Snowball sampling: A researcher wants to conduct a study on the social media
habits of LGBTQ+ individuals. They ask a few members of the LGBTQ+
community to participate in the study and then ask those participants to refer
other LGBTQ+ individuals who might be interested in participating.
● Characteristics
○ The probability of each member of the population being selected for the sample
is unknown or zero.
○ The sample is not representative of the population, as it is not selected randomly.
○ It is not based on statistical theory and it is more difficult to make inferences
about the population based on the sample.
○ It may be more practical or cost-effective to implement than probability sampling.

Various methods of sampling:


● Simple random sampling: Simple random sampling is a method of sampling in which
every member of the population has an equal and independent probability of being
selected for the sample. This method is easy to implement, but it requires a complete list
of the population, which may not always be available. An example of simple random
sampling would be selecting names from a hat to choose a sample of participants for a
study.
○ Example: A researcher wants to conduct a study on the effectiveness of a new
study technique. They create a list of all the students in their school and use a
random number generator to select a sample of 50 students for the study.
○ Characteristics:
■ Every member of the population has an equal and independent probability
of being selected for the sample.
■ The sample is representative of the population, as long as the sampling
process is truly random.
■ It requires a complete list of the population, which may not always be
available.
■ It is easy to implement.

● Stratified sampling: Stratified sampling is a method of sampling in which the population


is divided into subgroups (strata) and a sample is selected from each subgroup. This
method is used when the population is heterogeneous and it is important to represent
each subgroup in the sample. An example of stratified sampling would be selecting a
sample of students for a study by dividing the population into subgroups based on grade
level and selecting a proportional number of students from each subgroup.
○ Example: A researcher wants to conduct a study on the reading habits of
students in their school. They divide the students into three strata: elementary,
middle, and high school. They then use simple random sampling to select a
proportional number of students from each stratum for the study.
○ Characteristics:
■ The population is divided into subgroups (strata) and a sample is selected
from each subgroup.
■ The sample is more representative of the population than a simple
random sample, as it includes a proportional number of members from
each subgroup.
■ It requires a complete list of the population and knowledge of the
subgroups.
■ It is more time-consuming to implement than simple random sampling.

● Systematic sampling: Systematic sampling is a method of sampling in which members


of the population are selected at regular intervals from a list. This method is easy to
implement, but it can be biased if the list is not representative of the population. An
example of systematic sampling would be selecting every 10th name on a list of
registered voters to create a sample for a political poll.
○ Example: A researcher wants to conduct a study on the political views of adults
in their town. They obtain a list of all registered voters in the town and use
systematic sampling to select every 10th person on the list for the study.
○ Characteristics:
■ Members of the population are selected at regular intervals from a list.
■ It is easy to implement, but it can be biased if the list is not representative
of the population.
■ It requires a complete list of the population.

● Cluster sampling: Cluster sampling is a method of sampling in which the population is


divided into groups (clusters) and a sample of clusters is selected. All members of the
selected clusters are then included in the sample. This method is used when it is
impractical or too expensive to create a list of the entire population. An example of
cluster sampling would be selecting a sample of schools to participate in a study and
then including all students from the selected schools in the sample.
○ Example:A researcher wants to conduct a study on the physical activity levels of
children in their county. They divide the county into five regions and use simple
random sampling to select two regions for the study. All children in the selected
regions are then included in the sample.
○ Characteristics:
■ The population is divided into groups (clusters) and a sample of clusters
is selected.
■ All members of the selected clusters are included in the sample.
■ It is used when it is impractical or too expensive to create a list of the
entire population.
■ It is less precise than simple random sampling, as the sample is not
selected independently from the population.
● Multistage sampling: Multistage sampling is a method of sampling that combines two
or more of the above methods. This method is used when it is not practical to create a
list of the entire population or when the population is too large to study as a whole. An
example of multistage sampling would be selecting a sample of schools using cluster
sampling, and then selecting a sample of students from each selected school using
stratified sampling.
○ Example: A researcher wants to conduct a study on the nutrition habits of
children in their state. They use cluster sampling to select a sample of schools
from each region of the state. They then use stratified sampling to select a
sample of students from each selected school, stratifying by grade level.
○ Characteristics
■ Combines two or more sampling methods.
■ It is used when it is not practical to create a list of the entire population or
when the population is too large to study as a whole.
■ It is more time-consuming and complex to implement than single-stage
sampling methods.
■ It is less precise than simple random sampling, as the sample is not
selected independently from the population.

Sampling and non-sampling errors:


Sampling error: Sampling error refers to the difference between the sample estimate and the
true population parameter. In probability sampling, sampling error occurs because the sample is
selected randomly from the population, and the sample may not perfectly represent the
population.
● Example:
○ A researcher wants to conduct a study on the political views of adults in their city.
They use simple random sampling to select a sample of 500 adults from the
population of 50,000. After conducting the study, they find that 45% of the sample
identifies as Democrats.
○ However, the true proportion of Democrats in the population is 40%. The
difference between the sample estimate (45%) and the true population parameter
(40%) is an example of sampling error. This error occurred because the sample
was selected randomly from the population, and it is possible that the sample did
not perfectly represent the population.
● Characteristics:
○ Sampling error is a measure of the difference between the sample estimate and
the true population parameter.
○ Sampling error occurs because the sample is selected randomly from the
population, and the sample may not perfectly represent the population.
○ Sampling error can be estimated using statistical theory, such as the standard
error of the mean.
○ Sampling error can be reduced by increasing the sample size or using a more
representative sampling method.
○ Sampling error is not the same as non-sampling error, which refers to any error in
the sampling process that is not due to sampling error.
○ Sampling error is an inherent part of probability sampling and cannot be
completely eliminated, but it can be minimized through careful planning and
execution of the sampling process.
● Types:
○ Sampling bias: This occurs when the sampling method used is not
representative of the population, resulting in a biased sample.
○ Sampling variability: This occurs when the sample estimate differs from the true
population parameter due to random chance.
○ Sampling error of non-response: This occurs when some members of the
sample do not respond or cannot be contacted, resulting in a biased sample.
○ Sampling error of response: This occurs when respondents do not answer
honestly or do not understand the questions, resulting in inaccurate data.
○ Sampling error of measurement: This occurs when the measurement
instrument is not accurate or is used incorrectly, resulting in inaccurate data.

Non-sampling error: Non-sampling error refers to any error that occurs in the sampling
process that is not due to sampling error. Non-sampling errors can be caused by a variety of
factors, such as errors in measurement, response bias, and data entry errors.
● Example:
○ A researcher wants to conduct a study on the eating habits of college students.
They use stratified sampling to select a sample of 500 students from the
population of 20,000. They send out a survey to the selected students, but only
300 students respond.
○ The non-response rate of 40% (200 non-respondents out of 500 selected
students) is an example of non-sampling error. This error occurred because
some students did not respond or could not be contacted, resulting in a biased
sample. The bias in the sample may affect the accuracy of the study's findings.
● Characteristics:
○ Non-sampling error refers to any error in the sampling process that is not due to
sampling error.
○ Non-sampling error can be caused by a variety of factors, such as errors in
measurement, response bias, and data entry errors.
○ Non-sampling error can be difficult to quantify and measure.
○ Non-sampling error can affect the reliability and validity of the study's findings.
○ Non-sampling error can be reduced by careful planning and execution of the
sampling process and by using rigorous data collection and analysis methods.
○ Non-sampling error is not the same as sampling error, which refers to the
difference between the sample estimate and the true population parameter.
● Types
○ Measurement error: This occurs when the measurement instrument is not
accurate or is used incorrectly, resulting in inaccurate data.
○ Response bias: This occurs when respondents do not answer honestly or do not
understand the questions, resulting in inaccurate data.
○ Data entry errors: This occurs when data is entered incorrectly, resulting in
inaccurate data.
○ Sampling frame error: This occurs when the list of the population used to select
the sample is not representative of the population, resulting in a biased sample.
○ Non-response error: This occurs when some members of the sample do not
respond or cannot be contacted, resulting in a biased sample.
○ Interviewer bias: This occurs when the interviewer's behavior or characteristics
influence the respondent's answers, resulting in inaccurate data.

Methods of minimizing sampling error

Of the two types of errors, sampling error is easier to identify. The biggest techniques for
reducing sampling error are:

● Increase the sample size.


○ A larger sample size leads to a more precise result because the study gets closer
to the actual population size.

● Divide the population into groups.


○ Instead of a random sample, test groups according to their size in the population.
For example, if people of a certain demographic make up 35% of the population,
make sure 35% of the study is made up of this variable.

● Know your population.


○ The error of population specification is when a research team selects an
inappropriate population to obtain data. Know who buys your product, uses it,
works with you, and so forth. With basic socio-economic information, it is possible
to reach a consistent sample of the population. In cases like marketing research,
studies often relate to one specific population like Facebook users, Baby
Boomers, or even homeowners.

● Increase the sample size: Increasing the sample size can reduce sampling error
because a larger sample is more likely to be representative of the population.
○ Example: A researcher wants to conduct a study on the eating habits of college
students. They use simple random sampling to select a sample of 500 students
from the population of 20,000. The sample estimate is based on the responses of
all 500 students.
● Use a more representative sampling method: Using a sampling method that is more
representative of the population, such as stratified sampling or cluster sampling, can
reduce sampling error.
○ Example: A researcher wants to conduct a study on the shopping habits of
women in their city. They divide the city into four regions and set quotas for the
number of women to be sampled from each region. They then recruit participants
from each region until their quotas are met.

● Use a probability sampling method: Using a probability sampling method, such as


simple random sampling or stratified sampling, can reduce sampling error because every
member of the population has a known and non-zero probability of being selected for the
sample.
○ Example: A researcher wants to conduct a study on the political views of adults
in their city. They create a list of all registered voters in the city and use simple
random sampling to select a sample of 500 voters for the study.

● Use a high-quality sampling frame: Using a sampling frame that is accurate and
representative of the population can reduce sampling error.
○ Example: A researcher wants to conduct a study on the nutrition habits of
children in their county. They create a list of all the schools in the county and use
stratified sampling to select a sample of schools based on school size and
location.
● Use a high-quality measurement instrument: Using a measurement instrument that is
reliable and valid can reduce measurement error, which is a type of non-sampling error.
○ Example: A researcher wants to conduct a study on the physical activity levels of
adults in their city. They use a validated questionnaire to measure physical
activity levels.

● Use rigorous data collection and analysis methods: Careful planning and execution
of the sampling process and the use of rigorous data collection and analysis methods
can reduce non-sampling error.
○ Example: A researcher wants to conduct a study on the social media habits of
LGBTQ+ individuals. They use snowball sampling to recruit participants and
carefully follow the data collection and analysis procedures to ensure the
reliability and validity of the study's findings.
Methods of minimizing non-sampling error
● Thoroughly Pretest your Survey Mediums
○ As discussed in the example above, it is very important to ensure that your
survey and its invites run smoothly through any medium or on any device your
potential respondents might use. People are much more likely to ignore survey
requests if loading times are long, questions do not fit properly on their screens,
or they have to work to make the survey compatible with their device. The best
advice is to acknowledge your sample's different forms of communication
software and devices and pre-test your surveys and invites on each, ensuring
your survey runs smoothly for all your respondents.
● Avoid Rushed or Short Data Collection Periods
○ One of the worst things a researcher can do is limit their data collection time in
order to comply with a strict deadline. Your study's level of nonresponse bias will
climb dramatically if you are not flexible with the time frames respondents have to
answer your survey. Fortunately, flexibility is one of the main advantages to
online surveys since they do not require interviews (phone or in person) that must
be completed at certain times of the day. However, keeping your survey live for
only a few days can still severely limit a potential respondent's ability to answer.
Instead, it is recommended to extend a survey collection period to at least two
weeks so that participants can choose any day of the week to respond according
to their own busy schedule.
● Send Reminders to Potential Respondents
○ Sending a few reminder emails throughout your data collection period has been
shown to effectively gather more completed responses. It is best to send your
first reminder email midway through the collection period and the second near
the end of the collection period. Make sure you do not harass the people on your
email list who have already completed your survey! You can manage your
reminders and invites on FluidSurveys through the trigger options found in the
invite tool.
● Ensure Confidentiality
○ Any survey that requires information that is personal in nature should include
reassurance to respondents that the data collected will be kept completely
confidential. This is especially the case in surveys that are focused on sensitive
issues. Make certain someone reading your invite understands that the
information they provide will be viewed as part of the whole sample and not
individually scrutinized.
● Use Incentives
○ Many people refuse to respond to surveys because they feel they do not have
the time to spend answering questions. An incentive is usually necessary to
motivate people into taking part in your study. Depending on the length of the
survey, the difficulty in finding the correct respondents (ie: one-legged,
15th-century spoon collectors), and the information being asked, the incentive
can range from minimal to substantial in value. Remember, most respondents
won't have an invested Interest in your study and must feel that the survey is
worth their time!

Concept of Probability and Probability distribution


Probability: Probability is a measure of the likelihood of an event occurring. It is expressed as a
number between 0 and 1, where 0 indicates that an event is impossible and 1 indicates that an
event is certain to occur. For example, the probability of flipping a coin and getting heads is 0.5,
or 50%.
Characteristics:
● Probability is a measure of the likelihood of an event occurring.
● Probability is expressed as a number between 0 and 1, where 0 indicates that an event
is impossible and 1 indicates that an event is certain to occur.
● The sum of the probabilities of all possible outcomes of an event is equal to 1.
● The probability of an event occurring is equal to the number of ways the event can occur
divided by the total number of possible outcomes.
● Probability can be used to make predictions about the likelihood of an event occurring in
the future.
● Probability can be affected by past events, but it is not certain. For example, the
probability of flipping heads on a coin is not affected by the outcomes of previous flips.

Probability distribution: A probability distribution is a function that describes the probability of


a random variable taking on a particular value or set of values. A random variable is a variable
that takes on different values randomly, such as the outcome of a coin flip or the height of a
person.

There are several types of probability distributions, including the binomial distribution, the
normal distribution, and the uniform distribution. The type of probability distribution that is used
depends on the characteristics of the data and the type of random variable being studied.

For example, the binomial distribution is used to model the probability of a certain number of
successes in a fixed number of independent trials, such as the probability of flipping heads three
times in a row when flipping a coin. The normal distribution is used to model continuous data
that is symmetrical and bell-shaped, such as the heights of people in a population. The uniform
distribution is used to model data that is evenly distributed over a range, such as the roll of a fair
die.

Characteristics:
● A probability distribution is a function that describes the probability of a random variable
taking on a particular value or set of values.
● A probability distribution is used to model the probability of different outcomes for a
random variable.
● The sum of the probabilities of all possible outcomes for a probability distribution is equal
to 1.
● The shape of a probability distribution can vary, depending on the characteristics of the
data and the type of random variable being studied.
● Some common types of probability distributions include the binomial distribution, the
normal distribution, and the uniform distribution.
● Probability distributions can be used to make predictions about the likelihood of different
outcomes for a random variable.
● Probability distributions can be affected by past events, but they are not certain. For
example, the probability distribution for the height of a person may be affected by
genetics, but it is not certain.

Normal Distribution
The normal distribution, also known as the Gaussian distribution, is a continuous probability
distribution that is symmetrical and bell-shaped. It is a common distribution used to model data
that is continuous and follows a pattern of being more likely to occur around the mean, with the
likelihood decreasing as the data gets further from the mean.

Characteristics of the normal distribution:


● The normal distribution is symmetrical, with the mean, median, and mode all occurring at
the same point.
● The normal distribution is bell-shaped, with the highest point at the mean and the
probability decreasing as the data gets further from the mean.
● The normal distribution is defined by its mean and standard deviation. The mean is the
center of the distribution, and the standard deviation is a measure of the spread of the
data.
● The normal distribution is continuous, meaning that it can take on any value within a
given range.
● The normal distribution is a common distribution used to model data in many fields,
including psychology, economics, and biology.

Example:
A researcher wants to study the heights of adult men in a population. They measure the heights
of 100 men and find that the mean height is 70 inches and the standard deviation is 2 inches.
They use the normal distribution to model the heights of the men in the population. The normal
distribution is symmetrical and bell-shaped, with the mean of 70 inches and a standard deviation
of 2 inches. The probability of selecting a man who is taller than 74 inches is about 0.15, or
15%.

The formula for the normal distribution is:


Where:
● f(x) is the probability density function for the normal distribution
● x is the value of the random variable
● mu is the mean of the distribution
● sigma is the standard deviation of the distribution
● pi is the mathematical constant approximately equal to 3.14
● e is the mathematical constant approximately equal to 2.718

The probability density function describes the probability of a random variable taking on a
particular value. It is expressed as a continuous function, with the area under the curve
representing the probability of the random variable falling within a given range.

In the normal distribution, the mean, mu, is the center of the distribution, and the standard
deviation, sigma, is a measure of the spread of the data. The probability density function is
symmetrical and bell-shaped, with the highest point at the mean and the probability decreasing
as the data gets further from the mean.

The normal distribution is often used to model data that is continuous and follows a pattern of
being more likely to occur around the mean, with the likelihood decreasing as the data gets
further from the mean. It is a common distribution used in many fields, including psychology,
economics, and biology.

Poisson Distribution
The Poisson distribution is a discrete probability distribution that models the probability of a
certain number of events occurring in a given time period or space. It is used to model data that
is countable, such as the number of defects in a manufacturing process or the number of
customers arriving at a store.

Characteristics of the Poisson distribution:


● The Poisson distribution is a discrete distribution, meaning that it models the probability
of a countable number of events occurring.
● The Poisson distribution is defined by a single parameter, lambda, which represents the
average number of events occurring in a given time period or space.
● The Poisson distribution is symmetrical and bell-shaped, with the mean and mode
occurring at lambda and the probability decreasing as the number of events gets further
from lambda.
● The Poisson distribution is often used to model data that follows a pattern of occurring
randomly and independently, such as the number of defects in a manufacturing process
or the number of customers arriving at a store.
Example:
A researcher wants to study the number of accidents occurring on a certain stretch of road over
a period of one year. They collect data on the number of accidents occurring each month and
find that the average number of accidents occurring per month is 4. They use the Poisson
distribution to model the number of accidents occurring over the course of one year. The
Poisson distribution is symmetrical and bell-shaped, with the mean and mode occurring at 4
accidents per month. The probability of there being 6 accidents in a month is about 0.180, or
18%.

The formula for the Poisson distribution is:

Where:
● f(x) is the probability of x events occurring
● lambda is the average number of events occurring in a given time period or space
● x is the number of events occurring
● e is the mathematical constant approximately equal to 2.718
● x! is the factorial of x, which is the product of all the positive integers from 1 to x.

The Poisson distribution models the probability of a certain number of events occurring in a
given time period or space. It is defined by the parameter lambda, which represents the average
number of events occurring. The probability of x events occurring is given by the formula above,
where x is the number of events occurring and lambda is the average number of events
occurring.

The Poisson distribution is symmetrical and bell-shaped, with the mean and mode occurring at
lambda and the probability decreasing as the number of events gets further from lambda. It is
often used to model data that follows a pattern of occurring randomly and independently, such
as the number of defects in a manufacturing process or the number of customers arriving at a
store.
Binomial Distribution
The binomial distribution is a discrete probability distribution that models the probability of a
certain number of successes in a fixed number of independent trials. It is used to model data
that follows a pattern of having only two possible outcomes, such as the result of a coin flip or
the outcome of a medical treatment.

Characteristics of the binomial distribution:


● The binomial distribution is a discrete distribution, meaning that it models the probability
of a countable number of events occurring.
● The binomial distribution is defined by two parameters: the probability of success, p, and
the number of trials, n.
● The binomial distribution is symmetrical and bell-shaped, with the mean and mode
occurring at n * p and the probability decreasing as the number of successes gets further
from n * p.
● The binomial distribution is often used to model data that follows a pattern of having only
two possible outcomes, such as the result of a coin flip or the outcome of a medical
treatment.

Example:

A researcher wants to study the success rate of a medical treatment. They treat 100 patients
and find that the treatment is successful in 70% of cases. They use the binomial distribution to
model the number of successful treatments. The binomial distribution is symmetrical and
bell-shaped, with the mean and mode occurring at 70 treatments and the probability decreasing
as the number of successes gets further from 70. The probability of there being 75 successful
treatments is about 0.093, or 9.3%.

The formula for the binomial distribution is:

Where:
● P(x) is the probability of x successes occurring
● n is the number of trials
● x is the number of successes occurring
● p is the probability of success
● x! is the factorial of x, which is the product of all the positive integers from 1 to x
● (n - x)! is the factorial of (n - x), which is the product of all the positive integers from 1 to
(n - x)
The binomial distribution models the probability of a certain number of successes in a fixed
number of independent trials. It is defined by the parameters p and n, where p is the probability
of success and n is the number of trials. The probability of x successes occurring is given by the
formula above, where x is the number of successes occurring, p is the probability of success,
and n is the number of trials.

The binomial distribution is symmetrical and bell-shaped, with the mean and mode occurring at
n * p and the probability decreasing as the number of successes gets further from n * p. It is
often used to model data that follows a pattern of having only two possible outcomes, such as
the result of a coin flip or the outcome of a medical treatment.

Descriptive statistics
Descriptive statistics is a branch of statistics that deals with the collection, analysis,
interpretation, presentation, and organization of data. It is used to describe and summarize data
in a meaningful way, such as by calculating measures of central tendency (mean, median,
mode) and measures of dispersion (range, variance, standard deviation).

Characteristics of descriptive statistics:


● Descriptive statistics deals with the collection, analysis, interpretation, presentation, and
organization of data.
● Descriptive statistics is used to describe and summarize data in a meaningful way.
● Descriptive statistics calculates measures of central tendency (mean, median, mode)
and measures of dispersion (range, variance, standard deviation).
● Descriptive statistics is used to visualize data using graphs and plots, such as
histograms and box plots.

Examples:
● A researcher wants to study the income of a group of people. They collect data on the
income of 100 people and use descriptive statistics to calculate the mean, median, and
mode of the data.
● A researcher wants to study the weight of a group of dogs. They collect data on the
weight of 50 dogs and use descriptive statistics to calculate the range, variance, and
standard deviation of the data.
● A researcher wants to visualize the data on the weight of the dogs. They use a
histogram to show the distribution of the data and a box plot to show the range and
dispersion of the data.
Central Tendency
Central tendency is a statistical measure that describes the central or typical value of a dataset.
It is used to summarize and describe the data in a meaningful way, and can be calculated using
measures such as the mean, median, and mode.

Characteristics of central tendency:

● Central tendency is a statistical measure that describes the central or typical value of a
dataset.
● Central tendency can be calculated using measures such as the mean, median, and
mode.
● The mean is the arithmetic average of the data, and is calculated by summing all the
values and dividing by the number of values.
● The median is the middle value of the data when it is sorted in order, and is calculated by
finding the value that separates the data into two equal halves.
● The mode is the most frequently occurring value in the data, and is calculated by finding
the value that appears the most number of times.

Examples:

● A researcher wants to study the income of a group of people. They collect data on the
income of 100 people and use the mean to calculate the central tendency of the data.
The mean income is $50,000.
● A researcher wants to study the weight of a group of dogs. They collect data on the
weight of 50 dogs and use the median to calculate the central tendency of the data. The
median weight is 50 pounds.
● A researcher wants to study the number of siblings in a group of people. They collect
data on the number of siblings for 100 people and use the mode to calculate the central
tendency of the data. The mode is 2 siblings, as this is the number of siblings that occurs
the most number of times in the data.

Mean
The mean is a statistical measure that represents the average value of a dataset. It is calculated
by summing all the values in the dataset and dividing by the number of values. The mean is also
known as the arithmetic average or the average.

Characteristics of the mean:


● The mean is a statistical measure that represents the average value of a dataset.
● The mean is calculated by summing all the values in the dataset and dividing by the
number of values.
● The mean is sensitive to outliers, or values that are significantly higher or lower than the
rest of the data.
● The mean is often used to describe and summarize data, and is a common measure of
central tendency.
Example:
A researcher wants to study the income of a group of people. They collect data on the income of
100 people and calculate the mean income. The mean income is $50,000.

The formula for the mean is:


mean = (sum of all values) / (number of values)

Mode:
The mode is a statistical measure that represents the most frequently occurring value in a
dataset. It is calculated by finding the value that appears the most number of times in the
dataset. The mode is often used to describe and summarize data, and is a common measure of
central tendency.

Characteristics of the mode:


● The mode is a statistical measure that represents the most frequently occurring value in
a dataset.
● The mode is calculated by finding the value that appears the most number of times in the
dataset.
● The mode may not exist for some datasets, such as those with continuous variables or
those with multiple modes.
● The mode is often used to describe and summarize data, and is a common measure of
central tendency.

How to calculate mode:


The mode is calculated by finding the value that appears the most number of times in the
dataset. There is no specific formula for calculating the mode, as it is simply the value that
occurs the most number of times.

Example:
A researcher wants to study the number of siblings in a group of people. They collect data on
the number of siblings for 100 people and calculate the mode. The mode is 2 siblings, as this is
the number of siblings that occurs the most number of times in the data.
Median
The median is a statistical measure that represents the middle value of a dataset when it is
sorted in order. It is calculated by finding the value that separates the data into two equal
halves. The median is often used to describe and summarize data, and is a common measure
of central tendency.

Characteristics of the median:


● The median is a statistical measure that represents the middle value of a dataset when it
is sorted in order.
● The median is calculated by finding the value that separates the data into two equal
halves.
● The median is not affected by outliers, or values that are significantly higher or lower
than the rest of the data.
● The median is often used to describe and summarize data, and is a common measure of
central tendency.

How to find median:


The median is calculated by first sorting the data in order and then finding the value that
separates the data into two equal halves. There is no specific formula for calculating the
median, as it is simply the value that falls in the middle of the dataset when it is sorted in order.

Example:
A researcher wants to study the weight of a group of dogs. They collect data on the weight of 50
dogs and calculate the median. The median weight is 50 pounds, as this is the value that
separates the data into two equal halves when it is sorted in order.

Dispersion
Dispersion is a statistical measure that describes the spread or variability of a dataset. It is used
to quantify how much the values in a dataset differ from each other and from the mean of the
data. Dispersion can be calculated using measures such as the range, variance, and standard
deviation.

Characteristics of dispersion:
● Dispersion is a statistical measure that describes the spread or variability of a dataset.
● Dispersion is used to quantify how much the values in a dataset differ from each other
and from the mean of the data.
● Dispersion can be calculated using measures such as the range, variance, and standard
deviation.
● Dispersion is used to describe and summarize data and to understand the distribution of
the data.
Formulas for measures of dispersion:

● Range: The range is the difference between the highest and lowest values in the
dataset. It is calculated using the formula: range = max value - min value
● Variance: The variance is a measure of the spread of the data around the mean. It is
calculated using the formula: variance = sum of (x - mean)^2 / (n - 1)
● Standard deviation: The standard deviation is a measure of the spread of the data
around the mean. It is calculated using the formula: standard deviation = sqrt(variance)

Example:

A researcher wants to study the weight of a group of dogs. They collect data on the weight of 50
dogs and calculate the range, variance, and standard deviation of the data. The range is 40
pounds, the variance is 500, and the standard deviation is 22.4. These measures of dispersion
help the researcher understand the spread and variability of the data on the weight of the dogs.

Skewness
Skewness is a statistical measure that describes the symmetry of a distribution. It is used to
quantify the degree to which a distribution is skewed or asymmetrical. Skewness can be either
positive or negative, depending on the direction of the skew. Positive skewness indicates that
the distribution is skewed to the right, with a long tail on the positive side and a short tail on the
negative side. Negative skewness indicates that the distribution is skewed to the left, with a long
tail on the negative side and a short tail on the positive side.

Characteristics of skewness:
● Skewness is a statistical measure that describes the symmetry of a distribution.
● Skewness can be either positive or negative, depending on the direction of the skew.
● Positive skewness indicates that the distribution is skewed to the right, with a long tail on
the positive side and a short tail on the negative side.
● Negative skewness indicates that the distribution is skewed to the left, with a long tail on
the negative side and a short tail on the positive side.

Formula for skewness:


The formula for skewness is:

skewness = (mu - X) / SD

Where:
● mean is the mean of the distribution
● X is the mode of the distribution
● SD is the standard deviation of the distribution
There are two types of skewness: positive skewness and negative skewness.
● Positive skewness: Positive skewness indicates that the distribution is skewed to the
right, with a long tail on the positive side and a short tail on the negative side. Positive
skewness is often characterized by a mean that is greater than the median.

● Negative skewness: Negative skewness indicates that the distribution is skewed to the
left, with a long tail on the negative side and a short tail on the positive side. Negative
skewness is often characterized by a mean that is less than the median.

Example:
A researcher wants to study the weight of a group of dogs. They collect data on the weight of 50
dogs and calculate the skewness of the data. The mean weight is 50 pounds, the mode is 45
pounds, and the standard deviation is 22.4. The skewness of the data is 0.5, which indicates
that the distribution is slightly skewed to the right.

Kurtosis
Kurtosis is a statistical measure that describes the peakedness or flatness of a distribution. It is
used to quantify the degree to which a distribution is peaked or flat relative to a normal
distribution. Kurtosis can be either positive or negative, depending on the peakedness or
flatness of the distribution. Positive kurtosis indicates that the distribution is more peaked than a
normal distribution, while negative kurtosis indicates that the distribution is flatter than a normal
distribution.

Characteristics of kurtosis:
● Kurtosis is a statistical measure that describes the peakedness or flatness of a
distribution.
● Kurtosis is used to quantify the degree to which a distribution is peaked or flat relative to
a normal distribution.
● Kurtosis can be either positive or negative, depending on the peakedness or flatness of
the distribution.
● Positive kurtosis indicates that the distribution is more peaked than a normal distribution.
● Negative kurtosis indicates that the distribution is flatter than a normal distribution.

Formula for kurtosis:

The formula for kurtosis is:

kurtosis = (sum of (x - mu)^4 / n) / (SD^4)

Where:
● kurtosis is the kurtosis of the distribution
● x is a value in the dataset
● mu is the mean of the distribution
● n is the number of values in the dataset
● SD is the standard deviation of the distribution

There are two types of kurtosis: excess kurtosis and platykurtic kurtosis.
● Excess kurtosis: Excess kurtosis is a measure of the peakedness of a distribution
relative to a normal distribution. It is calculated by subtracting 3 from the kurtosis of the
distribution. Positive excess kurtosis indicates that the distribution is more peaked than a
normal distribution, while negative excess kurtosis indicates that the distribution is flatter
than a normal distribution.
● Platykurtic kurtosis: Platykurtic kurtosis is a measure of the flatness of a distribution
relative to a normal distribution. It is calculated by subtracting the kurtosis of the
distribution from 0. Positive platykurtic kurtosis indicates that the distribution is flatter
than a normal distribution, while negative platykurtic kurtosis indicates that the
distribution is more peaked than a normal distribution.

Example:
A researcher wants to study the height of a group of trees. They collect data on the height of 50
trees and calculate the kurtosis of the data. The mean height is 50 feet, the standard deviation
is 10 feet, and the kurtosis is 2.0. This indicates that the distribution of the tree heights is more
peaked than a normal distribution.

Hypothesis testing: Formulation and types; null hypothesis, alternate


hypothesis, type I and type II errors, level of significance, power of the test,
p-value. Concept of standard error and confidence interval.

Hypothesis testing: Formulation and types; null hypothesis, alternate


hypothesis
Hypothesis testing is a statistical procedure that is used to determine whether there is sufficient
evidence to support a hypothesis about a population. The process of hypothesis testing involves
formulating a hypothesis about the population, collecting data from a sample, and then
evaluating the data to determine whether the hypothesis is supported or rejected.

Formulation of a hypothesis:
A hypothesis is a statement about a population that is tested using statistical analysis. In order
to formulate a hypothesis, it is important to identify the population of interest and the variable of
interest. The hypothesis should be a clear and concise statement about the relationship
between the variable and the population.

Types of hypothesis:
There are two types of hypothesis: the null hypothesis and the alternative hypothesis.
● Null hypothesis: The null hypothesis is a statement that there is no relationship between
the variable and the population. The null hypothesis is denoted as H0.
● Alternative hypothesis: The alternative hypothesis is a statement that there is a
relationship between the variable and the population. The alternative hypothesis is
denoted as H1.

Example:
A researcher wants to study the relationship between diet and weight loss. They formulate the
following hypothesis:
● H0: There is no relationship between diet and weight loss.
● H1: There is a relationship between diet and weight loss.

The researcher collects data from a sample of people who are on different diets and measures
their weight loss. They then use statistical analysis to evaluate the data and determine whether
the null hypothesis or the alternative hypothesis is supported. If the data supports the alternative
hypothesis, the researcher can conclude that there is a relationship between diet and weight
loss. If the data does not support the alternative hypothesis, the researcher can conclude that
there is no relationship between diet and weight loss.

Null Hypothesis (in more detail)


The null hypothesis is a statistical hypothesis that is used in hypothesis testing to determine
whether there is sufficient evidence to support a hypothesis about a population. The null
hypothesis is a statement that there is no relationship between the variable of interest and the
population. The null hypothesis is denoted as H0.

Characteristics of the null hypothesis:

● The null hypothesis is a statement that there is no relationship between the variable of
interest and the population.
● The null hypothesis is denoted as H0.
● The null hypothesis is tested using statistical analysis to determine whether there is
sufficient evidence to reject it in favor of an alternative hypothesis.

Example:
A researcher wants to study the relationship between diet and weight loss. They formulate the
following null hypothesis:
● H0: There is no relationship between diet and weight loss.

The researcher collects data from a sample of people who are on different diets and measures
their weight loss. They then use statistical analysis to evaluate the data and determine whether
the null hypothesis is supported or rejected. If the data does not support the null hypothesis, the
researcher can conclude that there is a relationship between diet and weight loss. If the data
supports the null hypothesis, the researcher can conclude that there is no relationship between
diet and weight loss.

Alternative Hypothesis (in more detail)


The alternative hypothesis is a statistical hypothesis that is used in hypothesis testing to
determine whether there is sufficient evidence to support a hypothesis about a population. The
alternative hypothesis is a statement that there is a relationship between the variable of interest
and the population. The alternative hypothesis is denoted as H1.

Characteristics of the alternative hypothesis:


● The alternative hypothesis is a statement that there is a relationship between the
variable of interest and the population.
● The alternative hypothesis is denoted as H1.
● The alternative hypothesis is tested using statistical analysis to determine whether there
is sufficient evidence to support it over the null hypothesis.

Example:
A researcher wants to study the relationship between diet and weight loss. They formulate the
following alternative hypothesis:

H1: There is a relationship between diet and weight loss.

The researcher collects data from a sample of people who are on different diets and measures
their weight loss. They then use statistical analysis to evaluate the data and determine whether
the alternative hypothesis is supported or rejected. If the data supports the alternative
hypothesis, the researcher can conclude that there is a relationship between diet and weight
loss. If the data does not support the alternative hypothesis, the researcher can conclude that
there is no relationship between diet and weight loss.

Type I and Type II Error


Type I and type II errors are statistical errors that can occur when conducting hypothesis testing.

Type I error:
A type I error, also known as a false positive, is a statistical error that occurs when the null
hypothesis is rejected, but it is actually true. In other words, a type I error occurs when the
researcher concludes that there is a relationship between the variable of interest and the
population, but there is actually no relationship.

Characteristics of type I error:


● A type I error, also known as a false positive, is a statistical error that occurs when the
null hypothesis is rejected, but it is actually true.
● A type I error occurs when the researcher concludes that there is a relationship between
the variable of interest and the population, but there is actually no relationship.
● The probability of a type I error is denoted as alpha (α).

Example:
A researcher wants to study the relationship between diet and weight loss. They formulate the
following null and alternative hypotheses:
● H0: There is no relationship between diet and weight loss.
● H1: There is a relationship between diet and weight loss.

The researcher collects data from a sample of people who are on different diets and measures
their weight loss. They set the probability of a type I error at 0.05 (α = 0.05). After conducting the
statistical analysis, the researcher finds that the data supports the alternative hypothesis.
However, after further investigation, it turns out that the relationship between diet and weight
loss was due to a confounding variable that was not controlled for in the study. In this case, the
researcher has made a type I error, as they rejected the null hypothesis (H0), but it was actually
true.

Type II error:
A type II error, also known as a false negative, is a statistical error that occurs when the null
hypothesis is not rejected, but it is actually false. In other words, a type II error occurs when the
researcher concludes that there is no relationship between the variable of interest and the
population, but there is actually a relationship.

Characteristics of type II error:

● A type II error, also known as a false negative, is a statistical error that occurs when the
null hypothesis is not rejected, but it is actually false.
● A type II error occurs when the researcher concludes that there is no relationship
between the variable of interest and the population, but there is actually a relationship.
● The probability of a type II error is denoted as beta (β).

Example:
A researcher wants to study the relationship between diet and weight loss. They formulate the
following null and alternative hypotheses:

● H0: There is no relationship between diet and weight loss.


● H1: There is a relationship between diet and weight loss.

The researcher collects data from a sample of people who are on different diets and measures
their weight loss. They set the probability of a type II error at 0.20 (β = 0.20). After conducting
the statistical analysis, the researcher finds that the data does not support the alternative
hypothesis. However, after further investigation, it turns out that the sample size was too small
to detect the relationship between diet and weight loss. In this case, the researcher has made a
type II error, as they did not reject the null hypothesis (H0), but it was actually false.
Level of significance:
The level of significance is the probability of making a type I error when conducting hypothesis
testing. It is denoted as alpha (α) and is typically set at 0.05 or 0.01. A lower level of significance
means that the researcher is more conservative in rejecting the null hypothesis, while a higher
level of significance means that the researcher is less conservative in rejecting the null
hypothesis.

Characteristics of level of significance:

● The level of significance is the probability of making a type I error when conducting
hypothesis testing.
● It is denoted as alpha (α) and is typically set at 0.05 or 0.01.
● A lower level of significance means that the researcher is more conservative in rejecting
the null hypothesis.
● A higher level of significance means that the researcher is less conservative in rejecting
the null hypothesis.

Example:

A researcher wants to study the relationship between diet and weight loss. They formulate the
following null and alternative hypotheses:

● H0: There is no relationship between diet and weight loss.


● H1: There is a relationship between diet and weight loss.

The researcher collects data from a sample of people who are on different diets and measures
their weight loss. They set the level of significance at 0.01 (α = 0.01). After conducting the
statistical analysis, the researcher finds that the data supports the alternative hypothesis at the
0.01 level of significance. This means that the researcher is 99% confident that the relationship
between diet and weight loss is not due to chance and that there is a real relationship between
the two variables.

Power of the test:

The power of the test is the probability of rejecting the null hypothesis when it is false. It is
denoted as 1 - beta (1 - β) and is typically set at 0.80 or 0.90. A higher power of the test means
that the researcher is more likely to reject the null hypothesis when it is false, while a lower
power of the test means that the researcher is less likely to reject the null hypothesis when it is
false.

Characteristics of power of the test:


● The power of the test is the probability of rejecting the null hypothesis when it is false.
● It is denoted as 1 - beta (1 - β) and is typically set at 0.80 or 0.90.
● A higher power of the test means that the researcher is more likely to reject the null
hypothesis when it is false.
● A lower power of the test means that the researcher is less likely to reject the null
hypothesis when it is false.

Example:
A researcher wants to study the relationship between diet and weight loss. They formulate the
following null and alternative hypotheses:

● H0: There is no relationship between diet and weight loss.


● H1: There is a relationship between diet and weight loss

The researcher collects data from a sample of people who are on different diets and measures
their weight loss. They set the power of the test at 0.90 (1 - β = 0.90). After conducting the
statistical analysis, the researcher finds that the data supports the alternative hypothesis at the
0.90 power of the test. This means that the researcher is 90% confident that the relationship
between diet and weight loss is not due to chance and that there is a real relationship between
the two variables.

P-value:
The p-value is the probability of obtaining a test statistic that is at least as extreme as the
observed test statistic, given that the null hypothesis is true. It is used to determine whether the
observed data supports the null hypothesis or the alternative hypothesis. If the p-value is less
than the level of significance (α), then the null hypothesis is rejected in favor of the alternative
hypothesis. If the p-value is greater than the level of significance (α), then the null hypothesis is
not rejected.

Characteristics of p-value:
● The p-value is the probability of obtaining a test statistic that is at least as extreme as the
observed test statistic, given that the null hypothesis is true.
● It is used to determine whether the observed data supports the null hypothesis or the
alternative hypothesis.
● If the p-value is less than the level of significance (α), then the null hypothesis is rejected
in favor of the alternative hypothesis.
● If the p-value is greater than the level of significance (α), then the null hypothesis is not
rejected.

Example:
A researcher wants to study the relationship between diet and weight loss. They formulate the
following null and alternative hypotheses:

● H0: There is no relationship between diet and weight loss.


● H1: There is a relationship between diet and weight loss.
The researcher collects data from a sample of people who are on different diets and measures
their weight loss. They set the level of significance at 0.05 (α = 0.05). After conducting the
statistical analysis, the researcher calculates the p-value to be 0.03. Since the p-value is less
than the level of significance (0.03 < 0.05), the researcher rejects the null hypothesis in favor of
the alternative hypothesis. This means that the researcher concludes that there is a relationship
between diet and weight loss.

Standard error:

The standard error is a measure of the variability of a sampling distribution. It is calculated as


the standard deviation of the sampling distribution divided by the square root of the sample size.
The standard error is used to estimate the standard deviation of a population from a sample.

Characteristics of standard error:


● The standard error is a measure of the variability of a sampling distribution.
● It is calculated as the standard deviation of the sampling distribution divided by the
square root of the sample size.
● The standard error is used to estimate the standard deviation of a population from a
sample.

Example:

A researcher wants to study the height of students in a school. They collect data from a sample
of 100 students and calculate the mean height to be 175 cm. The standard deviation of the
sample is 15 cm. The standard error of the mean is calculated as follows:

Standard error = standard deviation / square root of sample size

Standard error = 15 cm / 10 cm

Standard error = 1.5 cm

This means that the standard error of the mean height of the students is 1.5 cm.

Confidence interval:

A confidence interval is a range of values that is likely to include the true value of a population
parameter with a certain level of confidence. It is calculated by adding and subtracting the
standard error from the point estimate of the population parameter. The level of confidence is
typically set at 95% or 99%.

Characteristics of confidence interval:


● A confidence interval is a range of values that is likely to include the true value of a
population parameter with a certain level of confidence.
● It is calculated by adding and subtracting the standard error from the point estimate of
the population parameter.
● The level of confidence is typically set at 95% or 99%.

Example:

A researcher wants to study the height of students in a school. They collect data from a sample
of 100 students and calculate the mean height to be 175 cm. The standard deviation of the
sample is 15 cm and the standard error of the mean is 1.5 cm. The 95% confidence interval for
the mean height of the students is calculated as follows:

Lower limit = mean height - (standard error x 1.96)

Lower limit = 175 cm - (1.5 cm x 1.96)

Lower limit = 170.8 cm

Upper limit = mean height + (standard error x 1.96)

Upper limit = 175 cm + (1.5 cm x 1.96)

Upper limit = 179.2 cm

This means that the 95% confidence interval for the mean height of the students is between
170.8 cm and 179.2 cm. This means that the researcher is 95% confident that the true mean
height of the students in the school falls within this range.

You might also like