0% found this document useful (0 votes)
39 views27 pages

Population

The document provides an overview of key statistical concepts, including populations, samples, parameters, and variables, as well as the distinction between descriptive and inferential statistics. It explains various statistical methods such as hypothesis testing, regression analysis, and the use of different types of data (nominal, ordinal, interval, ratio) along with primary and secondary data. Additionally, it discusses measures of central tendency and dispersion, frequency distributions, and the significance levels in hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views27 pages

Population

The document provides an overview of key statistical concepts, including populations, samples, parameters, and variables, as well as the distinction between descriptive and inferential statistics. It explains various statistical methods such as hypothesis testing, regression analysis, and the use of different types of data (nominal, ordinal, interval, ratio) along with primary and secondary data. Additionally, it discusses measures of central tendency and dispersion, frequency distributions, and the significance levels in hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

- Population:

A population is the complete set of all individuals, items, or


observations of interest for a particular statistical study. It
encompasses every member of the group being studied, defined by a
set of shared characteristics.
Example:
You are conducting a study to determine the average height of adults
in Pakistan. In this case:
The population would include all adults living in Pakistan.

- Sample:
A sample is a subset of individuals, items, or observations selected
from a larger population, which is used to represent and analyze the
characteristics of the entire population in a statistical study.
Example:
Suppose you want to study the average income of households in
Pakistan. It would be impractical to collect data from every household
(the population). Instead, you might select a sample of 1,000
households from various regions. By analyzing this smaller,
representative group, you can infer the average income for the entire
population.

- Parameter:
A parameter is a numerical value or characteristic that describes a
specific feature of a population in statistical analysis. It represents a
fixed value, as it pertains to the entire population, but is often
unknown and estimated using a sample.
Example:
you're studying the average income of all households in Pakistan (the
population).
The parameter would be the actual mean income of all households in
Pakistan.
- Variable:
A variable is a measurable characteristic, feature, or quantity that can
assume different values or attributes across individuals, objects, or
overtime. It serves as the fundamental unit of observation and analysis
in statistics and research, facilitating the study of relationships,
patterns, and behaviors within a given dataset or population.
Independent Variables: Variables that are manipulated or controlled
to observe their effect on another variable. For example, the dosage of
medicine in a clinical trial.
Dependent Variables: Variables that are measured to assess the
effect of the independent variable. For example, the patient’s blood
pressure as affected by the medicine.
Controlled Variables: Variables that are kept constant to avoid
confounding the results.
Extraneous Variables: Variables that are not of interest but could
influence the dependent variable if not controlled.

- Descriptive and inferential statistics:


Descriptive statistics:
Descriptive statistics is a branch of statistics that deals with
summarizing and describing the main features of a dataset. It provides
methods for organizing, visualizing, and presenting data meaningfully
and informally. Descriptive statistics describe the characteristics of the
data set under study without generalizing beyond the analyzed data.
Common measures and techniques in descriptive statistics:
measures of central tendency (such as mean, median, and mode),
measures of dispersion (such as range, variance, and standard
deviation), frequency distributions (histograms, frequency tables), and
graphical representations (box plots, bar charts, pie charts, etc.). These
methods help to provide a clear and concise summary of the data,
facilitating easier interpretation and understanding.
Measures of Central Tendency:
Measures of central tendency represent the center or typical value of a
dataset. They provide insight into where the bulk of the data points lie.
The three main measures of central tendency are:
Mean: The arithmetic average of all the values in the dataset.
Mean is used when the data is roughly symmetrical, does not have
outliers, is parametric and have large sample sizes.
It is also used for continuous (interval or ratio) data.
Median: The middle value of the dataset when arranged in ascending
or descending order. When the number of observations is even then
the average of the middle values is median.
It is used when important to know the middle point of the data rather
than the arithmetic average.
It is used when data contains outliers, is nonparametric, ordinal and
not normally distributed or skewed.
It is also used when the questionnaire is designed by the researcher
himself, otherwise mean is used.
Mode: The value that occurs most frequently in the dataset.
It is used when you want to identify the most common or frequent
category or value in the data.
It is used when the data is nominal, discrete and skewed.
It is also used for simple interpretation of practical decisions.
We use:
Mean: When the data is normally distributed and free of outliers.
Median: When the data is skewed or contains outliers.
Mode: When dealing with non-numeric or categorical data (e.g.,
favorite color).
Measures of Dispersion:
Measures of dispersion quantify the spread or variability of data points
around the central tendency. They indicate how much the individual
data points deviate from the average. Common measures of dispersion
include:
Range: The difference between the maximum and minimum values in
the dataset.
Variance: The average squared differences between each data point
and the mean.
Standard Deviation: The square root of the variance, representing the
average distance of data points from the mean.

Frequency Distributions and Graphical Representations:


Frequency distributions display the frequency of occurrence of different
values or ranges in a dataset. They help to visualize the distribution of
data across various categories. Common graphical representations
used in descriptive statistics include:
Bar charts: Bar graphs display data using rectangular bars where the
length of each bar is proportional to the value it represents. The bars
can be arranged vertically or horizontally.
We use them when comparing frequencies or counts across different
categories (e.g., survey responses by category).
Suitable for nominal or ordinal data.
Example: Showing the number of students favoring different subjects
(Math, Science, History).

Histograms: Bar charts that display the frequency of data points


within predefined intervals or bins.
We use them when describing the frequency distribution of continuous
(interval or ratio) data (e.g., test scores, heights).
Emphasizing the shape, spread, and central tendency of the data.
Example: A histogram displaying the distribution of exam scores
grouped into ranges like 0–10, 11–20, etc.

Pie Charts: Circular charts representing the proportions of different


categories within a dataset.
We use them when illustrating parts of a whole, such as percentages of
a total.
Best for a limited number of categories (typically less than 6-8) to keep
the chart clear.
Example: A pie chart showing the market share percentages of
different smartphone brands.

Inferential statistics:
Inferential statistics, on the other hand, involve making inferences,
predictions, or generalizations about a larger population based on data
collected from a sample of that population.
It extends the findings from a sample to the population from which the
sample was drawn. Inferential statistics allow researchers to draw
conclusions, test hypotheses, and make predictions about populations,
even when it is impractical or impossible to study the entire population
directly.

Key methods in inferential statistics:


hypothesis testing, where researchers test hypotheses about
population parameters using sample data; regression analysis, where
relationships between variables are examined and used to make
predictions; and confidence intervals, which provide estimates of
population parameters and their uncertainty levels.

Hypothesis Testing:
Hypothesis testing is a fundamental technique in inferential statistics
used to make decisions or draw conclusions about a population
parameter based on sample data. It involves formulating a null
hypothesis (H0) and an alternative hypothesis (Ha), collecting sample
data, and using statistical tests to determine whether there is enough
evidence to reject the null hypothesis in favor of the alternative
hypothesis. Common statistical tests for hypothesis testing include t-
tests, chi-square tests, ANOVA (Analysis of Variance), and z-tests.

Regression Analysis:
Regression analysis is a statistical technique used to examine the
relationship between one or more independent variables (predictors)
and a dependent variable (outcome) and to make predictions based on
this relationship. It helps to identify and quantify the strength and
direction of the association between variables and to predict the
dependent variable's value for given independent variable values.
Common types of regression analysis include linear, logistic,
polynomial, and multiple regression.
Confidence Intervals:
Confidence intervals provide a range of values within which the true
population parameter is likely to lie with a certain level of confidence
based on sample data. They quantify the uncertainty associated with
estimating population parameters from sample data. Confidence
intervals are calculated using point estimates, such as sample means
or proportions, and their standard errors. The confidence level
represents the probability that the interval contains the true population
parameter. Commonly used confidence levels include 90%, 95%, and
99%.

- Discrete and Continuous data:


Discrete data:
Discrete data consists of distinct and separate values. It is countable
and often involves whole numbers, with no possible intermediate
values.
Example:
Number of students in a class (e.g., 25, 30, 32).
Number of cars in a parking lot (e.g., 10, 20, 35).
Continuous Data:
Continuous data consists of values that can take any value within a
range. It is measurable and can include fractions or decimals.
Example:
Height of individuals (e.g., 5.5 ft, 6.1 ft).
Temperature (e.g., 22.3°C, 25.5°C)

- Theoretical and applied statistics:


Theoretical Statistics:
Concerned with the development and understanding of statistical
theories, methods, and concepts. It involves deriving formulas, proving
theorems, and studying statistical models and their properties.
Example: Deriving the formula for standard deviation or creating new
statistical distributions.
Applied Statistics:
Using statistical methods to solve real-world problems and analyze
data in practical situations. It deals with the application of established
statistical techniques to analyze data and draw meaningful
conclusions.
Example: Using regression analysis to predict sales figures for a
company.

Theoretical statistics is abstract and focuses on the "why" and "how"


behind statistical techniques, while applied statistics is practical and
focuses on "using" those techniques in real-world scenarios. In
theoretical statistics, the rules are fixed and based on strict
assumptions to ensure clarity in defining the methods.
In applied statistics, flexibility is introduced based on the context,
such as whether the population standard deviation is known and the
shape of the data's distribution.

The contrast between theoretical and applied statistics regarding the


use of z-tests and t-tests often arises due to practical adjustments
made in real-world scenarios versus textbook or theoretical guidelines.
Theoretical Perspective (Traditional Rule of n ≥ 30 for Z-Test, n
< 30 for T-Test):
Why n ≥ 30 for Z-Test:
The Central Limit Theorem states that, for sample sizes of 30 or more,
the sampling distribution of the mean tends to approximate a normal
distribution, regardless of the population's actual distribution. This
allows the use of the z-test.
The population standard deviation (σ) is assumed to be known, a key
requirement for the z-test.
Why n < 30 for T-Test:
For small sample sizes, the data may not approximate a normal
distribution as well, so the t-distribution is used, which accounts for
more variability due to smaller sample sizes.
The t-test is designed for situations where the population standard
deviation is unknown, and the sample standard deviation is used as an
estimate, introducing extra uncertainty.
Applied Perspective (More Flexibility in Practice):
Why Z-Test Can Be Used for Small Samples (n < 30):
In applied statistics, if the population standard deviation (σ) is known,
the z-test can still be applied, even for small sample sizes. This is
because the formula and assumptions remain valid as long as σ is
known, and the data is approximately normally distributed.
Why T-Test Can Be Used for Large Samples (n ≥ 30):
Practically, the t-test can be used for large samples because the t-
distribution approaches the normal distribution as the sample size
increases. Therefore, the distinction between z and t becomes
negligible for larger datasets.
Additionally, in most real-world scenarios, the population standard
deviation is unknown, making the t-test more suitable regardless of the
sample size.

- Types of data:
Nominal Data:
Nominal data consists of categories without any inherent order or
ranking. The numbers or labels simply represent different groups.
Example:
Gender (e.g., male, female)
Blood type (e.g., A, B, AB, O)
Types of cuisine (e.g., Italian, Chinese, Mexican)
Ordinal Data:
Ordinal data represents categories that have a meaningful order or
ranking, but the differences between the ranks are not precisely
measured.
Example:
Educational level (e.g., primary, secondary, tertiary)
Likert scale responses (e.g., strongly disagree, disagree, neutral, agree,
strongly
agree)
Performance ratings (e.g., bronze, silver, gold)

Interval Data:
Interval data features numerical values with equal intervals between
measurements, which allows for meaningful differences between
values; however, there is no true zero point.
Example:
Temperature in Celsius or Fahrenheit (the difference between 20°C and
30°C is the same as between 30°C and 40°C, but 0°C does not indicate
the absence of temperature).
Calendar dates (the interval between years is equal, but there is no
“zero” year in the modern calendar).
Ratio Data:
Ratio data is numerical data with equal intervals and a meaningful zero
point that indicates the absence of the quantity being measured. This
allows for comparisons in both differences and ratios.
Example:
Weight (e.g., 0 kg means no weight, and 20 kg is twice as heavy as 10
kg)
Height (e.g., 0 cm means no height)
Time duration (e.g., 0 seconds signifies no time elapsed)

- Primary and secondary data:


Primary Data:
Primary data is information collected firsthand by the researcher
specifically for the purpose of their study. This data is original and
tailored to answer the researcher's specific questions.
Characteristics:
Collected Directly: Data comes directly from its source through
methods like surveys, interviews, experiments, or observations.
Specific Purpose: It's gathered with a particular research objective in
mind.
Control: The researcher controls the data collection process, ensuring it
is relevant and of high quality.
Examples:
Surveys and Questionnaires: Distributing a survey to students to
understand their study habits.
Interviews: Conducting one-on-one interviews to gather personal
experiences or opinions.
Experiments: Running a controlled experiment to test the effect of a
new teaching method on student performance.
Observations: Recording behaviors in a natural setting, such as
observing customer interactions in a store.
Secondary Data:
Secondary data is information that was collected by someone else for
another purpose but is later used by a researcher for their study. This
data is not gathered firsthand but is already available.
Characteristics:
Already Available: Data is obtained from existing sources such as
databases, research reports, government publications, or previous
studies.
Broad Use: Can be used as background information or to support
findings from primary data.
Cost and Time Effective: Since it's already collected, it saves time and
resources.
Examples:
Government Reports: Utilizing census data or economic indicators
published by the government.
Research Articles and Journals: Referencing previously published
studies in academic journals.
Organizational Records: Using company sales data or customer
behavior reports.
Online Databases: Data repositories like those provided by research
institutions, the World Bank, or other international organizations.
- Frequency:
Frequency refers to the number of times a particular value or event
occurs within a dataset.
It is used for categorical data i.e., nominal data. As nominal data is the
basic level of data so only limited analysis can be applied.
As the level of data becomes higher all the basic analysis can be
applied with the higher ones now.

- Level of significance:
The level of significance (denoted as α) is a statistical threshold
used to determine whether the results of a hypothesis test are
statistically significant. It defines the probability of rejecting the
null hypothesis when it is actually true (a Type I error).
It sets the standard for determining how much risk of error is
acceptable in a test.
A lower significance level reduces the chance of a Type I error but
makes it harder to detect true effects.
Researchers choose this level based on the desired balance between
minimizing errors and ensuring sensitivity.

A result is considered highly significant when the p-value is


extremely small, indicating very strong evidence against the null
hypothesis. Among the commonly used significance levels:
0.01 (1%): This threshold indicates a highly significant result. If the
p-value is less than 0.01, there’s a very low probability (less than 1%)
that the observed result happened by random chance.
0.05 (5%): While still significant, the result at this level is less rigorous
compared to 0.01. It means there’s up to a 5% chance of a false
positive (rejecting a true null hypothesis).
- One tail and two tail:
One-Tailed Test:
A test where the region of rejection is only on one side (tail) of the
probability distribution.
Unidirectional Hypothesis: A one-tailed test is used when you have
a specific direction in mind for the effect (positive or negative).
Example:
Null Hypothesis (H₀): The new teaching method has no effect (mean
score is equal to the standard).
Alternative Hypothesis (H₁): The new teaching method improves scores
(mean score is greater than the standard).
Used when you're only interested in testing whether a value is greater
than or less than a certain point.
Example:
Testing whether a new drug increases recovery rates compared to the
standard drug.
Two-Tailed Test:
A test where the region of rejection is divided between both tails of the
probability distribution.
Two-Directional Hypothesis: A two-tailed test is used when you’re
testing for any difference, regardless of direction.
Example:
Null Hypothesis (H₀): The new teaching method has no effect (mean
score is equal to the standard).
Alternative Hypothesis (H₁): The new teaching method has an effect
(mean score is not equal to the standard).
Used when you're interested in testing whether a value is different
from a certain point, in either direction.
Example:
Testing whether a new drug has a different effect (either better or
worse) compared to the standard drug.

- Standard deviation:
Standard deviation (denoted as σ for a population or s for a sample) is
a measure of how spread out or dispersed the data is around the mean
(average) of a dataset. It provides insight into the variability or
consistency of data values.
Small Standard Deviation:
Data points are closely clustered around the mean.
Indicates low variability.
Example: Exam scores like 88, 89, 90, 91 (small spread).
Large Standard Deviation:
Data points are widely spread out from the mean.
Indicates high variability.
Example: Exam scores like 60, 70, 85, 100 (large spread).
Helps compare datasets: A dataset with a smaller standard
deviation is more consistent than one with a larger standard
deviation.
The higher the value of standard deviation, the higher is the dispersion
of the data about the mean and the higher is the variability of the data.
Small Standard Deviation (Homogeneous): Data points are tightly
packed near the mean → low variability.
Large Standard Deviation (Heterogeneous): Data points are more
spread out from the mean → high variability.

- Quartiles, Deciles, and Percentiles:


These are statistical measures that divide a dataset into equal parts,
providing a way to summarize and interpret data distributions. Here's a
breakdown of each:
Quartiles:
Divides data into 4 equal parts, each containing 25% of the data
points.
Includes:
Q1 (1st Quartile): 25th percentile (lower 25% of data).
Q2 (2nd Quartile/Median): 50th percentile (middle of data).
Q3 (3rd Quartile): 75th percentile (upper 75% of data).
Helps describe the spread of data, detect outliers, and analyze ranges
(e.g., interquartile range, IQR).
Deciles:
Divides data into 10 equal parts, each containing 10% of the data
points.
Example:
D1 (1st Decile): Bottom 10%.
D9 (9th Decile): Top 10%.
Offers a finer level of detail compared to quartiles, often used in
income distribution and academic ranking analysis.
Percentiles:
Divides data into 100 equal parts, each containing 1% of the data
points.
Example:
P90 (90th Percentile): Data point below which 90% of values fall.
Widely used in education (test scores), health (growth charts), and
sports performance.
We use them
To summarize large datasets: They condense data into manageable
summaries.
To compare individuals within a group: Helps rank and assess
relative performance (e.g., test scores or income brackets).
To detect trends or outliers: Highlights unusual patterns in data.
In education: Percentiles are used to rank students (e.g., "90th
percentile in test scores").
In economics: Deciles analyze income distribution (e.g., top 10% of
earners).
In healthcare: Percentiles track child growth patterns.
In business: Quartiles measure data variability (e.g., sales
performance).

- Discarding data:
If the data is incomplete (e.g., missing critical values) or contains
significant errors that cannot be corrected, it may need to be
discarded.
Example: Surveys with too many unanswered questions or corrupted
files in electronic data.
If the data violates critical assumptions required for analysis (e.g.,
normality, independence), it may not be usable.
Example: In a study requiring random sampling, biased or non-
randomly collected data may need to be excluded.
Data that does not align with the research objectives or is unrelated to
the research question should be discarded.
Example: Collecting information on heights when studying the
correlation between sleep and mental health.
Extreme outliers that are clearly errors or unrelated to the population
of interest may need to be removed.
Outliers should be carefully evaluated; they may represent valid
phenomena rather than errors.
Data collected without proper consent or in violation of ethical
guidelines must be discarded.
Example: Using personal health data without the subject’s explicit
permission.
Duplicate or redundant data that skews analysis and results should be
removed.
Example: Accidentally including the same participant's responses
more than once.
If the data is suspected to have been tampered with or fabricated, it
must be excluded.
Example: Suspiciously identical responses in a survey or manipulated
experimental data.
If errors occur during data transcription, encoding, or entry that cannot
be corrected, the affected data may need to be discarded.

Always document why and how data is discarded for transparency.


Follow ethical guidelines and institutional protocols while removing
data.
Use methods like sensitivity analysis to determine the impact of
excluding specific data

- Parametric statistics:
Parametric statistics involve statistical methods that assume the data
follows a certain distribution (usually normal distribution) and meets
specific conditions (e.g., homogeneity of variance).
Used to analyze numerical data and draw conclusions about
relationships or differences in populations.
Example: Techniques like correlation and regression are parametric if
the data meets these assumptions.
Under parametric analysis, various tools are used based on the type
of data and the relationship being studied. Two commonly used
methods are correlation and regression.

Correlation:
Correlation measures the strength and direction of a relationship
between two variables. It tells you whether increases (or decreases) in
one variable are associated with increases (or decreases) in another
variable.
Correlation does not imply causation. Even if two variables have a
strong correlation, one does not necessarily cause the other.
Ranges from -1 to +1.
+1: Perfect positive linear relationship
-1: Perfect negative linear relationship
0: No linear relationship
(a) Spearman’s Rank Correlation:
denoted as ρ or rs
Non-parametric and works with ranked (ordinal) data.

Measures the monotonic relationship (a consistent but not


constant rate of change between the variables) variables (doesn’t
assume linearity).

Suitable for data that doesn’t meet parametric assumptions (e.g.,


skewed distributions).

Example: Evaluating the relationship between students’ ranks in


math and science exams.

You use Spearman’s method when dealing with data that is ranked or
when the relationship isn’t necessarily linear.
(b) Pearson’s Correlation:
Denoted as r

Parametric and requires continuous data on an interval or ratio


scale.

Assumes a linear relationship (the rate of change between the


two variables is constant) between two continuous variables and
normally distributed data.
Often applied when working with Likert scale responses or
continuous variables.

Example: Studying the correlation between time spent studying and


exam scores.

Pearson is ideal when working with Likert scale data, which is treated
as continuous if it has sufficient scale points.
Regression:
Regression goes beyond correlation by modeling the relationship
between a dependent variable (Y) and one or more independent
variables (X).
Hypotheses in Regression:

The null hypothesis (H₀) states there is no relationship between


the variables.

The alternative hypothesis (H₁) posits a relationship exists.

Example: Predicting student scores (Y) based on hours of study (X).

- Impact of regression:
Understand relationships between things.
e.g. How studying affects grades, or how exercise affects
weight loss.
Make predictions about the future.
e.g. if you know how much studying improves grades, you can predict
how well someone will do if they study for 5 hours.
Make better decisions.
e.g. Businesses use regression to decide how much to spend on
advertising to get more sales.
Doctors use it to predict how different treatments affect health.
How regression supports correlation:
Correlation measures the strength and direction of a relationship
between two variables (ranging from -1 to +1).
Regression uses that relationship to predict and explain how one
variable affects another.
If there is a correlation, regression builds on it by giving a prediction
equation

- Difference between parametric and non-parametric tests:


Parametric Tests:
Based on assumptions about the population distribution (usually
normal distribution).
Require interval or ratio scale data (continuous data).
More powerful when assumptions are met.
Common Parametric Tests:
t-test
ANOVA
Pearson correlation
Regression analysis
Assumptions:
Normal distribution of data
Equal variances (homogeneity of variance)
Interval or ratio data
Independent observations
Non-Parametric Tests:
No strict assumptions about the population distribution (works on
any distribution).
Used when data is ordinal, ranked, or categorical, or when
parametric assumptions are violated.
Less powerful but more flexible.
Common Non-Parametric Tests:
Chi-Square test
Mann-Whitney U test (alternative to independent t-test)
Wilcoxon Signed-Rank test (alternative to paired t-test)
Kruskal-Wallis test (alternative to ANOVA)
Spearman correlation
- Skewedness:
Skewedness is a statistical measure that describes asymmetry or lack
of symmetry in a data distribution. In simpler terms, it shows whether
the data is more concentrated on one side of the mean.
If the tail of the distribution extends more to the right (higher values),
it is called positively skewed or right-skewed.
If the tail extends more to the left (lower values), it is called
negatively skewed or left-skewed.
If the distribution is symmetric (no skewness), it is termed normally
distributed.

- T-test and ANOVA:


T-test:
A t-test compares the means of two groups to see if they are
statistically significantly different from each other.

Types of T-tests:
Independent T-test: Compares two separate groups (e.g.,
treatment vs. control).
Paired sample t-test: Compares the same group at two different
times (pre-test/post-test).
One sample t-test: Compares the mean of a single sample to a
known or hypothesized population mean.
When to use a t-test:
You have two groups.
The dependent variable is continuous (e.g., height, weight, test
scores).
You want to know if the mean of one group is different from the other.

ANOVA:
ANOVA stands for Analysis of Variance. It compares the means of
three or more groups to see if there is at least one significant
difference among them.
Types of ANOVA:
One-way ANOVA: Tests one independent variable with 3+ groups.
Two-way ANOVA: Tests two independent variables (factors) and
their interaction.
Repeated Measures ANOVA: Tests the same subjects under
different conditions or times.
When to use ANOVA:
You have three or more groups.
Your dependent variable is continuous.
You want to know if there’s any difference in means among the
groups.

T-test Example:
Are males and females different in average test scores?
2 groups: male and female.
Use an independent t-test.
ANOVA Example:
Are test scores different among students who get no tutoring, 1-
hour tutoring, and 3-hour tutoring?
3 groups: no tutoring, 1 hour, 3 hours.
Use a One-way ANOVA.

- Assumptions of t-test and ANOVA:


Common Assumptions for Both t-Test and ANOVA:
Normality
The data in each group should be normally distributed.
Homogeneity of Variance (Equal Variances)
The variances across groups should be roughly equal.
In SPSS, you can test this using Levene’s Test.
Independence of Observations
Each observation (data point) should be independent of the others
(no repeated measures unless you're using a repeated measures test).
Specifics for Each Test:
t-Test:
Compares 2 groups.
Assumptions apply to each group.
Types:
Independent t-test (2 different groups)
Paired t-test (same group tested twice)
ANOVA:
Compares 3 or more groups.
Same assumptions but applied to all groups together.
If you find a significant result, you usually do post hoc tests (like
Tukey) to see which groups differ

- Mediation and moderation:


Mediator:
A mediator explains how or why one variable affects another.
It’s like a middle step in the relationship.
Example:
Exercise leads to Weight Loss, but Calories Burned is the reason
why.
So:
Exercise Calories -> Burned (Mediator) -> Weight Loss
Mediation explains the process or the mechanism of how X affects
Y.
Moderator:
A moderator changes the strength or direction of the relationship
between two variables.
It tells you when or for whom the relationship is stronger or weaker.
Example:
Stress affects Performance, but it depends on Experience Level.
For experienced people, stress might improve performance.
For beginners, stress might hurt performance.
Experience is the moderator.
A moderator changes the effect between X and Y

- Type-I and type-II error:


Type-I Error (False Positive):
You reject a true null hypothesis (H₀).
You conclude there is an effect or difference, but there really isn't.
Example: A test says someone is sick, but they are actually healthy.
Controlled by: Alpha (α), usually set at 0.05.

Type-II Error (False Negative):


You fail to reject a false null hypothesis (H₀).
You conclude there is no effect, but there actually is one.
Example: A test says someone is healthy, but they are actually sick.
Controlled by: Beta (β). Reducing β increases power.

- Characteristics of normal distribution curve:


Symmetry: The curve is perfectly symmetrical around its center (the
mean).
Unimodal: It has a single peak (mode) at the center.
Mean, Median, and Mode are Equal: These three measures of
central tendency are all located at the center of the distribution.
Asymptotic Tails: The curve's tails extend infinitely towards the x-
axis (the horizontal axis) but never touch it.
Empirical Rule (68-95-99.7 Rule): Approximately 68% of the data
falls within one standard deviation of the mean, 95% within two
standard deviations, and 99.7% within three standard deviations.
Described by Two Parameters: The normal distribution is fully
defined by its mean (μ) and standard deviation (σ).
Area under the Curve: The total area under the normal curve
represents 100% (or 1.0) of the data.

- Chi-square test:
It is used:
when data is in the form of frequencies or counts, not
measurements.
when dealing with categorical variables (like gender, preference,
yes/no answers).
When data is in categories, not continuous numbers.
(Example: Male/Female, Pass/Fail, Yes/No)
To test relationships between categories, like:
Is there a relationship between gender and voting preference?
Chi-Square is for frequency data and categorical variables, not
means or averages.

- 2X2 design:
A 2 x 2 design refers to a study with two independent variables,
Each variable has 2 levels (or categories).
Example:
Independent Variable 1 (IV1) = Gender (Male / Female)
Independent Variable 2 (IV2) = Treatment (Drug / Placebo)
So, there are 4 groups in total:
Male + Drug
Male + Placebo
Female + Drug
Female + Placebo

- Research design:
It’s the plan or strategy you use to conduct research, collect data,
and answer your research questions.
Types of research designs:
1.Experimental Design
You manipulate one or more independent variables and measure
their effect on a dependent variable.
Random assignment is used to control for bias.
Example: Giving one group a drug and another group a placebo, then
comparing outcomes.
Goal: Establish cause and effect.

2. Quasi-Experimental Design
Like experimental, but no random assignment.
Often used when randomization isn’t possible (like in real-world
settings).
Example: Comparing two schools’ test scores where students weren't
randomly assigned.

3. Non-Experimental / Observational Design


No manipulation of variables.
You observe and measure variables as they are.
Example: Surveying people about their TV watching habits and health.
Goal: Look for relationships, not cause-and-effect.
4. Descriptive Research
Describes characteristics or behaviors.
Uses surveys, observations, or case studies.
Example: Describing how many students prefer online learning.

5. Correlational Research
Looks for relationships between two variables.
No manipulation.
Example: Is there a correlation between hours of study and GPA?
Note: Correlation ≠ causation!

- Title of the research:


The title of the research should contain at least 14 words and
maximum15/17.
I.V and D.V should be mentioned.
What type of sample have you taken, it should be mentioned e.g. male
or female, young or adults etc.

You might also like