You are on page 1of 7

This note is prepared by AI, be careful when you read it.

Khan

August 7, 2023.

Feature Null Hypothesis (H0) Alternative Hypothesis (Ha)


Definition The statement that there is no difference The statement that there is a difference
between the two groups or variables. between the two groups or variables.
Symbol H0 Ha
Significance The probability of rejecting the null The probability of accepting the null
level hypothesis when it is actually true. hypothesis when it is actually false.
P-value The probability of obtaining the observed The probability of obtaining the observed
results, or more extreme results, if the results, or more extreme results, if the
null hypothesis is true. alternative hypothesis is true.
Decision Reject the null hypothesis if the p-value is Do not reject the null hypothesis if the p-
less than the significance level. value is greater than the significance level.
P-value and significance level are two important concepts in statistics that are often confused. The p-value
is a measure of the probability of obtaining the observed results, or more extreme results, if the null
hypothesis is true. The significance level is the probability of rejecting the null hypothesis when it is
actually true.

A p-value of 0.05 is often used as a cutoff for statistical significance. This means that if the p-value is less
than 0.05, then we can reject the null hypothesis with a 95% confidence level.For example, let's say we
are conducting a study to test the effectiveness of a new drug. We hypothesize that the drug will reduce
the risk of heart attack. We collect data from a group of patients who have taken the drug and a group of
patients who have not taken the drug. We then compare the rates of heart attack in the two groups.If the
p-value for this test is less than 0.05, then we can reject the null hypothesis and conclude that the drug is
effective in reducing the risk of heart attack. However, if the p-value is greater than 0.05, then we cannot
reject the null hypothesis and we cannot conclude that the drug is effective.

A t-test is a statistical hypothesis test used to compare the means of two groups and determine if there is
a significant difference between them. It's widely used for making inferences about population means
based on sample data. The t-test assesses whether the observed differences between the sample means
are likely to have occurred due to random sampling variation or if they represent true differences in the
population.

Test Type Description Use Case


One-Sample t-test Determines if the sample mean is significantly Compare sample
different from a known or hypothesized mean to a
population mean. population mean.
Independent Samples t-test Compares the means of two independent groups Compare means of
to determine if they are significantly different. two distinct groups.
Paired Samples t-test Compares the means of two related groups Compare means
(matched pairs or repeated measures) to before and after an
determine if they are significantly different. intervention.
One-Way ANOVA Extends the independent samples t-test to Compare means
compare means of three or more independent across multiple
groups. groups.

The chi-square test is a statistical test used to determine if there is a significant association or relationship
between two categorical variables. It compares the observed frequencies of categories in a contingency
table to the expected frequencies that would occur if the variables were independent. The chi-square test
assesses whether the observed distribution differs significantly from what would be expected by chance.

Type of Chi-Square Test Purpose Example


Chi-Square Test for Determine if there is an association Testing if gender and voting
Independence between two categorical variables. preference are associated.
Chi-Square Test for Determine if observed frequencies Testing if observed blood type
Goodness of Fit differ significantly from expected distribution matches expected
frequencies. distribution.
Chi-Square Test for Compare distributions across multiple Comparing product preferences
Homogeneity groups or populations. across different cities.

Level of Characteristics Examples


Measurement
Nominal Categories with no order or ranking. Gender (Male, Female)
No meaningful numeric difference. Eye color (Blue, Brown, Green)
Mode is the only meaningful measure. Country of birth
Ordinal Categories with order/rank. Educational levels (High School,
College, Bachelor's)
Differences are not uniform or equal. Likert scale (Strongly Disagree,
Disagree, Neutral, Agree, Strongly
Agree)
Median, percentiles are meaningful. Socio-economic status
Interval Equal intervals between values. Temperature (Celsius/Fahrenheit)
No true zero point. IQ scores
Ratios not meaningful (e.g., 40°C is not twice pH level
as hot as 20°C).
Ratio Equal intervals and true zero point. Height, weight, income
Ratios are meaningful (e.g., 100 cm is twice as Age
long as 50 cm).
All arithmetic operations are valid. Number of customers, distance
Outliers are data values that are surprisingly extreme when compared to the other values in the data
set. There are two types of outliers: univariate outliers, multivariate outliers

Feature Univariate Outliers Multivariate Outliers


Definition Data points that are significantly Data points that are significantly different
different from the rest of the data in from the rest of the data in multiple variables.
a single variable.
Identification Using methods such as the Using methods such as Mahalanobis distance
interquartile range (IQR) or z-scores. or Cook's distance.
Impact Can skew the results of statistical Can skew the results of statistical analyses that
analyses that are based on a single are based on multiple variables.
variable.
Treatment Can be removed from the data set or
Can be removed from the data set,
transformed to reduce their impact.
transformed to reduce their impact, or treated
as separate groups.
Example Income vs. Age: One person has Exam Scores: One student has a significantly
much lower income for their age lower score than the rest.
group.

Feature Recode Compute

Purpose Modifies existing variable's Creates a new variable with transformed values
values

Original Variable Values are changed in the same Original variable remains unchanged
variable

Syntax Example recode variable (old_value = compute new_variable = expression


new_value) (...)

Data Impact Changes affect original data No impact on original data

Storage and Original variable is overwritten New variable can be used for specific analyses
Usage

Common Use Simplify categorical variable Create groups or categories for analysis
Cases values

Feature Correlation Regression


Definition A measure of the strength and A statistical model that predicts the value of one
direction of the linear variable (the dependent variable) from the
relationship between two value of another variable (the independent
variables. variable).
Symbol r β
Significance The correlation coefficient can The regression coefficients can be used to
be used to determine whether determine the strength of the relationship
the relationship between the between the two variables and to predict the
two variables is statistically value of the dependent variable.
significant.
Direction The correlation coefficient can The regression coefficients can be positive or
be positive or negative, negative, indicating whether the independent
indicating whether the two variable has a positive or negative effect on the
variables move in the same dependent variable.
direction or in opposite
directions.
Use Correlation is used to Regression is used to predict the value of one
summarize the relationship variable from the value of another variable.
between two variables.
Limitations Correlation does not indicate Regression can be affected by outliers and by
causation. multicollinearity.
Feature Recode into Same Variable Recode into Different Variable

Purpose Modifies existing variable's Creates a new variable with transformed values
values

Original Variable Values are changed in the same Original variable remains unchanged
variable

Syntax Example recode variable (old_value = new_variable = variable.map(lambda x:


new_value) (...) new_value if x == old_value else x)

Data Impact Changes affect original data No impact on original data

Storage and Original variable is overwritten New variable can be used for specific analyses
Usage

Common Use Simplify categorical variable Create groups or categories for analysis
Cases values

Combine or recode ordinal data Preserve original data while transforming values

Feature Type I Error Type II Error

Definition Rejecting the null hypothesis Failing to reject the null hypothesis when it is
when it is actually true. actually false.

Symbol α β

Consequences Can lead to false positives. Can lead to false negatives.

Control Controlled by the significance Controlled by the power of the test.


level.
Types of Computer BUS

1.System BUS:

1.Address bus - carries memory addresses from the processor to other components such as primary
storage and input/output devices. ... 2.Data bus - carries the data between the processor and other
components. ... 3.Control bus - carries control signals from the processor to other components

2.Expansion BUS: Helps system bus, Transmit information between input,output and peripheral devices.

Merge: Adding Variables

Open SPSS:

Start SPSS and open the dataset where you want to add variables.

Data > Add Cases:

Go to "Data" in the top menu, then select "Add Cases."

Select Dataset: Choose the dataset that contains the variables you want to add.

Matching Variables: Specify the key variables that will be used to match cases between the two datasets.

Options: You can choose whether to replace or append the added cases and handle duplicate cases if
necessary.

OK: Click "OK" to perform the addition of variables.

(Adding Cases):

Open SPSS: Start SPSS and open the two datasets you want to merge.

Data > Merge Files > Add Cases:

Go to "Data" in the top menu, then select "Merge Files" and choose "Add Cases."

Target Dataset: Select the dataset you want to merge into (the one where you want to add cases).

Source Dataset: Choose the dataset you want to merge from (the one containing the cases you want to
add).

Matching Variables: Specify the key variables that will be used to match cases between the two datasets.

Options: Define how you want to handle cases that don't match or duplicate cases.

OK: Click "OK" to perform the merge.


testing normality: Descriptive statistics, statistical tests, normality plots

Descriptive statistics

Feature Skewness Kurtosis


Definition A measure of the asymmetry of A measure of the "peakedness" of a
a distribution. distribution.
Normal distribution 0 3
Positive skew The tail of the distribution is The distribution is more peaked than a
longer on the right side. normal distribution.
Negative skew The tail of the distribution is The distribution is flatter than a normal
longer on the left side. distribution.
Statistical tests

Kolmogorov-Smirnov test: The Kolmogorov-Smirnov test is a non-parametric test that is based on the
cumulative distribution function of the data. It is a less sensitive test for departures from normality than
the Shapiro-Wilk test, but it is more robust to outliers.

Normality plots

Feature Detrended Q-Q Plot Q-Q Plot Box Plot

Definition A graphical method for A graphical method for A graphical method for
assessing normality that assessing normality that plots displaying the distribution
removes the trend from the quantiles of the sample of data through its
the data before plotting. distribution against the quartiles. It displays the
quantiles of the standard median, quartiles, and
normal distribution. potential

Uses Used to assess whether Used to assess whether the Used to assess the
the data is normally data is normally distributed. distribution of data,
distributed after including its central
removing the trend. tendency, dispersion, and
outliers.

Strengths Can be used to assess Sensitive to departures from Easy to interpret and
normality even if the normality. understand.
data is not normally
distributed.

Weaknesses Can be sensitive to Not as sensitive to departures Does not provide


outliers. from normality as other information about the
methods. shape of the distribution.

Histogram: A histogram is a graphical method for assessing normality. It shows the distribution of the data
in terms of the number of observations in each bin. If the data is normally distributed, then the histogram
will be bell-shaped.
A normal distribution is a continuous probability distribution that is symmetric around the mean, has a
bell-shaped curve, and tapers off towards the tails. The normal distribution is one of the most important
distributions in statistics, and it is used in a wide variety of applications, including: Testing hypotheses,
Estimating parameters, Making predictions, Data visualization

Feature Parametric Tests Nonparametric Tests


Assumptions Make assumptions about the underlying Do not make assumptions about the
distribution of the data. underlying distribution of the data.
Power More powerful when the assumptions Less powerful, but more robust to
are met. violations of assumptions.
Interpretability More interpretable, as the results can be Less interpretable, as the results are
expressed in terms of the population expressed in terms of ranks or other
parameters. non-parametric statistics.
Examples t-tests, ANOVA, linear regression Mann-Whitney U test, Wilcoxon
signed-rank test, Kruskal-Wallis test
Data transformation involves changing the format, structure, or values of your data to make it more
suitable for analysis or to meet specific requirements. Data transformation can include tasks such as
cleaning, reformatting, aggregating, and deriving new variables. In SPSS, you can perform various data
transformations using different commands and functions. Here are some common data transformation
tasks and the corresponding SPSS commands:

Recoding Variables:

You can recode values of a variable to new values based on specific criteria.

RECODE VariableName (OldValue = NewValue) (OldValue2 = NewValue2) /INTO NewVariableName.

Creating New Variables:

You can create new variables based on calculations or combinations of existing variables.

COMPUTE NewVariable = ExistingVariable1 + ExistingVariable2.

Aggregating Data:

You can aggregate data to summarize values within groups or categories.

AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=GroupVariable /NewVariable =


MEAN(ExistingVariable).

Merging Datasets:

You can combine multiple datasets by merging them based on common variables.

MATCH FILES /FILE=*

/TABLE='AnotherDataset.sav'

/BY CommonVariable.

You might also like