You are on page 1of 19

Unit 2 UNit 5:

A statistical hypothesis is a statement about a population parameter


that is tested using statistical methods. There are two types of
statistical hypotheses: null hypotheses and alternative hypotheses.
The null hypothesis (H0) is a statement that assumes that there is no
significant difference between the observed data and the expected
data. This hypothesis is used as the starting point in statistical
testing, and it is assumed to be true until it is proven otherwise.
The alternative hypothesis (Ha) is the opposite of the null
hypothesis, and it states that there is a significant difference
between the observed data and the expected data. This hypothesis is
tested against the null hypothesis to determine if it can be rejected
or accepted.
For example, in a study to determine if there is a relationship
between education level and income, the null hypothesis might be
that there is no significant relationship between education level and
income. The alternative hypothesis would be that there is a
significant relationship between education level and income. The
statistical test would be used to determine if the null hypothesis can
be rejected or accepted based on the data collected.
Errors in hypothiesis testing :
Sometimes, our conclusions about an experiment may not match
reality. There can be errors in analysis. The two types of errors in
hypothesis testing are: Type I Error and Type II Error.
Type I Error
A Type I error is when the null hypothesis is true, but we reject it in
favor of the alternate. It is sometimes known as a false positive.
This means we falsely believe that differences before and after an
experiment are due to the treatment when that is not the case.
Type II Error
Type II error- when the null hypothesis is false, but we do not reject
it. It is sometimes known as a false negative. This means we
believe that differences before and after an experiment are only due
to chance and not treatment. This results in wastage of
experimental resources though the results have been achieved.

Level of significance :
The level of significance is defined as the fixed probability of
wrong elimination of null hypothesis when in fact, it is true. The
level of significance is stated to be the probability of type I error
and is preset by the researcher with the outcomes of error. The level
of significance is the measurement of the statistical significance. It
defines whether the null hypothesis is assumed to be accepted or
rejected. It is expected to identify if the result is statistically
significant for the null hypothesis to be false or rejected.
Level of Significance Symbol
The level of significance is denoted by the Greek symbol α (alpha).
Therefore, the level of significance is defined as follows:
Significance Level = p (type I error) = α
The values or the observations are less likely when they are farther
than the mean. The results are written as “significant at x%”.
Example: The value significant at 5% refers to p-value is less than
0.05 or p < 0.05. Similarly, significant at the 1% means that the p-
value is less than 0.01.
The level of significance is taken at 0.05 or 5%. When the p-value
is low, it means that the recognised values are significantly
different from the population value that was hypothesised in the
beginning. The p-value is said to be more significant if it is as low
as possible. Also, the result would be highly significant if the p-
value is very less. But, most generally, p-values smaller than 0.05
are known as significant, since getting a p-value less than 0.05 is
quite a less practice.

Steps use in testing of hypotheisis :


There are several steps that are typically followed when testing a
hypothesis:
-Clearly define the hypothesis that you want to test. The hypothesis
should be specific and testable.
-Determine the appropriate statistical test to use. There are many
different statistical tests that can be used to test a hypothesis, and
the appropriate test depends on the specific characteristics of the
data and the nature of the hypothesis being tested.
-Collect and analyze the data. This typically involves collecting a
sample of data and using statistical techniques to analyze it.
-Calculate the p-value. The p-value is a measure of the probability
that the results of the statistical test occurred by chance. If the p-
value is below a certain threshold (usually 0.05), the results are
considered statistically significant and the hypothesis is accepted. If
the p-value is above the threshold, the hypothesis is rejected.
-Interpret the results. If the hypothesis is accepted, this suggests
that there is a statistically significant relationship between the
variables being tested. If the hypothesis is rejected, this suggests
that there is not a statistically significant relationship between the
variables.
-Report the results. It is important to report the results of the
hypothesis test in a clear and concise manner, including a
description of the statistical test used, the sample size, the p-value,
and a summary of the results.
Unit 6
Unit 1 : What do you mean by sampling distribution?Why this Explain how sample size affects the margin of error with
Central Limit Theorem : The central limit theorem states that study is important in statistics? A sampling distribution is examples : The margin of error is a measure of the
whenever a random sample of size n is taken from any distribution the distribution of a statistic (such as the mean) computed precision of an estimate. It is the amount by which an
with mean and variance, then the sample mean will be approximately from a sample of a population. It describes how the statistic is estimate may differ from the true value of the parameter
normally distributed with mean and variance. The larger the value of likely to vary across different samples from the same being estimated. The margin of error is usually calculated as
the sample size, the better the approximation to the normal. population. a function of the sample size, the standard deviation of the
Consider x1, x2, x3,……,xn are independent and Studying the sampling distribution of a statistic sample, and the desired level of confidence.
identically distributed with mean μ and finite variance σ2, then any
is important in statistics because it allows us to understand As the sample size increases, the margin of
random variable Zn as,
the behavior of the statistic and to make inferences about the error decreases. This is because a larger sample size
population based on samples from the population. For provides more information about the population, which
example, if we know the sampling distribution of the mean of leads to a more precise estimate of the population parameter.
a certain population, we can use this information to construct For example, consider a survey that aims to estimate the
confidence intervals around the sample mean, which can be proportion of a population that supports a particular
used to make inferences about the population mean. candidate in an election. The survey is conducted by
Then the distribution function of Zn converges to the standard normal
randomly sampling 1000 people from the population. The
distribution function as n increases without any bound. The sampling distribution of a statistic is often approximated sample shows that 55% of the people support the candidate.
using the central limit theorem, which states that, given a The margin of error for this estimate can be calculated as
sufficiently large sample size, the distribution of the statistic follows:
will be approximately normal, regardless of the shape of the Margin of error = 1.96 * √(0.55 * 0.45 / 1000) = 0.04
distribution of the population. This allows us to use normal This means that we can be 95% confident that
distribution-based methods to make inferences about the the true proportion of people in the population who support
population based on samples from the population. the candidate is somewhere between 51% and 59%.
Studying the sampling distribution of a statistic Now consider the same survey, but with a
is also important in hypothesis testing, where we use the sample size of 100 people instead of 1000. The sample
sampling distribution to determine the likelihood of observing shows that 52% of the people support the candidate. The
a particular sample statistic if a certain hypothesis about the margin of error for this estimate is now:
Since xi are random independent variables, so Ui are also independent. population is true. Margin of error = 1.96 * √(0.52 * 0.48 / 100) = 0.14
This implies, This means that we can be 95% confident that
The Sampling Distribution of the Mean is the mean of the the true proportion of people in the population who support
population from where the items are sampled. If the the candidate is somewhere between 38% and 66%.
population distribution is normal, then the sampling As you can see, the margin of error is larger
distribution of the mean is likely to be normal for the samples when the sample size is smaller. This means that the
of all sizes.Following are the main properties of the sampling estimate is less precise when the sample size is smaller.
distribution of the mean: 1.Its mean is
equal to the population mean, thus, Explain different criteria of good estimator :There are
(?X͞  =sample mean and ?p Population several criteria that can be used to evaluate the quality of an
mean) 2.The population standard deviation divided estimator, including:
by the square root of the sample size is equal to the standard Unbiasedness: An unbiased estimator is one that has a mean
deviation of the sampling distribution of the mean, thus: that is equal to the true value of the parameter being
ln(mu(t))=n ln (1 + x)=n(x –x22 + x33− ……..) estimated. An estimator that is biased consistently over- or
Multiply each term by n, and as n → ∞, all terms but the first go to under-estimates the true value of the parameter.
zero. Efficiency: An efficient estimator is one that has a small
(σ = variance, meaning it is less prone to producing extreme or
population standard deviation, n = sample size) 3.The improbable estimates.
sampling distribution of the mean is normally distributed. Consistency: A consistent estimator is one that becomes
This means, the distribution of sample means for a large increasingly accurate as the sample size increases.
sample size is normally distributed irrespective of the shape Sufficiency: A sufficient estimator is one that makes use of
Which is the moment generating function for a standard normal of the universe, but provided the population standard all of the information in the sample in order to produce the
random variable. deviation (σ) is finite. Generally, the sample size 30 or more
Example: The record of weights of the female population follows a estimate.
is considered large for the statistical purposes. If the
normal distribution. Its mean and standard deviation are 65 kg and 14 Robustness: A robust estimator is one that is not sensitive to
population is normal, then the distribution of sample means
kg, respectively. If a researcher considers the records of 50 females, small deviations from the assumptions that were made in
will be normal, irrespective of the sample size.
then what would be the standard deviation of the chosen sample? deriving the estimator.
Solution: Range: The range of an estimator is the set of values that the
The Sampling Distribution of Proportion measures the
Mean of the population μ = 65 kg estimator can take on. An estimator with a narrow range is
proportion of success, i.e. a chance of occurrence of certain
Standard deviation of the population = 14 kg events, by dividing the number of successes i.e. chances by preferred because it is less likely to produce extreme or
sample size n = 50 the sample size ’n’. Thus, the sample proportion is defined implausible estimates.
Standard deviation is given by as p = x/n. Interval estimator: An interval estimator is one that
The sampling distribution of proportion obeys the binomial provides a range of values within which the true value of the
probability law if the random sample of ‘n’ is obtained with parameter is likely to lie. An interval estimator that is
replacement. Such as, if the population is infinite and the narrow and accurate is preferred.
= 14/√50
probability of occurrence of an event is ‘π’, then the Point estimator: A point estimator is one that provides a
=14/7.071  probability of non-occurrence of the event is (1-π).  Now single estimate of the value of the parameter. A point
= 1.97 consider all the possible sample size ‘n’ drawn from the estimator that is accurate and has a small variance is
population and estimate the proportion ‘p’ of success for preferred.
Explain the statement “ The central limit theorem forms the basis each. Then the mean (?p) and the standard deviation (σp) of
of inferential statistics ” : The central limit theorem states that, the sampling distribution of proportion can be obtained as: Unit 3:
given a sufficiently large sample size from a population with a finite
level of variance, the mean of all samples from the same population
will be approximately normally distributed. This result holds Parametric Tests Non-Parametric Tests
regardless of the shape of the distribution of the population from
which the samples are drawn. ?p = mean of proportion Lower statistical power. Greater statistical power.
In inferential statistics, we are often interested in estimating the π = population proportion which is defined as π = X/N, where
Applied to normal or Applied to categorical
value of a population parameter (such as the mean) based on a sample X is the number of elements that possess a certain
interval variables. variables.
from the population. The central limit theorem tells us that, as long as characteristic and N is the total number of items in the
the sample size is large enough, the distribution of the sample mean population. Used for large samples. Used for small samples.
will be approximately normal, which means we can use normal σp = standard error of proportion that measures the success
Its data distribution is The form of data
distribution-based methods to make inferences about the population (chance) variations of sample proportions from sample to
normal. distribution is not known.
parameter. sample
For example, if we want to estimate the mean income of a certain n= sample size, If the sample size is large (n≥30), then the Don’t make many
Make a lot of assumptions.
population, we can take a sample of incomes from that population, sampling distribution of proportion is likely to be normally assumptions.
calculate the mean of the sample, and use the central limit theorem to distributed. Demand a higher condition Require a lower validity
construct a confidence interval around that mean. This interval will The following formula is used when population is finite, and of validity. condition.
give us a range of values that we can be confident contains the true the sampling is made without the replacement:
mean of the population, based on the normal distribution of the sample Less chance of errors. Higher probability of errors.
mean.
So the central limit theorem forms the basis of inferential statistics The calculation is The calculation is less
because it provides a way to use the sample mean to make inferences complicated. complicated.
about the population mean, by assuming that the sample mean is Hypotheses are based on
The hypotheses are based
approximately normally distributed. ranges, median, and data
on numerical data.
frequency.
What do you mean by non parametric test? write down What is binomial test? why is it used? Describe the function and procedure of chi square test for
advantages of non parametric test over parametric test: The binomial test is a statistical test used to determine whethe goodness of fit. The chi-square test for goodness of fit is a
Nonparametric tests are statistical tests that do not assume a particular r the probability of success in a binary outcome (such as a "su statistical test used to determine whether a sample comes from a
shape or distribution for the data. They are generally less powerful ccess" or "failure") is different from a hypothesized value. It i population with a specific distribution. It is often used to test
than parametric tests, which do make assumptions about the data, but s based on the binomial distribution, which describes the prob whether a sample is consistent with a hypothesis about the
ability of a specific number of successes in a fixed number of population distribution.
they are more flexible and can be applied to a wider range of data sets.
trials, given a probability of success on each trial. To perform the chi-square test for goodness of fit, follow these
There are several advantages of nonparametric tests over
The binomial test is often used in situations where there are t steps:
parametric tests: 1.Determine the sample size and the number of categories or
wo possible outcomes (e.g., "success" or "failure") and the pr
1.Nonparametric tests do not require assumptions about the underlying obability of success is known or can be estimated. For examp bins in the data.
distribution of the data, which makes them more robust and less le, the binomial test might be used to determine whether the p 2.Calculate the expected frequency for each category or bin
sensitive to deviations from these assumptions. robability of winning a game is different from 50%, or to dete based on the hypothesized population distribution.
2.Nonparametric tests are generally easier to understand and interpret, rmine whether the probability of a product defect is different 3.Calculate the observed frequency for each category or bin
as they do not rely on complex statistical models or assumptions. from a certain value. based on the sample data.
3.Nonparametric tests can be applied to data that are not normally 4.Calculate the chi-square statistic using the formula:
distributed, or to data that are ordinal (ranked) rather than continuous. Method of binomial test : chi-square = sum((observed frequency - expected
4.Nonparametric tests can be used when the sample size is small, as To conduct a binomial test, the following steps are typically frequency)^2 / expected frequency)
they do not require large sample sizes to be effective. followed: 5.Use a table or computer software to determine the p-value for
5.Nonparametric tests are less sensitive to the presence of outliers or Define the research question and null hypothesis: As with the chi-square statistic.
extreme values in the data. any statistical test, it is important to start by defining the 6.Compare the p-value to the alpha level (typically 0.05) to
research question and the corresponding null hypothesis that determine whether the sample comes from a population with the
6.Nonparametric tests can be used when the data are collected in a way
you are testing. hypothesized distribution.
that does not allow for calculation of means and standard deviations It is important to note that the chi-square test for goodness of fit
Determine the hypothesized probability of success:
(e.g., if the data are categorical rather than continuous). assumes that the sample is a random and representative sample
Determine the hypothesized probability of success that is
being tested against. from the population. If the sample is not representative, the test
Discuss assumptions of non parametric test : Collect the data: Collect data on the number of successes and may not be accurate. Additionally, the chi-square test is
There are generally fewer assumptions associated with nonparametric failures in a fixed number of trials. sensitive to sample size, and may not be powerful when the
tests than with parametric tests. However, there are still some Calculate the test statistic: Use the formula for the binomial sample size is small. In these cases, it may be necessary to use a
assumptions that are commonly made in nonparametric tests: test to calculate the test statistic. different statistical test or to increase the sample size.
Independence: The observations in the sample should be independent Determine the p-value: Use the test statistic to determine the
of one another. p-value, which is the probability of observing a result as Discuss the rationale and method of chi square test for
Random sampling: The sample should be drawn randomly from the extreme or more extreme than the one observed, given that independence of attributes.
population. the null hypothesis is true. The chi-square test for independence of attributes is a statistical
Equal sample sizes: Nonparametric tests often assume that the sample Interpret the results: Compare the p-value to the chosen test used to determine whether there is a relationship between
sizes in each group being compared are equal. When the sample sizes significance level (usually 0.05). If the p-value is less than two categorical variables. It is often used to test whether there is
are not equal, the results of the test may not be reliable. the significance level, the null hypothesis can be rejected and an association between the two variables, or whether one
Normality: Some nonparametric tests assume that the underlying the alternative hypothesis can be accepted. If the p-value is variable is independent of the other.
distribution of the data is normal, even though they do not make this greater than the significance level, the null hypothesis cannot To perform the chi-square test for independence of attributes,
assumption explicitly. This is because they are based on the normal be rejected. follow these steps:
distribution in some way (e.g., the Wilcoxon rank sum test is based on 1.Determine the sample size and the number of categories or
the normal distribution of the ranks of the observations). Discuss process of kolmogorov smirnov process. bins in each variable.
Symmetry: Some nonparametric tests assume that the distribution of The Kolmogorov-Smirnov (K-S) test is a nonparametric 2.Construct a contingency table that shows the frequency of
the data is symmetrical. This assumption may be violated if the data statistical test used to determine whether two samples come each combination of categories or bins in the two variables.
are heavily skewed. from the same population. It is often used when the 3.Calculate the expected frequency for each combination of
Equal variances: Some nonparametric tests assume that the variances assumptions for parametric tests are not met, such as when categories or bins based on the assumption of independence
of the groups being compared are equal. When the variances are not the data is not normally distributed. between the two variables.
equal, the results of the test may not be reliable. To perform the K-S test, follow these steps: 4.Calculate the observed frequency for each combination of
It is important to carefully consider these assumptions when choosing 1.Calculate the sample size for each group. categories or bins based on the sample data.
and applying a nonparametric test, as violations of these assumptions 2.Calculate the cumulative distribution functions (CDFs) for 5.Calculate the chi-square statistic using the formula:
can affect the reliability of the test results. each sample. The CDF is a function that describes the chi-square = sum((observed frequency - expected
probability that a random 3.variable will be less than or equal frequency)^2 / expected frequency)
Write down steps of non parametric test : to a certain value. 6.Use a table or computer software to determine the p-value for
The steps of a nonparametric test are generally similar to those of a 4.Calculate the maximum absolute difference between the the chi-square statistic.
parametric test, with some key differences: two CDFs. This is known as the K-S statistic. 7.Compare the p-value to the alpha level (typically 0.05) to
Define the research question and null hypothesis: As with any 5.Use a table or computer software to determine the p-value determine whether there is a relationship between the two
statistical test, it is important to start by defining the research question for the K-S statistic. variables.
and the corresponding null hypothesis that you are testing. 6.Compare the p-value to the alpha level (typically 0.05) to The chi-square test for independence of attributes is a powerful
Select the appropriate nonparametric test: Based on the research determine whether the samples come from the same tool for testing relationships between categorical variables, but
question and the nature of the data, choose the appropriate population. it is sensitive to sample size. When the sample size is small, the
nonparametric test. The K-S test is a powerful tool for comparing two samples, test may not have enough power to detect a relationship
Check the assumptions of the nonparametric test: Make sure that the but it is sensitive to sample size. When the sample sizes are between the variables. In this case, it may be necessary to use a
assumptions of the nonparametric test are satisfied by the data. This small, the test may not have enough power to detect a different statistical test or to increase the sample size. It is also
may include checking for independence, random sampling, equal difference between the samples. In this case, it may be important to note that the chi-square test assumes that the
sample sizes, normality, symmetry, and equal variances, depending on necessary to use a different statistical test or to increase the sample is a random and representative sample from the
the specific test being used. sample size. population. If the sample is not representative, the test may not
Calculate the test statistic: Use the formula for the chosen be accurate.
nonparametric test to calculate the test statistic. What do you mean by median test ? Describe procedure
Determine the p-value: Use the test statistic to determine the p-value, of median test. Differentiate between chi-square test and KS test for
which is the probability of observing a result as extreme or more The median test is a nonparametric statistical test used to uniformity.  The Chi Square test is used to test whether the
extreme than the one observed, given that the null hypothesis is true. determine whether two samples come from populations with distribution of nominal variables is same or not as well as for other
Interpret the results: Compare the p-value to the chosen significance a common median. It is often used when the assumptions for distribution matches and on the other hand the Kolmogorov Smirnov
level (usually 0.05). If the p-value is less than the significance level, parametric tests are not met, such as when the data is not (K-S) test is only used to test to the goodness of fit for a continuous
the null hypothesis can be rejected and the alternative hypothesis can normally distributed. data. Kolmogorov Smirnov (K-
be accepted. If the p-value is greater than the significance level, the To perform the median test, follow these steps: S) test compares the continuous cdf, F(X), of the uniform
null hypothesis cannot be rejected. 1.Determine the sample size for each group. distribution to the empirical cdf, SN(x), of the sample of N
2.Rank the combined data from lowest to highest value.
Advantages and disadvantages of non parametric test : 3.Determine the median value for each group by selecting the
Nonparametric tests are statistical tests that do not assume a specific di value at the middle position of the ranked data for that group. observations. By definitionv,
stribution for the data. Here are some advantages and disadvantages of 4.Calculate the difference between the two medians. If the sample from the random-number
nonparametric tests: generator is R1, R2,......,RN, then the empirical cdf, SN(X), is
5.Use a table or computer software to determine the p-value
Advantages: defined by
for the difference between the medians.
1.Nonparametric tests do not require assumptions about the underlying 6.Compare the p-value to the alpha level (typically 0.05) to
distribution of the data, so they can be used with a variety of data types. determine whether the difference between the medians is As N
2.They are often more robust than parametric tests, meaning they can s statistically significant. becomes larger, SN(X) should become a better approximation to
till be reliable even if the assumptions of the parametric test are not me It is important to note that the median test can only be used to F(X), provided that the null hypothesis is true. The Kolmogorov-
t. Smirnov test is based on the largest absolute deviation or difference
compare the medians of two groups, not the means. If you between F(x) and SN(X) over the range of the random variable. I.e.
Disadvantages: want to compare the means of two groups, you will need to
1.Nonparametric tests may have lower statistical power than parametri use a different statistical test.
c tests, meaning they may be less able to detect differences between gr it is based on the statistic
oups. The chi-square test uses the sample
2.Nonparametric tests usually have a fewer number of statistical tests a
vailable, so there may be less flexibility in analyzing the data.
3.The interpretation of nonparametric tests can be more difficult than p
arametric tests because the results are not presented in terms of the fa statistic
miliar normal distribution. Where Oi is the observed number in the ith class, Ei is the expected
number in the ith class, and n is the number of classes. For the
uniform distribution, Ei ist the expected number in each class is
given by: Ei = N/n, N is the total number of observation.
Describe the function and procedure of mann whitney u test: Discuss the function and procedure for Kruskals Wallis H t What do you mean by multiple correlation? Write down the
The Mann-Whitney U test, also known as the Mann-Whitney- est. The Kruskal-Wallis H test is a nonparametric statistical hy relationship between multiple correlation coefficient and
Wilcoxon (MWW) test or the Wilcoxon rank-sum test, is a pothesis test used to compare the median of three or more inde simple correlation coefficient.
nonparametric statistical test used to determine whether two pendent samples. It is similar to the one-way analysis of varian Multiple correlation refers to the correlation between multiple
independent samples come from the same population. It is often used ce (ANOVA), but is more robust and easier to compute. The te predictor variables and a single outcome variable. It can be
when the assumptions for parametric tests are not met, such as when st is based on the ranks of the observations within each group, represented by the multiple correlation coefficient, which is a
the data is not normally distributed or the variances of the two groups rather than the observations themselves. measure of the strength and direction of the relationship between
are not equal. The Kruskal-Wallis H test is used to test the null hypothesis th the predictor variables and the outcome variable.
To perform the Mann-Whitney U test, follow these steps: at the medians of all groups are equal. If the p-value is less tha The multiple correlation coefficient is similar to the simple
1.Determine the sample size for each group. n the chosen level of significance (usually 0.05), then the null correlation coefficient, which measures the correlation between
2.Combine the data from both groups and rank the values from lowest hypothesis can be rejected and it can be concluded that there is two variables. However, the multiple correlation coefficient
to highest. a significant difference between at least two of the groups. takes into account the relationships between multiple predictor
3.Calculate the sum of the ranks for each group. These values are To conduct the Kruskal-Wallis H test, the following steps are t variables and the outcome variable, whereas the simple
known as the U statistics. ypically followed: correlation coefficient only measures the relationship between
4.Use a table or computer software to determine the p-value for the U 1.Rank all of the observations from all of the groups together. two variables.
statistics. 2.Calculate the sum of the ranks for each group (Ri). In general, the multiple correlation coefficient will be a stronger
5.Compare the p-value to the alpha level (typically 0.05) to determine 3.Calculate the test statistic, which is the sum of the squared ra measure of the relationship between the predictor variables and
whether the samples come from the same population. nks for each group divided by the number of observations in th the outcome variable than any of the individual simple
The Mann-Whitney U test is a powerful tool for comparing two e group (Hi). correlation coefficients between the predictor variables and the
samples, but it is sensitive to sample size. When the sample sizes are 4.Calculate the p-value using the test statistic and the number o outcome variable. This is because the multiple correlation
small, the test may not have enough power to detect a difference f groups and observations. coefficient takes into account the relationships between all of the
between the samples. In this case, it may be necessary to use a 5.Compare the p-value to the chosen level of significance. predictor variables and the outcome variable, whereas the simple
different statistical test or to increase the sample size. correlation coefficient only considers the relationship between
Differentiate between Kruskals Wallis one way ANOVA two variables.
What is willcoxon matched pair signed rank test ? why is it test and Friedman two way ANOVA test. Write down the properties of multiple correlation coefficient
superior to sign test? The Wilcoxon matched-pairs signed-rank test is The Kruskal-Wallis one-way analysis of variance (ANOVA) Multiple correlation coefficient is a statistical measure that
a nonparametric statistical hypothesis test used to compare two related test and the Friedman two-way ANOVA test are both quantifies the strength of the relationship between multiple
samples. It is similar to the paired t-test, but is more robust and easier nonparametric statistical tests that are used to compare the predictor variables and a single criterion variable. Here are some
to compute. The test is based on the ranks of the differences between mean ranks of multiple groups on a single dependent variable. properties of multiple correlation coefficient:
the pairs of observations, rather than the differences themselves. However, there are a few key differences between these two Range: The multiple correlation coefficient has a range of -1 to
The Wilcoxon test is superior to the sign test in several ways. First, the tests: 1, where -1 indicates a perfect negative correlation, 0 indicates
Wilcoxon test takes into account the magnitude of the differences The number of groups: The Kruskal-Wallis test is used to no correlation, and 1 indicates a perfect positive correlation.
between the pairs, while the sign test only looks at whether the compare the mean ranks of two or more groups, while the Interpretation: A multiple correlation coefficient of 0.7, for
differences are positive or negative. This means that the Wilcoxon test Friedman test is used to compare the mean ranks of three or example, indicates that the predictor variables are able to explain
is more sensitive to differences between the pairs and is less likely to more groups. 70% of the variance in the criterion variable.
produce a Type II error (failing to reject the null hypothesis when it is The type of independent variable: The Kruskal-Wallis test is Statistical significance: The multiple correlation coefficient can
actually false). used when the independent variable is a between-subjects be tested for statistical significance to determine whether the
Second, the Wilcoxon test has more statistical power than the sign test, factor (i.e., each subject is only in one group), while the relationship between the predictor variables and the criterion
which means that it is more likely to detect a difference between the Friedman test is used when the independent variable is a variable is likely to be due to chance or whether it is a real
pairs when one exists. This is especially important when the sample within-subjects (repeated measures) factor (i.e., each subject is relationship.
size is small. in multiple groups). Sensitivity: The multiple correlation coefficient is sensitive to
Finally, the Wilcoxon test is more robust to departures from the The type of dependent variable: Both the Kruskal-Wallis test the presence of outliers in the data.
assumptions of the paired t-test (such as normality of the differences), and the Friedman test can be used with continuous or ordinal Multicollinearity: The multiple correlation coefficient can be
making it a more reliable test in practical situations. dependent variables. affected by multicollinearity, which is the presence of strong
Post-hoc tests: If the results of the Kruskal-Wallis test are correlations between predictor variables. In such cases, it may be
Discuss the rationale and method of willcoxon matched pair signed significant, you can use the Mann-Whitney U test to determine difficult to interpret the individual contributions of the predictor
rank test. The Wilcoxon matched-pairs signed-rank test is used to which pairs of groups are significantly different from each variables to the criterion variable.
compare two related samples, where each individual in one sample is other. If the results of the Friedman test are significant, you Linearity: The multiple correlation coefficient assumes a linear
matched with an individual in the other sample. The test is based on can use the Nemenyi test or the Conover test to determine relationship between the predictor variables and the criterion
the ranks of the differences between the pairs of observations, rather which pairs of groups are significantly different. variable. If the relationship is non-linear, the multiple correlation
than the differences themselves. Unit 4 : coefficient may not accurately represent the strength of the
The rationale for using the Wilcoxon test is that it is a nonparametric What do mean by partial correlation? Write down the relationship.
test, which means that it makes no assumptions about the underlying relationship between partial and simple correlation Differentiate between multiple and partial correlation
distribution of the data. This is especially useful when the data are not coefficients. coefficient.
normally distributed or when the sample size is small. Partial correlation is a statistical measure that quantifies the Multiple correlation coefficient is a measure of the strength and
To conduct the Wilcoxon test, the following steps are typically strength of the relationship between two variables while direction of the relationship between multiple predictor variables
followed: controlling for (or partialing out) the effects of one or more and a single outcome variable. It takes into account the
1.Calculate the difference between each pair of observations. other variables. It can be used to determine whether the relationships between all of the predictor variables and the
2.Determine the sign (positive or negative) of each difference. relationship between two variables is independent of the outcome variable.
3.Rank the differences, with ties ranked in the middle. relationships between each of those variables and one or more On the other hand, partial correlation coefficient is a measure of
4.Calculate the sum of the ranks for the positive differences (W+) and other variables. the strength and direction of the relationship between two
the sum of the ranks for the negative differences (W-). Simple correlation, on the other hand, is a measure of the variables, while controlling for the effects of one or more other
5.Calculate the test statistic, which is the smaller of W+ and W-. strength of the relationship between two variables without variables. It is used to identify the unique contribution of each
6.Compare the test statistic to the critical value from a table of the controlling for any other variables. predictor variable to the prediction of the outcome variable,
Wilcoxon signed-rank test, or use a software package to calculate the The relationship between partial and simple correlation while controlling for the effects of the other predictor variables.
p-value. coefficients is that partial correlation allows you to control for For example, consider three predictor variables X1, X2, and X3,
7.If the p-value is less than the chosen level of significance (usually the effects of other variables on the relationship between two and an outcome variable Y. The multiple correlation coefficient
0.05), then the null hypothesis can be rejected and it can be concluded variables, while simple correlation does not. This can be useful would measure the relationship between X1, X2, and X3, and Y,
that there is a significant difference between the two samples. when you want to understand the unique relationship between taking into account the relationships between all of the predictor
two variables, rather than the relationship that is influenced by variables and the outcome variable. The partial correlation
What do you mean by Friedman two way ANOVA test?Describe p other variables. coefficient between X1 and Y, controlling for X2 and X3, would
rocess of the test. The Friedman two-way analysis of variance (ANO measure the relationship between X1 and Y while controlling for
VA) is a nonparametric statistical test that is used to compare the mean What do you mean by multiple correlation? Write down the effects of X2 and X3 on the relationship. This allows us to
ranks of three or more groups on a single dependent variable. It is an e the relationship between multiple correlation coefficient identify the unique contribution of X1 to the prediction of Y,
xtension of the Friedman test, which is used to compare the mean rank and simple correlation coefficient. while controlling for the effects of X2 and X3.
s of two groups. Multiple correlation refers to the correlation between multiple What is multiple regression? Write down the method of
The Friedman two-way ANOVA test is used when the independent var predictor variables and a single outcome variable. It can be obtaining regression line
iable is a within-subjects (repeated measures) factor and the dependent represented by the multiple correlation coefficient, which is a Multiple regression is a statistical technique used to predict the
variable is a continuous or ordinal variable. It is used to determine whe measure of the strength and direction of the relationship value of a criterion or dependent variable based on the values
ther there are significant differences between the groups in the depend between the predictor variables and the outcome variable. of two or more predictor or independent variables. It is used to
ent variable. The multiple correlation coefficient is similar to the simple model the relationship between multiple predictor variables
To conduct the Friedman two-way ANOVA test, you would first need correlation coefficient, which measures the correlation between and a single criterion variable.
to collect data from a sample of subjects for each group. The data shou two variables. However, the multiple correlation coefficient
Multiple regression analysis involves the following steps:
ld be collected in a way that allows you to determine the mean ranks f takes into account the relationships between multiple predictor
or each group. 1.Collect data on the predictor variables and the criterion
variables and the outcome variable, whereas the simple
Next, you would need to calculate the mean ranks for each group and c correlation coefficient only measures the relationship between variable.
ompare them using a statistical test. This can be done using a chi-squar two variables. 2.Determine the regression model that best fits the data.
e test or a Kruskal-Wallis test. In general, the multiple correlation coefficient will be a 3.Estimate the regression coefficients for the predictor
If the results of the statistical test are significant, you would then need stronger measure of the relationship between the predictor variables.
to conduct post-hoc tests to determine which pairs of groups are signifi variables and the outcome variable than any of the individual 4.Test the significance of the regression coefficients.
cantly different from each other. This can be done using the Nemenyi t simple correlation coefficients between the predictor variables 5.Use the regression equation to predict the value of the
est or the Conover test. and the outcome variable. This is because the multiple criterion variable for given values of the predictor variables.
It's important to note that the Friedman two-way ANOVA test should correlation coefficient takes into account the relationships The regression equation for a multiple regression model is:
only be used when the assumptions of the test are met. These assumpti between all of the predictor variables and the outcome y = b0 + b1x1 + b2x2 + ... + bn*xn
ons include normality of the residuals and homogeneity of variance. If variable, whereas the simple correlation coefficient only where y is the criterion variable, b0 is the intercept term, and
these assumptions are not met, it may be necessary to use a different st considers the relationship between two variables. b1, b2, ..., bn are the regression coefficients for the predictor
atistical test. variables x1, x2, ..., xn, respectively.
Describe the function and procedure of mann whitney u test: Discuss the function and procedure for Kruskals Wallis H What do you mean by multiple correlation? Write down the
The Mann-Whitney U test, also known as the Mann-Whitney- test. The Kruskal-Wallis H test is a nonparametric statistical relationship between multiple correlation coefficient and
Wilcoxon (MWW) test or the Wilcoxon rank-sum test, is a hypothesis test used to compare the median of three or more i simple correlation coefficient.
nonparametric statistical test used to determine whether two ndependent samples. It is similar to the one-way analysis of va Multiple correlation refers to the correlation between multiple
independent samples come from the same population. It is often used riance (ANOVA), but is more robust and easier to compute. T predictor variables and a single outcome variable. It can be
when the assumptions for parametric tests are not met, such as when he test is based on the ranks of the observations within each gr represented by the multiple correlation coefficient, which is a
the data is not normally distributed or the variances of the two groups oup, rather than the observations themselves. measure of the strength and direction of the relationship
are not equal. The Kruskal-Wallis H test is used to test the null hypothesis th between the predictor variables and the outcome variable.
To perform the Mann-Whitney U test, follow these steps: at the medians of all groups are equal. If the p-value is less tha The multiple correlation coefficient is similar to the simple
1.Determine the sample size for each group. n the chosen level of significance (usually 0.05), then the null correlation coefficient, which measures the correlation between
2.Combine the data from both groups and rank the values from lowest hypothesis can be rejected and it can be concluded that there i two variables. However, the multiple correlation coefficient
to highest. s a significant difference between at least two of the groups. takes into account the relationships between multiple predictor
3.Calculate the sum of the ranks for each group. These values are To conduct the Kruskal-Wallis H test, the following steps are t variables and the outcome variable, whereas the simple
known as the U statistics. ypically followed: correlation coefficient only measures the relationship between
4.Use a table or computer software to determine the p-value for the U 1.Rank all of the observations from all of the groups together. two variables.
statistics. 2.Calculate the sum of the ranks for each group (Ri). In general, the multiple correlation coefficient will be a stronger
5.Compare the p-value to the alpha level (typically 0.05) to determine 3.Calculate the test statistic, which is the sum of the squared r measure of the relationship between the predictor variables and
whether the samples come from the same population. anks for each group divided by the number of observations in the outcome variable than any of the individual simple
The Mann-Whitney U test is a powerful tool for comparing two the group (Hi). correlation coefficients between the predictor variables and the
samples, but it is sensitive to sample size. When the sample sizes are 4.Calculate the p-value using the test statistic and the number outcome variable. This is because the multiple correlation
small, the test may not have enough power to detect a difference of groups and observations. coefficient takes into account the relationships between all of
between the samples. In this case, it may be necessary to use a 5.Compare the p-value to the chosen level of significance. the predictor variables and the outcome variable, whereas the
different statistical test or to increase the sample size. simple correlation coefficient only considers the relationship
Differentiate between Kruskals Wallis one way ANOVA between two variables.
What is willcoxon matched pair signed rank test ? why is it test and Friedman two way ANOVA test. Write down the properties of multiple correlation coefficient
superior to sign test? The Wilcoxon matched-pairs signed-rank test is The Kruskal-Wallis one-way analysis of variance (ANOVA) Multiple correlation coefficient is a statistical measure that
a nonparametric statistical hypothesis test used to compare two related test and the Friedman two-way ANOVA test are both quantifies the strength of the relationship between multiple
samples. It is similar to the paired t-test, but is more robust and easier nonparametric statistical tests that are used to compare the predictor variables and a single criterion variable. Here are
to compute. The test is based on the ranks of the differences between mean ranks of multiple groups on a single dependent variable. some properties of multiple correlation coefficient:
the pairs of observations, rather than the differences themselves. However, there are a few key differences between these two Range: The multiple correlation coefficient has a range of -1 to
The Wilcoxon test is superior to the sign test in several ways. First, the tests: 1, where -1 indicates a perfect negative correlation, 0 indicates
Wilcoxon test takes into account the magnitude of the differences The number of groups: The Kruskal-Wallis test is used to no correlation, and 1 indicates a perfect positive correlation.
between the pairs, while the sign test only looks at whether the compare the mean ranks of two or more groups, while the Interpretation: A multiple correlation coefficient of 0.7, for
differences are positive or negative. This means that the Wilcoxon test Friedman test is used to compare the mean ranks of three or example, indicates that the predictor variables are able to
is more sensitive to differences between the pairs and is less likely to more groups. explain 70% of the variance in the criterion variable.
produce a Type II error (failing to reject the null hypothesis when it is The type of independent variable: The Kruskal-Wallis test is Statistical significance: The multiple correlation coefficient can
actually false). used when the independent variable is a between-subjects be tested for statistical significance to determine whether the
Second, the Wilcoxon test has more statistical power than the sign test, factor (i.e., each subject is only in one group), while the relationship between the predictor variables and the criterion
which means that it is more likely to detect a difference between the Friedman test is used when the independent variable is a variable is likely to be due to chance or whether it is a real
pairs when one exists. This is especially important when the sample within-subjects (repeated measures) factor (i.e., each subject relationship.
size is small. is in multiple groups). Sensitivity: The multiple correlation coefficient is sensitive to
Finally, the Wilcoxon test is more robust to departures from the The type of dependent variable: Both the Kruskal-Wallis test the presence of outliers in the data.
assumptions of the paired t-test (such as normality of the differences), and the Friedman test can be used with continuous or ordinal Multicollinearity: The multiple correlation coefficient can be
making it a more reliable test in practical situations. dependent variables. affected by multicollinearity, which is the presence of strong
Post-hoc tests: If the results of the Kruskal-Wallis test are correlations between predictor variables. In such cases, it may
Discuss the rationale and method of willcoxon matched pair signed significant, you can use the Mann-Whitney U test to be difficult to interpret the individual contributions of the
rank test. The Wilcoxon matched-pairs signed-rank test is used to determine which pairs of groups are significantly different predictor variables to the criterion variable.
compare two related samples, where each individual in one sample is from each other. If the results of the Friedman test are Linearity: The multiple correlation coefficient assumes a linear
matched with an individual in the other sample. The test is based on significant, you can use the Nemenyi test or the Conover test relationship between the predictor variables and the criterion
the ranks of the differences between the pairs of observations, rather to determine which pairs of groups are significantly different. variable. If the relationship is non-linear, the multiple
than the differences themselves. Unit 4 : correlation coefficient may not accurately represent the strength
The rationale for using the Wilcoxon test is that it is a nonparametric What do mean by partial correlation? Write down the of the relationship.
test, which means that it makes no assumptions about the underlying relationship between partial and simple correlation Differentiate between multiple and partial correlation
distribution of the data. This is especially useful when the data are not coefficients. coefficient.
normally distributed or when the sample size is small. Partial correlation is a statistical measure that quantifies the Multiple correlation coefficient is a measure of the strength and
To conduct the Wilcoxon test, the following steps are typically strength of the relationship between two variables while direction of the relationship between multiple predictor
followed: controlling for (or partialing out) the effects of one or more variables and a single outcome variable. It takes into account the
1.Calculate the difference between each pair of observations. other variables. It can be used to determine whether the relationships between all of the predictor variables and the
2.Determine the sign (positive or negative) of each difference. relationship between two variables is independent of the outcome variable.
3.Rank the differences, with ties ranked in the middle. relationships between each of those variables and one or more On the other hand, partial correlation coefficient is a measure of
4.Calculate the sum of the ranks for the positive differences (W+) and other variables. the strength and direction of the relationship between two
the sum of the ranks for the negative differences (W-). Simple correlation, on the other hand, is a measure of the variables, while controlling for the effects of one or more other
5.Calculate the test statistic, which is the smaller of W+ and W-. strength of the relationship between two variables without variables. It is used to identify the unique contribution of each
6.Compare the test statistic to the critical value from a table of the controlling for any other variables. predictor variable to the prediction of the outcome variable,
Wilcoxon signed-rank test, or use a software package to calculate the The relationship between partial and simple correlation while controlling for the effects of the other predictor variables.
p-value. coefficients is that partial correlation allows you to control for For example, consider three predictor variables X1, X2, and X3,
7.If the p-value is less than the chosen level of significance (usually the effects of other variables on the relationship between two and an outcome variable Y. The multiple correlation coefficient
0.05), then the null hypothesis can be rejected and it can be concluded variables, while simple correlation does not. This can be would measure the relationship between X1, X2, and X3, and
that there is a significant difference between the two samples. useful when you want to understand the unique relationship Y, taking into account the relationships between all of the
between two variables, rather than the relationship that is predictor variables and the outcome variable. The partial
What do you mean by Friedman two way ANOVA test?Describe p influenced by other variables. correlation coefficient between X1 and Y, controlling for X2
rocess of the test. The Friedman two-way analysis of variance (ANO and X3, would measure the relationship between X1 and Y
VA) is a nonparametric statistical test that is used to compare the mean What do you mean by multiple correlation? Write down while controlling for the effects of X2 and X3 on the
ranks of three or more groups on a single dependent variable. It is an e the relationship between multiple correlation coefficient relationship. This allows us to identify the unique contribution
xtension of the Friedman test, which is used to compare the mean rank and simple correlation coefficient. of X1 to the prediction of Y, while controlling for the effects of
s of two groups. Multiple correlation refers to the correlation between multiple X2 and X3.
The Friedman two-way ANOVA test is used when the independent var predictor variables and a single outcome variable. It can be What is multiple regression? Write down the method of
iable is a within-subjects (repeated measures) factor and the dependent represented by the multiple correlation coefficient, which is a obtaining regression line
variable is a continuous or ordinal variable. It is used to determine whe measure of the strength and direction of the relationship Multiple regression is a statistical technique used to predict the
ther there are significant differences between the groups in the depend between the predictor variables and the outcome variable. value of a criterion or dependent variable based on the values
ent variable. The multiple correlation coefficient is similar to the simple of two or more predictor or independent variables. It is used to
To conduct the Friedman two-way ANOVA test, you would first need correlation coefficient, which measures the correlation model the relationship between multiple predictor variables
to collect data from a sample of subjects for each group. The data shou between two variables. However, the multiple correlation and a single criterion variable.
ld be collected in a way that allows you to determine the mean ranks f coefficient takes into account the relationships between
Multiple regression analysis involves the following steps:
or each group. multiple predictor variables and the outcome variable,
1.Collect data on the predictor variables and the criterion
Next, you would need to calculate the mean ranks for each group and c whereas the simple correlation coefficient only measures the
ompare them using a statistical test. This can be done using a chi-squar relationship between two variables. variable.
e test or a Kruskal-Wallis test. In general, the multiple correlation coefficient will be a 2.Determine the regression model that best fits the data.
If the results of the statistical test are significant, you would then need stronger measure of the relationship between the predictor 3.Estimate the regression coefficients for the predictor
to conduct post-hoc tests to determine which pairs of groups are signifi variables and the outcome variable than any of the individual variables.
cantly different from each other. This can be done using the Nemenyi t simple correlation coefficients between the predictor variables 4.Test the significance of the regression coefficients.
est or the Conover test. and the outcome variable. This is because the multiple 5.Use the regression equation to predict the value of the
It's important to note that the Friedman two-way ANOVA test should correlation coefficient takes into account the relationships criterion variable for given values of the predictor variables.
only be used when the assumptions of the test are met. These assumpti between all of the predictor variables and the outcome The regression equation for a multiple regression model is:
ons include normality of the residuals and homogeneity of variance. If variable, whereas the simple correlation coefficient only y = b0 + b1x1 + b2x2 + ... + bn*xn
these assumptions are not met, it may be necessary to use a different st considers the relationship between two variables. where y is the criterion variable, b0 is the intercept term, and
atistical test. b1, b2, ..., bn are the regression coefficients for the predictor
variables x1, x2, ..., xn, respectively.
What are underlying assumptions of linear regression model?
Linear regression is a statistical method used to model the linear
relationship between a dependent variable and one or more
independent variables. There are several underlying assumptions of the
linear regression model:
Linearity: There is a linear relationship between the dependent
variable and the independent variables.
Independence of errors: The errors (residuals) of the model are
independent of each other.
Homoscedasticity: The errors have constant variance
(homoscedasticity).
Normality of errors: The errors are normally distributed.
Absence of multicollinearity: There is no multicollinearity among the
independent variables.
It is important to check whether these assumptions are met before
fitting a linear regression model to the data, as violating these
assumptions can lead to incorrect model estimates and predictions.

What do you mean by standard error of estimate? Write down the


role of it in regression analysis.
The standard error of estimate is a measure of the variability of the
predicted values around the true value of the criterion variable. It is an
estimate of the standard deviation of the error or difference between
the observed values of the criterion variable and the predicted values.
The standard error of estimate plays an important role in regression
analysis because it helps to determine the accuracy of the predictions
made by the regression model. A low standard error of estimate
indicates that the predicted values are close to the true values, while a
high standard error of estimate indicates that the predicted values are
far from the true values.
The standard error of estimate is also used to construct confidence
intervals around the predicted values. This allows researchers to
estimate the range of values within which the true value of the
criterion variable is likely to fall with a certain level of confidence.
The standard error of estimate can be used to compare the performance
of different regression models and to choose the model that provides
the best prediction accuracy. It can also be used to test the statistical
significance of the regression coefficients and to determine whether
the predictor variables have a significant effect on the criterion
variable.

What do you mean by coefficient of determination? How is it


different from correlation coefficient?
The coefficient of determination, also known as the R-squared or the
squared correlation coefficient, is a measure of the strength of the
relationship between a predictor variable and a criterion variable. It is
a statistical measure that indicates the proportion of the variance in the
criterion variable that is explained by the predictor variable.
The coefficient of determination is calculated as the square of the
Pearson correlation coefficient, which is a measure of the strength and
direction of the linear relationship between two variables. The Pearson
correlation coefficient can range from -1 to 1, where -1 indicates a
strong negative relationship, 0 indicates no relationship, and 1
indicates a strong positive relationship.
The coefficient of determination can range from 0 to 1, where 0
indicates that the predictor variable does not explain any of the
variance in the criterion variable, and 1 indicates that the predictor
variable explains all of the variance in the criterion variable.
The coefficient of determination is different from the Pearson
correlation coefficient in that it represents the proportion of variance
explained, rather than the strength and direction of the relationship. It
is a useful measure of the fit of the regression model because it
provides a single value that summarizes the overall goodness of fit of
the model.

You might also like