Stats

Unit 1 : What do you mean by sampling distribution?
What do you mean by sampling distribution?Why this Explain how sample size affects the margin of error with
Central Limit Theorem : The central limit theorem states that study is important in statistics? A sampling distribution is examples : The margin of error is a measure of the
whenever a random sample of size n is taken from any distribution the distribution of a statistic (such as the mean) computed precision of an estimate. It is the amount by which an
with mean and variance, then the sample mean will be approximately from a sample of a population. It describes how the statistic estimate may differ from the true value of the parameter
normally distributed with mean and variance. The larger the value of is likely to vary across different samples from the same being estimated. The margin of error is usually calculated as
the sample size, the better the approximation to the normal. population. a function of the sample size, the standard deviation of the
Consider x1, x2, x3,……,xn are independent and Studying the sampling distribution of a statistic sample, and the desired level of confidence.
identically distributed with mean μ and finite variance σ2, then any is important in statistics because it allows us to understand
random variable Zn as, As the sample size increases, the margin of
the behavior of the statistic and to make inferences about the error decreases. This is because a larger sample size provides
population based on samples from the population. For more information about the population, which leads to a
example, if we know the sampling distribution of the mean more precise estimate of the population parameter. For
of a certain population, we can use this information to example, consider a survey that aims to estimate the
construct confidence intervals around the sample mean, proportion of a population that supports a particular
which can be used to make inferences about the population candidate in an election. The survey is conducted by
Then the distribution function of Zn converges to the standard normal mean. randomly sampling 1000 people from the population. The
distribution function as n increases without any bound. The sampling distribution of a statistic is often approximated sample shows that 55% of the people support the candidate.
using the central limit theorem, which states that, given a The margin of error for this estimate can be calculated as
sufficiently large sample size, the distribution of the statistic follows: Margin
will be approximately normal, regardless of the shape of the of error = 1.96 * √(0.55 * 0.45 / 1000) = 0.04 This
distribution of the population. This allows us to use normal means that we can be 95% confident that the true proportion
distribution-based methods to make inferences about the of people in the population who support the candidate is
population based on samples from the population. somewhere between 51% and 59%. Now consider the
Studying the sampling distribution of a statistic same survey, but with a sample size of 100 people instead of
is also important in hypothesis testing, where we use the 1000. The sample shows that 52% of the people support the
sampling distribution to determine the likelihood of candidate. The margin of error for this estimate is now:
observing a particular sample statistic if a certain hypothesis Margin of error = 1.96 * √(0.52 * 0.48 / 100) = 0.14
Since xi are random independent variables, so Ui are also independent. about the population is true. This means that we can be 95% confident that
This implies, the true proportion of people in the population who support
The Sampling Distribution of the Mean is the mean of the
the candidate is somewhere between 38% and 66%.
population from where the items are sampled. If the
As you can see, the margin of error is larger
population distribution is normal, then the sampling
when the sample size is smaller. This means that the
distribution of the mean is likely to be normal for the
estimate is less precise when the sample size is smaller.
samples of all sizes.Following are the main properties of the
sampling distribution of the mean:
Explain different criteria of good estimator :There are
1.Its mean is equal to the population mean, thus,
several criteria that can be used to evaluate the quality of an
(? estimator, including:
X͞ =sample mean and ?p Population mean) 2.The Unbiasedness: An unbiased estimator is one that has a mean
population standard deviation divided by the square root of that is equal to the true value of the parameter being
the sample size is equal to the standard deviation of the estimated. An estimator that is biased consistently over- or
ln(mu(t))=n ln (1 + x)=n(x –x22 + x33− ……..) under-estimates the true value of the parameter.
Multiply each term by n, and as n → ∞, all terms but the first go to Efficiency: An efficient estimator is one that has a small
zero. variance, meaning it is less prone to producing extreme or
sampling distribution of the mean, thus:
improbable estimates.
(σ = population
Consistency: A consistent estimator is one that becomes
standard deviation, n = sample size) 3.The sampling
increasingly accurate as the sample size increases.
distribution of the mean is normally distributed. This means,
Sufficiency: A sufficient estimator is one that makes use of
the distribution of sample means for a large sample size is
all of the information in the sample in order to produce the
Which is the moment generating function for a standard normal normally distributed irrespective of the shape of the
estimate.
random variable. universe, but provided the population standard deviation (σ)
Example: The record of weights of the female population follows a Robustness: A robust estimator is one that is not sensitive to
is finite. Generally, the sample size 30 or more is considered
normal distribution. Its mean and standard deviation are 65 kg and 14 small deviations from the assumptions that were made in
large for the statistical purposes. If the population is normal,
kg, respectively. If a researcher considers the records of 50 females, deriving the estimator.
then the distribution of sample means will be normal,
then what would be the standard deviation of the chosen sample? Range: The range of an estimator is the set of values that the
irrespective of the sample size.
Solution: estimator can take on. An estimator with a narrow range is
Mean of the population μ = 65 kg preferred because it is less likely to produce extreme or
The Sampling Distribution of Proportion measures the
Standard deviation of the population = 14 kg proportion of success, i.e. a chance of occurrence of certain implausible estimates.
sample size n = 50 events, by dividing the number of successes i.e. chances by Interval estimator: An interval estimator is one that provides
Standard deviation is given by the sample size ’n’. Thus, the sample proportion is defined a range of values within which the true value of the
as p = x/n. parameter is likely to lie. An interval estimator that is narrow
The sampling distribution of proportion obeys the binomial and accurate is preferred. Point
probability law if the random sample of ‘n’ is obtained with estimator: A point estimator is one that provides a single
= 14/√50 replacement. Such as, if the population is infinite and the estimate of the value of the parameter. A point estimator that
=14/7.071 probability of occurrence of an event is ‘π’, then the is accurate and has a small variance is preferred.
= 1.97 probability of non-occurrence of the event is (1-π). Now
consider all the possible sample size ‘n’ drawn from the Unit 3:
Explain the statement “ The central limit theorem forms the basis population and estimate the proportion ‘p’ of success for
of inferential statistics ” : The central limit theorem states that, each. Then the mean (?p) and the standard deviation (σp) of
given a sufficiently large sample size from a population with a finite the sampling distribution of proportion can be obtained as: Parametric Tests Non-Parametric Tests
level of variance, the mean of all samples from the same population
will be approximately normally distributed. This result holds Lower statistical power. Greater statistical power.
regardless of the shape of the distribution of the population from Applied to normal or Applied to categorical
which the samples are drawn. interval variables. variables.
In inferential statistics, we are often interested in estimating the ?p = mean of proportion
value of a population parameter (such as the mean) based on a sample π = population proportion which is defined as π = X/N, Used for large samples. Used for small samples.
from the population. The central limit theorem tells us that, as long as where X is the number of elements that possess a certain Its data distribution is The form of data
the sample size is large enough, the distribution of the sample mean characteristic and N is the total number of items in the normal. distribution is not known.
will be approximately normal, which means we can use normal population.
distribution-based methods to make inferences about the population σp = standard error of proportion that measures the success Don’t make many
Make a lot of assumptions.
parameter. (chance) variations of sample proportions from sample to assumptions.
For example, if we want to estimate the mean income of a certain sample Demand a higher condition Require a lower validity
population, we can take a sample of incomes from that population, n= sample size, If the sample size is large (n≥30), then the of validity. condition.
calculate the mean of the sample, and use the central limit theorem to sampling distribution of proportion is likely to be normally
construct a confidence interval around that mean. This interval will distributed. Less chance of errors. Higher probability of errors.
give us a range of values that we can be confident contains the true The following formula is used when population is finite, and
mean of the population, based on the normal distribution of the sample the sampling is made without the replacement: The calculation is The calculation is less
mean. complicated. complicated.
So the central limit theorem forms the basis of inferential statistics Hypotheses are based on
The hypotheses are based
because it provides a way to use the sample mean to make inferences ranges, median, and data
on numerical data.
about the population mean, by assuming that the sample mean is frequency.
approximately normally distributed.
Describe the function and procedure of chi square test
What do you mean by non parametric test? write down What is binomial test? why is it used? for goodness of fit. The chi-square test for goodness of fit is
advantages of non parametric test over parametric test: The binomial test is a statistical test used to determine wheth a statistical test used to determine whether a sample comes
Nonparametric tests are statistical tests that do not assume a particular er the probability of success in a binary outcome (such as a from a population with a specific distribution. It is often
shape or distribution for the data. They are generally less powerful "success" or "failure") is different from a hypothesized valu used to test whether a sample is consistent with a hypothesis
than parametric tests, which do make assumptions about the data, but e. It is based on the binomial distribution, which describes th about the population distribution.
they are more flexible and can be applied to a wider range of data sets. e probability of a specific number of successes in a fixed nu To perform the chi-square test for goodness of fit, follow
There are several advantages of nonparametric tests over mber of trials, given a probability of success on each trial. these steps:
parametric tests: The binomial test is often used in situations where there are t 1.Determine the sample size and the number of categories or
1.Nonparametric tests do not require assumptions about the underlying wo possible outcomes (e.g., "success" or "failure") and the p bins in the data.
distribution of the data, which makes them more robust and less robability of success is known or can be estimated. For exam 2.Calculate the expected frequency for each category or bin
sensitive to deviations from these assumptions. ple, the binomial test might be used to determine whether the based on the hypothesized population distribution.
probability of winning a game is different from 50%, or to d 3.Calculate the observed frequency for each category or bin
2.Nonparametric tests are generally easier to understand and interpret,
etermine whether the probability of a product defect is differ based on the sample data.
as they do not rely on complex statistical models or assumptions. ent from a certain value. 4.Calculate the chi-square statistic using the formula:
3.Nonparametric tests can be applied to data that are not normally chi-square = sum((observed frequency - expected
distributed, or to data that are ordinal (ranked) rather than continuous. Method of binomial test : frequency)^2 / expected frequency)
4.Nonparametric tests can be used when the sample size is small, as To conduct a binomial test, the following steps are typically 5.Use a table or computer software to determine the p-value
they do not require large sample sizes to be effective. followed: for the chi-square statistic.
5.Nonparametric tests are less sensitive to the presence of outliers or Define the research question and null hypothesis: As with 6.Compare the p-value to the alpha level (typically 0.05) to
extreme values in the data. any statistical test, it is important to start by defining the determine whether the sample comes from a population with
6.Nonparametric tests can be used when the data are collected in a way research question and the corresponding null hypothesis that the hypothesized distribution.
that does not allow for calculation of means and standard deviations you are testing. It is important to note that the chi-square test for goodness of
(e.g., if the data are categorical rather than continuous). Determine the hypothesized probability of success: fit assumes that the sample is a random and representative
Determine the hypothesized probability of success that is sample from the population. If the sample is not
Discuss assumptions of non parametric test : being tested against. representative, the test may not be accurate. Additionally,
There are generally fewer assumptions associated with nonparametric Collect the data: Collect data on the number of successes the chi-square test is sensitive to sample size, and may not be
tests than with parametric tests. However, there are still some and failures in a fixed number of trials. powerful when the sample size is small. In these cases, it
assumptions that are commonly made in nonparametric tests: Calculate the test statistic: Use the formula for the binomial may be necessary to use a different statistical test or to
Independence: The observations in the sample should be independent test to calculate the test statistic. increase the sample size.
of one another. Determine the p-value: Use the test statistic to determine the
Random sampling: The sample should be drawn randomly from the p-value, which is the probability of observing a result as Discuss the rationale and method of chi square test for
population. extreme or more extreme than the one observed, given that independence of attributes.
Equal sample sizes: Nonparametric tests often assume that the sample the null hypothesis is true. The chi-square test for independence of attributes is a
sizes in each group being compared are equal. When the sample sizes Interpret the results: Compare the p-value to the chosen statistical test used to determine whether there is a
are not equal, the results of the test may not be reliable. significance level (usually 0.05). If the p-value is less than relationship between two categorical variables. It is often
Normality: Some nonparametric tests assume that the underlying the significance level, the null hypothesis can be rejected used to test whether there is an association between the two
distribution of the data is normal, even though they do not make this and the alternative hypothesis can be accepted. If the p-value variables, or whether one variable is independent of the
assumption explicitly. This is because they are based on the normal is greater than the significance level, the null hypothesis other.
distribution in some way (e.g., the Wilcoxon rank sum test is based on cannot be rejected. To perform the chi-square test for independence of
the normal distribution of the ranks of the observations). attributes, follow these steps:
Symmetry: Some nonparametric tests assume that the distribution of Discuss process of kolmogorov smirnov process. 1.Determine the sample size and the number of categories or
the data is symmetrical. This assumption may be violated if the data The Kolmogorov-Smirnov (K-S) test is a nonparametric bins in each variable.
are heavily skewed. statistical test used to determine whether two samples come 2.Construct a contingency table that shows the frequency of
Equal variances: Some nonparametric tests assume that the variances from the same population. It is often used when the each combination of categories or bins in the two variables.
of the groups being compared are equal. When the variances are not assumptions for parametric tests are not met, such as when 3.Calculate the expected frequency for each combination of
equal, the results of the test may not be reliable. the data is not normally distributed. categories or bins based on the assumption of independence
It is important to carefully consider these assumptions when choosing To perform the K-S test, follow these steps: between the two variables.
and applying a nonparametric test, as violations of these assumptions 1.Calculate the sample size for each group. 4.Calculate the observed frequency for each combination of
can affect the reliability of the test results. 2.Calculate the cumulative distribution functions (CDFs) for categories or bins based on the sample data.
each sample. The CDF is a function that describes the 5.Calculate the chi-square statistic using the formula:
Write down steps of non parametric test : probability that a random 3.variable will be less than or chi-square = sum((observed frequency - expected
The steps of a nonparametric test are generally similar to those of a equal to a certain value. frequency)^2 / expected frequency)
parametric test, with some key differences: 4.Calculate the maximum absolute difference between the 6.Use a table or computer software to determine the p-value
Define the research question and null hypothesis: As with any two CDFs. This is known as the K-S statistic. for the chi-square statistic.
statistical test, it is important to start by defining the research question 5.Use a table or computer software to determine the p-value 7.Compare the p-value to the alpha level (typically 0.05) to
and the corresponding null hypothesis that you are testing. for the K-S statistic. determine whether there is a relationship between the two
Select the appropriate nonparametric test: Based on the research 6.Compare the p-value to the alpha level (typically 0.05) to variables.
question and the nature of the data, choose the appropriate determine whether the samples come from the same The chi-square test for independence of attributes is a
nonparametric test. population. powerful tool for testing relationships between categorical
Check the assumptions of the nonparametric test: Make sure that the The K-S test is a powerful tool for comparing two samples, variables, but it is sensitive to sample size. When the sample
assumptions of the nonparametric test are satisfied by the data. This but it is sensitive to sample size. When the sample sizes are size is small, the test may not have enough power to detect a
may include checking for independence, random sampling, equal small, the test may not have enough power to detect a relationship between the variables. In this case, it may be
sample sizes, normality, symmetry, and equal variances, depending on difference between the samples. In this case, it may be necessary to use a different statistical test or to increase the
the specific test being used. necessary to use a different statistical test or to increase the sample size. It is also important to note that the chi-square
Calculate the test statistic: Use the formula for the chosen sample size. test assumes that the sample is a random and representative
nonparametric test to calculate the test statistic. sample from the population. If the sample is not
Determine the p-value: Use the test statistic to determine the p-value, What do you mean by median test ? Describe procedure representative, the test may not be accurate.
which is the probability of observing a result as extreme or more of median test.
extreme than the one observed, given that the null hypothesis is true. The median test is a nonparametric statistical test used to Differentiate between chi-square test and KS test for
Interpret the results: Compare the p-value to the chosen significance determine whether two samples come from populations with uniformity. The Chi Square test is used to test whether the
level (usually 0.05). If the p-value is less than the significance level, a common median. It is often used when the assumptions for distribution of nominal variables is same or not as well as for
the null hypothesis can be rejected and the alternative hypothesis can parametric tests are not met, such as when the data is not other distribution matches and on the other hand the Kolmogorov
be accepted. If the p-value is greater than the significance level, the normally distributed. Smirnov (K-S) test is only used to test to the goodness of fit for a
null hypothesis cannot be rejected. To perform the median test, follow these steps: continuous data.
1.Determine the sample size for each group. Kolmogorov Smirnov (K-S) test compares the continuous cdf,
Advantages and disadvantages of non parametric test : 2.Rank the combined data from lowest to highest value. F(X), of the uniform distribution to the empirical cdf, SN(x), of
Nonparametric tests are statistical tests that do not assume a specific di 3.Determine the median value for each group by selecting the sample of N observations. By definitionv,
stribution for the data. Here are some advantages and disadvantages of the value at the middle position of the ranked data for that
nonparametric tests: group.
Advantages: 4.Calculate the difference between the two medians.
5.Use a table or computer software to determine the p-value If the sample from the random-number generator is R1,
1.Nonparametric tests do not require assumptions about the underlying R2,......,RN, then the empirical cdf, SN(X), is defined by
distribution of the data, so they can be used with a variety of data types. for the difference between the medians.
2.They are often more robust than parametric tests, meaning they can s 6.Compare the p-value to the alpha level (typically 0.05) to
As
till be reliable even if the assumptions of the parametric test are not me determine whether the difference between the medians is
N becomes larger, SN(X) should become a better approximation
t. statistically significant.
to F(X), provided that the null hypothesis is true. The
Disadvantages: It is important to note that the median test can only be used
Kolmogorov-Smirnov test is based on the largest absolute
1.Nonparametric tests may have lower statistical power than parametri to compare the medians of two groups, not the means. If you
deviation or difference between F(x) and SN(X) over the range of
c tests, meaning they may be less able to detect differences between gr want to compare the means of two groups, you will need to
the random variable. I.e. it is based on the statistic
oups. use a different statistical test.
2.Nonparametric tests usually have a fewer number of statistical tests a
vailable, so there may be less flexibility in analyzing the data. The chi-square test uses the sample statistic
3.The interpretation of nonparametric tests can be more difficult than p
arametric tests because the results are not presented in terms of the fa
miliar normal distribution.
Where Oi is the observed number in the ith class, Ei is the

expected number in the ith class, and n is the number of classes.
For the uniform distribution, Ei ist the expected number in each
class is given by: Ei = N/n, N is the total number of observation.
What do you mean by multiple correlation? Write down
Describe the function and procedure of mann whitney u test: Discuss the function and procedure for Kruskals Wallis
the relationship between multiple correlation coefficient
The Mann-Whitney U test, also known as the Mann-Whitney- H test. The Kruskal-Wallis H test is a nonparametric statisti
and simple correlation coefficient.
Wilcoxon (MWW) test or the Wilcoxon rank-sum test, is a cal hypothesis test used to compare the median of three or m
Multiple correlation refers to the correlation between
nonparametric statistical test used to determine whether two ore independent samples. It is similar to the one-way analysi
multiple predictor variables and a single outcome variable. It
independent samples come from the same population. It is often used s of variance (ANOVA), but is more robust and easier to co
can be represented by the multiple correlation coefficient,
when the assumptions for parametric tests are not met, such as when mpute. The test is based on the ranks of the observations wit
which is a measure of the strength and direction of the
the data is not normally distributed or the variances of the two groups hin each group, rather than the observations themselves.
relationship between the predictor variables and the outcome
are not equal. The Kruskal-Wallis H test is used to test the null hypothesis
variable.
To perform the Mann-Whitney U test, follow these steps: that the medians of all groups are equal. If the p-value is less
The multiple correlation coefficient is similar to the simple
1.Determine the sample size for each group. than the chosen level of significance (usually 0.05), then the
correlation coefficient, which measures the correlation
2.Combine the data from both groups and rank the values from lowest null hypothesis can be rejected and it can be concluded that t
between two variables. However, the multiple correlation
to highest. here is a significant difference between at least two of the gr
coefficient takes into account the relationships between
3.Calculate the sum of the ranks for each group. These values are oups.
multiple predictor variables and the outcome variable,
known as the U statistics. To conduct the Kruskal-Wallis H test, the following steps ar
whereas the simple correlation coefficient only measures the
4.Use a table or computer software to determine the p-value for the U e typically followed:
relationship between two variables.
statistics. 1.Rank all of the observations from all of the groups together.
In general, the multiple correlation coefficient will be a
5.Compare the p-value to the alpha level (typically 0.05) to determine 2.Calculate the sum of the ranks for each group (Ri).
stronger measure of the relationship between the predictor
whether the samples come from the same population. 3.Calculate the test statistic, which is the sum of the squared
variables and the outcome variable than any of the individual
The Mann-Whitney U test is a powerful tool for comparing two ranks for each group divided by the number of observations i
simple correlation coefficients between the predictor
samples, but it is sensitive to sample size. When the sample sizes are n the group (Hi).
variables and the outcome variable. This is because the
small, the test may not have enough power to detect a difference 4.Calculate the p-value using the test statistic and the numbe
multiple correlation coefficient takes into account the
between the samples. In this case, it may be necessary to use a r of groups and observations.
relationships between all of the predictor variables and the
different statistical test or to increase the sample size. 5.Compare the p-value to the chosen level of significance.
outcome variable, whereas the simple correlation coefficient
only considers the relationship between two variables.
What is willcoxon matched pair signed rank test ? why is it Differentiate between Kruskals Wallis one way ANOVA
Write down the properties of multiple correlation
superior to sign test? The Wilcoxon matched-pairs signed-rank test is test and Friedman two way ANOVA test.
coefficient
a nonparametric statistical hypothesis test used to compare two related The Kruskal-Wallis one-way analysis of variance (ANOVA)
Multiple correlation coefficient is a statistical measure that
samples. It is similar to the paired t-test, but is more robust and easier test and the Friedman two-way ANOVA test are both
quantifies the strength of the relationship between multiple
to compute. The test is based on the ranks of the differences between nonparametric statistical tests that are used to compare the
predictor variables and a single criterion variable. Here are
the pairs of observations, rather than the differences themselves. mean ranks of multiple groups on a single dependent
some properties of multiple correlation coefficient:
The Wilcoxon test is superior to the sign test in several ways. First, the variable. However, there are a few key differences between
Range: The multiple correlation coefficient has a range of -1
Wilcoxon test takes into account the magnitude of the differences these two tests:
to 1, where -1 indicates a perfect negative correlation, 0
between the pairs, while the sign test only looks at whether the The number of groups: The Kruskal-Wallis test is used to
indicates no correlation, and 1 indicates a perfect positive
differences are positive or negative. This means that the Wilcoxon test compare the mean ranks of two or more groups, while the
correlation.
is more sensitive to differences between the pairs and is less likely to Friedman test is used to compare the mean ranks of three or
Interpretation: A multiple correlation coefficient of 0.7, for
produce a Type II error (failing to reject the null hypothesis when it is more groups.
example, indicates that the predictor variables are able to
actually false). The type of independent variable: The Kruskal-Wallis test
explain 70% of the variance in the criterion variable.
Second, the Wilcoxon test has more statistical power than the sign test, is used when the independent variable is a between-subjects
Statistical significance: The multiple correlation coefficient
which means that it is more likely to detect a difference between the factor (i.e., each subject is only in one group), while the
can be tested for statistical significance to determine whether
pairs when one exists. This is especially important when the sample Friedman test is used when the independent variable is a
the relationship between the predictor variables and the
size is small. within-subjects (repeated measures) factor (i.e., each subject
criterion variable is likely to be due to chance or whether it
Finally, the Wilcoxon test is more robust to departures from the is in multiple groups).
is a real relationship.
assumptions of the paired t-test (such as normality of the differences), The type of dependent variable: Both the Kruskal-Wallis
Sensitivity: The multiple correlation coefficient is sensitive
making it a more reliable test in practical situations. test and the Friedman test can be used with continuous or
to the presence of outliers in the data.
ordinal dependent variables.
Multicollinearity: The multiple correlation coefficient can
Discuss the rationale and method of willcoxon matched pair signed Post-hoc tests: If the results of the Kruskal-Wallis test are
be affected by multicollinearity, which is the presence of
rank test. The Wilcoxon matched-pairs signed-rank test is used to significant, you can use the Mann-Whitney U test to
strong correlations between predictor variables. In such
compare two related samples, where each individual in one sample is determine which pairs of groups are significantly different
cases, it may be difficult to interpret the individual
matched with an individual in the other sample. The test is based on from each other. If the results of the Friedman test are
contributions of the predictor variables to the criterion
the ranks of the differences between the pairs of observations, rather significant, you can use the Nemenyi test or the Conover test
variable.
than the differences themselves. to determine which pairs of groups are significantly
Linearity: The multiple correlation coefficient assumes a
The rationale for using the Wilcoxon test is that it is a nonparametric different.
linear relationship between the predictor variables and the
test, which means that it makes no assumptions about the underlying Unit 4 :
criterion variable. If the relationship is non-linear, the
distribution of the data. This is especially useful when the data are not What do mean by partial correlation? Write down the
multiple correlation coefficient may not accurately represent
normally distributed or when the sample size is small. relationship between partial and simple correlation
the strength of the relationship.
To conduct the Wilcoxon test, the following steps are typically coefficients.
Differentiate between multiple and partial correlation
followed: Partial correlation is a statistical measure that quantifies the
coefficient.
1.Calculate the difference between each pair of observations. strength of the relationship between two variables while
Multiple correlation coefficient is a measure of the strength
2.Determine the sign (positive or negative) of each difference. controlling for (or partialing out) the effects of one or more
and direction of the relationship between multiple predictor
3.Rank the differences, with ties ranked in the middle. other variables. It can be used to determine whether the
variables and a single outcome variable. It takes into account
4.Calculate the sum of the ranks for the positive differences (W+) and relationship between two variables is independent of the
the relationships between all of the predictor variables and
the sum of the ranks for the negative differences (W-). relationships between each of those variables and one or
the outcome variable.
5.Calculate the test statistic, which is the smaller of W+ and W-. more other variables.
On the other hand, partial correlation coefficient is a
6.Compare the test statistic to the critical value from a table of the Simple correlation, on the other hand, is a measure of the
measure of the strength and direction of the relationship
Wilcoxon signed-rank test, or use a software package to calculate the strength of the relationship between two variables without
between two variables, while controlling for the effects of
p-value. controlling for any other variables.
one or more other variables. It is used to identify the unique
7.If the p-value is less than the chosen level of significance (usually The relationship between partial and simple correlation
contribution of each predictor variable to the prediction of
0.05), then the null hypothesis can be rejected and it can be concluded coefficients is that partial correlation allows you to control
the outcome variable, while controlling for the effects of the
that there is a significant difference between the two samples. for the effects of other variables on the relationship between
other predictor variables.
two variables, while simple correlation does not. This can be
For example, consider three predictor variables X1, X2, and
What do you mean by Friedman two way ANOVA test?Describe p useful when you want to understand the unique relationship
X3, and an outcome variable Y. The multiple correlation
rocess of the test. The Friedman two-way analysis of variance (ANO between two variables, rather than the relationship that is
coefficient would measure the relationship between X1, X2,
VA) is a nonparametric statistical test that is used to compare the mean influenced by other variables.
and X3, and Y, taking into account the relationships between
ranks of three or more groups on a single dependent variable. It is an e all of the predictor variables and the outcome variable. The
xtension of the Friedman test, which is used to compare the mean rank What do you mean by multiple correlation? Write down
partial correlation coefficient between X1 and Y, controlling
s of two groups. the relationship between multiple correlation coefficient
for X2 and X3, would measure the relationship between X1
The Friedman two-way ANOVA test is used when the independent var and simple correlation coefficient.
and Y while controlling for the effects of X2 and X3 on the
iable is a within-subjects (repeated measures) factor and the dependent Multiple correlation refers to the correlation between
relationship. This allows us to identify the unique
variable is a continuous or ordinal variable. It is used to determine whe multiple predictor variables and a single outcome variable. It
contribution of X1 to the prediction of Y, while controlling
ther there are significant differences between the groups in the depend can be represented by the multiple correlation coefficient,
for the effects of X2 and X3.
ent variable. which is a measure of the strength and direction of the
What is multiple regression? Write down the method of
To conduct the Friedman two-way ANOVA test, you would first need relationship between the predictor variables and the outcome
variable. obtaining regression line
to collect data from a sample of subjects for each group. The data shou
The multiple correlation coefficient is similar to the simple Multiple regression is a statistical technique used to predict
ld be collected in a way that allows you to determine the mean ranks f
correlation coefficient, which measures the correlation the value of a criterion or dependent variable based on the
or each group.
between two variables. However, the multiple correlation values of two or more predictor or independent variables. It
Next, you would need to calculate the mean ranks for each group and c
ompare them using a statistical test. This can be done using a chi-squar coefficient takes into account the relationships between is used to model the relationship between multiple
e test or a Kruskal-Wallis test. multiple predictor variables and the outcome variable, predictor variables and a single criterion variable.
If the results of the statistical test are significant, you would then need whereas the simple correlation coefficient only measures the Multiple regression analysis involves the following steps:
to conduct post-hoc tests to determine which pairs of groups are signifi relationship between two variables. 1.Collect data on the predictor variables and the criterion
cantly different from each other. This can be done using the Nemenyi t In general, the multiple correlation coefficient will be a variable.
est or the Conover test. stronger measure of the relationship between the predictor 2.Determine the regression model that best fits the data.
It's important to note that the Friedman two-way ANOVA test should variables and the outcome variable than any of the individual 3.Estimate the regression coefficients for the predictor
only be used when the assumptions of the test are met. These assumpti simple correlation coefficients between the predictor variables.
ons include normality of the residuals and homogeneity of variance. If variables and the outcome variable. This is because the 4.Test the significance of the regression coefficients.
these assumptions are not met, it may be necessary to use a different st multiple correlation coefficient takes into account the 5.Use the regression equation to predict the value of the
atistical test. relationships between all of the predictor variables and the
criterion variable for given values of the predictor variables.
The regression equation for a multiple regression model is:
only considers the relationship between two variables.
y = b0 + b1x1 + b2x2 + ... + bn*xn
where y is the criterion variable, b0 is the intercept term,
and b1, b2, ..., bn are the regression coefficients for the
predictor variables x1, x2, ..., xn, respectively.
Describe the function and procedure of mann whitney u test: Discuss the function and procedure for Kruskals Wallis What do you mean by multiple correlation? Write down
The Mann-Whitney U test, also known as the Mann-Whitney- H test. The Kruskal-Wallis H test is a nonparametric statisti the relationship between multiple correlation coefficient
Wilcoxon (MWW) test or the Wilcoxon rank-sum test, is a cal hypothesis test used to compare the median of three or m and simple correlation coefficient.
nonparametric statistical test used to determine whether two ore independent samples. It is similar to the one-way analysi Multiple correlation refers to the correlation between
independent samples come from the same population. It is often used s of variance (ANOVA), but is more robust and easier to co multiple predictor variables and a single outcome variable. It
when the assumptions for parametric tests are not met, such as when mpute. The test is based on the ranks of the observations wit can be represented by the multiple correlation coefficient,
the data is not normally distributed or the variances of the two groups hin each group, rather than the observations themselves. which is a measure of the strength and direction of the
are not equal. The Kruskal-Wallis H test is used to test the null hypothesis relationship between the predictor variables and the outcome
To perform the Mann-Whitney U test, follow these steps: that the medians of all groups are equal. If the p-value is less variable.
1.Determine the sample size for each group. than the chosen level of significance (usually 0.05), then the The multiple correlation coefficient is similar to the simple
2.Combine the data from both groups and rank the values from lowest null hypothesis can be rejected and it can be concluded that t correlation coefficient, which measures the correlation
to highest. here is a significant difference between at least two of the gr between two variables. However, the multiple correlation
3.Calculate the sum of the ranks for each group. These values are oups. coefficient takes into account the relationships between
known as the U statistics. To conduct the Kruskal-Wallis H test, the following steps ar multiple predictor variables and the outcome variable,
4.Use a table or computer software to determine the p-value for the U e typically followed: whereas the simple correlation coefficient only measures the
statistics. 1.Rank all of the observations from all of the groups together. relationship between two variables.
5.Compare the p-value to the alpha level (typically 0.05) to determine 2.Calculate the sum of the ranks for each group (Ri). In general, the multiple correlation coefficient will be a
whether the samples come from the same population. 3.Calculate the test statistic, which is the sum of the squared stronger measure of the relationship between the predictor
The Mann-Whitney U test is a powerful tool for comparing two ranks for each group divided by the number of observations i variables and the outcome variable than any of the individual
samples, but it is sensitive to sample size. When the sample sizes are n the group (Hi). simple correlation coefficients between the predictor
small, the test may not have enough power to detect a difference 4.Calculate the p-value using the test statistic and the numbe variables and the outcome variable. This is because the
between the samples. In this case, it may be necessary to use a r of groups and observations. multiple correlation coefficient takes into account the
different statistical test or to increase the sample size. 5.Compare the p-value to the chosen level of significance. relationships between all of the predictor variables and the
What is willcoxon matched pair signed rank test ? why is it Differentiate between Kruskals Wallis one way ANOVA only considers the relationship between two variables.
superior to sign test? The Wilcoxon matched-pairs signed-rank test is test and Friedman two way ANOVA test. Write down the properties of multiple correlation
a nonparametric statistical hypothesis test used to compare two related The Kruskal-Wallis one-way analysis of variance (ANOVA) coefficient
samples. It is similar to the paired t-test, but is more robust and easier test and the Friedman two-way ANOVA test are both Multiple correlation coefficient is a statistical measure that
to compute. The test is based on the ranks of the differences between nonparametric statistical tests that are used to compare the quantifies the strength of the relationship between multiple
the pairs of observations, rather than the differences themselves. mean ranks of multiple groups on a single dependent predictor variables and a single criterion variable. Here are
The Wilcoxon test is superior to the sign test in several ways. First, the variable. However, there are a few key differences between some properties of multiple correlation coefficient:
Wilcoxon test takes into account the magnitude of the differences these two tests: Range: The multiple correlation coefficient has a range of -1
between the pairs, while the sign test only looks at whether the The number of groups: The Kruskal-Wallis test is used to to 1, where -1 indicates a perfect negative correlation, 0
differences are positive or negative. This means that the Wilcoxon test compare the mean ranks of two or more groups, while the indicates no correlation, and 1 indicates a perfect positive
is more sensitive to differences between the pairs and is less likely to Friedman test is used to compare the mean ranks of three or correlation.
produce a Type II error (failing to reject the null hypothesis when it is more groups. Interpretation: A multiple correlation coefficient of 0.7, for
actually false). The type of independent variable: The Kruskal-Wallis test example, indicates that the predictor variables are able to
Second, the Wilcoxon test has more statistical power than the sign test, is used when the independent variable is a between-subjects explain 70% of the variance in the criterion variable.
which means that it is more likely to detect a difference between the factor (i.e., each subject is only in one group), while the Statistical significance: The multiple correlation coefficient
pairs when one exists. This is especially important when the sample Friedman test is used when the independent variable is a can be tested for statistical significance to determine whether
size is small. within-subjects (repeated measures) factor (i.e., each subject the relationship between the predictor variables and the
Finally, the Wilcoxon test is more robust to departures from the is in multiple groups). criterion variable is likely to be due to chance or whether it
assumptions of the paired t-test (such as normality of the differences), The type of dependent variable: Both the Kruskal-Wallis is a real relationship.
making it a more reliable test in practical situations. test and the Friedman test can be used with continuous or Sensitivity: The multiple correlation coefficient is sensitive
ordinal dependent variables. to the presence of outliers in the data.
Discuss the rationale and method of willcoxon matched pair signed Post-hoc tests: If the results of the Kruskal-Wallis test are Multicollinearity: The multiple correlation coefficient can
rank test. The Wilcoxon matched-pairs signed-rank test is used to significant, you can use the Mann-Whitney U test to be affected by multicollinearity, which is the presence of
compare two related samples, where each individual in one sample is determine which pairs of groups are significantly different strong correlations between predictor variables. In such
matched with an individual in the other sample. The test is based on from each other. If the results of the Friedman test are cases, it may be difficult to interpret the individual
the ranks of the differences between the pairs of observations, rather significant, you can use the Nemenyi test or the Conover test contributions of the predictor variables to the criterion
than the differences themselves. to determine which pairs of groups are significantly variable.
The rationale for using the Wilcoxon test is that it is a nonparametric different. Linearity: The multiple correlation coefficient assumes a
test, which means that it makes no assumptions about the underlying Unit 4 : linear relationship between the predictor variables and the
distribution of the data. This is especially useful when the data are not What do mean by partial correlation? Write down the criterion variable. If the relationship is non-linear, the
normally distributed or when the sample size is small. relationship between partial and simple correlation multiple correlation coefficient may not accurately represent
To conduct the Wilcoxon test, the following steps are typically coefficients. the strength of the relationship.
followed: Partial correlation is a statistical measure that quantifies the Differentiate between multiple and partial correlation
1.Calculate the difference between each pair of observations. strength of the relationship between two variables while coefficient.
2.Determine the sign (positive or negative) of each difference. controlling for (or partialing out) the effects of one or more Multiple correlation coefficient is a measure of the strength
3.Rank the differences, with ties ranked in the middle. other variables. It can be used to determine whether the and direction of the relationship between multiple predictor
4.Calculate the sum of the ranks for the positive differences (W+) and relationship between two variables is independent of the variables and a single outcome variable. It takes into account
the sum of the ranks for the negative differences (W-). relationships between each of those variables and one or the relationships between all of the predictor variables and
5.Calculate the test statistic, which is the smaller of W+ and W-. more other variables. the outcome variable.
6.Compare the test statistic to the critical value from a table of the Simple correlation, on the other hand, is a measure of the On the other hand, partial correlation coefficient is a
Wilcoxon signed-rank test, or use a software package to calculate the strength of the relationship between two variables without measure of the strength and direction of the relationship
p-value. controlling for any other variables. between two variables, while controlling for the effects of
7.If the p-value is less than the chosen level of significance (usually The relationship between partial and simple correlation one or more other variables. It is used to identify the unique
0.05), then the null hypothesis can be rejected and it can be concluded coefficients is that partial correlation allows you to control contribution of each predictor variable to the prediction of
that there is a significant difference between the two samples. for the effects of other variables on the relationship between the outcome variable, while controlling for the effects of the
two variables, while simple correlation does not. This can be other predictor variables.
What do you mean by Friedman two way ANOVA test?Describe p useful when you want to understand the unique relationship For example, consider three predictor variables X1, X2, and
rocess of the test. The Friedman two-way analysis of variance (ANO between two variables, rather than the relationship that is X3, and an outcome variable Y. The multiple correlation
VA) is a nonparametric statistical test that is used to compare the mean influenced by other variables. coefficient would measure the relationship between X1, X2,
ranks of three or more groups on a single dependent variable. It is an e and X3, and Y, taking into account the relationships between
xtension of the Friedman test, which is used to compare the mean rank What do you mean by multiple correlation? Write down all of the predictor variables and the outcome variable. The
s of two groups. the relationship between multiple correlation coefficient partial correlation coefficient between X1 and Y, controlling
The Friedman two-way ANOVA test is used when the independent var and simple correlation coefficient. for X2 and X3, would measure the relationship between X1
iable is a within-subjects (repeated measures) factor and the dependent Multiple correlation refers to the correlation between and Y while controlling for the effects of X2 and X3 on the
variable is a continuous or ordinal variable. It is used to determine whe multiple predictor variables and a single outcome variable. It relationship. This allows us to identify the unique
ther there are significant differences between the groups in the depend can be represented by the multiple correlation coefficient, contribution of X1 to the prediction of Y, while controlling
ent variable. which is a measure of the strength and direction of the for the effects of X2 and X3.
To conduct the Friedman two-way ANOVA test, you would first need relationship between the predictor variables and the outcome What is multiple regression? Write down the method of
to collect data from a sample of subjects for each group. The data shou variable. obtaining regression line
ld be collected in a way that allows you to determine the mean ranks f The multiple correlation coefficient is similar to the simple Multiple regression is a statistical technique used to predict
or each group. correlation coefficient, which measures the correlation the value of a criterion or dependent variable based on the
Next, you would need to calculate the mean ranks for each group and c between two variables. However, the multiple correlation values of two or more predictor or independent variables. It
ompare them using a statistical test. This can be done using a chi-squar coefficient takes into account the relationships between is used to model the relationship between multiple
e test or a Kruskal-Wallis test. multiple predictor variables and the outcome variable, predictor variables and a single criterion variable.
If the results of the statistical test are significant, you would then need whereas the simple correlation coefficient only measures the Multiple regression analysis involves the following steps:
to conduct post-hoc tests to determine which pairs of groups are signifi relationship between two variables. 1.Collect data on the predictor variables and the criterion
cantly different from each other. This can be done using the Nemenyi t In general, the multiple correlation coefficient will be a
variable.
est or the Conover test. stronger measure of the relationship between the predictor
2.Determine the regression model that best fits the data.
It's important to note that the Friedman two-way ANOVA test should variables and the outcome variable than any of the individual
simple correlation coefficients between the predictor 3.Estimate the regression coefficients for the predictor
only be used when the assumptions of the test are met. These assumpti variables.
ons include normality of the residuals and homogeneity of variance. If variables and the outcome variable. This is because the
multiple correlation coefficient takes into account the 4.Test the significance of the regression coefficients.
these assumptions are not met, it may be necessary to use a different st
relationships between all of the predictor variables and the 5.Use the regression equation to predict the value of the
atistical test.
outcome variable, whereas the simple correlation coefficient criterion variable for given values of the predictor variables.
only considers the relationship between two variables. The regression equation for a multiple regression model is:
y = b0 + b1x1 + b2x2 + ... + bn*xn
where y is the criterion variable, b0 is the intercept term,
and b1, b2, ..., bn are the regression coefficients for the
predictor variables x1, x2, ..., xn, respectively.
What are underlying assumptions of linear regression model?
Linear regression is a statistical method used to model the linear
relationship between a dependent variable and one or more
independent variables. There are several underlying assumptions of the
linear regression model:
Linearity: There is a linear relationship between the dependent
variable and the independent variables.
Independence of errors: The errors (residuals) of the model are
independent of each other.
Homoscedasticity: The errors have constant variance
(homoscedasticity).
Normality of errors: The errors are normally distributed.
Absence of multicollinearity: There is no multicollinearity among the
independent variables.
It is important to check whether these assumptions are met before
fitting a linear regression model to the data, as violating these
assumptions can lead to incorrect model estimates and predictions.
What do you mean by standard error of estimate? Write down the

role of it in regression analysis.
The standard error of estimate is a measure of the variability of the
predicted values around the true value of the criterion variable. It is an
estimate of the standard deviation of the error or difference between
the observed values of the criterion variable and the predicted values.
The standard error of estimate plays an important role in regression
analysis because it helps to determine the accuracy of the predictions
made by the regression model. A low standard error of estimate
indicates that the predicted values are close to the true values, while a
high standard error of estimate indicates that the predicted values are
far from the true values.
The standard error of estimate is also used to construct confidence
intervals around the predicted values. This allows researchers to
estimate the range of values within which the true value of the
criterion variable is likely to fall with a certain level of confidence.
The standard error of estimate can be used to compare the performance
of different regression models and to choose the model that provides
the best prediction accuracy. It can also be used to test the statistical
significance of the regression coefficients and to determine whether
the predictor variables have a significant effect on the criterion
variable.
What do you mean by coefficient of determination? How is it

different from correlation coefficient?
The coefficient of determination, also known as the R-squared or the
squared correlation coefficient, is a measure of the strength of the
relationship between a predictor variable and a criterion variable. It is
a statistical measure that indicates the proportion of the variance in the
criterion variable that is explained by the predictor variable.
The coefficient of determination is calculated as the square of the
Pearson correlation coefficient, which is a measure of the strength and
direction of the linear relationship between two variables. The Pearson
correlation coefficient can range from -1 to 1, where -1 indicates a
strong negative relationship, 0 indicates no relationship, and 1
indicates a strong positive relationship.
The coefficient of determination can range from 0 to 1, where 0
indicates that the predictor variable does not explain any of the
variance in the criterion variable, and 1 indicates that the predictor
variable explains all of the variance in the criterion variable.
The coefficient of determination is different from the Pearson
correlation coefficient in that it represents the proportion of variance
explained, rather than the strength and direction of the relationship. It
is a useful measure of the fit of the regression model because it
provides a single value that summarizes the overall goodness of fit of
the model.

Stats

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stats

Uploaded by

Copyright:

Available Formats

Unit 1 : What do you mean by sampling distribution?

Where Oi is the observed number in the ith class, Ei is the

What do you mean by standard error of estimate? Write down the

What do you mean by coefficient of determination? How is it

You might also like