Professional Documents
Culture Documents
Stats
Stats
What do you mean by sampling distribution?Why this Explain how sample size affects the margin of error with
Central Limit Theorem : The central limit theorem states that study is important in statistics? A sampling distribution is examples : The margin of error is a measure of the
whenever a random sample of size n is taken from any distribution the distribution of a statistic (such as the mean) computed precision of an estimate. It is the amount by which an
with mean and variance, then the sample mean will be approximately from a sample of a population. It describes how the statistic estimate may differ from the true value of the parameter
normally distributed with mean and variance. The larger the value of is likely to vary across different samples from the same being estimated. The margin of error is usually calculated as
the sample size, the better the approximation to the normal. population. a function of the sample size, the standard deviation of the
Consider x1, x2, x3,……,xn are independent and Studying the sampling distribution of a statistic sample, and the desired level of confidence.
identically distributed with mean μ and finite variance σ2, then any is important in statistics because it allows us to understand
random variable Zn as, As the sample size increases, the margin of
the behavior of the statistic and to make inferences about the error decreases. This is because a larger sample size provides
population based on samples from the population. For more information about the population, which leads to a
example, if we know the sampling distribution of the mean more precise estimate of the population parameter. For
of a certain population, we can use this information to example, consider a survey that aims to estimate the
construct confidence intervals around the sample mean, proportion of a population that supports a particular
which can be used to make inferences about the population candidate in an election. The survey is conducted by
Then the distribution function of Zn converges to the standard normal mean. randomly sampling 1000 people from the population. The
distribution function as n increases without any bound. The sampling distribution of a statistic is often approximated sample shows that 55% of the people support the candidate.
using the central limit theorem, which states that, given a The margin of error for this estimate can be calculated as
sufficiently large sample size, the distribution of the statistic follows: Margin
will be approximately normal, regardless of the shape of the of error = 1.96 * √(0.55 * 0.45 / 1000) = 0.04 This
distribution of the population. This allows us to use normal means that we can be 95% confident that the true proportion
distribution-based methods to make inferences about the of people in the population who support the candidate is
population based on samples from the population. somewhere between 51% and 59%. Now consider the
Studying the sampling distribution of a statistic same survey, but with a sample size of 100 people instead of
is also important in hypothesis testing, where we use the 1000. The sample shows that 52% of the people support the
sampling distribution to determine the likelihood of candidate. The margin of error for this estimate is now:
observing a particular sample statistic if a certain hypothesis Margin of error = 1.96 * √(0.52 * 0.48 / 100) = 0.14
Since xi are random independent variables, so Ui are also independent. about the population is true. This means that we can be 95% confident that
This implies, the true proportion of people in the population who support
The Sampling Distribution of the Mean is the mean of the
the candidate is somewhere between 38% and 66%.
population from where the items are sampled. If the
As you can see, the margin of error is larger
population distribution is normal, then the sampling
when the sample size is smaller. This means that the
distribution of the mean is likely to be normal for the
estimate is less precise when the sample size is smaller.
samples of all sizes.Following are the main properties of the
sampling distribution of the mean:
Explain different criteria of good estimator :There are
1.Its mean is equal to the population mean, thus,
several criteria that can be used to evaluate the quality of an
(? estimator, including:
X͞ =sample mean and ?p Population mean) 2.The Unbiasedness: An unbiased estimator is one that has a mean
population standard deviation divided by the square root of that is equal to the true value of the parameter being
the sample size is equal to the standard deviation of the estimated. An estimator that is biased consistently over- or
ln(mu(t))=n ln (1 + x)=n(x –x22 + x33− ……..) under-estimates the true value of the parameter.
Multiply each term by n, and as n → ∞, all terms but the first go to Efficiency: An efficient estimator is one that has a small
zero. variance, meaning it is less prone to producing extreme or
sampling distribution of the mean, thus:
improbable estimates.
(σ = population
Consistency: A consistent estimator is one that becomes
standard deviation, n = sample size) 3.The sampling
increasingly accurate as the sample size increases.
distribution of the mean is normally distributed. This means,
Sufficiency: A sufficient estimator is one that makes use of
the distribution of sample means for a large sample size is
all of the information in the sample in order to produce the
Which is the moment generating function for a standard normal normally distributed irrespective of the shape of the
estimate.
random variable. universe, but provided the population standard deviation (σ)
Example: The record of weights of the female population follows a Robustness: A robust estimator is one that is not sensitive to
is finite. Generally, the sample size 30 or more is considered
normal distribution. Its mean and standard deviation are 65 kg and 14 small deviations from the assumptions that were made in
large for the statistical purposes. If the population is normal,
kg, respectively. If a researcher considers the records of 50 females, deriving the estimator.
then the distribution of sample means will be normal,
then what would be the standard deviation of the chosen sample? Range: The range of an estimator is the set of values that the
irrespective of the sample size.
Solution: estimator can take on. An estimator with a narrow range is
Mean of the population μ = 65 kg preferred because it is less likely to produce extreme or
The Sampling Distribution of Proportion measures the
Standard deviation of the population = 14 kg proportion of success, i.e. a chance of occurrence of certain implausible estimates.
sample size n = 50 events, by dividing the number of successes i.e. chances by Interval estimator: An interval estimator is one that provides
Standard deviation is given by the sample size ’n’. Thus, the sample proportion is defined a range of values within which the true value of the
as p = x/n. parameter is likely to lie. An interval estimator that is narrow
The sampling distribution of proportion obeys the binomial and accurate is preferred. Point
probability law if the random sample of ‘n’ is obtained with estimator: A point estimator is one that provides a single
= 14/√50 replacement. Such as, if the population is infinite and the estimate of the value of the parameter. A point estimator that
=14/7.071 probability of occurrence of an event is ‘π’, then the is accurate and has a small variance is preferred.
= 1.97 probability of non-occurrence of the event is (1-π). Now
consider all the possible sample size ‘n’ drawn from the Unit 3:
Explain the statement “ The central limit theorem forms the basis population and estimate the proportion ‘p’ of success for
of inferential statistics ” : The central limit theorem states that, each. Then the mean (?p) and the standard deviation (σp) of
given a sufficiently large sample size from a population with a finite the sampling distribution of proportion can be obtained as: Parametric Tests Non-Parametric Tests
level of variance, the mean of all samples from the same population
will be approximately normally distributed. This result holds Lower statistical power. Greater statistical power.
regardless of the shape of the distribution of the population from Applied to normal or Applied to categorical
which the samples are drawn. interval variables. variables.
In inferential statistics, we are often interested in estimating the ?p = mean of proportion
value of a population parameter (such as the mean) based on a sample π = population proportion which is defined as π = X/N, Used for large samples. Used for small samples.
from the population. The central limit theorem tells us that, as long as where X is the number of elements that possess a certain Its data distribution is The form of data
the sample size is large enough, the distribution of the sample mean characteristic and N is the total number of items in the normal. distribution is not known.
will be approximately normal, which means we can use normal population.
distribution-based methods to make inferences about the population σp = standard error of proportion that measures the success Don’t make many
Make a lot of assumptions.
parameter. (chance) variations of sample proportions from sample to assumptions.
For example, if we want to estimate the mean income of a certain sample Demand a higher condition Require a lower validity
population, we can take a sample of incomes from that population, n= sample size, If the sample size is large (n≥30), then the of validity. condition.
calculate the mean of the sample, and use the central limit theorem to sampling distribution of proportion is likely to be normally
construct a confidence interval around that mean. This interval will distributed. Less chance of errors. Higher probability of errors.
give us a range of values that we can be confident contains the true The following formula is used when population is finite, and
mean of the population, based on the normal distribution of the sample the sampling is made without the replacement: The calculation is The calculation is less
mean. complicated. complicated.
So the central limit theorem forms the basis of inferential statistics Hypotheses are based on
The hypotheses are based
because it provides a way to use the sample mean to make inferences ranges, median, and data
on numerical data.
about the population mean, by assuming that the sample mean is frequency.
approximately normally distributed.
Describe the function and procedure of chi square test
What do you mean by non parametric test? write down What is binomial test? why is it used? for goodness of fit. The chi-square test for goodness of fit is
advantages of non parametric test over parametric test: The binomial test is a statistical test used to determine wheth a statistical test used to determine whether a sample comes
Nonparametric tests are statistical tests that do not assume a particular er the probability of success in a binary outcome (such as a from a population with a specific distribution. It is often
shape or distribution for the data. They are generally less powerful "success" or "failure") is different from a hypothesized valu used to test whether a sample is consistent with a hypothesis
than parametric tests, which do make assumptions about the data, but e. It is based on the binomial distribution, which describes th about the population distribution.
they are more flexible and can be applied to a wider range of data sets. e probability of a specific number of successes in a fixed nu To perform the chi-square test for goodness of fit, follow
There are several advantages of nonparametric tests over mber of trials, given a probability of success on each trial. these steps:
parametric tests: The binomial test is often used in situations where there are t 1.Determine the sample size and the number of categories or
1.Nonparametric tests do not require assumptions about the underlying wo possible outcomes (e.g., "success" or "failure") and the p bins in the data.
distribution of the data, which makes them more robust and less robability of success is known or can be estimated. For exam 2.Calculate the expected frequency for each category or bin
sensitive to deviations from these assumptions. ple, the binomial test might be used to determine whether the based on the hypothesized population distribution.
probability of winning a game is different from 50%, or to d 3.Calculate the observed frequency for each category or bin
2.Nonparametric tests are generally easier to understand and interpret,
etermine whether the probability of a product defect is differ based on the sample data.
as they do not rely on complex statistical models or assumptions. ent from a certain value. 4.Calculate the chi-square statistic using the formula:
3.Nonparametric tests can be applied to data that are not normally chi-square = sum((observed frequency - expected
distributed, or to data that are ordinal (ranked) rather than continuous. Method of binomial test : frequency)^2 / expected frequency)
4.Nonparametric tests can be used when the sample size is small, as To conduct a binomial test, the following steps are typically 5.Use a table or computer software to determine the p-value
they do not require large sample sizes to be effective. followed: for the chi-square statistic.
5.Nonparametric tests are less sensitive to the presence of outliers or Define the research question and null hypothesis: As with 6.Compare the p-value to the alpha level (typically 0.05) to
extreme values in the data. any statistical test, it is important to start by defining the determine whether the sample comes from a population with
6.Nonparametric tests can be used when the data are collected in a way research question and the corresponding null hypothesis that the hypothesized distribution.
that does not allow for calculation of means and standard deviations you are testing. It is important to note that the chi-square test for goodness of
(e.g., if the data are categorical rather than continuous). Determine the hypothesized probability of success: fit assumes that the sample is a random and representative
Determine the hypothesized probability of success that is sample from the population. If the sample is not
Discuss assumptions of non parametric test : being tested against. representative, the test may not be accurate. Additionally,
There are generally fewer assumptions associated with nonparametric Collect the data: Collect data on the number of successes the chi-square test is sensitive to sample size, and may not be
tests than with parametric tests. However, there are still some and failures in a fixed number of trials. powerful when the sample size is small. In these cases, it
assumptions that are commonly made in nonparametric tests: Calculate the test statistic: Use the formula for the binomial may be necessary to use a different statistical test or to
Independence: The observations in the sample should be independent test to calculate the test statistic. increase the sample size.
of one another. Determine the p-value: Use the test statistic to determine the
Random sampling: The sample should be drawn randomly from the p-value, which is the probability of observing a result as Discuss the rationale and method of chi square test for
population. extreme or more extreme than the one observed, given that independence of attributes.
Equal sample sizes: Nonparametric tests often assume that the sample the null hypothesis is true. The chi-square test for independence of attributes is a
sizes in each group being compared are equal. When the sample sizes Interpret the results: Compare the p-value to the chosen statistical test used to determine whether there is a
are not equal, the results of the test may not be reliable. significance level (usually 0.05). If the p-value is less than relationship between two categorical variables. It is often
Normality: Some nonparametric tests assume that the underlying the significance level, the null hypothesis can be rejected used to test whether there is an association between the two
distribution of the data is normal, even though they do not make this and the alternative hypothesis can be accepted. If the p-value variables, or whether one variable is independent of the
assumption explicitly. This is because they are based on the normal is greater than the significance level, the null hypothesis other.
distribution in some way (e.g., the Wilcoxon rank sum test is based on cannot be rejected. To perform the chi-square test for independence of
the normal distribution of the ranks of the observations). attributes, follow these steps:
Symmetry: Some nonparametric tests assume that the distribution of Discuss process of kolmogorov smirnov process. 1.Determine the sample size and the number of categories or
the data is symmetrical. This assumption may be violated if the data The Kolmogorov-Smirnov (K-S) test is a nonparametric bins in each variable.
are heavily skewed. statistical test used to determine whether two samples come 2.Construct a contingency table that shows the frequency of
Equal variances: Some nonparametric tests assume that the variances from the same population. It is often used when the each combination of categories or bins in the two variables.
of the groups being compared are equal. When the variances are not assumptions for parametric tests are not met, such as when 3.Calculate the expected frequency for each combination of
equal, the results of the test may not be reliable. the data is not normally distributed. categories or bins based on the assumption of independence
It is important to carefully consider these assumptions when choosing To perform the K-S test, follow these steps: between the two variables.
and applying a nonparametric test, as violations of these assumptions 1.Calculate the sample size for each group. 4.Calculate the observed frequency for each combination of
can affect the reliability of the test results. 2.Calculate the cumulative distribution functions (CDFs) for categories or bins based on the sample data.
each sample. The CDF is a function that describes the 5.Calculate the chi-square statistic using the formula:
Write down steps of non parametric test : probability that a random 3.variable will be less than or chi-square = sum((observed frequency - expected
The steps of a nonparametric test are generally similar to those of a equal to a certain value. frequency)^2 / expected frequency)
parametric test, with some key differences: 4.Calculate the maximum absolute difference between the 6.Use a table or computer software to determine the p-value
Define the research question and null hypothesis: As with any two CDFs. This is known as the K-S statistic. for the chi-square statistic.
statistical test, it is important to start by defining the research question 5.Use a table or computer software to determine the p-value 7.Compare the p-value to the alpha level (typically 0.05) to
and the corresponding null hypothesis that you are testing. for the K-S statistic. determine whether there is a relationship between the two
Select the appropriate nonparametric test: Based on the research 6.Compare the p-value to the alpha level (typically 0.05) to variables.
question and the nature of the data, choose the appropriate determine whether the samples come from the same The chi-square test for independence of attributes is a
nonparametric test. population. powerful tool for testing relationships between categorical
Check the assumptions of the nonparametric test: Make sure that the The K-S test is a powerful tool for comparing two samples, variables, but it is sensitive to sample size. When the sample
assumptions of the nonparametric test are satisfied by the data. This but it is sensitive to sample size. When the sample sizes are size is small, the test may not have enough power to detect a
may include checking for independence, random sampling, equal small, the test may not have enough power to detect a relationship between the variables. In this case, it may be
sample sizes, normality, symmetry, and equal variances, depending on difference between the samples. In this case, it may be necessary to use a different statistical test or to increase the
the specific test being used. necessary to use a different statistical test or to increase the sample size. It is also important to note that the chi-square
Calculate the test statistic: Use the formula for the chosen sample size. test assumes that the sample is a random and representative
nonparametric test to calculate the test statistic. sample from the population. If the sample is not
Determine the p-value: Use the test statistic to determine the p-value, What do you mean by median test ? Describe procedure representative, the test may not be accurate.
which is the probability of observing a result as extreme or more of median test.
extreme than the one observed, given that the null hypothesis is true. The median test is a nonparametric statistical test used to Differentiate between chi-square test and KS test for
Interpret the results: Compare the p-value to the chosen significance determine whether two samples come from populations with uniformity. The Chi Square test is used to test whether the
level (usually 0.05). If the p-value is less than the significance level, a common median. It is often used when the assumptions for distribution of nominal variables is same or not as well as for
the null hypothesis can be rejected and the alternative hypothesis can parametric tests are not met, such as when the data is not other distribution matches and on the other hand the Kolmogorov
be accepted. If the p-value is greater than the significance level, the normally distributed. Smirnov (K-S) test is only used to test to the goodness of fit for a
null hypothesis cannot be rejected. To perform the median test, follow these steps: continuous data.
1.Determine the sample size for each group. Kolmogorov Smirnov (K-S) test compares the continuous cdf,
Advantages and disadvantages of non parametric test : 2.Rank the combined data from lowest to highest value. F(X), of the uniform distribution to the empirical cdf, SN(x), of
Nonparametric tests are statistical tests that do not assume a specific di 3.Determine the median value for each group by selecting the sample of N observations. By definitionv,
stribution for the data. Here are some advantages and disadvantages of the value at the middle position of the ranked data for that
nonparametric tests: group.
Advantages: 4.Calculate the difference between the two medians.
5.Use a table or computer software to determine the p-value If the sample from the random-number generator is R1,
1.Nonparametric tests do not require assumptions about the underlying R2,......,RN, then the empirical cdf, SN(X), is defined by
distribution of the data, so they can be used with a variety of data types. for the difference between the medians.
2.They are often more robust than parametric tests, meaning they can s 6.Compare the p-value to the alpha level (typically 0.05) to
As
till be reliable even if the assumptions of the parametric test are not me determine whether the difference between the medians is
N becomes larger, SN(X) should become a better approximation
t. statistically significant.
to F(X), provided that the null hypothesis is true. The
Disadvantages: It is important to note that the median test can only be used
Kolmogorov-Smirnov test is based on the largest absolute
1.Nonparametric tests may have lower statistical power than parametri to compare the medians of two groups, not the means. If you
deviation or difference between F(x) and SN(X) over the range of
c tests, meaning they may be less able to detect differences between gr want to compare the means of two groups, you will need to
the random variable. I.e. it is based on the statistic
oups. use a different statistical test.
2.Nonparametric tests usually have a fewer number of statistical tests a
vailable, so there may be less flexibility in analyzing the data. The chi-square test uses the sample statistic
3.The interpretation of nonparametric tests can be more difficult than p
arametric tests because the results are not presented in terms of the fa
miliar normal distribution.