You are on page 1of 4

Assignment-1(Answer)

Q: 4 Explain which statistical test/technique you would use: Two different


weight loss Drugs are tried on participants who come in with a certain disease.
Each Participant is randomly assigned one of the two drugs. The reduction in
Weight (as a percentage) is recorded for each participant. We would like to
see if one drug is more effective than the other.

Ans: 4: For example, a new drug (drug1) is proposed to lower total cholesterol. A
randomized controlled trial is designed to evaluate the efficacy of the medication
in lowering cholesterol. Thirty participants are enrolled in the trial and are
randomly assigned to receive either the new drug or a placebo (drug2). The
participants do not know which treatment they are assigned. Each participant is
asked to take the assigned treatment for 6 weeks. At the end of 6 weeks, each
patient's total cholesterol level is measured and the sample statistics are as follows.

Treatment Sample Mean Standard


Size Deviation
Drug 1 15 195.9 28.7
Drug2 15 227.4 30.3

Is there statistical evidence of a reduction in mean total cholesterol in patients


taking the new drug for 6 weeks as compared to participants taking placebo? We
will run the test using the five-step approach.

 Step 1. Set up hypotheses and determine level of significance

H0: μ1 = μ2 H1: μ1 < μ2                         α=0.05

 Step 2. Select the appropriate test statistic.  

Because both samples are small (< 30), we use the t test statistic. Before
implementing the formula, we first check whether the assumption of equality of
population variances is reasonable. The ratio of the sample variances, s 12/s22
=28.72/30.32 = 0.90, which falls between 0.5 and 2, suggesting that the assumption
of equality of population variances is reasonable. The appropriate test statistic is:

.
1
Assignment-1(Answer)

 Step 3. Set up decision rule.  

This is a lower-tailed test, using a t statistic and a 5% level of significance. The


appropriate critical value can be found in the t Table (in More Resources to the
right). In order to determine the critical value of t we need degrees of freedom, df,
defined as df=n1+n2-2 = 15+15-2=28. The critical value for a lower tailed test with
df=28 and α=0.05 is -1.701 and the decision rule is: Reject H0 if t < -1.701.

 Step 4. Compute the test statistic.  

We now substitute the sample data into the formula for the test statistic identified
in Step 2. Before substituting, we will first compute Sp, the pooled estimate of the
common standard deviation.

Now the test statistic,

 Step 5. Conclusion.  

We reject H0 because -2.92 < -1.701. We have statistically significant evidence at


α=0.05 to show that the mean total cholesterol level is lower in patients taking the
new drug for 6 weeks as compared to patients taking placebo, p < 0.005.

The clinical trial in this example finds a statistically significant reduction in total
cholesterol, whereas in the previous example where we had a historical control (as
opposed to a parallel control group) we did not demonstrate efficacy of the new
drug. Notice that the mean total cholesterol level in patients taking placebo is 217.4
which is very different from the mean cholesterol reported among all Americans in
2002 of 203 and used as the comparator in the prior example. The historical control
value may not have been the most appropriate comparator as cholesterol levels

2
Assignment-1(Answer)

have been increasing over time. In the next section, we present another design that
can be used to assess the efficacy of the new drug.

Q:5 Consider a modified k-NN method in which once the k nearest neighbors
to the query point are identified, you do a linear regression fit on them and
output the fitted value for the query point. Which of the following is/are true
regarding this method?

Justify your answer.

(a) This method makes an assumption that the data is locally linear.

(b) In order to perform well, this method would need dense distributed
training data.

(c) This method has higher bias compared to K-NN

(d) This method has higher variance compared to K-NN.

Ans: (a), (b), (d)

Since we do a linear fit in the k-neighborhood, we are making an assumption of


local linearity. Hence, (a) holds. The method would need dense distributed training
data to perform well, since in the case of the training data being sparse, the k-
neighborhood would end up being quite spread out (not really local anymore).
Hence, the assumption of local linearity would not give good results. Hence, (b)
holds. The method has higher variance, since we now have two parameters (slope
and intercept) instead of one in the case of conventional k-NN. (In the
Conventional case, we just try to fit a constant, and the average happens to be the
constant which minimizes the squared error.)

When KNN is used for classification, the output can be calculated as the class with
the highest frequency from the K-most similar instances. Each instance in essence
votes for their class and the class with the most votes is taken as the prediction.

Class probabilities can be calculated as the normalized frequency of samples that


belong to each class in the set of K most similar instances for a new data instance.
For example, in a binary classification problem (class is 0 or 1):
3
Assignment-1(Answer)

p(class=0) = count(class=0) / (count(class=0)+count(class=1))

If you are using K and you have an even number of classes (e.g. 2) it is a good idea
to choose a K value with an odd number to avoid a tie. And the inverse, use an
even number for K when you have an odd number of classes.

Ties can be broken consistently by expanding K by 1 and looking at the class of


the next most similar instance in the training dataset. KNN works well with a small
number of input variables (p), but struggles when the number of inputs is very
large.

Each input variable can be considered a dimension of a p-dimensional input space.


For example, if you had two input variables x1 and x2, the input space would be 2-
dimensional.

As the number of dimensions increases the volume of the input space increases at
an exponential rate. In high dimensions, points that may be similar may have very
large distances. All points will be far away from each other and our intuition for
distances in simple 2 and 3-dimensional spaces breaks down. This might feel
unintuitive at first, but this general problem is called the “Curse of
Dimensionality”.

You might also like