Professional Documents
Culture Documents
Ans: 4: For example, a new drug (drug1) is proposed to lower total cholesterol. A
randomized controlled trial is designed to evaluate the efficacy of the medication
in lowering cholesterol. Thirty participants are enrolled in the trial and are
randomly assigned to receive either the new drug or a placebo (drug2). The
participants do not know which treatment they are assigned. Each participant is
asked to take the assigned treatment for 6 weeks. At the end of 6 weeks, each
patient's total cholesterol level is measured and the sample statistics are as follows.
Because both samples are small (< 30), we use the t test statistic. Before
implementing the formula, we first check whether the assumption of equality of
population variances is reasonable. The ratio of the sample variances, s 12/s22
=28.72/30.32 = 0.90, which falls between 0.5 and 2, suggesting that the assumption
of equality of population variances is reasonable. The appropriate test statistic is:
.
1
Assignment-1(Answer)
We now substitute the sample data into the formula for the test statistic identified
in Step 2. Before substituting, we will first compute Sp, the pooled estimate of the
common standard deviation.
Step 5. Conclusion.
The clinical trial in this example finds a statistically significant reduction in total
cholesterol, whereas in the previous example where we had a historical control (as
opposed to a parallel control group) we did not demonstrate efficacy of the new
drug. Notice that the mean total cholesterol level in patients taking placebo is 217.4
which is very different from the mean cholesterol reported among all Americans in
2002 of 203 and used as the comparator in the prior example. The historical control
value may not have been the most appropriate comparator as cholesterol levels
2
Assignment-1(Answer)
have been increasing over time. In the next section, we present another design that
can be used to assess the efficacy of the new drug.
Q:5 Consider a modified k-NN method in which once the k nearest neighbors
to the query point are identified, you do a linear regression fit on them and
output the fitted value for the query point. Which of the following is/are true
regarding this method?
(a) This method makes an assumption that the data is locally linear.
(b) In order to perform well, this method would need dense distributed
training data.
When KNN is used for classification, the output can be calculated as the class with
the highest frequency from the K-most similar instances. Each instance in essence
votes for their class and the class with the most votes is taken as the prediction.
If you are using K and you have an even number of classes (e.g. 2) it is a good idea
to choose a K value with an odd number to avoid a tie. And the inverse, use an
even number for K when you have an odd number of classes.
As the number of dimensions increases the volume of the input space increases at
an exponential rate. In high dimensions, points that may be similar may have very
large distances. All points will be far away from each other and our intuition for
distances in simple 2 and 3-dimensional spaces breaks down. This might feel
unintuitive at first, but this general problem is called the “Curse of
Dimensionality”.