You are on page 1of 3

Full name: Nguyễn Tuấn Hưng

MSSV: K194141723
MID TERM

1. Explain the effect of Type I and Type II errors to model’s result. Give example.

A Type 1 error, often known as a false positive, happens when a researcher rejects a
true null hypothesis incorrectly. This means that your report that your findings are
significant when in fact they have occurred by chance.

A Type II error is when you don't realize there was an effect when there was. In
actuality, your study may not have had sufficient statistical power to identify a
significant effect.

Example: Based on your modest symptoms, you decide to get tested for COVID-19.
There are two types of errors that could occur:

Type I error (false positive): the test result indicates that you have coronavirus when
you don't.

Type II error (false negative): the test result indicates that you are free of
coronavirus, while you are in fact infected.

2. In your opinion, what is the most important thing to do in data mining process?
Explain and give example.

The most important in data mining is data cleaning. It's important since using dirty
data directly in mining might lead to confusion in operations and erroneous
findings.

For example, data cleansing includes things like removing missing values,
addressing spelling and grammar errors, standardizing data sets, repairing errors
like missing codes, empty fields, and finding duplicate information.

3. Is it possible to apply classification method for a dataset if both input and output
are interval? Explain.
The answer is no. Because if the confidence intervals for two models greatly
overlap, this indicates (statistical) equivalency and may be a cause to prefer the less
complex or more interpretable model.

4. Give your comments on the yellow line as below.

In my opinion, this model’s performance is bad, even useless. Its AUC is lower than
0.5 which implies that its predictive ability is worse than random.

5. Is it possible to have equal “macro avg” and “weighted avg”? Explain.

Maybe. If “0” label ’s precision and “1” label ’s precision are equal, “0” support
and “1” support must be equal.

7. Should we use the same train/test split ratio for any cases of data mining? Explain
why and give example.

There is no fixed rule for separation training and testing data sets. It depends on the
dataset. Too few training samples can also cause the model to be underfitting or too
much training samples can cause it to be overfitting. It may depend on the size of
dataset. A larger test set size shows the potential of the model in the real world.
There are two competing concerns: with less training data, your parameter estimates
have greater variance. With less testing data, your performance statistic will have
greater variance. Broadly speaking you should be concerned with dividing data such
that neither variance is too high.

9. What happen if “True Negative + False Positive = 0”. Give your opinion about
the model which result this phenomenon.

This causes a result that all '0'(Negative) labels are correctly classified and whereas
all the '1'(Positive) labels are incorrectly classified.

10. Explain the effect of overfitting & underfitting to the result. Give example

- Overfitting: Overfitting happens when a model learns the detail and noise in the
training data to the extent that it negatively impacts the performance of the model
on new data. There are two main impacts that can occur as a result of this.
To begin with, the model "accidentally" performs well in the train set, despite the
fact that the model's errors in predicting the test set output remain high.
Second, because the dataset is limited and non-representative, the prediction may be
very poor when the sample is expanded.

- Underfitting: The model is unable to recognize the logic and insights of the train
data, resulting in a poor prediction for the test data output.

13.

You might also like