You are on page 1of 2

Computer Science 3202/6915

Assignment 2 – Explore Regression Methods

Due date: Sunday February 18th by 11:30pm. (Closing date: March 10th at
midnight)

Learning goals:
1. Explore pre-processing techniques on a dataset.
2. Get familiar with the regression approaches available in scikit-learn.
3. Practice applying regression approaches.
4. Practice using cross-validation to select the best performing approach.

Instructions:
In the Brigthspace folder for this assignment, there is a dataset available
(A2data.tsv). This dataset consists of 99 numeric inputs and one numeric label for
48 instances. The dataset is given as a tab-delimited text files with one instance
per line and a column header. The first 99 columns are the features and the last
column is the output/label.

Your job is to work with this data to generate a regression model. You are allowed
to use any pre-processing technique, feature selection and regression method
available in scikit-learn. You are required to assess model performance using
cross-validation. These are the steps to complete this assignment:

1. Select, implement and assess the performance of a baseline regression


approach. That is, use a simple regression method directly on the data as
given (i.e., don’t do any pre-processing or feature selection) and obtain the
cross-validation root mean square error (RMSE) of this baseline model.
1. This is a small dataset (48 instances) so carefully consider which cross-
validation would be appropriate (10-fold CV, 5-fold CV, LOO-CV).
2. Generate at least two alternative regression models. You are allowed to
choose how to create those alternative regression models. For example,
applying a pre-processing technique and the same simple regression
method you use in step 1 counts as an alternative regression model; or
using a different regression method with the original data also counts as an
alternative regression model, or any combination of pre-processing, feature
selection and regression method counts as an alternative regression model.
3. Evaluate all the generated regression models using cross-validation and
create a plot to visualize and compare the performance of the models
(some suitable visualizations are box plots of the RMSE, interval plots of the
RMSE, or scatterplots showing actual output vs predicted output)

Submission:
1. Python code used to complete this assignment. Include instructions on how

Winter 2024 1/2


Computer Science 3202/6915
Assignment 2 – Explore Regression Methods

to run your code.


2. A report in a single PDF file containing:
1. Brief description and justification of your choice of methods.
1. Explain what method do you choose as baseline and why.
2. Explain what methods do you choose to generate at least two
alternative models and why.
2. A screenshot of a run of your program showing its output.
3. The data visualization(s) generated in step 3.
4. A table with the average RMSE ± standard deviation per method.
5. A brief concluding paragraph summarizing and interpreting your results.

Resources which might be useful:


1. Available methods in scikit-learn
https://scikit-learn.org/stable/supervised_learning.html#supervised-
learning
2. Cross-validation with linear regression
https://www.kaggle.com/code/jnikhilsai/cross-validation-with-linear-
regression

Winter 2024 2/2

You might also like