You are on page 1of 2

Capstone Project

General Instructions:
1. Provide appropriate comments in your code.
2. Perform all task programmatically using Python libraries.

‘XYZ’ hardware service center is specialized in servicing the CPUs. The center has maintained the details
about the CPUs they have serviced which is available in “machine_data.csv”. The dataset has 209 rows
and 9 columns. [Source of raw dataset: UCI repository] The details of the columns are as follows:

 vendor: represents the manufacturer of the CPU.


 model: represents the model number of the CPU.
 cycle_time: represents the time taken for internal data transfer in nanoseconds of the CPU.
 min_memory: represents the minimum main memory required by the CPU.
 max_memory: represents the maximum main memory supported by the CPU.
 cache: represents the size of cache memory required by the CPU.
 min_threads: represents the number of threads that run in the CPU when it is just switched on.
 max_threads: represents the maximum number of threads that can be run on the CPU.
 score: represents the performance score of the CPU.

Based on this data, ‘XYZ’ hardware service center would like to build a predictive model that predicts the
performance score for the new CPUs.
As a data science expert, you are expected to build the best model for the given scenario.

Problem statement:

Perform the following activities to build the model:

1. Import the data set “machine_data.csv”.


2. As part of data preprocessing, perform the following activities:
a. Encode the categorical column – ‘vendor’ using label encoder.
b. Identify the vendors who have manufactured less than 5 CPUs and drop those rows from the
given dataset, corresponding to the identified vendors.
c. Drop the column ‘model’.

[Note: The preprocessed dataset should be used further.]

3. Select ‘score’ as the target variable to be predicted and remaining features as predictors.
4. Split the data into training and testing data set in the ratio 80:20.
5. As part of model building, perform the following activities:
a. Based on the training data, build a Linear Regression model.
b. Find the train and the test score for the built model.
c. Calculate the adjusted R-Squared values on both the train and the test data.
6. Calculate the VIF values for all the features considered while building the model using the train data.
7. Based on the model built, predict the performance score of a new test sample/ new CPU instance
which is given below:

(Note: In the above hardware instance, the vendor value '14' is the label encoded value for vendor
'harris'.)

You might also like