You are on page 1of 5

FAQ - ReCell

1. How should one approach the ReCell project?

 Before starting the project, please read the problem statement carefully and go through the
criteria and descriptions mentioned in the rubric.

 Once you understand the task, download the dataset and import it into a Jupyter notebook to
get started with the project.

 To work on the project, you should start with data preprocessing and EDA using descriptive
statistics and visualizations.

 Once the EDA is completed and data is preprocessed, you can use the data to build a model,
check its performance, and check whether or not it satisfies the necessary assumptions.

 It is important to close the analysis with key findings and recommendations to the business.

2. Since we have missing values in the dataset, what is the best way to handle or treat those missing
values?

The strategy to deal with missing values varies with the problem at hand, the data provided, and other
factors too. Some of the common strategies are listed below.

 Drop the missing values

 Impute the missing values

o Using central tendency measures (mean, median, mode) of a column

 With mean: Missing values are imputed with the mean of the column. Preferred
for continuous data with no outliers

 With median: Missing values are imputed with the median of the column.
Preferred for continuous data with outliers

 With mode: Missing values are imputed with the mode of the column. Preferred
for categorical data

o Using central tendency measures (mean, median, mode) of a column grouped by


categories of a categorical column: Preferred for cases where the data under similar
categories of a categorical column are likely to have similar properties

3. In what order should one do EDA and Data Preprocessing?

Whenever one extracts or gets some data, the first step generally is to explore the data (check the
distribution, summary statistics, interactions between variables). More often than not, the data will be in
a state that would need some amount of preprocessing before any exploration can be performed. As
such, the step of data preprocessing is both preceded and followed by some amount of data exploration.
The initial exploration of the data helps in identifying the kind of preprocessing needed for the data. For
example, if the data has missing values, the data distribution will help you decide on the strategy to use
to treat the missing values. Once you have treated the missing values, you would want to check the
distribution before going ahead with modeling.

The exact steps to be taken for preprocessing, the kinds of analysis to perform, and the order of their
execution will depend on the data and problem at hand.

4. X=sm.add_constant(X) is not working for my project as a new column is not created. Why is this
happening? How can this be resolved?

add_constant() does not add a constant column to the data if a constant column already exists in
it. Please check if the data has a constant column before using add_constant().
As none of the independent variables should ideally be constants as they have variability in them
initially, the step where the variable(s) became constant has to be identified. The outlier treatment step
is a good place to start.

5. What one should do if the p-values are high (> 0.05) for some dummies and not for the other
dummies of a categorical variable?

The dummy variables with p-value > 0.05 should be dropped one by one until there are no such
variables. After removing each high p-value variable, the regression should be run again, and the p-
values of all the variables should be checked.

If all the dummy variables of a categorical column have a p-value > 0.05, then all the dummy variables for
that column can be dropped at once.
6. What one should do if the VIF is high (> 5) for some dummies and not for the other dummies of a
categorical variable?

The VIF values for dummy variables can be ignored.

If, however, the VIF value is inf or NaN, then one should check if one of the dummy variables was
dropped during one-hot encoding. If the VIF value is still inf or NaN, a different dummy variable than the
one dropped by using drop_first=True should be dropped and VIF values should be checked again.

For example, if a categorical variable 'Season' has four levels 'Spring', 'Summer', 'Fall' and 'Winter', and
using drop_first=True drops the dummy variable for 'Fall', then one can keep the dummy variable for
'Fall' and drop the dummy variable for 'Summer', and then check the VIF values.

7. Do we need to treat the outliers?

It is not mandatory to treat the outliers in the data. Depending upon the EDA performed, one can
determine if the outlier values are proper values or not, and then decide if outlier treatment is needed
or not and which columns should be treated. Some ways to treat outliers are the following:

1. Cap the values by the IQR method

2. Drop the outliers

It is important to provide a proper explanation for the chosen approach in the submission.

8. I am getting this error while building the model:

MissingDataError: exog contains inf or nans

How to resolve it?

The error occurred due to the presence of missing value(s) in the data passed to the OLS model. The
presence of missing values can be checked using the following code:

dataframe.isnull().sum()

If missing values are present, then one can use appropriate methods to treat all missing values before
feeding the data to the model.

9. Why I am getting the MAPE as NaN?

NaN values for MAPE may be occurring due to a minor mistake while defining the dependent and
independent variables. When the dependent variable is defined as a 2D array (as shown below)

# dependent variable
y = df[["target"]]
, the function to compute MAPE gets a 2D array (actual targets) and a 1D array (predicted targets) as
input, resulting in a shape mismatch, and consequently, NaN value for MAPE.

In order to rectify this error, one should define the target variable as a 1D array as follows:

# dependent variable
y = df["target"]

10. I am getting the Goldfeldquandt test to give a p-value > 0.05, but the scatter plot for
heteroscedasticity for residuals does not show any pattern, which means that it almost satisfies the
assumption. What to do in this case?

For homoscedasticity assumption, one can consider an approximation based on a scatter plot for
homoscedasticity for residuals. If the p-value of the Goldfeldquandt test is less than 0.05 but the scatter
plot shows no clear patterns, one can conclude that the assumption is satisfied with the reference of the
residuals vs fitted values plot.

However, it is good practice to try and ensure that the statistical test results match the visual test results.
To do so, we can add more variables to try and get the Goldfeldquandt test to give a p-value > 0.05. We
can also experiment with different transformation methods and/or feature engineering to obtain a p-
value > 0.05. However, these steps are not mandatory for the scope of the project if the scatter plot does
not show any clear pattern.

11. I am getting this error while creating the dummy variables:

TypeError: unhashable type series

How to resolve it?

The message “TypeError: unhashable type” appears in a Python program when one tries to use a
datatype that is not hashable in a portion of the code that requires hashable data. When creating
dummies, we need hashable series.

One of the ways the data can become unhashable is during missing value imputation. If one is imputing
missing values using a central tendency value (like median) of a column and misses adding a () at the end
of the function, like

df['column1']=df1['column1'].fillna(df1['column1'].median)

instead of

df['column1']=df1['column1'].fillna(df1['column1'].median())

, the missing values are replaced with function objects instead of the central tendency value (like
median), resulting in the type of data becoming an object. This column is converted into unhashable
type data.
So, it is important to ensure that the function for the central tendency value is properly defined while
imputing the missing values.

12. I am getting this error while building the model:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

How to resolve it?

This kind of error generally occurs when one attempts to fit a regression model in Python before
converting the categorical variables to dummy variables.

All category variables must be converted to dummy variables. One can


use pandas.get_dummies() function to convert the categorical variables into numerical variables.

13. I am getting a FutureWarning in my code. How to resolve it?

Future warnings notify users about things that will change in the code functionality during a
package/library update. They can be ignored in the current runtime as they will have no effect on the
code execution.

One can check the Python documentation to identify changes that need to be made to the code to avoid
such warnings.

One can also suppress the warning if needed (though it is not a preferred approach). To suppress the
warning, one can use the code below at the start of the Python notebook:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

You might also like