You are on page 1of 46

Building good training sets (UNIT 1-

PART 2)
Data Pre-processing

• Data preprocessing is a process of preparing the raw data and making


it suitable for a machine learning model. It is the first and crucial step
while creating a machine learning model.
• When creating a machine learning project, it is not always a case that
we come across the clean and formatted data.
• And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data preprocessing task.
Why do we need Data Preprocessing?
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
• Data preprocessing is required tasks for cleaning the data and making
it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
Dealing with Missing Data:
• Dealing with missing data is a crucial step in the preprocessing phase
of machine learning. Missing data can negatively impact the
performance and accuracy of your model, so it's essential to handle it
appropriately. Here are some common techniques for dealing with
missing data:
1. Identify Missing Data:
Begin by identifying the missing values in your dataset. You can use functions like isnull()
or info() in Python libraries such as Pandas to identify missing values.
2. Remove Missing Data:
If the proportion of missing data is small and removing those instances won't
significantly impact your analysis, you can simply remove rows with missing values using
the dropna() function.
Syntax: df.dropna(inplace=True)
3. Imputation:
Replace missing values with estimated or calculated values. Common imputation
techniques include:
Mean, Median, or Mode Imputation
Forward Fill or Backward Fill
Interpolation
• Mean, Median, or Mode Imputation: Replace missing values with the mean,
median, or mode of the respective column.
Syntax: df['column_name'].fillna(df['column_name'].mean(), inplace=True)
• Forward Fill or Backward Fill: Propagate the last known value forward or the next
known value backward.
Syntax: df.fillna(method='ffill', inplace=True) # Forward fill
• Interpolation: Interpolate missing values based on the values of other data points.
Syntax: df.interpolate(method='linear', inplace=True)
4. Data Augmentation:
For time-series data, use techniques like data augmentation to create new samples by introducing small
perturbations to the existing data.
5. Multiple Imputation:
Generate multiple imputed datasets and analyze each one separately, then combine the results to get more
robust estimates.
6. Domain-Specific Imputation:
Use domain knowledge to fill in missing values, considering the specific characteristics of the data.
7. Machine Learning Algorithms Resilient to Missing Data:
Some algorithms, like Random Forests, are inherently robust to missing data and can handle it without
explicit imputation.
Important Considerations:
•Data Distribution: Ensure that imputed values maintain the original distribution of the data.
•Evaluation: Evaluate the impact of missing data imputation on your model's performance using cross-
validation or other appropriate metrics.
Handling Categorical Data

• Categorical data is a set of predefined categories or groups an


observation can fall into.
• Categorical data can be found everywhere. For instance, survey
responses like marital status, profession, educational qualifications,
etc.
• categorical data can only take up a finite set of values. However, due
to human error, while filling out a survey form, or any other reason,
some bogus values could be found in the dataset.
Importing Libraries
Python libraries make it very easy for us to handle categorical data in a DataFrame
and perform typical and complex tasks with a single line of code.
•Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.
•Numpy – Numpy arrays are very fast and can perform large computations in a very
short time.
•Matplotlib/Seaborn – This library is used to draw visualizations.
•Sklearn – This module contains multiple libraries having pre-implemented functions
to perform tasks from data preprocessing to model development and evaluation.
common techniques for handling categorical
data:
1. Label Encoding:
2. One-Hot Encoding
3. Ordinal Encoding
4. Frequency Encoding
5. Target Encoding (Mean Encoding)
6. Binary Encoding
7. Hashing Trick
Let us discuss these one-by-one.
1. Label Encoding:
• Assign a unique numerical label to each category. This method is
suitable for ordinal data, where the order matters.
• Syntax:
• from sklearn.preprocessing import LabelEncoder
• le = LabelEncoder()
• df['categorical_column'] = le.fit_transform(df['categorical_column'])
2. One-Hot Encoding:
• Create binary columns for each category, indicating the presence or
absence of that category. This method is suitable for nominal data,
where there is no inherent order.
• Syntax: df = pd.get_dummies(df, columns=['categorical_column'],
prefix='category’)
• Limitation:
Curse of Dimensionality: Be cautious of the dimensionality increase,
especially with one-hot encoding, which may lead to sparse datasets.
3. Ordinal Encoding:
• Assign numerical values to categories based on their inherent order.
This is suitable for ordinal data.
• Syntax:
• ordinal_mapping = {'Low': 1, 'Medium': 2, 'High': 3}
• df['ordinal_column'] = df['ordinal_column'].map(ordinal_mapping)
4. Frequency Encoding:
• Encode categorical values based on their frequency in the dataset. This can capture
information about the distribution of categories.
Example:
frequency_map = df['categorical_column'].value_counts().to_dict()
df['categorical_column'] = df['categorical_column'].map(frequency_map)

5. Target Encoding (Mean Encoding):


• Encode categorical values based on the mean of the target variable for each category.
This can be useful for binary classification problems.
Example:
target_mean = df.groupby('categorical_column')['target_variable'].mean()
df['categorical_column'] = df['categorical_column'].map(target_mean)
6. Binary Encoding:
• Represent each category with binary code. This reduces the number
of columns compared to one-hot encoding.
Example:
• import category_encoders as ce
• encoder = ce.BinaryEncoder(cols=['categorical_column'])
• df_binary = encoder.fit_transform(df)
7. Hashing Trick:
• Apply a hash function to map categories to a fixed range of values.
This is useful when the number of categories is large.
Example:
• df['hashed_column'] = df['categorical_column'].apply(hash) % 1000
Partitioning a Dataset in Training and Test Sets:

• The train-test split is used to estimate the performance of machine learning


algorithms that are applicable for prediction-based Algorithms/Applications.
• This method is a fast and easy procedure to perform such that we can
compare our own machine learning model results to machine results.
• By default, the Test set is split into 30 % of actual data and the training set is
split into 70% of the actual data.
• We need to split a dataset into train and test sets to evaluate how well our
machine learning model performs.
• The train set is used to fit the model, and the statistics of the train set are
known. The second set is called the test data set, this set is solely used for
predictions.
What is Training Dataset?
• The training data is the biggest (in -size) subset of the original
dataset, which is used to train or fit the machine learning model.
Firstly, the training data is fed to the ML algorithms, which lets them
learn how to make predictions for the given task.
• What is Test Dataset?
• Once we train the model with the training dataset, it's time to test the
model with the test dataset. This dataset evaluates the performance of
the model and ensures that the model can generalize well with the new
or unseen dataset. The test dataset is another subset of original data,
which is independent of the training dataset.
Need of Splitting dataset into Train and Test set:
Splitting the dataset into train and test sets is one of the important parts
of data pre-processing, as by doing so, we can improve the performance
of our model and hence give better predictability.
Dataset Splitting:
• Scikit-learn alias sklearn is the most useful and robust library for
machine learning in Python.
• The scikit-learn library provides us with the model_selection module
in which we have the splitter function train_test_split().
• Syntax:
train_test_split(*arrays, test_size=None, train_size=None, random_state=None,
shuffle=True, stratify=None)
Parameters:
1.*arrays: inputs such as lists, arrays, data frames, or matrices
2.test_size: this is a float value whose value ranges between 0.0 and 1.0. it
represents the proportion of our test size. its default value is none.
3.train_size: this is a float value whose value ranges between 0.0 and 1.0. it
represents the proportion of our train size. its default value is none.
4.random_state: this parameter is used to control the shuffling applied to the data
before applying the split. it acts as a seed. In simple, it is used to set a seed for a
random generator so that you always get the same result, and the most used value
for this is 42.
5.shuffle: This parameter is used to shuffle the data before splitting. Its default
value is true.
6.stratify: This parameter is used to split the data in a stratified fashion.
Overfitting and Underfitting issues:
• Overfitting and underfitting are the most common problems that occur in the
Machine Learning model.
• A model can be said as overfitted when it performs quite well with the training
dataset but does not generalize well with the new or unseen dataset.
• The issue of overfitting occurs when the model tries to cover all the data points
and hence starts caching noises present in the data.
• Due to this, it can't generalize well to the new dataset. Because of these issues, the
accuracy and efficiency of the model degrade.
• Generally, the complex model has a high chance of overfitting.
• There are various ways by which we can avoid overfitting in the model, such as
Using the Cross-Validation method, early stopping the training, or by
regularization, etc.
Underfitting:
• On the other hand, the model is said to be under-fitted when it is not
able to capture the underlying trend of the data.
• It means the model shows poor performance even with the training
dataset. In most cases, underfitting issues occur when the model is not
perfectly suitable for the problem that we are trying to solve.
Training data vs. Testing Data:
• The main difference between training data and testing data is that
training data is the subset of original data that is used to train the
machine learning model, whereas testing data is used to check the
accuracy of the model.
• The training dataset is generally larger in size compared to the testing
dataset. The general ratios of splitting train and test datasets are 80:20,
70:30, or 90:10.
• Training data is well known to the model as it is used to train the
model, whereas testing data is like unseen/new data to the model.
How do training and testing data work in ML?
• Machine Learning algorithms enable the machines to make predictions and solve problems on
the basis of past observations or experiences.
• These experiences or observations an algorithm can take from the training data, which is fed to
it. Further, one of the great things about ML algorithms is that they can learn and improve over
time on their own, as they are trained with the relevant training data.
• Once the model is trained enough with the relevant training data, it is tested with the test data.
We can understand the whole process of training and testing in three steps, which are as follows:
1.Feed: Firstly, we need to train the model by feeding it with training input data.
2.Define: Now, training data is tagged with the corresponding outputs (in Supervised Learning),
and the model transforms the training data into text vectors or a number of data features.
3.Test: In the last step, we test the model by feeding it with the test data/unseen dataset. This step
ensures that the model is trained efficiently and can generalize well.
The above process is explained using a flowchart given below:
Normalization:
• Normalization is an essential step in the preprocessing of data for
machine learning models, and it is a feature scaling technique.
• Normalization is especially crucial for data manipulation, scaling
down, or up the range of data before it is utilized for subsequent stages
in the fields of soft computing, cloud computing, etc.
• Min-max scaling and Z-Score Normalization (Standardization) are the
two methods most frequently used for normalization in feature scaling.
• Normalization is scaling the data to be analyzed to a specific range
such as [0.0, 1.0] to provide better results.
Why do we need Data Normalization in ML?
• There are several reasons for the need for data normalization as follows:
• Normalization is essential to machine learning for a number of reasons. Throughout the
learning process, it guarantees that every feature contributes equally, preventing larger-
magnitude features from overshadowing others.
• It enables faster convergence of algorithms for optimization, especially those that
depend on gradient descent. Normalization improves the performance of distance-based
algorithms like k-Nearest Neighbours.
• Normalization improves overall performance by addressing model sensitivity problems
in algorithms such as Support Vector Machines and Neural Networks.
• Because it assumes uniform feature scales, it also supports the use of regularization
techniques like L1 and L2 regularization.
• In general, normalization is necessary when working with attributes that have different
scales; otherwise, the effectiveness of a significant attribute that is equally important (on
a lower scale) could be diluted due to other attributes having values on a larger scale.
Data Normalization Techniques:
1. Min-Max normalization:
• This method of normalizing data involves transforming the original
data linearly. The data’s minimum and maximum values are obtained,
and each value is then changed using the formula that follows.
• The formula works by subtracting the minimum value from the original value to
determine how far the value is from the minimum.
• Then, it divides this difference by the range of the variable (the difference between the
maximum and minimum values).
• This division scales the variable to a proportion of the entire range. As a result, the
normalized value falls between 0 and 1.
2. Normalization through decimal scaling:
The data is normalized by shifting the decimal point of its values. By dividing each data
value by the maximum absolute value of the data, we can use this technique to normalize
the data. The following formula is used to normalise the data value, v, of the data to v’:
Normalization of Z-score or Zero Mean (Standardization):

• Using the mean and standard deviation of the data, values are
normalized in this technique to create a standard normal distribution
(mean: 0, standard deviation: 1). The equation that is applied is:
Selecting Meaningful Features:
• Feature selection:
• Feature selection is a process that chooses a subset of features from the original
features so that the feature space is optimally reduced according to a certain
criterion.
• Feature selection is a critical step in the feature construction process.
• In text categorization problems, some words simply do not appear very often.
Perhaps the word “groovy” appears in exactly one training document, which is
positive. Is it really worth keeping this word around as a feature ? It’s a
dangerous endeavor because it’s hard to tell with just one training example if it
is really correlated with the positive class or is it just noise. You could hope that
your learning algorithm is smart enough to figure it out. Or you could just
remove it.
The role of feature selection in ML:
1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
Need of Selecting Meaningful Features:
• While building a machine learning model for real-life dataset, we come across a
lot of features in the dataset and not all these features are important every time.
• Adding unnecessary features while training the model leads us to reduce the
overall accuracy of the model, increase the complexity of the model and decrease
the generalization capability of the model and makes the model biased.
• Even the saying “Sometimes less is better” goes as well for the machine learning
model.
• Hence, feature selection is one of the important steps while building a machine
learning model.
• Its goal is to find the best possible set of features for building a machine learning
model.
Features Selection Algorithms are as follows:
1. Instance based approaches: There is no explicit procedure for feature
subset generation. Many small data samples are sampled from the data.
Features are weighted according to their roles in differentiating instances
of different classes for a data sample. Features with higher weights can be
selected.
2. Nondeterministic approaches: Genetic algorithms and simulated
annealing are also used in feature selection.
3. Exhaustive complete approaches: Branch and Bound evaluates
estimated accuracy and ABB checks an inconsistency measure that is
monotonic. Both start with a full feature set until the preset bound cannot
be maintained.
Types of feature selection:
• There are three general classes of feature selection algorithms:
• Filter methods,
• Wrapper methods
• Embedded methods.
1. Filter Methods:
• These methods are generally used while doing the pre-processing step.
• These methods select features from the dataset irrespective of the use
of any machine learning algorithm. In terms of computation, they are
very fast and inexpensive and are very good for removing duplicated,
correlated, redundant features but these methods do not remove
multicollinearity.
• Selection of feature is evaluated individually which can sometimes
help when features are in isolation (don’t have a dependency on other
features) but will lag when a combination of features can lead to
increase in the overall performance of the model.
Some techniques used are:
• Information Gain – It is defined as the amount of information
provided by the feature for identifying the target value and measures
reduction in the entropy values. Information gain of each attribute is
calculated considering the target values for feature selection.
• Chi-square test — Chi-square method (X2) is generally used to test
the relationship between categorical variables. It compares the
observed values from different attributes of the dataset to its expected
value.
•Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of quantifying
the association between the two continuous variables and the direction of the relationship
with its values ranging from -1 to 1.
•Variance Threshold – It is an approach where all features are removed whose variance
doesn’t meet the specific threshold. By default, this method removes features having zero
variance. The assumption made using this method is higher variance features are likely to
contain more information.
•Mean Absolute Difference (MAD) – This method is similar to variance threshold
method but the difference is there is no square in MAD. This method calculates the mean
absolute difference from the mean value.
•Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean
(AM) to that of Geometric mean (GM) for a given feature. Its value ranges from +1 to
∞ as AM ≥ GM for a given feature. Higher dispersion ratio implies a more relevant
feature.
•Mutual Dependence – This method measures if two variables are mutually dependent,
and thus provides the amount of information obtained for one variable on observing the
other variable. Depending on the presence/absence of a feature, it measures the amount
of information that feature contributes to making the target prediction.
•Relief – This method measures the quality of attributes by randomly sampling an
instance from the dataset and updating each feature and distinguishing between
instances that are near to each other based on the difference between the selected
instance and two nearest instances of same and opposite classes.
2. Wrapper methods:
• Wrapper methods, also referred to as greedy algorithms train the algorithm by
using a subset of features in an iterative manner.
• Based on the conclusions made from training in prior to the model, addition and
removal of features takes place.
• Stopping criteria for selecting the best subset are usually pre-defined by the
person training the model such as when the performance of the model decreases
or a specific number of features has been achieved.
• The main advantage of wrapper methods over the filter methods is that they
provide an optimal set of features for training the model, thus resulting in better
accuracy than the filter methods but are computationally more expensive.
Some techniques used are:
• Forward selection – This method is an iterative approach where we initially start with an
empty set of features and keep adding a feature which best improves our model after each
iteration. The stopping criterion is till the addition of a new variable does not improve the
performance of the model.
• Backward elimination – This method is also an iterative approach where we initially start
with all features and after each iteration, we remove the least significant feature. The
stopping criterion is till no improvement in the performance of the model is observed after
the feature is removed.
• Bi-directional elimination – This method uses both forward selection and backward
elimination technique simultaneously to reach one unique solution.
•Exhaustive selection – This technique is considered as the brute force approach for
the evaluation of feature subsets. It creates all possible subsets and builds a learning
algorithm for each subset and selects the subset whose model’s performance is best.
•Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is trained
on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed from the
current set of features till we are left with the required number of features.
3. Embedded methods:
• In embedded methods, the feature selection algorithm is blended as
part of the learning algorithm, thus having its own built-in feature
selection methods.
• Embedded methods encounter the drawbacks of filter and wrapper
methods and merge their advantages.
• These methods are faster like those of filter methods and more
accurate than the filter methods and take into consideration a
combination of features as well.
Some techniques used are:
• Regularization – This method adds a penalty to different parameters
of the machine learning model to avoid over-fitting of the model. This
approach of feature selection uses Lasso (L1 regularization) and
Elastic nets (L1 and L2 regularization). The penalty is applied over the
coefficients, thus bringing down some coefficients to zero. The
features having zero coefficient can be removed from the dataset.
• Tree-based methods – These methods such as Random Forest,
Gradient Boosting provides us feature importance as a way to select
features as well. Feature importance tells us which features are more
important in making an impact on the target feature.
Thank you!

You might also like