You are on page 1of 44

Unit - 3

• Feature Engineering:
• Feature Extraction and Engineering,
• Feature Engineering on Numeric Data,
• Feature Engineering on Categorical Data,
• Feature Engineering on Text Data,
• Feature Engineering on Temporal Data,
• Feature Engineering on Image Data,
• Feature Scaling,
• Feature Selection.
• Dimensionality Reduction:
• Feature extraction with Principal Component Analysis.
What is feature?
• A feature is an attribute of a data set that is used in a machine learning
process.
• The features in a data set are also called its dimensions.
• So a data set having ‘n’ features is called an n-dimensional data set.
• Example:
Class
variable or
dependent
variable
What is Feature Engineering?
• Feature engineering refers to the process of translating a data set
into features such that these features are able to represent the data
set more effectively and result in a better learning performance.
• feature engineering is an important pre-processing step for machine
learning. It has two major elements:
1. feature transformation
2. feature subset selection
Why Feature Engineering?
• Better representation of data
• Better performing models
• Essential for model building and evaluation:
• More flexibility on data types
How Do You Engineer Features?
• Feature engineering strategies can be applies on following data types:
• Numeric data, Categorical data
• Text data, Temporal data
• Image data
• Another aspect into feature engineering has recently gained prominence.
Here, the machine itself try to detect patterns and extract useful data
representations from the raw data, which can be used as features. This
process is also known as “auto feature generation”.
• Deep Learning has proved to be extremely effective in this area and neural
network architectures like convolutional neural networks (CNNs),
recurrent neural networks (RNNs), and Long Short Term Memory
networks (LSTMs) are extensively used for auto feature engineering and
extraction
Feature Engineering on Categorical Data
• Any attribute or feature that is categorical in nature represents discrete
values that belong to a specific finite set of categories or classes.
• Category or class labels can be text or numeric in nature. Usually there
are two types of categorical variables—nominal and ordinal.
• Nominal Categorical feature - doesn’t have any order
Example: video game genres, weather seasons, country names
• Ordinal Categorical feature – have specified order
Example: Clothing size, education levels
Engineer features from categorical data
• Transforming Nominal features
• Transforming Ordinal features
• Encoding Categorical features
• One hot encoding
• Dummy coding scheme
• Effect coding scheme
• Bin – counting scheme
• Feature Hashing scheme
Feature Engineering on Text Data
• In case of unstructured data like text documents, the first challenge is
dealing with the unpredictable nature of the syntax, format, and
content of the documents, which make it a challenge to extract useful
information for building models.
• The second challenge is transforming these textual representations into
numeric representations that can be understood by Machine Learning
algorithms.
• There exist various feature engineering techniques employed by data
scientists daily to extract numeric feature vectors from unstructured
text
load some sample text documents, do some basic pre-processing
Text pre processing
• Text tokenization and lower casing
• Removing special characters
• Removing stop words
• Correcting spellings
• Stemming
• Lemmatization
Tokenization
Corpus: collection of text documents.
Tokens: Tokens are a basic meaningful unit of a sentence or a document
N-Grams: N-grams is a combination of N words or characters together.
for example, if we have a sentence “I Love My Phone”.
Tokenization: process of splitting a text object into smaller units known
as tokens. Examples of tokens can be words, characters, numbers,
symbols, or n-grams.

Regular expression tokenization: In which a regular expression


pattern is used to get the tokens. For example, consider the following
string containing multiple delimiters. We can split the sentence by
passing a splitting pattern.
Normalization: Normalization is the process of converting a token into
its base form. Two popular methods used for normalization are stemming
and lemmatization.
• Stemming: removing inflationary forms from a given token. For
example laughing, laughed, laughs, laugh all will become laugh after
the stemming process.

Stemming is not a good process for normalization. since sometimes it can


produce non-meaningful words which are not present in the dictionary.
Consider the sentence ” His teams are not winning”. After stemming we
get “Hi team are not winn ” . Notice that the keyword winn is not a
regular word. Also, “hi” has changed the context of the entire sentence.
Lemmatization: Lemmatization is a systematic process of removing the
inflectional form of a token and transform it into a lemma. It makes use of
word structure, vocabulary, part of speech tags, and grammar relations.

• The output of lemmatization is a root word called a lemma. for example


“am”, “are”, “is” will be converted to “be”. Similarly, running runs, ‘ran’
will be replaced by ‘run’.
Removing Stop Words: useless words (data), are referred to as stop
words. 
• A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)
that a search engine has been programmed to ignore, both when
indexing entries for searching and when retrieving them as the result
of a search query. 
Bag of Words model
• The core principle of this model is to convert text documents into
numeric vectors.
• The dimension or size of each vector is N where N indicates all
possible distinct words across the corpus of documents.
• Each document once transformed is a numeric vector of size N where
the values or weights in the vector indicate the frequency of each word
in that specific document.
The output represents a numeric term frequency based feature vector
for each document like we mentioned before. To understand it better,
we can represent it using the feature names and view it as a data frame.
Bag of N-grams model
• An n-gram is basically a collection of word tokens from a text
document such that these tokens are contiguous and occur in a
sequence.
• Bi-grams indicate n-grams of order 2 (two words), Tri-grams indicate
n-grams of order 3 (three words), and so on.
• We can easily extend the bag of words model to use a bag of n-grams
model to give us n-gram based feature vectors.
TF-IDF
• TF-IDF stands for Term Frequency-Inverse Document Frequency,
which uses a combination of two metrics in its computation, namely:
term frequency (TF) and inverse document frequency (IDF).

• Term Frequency (TF)- 


• It gives us the recurrence of the word in each report in the corpus.
• It is the proportion of the number of times the word shows up in a
report contrasted with the all-out the number of words in that
record.
• It increments as the quantity of events of that word inside the
record increments.
TF-IDF
• Inverse Data Frequency (IDF)- 
• It is used to figure out the heaviness of uncommon words over all reports in
the corpus.
• The words that occur seldom in the corpus have a high IDF score.

• Joining these two, we think of the TF-IDF score (w) for a word in a


record in the corpus.
Suppose we have 3 texts and we need to vectorize these texts using TF-
IDF

Create a term frequency matrix where rows are documents and columns
are distinct terms throughout all documents. Count word occurrences in
every text.
Compute inverse document frequency (IDF) using the previously
explained formula.

Multiply TF matrix with IDF respectively


Document Similarity:
• This is very useful in domains like search engines, document
clustering, and information retrieval.
• Document similarity is the process of using a distance or similarity
based metric that can be used to identify how similar a text document
is with another document based on features extracted from the
documents like bag of words or tf-idf.
• Pairwise document similarity in a corpus involves computing
document similarity for each pair of documents in a corpus.
• Thus if you have C documents in a corpus, you would end up with a C
x C matrix such that each row and column represents the similarity
score for a pair of documents, which represent the indices at the row
and column, respectively.
• There are several similarity and distance metrics that are used to
compute document similarity. These include cosine similarity,
Euclidean distance, Jaccard distance, and so on.
Euclidean Distance:
• Is probably one of the most known formulas for computing the
distance between two points applying the Pythagorean theorem.
• To get it you just need to subtract the points from the vectors, raise
them to squares, add them up and take the square root of them
Topic Models
• The idea of topic models revolves around the process of extracting key
themes or concepts from a corpus of documents which are represented
as topics.
• Each topic can be represented as a bag or collection of words/terms
from the document corpus.
• Together, these terms signify a specific topic, theme or a concept and
each topic can be easily distinguished from other topics by virtue of
the semantic meaning conveyed by these terms.
• Topic models are extremely useful in summarizing large corpus of
text documents to extract and depict key concepts. They are also
useful in extracting features from text data that capture latent
patterns in the data.
Feature selection
• Feature selection is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant,
irrelevant, or noisy features.
• Below are some benefits of using feature selection in machine learning:
• It helps in avoiding the curse of dimensionality.
• It helps in the simplification of the model so that it can be easily
interpreted by the researchers.
• It reduces the training time.
• It reduces overfitting hence enhance the generalization.
Feature selection techniques
Wrapper methods
• In wrapper methodology, selection of features is done by considering it as
a search problem, in which different combinations are made, evaluated,
and compared with other combinations.
• It trains the algorithm by using the subset of features iteratively.
• Some techniques of wrapper methods are:
• Forward selection
• Backward elimination
• Exhaustive feature selection
• Recursive feature elimination
Forward selection:
• Forward selection is an iterative process, which begins with an
empty set of features.
• After each iteration, it keeps adding on a feature and evaluates the
performance to check whether it is improving the performance or
not.
• The process continues until the addition of a new variable/feature
does not improve the performance of the model.
Backward elimination:
• It is also an iterative approach, which begins the process by
considering all the features and removes the least significant
feature.
• This elimination process continues until removing the features does
not improve the performance of the model.
Exhaustive feature selection:
• Exhaustive feature selection is one of the best feature selection
methods, which evaluates each feature set as brute-force.
• It means this method tries & make each possible combination of
features and return the best performing feature set.

Recursive feature elimination:


• Recursive feature elimination is a recursive greedy optimization
approach, where features are selected by recursively taking a
smaller and smaller subset of features.
• Now, an estimator is trained with each set of features, and the
importance of each feature is determined using coef_attribute or
through a feature_importances_attribute.
Filter methods
• In Filter Method, features are selected on the basis of statistics
measures. This method does not depend on the learning algorithm and
chooses the features as a pre-processing step.
• The filter method filters out the irrelevant feature and redundant
columns from the model by using different metrics through ranking.
• Some techniques of filter methods are:
• Information Gain
• Chi-Square Test
• Fisher’s Score
• Missing Value ratio
Information Gain:
• Information gain determines the reduction in entropy while
transforming the dataset.
• It can be used as a feature selection technique by calculating the
information gain of each variable with respect to the target
variable.

Chi-square Test:
• Chi-square test is a technique to determine the relationship between
the categorical variables.
• The chi-square value is calculated between each feature and the
target variable, and the desired number of features with the best
chi-square value is selected.
Fisher's Score:
• Fisher's score is one of the popular supervised technique of features
selection. It returns the rank of the variable on the fisher's criteria in
descending order. Then we can select the variables with a large fisher's
score.
Missing Value Ratio:
• The value of the missing value ratio can be used for evaluating the
feature set against the threshold value.. The variable is having more
than the threshold value can be dropped.
Embedded methods
• Embedded methods combined the advantages of both filter and
wrapper methods by considering the interaction of features along with
low computational cost.
• These are fast processing methods similar to the filter method but
more accurate than the filter method.
• Some of the techniques are:
• Regularization
• Random forest importance
Regularization:
• Regularization adds a penalty term to different parameters of the
machine learning model for avoiding overfitting in the model.
• This penalty term is added to the coefficients; hence it shrinks some
coefficients to zero.
• Those features with zero coefficients can be removed from the dataset.
• The types of regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
• Random Forest Importance -
• Different tree-based methods of feature selection help us with feature
importance to provide a way of selecting features.
• Here, feature importance specifies which feature has more importance
in model building or has a great impact on the target variable.
• Random Forest is such a tree-based method, which is a type of bagging
algorithm that aggregates a different number of decision trees.
• It automatically ranks the nodes by their performance or decrease in the
impurity (Gini impurity) over all the trees.
• Nodes are arranged as per the impurity values, and thus it allows to
pruning of trees below a specific node.
• The remaining nodes create a subset of the most important features.
Filter vs Wrapper vs Embedded methods

You might also like