You are on page 1of 7

2) Theoretical Background

This chapter defines the following parts of the project, EDA(Exploratory Data Analysis), Feature
Engineering, Feature Selection, and Model Building and is the basis for the further project.
2.1 EDA (Exploratory Data Analysis)
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so
as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help
of summary statistics and graphical representations.

It is a good practice to understand the data first and try to gather as many insights from it. EDA is
all about making sense of data in hand,before getting them dirty with it.

 In EDA we load the Dataset


 Describe that Data by using some Functions( df.describe(), df.info() )
 Find any missing values (df.isna() )
 Find out the outlires (using Histogram , Box plot )
 Visualize Data ( Using Matplotlib , Seaborn )
Figure 2.1 ( A Histogram plot)
The above one figure (Figure 2.1) a histogram tells us the various information about the
Variable(Feature or Column)
2.2 Feature Engineering

Feature Engineering is the Next Step in A DATA SCIENCE or MACHINE LEARNING Project
after EDA (Exploratory Data Analysis)

Feature engineering is the process of using domain knowledge to extract


features (characteristics, properties, attributes) from raw data. A feature is a property shared by
independent units on which analysis or prediction is to be done. Features are used by predictive
models and influence results.
In EDA we just get know about the Data ( missing Values, Outlires ) but in Feature Engineering
we Clean the data by Handling Missing Values , Removing Outlires.

Handling Missing Values


Drop missing values
Fill Missing Values by Mean, Median, Mode

Handling Outlires
Using Standard Deviation
Normal Distribution
IQR (Inter Quartile Range)

2.3 Feature Selection


Feature selection is the process of reducing the number of input variables when developing a
predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of
modeling and, in some cases, to improve the performance of the model.
Statistical-based feature selection methods involve evaluating the relationship between each
input variable and the target variable using statistics and selecting those input variables that
have the strongest relationship with the target variable. These methods can be fast and
effective, although the choice of statistical measures depends on the data type of both the input
and output variables.

As such, it can be challenging for a machine learning practitioner to select an appropriate


statistical measure for a dataset when performing filter-based feature selection.

In this post, you will discover how to choose statistical measures for filter-based feature
selection with numerical and categorical data.

 There are two main types of feature selection techniques: supervised and unsupervised,
and supervised methods may be divided into wrapper, filter and intrinsic.

 Filter-based feature selection methods use statistical measures to score the correlation
or dependence between input variables that can be filtered to choose the most relevant
features.

 Statistical measures for feature selection must be carefully chosen based on the data
type of the input variable and the output or response variable.
2.4 Model Building

A machine learning model is built by learning and generalizing from training data, then
applying that acquired knowledge to new data it has never seen before to make predictions and
fulfill its purpose. Lack of data will prevent you from building the model, and access to data isn't
enough.

You might also like