0% found this document useful (0 votes)
22 views35 pages

AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance

Uploaded by

Rohith H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views35 pages

AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance

Uploaded by

Rohith H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Data

Preprocessing and
Feature
Engineering: Key to
Improving AI
Algorithm
Performance
Introduction to Data
1 Preprocessing and Feature

Table of 2
Engineering

Understanding Data Types


Content
Handling Missing Values
s
3

4 Identifying and Handling


Outliers

5 Encoding Categorical Data

6 Feature Scaling
Introduction to
Data
01 Preprocessing and
Feature
Engineering
The Importance of Data Quality
Garbage In, Garbage
Out (GIGO) Real-world Examples
High-quality data ensures Poor data quality cases lead to
reliable AI model outputs; project failure or biased
flawed data leads to models, highlighting data
inaccurate or biased results, preprocessing and feature
emphasizing data quality's engineering significance.
critical role.

Impact on Model
Accuracy

Data preprocessing and


feature engineering directly
influence model
performance; well-prepared
Overview of Data Preprocessing

01 02 03

Definition and Goals Key Steps in Data Benefits of Data


Preprocessing Preprocessing
Data preprocessing involves Essential steps in data Benefits of data
cleaning, transforming, and preprocessing include data preprocessing include
organizing raw data to cleaning, integration, improved accuracy, reduced
enhance data quality and transformation, and bias, and better model
prepare it for analysis. reduction: each step performance: enhancing
enhances data quality. overall analytical
effectiveness.
Page Title
Definition and Goals
Feature Engineering extracts the most
relevant information from raw data -
improves model accuracy and efficiency.

Methods in Feature
Engineering
Feature creation, selection, and
transformation are the primary
Benefits of Feature
Engineering methods; each provides
Better model interpretaility, effective tools for optimizing
higher accuracy and the input data.
generalization, and
dimensionality reduction are
benefits; leading to more
insight and efficiency.
Understanding
02
Data Types
Numerical Data

01 Continuous Data

02 Discrete Data

Continuous data can take


Importance of
any value within a range; 03 Understanding
examples include Discrete data consists of Numerical Data
temperature, height, and distinct, separate values;
weight, often requiring examples include counts
Understanding data
scaling. or integers: these values
properties is key to
may need specific
preprocessing and feature
handling.
engineering and
determines appropriate
techniques to use.
Categorical Data

Importance of
Understanding
Nominal Data Ordinal Data Categorical Data

Nominal data represents Ordinal data represents Appropriate handling of


categories without any inherent categories with a meaningful categorical data ensures
order; examples include colors order or ranking; examples accurate model representation:
or types of fruit, often requiring include ratings, which need leading better insights from
encoding. careful encoding to preserve qualitative variables.
order.
Time Series Data

Importance of Understanding
Time Series Data

Time-Based Features
Proper feature
Date/Time Components engineering is based on
Creating lag features, recognizing seasonality
rolling statistics, and which enhances model
Time series data includes
seasonal decomposition predictive capabilities
specific date/time values;
is crucial to understand over time.
extract components like
time-based trends in the
year, month, day, hour to
data.
reveal temporal patterns.
Handling Missing
03
Values
Identifying Missing Values

Techniques for Detecting Visualizing Missing Data Common Causes for


Missing Data Missing Data
Check null values, use summary Visual tools such as heatmaps Data entry, system errors, or
statistics to identify missing and missingness matrices can non-response can create
data spots in your dataset; then help reveal patterns and missing data, understanding
address them accurately. correlations in data gaps; guide causes is key to inform handling
handling strategies. strategies.
Deletion Methods K-Nearest Neighbors
Imputation with (KNN) or regression
Statistical Measures models

Fill missing values with Using the Fillna() Method in


mean, median, or mode, Pandas
suitable for numerical data;
The fillna() method in
ensure consideration of
Pandas can be
data distribution. applied to fill missing
values quickly;
Remove rows or choose appropriate
columns with too filling methods for
many missing values data.

Imputation with Predictive Strategies for Handling


Models Missing Values
Best Practices for Handling Missing Values

Documenting Missing Evaluating the Impact Sensitivity Analysis


Value Strategy of Imputation
Keep a detailed log of the Evaluate the change in Assess how different data
methods used for dealing model performance before handling impact your result
with missing values including and after handling missing by using the range of
their rationale during values to ensure data fidelity strategies to measure
preprocessing. after changes. stability.
Identifying and
04
Handling Outliers
Understanding Outliers

Impact of Outliers on
Definition of Outliers Sources of Outliers Models

Outliers are data points that Outliers can arise from data Outliers can distort statistical
deviate significantly from the entry errors, measurement analyses and model
rest of the dataset; understand errors, or genuine extreme predictions, reducing accuracy
and handle them carefully. values; determine origin for and reliability; hence the
proper decision. importance in preprocessing.
Methods for Detecting Outliers
Statistical Methods (Z-score, IQR)

Z-score measures the number of standard


deviations from the mean; IQR identifies
outliers based on interquartile range.

Visual Inspection Clustering Techniques


(DBSCAN)

Box plots, scatter plots, and Density-Based Spatial


histograms can visually Clustering of Applications
identify data points that lie with Noise groups dense
far from central data points together and marks
clusters. outliers in sparse areas.
Strategies for Handling Outliers

Transformation Techniques
(Log, Winsorizing)
Logarithmic scaling reduces the impact
of outliers, while winsorizing limits
extreme values ensuring data integrity.

Keeping Outliers

Retain outliers if they are valid,


Trimming Outliers
representative data points;
Removing extreme values from consider their potential impact on
dataset enhances robustness and the overall representation.
reduces misleading
interpretations; consider impacts
on variability.
Encoding
05
Categorical Data
Why Encode Categorical Data?

Machine Learning
Transforming Improved Model
Algorithms
Requirement
Categorical Features Performance

Machine learning Encoding transforms Proper encoding improves


algorithms typically require categorical variables into model performance by
numerical input, making numerical format, enabling representing categorical
categorical encoding their utilization in machine data in a format that
necessary for processing learning models and algorithms can interpret.
data effectively. analyses.
Common Encoding Techniques

Label Encoding Ordinal Encoding

Ordinal Encoding maps categories to


Label Encoding assigns a unique
integers based on inherent order;
integer to each category: simple but
important for preserving sequence
may imply unintended ordinal
information.
relationships.

One-Hot Encoding Using Pandas get_dummies()

One-Hot Encoding creates binary Pandas get_dummies() simplifies One-


columns for each category, providing Hot Encoding, using a function to
clarity but also increasing convert categorical values into dummy
dimensionality. variables.
Considerations When Choosing Encoding Methods

01 02 03

Type of Categorical Potential for High Impact on Model


Variable Cardinality Interpretability
Consider the variable type Avoid high-cardinality Ensure encoding enhances,
when encoding; One-Hot for variables because it can not obscures model
nominal, and Label for cause dimentionality issues; interpretability, to derive
ordinal, to preserve data group or refine categories as meaningful insights from
structure. suitable. machine learning processes.
06 Feature Scaling
Page Title
Definition and Importance
Feature scaling standardizes variables to
a similar range, preventing domination
by features with large values: improving
model efficiency.
Common Scaling
Techniques
Normalization and
Standardization are main
When to Apply Feature
Scaling techniques, but each has
Apply standardization when distinct math functions which
algorithm leverages distances, are used to scale features.
gradients, or regularization;
ensures algorithms work well.
Normalization (Min-Max Scaling)

How It Works Formula and Use Cases and Scenarios


Implementation
Normalizing scales values to a Formula: X_scaled = (X - X_min) Normalization: Suitable for data
specific range, usually between / (X_max - X_min); without a known bell
0 and 1, preserving distribution Implementation using scikit- distribution: optimizes
shape perfectly during the learn’s MinMaxScaler. performance in techniques like
process. image processing.
Standardization (Z-Score Scaling)
Use Cases and
How It Works Scenarios
Standardization rescales Standardization benefits
features to achieve a algorithms sensitive to feature
standard normal distribution scaling as it transforms data to
with zero mean and unit zero mean and unit variance.
variance.

Formula and
Implementation

Formula: X_scaled = (X - μ) /
σ; Implementation using
scikit-learn’s StandardScaler
for efficient processing.
Choosing Between Normalization and
Standardization

Data Distribution Algorithm Practical Implications


Considerations Requirements
Normalization preserves non- Standardization is robust for Normalization maintains
normal distribution with algorithms sensitive to distribution and ranges,
defined range; outliers and assumes normal while Standardization
Standardization alters data distribution; choose the ensures compatibility with
distribution to make method carefully. algorithms needing gaussian
gaussian. structure.
Pandas (Pandas
07 data cleansing
case study)
Page Title
Dataset Description
Description of the dataset used in the
case study, including its source, size,
and key features to know before
cleaning.
Objectives of the Case
Study
Objectives is to illustrate how to
use Pandas for data cleaning

Tools and Libraries Used tasks, such as missing values,


Listing of Python tools and outliers, and data formatting.
libraries, specifically Pandas as
prime, and show their
application within the cleansing
process.
Data Cleaning Steps
Importing and Exploring the Outlier Detection and
Data Treatment in Pandas

Load CSV with Pandas and inspect Guide on how to spot outliers using
headers, data types, and basic stats box plots and Z-scores with
to reveal potential data quality guidance on transforming extreme
problems. values or dropping them.

Data Transformation
Handling Missing
and Formatting in
Values in Pandas
Pandas

Address nulls using fillna and Standardize data in Pandas by


dropna, also discussing changing strings into
strategic imputation like numerical structures and by
mean or median based on reformatting datatypes for
Big Quiz:
08 Knowledge
Assessment
Questions on Data Types

3. You are analyzing survey data from a


1. Which of the following examples customer satisfaction study. The dataset
represents continuous numerical data in a contains the following two columns:
dataset? •Column A: Type of payment method
A) Number of students in a classroom used (Credit Card, Debit Card, UPI, Cash)
B) Age of individuals recorded to the •Column B: Satisfaction level rated as
nearest year (Very Unsatisfied, Unsatisfied, Neutral,
C) Temperature measured in Celsius Satisfied, Very Satisfied)
D) Postal codes of customer addresses You want to choose the correct data
visualization methods for each column.
2. In a dataset, which of the following Which of the following is the most
variables is an example of discrete appropriate approach?
numerical data? A) Use a pie chart for Column A and
A) Height of students in centimeters calculate the median for Column B
B) Number of cars sold by a dealership in a B) Use a bar chart for Column B and
month compute the mode for Column A
C) Speed of a vehicle measured in km/h C) Use a histogram for both columns
D) Amount of rainfall in millimeters D) Use the mean for both columns to
summarize the data
Questions on Handling Missing Values and Outliers
A data analyst is working with a
You are handling a dataset dataset containing annual
containing household income household income values. The
values. Several records are missing distribution is highly right-
in the "Annual Income" column. skewed due to a few extremely
You are considering imputing the high-income entries. The
missing values with either the mean analyst wants to reduce the
or the median. impact of these outliers before
Which of the following statements
training a machine learning
is most appropriate?
A) Mean imputation is always model.
better because it uses all the data Which of the following is the
points. best approach in this scenario?
B) Median imputation is preferable A) Apply a log transformation to
if the data is skewed or contains normalize the distribution
outliers. B) Remove all rows with the
C) Mean and median imputation highest income values
yield identical results in all cases.
D) Missing values should never be
C) Use one-hot encoding on the
imputed in a dataset. income column
D) Standardize the data without
handling outliers
Questions on Encoding and Feature Scaling

Quiz Question 1 Quiz Question 2

You are preparing a dataset for a machine You are preparing features for a
learning model. One column, “City”, contains
machine learning model. The data is
the names of cities (e.g., Bangalore, Mumbai,
Delhi, Chennai). You want to encode this normally distributed. You are using a
feature for use in a decision tree classifier. linear regression algorithm. Which
Which encoding technique should you scaling method is most appropriate?
choose, and why? A) Min-Max Scaling, as it works well
A) Use Label Encoding because city names with normal distributions
are strings and the model can interpret the B) Standard Scaling, because it centers
labels as ordinal data and preserves normality
B) Use One-Hot Encoding to prevent the
model from assuming any ordinal
C) No scaling is needed since linear
relationship among the city names regression is scale-invariant
C) Use Label Encoding because it reduces D) Use Min-Max only if there are
dimensionality and improves accuracy outliers
D) Use One-Hot Encoding only when the
variable has more than 50 categories
Thank you for
listening.
Reporter

You might also like