You are on page 1of 35

Chapter 2: Data Transformation

28/04/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 2


Data transformation
• The data transformation techniques in machine learning is used to transform
the data from one form to another form, keeping the essence of the data.
• Transformers are the type of functions that aim to get normally distributed
data.
• Normal distribution: A probability distribution with the mean 0 and
standard deviation of 1 is known as standard normal distribution or
Gaussian distribution. A normal distribution is symmetric about the mean
and follows a bell shaped curve. Almost 99.7% of the values lies within 3
standard deviation. The mean, median and mode of a normal distribution
are equal.
• Data normalization in ML that transforms data into a common format,
so it can be used in analytics and machine learning algorithms. It is
typically used to transform the raw data into a more useful form for
ML algorithms such as linear regression, logistic regression, and
neural networks. Data Normalization in machine learning can be
applied to numerical and categorical data, and it can help to reduce the
complexity of the data.
• Skewness: Skewness of a distribution is defined as the lack of symmetry. In a
symmetrical distribution, the Mean, Median and Mode are equal. The normal
distribution has a skewness of 0. Skewness tell us about distribution of our data.
• Skewness is of two types:
Positive skewness: When the tail on the right side of the distribution is longer or
fatter, we say the data is positively skewed. For a positive skewness, mean >
median > mode.
Negative skewness: When the tail on the left side of the distribution is longer or
fatter, we say that the distribution is negatively skewed. For a negative skewness,
mean < median < mode.
Coefficient of Skewness: Pearson developed two methods to find skewness in a sample.
Pearson’s Coefficient of Skewness using mode.
SK1= (mean- mode)/sd
where sd is the standard deviation for the sample.

Pearson’s Coefficient of Skewness using median.


• The value of this coefficient would be zero in a symmetrical distribution.
• If mean is greater than mode, coefficient of skewness would be positive
otherwise negative.

SK2=3(mean-median)/sd

where sd is the standard deviation for the sample. It is generally used when the mode is
unknown.
• Kurtosis: It is a measure of the tailedness of a distribution. Tailedness is how
often outliers occur. Tails are the tapering ends on either side of a distribution.
They represent the probability/frequency of values that are extremely high or
low compared to the mean.
• Kurtosis are of three types:
Mesokurtic: Distributions with medium kurtosis (medium tails) are
mesokurtic. When the tails of the distribution is similar to the normal
distribution then it is mesokurtic. The kurtosis for normal distribution is 3.
Platykurtic: Distributions with low kurtosis (thin tails) are platykurtic.
Kurtosis will be less than 3 which implies thinner tail or lack of outliers than
normal distribution. In this case, bell shaped distribution will be broader and
peak will be lower than the mesokurtic.
Leptokurtic: Distributions with high kurtosis (fat tails) are leptokurtic. If the
kurtosis is greater than 3 then it is leptokurtic. In this case, the tails will be
heavier than the normal distribution which means lots of outliers are present in
the data. It can be recognized as thin bell shaped distribution with peak higher
than normal distribution.
• If the data is skewed, it means that the process is not centered around
the target value, and may have more variation in one direction than the
other. This can lead to more defects, waste, or customer dissatisfaction.
• If the data is kurtotic, it means that the process has more or less
extreme values than expected, which can indicate instability, outliers, or
non-normality. This can affect the validity of the statistical tests and
assumptions, and reduce the process capability.
Types of kurtosis
Distributions can be categorized into three groups
based on their kurtosis:

Mesokurtic Platykurtic Leptokurtic

Tailedness Medium-tailed Thin-tailed Fat-tailed

Outlier frequency Medium Low High

Kurtosis Moderate (3) Low (< 3) High (> 3)

Excess kurtosis 0 Negative Positive

Example distribution Normal Uniform Laplace


• Data transformations are mathematical
operations that can be used to correct skewness
and kurtosis in the data, making the data more
normal or symmetrical.
• There are 3 types of data transformation
techniques:
Function Transformers
Power Transformers
Quantile Transformers
1) Function Transformers
• Function transformers are the type of data transformation technique that
uses a particular function to transform the data to the normal distribution.
• There are 5 types of function transformers that are used and which also
solve the issue of normal distribution.
a) Log Transform
b) Square Transform
c) Square Root Transform
d) Reciprocal Transform
e) Custom Transform
a) Log Transform
• Log transform is one of the simplest transformations on the data in which the
log is applied to every single distribution of the data.
• It transforms the right-skewed data into normally distributed data very well.
• Logarithm is defined only for positive values so we can’t apply log
transformation on 0 and negative numbers.
• The transformation is: where x is an attribute in the dataset.

b) Square Transform
• Square transform is the type of transformer in which the square of the data is
considered instead of the normal data.
• In this case, data is applied with the square function, where the square of every
single observation will be considered as the final transformed data.
• The transformation is: where x is an attribute in the dataset.
c) Square Root Transform
• In this transform, the square root of the data is calculated.
• This transform performs very well on the left-skewed data and efficiently
transforms the left-skewed data into normally distributed data.
• The transformation is: where x is an attribute in the dataset.

d) Reciprocal Transform
• In this transform, the reciprocal of every observation is considered.
• This transformation can be only used for non-zero values.
• The transformation is: where x is an attribute in the dataset.
e) Custom Transform
• On every dataset, the log and square root transforms can not be used, as
every data can have different patterns and complexity.
• Based on the domain knowledge of the data, custom transformations can be
applied to transform the data into a normal distribution.
• The custom transforms can be any function or parameter like sin, cos, tan,
cube, cube root etc.
2) Power Transformers
• Power Transformation techniques are the type of data transformation
technique where the power is applied to the data observations for
transforming the data.
• There are two types of Power Transformation techniques:
a) Box-Cox Transform
b) Yeo-Johnson Transform
a) Box-Cox Transform
• This transform technique is mainly used for transforming the data
observations by applying power to them.
• The power of the data observations is denoted by Lambda(λ).
• There are mainly two conditions associated with the power in this transform,
which is lambda equals zero and not equal to zero.

• The transformation is:


where x is an attribute in the dataset and λ is the transformation parameter.

b) Yeo Johnson Transform


• This transformation technique is also a power transform technique, where
the power of the data observations is applied to transform the data.
• This is an advanced form of a box cox transformations technique where it can
be applied to even zero and negative values of data observations.
The mathematical formulations of this
transformations technique are as follows:
3) Quantile Transformers
• Quantile transformation techniques are the type of data transformation
technique that can be applied to numerical data observations.
• A quantile determines how many values in a distribution are above or below a
certain limit.
• Quantile Transformer works better on larger datasets than Power Transformer.
Before quantile transformation After quantile transformation
Positive skewness Normal distribution
Before quantile transformation After quantile transformation
Negative skewness Normal distribution
Handling imbalanced data
• A balanced dataset is one in which the target variable has equal or nearly
equal observations in all the classes.
• An unbalanced dataset is one in which the target variable has more
observations in one specific class than the others.
• The main problem with imbalanced dataset prediction is how accurately are
we actually predicting both majority and minority class.
• Models may exhibit bias toward the majority class, resulting in poor
predictions for the minority class.
• Example: Let’s assume we are going to predict disease from an existing
dataset where for every 100 records only 5 patients are diagnosed with the
disease. So, the majority class is 95% with no disease and the minority class is
only 5% with the disease. Now, the model predicts that all 100 out of 100
patients have no disease.

• Approaches to Handle Imbalanced Data Set Problem:


Resampling (Oversampling and Undersampling): This technique is used
to upsample the minority class or downsample the majority class. There
are two main ways to perform resampling:
a) Undersampling — Deleting samples from the majority class.
b) Oversampling — Duplicating samples from the minority class.

• After sampling the data we can get a balanced dataset for both majority and
minority classes. So, when both classes have a similar number of records
present in the dataset, we can assume that the classifier will give equal
SMOTE (Synthetic Minority Oversampling
Technique): It is another technique to oversample the
minority class. Simply adding duplicate records of minority
class often don’t add any new information to the model. In
SMOTE, new instances are synthesized from the existing
data. SMOTE looks into minority class instances and
use k nearest neighbor to select a random nearest neighbor,
and a synthetic instance is created randomly in feature space.
Time series data
• A time series is a group of observations on a single entity over time
(regular time intervals).
• It is a type of data that tracks the evolution of a variable over time, such as
sales, stock prices, temperature, heart-rate etc.
• The regular time intervals can be daily, weekly, monthly, quarterly, or
annually, and the data is often represented as a line graph or time-series
plot.
• Time series data is commonly used in fields such as economics, finance,
weather forecasting, and operations management, among others, to analyze
trends and patterns, and to make predictions or forecasts.
• Time series analysis is a machine learning technique that forecasts target
value based solely on a known history of target values. It is a
specialized form of regression, known as auto-regressive model.
• Example of a time-series dataset: A CSV file which has
monthly balance of the users bank account starting from
January 1973 to September 1977.
Components of a time series data:
• A time series can be analyzed in detail by breaking down it into its primary
components. This process is called as time series decomposition. Time
series data is composed of Trend, Seasonality, Cyclic and Residual
components.

1) Trend Component: A trend in time series data refers to a long-term


upward or downward movement in the data, indicating a general increase
or decrease over time. The trend represents the underlying structure of the
data, capturing the direction and magnitude of change over a longer period.
There are several types of trends in time series data:
a) Upward Trend: A trend that shows a general increase over time, where
the values of the data tend to rise over time.
b) Downward Trend: A trend that shows a general decrease over time,
where the values of the data tend to decrease over time.
c) Horizontal Trend: A trend that shows no significant change over time,
where the values of the data remain constant over time.
d) Non-linear Trend: A trend that shows a more complex pattern of change
over time, including upward or downward trends that change direction or
magnitude over time.
e) Damped Trend: A trend that shows a gradual decline in the magnitude
of change over time, where the rate of change slows down over time.
2) Seasonality Component: Seasonality in time series data refers to
patterns that repeat over a regular time period, such as a day, a week, a
month, or a year. These patterns arise due to regular events, such as holidays,
weekends, or the changing of seasons. There are several types of seasonality in
time series data, including:
a) Weekly Seasonality: A type of seasonality that repeats over a 7-day period
and is commonly seen in time series data such as sales, energy usage, or
transportation patterns.
b) Monthly Seasonality: A type of seasonality that repeats over a 30- or 31-
day period and is commonly seen in time series data such as sales or weather
patterns.
c) Annual Seasonality: A type of seasonality that repeats over a 365- or 366-
day period and is commonly seen in time series data such as sales, agriculture,
or tourism patterns.
d) Holiday Seasonality: A type of seasonality that is caused by special events
such as holidays, festivals, or sporting events and is commonly seen in time
series data such as sales, traffic, or entertainment patterns.
3) Cycles Component: Cyclicity in time series data refers to the repeated
patterns or periodic fluctuations that occur in the data over a specific
time interval. It can be due to various factors such as seasonality (daily,
weekly, monthly, yearly), trends, and other underlying patterns. Cycles
usually occur along a large time interval, and the lengths of time between
successive peaks or troughs of a cycle are not necessarily the same.
4) Irregular Component: The component left after explaining the trend,
seasonal and cyclical movements is the irregular/residual component.
Irregularities in time series data refer to unexpected or unusual fluctuations in
the data that do not follow the general pattern of the data. These fluctuations
can occur for various reasons, such as measurement errors, unexpected
events, or other sources of noise.

You might also like