Professional Documents
Culture Documents
SK2=3(mean-median)/sd
where sd is the standard deviation for the sample. It is generally used when the mode is
unknown.
• Kurtosis: It is a measure of the tailedness of a distribution. Tailedness is how
often outliers occur. Tails are the tapering ends on either side of a distribution.
They represent the probability/frequency of values that are extremely high or
low compared to the mean.
• Kurtosis are of three types:
Mesokurtic: Distributions with medium kurtosis (medium tails) are
mesokurtic. When the tails of the distribution is similar to the normal
distribution then it is mesokurtic. The kurtosis for normal distribution is 3.
Platykurtic: Distributions with low kurtosis (thin tails) are platykurtic.
Kurtosis will be less than 3 which implies thinner tail or lack of outliers than
normal distribution. In this case, bell shaped distribution will be broader and
peak will be lower than the mesokurtic.
Leptokurtic: Distributions with high kurtosis (fat tails) are leptokurtic. If the
kurtosis is greater than 3 then it is leptokurtic. In this case, the tails will be
heavier than the normal distribution which means lots of outliers are present in
the data. It can be recognized as thin bell shaped distribution with peak higher
than normal distribution.
• If the data is skewed, it means that the process is not centered around
the target value, and may have more variation in one direction than the
other. This can lead to more defects, waste, or customer dissatisfaction.
• If the data is kurtotic, it means that the process has more or less
extreme values than expected, which can indicate instability, outliers, or
non-normality. This can affect the validity of the statistical tests and
assumptions, and reduce the process capability.
Types of kurtosis
Distributions can be categorized into three groups
based on their kurtosis:
b) Square Transform
• Square transform is the type of transformer in which the square of the data is
considered instead of the normal data.
• In this case, data is applied with the square function, where the square of every
single observation will be considered as the final transformed data.
• The transformation is: where x is an attribute in the dataset.
c) Square Root Transform
• In this transform, the square root of the data is calculated.
• This transform performs very well on the left-skewed data and efficiently
transforms the left-skewed data into normally distributed data.
• The transformation is: where x is an attribute in the dataset.
d) Reciprocal Transform
• In this transform, the reciprocal of every observation is considered.
• This transformation can be only used for non-zero values.
• The transformation is: where x is an attribute in the dataset.
e) Custom Transform
• On every dataset, the log and square root transforms can not be used, as
every data can have different patterns and complexity.
• Based on the domain knowledge of the data, custom transformations can be
applied to transform the data into a normal distribution.
• The custom transforms can be any function or parameter like sin, cos, tan,
cube, cube root etc.
2) Power Transformers
• Power Transformation techniques are the type of data transformation
technique where the power is applied to the data observations for
transforming the data.
• There are two types of Power Transformation techniques:
a) Box-Cox Transform
b) Yeo-Johnson Transform
a) Box-Cox Transform
• This transform technique is mainly used for transforming the data
observations by applying power to them.
• The power of the data observations is denoted by Lambda(λ).
• There are mainly two conditions associated with the power in this transform,
which is lambda equals zero and not equal to zero.
• After sampling the data we can get a balanced dataset for both majority and
minority classes. So, when both classes have a similar number of records
present in the dataset, we can assume that the classifier will give equal
SMOTE (Synthetic Minority Oversampling
Technique): It is another technique to oversample the
minority class. Simply adding duplicate records of minority
class often don’t add any new information to the model. In
SMOTE, new instances are synthesized from the existing
data. SMOTE looks into minority class instances and
use k nearest neighbor to select a random nearest neighbor,
and a synthetic instance is created randomly in feature space.
Time series data
• A time series is a group of observations on a single entity over time
(regular time intervals).
• It is a type of data that tracks the evolution of a variable over time, such as
sales, stock prices, temperature, heart-rate etc.
• The regular time intervals can be daily, weekly, monthly, quarterly, or
annually, and the data is often represented as a line graph or time-series
plot.
• Time series data is commonly used in fields such as economics, finance,
weather forecasting, and operations management, among others, to analyze
trends and patterns, and to make predictions or forecasts.
• Time series analysis is a machine learning technique that forecasts target
value based solely on a known history of target values. It is a
specialized form of regression, known as auto-regressive model.
• Example of a time-series dataset: A CSV file which has
monthly balance of the users bank account starting from
January 1973 to September 1977.
Components of a time series data:
• A time series can be analyzed in detail by breaking down it into its primary
components. This process is called as time series decomposition. Time
series data is composed of Trend, Seasonality, Cyclic and Residual
components.