You are on page 1of 66

MODULE 3

Chapter 1: Data

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 1


Contents
• Types of Data: Structured and Unstructured Data, Quantitative and
Qualitative Data.
• Four Levels of data (Nominal, Ordinal, Interval, Ratio Level).
1. Structured vs Unstructured
 Structured (Organized) Data: Data stored into a
row/column structure.
• Every row represents a single observation and column
represent the characteristics of that observation.
• Unstructured (Unorganized) Data: Type of data that is
in the free form and does not follow any standard
format/hierarchy.
• Eg: Text or raw audio signals that must be parsed
further to become organized.
Pros of Structured Data
 Structured data is generally thought of as being much
easier to work with and analyze.
 Most statistical and machine learning models were
built with structured data in mind and cannot work on
the loose interpretation of unstructured data.
 The natural row and column structure is easy to digest
for human and machine eyes.
Example of Data Pre-processing
for Text Data
• Text data is generally unstructured and hence there is
need to transform data into structured form.
• Few characteristics that describe the data to assist
transformation are:
Word/phrase count
The existence of certain special characters
The relative length of text
Picking out topics
Example: A Tweet
• This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn
skies.

• Pre-processing is necessary for this tweet because a vast


majority of learning algorithms require numerical data.
• Pre-processing allows us to explore features that have been
created from the existing features.
• For example, we can extract features such as word count and
special characters from the mentioned tweet.
Example: This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins
Venus & Saturn. Afloat in the dawn skies.
1. Word/phrase counts:-
• We may break down a tweet into its word/phrase count.
• The word ‘this’ appears in the tweet once, as does every other
word.
• We can represent this tweet in a structured format, converting
the unstructured set of words into a row/column format:

2. Presence of certain special characters


Example: This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins
Venus & Saturn. Afloat in the dawn skies.
• 3. Relative length of text
• This tweet is 121 characters long.
• The average tweet, as discovered by analysts, is about 30
characters in length.
• So, we calculate a new characteristic, called relative length, (which
is the length of the tweet divided by the average length), i.e.
121/30 telling us the length of this tweet as compared to average
tweet.
• This tweet is actually 4.03 times longer than the average tweet.
Example: This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins
Venus & Saturn. Afloat in the dawn skies.
• 4. Picking out topics
• This tweet is about astronomy, so we can add that information as a
column.
• Thus, we can convert a piece of text into structured/organized
data, ready for use in our models and exploratory analysis.

Topic

Astronomy
2. Qualitative/Quantitative
1. Quantitative data: Data that can be described using
numbers, and basic mathematical procedures,
including addition, subtraction etc can be performed.

2. Qualitative data: This data cannot be described using


numbers and basic Mathematics cannot be performed,
and described using "natural" categories and natural
language.
Examples
Example of Qualitative/Quantitative
Coffee Shop Data
Observations of coffee shops in a major city was made.
Following characteristics were recorded.
1. Name of coffee shop
2. Revenue (in thousands of dollars)
3. Zip code
4. Average monthly customers
5. Country of coffee origin
Let us try to classify each characteristic as Qualitative OR
Quantitative
Example of Qualitative/Quantitative
Coffee Shop Data
1. Name of coffee shop
• Qualitative
• The name of a coffee shop is not expressed as a number
and we cannot perform math on the name of the shop.
2. Revenue
• Revenue – Quantitative
Example of Qualitative/Quantitative
Coffee Shop Data
3. Zipcode
• This one is tricky!!!
• Zip code – Qualitative
• A zip code is always represented using numbers, but what
makes it qualitative is that it does not fit the second part of the
definition of quantitative—we cannot perform basic mathematical
operations on a zip code.
• If we add together two zip codes, it is a nonsensical measurement.
4. Average monthly customers
• Average monthly customers – Quantitative
5. Country of coffee origin
• Country of coffee origin – Qualitative
Example 2: World alcohol consumption
data

• Classification of attributes as Quantitative OR Qualitative


• country: Qualitative
• beer_servings: Quantitative
• spirit_servings: Quantitative
• wine_servings: Quantitative
• total_litres_of_pure_alcohol: Quantitative
• continent: Qualitative
Quantitative data can be broken down, one step further, into
discrete and continuous quantities.
Continuous Discrete
It can take any value in a It can only have specific
interval value. No decimal values
[1 to 10] 1, 2, 3, 4, 5….
Values can be: 1, 1.3,
2.46, 5.378…
Measured Counted
Example: Temperature: Example: Rolling die
22.6 C, 83.46 F 1, 2, 3, 4, 5, 6

Examples: The speed of a car – Continuous


The number of cats in a house – Discrete
Your weight – Continuous
The number of students in a class – Discrete
The number of books in a shelf – Discrete
The height of a person – Continuous
Exact age - Continuous
Four Levels of Data
• It is generally understood that a specific characteristic
(feature/column) of structured data can be broken
down into one of four levels of data. The levels are:
 The nominal level
 The ordinal level
 The interval level
 The ratio level
The nominal level
• The first level of data, the nominal level, consists of
data that is described purely by name or category
with no rank order.
• Basic examples include gender, nationality, species,
or name of a student, color of hair etc...
• No rank order means: Cannot tell which color of hair
is more important than other.
• They are not described by numbers and are therefore
qualitative.
Mathematical operations allowed

• We cannot perform mathematics on the nominal level


of data except the basic equality and set membership
functions, as shown in the following two examples:
Being a tech entrepreneur is the same as being in the
tech industry, but not vice versa.
Measures of center
• A measure of center is a number that describes what the data tends
to.
• It is sometimes referred to as the balance point of the data.
• Common examples include the mean, median, and mode.
• In order to find the center of nominal data, we generally turn to the
mode (the most common element) of the dataset.
2. The Ordinal level
• Categorical in nature but inherent with order or rank
where each options has a different values.
Examples:
Income Levels: ( Low, Medium, High )
Levels of agreement ( disagree, neutral, agree )
Levels of satisfaction ( Poor, average, good, excellent )
All these options are still categorical but they have
different values ( Ranking difference ).
Quick summary
3. Interval level
It is numerical data
Quantitative data: Data by default measured in numbers.
Examples:
Credit scores ( 300 – 850 )
GMAT scores ( 200 – 800 )
Temperature ( Fahrenheit )
Example of Interval level: Temperature

• If it is 100 degrees Fahrenheit in Texas and 80 degrees


Fahrenheit in Istanbul, Turkey, then Texas is 20 degrees
warmer than Istanbul.
• Thus, Data at the interval level allows meaningful
subtraction between data points.
Mathematical operations allowed
• We can use all the operations allowed with nominal and
ordinal(ordering, comparisons, and so on), along with
two other notable operations:
• Addition
• Subtraction
Measures of center
• Measures of mean, median and mode to describe this
data.
• Usually the most accurate description of the center of
data would be the arithmetic mean, more commonly
referred to as, simply, "the mean".
• At the previous levels, addition was meaningless;
therefore, the mean would have lost extreme value.
• It is only at the interval level and above that the
arithmetic mean makes sense.
Example: Temperature of Fridge
• Suppose we look at the temperature of a fridge containing a
pharmaceutical company's new vaccine. We measure the temperate
every hour with the following data points (in Fahrenheit):
• 31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26
Finding Measure of Centre
• Let’s find the mean and median of the data:
• temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26]
• Mean = 30.73
• Median = 31.0
Drawback with Interval Data:
• Data at the interval level does not have a "natural starting point or a
natural zero".
• However, being at zero degrees Celsius does not mean that you have
"no temperature".
4. Ratio level
• A ratio variable, has all the properties of an interval
variable, and also has a clear definition of 0.0. When
the variable equals 0.0, there is none of that variable.
Type of chocolate: Nominal data.
Satisfied: Ordinal data
Age, Groceries and choco-bars: Interval/Ratio data
Chapter 2: Data Transformation

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 37


Data transformation
• The data transformation techniques in machine learning is used to transform
the data from one form to another form, keeping the essence of the data.
• Transformers are the type of functions that aim to get normally distributed
data.
• Normal distribution: A probability distribution with the mean 0 and
standard deviation of 1 is known as standard normal distribution or
Gaussian distribution. A normal distribution is symmetric about the mean
and follows a bell shaped curve. Almost 99.7% of the values lies within 3
standard deviation. The mean, median and mode of a normal distribution
are equal.
• Skewness: Skewness of a distribution is defined as the lack of symmetry.
In a symmetrical distribution, the Mean, Median and Mode are equal. The
normal distribution has a skewness of 0. Skewness tell us about distribution
of our data.
• Skewness is of two types:
Positive skewness: When the tail on the right side of the distribution is longer or
fatter, we say the data is positively skewed. For a positive skewness, mean >
median > mode.
Negative skewness: When the tail on the left side of the distribution is longer or
fatter, we say that the distribution is negatively skewed. For a negative skewness,
mean < median < mode.
• Coefficient of Skewness: Pearson developed two methods to find skewness
in a sample.
1.Pearson’s Coefficient of Skewness using mode.
SK1= (mean- mode)/sd
where sd is the standard deviation for the sample.
2.Pearson’s Coefficient of Skewness using median.
SK2=3(mean-median)/sd
where sd is the standard deviation for the sample. It is generally used when
the mode is unknown.
• Kurtosis: It is a measure of the tailedness of a distribution. Tailedness is how
often outliers occur. Tails are the tapering ends on either side of a distribution.
They represent the probability/frequency of values that are extremely high or
low compared to the mean.
• Kurtosis are of three types:
Mesokurtic: Distributions with medium kurtosis (medium tails) are
mesokurtic. When the tails of the distribution is similar to the normal
distribution then it is mesokurtic. The kurtosis for normal distribution is 3.
Platykurtic: Distributions with low kurtosis (thin tails) are platykurtic.
Kurtosis will be less than 3 which implies thinner tail or lack of outliers than
normal distribution. In this case, bell shaped distribution will be broader and
peak will be lower than the mesokurtic.
Leptokurtic: Distributions with high kurtosis (fat tails) are leptokurtic. If the
kurtosis is greater than 3 then it is leptokurtic. In this case, the tails will be
heavier than the normal distribution which means lots of outliers are present in
the data. It can be recognized as thin bell shaped distribution with peak higher
than normal distribution.
• If the data is skewed, it means that the process is not centered around
the target value, and may have more variation in one direction than the
other. This can lead to more defects, waste, or customer dissatisfaction.
• If the data is kurtotic, it means that the process has more or less
extreme values than expected, which can indicate instability, outliers, or
non-normality. This can affect the validity of the statistical tests and
assumptions, and reduce the process capability.
• Data transformations are mathematical operations that can be used to
correct skewness and kurtosis in the data, making the data more normal or
symmetrical.
• There are 3 types of data transformation techniques:
Function Transformers
Power Transformers
Quantile Transformers
1) Function Transformers
• Function transformers are the type of data transformation technique that
uses a particular function to transform the data to the normal distribution.
• There are 5 types of function transformers that are used and which also
solve the issue of normal distribution.
a) Log Transform
b) Square Transform
c) Square Root Transform
d) Reciprocal Transform
e) Custom Transform
a) Log Transform
• Log transform is one of the simplest transformations on the data in which the
log is applied to every single distribution of the data.
• It transforms the right-skewed data into normally distributed data very well.
• Logarithm is defined only for positive values so we can’t apply log
transformation on 0 and negative numbers.
• The transformation is: where x is an attribute in the dataset.

b) Square Transform
• Square transform is the type of transformer in which the square of the data is
considered instead of the normal data.
• In this case, data is applied with the square function, where the square of every
single observation will be considered as the final transformed data.
• The transformation is: where x is an attribute in the dataset.
c) Square Root Transform
• In this transform, the square root of the data is calculated.
• This transform performs very well on the left-skewed data and efficiently
transforms the left-skewed data into normally distributed data.
• The transformation is: where x is an attribute in the dataset.

d) Reciprocal Transform
• In this transform, the reciprocal of every observation is considered.
• This transformation can be only used for non-zero values.
• The transformation is: where x is an attribute in the dataset.
e) Custom Transform
• On every dataset, the log and square root transforms can not be used, as
every data can have different patterns and complexity.
• Based on the domain knowledge of the data, custom transformations can be
applied to transform the data into a normal distribution.
• The custom transforms can be any function or parameter like sin, cos, tan,
cube, cube root etc.
2) Power Transformers
• Power Transformation techniques are the type of data transformation
technique where the power is applied to the data observations for
transforming the data.
• There are two types of Power Transformation techniques:
a) Box-Cox Transform
b) Yeo-Johnson Transform
a) Box-Cox Transform
• This transform technique is mainly used for transforming the data
observations by applying power to them.
• The power of the data observations is denoted by Lambda(λ).
• There are mainly two conditions associated with the power in this transform,
which is lambda equals zero and not equal to zero.

• The transformation is:


where x is an attribute in the dataset and λ is the transformation parameter.

b) Yeo Johnson Transform


• This transformation technique is also a power transform technique, where
the power of the data observations is applied to transform the data.
• This is an advanced form of a box cox transformations technique where it can
be applied to even zero and negative values of data observations.
3) Quantile Transformers
• Quantile transformation techniques are the type of data transformation
technique that can be applied to numerical data observations.
• A quantile determines how many values in a distribution are above or below
a certain limit.
• Quantile Transformer works better on larger datasets than Power
Transformer.
Before quantile transformation After quantile transformation
Positive skewness Normal distribution
Before quantile transformation After quantile transformation
Negative skewness Normal distribution
Handling imbalanced data
• A balanced dataset is one in which the target variable has equal or nearly
equal observations in all the classes.
• An unbalanced dataset is one in which the target variable has more
observations in one specific class than the others.
• The main problem with imbalanced dataset prediction is how accurately are
we actually predicting both majority and minority class.
• Models may exhibit bias toward the majority class, resulting in poor
predictions for the minority class.
• Example: Let’s assume we are going to predict disease from an existing
dataset where for every 100 records only 5 patients are diagnosed with the
disease. So, the majority class is 95% with no disease and the minority class is
only 5% with the disease. Now, the model predicts that all 100 out of 100
patients have no disease.

• Approaches to Handle Imbalanced Data Set Problem:


Resampling (Oversampling and Undersampling): This technique is used
to upsample the minority class or downsample the majority class. There
are two main ways to perform resampling:
a) Undersampling — Deleting samples from the majority class.
b) Oversampling — Duplicating samples from the minority class.

• After sampling the data we can get a balanced dataset for both majority and
minority classes. So, when both classes have a similar number of records
present in the dataset, we can assume that the classifier will give equal
SMOTE (Synthetic Minority Oversampling Technique): It is another
technique to oversample the minority class. Simply adding duplicate
records of minority class often don’t add any new information to the model.
In SMOTE, new instances are synthesized from the existing data. SMOTE
looks into minority class instances and use k nearest neighbor to select a
random nearest neighbor, and a synthetic instance is created randomly in
feature space.
Time series data
• A time series is a group of observations on a single entity over time
(regular time intervals).
• It is a type of data that tracks the evolution of a variable over time, such as
sales, stock prices, temperature, heart-rate etc.
• The regular time intervals can be daily, weekly, monthly, quarterly, or
annually, and the data is often represented as a line graph or time-series
plot.
• Time series data is commonly used in fields such as economics, finance,
weather forecasting, and operations management, among others, to analyze
trends and patterns, and to make predictions or forecasts.
• Time series analysis is a machine learning technique that forecasts target
value based solely on a known history of target values. It is a
specialized form of regression, known as auto-regressive model.
• Example of a time-series dataset: A CSV file which has
monthly balance of the users bank account starting from
January 1973 to September 1977.
Components of a time series data:
• A time series can be analyzed in detail by breaking down it into its primary
components. This process is called as time series decomposition. Time
series data is composed of Trend, Seasonality, Cyclic and Residual
components.

1) Trend Component: A trend in time series data refers to a long-term


upward or downward movement in the data, indicating a general increase
or decrease over time. The trend represents the underlying structure of the
data, capturing the direction and magnitude of change over a longer period.
There are several types of trends in time series data:
a) Upward Trend: A trend that shows a general increase over time, where
the values of the data tend to rise over time.
b) Downward Trend: A trend that shows a general decrease over time,
where the values of the data tend to decrease over time.
c) Horizontal Trend: A trend that shows no significant change over time,
where the values of the data remain constant over time.
d) Non-linear Trend: A trend that shows a more complex pattern of change
over time, including upward or downward trends that change direction or
magnitude over time.
e) Damped Trend: A trend that shows a gradual decline in the magnitude
of change over time, where the rate of change slows down over time.
2) Seasonality Component: Seasonality in time series data refers to
patterns that repeat over a regular time period, such as a day, a week, a
month, or a year. These patterns arise due to regular events, such as holidays,
weekends, or the changing of seasons. There are several types of seasonality in
time series data, including:
a) Weekly Seasonality: A type of seasonality that repeats over a 7-day period
and is commonly seen in time series data such as sales, energy usage, or
transportation patterns.
b) Monthly Seasonality: A type of seasonality that repeats over a 30- or 31-
day period and is commonly seen in time series data such as sales or weather
patterns.
c) Annual Seasonality: A type of seasonality that repeats over a 365- or 366-
day period and is commonly seen in time series data such as sales, agriculture,
or tourism patterns.
d) Holiday Seasonality: A type of seasonality that is caused by special events
such as holidays, festivals, or sporting events and is commonly seen in time
series data such as sales, traffic, or entertainment patterns.
3) Cycles Component: Cyclicity in time series data refers to the repeated
patterns or periodic fluctuations that occur in the data over a specific
time interval. It can be due to various factors such as seasonality (daily,
weekly, monthly, yearly), trends, and other underlying patterns. Cycles
usually occur along a large time interval, and the lengths of time between
successive peaks or troughs of a cycle are not necessarily the same.
4) Irregular Component: The component left after explaining the trend,
seasonal and cyclical movements is the irregular/residual component.
Irregularities in time series data refer to unexpected or unusual fluctuations in
the data that do not follow the general pattern of the data. These fluctuations
can occur for various reasons, such as measurement errors, unexpected
events, or other sources of noise.

You might also like