You are on page 1of 62

GEN101

Introductory
Artificial Intelligence
College of Engineering

Working with Data


Data in Machine Learning

How is data organized?


• Data typically presents as a table
• A single row of data: instance, sample, or record,
observation
• A single cell in the row: attribute, factor, feature
• Datasets are a collection of instances
• Datasets are used to train and tests AI algorithms
Record/Sample/Instance No. Feature 1 Feature 2
Name Name
Record/Sample/Instance 1 Feature 1 Value Feature 2 Value
in Sample 1 in Sample 1
Record/Sample/Instance 2 Feature 1 Value Feature 2 Value
in Sample 2 in Sample 2
Exploring Data
Data in Machine Learning
Sample Data
• Data typically presents as a table
• A single row of data: instance, sample, record, or observation
• A single cell in the row: attribute, factor, feature
• Datasets are a collection of instances
• Datasets are used to train and tests AI algorithms
Patient Number Blood Pressure Glucose Level Pre-diabetic
(Sample ID) (Feature 1) (Feature 2) (Class/Label/Target)

Patient 1 120/80 90 No

Patient 2 130/90 120 Yes


Data Types

Numerical Data: Any form of measurable data such as your height, weight, or the cost
of your phone bill. You can determine if a set of data is numerical by attempting to
average out the numbers or sort them in ascending or descending order.

Categorical Data: Sorted by defining characteristics.


Examples: gender, social class, ethnicity, hometown, the industry you work in, or a
variety of other labels. Order is not important.

Ordinal Data: data mixes numerical and categorical data. The data fall into categories, but the
numbers placed on the categories have meaning. For example, rating a restaurant on a scale
from 0 (lowest) to 4 (highest) stars gives ordinal data. Order is important.

Time Series Data: Consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals.

Textual Data: Words, sentences, or paragraphs that can provide some level of insight to
your machine learning models. Often grouped together or analyzed using various
methods such as word frequency, text classification, or sentiment analysis.
Explore Your Data
Know what you’re working with

You need to answer a set of basic questions


about the dataset:
• How many observations do I have?
• How many features?
• What are the data types of my features? Are
they numeric? Categorical?
• Do I have a target/class variable?
Collecting Data
1. Collect it yourself
Manual Automatic

Contains far
Cheaper
fewer errors

Gather
Takes more time
everything you
to collect
can find

More expensive
in general.
Collecting Data
2. Someone has already collected it for you

Google’s Dataset Microsoft Research Amazon Datasets UCI Machine Government


Search Open Data Learning Repository Datasets
Collecting Data
The Size and Quality of a Data Set

“Garbage in, garbage out”


• Your model is as good as your data
• How do you measure your data set's quality
and improve it?
• How much data do you need to get useful
results?
Collecting Data
Why is Collecting a Good Dataset Important?

The Google Translate team "...one of our most impactful


has more training data than quality advances since neural
they can use. Rather than machine translation has been
tuning their model, the in identifying the best subset
team has earned bigger of our training data to use“
wins by using the best - Software Engineer, Google
features in their data. Translate

"...most of the times when I "Interesting-looking" errors


tried to manually debug are typically caused by the
interesting-looking errors they data. Faulty data may cause
could be traced back to your model to learn the
issues with the training data." wrong patterns, regardless
- Software Engineer, Google of what modeling
Translate techniques you try.
Collecting Data
The Size of a Dataset
• Your AI should train on at least an
order of magnitude more
examples than trainable Data set Size (number of examples)
parameters
Iris flower data set 150 (total set)
• Simple AI model on large data
sets generally beat fancy models MovieLens (the 20M data set) 20,000,263 (total set)
on small data sets
• Google has had great success Google Gmail SmartReply 238,000,000 (training set)
training simple linear regression
models on large data sets Google Books Ngram 468,000,000,000 (total set)

• What counts as "a lot" of data? It Google Translate Trillions


depends on the project
• Datasets come in a variety of
sizes
Collecting Data
The Quality of a Dataset

• It’s no use having a lot of data if it’s bad data; quality


matters, too.
• Quality dataset is good data if it helps you accomplish its
intended task.
• However, while collecting data, it's helpful to have a more
concrete definition of quality. Certain aspects of quality tend
to correspond to better-performing models:
• reliability
• feature representation
• minimizing skew
Explore Your Data
Why should you do that?

The purpose of exploratory analysis is to "get to


know" the dataset:
• You’ll gain valuable hints for Data Cleaning 
can make or break your models
• You’ll think of ideas for Feature Engineering
can take your models from good to great).
• You’ll get a "feel" for the dataset will help
you communicate results and deliver
greater impact.
Missing Data?
Handle Missing Data
Do NOT Ignore

Missing data is like missing a puzzle piece.


If you drop it, that’s like pretending the
puzzle slot isn’t there. If you impute it, that’s
like trying to squeeze in a piece from
somewhere else in the puzzle.
Missing Data
Types of Missing Data

Missing The best way to handle missing You’re essentially adding a new class for the
feature.
data for categorical features is
categorical to simply label them as
This tells the algorithm that the value was
missing.
data ’Missing’! This also gets around the technical requirement
for no missing values.

Missing For missing numeric data, you Flag the observation with an indicator variable of
missingness.
should flag and fill the values.
numeric
Then, fill the original missing value with 0 just to
data meet the technical requirement of no missing
values.
Data Interpolation
Missing Data Interpolation
• Interpolation is a mathematical method that adjusts a function to your data and uses this function to
extrapolate the missing data.
• The most simple type of interpolation is the linear interpolation, that makes a mean between the
values before the missing data and the value after.

𝑦2 − 𝑦1
𝑦= 𝑥 − 𝑥1 + 𝑦1
𝑥2 − 𝑥1
Data Cleaning
Data Cleaning
Better Data  Fancier Algorithms

• Garbage in, garbage out.


• In fact, a simple algorithm can
outweigh a complex one just
because it was given enough and
high-quality data.
• Quality data beats fancy algorithms.
• Different types of data will require
different types of cleaning.
Remove Unwanted Observations
Duplicate Observations

Duplicate observations most frequently arise


during data collection, such as when you:
• Combine datasets from multiple places
• Scrape data
• Receive data from clients/other departments
Remove Unwanted Observations
Irrelevant Observations

Irrelevant observations are those that don’t


actually fit the specific problem that you’re trying
to solve.
• For example, if you were building a model for
Villas only, you wouldn't want observations for
Apartments in there.
• Checking for irrelevant observations before
engineering features can save you time and
effort.
Fix Structural Errors

• Structural errors are those that arise during


measurement, data transfer, or other types
of "poor housekeeping."
• Check for:
• Typos
• Inconsistent capitalization.
• Mislabeled classes
Fix Structural Errors
Example – Before Applying the Fix
Fix Structural Errors
Example – After Applying the Fix
Filter Unwanted Outliers

• An outlier is an observation that lies an


abnormal distance from other values.
• Examine your data carefully to decide whether
to remove a data outlier or not.
• You should never remove an outlier just
because it’s a "big number." That big number
could be very informative for your model.
• Removing an outlier can help your model’s
performance.
One Hot Encoding
Definition
• One-hot encoding is used in machine learning as a method to quantify categorical data.
• Splitting the column which contains numerical categorical data to many columns depending on the number of categories present
in that column. Each column contains “0” or “1” corresponding to which column it has been placed.

Fruit Categorical Price Apple Mango Orange Price


Value
1 0 0 5
Apple 1 5
0 1 0 10
Mango 2 10 
1 0 0 15
Apple 1 15
0 0 1 20
Orange 3 20
Data
Augmentation
Enlarge your Dataset
How do I get more data, if I don’t have “more data”?
Data Augmentation
Common Augmentation Methods

1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Original Image
Data Augmentation
Common Augmentation Methods

1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods

1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods

1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods

1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods

1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods

1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Distribution/Histogram

Categorical

Numerical

Ordinal
Data
Transformation
Data Normalization
Definition
• Transform data features to be on a similar scale which improves the performance and training stability of the machine learnin g
model.
• Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the
distribution of your data.
Data Normalization
Normalization Techniques
Data Normalization
Log Scaling
Data Normalization
Feature Clipping
Data Normalization
Z-Score
Data Normalization
Linear Scaling vs. Z-Score
Data Normalization
Linear Scaling
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
#Credit Cards Range = 0-4 (4 credit cards)
45 32,000 3 No

23 7,000 4 No

55 45,000 3 Yes Normalization Needed !!


65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes
Linear Scaling
Age min = 23 Age max = 65 Age max - Age min = 42
Age Age – Age min Age – Age min Age'
------------------
Age max - Age min
35 35-23 = 12 12/42 = 0.29 0.29

45 45-23 = 22 22/42 = 0.52 0.52

23 23-23 = 0 00/42 = 0.00 0.00

55 55-23 = 32 32/42 = 0.76 0.76

65 65-23 = 42 42/42 = 1.00 1.00

27 27-23 = 4 04/42 = 0.10 0.10

33 33-23 =10 10/42 = 0.24 0.24


Linear Scaling
Income min = 7k Income max = 45k Income max - Income min = 38k
Income Income – Income min Income – Income min Income'
------------------
Income max - Income min
15,000 8,000 8,000/38,000 0.21
32,000 25,000 25,000/38,000 0.66
7,000 0 0/38,000 0.00
45,000 38,000 38,000/38,000 1.00
12,000 5,000 5,000/38,000 0.13
20,000 13,000 13,000/38,000 0.34
25,000 18,000 18,000/38,000 0.47
Linear Scaling
#Credit Cards min = 0 #Credit Cards max = 4 #Credit Cards max - #Credit Cards min = 4
#Credit Cards #Credit Cards – #Credit Cards – #Credit Cards min #Credit
#Credit Cards min ------------------ Cards '
#Credit Cards max - #Credit Cards min
1 1 1/4 0.25
3 3 3/4 0.75
4 4 4/4 1.00
3 3 3/4 0.75
0 0 0/4 0.00
2 2 2/4 0.50
2 2 2/4 0.50
Linear Scaling
Age’ Income’ #Credit Buy Age Range = 23 to 65 (42 years)
Cards’ Insurance
Income Range = 7k to 45k (AED 38,000)
0.29 0.21 0.25 No
#Credit Cards Range = 0-4 (4 credit cards)
0.52 0.66 0.75 No

0.00 0.00 1.00 No

0.76 Yes
Normalization Completed !!
1.00 0.75
1.00 0.13 0.00 Yes

0.10 0.34 0.50 No

0.24 0.47 0.50 Yes


Data Normalization
Clipping
Room People Inside Humidity Cooling
Temperature (C) (%) Needed
Room Temperature = 16-27 degrees
23.2 30 100 High
People Inside = 0-50
24.8 150 65 Medium Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium Normalization Needed !!


44.3 40 80 High

22.5 20 69 Low

14 -10 73 Low
Data Normalization
Clipping
Room People Inside Humidity Cooling
Temperature (C) (%) Needed
Room Temperature = 16-27 degrees
23.2 30 100 High
People Inside = 0-50
24.8 150 65 Medium Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium Normalization Needed !!


44.3 40 80 High

22.5 20 69 Low

14 -10 73 Low
Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
Room Temperature = 16-27 degrees
23.2 30 100  85 High
People Inside = 0-50
24.8 150  50 65 Medium Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium Normalization Needed !!


43.3 27.0 40 80 High

22.5 20 69 Low

14.0  16.0 -10  0 73 Low


Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
Room Temperature = 16-27 degrees
23.2 30 85 High
People Inside = 0-50
24.8 50 65 Medium Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium Normalization Completed !!


27.0 40 80 High

22.5 20 69 Low

16.0 0 73 Low
Data Normalization
Z-Score
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
#Credit Cards Range = 0-4 (4 credit cards)
45 32,000 3 No
Normalization Needed !!
23 7,000 4 No

55 45,000 3 Yes

65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes
Z-Score
Age Age – Age mean Age – Age mean Age'
------------------
Age mean = 40.43 Age std dev Age’ mean = 0
Age std dev = 14.17 35 -5.43 -0.35 -0.35 Age’ std dev = 1
45 4.57 0.30 0.30
23 -17.43 -1.14 -1.14
55 14.57 0.95 0.95
65 24.57 1.61 1.61
27 -13.43 -0.88 -0.88
33 -7.43 -0.49 -0.49
Breakout Session
Clean your Data! Years Of Position Salary (k AED)
Experience
Activity
1 Staff 8
1. Fill the missing Data using
Interpolation 2 Staff 11
2. Remove Duplicate Observations 3 Staff _
3. Fix Structural Errors
4 Staff 17
4. Remove Outliers
5. Apply One Hot Encoding 3 Staff 14
6 Staff _
7 Staff 26
7 Manager 20
8 Supervisor 30
9 Supervisor 33
Years Of Staff Supervisor Manager Salary (k
Clean your Data! Experience AED)
Activity 1 1 0 0 8
2 1 0 0 11
1. Fill the missing Data using
Interpolation 3 1 0 0 14
2. Remove Duplicate Observations
4 1 0 0 17
3. Fix Structural Errors: Supervisr, staff
4. Remove Outliers 3 1 0 0 14
5. Apply One Hot Encoding 6 1 0 0 23
7 1 0 0 26
7 0 0 1 20
8 0 1 0 30
9 0 1 0 33
Breakout Session
Data Augmentation
Class Activity
Configuration
Data Augmentation Images
Required
Right
Translate
Left
Smaller
Scale
Bigger
45 clockwise
Rotate 45 counter
clockwise
From top
Crop From bottom
From side
Data Augmentation
Class Activity - Solution
Configuration
Data Augmentation Images
Required
Right
Translate
Left
Smaller
Scale
Bigger

45 clockwise
Rotate 45 counter
clockwise

From top
Crop From bottom
From side
Breakout Session

Normalize Rest of Data


Data Normalization
Linear Scaling
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
#Credit Cards Range = 0-4 (4 credit cards)
45 32,000 3 No

23 7,000 4 No

55 45,000 3 Yes Normalization Needed !!


65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes

You might also like