You are on page 1of 38

Artificial Intelligence

Lecture 4 - Machine learning pipeline

1
Contents
Data collection
- Sources for data collection
- High quality dataset
- Crawling techniques
Preprocess:
- Encode data
- Detect and handle missing data
- Detect and handle outlier data
2
Motivation
A typical Machine Learning process, what is the most important part?

3
The importance of data
- In the past, only a little amount of data was available → people focused on
traditional machine learning algorithm (e.g. KNN, Kmeans, Linear
Regression, Decision Trees, …) to make use of that data.
- When Big Data is available, modern deep learning algorithms were developed
to take advantage of a large amount of data (and the computational power
also)
- At this moment, deep learning algorithms have reached some limitations,
people turn their attention back to data, i.e. collect higher quality data
→ How to collect high quality data?

4
Data collection
Data can not only come from your own organization, but it can also be licensed
from a third-party data collection agency or consumer service, or created from
scratch

5
Internal Data Collection: Digital
Most of the organizations have their own information systems, e.g. Sales,
Manufacturing, CRM, HRM, …
These information systems generate data and keep in many forms (structured
database to unstructured log files).

Organizations may save all the data for future use, or provide data scientist to
analyze and build intelligent systems continuously
6
Types of data storage in an organization
Structured databases:
- Excel format
- Database Management Systems (DBMS): SQL Server, MySQL, Oracle, …
Unstructured databases:
- MongoDB, NoSQL
- Data warehouse, Data lake

7
Internal Data Collection: Digital
If you want to build an AI system but have not started to make data collection a
priority? Even in this scenario you may already have more data than you think
→ do an internal data exploration and come up with a data collection strategy
- The first part of data exploration consists of identifying and listing all existing
digital systems used in your organization
- this can be data explicitly being stored by the system (e.g., customer records) or just
system usage data that is being saved in log files
- Ask if this data is easily accessed or exported so that it can be used within
other systems (e.g., the AI system you are building)
Some possible access methods:
8
Application Programming Interface (API)
Best option: existing system provides a well-
documented API to access the data
- APIs are preferable since they are secure,
are easily and pragmatically accessed,
and provide real-time access to data
- APIs can provide convenient capabilities
on top of the raw data being stored, such
as roll-up statistics or other derived data
E.g. Google API, Facebook API, …

9
File Export
If a system does not have a convenient API for exporting data, it might have a file
export capability. This capability is likely in the system's user interface and allows
an end user to export data in a standardized format such as a comma-separated
value (CSV) file
- Only certain data may be able to be exported
- Highly structured data with a lot of internal structure may be harder to export
in a single file
- Have to be manually exported periodically

10
Direct Database Connection
If the system does not provide any supported data exporting capabilities and it is
infeasible or monetarily ineffective to add one, you could instead connect directly
to the system's internal database if one exists.
● This involves setting up a secure connection to the database that the system
uses in order to directly access database tables
● An important point to keep in mind is that you should access this data in a
readonly fashion so you don't inadvertently affect the application.

11
Internal Data Collection: Physical
Identify potentially existing manual process → physical form →How to digitize?
Data in its physical form must be digitized before it can be used to create an AI
system, but digitization is not a trivial task → need to consider
- Amount of time and cost required to digitize the data
- Is it sufficient to start collecting digital data onward?
If the physical data is valuable
→should start replacing it by a digital system

12
Data Collection via Licensing
If you have not been collecting data, or you require data that you are unable to
collect internally → data licensing
- Companies with the business model of selling data (
https://onesignal.com/pricing)
- Free dataset (may not fit your purpose completely)
- https://www.kaggle.com/
- https://project-awesome.org/awesomedata/awesome-public-datasets
- https://www.data.gov/
- Data licensing companies (YCharts, Thomson Reuters)
- Other sources depend on your domain (geospartial, agriculture,
transportation, …) →may require some dialogues or discussions

13
Datasets from Kaggle
https://www.kaggle.com/datasets
An online community of data scientists and machine learning practitioners
Allows users to
- find and publish data sets,
- explore and build models in a web-based data-science environment,
- work with other data scientists and machine learning engineers,
- and enter competitions to solve data science challenges

14
Crawl data from public websites
A web crawler, spider, or search engine bot downloads and indexes content from
all over the Internet. The goal of such a bot is to learn what (almost) every
webpage on the web is about, so that the information can be retrieved when it's
needed. They're called "web crawlers" because crawling is the technical term for
automatically accessing a website and obtaining data via a software program.
Practice:

15
Exercise
Prepare data for these problems, crawl at least 1.000 data samples for these tasks
and write the results in csv files:
1. Predict second-hand car prices (https://bonbanh.com/oto)
2. News articles classification (https://vnexpress.net/kinh-doanh)

16
Data Collection via Licensing
Notices:
- Data licensing pricing, especially in larger deals, should be negotiable based on
how you are planning to use the data. Scale-based pricing might be
advantageous to you and also allow the licensing company greater upside based
on your success.
- Avoid not to be beholden to the licensing company for their data
- That company suddenly stop their services
- They use the licensed data to bootstrap your system
- E.g.: If you are Grab CTO, do you use Google Maps API, or develop your own?
→Use the licensed data to build the initial system, but start collect your own data
from your AI system for later use

17
Data Collection via Crowdsourcing
Crowdsourcing platforms consist of two different types of users:
- Users who have questions need to be answered
- Post your unlabelled data to a crowdsourcing market
- Users who will answer these questions
- Monetarily incentivized to answer questions quickly and with high accuracy
- Typically, the same question is asked to multiple people for consistency. If there are
discrepancies for a single question, perhaps the image is ambiguous. If one particular user
has many discrepancies, this might mean the user is answering randomly and should be
removed from the job or that the user did not understand the prompt

Some crowdsourcing platforms: Figure eight, Mechanical Turk, Microworkers, …

18
Leveraging the Power of Existing Systems
There are already a number of
intelligent systems available that can
be used to generate a dataset, e.g.
Google, Flickr

Case study: use Google Images with


a search of “daytime pictures” and
then use a browser extension “
Image Downloader” to download all
the images on that page

19
Your tasks
Try your best to find as much as possible data for your project from:
- Licensing
- Kaggle-like website
- Data licensing companies
- Other sources based on your domain
- Crowdsourcing
- Find this kind of information in Vietnam
- Existing intelligent systems

20
How to know your data is high-quality
Data that is used to build AI systems is typically referred to as ground truth—that
is, the truth that underpins the knowledge in an AI system.
Supervised machine learning models are trained on labeled data that are
considered “ground truth” for the model to identify patterns that predict those
labels on new data.
→ A high quality dataset often has labels close to the ground truth in the real
scenario

21
Ground truth examples
Youtube wants to improve the speech to text system to automatically generate subtitle
for Vietnamese videos
Watch and listen carefully to this video, try to make a subtitle along with the song in
the video
- What is your own subtitle? → Label
- What is generated by the current speech2text system → Predicted
- What is the correct answer (the lyric provided by the author) → Ground truth
Does your label match exactly the ground truth?
- Yes → Your dataset is high quality
- No → You have to find other annotator
22
Ground truth
Good ground truth typically comes from or is produced by organizational systems
already in use.
If no existing data is available, subject matter experts (SMEs) can manually create
this ground truth
There are typically two key methods to build the distribution of your ground truth
when building an AI model:
- balanced number of examples for each class you want to recognize
- proportionately representative of how your system will be used

23
Data Preprocessing
- Encode data
- Detect and handle missing data
- Detect and handle outlier data

24
Encode data
Data encoding in machine learning refers to the process of converting categorical
or non-numeric data into a numerical format that can be utilized by machine
learning algorithms
Encoding data is important:
● Data Representation: How data is represented impacts model performance.
● Algorithms Expect Numerical Input: Most machine learning algorithms expect
numerical input.
● Avoiding Bias: Proper encoding helps avoid bias in models.

25
Types of encoding data
● Label Encoding
● One-Hot Encoding
● Ordinal Encoding
● Frequency Encoding
● Embedding
https://www.linkedin.com/pulse/useful-encoding-techniques-machine-learning-heb
a-al-haddad/

26
Label Encoding
Label Encoding is a straightforward technique that assigns a unique integer value
to each category.

This encoding is suitable for ordinal variables, where the categories have a
meaningful order. However, one must be cautious when using Label Encoding with
algorithms that interpret the encoded integers as ordinal values, as it may
introduce unintended bias.
27
One-Hot Encoding
One-Hot Encoding is a popular technique used to convert categorical variables
into a binary representation. Each category is transformed into a binary vector,
with a '1' indicating the presence of the category and '0' for all other categories.

One-Hot Encoding is suitable for nominal variables with no inherent ordinal


relationship between categories.

28
Ordinal Encoding
Ordinal Encoding is ideal for ordinal variables, where categories have a
meaningful order but lack evenly spaced numerical representations. Unlike Label
Encoding, Ordinal Encoding uses user-defined mappings to assign numerical
values to categories, preserving the ordinal relationship between them.

29
Frequency Encoding
Frequency Encoding, also known as Count Encoding, replaces each category with
its corresponding frequency or count in the dataset. This technique is useful when
dealing with high cardinality categorical variables, as it captures the importance of
each category based on its prevalence in the data.

30
Embedding
Embedding is a technique commonly used in Natural Language Processing (NLP)
tasks. It maps categorical variables to dense vectors in a continuous space,
capturing semantic relationships between categories. Embeddings allow the
model to learn the context and relationships within the data, making it powerful for
certain machine learning applications.

31
Detect and handle missing data
Missing data is defined as the values or data that is not stored (or not present) for
some variable/s in the given dataset.

Reasons for missing data:


● Past data might get corrupted due to improper maintenance.
● Observations are not recorded for certain fields due to some reasons. There
might be a failure in recording the values due to human error.
● The user has not provided the values intentionally
32
Detect and handle missing data
It is important to handle the missing values appropriately.
● Many machine learning algorithms fail if the dataset contains missing values.
However, algorithms like K-nearest and Naive Bayes support data with
missing values.
● You may end up building a biased machine learning model, leading to
incorrect results if the missing values are not handled properly.
● Missing data can lead to a lack of precision in the statistical analysis.
Check for missing values in Python: isnull().sum()

33
Handling Missing Values
● Deleting the Missing values
○ Deleting the entire row (listwise deletion)
○ Deleting the entire column
● Imputing the Missing Values
○ Replacing with an arbitrary value
○ Replacing with the mean
○ Replacing with the mode
○ Replacing with the median
○ Replacing with the next value, back value, …

https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/

34
Detect and handle outlier data
Outliers are those data points that are significantly different from the rest of the
dataset. They are often abnormal observations that skew the data distribution, and
arise due to inconsistent data entry, or erroneous observations.
For example, in a dataset of house prices, if you find a few houses priced at
around $1.5 million—much higher than the median house price, they’re likely
outliers. However, if the dataset contains a significantly large number of houses
priced at $1 million and above—they may be indicative of an increasing trend in
house prices. So it would be incorrect to label them all as outliers. In this case, you
need some knowledge of the real estate domain.

35
Detect outlier data
Outliers are those data points that are significantly different from the rest of the
dataset. They are often abnormal observations that skew the data distribution, and
arise due to inconsistent data entry, or erroneous observations.

36
Detect outlier data
Detect Outliers Using Standard Deviation
Detect Outliers Using the Z-Score
Detection Using Interquartile Range (IQR)
Detect Outliers Using Percentile
https://www.freecodecamp.org/news/how-to-detect-outliers-in-machine-learning/

37
Homework

38

You might also like