You are on page 1of 10

AI for Managers – S.

Prof. Soumen Manna


Data Curation and Data Governance

The building of an intelligent system has two main components

1. The collection of algorithms that build the machine learning models


underlying the technology.

2. The data that is fed into these algorithms.


Data Curation

• Historically, the field of machine learning has focused its research on


improving the algorithms to produce increasingly better models over
time.

• Recently, however, the algorithms have improved to a point where


they are no longer the bottleneck in the race for improved AI
technology.

• These algorithms are now capable of consuming vast amounts of data


and storing that intelligence in complex internal structures.
Data Curation

• Today, the race for improved AI systems has turned its focus to
improvements in data, both in quality and volume.

• Data that is used to build AI systems is typically referred to as ground


truth—that is, the truth that underpins the knowledge in an AI
system.

• This ground truth has to be genuine and representative of real


population. If no existing data is available, subject matter experts
(SMEs) can manually create this ground truth, though it would not
necessarily be as accurate.
Data Preprocessing
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.

(a) Missing Data Cleaning:

I. Ignore the rows or columns:


This approach is suitable only when some particular rows and
columns have more than 50% missing values.
Data Preprocessing
II. Fill the Missing values:
There are various ways.

• Mean imputation

• Median imputation

• mode imputation

• KNN imputation
Data Preprocessing

(b) Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.
It can be generated due to faulty data collection, data entry errors etc.
It can be handled in following ways :
I. Outlier removal II. Binning Method:
Data Preprocessing
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:

• Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
• Attribute creation:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
• Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
• Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
Questions?

You might also like