You are on page 1of 29

CIB 3013

TOPIC 4: : DATA PRE-PROCESSING:

CLO2: Discuss information technology concepts to provide the necessary data and
information for business analytics
CLO3 :Apply the appropriate data mining technique to answer business questions
related to large data
CLO2: Discuss information technology concepts to provide the necessary data and
information for business analytics

CLO3: Discuss information technology concepts to provide the necessary data and
information for business analytics

Learning Objectives:
• Discuss data quality dimensions (CLO2)
• Discuss how to handle error and missing data (CLO3)
• Discuss how to handle noisy data (CLO3)
• Discuss feature selection methods (CLO3)
2
TOPIC 4: DATA PRE-PROCESSING
Data Quality Dimensions

Data quality dimensions


WHAT ARE THE 6 DIMENSIONS OF DATA QUALITY?

1. Completeness
This dimension can cover a variety of attributes depending on the entity.
For customer data, it shows the minimum information essential for a productive engagement. For example, if the
customer address includes an optional landmark attribute, data can be considered complete even when the landmark
information is missing.

For products or services, completeness can suggest vital attributes that help customers compare and choose. If a product
description does not include any delivery estimate, it is not complete. Financial products often include historical
performance details for customers to assess alignment with their requirements. Completeness measures if the data is
sufficient to deliver meaningful inferences and decisions.

7
2. Accuracy
Data accuracy is the level to which data represents the real-world scenario and
confirms with a verifiable source. Accuracy of data ensures that the associated real-
world entities can participate as planned. For example: An accurate phone number of
an employee guarantees that the employee is always reachable. Inaccurate birth details,
on the other hand, can deprive the employee of certain benefits.

For example, you can verify customer bank details against a certificate from the bank,
or by processing a transaction. Accuracy of data is highly impacted on how data is
preserved through its entire journey, and successful data governance can promote this
dimension of data quality.

High data accuracy can power factually correct reporting and trusted business outcomes. Accuracy is very
critical for highly regulated industries such as healthcare and finance.
8
3. Consistency
This dimension represents if the same information stored and used at multiple instances matches. It is expressed
as the percent of matched values across various records. Data consistency ensures that analytics correctly capture
and leverage the value of data.

Consistency is difficult to assess and requires planned testing across multiple data sets. For example :If one
enterprise system uses a customer phone number with international code separately, and another system uses
prefixed international code, these formatting inconsistencies can be resolved quickly. However, if the underlying
information itself is inconsistent, resolving may require verification with another source. For example, if a patient
record puts the date of birth as May 1st, and another record shows it as June 1st, you may first need to assess the
accuracy of data from both sources.
Data consistency is often associated with data accuracy, and any data set scoring high on both will be a high-quality data set.

9
4. Validity
This dimension signifies that the value attributes are available for aligning with the specific domain or
requirement. For example, ZIP codes are valid if they contain the correct characters for the region. In a
calendar, months are valid if they match the standard global names. Using business rules is a systematic
approach to assess the validity of data.

Any invalid data will affect the completeness of data. You can define rules to ignore or resolve the invalid
data for ensuring completeness.

10
5. Uniqueness
This dimension indicates if it is a single recorded instance in the data set used. Uniqueness is the most
critical dimension for ensuring no duplication or overlaps. Data uniqueness is measured against all
records within a data set or across data sets. A high uniqueness score assures minimized duplicates or
overlaps, building trust in data and analysis.

Identifying overlaps can help in maintaining uniqueness, while data cleansing and deduplication can
remediate the duplicated records. For example: Unique customer profiles go a long way in offensive and
defensive strategies for customer engagement. Data uniqueness also improves data governance and
speeds up compliance.

11
6. Integrity
Data journey and transformation across systems can affect its attribute relationships. Integrity indicates that the
attributes are maintained correctly, even as data gets stored and used in diverse systems.

Data integrity affects relationships. For example, a customer profile includes the customer name and one or
more customer addresses. In case one customer address loses its integrity at some stage in the data journey, the
related customer profile can become incomplete and invalid.

Measuring data quality dimensions helps you identify the opportunities to improve data quality. With adaptive
rules and a continuous ML-based approach, predictive data quality brings you trusted data to drive real-time,
consistent, innovative business decisions.

12
PREPROCESSING:

 Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.
DATA CLEANING
NOISY DATA:
Question: Solution:

 Binning:
 Let's say you have a dataset of test scores for a
class of students.
 85, 92, 78, 65, 88, 95, 72, 68, 90, 82, 75, 98, 60,
88, 93
BINNING EXAMPLE

Question: Solution:

 Binning:  If you want to create bins for these scores, you might
decide to use the following ranges: 0-59, 60-69, 70-
 Let's say you have a dataset of test scores for a
79, 80-89, and 90-100.
class of students.
 Here's how you could bin these scores:
 85, 92, 78, 65, 88, 95, 72, 68, 90, 82, 75, 98, 60,
 - 0-59: 0
88, 93
 -60-69:60, 65, 68
 - 70-79: 72, 75, 78
 - 80-89: 82, 85, 88, 88,
 90-100: 90,92, 93, 95, 98
REGRESSION EXAMPLE:

Question: Solution:

 Suppose you have a dataset with information about houses,


including features such as:
 Size (in square feet)
 Number of bedrooms
 Number of bathrooms
 Distance to the city center
 Neighborhood crime rate
 Objective: Your goal is to build a regression model to
predict the selling price of a house based on these features.
REGRESSION EXAMPLE:

Question: Solution:

 Suppose you have a dataset with information about After training the regression model, you can use it
houses, including features such as: to predict the selling price of a house with specific
 Size (in square feet) characteristics. For instance, if you input a house
 Number of bedrooms with 2 bedrooms, 2 bathrooms, a size of 1500
 Number of bathrooms
square feet, a distance of 10 miles from the city
center, and a low crime rate, the model might
 Distance to the city center
predict a selling price based on the learned
 Neighborhood crime rate
relationships between these features and house
 Objective: Your goal is to build a regression model to prices in the training data
predict the selling price of a house based on these features.
Lab
DATA TRANSFORMATION:
DATA TRANSFORMATION:

The process of data transformation involves smoothing, aggregation, normalization, minimum and maximum
normalization, Data transformation stages:
 Smoothing: Involves removal of removed noise in data through processes like binning, regression, and grouping

 Feature construction: Feature construction involves building intermediate features from the original descriptors in a dataset

 Aggregation: This process includes summarizing or aggregating the data. It is a very popular technique used in sales.

 Normalization: Normalization helps in scaling data attributes in proportion

 Discretization: Refers to converting huge data sets into smaller ones and converting data attribute values into finite sets of
intervals and associating every interval with specific values
AGGREGATION:

Aggregation:
TYPES OF SAMPLING:

Types of sampling:
FEATURE SELECTION

Feature selection :

 Feature selection refers to the process of selecting the most important variables (features) related to your prediction
variable, in other words, selecting the attributes which contribute most to your model. some techniques for this
approach that you can apply either automatically or manually:
1.Correlation Between Features: This is the most common approach, which drops some features that have a
high correlation with others.
2. Statistical Tests: Another alternative is to use statistical tests to select the features, checking the relationship
of each feature individually with the output variable.
STEPS IN DATA PREPROCESSING

• Steps in Data Preprocessing


• Step 1: Importing the Dataset
• Step 2: Handling the Missing Data
• Step 3: Encoding Categorical Data.
• Visualize Output
Step 4: Splitting the Dataset into the Training and Test sets
• Training set
• Test set
• Step 5: Modelling
• Step 6: Evaluate
• Step 7: Interpreted

You might also like