You are on page 1of 9

KONERU LAKSHMAIAH EDUCATION FOUNDATION

(Deemed to be University estd, u/s, 3 of the UGC Act,

1956) (NAAC Accredited “A++” Grade University)

Green Fields, Guntur District, A.P., India – 522502

B.Tech. IIInd Year


PROGRAM
A.Y.2023-24
Even Semester
22CS2227 - DATA ANALYTICS AND VISUALIZATION
CO 2

Session 25.: Data Discretization

1. Course Description
Data Discretization Techniques in Data Science is an advanced course designed to provide
students with a comprehensive understanding of the fundamental concepts and methodologies
related to data discretization. In this course, students will delve into the principles, techniques, and
applications of data discretization, a critical step in the data preprocessing pipeline that enables
efficient and effective analysis of large datasets.
Prerequisites includes Basic knowledge of data structures and algorithms,Familiarity with
programming fundamentals in languages such as Python or R,Understanding of fundamental
concepts in statistics and data analysis.

2. Aim

The aim of data discretization is to transform continuous or numerical data into discrete or categorical form,
facilitating the analysis and processing of data by reducing complexity and noise while preserving the
underlying patterns and relationships within the data. This process allows for more efficient computation and
analysis, particularly in the context of machine learning algorithms and data mining tasks.

3. Instructional Objectives (Course Objectives)

 Gain a comprehensive understanding of data discretization concepts and


methodologies.
 Develop proficiency in implementing various data discretization techniques using
programming languages and data preprocessing tools.
 Evaluate the impact of data discretization on data quality, model performance, and
data analysis outcomes.
 Apply data discretization techniques to real-world datasets and analyze their
implications on data mining tasks.
4. Learning Outcomes (Course Outcome)

By the end of the course, students will be able to proficiently implement diverse data
discretization techniques using programming languages, critically evaluate their
impact on data quality
5. Module Description (CO-2 Description)

Applying various data discretization methods to explore the data.

6. Session Introduction
In this session, we will explore the fundamental concepts and methodologies related to
data discretization, emphasizing the significance of this preprocessing step in enabling
efficient data analysis. Through interactive discussions and practical demonstrations, we
aim to deepen your understanding of various discretization methods and their implications
for data quality and analysis outcomes. Get ready to delve into the intricacies of data
discretization and its role in shaping effective data science pipelines.

7. Session description

Data discretization is defined as a process of converting continuous data attribute values into
a finite set of intervals with minimal loss of information and associating with each interval
some specific data value or conceptual labels.

Data Discretization is considered as a data reduction mechanism, because it diminishes data


from large domains of numeric values to a subset of categorical values.

Ex. age can be transformed to (0-10,11-20….) or to conceptual labels like youth, adult,
senior.

Fits the problem statement


Often, it is easier to understand continuous data (such as weight) when divided and stored
into meaningful categories or groups. For example, we can divide a continuous variable,
weight, and store it in the following groups :
Under 100 lbs (light), between 140–160 lbs (mid), and over 200 lbs (heavy)

We would consider the structure useful if we see no objective difference between variables
falling under the same weight class.
In our example, weights of 85 lbs and 56 lbs convey the same information (the object is
light). Therefore, discretization helps make our data easier to understand if it fits the
problem statement.

Methods of Data Discretization


1.Binning
2.Histogram analysis
3.Cluster analysis
4.Decision tree analysis
5.Correlation analysis

1.Binning:
 Binning is a top-down splitting technique based on a specified number of bins.
 The main challenge in this discretization is to choose the number of intervals or bins
and how to decide on their width.
 Binning methods smooth a sorted data value by consulting its “neighborhood”, that is
the values around it. The sorted values are distributed into several “buckets” or bins.
Because binning methods consult the neighborhood of values, they perform local
smoothing.
 Attribute values can be discretized by applying equal-width or equalfrequency
binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin

2.Histogram analysis:
Histograms (or frequency histograms) are at least a century old and are widely used.
• “Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of
poles.
• Plotting histograms is a graphical method for summarizing the distribution of a given
attribute, X.
• If X is nominal, such as automobile model or item type, then a pole or vertical bar is
drawn for each known value of X.
• The height of the bar indicates the frequency (i.e., count) of that X value.
• The resulting graph is more commonly known as a bar chart.
• If X is numeric, the term histogram is preferred.
• The range of values for X is partitioned into disjoint consecutive subranges.
• The subranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X.
• The range of a bucket is known as the width.
Typically, the buckets are of equal width
For example, a price attribute with a value range of $1 to $200 (rounded up to the nearest
dollar) can be partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so on.
For each subrange, a bar is drawn with a height that represents the total count of items
observed within the subrange

3.Cluster analysis:
Cluster analysis is a popular data discretization method.
A clustering algorithm can be applied to discretize a numeric attribute, A, by partitioning the
values of A into clusters or groups based on similarity, and store cluster representation
(e.g., centroid and diameter) only.
It partitions the data set into clusters.
Properties of clusters:
(i) All the data points in a cluster should be similar to each other.
(ii) The data points from different clusters should be as different as possible.

4.Decision tree analysis:


A decision tree is a hierarchical model used in decision support that depicts decisions and
their potential outcomes, incorporating chance events, resource expenses, and utility. This
algorithmic model utilizes conditional control statements and is non-parametric,
supervised learning, useful for both classification and regression tasks. The tree structure
is comprised of a root node, branches, internal nodes, and leaf nodes, forming a
hierarchical, tree-like structure.

Example of Decision Tree


Let’s understand decision trees with the help of an example:

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy?
If yes then it will go to the next feature which is humidity and wind. It will again check if
there is a strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and
play.

5.Correlation analysis:
Correlation analysis is a statistical method used to measure the strength of the linear
relationship between two variables and compute their association. Correlation analysis
calculates the level of change in one variable due to the change in the other. A high
correlation points to a strong relationship between the two variables, while a low correlation
means that the variables are weakly related.
Researchers use correlation analysis to analyze quantitative data collected through research
methods like surveys and live polls for market research. They try to identify relationships,
patterns, significant connections, and trends between two variables or datasets. There is a
positive correlation between two variables when an increase in one variable leads to an
increase in the other. On the other hand, a negative correlation means that when one variable
increases, the other decreases and vice-versa.

8. Activities/ Case studies/related to the session

Case Study: Discretization Impact on Classification Models

Background:
Participants are provided with a dataset containing continuous variables, and they are tasked
with applying different discretization techniques such as equal width and equal frequency
discretization. They will then train classification models using the discretized data and
evaluate the impact on model performance metrics such as accuracy, precision, and recall.

9. Examples & contemporary extracts of articles/ practices to convey the idea


of the session

Enhancing Privacy Preservation through Data Discretization in Healthcare Analytics:

Example:

In the field of healthcare analytics, data discretization has emerged as a key strategy for
preserving patient privacy while enabling comprehensive analysis. A recent study by
Johnson et al. (2023) showcased how the application of differential privacy techniques
combined with data discretization methods allowed for effective analysis of patient health
records, ensuring compliance with privacy regulations without compromising the utility of
the data for research and analysis purposes
10. SAQ's-Self Assessment Questions

1. What is the primary goal of data discretization?


a) To increase data dimensionality
b) To reduce the number of data points
c) To handle continuous data
d) To remove outliers

2.How does data discretization contribute to handling missing or noisy data?


a) It removes the noisy data points
b) It reduces the impact of missing data
c) It increases the overall data complexity
d) It has no impact on missing or noisy data

11. Summary
Data discretization is a data preprocessing technique that involves transforming
continuous data into discrete form, enabling easier analysis and interpretation. It simplifies
complex datasets by partitioning numerical values into intervals or categories, reducing
computational complexity and noise. Discretization aids in preserving data privacy and
security, particularly in sensitive domains such as healthcare and finance, by anonymizing
identifiable information. It enhances the performance of machine learning models by
reducing overfitting and improving generalization. Through techniques like equal width
and equal frequency discretization, it facilitates the identification of meaningful patterns
and trends in the data, enabling more informed decision-making. Moreover, it plays a
critical role in data mining tasks, including classification, clustering, and association rule
mining, by facilitating efficient data exploration and pattern recognition.
12. Terminal Questions

1) What are the different methods of data discretization?

2) Examples of real-world applications where data discretization is commonly used?


13. Case Studies (Co Wise)

Enhancing Privacy Preservation through Data Discretization in Healthcare Analytics:

Example:

In the field of healthcare analytics, data discretization has emerged as a key strategy for
preserving patient privacy while enabling comprehensive analysis. A recent study by
Johnson et al. (2023) showcased how the application of differential privacy techniques
combined with data discretization methods allowed for effective analysis of patient health
records, ensuring compliance with privacy regulations without compromising the utility of
the data for research and analysis purposes
14. Answer Key

Solution:
1. Data discretization is defined as a process of converting continuous data attribute values
into a finite set of intervals with minimal loss of information and associating with each
interval some specific data value or conceptual labels.

Data Discretization is considered as a data reduction mechanism, because it diminishes data


from large domains of numeric values to a subset of categorical values.

Ex. age can be transformed to (0-10,11-20….) or to conceptual labels like youth, adult,
senior.

Fits the problem statement


Often, it is easier to understand continuous data (such as weight) when divided and stored
into meaningful categories or groups. For example, we can divide a continuous variable,
weight, and store it in the following groups :
Under 100 lbs (light), between 140–160 lbs (mid), and over 200 lbs (heavy)

We would consider the structure useful if we see no objective difference between variables
falling under the same weight class.
In our example, weights of 85 lbs and 56 lbs convey the same information (the object is
light). Therefore, discretization helps make our data easier to understand if it fits the
problem statement.

Methods of Data Discretization


1.Binning
2.Histogram analysis
3.Cluster analysis
4.Decision tree analysis
5.Correlation analysis

2. Data discretization is commonly employed in various real-world applications across


different domains to enable effective data analysis and decision-making. Some notable
examples of its usage include:

Healthcare: Patient health records often contain sensitive and continuous data, such
as medical test results and vital signs. Data discretization techniques are applied to
preserve patient privacy while enabling analysis for medical research and predictive
modeling.

Finance: Financial institutions use data discretization to analyze transactional data


for fraud detection and risk assessment. By discretizing transactional patterns, they
can identify irregularities and suspicious activities, enhancing security measures and
reducing financial risks.

Marketing: Customer data, including purchase history and demographic information,


is discretized to identify customer segments and behavior patterns. This facilitates
targeted marketing campaigns and personalized product recommendations,
improving customer engagement and retention.

Telecommunications: Telecom companies utilize data discretization to analyze


customer usage patterns and network performance. By discretizing network traffic
data and customer behavior, they can optimize network resource allocation and
improve service quality based on different usage categories.

Manufacturing: Data discretization is employed in quality control processes to


categorize production data and identify patterns related to product defects and
machine performance. This aids in improving production efficiency, minimizing
defects, and ensuring product quality and reliability.

Education: Educational institutions use data discretization to analyze student


performance data and identify learning patterns and trends. By discretizing academic
performance metrics, educators can personalize learning experiences and
interventions, leading to improved educational outcomes for students.

These examples illustrate the diverse applications of data discretization across various
industries, demonstrating its vital role in facilitating data analysis, pattern recognition, and
decision-making processes.

15. Glossary

Textual Annotation: The practice of adding comments, labels, or metadata to textual content to
provide additional information, context, or insights.
Labels: Short descriptions or tags attached to text to categorize or classify it, making it easier to
organize and search for.
Contextual Information: Additional data or details that surround the text, offering a better
understanding of the content's significance.

16. Reference Books:

1. Python Data Science Handbook, by Jake VanderPlas, Released November 2016


Publisher(s): O'Reilly Media, Inc. ISBN: 9781491912058

Sites and Web links:


Text and Annotation | Python Data Science Handbook (jakevdp.github.io)

17. Keywords
Discretization,Data preprocessing,Continuous data,Categorical data,Equal width
discretization,Equal frequency discretization,Supervised discretization,Unsupervised
discretization,Information gain,Clustering-based discretization,Decision tree-based
discretization

You might also like