Professional Documents
Culture Documents
Some common examples of discrete data are the Some common examples of continuous data are
number of students, the number of children, the height, weight, length, time, temperature, age, and so
shoe size, and so on on
Ordinal data values and integer values represent Decimal numbers and fractions represent continuous
discrete data data
Easily counted on something as simple as a Requires more in-depth measurement tools and
number line methods like curves and skews
Discrete data remains constant over a specific Continuous data varies over time and can have
time interval separate values at any given point
What is continuous data?
• Continuous data is a type of numerical data that refers to the unspecified number of
possible measurements between two realistic points.
• These numbers are Over time, measuring a particular not always clean and tidy like those
in discrete data, as they're usually collected from precise measurements.
• Subject allows us to create a defined range, where we can reasonably expect to collect
more data.
• Continuous data is all about accuracy. Variables in these data sets often
carry decimal points, with the number to the right stretched out as far as possible.
• This level of detail is paramount for scientists, doctors, and manufacturers, to name a few.
Examples of continuous data
• Some examples of continuous data include:
• The weight of newborn babies
• The daily wind speed
• The temperature of a freezer
Notations,
x = number of categories
freq = frequency of a category
n = number of values in data
• Supervised Binning:
• Supervised binning is a type of binning that transforms a numerical or
continuous variable into a categorical variable considering the target
class label into account.
• It refers to the target class label when selecting discretization cut
points. Entropy-based binning is a type of supervised binning.
• 1. Entropy-based Binning:
• The entropy-based binning algorithm categorizes the continuous or
numerical variable majority of values in a bin or category belong to
the same class label.
• It calculates entropy for target class labels, and it categorizes the split
based on maximum information gain.
Problem with One Hot Encoding while handling large number of
categories
What’s new in handling
categorical data, we have Well, I have OHE to save
Label encoding and One me here also.
hot encoding techniques
from Scikit learn.
It sounds interesting..
This encoding technique doesn’t
encode the categorical values I didn’t find anything implemented as such
directly, but gathers some stats but I have written my own function, and yes
about them it incurs a slight delay in the learning
say — Conditional probability, odd- pipeline but it is good for Kaggle
ratio or log odd ratio and then it competitions
makes a count table and insert
these stats into it.
Go on…
It seems lots of work
before you get what you
want, what library
function do you use to
implement this
The idea of bin counting is deviously
simple: rather than using the value
of the categorical variable as the
feature, use the conditional
probability of the target under that
value.
We’ll focus on the MPG column and make 3 bins, fuel-efficient, average fuel efficiency, and gas guzzlers.
Making the Bins
• Our first step will be to determine the ranges for these bins. This can be tricky as
it can be done in a few different ways.
• One way to do this is to divide the bins up evenly based on the distribution of
values. Basically, we would go through the same process as if we were creating
a histogram.
• Since Python makes generating histograms easy with matplotlib, let’s visualize
this data in a histogram and set the bins parameter to 3.
Pros and Cons of Bin Counting approach
Bin-counting Scheme
• The encoding schemes we discussed so far, work quite well on categorical
data in general, but they start causing problems when the number of distinct
categories in any feature becomes very large.
• Essential for any categorical feature of m distinct labels, you get m separate
features.
• This can easily increase the size of the feature set causing problems like
storage issues, model training problems with regard to time, space and
memory.
• Besides this, we also have to deal with ‘curse of dimensionality’ where
basically with an enormous number of features and not enough
representative samples, model performance starts getting affected often
leading to overfitting.
• The bin-counting scheme is a useful scheme for dealing with categorical
variables having many categories.
• In this scheme, instead of using the actual label values for encoding, we use
probability based statistical information about the value and the actual
target or response value which we aim to predict in our modeling efforts.