You are on page 1of 10

Assignment 3

1. Feature selection and feature extraction are two di erent techniques


used in machine learning for dimensionality reduction and improving
the performance of models. Here is the key di erence between the two:
a) Feature Selection: Feature selection is the process of selecting a
subset of relevant features (columns/variables) from the original set
of features. We identify and retain the most informative features that
contribute the most to the prediction task, while discarding irrelevant
or redundant features. This technique does not alter the original
features; it only selects a subset of them. Some common feature
selection methods include filter methods (e.g., correlation, mutual
information), wrapper methods (e.g., recursive feature elimination),
and embedded methods (e.g., LASSO regression). The main goal is to
improve the model's performance by eliminating irrelevant or
redundant features without transforming them. It helps in reducing
the complexity of the model, making it easier to interpret.
b) Feature Extraction: Feature extraction, on the other hand, is the
process of transforming or projecting the original set of features into
a new lower-dimensional space. This technique creates new features
based on combinations or transformations of the original features.
This new set of features can be easier to work with and can often
improve the performance of machine learning algorithms. Some
common feature extraction methods include Principal Component
Analysis (PCA), Linear Discriminant Analysis (LDA), Independent
Component Analysis (ICA), and autoencoder neural networks. The
goal is to reduce the dimensionality of the dataset by creating new
features that still capture the most essential information from the
original dataset. This can lead to improved model performance and
reduced computational costs, but at the expense of potentially
losing interpretability, since the new features are combinations of the
original ones and may not have a clear meaning.
In summary, feature selection selects a subset of the original features,
while feature extraction creates new features based on transformations
or combinations of the original features. Feature selection retains the
original meaning and interpretation of the selected features, whereas
feature extraction creates new features that may not have a clear
interpretation.
Curse of Dimensionality: The curse of dimensionality refers to the
challenges and problems that arise when dealing with high-dimensional
data in data preprocessing and machine learning tasks. As the number
of features (dimensions) in a dataset increases, several issues can
occur, making data analysis and modeling more di icult. The curse of
dimensionality manifests itself in the following ways:
a) Sparsity: Data points become increasingly spread out and far apart in
high-dimensional spaces, making it di icult to find meaningful
patterns or clusters.
b) Computational Complexity: As the number of dimensions grows, the
computational complexity of many algorithms increases
exponentially, leading to longer training times and increased memory
requirements.
c) Noise and Irrelevant Features: High-dimensional datasets often
contain irrelevant or redundant features that contribute little useful
information and can introduce noise.
d) Distance Metrics: Distance metrics become less meaningful and
discriminative in high-dimensional spaces, making it di icult to
distinguish between data points.
e) Overfitting: With more features, machine learning models have more
parameters to learn, increasing the risk of overfitting and poor
generalization performance.
2. Sampling is the process of selecting a subset of data from a larger
population or dataset for analysis or model building. Instead of working
with the entire dataset, which can be computationally expensive and
time-consuming, sampling allows us to work with a representative
subset of the data. Common sampling techniques include Simple
Random Sampling, Stratified Sampling, Cluster Sampling,
Systematic Sampling. Sampling can a ect the accuracy and e iciency
of data mining processes in the following ways:
a) Accuracy: If the sample is truly representative of the population,
meaning it captures the underlying patterns and distributions, the
analysis or model built on the sample can provide accurate results
and insights about the entire population. However, if the sampling
process introduces bias (e.g., non-random sampling, excluding
certain subgroups), the sample may not accurately represent the
population, leading to biased and inaccurate results.
b) E iciency: Sampling can significantly reduce the computational
cost and time required for data mining processes. Analyzing a
smaller dataset requires less memory, processing power, and time,
which is particularly beneficial when dealing with large-scale data or
when computational resources are limited.
We apply three sampling methods to find a sample of five attributes as
follows:
a) Simple Random Sampling: We randomly select five data points from
the dataset without replacement. {T387 (senior), T284
(middle_aged), T69 (senior), T290 (youth), T307 (youth)}. I used
random picker to get these.
b) Systematic Sampling: We choose points that are multiple of 3. So,
our sample is {T307 (youth), T263 (middle_aged), T326
(middle_aged), T284 (middle_aged), T876 (middle_aged)}
c) Stratified Sampling: Since the data is categorized into three groups
(youth, middle-aged, and senior), we divide the dataset into strata
based on these categories. Then we randomly pick from each
stratum proportional to their frequency in the entire dataset to
maintain representation. We get {T876(middle_aged),
T263(middle_aged), T138(middle_aged), T387(senior), T307(youth)}.
Considering representativeness, the stratified sample is the most
representative of the population. It ensures that each age group is
proportionally represented in the sample according to their presence in
the full dataset. With this method, despite the small size of the sample,
every group is included, and therefore, it provides a miniature yet
proportionate reflection of the entire dataset.

3. Calculating a correlation matrix is not association analysis, although


both involve relationships among attributes. A correlation matrix
quantifies the linear relationship between pairs of continuous variables,
providing specific measures like Pearson’s correlation coe icient.
Association analysis, like market basket analysis, looks for patterns or
rules (e.g., "if-then" statements) in the data, often without quantifying
the relationship or implying linearity. It uses techniques like the Apriori
algorithm to find items that frequently occur together, focusing on the
co-occurrence rather than a numeric relationship measure.

E-commerce platforms can use association rule mining to analyze


customer purchasing patterns and recommend products that are often
bought together. This is done by finding correlations between
products in transaction data and using those correlations to predict
what a customer might be interested in purchasing based on their
current shopping cart or past buying behavior.

Consider an e-commerce platform specializing in gardening supplies


that uses association rule mining to analyze customer purchase
patterns. Through this analysis, they discover a less obvious association
rule:
Rule: If a customer buys heirloom tomato seeds (Item A), there is a 60%
confidence that they will purchase organic fertilizer (Item B) within the
next two weeks.

Using this association rule, the e-commerce platform can:


a) Seasonal Recommendations: Suggest organic fertilizer to customers
buying heirloom tomato seeds especially at the start of the planting
season.
b) Bundled O ers: Create a "Gardener's Bundle" that includes both
heirloom tomato seeds and organic fertilizer at a slight discount. This
encourages customers to purchase both items together, simplifying
their shopping experience.
c) Complementary Product Recommendations: When a customer
views heirloom tomato seeds, display organic fertilizer as a
complementary product, with a note that says, "Customers who
bought this item also frequently buy organic fertilizer."
d) Checkout Prompts: Add a prompt at the checkout page for
customers who have heirloom tomato seeds in their cart, suggesting
they add organic fertilizer to their order with one click.
e) Customer Reviews and Testimonials: Showcase customer reviews or
testimonials that mention the success of using both products
together, potentially influencing other customers to make the same
pairing.
f) Educational Content: Develop blog posts or videos on the benefits of
using organic fertilizer with heirloom tomatoes and feature these
products together on the content page.

4. An association rule X → Y in the context of a dataset of transactions,


where X and Y are sets of items, implies a relationship where the
presence of items in set X within a transaction tends to indicate the
presence of items in set Y in the same transaction.
Support (s): This is a measure of how frequently the itemset appears in
the dataset. The support of the rule X → Y is defined as the proportion of
transactions in the data that contain both X and Y. A support of s means
that in s % of the transactions where X occurs, Y also occurs. Support is
an indication of how popular or common the combination of items is in
the dataset.
Confidence (c): This is a measure of the strength of the implication of
the rule. A confidence of c means that, given the transactions that
contain X, c % of them also contain Y. It is the probability of seeing the
itemset Y in transactions given that these transactions also contain X.

5. When dealing with a missing value in a dataset where the missing data
cannot be ignored, there are several imputation methods that can be
used to fill in the gap. Here are three alternatives for imputing the
missing value in the HOURS_PER_WEEK column:

a) Mean, Median Imputation for whole data:


The missing value is replaced by the mean (44.5) or median (40) of the
HOURS_PER_WEEK for all individuals in the dataset.
Advantages:
 Easy to implement.
 Does not introduce a significant bias if the data is normally
distributed or not skewed.
Disadvantages:
 Can reduce the variability of the data.
 Central tendency of the whole data might not be a good
representation of every class.
 Mean is sensitive to outliers, while median might not reflect the
true central tendency if the data is skewed.
b) Mode Imputation for Each Class:
The missing value is replaced by the mode (40) of the
HOURS_PER_WEEK within the same class category of the individual
with the missing value.
Advantages:
 Reflects the most common working hours for individuals in the
same income class.
 Considers the distribution of HOURS_PER_WEEK within the class
same as missing value.
Disadvantages:
 If the mode is not a good representation of the variability within
each class, this can lead to misleading analysis.
 The mode does not consider the individual's characteristics
outside of the class.

c) Predictive Modeling using Bayesian Inference


A predictive model such as a Bayesian regression model is built using
other attributes in the data to predict the missing HOURS_PER_WEEK
value. Using online tools, this value is coming out to be 43.68.
Advantages:
 Can provide a more accurate imputation by considering the
relationships between variables.
 Utilizes the information from other attributes, not just the
distribution of HOURS_PER_WEEK.
Disadvantages:
 More complex to implement and requires a model-building
process.
 There is a risk of overfitting, especially if the dataset is not large
enough or if the model is too complex.
We can transform the attribute EDUCATION_NUM to a new attribute.
with the following possible three values:
 9-10: Secondary Education (Group 1)
 11-13: Some College or Associate degree (Group 2)
 14 and above: Advanced Education (Group 3)
Now we discretize the AGE attribute into 4 equi-width intervals:
The range is 54 – 27 + 1 = 28 years (Di erence between maximum and
minimum age). The width is 28 / 4 = 7 years. So, we have the following bins:
Bin 1: 27-33 (ages 27 to 33, inclusive)
Bin 2: 34-40 (ages 34 to 40, inclusive)
Bin 3: 41-47 (ages 41 to 47, inclusive)
Bin 4: 48-54 (ages 48 to 54, inclusive)
Here is the histogram:

We can see that the frequency keeps on increasing as the age increases with
highest frequency of 10 in bin 45-55.
Now we discretize the AGE attribute into 4 equi-depth intervals:
Since there are 20 data points and we want 4 bins, each bin would ideally
contain 20/4 = 5 points. Assign the data points to each bin until the bin
reaches 5 data points. The highest value in one bin and the lowest value in the
next bin are the boundaries. Here are the bins:
Bin 1: [27, 28, 29, 30, 35] (ages 27 to 35, inclusive)
Bin 2: [36, 37, 38, 40, 44] (ages 36 to 44, inclusive)
Bin 3: [45, 47, 48, 49, 49] (ages 45 to 49, inclusive)
Bin 4: [49, 50, 52, 52, 54] (ages 49 to 54, inclusive)

Notice that the age 49 appears in both Bin 3 and Bin 4 due to its frequency in
the data set. In such cases, we may need to make a choice about where to
split the bin. In practice, one might decide to put an equal number of the
repeated value in each bin, or all in one bin, depending on the distribution and
desired analysis.

For the attribute AGE, we will create intervals based on the values of k as
follows:
For k = 0: We will have the interval [mean - sd, mean), which is [42 - 8, 42) or
[34, 42).
For k = -1: We will have the interval [mean, mean + sd), which is [42, 42 + 8) or
[42, 50).
For k = 1: We will have the interval [mean - 2*sd, mean - sd), which is [42 - 16,
42 - 8) or [26, 34).

We will continue this process to create a set of intervals based on the


multiples of the standard deviation from the mean. Since the AGE values
range from 27 to 54, we can consider k values from -2 to 1 for this dataset.
The intervals will be:
Interval for k = -2: [42 + 8, 42 + 2*8) → [50, 58)
Interval for k = -1: [42, 42 + 8) → [42, 50)
Interval for k = 0: [42 - 8, 42) → [34, 42)
Interval for k = 1: [42 – 2*8, 42 - 8) → [26, 34)

Now, let us bin the AGE values into these intervals:


Ages 27-33 would fall into the second bin ([26, 34)).
Ages 34-41 would fall into the third bin ([34, 42)).
Ages 42-49 would fall into the fourth bin ([42, 50)).
Ages 50-54 would fall into the fifth bin ([50, 58)).

With these bins, we have e ectively created a set of equi-depth intervals


based on the distribution of the AGE attribute relative to its mean and
standard deviation.

You might also like