Professional Documents
Culture Documents
CHALLENGES
POSSIBLE ACTIONS BASED ON
ASSOCIATION RULE - Random data can generate apparently
interesting association rules
BARBIE DOLL -> CANDY
- The more rules you produce, the greater this
danger
- Rules based on large numbers of records are
less subject to this danger
“If you’ve got terabytes of data, and you’re relying PATTERN DISCOVERY KEY IDEAS
on data mining to find interesting things in there for
Patterns are not known
you, you’ve lost before you’ve even begun.” –Herb
But data which are believed to possess
Edelstein
patterns are given
WHAT IS PATTERN DISCOVERY? Examples:
o Clustering: grouping similar
- According to SAS, it is one of the broad samples into clusters
categories of analytical methods associated o Associative rule mining: discover
with data mining. certain features that often appear
- It is a process of uncovering patterns from together in data
massive data sets. Patterns are known beforehand and are
PATTERN DISCOVERY CAUTION observed/described by:
o Explicit samples
Poor data quality o Similar samples (usually)
Opportunity Modeling approaches
Interventions o Build a model for each pattern
Separability o Find the best fit model for new data
Obviousness Usually require training using observed
Non-stationarity samples
1. CLUSTERING
- Clustering aims at dividing the data set into
groups (clusters) where the inter-cluster
similarities are minimized while the
similarities within each cluster are
maximized
2. ASSOCIATION RULE MINING
- Association rule discovery on usage data
results in finding groups of pages that are
commonly accessed
3. SEQUENTIAL PATTERN MINING
4. CLASSIFICATION
5. DECISION TREES
6. NAÏVE BAYES CLASSIFIER
SUPPORT
CONFIDENCE
- The support for the rule A =>B is the - defines the likeliness of occurrence of
probability that the two item sets occur consequent on the cart given that the cart
together already has the antecedents
- Confidence is estimated using the following:
CALCULATING FOR THE SUPPORT
TRANSACTION ITEMS
ID
Lanz Bread, Milk
Ong Bread, Lady’s Choice, Eggs - Technically, confidence is the conditional
Gonzales Bread, Milk, Butter probability of occurrence of consequent
Gepilano Bread, Diaper, Coke given the antecedent.
Francisco Bread, Diaper, Milk
Tupaz Bread, Milk, Butter CALCULATING FOR THE CONFIDENCE
support for the bread => milk is 4/6
RULE INTERPRETATION
TRANSACTION ITEMS
15. observe the table below and compute for the ID
support for beer ->peanut [4/7] Trans A Beer, Peanut, Egg
16. observe the table below and compute for the Trans B Beer, Milk, Peanut, Diaper
support for SD Card => phone case [0.3] Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut
17/20
TRANSACTION ITEMS
ID
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut
A => B
A => B
Answer: 1.5
13. When it comes to association analysis, the
more rules you produce, the greater the risk
Answer: 0.43
is. [True]
10. Observe the table below and compute for 14. Supposed you want to solve a time series
the support of problem where a rapid response to a real
Beer=>peanut change in the pattern of observations is
desired, which among the following is the
Answer: 4/7 ideal value for your alpha? [0.8]
11. Observe the table below and compute for 15. Which of the following is not an advantage
the support of of using association rule?
SD Card => Phone case
[Assumes transaction database is memory 5. It is the process of discovering useful
resident.] patterns and trends in large data
16. A trend is usually the result of long-term sets.[data mining]
factors such as population increases or 6. observe the table below and compute
decreases, shifting demographic for the lift ratio of
characteristics of population, improving egg -> Peanut
technology, changes in the competitive
landscape, and/or changes in consumer
preferences. [True]
17. It measures the overall impact. [Support]
18. Clustering aims to discover certain features
that often appear together in data.[False]
19. It is another type of association analysis
that involves using sequence data.
[Association Rule]
20. Observe the table below and compute for
the support of
SD Card => Phone case
Answer: 0.93
7. observe the table below and compute
for the confidence of
Airpods->powerbank
Answer:0.3
Answer: 0.5
12. Lift ratio shows how effective the rule is
in finding consequents. [True]
13. observe the table below and compute
for the lift ratio of
[0.50]
5. Observe the table below and compute for
milk->egg.
[0.14]
6. Observe the table below and compute for
lift ratio diaper->milk.
[0.40]
8. Observe the table below and compute for
the confidence of egg->peanut.
[1.5]
7. Observe the table below and compute for
the confidence of beer->diaper.
[0.8]
17. Observe the table below and compute for the lift
ratio of powerbank->airpods.
[0.33]
[0.67]
13. Which of the following is not an application of a 20. Observe the table below and compute for the
sequential pattern? [IDENTIFYING FAKE NEWS] support for Phone case->SD card.
8 airpods, powerbank 19. Observe the table below and compute for
the support for diaper⇒peanut
9 phone case, airpods Transaction ID Items
MISSING VALUES
- Having null values in your data set could
Variables Selection
affect the accuracy of the model.
OUTLIERS Observe minimizing garbage in, garbage out
- When your data has outliers, it could affect (GIGO)
the distribution of your data.
INCONSISTENT DATA Procedures
IMPROPERLY FORMATTED DATA 1. Backward-selection
LIMITED FEATURES 2. Forward-selection
THE NEED FOR TECHNIQUES SUCH
AS FEATURE ENGINEERING [M6-ST1] DATA ANOMALIES
Applications
Collapse the categories based on the number
of observations in a category.
Collapse the categories based on the
reduction in the chi-square test of
association between the categorical input
and the target.
Use smoothed weight of evidence coding to
convert the categorical input to a continuous
input.
Forward Selection
STEPS OF BACKWARD-SELECTION
BACKWARD SELECTION
EXAMPLE
STEPWISE PROCEDURE
Caloocan
Makati
Data preparation affects:
Quezon
The objectives of the research None of the choices
The quality of the research These anomalies have values that significantly
The research approach deviate from the other data points that exist in the
The sample size same context.
It is a manipulation of scale values to ensure
Contextual outliers
comparability with other scales:
When there’s a missing value for a categorical
Scale transformation variable, it is ideal to supply it by computing for the
It is a best practice to divide your dataset into train average of the data values available.
and test dataset.
False
True Outlier analysis can provide good product quality.
True Answer: Data Inconsistency
A review of the questionnaires is essential in order Anomaly detection can cause a bad user experience.
to:
False- False? hahahahaahha
Select the data analysis strategy
You can use histogram to detect outliers.
Find new insights
Increase the quality of the data True
Increase accuracy and precise of the
collected data When a subset points within a set is anomalous to
the entire data set, those values are:
Feature selection maps the original feature space to
a new feature space with lower dimensions by Collective outliers
combining the original feature space.
These anomalies have values that significantly
True deviate from the other data points that exist in the
same context.
False (dapat feature extraction kasi)
Contextual outliers
This happens when inserting vital data into the
database is not possible because other data is not These are problems that can occur in poorly planed,
already there. un-normalized databases where all the data is stored
in one table (a flat-file database).
Insertion anomaly
Anomalies
Unnecessary predictors will add noise to the
estimation of other quantities that we are interested
in.
True
You can also use regression when handling noisy
data.
True
- Quality of the research
Given the following values for age, what is the
problem with the data? The following can be done to treat unsatisfactory
response except:
Age
Returning to the field
16
A good rule of thumb in having a right amount of
27 data is to have 10 records for every predictor value.
True
-8990
Given are the following records for the
19
attribute rating. What is the problem with the data?
15 Data Inconsistency
18
Anomaly detection can cause a bad user experience.
application_rating True- False? hahahahaahha
A good rule of thumb in having a right amount of Supply the missing values given for the attribute
data is to have 10 records for every predictor value. city.
True City
Select the data analysis strategy You can use histogram to detect outliers.
Find new insights
True
Increase the quality of the data
Increase accuracy and precise of the When a subset points within a set is anomalous to
collected data the entire data set, those values are:
Feature selection maps the original feature space to Collective outliers
a new feature space with lower dimensions by
combining the original feature space. These anomalies have values that significantly
deviate from the other data points that exist in the
True same context.
False Contextual outliers
This happens when inserting vital data into the These are problems that can occur in poorly planed,
database is not possible because other data is not un-normalized databases where all the data is stored
already there. in one table (a flat-file database).
True
You can also use regression when handling noisy
data.
16/20 It is a best practice to divide your dataset into train
and test dataset. True
A homogenous data set is a data set whose data
records have the same target value. True This happens when the deletion of unwanted
information causes desired information to be
Supply for the missing values. deleted as well.
Deletion anomaly
Young = 12 – 17
Adult = 18 -34
Old = 35 – 60
What kind of data preparation was practiced? Data
Cleaning
It is the process of integrating multiple databases,
data cubes, or files. data integration
These are problems that can occur in poorly
planned, un-normalized databases where all the data
- 19.6 is stored in one table (a flat-file database).
Anomalies
It is a manipulation of scale values to ensure
comparability with variables with other scales: You can also use regression when handling noisy
data. True
Scale transformation
The procedure starts with an empty set of features
Supply the missing value in the given data below. [reduced set]. Forward Selection
It is the simplest of all variable selection procedures
and can be easily implemented without special
software (Use lowercase for your answer)
Backward Selection
The forward selection procedure starts with no
variables in the model. True
Estimation is about estimating the value for the
target variable except that the target variable is
categorical rather than numeric.
True
The figure below illustrates the first step in doing
backward selection. False (wala pics 😊)
It is intended to select the ―best‖ subset of
predictors. (Use lowercase for your answer)
Application_rating
Backward
Prior to variable selection, one must identify 1
outliers and influential points - maybe exclude them
2
at least temporarily. True
A
17/20
16
27
-8990
19
15
15. Forward selection is the simplest variable
18
selection model [FALSE]
[DATA INCONSISTENCY] 16. These are variables that significantly
influence Y and so should be in the model
7. Histogram is used to see missing data but are excluded [OMITTED
[FALSE] VARIABLES]
8. Clustering can also detect outliers [TRUE] 17. Unnecessary predictors will add noise to the
9. These outliers exist far outside the entirety estimation of other quantities that we are
of a data set [GLOBAL OUTLIERS] interested in. [TRUE]
18. The first step in stepwise procedure is to
select the predictor most highly correlated WMA
with the target. [FALSE]
19. Prior to variable selection, one must identify -
outliers and influential points – maybe
exclude them at least temporarily [TRUE] AGD
20. The procedure starts with an empty set of
features [reduced set]. [FORWARD AGD
SELECTION]
4. The following are techniques to treat
16/20 missing values except: [RETURNING TO
THE FIELD]
1. Given the following values for age, what is
5. Supply for the missing values. [19.6]
the problem with the data?
Age
16
27
-8990
19
15
18
AGD
AGD False
These outliers exist far outside the entirety of a data Forward selection is the opposite of stepwise selection.
set. Global outliers False
These are also known as point anomalies. Global outliers